All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/9] Initial PVHv2 Dom0 support
@ 2017-01-19 17:29 Roger Pau Monne
  2017-01-19 17:29 ` [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
                   ` (8 more replies)
  0 siblings, 9 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, konrad.wilk

Hello,

This is the first batch of the PVHv2 Dom0 support series, that includes
everything up to the point where ACPI tables for Dom0 are crafted. I've
decided to left the last part of the series (the one that contains the PCI
config space handlers, and other emulation/trapping related code) separated,
in order to focus and ease the review. This is of course not functional, one
might be able to partially boot a Dom0 kernel if it doesn't try to access
any physical devices, and the panic in setup.c is removed.

The full series can also be found on a git branch in my personal git repo:

git://xenbits.xen.org/people/royger/xen.git dom0_hvm_v5

Each patch contains the changelog between versions.

This series is based on top of Andrew Cooper's "x86/cpuid: Handling of simple
leaves in guest_cpuid()", although the dependency is only functional, there
should be no merge conflicts as a result of one going in before or after the
other.

Thanks, Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-20 18:41   ` Andrew Cooper
  2017-01-19 17:29 ` [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich,
	konrad.wilk

PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
from physical or emulated devices over event channels using PIRQs. This
applies to both DomU and Dom0 PVHv2 guests.

Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
route physical interrupts (even from emulated devices) over event channels,
and is thus allowed to use some of the PHYSDEV ops.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Update docs.

Changes since v2:
 - Change local variable name to currd instead of d.
 - Use currd where it makes sense.
---
 docs/misc/hvmlite.markdown        | 20 ++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c            | 25 ++++++++++++++++---------
 xen/arch/x86/physdev.c            |  5 +++--
 xen/common/kernel.c               |  3 ++-
 xen/include/public/arch-x86/xen.h |  4 +++-
 5 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/docs/misc/hvmlite.markdown b/docs/misc/hvmlite.markdown
index 898b8ee..b2557f7 100644
--- a/docs/misc/hvmlite.markdown
+++ b/docs/misc/hvmlite.markdown
@@ -75,3 +75,23 @@ info structure that's passed at boot time (field rsdp_paddr).
 
 Description of paravirtualized devices will come from XenStore, just as it's
 done for HVM guests.
+
+## Interrupts ##
+
+### Interrupts from physical devices ###
+
+Interrupts from physical devices are delivered using native methods, this is
+done in order to take advantage of new hardware assisted virtualization
+functions, like posted interrupts. This implies that PVHv2 guests with physical
+devices will also have the necessary interrupt controllers in order to manage
+the delivery of interrupts from those devices, using the same interfaces that
+are available on native hardware.
+
+### Interrupts from paravirtualized devices ###
+
+Interrupts from paravirtualized devices are delivered using event channels, see
+[Event Channel Internals][event_channels] for more detailed information about
+event channels. Delivery of those interrupts can be configured in the same way
+as HVM guests, check xen/include/public/hvm/params.h and
+xen/include/public/hvm/hvm_op.h for more information about available delivery
+methods.
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 63748dc..41a44c9 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3757,10 +3757,12 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
 static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *currd = current->domain;
+
     switch ( cmd )
     {
     default:
-        if ( !is_pvh_vcpu(current) || !is_hardware_domain(current->domain) )
+        if ( !is_pvh_domain(currd) || !is_hardware_domain(currd) )
             return -ENOSYS;
         /* fall through */
     case PHYSDEVOP_map_pirq:
@@ -3768,7 +3770,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
-        return do_physdev_op(cmd, arg);
+        return ((currd->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||
+               is_pvh_domain(currd)) ?
+                    do_physdev_op(cmd, arg) : -ENOSYS;
     }
 }
 
@@ -3801,17 +3805,20 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 static long hvm_physdev_op_compat32(
     int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *d = current->domain;
+
     switch ( cmd )
     {
-        case PHYSDEVOP_map_pirq:
-        case PHYSDEVOP_unmap_pirq:
-        case PHYSDEVOP_eoi:
-        case PHYSDEVOP_irq_status_query:
-        case PHYSDEVOP_get_free_pirq:
-            return compat_physdev_op(cmd, arg);
+    case PHYSDEVOP_map_pirq:
+    case PHYSDEVOP_unmap_pirq:
+    case PHYSDEVOP_eoi:
+    case PHYSDEVOP_irq_status_query:
+    case PHYSDEVOP_get_free_pirq:
+        return (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                    compat_physdev_op(cmd, arg) : -ENOSYS;
         break;
     default:
-            return -ENOSYS;
+        return -ENOSYS;
         break;
     }
 }
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 5a49796..0bea6e1 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -94,7 +94,8 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
     int pirq, irq, ret = 0;
     void *map_data = NULL;
 
-    if ( domid == DOMID_SELF && is_hvm_domain(d) )
+    if ( domid == DOMID_SELF && is_hvm_domain(d) &&
+         (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         /*
          * Only makes sense for vector-based callback, else HVM-IRQ logic
@@ -265,7 +266,7 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
     if ( ret )
         goto free_domain;
 
-    if ( is_hvm_domain(d) )
+    if ( is_hvm_domain(d) && (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         spin_lock(&d->event_lock);
         if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index d0edb13..a82f55f 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -332,7 +332,8 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             case guest_type_hvm:
                 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
                              (1U << XENFEAT_hvm_callback_vector) |
-                             (1U << XENFEAT_hvm_pirqs);
+                             ((d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                                 (1U << XENFEAT_hvm_pirqs) : 0);
                 break;
             }
 #endif
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index 363c8cc..73c829a 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -293,12 +293,14 @@ struct xen_arch_domainconfig {
 #define XEN_X86_EMU_IOMMU           (1U<<_XEN_X86_EMU_IOMMU)
 #define _XEN_X86_EMU_PIT            8
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
+#define _XEN_X86_EMU_USE_PIRQ       9
+#define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
     uint32_t emulation_flags;
 };
 
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
  2017-01-19 17:29 ` [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-20  6:41   ` Tian, Kevin
  2017-01-20 18:44   ` Andrew Cooper
  2017-01-19 17:29 ` [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jan Beulich, konrad.wilk, George Dunlap,
	Andrew Cooper, Jun Nakajima, boris.ostrovsky, Roger Pau Monne

There's nothing wrong with allowing the domain to perform DMA transfers to
MMIO areas that it already can access from the CPU, and this allows us to
remove the hack in set_identity_p2m_entry for PVH Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Check for mmio_ro_ranges, this requires passing the mfn to the function, and
   fixing the callers.
---
 xen/arch/x86/mm/p2m-ept.c |  5 +++--
 xen/arch/x86/mm/p2m-pt.c  | 17 ++++++++++-------
 xen/arch/x86/mm/p2m.c     |  9 ---------
 xen/include/asm-x86/p2m.h |  6 +++++-
 4 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 13cab24..0e316ba 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -672,7 +672,7 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
     uint8_t ipat = 0;
     bool_t need_modify_vtd_table = 1;
     bool_t vtd_pte_present = 0;
-    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt);
+    unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn);
     bool_t needs_sync = 1;
     ept_entry_t old_entry = { .epte = 0 };
     ept_entry_t new_entry = { .epte = 0 };
@@ -798,7 +798,8 @@ ept_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
 
         /* Safe to read-then-write because we hold the p2m lock */
         if ( ept_entry->mfn == new_entry.mfn &&
-             p2m_get_iommu_flags(ept_entry->sa_p2mt) == iommu_flags )
+             p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) ==
+             iommu_flags )
             need_modify_vtd_table = 0;
 
         ept_p2m_type_to_flags(p2m, &new_entry, p2mt, p2ma);
diff --git a/xen/arch/x86/mm/p2m-pt.c b/xen/arch/x86/mm/p2m-pt.c
index 3b025d5..a23d0bd 100644
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -499,7 +499,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
     l2_pgentry_t l2e_content;
     l3_pgentry_t l3e_content;
     int rc;
-    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt);
+    unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn);
     /*
      * old_mfn and iommu_old_flags control possible flush/update needs on the
      * IOMMU: We need to flush when MFN or flags (i.e. permissions) change.
@@ -565,9 +565,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         {
             if ( flags & _PAGE_PSE )
             {
-                iommu_old_flags =
-                    p2m_get_iommu_flags(p2m_flags_to_type(flags));
                 old_mfn = l1e_get_pfn(*p2m_entry);
+                iommu_old_flags =
+                    p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        _mfn(old_mfn));
             }
             else
             {
@@ -609,9 +610,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
                                    0, L1_PAGETABLE_ENTRIES);
         ASSERT(p2m_entry);
-        iommu_old_flags =
-            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)));
         old_mfn = l1e_get_pfn(*p2m_entry);
+        iommu_old_flags =
+            p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)),
+                                _mfn(old_mfn));
 
         if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) )
             entry_content = p2m_l1e_from_pfn(mfn_x(mfn),
@@ -637,9 +639,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
         {
             if ( flags & _PAGE_PSE )
             {
-                iommu_old_flags =
-                    p2m_get_iommu_flags(p2m_flags_to_type(flags));
                 old_mfn = l1e_get_pfn(*p2m_entry);
+                iommu_old_flags =
+                    p2m_get_iommu_flags(p2m_flags_to_type(flags),
+                                        _mfn(old_mfn));
             }
             else
             {
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 6a45185..7e33ab6 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1053,16 +1053,7 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
         ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
                             p2m_mmio_direct, p2ma);
     else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
-    {
         ret = 0;
-        /*
-         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
-         * but iomem regions are not mapped with IOMMU. This makes sure that
-         * RMRRs are correctly mapped with IOMMU.
-         */
-        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
-            ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
-    }
     else
     {
         if ( flag & XEN_DOMCTL_DEV_RDM_RELAXED )
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 7035860..271d379 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -824,7 +824,7 @@ void p2m_altp2m_propagate_change(struct domain *d, gfn_t gfn,
 /*
  * p2m type to IOMMU flags
  */
-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
+static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn)
 {
     unsigned int flags;
 
@@ -840,6 +840,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
     case p2m_grant_map_ro:
         flags = IOMMUF_readable;
         break;
+    case p2m_mmio_direct:
+        flags = IOMMUF_readable;
+        if ( rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
+            flags |= IOMMUF_writable;
     default:
         flags = 0;
         break;
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
  2017-01-19 17:29 ` [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
  2017-01-19 17:29 ` [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-20 19:03   ` Andrew Cooper
                     ` (2 more replies)
  2017-01-19 17:29 ` [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich,
	konrad.wilk

Split the Dom0 builder into two different functions, one for PV (and classic
PVH), and another one for PVHv2. Introduce a new command line parameter
called 'dom0' that can be used to request the creation of a PVHv2 Dom0 by
setting the 'hvm' sub-option. A panic has also been added if a user tries
to use dom0=hvm until all the code is in place, then the panic will be
removed. While there remove the dom0_shadow option that was used by PV Dom0, it
was lacking documentation and was not functional.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Move common sanity BUG_ONs and process_pending_softirqs to construct_dom0.
 - Remove the non-existant documentation about dom0_shadow option.
 - Fix the define of dom0_shadow to be 'false' instead of 0.
 - Move the parsing of the dom0 command line option to domain_build.c.
 - s/hvm/pvh.

Changes since v3:
 - Correctly declare the parameter list.
 - Add a panic if dom0=hvm is used. This will be removed once all the code is in
   place.

Changes since v2:
 - Fix coding style.
 - Introduce a new dom0 option that allows passing several parameters.
   Currently supported ones are hvm and shadow.

Changes since RFC:
 - Add documentation for the new command line option.
 - Simplify the logic in construct_dom0.
---
 docs/misc/xen-command-line.markdown | 20 +++++++++--
 xen/arch/x86/domain_build.c         | 66 +++++++++++++++++++++++++++++++++----
 xen/arch/x86/setup.c                |  9 +++++
 xen/include/asm-x86/setup.h         |  7 ++++
 4 files changed, 92 insertions(+), 10 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 0138978..9f33c23 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -646,9 +646,6 @@ restrictions set up here. Note that the values to be specified here are
 ACPI PXM ones, not Xen internal node numbers. `relaxed` sets up vCPU
 affinities to prefer but be not limited to the specified node(s).
 
-### dom0\_shadow
-> `= <boolean>`
-
 ### dom0\_vcpus\_pin
 > `= <boolean>`
 
@@ -656,6 +653,23 @@ affinities to prefer but be not limited to the specified node(s).
 
 Pin dom0 vcpus to their respective pcpus
 
+### dom0
+> `= List of [ pvh | shadow ]`
+
+> Sub-options:
+
+> `pvh`
+
+> Default: `false`
+
+Flag that makes a dom0 boot in PVHv2 mode.
+
+> `shadow`
+
+> Default: `false`
+
+Flag that makes a dom0 use shadow paging.
+
 ### dom0pvh
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 243df96..4d555b1 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -191,11 +191,40 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
 }
 
 #ifdef CONFIG_SHADOW_PAGING
-static bool_t __initdata opt_dom0_shadow;
+bool __initdata opt_dom0_shadow;
 boolean_param("dom0_shadow", opt_dom0_shadow);
 #else
-#define opt_dom0_shadow 0
+#define opt_dom0_shadow false
 #endif
+bool __initdata dom0_pvh;
+
+/*
+ * List of parameters that affect Dom0 creation:
+ *
+ *  - pvh               Create a PVHv2 Dom0.
+ *  - shadow            Use shadow paging for Dom0.
+ */
+static void __init parse_dom0_param(char *s)
+{
+    char *ss;
+
+    do {
+
+        ss = strchr(s, ',');
+        if ( ss )
+            *ss = '\0';
+
+        if ( !strcmp(s, "pvh") )
+            dom0_pvh = true;
+#ifdef CONFIG_SHADOW_PAGING
+        else if ( !strcmp(s, "shadow") )
+            opt_dom0_shadow = true;
+#endif
+
+        s = ss + 1;
+    } while ( ss );
+}
+custom_param("dom0", parse_dom0_param);
 
 static char __initdata opt_dom0_ioports_disable[200] = "";
 string_param("dom0_ioports_disable", opt_dom0_ioports_disable);
@@ -951,7 +980,7 @@ static int __init setup_permissions(struct domain *d)
     return rc;
 }
 
-int __init construct_dom0(
+static int __init construct_dom0_pv(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
     module_t *initrd,
@@ -1008,12 +1037,8 @@ int __init construct_dom0(
     paddr_t mpt_alloc;
 
     /* Sanity! */
-    BUG_ON(d->domain_id != 0);
-    BUG_ON(d->vcpu[0] == NULL);
     BUG_ON(v->is_initialised);
 
-    process_pending_softirqs();
-
     printk("*** LOADING DOMAIN 0 ***\n");
 
     d->max_pages = ~0U;
@@ -1655,6 +1680,33 @@ out:
     return rc;
 }
 
+static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
+                                     unsigned long image_headroom,
+                                     module_t *initrd,
+                                     void *(*bootstrap_map)(const module_t *),
+                                     char *cmdline)
+{
+
+    printk("** Building a PVH Dom0 **\n");
+
+    return 0;
+}
+
+int __init construct_dom0(struct domain *d, const module_t *image,
+                          unsigned long image_headroom, module_t *initrd,
+                          void *(*bootstrap_map)(const module_t *),
+                          char *cmdline)
+{
+    /* Sanity! */
+    BUG_ON(d->domain_id != 0);
+    BUG_ON(d->vcpu[0] == NULL);
+
+    process_pending_softirqs();
+
+    return (is_hvm_domain(d) ? construct_dom0_pvh : construct_dom0_pv)
+           (d, image, image_headroom, initrd,bootstrap_map, cmdline);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 0ccef1d..f52f269 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1545,6 +1545,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( opt_dom0pvh )
         domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
 
+    if ( dom0_pvh )
+    {
+        panic("Building a PVHv2 Dom0 is not yet supported.");
+        domcr_flags |= DOMCRF_hvm |
+                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
+                         DOMCRF_hap : 0);
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+    }
+
     /* Create initial domain 0. */
     dom0 = domain_create(0, domcr_flags, 0, &config);
     if ( IS_ERR(dom0) || (alloc_dom0_vcpu0(dom0) == NULL) )
diff --git a/xen/include/asm-x86/setup.h b/xen/include/asm-x86/setup.h
index c65b79c..47b9442 100644
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -57,4 +57,11 @@ extern uint8_t kbd_shift_flags;
 extern unsigned long highmem_start;
 #endif
 
+#ifdef CONFIG_SHADOW_PAGING
+extern bool opt_dom0_shadow;
+#else
+#define opt_dom0_shadow false
+#endif
+extern bool dom0_pvh;
+
 #endif
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (2 preceding siblings ...)
  2017-01-19 17:29 ` [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-20 19:41   ` Andrew Cooper
                     ` (2 more replies)
  2017-01-19 17:29 ` [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function Roger Pau Monne
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich,
	konrad.wilk

Craft the Dom0 e820 memory map and populate it. Introduce a helper to remove
memory pages that are shared between Xen and a domain, and use it in order to
remove low 1MB RAM regions from dom_io in order to assign them to a PVHv2 Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - Move process_pending_softirqs to previous patch.
 - Fix off-by-one errors in some checks.
 - Make unshare_xen_page_with_guest __init.
 - Improve unshare_xen_page_with_guest by making use of already existing
   is_xen_heap_page and put_page.
 - s/hvm/pvh/.
 - Use PAGE_ORDER_4K in pvh_setup_e820 in order to keep consistency with the
   p2m code.

Changes since v3:
 - Drop get_order_from_bytes_floor, it was only used by
   hvm_populate_memory_range.
 - Switch hvm_populate_memory_range to use frame numbers instead of full memory
   addresses.
 - Add a helper to steal the low 1MB RAM areas from dom_io and add them to Dom0
   as normal RAM.
 - Introduce unshare_xen_page_with_guest in order to remove pages from dom_io,
   so they can be assigned to other domains. This is needed in order to remove
   the low 1MB RAM regions from dom_io and assign them to the hardware_domain.
 - Simplify the loop in hvm_steal_ram.
 - Move definition of map_identity_mmio into this patch.

Changes since v2:
 - Introduce get_order_from_bytes_floor as a local function to
   domain_build.c.
 - Remove extra asserts.
 - Make hvm_populate_memory_range return an error code instead of panicking.
 - Fix comments and printks.
 - Use ULL sufix instead of casting to uint64_t.
 - Rename hvm_setup_vmx_unrestricted_guest to
   hvm_setup_vmx_realmode_helpers.
 - Only substract two pages from the memory calculation, that will be used
   by the MADT replacement.
 - Remove some comments.
 - Remove printing allocation information.
 - Don't stash any pages for the MADT, TSS or ident PT, those will be
   subtracted directly from RAM regions of the memory map.
 - Count the number of iterations before calling process_pending_softirqs
   when populating the memory map.
 - Move the initial call to process_pending_softirqs into construct_dom0,
   and remove the ones from construct_dom0_hvm and construct_dom0_pv.
 - Make memflags global so it can be shared between alloc_chunk and
   hvm_populate_memory_range.

Changes since RFC:
 - Use IS_ALIGNED instead of checking with PAGE_MASK.
 - Use the new %pB specifier in order to print sizes in human readable form.
 - Create a VM86 TSS for hardware that doesn't support unrestricted mode.
 - Subtract guest RAM for the identity page table and the VM86 TSS.
 - Split the creation of the unrestricted mode helper structures to a
   separate function.
 - Use preemption with paging_set_allocation.
 - Use get_order_from_bytes_floor.
---
 xen/arch/x86/domain_build.c | 299 +++++++++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/mm.c           |  16 +++
 xen/include/asm-x86/mm.h    |   2 +
 3 files changed, 312 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 4d555b1..fbce1c2 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -22,6 +22,7 @@
 #include <xen/compat.h>
 #include <xen/libelf.h>
 #include <xen/pfn.h>
+#include <xen/guest_access.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
 static long __initdata dom0_max_nrpages = LONG_MAX;
 
+/* Size of the VM86 TSS for virtual 8086 mode to use. */
+#define HVM_VM86_TSS_SIZE   128
+
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -244,11 +248,12 @@ boolean_param("ro-hpet", ro_hpet);
 #define round_pgup(_p)    (((_p)+(PAGE_SIZE-1))&PAGE_MASK)
 #define round_pgdown(_p)  ((_p)&PAGE_MASK)
 
+static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
+
 static struct page_info * __init alloc_chunk(
     struct domain *d, unsigned long max_pages)
 {
     static unsigned int __initdata last_order = MAX_ORDER;
-    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
     struct page_info *page;
     unsigned int order = get_order_from_pages(max_pages), free_order;
 
@@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
             avail -= max_pdx >> s;
     }
 
-    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
+    need_paging = opt_dom0_shadow ||
+                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
+                                                   !paging_mode_hap(d)));
     for ( ; ; need_paging = 0 )
     {
         nr_pages = dom0_nrpages;
@@ -365,7 +372,8 @@ static unsigned long __init compute_dom0_nr_pages(
         avail -= dom0_paging_pages(d, nr_pages);
     }
 
-    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
+    if ( is_pv_domain(d) &&
+         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
          ((dom0_min_nrpages <= 0) || (nr_pages > min_pages)) )
     {
         /*
@@ -581,6 +589,7 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
     struct e820entry *entry, *entry_guest;
     unsigned int i;
     unsigned long pages, cur_pages = 0;
+    uint64_t start, end;
 
     /*
      * Craft the e820 memory map for Dom0 based on the hardware e820 map.
@@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
             continue;
         }
 
-        *entry_guest = *entry;
-        pages = PFN_UP(entry_guest->size);
+        /*
+         * Make sure the start and length are aligned to PAGE_SIZE, because
+         * that's the minimum granularity of the 2nd stage translation. Since
+         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
+         * order to prevent this code from getting out of sync.
+         */
+        start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT);
+        end = (entry->addr + entry->size) &
+              ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 );
+        if ( start >= end )
+            continue;
+
+        entry_guest->type = E820_RAM;
+        entry_guest->addr = start;
+        entry_guest->size = end - start;
+        pages = PFN_DOWN(entry_guest->size);
         if ( (cur_pages + pages) > nr_pages )
         {
             /* Truncate region */
@@ -1680,15 +1703,281 @@ out:
     return rc;
 }
 
+static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
+                                       unsigned long nr_pages, bool map)
+{
+    int rc;
+
+    for ( ; ; )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, _gfn(pfn), nr_pages, _mfn(pfn));
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_WARNING
+                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
+                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
+            break;
+        }
+        nr_pages -= rc;
+        pfn += rc;
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+
+/* Populate an HVM memory range using the biggest possible order. */
+static int __init pvh_populate_memory_range(struct domain *d,
+                                            unsigned long start,
+                                            unsigned long nr_pages)
+{
+    unsigned int order, i = 0;
+    struct page_info *page;
+    int rc;
+#define MAP_MAX_ITER 64
+
+    order = MAX_ORDER;
+    while ( nr_pages != 0 )
+    {
+        unsigned int range_order = get_order_from_pages(nr_pages + 1);
+
+        order = min(range_order ? range_order - 1 : 0, order);
+        page = alloc_domheap_pages(d, order, memflags);
+        if ( page == NULL )
+        {
+            if ( order == 0 && memflags )
+            {
+                /* Try again without any memflags. */
+                memflags = 0;
+                order = MAX_ORDER;
+                continue;
+            }
+            if ( order == 0 )
+            {
+                printk("Unable to allocate memory with order 0!\n");
+                return -ENOMEM;
+            }
+            order--;
+            continue;
+        }
+
+        rc = guest_physmap_add_page(d, _gfn(start), _mfn(page_to_mfn(page)),
+                                    order);
+        if ( rc != 0 )
+        {
+            printk("Failed to populate memory: [%#lx,%lx): %d\n",
+                   start, start + (1UL << order), rc);
+            return -ENOMEM;
+        }
+        start += 1UL << order;
+        nr_pages -= 1UL << order;
+        if ( (++i % MAP_MAX_ITER) == 0 )
+            process_pending_softirqs();
+    }
+
+    return 0;
+#undef MAP_MAX_ITER
+}
+
+static int __init pvh_steal_ram(struct domain *d, unsigned long size,
+                                paddr_t limit, paddr_t *addr)
+{
+    unsigned int i = d->arch.nr_e820;
+
+    while ( i-- )
+    {
+        struct e820entry *entry = &d->arch.e820[i];
+
+        if ( entry->type != E820_RAM || entry->size < size )
+            continue;
+
+        /* Subtract from the beginning. */
+        if ( entry->addr + size <= limit && entry->addr >= MB(1) )
+        {
+            *addr = entry->addr;
+            entry->addr += size;
+            entry->size -= size;
+            return 0;
+        }
+    }
+
+    return -ENOMEM;
+}
+
+static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
+{
+    p2m_type_t p2mt;
+    uint32_t rc, *ident_pt;
+    uint8_t *tss;
+    mfn_t mfn;
+    paddr_t gaddr;
+    unsigned int i;
+
+    /*
+     * Steal some space from the last found RAM region. One page will be
+     * used for the identity page tables, and the remaining space for the
+     * VM86 TSS. Note that after this not all e820 regions will be aligned
+     * to PAGE_SIZE.
+     */
+    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) )
+    {
+        printk("Unable to find memory to stash the identity map and TSS\n");
+        return -ENOMEM;
+    }
+
+    /*
+     * Identity-map page table is required for running with CR0.PG=0
+     * when using Intel EPT. Create a 32-bit non-PAE page directory of
+     * superpages.
+     */
+    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                              &mfn, &p2mt, 0, &rc);
+    if ( ident_pt == NULL )
+    {
+        printk("Unable to map identity page tables\n");
+        return -ENOMEM;
+    }
+    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
+        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
+                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+    unmap_domain_page(ident_pt);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
+    gaddr += PAGE_SIZE;
+    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
+
+    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                         &mfn, &p2mt, 0, &rc);
+    if ( tss == NULL )
+    {
+        printk("Unable to map VM86 TSS area\n");
+        return 0;
+    }
+
+    memset(tss, 0, HVM_VM86_TSS_SIZE);
+    unmap_domain_page(tss);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
+
+    return 0;
+}
+
+static void __init pvh_steal_low_ram(struct domain *d, unsigned long start,
+                                     unsigned long nr_pages)
+{
+    unsigned long mfn;
+
+    ASSERT(start + nr_pages <= PFN_DOWN(MB(1)));
+
+    for ( mfn = start; mfn < start + nr_pages; mfn++ )
+    {
+        struct page_info *pg = mfn_to_page(mfn);
+        int rc;
+
+        rc = unshare_xen_page_with_guest(pg, dom_io);
+        if ( rc )
+        {
+            printk("Unable to unshare Xen mfn %#lx: %d\n", mfn, rc);
+            continue;
+        }
+
+        share_xen_page_with_guest(pg, d, XENSHARE_writable);
+        rc = guest_physmap_add_entry(d, _gfn(mfn), _mfn(mfn), 0, p2m_ram_rw);
+        if ( rc )
+            printk("Unable to add mfn %#lx to p2m: %d\n", mfn, rc);
+    }
+}
+
+static int __init pvh_setup_p2m(struct domain *d)
+{
+    struct vcpu *v = d->vcpu[0];
+    unsigned long nr_pages;
+    unsigned int i;
+    int rc;
+    bool preempted;
+#define MB1_PAGES PFN_DOWN(MB(1))
+
+    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
+
+    pvh_setup_e820(d, nr_pages);
+    do {
+        preempted = false;
+        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                              &preempted);
+        process_pending_softirqs();
+    } while ( preempted );
+
+    /*
+     * Memory below 1MB is identity mapped.
+     * NB: this only makes sense when booted from legacy BIOS.
+     */
+    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
+    if ( rc )
+    {
+        printk("Failed to identity map low 1MB: %d\n", rc);
+        return rc;
+    }
+
+    /* Populate memory map. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        unsigned long addr, size;
+
+        if ( d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        addr = PFN_DOWN(d->arch.e820[i].addr);
+        size = PFN_DOWN(d->arch.e820[i].size);
+
+        ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES);
+
+        if ( addr >= MB1_PAGES )
+            rc = pvh_populate_memory_range(d, addr, size);
+        else
+            pvh_steal_low_ram(d, addr, size);
+
+        if ( rc )
+            return rc;
+    }
+
+    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
+    {
+        /*
+         * Since Dom0 cannot be migrated, we will only setup the
+         * unrestricted guest helpers if they are needed by the current
+         * hardware we are running on.
+         */
+        rc = pvh_setup_vmx_realmode_helpers(d);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+#undef MB1_PAGES
+}
+
 static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    int rc;
 
     printk("** Building a PVH Dom0 **\n");
 
+    iommu_hwdom_init(d);
+
+    rc = pvh_setup_p2m(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 physical memory map\n");
+        return rc;
+    }
+
     return 0;
 }
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a5521f1..721a587 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -475,6 +475,22 @@ void share_xen_page_with_guest(
     spin_unlock(&d->page_alloc_lock);
 }
 
+int __init unshare_xen_page_with_guest(struct page_info *page,
+                                       struct domain *d)
+{
+    if ( page_get_owner(page) != d || !is_xen_heap_page(page) )
+        return -EINVAL;
+
+    if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
+        put_page(page);
+
+    /* Remove the owner and clear the flags. */
+    page->u.inuse.type_info = 0;
+    page_set_owner(page, NULL);
+
+    return 0;
+}
+
 void share_xen_page_with_privileged_guests(
     struct page_info *page, int readonly)
 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 93a073d..3d02ebb 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -276,6 +276,8 @@ struct spage_info
 #define XENSHARE_readonly 1
 extern void share_xen_page_with_guest(
     struct page_info *page, struct domain *d, int readonly);
+extern int unshare_xen_page_with_guest(struct page_info *page,
+                                       struct domain *d);
 extern void share_xen_page_with_privileged_guests(
     struct page_info *page, int readonly);
 extern void free_shared_domheap_page(struct page_info *page);
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (3 preceding siblings ...)
  2017-01-19 17:29 ` [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-20 19:45   ` Andrew Cooper
                     ` (2 more replies)
  2017-01-19 17:29 ` [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne, konrad.wilk

Current __hvm_copy assumes that the destination memory belongs to the current
vcpu, but this is not always the case since for PVHv2 Dom0 build hvm copy
functions are used with current being the idle vcpu. Add a new vcpu parameter
to hvm copy in order to solve that. Note that only hvm_copy_to_guest_phys_vcpu
is introduced, because that's the only one at the moment that's required in
order to build a PVHv2 Dom0.

While there, also assert that the passed vcpu belongs to a HVM guest.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v4:
 - New in the series.
---
 xen/arch/x86/hvm/hvm.c            | 40 ++++++++++++++++++++++++---------------
 xen/include/asm-x86/hvm/support.h |  2 ++
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 41a44c9..600a04d 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3077,15 +3077,16 @@ void hvm_task_switch(
 #define HVMCOPY_linear     (1u<<2)
 static enum hvm_copy_result __hvm_copy(
     void *buf, paddr_t addr, int size, unsigned int flags, uint32_t pfec,
-    pagefault_info_t *pfinfo)
+    pagefault_info_t *pfinfo, struct vcpu *vcpu)
 {
-    struct vcpu *curr = current;
     unsigned long gfn;
     struct page_info *page;
     p2m_type_t p2mt;
     char *p;
     int count, todo = size;
 
+    ASSERT(has_hvm_container_vcpu(vcpu));
+
     /*
      * XXX Disable for 4.1.0: PV-on-HVM drivers will do grant-table ops
      * such as query_size. Grant-table code currently does copy_to/from_guest
@@ -3110,7 +3111,7 @@ static enum hvm_copy_result __hvm_copy(
 
         if ( flags & HVMCOPY_linear )
         {
-            gfn = paging_gva_to_gfn(curr, addr, &pfec);
+            gfn = paging_gva_to_gfn(vcpu, addr, &pfec);
             if ( gfn == gfn_x(INVALID_GFN) )
             {
                 if ( pfec & PFEC_page_paged )
@@ -3137,12 +3138,12 @@ static enum hvm_copy_result __hvm_copy(
          * - 32-bit WinXP (& older Windows) on AMD CPUs for LAPIC accesses,
          * - newer Windows (like Server 2012) for HPET accesses.
          */
-        if ( !nestedhvm_vcpu_in_guestmode(curr)
-             && is_hvm_vcpu(curr)
+        if ( !nestedhvm_vcpu_in_guestmode(vcpu)
+             && is_hvm_vcpu(vcpu)
              && hvm_mmio_internal(gpa) )
             return HVMCOPY_bad_gfn_to_mfn;
 
-        page = get_page_from_gfn(curr->domain, gfn, &p2mt, P2M_UNSHARE);
+        page = get_page_from_gfn(vcpu->domain, gfn, &p2mt, P2M_UNSHARE);
 
         if ( !page )
             return HVMCOPY_bad_gfn_to_mfn;
@@ -3150,7 +3151,7 @@ static enum hvm_copy_result __hvm_copy(
         if ( p2m_is_paging(p2mt) )
         {
             put_page(page);
-            p2m_mem_paging_populate(curr->domain, gfn);
+            p2m_mem_paging_populate(vcpu->domain, gfn);
             return HVMCOPY_gfn_paged_out;
         }
         if ( p2m_is_shared(p2mt) )
@@ -3172,9 +3173,9 @@ static enum hvm_copy_result __hvm_copy(
             {
                 static unsigned long lastpage;
                 if ( xchg(&lastpage, gfn) != gfn )
-                    gdprintk(XENLOG_DEBUG, "guest attempted write to read-only"
+                    printk(XENLOG_DEBUG, "%pv guest attempted write to read-only"
                              " memory page. gfn=%#lx, mfn=%#lx\n",
-                             gfn, page_to_mfn(page));
+                             vcpu->domain, gfn, page_to_mfn(page));
             }
             else
             {
@@ -3182,7 +3183,7 @@ static enum hvm_copy_result __hvm_copy(
                     memcpy(p, buf, count);
                 else
                     memset(p, 0, count);
-                paging_mark_dirty(curr->domain, _mfn(page_to_mfn(page)));
+                paging_mark_dirty(vcpu->domain, _mfn(page_to_mfn(page)));
             }
         }
         else
@@ -3202,18 +3203,25 @@ static enum hvm_copy_result __hvm_copy(
     return HVMCOPY_okay;
 }
 
+enum hvm_copy_result hvm_copy_to_guest_phys_vcpu(
+    paddr_t paddr, void *buf, size_t size, struct vcpu *vcpu)
+{
+    return __hvm_copy(buf, paddr, size,
+                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, vcpu);
+}
+
 enum hvm_copy_result hvm_copy_to_guest_phys(
     paddr_t paddr, void *buf, int size)
 {
     return __hvm_copy(buf, paddr, size,
-                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_to_guest | HVMCOPY_phys, 0, NULL, current);
 }
 
 enum hvm_copy_result hvm_copy_from_guest_phys(
     void *buf, paddr_t paddr, int size)
 {
     return __hvm_copy(buf, paddr, size,
-                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL);
+                      HVMCOPY_from_guest | HVMCOPY_phys, 0, NULL, current);
 }
 
 enum hvm_copy_result hvm_copy_to_guest_linear(
@@ -3222,7 +3230,8 @@ enum hvm_copy_result hvm_copy_to_guest_linear(
 {
     return __hvm_copy(buf, addr, size,
                       HVMCOPY_to_guest | HVMCOPY_linear,
-                      PFEC_page_present | PFEC_write_access | pfec, pfinfo);
+                      PFEC_page_present | PFEC_write_access | pfec, pfinfo,
+                      current);
 }
 
 enum hvm_copy_result hvm_copy_from_guest_linear(
@@ -3231,7 +3240,7 @@ enum hvm_copy_result hvm_copy_from_guest_linear(
 {
     return __hvm_copy(buf, addr, size,
                       HVMCOPY_from_guest | HVMCOPY_linear,
-                      PFEC_page_present | pfec, pfinfo);
+                      PFEC_page_present | pfec, pfinfo, current);
 }
 
 enum hvm_copy_result hvm_fetch_from_guest_linear(
@@ -3240,7 +3249,8 @@ enum hvm_copy_result hvm_fetch_from_guest_linear(
 {
     return __hvm_copy(buf, addr, size,
                       HVMCOPY_from_guest | HVMCOPY_linear,
-                      PFEC_page_present | PFEC_insn_fetch | pfec, pfinfo);
+                      PFEC_page_present | PFEC_insn_fetch | pfec, pfinfo,
+                      current);
 }
 
 unsigned long copy_to_user_hvm(void *to, const void *from, unsigned int len)
diff --git a/xen/include/asm-x86/hvm/support.h b/xen/include/asm-x86/hvm/support.h
index ba5899c..1a34d35 100644
--- a/xen/include/asm-x86/hvm/support.h
+++ b/xen/include/asm-x86/hvm/support.h
@@ -71,6 +71,8 @@ enum hvm_copy_result hvm_copy_to_guest_phys(
     paddr_t paddr, void *buf, int size);
 enum hvm_copy_result hvm_copy_from_guest_phys(
     void *buf, paddr_t paddr, int size);
+enum hvm_copy_result hvm_copy_to_guest_phys_vcpu(
+    paddr_t paddr, void *buf, size_t size, struct vcpu *vcpu);
 
 /*
  * Copy to/from a guest linear address. @pfec should include PFEC_user_mode
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (4 preceding siblings ...)
  2017-01-19 17:29 ` [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-26 13:37   ` Jan Beulich
  2017-01-19 17:29 ` [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich,
	konrad.wilk

Introduce a helper to parse the Dom0 kernel.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - s/hvm/pvh.
 - Use hvm_copy_to_guest_phys_vcpu.

Changes since v3:
 - Change one error message.
 - Indent "out" label by one space.
 - Introduce hvm_copy_to_phys and slightly simplify the code in hvm_load_kernel.

Changes since v2:
 - Remove debug messages.
 - Don't hardcode the number of modules to 1.
---
 xen/arch/x86/domain_build.c | 143 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index fbce1c2..4f5f712 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -39,6 +39,7 @@
 #include <asm/hpet.h>
 
 #include <public/version.h>
+#include <public/arch-x86/hvm/start_info.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -1959,12 +1960,146 @@ static int __init pvh_setup_p2m(struct domain *d)
 #undef MB1_PAGES
 }
 
+static int __init pvh_load_kernel(struct domain *d, const module_t *image,
+                                  unsigned long image_headroom,
+                                  module_t *initrd, char *image_base,
+                                  char *cmdline, paddr_t *entry,
+                                  paddr_t *start_info_addr)
+{
+    char *image_start = image_base + image_headroom;
+    unsigned long image_len = image->mod_end;
+    struct elf_binary elf;
+    struct elf_dom_parms parms;
+    paddr_t last_addr;
+    struct hvm_start_info start_info;
+    struct hvm_modlist_entry mod = { 0 };
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    int rc;
+
+    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+    {
+        printk("Error trying to detect bz compressed kernel\n");
+        return rc;
+    }
+
+    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
+    {
+        printk("Unable to init ELF\n");
+        return rc;
+    }
+#ifdef VERBOSE
+    elf_set_verbose(&elf);
+#endif
+    elf_parse_binary(&elf);
+    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
+    {
+        printk("Unable to parse kernel for ELFNOTES\n");
+        return rc;
+    }
+
+    if ( parms.phys_entry == UNSET_ADDR32 ) {
+        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
+        return -EINVAL;
+    }
+
+    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
+           parms.guest_ver, parms.loader,
+           elf_64bit(&elf) ? "64-bit" : "32-bit");
+
+    /* Copy the OS image and free temporary buffer. */
+    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
+    elf.dest_size = parms.virt_kend - parms.virt_kstart;
+
+    /*
+     * NB: we need to set vCPU 0 of Dom0 as the current vCPU (instead of the
+     * idle one) because elf_load_binary calls raw_{copy_to/clear}_guest, and
+     * the target domain is not passed anywhere. This is very similar to what
+     * is done during classic PV Dom0 creation, where Xen switches to the Dom0
+     * page tables. We also cannot pass a struct domain or vcpu to
+     * elf_load_binary, since the interface is shared with the toolstack, and
+     * it doesn't have any notion of a domain or vcpu struct.
+     */
+    saved_current = current;
+    set_current(v);
+    rc = elf_load_binary(&elf);
+    set_current(saved_current);
+    if ( rc < 0 )
+    {
+        printk("Failed to load kernel: %d\n", rc);
+        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
+        return rc;
+    }
+
+    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
+
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_guest_phys_vcpu(last_addr,
+                                         mfn_to_virt(initrd->mod_start),
+                                         initrd->mod_end, v);
+        if ( rc )
+        {
+            printk("Unable to copy initrd to guest\n");
+            return rc;
+        }
+
+        mod.paddr = last_addr;
+        mod.size = initrd->mod_end;
+        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
+    }
+
+    /* Free temporary buffers. */
+    discard_initial_images();
+
+    memset(&start_info, 0, sizeof(start_info));
+    if ( cmdline != NULL )
+    {
+        rc = hvm_copy_to_guest_phys_vcpu(last_addr, cmdline,
+                                         strlen(cmdline) + 1, v);
+        if ( rc )
+        {
+            printk("Unable to copy guest command line\n");
+            return rc;
+        }
+        start_info.cmdline_paddr = last_addr;
+        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
+    }
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_guest_phys_vcpu(last_addr, &mod, sizeof(mod), v);
+        if ( rc )
+        {
+            printk("Unable to copy guest modules\n");
+            return rc;
+        }
+        start_info.modlist_paddr = last_addr;
+        start_info.nr_modules = 1;
+        last_addr += sizeof(mod);
+    }
+
+    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
+    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
+    rc = hvm_copy_to_guest_phys_vcpu(last_addr, &start_info,
+                                     sizeof(start_info), v);
+    if ( rc )
+    {
+        printk("Unable to copy start info to guest\n");
+        return rc;
+    }
+
+    *entry = parms.phys_entry;
+    *start_info_addr = last_addr;
+
+    return 0;
+}
+
 static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    paddr_t entry, start_info;
     int rc;
 
     printk("** Building a PVH Dom0 **\n");
@@ -1978,6 +2113,14 @@ static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = pvh_load_kernel(d, image, image_headroom, initrd, bootstrap_map(image),
+                         cmdline, &entry, &start_info);
+    if ( rc )
+    {
+        printk("Failed to load Dom0 kernel\n");
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (5 preceding siblings ...)
  2017-01-19 17:29 ` [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-19 17:32   ` Andrew Cooper
  2017-01-26 13:39   ` Jan Beulich
  2017-01-19 17:29 ` [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs Roger Pau Monne
  2017-01-19 17:29 ` [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
  8 siblings, 2 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich,
	konrad.wilk

PVHv2 Dom0 is limited to 128 vCPUs, as are all HVM guests at the moment. Fix
dom0_max_vcpus so it takes this limitation into account by poking at the
dom0_hvm variable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Change since v4:
 - Fix codding style to match rest of the function.

Changes since v3:
 - New in the series.
---
 xen/arch/x86/domain_build.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 4f5f712..55caa30 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -40,6 +40,7 @@
 
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
+#include <public/hvm/hvm_info_table.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -176,6 +177,8 @@ unsigned int __init dom0_max_vcpus(void)
         max_vcpus = opt_dom0_max_vcpus_max;
     if ( max_vcpus > MAX_VIRT_CPUS )
         max_vcpus = MAX_VIRT_CPUS;
+    if ( dom0_pvh && max_vcpus > HVM_MAX_VCPUS )
+        max_vcpus = HVM_MAX_VCPUS;
 
     return max_vcpus;
 }
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (6 preceding siblings ...)
  2017-01-19 17:29 ` [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-26 13:46   ` Jan Beulich
  2017-01-19 17:29 ` [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
  8 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich,
	konrad.wilk

Initialize Dom0 BSP/APs and setup the memory and IO permissions. This also sets
the initial BSP state in order to match the protocol specified in
hvm/start_info.h.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
---
 xen/arch/x86/domain_build.c | 60 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 55caa30..ca5d393 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -41,6 +41,7 @@
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
 #include <public/hvm/hvm_info_table.h>
+#include <public/hvm/hvm_vcpu.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -2096,6 +2097,56 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
     return 0;
 }
 
+static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
+                                 paddr_t start_info)
+{
+    vcpu_hvm_context_t cpu_ctx;
+    struct vcpu *v = d->vcpu[0];
+    int cpu, i, rc;
+
+    cpu = v->processor;
+    for ( i = 1; i < d->max_vcpus; i++ )
+    {
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_dom0_vcpu(d, i, cpu);
+    }
+
+    memset(&cpu_ctx, 0, sizeof(cpu_ctx));
+
+    cpu_ctx.mode = VCPU_HVM_MODE_32B;
+
+    cpu_ctx.cpu_regs.x86_32.ebx = start_info;
+    cpu_ctx.cpu_regs.x86_32.eip = entry;
+    cpu_ctx.cpu_regs.x86_32.cr0 = X86_CR0_PE | X86_CR0_ET;
+
+    cpu_ctx.cpu_regs.x86_32.cs_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ds_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ss_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.tr_limit = 0x67;
+    cpu_ctx.cpu_regs.x86_32.cs_ar = 0xc9b;
+    cpu_ctx.cpu_regs.x86_32.ds_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.ss_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.tr_ar = 0x8b;
+
+    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
+    if ( rc )
+    {
+        printk("Unable to setup Dom0 BSP context: %d\n", rc);
+        return rc;
+    }
+
+    rc = setup_permissions(d);
+    if ( rc )
+    {
+        panic("Unable to setup Dom0 permissions: %d\n", rc);
+        return rc;
+    }
+
+    update_domain_wallclock_time(d);
+
+    return 0;
+}
+
 static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2124,6 +2175,15 @@ static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = pvh_setup_cpus(d, entry, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 CPUs: %d\n", rc);
+        return rc;
+    }
+
+    clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
+
     return 0;
 }
 
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables
  2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (7 preceding siblings ...)
  2017-01-19 17:29 ` [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs Roger Pau Monne
@ 2017-01-19 17:29 ` Roger Pau Monne
  2017-01-26 14:16   ` Jan Beulich
  8 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-19 17:29 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich,
	konrad.wilk

Create a new MADT table that contains the topology exposed to the guest. A
new XSDT table is also created, in order to filter the tables that we want
to expose to the guest, plus the Xen crafted MADT. This in turn requires Xen
to also create a new RSDP in order to make it point to the custom XSDT.

Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
reside in reserved regions. This is needed because some memory maps don't
properly account for all the memory used by ACPI, so it's common to find ACPI
tables in reserved regions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v4:
 - s/hvm/pvh.
 - Use hvm_copy_to_guest_phys_vcpu.
 - Don't allocate up to E820MAX entries for the Dom0 memory map and instead
   allow pvh_add_mem_range to dynamically grow the memory map.
 - Add a comment about the lack of x2APIC MADT entries.
 - Change acpi_intr_overrides to unsigned int and the max iterator bound to
   UINT_MAX.
 - Set the MADT version as the minimum version between the hardware value and
   our supported version (4).
 - Set the MADT IO APIC ID to the current value of the domain vioapic->id.
 - Use void * when subtracting two pointers.
 - Fix indentation of nr_pages and use PFN_UP instead of DIV_ROUND_UP.
 - Change wording of the pvh_acpi_table_allowed error message.
 - Make j unsigned in pvh_setup_acpi_xsdt.
 - Move initialization of local variables with declarations in
   pvh_setup_acpi_xsdt.
 - Reword the comment about the allocated size of the xsdt custom table.
 - Fix line splitting.
 - Add a comment regarding the layering violation caused by the usage of
   acpi_tb_checksum.
 - Pass IO APIC NMI sources found in the MADT to Dom0.
 - Create x2APIC entries if the native MADT also contains them.
 - s/acpi_intr_overrrides/acpi_intr_overrides/.
 - Make sure the MADT is properly mapped into Dom0, or else Dom0 might not be
   able to access the output of the _MAT method depending on the
   implementation.
 - Get the first ACPI processor ID and use that as the base processor ID of the
   crafted MADT. This is done so that local/x2 APIC NMI entries match with the
   local/x2 APIC objects.

Changes since v3:
 - Use hvm_copy_to_phys in order to copy the tables to Dom0 memory.
 - Return EEXIST for overlaping ranges in hvm_add_mem_range.
 - s/ov/ovr/ for interrupt override parsing functions.
 - Constify intr local variable in acpi_set_intr_ovr.
 - Use structure asignement for type safety.
 - Perform sizeof using local variables in hvm_setup_acpi_madt.
 - Manually set revision of crafted/modified tables.
 - Only map tables to guest that reside in reserved or ACPI memory regions.
 - Copy the RSDP OEM signature to the crafted RSDP.
 - Pair calls to acpi_os_map_memory/acpi_os_unmap_memory.
 - Add memory regions for allowed ACPI tables to the memory map and then
   perform the identity mappings. This avoids having to call modify_identity_mmio
   multiple times.
 - Add a FIXME comment regarding the lack of multiple vIO-APICs.

Changes since v2:
 - Completely reworked.
---
 xen/arch/x86/domain_build.c | 517 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 517 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index ca5d393..3331f3b 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -23,6 +23,7 @@
 #include <xen/libelf.h>
 #include <xen/pfn.h>
 #include <xen/guest_access.h>
+#include <xen/acpi.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -38,6 +39,8 @@
 #include <asm/io_apic.h>
 #include <asm/hpet.h>
 
+#include <acpi/actables.h>
+
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
 #include <public/hvm/hvm_info_table.h>
@@ -50,6 +53,14 @@ static long __initdata dom0_max_nrpages = LONG_MAX;
 /* Size of the VM86 TSS for virtual 8086 mode to use. */
 #define HVM_VM86_TSS_SIZE   128
 
+static unsigned int __initdata acpi_intr_overrides;
+static struct acpi_madt_interrupt_override __initdata *intsrcovr;
+
+static unsigned int __initdata acpi_nmi_sources;
+static struct acpi_madt_nmi_source __initdata *nmisrc;
+
+static bool __initdata acpi_x2apic;
+
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -1812,6 +1823,58 @@ static int __init pvh_steal_ram(struct domain *d, unsigned long size,
     return -ENOMEM;
 }
 
+/* NB: memory map must be sorted at all times for this to work correctly. */
+static int __init pvh_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
+                                    unsigned int type)
+{
+    struct e820entry *map;
+    unsigned int i;
+
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        uint64_t rs = d->arch.e820[i].addr;
+        uint64_t re = rs + d->arch.e820[i].size;
+
+        if ( rs == e && d->arch.e820[i].type == type )
+        {
+            d->arch.e820[i].addr = s;
+            return 0;
+        }
+
+        if ( re == s && d->arch.e820[i].type == type &&
+             (i + 1 == d->arch.nr_e820 || d->arch.e820[i + 1].addr >= e) )
+        {
+            d->arch.e820[i].size += e - s;
+            return 0;
+        }
+
+        if ( rs >= e )
+            break;
+
+        if ( re > s )
+            return -EEXIST;
+    }
+
+    map = xzalloc_array(struct e820entry, d->arch.nr_e820 + 1);
+    if ( !map )
+    {
+        printk(XENLOG_WARNING "E820: out of memory to add region\n");
+        return -ENOMEM;
+    }
+
+    memcpy(map, d->arch.e820, i * sizeof(*d->arch.e820));
+    memcpy(map + i + 1, d->arch.e820 + i,
+           (d->arch.nr_e820 - i) * sizeof(*d->arch.e820));
+    map[i].addr = s;
+    map[i].size = e - s;
+    map[i].type = type;
+    xfree(d->arch.e820);
+    d->arch.e820 = map;
+    d->arch.nr_e820++;
+
+    return 0;
+}
+
 static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
 {
     p2m_type_t p2mt;
@@ -2147,6 +2210,453 @@ static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
     return 0;
 }
 
+static int __init acpi_count_intr_ovr(struct acpi_subtable_header *header,
+                                     const unsigned long end)
+{
+
+    acpi_intr_overrides++;
+    return 0;
+}
+
+static int __init acpi_set_intr_ovr(struct acpi_subtable_header *header,
+                                    const unsigned long end)
+{
+    const struct acpi_madt_interrupt_override *intr =
+        container_of(header, struct acpi_madt_interrupt_override, header);
+
+    *intsrcovr = *intr;
+    intsrcovr++;
+
+    return 0;
+}
+
+static int __init acpi_count_nmi_src(struct acpi_subtable_header *header,
+                                     const unsigned long end)
+{
+
+    acpi_nmi_sources++;
+    return 0;
+}
+
+static int __init acpi_set_nmi_src(struct acpi_subtable_header *header,
+                                   const unsigned long end)
+{
+    const struct acpi_madt_nmi_source *src =
+        container_of(header, struct acpi_madt_nmi_source, header);
+
+    *nmisrc = *src;
+    nmisrc++;
+
+    return 0;
+}
+
+static int __init acpi_check_x2apic(struct acpi_subtable_header *header,
+                                    const unsigned long end)
+{
+
+    acpi_x2apic = true;
+    return 0;
+}
+
+static int __init pvh_setup_acpi_madt(struct domain *d, paddr_t *addr)
+{
+    struct acpi_table_madt *madt;
+    struct acpi_table_header *table;
+    struct acpi_madt_io_apic *io_apic;
+    struct acpi_madt_local_apic *local_apic;
+    struct acpi_madt_local_x2apic *x2apic;
+    acpi_status status;
+    unsigned long size;
+    unsigned int i, max_vcpus;
+    int rc;
+
+    /* Count number of interrupt overrides in the MADT. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
+                          acpi_count_intr_ovr, UINT_MAX);
+
+    /* Count number of NMI sources in the MADT. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_count_nmi_src,
+                          UINT_MAX);
+
+    /* Check if there are x2APIC entries. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC, acpi_check_x2apic, 1);
+
+    max_vcpus = dom0_max_vcpus();
+    /* Calculate the size of the crafted MADT. */
+    size = sizeof(*madt);
+    size += sizeof(*io_apic);
+    /*
+     * The APIC ID field of the local APIC struct is only 1byte, so we need
+     * to limit the number of local APIC entries to 128 because we only use
+     * even numbers as APIC IDs.
+     */
+    size += sizeof(*local_apic) *
+            min(max_vcpus, 1U << (sizeof(local_apic->id) * 8));
+    size += sizeof(*intsrcovr) * acpi_intr_overrides;
+    size += sizeof(*nmisrc) * acpi_nmi_sources;
+    size += acpi_x2apic ? sizeof(*x2apic) * dom0_max_vcpus() : 0;
+
+    madt = xzalloc_bytes(size);
+    if ( !madt )
+    {
+        printk("Unable to allocate memory for MADT table\n");
+        return -ENOMEM;
+    }
+
+    /* Copy the native MADT table header. */
+    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
+    if ( !ACPI_SUCCESS(status) )
+    {
+        printk("Failed to get MADT ACPI table, aborting.\n");
+        return -EINVAL;
+    }
+    madt->header = *table;
+    madt->address = APIC_DEFAULT_PHYS_BASE;
+    /*
+     * NB: this is currently set to 4, which is the revision in the ACPI
+     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
+     * tables described in the headers.
+     */
+    madt->header.revision = min_t(unsigned char, table->revision, 4);
+
+    /*
+     * Setup the IO APIC entry.
+     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
+     * per domain. This must be fixed in order to provide the same amount of
+     * IO APICs as available on bare metal, and with the same IDs as found in
+     * the native IO APIC MADT entries.
+     */
+    if ( nr_ioapics > 1 )
+        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
+               nr_ioapics);
+    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
+    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
+    io_apic->header.length = sizeof(*io_apic);
+    io_apic->id = domain_vioapic(d)->id;
+    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
+
+    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
+    for ( i = 0; i < min(max_vcpus, 1U << (sizeof(local_apic->id) * 8)); i++ )
+    {
+        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
+        local_apic->header.length = sizeof(*local_apic);
+        local_apic->processor_id = i;
+        local_apic->id = i * 2;
+        local_apic->lapic_flags = ACPI_MADT_ENABLED;
+        local_apic++;
+    }
+
+    /* Setup interrupt overrides. */
+    intsrcovr = (struct acpi_madt_interrupt_override *)local_apic;
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_set_intr_ovr,
+                          acpi_intr_overrides);
+
+    /* Setup NMI sources. */
+    nmisrc = (struct acpi_madt_nmi_source *)intsrcovr;
+    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_set_nmi_src,
+                          acpi_nmi_sources);
+
+    /*
+     * NB: x2APIC entries will only be provided if there are also present
+     * on the native MADT.
+     */
+    x2apic = (struct acpi_madt_local_x2apic *)nmisrc;
+    for ( i = 0; acpi_x2apic && i < dom0_max_vcpus(); i++ )
+    {
+        x2apic->header.type = ACPI_MADT_TYPE_LOCAL_X2APIC;
+        x2apic->header.length = sizeof(*x2apic);
+        x2apic->uid = i;
+        x2apic->local_apic_id = i * 2;
+        x2apic->lapic_flags = ACPI_MADT_ENABLED;
+        x2apic++;
+    }
+
+    ASSERT(((void *)x2apic - (void *)madt) == size);
+    madt->header.length = size;
+    /*
+     * Calling acpi_tb_checksum here is a layering violation, but
+     * introducing a wrapper for such simple usage seems overkill.
+     */
+    madt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, madt), size);
+
+    /* Place the new MADT in guest memory space. */
+    if ( pvh_steal_ram(d, size, GB(4), addr) )
+    {
+        printk("Unable to find allocate guest RAM for MADT\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( pvh_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add MADT region to memory map\n");
+
+    rc = hvm_copy_to_guest_phys_vcpu(*addr, madt, size, d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy MADT into guest memory\n");
+        return rc;
+    }
+    xfree(madt);
+
+    return 0;
+}
+
+static bool __init acpi_memory_banned(unsigned long address,
+                                      unsigned long size)
+{
+    unsigned long mfn, nr_pages, i;
+
+    mfn = PFN_DOWN(address);
+    nr_pages = PFN_UP((address & ~PAGE_MASK) + size);
+    for ( i = 0 ; i < nr_pages; i++ )
+        if ( !page_is_ram_type(mfn + i, RAM_TYPE_RESERVED) &&
+             !page_is_ram_type(mfn + i, RAM_TYPE_ACPI) )
+            return true;
+
+    return false;
+}
+
+static bool __init pvh_acpi_table_allowed(const char *sig)
+{
+    static const char __init banned_tables[][ACPI_NAME_SIZE] = {
+        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
+        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
+    int i;
+
+    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
+        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
+            return false;
+
+    /* Make sure table doesn't reside in a RAM region. */
+    if ( acpi_memory_banned(acpi_gbl_root_table_list.tables[i].address,
+                            acpi_gbl_root_table_list.tables[i].length) )
+    {
+        printk("Skipping table %.4s because resides in a non-ACPI, non-reserved region\n",
+               sig);
+        return false;
+    }
+
+    return true;
+}
+
+static int __init pvh_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
+                                      paddr_t *addr)
+{
+    struct acpi_table_xsdt *xsdt;
+    struct acpi_table_header *table;
+    struct acpi_table_rsdp *rsdp;
+    unsigned long size = sizeof(*xsdt);
+    unsigned int i, j, num_tables = 0;
+    paddr_t xsdt_paddr;
+    int rc;
+
+    /*
+     * Restore original DMAR table signature, we are going to filter it
+     * from the new XSDT that is presented to the guest, so it no longer
+     * makes sense to have it's signature zapped.
+     */
+    acpi_dmar_reinstate();
+
+    /* Count the number of tables that will be added to the XSDT. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( pvh_acpi_table_allowed(sig) )
+            num_tables++;
+    }
+
+    /*
+     * No need to add or subtract anything because struct acpi_table_xsdt
+     * includes one array slot already, and we have filtered out the original
+     * MADT and we are going to add a custom built MADT.
+     */
+    size += num_tables * sizeof(xsdt->table_offset_entry[0]);
+
+    xsdt = xzalloc_bytes(size);
+    if ( !xsdt )
+    {
+        printk("Unable to allocate memory for XSDT table\n");
+        return -ENOMEM;
+    }
+
+    /* Copy the native XSDT table header. */
+    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
+    if ( !rsdp )
+    {
+        printk("Unable to map RSDP\n");
+        return -EINVAL;
+    }
+    xsdt_paddr = rsdp->xsdt_physical_address;
+    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));
+    table = acpi_os_map_memory(xsdt_paddr, sizeof(*table));
+    if ( !table )
+    {
+        printk("Unable to map XSDT\n");
+        return -EINVAL;
+    }
+    xsdt->header = *table;
+    acpi_os_unmap_memory(table, sizeof(*table));
+
+    /* Add the custom MADT. */
+    xsdt->table_offset_entry[0] = madt_addr;
+
+    /* Copy the addresses of the rest of the allowed tables. */
+    for( i = 0, j = 1; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( pvh_acpi_table_allowed(sig) )
+            xsdt->table_offset_entry[j++] =
+                                acpi_gbl_root_table_list.tables[i].address;
+    }
+
+    xsdt->header.revision = 1;
+    xsdt->header.length = size;
+    /*
+     * Calling acpi_tb_checksum here is a layering violation, but
+     * introducing a wrapper for such simple usage seems overkill.
+     */
+    xsdt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, xsdt), size);
+
+    /* Place the new XSDT in guest memory space. */
+    if ( pvh_steal_ram(d, size, GB(4), addr) )
+    {
+        printk("Unable to find guest RAM for XSDT\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( pvh_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add XSDT region to memory map\n");
+
+    rc = hvm_copy_to_guest_phys_vcpu(*addr, xsdt, size, d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy XSDT into guest memory\n");
+        return rc;
+    }
+    xfree(xsdt);
+
+    return 0;
+}
+
+static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
+{
+    struct acpi_table_rsdp rsdp, *native_rsdp;
+    unsigned long pfn, nr_pages;
+    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
+    unsigned int i;
+    int rc;
+
+    /* Scan top-level tables and add their regions to the guest memory map. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+        unsigned long addr = acpi_gbl_root_table_list.tables[i].address;
+        unsigned long size = acpi_gbl_root_table_list.tables[i].length;
+
+        /*
+         * Make sure the original MADT is also mapped, so that Dom0 can
+         * properly access the data returned by _MAT methods in case it's
+         * re-using MADT memory.
+         */
+        if ( pvh_acpi_table_allowed(sig) ||
+             (strncmp(sig, ACPI_SIG_MADT, ACPI_NAME_SIZE) == 0 &&
+             !acpi_memory_banned(addr, size)) )
+             pvh_add_mem_range(d, addr, addr + size, E820_ACPI);
+    }
+
+    /* Identity map ACPI e820 regions. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        if ( d->arch.e820[i].type != E820_ACPI &&
+             d->arch.e820[i].type != E820_NVS )
+            continue;
+
+        pfn = PFN_DOWN(d->arch.e820[i].addr);
+        nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
+                          d->arch.e820[i].size);
+
+        rc = modify_identity_mmio(d, pfn, nr_pages, true);
+        if ( rc )
+        {
+            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
+                   pfn, pfn + nr_pages);
+            return rc;
+        }
+    }
+
+    rc = pvh_setup_acpi_madt(d, &madt_paddr);
+    if ( rc )
+        return rc;
+
+    rc = pvh_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
+    if ( rc )
+        return rc;
+
+    /* Craft a custom RSDP. */
+    memset(&rsdp, 0, sizeof(rsdp));
+    memcpy(rsdp.signature, ACPI_SIG_RSDP, sizeof(rsdp.signature));
+    native_rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(rsdp));
+    memcpy(rsdp.oem_id, native_rsdp->oem_id, sizeof(rsdp.oem_id));
+    acpi_os_unmap_memory(native_rsdp, sizeof(rsdp));
+    rsdp.revision = 2;
+    rsdp.xsdt_physical_address = xsdt_paddr;
+    rsdp.rsdt_physical_address = 0;
+    rsdp.length = sizeof(rsdp);
+    /*
+     * Calling acpi_tb_checksum here is a layering violation, but
+     * introducing a wrapper for such simple usage seems overkill.
+     */
+    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                      ACPI_RSDP_REV0_SIZE);
+    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                               sizeof(rsdp));
+
+    /*
+     * Place the new RSDP in guest memory space.
+     *
+     * NB: this RSDP is not going to replace the original RSDP, which
+     * should still be accessible to the guest. However that RSDP is
+     * going to point to the native RSDT, and should not be used.
+     */
+    if ( pvh_steal_ram(d, sizeof(rsdp), GB(4), &rsdp_paddr) )
+    {
+        printk("Unable to allocate guest RAM for RSDP\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( pvh_add_mem_range(d, rsdp_paddr, rsdp_paddr + sizeof(rsdp),
+                           E820_ACPI) )
+        printk("Unable to add RSDP region to memory map\n");
+
+    /* Copy RSDP into guest memory. */
+    rc = hvm_copy_to_guest_phys_vcpu(rsdp_paddr, &rsdp, sizeof(rsdp),
+                                     d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy RSDP into guest memory\n");
+        return rc;
+    }
+
+    /* Copy RSDP address to start_info. */
+    rc = hvm_copy_to_guest_phys_vcpu(start_info +
+                                     offsetof(struct hvm_start_info,
+                                              rsdp_paddr),
+                                     &rsdp_paddr,
+                                     sizeof(((struct hvm_start_info *)
+                                             0)->rsdp_paddr), d->vcpu[0]);
+    if ( rc )
+    {
+        printk("Unable to copy RSDP into guest memory\n");
+        return rc;
+    }
+
+    return 0;
+}
+
 static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2182,6 +2692,13 @@ static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = pvh_setup_acpi(d, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 ACPI tables: %d\n", rc);
+        return rc;
+    }
+
     clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
 
     return 0;
-- 
2.10.1 (Apple Git-78)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0
  2017-01-19 17:29 ` [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
@ 2017-01-19 17:32   ` Andrew Cooper
  2017-01-26 13:39   ` Jan Beulich
  1 sibling, 0 replies; 68+ messages in thread
From: Andrew Cooper @ 2017-01-19 17:32 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: boris.ostrovsky, Jan Beulich, konrad.wilk

On 19/01/17 17:29, Roger Pau Monne wrote:
> PVHv2 Dom0 is limited to 128 vCPUs, as are all HVM guests at the moment. Fix
> dom0_max_vcpus so it takes this limitation into account by poking at the
> dom0_hvm variable.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2017-01-19 17:29 ` [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
@ 2017-01-20  6:41   ` Tian, Kevin
  2017-01-20 10:28     ` Roger Pau Monne
  2017-01-20 18:44   ` Andrew Cooper
  1 sibling, 1 reply; 68+ messages in thread
From: Tian, Kevin @ 2017-01-20  6:41 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Nakajima, Jun, konrad.wilk, George Dunlap, Andrew Cooper,
	Jan Beulich, boris.ostrovsky

> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: Friday, January 20, 2017 1:30 AM
> 
> There's nothing wrong with allowing the domain to perform DMA transfers to
> MMIO areas that it already can access from the CPU, and this allows us to
> remove the hack in set_identity_p2m_entry for PVH Dom0.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[...]
> @@ -840,6 +840,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t
> p2mt)
>      case p2m_grant_map_ro:
>          flags = IOMMUF_readable;
>          break;
> +    case p2m_mmio_direct:
> +        flags = IOMMUF_readable;
> +        if ( rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
> +            flags |= IOMMUF_writable;

I suppose you meant !rangeset_contains_singleton...

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2017-01-20  6:41   ` Tian, Kevin
@ 2017-01-20 10:28     ` Roger Pau Monne
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-20 10:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nakajima, Jun, konrad.wilk, George Dunlap, Andrew Cooper,
	Jan Beulich, xen-devel, boris.ostrovsky

On Fri, Jan 20, 2017 at 06:41:06AM +0000, Tian, Kevin wrote:
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: Friday, January 20, 2017 1:30 AM
> > 
> > There's nothing wrong with allowing the domain to perform DMA transfers to
> > MMIO areas that it already can access from the CPU, and this allows us to
> > remove the hack in set_identity_p2m_entry for PVH Dom0.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> [...]
> > @@ -840,6 +840,10 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t
> > p2mt)
> >      case p2m_grant_map_ro:
> >          flags = IOMMUF_readable;
> >          break;
> > +    case p2m_mmio_direct:
> > +        flags = IOMMUF_readable;
> > +        if ( rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) )
> > +            flags |= IOMMUF_writable;
> 
> I suppose you meant !rangeset_contains_singleton...

Yes, that's for catching that, it's now fixed.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2017-01-19 17:29 ` [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2017-01-20 18:41   ` Andrew Cooper
  2017-01-23 12:28     ` Roger Pau Monne
  0 siblings, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-20 18:41 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: boris.ostrovsky, Jan Beulich, konrad.wilk

On 19/01/17 17:29, Roger Pau Monne wrote:
> @@ -3768,7 +3770,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      case PHYSDEVOP_eoi:
>      case PHYSDEVOP_irq_status_query:
>      case PHYSDEVOP_get_free_pirq:
> -        return do_physdev_op(cmd, arg);
> +        return ((currd->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||

How about:

#define uses_pirqs(d)      (!!((d)->arch.emulation_flags &
XEN_X86_EMU_USE_PIRQ))

to match the other emulation flags predicates (or some suitable naming)?

Otherwise, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2017-01-19 17:29 ` [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
  2017-01-20  6:41   ` Tian, Kevin
@ 2017-01-20 18:44   ` Andrew Cooper
  2017-01-22  4:45     ` Tian, Kevin
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-20 18:44 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Kevin Tian, Jun Nakajima, konrad.wilk, George Dunlap,
	Jan Beulich, boris.ostrovsky

On 19/01/17 17:29, Roger Pau Monne wrote:
> There's nothing wrong with allowing the domain to perform DMA transfers to
> MMIO areas that it already can access from the CPU, and this allows us to
> remove the hack in set_identity_p2m_entry for PVH Dom0.
>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Subject to Kevin's fix, Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-19 17:29 ` [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2017-01-20 19:03   ` Andrew Cooper
  2017-01-23 12:58     ` Roger Pau Monne
  2017-01-20 19:13   ` Boris Ostrovsky
  2017-01-26 11:43   ` Jan Beulich
  2 siblings, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-20 19:03 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: boris.ostrovsky, Jan Beulich, konrad.wilk

On 19/01/17 17:29, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 0ccef1d..f52f269 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1545,6 +1545,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      if ( opt_dom0pvh )
>          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
>  
> +    if ( dom0_pvh )
> +    {
> +        panic("Building a PVHv2 Dom0 is not yet supported.");

This would be cleaner immediately before the return 0 of
construct_dom0_pvh(), would it not?

Otherwise, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-19 17:29 ` [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
  2017-01-20 19:03   ` Andrew Cooper
@ 2017-01-20 19:13   ` Boris Ostrovsky
  2017-01-20 19:27     ` Andrew Cooper
  2017-01-26 11:43   ` Jan Beulich
  2 siblings, 1 reply; 68+ messages in thread
From: Boris Ostrovsky @ 2017-01-20 19:13 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Andrew Cooper, Jan Beulich, konrad.wilk


> diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
> index 243df96..4d555b1 100644
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -191,11 +191,40 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
>  }
>  
>  #ifdef CONFIG_SHADOW_PAGING
> -static bool_t __initdata opt_dom0_shadow;
> +bool __initdata opt_dom0_shadow;
>  boolean_param("dom0_shadow", opt_dom0_shadow);

This (boolean_param) is not needed any longer.



> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 0ccef1d..f52f269 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1545,6 +1545,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      if ( opt_dom0pvh )
>          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
>  
> +    if ( dom0_pvh )
> +    {
> +        panic("Building a PVHv2 Dom0 is not yet supported.");
> +        domcr_flags |= DOMCRF_hvm |
> +                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
> +                         DOMCRF_hap : 0);


This probably has been discussed already --- are we going to support
shadow paging for PVH?

-boris




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-20 19:13   ` Boris Ostrovsky
@ 2017-01-20 19:27     ` Andrew Cooper
  0 siblings, 0 replies; 68+ messages in thread
From: Andrew Cooper @ 2017-01-20 19:27 UTC (permalink / raw)
  To: Boris Ostrovsky, Roger Pau Monne, xen-devel; +Cc: Jan Beulich, konrad.wilk

On 20/01/17 19:13, Boris Ostrovsky wrote:
>> diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
>> index 243df96..4d555b1 100644
>> --- a/xen/arch/x86/domain_build.c
>> +++ b/xen/arch/x86/domain_build.c
>> @@ -191,11 +191,40 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
>>  }
>>  
>>  #ifdef CONFIG_SHADOW_PAGING
>> -static bool_t __initdata opt_dom0_shadow;
>> +bool __initdata opt_dom0_shadow;
>>  boolean_param("dom0_shadow", opt_dom0_shadow);
> This (boolean_param) is not needed any longer.

Yes it is, for backwards compatibility.

Which is a good point.  The docs change should be modified still refer
to it, mark it as deprecated, and point at dom0=shadow.

>
>
>
>> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
>> index 0ccef1d..f52f269 100644
>> --- a/xen/arch/x86/setup.c
>> +++ b/xen/arch/x86/setup.c
>> @@ -1545,6 +1545,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>      if ( opt_dom0pvh )
>>          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
>>  
>> +    if ( dom0_pvh )
>> +    {
>> +        panic("Building a PVHv2 Dom0 is not yet supported.");
>> +        domcr_flags |= DOMCRF_hvm |
>> +                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
>> +                         DOMCRF_hap : 0);
>
> This probably has been discussed already --- are we going to support
> shadow paging for PVH?

ISTR agreeing that it should stay.  Not point excluding something which
already works and is already supported in other cases.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-19 17:29 ` [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2017-01-20 19:41   ` Andrew Cooper
  2017-01-23 11:23     ` Jan Beulich
  2017-01-23 14:11   ` Boris Ostrovsky
  2017-01-26 12:41   ` Jan Beulich
  2 siblings, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-20 19:41 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: boris.ostrovsky, Jan Beulich, konrad.wilk

On 19/01/17 17:29, Roger Pau Monne wrote:
> +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> +{
> +    p2m_type_t p2mt;
> +    uint32_t rc, *ident_pt;
> +    uint8_t *tss;
> +    mfn_t mfn;
> +    paddr_t gaddr;
> +    unsigned int i;
> +
> +    /*
> +     * Steal some space from the last found RAM region. One page will be
> +     * used for the identity page tables, and the remaining space for the
> +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> +     * to PAGE_SIZE.
> +     */
> +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) )
> +    {
> +        printk("Unable to find memory to stash the identity map and TSS\n");
> +        return -ENOMEM;
> +    }
> +
> +    /*
> +     * Identity-map page table is required for running with CR0.PG=0
> +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> +     * superpages.
> +     */
> +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                              &mfn, &p2mt, 0, &rc);
> +    if ( ident_pt == NULL )
> +    {
> +        printk("Unable to map identity page tables\n");
> +        return -ENOMEM;
> +    }
> +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
> +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);

Please can you make helper for this and dedup it with shadow_enable(). 
Something like:

void write_pse_identmap(uint32_t *l2)

rather than duplicating this particular piece of magic.  (It can
probably even be static inline.)

> +    unmap_domain_page(ident_pt);
> +    put_page(mfn_to_page(mfn_x(mfn)));
> +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
> +    gaddr += PAGE_SIZE;
> +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
> +
> +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                         &mfn, &p2mt, 0, &rc);
> +    if ( tss == NULL )
> +    {
> +        printk("Unable to map VM86 TSS area\n");
> +        return 0;
> +    }
> +
> +    memset(tss, 0, HVM_VM86_TSS_SIZE);

Do we actually need to 0 this?  Don't we guarantee to hand out zero'd
pages during construction?  (I can't actually recall.  Perhaps it is
better to explicitly clear it.)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function
  2017-01-19 17:29 ` [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function Roger Pau Monne
@ 2017-01-20 19:45   ` Andrew Cooper
  2017-01-23 13:50     ` Roger Pau Monne
  2017-01-26 12:51   ` Jan Beulich
  2017-01-26 13:33   ` Jan Beulich
  2 siblings, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-20 19:45 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: boris.ostrovsky, konrad.wilk

On 19/01/17 17:29, Roger Pau Monne wrote:
> @@ -3172,9 +3173,9 @@ static enum hvm_copy_result __hvm_copy(
>              {
>                  static unsigned long lastpage;
>                  if ( xchg(&lastpage, gfn) != gfn )
> -                    gdprintk(XENLOG_DEBUG, "guest attempted write to read-only"
> +                    printk(XENLOG_DEBUG, "%pv guest attempted write to read-only"
>                               " memory page. gfn=%#lx, mfn=%#lx\n",
> -                             gfn, page_to_mfn(page));
> +                             vcpu->domain, gfn, page_to_mfn(page));

Just vcpu here, or hitting this printk() will try to follow
d->shared_info as if it were v->domain.

With this fixed, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2017-01-20 18:44   ` Andrew Cooper
@ 2017-01-22  4:45     ` Tian, Kevin
  0 siblings, 0 replies; 68+ messages in thread
From: Tian, Kevin @ 2017-01-22  4:45 UTC (permalink / raw)
  To: Andrew Cooper, Roger Pau Monne, xen-devel
  Cc: George Dunlap, boris.ostrovsky, Jan Beulich, Nakajima, Jun

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Saturday, January 21, 2017 2:44 AM
> 
> On 19/01/17 17:29, Roger Pau Monne wrote:
> > There's nothing wrong with allowing the domain to perform DMA transfers to
> > MMIO areas that it already can access from the CPU, and this allows us to
> > remove the hack in set_identity_p2m_entry for PVH Dom0.
> >
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Subject to Kevin's fix, Reviewed-by: Andrew Cooper
> <andrew.cooper3@citrix.com>

If you won't send out a new version, then here is mine:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-20 19:41   ` Andrew Cooper
@ 2017-01-23 11:23     ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-23 11:23 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, boris.ostrovsky, Roger Pau Monne

>>> On 20.01.17 at 20:41, <andrew.cooper3@citrix.com> wrote:
> On 19/01/17 17:29, Roger Pau Monne wrote:
>> +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
>> +                         &mfn, &p2mt, 0, &rc);
>> +    if ( tss == NULL )
>> +    {
>> +        printk("Unable to map VM86 TSS area\n");
>> +        return 0;
>> +    }
>> +
>> +    memset(tss, 0, HVM_VM86_TSS_SIZE);
> 
> Do we actually need to 0 this?  Don't we guarantee to hand out zero'd
> pages during construction?  (I can't actually recall.  Perhaps it is
> better to explicitly clear it.)

No, we don't zero before handing out, we zero after a reclaiming
memory from a dying guest or from the hypervisor.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2017-01-20 18:41   ` Andrew Cooper
@ 2017-01-23 12:28     ` Roger Pau Monne
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-23 12:28 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, boris.ostrovsky, Jan Beulich

On Fri, Jan 20, 2017 at 06:41:21PM +0000, Andrew Cooper wrote:
> On 19/01/17 17:29, Roger Pau Monne wrote:
> > @@ -3768,7 +3770,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      case PHYSDEVOP_eoi:
> >      case PHYSDEVOP_irq_status_query:
> >      case PHYSDEVOP_get_free_pirq:
> > -        return do_physdev_op(cmd, arg);
> > +        return ((currd->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||
> 
> How about:
> 
> #define uses_pirqs(d)      (!!((d)->arch.emulation_flags &
> XEN_X86_EMU_USE_PIRQ))

I've named it "has_pirq" in order to match with the other options, that all use
the has_ prefix (has_use_pirq sounds too strange IMHO).

> to match the other emulation flags predicates (or some suitable naming)?
> 
> Otherwise, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-20 19:03   ` Andrew Cooper
@ 2017-01-23 12:58     ` Roger Pau Monne
  2017-01-23 12:59       ` Andrew Cooper
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-23 12:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, boris.ostrovsky, Jan Beulich

On Fri, Jan 20, 2017 at 07:03:10PM +0000, Andrew Cooper wrote:
> On 19/01/17 17:29, Roger Pau Monne wrote:
> > diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> > index 0ccef1d..f52f269 100644
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -1545,6 +1545,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >      if ( opt_dom0pvh )
> >          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
> >  
> > +    if ( dom0_pvh )
> > +    {
> > +        panic("Building a PVHv2 Dom0 is not yet supported.");
> 
> This would be cleaner immediately before the return 0 of
> construct_dom0_pvh(), would it not?
> 
> Otherwise, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

I've moved the panic to the end of construct_dom0_pvh, and marked dom0_shadow
as deprecated in the document, would you like me to keep the R-B?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-23 12:58     ` Roger Pau Monne
@ 2017-01-23 12:59       ` Andrew Cooper
  0 siblings, 0 replies; 68+ messages in thread
From: Andrew Cooper @ 2017-01-23 12:59 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky, Jan Beulich

On 23/01/17 12:58, Roger Pau Monne wrote:
> On Fri, Jan 20, 2017 at 07:03:10PM +0000, Andrew Cooper wrote:
>> On 19/01/17 17:29, Roger Pau Monne wrote:
>>> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
>>> index 0ccef1d..f52f269 100644
>>> --- a/xen/arch/x86/setup.c
>>> +++ b/xen/arch/x86/setup.c
>>> @@ -1545,6 +1545,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>      if ( opt_dom0pvh )
>>>          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
>>>  
>>> +    if ( dom0_pvh )
>>> +    {
>>> +        panic("Building a PVHv2 Dom0 is not yet supported.");
>> This would be cleaner immediately before the return 0 of
>> construct_dom0_pvh(), would it not?
>>
>> Otherwise, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
> I've moved the panic to the end of construct_dom0_pvh, and marked dom0_shadow
> as deprecated in the document, would you like me to keep the R-B?

Yeah - sounds fine.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function
  2017-01-20 19:45   ` Andrew Cooper
@ 2017-01-23 13:50     ` Roger Pau Monne
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-23 13:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, boris.ostrovsky

On Fri, Jan 20, 2017 at 07:45:35PM +0000, Andrew Cooper wrote:
> On 19/01/17 17:29, Roger Pau Monne wrote:
> > @@ -3172,9 +3173,9 @@ static enum hvm_copy_result __hvm_copy(
> >              {
> >                  static unsigned long lastpage;
> >                  if ( xchg(&lastpage, gfn) != gfn )
> > -                    gdprintk(XENLOG_DEBUG, "guest attempted write to read-only"
> > +                    printk(XENLOG_DEBUG, "%pv guest attempted write to read-only"
> >                               " memory page. gfn=%#lx, mfn=%#lx\n",
> > -                             gfn, page_to_mfn(page));
> > +                             vcpu->domain, gfn, page_to_mfn(page));
> 
> Just vcpu here, or hitting this printk() will try to follow
> d->shared_info as if it were v->domain.
>
> With this fixed, Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks, not sure why I've used domain here instead of vcpu.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-19 17:29 ` [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
  2017-01-20 19:41   ` Andrew Cooper
@ 2017-01-23 14:11   ` Boris Ostrovsky
  2017-01-23 14:43     ` Roger Pau Monne
  2017-01-26 12:41   ` Jan Beulich
  2 siblings, 1 reply; 68+ messages in thread
From: Boris Ostrovsky @ 2017-01-23 14:11 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel; +Cc: Andrew Cooper, Jan Beulich


>  
> +static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
> +                                       unsigned long nr_pages, bool map)
> +{
> +    int rc;
> +
> +    for ( ; ; )
> +    {
> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)

This can be taken outside the loop.

-boris

> +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> +        if ( rc == 0 )
> +            break;
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_WARNING
> +                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
> +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> +            break;
> +        }
> +        nr_pages -= rc;
> +        pfn += rc;
> +        process_pending_softirqs();
> +    }
> +
> +    return rc;
> +}
> +



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-23 14:11   ` Boris Ostrovsky
@ 2017-01-23 14:43     ` Roger Pau Monne
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-23 14:43 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: xen-devel, Andrew Cooper, Jan Beulich

On Mon, Jan 23, 2017 at 09:11:06AM -0500, Boris Ostrovsky wrote:
> 
> >  
> > +static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
> > +                                       unsigned long nr_pages, bool map)
> > +{
> > +    int rc;
> > +
> > +    for ( ; ; )
> > +    {
> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> 
> This can be taken outside the loop.

Maybe I can instead make map const, and the compiler should optimize this
itself?

I find it a little cumbersome to store function pointers, ie:

int (*mapf)(struct domain *, gfn_t, unsigned long, mfn_t) = ...;

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-19 17:29 ` [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
  2017-01-20 19:03   ` Andrew Cooper
  2017-01-20 19:13   ` Boris Ostrovsky
@ 2017-01-26 11:43   ` Jan Beulich
  2017-01-26 16:36     ` Roger Pau Monne
  2 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 11:43 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -191,11 +191,40 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
>  }
>  
>  #ifdef CONFIG_SHADOW_PAGING
> -static bool_t __initdata opt_dom0_shadow;
> +bool __initdata opt_dom0_shadow;
>  boolean_param("dom0_shadow", opt_dom0_shadow);
>  #else
> -#define opt_dom0_shadow 0
> +#define opt_dom0_shadow false

No need to duplicate the #define from the header here.

> @@ -1008,12 +1037,8 @@ int __init construct_dom0(
>      paddr_t mpt_alloc;
>  
>      /* Sanity! */
> -    BUG_ON(d->domain_id != 0);
> -    BUG_ON(d->vcpu[0] == NULL);
>      BUG_ON(v->is_initialised);

What about this last one?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-19 17:29 ` [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
  2017-01-20 19:41   ` Andrew Cooper
  2017-01-23 14:11   ` Boris Ostrovsky
@ 2017-01-26 12:41   ` Jan Beulich
  2017-01-27 11:14     ` Tim Deegan
  2017-01-27 12:23     ` Roger Pau Monne
  2 siblings, 2 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 12:41 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
>  static long __initdata dom0_min_nrpages;
>  static long __initdata dom0_max_nrpages = LONG_MAX;
>  
> +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> +#define HVM_VM86_TSS_SIZE   128

I continue to be puzzled by this value. Why 128? I think this really
needs to be clarified in the comment.

> @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
>              avail -= max_pdx >> s;
>      }
>  
> -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> +    need_paging = opt_dom0_shadow ||
> +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
> +                                                   !paging_mode_hap(d)));

What is the !paging_mode_hap() part good for? It's being taken care
of by checking opt_dom0_shadow already, isn't it? Alternatively, to
make the distinction more obvious, I'd suggest

    need_paging = has_hvm_container_domain(d)
                  ? !iommu_hap_pt_share || !paging_mode_hap(d)
                  : opt_dom0_shadow;

> @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>              continue;
>          }
>  
> -        *entry_guest = *entry;
> -        pages = PFN_UP(entry_guest->size);
> +        /*
> +         * Make sure the start and length are aligned to PAGE_SIZE, because
> +         * that's the minimum granularity of the 2nd stage translation. Since
> +         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
> +         * order to prevent this code from getting out of sync.
> +         */
> +        start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT);

You definitely don't need to use _AC() in C code. But the whole thing
can anyway simply be

        start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);

(albeit I'd like to note that if anything we'd have to be prepared
for page sizes > 4k, not smaller ones, and the whole idea of
PAGE_ORDER_4K breaks in that case).

> +        end = (entry->addr + entry->size) &
> +              ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 );

On top of the above, please remove the stray blank from near
the end of this statement.

> +static int __init pvh_steal_ram(struct domain *d, unsigned long size,
> +                                paddr_t limit, paddr_t *addr)
> +{
> +    unsigned int i = d->arch.nr_e820;
> +
> +    while ( i-- )
> +    {
> +        struct e820entry *entry = &d->arch.e820[i];
> +
> +        if ( entry->type != E820_RAM || entry->size < size )
> +            continue;
> +
> +        /* Subtract from the beginning. */
> +        if ( entry->addr + size <= limit && entry->addr >= MB(1) )
> +        {
> +            *addr = entry->addr;
> +            entry->addr += size;
> +            entry->size -= size;

The comment says so, but why from the beginning? Wouldn't it be
better to steal from the end of the highest range below 4Gb, to
keep an overall more conventional layout?

> +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> +{
> +    p2m_type_t p2mt;
> +    uint32_t rc, *ident_pt;
> +    uint8_t *tss;
> +    mfn_t mfn;
> +    paddr_t gaddr;
> +    unsigned int i;
> +
> +    /*
> +     * Steal some space from the last found RAM region. One page will be
> +     * used for the identity page tables, and the remaining space for the
> +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> +     * to PAGE_SIZE.
> +     */
> +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) 
> )
> +    {
> +        printk("Unable to find memory to stash the identity map and TSS\n");
> +        return -ENOMEM;
> +    }
> +
> +    /*
> +     * Identity-map page table is required for running with CR0.PG=0
> +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> +     * superpages.
> +     */
> +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                              &mfn, &p2mt, 0, &rc);
> +    if ( ident_pt == NULL )
> +    {
> +        printk("Unable to map identity page tables\n");
> +        return -ENOMEM;
> +    }
> +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
> +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> +    unmap_domain_page(ident_pt);
> +    put_page(mfn_to_page(mfn_x(mfn)));
> +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
> +    gaddr += PAGE_SIZE;
> +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));

This comes too late - the page table setup above also requires
page alignment (and with that, adding PAGE_SIZE would not break
the alignment requirement). Even more, the code below doesn't
strictly require page alignment, it only requires for the range to
not cross a page boundary.

> +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                         &mfn, &p2mt, 0, &rc);
> +    if ( tss == NULL )
> +    {
> +        printk("Unable to map VM86 TSS area\n");
> +        return 0;
> +    }
> +
> +    memset(tss, 0, HVM_VM86_TSS_SIZE);
> +    unmap_domain_page(tss);
> +    put_page(mfn_to_page(mfn_x(mfn)));
> +    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
> +
> +    return 0;

While I've seen the code a number of times by now, I still can't
help disliking the early success return (accompanied by an error
message). I think this not being a mistake would be more obvious
with

    if ( tss )
    {
    }
    else
        printk();
    return 0;

> +static int __init pvh_setup_p2m(struct domain *d)
> +{
> +    struct vcpu *v = d->vcpu[0];
> +    unsigned long nr_pages;
> +    unsigned int i;
> +    int rc;
> +    bool preempted;
> +#define MB1_PAGES PFN_DOWN(MB(1))
> +
> +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> +
> +    pvh_setup_e820(d, nr_pages);
> +    do {
> +        preempted = false;
> +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> +                              &preempted);
> +        process_pending_softirqs();
> +    } while ( preempted );
> +
> +    /*
> +     * Memory below 1MB is identity mapped.
> +     * NB: this only makes sense when booted from legacy BIOS.
> +     */
> +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);

MB1_PAGES

> +    if ( rc )
> +    {
> +        printk("Failed to identity map low 1MB: %d\n", rc);
> +        return rc;
> +    }
> +
> +    /* Populate memory map. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        unsigned long addr, size;
> +
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        addr = PFN_DOWN(d->arch.e820[i].addr);
> +        size = PFN_DOWN(d->arch.e820[i].size);
> +
> +        ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES);
> +
> +        if ( addr >= MB1_PAGES )
> +            rc = pvh_populate_memory_range(d, addr, size);
> +        else
> +            pvh_steal_low_ram(d, addr, size);

Would you mind shortening the ASSERT() expression above by
moving it into the else branch here?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function
  2017-01-19 17:29 ` [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function Roger Pau Monne
  2017-01-20 19:45   ` Andrew Cooper
@ 2017-01-26 12:51   ` Jan Beulich
  2017-01-26 13:33   ` Jan Beulich
  2 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 12:51 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3077,15 +3077,16 @@ void hvm_task_switch(
>  #define HVMCOPY_linear     (1u<<2)
>  static enum hvm_copy_result __hvm_copy(
>      void *buf, paddr_t addr, int size, unsigned int flags, uint32_t pfec,
> -    pagefault_info_t *pfinfo)
> +    pagefault_info_t *pfinfo, struct vcpu *vcpu)

Please stick to v as the name here, as done almost everywhere else.
It also looks a little odd to me to have this parameter last, when
commonly we put it first. For a copy function I can see the buffer
info to be first, but please consider moving it ahead of flags.

> @@ -3137,12 +3138,12 @@ static enum hvm_copy_result __hvm_copy(
>           * - 32-bit WinXP (& older Windows) on AMD CPUs for LAPIC accesses,
>           * - newer Windows (like Server 2012) for HPET accesses.
>           */
> -        if ( !nestedhvm_vcpu_in_guestmode(curr)
> -             && is_hvm_vcpu(curr)
> +        if ( !nestedhvm_vcpu_in_guestmode(vcpu)
> +             && is_hvm_vcpu(vcpu)
>               && hvm_mmio_internal(gpa) )

This must not be called for v != current.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function
  2017-01-19 17:29 ` [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function Roger Pau Monne
  2017-01-20 19:45   ` Andrew Cooper
  2017-01-26 12:51   ` Jan Beulich
@ 2017-01-26 13:33   ` Jan Beulich
  2017-01-27 14:55     ` Roger Pau Monne
  2 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 13:33 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> --- a/xen/include/asm-x86/hvm/support.h
> +++ b/xen/include/asm-x86/hvm/support.h
> @@ -71,6 +71,8 @@ enum hvm_copy_result hvm_copy_to_guest_phys(
>      paddr_t paddr, void *buf, int size);
>  enum hvm_copy_result hvm_copy_from_guest_phys(
>      void *buf, paddr_t paddr, int size);
> +enum hvm_copy_result hvm_copy_to_guest_phys_vcpu(
> +    paddr_t paddr, void *buf, size_t size, struct vcpu *vcpu);

Btw., there being just five existing callers, I think we should rather
change the existing function instead of adding yet another wrapper
of __hvm_copy().

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2
  2017-01-19 17:29 ` [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2017-01-26 13:37   ` Jan Beulich
  2017-01-27 17:22     ` Roger Pau Monne
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 13:37 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> @@ -1959,12 +1960,146 @@ static int __init pvh_setup_p2m(struct domain *d)
>  #undef MB1_PAGES
>  }
>  
> +static int __init pvh_load_kernel(struct domain *d, const module_t *image,
> +                                  unsigned long image_headroom,
> +                                  module_t *initrd, char *image_base,
> +                                  char *cmdline, paddr_t *entry,
> +                                  paddr_t *start_info_addr)
> +{
> +    char *image_start = image_base + image_headroom;

While for cmdline plain char is certainly fine, I think we should stop
abusing this type for image and image_base, even if this means
some further adjustments elsewhere. This really ought to be u8 or
unsigned char.

> +    unsigned long image_len = image->mod_end;
> +    struct elf_binary elf;
> +    struct elf_dom_parms parms;
> +    paddr_t last_addr;
> +    struct hvm_start_info start_info;
> +    struct hvm_modlist_entry mod = { 0 };
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    int rc;
> +
> +    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> +    {
> +        printk("Error trying to detect bz compressed kernel\n");
> +        return rc;
> +    }
> +
> +    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
> +    {
> +        printk("Unable to init ELF\n");
> +        return rc;
> +    }
> +#ifdef VERBOSE
> +    elf_set_verbose(&elf);
> +#endif
> +    elf_parse_binary(&elf);
> +    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
> +    {
> +        printk("Unable to parse kernel for ELFNOTES\n");
> +        return rc;
> +    }
> +
> +    if ( parms.phys_entry == UNSET_ADDR32 ) {

Please move the brace to its own line.

> +        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
> +        return -EINVAL;
> +    }
> +
> +    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
> +           parms.guest_ver, parms.loader,
> +           elf_64bit(&elf) ? "64-bit" : "32-bit");
> +
> +    /* Copy the OS image and free temporary buffer. */
> +    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
> +    elf.dest_size = parms.virt_kend - parms.virt_kstart;
> +
> +    /*
> +     * NB: we need to set vCPU 0 of Dom0 as the current vCPU (instead of the
> +     * idle one) because elf_load_binary calls raw_{copy_to/clear}_guest, and
> +     * the target domain is not passed anywhere. This is very similar to what
> +     * is done during classic PV Dom0 creation, where Xen switches to the Dom0
> +     * page tables. We also cannot pass a struct domain or vcpu to
> +     * elf_load_binary, since the interface is shared with the toolstack, and
> +     * it doesn't have any notion of a domain or vcpu struct.
> +     */
> +    saved_current = current;
> +    set_current(v);
> +    rc = elf_load_binary(&elf);
> +    set_current(saved_current);

While with the comment it's not as bad anymore, I'm still not happy
with this code. The tool stack portion of the explanation in particular
does not really hold: We would add an opaque void * parameter to
the functions involved, which the tool stack could pass NULL for, and
you'd use it to convey the vcpu.

> +    if ( rc < 0 )
> +    {
> +        printk("Failed to load kernel: %d\n", rc);
> +        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
> +        return rc;
> +    }
> +
> +    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> +
> +    if ( initrd != NULL )
> +    {
> +        rc = hvm_copy_to_guest_phys_vcpu(last_addr,
> +                                         mfn_to_virt(initrd->mod_start),
> +                                         initrd->mod_end, v);
> +        if ( rc )
> +        {
> +            printk("Unable to copy initrd to guest\n");
> +            return rc;
> +        }
> +
> +        mod.paddr = last_addr;
> +        mod.size = initrd->mod_end;
> +        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
> +    }
> +
> +    /* Free temporary buffers. */
> +    discard_initial_images();
> +
> +    memset(&start_info, 0, sizeof(start_info));

You clear "mod" with an initializer (which btw could be just the two
braces), but you memset() start_info - please be consistent.

> +    if ( cmdline != NULL )
> +    {
> +        rc = hvm_copy_to_guest_phys_vcpu(last_addr, cmdline,
> +                                         strlen(cmdline) + 1, v);
> +        if ( rc )
> +        {
> +            printk("Unable to copy guest command line\n");
> +            return rc;
> +        }
> +        start_info.cmdline_paddr = last_addr;
> +        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);

Where is this 8 coming from?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0
  2017-01-19 17:29 ` [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
  2017-01-19 17:32   ` Andrew Cooper
@ 2017-01-26 13:39   ` Jan Beulich
  1 sibling, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 13:39 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> @@ -176,6 +177,8 @@ unsigned int __init dom0_max_vcpus(void)
>          max_vcpus = opt_dom0_max_vcpus_max;
>      if ( max_vcpus > MAX_VIRT_CPUS )
>          max_vcpus = MAX_VIRT_CPUS;
> +    if ( dom0_pvh && max_vcpus > HVM_MAX_VCPUS )
> +        max_vcpus = HVM_MAX_VCPUS;

Thinking about it again, we shouldn't apply two limits here, when
those limits can change independently. I think you want

    limit = dom0_pvh ? HVM_MAX_VCPUS : MAX_VIRT_CPUS;
    if ( max_vcpus > limit )
         max_vcpus = limit;

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs
  2017-01-19 17:29 ` [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs Roger Pau Monne
@ 2017-01-26 13:46   ` Jan Beulich
  2017-02-08 12:48     ` Roger Pau Monne
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 13:46 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> @@ -2096,6 +2097,56 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>      return 0;
>  }
>  
> +static int __init pvh_setup_cpus(struct domain *d, paddr_t entry,
> +                                 paddr_t start_info)
> +{
> +    vcpu_hvm_context_t cpu_ctx;
> +    struct vcpu *v = d->vcpu[0];
> +    int cpu, i, rc;

i and cpu want to be unsigned int.

> +    cpu = v->processor;
> +    for ( i = 1; i < d->max_vcpus; i++ )
> +    {
> +        cpu = cpumask_cycle(cpu, &dom0_cpus);
> +        setup_dom0_vcpu(d, i, cpu);
> +    }
> +
> +    memset(&cpu_ctx, 0, sizeof(cpu_ctx));

Perhaps better (and shorter) to use an initializer again?

> +    cpu_ctx.mode = VCPU_HVM_MODE_32B;
> +
> +    cpu_ctx.cpu_regs.x86_32.ebx = start_info;
> +    cpu_ctx.cpu_regs.x86_32.eip = entry;
> +    cpu_ctx.cpu_regs.x86_32.cr0 = X86_CR0_PE | X86_CR0_ET;
> +
> +    cpu_ctx.cpu_regs.x86_32.cs_limit = ~0u;
> +    cpu_ctx.cpu_regs.x86_32.ds_limit = ~0u;
> +    cpu_ctx.cpu_regs.x86_32.ss_limit = ~0u;
> +    cpu_ctx.cpu_regs.x86_32.tr_limit = 0x67;
> +    cpu_ctx.cpu_regs.x86_32.cs_ar = 0xc9b;
> +    cpu_ctx.cpu_regs.x86_32.ds_ar = 0xc93;
> +    cpu_ctx.cpu_regs.x86_32.ss_ar = 0xc93;
> +    cpu_ctx.cpu_regs.x86_32.tr_ar = 0x8b;

In fact, all of this could become part of the initializer too as it looks.

> @@ -2124,6 +2175,15 @@ static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
>          return rc;
>      }
>  
> +    rc = pvh_setup_cpus(d, entry, start_info);
> +    if ( rc )
> +    {
> +        printk("Failed to setup Dom0 CPUs: %d\n", rc);
> +        return rc;
> +    }
> +
> +    clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);

Would you mind moving this into the function (where you then can
use just v)?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables
  2017-01-19 17:29 ` [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
@ 2017-01-26 14:16   ` Jan Beulich
  2017-02-08 15:10     ` Roger Pau Monne
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-01-26 14:16 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> +static int __init pvh_setup_acpi_madt(struct domain *d, paddr_t *addr)
> +{
> +    struct acpi_table_madt *madt;
> +    struct acpi_table_header *table;
> +    struct acpi_madt_io_apic *io_apic;
> +    struct acpi_madt_local_apic *local_apic;
> +    struct acpi_madt_local_x2apic *x2apic;
> +    acpi_status status;
> +    unsigned long size;
> +    unsigned int i, max_vcpus;
> +    int rc;
> +
> +    /* Count number of interrupt overrides in the MADT. */
> +    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
> +                          acpi_count_intr_ovr, UINT_MAX);
> +
> +    /* Count number of NMI sources in the MADT. */
> +    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_count_nmi_src,
> +                          UINT_MAX);
> +
> +    /* Check if there are x2APIC entries. */
> +    acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC, acpi_check_x2apic, 1);
> +
> +    max_vcpus = dom0_max_vcpus();
> +    /* Calculate the size of the crafted MADT. */
> +    size = sizeof(*madt);
> +    size += sizeof(*io_apic);

Still just a single IO-APIC and no fixme note. I see you have one
below, but this line will need to remain in lock step. Or perhaps you
could multiply by nr_ioapics here, accepting the transient waste of
space.

> +    /*
> +     * The APIC ID field of the local APIC struct is only 1byte, so we need
> +     * to limit the number of local APIC entries to 128 because we only use
> +     * even numbers as APIC IDs.
> +     */
> +    size += sizeof(*local_apic) *
> +            min(max_vcpus, 1U << (sizeof(local_apic->id) * 8));

This caps at 256 now. Perhaps it would anyway be better to use
HVM_MAX_VCPUS here, or maybe the minimum of both?

> +    size += sizeof(*intsrcovr) * acpi_intr_overrides;
> +    size += sizeof(*nmisrc) * acpi_nmi_sources;
> +    size += acpi_x2apic ? sizeof(*x2apic) * dom0_max_vcpus() : 0;

You have the function call result already latched in a local variable.
Plus, when you cap LAPIC entries, you should also provide x2APIC
ones (to be able to represent all vCPU-s).

> +    madt = xzalloc_bytes(size);
> +    if ( !madt )
> +    {
> +        printk("Unable to allocate memory for MADT table\n");
> +        return -ENOMEM;
> +    }
> +
> +    /* Copy the native MADT table header. */
> +    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
> +    if ( !ACPI_SUCCESS(status) )
> +    {
> +        printk("Failed to get MADT ACPI table, aborting.\n");
> +        return -EINVAL;
> +    }
> +    madt->header = *table;
> +    madt->address = APIC_DEFAULT_PHYS_BASE;
> +    /*
> +     * NB: this is currently set to 4, which is the revision in the ACPI
> +     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
> +     * tables described in the headers.
> +     */
> +    madt->header.revision = min_t(unsigned char, table->revision, 4);
> +
> +    /*
> +     * Setup the IO APIC entry.
> +     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
> +     * per domain. This must be fixed in order to provide the same amount of
> +     * IO APICs as available on bare metal, and with the same IDs as found in
> +     * the native IO APIC MADT entries.
> +     */
> +    if ( nr_ioapics > 1 )
> +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
> +               nr_ioapics);
> +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);

To aid readability you may want to use void in all such casts.

> +static bool __init pvh_acpi_table_allowed(const char *sig)
> +{
> +    static const char __init banned_tables[][ACPI_NAME_SIZE] = {

__initconst

> +        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
> +        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
> +    int i;

unsigned int

> +static int __init pvh_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
> +                                      paddr_t *addr)
> +{
> +    struct acpi_table_xsdt *xsdt;
> +    struct acpi_table_header *table;
> +    struct acpi_table_rsdp *rsdp;
> +    unsigned long size = sizeof(*xsdt);
> +    unsigned int i, j, num_tables = 0;
> +    paddr_t xsdt_paddr;
> +    int rc;
> +
> +    /*
> +     * Restore original DMAR table signature, we are going to filter it
> +     * from the new XSDT that is presented to the guest, so it no longer
> +     * makes sense to have it's signature zapped.

"... so it is no longer necessary to ..."?

> +static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
> +{
> +    struct acpi_table_rsdp rsdp, *native_rsdp;
> +    unsigned long pfn, nr_pages;
> +    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
> +    unsigned int i;
> +    int rc;
> +
> +    /* Scan top-level tables and add their regions to the guest memory map. */
> +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> +    {
> +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> +        unsigned long addr = acpi_gbl_root_table_list.tables[i].address;
> +        unsigned long size = acpi_gbl_root_table_list.tables[i].length;
> +
> +        /*
> +         * Make sure the original MADT is also mapped, so that Dom0 can
> +         * properly access the data returned by _MAT methods in case it's
> +         * re-using MADT memory.
> +         */
> +        if ( pvh_acpi_table_allowed(sig) ||
> +             (strncmp(sig, ACPI_SIG_MADT, ACPI_NAME_SIZE) == 0 &&
> +             !acpi_memory_banned(addr, size)) )

Indentation. But to me this would be clearer as

        if ( strncmp(sig, ACPI_SIG_MADT, ACPI_NAME_SIZE)
             ? pvh_acpi_table_allowed(sig)
             : !acpi_memory_banned(addr, size) )

anyway.

> +             pvh_add_mem_range(d, addr, addr + size, E820_ACPI);
> +    }
> +
> +    /* Identity map ACPI e820 regions. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        if ( d->arch.e820[i].type != E820_ACPI &&
> +             d->arch.e820[i].type != E820_NVS )
> +            continue;
> +
> +        pfn = PFN_DOWN(d->arch.e820[i].addr);
> +        nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
> +                          d->arch.e820[i].size);
> +
> +        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> +        if ( rc )
> +        {
> +            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> +                   pfn, pfn + nr_pages);
> +            return rc;
> +        }
> +    }
> +
> +    rc = pvh_setup_acpi_madt(d, &madt_paddr);
> +    if ( rc )
> +        return rc;
> +
> +    rc = pvh_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
> +    if ( rc )
> +        return rc;
> +
> +    /* Craft a custom RSDP. */
> +    memset(&rsdp, 0, sizeof(rsdp));

Perhaps worth using an initializer again (with once again as many as
possible of the items below folded there)?

> +    memcpy(rsdp.signature, ACPI_SIG_RSDP, sizeof(rsdp.signature));
> +    native_rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(rsdp));
> +    memcpy(rsdp.oem_id, native_rsdp->oem_id, sizeof(rsdp.oem_id));
> +    acpi_os_unmap_memory(native_rsdp, sizeof(rsdp));
> +    rsdp.revision = 2;
> +    rsdp.xsdt_physical_address = xsdt_paddr;
> +    rsdp.rsdt_physical_address = 0;

This is redundant in any event.

> +    rsdp.length = sizeof(rsdp);
> +    /*
> +     * Calling acpi_tb_checksum here is a layering violation, but
> +     * introducing a wrapper for such simple usage seems overkill.
> +     */
> +    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
> +                                      ACPI_RSDP_REV0_SIZE);
> +    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
> +                                               sizeof(rsdp));
> +
> +    /*
> +     * Place the new RSDP in guest memory space.
> +     *
> +     * NB: this RSDP is not going to replace the original RSDP, which
> +     * should still be accessible to the guest. However that RSDP is
> +     * going to point to the native RSDT, and should not be used.

... for the Dom0 kernel's boot purposes (we keep it visible after all
for post boot access).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2
  2017-01-26 11:43   ` Jan Beulich
@ 2017-01-26 16:36     ` Roger Pau Monne
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-26 16:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Jan 26, 2017 at 04:43:37AM -0700, Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/domain_build.c
> > +++ b/xen/arch/x86/domain_build.c
> > @@ -191,11 +191,40 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
> >  }
> >  
> >  #ifdef CONFIG_SHADOW_PAGING
> > -static bool_t __initdata opt_dom0_shadow;
> > +bool __initdata opt_dom0_shadow;
> >  boolean_param("dom0_shadow", opt_dom0_shadow);
> >  #else
> > -#define opt_dom0_shadow 0
> > +#define opt_dom0_shadow false
> 
> No need to duplicate the #define from the header here.
> 
> > @@ -1008,12 +1037,8 @@ int __init construct_dom0(
> >      paddr_t mpt_alloc;
> >  
> >      /* Sanity! */
> > -    BUG_ON(d->domain_id != 0);
> > -    BUG_ON(d->vcpu[0] == NULL);
> >      BUG_ON(v->is_initialised);
> 
> What about this last one?

Hm, not sure why I didn't move that last time, IIRC it triggered for PVHv2
vCPU, but that no longer seems to be the case, so I've moved it. Both fixed,
thanks.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-26 12:41   ` Jan Beulich
@ 2017-01-27 11:14     ` Tim Deegan
  2017-01-27 12:37       ` Roger Pau Monne
  2017-01-27 12:51       ` Andrew Cooper
  2017-01-27 12:23     ` Roger Pau Monne
  1 sibling, 2 replies; 68+ messages in thread
From: Tim Deegan @ 2017-01-27 11:14 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel, Roger Pau Monne

At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> > +#define HVM_VM86_TSS_SIZE   128
> 
> I continue to be puzzled by this value. Why 128? I think this really
> needs to be clarified in the comment.

I was asked on IRC to do some archaeology / explain myself about this,
so here goes.

First, the _intended_ mechanism for "real mode" guests on older VMX
hardware is to run them in virtual 8086 mode inside the guest as much
as possible, and emulate whenever we can't do that.

This is managed with some state in v->arch.hvm_vmx:
 - vmx_realmode, set when the guest thinks it's in real mode. 
 - vmx_emulate, to force emulation rather than VMENTER
   We set this when we have exceptions to inject, as the VMX hardware
   would try to inject them in 32-bit protected mode.
 - vm86_segment_mask, a bitmask of segments that can't be fudged
   to run in virtual 8086 mode.

When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the
emulator if either vmx_emulate or any bit in vm86_segment_mask is set;
otherwise it calls vmx_enter_realmode() to adjust %rflags and enters
the guest in virtual 8086 mode.

The reason we need a TSS at all is for handling software interrupts.
Virtual 8086 mode has two ways to handle software interrupts: stay in
virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit
protected mode.  We want the first of those, so that a guest in 'real mode'
can make BIOS calls.

The CPU uses a bitmap in the TSS to decide which method to use; we
need all the bits in that bitmap to be clear.  In my SDM (April 2016)
this is section 20.3.3 "Class 3 -- Software Interrupt Handling in
Virtual-8086 Mode", table 20-2, method 5.

---

So far so good, and AIUI the system works -- or at least it did in
December 2008 when it was put in (8d4638d1), because emulating every
instruction made Windows boot times so slow that we would definitely
have noticed.

But looking at it now, I'm not convinced of exactly how.  The magic
bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
base address itself lives at offset 100.  A zero'd TSS should mean an
I/O map at 0, and an interrupt redirection bitmap at -32, which would
plausibly work if the TSS were 256 bytes (matching the limit set in
Xen).  Perhaps it's only working because the 128 bytes following the
TSS in hvmloader happen to be zeros too?

I also don't remember why the TSS is 128 rather than 104 bytes.  The
SDM claims that the TSS must be larger than 104 bytes "when accessing
the I/O permission bit map or interrupt redirection bit map."
(7.2.2. "TSS Descriptor") but I suspect that just means that the
generated address of the bitmap must lie inside the limit.

In any case, the limit set in vmx_set_segment_register() should surely
match the size of the actual TSS!

I haven't got the time or hardware to test this right now, but could
maybe look at it next week unless anyone else wants to play with it.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-26 12:41   ` Jan Beulich
  2017-01-27 11:14     ` Tim Deegan
@ 2017-01-27 12:23     ` Roger Pau Monne
  2017-01-27 15:11       ` Jan Beulich
  1 sibling, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-27 12:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
> >  static long __initdata dom0_min_nrpages;
> >  static long __initdata dom0_max_nrpages = LONG_MAX;
> >  
> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> > +#define HVM_VM86_TSS_SIZE   128
> 
> I continue to be puzzled by this value. Why 128? I think this really
> needs to be clarified in the comment.

Given the recent comments by Tim, and that this is starting to look like a can
of worms, I would like to leave this as-is for the moment, on the grounds that
it's what hvmloader does (I'm not saying it's right), and that this issue
should be treated independently from this patch series.

Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
IIRC I've tried that before (without unrestricted mode support) and it was
working fine.

> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
> >              avail -= max_pdx >> s;
> >      }
> >  
> > -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> > +    need_paging = opt_dom0_shadow ||
> > +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
> > +                                                   !paging_mode_hap(d)));
> 
> What is the !paging_mode_hap() part good for? It's being taken care
> of by checking opt_dom0_shadow already, isn't it? Alternatively, to
> make the distinction more obvious, I'd suggest
> 
>     need_paging = has_hvm_container_domain(d)
>                   ? !iommu_hap_pt_share || !paging_mode_hap(d)
>                   : opt_dom0_shadow;

AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but
with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but
paging_mode_hap would be false. Maybe that's just an impossible combination in
any case...

> > @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
> >              continue;
> >          }
> >  
> > -        *entry_guest = *entry;
> > -        pages = PFN_UP(entry_guest->size);
> > +        /*
> > +         * Make sure the start and length are aligned to PAGE_SIZE, because
> > +         * that's the minimum granularity of the 2nd stage translation. Since
> > +         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
> > +         * order to prevent this code from getting out of sync.
> > +         */
> > +        start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT);
> 
> You definitely don't need to use _AC() in C code. But the whole thing
> can anyway simply be
> 
>         start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);
> 
> (albeit I'd like to note that if anything we'd have to be prepared
> for page sizes > 4k, not smaller ones, and the whole idea of
> PAGE_ORDER_4K breaks in that case).

Thanks, I will change as per your recommendation above, although I'm not sure
what to do with the PAGE_ORDER_4K thing. Are you fine with leaving it like you
suggest?

> > +        end = (entry->addr + entry->size) &
> > +              ~((_AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT) - 1 );
> 
> On top of the above, please remove the stray blank from near
> the end of this statement.

I've changed that to:

        end = (entry->addr + entry->size) &
              ~((PAGE_SIZE << PAGE_ORDER_4K) - 1);

In order to match with the above.

> > +static int __init pvh_steal_ram(struct domain *d, unsigned long size,
> > +                                paddr_t limit, paddr_t *addr)
> > +{
> > +    unsigned int i = d->arch.nr_e820;
> > +
> > +    while ( i-- )
> > +    {
> > +        struct e820entry *entry = &d->arch.e820[i];
> > +
> > +        if ( entry->type != E820_RAM || entry->size < size )
> > +            continue;
> > +
> > +        /* Subtract from the beginning. */
> > +        if ( entry->addr + size <= limit && entry->addr >= MB(1) )
> > +        {
> > +            *addr = entry->addr;
> > +            entry->addr += size;
> > +            entry->size -= size;
> 
> The comment says so, but why from the beginning? Wouldn't it be
> better to steal from the end of the highest range below 4Gb, to
> keep an overall more conventional layout?

That sounds sensible, let me change it to:

        /* Subtract from the end. */
        if ( entry->addr + entry->size + size <= limit &&
             entry->addr >= MB(1) )
        {
            entry->size -= size;
            *addr = entry->addr + entry->size;
            return 0;
        }

This is going to involve some changes in pvh_setup_vmx_realmode_helpers, see
below.

> > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> > +{
> > +    p2m_type_t p2mt;
> > +    uint32_t rc, *ident_pt;
> > +    uint8_t *tss;
> > +    mfn_t mfn;
> > +    paddr_t gaddr;
> > +    unsigned int i;
> > +
> > +    /*
> > +     * Steal some space from the last found RAM region. One page will be
> > +     * used for the identity page tables, and the remaining space for the
> > +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> > +     * to PAGE_SIZE.
> > +     */
> > +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) 
> > )
> > +    {
> > +        printk("Unable to find memory to stash the identity map and TSS\n");
> > +        return -ENOMEM;
> > +    }
> > +
> > +    /*
> > +     * Identity-map page table is required for running with CR0.PG=0
> > +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> > +     * superpages.
> > +     */
> > +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> > +                              &mfn, &p2mt, 0, &rc);
> > +    if ( ident_pt == NULL )
> > +    {
> > +        printk("Unable to map identity page tables\n");
> > +        return -ENOMEM;
> > +    }
> > +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
> > +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
> > +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
> > +    unmap_domain_page(ident_pt);
> > +    put_page(mfn_to_page(mfn_x(mfn)));
> > +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
> > +    gaddr += PAGE_SIZE;
> > +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
> 
> This comes too late - the page table setup above also requires
> page alignment (and with that, adding PAGE_SIZE would not break
> the alignment requirement). Even more, the code below doesn't
> strictly require page alignment, it only requires for the range to
> not cross a page boundary.

Given the change that you requested in pvh_steal_ram, now the start of the
memory area returned by it it's not going to be page-aligned, so I will have to
perform the TSS setup first, and then the identity page tables.

> > +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> > +                         &mfn, &p2mt, 0, &rc);
> > +    if ( tss == NULL )
> > +    {
> > +        printk("Unable to map VM86 TSS area\n");
> > +        return 0;
> > +    }
> > +
> > +    memset(tss, 0, HVM_VM86_TSS_SIZE);
> > +    unmap_domain_page(tss);
> > +    put_page(mfn_to_page(mfn_x(mfn)));
> > +    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
> > +
> > +    return 0;
> 
> While I've seen the code a number of times by now, I still can't
> help disliking the early success return (accompanied by an error
> message). I think this not being a mistake would be more obvious
> with
> 
>     if ( tss )
>     {
>     }
>     else
>         printk();
>     return 0;

That's not a problem, I will change it given that I will also have to move this
before the setup of the identity page tables.

> > +static int __init pvh_setup_p2m(struct domain *d)
> > +{
> > +    struct vcpu *v = d->vcpu[0];
> > +    unsigned long nr_pages;
> > +    unsigned int i;
> > +    int rc;
> > +    bool preempted;
> > +#define MB1_PAGES PFN_DOWN(MB(1))
> > +
> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> > +
> > +    pvh_setup_e820(d, nr_pages);
> > +    do {
> > +        preempted = false;
> > +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> > +                              &preempted);
> > +        process_pending_softirqs();
> > +    } while ( preempted );
> > +
> > +    /*
> > +     * Memory below 1MB is identity mapped.
> > +     * NB: this only makes sense when booted from legacy BIOS.
> > +     */
> > +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
> 
> MB1_PAGES
> 
> > +    if ( rc )
> > +    {
> > +        printk("Failed to identity map low 1MB: %d\n", rc);
> > +        return rc;
> > +    }
> > +
> > +    /* Populate memory map. */
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        unsigned long addr, size;
> > +
> > +        if ( d->arch.e820[i].type != E820_RAM )
> > +            continue;
> > +
> > +        addr = PFN_DOWN(d->arch.e820[i].addr);
> > +        size = PFN_DOWN(d->arch.e820[i].size);
> > +
> > +        ASSERT(addr >= MB1_PAGES || addr + size < MB1_PAGES);
> > +
> > +        if ( addr >= MB1_PAGES )
> > +            rc = pvh_populate_memory_range(d, addr, size);
> > +        else
> > +            pvh_steal_low_ram(d, addr, size);
> 
> Would you mind shortening the ASSERT() expression above by
> moving it into the else branch here?

Fixed both of the above, thanks.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 11:14     ` Tim Deegan
@ 2017-01-27 12:37       ` Roger Pau Monne
  2017-01-27 12:51       ` Andrew Cooper
  1 sibling, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-27 12:37 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Andrew Cooper, boris.ostrovsky, Jan Beulich, xen-devel

On Fri, Jan 27, 2017 at 11:14:10AM +0000, Tim Deegan wrote:
> At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote:
> > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> > > +#define HVM_VM86_TSS_SIZE   128
> > 
> > I continue to be puzzled by this value. Why 128? I think this really
> > needs to be clarified in the comment.
> 
> I was asked on IRC to do some archaeology / explain myself about this,
> so here goes.
> 
> First, the _intended_ mechanism for "real mode" guests on older VMX
> hardware is to run them in virtual 8086 mode inside the guest as much
> as possible, and emulate whenever we can't do that.
> 
> This is managed with some state in v->arch.hvm_vmx:
>  - vmx_realmode, set when the guest thinks it's in real mode. 
>  - vmx_emulate, to force emulation rather than VMENTER
>    We set this when we have exceptions to inject, as the VMX hardware
>    would try to inject them in 32-bit protected mode.
>  - vm86_segment_mask, a bitmask of segments that can't be fudged
>    to run in virtual 8086 mode.
> 
> When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the
> emulator if either vmx_emulate or any bit in vm86_segment_mask is set;
> otherwise it calls vmx_enter_realmode() to adjust %rflags and enters
> the guest in virtual 8086 mode.
> 
> The reason we need a TSS at all is for handling software interrupts.
> Virtual 8086 mode has two ways to handle software interrupts: stay in
> virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit
> protected mode.  We want the first of those, so that a guest in 'real mode'
> can make BIOS calls.
> 
> The CPU uses a bitmap in the TSS to decide which method to use; we
> need all the bits in that bitmap to be clear.  In my SDM (April 2016)
> this is section 20.3.3 "Class 3 -- Software Interrupt Handling in
> Virtual-8086 Mode", table 20-2, method 5.
> 
> ---
> 
> So far so good, and AIUI the system works -- or at least it did in
> December 2008 when it was put in (8d4638d1), because emulating every
> instruction made Windows boot times so slow that we would definitely
> have noticed.
> 
> But looking at it now, I'm not convinced of exactly how.  The magic
> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> base address itself lives at offset 100.  A zero'd TSS should mean an
> I/O map at 0, and an interrupt redirection bitmap at -32, which would
> plausibly work if the TSS were 256 bytes (matching the limit set in
> Xen).  Perhaps it's only working because the 128 bytes following the
> TSS in hvmloader happen to be zeros too?

Right, so *if* this was working as intended, the interrupt bitmap would be at
HVM_PARAM_VM86_TSS - 32, which we don't guarantee to zero at all.

I've also looked at the manual, and it states that the last bit of the IO
bitmap should be filled with 1s[0], which we don't do at all. Also, what's the
expected size of the IO bitmap, 64KB?

Roger.

[0] Vol3, section 20.3.3 "Class 3-Software Interrupt Handling in Virtual-8086
Mode, Figure 20-5.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 11:14     ` Tim Deegan
  2017-01-27 12:37       ` Roger Pau Monne
@ 2017-01-27 12:51       ` Andrew Cooper
  2017-01-27 13:20         ` Tim Deegan
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-27 12:51 UTC (permalink / raw)
  To: Tim Deegan, Jan Beulich; +Cc: xen-devel, boris.ostrovsky, Roger Pau Monne

On 27/01/17 11:14, Tim Deegan wrote:
> At 05:41 -0700 on 26 Jan (1485409318), Jan Beulich wrote:
>>>>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>>> +/* Size of the VM86 TSS for virtual 8086 mode to use. */
>>> +#define HVM_VM86_TSS_SIZE   128
>> I continue to be puzzled by this value. Why 128? I think this really
>> needs to be clarified in the comment.
> I was asked on IRC to do some archaeology / explain myself about this,
> so here goes.
>
> First, the _intended_ mechanism for "real mode" guests on older VMX
> hardware is to run them in virtual 8086 mode inside the guest as much
> as possible, and emulate whenever we can't do that.
>
> This is managed with some state in v->arch.hvm_vmx:
>  - vmx_realmode, set when the guest thinks it's in real mode. 
>  - vmx_emulate, to force emulation rather than VMENTER
>    We set this when we have exceptions to inject, as the VMX hardware
>    would try to inject them in 32-bit protected mode.
>  - vm86_segment_mask, a bitmask of segments that can't be fudged
>    to run in virtual 8086 mode.
>
> When vmx_realmode is set, vmx_do_vmentry() DTRT: it bails out into the
> emulator if either vmx_emulate or any bit in vm86_segment_mask is set;
> otherwise it calls vmx_enter_realmode() to adjust %rflags and enters
> the guest in virtual 8086 mode.

Ah - this is where I went wrong.  I'd logically combined
vmx_enter_realmode and vmx_realmode when reading the assembly.

>
> The reason we need a TSS at all is for handling software interrupts.
> Virtual 8086 mode has two ways to handle software interrupts: stay in
> virtual 8086 mode and vector via the table @0x0, or raise #GP in 32-bit
> protected mode.  We want the first of those, so that a guest in 'real mode'
> can make BIOS calls.
>
> The CPU uses a bitmap in the TSS to decide which method to use; we
> need all the bits in that bitmap to be clear.  In my SDM (April 2016)
> this is section 20.3.3 "Class 3 -- Software Interrupt Handling in
> Virtual-8086 Mode", table 20-2, method 5.
>
> ---
>
> So far so good, and AIUI the system works -- or at least it did in
> December 2008 when it was put in (8d4638d1), because emulating every
> instruction made Windows boot times so slow that we would definitely
> have noticed.
>
> But looking at it now, I'm not convinced of exactly how.  The magic
> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> base address itself lives at offset 100.  A zero'd TSS should mean an
> I/O map at 0, and an interrupt redirection bitmap at -32, which would
> plausibly work if the TSS were 256 bytes (matching the limit set in
> Xen).  Perhaps it's only working because the 128 bytes following the
> TSS in hvmloader happen to be zeros too?

With an IO_base_map of 0, the software interrupt bitmap will end up
being ahead of the TSS, not after it.

I would not be surprised if this turns out that microcode doesn't range
check against TSS.base.

> I also don't remember why the TSS is 128 rather than 104 bytes.  The
> SDM claims that the TSS must be larger than 104 bytes "when accessing
> the I/O permission bit map or interrupt redirection bit map."
> (7.2.2. "TSS Descriptor") but I suspect that just means that the
> generated address of the bitmap must lie inside the limit.

The documented way of expressing "no IO bitmap" is to set the map base
to a value which exceeds the TSS limit.  All this means (I think) is
that you must make a larger than default TSS if you want to use a IO or
software interrupt bitmap.

> In any case, the limit set in vmx_set_segment_register() should surely
> match the size of the actual TSS.
> I haven't got the time or hardware to test this right now, but could
> maybe look at it next week unless anyone else wants to play with it.

I have hardware.  I will look into it when I have a moment, unless
anyone beats me to it.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 12:51       ` Andrew Cooper
@ 2017-01-27 13:20         ` Tim Deegan
  2017-01-27 13:46           ` Andrew Cooper
  2017-01-27 16:40           ` Jan Beulich
  0 siblings, 2 replies; 68+ messages in thread
From: Tim Deegan @ 2017-01-27 13:20 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, boris.ostrovsky, Jan Beulich, Roger Pau Monne

Hi,

At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
> On 27/01/17 11:14, Tim Deegan wrote:
> > But looking at it now, I'm not convinced of exactly how.  The magic
> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> > base address itself lives at offset 100.  A zero'd TSS should mean an
> > I/O map at 0, and an interrupt redirection bitmap at -32, which would
> > plausibly work if the TSS were 256 bytes (matching the limit set in
> > Xen).  Perhaps it's only working because the 128 bytes following the
> > TSS in hvmloader happen to be zeros too?
> 
> With an IO_base_map of 0, the software interrupt bitmap will end up
> being ahead of the TSS, not after it.

I should have thought that the segmented address calculation would
wrap and leave us at TSS + 224.

> > I also don't remember why the TSS is 128 rather than 104 bytes.  The
> > SDM claims that the TSS must be larger than 104 bytes "when accessing
> > the I/O permission bit map or interrupt redirection bit map."
> > (7.2.2. "TSS Descriptor") but I suspect that just means that the
> > generated address of the bitmap must lie inside the limit.
> 
> The documented way of expressing "no IO bitmap" is to set the map base
> to a value which exceeds the TSS limit.  All this means (I think) is
> that you must make a larger than default TSS if you want to use a IO or
> software interrupt bitmap.

Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
enough space for a full one, but the current SDM is pretty clear that
the CPU will try to check it in virtual 8086 mode.

It may be that all the ports actually used happen to fall in the 128
bytes of zeros that we provide.

Or possibly (both for this and the interrupt bitmap) we are causing
#GP and somehow ending up exiting-and-emulating.  But I don't see
quite what the path is for that.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 13:20         ` Tim Deegan
@ 2017-01-27 13:46           ` Andrew Cooper
  2017-01-27 14:01             ` Tim Deegan
  2017-01-27 16:40           ` Jan Beulich
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-27 13:46 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, boris.ostrovsky, Jan Beulich, Roger Pau Monne

On 27/01/17 13:20, Tim Deegan wrote:
> Hi,
>
> At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
>> On 27/01/17 11:14, Tim Deegan wrote:
>>> But looking at it now, I'm not convinced of exactly how.  The magic
>>> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
>>> base address itself lives at offset 100.  A zero'd TSS should mean an
>>> I/O map at 0, and an interrupt redirection bitmap at -32, which would
>>> plausibly work if the TSS were 256 bytes (matching the limit set in
>>> Xen).  Perhaps it's only working because the 128 bytes following the
>>> TSS in hvmloader happen to be zeros too?
>> With an IO_base_map of 0, the software interrupt bitmap will end up
>> being ahead of the TSS, not after it.
> I should have thought that the segmented address calculation would
> wrap and leave us at TSS + 224.

As far as I am aware, this is the only case of a system descriptor
access which could end up negative (relative to base).  All IDT/GDT/LDT
accesses are sensibly bounded by the validity of their trigger conditions.

I'd expect microcode to calculate TSS.base + I/O base - 32 +
bit_of(vector) on the expectation that an OS actually wanting this to
work would have set it up properly.

The actual behaviour can be determined by putting the TSS on a page
boundary, making the previous frame non-readable via EPT, and seeing
whether an EPT violation occurs.  (I haven't yet got far enough in my
nested virt work for this to be an easy thing to configure, but it is
possible by manually clobbering unrestricted mode on a newer processor
and using HAP.)

>
>>> I also don't remember why the TSS is 128 rather than 104 bytes.  The
>>> SDM claims that the TSS must be larger than 104 bytes "when accessing
>>> the I/O permission bit map or interrupt redirection bit map."
>>> (7.2.2. "TSS Descriptor") but I suspect that just means that the
>>> generated address of the bitmap must lie inside the limit.
>> The documented way of expressing "no IO bitmap" is to set the map base
>> to a value which exceeds the TSS limit.  All this means (I think) is
>> that you must make a larger than default TSS if you want to use a IO or
>> software interrupt bitmap.
> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> enough space for a full one, but the current SDM is pretty clear that
> the CPU will try to check it in virtual 8086 mode.
>
> It may be that all the ports actually used happen to fall in the 128
> bytes of zeros that we provide.

With an offset of 0, we actually provide 256 bytes of zeros in the
bitmap within the TSS limit.

> Or possibly (both for this and the interrupt bitmap) we are causing
> #GP and somehow ending up exiting-and-emulating.  But I don't see
> quite what the path is for that.

We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
This bypasses all I/O bitmap checks (a properly common to ring 3
protected tasks as well - See specifically 20.2.7 "Sensitive
Instructions"), which means the IN/OUT instructions end up directly at
the relevant vmexit case.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 13:46           ` Andrew Cooper
@ 2017-01-27 14:01             ` Tim Deegan
  2017-01-27 14:35               ` Andrew Cooper
  0 siblings, 1 reply; 68+ messages in thread
From: Tim Deegan @ 2017-01-27 14:01 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, boris.ostrovsky, Jan Beulich, Roger Pau Monne

Hi,

At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
> The actual behaviour can be determined by putting the TSS on a page
> boundary, making the previous frame non-readable via EPT, and seeing
> whether an EPT violation occurs.

Indeed.  Or likewise with normal pagetables. 

> > Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> > enough space for a full one, but the current SDM is pretty clear that
> > the CPU will try to check it in virtual 8086 mode.
> >
> > It may be that all the ports actually used happen to fall in the 128
> > bytes of zeros that we provide.
> 
> With an offset of 0, we actually provide 256 bytes of zeros in the
> bitmap within the TSS limit.

Sure, or at least 128 bytes of zeros and another 128 bytes of something.

> > Or possibly (both for this and the interrupt bitmap) we are causing
> > #GP and somehow ending up exiting-and-emulating.  But I don't see
> > quite what the path is for that.
> 
> We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
> This bypasses all I/O bitmap checks (a properly common to ring 3
> protected tasks as well - See specifically 20.2.7 "Sensitive
> Instructions"), which means the IN/OUT instructions end up directly at
> the relevant vmexit case.

20.2.8.1 makes it clear that this is not the case -- in virtual 8086
mode all IN/OUT ops check the bitmap event with IOPL == CPL.

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 14:01             ` Tim Deegan
@ 2017-01-27 14:35               ` Andrew Cooper
  2017-01-27 19:43                 ` Tim Deegan
  0 siblings, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-27 14:35 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, boris.ostrovsky, Jan Beulich, Roger Pau Monne

On 27/01/17 14:01, Tim Deegan wrote:
> Hi,
>
> At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
>> The actual behaviour can be determined by putting the TSS on a page
>> boundary, making the previous frame non-readable via EPT, and seeing
>> whether an EPT violation occurs.
> Indeed.  Or likewise with normal pagetables. 
>
>>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>>> enough space for a full one, but the current SDM is pretty clear that
>>> the CPU will try to check it in virtual 8086 mode.
>>>
>>> It may be that all the ports actually used happen to fall in the 128
>>> bytes of zeros that we provide.
>> With an offset of 0, we actually provide 256 bytes of zeros in the
>> bitmap within the TSS limit.
> Sure, or at least 128 bytes of zeros and another 128 bytes of something.

That is a good point.  Nothing prevents a guest exiting vm86 mode, and
using a task switch to move to a new tss, which will cause Xen to write
state back into the vm86_tss, making it no longer a zeroed block of memory.

Despite being owned by the guest, this TSS is actually managed by Xen. 
It should be initialised to defaults each time Xen needs to use it on
behalf of the guest.

>>> Or possibly (both for this and the interrupt bitmap) we are causing
>>> #GP and somehow ending up exiting-and-emulating.  But I don't see
>>> quite what the path is for that.
>> We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
>> This bypasses all I/O bitmap checks (a properly common to ring 3
>> protected tasks as well - See specifically 20.2.7 "Sensitive
>> Instructions"), which means the IN/OUT instructions end up directly at
>> the relevant vmexit case.
> 20.2.8.1 makes it clear that this is not the case -- in virtual 8086
> mode all IN/OUT ops check the bitmap event with IOPL == CPL.

Hmm.  Right you area, which explains why the TSS limit is greater than
0x67. 

If the emulation code were working correctly, the emulator should come
to the same conclusion as hardware and inject a #GP fault.  I suspect it
is more likely that RomBIOS doesn't use a port higher than we have
bitmap space for.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function
  2017-01-26 13:33   ` Jan Beulich
@ 2017-01-27 14:55     ` Roger Pau Monne
  0 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-27 14:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, boris.ostrovsky

On Thu, Jan 26, 2017 at 06:33:11AM -0700, Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > --- a/xen/include/asm-x86/hvm/support.h
> > +++ b/xen/include/asm-x86/hvm/support.h
> > @@ -71,6 +71,8 @@ enum hvm_copy_result hvm_copy_to_guest_phys(
> >      paddr_t paddr, void *buf, int size);
> >  enum hvm_copy_result hvm_copy_from_guest_phys(
> >      void *buf, paddr_t paddr, int size);
> > +enum hvm_copy_result hvm_copy_to_guest_phys_vcpu(
> > +    paddr_t paddr, void *buf, size_t size, struct vcpu *vcpu);
> 
> Btw., there being just five existing callers, I think we should rather
> change the existing function instead of adding yet another wrapper
> of __hvm_copy().

Done, together with the changes requested to this patch in the other email.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 12:23     ` Roger Pau Monne
@ 2017-01-27 15:11       ` Jan Beulich
  2017-01-27 16:04         ` Roger Pau Monne
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-01-27 15:11 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote:
> On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
>> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
>> >  static long __initdata dom0_min_nrpages;
>> >  static long __initdata dom0_max_nrpages = LONG_MAX;
>> >  
>> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
>> > +#define HVM_VM86_TSS_SIZE   128
>> 
>> I continue to be puzzled by this value. Why 128? I think this really
>> needs to be clarified in the comment.
> 
> Given the recent comments by Tim, and that this is starting to look like a can
> of worms, I would like to leave this as-is for the moment, on the grounds that
> it's what hvmloader does (I'm not saying it's right), and that this issue
> should be treated independently from this patch series.

Well, for the purpose of this patch it would be sufficient if the
comment referred to hvmloader. But then I think I saw you set the
TSS limit to 0x67, which is neither in line with the value above nor
- according to what Tim said (but I didn't check myself yet) - the
255 used in hvmloader. I.e. if you clone hvmloader code, all
aspects of it should match.

> Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
> IIRC I've tried that before (without unrestricted mode support) and it was
> working fine.

Now if that's the case, then why bother with the TSS?

>> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
>> >              avail -= max_pdx >> s;
>> >      }
>> >  
>> > -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
>> > +    need_paging = opt_dom0_shadow ||
>> > +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
>> > +                                                   !paging_mode_hap(d)));
>> 
>> What is the !paging_mode_hap() part good for? It's being taken care
>> of by checking opt_dom0_shadow already, isn't it? Alternatively, to
>> make the distinction more obvious, I'd suggest
>> 
>>     need_paging = has_hvm_container_domain(d)
>>                   ? !iommu_hap_pt_share || !paging_mode_hap(d)
>>                   : opt_dom0_shadow;
> 
> AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but
> with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but
> paging_mode_hap would be false. Maybe that's just an impossible combination in
> any case...

At least when running Xen itself virtualized, I wouldn't dare to assume
this is an impossible combination. However, I can't see how that case
would be handled any different by the original or the suggested
replacement expressions: need_paging would get set either way afaict.

>> > @@ -608,8 +617,22 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>> >              continue;
>> >          }
>> >  
>> > -        *entry_guest = *entry;
>> > -        pages = PFN_UP(entry_guest->size);
>> > +        /*
>> > +         * Make sure the start and length are aligned to PAGE_SIZE, because
>> > +         * that's the minimum granularity of the 2nd stage translation. Since
>> > +         * the p2m code uses PAGE_ORDER_4K internally, also use it here in
>> > +         * order to prevent this code from getting out of sync.
>> > +         */
>> > +        start = ROUNDUP(entry->addr, _AC(1,L) << PAGE_ORDER_4K << PAGE_SHIFT);
>> 
>> You definitely don't need to use _AC() in C code. But the whole thing
>> can anyway simply be
>> 
>>         start = ROUNDUP(entry->addr, PAGE_SIZE << PAGE_ORDER_4K);
>> 
>> (albeit I'd like to note that if anything we'd have to be prepared
>> for page sizes > 4k, not smaller ones, and the whole idea of
>> PAGE_ORDER_4K breaks in that case).
> 
> Thanks, I will change as per your recommendation above, although I'm not sure
> what to do with the PAGE_ORDER_4K thing. Are you fine with leaving it like you
> suggest?

Yes, there's far more broken code in that case, and hence the remark
was in parentheses in an attempt to make clear it's really just a remark.

>> > +static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
>> > +{
>> > +    p2m_type_t p2mt;
>> > +    uint32_t rc, *ident_pt;
>> > +    uint8_t *tss;
>> > +    mfn_t mfn;
>> > +    paddr_t gaddr;
>> > +    unsigned int i;
>> > +
>> > +    /*
>> > +     * Steal some space from the last found RAM region. One page will be
>> > +     * used for the identity page tables, and the remaining space for the
>> > +     * VM86 TSS. Note that after this not all e820 regions will be aligned
>> > +     * to PAGE_SIZE.
>> > +     */
>> > +    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) )
>> > +    {
>> > +        printk("Unable to find memory to stash the identity map and TSS\n");
>> > +        return -ENOMEM;
>> > +    }
>> > +
>> > +    /*
>> > +     * Identity-map page table is required for running with CR0.PG=0
>> > +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
>> > +     * superpages.
>> > +     */
>> > +    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
>> > +                              &mfn, &p2mt, 0, &rc);
>> > +    if ( ident_pt == NULL )
>> > +    {
>> > +        printk("Unable to map identity page tables\n");
>> > +        return -ENOMEM;
>> > +    }
>> > +    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
>> > +        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
>> > +                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
>> > +    unmap_domain_page(ident_pt);
>> > +    put_page(mfn_to_page(mfn_x(mfn)));
>> > +    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
>> > +    gaddr += PAGE_SIZE;
>> > +    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
>> 
>> This comes too late - the page table setup above also requires
>> page alignment (and with that, adding PAGE_SIZE would not break
>> the alignment requirement). Even more, the code below doesn't
>> strictly require page alignment, it only requires for the range to
>> not cross a page boundary.
> 
> Given the change that you requested in pvh_steal_ram, now the start of the
> memory area returned by it it's not going to be page-aligned, so I will have to
> perform the TSS setup first, and then the identity page tables.

Or simply pass the required alignment.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 15:11       ` Jan Beulich
@ 2017-01-27 16:04         ` Roger Pau Monne
  2017-01-27 16:29           ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-27 16:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Jan 27, 2017 at 08:11:56AM -0700, Jan Beulich wrote:
> >>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote:
> > On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
> >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> >> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
> >> >  static long __initdata dom0_min_nrpages;
> >> >  static long __initdata dom0_max_nrpages = LONG_MAX;
> >> >  
> >> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> >> > +#define HVM_VM86_TSS_SIZE   128
> >> 
> >> I continue to be puzzled by this value. Why 128? I think this really
> >> needs to be clarified in the comment.
> > 
> > Given the recent comments by Tim, and that this is starting to look like a can
> > of worms, I would like to leave this as-is for the moment, on the grounds that
> > it's what hvmloader does (I'm not saying it's right), and that this issue
> > should be treated independently from this patch series.
> 
> Well, for the purpose of this patch it would be sufficient if the
> comment referred to hvmloader. But then I think I saw you set the
> TSS limit to 0x67, which is neither in line with the value above nor

Hm, no, I'm not setting the limit anywhere here, this is done in
vmx_set_segment_register, and it's indeed set to 0xff which is wrong for
hvmloader too according to the conversation that's going on related to this
HVM_VM86_TSS_SIZE param.

> - according to what Tim said (but I didn't check myself yet) - the
> 255 used in hvmloader. I.e. if you clone hvmloader code, all
> aspects of it should match.
> 
> > Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
> > IIRC I've tried that before (without unrestricted mode support) and it was
> > working fine.
> 
> Now if that's the case, then why bother with the TSS?

It seems like it working was just luck, but I don't know all the details. Maybe
the emulator is somehow fixing this up when the TSS is corrupted/incorrect?

> >> > @@ -333,7 +338,9 @@ static unsigned long __init compute_dom0_nr_pages(
> >> >              avail -= max_pdx >> s;
> >> >      }
> >> >  
> >> > -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> >> > +    need_paging = opt_dom0_shadow ||
> >> > +                  (has_hvm_container_domain(d) && (!iommu_hap_pt_share ||
> >> > +                                                   !paging_mode_hap(d)));
> >> 
> >> What is the !paging_mode_hap() part good for? It's being taken care
> >> of by checking opt_dom0_shadow already, isn't it? Alternatively, to
> >> make the distinction more obvious, I'd suggest
> >> 
> >>     need_paging = has_hvm_container_domain(d)
> >>                   ? !iommu_hap_pt_share || !paging_mode_hap(d)
> >>                   : opt_dom0_shadow;
> > 
> > AFAICT it *might* be possible to run a PVHv2 Dom0 on a box with no EPT, but
> > with an IOMMU? Does that exist? In that case opt_dom0_shadow won't be set, but
> > paging_mode_hap would be false. Maybe that's just an impossible combination in
> > any case...
> 
> At least when running Xen itself virtualized, I wouldn't dare to assume
> this is an impossible combination. However, I can't see how that case
> would be handled any different by the original or the suggested
> replacement expressions: need_paging would get set either way afaict.

Oh yes, sorry, my reply was to the "What is the !paging_mode_hap() part good
for?" question. I've changed setting need_paging as you suggested.

> > Given the change that you requested in pvh_steal_ram, now the start of the
> > memory area returned by it it's not going to be page-aligned, so I will have to
> > perform the TSS setup first, and then the identity page tables.
> 
> Or simply pass the required alignment.

Passing an alignment here would mean that pvh_steal_ram would have to return 2
pages in order to meet this alignment, and we would end up wasting memory.
Also, this is the only caller of pvh_steal_ram that requires alignment. This is
what I have after changing pvh_steal_ram to remove RAM from the end of the
region:

static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
{
    p2m_type_t p2mt;
    uint32_t rc, *ident_pt;
    uint8_t *tss;
    mfn_t mfn;
    paddr_t gaddr;

    /*
     * Steal some space from the last found RAM region. One page will be
     * used for the identity page tables, and the remaining space for the
     * VM86 TSS. Note that after this not all e820 regions will be aligned
     * to PAGE_SIZE.
     */
    if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, GB(4), &gaddr) )
    {
        printk("Unable to find memory to stash the identity map and TSS\n");
        return -ENOMEM;
    }

    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
                         &mfn, &p2mt, 0, &rc);
    if ( tss )
    {
        memset(tss, 0, HVM_VM86_TSS_SIZE);
        unmap_domain_page(tss);
        put_page(mfn_to_page(mfn_x(mfn)));
        d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
    }
    else
        printk("Unable to map VM86 TSS area\n");

    gaddr += HVM_VM86_TSS_SIZE;
    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));

    /*
     * Identity-map page table is required for running with CR0.PG=0
     * when using Intel EPT. Create a 32-bit non-PAE page directory of
     * superpages.
     */
    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
                              &mfn, &p2mt, 0, &rc);
    if ( ident_pt == NULL )
    {
        printk("Unable to map identity page tables\n");
        return -ENOMEM;
    }
    write_32bit_pse_identmap(ident_pt);
    unmap_domain_page(ident_pt);
    put_page(mfn_to_page(mfn_x(mfn)));
    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;

    return 0;
}

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 16:04         ` Roger Pau Monne
@ 2017-01-27 16:29           ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-27 16:29 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.01.17 at 17:04, <roger.pau@citrix.com> wrote:
> On Fri, Jan 27, 2017 at 08:11:56AM -0700, Jan Beulich wrote:
>> >>> On 27.01.17 at 13:23, <roger.pau@citrix.com> wrote:
>> > On Thu, Jan 26, 2017 at 05:41:58AM -0700, Jan Beulich wrote:
>> >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> >> > @@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
>> >> >  static long __initdata dom0_min_nrpages;
>> >> >  static long __initdata dom0_max_nrpages = LONG_MAX;
>> >> >  
>> >> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
>> >> > +#define HVM_VM86_TSS_SIZE   128
>> >> 
>> >> I continue to be puzzled by this value. Why 128? I think this really
>> >> needs to be clarified in the comment.
>> > 
>> > Given the recent comments by Tim, and that this is starting to look like a can
>> > of worms, I would like to leave this as-is for the moment, on the grounds that
>> > it's what hvmloader does (I'm not saying it's right), and that this issue
>> > should be treated independently from this patch series.
>> 
>> Well, for the purpose of this patch it would be sufficient if the
>> comment referred to hvmloader. But then I think I saw you set the
>> TSS limit to 0x67, which is neither in line with the value above nor
> 
> Hm, no, I'm not setting the limit anywhere here, this is done in
> vmx_set_segment_register,

Well, you do, in patch 8 (in pvh_setup_cpus()). But that's a different
TSS, so the limits are independent. It's just what I had in mind here.

> and it's indeed set to 0xff which is wrong for
> hvmloader too according to the conversation that's going on related to this
> HVM_VM86_TSS_SIZE param.

Right.

>> - according to what Tim said (but I didn't check myself yet) - the
>> 255 used in hvmloader. I.e. if you clone hvmloader code, all
>> aspects of it should match.
>> 
>> > Alternatively, I can just remove setting HVM_PARAM_VM86_TSS for a PVHv2 Dom0.
>> > IIRC I've tried that before (without unrestricted mode support) and it was
>> > working fine.
>> 
>> Now if that's the case, then why bother with the TSS?
> 
> It seems like it working was just luck, but I don't know all the details. Maybe
> the emulator is somehow fixing this up when the TSS is corrupted/incorrect?

I don't think so. Btw, why is the kernel dropping back into real mode
anyway? It's being started in protected mode after all.

>> > Given the change that you requested in pvh_steal_ram, now the start of the
>> > memory area returned by it it's not going to be page-aligned, so I will have to
>> > perform the TSS setup first, and then the identity page tables.
>> 
>> Or simply pass the required alignment.
> 
> Passing an alignment here would mean that pvh_steal_ram would have to return 2
> pages in order to meet this alignment, and we would end up wasting memory.
> Also, this is the only caller of pvh_steal_ram that requires alignment. This is
> what I have after changing pvh_steal_ram to remove RAM from the end of the
> region:
> 
> static int __init pvh_setup_vmx_realmode_helpers(struct domain *d)
> {
>     p2m_type_t p2mt;
>     uint32_t rc, *ident_pt;
>     uint8_t *tss;
>     mfn_t mfn;
>     paddr_t gaddr;
> 
>     /*
>      * Steal some space from the last found RAM region. One page will be
>      * used for the identity page tables, and the remaining space for the
>      * VM86 TSS. Note that after this not all e820 regions will be aligned
>      * to PAGE_SIZE.
>      */
>     if ( pvh_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, GB(4), &gaddr) )
>     {
>         printk("Unable to find memory to stash the identity map and TSS\n");
>         return -ENOMEM;
>     }
> 
>     tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
>                          &mfn, &p2mt, 0, &rc);
>     if ( tss )
>     {
>         memset(tss, 0, HVM_VM86_TSS_SIZE);
>         unmap_domain_page(tss);
>         put_page(mfn_to_page(mfn_x(mfn)));
>         d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
>     }
>     else
>         printk("Unable to map VM86 TSS area\n");
> 
>     gaddr += HVM_VM86_TSS_SIZE;
>     ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));

And this assert holds merely because, prior to this function running,
all E820 entries are page aligned? That's rather fragile then.
Considering that getting into here is going to be increasingly unlikely
going forward, I don't think we should be afraid of wasting a little
bit of memory here.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 13:20         ` Tim Deegan
  2017-01-27 13:46           ` Andrew Cooper
@ 2017-01-27 16:40           ` Jan Beulich
  2017-01-27 18:06             ` Andrew Cooper
  2017-01-27 19:48             ` Tim Deegan
  1 sibling, 2 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-27 16:40 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel, Roger Pau Monne

>>> On 27.01.17 at 14:20, <tim@xen.org> wrote:
> At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
>> On 27/01/17 11:14, Tim Deegan wrote:
>> > But looking at it now, I'm not convinced of exactly how.  The magic
>> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
>> > base address itself lives at offset 100.  A zero'd TSS should mean an
>> > I/O map at 0, and an interrupt redirection bitmap at -32, which would
>> > plausibly work if the TSS were 256 bytes (matching the limit set in
>> > Xen).  Perhaps it's only working because the 128 bytes following the
>> > TSS in hvmloader happen to be zeros too?
>> 
>> With an IO_base_map of 0, the software interrupt bitmap will end up
>> being ahead of the TSS, not after it.
> 
> I should have thought that the segmented address calculation would
> wrap and leave us at TSS + 224.

I don't think wrapping takes the limit value into account. It's all
linear address calculations, and as Andrew says the assumption
in microcode likely is that things will be set up properly by any
OS interested in using the interrupt bitmap.

>> > I also don't remember why the TSS is 128 rather than 104 bytes.  The
>> > SDM claims that the TSS must be larger than 104 bytes "when accessing
>> > the I/O permission bit map or interrupt redirection bit map."
>> > (7.2.2. "TSS Descriptor") but I suspect that just means that the
>> > generated address of the bitmap must lie inside the limit.
>> 
>> The documented way of expressing "no IO bitmap" is to set the map base
>> to a value which exceeds the TSS limit.  All this means (I think) is
>> that you must make a larger than default TSS if you want to use a IO or
>> software interrupt bitmap.
> 
> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> enough space for a full one, but the current SDM is pretty clear that
> the CPU will try to check it in virtual 8086 mode.
> 
> It may be that all the ports actually used happen to fall in the 128
> bytes of zeros that we provide.

I suppose so: This is precisely enough for the ISA port range.

So what we'll need to do then, as I understand it from the
discussion so far:

- vmx_set_segment_register() will need to set a correct limit
- vmx_set_segment_register() should initialize the TSS every
  time (including setting the I/O bitmap address to no lower
  than 32)
- hvmloader's init_vm86_tss() will need to allocate 160 bytes
  rather than 128 (and we should expose this number, so that
  Roger can also use it)

Perhaps we should even introduce a hypercall for hvmloader
to query the needed value, rather than exposing a hardcoded
number?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2
  2017-01-26 13:37   ` Jan Beulich
@ 2017-01-27 17:22     ` Roger Pau Monne
  2017-01-27 17:28       ` Roger Pau Monne
                         ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-27 17:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Jan 26, 2017 at 06:37:00AM -0700, Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > @@ -1959,12 +1960,146 @@ static int __init pvh_setup_p2m(struct domain *d)
> >  #undef MB1_PAGES
> >  }
> >  
> > +static int __init pvh_load_kernel(struct domain *d, const module_t *image,
> > +                                  unsigned long image_headroom,
> > +                                  module_t *initrd, char *image_base,
> > +                                  char *cmdline, paddr_t *entry,
> > +                                  paddr_t *start_info_addr)
> > +{
> > +    char *image_start = image_base + image_headroom;
> 
> While for cmdline plain char is certainly fine, I think we should stop
> abusing this type for image and image_base, even if this means
> some further adjustments elsewhere. This really ought to be u8 or
> unsigned char.

Done.

> > +    unsigned long image_len = image->mod_end;
> > +    struct elf_binary elf;
> > +    struct elf_dom_parms parms;
> > +    paddr_t last_addr;
> > +    struct hvm_start_info start_info;
> > +    struct hvm_modlist_entry mod = { 0 };
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    int rc;
> > +
> > +    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> > +    {
> > +        printk("Error trying to detect bz compressed kernel\n");
> > +        return rc;
> > +    }
> > +
> > +    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
> > +    {
> > +        printk("Unable to init ELF\n");
> > +        return rc;
> > +    }
> > +#ifdef VERBOSE
> > +    elf_set_verbose(&elf);
> > +#endif
> > +    elf_parse_binary(&elf);
> > +    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
> > +    {
> > +        printk("Unable to parse kernel for ELFNOTES\n");
> > +        return rc;
> > +    }
> > +
> > +    if ( parms.phys_entry == UNSET_ADDR32 ) {
> 
> Please move the brace to its own line.

Done.

> > +        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
> > +           parms.guest_ver, parms.loader,
> > +           elf_64bit(&elf) ? "64-bit" : "32-bit");
> > +
> > +    /* Copy the OS image and free temporary buffer. */
> > +    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
> > +    elf.dest_size = parms.virt_kend - parms.virt_kstart;
> > +
> > +    /*
> > +     * NB: we need to set vCPU 0 of Dom0 as the current vCPU (instead of the
> > +     * idle one) because elf_load_binary calls raw_{copy_to/clear}_guest, and
> > +     * the target domain is not passed anywhere. This is very similar to what
> > +     * is done during classic PV Dom0 creation, where Xen switches to the Dom0
> > +     * page tables. We also cannot pass a struct domain or vcpu to
> > +     * elf_load_binary, since the interface is shared with the toolstack, and
> > +     * it doesn't have any notion of a domain or vcpu struct.
> > +     */
> > +    saved_current = current;
> > +    set_current(v);
> > +    rc = elf_load_binary(&elf);
> > +    set_current(saved_current);
> 
> While with the comment it's not as bad anymore, I'm still not happy
> with this code. The tool stack portion of the explanation in particular
> does not really hold: We would add an opaque void * parameter to
> the functions involved, which the tool stack could pass NULL for, and
> you'd use it to convey the vcpu.

As spoken on IRC, I will add a new field to elf_binary only visible to the
hypervisor and that will be used to pass the destination vcpu down to
elf_load_image.

> > +    if ( rc < 0 )
> > +    {
> > +        printk("Failed to load kernel: %d\n", rc);
> > +        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
> > +        return rc;
> > +    }
> > +
> > +    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> > +
> > +    if ( initrd != NULL )
> > +    {
> > +        rc = hvm_copy_to_guest_phys_vcpu(last_addr,
> > +                                         mfn_to_virt(initrd->mod_start),
> > +                                         initrd->mod_end, v);
> > +        if ( rc )
> > +        {
> > +            printk("Unable to copy initrd to guest\n");
> > +            return rc;
> > +        }
> > +
> > +        mod.paddr = last_addr;
> > +        mod.size = initrd->mod_end;
> > +        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
> > +    }
> > +
> > +    /* Free temporary buffers. */
> > +    discard_initial_images();
> > +
> > +    memset(&start_info, 0, sizeof(start_info));
> 
> You clear "mod" with an initializer (which btw could be just the two
> braces), but you memset() start_info - please be consistent.

Done, I've added an initializer to it also.

> > +    if ( cmdline != NULL )
> > +    {
> > +        rc = hvm_copy_to_guest_phys_vcpu(last_addr, cmdline,
> > +                                         strlen(cmdline) + 1, v);
> > +        if ( rc )
> > +        {
> > +            printk("Unable to copy guest command line\n");
> > +            return rc;
> > +        }
> > +        start_info.cmdline_paddr = last_addr;
> > +        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
> 
> Where is this 8 coming from?

So that the next struct is aligned to the cache line size? It's also done in
xc_dom_x86.c.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2
  2017-01-27 17:22     ` Roger Pau Monne
@ 2017-01-27 17:28       ` Roger Pau Monne
  2017-01-30 10:20         ` Jan Beulich
  2017-01-27 18:03       ` Roger Pau Monne
  2017-01-30 10:14       ` Jan Beulich
  2 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-27 17:28 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Jan 27, 2017 at 05:22:03PM +0000, Roger Pau Monne wrote:
> On Thu, Jan 26, 2017 at 06:37:00AM -0700, Jan Beulich wrote:
> > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > > +    if ( cmdline != NULL )
> > > +    {
> > > +        rc = hvm_copy_to_guest_phys_vcpu(last_addr, cmdline,
> > > +                                         strlen(cmdline) + 1, v);
> > > +        if ( rc )
> > > +        {
> > > +            printk("Unable to copy guest command line\n");
> > > +            return rc;
> > > +        }
> > > +        start_info.cmdline_paddr = last_addr;
> > > +        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
> > 
> > Where is this 8 coming from?
> 
> So that the next struct is aligned to the cache line size? It's also done in
> xc_dom_x86.c.

Sorry, not to the cache, this is just so it's aligned:

https://lists.xen.org/archives/html/xen-devel/2015-08/msg01883.html

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2
  2017-01-27 17:22     ` Roger Pau Monne
  2017-01-27 17:28       ` Roger Pau Monne
@ 2017-01-27 18:03       ` Roger Pau Monne
  2017-01-30 10:14       ` Jan Beulich
  2 siblings, 0 replies; 68+ messages in thread
From: Roger Pau Monne @ 2017-01-27 18:03 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Jan 27, 2017 at 05:22:03PM +0000, Roger Pau Monne wrote:
> On Thu, Jan 26, 2017 at 06:37:00AM -0700, Jan Beulich wrote:
> > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > > @@ -1959,12 +1960,146 @@ static int __init pvh_setup_p2m(struct domain *d)
> > >  #undef MB1_PAGES
> > >  }
> > >  
> > > +static int __init pvh_load_kernel(struct domain *d, const module_t *image,
> > > +                                  unsigned long image_headroom,
> > > +                                  module_t *initrd, char *image_base,
> > > +                                  char *cmdline, paddr_t *entry,
> > > +                                  paddr_t *start_info_addr)
> > > +{
> > > +    char *image_start = image_base + image_headroom;
> > 
> > While for cmdline plain char is certainly fine, I think we should stop
> > abusing this type for image and image_base, even if this means
> > some further adjustments elsewhere. This really ought to be u8 or
> > unsigned char.
> 
> Done.

I've used void * because that's also used by the toolstack and involves less
changes.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 16:40           ` Jan Beulich
@ 2017-01-27 18:06             ` Andrew Cooper
  2017-02-03 13:57               ` Jan Beulich
  2017-01-27 19:48             ` Tim Deegan
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Cooper @ 2017-01-27 18:06 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan; +Cc: xen-devel, boris.ostrovsky, Roger Pau Monne

On 27/01/17 16:40, Jan Beulich wrote:
>>>> On 27.01.17 at 14:20, <tim@xen.org> wrote:
>> At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
>>> On 27/01/17 11:14, Tim Deegan wrote:
>>>> But looking at it now, I'm not convinced of exactly how.  The magic
>>>> bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
>>>> base address itself lives at offset 100.  A zero'd TSS should mean an
>>>> I/O map at 0, and an interrupt redirection bitmap at -32, which would
>>>> plausibly work if the TSS were 256 bytes (matching the limit set in
>>>> Xen).  Perhaps it's only working because the 128 bytes following the
>>>> TSS in hvmloader happen to be zeros too?
>>> With an IO_base_map of 0, the software interrupt bitmap will end up
>>> being ahead of the TSS, not after it.
>> I should have thought that the segmented address calculation would
>> wrap and leave us at TSS + 224.
> I don't think wrapping takes the limit value into account. It's all
> linear address calculations, and as Andrew says the assumption
> in microcode likely is that things will be set up properly by any
> OS interested in using the interrupt bitmap.
>
>>>> I also don't remember why the TSS is 128 rather than 104 bytes.  The
>>>> SDM claims that the TSS must be larger than 104 bytes "when accessing
>>>> the I/O permission bit map or interrupt redirection bit map."
>>>> (7.2.2. "TSS Descriptor") but I suspect that just means that the
>>>> generated address of the bitmap must lie inside the limit.
>>> The documented way of expressing "no IO bitmap" is to set the map base
>>> to a value which exceeds the TSS limit.  All this means (I think) is
>>> that you must make a larger than default TSS if you want to use a IO or
>>> software interrupt bitmap.
>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>> enough space for a full one, but the current SDM is pretty clear that
>> the CPU will try to check it in virtual 8086 mode.
>>
>> It may be that all the ports actually used happen to fall in the 128
>> bytes of zeros that we provide.
> I suppose so: This is precisely enough for the ISA port range.
>
> So what we'll need to do then, as I understand it from the
> discussion so far:
>
> - vmx_set_segment_register() will need to set a correct limit
> - vmx_set_segment_register() should initialize the TSS every
>   time (including setting the I/O bitmap address to no lower
>   than 32)
> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>   rather than 128 (and we should expose this number, so that
>   Roger can also use it)
>
> Perhaps we should even introduce a hypercall for hvmloader
> to query the needed value, rather than exposing a hardcoded
> number?

I suggest we remove all responsibility of managing this from hvmloader. 
The only thing hvmloader does is allocate space for it, and reserve it
in the E820.

It is conceptually related to IDENT_PT, although the IDENT_PT must be
allocated and filled in by the domain builder for the HVM guest to
function.  It would be cleaner for the domain builder to also allocate
an adjacent page for the VM86_TSS when it constructs the IDENT_PT.

All HVMLoader needs to do is read the two hvmparams and adjust the E820
table suitably.

Finally, the IO bitmap needs to be a fraction larger than 160 bytes.

From tools/firmware/rombios/rombios.h:

#define PANIC_PORT  0x400
#define PANIC_PORT2 0x401
#define INFO_PORT   0x402
#define DEBUG_PORT  0x403

which are just above the ISA range.  I'd also just allocate a full page
for it; no OS is going to bother trying to use fractions of a page
around an E820 reserved region.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 14:35               ` Andrew Cooper
@ 2017-01-27 19:43                 ` Tim Deegan
  2017-01-30 10:43                   ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Tim Deegan @ 2017-01-27 19:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, boris.ostrovsky, Jan Beulich, Roger Pau Monne


> Despite being owned by the guest, this TSS is actually managed by
Xen.
> It should be initialised to defaults each time Xen needs to use it
on
> behalf of the guest.

At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote:
> On 27/01/17 14:01, Tim Deegan wrote:
> > Hi,
> >
> > At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
> >> The actual behaviour can be determined by putting the TSS on a page
> >> boundary, making the previous frame non-readable via EPT, and seeing
> >> whether an EPT violation occurs.
> > Indeed.  Or likewise with normal pagetables. 
> >
> >>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
> >>> enough space for a full one, but the current SDM is pretty clear that
> >>> the CPU will try to check it in virtual 8086 mode.
> >>>
> >>> It may be that all the ports actually used happen to fall in the 128
> >>> bytes of zeros that we provide.
> >> With an offset of 0, we actually provide 256 bytes of zeros in the
> >> bitmap within the TSS limit.
> > Sure, or at least 128 bytes of zeros and another 128 bytes of something.
> 
> That is a good point.  Nothing prevents a guest exiting vm86 mode, and
> using a task switch to move to a new tss, which will cause Xen to write
> state back into the vm86_tss, making it no longer a zeroed block of memory.
> 
> Despite being owned by the guest, this TSS is actually managed by Xen. 
> It should be initialised to defaults each time Xen needs to use it on
> behalf of the guest.

But it's already in an E820 reserved block - if the guest overwrites
it (with a task switch or otherwise) it will break real-mode support,
but this is no worse than nobbling any other part of the BIOS state.

If we're making it non-zero, I can see an argument for having Xen init
the contents once (maybe when the HVM param is written?) so that it
matches what Xen expects of it.  But resetting it every time we use it
would be overkill.

> >> We set IOPL to 3 as well as when entering vm86 to fake up real mode. 
> >> This bypasses all I/O bitmap checks (a properly common to ring 3
> >> protected tasks as well - See specifically 20.2.7 "Sensitive
> >> Instructions"), which means the IN/OUT instructions end up directly at
> >> the relevant vmexit case.
> > 20.2.8.1 makes it clear that this is not the case -- in virtual 8086
> > mode all IN/OUT ops check the bitmap event with IOPL == CPL.
> 
> Hmm.  Right you area, which explains why the TSS limit is greater than
> 0x67. 
> 
> If the emulation code were working correctly, the emulator should come
> to the same conclusion as hardware and inject a #GP fault.

I don't think so -- the emulator is emulating actual real-mode, not
virtual 8086 mode, so it shouldn't fault on any IO port accesses.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 16:40           ` Jan Beulich
  2017-01-27 18:06             ` Andrew Cooper
@ 2017-01-27 19:48             ` Tim Deegan
  2017-02-02 15:38               ` Jan Beulich
  1 sibling, 1 reply; 68+ messages in thread
From: Tim Deegan @ 2017-01-27 19:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel, Roger Pau Monne

At 09:40 -0700 on 27 Jan (1485510008), Jan Beulich wrote:
> >>> On 27.01.17 at 14:20, <tim@xen.org> wrote:
> > At 12:51 +0000 on 27 Jan (1485521470), Andrew Cooper wrote:
> >> On 27/01/17 11:14, Tim Deegan wrote:
> >> > But looking at it now, I'm not convinced of exactly how.  The magic
> >> > bitmap in the TSS is at [I/O Map Base Address] - 32, and the I/O map
> >> > base address itself lives at offset 100.  A zero'd TSS should mean an
> >> > I/O map at 0, and an interrupt redirection bitmap at -32, which would
> >> > plausibly work if the TSS were 256 bytes (matching the limit set in
> >> > Xen).  Perhaps it's only working because the 128 bytes following the
> >> > TSS in hvmloader happen to be zeros too?
> >> 
> >> With an IO_base_map of 0, the software interrupt bitmap will end up
> >> being ahead of the TSS, not after it.
> > 
> > I should have thought that the segmented address calculation would
> > wrap and leave us at TSS + 224.
> 
> I don't think wrapping takes the limit value into account.

Quite right, I'm talking nonsense.

> - vmx_set_segment_register() will need to set a correct limit

Yep.

> - vmx_set_segment_register() should initialize the TSS every
>   time (including setting the I/O bitmap address to no lower
>   than 32)

Probably to no lower than 136, to avoid having the bits of that field
itself appearing in either the IO or interrupt bitmap.

> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>   rather than 128 (and we should expose this number, so that
>   Roger can also use it)
> 
> Perhaps we should even introduce a hypercall for hvmloader
> to query the needed value, rather than exposing a hardcoded
> number?

I think Andrew's suggestion of just using a whole page is a good
one.  The TSS is a 32-bit one, after all, and doesn't need to live in
BIOS space.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2
  2017-01-27 17:22     ` Roger Pau Monne
  2017-01-27 17:28       ` Roger Pau Monne
  2017-01-27 18:03       ` Roger Pau Monne
@ 2017-01-30 10:14       ` Jan Beulich
  2 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-30 10:14 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.01.17 at 18:22, <roger.pau@citrix.com> wrote:
> On Thu, Jan 26, 2017 at 06:37:00AM -0700, Jan Beulich wrote:
>> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> > +    if ( cmdline != NULL )
>> > +    {
>> > +        rc = hvm_copy_to_guest_phys_vcpu(last_addr, cmdline,
>> > +                                         strlen(cmdline) + 1, v);
>> > +        if ( rc )
>> > +        {
>> > +            printk("Unable to copy guest command line\n");
>> > +            return rc;
>> > +        }
>> > +        start_info.cmdline_paddr = last_addr;
>> > +        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
>> 
>> Where is this 8 coming from?
> 
> So that the next struct is aligned to the cache line size? It's also done in
> xc_dom_x86.c.

How does 8 match up with cache line size?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2
  2017-01-27 17:28       ` Roger Pau Monne
@ 2017-01-30 10:20         ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-01-30 10:20 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.01.17 at 18:28, <roger.pau@citrix.com> wrote:
> On Fri, Jan 27, 2017 at 05:22:03PM +0000, Roger Pau Monne wrote:
>> On Thu, Jan 26, 2017 at 06:37:00AM -0700, Jan Beulich wrote:
>> > >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> > > +    if ( cmdline != NULL )
>> > > +    {
>> > > +        rc = hvm_copy_to_guest_phys_vcpu(last_addr, cmdline,
>> > > +                                         strlen(cmdline) + 1, v);
>> > > +        if ( rc )
>> > > +        {
>> > > +            printk("Unable to copy guest command line\n");
>> > > +            return rc;
>> > > +        }
>> > > +        start_info.cmdline_paddr = last_addr;
>> > > +        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
>> > 
>> > Where is this 8 coming from?
>> 
>> So that the next struct is aligned to the cache line size? It's also done in
>> xc_dom_x86.c.
> 
> Sorry, not to the cache, this is just so it's aligned:
> 
> https://lists.xen.org/archives/html/xen-devel/2015-08/msg01883.html 

Well, that still doesn't explain the 8 to the reader of the code. The
more that it remains unclear why it needs to be 8 even for the 32-bit
case. Deriving from guest properties would likely make the code
self-explanatory; otherwise I'd rather like to ask for a brief comment.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 19:43                 ` Tim Deegan
@ 2017-01-30 10:43                   ` Jan Beulich
  2017-01-30 11:06                     ` Andrew Cooper
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-01-30 10:43 UTC (permalink / raw)
  To: Tim Deegan; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel, Roger Pau Monne

>>> On 27.01.17 at 20:43, <tim@xen.org> wrote:

>> Despite being owned by the guest, this TSS is actually managed by
> Xen.
>> It should be initialised to defaults each time Xen needs to use it
> on
>> behalf of the guest.
> 
> At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote:
>> On 27/01/17 14:01, Tim Deegan wrote:
>> > Hi,
>> >
>> > At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
>> >> The actual behaviour can be determined by putting the TSS on a page
>> >> boundary, making the previous frame non-readable via EPT, and seeing
>> >> whether an EPT violation occurs.
>> > Indeed.  Or likewise with normal pagetables. 
>> >
>> >>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>> >>> enough space for a full one, but the current SDM is pretty clear that
>> >>> the CPU will try to check it in virtual 8086 mode.
>> >>>
>> >>> It may be that all the ports actually used happen to fall in the 128
>> >>> bytes of zeros that we provide.
>> >> With an offset of 0, we actually provide 256 bytes of zeros in the
>> >> bitmap within the TSS limit.
>> > Sure, or at least 128 bytes of zeros and another 128 bytes of something.
>> 
>> That is a good point.  Nothing prevents a guest exiting vm86 mode, and
>> using a task switch to move to a new tss, which will cause Xen to write
>> state back into the vm86_tss, making it no longer a zeroed block of memory.
>> 
>> Despite being owned by the guest, this TSS is actually managed by Xen. 
>> It should be initialised to defaults each time Xen needs to use it on
>> behalf of the guest.
> 
> But it's already in an E820 reserved block - if the guest overwrites
> it (with a task switch or otherwise) it will break real-mode support,
> but this is no worse than nobbling any other part of the BIOS state.
> 
> If we're making it non-zero, I can see an argument for having Xen init
> the contents once (maybe when the HVM param is written?) so that it
> matches what Xen expects of it.  But resetting it every time we use it
> would be overkill.

That wasn't the point Andrew was making, I think. A task switch
initiated by the guest would make the hypervisor write into that
TSS (as the outgoing one). Of course any sane guest would do an
LTR first (or else it would risk memory near address zero to get
clobbered on real hardware).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-30 10:43                   ` Jan Beulich
@ 2017-01-30 11:06                     ` Andrew Cooper
  0 siblings, 0 replies; 68+ messages in thread
From: Andrew Cooper @ 2017-01-30 11:06 UTC (permalink / raw)
  To: Jan Beulich, Tim Deegan; +Cc: xen-devel, boris.ostrovsky, Roger Pau Monne

On 30/01/17 10:43, Jan Beulich wrote:
>>>> On 27.01.17 at 20:43, <tim@xen.org> wrote:
>>> Despite being owned by the guest, this TSS is actually managed by
>> Xen.
>>> It should be initialised to defaults each time Xen needs to use it
>> on
>>> behalf of the guest.
>> At 14:35 +0000 on 27 Jan (1485527708), Andrew Cooper wrote:
>>> On 27/01/17 14:01, Tim Deegan wrote:
>>>> Hi,
>>>>
>>>> At 13:46 +0000 on 27 Jan (1485524765), Andrew Cooper wrote:
>>>>> The actual behaviour can be determined by putting the TSS on a page
>>>>> boundary, making the previous frame non-readable via EPT, and seeing
>>>>> whether an EPT violation occurs.
>>>> Indeed.  Or likewise with normal pagetables. 
>>>>
>>>>>> Yes, I wonder about the I/O bitmap too.  We don't provide one, or even
>>>>>> enough space for a full one, but the current SDM is pretty clear that
>>>>>> the CPU will try to check it in virtual 8086 mode.
>>>>>>
>>>>>> It may be that all the ports actually used happen to fall in the 128
>>>>>> bytes of zeros that we provide.
>>>>> With an offset of 0, we actually provide 256 bytes of zeros in the
>>>>> bitmap within the TSS limit.
>>>> Sure, or at least 128 bytes of zeros and another 128 bytes of something.
>>> That is a good point.  Nothing prevents a guest exiting vm86 mode, and
>>> using a task switch to move to a new tss, which will cause Xen to write
>>> state back into the vm86_tss, making it no longer a zeroed block of memory.
>>>
>>> Despite being owned by the guest, this TSS is actually managed by Xen. 
>>> It should be initialised to defaults each time Xen needs to use it on
>>> behalf of the guest.
>> But it's already in an E820 reserved block - if the guest overwrites
>> it (with a task switch or otherwise) it will break real-mode support,
>> but this is no worse than nobbling any other part of the BIOS state.
>>
>> If we're making it non-zero, I can see an argument for having Xen init
>> the contents once (maybe when the HVM param is written?) so that it
>> matches what Xen expects of it.  But resetting it every time we use it
>> would be overkill.
> That wasn't the point Andrew was making, I think. A task switch
> initiated by the guest would make the hypervisor write into that
> TSS (as the outgoing one). Of course any sane guest would do an
> LTR first (or else it would risk memory near address zero to get
> clobbered on real hardware).

Thinking about it, this depends on whether we properly save and restore
the protected mode %tr around entering and exiting faked-up real mode.

If the saving and restoring is already done properly, then I think my
concern is unfounded.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 19:48             ` Tim Deegan
@ 2017-02-02 15:38               ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-02-02 15:38 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan; +Cc: xen-devel, boris.ostrovsky, Roger Pau Monne

>>> On 27.01.17 at 20:48, <tim@xen.org> wrote:
> At 09:40 -0700 on 27 Jan (1485510008), Jan Beulich wrote:
>> - vmx_set_segment_register() should initialize the TSS every
>>   time (including setting the I/O bitmap address to no lower
>>   than 32)
> 
> Probably to no lower than 136, to avoid having the bits of that field
> itself appearing in either the IO or interrupt bitmap.

Indeed.

>> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>>   rather than 128 (and we should expose this number, so that
>>   Roger can also use it)
>> 
>> Perhaps we should even introduce a hypercall for hvmloader
>> to query the needed value, rather than exposing a hardcoded
>> number?
> 
> I think Andrew's suggestion of just using a whole page is a good
> one.  The TSS is a 32-bit one, after all, and doesn't need to live in
> BIOS space.

Hmm, any size increase will need to come with further changes,
as it looks, including the use of a new HVM param: The VM86_TSS
param is being migrated, and hence for an incoming VM we need
to be able to tell whether the guest has set aside 128 bytes or a
full page. This of course implies that we need to keep Xen handle
the 128-byte case, too.

And if we somehow expect that a single page may not suffice in
the future, it may even be advisable to store an (address,size)
pair as param.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map
  2017-01-27 18:06             ` Andrew Cooper
@ 2017-02-03 13:57               ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-02-03 13:57 UTC (permalink / raw)
  To: Andrew Cooper, Tim Deegan; +Cc: xen-devel, boris.ostrovsky, Roger PauMonne

>>> On 27.01.17 at 19:06, <andrew.cooper3@citrix.com> wrote:
> On 27/01/17 16:40, Jan Beulich wrote:
>> So what we'll need to do then, as I understand it from the
>> discussion so far:
>>
>> - vmx_set_segment_register() will need to set a correct limit
>> - vmx_set_segment_register() should initialize the TSS every
>>   time (including setting the I/O bitmap address to no lower
>>   than 32)
>> - hvmloader's init_vm86_tss() will need to allocate 160 bytes
>>   rather than 128 (and we should expose this number, so that
>>   Roger can also use it)
>>
>> Perhaps we should even introduce a hypercall for hvmloader
>> to query the needed value, rather than exposing a hardcoded
>> number?
> 
> I suggest we remove all responsibility of managing this from hvmloader. 
> The only thing hvmloader does is allocate space for it, and reserve it
> in the E820.

While I did it that way for now, I'm no longer convinced this is
useful. With multiple vCPU-s, a guest can do whatever it wants to
this TSS anyway, regardless of whether Xen currently thinks it's
using a suitably initialized memory block. And whatever the guest
does, any non-zero bit in that area will only slow it down (due to
the VM exits resulting from the #GP faults caused by those 1 bits,
resulting in the respective I/O or INTnn insns being carried out by
the emulator).

> It is conceptually related to IDENT_PT, although the IDENT_PT must be
> allocated and filled in by the domain builder for the HVM guest to
> function.  It would be cleaner for the domain builder to also allocate
> an adjacent page for the VM86_TSS when it constructs the IDENT_PT.

I'll leave that for someone else to carry out; for now allocation
will remain in hvmloader.

> Finally, the IO bitmap needs to be a fraction larger than 160 bytes.
> 
> From tools/firmware/rombios/rombios.h:
> 
> #define PANIC_PORT  0x400
> #define PANIC_PORT2 0x401
> #define INFO_PORT   0x402
> #define DEBUG_PORT  0x403
> 
> which are just above the ISA range.

Which causes only slowness (due to needing the emulator to carry
out the instruction), but no lack of functionality.

>  I'd also just allocate a full page
> for it; no OS is going to bother trying to use fractions of a page
> around an E820 reserved region.

But the smaller range may well be part of an already partially used
page. Together with the fact that any port accesses not covered
by the bitmap would still be correctly handled, I'd prefer to make
the TSS 0x68 + 0x20 + 0x80 + 1 bytes large (base structure plus
interrupt redirection bitmap plus I/O bitmap plus trailing byte),
which, due to the goal of avoiding page boundaries in the middle,
would mean a 512 byte block aligned to a 512-byte boundary.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs
  2017-01-26 13:46   ` Jan Beulich
@ 2017-02-08 12:48     ` Roger Pau Monne
  2017-02-08 13:02       ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-02-08 12:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Jan 26, 2017 at 06:46:17AM -0700, Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > @@ -2124,6 +2175,15 @@ static int __init construct_dom0_pvh(struct domain *d, const module_t *image,
> >          return rc;
> >      }
> >  
> > +    rc = pvh_setup_cpus(d, entry, start_info);
> > +    if ( rc )
> > +    {
> > +        printk("Failed to setup Dom0 CPUs: %d\n", rc);
> > +        return rc;
> > +    }
> > +
> > +    clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
> 
> Would you mind moving this into the function (where you then can
> use just v)?

Fixed all the above, but I'm not sure about this. In v2 you requested me to
move the clear_bit outside of pvh_set_cpus:

https://www.mail-archive.com/xen-devel@lists.xen.org/msg83714.html

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs
  2017-02-08 12:48     ` Roger Pau Monne
@ 2017-02-08 13:02       ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-02-08 13:02 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 08.02.17 at 13:48, <roger.pau@citrix.com> wrote:
> On Thu, Jan 26, 2017 at 06:46:17AM -0700, Jan Beulich wrote:
>> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> > @@ -2124,6 +2175,15 @@ static int __init construct_dom0_pvh(struct domain 
> *d, const module_t *image,
>> >          return rc;
>> >      }
>> >  
>> > +    rc = pvh_setup_cpus(d, entry, start_info);
>> > +    if ( rc )
>> > +    {
>> > +        printk("Failed to setup Dom0 CPUs: %d\n", rc);
>> > +        return rc;
>> > +    }
>> > +
>> > +    clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
>> 
>> Would you mind moving this into the function (where you then can
>> use just v)?
> 
> Fixed all the above, but I'm not sure about this. In v2 you requested me to
> move the clear_bit outside of pvh_set_cpus:
> 
> https://www.mail-archive.com/xen-devel@lists.xen.org/msg83714.html 

I asked you to move it later, but I did not ask for it to be moved
outside.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables
  2017-01-26 14:16   ` Jan Beulich
@ 2017-02-08 15:10     ` Roger Pau Monne
  2017-02-08 15:50       ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-02-08 15:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Jan 26, 2017 at 07:16:32AM -0700, Jan Beulich wrote:
> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> > +static int __init pvh_setup_acpi_madt(struct domain *d, paddr_t *addr)
> > +{
> > +    struct acpi_table_madt *madt;
> > +    struct acpi_table_header *table;
> > +    struct acpi_madt_io_apic *io_apic;
> > +    struct acpi_madt_local_apic *local_apic;
> > +    struct acpi_madt_local_x2apic *x2apic;
> > +    acpi_status status;
> > +    unsigned long size;
> > +    unsigned int i, max_vcpus;
> > +    int rc;
> > +
> > +    /* Count number of interrupt overrides in the MADT. */
> > +    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
> > +                          acpi_count_intr_ovr, UINT_MAX);
> > +
> > +    /* Count number of NMI sources in the MADT. */
> > +    acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_count_nmi_src,
> > +                          UINT_MAX);
> > +
> > +    /* Check if there are x2APIC entries. */
> > +    acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC, acpi_check_x2apic, 1);
> > +
> > +    max_vcpus = dom0_max_vcpus();
> > +    /* Calculate the size of the crafted MADT. */
> > +    size = sizeof(*madt);
> > +    size += sizeof(*io_apic);
> 
> Still just a single IO-APIC and no fixme note. I see you have one
> below, but this line will need to remain in lock step. Or perhaps you
> could multiply by nr_ioapics here, accepting the transient waste of
> space.

Multiplying by nr_ioapics would make the ASSERT(((void *)x2apic - (void *)madt)
== size); at the end trigger. I will rather add a comment here.

> > +    /*
> > +     * The APIC ID field of the local APIC struct is only 1byte, so we need
> > +     * to limit the number of local APIC entries to 128 because we only use
> > +     * even numbers as APIC IDs.
> > +     */
> > +    size += sizeof(*local_apic) *
> > +            min(max_vcpus, 1U << (sizeof(local_apic->id) * 8));
> 
> This caps at 256 now. Perhaps it would anyway be better to use
> HVM_MAX_VCPUS here, or maybe the minimum of both?

max_vcpus is already capped to HVM_MAX_VCPUS. IMHO it's not right to use
HVM_MAX_VCPUS because we will increase it at some point, and then we would add
more local APIC entries that possible thus overflowing the id field.

> > +    size += sizeof(*intsrcovr) * acpi_intr_overrides;
> > +    size += sizeof(*nmisrc) * acpi_nmi_sources;
> > +    size += acpi_x2apic ? sizeof(*x2apic) * dom0_max_vcpus() : 0;
> 
> You have the function call result already latched in a local variable.

Right, fixed.

> Plus, when you cap LAPIC entries, you should also provide x2APIC
> ones (to be able to represent all vCPU-s).

Would it make sense to then provide x2APIC entries regardless of whether they
are present on the native tables or not?

The emulated APIC provided by Xen supports x2APIC mode anyway.

> > +    madt = xzalloc_bytes(size);
> > +    if ( !madt )
> > +    {
> > +        printk("Unable to allocate memory for MADT table\n");
> > +        return -ENOMEM;
> > +    }
> > +
> > +    /* Copy the native MADT table header. */
> > +    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
> > +    if ( !ACPI_SUCCESS(status) )
> > +    {
> > +        printk("Failed to get MADT ACPI table, aborting.\n");
> > +        return -EINVAL;
> > +    }
> > +    madt->header = *table;
> > +    madt->address = APIC_DEFAULT_PHYS_BASE;
> > +    /*
> > +     * NB: this is currently set to 4, which is the revision in the ACPI
> > +     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
> > +     * tables described in the headers.
> > +     */
> > +    madt->header.revision = min_t(unsigned char, table->revision, 4);
> > +
> > +    /*
> > +     * Setup the IO APIC entry.
> > +     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
> > +     * per domain. This must be fixed in order to provide the same amount of
> > +     * IO APICs as available on bare metal, and with the same IDs as found in
> > +     * the native IO APIC MADT entries.
> > +     */
> > +    if ( nr_ioapics > 1 )
> > +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
> > +               nr_ioapics);
> > +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> 
> To aid readability you may want to use void in all such casts.

Thanks.

> > +static int __init pvh_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
> > +                                      paddr_t *addr)
> > +{
> > +    struct acpi_table_xsdt *xsdt;
> > +    struct acpi_table_header *table;
> > +    struct acpi_table_rsdp *rsdp;
> > +    unsigned long size = sizeof(*xsdt);
> > +    unsigned int i, j, num_tables = 0;
> > +    paddr_t xsdt_paddr;
> > +    int rc;
> > +
> > +    /*
> > +     * Restore original DMAR table signature, we are going to filter it
> > +     * from the new XSDT that is presented to the guest, so it no longer
> > +     * makes sense to have it's signature zapped.
> 
> "... so it is no longer necessary to ..."?

Fixed.

> > +static int __init pvh_setup_acpi(struct domain *d, paddr_t start_info)
> > +{
> > +    struct acpi_table_rsdp rsdp, *native_rsdp;
> > +    unsigned long pfn, nr_pages;
> > +    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    /* Scan top-level tables and add their regions to the guest memory map. */
> > +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> > +    {
> > +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> > +        unsigned long addr = acpi_gbl_root_table_list.tables[i].address;
> > +        unsigned long size = acpi_gbl_root_table_list.tables[i].length;
> > +
> > +        /*
> > +         * Make sure the original MADT is also mapped, so that Dom0 can
> > +         * properly access the data returned by _MAT methods in case it's
> > +         * re-using MADT memory.
> > +         */
> > +        if ( pvh_acpi_table_allowed(sig) ||
> > +             (strncmp(sig, ACPI_SIG_MADT, ACPI_NAME_SIZE) == 0 &&
> > +             !acpi_memory_banned(addr, size)) )
> 
> Indentation. But to me this would be clearer as
> 
>         if ( strncmp(sig, ACPI_SIG_MADT, ACPI_NAME_SIZE)
>              ? pvh_acpi_table_allowed(sig)
>              : !acpi_memory_banned(addr, size) )
> 
> anyway.

Right, that's probably easier to read.

> > +             pvh_add_mem_range(d, addr, addr + size, E820_ACPI);
> > +    }
> > +
> > +    /* Identity map ACPI e820 regions. */
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        if ( d->arch.e820[i].type != E820_ACPI &&
> > +             d->arch.e820[i].type != E820_NVS )
> > +            continue;
> > +
> > +        pfn = PFN_DOWN(d->arch.e820[i].addr);
> > +        nr_pages = PFN_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
> > +                          d->arch.e820[i].size);
> > +
> > +        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> > +        if ( rc )
> > +        {
> > +            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> > +                   pfn, pfn + nr_pages);
> > +            return rc;
> > +        }
> > +    }
> > +
> > +    rc = pvh_setup_acpi_madt(d, &madt_paddr);
> > +    if ( rc )
> > +        return rc;
> > +
> > +    rc = pvh_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
> > +    if ( rc )
> > +        return rc;
> > +
> > +    /* Craft a custom RSDP. */
> > +    memset(&rsdp, 0, sizeof(rsdp));
> 
> Perhaps worth using an initializer again (with once again as many as
> possible of the items below folded there)?

Done, I've initialized signature, revision and length.

> > +    rsdp.length = sizeof(rsdp);
> > +    /*
> > +     * Calling acpi_tb_checksum here is a layering violation, but
> > +     * introducing a wrapper for such simple usage seems overkill.
> > +     */
> > +    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
> > +                                      ACPI_RSDP_REV0_SIZE);
> > +    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
> > +                                               sizeof(rsdp));
> > +
> > +    /*
> > +     * Place the new RSDP in guest memory space.
> > +     *
> > +     * NB: this RSDP is not going to replace the original RSDP, which
> > +     * should still be accessible to the guest. However that RSDP is
> > +     * going to point to the native RSDT, and should not be used.
> 
> ... for the Dom0 kernel's boot purposes (we keep it visible after all
> for post boot access).

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables
  2017-02-08 15:10     ` Roger Pau Monne
@ 2017-02-08 15:50       ` Jan Beulich
  2017-02-08 15:58         ` Roger Pau Monne
  0 siblings, 1 reply; 68+ messages in thread
From: Jan Beulich @ 2017-02-08 15:50 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 08.02.17 at 16:10, <roger.pau@citrix.com> wrote:
> On Thu, Jan 26, 2017 at 07:16:32AM -0700, Jan Beulich wrote:
>> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> > +    /*
>> > +     * The APIC ID field of the local APIC struct is only 1byte, so we need
>> > +     * to limit the number of local APIC entries to 128 because we only use
>> > +     * even numbers as APIC IDs.
>> > +     */
>> > +    size += sizeof(*local_apic) *
>> > +            min(max_vcpus, 1U << (sizeof(local_apic->id) * 8));
>> 
>> This caps at 256 now. Perhaps it would anyway be better to use
>> HVM_MAX_VCPUS here, or maybe the minimum of both?
> 
> max_vcpus is already capped to HVM_MAX_VCPUS.

Is it? I see a use of that symbol only in mk_dsdt.c.

> IMHO it's not right to use
> HVM_MAX_VCPUS because we will increase it at some point, and then we would add
> more local APIC entries that possible thus overflowing the id field.

Hence the "or maybe the minimum of both".

>> Plus, when you cap LAPIC entries, you should also provide x2APIC
>> ones (to be able to represent all vCPU-s).
> 
> Would it make sense to then provide x2APIC entries regardless of whether they
> are present on the native tables or not?
> 
> The emulated APIC provided by Xen supports x2APIC mode anyway.

Yes, I think so.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables
  2017-02-08 15:50       ` Jan Beulich
@ 2017-02-08 15:58         ` Roger Pau Monne
  2017-02-08 16:03           ` Jan Beulich
  0 siblings, 1 reply; 68+ messages in thread
From: Roger Pau Monne @ 2017-02-08 15:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Wed, Feb 08, 2017 at 08:50:54AM -0700, Jan Beulich wrote:
> >>> On 08.02.17 at 16:10, <roger.pau@citrix.com> wrote:
> > On Thu, Jan 26, 2017 at 07:16:32AM -0700, Jan Beulich wrote:
> >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
> >> > +    /*
> >> > +     * The APIC ID field of the local APIC struct is only 1byte, so we need
> >> > +     * to limit the number of local APIC entries to 128 because we only use
> >> > +     * even numbers as APIC IDs.
> >> > +     */
> >> > +    size += sizeof(*local_apic) *
> >> > +            min(max_vcpus, 1U << (sizeof(local_apic->id) * 8));
> >> 
> >> This caps at 256 now. Perhaps it would anyway be better to use
> >> HVM_MAX_VCPUS here, or maybe the minimum of both?
> > 
> > max_vcpus is already capped to HVM_MAX_VCPUS.
> 
> Is it? I see a use of that symbol only in mk_dsdt.c.

It's introduced in a patch of this series, see "x86/PVHv2: fix dom0_max_vcpus
so it's capped to 128 for PVHv2 Dom0", that's why you don't find it in your
tree :).

> > IMHO it's not right to use
> > HVM_MAX_VCPUS because we will increase it at some point, and then we would add
> > more local APIC entries that possible thus overflowing the id field.
> 
> Hence the "or maybe the minimum of both".
> 
> >> Plus, when you cap LAPIC entries, you should also provide x2APIC
> >> ones (to be able to represent all vCPU-s).
> > 
> > Would it make sense to then provide x2APIC entries regardless of whether they
> > are present on the native tables or not?
> > 
> > The emulated APIC provided by Xen supports x2APIC mode anyway.
> 
> Yes, I think so.

As a reply to both comments above, I will switch to simply using x2APIC entries
for all vCPUs, and get rid of all this, so we will only provide x2APIC for
vCPUs.  Then if HVM_MAX_VCPUS is increased we will have no issue.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables
  2017-02-08 15:58         ` Roger Pau Monne
@ 2017-02-08 16:03           ` Jan Beulich
  0 siblings, 0 replies; 68+ messages in thread
From: Jan Beulich @ 2017-02-08 16:03 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 08.02.17 at 16:58, <roger.pau@citrix.com> wrote:
> On Wed, Feb 08, 2017 at 08:50:54AM -0700, Jan Beulich wrote:
>> >>> On 08.02.17 at 16:10, <roger.pau@citrix.com> wrote:
>> > On Thu, Jan 26, 2017 at 07:16:32AM -0700, Jan Beulich wrote:
>> >> >>> On 19.01.17 at 18:29, <roger.pau@citrix.com> wrote:
>> >> > +    /*
>> >> > +     * The APIC ID field of the local APIC struct is only 1byte, so we need
>> >> > +     * to limit the number of local APIC entries to 128 because we only use
>> >> > +     * even numbers as APIC IDs.
>> >> > +     */
>> >> > +    size += sizeof(*local_apic) *
>> >> > +            min(max_vcpus, 1U << (sizeof(local_apic->id) * 8));
>> >> 
>> >> This caps at 256 now. Perhaps it would anyway be better to use
>> >> HVM_MAX_VCPUS here, or maybe the minimum of both?
>> > 
>> > max_vcpus is already capped to HVM_MAX_VCPUS.
>> 
>> Is it? I see a use of that symbol only in mk_dsdt.c.
> 
> It's introduced in a patch of this series, see "x86/PVHv2: fix 
> dom0_max_vcpus
> so it's capped to 128 for PVHv2 Dom0", that's why you don't find it in your
> tree :).

Okay. A little bit disconnected, but anyway.

>> > IMHO it's not right to use
>> > HVM_MAX_VCPUS because we will increase it at some point, and then we would add
>> > more local APIC entries that possible thus overflowing the id field.
>> 
>> Hence the "or maybe the minimum of both".
>> 
>> >> Plus, when you cap LAPIC entries, you should also provide x2APIC
>> >> ones (to be able to represent all vCPU-s).
>> > 
>> > Would it make sense to then provide x2APIC entries regardless of whether they
>> > are present on the native tables or not?
>> > 
>> > The emulated APIC provided by Xen supports x2APIC mode anyway.
>> 
>> Yes, I think so.
> 
> As a reply to both comments above, I will switch to simply using x2APIC entries
> for all vCPUs, and get rid of all this, so we will only provide x2APIC for
> vCPUs.  Then if HVM_MAX_VCPUS is increased we will have no issue.

If all guests we care about are fine with this - why not.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2017-02-08 16:03 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-19 17:29 [PATCH v5 0/9] Initial PVHv2 Dom0 support Roger Pau Monne
2017-01-19 17:29 ` [PATCH v5 1/9] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
2017-01-20 18:41   ` Andrew Cooper
2017-01-23 12:28     ` Roger Pau Monne
2017-01-19 17:29 ` [PATCH v5 2/9] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
2017-01-20  6:41   ` Tian, Kevin
2017-01-20 10:28     ` Roger Pau Monne
2017-01-20 18:44   ` Andrew Cooper
2017-01-22  4:45     ` Tian, Kevin
2017-01-19 17:29 ` [PATCH v5 3/9] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
2017-01-20 19:03   ` Andrew Cooper
2017-01-23 12:58     ` Roger Pau Monne
2017-01-23 12:59       ` Andrew Cooper
2017-01-20 19:13   ` Boris Ostrovsky
2017-01-20 19:27     ` Andrew Cooper
2017-01-26 11:43   ` Jan Beulich
2017-01-26 16:36     ` Roger Pau Monne
2017-01-19 17:29 ` [PATCH v5 4/9] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
2017-01-20 19:41   ` Andrew Cooper
2017-01-23 11:23     ` Jan Beulich
2017-01-23 14:11   ` Boris Ostrovsky
2017-01-23 14:43     ` Roger Pau Monne
2017-01-26 12:41   ` Jan Beulich
2017-01-27 11:14     ` Tim Deegan
2017-01-27 12:37       ` Roger Pau Monne
2017-01-27 12:51       ` Andrew Cooper
2017-01-27 13:20         ` Tim Deegan
2017-01-27 13:46           ` Andrew Cooper
2017-01-27 14:01             ` Tim Deegan
2017-01-27 14:35               ` Andrew Cooper
2017-01-27 19:43                 ` Tim Deegan
2017-01-30 10:43                   ` Jan Beulich
2017-01-30 11:06                     ` Andrew Cooper
2017-01-27 16:40           ` Jan Beulich
2017-01-27 18:06             ` Andrew Cooper
2017-02-03 13:57               ` Jan Beulich
2017-01-27 19:48             ` Tim Deegan
2017-02-02 15:38               ` Jan Beulich
2017-01-27 12:23     ` Roger Pau Monne
2017-01-27 15:11       ` Jan Beulich
2017-01-27 16:04         ` Roger Pau Monne
2017-01-27 16:29           ` Jan Beulich
2017-01-19 17:29 ` [PATCH v5 5/9] x86/hvm: add vcpu parameter to guest memory copy function Roger Pau Monne
2017-01-20 19:45   ` Andrew Cooper
2017-01-23 13:50     ` Roger Pau Monne
2017-01-26 12:51   ` Jan Beulich
2017-01-26 13:33   ` Jan Beulich
2017-01-27 14:55     ` Roger Pau Monne
2017-01-19 17:29 ` [PATCH v5 6/9] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
2017-01-26 13:37   ` Jan Beulich
2017-01-27 17:22     ` Roger Pau Monne
2017-01-27 17:28       ` Roger Pau Monne
2017-01-30 10:20         ` Jan Beulich
2017-01-27 18:03       ` Roger Pau Monne
2017-01-30 10:14       ` Jan Beulich
2017-01-19 17:29 ` [PATCH v5 7/9] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
2017-01-19 17:32   ` Andrew Cooper
2017-01-26 13:39   ` Jan Beulich
2017-01-19 17:29 ` [PATCH v5 8/9] xen/x86: Setup PVHv2 Dom0 CPUs Roger Pau Monne
2017-01-26 13:46   ` Jan Beulich
2017-02-08 12:48     ` Roger Pau Monne
2017-02-08 13:02       ` Jan Beulich
2017-01-19 17:29 ` [PATCH v5 9/9] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
2017-01-26 14:16   ` Jan Beulich
2017-02-08 15:10     ` Roger Pau Monne
2017-02-08 15:50       ` Jan Beulich
2017-02-08 15:58         ` Roger Pau Monne
2017-02-08 16:03           ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.