All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/14] Initial PVHv2 Dom0 support
@ 2016-11-30 16:49 Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
                   ` (13 more replies)
  0 siblings, 14 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk

Hello,

This is the first batch of the PVHv2 Dom0 support series, that includes
everything up to the point where ACPI tables for Dom0 are crafted. I've
decided to left the last part of the series (the one that contains the PCI
config space handlers, and other emulation/trapping related code) separated,
in order to focus and ease the review. This is of course not functional, one
might be able to partially boot a Dom0 kernel if it doesn't try to access
any physical devices, and the panic in setup.c is removed.

The full series can also be found on a git branch in my personal git repo:

git://xenbits.xen.org/people/royger/xen.git dom0_hvm_v4

Each patch contains the changelog between versions.

Thanks, Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-08 16:21   ` Jan Beulich
  2016-11-30 16:49 ` [PATCH v4 02/14] xen/x86: fix return value of *_set_allocation functions Roger Pau Monne
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
from physical or emulated devices over event channels using PIRQs. This
applies to both DomU and Dom0 PVHv2 guests.

Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
route physical interrupts (even from emulated devices) over event channels,
and is thus allowed to use some of the PHYSDEV ops.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Update docs.

Changes since v2:
 - Change local variable name to currd instead of d.
 - Use currd where it makes sense.
---
 docs/misc/hvmlite.markdown        | 20 ++++++++++++++++++++
 xen/arch/x86/hvm/hvm.c            | 25 ++++++++++++++++---------
 xen/arch/x86/physdev.c            |  5 +++--
 xen/common/kernel.c               |  3 ++-
 xen/include/public/arch-x86/xen.h |  4 +++-
 5 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/docs/misc/hvmlite.markdown b/docs/misc/hvmlite.markdown
index 898b8ee..b2557f7 100644
--- a/docs/misc/hvmlite.markdown
+++ b/docs/misc/hvmlite.markdown
@@ -75,3 +75,23 @@ info structure that's passed at boot time (field rsdp_paddr).
 
 Description of paravirtualized devices will come from XenStore, just as it's
 done for HVM guests.
+
+## Interrupts ##
+
+### Interrupts from physical devices ###
+
+Interrupts from physical devices are delivered using native methods, this is
+done in order to take advantage of new hardware assisted virtualization
+functions, like posted interrupts. This implies that PVHv2 guests with physical
+devices will also have the necessary interrupt controllers in order to manage
+the delivery of interrupts from those devices, using the same interfaces that
+are available on native hardware.
+
+### Interrupts from paravirtualized devices ###
+
+Interrupts from paravirtualized devices are delivered using event channels, see
+[Event Channel Internals][event_channels] for more detailed information about
+event channels. Delivery of those interrupts can be configured in the same way
+as HVM guests, check xen/include/public/hvm/params.h and
+xen/include/public/hvm/hvm_op.h for more information about available delivery
+methods.
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 25dc759..306d6b0 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4186,10 +4186,12 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
 static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *currd = current->domain;
+
     switch ( cmd )
     {
     default:
-        if ( !is_pvh_vcpu(current) || !is_hardware_domain(current->domain) )
+        if ( !is_pvh_domain(currd) || !is_hardware_domain(currd) )
             return -ENOSYS;
         /* fall through */
     case PHYSDEVOP_map_pirq:
@@ -4197,7 +4199,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
-        return do_physdev_op(cmd, arg);
+        return ((currd->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||
+               is_pvh_domain(currd)) ?
+                    do_physdev_op(cmd, arg) : -ENOSYS;
     }
 }
 
@@ -4230,17 +4234,20 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 static long hvm_physdev_op_compat32(
     int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *d = current->domain;
+
     switch ( cmd )
     {
-        case PHYSDEVOP_map_pirq:
-        case PHYSDEVOP_unmap_pirq:
-        case PHYSDEVOP_eoi:
-        case PHYSDEVOP_irq_status_query:
-        case PHYSDEVOP_get_free_pirq:
-            return compat_physdev_op(cmd, arg);
+    case PHYSDEVOP_map_pirq:
+    case PHYSDEVOP_unmap_pirq:
+    case PHYSDEVOP_eoi:
+    case PHYSDEVOP_irq_status_query:
+    case PHYSDEVOP_get_free_pirq:
+        return (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                    compat_physdev_op(cmd, arg) : -ENOSYS;
         break;
     default:
-            return -ENOSYS;
+        return -ENOSYS;
         break;
     }
 }
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 5a49796..0bea6e1 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -94,7 +94,8 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
     int pirq, irq, ret = 0;
     void *map_data = NULL;
 
-    if ( domid == DOMID_SELF && is_hvm_domain(d) )
+    if ( domid == DOMID_SELF && is_hvm_domain(d) &&
+         (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         /*
          * Only makes sense for vector-based callback, else HVM-IRQ logic
@@ -265,7 +266,7 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
     if ( ret )
         goto free_domain;
 
-    if ( is_hvm_domain(d) )
+    if ( is_hvm_domain(d) && (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         spin_lock(&d->event_lock);
         if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index d0edb13..a82f55f 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -332,7 +332,8 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             case guest_type_hvm:
                 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
                              (1U << XENFEAT_hvm_callback_vector) |
-                             (1U << XENFEAT_hvm_pirqs);
+                             ((d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                                 (1U << XENFEAT_hvm_pirqs) : 0);
                 break;
             }
 #endif
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index cdd93c1..da6f4f2 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -283,12 +283,14 @@ struct xen_arch_domainconfig {
 #define XEN_X86_EMU_IOMMU           (1U<<_XEN_X86_EMU_IOMMU)
 #define _XEN_X86_EMU_PIT            8
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
+#define _XEN_X86_EMU_USE_PIRQ       9
+#define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
     uint32_t emulation_flags;
 };
 #endif
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 02/14] xen/x86: fix return value of *_set_allocation functions
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 03/14] xen/x86: allow calling {shadow/hap}_set_allocation with the idle domain Roger Pau Monne
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, Roger Pau Monne

Return should be an int.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Tim Deegan <tim@xen.org>
---
Changes since v2:
 - Also fix the callers to treat the return value as an int.
 - Don't convert the pages parameter to unsigned long.
---
 xen/arch/x86/mm/hap/hap.c       |  8 +++-----
 xen/arch/x86/mm/shadow/common.c | 12 +++++-------
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 3218fa2..f099e94 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -334,7 +334,7 @@ hap_get_allocation(struct domain *d)
 
 /* Set the pool of pages to the required number of pages.
  * Returns 0 for success, non-zero for failure. */
-static unsigned int
+static int
 hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
 {
     struct page_info *pg;
@@ -468,14 +468,12 @@ int hap_enable(struct domain *d, u32 mode)
     old_pages = d->arch.paging.hap.total_pages;
     if ( old_pages == 0 )
     {
-        unsigned int r;
         paging_lock(d);
-        r = hap_set_allocation(d, 256, NULL);
-        if ( r != 0 )
+        rv = hap_set_allocation(d, 256, NULL);
+        if ( rv != 0 )
         {
             hap_set_allocation(d, 0, NULL);
             paging_unlock(d);
-            rv = -ENOMEM;
             goto out;
         }
         paging_unlock(d);
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index ced2313..756c276 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1615,9 +1615,9 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
  * Input will be rounded up to at least shadow_min_acceptable_pages(),
  * plus space for the p2m table.
  * Returns 0 for success, non-zero for failure. */
-static unsigned int sh_set_allocation(struct domain *d,
-                                      unsigned int pages,
-                                      int *preempted)
+static int sh_set_allocation(struct domain *d,
+                             unsigned int pages,
+                             int *preempted)
 {
     struct page_info *sp;
     unsigned int lower_bound;
@@ -3153,13 +3153,11 @@ int shadow_enable(struct domain *d, u32 mode)
     old_pages = d->arch.paging.shadow.total_pages;
     if ( old_pages == 0 )
     {
-        unsigned int r;
         paging_lock(d);
-        r = sh_set_allocation(d, 1024, NULL); /* Use at least 4MB */
-        if ( r != 0 )
+        rv = sh_set_allocation(d, 1024, NULL); /* Use at least 4MB */
+        if ( rv != 0 )
         {
             sh_set_allocation(d, 0, NULL);
-            rv = -ENOMEM;
             goto out_locked;
         }
         paging_unlock(d);
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 03/14] xen/x86: allow calling {shadow/hap}_set_allocation with the idle domain
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 02/14] xen/x86: fix return value of *_set_allocation functions Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-01 11:25   ` Tim Deegan
  2016-11-30 16:49 ` [PATCH v4 04/14] x86/paging: introduce paging_set_allocation Roger Pau Monne
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, Roger Pau Monne

... and using the "preempted" parameter. Introduce a new helper that can
be used from both hypercall or idle vcpu context (ie: during Dom0
creation) in order to check if preemption is needed. If such preemption
happens, the caller should then call process_pending_softirqs in order to
drain the pending softirqs, and then call *_set_allocation again to continue
with it's execution.

This allows us to call *_set_allocation() when building domain 0.

While there also document hypercall_preempt_check and add an assert to
local_events_need_delivery in order to be sure it's not called by the idle
domain, which doesn't receive any events (and that in turn
hypercall_preempt_check is also not called by the idle domain).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Tim Deegan <tim@xen.org>
---
Changes since v3:
 - Add general_preempt_check as a macro that can be called from both
   hypercall context and idle vcpu context.
 - Document hypercall_preempt_check.
 - Add assert to local_events_need_delivery in order to be sure it's not
   called by the idle domain.

Changes since v2:
 - Fix commit message.
---
 xen/arch/x86/mm/hap/hap.c       |  2 +-
 xen/arch/x86/mm/shadow/common.c |  2 +-
 xen/include/asm-x86/event.h     |  3 +++
 xen/include/xen/sched.h         | 15 +++++++++++++++
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index f099e94..b9faba6 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -379,7 +379,7 @@ hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
             break;
 
         /* Check to see if we need to yield and try again */
-        if ( preempted && hypercall_preempt_check() )
+        if ( preempted && general_preempt_check() )
         {
             *preempted = 1;
             return 0;
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 756c276..ddbdb73 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1681,7 +1681,7 @@ static int sh_set_allocation(struct domain *d,
             break;
 
         /* Check to see if we need to yield and try again */
-        if ( preempted && hypercall_preempt_check() )
+        if ( preempted && general_preempt_check() )
         {
             *preempted = 1;
             return 0;
diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h
index a82062e..d589d6f 100644
--- a/xen/include/asm-x86/event.h
+++ b/xen/include/asm-x86/event.h
@@ -23,6 +23,9 @@ int hvm_local_events_need_delivery(struct vcpu *v);
 static inline int local_events_need_delivery(void)
 {
     struct vcpu *v = current;
+
+    ASSERT(!is_idle_vcpu(v));
+
     return (has_hvm_container_vcpu(v) ? hvm_local_events_need_delivery(v) :
             (vcpu_info(v, evtchn_upcall_pending) &&
              !vcpu_info(v, evtchn_upcall_mask)));
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 1fbda87..063efe6 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -708,11 +708,26 @@ unsigned long hypercall_create_continuation(
     unsigned int op, const char *format, ...);
 void hypercall_cancel_continuation(void);
 
+/*
+ * For long-running operations that must be in hypercall context, check
+ * if there is background work to be done that should interrupt this
+ * operation.
+ */
 #define hypercall_preempt_check() (unlikely(    \
         softirq_pending(smp_processor_id()) |   \
         local_events_need_delivery()            \
     ))
 
+/*
+ * For long-running operations that may be in hypercall context or on
+ * the idle vcpu (e.g. during dom0 construction), check if there is
+ * background work to be done that should interrupt this operation.
+ */
+#define general_preempt_check() (unlikely(                          \
+        softirq_pending(smp_processor_id()) ||                      \
+        (!is_idle_vcpu(current) && local_events_need_delivery())    \
+    ))
+
 extern struct domain *domain_list;
 
 /* Caller must hold the domlist_read_lock or domlist_update_lock. */
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 04/14] x86/paging: introduce paging_set_allocation
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (2 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 03/14] xen/x86: allow calling {shadow/hap}_set_allocation with the idle domain Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 05/14] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, Roger Pau Monne

... and remove hap_set_alloc_for_pvh_dom0. While there also change the last
parameter of the {hap/shadow}_set_allocation functions to be a boolean.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Tim Deegan <tim@xen.org>
---
Changes since v3:
 - Rename sh_set_allocation to shadow_set_allocation (public shadow
   functions use the shadow prefix instead of sh).

Changes since v2:
 - Convert the preempt parameter into a bool.
 - Fix Dom0 builder comment to reflect that paging.mode should be correct
   before calling paging_set_allocation.

Changes since RFC:
 - Make paging_set_allocation preemtable.
 - Move comments.
---
 xen/arch/x86/domain_build.c     | 21 +++++++++++++++------
 xen/arch/x86/mm/hap/hap.c       | 22 +++++-----------------
 xen/arch/x86/mm/paging.c        | 19 ++++++++++++++++++-
 xen/arch/x86/mm/shadow/common.c | 31 +++++++++++++------------------
 xen/include/asm-x86/hap.h       |  4 ++--
 xen/include/asm-x86/paging.h    |  7 +++++++
 xen/include/asm-x86/shadow.h    | 11 ++++++++++-
 7 files changed, 70 insertions(+), 45 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 0a02d65..17f8e91 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -35,7 +35,6 @@
 #include <asm/setup.h>
 #include <asm/bzimage.h> /* for bzimage_parse */
 #include <asm/io_apic.h>
-#include <asm/hap.h>
 #include <asm/hpet.h>
 
 #include <public/version.h>
@@ -1383,15 +1382,25 @@ int __init construct_dom0(
                          nr_pages);
     }
 
-    if ( is_pvh_domain(d) )
-        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
-
     /*
-     * We enable paging mode again so guest_physmap_add_page will do the
-     * right thing for us.
+     * We enable paging mode again so guest_physmap_add_page and
+     * paging_set_allocation will do the right thing for us.
      */
     d->arch.paging.mode = save_pvh_pg_mode;
 
+    if ( is_pvh_domain(d) )
+    {
+        bool preempted;
+
+        do {
+            preempted = false;
+            paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                                  &preempted);
+            process_pending_softirqs();
+        } while ( preempted );
+    }
+
+
     /* Write the phys->machine and machine->phys table entries. */
     for ( pfn = 0; pfn < count; pfn++ )
     {
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index b9faba6..e6dc088 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -334,8 +334,7 @@ hap_get_allocation(struct domain *d)
 
 /* Set the pool of pages to the required number of pages.
  * Returns 0 for success, non-zero for failure. */
-static int
-hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
+int hap_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
 {
     struct page_info *pg;
 
@@ -381,7 +380,7 @@ hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
         /* Check to see if we need to yield and try again */
         if ( preempted && general_preempt_check() )
         {
-            *preempted = 1;
+            *preempted = true;
             return 0;
         }
     }
@@ -561,7 +560,7 @@ void hap_final_teardown(struct domain *d)
     paging_unlock(d);
 }
 
-void hap_teardown(struct domain *d, int *preempted)
+void hap_teardown(struct domain *d, bool *preempted)
 {
     struct vcpu *v;
     mfn_t mfn;
@@ -609,7 +608,8 @@ out:
 int hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
                XEN_GUEST_HANDLE_PARAM(void) u_domctl)
 {
-    int rc, preempted = 0;
+    int rc;
+    bool preempted = false;
 
     switch ( sc->op )
     {
@@ -636,18 +636,6 @@ int hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
     }
 }
 
-void __init hap_set_alloc_for_pvh_dom0(struct domain *d,
-                                       unsigned long hap_pages)
-{
-    int rc;
-
-    paging_lock(d);
-    rc = hap_set_allocation(d, hap_pages, NULL);
-    paging_unlock(d);
-
-    BUG_ON(rc);
-}
-
 static const struct paging_mode hap_paging_real_mode;
 static const struct paging_mode hap_paging_protected_mode;
 static const struct paging_mode hap_paging_pae_mode;
diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
index cc44682..853a035 100644
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -809,7 +809,8 @@ long paging_domctl_continuation(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
 /* Call when destroying a domain */
 int paging_teardown(struct domain *d)
 {
-    int rc, preempted = 0;
+    int rc;
+    bool preempted = false;
 
     if ( hap_enabled(d) )
         hap_teardown(d, &preempted);
@@ -954,6 +955,22 @@ void paging_write_p2m_entry(struct p2m_domain *p2m, unsigned long gfn,
         safe_write_pte(p, new);
 }
 
+int paging_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
+{
+    int rc;
+
+    ASSERT(paging_mode_enabled(d));
+
+    paging_lock(d);
+    if ( hap_enabled(d) )
+        rc = hap_set_allocation(d, pages, preempted);
+    else
+        rc = shadow_set_allocation(d, pages, preempted);
+    paging_unlock(d);
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index ddbdb73..9f3bed9 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1611,13 +1611,7 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
     paging_unlock(d);
 }
 
-/* Set the pool of shadow pages to the required number of pages.
- * Input will be rounded up to at least shadow_min_acceptable_pages(),
- * plus space for the p2m table.
- * Returns 0 for success, non-zero for failure. */
-static int sh_set_allocation(struct domain *d,
-                             unsigned int pages,
-                             int *preempted)
+int shadow_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
 {
     struct page_info *sp;
     unsigned int lower_bound;
@@ -1683,7 +1677,7 @@ static int sh_set_allocation(struct domain *d,
         /* Check to see if we need to yield and try again */
         if ( preempted && general_preempt_check() )
         {
-            *preempted = 1;
+            *preempted = true;
             return 0;
         }
     }
@@ -3154,10 +3148,10 @@ int shadow_enable(struct domain *d, u32 mode)
     if ( old_pages == 0 )
     {
         paging_lock(d);
-        rv = sh_set_allocation(d, 1024, NULL); /* Use at least 4MB */
+        rv = shadow_set_allocation(d, 1024, NULL); /* Use at least 4MB */
         if ( rv != 0 )
         {
-            sh_set_allocation(d, 0, NULL);
+            shadow_set_allocation(d, 0, NULL);
             goto out_locked;
         }
         paging_unlock(d);
@@ -3239,7 +3233,7 @@ int shadow_enable(struct domain *d, u32 mode)
     return rv;
 }
 
-void shadow_teardown(struct domain *d, int *preempted)
+void shadow_teardown(struct domain *d, bool *preempted)
 /* Destroy the shadow pagetables of this domain and free its shadow memory.
  * Should only be called for dying domains. */
 {
@@ -3301,7 +3295,7 @@ void shadow_teardown(struct domain *d, int *preempted)
     if ( d->arch.paging.shadow.total_pages != 0 )
     {
         /* Destroy all the shadows and release memory to domheap */
-        sh_set_allocation(d, 0, preempted);
+        shadow_set_allocation(d, 0, preempted);
 
         if ( preempted && *preempted )
             goto out;
@@ -3366,7 +3360,7 @@ void shadow_final_teardown(struct domain *d)
     p2m_teardown(p2m_get_hostp2m(d));
     /* Free any shadow memory that the p2m teardown released */
     paging_lock(d);
-    sh_set_allocation(d, 0, NULL);
+    shadow_set_allocation(d, 0, NULL);
     SHADOW_PRINTK("dom %u final teardown done."
                    "  Shadow pages total = %u, free = %u, p2m=%u\n",
                    d->domain_id,
@@ -3392,9 +3386,9 @@ static int shadow_one_bit_enable(struct domain *d, u32 mode)
     if ( d->arch.paging.shadow.total_pages == 0 )
     {
         /* Init the shadow memory allocation if the user hasn't done so */
-        if ( sh_set_allocation(d, 1, NULL) != 0 )
+        if ( shadow_set_allocation(d, 1, NULL) != 0 )
         {
-            sh_set_allocation(d, 0, NULL);
+            shadow_set_allocation(d, 0, NULL);
             return -ENOMEM;
         }
     }
@@ -3463,7 +3457,7 @@ static int shadow_one_bit_disable(struct domain *d, u32 mode)
         }
 
         /* Pull down the memory allocation */
-        if ( sh_set_allocation(d, 0, NULL) != 0 )
+        if ( shadow_set_allocation(d, 0, NULL) != 0 )
             BUG(); /* In fact, we will have BUG()ed already */
         shadow_hash_teardown(d);
         SHADOW_PRINTK("un-shadowing of domain %u done."
@@ -3876,7 +3870,8 @@ int shadow_domctl(struct domain *d,
                   xen_domctl_shadow_op_t *sc,
                   XEN_GUEST_HANDLE_PARAM(void) u_domctl)
 {
-    int rc, preempted = 0;
+    int rc;
+    bool preempted = false;
 
     switch ( sc->op )
     {
@@ -3907,7 +3902,7 @@ int shadow_domctl(struct domain *d,
             paging_unlock(d);
             return -EINVAL;
         }
-        rc = sh_set_allocation(d, sc->mb << (20 - PAGE_SHIFT), &preempted);
+        rc = shadow_set_allocation(d, sc->mb << (20 - PAGE_SHIFT), &preempted);
         paging_unlock(d);
         if ( preempted )
             /* Not finished.  Set up to re-run the call. */
diff --git a/xen/include/asm-x86/hap.h b/xen/include/asm-x86/hap.h
index c613836..dedb4b1 100644
--- a/xen/include/asm-x86/hap.h
+++ b/xen/include/asm-x86/hap.h
@@ -38,7 +38,7 @@ int   hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
                  XEN_GUEST_HANDLE_PARAM(void) u_domctl);
 int   hap_enable(struct domain *d, u32 mode);
 void  hap_final_teardown(struct domain *d);
-void  hap_teardown(struct domain *d, int *preempted);
+void  hap_teardown(struct domain *d, bool *preempted);
 void  hap_vcpu_init(struct vcpu *v);
 int   hap_track_dirty_vram(struct domain *d,
                            unsigned long begin_pfn,
@@ -46,7 +46,7 @@ int   hap_track_dirty_vram(struct domain *d,
                            XEN_GUEST_HANDLE_64(uint8) dirty_bitmap);
 
 extern const struct paging_mode *hap_paging_get_mode(struct vcpu *);
-void hap_set_alloc_for_pvh_dom0(struct domain *d, unsigned long num_pages);
+int hap_set_allocation(struct domain *d, unsigned int pages, bool *preempted);
 
 #endif /* XEN_HAP_H */
 
diff --git a/xen/include/asm-x86/paging.h b/xen/include/asm-x86/paging.h
index 56eef6b..f83ed8b 100644
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -347,6 +347,13 @@ void pagetable_dying(struct domain *d, paddr_t gpa);
 void paging_dump_domain_info(struct domain *d);
 void paging_dump_vcpu_info(struct vcpu *v);
 
+/* Set the pool of shadow pages to the required number of pages.
+ * Input might be rounded up to at minimum amount of pages, plus
+ * space for the p2m table.
+ * Returns 0 for success, non-zero for failure. */
+int paging_set_allocation(struct domain *d, unsigned int pages,
+                          bool *preempted);
+
 #endif /* XEN_PAGING_H */
 
 /*
diff --git a/xen/include/asm-x86/shadow.h b/xen/include/asm-x86/shadow.h
index 6d0aefb..bac952f 100644
--- a/xen/include/asm-x86/shadow.h
+++ b/xen/include/asm-x86/shadow.h
@@ -73,7 +73,7 @@ int shadow_domctl(struct domain *d,
                   XEN_GUEST_HANDLE_PARAM(void) u_domctl);
 
 /* Call when destroying a domain */
-void shadow_teardown(struct domain *d, int *preempted);
+void shadow_teardown(struct domain *d, bool *preempted);
 
 /* Call once all of the references to the domain have gone away */
 void shadow_final_teardown(struct domain *d);
@@ -83,6 +83,13 @@ void sh_remove_shadows(struct domain *d, mfn_t gmfn, int fast, int all);
 /* Discard _all_ mappings from the domain's shadows. */
 void shadow_blow_tables_per_domain(struct domain *d);
 
+/* Set the pool of shadow pages to the required number of pages.
+ * Input will be rounded up to at least shadow_min_acceptable_pages(),
+ * plus space for the p2m table.
+ * Returns 0 for success, non-zero for failure. */
+int shadow_set_allocation(struct domain *d, unsigned int pages,
+                          bool *preempted);
+
 #else /* !CONFIG_SHADOW_PAGING */
 
 #define shadow_teardown(d, p) ASSERT(is_pv_domain(d))
@@ -91,6 +98,8 @@ void shadow_blow_tables_per_domain(struct domain *d);
     ({ ASSERT(is_pv_domain(d)); -EOPNOTSUPP; })
 #define shadow_track_dirty_vram(d, begin_pfn, nr, bitmap) \
     ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
+#define shadow_set_allocation(d, pages, preempted) \
+    ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
 
 static inline void sh_remove_shadows(struct domain *d, mfn_t gmfn,
                                      bool_t fast, bool_t all) {}
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 05/14] xen/x86: split the setup of Dom0 permissions to a function
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (3 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 04/14] x86/paging: introduce paging_set_allocation Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 06/14] x86/vtd: refuse to enable IOMMU if the PCI scan fails Roger Pau Monne
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

So that it can also be used by the PVH-specific domain builder. This is just
code motion, it should not introduce any functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes since v2:
 - Fix comment style.
 - Convert i to unsigned int.
 - Restore previous BUG_ON in case of failure (instead of panic).
 - Remove unneeded rc initializer.
---
 xen/arch/x86/domain_build.c | 160 +++++++++++++++++++++++---------------------
 1 file changed, 83 insertions(+), 77 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 17f8e91..1e557b9 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -869,6 +869,88 @@ static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
     unmap_domain_page(l4start);
 }
 
+static int __init setup_permissions(struct domain *d)
+{
+    unsigned long mfn;
+    unsigned int i;
+    int rc;
+
+    /* The hardware domain is initially permitted full I/O capabilities. */
+    rc = ioports_permit_access(d, 0, 0xFFFF);
+    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
+    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
+
+    /* Modify I/O port access permissions. */
+
+    /* Master Interrupt Controller (PIC). */
+    rc |= ioports_deny_access(d, 0x20, 0x21);
+    /* Slave Interrupt Controller (PIC). */
+    rc |= ioports_deny_access(d, 0xA0, 0xA1);
+    /* Interval Timer (PIT). */
+    rc |= ioports_deny_access(d, 0x40, 0x43);
+    /* PIT Channel 2 / PC Speaker Control. */
+    rc |= ioports_deny_access(d, 0x61, 0x61);
+    /* ACPI PM Timer. */
+    if ( pmtmr_ioport )
+        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
+    /* PCI configuration space (NB. 0xcf8 has special treatment). */
+    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
+    /* Command-line I/O ranges. */
+    process_dom0_ioports_disable(d);
+
+    /* Modify I/O memory access permissions. */
+
+    /* Local APIC. */
+    if ( mp_lapic_addr != 0 )
+    {
+        mfn = paddr_to_pfn(mp_lapic_addr);
+        rc |= iomem_deny_access(d, mfn, mfn);
+    }
+    /* I/O APICs. */
+    for ( i = 0; i < nr_ioapics; i++ )
+    {
+        mfn = paddr_to_pfn(mp_ioapics[i].mpc_apicaddr);
+        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
+            rc |= iomem_deny_access(d, mfn, mfn);
+    }
+    /* MSI range. */
+    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
+                            paddr_to_pfn(MSI_ADDR_BASE_LO +
+                                         MSI_ADDR_DEST_ID_MASK));
+    /* HyperTransport range. */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
+                                paddr_to_pfn((1ULL << 40) - 1));
+
+    /* Remove access to E820_UNUSABLE I/O regions above 1MB. */
+    for ( i = 0; i < e820.nr_map; i++ )
+    {
+        unsigned long sfn, efn;
+        sfn = max_t(unsigned long, paddr_to_pfn(e820.map[i].addr), 0x100ul);
+        efn = paddr_to_pfn(e820.map[i].addr + e820.map[i].size - 1);
+        if ( (e820.map[i].type == E820_UNUSABLE) &&
+             (e820.map[i].size != 0) &&
+             (sfn <= efn) )
+            rc |= iomem_deny_access(d, sfn, efn);
+    }
+
+    /* Prevent access to HPET */
+    if ( hpet_address )
+    {
+        u8 prot_flags = hpet_flags & ACPI_HPET_PAGE_PROTECT_MASK;
+
+        mfn = paddr_to_pfn(hpet_address);
+        if ( prot_flags == ACPI_HPET_PAGE_PROTECT4 )
+            rc |= iomem_deny_access(d, mfn, mfn);
+        else if ( prot_flags == ACPI_HPET_PAGE_PROTECT64 )
+            rc |= iomem_deny_access(d, mfn, mfn + 15);
+        else if ( ro_hpet )
+            rc |= rangeset_add_singleton(mmio_ro_ranges, mfn);
+    }
+
+    return rc;
+}
+
 int __init construct_dom0(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
@@ -1539,83 +1621,7 @@ int __init construct_dom0(
     if ( test_bit(XENFEAT_supervisor_mode_kernel, parms.f_required) )
         panic("Dom0 requires supervisor-mode execution");
 
-    rc = 0;
-
-    /* The hardware domain is initially permitted full I/O capabilities. */
-    rc |= ioports_permit_access(d, 0, 0xFFFF);
-    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
-    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
-
-    /*
-     * Modify I/O port access permissions.
-     */
-    /* Master Interrupt Controller (PIC). */
-    rc |= ioports_deny_access(d, 0x20, 0x21);
-    /* Slave Interrupt Controller (PIC). */
-    rc |= ioports_deny_access(d, 0xA0, 0xA1);
-    /* Interval Timer (PIT). */
-    rc |= ioports_deny_access(d, 0x40, 0x43);
-    /* PIT Channel 2 / PC Speaker Control. */
-    rc |= ioports_deny_access(d, 0x61, 0x61);
-    /* ACPI PM Timer. */
-    if ( pmtmr_ioport )
-        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
-    /* PCI configuration space (NB. 0xcf8 has special treatment). */
-    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
-    /* Command-line I/O ranges. */
-    process_dom0_ioports_disable(d);
-
-    /*
-     * Modify I/O memory access permissions.
-     */
-    /* Local APIC. */
-    if ( mp_lapic_addr != 0 )
-    {
-        mfn = paddr_to_pfn(mp_lapic_addr);
-        rc |= iomem_deny_access(d, mfn, mfn);
-    }
-    /* I/O APICs. */
-    for ( i = 0; i < nr_ioapics; i++ )
-    {
-        mfn = paddr_to_pfn(mp_ioapics[i].mpc_apicaddr);
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
-            rc |= iomem_deny_access(d, mfn, mfn);
-    }
-    /* MSI range. */
-    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
-                            paddr_to_pfn(MSI_ADDR_BASE_LO +
-                                         MSI_ADDR_DEST_ID_MASK));
-    /* HyperTransport range. */
-    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
-        rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
-                                paddr_to_pfn((1ULL << 40) - 1));
-
-    /* Remove access to E820_UNUSABLE I/O regions above 1MB. */
-    for ( i = 0; i < e820.nr_map; i++ )
-    {
-        unsigned long sfn, efn;
-        sfn = max_t(unsigned long, paddr_to_pfn(e820.map[i].addr), 0x100ul);
-        efn = paddr_to_pfn(e820.map[i].addr + e820.map[i].size - 1);
-        if ( (e820.map[i].type == E820_UNUSABLE) &&
-             (e820.map[i].size != 0) &&
-             (sfn <= efn) )
-            rc |= iomem_deny_access(d, sfn, efn);
-    }
-
-    /* Prevent access to HPET */
-    if ( hpet_address )
-    {
-        u8 prot_flags = hpet_flags & ACPI_HPET_PAGE_PROTECT_MASK;
-
-        mfn = paddr_to_pfn(hpet_address);
-        if ( prot_flags == ACPI_HPET_PAGE_PROTECT4 )
-            rc |= iomem_deny_access(d, mfn, mfn);
-        else if ( prot_flags == ACPI_HPET_PAGE_PROTECT64 )
-            rc |= iomem_deny_access(d, mfn, mfn + 15);
-        else if ( ro_hpet )
-            rc |= rangeset_add_singleton(mmio_ro_ranges, mfn);
-    }
-
+    rc = setup_permissions(d);
     BUG_ON(rc != 0);
 
     if ( elf_check_broken(&elf) )
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 06/14] x86/vtd: refuse to enable IOMMU if the PCI scan fails
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (4 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 05/14] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-01  0:29   ` Tian, Kevin
  2016-11-30 16:49 ` [PATCH v4 07/14] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Kevin Tian, Feng Wu, Jan Beulich, Roger Pau Monne

This provides uniform behavior between Intel and AMD IOMMU initialization, and
is a requirement for PVHv2 Dom0, that depends on a working IOMMU plus the PCI
bus being scanned for devices.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Feng Wu <feng.wu@intel.com>
---
Changes since v3:
 - New in this revision.
---
 xen/drivers/passthrough/vtd/iommu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 48f120b..78b5a6a 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2299,7 +2299,9 @@ int __init intel_vtd_setup(void)
     P(iommu_hap_pt_share, "Shared EPT tables");
 #undef P
 
-    scan_pci_devices();
+    ret = scan_pci_devices();
+    if ( ret )
+        goto error;
 
     ret = init_vtd_hw();
     if ( ret )
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 07/14] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (5 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 06/14] x86/vtd: refuse to enable IOMMU if the PCI scan fails Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-08 16:24   ` Jan Beulich
  2016-11-30 16:49 ` [PATCH v4 08/14] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk; +Cc: Roger Pau Monne

There's nothing wrong with allowing the domain to perform DMA transfers to
MMIO areas that it already can access from the CPU, and this allows us to
remove the hack in set_identity_p2m_entry for PVH Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/mm/p2m.c     | 9 ---------
 xen/include/asm-x86/p2m.h | 1 +
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 6a45185..7e33ab6 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1053,16 +1053,7 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
         ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
                             p2m_mmio_direct, p2ma);
     else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
-    {
         ret = 0;
-        /*
-         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
-         * but iomem regions are not mapped with IOMMU. This makes sure that
-         * RMRRs are correctly mapped with IOMMU.
-         */
-        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
-            ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
-    }
     else
     {
         if ( flag & XEN_DOMCTL_DEV_RDM_RELAXED )
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 7035860..b562da3 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -834,6 +834,7 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
     case p2m_grant_map_rw:
     case p2m_ram_logdirty:
     case p2m_map_foreign:
+    case p2m_mmio_direct:
         flags =  IOMMUF_readable | IOMMUF_writable;
         break;
     case p2m_ram_ro:
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 08/14] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (6 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 07/14] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-08 16:29   ` Jan Beulich
  2016-11-30 16:49 ` [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Allow the use of both the emulated local APIC and IO APIC for the hardware
domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Don't enable the emulated PIT for PVHv2 Dom0.

Changes since v2:
 - Allow all PV guests to use the emulated PIT.

Changes since RFC:
 - Move the emulation flags check to a separate helper.
---
 xen/arch/x86/domain.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index eae643f..2c7a528 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -509,6 +509,27 @@ void vcpu_destroy(struct vcpu *v)
         xfree(v->arch.pv_vcpu.trap_ctxt);
 }
 
+static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
+{
+
+    if ( is_hvm_domain(d) )
+    {
+        if ( is_hardware_domain(d) &&
+             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
+            return false;
+        if ( !is_hardware_domain(d) && emflags &&
+             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
+            return false;
+    }
+    else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
+    {
+        /* PV or classic PVH. */
+        return false;
+    }
+
+    return true;
+}
+
 int arch_domain_create(struct domain *d, unsigned int domcr_flags,
                        struct xen_arch_domainconfig *config)
 {
@@ -547,7 +568,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags,
     {
         uint32_t emflags;
 
-        if ( is_hardware_domain(d) )
+        if ( is_hardware_domain(d) && is_pv_domain(d) )
             config->emulation_flags |= XEN_X86_EMU_PIT;
 
         emflags = config->emulation_flags;
@@ -558,11 +579,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags,
             return -EINVAL;
         }
 
-        /* PVHv2 guests can request emulated APIC. */
-        if ( emflags &&
-            (is_hvm_domain(d) ? ((emflags != XEN_X86_EMU_ALL) &&
-                                 (emflags != XEN_X86_EMU_LAPIC)) :
-                                (emflags != XEN_X86_EMU_PIT)) )
+        if ( !emulation_flags_ok(d, emflags) )
         {
             printk(XENLOG_G_ERR "d%d: Xen does not allow %s domain creation "
                    "with the current selection of emulators: %#x\n",
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (7 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 08/14] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-09 16:07   ` Jan Beulich
  2016-11-30 16:49 ` [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Split the Dom0 builder into two different functions, one for PV (and classic
PVH), and another one for PVHv2. Introduce a new command line parameter
called 'dom0' that can be used to request the creation of a PVHv2 Dom0 by
setting the 'hvm' sub-option. A panic has also been added if a user tries
to use dom0=hvm until all the code is in place, then the panic will be
removed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Correctly declare the parameter list.
 - Add a panic if dom0=hvm is used. This will be removed once all the code is in
   place.

Changes since v2:
 - Fix coding style.
 - Introduce a new dom0 option that allows passing several parameters.
   Currently supported ones are hvm and shadow.

Changes since RFC:
 - Add documentation for the new command line option.
 - Simplify the logic in construct_dom0.
---
 docs/misc/xen-command-line.markdown | 17 +++++++++++++++++
 xen/arch/x86/domain_build.c         | 28 +++++++++++++++++++++++----
 xen/arch/x86/setup.c                | 38 +++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/setup.h         |  6 ++++++
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 0138978..c9729be 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -656,6 +656,23 @@ affinities to prefer but be not limited to the specified node(s).
 
 Pin dom0 vcpus to their respective pcpus
 
+### dom0
+> `= List of [ hvm | shadow ]`
+
+> Sub-options:
+
+> `hvm`
+
+> Default: `false`
+
+Flag that makes a dom0 boot in PVHv2 mode.
+
+> `shadow`
+
+> Default: `false`
+
+Flag that makes a dom0 use shadow paging.
+
 ### dom0pvh
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 1e557b9..2c9ebf2 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -191,10 +191,8 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
 }
 
 #ifdef CONFIG_SHADOW_PAGING
-static bool_t __initdata opt_dom0_shadow;
+bool __initdata opt_dom0_shadow;
 boolean_param("dom0_shadow", opt_dom0_shadow);
-#else
-#define opt_dom0_shadow 0
 #endif
 
 static char __initdata opt_dom0_ioports_disable[200] = "";
@@ -951,7 +949,7 @@ static int __init setup_permissions(struct domain *d)
     return rc;
 }
 
-int __init construct_dom0(
+static int __init construct_dom0_pv(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
     module_t *initrd,
@@ -1655,6 +1653,28 @@ out:
     return rc;
 }
 
+static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
+                                     unsigned long image_headroom,
+                                     module_t *initrd,
+                                     void *(*bootstrap_map)(const module_t *),
+                                     char *cmdline)
+{
+
+    printk("** Building a PVH Dom0 **\n");
+
+    return 0;
+}
+
+int __init construct_dom0(struct domain *d, const module_t *image,
+                          unsigned long image_headroom, module_t *initrd,
+                          void *(*bootstrap_map)(const module_t *),
+                          char *cmdline)
+{
+
+    return (is_hvm_domain(d) ? construct_dom0_hvm : construct_dom0_pv)
+           (d, image, image_headroom, initrd,bootstrap_map, cmdline);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index b130671..255e20c 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -187,6 +187,35 @@ static void __init parse_acpi_param(char *s)
     }
 }
 
+/*
+ * List of parameters that affect Dom0 creation:
+ *
+ *  - hvm               Create a PVHv2 Dom0.
+ *  - shadow            Use shadow paging for Dom0.
+ */
+static bool __initdata dom0_hvm;
+static void __init parse_dom0_param(char *s)
+{
+    char *ss;
+
+    do {
+
+        ss = strchr(s, ',');
+        if ( ss )
+            *ss = '\0';
+
+        if ( !strcmp(s, "hvm") )
+            dom0_hvm = true;
+#ifdef CONFIG_SHADOW_PAGING
+        else if ( !strcmp(s, "shadow") )
+            opt_dom0_shadow = true;
+#endif
+
+        s = ss + 1;
+    } while ( ss );
+}
+custom_param("dom0", parse_dom0_param);
+
 static const module_t *__initdata initial_images;
 static unsigned int __initdata nr_initial_images;
 
@@ -1541,6 +1570,15 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( opt_dom0pvh )
         domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
 
+    if ( dom0_hvm )
+    {
+        panic("Building a PVHv2 Dom0 is not yet supported.");
+        domcr_flags |= DOMCRF_hvm |
+                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
+                         DOMCRF_hap : 0);
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+    }
+
     /* Create initial domain 0. */
     dom0 = domain_create(0, domcr_flags, 0, &config);
     if ( IS_ERR(dom0) || (alloc_dom0_vcpu0(dom0) == NULL) )
diff --git a/xen/include/asm-x86/setup.h b/xen/include/asm-x86/setup.h
index c65b79c..c4179d1 100644
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -57,4 +57,10 @@ extern uint8_t kbd_shift_flags;
 extern unsigned long highmem_start;
 #endif
 
+#ifdef CONFIG_SHADOW_PAGING
+extern bool opt_dom0_shadow;
+#else
+#define opt_dom0_shadow 0
+#endif
+
 #endif
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (8 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-09 16:48   ` Jan Beulich
  2016-11-30 16:49 ` [PATCH v4 11/14] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Craft the Dom0 e820 memory map and populate it. Introduce a helper to remove
memory pages that are shared between Xen and a domain, and use it in order to
remove low 1MB RAM regions from dom_io in order to assign them to a PVHv2 Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Drop get_order_from_bytes_floor, it was only used by
   hvm_populate_memory_range.
 - Switch hvm_populate_memory_range to use frame numbers instead of full memory
   addresses.
 - Add a helper to steal the low 1MB RAM areas from dom_io and add them to Dom0
   as normal RAM.
 - Introduce unshare_xen_page_with_guest in order to remove pages from dom_io,
   so they can be assigned to other domains. This is needed in order to remove
   the low 1MB RAM regions from dom_io and assign them to the hardware_domain.
 - Simplify the loop in hvm_steal_ram.
 - Move definition of map_identity_mmio into this patch.

Changes since v2:
 - Introduce get_order_from_bytes_floor as a local function to
   domain_build.c.
 - Remove extra asserts.
 - Make hvm_populate_memory_range return an error code instead of panicking.
 - Fix comments and printks.
 - Use ULL sufix instead of casting to uint64_t.
 - Rename hvm_setup_vmx_unrestricted_guest to
   hvm_setup_vmx_realmode_helpers.
 - Only substract two pages from the memory calculation, that will be used
   by the MADT replacement.
 - Remove some comments.
 - Remove printing allocation information.
 - Don't stash any pages for the MADT, TSS or ident PT, those will be
   subtracted directly from RAM regions of the memory map.
 - Count the number of iterations before calling process_pending_softirqs
   when populating the memory map.
 - Move the initial call to process_pending_softirqs into construct_dom0,
   and remove the ones from construct_dom0_hvm and construct_dom0_pv.
 - Make memflags global so it can be shared between alloc_chunk and
   hvm_populate_memory_range.

Changes since RFC:
 - Use IS_ALIGNED instead of checking with PAGE_MASK.
 - Use the new %pB specifier in order to print sizes in human readable form.
 - Create a VM86 TSS for hardware that doesn't support unrestricted mode.
 - Subtract guest RAM for the identity page table and the VM86 TSS.
 - Split the creation of the unrestricted mode helper structures to a
   separate function.
 - Use preemption with paging_set_allocation.
 - Use get_order_from_bytes_floor.
---
 xen/arch/x86/domain_build.c | 310 ++++++++++++++++++++++++++++++++++++++++++--
 xen/arch/x86/mm.c           |  37 ++++++
 xen/include/asm-x86/mm.h    |   2 +
 3 files changed, 340 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 2c9ebf2..8602566 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -22,6 +22,7 @@
 #include <xen/compat.h>
 #include <xen/libelf.h>
 #include <xen/pfn.h>
+#include <xen/guest_access.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
 static long __initdata dom0_max_nrpages = LONG_MAX;
 
+/* Size of the VM86 TSS for virtual 8086 mode to use. */
+#define HVM_VM86_TSS_SIZE   128
+
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -213,11 +217,12 @@ boolean_param("ro-hpet", ro_hpet);
 #define round_pgup(_p)    (((_p)+(PAGE_SIZE-1))&PAGE_MASK)
 #define round_pgdown(_p)  ((_p)&PAGE_MASK)
 
+static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
+
 static struct page_info * __init alloc_chunk(
     struct domain *d, unsigned long max_pages)
 {
     static unsigned int __initdata last_order = MAX_ORDER;
-    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
     struct page_info *page;
     unsigned int order = get_order_from_pages(max_pages), free_order;
 
@@ -302,7 +307,8 @@ static unsigned long __init compute_dom0_nr_pages(
             avail -= max_pdx >> s;
     }
 
-    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
+    need_paging = opt_dom0_shadow || (has_hvm_container_domain(d) &&
+                  (!iommu_hap_pt_share || !paging_mode_hap(d)));
     for ( ; ; need_paging = 0 )
     {
         nr_pages = dom0_nrpages;
@@ -334,7 +340,8 @@ static unsigned long __init compute_dom0_nr_pages(
         avail -= dom0_paging_pages(d, nr_pages);
     }
 
-    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
+    if ( is_pv_domain(d) &&
+         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
          ((dom0_min_nrpages <= 0) || (nr_pages > min_pages)) )
     {
         /*
@@ -545,11 +552,12 @@ static __init void pvh_map_all_iomem(struct domain *d, unsigned long nr_pages)
     ASSERT(nr_holes == 0);
 }
 
-static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
+static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
 {
     struct e820entry *entry, *entry_guest;
     unsigned int i;
     unsigned long pages, cur_pages = 0;
+    uint64_t start, end;
 
     /*
      * Craft the e820 memory map for Dom0 based on the hardware e820 map.
@@ -577,8 +585,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
             continue;
         }
 
-        *entry_guest = *entry;
-        pages = PFN_UP(entry_guest->size);
+        /*
+         * Make sure the start and length are aligned to PAGE_SIZE, because
+         * that's the minimum granularity of the 2nd stage translation.
+         */
+        start = ROUNDUP(entry->addr, PAGE_SIZE);
+        end = (entry->addr + entry->size) & PAGE_MASK;
+        if ( start >= end )
+            continue;
+
+        entry_guest->type = E820_RAM;
+        entry_guest->addr = start;
+        entry_guest->size = end - start;
+        pages = PFN_DOWN(entry_guest->size);
         if ( (cur_pages + pages) > nr_pages )
         {
             /* Truncate region */
@@ -1010,8 +1029,6 @@ static int __init construct_dom0_pv(
     BUG_ON(d->vcpu[0] == NULL);
     BUG_ON(v->is_initialised);
 
-    process_pending_softirqs();
-
     printk("*** LOADING DOMAIN 0 ***\n");
 
     d->max_pages = ~0U;
@@ -1637,7 +1654,7 @@ static int __init construct_dom0_pv(
         dom0_update_physmap(d, pfn, mfn, 0);
 
         pvh_map_all_iomem(d, nr_pages);
-        pvh_setup_e820(d, nr_pages);
+        hvm_setup_e820(d, nr_pages);
     }
 
     if ( d->domain_id == hardware_domid )
@@ -1653,15 +1670,289 @@ out:
     return rc;
 }
 
+static int __init modify_identity_mmio(struct domain *d, unsigned long pfn,
+                                       unsigned long nr_pages, bool map)
+{
+    int rc;
+
+    for ( ; ; )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, _gfn(pfn), nr_pages, _mfn(pfn));
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_WARNING
+                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
+                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
+            break;
+        }
+        nr_pages -= rc;
+        pfn += rc;
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+
+/* Populate an HVM memory range using the biggest possible order. */
+static int __init hvm_populate_memory_range(struct domain *d,
+                                            unsigned long start,
+                                            unsigned long nr_pages)
+{
+    unsigned int order, i = 0;
+    struct page_info *page;
+    int rc;
+#define MAP_MAX_ITER 64
+
+    order = MAX_ORDER;
+    while ( nr_pages != 0 )
+    {
+        unsigned int range_order = get_order_from_pages(nr_pages + 1);
+
+        order = min(range_order ? range_order - 1 : 0, order);
+        page = alloc_domheap_pages(d, order, memflags);
+        if ( page == NULL )
+        {
+            if ( order == 0 && memflags )
+            {
+                /* Try again without any memflags. */
+                memflags = 0;
+                order = MAX_ORDER;
+                continue;
+            }
+            if ( order == 0 )
+            {
+                printk("Unable to allocate memory with order 0!\n");
+                return -ENOMEM;
+            }
+            order--;
+            continue;
+        }
+
+        rc = guest_physmap_add_page(d, _gfn(start), _mfn(page_to_mfn(page)),
+                                    order);
+        if ( rc != 0 )
+        {
+            printk("Failed to populate memory: [%#lx,%lx): %d\n",
+                   start, start + (1UL << order), rc);
+            return -ENOMEM;
+        }
+        start += 1UL << order;
+        nr_pages -= 1UL << order;
+        if ( (++i % MAP_MAX_ITER) == 0 )
+            process_pending_softirqs();
+    }
+
+    return 0;
+#undef MAP_MAX_ITER
+}
+
+static int __init hvm_steal_ram(struct domain *d, unsigned long size,
+                                paddr_t limit, paddr_t *addr)
+{
+    unsigned int i = d->arch.nr_e820;
+
+    while ( i-- )
+    {
+        struct e820entry *entry = &d->arch.e820[i];
+
+        if ( entry->type != E820_RAM || entry->size < size )
+            continue;
+
+        /* Subtract from the beginning. */
+        if ( entry->addr + size < limit && entry->addr >= MB(1) )
+        {
+            *addr = entry->addr;
+            entry->addr += size;
+            entry->size -= size;
+            return 0;
+        }
+    }
+
+    return -ENOMEM;
+}
+
+static int __init hvm_setup_vmx_realmode_helpers(struct domain *d)
+{
+    p2m_type_t p2mt;
+    uint32_t rc, *ident_pt;
+    uint8_t *tss;
+    mfn_t mfn;
+    paddr_t gaddr;
+    unsigned int i;
+
+    /*
+     * Steal some space from the last found RAM region. One page will be
+     * used for the identity page tables, and the remaining space for the
+     * VM86 TSS. Note that after this not all e820 regions will be aligned
+     * to PAGE_SIZE.
+     */
+    if ( hvm_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) )
+    {
+        printk("Unable to find memory to stash the identity map and TSS\n");
+        return -ENOMEM;
+    }
+
+    /*
+     * Identity-map page table is required for running with CR0.PG=0
+     * when using Intel EPT. Create a 32-bit non-PAE page directory of
+     * superpages.
+     */
+    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                              &mfn, &p2mt, 0, &rc);
+    if ( ident_pt == NULL )
+    {
+        printk("Unable to map identity page tables\n");
+        return -ENOMEM;
+    }
+    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
+        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
+                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+    unmap_domain_page(ident_pt);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
+    gaddr += PAGE_SIZE;
+    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
+
+    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                         &mfn, &p2mt, 0, &rc);
+    if ( tss == NULL )
+    {
+        printk("Unable to map VM86 TSS area\n");
+        return 0;
+    }
+
+    memset(tss, 0, HVM_VM86_TSS_SIZE);
+    unmap_domain_page(tss);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
+
+    return 0;
+}
+
+static void __init hvm_steal_low_ram(struct domain *d, unsigned long start,
+                                     unsigned long nr_pages)
+{
+    unsigned long mfn;
+
+    ASSERT(start + nr_pages < PFN_DOWN(MB(1)));
+
+    for ( mfn = start; mfn < start + nr_pages; mfn++ )
+    {
+        struct page_info *pg = mfn_to_page(mfn);
+        int rc;
+
+        rc = unshare_xen_page_with_guest(pg, dom_io);
+        if ( rc )
+        {
+            printk("Unable to unshare Xen mfn %#lx: %d\n", mfn, rc);
+            continue;
+        }
+
+        share_xen_page_with_guest(pg, d, XENSHARE_writable);
+        rc = guest_physmap_add_entry(d, _gfn(mfn), _mfn(mfn), 0, p2m_ram_rw);
+        if ( rc )
+            printk("Unable to add mfn %#lx to p2m: %d\n", mfn, rc);
+    }
+}
+
+static int __init hvm_setup_p2m(struct domain *d)
+{
+    struct vcpu *v = d->vcpu[0];
+    unsigned long nr_pages;
+    unsigned int i;
+    int rc;
+    bool preempted;
+#define MB1_PAGES PFN_DOWN(MB(1))
+
+    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
+
+    hvm_setup_e820(d, nr_pages);
+    do {
+        preempted = false;
+        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                              &preempted);
+        process_pending_softirqs();
+    } while ( preempted );
+
+    /*
+     * Memory below 1MB is identity mapped.
+     * NB: this only makes sense when booted from legacy BIOS.
+     */
+    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
+    if ( rc )
+    {
+        printk("Failed to identity map low 1MB: %d\n", rc);
+        return rc;
+    }
+
+    /* Populate memory map. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        unsigned long addr, size;
+
+        if ( d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        addr = PFN_DOWN(d->arch.e820[i].addr);
+        size = PFN_DOWN(d->arch.e820[i].size);
+
+        if ( addr >= MB1_PAGES )
+            rc = hvm_populate_memory_range(d, addr, size);
+        else if ( addr + size > MB1_PAGES )
+        {
+            hvm_steal_low_ram(d, addr, MB1_PAGES - addr);
+            rc = hvm_populate_memory_range(d, MB1_PAGES,
+                                           size - (MB1_PAGES - addr));
+        }
+        else
+            hvm_steal_low_ram(d, addr, size);
+
+        if ( rc )
+            return rc;
+    }
+
+    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
+    {
+        /*
+         * Since Dom0 cannot be migrated, we will only setup the
+         * unrestricted guest helpers if they are needed by the current
+         * hardware we are running on.
+         */
+        rc = hvm_setup_vmx_realmode_helpers(d);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+#undef MB1_PAGES
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    int rc;
 
     printk("** Building a PVH Dom0 **\n");
 
+    /* Sanity! */
+    BUG_ON(d->domain_id);
+    BUG_ON(!d->vcpu[0]);
+
+    iommu_hwdom_init(d);
+
+    rc = hvm_setup_p2m(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 physical memory map\n");
+        return rc;
+    }
+
     return 0;
 }
 
@@ -1670,6 +1961,7 @@ int __init construct_dom0(struct domain *d, const module_t *image,
                           void *(*bootstrap_map)(const module_t *),
                           char *cmdline)
 {
+    process_pending_softirqs();
 
     return (is_hvm_domain(d) ? construct_dom0_hvm : construct_dom0_pv)
            (d, image, image_headroom, initrd,bootstrap_map, cmdline);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 03dcd71..a31106c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -475,6 +475,43 @@ void share_xen_page_with_guest(
     spin_unlock(&d->page_alloc_lock);
 }
 
+int unshare_xen_page_with_guest(struct page_info *page, struct domain *d)
+{
+    unsigned long y, x;
+    bool drop_dom_ref;
+
+    if ( page_get_owner(page) != d || !(page->count_info & PGC_xen_heap) )
+        return -EINVAL;
+
+    spin_lock(&d->page_alloc_lock);
+
+    /* Drop the page reference so we can chanfge the owner. */
+    y = page->count_info;
+    do {
+        x = y;
+        if ( (x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated) )
+        {
+            spin_unlock(&d->page_alloc_lock);
+            return -EINVAL;
+        }
+        y = cmpxchg(&page->count_info, x, PGC_xen_heap);
+    } while ( y != x );
+
+    /* Remove the page form the list of domain pages. */
+    page_list_del(page, &d->xenpage_list);
+    drop_dom_ref = (--d->xenheap_pages == 0);
+
+    /* Remove the owner and clear the flags. */
+    page_set_owner(page, NULL);
+    page->u.inuse.type_info = 0;
+    spin_unlock(&d->page_alloc_lock);
+
+    if ( drop_dom_ref )
+        put_domain(d);
+
+    return 0;
+}
+
 void share_xen_page_with_privileged_guests(
     struct page_info *page, int readonly)
 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 1b4d1c3..041692b 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -276,6 +276,8 @@ struct spage_info
 #define XENSHARE_readonly 1
 extern void share_xen_page_with_guest(
     struct page_info *page, struct domain *d, int readonly);
+extern int unshare_xen_page_with_guest(struct page_info *page,
+                                       struct domain *d);
 extern void share_xen_page_with_privileged_guests(
     struct page_info *page, int readonly);
 extern void free_shared_domheap_page(struct page_info *page);
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 11/14] xen/x86: parse Dom0 kernel for PVHv2
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (9 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-09 17:05   ` Jan Beulich
  2016-11-30 16:49 ` [PATCH v4 12/14] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Introduce a helper to parse the Dom0 kernel.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Change one error message.
 - Indent "out" label by one space.
 - Introduce hvm_copy_to_phys and slightly simplify the code in hvm_load_kernel.

Changes since v2:
 - Remove debug messages.
 - Don't hardcode the number of modules to 1.
---
 xen/arch/x86/domain_build.c | 145 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 8602566..e40fb94 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -39,6 +39,7 @@
 #include <asm/hpet.h>
 
 #include <public/version.h>
+#include <public/arch-x86/hvm/start_info.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -1930,12 +1931,148 @@ static int __init hvm_setup_p2m(struct domain *d)
 #undef MB1_PAGES
 }
 
+static int __init hvm_copy_to_phys(struct domain *d, paddr_t paddr, void *buf,
+                                   int size)
+{
+    struct vcpu *saved_current;
+    int rc;
+
+    saved_current = current;
+    set_current(d->vcpu[0]);
+    rc = hvm_copy_to_guest_phys(paddr, buf, size);
+    set_current(saved_current);
+
+    return rc != HVMCOPY_okay ? -EFAULT : 0;
+}
+
+static int __init hvm_load_kernel(struct domain *d, const module_t *image,
+                                  unsigned long image_headroom,
+                                  module_t *initrd, char *image_base,
+                                  char *cmdline, paddr_t *entry,
+                                  paddr_t *start_info_addr)
+{
+    char *image_start = image_base + image_headroom;
+    unsigned long image_len = image->mod_end;
+    struct elf_binary elf;
+    struct elf_dom_parms parms;
+    paddr_t last_addr;
+    struct hvm_start_info start_info;
+    struct hvm_modlist_entry mod;
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    int rc;
+
+    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+    {
+        printk("Error trying to detect bz compressed kernel\n");
+        return rc;
+    }
+
+    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
+    {
+        printk("Unable to init ELF\n");
+        return rc;
+    }
+#ifdef VERBOSE
+    elf_set_verbose(&elf);
+#endif
+    elf_parse_binary(&elf);
+    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
+    {
+        printk("Unable to parse kernel for ELFNOTES\n");
+        return rc;
+    }
+
+    if ( parms.phys_entry == UNSET_ADDR32 ) {
+        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
+        return -EINVAL;
+    }
+
+    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
+           parms.guest_ver, parms.loader,
+           elf_64bit(&elf) ? "64-bit" : "32-bit");
+
+    /* Copy the OS image and free temporary buffer. */
+    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
+    elf.dest_size = parms.virt_kend - parms.virt_kstart;
+
+    saved_current = current;
+    set_current(v);
+    rc = elf_load_binary(&elf);
+    set_current(saved_current);
+    if ( rc < 0 )
+    {
+        printk("Failed to load kernel: %d\n", rc);
+        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
+        return rc;
+    }
+
+    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
+
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_phys(d, last_addr, mfn_to_virt(initrd->mod_start),
+                              initrd->mod_end);
+        if ( rc )
+        {
+            printk("Unable to copy initrd to guest\n");
+            return rc;
+        }
+
+        mod.paddr = last_addr;
+        mod.size = initrd->mod_end;
+        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
+    }
+
+    /* Free temporary buffers. */
+    discard_initial_images();
+
+    memset(&start_info, 0, sizeof(start_info));
+    if ( cmdline != NULL )
+    {
+        rc = hvm_copy_to_phys(d, last_addr, cmdline, strlen(cmdline) + 1);
+        if ( rc )
+        {
+            printk("Unable to copy guest command line\n");
+            return rc;
+        }
+        start_info.cmdline_paddr = last_addr;
+        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
+    }
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_phys(d, last_addr, &mod, sizeof(mod));
+        if ( rc )
+        {
+            printk("Unable to copy guest modules\n");
+            return rc;
+        }
+        start_info.modlist_paddr = last_addr;
+        start_info.nr_modules = 1;
+        last_addr += sizeof(mod);
+    }
+
+    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
+    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
+    rc = hvm_copy_to_phys(d, last_addr, &start_info, sizeof(start_info));
+    if ( rc )
+    {
+        printk("Unable to copy start info to guest\n");
+        return rc;
+    }
+
+    *entry = parms.phys_entry;
+    *start_info_addr = last_addr;
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    paddr_t entry, start_info;
     int rc;
 
     printk("** Building a PVH Dom0 **\n");
@@ -1953,6 +2090,14 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_load_kernel(d, image, image_headroom, initrd, bootstrap_map(image),
+                         cmdline, &entry, &start_info);
+    if ( rc )
+    {
+        printk("Failed to load Dom0 kernel\n");
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 12/14] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (10 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 11/14] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-09 17:09   ` Jan Beulich
  2016-11-30 16:49 ` [PATCH v4 13/14] xen/x86: hack to setup PVHv2 Dom0 CPUs Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

PVHv2 Dom0 is limited to 128 vCPUs, as are all HVM guests at the moment. Fix
dom0_max_vcpus so it takes this limitation into account by poking at the
dom0_hvm variable.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - New in the series.
---
 xen/arch/x86/domain_build.c | 3 +++
 xen/arch/x86/setup.c        | 2 +-
 xen/include/asm-x86/setup.h | 2 ++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index e40fb94..7e22ba3 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -40,6 +40,7 @@
 
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
+#include <public/hvm/hvm_info_table.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -176,6 +177,8 @@ unsigned int __init dom0_max_vcpus(void)
         max_vcpus = opt_dom0_max_vcpus_max;
     if ( max_vcpus > MAX_VIRT_CPUS )
         max_vcpus = MAX_VIRT_CPUS;
+    if ( dom0_hvm )
+        max_vcpus = min_t(typeof(max_vcpus), max_vcpus, HVM_MAX_VCPUS);
 
     return max_vcpus;
 }
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 255e20c..737f2ca 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -193,7 +193,7 @@ static void __init parse_acpi_param(char *s)
  *  - hvm               Create a PVHv2 Dom0.
  *  - shadow            Use shadow paging for Dom0.
  */
-static bool __initdata dom0_hvm;
+bool __initdata dom0_hvm;
 static void __init parse_dom0_param(char *s)
 {
     char *ss;
diff --git a/xen/include/asm-x86/setup.h b/xen/include/asm-x86/setup.h
index c4179d1..3c9389e 100644
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -63,4 +63,6 @@ extern bool opt_dom0_shadow;
 #define opt_dom0_shadow 0
 #endif
 
+extern bool dom0_hvm;
+
 #endif
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 13/14] xen/x86: hack to setup PVHv2 Dom0 CPUs
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (11 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 12/14] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-11-30 16:49 ` [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
  13 siblings, 0 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Initialize Dom0 BSP/APs and setup the memory and IO permissions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
DO NOT APPLY.

The logic used to setup the CPUID leaves is clearly lacking. This patch will
be rebased on top of Andrew's CPUID work, that will move CPUID setup from
libxc into Xen. For the time being this is needed in order to be able to
boot a PVHv2 Dom0, in order to test the rest of the patches.
---
 xen/arch/x86/domain_build.c | 97 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 7e22ba3..5c7592b 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -41,6 +41,7 @@
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
 #include <public/hvm/hvm_info_table.h>
+#include <public/hvm/hvm_vcpu.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -2069,6 +2070,93 @@ static int __init hvm_load_kernel(struct domain *d, const module_t *image,
     return 0;
 }
 
+static int __init hvm_setup_cpus(struct domain *d, paddr_t entry,
+                                 paddr_t start_info)
+{
+    vcpu_hvm_context_t cpu_ctx;
+    struct vcpu *v = d->vcpu[0];
+    int cpu, i, rc;
+    struct {
+        uint32_t index;
+        uint32_t count;
+    } cpuid_leaves[] = {
+        {0, XEN_CPUID_INPUT_UNUSED},
+        {1, XEN_CPUID_INPUT_UNUSED},
+        {2, XEN_CPUID_INPUT_UNUSED},
+        {4, 0},
+        {4, 1},
+        {4, 2},
+        {4, 3},
+        {4, 4},
+        {7, 0},
+        {0xa, XEN_CPUID_INPUT_UNUSED},
+        {0xd, 0},
+        {0x80000000, XEN_CPUID_INPUT_UNUSED},
+        {0x80000001, XEN_CPUID_INPUT_UNUSED},
+        {0x80000002, XEN_CPUID_INPUT_UNUSED},
+        {0x80000003, XEN_CPUID_INPUT_UNUSED},
+        {0x80000004, XEN_CPUID_INPUT_UNUSED},
+        {0x80000005, XEN_CPUID_INPUT_UNUSED},
+        {0x80000006, XEN_CPUID_INPUT_UNUSED},
+        {0x80000007, XEN_CPUID_INPUT_UNUSED},
+        {0x80000008, XEN_CPUID_INPUT_UNUSED},
+    };
+
+    cpu = v->processor;
+    for ( i = 1; i < d->max_vcpus; i++ )
+    {
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_dom0_vcpu(d, i, cpu);
+    }
+
+    memset(&cpu_ctx, 0, sizeof(cpu_ctx));
+
+    cpu_ctx.mode = VCPU_HVM_MODE_32B;
+
+    cpu_ctx.cpu_regs.x86_32.ebx = start_info;
+    cpu_ctx.cpu_regs.x86_32.eip = entry;
+    cpu_ctx.cpu_regs.x86_32.cr0 = X86_CR0_PE | X86_CR0_ET;
+
+    cpu_ctx.cpu_regs.x86_32.cs_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ds_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ss_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.tr_limit = 0x67;
+    cpu_ctx.cpu_regs.x86_32.cs_ar = 0xc9b;
+    cpu_ctx.cpu_regs.x86_32.ds_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.ss_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.tr_ar = 0x8b;
+
+    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
+    if ( rc )
+    {
+        printk("Unable to setup Dom0 BSP context: %d\n", rc);
+        return rc;
+    }
+
+    for ( i = 0; i < ARRAY_SIZE(cpuid_leaves); i++ )
+    {
+        d->arch.cpuids[i].input[0] = cpuid_leaves[i].index;
+        d->arch.cpuids[i].input[1] = cpuid_leaves[i].count;
+        cpuid_count(d->arch.cpuids[i].input[0], d->arch.cpuids[i].input[1],
+                    &d->arch.cpuids[i].eax, &d->arch.cpuids[i].ebx,
+                    &d->arch.cpuids[i].ecx, &d->arch.cpuids[i].edx);
+        /* XXX: we need to do much more filtering here. */
+        if ( d->arch.cpuids[i].input[0] == 1 )
+            d->arch.cpuids[i].ecx &= ~X86_FEATURE_VMX;
+    }
+
+    rc = setup_permissions(d);
+    if ( rc )
+    {
+        panic("Unable to setup Dom0 permissions: %d\n", rc);
+        return rc;
+    }
+
+    update_domain_wallclock_time(d);
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2101,6 +2189,15 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_setup_cpus(d, entry, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 CPUs: %d\n", rc);
+        return rc;
+    }
+
+    clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
+
     return 0;
 }
 
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (12 preceding siblings ...)
  2016-11-30 16:49 ` [PATCH v4 13/14] xen/x86: hack to setup PVHv2 Dom0 CPUs Roger Pau Monne
@ 2016-11-30 16:49 ` Roger Pau Monne
  2016-12-12 13:56   ` Jan Beulich
  13 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-11-30 16:49 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Create a new MADT table that contains the topology exposed to the guest. A
new XSDT table is also created, in order to filter the tables that we want
to expose to the guest, plus the Xen crafted MADT. This in turn requires Xen
to also create a new RSDP in order to make it point to the custom XSDT.

Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
reside in reserved regions. This is needed because some memory maps don't
properly account for all the memory used by ACPI, so it's common to find ACPI
tables in reserved regions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v3:
 - Use hvm_copy_to_phys in order to copy the tables to Dom0 memory.
 - Return EEXIST for overlaping ranges in hvm_add_mem_range.
 - s/ov/ovr/ for interrupt override parsing functions.
 - Constify intr local variable in acpi_set_intr_ovr.
 - Use structure asignement for type safety.
 - Perform sizeof using local variables in hvm_setup_acpi_madt.
 - Manually set revision of crafted/modified tables.
 - Only map tables to guest that reside in reserved or ACPI memory regions.
 - Copy the RSDP OEM signature to the crafted RSDP.
 - Pair calls to acpi_os_map_memory/acpi_os_unmap_memory.
 - Add memory regions for allowed ACPI tables to the memory map and then
   perform the identity mappings. This avoids having to call modify_identity_mmio
   multiple times.
 - Add a FIXME comment regarding the lack of multiple vIO-APICs.

Changes since v2:
 - Completely reworked.
---
 xen/arch/x86/domain_build.c | 429 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 428 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 5c7592b..fc778b2 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -23,6 +23,7 @@
 #include <xen/libelf.h>
 #include <xen/pfn.h>
 #include <xen/guest_access.h>
+#include <xen/acpi.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -38,6 +39,8 @@
 #include <asm/io_apic.h>
 #include <asm/hpet.h>
 
+#include <acpi/actables.h>
+
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
 #include <public/hvm/hvm_info_table.h>
@@ -50,6 +53,9 @@ static long __initdata dom0_max_nrpages = LONG_MAX;
 /* Size of the VM86 TSS for virtual 8086 mode to use. */
 #define HVM_VM86_TSS_SIZE   128
 
+static unsigned int __initdata acpi_intr_overrrides;
+static struct acpi_madt_interrupt_override __initdata *intsrcovr;
+
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -567,7 +573,7 @@ static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
     /*
      * Craft the e820 memory map for Dom0 based on the hardware e820 map.
      */
-    d->arch.e820 = xzalloc_array(struct e820entry, e820.nr_map);
+    d->arch.e820 = xzalloc_array(struct e820entry, E820MAX);
     if ( !d->arch.e820 )
         panic("Unable to allocate memory for Dom0 e820 map");
     entry_guest = d->arch.e820;
@@ -1779,6 +1785,55 @@ static int __init hvm_steal_ram(struct domain *d, unsigned long size,
     return -ENOMEM;
 }
 
+/* NB: memory map must be sorted at all times for this to work correctly. */
+static int __init hvm_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
+                                    unsigned int type)
+{
+    unsigned int i;
+
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        uint64_t rs = d->arch.e820[i].addr;
+        uint64_t re = rs + d->arch.e820[i].size;
+
+        if ( rs == e && d->arch.e820[i].type == type )
+        {
+            d->arch.e820[i].addr = s;
+            return 0;
+        }
+
+        if ( re == s && d->arch.e820[i].type == type &&
+             (i + 1 == d->arch.nr_e820 || d->arch.e820[i + 1].addr >= e) )
+        {
+            d->arch.e820[i].size += e - s;
+            return 0;
+        }
+
+        if ( rs >= e )
+            break;
+
+        if ( re > s )
+            return -EEXIST;
+    }
+
+    if ( d->arch.nr_e820 >= E820MAX )
+    {
+        printk(XENLOG_WARNING "E820: overflow while adding region"
+               "[%"PRIx64", %"PRIx64")\n", s, e);
+        return -ENOMEM;
+    }
+
+    memmove(d->arch.e820 + i + 1, d->arch.e820 + i,
+            (d->arch.nr_e820 - i) * sizeof(*d->arch.e820));
+
+    d->arch.nr_e820++;
+    d->arch.e820[i].addr = s;
+    d->arch.e820[i].size = e - s;
+    d->arch.e820[i].type = type;
+
+    return 0;
+}
+
 static int __init hvm_setup_vmx_realmode_helpers(struct domain *d)
 {
     p2m_type_t p2mt;
@@ -2157,6 +2212,371 @@ static int __init hvm_setup_cpus(struct domain *d, paddr_t entry,
     return 0;
 }
 
+static int __init acpi_count_intr_ovr(struct acpi_subtable_header *header,
+                                     const unsigned long end)
+{
+
+    acpi_intr_overrrides++;
+    return 0;
+}
+
+static int __init acpi_set_intr_ovr(struct acpi_subtable_header *header,
+                                    const unsigned long end)
+{
+    const struct acpi_madt_interrupt_override *intr =
+        container_of(header, struct acpi_madt_interrupt_override, header);
+
+    *intsrcovr = *intr;
+    intsrcovr++;
+
+    return 0;
+}
+
+static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
+{
+    struct acpi_table_madt *madt;
+    struct acpi_table_header *table;
+    struct acpi_madt_io_apic *io_apic;
+    struct acpi_madt_local_apic *local_apic;
+    acpi_status status;
+    unsigned long size;
+    unsigned int i;
+    int rc;
+
+    /* Count number of interrupt overrides in the MADT. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
+                          acpi_count_intr_ovr, MAX_IRQ_SOURCES);
+
+    /* Calculate the size of the crafted MADT. */
+    size = sizeof(*madt);
+    size += sizeof(*io_apic);
+    size += sizeof(*local_apic) * dom0_max_vcpus();
+    size += sizeof(struct acpi_madt_interrupt_override) * acpi_intr_overrrides;
+
+    madt = xzalloc_bytes(size);
+    if ( !madt )
+    {
+        printk("Unable to allocate memory for MADT table\n");
+        return -ENOMEM;
+    }
+
+    /* Copy the native MADT table header. */
+    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
+    if ( !ACPI_SUCCESS(status) )
+    {
+        printk("Failed to get MADT ACPI table, aborting.\n");
+        return -EINVAL;
+    }
+    *((struct acpi_table_header *)madt) = *table;
+    madt->address = APIC_DEFAULT_PHYS_BASE;
+    /*
+     * NB: this is currently set to 4, which is the revision in the ACPI
+     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
+     * tables described in the headers.
+     */
+    madt->header.revision = 4;
+
+    /*
+     * Setup the IO APIC entry.
+     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
+     * per domain. This must be fixed in order to provide the same amount of
+     * IO APICs as available on bare metal.
+     */
+    if ( nr_ioapics > 1 )
+        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
+               nr_ioapics);
+    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
+    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
+    io_apic->header.length = sizeof(*io_apic);
+    io_apic->id = 1;
+    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
+
+    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
+    for ( i = 0; i < dom0_max_vcpus(); i++ )
+    {
+        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
+        local_apic->header.length = sizeof(*local_apic);
+        local_apic->processor_id = i;
+        local_apic->id = i * 2;
+        local_apic->lapic_flags = ACPI_MADT_ENABLED;
+        local_apic++;
+    }
+
+    /* Setup interrupt overrides. */
+    intsrcovr = (struct acpi_madt_interrupt_override *)local_apic;
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_set_intr_ovr,
+                          MAX_IRQ_SOURCES);
+    ASSERT(((unsigned char *)intsrcovr - (unsigned char *)madt) == size);
+    madt->header.length = size;
+    madt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, madt), size);
+
+    /* Place the new MADT in guest memory space. */
+    if ( hvm_steal_ram(d, size, GB(4), addr) )
+    {
+        printk("Unable to find allocate guest RAM for MADT\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( hvm_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add MADT region to memory map\n");
+
+    rc = hvm_copy_to_phys(d, *addr, madt, size);
+    if ( rc )
+    {
+        printk("Unable to copy MADT into guest memory\n");
+        return rc;
+    }
+    xfree(madt);
+
+    return 0;
+}
+
+static bool __init acpi_memory_banned(unsigned long mfn, unsigned long nr_pages)
+{
+    unsigned long i;
+
+    for ( i = 0 ; i < nr_pages; i++ )
+        if ( !page_is_ram_type(mfn + i, RAM_TYPE_RESERVED) &&
+             !page_is_ram_type(mfn + i, RAM_TYPE_ACPI) )
+            return true;
+
+    return false;
+}
+
+static bool __init hvm_acpi_table_allowed(const char *sig)
+{
+    static const char __init banned_tables[][ACPI_NAME_SIZE] = {
+        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
+        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
+    unsigned long pfn, nr_pages;
+    int i;
+
+    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
+        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
+            return false;
+
+    /* Make sure table doesn't reside in a RAM region. */
+    pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
+    nr_pages = DIV_ROUND_UP(
+                    (acpi_gbl_root_table_list.tables[i].address & ~PAGE_MASK) +
+                    acpi_gbl_root_table_list.tables[i].length, PAGE_SIZE);
+    if ( acpi_memory_banned(pfn, nr_pages) )
+    {
+        printk("Skipping table %.4s because resides in a non ACPI or reserved region\n",
+               sig);
+        return false;
+    }
+
+    return true;
+}
+
+static int __init hvm_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
+                                      paddr_t *addr)
+{
+    struct acpi_table_xsdt *xsdt;
+    struct acpi_table_header *table;
+    struct acpi_table_rsdp *rsdp;
+    unsigned long size;
+    unsigned int i, num_tables;
+    paddr_t xsdt_paddr;
+    int j, rc;
+
+    /*
+     * Restore original DMAR table signature, we are going to filter it
+     * from the new XSDT that is presented to the guest, so it no longer
+     * makes sense to have it's signature zapped.
+     */
+    acpi_dmar_reinstate();
+
+    /* Account for the space needed by the XSDT. */
+    size = sizeof(*xsdt);
+    num_tables = 0;
+
+    /* Count the number of tables that will be added to the XSDT. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( hvm_acpi_table_allowed(sig) )
+            num_tables++;
+    }
+
+    /*
+     * No need to subtract one because we will be adding a custom MADT (and
+     * the native one is not accounted for).
+     */
+    size += num_tables * sizeof(xsdt->table_offset_entry[0]);
+
+    xsdt = xzalloc_bytes(size);
+    if ( !xsdt )
+    {
+        printk("Unable to allocate memory for XSDT table\n");
+        return -ENOMEM;
+    }
+
+    /* Copy the native XSDT table header. */
+    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
+    if ( !rsdp )
+    {
+        printk("Unable to map RSDP\n");
+        return -EINVAL;
+    }
+    xsdt_paddr = rsdp->xsdt_physical_address;
+    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));
+    table = acpi_os_map_memory(xsdt_paddr, sizeof(*table));
+    if ( !table )
+    {
+        printk("Unable to map XSDT\n");
+        return -EINVAL;
+    }
+    *((struct acpi_table_header *)xsdt) = *table;
+    acpi_os_unmap_memory(table, sizeof(*table));
+
+    /* Add the custom MADT. */
+    j = 0;
+    xsdt->table_offset_entry[j++] = madt_addr;
+
+    /* Copy the addresses of the rest of the allowed tables. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( hvm_acpi_table_allowed(sig) )
+            xsdt->table_offset_entry[j++] =
+                                acpi_gbl_root_table_list.tables[i].address;
+    }
+
+    xsdt->header.revision = 1;
+    xsdt->header.length = size;
+    xsdt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, xsdt), size);
+
+    /* Place the new XSDT in guest memory space. */
+    if ( hvm_steal_ram(d, size, GB(4), addr) )
+    {
+        printk("Unable to find guest RAM for XSDT\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( hvm_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add XSDT region to memory map\n");
+
+    rc = hvm_copy_to_phys(d, *addr, xsdt, size);
+    if ( rc )
+    {
+        printk("Unable to copy XSDT into guest memory\n");
+        return rc;
+    }
+    xfree(xsdt);
+
+    return 0;
+}
+
+static int __init hvm_setup_acpi(struct domain *d, paddr_t start_info)
+{
+    struct acpi_table_rsdp rsdp, *native_rsdp;
+    unsigned long pfn, nr_pages;
+    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
+    unsigned int i;
+    int rc;
+
+    /* Scan top-level tables and add their regions to the guest memory map. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( hvm_acpi_table_allowed(sig) )
+            hvm_add_mem_range(d, acpi_gbl_root_table_list.tables[i].address,
+                              acpi_gbl_root_table_list.tables[i].address +
+                              acpi_gbl_root_table_list.tables[i].length,
+                              E820_ACPI);
+    }
+
+    /* Identity map ACPI e820 regions. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        if ( d->arch.e820[i].type != E820_ACPI &&
+             d->arch.e820[i].type != E820_NVS )
+            continue;
+
+        pfn = PFN_DOWN(d->arch.e820[i].addr);
+        nr_pages = DIV_ROUND_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
+                                d->arch.e820[i].size, PAGE_SIZE);
+
+        rc = modify_identity_mmio(d, pfn, nr_pages, true);
+        if ( rc )
+        {
+            printk(
+                "Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
+                   pfn, pfn + nr_pages);
+            return rc;
+        }
+    }
+
+    rc = hvm_setup_acpi_madt(d, &madt_paddr);
+    if ( rc )
+        return rc;
+
+    rc = hvm_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
+    if ( rc )
+        return rc;
+
+    /* Craft a custom RSDP. */
+    memset(&rsdp, 0, sizeof(rsdp));
+    memcpy(rsdp.signature, ACPI_SIG_RSDP, sizeof(rsdp.signature));
+    native_rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(rsdp));
+    memcpy(rsdp.oem_id, native_rsdp->oem_id, sizeof(rsdp.oem_id));
+    acpi_os_unmap_memory(native_rsdp, sizeof(rsdp));
+    rsdp.revision = 2;
+    rsdp.xsdt_physical_address = xsdt_paddr;
+    rsdp.rsdt_physical_address = 0;
+    rsdp.length = sizeof(rsdp);
+    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                      ACPI_RSDP_REV0_SIZE);
+    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                               sizeof(rsdp));
+
+    /*
+     * Place the new RSDP in guest memory space.
+     *
+     * NB: this RSDP is not going to replace the original RSDP, which
+     * should still be accessible to the guest. However that RSDP is
+     * going to point to the native RSDT, and should not be used.
+     */
+    if ( hvm_steal_ram(d, sizeof(rsdp), GB(4), &rsdp_paddr) )
+    {
+        printk("Unable to allocate guest RAM for RSDP\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( hvm_add_mem_range(d, rsdp_paddr, rsdp_paddr + sizeof(rsdp),
+                           E820_ACPI) )
+        printk("Unable to add RSDP region to memory map\n");
+
+    /* Copy RSDP into guest memory. */
+    rc = hvm_copy_to_phys(d, rsdp_paddr, &rsdp, sizeof(rsdp));
+    if ( rc )
+    {
+        printk("Unable to copy RSDP into guest memory\n");
+        return rc;
+    }
+
+    /* Copy RSDP address to start_info. */
+    rc = hvm_copy_to_phys(d, start_info +
+                          offsetof(struct hvm_start_info, rsdp_paddr),
+                          &rsdp_paddr,
+                          sizeof(((struct hvm_start_info *)0)->rsdp_paddr));
+    if ( rc )
+    {
+        printk("Unable to copy RSDP into guest memory\n");
+        return rc;
+    }
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2196,6 +2616,13 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_setup_acpi(d, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 ACPI tables: %d\n", rc);
+        return rc;
+    }
+
     clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
 
     return 0;
-- 
2.9.3 (Apple Git-75)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 06/14] x86/vtd: refuse to enable IOMMU if the PCI scan fails
  2016-11-30 16:49 ` [PATCH v4 06/14] x86/vtd: refuse to enable IOMMU if the PCI scan fails Roger Pau Monne
@ 2016-12-01  0:29   ` Tian, Kevin
  0 siblings, 0 replies; 59+ messages in thread
From: Tian, Kevin @ 2016-12-01  0:29 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Wu, Feng, Jan Beulich

> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: Thursday, December 01, 2016 12:50 AM
> 
> This provides uniform behavior between Intel and AMD IOMMU initialization, and
> is a requirement for PVHv2 Dom0, that depends on a working IOMMU plus the PCI
> bus being scanned for devices.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 03/14] xen/x86: allow calling {shadow/hap}_set_allocation with the idle domain
  2016-11-30 16:49 ` [PATCH v4 03/14] xen/x86: allow calling {shadow/hap}_set_allocation with the idle domain Roger Pau Monne
@ 2016-12-01 11:25   ` Tim Deegan
  0 siblings, 0 replies; 59+ messages in thread
From: Tim Deegan @ 2016-12-01 11:25 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, xen-devel, boris.ostrovsky

At 16:49 +0000 on 30 Nov (1480524579), Roger Pau Monne wrote:
> ... and using the "preempted" parameter. Introduce a new helper that can
> be used from both hypercall or idle vcpu context (ie: during Dom0
> creation) in order to check if preemption is needed. If such preemption
> happens, the caller should then call process_pending_softirqs in order to
> drain the pending softirqs, and then call *_set_allocation again to continue
> with it's execution.
> 
> This allows us to call *_set_allocation() when building domain 0.
> 
> While there also document hypercall_preempt_check and add an assert to
> local_events_need_delivery in order to be sure it's not called by the idle
> domain, which doesn't receive any events (and that in turn
> hypercall_preempt_check is also not called by the idle domain).
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Acked-by: George Dunlap <george.dunlap@citrix.com>

Acked-by: Tim Deegan <tim@xen.org>

Thanks,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-30 16:49 ` [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2016-12-08 16:21   ` Jan Beulich
  2016-12-16 11:24     ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-08 16:21 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> --- a/docs/misc/hvmlite.markdown
> +++ b/docs/misc/hvmlite.markdown
> @@ -75,3 +75,23 @@ info structure that's passed at boot time (field rsdp_paddr).
>  
>  Description of paravirtualized devices will come from XenStore, just as it's
>  done for HVM guests.
> +
> +## Interrupts ##
> +
> +### Interrupts from physical devices ###
> +
> +Interrupts from physical devices are delivered using native methods, this is
> +done in order to take advantage of new hardware assisted virtualization
> +functions, like posted interrupts. This implies that PVHv2 guests with physical
> +devices will also have the necessary interrupt controllers in order to manage
> +the delivery of interrupts from those devices, using the same interfaces that
> +are available on native hardware.

I continue to not really agree with this. For a while you can't assume
posted interrupts to be universally available. And iirc we still don't
enable their use by default in Xen. Hence I could see this to be the
preferred mechanism, but the event channel based mechanism may
want to remain an alternative, the more that ...

> +### Interrupts from paravirtualized devices ###
> +
> +Interrupts from paravirtualized devices are delivered using event channels, see
> +[Event Channel Internals][event_channels] for more detailed information about
> +event channels. Delivery of those interrupts can be configured in the same way
> +as HVM guests, check xen/include/public/hvm/params.h and
> +xen/include/public/hvm/hvm_op.h for more information about available delivery
> +methods.

... you still need them anyway.

Similarly I continue to be in disagreement with the blanket disabling
of all physdev ops.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 07/14] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2016-11-30 16:49 ` [PATCH v4 07/14] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
@ 2016-12-08 16:24   ` Jan Beulich
  2016-12-16 14:01     ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-08 16:24 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -834,6 +834,7 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
>      case p2m_grant_map_rw:
>      case p2m_ram_logdirty:
>      case p2m_map_foreign:
> +    case p2m_mmio_direct:
>          flags =  IOMMUF_readable | IOMMUF_writable;
>          break;

I think you need to take mmio_ro_ranges into account here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 08/14] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-11-30 16:49 ` [PATCH v4 08/14] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
@ 2016-12-08 16:29   ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2016-12-08 16:29 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> Allow the use of both the emulated local APIC and IO APIC for the hardware
> domain.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>
albeit I'm not 100% convinced of ...

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -509,6 +509,27 @@ void vcpu_destroy(struct vcpu *v)
>          xfree(v->arch.pv_vcpu.trap_ctxt);
>  }
>  
> +static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
> +{
> +
> +    if ( is_hvm_domain(d) )
> +    {
> +        if ( is_hardware_domain(d) &&
> +             emflags != (XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )

... requiring IO-APIC emulation here. But we can adjust that once
we find a legacy free system to test on.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-11-30 16:49 ` [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2016-12-09 16:07   ` Jan Beulich
  2016-12-16 14:28     ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-09 16:07 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -656,6 +656,23 @@ affinities to prefer but be not limited to the specified node(s).
>  
>  Pin dom0 vcpus to their respective pcpus
>  
> +### dom0
> +> `= List of [ hvm | shadow ]`
> +
> +> Sub-options:
> +
> +> `hvm`
> +
> +> Default: `false`
> +
> +Flag that makes a dom0 boot in PVHv2 mode.
> +
> +> `shadow`
> +
> +> Default: `false`
> +
> +Flag that makes a dom0 use shadow paging.

Would you mind marking dom0_shadow deprecated at once? In fact
I wouldn't mind if it was removed from the documentation altogether,
the more that it still has no description at all.

> @@ -1655,6 +1653,28 @@ out:
>      return rc;
>  }
>  
> +static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
> +                                     unsigned long image_headroom,
> +                                     module_t *initrd,
> +                                     void *(*bootstrap_map)(const module_t *),
> +                                     char *cmdline)
> +{
> +
> +    printk("** Building a PVH Dom0 **\n");

Why again is it that you call the function "hvm" but mean "pvh"?

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -187,6 +187,35 @@ static void __init parse_acpi_param(char *s)
>      }
>  }
>  
> +/*
> + * List of parameters that affect Dom0 creation:
> + *
> + *  - hvm               Create a PVHv2 Dom0.
> + *  - shadow            Use shadow paging for Dom0.
> + */
> +static bool __initdata dom0_hvm;
> +static void __init parse_dom0_param(char *s)
> +{
> +    char *ss;
> +
> +    do {
> +
> +        ss = strchr(s, ',');
> +        if ( ss )
> +            *ss = '\0';
> +
> +        if ( !strcmp(s, "hvm") )
> +            dom0_hvm = true;
> +#ifdef CONFIG_SHADOW_PAGING
> +        else if ( !strcmp(s, "shadow") )
> +            opt_dom0_shadow = true;
> +#endif
> +
> +        s = ss + 1;
> +    } while ( ss );
> +}
> +custom_param("dom0", parse_dom0_param);

I continue to think that this should live in domain_build.c, and
dom0_hvm be the one off variable which needs to be global. After
all we intend to extend the "dom0=" quite a bit (presumably to
subsume everything which the various "dom0..." options now do),
and all that stuff lives there anyway.

> --- a/xen/include/asm-x86/setup.h
> +++ b/xen/include/asm-x86/setup.h
> @@ -57,4 +57,10 @@ extern uint8_t kbd_shift_flags;
>  extern unsigned long highmem_start;
>  #endif
>  
> +#ifdef CONFIG_SHADOW_PAGING
> +extern bool opt_dom0_shadow;
> +#else
> +#define opt_dom0_shadow 0

"false" please, to match up with "bool".

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-11-30 16:49 ` [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2016-12-09 16:48   ` Jan Beulich
  2016-12-16 17:38     ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-09 16:48 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> @@ -302,7 +307,8 @@ static unsigned long __init compute_dom0_nr_pages(
>              avail -= max_pdx >> s;
>      }
>  
> -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> +    need_paging = opt_dom0_shadow || (has_hvm_container_domain(d) &&
> +                  (!iommu_hap_pt_share || !paging_mode_hap(d)));

Indentation.

> @@ -545,11 +552,12 @@ static __init void pvh_map_all_iomem(struct domain *d, unsigned long nr_pages)
>      ASSERT(nr_holes == 0);
>  }
>  
> -static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
> +static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)

Why?

> @@ -577,8 +585,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>              continue;
>          }
>  
> -        *entry_guest = *entry;
> -        pages = PFN_UP(entry_guest->size);
> +        /*
> +         * Make sure the start and length are aligned to PAGE_SIZE, because
> +         * that's the minimum granularity of the 2nd stage translation.
> +         */
> +        start = ROUNDUP(entry->addr, PAGE_SIZE);
> +        end = (entry->addr + entry->size) & PAGE_MASK;

Taking the comment into consideration, I wonder whether you
wouldn't better use PAGE_ORDER_4K here, as that's what the
p2m code uses.

> @@ -1010,8 +1029,6 @@ static int __init construct_dom0_pv(
>      BUG_ON(d->vcpu[0] == NULL);
>      BUG_ON(v->is_initialised);
>  
> -    process_pending_softirqs();

Wouldn't this adjustment better fit into the previous patch, together
with its companion below?

> +static int __init hvm_steal_ram(struct domain *d, unsigned long size,
> +                                paddr_t limit, paddr_t *addr)
> +{
> +    unsigned int i = d->arch.nr_e820;
> +
> +    while ( i-- )
> +    {
> +        struct e820entry *entry = &d->arch.e820[i];
> +
> +        if ( entry->type != E820_RAM || entry->size < size )
> +            continue;
> +
> +        /* Subtract from the beginning. */
> +        if ( entry->addr + size < limit && entry->addr >= MB(1) )

<= I think (for the left comparison)?

> +static void __init hvm_steal_low_ram(struct domain *d, unsigned long start,
> +                                     unsigned long nr_pages)
> +{
> +    unsigned long mfn;
> +
> +    ASSERT(start + nr_pages < PFN_DOWN(MB(1)));

<= again I think.

> +static int __init hvm_setup_p2m(struct domain *d)
> +{
> +    struct vcpu *v = d->vcpu[0];
> +    unsigned long nr_pages;
> +    unsigned int i;
> +    int rc;
> +    bool preempted;
> +#define MB1_PAGES PFN_DOWN(MB(1))
> +
> +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> +
> +    hvm_setup_e820(d, nr_pages);
> +    do {
> +        preempted = false;
> +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> +                              &preempted);
> +        process_pending_softirqs();
> +    } while ( preempted );
> +
> +    /*
> +     * Memory below 1MB is identity mapped.
> +     * NB: this only makes sense when booted from legacy BIOS.
> +     */
> +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
> +    if ( rc )
> +    {
> +        printk("Failed to identity map low 1MB: %d\n", rc);
> +        return rc;
> +    }
> +
> +    /* Populate memory map. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        unsigned long addr, size;
> +
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        addr = PFN_DOWN(d->arch.e820[i].addr);
> +        size = PFN_DOWN(d->arch.e820[i].size);
> +
> +        if ( addr >= MB1_PAGES )
> +            rc = hvm_populate_memory_range(d, addr, size);
> +        else if ( addr + size > MB1_PAGES )
> +        {
> +            hvm_steal_low_ram(d, addr, MB1_PAGES - addr);
> +            rc = hvm_populate_memory_range(d, MB1_PAGES,
> +                                           size - (MB1_PAGES - addr));

Is this case possible at all? All x86 systems have some form of
BIOS right below the 1Mb boundary, and the E820 map for
Dom0 is being derived from the host one.

> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -475,6 +475,43 @@ void share_xen_page_with_guest(
>      spin_unlock(&d->page_alloc_lock);
>  }
>  
> +int unshare_xen_page_with_guest(struct page_info *page, struct domain *d)

__init

And once its __init, it may be possible to simplify it, as you don't need
to fear races anymore. E.g. you wouldn't need a loop over cmpxchg().

> +{
> +    unsigned long y, x;
> +    bool drop_dom_ref;
> +
> +    if ( page_get_owner(page) != d || !(page->count_info & PGC_xen_heap) )

Please don't open code is_xen_heap_page().

> +        return -EINVAL;
> +
> +    spin_lock(&d->page_alloc_lock);
> +
> +    /* Drop the page reference so we can chanfge the owner. */
> +    y = page->count_info;
> +    do {
> +        x = y;
> +        if ( (x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated) )
> +        {
> +            spin_unlock(&d->page_alloc_lock);
> +            return -EINVAL;
> +        }
> +        y = cmpxchg(&page->count_info, x, PGC_xen_heap);
> +    } while ( y != x );
> +
> +    /* Remove the page form the list of domain pages. */
> +    page_list_del(page, &d->xenpage_list);
> +    drop_dom_ref = (--d->xenheap_pages == 0);

Aren't you open coding

    if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
        put_page(page);

here (except for the check on the initial value, which could be
moved up)?

> +    /* Remove the owner and clear the flags. */
> +    page_set_owner(page, NULL);
> +    page->u.inuse.type_info = 0;

I think you'd better bail early if this is non-zero. Or else please use
the order used elsewhere (clearing type info, then owner) - while
it's benign, it avoids someone later wondering whether the order
is wrong in either place.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 11/14] xen/x86: parse Dom0 kernel for PVHv2
  2016-11-30 16:49 ` [PATCH v4 11/14] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2016-12-09 17:05   ` Jan Beulich
  2016-12-20 17:34     ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-09 17:05 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> @@ -1930,12 +1931,148 @@ static int __init hvm_setup_p2m(struct domain *d)
>  #undef MB1_PAGES
>  }
>  
> +static int __init hvm_copy_to_phys(struct domain *d, paddr_t paddr, void *buf,
> +                                   int size)

I guess you made size plain int because hvm_copy_to_guest_phys()
has it that way, but please let's not spread such bogus things - sizes
can't possibly be negative.

> +{
> +    struct vcpu *saved_current;
> +    int rc;
> +
> +    saved_current = current;
> +    set_current(d->vcpu[0]);
> +    rc = hvm_copy_to_guest_phys(paddr, buf, size);
> +    set_current(saved_current);

I continue to be uncertain about the behavior of this if something
inside hvm_copy_to_guest_phys() goes wrong: Did you either
statically analyze the code or try in practice out whether the
playing with current makes understanding the crash output any
harder?

While there's going to be some work involved with it, I do think
that the use here might be a reason for the whole hvm_copy()
machinery to gain a struct vcpu* parameter.

> +static int __init hvm_load_kernel(struct domain *d, const module_t *image,
> +                                  unsigned long image_headroom,
> +                                  module_t *initrd, char *image_base,
> +                                  char *cmdline, paddr_t *entry,
> +                                  paddr_t *start_info_addr)
> +{
> +    char *image_start = image_base + image_headroom;
> +    unsigned long image_len = image->mod_end;
> +    struct elf_binary elf;
> +    struct elf_dom_parms parms;
> +    paddr_t last_addr;
> +    struct hvm_start_info start_info;
> +    struct hvm_modlist_entry mod;
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    int rc;
> +
> +    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> +    {
> +        printk("Error trying to detect bz compressed kernel\n");
> +        return rc;
> +    }
> +
> +    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
> +    {
> +        printk("Unable to init ELF\n");
> +        return rc;
> +    }
> +#ifdef VERBOSE
> +    elf_set_verbose(&elf);
> +#endif
> +    elf_parse_binary(&elf);
> +    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
> +    {
> +        printk("Unable to parse kernel for ELFNOTES\n");
> +        return rc;
> +    }
> +
> +    if ( parms.phys_entry == UNSET_ADDR32 ) {
> +        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
> +        return -EINVAL;
> +    }
> +
> +    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
> +           parms.guest_ver, parms.loader,
> +           elf_64bit(&elf) ? "64-bit" : "32-bit");
> +
> +    /* Copy the OS image and free temporary buffer. */
> +    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
> +    elf.dest_size = parms.virt_kend - parms.virt_kstart;
> +
> +    saved_current = current;
> +    set_current(v);
> +    rc = elf_load_binary(&elf);
> +    set_current(saved_current);

Same reservations as above.

> +    if ( rc < 0 )
> +    {
> +        printk("Failed to load kernel: %d\n", rc);
> +        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
> +        return rc;
> +    }
> +
> +    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> +
> +    if ( initrd != NULL )
> +    {
> +        rc = hvm_copy_to_phys(d, last_addr, mfn_to_virt(initrd->mod_start),
> +                              initrd->mod_end);
> +        if ( rc )
> +        {
> +            printk("Unable to copy initrd to guest\n");
> +            return rc;
> +        }
> +
> +        mod.paddr = last_addr;
> +        mod.size = initrd->mod_end;
> +        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
> +    }

mod is left uninitialized in the else case afaict - I don't think all
compilers we support (plus Coverity) can spot the common
dependency on initrd != NULL.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 12/14] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0
  2016-11-30 16:49 ` [PATCH v4 12/14] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
@ 2016-12-09 17:09   ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2016-12-09 17:09 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> @@ -176,6 +177,8 @@ unsigned int __init dom0_max_vcpus(void)
>          max_vcpus = opt_dom0_max_vcpus_max;
>      if ( max_vcpus > MAX_VIRT_CPUS )
>          max_vcpus = MAX_VIRT_CPUS;
> +    if ( dom0_hvm )
> +        max_vcpus = min_t(typeof(max_vcpus), max_vcpus, HVM_MAX_VCPUS);

I don't see the need for min_t() here - just follow the code right
before your addition:

    if ( dom0_hvm && max_vcpus > HVM_MAX_VCPUS )
        max_vcpus = HVM_MAX_VCPUS;

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -193,7 +193,7 @@ static void __init parse_acpi_param(char *s)
>   *  - hvm               Create a PVHv2 Dom0.
>   *  - shadow            Use shadow paging for Dom0.
>   */
> -static bool __initdata dom0_hvm;
> +bool __initdata dom0_hvm;
>  static void __init parse_dom0_param(char *s)
>  {
>      char *ss;
> --- a/xen/include/asm-x86/setup.h
> +++ b/xen/include/asm-x86/setup.h
> @@ -63,4 +63,6 @@ extern bool opt_dom0_shadow;
>  #define opt_dom0_shadow 0
>  #endif
>  
> +extern bool dom0_hvm;

One more argument to move the command ling option parsing to
domain_build.c.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-11-30 16:49 ` [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
@ 2016-12-12 13:56   ` Jan Beulich
  2016-12-21 16:32     ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-12 13:56 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> @@ -567,7 +573,7 @@ static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
>      /*
>       * Craft the e820 memory map for Dom0 based on the hardware e820 map.
>       */
> -    d->arch.e820 = xzalloc_array(struct e820entry, e820.nr_map);
> +    d->arch.e820 = xzalloc_array(struct e820entry, E820MAX);

I have to admit that I consider this both wasteful and inflexible.
While commonly you'll need far less entries, what if in fact you
need more?

> +static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
> +{
> +    struct acpi_table_madt *madt;
> +    struct acpi_table_header *table;
> +    struct acpi_madt_io_apic *io_apic;
> +    struct acpi_madt_local_apic *local_apic;

I think I had asked about the absence of struct acpi_madt_local_x2apic
here before, but now that I look again I also wonder how you get away
without struct acpi_madt_nmi_source and struct acpi_madt_local_apic_nmi;
I can kind of see struct acpi_madt_local_apic_override not being
needed here (and I'd really hope anyway that no BIOS actually makes
use of it). The question mainly is what possible damage there may be
if what Dom0 gets to see disagrees with what the firmware acts on
(remember that AML code may alter MADT contents).

> +    acpi_status status;
> +    unsigned long size;
> +    unsigned int i;
> +    int rc;
> +
> +    /* Count number of interrupt overrides in the MADT. */
> +    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
> +                          acpi_count_intr_ovr, MAX_IRQ_SOURCES);

What's the reason for using MAX_IRQ_SOURCES here? There's no
static array involved here limiting the supported count.

> +    /* Calculate the size of the crafted MADT. */
> +    size = sizeof(*madt);
> +    size += sizeof(*io_apic);
> +    size += sizeof(*local_apic) * dom0_max_vcpus();
> +    size += sizeof(struct acpi_madt_interrupt_override) * acpi_intr_overrrides;

sizeof(*intsrcovr)

> +    madt = xzalloc_bytes(size);
> +    if ( !madt )
> +    {
> +        printk("Unable to allocate memory for MADT table\n");
> +        return -ENOMEM;
> +    }
> +
> +    /* Copy the native MADT table header. */
> +    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
> +    if ( !ACPI_SUCCESS(status) )
> +    {
> +        printk("Failed to get MADT ACPI table, aborting.\n");
> +        return -EINVAL;
> +    }
> +    *((struct acpi_table_header *)madt) = *table;

madt.header = *table;

> +    madt->address = APIC_DEFAULT_PHYS_BASE;
> +    /*
> +     * NB: this is currently set to 4, which is the revision in the ACPI
> +     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
> +     * tables described in the headers.
> +     */
> +    madt->header.revision = 4;

I can see you wanting to cap the revision here, but is it safe to
possibly bump it from what firmware has?

> +    /*
> +     * Setup the IO APIC entry.
> +     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
> +     * per domain. This must be fixed in order to provide the same amount of
> +     * IO APICs as available on bare metal.
> +     */
> +    if ( nr_ioapics > 1 )
> +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
> +               nr_ioapics);
> +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
> +    io_apic->header.length = sizeof(*io_apic);
> +    io_apic->id = 1;

Is it safe to use an ID other than what firmware did assign? Where
is this 1 coming from? And how does this get sync-ed up with the
respective struct hvm_hw_vioapic field?

> +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
> +
> +    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
> +    for ( i = 0; i < dom0_max_vcpus(); i++ )
> +    {
> +        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
> +        local_apic->header.length = sizeof(*local_apic);
> +        local_apic->processor_id = i;
> +        local_apic->id = i * 2;
> +        local_apic->lapic_flags = ACPI_MADT_ENABLED;
> +        local_apic++;
> +    }
> +
> +    /* Setup interrupt overrides. */
> +    intsrcovr = (struct acpi_madt_interrupt_override *)local_apic;
> +    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_set_intr_ovr,
> +                          MAX_IRQ_SOURCES);
> +    ASSERT(((unsigned char *)intsrcovr - (unsigned char *)madt) == size);

Likely easier to read if you used void * or long for the casts here.

> +static bool __init hvm_acpi_table_allowed(const char *sig)
> +{
> +    static const char __init banned_tables[][ACPI_NAME_SIZE] = {
> +        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
> +        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
> +    unsigned long pfn, nr_pages;
> +    int i;
> +
> +    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
> +        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
> +            return false;
> +
> +    /* Make sure table doesn't reside in a RAM region. */
> +    pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> +    nr_pages = DIV_ROUND_UP(
> +                    (acpi_gbl_root_table_list.tables[i].address & ~PAGE_MASK) +
> +                    acpi_gbl_root_table_list.tables[i].length, PAGE_SIZE);

How did you choose indentation here? I think there are a number
of ways compatible with how we commonly do it, but this is not one
of them. Preferably

    nr_pages =
        DIV_ROUND_UP((acpi_gbl_root_table_list.tables[i].address & ~PAGE_MASK) +
                     acpi_gbl_root_table_list.tables[i].length,
                     PAGE_SIZE);

or even

    nr_pages =
        PFN_UP((acpi_gbl_root_table_list.tables[i].address & ~PAGE_MASK) +
               acpi_gbl_root_table_list.tables[i].length);

> +    if ( acpi_memory_banned(pfn, nr_pages) )
> +    {
> +        printk("Skipping table %.4s because resides in a non ACPI or reserved region\n",

I think this wording can be mis-read. How about "... because resides
in a non-ACPI, non-reserved region"?

> +static int __init hvm_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
> +                                      paddr_t *addr)
> +{
> +    struct acpi_table_xsdt *xsdt;
> +    struct acpi_table_header *table;
> +    struct acpi_table_rsdp *rsdp;
> +    unsigned long size;
> +    unsigned int i, num_tables;
> +    paddr_t xsdt_paddr;
> +    int j, rc;

How come i is properly unsigned, but j is plain int?

> +    /*
> +     * Restore original DMAR table signature, we are going to filter it
> +     * from the new XSDT that is presented to the guest, so it no longer
> +     * makes sense to have it's signature zapped.
> +     */
> +    acpi_dmar_reinstate();
> +
> +    /* Account for the space needed by the XSDT. */
> +    size = sizeof(*xsdt);
> +    num_tables = 0;

Mind making these the initializers of the variables?

> +    /* Count the number of tables that will be added to the XSDT. */
> +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> +    {
> +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> +
> +        if ( hvm_acpi_table_allowed(sig) )
> +            num_tables++;
> +    }
> +
> +    /*
> +     * No need to subtract one because we will be adding a custom MADT (and
> +     * the native one is not accounted for).
> +     */
> +    size += num_tables * sizeof(xsdt->table_offset_entry[0]);

The comment has been confusing me each time I've come here so
far. The reason you don't need to add or subtract anything is
because struct acpi_table_xsdt includes one array slot already
(which isn't visible here, i.e. one would need to look at the header),
_and_ you've skipped native MADT _and_ you're going to add a
made up MADT, so I think it would be better to explain it in those
terms.

> +    xsdt = xzalloc_bytes(size);
> +    if ( !xsdt )
> +    {
> +        printk("Unable to allocate memory for XSDT table\n");
> +        return -ENOMEM;
> +    }
> +
> +    /* Copy the native XSDT table header. */
> +    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
> +    if ( !rsdp )
> +    {
> +        printk("Unable to map RSDP\n");
> +        return -EINVAL;
> +    }
> +    xsdt_paddr = rsdp->xsdt_physical_address;
> +    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));
> +    table = acpi_os_map_memory(xsdt_paddr, sizeof(*table));
> +    if ( !table )
> +    {
> +        printk("Unable to map XSDT\n");
> +        return -EINVAL;
> +    }
> +    *((struct acpi_table_header *)xsdt) = *table;

xsdt.header

> +    acpi_os_unmap_memory(table, sizeof(*table));
> +
> +    /* Add the custom MADT. */
> +    j = 0;
> +    xsdt->table_offset_entry[j++] = madt_addr;

Please re-use num_tables for the counting here. It also would
likely be a little more clear if you used plain 0 as array index
and assigned 1 right away.

> +static int __init hvm_setup_acpi(struct domain *d, paddr_t start_info)
> +{
> +    struct acpi_table_rsdp rsdp, *native_rsdp;
> +    unsigned long pfn, nr_pages;
> +    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
> +    unsigned int i;
> +    int rc;
> +
> +    /* Scan top-level tables and add their regions to the guest memory map. 
> */
> +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> +    {
> +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> +
> +        if ( hvm_acpi_table_allowed(sig) )
> +            hvm_add_mem_range(d, acpi_gbl_root_table_list.tables[i].address,
> +                              acpi_gbl_root_table_list.tables[i].address +
> +                              acpi_gbl_root_table_list.tables[i].length,
> +                              E820_ACPI);
> +    }
> +
> +    /* Identity map ACPI e820 regions. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        if ( d->arch.e820[i].type != E820_ACPI &&
> +             d->arch.e820[i].type != E820_NVS )
> +            continue;
> +
> +        pfn = PFN_DOWN(d->arch.e820[i].addr);
> +        nr_pages = DIV_ROUND_UP((d->arch.e820[i].addr & ~PAGE_MASK) +
> +                                d->arch.e820[i].size, PAGE_SIZE);

Perhaps PFN_UP() again?

> +        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> +        if ( rc )
> +        {
> +            printk(
> +                "Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",

Please don't split lines like this.

> +                   pfn, pfn + nr_pages);
> +            return rc;
> +        }
> +    }
> +
> +    rc = hvm_setup_acpi_madt(d, &madt_paddr);
> +    if ( rc )
> +        return rc;
> +
> +    rc = hvm_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
> +    if ( rc )
> +        return rc;
> +
> +    /* Craft a custom RSDP. */
> +    memset(&rsdp, 0, sizeof(rsdp));
> +    memcpy(rsdp.signature, ACPI_SIG_RSDP, sizeof(rsdp.signature));
> +    native_rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(rsdp));

While this one is fine to call from here, ...

> +    memcpy(rsdp.oem_id, native_rsdp->oem_id, sizeof(rsdp.oem_id));
> +    acpi_os_unmap_memory(native_rsdp, sizeof(rsdp));
> +    rsdp.revision = 2;
> +    rsdp.xsdt_physical_address = xsdt_paddr;
> +    rsdp.rsdt_physical_address = 0;
> +    rsdp.length = sizeof(rsdp);
> +    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
> +                                      ACPI_RSDP_REV0_SIZE);
> +    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
> +                                               sizeof(rsdp));

... these two represent a layering violation: Code outside of
xen/drivers/acpi/*.c is not supposed to call anything in
xen/drivers/acpi/{tables,utilities}/. Since introducing a wrapper
is likely overkill for the purpose here I think we can tolerate it if
suitably annotated in a comment. (Granted, ARM has a similar
issue.)

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-12-08 16:21   ` Jan Beulich
@ 2016-12-16 11:24     ` Roger Pau Monne
  0 siblings, 0 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 11:24 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Dec 08, 2016 at 09:21:23AM -0700, Jan Beulich wrote:
> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > --- a/docs/misc/hvmlite.markdown
> > +++ b/docs/misc/hvmlite.markdown
> > @@ -75,3 +75,23 @@ info structure that's passed at boot time (field rsdp_paddr).
> >  
> >  Description of paravirtualized devices will come from XenStore, just as it's
> >  done for HVM guests.
> > +
> > +## Interrupts ##
> > +
> > +### Interrupts from physical devices ###
> > +
> > +Interrupts from physical devices are delivered using native methods, this is
> > +done in order to take advantage of new hardware assisted virtualization
> > +functions, like posted interrupts. This implies that PVHv2 guests with physical
> > +devices will also have the necessary interrupt controllers in order to manage
> > +the delivery of interrupts from those devices, using the same interfaces that
> > +are available on native hardware.
> 
> I continue to not really agree with this. For a while you can't assume
> posted interrupts to be universally available. And iirc we still don't
> enable their use by default in Xen. Hence I could see this to be the
> preferred mechanism, but the event channel based mechanism may
> want to remain an alternative, the more that ...

Maybe they are not universally available now, but AFAICT the trend in the
hardware industry is to follow that route. Keep in mind that when PVHv2 Dom0
becomes production ready most of this technologies are going to be widely
available.

> > +### Interrupts from paravirtualized devices ###
> > +
> > +Interrupts from paravirtualized devices are delivered using event channels, see
> > +[Event Channel Internals][event_channels] for more detailed information about
> > +event channels. Delivery of those interrupts can be configured in the same way
> > +as HVM guests, check xen/include/public/hvm/params.h and
> > +xen/include/public/hvm/hvm_op.h for more information about available delivery
> > +methods.
> 
> ... you still need them anyway.

Well, I think that event channels from virt devices and event channels from
physical ones are quite different. IIRC Linux even uses a different IRQ chip in
order to handle them.

> Similarly I continue to be in disagreement with the blanket disabling
> of all physdev ops.

I'm not opposed to enabling some of them. I think we already agreed that at
least PHYSDEVOP_pci_mmcfg_reserved is required on some hardware.

I'm just opposed to wholesale enabling them all, like we did for PVHv1. Also,
if we enable them now, we will be tied to this ABI, and then the only solution
if we ever want to get rid of them is to bump the ABI.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 07/14] x86/iommu: add IOMMU entries for p2m_mmio_direct pages
  2016-12-08 16:24   ` Jan Beulich
@ 2016-12-16 14:01     ` Roger Pau Monne
  0 siblings, 0 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 14:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, boris.ostrovsky

On Thu, Dec 08, 2016 at 09:24:08AM -0700, Jan Beulich wrote:
> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > --- a/xen/include/asm-x86/p2m.h
> > +++ b/xen/include/asm-x86/p2m.h
> > @@ -834,6 +834,7 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
> >      case p2m_grant_map_rw:
> >      case p2m_ram_logdirty:
> >      case p2m_map_foreign:
> > +    case p2m_mmio_direct:
> >          flags =  IOMMUF_readable | IOMMUF_writable;
> >          break;
> 
> I think you need to take mmio_ro_ranges into account here.

Right, I've split p2m_mmio_direct into a separate case and added a check for
mmio_ro_ranges, this however forces all the callers of p2m_get_iommu_flags to
pass a second mfn parameter (it isn't that bad since the mfn is always
available in the call sites).

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-09 16:07   ` Jan Beulich
@ 2016-12-16 14:28     ` Roger Pau Monne
  2016-12-16 14:45       ` Roger Pau Monne
                         ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 14:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -656,6 +656,23 @@ affinities to prefer but be not limited to the specified node(s).
> >  
> >  Pin dom0 vcpus to their respective pcpus
> >  
> > +### dom0
> > +> `= List of [ hvm | shadow ]`
> > +
> > +> Sub-options:
> > +
> > +> `hvm`
> > +
> > +> Default: `false`
> > +
> > +Flag that makes a dom0 boot in PVHv2 mode.
> > +
> > +> `shadow`
> > +
> > +> Default: `false`
> > +
> > +Flag that makes a dom0 use shadow paging.
> 
> Would you mind marking dom0_shadow deprecated at once? In fact
> I wouldn't mind if it was removed from the documentation altogether,
> the more that it still has no description at all.

Sure, AFAICT it's just removing it from the documentation and a single usage
in construct_dom0_pv (the one in compute_dom0_nr_pages needs to stay for PVHv2
Dom0).

> > @@ -1655,6 +1653,28 @@ out:
> >      return rc;
> >  }
> >  
> > +static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
> > +                                     unsigned long image_headroom,
> > +                                     module_t *initrd,
> > +                                     void *(*bootstrap_map)(const module_t *),
> > +                                     char *cmdline)
> > +{
> > +
> > +    printk("** Building a PVH Dom0 **\n");
> 
> Why again is it that you call the function "hvm" but mean "pvh"?

This was to differentiate between the current "pvh" functions in this file that
refer to PVHv1. I could name them pvhv2, but IMHO hvm seems clearer and
shorter.

> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -187,6 +187,35 @@ static void __init parse_acpi_param(char *s)
> >      }
> >  }
> >  
> > +/*
> > + * List of parameters that affect Dom0 creation:
> > + *
> > + *  - hvm               Create a PVHv2 Dom0.
> > + *  - shadow            Use shadow paging for Dom0.
> > + */
> > +static bool __initdata dom0_hvm;
> > +static void __init parse_dom0_param(char *s)
> > +{
> > +    char *ss;
> > +
> > +    do {
> > +
> > +        ss = strchr(s, ',');
> > +        if ( ss )
> > +            *ss = '\0';
> > +
> > +        if ( !strcmp(s, "hvm") )
> > +            dom0_hvm = true;
> > +#ifdef CONFIG_SHADOW_PAGING
> > +        else if ( !strcmp(s, "shadow") )
> > +            opt_dom0_shadow = true;
> > +#endif
> > +
> > +        s = ss + 1;
> > +    } while ( ss );
> > +}
> > +custom_param("dom0", parse_dom0_param);
> 
> I continue to think that this should live in domain_build.c, and
> dom0_hvm be the one off variable which needs to be global. After
> all we intend to extend the "dom0=" quite a bit (presumably to
> subsume everything which the various "dom0..." options now do),
> and all that stuff lives there anyway.

opt_dom0_shadow is also needed by setup.c in order to pass the right DOMCRF
flags. I don't really mind putting it here or in setup.c, but ATM both options
will need to be exported one way or another.

> > --- a/xen/include/asm-x86/setup.h
> > +++ b/xen/include/asm-x86/setup.h
> > @@ -57,4 +57,10 @@ extern uint8_t kbd_shift_flags;
> >  extern unsigned long highmem_start;
> >  #endif
> >  
> > +#ifdef CONFIG_SHADOW_PAGING
> > +extern bool opt_dom0_shadow;
> > +#else
> > +#define opt_dom0_shadow 0
> 
> "false" please, to match up with "bool".

Fixed, thanks.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 14:28     ` Roger Pau Monne
@ 2016-12-16 14:45       ` Roger Pau Monne
  2016-12-16 16:17         ` Jan Beulich
  2016-12-16 15:28       ` Roger Pau Monne
  2016-12-16 16:18       ` Jan Beulich
  2 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 14:45 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Dec 16, 2016 at 02:28:54PM +0000, Roger Pau Monne wrote:
> On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
> > >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > > @@ -1655,6 +1653,28 @@ out:
> > >      return rc;
> > >  }
> > >  
> > > +static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
> > > +                                     unsigned long image_headroom,
> > > +                                     module_t *initrd,
> > > +                                     void *(*bootstrap_map)(const module_t *),
> > > +                                     char *cmdline)
> > > +{
> > > +
> > > +    printk("** Building a PVH Dom0 **\n");
> > 
> > Why again is it that you call the function "hvm" but mean "pvh"?
> 
> This was to differentiate between the current "pvh" functions in this file that
> refer to PVHv1. I could name them pvhv2, but IMHO hvm seems clearer and
> shorter.

Oh, and the other reason was that Xen doesn't really know the difference
between a HVM guest and a PVHv2 guest, hence hvm felt more natural.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 14:28     ` Roger Pau Monne
  2016-12-16 14:45       ` Roger Pau Monne
@ 2016-12-16 15:28       ` Roger Pau Monne
  2016-12-16 16:15         ` Jan Beulich
  2016-12-16 16:18       ` Jan Beulich
  2 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 15:28 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Dec 16, 2016 at 02:28:54PM +0000, Roger Pau Monne wrote:
> On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
> > >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > > --- a/xen/arch/x86/setup.c
> > > +++ b/xen/arch/x86/setup.c
> > > @@ -187,6 +187,35 @@ static void __init parse_acpi_param(char *s)
> > >      }
> > >  }
> > >  
> > > +/*
> > > + * List of parameters that affect Dom0 creation:
> > > + *
> > > + *  - hvm               Create a PVHv2 Dom0.
> > > + *  - shadow            Use shadow paging for Dom0.
> > > + */
> > > +static bool __initdata dom0_hvm;
> > > +static void __init parse_dom0_param(char *s)
> > > +{
> > > +    char *ss;
> > > +
> > > +    do {
> > > +
> > > +        ss = strchr(s, ',');
> > > +        if ( ss )
> > > +            *ss = '\0';
> > > +
> > > +        if ( !strcmp(s, "hvm") )
> > > +            dom0_hvm = true;
> > > +#ifdef CONFIG_SHADOW_PAGING
> > > +        else if ( !strcmp(s, "shadow") )
> > > +            opt_dom0_shadow = true;
> > > +#endif
> > > +
> > > +        s = ss + 1;
> > > +    } while ( ss );
> > > +}
> > > +custom_param("dom0", parse_dom0_param);
> > 
> > I continue to think that this should live in domain_build.c, and
> > dom0_hvm be the one off variable which needs to be global. After
> > all we intend to extend the "dom0=" quite a bit (presumably to
> > subsume everything which the various "dom0..." options now do),
> > and all that stuff lives there anyway.

In fact opt_dom0_shadow is only going to be needed by setup.c after the removal
of it's usage by PV Dom0, so I think it's better to keep parse_dom0_param in
setup.c.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 15:28       ` Roger Pau Monne
@ 2016-12-16 16:15         ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2016-12-16 16:15 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 16.12.16 at 16:28, <roger.pau@citrix.com> wrote:
> On Fri, Dec 16, 2016 at 02:28:54PM +0000, Roger Pau Monne wrote:
>> On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
>> > >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
>> > > --- a/xen/arch/x86/setup.c
>> > > +++ b/xen/arch/x86/setup.c
>> > > @@ -187,6 +187,35 @@ static void __init parse_acpi_param(char *s)
>> > >      }
>> > >  }
>> > >  
>> > > +/*
>> > > + * List of parameters that affect Dom0 creation:
>> > > + *
>> > > + *  - hvm               Create a PVHv2 Dom0.
>> > > + *  - shadow            Use shadow paging for Dom0.
>> > > + */
>> > > +static bool __initdata dom0_hvm;
>> > > +static void __init parse_dom0_param(char *s)
>> > > +{
>> > > +    char *ss;
>> > > +
>> > > +    do {
>> > > +
>> > > +        ss = strchr(s, ',');
>> > > +        if ( ss )
>> > > +            *ss = '\0';
>> > > +
>> > > +        if ( !strcmp(s, "hvm") )
>> > > +            dom0_hvm = true;
>> > > +#ifdef CONFIG_SHADOW_PAGING
>> > > +        else if ( !strcmp(s, "shadow") )
>> > > +            opt_dom0_shadow = true;
>> > > +#endif
>> > > +
>> > > +        s = ss + 1;
>> > > +    } while ( ss );
>> > > +}
>> > > +custom_param("dom0", parse_dom0_param);
>> > 
>> > I continue to think that this should live in domain_build.c, and
>> > dom0_hvm be the one off variable which needs to be global. After
>> > all we intend to extend the "dom0=" quite a bit (presumably to
>> > subsume everything which the various "dom0..." options now do),
>> > and all that stuff lives there anyway.
> 
> In fact opt_dom0_shadow is only going to be needed by setup.c after the removal
> of it's usage by PV Dom0, so I think it's better to keep parse_dom0_param in
> setup.c.

No. I thought I had made this clear by a comment to a later patch
(where dom0_hvm gets made non-static).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 14:45       ` Roger Pau Monne
@ 2016-12-16 16:17         ` Jan Beulich
  2016-12-16 17:57           ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-16 16:17 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 16.12.16 at 15:45, <roger.pau@citrix.com> wrote:
> On Fri, Dec 16, 2016 at 02:28:54PM +0000, Roger Pau Monne wrote:
>> On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
>> > >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
>> > > @@ -1655,6 +1653,28 @@ out:
>> > >      return rc;
>> > >  }
>> > >  
>> > > +static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
>> > > +                                     unsigned long image_headroom,
>> > > +                                     module_t *initrd,
>> > > +                                     void *(*bootstrap_map)(const module_t *),
>> > > +                                     char *cmdline)
>> > > +{
>> > > +
>> > > +    printk("** Building a PVH Dom0 **\n");
>> > 
>> > Why again is it that you call the function "hvm" but mean "pvh"?
>> 
>> This was to differentiate between the current "pvh" functions in this file that
>> refer to PVHv1. I could name them pvhv2, but IMHO hvm seems clearer and
>> shorter.
> 
> Oh, and the other reason was that Xen doesn't really know the difference
> between a HVM guest and a PVHv2 guest, hence hvm felt more natural.

Xen certainly can tell the difference for Dom0, since a true HVM
Dom0 can't exist without a lot of work towards getting a device
model run somewhere to service it. I continue to think that "hvm"
in any of the names involved in this series is misleading.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 14:28     ` Roger Pau Monne
  2016-12-16 14:45       ` Roger Pau Monne
  2016-12-16 15:28       ` Roger Pau Monne
@ 2016-12-16 16:18       ` Jan Beulich
  2016-12-16 17:42         ` Roger Pau Monne
  2 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-16 16:18 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 16.12.16 at 15:28, <roger.pau@citrix.com> wrote:
> On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
>> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
>> > --- a/docs/misc/xen-command-line.markdown
>> > +++ b/docs/misc/xen-command-line.markdown
>> > @@ -656,6 +656,23 @@ affinities to prefer but be not limited to the specified node(s).
>> >  
>> >  Pin dom0 vcpus to their respective pcpus
>> >  
>> > +### dom0
>> > +> `= List of [ hvm | shadow ]`
>> > +
>> > +> Sub-options:
>> > +
>> > +> `hvm`
>> > +
>> > +> Default: `false`
>> > +
>> > +Flag that makes a dom0 boot in PVHv2 mode.
>> > +
>> > +> `shadow`
>> > +
>> > +> Default: `false`
>> > +
>> > +Flag that makes a dom0 use shadow paging.
>> 
>> Would you mind marking dom0_shadow deprecated at once? In fact
>> I wouldn't mind if it was removed from the documentation altogether,
>> the more that it still has no description at all.
> 
> Sure, AFAICT it's just removing it from the documentation and a single usage
> in construct_dom0_pv (the one in compute_dom0_nr_pages needs to stay for PVHv2
> Dom0).

Just to avoid any misunderstanding: I talked about documentation
only, not about removing anything from code.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-12-09 16:48   ` Jan Beulich
@ 2016-12-16 17:38     ` Roger Pau Monne
  2016-12-19  7:48       ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 17:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Dec 09, 2016 at 09:48:32AM -0700, Jan Beulich wrote:
> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > @@ -302,7 +307,8 @@ static unsigned long __init compute_dom0_nr_pages(
> >              avail -= max_pdx >> s;
> >      }
> >  
> > -    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
> > +    need_paging = opt_dom0_shadow || (has_hvm_container_domain(d) &&
> > +                  (!iommu_hap_pt_share || !paging_mode_hap(d)));
> 
> Indentation.

Fixed.

> > @@ -545,11 +552,12 @@ static __init void pvh_map_all_iomem(struct domain *d, unsigned long nr_pages)
> >      ASSERT(nr_holes == 0);
> >  }
> >  
> > -static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
> > +static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
> 
> Why?

So that afterwards I can remove all the pvh_ functions and leave the hvm_ ones
only. But seeing your response to the other patch, would you prefer me to just
use pvh_ for the new functions also?

> > @@ -577,8 +585,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
> >              continue;
> >          }
> >  
> > -        *entry_guest = *entry;
> > -        pages = PFN_UP(entry_guest->size);
> > +        /*
> > +         * Make sure the start and length are aligned to PAGE_SIZE, because
> > +         * that's the minimum granularity of the 2nd stage translation.
> > +         */
> > +        start = ROUNDUP(entry->addr, PAGE_SIZE);
> > +        end = (entry->addr + entry->size) & PAGE_MASK;
> 
> Taking the comment into consideration, I wonder whether you
> wouldn't better use PAGE_ORDER_4K here, as that's what the
> p2m code uses.

That's going to be more cumbersome, since PAGE_SIZE would become 1UL <<
PAGE_ORDER_4K << PAGE_SHIFT, and PAGE_MASK is going to be -1 and ~ of the
previous construct. But I see your point, maybe I should define PAGE_SIZE_4K
and PAGE_MASK_4K in xen/include/asm-x86/page.h?

> > @@ -1010,8 +1029,6 @@ static int __init construct_dom0_pv(
> >      BUG_ON(d->vcpu[0] == NULL);
> >      BUG_ON(v->is_initialised);
> >  
> > -    process_pending_softirqs();
> 
> Wouldn't this adjustment better fit into the previous patch, together
> with its companion below?

Yes, I guess I must have forgotten to move this.

> > +static int __init hvm_steal_ram(struct domain *d, unsigned long size,
> > +                                paddr_t limit, paddr_t *addr)
> > +{
> > +    unsigned int i = d->arch.nr_e820;
> > +
> > +    while ( i-- )
> > +    {
> > +        struct e820entry *entry = &d->arch.e820[i];
> > +
> > +        if ( entry->type != E820_RAM || entry->size < size )
> > +            continue;
> > +
> > +        /* Subtract from the beginning. */
> > +        if ( entry->addr + size < limit && entry->addr >= MB(1) )
> 
> <= I think (for the left comparison)?

Yes.

> > +static void __init hvm_steal_low_ram(struct domain *d, unsigned long start,
> > +                                     unsigned long nr_pages)
> > +{
> > +    unsigned long mfn;
> > +
> > +    ASSERT(start + nr_pages < PFN_DOWN(MB(1)));
> 
> <= again I think.

Right.

> > +static int __init hvm_setup_p2m(struct domain *d)
> > +{
> > +    struct vcpu *v = d->vcpu[0];
> > +    unsigned long nr_pages;
> > +    unsigned int i;
> > +    int rc;
> > +    bool preempted;
> > +#define MB1_PAGES PFN_DOWN(MB(1))
> > +
> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> > +
> > +    hvm_setup_e820(d, nr_pages);
> > +    do {
> > +        preempted = false;
> > +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> > +                              &preempted);
> > +        process_pending_softirqs();
> > +    } while ( preempted );
> > +
> > +    /*
> > +     * Memory below 1MB is identity mapped.
> > +     * NB: this only makes sense when booted from legacy BIOS.
> > +     */
> > +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
> > +    if ( rc )
> > +    {
> > +        printk("Failed to identity map low 1MB: %d\n", rc);
> > +        return rc;
> > +    }
> > +
> > +    /* Populate memory map. */
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        unsigned long addr, size;
> > +
> > +        if ( d->arch.e820[i].type != E820_RAM )
> > +            continue;
> > +
> > +        addr = PFN_DOWN(d->arch.e820[i].addr);
> > +        size = PFN_DOWN(d->arch.e820[i].size);
> > +
> > +        if ( addr >= MB1_PAGES )
> > +            rc = hvm_populate_memory_range(d, addr, size);
> > +        else if ( addr + size > MB1_PAGES )
> > +        {
> > +            hvm_steal_low_ram(d, addr, MB1_PAGES - addr);
> > +            rc = hvm_populate_memory_range(d, MB1_PAGES,
> > +                                           size - (MB1_PAGES - addr));
> 
> Is this case possible at all? All x86 systems have some form of
> BIOS right below the 1Mb boundary, and the E820 map for
> Dom0 is being derived from the host one.

Heh, I don't think so but I wanted to cover all possible inputs. TBH I have no
idea how broken e820 memory maps can really be.

Would you be fine with removing this case and adding an ASSERT instead?

> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -475,6 +475,43 @@ void share_xen_page_with_guest(
> >      spin_unlock(&d->page_alloc_lock);
> >  }
> >  
> > +int unshare_xen_page_with_guest(struct page_info *page, struct domain *d)
> 
> __init
> 
> And once its __init, it may be possible to simplify it, as you don't need
> to fear races anymore. E.g. you wouldn't need a loop over cmpxchg().

Indeed.

> > +{
> > +    unsigned long y, x;
> > +    bool drop_dom_ref;
> > +
> > +    if ( page_get_owner(page) != d || !(page->count_info & PGC_xen_heap) )
> 
> Please don't open code is_xen_heap_page().

Right, I'm not very knowledgeable of the mm functions yet.

> > +        return -EINVAL;
> > +
> > +    spin_lock(&d->page_alloc_lock);
> > +
> > +    /* Drop the page reference so we can chanfge the owner. */
> > +    y = page->count_info;
> > +    do {
> > +        x = y;
> > +        if ( (x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated) )
> > +        {
> > +            spin_unlock(&d->page_alloc_lock);
> > +            return -EINVAL;
> > +        }
> > +        y = cmpxchg(&page->count_info, x, PGC_xen_heap);
> > +    } while ( y != x );
> > +
> > +    /* Remove the page form the list of domain pages. */
> > +    page_list_del(page, &d->xenpage_list);
> > +    drop_dom_ref = (--d->xenheap_pages == 0);
> 
> Aren't you open coding
> 
>     if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
>         put_page(page);
> 
> here (except for the check on the initial value, which could be
> moved up)?

Yes, that's right.

> > +    /* Remove the owner and clear the flags. */
> > +    page_set_owner(page, NULL);
> > +    page->u.inuse.type_info = 0;
> 
> I think you'd better bail early if this is non-zero. Or else please use
> the order used elsewhere (clearing type info, then owner) - while
> it's benign, it avoids someone later wondering whether the order
> is wrong in either place.

It's certainly going to be non-zero, because share_xen_page_with_guests sets it
to:

page->u.inuse.type_info  = (readonly ? PGT_none : PGT_writable_page);
page->u.inuse.type_info |= PGT_validated | 1;

I've ended up coding it as:

int __init unshare_xen_page_with_guest(struct page_info *page,
                                       struct domain *d)
{
    if ( page_get_owner(page) != d || !is_xen_heap_page(page) )
        return -EINVAL;

    if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
        put_page(page);

    /* Remove the owner and clear the flags. */
    page->u.inuse.type_info = 0;
    page_set_owner(page, NULL);

    return 0;
}

(note that put_page does indeed use a cmpxchg, but the benefits of not open
coding it far outweighs the penalty of using cmpxchg IMHO).

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 16:18       ` Jan Beulich
@ 2016-12-16 17:42         ` Roger Pau Monne
  2016-12-19  7:41           ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 17:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Dec 16, 2016 at 09:18:13AM -0700, Jan Beulich wrote:
> >>> On 16.12.16 at 15:28, <roger.pau@citrix.com> wrote:
> > On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
> >> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> >> > --- a/docs/misc/xen-command-line.markdown
> >> > +++ b/docs/misc/xen-command-line.markdown
> >> > @@ -656,6 +656,23 @@ affinities to prefer but be not limited to the specified node(s).
> >> >  
> >> >  Pin dom0 vcpus to their respective pcpus
> >> >  
> >> > +### dom0
> >> > +> `= List of [ hvm | shadow ]`
> >> > +
> >> > +> Sub-options:
> >> > +
> >> > +> `hvm`
> >> > +
> >> > +> Default: `false`
> >> > +
> >> > +Flag that makes a dom0 boot in PVHv2 mode.
> >> > +
> >> > +> `shadow`
> >> > +
> >> > +> Default: `false`
> >> > +
> >> > +Flag that makes a dom0 use shadow paging.
> >> 
> >> Would you mind marking dom0_shadow deprecated at once? In fact
> >> I wouldn't mind if it was removed from the documentation altogether,
> >> the more that it still has no description at all.
> > 
> > Sure, AFAICT it's just removing it from the documentation and a single usage
> > in construct_dom0_pv (the one in compute_dom0_nr_pages needs to stay for PVHv2
> > Dom0).
> 
> Just to avoid any misunderstanding: I talked about documentation
> only, not about removing anything from code.

So, you just want to remove the documentation line about dom0_shadow and leave
the option there?

It seems kind of pointless to me, the more that a) it's not going to be
documented and b) AFAIK it's not working.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 16:17         ` Jan Beulich
@ 2016-12-16 17:57           ` Roger Pau Monne
  2016-12-19  7:42             ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-16 17:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Dec 16, 2016 at 09:17:01AM -0700, Jan Beulich wrote:
> >>> On 16.12.16 at 15:45, <roger.pau@citrix.com> wrote:
> > On Fri, Dec 16, 2016 at 02:28:54PM +0000, Roger Pau Monne wrote:
> >> On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
> >> > >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> >> > > @@ -1655,6 +1653,28 @@ out:
> >> > >      return rc;
> >> > >  }
> >> > >  
> >> > > +static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
> >> > > +                                     unsigned long image_headroom,
> >> > > +                                     module_t *initrd,
> >> > > +                                     void *(*bootstrap_map)(const module_t *),
> >> > > +                                     char *cmdline)
> >> > > +{
> >> > > +
> >> > > +    printk("** Building a PVH Dom0 **\n");
> >> > 
> >> > Why again is it that you call the function "hvm" but mean "pvh"?
> >> 
> >> This was to differentiate between the current "pvh" functions in this file that
> >> refer to PVHv1. I could name them pvhv2, but IMHO hvm seems clearer and
> >> shorter.
> > 
> > Oh, and the other reason was that Xen doesn't really know the difference
> > between a HVM guest and a PVHv2 guest, hence hvm felt more natural.
> 
> Xen certainly can tell the difference for Dom0, since a true HVM
> Dom0 can't exist without a lot of work towards getting a device
> model run somewhere to service it. I continue to think that "hvm"
> in any of the names involved in this series is misleading.

Yes, but that device model isn't a Xen-kernel component, and would be attached
using the ioreq server machinery anyway, so IMHO I still think this is more
like HVM rather than anything else from a Xen PoV. And if we ever got something
like a complete HVM Dom0 with a device model it would certainly use this
machinery.

In any case, I don't want to start a bikeshed over it, so please tell me your
preference for either pvh_ or pvh2 or pvhv2_ prefix for these new functions.

Thanks, Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 17:42         ` Roger Pau Monne
@ 2016-12-19  7:41           ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2016-12-19  7:41 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 16.12.16 at 18:42, <roger.pau@citrix.com> wrote:
> On Fri, Dec 16, 2016 at 09:18:13AM -0700, Jan Beulich wrote:
>> >>> On 16.12.16 at 15:28, <roger.pau@citrix.com> wrote:
>> > On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
>> >> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
>> >> > --- a/docs/misc/xen-command-line.markdown
>> >> > +++ b/docs/misc/xen-command-line.markdown
>> >> > @@ -656,6 +656,23 @@ affinities to prefer but be not limited to the 
> specified node(s).
>> >> >  
>> >> >  Pin dom0 vcpus to their respective pcpus
>> >> >  
>> >> > +### dom0
>> >> > +> `= List of [ hvm | shadow ]`
>> >> > +
>> >> > +> Sub-options:
>> >> > +
>> >> > +> `hvm`
>> >> > +
>> >> > +> Default: `false`
>> >> > +
>> >> > +Flag that makes a dom0 boot in PVHv2 mode.
>> >> > +
>> >> > +> `shadow`
>> >> > +
>> >> > +> Default: `false`
>> >> > +
>> >> > +Flag that makes a dom0 use shadow paging.
>> >> 
>> >> Would you mind marking dom0_shadow deprecated at once? In fact
>> >> I wouldn't mind if it was removed from the documentation altogether,
>> >> the more that it still has no description at all.
>> > 
>> > Sure, AFAICT it's just removing it from the documentation and a single usage
>> > in construct_dom0_pv (the one in compute_dom0_nr_pages needs to stay for PVHv2
>> > Dom0).
>> 
>> Just to avoid any misunderstanding: I talked about documentation
>> only, not about removing anything from code.
> 
> So, you just want to remove the documentation line about dom0_shadow and leave
> the option there?
> 
> It seems kind of pointless to me, the more that a) it's not going to be
> documented and b) AFAIK it's not working.

Well, if it's _provably_ not working, then of course it could be
removed altogether.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2
  2016-12-16 17:57           ` Roger Pau Monne
@ 2016-12-19  7:42             ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2016-12-19  7:42 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 16.12.16 at 18:57, <roger.pau@citrix.com> wrote:
> On Fri, Dec 16, 2016 at 09:17:01AM -0700, Jan Beulich wrote:
>> >>> On 16.12.16 at 15:45, <roger.pau@citrix.com> wrote:
>> > On Fri, Dec 16, 2016 at 02:28:54PM +0000, Roger Pau Monne wrote:
>> >> On Fri, Dec 09, 2016 at 09:07:16AM -0700, Jan Beulich wrote:
>> >> > >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
>> >> > > @@ -1655,6 +1653,28 @@ out:
>> >> > >      return rc;
>> >> > >  }
>> >> > >  
>> >> > > +static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
>> >> > > +                                     unsigned long image_headroom,
>> >> > > +                                     module_t *initrd,
>> >> > > +                                     void *(*bootstrap_map)(const module_t *),
>> >> > > +                                     char *cmdline)
>> >> > > +{
>> >> > > +
>> >> > > +    printk("** Building a PVH Dom0 **\n");
>> >> > 
>> >> > Why again is it that you call the function "hvm" but mean "pvh"?
>> >> 
>> >> This was to differentiate between the current "pvh" functions in this file that
>> >> refer to PVHv1. I could name them pvhv2, but IMHO hvm seems clearer and
>> >> shorter.
>> > 
>> > Oh, and the other reason was that Xen doesn't really know the difference
>> > between a HVM guest and a PVHv2 guest, hence hvm felt more natural.
>> 
>> Xen certainly can tell the difference for Dom0, since a true HVM
>> Dom0 can't exist without a lot of work towards getting a device
>> model run somewhere to service it. I continue to think that "hvm"
>> in any of the names involved in this series is misleading.
> 
> Yes, but that device model isn't a Xen-kernel component, and would be attached
> using the ioreq server machinery anyway, so IMHO I still think this is more
> like HVM rather than anything else from a Xen PoV. And if we ever got something
> like a complete HVM Dom0 with a device model it would certainly use this
> machinery.
> 
> In any case, I don't want to start a bikeshed over it, so please tell me your
> preference for either pvh_ or pvh2 or pvhv2_ prefix for these new functions.

pvh please.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-12-16 17:38     ` Roger Pau Monne
@ 2016-12-19  7:48       ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2016-12-19  7:48 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 16.12.16 at 18:38, <roger.pau@citrix.com> wrote:
> On Fri, Dec 09, 2016 at 09:48:32AM -0700, Jan Beulich wrote:
>> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
>> > @@ -545,11 +552,12 @@ static __init void pvh_map_all_iomem(struct domain *d, unsigned long nr_pages)
>> >      ASSERT(nr_holes == 0);
>> >  }
>> >  
>> > -static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>> > +static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
>> 
>> Why?
> 
> So that afterwards I can remove all the pvh_ functions and leave the hvm_ ones
> only. But seeing your response to the other patch, would you prefer me to just
> use pvh_ for the new functions also?

Yes. After all the intention is to rip out all PVHv1 stuff.

>> > @@ -577,8 +585,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>> >              continue;
>> >          }
>> >  
>> > -        *entry_guest = *entry;
>> > -        pages = PFN_UP(entry_guest->size);
>> > +        /*
>> > +         * Make sure the start and length are aligned to PAGE_SIZE, because
>> > +         * that's the minimum granularity of the 2nd stage translation.
>> > +         */
>> > +        start = ROUNDUP(entry->addr, PAGE_SIZE);
>> > +        end = (entry->addr + entry->size) & PAGE_MASK;
>> 
>> Taking the comment into consideration, I wonder whether you
>> wouldn't better use PAGE_ORDER_4K here, as that's what the
>> p2m code uses.
> 
> That's going to be more cumbersome, since PAGE_SIZE would become 1UL <<
> PAGE_ORDER_4K << PAGE_SHIFT, and PAGE_MASK is going to be -1 and ~ of the
> previous construct. But I see your point, maybe I should define PAGE_SIZE_4K
> and PAGE_MASK_4K in xen/include/asm-x86/page.h?

That's an option, but considering the p2m code has got along
without it so far, I'm not fully convinced we need it. Perhaps
get an opinion from George (as the x86/mm maintainer).

>> > +static int __init hvm_setup_p2m(struct domain *d)
>> > +{
>> > +    struct vcpu *v = d->vcpu[0];
>> > +    unsigned long nr_pages;
>> > +    unsigned int i;
>> > +    int rc;
>> > +    bool preempted;
>> > +#define MB1_PAGES PFN_DOWN(MB(1))
>> > +
>> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
>> > +
>> > +    hvm_setup_e820(d, nr_pages);
>> > +    do {
>> > +        preempted = false;
>> > +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
>> > +                              &preempted);
>> > +        process_pending_softirqs();
>> > +    } while ( preempted );
>> > +
>> > +    /*
>> > +     * Memory below 1MB is identity mapped.
>> > +     * NB: this only makes sense when booted from legacy BIOS.
>> > +     */
>> > +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
>> > +    if ( rc )
>> > +    {
>> > +        printk("Failed to identity map low 1MB: %d\n", rc);
>> > +        return rc;
>> > +    }
>> > +
>> > +    /* Populate memory map. */
>> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> > +    {
>> > +        unsigned long addr, size;
>> > +
>> > +        if ( d->arch.e820[i].type != E820_RAM )
>> > +            continue;
>> > +
>> > +        addr = PFN_DOWN(d->arch.e820[i].addr);
>> > +        size = PFN_DOWN(d->arch.e820[i].size);
>> > +
>> > +        if ( addr >= MB1_PAGES )
>> > +            rc = hvm_populate_memory_range(d, addr, size);
>> > +        else if ( addr + size > MB1_PAGES )
>> > +        {
>> > +            hvm_steal_low_ram(d, addr, MB1_PAGES - addr);
>> > +            rc = hvm_populate_memory_range(d, MB1_PAGES,
>> > +                                           size - (MB1_PAGES - addr));
>> 
>> Is this case possible at all? All x86 systems have some form of
>> BIOS right below the 1Mb boundary, and the E820 map for
>> Dom0 is being derived from the host one.
> 
> Heh, I don't think so but I wanted to cover all possible inputs. TBH I have no
> idea how broken e820 memory maps can really be.
> 
> Would you be fine with removing this case and adding an ASSERT instead?

Yes; in fact that would be my preference.

>> > +    /* Remove the owner and clear the flags. */
>> > +    page_set_owner(page, NULL);
>> > +    page->u.inuse.type_info = 0;
>> 
>> I think you'd better bail early if this is non-zero. Or else please use
>> the order used elsewhere (clearing type info, then owner) - while
>> it's benign, it avoids someone later wondering whether the order
>> is wrong in either place.
> 
> It's certainly going to be non-zero, because share_xen_page_with_guests sets it
> to:
> 
> page->u.inuse.type_info  = (readonly ? PGT_none : PGT_writable_page);
> page->u.inuse.type_info |= PGT_validated | 1;
> 
> I've ended up coding it as:
> 
> int __init unshare_xen_page_with_guest(struct page_info *page,
>                                        struct domain *d)
> {
>     if ( page_get_owner(page) != d || !is_xen_heap_page(page) )
>         return -EINVAL;
> 
>     if ( test_and_clear_bit(_PGC_allocated, &page->count_info) )
>         put_page(page);
> 
>     /* Remove the owner and clear the flags. */
>     page->u.inuse.type_info = 0;
>     page_set_owner(page, NULL);
> 
>     return 0;
> }

This looks good, thanks.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 11/14] xen/x86: parse Dom0 kernel for PVHv2
  2016-12-09 17:05   ` Jan Beulich
@ 2016-12-20 17:34     ` Roger Pau Monne
  0 siblings, 0 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-20 17:34 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Dec 09, 2016 at 10:05:18AM -0700, Jan Beulich wrote:
> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > @@ -1930,12 +1931,148 @@ static int __init hvm_setup_p2m(struct domain *d)
> >  #undef MB1_PAGES
> >  }
> >  
> > +static int __init hvm_copy_to_phys(struct domain *d, paddr_t paddr, void *buf,
> > +                                   int size)
> 
> I guess you made size plain int because hvm_copy_to_guest_phys()
> has it that way, but please let's not spread such bogus things - sizes
> can't possibly be negative.
> 
> > +{
> > +    struct vcpu *saved_current;
> > +    int rc;
> > +
> > +    saved_current = current;
> > +    set_current(d->vcpu[0]);
> > +    rc = hvm_copy_to_guest_phys(paddr, buf, size);
> > +    set_current(saved_current);
> 
> I continue to be uncertain about the behavior of this if something
> inside hvm_copy_to_guest_phys() goes wrong: Did you either
> statically analyze the code or try in practice out whether the
> playing with current makes understanding the crash output any
> harder?

If you managed to somehow call hvm_copy_to_guest_phys with the idle vcpu as
current you would get this kind of error, which I admin is maybe not that
obvious (apart from the IDLEv0 prefix).

(XEN) IDLEv0 Error pfn 21bd: rd=32767 od=32756 caf=180000000000000 taf=0000000000000000

See below.

> While there's going to be some work involved with it, I do think
> that the use here might be a reason for the whole hvm_copy()
> machinery to gain a struct vcpu* parameter.

I've gone that route and added a new param to __hvm_copy, and also introduced
hvm_copy_to_guest_phys_vcpu which takes an additional vcpu parameter. While
there I've also added an assert to __hvm_copy in order to make sure the
vcpu parameter is always a hvm/pvh vcpu.

> > +static int __init hvm_load_kernel(struct domain *d, const module_t *image,
> > +                                  unsigned long image_headroom,
> > +                                  module_t *initrd, char *image_base,
> > +                                  char *cmdline, paddr_t *entry,
> > +                                  paddr_t *start_info_addr)
> > +{
> > +    char *image_start = image_base + image_headroom;
> > +    unsigned long image_len = image->mod_end;
> > +    struct elf_binary elf;
> > +    struct elf_dom_parms parms;
> > +    paddr_t last_addr;
> > +    struct hvm_start_info start_info;
> > +    struct hvm_modlist_entry mod;
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    int rc;
> > +
> > +    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> > +    {
> > +        printk("Error trying to detect bz compressed kernel\n");
> > +        return rc;
> > +    }
> > +
> > +    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
> > +    {
> > +        printk("Unable to init ELF\n");
> > +        return rc;
> > +    }
> > +#ifdef VERBOSE
> > +    elf_set_verbose(&elf);
> > +#endif
> > +    elf_parse_binary(&elf);
> > +    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
> > +    {
> > +        printk("Unable to parse kernel for ELFNOTES\n");
> > +        return rc;
> > +    }
> > +
> > +    if ( parms.phys_entry == UNSET_ADDR32 ) {
> > +        printk("Unable to find XEN_ELFNOTE_PHYS32_ENTRY address\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
> > +           parms.guest_ver, parms.loader,
> > +           elf_64bit(&elf) ? "64-bit" : "32-bit");
> > +
> > +    /* Copy the OS image and free temporary buffer. */
> > +    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
> > +    elf.dest_size = parms.virt_kend - parms.virt_kstart;
> > +
> > +    saved_current = current;
> > +    set_current(v);
> > +    rc = elf_load_binary(&elf);
> > +    set_current(saved_current);
> 
> Same reservations as above.

Right, this one however is more tricky to fix since elf_load_binary is shared
with libxc, so adding a vcpu/domain parameter here is problematic for the
toolstack side. That's quite similar to what happens on classic PV Dom0
creation, we need to switch to Dom0 page tables. I'm not trying to use that to
justify that this is the best way, but everything else seems quite convoluted
(either adding a new param to elf_load_binary or a new field to struct
elf_binary in order to store the domain/vcpu).

> > +    if ( rc < 0 )
> > +    {
> > +        printk("Failed to load kernel: %d\n", rc);
> > +        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
> > +        return rc;
> > +    }
> > +
> > +    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> > +
> > +    if ( initrd != NULL )
> > +    {
> > +        rc = hvm_copy_to_phys(d, last_addr, mfn_to_virt(initrd->mod_start),
> > +                              initrd->mod_end);
> > +        if ( rc )
> > +        {
> > +            printk("Unable to copy initrd to guest\n");
> > +            return rc;
> > +        }
> > +
> > +        mod.paddr = last_addr;
> > +        mod.size = initrd->mod_end;
> > +        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
> > +    }
> 
> mod is left uninitialized in the else case afaict - I don't think all
> compilers we support (plus Coverity) can spot the common
> dependency on initrd != NULL.

Clang doesn't seem to complain, but I will add an initialized to be sure.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-12 13:56   ` Jan Beulich
@ 2016-12-21 16:32     ` Roger Pau Monne
  2016-12-21 17:04       ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-21 16:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Mon, Dec 12, 2016 at 06:56:36AM -0700, Jan Beulich wrote:
> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> > @@ -567,7 +573,7 @@ static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
> >      /*
> >       * Craft the e820 memory map for Dom0 based on the hardware e820 map.
> >       */
> > -    d->arch.e820 = xzalloc_array(struct e820entry, e820.nr_map);
> > +    d->arch.e820 = xzalloc_array(struct e820entry, E820MAX);
> 
> I have to admit that I consider this both wasteful and inflexible.
> While commonly you'll need far less entries, what if in fact you
> need more?

I've updated pvh_add_mem_range so that now it expands the memory map when
needed, in order to add new regions. This solves both the waste and the
inflexibility:

map = xzalloc_array(struct e820entry, d->arch.nr_e820 + 1);
if ( !map )
{
    printk(XENLOG_WARNING "E820: out of memory to add region\n");
    return -ENOMEM;
}

memcpy(map, d->arch.e820, i * sizeof(*d->arch.e820));
memcpy(map + i + 1, d->arch.e820 + i,
       (d->arch.nr_e820 - i) * sizeof(*d->arch.e820));
map[i].addr = s;
map[i].size = e - s;
map[i].type = type;
xfree(d->arch.e820);
d->arch.e820 = map;
d->arch.nr_e820++;

> > +static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
> > +{
> > +    struct acpi_table_madt *madt;
> > +    struct acpi_table_header *table;
> > +    struct acpi_madt_io_apic *io_apic;
> > +    struct acpi_madt_local_apic *local_apic;
> 
> I think I had asked about the absence of struct acpi_madt_local_x2apic
> here before, but now that I look again I also wonder how you get away
> without struct acpi_madt_nmi_source and struct acpi_madt_local_apic_nmi;

Since we are currently limited to 128 vCPUs (max APIC ID 255), I don't think
acpi_madt_local_x2apic is required in this scenario:

https://lists.xen.org/archives/html/xen-devel/2016-11/msg02815.html

I will work on adding those entries once that limit is lifted, would you be
fine with me adding a comment here regarding the no-need of
acpi_madt_local_x2apic until support for > 128vCPUs is added?

Regarding the local/IO APIC NMI structures, won't those NMI's be delivered to
Xen, or are those then re-injected to the guest?

> I can kind of see struct acpi_madt_local_apic_override not being

Since Xen is the one that sets the local APIC address in the MADT, and it
always matches the position of the emulated local APIC I don't see why we would
want to use acpi_madt_local_apic_override. I see that your worries are related
to what would happen if AML tries to modify the MADT, but wouldn't in that case
modify the native MADT, instead of the crafted one?

> needed here (and I'd really hope anyway that no BIOS actually makes
> use of it). The question mainly is what possible damage there may be
> if what Dom0 gets to see disagrees with what the firmware acts on
> (remember that AML code may alter MADT contents).
> 
> > +    acpi_status status;
> > +    unsigned long size;
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    /* Count number of interrupt overrides in the MADT. */
> > +    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE,
> > +                          acpi_count_intr_ovr, MAX_IRQ_SOURCES);
> 
> What's the reason for using MAX_IRQ_SOURCES here? There's no
> static array involved here limiting the supported count.

Right, this can be INT_MAX, or UINT_MAX if I change the type of
acpi_intr_overrrides.

> > +    madt->address = APIC_DEFAULT_PHYS_BASE;
> > +    /*
> > +     * NB: this is currently set to 4, which is the revision in the ACPI
> > +     * spec 6.1. Sadly ACPICA doesn't provide revision numbers for the
> > +     * tables described in the headers.
> > +     */
> > +    madt->header.revision = 4;
> 
> I can see you wanting to cap the revision here, but is it safe to
> possibly bump it from what firmware has?

Hm, we only use the headers and the interrupt overrides as-is from the original
table, the rest comes from the acpica structs. So I assume it's better to be
safe than sorry, and a min should be used here together with the original table
version.

> > +    /*
> > +     * Setup the IO APIC entry.
> > +     * FIXME: the current vIO-APIC code just supports one IO-APIC instance
> > +     * per domain. This must be fixed in order to provide the same amount of
> > +     * IO APICs as available on bare metal.
> > +     */
> > +    if ( nr_ioapics > 1 )
> > +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
> > +               nr_ioapics);
> > +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> > +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
> > +    io_apic->header.length = sizeof(*io_apic);
> > +    io_apic->id = 1;
> 
> Is it safe to use an ID other than what firmware did assign? Where
> is this 1 coming from? And how does this get sync-ed up with the
> respective struct hvm_hw_vioapic field?

Not sure why I've set this to 1, AFAICT the vioapic code sets this to 0 on
reset, so it should be 0 here (until we have proper support in order to
completely reproduce the native IO APIC topology).

[ ... fixed all the comments below ..]

Thanks for the detailed review!

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-21 16:32     ` Roger Pau Monne
@ 2016-12-21 17:04       ` Jan Beulich
  2016-12-22 10:43         ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-21 17:04 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 21.12.16 at 17:32, <roger.pau@citrix.com> wrote:
> On Mon, Dec 12, 2016 at 06:56:36AM -0700, Jan Beulich wrote:
>> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
>> > +static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
>> > +{
>> > +    struct acpi_table_madt *madt;
>> > +    struct acpi_table_header *table;
>> > +    struct acpi_madt_io_apic *io_apic;
>> > +    struct acpi_madt_local_apic *local_apic;
>> 
>> I think I had asked about the absence of struct acpi_madt_local_x2apic
>> here before, but now that I look again I also wonder how you get away
>> without struct acpi_madt_nmi_source and struct acpi_madt_local_apic_nmi;
> 
> Since we are currently limited to 128 vCPUs (max APIC ID 255), I don't think
> acpi_madt_local_x2apic is required in this scenario:
> 
> https://lists.xen.org/archives/html/xen-devel/2016-11/msg02815.html 
> 
> I will work on adding those entries once that limit is lifted, would you be
> fine with me adding a comment here regarding the no-need of
> acpi_madt_local_x2apic until support for > 128vCPUs is added?

I'm not convinced these table entries are tied to >255 CPUs - I'm
seeing them on systems with far less. Hence I simply wonder what
functionality we may miss to offer to OSes with these tables
absent.

> Regarding the local/IO APIC NMI structures, won't those NMI's be delivered to
> Xen, or are those then re-injected to the guest?

PV Dom0 may get to see NMIs, so I'd expect PVH to behave
similarly.

>> I can kind of see struct acpi_madt_local_apic_override not being
> 
> Since Xen is the one that sets the local APIC address in the MADT, and it
> always matches the position of the emulated local APIC I don't see why we would
> want to use acpi_madt_local_apic_override. I see that your worries are related
> to what would happen if AML tries to modify the MADT, but wouldn't in that case
> modify the native MADT, instead of the crafted one?

Exactly, so how would you see the modification to get
propagated?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-21 17:04       ` Jan Beulich
@ 2016-12-22 10:43         ` Roger Pau Monne
  2016-12-22 11:03           ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-22 10:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Wed, Dec 21, 2016 at 10:04:20AM -0700, Jan Beulich wrote:
> >>> On 21.12.16 at 17:32, <roger.pau@citrix.com> wrote:
> > On Mon, Dec 12, 2016 at 06:56:36AM -0700, Jan Beulich wrote:
> >> >>> On 30.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> >> > +static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
> >> > +{
> >> > +    struct acpi_table_madt *madt;
> >> > +    struct acpi_table_header *table;
> >> > +    struct acpi_madt_io_apic *io_apic;
> >> > +    struct acpi_madt_local_apic *local_apic;
> >> 
> >> I think I had asked about the absence of struct acpi_madt_local_x2apic
> >> here before, but now that I look again I also wonder how you get away
> >> without struct acpi_madt_nmi_source and struct acpi_madt_local_apic_nmi;
> > 
> > Since we are currently limited to 128 vCPUs (max APIC ID 255), I don't think
> > acpi_madt_local_x2apic is required in this scenario:
> > 
> > https://lists.xen.org/archives/html/xen-devel/2016-11/msg02815.html 
> > 
> > I will work on adding those entries once that limit is lifted, would you be
> > fine with me adding a comment here regarding the no-need of
> > acpi_madt_local_x2apic until support for > 128vCPUs is added?
> 
> I'm not convinced these table entries are tied to >255 CPUs - I'm
> seeing them on systems with far less. Hence I simply wonder what
> functionality we may miss to offer to OSes with these tables
> absent.

From the ACPI spec, both entries looks very similar, with the exception that
the x2APIC entry has a bigger size (from 1byte to 4bytes) for both the
processor and the APIC ID.

I don't have a system with x2APIC entries at hand, but I guess the easiest way
to solve this would be to add x2APIC and x2APIC NMI entries if they are also
present on the native MADT?

Could it be that your systems provide those entries because the firmware is
shared with other high-end boxes or prepared to handle configurations with >255
APIC IDs?

> >> I can kind of see struct acpi_madt_local_apic_override not being
> > 
> > Since Xen is the one that sets the local APIC address in the MADT, and it
> > always matches the position of the emulated local APIC I don't see why we would
> > want to use acpi_madt_local_apic_override. I see that your worries are related
> > to what would happen if AML tries to modify the MADT, but wouldn't in that case
> > modify the native MADT, instead of the crafted one?
> 
> Exactly, so how would you see the modification to get
> propagated?

Please bear with me, by modifications here do you mean what the _MAT method
returns from processor objects?

The ACPI 6.1 spec has something that worries me quite a lot (page 145):

"
a) When _MAT (see Section 6.2.10) appears under a Processor Device object (see
Section 8.4), OSPM processes the Interrupt Controller Structures returned by
_MAT with the types labeled "yes" and ignores other types.

b) When _MAT appears under an I/O APIC Device (see Section 9.17), OSPM
processes the Interrupt Controller Structures returned by _MAT with the types
labeled "yes" and ignores other types.
"

Which from my understanding means that almost everything from the original MADT
can be overwritten by the returns of _MAT methods. The same seems to be true
for ARM, and I don't see them dealing with this in any way. Is this something
that has yet to be implemented by any vendor?

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 10:43         ` Roger Pau Monne
@ 2016-12-22 11:03           ` Jan Beulich
  2016-12-22 12:17             ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-22 11:03 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 22.12.16 at 11:43, <roger.pau@citrix.com> wrote:
> On Wed, Dec 21, 2016 at 10:04:20AM -0700, Jan Beulich wrote:
>> I'm not convinced these table entries are tied to >255 CPUs - I'm
>> seeing them on systems with far less. Hence I simply wonder what
>> functionality we may miss to offer to OSes with these tables
>> absent.
> 
> From the ACPI spec, both entries looks very similar, with the exception that
> the x2APIC entry has a bigger size (from 1byte to 4bytes) for both the
> processor and the APIC ID.
> 
> I don't have a system with x2APIC entries at hand, but I guess the easiest way
> to solve this would be to add x2APIC and x2APIC NMI entries if they are also
> present on the native MADT?

Yes, that might be a good basis for determining, at least for now.
Otoh we always emulate an x2APIC, so it may not be unreasonable
to always provide these entries in parallel with the legacy LAPIC
ones.

> Could it be that your systems provide those entries because the firmware is
> shared with other high-end boxes or prepared to handle configurations with >255
> APIC IDs?

I obviously don't know (and hence can't exclude) that.

>> >> I can kind of see struct acpi_madt_local_apic_override not being
>> > 
>> > Since Xen is the one that sets the local APIC address in the MADT, and it
>> > always matches the position of the emulated local APIC I don't see why we would
>> > want to use acpi_madt_local_apic_override. I see that your worries are related
>> > to what would happen if AML tries to modify the MADT, but wouldn't in that case
>> > modify the native MADT, instead of the crafted one?
>> 
>> Exactly, so how would you see the modification to get
>> propagated?
> 
> Please bear with me, by modifications here do you mean what the _MAT method
> returns from processor objects?
> 
> The ACPI 6.1 spec has something that worries me quite a lot (page 145):
> 
> "
> a) When _MAT (see Section 6.2.10) appears under a Processor Device object (see
> Section 8.4), OSPM processes the Interrupt Controller Structures returned by
> _MAT with the types labeled "yes" and ignores other types.
> 
> b) When _MAT appears under an I/O APIC Device (see Section 9.17), OSPM
> processes the Interrupt Controller Structures returned by _MAT with the types
> labeled "yes" and ignores other types.
> "
> 
> Which from my understanding means that almost everything from the original MADT
> can be overwritten by the returns of _MAT methods. The same seems to be true
> for ARM, and I don't see them dealing with this in any way. Is this something
> that has yet to be implemented by any vendor?

I don't understand. For one, I can't see the spec requiring or excluding
the in place modification of MADT entries for the purpose of implementing
_MAT. Which route is taken is imo an implementation choice. In hvmloader
we use the in place modification approach at present. And then looking
at the original MADT is a boot time thing, so subsequent modifications
should be of no interest to the OS (they'd get those as _MAT return
values, not by looking at the static table).

The thing that I'm worried about is a physical machine's firmware using
the same approach as we do in hvmloader, thus returning from _MAT
values which are out of sync with the MADT we present to Dom0.
Although - thinking about it a second time now, we'd have the same
inconsistency issue if the firmware constructed the _MAT return
values from scratch, so this will need dealing with in any case for
CPU hotplug to work.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 11:03           ` Jan Beulich
@ 2016-12-22 12:17             ` Roger Pau Monne
  2016-12-22 16:17               ` Boris Ostrovsky
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-22 12:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Dec 22, 2016 at 04:03:12AM -0700, Jan Beulich wrote:
> >>> On 22.12.16 at 11:43, <roger.pau@citrix.com> wrote:
> > On Wed, Dec 21, 2016 at 10:04:20AM -0700, Jan Beulich wrote:
> >> > Since Xen is the one that sets the local APIC address in the MADT, and it
> >> > always matches the position of the emulated local APIC I don't see why we would
> >> > want to use acpi_madt_local_apic_override. I see that your worries are related
> >> > to what would happen if AML tries to modify the MADT, but wouldn't in that case
> >> > modify the native MADT, instead of the crafted one?
> >> 
> >> Exactly, so how would you see the modification to get
> >> propagated?
> > 
> > Please bear with me, by modifications here do you mean what the _MAT method
> > returns from processor objects?
> > 
> > The ACPI 6.1 spec has something that worries me quite a lot (page 145):
> > 
> > "
> > a) When _MAT (see Section 6.2.10) appears under a Processor Device object (see
> > Section 8.4), OSPM processes the Interrupt Controller Structures returned by
> > _MAT with the types labeled "yes" and ignores other types.
> > 
> > b) When _MAT appears under an I/O APIC Device (see Section 9.17), OSPM
> > processes the Interrupt Controller Structures returned by _MAT with the types
> > labeled "yes" and ignores other types.
> > "
> > 
> > Which from my understanding means that almost everything from the original MADT
> > can be overwritten by the returns of _MAT methods. The same seems to be true
> > for ARM, and I don't see them dealing with this in any way. Is this something
> > that has yet to be implemented by any vendor?
> 
> I don't understand. For one, I can't see the spec requiring or excluding
> the in place modification of MADT entries for the purpose of implementing
> _MAT. Which route is taken is imo an implementation choice. In hvmloader
> we use the in place modification approach at present. And then looking
> at the original MADT is a boot time thing, so subsequent modifications
> should be of no interest to the OS (they'd get those as _MAT return
> values, not by looking at the static table).
> 
> The thing that I'm worried about is a physical machine's firmware using
> the same approach as we do in hvmloader, thus returning from _MAT
> values which are out of sync with the MADT we present to Dom0.
> Although - thinking about it a second time now, we'd have the same
> inconsistency issue if the firmware constructed the _MAT return
> values from scratch, so this will need dealing with in any case for
> CPU hotplug to work.

I don't see an obvious solution to this, those _MAT methods reside in AML, and
ATM Xen has no way to parse, much even less modify this code. Xen could add the
Processor device path to the STAO so that Dom0 ignores them, but that would
also prevent ACPI CPU hotplug from working in Dom0.

Maybe Boris has some ideas about how to do CPU hotplug for Dom0?

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 12:17             ` Roger Pau Monne
@ 2016-12-22 16:17               ` Boris Ostrovsky
  2016-12-22 16:24                 ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Boris Ostrovsky @ 2016-12-22 16:17 UTC (permalink / raw)
  To: Roger Pau Monne, Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
> On Thu, Dec 22, 2016 at 04:03:12AM -0700, Jan Beulich wrote:
>>>>> On 22.12.16 at 11:43, <roger.pau@citrix.com> wrote:
>>> On Wed, Dec 21, 2016 at 10:04:20AM -0700, Jan Beulich wrote:
>>>>> Since Xen is the one that sets the local APIC address in the MADT, and it
>>>>> always matches the position of the emulated local APIC I don't see why we would
>>>>> want to use acpi_madt_local_apic_override. I see that your worries are related
>>>>> to what would happen if AML tries to modify the MADT, but wouldn't in that case
>>>>> modify the native MADT, instead of the crafted one?
>>>> Exactly, so how would you see the modification to get
>>>> propagated?
>>> Please bear with me, by modifications here do you mean what the _MAT method
>>> returns from processor objects?
>>>
>>> The ACPI 6.1 spec has something that worries me quite a lot (page 145):
>>>
>>> "
>>> a) When _MAT (see Section 6.2.10) appears under a Processor Device object (see
>>> Section 8.4), OSPM processes the Interrupt Controller Structures returned by
>>> _MAT with the types labeled "yes" and ignores other types.
>>>
>>> b) When _MAT appears under an I/O APIC Device (see Section 9.17), OSPM
>>> processes the Interrupt Controller Structures returned by _MAT with the types
>>> labeled "yes" and ignores other types.
>>> "
>>>
>>> Which from my understanding means that almost everything from the original MADT
>>> can be overwritten by the returns of _MAT methods. The same seems to be true
>>> for ARM, and I don't see them dealing with this in any way. Is this something
>>> that has yet to be implemented by any vendor?
>> I don't understand. For one, I can't see the spec requiring or excluding
>> the in place modification of MADT entries for the purpose of implementing
>> _MAT. Which route is taken is imo an implementation choice. In hvmloader
>> we use the in place modification approach at present. And then looking
>> at the original MADT is a boot time thing, so subsequent modifications
>> should be of no interest to the OS (they'd get those as _MAT return
>> values, not by looking at the static table).
>>
>> The thing that I'm worried about is a physical machine's firmware using
>> the same approach as we do in hvmloader, thus returning from _MAT
>> values which are out of sync with the MADT we present to Dom0.
>> Although - thinking about it a second time now, we'd have the same
>> inconsistency issue if the firmware constructed the _MAT return
>> values from scratch, so this will need dealing with in any case for
>> CPU hotplug to work.
> I don't see an obvious solution to this, those _MAT methods reside in AML, and
> ATM Xen has no way to parse, much even less modify this code. Xen could add the
> Processor device path to the STAO so that Dom0 ignores them, but that would
> also prevent ACPI CPU hotplug from working in Dom0.
>
> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?


Why would Xen need to be able to parse the AML that is intended to be
executed by dom0? I'd think that all the hypervisor would need to do is
to load it into memory, not any different from how it's done for regular
guests.

-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 16:17               ` Boris Ostrovsky
@ 2016-12-22 16:24                 ` Jan Beulich
  2016-12-22 16:44                   ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2016-12-22 16:24 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel, Roger Pau Monne

>>> On 22.12.16 at 17:17, <boris.ostrovsky@oracle.com> wrote:
> On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
>> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?
> 
> Why would Xen need to be able to parse the AML that is intended to be
> executed by dom0? I'd think that all the hypervisor would need to do is
> to load it into memory, not any different from how it's done for regular
> guests.

Well, if Dom0 executed the unmodified _MAT, it would get back
data relating to the physical CPU. Roger is overriding MADT to
present virtual CPU information to Dom0, and this would likewise
need to happen for the _MAT return value.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 16:24                 ` Jan Beulich
@ 2016-12-22 16:44                   ` Roger Pau Monne
  2016-12-22 18:19                     ` Boris Ostrovsky
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-22 16:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Boris Ostrovsky, xen-devel

On Thu, Dec 22, 2016 at 09:24:02AM -0700, Jan Beulich wrote:
> >>> On 22.12.16 at 17:17, <boris.ostrovsky@oracle.com> wrote:
> > On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
> >> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?
> > 
> > Why would Xen need to be able to parse the AML that is intended to be
> > executed by dom0? I'd think that all the hypervisor would need to do is
> > to load it into memory, not any different from how it's done for regular
> > guests.
> 
> Well, if Dom0 executed the unmodified _MAT, it would get back
> data relating to the physical CPU. Roger is overriding MADT to
> present virtual CPU information to Dom0, and this would likewise
> need to happen for the _MAT return value.

This is one of the problems with this Dom/Xen0 split brain problem that we
have, and cannot get away from.

To clarify a little bit, I'm not overwriting the original MADT in memory, so
Dom0 should still be able to access it if the implementation of _MAT returns
data from that area. AFAICT when plugging in a physical CPU (pCPU) into the
hardware, Dom0 should see "correct" data returned by the _MAT method. However
(as represented by the " I've used), this data will not match Dom0 vCPU
topology, and should not be used by Dom0 (only reported to Xen in order to
bring up the new pCPU).

Then the problem arises because we have no way to perform vCPU hotplug for
Dom0, not at least as it is done for DomU (using ACPI), so we would have to use
an out-of-band method in order to do vCPU hotplug for Dom0, which is a PITA.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 16:44                   ` Roger Pau Monne
@ 2016-12-22 18:19                     ` Boris Ostrovsky
  2016-12-23 10:27                       ` Roger Pau Monne
  2017-01-03  7:53                       ` Jan Beulich
  0 siblings, 2 replies; 59+ messages in thread
From: Boris Ostrovsky @ 2016-12-22 18:19 UTC (permalink / raw)
  To: Roger Pau Monne, Jan Beulich; +Cc: Andrew Cooper, xen-devel

On 12/22/2016 11:44 AM, Roger Pau Monne wrote:
> On Thu, Dec 22, 2016 at 09:24:02AM -0700, Jan Beulich wrote:
>>>>> On 22.12.16 at 17:17, <boris.ostrovsky@oracle.com> wrote:
>>> On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
>>>> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?
>>> Why would Xen need to be able to parse the AML that is intended to be
>>> executed by dom0? I'd think that all the hypervisor would need to do is
>>> to load it into memory, not any different from how it's done for regular
>>> guests.
>> Well, if Dom0 executed the unmodified _MAT, it would get back
>> data relating to the physical CPU. Roger is overriding MADT to
>> present virtual CPU information to Dom0, and this would likewise
>> need to happen for the _MAT return value.

By "unmodified _MAT" you mean system's _MAT?  Why can't we provide our
own that will return _MAT object properly adjusted for dom0? We are
going to provide our own (i.e. not system's) DSDT, aren't we?

> This is one of the problems with this Dom/Xen0 split brain problem that we
> have, and cannot get away from.
>
> To clarify a little bit, I'm not overwriting the original MADT in memory, so
> Dom0 should still be able to access it if the implementation of _MAT returns
> data from that area. AFAICT when plugging in a physical CPU (pCPU) into the
> hardware, Dom0 should see "correct" data returned by the _MAT method. However
> (as represented by the " I've used), this data will not match Dom0 vCPU
> topology, and should not be used by Dom0 (only reported to Xen in order to
> bring up the new pCPU).

So the problem seems to be that we need to run both _MATs --- system's
and dom0's.


> Then the problem arises because we have no way to perform vCPU hotplug for
> Dom0, not at least as it is done for DomU (using ACPI), so we would have to use
> an out-of-band method in order to do vCPU hotplug for Dom0, which is a PITA.


I would very much like to avoid this.

-boris



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 18:19                     ` Boris Ostrovsky
@ 2016-12-23 10:27                       ` Roger Pau Monne
  2016-12-23 15:13                         ` Boris Ostrovsky
  2016-12-23 15:21                         ` Konrad Rzeszutek Wilk
  2017-01-03  7:53                       ` Jan Beulich
  1 sibling, 2 replies; 59+ messages in thread
From: Roger Pau Monne @ 2016-12-23 10:27 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, Jan Beulich, xen-devel

On Thu, Dec 22, 2016 at 01:19:36PM -0500, Boris Ostrovsky wrote:
> On 12/22/2016 11:44 AM, Roger Pau Monne wrote:
> > On Thu, Dec 22, 2016 at 09:24:02AM -0700, Jan Beulich wrote:
> >>>>> On 22.12.16 at 17:17, <boris.ostrovsky@oracle.com> wrote:
> >>> On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
> >>>> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?
> >>> Why would Xen need to be able to parse the AML that is intended to be
> >>> executed by dom0? I'd think that all the hypervisor would need to do is
> >>> to load it into memory, not any different from how it's done for regular
> >>> guests.
> >> Well, if Dom0 executed the unmodified _MAT, it would get back
> >> data relating to the physical CPU. Roger is overriding MADT to
> >> present virtual CPU information to Dom0, and this would likewise
> >> need to happen for the _MAT return value.
> 
> By "unmodified _MAT" you mean system's _MAT?  Why can't we provide our
> own that will return _MAT object properly adjusted for dom0? We are
> going to provide our own (i.e. not system's) DSDT, aren't we?

Providing Dom0 with a different DSDT is almost impossible from a Xen PoV, for
once Xen cannot parse the original DSDT (because it's a dynamic table), and
then if we would be to provide a modified DSDT, we would also need an asl
assembler, so that we could parse the DSDT, modify it, and then compile it
again in order to provide it to Dom0.

Although all this code could be put under an init section that would be freed
after Dom0 creation it seems overkill and very far from trivial, not to mention
that I'm not even sure what side-effects there would be if Xen parsed the DSDT
itself without having any drivers.

> > This is one of the problems with this Dom/Xen0 split brain problem that we
> > have, and cannot get away from.
> >
> > To clarify a little bit, I'm not overwriting the original MADT in memory, so
> > Dom0 should still be able to access it if the implementation of _MAT returns
> > data from that area. AFAICT when plugging in a physical CPU (pCPU) into the
> > hardware, Dom0 should see "correct" data returned by the _MAT method. However
> > (as represented by the " I've used), this data will not match Dom0 vCPU
> > topology, and should not be used by Dom0 (only reported to Xen in order to
> > bring up the new pCPU).
> 
> So the problem seems to be that we need to run both _MATs --- system's
> and dom0's.

Exactly, we need _MAT for pCPUs and _MAT for _vCPUs.

> > Then the problem arises because we have no way to perform vCPU hotplug for
> > Dom0, not at least as it is done for DomU (using ACPI), so we would have to use
> > an out-of-band method in order to do vCPU hotplug for Dom0, which is a PITA.
> 
> 
> I would very much like to avoid this.

Maybe we can provide an extra SSDT for Dom0 that basically overwrites the CPU
objects (_PR.CPUX), but I'm not sure if ACPI allows this kind of objects
overwrites?

After reading the spec, I came across the following note in the SSDT section:

"Additional tables can only add data; they cannot overwrite data from previous
tables."

So I guess this is a no-go.

I only see the following options:

 - Prevent Dom0 from using the original _MAT methods (or even the full _PR.CPU
   objects) using the STAO, and then provide Dom0 with an out-of-band method
   (ie: not ACPI) in order to do CPU hotplug.

 - Expand the STAO so that it can be used to override ACPI namespace objects,
   possibly by adding a payload field that contains aml code. It seems that
   Linux already supports overwriting part of the ACPI namespace from
   user-space[0], so part of the needed machinery seem to be already in place
   (hopefully in acpica code?).

 - Disable the native CPU objects in the DSDT/SSDT using the STAO. Then pick up
   unused ACPI CPU IDs and use those for vCPUs. Provide an additional SSDT that
   contains ACPI objects for those vCPUs (as is done for DomU). This means we
   would probably have to start using x2APIC entries in the MADT, since the
   CPUs IDs might easily expand past 255 (AFAICT we could still keep the APIC
   IDs low however, since those two are disjoint).

I don't really fancy any of these two options, probably the last one seems like
the best IMHO, but I would like to hear some feedback about them, and of course
I'm open to suggestions :).

Roger.

[0] https://www.kernel.org/doc/Documentation/acpi/method-customizing.txt


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-23 10:27                       ` Roger Pau Monne
@ 2016-12-23 15:13                         ` Boris Ostrovsky
  2016-12-23 15:31                           ` Konrad Rzeszutek Wilk
  2016-12-23 15:21                         ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 59+ messages in thread
From: Boris Ostrovsky @ 2016-12-23 15:13 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, Jan Beulich, xen-devel

On 12/23/2016 05:27 AM, Roger Pau Monne wrote:
> On Thu, Dec 22, 2016 at 01:19:36PM -0500, Boris Ostrovsky wrote:
>> On 12/22/2016 11:44 AM, Roger Pau Monne wrote:
>>> On Thu, Dec 22, 2016 at 09:24:02AM -0700, Jan Beulich wrote:
>>>>>>> On 22.12.16 at 17:17, <boris.ostrovsky@oracle.com> wrote:
>>>>> On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
>>>>>> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?
>>>>> Why would Xen need to be able to parse the AML that is intended to be
>>>>> executed by dom0? I'd think that all the hypervisor would need to do is
>>>>> to load it into memory, not any different from how it's done for regular
>>>>> guests.
>>>> Well, if Dom0 executed the unmodified _MAT, it would get back
>>>> data relating to the physical CPU. Roger is overriding MADT to
>>>> present virtual CPU information to Dom0, and this would likewise
>>>> need to happen for the _MAT return value.
>> By "unmodified _MAT" you mean system's _MAT?  Why can't we provide our
>> own that will return _MAT object properly adjusted for dom0? We are
>> going to provide our own (i.e. not system's) DSDT, aren't we?
> Providing Dom0 with a different DSDT is almost impossible from a Xen PoV, for
> once Xen cannot parse the original DSDT (because it's a dynamic table), and
> then if we would be to provide a modified DSDT, we would also need an asl
> assembler, so that we could parse the DSDT, modify it, and then compile it
> again in order to provide it to Dom0.
>
> Although all this code could be put under an init section that would be freed
> after Dom0 creation it seems overkill and very far from trivial, not to mention
> that I'm not even sure what side-effects there would be if Xen parsed the DSDT
> itself without having any drivers.
>
>>> This is one of the problems with this Dom/Xen0 split brain problem that we
>>> have, and cannot get away from.
>>>
>>> To clarify a little bit, I'm not overwriting the original MADT in memory, so
>>> Dom0 should still be able to access it if the implementation of _MAT returns
>>> data from that area. AFAICT when plugging in a physical CPU (pCPU) into the
>>> hardware, Dom0 should see "correct" data returned by the _MAT method. However
>>> (as represented by the " I've used), this data will not match Dom0 vCPU
>>> topology, and should not be used by Dom0 (only reported to Xen in order to
>>> bring up the new pCPU).
>> So the problem seems to be that we need to run both _MATs --- system's
>> and dom0's.
> Exactly, we need _MAT for pCPUs and _MAT for _vCPUs.
>
>>> Then the problem arises because we have no way to perform vCPU hotplug for
>>> Dom0, not at least as it is done for DomU (using ACPI), so we would have to use
>>> an out-of-band method in order to do vCPU hotplug for Dom0, which is a PITA.
>>
>> I would very much like to avoid this.
> Maybe we can provide an extra SSDT for Dom0 that basically overwrites the CPU
> objects (_PR.CPUX), but I'm not sure if ACPI allows this kind of objects
> overwrites?
>
> After reading the spec, I came across the following note in the SSDT section:
>
> "Additional tables can only add data; they cannot overwrite data from previous
> tables."
>
> So I guess this is a no-go.
>
> I only see the following options:
>
>  - Prevent Dom0 from using the original _MAT methods (or even the full _PR.CPU
>    objects) using the STAO, and then provide Dom0 with an out-of-band method
>    (ie: not ACPI) in order to do CPU hotplug.

I don't see how we can have dom0 execute system's _MAT (or, indeed,
anything in _PR_CPU) and not try to add the processor to its own (vCPU)
list.

>
>  - Expand the STAO so that it can be used to override ACPI namespace objects,
>    possibly by adding a payload field that contains aml code. It seems that
>    Linux already supports overwriting part of the ACPI namespace from
>    user-space[0], so part of the needed machinery seem to be already in place
>    (hopefully in acpica code?).

At least judging by the pathname in sysfs this may be a debugging
feature. Which doesn't mean it can't be used, but we'll need to confirm
this with (Linux) APCI people.

>
>  - Disable the native CPU objects in the DSDT/SSDT using the STAO. Then pick up
>    unused ACPI CPU IDs and use those for vCPUs. Provide an additional SSDT that
>    contains ACPI objects for those vCPUs (as is done for DomU). This means we
>    would probably have to start using x2APIC entries in the MADT, since the
>    CPUs IDs might easily expand past 255 (AFAICT we could still keep the APIC
>    IDs low however, since those two are disjoint).

But this still assumes that dom0 handles ACPI event for a pCPUs as well,
right? And I am not sure this can work.

Actually, how do we hotplug pCPUs now?

-boris

>
> I don't really fancy any of these two options, probably the last one seems like
> the best IMHO, but I would like to hear some feedback about them, and of course
> I'm open to suggestions :).
>
> Roger.
>
> [0] https://www.kernel.org/doc/Documentation/acpi/method-customizing.txt
>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-23 10:27                       ` Roger Pau Monne
  2016-12-23 15:13                         ` Boris Ostrovsky
@ 2016-12-23 15:21                         ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 59+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-12-23 15:21 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, Boris Ostrovsky, Jan Beulich, xen-devel

On Fri, Dec 23, 2016 at 10:27:46AM +0000, Roger Pau Monne wrote:
> On Thu, Dec 22, 2016 at 01:19:36PM -0500, Boris Ostrovsky wrote:
> > On 12/22/2016 11:44 AM, Roger Pau Monne wrote:
> > > On Thu, Dec 22, 2016 at 09:24:02AM -0700, Jan Beulich wrote:
> > >>>>> On 22.12.16 at 17:17, <boris.ostrovsky@oracle.com> wrote:
> > >>> On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
> > >>>> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?
> > >>> Why would Xen need to be able to parse the AML that is intended to be
> > >>> executed by dom0? I'd think that all the hypervisor would need to do is
> > >>> to load it into memory, not any different from how it's done for regular
> > >>> guests.
> > >> Well, if Dom0 executed the unmodified _MAT, it would get back
> > >> data relating to the physical CPU. Roger is overriding MADT to
> > >> present virtual CPU information to Dom0, and this would likewise
> > >> need to happen for the _MAT return value.
> > 
> > By "unmodified _MAT" you mean system's _MAT?  Why can't we provide our
> > own that will return _MAT object properly adjusted for dom0? We are
> > going to provide our own (i.e. not system's) DSDT, aren't we?
> 
> Providing Dom0 with a different DSDT is almost impossible from a Xen PoV, for
> once Xen cannot parse the original DSDT (because it's a dynamic table), and
> then if we would be to provide a modified DSDT, we would also need an asl
> assembler, so that we could parse the DSDT, modify it, and then compile it
> again in order to provide it to Dom0.
> 
> Although all this code could be put under an init section that would be freed
> after Dom0 creation it seems overkill and very far from trivial, not to mention
> that I'm not even sure what side-effects there would be if Xen parsed the DSDT
> itself without having any drivers.
> 
> > > This is one of the problems with this Dom/Xen0 split brain problem that we
> > > have, and cannot get away from.
> > >
> > > To clarify a little bit, I'm not overwriting the original MADT in memory, so
> > > Dom0 should still be able to access it if the implementation of _MAT returns
> > > data from that area. AFAICT when plugging in a physical CPU (pCPU) into the
> > > hardware, Dom0 should see "correct" data returned by the _MAT method. However
> > > (as represented by the " I've used), this data will not match Dom0 vCPU
> > > topology, and should not be used by Dom0 (only reported to Xen in order to
> > > bring up the new pCPU).
> > 
> > So the problem seems to be that we need to run both _MATs --- system's
> > and dom0's.
> 
> Exactly, we need _MAT for pCPUs and _MAT for _vCPUs.
> 
> > > Then the problem arises because we have no way to perform vCPU hotplug for
> > > Dom0, not at least as it is done for DomU (using ACPI), so we would have to use
> > > an out-of-band method in order to do vCPU hotplug for Dom0, which is a PITA.
> > 
> > 
> > I would very much like to avoid this.
> 
> Maybe we can provide an extra SSDT for Dom0 that basically overwrites the CPU
> objects (_PR.CPUX), but I'm not sure if ACPI allows this kind of objects
> overwrites?
> 
> After reading the spec, I came across the following note in the SSDT section:
> 
> "Additional tables can only add data; they cannot overwrite data from previous
> tables."
> 
> So I guess this is a no-go.
> 
> I only see the following options:
> 
>  - Prevent Dom0 from using the original _MAT methods (or even the full _PR.CPU
>    objects) using the STAO, and then provide Dom0 with an out-of-band method
>    (ie: not ACPI) in order to do CPU hotplug.
> 
>  - Expand the STAO so that it can be used to override ACPI namespace objects,
>    possibly by adding a payload field that contains aml code. It seems that
>    Linux already supports overwriting part of the ACPI namespace from
>    user-space[0], so part of the needed machinery seem to be already in place
>    (hopefully in acpica code?).
> 
>  - Disable the native CPU objects in the DSDT/SSDT using the STAO. Then pick up
>    unused ACPI CPU IDs and use those for vCPUs. Provide an additional SSDT that

You need those so that the C and P state parser can actually parse the other
CPU states which it will upload to Xen.

>    contains ACPI objects for those vCPUs (as is done for DomU). This means we
>    would probably have to start using x2APIC entries in the MADT, since the
>    CPUs IDs might easily expand past 255 (AFAICT we could still keep the APIC
>    IDs low however, since those two are disjoint).
>

- Import ACPI AML parser in Xen. <mad laughter>

 
> I don't really fancy any of these two options, probably the last one seems like
> the best IMHO, but I would like to hear some feedback about them, and of course
> I'm open to suggestions :).

- Do the same thing we had been doing with Xen and Linux PV. Expose the full
  machine MADT and then massage it. Yuck.

I am full of bad ideas today.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-23 15:13                         ` Boris Ostrovsky
@ 2016-12-23 15:31                           ` Konrad Rzeszutek Wilk
  2016-12-23 15:35                             ` Boris Ostrovsky
  0 siblings, 1 reply; 59+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-12-23 15:31 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel, Jan Beulich, Roger Pau Monne

> But this still assumes that dom0 handles ACPI event for a pCPUs as well,
> right? And I am not sure this can work.
> 
> Actually, how do we hotplug pCPUs now?

xen-hptool

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-23 15:31                           ` Konrad Rzeszutek Wilk
@ 2016-12-23 15:35                             ` Boris Ostrovsky
  2016-12-23 16:02                               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 59+ messages in thread
From: Boris Ostrovsky @ 2016-12-23 15:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, xen-devel, Jan Beulich, Roger Pau Monne

On 12/23/2016 10:31 AM, Konrad Rzeszutek Wilk wrote:
>> But this still assumes that dom0 handles ACPI event for a pCPUs as well,
>> right? And I am not sure this can work.
>>
>> Actually, how do we hotplug pCPUs now?
> xen-hptool

Yes, but this has nothing to do with an actual pCPU being hot-plugged
into the system. It's similar to doing 'echo 1 > /sys/.../cpuX/online in
Lnux.

My question is --- how do we (hypervisor/dom0/toolstack) become aware of
appearance of a new processor.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-23 15:35                             ` Boris Ostrovsky
@ 2016-12-23 16:02                               ` Konrad Rzeszutek Wilk
  2016-12-23 17:13                                 ` Boris Ostrovsky
  0 siblings, 1 reply; 59+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-12-23 16:02 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel, Jan Beulich, Roger Pau Monne

On Fri, Dec 23, 2016 at 10:35:10AM -0500, Boris Ostrovsky wrote:
> On 12/23/2016 10:31 AM, Konrad Rzeszutek Wilk wrote:
> >> But this still assumes that dom0 handles ACPI event for a pCPUs as well,
> >> right? And I am not sure this can work.
> >>
> >> Actually, how do we hotplug pCPUs now?
> > xen-hptool
> 
> Yes, but this has nothing to do with an actual pCPU being hot-plugged
> into the system. It's similar to doing 'echo 1 > /sys/.../cpuX/online in
> Lnux.
> 
> My question is --- how do we (hypervisor/dom0/toolstack) become aware of
> appearance of a new processor.

Ooooh. drivers/xen/xen-acpi-cpuhotplug.c


But
213 config XEN_STUB                                                                 
214     bool "Xen stub drivers"                                                     
215     depends on XEN && X86_64 && BROKEN                                          
216     default n                                          


235 config XEN_ACPI_HOTPLUG_CPU                                                     
236     tristate "Xen ACPI cpu hotplug"                                             
237     depends on XEN_DOM0 && XEN_STUB && ACPI       

So it is probably very very badly not working.

> 
> -boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-23 16:02                               ` Konrad Rzeszutek Wilk
@ 2016-12-23 17:13                                 ` Boris Ostrovsky
  2017-01-03 15:55                                   ` Roger Pau Monne
  0 siblings, 1 reply; 59+ messages in thread
From: Boris Ostrovsky @ 2016-12-23 17:13 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, xen-devel, Jan Beulich, Roger Pau Monne

On 12/23/2016 11:02 AM, Konrad Rzeszutek Wilk wrote:
> On Fri, Dec 23, 2016 at 10:35:10AM -0500, Boris Ostrovsky wrote:
>> On 12/23/2016 10:31 AM, Konrad Rzeszutek Wilk wrote:
>>>> But this still assumes that dom0 handles ACPI event for a pCPUs as well,
>>>> right? And I am not sure this can work.
>>>>
>>>> Actually, how do we hotplug pCPUs now?
>>> xen-hptool
>> Yes, but this has nothing to do with an actual pCPU being hot-plugged
>> into the system. It's similar to doing 'echo 1 > /sys/.../cpuX/online in
>> Lnux.
>>
>> My question is --- how do we (hypervisor/dom0/toolstack) become aware of
>> appearance of a new processor.
> Ooooh. drivers/xen/xen-acpi-cpuhotplug.c

Looking at this code I think we should be able to at least differentiate
between pCPU and vCPU and not add pCPU to dom0.

-boris

>
>
> But
> 213 config XEN_STUB                                                                 
> 214     bool "Xen stub drivers"                                                     
> 215     depends on XEN && X86_64 && BROKEN                                          
> 216     default n                                          
>
>
> 235 config XEN_ACPI_HOTPLUG_CPU                                                     
> 236     tristate "Xen ACPI cpu hotplug"                                             
> 237     depends on XEN_DOM0 && XEN_STUB && ACPI       
>
> So it is probably very very badly not working.
>
>> -boris



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-22 18:19                     ` Boris Ostrovsky
  2016-12-23 10:27                       ` Roger Pau Monne
@ 2017-01-03  7:53                       ` Jan Beulich
  1 sibling, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2017-01-03  7:53 UTC (permalink / raw)
  To: Roger Pau Monne, Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel

>>> On 22.12.16 at 19:19, <boris.ostrovsky@oracle.com> wrote:
> On 12/22/2016 11:44 AM, Roger Pau Monne wrote:
>> On Thu, Dec 22, 2016 at 09:24:02AM -0700, Jan Beulich wrote:
>>>>>> On 22.12.16 at 17:17, <boris.ostrovsky@oracle.com> wrote:
>>>> On 12/22/2016 07:17 AM, Roger Pau Monne wrote:
>>>>> Maybe Boris has some ideas about how to do CPU hotplug for Dom0?
>>>> Why would Xen need to be able to parse the AML that is intended to be
>>>> executed by dom0? I'd think that all the hypervisor would need to do is
>>>> to load it into memory, not any different from how it's done for regular
>>>> guests.
>>> Well, if Dom0 executed the unmodified _MAT, it would get back
>>> data relating to the physical CPU. Roger is overriding MADT to
>>> present virtual CPU information to Dom0, and this would likewise
>>> need to happen for the _MAT return value.
> 
> By "unmodified _MAT" you mean system's _MAT?  Why can't we provide our
> own that will return _MAT object properly adjusted for dom0? We are
> going to provide our own (i.e. not system's) DSDT, aren't we?

For Dom0? No, I didn't think so. We can't reasonably edit the host's,
and we also can't craft our own (that works only for DomU).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-12-23 17:13                                 ` Boris Ostrovsky
@ 2017-01-03 15:55                                   ` Roger Pau Monne
  2017-01-03 19:13                                     ` Boris Ostrovsky
  0 siblings, 1 reply; 59+ messages in thread
From: Roger Pau Monne @ 2017-01-03 15:55 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel, Jan Beulich

On Fri, Dec 23, 2016 at 12:13:47PM -0500, Boris Ostrovsky wrote:
> On 12/23/2016 11:02 AM, Konrad Rzeszutek Wilk wrote:
> > On Fri, Dec 23, 2016 at 10:35:10AM -0500, Boris Ostrovsky wrote:
> >> On 12/23/2016 10:31 AM, Konrad Rzeszutek Wilk wrote:
> >>>> But this still assumes that dom0 handles ACPI event for a pCPUs as well,
> >>>> right? And I am not sure this can work.
> >>>>
> >>>> Actually, how do we hotplug pCPUs now?
> >>> xen-hptool
> >> Yes, but this has nothing to do with an actual pCPU being hot-plugged
> >> into the system. It's similar to doing 'echo 1 > /sys/.../cpuX/online in
> >> Lnux.
> >>
> >> My question is --- how do we (hypervisor/dom0/toolstack) become aware of
> >> appearance of a new processor.
> > Ooooh. drivers/xen/xen-acpi-cpuhotplug.c
> 
> Looking at this code I think we should be able to at least differentiate
> between pCPU and vCPU and not add pCPU to dom0.

Right, but I don't really see any way to do this with the current ACPI
interface, and still be able to use ACPI CPU hotplug for both pCPUs and
vCPUs.

We can provide a crafted MADT that reflects the number of vCPUs available to
Dom0 at boot time, let Dom0 find the Processor objects in the ACPI namespace
and perform pCPU hotplug as it's been done until now. Then for Dom0 vCPU
hotplug we would need to use the hypercall interface as it's used by classic PV
guests, because AFAICT there's no way we can provide Dom0 with a PRST/PRSC
that's valid for both vCPUs and pCPUs.

Roger.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables
  2017-01-03 15:55                                   ` Roger Pau Monne
@ 2017-01-03 19:13                                     ` Boris Ostrovsky
  0 siblings, 0 replies; 59+ messages in thread
From: Boris Ostrovsky @ 2017-01-03 19:13 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, xen-devel, Jan Beulich


> We can provide a crafted MADT that reflects the number of vCPUs available to
> Dom0 at boot time, let Dom0 find the Processor objects in the ACPI namespace
> and perform pCPU hotplug as it's been done until now. Then for Dom0 vCPU
> hotplug we would need to use the hypercall interface as it's used by classic PV
> guests, because AFAICT there's no way we can provide Dom0 with a PRST/PRSC
> that's valid for both vCPUs and pCPUs.

If we have a PV-like interface for bringing up vCPUs for dom0 then
what's the point of doing it differently for domUs?

-boris



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2017-01-03 19:13 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-30 16:49 [PATCH v4 00/14] Initial PVHv2 Dom0 support Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 01/14] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
2016-12-08 16:21   ` Jan Beulich
2016-12-16 11:24     ` Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 02/14] xen/x86: fix return value of *_set_allocation functions Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 03/14] xen/x86: allow calling {shadow/hap}_set_allocation with the idle domain Roger Pau Monne
2016-12-01 11:25   ` Tim Deegan
2016-11-30 16:49 ` [PATCH v4 04/14] x86/paging: introduce paging_set_allocation Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 05/14] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 06/14] x86/vtd: refuse to enable IOMMU if the PCI scan fails Roger Pau Monne
2016-12-01  0:29   ` Tian, Kevin
2016-11-30 16:49 ` [PATCH v4 07/14] x86/iommu: add IOMMU entries for p2m_mmio_direct pages Roger Pau Monne
2016-12-08 16:24   ` Jan Beulich
2016-12-16 14:01     ` Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 08/14] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
2016-12-08 16:29   ` Jan Beulich
2016-11-30 16:49 ` [PATCH v4 09/14] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
2016-12-09 16:07   ` Jan Beulich
2016-12-16 14:28     ` Roger Pau Monne
2016-12-16 14:45       ` Roger Pau Monne
2016-12-16 16:17         ` Jan Beulich
2016-12-16 17:57           ` Roger Pau Monne
2016-12-19  7:42             ` Jan Beulich
2016-12-16 15:28       ` Roger Pau Monne
2016-12-16 16:15         ` Jan Beulich
2016-12-16 16:18       ` Jan Beulich
2016-12-16 17:42         ` Roger Pau Monne
2016-12-19  7:41           ` Jan Beulich
2016-11-30 16:49 ` [PATCH v4 10/14] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
2016-12-09 16:48   ` Jan Beulich
2016-12-16 17:38     ` Roger Pau Monne
2016-12-19  7:48       ` Jan Beulich
2016-11-30 16:49 ` [PATCH v4 11/14] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
2016-12-09 17:05   ` Jan Beulich
2016-12-20 17:34     ` Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 12/14] x86/PVHv2: fix dom0_max_vcpus so it's capped to 128 for PVHv2 Dom0 Roger Pau Monne
2016-12-09 17:09   ` Jan Beulich
2016-11-30 16:49 ` [PATCH v4 13/14] xen/x86: hack to setup PVHv2 Dom0 CPUs Roger Pau Monne
2016-11-30 16:49 ` [PATCH v4 14/14] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
2016-12-12 13:56   ` Jan Beulich
2016-12-21 16:32     ` Roger Pau Monne
2016-12-21 17:04       ` Jan Beulich
2016-12-22 10:43         ` Roger Pau Monne
2016-12-22 11:03           ` Jan Beulich
2016-12-22 12:17             ` Roger Pau Monne
2016-12-22 16:17               ` Boris Ostrovsky
2016-12-22 16:24                 ` Jan Beulich
2016-12-22 16:44                   ` Roger Pau Monne
2016-12-22 18:19                     ` Boris Ostrovsky
2016-12-23 10:27                       ` Roger Pau Monne
2016-12-23 15:13                         ` Boris Ostrovsky
2016-12-23 15:31                           ` Konrad Rzeszutek Wilk
2016-12-23 15:35                             ` Boris Ostrovsky
2016-12-23 16:02                               ` Konrad Rzeszutek Wilk
2016-12-23 17:13                                 ` Boris Ostrovsky
2017-01-03 15:55                                   ` Roger Pau Monne
2017-01-03 19:13                                     ` Boris Ostrovsky
2016-12-23 15:21                         ` Konrad Rzeszutek Wilk
2017-01-03  7:53                       ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.