All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3.1 00/15] Initial PVHv2 Dom0 support
@ 2016-10-29  8:59 Roger Pau Monne
  2016-10-29  8:59 ` [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
                   ` (15 more replies)
  0 siblings, 16 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk

(resending as v3.1, it seems like I need to figure out how to properly use 
msmtp with git send-email because on the last try only the cover letter 
was actually sent).

Hello,

This is the first batch of the PVH Dom0 support eries, that includes
everything up to the point where ACPI tables for he Dom0 are crafted. I've
decided to left the last part of the series (the ne that contains the PCI
config space handlers, and other mulation/trapping related code) separated,
in order to focus and ease the review. This is f course not functional, one
might be able to partially boot a Dom0 kernel if t doesn't try to access
any physical device.

Another reason for splitting this series is so hat I can write a proper
design document about how this trapping is going o work, and what is it
supposed to do, because during the last review ound I got the feeling that
some comments where not really related to the ode itself, but to what I was
trying to achieve, so it's best to discuss them n a design document rather
than mixed up with code.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-10-31 16:32   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 02/15] xen/x86: fix return value of *_set_allocation functions Roger Pau Monne
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
from physical or emulated devices over event channels using PIRQs. This
applies to both DomU and Dom0 PVHv2 guests.

Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
route physical interrupts (even from emulated devices) over event channels,
and is thus allowed to use some of the PHYSDEV ops.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Change local variable name to currd instead of d.
 - Use currd where it makes sense.
---
 xen/arch/x86/hvm/hvm.c            | 25 ++++++++++++++++---------
 xen/arch/x86/physdev.c            |  5 +++--
 xen/common/kernel.c               |  3 ++-
 xen/include/public/arch-x86/xen.h |  4 +++-
 4 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 11e2b82..e516b20 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4164,10 +4164,12 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
 static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *currd = current->domain;
+
     switch ( cmd )
     {
     default:
-        if ( !is_pvh_vcpu(current) || !is_hardware_domain(current->domain) )
+        if ( !is_pvh_domain(currd) || !is_hardware_domain(currd) )
             return -ENOSYS;
         /* fall through */
     case PHYSDEVOP_map_pirq:
@@ -4175,7 +4177,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
-        return do_physdev_op(cmd, arg);
+        return ((currd->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||
+               is_pvh_domain(currd)) ?
+                    do_physdev_op(cmd, arg) : -ENOSYS;
     }
 }
 
@@ -4208,17 +4212,20 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 static long hvm_physdev_op_compat32(
     int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *d = current->domain;
+
     switch ( cmd )
     {
-        case PHYSDEVOP_map_pirq:
-        case PHYSDEVOP_unmap_pirq:
-        case PHYSDEVOP_eoi:
-        case PHYSDEVOP_irq_status_query:
-        case PHYSDEVOP_get_free_pirq:
-            return compat_physdev_op(cmd, arg);
+    case PHYSDEVOP_map_pirq:
+    case PHYSDEVOP_unmap_pirq:
+    case PHYSDEVOP_eoi:
+    case PHYSDEVOP_irq_status_query:
+    case PHYSDEVOP_get_free_pirq:
+        return (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                    compat_physdev_op(cmd, arg) : -ENOSYS;
         break;
     default:
-            return -ENOSYS;
+        return -ENOSYS;
         break;
     }
 }
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 5a49796..0bea6e1 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -94,7 +94,8 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
     int pirq, irq, ret = 0;
     void *map_data = NULL;
 
-    if ( domid == DOMID_SELF && is_hvm_domain(d) )
+    if ( domid == DOMID_SELF && is_hvm_domain(d) &&
+         (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         /*
          * Only makes sense for vector-based callback, else HVM-IRQ logic
@@ -265,7 +266,7 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
     if ( ret )
         goto free_domain;
 
-    if ( is_hvm_domain(d) )
+    if ( is_hvm_domain(d) && (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         spin_lock(&d->event_lock);
         if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index d0edb13..a82f55f 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -332,7 +332,8 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             case guest_type_hvm:
                 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
                              (1U << XENFEAT_hvm_callback_vector) |
-                             (1U << XENFEAT_hvm_pirqs);
+                             ((d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                                 (1U << XENFEAT_hvm_pirqs) : 0);
                 break;
             }
 #endif
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index cdd93c1..da6f4f2 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -283,12 +283,14 @@ struct xen_arch_domainconfig {
 #define XEN_X86_EMU_IOMMU           (1U<<_XEN_X86_EMU_IOMMU)
 #define _XEN_X86_EMU_PIT            8
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
+#define _XEN_X86_EMU_USE_PIRQ       9
+#define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
     uint32_t emulation_flags;
 };
 #endif
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 02/15] xen/x86: fix return value of *_set_allocation functions
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
  2016-10-29  8:59 ` [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-10-29 22:11   ` Tim Deegan
  2016-10-29  8:59 ` [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, Roger Pau Monne

Return should be an int.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Tim Deegan <tim@xen.org>
---
Changes since v2:
 - Also fix the callers to treat the return value as an int.
 - Don't convert the pages parameter to unsigned long.
---
 xen/arch/x86/mm/hap/hap.c       |  8 +++-----
 xen/arch/x86/mm/shadow/common.c | 12 +++++-------
 2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 3218fa2..f099e94 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -334,7 +334,7 @@ hap_get_allocation(struct domain *d)
 
 /* Set the pool of pages to the required number of pages.
  * Returns 0 for success, non-zero for failure. */
-static unsigned int
+static int
 hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
 {
     struct page_info *pg;
@@ -468,14 +468,12 @@ int hap_enable(struct domain *d, u32 mode)
     old_pages = d->arch.paging.hap.total_pages;
     if ( old_pages == 0 )
     {
-        unsigned int r;
         paging_lock(d);
-        r = hap_set_allocation(d, 256, NULL);
-        if ( r != 0 )
+        rv = hap_set_allocation(d, 256, NULL);
+        if ( rv != 0 )
         {
             hap_set_allocation(d, 0, NULL);
             paging_unlock(d);
-            rv = -ENOMEM;
             goto out;
         }
         paging_unlock(d);
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 21607bf..065bdc7 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1613,9 +1613,9 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
  * Input will be rounded up to at least shadow_min_acceptable_pages(),
  * plus space for the p2m table.
  * Returns 0 for success, non-zero for failure. */
-static unsigned int sh_set_allocation(struct domain *d,
-                                      unsigned int pages,
-                                      int *preempted)
+static int sh_set_allocation(struct domain *d,
+                             unsigned int pages,
+                             int *preempted)
 {
     struct page_info *sp;
     unsigned int lower_bound;
@@ -3151,13 +3151,11 @@ int shadow_enable(struct domain *d, u32 mode)
     old_pages = d->arch.paging.shadow.total_pages;
     if ( old_pages == 0 )
     {
-        unsigned int r;
         paging_lock(d);
-        r = sh_set_allocation(d, 1024, NULL); /* Use at least 4MB */
-        if ( r != 0 )
+        rv = sh_set_allocation(d, 1024, NULL); /* Use at least 4MB */
+        if ( rv != 0 )
         {
             sh_set_allocation(d, 0, NULL);
-            rv = -ENOMEM;
             goto out_locked;
         }
         paging_unlock(d);
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
  2016-10-29  8:59 ` [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
  2016-10-29  8:59 ` [PATCH v3.1 02/15] xen/x86: fix return value of *_set_allocation functions Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-10-31 16:34   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 04/15] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, Roger Pau Monne

... and using the "preempted" parameter. The solution relies on just calling
softirq_pending if the current domain is the idle domain. If such preemption
happens, the caller should then call process_pending_softirqs in order to
drain the pending softirqs, and then call {sh/hap}_set_allocation again to
continue with it's execution.

This allows us to call *_set_allocation() when building domain 0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Fix commit message.
---
 xen/arch/x86/mm/hap/hap.c       | 4 +++-
 xen/arch/x86/mm/shadow/common.c | 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index f099e94..0645521 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
             break;
 
         /* Check to see if we need to yield and try again */
-        if ( preempted && hypercall_preempt_check() )
+        if ( preempted &&
+             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
+                                      hypercall_preempt_check()) )
         {
             *preempted = 1;
             return 0;
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 065bdc7..b2e99c2 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1679,7 +1679,9 @@ static int sh_set_allocation(struct domain *d,
             break;
 
         /* Check to see if we need to yield and try again */
-        if ( preempted && hypercall_preempt_check() )
+        if ( preempted &&
+             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
+                                      hypercall_preempt_check()) )
         {
             *preempted = 1;
             return 0;
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 04/15] xen/x86: assert that local_events_need_delivery is not called by the idle domain
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (2 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-10-31 16:37   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 05/15] x86/paging: introduce paging_set_allocation Roger Pau Monne
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

It doesn't make sense since the idle domain doesn't receive any events. This
is relevant in order to be sure that hypercall_preempt_check is not called
by the idle domain, which would happen previously when calling
{hap/sh}_set_allocation during domain 0 creation.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Expand commit message.
---
 xen/include/asm-x86/event.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h
index a82062e..d589d6f 100644
--- a/xen/include/asm-x86/event.h
+++ b/xen/include/asm-x86/event.h
@@ -23,6 +23,9 @@ int hvm_local_events_need_delivery(struct vcpu *v);
 static inline int local_events_need_delivery(void)
 {
     struct vcpu *v = current;
+
+    ASSERT(!is_idle_vcpu(v));
+
     return (has_hvm_container_vcpu(v) ? hvm_local_events_need_delivery(v) :
             (vcpu_info(v, evtchn_upcall_pending) &&
              !vcpu_info(v, evtchn_upcall_mask)));
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 05/15] x86/paging: introduce paging_set_allocation
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (3 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 04/15] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-10-31 16:42   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 06/15] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, Roger Pau Monne

... and remove hap_set_alloc_for_pvh_dom0. While there also change the last
parameter of the {hap/sh}_set_allocation functions to be a boolean.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: George Dunlap <george.dunlap@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Tim Deegan <tim@xen.org>
---
Changes since v2:
 - Convert the preempt parameter into a bool.
 - Fix Dom0 builder comment to reflect that paging.mode should be correct
   before calling paging_set_allocation.

Changes since RFC:
 - Make paging_set_allocation preemtable.
 - Move comments.
---
 xen/arch/x86/domain_build.c     | 21 +++++++++++++++------
 xen/arch/x86/mm/hap/hap.c       | 22 +++++-----------------
 xen/arch/x86/mm/paging.c        | 19 ++++++++++++++++++-
 xen/arch/x86/mm/shadow/common.c | 15 +++++----------
 xen/include/asm-x86/hap.h       |  4 ++--
 xen/include/asm-x86/paging.h    |  7 +++++++
 xen/include/asm-x86/shadow.h    | 10 +++++++++-
 7 files changed, 61 insertions(+), 37 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 0a02d65..17f8e91 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -35,7 +35,6 @@
 #include <asm/setup.h>
 #include <asm/bzimage.h> /* for bzimage_parse */
 #include <asm/io_apic.h>
-#include <asm/hap.h>
 #include <asm/hpet.h>
 
 #include <public/version.h>
@@ -1383,15 +1382,25 @@ int __init construct_dom0(
                          nr_pages);
     }
 
-    if ( is_pvh_domain(d) )
-        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
-
     /*
-     * We enable paging mode again so guest_physmap_add_page will do the
-     * right thing for us.
+     * We enable paging mode again so guest_physmap_add_page and
+     * paging_set_allocation will do the right thing for us.
      */
     d->arch.paging.mode = save_pvh_pg_mode;
 
+    if ( is_pvh_domain(d) )
+    {
+        bool preempted;
+
+        do {
+            preempted = false;
+            paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                                  &preempted);
+            process_pending_softirqs();
+        } while ( preempted );
+    }
+
+
     /* Write the phys->machine and machine->phys table entries. */
     for ( pfn = 0; pfn < count; pfn++ )
     {
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 0645521..b930619 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -334,8 +334,7 @@ hap_get_allocation(struct domain *d)
 
 /* Set the pool of pages to the required number of pages.
  * Returns 0 for success, non-zero for failure. */
-static int
-hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
+int hap_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
 {
     struct page_info *pg;
 
@@ -383,7 +382,7 @@ hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
              (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
                                       hypercall_preempt_check()) )
         {
-            *preempted = 1;
+            *preempted = true;
             return 0;
         }
     }
@@ -563,7 +562,7 @@ void hap_final_teardown(struct domain *d)
     paging_unlock(d);
 }
 
-void hap_teardown(struct domain *d, int *preempted)
+void hap_teardown(struct domain *d, bool *preempted)
 {
     struct vcpu *v;
     mfn_t mfn;
@@ -611,7 +610,8 @@ out:
 int hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
                XEN_GUEST_HANDLE_PARAM(void) u_domctl)
 {
-    int rc, preempted = 0;
+    int rc;
+    bool preempted = false;
 
     switch ( sc->op )
     {
@@ -638,18 +638,6 @@ int hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
     }
 }
 
-void __init hap_set_alloc_for_pvh_dom0(struct domain *d,
-                                       unsigned long hap_pages)
-{
-    int rc;
-
-    paging_lock(d);
-    rc = hap_set_allocation(d, hap_pages, NULL);
-    paging_unlock(d);
-
-    BUG_ON(rc);
-}
-
 static const struct paging_mode hap_paging_real_mode;
 static const struct paging_mode hap_paging_protected_mode;
 static const struct paging_mode hap_paging_pae_mode;
diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
index cc44682..5d80b03 100644
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -809,7 +809,8 @@ long paging_domctl_continuation(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl)
 /* Call when destroying a domain */
 int paging_teardown(struct domain *d)
 {
-    int rc, preempted = 0;
+    int rc;
+    bool preempted = false;
 
     if ( hap_enabled(d) )
         hap_teardown(d, &preempted);
@@ -954,6 +955,22 @@ void paging_write_p2m_entry(struct p2m_domain *p2m, unsigned long gfn,
         safe_write_pte(p, new);
 }
 
+int paging_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
+{
+    int rc;
+
+    ASSERT(paging_mode_enabled(d));
+
+    paging_lock(d);
+    if ( hap_enabled(d) )
+        rc = hap_set_allocation(d, pages, preempted);
+    else
+        rc = sh_set_allocation(d, pages, preempted);
+    paging_unlock(d);
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index b2e99c2..4933651 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1609,13 +1609,7 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
     paging_unlock(d);
 }
 
-/* Set the pool of shadow pages to the required number of pages.
- * Input will be rounded up to at least shadow_min_acceptable_pages(),
- * plus space for the p2m table.
- * Returns 0 for success, non-zero for failure. */
-static int sh_set_allocation(struct domain *d,
-                             unsigned int pages,
-                             int *preempted)
+int sh_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
 {
     struct page_info *sp;
     unsigned int lower_bound;
@@ -1683,7 +1677,7 @@ static int sh_set_allocation(struct domain *d,
              (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
                                       hypercall_preempt_check()) )
         {
-            *preempted = 1;
+            *preempted = true;
             return 0;
         }
     }
@@ -3239,7 +3233,7 @@ int shadow_enable(struct domain *d, u32 mode)
     return rv;
 }
 
-void shadow_teardown(struct domain *d, int *preempted)
+void shadow_teardown(struct domain *d, bool *preempted)
 /* Destroy the shadow pagetables of this domain and free its shadow memory.
  * Should only be called for dying domains. */
 {
@@ -3876,7 +3870,8 @@ int shadow_domctl(struct domain *d,
                   xen_domctl_shadow_op_t *sc,
                   XEN_GUEST_HANDLE_PARAM(void) u_domctl)
 {
-    int rc, preempted = 0;
+    int rc;
+    bool preempted = false;
 
     switch ( sc->op )
     {
diff --git a/xen/include/asm-x86/hap.h b/xen/include/asm-x86/hap.h
index c613836..dedb4b1 100644
--- a/xen/include/asm-x86/hap.h
+++ b/xen/include/asm-x86/hap.h
@@ -38,7 +38,7 @@ int   hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
                  XEN_GUEST_HANDLE_PARAM(void) u_domctl);
 int   hap_enable(struct domain *d, u32 mode);
 void  hap_final_teardown(struct domain *d);
-void  hap_teardown(struct domain *d, int *preempted);
+void  hap_teardown(struct domain *d, bool *preempted);
 void  hap_vcpu_init(struct vcpu *v);
 int   hap_track_dirty_vram(struct domain *d,
                            unsigned long begin_pfn,
@@ -46,7 +46,7 @@ int   hap_track_dirty_vram(struct domain *d,
                            XEN_GUEST_HANDLE_64(uint8) dirty_bitmap);
 
 extern const struct paging_mode *hap_paging_get_mode(struct vcpu *);
-void hap_set_alloc_for_pvh_dom0(struct domain *d, unsigned long num_pages);
+int hap_set_allocation(struct domain *d, unsigned int pages, bool *preempted);
 
 #endif /* XEN_HAP_H */
 
diff --git a/xen/include/asm-x86/paging.h b/xen/include/asm-x86/paging.h
index 56eef6b..f83ed8b 100644
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -347,6 +347,13 @@ void pagetable_dying(struct domain *d, paddr_t gpa);
 void paging_dump_domain_info(struct domain *d);
 void paging_dump_vcpu_info(struct vcpu *v);
 
+/* Set the pool of shadow pages to the required number of pages.
+ * Input might be rounded up to at minimum amount of pages, plus
+ * space for the p2m table.
+ * Returns 0 for success, non-zero for failure. */
+int paging_set_allocation(struct domain *d, unsigned int pages,
+                          bool *preempted);
+
 #endif /* XEN_PAGING_H */
 
 /*
diff --git a/xen/include/asm-x86/shadow.h b/xen/include/asm-x86/shadow.h
index 6d0aefb..4822f89 100644
--- a/xen/include/asm-x86/shadow.h
+++ b/xen/include/asm-x86/shadow.h
@@ -73,7 +73,7 @@ int shadow_domctl(struct domain *d,
                   XEN_GUEST_HANDLE_PARAM(void) u_domctl);
 
 /* Call when destroying a domain */
-void shadow_teardown(struct domain *d, int *preempted);
+void shadow_teardown(struct domain *d, bool *preempted);
 
 /* Call once all of the references to the domain have gone away */
 void shadow_final_teardown(struct domain *d);
@@ -83,6 +83,12 @@ void sh_remove_shadows(struct domain *d, mfn_t gmfn, int fast, int all);
 /* Discard _all_ mappings from the domain's shadows. */
 void shadow_blow_tables_per_domain(struct domain *d);
 
+/* Set the pool of shadow pages to the required number of pages.
+ * Input will be rounded up to at least shadow_min_acceptable_pages(),
+ * plus space for the p2m table.
+ * Returns 0 for success, non-zero for failure. */
+int sh_set_allocation(struct domain *d, unsigned int pages, bool *preempted);
+
 #else /* !CONFIG_SHADOW_PAGING */
 
 #define shadow_teardown(d, p) ASSERT(is_pv_domain(d))
@@ -91,6 +97,8 @@ void shadow_blow_tables_per_domain(struct domain *d);
     ({ ASSERT(is_pv_domain(d)); -EOPNOTSUPP; })
 #define shadow_track_dirty_vram(d, begin_pfn, nr, bitmap) \
     ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
+#define sh_set_allocation(d, pages, preempted) \
+    ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
 
 static inline void sh_remove_shadows(struct domain *d, mfn_t gmfn,
                                      bool_t fast, bool_t all) {}
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 06/15] xen/x86: split the setup of Dom0 permissions to a function
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (4 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 05/15] x86/paging: introduce paging_set_allocation Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-10-31 16:44   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally Roger Pau Monne
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

So that it can also be used by the PVH-specific domain builder. This is just
code motion, it should not introduce any functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Fix comment style.
 - Convert i to unsigned int.
 - Restore previous BUG_ON in case of failure (instead of panic).
 - Remove unneeded rc initializer.
---
 xen/arch/x86/domain_build.c | 160 +++++++++++++++++++++++---------------------
 1 file changed, 83 insertions(+), 77 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 17f8e91..1e557b9 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -869,6 +869,88 @@ static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
     unmap_domain_page(l4start);
 }
 
+static int __init setup_permissions(struct domain *d)
+{
+    unsigned long mfn;
+    unsigned int i;
+    int rc;
+
+    /* The hardware domain is initially permitted full I/O capabilities. */
+    rc = ioports_permit_access(d, 0, 0xFFFF);
+    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
+    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
+
+    /* Modify I/O port access permissions. */
+
+    /* Master Interrupt Controller (PIC). */
+    rc |= ioports_deny_access(d, 0x20, 0x21);
+    /* Slave Interrupt Controller (PIC). */
+    rc |= ioports_deny_access(d, 0xA0, 0xA1);
+    /* Interval Timer (PIT). */
+    rc |= ioports_deny_access(d, 0x40, 0x43);
+    /* PIT Channel 2 / PC Speaker Control. */
+    rc |= ioports_deny_access(d, 0x61, 0x61);
+    /* ACPI PM Timer. */
+    if ( pmtmr_ioport )
+        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
+    /* PCI configuration space (NB. 0xcf8 has special treatment). */
+    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
+    /* Command-line I/O ranges. */
+    process_dom0_ioports_disable(d);
+
+    /* Modify I/O memory access permissions. */
+
+    /* Local APIC. */
+    if ( mp_lapic_addr != 0 )
+    {
+        mfn = paddr_to_pfn(mp_lapic_addr);
+        rc |= iomem_deny_access(d, mfn, mfn);
+    }
+    /* I/O APICs. */
+    for ( i = 0; i < nr_ioapics; i++ )
+    {
+        mfn = paddr_to_pfn(mp_ioapics[i].mpc_apicaddr);
+        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
+            rc |= iomem_deny_access(d, mfn, mfn);
+    }
+    /* MSI range. */
+    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
+                            paddr_to_pfn(MSI_ADDR_BASE_LO +
+                                         MSI_ADDR_DEST_ID_MASK));
+    /* HyperTransport range. */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
+                                paddr_to_pfn((1ULL << 40) - 1));
+
+    /* Remove access to E820_UNUSABLE I/O regions above 1MB. */
+    for ( i = 0; i < e820.nr_map; i++ )
+    {
+        unsigned long sfn, efn;
+        sfn = max_t(unsigned long, paddr_to_pfn(e820.map[i].addr), 0x100ul);
+        efn = paddr_to_pfn(e820.map[i].addr + e820.map[i].size - 1);
+        if ( (e820.map[i].type == E820_UNUSABLE) &&
+             (e820.map[i].size != 0) &&
+             (sfn <= efn) )
+            rc |= iomem_deny_access(d, sfn, efn);
+    }
+
+    /* Prevent access to HPET */
+    if ( hpet_address )
+    {
+        u8 prot_flags = hpet_flags & ACPI_HPET_PAGE_PROTECT_MASK;
+
+        mfn = paddr_to_pfn(hpet_address);
+        if ( prot_flags == ACPI_HPET_PAGE_PROTECT4 )
+            rc |= iomem_deny_access(d, mfn, mfn);
+        else if ( prot_flags == ACPI_HPET_PAGE_PROTECT64 )
+            rc |= iomem_deny_access(d, mfn, mfn + 15);
+        else if ( ro_hpet )
+            rc |= rangeset_add_singleton(mmio_ro_ranges, mfn);
+    }
+
+    return rc;
+}
+
 int __init construct_dom0(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
@@ -1539,83 +1621,7 @@ int __init construct_dom0(
     if ( test_bit(XENFEAT_supervisor_mode_kernel, parms.f_required) )
         panic("Dom0 requires supervisor-mode execution");
 
-    rc = 0;
-
-    /* The hardware domain is initially permitted full I/O capabilities. */
-    rc |= ioports_permit_access(d, 0, 0xFFFF);
-    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
-    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
-
-    /*
-     * Modify I/O port access permissions.
-     */
-    /* Master Interrupt Controller (PIC). */
-    rc |= ioports_deny_access(d, 0x20, 0x21);
-    /* Slave Interrupt Controller (PIC). */
-    rc |= ioports_deny_access(d, 0xA0, 0xA1);
-    /* Interval Timer (PIT). */
-    rc |= ioports_deny_access(d, 0x40, 0x43);
-    /* PIT Channel 2 / PC Speaker Control. */
-    rc |= ioports_deny_access(d, 0x61, 0x61);
-    /* ACPI PM Timer. */
-    if ( pmtmr_ioport )
-        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
-    /* PCI configuration space (NB. 0xcf8 has special treatment). */
-    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
-    /* Command-line I/O ranges. */
-    process_dom0_ioports_disable(d);
-
-    /*
-     * Modify I/O memory access permissions.
-     */
-    /* Local APIC. */
-    if ( mp_lapic_addr != 0 )
-    {
-        mfn = paddr_to_pfn(mp_lapic_addr);
-        rc |= iomem_deny_access(d, mfn, mfn);
-    }
-    /* I/O APICs. */
-    for ( i = 0; i < nr_ioapics; i++ )
-    {
-        mfn = paddr_to_pfn(mp_ioapics[i].mpc_apicaddr);
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
-            rc |= iomem_deny_access(d, mfn, mfn);
-    }
-    /* MSI range. */
-    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
-                            paddr_to_pfn(MSI_ADDR_BASE_LO +
-                                         MSI_ADDR_DEST_ID_MASK));
-    /* HyperTransport range. */
-    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
-        rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
-                                paddr_to_pfn((1ULL << 40) - 1));
-
-    /* Remove access to E820_UNUSABLE I/O regions above 1MB. */
-    for ( i = 0; i < e820.nr_map; i++ )
-    {
-        unsigned long sfn, efn;
-        sfn = max_t(unsigned long, paddr_to_pfn(e820.map[i].addr), 0x100ul);
-        efn = paddr_to_pfn(e820.map[i].addr + e820.map[i].size - 1);
-        if ( (e820.map[i].type == E820_UNUSABLE) &&
-             (e820.map[i].size != 0) &&
-             (sfn <= efn) )
-            rc |= iomem_deny_access(d, sfn, efn);
-    }
-
-    /* Prevent access to HPET */
-    if ( hpet_address )
-    {
-        u8 prot_flags = hpet_flags & ACPI_HPET_PAGE_PROTECT_MASK;
-
-        mfn = paddr_to_pfn(hpet_address);
-        if ( prot_flags == ACPI_HPET_PAGE_PROTECT4 )
-            rc |= iomem_deny_access(d, mfn, mfn);
-        else if ( prot_flags == ACPI_HPET_PAGE_PROTECT64 )
-            rc |= iomem_deny_access(d, mfn, mfn + 15);
-        else if ( ro_hpet )
-            rc |= rangeset_add_singleton(mmio_ro_ranges, mfn);
-    }
-
+    rc = setup_permissions(d);
     BUG_ON(rc != 0);
 
     if ( elf_check_broken(&elf) )
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (5 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 06/15] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-10-31 16:47   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions Roger Pau Monne
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper,
	Suravee Suthikulpanit, Roger Pau Monne

Instead of being tied to the presence of an IOMMU. This avoids doing the
scan in two different places, and although it's only required for PVHv2
guests (that also require and IOMMU), it makes the code slightly easier to
follow.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Feng Wu <feng.wu@intel.com>
---
Changes since v2:
 - Expand the commit message.
---
 xen/arch/x86/setup.c                        | 2 ++
 xen/drivers/passthrough/amd/pci_amd_iommu.c | 3 ++-
 xen/drivers/passthrough/vtd/iommu.c         | 2 --
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index b130671..72e7f24 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     early_msi_init();
 
+    scan_pci_devices();
+
     iommu_setup();    /* setup iommu if available */
 
     smp_prepare_cpus(max_cpus);
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 94a25a4..d12575d 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
 
     if ( !amd_iommu_perdev_intremap )
         printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
-    return scan_pci_devices();
+
+    return 0;
 }
 
 static int allocate_domain_resources(struct domain_iommu *hd)
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 48f120b..919993e 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2299,8 +2299,6 @@ int __init intel_vtd_setup(void)
     P(iommu_hap_pt_share, "Shared EPT tables");
 #undef P
 
-    scan_pci_devices();
-
     ret = init_vtd_hw();
     if ( ret )
         goto error;
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (6 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-11-04  9:16   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, Roger Pau Monne

Currently RMRR regions are only mapped to the hardware domain or to
non-translated domains that use an IOMMU. In order to fix this, make sure
set_identity_p2m_entry sets the appropriate IOMMU mappings, and that
clear_identity_p2m_entry also removes them.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/mm/p2m.c | 37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 6a45185..da3e937 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1049,22 +1049,29 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
 
     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
 
-    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
+    switch ( p2mt )
+    {
+    case p2m_invalid:
+    case p2m_mmio_dm:
         ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
                             p2m_mmio_direct, p2ma);
-    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
-    {
-        ret = 0;
-        /*
-         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
-         * but iomem regions are not mapped with IOMMU. This makes sure that
-         * RMRRs are correctly mapped with IOMMU.
-         */
-        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
+        if ( ret )
+            break;
+        /* fallthrough */
+    case p2m_mmio_direct:
+        if ( p2mt == p2m_mmio_direct && a != p2ma )
+        {
+            printk(XENLOG_G_WARNING
+                   "Cannot setup identity map d%d:%lx, already mapped with "
+                   "different access type (current: %d, requested: %d).\n",
+                   d->domain_id, gfn, a, p2ma);
+            ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
+            break;
+        }
+        if ( !iommu_use_hap_pt(d) )
             ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
-    }
-    else
-    {
+        break;
+    default:
         if ( flag & XEN_DOMCTL_DEV_RDM_RELAXED )
             ret = 0;
         else
@@ -1073,6 +1080,7 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
                "Cannot setup identity map d%d:%lx,"
                " gfn already mapped to %lx.\n",
                d->domain_id, gfn, mfn_x(mfn));
+        break;
     }
 
     gfn_unlock(p2m, gfn, 0);
@@ -1149,6 +1157,9 @@ int clear_identity_p2m_entry(struct domain *d, unsigned long gfn)
     {
         ret = p2m_set_entry(p2m, gfn, INVALID_MFN, PAGE_ORDER_4K,
                             p2m_invalid, p2m->default_access);
+        if ( !iommu_use_hap_pt(d) )
+            ret = iommu_unmap_page(d, gfn) ? : ret;
+
         gfn_unlock(p2m, gfn, 0);
     }
     else
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (7 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-11-04  9:19   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Allow the use of both the emulated local APIC and IO APIC for the hardware
domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Allow all PV guests to use the emulated PIT.

Changes since RFC:
 - Move the emulation flags check to a separate helper.
---
 xen/arch/x86/domain.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 1bd5eb6..42a4923 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -509,6 +509,27 @@ void vcpu_destroy(struct vcpu *v)
         xfree(v->arch.pv_vcpu.trap_ctxt);
 }
 
+static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
+{
+
+    if ( is_hvm_domain(d) )
+    {
+        if ( is_hardware_domain(d) &&
+             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
+            return false;
+        if ( !is_hardware_domain(d) && emflags &&
+             emflags != XEN_X86_EMU_ALL && emflags != XEN_X86_EMU_LAPIC )
+            return false;
+    }
+    else if ( emflags != 0 && emflags != XEN_X86_EMU_PIT )
+    {
+        /* PV or classic PVH. */
+        return false;
+    }
+
+    return true;
+}
+
 int arch_domain_create(struct domain *d, unsigned int domcr_flags,
                        struct xen_arch_domainconfig *config)
 {
@@ -558,11 +579,7 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags,
             return -EINVAL;
         }
 
-        /* PVHv2 guests can request emulated APIC. */
-        if ( emflags &&
-            (is_hvm_domain(d) ? ((emflags != XEN_X86_EMU_ALL) &&
-                                 (emflags != XEN_X86_EMU_LAPIC)) :
-                                (emflags != XEN_X86_EMU_PIT)) )
+        if ( !emulation_flags_ok(d, emflags) )
         {
             printk(XENLOG_G_ERR "d%d: Xen does not allow %s domain creation "
                    "with the current selection of emulators: %#x\n",
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (8 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-11-11 16:53   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO Roger Pau Monne
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Split the Dom0 builder into two different functions, one for PV (and classic
PVH), and another one for PVHv2. Introduce a new command line parameter
called 'dom0' that can be used to request the creation of a PVHv2 Dom0 by
setting the 'hvm' sub-option.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Fix coding style.
 - Introduce a new dom0 option that allows passing several parameters.
   Currently supported ones are hvm and shadow.

Changes since RFC:
 - Add documentation for the new command line option.
 - Simplify the logic in construct_dom0.
---
 docs/misc/xen-command-line.markdown | 17 ++++++++++++++++
 xen/arch/x86/domain_build.c         | 28 ++++++++++++++++++++++----
 xen/arch/x86/setup.c                | 39 +++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/setup.h         |  6 ++++++
 4 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 87c3023..006e90c 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -656,6 +656,23 @@ affinities to prefer but be not limited to the specified node(s).
 
 Pin dom0 vcpus to their respective pcpus
 
+### dom0
+> `= List of [ hvm | shadow ]`
+
+> Sub-options:
+
+> `hvm`
+
+> Default: `false`
+
+Flag that makes a dom0 boot in PVHv2 mode.
+
+> `shadow`
+
+> Default: `false`
+
+Flag that makes a dom0 use shadow paging.
+
 ### dom0pvh
 > `= <boolean>`
 
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 1e557b9..2c9ebf2 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -191,10 +191,8 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
 }
 
 #ifdef CONFIG_SHADOW_PAGING
-static bool_t __initdata opt_dom0_shadow;
+bool __initdata opt_dom0_shadow;
 boolean_param("dom0_shadow", opt_dom0_shadow);
-#else
-#define opt_dom0_shadow 0
 #endif
 
 static char __initdata opt_dom0_ioports_disable[200] = "";
@@ -951,7 +949,7 @@ static int __init setup_permissions(struct domain *d)
     return rc;
 }
 
-int __init construct_dom0(
+static int __init construct_dom0_pv(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
     module_t *initrd,
@@ -1655,6 +1653,28 @@ out:
     return rc;
 }
 
+static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
+                                     unsigned long image_headroom,
+                                     module_t *initrd,
+                                     void *(*bootstrap_map)(const module_t *),
+                                     char *cmdline)
+{
+
+    printk("** Building a PVH Dom0 **\n");
+
+    return 0;
+}
+
+int __init construct_dom0(struct domain *d, const module_t *image,
+                          unsigned long image_headroom, module_t *initrd,
+                          void *(*bootstrap_map)(const module_t *),
+                          char *cmdline)
+{
+
+    return (is_hvm_domain(d) ? construct_dom0_hvm : construct_dom0_pv)
+           (d, image, image_headroom, initrd,bootstrap_map, cmdline);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 72e7f24..64d4c89 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -67,6 +67,16 @@ unsigned long __read_mostly cr4_pv32_mask;
 static bool_t __initdata opt_dom0pvh;
 boolean_param("dom0pvh", opt_dom0pvh);
 
+/*
+ * List of parameters that affect Dom0 creation:
+ *
+ *  - hvm               Create a PVHv2 Dom0.
+ *  - shadow            Use shadow paging for Dom0.
+ */
+static void parse_dom0_param(char *s);
+custom_param("dom0", parse_dom0_param);
+static bool __initdata dom0_hvm;
+
 /* **** Linux config option: propagated to domain0. */
 /* "acpi=off":    Sisables both ACPI table parsing and interpreter. */
 /* "acpi=force":  Override the disable blacklist.                   */
@@ -187,6 +197,27 @@ static void __init parse_acpi_param(char *s)
     }
 }
 
+static void __init parse_dom0_param(char *s)
+{
+    char *ss;
+
+    do {
+
+        ss = strchr(s, ',');
+        if ( ss )
+            *ss = '\0';
+
+        if ( !strcmp(s, "hvm") )
+            dom0_hvm = true;
+#ifdef CONFIG_SHADOW_PAGING
+        else if ( !strcmp(s, "shadow") )
+            opt_dom0_shadow = true;
+#endif
+
+        s = ss + 1;
+    } while ( ss );
+}
+
 static const module_t *__initdata initial_images;
 static unsigned int __initdata nr_initial_images;
 
@@ -1543,6 +1574,14 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( opt_dom0pvh )
         domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
 
+    if ( dom0_hvm )
+    {
+        domcr_flags |= DOMCRF_hvm |
+                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
+                         DOMCRF_hap : 0);
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+    }
+
     /* Create initial domain 0. */
     dom0 = domain_create(0, domcr_flags, 0, &config);
     if ( IS_ERR(dom0) || (alloc_dom0_vcpu0(dom0) == NULL) )
diff --git a/xen/include/asm-x86/setup.h b/xen/include/asm-x86/setup.h
index c65b79c..c4179d1 100644
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -57,4 +57,10 @@ extern uint8_t kbd_shift_flags;
 extern unsigned long highmem_start;
 #endif
 
+#ifdef CONFIG_SHADOW_PAGING
+extern bool opt_dom0_shadow;
+#else
+#define opt_dom0_shadow 0
+#endif
+
 #endif
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (9 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-11-11 16:58   ` Jan Beulich
  2016-11-11 20:17   ` Konrad Rzeszutek Wilk
  2016-10-29  8:59 ` [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
                   ` (4 subsequent siblings)
  15 siblings, 2 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Jan Beulich, Roger Pau Monne

Current {un}map_mmio_regions implementation has a maximum number of loops to
perform before giving up and returning to the caller. This is an issue when
mapping large MMIO regions when building the hardware domain. In order to
solve it, introduce a wrapper around {un}map_mmio_regions that takes care of
calling process_pending_softirqs between consecutive {un}map_mmio_regions
calls.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
Changes since v2:
 - Pull the code into a separate patch.
 - Use an unbounded for loop with break conditions.
---
 xen/common/memory.c          | 26 ++++++++++++++++++++++++++
 xen/include/xen/p2m-common.h |  7 +++++++
 2 files changed, 33 insertions(+)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 21797ca..66c0484 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -1418,6 +1418,32 @@ int prepare_ring_for_helper(
     return 0;
 }
 
+int modify_identity_mmio(struct domain *d, unsigned long pfn,
+                         unsigned long nr_pages, bool map)
+{
+    int rc;
+
+    for ( ; ; )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, _gfn(pfn), nr_pages, _mfn(pfn));
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_WARNING
+                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
+                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
+            break;
+        }
+        nr_pages -= rc;
+        pfn += rc;
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/include/xen/p2m-common.h b/xen/include/xen/p2m-common.h
index 3be1e91..2ade0e7 100644
--- a/xen/include/xen/p2m-common.h
+++ b/xen/include/xen/p2m-common.h
@@ -65,4 +65,11 @@ long p2m_set_mem_access_multi(struct domain *d,
  */
 int p2m_get_mem_access(struct domain *d, gfn_t gfn, xenmem_access_t *access);
 
+/*
+ * Helper for {un}mapping large MMIO regions, it will take care of calling
+ * process_pending_softirqs between consecutive {un}map_mmio_regions calls.
+ */
+int modify_identity_mmio(struct domain *d, unsigned long pfn,
+                         unsigned long nr_pages, bool map);
+
 #endif /* _XEN_P2M_COMMON_H */
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (10 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-11-11 17:16   ` Jan Beulich
  2016-10-29  8:59 ` [PATCH v3.1 13/15] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Craft the Dom0 e820 memory map and populate it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Introduce get_order_from_bytes_floor as a local function to
   domain_build.c.
 - Remove extra asserts.
 - Make hvm_populate_memory_range return an error code instead of panicking.
 - Fix comments and printks.
 - Use ULL sufix instead of casting to uint64_t.
 - Rename hvm_setup_vmx_unrestricted_guest to
   hvm_setup_vmx_realmode_helpers.
 - Only substract two pages from the memory calculation, that will be used
   by the MADT replacement.
 - Remove some comments.
 - Remove printing allocation information.
 - Don't stash any pages for the MADT, TSS or ident PT, those will be
   subtracted directly from RAM regions of the memory map.
 - Count the number of iterations before calling process_pending_softirqs
   when populating the memory map.
 - Move the initial call to process_pending_softirqs into construct_dom0,
   and remove the ones from construct_dom0_hvm and construct_dom0_pv.
 - Make memflags global so it can be shared between alloc_chunk and
   hvm_populate_memory_range.

Changes since RFC:
 - Use IS_ALIGNED instead of checking with PAGE_MASK.
 - Use the new %pB specifier in order to print sizes in human readable form.
 - Create a VM86 TSS for hardware that doesn't support unrestricted mode.
 - Subtract guest RAM for the identity page table and the VM86 TSS.
 - Split the creation of the unrestricted mode helper structures to a
   separate function.
 - Use preemption with paging_set_allocation.
 - Use get_order_from_bytes_floor.
---
 xen/arch/x86/domain_build.c | 275 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 266 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 2c9ebf2..ec1ac89 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -22,6 +22,7 @@
 #include <xen/compat.h>
 #include <xen/libelf.h>
 #include <xen/pfn.h>
+#include <xen/guest_access.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -43,6 +44,9 @@ static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
 static long __initdata dom0_max_nrpages = LONG_MAX;
 
+/* Size of the VM86 TSS for virtual 8086 mode to use. */
+#define HVM_VM86_TSS_SIZE   128
+
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -190,6 +194,14 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
     return setup_dom0_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
 }
 
+static unsigned int __init get_order_from_bytes_floor(paddr_t size)
+{
+    unsigned int order;
+
+    order = get_order_from_bytes(size + 1);
+    return order > 0 ? order - 1 : order;
+}
+
 #ifdef CONFIG_SHADOW_PAGING
 bool __initdata opt_dom0_shadow;
 boolean_param("dom0_shadow", opt_dom0_shadow);
@@ -213,11 +225,12 @@ boolean_param("ro-hpet", ro_hpet);
 #define round_pgup(_p)    (((_p)+(PAGE_SIZE-1))&PAGE_MASK)
 #define round_pgdown(_p)  ((_p)&PAGE_MASK)
 
+static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
+
 static struct page_info * __init alloc_chunk(
     struct domain *d, unsigned long max_pages)
 {
     static unsigned int __initdata last_order = MAX_ORDER;
-    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
     struct page_info *page;
     unsigned int order = get_order_from_pages(max_pages), free_order;
 
@@ -302,7 +315,8 @@ static unsigned long __init compute_dom0_nr_pages(
             avail -= max_pdx >> s;
     }
 
-    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
+    need_paging = opt_dom0_shadow || (has_hvm_container_domain(d) &&
+                  (!iommu_hap_pt_share || !paging_mode_hap(d)));
     for ( ; ; need_paging = 0 )
     {
         nr_pages = dom0_nrpages;
@@ -334,7 +348,8 @@ static unsigned long __init compute_dom0_nr_pages(
         avail -= dom0_paging_pages(d, nr_pages);
     }
 
-    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
+    if ( is_pv_domain(d) &&
+         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
          ((dom0_min_nrpages <= 0) || (nr_pages > min_pages)) )
     {
         /*
@@ -545,11 +560,12 @@ static __init void pvh_map_all_iomem(struct domain *d, unsigned long nr_pages)
     ASSERT(nr_holes == 0);
 }
 
-static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
+static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
 {
     struct e820entry *entry, *entry_guest;
     unsigned int i;
     unsigned long pages, cur_pages = 0;
+    uint64_t start, end;
 
     /*
      * Craft the e820 memory map for Dom0 based on the hardware e820 map.
@@ -577,8 +593,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
             continue;
         }
 
-        *entry_guest = *entry;
-        pages = PFN_UP(entry_guest->size);
+        /*
+         * Make sure the start and length are aligned to PAGE_SIZE, because
+         * that's the minimum granularity of the 2nd stage translation.
+         */
+        start = ROUNDUP(entry->addr, PAGE_SIZE);
+        end = (entry->addr + entry->size) & PAGE_MASK;
+        if ( start >= end )
+            continue;
+
+        entry_guest->type = E820_RAM;
+        entry_guest->addr = start;
+        entry_guest->size = end - start;
+        pages = PFN_DOWN(entry_guest->size);
         if ( (cur_pages + pages) > nr_pages )
         {
             /* Truncate region */
@@ -1010,8 +1037,6 @@ static int __init construct_dom0_pv(
     BUG_ON(d->vcpu[0] == NULL);
     BUG_ON(v->is_initialised);
 
-    process_pending_softirqs();
-
     printk("*** LOADING DOMAIN 0 ***\n");
 
     d->max_pages = ~0U;
@@ -1637,7 +1662,7 @@ static int __init construct_dom0_pv(
         dom0_update_physmap(d, pfn, mfn, 0);
 
         pvh_map_all_iomem(d, nr_pages);
-        pvh_setup_e820(d, nr_pages);
+        hvm_setup_e820(d, nr_pages);
     }
 
     if ( d->domain_id == hardware_domid )
@@ -1653,15 +1678,246 @@ out:
     return rc;
 }
 
+/* Populate an HVM memory range using the biggest possible order. */
+static int __init hvm_populate_memory_range(struct domain *d, uint64_t start,
+                                             uint64_t size)
+{
+    unsigned int order, i = 0;
+    struct page_info *page;
+    int rc;
+#define MAP_MAX_ITER 64
+
+    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
+
+    order = MAX_ORDER;
+    while ( size != 0 )
+    {
+        order = min(get_order_from_bytes_floor(size), order);
+        page = alloc_domheap_pages(d, order, memflags);
+        if ( page == NULL )
+        {
+            if ( order == 0 && memflags )
+            {
+                /* Try again without any memflags. */
+                memflags = 0;
+                order = MAX_ORDER;
+                continue;
+            }
+            if ( order == 0 )
+            {
+                printk("Unable to allocate memory with order 0!\n");
+                return -ENOMEM;
+            }
+            order--;
+            continue;
+        }
+
+        rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)),
+                                    _mfn(page_to_mfn(page)), order);
+        if ( rc != 0 )
+        {
+            printk("Failed to populate memory: [%" PRIx64 ",%" PRIx64 ") %d\n",
+                   start, start + ((1ULL) << (order + PAGE_SHIFT)), rc);
+            return -ENOMEM;
+        }
+        start += 1ULL << (order + PAGE_SHIFT);
+        size -= 1ULL << (order + PAGE_SHIFT);
+        if ( (++i % MAP_MAX_ITER) == 0 )
+            process_pending_softirqs();
+    }
+
+    return 0;
+#undef MAP_MAX_ITER
+}
+
+static int __init hvm_steal_ram(struct domain *d, unsigned long size,
+                                paddr_t limit, paddr_t *addr)
+{
+    unsigned int i;
+
+    for ( i = 1; i <= d->arch.nr_e820; i++ )
+    {
+        struct e820entry *entry = &d->arch.e820[d->arch.nr_e820 - i];
+
+        if ( entry->type != E820_RAM || entry->size < size )
+            continue;
+
+        /* Subtract from the beginning. */
+        if ( entry->addr + size < limit && entry->addr >= MB(1) )
+        {
+            *addr = entry->addr;
+            entry->addr += size;
+            entry->size -= size;
+            return 0;
+        }
+    }
+
+    return -ENOMEM;
+}
+
+static int __init hvm_setup_vmx_realmode_helpers(struct domain *d)
+{
+    p2m_type_t p2mt;
+    uint32_t rc, *ident_pt;
+    uint8_t *tss;
+    mfn_t mfn;
+    paddr_t gaddr;
+    unsigned int i;
+
+    /*
+     * Steal some space from the last found RAM region. One page will be
+     * used for the identity page tables, and the remaining space for the
+     * VM86 TSS. Note that after this not all e820 regions will be aligned
+     * to PAGE_SIZE.
+     */
+    if ( hvm_steal_ram(d, PAGE_SIZE + HVM_VM86_TSS_SIZE, ULONG_MAX, &gaddr) )
+    {
+        printk("Unable to find memory to stash the identity map and TSS\n");
+        return -ENOMEM;
+    }
+
+    /*
+     * Identity-map page table is required for running with CR0.PG=0
+     * when using Intel EPT. Create a 32-bit non-PAE page directory of
+     * superpages.
+     */
+    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                              &mfn, &p2mt, 0, &rc);
+    if ( ident_pt == NULL )
+    {
+        printk("Unable to map identity page tables\n");
+        return -ENOMEM;
+    }
+    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
+        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
+                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+    unmap_domain_page(ident_pt);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
+    gaddr += PAGE_SIZE;
+    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
+
+    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                         &mfn, &p2mt, 0, &rc);
+    if ( tss == NULL )
+    {
+        printk("Unable to map VM86 TSS area\n");
+        return 0;
+    }
+
+    memset(tss, 0, HVM_VM86_TSS_SIZE);
+    unmap_domain_page(tss);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
+
+    return 0;
+}
+
+static int __init hvm_setup_p2m(struct domain *d)
+{
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    unsigned long nr_pages;
+    int i, rc;
+    bool preempted;
+
+    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
+
+    hvm_setup_e820(d, nr_pages);
+    do {
+        preempted = false;
+        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                              &preempted);
+        process_pending_softirqs();
+    } while ( preempted );
+
+    /*
+     * Special treatment for memory < 1MB:
+     *  - Copy the data in e820 regions marked as RAM (BDA, BootSector...).
+     *  - Identity map everything else.
+     * NB: all this only makes sense if booted from legacy BIOSes.
+     * NB2: regions marked as RAM in the memory map are backed by RAM pages
+     * in the p2m, and the original data is copied over. This is done because
+     * at least FreeBSD places the AP boot trampoline in a RAM region found
+     * below the first MB, and the real-mode emulator found in Xen cannot
+     * deal with code that resides in guest pages marked as MMIO. This can
+     * cause problems if the memory map is not correct, and for example the
+     * EBDA or the video ROM region is marked as RAM.
+     */
+    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
+    if ( rc )
+    {
+        printk("Failed to identity map low 1MB: %d\n", rc);
+        return rc;
+    }
+
+    /* Populate memory map. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        if ( d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        rc = hvm_populate_memory_range(d, d->arch.e820[i].addr,
+                                       d->arch.e820[i].size);
+        if ( rc )
+            return rc;
+        if ( d->arch.e820[i].addr < MB(1) )
+        {
+            unsigned long end = min_t(unsigned long,
+                            d->arch.e820[i].addr + d->arch.e820[i].size, MB(1));
+
+            saved_current = current;
+            set_current(v);
+            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
+                                        maddr_to_virt(d->arch.e820[i].addr),
+                                        end - d->arch.e820[i].addr);
+            set_current(saved_current);
+            if ( rc != HVMCOPY_okay )
+            {
+                printk("Unable to copy RAM region %#lx - %#lx\n",
+                       d->arch.e820[i].addr, end);
+                return -EFAULT;
+            }
+        }
+    }
+
+    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
+    {
+        /*
+         * Since Dom0 cannot be migrated, we will only setup the
+         * unrestricted guest helpers if they are needed by the current
+         * hardware we are running on.
+         */
+        rc = hvm_setup_vmx_realmode_helpers(d);
+        if ( rc )
+            return rc;
+    }
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    int rc;
 
     printk("** Building a PVH Dom0 **\n");
 
+    /* Sanity! */
+    BUG_ON(d->domain_id);
+    BUG_ON(!d->vcpu[0]);
+
+    iommu_hwdom_init(d);
+
+    rc = hvm_setup_p2m(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 physical memory map\n");
+        return rc;
+    }
+
     return 0;
 }
 
@@ -1670,6 +1926,7 @@ int __init construct_dom0(struct domain *d, const module_t *image,
                           void *(*bootstrap_map)(const module_t *),
                           char *cmdline)
 {
+    process_pending_softirqs();
 
     return (is_hvm_domain(d) ? construct_dom0_hvm : construct_dom0_pv)
            (d, image, image_headroom, initrd,bootstrap_map, cmdline);
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 13/15] xen/x86: parse Dom0 kernel for PVHv2
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (11 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2016-10-29  8:59 ` Roger Pau Monne
  2016-11-11 20:30   ` Konrad Rzeszutek Wilk
  2016-10-29  9:00 ` [PATCH v3.1 14/15] xen/x86: hack to setup PVHv2 Dom0 CPUs Roger Pau Monne
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  8:59 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Introduce a helper to parse the Dom0 kernel.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Remove debug messages.
 - Don't hardcode the number of modules to 1.
---
 xen/arch/x86/domain_build.c | 138 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index ec1ac89..168be62 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -39,6 +39,7 @@
 #include <asm/hpet.h>
 
 #include <public/version.h>
+#include <public/arch-x86/hvm/start_info.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -1895,12 +1896,141 @@ static int __init hvm_setup_p2m(struct domain *d)
     return 0;
 }
 
+static int __init hvm_load_kernel(struct domain *d, const module_t *image,
+                                  unsigned long image_headroom,
+                                  module_t *initrd, char *image_base,
+                                  char *cmdline, paddr_t *entry,
+                                  paddr_t *start_info_addr)
+{
+    char *image_start = image_base + image_headroom;
+    unsigned long image_len = image->mod_end;
+    struct elf_binary elf;
+    struct elf_dom_parms parms;
+    paddr_t last_addr;
+    struct hvm_start_info start_info;
+    struct hvm_modlist_entry mod;
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    int rc;
+
+    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+    {
+        printk("Error trying to detect bz compressed kernel\n");
+        return rc;
+    }
+
+    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
+    {
+        printk("Unable to init ELF\n");
+        return rc;
+    }
+#ifdef VERBOSE
+    elf_set_verbose(&elf);
+#endif
+    elf_parse_binary(&elf);
+    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
+    {
+        printk("Unable to parse kernel for ELFNOTES\n");
+        return rc;
+    }
+
+    if ( parms.phys_entry == UNSET_ADDR32 ) {
+        printk("Unable to find kernel entry point, aborting\n");
+        return -EINVAL;
+    }
+
+    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
+           parms.guest_ver, parms.loader,
+           elf_64bit(&elf) ? "64-bit" : "32-bit");
+
+    /* Copy the OS image and free temporary buffer. */
+    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
+    elf.dest_size = parms.virt_kend - parms.virt_kstart;
+
+    saved_current = current;
+    set_current(v);
+
+    rc = elf_load_binary(&elf);
+    if ( rc < 0 )
+    {
+        printk("Failed to load kernel: %d\n", rc);
+        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
+        goto out;
+    }
+
+    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
+
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
+                                    initrd->mod_end);
+        if ( rc != HVMCOPY_okay )
+        {
+            printk("Unable to copy initrd to guest\n");
+            rc = -EFAULT;
+            goto out;
+        }
+
+        mod.paddr = last_addr;
+        mod.size = initrd->mod_end;
+        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
+    }
+
+    /* Free temporary buffers. */
+    discard_initial_images();
+
+    memset(&start_info, 0, sizeof(start_info));
+    if ( cmdline != NULL )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr, cmdline, strlen(cmdline) + 1);
+        if ( rc != HVMCOPY_okay )
+        {
+            printk("Unable to copy guest command line\n");
+            rc = -EFAULT;
+            goto out;
+        }
+        start_info.cmdline_paddr = last_addr;
+        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
+    }
+    if ( initrd != NULL )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr, &mod, sizeof(mod));
+        if ( rc != HVMCOPY_okay )
+        {
+            printk("Unable to copy guest modules\n");
+            rc = -EFAULT;
+            goto out;
+        }
+        start_info.modlist_paddr = last_addr;
+        start_info.nr_modules = 1;
+        last_addr += sizeof(mod);
+    }
+
+    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
+    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
+    rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info));
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy start info to guest\n");
+        rc = -EFAULT;
+        goto out;
+    }
+
+    *entry = parms.phys_entry;
+    *start_info_addr = last_addr;
+    rc = 0;
+
+out:
+    set_current(saved_current);
+    return rc;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    paddr_t entry, start_info;
     int rc;
 
     printk("** Building a PVH Dom0 **\n");
@@ -1918,6 +2048,14 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_load_kernel(d, image, image_headroom, initrd, bootstrap_map(image),
+                         cmdline, &entry, &start_info);
+    if ( rc )
+    {
+        printk("Failed to load Dom0 kernel\n");
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 14/15] xen/x86: hack to setup PVHv2 Dom0 CPUs
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (12 preceding siblings ...)
  2016-10-29  8:59 ` [PATCH v3.1 13/15] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2016-10-29  9:00 ` Roger Pau Monne
  2016-10-29  9:00 ` [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
  2016-10-31 14:35 ` [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Boris Ostrovsky
  15 siblings, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  9:00 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Initialize Dom0 BSP/APs and setup the memory and IO permissions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
DO NOT APPLY.

The logic used to setup the CPUID leaves is clearly lacking. This patch will
be rebased on top of Andrew's CPUID work, that will move CPUID setup from
libxc into Xen. For the time being this is needed in order to be able to
boot a PVHv2 Dom0, in order to test the rest of the patches.
---
 xen/arch/x86/domain_build.c | 97 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 97 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 168be62..1ebc21f 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -40,6 +40,7 @@
 
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
+#include <public/hvm/hvm_vcpu.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -2024,6 +2025,93 @@ out:
     return rc;
 }
 
+static int __init hvm_setup_cpus(struct domain *d, paddr_t entry,
+                                 paddr_t start_info)
+{
+    vcpu_hvm_context_t cpu_ctx;
+    struct vcpu *v = d->vcpu[0];
+    int cpu, i, rc;
+    struct {
+        uint32_t index;
+        uint32_t count;
+    } cpuid_leaves[] = {
+        {0, XEN_CPUID_INPUT_UNUSED},
+        {1, XEN_CPUID_INPUT_UNUSED},
+        {2, XEN_CPUID_INPUT_UNUSED},
+        {4, 0},
+        {4, 1},
+        {4, 2},
+        {4, 3},
+        {4, 4},
+        {7, 0},
+        {0xa, XEN_CPUID_INPUT_UNUSED},
+        {0xd, 0},
+        {0x80000000, XEN_CPUID_INPUT_UNUSED},
+        {0x80000001, XEN_CPUID_INPUT_UNUSED},
+        {0x80000002, XEN_CPUID_INPUT_UNUSED},
+        {0x80000003, XEN_CPUID_INPUT_UNUSED},
+        {0x80000004, XEN_CPUID_INPUT_UNUSED},
+        {0x80000005, XEN_CPUID_INPUT_UNUSED},
+        {0x80000006, XEN_CPUID_INPUT_UNUSED},
+        {0x80000007, XEN_CPUID_INPUT_UNUSED},
+        {0x80000008, XEN_CPUID_INPUT_UNUSED},
+    };
+
+    cpu = v->processor;
+    for ( i = 1; i < d->max_vcpus; i++ )
+    {
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_dom0_vcpu(d, i, cpu);
+    }
+
+    memset(&cpu_ctx, 0, sizeof(cpu_ctx));
+
+    cpu_ctx.mode = VCPU_HVM_MODE_32B;
+
+    cpu_ctx.cpu_regs.x86_32.ebx = start_info;
+    cpu_ctx.cpu_regs.x86_32.eip = entry;
+    cpu_ctx.cpu_regs.x86_32.cr0 = X86_CR0_PE | X86_CR0_ET;
+
+    cpu_ctx.cpu_regs.x86_32.cs_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ds_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ss_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.tr_limit = 0x67;
+    cpu_ctx.cpu_regs.x86_32.cs_ar = 0xc9b;
+    cpu_ctx.cpu_regs.x86_32.ds_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.ss_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.tr_ar = 0x8b;
+
+    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
+    if ( rc )
+    {
+        printk("Unable to setup Dom0 BSP context: %d\n", rc);
+        return rc;
+    }
+
+    for ( i = 0; i < ARRAY_SIZE(cpuid_leaves); i++ )
+    {
+        d->arch.cpuids[i].input[0] = cpuid_leaves[i].index;
+        d->arch.cpuids[i].input[1] = cpuid_leaves[i].count;
+        cpuid_count(d->arch.cpuids[i].input[0], d->arch.cpuids[i].input[1],
+                    &d->arch.cpuids[i].eax, &d->arch.cpuids[i].ebx,
+                    &d->arch.cpuids[i].ecx, &d->arch.cpuids[i].edx);
+        /* XXX: we need to do much more filtering here. */
+        if ( d->arch.cpuids[i].input[0] == 1 )
+            d->arch.cpuids[i].ecx &= ~X86_FEATURE_VMX;
+    }
+
+    rc = setup_permissions(d);
+    if ( rc )
+    {
+        panic("Unable to setup Dom0 permissions: %d\n", rc);
+        return rc;
+    }
+
+    update_domain_wallclock_time(d);
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2056,6 +2144,15 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_setup_cpus(d, entry, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 CPUs: %d\n", rc);
+        return rc;
+    }
+
+    clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
+
     return 0;
 }
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (13 preceding siblings ...)
  2016-10-29  9:00 ` [PATCH v3.1 14/15] xen/x86: hack to setup PVHv2 Dom0 CPUs Roger Pau Monne
@ 2016-10-29  9:00 ` Roger Pau Monne
  2016-11-14 16:15   ` Jan Beulich
  2016-10-31 14:35 ` [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Boris Ostrovsky
  15 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-29  9:00 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky, konrad.wilk
  Cc: Andrew Cooper, Jan Beulich, Roger Pau Monne

Create a new MADT table that contains the topology exposed to the guest. A
new XSDT table is also created, in order to filter the tables that we want
to expose to the guest, plus the Xen crafted MADT. This in turn requires Xen
to also create a new RSDP in order to make it point to the custom XSDT.

Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
that don't reside in RAM regions. This is needed because some memory maps
don't properly account for all the memory used by ACPI, so it's common to
find ACPI tables in holes.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v2:
 - Completely reworked.
---
 xen/arch/x86/domain_build.c | 428 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 427 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 1ebc21f..d7b54d9 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -23,6 +23,7 @@
 #include <xen/libelf.h>
 #include <xen/pfn.h>
 #include <xen/guest_access.h>
+#include <xen/acpi.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -38,6 +39,8 @@
 #include <asm/io_apic.h>
 #include <asm/hpet.h>
 
+#include <acpi/actables.h>
+
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
 #include <public/hvm/hvm_vcpu.h>
@@ -49,6 +52,9 @@ static long __initdata dom0_max_nrpages = LONG_MAX;
 /* Size of the VM86 TSS for virtual 8086 mode to use. */
 #define HVM_VM86_TSS_SIZE   128
 
+static unsigned int __initdata acpi_intr_overrrides;
+static struct acpi_madt_interrupt_override __initdata *intsrcovr;
+
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -572,7 +578,7 @@ static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
     /*
      * Craft the e820 memory map for Dom0 based on the hardware e820 map.
      */
-    d->arch.e820 = xzalloc_array(struct e820entry, e820.nr_map);
+    d->arch.e820 = xzalloc_array(struct e820entry, E820MAX);
     if ( !d->arch.e820 )
         panic("Unable to allocate memory for Dom0 e820 map");
     entry_guest = d->arch.e820;
@@ -1757,6 +1763,54 @@ static int __init hvm_steal_ram(struct domain *d, unsigned long size,
     return -ENOMEM;
 }
 
+static int __init hvm_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
+                                    uint32_t type)
+{
+    unsigned int i;
+
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        uint64_t rs = d->arch.e820[i].addr;
+        uint64_t re = rs + d->arch.e820[i].size;
+
+        if ( rs == e && d->arch.e820[i].type == type )
+        {
+            d->arch.e820[i].addr = s;
+            return 0;
+        }
+
+        if ( re == s && d->arch.e820[i].type == type &&
+             (i + 1 == d->arch.nr_e820 || d->arch.e820[i + 1].addr >= e) )
+        {
+            d->arch.e820[i].size += e - s;
+            return 0;
+        }
+
+        if ( rs >= e )
+            break;
+
+        if ( re > s )
+            return -ENOMEM;
+    }
+
+    if ( d->arch.nr_e820 >= E820MAX )
+    {
+        printk(XENLOG_WARNING "E820: overflow while adding region"
+               "[%"PRIx64", %"PRIx64")\n", s, e);
+        return -ENOMEM;
+    }
+
+    memmove(d->arch.e820 + i + 1, d->arch.e820 + i,
+            (d->arch.nr_e820 - i) * sizeof(*d->arch.e820));
+
+    d->arch.nr_e820++;
+    d->arch.e820[i].addr = s;
+    d->arch.e820[i].size = e - s;
+    d->arch.e820[i].type = type;
+
+    return 0;
+}
+
 static int __init hvm_setup_vmx_realmode_helpers(struct domain *d)
 {
     p2m_type_t p2mt;
@@ -2112,6 +2166,371 @@ static int __init hvm_setup_cpus(struct domain *d, paddr_t entry,
     return 0;
 }
 
+static int __init acpi_count_intr_ov(struct acpi_subtable_header *header,
+                                     const unsigned long end)
+{
+
+    acpi_intr_overrrides++;
+    return 0;
+}
+
+static int __init acpi_set_intr_ov(struct acpi_subtable_header *header,
+                                   const unsigned long end)
+{
+    struct acpi_madt_interrupt_override *intr =
+        container_of(header, struct acpi_madt_interrupt_override, header);
+
+    ACPI_MEMCPY(intsrcovr, intr, sizeof(*intr));
+    intsrcovr++;
+
+    return 0;
+}
+
+static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
+{
+    struct acpi_table_madt *madt;
+    struct acpi_table_header *table;
+    struct acpi_madt_io_apic *io_apic;
+    struct acpi_madt_local_apic *local_apic;
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    acpi_status status;
+    unsigned long size;
+    unsigned int i;
+    int rc;
+
+    /* Count number of interrupt overrides in the MADT. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_count_intr_ov,
+                          MAX_IRQ_SOURCES);
+
+    /* Calculate the size of the crafted MADT. */
+    size = sizeof(struct acpi_table_madt);
+    size += sizeof(struct acpi_madt_interrupt_override) * acpi_intr_overrrides;
+    size += sizeof(struct acpi_madt_io_apic);
+    size += sizeof(struct acpi_madt_local_apic) * dom0_max_vcpus();
+
+    madt = xzalloc_bytes(size);
+    if ( !madt )
+    {
+        printk("Unable to allocate memory for MADT table\n");
+        return -ENOMEM;
+    }
+
+    /* Copy the native MADT table header. */
+    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
+    if ( !ACPI_SUCCESS(status) )
+    {
+        printk("Failed to get MADT ACPI table, aborting.\n");
+        return -EINVAL;
+    }
+    ACPI_MEMCPY(madt, table, sizeof(*table));
+    madt->address = APIC_DEFAULT_PHYS_BASE;
+
+    /* Setup the IO APIC entry. */
+    if ( nr_ioapics > 1 )
+        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
+               nr_ioapics);
+    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
+    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
+    io_apic->header.length = sizeof(*io_apic);
+    io_apic->id = 1;
+    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
+
+    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
+    for ( i = 0; i < dom0_max_vcpus(); i++ )
+    {
+        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
+        local_apic->header.length = sizeof(*local_apic);
+        local_apic->processor_id = i;
+        local_apic->id = i * 2;
+        local_apic->lapic_flags = ACPI_MADT_ENABLED;
+        local_apic++;
+    }
+
+    /* Setup interrupt overwrites. */
+    intsrcovr = (struct acpi_madt_interrupt_override *)local_apic;
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_set_intr_ov,
+                          MAX_IRQ_SOURCES);
+    ASSERT(((unsigned char *)intsrcovr - (unsigned char *)madt) == size);
+    madt->header.length = size;
+    madt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, madt), size);
+
+    /* Place the new MADT in guest memory space. */
+    if ( hvm_steal_ram(d, size, GB(4), addr) )
+    {
+        printk("Unable to find allocate guest RAM for MADT\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( hvm_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add MADT region to memory map\n");
+
+    saved_current = current;
+    set_current(v);
+    rc = hvm_copy_to_guest_phys(*addr, madt, size);
+    set_current(saved_current);
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy MADT into guest memory\n");
+        return -EFAULT;
+    }
+    xfree(madt);
+
+    return 0;
+}
+
+static bool __init range_is_ram(unsigned long mfn, unsigned long nr_pages)
+{
+    unsigned long i;
+
+    for ( i = 0 ; i < nr_pages; i++ )
+        if ( page_is_ram_type(mfn + i, RAM_TYPE_CONVENTIONAL) )
+            return true;
+
+    return false;
+}
+
+static bool __init hvm_acpi_table_allowed(const char *sig)
+{
+    static const char __init banned_tables[][ACPI_NAME_SIZE] = {
+        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
+        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
+    unsigned long pfn, nr_pages;
+    int i;
+
+    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
+        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
+            return false;
+
+    /* Make sure table doesn't reside in a RAM region. */
+    pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
+    nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
+                            PAGE_SIZE);
+    if ( range_is_ram(pfn, nr_pages) )
+    {
+        printk("Skipping table %.4s because resides in a RAM region\n",
+               sig);
+        return false;
+    }
+
+    return true;
+}
+
+static int __init hvm_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
+                                      paddr_t *addr)
+{
+    struct acpi_table_xsdt *xsdt;
+    struct acpi_table_header *table;
+    struct acpi_table_rsdp *rsdp;
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    unsigned long size;
+    unsigned int i, num_tables;
+    int j, rc;
+
+    /*
+     * Restore original DMAR table signature, we are going to filter it
+     * from the new XSDT that is presented to the guest, so it no longer
+     * makes sense to have it's signature zapped.
+     */
+    acpi_dmar_reinstate();
+
+    /* Account for the space needed by the XSDT. */
+    size = sizeof(*xsdt);
+    num_tables = 0;
+
+    /* Count the number of tables that will be added to the XSDT. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+
+        if ( !hvm_acpi_table_allowed(sig) )
+            continue;
+
+        num_tables++;
+    }
+
+    /*
+     * No need to subtract one because we will be adding a custom MADT (and
+     * the native one is not accounted for).
+     */
+    size += num_tables * sizeof(u64);
+
+    xsdt = xzalloc_bytes(size);
+    if ( !xsdt )
+    {
+        printk("Unable to allocate memory for XSDT table\n");
+        return -ENOMEM;
+    }
+
+    /* Copy the native XSDT table header. */
+    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
+    if ( !rsdp )
+    {
+        printk("Unable to map RSDP\n");
+        return -EINVAL;
+    }
+    table = acpi_os_map_memory(rsdp->xsdt_physical_address, sizeof(*table));
+    if ( !table )
+    {
+        printk("Unable to map XSDT\n");
+        return -EINVAL;
+    }
+    ACPI_MEMCPY(xsdt, table, sizeof(*table));
+    acpi_os_unmap_memory(table, sizeof(*table));
+    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));
+
+    /* Add the custom MADT. */
+    j = 0;
+    xsdt->table_offset_entry[j++] = madt_addr;
+
+    /* Copy the address of the rest of the allowed tables. */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
+        unsigned long pfn, nr_pages;
+
+        if ( !hvm_acpi_table_allowed(sig) )
+            continue;
+
+        /* Make sure table doesn't reside in a RAM region. */
+        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
+        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
+                                PAGE_SIZE);
+
+        /* Make sure table is mapped. */
+        rc = modify_identity_mmio(d, pfn, nr_pages, true);
+        if ( rc )
+            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
+                   pfn, pfn + nr_pages);
+
+        xsdt->table_offset_entry[j++] =
+                            acpi_gbl_root_table_list.tables[i].address;
+    }
+
+    xsdt->header.length = size;
+    xsdt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, xsdt), size);
+
+    /* Place the new XSDT in guest memory space. */
+    if ( hvm_steal_ram(d, size, GB(4), addr) )
+    {
+        printk("Unable to find allocate guest RAM for XSDT\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( hvm_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
+        printk("Unable to add XSDT region to memory map\n");
+
+    saved_current = current;
+    set_current(v);
+    rc = hvm_copy_to_guest_phys(*addr, xsdt, size);
+    set_current(saved_current);
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy XSDT into guest memory\n");
+        return -EFAULT;
+    }
+    xfree(xsdt);
+
+    return 0;
+}
+
+
+static int __init hvm_setup_acpi(struct domain *d, paddr_t start_info)
+{
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    struct acpi_table_rsdp rsdp;
+    unsigned long pfn, nr_pages;
+    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
+    unsigned int i;
+    int rc;
+
+    /* Identity map ACPI e820 regions. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        if ( d->arch.e820[i].type != E820_ACPI &&
+             d->arch.e820[i].type != E820_NVS )
+            continue;
+
+        pfn = PFN_DOWN(d->arch.e820[i].addr);
+        nr_pages = DIV_ROUND_UP(d->arch.e820[i].size, PAGE_SIZE);
+
+        rc = modify_identity_mmio(d, pfn, nr_pages, true);
+        if ( rc )
+        {
+            printk(
+                "Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
+                   pfn, pfn + nr_pages);
+            return rc;
+        }
+    }
+
+    rc = hvm_setup_acpi_madt(d, &madt_paddr);
+    if ( rc )
+        return rc;
+
+    rc = hvm_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
+    if ( rc )
+        return rc;
+
+    /* Craft a custom RSDP. */
+    memset(&rsdp, 0, sizeof(rsdp));
+    memcpy(&rsdp.signature, ACPI_SIG_RSDP, sizeof(rsdp.signature));
+    memcpy(&rsdp.oem_id, "XenVMM", sizeof(rsdp.oem_id));
+    rsdp.revision = 2;
+    rsdp.xsdt_physical_address = xsdt_paddr;
+    rsdp.rsdt_physical_address = 0;
+    rsdp.length = sizeof(rsdp);
+    rsdp.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                      ACPI_RSDP_REV0_SIZE);
+    rsdp.extended_checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, &rsdp),
+                                               sizeof(rsdp));
+
+    /*
+     * Place the new RSDP in guest memory space.
+     *
+     * NB: this RSDP is not going to replace the original RSDP, which
+     * should still be accessible to the guest. However that RSDP is
+     * going to point to the native XSDT/RSDT, and should not be used.
+     */
+    if ( hvm_steal_ram(d, sizeof(rsdp), GB(4), &rsdp_paddr) )
+    {
+        printk("Unable to allocate guest RAM for RSDP\n");
+        return -ENOMEM;
+    }
+
+    /* Mark this region as E820_ACPI. */
+    if ( hvm_add_mem_range(d, rsdp_paddr, rsdp_paddr + sizeof(rsdp),
+                           E820_ACPI) )
+        printk("Unable to add RSDP region to memory map\n");
+
+    /* Copy RSDP into guest memory. */
+    saved_current = current;
+    set_current(v);
+    rc = hvm_copy_to_guest_phys(rsdp_paddr, &rsdp, sizeof(rsdp));
+    if ( rc != HVMCOPY_okay )
+    {
+        set_current(saved_current);
+        printk("Unable to copy RSDP into guest memory\n");
+        return -EFAULT;
+    }
+
+    /* Copy RSDP address to start_info. */
+    rc = hvm_copy_to_guest_phys(start_info +
+                                offsetof(struct hvm_start_info, rsdp_paddr),
+                                &rsdp_paddr,
+                                sizeof(
+                                    ((struct hvm_start_info *)0)->rsdp_paddr));
+    set_current(saved_current);
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy RSDP into guest memory\n");
+        return -EFAULT;
+    }
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2151,6 +2570,13 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_setup_acpi(d, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 ACPI tables: %d\n", rc);
+        return rc;
+    }
+
     clear_bit(_VPF_down, &d->vcpu[0]->pause_flags);
 
     return 0;
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 02/15] xen/x86: fix return value of *_set_allocation functions
  2016-10-29  8:59 ` [PATCH v3.1 02/15] xen/x86: fix return value of *_set_allocation functions Roger Pau Monne
@ 2016-10-29 22:11   ` Tim Deegan
  0 siblings, 0 replies; 89+ messages in thread
From: Tim Deegan @ 2016-10-29 22:11 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, xen-devel, boris.ostrovsky

At 10:59 +0200 on 29 Oct (1477738788), Roger Pau Monne wrote:
> Return should be an int.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Tim Deegan <tim@xen.org>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 00/15] Initial PVHv2 Dom0 support
  2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
                   ` (14 preceding siblings ...)
  2016-10-29  9:00 ` [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
@ 2016-10-31 14:35 ` Boris Ostrovsky
  2016-10-31 14:43   ` Andrew Cooper
  15 siblings, 1 reply; 89+ messages in thread
From: Boris Ostrovsky @ 2016-10-31 14:35 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel, konrad.wilk



On 10/29/2016 04:59 AM, Roger Pau Monne wrote:
> (resending as v3.1, it seems like I need to figure out how to properly use
> msmtp with git send-email because on the last try only the cover letter
> was actually sent).
>
> Hello,
>
> This is the first batch of the PVH Dom0 support eries, that includes
> everything up to the point where ACPI tables for he Dom0 are crafted. I've
> decided to left the last part of the series (the ne that contains the PCI
> config space handlers, and other mulation/trapping related code) separated,
> in order to focus and ease the review. This is f course not functional, one
> might be able to partially boot a Dom0 kernel if t doesn't try to access
> any physical device.
>
> Another reason for splitting this series is so hat I can write a proper
> design document about how this trapping is going o work, and what is it
> supposed to do, because during the last review ound I got the feeling that
> some comments where not really related to the ode itself, but to what I was
> trying to achieve, so it's best to discuss them n a design document rather
> than mixed up with code.


Given that we are dropping PVHv1 support from Linux and the fact that v1 
has always been a tech preview (or some such) should we drop it now?

The is_pvh_domain() is getting more and more confusing.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 00/15] Initial PVHv2 Dom0 support
  2016-10-31 14:35 ` [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Boris Ostrovsky
@ 2016-10-31 14:43   ` Andrew Cooper
  2016-10-31 16:35     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Andrew Cooper @ 2016-10-31 14:43 UTC (permalink / raw)
  To: Boris Ostrovsky, Roger Pau Monne, xen-devel, konrad.wilk

On 31/10/16 14:35, Boris Ostrovsky wrote:
>
>
> On 10/29/2016 04:59 AM, Roger Pau Monne wrote:
>> (resending as v3.1, it seems like I need to figure out how to
>> properly use
>> msmtp with git send-email because on the last try only the cover letter
>> was actually sent).
>>
>> Hello,
>>
>> This is the first batch of the PVH Dom0 support eries, that includes
>> everything up to the point where ACPI tables for he Dom0 are crafted.
>> I've
>> decided to left the last part of the series (the ne that contains the
>> PCI
>> config space handlers, and other mulation/trapping related code)
>> separated,
>> in order to focus and ease the review. This is f course not
>> functional, one
>> might be able to partially boot a Dom0 kernel if t doesn't try to access
>> any physical device.
>>
>> Another reason for splitting this series is so hat I can write a proper
>> design document about how this trapping is going o work, and what is it
>> supposed to do, because during the last review ound I got the feeling
>> that
>> some comments where not really related to the ode itself, but to what
>> I was
>> trying to achieve, so it's best to discuss them n a design document
>> rather
>> than mixed up with code.
>
>
> Given that we are dropping PVHv1 support from Linux and the fact that
> v1 has always been a tech preview (or some such) should we drop it now?
>
> The is_pvh_domain() is getting more and more confusing.

+1 for dropping all the PVHv1 remnants from Xen.

Perhaps the start of 4.9 is the best time to flip this switch.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-10-29  8:59 ` [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2016-10-31 16:32   ` Jan Beulich
  2016-11-03 12:35     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-10-31 16:32 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
> from physical or emulated devices over event channels using PIRQs. This
> applies to both DomU and Dom0 PVHv2 guests.
> 
> Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> route physical interrupts (even from emulated devices) over event channels,
> and is thus allowed to use some of the PHYSDEV ops.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

The patch looks fine now for its purpose, but I'm hesitant to ack it
without us having settled on whether we indeed mean to hide all
those physdev ops from Dom0. In particular I don't recall this (and
the reasoning behind it) having got written down somewhere.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-10-29  8:59 ` [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
@ 2016-10-31 16:34   ` Jan Beulich
  2016-11-01 10:45     ` Tim Deegan
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-10-31 16:34 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> ... and using the "preempted" parameter. The solution relies on just calling
> softirq_pending if the current domain is the idle domain. If such preemption
> happens, the caller should then call process_pending_softirqs in order to
> drain the pending softirqs, and then call {sh/hap}_set_allocation again to
> continue with it's execution.
> 
> This allows us to call *_set_allocation() when building domain 0.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Acked-by: George Dunlap <george.dunlap@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>

Cc: Tim

> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> Changes since v2:
>  - Fix commit message.
> ---
>  xen/arch/x86/mm/hap/hap.c       | 4 +++-
>  xen/arch/x86/mm/shadow/common.c | 4 +++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index f099e94..0645521 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned int pages, 
> int *preempted)
>              break;
>  
>          /* Check to see if we need to yield and try again */
> -        if ( preempted && hypercall_preempt_check() )
> +        if ( preempted &&
> +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
> +                                      hypercall_preempt_check()) )
>          {
>              *preempted = 1;
>              return 0;
> diff --git a/xen/arch/x86/mm/shadow/common.c 
> b/xen/arch/x86/mm/shadow/common.c
> index 065bdc7..b2e99c2 100644
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1679,7 +1679,9 @@ static int sh_set_allocation(struct domain *d,
>              break;
>  
>          /* Check to see if we need to yield and try again */
> -        if ( preempted && hypercall_preempt_check() )
> +        if ( preempted &&
> +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
> +                                      hypercall_preempt_check()) )
>          {
>              *preempted = 1;
>              return 0;
> -- 
> 2.7.4 (Apple Git-66)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 00/15] Initial PVHv2 Dom0 support
  2016-10-31 14:43   ` Andrew Cooper
@ 2016-10-31 16:35     ` Roger Pau Monne
  0 siblings, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-10-31 16:35 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Boris Ostrovsky

On Mon, Oct 31, 2016 at 02:43:08PM +0000, Andrew Cooper wrote:
> On 31/10/16 14:35, Boris Ostrovsky wrote:
> >
> >
> > On 10/29/2016 04:59 AM, Roger Pau Monne wrote:
> >> (resending as v3.1, it seems like I need to figure out how to
> >> properly use
> >> msmtp with git send-email because on the last try only the cover letter
> >> was actually sent).
> >>
> >> Hello,
> >>
> >> This is the first batch of the PVH Dom0 support eries, that includes
> >> everything up to the point where ACPI tables for he Dom0 are crafted.
> >> I've
> >> decided to left the last part of the series (the ne that contains the
> >> PCI
> >> config space handlers, and other mulation/trapping related code)
> >> separated,
> >> in order to focus and ease the review. This is f course not
> >> functional, one
> >> might be able to partially boot a Dom0 kernel if t doesn't try to access
> >> any physical device.
> >>
> >> Another reason for splitting this series is so hat I can write a proper
> >> design document about how this trapping is going o work, and what is it
> >> supposed to do, because during the last review ound I got the feeling
> >> that
> >> some comments where not really related to the ode itself, but to what
> >> I was
> >> trying to achieve, so it's best to discuss them n a design document
> >> rather
> >> than mixed up with code.
> >
> >
> > Given that we are dropping PVHv1 support from Linux and the fact that
> > v1 has always been a tech preview (or some such) should we drop it now?
> >
> > The is_pvh_domain() is getting more and more confusing.
> 
> +1 for dropping all the PVHv1 remnants from Xen.
> 
> Perhaps the start of 4.9 is the best time to flip this switch.

Yes, I don't have any objections to that. Perhaps it should be done after 
this series is in, since here I'm reusing some of the PVH code in 
domain_build.c (so removing PVH before committing this would force me to 
reintroduce those functions later on).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 04/15] xen/x86: assert that local_events_need_delivery is not called by the idle domain
  2016-10-29  8:59 ` [PATCH v3.1 04/15] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
@ 2016-10-31 16:37   ` Jan Beulich
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Beulich @ 2016-10-31 16:37 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> It doesn't make sense since the idle domain doesn't receive any events. This
> is relevant in order to be sure that hypercall_preempt_check is not called
> by the idle domain, which would happen previously when calling
> {hap/sh}_set_allocation during domain 0 creation.

AIUI this describes the state of things before this series, not before
this patch. I wonder whether this wouldn't better be folded into the
previous patch, with the commit message slightly adjusted.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 05/15] x86/paging: introduce paging_set_allocation
  2016-10-29  8:59 ` [PATCH v3.1 05/15] x86/paging: introduce paging_set_allocation Roger Pau Monne
@ 2016-10-31 16:42   ` Jan Beulich
  2016-11-01 10:29     ` Tim Deegan
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-10-31 16:42 UTC (permalink / raw)
  To: Roger Pau Monne, Tim Deegan
  Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1609,13 +1609,7 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
>      paging_unlock(d);
>  }
>  
> -/* Set the pool of shadow pages to the required number of pages.
> - * Input will be rounded up to at least shadow_min_acceptable_pages(),
> - * plus space for the p2m table.
> - * Returns 0 for success, non-zero for failure. */
> -static int sh_set_allocation(struct domain *d,
> -                             unsigned int pages,
> -                             int *preempted)
> +int sh_set_allocation(struct domain *d, unsigned int pages, bool *preempted)

Iirc functions with a name starting with sh_ are shadow code internal,
so you may better switch to shadow_set_allocation(). Tim?

With that taken care of (unless Tim indicates there's no need),
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 06/15] xen/x86: split the setup of Dom0 permissions to a function
  2016-10-29  8:59 ` [PATCH v3.1 06/15] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
@ 2016-10-31 16:44   ` Jan Beulich
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Beulich @ 2016-10-31 16:44 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> So that it can also be used by the PVH-specific domain builder. This is just
> code motion, it should not introduce any functional change.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-10-29  8:59 ` [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally Roger Pau Monne
@ 2016-10-31 16:47   ` Jan Beulich
  2016-11-03 10:58     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-10-31 16:47 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit,
	xen-devel, boris.ostrovsky

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>  
>      early_msi_init();
>  
> +    scan_pci_devices();
> +
>      iommu_setup();    /* setup iommu if available */
>  
>      smp_prepare_cpus(max_cpus);
> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
>  
>      if ( !amd_iommu_perdev_intremap )
>          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
> -    return scan_pci_devices();
> +
> +    return 0;
>  }

I'm relatively certain that I did point out on a prior version that the
error handling here gets lost. At the very least the commit message
should provide a reason for doing so; even better would be if there
was no behavioral change (other than the point in time where this
happens slightly changing).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 05/15] x86/paging: introduce paging_set_allocation
  2016-10-31 16:42   ` Jan Beulich
@ 2016-11-01 10:29     ` Tim Deegan
  0 siblings, 0 replies; 89+ messages in thread
From: Tim Deegan @ 2016-11-01 10:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, xen-devel, boris.ostrovsky,
	Roger Pau Monne

At 10:42 -0600 on 31 Oct (1477910532), Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/mm/shadow/common.c
> > +++ b/xen/arch/x86/mm/shadow/common.c
> > @@ -1609,13 +1609,7 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
> >      paging_unlock(d);
> >  }
> >  
> > -/* Set the pool of shadow pages to the required number of pages.
> > - * Input will be rounded up to at least shadow_min_acceptable_pages(),
> > - * plus space for the p2m table.
> > - * Returns 0 for success, non-zero for failure. */
> > -static int sh_set_allocation(struct domain *d,
> > -                             unsigned int pages,
> > -                             int *preempted)
> > +int sh_set_allocation(struct domain *d, unsigned int pages, bool *preempted)
> 
> Iirc functions with a name starting with sh_ are shadow code internal,
> so you may better switch to shadow_set_allocation(). Tim?

Yep.  That naming convention is not very faithfully followed,
especially since the introduction of hap* and paging*, but it would be
nice.

Cheers,

Tim.

> With that taken care of (unless Tim indicates there's no need),
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> 
> Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-10-31 16:34   ` Jan Beulich
@ 2016-11-01 10:45     ` Tim Deegan
  2016-11-02 17:14       ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Tim Deegan @ 2016-11-01 10:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, xen-devel, boris.ostrovsky,
	Roger Pau Monne

At 10:34 -0600 on 31 Oct (1477910088), Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > ... and using the "preempted" parameter. The solution relies on just calling
> > softirq_pending if the current domain is the idle domain. If such preemption
> > happens, the caller should then call process_pending_softirqs in order to
> > drain the pending softirqs, and then call {sh/hap}_set_allocation again to
> > continue with it's execution.
> > 
> > This allows us to call *_set_allocation() when building domain 0.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > Acked-by: George Dunlap <george.dunlap@citrix.com>
> > ---
> > Cc: George Dunlap <george.dunlap@eu.citrix.com>
> 
> Cc: Tim
> 
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> > Changes since v2:
> >  - Fix commit message.
> > ---
> >  xen/arch/x86/mm/hap/hap.c       | 4 +++-
> >  xen/arch/x86/mm/shadow/common.c | 4 +++-
> >  2 files changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> > index f099e94..0645521 100644
> > --- a/xen/arch/x86/mm/hap/hap.c
> > +++ b/xen/arch/x86/mm/hap/hap.c
> > @@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned int pages, 
> > int *preempted)
> >              break;
> >  
> >          /* Check to see if we need to yield and try again */
> > -        if ( preempted && hypercall_preempt_check() )
> > +        if ( preempted &&
> > +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
> > +                                      hypercall_preempt_check()) )

This is a bit clunky, and is open-coded twice.  Can the existing checks in
hypercall_preempt_check() be made safe to run on the idle vcpu?
If not, please make a wrapper to call here that DTRT on idle and
non-idle, e.g. something like:

 /*
  * For long-running operations that may be in hypercall context or on
  * the idle vcpu (e.g. during dom0 construction), check if there is
  * background work to be done that should interrupt this operation.
  */
 static inline bool general_preempt_check(void)
 {
     return unlikely(softirq_pending(smp_processor_id()) ||
                     (!is_idle_vcpu(current) && local_events_need_delivery()));
 }

If you're feeling keen, you could convert hypercall_preempt_check() to
an inline function and comment it too. :)

Apart from that, ack.

Cheers,

Tim.

> >          {
> >              *preempted = 1;
> >              return 0;
> > diff --git a/xen/arch/x86/mm/shadow/common.c 
> > b/xen/arch/x86/mm/shadow/common.c
> > index 065bdc7..b2e99c2 100644
> > --- a/xen/arch/x86/mm/shadow/common.c
> > +++ b/xen/arch/x86/mm/shadow/common.c
> > @@ -1679,7 +1679,9 @@ static int sh_set_allocation(struct domain *d,
> >              break;
> >  
> >          /* Check to see if we need to yield and try again */
> > -        if ( preempted && hypercall_preempt_check() )
> > +        if ( preempted &&
> > +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
> > +                                      hypercall_preempt_check()) )
> >          {
> >              *preempted = 1;
> >              return 0;
> > -- 
> > 2.7.4 (Apple Git-66)
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-11-01 10:45     ` Tim Deegan
@ 2016-11-02 17:14       ` Roger Pau Monne
  2016-11-03 10:20         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-02 17:14 UTC (permalink / raw)
  To: Tim Deegan
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, xen-devel, boris.ostrovsky

On Tue, Nov 01, 2016 at 10:45:05AM +0000, Tim Deegan wrote:
> At 10:34 -0600 on 31 Oct (1477910088), Jan Beulich wrote:
> > >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > > ... and using the "preempted" parameter. The solution relies on just calling
> > > softirq_pending if the current domain is the idle domain. If such preemption
> > > happens, the caller should then call process_pending_softirqs in order to
> > > drain the pending softirqs, and then call {sh/hap}_set_allocation again to
> > > continue with it's execution.
> > > 
> > > This allows us to call *_set_allocation() when building domain 0.
> > > 
> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > Acked-by: George Dunlap <george.dunlap@citrix.com>
> > > ---
> > > Cc: George Dunlap <george.dunlap@eu.citrix.com>
> > 
> > Cc: Tim
> > 
> > > Cc: Jan Beulich <jbeulich@suse.com>
> > > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > > ---
> > > Changes since v2:
> > >  - Fix commit message.
> > > ---
> > >  xen/arch/x86/mm/hap/hap.c       | 4 +++-
> > >  xen/arch/x86/mm/shadow/common.c | 4 +++-
> > >  2 files changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> > > index f099e94..0645521 100644
> > > --- a/xen/arch/x86/mm/hap/hap.c
> > > +++ b/xen/arch/x86/mm/hap/hap.c
> > > @@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned int pages, 
> > > int *preempted)
> > >              break;
> > >  
> > >          /* Check to see if we need to yield and try again */
> > > -        if ( preempted && hypercall_preempt_check() )
> > > +        if ( preempted &&
> > > +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
> > > +                                      hypercall_preempt_check()) )
> 
> This is a bit clunky, and is open-coded twice.  Can the existing checks in
> hypercall_preempt_check() be made safe to run on the idle vcpu?
> If not, please make a wrapper to call here that DTRT on idle and
> non-idle, e.g. something like:

Yes, hypercall_preempt_check can be made safe to run on the idle vcpu, just 
by guarding the call to local_events_need_delivery with a !is_idle_vcpu. 
I'm not really in favor of doing that, because then the name of the 
function is misleading, hypercall_preempt_check could then be used even on 
non-hypercall contexts.

>  /*
>   * For long-running operations that may be in hypercall context or on
>   * the idle vcpu (e.g. during dom0 construction), check if there is
>   * background work to be done that should interrupt this operation.
>   */
>  static inline bool general_preempt_check(void)
>  {
>      return unlikely(softirq_pending(smp_processor_id()) ||
>                      (!is_idle_vcpu(current) && local_events_need_delivery()));
>  }
>
> If you're feeling keen, you could convert hypercall_preempt_check() to
> an inline function and comment it too. :)

IMHO this is better, and I don't mind changing hypercall_preempt_check to an 
inline function :).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-11-02 17:14       ` Roger Pau Monne
@ 2016-11-03 10:20         ` Roger Pau Monne
  2016-11-03 10:33           ` Tim Deegan
  2016-11-03 11:31           ` Jan Beulich
  0 siblings, 2 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-03 10:20 UTC (permalink / raw)
  To: Tim Deegan, George Dunlap, Andrew Cooper, Jan Beulich, xen-devel,
	boris.ostrovsky

On Wed, Nov 02, 2016 at 06:14:13PM +0100, Roger Pau Monne wrote:
> On Tue, Nov 01, 2016 at 10:45:05AM +0000, Tim Deegan wrote:
> > At 10:34 -0600 on 31 Oct (1477910088), Jan Beulich wrote:
> > > >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >   * the idle vcpu (e.g. during dom0 construction), check if there is
> >   * background work to be done that should interrupt this operation.
> >   */
> >  static inline bool general_preempt_check(void)
> >  {
> >      return unlikely(softirq_pending(smp_processor_id()) ||
> >                      (!is_idle_vcpu(current) && local_events_need_delivery()));
> >  }
> >
> > If you're feeling keen, you could convert hypercall_preempt_check() to
> > an inline function and comment it too. :)
> 
> IMHO this is better, and I don't mind changing hypercall_preempt_check to an 
> inline function :).

So it turns out this is not trivial at all. Converting 
hypercall_preempt_check and also adding general_preempt_check as inline 
functions into shed.h causes trouble because they depend on 
local_events_need_delivery which in turn depends on the struct vcpu being 
defined. I could possibly move {general/hypercall}_preempt_check into 
xen/event.h and fixup the callers, but I would maybe prefer to leave this 
as-is for the time being, and add general_preempt_check as a macro.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-11-03 10:20         ` Roger Pau Monne
@ 2016-11-03 10:33           ` Tim Deegan
  2016-11-03 11:31           ` Jan Beulich
  1 sibling, 0 replies; 89+ messages in thread
From: Tim Deegan @ 2016-11-03 10:33 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, Jan Beulich, xen-devel

At 11:20 +0100 on 03 Nov (1478172025), Roger Pau Monne wrote:
> On Wed, Nov 02, 2016 at 06:14:13PM +0100, Roger Pau Monne wrote:
> > On Tue, Nov 01, 2016 at 10:45:05AM +0000, Tim Deegan wrote:
> > > At 10:34 -0600 on 31 Oct (1477910088), Jan Beulich wrote:
> > > > >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > >   * the idle vcpu (e.g. during dom0 construction), check if there is
> > >   * background work to be done that should interrupt this operation.
> > >   */
> > >  static inline bool general_preempt_check(void)
> > >  {
> > >      return unlikely(softirq_pending(smp_processor_id()) ||
> > >                      (!is_idle_vcpu(current) && local_events_need_delivery()));
> > >  }
> > >
> > > If you're feeling keen, you could convert hypercall_preempt_check() to
> > > an inline function and comment it too. :)
> > 
> > IMHO this is better, and I don't mind changing hypercall_preempt_check to an 
> > inline function :).
> 
> So it turns out this is not trivial at all. Converting 
> hypercall_preempt_check and also adding general_preempt_check as inline 
> functions into shed.h causes trouble because they depend on 
> local_events_need_delivery which in turn depends on the struct vcpu being 
> defined. I could possibly move {general/hypercall}_preempt_check into 
> xen/event.h and fixup the callers, but I would maybe prefer to leave this 
> as-is for the time being, and add general_preempt_check as a macro.

Righto - that sounds fine to me.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-10-31 16:47   ` Jan Beulich
@ 2016-11-03 10:58     ` Roger Pau Monne
  2016-11-03 11:35       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-03 10:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit,
	xen-devel, boris.ostrovsky

On Mon, Oct 31, 2016 at 10:47:15AM -0600, Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >  
> >      early_msi_init();
> >  
> > +    scan_pci_devices();
> > +
> >      iommu_setup();    /* setup iommu if available */
> >  
> >      smp_prepare_cpus(max_cpus);
> > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
> >  
> >      if ( !amd_iommu_perdev_intremap )
> >          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
> > -    return scan_pci_devices();
> > +
> > +    return 0;
> >  }
> 
> I'm relatively certain that I did point out on a prior version that the
> error handling here gets lost. At the very least the commit message
> should provide a reason for doing so; even better would be if there
> was no behavioral change (other than the point in time where this
> happens slightly changing).

Behaviour here is different on Intel or AMD hardware, on Intel failure to 
scan the PCI bus will not be fatal, and the IOMMU will be enabled anyway. On 
AMD OTOH failure to scan the PCI bus will cause the IOMMU to be disabled. 
I expect we should be able to behave equally for both Intel and AMD, so 
which one should be used?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-11-03 10:20         ` Roger Pau Monne
  2016-11-03 10:33           ` Tim Deegan
@ 2016-11-03 11:31           ` Jan Beulich
  1 sibling, 0 replies; 89+ messages in thread
From: Jan Beulich @ 2016-11-03 11:31 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, Tim Deegan, xen-devel

>>> On 03.11.16 at 11:20, <roger.pau@citrix.com> wrote:
> On Wed, Nov 02, 2016 at 06:14:13PM +0100, Roger Pau Monne wrote:
>> On Tue, Nov 01, 2016 at 10:45:05AM +0000, Tim Deegan wrote:
>> > At 10:34 -0600 on 31 Oct (1477910088), Jan Beulich wrote:
>> > > >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> >   * the idle vcpu (e.g. during dom0 construction), check if there is
>> >   * background work to be done that should interrupt this operation.
>> >   */
>> >  static inline bool general_preempt_check(void)
>> >  {
>> >      return unlikely(softirq_pending(smp_processor_id()) ||
>> >                      (!is_idle_vcpu(current) && local_events_need_delivery()));
>> >  }
>> >
>> > If you're feeling keen, you could convert hypercall_preempt_check() to
>> > an inline function and comment it too. :)
>> 
>> IMHO this is better, and I don't mind changing hypercall_preempt_check to an 
>> inline function :).
> 
> So it turns out this is not trivial at all. Converting 
> hypercall_preempt_check and also adding general_preempt_check as inline 
> functions into shed.h causes trouble because they depend on 
> local_events_need_delivery which in turn depends on the struct vcpu being 
> defined. I could possibly move {general/hypercall}_preempt_check into 
> xen/event.h and fixup the callers, but I would maybe prefer to leave this 
> as-is for the time being, and add general_preempt_check as a macro.

Fine with me.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-11-03 10:58     ` Roger Pau Monne
@ 2016-11-03 11:35       ` Jan Beulich
  2016-11-03 11:54         ` Boris Ostrovsky
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-03 11:35 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit,
	xen-devel, boris.ostrovsky

>>> On 03.11.16 at 11:58, <roger.pau@citrix.com> wrote:
> On Mon, Oct 31, 2016 at 10:47:15AM -0600, Jan Beulich wrote:
>> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> > --- a/xen/arch/x86/setup.c
>> > +++ b/xen/arch/x86/setup.c
>> > @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>> >  
>> >      early_msi_init();
>> >  
>> > +    scan_pci_devices();
>> > +
>> >      iommu_setup();    /* setup iommu if available */
>> >  
>> >      smp_prepare_cpus(max_cpus);
>> > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> > @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
>> >  
>> >      if ( !amd_iommu_perdev_intremap )
>> >          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
>> > -    return scan_pci_devices();
>> > +
>> > +    return 0;
>> >  }
>> 
>> I'm relatively certain that I did point out on a prior version that the
>> error handling here gets lost. At the very least the commit message
>> should provide a reason for doing so; even better would be if there
>> was no behavioral change (other than the point in time where this
>> happens slightly changing).
> 
> Behaviour here is different on Intel or AMD hardware, on Intel failure to 
> scan the PCI bus will not be fatal, and the IOMMU will be enabled anyway. On 
> AMD OTOH failure to scan the PCI bus will cause the IOMMU to be disabled. 
> I expect we should be able to behave equally for both Intel and AMD, so 
> which one should be used?

I'm afraid I have to defer to the vendor IOMMU maintainers for
that one, as I don't know the reason for the difference in behavior.
An aspect that may play into here is that for AMD the IOMMU is
represented by a PCI device, while for Intel it's just a part of one
of the core chipset devices.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-11-03 11:35       ` Jan Beulich
@ 2016-11-03 11:54         ` Boris Ostrovsky
  2016-11-29 12:33           ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Boris Ostrovsky @ 2016-11-03 11:54 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit, xen-devel



On 11/03/2016 07:35 AM, Jan Beulich wrote:
>>>> On 03.11.16 at 11:58, <roger.pau@citrix.com> wrote:
>> On Mon, Oct 31, 2016 at 10:47:15AM -0600, Jan Beulich wrote:
>>>>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>>>> --- a/xen/arch/x86/setup.c
>>>> +++ b/xen/arch/x86/setup.c
>>>> @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>>>>
>>>>      early_msi_init();
>>>>
>>>> +    scan_pci_devices();
>>>> +
>>>>      iommu_setup();    /* setup iommu if available */
>>>>
>>>>      smp_prepare_cpus(max_cpus);
>>>> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>>>> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>>>> @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
>>>>
>>>>      if ( !amd_iommu_perdev_intremap )
>>>>          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
>>>> -    return scan_pci_devices();
>>>> +
>>>> +    return 0;
>>>>  }
>>>
>>> I'm relatively certain that I did point out on a prior version that the
>>> error handling here gets lost. At the very least the commit message
>>> should provide a reason for doing so; even better would be if there
>>> was no behavioral change (other than the point in time where this
>>> happens slightly changing).
>>
>> Behaviour here is different on Intel or AMD hardware, on Intel failure to
>> scan the PCI bus will not be fatal, and the IOMMU will be enabled anyway. On
>> AMD OTOH failure to scan the PCI bus will cause the IOMMU to be disabled.
>> I expect we should be able to behave equally for both Intel and AMD, so
>> which one should be used?
>
> I'm afraid I have to defer to the vendor IOMMU maintainers for
> that one, as I don't know the reason for the difference in behavior.
> An aspect that may play into here is that for AMD the IOMMU is
> represented by a PCI device, while for Intel it's just a part of one
> of the core chipset devices.

That's probably the reason although it looks like the only failure that 
scan_pci_devices() can return is -ENOMEM, in which case disabling IOMMU 
may not be the biggest problem.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-10-31 16:32   ` Jan Beulich
@ 2016-11-03 12:35     ` Roger Pau Monne
  2016-11-03 12:52       ` Jan Beulich
  2016-11-03 14:22       ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-03 12:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Mon, Oct 31, 2016 at 10:32:47AM -0600, Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
> > from physical or emulated devices over event channels using PIRQs. This
> > applies to both DomU and Dom0 PVHv2 guests.
> > 
> > Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> > route physical interrupts (even from emulated devices) over event channels,
> > and is thus allowed to use some of the PHYSDEV ops.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> The patch looks fine now for its purpose, but I'm hesitant to ack it
> without us having settled on whether we indeed mean to hide all
> those physdev ops from Dom0. In particular I don't recall this (and
> the reasoning behind it) having got written down somewhere.

I'm planning to add the following doc update together with this commit:

diff --git a/docs/misc/hvmlite.markdown b/docs/misc/hvmlite.markdown
index 946908e..4fc757f 100644
--- a/docs/misc/hvmlite.markdown
+++ b/docs/misc/hvmlite.markdown
@@ -75,3 +75,38 @@ info structure that's passed at boot time (field rsdp_paddr).
 
 Description of paravirtualized devices will come from XenStore, just as it's
 done for HVM guests.
+
+## Interrupts ##
+
+### Interrupts from physical devices ###
+
+Interrupts from physical devices are delivered using native methods, this is
+done in order to take advantage of new hardware assisted virtualization
+functions, like posted interrupts. This implies that PVHv2 guests with physical
+devices will also have the necessary interrupt controllers in order to manage
+the delivery of interrupts from those devices, using the same interfaces that
+are available on native hardware.
+
+### Interrupts from paravirtualized devices ###
+
+Interrupts from paravirtualized devices are delivered using event channels, see
+[Event Channel Internals][event_channels] for more detailed information about
+event channels. In order to inject interrupts into the guest an IDT vector is
+used. This is the same mechanism used on PVHVM guests, and allows having
+per-cpu interrupts that can be also used to deliver timers or IPIs if desired.
+
+In order to register the callback IDT vector the `HVMOP_set_param` hypercall
+is used with the following values:
+
+    domid = DOMID_SELF
+    index = HVM_PARAM_CALLBACK_IRQ
+    value = (0x2 << 56) | vector_value
+
+The OS has to program the IDT for the `vector_value` using the baremetal
+mechanism.
+
+In order to know which event channel has fired, we need to look into the
+information provided in the `shared_info` structure. The `evtchn_pending`
+array is used as a bitmap in order to find out which event channel has
+fired. Event channels can also be masked by setting it's port value in the
+`shared_info->evtchn_mask` bitmap.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-03 12:35     ` Roger Pau Monne
@ 2016-11-03 12:52       ` Jan Beulich
  2016-11-03 14:25         ` Konrad Rzeszutek Wilk
  2016-11-03 15:05         ` Roger Pau Monne
  2016-11-03 14:22       ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 89+ messages in thread
From: Jan Beulich @ 2016-11-03 12:52 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 03.11.16 at 13:35, <roger.pau@citrix.com> wrote:
> --- a/docs/misc/hvmlite.markdown
> +++ b/docs/misc/hvmlite.markdown
> @@ -75,3 +75,38 @@ info structure that's passed at boot time (field rsdp_paddr).
>  
>  Description of paravirtualized devices will come from XenStore, just as it's
>  done for HVM guests.
> +
> +## Interrupts ##
> +
> +### Interrupts from physical devices ###
> +
> +Interrupts from physical devices are delivered using native methods, this is
> +done in order to take advantage of new hardware assisted virtualization
> +functions, like posted interrupts.

Okay, that's a reason for this to be optional (iirc AMD doesn't
have posted interrupts so far), not for all the physdev ops being
made not work at all. The more that I think I did point out before
that there is at least one case where interrupt delivery info
needs to be made available to Xen despite Dom0 not setting up
an IO-APIC entry, and hence a physdev op is the only way to
communicate that information.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-03 12:35     ` Roger Pau Monne
  2016-11-03 12:52       ` Jan Beulich
@ 2016-11-03 14:22       ` Konrad Rzeszutek Wilk
  2016-11-03 15:01         ` Roger Pau Monne
  2016-11-03 15:43         ` Roger Pau Monne
  1 sibling, 2 replies; 89+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-03 14:22 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Jan Beulich, xen-devel

On Thu, Nov 03, 2016 at 01:35:37PM +0100, Roger Pau Monne wrote:
> On Mon, Oct 31, 2016 at 10:32:47AM -0600, Jan Beulich wrote:
> > >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > > PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
> > > from physical or emulated devices over event channels using PIRQs. This
> > > applies to both DomU and Dom0 PVHv2 guests.
> > > 
> > > Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> > > route physical interrupts (even from emulated devices) over event channels,
> > > and is thus allowed to use some of the PHYSDEV ops.
> > > 
> > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > 
> > The patch looks fine now for its purpose, but I'm hesitant to ack it
> > without us having settled on whether we indeed mean to hide all
> > those physdev ops from Dom0. In particular I don't recall this (and
> > the reasoning behind it) having got written down somewhere.
> 
> I'm planning to add the following doc update together with this commit:
> 
> diff --git a/docs/misc/hvmlite.markdown b/docs/misc/hvmlite.markdown
> index 946908e..4fc757f 100644
> --- a/docs/misc/hvmlite.markdown
> +++ b/docs/misc/hvmlite.markdown
> @@ -75,3 +75,38 @@ info structure that's passed at boot time (field rsdp_paddr).
>  
>  Description of paravirtualized devices will come from XenStore, just as it's
>  done for HVM guests.
> +
> +## Interrupts ##
> +
> +### Interrupts from physical devices ###
> +
> +Interrupts from physical devices are delivered using native methods, this is
> +done in order to take advantage of new hardware assisted virtualization
> +functions, like posted interrupts. This implies that PVHv2 guests with physical
> +devices will also have the necessary interrupt controllers in order to manage
> +the delivery of interrupts from those devices, using the same interfaces that
> +are available on native hardware.
> +
> +### Interrupts from paravirtualized devices ###
> +
> +Interrupts from paravirtualized devices are delivered using event channels, see
> +[Event Channel Internals][event_channels] for more detailed information about

Is this a must? This mechanism was designed before vAPIC was present -
and has the inherent disadvantage that:

 1) It can't use vAPIC (it actually has to disable this as it needs to
    turn on VMX interrupt window to do this).

 2) It is hackish. It completly bypasses the APIC and it uses the
    VM_ENTRY_INTR_INFO (suppose to be used for traps).

 3) It is also racy for events that are more than 64 values apart (with
    the old 2 level one). That is you can have this callback vector
    being injected couple of times - as the OS interrupt handler does not
    mask the events.

 4) It causes the guest an VMEXIT (to stop it so that we can tweak
    the VM_ENTRY_INTR_INFO).

If we really want to use it, could we instead use the per-vector
that Paul added? HVMOP_set_evtchn_upcall_vector?

Or perhaps just add events -> MSI-X mechanism and then we can also
do this under normal HVM guests?

Either option would require changes in Linux/FreeBSD to deal with this.
> +event channels. In order to inject interrupts into the guest an IDT vector is
> +used. This is the same mechanism used on PVHVM guests, and allows having
> +per-cpu interrupts that can be also used to deliver timers or IPIs if desired.
> +
> +In order to register the callback IDT vector the `HVMOP_set_param` hypercall
> +is used with the following values:
> +
> +    domid = DOMID_SELF
> +    index = HVM_PARAM_CALLBACK_IRQ
> +    value = (0x2 << 56) | vector_value
> +
> +The OS has to program the IDT for the `vector_value` using the baremetal
> +mechanism.
> +
> +In order to know which event channel has fired, we need to look into the
> +information provided in the `shared_info` structure. The `evtchn_pending`
> +array is used as a bitmap in order to find out which event channel has
> +fired. Event channels can also be masked by setting it's port value in the
> +`shared_info->evtchn_mask` bitmap.

.. Well that is for the 2-level, but the FIFO is a bit different.
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-03 12:52       ` Jan Beulich
@ 2016-11-03 14:25         ` Konrad Rzeszutek Wilk
  2016-11-03 15:05         ` Roger Pau Monne
  1 sibling, 0 replies; 89+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-03 14:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel, Roger Pau Monne

On Thu, Nov 03, 2016 at 06:52:16AM -0600, Jan Beulich wrote:
> >>> On 03.11.16 at 13:35, <roger.pau@citrix.com> wrote:
> > --- a/docs/misc/hvmlite.markdown
> > +++ b/docs/misc/hvmlite.markdown
> > @@ -75,3 +75,38 @@ info structure that's passed at boot time (field rsdp_paddr).
> >  
> >  Description of paravirtualized devices will come from XenStore, just as it's
> >  done for HVM guests.
> > +
> > +## Interrupts ##
> > +
> > +### Interrupts from physical devices ###
> > +
> > +Interrupts from physical devices are delivered using native methods, this is
> > +done in order to take advantage of new hardware assisted virtualization
> > +functions, like posted interrupts.
> 
> Okay, that's a reason for this to be optional (iirc AMD doesn't
> have posted interrupts so far), not for all the physdev ops being

They do, it is called AVIC. AMD posted an RFC patch to implement that:

https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg01815.html

And it is mentioned at page 505 in the 24593 (Rev 3.26) manual.

> made not work at all. The more that I think I did point out before
> that there is at least one case where interrupt delivery info
> needs to be made available to Xen despite Dom0 not setting up
> an IO-APIC entry, and hence a physdev op is the only way to
> communicate that information.
> 
> Jan
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-03 14:22       ` Konrad Rzeszutek Wilk
@ 2016-11-03 15:01         ` Roger Pau Monne
  2016-11-03 15:43         ` Roger Pau Monne
  1 sibling, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-03 15:01 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, boris.ostrovsky, Jan Beulich, xen-devel

On Thu, Nov 03, 2016 at 10:22:37AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 03, 2016 at 01:35:37PM +0100, Roger Pau Monne wrote:
> > On Mon, Oct 31, 2016 at 10:32:47AM -0600, Jan Beulich wrote:
> > > >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > > > PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
> > > > from physical or emulated devices over event channels using PIRQs. This
> > > > applies to both DomU and Dom0 PVHv2 guests.
> > > > 
> > > > Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> > > > route physical interrupts (even from emulated devices) over event channels,
> > > > and is thus allowed to use some of the PHYSDEV ops.
> > > > 
> > > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > 
> > > The patch looks fine now for its purpose, but I'm hesitant to ack it
> > > without us having settled on whether we indeed mean to hide all
> > > those physdev ops from Dom0. In particular I don't recall this (and
> > > the reasoning behind it) having got written down somewhere.
> > 
> > I'm planning to add the following doc update together with this commit:
> > 
> > diff --git a/docs/misc/hvmlite.markdown b/docs/misc/hvmlite.markdown
> > index 946908e..4fc757f 100644
> > --- a/docs/misc/hvmlite.markdown
> > +++ b/docs/misc/hvmlite.markdown
> > @@ -75,3 +75,38 @@ info structure that's passed at boot time (field rsdp_paddr).
> >  
> >  Description of paravirtualized devices will come from XenStore, just as it's
> >  done for HVM guests.
> > +
> > +## Interrupts ##
> > +
> > +### Interrupts from physical devices ###
> > +
> > +Interrupts from physical devices are delivered using native methods, this is
> > +done in order to take advantage of new hardware assisted virtualization
> > +functions, like posted interrupts. This implies that PVHv2 guests with physical
> > +devices will also have the necessary interrupt controllers in order to manage
> > +the delivery of interrupts from those devices, using the same interfaces that
> > +are available on native hardware.
> > +
> > +### Interrupts from paravirtualized devices ###
> > +
> > +Interrupts from paravirtualized devices are delivered using event channels, see
> > +[Event Channel Internals][event_channels] for more detailed information about
> 
> Is this a must? This mechanism was designed before vAPIC was present -
> and has the inherent disadvantage that:
> 
>  1) It can't use vAPIC (it actually has to disable this as it needs to
>     turn on VMX interrupt window to do this).
> 
>  2) It is hackish. It completly bypasses the APIC and it uses the
>     VM_ENTRY_INTR_INFO (suppose to be used for traps).
> 
>  3) It is also racy for events that are more than 64 values apart (with
>     the old 2 level one). That is you can have this callback vector
>     being injected couple of times - as the OS interrupt handler does not
>     mask the events.
> 
>  4) It causes the guest an VMEXIT (to stop it so that we can tweak
>     the VM_ENTRY_INTR_INFO).
> 
> If we really want to use it, could we instead use the per-vector
> that Paul added? HVMOP_set_evtchn_upcall_vector?

PVHv2 should be able to use the same mechanism as HVM guests for event 
channel delivery. I'm not familiar with HVMOP_set_evtchn_upcall_vector, but 
AFAICT it's very similar to HVM_PARAM_CALLBACK_IRQ with the difference that 
each vCPU can specify different vectors, right?

> Or perhaps just add events -> MSI-X mechanism and then we can also
> do this under normal HVM guests?

That would be OK, but I think this is something that's out of the scope 
here. If this is ever implemented for HVM guests PVHv2 should also be able 
to use it, provided they have a local APIC.

> Either option would require changes in Linux/FreeBSD to deal with this.
> > +event channels. In order to inject interrupts into the guest an IDT vector is
> > +used. This is the same mechanism used on PVHVM guests, and allows having
> > +per-cpu interrupts that can be also used to deliver timers or IPIs if desired.
> > +
> > +In order to register the callback IDT vector the `HVMOP_set_param` hypercall
> > +is used with the following values:
> > +
> > +    domid = DOMID_SELF
> > +    index = HVM_PARAM_CALLBACK_IRQ
> > +    value = (0x2 << 56) | vector_value
> > +
> > +The OS has to program the IDT for the `vector_value` using the baremetal
> > +mechanism.
> > +
> > +In order to know which event channel has fired, we need to look into the
> > +information provided in the `shared_info` structure. The `evtchn_pending`
> > +array is used as a bitmap in order to find out which event channel has
> > +fired. Event channels can also be masked by setting it's port value in the
> > +`shared_info->evtchn_mask` bitmap.
> 
> .. Well that is for the 2-level, but the FIFO is a bit different.

Right, I've just copy-pasted this from the classic PVH documented, which is 
clearly outdated now. We should have a document that describes how event 
channels should be used, both l2 and fifo implementations.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-03 12:52       ` Jan Beulich
  2016-11-03 14:25         ` Konrad Rzeszutek Wilk
@ 2016-11-03 15:05         ` Roger Pau Monne
  1 sibling, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-03 15:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Nov 03, 2016 at 06:52:16AM -0600, Jan Beulich wrote:
> >>> On 03.11.16 at 13:35, <roger.pau@citrix.com> wrote:
> > --- a/docs/misc/hvmlite.markdown
> > +++ b/docs/misc/hvmlite.markdown
> > @@ -75,3 +75,38 @@ info structure that's passed at boot time (field rsdp_paddr).
> >  
> >  Description of paravirtualized devices will come from XenStore, just as it's
> >  done for HVM guests.
> > +
> > +## Interrupts ##
> > +
> > +### Interrupts from physical devices ###
> > +
> > +Interrupts from physical devices are delivered using native methods, this is
> > +done in order to take advantage of new hardware assisted virtualization
> > +functions, like posted interrupts.
> 
> Okay, that's a reason for this to be optional (iirc AMD doesn't
> have posted interrupts so far), not for all the physdev ops being
> made not work at all. The more that I think I did point out before
> that there is at least one case where interrupt delivery info
> needs to be made available to Xen despite Dom0 not setting up
> an IO-APIC entry, and hence a physdev op is the only way to
> communicate that information.

IIRC we already had this discussion earlier, and the only thing that could 
need such information is a serial console that uses a non ISA IRQ, which we 
can also drive in polling mode. IMHO we can always enable some PHYSDEV ops 
if they are strictly necessary, but I would like to start with as few (ie: 
none) as possible.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-11-03 14:22       ` Konrad Rzeszutek Wilk
  2016-11-03 15:01         ` Roger Pau Monne
@ 2016-11-03 15:43         ` Roger Pau Monne
  1 sibling, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-03 15:43 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, boris.ostrovsky, Jan Beulich, xen-devel

On Thu, Nov 03, 2016 at 10:22:37AM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 03, 2016 at 01:35:37PM +0100, Roger Pau Monne wrote:
> > On Mon, Oct 31, 2016 at 10:32:47AM -0600, Jan Beulich wrote:
> > > >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > > > PVHv2 guests, unlike HVM guests, won't have the option to route interrupts
> > > > from physical or emulated devices over event channels using PIRQs. This
> > > > applies to both DomU and Dom0 PVHv2 guests.
> > > > 
> > > > Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> > > > route physical interrupts (even from emulated devices) over event channels,
> > > > and is thus allowed to use some of the PHYSDEV ops.
> > > > 
> > > > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > > 
> > > The patch looks fine now for its purpose, but I'm hesitant to ack it
> > > without us having settled on whether we indeed mean to hide all
> > > those physdev ops from Dom0. In particular I don't recall this (and
> > > the reasoning behind it) having got written down somewhere.
> > 
> > I'm planning to add the following doc update together with this commit:
> > 
> > diff --git a/docs/misc/hvmlite.markdown b/docs/misc/hvmlite.markdown
> > index 946908e..4fc757f 100644
> > --- a/docs/misc/hvmlite.markdown
> > +++ b/docs/misc/hvmlite.markdown
> > @@ -75,3 +75,38 @@ info structure that's passed at boot time (field rsdp_paddr).
> >  
> >  Description of paravirtualized devices will come from XenStore, just as it's
> >  done for HVM guests.
> > +
> > +## Interrupts ##
> > +
> > +### Interrupts from physical devices ###
> > +
> > +Interrupts from physical devices are delivered using native methods, this is
> > +done in order to take advantage of new hardware assisted virtualization
> > +functions, like posted interrupts. This implies that PVHv2 guests with physical
> > +devices will also have the necessary interrupt controllers in order to manage
> > +the delivery of interrupts from those devices, using the same interfaces that
> > +are available on native hardware.
> > +
> > +### Interrupts from paravirtualized devices ###
> > +
> > +Interrupts from paravirtualized devices are delivered using event channels, see
> > +[Event Channel Internals][event_channels] for more detailed information about
> 
> Is this a must? This mechanism was designed before vAPIC was present -
> and has the inherent disadvantage that:
> 
>  1) It can't use vAPIC (it actually has to disable this as it needs to
>     turn on VMX interrupt window to do this).
> 
>  2) It is hackish. It completly bypasses the APIC and it uses the
>     VM_ENTRY_INTR_INFO (suppose to be used for traps).
> 
>  3) It is also racy for events that are more than 64 values apart (with
>     the old 2 level one). That is you can have this callback vector
>     being injected couple of times - as the OS interrupt handler does not
>     mask the events.
> 
>  4) It causes the guest an VMEXIT (to stop it so that we can tweak
>     the VM_ENTRY_INTR_INFO).
> 
> If we really want to use it, could we instead use the per-vector
> that Paul added? HVMOP_set_evtchn_upcall_vector?
> 
> Or perhaps just add events -> MSI-X mechanism and then we can also
> do this under normal HVM guests?
> 
> Either option would require changes in Linux/FreeBSD to deal with this.
> > +event channels. In order to inject interrupts into the guest an IDT vector is
> > +used. This is the same mechanism used on PVHVM guests, and allows having
> > +per-cpu interrupts that can be also used to deliver timers or IPIs if desired.
> > +
> > +In order to register the callback IDT vector the `HVMOP_set_param` hypercall
> > +is used with the following values:
> > +
> > +    domid = DOMID_SELF
> > +    index = HVM_PARAM_CALLBACK_IRQ
> > +    value = (0x2 << 56) | vector_value
> > +
> > +The OS has to program the IDT for the `vector_value` using the baremetal
> > +mechanism.
> > +
> > +In order to know which event channel has fired, we need to look into the
> > +information provided in the `shared_info` structure. The `evtchn_pending`
> > +array is used as a bitmap in order to find out which event channel has
> > +fired. Event channels can also be masked by setting it's port value in the
> > +`shared_info->evtchn_mask` bitmap.
> 
> .. Well that is for the 2-level, but the FIFO is a bit different.

I've modified the last paragraph, so now it's less concise about how event 
channel works (it doesn't mention l2 internals anymore):

### Interrupts from paravirtualized devices ###

Interrupts from paravirtualized devices are delivered using event channels, see
[Event Channel Internals][event_channels] for more detailed information about
event channels. Delivery of those interrupts can be configured in the same way
as HVM guests, check xen/include/public/hvm/params.h and
xen/include/public/hvm/hvm_op.h for more information about available delivery
methods.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-10-29  8:59 ` [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions Roger Pau Monne
@ 2016-11-04  9:16   ` Jan Beulich
  2016-11-04  9:45     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04  9:16 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -1049,22 +1049,29 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
>  
>      mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
>  
> -    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
> +    switch ( p2mt )
> +    {
> +    case p2m_invalid:
> +    case p2m_mmio_dm:
>          ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
>                              p2m_mmio_direct, p2ma);
> -    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
> -    {
> -        ret = 0;
> -        /*
> -         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
> -         * but iomem regions are not mapped with IOMMU. This makes sure that
> -         * RMRRs are correctly mapped with IOMMU.
> -         */
> -        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
> +        if ( ret )
> +            break;
> +        /* fallthrough */
> +    case p2m_mmio_direct:
> +        if ( p2mt == p2m_mmio_direct && a != p2ma )

I don't understand the removal of the MFN == GFN check, and it
also isn't being explained in the commit message.

And then following a case label with a comparison of the respective
switch expression against the very value from the case label is
certainly odd. I'm pretty sure a better structure of the code could be
found.

> +        {
> +            printk(XENLOG_G_WARNING
> +                   "Cannot setup identity map d%d:%lx, already mapped with "
> +                   "different access type (current: %d, requested: %d).\n",

Please avoid full stops at the end of log messages.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-10-29  8:59 ` [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
@ 2016-11-04  9:19   ` Jan Beulich
  2016-11-04  9:47     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04  9:19 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -509,6 +509,27 @@ void vcpu_destroy(struct vcpu *v)
>          xfree(v->arch.pv_vcpu.trap_ctxt);
>  }
>  
> +static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
> +{
> +
> +    if ( is_hvm_domain(d) )
> +    {
> +        if ( is_hardware_domain(d) &&
> +             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
> +            return false;

Why are hardware domains required to get all three?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04  9:16   ` Jan Beulich
@ 2016-11-04  9:45     ` Roger Pau Monne
  2016-11-04 10:34       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04  9:45 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 03:16:47AM -0600, Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/mm/p2m.c
> > +++ b/xen/arch/x86/mm/p2m.c
> > @@ -1049,22 +1049,29 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
> >  
> >      mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
> >  
> > -    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
> > +    switch ( p2mt )
> > +    {
> > +    case p2m_invalid:
> > +    case p2m_mmio_dm:
> >          ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
> >                              p2m_mmio_direct, p2ma);
> > -    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
> > -    {
> > -        ret = 0;
> > -        /*
> > -         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
> > -         * but iomem regions are not mapped with IOMMU. This makes sure that
> > -         * RMRRs are correctly mapped with IOMMU.
> > -         */
> > -        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
> > +        if ( ret )
> > +            break;
> > +        /* fallthrough */
> > +    case p2m_mmio_direct:
> > +        if ( p2mt == p2m_mmio_direct && a != p2ma )
> 
> I don't understand the removal of the MFN == GFN check, and it
> also isn't being explained in the commit message.

Maybe I'm not understanding the logic of this function correctly, but it 
seems extremely bogus, and behaves quite differently depending on whether 
gfn == mfn and whether the domain is the hardware domain.

If gfn == mfn (so the page is already mapped in the p2m) and the domain is 
the hardware domain, an IOMMU mapping would be established. If gfn is not 
set, we will just set the p2m entry, but the IOMMU is not going to be 
properly configured, unless it shares the pt with p2m.

This patch fixes the behavior of the function so it's consistent, and we 
can guarantee that after calling it a proper mapping in the p2m and the IOMMU 
will exist, and that it's going to be gfn == mfn, or else an error will be 
returned.

I agree with you that the mfn == gfn check should be kept, so the condition 
above should be:

	if ( p2mt == p2m_mmio_direct && (a != p2ma || gfn != mfn) )

But please see below.

> And then following a case label with a comparison of the respective
> switch expression against the very value from the case label is
> certainly odd. I'm pretty sure a better structure of the code could be
> found.

Hm, the comparison is there because of the fallthrough in the above case. I 
could remove it by also setting the IOMMU entry in the above case, if that's 
better, so it would look like:

case p2m_invalid:
case p2m_mmio_dm:
    ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
                        p2m_mmio_direct, p2ma);
    if ( ret )
        break;
    if ( !iommu_use_hap_pt(d) )
        ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
    break;
case p2m_mmio_direct:
    if ( a != p2ma || gfn != mfn )
    {
        printk(XENLOG_G_WARNING
               "Cannot setup identity map d%d:%lx, already mapped with "
               "different access type or mfn\n", d->domain_id, gfn);
        ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
        break;
    }
    if ( !iommu_use_hap_pt(d) )
        ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
    break;

Or I could add an if before entering the switch case that checks if type is
p2m_mmio_direct and if the access type and mfn matches what we expect, but I 
think that's not going to make the code easier.

> > +        {
> > +            printk(XENLOG_G_WARNING
> > +                   "Cannot setup identity map d%d:%lx, already mapped with "
> > +                   "different access type (current: %d, requested: %d).\n",
> 
> Please avoid full stops at the end of log messages.

Done.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-11-04  9:19   ` Jan Beulich
@ 2016-11-04  9:47     ` Roger Pau Monne
  2016-11-04 10:21       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04  9:47 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 03:19:11AM -0600, Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -509,6 +509,27 @@ void vcpu_destroy(struct vcpu *v)
> >          xfree(v->arch.pv_vcpu.trap_ctxt);
> >  }
> >  
> > +static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
> > +{
> > +
> > +    if ( is_hvm_domain(d) )
> > +    {
> > +        if ( is_hardware_domain(d) &&
> > +             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
> > +            return false;
> 
> Why are hardware domains required to get all three?

The PIT is always enabled for hardware domains, although we might consider 
disabling it for PVHv2 Dom0? TBH, I don't have a strong opinion here.

The local APIC and IO APIC are required in order to deliver interrupts from 
physical devices (this is only for PVHv2 hardware domains).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-11-04  9:47     ` Roger Pau Monne
@ 2016-11-04 10:21       ` Jan Beulich
  2016-11-04 12:09         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04 10:21 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 10:47, <roger.pau@citrix.com> wrote:
> On Fri, Nov 04, 2016 at 03:19:11AM -0600, Jan Beulich wrote:
>> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> > --- a/xen/arch/x86/domain.c
>> > +++ b/xen/arch/x86/domain.c
>> > @@ -509,6 +509,27 @@ void vcpu_destroy(struct vcpu *v)
>> >          xfree(v->arch.pv_vcpu.trap_ctxt);
>> >  }
>> >  
>> > +static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
>> > +{
>> > +
>> > +    if ( is_hvm_domain(d) )
>> > +    {
>> > +        if ( is_hardware_domain(d) &&
>> > +             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
>> > +            return false;
>> 
>> Why are hardware domains required to get all three?
> 
> The PIT is always enabled for hardware domains, although we might consider 
> disabling it for PVHv2 Dom0? TBH, I don't have a strong opinion here.

I think unless there's a reason to require it, it should be optional.

> The local APIC and IO APIC are required in order to deliver interrupts from 
> physical devices (this is only for PVHv2 hardware domains).

I can see the need for an LAPIC, but I don't think IO-APICs are
strictly necessary. Therefore I'd like the option of it being optional
at least to be considered (perhaps by way of a brief note in the
commit message).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04  9:45     ` Roger Pau Monne
@ 2016-11-04 10:34       ` Jan Beulich
  2016-11-04 12:25         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04 10:34 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 10:45, <roger.pau@citrix.com> wrote:
> On Fri, Nov 04, 2016 at 03:16:47AM -0600, Jan Beulich wrote:
>> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> > --- a/xen/arch/x86/mm/p2m.c
>> > +++ b/xen/arch/x86/mm/p2m.c
>> > @@ -1049,22 +1049,29 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
>> >  
>> >      mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
>> >  
>> > -    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
>> > +    switch ( p2mt )
>> > +    {
>> > +    case p2m_invalid:
>> > +    case p2m_mmio_dm:
>> >          ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
>> >                              p2m_mmio_direct, p2ma);
>> > -    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
>> > -    {
>> > -        ret = 0;
>> > -        /*
>> > -         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
>> > -         * but iomem regions are not mapped with IOMMU. This makes sure that
>> > -         * RMRRs are correctly mapped with IOMMU.
>> > -         */
>> > -        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
>> > +        if ( ret )
>> > +            break;
>> > +        /* fallthrough */
>> > +    case p2m_mmio_direct:
>> > +        if ( p2mt == p2m_mmio_direct && a != p2ma )
>> 
>> I don't understand the removal of the MFN == GFN check, and it
>> also isn't being explained in the commit message.
> 
> Maybe I'm not understanding the logic of this function correctly, but it 
> seems extremely bogus, and behaves quite differently depending on whether 
> gfn == mfn and whether the domain is the hardware domain.

I can't exclude there's something wrong here, but you're removing
a safety belt. Before touching this, did you go back in history to
find out why things are the way they are? I remember it having
taken quite a bit of discussion to reach a mostly acceptable flow
here.

> If gfn == mfn (so the page is already mapped in the p2m) and the domain is 
> the hardware domain, an IOMMU mapping would be established. If gfn is not 
> set, we will just set the p2m entry, but the IOMMU is not going to be 
> properly configured, unless it shares the pt with p2m.

Well, that's why the comment says "PVH fixme". The issue is not
the code here, but the code which established the mapping we
found here. That code fails to also do the IOMMU mapping when
needed. The only correct course of action, afaict, would be to
fix that other code (wherever that is) and remove the comment
together with the bogus code here (which would lead to just
"ret = 0" remaining.

> This patch fixes the behavior of the function so it's consistent, and we 
> can guarantee that after calling it a proper mapping in the p2m and the IOMMU 
> will exist, and that it's going to be gfn == mfn, or else an error will be returned.
> 
> I agree with you that the mfn == gfn check should be kept, so the condition 
> above should be:
> 
> 	if ( p2mt == p2m_mmio_direct && (a != p2ma || gfn != mfn) )
> 
> But please see below.
> 
>> And then following a case label with a comparison of the respective
>> switch expression against the very value from the case label is
>> certainly odd. I'm pretty sure a better structure of the code could be
>> found.
> 
> Hm, the comparison is there because of the fallthrough in the above case. I 
> could remove it by also setting the IOMMU entry in the above case, if that's 
> better, so it would look like:
> 
> case p2m_invalid:
> case p2m_mmio_dm:
>     ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
>                         p2m_mmio_direct, p2ma);
>     if ( ret )
>         break;
>     if ( !iommu_use_hap_pt(d) )
>         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
>     break;
> case p2m_mmio_direct:
>     if ( a != p2ma || gfn != mfn )
>     {
>         printk(XENLOG_G_WARNING
>                "Cannot setup identity map d%d:%lx, already mapped with "
>                "different access type or mfn\n", d->domain_id, gfn);
>         ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
>         break;
>     }
>     if ( !iommu_use_hap_pt(d) )
>         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);

Well, since according to what I've said above this code should
really not be here, I think the code structuring question is moot
now. The conditional call to iommu_map_page() really just needs
adding alongside the p2m_set_entry() call.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-11-04 10:21       ` Jan Beulich
@ 2016-11-04 12:09         ` Roger Pau Monne
  2016-11-04 12:50           ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04 12:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 04:21:02AM -0600, Jan Beulich wrote:
> >>> On 04.11.16 at 10:47, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 04, 2016 at 03:19:11AM -0600, Jan Beulich wrote:
> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >> > --- a/xen/arch/x86/domain.c
> >> > +++ b/xen/arch/x86/domain.c
> >> > @@ -509,6 +509,27 @@ void vcpu_destroy(struct vcpu *v)
> >> >          xfree(v->arch.pv_vcpu.trap_ctxt);
> >> >  }
> >> >  
> >> > +static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
> >> > +{
> >> > +
> >> > +    if ( is_hvm_domain(d) )
> >> > +    {
> >> > +        if ( is_hardware_domain(d) &&
> >> > +             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC) )
> >> > +            return false;
> >> 
> >> Why are hardware domains required to get all three?
> > 
> > The PIT is always enabled for hardware domains, although we might consider 
> > disabling it for PVHv2 Dom0? TBH, I don't have a strong opinion here.
> 
> I think unless there's a reason to require it, it should be optional.

Ack.
 
> > The local APIC and IO APIC are required in order to deliver interrupts from 
> > physical devices (this is only for PVHv2 hardware domains).
> 
> I can see the need for an LAPIC, but I don't think IO-APICs are
> strictly necessary. Therefore I'd like the option of it being optional
> at least to be considered (perhaps by way of a brief note in the
> commit message).

While it should be possible to run without an IO APIC, AFAICT most USB 
controllers still only support legacy PCI interrupts, and then the SCI ACPI 
interrupt is also delivered from a ISA IRQ. I could make the IO APIC 
optional, but it's going to hinder the functionality of a Dom0 IMHO if 
disabled.

What about adding a no-ioapic option to the dom0= list of options?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 10:34       ` Jan Beulich
@ 2016-11-04 12:25         ` Roger Pau Monne
  2016-11-04 12:53           ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04 12:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 04:34:58AM -0600, Jan Beulich wrote:
> >>> On 04.11.16 at 10:45, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 04, 2016 at 03:16:47AM -0600, Jan Beulich wrote:
> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >> > --- a/xen/arch/x86/mm/p2m.c
> >> > +++ b/xen/arch/x86/mm/p2m.c
> >> > @@ -1049,22 +1049,29 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
> >> >  
> >> >      mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
> >> >  
> >> > -    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
> >> > +    switch ( p2mt )
> >> > +    {
> >> > +    case p2m_invalid:
> >> > +    case p2m_mmio_dm:
> >> >          ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
> >> >                              p2m_mmio_direct, p2ma);
> >> > -    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
> >> > -    {
> >> > -        ret = 0;
> >> > -        /*
> >> > -         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
> >> > -         * but iomem regions are not mapped with IOMMU. This makes sure that
> >> > -         * RMRRs are correctly mapped with IOMMU.
> >> > -         */
> >> > -        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
> >> > +        if ( ret )
> >> > +            break;
> >> > +        /* fallthrough */
> >> > +    case p2m_mmio_direct:
> >> > +        if ( p2mt == p2m_mmio_direct && a != p2ma )
> >> 
> >> I don't understand the removal of the MFN == GFN check, and it
> >> also isn't being explained in the commit message.
> > 
> > Maybe I'm not understanding the logic of this function correctly, but it 
> > seems extremely bogus, and behaves quite differently depending on whether 
> > gfn == mfn and whether the domain is the hardware domain.
> 
> I can't exclude there's something wrong here, but you're removing
> a safety belt. Before touching this, did you go back in history to
> find out why things are the way they are? I remember it having
> taken quite a bit of discussion to reach a mostly acceptable flow
> here.

As said, I agree that the gfn == mfn check should be kept.

I've looked at 0e9e09 and 5ae039, but I cannot really understand how 5ae039 
was supposed to work in the first place, and to create the proper IOMMU 
mappings for RMRR regions. It replaced a call to intel_iommu_map_page with a 
call to set_identity_p2m_entry, and this newly introduced function 
(set_identity_p2m_entry) will only setup the p2m mappings for the required 
page, but it will completely ignore to setup any IOMMU mappings if the pt is 
not shared between HAP and the IOMMU.

Then 0e9e09 is a fixup for PVH guests, which really require RMRR regions 
properly mapped in the IOMMU in order to run. Since on PVH guests holes and 
reserved regions are identity mapped in the p2m, RMRR regions should already 
be mapped in the p2m, so 0e9e09 just added the IOMMU mappings if the pt was 
not shared.

But yet I think that 0e9e09 is wrong, and that it fixed RMRR mappings for 
hardware that shares the pt between HAP and the IOMMU while breaking it for 
hardware that doesn't share the pt between HAP and the IOMMU.
 
> > If gfn == mfn (so the page is already mapped in the p2m) and the domain is 
> > the hardware domain, an IOMMU mapping would be established. If gfn is not 
> > set, we will just set the p2m entry, but the IOMMU is not going to be 
> > properly configured, unless it shares the pt with p2m.
> 
> Well, that's why the comment says "PVH fixme". The issue is not
> the code here, but the code which established the mapping we
> found here. That code fails to also do the IOMMU mapping when
> needed. The only correct course of action, afaict, would be to
> fix that other code (wherever that is) and remove the comment
> together with the bogus code here (which would lead to just
> "ret = 0" remaining.

On classic PVH all holes or reserved regions in the memory map are identity 
mapped into the p2m, this is why RMRR regions where expected to be already 
mapped in the p2m. This is no longer true for PVHv2 domains, and holes or 
reserved regions are no longer mapped by default into the p2m.

> > This patch fixes the behavior of the function so it's consistent, and we 
> > can guarantee that after calling it a proper mapping in the p2m and the IOMMU 
> > will exist, and that it's going to be gfn == mfn, or else an error will be returned.
> > 
> > I agree with you that the mfn == gfn check should be kept, so the condition 
> > above should be:
> > 
> > 	if ( p2mt == p2m_mmio_direct && (a != p2ma || gfn != mfn) )
> > 
> > But please see below.
> > 
> >> And then following a case label with a comparison of the respective
> >> switch expression against the very value from the case label is
> >> certainly odd. I'm pretty sure a better structure of the code could be
> >> found.
> > 
> > Hm, the comparison is there because of the fallthrough in the above case. I 
> > could remove it by also setting the IOMMU entry in the above case, if that's 
> > better, so it would look like:
> > 
> > case p2m_invalid:
> > case p2m_mmio_dm:
> >     ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
> >                         p2m_mmio_direct, p2ma);
> >     if ( ret )
> >         break;
> >     if ( !iommu_use_hap_pt(d) )
> >         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> >     break;
> > case p2m_mmio_direct:
> >     if ( a != p2ma || gfn != mfn )
> >     {
> >         printk(XENLOG_G_WARNING
> >                "Cannot setup identity map d%d:%lx, already mapped with "
> >                "different access type or mfn\n", d->domain_id, gfn);
> >         ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
> >         break;
> >     }
> >     if ( !iommu_use_hap_pt(d) )
> >         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> 
> Well, since according to what I've said above this code should
> really not be here, I think the code structuring question is moot
> now. The conditional call to iommu_map_page() really just needs
> adding alongside the p2m_set_entry() call.

OK, so if the gfn is already mapped into the p2m we don't care whether it 
has a valid IOMMU mapping or not?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-11-04 12:09         ` Roger Pau Monne
@ 2016-11-04 12:50           ` Jan Beulich
  2016-11-04 13:06             ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04 12:50 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 13:09, <roger.pau@citrix.com> wrote:
> On Fri, Nov 04, 2016 at 04:21:02AM -0600, Jan Beulich wrote:
>> >>> On 04.11.16 at 10:47, <roger.pau@citrix.com> wrote:
>> > The local APIC and IO APIC are required in order to deliver interrupts from 
>> > physical devices (this is only for PVHv2 hardware domains).
>> 
>> I can see the need for an LAPIC, but I don't think IO-APICs are
>> strictly necessary. Therefore I'd like the option of it being optional
>> at least to be considered (perhaps by way of a brief note in the
>> commit message).
> 
> While it should be possible to run without an IO APIC, AFAICT most USB 
> controllers still only support legacy PCI interrupts, and then the SCI ACPI 
> interrupt is also delivered from a ISA IRQ. I could make the IO APIC 
> optional, but it's going to hinder the functionality of a Dom0 IMHO if 
> disabled.
> 
> What about adding a no-ioapic option to the dom0= list of options?

That's a possible route to go, but mostly orthogonal to the question
here (you talk about mechanism to effect this being optional,
whereas here the question is whether it should be optional in the
first place).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 12:25         ` Roger Pau Monne
@ 2016-11-04 12:53           ` Jan Beulich
  2016-11-04 13:03             ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04 12:53 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 13:25, <roger.pau@citrix.com> wrote:
> On Fri, Nov 04, 2016 at 04:34:58AM -0600, Jan Beulich wrote:
>> >>> On 04.11.16 at 10:45, <roger.pau@citrix.com> wrote:
>> > case p2m_invalid:
>> > case p2m_mmio_dm:
>> >     ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
>> >                         p2m_mmio_direct, p2ma);
>> >     if ( ret )
>> >         break;
>> >     if ( !iommu_use_hap_pt(d) )
>> >         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
>> >     break;
>> > case p2m_mmio_direct:
>> >     if ( a != p2ma || gfn != mfn )
>> >     {
>> >         printk(XENLOG_G_WARNING
>> >                "Cannot setup identity map d%d:%lx, already mapped with "
>> >                "different access type or mfn\n", d->domain_id, gfn);
>> >         ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
>> >         break;
>> >     }
>> >     if ( !iommu_use_hap_pt(d) )
>> >         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
>> 
>> Well, since according to what I've said above this code should
>> really not be here, I think the code structuring question is moot
>> now. The conditional call to iommu_map_page() really just needs
>> adding alongside the p2m_set_entry() call.
> 
> OK, so if the gfn is already mapped into the p2m we don't care whether it 
> has a valid IOMMU mapping or not?

We do care, but it is the responsibility of whoever established the
first mapping to make sure it's present in both P2M and IOMMU.
IOW if the GFN is already mapped, we should be able to imply that
it's mapped in both places.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 12:53           ` Jan Beulich
@ 2016-11-04 13:03             ` Roger Pau Monne
  2016-11-04 13:16               ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04 13:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 06:53:09AM -0600, Jan Beulich wrote:
> >>> On 04.11.16 at 13:25, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 04, 2016 at 04:34:58AM -0600, Jan Beulich wrote:
> >> >>> On 04.11.16 at 10:45, <roger.pau@citrix.com> wrote:
> >> > case p2m_invalid:
> >> > case p2m_mmio_dm:
> >> >     ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
> >> >                         p2m_mmio_direct, p2ma);
> >> >     if ( ret )
> >> >         break;
> >> >     if ( !iommu_use_hap_pt(d) )
> >> >         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> >> >     break;
> >> > case p2m_mmio_direct:
> >> >     if ( a != p2ma || gfn != mfn )
> >> >     {
> >> >         printk(XENLOG_G_WARNING
> >> >                "Cannot setup identity map d%d:%lx, already mapped with "
> >> >                "different access type or mfn\n", d->domain_id, gfn);
> >> >         ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
> >> >         break;
> >> >     }
> >> >     if ( !iommu_use_hap_pt(d) )
> >> >         ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> >> 
> >> Well, since according to what I've said above this code should
> >> really not be here, I think the code structuring question is moot
> >> now. The conditional call to iommu_map_page() really just needs
> >> adding alongside the p2m_set_entry() call.
> > 
> > OK, so if the gfn is already mapped into the p2m we don't care whether it 
> > has a valid IOMMU mapping or not?
> 
> We do care, but it is the responsibility of whoever established the
> first mapping to make sure it's present in both P2M and IOMMU.
> IOW if the GFN is already mapped, we should be able to imply that
> it's mapped in both places.

But how is the first caller that established the mapping supposed to know if 
it needs an IOMMU entry or not? (p2m_mmio_direct types don't get an IOMMU 
mapping at all)

Are we expecting the first caller that setup the mapping to also know about 
RMRR regions and add the IOMMU entry if needed?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain
  2016-11-04 12:50           ` Jan Beulich
@ 2016-11-04 13:06             ` Roger Pau Monne
  0 siblings, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04 13:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 06:50:10AM -0600, Jan Beulich wrote:
> >>> On 04.11.16 at 13:09, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 04, 2016 at 04:21:02AM -0600, Jan Beulich wrote:
> >> >>> On 04.11.16 at 10:47, <roger.pau@citrix.com> wrote:
> >> > The local APIC and IO APIC are required in order to deliver interrupts from 
> >> > physical devices (this is only for PVHv2 hardware domains).
> >> 
> >> I can see the need for an LAPIC, but I don't think IO-APICs are
> >> strictly necessary. Therefore I'd like the option of it being optional
> >> at least to be considered (perhaps by way of a brief note in the
> >> commit message).
> > 
> > While it should be possible to run without an IO APIC, AFAICT most USB 
> > controllers still only support legacy PCI interrupts, and then the SCI ACPI 
> > interrupt is also delivered from a ISA IRQ. I could make the IO APIC 
> > optional, but it's going to hinder the functionality of a Dom0 IMHO if 
> > disabled.
> > 
> > What about adding a no-ioapic option to the dom0= list of options?
> 
> That's a possible route to go, but mostly orthogonal to the question
> here (you talk about mechanism to effect this being optional,
> whereas here the question is whether it should be optional in the
> first place).

As I've stated in the first paragraph, I think an IO APIC is needed 
for Dom0 to work properly.

Then I don't mind adding an option (disabled by default) if someone believes 
it can run a Dom0 without an IO APIC. In any case, this can always be 
modified later, because IO APIC presence is reported in the MADT, and it's 
not set in stone. 

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 13:03             ` Roger Pau Monne
@ 2016-11-04 13:16               ` Jan Beulich
  2016-11-04 15:33                 ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04 13:16 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 14:03, <roger.pau@citrix.com> wrote:
> On Fri, Nov 04, 2016 at 06:53:09AM -0600, Jan Beulich wrote:
>> >>> On 04.11.16 at 13:25, <roger.pau@citrix.com> wrote:
>> > On Fri, Nov 04, 2016 at 04:34:58AM -0600, Jan Beulich wrote:
>> >> >>> On 04.11.16 at 10:45, <roger.pau@citrix.com> wrote:
>> >> > case p2m_invalid:
>> >> > case p2m_mmio_dm:
>> >> >     ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
>> >> >                         p2m_mmio_direct, p2ma);
>> >> >     if ( ret )
>> >> >         break;
>> >> >     if ( !iommu_use_hap_pt(d) )
>> >> >         ret = iommu_map_page(d, gfn, gfn, 
> IOMMUF_readable|IOMMUF_writable);
>> >> >     break;
>> >> > case p2m_mmio_direct:
>> >> >     if ( a != p2ma || gfn != mfn )
>> >> >     {
>> >> >         printk(XENLOG_G_WARNING
>> >> >                "Cannot setup identity map d%d:%lx, already mapped with "
>> >> >                "different access type or mfn\n", d->domain_id, gfn);
>> >> >         ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
>> >> >         break;
>> >> >     }
>> >> >     if ( !iommu_use_hap_pt(d) )
>> >> >         ret = iommu_map_page(d, gfn, gfn, 
> IOMMUF_readable|IOMMUF_writable);
>> >> 
>> >> Well, since according to what I've said above this code should
>> >> really not be here, I think the code structuring question is moot
>> >> now. The conditional call to iommu_map_page() really just needs
>> >> adding alongside the p2m_set_entry() call.
>> > 
>> > OK, so if the gfn is already mapped into the p2m we don't care whether it 
>> > has a valid IOMMU mapping or not?
>> 
>> We do care, but it is the responsibility of whoever established the
>> first mapping to make sure it's present in both P2M and IOMMU.
>> IOW if the GFN is already mapped, we should be able to imply that
>> it's mapped in both places.
> 
> But how is the first caller that established the mapping supposed to know if 
> it needs an IOMMU entry or not? (p2m_mmio_direct types don't get an IOMMU 
> mapping at all)

And it's that fact stated in parentheses which I'd like to question.
I don't see what's wrong with e.g. DMAing right into / out of a
video frame buffer.

> Are we expecting the first caller that setup the mapping to also know about 
> RMRR regions and add the IOMMU entry if needed?

This has nothing to do with RMRR regions: Whoever establishes
_some_ P2M mapping ought to also establish an IOMMU one, if
the tables aren't shared (unless there is an explicit reason not do
so).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 13:16               ` Jan Beulich
@ 2016-11-04 15:33                 ` Roger Pau Monne
  2016-11-04 16:13                   ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04 15:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 07:16:08AM -0600, Jan Beulich wrote:
> >>> On 04.11.16 at 14:03, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 04, 2016 at 06:53:09AM -0600, Jan Beulich wrote:
> >> >>> On 04.11.16 at 13:25, <roger.pau@citrix.com> wrote:
> >> > On Fri, Nov 04, 2016 at 04:34:58AM -0600, Jan Beulich wrote:
> >> >> >>> On 04.11.16 at 10:45, <roger.pau@citrix.com> wrote:
> >> >> > case p2m_invalid:
> >> >> > case p2m_mmio_dm:
> >> >> >     ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
> >> >> >                         p2m_mmio_direct, p2ma);
> >> >> >     if ( ret )
> >> >> >         break;
> >> >> >     if ( !iommu_use_hap_pt(d) )
> >> >> >         ret = iommu_map_page(d, gfn, gfn, 
> > IOMMUF_readable|IOMMUF_writable);
> >> >> >     break;
> >> >> > case p2m_mmio_direct:
> >> >> >     if ( a != p2ma || gfn != mfn )
> >> >> >     {
> >> >> >         printk(XENLOG_G_WARNING
> >> >> >                "Cannot setup identity map d%d:%lx, already mapped with "
> >> >> >                "different access type or mfn\n", d->domain_id, gfn);
> >> >> >         ret = (flag & XEN_DOMCTL_DEV_RDM_RELAXED) ? 0 : -EBUSY;
> >> >> >         break;
> >> >> >     }
> >> >> >     if ( !iommu_use_hap_pt(d) )
> >> >> >         ret = iommu_map_page(d, gfn, gfn, 
> > IOMMUF_readable|IOMMUF_writable);
> >> >> 
> >> >> Well, since according to what I've said above this code should
> >> >> really not be here, I think the code structuring question is moot
> >> >> now. The conditional call to iommu_map_page() really just needs
> >> >> adding alongside the p2m_set_entry() call.
> >> > 
> >> > OK, so if the gfn is already mapped into the p2m we don't care whether it 
> >> > has a valid IOMMU mapping or not?
> >> 
> >> We do care, but it is the responsibility of whoever established the
> >> first mapping to make sure it's present in both P2M and IOMMU.
> >> IOW if the GFN is already mapped, we should be able to imply that
> >> it's mapped in both places.
> > 
> > But how is the first caller that established the mapping supposed to know if 
> > it needs an IOMMU entry or not? (p2m_mmio_direct types don't get an IOMMU 
> > mapping at all)
> 
> And it's that fact stated in parentheses which I'd like to question.
> I don't see what's wrong with e.g. DMAing right into / out of a
> video frame buffer.

Right, so what about the following patch. It would fix my issues, and also 
remove the PVH hack in set_identity_p2m_entry:

---
x86/iommu: add IOMMU entries for p2m_mmio_direct pages

There's nothing wrong with allowing the domain to perform DMA transfers to 
MMIO areas that it already can access from the CPU, and this allows us to 
remove the hack in set_identity_p2m_entry for PVH Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
 xen/arch/x86/mm/p2m.c     |    9 ---------
 xen/include/asm-x86/p2m.h |    1 +
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 6a45185..7e33ab6 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1053,16 +1053,7 @@ int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
         ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
                             p2m_mmio_direct, p2ma);
     else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
-    {
         ret = 0;
-        /*
-         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
-         * but iomem regions are not mapped with IOMMU. This makes sure that
-         * RMRRs are correctly mapped with IOMMU.
-         */
-        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
-            ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
-    }
     else
     {
         if ( flag & XEN_DOMCTL_DEV_RDM_RELAXED )
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 7035860..b562da3 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -834,6 +834,7 @@ static inline unsigned int 
p2m_get_iommu_flags(p2m_type_t p2mt)
     case p2m_grant_map_rw:
     case p2m_ram_logdirty:
     case p2m_map_foreign:
+    case p2m_mmio_direct:
         flags =  IOMMUF_readable | IOMMUF_writable;
         break;
     case p2m_ram_ro:


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 15:33                 ` Roger Pau Monne
@ 2016-11-04 16:13                   ` Jan Beulich
  2016-11-04 16:19                     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04 16:13 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 16:33, <roger.pau@citrix.com> wrote:
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -834,6 +834,7 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
>      case p2m_grant_map_rw:
>      case p2m_ram_logdirty:
>      case p2m_map_foreign:
> +    case p2m_mmio_direct:
>          flags =  IOMMUF_readable | IOMMUF_writable;
>          break;
>      case p2m_ram_ro:

Generally this may be the route to go. But if we want to do so, we
need to throughly understand why this type wasn't included here
before (and I don't know myself).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 16:13                   ` Jan Beulich
@ 2016-11-04 16:19                     ` Roger Pau Monne
  2016-11-04 17:08                       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04 16:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 10:13:08AM -0600, Jan Beulich wrote:
> >>> On 04.11.16 at 16:33, <roger.pau@citrix.com> wrote:
> > --- a/xen/include/asm-x86/p2m.h
> > +++ b/xen/include/asm-x86/p2m.h
> > @@ -834,6 +834,7 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
> >      case p2m_grant_map_rw:
> >      case p2m_ram_logdirty:
> >      case p2m_map_foreign:
> > +    case p2m_mmio_direct:
> >          flags =  IOMMUF_readable | IOMMUF_writable;
> >          break;
> >      case p2m_ram_ro:
> 
> Generally this may be the route to go. But if we want to do so, we
> need to throughly understand why this type wasn't included here
> before (and I don't know myself).

It was me that introduced p2m_get_iommu_flags, and I just didn't think it 
would be useful at that point, that's why it wasn't included.

The patch that introduced p2m_get_iommu_flags was focused on fixing DMA'ing 
from grant-pages on HVM guests with passed-through hardware, this was needed 
for driver domains and later on by classic PVH Dom0

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 16:19                     ` Roger Pau Monne
@ 2016-11-04 17:08                       ` Jan Beulich
  2016-11-04 17:25                         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-04 17:08 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 17:19, <roger.pau@citrix.com> wrote:
> On Fri, Nov 04, 2016 at 10:13:08AM -0600, Jan Beulich wrote:
>> >>> On 04.11.16 at 16:33, <roger.pau@citrix.com> wrote:
>> > --- a/xen/include/asm-x86/p2m.h
>> > +++ b/xen/include/asm-x86/p2m.h
>> > @@ -834,6 +834,7 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
>> >      case p2m_grant_map_rw:
>> >      case p2m_ram_logdirty:
>> >      case p2m_map_foreign:
>> > +    case p2m_mmio_direct:
>> >          flags =  IOMMUF_readable | IOMMUF_writable;
>> >          break;
>> >      case p2m_ram_ro:
>> 
>> Generally this may be the route to go. But if we want to do so, we
>> need to throughly understand why this type wasn't included here
>> before (and I don't know myself).
> 
> It was me that introduced p2m_get_iommu_flags, and I just didn't think it 
> would be useful at that point, that's why it wasn't included.

But there must have been logic to set the permissions before that?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 17:08                       ` Jan Beulich
@ 2016-11-04 17:25                         ` Roger Pau Monne
  2016-11-07  8:36                           ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-04 17:25 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 04, 2016 at 11:08:21AM -0600, Jan Beulich wrote:
> >>> On 04.11.16 at 17:19, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 04, 2016 at 10:13:08AM -0600, Jan Beulich wrote:
> >> >>> On 04.11.16 at 16:33, <roger.pau@citrix.com> wrote:
> >> > --- a/xen/include/asm-x86/p2m.h
> >> > +++ b/xen/include/asm-x86/p2m.h
> >> > @@ -834,6 +834,7 @@ static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt)
> >> >      case p2m_grant_map_rw:
> >> >      case p2m_ram_logdirty:
> >> >      case p2m_map_foreign:
> >> > +    case p2m_mmio_direct:
> >> >          flags =  IOMMUF_readable | IOMMUF_writable;
> >> >          break;
> >> >      case p2m_ram_ro:
> >> 
> >> Generally this may be the route to go. But if we want to do so, we
> >> need to throughly understand why this type wasn't included here
> >> before (and I don't know myself).
> > 
> > It was me that introduced p2m_get_iommu_flags, and I just didn't think it 
> > would be useful at that point, that's why it wasn't included.
> 
> But there must have been logic to set the permissions before that?

Previous to that only p2m_ram_rw would get IOMMU mappings. I've tracked this 
back to ff635e12, but there's no mention there about why only p2m_ram_rw 
would get IOMMU mappings.

Considering that on hw with shared page-tables this is already available, I 
don't see an issue with also doing it for the non-shared pt case.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions
  2016-11-04 17:25                         ` Roger Pau Monne
@ 2016-11-07  8:36                           ` Jan Beulich
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Beulich @ 2016-11-07  8:36 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.11.16 at 18:25, <roger.pau@citrix.com> wrote:
> On Fri, Nov 04, 2016 at 11:08:21AM -0600, Jan Beulich wrote:
>> >>> On 04.11.16 at 17:19, <roger.pau@citrix.com> wrote:
>> > On Fri, Nov 04, 2016 at 10:13:08AM -0600, Jan Beulich wrote:
>> >> >>> On 04.11.16 at 16:33, <roger.pau@citrix.com> wrote:
>> >> > --- a/xen/include/asm-x86/p2m.h
>> >> > +++ b/xen/include/asm-x86/p2m.h
>> >> > @@ -834,6 +834,7 @@ static inline unsigned int 
> p2m_get_iommu_flags(p2m_type_t p2mt)
>> >> >      case p2m_grant_map_rw:
>> >> >      case p2m_ram_logdirty:
>> >> >      case p2m_map_foreign:
>> >> > +    case p2m_mmio_direct:
>> >> >          flags =  IOMMUF_readable | IOMMUF_writable;
>> >> >          break;
>> >> >      case p2m_ram_ro:
>> >> 
>> >> Generally this may be the route to go. But if we want to do so, we
>> >> need to throughly understand why this type wasn't included here
>> >> before (and I don't know myself).
>> > 
>> > It was me that introduced p2m_get_iommu_flags, and I just didn't think it 
>> > would be useful at that point, that's why it wasn't included.
>> 
>> But there must have been logic to set the permissions before that?
> 
> Previous to that only p2m_ram_rw would get IOMMU mappings. I've tracked this 
> back to ff635e12, but there's no mention there about why only p2m_ram_rw 
> would get IOMMU mappings.
> 
> Considering that on hw with shared page-tables this is already available, I 
> don't see an issue with also doing it for the non-shared pt case.

True.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2
  2016-10-29  8:59 ` [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2016-11-11 16:53   ` Jan Beulich
  2016-11-16 18:02     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-11 16:53 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -191,10 +191,8 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
>  }
>  
>  #ifdef CONFIG_SHADOW_PAGING
> -static bool_t __initdata opt_dom0_shadow;
> +bool __initdata opt_dom0_shadow;
>  boolean_param("dom0_shadow", opt_dom0_shadow);
> -#else
> -#define opt_dom0_shadow 0
>  #endif

I think the new option parsing would better go here, avoiding the need
for this change. Making dom0_hvm visible globally is the less intrusive
variant.

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -67,6 +67,16 @@ unsigned long __read_mostly cr4_pv32_mask;
>  static bool_t __initdata opt_dom0pvh;
>  boolean_param("dom0pvh", opt_dom0pvh);
>  
> +/*
> + * List of parameters that affect Dom0 creation:
> + *
> + *  - hvm               Create a PVHv2 Dom0.
> + *  - shadow            Use shadow paging for Dom0.
> + */
> +static void parse_dom0_param(char *s);

Please try to avoid such forward declarations.

> @@ -1543,6 +1574,14 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      if ( opt_dom0pvh )
>          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
>  
> +    if ( dom0_hvm )
> +    {
> +        domcr_flags |= DOMCRF_hvm |
> +                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
> +                         DOMCRF_hap : 0);
> +        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
> +    }

If you wire this up here already, instead of later in the series, what's
the effect of someone using this option? Crash?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO
  2016-10-29  8:59 ` [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO Roger Pau Monne
@ 2016-11-11 16:58   ` Jan Beulich
  2016-11-29 12:41     ` Roger Pau Monne
  2016-11-11 20:17   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-11 16:58 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, WeiLiu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> Current {un}map_mmio_regions implementation has a maximum number of loops to
> perform before giving up and returning to the caller. This is an issue when
> mapping large MMIO regions when building the hardware domain. In order to
> solve it, introduce a wrapper around {un}map_mmio_regions that takes care of
> calling process_pending_softirqs between consecutive {un}map_mmio_regions
> calls.

So is this something that's going to be needed for other than
hwdom building? Because if not ...

> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1418,6 +1418,32 @@ int prepare_ring_for_helper(
>      return 0;
>  }
>  
> +int modify_identity_mmio(struct domain *d, unsigned long pfn,
> +                         unsigned long nr_pages, bool map)

... I don't think the function belongs here, and it should be
marked __hwdom_init.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-10-29  8:59 ` [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2016-11-11 17:16   ` Jan Beulich
  2016-11-28 11:26     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-11 17:16 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> +static int __init hvm_populate_memory_range(struct domain *d, uint64_t start,
> +                                             uint64_t size)
> +{
> +    unsigned int order, i = 0;
> +    struct page_info *page;
> +    int rc;
> +#define MAP_MAX_ITER 64
> +
> +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
> +
> +    order = MAX_ORDER;
> +    while ( size != 0 )
> +    {
> +        order = min(get_order_from_bytes_floor(size), order);

This being the only caller, I don't see the point of the helper, the
more that the logic to prevent underflow is unnecessary for the
use here.

> +        page = alloc_domheap_pages(d, order, memflags);
> +        if ( page == NULL )
> +        {
> +            if ( order == 0 && memflags )
> +            {
> +                /* Try again without any memflags. */
> +                memflags = 0;
> +                order = MAX_ORDER;
> +                continue;
> +            }
> +            if ( order == 0 )
> +            {
> +                printk("Unable to allocate memory with order 0!\n");
> +                return -ENOMEM;
> +            }
> +            order--;
> +            continue;
> +        }
> +
> +        rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)),
> +                                    _mfn(page_to_mfn(page)), order);
> +        if ( rc != 0 )
> +        {
> +            printk("Failed to populate memory: [%" PRIx64 ",%" PRIx64 ") %d\n",
> +                   start, start + ((1ULL) << (order + PAGE_SHIFT)), rc);
> +            return -ENOMEM;
> +        }
> +        start += 1ULL << (order + PAGE_SHIFT);
> +        size -= 1ULL << (order + PAGE_SHIFT);

With all of these PAGE_SHIFT uses I wonder whether you wouldn't
be better of doing everything with (number of) page frames.

> +static int __init hvm_steal_ram(struct domain *d, unsigned long size,
> +                                paddr_t limit, paddr_t *addr)
> +{
> +    unsigned int i;
> +
> +    for ( i = 1; i <= d->arch.nr_e820; i++ )
> +    {
> +        struct e820entry *entry = &d->arch.e820[d->arch.nr_e820 - i];

Why don't you simply make the loop count downwards?

> +static int __init hvm_setup_p2m(struct domain *d)
> +{
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    unsigned long nr_pages;
> +    int i, rc;

The use of i below calls for it to be unsigned int.

> +    bool preempted;
> +
> +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> +
> +    hvm_setup_e820(d, nr_pages);
> +    do {
> +        preempted = false;
> +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> +                              &preempted);
> +        process_pending_softirqs();
> +    } while ( preempted );
> +
> +    /*
> +     * Special treatment for memory < 1MB:
> +     *  - Copy the data in e820 regions marked as RAM (BDA, BootSector...).
> +     *  - Identity map everything else.
> +     * NB: all this only makes sense if booted from legacy BIOSes.
> +     * NB2: regions marked as RAM in the memory map are backed by RAM pages
> +     * in the p2m, and the original data is copied over. This is done because
> +     * at least FreeBSD places the AP boot trampoline in a RAM region found
> +     * below the first MB, and the real-mode emulator found in Xen cannot
> +     * deal with code that resides in guest pages marked as MMIO. This can
> +     * cause problems if the memory map is not correct, and for example the
> +     * EBDA or the video ROM region is marked as RAM.
> +     */

Perhaps it's the real mode emulator which needs adjustment?

> +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
> +    if ( rc )
> +    {
> +        printk("Failed to identity map low 1MB: %d\n", rc);
> +        return rc;
> +    }
> +
> +    /* Populate memory map. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        rc = hvm_populate_memory_range(d, d->arch.e820[i].addr,
> +                                       d->arch.e820[i].size);
> +        if ( rc )
> +            return rc;
> +        if ( d->arch.e820[i].addr < MB(1) )
> +        {
> +            unsigned long end = min_t(unsigned long,
> +                            d->arch.e820[i].addr + d->arch.e820[i].size, MB(1));
> +
> +            saved_current = current;
> +            set_current(v);
> +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
> +                                        maddr_to_virt(d->arch.e820[i].addr),
> +                                        end - d->arch.e820[i].addr);
> +            set_current(saved_current);

If anything goes wrong here, how much confusion will result from
current being wrong? In particular, will this complicate debugging
of possible issues?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO
  2016-10-29  8:59 ` [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO Roger Pau Monne
  2016-11-11 16:58   ` Jan Beulich
@ 2016-11-11 20:17   ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 89+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-11 20:17 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Jan Beulich, xen-devel, boris.ostrovsky

On Sat, Oct 29, 2016 at 10:59:57AM +0200, Roger Pau Monne wrote:
> Current {un}map_mmio_regions implementation has a maximum number of loops to
> perform before giving up and returning to the caller. This is an issue when
> mapping large MMIO regions when building the hardware domain. In order to
> solve it, introduce a wrapper around {un}map_mmio_regions that takes care of
> calling process_pending_softirqs between consecutive {un}map_mmio_regions
> calls.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---
> Changes since v2:
>  - Pull the code into a separate patch.
>  - Use an unbounded for loop with break conditions.
> ---
>  xen/common/memory.c          | 26 ++++++++++++++++++++++++++
>  xen/include/xen/p2m-common.h |  7 +++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 21797ca..66c0484 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -1418,6 +1418,32 @@ int prepare_ring_for_helper(
>      return 0;
>  }
>  
> +int modify_identity_mmio(struct domain *d, unsigned long pfn,
> +                         unsigned long nr_pages, bool map)
> +{
> +    int rc;
> +
> +    for ( ; ; )
> +    {
> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> +        if ( rc == 0 )
> +            break;
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_WARNING
> +                   "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n",
> +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);

If we fail should we call this again but with the unmap operation?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 13/15] xen/x86: parse Dom0 kernel for PVHv2
  2016-10-29  8:59 ` [PATCH v3.1 13/15] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2016-11-11 20:30   ` Konrad Rzeszutek Wilk
  2016-11-28 12:14     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-11 20:30 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky, Jan Beulich, Andrew Cooper

On Sat, Oct 29, 2016 at 10:59:59AM +0200, Roger Pau Monne wrote:
> Introduce a helper to parse the Dom0 kernel.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> Changes since v2:
>  - Remove debug messages.
>  - Don't hardcode the number of modules to 1.
> ---
>  xen/arch/x86/domain_build.c | 138 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 138 insertions(+)
> 
> diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
> index ec1ac89..168be62 100644
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -39,6 +39,7 @@
>  #include <asm/hpet.h>
>  
>  #include <public/version.h>
> +#include <public/arch-x86/hvm/start_info.h>
>  
>  static long __initdata dom0_nrpages;
>  static long __initdata dom0_min_nrpages;
> @@ -1895,12 +1896,141 @@ static int __init hvm_setup_p2m(struct domain *d)
>      return 0;
>  }
>  
> +static int __init hvm_load_kernel(struct domain *d, const module_t *image,
> +                                  unsigned long image_headroom,
> +                                  module_t *initrd, char *image_base,
> +                                  char *cmdline, paddr_t *entry,
> +                                  paddr_t *start_info_addr)
> +{
> +    char *image_start = image_base + image_headroom;
> +    unsigned long image_len = image->mod_end;
> +    struct elf_binary elf;
> +    struct elf_dom_parms parms;
> +    paddr_t last_addr;
> +    struct hvm_start_info start_info;
> +    struct hvm_modlist_entry mod;
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    int rc;
> +
> +    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> +    {
> +        printk("Error trying to detect bz compressed kernel\n");
> +        return rc;
> +    }
> +
> +    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
> +    {
> +        printk("Unable to init ELF\n");
> +        return rc;
> +    }
> +#ifdef VERBOSE
> +    elf_set_verbose(&elf);
> +#endif
> +    elf_parse_binary(&elf);
> +    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
> +    {
> +        printk("Unable to parse kernel for ELFNOTES\n");

Perhaps s/ELFNOTES/PT_NOTEs/?
> +        return rc;
> +    }
> +
> +    if ( parms.phys_entry == UNSET_ADDR32 ) {
> +        printk("Unable to find kernel entry point, aborting\n");

Perhaps: Unable to find XEN_ELFNOTE_PHYS32_ENTRY point.

> +        return -EINVAL;
> +    }
> +
> +    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
> +           parms.guest_ver, parms.loader,

Hm, I don't know if XEN_ELFNOTE_GUEST_VERSION or XEN_ELFNOTE_GUEST_OS
are mandated.

Perhaps you should do memset(&params) before you pass it to elf_xen_parse?

> +           elf_64bit(&elf) ? "64-bit" : "32-bit");
> +
> +    /* Copy the OS image and free temporary buffer. */
> +    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
> +    elf.dest_size = parms.virt_kend - parms.virt_kstart;
> +
> +    saved_current = current;
> +    set_current(v);
> +
> +    rc = elf_load_binary(&elf);
> +    if ( rc < 0 )
> +    {
> +        printk("Failed to load kernel: %d\n", rc);
> +        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
> +        goto out;
> +    }
> +
> +    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> +
> +    if ( initrd != NULL )
> +    {
> +        rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
> +                                    initrd->mod_end);
> +        if ( rc != HVMCOPY_okay )
> +        {
> +            printk("Unable to copy initrd to guest\n");
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +
> +        mod.paddr = last_addr;
> +        mod.size = initrd->mod_end;
> +        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
> +    }
> +
> +    /* Free temporary buffers. */
> +    discard_initial_images();
> +
> +    memset(&start_info, 0, sizeof(start_info));
> +    if ( cmdline != NULL )
> +    {
> +        rc = hvm_copy_to_guest_phys(last_addr, cmdline, strlen(cmdline) + 1);
> +        if ( rc != HVMCOPY_okay )
> +        {
> +            printk("Unable to copy guest command line\n");
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +        start_info.cmdline_paddr = last_addr;
> +        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
> +    }
> +    if ( initrd != NULL )

It may be better if this is an array. You can have multiple initrd's.

> +    {
> +        rc = hvm_copy_to_guest_phys(last_addr, &mod, sizeof(mod));
> +        if ( rc != HVMCOPY_okay )
> +        {
> +            printk("Unable to copy guest modules\n");
> +            rc = -EFAULT;
> +            goto out;
> +        }
> +        start_info.modlist_paddr = last_addr;
> +        start_info.nr_modules = 1;
> +        last_addr += sizeof(mod);
> +    }
> +
> +    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
> +    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
> +    rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info));
> +    if ( rc != HVMCOPY_okay )
> +    {
> +        printk("Unable to copy start info to guest\n");
> +        rc = -EFAULT;
> +        goto out;
> +    }
> +
> +    *entry = parms.phys_entry;
> +    *start_info_addr = last_addr;
> +    rc = 0;
> +
> +out:

Extra space in front of the label.
> +    set_current(saved_current);
> +    return rc;
> +}
> +

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-29  9:00 ` [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
@ 2016-11-14 16:15   ` Jan Beulich
  2016-11-30 12:40     ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-14 16:15 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.10.16 at 11:00, <roger.pau@citrix.com> wrote:
> Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
> p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
> that don't reside in RAM regions. This is needed because some memory maps
> don't properly account for all the memory used by ACPI, so it's common to
> find ACPI tables in holes.

I question whether this behavior should be enabled by default. Not
having seen the code yet I also wonder whether these regions
shouldn't simply be added to the guest's E820 as E820_ACPI, which
should then result in them getting mapped without further special
casing.

> +static int __init hvm_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
> +                                    uint32_t type)

I see s and e being uint64_t, but I don't see why type can't be plain
unsigned int.

> +{
> +    unsigned int i;
> +
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        uint64_t rs = d->arch.e820[i].addr;
> +        uint64_t re = rs + d->arch.e820[i].size;
> +
> +        if ( rs == e && d->arch.e820[i].type == type )
> +        {
> +            d->arch.e820[i].addr = s;
> +            return 0;
> +        }
> +
> +        if ( re == s && d->arch.e820[i].type == type &&
> +             (i + 1 == d->arch.nr_e820 || d->arch.e820[i + 1].addr >= e) )

I understand this to be overlap prevention, but there's no equivalent
in the earlier if(). Are you relying on the table being strictly sorted at
all times? If so, a comment should say so.

> +        {
> +            d->arch.e820[i].size += e - s;
> +            return 0;
> +        }
> +
> +        if ( rs >= e )
> +            break;
> +
> +        if ( re > s )
> +            return -ENOMEM;

I don't think ENOMEM is appropriate to signal an overlap. And don't
you need to reverse these last two if()s?

> @@ -2112,6 +2166,371 @@ static int __init hvm_setup_cpus(struct domain *d, 
> paddr_t entry,
>      return 0;
>  }
>  
> +static int __init acpi_count_intr_ov(struct acpi_subtable_header *header,
> +                                     const unsigned long end)
> +{
> +
> +    acpi_intr_overrrides++;
> +    return 0;
> +}
> +
> +static int __init acpi_set_intr_ov(struct acpi_subtable_header *header,
> +                                   const unsigned long end)

May I ask for "ov" to become at least "ovr" in all cases? Also stray
const.

> +{
> +    struct acpi_madt_interrupt_override *intr =
> +        container_of(header, struct acpi_madt_interrupt_override, header);

Yet missing const here.

> +    ACPI_MEMCPY(intsrcovr, intr, sizeof(*intr));

Structure assignment (for type safety; also elsewhere)?

> +static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
> +{
> +    struct acpi_table_madt *madt;
> +    struct acpi_table_header *table;
> +    struct acpi_madt_io_apic *io_apic;
> +    struct acpi_madt_local_apic *local_apic;
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    acpi_status status;
> +    unsigned long size;
> +    unsigned int i;
> +    int rc;
> +
> +    /* Count number of interrupt overrides in the MADT. */
> +    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_count_intr_ov,
> +                          MAX_IRQ_SOURCES);
> +
> +    /* Calculate the size of the crafted MADT. */
> +    size = sizeof(struct acpi_table_madt);
> +    size += sizeof(struct acpi_madt_interrupt_override) * acpi_intr_overrrides;
> +    size += sizeof(struct acpi_madt_io_apic);
> +    size += sizeof(struct acpi_madt_local_apic) * dom0_max_vcpus();

All the sizeof()s would better use the variables declared above.

> +    madt = xzalloc_bytes(size);
> +    if ( !madt )
> +    {
> +        printk("Unable to allocate memory for MADT table\n");
> +        return -ENOMEM;
> +    }
> +
> +    /* Copy the native MADT table header. */
> +    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
> +    if ( !ACPI_SUCCESS(status) )
> +    {
> +        printk("Failed to get MADT ACPI table, aborting.\n");
> +        return -EINVAL;
> +    }
> +    ACPI_MEMCPY(madt, table, sizeof(*table));
> +    madt->address = APIC_DEFAULT_PHYS_BASE;

You may also need to override table revision (at least it shouldn't end
up larger than what we know about).

> +    /* Setup the IO APIC entry. */
> +    if ( nr_ioapics > 1 )
> +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
> +               nr_ioapics);

I've said elsewhere already that I think we should provide 1 vIO-APIC
per physical one.

> +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
> +    io_apic->header.length = sizeof(*io_apic);
> +    io_apic->id = 1;
> +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
> +
> +    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
> +    for ( i = 0; i < dom0_max_vcpus(); i++ )
> +    {
> +        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
> +        local_apic->header.length = sizeof(*local_apic);
> +        local_apic->processor_id = i;
> +        local_apic->id = i * 2;
> +        local_apic->lapic_flags = ACPI_MADT_ENABLED;
> +        local_apic++;
> +    }

What about x2apic? And for lapic, do you limit vCPU count anywhere?

> +    /* Setup interrupt overwrites. */

overrides

> +static bool __init hvm_acpi_table_allowed(const char *sig)
> +{
> +    static const char __init banned_tables[][ACPI_NAME_SIZE] = {
> +        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
> +        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
> +    unsigned long pfn, nr_pages;
> +    int i;
> +
> +    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
> +        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
> +            return false;
> +
> +    /* Make sure table doesn't reside in a RAM region. */
> +    pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> +    nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
> +                            PAGE_SIZE);

You also need to add in the offset-into-page from the base address.

> +    if ( range_is_ram(pfn, nr_pages) )
> +    {
> +        printk("Skipping table %.4s because resides in a RAM region\n",
> +               sig);
> +        return false;

I think this should be more strict, at least to start with: Require the
table to be in an E820_ACPI region (or maybe an E820_RESERVED
one), but nothing else.

> +static int __init hvm_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
> +                                      paddr_t *addr)
> +{
> +    struct acpi_table_xsdt *xsdt;
> +    struct acpi_table_header *table;
> +    struct acpi_table_rsdp *rsdp;
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    unsigned long size;
> +    unsigned int i, num_tables;
> +    int j, rc;
> +
> +    /*
> +     * Restore original DMAR table signature, we are going to filter it
> +     * from the new XSDT that is presented to the guest, so it no longer
> +     * makes sense to have it's signature zapped.
> +     */
> +    acpi_dmar_reinstate();
> +
> +    /* Account for the space needed by the XSDT. */
> +    size = sizeof(*xsdt);
> +    num_tables = 0;
> +
> +    /* Count the number of tables that will be added to the XSDT. */
> +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> +    {
> +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> +
> +        if ( !hvm_acpi_table_allowed(sig) )
> +            continue;
> +
> +        num_tables++;
> +    }

Unless you expect something to be added to this loop, please
simplify it by inverting the condition and dropping the continue.

> +    /*
> +     * No need to subtract one because we will be adding a custom MADT (and
> +     * the native one is not accounted for).
> +     */
> +    size += num_tables * sizeof(u64);

sizeof(xsdt->table_offset_entry[0])

> +    xsdt = xzalloc_bytes(size);
> +    if ( !xsdt )
> +    {
> +        printk("Unable to allocate memory for XSDT table\n");
> +        return -ENOMEM;
> +    }
> +
> +    /* Copy the native XSDT table header. */
> +    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
> +    if ( !rsdp )
> +    {
> +        printk("Unable to map RSDP\n");
> +        return -EINVAL;
> +    }
> +    table = acpi_os_map_memory(rsdp->xsdt_physical_address, sizeof(*table));
> +    if ( !table )
> +    {
> +        printk("Unable to map XSDT\n");
> +        return -EINVAL;
> +    }
> +    ACPI_MEMCPY(xsdt, table, sizeof(*table));
> +    acpi_os_unmap_memory(table, sizeof(*table));
> +    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));

At this point we're not in SYS_STATE_active yet, and hence there
can only be one mapping at a time. The way it's written right now
does not represent an active problem, but to prevent someone
falling into this trap you should unmap the first mapping before
establishing the second one.

> +    /* Add the custom MADT. */
> +    j = 0;
> +    xsdt->table_offset_entry[j++] = madt_addr;
> +
> +    /* Copy the address of the rest of the allowed tables. */

addresses?

> +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> +    {
> +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> +        unsigned long pfn, nr_pages;
> +
> +        if ( !hvm_acpi_table_allowed(sig) )
> +            continue;
> +
> +        /* Make sure table doesn't reside in a RAM region. */
> +        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> +        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
> +                                PAGE_SIZE);

See above (and there appears to be at least one more further down).

> +        /* Make sure table is mapped. */
> +        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> +        if ( rc )
> +            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> +                   pfn, pfn + nr_pages);

Isn't the comment for this code block meant to go ahead of the earlier
one, in place of the comment that's there (and looks wrong)?

> +        xsdt->table_offset_entry[j++] =
> +                            acpi_gbl_root_table_list.tables[i].address;
> +    }
> +
> +    xsdt->header.length = size;
> +    xsdt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, xsdt), size);
> +
> +    /* Place the new XSDT in guest memory space. */
> +    if ( hvm_steal_ram(d, size, GB(4), addr) )
> +    {
> +        printk("Unable to find allocate guest RAM for XSDT\n");

"find" or "allocate"?

> +        return -ENOMEM;
> +    }
> +
> +    /* Mark this region as E820_ACPI. */
> +    if ( hvm_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
> +        printk("Unable to add XSDT region to memory map\n");
> +
> +    saved_current = current;
> +    set_current(v);
> +    rc = hvm_copy_to_guest_phys(*addr, xsdt, size);
> +    set_current(saved_current);

This pattern appears to be recurring - please make a helper function
(which then also eases possibly addressing my earlier remark
regarding that playing with current).

> +    if ( rc != HVMCOPY_okay )
> +    {
> +        printk("Unable to copy XSDT into guest memory\n");
> +        return -EFAULT;
> +    }
> +    xfree(xsdt);
> +
> +    return 0;
> +}
> +
> +
> +static int __init hvm_setup_acpi(struct domain *d, paddr_t start_info)

Only one blank line between functions please.

> +{
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    struct acpi_table_rsdp rsdp;
> +    unsigned long pfn, nr_pages;
> +    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
> +    unsigned int i;
> +    int rc;
> +
> +    /* Identity map ACPI e820 regions. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        if ( d->arch.e820[i].type != E820_ACPI &&
> +             d->arch.e820[i].type != E820_NVS )
> +            continue;
> +
> +        pfn = PFN_DOWN(d->arch.e820[i].addr);
> +        nr_pages = DIV_ROUND_UP(d->arch.e820[i].size, PAGE_SIZE);
> +
> +        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> +        if ( rc )
> +        {
> +            printk(
> +                "Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> +                   pfn, pfn + nr_pages);
> +            return rc;
> +        }
> +    }
> +
> +    rc = hvm_setup_acpi_madt(d, &madt_paddr);
> +    if ( rc )
> +        return rc;
> +
> +    rc = hvm_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
> +    if ( rc )
> +        return rc;

Coming back to the initial comment: If you did the 1:1 mapping last
and if you added problematic ranges to the E820 map, you wouldn't
need to call modify_identity_mmio() in two places.

> +    /* Craft a custom RSDP. */
> +    memset(&rsdp, 0, sizeof(rsdp));
> +    memcpy(&rsdp.signature, ACPI_SIG_RSDP, sizeof(rsdp.signature));
> +    memcpy(&rsdp.oem_id, "XenVMM", sizeof(rsdp.oem_id));

Is that a good idea? I think Dom0 should get to see the real OEM.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2
  2016-11-11 16:53   ` Jan Beulich
@ 2016-11-16 18:02     ` Roger Pau Monne
  2016-11-17 10:49       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-16 18:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 11, 2016 at 09:53:49AM -0700, Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/domain_build.c
> > +++ b/xen/arch/x86/domain_build.c
> > @@ -191,10 +191,8 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
> >  }
> >  
> >  #ifdef CONFIG_SHADOW_PAGING
> > -static bool_t __initdata opt_dom0_shadow;
> > +bool __initdata opt_dom0_shadow;
> >  boolean_param("dom0_shadow", opt_dom0_shadow);
> > -#else
> > -#define opt_dom0_shadow 0
> >  #endif
> 
> I think the new option parsing would better go here, avoiding the need
> for this change. Making dom0_hvm visible globally is the less intrusive
> variant.

I'm not sure I follow your point, even if dom0_hvm is defined here together 
with the parsing, opt_dom0_shadow still needs to be made global, so it can be 
accessed from setup.c which is where the domain_create call happens.
 
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -67,6 +67,16 @@ unsigned long __read_mostly cr4_pv32_mask;
> >  static bool_t __initdata opt_dom0pvh;
> >  boolean_param("dom0pvh", opt_dom0pvh);
> >  
> > +/*
> > + * List of parameters that affect Dom0 creation:
> > + *
> > + *  - hvm               Create a PVHv2 Dom0.
> > + *  - shadow            Use shadow paging for Dom0.
> > + */
> > +static void parse_dom0_param(char *s);
> 
> Please try to avoid such forward declarations.
> 
> > @@ -1543,6 +1574,14 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >      if ( opt_dom0pvh )
> >          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
> >  
> > +    if ( dom0_hvm )
> > +    {
> > +        domcr_flags |= DOMCRF_hvm |
> > +                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
> > +                         DOMCRF_hap : 0);
> > +        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
> > +    }
> 
> If you wire this up here already, instead of later in the series, what's
> the effect of someone using this option? Crash?

Most certainly. The BSP IP points to 0 at this point. I can wire this up later, 
but it's going to be strange IMHO.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2
  2016-11-16 18:02     ` Roger Pau Monne
@ 2016-11-17 10:49       ` Jan Beulich
  2016-11-28 17:49         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-17 10:49 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 16.11.16 at 19:02, <roger.pau@citrix.com> wrote:
> On Fri, Nov 11, 2016 at 09:53:49AM -0700, Jan Beulich wrote:
>> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> > --- a/xen/arch/x86/domain_build.c
>> > +++ b/xen/arch/x86/domain_build.c
>> > @@ -191,10 +191,8 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain 
> *dom0)
>> >  }
>> >  
>> >  #ifdef CONFIG_SHADOW_PAGING
>> > -static bool_t __initdata opt_dom0_shadow;
>> > +bool __initdata opt_dom0_shadow;
>> >  boolean_param("dom0_shadow", opt_dom0_shadow);
>> > -#else
>> > -#define opt_dom0_shadow 0
>> >  #endif
>> 
>> I think the new option parsing would better go here, avoiding the need
>> for this change. Making dom0_hvm visible globally is the less intrusive
>> variant.
> 
> I'm not sure I follow your point, even if dom0_hvm is defined here together 
> with the parsing, opt_dom0_shadow still needs to be made global, so it can be 
> accessed from setup.c which is where the domain_create call happens.

Oh, I had overlooked that use.

>> > --- a/xen/arch/x86/setup.c
>> > +++ b/xen/arch/x86/setup.c
>> > @@ -67,6 +67,16 @@ unsigned long __read_mostly cr4_pv32_mask;
>> >  static bool_t __initdata opt_dom0pvh;
>> >  boolean_param("dom0pvh", opt_dom0pvh);
>> >  
>> > +/*
>> > + * List of parameters that affect Dom0 creation:
>> > + *
>> > + *  - hvm               Create a PVHv2 Dom0.
>> > + *  - shadow            Use shadow paging for Dom0.
>> > + */
>> > +static void parse_dom0_param(char *s);
>> 
>> Please try to avoid such forward declarations.
>> 
>> > @@ -1543,6 +1574,14 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>> >      if ( opt_dom0pvh )
>> >          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
>> >  
>> > +    if ( dom0_hvm )
>> > +    {
>> > +        domcr_flags |= DOMCRF_hvm |
>> > +                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
>> > +                         DOMCRF_hap : 0);
>> > +        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
>> > +    }
>> 
>> If you wire this up here already, instead of later in the series, what's
>> the effect of someone using this option? Crash?
> 
> Most certainly. The BSP IP points to 0 at this point. I can wire this up later, 
> but it's going to be strange IMHO.

Not any more "strange" than someone trying the option and getting
some random and perhaps not immediately understandable crash.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-11-11 17:16   ` Jan Beulich
@ 2016-11-28 11:26     ` Roger Pau Monne
  2016-11-28 11:41       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-28 11:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Nov 11, 2016 at 10:16:43AM -0700, Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > +static int __init hvm_populate_memory_range(struct domain *d, uint64_t start,
> > +                                             uint64_t size)
> > +{
> > +    unsigned int order, i = 0;
> > +    struct page_info *page;
> > +    int rc;
> > +#define MAP_MAX_ITER 64
> > +
> > +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
> > +
> > +    order = MAX_ORDER;
> > +    while ( size != 0 )
> > +    {
> > +        order = min(get_order_from_bytes_floor(size), order);
> 
> This being the only caller, I don't see the point of the helper, the
> more that the logic to prevent underflow is unnecessary for the
> use here.

Right, the more that now get_order_from_bytes_floor is mostly a wrapper around 
get_order_from_bytes.
 
> > +        page = alloc_domheap_pages(d, order, memflags);
> > +        if ( page == NULL )
> > +        {
> > +            if ( order == 0 && memflags )
> > +            {
> > +                /* Try again without any memflags. */
> > +                memflags = 0;
> > +                order = MAX_ORDER;
> > +                continue;
> > +            }
> > +            if ( order == 0 )
> > +            {
> > +                printk("Unable to allocate memory with order 0!\n");
> > +                return -ENOMEM;
> > +            }
> > +            order--;
> > +            continue;
> > +        }
> > +
> > +        rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)),
> > +                                    _mfn(page_to_mfn(page)), order);
> > +        if ( rc != 0 )
> > +        {
> > +            printk("Failed to populate memory: [%" PRIx64 ",%" PRIx64 ") %d\n",
> > +                   start, start + ((1ULL) << (order + PAGE_SHIFT)), rc);
> > +            return -ENOMEM;
> > +        }
> > +        start += 1ULL << (order + PAGE_SHIFT);
> > +        size -= 1ULL << (order + PAGE_SHIFT);
> 
> With all of these PAGE_SHIFT uses I wonder whether you wouldn't
> be better of doing everything with (number of) page frames.

Probably, this will avoid a couple of shifts here and there.

> > +static int __init hvm_steal_ram(struct domain *d, unsigned long size,
> > +                                paddr_t limit, paddr_t *addr)
> > +{
> > +    unsigned int i;
> > +
> > +    for ( i = 1; i <= d->arch.nr_e820; i++ )
> > +    {
> > +        struct e820entry *entry = &d->arch.e820[d->arch.nr_e820 - i];
> 
> Why don't you simply make the loop count downwards?

With i being an unsigned int, this would make the condition look quite awkward, 
because i >= 0 cannot be used. I would have to use i <= d->arch.nr_e820, so I 
think it's best to leave it as-is for readability.

> > +static int __init hvm_setup_p2m(struct domain *d)
> > +{
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    unsigned long nr_pages;
> > +    int i, rc;
> 
> The use of i below calls for it to be unsigned int.

Sure.

> > +    bool preempted;
> > +
> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0);
> > +
> > +    hvm_setup_e820(d, nr_pages);
> > +    do {
> > +        preempted = false;
> > +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> > +                              &preempted);
> > +        process_pending_softirqs();
> > +    } while ( preempted );
> > +
> > +    /*
> > +     * Special treatment for memory < 1MB:
> > +     *  - Copy the data in e820 regions marked as RAM (BDA, BootSector...).
> > +     *  - Identity map everything else.
> > +     * NB: all this only makes sense if booted from legacy BIOSes.
> > +     * NB2: regions marked as RAM in the memory map are backed by RAM pages
> > +     * in the p2m, and the original data is copied over. This is done because
> > +     * at least FreeBSD places the AP boot trampoline in a RAM region found
> > +     * below the first MB, and the real-mode emulator found in Xen cannot
> > +     * deal with code that resides in guest pages marked as MMIO. This can
> > +     * cause problems if the memory map is not correct, and for example the
> > +     * EBDA or the video ROM region is marked as RAM.
> > +     */
> 
> Perhaps it's the real mode emulator which needs adjustment?

After the talk that we had on IRC, I've decided that the best way to deal with 
this is to map the RAM regions below 1MB as RAM instead of MMIO as it's done 
here, so I've added a helper to steal those pages from dom_io, assign them to 
Dom0, and map into the p2m as p2m_ram_rw. This works fine and the emulator no 
longer complains.

> > +    rc = modify_identity_mmio(d, 0, PFN_DOWN(MB(1)), true);
> > +    if ( rc )
> > +    {
> > +        printk("Failed to identity map low 1MB: %d\n", rc);
> > +        return rc;
> > +    }
> > +
> > +    /* Populate memory map. */
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        if ( d->arch.e820[i].type != E820_RAM )
> > +            continue;
> > +
> > +        rc = hvm_populate_memory_range(d, d->arch.e820[i].addr,
> > +                                       d->arch.e820[i].size);
> > +        if ( rc )
> > +            return rc;
> > +        if ( d->arch.e820[i].addr < MB(1) )
> > +        {
> > +            unsigned long end = min_t(unsigned long,
> > +                            d->arch.e820[i].addr + d->arch.e820[i].size, MB(1));
> > +
> > +            saved_current = current;
> > +            set_current(v);
> > +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
> > +                                        maddr_to_virt(d->arch.e820[i].addr),
> > +                                        end - d->arch.e820[i].addr);
> > +            set_current(saved_current);
> 
> If anything goes wrong here, how much confusion will result from
> current being wrong? In particular, will this complicate debugging
> of possible issues?

TBH, I'm not sure, current in this case is the idle domain, so trying to execute 
a hvm_copy_to_guest_phys with current being the idle domain, which from a Xen 
PoV is a PV vCPU, would probably result in some assert triggering in the 
hvm_copy_to_guest_phys path (or at least I would expect so). Note that this 
chunk of code is removed, since RAM regions below 1MB are now mapped as 
p2m_ram_rw.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-11-28 11:26     ` Roger Pau Monne
@ 2016-11-28 11:41       ` Jan Beulich
  2016-11-28 13:30         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-28 11:41 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 28.11.16 at 12:26, <roger.pau@citrix.com> wrote:
> On Fri, Nov 11, 2016 at 10:16:43AM -0700, Jan Beulich wrote:
>> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> > +static int __init hvm_steal_ram(struct domain *d, unsigned long size,
>> > +                                paddr_t limit, paddr_t *addr)
>> > +{
>> > +    unsigned int i;
>> > +
>> > +    for ( i = 1; i <= d->arch.nr_e820; i++ )
>> > +    {
>> > +        struct e820entry *entry = &d->arch.e820[d->arch.nr_e820 - i];
>> 
>> Why don't you simply make the loop count downwards?
> 
> With i being an unsigned int, this would make the condition look quite awkward, 
> because i >= 0 cannot be used. I would have to use i <= d->arch.nr_e820, so I 
> think it's best to leave it as-is for readability.

What's wrong with

    i = d->arch.nr_e820;
    while ( i-- )
    {
        ...

(or its for() equivalent)?

>> > +            saved_current = current;
>> > +            set_current(v);
>> > +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
>> > +                                        maddr_to_virt(d->arch.e820[i].addr),
>> > +                                        end - d->arch.e820[i].addr);
>> > +            set_current(saved_current);
>> 
>> If anything goes wrong here, how much confusion will result from
>> current being wrong? In particular, will this complicate debugging
>> of possible issues?
> 
> TBH, I'm not sure, current in this case is the idle domain, so trying to execute 
> a hvm_copy_to_guest_phys with current being the idle domain, which from a Xen 
> PoV is a PV vCPU, would probably result in some assert triggering in the 
> hvm_copy_to_guest_phys path (or at least I would expect so). Note that this 
> chunk of code is removed, since RAM regions below 1MB are now mapped as 
> p2m_ram_rw.

Even if this chunk of code no longer exists, iirc there  were a few
more instances of this current overriding, so unless they're all gone
now I still think this need considering (and ideally finding a better
solution, maybe along the lines of mapcache_override_current()).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 13/15] xen/x86: parse Dom0 kernel for PVHv2
  2016-11-11 20:30   ` Konrad Rzeszutek Wilk
@ 2016-11-28 12:14     ` Roger Pau Monne
  0 siblings, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-28 12:14 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: xen-devel, boris.ostrovsky, Jan Beulich, Andrew Cooper

On Fri, Nov 11, 2016 at 03:30:17PM -0500, Konrad Rzeszutek Wilk wrote:
> On Sat, Oct 29, 2016 at 10:59:59AM +0200, Roger Pau Monne wrote:
> > Introduce a helper to parse the Dom0 kernel.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> > Changes since v2:
> >  - Remove debug messages.
> >  - Don't hardcode the number of modules to 1.
> > ---
> >  xen/arch/x86/domain_build.c | 138 ++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 138 insertions(+)
> > 
> > diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
> > index ec1ac89..168be62 100644
> > --- a/xen/arch/x86/domain_build.c
> > +++ b/xen/arch/x86/domain_build.c
> > @@ -39,6 +39,7 @@
> >  #include <asm/hpet.h>
> >  
> >  #include <public/version.h>
> > +#include <public/arch-x86/hvm/start_info.h>
> >  
> >  static long __initdata dom0_nrpages;
> >  static long __initdata dom0_min_nrpages;
> > @@ -1895,12 +1896,141 @@ static int __init hvm_setup_p2m(struct domain *d)
> >      return 0;
> >  }
> >  
> > +static int __init hvm_load_kernel(struct domain *d, const module_t *image,
> > +                                  unsigned long image_headroom,
> > +                                  module_t *initrd, char *image_base,
> > +                                  char *cmdline, paddr_t *entry,
> > +                                  paddr_t *start_info_addr)
> > +{
> > +    char *image_start = image_base + image_headroom;
> > +    unsigned long image_len = image->mod_end;
> > +    struct elf_binary elf;
> > +    struct elf_dom_parms parms;
> > +    paddr_t last_addr;
> > +    struct hvm_start_info start_info;
> > +    struct hvm_modlist_entry mod;
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    int rc;
> > +
> > +    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
> > +    {
> > +        printk("Error trying to detect bz compressed kernel\n");
> > +        return rc;
> > +    }
> > +
> > +    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
> > +    {
> > +        printk("Unable to init ELF\n");
> > +        return rc;
> > +    }
> > +#ifdef VERBOSE
> > +    elf_set_verbose(&elf);
> > +#endif
> > +    elf_parse_binary(&elf);
> > +    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
> > +    {
> > +        printk("Unable to parse kernel for ELFNOTES\n");
> 
> Perhaps s/ELFNOTES/PT_NOTEs/?

They can also be in a SHT_NOTE section, for example FreeBSD doesn't have a 
PT_NOTE program header, and just places the notes inside of a SHT_NOTE section.

> > +        return rc;
> > +    }
> > +
> > +    if ( parms.phys_entry == UNSET_ADDR32 ) {
> > +        printk("Unable to find kernel entry point, aborting\n");
> 
> Perhaps: Unable to find XEN_ELFNOTE_PHYS32_ENTRY point.

Right, I've changed "point" to "address" because I think it's easier to 
understand.

> > +        return -EINVAL;
> > +    }
> > +
> > +    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
> > +           parms.guest_ver, parms.loader,
> 
> Hm, I don't know if XEN_ELFNOTE_GUEST_VERSION or XEN_ELFNOTE_GUEST_OS
> are mandated.
> 
> Perhaps you should do memset(&params) before you pass it to elf_xen_parse?

elf_xen_parse already does a memset of params.
 
> > +           elf_64bit(&elf) ? "64-bit" : "32-bit");
> > +
> > +    /* Copy the OS image and free temporary buffer. */
> > +    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
> > +    elf.dest_size = parms.virt_kend - parms.virt_kstart;
> > +
> > +    saved_current = current;
> > +    set_current(v);
> > +
> > +    rc = elf_load_binary(&elf);
> > +    if ( rc < 0 )
> > +    {
> > +        printk("Failed to load kernel: %d\n", rc);
> > +        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
> > +        goto out;
> > +    }
> > +
> > +    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> > +
> > +    if ( initrd != NULL )
> > +    {
> > +        rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
> > +                                    initrd->mod_end);
> > +        if ( rc != HVMCOPY_okay )
> > +        {
> > +            printk("Unable to copy initrd to guest\n");
> > +            rc = -EFAULT;
> > +            goto out;
> > +        }
> > +
> > +        mod.paddr = last_addr;
> > +        mod.size = initrd->mod_end;
> > +        last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
> > +    }
> > +
> > +    /* Free temporary buffers. */
> > +    discard_initial_images();
> > +
> > +    memset(&start_info, 0, sizeof(start_info));
> > +    if ( cmdline != NULL )
> > +    {
> > +        rc = hvm_copy_to_guest_phys(last_addr, cmdline, strlen(cmdline) + 1);
> > +        if ( rc != HVMCOPY_okay )
> > +        {
> > +            printk("Unable to copy guest command line\n");
> > +            rc = -EFAULT;
> > +            goto out;
> > +        }
> > +        start_info.cmdline_paddr = last_addr;
> > +        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
> > +    }
> > +    if ( initrd != NULL )
> 
> It may be better if this is an array. You can have multiple initrd's.

Hm, I'm not really sure I can do anything about this here. ATM it seems like 
Dom0 build is limited to a single initrd blob, and that's how it's done in the 
classic PV path.

I'm also very interested in being able to pass more than one module to Dom0, 
that would make booting FreeBSD much easier, since there's not initrd there and 
I basically have to create a fake one, being able to pass more than one module 
would simplify the code there. In any case I think this is out of the scope of 
this series.

> > +    {
> > +        rc = hvm_copy_to_guest_phys(last_addr, &mod, sizeof(mod));
> > +        if ( rc != HVMCOPY_okay )
> > +        {
> > +            printk("Unable to copy guest modules\n");
> > +            rc = -EFAULT;
> > +            goto out;
> > +        }
> > +        start_info.modlist_paddr = last_addr;
> > +        start_info.nr_modules = 1;
> > +        last_addr += sizeof(mod);
> > +    }
> > +
> > +    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
> > +    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
> > +    rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info));
> > +    if ( rc != HVMCOPY_okay )
> > +    {
> > +        printk("Unable to copy start info to guest\n");
> > +        rc = -EFAULT;
> > +        goto out;
> > +    }
> > +
> > +    *entry = parms.phys_entry;
> > +    *start_info_addr = last_addr;
> > +    rc = 0;
> > +
> > +out:
> 
> Extra space in front of the label.

Fixed, thanks!

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-11-28 11:41       ` Jan Beulich
@ 2016-11-28 13:30         ` Roger Pau Monne
  2016-11-28 13:49           ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-28 13:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Mon, Nov 28, 2016 at 04:41:22AM -0700, Jan Beulich wrote:
> >>> On 28.11.16 at 12:26, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 11, 2016 at 10:16:43AM -0700, Jan Beulich wrote:
> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >> > +static int __init hvm_steal_ram(struct domain *d, unsigned long size,
> >> > +                                paddr_t limit, paddr_t *addr)
> >> > +{
> >> > +    unsigned int i;
> >> > +
> >> > +    for ( i = 1; i <= d->arch.nr_e820; i++ )
> >> > +    {
> >> > +        struct e820entry *entry = &d->arch.e820[d->arch.nr_e820 - i];
> >> 
> >> Why don't you simply make the loop count downwards?
> > 
> > With i being an unsigned int, this would make the condition look quite awkward, 
> > because i >= 0 cannot be used. I would have to use i <= d->arch.nr_e820, so I 
> > think it's best to leave it as-is for readability.
> 
> What's wrong with
> 
>     i = d->arch.nr_e820;
>     while ( i-- )
>     {
>         ...
> 
> (or its for() equivalent)?

Nothing, I guess it's Monday...

> >> > +            saved_current = current;
> >> > +            set_current(v);
> >> > +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
> >> > +                                        maddr_to_virt(d->arch.e820[i].addr),
> >> > +                                        end - d->arch.e820[i].addr);
> >> > +            set_current(saved_current);
> >> 
> >> If anything goes wrong here, how much confusion will result from
> >> current being wrong? In particular, will this complicate debugging
> >> of possible issues?
> > 
> > TBH, I'm not sure, current in this case is the idle domain, so trying to execute 
> > a hvm_copy_to_guest_phys with current being the idle domain, which from a Xen 
> > PoV is a PV vCPU, would probably result in some assert triggering in the 
> > hvm_copy_to_guest_phys path (or at least I would expect so). Note that this 
> > chunk of code is removed, since RAM regions below 1MB are now mapped as 
> > p2m_ram_rw.
> 
> Even if this chunk of code no longer exists, iirc there  were a few
> more instances of this current overriding, so unless they're all gone
> now I still think this need considering (and ideally finding a better
> solution, maybe along the lines of mapcache_override_current()).

This could be solved by making __hvm_copy take a struct domain param, but this 
is a very big change. I could also try to fix __hvm_copy so that we can set an 
override vcpu, much like mapcache_override_current (hvm_override_current?).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-11-28 13:30         ` Roger Pau Monne
@ 2016-11-28 13:49           ` Jan Beulich
  2016-11-28 16:02             ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-28 13:49 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 28.11.16 at 14:30, <roger.pau@citrix.com> wrote:
> On Mon, Nov 28, 2016 at 04:41:22AM -0700, Jan Beulich wrote:
>> >>> On 28.11.16 at 12:26, <roger.pau@citrix.com> wrote:
>> > On Fri, Nov 11, 2016 at 10:16:43AM -0700, Jan Beulich wrote:
>> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> >> > +            saved_current = current;
>> >> > +            set_current(v);
>> >> > +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
>> >> > +                                        maddr_to_virt(d->arch.e820[i].addr),
>> >> > +                                        end - d->arch.e820[i].addr);
>> >> > +            set_current(saved_current);
>> >> 
>> >> If anything goes wrong here, how much confusion will result from
>> >> current being wrong? In particular, will this complicate debugging
>> >> of possible issues?
>> > 
>> > TBH, I'm not sure, current in this case is the idle domain, so trying to execute 
>> > a hvm_copy_to_guest_phys with current being the idle domain, which from a Xen 
>> > PoV is a PV vCPU, would probably result in some assert triggering in the 
>> > hvm_copy_to_guest_phys path (or at least I would expect so). Note that this 
> 
>> > chunk of code is removed, since RAM regions below 1MB are now mapped as 
>> > p2m_ram_rw.
>> 
>> Even if this chunk of code no longer exists, iirc there  were a few
>> more instances of this current overriding, so unless they're all gone
>> now I still think this need considering (and ideally finding a better
>> solution, maybe along the lines of mapcache_override_current()).
> 
> This could be solved by making __hvm_copy take a struct domain param, but this 
> is a very big change. I could also try to fix __hvm_copy so that we can set an 
> override vcpu, much like mapcache_override_current (hvm_override_current?).

Well, as implied before: If there's provably no harm to debuggability,
then perhaps there's no real need for you to change your code. If,
however, there remains any doubt, then I specifically suggested
that override variant, knowing that handing a proper struct vcpu *
or struct domain * to the function would likely end up touching a lot
of code.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-11-28 13:49           ` Jan Beulich
@ 2016-11-28 16:02             ` Roger Pau Monne
  0 siblings, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-28 16:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Mon, Nov 28, 2016 at 06:49:42AM -0700, Jan Beulich wrote:
> >>> On 28.11.16 at 14:30, <roger.pau@citrix.com> wrote:
> > On Mon, Nov 28, 2016 at 04:41:22AM -0700, Jan Beulich wrote:
> >> >>> On 28.11.16 at 12:26, <roger.pau@citrix.com> wrote:
> >> > On Fri, Nov 11, 2016 at 10:16:43AM -0700, Jan Beulich wrote:
> >> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >> >> > +            saved_current = current;
> >> >> > +            set_current(v);
> >> >> > +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
> >> >> > +                                        maddr_to_virt(d->arch.e820[i].addr),
> >> >> > +                                        end - d->arch.e820[i].addr);
> >> >> > +            set_current(saved_current);
> >> >> 
> >> >> If anything goes wrong here, how much confusion will result from
> >> >> current being wrong? In particular, will this complicate debugging
> >> >> of possible issues?
> >> > 
> >> > TBH, I'm not sure, current in this case is the idle domain, so trying to execute 
> >> > a hvm_copy_to_guest_phys with current being the idle domain, which from a Xen 
> >> > PoV is a PV vCPU, would probably result in some assert triggering in the 
> >> > hvm_copy_to_guest_phys path (or at least I would expect so). Note that this 
> > 
> >> > chunk of code is removed, since RAM regions below 1MB are now mapped as 
> >> > p2m_ram_rw.
> >> 
> >> Even if this chunk of code no longer exists, iirc there  were a few
> >> more instances of this current overriding, so unless they're all gone
> >> now I still think this need considering (and ideally finding a better
> >> solution, maybe along the lines of mapcache_override_current()).
> > 
> > This could be solved by making __hvm_copy take a struct domain param, but this 
> > is a very big change. I could also try to fix __hvm_copy so that we can set an 
> > override vcpu, much like mapcache_override_current (hvm_override_current?).
> 
> Well, as implied before: If there's provably no harm to debuggability,
> then perhaps there's no real need for you to change your code. If,
> however, there remains any doubt, then I specifically suggested
> that override variant, knowing that handing a proper struct vcpu *
> or struct domain * to the function would likely end up touching a lot
> of code.

Hm, I don't really see any harm ATM, but I'm also wondering whether I should add 
a helper that does all this, there are multiple instances of the 
set_current/hvm_copy/set_current construct throughout the series, so adding a 
hvm_copy_to_phys seems quite sensible.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2
  2016-11-17 10:49       ` Jan Beulich
@ 2016-11-28 17:49         ` Roger Pau Monne
  2016-11-29  9:34           ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-28 17:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Nov 17, 2016 at 03:49:22AM -0700, Jan Beulich wrote:
> >>> On 16.11.16 at 19:02, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 11, 2016 at 09:53:49AM -0700, Jan Beulich wrote:
> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >> > --- a/xen/arch/x86/setup.c
> >> > +++ b/xen/arch/x86/setup.c
> >> > @@ -67,6 +67,16 @@ unsigned long __read_mostly cr4_pv32_mask;
> >> >  static bool_t __initdata opt_dom0pvh;
> >> >  boolean_param("dom0pvh", opt_dom0pvh);
> >> >  
> >> > +/*
> >> > + * List of parameters that affect Dom0 creation:
> >> > + *
> >> > + *  - hvm               Create a PVHv2 Dom0.
> >> > + *  - shadow            Use shadow paging for Dom0.
> >> > + */
> >> > +static void parse_dom0_param(char *s);
> >> 
> >> Please try to avoid such forward declarations.

OK, so would you prefer to place the custom_param call after the function 
definition? I've done it that way (with the forward declaration) because it's 
the way other options are already implemented.
 
> >> > @@ -1543,6 +1574,14 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >> >      if ( opt_dom0pvh )
> >> >          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
> >> >  
> >> > +    if ( dom0_hvm )
> >> > +    {
> >> > +        domcr_flags |= DOMCRF_hvm |
> >> > +                       ((hvm_funcs.hap_supported && !opt_dom0_shadow) ?
> >> > +                         DOMCRF_hap : 0);
> >> > +        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
> >> > +    }
> >> 
> >> If you wire this up here already, instead of later in the series, what's
> >> the effect of someone using this option? Crash?
> > 
> > Most certainly. The BSP IP points to 0 at this point. I can wire this up later, 
> > but it's going to be strange IMHO.
> 
> Not any more "strange" than someone trying the option and getting
> some random and perhaps not immediately understandable crash.

I will add a panic here then, which I will then remove once this is finished.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2
  2016-11-28 17:49         ` Roger Pau Monne
@ 2016-11-29  9:34           ` Jan Beulich
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Beulich @ 2016-11-29  9:34 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 28.11.16 at 18:49, <roger.pau@citrix.com> wrote:
> On Thu, Nov 17, 2016 at 03:49:22AM -0700, Jan Beulich wrote:
>> >>> On 16.11.16 at 19:02, <roger.pau@citrix.com> wrote:
>> > On Fri, Nov 11, 2016 at 09:53:49AM -0700, Jan Beulich wrote:
>> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> >> > --- a/xen/arch/x86/setup.c
>> >> > +++ b/xen/arch/x86/setup.c
>> >> > @@ -67,6 +67,16 @@ unsigned long __read_mostly cr4_pv32_mask;
>> >> >  static bool_t __initdata opt_dom0pvh;
>> >> >  boolean_param("dom0pvh", opt_dom0pvh);
>> >> >  
>> >> > +/*
>> >> > + * List of parameters that affect Dom0 creation:
>> >> > + *
>> >> > + *  - hvm               Create a PVHv2 Dom0.
>> >> > + *  - shadow            Use shadow paging for Dom0.
>> >> > + */
>> >> > +static void parse_dom0_param(char *s);
>> >> 
>> >> Please try to avoid such forward declarations.
> 
> OK, so would you prefer to place the custom_param call after the function 
> definition? I've done it that way (with the forward declaration) because it's 
> the way other options are already implemented.

I don't know where you've looked, but out of the 8 or so examples
I've looked at just now only one used a forward declaration, and
we've been asking others introducing new custom options the same
I'm asking you now. IOW - yes, the function should come first,
immediately followed by the custom_param().

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-11-03 11:54         ` Boris Ostrovsky
@ 2016-11-29 12:33           ` Roger Pau Monne
  2016-11-29 12:47             ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-29 12:33 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper,
	Suravee Suthikulpanit, xen-devel

On Thu, Nov 03, 2016 at 07:54:24AM -0400, Boris Ostrovsky wrote:
> 
> 
> On 11/03/2016 07:35 AM, Jan Beulich wrote:
> > > > > On 03.11.16 at 11:58, <roger.pau@citrix.com> wrote:
> > > On Mon, Oct 31, 2016 at 10:47:15AM -0600, Jan Beulich wrote:
> > > > > > > On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > > > > --- a/xen/arch/x86/setup.c
> > > > > +++ b/xen/arch/x86/setup.c
> > > > > @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> > > > > 
> > > > >      early_msi_init();
> > > > > 
> > > > > +    scan_pci_devices();
> > > > > +
> > > > >      iommu_setup();    /* setup iommu if available */
> > > > > 
> > > > >      smp_prepare_cpus(max_cpus);
> > > > > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > > > > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > > > > @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
> > > > > 
> > > > >      if ( !amd_iommu_perdev_intremap )
> > > > >          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
> > > > > -    return scan_pci_devices();
> > > > > +
> > > > > +    return 0;
> > > > >  }
> > > > 
> > > > I'm relatively certain that I did point out on a prior version that the
> > > > error handling here gets lost. At the very least the commit message
> > > > should provide a reason for doing so; even better would be if there
> > > > was no behavioral change (other than the point in time where this
> > > > happens slightly changing).
> > > 
> > > Behaviour here is different on Intel or AMD hardware, on Intel failure to
> > > scan the PCI bus will not be fatal, and the IOMMU will be enabled anyway. On
> > > AMD OTOH failure to scan the PCI bus will cause the IOMMU to be disabled.
> > > I expect we should be able to behave equally for both Intel and AMD, so
> > > which one should be used?
> > 
> > I'm afraid I have to defer to the vendor IOMMU maintainers for
> > that one, as I don't know the reason for the difference in behavior.
> > An aspect that may play into here is that for AMD the IOMMU is
> > represented by a PCI device, while for Intel it's just a part of one
> > of the core chipset devices.
> 
> That's probably the reason although it looks like the only failure that
> scan_pci_devices() can return is -ENOMEM, in which case disabling IOMMU may
> not be the biggest problem.

I don't think we have reached consensus regarding what to do here. IMHO, if we 
have to keep the same behavior it makes no sense to move the call, in which 
case I will just remove this patch. OTOH, I think that as Boris says, if 
scan_pci_devices fails there's something very wrong, in which case we should 
just panic.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO
  2016-11-11 16:58   ` Jan Beulich
@ 2016-11-29 12:41     ` Roger Pau Monne
  2016-11-29 13:00       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-29 12:41 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, WeiLiu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

On Fri, Nov 11, 2016 at 09:58:44AM -0700, Jan Beulich wrote:
> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > Current {un}map_mmio_regions implementation has a maximum number of loops to
> > perform before giving up and returning to the caller. This is an issue when
> > mapping large MMIO regions when building the hardware domain. In order to
> > solve it, introduce a wrapper around {un}map_mmio_regions that takes care of
> > calling process_pending_softirqs between consecutive {un}map_mmio_regions
> > calls.
> 
> So is this something that's going to be needed for other than
> hwdom building? Because if not ...

Yes, something similar will also be used by PHYSDEVOP_pci_mmcfg_reserved, but 
that would require hypercall continuations instead of processing pending 
softirqs.
 
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -1418,6 +1418,32 @@ int prepare_ring_for_helper(
> >      return 0;
> >  }
> >  
> > +int modify_identity_mmio(struct domain *d, unsigned long pfn,
> > +                         unsigned long nr_pages, bool map)
> 
> ... I don't think the function belongs here, and it should be
> marked __hwdom_init.

Were would you recommend adding it? Take into account that's also going to be 
used by other code apart from the ACPI Dom0 builder, like the PCI BAR mapping.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-11-29 12:33           ` Roger Pau Monne
@ 2016-11-29 12:47             ` Jan Beulich
  2016-11-29 12:57               ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-29 12:47 UTC (permalink / raw)
  To: Roger Pau Monne, Boris Ostrovsky
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit, xen-devel

>>> On 29.11.16 at 13:33, <roger.pau@citrix.com> wrote:
> On Thu, Nov 03, 2016 at 07:54:24AM -0400, Boris Ostrovsky wrote:
>> 
>> 
>> On 11/03/2016 07:35 AM, Jan Beulich wrote:
>> > > > > On 03.11.16 at 11:58, <roger.pau@citrix.com> wrote:
>> > > On Mon, Oct 31, 2016 at 10:47:15AM -0600, Jan Beulich wrote:
>> > > > > > > On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> > > > > --- a/xen/arch/x86/setup.c
>> > > > > +++ b/xen/arch/x86/setup.c
>> > > > > @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long 
> mbi_p)
>> > > > > 
>> > > > >      early_msi_init();
>> > > > > 
>> > > > > +    scan_pci_devices();
>> > > > > +
>> > > > >      iommu_setup();    /* setup iommu if available */
>> > > > > 
>> > > > >      smp_prepare_cpus(max_cpus);
>> > > > > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> > > > > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
>> > > > > @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
>> > > > > 
>> > > > >      if ( !amd_iommu_perdev_intremap )
>> > > > >          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap 
> table is not recommended (see XSA-36)!\n");
>> > > > > -    return scan_pci_devices();
>> > > > > +
>> > > > > +    return 0;
>> > > > >  }
>> > > > 
>> > > > I'm relatively certain that I did point out on a prior version that the
>> > > > error handling here gets lost. At the very least the commit message
>> > > > should provide a reason for doing so; even better would be if there
>> > > > was no behavioral change (other than the point in time where this
>> > > > happens slightly changing).
>> > > 
>> > > Behaviour here is different on Intel or AMD hardware, on Intel failure to
>> > > scan the PCI bus will not be fatal, and the IOMMU will be enabled anyway. On
>> > > AMD OTOH failure to scan the PCI bus will cause the IOMMU to be disabled.
>> > > I expect we should be able to behave equally for both Intel and AMD, so
>> > > which one should be used?
>> > 
>> > I'm afraid I have to defer to the vendor IOMMU maintainers for
>> > that one, as I don't know the reason for the difference in behavior.
>> > An aspect that may play into here is that for AMD the IOMMU is
>> > represented by a PCI device, while for Intel it's just a part of one
>> > of the core chipset devices.
>> 
>> That's probably the reason although it looks like the only failure that
>> scan_pci_devices() can return is -ENOMEM, in which case disabling IOMMU may
>> not be the biggest problem.
> 
> I don't think we have reached consensus regarding what to do here. IMHO, if we 
> have to keep the same behavior it makes no sense to move the call, in which 
> case I will just remove this patch. OTOH, I think that as Boris says, if 
> scan_pci_devices fails there's something very wrong, in which case we should 
> just panic.

While I can see your point, I think we should get away from both
assuming only certain kinds of failures can occur in the callers of
functions as well as panic()ing for initialization failure of optional
functionality. Anything depending on such optional stuff should
simply get disabled in turn.

As to the specific case here - I think rather than ditching error
handling, it would better be added uniformly (i.e. disabling the
IOMMU regardless of vendor). Otoh, if leaving the patch out is
an option, I wouldn't mind that route; I had got the impression
though that you were of the opinion that it's a requirement.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-11-29 12:47             ` Jan Beulich
@ 2016-11-29 12:57               ` Roger Pau Monne
  2016-11-30  5:53                 ` Tian, Kevin
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-29 12:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit,
	xen-devel, Boris Ostrovsky

On Tue, Nov 29, 2016 at 05:47:42AM -0700, Jan Beulich wrote:
> >>> On 29.11.16 at 13:33, <roger.pau@citrix.com> wrote:
> > On Thu, Nov 03, 2016 at 07:54:24AM -0400, Boris Ostrovsky wrote:
> >> 
> >> 
> >> On 11/03/2016 07:35 AM, Jan Beulich wrote:
> >> > > > > On 03.11.16 at 11:58, <roger.pau@citrix.com> wrote:
> >> > > On Mon, Oct 31, 2016 at 10:47:15AM -0600, Jan Beulich wrote:
> >> > > > > > > On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >> > > > > --- a/xen/arch/x86/setup.c
> >> > > > > +++ b/xen/arch/x86/setup.c
> >> > > > > @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long 
> > mbi_p)
> >> > > > > 
> >> > > > >      early_msi_init();
> >> > > > > 
> >> > > > > +    scan_pci_devices();
> >> > > > > +
> >> > > > >      iommu_setup();    /* setup iommu if available */
> >> > > > > 
> >> > > > >      smp_prepare_cpus(max_cpus);
> >> > > > > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> > > > > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> >> > > > > @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
> >> > > > > 
> >> > > > >      if ( !amd_iommu_perdev_intremap )
> >> > > > >          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap 
> > table is not recommended (see XSA-36)!\n");
> >> > > > > -    return scan_pci_devices();
> >> > > > > +
> >> > > > > +    return 0;
> >> > > > >  }
> >> > > > 
> >> > > > I'm relatively certain that I did point out on a prior version that the
> >> > > > error handling here gets lost. At the very least the commit message
> >> > > > should provide a reason for doing so; even better would be if there
> >> > > > was no behavioral change (other than the point in time where this
> >> > > > happens slightly changing).
> >> > > 
> >> > > Behaviour here is different on Intel or AMD hardware, on Intel failure to
> >> > > scan the PCI bus will not be fatal, and the IOMMU will be enabled anyway. On
> >> > > AMD OTOH failure to scan the PCI bus will cause the IOMMU to be disabled.
> >> > > I expect we should be able to behave equally for both Intel and AMD, so
> >> > > which one should be used?
> >> > 
> >> > I'm afraid I have to defer to the vendor IOMMU maintainers for
> >> > that one, as I don't know the reason for the difference in behavior.
> >> > An aspect that may play into here is that for AMD the IOMMU is
> >> > represented by a PCI device, while for Intel it's just a part of one
> >> > of the core chipset devices.
> >> 
> >> That's probably the reason although it looks like the only failure that
> >> scan_pci_devices() can return is -ENOMEM, in which case disabling IOMMU may
> >> not be the biggest problem.
> > 
> > I don't think we have reached consensus regarding what to do here. IMHO, if we 
> > have to keep the same behavior it makes no sense to move the call, in which 
> > case I will just remove this patch. OTOH, I think that as Boris says, if 
> > scan_pci_devices fails there's something very wrong, in which case we should 
> > just panic.
> 
> While I can see your point, I think we should get away from both
> assuming only certain kinds of failures can occur in the callers of
> functions as well as panic()ing for initialization failure of optional
> functionality. Anything depending on such optional stuff should
> simply get disabled in turn.
> 
> As to the specific case here - I think rather than ditching error
> handling, it would better be added uniformly (i.e. disabling the
> IOMMU regardless of vendor). Otoh, if leaving the patch out is
> an option, I wouldn't mind that route; I had got the impression
> though that you were of the opinion that it's a requirement.

OK, for PVHv2 Dom0 scanning the PCI devices is a requirement, so I think the 
best way to solve this is to also fail IOMMU initialization for Intel if the PCI 
scan fails, and this will also prevent a PVHv2 Dom0 from booting, since the 
IOMMU is a requirement.

Roger. 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO
  2016-11-29 12:41     ` Roger Pau Monne
@ 2016-11-29 13:00       ` Jan Beulich
  2016-11-29 15:32         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-29 13:00 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, WeiLiu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 29.11.16 at 13:41, <roger.pau@citrix.com> wrote:
> On Fri, Nov 11, 2016 at 09:58:44AM -0700, Jan Beulich wrote:
>> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
>> > Current {un}map_mmio_regions implementation has a maximum number of loops 
> to
>> > perform before giving up and returning to the caller. This is an issue when
>> > mapping large MMIO regions when building the hardware domain. In order to
>> > solve it, introduce a wrapper around {un}map_mmio_regions that takes care 
> of
>> > calling process_pending_softirqs between consecutive {un}map_mmio_regions
>> > calls.
>> 
>> So is this something that's going to be needed for other than
>> hwdom building? Because if not ...
> 
> Yes, something similar will also be used by PHYSDEVOP_pci_mmcfg_reserved, but 
> that would require hypercall continuations instead of processing pending softirqs.
>  
>> > --- a/xen/common/memory.c
>> > +++ b/xen/common/memory.c
>> > @@ -1418,6 +1418,32 @@ int prepare_ring_for_helper(
>> >      return 0;
>> >  }
>> >  
>> > +int modify_identity_mmio(struct domain *d, unsigned long pfn,
>> > +                         unsigned long nr_pages, bool map)
>> 
>> ... I don't think the function belongs here, and it should be
>> marked __hwdom_init.
> 
> Were would you recommend adding it? Take into account that's also going to be 
> used by other code apart from the ACPI Dom0 builder, like the PCI BAR 
> mapping.

Hmm, if it's to be used by non-init code, then it staying here would
make sense if it's also potentially useful to ARM. If it isn't, then
moving it to e.g. xen/arch/x86/mm.c might be better. If it was init
only, I would have wanted it to go into one of the more dedicated
init files ...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO
  2016-11-29 13:00       ` Jan Beulich
@ 2016-11-29 15:32         ` Roger Pau Monne
  0 siblings, 0 replies; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-29 15:32 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, WeiLiu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, xen-devel, boris.ostrovsky

On Tue, Nov 29, 2016 at 06:00:47AM -0700, Jan Beulich wrote:
> >>> On 29.11.16 at 13:41, <roger.pau@citrix.com> wrote:
> > On Fri, Nov 11, 2016 at 09:58:44AM -0700, Jan Beulich wrote:
> >> >>> On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> >> > Current {un}map_mmio_regions implementation has a maximum number of loops 
> > to
> >> > perform before giving up and returning to the caller. This is an issue when
> >> > mapping large MMIO regions when building the hardware domain. In order to
> >> > solve it, introduce a wrapper around {un}map_mmio_regions that takes care 
> > of
> >> > calling process_pending_softirqs between consecutive {un}map_mmio_regions
> >> > calls.
> >> 
> >> So is this something that's going to be needed for other than
> >> hwdom building? Because if not ...
> > 
> > Yes, something similar will also be used by PHYSDEVOP_pci_mmcfg_reserved, but 
> > that would require hypercall continuations instead of processing pending softirqs.
> >  
> >> > --- a/xen/common/memory.c
> >> > +++ b/xen/common/memory.c
> >> > @@ -1418,6 +1418,32 @@ int prepare_ring_for_helper(
> >> >      return 0;
> >> >  }
> >> >  
> >> > +int modify_identity_mmio(struct domain *d, unsigned long pfn,
> >> > +                         unsigned long nr_pages, bool map)
> >> 
> >> ... I don't think the function belongs here, and it should be
> >> marked __hwdom_init.
> > 
> > Were would you recommend adding it? Take into account that's also going to be 
> > used by other code apart from the ACPI Dom0 builder, like the PCI BAR 
> > mapping.
> 
> Hmm, if it's to be used by non-init code, then it staying here would
> make sense if it's also potentially useful to ARM. If it isn't, then
> moving it to e.g. xen/arch/x86/mm.c might be better. If it was init
> only, I would have wanted it to go into one of the more dedicated
> init files ...

Right, for the scope of this series this function is only used by domain build, 
so it makes sense to place it in domain_build.c under __init. If later on I need 
it for other purposes I will move it.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-11-29 12:57               ` Roger Pau Monne
@ 2016-11-30  5:53                 ` Tian, Kevin
  2016-11-30  9:02                   ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Tian, Kevin @ 2016-11-30  5:53 UTC (permalink / raw)
  To: Roger Pau Monne, Jan Beulich
  Cc: Wu, Feng, Andrew Cooper, Suravee Suthikulpanit, xen-devel,
	Boris Ostrovsky

> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: Tuesday, November 29, 2016 8:58 PM
> 
> On Tue, Nov 29, 2016 at 05:47:42AM -0700, Jan Beulich wrote:
> > >>> On 29.11.16 at 13:33, <roger.pau@citrix.com> wrote:
> > > On Thu, Nov 03, 2016 at 07:54:24AM -0400, Boris Ostrovsky wrote:
> > >>
> > >>
> > >> On 11/03/2016 07:35 AM, Jan Beulich wrote:
> > >> > > > > On 03.11.16 at 11:58, <roger.pau@citrix.com> wrote:
> > >> > > On Mon, Oct 31, 2016 at 10:47:15AM -0600, Jan Beulich wrote:
> > >> > > > > > > On 29.10.16 at 10:59, <roger.pau@citrix.com> wrote:
> > >> > > > > --- a/xen/arch/x86/setup.c
> > >> > > > > +++ b/xen/arch/x86/setup.c
> > >> > > > > @@ -1491,6 +1491,8 @@ void __init noreturn __start_xen(unsigned long
> > > mbi_p)
> > >> > > > >
> > >> > > > >      early_msi_init();
> > >> > > > >
> > >> > > > > +    scan_pci_devices();
> > >> > > > > +
> > >> > > > >      iommu_setup();    /* setup iommu if available */
> > >> > > > >
> > >> > > > >      smp_prepare_cpus(max_cpus);
> > >> > > > > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > >> > > > > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > >> > > > > @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
> > >> > > > >
> > >> > > > >      if ( !amd_iommu_perdev_intremap )
> > >> > > > >          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap
> > > table is not recommended (see XSA-36)!\n");
> > >> > > > > -    return scan_pci_devices();
> > >> > > > > +
> > >> > > > > +    return 0;
> > >> > > > >  }
> > >> > > >
> > >> > > > I'm relatively certain that I did point out on a prior version that the
> > >> > > > error handling here gets lost. At the very least the commit message
> > >> > > > should provide a reason for doing so; even better would be if there
> > >> > > > was no behavioral change (other than the point in time where this
> > >> > > > happens slightly changing).
> > >> > >
> > >> > > Behaviour here is different on Intel or AMD hardware, on Intel failure to
> > >> > > scan the PCI bus will not be fatal, and the IOMMU will be enabled anyway. On
> > >> > > AMD OTOH failure to scan the PCI bus will cause the IOMMU to be disabled.
> > >> > > I expect we should be able to behave equally for both Intel and AMD, so
> > >> > > which one should be used?
> > >> >
> > >> > I'm afraid I have to defer to the vendor IOMMU maintainers for
> > >> > that one, as I don't know the reason for the difference in behavior.
> > >> > An aspect that may play into here is that for AMD the IOMMU is
> > >> > represented by a PCI device, while for Intel it's just a part of one
> > >> > of the core chipset devices.
> > >>
> > >> That's probably the reason although it looks like the only failure that
> > >> scan_pci_devices() can return is -ENOMEM, in which case disabling IOMMU may
> > >> not be the biggest problem.
> > >
> > > I don't think we have reached consensus regarding what to do here. IMHO, if we
> > > have to keep the same behavior it makes no sense to move the call, in which
> > > case I will just remove this patch. OTOH, I think that as Boris says, if
> > > scan_pci_devices fails there's something very wrong, in which case we should
> > > just panic.
> >
> > While I can see your point, I think we should get away from both
> > assuming only certain kinds of failures can occur in the callers of
> > functions as well as panic()ing for initialization failure of optional
> > functionality. Anything depending on such optional stuff should
> > simply get disabled in turn.
> >
> > As to the specific case here - I think rather than ditching error
> > handling, it would better be added uniformly (i.e. disabling the
> > IOMMU regardless of vendor). Otoh, if leaving the patch out is
> > an option, I wouldn't mind that route; I had got the impression
> > though that you were of the opinion that it's a requirement.
> 
> OK, for PVHv2 Dom0 scanning the PCI devices is a requirement, so I think the
> best way to solve this is to also fail IOMMU initialization for Intel if the PCI
> scan fails, and this will also prevent a PVHv2 Dom0 from booting, since the
> IOMMU is a requirement.
> 

I'm OK with this policy change. Although there is no strict dependency
between VT-d and PCI scan, the purpose of VT-d is for PCI-based 
device assignment.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally
  2016-11-30  5:53                 ` Tian, Kevin
@ 2016-11-30  9:02                   ` Jan Beulich
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Beulich @ 2016-11-30  9:02 UTC (permalink / raw)
  To: Kevin Tian
  Cc: Feng Wu, Andrew Cooper, Suravee Suthikulpanit, xen-devel,
	Boris Ostrovsky, Roger Pau Monne

>>> On 30.11.16 at 06:53, <kevin.tian@intel.com> wrote:
> I'm OK with this policy change. Although there is no strict dependency
> between VT-d and PCI scan, the purpose of VT-d is for PCI-based 
> device assignment.

How is there not, considering the bus to bridge mapping? Within
IOMMU code, VT-d is the only user of find_upstream_bridge().

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-11-14 16:15   ` Jan Beulich
@ 2016-11-30 12:40     ` Roger Pau Monne
  2016-11-30 14:09       ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-30 12:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Mon, Nov 14, 2016 at 09:15:37AM -0700, Jan Beulich wrote:
> >>> On 29.10.16 at 11:00, <roger.pau@citrix.com> wrote:
> > Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
> > p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
> > that don't reside in RAM regions. This is needed because some memory maps
> > don't properly account for all the memory used by ACPI, so it's common to
> > find ACPI tables in holes.
> 
> I question whether this behavior should be enabled by default. Not
> having seen the code yet I also wonder whether these regions
> shouldn't simply be added to the guest's E820 as E820_ACPI, which
> should then result in them getting mapped without further special
> casing.
> 
> > +static int __init hvm_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
> > +                                    uint32_t type)
> 
> I see s and e being uint64_t, but I don't see why type can't be plain
> unsigned int.

Well, that's the type for "type" as defined in e820.h. I'm just using uint32_t 
for consistency with that.

> > +{
> > +    unsigned int i;
> > +
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        uint64_t rs = d->arch.e820[i].addr;
> > +        uint64_t re = rs + d->arch.e820[i].size;
> > +
> > +        if ( rs == e && d->arch.e820[i].type == type )
> > +        {
> > +            d->arch.e820[i].addr = s;
> > +            return 0;
> > +        }
> > +
> > +        if ( re == s && d->arch.e820[i].type == type &&
> > +             (i + 1 == d->arch.nr_e820 || d->arch.e820[i + 1].addr >= e) )
> 
> I understand this to be overlap prevention, but there's no equivalent
> in the earlier if(). Are you relying on the table being strictly sorted at
> all times? If so, a comment should say so.

I've added such at the top of the function.
 
> > +        {
> > +            d->arch.e820[i].size += e - s;
> > +            return 0;
> > +        }
> > +
> > +        if ( rs >= e )
> > +            break;
> > +
> > +        if ( re > s )
> > +            return -ENOMEM;
> 
> I don't think ENOMEM is appropriate to signal an overlap. And don't
> you need to reverse these last two if()s?

I've changed ENOMEM to EEXIST. Hm, I don't think so, if I reversed those we will 
get error when trying to add a non-contiguous region to fill a hole between two 
existing regions right?

> > @@ -2112,6 +2166,371 @@ static int __init hvm_setup_cpus(struct domain *d, 
> > paddr_t entry,
> >      return 0;
> >  }
> >  
> > +static int __init acpi_count_intr_ov(struct acpi_subtable_header *header,
> > +                                     const unsigned long end)
> > +{
> > +
> > +    acpi_intr_overrrides++;
> > +    return 0;
> > +}
> > +
> > +static int __init acpi_set_intr_ov(struct acpi_subtable_header *header,
> > +                                   const unsigned long end)
> 
> May I ask for "ov" to become at least "ovr" in all cases? Also stray
> const.

Sure, changed ov to ovr.

That const comes from the definition of the handler expected by 
acpi_table_parse_madt (acpi_madt_entry_handler).

> > +{
> > +    struct acpi_madt_interrupt_override *intr =
> > +        container_of(header, struct acpi_madt_interrupt_override, header);
> 
> Yet missing const here.

Done.
 
> > +    ACPI_MEMCPY(intsrcovr, intr, sizeof(*intr));
> 
> Structure assignment (for type safety; also elsewhere)?

I wasn't sure what to do here, since there's a specific ACPI_MEMCPY function, 
but I guess this is designed to be used by acpica code itself, and ACPI_MEMCPY 
is just an OS-agnotic wrapper to memcpy.

> > +static int __init hvm_setup_acpi_madt(struct domain *d, paddr_t *addr)
> > +{
> > +    struct acpi_table_madt *madt;
> > +    struct acpi_table_header *table;
> > +    struct acpi_madt_io_apic *io_apic;
> > +    struct acpi_madt_local_apic *local_apic;
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    acpi_status status;
> > +    unsigned long size;
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    /* Count number of interrupt overrides in the MADT. */
> > +    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_count_intr_ov,
> > +                          MAX_IRQ_SOURCES);
> > +
> > +    /* Calculate the size of the crafted MADT. */
> > +    size = sizeof(struct acpi_table_madt);
> > +    size += sizeof(struct acpi_madt_interrupt_override) * acpi_intr_overrrides;
> > +    size += sizeof(struct acpi_madt_io_apic);
> > +    size += sizeof(struct acpi_madt_local_apic) * dom0_max_vcpus();
> 
> All the sizeof()s would better use the variables declared above.

OK, I can do that for all of them expect for acpi_madt_interrupt_override, which 
doesn't have a matching local variable.

> > +    madt = xzalloc_bytes(size);
> > +    if ( !madt )
> > +    {
> > +        printk("Unable to allocate memory for MADT table\n");
> > +        return -ENOMEM;
> > +    }
> > +
> > +    /* Copy the native MADT table header. */
> > +    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
> > +    if ( !ACPI_SUCCESS(status) )
> > +    {
> > +        printk("Failed to get MADT ACPI table, aborting.\n");
> > +        return -EINVAL;
> > +    }
> > +    ACPI_MEMCPY(madt, table, sizeof(*table));
> > +    madt->address = APIC_DEFAULT_PHYS_BASE;
> 
> You may also need to override table revision (at least it shouldn't end
> up larger than what we know about).
> 
> > +    /* Setup the IO APIC entry. */
> > +    if ( nr_ioapics > 1 )
> > +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
> > +               nr_ioapics);
> 
> I've said elsewhere already that I think we should provide 1 vIO-APIC
> per physical one.

Agree, but the current vIO-APIC is not really up to it. I will work on getting 
it to support multiple instances.

> > +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> > +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
> > +    io_apic->header.length = sizeof(*io_apic);
> > +    io_apic->id = 1;
> > +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
> > +
> > +    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
> > +    for ( i = 0; i < dom0_max_vcpus(); i++ )
> > +    {
> > +        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
> > +        local_apic->header.length = sizeof(*local_apic);
> > +        local_apic->processor_id = i;
> > +        local_apic->id = i * 2;
> > +        local_apic->lapic_flags = ACPI_MADT_ENABLED;
> > +        local_apic++;
> > +    }
> 
> What about x2apic? And for lapic, do you limit vCPU count anywhere?

Yes, there's no x2apic information, I'm currently looking at libacpi in tools, 
and there doesn't seem to be any local x2apic structure there either. Am I 
missing something?

Regarding vCPU count, I will limit it to 128.

> > +    /* Setup interrupt overwrites. */
> 
> overrides
> 
> > +static bool __init hvm_acpi_table_allowed(const char *sig)
> > +{
> > +    static const char __init banned_tables[][ACPI_NAME_SIZE] = {
> > +        ACPI_SIG_HPET, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_MPST,
> > +        ACPI_SIG_PMTT, ACPI_SIG_MADT, ACPI_SIG_DMAR};
> > +    unsigned long pfn, nr_pages;
> > +    int i;
> > +
> > +    for ( i = 0 ; i < ARRAY_SIZE(banned_tables); i++ )
> > +        if ( strncmp(sig, banned_tables[i], ACPI_NAME_SIZE) == 0 )
> > +            return false;
> > +
> > +    /* Make sure table doesn't reside in a RAM region. */
> > +    pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> > +    nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
> > +                            PAGE_SIZE);
> 
> You also need to add in the offset-into-page from the base address.

Done, thanks!

> > +    if ( range_is_ram(pfn, nr_pages) )
> > +    {
> > +        printk("Skipping table %.4s because resides in a RAM region\n",
> > +               sig);
> > +        return false;
> 
> I think this should be more strict, at least to start with: Require the
> table to be in an E820_ACPI region (or maybe an E820_RESERVED
> one), but nothing else.

Done, only allowed tables in ACPI or reserved regions.

> > +static int __init hvm_setup_acpi_xsdt(struct domain *d, paddr_t madt_addr,
> > +                                      paddr_t *addr)
> > +{
> > +    struct acpi_table_xsdt *xsdt;
> > +    struct acpi_table_header *table;
> > +    struct acpi_table_rsdp *rsdp;
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    unsigned long size;
> > +    unsigned int i, num_tables;
> > +    int j, rc;
> > +
> > +    /*
> > +     * Restore original DMAR table signature, we are going to filter it
> > +     * from the new XSDT that is presented to the guest, so it no longer
> > +     * makes sense to have it's signature zapped.
> > +     */
> > +    acpi_dmar_reinstate();
> > +
> > +    /* Account for the space needed by the XSDT. */
> > +    size = sizeof(*xsdt);
> > +    num_tables = 0;
> > +
> > +    /* Count the number of tables that will be added to the XSDT. */
> > +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> > +    {
> > +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> > +
> > +        if ( !hvm_acpi_table_allowed(sig) )
> > +            continue;
> > +
> > +        num_tables++;
> > +    }
> 
> Unless you expect something to be added to this loop, please
> simplify it by inverting the condition and dropping the continue.

Done, thanks.
 
> > +    /*
> > +     * No need to subtract one because we will be adding a custom MADT (and
> > +     * the native one is not accounted for).
> > +     */
> > +    size += num_tables * sizeof(u64);
> 
> sizeof(xsdt->table_offset_entry[0])
> 
> > +    xsdt = xzalloc_bytes(size);
> > +    if ( !xsdt )
> > +    {
> > +        printk("Unable to allocate memory for XSDT table\n");
> > +        return -ENOMEM;
> > +    }
> > +
> > +    /* Copy the native XSDT table header. */
> > +    rsdp = acpi_os_map_memory(acpi_os_get_root_pointer(), sizeof(*rsdp));
> > +    if ( !rsdp )
> > +    {
> > +        printk("Unable to map RSDP\n");
> > +        return -EINVAL;
> > +    }
> > +    table = acpi_os_map_memory(rsdp->xsdt_physical_address, sizeof(*table));
> > +    if ( !table )
> > +    {
> > +        printk("Unable to map XSDT\n");
> > +        return -EINVAL;
> > +    }
> > +    ACPI_MEMCPY(xsdt, table, sizeof(*table));
> > +    acpi_os_unmap_memory(table, sizeof(*table));
> > +    acpi_os_unmap_memory(rsdp, sizeof(*rsdp));
> 
> At this point we're not in SYS_STATE_active yet, and hence there
> can only be one mapping at a time. The way it's written right now
> does not represent an active problem, but to prevent someone
> falling into this trap you should unmap the first mapping before
> establishing the second one.

Done.

> > +    /* Add the custom MADT. */
> > +    j = 0;
> > +    xsdt->table_offset_entry[j++] = madt_addr;
> > +
> > +    /* Copy the address of the rest of the allowed tables. */
> 
> addresses?
> 
> > +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> > +    {
> > +        const char *sig = acpi_gbl_root_table_list.tables[i].signature.ascii;
> > +        unsigned long pfn, nr_pages;
> > +
> > +        if ( !hvm_acpi_table_allowed(sig) )
> > +            continue;
> > +
> > +        /* Make sure table doesn't reside in a RAM region. */
> > +        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> > +        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
> > +                                PAGE_SIZE);
> 
> See above (and there appears to be at least one more further down).

Both fixed (and the one further below).

> > +        /* Make sure table is mapped. */
> > +        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> > +        if ( rc )
> > +            printk("Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> > +                   pfn, pfn + nr_pages);
> 
> Isn't the comment for this code block meant to go ahead of the earlier
> one, in place of the comment that's there (and looks wrong)?

Yes, thanks for noticing.

> > +        xsdt->table_offset_entry[j++] =
> > +                            acpi_gbl_root_table_list.tables[i].address;
> > +    }
> > +
> > +    xsdt->header.length = size;
> > +    xsdt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, xsdt), size);
> > +
> > +    /* Place the new XSDT in guest memory space. */
> > +    if ( hvm_steal_ram(d, size, GB(4), addr) )
> > +    {
> > +        printk("Unable to find allocate guest RAM for XSDT\n");
> 
> "find" or "allocate"?

Yes, find is what I wanted to write.

> > +        return -ENOMEM;
> > +    }
> > +
> > +    /* Mark this region as E820_ACPI. */
> > +    if ( hvm_add_mem_range(d, *addr, *addr + size, E820_ACPI) )
> > +        printk("Unable to add XSDT region to memory map\n");
> > +
> > +    saved_current = current;
> > +    set_current(v);
> > +    rc = hvm_copy_to_guest_phys(*addr, xsdt, size);
> > +    set_current(saved_current);
> 
> This pattern appears to be recurring - please make a helper function
> (which then also eases possibly addressing my earlier remark
> regarding that playing with current).

Done (in an earlier patch).

> > +    if ( rc != HVMCOPY_okay )
> > +    {
> > +        printk("Unable to copy XSDT into guest memory\n");
> > +        return -EFAULT;
> > +    }
> > +    xfree(xsdt);
> > +
> > +    return 0;
> > +}
> > +
> > +
> > +static int __init hvm_setup_acpi(struct domain *d, paddr_t start_info)
> 
> Only one blank line between functions please.

Done.

> > +{
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    struct acpi_table_rsdp rsdp;
> > +    unsigned long pfn, nr_pages;
> > +    paddr_t madt_paddr, xsdt_paddr, rsdp_paddr;
> > +    unsigned int i;
> > +    int rc;
> > +
> > +    /* Identity map ACPI e820 regions. */
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        if ( d->arch.e820[i].type != E820_ACPI &&
> > +             d->arch.e820[i].type != E820_NVS )
> > +            continue;
> > +
> > +        pfn = PFN_DOWN(d->arch.e820[i].addr);
> > +        nr_pages = DIV_ROUND_UP(d->arch.e820[i].size, PAGE_SIZE);
> > +
> > +        rc = modify_identity_mmio(d, pfn, nr_pages, true);
> > +        if ( rc )
> > +        {
> > +            printk(
> > +                "Failed to map ACPI region [%#lx, %#lx) into Dom0 memory map\n",
> > +                   pfn, pfn + nr_pages);
> > +            return rc;
> > +        }
> > +    }
> > +
> > +    rc = hvm_setup_acpi_madt(d, &madt_paddr);
> > +    if ( rc )
> > +        return rc;
> > +
> > +    rc = hvm_setup_acpi_xsdt(d, madt_paddr, &xsdt_paddr);
> > +    if ( rc )
> > +        return rc;
> 
> Coming back to the initial comment: If you did the 1:1 mapping last
> and if you added problematic ranges to the E820 map, you wouldn't
> need to call modify_identity_mmio() in two places.

Hm, right. I guess I can change this slightly.

> > +    /* Craft a custom RSDP. */
> > +    memset(&rsdp, 0, sizeof(rsdp));
> > +    memcpy(&rsdp.signature, ACPI_SIG_RSDP, sizeof(rsdp.signature));
> > +    memcpy(&rsdp.oem_id, "XenVMM", sizeof(rsdp.oem_id));
> 
> Is that a good idea? I think Dom0 should get to see the real OEM.

Now that you mention, there are probably OS-specific hacks to deal with broken 
OEM tables, so I will leave the native OEM in place.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-11-30 12:40     ` Roger Pau Monne
@ 2016-11-30 14:09       ` Jan Beulich
  2016-11-30 14:23         ` Roger Pau Monne
  0 siblings, 1 reply; 89+ messages in thread
From: Jan Beulich @ 2016-11-30 14:09 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 13:40, <roger.pau@citrix.com> wrote:
> On Mon, Nov 14, 2016 at 09:15:37AM -0700, Jan Beulich wrote:
>> >>> On 29.10.16 at 11:00, <roger.pau@citrix.com> wrote:
>> > Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
>> > p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
>> > that don't reside in RAM regions. This is needed because some memory maps
>> > don't properly account for all the memory used by ACPI, so it's common to
>> > find ACPI tables in holes.
>> 
>> I question whether this behavior should be enabled by default. Not
>> having seen the code yet I also wonder whether these regions
>> shouldn't simply be added to the guest's E820 as E820_ACPI, which
>> should then result in them getting mapped without further special
>> casing.
>> 
>> > +static int __init hvm_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
>> > +                                    uint32_t type)
>> 
>> I see s and e being uint64_t, but I don't see why type can't be plain
>> unsigned int.
> 
> Well, that's the type for "type" as defined in e820.h. I'm just using uint32_t 
> for consistency with that.

As said a number of times in various contexts: We should try to
get away from using fixed width types where we don't really need
them.

>> > +        {
>> > +            d->arch.e820[i].size += e - s;
>> > +            return 0;
>> > +        }
>> > +
>> > +        if ( rs >= e )
>> > +            break;
>> > +
>> > +        if ( re > s )
>> > +            return -ENOMEM;
>> 
>> I don't think ENOMEM is appropriate to signal an overlap. And don't
>> you need to reverse these last two if()s?
> 
> I've changed ENOMEM to EEXIST. Hm, I don't think so, if I reversed those we will 
> get error when trying to add a non-contiguous region to fill a hole between two 
> existing regions right?

Looks like I've managed to write something else than I meant. I was
really thinking of

        if ( re > s )
        {
            if ( rs >= e )
                break;
            return -ENOMEM;
        }

But then again I think with things being sorted it may not matter at all.

>> > +    ACPI_MEMCPY(intsrcovr, intr, sizeof(*intr));
>> 
>> Structure assignment (for type safety; also elsewhere)?
> 
> I wasn't sure what to do here, since there's a specific ACPI_MEMCPY function, 
> but I guess this is designed to be used by acpica code itself, and ACPI_MEMCPY 
> is just an OS-agnotic wrapper to memcpy.

Indeed.

>> > +    /* Setup the IO APIC entry. */
>> > +    if ( nr_ioapics > 1 )
>> > +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
>> > +               nr_ioapics);
>> 
>> I've said elsewhere already that I think we should provide 1 vIO-APIC
>> per physical one.
> 
> Agree, but the current vIO-APIC is not really up to it. I will work on getting 
> it to support multiple instances.

Until then this should obtain a grep-able "fixme" annotation.

>> > +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
>> > +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
>> > +    io_apic->header.length = sizeof(*io_apic);
>> > +    io_apic->id = 1;
>> > +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
>> > +
>> > +    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
>> > +    for ( i = 0; i < dom0_max_vcpus(); i++ )
>> > +    {
>> > +        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
>> > +        local_apic->header.length = sizeof(*local_apic);
>> > +        local_apic->processor_id = i;
>> > +        local_apic->id = i * 2;
>> > +        local_apic->lapic_flags = ACPI_MADT_ENABLED;
>> > +        local_apic++;
>> > +    }
>> 
>> What about x2apic? And for lapic, do you limit vCPU count anywhere?
> 
> Yes, there's no x2apic information, I'm currently looking at libacpi in tools, 
> and there doesn't seem to be any local x2apic structure there either. Am I 
> missing something?

I don't think you are.

> Regarding vCPU count, I will limit it to 128.

With it limited there'll be no strict need for x2apic structures. Still
we should get them added eventually.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-11-30 14:09       ` Jan Beulich
@ 2016-11-30 14:23         ` Roger Pau Monne
  2016-11-30 16:38           ` Jan Beulich
  0 siblings, 1 reply; 89+ messages in thread
From: Roger Pau Monne @ 2016-11-30 14:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Wed, Nov 30, 2016 at 07:09:47AM -0700, Jan Beulich wrote:
> >>> On 30.11.16 at 13:40, <roger.pau@citrix.com> wrote:
> > On Mon, Nov 14, 2016 at 09:15:37AM -0700, Jan Beulich wrote:
> >> >>> On 29.10.16 at 11:00, <roger.pau@citrix.com> wrote:
> >> > Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
> >> > p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
> >> > that don't reside in RAM regions. This is needed because some memory maps
> >> > don't properly account for all the memory used by ACPI, so it's common to
> >> > find ACPI tables in holes.
> >> 
> >> I question whether this behavior should be enabled by default. Not
> >> having seen the code yet I also wonder whether these regions
> >> shouldn't simply be added to the guest's E820 as E820_ACPI, which
> >> should then result in them getting mapped without further special
> >> casing.
> >> 
> >> > +static int __init hvm_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
> >> > +                                    uint32_t type)
> >> 
> >> I see s and e being uint64_t, but I don't see why type can't be plain
> >> unsigned int.
> > 
> > Well, that's the type for "type" as defined in e820.h. I'm just using uint32_t 
> > for consistency with that.
> 
> As said a number of times in various contexts: We should try to
> get away from using fixed width types where we don't really need
> them.

Done, I've changed it. Would you like me to also change the uint64_t's to 
paddr_t?

> >> > +        {
> >> > +            d->arch.e820[i].size += e - s;
> >> > +            return 0;
> >> > +        }
> >> > +
> >> > +        if ( rs >= e )
> >> > +            break;
> >> > +
> >> > +        if ( re > s )
> >> > +            return -ENOMEM;
> >> 
> >> I don't think ENOMEM is appropriate to signal an overlap. And don't
> >> you need to reverse these last two if()s?
> > 
> > I've changed ENOMEM to EEXIST. Hm, I don't think so, if I reversed those we will 
> > get error when trying to add a non-contiguous region to fill a hole between two 
> > existing regions right?
> 
> Looks like I've managed to write something else than I meant. I was
> really thinking of
> 
>         if ( re > s )
>         {
>             if ( rs >= e )
>                 break;
>             return -ENOMEM;
>         }
> 
> But then again I think with things being sorted it may not matter at all.

I slightly prefer the current one since it has less nested ifs, but if you have 
a strong preference for the later I don't really mind changing it.

> >> > +    if ( nr_ioapics > 1 )
> >> > +        printk("WARNING: found %d IO APICs, Dom0 will only have access to 1 emulated IO APIC\n",
> >> > +               nr_ioapics);
> >> 
> >> I've said elsewhere already that I think we should provide 1 vIO-APIC
> >> per physical one.
> > 
> > Agree, but the current vIO-APIC is not really up to it. I will work on getting 
> > it to support multiple instances.
> 
> Until then this should obtain a grep-able "fixme" annotation.

Oh, right (you said that several times, sorry).
 
Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

* Re: [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-11-30 14:23         ` Roger Pau Monne
@ 2016-11-30 16:38           ` Jan Beulich
  0 siblings, 0 replies; 89+ messages in thread
From: Jan Beulich @ 2016-11-30 16:38 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.11.16 at 15:23, <roger.pau@citrix.com> wrote:
> On Wed, Nov 30, 2016 at 07:09:47AM -0700, Jan Beulich wrote:
>> >>> On 30.11.16 at 13:40, <roger.pau@citrix.com> wrote:
>> > On Mon, Nov 14, 2016 at 09:15:37AM -0700, Jan Beulich wrote:
>> >> >>> On 29.10.16 at 11:00, <roger.pau@citrix.com> wrote:
>> >> > Also, regions marked as E820_ACPI or E820_NVS are identity mapped into Dom0
>> >> > p2m, plus any top-level ACPI tables that should be accessible to Dom0 and
>> >> > that don't reside in RAM regions. This is needed because some memory maps
>> >> > don't properly account for all the memory used by ACPI, so it's common to
>> >> > find ACPI tables in holes.
>> >> 
>> >> I question whether this behavior should be enabled by default. Not
>> >> having seen the code yet I also wonder whether these regions
>> >> shouldn't simply be added to the guest's E820 as E820_ACPI, which
>> >> should then result in them getting mapped without further special
>> >> casing.
>> >> 
>> >> > +static int __init hvm_add_mem_range(struct domain *d, uint64_t s, uint64_t e,
>> >> > +                                    uint32_t type)
>> >> 
>> >> I see s and e being uint64_t, but I don't see why type can't be plain
>> >> unsigned int.
>> > 
>> > Well, that's the type for "type" as defined in e820.h. I'm just using uint32_t 
>> > for consistency with that.
>> 
>> As said a number of times in various contexts: We should try to
>> get away from using fixed width types where we don't really need
>> them.
> 
> Done, I've changed it. Would you like me to also change the uint64_t's to 
> paddr_t?

To me paddr_t is not better or worse than uint64_t, perhaps with
the slight exception that in a (very old) non-PAE 32-bit build
paddr_t would have been actively wrong.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 89+ messages in thread

end of thread, other threads:[~2016-11-30 16:38 UTC | newest]

Thread overview: 89+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-29  8:59 [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Roger Pau Monne
2016-10-29  8:59 ` [PATCH v3.1 01/15] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
2016-10-31 16:32   ` Jan Beulich
2016-11-03 12:35     ` Roger Pau Monne
2016-11-03 12:52       ` Jan Beulich
2016-11-03 14:25         ` Konrad Rzeszutek Wilk
2016-11-03 15:05         ` Roger Pau Monne
2016-11-03 14:22       ` Konrad Rzeszutek Wilk
2016-11-03 15:01         ` Roger Pau Monne
2016-11-03 15:43         ` Roger Pau Monne
2016-10-29  8:59 ` [PATCH v3.1 02/15] xen/x86: fix return value of *_set_allocation functions Roger Pau Monne
2016-10-29 22:11   ` Tim Deegan
2016-10-29  8:59 ` [PATCH v3.1 03/15] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
2016-10-31 16:34   ` Jan Beulich
2016-11-01 10:45     ` Tim Deegan
2016-11-02 17:14       ` Roger Pau Monne
2016-11-03 10:20         ` Roger Pau Monne
2016-11-03 10:33           ` Tim Deegan
2016-11-03 11:31           ` Jan Beulich
2016-10-29  8:59 ` [PATCH v3.1 04/15] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
2016-10-31 16:37   ` Jan Beulich
2016-10-29  8:59 ` [PATCH v3.1 05/15] x86/paging: introduce paging_set_allocation Roger Pau Monne
2016-10-31 16:42   ` Jan Beulich
2016-11-01 10:29     ` Tim Deegan
2016-10-29  8:59 ` [PATCH v3.1 06/15] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
2016-10-31 16:44   ` Jan Beulich
2016-10-29  8:59 ` [PATCH v3.1 07/15] xen/x86: do the PCI scan unconditionally Roger Pau Monne
2016-10-31 16:47   ` Jan Beulich
2016-11-03 10:58     ` Roger Pau Monne
2016-11-03 11:35       ` Jan Beulich
2016-11-03 11:54         ` Boris Ostrovsky
2016-11-29 12:33           ` Roger Pau Monne
2016-11-29 12:47             ` Jan Beulich
2016-11-29 12:57               ` Roger Pau Monne
2016-11-30  5:53                 ` Tian, Kevin
2016-11-30  9:02                   ` Jan Beulich
2016-10-29  8:59 ` [PATCH v3.1 08/15] x86/vtd: fix mapping of RMRR regions Roger Pau Monne
2016-11-04  9:16   ` Jan Beulich
2016-11-04  9:45     ` Roger Pau Monne
2016-11-04 10:34       ` Jan Beulich
2016-11-04 12:25         ` Roger Pau Monne
2016-11-04 12:53           ` Jan Beulich
2016-11-04 13:03             ` Roger Pau Monne
2016-11-04 13:16               ` Jan Beulich
2016-11-04 15:33                 ` Roger Pau Monne
2016-11-04 16:13                   ` Jan Beulich
2016-11-04 16:19                     ` Roger Pau Monne
2016-11-04 17:08                       ` Jan Beulich
2016-11-04 17:25                         ` Roger Pau Monne
2016-11-07  8:36                           ` Jan Beulich
2016-10-29  8:59 ` [PATCH v3.1 09/15] xen/x86: allow the emulated APICs to be enabled for the hardware domain Roger Pau Monne
2016-11-04  9:19   ` Jan Beulich
2016-11-04  9:47     ` Roger Pau Monne
2016-11-04 10:21       ` Jan Beulich
2016-11-04 12:09         ` Roger Pau Monne
2016-11-04 12:50           ` Jan Beulich
2016-11-04 13:06             ` Roger Pau Monne
2016-10-29  8:59 ` [PATCH v3.1 10/15] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
2016-11-11 16:53   ` Jan Beulich
2016-11-16 18:02     ` Roger Pau Monne
2016-11-17 10:49       ` Jan Beulich
2016-11-28 17:49         ` Roger Pau Monne
2016-11-29  9:34           ` Jan Beulich
2016-10-29  8:59 ` [PATCH v3.1 11/15] xen/mm: introduce a function to map large chunks of MMIO Roger Pau Monne
2016-11-11 16:58   ` Jan Beulich
2016-11-29 12:41     ` Roger Pau Monne
2016-11-29 13:00       ` Jan Beulich
2016-11-29 15:32         ` Roger Pau Monne
2016-11-11 20:17   ` Konrad Rzeszutek Wilk
2016-10-29  8:59 ` [PATCH v3.1 12/15] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
2016-11-11 17:16   ` Jan Beulich
2016-11-28 11:26     ` Roger Pau Monne
2016-11-28 11:41       ` Jan Beulich
2016-11-28 13:30         ` Roger Pau Monne
2016-11-28 13:49           ` Jan Beulich
2016-11-28 16:02             ` Roger Pau Monne
2016-10-29  8:59 ` [PATCH v3.1 13/15] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
2016-11-11 20:30   ` Konrad Rzeszutek Wilk
2016-11-28 12:14     ` Roger Pau Monne
2016-10-29  9:00 ` [PATCH v3.1 14/15] xen/x86: hack to setup PVHv2 Dom0 CPUs Roger Pau Monne
2016-10-29  9:00 ` [PATCH v3.1 15/15] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
2016-11-14 16:15   ` Jan Beulich
2016-11-30 12:40     ` Roger Pau Monne
2016-11-30 14:09       ` Jan Beulich
2016-11-30 14:23         ` Roger Pau Monne
2016-11-30 16:38           ` Jan Beulich
2016-10-31 14:35 ` [PATCH v3.1 00/15] Initial PVHv2 Dom0 support Boris Ostrovsky
2016-10-31 14:43   ` Andrew Cooper
2016-10-31 16:35     ` Roger Pau Monne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.