All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/30] PVHv2 Dom0
@ 2016-09-27 15:56 Roger Pau Monne
  2016-09-27 15:56 ` [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder Roger Pau Monne
                   ` (30 more replies)
  0 siblings, 31 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:56 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky

Hello,

This is the first "complete" PVHv2 implementation in the sense that
it has feature parity with classic PVH Dom0. It is still very experimental,
but I've managed to boot it on all Intel boxes I've tried (OK, only 3 so far).
I've also tried on an AMD box, but sadly the ACPI tables there didn't contain
any RMRR regions and the IOMMU started spitting a bunch of page faults caused
by devices trying to access memory regions marked as reserved in the e820.

Here is the list of most relevant changes compared to classic PVH or PV:

 - An emulated local APIC (for each vCPU) and IO APIC is provided to Dom0.
 - The MADT has been replaced in order to reflect the real topology seen by
   the guest. This needs a little bit more of work so that the new MADT is
   not placed over the old one.
 - BARs of PCI devices are automatically mapped by Xen into Dom0 memory space.
   Currently there's no support for changing the position of the BARs, and Xen
   will just crash the domain if this is attempted.
 - Interrupts from physical devices are configured and routed using native
   mechanism, PIRQs are not available to this PVH Dom0 implementation at all.
   Xen will automatically detect the features of the physical devices available
   to Dom0, and will setup traps/handlers in order to intercept all relevant
   configuration accesses.
 - Access to PCIe regions is also trapped by Xen, so configuration can be done
   using the IO ports or the memory mapped configuration areas, as supported
   by each device.
 - Some ACPI tables are zapped (it's signature is inverted) to prevent Dom0
   from poking at them, those are: HPET, SLIT, SRAT, MPST and PMTT.

I know this series is quite big, but I think some of the initial changes can
go in as bugfixes, which would help me reduce the size.

Thanks, Roger.

Changes in this series:
 docs/misc/printk-formats.txt                |    5 +
 docs/misc/xen-command-line.markdown         |   15 +
 tools/firmware/hvmloader/hvmloader.c        |   17 -
 tools/libxc/include/xc_dom.h                |    2 +-
 tools/libxc/xc_dom_x86.c                    |   16 +
 xen/arch/arm/domain.c                       |    2 +-
 xen/arch/arm/domain_build.c                 |   16 +-
 xen/arch/arm/kernel.c                       |    4 +-
 xen/arch/arm/percpu.c                       |    3 +-
 xen/arch/x86/domain.c                       |   30 +-
 xen/arch/x86/domain_build.c                 | 1003 +++++++++++++++++++++++---
 xen/arch/x86/e820.c                         |    2 +-
 xen/arch/x86/hvm/hvm.c                      |   26 +-
 xen/arch/x86/hvm/io.c                       |  921 +++++++++++++++++++++++-
 xen/arch/x86/hvm/ioreq.c                    |   11 +
 xen/arch/x86/hvm/irq.c                      |    9 +
 xen/arch/x86/hvm/svm/nestedsvm.c            |    8 +-
 xen/arch/x86/hvm/svm/vmcb.c                 |    5 +-
 xen/arch/x86/hvm/vioapic.c                  |   28 +-
 xen/arch/x86/hvm/vmsi.c                     | 1036 +++++++++++++++++++++++++++
 xen/arch/x86/mm/hap/hap.c                   |   22 +-
 xen/arch/x86/mm/p2m.c                       |   88 +--
 xen/arch/x86/mm/paging.c                    |   16 +
 xen/arch/x86/mm/shadow/common.c             |   10 +-
 xen/arch/x86/percpu.c                       |    3 +-
 xen/arch/x86/physdev.c                      |    9 +-
 xen/arch/x86/setup.c                        |   11 +
 xen/arch/x86/smpboot.c                      |    4 +-
 xen/arch/x86/traps.c                        |   39 -
 xen/common/kernel.c                         |    3 +-
 xen/common/kexec.c                          |    2 +-
 xen/common/page_alloc.c                     |    2 +-
 xen/common/tmem_xen.c                       |    2 +-
 xen/common/vsprintf.c                       |   15 +
 xen/common/xmalloc_tlsf.c                   |    6 +-
 xen/drivers/char/console.c                  |    6 +-
 xen/drivers/char/serial.c                   |    2 +-
 xen/drivers/passthrough/amd/iommu_init.c    |   17 +-
 xen/drivers/passthrough/amd/pci_amd_iommu.c |    3 +-
 xen/drivers/passthrough/io.c                |  148 +++-
 xen/drivers/passthrough/pci.c               |  438 ++++++++++-
 xen/drivers/passthrough/vtd/iommu.c         |   23 +-
 xen/include/asm-x86/domain.h                |   12 +-
 xen/include/asm-x86/e820.h                  |    1 +
 xen/include/asm-x86/event.h                 |    3 +
 xen/include/asm-x86/flushtlb.h              |    2 +-
 xen/include/asm-x86/hap.h                   |    2 +-
 xen/include/asm-x86/hvm/domain.h            |    4 +
 xen/include/asm-x86/hvm/io.h                |  251 +++++++
 xen/include/asm-x86/hvm/ioreq.h             |    1 +
 xen/include/asm-x86/irq.h                   |    5 +
 xen/include/asm-x86/msi.h                   |   34 +
 xen/include/asm-x86/p2m.h                   |    5 -
 xen/include/asm-x86/paging.h                |    7 +
 xen/include/asm-x86/shadow.h                |    8 +
 xen/include/public/arch-x86/xen.h           |    4 +-
 xen/include/xen/hvm/irq.h                   |    4 +
 xen/include/xen/iommu.h                     |    1 +
 xen/include/xen/mm.h                        |   12 +-
 xen/include/xen/p2m-common.h                |   30 +
 xen/include/xen/pci.h                       |   10 +
 xen/include/xen/pci_regs.h                  |    4 +
 62 files changed, 4031 insertions(+), 397 deletions(-)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
@ 2016-09-27 15:56 ` Roger Pau Monne
  2016-09-28 15:35   ` Jan Beulich
  2016-09-27 15:56 ` [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
                   ` (29 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:56 UTC (permalink / raw)
  To: xen-devel
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

This is also required for PVHv2 guests if they want to use real-mode, and
hvmloader is not executed for those kind of guests.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 tools/firmware/hvmloader/hvmloader.c | 17 -----------------
 tools/libxc/include/xc_dom.h         |  2 +-
 tools/libxc/xc_dom_x86.c             | 16 ++++++++++++++++
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/tools/firmware/hvmloader/hvmloader.c b/tools/firmware/hvmloader/hvmloader.c
index bbd4e34..9eabbd8 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -176,21 +176,6 @@ static void cmos_write_memory_size(void)
     cmos_outb(0x35, (uint8_t)( alt_mem >> 8));
 }
 
-/*
- * Set up an empty TSS area for virtual 8086 mode to use. 
- * The only important thing is that it musn't have any bits set 
- * in the interrupt redirection bitmap, so all zeros will do.
- */
-static void init_vm86_tss(void)
-{
-    void *tss;
-
-    tss = mem_alloc(128, 128);
-    memset(tss, 0, 128);
-    hvm_param_set(HVM_PARAM_VM86_TSS, virt_to_phys(tss));
-    printf("vm86 TSS at %08lx\n", virt_to_phys(tss));
-}
-
 static void apic_setup(void)
 {
     /*
@@ -398,8 +383,6 @@ int main(void)
         hvm_param_set(HVM_PARAM_ACPI_IOPORTS_LOCATION, 1);
     }
 
-    init_vm86_tss();
-
     cmos_write_memory_size();
 
     printf("BIOS map:\n");
diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
index de7dca9..e1cfaad 100644
--- a/tools/libxc/include/xc_dom.h
+++ b/tools/libxc/include/xc_dom.h
@@ -20,7 +20,7 @@
 #include <xenguest.h>
 
 #define INVALID_PFN ((xen_pfn_t)-1)
-#define X86_HVM_NR_SPECIAL_PAGES    8
+#define X86_HVM_NR_SPECIAL_PAGES    9
 #define X86_HVM_END_SPECIAL_REGION  0xff000u
 
 /* --- typedefs and structs ---------------------------------------- */
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index 0eab8a7..1676a3c 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -59,6 +59,7 @@
 #define SPECIALPAGE_IOREQ    5
 #define SPECIALPAGE_IDENT_PT 6
 #define SPECIALPAGE_CONSOLE  7
+#define SPECIALPAGE_VM86TSS  8
 #define special_pfn(x) \
     (X86_HVM_END_SPECIAL_REGION - X86_HVM_NR_SPECIAL_PAGES + (x))
 
@@ -590,6 +591,7 @@ static int alloc_magic_pages_hvm(struct xc_dom_image *dom)
 {
     unsigned long i;
     uint32_t *ident_pt, domid = dom->guest_domid;
+    void *tss;
     int rc;
     xen_pfn_t special_array[X86_HVM_NR_SPECIAL_PAGES];
     xen_pfn_t ioreq_server_array[NR_IOREQ_SERVER_PAGES];
@@ -699,6 +701,20 @@ static int alloc_magic_pages_hvm(struct xc_dom_image *dom)
     xc_hvm_param_set(xch, domid, HVM_PARAM_IDENT_PT,
                      special_pfn(SPECIALPAGE_IDENT_PT) << PAGE_SHIFT);
 
+    /*
+     * Set up an empty TSS area for virtual 8086 mode to use.
+     * The only important thing is that it musn't have any bits set
+     * in the interrupt redirection bitmap, so all zeros will do.
+     */
+    if ( (tss = xc_map_foreign_range(
+              xch, domid, PAGE_SIZE, PROT_READ | PROT_WRITE,
+              special_pfn(SPECIALPAGE_VM86TSS))) == NULL )
+        goto error_out;
+    memset(tss, 0, 128);
+    munmap(tss, PAGE_SIZE);
+    xc_hvm_param_set(xch, domid, HVM_PARAM_VM86_TSS,
+                     special_pfn(SPECIALPAGE_VM86TSS) << PAGE_SHIFT);
+
     dom->console_pfn = special_pfn(SPECIALPAGE_CONSOLE);
     dom->xenstore_pfn = special_pfn(SPECIALPAGE_XENSTORE);
     dom->parms.virt_hypercall = -1;
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
  2016-09-27 15:56 ` [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder Roger Pau Monne
@ 2016-09-27 15:56 ` Roger Pau Monne
  2016-09-28 16:03   ` Jan Beulich
  2016-09-27 15:56 ` [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions Roger Pau Monne
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:56 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

On PVHv2 guests we explicitly want to use native methods for routing
interrupts.

Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
route physical interrupts (even from emulated devices) over event channels.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/hvm.c            | 23 +++++++++++++++--------
 xen/arch/x86/physdev.c            |  5 +++--
 xen/common/kernel.c               |  3 ++-
 xen/include/public/arch-x86/xen.h |  4 +++-
 4 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 7bad845..a291f82 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4117,6 +4117,8 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
 static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *d = current->domain;
+
     switch ( cmd )
     {
     default:
@@ -4128,7 +4130,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
     case PHYSDEVOP_eoi:
     case PHYSDEVOP_irq_status_query:
     case PHYSDEVOP_get_free_pirq:
-        return do_physdev_op(cmd, arg);
+        return ((d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||
+               is_pvh_vcpu(current)) ?
+                    do_physdev_op(cmd, arg) : -ENOSYS;
     }
 }
 
@@ -4161,17 +4165,20 @@ static long hvm_memory_op_compat32(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 static long hvm_physdev_op_compat32(
     int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
+    struct domain *d = current->domain;
+
     switch ( cmd )
     {
-        case PHYSDEVOP_map_pirq:
-        case PHYSDEVOP_unmap_pirq:
-        case PHYSDEVOP_eoi:
-        case PHYSDEVOP_irq_status_query:
-        case PHYSDEVOP_get_free_pirq:
-            return compat_physdev_op(cmd, arg);
+    case PHYSDEVOP_map_pirq:
+    case PHYSDEVOP_unmap_pirq:
+    case PHYSDEVOP_eoi:
+    case PHYSDEVOP_irq_status_query:
+    case PHYSDEVOP_get_free_pirq:
+        return (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                    compat_physdev_op(cmd, arg) : -ENOSYS;
         break;
     default:
-            return -ENOSYS;
+        return -ENOSYS;
         break;
     }
 }
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 5a49796..0bea6e1 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -94,7 +94,8 @@ int physdev_map_pirq(domid_t domid, int type, int *index, int *pirq_p,
     int pirq, irq, ret = 0;
     void *map_data = NULL;
 
-    if ( domid == DOMID_SELF && is_hvm_domain(d) )
+    if ( domid == DOMID_SELF && is_hvm_domain(d) &&
+         (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         /*
          * Only makes sense for vector-based callback, else HVM-IRQ logic
@@ -265,7 +266,7 @@ int physdev_unmap_pirq(domid_t domid, int pirq)
     if ( ret )
         goto free_domain;
 
-    if ( is_hvm_domain(d) )
+    if ( is_hvm_domain(d) && (d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) )
     {
         spin_lock(&d->event_lock);
         if ( domain_pirq_to_emuirq(d, pirq) != IRQ_UNBOUND )
diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index d0edb13..a82f55f 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -332,7 +332,8 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             case guest_type_hvm:
                 fi.submap |= (1U << XENFEAT_hvm_safe_pvclock) |
                              (1U << XENFEAT_hvm_callback_vector) |
-                             (1U << XENFEAT_hvm_pirqs);
+                             ((d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ?
+                                 (1U << XENFEAT_hvm_pirqs) : 0);
                 break;
             }
 #endif
diff --git a/xen/include/public/arch-x86/xen.h b/xen/include/public/arch-x86/xen.h
index cdd93c1..da6f4f2 100644
--- a/xen/include/public/arch-x86/xen.h
+++ b/xen/include/public/arch-x86/xen.h
@@ -283,12 +283,14 @@ struct xen_arch_domainconfig {
 #define XEN_X86_EMU_IOMMU           (1U<<_XEN_X86_EMU_IOMMU)
 #define _XEN_X86_EMU_PIT            8
 #define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
+#define _XEN_X86_EMU_USE_PIRQ       9
+#define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
 
 #define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
                                      XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
                                      XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
                                      XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
-                                     XEN_X86_EMU_PIT)
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ)
     uint32_t emulation_flags;
 };
 #endif
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
  2016-09-27 15:56 ` [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder Roger Pau Monne
  2016-09-27 15:56 ` [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2016-09-27 15:56 ` Roger Pau Monne
  2016-09-28  9:34   ` Tim Deegan
                     ` (3 more replies)
  2016-09-27 15:56 ` [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
                   ` (27 subsequent siblings)
  30 siblings, 4 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:56 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

Return should be an int, and the number of pages should be an unsigned long.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Tim Deegan <tim@xen.org>
---
 xen/arch/x86/mm/hap/hap.c       |  6 +++---
 xen/arch/x86/mm/shadow/common.c |  7 +++----
 xen/include/asm-x86/domain.h    | 12 ++++++------
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 3218fa2..b6d2c61 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -325,7 +325,7 @@ static void hap_free_p2m_page(struct domain *d, struct page_info *pg)
 static unsigned int
 hap_get_allocation(struct domain *d)
 {
-    unsigned int pg = d->arch.paging.hap.total_pages
+    unsigned long pg = d->arch.paging.hap.total_pages
         + d->arch.paging.hap.p2m_pages;
 
     return ((pg >> (20 - PAGE_SHIFT))
@@ -334,8 +334,8 @@ hap_get_allocation(struct domain *d)
 
 /* Set the pool of pages to the required number of pages.
  * Returns 0 for success, non-zero for failure. */
-static unsigned int
-hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
+static int
+hap_set_allocation(struct domain *d, unsigned long pages, int *preempted)
 {
     struct page_info *pg;
 
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index 21607bf..d3cc2cc 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1613,9 +1613,8 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
  * Input will be rounded up to at least shadow_min_acceptable_pages(),
  * plus space for the p2m table.
  * Returns 0 for success, non-zero for failure. */
-static unsigned int sh_set_allocation(struct domain *d,
-                                      unsigned int pages,
-                                      int *preempted)
+static int sh_set_allocation(struct domain *d, unsigned long pages,
+                             int *preempted)
 {
     struct page_info *sp;
     unsigned int lower_bound;
@@ -1692,7 +1691,7 @@ static unsigned int sh_set_allocation(struct domain *d,
 /* Return the size of the shadow pool, rounded up to the nearest MB */
 static unsigned int shadow_get_allocation(struct domain *d)
 {
-    unsigned int pg = d->arch.paging.shadow.total_pages
+    unsigned long pg = d->arch.paging.shadow.total_pages
         + d->arch.paging.shadow.p2m_pages;
     return ((pg >> (20 - PAGE_SHIFT))
             + ((pg & ((1 << (20 - PAGE_SHIFT)) - 1)) ? 1 : 0));
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 5807a1f..11ac2a5 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -93,9 +93,9 @@ struct shadow_domain {
 
     /* Memory allocation */
     struct page_list_head freelist;
-    unsigned int      total_pages;  /* number of pages allocated */
-    unsigned int      free_pages;   /* number of pages on freelists */
-    unsigned int      p2m_pages;    /* number of pages allocates to p2m */
+    unsigned long      total_pages;  /* number of pages allocated */
+    unsigned long      free_pages;   /* number of pages on freelists */
+    unsigned long      p2m_pages;    /* number of pages allocates to p2m */
 
     /* 1-to-1 map for use when HVM vcpus have paging disabled */
     pagetable_t unpaged_pagetable;
@@ -155,9 +155,9 @@ struct shadow_vcpu {
 /************************************************/
 struct hap_domain {
     struct page_list_head freelist;
-    unsigned int      total_pages;  /* number of pages allocated */
-    unsigned int      free_pages;   /* number of pages on freelists */
-    unsigned int      p2m_pages;    /* number of pages allocates to p2m */
+    unsigned long      total_pages;  /* number of pages allocated */
+    unsigned long      free_pages;   /* number of pages on freelists */
+    unsigned long      p2m_pages;    /* number of pages allocates to p2m */
 };
 
 /************************************************/
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (2 preceding siblings ...)
  2016-09-27 15:56 ` [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions Roger Pau Monne
@ 2016-09-27 15:56 ` Roger Pau Monne
  2016-09-29 10:43   ` Jan Beulich
  2016-09-30 16:56   ` George Dunlap
  2016-09-27 15:57 ` [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
                   ` (26 subsequent siblings)
  30 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:56 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

... and using the "preempted" parameter. The solution relies on just calling
softirq_pending if the current domain is the idle domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/mm/hap/hap.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index b6d2c61..2dc82f5 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned long pages, int *preempted)
             break;
 
         /* Check to see if we need to yield and try again */
-        if ( preempted && hypercall_preempt_check() )
+        if ( preempted &&
+             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
+                                      hypercall_preempt_check()) )
         {
             *preempted = 1;
             return 0;
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by the idle domain
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (3 preceding siblings ...)
  2016-09-27 15:56 ` [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-29 10:45   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 06/30] x86/paging: introduce paging_set_allocation Roger Pau Monne
                   ` (25 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

It doesn't make sense since the idle domain doesn't receive any events.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/include/asm-x86/event.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/include/asm-x86/event.h b/xen/include/asm-x86/event.h
index a82062e..d589d6f 100644
--- a/xen/include/asm-x86/event.h
+++ b/xen/include/asm-x86/event.h
@@ -23,6 +23,9 @@ int hvm_local_events_need_delivery(struct vcpu *v);
 static inline int local_events_need_delivery(void)
 {
     struct vcpu *v = current;
+
+    ASSERT(!is_idle_vcpu(v));
+
     return (has_hvm_container_vcpu(v) ? hvm_local_events_need_delivery(v) :
             (vcpu_info(v, evtchn_upcall_pending) &&
              !vcpu_info(v, evtchn_upcall_mask)));
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 06/30] x86/paging: introduce paging_set_allocation
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (4 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-29 10:51   ` Jan Beulich
  2016-09-30 17:00   ` George Dunlap
  2016-09-27 15:57 ` [PATCH v2 07/30] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
                   ` (24 subsequent siblings)
  30 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

... and remove hap_set_alloc_for_pvh_dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Tim Deegan <tim@xen.org>
---
Changes since RFC:
 - Make paging_set_allocation preemtable.
 - Move comments.
---
 xen/arch/x86/domain_build.c     | 17 +++++++++++++----
 xen/arch/x86/mm/hap/hap.c       | 14 +-------------
 xen/arch/x86/mm/paging.c        | 16 ++++++++++++++++
 xen/arch/x86/mm/shadow/common.c |  7 +------
 xen/include/asm-x86/hap.h       |  2 +-
 xen/include/asm-x86/paging.h    |  7 +++++++
 xen/include/asm-x86/shadow.h    |  8 ++++++++
 7 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 0a02d65..04d6cb0 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -35,7 +35,6 @@
 #include <asm/setup.h>
 #include <asm/bzimage.h> /* for bzimage_parse */
 #include <asm/io_apic.h>
-#include <asm/hap.h>
 #include <asm/hpet.h>
 
 #include <public/version.h>
@@ -1383,15 +1382,25 @@ int __init construct_dom0(
                          nr_pages);
     }
 
-    if ( is_pvh_domain(d) )
-        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
-
     /*
      * We enable paging mode again so guest_physmap_add_page will do the
      * right thing for us.
      */
     d->arch.paging.mode = save_pvh_pg_mode;
 
+    if ( is_pvh_domain(d) )
+    {
+        int preempted;
+
+        do {
+            preempted = 0;
+            paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                                  &preempted);
+            process_pending_softirqs();
+        } while ( preempted );
+    }
+
+
     /* Write the phys->machine and machine->phys table entries. */
     for ( pfn = 0; pfn < count; pfn++ )
     {
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index 2dc82f5..4420e4e 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -334,7 +334,7 @@ hap_get_allocation(struct domain *d)
 
 /* Set the pool of pages to the required number of pages.
  * Returns 0 for success, non-zero for failure. */
-static int
+int
 hap_set_allocation(struct domain *d, unsigned long pages, int *preempted)
 {
     struct page_info *pg;
@@ -640,18 +640,6 @@ int hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
     }
 }
 
-void __init hap_set_alloc_for_pvh_dom0(struct domain *d,
-                                       unsigned long hap_pages)
-{
-    int rc;
-
-    paging_lock(d);
-    rc = hap_set_allocation(d, hap_pages, NULL);
-    paging_unlock(d);
-
-    BUG_ON(rc);
-}
-
 static const struct paging_mode hap_paging_real_mode;
 static const struct paging_mode hap_paging_protected_mode;
 static const struct paging_mode hap_paging_pae_mode;
diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
index cc44682..2717bd3 100644
--- a/xen/arch/x86/mm/paging.c
+++ b/xen/arch/x86/mm/paging.c
@@ -954,6 +954,22 @@ void paging_write_p2m_entry(struct p2m_domain *p2m, unsigned long gfn,
         safe_write_pte(p, new);
 }
 
+int paging_set_allocation(struct domain *d, unsigned long pages, int *preempted)
+{
+    int rc;
+
+    ASSERT(paging_mode_enabled(d));
+
+    paging_lock(d);
+    if ( hap_enabled(d) )
+        rc = hap_set_allocation(d, pages, preempted);
+    else
+        rc = sh_set_allocation(d, pages, preempted);
+    paging_unlock(d);
+
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
index d3cc2cc..53ffe1a 100644
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1609,12 +1609,7 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
     paging_unlock(d);
 }
 
-/* Set the pool of shadow pages to the required number of pages.
- * Input will be rounded up to at least shadow_min_acceptable_pages(),
- * plus space for the p2m table.
- * Returns 0 for success, non-zero for failure. */
-static int sh_set_allocation(struct domain *d, unsigned long pages,
-                             int *preempted)
+int sh_set_allocation(struct domain *d, unsigned long pages, int *preempted)
 {
     struct page_info *sp;
     unsigned int lower_bound;
diff --git a/xen/include/asm-x86/hap.h b/xen/include/asm-x86/hap.h
index c613836..9d59430 100644
--- a/xen/include/asm-x86/hap.h
+++ b/xen/include/asm-x86/hap.h
@@ -46,7 +46,7 @@ int   hap_track_dirty_vram(struct domain *d,
                            XEN_GUEST_HANDLE_64(uint8) dirty_bitmap);
 
 extern const struct paging_mode *hap_paging_get_mode(struct vcpu *);
-void hap_set_alloc_for_pvh_dom0(struct domain *d, unsigned long num_pages);
+int hap_set_allocation(struct domain *d, unsigned long pages, int *preempted);
 
 #endif /* XEN_HAP_H */
 
diff --git a/xen/include/asm-x86/paging.h b/xen/include/asm-x86/paging.h
index 56eef6b..c2d60d3 100644
--- a/xen/include/asm-x86/paging.h
+++ b/xen/include/asm-x86/paging.h
@@ -347,6 +347,13 @@ void pagetable_dying(struct domain *d, paddr_t gpa);
 void paging_dump_domain_info(struct domain *d);
 void paging_dump_vcpu_info(struct vcpu *v);
 
+/* Set the pool of shadow pages to the required number of pages.
+ * Input might be rounded up to at minimum amount of pages, plus
+ * space for the p2m table.
+ * Returns 0 for success, non-zero for failure. */
+int paging_set_allocation(struct domain *d, unsigned long pages,
+                          int *preempted);
+
 #endif /* XEN_PAGING_H */
 
 /*
diff --git a/xen/include/asm-x86/shadow.h b/xen/include/asm-x86/shadow.h
index 6d0aefb..f0e2227 100644
--- a/xen/include/asm-x86/shadow.h
+++ b/xen/include/asm-x86/shadow.h
@@ -83,6 +83,12 @@ void sh_remove_shadows(struct domain *d, mfn_t gmfn, int fast, int all);
 /* Discard _all_ mappings from the domain's shadows. */
 void shadow_blow_tables_per_domain(struct domain *d);
 
+/* Set the pool of shadow pages to the required number of pages.
+ * Input will be rounded up to at least shadow_min_acceptable_pages(),
+ * plus space for the p2m table.
+ * Returns 0 for success, non-zero for failure. */
+int sh_set_allocation(struct domain *d, unsigned long pages, int *preempted);
+
 #else /* !CONFIG_SHADOW_PAGING */
 
 #define shadow_teardown(d, p) ASSERT(is_pv_domain(d))
@@ -91,6 +97,8 @@ void shadow_blow_tables_per_domain(struct domain *d);
     ({ ASSERT(is_pv_domain(d)); -EOPNOTSUPP; })
 #define shadow_track_dirty_vram(d, begin_pfn, nr, bitmap) \
     ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
+#define sh_set_allocation(d, pages, preempted) \
+    ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
 
 static inline void sh_remove_shadows(struct domain *d, mfn_t gmfn,
                                      bool_t fast, bool_t all) {}
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 07/30] xen/x86: split the setup of Dom0 permissions to a function
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (5 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 06/30] x86/paging: introduce paging_set_allocation Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-29 13:47   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally Roger Pau Monne
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

So that it can also be used by the PVH-specific domain builder. This is just
code motion, it should not introduce any functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/domain_build.c | 164 +++++++++++++++++++++++---------------------
 1 file changed, 86 insertions(+), 78 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 04d6cb0..ffd0521 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -869,6 +869,89 @@ static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
     unmap_domain_page(l4start);
 }
 
+static int __init setup_permissions(struct domain *d)
+{
+    unsigned long mfn;
+    int i, rc = 0;
+
+    /* The hardware domain is initially permitted full I/O capabilities. */
+    rc |= ioports_permit_access(d, 0, 0xFFFF);
+    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
+    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
+
+    /*
+     * Modify I/O port access permissions.
+     */
+    /* Master Interrupt Controller (PIC). */
+    rc |= ioports_deny_access(d, 0x20, 0x21);
+    /* Slave Interrupt Controller (PIC). */
+    rc |= ioports_deny_access(d, 0xA0, 0xA1);
+    /* Interval Timer (PIT). */
+    rc |= ioports_deny_access(d, 0x40, 0x43);
+    /* PIT Channel 2 / PC Speaker Control. */
+    rc |= ioports_deny_access(d, 0x61, 0x61);
+    /* ACPI PM Timer. */
+    if ( pmtmr_ioport )
+        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
+    /* PCI configuration space (NB. 0xcf8 has special treatment). */
+    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
+    /* Command-line I/O ranges. */
+    process_dom0_ioports_disable(d);
+
+    /*
+     * Modify I/O memory access permissions.
+     */
+    /* Local APIC. */
+    if ( mp_lapic_addr != 0 )
+    {
+        mfn = paddr_to_pfn(mp_lapic_addr);
+        rc |= iomem_deny_access(d, mfn, mfn);
+    }
+    /* I/O APICs. */
+    for ( i = 0; i < nr_ioapics; i++ )
+    {
+        mfn = paddr_to_pfn(mp_ioapics[i].mpc_apicaddr);
+        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
+            rc |= iomem_deny_access(d, mfn, mfn);
+    }
+    /* MSI range. */
+    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
+                            paddr_to_pfn(MSI_ADDR_BASE_LO +
+                                         MSI_ADDR_DEST_ID_MASK));
+    /* HyperTransport range. */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
+                                paddr_to_pfn((1ULL << 40) - 1));
+
+    /* Remove access to E820_UNUSABLE I/O regions above 1MB. */
+    for ( i = 0; i < e820.nr_map; i++ )
+    {
+        unsigned long sfn, efn;
+        sfn = max_t(unsigned long, paddr_to_pfn(e820.map[i].addr), 0x100ul);
+        efn = paddr_to_pfn(e820.map[i].addr + e820.map[i].size - 1);
+        if ( (e820.map[i].type == E820_UNUSABLE) &&
+             (e820.map[i].size != 0) &&
+             (sfn <= efn) )
+            rc |= iomem_deny_access(d, sfn, efn);
+    }
+
+    /* Prevent access to HPET */
+    if ( hpet_address )
+    {
+        u8 prot_flags = hpet_flags & ACPI_HPET_PAGE_PROTECT_MASK;
+
+        mfn = paddr_to_pfn(hpet_address);
+        if ( prot_flags == ACPI_HPET_PAGE_PROTECT4 )
+            rc |= iomem_deny_access(d, mfn, mfn);
+        else if ( prot_flags == ACPI_HPET_PAGE_PROTECT64 )
+            rc |= iomem_deny_access(d, mfn, mfn + 15);
+        else if ( ro_hpet )
+            rc |= rangeset_add_singleton(mmio_ro_ranges, mfn);
+    }
+
+    return rc;
+}
+
 int __init construct_dom0(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
@@ -1539,84 +1622,9 @@ int __init construct_dom0(
     if ( test_bit(XENFEAT_supervisor_mode_kernel, parms.f_required) )
         panic("Dom0 requires supervisor-mode execution");
 
-    rc = 0;
-
-    /* The hardware domain is initially permitted full I/O capabilities. */
-    rc |= ioports_permit_access(d, 0, 0xFFFF);
-    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
-    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
-
-    /*
-     * Modify I/O port access permissions.
-     */
-    /* Master Interrupt Controller (PIC). */
-    rc |= ioports_deny_access(d, 0x20, 0x21);
-    /* Slave Interrupt Controller (PIC). */
-    rc |= ioports_deny_access(d, 0xA0, 0xA1);
-    /* Interval Timer (PIT). */
-    rc |= ioports_deny_access(d, 0x40, 0x43);
-    /* PIT Channel 2 / PC Speaker Control. */
-    rc |= ioports_deny_access(d, 0x61, 0x61);
-    /* ACPI PM Timer. */
-    if ( pmtmr_ioport )
-        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
-    /* PCI configuration space (NB. 0xcf8 has special treatment). */
-    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
-    /* Command-line I/O ranges. */
-    process_dom0_ioports_disable(d);
-
-    /*
-     * Modify I/O memory access permissions.
-     */
-    /* Local APIC. */
-    if ( mp_lapic_addr != 0 )
-    {
-        mfn = paddr_to_pfn(mp_lapic_addr);
-        rc |= iomem_deny_access(d, mfn, mfn);
-    }
-    /* I/O APICs. */
-    for ( i = 0; i < nr_ioapics; i++ )
-    {
-        mfn = paddr_to_pfn(mp_ioapics[i].mpc_apicaddr);
-        if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn) )
-            rc |= iomem_deny_access(d, mfn, mfn);
-    }
-    /* MSI range. */
-    rc |= iomem_deny_access(d, paddr_to_pfn(MSI_ADDR_BASE_LO),
-                            paddr_to_pfn(MSI_ADDR_BASE_LO +
-                                         MSI_ADDR_DEST_ID_MASK));
-    /* HyperTransport range. */
-    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
-        rc |= iomem_deny_access(d, paddr_to_pfn(0xfdULL << 32),
-                                paddr_to_pfn((1ULL << 40) - 1));
-
-    /* Remove access to E820_UNUSABLE I/O regions above 1MB. */
-    for ( i = 0; i < e820.nr_map; i++ )
-    {
-        unsigned long sfn, efn;
-        sfn = max_t(unsigned long, paddr_to_pfn(e820.map[i].addr), 0x100ul);
-        efn = paddr_to_pfn(e820.map[i].addr + e820.map[i].size - 1);
-        if ( (e820.map[i].type == E820_UNUSABLE) &&
-             (e820.map[i].size != 0) &&
-             (sfn <= efn) )
-            rc |= iomem_deny_access(d, sfn, efn);
-    }
-
-    /* Prevent access to HPET */
-    if ( hpet_address )
-    {
-        u8 prot_flags = hpet_flags & ACPI_HPET_PAGE_PROTECT_MASK;
-
-        mfn = paddr_to_pfn(hpet_address);
-        if ( prot_flags == ACPI_HPET_PAGE_PROTECT4 )
-            rc |= iomem_deny_access(d, mfn, mfn);
-        else if ( prot_flags == ACPI_HPET_PAGE_PROTECT64 )
-            rc |= iomem_deny_access(d, mfn, mfn + 15);
-        else if ( ro_hpet )
-            rc |= rangeset_add_singleton(mmio_ro_ranges, mfn);
-    }
-
-    BUG_ON(rc != 0);
+    rc = setup_permissions(d);
+    if ( rc != 0 )
+        panic("Failed to setup Dom0 permissions");
 
     if ( elf_check_broken(&elf) )
         printk(" Xen warning: dom0 kernel broken ELF: %s\n",
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (6 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 07/30] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-29 13:55   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions Roger Pau Monne
                   ` (22 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, Jan Beulich, Andrew Cooper,
	Suravee Suthikulpanit, boris.ostrovsky, Roger Pau Monne

Instead of being tied to the presence of an IOMMU

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Feng Wu <feng.wu@intel.com>
---
 xen/arch/x86/setup.c                        | 2 ++
 xen/drivers/passthrough/amd/pci_amd_iommu.c | 3 ++-
 xen/drivers/passthrough/vtd/iommu.c         | 2 --
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 8ae897a..1d27a6f 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1443,6 +1443,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     early_msi_init();
 
+    scan_pci_devices();
+
     iommu_setup();    /* setup iommu if available */
 
     smp_prepare_cpus(max_cpus);
diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c
index 94a25a4..d12575d 100644
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
 
     if ( !amd_iommu_perdev_intremap )
         printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
-    return scan_pci_devices();
+
+    return 0;
 }
 
 static int allocate_domain_resources(struct domain_iommu *hd)
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 48f120b..919993e 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2299,8 +2299,6 @@ int __init intel_vtd_setup(void)
     P(iommu_hap_pt_share, "Shared EPT tables");
 #undef P
 
-    scan_pci_devices();
-
     ret = init_vtd_hw();
     if ( ret )
         goto error;
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (7 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-29 14:18   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 10/30] xen/x86: allow the emulated APICs to be enbled for the hardware domain Roger Pau Monne
                   ` (21 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper, Jan Beulich,
	boris.ostrovsky, Roger Pau Monne

The current code used by Intel VTd will only map RMRR regions for the
hardware domain, but will fail to map RMRR regions for unprivileged domains
unless the page tables are shared between EPT and IOMMU. Fix this and
simplify the code, removing the {set/clear}_identity_p2m_entry helpers and
just using the normal MMIO mapping functions. Introduce a new MMIO
mapping/unmapping helper, that takes care of checking for pending IRQs if
the mapped region is big enough that it cannot be done in one shot.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Feng Wu <feng.wu@intel.com>
---
 xen/arch/x86/mm/p2m.c               | 86 -------------------------------------
 xen/drivers/passthrough/vtd/iommu.c | 21 +++++----
 xen/include/asm-x86/p2m.h           |  5 ---
 xen/include/xen/p2m-common.h        | 30 +++++++++++++
 4 files changed, 42 insertions(+), 100 deletions(-)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 9526fff..44492ae 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1029,56 +1029,6 @@ int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
     return set_typed_p2m_entry(d, gfn, mfn, order, p2m_mmio_direct, access);
 }
 
-int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
-                           p2m_access_t p2ma, unsigned int flag)
-{
-    p2m_type_t p2mt;
-    p2m_access_t a;
-    mfn_t mfn;
-    struct p2m_domain *p2m = p2m_get_hostp2m(d);
-    int ret;
-
-    if ( !paging_mode_translate(p2m->domain) )
-    {
-        if ( !need_iommu(d) )
-            return 0;
-        return iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
-    }
-
-    gfn_lock(p2m, gfn, 0);
-
-    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
-
-    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
-        ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
-                            p2m_mmio_direct, p2ma);
-    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
-    {
-        ret = 0;
-        /*
-         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
-         * but iomem regions are not mapped with IOMMU. This makes sure that
-         * RMRRs are correctly mapped with IOMMU.
-         */
-        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
-            ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
-    }
-    else
-    {
-        if ( flag & XEN_DOMCTL_DEV_RDM_RELAXED )
-            ret = 0;
-        else
-            ret = -EBUSY;
-        printk(XENLOG_G_WARNING
-               "Cannot setup identity map d%d:%lx,"
-               " gfn already mapped to %lx.\n",
-               d->domain_id, gfn, mfn_x(mfn));
-    }
-
-    gfn_unlock(p2m, gfn, 0);
-    return ret;
-}
-
 /*
  * Returns:
  *    0        for success
@@ -1127,42 +1077,6 @@ int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
     return rc;
 }
 
-int clear_identity_p2m_entry(struct domain *d, unsigned long gfn)
-{
-    p2m_type_t p2mt;
-    p2m_access_t a;
-    mfn_t mfn;
-    struct p2m_domain *p2m = p2m_get_hostp2m(d);
-    int ret;
-
-    if ( !paging_mode_translate(d) )
-    {
-        if ( !need_iommu(d) )
-            return 0;
-        return iommu_unmap_page(d, gfn);
-    }
-
-    gfn_lock(p2m, gfn, 0);
-
-    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
-    if ( p2mt == p2m_mmio_direct && mfn_x(mfn) == gfn )
-    {
-        ret = p2m_set_entry(p2m, gfn, INVALID_MFN, PAGE_ORDER_4K,
-                            p2m_invalid, p2m->default_access);
-        gfn_unlock(p2m, gfn, 0);
-    }
-    else
-    {
-        gfn_unlock(p2m, gfn, 0);
-        printk(XENLOG_G_WARNING
-               "non-identity map d%d:%lx not cleared (mapped to %lx)\n",
-               d->domain_id, gfn, mfn_x(mfn));
-        ret = 0;
-    }
-
-    return ret;
-}
-
 /* Returns: 0 for success, -errno for failure */
 int set_shared_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn)
 {
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 919993e..714a19e 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1896,6 +1896,7 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
     unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K;
     struct mapped_rmrr *mrmrr;
     struct domain_iommu *hd = dom_iommu(d);
+    int ret = 0;
 
     ASSERT(pcidevs_locked());
     ASSERT(rmrr->base_address < rmrr->end_address);
@@ -1909,8 +1910,6 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
         if ( mrmrr->base == rmrr->base_address &&
              mrmrr->end == rmrr->end_address )
         {
-            int ret = 0;
-
             if ( map )
             {
                 ++mrmrr->count;
@@ -1920,9 +1919,10 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
             if ( --mrmrr->count )
                 return 0;
 
-            while ( base_pfn < end_pfn )
+            ret = modify_mmio_11(d, base_pfn, end_pfn - base_pfn, false);
+            while ( !iommu_use_hap_pt(d) && base_pfn < end_pfn )
             {
-                if ( clear_identity_p2m_entry(d, base_pfn) )
+                if ( iommu_unmap_page(d, base_pfn) )
                     ret = -ENXIO;
                 base_pfn++;
             }
@@ -1936,12 +1936,15 @@ static int rmrr_identity_mapping(struct domain *d, bool_t map,
     if ( !map )
         return -ENOENT;
 
-    while ( base_pfn < end_pfn )
+    ret = modify_mmio_11(d, base_pfn, end_pfn - base_pfn, true);
+    if ( ret )
+        return ret;
+    while ( !iommu_use_hap_pt(d) && base_pfn < end_pfn )
     {
-        int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag);
-
-        if ( err )
-            return err;
+        ret = iommu_map_page(d, base_pfn, base_pfn,
+                             IOMMUF_readable|IOMMUF_writable);
+        if ( ret )
+            return ret;
         base_pfn++;
     }
 
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 7035860..ccf19e5 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -602,11 +602,6 @@ int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
 int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn,
                          unsigned int order);
 
-/* Set identity addresses in the p2m table (for pass-through) */
-int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
-                           p2m_access_t p2ma, unsigned int flag);
-int clear_identity_p2m_entry(struct domain *d, unsigned long gfn);
-
 /* Add foreign mapping to the guest's p2m table. */
 int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
                     unsigned long gpfn, domid_t foreign_domid);
diff --git a/xen/include/xen/p2m-common.h b/xen/include/xen/p2m-common.h
index 3be1e91..5f6b4ef 100644
--- a/xen/include/xen/p2m-common.h
+++ b/xen/include/xen/p2m-common.h
@@ -2,6 +2,7 @@
 #define _XEN_P2M_COMMON_H
 
 #include <public/vm_event.h>
+#include <xen/softirq.h>
 
 /*
  * Additional access types, which are used to further restrict
@@ -46,6 +47,35 @@ int unmap_mmio_regions(struct domain *d,
                        mfn_t mfn);
 
 /*
+ * Preemptive Helper for mapping/unmapping MMIO regions.
+ */
+static inline int modify_mmio_11(struct domain *d, unsigned long pfn,
+                                 unsigned long nr_pages, bool map)
+{
+    int rc;
+
+    while ( nr_pages > 0 )
+    {
+        rc = (map ? map_mmio_regions : unmap_mmio_regions)
+             (d, _gfn(pfn), nr_pages, _mfn(pfn));
+        if ( rc == 0 )
+            break;
+        if ( rc < 0 )
+        {
+            printk(XENLOG_ERR
+                "Failed to %smap %#lx - %#lx into domain %d memory map: %d\n",
+                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
+            return rc;
+        }
+        nr_pages -= rc;
+        pfn += rc;
+        process_pending_softirqs();
+    }
+
+    return rc;
+}
+
+/*
  * Set access type for a region of gfns.
  * If gfn == INVALID_GFN, sets the default access type.
  */
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 10/30] xen/x86: allow the emulated APICs to be enbled for the hardware domain
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (8 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-29 14:26   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
                   ` (20 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Allow the use of both the emulated local APIC and IO APIC for the hardware
domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since RFC:
 - Move the emulation flags check to a separate helper.
---
 xen/arch/x86/domain.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 3c4b094..332e7f0 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -509,6 +509,29 @@ void vcpu_destroy(struct vcpu *v)
         xfree(v->arch.pv_vcpu.trap_ctxt);
 }
 
+static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
+{
+
+    if ( is_hvm_domain(d) )
+    {
+        if ( is_hardware_domain(d) &&
+             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC))
+            return false;
+        if ( !is_hardware_domain(d) &&
+             emflags != XEN_X86_EMU_ALL && emflags != 0 )
+            return false;
+    }
+    else
+    {
+        /* PV or classic PVH. */
+        if ( is_hardware_domain(d) && emflags != XEN_X86_EMU_PIT )
+            return false;
+        if ( !is_hardware_domain(d) && emflags != 0 )
+            return false;
+    }
+    return true;
+}
+
 int arch_domain_create(struct domain *d, unsigned int domcr_flags,
                        struct xen_arch_domainconfig *config)
 {
@@ -553,9 +576,8 @@ int arch_domain_create(struct domain *d, unsigned int domcr_flags,
         }
         if ( is_hardware_domain(d) )
             config->emulation_flags |= XEN_X86_EMU_PIT;
-        if ( config->emulation_flags != 0 &&
-             (config->emulation_flags !=
-              (is_hvm_domain(d) ? XEN_X86_EMU_ALL : XEN_X86_EMU_PIT)) )
+
+        if ( !emulation_flags_ok(d, config->emulation_flags) )
         {
             printk(XENLOG_G_ERR "d%d: Xen does not allow %s domain creation "
                    "with the current selection of emulators: %#x\n",
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (9 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 10/30] xen/x86: allow the emulated APICs to be enbled for the hardware domain Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-30 15:03   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 12/30] xen/x86: make print_e820_memory_map global Roger Pau Monne
                   ` (19 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Split the Dom0 builder into two different functions, one for PV (and classic
PVH), and another one for PVHv2. Introduce a new command line parameter,
dom0hvm in order to request the creation of a PVHv2 Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since RFC:
 - Add documentation for the new command line option.
 - Simplify the logic in construct_dom0.
---
 docs/misc/xen-command-line.markdown |  7 +++++++
 xen/arch/x86/domain_build.c         | 24 +++++++++++++++++++++++-
 xen/arch/x86/setup.c                |  9 +++++++++
 3 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 8ff57fa..59d7210 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -663,6 +663,13 @@ Pin dom0 vcpus to their respective pcpus
 
 Flag that makes a 64bit dom0 boot in PVH mode. No 32bit support at present.
 
+### dom0hvm
+> `= <boolean>`
+
+> Default: `false`
+
+Flag that makes a dom0 boot in PVHv2 mode.
+
 ### dtuart (ARM)
 > `= path [:options]`
 
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index ffd0521..78980ae 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -952,7 +952,7 @@ static int __init setup_permissions(struct domain *d)
     return rc;
 }
 
-int __init construct_dom0(
+static int __init construct_dom0_pv(
     struct domain *d,
     const module_t *image, unsigned long image_headroom,
     module_t *initrd,
@@ -1657,6 +1657,28 @@ out:
     return rc;
 }
 
+static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
+                                     unsigned long image_headroom,
+                                     module_t *initrd,
+                                     void *(*bootstrap_map)(const module_t *),
+                                     char *cmdline)
+{
+
+    printk("** Building a PVH Dom0 **\n");
+
+    return 0;
+}
+
+int __init construct_dom0(struct domain *d, const module_t *image,
+                          unsigned long image_headroom, module_t *initrd,
+                          void *(*bootstrap_map)(const module_t *),
+                          char *cmdline)
+{
+
+    return (is_hvm_domain(d) ? construct_dom0_hvm : construct_dom0_pv)
+           (d, image, image_headroom, initrd,bootstrap_map, cmdline);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 1d27a6f..9272318 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -75,6 +75,10 @@ unsigned long __read_mostly cr4_pv32_mask;
 static bool_t __initdata opt_dom0pvh;
 boolean_param("dom0pvh", opt_dom0pvh);
 
+/* Boot dom0 in HVM mode */
+static bool_t __initdata opt_dom0hvm;
+boolean_param("dom0hvm", opt_dom0hvm);
+
 /* **** Linux config option: propagated to domain0. */
 /* "acpi=off":    Sisables both ACPI table parsing and interpreter. */
 /* "acpi=force":  Override the disable blacklist.                   */
@@ -1495,6 +1499,11 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     if ( opt_dom0pvh )
         domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
 
+    if ( opt_dom0hvm ) {
+        domcr_flags |= DOMCRF_hvm | (hvm_funcs.hap_supported ? DOMCRF_hap : 0);
+        config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
+    }
+
     /*
      * Create initial domain 0.
      * x86 doesn't support arch-configuration. So it's fine to pass
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 12/30] xen/x86: make print_e820_memory_map global
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (10 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-30 15:04   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form Roger Pau Monne
                   ` (18 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

So that it can be called from the Dom0 builder.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/e820.c        | 2 +-
 xen/include/asm-x86/e820.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c
index ef077a5..48e35f9 100644
--- a/xen/arch/x86/e820.c
+++ b/xen/arch/x86/e820.c
@@ -87,7 +87,7 @@ static void __init add_memory_region(unsigned long long start,
     e820.nr_map++;
 }
 
-static void __init print_e820_memory_map(struct e820entry *map, unsigned int entries)
+void __init print_e820_memory_map(struct e820entry *map, unsigned int entries)
 {
     unsigned int i;
 
diff --git a/xen/include/asm-x86/e820.h b/xen/include/asm-x86/e820.h
index d9ff4eb..9dad76a 100644
--- a/xen/include/asm-x86/e820.h
+++ b/xen/include/asm-x86/e820.h
@@ -31,6 +31,7 @@ extern int e820_change_range_type(
 extern int e820_add_range(
     struct e820map *, uint64_t s, uint64_t e, uint32_t type);
 extern unsigned long init_e820(const char *, struct e820entry *, unsigned int *);
+extern void print_e820_memory_map(struct e820entry *map, unsigned int entries);
 extern struct e820map e820;
 
 /* These symbols live in the boot trampoline. */
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (11 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 12/30] xen/x86: make print_e820_memory_map global Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-28  8:24   ` Juergen Gross
                     ` (2 more replies)
  2016-09-27 15:57 ` [PATCH v2 14/30] xen/mm: add a ceil sufix to current page calculation routine Roger Pau Monne
                   ` (17 subsequent siblings)
  30 siblings, 3 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Introduce a new %pB format specifier to print sizes (in bytes) in a
human-readable form.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
---
 docs/misc/printk-formats.txt |  5 +++++
 xen/common/vsprintf.c        | 15 +++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/docs/misc/printk-formats.txt b/docs/misc/printk-formats.txt
index 525108f..0ee3504 100644
--- a/docs/misc/printk-formats.txt
+++ b/docs/misc/printk-formats.txt
@@ -30,3 +30,8 @@ Domain and vCPU information:
 
        %pv     Domain and vCPU ID from a 'struct vcpu *' (printed as
                "d<domid>v<vcpuid>")
+
+Size in human readable form:
+
+       %pZ     Size in human-readable form (input size must be in bytes).
+                 e.g.  24MB
diff --git a/xen/common/vsprintf.c b/xen/common/vsprintf.c
index f92fb67..2dadaec 100644
--- a/xen/common/vsprintf.c
+++ b/xen/common/vsprintf.c
@@ -386,6 +386,21 @@ static char *pointer(char *str, char *end, const char **fmt_ptr,
             *str = 'v';
         return number(str + 1, end, v->vcpu_id, 10, -1, -1, 0);
     }
+    case 'Z':
+    {
+        static const char units[][3] = {"B", "KB", "MB", "GB", "TB"};
+        size_t size = (size_t)arg;
+        int i = 0;
+
+        /* Advance parents fmt string, as we have consumed 'B' */
+        ++*fmt_ptr;
+
+        while ( ++i < sizeof(units) && size >= 1024 )
+            size >>= 10; /* size /= 1024 */
+
+        str = number(str, end, size, 10, -1, -1, 0);
+        return string(str, end, units[i-1], -1, -1, 0);
+    }
     }
 
     if ( field_width == -1 )
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 14/30] xen/mm: add a ceil sufix to current page calculation routine
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (12 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-30 15:20   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
                   ` (16 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Suravee Suthikulpanit, Andrew Cooper,
	Julien Grall, Jan Beulich, boris.ostrovsky, Roger Pau Monne

... and introduce a floor variant.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/arch/arm/domain.c                    |  2 +-
 xen/arch/arm/domain_build.c              | 16 +++++-----------
 xen/arch/arm/kernel.c                    |  4 ++--
 xen/arch/arm/percpu.c                    |  3 ++-
 xen/arch/x86/domain.c                    |  2 +-
 xen/arch/x86/domain_build.c              |  4 ++--
 xen/arch/x86/hvm/svm/nestedsvm.c         |  8 ++++----
 xen/arch/x86/hvm/svm/vmcb.c              |  5 +++--
 xen/arch/x86/percpu.c                    |  3 ++-
 xen/arch/x86/smpboot.c                   |  4 ++--
 xen/common/kexec.c                       |  2 +-
 xen/common/page_alloc.c                  |  2 +-
 xen/common/tmem_xen.c                    |  2 +-
 xen/common/xmalloc_tlsf.c                |  6 +++---
 xen/drivers/char/console.c               |  6 +++---
 xen/drivers/char/serial.c                |  2 +-
 xen/drivers/passthrough/amd/iommu_init.c | 17 +++++++++--------
 xen/drivers/passthrough/pci.c            |  2 +-
 xen/include/asm-x86/flushtlb.h           |  2 +-
 xen/include/xen/mm.h                     | 12 +++++++++++-
 20 files changed, 56 insertions(+), 48 deletions(-)

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index 20bb2ba..1f6b0a4 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -661,7 +661,7 @@ void arch_domain_destroy(struct domain *d)
     free_xenheap_page(d->shared_info);
 #ifdef CONFIG_ACPI
     free_xenheap_pages(d->arch.efi_acpi_table,
-                       get_order_from_bytes(d->arch.efi_acpi_len));
+                       get_order_from_bytes_ceil(d->arch.efi_acpi_len));
 #endif
     domain_io_free(d);
 }
diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c
index 35ab08d..cabe030 100644
--- a/xen/arch/arm/domain_build.c
+++ b/xen/arch/arm/domain_build.c
@@ -73,14 +73,8 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
 
 static unsigned int get_11_allocation_size(paddr_t size)
 {
-    /*
-     * get_order_from_bytes returns the order greater than or equal to
-     * the given size, but we need less than or equal. Adding one to
-     * the size pushes an evenly aligned size into the next order, so
-     * we can then unconditionally subtract 1 from the order which is
-     * returned.
-     */
-    return get_order_from_bytes(size + 1) - 1;
+
+    return get_order_from_bytes_floor(size);
 }
 
 /*
@@ -238,8 +232,8 @@ fail:
 static void allocate_memory(struct domain *d, struct kernel_info *kinfo)
 {
     const unsigned int min_low_order =
-        get_order_from_bytes(min_t(paddr_t, dom0_mem, MB(128)));
-    const unsigned int min_order = get_order_from_bytes(MB(4));
+        get_order_from_bytes_ceil(min_t(paddr_t, dom0_mem, MB(128)));
+    const unsigned int min_order = get_order_from_bytes_ceil(MB(4));
     struct page_info *pg;
     unsigned int order = get_11_allocation_size(kinfo->unassigned_mem);
     int i;
@@ -1828,7 +1822,7 @@ static int prepare_acpi(struct domain *d, struct kernel_info *kinfo)
     if ( rc != 0 )
         return rc;
 
-    order = get_order_from_bytes(d->arch.efi_acpi_len);
+    order = get_order_from_bytes_ceil(d->arch.efi_acpi_len);
     d->arch.efi_acpi_table = alloc_xenheap_pages(order, 0);
     if ( d->arch.efi_acpi_table == NULL )
     {
diff --git a/xen/arch/arm/kernel.c b/xen/arch/arm/kernel.c
index 3f6cce3..0d9986b 100644
--- a/xen/arch/arm/kernel.c
+++ b/xen/arch/arm/kernel.c
@@ -291,7 +291,7 @@ static __init int kernel_decompress(struct bootmodule *mod)
         return -EFAULT;
 
     output_size = output_length(input, size);
-    kernel_order_out = get_order_from_bytes(output_size);
+    kernel_order_out = get_order_from_bytes_ceil(output_size);
     pages = alloc_domheap_pages(NULL, kernel_order_out, 0);
     if ( pages == NULL )
     {
@@ -463,7 +463,7 @@ static int kernel_elf_probe(struct kernel_info *info,
 
     memset(&info->elf.elf, 0, sizeof(info->elf.elf));
 
-    info->elf.kernel_order = get_order_from_bytes(size);
+    info->elf.kernel_order = get_order_from_bytes_ceil(size);
     info->elf.kernel_img = alloc_xenheap_pages(info->elf.kernel_order, 0);
     if ( info->elf.kernel_img == NULL )
         panic("Cannot allocate temporary buffer for kernel");
diff --git a/xen/arch/arm/percpu.c b/xen/arch/arm/percpu.c
index e545024..954e92f 100644
--- a/xen/arch/arm/percpu.c
+++ b/xen/arch/arm/percpu.c
@@ -7,7 +7,8 @@
 
 unsigned long __per_cpu_offset[NR_CPUS];
 #define INVALID_PERCPU_AREA (-(long)__per_cpu_start)
-#define PERCPU_ORDER (get_order_from_bytes(__per_cpu_data_end-__per_cpu_start))
+#define PERCPU_ORDER \
+    (get_order_from_bytes_ceil(__per_cpu_data_end-__per_cpu_start))
 
 void __init percpu_init_areas(void)
 {
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 332e7f0..3d70720 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -236,7 +236,7 @@ static unsigned int __init noinline _domain_struct_bits(void)
 struct domain *alloc_domain_struct(void)
 {
     struct domain *d;
-    unsigned int order = get_order_from_bytes(sizeof(*d));
+    unsigned int order = get_order_from_bytes_ceil(sizeof(*d));
 #ifdef CONFIG_BIGMEM
     const unsigned int bits = 0;
 #else
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 78980ae..982bb5f 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -290,7 +290,7 @@ static unsigned long __init compute_dom0_nr_pages(
 
     /* Reserve memory for further dom0 vcpu-struct allocations... */
     avail -= (d->max_vcpus - 1UL)
-             << get_order_from_bytes(sizeof(struct vcpu));
+             << get_order_from_bytes_ceil(sizeof(struct vcpu));
     /* ...and compat_l4's, if needed. */
     if ( is_pv_32bit_domain(d) )
         avail -= d->max_vcpus - 1;
@@ -1172,7 +1172,7 @@ static int __init construct_dom0_pv(
     count = v_end - v_start;
     if ( vinitrd_start )
         count -= PAGE_ALIGN(initrd_len);
-    order = get_order_from_bytes(count);
+    order = get_order_from_bytes_ceil(count);
     if ( (1UL << order) + PFN_UP(initrd_len) > nr_pages )
         panic("Domain 0 allocation is too small for kernel image");
 
diff --git a/xen/arch/x86/hvm/svm/nestedsvm.c b/xen/arch/x86/hvm/svm/nestedsvm.c
index f9b38ab..7b3af39 100644
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -101,13 +101,13 @@ int nsvm_vcpu_initialise(struct vcpu *v)
     struct nestedvcpu *nv = &vcpu_nestedhvm(v);
     struct nestedsvm *svm = &vcpu_nestedsvm(v);
 
-    msrpm = alloc_xenheap_pages(get_order_from_bytes(MSRPM_SIZE), 0);
+    msrpm = alloc_xenheap_pages(get_order_from_bytes_ceil(MSRPM_SIZE), 0);
     svm->ns_cached_msrpm = msrpm;
     if (msrpm == NULL)
         goto err;
     memset(msrpm, 0x0, MSRPM_SIZE);
 
-    msrpm = alloc_xenheap_pages(get_order_from_bytes(MSRPM_SIZE), 0);
+    msrpm = alloc_xenheap_pages(get_order_from_bytes_ceil(MSRPM_SIZE), 0);
     svm->ns_merged_msrpm = msrpm;
     if (msrpm == NULL)
         goto err;
@@ -141,12 +141,12 @@ void nsvm_vcpu_destroy(struct vcpu *v)
 
     if (svm->ns_cached_msrpm) {
         free_xenheap_pages(svm->ns_cached_msrpm,
-                           get_order_from_bytes(MSRPM_SIZE));
+                           get_order_from_bytes_ceil(MSRPM_SIZE));
         svm->ns_cached_msrpm = NULL;
     }
     if (svm->ns_merged_msrpm) {
         free_xenheap_pages(svm->ns_merged_msrpm,
-                           get_order_from_bytes(MSRPM_SIZE));
+                           get_order_from_bytes_ceil(MSRPM_SIZE));
         svm->ns_merged_msrpm = NULL;
     }
     hvm_unmap_guest_frame(nv->nv_vvmcx, 1);
diff --git a/xen/arch/x86/hvm/svm/vmcb.c b/xen/arch/x86/hvm/svm/vmcb.c
index 9ea014f..c763b75 100644
--- a/xen/arch/x86/hvm/svm/vmcb.c
+++ b/xen/arch/x86/hvm/svm/vmcb.c
@@ -98,7 +98,8 @@ static int construct_vmcb(struct vcpu *v)
                              CR_INTERCEPT_CR8_WRITE);
 
     /* I/O and MSR permission bitmaps. */
-    arch_svm->msrpm = alloc_xenheap_pages(get_order_from_bytes(MSRPM_SIZE), 0);
+    arch_svm->msrpm = alloc_xenheap_pages(
+                        get_order_from_bytes_ceil(MSRPM_SIZE), 0);
     if ( arch_svm->msrpm == NULL )
         return -ENOMEM;
     memset(arch_svm->msrpm, 0xff, MSRPM_SIZE);
@@ -268,7 +269,7 @@ void svm_destroy_vmcb(struct vcpu *v)
     if ( arch_svm->msrpm != NULL )
     {
         free_xenheap_pages(
-            arch_svm->msrpm, get_order_from_bytes(MSRPM_SIZE));
+            arch_svm->msrpm, get_order_from_bytes_ceil(MSRPM_SIZE));
         arch_svm->msrpm = NULL;
     }
 
diff --git a/xen/arch/x86/percpu.c b/xen/arch/x86/percpu.c
index 1c1dad9..d44e7e2 100644
--- a/xen/arch/x86/percpu.c
+++ b/xen/arch/x86/percpu.c
@@ -14,7 +14,8 @@ unsigned long __per_cpu_offset[NR_CPUS];
  * context of PV guests.
  */
 #define INVALID_PERCPU_AREA (0x8000000000000000L - (long)__per_cpu_start)
-#define PERCPU_ORDER (get_order_from_bytes(__per_cpu_data_end-__per_cpu_start))
+#define PERCPU_ORDER \
+    (get_order_from_bytes_ceil(__per_cpu_data_end-__per_cpu_start))
 
 void __init percpu_init_areas(void)
 {
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 3a9dd3e..5597675 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -669,7 +669,7 @@ static void cpu_smpboot_free(unsigned int cpu)
 
     free_xenheap_pages(per_cpu(compat_gdt_table, cpu), order);
 
-    order = get_order_from_bytes(IDT_ENTRIES * sizeof(idt_entry_t));
+    order = get_order_from_bytes_ceil(IDT_ENTRIES * sizeof(idt_entry_t));
     free_xenheap_pages(idt_tables[cpu], order);
     idt_tables[cpu] = NULL;
 
@@ -710,7 +710,7 @@ static int cpu_smpboot_alloc(unsigned int cpu)
     memcpy(gdt, boot_cpu_compat_gdt_table, NR_RESERVED_GDT_PAGES * PAGE_SIZE);
     gdt[PER_CPU_GDT_ENTRY - FIRST_RESERVED_GDT_ENTRY].a = cpu;
 
-    order = get_order_from_bytes(IDT_ENTRIES * sizeof(idt_entry_t));
+    order = get_order_from_bytes_ceil(IDT_ENTRIES * sizeof(idt_entry_t));
     idt_tables[cpu] = alloc_xenheap_pages(order, memflags);
     if ( idt_tables[cpu] == NULL )
         goto oom;
diff --git a/xen/common/kexec.c b/xen/common/kexec.c
index c83d48f..f557475 100644
--- a/xen/common/kexec.c
+++ b/xen/common/kexec.c
@@ -556,7 +556,7 @@ static int __init kexec_init(void)
         crash_heap_size = PAGE_ALIGN(crash_heap_size);
 
         crash_heap_current = alloc_xenheap_pages(
-            get_order_from_bytes(crash_heap_size),
+            get_order_from_bytes_ceil(crash_heap_size),
             MEMF_bits(crashinfo_maxaddr_bits) );
 
         if ( ! crash_heap_current )
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index ae2476d..7f0381e 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -553,7 +553,7 @@ static unsigned long init_node_heap(int node, unsigned long mfn,
         *use_tail = 0;
     }
 #endif
-    else if ( get_order_from_bytes(sizeof(**_heap)) ==
+    else if ( get_order_from_bytes_ceil(sizeof(**_heap)) ==
               get_order_from_pages(needed) )
     {
         _heap[node] = alloc_xenheap_pages(get_order_from_pages(needed), 0);
diff --git a/xen/common/tmem_xen.c b/xen/common/tmem_xen.c
index 71cb7d5..6c630b6 100644
--- a/xen/common/tmem_xen.c
+++ b/xen/common/tmem_xen.c
@@ -292,7 +292,7 @@ int __init tmem_init(void)
     unsigned int cpu;
 
     dstmem_order = get_order_from_pages(LZO_DSTMEM_PAGES);
-    workmem_order = get_order_from_bytes(LZO1X_1_MEM_COMPRESS);
+    workmem_order = get_order_from_bytes_ceil(LZO1X_1_MEM_COMPRESS);
 
     for_each_online_cpu ( cpu )
     {
diff --git a/xen/common/xmalloc_tlsf.c b/xen/common/xmalloc_tlsf.c
index 6c1b882..32800e1 100644
--- a/xen/common/xmalloc_tlsf.c
+++ b/xen/common/xmalloc_tlsf.c
@@ -298,7 +298,7 @@ struct xmem_pool *xmem_pool_create(
     BUG_ON(max_size && (max_size < init_size));
 
     pool_bytes = ROUNDUP_SIZE(sizeof(*pool));
-    pool_order = get_order_from_bytes(pool_bytes);
+    pool_order = get_order_from_bytes_ceil(pool_bytes);
 
     pool = (void *)alloc_xenheap_pages(pool_order, 0);
     if ( pool == NULL )
@@ -371,7 +371,7 @@ void xmem_pool_destroy(struct xmem_pool *pool)
     spin_unlock(&pool_list_lock);
 
     pool_bytes = ROUNDUP_SIZE(sizeof(*pool));
-    pool_order = get_order_from_bytes(pool_bytes);
+    pool_order = get_order_from_bytes_ceil(pool_bytes);
     free_xenheap_pages(pool,pool_order);
 }
 
@@ -530,7 +530,7 @@ static void *xmalloc_whole_pages(unsigned long size, unsigned long align)
     unsigned int i, order;
     void *res, *p;
 
-    order = get_order_from_bytes(max(align, size));
+    order = get_order_from_bytes_ceil(max(align, size));
 
     res = alloc_xenheap_pages(order, 0);
     if ( res == NULL )
diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c
index 55ae31a..605639e 100644
--- a/xen/drivers/char/console.c
+++ b/xen/drivers/char/console.c
@@ -301,7 +301,7 @@ static void dump_console_ring_key(unsigned char key)
 
     /* create a buffer in which we'll copy the ring in the correct
        order and NUL terminate */
-    order = get_order_from_bytes(conring_size + 1);
+    order = get_order_from_bytes_ceil(conring_size + 1);
     buf = alloc_xenheap_pages(order, 0);
     if ( buf == NULL )
     {
@@ -759,7 +759,7 @@ void __init console_init_ring(void)
     if ( !opt_conring_size )
         return;
 
-    order = get_order_from_bytes(max(opt_conring_size, conring_size));
+    order = get_order_from_bytes_ceil(max(opt_conring_size, conring_size));
     memflags = MEMF_bits(crashinfo_maxaddr_bits);
     while ( (ring = alloc_xenheap_pages(order, memflags)) == NULL )
     {
@@ -1080,7 +1080,7 @@ static int __init debugtrace_init(void)
     if ( bytes == 0 )
         return 0;
 
-    order = get_order_from_bytes(bytes);
+    order = get_order_from_bytes_ceil(bytes);
     debugtrace_buf = alloc_xenheap_pages(order, 0);
     ASSERT(debugtrace_buf != NULL);
 
diff --git a/xen/drivers/char/serial.c b/xen/drivers/char/serial.c
index 0fc5ced..5ac75bb 100644
--- a/xen/drivers/char/serial.c
+++ b/xen/drivers/char/serial.c
@@ -577,7 +577,7 @@ void __init serial_async_transmit(struct serial_port *port)
     while ( serial_txbufsz & (serial_txbufsz - 1) )
         serial_txbufsz &= serial_txbufsz - 1;
     port->txbuf = alloc_xenheap_pages(
-        get_order_from_bytes(serial_txbufsz), 0);
+        get_order_from_bytes_ceil(serial_txbufsz), 0);
 }
 
 /*
diff --git a/xen/drivers/passthrough/amd/iommu_init.c b/xen/drivers/passthrough/amd/iommu_init.c
index ea9f7e7..696ff1a 100644
--- a/xen/drivers/passthrough/amd/iommu_init.c
+++ b/xen/drivers/passthrough/amd/iommu_init.c
@@ -136,7 +136,8 @@ static void register_iommu_cmd_buffer_in_mmio_space(struct amd_iommu *iommu)
     iommu_set_addr_lo_to_reg(&entry, addr_lo >> PAGE_SHIFT);
     writel(entry, iommu->mmio_base + IOMMU_CMD_BUFFER_BASE_LOW_OFFSET);
 
-    power_of2_entries = get_order_from_bytes(iommu->cmd_buffer.alloc_size) +
+    power_of2_entries =
+        get_order_from_bytes_ceil(iommu->cmd_buffer.alloc_size) +
         IOMMU_CMD_BUFFER_POWER_OF2_ENTRIES_PER_PAGE;
 
     entry = 0;
@@ -164,7 +165,7 @@ static void register_iommu_event_log_in_mmio_space(struct amd_iommu *iommu)
     iommu_set_addr_lo_to_reg(&entry, addr_lo >> PAGE_SHIFT);
     writel(entry, iommu->mmio_base + IOMMU_EVENT_LOG_BASE_LOW_OFFSET);
 
-    power_of2_entries = get_order_from_bytes(iommu->event_log.alloc_size) +
+    power_of2_entries = get_order_from_bytes_ceil(iommu->event_log.alloc_size) +
                         IOMMU_EVENT_LOG_POWER_OF2_ENTRIES_PER_PAGE;
 
     entry = 0;
@@ -192,7 +193,7 @@ static void register_iommu_ppr_log_in_mmio_space(struct amd_iommu *iommu)
     iommu_set_addr_lo_to_reg(&entry, addr_lo >> PAGE_SHIFT);
     writel(entry, iommu->mmio_base + IOMMU_PPR_LOG_BASE_LOW_OFFSET);
 
-    power_of2_entries = get_order_from_bytes(iommu->ppr_log.alloc_size) +
+    power_of2_entries = get_order_from_bytes_ceil(iommu->ppr_log.alloc_size) +
                         IOMMU_PPR_LOG_POWER_OF2_ENTRIES_PER_PAGE;
 
     entry = 0;
@@ -918,7 +919,7 @@ static void __init deallocate_buffer(void *buf, uint32_t sz)
     int order = 0;
     if ( buf )
     {
-        order = get_order_from_bytes(sz);
+        order = get_order_from_bytes_ceil(sz);
         __free_amd_iommu_tables(buf, order);
     }
 }
@@ -940,7 +941,7 @@ static void __init deallocate_ring_buffer(struct ring_buffer *ring_buf)
 static void * __init allocate_buffer(uint32_t alloc_size, const char *name)
 {
     void * buffer;
-    int order = get_order_from_bytes(alloc_size);
+    int order = get_order_from_bytes_ceil(alloc_size);
 
     buffer = __alloc_amd_iommu_tables(order);
 
@@ -963,8 +964,8 @@ static void * __init allocate_ring_buffer(struct ring_buffer *ring_buf,
 
     spin_lock_init(&ring_buf->lock);
     
-    ring_buf->alloc_size = PAGE_SIZE << get_order_from_bytes(entries *
-                                                             entry_size);
+    ring_buf->alloc_size = PAGE_SIZE << get_order_from_bytes_ceil(entries *
+                                                                  entry_size);
     ring_buf->entries = ring_buf->alloc_size / entry_size;
     ring_buf->buffer = allocate_buffer(ring_buf->alloc_size, name);
     return ring_buf->buffer;
@@ -1163,7 +1164,7 @@ static int __init amd_iommu_setup_device_table(
 
     /* allocate 'device table' on a 4K boundary */
     device_table.alloc_size = PAGE_SIZE <<
-                              get_order_from_bytes(
+                              get_order_from_bytes_ceil(
                               PAGE_ALIGN(ivrs_bdf_entries *
                               IOMMU_DEV_TABLE_ENTRY_SIZE));
     device_table.entries = device_table.alloc_size /
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 338d6b4..dd291a2 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -460,7 +460,7 @@ int __init pci_ro_device(int seg, int bus, int devfn)
     {
         size_t sz = BITS_TO_LONGS(PCI_BDF(-1, -1, -1) + 1) * sizeof(long);
 
-        pseg->ro_map = alloc_xenheap_pages(get_order_from_bytes(sz), 0);
+        pseg->ro_map = alloc_xenheap_pages(get_order_from_bytes_ceil(sz), 0);
         if ( !pseg->ro_map )
             return -ENOMEM;
         memset(pseg->ro_map, 0, sz);
diff --git a/xen/include/asm-x86/flushtlb.h b/xen/include/asm-x86/flushtlb.h
index 2e7ed6b..45d6b0a 100644
--- a/xen/include/asm-x86/flushtlb.h
+++ b/xen/include/asm-x86/flushtlb.h
@@ -125,7 +125,7 @@ static inline int invalidate_dcache_va_range(const void *p,
 static inline int clean_and_invalidate_dcache_va_range(const void *p,
                                                        unsigned long size)
 {
-    unsigned int order = get_order_from_bytes(size);
+    unsigned int order = get_order_from_bytes_ceil(size);
     /* sub-page granularity support needs to be added if necessary */
     flush_area_local(p, FLUSH_CACHE|FLUSH_ORDER(order));
     return 0;
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 76fbb82..5357a08 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -519,7 +519,7 @@ page_list_splice(struct page_list_head *list, struct page_list_head *head)
     list_for_each_entry_safe_reverse(pos, tmp, head, list)
 #endif
 
-static inline unsigned int get_order_from_bytes(paddr_t size)
+static inline unsigned int get_order_from_bytes_ceil(paddr_t size)
 {
     unsigned int order;
 
@@ -530,6 +530,16 @@ static inline unsigned int get_order_from_bytes(paddr_t size)
     return order;
 }
 
+static inline unsigned int get_order_from_bytes_floor(paddr_t size)
+{
+    unsigned int order;
+
+    size >>= PAGE_SHIFT;
+    for ( order = 0; size >= (1 << (order + 1)); order++ );
+
+    return order;
+}
+
 static inline unsigned int get_order_from_pages(unsigned long nr_pages)
 {
     unsigned int order;
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (13 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 14/30] xen/mm: add a ceil sufix to current page calculation routine Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-30 15:52   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 16/30] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
                   ` (15 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Craft the Dom0 e820 memory map and populate it.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since RFC:
 - Use IS_ALIGNED instead of checking with PAGE_MASK.
 - Use the new %pB specifier in order to print sizes in human readable form.
 - Create a VM86 TSS for hardware that doesn't support unrestricted mode.
 - Subtract guest RAM for the identity page table and the VM86 TSS.
 - Split the creation of the unrestricted mode helper structures to a
   separate function.
 - Use preemption with paging_set_allocation.
 - Use get_order_from_bytes_floor.
---
 xen/arch/x86/domain_build.c | 257 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 251 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 982bb5f..c590c58 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -22,6 +22,7 @@
 #include <xen/compat.h>
 #include <xen/libelf.h>
 #include <xen/pfn.h>
+#include <xen/guest_access.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -43,6 +44,11 @@ static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
 static long __initdata dom0_max_nrpages = LONG_MAX;
 
+/* Size of the VM86 TSS for virtual 8086 mode to use. */
+#define HVM_VM86_TSS_SIZE   128
+
+static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];
+
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
  * 
@@ -304,7 +310,8 @@ static unsigned long __init compute_dom0_nr_pages(
             avail -= max_pdx >> s;
     }
 
-    need_paging = opt_dom0_shadow || (is_pvh_domain(d) && !iommu_hap_pt_share);
+    need_paging = opt_dom0_shadow || (has_hvm_container_domain(d) &&
+                  (!iommu_hap_pt_share || !paging_mode_hap(d)));
     for ( ; ; need_paging = 0 )
     {
         nr_pages = dom0_nrpages;
@@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages(
         avail -= dom0_paging_pages(d, nr_pages);
     }
 
-    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
+    if ( is_pv_domain(d) &&
+         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
          ((dom0_min_nrpages <= 0) || (nr_pages > min_pages)) )
     {
         /*
@@ -547,11 +555,12 @@ static __init void pvh_map_all_iomem(struct domain *d, unsigned long nr_pages)
     ASSERT(nr_holes == 0);
 }
 
-static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
+static __init void hvm_setup_e820(struct domain *d, unsigned long nr_pages)
 {
     struct e820entry *entry, *entry_guest;
     unsigned int i;
     unsigned long pages, cur_pages = 0;
+    uint64_t start, end;
 
     /*
      * Craft the e820 memory map for Dom0 based on the hardware e820 map.
@@ -579,8 +588,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
             continue;
         }
 
-        *entry_guest = *entry;
-        pages = PFN_UP(entry_guest->size);
+        /*
+         * Make sure the start and length are aligned to PAGE_SIZE, because
+         * that's the minimum granularity of the 2nd stage translation.
+         */
+        start = ROUNDUP(entry->addr, PAGE_SIZE);
+        end = (entry->addr + entry->size) & PAGE_MASK;
+        if ( start >= end )
+            continue;
+
+        entry_guest->type = E820_RAM;
+        entry_guest->addr = start;
+        entry_guest->size = end - start;
+        pages = PFN_DOWN(entry_guest->size);
         if ( (cur_pages + pages) > nr_pages )
         {
             /* Truncate region */
@@ -591,6 +611,8 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
         {
             cur_pages += pages;
         }
+        ASSERT(IS_ALIGNED(entry_guest->addr, PAGE_SIZE) &&
+               IS_ALIGNED(entry_guest->size, PAGE_SIZE));
  next:
         d->arch.nr_e820++;
         entry_guest++;
@@ -1641,7 +1663,7 @@ static int __init construct_dom0_pv(
         dom0_update_physmap(d, pfn, mfn, 0);
 
         pvh_map_all_iomem(d, nr_pages);
-        pvh_setup_e820(d, nr_pages);
+        hvm_setup_e820(d, nr_pages);
     }
 
     if ( d->domain_id == hardware_domid )
@@ -1657,15 +1679,238 @@ out:
     return rc;
 }
 
+/* Populate an HVM memory range using the biggest possible order. */
+static void __init hvm_populate_memory_range(struct domain *d, uint64_t start,
+                                             uint64_t size)
+{
+    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
+    unsigned int order;
+    struct page_info *page;
+    int rc;
+
+    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
+
+    order = MAX_ORDER;
+    while ( size != 0 )
+    {
+        order = min(get_order_from_bytes_floor(size), order);
+        page = alloc_domheap_pages(d, order, memflags);
+        if ( page == NULL )
+        {
+            if ( order == 0 && memflags )
+            {
+                /* Try again without any memflags. */
+                memflags = 0;
+                order = MAX_ORDER;
+                continue;
+            }
+            if ( order == 0 )
+                panic("Unable to allocate memory with order 0!\n");
+            order--;
+            continue;
+        }
+
+        hvm_mem_stats[order]++;
+        rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)),
+                                    _mfn(page_to_mfn(page)), order);
+        if ( rc != 0 )
+            panic("Failed to populate memory: [%" PRIx64 " - %" PRIx64 "] %d\n",
+                  start, start + (((uint64_t)1) << (order + PAGE_SHIFT)), rc);
+        start += ((uint64_t)1) << (order + PAGE_SHIFT);
+        size -= ((uint64_t)1) << (order + PAGE_SHIFT);
+        if ( (size & 0xffffffff) == 0 )
+            process_pending_softirqs();
+    }
+
+}
+
+static int __init hvm_setup_vmx_unrestricted_guest(struct domain *d)
+{
+    struct e820entry *entry;
+    p2m_type_t p2mt;
+    uint32_t rc, *ident_pt;
+    uint8_t *tss;
+    mfn_t mfn;
+    paddr_t gaddr = 0;
+    int i;
+
+    /*
+     * Stole some space from the last found RAM region. One page will be
+     * used for the identify page tables, and the remaining space for the
+     * VM86 TSS. Note that after this not all e820 regions will be aligned
+     * to PAGE_SIZE.
+     */
+    for ( i = 1; i <= d->arch.nr_e820; i++ )
+    {
+        entry = &d->arch.e820[d->arch.nr_e820 - i];
+        if ( entry->type != E820_RAM ||
+             entry->size < PAGE_SIZE + HVM_VM86_TSS_SIZE )
+            continue;
+
+        entry->size -= PAGE_SIZE + HVM_VM86_TSS_SIZE;
+        gaddr = entry->addr + entry->size;
+        break;
+    }
+
+    if ( gaddr == 0 || gaddr < MB(1) )
+    {
+        printk("Unable to find memory to stash the identity map and TSS\n");
+        return -ENOMEM;
+    }
+
+    /*
+     * Identity-map page table is required for running with CR0.PG=0
+     * when using Intel EPT. Create a 32-bit non-PAE page directory of
+     * superpages.
+     */
+    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                         &mfn, &p2mt, 0, &rc);
+    if ( tss == NULL )
+    {
+        printk("Unable to map VM86 TSS area\n");
+        return -ENOMEM;
+    }
+    tss += (gaddr & ~PAGE_MASK);
+    memset(tss, 0, HVM_VM86_TSS_SIZE);
+    unmap_domain_page(tss);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_VM86_TSS] = gaddr;
+    gaddr += HVM_VM86_TSS_SIZE;
+    ASSERT(IS_ALIGNED(gaddr, PAGE_SIZE));
+
+    ident_pt = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
+                              &mfn, &p2mt, 0, &rc);
+    if ( ident_pt == NULL )
+    {
+        printk("Unable to map identity page tables\n");
+        return -ENOMEM;
+    }
+    for ( i = 0; i < PAGE_SIZE / sizeof(*ident_pt); i++ )
+        ident_pt[i] = ((i << 22) | _PAGE_PRESENT | _PAGE_RW | _PAGE_USER |
+                       _PAGE_ACCESSED | _PAGE_DIRTY | _PAGE_PSE);
+    unmap_domain_page(ident_pt);
+    put_page(mfn_to_page(mfn_x(mfn)));
+    d->arch.hvm_domain.params[HVM_PARAM_IDENT_PT] = gaddr;
+
+    return 0;
+}
+
+static int __init hvm_setup_p2m(struct domain *d)
+{
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    unsigned long nr_pages;
+    int i, rc, preempted;
+
+    printk("** Preparing memory map **\n");
+
+    /*
+     * Subtract one page for the EPT identity page table and two pages
+     * for the MADT replacement.
+     */
+    nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3;
+
+    hvm_setup_e820(d, nr_pages);
+    do {
+        preempted = 0;
+        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
+                              &preempted);
+        process_pending_softirqs();
+    } while ( preempted );
+
+    /*
+     * Special treatment for memory < 1MB:
+     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).
+     *  - Map everything else as 1:1.
+     * NB: all this only makes sense if booted from legacy BIOSes.
+     */
+    rc = modify_mmio_11(d, 0, PFN_DOWN(MB(1)), true);
+    if ( rc )
+    {
+        printk("Failed to map low 1MB 1:1: %d\n", rc);
+        return rc;
+    }
+
+    printk("** Populating memory map **\n");
+    /* Populate memory map. */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        if ( d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        hvm_populate_memory_range(d, d->arch.e820[i].addr,
+                                  d->arch.e820[i].size);
+        if ( d->arch.e820[i].addr < MB(1) )
+        {
+            unsigned long end = min_t(unsigned long,
+                            d->arch.e820[i].addr + d->arch.e820[i].size, MB(1));
+
+            saved_current = current;
+            set_current(v);
+            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
+                                        maddr_to_virt(d->arch.e820[i].addr),
+                                        end - d->arch.e820[i].addr);
+            set_current(saved_current);
+            if ( rc != HVMCOPY_okay )
+            {
+                printk("Unable to copy RAM region %#lx - %#lx\n",
+                       d->arch.e820[i].addr, end);
+                return -EFAULT;
+            }
+        }
+    }
+
+    printk("Memory allocation stats:\n");
+    for ( i = 0; i <= MAX_ORDER; i++ )
+    {
+        if ( hvm_mem_stats[MAX_ORDER - i] != 0 )
+            printk("Order %2u: %pZ\n", MAX_ORDER - i,
+                   _p(((uint64_t)1 << (MAX_ORDER - i + PAGE_SHIFT)) *
+                      hvm_mem_stats[MAX_ORDER - i]));
+    }
+
+    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
+    {
+        /*
+         * Since Dom0 cannot be migrated, we will only setup the
+         * unrestricted guest helpers if they are needed by the current
+         * hardware we are running on.
+         */
+        rc = hvm_setup_vmx_unrestricted_guest(d);
+        if ( rc )
+            return rc;
+    }
+
+    printk("Dom0 memory map:\n");
+    print_e820_memory_map(d->arch.e820, d->arch.nr_e820);
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    int rc;
 
     printk("** Building a PVH Dom0 **\n");
 
+    /* Sanity! */
+    BUG_ON(d->domain_id != 0);
+    BUG_ON(d->vcpu[0] == NULL);
+
+    process_pending_softirqs();
+
+    iommu_hwdom_init(d);
+
+    rc = hvm_setup_p2m(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 physical memory map\n");
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 16/30] xen/x86: parse Dom0 kernel for PVHv2
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (14 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-06 15:14   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs Roger Pau Monne
                   ` (14 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Introduce a helper to parse the Dom0 kernel.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/domain_build.c | 138 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index c590c58..effebf1 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -39,6 +39,7 @@
 #include <asm/hpet.h>
 
 #include <public/version.h>
+#include <public/arch-x86/hvm/start_info.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -1886,12 +1887,141 @@ static int __init hvm_setup_p2m(struct domain *d)
     return 0;
 }
 
+static int __init hvm_load_kernel(struct domain *d, const module_t *image,
+                                  unsigned long image_headroom,
+                                  module_t *initrd, char *image_base,
+                                  char *cmdline, paddr_t *entry,
+                                  paddr_t *start_info_addr)
+{
+    char *image_start = image_base + image_headroom;
+    unsigned long image_len = image->mod_end;
+    struct elf_binary elf;
+    struct elf_dom_parms parms;
+    paddr_t last_addr;
+    struct hvm_start_info start_info;
+    struct hvm_modlist_entry mod;
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    int rc;
+
+    printk("** Parsing Dom0 kernel **\n");
+
+    if ( (rc = bzimage_parse(image_base, &image_start, &image_len)) != 0 )
+    {
+        printk("Error trying to detect bz compressed kernel\n");
+        return rc;
+    }
+
+    if ( (rc = elf_init(&elf, image_start, image_len)) != 0 )
+    {
+        printk("Unable to init ELF\n");
+        return rc;
+    }
+#ifdef VERBOSE
+    elf_set_verbose(&elf);
+#endif
+    elf_parse_binary(&elf);
+    if ( (rc = elf_xen_parse(&elf, &parms)) != 0 )
+    {
+        printk("Unable to parse kernel for ELFNOTES\n");
+        return rc;
+    }
+
+    if ( parms.phys_entry == UNSET_ADDR32 ) {
+        printk("Unable to find kernel entry point, aborting\n");
+        return -EINVAL;
+    }
+
+    printk("OS: %s version: %s loader: %s bitness: %s\n", parms.guest_os,
+           parms.guest_ver, parms.loader,
+           elf_64bit(&elf) ? "64-bit" : "32-bit");
+
+    printk("** Loading Dom0 kernel **\n");
+    /* Copy the OS image and free temporary buffer. */
+    elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
+    elf.dest_size = parms.virt_kend - parms.virt_kstart;
+
+    saved_current = current;
+    set_current(v);
+
+    rc = elf_load_binary(&elf);
+    if ( rc < 0 )
+    {
+        printk("Failed to load kernel: %d\n", rc);
+        printk("Xen dom0 kernel broken ELF: %s\n", elf_check_broken(&elf));
+        goto out;
+    }
+
+    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
+    printk("** Copying Dom0 modules **\n");
+
+    rc = hvm_copy_to_guest_phys(last_addr, mfn_to_virt(initrd->mod_start),
+                                initrd->mod_end);
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy initrd to guest\n");
+        rc = -EFAULT;
+        goto out;
+    }
+
+    mod.paddr = last_addr;
+    mod.size = initrd->mod_end;
+    last_addr += ROUNDUP(initrd->mod_end, PAGE_SIZE);
+
+    /* Free temporary buffers. */
+    discard_initial_images();
+
+    printk("** Setting up start-of-day info **\n");
+
+    memset(&start_info, 0, sizeof(start_info));
+    if ( cmdline != NULL )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr, cmdline, strlen(cmdline) + 1);
+        if ( rc != HVMCOPY_okay )
+        {
+            printk("Unable to copy guest command line\n");
+            rc = -EFAULT;
+            goto out;
+        }
+        start_info.cmdline_paddr = last_addr;
+        last_addr += ROUNDUP(strlen(cmdline) + 1, 8);
+    }
+    rc = hvm_copy_to_guest_phys(last_addr, &mod, sizeof(mod));
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy guest modules\n");
+        rc = -EFAULT;
+        goto out;
+    }
+
+    start_info.modlist_paddr = last_addr;
+    start_info.nr_modules = 1;
+    last_addr += sizeof(mod);
+    start_info.magic = XEN_HVM_START_MAGIC_VALUE;
+    start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
+    rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info));
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy start info to guest\n");
+        rc = -EFAULT;
+        goto out;
+    }
+
+    *entry = parms.phys_entry;
+    *start_info_addr = last_addr;
+    rc = 0;
+
+out:
+    set_current(saved_current);
+    return rc;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
                                      void *(*bootstrap_map)(const module_t *),
                                      char *cmdline)
 {
+    paddr_t entry, start_info;
     int rc;
 
     printk("** Building a PVH Dom0 **\n");
@@ -1911,6 +2041,14 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_load_kernel(d, image, image_headroom, initrd, bootstrap_map(image),
+                         cmdline, &entry, &start_info);
+    if ( rc )
+    {
+        printk("Failed to load Dom0 kernel\n");
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (15 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 16/30] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-06 15:20   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
                   ` (13 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Initialize Dom0 BSP/APs and setup the memory and IO permissions.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
The logic used to setup the CPUID leaves is extremely simplistic (and
probably wrong for hardware different than mine). I'm not sure what's the
best way to deal with this, the code that currently sets the CPUID leaves
for HVM guests lives in libxc, maybe moving it xen/common would be better?
---
 xen/arch/x86/domain_build.c | 103 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index effebf1..8ea54ae 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -40,6 +40,7 @@
 
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
+#include <public/hvm/hvm_vcpu.h>
 
 static long __initdata dom0_nrpages;
 static long __initdata dom0_min_nrpages;
@@ -2015,6 +2016,101 @@ out:
     return rc;
 }
 
+static int __init hvm_setup_cpus(struct domain *d, paddr_t entry,
+                                 paddr_t start_info)
+{
+    vcpu_hvm_context_t cpu_ctx;
+    struct vcpu *v = d->vcpu[0];
+    int cpu, i, rc;
+    struct {
+        uint32_t index;
+        uint32_t count;
+    } cpuid_leaves[] = {
+        {0, XEN_CPUID_INPUT_UNUSED},
+        {1, XEN_CPUID_INPUT_UNUSED},
+        {2, XEN_CPUID_INPUT_UNUSED},
+        {4, 0},
+        {4, 1},
+        {4, 2},
+        {4, 3},
+        {4, 4},
+        {7, 0},
+        {0xa, XEN_CPUID_INPUT_UNUSED},
+        {0xd, 0},
+        {0x80000000, XEN_CPUID_INPUT_UNUSED},
+        {0x80000001, XEN_CPUID_INPUT_UNUSED},
+        {0x80000002, XEN_CPUID_INPUT_UNUSED},
+        {0x80000003, XEN_CPUID_INPUT_UNUSED},
+        {0x80000004, XEN_CPUID_INPUT_UNUSED},
+        {0x80000005, XEN_CPUID_INPUT_UNUSED},
+        {0x80000006, XEN_CPUID_INPUT_UNUSED},
+        {0x80000007, XEN_CPUID_INPUT_UNUSED},
+        {0x80000008, XEN_CPUID_INPUT_UNUSED},
+    };
+
+    printk("** Setting up BSP/APs **\n");
+
+    cpu = v->processor;
+    for ( i = 1; i < d->max_vcpus; i++ )
+    {
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_dom0_vcpu(d, i, cpu);
+    }
+
+    memset(&cpu_ctx, 0, sizeof(cpu_ctx));
+
+    cpu_ctx.mode = VCPU_HVM_MODE_32B;
+
+    cpu_ctx.cpu_regs.x86_32.ebx = start_info;
+    cpu_ctx.cpu_regs.x86_32.eip = entry;
+    cpu_ctx.cpu_regs.x86_32.cr0 = X86_CR0_PE | X86_CR0_ET;
+
+    cpu_ctx.cpu_regs.x86_32.cs_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ds_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.ss_limit = ~0u;
+    cpu_ctx.cpu_regs.x86_32.tr_limit = 0x67;
+    cpu_ctx.cpu_regs.x86_32.cs_ar = 0xc9b;
+    cpu_ctx.cpu_regs.x86_32.ds_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.ss_ar = 0xc93;
+    cpu_ctx.cpu_regs.x86_32.tr_ar = 0x8b;
+
+    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
+    if ( rc )
+    {
+        printk("Unable to setup Dom0 BSP context: %d\n", rc);
+        return rc;
+    }
+    clear_bit(_VPF_down, &v->pause_flags);
+
+    for ( i = 0; i < ARRAY_SIZE(cpuid_leaves); i++ )
+    {
+        d->arch.cpuids[i].input[0] = cpuid_leaves[i].index;
+        d->arch.cpuids[i].input[1] = cpuid_leaves[i].count;
+        if ( d->arch.cpuids[i].input[1] == XEN_CPUID_INPUT_UNUSED )
+            cpuid(d->arch.cpuids[i].input[0], &d->arch.cpuids[i].eax,
+                  &d->arch.cpuids[i].ebx, &d->arch.cpuids[i].ecx,
+                  &d->arch.cpuids[i].edx);
+        else
+            cpuid_count(d->arch.cpuids[i].input[0], d->arch.cpuids[i].input[1],
+                        &d->arch.cpuids[i].eax, &d->arch.cpuids[i].ebx,
+                        &d->arch.cpuids[i].ecx, &d->arch.cpuids[i].edx);
+        /* XXX: we need to do much more filtering here. */
+        if ( d->arch.cpuids[i].input[0] == 1 )
+            d->arch.cpuids[i].ecx &= ~X86_FEATURE_VMX;
+    }
+
+    rc = setup_permissions(d);
+    if ( rc )
+    {
+        panic("Unable to setup Dom0 permissions: %d\n", rc);
+        return rc;
+    }
+
+    update_domain_wallclock_time(d);
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2049,6 +2145,13 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_setup_cpus(d, entry, start_info);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 CPUs: %d\n", rc);
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (16 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-06 15:40   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain Roger Pau Monne
                   ` (12 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

This maps all the regions in the e820 marked as E820_ACPI or E820_NVS and
the top-level ACPI tables discovered by Xen to Dom0 1:1. It also shadows the
page(s) where the native MADT is placed by mapping a RAM page over it,
copying the original data and modifying it afterwards in order to represent
the real CPU topology exposed to Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
FWIW, I think that the current approach that I've used in order to craft the
MADT is not the best one, IMHO it would be better to place the MADT at the
end of the E820_ACPI region (expanding it's size one page), and modify the
XSDT/RSDT in order to point to it, that way we avoid shadowing any other
ACPI data that might be at the same page as the native MADT (and that needs
to be modified by Dom0).
---
 xen/arch/x86/domain_build.c | 274 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 274 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 8ea54ae..407f742 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -23,6 +23,7 @@
 #include <xen/libelf.h>
 #include <xen/pfn.h>
 #include <xen/guest_access.h>
+#include <xen/acpi.h>
 #include <asm/regs.h>
 #include <asm/system.h>
 #include <asm/io.h>
@@ -38,6 +39,8 @@
 #include <asm/io_apic.h>
 #include <asm/hpet.h>
 
+#include <acpi/actables.h>
+
 #include <public/version.h>
 #include <public/arch-x86/hvm/start_info.h>
 #include <public/hvm/hvm_vcpu.h>
@@ -50,6 +53,8 @@ static long __initdata dom0_max_nrpages = LONG_MAX;
 #define HVM_VM86_TSS_SIZE   128
 
 static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];
+static unsigned int __initdata acpi_intr_overrrides = 0;
+static struct acpi_madt_interrupt_override __initdata *intsrcovr = NULL;
 
 /*
  * dom0_mem=[min:<min_amt>,][max:<max_amt>,][<amt>]
@@ -1999,6 +2004,7 @@ static int __init hvm_load_kernel(struct domain *d, const module_t *image,
     last_addr += sizeof(mod);
     start_info.magic = XEN_HVM_START_MAGIC_VALUE;
     start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
+    start_info.rsdp_paddr = acpi_os_get_root_pointer();
     rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info));
     if ( rc != HVMCOPY_okay )
     {
@@ -2111,6 +2117,267 @@ static int __init hvm_setup_cpus(struct domain *d, paddr_t entry,
     return 0;
 }
 
+static int __init acpi_count_intr_ov(struct acpi_subtable_header *header,
+                                     const unsigned long end)
+{
+
+    acpi_intr_overrrides++;
+    return 0;
+}
+
+static int __init acpi_set_intr_ov(struct acpi_subtable_header *header,
+                                   const unsigned long end)
+{
+    struct acpi_madt_interrupt_override *intr =
+        container_of(header, struct acpi_madt_interrupt_override, header);
+
+    ACPI_MEMCPY(intsrcovr, intr, sizeof(*intr));
+    intsrcovr++;
+
+    return 0;
+}
+
+static void __init acpi_zap_table_signature(char *name)
+{
+    struct acpi_table_header *table;
+    acpi_status status;
+    union {
+        char str[ACPI_NAME_SIZE];
+        uint32_t bits;
+    } signature;
+    char tmp;
+    int i;
+
+    status = acpi_get_table(name, 0, &table);
+    if ( ACPI_SUCCESS(status) )
+    {
+        memcpy(&signature.str[0], &table->signature[0], ACPI_NAME_SIZE);
+        for ( i = 0; i < ACPI_NAME_SIZE / 2; i++ )
+        {
+            tmp = signature.str[ACPI_NAME_SIZE - i - 1];
+            signature.str[ACPI_NAME_SIZE - i - 1] = signature.str[i];
+            signature.str[i] = tmp;
+        }
+        write_atomic((uint32_t*)&table->signature[0], signature.bits);
+    }
+}
+
+static int __init hvm_setup_acpi(struct domain *d)
+{
+    struct vcpu *saved_current, *v = d->vcpu[0];
+    unsigned long pfn, nr_pages;
+    uint64_t size, start_addr, end_addr;
+    uint64_t madt_addr[2] = { 0, 0 };
+    struct acpi_table_header *table;
+    struct acpi_table_madt *madt;
+    struct acpi_madt_io_apic *io_apic;
+    struct acpi_madt_local_apic *local_apic;
+    acpi_status status;
+    int rc, i;
+
+    printk("** Setup ACPI tables **\n");
+
+    /* ZAP the HPET, SLIT, SRAT, MPST and PMTT tables. */
+    acpi_zap_table_signature(ACPI_SIG_HPET);
+    acpi_zap_table_signature(ACPI_SIG_SLIT);
+    acpi_zap_table_signature(ACPI_SIG_SRAT);
+    acpi_zap_table_signature(ACPI_SIG_MPST);
+    acpi_zap_table_signature(ACPI_SIG_PMTT);
+
+    /* Map ACPI tables 1:1 */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        if ( d->arch.e820[i].type != E820_ACPI &&
+             d->arch.e820[i].type != E820_NVS )
+            continue;
+
+        pfn = PFN_DOWN(d->arch.e820[i].addr);
+        nr_pages = DIV_ROUND_UP(d->arch.e820[i].size, PAGE_SIZE);
+
+        rc = modify_mmio_11(d, pfn, nr_pages, true);
+        if ( rc )
+        {
+            printk(
+                "Failed to map ACPI region %#lx - %#lx into Dom0 memory map\n",
+                   pfn, pfn + nr_pages);
+            return rc;
+        }
+    }
+    /*
+     * Since most memory maps provided by hardware are wrong, make sure each
+     * top-level table is properly mapped into Dom0.
+     */
+    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
+    {
+        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
+        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
+                                PAGE_SIZE);
+        rc = modify_mmio_11(d, pfn, nr_pages, true);
+        if ( rc )
+        {
+            printk(
+                "Failed to map ACPI region %#lx - %#lx into Dom0 memory map\n",
+                   pfn, pfn + nr_pages);
+            return rc;
+        }
+    }
+
+    /*
+     * Special treatment for memory < 1MB:
+     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).
+     *  - Map any reserved regions as 1:1.
+     * NB: all this only makes sense if booted from legacy BIOSes.
+     */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        unsigned long end = d->arch.e820[i].addr + d->arch.e820[i].size;
+        end = end > MB(1) ? MB(1) : end;
+
+        if ( d->arch.e820[i].type == E820_RAM )
+        {
+            saved_current = current;
+            set_current(v);
+            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
+                                        maddr_to_virt(d->arch.e820[i].addr),
+                                        end - d->arch.e820[i].addr);
+            set_current(saved_current);
+            if ( rc != HVMCOPY_okay )
+            {
+                printk("Unable to copy RAM region %#lx - %#lx\n",
+                       d->arch.e820[i].addr, end);
+                return -EFAULT;
+            }
+        }
+        else if ( d->arch.e820[i].type == E820_RESERVED )
+        {
+            pfn = PFN_DOWN(d->arch.e820[i].addr);
+            nr_pages = DIV_ROUND_UP(end - d->arch.e820[i].addr, PAGE_SIZE);
+            rc = modify_mmio_11(d, pfn, nr_pages, true);
+            if ( rc )
+            {
+                printk("Unable to map reserved region at %#lx - %#lx: %d\n",
+                       d->arch.e820[i].addr, end, rc);
+                return rc;
+            }
+        }
+        if ( end == MB(1) )
+            break;
+     }
+
+    acpi_get_table_phys(ACPI_SIG_MADT, 0, &madt_addr[0], &size);
+    if ( !madt_addr[0] )
+    {
+        printk("Unable to find ACPI MADT table\n");
+        return -EINVAL;
+    }
+    if ( size > PAGE_SIZE )
+    {
+        printk("MADT table is bigger than PAGE_SIZE, aborting\n");
+        return -EINVAL;
+    }
+
+    acpi_get_table_phys(ACPI_SIG_MADT, 2, &madt_addr[1], &size);
+    if ( madt_addr[1] != 0 && madt_addr[1] != madt_addr[0] )
+    {
+        printk("Multiple MADT tables found, aborting\n");
+        return -EINVAL;
+    }
+
+    /*
+     * Populate the guest physical memory were MADT resides with empty RAM
+     * pages. This will remove the 1:1 mapping in this area, so that Xen
+     * can modify it without any side-effects.
+     */
+    start_addr = madt_addr[0] & PAGE_MASK;
+    end_addr = PAGE_ALIGN(madt_addr[0] + size);
+    hvm_populate_memory_range(d, start_addr, end_addr - start_addr);
+
+    /* Get the address where the MADT is currently mapped. */
+    status = acpi_get_table(ACPI_SIG_MADT, 0, &table);
+    if ( !ACPI_SUCCESS(status) )
+    {
+        printk("Failed to get MADT ACPI table, aborting.\n");
+        return -EINVAL;
+    }
+
+    /*
+     * Copy the original MADT table (and whatever is around it) to the
+     * guest physmap.
+     */
+    saved_current = current;
+    set_current(v);
+    rc = hvm_copy_to_guest_phys(start_addr,
+                                (void *)((uintptr_t)table & PAGE_MASK),
+                                end_addr - start_addr);
+    set_current(saved_current);
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy original MADT page(s)\n");
+        return -EFAULT;
+    }
+
+    /* Craft a new MADT for the guest */
+
+    /* Count number of interrupt overrides. */
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_count_intr_ov,
+                          MAX_IRQ_SOURCES);
+    size = sizeof(struct acpi_table_madt);
+    size += sizeof(struct acpi_madt_interrupt_override) * acpi_intr_overrrides;
+    size += sizeof(struct acpi_madt_io_apic);
+    size += sizeof(struct acpi_madt_local_apic) * dom0_max_vcpus();
+
+    madt = xzalloc_bytes(size);
+    ACPI_MEMCPY(madt, table, sizeof(*madt));
+    madt->address = APIC_DEFAULT_PHYS_BASE;
+    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
+    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
+    io_apic->header.length = sizeof(*io_apic);
+    io_apic->id = 1;
+    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
+
+    if ( dom0_max_vcpus() > num_online_cpus() )
+    {
+        printk("CPU overcommit is not supported for Dom0\n");
+        xfree(madt);
+        return -EINVAL;
+    }
+
+    local_apic = (struct acpi_madt_local_apic *)(io_apic + 1);
+    for ( i = 0; i < dom0_max_vcpus(); i++ )
+    {
+        local_apic->header.type = ACPI_MADT_TYPE_LOCAL_APIC;
+        local_apic->header.length = sizeof(*local_apic);
+        local_apic->processor_id = i;
+        local_apic->id = i * 2;
+        local_apic->lapic_flags = ACPI_MADT_ENABLED;
+        local_apic++;
+    }
+
+    intsrcovr = (struct acpi_madt_interrupt_override *)local_apic;
+    acpi_table_parse_madt(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_set_intr_ov,
+                          MAX_IRQ_SOURCES);
+    ASSERT(((unsigned char *)intsrcovr - (unsigned char *)madt) == size);
+    madt->header.length = size;
+    madt->header.checksum -= acpi_tb_checksum(ACPI_CAST_PTR(u8, madt),
+                                              madt->header.length);
+
+    /* Copy the new MADT table to the guest physmap. */
+    saved_current = current;
+    set_current(v);
+    rc = hvm_copy_to_guest_phys(madt_addr[0], madt, size);
+    set_current(saved_current);
+    if ( rc != HVMCOPY_okay )
+    {
+        printk("Unable to copy modified MADT page(s)\n");
+        xfree(madt);
+        return -EFAULT;
+    }
+
+    xfree(madt);
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2152,6 +2419,13 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_setup_acpi(d);
+    if ( rc )
+    {
+        printk("Failed to setup Dom0 ACPI tables: %d\n", rc);
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (17 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-03  9:02   ` Paul Durrant
  2016-10-06 15:44   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code Roger Pau Monne
                   ` (11 subsequent siblings)
  30 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

This is very similar to the PCI trap used for the traditional PV(H) Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/io.c         | 72 ++++++++++++++++++++++++++++++++++++++++++-
 xen/arch/x86/traps.c          | 39 -----------------------
 xen/drivers/passthrough/pci.c | 39 +++++++++++++++++++++++
 xen/include/xen/pci.h         |  2 ++
 4 files changed, 112 insertions(+), 40 deletions(-)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 1e7a5f9..31d54dc 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -247,12 +247,79 @@ static int dpci_portio_write(const struct hvm_io_handler *handler,
     return X86EMUL_OKAY;
 }
 
+static bool_t hw_dpci_portio_accept(const struct hvm_io_handler *handler,
+                                    const ioreq_t *p)
+{
+    if ( (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc)
+    {
+        return 1;
+    }
+
+    return 0;
+}
+
+static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
+                            uint64_t addr,
+                            uint32_t size,
+                            uint64_t *data)
+{
+    struct domain *currd = current->domain;
+
+    if ( addr == 0xcf8 )
+    {
+        ASSERT(size == 4);
+        *data = currd->arch.pci_cf8;
+        return X86EMUL_OKAY;
+    }
+
+    ASSERT((addr & 0xfffc) == 0xcfc);
+    size = min(size, 4 - ((uint32_t)addr & 3));
+    if ( size == 3 )
+        size = 2;
+    if ( pci_cfg_ok(currd, addr & 3, size, NULL) )
+        *data = pci_conf_read(currd->arch.pci_cf8, addr & 3, size);
+
+    return X86EMUL_OKAY;
+}
+
+static int hw_dpci_portio_write(const struct hvm_io_handler *handler,
+                                uint64_t addr,
+                                uint32_t size,
+                                uint64_t data)
+{
+    struct domain *currd = current->domain;
+    uint32_t data32;
+
+    if ( addr == 0xcf8 )
+    {
+            ASSERT(size == 4);
+            currd->arch.pci_cf8 = data;
+            return X86EMUL_OKAY;
+    }
+
+    ASSERT((addr & 0xfffc) == 0xcfc);
+    size = min(size, 4 - ((uint32_t)addr & 3));
+    if ( size == 3 )
+        size = 2;
+    data32 = data;
+    if ( pci_cfg_ok(currd, addr & 3, size, &data32) )
+        pci_conf_write(currd->arch.pci_cf8, addr & 3, size, data);
+
+    return X86EMUL_OKAY;
+}
+
 static const struct hvm_io_ops dpci_portio_ops = {
     .accept = dpci_portio_accept,
     .read = dpci_portio_read,
     .write = dpci_portio_write
 };
 
+static const struct hvm_io_ops hw_dpci_portio_ops = {
+    .accept = hw_dpci_portio_accept,
+    .read = hw_dpci_portio_read,
+    .write = hw_dpci_portio_write
+};
+
 void register_dpci_portio_handler(struct domain *d)
 {
     struct hvm_io_handler *handler = hvm_next_io_handler(d);
@@ -261,7 +328,10 @@ void register_dpci_portio_handler(struct domain *d)
         return;
 
     handler->type = IOREQ_TYPE_PIO;
-    handler->ops = &dpci_portio_ops;
+    if ( is_hardware_domain(d) )
+        handler->ops = &hw_dpci_portio_ops;
+    else
+        handler->ops = &dpci_portio_ops;
 }
 
 /*
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 24d173f..f3c5c9e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2076,45 +2076,6 @@ static bool_t admin_io_okay(unsigned int port, unsigned int bytes,
     return ioports_access_permitted(d, port, port + bytes - 1);
 }
 
-static bool_t pci_cfg_ok(struct domain *currd, unsigned int start,
-                         unsigned int size, uint32_t *write)
-{
-    uint32_t machine_bdf;
-
-    if ( !is_hardware_domain(currd) )
-        return 0;
-
-    if ( !CF8_ENABLED(currd->arch.pci_cf8) )
-        return 1;
-
-    machine_bdf = CF8_BDF(currd->arch.pci_cf8);
-    if ( write )
-    {
-        const unsigned long *ro_map = pci_get_ro_map(0);
-
-        if ( ro_map && test_bit(machine_bdf, ro_map) )
-            return 0;
-    }
-    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
-    /* AMD extended configuration space access? */
-    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
-         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
-         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 <= 0x17 )
-    {
-        uint64_t msr_val;
-
-        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
-            return 0;
-        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
-            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
-    }
-
-    return !write ?
-           xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
-                                     start, start + size - 1, 0) == 0 :
-           pci_conf_write_intercept(0, machine_bdf, start, size, write) >= 0;
-}
-
 uint32_t guest_io_read(unsigned int port, unsigned int bytes,
                        struct domain *currd)
 {
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index dd291a2..a53b4c8 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -966,6 +966,45 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
                      PCI_COMMAND, cword & ~PCI_COMMAND_MASTER);
 }
 
+bool_t pci_cfg_ok(struct domain *currd, unsigned int start,
+                         unsigned int size, uint32_t *write)
+{
+    uint32_t machine_bdf;
+
+    if ( !is_hardware_domain(currd) )
+        return 0;
+
+    if ( !CF8_ENABLED(currd->arch.pci_cf8) )
+        return 1;
+
+    machine_bdf = CF8_BDF(currd->arch.pci_cf8);
+    if ( write )
+    {
+        const unsigned long *ro_map = pci_get_ro_map(0);
+
+        if ( ro_map && test_bit(machine_bdf, ro_map) )
+            return 0;
+    }
+    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
+    /* AMD extended configuration space access? */
+    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
+         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 <= 0x17 )
+    {
+        uint64_t msr_val;
+
+        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
+            return 0;
+        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
+            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
+    }
+
+    return !write ?
+           xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
+                                     start, start + size - 1, 0) == 0 :
+           pci_conf_write_intercept(0, machine_bdf, start, size, write) >= 0;
+}
+
 /*
  * scan pci devices to add all existed PCI devices to alldevs_list,
  * and setup pci hierarchy in array bus2bridge.
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 0872401..f191773 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -162,6 +162,8 @@ const char *parse_pci(const char *, unsigned int *seg, unsigned int *bus,
 
 bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
 
+bool_t pci_cfg_ok(struct domain *, unsigned int, unsigned int, uint32_t *);
+
 struct pirq;
 int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);
 void msixtbl_pt_unregister(struct domain *, struct pirq *);
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (18 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-03  9:54   ` Paul Durrant
                     ` (2 more replies)
  2016-09-27 15:57 ` [PATCH v2 21/30] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
                   ` (10 subsequent siblings)
  30 siblings, 3 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Most of this code has been picked up from QEMU and modified so it can be
plugged into the internal Xen IO handlers. The structure of the handlers has
been keep quite similar to QEMU, so existing handlers can be imported
without a lot of effort.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
 docs/misc/xen-command-line.markdown |   8 +
 xen/arch/x86/hvm/hvm.c              |   2 +
 xen/arch/x86/hvm/io.c               | 621 ++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/domain.h    |   4 +
 xen/include/asm-x86/hvm/io.h        | 176 ++++++++++
 xen/include/xen/pci.h               |   5 +
 6 files changed, 816 insertions(+)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 59d7210..78130c8 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -670,6 +670,14 @@ Flag that makes a 64bit dom0 boot in PVH mode. No 32bit support at present.
 
 Flag that makes a dom0 boot in PVHv2 mode.
 
+### dom0permissive
+> `= <boolean>`
+
+> Default: `true`
+
+Select mode of PCI pass-through when using a PVHv2 Dom0, either permissive or
+not.
+
 ### dtuart (ARM)
 > `= path [:options]`
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index a291f82..bc4f7bc 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -632,6 +632,8 @@ int hvm_domain_initialise(struct domain *d)
             goto fail1;
         }
         memset(d->arch.hvm_domain.io_bitmap, ~0, HVM_IOBITMAP_SIZE);
+        INIT_LIST_HEAD(&d->arch.hvm_domain.pt_devices);
+        rwlock_init(&d->arch.hvm_domain.pt_lock);
     }
     else
         d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 31d54dc..7de1de3 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -46,6 +46,10 @@
 #include <xen/iocap.h>
 #include <public/hvm/ioreq.h>
 
+/* Set permissive mode for HVM Dom0 PCI pass-through by default */
+static bool_t opt_dom0permissive = 1;
+boolean_param("dom0permissive", opt_dom0permissive);
+
 void send_timeoffset_req(unsigned long timeoff)
 {
     ioreq_t p = {
@@ -258,12 +262,403 @@ static bool_t hw_dpci_portio_accept(const struct hvm_io_handler *handler,
     return 0;
 }
 
+static struct hvm_pt_device *hw_dpci_get_device(struct domain *d)
+{
+    uint8_t bus, slot, func;
+    uint32_t addr;
+    struct hvm_pt_device *dev;
+
+    /* Decode bus, slot and func. */
+    addr = CF8_BDF(d->arch.pci_cf8);
+    bus = PCI_BUS(addr);
+    slot = PCI_SLOT(addr);
+    func = PCI_FUNC(addr);
+
+    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
+    {
+        if ( dev->pdev->seg != 0 || dev->pdev->bus != bus ||
+             dev->pdev->devfn != PCI_DEVFN(slot,func) )
+            continue;
+
+        return dev;
+    }
+
+    return NULL;
+}
+
+/* Dispatchers */
+
+/* Find emulate register group entry */
+struct hvm_pt_reg_group *hvm_pt_find_reg_grp(struct hvm_pt_device *d,
+                                             uint32_t address)
+{
+    struct hvm_pt_reg_group *entry = NULL;
+
+    /* Find register group entry */
+    list_for_each_entry( entry, &d->register_groups, entries )
+    {
+        /* check address */
+        if ( (entry->base_offset <= address)
+             && ((entry->base_offset + entry->size) > address) )
+            return entry;
+    }
+
+    /* Group entry not found */
+    return NULL;
+}
+
+/* Find emulate register entry */
+struct hvm_pt_reg *hvm_pt_find_reg(struct hvm_pt_reg_group *reg_grp,
+                                   uint32_t address)
+{
+    struct hvm_pt_reg *reg_entry = NULL;
+    struct hvm_pt_reg_handler *handler = NULL;
+    uint32_t real_offset = 0;
+
+    /* Find register entry */
+    list_for_each_entry( reg_entry, &reg_grp->registers, entries )
+    {
+        handler = reg_entry->handler;
+        real_offset = reg_grp->base_offset + handler->offset;
+        /* Check address */
+        if ( (real_offset <= address)
+             && ((real_offset + handler->size) > address) )
+            return reg_entry;
+    }
+
+    return NULL;
+}
+
+static int hvm_pt_pci_config_access_check(struct hvm_pt_device *d,
+                                          uint32_t addr, int len)
+{
+    /* Check offset range */
+    if ( addr >= 0xFF )
+    {
+        printk_pdev(d->pdev, XENLOG_DEBUG,
+            "failed to access register with offset exceeding 0xFF. "
+            "(addr: 0x%02x, len: %d)\n", addr, len);
+        return -EDOM;
+    }
+
+    /* Check read size */
+    if ( (len != 1) && (len != 2) && (len != 4) )
+    {
+        printk_pdev(d->pdev, XENLOG_DEBUG,
+            "failed to access register with invalid access length. "
+            "(addr: 0x%02x, len: %d)\n", addr, len);
+        return -EINVAL;
+    }
+
+    /* Check offset alignment */
+    if ( addr & (len - 1) )
+    {
+        printk_pdev(d->pdev, XENLOG_DEBUG,
+            "failed to access register with invalid access size "
+            "alignment. (addr: 0x%02x, len: %d)\n", addr, len);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
+                                  uint32_t *data, int len)
+{
+    uint32_t val = 0;
+    struct hvm_pt_reg_group *reg_grp_entry = NULL;
+    struct hvm_pt_reg *reg_entry = NULL;
+    int rc = 0;
+    int emul_len = 0;
+    uint32_t find_addr = addr;
+    unsigned int seg = d->pdev->seg;
+    unsigned int bus = d->pdev->bus;
+    unsigned int slot = PCI_SLOT(d->pdev->devfn);
+    unsigned int func = PCI_FUNC(d->pdev->devfn);
+
+    /* Sanity checks. */
+    if ( hvm_pt_pci_config_access_check(d, addr, len) )
+        return X86EMUL_UNHANDLEABLE;
+
+    /* Find register group entry. */
+    reg_grp_entry = hvm_pt_find_reg_grp(d, addr);
+    if ( reg_grp_entry == NULL )
+        return X86EMUL_UNHANDLEABLE;
+
+    /* Read I/O device register value. */
+    switch( len )
+    {
+    case 1:
+        val = pci_conf_read8(seg, bus, slot, func, addr);
+        break;
+    case 2:
+        val = pci_conf_read16(seg, bus, slot, func, addr);
+        break;
+    case 4:
+        val = pci_conf_read32(seg, bus, slot, func, addr);
+        break;
+    default:
+        BUG();
+    }
+
+    /* Adjust the read value to appropriate CFC-CFF window. */
+    val <<= (addr & 3) << 3;
+    emul_len = len;
+
+    /* Loop around the guest requested size. */
+    while ( emul_len > 0 )
+    {
+        /* Find register entry to be emulated. */
+        reg_entry = hvm_pt_find_reg(reg_grp_entry, find_addr);
+        if ( reg_entry )
+        {
+            struct hvm_pt_reg_handler *handler = reg_entry->handler;
+            uint32_t real_offset = reg_grp_entry->base_offset + handler->offset;
+            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
+            uint8_t *ptr_val = NULL;
+
+            valid_mask <<= (find_addr - real_offset) << 3;
+            ptr_val = (uint8_t *)&val + (real_offset & 3);
+
+            /* Do emulation based on register size. */
+            switch ( handler->size )
+            {
+            case 1:
+                if ( handler->u.b.read )
+                    rc = handler->u.b.read(d, reg_entry, ptr_val, valid_mask);
+                break;
+            case 2:
+                if ( handler->u.w.read )
+                    rc = handler->u.w.read(d, reg_entry, (uint16_t *)ptr_val,
+                                           valid_mask);
+                break;
+            case 4:
+                if ( handler->u.dw.read )
+                    rc = handler->u.dw.read(d, reg_entry, (uint32_t *)ptr_val,
+                                            valid_mask);
+                break;
+            }
+
+            if ( rc < 0 )
+            {
+                gdprintk(XENLOG_WARNING,
+                         "Invalid read emulation, shutting down domain\n");
+                domain_crash(current->domain);
+                return X86EMUL_UNHANDLEABLE;
+            }
+
+            /* Calculate next address to find. */
+            emul_len -= handler->size;
+            if ( emul_len > 0 )
+                find_addr = real_offset + handler->size;
+        }
+        else
+        {
+            /* Nothing to do with passthrough type register */
+            emul_len--;
+            find_addr++;
+        }
+    }
+
+    /* Need to shift back before returning them to pci bus emulator */
+    val >>= ((addr & 3) << 3);
+    *data = val;
+
+    return X86EMUL_OKAY;
+}
+
+static int hvm_pt_pci_write_config(struct hvm_pt_device *d, uint32_t addr,
+                                    uint32_t val, int len)
+{
+    int index = 0;
+    struct hvm_pt_reg_group *reg_grp_entry = NULL;
+    int rc = 0;
+    uint32_t read_val = 0, wb_mask;
+    int emul_len = 0;
+    struct hvm_pt_reg *reg_entry = NULL;
+    uint32_t find_addr = addr;
+    struct hvm_pt_reg_handler *handler = NULL;
+    bool wp_flag = false;
+    unsigned int seg = d->pdev->seg;
+    unsigned int bus = d->pdev->bus;
+    unsigned int slot = PCI_SLOT(d->pdev->devfn);
+    unsigned int func = PCI_FUNC(d->pdev->devfn);
+
+    /* Sanity checks. */
+    if ( hvm_pt_pci_config_access_check(d, addr, len) )
+        return X86EMUL_UNHANDLEABLE;
+
+    /* Find register group entry. */
+    reg_grp_entry = hvm_pt_find_reg_grp(d, addr);
+    if ( reg_grp_entry == NULL )
+        return X86EMUL_UNHANDLEABLE;
+
+    /* Read I/O device register value. */
+    switch( len )
+    {
+    case 1:
+        read_val = pci_conf_read8(seg, bus, slot, func, addr);
+        break;
+    case 2:
+        read_val = pci_conf_read16(seg, bus, slot, func, addr);
+        break;
+    case 4:
+        read_val = pci_conf_read32(seg, bus, slot, func, addr);
+        break;
+    default:
+        BUG();
+    }
+    wb_mask = 0xFFFFFFFF >> ((4 - len) << 3);
+
+    /* Adjust the read and write value to appropriate CFC-CFF window */
+    read_val <<= (addr & 3) << 3;
+    val <<= (addr & 3) << 3;
+    emul_len = len;
+
+    /* Loop around the guest requested size */
+    while ( emul_len > 0 )
+    {
+        /* Find register entry to be emulated */
+        reg_entry = hvm_pt_find_reg(reg_grp_entry, find_addr);
+        if ( reg_entry )
+        {
+            handler = reg_entry->handler;
+            uint32_t real_offset = reg_grp_entry->base_offset + handler->offset;
+            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
+            uint8_t *ptr_val = NULL;
+            uint32_t wp_mask = handler->emu_mask | handler->ro_mask;
+
+            valid_mask <<= (find_addr - real_offset) << 3;
+            ptr_val = (uint8_t *)&val + (real_offset & 3);
+            if ( !d->permissive )
+                wp_mask |= handler->res_mask;
+            if ( wp_mask == (0xFFFFFFFF >> ((4 - handler->size) << 3)) )
+                wb_mask &= ~((wp_mask >> ((find_addr - real_offset) << 3))
+                             << ((len - emul_len) << 3));
+
+            /* Do emulation based on register size */
+            switch ( handler->size )
+            {
+            case 1:
+                if ( handler->u.b.write )
+                    rc = handler->u.b.write(d, reg_entry, ptr_val,
+                                            read_val >> ((real_offset & 3) << 3),
+                                            valid_mask);
+                break;
+            case 2:
+                if ( handler->u.w.write )
+                    rc = handler->u.w.write(d, reg_entry, (uint16_t *)ptr_val,
+                                            (read_val >> ((real_offset & 3) << 3)),
+                                            valid_mask);
+                break;
+            case 4:
+                if ( handler->u.dw.write )
+                    rc = handler->u.dw.write(d, reg_entry, (uint32_t *)ptr_val,
+                                             (read_val >> ((real_offset & 3) << 3)),
+                                             valid_mask);
+                break;
+            }
+
+            if ( rc < 0 )
+            {
+                gdprintk(XENLOG_WARNING,
+                         "Invalid write emulation, shutting down domain\n");
+                domain_crash(current->domain);
+                return X86EMUL_UNHANDLEABLE;
+            }
+
+            /* Calculate next address to find */
+            emul_len -= handler->size;
+            if ( emul_len > 0 )
+                find_addr = real_offset + handler->size;
+        }
+        else
+        {
+            /* Nothing to do with passthrough type register */
+            if ( !d->permissive )
+            {
+                wb_mask &= ~(0xff << ((len - emul_len) << 3));
+                /*
+                 * Unused BARs will make it here, but we don't want to issue
+                 * warnings for writes to them (bogus writes get dealt with
+                 * above).
+                 */
+                if ( index < 0 )
+                    wp_flag = true;
+            }
+            emul_len--;
+            find_addr++;
+        }
+    }
+
+    /* Need to shift back before passing them to xen_host_pci_set_block */
+    val >>= (addr & 3) << 3;
+
+    if ( wp_flag && !d->permissive_warned )
+    {
+        d->permissive_warned = true;
+        gdprintk(XENLOG_WARNING,
+          "Write-back to unknown field 0x%02x (partially) inhibited (0x%0*x)\n",
+          addr, len * 2, wb_mask);
+        gdprintk(XENLOG_WARNING,
+          "If the device doesn't work, try enabling permissive mode\n");
+        gdprintk(XENLOG_WARNING,
+          "(unsafe) and if it helps report the problem to xen-devel\n");
+    }
+    for ( index = 0; wb_mask; index += len )
+    {
+        /* Unknown regs are passed through */
+        while ( !(wb_mask & 0xff) )
+        {
+            index++;
+            wb_mask >>= 8;
+        }
+        len = 0;
+        do {
+            len++;
+            wb_mask >>= 8;
+        } while ( wb_mask & 0xff );
+
+        switch( len )
+        {
+        case 1:
+        {
+            uint8_t value;
+            memcpy(&value, (uint8_t *)&val + index, 1);
+            pci_conf_write8(seg, bus, slot, func, addr + index, value);
+            break;
+        }
+        case 2:
+        {
+            uint16_t value;
+            memcpy(&value, (uint8_t *)&val + index, 2);
+            pci_conf_write16(seg, bus, slot, func, addr + index, value);
+            break;
+        }
+        case 4:
+        {
+            uint32_t value;
+            memcpy(&value, (uint8_t *)&val + index, 4);
+            pci_conf_write32(seg, bus, slot, func, addr + index, value);
+            break;
+        }
+        default:
+            BUG();
+        }
+    }
+    return X86EMUL_OKAY;
+}
+
 static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
                             uint64_t addr,
                             uint32_t size,
                             uint64_t *data)
 {
     struct domain *currd = current->domain;
+    struct hvm_pt_device *dev;
+    uint32_t data32;
+    uint8_t reg;
+    int rc;
 
     if ( addr == 0xcf8 )
     {
@@ -276,6 +671,22 @@ static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
     size = min(size, 4 - ((uint32_t)addr & 3));
     if ( size == 3 )
         size = 2;
+
+    read_lock(&currd->arch.hvm_domain.pt_lock);
+    dev = hw_dpci_get_device(currd);
+    if ( dev != NULL )
+    {
+        reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
+        rc = hvm_pt_pci_read_config(dev, reg, &data32, size);
+        if ( rc == X86EMUL_OKAY )
+        {
+            read_unlock(&currd->arch.hvm_domain.pt_lock);
+            *data = data32;
+            return rc;
+        }
+    }
+    read_unlock(&currd->arch.hvm_domain.pt_lock);
+
     if ( pci_cfg_ok(currd, addr & 3, size, NULL) )
         *data = pci_conf_read(currd->arch.pci_cf8, addr & 3, size);
 
@@ -288,7 +699,10 @@ static int hw_dpci_portio_write(const struct hvm_io_handler *handler,
                                 uint64_t data)
 {
     struct domain *currd = current->domain;
+    struct hvm_pt_device *dev;
     uint32_t data32;
+    uint8_t reg;
+    int rc;
 
     if ( addr == 0xcf8 )
     {
@@ -302,12 +716,219 @@ static int hw_dpci_portio_write(const struct hvm_io_handler *handler,
     if ( size == 3 )
         size = 2;
     data32 = data;
+
+    read_lock(&currd->arch.hvm_domain.pt_lock);
+    dev = hw_dpci_get_device(currd);
+    if ( dev != NULL )
+    {
+        reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
+        rc = hvm_pt_pci_write_config(dev, reg, data32, size);
+        if ( rc == X86EMUL_OKAY )
+        {
+            read_unlock(&currd->arch.hvm_domain.pt_lock);
+            return rc;
+        }
+    }
+    read_unlock(&currd->arch.hvm_domain.pt_lock);
+
     if ( pci_cfg_ok(currd, addr & 3, size, &data32) )
         pci_conf_write(currd->arch.pci_cf8, addr & 3, size, data);
 
     return X86EMUL_OKAY;
 }
 
+static void hvm_pt_free_device(struct hvm_pt_device *dev)
+{
+    struct hvm_pt_reg_group *group, *g;
+
+    list_for_each_entry_safe( group, g, &dev->register_groups, entries )
+    {
+        struct hvm_pt_reg *reg, *r;
+
+        list_for_each_entry_safe( reg, r, &group->registers, entries )
+        {
+            list_del(&reg->entries);
+            xfree(reg);
+        }
+
+        list_del(&group->entries);
+        xfree(group);
+    }
+
+    xfree(dev);
+}
+
+static int hvm_pt_add_register(struct hvm_pt_device *dev,
+                               struct hvm_pt_reg_group *group,
+                               struct hvm_pt_reg_handler *handler)
+{
+    struct pci_dev *pdev = dev->pdev;
+    struct hvm_pt_reg *reg;
+
+    reg = xmalloc(struct hvm_pt_reg);
+    if ( reg == NULL )
+        return -ENOMEM;
+
+    memset(reg, 0, sizeof(*reg));
+    reg->handler = handler;
+    if ( handler->init != NULL )
+    {
+        uint32_t host_mask, size_mask, data = 0;
+        uint8_t seg, bus, slot, func;
+        unsigned int offset;
+        uint32_t val;
+        int rc;
+
+        /* Initialize emulate register */
+        rc = handler->init(dev, reg->handler,
+                           group->base_offset + reg->handler->offset, &data);
+        if ( rc < 0 )
+            return rc;
+
+        if ( data == HVM_PT_INVALID_REG )
+        {
+            xfree(reg);
+            return 0;
+        }
+
+        /* Sync up the data to val */
+        offset = group->base_offset + reg->handler->offset;
+        size_mask = 0xFFFFFFFF >> ((4 - reg->handler->size) << 3);
+
+        seg = pdev->seg;
+        bus = pdev->bus;
+        slot = PCI_SLOT(pdev->devfn);
+        func = PCI_FUNC(pdev->devfn);
+
+        switch ( reg->handler->size )
+        {
+        case 1:
+            val = pci_conf_read8(seg, bus, slot, func, offset);
+            break;
+        case 2:
+            val = pci_conf_read16(seg, bus, slot, func, offset);
+            break;
+        case 4:
+            val = pci_conf_read32(seg, bus, slot, func, offset);
+            break;
+        default:
+            BUG();
+        }
+
+        /*
+         * Set bits in emu_mask are the ones we emulate. The reg shall
+         * contain the emulated view of the guest - therefore we flip
+         * the mask to mask out the host values (which reg initially
+         * has).
+         */
+        host_mask = size_mask & ~reg->handler->emu_mask;
+
+        if ( (data & host_mask) != (val & host_mask) )
+        {
+            uint32_t new_val;
+
+            /* Mask out host (including past size). */
+            new_val = val & host_mask;
+            /* Merge emulated ones (excluding the non-emulated ones). */
+            new_val |= data & host_mask;
+            /*
+             * Leave intact host and emulated values past the size -
+             * even though we do not care as we write per reg->size
+             * granularity, but for the logging below lets have the
+             * proper value.
+             */
+            new_val |= ((val | data)) & ~size_mask;
+            printk_pdev(pdev, XENLOG_ERR,
+"offset 0x%04x mismatch! Emulated=0x%04x, host=0x%04x, syncing to 0x%04x.\n",
+                        offset, data, val, new_val);
+            val = new_val;
+        }
+        else
+            val = data;
+
+        if ( val & ~size_mask )
+        {
+            printk_pdev(pdev, XENLOG_ERR,
+                    "Offset 0x%04x:0x%04x expands past register size(%d)!\n",
+                        offset, val, reg->handler->size);
+            return -EINVAL;
+        }
+
+        reg->val.dword = val;
+    }
+    list_add_tail(&reg->entries, &group->registers);
+
+    return 0;
+}
+
+static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
+};
+
+int hwdom_add_device(struct pci_dev *pdev)
+{
+    struct domain *d = pdev->domain;
+    struct hvm_pt_device *dev;
+    int j, i, rc;
+
+    ASSERT( is_hardware_domain(d) );
+    ASSERT( pcidevs_locked() );
+
+    dev = xmalloc(struct hvm_pt_device);
+    if ( dev == NULL )
+        return -ENOMEM;
+
+    memset(dev, 0 , sizeof(*dev));
+
+    dev->pdev = pdev;
+    INIT_LIST_HEAD(&dev->register_groups);
+
+    dev->permissive = opt_dom0permissive;
+
+    for ( j = 0; j < ARRAY_SIZE(hwdom_pt_handlers); j++ )
+    {
+        struct hvm_pt_handler_init *handler_init = hwdom_pt_handlers[j];
+        struct hvm_pt_reg_group *group;
+
+        group = xmalloc(struct hvm_pt_reg_group);
+        if ( group == NULL )
+        {
+            xfree(dev);
+            return -ENOMEM;
+        }
+        INIT_LIST_HEAD(&group->registers);
+
+        rc = handler_init->init(dev, group);
+        if ( rc == 0 )
+        {
+            for ( i = 0; handler_init->handlers[i].size != 0; i++ )
+            {
+                int rc;
+
+                rc = hvm_pt_add_register(dev, group,
+                                         &handler_init->handlers[i]);
+                if ( rc )
+                {
+                    printk_pdev(pdev, XENLOG_ERR, "error adding register: %d\n",
+                                rc);
+                    hvm_pt_free_device(dev);
+                    return rc;
+                }
+            }
+
+            list_add_tail(&group->entries, &dev->register_groups);
+        }
+        else
+            xfree(group);
+    }
+
+    write_lock(&d->arch.hvm_domain.pt_lock);
+    list_add_tail(&dev->entries, &d->arch.hvm_domain.pt_devices);
+    write_unlock(&d->arch.hvm_domain.pt_lock);
+    printk_pdev(pdev, XENLOG_DEBUG, "added for pass-through\n");
+
+    return 0;
+}
+
 static const struct hvm_io_ops dpci_portio_ops = {
     .accept = dpci_portio_accept,
     .read = dpci_portio_read,
diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-x86/hvm/domain.h
index f34d784..1b1a52f 100644
--- a/xen/include/asm-x86/hvm/domain.h
+++ b/xen/include/asm-x86/hvm/domain.h
@@ -152,6 +152,10 @@ struct hvm_domain {
         struct vmx_domain vmx;
         struct svm_domain svm;
     };
+
+    /* List of passed-through devices (hw domain only). */
+    struct list_head pt_devices;
+    rwlock_t pt_lock;
 };
 
 #define hap_enabled(d)  ((d)->arch.hvm_domain.hap_enabled)
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index e9b3f83..80f830d 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -153,6 +153,182 @@ extern void hvm_dpci_msi_eoi(struct domain *d, int vector);
 
 void register_dpci_portio_handler(struct domain *d);
 
+/* Structures for pci-passthrough state and handlers. */
+struct hvm_pt_device;
+struct hvm_pt_reg_handler;
+struct hvm_pt_reg;
+struct hvm_pt_reg_group;
+
+/* Return code when register should be ignored. */
+#define HVM_PT_INVALID_REG 0xFFFFFFFF
+
+/* function type for config reg */
+typedef int (*hvm_pt_conf_reg_init)
+    (struct hvm_pt_device *, struct hvm_pt_reg_handler *, uint32_t real_offset,
+     uint32_t *data);
+
+typedef int (*hvm_pt_conf_dword_write)
+    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
+     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
+typedef int (*hvm_pt_conf_word_write)
+    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
+     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
+typedef int (*hvm_pt_conf_byte_write)
+    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
+     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
+typedef int (*hvm_pt_conf_dword_read)
+    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
+     uint32_t *val, uint32_t valid_mask);
+typedef int (*hvm_pt_conf_word_read)
+    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
+     uint16_t *val, uint16_t valid_mask);
+typedef int (*hvm_pt_conf_byte_read)
+    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
+     uint8_t *val, uint8_t valid_mask);
+
+typedef int (*hvm_pt_group_init)
+    (struct hvm_pt_device *, struct hvm_pt_reg_group *);
+
+/*
+ * Emulated register information.
+ *
+ * This should be shared between all the consumers that trap on accesses
+ * to certain PCI registers.
+ */
+struct hvm_pt_reg_handler {
+    uint32_t offset;
+    uint32_t size;
+    uint32_t init_val;
+    /* reg reserved field mask (ON:reserved, OFF:defined) */
+    uint32_t res_mask;
+    /* reg read only field mask (ON:RO/ROS, OFF:other) */
+    uint32_t ro_mask;
+    /* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
+    uint32_t rw1c_mask;
+    /* reg emulate field mask (ON:emu, OFF:passthrough) */
+    uint32_t emu_mask;
+    hvm_pt_conf_reg_init init;
+    /* read/write function pointer
+     * for double_word/word/byte size */
+    union {
+        struct {
+            hvm_pt_conf_dword_write write;
+            hvm_pt_conf_dword_read read;
+        } dw;
+        struct {
+            hvm_pt_conf_word_write write;
+            hvm_pt_conf_word_read read;
+        } w;
+        struct {
+            hvm_pt_conf_byte_write write;
+            hvm_pt_conf_byte_read read;
+        } b;
+    } u;
+};
+
+struct hvm_pt_handler_init {
+    struct hvm_pt_reg_handler *handlers;
+    hvm_pt_group_init init;
+};
+
+/*
+ * Emulated register value.
+ *
+ * This is the representation of each specific emulated register.
+ */
+struct hvm_pt_reg {
+    struct list_head entries;
+    struct hvm_pt_reg_handler *handler;
+    union {
+        uint8_t   byte;
+        uint16_t  word;
+        uint32_t  dword;
+    } val;
+};
+
+/*
+ * Emulated register group.
+ *
+ * In order to speed up (and logically group) emulated registers search,
+ * groups are used that represent specific emulated features, like MSI.
+ */
+struct hvm_pt_reg_group {
+    struct list_head entries;
+    uint32_t base_offset;
+    uint8_t size;
+    struct list_head registers;
+};
+
+/*
+ * Guest MSI information.
+ *
+ * MSI values set by the guest.
+ */
+struct hvm_pt_msi {
+    uint16_t flags;
+    uint32_t addr_lo;  /* guest message address */
+    uint32_t addr_hi;  /* guest message upper address */
+    uint16_t data;     /* guest message data */
+    uint32_t ctrl_offset; /* saved control offset */
+    int pirq;          /* guest pirq corresponding */
+    bool_t initialized;  /* when guest MSI is initialized */
+    bool_t mapped;       /* when pirq is mapped */
+};
+
+/*
+ * Guest passed-through PCI device.
+ */
+struct hvm_pt_device {
+    struct list_head entries;
+
+    struct pci_dev *pdev;
+
+    bool_t permissive;
+    bool_t permissive_warned;
+
+    /* MSI status. */
+    struct hvm_pt_msi msi;
+
+    struct list_head register_groups;
+};
+
+/*
+ * The hierarchy of the above structures is the following:
+ *
+ * +---------------+         +---------------+
+ * |               | entries |               | ...
+ * | hvm_pt_device +---------+ hvm_pt_device +----+
+ * |               |         |               |
+ * +-+-------------+         +---------------+
+ *   |
+ *   | register_groups
+ *   |
+ * +-v----------------+          +------------------+
+ * |                  | entries  |                  | ...
+ * | hvm_pt_reg_group +----------+ hvm_pt_reg_group +----+
+ * |                  |          |                  |
+ * +-+----------------+          +------------------+
+ *   |
+ *   | registers
+ *   |
+ * +-v----------+            +------------+
+ * |            | entries    |            | ...
+ * | hvm_pt_reg +------------+ hvm_pt_reg +----+
+ * |            |            |            |
+ * +-+----------+            +-+----------+
+ *   |                         |
+ *   | handler                 | handler
+ *   |                         |
+ * +-v------------------+    +-v------------------+
+ * |                    |    |                    |
+ * | hvm_pt_reg_handler |    | hvm_pt_reg_handler |
+ * |                    |    |                    |
+ * +--------------------+    +--------------------+
+ */
+
+/* Helper to add passed-through devices to the hardware domain. */
+int hwdom_add_device(struct pci_dev *pdev);
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index f191773..b21a891 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -90,6 +90,11 @@ struct pci_dev {
     u64 vf_rlen[6];
 };
 
+/* Helper for printing pci_dev related messages. */
+#define printk_pdev(pdev, lvl, fmt, ...)                                  \
+    printk(lvl "PCI %04x:%02x:%02x.%u: " fmt, pdev->seg, pdev->bus,      \
+           PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), ##__VA_ARGS__)
+
 #define for_each_pdev(domain, pdev) \
     list_for_each_entry(pdev, &(domain->arch.pdev_list), domain_list)
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 21/30] xen/pci: split code to size BARs from pci_add_device
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (19 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-06 16:00   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping Roger Pau Monne
                   ` (9 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne, Jan Beulich

Because it's also going to be used by other code.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
---
 xen/drivers/passthrough/pci.c | 86 ++++++++++++++++++++++++++-----------------
 1 file changed, 53 insertions(+), 33 deletions(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index a53b4c8..6d831dd 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -587,6 +587,52 @@ static void pci_enable_acs(struct pci_dev *pdev)
     pci_conf_write16(seg, bus, dev, func, pos + PCI_ACS_CTRL, ctrl);
 }
 
+static int pci_size_bar(unsigned int seg, unsigned int bus, unsigned int slot,
+                        unsigned int func, unsigned int base,
+                        unsigned int max_bars, unsigned int *index,
+                        uint64_t *addr, uint64_t *size)
+{
+    unsigned int idx = base + *index * 4;
+    u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
+    u32 hi = 0;
+
+    *addr = *size = 0;
+
+    ASSERT((bar & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_MEMORY);
+    pci_conf_write32(seg, bus, slot, func, idx, ~0);
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        if ( *index >= max_bars )
+        {
+            printk(XENLOG_WARNING
+                   "device %04x:%02x:%02x.%u with 64-bit BAR in last slot\n",
+                   seg, bus, slot, func);
+            return -EINVAL;
+        }
+        hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
+        pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
+    }
+    *size = pci_conf_read32(seg, bus, slot, func, idx) &
+            PCI_BASE_ADDRESS_MEM_MASK;
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+    {
+        *size |= (u64)pci_conf_read32(seg, bus, slot, func, idx + 4) << 32;
+        pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
+    }
+    else if ( *size )
+        *size |= (u64)~0 << 32;
+    pci_conf_write32(seg, bus, slot, func, idx, bar);
+    *size = - *size;
+    *addr = (bar & PCI_BASE_ADDRESS_MEM_MASK) | ((u64)hi << 32);
+    if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+         PCI_BASE_ADDRESS_MEM_TYPE_64 )
+        ++*index;
+
+    return 0;
+}
+
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *info, nodeid_t node)
 {
@@ -651,7 +697,7 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
             {
                 unsigned int idx = pos + PCI_SRIOV_BAR + i * 4;
                 u32 bar = pci_conf_read32(seg, bus, slot, func, idx);
-                u32 hi = 0;
+                uint64_t addr;
 
                 if ( (bar & PCI_BASE_ADDRESS_SPACE) ==
                      PCI_BASE_ADDRESS_SPACE_IO )
@@ -662,38 +708,12 @@ int pci_add_device(u16 seg, u8 bus, u8 devfn,
                            seg, bus, slot, func, i);
                     continue;
                 }
-                pci_conf_write32(seg, bus, slot, func, idx, ~0);
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    if ( i >= PCI_SRIOV_NUM_BARS )
-                    {
-                        printk(XENLOG_WARNING
-                               "SR-IOV device %04x:%02x:%02x.%u with 64-bit"
-                               " vf BAR in last slot\n",
-                               seg, bus, slot, func);
-                        break;
-                    }
-                    hi = pci_conf_read32(seg, bus, slot, func, idx + 4);
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, ~0);
-                }
-                pdev->vf_rlen[i] = pci_conf_read32(seg, bus, slot, func, idx) &
-                                   PCI_BASE_ADDRESS_MEM_MASK;
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                {
-                    pdev->vf_rlen[i] |= (u64)pci_conf_read32(seg, bus,
-                                                             slot, func,
-                                                             idx + 4) << 32;
-                    pci_conf_write32(seg, bus, slot, func, idx + 4, hi);
-                }
-                else if ( pdev->vf_rlen[i] )
-                    pdev->vf_rlen[i] |= (u64)~0 << 32;
-                pci_conf_write32(seg, bus, slot, func, idx, bar);
-                pdev->vf_rlen[i] = -pdev->vf_rlen[i];
-                if ( (bar & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
-                     PCI_BASE_ADDRESS_MEM_TYPE_64 )
-                    ++i;
+                ret = pci_size_bar(seg, bus, slot, func, pos + PCI_SRIOV_BAR,
+                                   PCI_SRIOV_NUM_BARS, &i, &addr,
+                                   &pdev->vf_rlen[i]);
+                if ( ret )
+                    printk_pdev(pdev, XENLOG_WARNING,
+                                "failed to size SR-IOV BAR%u\n", i);
             }
         }
         else
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (20 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 21/30] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-03 10:10   ` Paul Durrant
  2016-09-27 15:57 ` [PATCH v2 23/30] xen/x86: route legacy PCI interrupts to Dom0 Roger Pau Monne
                   ` (8 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Add handlers to detect attemps from a PVHv2 Dom0 to change the position of
the PCI BARs and properly remap them.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/io.c         |   2 +
 xen/drivers/passthrough/pci.c | 307 ++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/io.h  |  19 +++
 xen/include/xen/pci.h         |   3 +
 4 files changed, 331 insertions(+)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 7de1de3..4db0266 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -862,6 +862,8 @@ static int hvm_pt_add_register(struct hvm_pt_device *dev,
 }
 
 static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
+    &hvm_pt_bar_init,
+    &hvm_pt_vf_bar_init,
 };
 
 int hwdom_add_device(struct pci_dev *pdev)
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 6d831dd..60c9e74 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -633,6 +633,313 @@ static int pci_size_bar(unsigned int seg, unsigned int bus, unsigned int slot,
     return 0;
 }
 
+static bool bar_reg_is_vf(uint32_t real_offset, uint32_t handler_offset)
+{
+    if ( real_offset - handler_offset == PCI_SRIOV_BAR )
+        return true;
+    else
+        return false;
+}
+
+static int bar_reg_init(struct hvm_pt_device *s,
+                        struct hvm_pt_reg_handler *handler,
+                        uint32_t real_offset, uint32_t *data)
+{
+    uint8_t seg, bus, slot, func;
+    uint64_t addr, size;
+    uint32_t val;
+    unsigned int index = handler->offset / 4;
+    bool vf = bar_reg_is_vf(real_offset, handler->offset);
+    struct hvm_pt_bar *bars = (vf ? s->vf_bars : s->bars);
+    int num_bars = (vf ? PCI_SRIOV_NUM_BARS : s->num_bars);
+    int rc;
+
+    if ( index >= num_bars )
+    {
+        *data = HVM_PT_INVALID_REG;
+        return 0;
+    }
+
+    seg = s->pdev->seg;
+    bus = s->pdev->bus;
+    slot = PCI_SLOT(s->pdev->devfn);
+    func = PCI_FUNC(s->pdev->devfn);
+    val = pci_conf_read32(seg, bus, slot, func, real_offset);
+
+    if ( index > 0 && bars[index - 1].type == HVM_PT_BAR_MEM64_LO )
+        bars[index].type = HVM_PT_BAR_MEM64_HI;
+    else if ( (val & PCI_BASE_ADDRESS_SPACE) == PCI_BASE_ADDRESS_SPACE_IO )
+    {
+        bars[index].type = HVM_PT_BAR_UNUSED;
+    }
+    else if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
+              PCI_BASE_ADDRESS_MEM_TYPE_64 )
+        bars[index].type = HVM_PT_BAR_MEM64_LO;
+    else
+        bars[index].type = HVM_PT_BAR_MEM32;
+
+    if ( bars[index].type == HVM_PT_BAR_MEM32 ||
+         bars[index].type == HVM_PT_BAR_MEM64_LO )
+    {
+        /* Size the BAR and map it. */
+        rc = pci_size_bar(seg, bus, slot, func, real_offset - handler->offset,
+                          num_bars, &index, &addr, &size);
+        if ( rc )
+        {
+            printk_pdev(s->pdev, XENLOG_ERR, "unable to size BAR#%d\n",
+                        index);
+            return rc;
+        }
+
+        if ( size == 0 )
+            bars[index].type = HVM_PT_BAR_UNUSED;
+        else
+        {
+            printk_pdev(s->pdev, XENLOG_DEBUG,
+                        "Mapping BAR#%u: %#lx size: %u\n", index, addr, size);
+            rc = modify_mmio_11(s->pdev->domain, PFN_DOWN(addr),
+                                DIV_ROUND_UP(size, PAGE_SIZE), true);
+            if ( rc )
+            {
+                printk_pdev(s->pdev, XENLOG_ERR,
+                            "failed to map BAR#%d into memory map: %d\n",
+                            index, rc);
+                return rc;
+            }
+        }
+    }
+
+    *data = bars[index].type == HVM_PT_BAR_UNUSED ? HVM_PT_INVALID_REG : val;
+    return 0;
+}
+
+/* Only allow writes to check the size of the BARs */
+static int allow_bar_write(struct hvm_pt_bar *bar, struct hvm_pt_reg *reg,
+                           struct pci_dev *pdev, uint32_t val)
+{
+    uint32_t mask;
+
+    if ( bar->type == HVM_PT_BAR_MEM64_HI )
+        mask = ~0;
+    else
+        mask = (uint32_t) PCI_BASE_ADDRESS_MEM_MASK;
+
+    if ( val != ~0 && (val & mask) != (reg->val.dword & mask) )
+    {
+        printk_pdev(pdev, XENLOG_ERR,
+                "changing the position of the BARs is not yet supported: %#x\n",
+                    val);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int bar_reg_write(struct hvm_pt_device *s, struct hvm_pt_reg *reg,
+                         uint32_t *val, uint32_t dev_value, uint32_t valid_mask)
+{
+    int index = reg->handler->offset / 4;
+
+    return allow_bar_write(&s->bars[index], reg, s->pdev, *val);
+}
+
+static int vf_bar_reg_write(struct hvm_pt_device *s, struct hvm_pt_reg *reg,
+                            uint32_t *val, uint32_t dev_value,
+                            uint32_t valid_mask)
+{
+    int index = reg->handler->offset / 4;
+
+    return allow_bar_write(&s->vf_bars[index], reg, s->pdev, *val);
+}
+
+/* BAR regs static information table */
+static struct hvm_pt_reg_handler bar_handler[] = {
+    /* BAR 0 reg */
+    /* mask of BAR need to be decided later, depends on IO/MEM type */
+    {
+        .offset     = 0,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = bar_reg_write,
+    },
+    /* BAR 1 reg */
+    {
+        .offset     = 4,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = bar_reg_write,
+    },
+    /* BAR 2 reg */
+    {
+        .offset     = 8,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = bar_reg_write,
+    },
+    /* BAR 3 reg */
+    {
+        .offset     = 12,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = bar_reg_write,
+    },
+    /* BAR 4 reg */
+    {
+        .offset     = 16,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = bar_reg_write,
+    },
+    /* BAR 5 reg */
+    {
+        .offset     = 20,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = bar_reg_write,
+    },
+    /* TODO: we should also trap accesses to the expansion ROM base address. */
+    /* End. */
+    {
+        .size = 0,
+    },
+};
+
+static int bar_group_init(struct hvm_pt_device *dev,
+                          struct hvm_pt_reg_group *group)
+{
+    uint8_t seg, bus, slot, func;
+
+    seg = dev->pdev->seg;
+    bus = dev->pdev->bus;
+    slot = PCI_SLOT(dev->pdev->devfn);
+    func = PCI_FUNC(dev->pdev->devfn);
+    dev->htype = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) & 0x7f;
+    switch ( dev->htype )
+    {
+    case PCI_HEADER_TYPE_NORMAL:
+        group->size = 36;
+        dev->num_bars = 6;
+        break;
+    case PCI_HEADER_TYPE_BRIDGE:
+        group->size = 44;
+        dev->num_bars = 2;
+        break;
+    default:
+        printk_pdev(dev->pdev, XENLOG_ERR, "device type %#x not supported\n",
+                    dev->htype);
+        return -EINVAL;
+    }
+    group->base_offset = PCI_BASE_ADDRESS_0;
+
+    return 0;
+}
+
+struct hvm_pt_handler_init hvm_pt_bar_init = {
+    .handlers = bar_handler,
+    .init = bar_group_init,
+};
+
+/* BAR regs static information table */
+static struct hvm_pt_reg_handler vf_bar_handler[] = {
+    /* BAR 0 reg */
+    /* mask of BAR need to be decided later, depends on IO/MEM type */
+    {
+        .offset     = 0,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = vf_bar_reg_write,
+    },
+    /* BAR 1 reg */
+    {
+        .offset     = 4,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = vf_bar_reg_write,
+    },
+    /* BAR 2 reg */
+    {
+        .offset     = 8,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = vf_bar_reg_write,
+    },
+    /* BAR 3 reg */
+    {
+        .offset     = 12,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = vf_bar_reg_write,
+    },
+    /* BAR 4 reg */
+    {
+        .offset     = 16,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = vf_bar_reg_write,
+    },
+    /* BAR 5 reg */
+    {
+        .offset     = 20,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .init       = bar_reg_init,
+        .u.dw.write = vf_bar_reg_write,
+    },
+    /* End. */
+    {
+        .size = 0,
+    },
+};
+
+static int vf_bar_group_init(struct hvm_pt_device *dev,
+                          struct hvm_pt_reg_group *group)
+{
+    uint8_t seg, bus, slot, func;
+    struct pci_dev *pdev = dev->pdev;
+    int sr_offset;
+    uint16_t ctrl;
+
+    seg = dev->pdev->seg;
+    bus = dev->pdev->bus;
+    slot = PCI_SLOT(dev->pdev->devfn);
+    func = PCI_FUNC(dev->pdev->devfn);
+
+    sr_offset = pci_find_ext_capability(seg, bus, pdev->devfn,
+                                        PCI_EXT_CAP_ID_SRIOV);
+    if ( sr_offset == 0 )
+        return -EINVAL;
+
+    printk_pdev(pdev, XENLOG_DEBUG, "found SR-IOV capabilities\n");
+    ctrl = pci_conf_read16(seg, bus, slot, func, sr_offset + PCI_SRIOV_CTRL);
+    if ( (ctrl & (PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE)) )
+    {
+        printk_pdev(pdev, XENLOG_ERR,
+                    "SR-IOV functions already enabled (%#04x)\n", ctrl);
+        return -EINVAL;
+    }
+
+    group->base_offset = sr_offset + PCI_SRIOV_BAR;
+    group->size = PCI_SRIOV_NUM_BARS * 4;
+
+    return 0;
+}
+
+struct hvm_pt_handler_init hvm_pt_vf_bar_init = {
+    .handlers = vf_bar_handler,
+    .init = vf_bar_group_init,
+};
+
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
                    const struct pci_dev_info *info, nodeid_t node)
 {
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 80f830d..25af036 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -19,6 +19,7 @@
 #ifndef __ASM_X86_HVM_IO_H__
 #define __ASM_X86_HVM_IO_H__
 
+#include <xen/pci_regs.h>
 #include <asm/hvm/vpic.h>
 #include <asm/hvm/vioapic.h>
 #include <public/hvm/ioreq.h>
@@ -275,6 +276,16 @@ struct hvm_pt_msi {
     bool_t mapped;       /* when pirq is mapped */
 };
 
+struct hvm_pt_bar {
+    uint32_t val;
+    enum bar_type {
+        HVM_PT_BAR_UNUSED,
+        HVM_PT_BAR_MEM32,
+        HVM_PT_BAR_MEM64_LO,
+        HVM_PT_BAR_MEM64_HI,
+    } type;
+};
+
 /*
  * Guest passed-through PCI device.
  */
@@ -289,6 +300,14 @@ struct hvm_pt_device {
     /* MSI status. */
     struct hvm_pt_msi msi;
 
+    /* PCI header type. */
+    uint8_t htype;
+
+    /* BAR tracking. */
+    int num_bars;
+    struct hvm_pt_bar bars[6];
+    struct hvm_pt_bar vf_bars[PCI_SRIOV_NUM_BARS];
+
     struct list_head register_groups;
 };
 
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index b21a891..51e0255 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -174,4 +174,7 @@ int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);
 void msixtbl_pt_unregister(struct domain *, struct pirq *);
 void msixtbl_pt_cleanup(struct domain *d);
 
+extern struct hvm_pt_handler_init hvm_pt_bar_init;
+extern struct hvm_pt_handler_init hvm_pt_vf_bar_init;
+
 #endif /* __XEN_PCI_H__ */
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 23/30] xen/x86: route legacy PCI interrupts to Dom0
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (21 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-10 13:37   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 24/30] x86/vmsi: add MSI emulation for hardware domain Roger Pau Monne
                   ` (7 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

This is done adding some Dom0 specific logic to the IO APIC emulation inside
of Xen, so that writes to the IO APIC registers that should unmask an
interrupt will take care of setting up this interrupt with Xen. A Dom0
specific EIO handler also has to be used, since Xen doesn't know the
topology of the PCI devices and it just has to passthrough what Dom0 does.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
 xen/arch/x86/hvm/irq.c       |   9 +++
 xen/arch/x86/hvm/vioapic.c   |  28 ++++++++-
 xen/arch/x86/physdev.c       |   4 --
 xen/drivers/passthrough/io.c | 144 ++++++++++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/hvm/io.h |   2 +
 xen/include/asm-x86/irq.h    |   5 ++
 xen/include/xen/hvm/irq.h    |   3 +
 xen/include/xen/iommu.h      |   1 +
 8 files changed, 177 insertions(+), 19 deletions(-)

diff --git a/xen/arch/x86/hvm/irq.c b/xen/arch/x86/hvm/irq.c
index 5323d7c..be9b648 100644
--- a/xen/arch/x86/hvm/irq.c
+++ b/xen/arch/x86/hvm/irq.c
@@ -88,6 +88,15 @@ void hvm_pci_intx_assert(
     spin_unlock(&d->arch.hvm_domain.irq_lock);
 }
 
+void hvm_hw_gsi_assert(struct domain *d, unsigned int gsi)
+{
+
+    ASSERT(is_hardware_domain(d));
+    spin_lock(&d->arch.hvm_domain.irq_lock);
+    assert_gsi(d, gsi);
+    spin_unlock(&d->arch.hvm_domain.irq_lock);
+}
+
 static void __hvm_pci_intx_deassert(
     struct domain *d, unsigned int device, unsigned int intx)
 {
diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
index 611be87..18305be 100644
--- a/xen/arch/x86/hvm/vioapic.c
+++ b/xen/arch/x86/hvm/vioapic.c
@@ -148,6 +148,29 @@ static void vioapic_write_redirent(
         unmasked = unmasked && !ent.fields.mask;
     }
 
+    if ( is_hardware_domain(d) && unmasked )
+    {
+        int ret, gsi;
+
+        /* Interrupt has been unmasked */
+        gsi = idx;
+        ret = mp_register_gsi(gsi, ent.fields.trig_mode, ent.fields.polarity);
+        if ( ret && ret != -EEXIST )
+        {
+            gdprintk(XENLOG_WARNING,
+                     "%s: error registering GSI %d\n", __func__, ret);
+        }
+        if ( !ret )
+        {
+            ret = physdev_map_pirq(DOMID_SELF, MAP_PIRQ_TYPE_GSI, &gsi, &gsi,
+                                   NULL);
+            BUG_ON(ret);
+
+            ret = pt_irq_bind_hw_domain(gsi);
+            BUG_ON(ret);
+        }
+    }
+
     *pent = ent;
 
     if ( idx == 0 )
@@ -409,7 +432,10 @@ void vioapic_update_EOI(struct domain *d, u8 vector)
         if ( iommu_enabled )
         {
             spin_unlock(&d->arch.hvm_domain.irq_lock);
-            hvm_dpci_eoi(d, gsi, ent);
+            if ( is_hardware_domain(d) )
+                hvm_hw_dpci_eoi(d, gsi, ent);
+            else
+                hvm_dpci_eoi(d, gsi, ent);
             spin_lock(&d->arch.hvm_domain.irq_lock);
         }
 
diff --git a/xen/arch/x86/physdev.c b/xen/arch/x86/physdev.c
index 0bea6e1..27dcbf4 100644
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -19,10 +19,6 @@
 #include <xsm/xsm.h>
 #include <asm/p2m.h>
 
-int physdev_map_pirq(domid_t, int type, int *index, int *pirq_p,
-                     struct msi_info *);
-int physdev_unmap_pirq(domid_t, int pirq);
-
 #include "x86_64/mmconfig.h"
 
 #ifndef COMPAT
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 66577b6..edd8dbd 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -159,26 +159,29 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
 static void pt_irq_time_out(void *data)
 {
     struct hvm_pirq_dpci *irq_map = data;
-    const struct hvm_irq_dpci *dpci;
     const struct dev_intx_gsi_link *digl;
 
     spin_lock(&irq_map->dom->event_lock);
 
-    dpci = domain_get_irq_dpci(irq_map->dom);
-    ASSERT(dpci);
-    list_for_each_entry ( digl, &irq_map->digl_list, list )
+    if ( !is_hardware_domain(irq_map->dom) )
     {
-        unsigned int guest_gsi = hvm_pci_intx_gsi(digl->device, digl->intx);
-        const struct hvm_girq_dpci_mapping *girq;
-
-        list_for_each_entry ( girq, &dpci->girq[guest_gsi], list )
+        const struct hvm_irq_dpci *dpci = domain_get_irq_dpci(irq_map->dom);
+        ASSERT(dpci);
+        list_for_each_entry ( digl, &irq_map->digl_list, list )
         {
-            struct pirq *pirq = pirq_info(irq_map->dom, girq->machine_gsi);
+            unsigned int guest_gsi = hvm_pci_intx_gsi(digl->device, digl->intx);
+            const struct hvm_girq_dpci_mapping *girq;
+
+            list_for_each_entry ( girq, &dpci->girq[guest_gsi], list )
+            {
+                struct pirq *pirq = pirq_info(irq_map->dom, girq->machine_gsi);
 
-            pirq_dpci(pirq)->flags |= HVM_IRQ_DPCI_EOI_LATCH;
+                pirq_dpci(pirq)->flags |= HVM_IRQ_DPCI_EOI_LATCH;
+            }
+            hvm_pci_intx_deassert(irq_map->dom, digl->device, digl->intx);
         }
-        hvm_pci_intx_deassert(irq_map->dom, digl->device, digl->intx);
-    }
+    } else
+        irq_map->flags |= HVM_IRQ_DPCI_EOI_LATCH;
 
     pt_pirq_iterate(irq_map->dom, pt_irq_guest_eoi, NULL);
 
@@ -557,6 +560,85 @@ int pt_irq_create_bind(
     return 0;
 }
 
+int pt_irq_bind_hw_domain(int gsi)
+{
+    struct domain *d = hardware_domain;
+    struct hvm_pirq_dpci *pirq_dpci;
+    struct hvm_irq_dpci *hvm_irq_dpci;
+    struct pirq *info;
+    int rc;
+
+    if ( gsi < 0 || gsi >= d->nr_pirqs )
+        return -EINVAL;
+
+restart:
+    spin_lock(&d->event_lock);
+
+    hvm_irq_dpci = domain_get_irq_dpci(d);
+    if ( hvm_irq_dpci == NULL )
+    {
+        unsigned int i;
+
+        hvm_irq_dpci = xzalloc(struct hvm_irq_dpci);
+        if ( hvm_irq_dpci == NULL )
+        {
+            spin_unlock(&d->event_lock);
+            return -ENOMEM;
+        }
+        for ( i = 0; i < NR_HVM_IRQS; i++ )
+            INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
+
+        d->arch.hvm_domain.irq.dpci = hvm_irq_dpci;
+    }
+
+    info = pirq_get_info(d, gsi);
+    if ( !info )
+    {
+        spin_unlock(&d->event_lock);
+        return -ENOMEM;
+    }
+    pirq_dpci = pirq_dpci(info);
+
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        cpu_relax();
+        goto restart;
+    }
+
+    pirq_dpci->dom = d;
+    pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
+                       HVM_IRQ_DPCI_MACH_PCI |
+                       HVM_IRQ_DPCI_GUEST_PCI;
+
+    /* Init timer before binding */
+    if ( pt_irq_need_timer(pirq_dpci->flags) )
+        init_timer(&pirq_dpci->timer, pt_irq_time_out, pirq_dpci, 0);
+
+    rc = pirq_guest_bind(d->vcpu[0], info, gsi > 15 ? BIND_PIRQ__WILL_SHARE :
+                                                      0);
+    if ( unlikely(rc) )
+    {
+        if ( pt_irq_need_timer(pirq_dpci->flags) )
+            kill_timer(&pirq_dpci->timer);
+        pirq_dpci->dom = NULL;
+        pirq_cleanup_check(info, d);
+        spin_unlock(&d->event_lock);
+        return rc;
+    }
+
+    spin_unlock(&d->event_lock);
+    return 0;
+}
+
 int pt_irq_destroy_bind(
     struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
 {
@@ -819,11 +901,19 @@ static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
             return;
         }
 
-        list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
+        if ( is_hardware_domain(d) )
         {
-            hvm_pci_intx_assert(d, digl->device, digl->intx);
+            hvm_hw_gsi_assert(d, pirq->pirq);
             pirq_dpci->pending++;
         }
+        else
+        {
+            list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
+            {
+                hvm_pci_intx_assert(d, digl->device, digl->intx);
+                pirq_dpci->pending++;
+            }
+        }
 
         if ( pirq_dpci->flags & HVM_IRQ_DPCI_TRANSLATE )
         {
@@ -899,6 +989,32 @@ unlock:
     spin_unlock(&d->event_lock);
 }
 
+void hvm_hw_dpci_eoi(struct domain *d, unsigned int gsi,
+                     const union vioapic_redir_entry *ent)
+{
+    struct pirq *pirq = pirq_info(d, gsi);
+    struct hvm_pirq_dpci *pirq_dpci;
+
+    ASSERT(is_hardware_domain(d) && iommu_enabled);
+
+    if ( pirq == NULL )
+        return;
+
+    pirq_dpci = pirq_dpci(pirq);
+    ASSERT(pirq_dpci != NULL);
+
+    spin_lock(&d->event_lock);
+    if ( --pirq_dpci->pending || (ent && ent->fields.mask) ||
+         !pt_irq_need_timer(pirq_dpci->flags) )
+        goto unlock;
+
+    stop_timer(&pirq_dpci->timer);
+    pirq_guest_eoi(pirq);
+
+unlock:
+    spin_unlock(&d->event_lock);
+}
+
 /*
  * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to
  * doing it. If that is the case we let 'pt_pirq_softirq_reset' do ref-counting.
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 25af036..bfd76ff 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -126,6 +126,8 @@ int handle_pio(uint16_t port, unsigned int size, int dir);
 void hvm_interrupt_post(struct vcpu *v, int vector, int type);
 void hvm_dpci_eoi(struct domain *d, unsigned int guest_irq,
                   const union vioapic_redir_entry *ent);
+void hvm_hw_dpci_eoi(struct domain *d, unsigned int gsi,
+                     const union vioapic_redir_entry *ent);
 void msix_write_completion(struct vcpu *);
 void msixtbl_init(struct domain *d);
 
diff --git a/xen/include/asm-x86/irq.h b/xen/include/asm-x86/irq.h
index 7efdd37..07f21ab 100644
--- a/xen/include/asm-x86/irq.h
+++ b/xen/include/asm-x86/irq.h
@@ -201,4 +201,9 @@ bool_t cpu_has_pending_apic_eoi(void);
 
 static inline void arch_move_irqs(struct vcpu *v) { }
 
+struct msi_info;
+int physdev_map_pirq(domid_t, int type, int *index, int *pirq_p,
+                     struct msi_info *);
+int physdev_unmap_pirq(domid_t, int pirq);
+
 #endif /* _ASM_HW_IRQ_H */
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 4c9cb20..2ffaf35 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -122,6 +122,9 @@ void hvm_isa_irq_assert(
 void hvm_isa_irq_deassert(
     struct domain *d, unsigned int isa_irq);
 
+/* Modify state of a hardware domain GSI */
+void hvm_hw_gsi_assert(struct domain *d, unsigned int gsi);
+
 void hvm_set_pci_link_route(struct domain *d, u8 link, u8 isa_irq);
 
 int hvm_inject_msi(struct domain *d, uint64_t addr, uint32_t data);
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index 5803e3f..07c6c40 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -114,6 +114,7 @@ struct pirq;
 int hvm_do_IRQ_dpci(struct domain *, struct pirq *);
 int pt_irq_create_bind(struct domain *, xen_domctl_bind_pt_irq_t *);
 int pt_irq_destroy_bind(struct domain *, xen_domctl_bind_pt_irq_t *);
+int pt_irq_bind_hw_domain(int gsi);
 
 void hvm_dpci_isairq_eoi(struct domain *d, unsigned int isairq);
 struct hvm_irq_dpci *domain_get_irq_dpci(const struct domain *);
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 24/30] x86/vmsi: add MSI emulation for hardware domain
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (22 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 23/30] xen/x86: route legacy PCI interrupts to Dom0 Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-27 15:57 ` [PATCH v2 25/30] xen/x86: add all PCI devices to PVHv2 Dom0 Roger Pau Monne
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Import the MSI handlers from QEMU into Xen. This allows Xen to detect
accesses to the MSI registers and correctly setup PIRQs for physical devices
that are then bound to the hardware domain.

The current logic only allows the usage of a single MSI interrupt per
device, so the maximum queue size announced by the device is unconditionally
set to 0 (1 vector only).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/io.c        |  59 +++++
 xen/arch/x86/hvm/vmsi.c      | 538 +++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/io.h |  28 +++
 xen/include/asm-x86/msi.h    |  32 +++
 xen/include/xen/hvm/irq.h    |   1 +
 xen/include/xen/pci_regs.h   |   4 +
 6 files changed, 662 insertions(+)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 4db0266..779babb 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -864,6 +864,7 @@ static int hvm_pt_add_register(struct hvm_pt_device *dev,
 static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
     &hvm_pt_bar_init,
     &hvm_pt_vf_bar_init,
+    &hvm_pt_msi_init,
 };
 
 int hwdom_add_device(struct pci_dev *pdev)
@@ -931,6 +932,64 @@ int hwdom_add_device(struct pci_dev *pdev)
     return 0;
 }
 
+/* Generic handlers for HVM PCI pass-through. */
+int hvm_pt_common_reg_init(struct hvm_pt_device *s,
+                           struct hvm_pt_reg_handler *handler,
+                           uint32_t real_offset, uint32_t *data)
+{
+    *data = handler->init_val;
+    return 0;
+}
+
+int hvm_pt_word_reg_read(struct hvm_pt_device *s, struct hvm_pt_reg *reg,
+                         uint16_t *value, uint16_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    uint16_t valid_emu_mask = 0;
+    uint16_t *data = &reg->val.word;
+
+    /* emulate word register */
+    valid_emu_mask = handler->emu_mask & valid_mask;
+    *value = HVM_PT_MERGE_VALUE(*value, *data, ~valid_emu_mask);
+
+    return 0;
+}
+
+int hvm_pt_long_reg_read(struct hvm_pt_device *s, struct hvm_pt_reg *reg,
+                         uint32_t *value, uint32_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    uint32_t valid_emu_mask = 0;
+    uint32_t *data = &reg->val.dword;
+
+    /* emulate long register */
+    valid_emu_mask = handler->emu_mask & valid_mask;
+    *value = HVM_PT_MERGE_VALUE(*value, *data, ~valid_emu_mask);
+
+    return 0;
+}
+
+int hvm_pt_long_reg_write(struct hvm_pt_device *s, struct hvm_pt_reg *reg,
+                          uint32_t *val, uint32_t dev_value,
+                          uint32_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    uint32_t writable_mask = 0;
+    uint32_t throughable_mask = hvm_pt_get_throughable_mask(s, handler,
+                                                            valid_mask);
+    uint32_t *data = &reg->val.dword;
+
+    /* modify emulate register */
+    writable_mask = handler->emu_mask & ~handler->ro_mask & valid_mask;
+    *data = HVM_PT_MERGE_VALUE(*val, *data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    *val = HVM_PT_MERGE_VALUE(*val, dev_value & ~handler->rw1c_mask,
+                              throughable_mask);
+
+    return 0;
+}
+
 static const struct hvm_io_ops dpci_portio_ops = {
     .accept = dpci_portio_accept,
     .read = dpci_portio_read,
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index d81c5d4..75ba429 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -624,3 +624,541 @@ void msix_write_completion(struct vcpu *v)
     if ( msixtbl_write(v, ctrl_address, 4, 0) != X86EMUL_OKAY )
         gdprintk(XENLOG_WARNING, "MSI-X write completion failure\n");
 }
+
+/* MSI emulation. */
+
+/* Helper to check supported MSI features. */
+#define vmsi_check_type(offset, flags, what) \
+        ((offset) == ((flags) & PCI_MSI_FLAGS_64BIT ? \
+                      PCI_MSI_##what##_64 : PCI_MSI_##what##_32))
+
+static inline uint64_t msi_addr64(struct hvm_pt_msi *msi)
+{
+    return (uint64_t)msi->addr_hi << 32 | msi->addr_lo;
+}
+
+/* Helper for updating a PIRQ-vMSI bind. */
+static int vmsi_update_bind(struct hvm_pt_msi *msi)
+{
+    xen_domctl_bind_pt_irq_t bind;
+    struct hvm_pt_device *s = container_of(msi, struct hvm_pt_device, msi);
+    int rc;
+
+    ASSERT(msi->pirq != -1);
+
+    bind.hvm_domid = DOMID_SELF;
+    bind.machine_irq = msi->pirq;
+    bind.irq_type = PT_IRQ_TYPE_MSI;
+    bind.u.msi.gvec = msi_vector(msi->data);
+    bind.u.msi.gflags = msi_gflags(msi->data, msi_addr64(msi));
+    bind.u.msi.gtable = 0;
+
+    pcidevs_lock();
+    rc = pt_irq_create_bind(current->domain, &bind);
+    pcidevs_unlock();
+    if ( rc )
+    {
+        printk_pdev(s->pdev, XENLOG_ERR,
+                      "updating of MSI failed. (err: %d)\n", rc);
+        rc = physdev_unmap_pirq(DOMID_SELF, msi->pirq);
+        if ( rc )
+            printk_pdev(s->pdev, XENLOG_ERR,
+                          "unmapping of MSI pirq %d failed. (err: %i)\n",
+                          msi->pirq, rc);
+        msi->pirq = -1;
+        msi->mapped = false;
+        msi->initialized = false;
+        return rc;
+    }
+
+    return 0;
+}
+
+/* Handlers. */
+
+/* Message Control register */
+static int vmsi_msgctrl_reg_init(struct hvm_pt_device *s,
+                                 struct hvm_pt_reg_handler *handler,
+                                 uint32_t real_offset, uint32_t *data)
+{
+    struct hvm_pt_msi *msi = &s->msi;
+    struct pci_dev *pdev = s->pdev;
+    uint16_t reg_field;
+    uint8_t seg, bus, slot, func;
+
+    seg = pdev->seg;
+    bus = pdev->bus;
+    slot = PCI_SLOT(pdev->devfn);
+    func = PCI_FUNC(pdev->devfn);
+
+    /* Use I/O device register's value as initial value */
+    reg_field = pci_conf_read16(seg, bus, slot, func, real_offset);
+    if ( reg_field & PCI_MSI_FLAGS_ENABLE )
+    {
+        printk_pdev(pdev, XENLOG_INFO,
+                      "MSI already enabled, disabling it first\n");
+        reg_field &= ~PCI_MSI_FLAGS_ENABLE;
+        pci_conf_write16(seg, bus, slot, func, real_offset, reg_field);
+    }
+    msi->flags |= reg_field;
+    msi->ctrl_offset = real_offset;
+    msi->initialized = false;
+    msi->mapped = false;
+
+    *data = handler->init_val | (reg_field & ~PCI_MSI_FLAGS_QMASK);
+    return 0;
+}
+
+static int vmsi_msgctrl_reg_write(struct hvm_pt_device *s,
+                                  struct hvm_pt_reg *reg, uint16_t *val,
+                                  uint16_t dev_value, uint16_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    struct hvm_pt_msi *msi = &s->msi;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = hvm_pt_get_throughable_mask(s, handler,
+                                                            valid_mask);
+    uint16_t *data = &reg->val.word;
+    int rc;
+
+    /* Currently no support for multi-vector */
+    if ( *val & PCI_MSI_FLAGS_QSIZE )
+        printk_pdev(s->pdev, XENLOG_WARNING,
+                      "tries to set more than 1 vector ctrl %x\n", *val);
+
+    /* Modify emulate register */
+    writable_mask = handler->emu_mask & ~handler->ro_mask & valid_mask;
+    *data = HVM_PT_MERGE_VALUE(*val, *data, writable_mask);
+    msi->flags |= *data & ~PCI_MSI_FLAGS_ENABLE;
+
+    /* Create value for writing to I/O device register */
+    *val = HVM_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+
+    /* update MSI */
+    if ( *val & PCI_MSI_FLAGS_ENABLE )
+    {
+        /* Setup MSI pirq for the first time */
+        if ( !msi->initialized )
+        {
+            struct msi_info msi_info;
+            int index = -1;
+
+            /* Init physical one */
+            printk_pdev(s->pdev, XENLOG_DEBUG, "setup MSI (register: %x).\n",
+                          *val);
+
+            memset(&msi_info, 0, sizeof(msi_info));
+            msi_info.seg = s->pdev->seg;
+            msi_info.bus = s->pdev->bus;
+            msi_info.devfn = s->pdev->devfn;
+
+            rc = physdev_map_pirq(DOMID_SELF, MAP_PIRQ_TYPE_MSI, &index,
+                                  &msi->pirq, &msi_info);
+            if ( rc )
+            {
+                /*
+                 * Do not broadcast this error, since there's nothing else
+                 * that can be done (MSI setup should have been successful).
+                 * Guest MSI would be actually not working.
+                 */
+                *val &= ~PCI_MSI_FLAGS_ENABLE;
+
+                printk_pdev(s->pdev, XENLOG_ERR,
+                              "can not map MSI (register: %x)!\n", *val);
+                return 0;
+            }
+
+            rc = vmsi_update_bind(msi);
+            if ( rc )
+            {
+                *val &= ~PCI_MSI_FLAGS_ENABLE;
+                printk_pdev(s->pdev, XENLOG_ERR,
+                              "can not bind MSI (register: %x)!\n", *val);
+                return 0;
+            }
+            msi->initialized = true;
+            msi->mapped = true;
+        }
+        msi->flags |= PCI_MSI_FLAGS_ENABLE;
+    }
+    else if ( msi->mapped )
+    {
+        uint8_t seg, bus, slot, func;
+        uint8_t gvec = msi_vector(msi->data);
+        uint32_t gflags = msi_gflags(msi->data, msi_addr64(msi));
+        uint16_t flags;
+
+        seg = s->pdev->seg;
+        bus = s->pdev->bus;
+        slot = PCI_SLOT(s->pdev->devfn);
+        func = PCI_FUNC(s->pdev->devfn);
+
+        flags = pci_conf_read16(seg, bus, slot, func, s->msi.ctrl_offset);
+        pci_conf_write16(seg, bus, slot, func, s->msi.ctrl_offset,
+                         flags & ~PCI_MSI_FLAGS_ENABLE);
+
+        if ( msi->pirq == -1 )
+            return 0;
+
+        if ( msi->initialized )
+        {
+            xen_domctl_bind_pt_irq_t bind;
+
+            printk_pdev(s->pdev, XENLOG_DEBUG,
+                          "Unbind MSI with pirq %d, gvec %#x\n", msi->pirq,
+                          gvec);
+
+            bind.hvm_domid = DOMID_SELF;
+            bind.irq_type = PT_IRQ_TYPE_MSI;
+            bind.machine_irq = msi->pirq;
+            bind.u.msi.gvec = gvec;
+            bind.u.msi.gflags = gflags;
+            bind.u.msi.gtable = 0;
+
+            pcidevs_lock();
+            rc = pt_irq_destroy_bind(current->domain, &bind);
+            pcidevs_unlock();
+            if ( rc )
+                printk_pdev(s->pdev, XENLOG_ERR,
+                              "can not unbind MSI (register: %x)!\n", *val);
+
+            rc = physdev_unmap_pirq(DOMID_SELF, msi->pirq);
+            if ( rc )
+                printk_pdev(s->pdev, XENLOG_ERR,
+                              "unmapping of MSI pirq %d failed. (err: %i)\n",
+                              msi->pirq, rc);
+            msi->flags &= ~PCI_MSI_FLAGS_ENABLE;
+            msi->initialized = false;
+            msi->mapped = false;
+            msi->pirq = -1;
+        }
+    }
+
+    return 0;
+}
+
+/* Initialize Message Upper Address register */
+static int vmsi_msgaddr64_reg_init(struct hvm_pt_device *s,
+                                   struct hvm_pt_reg_handler *handler,
+                                   uint32_t real_offset,
+                                   uint32_t *data)
+{
+    /* No need to initialize in case of 32 bit type */
+    if ( !(s->msi.flags & PCI_MSI_FLAGS_64BIT) )
+        *data = HVM_PT_INVALID_REG;
+    else
+        *data = handler->init_val;
+
+    return 0;
+}
+
+/* Write Message Address register */
+static int vmsi_msgaddr32_reg_write(struct hvm_pt_device *s,
+                                    struct hvm_pt_reg *reg, uint32_t *val,
+                                    uint32_t dev_value, uint32_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    uint32_t writable_mask = 0;
+    uint32_t old_addr = reg->val.dword;
+    uint32_t *data = &reg->val.dword;
+
+    /* Modify emulate register */
+    writable_mask = handler->emu_mask & ~handler->ro_mask & valid_mask;
+    *data = HVM_PT_MERGE_VALUE(*val, *data, writable_mask);
+    s->msi.addr_lo = *data;
+
+    /* Create value for writing to I/O device register */
+    *val = HVM_PT_MERGE_VALUE(*val, dev_value, 0);
+
+    /* Update MSI */
+    if ( *data != old_addr && s->msi.mapped )
+        vmsi_update_bind(&s->msi);
+
+    return 0;
+}
+
+/* Write Message Upper Address register */
+static int vmsi_msgaddr64_reg_write(struct hvm_pt_device *s,
+                                    struct hvm_pt_reg *reg, uint32_t *val,
+                                    uint32_t dev_value, uint32_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    uint32_t writable_mask = 0;
+    uint32_t old_addr = reg->val.dword;
+    uint32_t *data = &reg->val.dword;
+
+    /* Check whether the type is 64 bit or not */
+    if ( !(s->msi.flags & PCI_MSI_FLAGS_64BIT) )
+    {
+        printk_pdev(s->pdev, XENLOG_ERR,
+                   "Can't write to the upper address without 64 bit support\n");
+        return -EOPNOTSUPP;
+    }
+
+    /* Modify emulate register */
+    writable_mask = handler->emu_mask & ~handler->ro_mask & valid_mask;
+    *data = HVM_PT_MERGE_VALUE(*val, *data, writable_mask);
+    /* update the msi_info too */
+    s->msi.addr_hi = *data;
+
+    /* Create value for writing to I/O device register */
+    *val = HVM_PT_MERGE_VALUE(*val, dev_value, 0);
+
+    /* Update MSI */
+    if ( *data != old_addr && s->msi.mapped )
+        vmsi_update_bind(&s->msi);
+
+    return 0;
+}
+
+/*
+ * This function is shared between 32 and 64 bits MSI implementations
+ * Initialize Message Data register
+ */
+static int vmsi_msgdata_reg_init(struct hvm_pt_device *s,
+                                 struct hvm_pt_reg_handler *handler,
+                                 uint32_t real_offset,
+                                 uint32_t *data)
+{
+    uint32_t flags = s->msi.flags;
+    uint32_t offset = handler->offset;
+
+    /* Check the offset whether matches the type or not */
+    if ( vmsi_check_type(offset, flags, DATA) )
+        *data = handler->init_val;
+    else
+        *data = HVM_PT_INVALID_REG;
+
+    return 0;
+}
+
+/*
+ * This function is shared between 32 and 64 bits MSI implementations
+ * Write Message Data register
+ */
+static int vmsi_msgdata_reg_write(struct hvm_pt_device *s,
+                                  struct hvm_pt_reg *reg, uint16_t *val,
+                                  uint16_t dev_value, uint16_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    struct hvm_pt_msi *msi = &s->msi;
+    uint16_t writable_mask = 0;
+    uint16_t old_data = reg->val.word;
+    uint32_t offset = handler->offset;
+    uint16_t *data = &reg->val.word;
+
+    /* Check the offset whether matches the type or not */
+    if ( !vmsi_check_type(offset, msi->flags, DATA) )
+    {
+        /* Exit I/O emulator */
+        printk_pdev(s->pdev, XENLOG_ERR,
+                      "the offset does not match the 32/64 bit type!\n");
+        return -EOPNOTSUPP;
+    }
+
+    /* Modify emulate register */
+    writable_mask = handler->emu_mask & ~handler->ro_mask & valid_mask;
+    *data = HVM_PT_MERGE_VALUE(*val, *data, writable_mask);
+    /* Update the msi_info too */
+    msi->data = *data;
+
+    /* Create value for writing to I/O device register */
+    *val = HVM_PT_MERGE_VALUE(*val, dev_value, 0);
+
+    /* Update MSI */
+    if ( *data != old_data && msi->mapped )
+        vmsi_update_bind(msi);
+
+    return 0;
+}
+
+/*
+ * This function is shared between 32 and 64 bits MSI implementations
+ * Initialize Mask register
+ */
+static int vmsi_mask_reg_init(struct hvm_pt_device *s,
+                              struct hvm_pt_reg_handler *handler,
+                              uint32_t real_offset,
+                              uint32_t *data)
+{
+    uint32_t flags = s->msi.flags;
+
+    /* Check the offset whether matches the type or not */
+    if ( !(flags & PCI_MSI_FLAGS_MASKBIT) )
+        *data = HVM_PT_INVALID_REG;
+    else if ( vmsi_check_type(handler->offset, flags, MASK) )
+        *data = handler->init_val;
+    else
+        *data = HVM_PT_INVALID_REG;
+
+    return 0;
+}
+
+/*
+ * This function is shared between 32 and 64 bits MSI implementations
+ * Initialize Pending register
+ */
+static int vmsi_pending_reg_init(struct hvm_pt_device *s,
+                                 struct hvm_pt_reg_handler *handler,
+                                 uint32_t real_offset,
+                                 uint32_t *data)
+{
+    uint32_t flags = s->msi.flags;
+
+    /* check the offset whether matches the type or not */
+    if ( !(flags & PCI_MSI_FLAGS_MASKBIT) )
+        *data = HVM_PT_INVALID_REG;
+    else if ( vmsi_check_type(handler->offset, flags, PENDING) )
+        *data = handler->init_val;
+    else
+        *data = HVM_PT_INVALID_REG;
+
+    return 0;
+}
+
+/* MSI Capability Structure reg static information table */
+static struct hvm_pt_reg_handler vmsi_handler[] = {
+    /* Message Control reg */
+    {
+        .offset     = PCI_MSI_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .res_mask   = 0xFE00,
+        .ro_mask    = 0x018E,
+        .emu_mask   = 0x017E,
+        .init       = vmsi_msgctrl_reg_init,
+        .u.w.read   = hvm_pt_word_reg_read,
+        .u.w.write  = vmsi_msgctrl_reg_write,
+    },
+    /* Message Address reg */
+    {
+        .offset     = PCI_MSI_ADDRESS_LO,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x00000003,
+        .emu_mask   = 0xFFFFFFFF,
+        .init       = hvm_pt_common_reg_init,
+        .u.dw.read  = hvm_pt_long_reg_read,
+        .u.dw.write = vmsi_msgaddr32_reg_write,
+    },
+    /* Message Upper Address reg (if PCI_MSI_FLAGS_64BIT set) */
+    {
+        .offset     = PCI_MSI_ADDRESS_HI,
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0x00000000,
+        .emu_mask   = 0xFFFFFFFF,
+        .init       = vmsi_msgaddr64_reg_init,
+        .u.dw.read  = hvm_pt_long_reg_read,
+        .u.dw.write = vmsi_msgaddr64_reg_write,
+    },
+    /* Message Data reg (16 bits of data for 32-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_32,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x0000,
+        .emu_mask   = 0xFFFF,
+        .init       = vmsi_msgdata_reg_init,
+        .u.w.read   = hvm_pt_word_reg_read,
+        .u.w.write  = vmsi_msgdata_reg_write,
+    },
+    /* Message Data reg (16 bits of data for 64-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_64,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .ro_mask    = 0x0000,
+        .emu_mask   = 0xFFFF,
+        .init       = vmsi_msgdata_reg_init,
+        .u.w.read   = hvm_pt_word_reg_read,
+        .u.w.write  = vmsi_msgdata_reg_write,
+    },
+    /* Mask reg (if PCI_MSI_FLAGS_MASKBIT set, for 32-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_64, /* PCI_MSI_MASK_32 */
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0xFFFFFFFF,
+        .emu_mask   = 0xFFFFFFFF,
+        .init       = vmsi_mask_reg_init,
+        .u.dw.read  = hvm_pt_long_reg_read,
+        .u.dw.write = hvm_pt_long_reg_write,
+    },
+    /* Mask reg (if PCI_MSI_FLAGS_MASKBIT set, for 64-bit devices) */
+    {
+        .offset     = PCI_MSI_MASK_BIT, /* PCI_MSI_MASK_64 */
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0xFFFFFFFF,
+        .emu_mask   = 0xFFFFFFFF,
+        .init       = vmsi_mask_reg_init,
+        .u.dw.read  = hvm_pt_long_reg_read,
+        .u.dw.write = hvm_pt_long_reg_write,
+    },
+    /* Pending reg (if PCI_MSI_FLAGS_MASKBIT set, for 32-bit devices) */
+    {
+        .offset     = PCI_MSI_DATA_64 + 4, /* PCI_MSI_PENDING_32 */
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0xFFFFFFFF,
+        .emu_mask   = 0x00000000,
+        .init       = vmsi_pending_reg_init,
+        .u.dw.read  = hvm_pt_long_reg_read,
+        .u.dw.write = hvm_pt_long_reg_write,
+    },
+    /* Pending reg (if PCI_MSI_FLAGS_MASKBIT set, for 64-bit devices) */
+    {
+        .offset     = PCI_MSI_MASK_BIT + 4, /* PCI_MSI_PENDING_64 */
+        .size       = 4,
+        .init_val   = 0x00000000,
+        .ro_mask    = 0xFFFFFFFF,
+        .emu_mask   = 0x00000000,
+        .init       = vmsi_pending_reg_init,
+        .u.dw.read  = hvm_pt_long_reg_read,
+        .u.dw.write = hvm_pt_long_reg_write,
+    },
+    /* End */
+    {
+        .size = 0,
+    },
+};
+
+static int vmsi_group_init(struct hvm_pt_device *dev,
+                                 struct hvm_pt_reg_group *group)
+{
+    uint8_t seg, bus, slot, func;
+    struct pci_dev *pdev = dev->pdev;
+    int msi_offset;
+    uint8_t msi_size = 0xa;
+    uint16_t flags;
+
+    dev->msi.pirq = -1;
+    seg = pdev->seg;
+    bus = pdev->bus;
+    slot = PCI_SLOT(pdev->devfn);
+    func = PCI_FUNC(pdev->devfn);
+
+    msi_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSI);
+    if ( msi_offset == 0 )
+        return -ENODEV;
+
+    group->base_offset = msi_offset;
+    flags = pci_conf_read16(seg, bus, slot, func,
+                            msi_offset + PCI_MSI_FLAGS);
+
+    if ( flags & PCI_MSI_FLAGS_64BIT )
+        msi_size += 4;
+    if ( flags & PCI_MSI_FLAGS_MASKBIT )
+        msi_size += 10;
+
+    dev->msi.flags = flags;
+    group->size = msi_size;
+
+    return 0;
+}
+
+struct hvm_pt_handler_init hvm_pt_msi_init = {
+    .handlers = vmsi_handler,
+    .init = vmsi_group_init,
+};
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index bfd76ff..0f8726a 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -165,6 +165,9 @@ struct hvm_pt_reg_group;
 /* Return code when register should be ignored. */
 #define HVM_PT_INVALID_REG 0xFFFFFFFF
 
+#define HVM_PT_MERGE_VALUE(value, data, val_mask) \
+    (((value) & (val_mask)) | ((data) & ~(val_mask)))
+
 /* function type for config reg */
 typedef int (*hvm_pt_conf_reg_init)
     (struct hvm_pt_device *, struct hvm_pt_reg_handler *, uint32_t real_offset,
@@ -350,6 +353,31 @@ struct hvm_pt_device {
 /* Helper to add passed-through devices to the hardware domain. */
 int hwdom_add_device(struct pci_dev *pdev);
 
+/* Generic handlers for HVM PCI pass-through. */
+int hvm_pt_long_reg_read(struct hvm_pt_device *, struct hvm_pt_reg *,
+                         uint32_t *, uint32_t);
+int hvm_pt_long_reg_write(struct hvm_pt_device *, struct hvm_pt_reg *,
+                          uint32_t *, uint32_t, uint32_t);
+int hvm_pt_word_reg_read(struct hvm_pt_device *, struct hvm_pt_reg *,
+                         uint16_t *, uint16_t);
+
+int hvm_pt_common_reg_init(struct hvm_pt_device *, struct hvm_pt_reg_handler *,
+                           uint32_t real_offset, uint32_t *data);
+
+static inline uint32_t hvm_pt_get_throughable_mask(
+                                    struct hvm_pt_device *s,
+                                    struct hvm_pt_reg_handler *handler,
+                                    uint32_t valid_mask)
+{
+    uint32_t throughable_mask = ~(handler->emu_mask | handler->ro_mask);
+
+    if ( !s->permissive )
+        throughable_mask &= ~handler->res_mask;
+
+    return throughable_mask & valid_mask;
+}
+
+
 #endif /* __ASM_X86_HVM_IO_H__ */
 
 
diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
index 9c02945..8c7fb27 100644
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -246,4 +246,36 @@ void ack_nonmaskable_msi_irq(struct irq_desc *);
 void end_nonmaskable_msi_irq(struct irq_desc *, u8 vector);
 void set_msi_affinity(struct irq_desc *, const cpumask_t *);
 
+static inline uint8_t msi_vector(uint32_t data)
+{
+    return (data & MSI_DATA_VECTOR_MASK) >> MSI_DATA_VECTOR_SHIFT;
+}
+
+static inline uint8_t msi_dest_id(uint32_t addr)
+{
+    return (addr & MSI_ADDR_DEST_ID_MASK) >> MSI_ADDR_DEST_ID_SHIFT;
+}
+
+static inline uint32_t msi_gflags(uint32_t data, uint64_t addr)
+{
+    uint32_t result = 0;
+    int rh, dm, dest_id, deliv_mode, trig_mode;
+
+    rh = (addr >> MSI_ADDR_REDIRECTION_SHIFT) & 0x1;
+    dm = (addr >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+    dest_id = msi_dest_id(addr);
+    deliv_mode = (data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x7;
+    trig_mode = (data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+
+    result = dest_id | (rh << GFLAGS_SHIFT_RH)
+        | (dm << GFLAGS_SHIFT_DM)
+        | (deliv_mode << GFLAGS_SHIFT_DELIV_MODE)
+        | (trig_mode << GFLAGS_SHIFT_TRG_MODE);
+
+    return result;
+}
+
+/* MSI HVM pass-through handlers. */
+extern struct hvm_pt_handler_init hvm_pt_msi_init;
+
 #endif /* __ASM_MSI_H */
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 2ffaf35..4d24bf0 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -56,6 +56,7 @@ struct dev_intx_gsi_link {
 #define VMSI_TRIG_MODE    0x8000
 
 #define GFLAGS_SHIFT_RH             8
+#define GFLAGS_SHIFT_DM             9
 #define GFLAGS_SHIFT_DELIV_MODE     12
 #define GFLAGS_SHIFT_TRG_MODE       15
 
diff --git a/xen/include/xen/pci_regs.h b/xen/include/xen/pci_regs.h
index ecd6124..8db4e0e 100644
--- a/xen/include/xen/pci_regs.h
+++ b/xen/include/xen/pci_regs.h
@@ -296,6 +296,10 @@
 #define PCI_MSI_DATA_32		8	/* 16 bits of data for 32-bit devices */
 #define PCI_MSI_DATA_64		12	/* 16 bits of data for 64-bit devices */
 #define PCI_MSI_MASK_BIT	16	/* Mask bits register */
+#define PCI_MSI_MASK_64		PCI_MSI_MASK_BIT
+#define PCI_MSI_MASK_32		PCI_MSI_DATA_64
+#define PCI_MSI_PENDING_32	PCI_MSI_MASK_BIT
+#define PCI_MSI_PENDING_64	20
 
 /* MSI-X registers (these are at offset PCI_MSIX_FLAGS) */
 #define PCI_MSIX_FLAGS		2
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 25/30] xen/x86: add all PCI devices to PVHv2 Dom0
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (23 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 24/30] x86/vmsi: add MSI emulation for hardware domain Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-10 13:44   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 26/30] xen/x86: add PCIe emulation Roger Pau Monne
                   ` (5 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/domain_build.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 407f742..b4a14a3 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -2378,6 +2378,25 @@ static int __init hvm_setup_acpi(struct domain *d)
     return 0;
 }
 
+static int __init hvm_setup_pci(struct domain *d)
+{
+    struct pci_dev *pdev;
+    int rc;
+
+    printk("** Adding PCI devices **\n");
+
+    pcidevs_lock();
+    list_for_each_entry( pdev, &d->arch.pdev_list, domain_list )
+    {
+        rc = hwdom_add_device(pdev);
+        if ( rc )
+            return rc;
+    }
+    pcidevs_unlock();
+
+    return 0;
+}
+
 static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
                                      unsigned long image_headroom,
                                      module_t *initrd,
@@ -2426,6 +2445,13 @@ static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
         return rc;
     }
 
+    rc = hvm_setup_pci(d);
+    if ( rc )
+    {
+        printk("Failed to add PCI devices: %d\n", rc);
+        return rc;
+    }
+
     return 0;
 }
 
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 26/30] xen/x86: add PCIe emulation
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (24 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 25/30] xen/x86: add all PCI devices to PVHv2 Dom0 Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-03 10:46   ` Paul Durrant
  2016-10-10 13:57   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 27/30] x86/msixtbl: disable MSI-X intercepts for domains without an ioreq server Roger Pau Monne
                   ` (4 subsequent siblings)
  30 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Add a new MMIO handler that traps accesses to PCIe regions, as discovered by
Xen from the MCFG ACPI table. The handler used is the same as the one used
for accesses to the IO PCI configuration space.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/io.c | 177 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 171 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 779babb..088e3ec 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -46,6 +46,8 @@
 #include <xen/iocap.h>
 #include <public/hvm/ioreq.h>
 
+#include "../x86_64/mmconfig.h"
+
 /* Set permissive mode for HVM Dom0 PCI pass-through by default */
 static bool_t opt_dom0permissive = 1;
 boolean_param("dom0permissive", opt_dom0permissive);
@@ -363,7 +365,7 @@ static int hvm_pt_pci_config_access_check(struct hvm_pt_device *d,
 }
 
 static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
-                                  uint32_t *data, int len)
+                                  uint32_t *data, int len, bool pcie)
 {
     uint32_t val = 0;
     struct hvm_pt_reg_group *reg_grp_entry = NULL;
@@ -377,7 +379,7 @@ static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
     unsigned int func = PCI_FUNC(d->pdev->devfn);
 
     /* Sanity checks. */
-    if ( hvm_pt_pci_config_access_check(d, addr, len) )
+    if ( !pcie && hvm_pt_pci_config_access_check(d, addr, len) )
         return X86EMUL_UNHANDLEABLE;
 
     /* Find register group entry. */
@@ -468,7 +470,7 @@ static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
 }
 
 static int hvm_pt_pci_write_config(struct hvm_pt_device *d, uint32_t addr,
-                                    uint32_t val, int len)
+                                    uint32_t val, int len, bool pcie)
 {
     int index = 0;
     struct hvm_pt_reg_group *reg_grp_entry = NULL;
@@ -485,7 +487,7 @@ static int hvm_pt_pci_write_config(struct hvm_pt_device *d, uint32_t addr,
     unsigned int func = PCI_FUNC(d->pdev->devfn);
 
     /* Sanity checks. */
-    if ( hvm_pt_pci_config_access_check(d, addr, len) )
+    if ( !pcie && hvm_pt_pci_config_access_check(d, addr, len) )
         return X86EMUL_UNHANDLEABLE;
 
     /* Find register group entry. */
@@ -677,7 +679,7 @@ static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
     if ( dev != NULL )
     {
         reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
-        rc = hvm_pt_pci_read_config(dev, reg, &data32, size);
+        rc = hvm_pt_pci_read_config(dev, reg, &data32, size, false);
         if ( rc == X86EMUL_OKAY )
         {
             read_unlock(&currd->arch.hvm_domain.pt_lock);
@@ -722,7 +724,7 @@ static int hw_dpci_portio_write(const struct hvm_io_handler *handler,
     if ( dev != NULL )
     {
         reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
-        rc = hvm_pt_pci_write_config(dev, reg, data32, size);
+        rc = hvm_pt_pci_write_config(dev, reg, data32, size, false);
         if ( rc == X86EMUL_OKAY )
         {
             read_unlock(&currd->arch.hvm_domain.pt_lock);
@@ -1002,6 +1004,166 @@ static const struct hvm_io_ops hw_dpci_portio_ops = {
     .write = hw_dpci_portio_write
 };
 
+static struct acpi_mcfg_allocation *pcie_find_mmcfg(unsigned long addr)
+{
+    int i;
+
+    for ( i = 0; i < pci_mmcfg_config_num; i++ )
+    {
+        unsigned long start, end;
+
+        start = pci_mmcfg_config[i].address;
+        end = pci_mmcfg_config[i].address +
+              ((pci_mmcfg_config[i].end_bus_number + 1) << 20);
+        if ( addr >= start && addr < end )
+            return &pci_mmcfg_config[i];
+    }
+
+    return NULL;
+}
+
+static struct hvm_pt_device *hw_pcie_get_device(unsigned int seg,
+                                                unsigned int bus,
+                                                unsigned int slot,
+                                                unsigned int func)
+{
+    struct hvm_pt_device *dev;
+    struct domain *d = current->domain;
+
+    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
+    {
+        if ( dev->pdev->seg != seg || dev->pdev->bus != bus ||
+             dev->pdev->devfn != PCI_DEVFN(slot, func) )
+            continue;
+
+        return dev;
+    }
+
+    return NULL;
+}
+
+static void pcie_decode_addr(unsigned long addr, unsigned int *bus,
+                             unsigned int *slot, unsigned int *func,
+                             unsigned int *reg)
+{
+
+    *bus = (addr >> 20) & 0xff;
+    *slot = (addr >> 15) & 0x1f;
+    *func = (addr >> 12) & 0x7;
+    *reg = addr & 0xfff;
+}
+
+static int pcie_range(struct vcpu *v, unsigned long addr)
+{
+
+    return pcie_find_mmcfg(addr) != NULL ? 1 : 0;
+}
+
+static int pcie_read(struct vcpu *v, unsigned long addr,
+                     unsigned int len, unsigned long *pval)
+{
+    struct acpi_mcfg_allocation *mmcfg = pcie_find_mmcfg(addr);
+    struct domain *d = v->domain;
+    unsigned int seg, bus, slot, func, reg;
+    struct hvm_pt_device *dev;
+    uint32_t val;
+    int rc;
+
+    ASSERT(mmcfg != NULL);
+
+    if ( len > 4 || len == 3 )
+        return X86EMUL_UNHANDLEABLE;
+
+    addr -= mmcfg->address;
+    seg = mmcfg->pci_segment;
+    pcie_decode_addr(addr, &bus, &slot, &func, &reg);
+
+    read_lock(&d->arch.hvm_domain.pt_lock);
+    dev = hw_pcie_get_device(seg, bus, slot, func);
+    if ( dev != NULL )
+    {
+        rc = hvm_pt_pci_read_config(dev, reg, &val, len, true);
+        if ( rc == X86EMUL_OKAY )
+        {
+            read_unlock(&d->arch.hvm_domain.pt_lock);
+            goto out;
+        }
+    }
+    read_unlock(&d->arch.hvm_domain.pt_lock);
+
+    /* Pass-through */
+    switch ( len )
+    {
+    case 1:
+        val = pci_conf_read8(seg, bus, slot, func, reg);
+        break;
+    case 2:
+        val = pci_conf_read16(seg, bus, slot, func, reg);
+        break;
+    case 4:
+        val = pci_conf_read32(seg, bus, slot, func, reg);
+        break;
+    }
+
+ out:
+    *pval = val;
+    return X86EMUL_OKAY;
+}
+
+static int pcie_write(struct vcpu *v, unsigned long addr,
+                      unsigned int len, unsigned long val)
+{
+    struct acpi_mcfg_allocation *mmcfg = pcie_find_mmcfg(addr);
+    struct domain *d = v->domain;
+    unsigned int seg, bus, slot, func, reg;
+    struct hvm_pt_device *dev;
+    int rc;
+
+    ASSERT(mmcfg != NULL);
+
+    if ( len > 4 || len == 3 )
+        return X86EMUL_UNHANDLEABLE;
+
+    addr -= mmcfg->address;
+    seg = mmcfg->pci_segment;
+    pcie_decode_addr(addr, &bus, &slot, &func, &reg);
+
+    read_lock(&d->arch.hvm_domain.pt_lock);
+    dev = hw_pcie_get_device(seg, bus, slot, func);
+    if ( dev != NULL )
+    {
+        rc = hvm_pt_pci_write_config(dev, reg, val, len, true);
+        if ( rc == X86EMUL_OKAY )
+        {
+            read_unlock(&d->arch.hvm_domain.pt_lock);
+            return rc;
+        }
+    }
+    read_unlock(&d->arch.hvm_domain.pt_lock);
+
+    /* Pass-through */
+    switch ( len )
+    {
+    case 1:
+        pci_conf_write8(seg, bus, slot, func, reg, val);
+        break;
+    case 2:
+        pci_conf_write16(seg, bus, slot, func, reg, val);
+        break;
+    case 4:
+        pci_conf_write32(seg, bus, slot, func, reg, val);
+        break;
+    }
+
+    return X86EMUL_OKAY;
+}
+
+static const struct hvm_mmio_ops pcie_mmio_ops = {
+    .check = pcie_range,
+    .read = pcie_read,
+    .write = pcie_write
+};
+
 void register_dpci_portio_handler(struct domain *d)
 {
     struct hvm_io_handler *handler = hvm_next_io_handler(d);
@@ -1011,7 +1173,10 @@ void register_dpci_portio_handler(struct domain *d)
 
     handler->type = IOREQ_TYPE_PIO;
     if ( is_hardware_domain(d) )
+    {
         handler->ops = &hw_dpci_portio_ops;
+        register_mmio_handler(d, &pcie_mmio_ops);
+    }
     else
         handler->ops = &dpci_portio_ops;
 }
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 27/30] x86/msixtbl: disable MSI-X intercepts for domains without an ioreq server
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (25 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 26/30] xen/x86: add PCIe emulation Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-10 14:18   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0 Roger Pau Monne
                   ` (3 subsequent siblings)
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Paul Durrant, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

The current msixtbl intercepts only partially trap MSI-X accesses, but are
not complete, there's missing logic in order to setup PIRQs and bind them to
domains. Disable them for domains without at least an ioreq server (PVH).

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Paul Durrant <paul.durrant@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
NB: this is a preparatory patch for introducing a complete MSI-X emulation
layer into Xen. Long term the current msixtbl code should be replaced with
the complete MSI-X emulation introduced in later patches.
---
 xen/arch/x86/hvm/ioreq.c        | 11 +++++++++++
 xen/drivers/passthrough/io.c    |  4 +++-
 xen/include/asm-x86/hvm/ioreq.h |  1 +
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/ioreq.c b/xen/arch/x86/hvm/ioreq.c
index d2245e2..b09fa96 100644
--- a/xen/arch/x86/hvm/ioreq.c
+++ b/xen/arch/x86/hvm/ioreq.c
@@ -772,6 +772,17 @@ int hvm_destroy_ioreq_server(struct domain *d, ioservid_t id)
     return rc;
 }
 
+int hvm_has_ioreq_server(struct domain *d)
+{
+    int empty;
+
+    spin_lock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
+    empty = list_empty(&d->arch.hvm_domain.ioreq_server.list);
+    spin_unlock_recursive(&d->arch.hvm_domain.ioreq_server.lock);
+
+    return !empty;
+}
+
 int hvm_get_ioreq_server_info(struct domain *d, ioservid_t id,
                               unsigned long *ioreq_pfn,
                               unsigned long *bufioreq_pfn,
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index edd8dbd..1e5e365 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -24,6 +24,7 @@
 #include <asm/hvm/irq.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
+#include <asm/hvm/ioreq.h>
 #include <asm/io_apic.h>
 
 static DEFINE_PER_CPU(struct list_head, dpci_list);
@@ -384,7 +385,8 @@ int pt_irq_create_bind(
             pirq_dpci->dom = d;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], info, 0);
-            if ( rc == 0 && pt_irq_bind->u.msi.gtable )
+            if ( rc == 0 && pt_irq_bind->u.msi.gtable &&
+                 hvm_has_ioreq_server(d) )
             {
                 rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
                 if ( unlikely(rc) )
diff --git a/xen/include/asm-x86/hvm/ioreq.h b/xen/include/asm-x86/hvm/ioreq.h
index fbf2c74..6456cd2 100644
--- a/xen/include/asm-x86/hvm/ioreq.h
+++ b/xen/include/asm-x86/hvm/ioreq.h
@@ -31,6 +31,7 @@ int hvm_get_ioreq_server_info(struct domain *d, ioservid_t id,
                               unsigned long *ioreq_pfn,
                               unsigned long *bufioreq_pfn,
                               evtchn_port_t *bufioreq_port);
+int hvm_has_ioreq_server(struct domain *d);
 int hvm_map_io_range_to_ioreq_server(struct domain *d, ioservid_t id,
                                      uint32_t type, uint64_t start,
                                      uint64_t end);
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (26 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 27/30] x86/msixtbl: disable MSI-X intercepts for domains without an ioreq server Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-03 10:57   ` Paul Durrant
  2016-10-10 16:15   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings Roger Pau Monne
                   ` (2 subsequent siblings)
  30 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne

This requires adding handlers to the PCI configuration space, plus a MMIO
handler for the MSI-X table, the PBA is left mapped directly into the guest.
The implementation is based on the one already found in the passthrough
code from QEMU.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Paul Durrant <paul.durrant@citrix.com>
Jan Beulich <jbeulich@suse.com>
Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/io.c         |   2 +
 xen/arch/x86/hvm/vmsi.c       | 498 ++++++++++++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/pci.c |   6 +-
 xen/include/asm-x86/hvm/io.h  |  26 +++
 xen/include/asm-x86/msi.h     |   4 +-
 5 files changed, 534 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
index 088e3ec..11b7313 100644
--- a/xen/arch/x86/hvm/io.c
+++ b/xen/arch/x86/hvm/io.c
@@ -867,6 +867,7 @@ static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
     &hvm_pt_bar_init,
     &hvm_pt_vf_bar_init,
     &hvm_pt_msi_init,
+    &hvm_pt_msix_init,
 };
 
 int hwdom_add_device(struct pci_dev *pdev)
@@ -1176,6 +1177,7 @@ void register_dpci_portio_handler(struct domain *d)
     {
         handler->ops = &hw_dpci_portio_ops;
         register_mmio_handler(d, &pcie_mmio_ops);
+        register_mmio_handler(d, &vmsix_mmio_ops);
     }
     else
         handler->ops = &dpci_portio_ops;
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 75ba429..92c3b50 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -40,6 +40,7 @@
 #include <asm/current.h>
 #include <asm/event.h>
 #include <asm/io_apic.h>
+#include <asm/p2m.h>
 
 static void vmsi_inj_irq(
     struct vlapic *target,
@@ -1162,3 +1163,500 @@ struct hvm_pt_handler_init hvm_pt_msi_init = {
     .handlers = vmsi_handler,
     .init = vmsi_group_init,
 };
+
+/* MSI-X */
+#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)]
+
+static int vmsix_update_one(struct hvm_pt_device *s, int entry_nr,
+                            uint32_t vec_ctrl)
+{
+    struct hvm_pt_msix_entry *entry = NULL;
+    xen_domctl_bind_pt_irq_t bind;
+    bool bound = true;
+    struct irq_desc *desc;
+    unsigned long flags;
+    int irq;
+    int pirq;
+    int rc;
+
+    if ( entry_nr < 0 || entry_nr >= s->msix->total_entries )
+        return -EINVAL;
+
+    entry = &s->msix->msix_entry[entry_nr];
+
+    if ( !entry->updated )
+        goto mask;
+
+    pirq = entry->pirq;
+
+    /*
+     * Update the entry addr and data to the latest values only when the
+     * entry is masked or they are all masked, as required by the spec.
+     * Addr and data changes while the MSI-X entry is unmasked get deferred
+     * until the next masked -> unmasked transition.
+     */
+    if ( s->msix->maskall ||
+         (entry->latch(VECTOR_CTRL_OFFSET) & PCI_MSIX_VECTOR_BITMASK) )
+    {
+        entry->addr = entry->latch(LOWER_ADDR_OFFSET) |
+                      ((uint64_t)entry->latch(UPPER_ADDR_OFFSET) << 32);
+        entry->data = entry->latch(DATA_OFFSET);
+    }
+
+    if ( pirq == -1 )
+    {
+        struct msi_info msi_info;
+        //struct irq_desc *desc;
+        int index = -1;
+
+        /* Init physical one */
+        printk_pdev(s->pdev, XENLOG_DEBUG, "setup MSI-X (entry: %d).\n",
+                    entry_nr);
+
+        memset(&msi_info, 0, sizeof(msi_info));
+        msi_info.seg = s->pdev->seg;
+        msi_info.bus = s->pdev->bus;
+        msi_info.devfn = s->pdev->devfn;
+        msi_info.table_base = s->msix->table_base;
+        msi_info.entry_nr = entry_nr;
+
+        rc = physdev_map_pirq(DOMID_SELF, MAP_PIRQ_TYPE_MSI, &index,
+                              &pirq, &msi_info);
+        if ( rc )
+        {
+            /*
+             * Do not broadcast this error, since there's nothing else
+             * that can be done (MSI-X setup should have been successful).
+             * Guest MSI would be actually not working.
+             */
+
+            printk_pdev(s->pdev, XENLOG_ERR,
+                          "can not map MSI-X (entry: %d)!\n", entry_nr);
+            return rc;
+        }
+        entry->pirq = pirq;
+        bound = false;
+    }
+
+    ASSERT(entry->pirq != -1);
+
+    if ( bound )
+    {
+        printk_pdev(s->pdev, XENLOG_DEBUG, "destroy bind MSI-X entry %d\n",
+                    entry_nr);
+        bind.hvm_domid = DOMID_SELF;
+        bind.machine_irq = entry->pirq;
+        bind.irq_type = PT_IRQ_TYPE_MSI;
+        bind.u.msi.gvec = msi_vector(entry->data);
+        bind.u.msi.gflags = msi_gflags(entry->data, entry->addr);
+        bind.u.msi.gtable = s->msix->table_base;
+
+        pcidevs_lock();
+        rc = pt_irq_destroy_bind(current->domain, &bind);
+        pcidevs_unlock();
+        if ( rc )
+        {
+            printk_pdev(s->pdev, XENLOG_ERR, "updating of MSI-X failed: %d\n",
+                        rc);
+            rc = physdev_unmap_pirq(DOMID_SELF, entry->pirq);
+            if ( rc )
+                printk_pdev(s->pdev, XENLOG_ERR,
+                            "unmapping of MSI pirq %d failed: %d\n",
+                            entry->pirq, rc);
+            entry->pirq = -1;
+            return rc;
+        }
+    }
+
+    printk_pdev(s->pdev, XENLOG_DEBUG, "bind MSI-X entry %d\n", entry_nr);
+    bind.hvm_domid = DOMID_SELF;
+    bind.machine_irq = entry->pirq;
+    bind.irq_type = PT_IRQ_TYPE_MSI;
+    bind.u.msi.gvec = msi_vector(entry->data);
+    bind.u.msi.gflags = msi_gflags(entry->data, entry->addr);
+    bind.u.msi.gtable = s->msix->table_base;
+
+    pcidevs_lock();
+    rc = pt_irq_create_bind(current->domain, &bind);
+    pcidevs_unlock();
+    if ( rc )
+    {
+        printk_pdev(s->pdev, XENLOG_ERR, "updating of MSI-X failed: %d\n", rc);
+        rc = physdev_unmap_pirq(DOMID_SELF, entry->pirq);
+        if ( rc )
+            printk_pdev(s->pdev, XENLOG_ERR,
+                        "unmapping of MSI pirq %d failed: %d\n",
+                        entry->pirq, rc);
+        entry->pirq = -1;
+        return rc;
+    }
+
+    entry->updated = false;
+
+ mask:
+    if ( entry->pirq != -1 &&
+         ((vec_ctrl ^ entry->latch(VECTOR_CTRL_OFFSET)) &
+          PCI_MSIX_VECTOR_BITMASK) )
+    {
+        printk_pdev(s->pdev, XENLOG_DEBUG, "%smasking MSI-X entry %d\n",
+                    (vec_ctrl & PCI_MSIX_VECTOR_BITMASK) ? "" : "un", entry_nr);
+        irq = domain_pirq_to_irq(s->pdev->domain, entry->pirq);
+        desc = irq_to_desc(irq);
+        spin_lock_irqsave(&desc->lock, flags);
+        guest_mask_msi_irq(desc, !!(vec_ctrl & PCI_MSIX_VECTOR_BITMASK));
+        spin_unlock_irqrestore(&desc->lock, flags);
+    }
+
+    return 0;
+}
+
+static int vmsix_update(struct hvm_pt_device *s)
+{
+    struct hvm_pt_msix *msix = s->msix;
+    int i, rc;
+
+    for ( i = 0; i < msix->total_entries; i++ )
+    {
+        rc = vmsix_update_one(s, i,
+                              msix->msix_entry[i].latch(VECTOR_CTRL_OFFSET));
+        if ( rc )
+            printk_pdev(s->pdev, XENLOG_ERR, "failed to update MSI-X %d\n", i);
+    }
+
+    return 0;
+}
+
+static int vmsix_disable(struct hvm_pt_device *s)
+{
+    struct hvm_pt_msix *msix = s->msix;
+    int i, rc;
+
+    for ( i = 0; i < msix->total_entries; i++ )
+    {
+        struct hvm_pt_msix_entry *entry =  &s->msix->msix_entry[i];
+        xen_domctl_bind_pt_irq_t bind;
+
+        if ( entry->pirq == -1 )
+            continue;
+
+        bind.hvm_domid = DOMID_SELF;
+        bind.irq_type = PT_IRQ_TYPE_MSI;
+        bind.machine_irq = entry->pirq;
+        bind.u.msi.gvec = msi_vector(entry->data);
+        bind.u.msi.gflags = msi_gflags(entry->data, entry->addr);
+        bind.u.msi.gtable = msix->table_base;
+        pcidevs_lock();
+        rc = pt_irq_destroy_bind(current->domain, &bind);
+        pcidevs_unlock();
+        if ( rc )
+        {
+            printk_pdev(s->pdev, XENLOG_ERR,
+                        "failed to destroy MSI-X PIRQ bind entry %d: %d\n",
+                        i, rc);
+            return rc;
+        }
+
+        rc = physdev_unmap_pirq(DOMID_SELF, entry->pirq);
+        if ( rc )
+        {
+            printk_pdev(s->pdev, XENLOG_ERR,
+                        "failed to unmap PIRQ %d MSI-X entry %d: %d\n",
+                        entry->pirq, i, rc);
+            return rc;
+        }
+
+        entry->pirq = -1;
+        entry->updated = false;
+    }
+
+    return 0;
+}
+
+/* Message Control register for MSI-X */
+static int vmsix_ctrl_reg_init(struct hvm_pt_device *s,
+                               struct hvm_pt_reg_handler *handler,
+                               uint32_t real_offset, uint32_t *data)
+{
+    struct pci_dev *pdev = s->pdev;
+    struct hvm_pt_msix *msix = s->msix;
+    uint8_t seg, bus, slot, func;
+    uint16_t reg_field;
+
+    seg = pdev->seg;
+    bus = pdev->bus;
+    slot = PCI_SLOT(pdev->devfn);
+    func = PCI_FUNC(pdev->devfn);
+
+    /* use I/O device register's value as initial value */
+    reg_field = pci_conf_read16(seg, bus, slot, func, real_offset);
+    if ( reg_field & PCI_MSIX_FLAGS_ENABLE )
+    {
+        printk_pdev(pdev, XENLOG_INFO,
+                    "MSI-X already enabled, disabling it first\n");
+        reg_field &= ~PCI_MSIX_FLAGS_ENABLE;
+        pci_conf_write16(seg, bus, slot, func, real_offset, reg_field);
+    }
+
+    msix->ctrl_offset = real_offset;
+
+    *data = handler->init_val | reg_field;
+    return 0;
+}
+static int vmsix_ctrl_reg_write(struct hvm_pt_device *s, struct hvm_pt_reg *reg,
+                                uint16_t *val, uint16_t dev_value,
+                                uint16_t valid_mask)
+{
+    struct hvm_pt_reg_handler *handler = reg->handler;
+    uint16_t writable_mask = 0;
+    uint16_t throughable_mask = hvm_pt_get_throughable_mask(s, handler,
+                                                            valid_mask);
+    int debug_msix_enabled_old;
+    uint16_t *data = &reg->val.word;
+
+    /* modify emulate register */
+    writable_mask = handler->emu_mask & ~handler->ro_mask & valid_mask;
+    *data = HVM_PT_MERGE_VALUE(*val, *data, writable_mask);
+
+    /* create value for writing to I/O device register */
+    *val = HVM_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
+
+    /* update MSI-X */
+    if ( (*val & PCI_MSIX_FLAGS_ENABLE)
+         && !(*val & PCI_MSIX_FLAGS_MASKALL) )
+        vmsix_update(s);
+    else if ( !(*val & PCI_MSIX_FLAGS_ENABLE) && s->msix->enabled )
+        vmsix_disable(s);
+
+    s->msix->maskall = *val & PCI_MSIX_FLAGS_MASKALL;
+
+    debug_msix_enabled_old = s->msix->enabled;
+    s->msix->enabled = !!(*val & PCI_MSIX_FLAGS_ENABLE);
+    if ( s->msix->enabled != debug_msix_enabled_old )
+        printk_pdev(s->pdev, XENLOG_DEBUG, "%s MSI-X\n",
+                    s->msix->enabled ? "enable" : "disable");
+
+    return 0;
+}
+
+/* MSI Capability Structure reg static information table */
+static struct hvm_pt_reg_handler vmsix_handler[] = {
+    /* Message Control reg */
+    {
+        .offset     = PCI_MSIX_FLAGS,
+        .size       = 2,
+        .init_val   = 0x0000,
+        .res_mask   = 0x3800,
+        .ro_mask    = 0x07FF,
+        .emu_mask   = 0x0000,
+        .init       = vmsix_ctrl_reg_init,
+        .u.w.read   = hvm_pt_word_reg_read,
+        .u.w.write  = vmsix_ctrl_reg_write,
+    },
+    /* End */
+    {
+        .size = 0,
+    },
+};
+
+static int vmsix_group_init(struct hvm_pt_device *s,
+                            struct hvm_pt_reg_group *group)
+{
+    uint8_t seg, bus, slot, func;
+    struct pci_dev *pdev = s->pdev;
+    int msix_offset, total_entries, i, bar_index, rc;
+    uint32_t table_off;
+    uint16_t flags;
+
+    seg = pdev->seg;
+    bus = pdev->bus;
+    slot = PCI_SLOT(pdev->devfn);
+    func = PCI_FUNC(pdev->devfn);
+
+    msix_offset = pci_find_cap_offset(seg, bus, slot, func, PCI_CAP_ID_MSIX);
+    if ( msix_offset == 0 )
+        return -ENODEV;
+
+    group->base_offset = msix_offset;
+    flags = pci_conf_read16(seg, bus, slot, func,
+                            msix_offset + PCI_MSIX_FLAGS);
+    total_entries = flags & PCI_MSIX_FLAGS_QSIZE;
+    total_entries += 1;
+
+    s->msix = xmalloc_bytes(sizeof(struct hvm_pt_msix) +
+                            total_entries * sizeof(struct hvm_pt_msix_entry));
+    if ( s->msix == NULL )
+    {
+        printk_pdev(pdev, XENLOG_ERR, "unable to allocate memory for MSI-X\n");
+        return -ENOMEM;
+    }
+    memset(s->msix, 0, sizeof(struct hvm_pt_msix) +
+           total_entries * sizeof(struct hvm_pt_msix_entry));
+
+    s->msix->total_entries = total_entries;
+    for ( i = 0; i < total_entries; i++ )
+    {
+        struct hvm_pt_msix_entry *entry = &s->msix->msix_entry[i];
+
+        entry->pirq = -1;
+        entry->latch(VECTOR_CTRL_OFFSET) = PCI_MSIX_VECTOR_BITMASK;
+    }
+
+    table_off = pci_conf_read32(seg, bus, slot, func,
+                                msix_offset + PCI_MSIX_TABLE);
+    bar_index = s->msix->bar_index = table_off & PCI_MSIX_BIRMASK;
+    table_off &= ~PCI_MSIX_BIRMASK;
+    s->msix->table_base = s->bars[bar_index].addr;
+    s->msix->table_offset = table_off;
+    s->msix->mmio_base_addr = s->bars[bar_index].addr + table_off;
+    printk_pdev(pdev, XENLOG_DEBUG,
+                "MSI-X table at BAR#%d address: %#lx size: %d\n",
+                bar_index, s->msix->mmio_base_addr,
+                total_entries * PCI_MSIX_ENTRY_SIZE);
+
+    /* Unmap the BAR so that the guest cannot directly write to it. */
+    rc = modify_mmio_11(s->pdev->domain, PFN_DOWN(s->msix->mmio_base_addr),
+                        DIV_ROUND_UP(total_entries * PCI_MSIX_ENTRY_SIZE,
+                                     PAGE_SIZE),
+                        false);
+    if ( rc )
+    {
+        printk_pdev(pdev, XENLOG_ERR,
+                    "Unable to unmap address %#lx from BAR#%d\n",
+                    s->bars[bar_index].addr, bar_index);
+        xfree(s->msix);
+        return rc;
+    }
+
+    return 0;
+}
+
+struct hvm_pt_handler_init hvm_pt_msix_init = {
+    .handlers = vmsix_handler,
+    .init = vmsix_group_init,
+};
+
+/* MMIO handlers for MSI-X */
+static struct hvm_pt_device *vmsix_find_dev_mmio(struct domain *d,
+                                                 unsigned long addr)
+{
+    struct hvm_pt_device *dev;
+
+    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
+    {
+        unsigned long table_addr, table_size;
+
+        if ( dev->msix == NULL )
+            continue;
+
+        table_addr = dev->msix->mmio_base_addr;
+        table_size = dev->msix->total_entries * PCI_MSIX_ENTRY_SIZE;
+        if ( addr < table_addr || addr >= table_addr + table_size )
+            continue;
+
+        return dev;
+    }
+
+    return NULL;
+}
+
+
+static uint32_t vmsix_get_entry_value(struct hvm_pt_msix_entry *e, int offset)
+{
+    ASSERT(!(offset % sizeof(*e->latch)));
+    return e->latch[offset / sizeof(*e->latch)];
+}
+
+static void vmsix_set_entry_value(struct hvm_pt_msix_entry *e, int offset,
+                                  uint32_t val)
+{
+    ASSERT(!(offset % sizeof(*e->latch)));
+    e->latch[offset / sizeof(*e->latch)] = val;
+}
+
+static int vmsix_mem_write(struct vcpu *v, unsigned long addr,
+                           unsigned int size, unsigned long val)
+{
+    struct hvm_pt_device *s;
+    struct hvm_pt_msix *msix;
+    struct hvm_pt_msix_entry *entry;
+    unsigned int entry_nr, offset;
+    unsigned long raddr;
+    int rc = 0;
+
+    read_lock(&v->domain->arch.hvm_domain.pt_lock);
+    s = vmsix_find_dev_mmio(v->domain, addr);
+    msix = s->msix;
+    raddr = addr - msix->mmio_base_addr;
+    entry_nr = raddr / PCI_MSIX_ENTRY_SIZE;
+    if ( entry_nr >= msix->total_entries )
+    {
+        printk_pdev(s->pdev, XENLOG_ERR, "asked MSI-X entry %d out of range!\n",
+                    entry_nr);
+        rc = -EINVAL;
+        goto out;
+    }
+
+    entry = &msix->msix_entry[entry_nr];
+    offset = raddr % PCI_MSIX_ENTRY_SIZE;
+
+    if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
+    {
+        if ( vmsix_get_entry_value(entry, offset) == val && entry->pirq != -1 )
+            goto out;
+
+        entry->updated = true;
+    }
+    else
+        vmsix_update_one(s, entry_nr, val);
+
+    vmsix_set_entry_value(entry, offset, val);
+
+ out:
+    read_unlock(&v->domain->arch.hvm_domain.pt_lock);
+    return 0;
+}
+
+static int vmsix_mem_read(struct vcpu *v, unsigned long addr,
+                          unsigned int size, unsigned long *val)
+{
+    struct hvm_pt_device *s;
+    struct hvm_pt_msix *msix;
+    unsigned long raddr;
+    int entry_nr, offset;
+
+    read_lock(&v->domain->arch.hvm_domain.pt_lock);
+    s = vmsix_find_dev_mmio(v->domain, addr);
+    msix = s->msix;
+    raddr = addr - msix->mmio_base_addr;
+    entry_nr = raddr / PCI_MSIX_ENTRY_SIZE;
+    if ( entry_nr >= msix->total_entries )
+    {
+        printk_pdev(s->pdev, XENLOG_ERR, "asked MSI-X entry %d out of range!\n",
+                    entry_nr);
+        read_unlock(&v->domain->arch.hvm_domain.pt_lock);
+        return -EINVAL;
+    }
+
+    offset = raddr % PCI_MSIX_ENTRY_SIZE;
+    *val = vmsix_get_entry_value(&msix->msix_entry[entry_nr], offset);
+    read_unlock(&v->domain->arch.hvm_domain.pt_lock);
+
+    return 0;
+}
+
+static int vmsix_mem_accepts(struct vcpu *v, unsigned long addr)
+{
+    int accept;
+
+    read_lock(&v->domain->arch.hvm_domain.pt_lock);
+    accept = (vmsix_find_dev_mmio(v->domain, addr) != NULL);
+    read_unlock(&v->domain->arch.hvm_domain.pt_lock);
+
+    return accept;
+}
+
+const struct hvm_mmio_ops vmsix_mmio_ops = {
+    .check = vmsix_mem_accepts,
+    .read = vmsix_mem_read,
+    .write = vmsix_mem_write,
+};
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 60c9e74..f143029 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -681,9 +681,11 @@ static int bar_reg_init(struct hvm_pt_device *s,
     if ( bars[index].type == HVM_PT_BAR_MEM32 ||
          bars[index].type == HVM_PT_BAR_MEM64_LO )
     {
+        unsigned int next_index = index;
+
         /* Size the BAR and map it. */
         rc = pci_size_bar(seg, bus, slot, func, real_offset - handler->offset,
-                          num_bars, &index, &addr, &size);
+                          num_bars, &next_index, &addr, &size);
         if ( rc )
         {
             printk_pdev(s->pdev, XENLOG_ERR, "unable to size BAR#%d\n",
@@ -706,6 +708,8 @@ static int bar_reg_init(struct hvm_pt_device *s,
                             index, rc);
                 return rc;
             }
+            bars[index].addr = addr;
+            bars[index].size = size;
         }
     }
 
diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
index 0f8726a..9d9c439 100644
--- a/xen/include/asm-x86/hvm/io.h
+++ b/xen/include/asm-x86/hvm/io.h
@@ -281,7 +281,30 @@ struct hvm_pt_msi {
     bool_t mapped;       /* when pirq is mapped */
 };
 
+/* Guest MSI-X information. */
+struct hvm_pt_msix_entry {
+    int pirq;
+    uint64_t addr;
+    uint32_t data;
+    uint32_t latch[4];
+    bool updated; /* indicate whether MSI ADDR or DATA is updated */
+};
+
+struct hvm_pt_msix {
+    uint32_t ctrl_offset;
+    bool enabled;
+    bool maskall;
+    int total_entries;
+    int bar_index;
+    uint64_t table_base;
+    uint32_t table_offset; /* page align mmap */
+    uint64_t mmio_base_addr;
+    struct hvm_pt_msix_entry msix_entry[0];
+};
+
 struct hvm_pt_bar {
+    uint64_t addr;
+    uint64_t size;
     uint32_t val;
     enum bar_type {
         HVM_PT_BAR_UNUSED,
@@ -305,6 +328,9 @@ struct hvm_pt_device {
     /* MSI status. */
     struct hvm_pt_msi msi;
 
+    /* MSI-X status. */
+    struct hvm_pt_msix *msix;
+
     /* PCI header type. */
     uint8_t htype;
 
diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
index 8c7fb27..981f9ef 100644
--- a/xen/include/asm-x86/msi.h
+++ b/xen/include/asm-x86/msi.h
@@ -275,7 +275,9 @@ static inline uint32_t msi_gflags(uint32_t data, uint64_t addr)
     return result;
 }
 
-/* MSI HVM pass-through handlers. */
+/* MSI(-X) HVM pass-through handlers. */
 extern struct hvm_pt_handler_init hvm_pt_msi_init;
+extern struct hvm_pt_handler_init hvm_pt_msix_init;
+extern const struct hvm_mmio_ops vmsix_mmio_ops;
 
 #endif /* __ASM_MSI_H */
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (27 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0 Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-09-30 17:36   ` George Dunlap
  2016-10-10 14:21   ` Jan Beulich
  2016-09-27 15:57 ` [PATCH v2 30/30] xen: allow setting the store pfn HVM parameter Roger Pau Monne
  2016-09-28 12:22 ` [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
  30 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, boris.ostrovsky,
	Roger Pau Monne

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: George Dunlap <george.dunlap@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/mm/p2m.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 44492ae..c989b60 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -2793,7 +2793,7 @@ int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
     struct domain *fdom;
 
     ASSERT(tdom);
-    if ( foreigndom == DOMID_SELF || !is_pvh_domain(tdom) )
+    if ( foreigndom == DOMID_SELF || !has_hvm_container_domain(tdom) )
         return -EINVAL;
     /*
      * pvh fixme: until support is added to p2m teardown code to cleanup any
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* [PATCH v2 30/30] xen: allow setting the store pfn HVM parameter
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (28 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings Roger Pau Monne
@ 2016-09-27 15:57 ` Roger Pau Monne
  2016-10-03 11:01   ` Paul Durrant
  2016-09-28 12:22 ` [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
  30 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-27 15:57 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

Xen already allows setting the store event channel, and this parameter is
not used by Xen at all.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
 xen/arch/x86/hvm/hvm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index bc4f7bc..5c3aa2a 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -4982,6 +4982,7 @@ static int hvm_allow_set_param(struct domain *d,
     case HVM_PARAM_STORE_EVTCHN:
     case HVM_PARAM_CONSOLE_EVTCHN:
     case HVM_PARAM_X87_FIP_WIDTH:
+    case HVM_PARAM_STORE_PFN:
         break;
     /*
      * The following parameters must not be set by the guest
-- 
2.7.4 (Apple Git-66)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form
  2016-09-27 15:57 ` [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form Roger Pau Monne
@ 2016-09-28  8:24   ` Juergen Gross
  2016-09-28 11:56     ` Roger Pau Monne
  2016-10-03  8:36   ` Paul Durrant
  2016-10-11 10:27   ` Roger Pau Monne
  2 siblings, 1 reply; 146+ messages in thread
From: Juergen Gross @ 2016-09-28  8:24 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Jan Beulich, boris.ostrovsky

On 27/09/16 17:57, Roger Pau Monne wrote:
> Introduce a new %pB format specifier to print sizes (in bytes) in a

Code suggests it is pZ.

> human-readable form.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---
>  docs/misc/printk-formats.txt |  5 +++++
>  xen/common/vsprintf.c        | 15 +++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/docs/misc/printk-formats.txt b/docs/misc/printk-formats.txt
> index 525108f..0ee3504 100644
> --- a/docs/misc/printk-formats.txt
> +++ b/docs/misc/printk-formats.txt
> @@ -30,3 +30,8 @@ Domain and vCPU information:
>  
>         %pv     Domain and vCPU ID from a 'struct vcpu *' (printed as
>                 "d<domid>v<vcpuid>")
> +
> +Size in human readable form:
> +
> +       %pZ     Size in human-readable form (input size must be in bytes).
> +                 e.g.  24MB
> diff --git a/xen/common/vsprintf.c b/xen/common/vsprintf.c
> index f92fb67..2dadaec 100644
> --- a/xen/common/vsprintf.c
> +++ b/xen/common/vsprintf.c
> @@ -386,6 +386,21 @@ static char *pointer(char *str, char *end, const char **fmt_ptr,
>              *str = 'v';
>          return number(str + 1, end, v->vcpu_id, 10, -1, -1, 0);
>      }
> +    case 'Z':
> +    {
> +        static const char units[][3] = {"B", "KB", "MB", "GB", "TB"};
> +        size_t size = (size_t)arg;
> +        int i = 0;
> +
> +        /* Advance parents fmt string, as we have consumed 'B' */
> +        ++*fmt_ptr;
> +
> +        while ( ++i < sizeof(units) && size >= 1024 )

sizeof(units) is 15. That's not what you want to check here.


Juergen

> +            size >>= 10; /* size /= 1024 */
> +
> +        str = number(str, end, size, 10, -1, -1, 0);
> +        return string(str, end, units[i-1], -1, -1, 0);
> +    }
>      }
>  
>      if ( field_width == -1 )
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-09-27 15:56 ` [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions Roger Pau Monne
@ 2016-09-28  9:34   ` Tim Deegan
  2016-09-29 10:39   ` Jan Beulich
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 146+ messages in thread
From: Tim Deegan @ 2016-09-28  9:34 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Jan Beulich, xen-devel, boris.ostrovsky

At 17:56 +0200 on 27 Sep (1474999018), Roger Pau Monne wrote:
> Return should be an int, and the number of pages should be an unsigned long.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Both those changes seem fine to me.  Since you're changing the return
types to int, can you please also change the two callers (hap_enable()
and shadow_enable()) that assign it to an unsigned variable?  AFAICS
both should just pass the result through instead of forcing all errors
to -ENOMEM.

Cheers,

Tim.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form
  2016-09-28  8:24   ` Juergen Gross
@ 2016-09-28 11:56     ` Roger Pau Monne
  2016-09-28 12:01       ` Andrew Cooper
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-28 11:56 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Andrew Cooper,
	Ian Jackson, Tim Deegan, Jan Beulich, xen-devel, boris.ostrovsky

On Wed, Sep 28, 2016 at 10:24:07AM +0200, Juergen Gross wrote:
> On 27/09/16 17:57, Roger Pau Monne wrote:
> > Introduce a new %pB format specifier to print sizes (in bytes) in a
> 
> Code suggests it is pZ.

Right, first implementation used pB, but then I realized this was already 
used by Linux to print something else.
 
> > human-readable form.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > Cc: Stefano Stabellini <sstabellini@kernel.org>
> > Cc: Tim Deegan <tim@xen.org>
> > Cc: Wei Liu <wei.liu2@citrix.com>
> > ---
> >  docs/misc/printk-formats.txt |  5 +++++
> >  xen/common/vsprintf.c        | 15 +++++++++++++++
> >  2 files changed, 20 insertions(+)
> > 
> > diff --git a/docs/misc/printk-formats.txt b/docs/misc/printk-formats.txt
> > index 525108f..0ee3504 100644
> > --- a/docs/misc/printk-formats.txt
> > +++ b/docs/misc/printk-formats.txt
> > @@ -30,3 +30,8 @@ Domain and vCPU information:
> >  
> >         %pv     Domain and vCPU ID from a 'struct vcpu *' (printed as
> >                 "d<domid>v<vcpuid>")
> > +
> > +Size in human readable form:
> > +
> > +       %pZ     Size in human-readable form (input size must be in bytes).
> > +                 e.g.  24MB
> > diff --git a/xen/common/vsprintf.c b/xen/common/vsprintf.c
> > index f92fb67..2dadaec 100644
> > --- a/xen/common/vsprintf.c
> > +++ b/xen/common/vsprintf.c
> > @@ -386,6 +386,21 @@ static char *pointer(char *str, char *end, const char **fmt_ptr,
> >              *str = 'v';
> >          return number(str + 1, end, v->vcpu_id, 10, -1, -1, 0);
> >      }
> > +    case 'Z':
> > +    {
> > +        static const char units[][3] = {"B", "KB", "MB", "GB", "TB"};
> > +        size_t size = (size_t)arg;
> > +        int i = 0;
> > +
> > +        /* Advance parents fmt string, as we have consumed 'B' */
> > +        ++*fmt_ptr;
> > +
> > +        while ( ++i < sizeof(units) && size >= 1024 )
> 
> sizeof(units) is 15. That's not what you want to check here.

Indeed, it's ARRAY_SIZE, thanks for noticing!

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form
  2016-09-28 11:56     ` Roger Pau Monne
@ 2016-09-28 12:01       ` Andrew Cooper
  0 siblings, 0 replies; 146+ messages in thread
From: Andrew Cooper @ 2016-09-28 12:01 UTC (permalink / raw)
  To: Roger Pau Monne, Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, George Dunlap, Tim Deegan,
	Ian Jackson, Jan Beulich, xen-devel, boris.ostrovsky

On 28/09/16 12:56, Roger Pau Monne wrote:
> On Wed, Sep 28, 2016 at 10:24:07AM +0200, Juergen Gross wrote:
>> On 27/09/16 17:57, Roger Pau Monne wrote:
>>> Introduce a new %pB format specifier to print sizes (in bytes) in a
>> Code suggests it is pZ.
> Right, first implementation used pB, but then I realized this was already 
> used by Linux to print something else.
>  
>>> diff --git a/xen/common/vsprintf.c b/xen/common/vsprintf.c
>>> index f92fb67..2dadaec 100644
>>> --- a/xen/common/vsprintf.c
>>> +++ b/xen/common/vsprintf.c
>>> @@ -386,6 +386,21 @@ static char *pointer(char *str, char *end, const char **fmt_ptr,
>>>              *str = 'v';
>>>          return number(str + 1, end, v->vcpu_id, 10, -1, -1, 0);
>>>      }
>>> +    case 'Z':
>>> +    {
>>> +        static const char units[][3] = {"B", "KB", "MB", "GB", "TB"};
>>> +        size_t size = (size_t)arg;
>>> +        int i = 0;
>>> +
>>> +        /* Advance parents fmt string, as we have consumed 'B' */

And this comment.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 00/30] PVHv2 Dom0
  2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
                   ` (29 preceding siblings ...)
  2016-09-27 15:57 ` [PATCH v2 30/30] xen: allow setting the store pfn HVM parameter Roger Pau Monne
@ 2016-09-28 12:22 ` Roger Pau Monne
  30 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-28 12:22 UTC (permalink / raw)
  To: xen-devel, boris.ostrovsky

On Tue, Sep 27, 2016 at 05:56:55PM +0200, Roger Pau Monne wrote:
> Hello,
> 
> This is the first "complete" PVHv2 implementation in the sense that
> it has feature parity with classic PVH Dom0. It is still very experimental,
> but I've managed to boot it on all Intel boxes I've tried (OK, only 3 so far).
> I've also tried on an AMD box, but sadly the ACPI tables there didn't contain
> any RMRR regions and the IOMMU started spitting a bunch of page faults caused
> by devices trying to access memory regions marked as reserved in the e820.
> 
> Here is the list of most relevant changes compared to classic PVH or PV:
> 
>  - An emulated local APIC (for each vCPU) and IO APIC is provided to Dom0.
>  - The MADT has been replaced in order to reflect the real topology seen by
>    the guest. This needs a little bit more of work so that the new MADT is
>    not placed over the old one.
>  - BARs of PCI devices are automatically mapped by Xen into Dom0 memory space.
>    Currently there's no support for changing the position of the BARs, and Xen
>    will just crash the domain if this is attempted.
>  - Interrupts from physical devices are configured and routed using native
>    mechanism, PIRQs are not available to this PVH Dom0 implementation at all.
>    Xen will automatically detect the features of the physical devices available
>    to Dom0, and will setup traps/handlers in order to intercept all relevant
>    configuration accesses.
>  - Access to PCIe regions is also trapped by Xen, so configuration can be done
>    using the IO ports or the memory mapped configuration areas, as supported
>    by each device.
>  - Some ACPI tables are zapped (it's signature is inverted) to prevent Dom0
>    from poking at them, those are: HPET, SLIT, SRAT, MPST and PMTT.
> 
> I know this series is quite big, but I think some of the initial changes can
> go in as bugfixes, which would help me reduce the size.
> 

I've forgotten to push this to my personal git repo, the series can be found 
at:

git://xenbits.xen.org/people/royger/xen.git dom0_hvm_v2

(with the recent comments from this morning addressed)

I've also forgot to mention that on one Intel box, that's a two socket 
machine with Xeon E5520 and a 5520/5500 chipset with two IO APICs I need to 
forcefully set the IO APIC ack mode to 'old', or else legacy PCI interrupts 
get stuck with irr=1. And even when doing that, after some uptime Dom0 
starts losing MSI and MSI-X interrupts. I haven't been able to reproduce 
this on any other box.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder
  2016-09-27 15:56 ` [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder Roger Pau Monne
@ 2016-09-28 15:35   ` Jan Beulich
  2016-09-29 12:57     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-28 15:35 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, boris.ostrovsky

>>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> This is also required for PVHv2 guests if they want to use real-mode, and
> hvmloader is not executed for those kind of guests.

While the intention is fine, I'm not convinced of consuming yet another
special page here: Other than the way hvmloader's allocation works,
here you permanently take away a page from the guest unconditionally
which (a) is used only on VMX, (b) only on old hardware, and (c) VMX
code appears to even be able to help itself without this TSS (at the
price of doing more emulation).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-09-27 15:56 ` [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
@ 2016-09-28 16:03   ` Jan Beulich
  2016-09-29 14:17     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-28 16:03 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> On PVHv2 guests we explicitly want to use native methods for routing
> interrupts.
> 
> Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> route physical interrupts (even from emulated devices) over event channels.

So you specifically want this new flag off for PVHv2 guests? Based on
just this description I did get the opposite impression...

> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4117,6 +4117,8 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>  static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  {
> +    struct domain *d = current->domain;

currd please.

> @@ -4128,7 +4130,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>      case PHYSDEVOP_eoi:
>      case PHYSDEVOP_irq_status_query:
>      case PHYSDEVOP_get_free_pirq:
> -        return do_physdev_op(cmd, arg);
> +        return ((d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||
> +               is_pvh_vcpu(current)) ?

is_pvh_domain(currd)

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-09-27 15:56 ` [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions Roger Pau Monne
  2016-09-28  9:34   ` Tim Deegan
@ 2016-09-29 10:39   ` Jan Beulich
  2016-09-29 14:33     ` Roger Pau Monne
  2016-09-30 16:48   ` George Dunlap
  2016-10-03  8:05   ` Paul Durrant
  3 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 10:39 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> Return should be an int, and the number of pages should be an unsigned long.

I can see the former, but why the latter? Acting on 32-bit quantities
is a little cheaper after all.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-09-27 15:56 ` [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
@ 2016-09-29 10:43   ` Jan Beulich
  2016-09-29 14:37     ` Roger Pau Monne
  2016-09-30 16:56   ` George Dunlap
  1 sibling, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 10:43 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned long pages, int *preempted)
>              break;
>  
>          /* Check to see if we need to yield and try again */
> -        if ( preempted && hypercall_preempt_check() )
> +        if ( preempted &&
> +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
> +                                      hypercall_preempt_check()) )

So what is the supposed action for the caller to take in this case?
I think this should at least be spelled out in the commit message.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by the idle domain
  2016-09-27 15:57 ` [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
@ 2016-09-29 10:45   ` Jan Beulich
  2016-09-30  8:32     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 10:45 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> It doesn't make sense since the idle domain doesn't receive any events.

The change itself is fine, but I think it would help if the commit
message made explicit why this is becoming relevant.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/30] x86/paging: introduce paging_set_allocation
  2016-09-27 15:57 ` [PATCH v2 06/30] x86/paging: introduce paging_set_allocation Roger Pau Monne
@ 2016-09-29 10:51   ` Jan Beulich
  2016-09-29 14:51     ` Roger Pau Monne
  2016-09-30 17:00   ` George Dunlap
  1 sibling, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 10:51 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> @@ -1383,15 +1382,25 @@ int __init construct_dom0(
>                           nr_pages);
>      }
>  
> -    if ( is_pvh_domain(d) )
> -        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
> -
>      /*
>       * We enable paging mode again so guest_physmap_add_page will do the
>       * right thing for us.
>       */

I'm afraid you render this comment stale - please adjust it accordingly.

> --- a/xen/arch/x86/mm/paging.c
> +++ b/xen/arch/x86/mm/paging.c
> @@ -954,6 +954,22 @@ void paging_write_p2m_entry(struct p2m_domain *p2m, unsigned long gfn,
>          safe_write_pte(p, new);
>  }
>  
> +int paging_set_allocation(struct domain *d, unsigned long pages, int *preempted)

Since you need to touch the other two functions anyway, please
make the last parameter bool * here and there.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder
  2016-09-28 15:35   ` Jan Beulich
@ 2016-09-29 12:57     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 12:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Wei Liu, Andrew Cooper, Ian Jackson, xen-devel, boris.ostrovsky

On Wed, Sep 28, 2016 at 09:35:21AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> > This is also required for PVHv2 guests if they want to use real-mode, and
> > hvmloader is not executed for those kind of guests.
> 
> While the intention is fine, I'm not convinced of consuming yet another
> special page here: Other than the way hvmloader's allocation works,
> here you permanently take away a page from the guest unconditionally
> which (a) is used only on VMX, (b) only on old hardware, and (c) VMX
> code appears to even be able to help itself without this TSS (at the
> price of doing more emulation).

Yes, real mode should also work without this. Given that I don't think we 
expect real-mode to be used for mostly anything but early AP initialization, 
I guess we could just leave PVHv2 guests without the TSS.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 07/30] xen/x86: split the setup of Dom0 permissions to a function
  2016-09-27 15:57 ` [PATCH v2 07/30] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
@ 2016-09-29 13:47   ` Jan Beulich
  2016-09-29 15:53     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 13:47 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> So that it can also be used by the PVH-specific domain builder. This is just
> code motion, it should not introduce any functional change.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Looks generally okay, but please do minor style corrections as you
move code:

> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -869,6 +869,89 @@ static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
>      unmap_domain_page(l4start);
>  }
>  
> +static int __init setup_permissions(struct domain *d)
> +{
> +    unsigned long mfn;
> +    int i, rc = 0;

i should be unsigned int, and the initializer of rc could be avoided.

> +    /* The hardware domain is initially permitted full I/O capabilities. */
> +    rc |= ioports_permit_access(d, 0, 0xFFFF);
> +    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
> +    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
> +
> +    /*
> +     * Modify I/O port access permissions.
> +     */

This is a single line comment - I understand it's trying to be more of a
separator than the others, but I'd prefer for it to do so by being
followed by a blank line.

> +    /* Master Interrupt Controller (PIC). */
> +    rc |= ioports_deny_access(d, 0x20, 0x21);
> +    /* Slave Interrupt Controller (PIC). */
> +    rc |= ioports_deny_access(d, 0xA0, 0xA1);
> +    /* Interval Timer (PIT). */
> +    rc |= ioports_deny_access(d, 0x40, 0x43);
> +    /* PIT Channel 2 / PC Speaker Control. */
> +    rc |= ioports_deny_access(d, 0x61, 0x61);
> +    /* ACPI PM Timer. */
> +    if ( pmtmr_ioport )
> +        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
> +    /* PCI configuration space (NB. 0xcf8 has special treatment). */
> +    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
> +    /* Command-line I/O ranges. */
> +    process_dom0_ioports_disable(d);
> +
> +    /*
> +     * Modify I/O memory access permissions.
> +     */

Dito.

> -    BUG_ON(rc != 0);
> +    rc = setup_permissions(d);
> +    if ( rc != 0 )
> +        panic("Failed to setup Dom0 permissions");

To be honest, I'm not sure of this BUG_ON() -> panic() conversion.
I think I'd prefer it to stay the way it was. We're not really expecting
for any of this to fail anyway.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally
  2016-09-27 15:57 ` [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally Roger Pau Monne
@ 2016-09-29 13:55   ` Jan Beulich
  2016-09-29 15:11     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 13:55 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit,
	xen-devel, boris.ostrovsky

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> Instead of being tied to the presence of an IOMMU

At the very least I'd expect the "why" aspect to get mentioned
here.

> --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
>  
>      if ( !amd_iommu_perdev_intremap )
>          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
> -    return scan_pci_devices();
> +
> +    return 0;
>  }

Not how an error return from the function at least here does not
get ignored, leading to the IOMMU not getting enabled. This behavior
should be preserved, and I think it actually should extend to VT-d.
Furthermore iiuc you make PVHv2 Dom0 setup depend on this
succeeding, which may then require further propagation of the
success status here (or maybe, just like PVHv1, you require a
functional IOMMU, in which case failing IOMMU setup would be
sufficient).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-09-28 16:03   ` Jan Beulich
@ 2016-09-29 14:17     ` Roger Pau Monne
  2016-09-29 16:07       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 14:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Wed, Sep 28, 2016 at 10:03:21AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> > On PVHv2 guests we explicitly want to use native methods for routing
> > interrupts.
> > 
> > Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> > route physical interrupts (even from emulated devices) over event channels.
> 
> So you specifically want this new flag off for PVHv2 guests? Based on
> just this description I did get the opposite impression...

Yes, that's right, I don't want PVHv2 guests to know anything about PIRQs. I 
don't really know how to reword this, what about:

"PVHv2 guests, unlike HVM guests, won't have the option to route interrupts 
from physical or emulated devices over event channels using PIRQs. This 
applies to both DomU and Dom0 PVHv2 guests.

Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
route physical interrupts (even from emulated devices) over event channels, 
and is thus allowed to use some of the PHYSDEV ops."

> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -4117,6 +4117,8 @@ static long hvm_memory_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >  
> >  static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >  {
> > +    struct domain *d = current->domain;
> 
> currd please.
> 
> > @@ -4128,7 +4130,9 @@ static long hvm_physdev_op(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
> >      case PHYSDEVOP_eoi:
> >      case PHYSDEVOP_irq_status_query:
> >      case PHYSDEVOP_get_free_pirq:
> > -        return do_physdev_op(cmd, arg);
> > +        return ((d->arch.emulation_flags & XEN_X86_EMU_USE_PIRQ) ||
> > +               is_pvh_vcpu(current)) ?
> 
> is_pvh_domain(currd)

Thanks, it's fixed now. I've also taken the opportunity to fixing two other 
instances of current and current->domain.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions
  2016-09-27 15:57 ` [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions Roger Pau Monne
@ 2016-09-29 14:18   ` Jan Beulich
  2016-09-30 11:27     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 14:18 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper, xen-devel,
	boris.ostrovsky

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> The current code used by Intel VTd will only map RMRR regions for the
> hardware domain, but will fail to map RMRR regions for unprivileged domains
> unless the page tables are shared between EPT and IOMMU.

Okay, if that's the case it surely should get fixed.

> Fix this and
> simplify the code, removing the {set/clear}_identity_p2m_entry helpers and
> just using the normal MMIO mapping functions.

This simplification, however, goes too far. Namely ...

> -int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
> -                           p2m_access_t p2ma, unsigned int flag)
> -{
> -    p2m_type_t p2mt;
> -    p2m_access_t a;
> -    mfn_t mfn;
> -    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> -    int ret;
> -
> -    if ( !paging_mode_translate(p2m->domain) )
> -    {
> -        if ( !need_iommu(d) )
> -            return 0;
> -        return iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> -    }
> -
> -    gfn_lock(p2m, gfn, 0);
> -
> -    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
> -
> -    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
> -        ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
> -                            p2m_mmio_direct, p2ma);
> -    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
> -    {
> -        ret = 0;
> -        /*
> -         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
> -         * but iomem regions are not mapped with IOMMU. This makes sure that
> -         * RMRRs are correctly mapped with IOMMU.
> -         */
> -        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
> -            ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> -    }
> -    else
> -    {
> -        if ( flag & XEN_DOMCTL_DEV_RDM_RELAXED )
> -            ret = 0;
> -        else
> -            ret = -EBUSY;
> -        printk(XENLOG_G_WARNING
> -               "Cannot setup identity map d%d:%lx,"
> -               " gfn already mapped to %lx.\n",
> -               d->domain_id, gfn, mfn_x(mfn));

... this logic (and its clear side counterpart) should not be removed
without replacement. Note in this context how you render "flag" an
unused parameter of rmrr_identity_mapping().

> --- a/xen/include/xen/p2m-common.h
> +++ b/xen/include/xen/p2m-common.h
> @@ -2,6 +2,7 @@
>  #define _XEN_P2M_COMMON_H
>  
>  #include <public/vm_event.h>
> +#include <xen/softirq.h>
>  
>  /*
>   * Additional access types, which are used to further restrict
> @@ -46,6 +47,35 @@ int unmap_mmio_regions(struct domain *d,
>                         mfn_t mfn);
>  
>  /*
> + * Preemptive Helper for mapping/unmapping MMIO regions.
> + */

Single line comment.

> +static inline int modify_mmio_11(struct domain *d, unsigned long pfn,
> +                                 unsigned long nr_pages, bool map)

Why do you make this an inline function? And I have to admit that I
dislike this strange use of number 11 - what's wrong with continuing
to use the term "direct map" in one way or another?

> +{
> +    int rc;
> +
> +    while ( nr_pages > 0 )
> +    {
> +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> +        if ( rc == 0 )
> +            break;
> +        if ( rc < 0 )
> +        {
> +            printk(XENLOG_ERR
> +                "Failed to %smap %#lx - %#lx into domain %d memory map: %d\n",

"Failed to identity %smap [%#lx,%#lx) for d%d: %d\n"

And I think XENLOG_WARNING would do - whether this actually is
a problem depends on further factors.

> +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> +            return rc;
> +        }
> +        nr_pages -= rc;
> +        pfn += rc;
> +        process_pending_softirqs();

Is this what you call "preemptive"? 

> +    }
> +
> +    return rc;

The way this is coded it appears to possibly return non-zero even in
success case. I think this would therefore better be a for ( ; ; ) loop.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 10/30] xen/x86: allow the emulated APICs to be enbled for the hardware domain
  2016-09-27 15:57 ` [PATCH v2 10/30] xen/x86: allow the emulated APICs to be enbled for the hardware domain Roger Pau Monne
@ 2016-09-29 14:26   ` Jan Beulich
  2016-09-30 15:44     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 14:26 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> +static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
> +{
> +
> +    if ( is_hvm_domain(d) )
> +    {
> +        if ( is_hardware_domain(d) &&
> +             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC))
> +            return false;
> +        if ( !is_hardware_domain(d) &&
> +             emflags != XEN_X86_EMU_ALL && emflags != 0 )
> +            return false;
> +    }
> +    else
> +    {
> +        /* PV or classic PVH. */
> +        if ( is_hardware_domain(d) && emflags != XEN_X86_EMU_PIT )
> +            return false;
> +        if ( !is_hardware_domain(d) && emflags != 0 )
> +            return false;

Previous code permitted XEN_X86_EMU_PIT also for the non-hardware
domains afaict. You shouldn't change behavior without saying so and
explaining why.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-09-29 10:39   ` Jan Beulich
@ 2016-09-29 14:33     ` Roger Pau Monne
  2016-09-29 16:09       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 14:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

On Thu, Sep 29, 2016 at 04:39:02AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> > Return should be an int, and the number of pages should be an unsigned long.
> 
> I can see the former, but why the latter? Acting on 32-bit quantities
> is a little cheaper after all.

This was requested by Andrew in the previous version:

https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg03126.html

But yes, an unsigned int is enough to hold the maximum number of pages for 
a domain given than we support up to 1TB for HVM guests. Or maybe there are 
plans to increase the maximum supported memory per guest?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-09-29 10:43   ` Jan Beulich
@ 2016-09-29 14:37     ` Roger Pau Monne
  2016-09-29 16:10       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 14:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Sep 29, 2016 at 04:43:07AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
> > --- a/xen/arch/x86/mm/hap/hap.c
> > +++ b/xen/arch/x86/mm/hap/hap.c
> > @@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned long pages, int *preempted)
> >              break;
> >  
> >          /* Check to see if we need to yield and try again */
> > -        if ( preempted && hypercall_preempt_check() )
> > +        if ( preempted &&
> > +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) :
> > +                                      hypercall_preempt_check()) )
> 
> So what is the supposed action for the caller to take in this case?
> I think this should at least be spelled out in the commit message.

I'm not sure I follow, but I assume you mean in the idle vcpu case?

In that case the caller should call process_pending_softirqs. I will modify 
the commit message to include it.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/30] x86/paging: introduce paging_set_allocation
  2016-09-29 10:51   ` Jan Beulich
@ 2016-09-29 14:51     ` Roger Pau Monne
  2016-09-29 16:12       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 14:51 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

On Thu, Sep 29, 2016 at 04:51:42AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > @@ -1383,15 +1382,25 @@ int __init construct_dom0(
> >                           nr_pages);
> >      }
> >  
> > -    if ( is_pvh_domain(d) )
> > -        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
> > -
> >      /*
> >       * We enable paging mode again so guest_physmap_add_page will do the
> >       * right thing for us.
> >       */
> 
> I'm afraid you render this comment stale - please adjust it accordingly.

Not AFAICT, this comment is referring to the next line, which is:

d->arch.paging.mode = save_pvh_pg_mode;

The classic PVH domain builder contains quite a lot of craziness and 
disables paging modes at certain points by playing with d->arch.paging.mode.

> > --- a/xen/arch/x86/mm/paging.c
> > +++ b/xen/arch/x86/mm/paging.c
> > @@ -954,6 +954,22 @@ void paging_write_p2m_entry(struct p2m_domain *p2m, unsigned long gfn,
> >          safe_write_pte(p, new);
> >  }
> >  
> > +int paging_set_allocation(struct domain *d, unsigned long pages, int *preempted)
> 
> Since you need to touch the other two functions anyway, please
> make the last parameter bool * here and there.

OK, that's fine.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally
  2016-09-29 13:55   ` Jan Beulich
@ 2016-09-29 15:11     ` Roger Pau Monne
  2016-09-29 16:14       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 15:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit,
	xen-devel, boris.ostrovsky

On Thu, Sep 29, 2016 at 07:55:00AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > Instead of being tied to the presence of an IOMMU
> 
> At the very least I'd expect the "why" aspect to get mentioned
> here.

TBH, it seems simpler to have it there rather than conditional to the 
presence of an IOMMU. Also, I think scan_pci_devices failing should be 
fatal. Would you be OK with me adding a panic if the scan fails?

> > --- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > +++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
> > @@ -219,7 +219,8 @@ int __init amd_iov_detect(void)
> >  
> >      if ( !amd_iommu_perdev_intremap )
> >          printk(XENLOG_WARNING "AMD-Vi: Using global interrupt remap table is not recommended (see XSA-36)!\n");
> > -    return scan_pci_devices();
> > +
> > +    return 0;
> >  }
> 
> Not how an error return from the function at least here does not
> get ignored, leading to the IOMMU not getting enabled. This behavior
> should be preserved, and I think it actually should extend to VT-d.
> Furthermore iiuc you make PVHv2 Dom0 setup depend on this
> succeeding, which may then require further propagation of the
> success status here (or maybe, just like PVHv1, you require a
> functional IOMMU, in which case failing IOMMU setup would be
> sufficient).

Yes, PVHv2, just like PVHv1 requires a functional IOMMU, see 
check_hwdom_reqs in passthrough/iommu.c (if the hardware domain is using a 
translated paging mode an IOMMU is required).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 07/30] xen/x86: split the setup of Dom0 permissions to a function
  2016-09-29 13:47   ` Jan Beulich
@ 2016-09-29 15:53     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 15:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Sep 29, 2016 at 07:47:22AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > So that it can also be used by the PVH-specific domain builder. This is just
> > code motion, it should not introduce any functional change.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Looks generally okay, but please do minor style corrections as you
> move code:
> 
> > --- a/xen/arch/x86/domain_build.c
> > +++ b/xen/arch/x86/domain_build.c
> > @@ -869,6 +869,89 @@ static __init void setup_pv_physmap(struct domain *d, unsigned long pgtbl_pfn,
> >      unmap_domain_page(l4start);
> >  }
> >  
> > +static int __init setup_permissions(struct domain *d)
> > +{
> > +    unsigned long mfn;
> > +    int i, rc = 0;
> 
> i should be unsigned int, and the initializer of rc could be avoided.

Done (I've also converted the first assignment to rc below from |= to =).

> > +    /* The hardware domain is initially permitted full I/O capabilities. */
> > +    rc |= ioports_permit_access(d, 0, 0xFFFF);
> > +    rc |= iomem_permit_access(d, 0UL, (1UL << (paddr_bits - PAGE_SHIFT)) - 1);
> > +    rc |= irqs_permit_access(d, 1, nr_irqs_gsi - 1);
> > +
> > +    /*
> > +     * Modify I/O port access permissions.
> > +     */
> 
> This is a single line comment - I understand it's trying to be more of a
> separator than the others, but I'd prefer for it to do so by being
> followed by a blank line.
> 
> > +    /* Master Interrupt Controller (PIC). */
> > +    rc |= ioports_deny_access(d, 0x20, 0x21);
> > +    /* Slave Interrupt Controller (PIC). */
> > +    rc |= ioports_deny_access(d, 0xA0, 0xA1);
> > +    /* Interval Timer (PIT). */
> > +    rc |= ioports_deny_access(d, 0x40, 0x43);
> > +    /* PIT Channel 2 / PC Speaker Control. */
> > +    rc |= ioports_deny_access(d, 0x61, 0x61);
> > +    /* ACPI PM Timer. */
> > +    if ( pmtmr_ioport )
> > +        rc |= ioports_deny_access(d, pmtmr_ioport, pmtmr_ioport + 3);
> > +    /* PCI configuration space (NB. 0xcf8 has special treatment). */
> > +    rc |= ioports_deny_access(d, 0xcfc, 0xcff);
> > +    /* Command-line I/O ranges. */
> > +    process_dom0_ioports_disable(d);
> > +
> > +    /*
> > +     * Modify I/O memory access permissions.
> > +     */
> 
> Dito.
> 
> > -    BUG_ON(rc != 0);
> > +    rc = setup_permissions(d);
> > +    if ( rc != 0 )
> > +        panic("Failed to setup Dom0 permissions");
> 
> To be honest, I'm not sure of this BUG_ON() -> panic() conversion.
> I think I'd prefer it to stay the way it was. We're not really expecting
> for any of this to fail anyway.
> 
> Jan

Done, fixed all the above, thanks.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests
  2016-09-29 14:17     ` Roger Pau Monne
@ 2016-09-29 16:07       ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 16:07 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.09.16 at 16:17, <roger.pau@citrix.com> wrote:
> On Wed, Sep 28, 2016 at 10:03:21AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
>> > On PVHv2 guests we explicitly want to use native methods for routing
>> > interrupts.
>> > 
>> > Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
>> > route physical interrupts (even from emulated devices) over event channels.
>> 
>> So you specifically want this new flag off for PVHv2 guests? Based on
>> just this description I did get the opposite impression...
> 
> Yes, that's right, I don't want PVHv2 guests to know anything about PIRQs. I 
> 
> don't really know how to reword this, what about:
> 
> "PVHv2 guests, unlike HVM guests, won't have the option to route interrupts 
> from physical or emulated devices over event channels using PIRQs. This 
> applies to both DomU and Dom0 PVHv2 guests.
> 
> Introduce a new XEN_X86_EMU_USE_PIRQ to notify Xen whether a HVM guest can
> route physical interrupts (even from emulated devices) over event channels, 
> and is thus allowed to use some of the PHYSDEV ops."

SGTM

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-09-29 14:33     ` Roger Pau Monne
@ 2016-09-29 16:09       ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 16:09 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 29.09.16 at 16:33, <roger.pau@citrix.com> wrote:
> On Thu, Sep 29, 2016 at 04:39:02AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
>> > Return should be an int, and the number of pages should be an unsigned long.
>> 
>> I can see the former, but why the latter? Acting on 32-bit quantities
>> is a little cheaper after all.
> 
> This was requested by Andrew in the previous version:
> 
> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg03126.html 
> 
> But yes, an unsigned int is enough to hold the maximum number of pages for 
> a domain given than we support up to 1TB for HVM guests. Or maybe there are 
> plans to increase the maximum supported memory per guest?

Larger guests or not - here we're talking about number of pages
used internally for page tables, not numbers of pages assigned
to the guest, aren't we? In any event I'm not sure which truncation
issue Andrew had spotted back then.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-09-29 14:37     ` Roger Pau Monne
@ 2016-09-29 16:10       ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 16:10 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 29.09.16 at 16:37, <roger.pau@citrix.com> wrote:
> On Thu, Sep 29, 2016 at 04:43:07AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:56, <roger.pau@citrix.com> wrote:
>> > --- a/xen/arch/x86/mm/hap/hap.c
>> > +++ b/xen/arch/x86/mm/hap/hap.c
>> > @@ -379,7 +379,9 @@ hap_set_allocation(struct domain *d, unsigned long 
> pages, int *preempted)
>> >              break;
>> >  
>> >          /* Check to see if we need to yield and try again */
>> > -        if ( preempted && hypercall_preempt_check() )
>> > +        if ( preempted &&
>> > +             (is_idle_vcpu(current) ? softirq_pending(smp_processor_id()) 
> :
>> > +                                      hypercall_preempt_check()) )
>> 
>> So what is the supposed action for the caller to take in this case?
>> I think this should at least be spelled out in the commit message.
> 
> I'm not sure I follow, but I assume you mean in the idle vcpu case?

Yes. Right now it is expected to schedule a hypercall continuation.

> In that case the caller should call process_pending_softirqs. I will modify 
> the commit message to include it.

Thanks.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/30] x86/paging: introduce paging_set_allocation
  2016-09-29 14:51     ` Roger Pau Monne
@ 2016-09-29 16:12       ` Jan Beulich
  2016-09-29 16:57         ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 16:12 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

>>> On 29.09.16 at 16:51, <roger.pau@citrix.com> wrote:
> On Thu, Sep 29, 2016 at 04:51:42AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > @@ -1383,15 +1382,25 @@ int __init construct_dom0(
>> >                           nr_pages);
>> >      }
>> >  
>> > -    if ( is_pvh_domain(d) )
>> > -        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
>> > -
>> >      /*
>> >       * We enable paging mode again so guest_physmap_add_page will do the
>> >       * right thing for us.
>> >       */
>> 
>> I'm afraid you render this comment stale - please adjust it accordingly.
> 
> Not AFAICT, this comment is referring to the next line, which is:
> 
> d->arch.paging.mode = save_pvh_pg_mode;
> 
> The classic PVH domain builder contains quite a lot of craziness and 
> disables paging modes at certain points by playing with d->arch.paging.mode.

Right, but your addition below that line now also depends on that
restore, afaict.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally
  2016-09-29 15:11     ` Roger Pau Monne
@ 2016-09-29 16:14       ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-29 16:14 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, Andrew Cooper, Suravee Suthikulpanit,
	xen-devel, boris.ostrovsky

>>> On 29.09.16 at 17:11, <roger.pau@citrix.com> wrote:
> On Thu, Sep 29, 2016 at 07:55:00AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > Instead of being tied to the presence of an IOMMU
>> 
>> At the very least I'd expect the "why" aspect to get mentioned
>> here.
> 
> TBH, it seems simpler to have it there rather than conditional to the 
> presence of an IOMMU. Also, I think scan_pci_devices failing should be 
> fatal. Would you be OK with me adding a panic if the scan fails?

Hmm, no, not really. We can do without knowing of any PCI
devices right now, so I don't see why this should become a
panic. The more that we don't do a complete scan anyway, so
we know information is going to be at best partial until Dom0
gives us a complete picture.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/30] x86/paging: introduce paging_set_allocation
  2016-09-29 16:12       ` Jan Beulich
@ 2016-09-29 16:57         ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-29 16:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, xen-devel, boris.ostrovsky

On Thu, Sep 29, 2016 at 10:12:12AM -0600, Jan Beulich wrote:
> >>> On 29.09.16 at 16:51, <roger.pau@citrix.com> wrote:
> > On Thu, Sep 29, 2016 at 04:51:42AM -0600, Jan Beulich wrote:
> >> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> >> > @@ -1383,15 +1382,25 @@ int __init construct_dom0(
> >> >                           nr_pages);
> >> >      }
> >> >  
> >> > -    if ( is_pvh_domain(d) )
> >> > -        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
> >> > -
> >> >      /*
> >> >       * We enable paging mode again so guest_physmap_add_page will do the
> >> >       * right thing for us.
> >> >       */
> >> 
> >> I'm afraid you render this comment stale - please adjust it accordingly.
> > 
> > Not AFAICT, this comment is referring to the next line, which is:
> > 
> > d->arch.paging.mode = save_pvh_pg_mode;
> > 
> > The classic PVH domain builder contains quite a lot of craziness and 
> > disables paging modes at certain points by playing with d->arch.paging.mode.
> 
> Right, but your addition below that line now also depends on that
> restore, afaict.

Yes, that's completely right, sorry for overlooking it. I've expanded the 
comment to also reference paging_set_allocation (or else we would hit an 
assert).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by the idle domain
  2016-09-29 10:45   ` Jan Beulich
@ 2016-09-30  8:32     ` Roger Pau Monne
  2016-09-30  8:59       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-30  8:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Sep 29, 2016 at 04:45:57AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > It doesn't make sense since the idle domain doesn't receive any events.
> 
> The change itself is fine, but I think it would help if the commit
> message made explicit why this is becoming relevant.

Done. I've added:

"This is relevant in order to be sure that hypercall_preempt_check is not 
called by the idle domain."

To the commit message.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by the idle domain
  2016-09-30  8:32     ` Roger Pau Monne
@ 2016-09-30  8:59       ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-30  8:59 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 30.09.16 at 10:32, <roger.pau@citrix.com> wrote:
> On Thu, Sep 29, 2016 at 04:45:57AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > It doesn't make sense since the idle domain doesn't receive any events.
>> 
>> The change itself is fine, but I think it would help if the commit
>> message made explicit why this is becoming relevant.
> 
> Done. I've added:
> 
> "This is relevant in order to be sure that hypercall_preempt_check is not 
> called by the idle domain."
> 
> To the commit message.

This is only part of it imo - I think it would help more if you clarified
how you got there in the first place all of the sudden.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions
  2016-09-29 14:18   ` Jan Beulich
@ 2016-09-30 11:27     ` Roger Pau Monne
  2016-09-30 13:21       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-30 11:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper, xen-devel,
	boris.ostrovsky

On Thu, Sep 29, 2016 at 08:18:36AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > The current code used by Intel VTd will only map RMRR regions for the
> > hardware domain, but will fail to map RMRR regions for unprivileged domains
> > unless the page tables are shared between EPT and IOMMU.
> 
> Okay, if that's the case it surely should get fixed.
> 
> > Fix this and
> > simplify the code, removing the {set/clear}_identity_p2m_entry helpers and
> > just using the normal MMIO mapping functions.
> 
> This simplification, however, goes too far. Namely ...
> 
> > -int set_identity_p2m_entry(struct domain *d, unsigned long gfn,
> > -                           p2m_access_t p2ma, unsigned int flag)
> > -{
> > -    p2m_type_t p2mt;
> > -    p2m_access_t a;
> > -    mfn_t mfn;
> > -    struct p2m_domain *p2m = p2m_get_hostp2m(d);
> > -    int ret;
> > -
> > -    if ( !paging_mode_translate(p2m->domain) )
> > -    {
> > -        if ( !need_iommu(d) )
> > -            return 0;
> > -        return iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> > -    }
> > -
> > -    gfn_lock(p2m, gfn, 0);
> > -
> > -    mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, 0, NULL, NULL);
> > -
> > -    if ( p2mt == p2m_invalid || p2mt == p2m_mmio_dm )
> > -        ret = p2m_set_entry(p2m, gfn, _mfn(gfn), PAGE_ORDER_4K,
> > -                            p2m_mmio_direct, p2ma);
> > -    else if ( mfn_x(mfn) == gfn && p2mt == p2m_mmio_direct && a == p2ma )
> > -    {
> > -        ret = 0;
> > -        /*
> > -         * PVH fixme: during Dom0 PVH construction, p2m entries are being set
> > -         * but iomem regions are not mapped with IOMMU. This makes sure that
> > -         * RMRRs are correctly mapped with IOMMU.
> > -         */
> > -        if ( is_hardware_domain(d) && !iommu_use_hap_pt(d) )
> > -            ret = iommu_map_page(d, gfn, gfn, IOMMUF_readable|IOMMUF_writable);
> > -    }
> > -    else
> > -    {
> > -        if ( flag & XEN_DOMCTL_DEV_RDM_RELAXED )
> > -            ret = 0;
> > -        else
> > -            ret = -EBUSY;
> > -        printk(XENLOG_G_WARNING
> > -               "Cannot setup identity map d%d:%lx,"
> > -               " gfn already mapped to %lx.\n",
> > -               d->domain_id, gfn, mfn_x(mfn));
> 
> ... this logic (and its clear side counterpart) should not be removed
> without replacement. Note in this context how you render "flag" an
> unused parameter of rmrr_identity_mapping().

OK, so I'm just going to fix {set/clear}_identity_p2m_entry, because leaving 
the current logic while using something like modify_mmio_11 (or whatever we 
end up calling it) it's too complex IMHO.

> > --- a/xen/include/xen/p2m-common.h
> > +++ b/xen/include/xen/p2m-common.h
> > @@ -2,6 +2,7 @@
> >  #define _XEN_P2M_COMMON_H
> >  
> >  #include <public/vm_event.h>
> > +#include <xen/softirq.h>
> >  
> >  /*
> >   * Additional access types, which are used to further restrict
> > @@ -46,6 +47,35 @@ int unmap_mmio_regions(struct domain *d,
> >                         mfn_t mfn);
> >  
> >  /*
> > + * Preemptive Helper for mapping/unmapping MMIO regions.
> > + */
> 
> Single line comment.
> 
> > +static inline int modify_mmio_11(struct domain *d, unsigned long pfn,
> > +                                 unsigned long nr_pages, bool map)
> 
> Why do you make this an inline function? And I have to admit that I
> dislike this strange use of number 11 - what's wrong with continuing
> to use the term "direct map" in one way or another?

I've renamed it to modify_mmio_direct and moved it to common/memory.c, since 
none of the files in passthrough/ seemed like a good place to put it.

> > +{
> > +    int rc;
> > +
> > +    while ( nr_pages > 0 )
> > +    {
> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> > +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> > +        if ( rc == 0 )
> > +            break;
> > +        if ( rc < 0 )
> > +        {
> > +            printk(XENLOG_ERR
> > +                "Failed to %smap %#lx - %#lx into domain %d memory map: %d\n",
> 
> "Failed to identity %smap [%#lx,%#lx) for d%d: %d\n"
> 
> And I think XENLOG_WARNING would do - whether this actually is
> a problem depends on further factors.

Done.

> > +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> > +            return rc;
> > +        }
> > +        nr_pages -= rc;
> > +        pfn += rc;
> > +        process_pending_softirqs();
> 
> Is this what you call "preemptive"? 

Right, I've removed preemptive from the comment since it makes no sense.

> > +    }
> > +
> > +    return rc;
> 
> The way this is coded it appears to possibly return non-zero even in
> success case. I think this would therefore better be a for ( ; ; ) loop.

I don't think this is possible, {un}map_mmio_regions will return < 0 on 
error, > 0 if there are pending pages to map, and 0 when all the requested 
pages have been mapped successfully.

Thanks, Roger. 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions
  2016-09-30 11:27     ` Roger Pau Monne
@ 2016-09-30 13:21       ` Jan Beulich
  2016-09-30 15:02         ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-30 13:21 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper, xen-devel,
	boris.ostrovsky

>>> On 30.09.16 at 13:27, <roger.pau@citrix.com> wrote:
> On Thu, Sep 29, 2016 at 08:18:36AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > +{
>> > +    int rc;
>> > +
>> > +    while ( nr_pages > 0 )
>> > +    {
>> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
>> > +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
>> > +        if ( rc == 0 )
>> > +            break;
>> > +        if ( rc < 0 )
>> > +        {
>> > +            printk(XENLOG_ERR
>> > +                "Failed to %smap %#lx - %#lx into domain %d memory map: %d\n",
>> > +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
>> > +            return rc;
>> > +        }
>> > +        nr_pages -= rc;
>> > +        pfn += rc;
>> > +        process_pending_softirqs();
>> > +    }
>> > +
>> > +    return rc;
>> 
>> The way this is coded it appears to possibly return non-zero even in
>> success case. I think this would therefore better be a for ( ; ; ) loop.
> 
> I don't think this is possible, {un}map_mmio_regions will return < 0 on 
> error, > 0 if there are pending pages to map, and 0 when all the requested 
> pages have been mapped successfully.

Right - hence the "appears to" in my reply; it took me a while to
figure it's not actually possible, and hence my desire to make this
more obvious to the reader.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions
  2016-09-30 13:21       ` Jan Beulich
@ 2016-09-30 15:02         ` Roger Pau Monne
  2016-09-30 15:09           ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-30 15:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper, xen-devel,
	boris.ostrovsky

On Fri, Sep 30, 2016 at 07:21:41AM -0600, Jan Beulich wrote:
> >>> On 30.09.16 at 13:27, <roger.pau@citrix.com> wrote:
> > On Thu, Sep 29, 2016 at 08:18:36AM -0600, Jan Beulich wrote:
> >> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> >> > +{
> >> > +    int rc;
> >> > +
> >> > +    while ( nr_pages > 0 )
> >> > +    {
> >> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
> >> > +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
> >> > +        if ( rc == 0 )
> >> > +            break;
> >> > +        if ( rc < 0 )
> >> > +        {
> >> > +            printk(XENLOG_ERR
> >> > +                "Failed to %smap %#lx - %#lx into domain %d memory map: %d\n",
> >> > +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
> >> > +            return rc;
> >> > +        }
> >> > +        nr_pages -= rc;
> >> > +        pfn += rc;
> >> > +        process_pending_softirqs();
> >> > +    }
> >> > +
> >> > +    return rc;
> >> 
> >> The way this is coded it appears to possibly return non-zero even in
> >> success case. I think this would therefore better be a for ( ; ; ) loop.
> > 
> > I don't think this is possible, {un}map_mmio_regions will return < 0 on 
> > error, > 0 if there are pending pages to map, and 0 when all the requested 
> > pages have been mapped successfully.
> 
> Right - hence the "appears to" in my reply; it took me a while to
> figure it's not actually possible, and hence my desire to make this
> more obvious to the reader.

Ah, OK, I misunderstood you then. What about changing the last return rc
to return 0? This would make it more obvious, because I'm not really sure a 
for loop would change much (IMHO the problem is the return semantics used by 
{un}map_mmio_regions).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2
  2016-09-27 15:57 ` [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
@ 2016-09-30 15:03   ` Jan Beulich
  2016-10-03 10:09     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-30 15:03 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -663,6 +663,13 @@ Pin dom0 vcpus to their respective pcpus
>  
>  Flag that makes a 64bit dom0 boot in PVH mode. No 32bit support at present.
>  
> +### dom0hvm
> +> `= <boolean>`
> +
> +> Default: `false`
> +
> +Flag that makes a dom0 boot in PVHv2 mode.

Considering sorting aspects this clearly wants to go at least ahead of
dom0pvh.

> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -75,6 +75,10 @@ unsigned long __read_mostly cr4_pv32_mask;
>  static bool_t __initdata opt_dom0pvh;
>  boolean_param("dom0pvh", opt_dom0pvh);
>  
> +/* Boot dom0 in HVM mode */
> +static bool_t __initdata opt_dom0hvm;

Plain bool please.

> @@ -1495,6 +1499,11 @@ void __init noreturn __start_xen(unsigned long mbi_p)
>      if ( opt_dom0pvh )
>          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
>  
> +    if ( opt_dom0hvm ) {

Coding style.

> +        domcr_flags |= DOMCRF_hvm | (hvm_funcs.hap_supported ? DOMCRF_hap : 0);

So you mean to support PVHv2 on shadow (including for Dom0)
right away. Are you also testing that?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 12/30] xen/x86: make print_e820_memory_map global
  2016-09-27 15:57 ` [PATCH v2 12/30] xen/x86: make print_e820_memory_map global Roger Pau Monne
@ 2016-09-30 15:04   ` Jan Beulich
  2016-10-03 16:23     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-09-30 15:04 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> So that it can be called from the Dom0 builder.

Why would the Dom0 builder need to call it, when it doesn't so far?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions
  2016-09-30 15:02         ` Roger Pau Monne
@ 2016-09-30 15:09           ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-30 15:09 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Kevin Tian, Feng Wu, George Dunlap, Andrew Cooper, xen-devel,
	boris.ostrovsky

>>> On 30.09.16 at 17:02, <roger.pau@citrix.com> wrote:
> On Fri, Sep 30, 2016 at 07:21:41AM -0600, Jan Beulich wrote:
>> >>> On 30.09.16 at 13:27, <roger.pau@citrix.com> wrote:
>> > On Thu, Sep 29, 2016 at 08:18:36AM -0600, Jan Beulich wrote:
>> >> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> >> > +{
>> >> > +    int rc;
>> >> > +
>> >> > +    while ( nr_pages > 0 )
>> >> > +    {
>> >> > +        rc = (map ? map_mmio_regions : unmap_mmio_regions)
>> >> > +             (d, _gfn(pfn), nr_pages, _mfn(pfn));
>> >> > +        if ( rc == 0 )
>> >> > +            break;
>> >> > +        if ( rc < 0 )
>> >> > +        {
>> >> > +            printk(XENLOG_ERR
>> >> > +                "Failed to %smap %#lx - %#lx into domain %d memory map: 
> %d\n",
>> >> > +                   map ? "" : "un", pfn, pfn + nr_pages, d->domain_id, rc);
>> >> > +            return rc;
>> >> > +        }
>> >> > +        nr_pages -= rc;
>> >> > +        pfn += rc;
>> >> > +        process_pending_softirqs();
>> >> > +    }
>> >> > +
>> >> > +    return rc;
>> >> 
>> >> The way this is coded it appears to possibly return non-zero even in
>> >> success case. I think this would therefore better be a for ( ; ; ) loop.
>> > 
>> > I don't think this is possible, {un}map_mmio_regions will return < 0 on 
>> > error, > 0 if there are pending pages to map, and 0 when all the requested 
>> > pages have been mapped successfully.
>> 
>> Right - hence the "appears to" in my reply; it took me a while to
>> figure it's not actually possible, and hence my desire to make this
>> more obvious to the reader.
> 
> Ah, OK, I misunderstood you then. What about changing the last return rc
> to return 0? This would make it more obvious, because I'm not really sure a 
> for loop would change much (IMHO the problem is the return semantics used by 
> {un}map_mmio_regions).

Well, my suggestion was not to use "a for loop" but specifically
"for ( ; ; )" to clarify the only loop exit condition is the single break
statement. That break statement could of course also become a
"return 0". I'd rather not see the return at the end of the function
become "return 0"; if anything (i.e. if you follow my suggestion) it
could be deleted altogether.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 14/30] xen/mm: add a ceil sufix to current page calculation routine
  2016-09-27 15:57 ` [PATCH v2 14/30] xen/mm: add a ceil sufix to current page calculation routine Roger Pau Monne
@ 2016-09-30 15:20   ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-30 15:20 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Stefano Stabellini, Andrew Cooper, JulienGrall,
	Suravee Suthikulpanit, xen-devel, boris.ostrovsky

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> ... and introduce a floor variant.

I dislike this, not the least because of the many places you touch
just to tack that odd suffix on. Unless you plan on adding half a
dozen or more callers to that floor variant, I think it would be
prudent to not introduce such a variant, but instead make those
callers simply obtain what they need by calling the existing one.
After all gofb_floor(x) = gofb(x + 1) - 1 if I'm not mistaken.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 10/30] xen/x86: allow the emulated APICs to be enbled for the hardware domain
  2016-09-29 14:26   ` Jan Beulich
@ 2016-09-30 15:44     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-09-30 15:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Sep 29, 2016 at 08:26:04AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > +static bool emulation_flags_ok(const struct domain *d, uint32_t emflags)
> > +{
> > +
> > +    if ( is_hvm_domain(d) )
> > +    {
> > +        if ( is_hardware_domain(d) &&
> > +             emflags != (XEN_X86_EMU_PIT|XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC))
> > +            return false;
> > +        if ( !is_hardware_domain(d) &&
> > +             emflags != XEN_X86_EMU_ALL && emflags != 0 )
> > +            return false;
> > +    }
> > +    else
> > +    {
> > +        /* PV or classic PVH. */
> > +        if ( is_hardware_domain(d) && emflags != XEN_X86_EMU_PIT )
> > +            return false;
> > +        if ( !is_hardware_domain(d) && emflags != 0 )
> > +            return false;
> 
> Previous code permitted XEN_X86_EMU_PIT also for the non-hardware
> domains afaict. You shouldn't change behavior without saying so and
> explaining why.

Done. I'm not really sure if the emulated PIT makes much sense for a PV 
domain, also because I don't think libxl ever enabled the PIT for a PV 
domain, but alas this should be discussed in a separate patch and not 
batched in this commit.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-09-27 15:57 ` [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
@ 2016-09-30 15:52   ` Jan Beulich
  2016-10-04  9:12     ` Roger Pau Monne
  2016-10-11 14:06     ` Roger Pau Monne
  0 siblings, 2 replies; 146+ messages in thread
From: Jan Beulich @ 2016-09-30 15:52 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> @@ -43,6 +44,11 @@ static long __initdata dom0_nrpages;
>  static long __initdata dom0_min_nrpages;
>  static long __initdata dom0_max_nrpages = LONG_MAX;
>  
> +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> +#define HVM_VM86_TSS_SIZE   128
> +
> +static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];

This is for your debugging only I suppose?

> @@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages(
>          avail -= dom0_paging_pages(d, nr_pages);
>      }
>  
> -    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
> +    if ( is_pv_domain(d) &&
> +         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&

Perhaps better to simply force parms->p2m_base to UNSET_ADDR
earlier on?

> @@ -579,8 +588,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>              continue;
>          }
>  
> -        *entry_guest = *entry;
> -        pages = PFN_UP(entry_guest->size);
> +        /*
> +         * Make sure the start and length are aligned to PAGE_SIZE, because
> +         * that's the minimum granularity of the 2nd stage translation.
> +         */
> +        start = ROUNDUP(entry->addr, PAGE_SIZE);
> +        end = (entry->addr + entry->size) & PAGE_MASK;
> +        if ( start >= end )
> +            continue;
> +
> +        entry_guest->type = E820_RAM;
> +        entry_guest->addr = start;
> +        entry_guest->size = end - start;
> +        pages = PFN_DOWN(entry_guest->size);
>          if ( (cur_pages + pages) > nr_pages )
>          {
>              /* Truncate region */
> @@ -591,6 +611,8 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
>          {
>              cur_pages += pages;
>          }
> +        ASSERT(IS_ALIGNED(entry_guest->addr, PAGE_SIZE) &&
> +               IS_ALIGNED(entry_guest->size, PAGE_SIZE));

What does this guard against? Your addition arranges for things to
be page aligned, and the only adjustment done until we get here is
one that obviously also doesn't violate that requirement. I'm all for
assertions when they check state which is not obviously right, but
here I don't see the need.

> @@ -1657,15 +1679,238 @@ out:
>      return rc;
>  }
>  
> +/* Populate an HVM memory range using the biggest possible order. */
> +static void __init hvm_populate_memory_range(struct domain *d, uint64_t start,
> +                                             uint64_t size)
> +{
> +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
> +    unsigned int order;
> +    struct page_info *page;
> +    int rc;
> +
> +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
> +
> +    order = MAX_ORDER;
> +    while ( size != 0 )
> +    {
> +        order = min(get_order_from_bytes_floor(size), order);
> +        page = alloc_domheap_pages(d, order, memflags);
> +        if ( page == NULL )
> +        {
> +            if ( order == 0 && memflags )
> +            {
> +                /* Try again without any memflags. */
> +                memflags = 0;
> +                order = MAX_ORDER;
> +                continue;
> +            }
> +            if ( order == 0 )
> +                panic("Unable to allocate memory with order 0!\n");
> +            order--;
> +            continue;
> +        }

Is it not possible to utilize alloc_chunk() here?

> +        hvm_mem_stats[order]++;
> +        rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)),
> +                                    _mfn(page_to_mfn(page)), order);
> +        if ( rc != 0 )
> +            panic("Failed to populate memory: [%" PRIx64 " - %" PRIx64 "] %d\n",

[<start>,<end>) please.

> +                  start, start + (((uint64_t)1) << (order + PAGE_SHIFT)), rc);
> +        start += ((uint64_t)1) << (order + PAGE_SHIFT);
> +        size -= ((uint64_t)1) << (order + PAGE_SHIFT);

Please prefer 1ULL over (uint64_t)1.

> +        if ( (size & 0xffffffff) == 0 )
> +            process_pending_softirqs();

That's 4Gb at a time - isn't that a little too much?

> +    }
> +
> +}

Stray blank line.

> +static int __init hvm_setup_vmx_unrestricted_guest(struct domain *d)
> +{
> +    struct e820entry *entry;
> +    p2m_type_t p2mt;
> +    uint32_t rc, *ident_pt;
> +    uint8_t *tss;
> +    mfn_t mfn;
> +    paddr_t gaddr = 0;
> +    int i;

unsigned int

> +    /*
> +     * Stole some space from the last found RAM region. One page will be

Steal

> +     * used for the identify page tables, and the remaining space for the

identity

> +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> +     * to PAGE_SIZE.
> +     */
> +    for ( i = 1; i <= d->arch.nr_e820; i++ )
> +    {
> +        entry = &d->arch.e820[d->arch.nr_e820 - i];
> +        if ( entry->type != E820_RAM ||
> +             entry->size < PAGE_SIZE + HVM_VM86_TSS_SIZE )
> +            continue;
> +
> +        entry->size -= PAGE_SIZE + HVM_VM86_TSS_SIZE;
> +        gaddr = entry->addr + entry->size;
> +        break;
> +    }
> +
> +    if ( gaddr == 0 || gaddr < MB(1) )
> +    {
> +        printk("Unable to find memory to stash the identity map and TSS\n");
> +        return -ENOMEM;

One function up you panic() on error - please be consistent. Also for
one of the other patches I think we figured that the TSS isn't really
required, so please only warn in that case.

> +    }
> +
> +    /*
> +     * Identity-map page table is required for running with CR0.PG=0
> +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> +     * superpages.
> +     */
> +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> +                         &mfn, &p2mt, 0, &rc);

Comment and operation don't really fit together.

> +static int __init hvm_setup_p2m(struct domain *d)
> +{
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    unsigned long nr_pages;
> +    int i, rc, preempted;
> +
> +    printk("** Preparing memory map **\n");

Debugging leftover again?

> +    /*
> +     * Subtract one page for the EPT identity page table and two pages
> +     * for the MADT replacement.
> +     */
> +    nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3;

How do you know the MADT replacement requires two pages? Isn't
that CPU-count dependent? And doesn't the partial page used for
the TSS also need accounting for here?

> +    hvm_setup_e820(d, nr_pages);
> +    do {
> +        preempted = 0;
> +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> +                              &preempted);
> +        process_pending_softirqs();
> +    } while ( preempted );
> +
> +    /*
> +     * Special treatment for memory < 1MB:
> +     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).
> +     *  - Map everything else as 1:1.
> +     * NB: all this only makes sense if booted from legacy BIOSes.
> +     */
> +    rc = modify_mmio_11(d, 0, PFN_DOWN(MB(1)), true);
> +    if ( rc )
> +    {
> +        printk("Failed to map low 1MB 1:1: %d\n", rc);
> +        return rc;
> +    }
> +
> +    printk("** Populating memory map **\n");
> +    /* Populate memory map. */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        hvm_populate_memory_range(d, d->arch.e820[i].addr,
> +                                  d->arch.e820[i].size);
> +        if ( d->arch.e820[i].addr < MB(1) )
> +        {
> +            unsigned long end = min_t(unsigned long,
> +                            d->arch.e820[i].addr + d->arch.e820[i].size, MB(1));
> +
> +            saved_current = current;
> +            set_current(v);
> +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
> +                                        maddr_to_virt(d->arch.e820[i].addr),
> +                                        end - d->arch.e820[i].addr);
> +            set_current(saved_current);
> +            if ( rc != HVMCOPY_okay )
> +            {
> +                printk("Unable to copy RAM region %#lx - %#lx\n",
> +                       d->arch.e820[i].addr, end);
> +                return -EFAULT;
> +            }
> +        }
> +    }
> +
> +    printk("Memory allocation stats:\n");
> +    for ( i = 0; i <= MAX_ORDER; i++ )
> +    {
> +        if ( hvm_mem_stats[MAX_ORDER - i] != 0 )
> +            printk("Order %2u: %pZ\n", MAX_ORDER - i,
> +                   _p(((uint64_t)1 << (MAX_ORDER - i + PAGE_SHIFT)) *
> +                      hvm_mem_stats[MAX_ORDER - i]));
> +    }
> +
> +    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
> +    {
> +        /*
> +         * Since Dom0 cannot be migrated, we will only setup the
> +         * unrestricted guest helpers if they are needed by the current
> +         * hardware we are running on.
> +         */
> +        rc = hvm_setup_vmx_unrestricted_guest(d);

Calling a function of this name inside an if() checking for
!vmx_unrestricted_guest() is, well, odd.

>  static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
>                                       unsigned long image_headroom,
>                                       module_t *initrd,
>                                       void *(*bootstrap_map)(const module_t *),
>                                       char *cmdline)
>  {
> +    int rc;
>  
>      printk("** Building a PVH Dom0 **\n");
>  
> +    /* Sanity! */
> +    BUG_ON(d->domain_id != 0);
> +    BUG_ON(d->vcpu[0] == NULL);

May I suggest

    BUG_ON(d->domain_id);
    BUG_ON(!d->vcpu[0]);

in cases like this?

> +    process_pending_softirqs();

Why, outside of any loop?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-09-27 15:56 ` [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions Roger Pau Monne
  2016-09-28  9:34   ` Tim Deegan
  2016-09-29 10:39   ` Jan Beulich
@ 2016-09-30 16:48   ` George Dunlap
  2016-10-03  8:05   ` Paul Durrant
  3 siblings, 0 replies; 146+ messages in thread
From: George Dunlap @ 2016-09-30 16:48 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, boris.ostrovsky

On 27/09/16 16:56, Roger Pau Monne wrote:
> Return should be an int, and the number of pages should be an unsigned long.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Non-shadow mm bits:

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Tim Deegan <tim@xen.org>
> ---
>  xen/arch/x86/mm/hap/hap.c       |  6 +++---
>  xen/arch/x86/mm/shadow/common.c |  7 +++----
>  xen/include/asm-x86/domain.h    | 12 ++++++------
>  3 files changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index 3218fa2..b6d2c61 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -325,7 +325,7 @@ static void hap_free_p2m_page(struct domain *d, struct page_info *pg)
>  static unsigned int
>  hap_get_allocation(struct domain *d)
>  {
> -    unsigned int pg = d->arch.paging.hap.total_pages
> +    unsigned long pg = d->arch.paging.hap.total_pages
>          + d->arch.paging.hap.p2m_pages;
>  
>      return ((pg >> (20 - PAGE_SHIFT))
> @@ -334,8 +334,8 @@ hap_get_allocation(struct domain *d)
>  
>  /* Set the pool of pages to the required number of pages.
>   * Returns 0 for success, non-zero for failure. */
> -static unsigned int
> -hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
> +static int
> +hap_set_allocation(struct domain *d, unsigned long pages, int *preempted)
>  {
>      struct page_info *pg;
>  
> diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
> index 21607bf..d3cc2cc 100644
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1613,9 +1613,8 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
>   * Input will be rounded up to at least shadow_min_acceptable_pages(),
>   * plus space for the p2m table.
>   * Returns 0 for success, non-zero for failure. */
> -static unsigned int sh_set_allocation(struct domain *d,
> -                                      unsigned int pages,
> -                                      int *preempted)
> +static int sh_set_allocation(struct domain *d, unsigned long pages,
> +                             int *preempted)
>  {
>      struct page_info *sp;
>      unsigned int lower_bound;
> @@ -1692,7 +1691,7 @@ static unsigned int sh_set_allocation(struct domain *d,
>  /* Return the size of the shadow pool, rounded up to the nearest MB */
>  static unsigned int shadow_get_allocation(struct domain *d)
>  {
> -    unsigned int pg = d->arch.paging.shadow.total_pages
> +    unsigned long pg = d->arch.paging.shadow.total_pages
>          + d->arch.paging.shadow.p2m_pages;
>      return ((pg >> (20 - PAGE_SHIFT))
>              + ((pg & ((1 << (20 - PAGE_SHIFT)) - 1)) ? 1 : 0));
> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
> index 5807a1f..11ac2a5 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -93,9 +93,9 @@ struct shadow_domain {
>  
>      /* Memory allocation */
>      struct page_list_head freelist;
> -    unsigned int      total_pages;  /* number of pages allocated */
> -    unsigned int      free_pages;   /* number of pages on freelists */
> -    unsigned int      p2m_pages;    /* number of pages allocates to p2m */
> +    unsigned long      total_pages;  /* number of pages allocated */
> +    unsigned long      free_pages;   /* number of pages on freelists */
> +    unsigned long      p2m_pages;    /* number of pages allocates to p2m */
>  
>      /* 1-to-1 map for use when HVM vcpus have paging disabled */
>      pagetable_t unpaged_pagetable;
> @@ -155,9 +155,9 @@ struct shadow_vcpu {
>  /************************************************/
>  struct hap_domain {
>      struct page_list_head freelist;
> -    unsigned int      total_pages;  /* number of pages allocated */
> -    unsigned int      free_pages;   /* number of pages on freelists */
> -    unsigned int      p2m_pages;    /* number of pages allocates to p2m */
> +    unsigned long      total_pages;  /* number of pages allocated */
> +    unsigned long      free_pages;   /* number of pages on freelists */
> +    unsigned long      p2m_pages;    /* number of pages allocates to p2m */
>  };
>  
>  /************************************************/
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-09-27 15:56 ` [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
  2016-09-29 10:43   ` Jan Beulich
@ 2016-09-30 16:56   ` George Dunlap
  2016-09-30 16:56     ` George Dunlap
  1 sibling, 1 reply; 146+ messages in thread
From: George Dunlap @ 2016-09-30 16:56 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, Jan Beulich

On 27/09/16 16:56, Roger Pau Monne wrote:
> ... and using the "preempted" parameter. The solution relies on just calling
> softirq_pending if the current domain is the idle domain.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

You probably also want to add something to the effect of:

"This allows us to call *_set_allocation() when building domain 0."

Someone doing archeology doesn't want to dig out this series to figure
out how it would be possible to call hap_set_allocation() while idle. :-)

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain
  2016-09-30 16:56   ` George Dunlap
@ 2016-09-30 16:56     ` George Dunlap
  0 siblings, 0 replies; 146+ messages in thread
From: George Dunlap @ 2016-09-30 16:56 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, Jan Beulich

On 30/09/16 17:56, George Dunlap wrote:
> On 27/09/16 16:56, Roger Pau Monne wrote:
>> ... and using the "preempted" parameter. The solution relies on just calling
>> softirq_pending if the current domain is the idle domain.
>>
>> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> You probably also want to add something to the effect of:
> 
> "This allows us to call *_set_allocation() when building domain 0."
> 
> Someone doing archeology doesn't want to dig out this series to figure
> out how it would be possible to call hap_set_allocation() while idle. :-)

Oh, but when the comments have been addressed, you can add:

Acked-by: George Dunlap <george.dunlap@citrix.com>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 06/30] x86/paging: introduce paging_set_allocation
  2016-09-27 15:57 ` [PATCH v2 06/30] x86/paging: introduce paging_set_allocation Roger Pau Monne
  2016-09-29 10:51   ` Jan Beulich
@ 2016-09-30 17:00   ` George Dunlap
  1 sibling, 0 replies; 146+ messages in thread
From: George Dunlap @ 2016-09-30 17:00 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: George Dunlap, Andrew Cooper, Tim Deegan, Jan Beulich, boris.ostrovsky

On 27/09/16 16:57, Roger Pau Monne wrote:
> ... and remove hap_set_alloc_for_pvh_dom0.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> Acked-by: Tim Deegan <tim@xen.org>

Acked-by: George Dunlap <george.dunlap@citrix.com>

> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Tim Deegan <tim@xen.org>
> ---
> Changes since RFC:
>  - Make paging_set_allocation preemtable.
>  - Move comments.
> ---
>  xen/arch/x86/domain_build.c     | 17 +++++++++++++----
>  xen/arch/x86/mm/hap/hap.c       | 14 +-------------
>  xen/arch/x86/mm/paging.c        | 16 ++++++++++++++++
>  xen/arch/x86/mm/shadow/common.c |  7 +------
>  xen/include/asm-x86/hap.h       |  2 +-
>  xen/include/asm-x86/paging.h    |  7 +++++++
>  xen/include/asm-x86/shadow.h    |  8 ++++++++
>  7 files changed, 47 insertions(+), 24 deletions(-)
> 
> diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
> index 0a02d65..04d6cb0 100644
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -35,7 +35,6 @@
>  #include <asm/setup.h>
>  #include <asm/bzimage.h> /* for bzimage_parse */
>  #include <asm/io_apic.h>
> -#include <asm/hap.h>
>  #include <asm/hpet.h>
>  
>  #include <public/version.h>
> @@ -1383,15 +1382,25 @@ int __init construct_dom0(
>                           nr_pages);
>      }
>  
> -    if ( is_pvh_domain(d) )
> -        hap_set_alloc_for_pvh_dom0(d, dom0_paging_pages(d, nr_pages));
> -
>      /*
>       * We enable paging mode again so guest_physmap_add_page will do the
>       * right thing for us.
>       */
>      d->arch.paging.mode = save_pvh_pg_mode;
>  
> +    if ( is_pvh_domain(d) )
> +    {
> +        int preempted;
> +
> +        do {
> +            preempted = 0;
> +            paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> +                                  &preempted);
> +            process_pending_softirqs();
> +        } while ( preempted );
> +    }
> +
> +
>      /* Write the phys->machine and machine->phys table entries. */
>      for ( pfn = 0; pfn < count; pfn++ )
>      {
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index 2dc82f5..4420e4e 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -334,7 +334,7 @@ hap_get_allocation(struct domain *d)
>  
>  /* Set the pool of pages to the required number of pages.
>   * Returns 0 for success, non-zero for failure. */
> -static int
> +int
>  hap_set_allocation(struct domain *d, unsigned long pages, int *preempted)
>  {
>      struct page_info *pg;
> @@ -640,18 +640,6 @@ int hap_domctl(struct domain *d, xen_domctl_shadow_op_t *sc,
>      }
>  }
>  
> -void __init hap_set_alloc_for_pvh_dom0(struct domain *d,
> -                                       unsigned long hap_pages)
> -{
> -    int rc;
> -
> -    paging_lock(d);
> -    rc = hap_set_allocation(d, hap_pages, NULL);
> -    paging_unlock(d);
> -
> -    BUG_ON(rc);
> -}
> -
>  static const struct paging_mode hap_paging_real_mode;
>  static const struct paging_mode hap_paging_protected_mode;
>  static const struct paging_mode hap_paging_pae_mode;
> diff --git a/xen/arch/x86/mm/paging.c b/xen/arch/x86/mm/paging.c
> index cc44682..2717bd3 100644
> --- a/xen/arch/x86/mm/paging.c
> +++ b/xen/arch/x86/mm/paging.c
> @@ -954,6 +954,22 @@ void paging_write_p2m_entry(struct p2m_domain *p2m, unsigned long gfn,
>          safe_write_pte(p, new);
>  }
>  
> +int paging_set_allocation(struct domain *d, unsigned long pages, int *preempted)
> +{
> +    int rc;
> +
> +    ASSERT(paging_mode_enabled(d));
> +
> +    paging_lock(d);
> +    if ( hap_enabled(d) )
> +        rc = hap_set_allocation(d, pages, preempted);
> +    else
> +        rc = sh_set_allocation(d, pages, preempted);
> +    paging_unlock(d);
> +
> +    return rc;
> +}
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/x86/mm/shadow/common.c b/xen/arch/x86/mm/shadow/common.c
> index d3cc2cc..53ffe1a 100644
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1609,12 +1609,7 @@ shadow_free_p2m_page(struct domain *d, struct page_info *pg)
>      paging_unlock(d);
>  }
>  
> -/* Set the pool of shadow pages to the required number of pages.
> - * Input will be rounded up to at least shadow_min_acceptable_pages(),
> - * plus space for the p2m table.
> - * Returns 0 for success, non-zero for failure. */
> -static int sh_set_allocation(struct domain *d, unsigned long pages,
> -                             int *preempted)
> +int sh_set_allocation(struct domain *d, unsigned long pages, int *preempted)
>  {
>      struct page_info *sp;
>      unsigned int lower_bound;
> diff --git a/xen/include/asm-x86/hap.h b/xen/include/asm-x86/hap.h
> index c613836..9d59430 100644
> --- a/xen/include/asm-x86/hap.h
> +++ b/xen/include/asm-x86/hap.h
> @@ -46,7 +46,7 @@ int   hap_track_dirty_vram(struct domain *d,
>                             XEN_GUEST_HANDLE_64(uint8) dirty_bitmap);
>  
>  extern const struct paging_mode *hap_paging_get_mode(struct vcpu *);
> -void hap_set_alloc_for_pvh_dom0(struct domain *d, unsigned long num_pages);
> +int hap_set_allocation(struct domain *d, unsigned long pages, int *preempted);
>  
>  #endif /* XEN_HAP_H */
>  
> diff --git a/xen/include/asm-x86/paging.h b/xen/include/asm-x86/paging.h
> index 56eef6b..c2d60d3 100644
> --- a/xen/include/asm-x86/paging.h
> +++ b/xen/include/asm-x86/paging.h
> @@ -347,6 +347,13 @@ void pagetable_dying(struct domain *d, paddr_t gpa);
>  void paging_dump_domain_info(struct domain *d);
>  void paging_dump_vcpu_info(struct vcpu *v);
>  
> +/* Set the pool of shadow pages to the required number of pages.
> + * Input might be rounded up to at minimum amount of pages, plus
> + * space for the p2m table.
> + * Returns 0 for success, non-zero for failure. */
> +int paging_set_allocation(struct domain *d, unsigned long pages,
> +                          int *preempted);
> +
>  #endif /* XEN_PAGING_H */
>  
>  /*
> diff --git a/xen/include/asm-x86/shadow.h b/xen/include/asm-x86/shadow.h
> index 6d0aefb..f0e2227 100644
> --- a/xen/include/asm-x86/shadow.h
> +++ b/xen/include/asm-x86/shadow.h
> @@ -83,6 +83,12 @@ void sh_remove_shadows(struct domain *d, mfn_t gmfn, int fast, int all);
>  /* Discard _all_ mappings from the domain's shadows. */
>  void shadow_blow_tables_per_domain(struct domain *d);
>  
> +/* Set the pool of shadow pages to the required number of pages.
> + * Input will be rounded up to at least shadow_min_acceptable_pages(),
> + * plus space for the p2m table.
> + * Returns 0 for success, non-zero for failure. */
> +int sh_set_allocation(struct domain *d, unsigned long pages, int *preempted);
> +
>  #else /* !CONFIG_SHADOW_PAGING */
>  
>  #define shadow_teardown(d, p) ASSERT(is_pv_domain(d))
> @@ -91,6 +97,8 @@ void shadow_blow_tables_per_domain(struct domain *d);
>      ({ ASSERT(is_pv_domain(d)); -EOPNOTSUPP; })
>  #define shadow_track_dirty_vram(d, begin_pfn, nr, bitmap) \
>      ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
> +#define sh_set_allocation(d, pages, preempted) \
> +    ({ ASSERT_UNREACHABLE(); -EOPNOTSUPP; })
>  
>  static inline void sh_remove_shadows(struct domain *d, mfn_t gmfn,
>                                       bool_t fast, bool_t all) {}
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings
  2016-09-27 15:57 ` [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings Roger Pau Monne
@ 2016-09-30 17:36   ` George Dunlap
  2016-10-10 14:21   ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: George Dunlap @ 2016-09-30 17:36 UTC (permalink / raw)
  To: Roger Pau Monne, xen-devel
  Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, Jan Beulich

On 27/09/16 16:57, Roger Pau Monne wrote:
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

Acked-by: George Dunlap <george.dunlap@citrix.com>

With one note: You might consider changing the summary to "Allow HVM
hardware domains (PVHv2 dom0) to perform..."

That makes it clear to someone casually scanning that you're really
enabling all HVM domains to get to the next step.  If we ever get rid of
the hardware domain check, this might be important to people using XSM,
for instance.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-09-27 15:56 ` [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions Roger Pau Monne
                     ` (2 preceding siblings ...)
  2016-09-30 16:48   ` George Dunlap
@ 2016-10-03  8:05   ` Paul Durrant
  2016-10-06 11:33     ` Roger Pau Monne
  3 siblings, 1 reply; 146+ messages in thread
From: Paul Durrant @ 2016-10-03  8:05 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Tim (Xen.org),
	George Dunlap, Jan Beulich, boris.ostrovsky, Roger Pau Monne

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> Roger Pau Monne
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: George Dunlap <George.Dunlap@citrix.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Tim (Xen.org) <tim@xen.org>; Jan Beulich
> <jbeulich@suse.com>; boris.ostrovsky@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>
> Subject: [Xen-devel] [PATCH v2 03/30] xen/x86: fix parameters and return
> value of *_set_allocation functions
> 
> Return should be an int, and the number of pages should be an unsigned
> long.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: George Dunlap <george.dunlap@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Tim Deegan <tim@xen.org>
> ---
>  xen/arch/x86/mm/hap/hap.c       |  6 +++---
>  xen/arch/x86/mm/shadow/common.c |  7 +++----
>  xen/include/asm-x86/domain.h    | 12 ++++++------
>  3 files changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> index 3218fa2..b6d2c61 100644
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -325,7 +325,7 @@ static void hap_free_p2m_page(struct domain *d,
> struct page_info *pg)
>  static unsigned int
>  hap_get_allocation(struct domain *d)
>  {
> -    unsigned int pg = d->arch.paging.hap.total_pages
> +    unsigned long pg = d->arch.paging.hap.total_pages
>          + d->arch.paging.hap.p2m_pages;

You are modifying this calculation (by including hap.p2m_pages as well as hap.total_pages) so the comment should probably mention this.

> 
>      return ((pg >> (20 - PAGE_SHIFT))
> @@ -334,8 +334,8 @@ hap_get_allocation(struct domain *d)
> 
>  /* Set the pool of pages to the required number of pages.
>   * Returns 0 for success, non-zero for failure. */
> -static unsigned int
> -hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
> +static int
> +hap_set_allocation(struct domain *d, unsigned long pages, int
> *preempted)
>  {
>      struct page_info *pg;
> 
> diff --git a/xen/arch/x86/mm/shadow/common.c
> b/xen/arch/x86/mm/shadow/common.c
> index 21607bf..d3cc2cc 100644
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1613,9 +1613,8 @@ shadow_free_p2m_page(struct domain *d, struct
> page_info *pg)
>   * Input will be rounded up to at least shadow_min_acceptable_pages(),
>   * plus space for the p2m table.
>   * Returns 0 for success, non-zero for failure. */
> -static unsigned int sh_set_allocation(struct domain *d,
> -                                      unsigned int pages,
> -                                      int *preempted)
> +static int sh_set_allocation(struct domain *d, unsigned long pages,
> +                             int *preempted)
>  {
>      struct page_info *sp;
>      unsigned int lower_bound;
> @@ -1692,7 +1691,7 @@ static unsigned int sh_set_allocation(struct domain
> *d,
>  /* Return the size of the shadow pool, rounded up to the nearest MB */
>  static unsigned int shadow_get_allocation(struct domain *d)
>  {
> -    unsigned int pg = d->arch.paging.shadow.total_pages
> +    unsigned long pg = d->arch.paging.shadow.total_pages
>          + d->arch.paging.shadow.p2m_pages;

Same here.

>      return ((pg >> (20 - PAGE_SHIFT))
>              + ((pg & ((1 << (20 - PAGE_SHIFT)) - 1)) ? 1 : 0));

OMG. Is there no rounding macro you can use for this?

  Paul

> diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
> index 5807a1f..11ac2a5 100644
> --- a/xen/include/asm-x86/domain.h
> +++ b/xen/include/asm-x86/domain.h
> @@ -93,9 +93,9 @@ struct shadow_domain {
> 
>      /* Memory allocation */
>      struct page_list_head freelist;
> -    unsigned int      total_pages;  /* number of pages allocated */
> -    unsigned int      free_pages;   /* number of pages on freelists */
> -    unsigned int      p2m_pages;    /* number of pages allocates to p2m */
> +    unsigned long      total_pages;  /* number of pages allocated */
> +    unsigned long      free_pages;   /* number of pages on freelists */
> +    unsigned long      p2m_pages;    /* number of pages allocates to p2m */
> 
>      /* 1-to-1 map for use when HVM vcpus have paging disabled */
>      pagetable_t unpaged_pagetable;
> @@ -155,9 +155,9 @@ struct shadow_vcpu {
>  /************************************************/
>  struct hap_domain {
>      struct page_list_head freelist;
> -    unsigned int      total_pages;  /* number of pages allocated */
> -    unsigned int      free_pages;   /* number of pages on freelists */
> -    unsigned int      p2m_pages;    /* number of pages allocates to p2m */
> +    unsigned long      total_pages;  /* number of pages allocated */
> +    unsigned long      free_pages;   /* number of pages on freelists */
> +    unsigned long      p2m_pages;    /* number of pages allocates to p2m */
>  };
> 
>  /************************************************/
> --
> 2.7.4 (Apple Git-66)
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form
  2016-09-27 15:57 ` [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form Roger Pau Monne
  2016-09-28  8:24   ` Juergen Gross
@ 2016-10-03  8:36   ` Paul Durrant
  2016-10-11 10:27   ` Roger Pau Monne
  2 siblings, 0 replies; 146+ messages in thread
From: Paul Durrant @ 2016-10-03  8:36 UTC (permalink / raw)
  To: xen-devel
  Cc: Stefano Stabellini, Wei Liu, Andrew Cooper, Tim (Xen.org),
	George Dunlap, Jan Beulich, Ian Jackson, boris.ostrovsky,
	Roger Pau Monne

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> Roger Pau Monne
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: Stefano Stabellini <sstabellini@kernel.org>; Wei Liu
> <wei.liu2@citrix.com>; George Dunlap <George.Dunlap@citrix.com>;
> Andrew Cooper <Andrew.Cooper3@citrix.com>; Ian Jackson
> <Ian.Jackson@citrix.com>; Tim (Xen.org) <tim@xen.org>; Jan Beulich
> <jbeulich@suse.com>; boris.ostrovsky@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>
> Subject: [Xen-devel] [PATCH v2 13/30] xen: introduce a new format specifier
> to print sizes in human-readable form
> 
> Introduce a new %pB format specifier to print sizes (in bytes) in a human-
> readable form.
> 

This comment does not seem to match the code. You use 'Z' below...

  Paul

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Cc: Tim Deegan <tim@xen.org>
> Cc: Wei Liu <wei.liu2@citrix.com>
> ---
>  docs/misc/printk-formats.txt |  5 +++++
>  xen/common/vsprintf.c        | 15 +++++++++++++++
>  2 files changed, 20 insertions(+)
> 
> diff --git a/docs/misc/printk-formats.txt b/docs/misc/printk-formats.txt
> index 525108f..0ee3504 100644
> --- a/docs/misc/printk-formats.txt
> +++ b/docs/misc/printk-formats.txt
> @@ -30,3 +30,8 @@ Domain and vCPU information:
> 
>         %pv     Domain and vCPU ID from a 'struct vcpu *' (printed as
>                 "d<domid>v<vcpuid>")
> +
> +Size in human readable form:
> +
> +       %pZ     Size in human-readable form (input size must be in bytes).
> +                 e.g.  24MB
> diff --git a/xen/common/vsprintf.c b/xen/common/vsprintf.c index
> f92fb67..2dadaec 100644
> --- a/xen/common/vsprintf.c
> +++ b/xen/common/vsprintf.c
> @@ -386,6 +386,21 @@ static char *pointer(char *str, char *end, const char
> **fmt_ptr,
>              *str = 'v';
>          return number(str + 1, end, v->vcpu_id, 10, -1, -1, 0);
>      }
> +    case 'Z':
> +    {
> +        static const char units[][3] = {"B", "KB", "MB", "GB", "TB"};
> +        size_t size = (size_t)arg;
> +        int i = 0;
> +
> +        /* Advance parents fmt string, as we have consumed 'B' */
> +        ++*fmt_ptr;
> +
> +        while ( ++i < sizeof(units) && size >= 1024 )
> +            size >>= 10; /* size /= 1024 */
> +
> +        str = number(str, end, size, 10, -1, -1, 0);
> +        return string(str, end, units[i-1], -1, -1, 0);
> +    }
>      }
> 
>      if ( field_width == -1 )
> --
> 2.7.4 (Apple Git-66)
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain
  2016-09-27 15:57 ` [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain Roger Pau Monne
@ 2016-10-03  9:02   ` Paul Durrant
  2016-10-06 14:31     ` Roger Pau Monne
  2016-10-06 15:44   ` Jan Beulich
  1 sibling, 1 reply; 146+ messages in thread
From: Paul Durrant @ 2016-10-03  9:02 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for
> hardware domain
> 
> This is very similar to the PCI trap used for the traditional PV(H) Dom0.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Paul Durrant <paul.durrant@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/x86/hvm/io.c         | 72
> ++++++++++++++++++++++++++++++++++++++++++-
>  xen/arch/x86/traps.c          | 39 -----------------------
>  xen/drivers/passthrough/pci.c | 39 +++++++++++++++++++++++
>  xen/include/xen/pci.h         |  2 ++
>  4 files changed, 112 insertions(+), 40 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c index
> 1e7a5f9..31d54dc 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -247,12 +247,79 @@ static int dpci_portio_write(const struct
> hvm_io_handler *handler,
>      return X86EMUL_OKAY;
>  }
> 
> +static bool_t hw_dpci_portio_accept(const struct hvm_io_handler
> *handler,
> +                                    const ioreq_t *p) {
> +    if ( (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc)
> +    {
> +        return 1;
> +    }
> +
> +    return 0;

Why not just return the value of the boolean expression?

> +}
> +
> +static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
> +                            uint64_t addr,
> +                            uint32_t size,
> +                            uint64_t *data) {
> +    struct domain *currd = current->domain;
> +
> +    if ( addr == 0xcf8 )
> +    {
> +        ASSERT(size == 4);
> +        *data = currd->arch.pci_cf8;
> +        return X86EMUL_OKAY;
> +    }
> +
> +    ASSERT((addr & 0xfffc) == 0xcfc);

You could do 'addr &= 3' and simplify the expressions below.

> +    size = min(size, 4 - ((uint32_t)addr & 3));
> +    if ( size == 3 )
> +        size = 2;
> +    if ( pci_cfg_ok(currd, addr & 3, size, NULL) )

Is this the right place to do the check. Can this not be done in the accept op?

> +        *data = pci_conf_read(currd->arch.pci_cf8, addr & 3, size);
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int hw_dpci_portio_write(const struct hvm_io_handler *handler,
> +                                uint64_t addr,
> +                                uint32_t size,
> +                                uint64_t data) {
> +    struct domain *currd = current->domain;
> +    uint32_t data32;
> +
> +    if ( addr == 0xcf8 )
> +    {
> +            ASSERT(size == 4);
> +            currd->arch.pci_cf8 = data;
> +            return X86EMUL_OKAY;
> +    }
> +
> +    ASSERT((addr & 0xfffc) == 0xcfc);

'addr &= 3' again here.

  Paul

> +    size = min(size, 4 - ((uint32_t)addr & 3));
> +    if ( size == 3 )
> +        size = 2;
> +    data32 = data;
> +    if ( pci_cfg_ok(currd, addr & 3, size, &data32) )
> +        pci_conf_write(currd->arch.pci_cf8, addr & 3, size, data);
> +
> +    return X86EMUL_OKAY;
> +}
> +
>  static const struct hvm_io_ops dpci_portio_ops = {
>      .accept = dpci_portio_accept,
>      .read = dpci_portio_read,
>      .write = dpci_portio_write
>  };
> 
> +static const struct hvm_io_ops hw_dpci_portio_ops = {
> +    .accept = hw_dpci_portio_accept,
> +    .read = hw_dpci_portio_read,
> +    .write = hw_dpci_portio_write
> +};
> +
>  void register_dpci_portio_handler(struct domain *d)  {
>      struct hvm_io_handler *handler = hvm_next_io_handler(d); @@ -261,7
> +328,10 @@ void register_dpci_portio_handler(struct domain *d)
>          return;
> 
>      handler->type = IOREQ_TYPE_PIO;
> -    handler->ops = &dpci_portio_ops;
> +    if ( is_hardware_domain(d) )
> +        handler->ops = &hw_dpci_portio_ops;
> +    else
> +        handler->ops = &dpci_portio_ops;
>  }
> 
>  /*
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index
> 24d173f..f3c5c9e 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2076,45 +2076,6 @@ static bool_t admin_io_okay(unsigned int port,
> unsigned int bytes,
>      return ioports_access_permitted(d, port, port + bytes - 1);  }
> 
> -static bool_t pci_cfg_ok(struct domain *currd, unsigned int start,
> -                         unsigned int size, uint32_t *write)
> -{
> -    uint32_t machine_bdf;
> -
> -    if ( !is_hardware_domain(currd) )
> -        return 0;
> -
> -    if ( !CF8_ENABLED(currd->arch.pci_cf8) )
> -        return 1;
> -
> -    machine_bdf = CF8_BDF(currd->arch.pci_cf8);
> -    if ( write )
> -    {
> -        const unsigned long *ro_map = pci_get_ro_map(0);
> -
> -        if ( ro_map && test_bit(machine_bdf, ro_map) )
> -            return 0;
> -    }
> -    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
> -    /* AMD extended configuration space access? */
> -    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
> -         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> -         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 <= 0x17 )
> -    {
> -        uint64_t msr_val;
> -
> -        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
> -            return 0;
> -        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
> -            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
> -    }
> -
> -    return !write ?
> -           xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
> -                                     start, start + size - 1, 0) == 0 :
> -           pci_conf_write_intercept(0, machine_bdf, start, size, write) >= 0;
> -}
> -
>  uint32_t guest_io_read(unsigned int port, unsigned int bytes,
>                         struct domain *currd)  { diff --git a/xen/drivers/passthrough/pci.c
> b/xen/drivers/passthrough/pci.c index dd291a2..a53b4c8 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -966,6 +966,45 @@ void pci_check_disable_device(u16 seg, u8 bus, u8
> devfn)
>                       PCI_COMMAND, cword & ~PCI_COMMAND_MASTER);  }
> 
> +bool_t pci_cfg_ok(struct domain *currd, unsigned int start,
> +                         unsigned int size, uint32_t *write) {
> +    uint32_t machine_bdf;
> +
> +    if ( !is_hardware_domain(currd) )
> +        return 0;
> +
> +    if ( !CF8_ENABLED(currd->arch.pci_cf8) )
> +        return 1;
> +
> +    machine_bdf = CF8_BDF(currd->arch.pci_cf8);
> +    if ( write )
> +    {
> +        const unsigned long *ro_map = pci_get_ro_map(0);
> +
> +        if ( ro_map && test_bit(machine_bdf, ro_map) )
> +            return 0;
> +    }
> +    start |= CF8_ADDR_LO(currd->arch.pci_cf8);
> +    /* AMD extended configuration space access? */
> +    if ( CF8_ADDR_HI(currd->arch.pci_cf8) &&
> +         boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> +         boot_cpu_data.x86 >= 0x10 && boot_cpu_data.x86 <= 0x17 )
> +    {
> +        uint64_t msr_val;
> +
> +        if ( rdmsr_safe(MSR_AMD64_NB_CFG, msr_val) )
> +            return 0;
> +        if ( msr_val & (1ULL << AMD64_NB_CFG_CF8_EXT_ENABLE_BIT) )
> +            start |= CF8_ADDR_HI(currd->arch.pci_cf8);
> +    }
> +
> +    return !write ?
> +           xsm_pci_config_permission(XSM_HOOK, currd, machine_bdf,
> +                                     start, start + size - 1, 0) == 0 :
> +           pci_conf_write_intercept(0, machine_bdf, start, size, write)
> +>= 0; }
> +
>  /*
>   * scan pci devices to add all existed PCI devices to alldevs_list,
>   * and setup pci hierarchy in array bus2bridge.
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h index
> 0872401..f191773 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -162,6 +162,8 @@ const char *parse_pci(const char *, unsigned int *seg,
> unsigned int *bus,
> 
>  bool_t pcie_aer_get_firmware_first(const struct pci_dev *);
> 
> +bool_t pci_cfg_ok(struct domain *, unsigned int, unsigned int, uint32_t
> +*);
> +
>  struct pirq;
>  int msixtbl_pt_register(struct domain *, struct pirq *, uint64_t gtable);  void
> msixtbl_pt_unregister(struct domain *, struct pirq *);
> --
> 2.7.4 (Apple Git-66)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code
  2016-09-27 15:57 ` [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code Roger Pau Monne
@ 2016-10-03  9:54   ` Paul Durrant
  2016-10-06 15:08     ` Roger Pau Monne
  2016-10-06 15:47   ` Jan Beulich
  2016-10-10 12:41   ` Jan Beulich
  2 siblings, 1 reply; 146+ messages in thread
From: Paul Durrant @ 2016-10-03  9:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> Subject: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import
> QEMU passthrough code
> 
> Most of this code has been picked up from QEMU and modified so it can be
> plugged into the internal Xen IO handlers. The structure of the handlers has
> been keep quite similar to QEMU, so existing handlers can be imported
> without a lot of effort.
> 

If you lifted code from QEMU then one assumes there is no problem with license, but do you need to amend copyrights for any of the files where you put the code?

> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Paul Durrant <paul.durrant@citrix.com>
> ---
>  docs/misc/xen-command-line.markdown |   8 +
>  xen/arch/x86/hvm/hvm.c              |   2 +
>  xen/arch/x86/hvm/io.c               | 621
> ++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/hvm/domain.h    |   4 +
>  xen/include/asm-x86/hvm/io.h        | 176 ++++++++++
>  xen/include/xen/pci.h               |   5 +
>  6 files changed, 816 insertions(+)
> 
> diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-
> command-line.markdown
> index 59d7210..78130c8 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -670,6 +670,14 @@ Flag that makes a 64bit dom0 boot in PVH mode. No
> 32bit support at present.
> 
>  Flag that makes a dom0 boot in PVHv2 mode.
> 
> +### dom0permissive
> +> `= <boolean>`
> +
> +> Default: `true`
> +
> +Select mode of PCI pass-through when using a PVHv2 Dom0, either
> permissive or
> +not.
> +
>  ### dtuart (ARM)
>  > `= path [:options]`
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index a291f82..bc4f7bc 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -632,6 +632,8 @@ int hvm_domain_initialise(struct domain *d)
>              goto fail1;
>          }
>          memset(d->arch.hvm_domain.io_bitmap, ~0, HVM_IOBITMAP_SIZE);
> +        INIT_LIST_HEAD(&d->arch.hvm_domain.pt_devices);
> +        rwlock_init(&d->arch.hvm_domain.pt_lock);
>      }
>      else
>          d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 31d54dc..7de1de3 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -46,6 +46,10 @@
>  #include <xen/iocap.h>
>  #include <public/hvm/ioreq.h>
> 
> +/* Set permissive mode for HVM Dom0 PCI pass-through by default */
> +static bool_t opt_dom0permissive = 1;
> +boolean_param("dom0permissive", opt_dom0permissive);
> +
>  void send_timeoffset_req(unsigned long timeoff)
>  {
>      ioreq_t p = {
> @@ -258,12 +262,403 @@ static bool_t hw_dpci_portio_accept(const struct
> hvm_io_handler *handler,
>      return 0;
>  }
> 
> +static struct hvm_pt_device *hw_dpci_get_device(struct domain *d)
> +{
> +    uint8_t bus, slot, func;
> +    uint32_t addr;
> +    struct hvm_pt_device *dev;
> +
> +    /* Decode bus, slot and func. */
> +    addr = CF8_BDF(d->arch.pci_cf8);
> +    bus = PCI_BUS(addr);
> +    slot = PCI_SLOT(addr);
> +    func = PCI_FUNC(addr);
> +
> +    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
> +    {
> +        if ( dev->pdev->seg != 0 || dev->pdev->bus != bus ||
> +             dev->pdev->devfn != PCI_DEVFN(slot,func) )
> +            continue;
> +
> +        return dev;
> +    }
> +
> +    return NULL;
> +}
> +
> +/* Dispatchers */
> +
> +/* Find emulate register group entry */
> +struct hvm_pt_reg_group *hvm_pt_find_reg_grp(struct hvm_pt_device
> *d,
> +                                             uint32_t address)
> +{
> +    struct hvm_pt_reg_group *entry = NULL;
> +
> +    /* Find register group entry */
> +    list_for_each_entry( entry, &d->register_groups, entries )
> +    {
> +        /* check address */
> +        if ( (entry->base_offset <= address)
> +             && ((entry->base_offset + entry->size) > address) )
> +            return entry;
> +    }
> +
> +    /* Group entry not found */
> +    return NULL;
> +}
> +
> +/* Find emulate register entry */
> +struct hvm_pt_reg *hvm_pt_find_reg(struct hvm_pt_reg_group *reg_grp,
> +                                   uint32_t address)
> +{
> +    struct hvm_pt_reg *reg_entry = NULL;
> +    struct hvm_pt_reg_handler *handler = NULL;
> +    uint32_t real_offset = 0;
> +
> +    /* Find register entry */
> +    list_for_each_entry( reg_entry, &reg_grp->registers, entries )
> +    {
> +        handler = reg_entry->handler;
> +        real_offset = reg_grp->base_offset + handler->offset;
> +        /* Check address */
> +        if ( (real_offset <= address)
> +             && ((real_offset + handler->size) > address) )
> +            return reg_entry;
> +    }
> +
> +    return NULL;
> +}
> +
> +static int hvm_pt_pci_config_access_check(struct hvm_pt_device *d,
> +                                          uint32_t addr, int len)
> +{
> +    /* Check offset range */
> +    if ( addr >= 0xFF )
> +    {
> +        printk_pdev(d->pdev, XENLOG_DEBUG,
> +            "failed to access register with offset exceeding 0xFF. "
> +            "(addr: 0x%02x, len: %d)\n", addr, len);
> +        return -EDOM;
> +    }
> +
> +    /* Check read size */
> +    if ( (len != 1) && (len != 2) && (len != 4) )
> +    {
> +        printk_pdev(d->pdev, XENLOG_DEBUG,
> +            "failed to access register with invalid access length. "
> +            "(addr: 0x%02x, len: %d)\n", addr, len);
> +        return -EINVAL;
> +    }
> +
> +    /* Check offset alignment */
> +    if ( addr & (len - 1) )
> +    {
> +        printk_pdev(d->pdev, XENLOG_DEBUG,
> +            "failed to access register with invalid access size "
> +            "alignment. (addr: 0x%02x, len: %d)\n", addr, len);
> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
> +                                  uint32_t *data, int len)
> +{
> +    uint32_t val = 0;
> +    struct hvm_pt_reg_group *reg_grp_entry = NULL;
> +    struct hvm_pt_reg *reg_entry = NULL;
> +    int rc = 0;
> +    int emul_len = 0;
> +    uint32_t find_addr = addr;
> +    unsigned int seg = d->pdev->seg;
> +    unsigned int bus = d->pdev->bus;
> +    unsigned int slot = PCI_SLOT(d->pdev->devfn);
> +    unsigned int func = PCI_FUNC(d->pdev->devfn);
> +
> +    /* Sanity checks. */
> +    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    /* Find register group entry. */
> +    reg_grp_entry = hvm_pt_find_reg_grp(d, addr);
> +    if ( reg_grp_entry == NULL )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    /* Read I/O device register value. */
> +    switch( len )
> +    {
> +    case 1:
> +        val = pci_conf_read8(seg, bus, slot, func, addr);
> +        break;
> +    case 2:
> +        val = pci_conf_read16(seg, bus, slot, func, addr);
> +        break;
> +    case 4:
> +        val = pci_conf_read32(seg, bus, slot, func, addr);
> +        break;
> +    default:
> +        BUG();
> +    }
> +
> +    /* Adjust the read value to appropriate CFC-CFF window. */
> +    val <<= (addr & 3) << 3;
> +    emul_len = len;
> +
> +    /* Loop around the guest requested size. */
> +    while ( emul_len > 0 )
> +    {
> +        /* Find register entry to be emulated. */
> +        reg_entry = hvm_pt_find_reg(reg_grp_entry, find_addr);
> +        if ( reg_entry )
> +        {
> +            struct hvm_pt_reg_handler *handler = reg_entry->handler;
> +            uint32_t real_offset = reg_grp_entry->base_offset + handler-
> >offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);

Figuring out whether this makes sense makes my brain hurt. Any chance of some macro or at least comments about this?

> +            uint8_t *ptr_val = NULL;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +
> +            /* Do emulation based on register size. */
> +            switch ( handler->size )
> +            {
> +            case 1:
> +                if ( handler->u.b.read )
> +                    rc = handler->u.b.read(d, reg_entry, ptr_val, valid_mask);
> +                break;
> +            case 2:
> +                if ( handler->u.w.read )
> +                    rc = handler->u.w.read(d, reg_entry, (uint16_t *)ptr_val,
> +                                           valid_mask);
> +                break;
> +            case 4:
> +                if ( handler->u.dw.read )
> +                    rc = handler->u.dw.read(d, reg_entry, (uint32_t *)ptr_val,
> +                                            valid_mask);
> +                break;
> +            }
> +
> +            if ( rc < 0 )
> +            {
> +                gdprintk(XENLOG_WARNING,
> +                         "Invalid read emulation, shutting down domain\n");
> +                domain_crash(current->domain);
> +                return X86EMUL_UNHANDLEABLE;
> +            }
> +
> +            /* Calculate next address to find. */
> +            emul_len -= handler->size;
> +            if ( emul_len > 0 )
> +                find_addr = real_offset + handler->size;
> +        }
> +        else
> +        {
> +            /* Nothing to do with passthrough type register */
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* Need to shift back before returning them to pci bus emulator */
> +    val >>= ((addr & 3) << 3);
> +    *data = val;
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static int hvm_pt_pci_write_config(struct hvm_pt_device *d, uint32_t
> addr,
> +                                    uint32_t val, int len)
> +{
> +    int index = 0;
> +    struct hvm_pt_reg_group *reg_grp_entry = NULL;
> +    int rc = 0;
> +    uint32_t read_val = 0, wb_mask;
> +    int emul_len = 0;
> +    struct hvm_pt_reg *reg_entry = NULL;
> +    uint32_t find_addr = addr;
> +    struct hvm_pt_reg_handler *handler = NULL;
> +    bool wp_flag = false;
> +    unsigned int seg = d->pdev->seg;
> +    unsigned int bus = d->pdev->bus;
> +    unsigned int slot = PCI_SLOT(d->pdev->devfn);
> +    unsigned int func = PCI_FUNC(d->pdev->devfn);
> +
> +    /* Sanity checks. */
> +    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    /* Find register group entry. */
> +    reg_grp_entry = hvm_pt_find_reg_grp(d, addr);
> +    if ( reg_grp_entry == NULL )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    /* Read I/O device register value. */
> +    switch( len )
> +    {
> +    case 1:
> +        read_val = pci_conf_read8(seg, bus, slot, func, addr);
> +        break;
> +    case 2:
> +        read_val = pci_conf_read16(seg, bus, slot, func, addr);
> +        break;
> +    case 4:
> +        read_val = pci_conf_read32(seg, bus, slot, func, addr);
> +        break;
> +    default:
> +        BUG();
> +    }
> +    wb_mask = 0xFFFFFFFF >> ((4 - len) << 3);
> +
> +    /* Adjust the read and write value to appropriate CFC-CFF window */
> +    read_val <<= (addr & 3) << 3;
> +    val <<= (addr & 3) << 3;
> +    emul_len = len;
> +
> +    /* Loop around the guest requested size */
> +    while ( emul_len > 0 )
> +    {
> +        /* Find register entry to be emulated */
> +        reg_entry = hvm_pt_find_reg(reg_grp_entry, find_addr);
> +        if ( reg_entry )
> +        {
> +            handler = reg_entry->handler;
> +            uint32_t real_offset = reg_grp_entry->base_offset + handler-
> >offset;
> +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> +            uint8_t *ptr_val = NULL;
> +            uint32_t wp_mask = handler->emu_mask | handler->ro_mask;
> +
> +            valid_mask <<= (find_addr - real_offset) << 3;
> +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> +            if ( !d->permissive )
> +                wp_mask |= handler->res_mask;
> +            if ( wp_mask == (0xFFFFFFFF >> ((4 - handler->size) << 3)) )
> +                wb_mask &= ~((wp_mask >> ((find_addr - real_offset) << 3))
> +                             << ((len - emul_len) << 3));
> +
> +            /* Do emulation based on register size */
> +            switch ( handler->size )
> +            {
> +            case 1:
> +                if ( handler->u.b.write )
> +                    rc = handler->u.b.write(d, reg_entry, ptr_val,
> +                                            read_val >> ((real_offset & 3) << 3),
> +                                            valid_mask);
> +                break;
> +            case 2:
> +                if ( handler->u.w.write )
> +                    rc = handler->u.w.write(d, reg_entry, (uint16_t *)ptr_val,
> +                                            (read_val >> ((real_offset & 3) << 3)),
> +                                            valid_mask);
> +                break;
> +            case 4:
> +                if ( handler->u.dw.write )
> +                    rc = handler->u.dw.write(d, reg_entry, (uint32_t *)ptr_val,
> +                                             (read_val >> ((real_offset & 3) << 3)),
> +                                             valid_mask);
> +                break;
> +            }
> +
> +            if ( rc < 0 )
> +            {
> +                gdprintk(XENLOG_WARNING,
> +                         "Invalid write emulation, shutting down domain\n");
> +                domain_crash(current->domain);
> +                return X86EMUL_UNHANDLEABLE;
> +            }
> +
> +            /* Calculate next address to find */
> +            emul_len -= handler->size;
> +            if ( emul_len > 0 )
> +                find_addr = real_offset + handler->size;
> +        }
> +        else
> +        {
> +            /* Nothing to do with passthrough type register */
> +            if ( !d->permissive )
> +            {
> +                wb_mask &= ~(0xff << ((len - emul_len) << 3));
> +                /*
> +                 * Unused BARs will make it here, but we don't want to issue
> +                 * warnings for writes to them (bogus writes get dealt with
> +                 * above).
> +                 */
> +                if ( index < 0 )
> +                    wp_flag = true;
> +            }
> +            emul_len--;
> +            find_addr++;
> +        }
> +    }
> +
> +    /* Need to shift back before passing them to xen_host_pci_set_block */
> +    val >>= (addr & 3) << 3;
> +
> +    if ( wp_flag && !d->permissive_warned )
> +    {
> +        d->permissive_warned = true;
> +        gdprintk(XENLOG_WARNING,
> +          "Write-back to unknown field 0x%02x (partially) inhibited (0x%0*x)\n",
> +          addr, len * 2, wb_mask);
> +        gdprintk(XENLOG_WARNING,
> +          "If the device doesn't work, try enabling permissive mode\n");
> +        gdprintk(XENLOG_WARNING,
> +          "(unsafe) and if it helps report the problem to xen-devel\n");
> +    }
> +    for ( index = 0; wb_mask; index += len )
> +    {
> +        /* Unknown regs are passed through */
> +        while ( !(wb_mask & 0xff) )
> +        {
> +            index++;
> +            wb_mask >>= 8;
> +        }
> +        len = 0;
> +        do {
> +            len++;
> +            wb_mask >>= 8;
> +        } while ( wb_mask & 0xff );
> +
> +        switch( len )
> +        {
> +        case 1:
> +        {
> +            uint8_t value;
> +            memcpy(&value, (uint8_t *)&val + index, 1);
> +            pci_conf_write8(seg, bus, slot, func, addr + index, value);
> +            break;
> +        }
> +        case 2:
> +        {
> +            uint16_t value;
> +            memcpy(&value, (uint8_t *)&val + index, 2);
> +            pci_conf_write16(seg, bus, slot, func, addr + index, value);
> +            break;
> +        }
> +        case 4:
> +        {
> +            uint32_t value;
> +            memcpy(&value, (uint8_t *)&val + index, 4);
> +            pci_conf_write32(seg, bus, slot, func, addr + index, value);
> +            break;
> +        }
> +        default:
> +            BUG();
> +        }
> +    }
> +    return X86EMUL_OKAY;
> +}
> +
>  static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
>                              uint64_t addr,
>                              uint32_t size,
>                              uint64_t *data)
>  {
>      struct domain *currd = current->domain;
> +    struct hvm_pt_device *dev;
> +    uint32_t data32;
> +    uint8_t reg;
> +    int rc;
> 
>      if ( addr == 0xcf8 )
>      {
> @@ -276,6 +671,22 @@ static int hw_dpci_portio_read(const struct
> hvm_io_handler *handler,
>      size = min(size, 4 - ((uint32_t)addr & 3));
>      if ( size == 3 )
>          size = 2;
> +
> +    read_lock(&currd->arch.hvm_domain.pt_lock);
> +    dev = hw_dpci_get_device(currd);
> +    if ( dev != NULL )
> +    {
> +        reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> +        rc = hvm_pt_pci_read_config(dev, reg, &data32, size);
> +        if ( rc == X86EMUL_OKAY )
> +        {
> +            read_unlock(&currd->arch.hvm_domain.pt_lock);
> +            *data = data32;
> +            return rc;
> +        }
> +    }
> +    read_unlock(&currd->arch.hvm_domain.pt_lock);
> +
>      if ( pci_cfg_ok(currd, addr & 3, size, NULL) )
>          *data = pci_conf_read(currd->arch.pci_cf8, addr & 3, size);
> 
> @@ -288,7 +699,10 @@ static int hw_dpci_portio_write(const struct
> hvm_io_handler *handler,
>                                  uint64_t data)
>  {
>      struct domain *currd = current->domain;
> +    struct hvm_pt_device *dev;
>      uint32_t data32;
> +    uint8_t reg;
> +    int rc;
> 
>      if ( addr == 0xcf8 )
>      {
> @@ -302,12 +716,219 @@ static int hw_dpci_portio_write(const struct
> hvm_io_handler *handler,
>      if ( size == 3 )
>          size = 2;
>      data32 = data;
> +
> +    read_lock(&currd->arch.hvm_domain.pt_lock);
> +    dev = hw_dpci_get_device(currd);
> +    if ( dev != NULL )
> +    {
> +        reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> +        rc = hvm_pt_pci_write_config(dev, reg, data32, size);
> +        if ( rc == X86EMUL_OKAY )
> +        {
> +            read_unlock(&currd->arch.hvm_domain.pt_lock);
> +            return rc;
> +        }
> +    }
> +    read_unlock(&currd->arch.hvm_domain.pt_lock);
> +

I must be missing something here. Why are you adding passthrough code to the hardware domain's handlers? Surely it sees all devices anyway?

>      if ( pci_cfg_ok(currd, addr & 3, size, &data32) )
>          pci_conf_write(currd->arch.pci_cf8, addr & 3, size, data);
> 
>      return X86EMUL_OKAY;
>  }
> 
> +static void hvm_pt_free_device(struct hvm_pt_device *dev)
> +{
> +    struct hvm_pt_reg_group *group, *g;
> +
> +    list_for_each_entry_safe( group, g, &dev->register_groups, entries )
> +    {
> +        struct hvm_pt_reg *reg, *r;
> +
> +        list_for_each_entry_safe( reg, r, &group->registers, entries )
> +        {
> +            list_del(&reg->entries);
> +            xfree(reg);
> +        }
> +
> +        list_del(&group->entries);
> +        xfree(group);
> +    }
> +
> +    xfree(dev);
> +}
> +
> +static int hvm_pt_add_register(struct hvm_pt_device *dev,
> +                               struct hvm_pt_reg_group *group,
> +                               struct hvm_pt_reg_handler *handler)
> +{
> +    struct pci_dev *pdev = dev->pdev;
> +    struct hvm_pt_reg *reg;
> +
> +    reg = xmalloc(struct hvm_pt_reg);
> +    if ( reg == NULL )
> +        return -ENOMEM;
> +
> +    memset(reg, 0, sizeof(*reg));

xzalloc()?

> +    reg->handler = handler;
> +    if ( handler->init != NULL )
> +    {
> +        uint32_t host_mask, size_mask, data = 0;
> +        uint8_t seg, bus, slot, func;
> +        unsigned int offset;
> +        uint32_t val;
> +        int rc;
> +
> +        /* Initialize emulate register */
> +        rc = handler->init(dev, reg->handler,
> +                           group->base_offset + reg->handler->offset, &data);
> +        if ( rc < 0 )
> +            return rc;
> +
> +        if ( data == HVM_PT_INVALID_REG )
> +        {
> +            xfree(reg);
> +            return 0;
> +        }
> +
> +        /* Sync up the data to val */
> +        offset = group->base_offset + reg->handler->offset;
> +        size_mask = 0xFFFFFFFF >> ((4 - reg->handler->size) << 3);
> +
> +        seg = pdev->seg;
> +        bus = pdev->bus;
> +        slot = PCI_SLOT(pdev->devfn);
> +        func = PCI_FUNC(pdev->devfn);
> +
> +        switch ( reg->handler->size )
> +        {
> +        case 1:
> +            val = pci_conf_read8(seg, bus, slot, func, offset);
> +            break;
> +        case 2:
> +            val = pci_conf_read16(seg, bus, slot, func, offset);
> +            break;
> +        case 4:
> +            val = pci_conf_read32(seg, bus, slot, func, offset);
> +            break;
> +        default:
> +            BUG();
> +        }
> +
> +        /*
> +         * Set bits in emu_mask are the ones we emulate. The reg shall
> +         * contain the emulated view of the guest - therefore we flip
> +         * the mask to mask out the host values (which reg initially
> +         * has).
> +         */
> +        host_mask = size_mask & ~reg->handler->emu_mask;
> +
> +        if ( (data & host_mask) != (val & host_mask) )
> +        {
> +            uint32_t new_val;
> +
> +            /* Mask out host (including past size). */
> +            new_val = val & host_mask;
> +            /* Merge emulated ones (excluding the non-emulated ones). */
> +            new_val |= data & host_mask;
> +            /*
> +             * Leave intact host and emulated values past the size -
> +             * even though we do not care as we write per reg->size
> +             * granularity, but for the logging below lets have the
> +             * proper value.
> +             */
> +            new_val |= ((val | data)) & ~size_mask;
> +            printk_pdev(pdev, XENLOG_ERR,
> +"offset 0x%04x mismatch! Emulated=0x%04x, host=0x%04x, syncing to
> 0x%04x.\n",
> +                        offset, data, val, new_val);
> +            val = new_val;
> +        }
> +        else
> +            val = data;
> +
> +        if ( val & ~size_mask )
> +        {
> +            printk_pdev(pdev, XENLOG_ERR,
> +                    "Offset 0x%04x:0x%04x expands past register size(%d)!\n",
> +                        offset, val, reg->handler->size);
> +            return -EINVAL;
> +        }
> +
> +        reg->val.dword = val;
> +    }
> +    list_add_tail(&reg->entries, &group->registers);
> +
> +    return 0;
> +}
> +
> +static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
> +};
> +
> +int hwdom_add_device(struct pci_dev *pdev)
> +{
> +    struct domain *d = pdev->domain;
> +    struct hvm_pt_device *dev;
> +    int j, i, rc;
> +
> +    ASSERT( is_hardware_domain(d) );
> +    ASSERT( pcidevs_locked() );
> +
> +    dev = xmalloc(struct hvm_pt_device);
> +    if ( dev == NULL )
> +        return -ENOMEM;
> +
> +    memset(dev, 0 , sizeof(*dev));

xzalloc()?

> +
> +    dev->pdev = pdev;
> +    INIT_LIST_HEAD(&dev->register_groups);
> +
> +    dev->permissive = opt_dom0permissive;
> +
> +    for ( j = 0; j < ARRAY_SIZE(hwdom_pt_handlers); j++ )
> +    {
> +        struct hvm_pt_handler_init *handler_init = hwdom_pt_handlers[j];
> +        struct hvm_pt_reg_group *group;
> +
> +        group = xmalloc(struct hvm_pt_reg_group);
> +        if ( group == NULL )
> +        {
> +            xfree(dev);
> +            return -ENOMEM;
> +        }
> +        INIT_LIST_HEAD(&group->registers);
> +
> +        rc = handler_init->init(dev, group);
> +        if ( rc == 0 )
> +        {
> +            for ( i = 0; handler_init->handlers[i].size != 0; i++ )
> +            {
> +                int rc;
> +
> +                rc = hvm_pt_add_register(dev, group,
> +                                         &handler_init->handlers[i]);
> +                if ( rc )
> +                {
> +                    printk_pdev(pdev, XENLOG_ERR, "error adding register: %d\n",
> +                                rc);
> +                    hvm_pt_free_device(dev);
> +                    return rc;
> +                }
> +            }
> +
> +            list_add_tail(&group->entries, &dev->register_groups);
> +        }
> +        else
> +            xfree(group);
> +    }
> +
> +    write_lock(&d->arch.hvm_domain.pt_lock);
> +    list_add_tail(&dev->entries, &d->arch.hvm_domain.pt_devices);
> +    write_unlock(&d->arch.hvm_domain.pt_lock);
> +    printk_pdev(pdev, XENLOG_DEBUG, "added for pass-through\n");
> +
> +    return 0;
> +}
> +
>  static const struct hvm_io_ops dpci_portio_ops = {
>      .accept = dpci_portio_accept,
>      .read = dpci_portio_read,
> diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-
> x86/hvm/domain.h
> index f34d784..1b1a52f 100644
> --- a/xen/include/asm-x86/hvm/domain.h
> +++ b/xen/include/asm-x86/hvm/domain.h
> @@ -152,6 +152,10 @@ struct hvm_domain {
>          struct vmx_domain vmx;
>          struct svm_domain svm;
>      };
> +
> +    /* List of passed-through devices (hw domain only). */
> +    struct list_head pt_devices;
> +    rwlock_t pt_lock;
>  };
> 
>  #define hap_enabled(d)  ((d)->arch.hvm_domain.hap_enabled)
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index e9b3f83..80f830d 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -153,6 +153,182 @@ extern void hvm_dpci_msi_eoi(struct domain *d,
> int vector);
> 
>  void register_dpci_portio_handler(struct domain *d);
> 
> +/* Structures for pci-passthrough state and handlers. */
> +struct hvm_pt_device;
> +struct hvm_pt_reg_handler;
> +struct hvm_pt_reg;
> +struct hvm_pt_reg_group;
> +
> +/* Return code when register should be ignored. */
> +#define HVM_PT_INVALID_REG 0xFFFFFFFF
> +
> +/* function type for config reg */
> +typedef int (*hvm_pt_conf_reg_init)
> +    (struct hvm_pt_device *, struct hvm_pt_reg_handler *, uint32_t
> real_offset,
> +     uint32_t *data);
> +
> +typedef int (*hvm_pt_conf_dword_write)
> +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> +     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
> +typedef int (*hvm_pt_conf_word_write)
> +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> +     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
> +typedef int (*hvm_pt_conf_byte_write)
> +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> +     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
> +typedef int (*hvm_pt_conf_dword_read)
> +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> +     uint32_t *val, uint32_t valid_mask);
> +typedef int (*hvm_pt_conf_word_read)
> +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> +     uint16_t *val, uint16_t valid_mask);
> +typedef int (*hvm_pt_conf_byte_read)
> +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> +     uint8_t *val, uint8_t valid_mask);
> +
> +typedef int (*hvm_pt_group_init)
> +    (struct hvm_pt_device *, struct hvm_pt_reg_group *);
> +
> +/*
> + * Emulated register information.
> + *
> + * This should be shared between all the consumers that trap on accesses
> + * to certain PCI registers.
> + */
> +struct hvm_pt_reg_handler {
> +    uint32_t offset;
> +    uint32_t size;
> +    uint32_t init_val;
> +    /* reg reserved field mask (ON:reserved, OFF:defined) */
> +    uint32_t res_mask;
> +    /* reg read only field mask (ON:RO/ROS, OFF:other) */
> +    uint32_t ro_mask;
> +    /* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
> +    uint32_t rw1c_mask;
> +    /* reg emulate field mask (ON:emu, OFF:passthrough) */
> +    uint32_t emu_mask;
> +    hvm_pt_conf_reg_init init;
> +    /* read/write function pointer
> +     * for double_word/word/byte size */
> +    union {
> +        struct {
> +            hvm_pt_conf_dword_write write;
> +            hvm_pt_conf_dword_read read;
> +        } dw;
> +        struct {
> +            hvm_pt_conf_word_write write;
> +            hvm_pt_conf_word_read read;
> +        } w;
> +        struct {
> +            hvm_pt_conf_byte_write write;
> +            hvm_pt_conf_byte_read read;
> +        } b;
> +    } u;
> +};
> +
> +struct hvm_pt_handler_init {
> +    struct hvm_pt_reg_handler *handlers;
> +    hvm_pt_group_init init;
> +};
> +
> +/*
> + * Emulated register value.
> + *
> + * This is the representation of each specific emulated register.
> + */
> +struct hvm_pt_reg {
> +    struct list_head entries;
> +    struct hvm_pt_reg_handler *handler;
> +    union {
> +        uint8_t   byte;
> +        uint16_t  word;
> +        uint32_t  dword;
> +    } val;
> +};
> +
> +/*
> + * Emulated register group.
> + *
> + * In order to speed up (and logically group) emulated registers search,
> + * groups are used that represent specific emulated features, like MSI.
> + */
> +struct hvm_pt_reg_group {
> +    struct list_head entries;
> +    uint32_t base_offset;
> +    uint8_t size;
> +    struct list_head registers;
> +};
> +
> +/*
> + * Guest MSI information.
> + *
> + * MSI values set by the guest.
> + */
> +struct hvm_pt_msi {
> +    uint16_t flags;
> +    uint32_t addr_lo;  /* guest message address */
> +    uint32_t addr_hi;  /* guest message upper address */
> +    uint16_t data;     /* guest message data */
> +    uint32_t ctrl_offset; /* saved control offset */
> +    int pirq;          /* guest pirq corresponding */
> +    bool_t initialized;  /* when guest MSI is initialized */
> +    bool_t mapped;       /* when pirq is mapped */
> +};
> +
> +/*
> + * Guest passed-through PCI device.
> + */
> +struct hvm_pt_device {
> +    struct list_head entries;
> +
> +    struct pci_dev *pdev;
> +
> +    bool_t permissive;
> +    bool_t permissive_warned;
> +
> +    /* MSI status. */
> +    struct hvm_pt_msi msi;
> +
> +    struct list_head register_groups;
> +};
> +
> +/*
> + * The hierarchy of the above structures is the following:
> + *
> + * +---------------+         +---------------+
> + * |               | entries |               | ...
> + * | hvm_pt_device +---------+ hvm_pt_device +----+
> + * |               |         |               |
> + * +-+-------------+         +---------------+
> + *   |
> + *   | register_groups
> + *   |
> + * +-v----------------+          +------------------+
> + * |                  | entries  |                  | ...
> + * | hvm_pt_reg_group +----------+ hvm_pt_reg_group +----+
> + * |                  |          |                  |
> + * +-+----------------+          +------------------+
> + *   |
> + *   | registers
> + *   |
> + * +-v----------+            +------------+
> + * |            | entries    |            | ...
> + * | hvm_pt_reg +------------+ hvm_pt_reg +----+
> + * |            |            |            |
> + * +-+----------+            +-+----------+
> + *   |                         |
> + *   | handler                 | handler
> + *   |                         |
> + * +-v------------------+    +-v------------------+
> + * |                    |    |                    |
> + * | hvm_pt_reg_handler |    | hvm_pt_reg_handler |
> + * |                    |    |                    |
> + * +--------------------+    +--------------------+
> + */
> +
> +/* Helper to add passed-through devices to the hardware domain. */
> +int hwdom_add_device(struct pci_dev *pdev);
> +
>  #endif /* __ASM_X86_HVM_IO_H__ */
> 
> 
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> index f191773..b21a891 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -90,6 +90,11 @@ struct pci_dev {
>      u64 vf_rlen[6];
>  };
> 
> +/* Helper for printing pci_dev related messages. */
> +#define printk_pdev(pdev, lvl, fmt, ...)                                  \
> +    printk(lvl "PCI %04x:%02x:%02x.%u: " fmt, pdev->seg, pdev->bus,      \
> +           PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), ##__VA_ARGS__)
> +
>  #define for_each_pdev(domain, pdev) \
>      list_for_each_entry(pdev, &(domain->arch.pdev_list), domain_list)
> 
> --
> 2.7.4 (Apple Git-66)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2
  2016-09-30 15:03   ` Jan Beulich
@ 2016-10-03 10:09     ` Roger Pau Monne
  2016-10-04  6:54       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-03 10:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Sep 30, 2016 at 09:03:34AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -663,6 +663,13 @@ Pin dom0 vcpus to their respective pcpus
> >  
> >  Flag that makes a 64bit dom0 boot in PVH mode. No 32bit support at present.
> >  
> > +### dom0hvm
> > +> `= <boolean>`
> > +
> > +> Default: `false`
> > +
> > +Flag that makes a dom0 boot in PVHv2 mode.
> 
> Considering sorting aspects this clearly wants to go at least ahead of
> dom0pvh.
>
> > --- a/xen/arch/x86/setup.c
> > +++ b/xen/arch/x86/setup.c
> > @@ -75,6 +75,10 @@ unsigned long __read_mostly cr4_pv32_mask;
> >  static bool_t __initdata opt_dom0pvh;
> >  boolean_param("dom0pvh", opt_dom0pvh);
> >  
> > +/* Boot dom0 in HVM mode */
> > +static bool_t __initdata opt_dom0hvm;
> 
> Plain bool please.
> 
> > @@ -1495,6 +1499,11 @@ void __init noreturn __start_xen(unsigned long mbi_p)
> >      if ( opt_dom0pvh )
> >          domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
> >  
> > +    if ( opt_dom0hvm ) {
> 
> Coding style.

Fixed all the above.
 
> > +        domcr_flags |= DOMCRF_hvm | (hvm_funcs.hap_supported ? DOMCRF_hap : 0);
> 
> So you mean to support PVHv2 on shadow (including for Dom0)
> right away. Are you also testing that?

I've added the following patch to my queue, in order to allow the user to 
select whether they want to use HAP or shadow, I've tested it locally and 
there seems to be no issues in building a PVHv2 Dom0 using shadow.

---
diff --git a/xen/arch/x86/domain_build.c b/xen/arch/x86/domain_build.c
index 80e20fa..17956e2 100644
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -203,13 +203,6 @@ struct vcpu *__init alloc_dom0_vcpu0(struct domain *dom0)
     return setup_dom0_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
 }
 
-#ifdef CONFIG_SHADOW_PAGING
-static bool_t __initdata opt_dom0_shadow;
-boolean_param("dom0_shadow", opt_dom0_shadow);
-#else
-#define opt_dom0_shadow 0
-#endif
-
 static char __initdata opt_dom0_ioports_disable[200] = "";
 string_param("dom0_ioports_disable", opt_dom0_ioports_disable);
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 9272318..252125d 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -79,6 +79,11 @@ boolean_param("dom0pvh", opt_dom0pvh);
 static bool_t __initdata opt_dom0hvm;
 boolean_param("dom0hvm", opt_dom0hvm);
 
+#ifdef CONFIG_SHADOW_PAGING
+bool __initdata opt_dom0_shadow;
+boolean_param("dom0_shadow", opt_dom0_shadow);
+#endif
+
 /* **** Linux config option: propagated to domain0. */
 /* "acpi=off":    Sisables both ACPI table parsing and interpreter. */
 /* "acpi=force":  Override the disable blacklist.                   */
@@ -1500,7 +1505,9 @@ void __init noreturn __start_xen(unsigned long mbi_p)
         domcr_flags |= DOMCRF_pvh | DOMCRF_hap;
 
     if ( opt_dom0hvm ) {
-        domcr_flags |= DOMCRF_hvm | (hvm_funcs.hap_supported ? DOMCRF_hap : 0);
+        domcr_flags |= DOMCRF_hvm |
+                       (hvm_funcs.hap_supported && !opt_dom0_shadow ?
+                       DOMCRF_hap : 0);
         config.emulation_flags = XEN_X86_EMU_LAPIC|XEN_X86_EMU_IOAPIC;
     }
 
diff --git a/xen/include/asm-x86/setup.h b/xen/include/asm-x86/setup.h
index c65b79c..888d952 100644
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -51,6 +51,12 @@ void microcode_grab_module(
 
 extern uint8_t kbd_shift_flags;
 
+#ifdef CONFIG_SHADOW_PAGING
+extern bool opt_dom0_shadow;
+#else
+#define opt_dom0_shadow 0
+#endif
+
 #ifdef NDEBUG
 # define highmem_start 0
 #else


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping
  2016-09-27 15:57 ` [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping Roger Pau Monne
@ 2016-10-03 10:10   ` Paul Durrant
  2016-10-06 15:25     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Paul Durrant @ 2016-10-03 10:10 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping
> 
> Add handlers to detect attemps from a PVHv2 Dom0 to change the position
> of the PCI BARs and properly remap them.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Paul Durrant <paul.durrant@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/x86/hvm/io.c         |   2 +
>  xen/drivers/passthrough/pci.c | 307
> ++++++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/hvm/io.h  |  19 +++
>  xen/include/xen/pci.h         |   3 +
>  4 files changed, 331 insertions(+)
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c index
> 7de1de3..4db0266 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -862,6 +862,8 @@ static int hvm_pt_add_register(struct hvm_pt_device
> *dev,  }
> 
>  static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
> +    &hvm_pt_bar_init,
> +    &hvm_pt_vf_bar_init,
>  };
> 
>  int hwdom_add_device(struct pci_dev *pdev) diff --git
> a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index
> 6d831dd..60c9e74 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -633,6 +633,313 @@ static int pci_size_bar(unsigned int seg, unsigned
> int bus, unsigned int slot,
>      return 0;
>  }
> 
> +static bool bar_reg_is_vf(uint32_t real_offset, uint32_t
> +handler_offset) {
> +    if ( real_offset - handler_offset == PCI_SRIOV_BAR )
> +        return true;
> +    else
> +        return false;
> +}
> +

Return the bool expression rather than the if-then-else?

> +static int bar_reg_init(struct hvm_pt_device *s,
> +                        struct hvm_pt_reg_handler *handler,
> +                        uint32_t real_offset, uint32_t *data) {
> +    uint8_t seg, bus, slot, func;
> +    uint64_t addr, size;
> +    uint32_t val;
> +    unsigned int index = handler->offset / 4;
> +    bool vf = bar_reg_is_vf(real_offset, handler->offset);
> +    struct hvm_pt_bar *bars = (vf ? s->vf_bars : s->bars);
> +    int num_bars = (vf ? PCI_SRIOV_NUM_BARS : s->num_bars);
> +    int rc;
> +
> +    if ( index >= num_bars )
> +    {
> +        *data = HVM_PT_INVALID_REG;
> +        return 0;
> +    }
> +
> +    seg = s->pdev->seg;
> +    bus = s->pdev->bus;
> +    slot = PCI_SLOT(s->pdev->devfn);
> +    func = PCI_FUNC(s->pdev->devfn);
> +    val = pci_conf_read32(seg, bus, slot, func, real_offset);
> +
> +    if ( index > 0 && bars[index - 1].type == HVM_PT_BAR_MEM64_LO )
> +        bars[index].type = HVM_PT_BAR_MEM64_HI;
> +    else if ( (val & PCI_BASE_ADDRESS_SPACE) ==
> PCI_BASE_ADDRESS_SPACE_IO )
> +    {
> +        bars[index].type = HVM_PT_BAR_UNUSED;
> +    }
> +    else if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> +              PCI_BASE_ADDRESS_MEM_TYPE_64 )
> +        bars[index].type = HVM_PT_BAR_MEM64_LO;
> +    else
> +        bars[index].type = HVM_PT_BAR_MEM32;
> +
> +    if ( bars[index].type == HVM_PT_BAR_MEM32 ||
> +         bars[index].type == HVM_PT_BAR_MEM64_LO )
> +    {
> +        /* Size the BAR and map it. */
> +        rc = pci_size_bar(seg, bus, slot, func, real_offset - handler->offset,
> +                          num_bars, &index, &addr, &size);
> +        if ( rc )
> +        {
> +            printk_pdev(s->pdev, XENLOG_ERR, "unable to size BAR#%d\n",
> +                        index);
> +            return rc;
> +        }
> +
> +        if ( size == 0 )
> +            bars[index].type = HVM_PT_BAR_UNUSED;
> +        else
> +        {
> +            printk_pdev(s->pdev, XENLOG_DEBUG,
> +                        "Mapping BAR#%u: %#lx size: %u\n", index, addr, size);
> +            rc = modify_mmio_11(s->pdev->domain, PFN_DOWN(addr),
> +                                DIV_ROUND_UP(size, PAGE_SIZE), true);
> +            if ( rc )
> +            {
> +                printk_pdev(s->pdev, XENLOG_ERR,
> +                            "failed to map BAR#%d into memory map: %d\n",
> +                            index, rc);
> +                return rc;
> +            }
> +        }
> +    }
> +
> +    *data = bars[index].type == HVM_PT_BAR_UNUSED ?
> HVM_PT_INVALID_REG : val;
> +    return 0;
> +}
> +
> +/* Only allow writes to check the size of the BARs */ static int
> +allow_bar_write(struct hvm_pt_bar *bar, struct hvm_pt_reg *reg,
> +                           struct pci_dev *pdev, uint32_t val) {
> +    uint32_t mask;
> +
> +    if ( bar->type == HVM_PT_BAR_MEM64_HI )
> +        mask = ~0;
> +    else
> +        mask = (uint32_t) PCI_BASE_ADDRESS_MEM_MASK;
> +
> +    if ( val != ~0 && (val & mask) != (reg->val.dword & mask) )
> +    {
> +        printk_pdev(pdev, XENLOG_ERR,
> +                "changing the position of the BARs is not yet supported: %#x\n",
> +                    val);

This doesn't seem to quite tally with commit comment. Can BARs be re-programmed or not?

> +        return -EINVAL;
> +    }
> +
> +    return 0;
> +}
> +
> +static int bar_reg_write(struct hvm_pt_device *s, struct hvm_pt_reg *reg,
> +                         uint32_t *val, uint32_t dev_value, uint32_t
> +valid_mask) {
> +    int index = reg->handler->offset / 4;
> +
> +    return allow_bar_write(&s->bars[index], reg, s->pdev, *val); }
> +
> +static int vf_bar_reg_write(struct hvm_pt_device *s, struct hvm_pt_reg
> *reg,
> +                            uint32_t *val, uint32_t dev_value,
> +                            uint32_t valid_mask) {
> +    int index = reg->handler->offset / 4;
> +
> +    return allow_bar_write(&s->vf_bars[index], reg, s->pdev, *val); }
> +
> +/* BAR regs static information table */ static struct
> +hvm_pt_reg_handler bar_handler[] = {
> +    /* BAR 0 reg */
> +    /* mask of BAR need to be decided later, depends on IO/MEM type */
> +    {
> +        .offset     = 0,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = bar_reg_write,
> +    },
> +    /* BAR 1 reg */
> +    {
> +        .offset     = 4,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = bar_reg_write,
> +    },
> +    /* BAR 2 reg */
> +    {
> +        .offset     = 8,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = bar_reg_write,
> +    },
> +    /* BAR 3 reg */
> +    {
> +        .offset     = 12,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = bar_reg_write,
> +    },
> +    /* BAR 4 reg */
> +    {
> +        .offset     = 16,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = bar_reg_write,
> +    },
> +    /* BAR 5 reg */
> +    {
> +        .offset     = 20,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = bar_reg_write,
> +    },
> +    /* TODO: we should also trap accesses to the expansion ROM base
> address. */
> +    /* End. */
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +static int bar_group_init(struct hvm_pt_device *dev,
> +                          struct hvm_pt_reg_group *group) {
> +    uint8_t seg, bus, slot, func;
> +
> +    seg = dev->pdev->seg;
> +    bus = dev->pdev->bus;
> +    slot = PCI_SLOT(dev->pdev->devfn);
> +    func = PCI_FUNC(dev->pdev->devfn);
> +    dev->htype = pci_conf_read8(seg, bus, slot, func, PCI_HEADER_TYPE) &
> 0x7f;
> +    switch ( dev->htype )
> +    {
> +    case PCI_HEADER_TYPE_NORMAL:
> +        group->size = 36;
> +        dev->num_bars = 6;
> +        break;
> +    case PCI_HEADER_TYPE_BRIDGE:
> +        group->size = 44;
> +        dev->num_bars = 2;
> +        break;
> +    default:
> +        printk_pdev(dev->pdev, XENLOG_ERR, "device type %#x not
> supported\n",
> +                    dev->htype);
> +        return -EINVAL;
> +    }
> +    group->base_offset = PCI_BASE_ADDRESS_0;
> +
> +    return 0;
> +}
> +
> +struct hvm_pt_handler_init hvm_pt_bar_init = {
> +    .handlers = bar_handler,
> +    .init = bar_group_init,
> +};
> +
> +/* BAR regs static information table */ static struct
> +hvm_pt_reg_handler vf_bar_handler[] = {
> +    /* BAR 0 reg */
> +    /* mask of BAR need to be decided later, depends on IO/MEM type */
> +    {
> +        .offset     = 0,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = vf_bar_reg_write,
> +    },
> +    /* BAR 1 reg */
> +    {
> +        .offset     = 4,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = vf_bar_reg_write,
> +    },
> +    /* BAR 2 reg */
> +    {
> +        .offset     = 8,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = vf_bar_reg_write,
> +    },
> +    /* BAR 3 reg */
> +    {
> +        .offset     = 12,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = vf_bar_reg_write,
> +    },
> +    /* BAR 4 reg */
> +    {
> +        .offset     = 16,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = vf_bar_reg_write,
> +    },
> +    /* BAR 5 reg */
> +    {
> +        .offset     = 20,
> +        .size       = 4,
> +        .init_val   = 0x00000000,
> +        .init       = bar_reg_init,
> +        .u.dw.write = vf_bar_reg_write,
> +    },
> +    /* End. */
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +static int vf_bar_group_init(struct hvm_pt_device *dev,
> +                          struct hvm_pt_reg_group *group) {
> +    uint8_t seg, bus, slot, func;
> +    struct pci_dev *pdev = dev->pdev;
> +    int sr_offset;
> +    uint16_t ctrl;
> +
> +    seg = dev->pdev->seg;
> +    bus = dev->pdev->bus;
> +    slot = PCI_SLOT(dev->pdev->devfn);
> +    func = PCI_FUNC(dev->pdev->devfn);
> +
> +    sr_offset = pci_find_ext_capability(seg, bus, pdev->devfn,
> +                                        PCI_EXT_CAP_ID_SRIOV);
> +    if ( sr_offset == 0 )
> +        return -EINVAL;
> +
> +    printk_pdev(pdev, XENLOG_DEBUG, "found SR-IOV capabilities\n");
> +    ctrl = pci_conf_read16(seg, bus, slot, func, sr_offset + PCI_SRIOV_CTRL);
> +    if ( (ctrl & (PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE)) )
> +    {
> +        printk_pdev(pdev, XENLOG_ERR,
> +                    "SR-IOV functions already enabled (%#04x)\n", ctrl);
> +        return -EINVAL;
> +    }
> +
> +    group->base_offset = sr_offset + PCI_SRIOV_BAR;
> +    group->size = PCI_SRIOV_NUM_BARS * 4;
> +
> +    return 0;
> +}
> +
> +struct hvm_pt_handler_init hvm_pt_vf_bar_init = {
> +    .handlers = vf_bar_handler,
> +    .init = vf_bar_group_init,
> +};
> +
>  int pci_add_device(u16 seg, u8 bus, u8 devfn,
>                     const struct pci_dev_info *info, nodeid_t node)  { diff --git
> a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h index
> 80f830d..25af036 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -19,6 +19,7 @@
>  #ifndef __ASM_X86_HVM_IO_H__
>  #define __ASM_X86_HVM_IO_H__
> 
> +#include <xen/pci_regs.h>
>  #include <asm/hvm/vpic.h>
>  #include <asm/hvm/vioapic.h>
>  #include <public/hvm/ioreq.h>
> @@ -275,6 +276,16 @@ struct hvm_pt_msi {
>      bool_t mapped;       /* when pirq is mapped */
>  };
> 
> +struct hvm_pt_bar {
> +    uint32_t val;
> +    enum bar_type {
> +        HVM_PT_BAR_UNUSED,
> +        HVM_PT_BAR_MEM32,
> +        HVM_PT_BAR_MEM64_LO,
> +        HVM_PT_BAR_MEM64_HI,
> +    } type;
> +};
> +
>  /*
>   * Guest passed-through PCI device.
>   */
> @@ -289,6 +300,14 @@ struct hvm_pt_device {
>      /* MSI status. */
>      struct hvm_pt_msi msi;
> 
> +    /* PCI header type. */
> +    uint8_t htype;
> +
> +    /* BAR tracking. */
> +    int num_bars;
> +    struct hvm_pt_bar bars[6];
> +    struct hvm_pt_bar vf_bars[PCI_SRIOV_NUM_BARS];
> +
>      struct list_head register_groups;
>  };
> 
> diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h index
> b21a891..51e0255 100644
> --- a/xen/include/xen/pci.h
> +++ b/xen/include/xen/pci.h
> @@ -174,4 +174,7 @@ int msixtbl_pt_register(struct domain *, struct pirq *,
> uint64_t gtable);  void msixtbl_pt_unregister(struct domain *, struct pirq *);
> void msixtbl_pt_cleanup(struct domain *d);
> 
> +extern struct hvm_pt_handler_init hvm_pt_bar_init; extern struct
> +hvm_pt_handler_init hvm_pt_vf_bar_init;
> +
>  #endif /* __XEN_PCI_H__ */
> --
> 2.7.4 (Apple Git-66)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 26/30] xen/x86: add PCIe emulation
  2016-09-27 15:57 ` [PATCH v2 26/30] xen/x86: add PCIe emulation Roger Pau Monne
@ 2016-10-03 10:46   ` Paul Durrant
  2016-10-06 15:53     ` Roger Pau Monne
  2016-10-10 13:57   ` Jan Beulich
  1 sibling, 1 reply; 146+ messages in thread
From: Paul Durrant @ 2016-10-03 10:46 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Roger Pau Monne, Jan Beulich

> -----Original Message-----
> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> <roger.pau@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> Beulich <jbeulich@suse.com>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>
> Subject: [PATCH v2 26/30] xen/x86: add PCIe emulation
> 
> Add a new MMIO handler that traps accesses to PCIe regions, as discovered
> by
> Xen from the MCFG ACPI table. The handler used is the same as the one
> used
> for accesses to the IO PCI configuration space.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Paul Durrant <paul.durrant@citrix.com>
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/x86/hvm/io.c | 177
> ++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 171 insertions(+), 6 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 779babb..088e3ec 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -46,6 +46,8 @@
>  #include <xen/iocap.h>
>  #include <public/hvm/ioreq.h>
> 
> +#include "../x86_64/mmconfig.h"
> +
>  /* Set permissive mode for HVM Dom0 PCI pass-through by default */
>  static bool_t opt_dom0permissive = 1;
>  boolean_param("dom0permissive", opt_dom0permissive);
> @@ -363,7 +365,7 @@ static int hvm_pt_pci_config_access_check(struct
> hvm_pt_device *d,
>  }
> 
>  static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
> -                                  uint32_t *data, int len)
> +                                  uint32_t *data, int len, bool pcie)
>  {
>      uint32_t val = 0;
>      struct hvm_pt_reg_group *reg_grp_entry = NULL;
> @@ -377,7 +379,7 @@ static int hvm_pt_pci_read_config(struct
> hvm_pt_device *d, uint32_t addr,
>      unsigned int func = PCI_FUNC(d->pdev->devfn);
> 
>      /* Sanity checks. */
> -    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> +    if ( !pcie && hvm_pt_pci_config_access_check(d, addr, len) )
>          return X86EMUL_UNHANDLEABLE;
> 
>      /* Find register group entry. */
> @@ -468,7 +470,7 @@ static int hvm_pt_pci_read_config(struct
> hvm_pt_device *d, uint32_t addr,
>  }
> 
>  static int hvm_pt_pci_write_config(struct hvm_pt_device *d, uint32_t addr,
> -                                    uint32_t val, int len)
> +                                    uint32_t val, int len, bool pcie)
>  {
>      int index = 0;
>      struct hvm_pt_reg_group *reg_grp_entry = NULL;
> @@ -485,7 +487,7 @@ static int hvm_pt_pci_write_config(struct
> hvm_pt_device *d, uint32_t addr,
>      unsigned int func = PCI_FUNC(d->pdev->devfn);
> 
>      /* Sanity checks. */
> -    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> +    if ( !pcie && hvm_pt_pci_config_access_check(d, addr, len) )
>          return X86EMUL_UNHANDLEABLE;
> 
>      /* Find register group entry. */
> @@ -677,7 +679,7 @@ static int hw_dpci_portio_read(const struct
> hvm_io_handler *handler,
>      if ( dev != NULL )
>      {
>          reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> -        rc = hvm_pt_pci_read_config(dev, reg, &data32, size);
> +        rc = hvm_pt_pci_read_config(dev, reg, &data32, size, false);
>          if ( rc == X86EMUL_OKAY )
>          {
>              read_unlock(&currd->arch.hvm_domain.pt_lock);
> @@ -722,7 +724,7 @@ static int hw_dpci_portio_write(const struct
> hvm_io_handler *handler,
>      if ( dev != NULL )
>      {
>          reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> -        rc = hvm_pt_pci_write_config(dev, reg, data32, size);
> +        rc = hvm_pt_pci_write_config(dev, reg, data32, size, false);
>          if ( rc == X86EMUL_OKAY )
>          {
>              read_unlock(&currd->arch.hvm_domain.pt_lock);
> @@ -1002,6 +1004,166 @@ static const struct hvm_io_ops
> hw_dpci_portio_ops = {
>      .write = hw_dpci_portio_write
>  };
> 
> +static struct acpi_mcfg_allocation *pcie_find_mmcfg(unsigned long addr)
> +{
> +    int i;
> +
> +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> +    {
> +        unsigned long start, end;
> +
> +        start = pci_mmcfg_config[i].address;
> +        end = pci_mmcfg_config[i].address +
> +              ((pci_mmcfg_config[i].end_bus_number + 1) << 20);
> +        if ( addr >= start && addr < end )
> +            return &pci_mmcfg_config[i];
> +    }
> +
> +    return NULL;
> +}
> +
> +static struct hvm_pt_device *hw_pcie_get_device(unsigned int seg,
> +                                                unsigned int bus,
> +                                                unsigned int slot,
> +                                                unsigned int func)
> +{
> +    struct hvm_pt_device *dev;
> +    struct domain *d = current->domain;
> +
> +    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
> +    {
> +        if ( dev->pdev->seg != seg || dev->pdev->bus != bus ||
> +             dev->pdev->devfn != PCI_DEVFN(slot, func) )
> +            continue;
> +
> +        return dev;
> +    }
> +
> +    return NULL;
> +}
> +
> +static void pcie_decode_addr(unsigned long addr, unsigned int *bus,
> +                             unsigned int *slot, unsigned int *func,
> +                             unsigned int *reg)
> +{
> +
> +    *bus = (addr >> 20) & 0xff;
> +    *slot = (addr >> 15) & 0x1f;
> +    *func = (addr >> 12) & 0x7;
> +    *reg = addr & 0xfff;
> +}
> +
> +static int pcie_range(struct vcpu *v, unsigned long addr)
> +{
> +
> +    return pcie_find_mmcfg(addr) != NULL ? 1 : 0;
> +}
> +
> +static int pcie_read(struct vcpu *v, unsigned long addr,
> +                     unsigned int len, unsigned long *pval)
> +{
> +    struct acpi_mcfg_allocation *mmcfg = pcie_find_mmcfg(addr);
> +    struct domain *d = v->domain;
> +    unsigned int seg, bus, slot, func, reg;
> +    struct hvm_pt_device *dev;
> +    uint32_t val;
> +    int rc;
> +
> +    ASSERT(mmcfg != NULL);
> +
> +    if ( len > 4 || len == 3 )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    addr -= mmcfg->address;
> +    seg = mmcfg->pci_segment;
> +    pcie_decode_addr(addr, &bus, &slot, &func, &reg);
> +
> +    read_lock(&d->arch.hvm_domain.pt_lock);
> +    dev = hw_pcie_get_device(seg, bus, slot, func);
> +    if ( dev != NULL )
> +    {
> +        rc = hvm_pt_pci_read_config(dev, reg, &val, len, true);
> +        if ( rc == X86EMUL_OKAY )
> +        {
> +            read_unlock(&d->arch.hvm_domain.pt_lock);
> +            goto out;
> +        }
> +    }
> +    read_unlock(&d->arch.hvm_domain.pt_lock);
> +
> +    /* Pass-through */
> +    switch ( len )
> +    {
> +    case 1:
> +        val = pci_conf_read8(seg, bus, slot, func, reg);
> +        break;
> +    case 2:
> +        val = pci_conf_read16(seg, bus, slot, func, reg);
> +        break;
> +    case 4:
> +        val = pci_conf_read32(seg, bus, slot, func, reg);
> +        break;
> +    }
> +
> + out:
> +    *pval = val;
> +    return X86EMUL_OKAY;
> +}
> +
> +static int pcie_write(struct vcpu *v, unsigned long addr,
> +                      unsigned int len, unsigned long val)
> +{
> +    struct acpi_mcfg_allocation *mmcfg = pcie_find_mmcfg(addr);
> +    struct domain *d = v->domain;
> +    unsigned int seg, bus, slot, func, reg;
> +    struct hvm_pt_device *dev;
> +    int rc;
> +
> +    ASSERT(mmcfg != NULL);
> +
> +    if ( len > 4 || len == 3 )
> +        return X86EMUL_UNHANDLEABLE;
> +
> +    addr -= mmcfg->address;
> +    seg = mmcfg->pci_segment;
> +    pcie_decode_addr(addr, &bus, &slot, &func, &reg);
> +
> +    read_lock(&d->arch.hvm_domain.pt_lock);
> +    dev = hw_pcie_get_device(seg, bus, slot, func);
> +    if ( dev != NULL )
> +    {
> +        rc = hvm_pt_pci_write_config(dev, reg, val, len, true);
> +        if ( rc == X86EMUL_OKAY )
> +        {
> +            read_unlock(&d->arch.hvm_domain.pt_lock);
> +            return rc;
> +        }
> +    }
> +    read_unlock(&d->arch.hvm_domain.pt_lock);
> +
> +    /* Pass-through */
> +    switch ( len )
> +    {
> +    case 1:
> +        pci_conf_write8(seg, bus, slot, func, reg, val);
> +        break;
> +    case 2:
> +        pci_conf_write16(seg, bus, slot, func, reg, val);
> +        break;
> +    case 4:
> +        pci_conf_write32(seg, bus, slot, func, reg, val);
> +        break;
> +    }
> +
> +    return X86EMUL_OKAY;
> +}
> +
> +static const struct hvm_mmio_ops pcie_mmio_ops = {
> +    .check = pcie_range,
> +    .read = pcie_read,
> +    .write = pcie_write
> +};
> +
>  void register_dpci_portio_handler(struct domain *d)
>  {
>      struct hvm_io_handler *handler = hvm_next_io_handler(d);
> @@ -1011,7 +1173,10 @@ void register_dpci_portio_handler(struct domain
> *d)
> 
>      handler->type = IOREQ_TYPE_PIO;
>      if ( is_hardware_domain(d) )
> +    {
>          handler->ops = &hw_dpci_portio_ops;
> +        register_mmio_handler(d, &pcie_mmio_ops);

This is a somewhat counterintuitive place to be registering an MMIO handler? Would this not be better done directly by the caller?

  Paul

> +    }
>      else
>          handler->ops = &dpci_portio_ops;
>  }
> --
> 2.7.4 (Apple Git-66)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0
  2016-09-27 15:57 ` [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0 Roger Pau Monne
@ 2016-10-03 10:57   ` Paul Durrant
  2016-10-06 15:58     ` Roger Pau Monne
  2016-10-10 16:15   ` Jan Beulich
  1 sibling, 1 reply; 146+ messages in thread
From: Paul Durrant @ 2016-10-03 10:57 UTC (permalink / raw)
  To: xen-devel; +Cc: boris.ostrovsky, Roger Pau Monne

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> Roger Pau Monne
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: boris.ostrovsky@oracle.com; Roger Pau Monne <roger.pau@citrix.com>
> Subject: [Xen-devel] [PATCH v2 28/30] xen/x86: add MSI-X emulation to
> PVHv2 Dom0
> 
> This requires adding handlers to the PCI configuration space, plus a MMIO
> handler for the MSI-X table, the PBA is left mapped directly into the guest.
> The implementation is based on the one already found in the passthrough
> code from QEMU.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Paul Durrant <paul.durrant@citrix.com>
> Jan Beulich <jbeulich@suse.com>
> Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/x86/hvm/io.c         |   2 +
>  xen/arch/x86/hvm/vmsi.c       | 498
> ++++++++++++++++++++++++++++++++++++++++++
>  xen/drivers/passthrough/pci.c |   6 +-
>  xen/include/asm-x86/hvm/io.h  |  26 +++
>  xen/include/asm-x86/msi.h     |   4 +-
>  5 files changed, 534 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> index 088e3ec..11b7313 100644
> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -867,6 +867,7 @@ static struct hvm_pt_handler_init
> *hwdom_pt_handlers[] = {
>      &hvm_pt_bar_init,
>      &hvm_pt_vf_bar_init,
>      &hvm_pt_msi_init,
> +    &hvm_pt_msix_init,
>  };
> 
>  int hwdom_add_device(struct pci_dev *pdev)
> @@ -1176,6 +1177,7 @@ void register_dpci_portio_handler(struct domain
> *d)
>      {
>          handler->ops = &hw_dpci_portio_ops;
>          register_mmio_handler(d, &pcie_mmio_ops);
> +        register_mmio_handler(d, &vmsix_mmio_ops);

Again, this is a somewhat counterintuitive place to make this call.

  Paul

>      }
>      else
>          handler->ops = &dpci_portio_ops;
> diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
> index 75ba429..92c3b50 100644
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -40,6 +40,7 @@
>  #include <asm/current.h>
>  #include <asm/event.h>
>  #include <asm/io_apic.h>
> +#include <asm/p2m.h>
> 
>  static void vmsi_inj_irq(
>      struct vlapic *target,
> @@ -1162,3 +1163,500 @@ struct hvm_pt_handler_init hvm_pt_msi_init = {
>      .handlers = vmsi_handler,
>      .init = vmsi_group_init,
>  };
> +
> +/* MSI-X */
> +#define latch(fld) latch[PCI_MSIX_ENTRY_##fld / sizeof(uint32_t)]
> +
> +static int vmsix_update_one(struct hvm_pt_device *s, int entry_nr,
> +                            uint32_t vec_ctrl)
> +{
> +    struct hvm_pt_msix_entry *entry = NULL;
> +    xen_domctl_bind_pt_irq_t bind;
> +    bool bound = true;
> +    struct irq_desc *desc;
> +    unsigned long flags;
> +    int irq;
> +    int pirq;
> +    int rc;
> +
> +    if ( entry_nr < 0 || entry_nr >= s->msix->total_entries )
> +        return -EINVAL;
> +
> +    entry = &s->msix->msix_entry[entry_nr];
> +
> +    if ( !entry->updated )
> +        goto mask;
> +
> +    pirq = entry->pirq;
> +
> +    /*
> +     * Update the entry addr and data to the latest values only when the
> +     * entry is masked or they are all masked, as required by the spec.
> +     * Addr and data changes while the MSI-X entry is unmasked get deferred
> +     * until the next masked -> unmasked transition.
> +     */
> +    if ( s->msix->maskall ||
> +         (entry->latch(VECTOR_CTRL_OFFSET) & PCI_MSIX_VECTOR_BITMASK)
> )
> +    {
> +        entry->addr = entry->latch(LOWER_ADDR_OFFSET) |
> +                      ((uint64_t)entry->latch(UPPER_ADDR_OFFSET) << 32);
> +        entry->data = entry->latch(DATA_OFFSET);
> +    }
> +
> +    if ( pirq == -1 )
> +    {
> +        struct msi_info msi_info;
> +        //struct irq_desc *desc;
> +        int index = -1;
> +
> +        /* Init physical one */
> +        printk_pdev(s->pdev, XENLOG_DEBUG, "setup MSI-X (entry: %d).\n",
> +                    entry_nr);
> +
> +        memset(&msi_info, 0, sizeof(msi_info));
> +        msi_info.seg = s->pdev->seg;
> +        msi_info.bus = s->pdev->bus;
> +        msi_info.devfn = s->pdev->devfn;
> +        msi_info.table_base = s->msix->table_base;
> +        msi_info.entry_nr = entry_nr;
> +
> +        rc = physdev_map_pirq(DOMID_SELF, MAP_PIRQ_TYPE_MSI, &index,
> +                              &pirq, &msi_info);
> +        if ( rc )
> +        {
> +            /*
> +             * Do not broadcast this error, since there's nothing else
> +             * that can be done (MSI-X setup should have been successful).
> +             * Guest MSI would be actually not working.
> +             */
> +
> +            printk_pdev(s->pdev, XENLOG_ERR,
> +                          "can not map MSI-X (entry: %d)!\n", entry_nr);
> +            return rc;
> +        }
> +        entry->pirq = pirq;
> +        bound = false;
> +    }
> +
> +    ASSERT(entry->pirq != -1);
> +
> +    if ( bound )
> +    {
> +        printk_pdev(s->pdev, XENLOG_DEBUG, "destroy bind MSI-X entry
> %d\n",
> +                    entry_nr);
> +        bind.hvm_domid = DOMID_SELF;
> +        bind.machine_irq = entry->pirq;
> +        bind.irq_type = PT_IRQ_TYPE_MSI;
> +        bind.u.msi.gvec = msi_vector(entry->data);
> +        bind.u.msi.gflags = msi_gflags(entry->data, entry->addr);
> +        bind.u.msi.gtable = s->msix->table_base;
> +
> +        pcidevs_lock();
> +        rc = pt_irq_destroy_bind(current->domain, &bind);
> +        pcidevs_unlock();
> +        if ( rc )
> +        {
> +            printk_pdev(s->pdev, XENLOG_ERR, "updating of MSI-X failed:
> %d\n",
> +                        rc);
> +            rc = physdev_unmap_pirq(DOMID_SELF, entry->pirq);
> +            if ( rc )
> +                printk_pdev(s->pdev, XENLOG_ERR,
> +                            "unmapping of MSI pirq %d failed: %d\n",
> +                            entry->pirq, rc);
> +            entry->pirq = -1;
> +            return rc;
> +        }
> +    }
> +
> +    printk_pdev(s->pdev, XENLOG_DEBUG, "bind MSI-X entry %d\n",
> entry_nr);
> +    bind.hvm_domid = DOMID_SELF;
> +    bind.machine_irq = entry->pirq;
> +    bind.irq_type = PT_IRQ_TYPE_MSI;
> +    bind.u.msi.gvec = msi_vector(entry->data);
> +    bind.u.msi.gflags = msi_gflags(entry->data, entry->addr);
> +    bind.u.msi.gtable = s->msix->table_base;
> +
> +    pcidevs_lock();
> +    rc = pt_irq_create_bind(current->domain, &bind);
> +    pcidevs_unlock();
> +    if ( rc )
> +    {
> +        printk_pdev(s->pdev, XENLOG_ERR, "updating of MSI-X failed: %d\n",
> rc);
> +        rc = physdev_unmap_pirq(DOMID_SELF, entry->pirq);
> +        if ( rc )
> +            printk_pdev(s->pdev, XENLOG_ERR,
> +                        "unmapping of MSI pirq %d failed: %d\n",
> +                        entry->pirq, rc);
> +        entry->pirq = -1;
> +        return rc;
> +    }
> +
> +    entry->updated = false;
> +
> + mask:
> +    if ( entry->pirq != -1 &&
> +         ((vec_ctrl ^ entry->latch(VECTOR_CTRL_OFFSET)) &
> +          PCI_MSIX_VECTOR_BITMASK) )
> +    {
> +        printk_pdev(s->pdev, XENLOG_DEBUG, "%smasking MSI-X entry
> %d\n",
> +                    (vec_ctrl & PCI_MSIX_VECTOR_BITMASK) ? "" : "un", entry_nr);
> +        irq = domain_pirq_to_irq(s->pdev->domain, entry->pirq);
> +        desc = irq_to_desc(irq);
> +        spin_lock_irqsave(&desc->lock, flags);
> +        guest_mask_msi_irq(desc, !!(vec_ctrl &
> PCI_MSIX_VECTOR_BITMASK));
> +        spin_unlock_irqrestore(&desc->lock, flags);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vmsix_update(struct hvm_pt_device *s)
> +{
> +    struct hvm_pt_msix *msix = s->msix;
> +    int i, rc;
> +
> +    for ( i = 0; i < msix->total_entries; i++ )
> +    {
> +        rc = vmsix_update_one(s, i,
> +                              msix->msix_entry[i].latch(VECTOR_CTRL_OFFSET));
> +        if ( rc )
> +            printk_pdev(s->pdev, XENLOG_ERR, "failed to update MSI-X %d\n",
> i);
> +    }
> +
> +    return 0;
> +}
> +
> +static int vmsix_disable(struct hvm_pt_device *s)
> +{
> +    struct hvm_pt_msix *msix = s->msix;
> +    int i, rc;
> +
> +    for ( i = 0; i < msix->total_entries; i++ )
> +    {
> +        struct hvm_pt_msix_entry *entry =  &s->msix->msix_entry[i];
> +        xen_domctl_bind_pt_irq_t bind;
> +
> +        if ( entry->pirq == -1 )
> +            continue;
> +
> +        bind.hvm_domid = DOMID_SELF;
> +        bind.irq_type = PT_IRQ_TYPE_MSI;
> +        bind.machine_irq = entry->pirq;
> +        bind.u.msi.gvec = msi_vector(entry->data);
> +        bind.u.msi.gflags = msi_gflags(entry->data, entry->addr);
> +        bind.u.msi.gtable = msix->table_base;
> +        pcidevs_lock();
> +        rc = pt_irq_destroy_bind(current->domain, &bind);
> +        pcidevs_unlock();
> +        if ( rc )
> +        {
> +            printk_pdev(s->pdev, XENLOG_ERR,
> +                        "failed to destroy MSI-X PIRQ bind entry %d: %d\n",
> +                        i, rc);
> +            return rc;
> +        }
> +
> +        rc = physdev_unmap_pirq(DOMID_SELF, entry->pirq);
> +        if ( rc )
> +        {
> +            printk_pdev(s->pdev, XENLOG_ERR,
> +                        "failed to unmap PIRQ %d MSI-X entry %d: %d\n",
> +                        entry->pirq, i, rc);
> +            return rc;
> +        }
> +
> +        entry->pirq = -1;
> +        entry->updated = false;
> +    }
> +
> +    return 0;
> +}
> +
> +/* Message Control register for MSI-X */
> +static int vmsix_ctrl_reg_init(struct hvm_pt_device *s,
> +                               struct hvm_pt_reg_handler *handler,
> +                               uint32_t real_offset, uint32_t *data)
> +{
> +    struct pci_dev *pdev = s->pdev;
> +    struct hvm_pt_msix *msix = s->msix;
> +    uint8_t seg, bus, slot, func;
> +    uint16_t reg_field;
> +
> +    seg = pdev->seg;
> +    bus = pdev->bus;
> +    slot = PCI_SLOT(pdev->devfn);
> +    func = PCI_FUNC(pdev->devfn);
> +
> +    /* use I/O device register's value as initial value */
> +    reg_field = pci_conf_read16(seg, bus, slot, func, real_offset);
> +    if ( reg_field & PCI_MSIX_FLAGS_ENABLE )
> +    {
> +        printk_pdev(pdev, XENLOG_INFO,
> +                    "MSI-X already enabled, disabling it first\n");
> +        reg_field &= ~PCI_MSIX_FLAGS_ENABLE;
> +        pci_conf_write16(seg, bus, slot, func, real_offset, reg_field);
> +    }
> +
> +    msix->ctrl_offset = real_offset;
> +
> +    *data = handler->init_val | reg_field;
> +    return 0;
> +}
> +static int vmsix_ctrl_reg_write(struct hvm_pt_device *s, struct hvm_pt_reg
> *reg,
> +                                uint16_t *val, uint16_t dev_value,
> +                                uint16_t valid_mask)
> +{
> +    struct hvm_pt_reg_handler *handler = reg->handler;
> +    uint16_t writable_mask = 0;
> +    uint16_t throughable_mask = hvm_pt_get_throughable_mask(s,
> handler,
> +                                                            valid_mask);
> +    int debug_msix_enabled_old;
> +    uint16_t *data = &reg->val.word;
> +
> +    /* modify emulate register */
> +    writable_mask = handler->emu_mask & ~handler->ro_mask &
> valid_mask;
> +    *data = HVM_PT_MERGE_VALUE(*val, *data, writable_mask);
> +
> +    /* create value for writing to I/O device register */
> +    *val = HVM_PT_MERGE_VALUE(*val, dev_value, throughable_mask);
> +
> +    /* update MSI-X */
> +    if ( (*val & PCI_MSIX_FLAGS_ENABLE)
> +         && !(*val & PCI_MSIX_FLAGS_MASKALL) )
> +        vmsix_update(s);
> +    else if ( !(*val & PCI_MSIX_FLAGS_ENABLE) && s->msix->enabled )
> +        vmsix_disable(s);
> +
> +    s->msix->maskall = *val & PCI_MSIX_FLAGS_MASKALL;
> +
> +    debug_msix_enabled_old = s->msix->enabled;
> +    s->msix->enabled = !!(*val & PCI_MSIX_FLAGS_ENABLE);
> +    if ( s->msix->enabled != debug_msix_enabled_old )
> +        printk_pdev(s->pdev, XENLOG_DEBUG, "%s MSI-X\n",
> +                    s->msix->enabled ? "enable" : "disable");
> +
> +    return 0;
> +}
> +
> +/* MSI Capability Structure reg static information table */
> +static struct hvm_pt_reg_handler vmsix_handler[] = {
> +    /* Message Control reg */
> +    {
> +        .offset     = PCI_MSIX_FLAGS,
> +        .size       = 2,
> +        .init_val   = 0x0000,
> +        .res_mask   = 0x3800,
> +        .ro_mask    = 0x07FF,
> +        .emu_mask   = 0x0000,
> +        .init       = vmsix_ctrl_reg_init,
> +        .u.w.read   = hvm_pt_word_reg_read,
> +        .u.w.write  = vmsix_ctrl_reg_write,
> +    },
> +    /* End */
> +    {
> +        .size = 0,
> +    },
> +};
> +
> +static int vmsix_group_init(struct hvm_pt_device *s,
> +                            struct hvm_pt_reg_group *group)
> +{
> +    uint8_t seg, bus, slot, func;
> +    struct pci_dev *pdev = s->pdev;
> +    int msix_offset, total_entries, i, bar_index, rc;
> +    uint32_t table_off;
> +    uint16_t flags;
> +
> +    seg = pdev->seg;
> +    bus = pdev->bus;
> +    slot = PCI_SLOT(pdev->devfn);
> +    func = PCI_FUNC(pdev->devfn);
> +
> +    msix_offset = pci_find_cap_offset(seg, bus, slot, func,
> PCI_CAP_ID_MSIX);
> +    if ( msix_offset == 0 )
> +        return -ENODEV;
> +
> +    group->base_offset = msix_offset;
> +    flags = pci_conf_read16(seg, bus, slot, func,
> +                            msix_offset + PCI_MSIX_FLAGS);
> +    total_entries = flags & PCI_MSIX_FLAGS_QSIZE;
> +    total_entries += 1;
> +
> +    s->msix = xmalloc_bytes(sizeof(struct hvm_pt_msix) +
> +                            total_entries * sizeof(struct hvm_pt_msix_entry));
> +    if ( s->msix == NULL )
> +    {
> +        printk_pdev(pdev, XENLOG_ERR, "unable to allocate memory for MSI-
> X\n");
> +        return -ENOMEM;
> +    }
> +    memset(s->msix, 0, sizeof(struct hvm_pt_msix) +
> +           total_entries * sizeof(struct hvm_pt_msix_entry));
> +
> +    s->msix->total_entries = total_entries;
> +    for ( i = 0; i < total_entries; i++ )
> +    {
> +        struct hvm_pt_msix_entry *entry = &s->msix->msix_entry[i];
> +
> +        entry->pirq = -1;
> +        entry->latch(VECTOR_CTRL_OFFSET) = PCI_MSIX_VECTOR_BITMASK;
> +    }
> +
> +    table_off = pci_conf_read32(seg, bus, slot, func,
> +                                msix_offset + PCI_MSIX_TABLE);
> +    bar_index = s->msix->bar_index = table_off & PCI_MSIX_BIRMASK;
> +    table_off &= ~PCI_MSIX_BIRMASK;
> +    s->msix->table_base = s->bars[bar_index].addr;
> +    s->msix->table_offset = table_off;
> +    s->msix->mmio_base_addr = s->bars[bar_index].addr + table_off;
> +    printk_pdev(pdev, XENLOG_DEBUG,
> +                "MSI-X table at BAR#%d address: %#lx size: %d\n",
> +                bar_index, s->msix->mmio_base_addr,
> +                total_entries * PCI_MSIX_ENTRY_SIZE);
> +
> +    /* Unmap the BAR so that the guest cannot directly write to it. */
> +    rc = modify_mmio_11(s->pdev->domain, PFN_DOWN(s->msix-
> >mmio_base_addr),
> +                        DIV_ROUND_UP(total_entries * PCI_MSIX_ENTRY_SIZE,
> +                                     PAGE_SIZE),
> +                        false);
> +    if ( rc )
> +    {
> +        printk_pdev(pdev, XENLOG_ERR,
> +                    "Unable to unmap address %#lx from BAR#%d\n",
> +                    s->bars[bar_index].addr, bar_index);
> +        xfree(s->msix);
> +        return rc;
> +    }
> +
> +    return 0;
> +}
> +
> +struct hvm_pt_handler_init hvm_pt_msix_init = {
> +    .handlers = vmsix_handler,
> +    .init = vmsix_group_init,
> +};
> +
> +/* MMIO handlers for MSI-X */
> +static struct hvm_pt_device *vmsix_find_dev_mmio(struct domain *d,
> +                                                 unsigned long addr)
> +{
> +    struct hvm_pt_device *dev;
> +
> +    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
> +    {
> +        unsigned long table_addr, table_size;
> +
> +        if ( dev->msix == NULL )
> +            continue;
> +
> +        table_addr = dev->msix->mmio_base_addr;
> +        table_size = dev->msix->total_entries * PCI_MSIX_ENTRY_SIZE;
> +        if ( addr < table_addr || addr >= table_addr + table_size )
> +            continue;
> +
> +        return dev;
> +    }
> +
> +    return NULL;
> +}
> +
> +
> +static uint32_t vmsix_get_entry_value(struct hvm_pt_msix_entry *e, int
> offset)
> +{
> +    ASSERT(!(offset % sizeof(*e->latch)));
> +    return e->latch[offset / sizeof(*e->latch)];
> +}
> +
> +static void vmsix_set_entry_value(struct hvm_pt_msix_entry *e, int offset,
> +                                  uint32_t val)
> +{
> +    ASSERT(!(offset % sizeof(*e->latch)));
> +    e->latch[offset / sizeof(*e->latch)] = val;
> +}
> +
> +static int vmsix_mem_write(struct vcpu *v, unsigned long addr,
> +                           unsigned int size, unsigned long val)
> +{
> +    struct hvm_pt_device *s;
> +    struct hvm_pt_msix *msix;
> +    struct hvm_pt_msix_entry *entry;
> +    unsigned int entry_nr, offset;
> +    unsigned long raddr;
> +    int rc = 0;
> +
> +    read_lock(&v->domain->arch.hvm_domain.pt_lock);
> +    s = vmsix_find_dev_mmio(v->domain, addr);
> +    msix = s->msix;
> +    raddr = addr - msix->mmio_base_addr;
> +    entry_nr = raddr / PCI_MSIX_ENTRY_SIZE;
> +    if ( entry_nr >= msix->total_entries )
> +    {
> +        printk_pdev(s->pdev, XENLOG_ERR, "asked MSI-X entry %d out of
> range!\n",
> +                    entry_nr);
> +        rc = -EINVAL;
> +        goto out;
> +    }
> +
> +    entry = &msix->msix_entry[entry_nr];
> +    offset = raddr % PCI_MSIX_ENTRY_SIZE;
> +
> +    if ( offset != PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET )
> +    {
> +        if ( vmsix_get_entry_value(entry, offset) == val && entry->pirq != -1 )
> +            goto out;
> +
> +        entry->updated = true;
> +    }
> +    else
> +        vmsix_update_one(s, entry_nr, val);
> +
> +    vmsix_set_entry_value(entry, offset, val);
> +
> + out:
> +    read_unlock(&v->domain->arch.hvm_domain.pt_lock);
> +    return 0;
> +}
> +
> +static int vmsix_mem_read(struct vcpu *v, unsigned long addr,
> +                          unsigned int size, unsigned long *val)
> +{
> +    struct hvm_pt_device *s;
> +    struct hvm_pt_msix *msix;
> +    unsigned long raddr;
> +    int entry_nr, offset;
> +
> +    read_lock(&v->domain->arch.hvm_domain.pt_lock);
> +    s = vmsix_find_dev_mmio(v->domain, addr);
> +    msix = s->msix;
> +    raddr = addr - msix->mmio_base_addr;
> +    entry_nr = raddr / PCI_MSIX_ENTRY_SIZE;
> +    if ( entry_nr >= msix->total_entries )
> +    {
> +        printk_pdev(s->pdev, XENLOG_ERR, "asked MSI-X entry %d out of
> range!\n",
> +                    entry_nr);
> +        read_unlock(&v->domain->arch.hvm_domain.pt_lock);
> +        return -EINVAL;
> +    }
> +
> +    offset = raddr % PCI_MSIX_ENTRY_SIZE;
> +    *val = vmsix_get_entry_value(&msix->msix_entry[entry_nr], offset);
> +    read_unlock(&v->domain->arch.hvm_domain.pt_lock);
> +
> +    return 0;
> +}
> +
> +static int vmsix_mem_accepts(struct vcpu *v, unsigned long addr)
> +{
> +    int accept;
> +
> +    read_lock(&v->domain->arch.hvm_domain.pt_lock);
> +    accept = (vmsix_find_dev_mmio(v->domain, addr) != NULL);
> +    read_unlock(&v->domain->arch.hvm_domain.pt_lock);
> +
> +    return accept;
> +}
> +
> +const struct hvm_mmio_ops vmsix_mmio_ops = {
> +    .check = vmsix_mem_accepts,
> +    .read = vmsix_mem_read,
> +    .write = vmsix_mem_write,
> +};
> diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
> index 60c9e74..f143029 100644
> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -681,9 +681,11 @@ static int bar_reg_init(struct hvm_pt_device *s,
>      if ( bars[index].type == HVM_PT_BAR_MEM32 ||
>           bars[index].type == HVM_PT_BAR_MEM64_LO )
>      {
> +        unsigned int next_index = index;
> +
>          /* Size the BAR and map it. */
>          rc = pci_size_bar(seg, bus, slot, func, real_offset - handler->offset,
> -                          num_bars, &index, &addr, &size);
> +                          num_bars, &next_index, &addr, &size);
>          if ( rc )
>          {
>              printk_pdev(s->pdev, XENLOG_ERR, "unable to size BAR#%d\n",
> @@ -706,6 +708,8 @@ static int bar_reg_init(struct hvm_pt_device *s,
>                              index, rc);
>                  return rc;
>              }
> +            bars[index].addr = addr;
> +            bars[index].size = size;
>          }
>      }
> 
> diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> index 0f8726a..9d9c439 100644
> --- a/xen/include/asm-x86/hvm/io.h
> +++ b/xen/include/asm-x86/hvm/io.h
> @@ -281,7 +281,30 @@ struct hvm_pt_msi {
>      bool_t mapped;       /* when pirq is mapped */
>  };
> 
> +/* Guest MSI-X information. */
> +struct hvm_pt_msix_entry {
> +    int pirq;
> +    uint64_t addr;
> +    uint32_t data;
> +    uint32_t latch[4];
> +    bool updated; /* indicate whether MSI ADDR or DATA is updated */
> +};
> +
> +struct hvm_pt_msix {
> +    uint32_t ctrl_offset;
> +    bool enabled;
> +    bool maskall;
> +    int total_entries;
> +    int bar_index;
> +    uint64_t table_base;
> +    uint32_t table_offset; /* page align mmap */
> +    uint64_t mmio_base_addr;
> +    struct hvm_pt_msix_entry msix_entry[0];
> +};
> +
>  struct hvm_pt_bar {
> +    uint64_t addr;
> +    uint64_t size;
>      uint32_t val;
>      enum bar_type {
>          HVM_PT_BAR_UNUSED,
> @@ -305,6 +328,9 @@ struct hvm_pt_device {
>      /* MSI status. */
>      struct hvm_pt_msi msi;
> 
> +    /* MSI-X status. */
> +    struct hvm_pt_msix *msix;
> +
>      /* PCI header type. */
>      uint8_t htype;
> 
> diff --git a/xen/include/asm-x86/msi.h b/xen/include/asm-x86/msi.h
> index 8c7fb27..981f9ef 100644
> --- a/xen/include/asm-x86/msi.h
> +++ b/xen/include/asm-x86/msi.h
> @@ -275,7 +275,9 @@ static inline uint32_t msi_gflags(uint32_t data,
> uint64_t addr)
>      return result;
>  }
> 
> -/* MSI HVM pass-through handlers. */
> +/* MSI(-X) HVM pass-through handlers. */
>  extern struct hvm_pt_handler_init hvm_pt_msi_init;
> +extern struct hvm_pt_handler_init hvm_pt_msix_init;
> +extern const struct hvm_mmio_ops vmsix_mmio_ops;
> 
>  #endif /* __ASM_MSI_H */
> --
> 2.7.4 (Apple Git-66)
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 30/30] xen: allow setting the store pfn HVM parameter
  2016-09-27 15:57 ` [PATCH v2 30/30] xen: allow setting the store pfn HVM parameter Roger Pau Monne
@ 2016-10-03 11:01   ` Paul Durrant
  0 siblings, 0 replies; 146+ messages in thread
From: Paul Durrant @ 2016-10-03 11:01 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, boris.ostrovsky, Jan Beulich, Roger Pau Monne

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> Roger Pau Monne
> Sent: 27 September 2016 16:57
> To: xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>;
> boris.ostrovsky@oracle.com; Roger Pau Monne <roger.pau@citrix.com>; Jan
> Beulich <jbeulich@suse.com>
> Subject: [Xen-devel] [PATCH v2 30/30] xen: allow setting the store pfn HVM
> parameter
> 
> Xen already allows setting the store event channel, and this parameter is not
> used by Xen at all.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> Cc: Jan Beulich <jbeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
>  xen/arch/x86/hvm/hvm.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index
> bc4f7bc..5c3aa2a 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -4982,6 +4982,7 @@ static int hvm_allow_set_param(struct domain *d,
>      case HVM_PARAM_STORE_EVTCHN:
>      case HVM_PARAM_CONSOLE_EVTCHN:
>      case HVM_PARAM_X87_FIP_WIDTH:
> +    case HVM_PARAM_STORE_PFN:

Why does a guest need to be able set this? It needs to to be able to reset the store evtchn because of the need to be able to handle an evtchn reset, which blows way the guests local port, but what is going move the store ring such that the guest needs to play with it?

  Paul

>          break;
>      /*
>       * The following parameters must not be set by the guest
> --
> 2.7.4 (Apple Git-66)
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 12/30] xen/x86: make print_e820_memory_map global
  2016-09-30 15:04   ` Jan Beulich
@ 2016-10-03 16:23     ` Roger Pau Monne
  2016-10-04  6:47       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-03 16:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Sep 30, 2016 at 09:04:24AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > So that it can be called from the Dom0 builder.
> 
> Why would the Dom0 builder need to call it, when it doesn't so far?

IMHO, I find it useful to print the domain 0 memory map during creation. It 
wasn't useful for a PV Dom0, because the PV Dom0 memory map is just the 
native memory map, which is already printed earlier during the boot process. 
If it doesn't seem useful to you or others I can leave it out.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 12/30] xen/x86: make print_e820_memory_map global
  2016-10-03 16:23     ` Roger Pau Monne
@ 2016-10-04  6:47       ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-04  6:47 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 03.10.16 at 18:23, <roger.pau@citrix.com> wrote:
> On Fri, Sep 30, 2016 at 09:04:24AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > So that it can be called from the Dom0 builder.
>> 
>> Why would the Dom0 builder need to call it, when it doesn't so far?
> 
> IMHO, I find it useful to print the domain 0 memory map during creation. It 
> wasn't useful for a PV Dom0, because the PV Dom0 memory map is just the 
> native memory map, which is already printed earlier during the boot process. 
> If it doesn't seem useful to you or others I can leave it out.

Well, I don't mean to nack the change if others agree with you. If this
is to be made more widely available, it would want its first parameter
constified though.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2
  2016-10-03 10:09     ` Roger Pau Monne
@ 2016-10-04  6:54       ` Jan Beulich
  2016-10-04  7:09         ` Andrew Cooper
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-04  6:54 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 03.10.16 at 12:09, <roger.pau@citrix.com> wrote:
> I've added the following patch to my queue, in order to allow the user to 
> select whether they want to use HAP or shadow, I've tested it locally and 
> there seems to be no issues in building a PVHv2 Dom0 using shadow.

Hmm, two remarks: For one, I'm not convinced of the need to move
the definition. It being where it is now allows the string literal to be
discarded post boot. And considering that the option has presumably
been broken for PV for a long time and was never working for PVHv1,
I'm also unconvinced using it (and hence retaining its existence) is a
good idea - I'd much rather see "dom0hvm=hap" and
"dom0hvm=shadow" supported along with the booleans which can
be given to it right now.

> --- a/xen/include/asm-x86/setup.h
> +++ b/xen/include/asm-x86/setup.h
> @@ -51,6 +51,12 @@ void microcode_grab_module(
>  
>  extern uint8_t kbd_shift_flags;
>  
> +#ifdef CONFIG_SHADOW_PAGING
> +extern bool opt_dom0_shadow;
> +#else
> +#define opt_dom0_shadow 0

Please use "false" here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2
  2016-10-04  6:54       ` Jan Beulich
@ 2016-10-04  7:09         ` Andrew Cooper
  0 siblings, 0 replies; 146+ messages in thread
From: Andrew Cooper @ 2016-10-04  7:09 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

On 04/10/2016 07:54, Jan Beulich wrote:
>>>> On 03.10.16 at 12:09, <roger.pau@citrix.com> wrote:
>> I've added the following patch to my queue, in order to allow the user to 
>> select whether they want to use HAP or shadow, I've tested it locally and 
>> there seems to be no issues in building a PVHv2 Dom0 using shadow.
> Hmm, two remarks: For one, I'm not convinced of the need to move
> the definition. It being where it is now allows the string literal to be
> discarded post boot. And considering that the option has presumably
> been broken for PV for a long time and was never working for PVHv1,
> I'm also unconvinced using it (and hence retaining its existence) is a
> good idea - I'd much rather see "dom0hvm=hap" and
> "dom0hvm=shadow" supported along with the booleans which can
> be given to it right now.

We already have a large number of dom0$foo options which are
inconsistent with underscores or no word breaks.

Could we introduce a dom0= instead and run it like the existing iommu=
to avoid gaining any new top level dom0$foo options?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-09-30 15:52   ` Jan Beulich
@ 2016-10-04  9:12     ` Roger Pau Monne
  2016-10-04 11:16       ` Jan Beulich
  2016-10-11 14:06     ` Roger Pau Monne
  1 sibling, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-04  9:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > @@ -43,6 +44,11 @@ static long __initdata dom0_nrpages;
> >  static long __initdata dom0_min_nrpages;
> >  static long __initdata dom0_max_nrpages = LONG_MAX;
> >  
> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
> > +#define HVM_VM86_TSS_SIZE   128
> > +
> > +static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];
> 
> This is for your debugging only I suppose?

Probably, I wasn't sure if it was relevant so I left it here. Would it make 
sense to only print this for debug builds of the hypervisor? Or better to 
just remove it?

> > @@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages(
> >          avail -= dom0_paging_pages(d, nr_pages);
> >      }
> >  
> > -    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
> > +    if ( is_pv_domain(d) &&
> > +         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
> 
> Perhaps better to simply force parms->p2m_base to UNSET_ADDR
> earlier on?

p2m_base is already unconditionally set to UNSET_ADDR for PVHv2 domains, 
hence the added is_pv_domain check in order to make sure that PVHv2 guests 
don't get into that branch, which AFAICT is only relevant to PV guests.

> > @@ -579,8 +588,19 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
> >              continue;
> >          }
> >  
> > -        *entry_guest = *entry;
> > -        pages = PFN_UP(entry_guest->size);
> > +        /*
> > +         * Make sure the start and length are aligned to PAGE_SIZE, because
> > +         * that's the minimum granularity of the 2nd stage translation.
> > +         */
> > +        start = ROUNDUP(entry->addr, PAGE_SIZE);
> > +        end = (entry->addr + entry->size) & PAGE_MASK;
> > +        if ( start >= end )
> > +            continue;
> > +
> > +        entry_guest->type = E820_RAM;
> > +        entry_guest->addr = start;
> > +        entry_guest->size = end - start;
> > +        pages = PFN_DOWN(entry_guest->size);
> >          if ( (cur_pages + pages) > nr_pages )
> >          {
> >              /* Truncate region */
> > @@ -591,6 +611,8 @@ static __init void pvh_setup_e820(struct domain *d, unsigned long nr_pages)
> >          {
> >              cur_pages += pages;
> >          }
> > +        ASSERT(IS_ALIGNED(entry_guest->addr, PAGE_SIZE) &&
> > +               IS_ALIGNED(entry_guest->size, PAGE_SIZE));
> 
> What does this guard against? Your addition arranges for things to
> be page aligned, and the only adjustment done until we get here is
> one that obviously also doesn't violate that requirement. I'm all for
> assertions when they check state which is not obviously right, but
> here I don't see the need.

Right, I'm going to remove it. I've added more seat belts than needed when 
testing this and forgot to remove them.

> > @@ -1657,15 +1679,238 @@ out:
> >      return rc;
> >  }
> >  
> > +/* Populate an HVM memory range using the biggest possible order. */
> > +static void __init hvm_populate_memory_range(struct domain *d, uint64_t start,
> > +                                             uint64_t size)
> > +{
> > +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
> > +    unsigned int order;
> > +    struct page_info *page;
> > +    int rc;
> > +
> > +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
> > +
> > +    order = MAX_ORDER;
> > +    while ( size != 0 )
> > +    {
> > +        order = min(get_order_from_bytes_floor(size), order);
> > +        page = alloc_domheap_pages(d, order, memflags);
> > +        if ( page == NULL )
> > +        {
> > +            if ( order == 0 && memflags )
> > +            {
> > +                /* Try again without any memflags. */
> > +                memflags = 0;
> > +                order = MAX_ORDER;
> > +                continue;
> > +            }
> > +            if ( order == 0 )
> > +                panic("Unable to allocate memory with order 0!\n");
> > +            order--;
> > +            continue;
> > +        }
> 
> Is it not possible to utilize alloc_chunk() here?

Not really, alloc_chunk will do a ceil calculation of the number of needed 
pages, which means we end up allocating bigger chunks than needed and then 
free them. I prefer this approach, since we always request the exact memory 
that's required, so there's no need to free leftover pages.

> > +        hvm_mem_stats[order]++;
> > +        rc = guest_physmap_add_page(d, _gfn(PFN_DOWN(start)),
> > +                                    _mfn(page_to_mfn(page)), order);
> > +        if ( rc != 0 )
> > +            panic("Failed to populate memory: [%" PRIx64 " - %" PRIx64 "] %d\n",
> 
> [<start>,<end>) please.
> 
> > +                  start, start + (((uint64_t)1) << (order + PAGE_SHIFT)), rc);
> > +        start += ((uint64_t)1) << (order + PAGE_SHIFT);
> > +        size -= ((uint64_t)1) << (order + PAGE_SHIFT);
> 
> Please prefer 1ULL over (uint64_t)1.
> 
> > +        if ( (size & 0xffffffff) == 0 )
> > +            process_pending_softirqs();
> 
> That's 4Gb at a time - isn't that a little too much?

Hm, it's the same that's used in pvh_add_mem_mapping AFAICT. I could reduce 
it to 0xfffffff, but I'm also wondering if it makes sense to just call it on 
each iteration, since we are possibly mapping regions with big orders here.

> > +    }
> > +
> > +}
> 
> Stray blank line.
> 
> > +static int __init hvm_setup_vmx_unrestricted_guest(struct domain *d)
> > +{
> > +    struct e820entry *entry;
> > +    p2m_type_t p2mt;
> > +    uint32_t rc, *ident_pt;
> > +    uint8_t *tss;
> > +    mfn_t mfn;
> > +    paddr_t gaddr = 0;
> > +    int i;
> 
> unsigned int
> 
> > +    /*
> > +     * Stole some space from the last found RAM region. One page will be
> 
> Steal
> 
> > +     * used for the identify page tables, and the remaining space for the
> 
> identity
> 
> > +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> > +     * to PAGE_SIZE.
> > +     */
> > +    for ( i = 1; i <= d->arch.nr_e820; i++ )
> > +    {
> > +        entry = &d->arch.e820[d->arch.nr_e820 - i];
> > +        if ( entry->type != E820_RAM ||
> > +             entry->size < PAGE_SIZE + HVM_VM86_TSS_SIZE )
> > +            continue;
> > +
> > +        entry->size -= PAGE_SIZE + HVM_VM86_TSS_SIZE;
> > +        gaddr = entry->addr + entry->size;
> > +        break;
> > +    }
> > +
> > +    if ( gaddr == 0 || gaddr < MB(1) )
> > +    {
> > +        printk("Unable to find memory to stash the identity map and TSS\n");
> > +        return -ENOMEM;
> 
> One function up you panic() on error - please be consistent. Also for
> one of the other patches I think we figured that the TSS isn't really
> required, so please only warn in that case.
> 
> > +    }
> > +
> > +    /*
> > +     * Identity-map page table is required for running with CR0.PG=0
> > +     * when using Intel EPT. Create a 32-bit non-PAE page directory of
> > +     * superpages.
> > +     */
> > +    tss = map_domain_gfn(p2m_get_hostp2m(d), _gfn(PFN_DOWN(gaddr)),
> > +                         &mfn, &p2mt, 0, &rc);
> 
> Comment and operation don't really fit together.
> 
> > +static int __init hvm_setup_p2m(struct domain *d)
> > +{
> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> > +    unsigned long nr_pages;
> > +    int i, rc, preempted;
> > +
> > +    printk("** Preparing memory map **\n");
> 
> Debugging leftover again?

Should this be merged with the next label, so that it reads as "Preparing 
and populating the memory map"?

> > +    /*
> > +     * Subtract one page for the EPT identity page table and two pages
> > +     * for the MADT replacement.
> > +     */
> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3;
> 
> How do you know the MADT replacement requires two pages? Isn't
> that CPU-count dependent? And doesn't the partial page used for
> the TSS also need accounting for here?

Yes, it's CPU-count dependent. This is just an approximation, since we only 
support up to 256 CPUs on HVM guests, and each Processor Local APIC entry is 
8 bytes, this means that the CPU-related data is going to use up to 2048 
bytes of data, which still leaves plenty of space for the IO APIC and the 
Interrupt Source Override entries. We requests two pages in case the 
original MADT crosses a page boundary. FWIW, I could also fetch the original 
MADT size earlier and use that as the upper bound here for memory 
reservation.

In the RFC series we also spoke about placing the MADT in a different 
position that the native one, which would mean that we would end up stealing 
some space from a RAM region in order to place it, so that we wouldn't have 
to do this accounting.

In fact this is slightly wrong, and it doesn't need to account for the 
identity page table or the VM86 TSS, since we end up stealing this space 
from populated RAM regions. At the moment we only need to account for the 2 
pages that could be used by the MADT.

> > +    hvm_setup_e820(d, nr_pages);
> > +    do {
> > +        preempted = 0;
> > +        paging_set_allocation(d, dom0_paging_pages(d, nr_pages),
> > +                              &preempted);
> > +        process_pending_softirqs();
> > +    } while ( preempted );
> > +
> > +    /*
> > +     * Special treatment for memory < 1MB:
> > +     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).
> > +     *  - Map everything else as 1:1.
> > +     * NB: all this only makes sense if booted from legacy BIOSes.
> > +     */
> > +    rc = modify_mmio_11(d, 0, PFN_DOWN(MB(1)), true);
> > +    if ( rc )
> > +    {
> > +        printk("Failed to map low 1MB 1:1: %d\n", rc);
> > +        return rc;
> > +    }
> > +
> > +    printk("** Populating memory map **\n");
> > +    /* Populate memory map. */
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        if ( d->arch.e820[i].type != E820_RAM )
> > +            continue;
> > +
> > +        hvm_populate_memory_range(d, d->arch.e820[i].addr,
> > +                                  d->arch.e820[i].size);
> > +        if ( d->arch.e820[i].addr < MB(1) )
> > +        {
> > +            unsigned long end = min_t(unsigned long,
> > +                            d->arch.e820[i].addr + d->arch.e820[i].size, MB(1));
> > +
> > +            saved_current = current;
> > +            set_current(v);
> > +            rc = hvm_copy_to_guest_phys(d->arch.e820[i].addr,
> > +                                        maddr_to_virt(d->arch.e820[i].addr),
> > +                                        end - d->arch.e820[i].addr);
> > +            set_current(saved_current);
> > +            if ( rc != HVMCOPY_okay )
> > +            {
> > +                printk("Unable to copy RAM region %#lx - %#lx\n",
> > +                       d->arch.e820[i].addr, end);
> > +                return -EFAULT;
> > +            }
> > +        }
> > +    }
> > +
> > +    printk("Memory allocation stats:\n");
> > +    for ( i = 0; i <= MAX_ORDER; i++ )
> > +    {
> > +        if ( hvm_mem_stats[MAX_ORDER - i] != 0 )
> > +            printk("Order %2u: %pZ\n", MAX_ORDER - i,
> > +                   _p(((uint64_t)1 << (MAX_ORDER - i + PAGE_SHIFT)) *
> > +                      hvm_mem_stats[MAX_ORDER - i]));
> > +    }
> > +
> > +    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
> > +    {
> > +        /*
> > +         * Since Dom0 cannot be migrated, we will only setup the
> > +         * unrestricted guest helpers if they are needed by the current
> > +         * hardware we are running on.
> > +         */
> > +        rc = hvm_setup_vmx_unrestricted_guest(d);
> 
> Calling a function of this name inside an if() checking for
> !vmx_unrestricted_guest() is, well, odd.

Yes, that's right. What about calling it hvm_setup_vmx_realmode_helpers?

> >  static int __init construct_dom0_hvm(struct domain *d, const module_t *image,
> >                                       unsigned long image_headroom,
> >                                       module_t *initrd,
> >                                       void *(*bootstrap_map)(const module_t *),
> >                                       char *cmdline)
> >  {
> > +    int rc;
> >  
> >      printk("** Building a PVH Dom0 **\n");
> >  
> > +    /* Sanity! */
> > +    BUG_ON(d->domain_id != 0);
> > +    BUG_ON(d->vcpu[0] == NULL);
> 
> May I suggest
> 
>     BUG_ON(d->domain_id);
>     BUG_ON(!d->vcpu[0]);
> 
> in cases like this?

Yes, I have the tendency to not use '!' or perform direct checks unless it's 
a boolean type.

> > +    process_pending_softirqs();
> 
> Why, outside of any loop?

It's the same that's done in construct_dom0_pv, so I though that it was a 
good idea to drain any pending softirqs before starting domain build.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-10-04  9:12     ` Roger Pau Monne
@ 2016-10-04 11:16       ` Jan Beulich
  2016-10-11 14:01         ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-04 11:16 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 04.10.16 at 11:12, <roger.pau@citrix.com> wrote:
> On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > @@ -43,6 +44,11 @@ static long __initdata dom0_nrpages;
>> >  static long __initdata dom0_min_nrpages;
>> >  static long __initdata dom0_max_nrpages = LONG_MAX;
>> >  
>> > +/* Size of the VM86 TSS for virtual 8086 mode to use. */
>> > +#define HVM_VM86_TSS_SIZE   128
>> > +
>> > +static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];
>> 
>> This is for your debugging only I suppose?
> 
> Probably, I wasn't sure if it was relevant so I left it here. Would it make 
> sense to only print this for debug builds of the hypervisor? Or better to 
> just remove it?

Statistics in debug builds often aren't really meaningful, so my order
of preference would be to remove it, then to keep it for all cases but
default it to be silent in non-debug builds (with a command line option
to enable).

>> > @@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages(
>> >          avail -= dom0_paging_pages(d, nr_pages);
>> >      }
>> >  
>> > -    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
>> > +    if ( is_pv_domain(d) &&
>> > +         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
>> 
>> Perhaps better to simply force parms->p2m_base to UNSET_ADDR
>> earlier on?
> 
> p2m_base is already unconditionally set to UNSET_ADDR for PVHv2 domains, 
> hence the added is_pv_domain check in order to make sure that PVHv2 guests 
> don't get into that branch, which AFAICT is only relevant to PV guests.

This reads contradictory: If it's set to UNSET_ADDR, why the extra
check?

>> > @@ -1657,15 +1679,238 @@ out:
>> >      return rc;
>> >  }
>> >  
>> > +/* Populate an HVM memory range using the biggest possible order. */
>> > +static void __init hvm_populate_memory_range(struct domain *d, uint64_t start,
>> > +                                             uint64_t size)
>> > +{
>> > +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
>> > +    unsigned int order;
>> > +    struct page_info *page;
>> > +    int rc;
>> > +
>> > +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
>> > +
>> > +    order = MAX_ORDER;
>> > +    while ( size != 0 )
>> > +    {
>> > +        order = min(get_order_from_bytes_floor(size), order);
>> > +        page = alloc_domheap_pages(d, order, memflags);
>> > +        if ( page == NULL )
>> > +        {
>> > +            if ( order == 0 && memflags )
>> > +            {
>> > +                /* Try again without any memflags. */
>> > +                memflags = 0;
>> > +                order = MAX_ORDER;
>> > +                continue;
>> > +            }
>> > +            if ( order == 0 )
>> > +                panic("Unable to allocate memory with order 0!\n");
>> > +            order--;
>> > +            continue;
>> > +        }
>> 
>> Is it not possible to utilize alloc_chunk() here?
> 
> Not really, alloc_chunk will do a ceil calculation of the number of needed 
> pages, which means we end up allocating bigger chunks than needed and then 
> free them. I prefer this approach, since we always request the exact memory 
> that's required, so there's no need to free leftover pages.

Hmm, in that case at least some of the shared logic would be nice to
get abstracted out.

>> > +        if ( (size & 0xffffffff) == 0 )
>> > +            process_pending_softirqs();
>> 
>> That's 4Gb at a time - isn't that a little too much?
> 
> Hm, it's the same that's used in pvh_add_mem_mapping AFAICT. I could reduce 
> it to 0xfffffff, but I'm also wondering if it makes sense to just call it on 
> each iteration, since we are possibly mapping regions with big orders here.

Iteration could is all that matters here really; the size of the mapping
wouldn't normally (as long as its one of the hardware supported sizes).
Doing the check on every iteration may be a little much (you may want
to check whether there's noticeable extra overhead), but doing the
check on like every 64 iterations may limit overhead enough to be
acceptable without making this more sophisticated.

>> > +static int __init hvm_setup_p2m(struct domain *d)
>> > +{
>> > +    struct vcpu *saved_current, *v = d->vcpu[0];
>> > +    unsigned long nr_pages;
>> > +    int i, rc, preempted;
>> > +
>> > +    printk("** Preparing memory map **\n");
>> 
>> Debugging leftover again?
> 
> Should this be merged with the next label, so that it reads as "Preparing 
> and populating the memory map"?

No, I'd rather see the other(s) gone too.

>> > +    /*
>> > +     * Subtract one page for the EPT identity page table and two pages
>> > +     * for the MADT replacement.
>> > +     */
>> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3;
>> 
>> How do you know the MADT replacement requires two pages? Isn't
>> that CPU-count dependent? And doesn't the partial page used for
>> the TSS also need accounting for here?
> 
> Yes, it's CPU-count dependent. This is just an approximation, since we only 
> support up to 256 CPUs on HVM guests, and each Processor Local APIC entry is 
> 
> 8 bytes, this means that the CPU-related data is going to use up to 2048 
> bytes of data, which still leaves plenty of space for the IO APIC and the 
> Interrupt Source Override entries. We requests two pages in case the 
> original MADT crosses a page boundary. FWIW, I could also fetch the original 
> MADT size earlier and use that as the upper bound here for memory 
> reservation.

That wouldn't help in case someone wants more vCPU-s than there
are pCPU-s. And baking in another assumption of there being <=
128 vCPU-s when there's already work being done to eliminate that
limit is likely not too good an idea.

> In the RFC series we also spoke about placing the MADT in a different 
> position that the native one, which would mean that we would end up stealing 
> some space from a RAM region in order to place it, so that we wouldn't have 
> to do this accounting.

Putting the new MADT at the same address as the old one won't work
anyway, again because possibly vCPU-s > pCPU-s.

>> > +    if ( cpu_has_vmx && paging_mode_hap(d) && !vmx_unrestricted_guest(v) )
>> > +    {
>> > +        /*
>> > +         * Since Dom0 cannot be migrated, we will only setup the
>> > +         * unrestricted guest helpers if they are needed by the current
>> > +         * hardware we are running on.
>> > +         */
>> > +        rc = hvm_setup_vmx_unrestricted_guest(d);
>> 
>> Calling a function of this name inside an if() checking for
>> !vmx_unrestricted_guest() is, well, odd.
> 
> Yes, that's right. What about calling it hvm_setup_vmx_realmode_helpers?

Sounds quite a bit more reasonable.

>> > +    process_pending_softirqs();
>> 
>> Why, outside of any loop?
> 
> It's the same that's done in construct_dom0_pv, so I though that it was a 
> good idea to drain any pending softirqs before starting domain build.

Perhaps in that case it should be pulled out of there into the
wrapper?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions
  2016-10-03  8:05   ` Paul Durrant
@ 2016-10-06 11:33     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-06 11:33 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Andrew Cooper, Tim (Xen.org),
	George Dunlap, Jan Beulich, xen-devel, boris.ostrovsky

On Mon, Oct 03, 2016 at 09:05:43AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> > Roger Pau Monne
> > Sent: 27 September 2016 16:57
> > To: xen-devel@lists.xenproject.org
> > Cc: George Dunlap <George.Dunlap@citrix.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>; Tim (Xen.org) <tim@xen.org>; Jan Beulich
> > <jbeulich@suse.com>; boris.ostrovsky@oracle.com; Roger Pau Monne
> > <roger.pau@citrix.com>
> > Subject: [Xen-devel] [PATCH v2 03/30] xen/x86: fix parameters and return
> > value of *_set_allocation functions
> > 
> > Return should be an int, and the number of pages should be an unsigned
> > long.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: George Dunlap <george.dunlap@eu.citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: Tim Deegan <tim@xen.org>
> > ---
> >  xen/arch/x86/mm/hap/hap.c       |  6 +++---
> >  xen/arch/x86/mm/shadow/common.c |  7 +++----
> >  xen/include/asm-x86/domain.h    | 12 ++++++------
> >  3 files changed, 12 insertions(+), 13 deletions(-)
> > 
> > diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
> > index 3218fa2..b6d2c61 100644
> > --- a/xen/arch/x86/mm/hap/hap.c
> > +++ b/xen/arch/x86/mm/hap/hap.c
> > @@ -325,7 +325,7 @@ static void hap_free_p2m_page(struct domain *d,
> > struct page_info *pg)
> >  static unsigned int
> >  hap_get_allocation(struct domain *d)
> >  {
> > -    unsigned int pg = d->arch.paging.hap.total_pages
> > +    unsigned long pg = d->arch.paging.hap.total_pages
> >          + d->arch.paging.hap.p2m_pages;
> 
> You are modifying this calculation (by including hap.p2m_pages as well as hap.total_pages) so the comment should probably mention this.

I think your mail reader is fooling you, the '+' was already there before my 
change (and in fact I haven't even touched that line).
 
> > 
> >      return ((pg >> (20 - PAGE_SHIFT))
> > @@ -334,8 +334,8 @@ hap_get_allocation(struct domain *d)
> > 
> >  /* Set the pool of pages to the required number of pages.
> >   * Returns 0 for success, non-zero for failure. */
> > -static unsigned int
> > -hap_set_allocation(struct domain *d, unsigned int pages, int *preempted)
> > +static int
> > +hap_set_allocation(struct domain *d, unsigned long pages, int
> > *preempted)
> >  {
> >      struct page_info *pg;
> > 
> > diff --git a/xen/arch/x86/mm/shadow/common.c
> > b/xen/arch/x86/mm/shadow/common.c
> > index 21607bf..d3cc2cc 100644
> > --- a/xen/arch/x86/mm/shadow/common.c
> > +++ b/xen/arch/x86/mm/shadow/common.c
> > @@ -1613,9 +1613,8 @@ shadow_free_p2m_page(struct domain *d, struct
> > page_info *pg)
> >   * Input will be rounded up to at least shadow_min_acceptable_pages(),
> >   * plus space for the p2m table.
> >   * Returns 0 for success, non-zero for failure. */
> > -static unsigned int sh_set_allocation(struct domain *d,
> > -                                      unsigned int pages,
> > -                                      int *preempted)
> > +static int sh_set_allocation(struct domain *d, unsigned long pages,
> > +                             int *preempted)
> >  {
> >      struct page_info *sp;
> >      unsigned int lower_bound;
> > @@ -1692,7 +1691,7 @@ static unsigned int sh_set_allocation(struct domain
> > *d,
> >  /* Return the size of the shadow pool, rounded up to the nearest MB */
> >  static unsigned int shadow_get_allocation(struct domain *d)
> >  {
> > -    unsigned int pg = d->arch.paging.shadow.total_pages
> > +    unsigned long pg = d->arch.paging.shadow.total_pages
> >          + d->arch.paging.shadow.p2m_pages;
> 
> Same here.
> 
> >      return ((pg >> (20 - PAGE_SHIFT))
> >              + ((pg & ((1 << (20 - PAGE_SHIFT)) - 1)) ? 1 : 0));
> 
> OMG. Is there no rounding macro you can use for this?

Hm, I don't think there's any, and the code was already there (I haven't 
added this), so I will just leave it as-is.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain
  2016-10-03  9:02   ` Paul Durrant
@ 2016-10-06 14:31     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-06 14:31 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, boris.ostrovsky, Andrew Cooper, Jan Beulich

On Mon, Oct 03, 2016 at 10:02:27AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: 27 September 2016 16:57
> > To: xen-devel@lists.xenproject.org
> > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> > <roger.pau@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> > Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>
> > Subject: [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for
> > hardware domain
> > 
> > This is very similar to the PCI trap used for the traditional PV(H) Dom0.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Paul Durrant <paul.durrant@citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> >  xen/arch/x86/hvm/io.c         | 72
> > ++++++++++++++++++++++++++++++++++++++++++-
> >  xen/arch/x86/traps.c          | 39 -----------------------
> >  xen/drivers/passthrough/pci.c | 39 +++++++++++++++++++++++
> >  xen/include/xen/pci.h         |  2 ++
> >  4 files changed, 112 insertions(+), 40 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c index
> > 1e7a5f9..31d54dc 100644
> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -247,12 +247,79 @@ static int dpci_portio_write(const struct
> > hvm_io_handler *handler,
> >      return X86EMUL_OKAY;
> >  }
> > 
> > +static bool_t hw_dpci_portio_accept(const struct hvm_io_handler
> > *handler,
> > +                                    const ioreq_t *p) {
> > +    if ( (p->addr == 0xcf8 && p->size == 4) || (p->addr & 0xfffc) == 0xcfc)
> > +    {
> > +        return 1;
> > +    }
> > +
> > +    return 0;
> 
> Why not just return the value of the boolean expression?

Thanks, fixed.
 
> > +}
> > +
> > +static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
> > +                            uint64_t addr,
> > +                            uint32_t size,
> > +                            uint64_t *data) {
> > +    struct domain *currd = current->domain;
> > +
> > +    if ( addr == 0xcf8 )
> > +    {
> > +        ASSERT(size == 4);
> > +        *data = currd->arch.pci_cf8;
> > +        return X86EMUL_OKAY;
> > +    }
> > +
> > +    ASSERT((addr & 0xfffc) == 0xcfc);
> 
> You could do 'addr &= 3' and simplify the expressions below.

Fixed.
 
> > +    size = min(size, 4 - ((uint32_t)addr & 3));
> > +    if ( size == 3 )
> > +        size = 2;
> > +    if ( pci_cfg_ok(currd, addr & 3, size, NULL) )
> 
> Is this the right place to do the check. Can this not be done in the accept op?

In the intercept op we don't know if the operation is a read or a write, and 
pci_cfg_ok seems to do more than a check, it calls pci_conf_write_intercept. 

> > +        *data = pci_conf_read(currd->arch.pci_cf8, addr & 3, size);
> > +
> > +    return X86EMUL_OKAY;
> > +}
> > +
> > +static int hw_dpci_portio_write(const struct hvm_io_handler *handler,
> > +                                uint64_t addr,
> > +                                uint32_t size,
> > +                                uint64_t data) {
> > +    struct domain *currd = current->domain;
> > +    uint32_t data32;
> > +
> > +    if ( addr == 0xcf8 )
> > +    {
> > +            ASSERT(size == 4);
> > +            currd->arch.pci_cf8 = data;
> > +            return X86EMUL_OKAY;
> > +    }
> > +
> > +    ASSERT((addr & 0xfffc) == 0xcfc);
> 
> 'addr &= 3' again here.

Thanks.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code
  2016-10-03  9:54   ` Paul Durrant
@ 2016-10-06 15:08     ` Roger Pau Monne
  2016-10-06 15:52       ` Lars Kurth
  2016-10-07  9:13       ` Jan Beulich
  0 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-06 15:08 UTC (permalink / raw)
  To: Paul Durrant
  Cc: lars.kurth, George.Dunlap, Andrew Cooper, Jan Beulich,
	ian.jackson, xen-devel, boris.ostrovsky

On Mon, Oct 03, 2016 at 10:54:54AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: 27 September 2016 16:57
> > To: xen-devel@lists.xenproject.org
> > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> > <roger.pau@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
> > Subject: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import
> > QEMU passthrough code
> > 
> > Most of this code has been picked up from QEMU and modified so it can be
> > plugged into the internal Xen IO handlers. The structure of the handlers has
> > been keep quite similar to QEMU, so existing handlers can be imported
> > without a lot of effort.
> > 
> 
> If you lifted code from QEMU then one assumes there is no problem with license, but do you need to amend copyrights for any of the files where you put the code?

License is GPL 2, same as Xen. For copyrights I have to admit I have no 
idea. The code is not imported as-is for obvious reasons, but the logic is 
mostly the same. I don't mind adding the copyright holders for all the code 
I've imported, they are:

Copyright (c) 2007, Neocleus Corporation.
Copyright (c) 2007, Intel Corporation.

With different authors depending on the file. Adding Lars, Ian and George 
since they have more experience with copyrights.

> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > Cc: Paul Durrant <paul.durrant@citrix.com>
> > ---
> >  docs/misc/xen-command-line.markdown |   8 +
> >  xen/arch/x86/hvm/hvm.c              |   2 +
> >  xen/arch/x86/hvm/io.c               | 621
> > ++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-x86/hvm/domain.h    |   4 +
> >  xen/include/asm-x86/hvm/io.h        | 176 ++++++++++
> >  xen/include/xen/pci.h               |   5 +
> >  6 files changed, 816 insertions(+)
> > 
> > diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-
> > command-line.markdown
> > index 59d7210..78130c8 100644
> > --- a/docs/misc/xen-command-line.markdown
> > +++ b/docs/misc/xen-command-line.markdown
> > @@ -670,6 +670,14 @@ Flag that makes a 64bit dom0 boot in PVH mode. No
> > 32bit support at present.
> > 
> >  Flag that makes a dom0 boot in PVHv2 mode.
> > 
> > +### dom0permissive
> > +> `= <boolean>`
> > +
> > +> Default: `true`
> > +
> > +Select mode of PCI pass-through when using a PVHv2 Dom0, either
> > permissive or
> > +not.
> > +
> >  ### dtuart (ARM)
> >  > `= path [:options]`
> > 
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index a291f82..bc4f7bc 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -632,6 +632,8 @@ int hvm_domain_initialise(struct domain *d)
> >              goto fail1;
> >          }
> >          memset(d->arch.hvm_domain.io_bitmap, ~0, HVM_IOBITMAP_SIZE);
> > +        INIT_LIST_HEAD(&d->arch.hvm_domain.pt_devices);
> > +        rwlock_init(&d->arch.hvm_domain.pt_lock);
> >      }
> >      else
> >          d->arch.hvm_domain.io_bitmap = hvm_io_bitmap;
> > diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> > index 31d54dc..7de1de3 100644
> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -46,6 +46,10 @@
> >  #include <xen/iocap.h>
> >  #include <public/hvm/ioreq.h>
> > 
> > +/* Set permissive mode for HVM Dom0 PCI pass-through by default */
> > +static bool_t opt_dom0permissive = 1;
> > +boolean_param("dom0permissive", opt_dom0permissive);
> > +
> >  void send_timeoffset_req(unsigned long timeoff)
> >  {
> >      ioreq_t p = {
> > @@ -258,12 +262,403 @@ static bool_t hw_dpci_portio_accept(const struct
> > hvm_io_handler *handler,
> >      return 0;
> >  }
> > 
> > +static struct hvm_pt_device *hw_dpci_get_device(struct domain *d)
> > +{
> > +    uint8_t bus, slot, func;
> > +    uint32_t addr;
> > +    struct hvm_pt_device *dev;
> > +
> > +    /* Decode bus, slot and func. */
> > +    addr = CF8_BDF(d->arch.pci_cf8);
> > +    bus = PCI_BUS(addr);
> > +    slot = PCI_SLOT(addr);
> > +    func = PCI_FUNC(addr);
> > +
> > +    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
> > +    {
> > +        if ( dev->pdev->seg != 0 || dev->pdev->bus != bus ||
> > +             dev->pdev->devfn != PCI_DEVFN(slot,func) )
> > +            continue;
> > +
> > +        return dev;
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +/* Dispatchers */
> > +
> > +/* Find emulate register group entry */
> > +struct hvm_pt_reg_group *hvm_pt_find_reg_grp(struct hvm_pt_device
> > *d,
> > +                                             uint32_t address)
> > +{
> > +    struct hvm_pt_reg_group *entry = NULL;
> > +
> > +    /* Find register group entry */
> > +    list_for_each_entry( entry, &d->register_groups, entries )
> > +    {
> > +        /* check address */
> > +        if ( (entry->base_offset <= address)
> > +             && ((entry->base_offset + entry->size) > address) )
> > +            return entry;
> > +    }
> > +
> > +    /* Group entry not found */
> > +    return NULL;
> > +}
> > +
> > +/* Find emulate register entry */
> > +struct hvm_pt_reg *hvm_pt_find_reg(struct hvm_pt_reg_group *reg_grp,
> > +                                   uint32_t address)
> > +{
> > +    struct hvm_pt_reg *reg_entry = NULL;
> > +    struct hvm_pt_reg_handler *handler = NULL;
> > +    uint32_t real_offset = 0;
> > +
> > +    /* Find register entry */
> > +    list_for_each_entry( reg_entry, &reg_grp->registers, entries )
> > +    {
> > +        handler = reg_entry->handler;
> > +        real_offset = reg_grp->base_offset + handler->offset;
> > +        /* Check address */
> > +        if ( (real_offset <= address)
> > +             && ((real_offset + handler->size) > address) )
> > +            return reg_entry;
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static int hvm_pt_pci_config_access_check(struct hvm_pt_device *d,
> > +                                          uint32_t addr, int len)
> > +{
> > +    /* Check offset range */
> > +    if ( addr >= 0xFF )
> > +    {
> > +        printk_pdev(d->pdev, XENLOG_DEBUG,
> > +            "failed to access register with offset exceeding 0xFF. "
> > +            "(addr: 0x%02x, len: %d)\n", addr, len);
> > +        return -EDOM;
> > +    }
> > +
> > +    /* Check read size */
> > +    if ( (len != 1) && (len != 2) && (len != 4) )
> > +    {
> > +        printk_pdev(d->pdev, XENLOG_DEBUG,
> > +            "failed to access register with invalid access length. "
> > +            "(addr: 0x%02x, len: %d)\n", addr, len);
> > +        return -EINVAL;
> > +    }
> > +
> > +    /* Check offset alignment */
> > +    if ( addr & (len - 1) )
> > +    {
> > +        printk_pdev(d->pdev, XENLOG_DEBUG,
> > +            "failed to access register with invalid access size "
> > +            "alignment. (addr: 0x%02x, len: %d)\n", addr, len);
> > +        return -EINVAL;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
> > +                                  uint32_t *data, int len)
> > +{
> > +    uint32_t val = 0;
> > +    struct hvm_pt_reg_group *reg_grp_entry = NULL;
> > +    struct hvm_pt_reg *reg_entry = NULL;
> > +    int rc = 0;
> > +    int emul_len = 0;
> > +    uint32_t find_addr = addr;
> > +    unsigned int seg = d->pdev->seg;
> > +    unsigned int bus = d->pdev->bus;
> > +    unsigned int slot = PCI_SLOT(d->pdev->devfn);
> > +    unsigned int func = PCI_FUNC(d->pdev->devfn);
> > +
> > +    /* Sanity checks. */
> > +    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> > +        return X86EMUL_UNHANDLEABLE;
> > +
> > +    /* Find register group entry. */
> > +    reg_grp_entry = hvm_pt_find_reg_grp(d, addr);
> > +    if ( reg_grp_entry == NULL )
> > +        return X86EMUL_UNHANDLEABLE;
> > +
> > +    /* Read I/O device register value. */
> > +    switch( len )
> > +    {
> > +    case 1:
> > +        val = pci_conf_read8(seg, bus, slot, func, addr);
> > +        break;
> > +    case 2:
> > +        val = pci_conf_read16(seg, bus, slot, func, addr);
> > +        break;
> > +    case 4:
> > +        val = pci_conf_read32(seg, bus, slot, func, addr);
> > +        break;
> > +    default:
> > +        BUG();
> > +    }
> > +
> > +    /* Adjust the read value to appropriate CFC-CFF window. */
> > +    val <<= (addr & 3) << 3;
> > +    emul_len = len;
> > +
> > +    /* Loop around the guest requested size. */
> > +    while ( emul_len > 0 )
> > +    {
> > +        /* Find register entry to be emulated. */
> > +        reg_entry = hvm_pt_find_reg(reg_grp_entry, find_addr);
> > +        if ( reg_entry )
> > +        {
> > +            struct hvm_pt_reg_handler *handler = reg_entry->handler;
> > +            uint32_t real_offset = reg_grp_entry->base_offset + handler-
> > >offset;
> > +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> 
> Figuring out whether this makes sense makes my brain hurt. Any chance of some macro or at least comments about this?

Right. What about:

/* Create a bitmask from a given size (in bytes). */
#define HVM_PT_SIZE_TO_MASK(size) (0xFFFFFFFF >> ((4 - (size)) << 3))

> > +            uint8_t *ptr_val = NULL;
> > +
> > +            valid_mask <<= (find_addr - real_offset) << 3;
> > +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> > +
> > +            /* Do emulation based on register size. */
> > +            switch ( handler->size )
> > +            {
> > +            case 1:
> > +                if ( handler->u.b.read )
> > +                    rc = handler->u.b.read(d, reg_entry, ptr_val, valid_mask);
> > +                break;
> > +            case 2:
> > +                if ( handler->u.w.read )
> > +                    rc = handler->u.w.read(d, reg_entry, (uint16_t *)ptr_val,
> > +                                           valid_mask);
> > +                break;
> > +            case 4:
> > +                if ( handler->u.dw.read )
> > +                    rc = handler->u.dw.read(d, reg_entry, (uint32_t *)ptr_val,
> > +                                            valid_mask);
> > +                break;
> > +            }
> > +
> > +            if ( rc < 0 )
> > +            {
> > +                gdprintk(XENLOG_WARNING,
> > +                         "Invalid read emulation, shutting down domain\n");
> > +                domain_crash(current->domain);
> > +                return X86EMUL_UNHANDLEABLE;
> > +            }
> > +
> > +            /* Calculate next address to find. */
> > +            emul_len -= handler->size;
> > +            if ( emul_len > 0 )
> > +                find_addr = real_offset + handler->size;
> > +        }
> > +        else
> > +        {
> > +            /* Nothing to do with passthrough type register */
> > +            emul_len--;
> > +            find_addr++;
> > +        }
> > +    }
> > +
> > +    /* Need to shift back before returning them to pci bus emulator */
> > +    val >>= ((addr & 3) << 3);
> > +    *data = val;
> > +
> > +    return X86EMUL_OKAY;
> > +}
> > +
> > +static int hvm_pt_pci_write_config(struct hvm_pt_device *d, uint32_t
> > addr,
> > +                                    uint32_t val, int len)
> > +{
> > +    int index = 0;
> > +    struct hvm_pt_reg_group *reg_grp_entry = NULL;
> > +    int rc = 0;
> > +    uint32_t read_val = 0, wb_mask;
> > +    int emul_len = 0;
> > +    struct hvm_pt_reg *reg_entry = NULL;
> > +    uint32_t find_addr = addr;
> > +    struct hvm_pt_reg_handler *handler = NULL;
> > +    bool wp_flag = false;
> > +    unsigned int seg = d->pdev->seg;
> > +    unsigned int bus = d->pdev->bus;
> > +    unsigned int slot = PCI_SLOT(d->pdev->devfn);
> > +    unsigned int func = PCI_FUNC(d->pdev->devfn);
> > +
> > +    /* Sanity checks. */
> > +    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> > +        return X86EMUL_UNHANDLEABLE;
> > +
> > +    /* Find register group entry. */
> > +    reg_grp_entry = hvm_pt_find_reg_grp(d, addr);
> > +    if ( reg_grp_entry == NULL )
> > +        return X86EMUL_UNHANDLEABLE;
> > +
> > +    /* Read I/O device register value. */
> > +    switch( len )
> > +    {
> > +    case 1:
> > +        read_val = pci_conf_read8(seg, bus, slot, func, addr);
> > +        break;
> > +    case 2:
> > +        read_val = pci_conf_read16(seg, bus, slot, func, addr);
> > +        break;
> > +    case 4:
> > +        read_val = pci_conf_read32(seg, bus, slot, func, addr);
> > +        break;
> > +    default:
> > +        BUG();
> > +    }
> > +    wb_mask = 0xFFFFFFFF >> ((4 - len) << 3);
> > +
> > +    /* Adjust the read and write value to appropriate CFC-CFF window */
> > +    read_val <<= (addr & 3) << 3;
> > +    val <<= (addr & 3) << 3;
> > +    emul_len = len;
> > +
> > +    /* Loop around the guest requested size */
> > +    while ( emul_len > 0 )
> > +    {
> > +        /* Find register entry to be emulated */
> > +        reg_entry = hvm_pt_find_reg(reg_grp_entry, find_addr);
> > +        if ( reg_entry )
> > +        {
> > +            handler = reg_entry->handler;
> > +            uint32_t real_offset = reg_grp_entry->base_offset + handler-
> > >offset;
> > +            uint32_t valid_mask = 0xFFFFFFFF >> ((4 - emul_len) << 3);
> > +            uint8_t *ptr_val = NULL;
> > +            uint32_t wp_mask = handler->emu_mask | handler->ro_mask;
> > +
> > +            valid_mask <<= (find_addr - real_offset) << 3;
> > +            ptr_val = (uint8_t *)&val + (real_offset & 3);
> > +            if ( !d->permissive )
> > +                wp_mask |= handler->res_mask;
> > +            if ( wp_mask == (0xFFFFFFFF >> ((4 - handler->size) << 3)) )
> > +                wb_mask &= ~((wp_mask >> ((find_addr - real_offset) << 3))
> > +                             << ((len - emul_len) << 3));
> > +
> > +            /* Do emulation based on register size */
> > +            switch ( handler->size )
> > +            {
> > +            case 1:
> > +                if ( handler->u.b.write )
> > +                    rc = handler->u.b.write(d, reg_entry, ptr_val,
> > +                                            read_val >> ((real_offset & 3) << 3),
> > +                                            valid_mask);
> > +                break;
> > +            case 2:
> > +                if ( handler->u.w.write )
> > +                    rc = handler->u.w.write(d, reg_entry, (uint16_t *)ptr_val,
> > +                                            (read_val >> ((real_offset & 3) << 3)),
> > +                                            valid_mask);
> > +                break;
> > +            case 4:
> > +                if ( handler->u.dw.write )
> > +                    rc = handler->u.dw.write(d, reg_entry, (uint32_t *)ptr_val,
> > +                                             (read_val >> ((real_offset & 3) << 3)),
> > +                                             valid_mask);
> > +                break;
> > +            }
> > +
> > +            if ( rc < 0 )
> > +            {
> > +                gdprintk(XENLOG_WARNING,
> > +                         "Invalid write emulation, shutting down domain\n");
> > +                domain_crash(current->domain);
> > +                return X86EMUL_UNHANDLEABLE;
> > +            }
> > +
> > +            /* Calculate next address to find */
> > +            emul_len -= handler->size;
> > +            if ( emul_len > 0 )
> > +                find_addr = real_offset + handler->size;
> > +        }
> > +        else
> > +        {
> > +            /* Nothing to do with passthrough type register */
> > +            if ( !d->permissive )
> > +            {
> > +                wb_mask &= ~(0xff << ((len - emul_len) << 3));
> > +                /*
> > +                 * Unused BARs will make it here, but we don't want to issue
> > +                 * warnings for writes to them (bogus writes get dealt with
> > +                 * above).
> > +                 */
> > +                if ( index < 0 )
> > +                    wp_flag = true;
> > +            }
> > +            emul_len--;
> > +            find_addr++;
> > +        }
> > +    }
> > +
> > +    /* Need to shift back before passing them to xen_host_pci_set_block */
> > +    val >>= (addr & 3) << 3;
> > +
> > +    if ( wp_flag && !d->permissive_warned )
> > +    {
> > +        d->permissive_warned = true;
> > +        gdprintk(XENLOG_WARNING,
> > +          "Write-back to unknown field 0x%02x (partially) inhibited (0x%0*x)\n",
> > +          addr, len * 2, wb_mask);
> > +        gdprintk(XENLOG_WARNING,
> > +          "If the device doesn't work, try enabling permissive mode\n");
> > +        gdprintk(XENLOG_WARNING,
> > +          "(unsafe) and if it helps report the problem to xen-devel\n");
> > +    }
> > +    for ( index = 0; wb_mask; index += len )
> > +    {
> > +        /* Unknown regs are passed through */
> > +        while ( !(wb_mask & 0xff) )
> > +        {
> > +            index++;
> > +            wb_mask >>= 8;
> > +        }
> > +        len = 0;
> > +        do {
> > +            len++;
> > +            wb_mask >>= 8;
> > +        } while ( wb_mask & 0xff );
> > +
> > +        switch( len )
> > +        {
> > +        case 1:
> > +        {
> > +            uint8_t value;
> > +            memcpy(&value, (uint8_t *)&val + index, 1);
> > +            pci_conf_write8(seg, bus, slot, func, addr + index, value);
> > +            break;
> > +        }
> > +        case 2:
> > +        {
> > +            uint16_t value;
> > +            memcpy(&value, (uint8_t *)&val + index, 2);
> > +            pci_conf_write16(seg, bus, slot, func, addr + index, value);
> > +            break;
> > +        }
> > +        case 4:
> > +        {
> > +            uint32_t value;
> > +            memcpy(&value, (uint8_t *)&val + index, 4);
> > +            pci_conf_write32(seg, bus, slot, func, addr + index, value);
> > +            break;
> > +        }
> > +        default:
> > +            BUG();
> > +        }
> > +    }
> > +    return X86EMUL_OKAY;
> > +}
> > +
> >  static int hw_dpci_portio_read(const struct hvm_io_handler *handler,
> >                              uint64_t addr,
> >                              uint32_t size,
> >                              uint64_t *data)
> >  {
> >      struct domain *currd = current->domain;
> > +    struct hvm_pt_device *dev;
> > +    uint32_t data32;
> > +    uint8_t reg;
> > +    int rc;
> > 
> >      if ( addr == 0xcf8 )
> >      {
> > @@ -276,6 +671,22 @@ static int hw_dpci_portio_read(const struct
> > hvm_io_handler *handler,
> >      size = min(size, 4 - ((uint32_t)addr & 3));
> >      if ( size == 3 )
> >          size = 2;
> > +
> > +    read_lock(&currd->arch.hvm_domain.pt_lock);
> > +    dev = hw_dpci_get_device(currd);
> > +    if ( dev != NULL )
> > +    {
> > +        reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> > +        rc = hvm_pt_pci_read_config(dev, reg, &data32, size);
> > +        if ( rc == X86EMUL_OKAY )
> > +        {
> > +            read_unlock(&currd->arch.hvm_domain.pt_lock);
> > +            *data = data32;
> > +            return rc;
> > +        }
> > +    }
> > +    read_unlock(&currd->arch.hvm_domain.pt_lock);
> > +
> >      if ( pci_cfg_ok(currd, addr & 3, size, NULL) )
> >          *data = pci_conf_read(currd->arch.pci_cf8, addr & 3, size);
> > 
> > @@ -288,7 +699,10 @@ static int hw_dpci_portio_write(const struct
> > hvm_io_handler *handler,
> >                                  uint64_t data)
> >  {
> >      struct domain *currd = current->domain;
> > +    struct hvm_pt_device *dev;
> >      uint32_t data32;
> > +    uint8_t reg;
> > +    int rc;
> > 
> >      if ( addr == 0xcf8 )
> >      {
> > @@ -302,12 +716,219 @@ static int hw_dpci_portio_write(const struct
> > hvm_io_handler *handler,
> >      if ( size == 3 )
> >          size = 2;
> >      data32 = data;
> > +
> > +    read_lock(&currd->arch.hvm_domain.pt_lock);
> > +    dev = hw_dpci_get_device(currd);
> > +    if ( dev != NULL )
> > +    {
> > +        reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> > +        rc = hvm_pt_pci_write_config(dev, reg, data32, size);
> > +        if ( rc == X86EMUL_OKAY )
> > +        {
> > +            read_unlock(&currd->arch.hvm_domain.pt_lock);
> > +            return rc;
> > +        }
> > +    }
> > +    read_unlock(&currd->arch.hvm_domain.pt_lock);
> > +
> 
> I must be missing something here. Why are you adding passthrough code to the hardware domain's handlers? Surely it sees all devices anyway?

Yes, but it cannot access some of the registers directly, for example Dom0 
cannot configure the MSI registers, or else Xen would start receiving 
interrupts from unset vectors.

All this is done so that Xen can detect accesses to sensible registers and 
perform appropriate actions. For example, following the MSI case, Xen will 
detects this accesses and setup and bind proper PIRQs for the guest.

> >      if ( pci_cfg_ok(currd, addr & 3, size, &data32) )
> >          pci_conf_write(currd->arch.pci_cf8, addr & 3, size, data);
> > 
> >      return X86EMUL_OKAY;
> >  }
> > 
> > +static void hvm_pt_free_device(struct hvm_pt_device *dev)
> > +{
> > +    struct hvm_pt_reg_group *group, *g;
> > +
> > +    list_for_each_entry_safe( group, g, &dev->register_groups, entries )
> > +    {
> > +        struct hvm_pt_reg *reg, *r;
> > +
> > +        list_for_each_entry_safe( reg, r, &group->registers, entries )
> > +        {
> > +            list_del(&reg->entries);
> > +            xfree(reg);
> > +        }
> > +
> > +        list_del(&group->entries);
> > +        xfree(group);
> > +    }
> > +
> > +    xfree(dev);
> > +}
> > +
> > +static int hvm_pt_add_register(struct hvm_pt_device *dev,
> > +                               struct hvm_pt_reg_group *group,
> > +                               struct hvm_pt_reg_handler *handler)
> > +{
> > +    struct pci_dev *pdev = dev->pdev;
> > +    struct hvm_pt_reg *reg;
> > +
> > +    reg = xmalloc(struct hvm_pt_reg);
> > +    if ( reg == NULL )
> > +        return -ENOMEM;
> > +
> > +    memset(reg, 0, sizeof(*reg));
> 
> xzalloc()?

Thanks.

> > +    reg->handler = handler;
> > +    if ( handler->init != NULL )
> > +    {
> > +        uint32_t host_mask, size_mask, data = 0;
> > +        uint8_t seg, bus, slot, func;
> > +        unsigned int offset;
> > +        uint32_t val;
> > +        int rc;
> > +
> > +        /* Initialize emulate register */
> > +        rc = handler->init(dev, reg->handler,
> > +                           group->base_offset + reg->handler->offset, &data);
> > +        if ( rc < 0 )
> > +            return rc;
> > +
> > +        if ( data == HVM_PT_INVALID_REG )
> > +        {
> > +            xfree(reg);
> > +            return 0;
> > +        }
> > +
> > +        /* Sync up the data to val */
> > +        offset = group->base_offset + reg->handler->offset;
> > +        size_mask = 0xFFFFFFFF >> ((4 - reg->handler->size) << 3);
> > +
> > +        seg = pdev->seg;
> > +        bus = pdev->bus;
> > +        slot = PCI_SLOT(pdev->devfn);
> > +        func = PCI_FUNC(pdev->devfn);
> > +
> > +        switch ( reg->handler->size )
> > +        {
> > +        case 1:
> > +            val = pci_conf_read8(seg, bus, slot, func, offset);
> > +            break;
> > +        case 2:
> > +            val = pci_conf_read16(seg, bus, slot, func, offset);
> > +            break;
> > +        case 4:
> > +            val = pci_conf_read32(seg, bus, slot, func, offset);
> > +            break;
> > +        default:
> > +            BUG();
> > +        }
> > +
> > +        /*
> > +         * Set bits in emu_mask are the ones we emulate. The reg shall
> > +         * contain the emulated view of the guest - therefore we flip
> > +         * the mask to mask out the host values (which reg initially
> > +         * has).
> > +         */
> > +        host_mask = size_mask & ~reg->handler->emu_mask;
> > +
> > +        if ( (data & host_mask) != (val & host_mask) )
> > +        {
> > +            uint32_t new_val;
> > +
> > +            /* Mask out host (including past size). */
> > +            new_val = val & host_mask;
> > +            /* Merge emulated ones (excluding the non-emulated ones). */
> > +            new_val |= data & host_mask;
> > +            /*
> > +             * Leave intact host and emulated values past the size -
> > +             * even though we do not care as we write per reg->size
> > +             * granularity, but for the logging below lets have the
> > +             * proper value.
> > +             */
> > +            new_val |= ((val | data)) & ~size_mask;
> > +            printk_pdev(pdev, XENLOG_ERR,
> > +"offset 0x%04x mismatch! Emulated=0x%04x, host=0x%04x, syncing to
> > 0x%04x.\n",
> > +                        offset, data, val, new_val);
> > +            val = new_val;
> > +        }
> > +        else
> > +            val = data;
> > +
> > +        if ( val & ~size_mask )
> > +        {
> > +            printk_pdev(pdev, XENLOG_ERR,
> > +                    "Offset 0x%04x:0x%04x expands past register size(%d)!\n",
> > +                        offset, val, reg->handler->size);
> > +            return -EINVAL;
> > +        }
> > +
> > +        reg->val.dword = val;
> > +    }
> > +    list_add_tail(&reg->entries, &group->registers);
> > +
> > +    return 0;
> > +}
> > +
> > +static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
> > +};
> > +
> > +int hwdom_add_device(struct pci_dev *pdev)
> > +{
> > +    struct domain *d = pdev->domain;
> > +    struct hvm_pt_device *dev;
> > +    int j, i, rc;
> > +
> > +    ASSERT( is_hardware_domain(d) );
> > +    ASSERT( pcidevs_locked() );
> > +
> > +    dev = xmalloc(struct hvm_pt_device);
> > +    if ( dev == NULL )
> > +        return -ENOMEM;
> > +
> > +    memset(dev, 0 , sizeof(*dev));
> 
> xzalloc()?

Fixed.
 
> > +
> > +    dev->pdev = pdev;
> > +    INIT_LIST_HEAD(&dev->register_groups);
> > +
> > +    dev->permissive = opt_dom0permissive;
> > +
> > +    for ( j = 0; j < ARRAY_SIZE(hwdom_pt_handlers); j++ )
> > +    {
> > +        struct hvm_pt_handler_init *handler_init = hwdom_pt_handlers[j];
> > +        struct hvm_pt_reg_group *group;
> > +
> > +        group = xmalloc(struct hvm_pt_reg_group);
> > +        if ( group == NULL )
> > +        {
> > +            xfree(dev);
> > +            return -ENOMEM;
> > +        }
> > +        INIT_LIST_HEAD(&group->registers);
> > +
> > +        rc = handler_init->init(dev, group);
> > +        if ( rc == 0 )
> > +        {
> > +            for ( i = 0; handler_init->handlers[i].size != 0; i++ )
> > +            {
> > +                int rc;
> > +
> > +                rc = hvm_pt_add_register(dev, group,
> > +                                         &handler_init->handlers[i]);
> > +                if ( rc )
> > +                {
> > +                    printk_pdev(pdev, XENLOG_ERR, "error adding register: %d\n",
> > +                                rc);
> > +                    hvm_pt_free_device(dev);
> > +                    return rc;
> > +                }
> > +            }
> > +
> > +            list_add_tail(&group->entries, &dev->register_groups);
> > +        }
> > +        else
> > +            xfree(group);
> > +    }
> > +
> > +    write_lock(&d->arch.hvm_domain.pt_lock);
> > +    list_add_tail(&dev->entries, &d->arch.hvm_domain.pt_devices);
> > +    write_unlock(&d->arch.hvm_domain.pt_lock);
> > +    printk_pdev(pdev, XENLOG_DEBUG, "added for pass-through\n");
> > +
> > +    return 0;
> > +}
> > +
> >  static const struct hvm_io_ops dpci_portio_ops = {
> >      .accept = dpci_portio_accept,
> >      .read = dpci_portio_read,
> > diff --git a/xen/include/asm-x86/hvm/domain.h b/xen/include/asm-
> > x86/hvm/domain.h
> > index f34d784..1b1a52f 100644
> > --- a/xen/include/asm-x86/hvm/domain.h
> > +++ b/xen/include/asm-x86/hvm/domain.h
> > @@ -152,6 +152,10 @@ struct hvm_domain {
> >          struct vmx_domain vmx;
> >          struct svm_domain svm;
> >      };
> > +
> > +    /* List of passed-through devices (hw domain only). */
> > +    struct list_head pt_devices;
> > +    rwlock_t pt_lock;
> >  };
> > 
> >  #define hap_enabled(d)  ((d)->arch.hvm_domain.hap_enabled)
> > diff --git a/xen/include/asm-x86/hvm/io.h b/xen/include/asm-x86/hvm/io.h
> > index e9b3f83..80f830d 100644
> > --- a/xen/include/asm-x86/hvm/io.h
> > +++ b/xen/include/asm-x86/hvm/io.h
> > @@ -153,6 +153,182 @@ extern void hvm_dpci_msi_eoi(struct domain *d,
> > int vector);
> > 
> >  void register_dpci_portio_handler(struct domain *d);
> > 
> > +/* Structures for pci-passthrough state and handlers. */
> > +struct hvm_pt_device;
> > +struct hvm_pt_reg_handler;
> > +struct hvm_pt_reg;
> > +struct hvm_pt_reg_group;
> > +
> > +/* Return code when register should be ignored. */
> > +#define HVM_PT_INVALID_REG 0xFFFFFFFF
> > +
> > +/* function type for config reg */
> > +typedef int (*hvm_pt_conf_reg_init)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg_handler *, uint32_t
> > real_offset,
> > +     uint32_t *data);
> > +
> > +typedef int (*hvm_pt_conf_dword_write)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> > +     uint32_t *val, uint32_t dev_value, uint32_t valid_mask);
> > +typedef int (*hvm_pt_conf_word_write)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> > +     uint16_t *val, uint16_t dev_value, uint16_t valid_mask);
> > +typedef int (*hvm_pt_conf_byte_write)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> > +     uint8_t *val, uint8_t dev_value, uint8_t valid_mask);
> > +typedef int (*hvm_pt_conf_dword_read)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> > +     uint32_t *val, uint32_t valid_mask);
> > +typedef int (*hvm_pt_conf_word_read)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> > +     uint16_t *val, uint16_t valid_mask);
> > +typedef int (*hvm_pt_conf_byte_read)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg *cfg_entry,
> > +     uint8_t *val, uint8_t valid_mask);
> > +
> > +typedef int (*hvm_pt_group_init)
> > +    (struct hvm_pt_device *, struct hvm_pt_reg_group *);
> > +
> > +/*
> > + * Emulated register information.
> > + *
> > + * This should be shared between all the consumers that trap on accesses
> > + * to certain PCI registers.
> > + */
> > +struct hvm_pt_reg_handler {
> > +    uint32_t offset;
> > +    uint32_t size;
> > +    uint32_t init_val;
> > +    /* reg reserved field mask (ON:reserved, OFF:defined) */
> > +    uint32_t res_mask;
> > +    /* reg read only field mask (ON:RO/ROS, OFF:other) */
> > +    uint32_t ro_mask;
> > +    /* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
> > +    uint32_t rw1c_mask;
> > +    /* reg emulate field mask (ON:emu, OFF:passthrough) */
> > +    uint32_t emu_mask;
> > +    hvm_pt_conf_reg_init init;
> > +    /* read/write function pointer
> > +     * for double_word/word/byte size */
> > +    union {
> > +        struct {
> > +            hvm_pt_conf_dword_write write;
> > +            hvm_pt_conf_dword_read read;
> > +        } dw;
> > +        struct {
> > +            hvm_pt_conf_word_write write;
> > +            hvm_pt_conf_word_read read;
> > +        } w;
> > +        struct {
> > +            hvm_pt_conf_byte_write write;
> > +            hvm_pt_conf_byte_read read;
> > +        } b;
> > +    } u;
> > +};
> > +
> > +struct hvm_pt_handler_init {
> > +    struct hvm_pt_reg_handler *handlers;
> > +    hvm_pt_group_init init;
> > +};
> > +
> > +/*
> > + * Emulated register value.
> > + *
> > + * This is the representation of each specific emulated register.
> > + */
> > +struct hvm_pt_reg {
> > +    struct list_head entries;
> > +    struct hvm_pt_reg_handler *handler;
> > +    union {
> > +        uint8_t   byte;
> > +        uint16_t  word;
> > +        uint32_t  dword;
> > +    } val;
> > +};
> > +
> > +/*
> > + * Emulated register group.
> > + *
> > + * In order to speed up (and logically group) emulated registers search,
> > + * groups are used that represent specific emulated features, like MSI.
> > + */
> > +struct hvm_pt_reg_group {
> > +    struct list_head entries;
> > +    uint32_t base_offset;
> > +    uint8_t size;
> > +    struct list_head registers;
> > +};
> > +
> > +/*
> > + * Guest MSI information.
> > + *
> > + * MSI values set by the guest.
> > + */
> > +struct hvm_pt_msi {
> > +    uint16_t flags;
> > +    uint32_t addr_lo;  /* guest message address */
> > +    uint32_t addr_hi;  /* guest message upper address */
> > +    uint16_t data;     /* guest message data */
> > +    uint32_t ctrl_offset; /* saved control offset */
> > +    int pirq;          /* guest pirq corresponding */
> > +    bool_t initialized;  /* when guest MSI is initialized */
> > +    bool_t mapped;       /* when pirq is mapped */
> > +};
> > +
> > +/*
> > + * Guest passed-through PCI device.
> > + */
> > +struct hvm_pt_device {
> > +    struct list_head entries;
> > +
> > +    struct pci_dev *pdev;
> > +
> > +    bool_t permissive;
> > +    bool_t permissive_warned;
> > +
> > +    /* MSI status. */
> > +    struct hvm_pt_msi msi;
> > +
> > +    struct list_head register_groups;
> > +};
> > +
> > +/*
> > + * The hierarchy of the above structures is the following:
> > + *
> > + * +---------------+         +---------------+
> > + * |               | entries |               | ...
> > + * | hvm_pt_device +---------+ hvm_pt_device +----+
> > + * |               |         |               |
> > + * +-+-------------+         +---------------+
> > + *   |
> > + *   | register_groups
> > + *   |
> > + * +-v----------------+          +------------------+
> > + * |                  | entries  |                  | ...
> > + * | hvm_pt_reg_group +----------+ hvm_pt_reg_group +----+
> > + * |                  |          |                  |
> > + * +-+----------------+          +------------------+
> > + *   |
> > + *   | registers
> > + *   |
> > + * +-v----------+            +------------+
> > + * |            | entries    |            | ...
> > + * | hvm_pt_reg +------------+ hvm_pt_reg +----+
> > + * |            |            |            |
> > + * +-+----------+            +-+----------+
> > + *   |                         |
> > + *   | handler                 | handler
> > + *   |                         |
> > + * +-v------------------+    +-v------------------+
> > + * |                    |    |                    |
> > + * | hvm_pt_reg_handler |    | hvm_pt_reg_handler |
> > + * |                    |    |                    |
> > + * +--------------------+    +--------------------+
> > + */
> > +
> > +/* Helper to add passed-through devices to the hardware domain. */
> > +int hwdom_add_device(struct pci_dev *pdev);
> > +
> >  #endif /* __ASM_X86_HVM_IO_H__ */
> > 
> > 
> > diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
> > index f191773..b21a891 100644
> > --- a/xen/include/xen/pci.h
> > +++ b/xen/include/xen/pci.h
> > @@ -90,6 +90,11 @@ struct pci_dev {
> >      u64 vf_rlen[6];
> >  };
> > 
> > +/* Helper for printing pci_dev related messages. */
> > +#define printk_pdev(pdev, lvl, fmt, ...)                                  \
> > +    printk(lvl "PCI %04x:%02x:%02x.%u: " fmt, pdev->seg, pdev->bus,      \
> > +           PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), ##__VA_ARGS__)
> > +
> >  #define for_each_pdev(domain, pdev) \
> >      list_for_each_entry(pdev, &(domain->arch.pdev_list), domain_list)
> > 
> > --
> > 2.7.4 (Apple Git-66)
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 16/30] xen/x86: parse Dom0 kernel for PVHv2
  2016-09-27 15:57 ` [PATCH v2 16/30] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
@ 2016-10-06 15:14   ` Jan Beulich
  2016-10-11 15:02     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-06 15:14 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> +    start_info.modlist_paddr = last_addr;
> +    start_info.nr_modules = 1;
> +    last_addr += sizeof(mod);

How can this be unconditionally 1? It is certainly possible to boot
without initrd.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs
  2016-09-27 15:57 ` [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs Roger Pau Monne
@ 2016-10-06 15:20   ` Jan Beulich
  2016-10-12 11:06     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-06 15:20 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> The logic used to setup the CPUID leaves is extremely simplistic (and
> probably wrong for hardware different than mine). I'm not sure what's the
> best way to deal with this, the code that currently sets the CPUID leaves
> for HVM guests lives in libxc, maybe moving it xen/common would be better?

Yeah, a pre-populated array of leaves certainly won't do.

> +    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
> +    if ( rc )
> +    {
> +        printk("Unable to setup Dom0 BSP context: %d\n", rc);
> +        return rc;
> +    }
> +    clear_bit(_VPF_down, &v->pause_flags);

Even if it may not matter right away, I think you want to clear this
flag later, after having completed all setup.

> +    for ( i = 0; i < ARRAY_SIZE(cpuid_leaves); i++ )
> +    {
> +        d->arch.cpuids[i].input[0] = cpuid_leaves[i].index;
> +        d->arch.cpuids[i].input[1] = cpuid_leaves[i].count;
> +        if ( d->arch.cpuids[i].input[1] == XEN_CPUID_INPUT_UNUSED )
> +            cpuid(d->arch.cpuids[i].input[0], &d->arch.cpuids[i].eax,
> +                  &d->arch.cpuids[i].ebx, &d->arch.cpuids[i].ecx,
> +                  &d->arch.cpuids[i].edx);
> +        else
> +            cpuid_count(d->arch.cpuids[i].input[0], d->arch.cpuids[i].input[1],
> +                        &d->arch.cpuids[i].eax, &d->arch.cpuids[i].ebx,
> +                        &d->arch.cpuids[i].ecx, &d->arch.cpuids[i].edx);

Why this if/else? It is always fine to use cpuid_count().

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping
  2016-10-03 10:10   ` Paul Durrant
@ 2016-10-06 15:25     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-06 15:25 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, boris.ostrovsky, Andrew Cooper, Jan Beulich

On Mon, Oct 03, 2016 at 11:10:44AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: 27 September 2016 16:57
> > To: xen-devel@lists.xenproject.org
> > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> > <roger.pau@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> > Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>
> > Subject: [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping
> > 
> > Add handlers to detect attemps from a PVHv2 Dom0 to change the position
> > of the PCI BARs and properly remap them.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Paul Durrant <paul.durrant@citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> >  xen/arch/x86/hvm/io.c         |   2 +
> >  xen/drivers/passthrough/pci.c | 307
> > ++++++++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-x86/hvm/io.h  |  19 +++
> >  xen/include/xen/pci.h         |   3 +
> >  4 files changed, 331 insertions(+)
> > 
> > diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c index
> > 7de1de3..4db0266 100644
> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -862,6 +862,8 @@ static int hvm_pt_add_register(struct hvm_pt_device
> > *dev,  }
> > 
> >  static struct hvm_pt_handler_init *hwdom_pt_handlers[] = {
> > +    &hvm_pt_bar_init,
> > +    &hvm_pt_vf_bar_init,
> >  };
> > 
> >  int hwdom_add_device(struct pci_dev *pdev) diff --git
> > a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c index
> > 6d831dd..60c9e74 100644
> > --- a/xen/drivers/passthrough/pci.c
> > +++ b/xen/drivers/passthrough/pci.c
> > @@ -633,6 +633,313 @@ static int pci_size_bar(unsigned int seg, unsigned
> > int bus, unsigned int slot,
> >      return 0;
> >  }
> > 
> > +static bool bar_reg_is_vf(uint32_t real_offset, uint32_t
> > +handler_offset) {
> > +    if ( real_offset - handler_offset == PCI_SRIOV_BAR )
> > +        return true;
> > +    else
> > +        return false;
> > +}
> > +
> 
> Return the bool expression rather than the if-then-else?

Done.
 
> > +static int bar_reg_init(struct hvm_pt_device *s,
> > +                        struct hvm_pt_reg_handler *handler,
> > +                        uint32_t real_offset, uint32_t *data) {
> > +    uint8_t seg, bus, slot, func;
> > +    uint64_t addr, size;
> > +    uint32_t val;
> > +    unsigned int index = handler->offset / 4;
> > +    bool vf = bar_reg_is_vf(real_offset, handler->offset);
> > +    struct hvm_pt_bar *bars = (vf ? s->vf_bars : s->bars);
> > +    int num_bars = (vf ? PCI_SRIOV_NUM_BARS : s->num_bars);
> > +    int rc;
> > +
> > +    if ( index >= num_bars )
> > +    {
> > +        *data = HVM_PT_INVALID_REG;
> > +        return 0;
> > +    }
> > +
> > +    seg = s->pdev->seg;
> > +    bus = s->pdev->bus;
> > +    slot = PCI_SLOT(s->pdev->devfn);
> > +    func = PCI_FUNC(s->pdev->devfn);
> > +    val = pci_conf_read32(seg, bus, slot, func, real_offset);
> > +
> > +    if ( index > 0 && bars[index - 1].type == HVM_PT_BAR_MEM64_LO )
> > +        bars[index].type = HVM_PT_BAR_MEM64_HI;
> > +    else if ( (val & PCI_BASE_ADDRESS_SPACE) ==
> > PCI_BASE_ADDRESS_SPACE_IO )
> > +    {
> > +        bars[index].type = HVM_PT_BAR_UNUSED;
> > +    }
> > +    else if ( (val & PCI_BASE_ADDRESS_MEM_TYPE_MASK) ==
> > +              PCI_BASE_ADDRESS_MEM_TYPE_64 )
> > +        bars[index].type = HVM_PT_BAR_MEM64_LO;
> > +    else
> > +        bars[index].type = HVM_PT_BAR_MEM32;
> > +
> > +    if ( bars[index].type == HVM_PT_BAR_MEM32 ||
> > +         bars[index].type == HVM_PT_BAR_MEM64_LO )
> > +    {
> > +        /* Size the BAR and map it. */
> > +        rc = pci_size_bar(seg, bus, slot, func, real_offset - handler->offset,
> > +                          num_bars, &index, &addr, &size);
> > +        if ( rc )
> > +        {
> > +            printk_pdev(s->pdev, XENLOG_ERR, "unable to size BAR#%d\n",
> > +                        index);
> > +            return rc;
> > +        }
> > +
> > +        if ( size == 0 )
> > +            bars[index].type = HVM_PT_BAR_UNUSED;
> > +        else
> > +        {
> > +            printk_pdev(s->pdev, XENLOG_DEBUG,
> > +                        "Mapping BAR#%u: %#lx size: %u\n", index, addr, size);
> > +            rc = modify_mmio_11(s->pdev->domain, PFN_DOWN(addr),
> > +                                DIV_ROUND_UP(size, PAGE_SIZE), true);
> > +            if ( rc )
> > +            {
> > +                printk_pdev(s->pdev, XENLOG_ERR,
> > +                            "failed to map BAR#%d into memory map: %d\n",
> > +                            index, rc);
> > +                return rc;
> > +            }
> > +        }
> > +    }
> > +
> > +    *data = bars[index].type == HVM_PT_BAR_UNUSED ?
> > HVM_PT_INVALID_REG : val;
> > +    return 0;
> > +}
> > +
> > +/* Only allow writes to check the size of the BARs */ static int
> > +allow_bar_write(struct hvm_pt_bar *bar, struct hvm_pt_reg *reg,
> > +                           struct pci_dev *pdev, uint32_t val) {
> > +    uint32_t mask;
> > +
> > +    if ( bar->type == HVM_PT_BAR_MEM64_HI )
> > +        mask = ~0;
> > +    else
> > +        mask = (uint32_t) PCI_BASE_ADDRESS_MEM_MASK;
> > +
> > +    if ( val != ~0 && (val & mask) != (reg->val.dword & mask) )
> > +    {
> > +        printk_pdev(pdev, XENLOG_ERR,
> > +                "changing the position of the BARs is not yet supported: %#x\n",
> > +                    val);
> 
> This doesn't seem to quite tally with commit comment. Can BARs be re-programmed or not?

Right, I got messed up in the commit meesage. This _prevents_ remapping BARs 
(which would also not work anyway). Further code to allow remapping BARs can 
be based on top of this handlers.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-09-27 15:57 ` [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
@ 2016-10-06 15:40   ` Jan Beulich
  2016-10-06 15:48     ` Andrew Cooper
  2016-10-12 15:35     ` Roger Pau Monne
  0 siblings, 2 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-06 15:40 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> FWIW, I think that the current approach that I've used in order to craft the
> MADT is not the best one, IMHO it would be better to place the MADT at the
> end of the E820_ACPI region (expanding it's size one page), and modify the
> XSDT/RSDT in order to point to it, that way we avoid shadowing any other
> ACPI data that might be at the same page as the native MADT (and that needs
> to be modified by Dom0).

I agree with the idea of placing MADT elsewhere, but I don't think you
can rely on E820_ACPI to have room to grow into right after its end.

> @@ -50,6 +53,8 @@ static long __initdata dom0_max_nrpages = LONG_MAX;
>  #define HVM_VM86_TSS_SIZE   128
>  
>  static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];
> +static unsigned int __initdata acpi_intr_overrrides = 0;
> +static struct acpi_madt_interrupt_override __initdata *intsrcovr = NULL;

Pointless initializers.

> +static void __init acpi_zap_table_signature(char *name)
> +{
> +    struct acpi_table_header *table;
> +    acpi_status status;
> +    union {
> +        char str[ACPI_NAME_SIZE];
> +        uint32_t bits;
> +    } signature;
> +    char tmp;
> +    int i;
> +
> +    status = acpi_get_table(name, 0, &table);
> +    if ( ACPI_SUCCESS(status) )
> +    {
> +        memcpy(&signature.str[0], &table->signature[0], ACPI_NAME_SIZE);
> +        for ( i = 0; i < ACPI_NAME_SIZE / 2; i++ )
> +        {
> +            tmp = signature.str[ACPI_NAME_SIZE - i - 1];
> +            signature.str[ACPI_NAME_SIZE - i - 1] = signature.str[i];
> +            signature.str[i] = tmp;
> +        }
> +        write_atomic((uint32_t*)&table->signature[0], signature.bits);
> +    }
> +}

Together with moving MADT elsewhere we should also move
XSDT/RSDT, at which point we can simply avoid copying the
pointers for tables we don't want Dom0 to see (and down the
road we also have the option of adding tables).

The downside to both approaches is that this once again is a
black-listing model, i.e. new structure types we may need to
also suppress will remain visible to Dom0 until a patched
hypervisor would become available.

> +static int __init hvm_setup_acpi(struct domain *d)
> +{
> +    struct vcpu *saved_current, *v = d->vcpu[0];
> +    unsigned long pfn, nr_pages;
> +    uint64_t size, start_addr, end_addr;
> +    uint64_t madt_addr[2] = { 0, 0 };
> +    struct acpi_table_header *table;
> +    struct acpi_table_madt *madt;
> +    struct acpi_madt_io_apic *io_apic;
> +    struct acpi_madt_local_apic *local_apic;
> +    acpi_status status;
> +    int rc, i;
> +
> +    printk("** Setup ACPI tables **\n");
> +
> +    /* ZAP the HPET, SLIT, SRAT, MPST and PMTT tables. */
> +    acpi_zap_table_signature(ACPI_SIG_HPET);
> +    acpi_zap_table_signature(ACPI_SIG_SLIT);
> +    acpi_zap_table_signature(ACPI_SIG_SRAT);
> +    acpi_zap_table_signature(ACPI_SIG_MPST);
> +    acpi_zap_table_signature(ACPI_SIG_PMTT);
> +
> +    /* Map ACPI tables 1:1 */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        if ( d->arch.e820[i].type != E820_ACPI &&
> +             d->arch.e820[i].type != E820_NVS )
> +            continue;
> +
> +        pfn = PFN_DOWN(d->arch.e820[i].addr);
> +        nr_pages = DIV_ROUND_UP(d->arch.e820[i].size, PAGE_SIZE);
> +
> +        rc = modify_mmio_11(d, pfn, nr_pages, true);
> +        if ( rc )
> +        {
> +            printk(
> +                "Failed to map ACPI region %#lx - %#lx into Dom0 memory map\n",

Please keep the format string on the printk line.

> +                   pfn, pfn + nr_pages);
> +            return rc;
> +        }
> +    }
> +    /*
> +     * Since most memory maps provided by hardware are wrong, make sure each
> +     * top-level table is properly mapped into Dom0.
> +     */

Please be more specific here - wrong in which way? Not all ACPI tables
living in one of the two E820 type regions? But see also below.

In any event you need to be more careful here: Mapping ordinary RAM
1:1 into Dom0 can't end well. Also, once working with a cloned XSDT you
won't need to cover tables here which have got hidden from Dom0.

> +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> +    {
> +        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> +        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
> +                                PAGE_SIZE);
> +        rc = modify_mmio_11(d, pfn, nr_pages, true);
> +        if ( rc )
> +        {
> +            printk(
> +                "Failed to map ACPI region %#lx - %#lx into Dom0 memory map\n",
> +                   pfn, pfn + nr_pages);
> +            return rc;
> +        }
> +    }
> +
> +    /*
> +     * Special treatment for memory < 1MB:
> +     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).

Copy? What if some of it needs to get modified to interact with
firmware? I think you'd be better off mapping everything Xen
doesn't use into Dom0, and only mapping fresh RAM pages
over regions Xen does use (namely the trampoline).

> +     *  - Map any reserved regions as 1:1.

Why would reserved regions need such treatment only when below
1Mb?

> +    acpi_get_table_phys(ACPI_SIG_MADT, 0, &madt_addr[0], &size);
> +    if ( !madt_addr[0] )
> +    {
> +        printk("Unable to find ACPI MADT table\n");
> +        return -EINVAL;
> +    }
> +    if ( size > PAGE_SIZE )
> +    {
> +        printk("MADT table is bigger than PAGE_SIZE, aborting\n");
> +        return -EINVAL;
> +    }

This restriction for sure needs to go away.

> +    acpi_get_table_phys(ACPI_SIG_MADT, 2, &madt_addr[1], &size);

Why 2 (and not 1) when the first invocation used 0? But this is not
going to be needed anyway with the alternative model.

> +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
> +    io_apic->header.length = sizeof(*io_apic);
> +    io_apic->id = 1;
> +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;

I think you need to make as many IO-APICs available to Dom0 as
there are hardware ones.

> +    if ( dom0_max_vcpus() > num_online_cpus() )
> +    {
> +        printk("CPU overcommit is not supported for Dom0\n");

???

> +        xfree(madt);

I don't think such cleanup is needed here - boot won't succeed
anyway.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain
  2016-09-27 15:57 ` [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain Roger Pau Monne
  2016-10-03  9:02   ` Paul Durrant
@ 2016-10-06 15:44   ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-06 15:44 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -2076,45 +2076,6 @@ static bool_t admin_io_okay(unsigned int port, unsigned int bytes,
>      return ioports_access_permitted(d, port, port + bytes - 1);
>  }
>  
> -static bool_t pci_cfg_ok(struct domain *currd, unsigned int start,
> -                         unsigned int size, uint32_t *write)

I don't follow why you move this ...

> --- a/xen/drivers/passthrough/pci.c
> +++ b/xen/drivers/passthrough/pci.c
> @@ -966,6 +966,45 @@ void pci_check_disable_device(u16 seg, u8 bus, u8 devfn)
>                       PCI_COMMAND, cword & ~PCI_COMMAND_MASTER);
>  }
>  
> +bool_t pci_cfg_ok(struct domain *currd, unsigned int start,
> +                         unsigned int size, uint32_t *write)

here. After all ...

> +{
> +    uint32_t machine_bdf;
> +
> +    if ( !is_hardware_domain(currd) )
> +        return 0;
> +
> +    if ( !CF8_ENABLED(currd->arch.pci_cf8) )
> +        return 1;
> +
> +    machine_bdf = CF8_BDF(currd->arch.pci_cf8);

... concepts like I/O ports in general and ports CF8 / CFC in particular
are x86-specific. If this really needs moving at all, it should stay in an
x86-specific file.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code
  2016-09-27 15:57 ` [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code Roger Pau Monne
  2016-10-03  9:54   ` Paul Durrant
@ 2016-10-06 15:47   ` Jan Beulich
  2016-10-10 12:41   ` Jan Beulich
  2 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-06 15:47 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> Most of this code has been picked up from QEMU and modified so it can be
> plugged into the internal Xen IO handlers. The structure of the handlers has
> been keep quite similar to QEMU, so existing handlers can be imported
> without a lot of effort.

Without looking at the code in any detail, the question that Paul has
raised really needs to be answered first: Why pass-through code
when Dom0 has access to (almost) all devices by default anyway?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-06 15:40   ` Jan Beulich
@ 2016-10-06 15:48     ` Andrew Cooper
  2016-10-12 15:35     ` Roger Pau Monne
  1 sibling, 0 replies; 146+ messages in thread
From: Andrew Cooper @ 2016-10-06 15:48 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

On 06/10/16 16:40, Jan Beulich wrote:
>>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> FWIW, I think that the current approach that I've used in order to craft the
>> MADT is not the best one, IMHO it would be better to place the MADT at the
>> end of the E820_ACPI region (expanding it's size one page), and modify the
>> XSDT/RSDT in order to point to it, that way we avoid shadowing any other
>> ACPI data that might be at the same page as the native MADT (and that needs
>> to be modified by Dom0).
> I agree with the idea of placing MADT elsewhere, but I don't think you
> can rely on E820_ACPI to have room to grow into right after its end.
>
>> @@ -50,6 +53,8 @@ static long __initdata dom0_max_nrpages = LONG_MAX;
>>  #define HVM_VM86_TSS_SIZE   128
>>  
>>  static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];
>> +static unsigned int __initdata acpi_intr_overrrides = 0;
>> +static struct acpi_madt_interrupt_override __initdata *intsrcovr = NULL;
> Pointless initializers.
>
>> +static void __init acpi_zap_table_signature(char *name)
>> +{
>> +    struct acpi_table_header *table;
>> +    acpi_status status;
>> +    union {
>> +        char str[ACPI_NAME_SIZE];
>> +        uint32_t bits;
>> +    } signature;
>> +    char tmp;
>> +    int i;
>> +
>> +    status = acpi_get_table(name, 0, &table);
>> +    if ( ACPI_SUCCESS(status) )
>> +    {
>> +        memcpy(&signature.str[0], &table->signature[0], ACPI_NAME_SIZE);
>> +        for ( i = 0; i < ACPI_NAME_SIZE / 2; i++ )
>> +        {
>> +            tmp = signature.str[ACPI_NAME_SIZE - i - 1];
>> +            signature.str[ACPI_NAME_SIZE - i - 1] = signature.str[i];
>> +            signature.str[i] = tmp;
>> +        }
>> +        write_atomic((uint32_t*)&table->signature[0], signature.bits);
>> +    }
>> +}
> Together with moving MADT elsewhere we should also move
> XSDT/RSDT, at which point we can simply avoid copying the
> pointers for tables we don't want Dom0 to see (and down the
> road we also have the option of adding tables).
>
> The downside to both approaches is that this once again is a
> black-listing model, i.e. new structure types we may need to
> also suppress will remain visible to Dom0 until a patched
> hypervisor would become available.

If we are providing a new XSDT/RSDT, we can have full control over the
tables dom0 sees.  Pointers to existing tables should only be entered
into the new XSDT/RSDT if Xen explicitly knows the table & version.

We will have to be diligent at keeping on top of new versions of the
ACPI spec, but everything like this should be whitelist based.  (This is
also the same model I trying to move the CPUID/MSR infrastructure towards).

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code
  2016-10-06 15:08     ` Roger Pau Monne
@ 2016-10-06 15:52       ` Lars Kurth
  2016-10-07  9:13       ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: Lars Kurth @ 2016-10-06 15:52 UTC (permalink / raw)
  To: Roger Pau Monne, Paul Durrant
  Cc: Andrew Cooper, George Dunlap, Jan Beulich, Ian Jackson,
	xen-devel, boris.ostrovsky



On 06/10/2016 17:08, "Roger Pau Monne" <roger.pau@citrix.com> wrote:

>On Mon, Oct 03, 2016 at 10:54:54AM +0100, Paul Durrant wrote:
>> > -----Original Message-----
>> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
>> > Sent: 27 September 2016 16:57
>> > To: xen-devel@lists.xenproject.org
>> > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau
>>Monne
>> > <roger.pau@citrix.com>; Jan Beulich <jbeulich@suse.com>; Andrew Cooper
>> > <Andrew.Cooper3@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>
>> > Subject: [PATCH v2 20/30] xen/x86: add the basic infrastructure to
>>import
>> > QEMU passthrough code
>> > 
>> > Most of this code has been picked up from QEMU and modified so it can
>>be
>> > plugged into the internal Xen IO handlers. The structure of the
>>handlers has
>> > been keep quite similar to QEMU, so existing handlers can be imported
>> > without a lot of effort.
>> > 
>> 
>> If you lifted code from QEMU then one assumes there is no problem with
>>license, but do you need to amend copyrights for any of the files where
>>you put the code?
>
>License is GPL 2, same as Xen. For copyrights I have to admit I have no
>idea. The code is not imported as-is for obvious reasons, but the logic
>is 
>mostly the same. I don't mind adding the copyright holders for all the
>code 
>I've imported, they are:
>
>Copyright (c) 2007, Neocleus Corporation.
>Copyright (c) 2007, Intel Corporation.

For imported code, you should keep the (c) header as is, adapt the coding
style, and then add a
Copyright (c) 2016, ... if you are making significant modifications


You should also create a README.source file (or add to one in that part of
the tree), which tracks where the code came from (e.g. QEMU in this case,
referring to the source file), such that it becomes easier if someone
needs to go back at some point. The commit message should also contain
that information.

Lars

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 26/30] xen/x86: add PCIe emulation
  2016-10-03 10:46   ` Paul Durrant
@ 2016-10-06 15:53     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-06 15:53 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, boris.ostrovsky, Andrew Cooper, Jan Beulich

On Mon, Oct 03, 2016 at 11:46:25AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: 27 September 2016 16:57
> > To: xen-devel@lists.xenproject.org
> > Cc: konrad.wilk@oracle.com; boris.ostrovsky@oracle.com; Roger Pau Monne
> > <roger.pau@citrix.com>; Paul Durrant <Paul.Durrant@citrix.com>; Jan
> > Beulich <jbeulich@suse.com>; Andrew Cooper
> > <Andrew.Cooper3@citrix.com>
> > Subject: [PATCH v2 26/30] xen/x86: add PCIe emulation
> > 
> > Add a new MMIO handler that traps accesses to PCIe regions, as discovered
> > by
> > Xen from the MCFG ACPI table. The handler used is the same as the one
> > used
> > for accesses to the IO PCI configuration space.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Cc: Paul Durrant <paul.durrant@citrix.com>
> > Cc: Jan Beulich <jbeulich@suse.com>
> > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> >  xen/arch/x86/hvm/io.c | 177
> > ++++++++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 171 insertions(+), 6 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> > index 779babb..088e3ec 100644
> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -46,6 +46,8 @@
> >  #include <xen/iocap.h>
> >  #include <public/hvm/ioreq.h>
> > 
> > +#include "../x86_64/mmconfig.h"
> > +
> >  /* Set permissive mode for HVM Dom0 PCI pass-through by default */
> >  static bool_t opt_dom0permissive = 1;
> >  boolean_param("dom0permissive", opt_dom0permissive);
> > @@ -363,7 +365,7 @@ static int hvm_pt_pci_config_access_check(struct
> > hvm_pt_device *d,
> >  }
> > 
> >  static int hvm_pt_pci_read_config(struct hvm_pt_device *d, uint32_t addr,
> > -                                  uint32_t *data, int len)
> > +                                  uint32_t *data, int len, bool pcie)
> >  {
> >      uint32_t val = 0;
> >      struct hvm_pt_reg_group *reg_grp_entry = NULL;
> > @@ -377,7 +379,7 @@ static int hvm_pt_pci_read_config(struct
> > hvm_pt_device *d, uint32_t addr,
> >      unsigned int func = PCI_FUNC(d->pdev->devfn);
> > 
> >      /* Sanity checks. */
> > -    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> > +    if ( !pcie && hvm_pt_pci_config_access_check(d, addr, len) )
> >          return X86EMUL_UNHANDLEABLE;
> > 
> >      /* Find register group entry. */
> > @@ -468,7 +470,7 @@ static int hvm_pt_pci_read_config(struct
> > hvm_pt_device *d, uint32_t addr,
> >  }
> > 
> >  static int hvm_pt_pci_write_config(struct hvm_pt_device *d, uint32_t addr,
> > -                                    uint32_t val, int len)
> > +                                    uint32_t val, int len, bool pcie)
> >  {
> >      int index = 0;
> >      struct hvm_pt_reg_group *reg_grp_entry = NULL;
> > @@ -485,7 +487,7 @@ static int hvm_pt_pci_write_config(struct
> > hvm_pt_device *d, uint32_t addr,
> >      unsigned int func = PCI_FUNC(d->pdev->devfn);
> > 
> >      /* Sanity checks. */
> > -    if ( hvm_pt_pci_config_access_check(d, addr, len) )
> > +    if ( !pcie && hvm_pt_pci_config_access_check(d, addr, len) )
> >          return X86EMUL_UNHANDLEABLE;
> > 
> >      /* Find register group entry. */
> > @@ -677,7 +679,7 @@ static int hw_dpci_portio_read(const struct
> > hvm_io_handler *handler,
> >      if ( dev != NULL )
> >      {
> >          reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> > -        rc = hvm_pt_pci_read_config(dev, reg, &data32, size);
> > +        rc = hvm_pt_pci_read_config(dev, reg, &data32, size, false);
> >          if ( rc == X86EMUL_OKAY )
> >          {
> >              read_unlock(&currd->arch.hvm_domain.pt_lock);
> > @@ -722,7 +724,7 @@ static int hw_dpci_portio_write(const struct
> > hvm_io_handler *handler,
> >      if ( dev != NULL )
> >      {
> >          reg = (currd->arch.pci_cf8 & 0xfc) | (addr & 0x3);
> > -        rc = hvm_pt_pci_write_config(dev, reg, data32, size);
> > +        rc = hvm_pt_pci_write_config(dev, reg, data32, size, false);
> >          if ( rc == X86EMUL_OKAY )
> >          {
> >              read_unlock(&currd->arch.hvm_domain.pt_lock);
> > @@ -1002,6 +1004,166 @@ static const struct hvm_io_ops
> > hw_dpci_portio_ops = {
> >      .write = hw_dpci_portio_write
> >  };
> > 
> > +static struct acpi_mcfg_allocation *pcie_find_mmcfg(unsigned long addr)
> > +{
> > +    int i;
> > +
> > +    for ( i = 0; i < pci_mmcfg_config_num; i++ )
> > +    {
> > +        unsigned long start, end;
> > +
> > +        start = pci_mmcfg_config[i].address;
> > +        end = pci_mmcfg_config[i].address +
> > +              ((pci_mmcfg_config[i].end_bus_number + 1) << 20);
> > +        if ( addr >= start && addr < end )
> > +            return &pci_mmcfg_config[i];
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static struct hvm_pt_device *hw_pcie_get_device(unsigned int seg,
> > +                                                unsigned int bus,
> > +                                                unsigned int slot,
> > +                                                unsigned int func)
> > +{
> > +    struct hvm_pt_device *dev;
> > +    struct domain *d = current->domain;
> > +
> > +    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
> > +    {
> > +        if ( dev->pdev->seg != seg || dev->pdev->bus != bus ||
> > +             dev->pdev->devfn != PCI_DEVFN(slot, func) )
> > +            continue;
> > +
> > +        return dev;
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> > +static void pcie_decode_addr(unsigned long addr, unsigned int *bus,
> > +                             unsigned int *slot, unsigned int *func,
> > +                             unsigned int *reg)
> > +{
> > +
> > +    *bus = (addr >> 20) & 0xff;
> > +    *slot = (addr >> 15) & 0x1f;
> > +    *func = (addr >> 12) & 0x7;
> > +    *reg = addr & 0xfff;
> > +}
> > +
> > +static int pcie_range(struct vcpu *v, unsigned long addr)
> > +{
> > +
> > +    return pcie_find_mmcfg(addr) != NULL ? 1 : 0;
> > +}
> > +
> > +static int pcie_read(struct vcpu *v, unsigned long addr,
> > +                     unsigned int len, unsigned long *pval)
> > +{
> > +    struct acpi_mcfg_allocation *mmcfg = pcie_find_mmcfg(addr);
> > +    struct domain *d = v->domain;
> > +    unsigned int seg, bus, slot, func, reg;
> > +    struct hvm_pt_device *dev;
> > +    uint32_t val;
> > +    int rc;
> > +
> > +    ASSERT(mmcfg != NULL);
> > +
> > +    if ( len > 4 || len == 3 )
> > +        return X86EMUL_UNHANDLEABLE;
> > +
> > +    addr -= mmcfg->address;
> > +    seg = mmcfg->pci_segment;
> > +    pcie_decode_addr(addr, &bus, &slot, &func, &reg);
> > +
> > +    read_lock(&d->arch.hvm_domain.pt_lock);
> > +    dev = hw_pcie_get_device(seg, bus, slot, func);
> > +    if ( dev != NULL )
> > +    {
> > +        rc = hvm_pt_pci_read_config(dev, reg, &val, len, true);
> > +        if ( rc == X86EMUL_OKAY )
> > +        {
> > +            read_unlock(&d->arch.hvm_domain.pt_lock);
> > +            goto out;
> > +        }
> > +    }
> > +    read_unlock(&d->arch.hvm_domain.pt_lock);
> > +
> > +    /* Pass-through */
> > +    switch ( len )
> > +    {
> > +    case 1:
> > +        val = pci_conf_read8(seg, bus, slot, func, reg);
> > +        break;
> > +    case 2:
> > +        val = pci_conf_read16(seg, bus, slot, func, reg);
> > +        break;
> > +    case 4:
> > +        val = pci_conf_read32(seg, bus, slot, func, reg);
> > +        break;
> > +    }
> > +
> > + out:
> > +    *pval = val;
> > +    return X86EMUL_OKAY;
> > +}
> > +
> > +static int pcie_write(struct vcpu *v, unsigned long addr,
> > +                      unsigned int len, unsigned long val)
> > +{
> > +    struct acpi_mcfg_allocation *mmcfg = pcie_find_mmcfg(addr);
> > +    struct domain *d = v->domain;
> > +    unsigned int seg, bus, slot, func, reg;
> > +    struct hvm_pt_device *dev;
> > +    int rc;
> > +
> > +    ASSERT(mmcfg != NULL);
> > +
> > +    if ( len > 4 || len == 3 )
> > +        return X86EMUL_UNHANDLEABLE;
> > +
> > +    addr -= mmcfg->address;
> > +    seg = mmcfg->pci_segment;
> > +    pcie_decode_addr(addr, &bus, &slot, &func, &reg);
> > +
> > +    read_lock(&d->arch.hvm_domain.pt_lock);
> > +    dev = hw_pcie_get_device(seg, bus, slot, func);
> > +    if ( dev != NULL )
> > +    {
> > +        rc = hvm_pt_pci_write_config(dev, reg, val, len, true);
> > +        if ( rc == X86EMUL_OKAY )
> > +        {
> > +            read_unlock(&d->arch.hvm_domain.pt_lock);
> > +            return rc;
> > +        }
> > +    }
> > +    read_unlock(&d->arch.hvm_domain.pt_lock);
> > +
> > +    /* Pass-through */
> > +    switch ( len )
> > +    {
> > +    case 1:
> > +        pci_conf_write8(seg, bus, slot, func, reg, val);
> > +        break;
> > +    case 2:
> > +        pci_conf_write16(seg, bus, slot, func, reg, val);
> > +        break;
> > +    case 4:
> > +        pci_conf_write32(seg, bus, slot, func, reg, val);
> > +        break;
> > +    }
> > +
> > +    return X86EMUL_OKAY;
> > +}
> > +
> > +static const struct hvm_mmio_ops pcie_mmio_ops = {
> > +    .check = pcie_range,
> > +    .read = pcie_read,
> > +    .write = pcie_write
> > +};
> > +
> >  void register_dpci_portio_handler(struct domain *d)
> >  {
> >      struct hvm_io_handler *handler = hvm_next_io_handler(d);
> > @@ -1011,7 +1173,10 @@ void register_dpci_portio_handler(struct domain
> > *d)
> > 
> >      handler->type = IOREQ_TYPE_PIO;
> >      if ( is_hardware_domain(d) )
> > +    {
> >          handler->ops = &hw_dpci_portio_ops;
> > +        register_mmio_handler(d, &pcie_mmio_ops);
> 
> This is a somewhat counterintuitive place to be registering an MMIO handler? Would this not be better done directly by the caller?

Right, I've moved the handlers to arch/x86/pci.c, and directly registered the 
handler in hvm_domain_initialise. TBH, I think we should talk a little bit 
about how this code should be structured. Should it go into some common 
subdirectory? Should it be sprinkled around like it's done in this series?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0
  2016-10-03 10:57   ` Paul Durrant
@ 2016-10-06 15:58     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-06 15:58 UTC (permalink / raw)
  To: Paul Durrant; +Cc: xen-devel, boris.ostrovsky

On Mon, Oct 03, 2016 at 11:57:13AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> > Roger Pau Monne
> > Sent: 27 September 2016 16:57
> > To: xen-devel@lists.xenproject.org
> > Cc: boris.ostrovsky@oracle.com; Roger Pau Monne <roger.pau@citrix.com>
> > Subject: [Xen-devel] [PATCH v2 28/30] xen/x86: add MSI-X emulation to
> > PVHv2 Dom0
> > 
> > This requires adding handlers to the PCI configuration space, plus a MMIO
> > handler for the MSI-X table, the PBA is left mapped directly into the guest.
> > The implementation is based on the one already found in the passthrough
> > code from QEMU.
> > 
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> > ---
> > Paul Durrant <paul.durrant@citrix.com>
> > Jan Beulich <jbeulich@suse.com>
> > Andrew Cooper <andrew.cooper3@citrix.com>
> > ---
> >  xen/arch/x86/hvm/io.c         |   2 +
> >  xen/arch/x86/hvm/vmsi.c       | 498
> > ++++++++++++++++++++++++++++++++++++++++++
> >  xen/drivers/passthrough/pci.c |   6 +-
> >  xen/include/asm-x86/hvm/io.h  |  26 +++
> >  xen/include/asm-x86/msi.h     |   4 +-
> >  5 files changed, 534 insertions(+), 2 deletions(-)
> > 
> > diff --git a/xen/arch/x86/hvm/io.c b/xen/arch/x86/hvm/io.c
> > index 088e3ec..11b7313 100644
> > --- a/xen/arch/x86/hvm/io.c
> > +++ b/xen/arch/x86/hvm/io.c
> > @@ -867,6 +867,7 @@ static struct hvm_pt_handler_init
> > *hwdom_pt_handlers[] = {
> >      &hvm_pt_bar_init,
> >      &hvm_pt_vf_bar_init,
> >      &hvm_pt_msi_init,
> > +    &hvm_pt_msix_init,
> >  };
> > 
> >  int hwdom_add_device(struct pci_dev *pdev)
> > @@ -1176,6 +1177,7 @@ void register_dpci_portio_handler(struct domain
> > *d)
> >      {
> >          handler->ops = &hw_dpci_portio_ops;
> >          register_mmio_handler(d, &pcie_mmio_ops);
> > +        register_mmio_handler(d, &vmsix_mmio_ops);
> 
> Again, this is a somewhat counterintuitive place to make this call.

Done, I've moved it to hvm_domain_initialise together with the PCIe MMIO 
handlers.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 21/30] xen/pci: split code to size BARs from pci_add_device
  2016-09-27 15:57 ` [PATCH v2 21/30] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
@ 2016-10-06 16:00   ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-06 16:00 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> Because it's also going to be used by other code.

If you want to make use of this for general purpose sizing, simply
moving the existing code is not enough. For one I/O port BARs
don't get handled (as SR-IOV devices aren't allowed to have such).
And then I'm afraid there are a number of quirks to work around,
so some code may need to be lifted from Linux. While it may be
legitimate to not do this right here, it should be done before this
code gets used for other than SR-IOV, and having peeked into the
next patch I didn't find you doing any adjustments.

> +                ret = pci_size_bar(seg, bus, slot, func, pos + PCI_SRIOV_BAR,
> +                                   PCI_SRIOV_NUM_BARS, &i, &addr,
> +                                   &pdev->vf_rlen[i]);
> +                if ( ret )
> +                    printk_pdev(pdev, XENLOG_WARNING,
> +                                "failed to size SR-IOV BAR%u\n", i);
>              }
>          }

Just one remark on the code itself: With "addr" being of no interest
to this caller (afaics), I think it would be prudent to make this an
optional parameter (and pass NULL here).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code
  2016-10-06 15:08     ` Roger Pau Monne
  2016-10-06 15:52       ` Lars Kurth
@ 2016-10-07  9:13       ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-07  9:13 UTC (permalink / raw)
  To: Paul Durrant, Roger Pau Monne
  Cc: lars.kurth, George.Dunlap, Andrew Cooper, ian.jackson, xen-devel,
	boris.ostrovsky

>>> On 06.10.16 at 17:08, <roger.pau@citrix.com> wrote:
> On Mon, Oct 03, 2016 at 10:54:54AM +0100, Paul Durrant wrote:

To both of you: Please limit the quoting in your replies.

Thanks, Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code
  2016-09-27 15:57 ` [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code Roger Pau Monne
  2016-10-03  9:54   ` Paul Durrant
  2016-10-06 15:47   ` Jan Beulich
@ 2016-10-10 12:41   ` Jan Beulich
  2 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 12:41 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -632,6 +632,8 @@ int hvm_domain_initialise(struct domain *d)
>              goto fail1;
>          }
>          memset(d->arch.hvm_domain.io_bitmap, ~0, HVM_IOBITMAP_SIZE);
> +        INIT_LIST_HEAD(&d->arch.hvm_domain.pt_devices);

This field appears to be redundant with arch.pdev_list.

> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -46,6 +46,10 @@
>  #include <xen/iocap.h>
>  #include <public/hvm/ioreq.h>
>  
> +/* Set permissive mode for HVM Dom0 PCI pass-through by default */
> +static bool_t opt_dom0permissive = 1;

Plain bool / true / false please. And as mentioned by Andrew, we
should stop adding more dom0xyz options, and use a consolidated
dom0= one instead.

> @@ -258,12 +262,403 @@ static bool_t hw_dpci_portio_accept(const struct 
> hvm_io_handler *handler,
>      return 0;
>  }
>  
> +static struct hvm_pt_device *hw_dpci_get_device(struct domain *d)
> +{
> +    uint8_t bus, slot, func;
> +    uint32_t addr;
> +    struct hvm_pt_device *dev;
> +
> +    /* Decode bus, slot and func. */
> +    addr = CF8_BDF(d->arch.pci_cf8);
> +    bus = PCI_BUS(addr);
> +    slot = PCI_SLOT(addr);
> +    func = PCI_FUNC(addr);
> +
> +    list_for_each_entry( dev, &d->arch.hvm_domain.pt_devices, entries )
> +    {
> +        if ( dev->pdev->seg != 0 || dev->pdev->bus != bus ||

Okay, there's no way segments other than 0 can be handled here.
But having glanced over the titles of the rest of the series - where
are those going to be handled (read: Where is the MCFG code,
which qemu doesn't have)?

Also I think the function name is not well chosen: Its prefix suggests
some kind of "official" interface, yet it really just is an internal helper
which doesn't even "get" a device in the general sense of needing to
"put" it later on.

And it looks like the parameter could be constified (but this appears
to be a wider problem).

> +/* Dispatchers */
> +
> +/* Find emulate register group entry */
> +struct hvm_pt_reg_group *hvm_pt_find_reg_grp(struct hvm_pt_device *d,
> +                                             uint32_t address)

Please don't needlessly use fixed width types.

> +{
> +    struct hvm_pt_reg_group *entry = NULL;
> +
> +    /* Find register group entry */
> +    list_for_each_entry( entry, &d->register_groups, entries )
> +    {
> +        /* check address */
> +        if ( (entry->base_offset <= address)
> +             && ((entry->base_offset + entry->size) > address) )

Coding style (&& belongs on the previous line).

And actually I guess I'll stop here, realizing that I'm completely
unconvinced of the not even spelled out intentions. Alone the
lifting of code from qemu is problematic imo: That code has proven
to have many issues, only the most severe of which have got fixed
over time. I'm therefore of the opinion that a clean re-write from
scratch should at least be considered, once it was written down
somewhere (docs/misc/hvmlite.markdown?) and agreed upon what
the behavior actually ought to be.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 23/30] xen/x86: route legacy PCI interrupts to Dom0
  2016-09-27 15:57 ` [PATCH v2 23/30] xen/x86: route legacy PCI interrupts to Dom0 Roger Pau Monne
@ 2016-10-10 13:37   ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 13:37 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> This is done adding some Dom0 specific logic to the IO APIC emulation inside
> of Xen, so that writes to the IO APIC registers that should unmask an
> interrupt will take care of setting up this interrupt with Xen. A Dom0
> specific EIO handler also has to be used, since Xen doesn't know the
> topology of the PCI devices and it just has to passthrough what Dom0 does.

Without having looked at the patch (yet) I have a hard time seeing
the connection between EOI and PCI topology. I therefore think the
description needs improvement.

> --- a/xen/arch/x86/hvm/vioapic.c
> +++ b/xen/arch/x86/hvm/vioapic.c
> @@ -148,6 +148,29 @@ static void vioapic_write_redirent(
>          unmasked = unmasked && !ent.fields.mask;
>      }
>  
> +    if ( is_hardware_domain(d) && unmasked )
> +    {
> +        int ret, gsi;
> +
> +        /* Interrupt has been unmasked */
> +        gsi = idx;
> +        ret = mp_register_gsi(gsi, ent.fields.trig_mode, ent.fields.polarity);
> +        if ( ret && ret != -EEXIST )
> +        {
> +            gdprintk(XENLOG_WARNING,
> +                     "%s: error registering GSI %d\n", __func__, ret);

The message text suggests the number is the GSI, whereas it really
looks to be an error code (and I guess you really mean to log both).
Also please no unnecessary new uses of __func__.

> +        }
> +        if ( !ret )
> +        {
> +            ret = physdev_map_pirq(DOMID_SELF, MAP_PIRQ_TYPE_GSI, &gsi, &gsi,
> +                                   NULL);
> +            BUG_ON(ret);
> +
> +            ret = pt_irq_bind_hw_domain(gsi);
> +            BUG_ON(ret);

Why BUG_ON() (in both cases)? I don't think we're necessarily hosed
just because of one IRQ setup failure.

> @@ -409,7 +432,10 @@ void vioapic_update_EOI(struct domain *d, u8 vector)
>          if ( iommu_enabled )
>          {
>              spin_unlock(&d->arch.hvm_domain.irq_lock);
> -            hvm_dpci_eoi(d, gsi, ent);
> +            if ( is_hardware_domain(d) )
> +                hvm_hw_dpci_eoi(d, gsi, ent);
> +            else
> +                hvm_dpci_eoi(d, gsi, ent);

This looks like you rather want to make the distinction inside the
called function.

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -159,26 +159,29 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
>  static void pt_irq_time_out(void *data)
>  {
>      struct hvm_pirq_dpci *irq_map = data;
> -    const struct hvm_irq_dpci *dpci;
>      const struct dev_intx_gsi_link *digl;
>  
>      spin_lock(&irq_map->dom->event_lock);
>  
> -    dpci = domain_get_irq_dpci(irq_map->dom);
> -    ASSERT(dpci);
> -    list_for_each_entry ( digl, &irq_map->digl_list, list )
> +    if ( !is_hardware_domain(irq_map->dom) )
>      {
> -        unsigned int guest_gsi = hvm_pci_intx_gsi(digl->device, digl->intx);
> -        const struct hvm_girq_dpci_mapping *girq;
> -
> -        list_for_each_entry ( girq, &dpci->girq[guest_gsi], list )
> +        const struct hvm_irq_dpci *dpci = domain_get_irq_dpci(irq_map->dom);
> +        ASSERT(dpci);

Blank line between declarations and statements please.

> +        list_for_each_entry ( digl, &irq_map->digl_list, list )
>          {
> -            struct pirq *pirq = pirq_info(irq_map->dom, girq->machine_gsi);
> +            unsigned int guest_gsi = hvm_pci_intx_gsi(digl->device, digl->intx);
> +            const struct hvm_girq_dpci_mapping *girq;
> +
> +            list_for_each_entry ( girq, &dpci->girq[guest_gsi], list )
> +            {
> +                struct pirq *pirq = pirq_info(irq_map->dom, girq->machine_gsi);
>  
> -            pirq_dpci(pirq)->flags |= HVM_IRQ_DPCI_EOI_LATCH;
> +                pirq_dpci(pirq)->flags |= HVM_IRQ_DPCI_EOI_LATCH;
> +            }
> +            hvm_pci_intx_deassert(irq_map->dom, digl->device, digl->intx);
>          }
> -        hvm_pci_intx_deassert(irq_map->dom, digl->device, digl->intx);
> -    }
> +    } else

Coding style.

> +        irq_map->flags |= HVM_IRQ_DPCI_EOI_LATCH;

And I'm afraid I can't conclude anyway why you do what you do
here, as you don't really describe your the changes in any detail.

> @@ -557,6 +560,85 @@ int pt_irq_create_bind(
>      return 0;
>  }
>  
> +int pt_irq_bind_hw_domain(int gsi)
> +{
> +    struct domain *d = hardware_domain;
> +    struct hvm_pirq_dpci *pirq_dpci;
> +    struct hvm_irq_dpci *hvm_irq_dpci;
> +    struct pirq *info;
> +    int rc;
> +
> +    if ( gsi < 0 || gsi >= d->nr_pirqs )
> +        return -EINVAL;
> +
> +restart:

Labels (if they're needed at all) indented by at least one blank
please.

And I'm afraid I'm giving up again.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 25/30] xen/x86: add all PCI devices to PVHv2 Dom0
  2016-09-27 15:57 ` [PATCH v2 25/30] xen/x86: add all PCI devices to PVHv2 Dom0 Roger Pau Monne
@ 2016-10-10 13:44   ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 13:44 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -2378,6 +2378,25 @@ static int __init hvm_setup_acpi(struct domain *d)
>      return 0;
>  }
>  
> +static int __init hvm_setup_pci(struct domain *d)
> +{
> +    struct pci_dev *pdev;
> +    int rc;
> +
> +    printk("** Adding PCI devices **\n");
> +
> +    pcidevs_lock();
> +    list_for_each_entry( pdev, &d->arch.pdev_list, domain_list )
> +    {
> +        rc = hwdom_add_device(pdev);

Considering the loop construct it is clear that the devices must
already have been added to the domain by the time you get here.
Hence the title (together with there not being any description) is
at least misleading.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 26/30] xen/x86: add PCIe emulation
  2016-09-27 15:57 ` [PATCH v2 26/30] xen/x86: add PCIe emulation Roger Pau Monne
  2016-10-03 10:46   ` Paul Durrant
@ 2016-10-10 13:57   ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 13:57 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> Add a new MMIO handler that traps accesses to PCIe regions, as discovered by
> Xen from the MCFG ACPI table. The handler used is the same as the one used
> for accesses to the IO PCI configuration space.

Both in the title and in the code you use PCIe when you really mean
MCFG / MMCFG. Please don't misguide people like this: PCI-X uses
extended config space too afaik.

> --- a/xen/arch/x86/hvm/io.c
> +++ b/xen/arch/x86/hvm/io.c
> @@ -46,6 +46,8 @@
>  #include <xen/iocap.h>
>  #include <public/hvm/ioreq.h>
>  
> +#include "../x86_64/mmconfig.h"

Please don't. If declaration there are needed here, move them into
xen/include/asm-x86/.

> +static struct acpi_mcfg_allocation *pcie_find_mmcfg(unsigned long addr)
> +{
> +    int i;

unsigned int

> +static void pcie_decode_addr(unsigned long addr, unsigned int *bus,

By the time you gets here what gets passed in is not an address
anymore, so please don't name the parameter this way.

> +                             unsigned int *slot, unsigned int *func,
> +                             unsigned int *reg)
> +{
> +
> +    *bus = (addr >> 20) & 0xff;
> +    *slot = (addr >> 15) & 0x1f;
> +    *func = (addr >> 12) & 0x7;

Any reason you can't use pci_mmcfg_decode() here instead of
these various open coded numbers, or at least some derived
logic?

> +static int pcie_range(struct vcpu *v, unsigned long addr)
> +{
> +
> +    return pcie_find_mmcfg(addr) != NULL ? 1 : 0;

No need for the conditional expression.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 27/30] x86/msixtbl: disable MSI-X intercepts for domains without an ioreq server
  2016-09-27 15:57 ` [PATCH v2 27/30] x86/msixtbl: disable MSI-X intercepts for domains without an ioreq server Roger Pau Monne
@ 2016-10-10 14:18   ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 14:18 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, Paul Durrant, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> The current msixtbl intercepts only partially trap MSI-X accesses, but are
> not complete, there's missing logic in order to setup PIRQs and bind them to
> domains. Disable them for domains without at least an ioreq server (PVH).

So what if a server registers later on?

> --- a/xen/arch/x86/hvm/ioreq.c
> +++ b/xen/arch/x86/hvm/ioreq.c
> @@ -772,6 +772,17 @@ int hvm_destroy_ioreq_server(struct domain *d, ioservid_t id)
>      return rc;
>  }
>  
> +int hvm_has_ioreq_server(struct domain *d)

bool

> +{
> +    int empty;

bool

> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -24,6 +24,7 @@
>  #include <asm/hvm/irq.h>
>  #include <asm/hvm/support.h>
>  #include <xen/hvm/irq.h>
> +#include <asm/hvm/ioreq.h>
>  #include <asm/io_apic.h>

Please group with the other asm/hvm/ ones.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings
  2016-09-27 15:57 ` [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings Roger Pau Monne
  2016-09-30 17:36   ` George Dunlap
@ 2016-10-10 14:21   ` Jan Beulich
  2016-10-10 14:27     ` George Dunlap
  1 sibling, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 14:21 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -2793,7 +2793,7 @@ int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
>      struct domain *fdom;
>  
>      ASSERT(tdom);
> -    if ( foreigndom == DOMID_SELF || !is_pvh_domain(tdom) )
> +    if ( foreigndom == DOMID_SELF || !has_hvm_container_domain(tdom) )
>          return -EINVAL;

Can PV domains make it here? If not, dropping the predicate would
seem the better adjustment.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings
  2016-10-10 14:21   ` Jan Beulich
@ 2016-10-10 14:27     ` George Dunlap
  2016-10-10 14:50       ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: George Dunlap @ 2016-10-10 14:27 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne
  Cc: George Dunlap, Andrew Cooper, boris.ostrovsky, xen-devel

On 10/10/16 15:21, Jan Beulich wrote:
>>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -2793,7 +2793,7 @@ int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
>>      struct domain *fdom;
>>  
>>      ASSERT(tdom);
>> -    if ( foreigndom == DOMID_SELF || !is_pvh_domain(tdom) )
>> +    if ( foreigndom == DOMID_SELF || !has_hvm_container_domain(tdom) )
>>          return -EINVAL;
> 
> Can PV domains make it here? If not, dropping the predicate would
> seem the better adjustment.

Is there any chance that in the future PV domains might accidentally get
here because of some other changes in the future?  If so, leaving the
predicate seems like a sensible precaution to reduce the risk of XSAs
down the road, since we're doing a load of checking already anyway. ;-)

At the moment, nobody's going to get past he "is_hardware_domain()"
except dom0, but presumably that will change once we get driver domains
implemented.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings
  2016-10-10 14:27     ` George Dunlap
@ 2016-10-10 14:50       ` Jan Beulich
  2016-10-10 14:58         ` George Dunlap
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 14:50 UTC (permalink / raw)
  To: George Dunlap
  Cc: George Dunlap, Andrew Cooper, xen-devel, boris.ostrovsky,
	Roger Pau Monne

>>> On 10.10.16 at 16:27, <george.dunlap@citrix.com> wrote:
> On 10/10/16 15:21, Jan Beulich wrote:
>>>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>>> --- a/xen/arch/x86/mm/p2m.c
>>> +++ b/xen/arch/x86/mm/p2m.c
>>> @@ -2793,7 +2793,7 @@ int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
>>>      struct domain *fdom;
>>>  
>>>      ASSERT(tdom);
>>> -    if ( foreigndom == DOMID_SELF || !is_pvh_domain(tdom) )
>>> +    if ( foreigndom == DOMID_SELF || !has_hvm_container_domain(tdom) )
>>>          return -EINVAL;
>> 
>> Can PV domains make it here? If not, dropping the predicate would
>> seem the better adjustment.
> 
> Is there any chance that in the future PV domains might accidentally get
> here because of some other changes in the future?  If so, leaving the
> predicate seems like a sensible precaution to reduce the risk of XSAs
> down the road, since we're doing a load of checking already anyway. ;-)

In which case I'd still prefer to extend the ASSERT() right ahead of
the if(). But you're the maintainer of this code, so you know best.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings
  2016-10-10 14:50       ` Jan Beulich
@ 2016-10-10 14:58         ` George Dunlap
  0 siblings, 0 replies; 146+ messages in thread
From: George Dunlap @ 2016-10-10 14:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: George Dunlap, Andrew Cooper, xen-devel, boris.ostrovsky,
	Roger Pau Monne

On 10/10/16 15:50, Jan Beulich wrote:
>>>> On 10.10.16 at 16:27, <george.dunlap@citrix.com> wrote:
>> On 10/10/16 15:21, Jan Beulich wrote:
>>>>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>>>> --- a/xen/arch/x86/mm/p2m.c
>>>> +++ b/xen/arch/x86/mm/p2m.c
>>>> @@ -2793,7 +2793,7 @@ int p2m_add_foreign(struct domain *tdom, unsigned long fgfn,
>>>>      struct domain *fdom;
>>>>  
>>>>      ASSERT(tdom);
>>>> -    if ( foreigndom == DOMID_SELF || !is_pvh_domain(tdom) )
>>>> +    if ( foreigndom == DOMID_SELF || !has_hvm_container_domain(tdom) )
>>>>          return -EINVAL;
>>>
>>> Can PV domains make it here? If not, dropping the predicate would
>>> seem the better adjustment.
>>
>> Is there any chance that in the future PV domains might accidentally get
>> here because of some other changes in the future?  If so, leaving the
>> predicate seems like a sensible precaution to reduce the risk of XSAs
>> down the road, since we're doing a load of checking already anyway. ;-)
> 
> In which case I'd still prefer to extend the ASSERT() right ahead of
> the if(). But you're the maintainer of this code, so you know best.

ASSERT()s and BUG_ON()s are not suitable for checks with security
implications.  Unless we specifically test for PV guests making this
hypercall, both are very likely to slip entirely through any testing we
do; in which case the former would become a full security issue, and the
latter becomes a DoS.

 -George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0
  2016-09-27 15:57 ` [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0 Roger Pau Monne
  2016-10-03 10:57   ` Paul Durrant
@ 2016-10-10 16:15   ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-10 16:15 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: xen-devel, boris.ostrovsky

>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> --- a/xen/arch/x86/hvm/vmsi.c
> +++ b/xen/arch/x86/hvm/vmsi.c
> @@ -40,6 +40,7 @@
>  #include <asm/current.h>
>  #include <asm/event.h>
>  #include <asm/io_apic.h>
> +#include <asm/p2m.h>
>  
>  static void vmsi_inj_irq(
>      struct vlapic *target,
> @@ -1162,3 +1163,500 @@ struct hvm_pt_handler_init hvm_pt_msi_init = {
>      .handlers = vmsi_handler,
>      .init = vmsi_group_init,
>  };
> +
> +/* MSI-X */

This comment is misleading - the entire file deals with MSI-X only.

Other than that I'm as frightened here of the code lifted from qemu
as I have expressed already on the earlier patches, and I'd really
like to understand better why all this is needed when it wasn't
needed for PVHv1, when so far I had understood that it's mainly
boot and AP bringup (as documented in hvmlite.markdown) which
makes up the difference between them. And just to avoid
misunderstandings - I'm all for replacing hacked up bits of the v1
implementation, but I don't think transplanting the other half of
the split brain qemu/Xen model into Xen will result in a single,
well functioning brain.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form
  2016-09-27 15:57 ` [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form Roger Pau Monne
  2016-09-28  8:24   ` Juergen Gross
  2016-10-03  8:36   ` Paul Durrant
@ 2016-10-11 10:27   ` Roger Pau Monne
  2 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-11 10:27 UTC (permalink / raw)
  To: xen-devel, konrad.wilk, boris.ostrovsky, Andrew Cooper,
	George Dunlap, Ian Jackson, Jan Beulich, Stefano Stabellini,
	Tim Deegan, Wei Liu

On Tue, Sep 27, 2016 at 05:57:08PM +0200, Roger Pau Monne wrote:
> Introduce a new %pB format specifier to print sizes (in bytes) in a
> human-readable form.

Since this was intended to be used to print the sizes of the memory 
allocations, and it has been requested to remove this information, I will 
also remove this patch. If someone is later interested in using this, feel 
free to pick up this patch.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-10-04 11:16       ` Jan Beulich
@ 2016-10-11 14:01         ` Roger Pau Monne
  2016-10-12 11:51           ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-11 14:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Tue, Oct 04, 2016 at 05:16:09AM -0600, Jan Beulich wrote:
> >>> On 04.10.16 at 11:12, <roger.pau@citrix.com> wrote:
> > On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote:
> >> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> >> > @@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages(
> >> >          avail -= dom0_paging_pages(d, nr_pages);
> >> >      }
> >> >  
> >> > -    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
> >> > +    if ( is_pv_domain(d) &&
> >> > +         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
> >> 
> >> Perhaps better to simply force parms->p2m_base to UNSET_ADDR
> >> earlier on?
> > 
> > p2m_base is already unconditionally set to UNSET_ADDR for PVHv2 domains, 
> > hence the added is_pv_domain check in order to make sure that PVHv2 guests 
> > don't get into that branch, which AFAICT is only relevant to PV guests.
> 
> This reads contradictory: If it's set to UNSET_ADDR, why the extra
> check?

On PVHv2 p2m_base == UNSET_ADDR already, so the extra check is to make sure 
PVHv2 guests don't execute the code inside of the if condition, which is 
for classic PV guests (note that the check is not against != UNSET_ADDR). Or 
maybe I'm missing something?

> >> > @@ -1657,15 +1679,238 @@ out:
> >> >      return rc;
> >> >  }
> >> >  
> >> > +/* Populate an HVM memory range using the biggest possible order. */
> >> > +static void __init hvm_populate_memory_range(struct domain *d, uint64_t start,
> >> > +                                             uint64_t size)
> >> > +{
> >> > +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
> >> > +    unsigned int order;
> >> > +    struct page_info *page;
> >> > +    int rc;
> >> > +
> >> > +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
> >> > +
> >> > +    order = MAX_ORDER;
> >> > +    while ( size != 0 )
> >> > +    {
> >> > +        order = min(get_order_from_bytes_floor(size), order);
> >> > +        page = alloc_domheap_pages(d, order, memflags);
> >> > +        if ( page == NULL )
> >> > +        {
> >> > +            if ( order == 0 && memflags )
> >> > +            {
> >> > +                /* Try again without any memflags. */
> >> > +                memflags = 0;
> >> > +                order = MAX_ORDER;
> >> > +                continue;
> >> > +            }
> >> > +            if ( order == 0 )
> >> > +                panic("Unable to allocate memory with order 0!\n");
> >> > +            order--;
> >> > +            continue;
> >> > +        }
> >> 
> >> Is it not possible to utilize alloc_chunk() here?
> > 
> > Not really, alloc_chunk will do a ceil calculation of the number of needed 
> > pages, which means we end up allocating bigger chunks than needed and then 
> > free them. I prefer this approach, since we always request the exact memory 
> > that's required, so there's no need to free leftover pages.
> 
> Hmm, in that case at least some of the shared logic would be nice to
> get abstracted out.

TBH, I don't see much benefit in that, alloc_chunk works fine for PV guests 
because Xen ends up walking the list of returned pages and mapping them one 
to one. This is IMHO not what should be done for PVH guests, and instead the 
caller needs to know the actual order of the allocated chunk, so it can pass 
it to guest_physmap_add_page. ATM it's a very simple function that's clear, 
if I start mixing up bits of alloc_chunk it's going to get more complex, and 
I would like to avoid that, so that PVH Dom0 build doesn't end up like 
current PV Dom0 build.

> >> > +        if ( (size & 0xffffffff) == 0 )
> >> > +            process_pending_softirqs();
> >> 
> >> That's 4Gb at a time - isn't that a little too much?
> > 
> > Hm, it's the same that's used in pvh_add_mem_mapping AFAICT. I could reduce 
> > it to 0xfffffff, but I'm also wondering if it makes sense to just call it on 
> > each iteration, since we are possibly mapping regions with big orders here.
> 
> Iteration could is all that matters here really; the size of the mapping
> wouldn't normally (as long as its one of the hardware supported sizes).
> Doing the check on every iteration may be a little much (you may want
> to check whether there's noticeable extra overhead), but doing the
> check on like every 64 iterations may limit overhead enough to be
> acceptable without making this more sophisticated.

Ack, I've done it using a define, so we can always change that 64 to 
something else if the need arises.

> >> > +static int __init hvm_setup_p2m(struct domain *d)
> >> > +{
> >> > +    struct vcpu *saved_current, *v = d->vcpu[0];
> >> > +    unsigned long nr_pages;
> >> > +    int i, rc, preempted;
> >> > +
> >> > +    printk("** Preparing memory map **\n");
> >> 
> >> Debugging leftover again?
> > 
> > Should this be merged with the next label, so that it reads as "Preparing 
> > and populating the memory map"?
> 
> No, I'd rather see the other(s) gone too.

Done, I've just left the initial "Building a PVH Dom0" message.
 
> >> > +    /*
> >> > +     * Subtract one page for the EPT identity page table and two pages
> >> > +     * for the MADT replacement.
> >> > +     */
> >> > +    nr_pages = compute_dom0_nr_pages(d, NULL, 0) - 3;
> >> 
> >> How do you know the MADT replacement requires two pages? Isn't
> >> that CPU-count dependent? And doesn't the partial page used for
> >> the TSS also need accounting for here?
> > 
> > Yes, it's CPU-count dependent. This is just an approximation, since we only 
> > support up to 256 CPUs on HVM guests, and each Processor Local APIC entry is 
> > 
> > 8 bytes, this means that the CPU-related data is going to use up to 2048 
> > bytes of data, which still leaves plenty of space for the IO APIC and the 
> > Interrupt Source Override entries. We requests two pages in case the 
> > original MADT crosses a page boundary. FWIW, I could also fetch the original 
> > MADT size earlier and use that as the upper bound here for memory 
> > reservation.
> 
> That wouldn't help in case someone wants more vCPU-s than there
> are pCPU-s. And baking in another assumption of there being <=
> 128 vCPU-s when there's already work being done to eliminate that
> limit is likely not too good an idea.

Right, I've now removed all this, since for the MADT we are going to steal 
RAM pages.

> > In the RFC series we also spoke about placing the MADT in a different 
> > position that the native one, which would mean that we would end up stealing 
> > some space from a RAM region in order to place it, so that we wouldn't have 
> > to do this accounting.
> 
> Putting the new MADT at the same address as the old one won't work
> anyway, again because possibly vCPU-s > pCPU-s.

Yes, although from my tests doing CPU over-allocation on HVM guests works 
very badly.

> >> > +    process_pending_softirqs();
> >> 
> >> Why, outside of any loop?
> > 
> > It's the same that's done in construct_dom0_pv, so I though that it was a 
> > good idea to drain any pending softirqs before starting domain build.
> 
> Perhaps in that case it should be pulled out of there into the
> wrapper?

Yes, good point.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-09-30 15:52   ` Jan Beulich
  2016-10-04  9:12     ` Roger Pau Monne
@ 2016-10-11 14:06     ` Roger Pau Monne
  2016-10-12 11:58       ` Jan Beulich
  1 sibling, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-11 14:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

It seems like I've forgotten to answer one of your comments in a previous 
email, sorry.

On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > +     * VM86 TSS. Note that after this not all e820 regions will be aligned
> > +     * to PAGE_SIZE.
> > +     */
> > +    for ( i = 1; i <= d->arch.nr_e820; i++ )
> > +    {
> > +        entry = &d->arch.e820[d->arch.nr_e820 - i];
> > +        if ( entry->type != E820_RAM ||
> > +             entry->size < PAGE_SIZE + HVM_VM86_TSS_SIZE )
> > +            continue;
> > +
> > +        entry->size -= PAGE_SIZE + HVM_VM86_TSS_SIZE;
> > +        gaddr = entry->addr + entry->size;
> > +        break;
> > +    }
> > +
> > +    if ( gaddr == 0 || gaddr < MB(1) )
> > +    {
> > +        printk("Unable to find memory to stash the identity map and TSS\n");
> > +        return -ENOMEM;
> 
> One function up you panic() on error - please be consistent. Also for
> one of the other patches I think we figured that the TSS isn't really
> required, so please only warn in that case.

The allocation is done together for the ident PT and the TSS, and while 
the TSS is not mandatory the identity page-table is (or else Dom0 kernel 
won't boot at all on this kind of systems). In any case, it's almost 
impossible for this allocation to fail (because Xen is not actually 
allocating memory, just stealing a part of a RAM region that has already 
been populated).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 16/30] xen/x86: parse Dom0 kernel for PVHv2
  2016-10-06 15:14   ` Jan Beulich
@ 2016-10-11 15:02     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-11 15:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Oct 06, 2016 at 09:14:18AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > +    start_info.modlist_paddr = last_addr;
> > +    start_info.nr_modules = 1;
> > +    last_addr += sizeof(mod);
> 
> How can this be unconditionally 1? It is certainly possible to boot
> without initrd.

Yes, it seems like I left this here and forgot to finish it, so an initrd 
was assumed to be present (here and in the code further up). It should now 
be able to handle booting Dom0 without any initrd.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs
  2016-10-06 15:20   ` Jan Beulich
@ 2016-10-12 11:06     ` Roger Pau Monne
  2016-10-12 11:32       ` Andrew Cooper
  2016-10-12 12:02       ` Jan Beulich
  0 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-12 11:06 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Oct 06, 2016 at 09:20:07AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > The logic used to setup the CPUID leaves is extremely simplistic (and
> > probably wrong for hardware different than mine). I'm not sure what's the
> > best way to deal with this, the code that currently sets the CPUID leaves
> > for HVM guests lives in libxc, maybe moving it xen/common would be better?
> 
> Yeah, a pre-populated array of leaves certainly won't do.

This is what current HVM guests use, and TBH, I would prefer to don't 
diverge from HVM. Would it make sense to leave this as-is, until all this 
cpuid stuff is fixed? (IIRC Andrew is still working on this).

> > +    rc = arch_set_info_hvm_guest(v, &cpu_ctx);
> > +    if ( rc )
> > +    {
> > +        printk("Unable to setup Dom0 BSP context: %d\n", rc);
> > +        return rc;
> > +    }
> > +    clear_bit(_VPF_down, &v->pause_flags);
> 
> Even if it may not matter right away, I think you want to clear this
> flag later, after having completed all setup.

Right, I've now moved the clear_bit to the end of construct_dom0_hvm.

> > +    for ( i = 0; i < ARRAY_SIZE(cpuid_leaves); i++ )
> > +    {
> > +        d->arch.cpuids[i].input[0] = cpuid_leaves[i].index;
> > +        d->arch.cpuids[i].input[1] = cpuid_leaves[i].count;
> > +        if ( d->arch.cpuids[i].input[1] == XEN_CPUID_INPUT_UNUSED )
> > +            cpuid(d->arch.cpuids[i].input[0], &d->arch.cpuids[i].eax,
> > +                  &d->arch.cpuids[i].ebx, &d->arch.cpuids[i].ecx,
> > +                  &d->arch.cpuids[i].edx);
> > +        else
> > +            cpuid_count(d->arch.cpuids[i].input[0], d->arch.cpuids[i].input[1],
> > +                        &d->arch.cpuids[i].eax, &d->arch.cpuids[i].ebx,
> > +                        &d->arch.cpuids[i].ecx, &d->arch.cpuids[i].edx);
> 
> Why this if/else? It is always fine to use cpuid_count().

Done, now it's cpuid_count.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs
  2016-10-12 11:06     ` Roger Pau Monne
@ 2016-10-12 11:32       ` Andrew Cooper
  2016-10-12 12:02       ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: Andrew Cooper @ 2016-10-12 11:32 UTC (permalink / raw)
  To: Roger Pau Monne, Jan Beulich; +Cc: xen-devel, boris.ostrovsky

On 12/10/16 12:06, Roger Pau Monne wrote:
> On Thu, Oct 06, 2016 at 09:20:07AM -0600, Jan Beulich wrote:
>>>>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>>> The logic used to setup the CPUID leaves is extremely simplistic (and
>>> probably wrong for hardware different than mine). I'm not sure what's the
>>> best way to deal with this, the code that currently sets the CPUID leaves
>>> for HVM guests lives in libxc, maybe moving it xen/common would be better?
>> Yeah, a pre-populated array of leaves certainly won't do.
> This is what current HVM guests use, and TBH, I would prefer to don't 
> diverge from HVM. Would it make sense to leave this as-is, until all this 
> cpuid stuff is fixed? (IIRC Andrew is still working on this).

My CPUID work will remove the need for any of this, (and indeed, is a
prerequisite for an HVM Control domain to build further HVM domains).

At boot, where we currently calculate the featuresets, we will also
calculate maximum full policies.  domain_create() will clone the
appropriate default policy (pv or hvm) as a starting point.

A regular domU will have the toolstack optionally reduce the policy via
the domctl interface, but in the absence of any changes, the domain will
get the maximum supported featureset available on the hardware.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-10-11 14:01         ` Roger Pau Monne
@ 2016-10-12 11:51           ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-12 11:51 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 11.10.16 at 16:01, <roger.pau@citrix.com> wrote:
> On Tue, Oct 04, 2016 at 05:16:09AM -0600, Jan Beulich wrote:
>> >>> On 04.10.16 at 11:12, <roger.pau@citrix.com> wrote:
>> > On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote:
>> >> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> >> > @@ -336,7 +343,8 @@ static unsigned long __init compute_dom0_nr_pages(
>> >> >          avail -= dom0_paging_pages(d, nr_pages);
>> >> >      }
>> >> >  
>> >> > -    if ( (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
>> >> > +    if ( is_pv_domain(d) &&
>> >> > +         (parms->p2m_base == UNSET_ADDR) && (dom0_nrpages <= 0) &&
>> >> 
>> >> Perhaps better to simply force parms->p2m_base to UNSET_ADDR
>> >> earlier on?
>> > 
>> > p2m_base is already unconditionally set to UNSET_ADDR for PVHv2 domains, 
>> > hence the added is_pv_domain check in order to make sure that PVHv2 guests 
>> > don't get into that branch, which AFAICT is only relevant to PV guests.
>> 
>> This reads contradictory: If it's set to UNSET_ADDR, why the extra
>> check?
> 
> On PVHv2 p2m_base == UNSET_ADDR already, so the extra check is to make sure 
> PVHv2 guests don't execute the code inside of the if condition, which is 
> for classic PV guests (note that the check is not against != UNSET_ADDR). Or 
> maybe I'm missing something?

No, I have to apologize - I read things the wrong way round.
Thanks for bearing with me.

>> >> > @@ -1657,15 +1679,238 @@ out:
>> >> >      return rc;
>> >> >  }
>> >> >  
>> >> > +/* Populate an HVM memory range using the biggest possible order. */
>> >> > +static void __init hvm_populate_memory_range(struct domain *d, uint64_t start,
>> >> > +                                             uint64_t size)
>> >> > +{
>> >> > +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
>> >> > +    unsigned int order;
>> >> > +    struct page_info *page;
>> >> > +    int rc;
>> >> > +
>> >> > +    ASSERT(IS_ALIGNED(size, PAGE_SIZE) && IS_ALIGNED(start, PAGE_SIZE));
>> >> > +
>> >> > +    order = MAX_ORDER;
>> >> > +    while ( size != 0 )
>> >> > +    {
>> >> > +        order = min(get_order_from_bytes_floor(size), order);
>> >> > +        page = alloc_domheap_pages(d, order, memflags);
>> >> > +        if ( page == NULL )
>> >> > +        {
>> >> > +            if ( order == 0 && memflags )
>> >> > +            {
>> >> > +                /* Try again without any memflags. */
>> >> > +                memflags = 0;
>> >> > +                order = MAX_ORDER;
>> >> > +                continue;
>> >> > +            }
>> >> > +            if ( order == 0 )
>> >> > +                panic("Unable to allocate memory with order 0!\n");
>> >> > +            order--;
>> >> > +            continue;
>> >> > +        }
>> >> 
>> >> Is it not possible to utilize alloc_chunk() here?
>> > 
>> > Not really, alloc_chunk will do a ceil calculation of the number of needed 
>> > pages, which means we end up allocating bigger chunks than needed and then 
>> > free them. I prefer this approach, since we always request the exact memory 
>> > that's required, so there's no need to free leftover pages.
>> 
>> Hmm, in that case at least some of the shared logic would be nice to
>> get abstracted out.
> 
> TBH, I don't see much benefit in that, alloc_chunk works fine for PV guests 
> because Xen ends up walking the list of returned pages and mapping them one 
> to one. This is IMHO not what should be done for PVH guests, and instead the 
> caller needs to know the actual order of the allocated chunk, so it can pass 
> it to guest_physmap_add_page. ATM it's a very simple function that's clear, 
> if I start mixing up bits of alloc_chunk it's going to get more complex, and 
> I would like to avoid that, so that PVH Dom0 build doesn't end up like 
> current PV Dom0 build.

Hmm - I did say "abstract out", not "re-use". The way how memflags
get set and decreasing orders get tried in your code looks suspiciously
similar to what alloc_chunk() does.

>> > In the RFC series we also spoke about placing the MADT in a different 
>> > position that the native one, which would mean that we would end up stealing 
>> > some space from a RAM region in order to place it, so that we wouldn't have 
>> > to do this accounting.
>> 
>> Putting the new MADT at the same address as the old one won't work
>> anyway, again because possibly vCPU-s > pCPU-s.
> 
> Yes, although from my tests doing CPU over-allocation on HVM guests works 
> very badly.

Interesting. I didn't try recently, but I recall having tried a longer
while ago without seeing severe issues.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map
  2016-10-11 14:06     ` Roger Pau Monne
@ 2016-10-12 11:58       ` Jan Beulich
  0 siblings, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-12 11:58 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 11.10.16 at 16:06, <roger.pau@citrix.com> wrote:
> On Fri, Sep 30, 2016 at 09:52:56AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > +     * VM86 TSS. Note that after this not all e820 regions will be aligned
>> > +     * to PAGE_SIZE.
>> > +     */
>> > +    for ( i = 1; i <= d->arch.nr_e820; i++ )
>> > +    {
>> > +        entry = &d->arch.e820[d->arch.nr_e820 - i];
>> > +        if ( entry->type != E820_RAM ||
>> > +             entry->size < PAGE_SIZE + HVM_VM86_TSS_SIZE )
>> > +            continue;
>> > +
>> > +        entry->size -= PAGE_SIZE + HVM_VM86_TSS_SIZE;
>> > +        gaddr = entry->addr + entry->size;
>> > +        break;
>> > +    }
>> > +
>> > +    if ( gaddr == 0 || gaddr < MB(1) )
>> > +    {
>> > +        printk("Unable to find memory to stash the identity map and TSS\n");
>> > +        return -ENOMEM;
>> 
>> One function up you panic() on error - please be consistent. Also for
>> one of the other patches I think we figured that the TSS isn't really
>> required, so please only warn in that case.
> 
> The allocation is done together for the ident PT and the TSS, and while 
> the TSS is not mandatory the identity page-table is (or else Dom0 kernel 
> won't boot at all on this kind of systems). In any case, it's almost 
> impossible for this allocation to fail (because Xen is not actually 
> allocating memory, just stealing a part of a RAM region that has already 
> been populated).

All fine, but I continue to think errors should be dealt with in a
consistent manner (no matter how [un]likely they are), and
warnings better get issued when there's any meaningful impact
due to whatever triggers the warning. Albeit - having looked at
the full patch again - it looks like I was wrong with naming this
"warning": Both here and further down you return an error if any
of the steps failed. The allocation being done in one go is fine to
be an error; failure of the mapping of the non-required TSS, otoh,
shouldn't cause the loading of Dom0 to fail (and I think that is
where I've been unsure the printk() is warranted).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs
  2016-10-12 11:06     ` Roger Pau Monne
  2016-10-12 11:32       ` Andrew Cooper
@ 2016-10-12 12:02       ` Jan Beulich
  1 sibling, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-12 12:02 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 12.10.16 at 13:06, <roger.pau@citrix.com> wrote:
> On Thu, Oct 06, 2016 at 09:20:07AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > The logic used to setup the CPUID leaves is extremely simplistic (and
>> > probably wrong for hardware different than mine). I'm not sure what's the
>> > best way to deal with this, the code that currently sets the CPUID leaves
>> > for HVM guests lives in libxc, maybe moving it xen/common would be better?
>> 
>> Yeah, a pre-populated array of leaves certainly won't do.
> 
> This is what current HVM guests use, and TBH, I would prefer to don't 
> diverge from HVM. Would it make sense to leave this as-is, until all this 
> cpuid stuff is fixed? (IIRC Andrew is still working on this).

Leaving it as is makes little sense to me. What would make sense
is to make Andrew's work a prereq for this patch.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-06 15:40   ` Jan Beulich
  2016-10-06 15:48     ` Andrew Cooper
@ 2016-10-12 15:35     ` Roger Pau Monne
  2016-10-12 15:55       ` Jan Beulich
  1 sibling, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-12 15:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Oct 06, 2016 at 09:40:50AM -0600, Jan Beulich wrote:
> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> > FWIW, I think that the current approach that I've used in order to craft the
> > MADT is not the best one, IMHO it would be better to place the MADT at the
> > end of the E820_ACPI region (expanding it's size one page), and modify the
> > XSDT/RSDT in order to point to it, that way we avoid shadowing any other
> > ACPI data that might be at the same page as the native MADT (and that needs
> > to be modified by Dom0).
> 
> I agree with the idea of placing MADT elsewhere, but I don't think you
> can rely on E820_ACPI to have room to grow into right after its end.

Right, I think picking some memory from a RAM region and marking it as 
E820_ACPI is the best approach.

> > @@ -50,6 +53,8 @@ static long __initdata dom0_max_nrpages = LONG_MAX;
> >  #define HVM_VM86_TSS_SIZE   128
> >  
> >  static unsigned int __initdata hvm_mem_stats[MAX_ORDER + 1];
> > +static unsigned int __initdata acpi_intr_overrrides = 0;
> > +static struct acpi_madt_interrupt_override __initdata *intsrcovr = NULL;
> 
> Pointless initializers.

Removed.

> > +static void __init acpi_zap_table_signature(char *name)
> > +{
> > +    struct acpi_table_header *table;
> > +    acpi_status status;
> > +    union {
> > +        char str[ACPI_NAME_SIZE];
> > +        uint32_t bits;
> > +    } signature;
> > +    char tmp;
> > +    int i;
> > +
> > +    status = acpi_get_table(name, 0, &table);
> > +    if ( ACPI_SUCCESS(status) )
> > +    {
> > +        memcpy(&signature.str[0], &table->signature[0], ACPI_NAME_SIZE);
> > +        for ( i = 0; i < ACPI_NAME_SIZE / 2; i++ )
> > +        {
> > +            tmp = signature.str[ACPI_NAME_SIZE - i - 1];
> > +            signature.str[ACPI_NAME_SIZE - i - 1] = signature.str[i];
> > +            signature.str[i] = tmp;
> > +        }
> > +        write_atomic((uint32_t*)&table->signature[0], signature.bits);
> > +    }
> > +}
> 
> Together with moving MADT elsewhere we should also move
> XSDT/RSDT, at which point we can simply avoid copying the
> pointers for tables we don't want Dom0 to see (and down the
> road we also have the option of adding tables).
> 
> The downside to both approaches is that this once again is a
> black-listing model, i.e. new structure types we may need to
> also suppress will remain visible to Dom0 until a patched
> hypervisor would become available.

Maybe we could do whitelisting instead? I can see that at least the 
following tables should be available to Dom0 if present, but TBH, it's hard 
to tell:

MADT, DSDT, FADT, SSDT, FACS, SBST, NFIT, MCFG, TCPA. Then ECDT, CPEP and 
RASF also seem fine to make available to Dom0, but I'm dubious.

But I agree that crafting a custom XSDT/RSDT is the best option here.
 
> > +                   pfn, pfn + nr_pages);
> > +            return rc;
> > +        }
> > +    }
> > +    /*
> > +     * Since most memory maps provided by hardware are wrong, make sure each
> > +     * top-level table is properly mapped into Dom0.
> > +     */
> 
> Please be more specific here - wrong in which way? Not all ACPI tables
> living in one of the two E820 type regions? But see also below.
> 
> In any event you need to be more careful here: Mapping ordinary RAM
> 1:1 into Dom0 can't end well. Also, once working with a cloned XSDT you
> won't need to cover tables here which have got hidden from Dom0.

I've found systems where some ACPI tables reside in memory holes or reserved 
regions. I don't think I've found a system where an ACPI table would reside 
in a RAM region, but I agree that it's worth adding a check here to make 
sure.

> > +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> > +    {
> > +        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> > +        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
> > +                                PAGE_SIZE);
> > +        rc = modify_mmio_11(d, pfn, nr_pages, true);
> > +        if ( rc )
> > +        {
> > +            printk(
> > +                "Failed to map ACPI region %#lx - %#lx into Dom0 memory map\n",
> > +                   pfn, pfn + nr_pages);
> > +            return rc;
> > +        }
> > +    }
> > +
> > +    /*
> > +     * Special treatment for memory < 1MB:
> > +     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).
> 
> Copy? What if some of it needs to get modified to interact with
> firmware? I think you'd be better off mapping everything Xen
> doesn't use into Dom0, and only mapping fresh RAM pages
> over regions Xen does use (namely the trampoline).

Hm, that was my first approach (mapping the whole first MB into Dom0), but 
sadly it doesn't seem to work because FreeBSD at least places the AP boot 
trampoline there, and since APs are launched in 16b mode, the emulator 
cannot handle executing memory from MMIO regions and crashes the domain. 
That's why I had to map and copy data from RAM regions below 1MB. The BDA 
it's all static data AFAICT, and the EBDA should reside in a reserved 
region (or at least does on my systems).

Might it be possible to solve this by identity mapping the first 1MB, and 
marking the RAM regions there as p2m_ram_rw? Or would that create even 
further problems?

> > +     *  - Map any reserved regions as 1:1.
> 
> Why would reserved regions need such treatment only when below
> 1Mb?

The video RAM for the VGA display needs to be mapped into Dom0, and that's 
in the reserved region below 1MB (usually starting at 0xA0000).

I would like to avoid mapping all the reserved regions into Dom0 by default 
(and the IOMMU), but this might be needed on some systems as a workaround 
(specially those not providing or with wrong RMRR tables).

> > +    acpi_get_table_phys(ACPI_SIG_MADT, 0, &madt_addr[0], &size);
> > +    if ( !madt_addr[0] )
> > +    {
> > +        printk("Unable to find ACPI MADT table\n");
> > +        return -EINVAL;
> > +    }
> > +    if ( size > PAGE_SIZE )
> > +    {
> > +        printk("MADT table is bigger than PAGE_SIZE, aborting\n");
> > +        return -EINVAL;
> > +    }
> 
> This restriction for sure needs to go away.

Sure.

> > +    acpi_get_table_phys(ACPI_SIG_MADT, 2, &madt_addr[1], &size);
> 
> Why 2 (and not 1) when the first invocation used 0? But this is not
> going to be needed anyway with the alternative model.
> 
> > +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> > +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
> > +    io_apic->header.length = sizeof(*io_apic);
> > +    io_apic->id = 1;
> > +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
> 
> I think you need to make as many IO-APICs available to Dom0 as
> there are hardware ones.

Right, I've been wondering about this, and then I need to expand the IO APIC 
emulation code so that Xen is able to emulate two IO-APICs.

Can I ask why do you think this is needed? If the number of pins in the 
multiple IO APIC case doesn't exceed 48 (which is what the emulated IO APIC 
provides), shouldn't this be enough?

> > +    if ( dom0_max_vcpus() > num_online_cpus() )
> > +    {
> > +        printk("CPU overcommit is not supported for Dom0\n");
> 
> ???

This is going away with the new MADT approach.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-12 15:35     ` Roger Pau Monne
@ 2016-10-12 15:55       ` Jan Beulich
  2016-10-26 11:35         ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-12 15:55 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 12.10.16 at 17:35, <roger.pau@citrix.com> wrote:
> On Thu, Oct 06, 2016 at 09:40:50AM -0600, Jan Beulich wrote:
>> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
>> > +static void __init acpi_zap_table_signature(char *name)
>> > +{
>> > +    struct acpi_table_header *table;
>> > +    acpi_status status;
>> > +    union {
>> > +        char str[ACPI_NAME_SIZE];
>> > +        uint32_t bits;
>> > +    } signature;
>> > +    char tmp;
>> > +    int i;
>> > +
>> > +    status = acpi_get_table(name, 0, &table);
>> > +    if ( ACPI_SUCCESS(status) )
>> > +    {
>> > +        memcpy(&signature.str[0], &table->signature[0], ACPI_NAME_SIZE);
>> > +        for ( i = 0; i < ACPI_NAME_SIZE / 2; i++ )
>> > +        {
>> > +            tmp = signature.str[ACPI_NAME_SIZE - i - 1];
>> > +            signature.str[ACPI_NAME_SIZE - i - 1] = signature.str[i];
>> > +            signature.str[i] = tmp;
>> > +        }
>> > +        write_atomic((uint32_t*)&table->signature[0], signature.bits);
>> > +    }
>> > +}
>> 
>> Together with moving MADT elsewhere we should also move
>> XSDT/RSDT, at which point we can simply avoid copying the
>> pointers for tables we don't want Dom0 to see (and down the
>> road we also have the option of adding tables).
>> 
>> The downside to both approaches is that this once again is a
>> black-listing model, i.e. new structure types we may need to
>> also suppress will remain visible to Dom0 until a patched
>> hypervisor would become available.
> 
> Maybe we could do whitelisting instead? I can see that at least the 
> following tables should be available to Dom0 if present, but TBH, it's hard 
> to tell:

Taking an abstract perspective I agree with Andrew that we should
be whitelisting here. However, as you already see from the list you
provided (which afaict is far from complete wrt ACPI 6.1), this may
become cumbersome already initially, not to speak of down the road.

>> > +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
>> > +    {
>> > +        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
>> > +        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
>> > +                                PAGE_SIZE);
>> > +        rc = modify_mmio_11(d, pfn, nr_pages, true);
>> > +        if ( rc )
>> > +        {
>> > +            printk(
>> > +                "Failed to map ACPI region %#lx - %#lx into Dom0 memory map\n",
>> > +                   pfn, pfn + nr_pages);
>> > +            return rc;
>> > +        }
>> > +    }
>> > +
>> > +    /*
>> > +     * Special treatment for memory < 1MB:
>> > +     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).
>> 
>> Copy? What if some of it needs to get modified to interact with
>> firmware? I think you'd be better off mapping everything Xen
>> doesn't use into Dom0, and only mapping fresh RAM pages
>> over regions Xen does use (namely the trampoline).
> 
> Hm, that was my first approach (mapping the whole first MB into Dom0), but 
> sadly it doesn't seem to work because FreeBSD at least places the AP boot 
> trampoline there, and since APs are launched in 16b mode, the emulator 
> cannot handle executing memory from MMIO regions and crashes the domain. 
> That's why I had to map and copy data from RAM regions below 1MB. The BDA 
> it's all static data AFAICT, and the EBDA should reside in a reserved 
> region (or at least does on my systems).

I'm afraid there are systems where the EBDA is not marked reserved.
But maybe there are no current (64-bit capable) ones of that sort.

> Might it be possible to solve this by identity mapping the first 1MB, and 
> marking the RAM regions there as p2m_ram_rw? Or would that create even 
> further problems?

Hmm, not sure - the range below 1Mb is marked as MMIO in
frame_table[], so there would be a (benign?) conflict at least there.

>> > +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
>> > +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
>> > +    io_apic->header.length = sizeof(*io_apic);
>> > +    io_apic->id = 1;
>> > +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
>> 
>> I think you need to make as many IO-APICs available to Dom0 as
>> there are hardware ones.
> 
> Right, I've been wondering about this, and then I need to expand the IO APIC 
> emulation code so that Xen is able to emulate two IO-APICs.
> 
> Can I ask why do you think this is needed? If the number of pins in the 
> multiple IO APIC case doesn't exceed 48 (which is what the emulated IO APIC 
> provides), shouldn't this be enough?

The number of pins easily can be larger. And I think the mapping
code would end up simpler if there was a 1:1 relationship between
physical and virtual IO-APICs. I've just not checked one of my
larger (but older) systems - it has 5 IO-APICs with a total of 120 pins.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-12 15:55       ` Jan Beulich
@ 2016-10-26 11:35         ` Roger Pau Monne
  2016-10-26 14:10           ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-26 11:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Wed, Oct 12, 2016 at 09:55:44AM -0600, Jan Beulich wrote:
> >>> On 12.10.16 at 17:35, <roger.pau@citrix.com> wrote:
> > On Thu, Oct 06, 2016 at 09:40:50AM -0600, Jan Beulich wrote:
> >> >>> On 27.09.16 at 17:57, <roger.pau@citrix.com> wrote:
> >> > +static void __init acpi_zap_table_signature(char *name)
> >> > +{
> >> > +    struct acpi_table_header *table;
> >> > +    acpi_status status;
> >> > +    union {
> >> > +        char str[ACPI_NAME_SIZE];
> >> > +        uint32_t bits;
> >> > +    } signature;
> >> > +    char tmp;
> >> > +    int i;
> >> > +
> >> > +    status = acpi_get_table(name, 0, &table);
> >> > +    if ( ACPI_SUCCESS(status) )
> >> > +    {
> >> > +        memcpy(&signature.str[0], &table->signature[0], ACPI_NAME_SIZE);
> >> > +        for ( i = 0; i < ACPI_NAME_SIZE / 2; i++ )
> >> > +        {
> >> > +            tmp = signature.str[ACPI_NAME_SIZE - i - 1];
> >> > +            signature.str[ACPI_NAME_SIZE - i - 1] = signature.str[i];
> >> > +            signature.str[i] = tmp;
> >> > +        }
> >> > +        write_atomic((uint32_t*)&table->signature[0], signature.bits);
> >> > +    }
> >> > +}
> >> 
> >> Together with moving MADT elsewhere we should also move
> >> XSDT/RSDT, at which point we can simply avoid copying the
> >> pointers for tables we don't want Dom0 to see (and down the
> >> road we also have the option of adding tables).
> >> 
> >> The downside to both approaches is that this once again is a
> >> black-listing model, i.e. new structure types we may need to
> >> also suppress will remain visible to Dom0 until a patched
> >> hypervisor would become available.
> > 
> > Maybe we could do whitelisting instead? I can see that at least the 
> > following tables should be available to Dom0 if present, but TBH, it's hard 
> > to tell:
> 
> Taking an abstract perspective I agree with Andrew that we should
> be whitelisting here. However, as you already see from the list you
> provided (which afaict is far from complete wrt ACPI 6.1), this may
> become cumbersome already initially, not to speak of down the road.

I've initially used a back-listing approach. We can always change this later 
on.

So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
from the guest memory map and marked as E820_ACPI, which means that the new 
RSDP no longer resides below 1MB, and that the Dom0 kernel _must_ use the 
rsdp_paddr provided in the start info, or else it's going to access the 
native RSDP.

> >> > +    for( i = 0; i < acpi_gbl_root_table_list.count; i++ )
> >> > +    {
> >> > +        pfn = PFN_DOWN(acpi_gbl_root_table_list.tables[i].address);
> >> > +        nr_pages = DIV_ROUND_UP(acpi_gbl_root_table_list.tables[i].length,
> >> > +                                PAGE_SIZE);
> >> > +        rc = modify_mmio_11(d, pfn, nr_pages, true);
> >> > +        if ( rc )
> >> > +        {
> >> > +            printk(
> >> > +                "Failed to map ACPI region %#lx - %#lx into Dom0 memory map\n",
> >> > +                   pfn, pfn + nr_pages);
> >> > +            return rc;
> >> > +        }
> >> > +    }
> >> > +
> >> > +    /*
> >> > +     * Special treatment for memory < 1MB:
> >> > +     *  - Copy the data in e820 regions marked as RAM (BDA, EBDA...).
> >> 
> >> Copy? What if some of it needs to get modified to interact with
> >> firmware? I think you'd be better off mapping everything Xen
> >> doesn't use into Dom0, and only mapping fresh RAM pages
> >> over regions Xen does use (namely the trampoline).
> > 
> > Hm, that was my first approach (mapping the whole first MB into Dom0), but 
> > sadly it doesn't seem to work because FreeBSD at least places the AP boot 
> > trampoline there, and since APs are launched in 16b mode, the emulator 
> > cannot handle executing memory from MMIO regions and crashes the domain. 
> > That's why I had to map and copy data from RAM regions below 1MB. The BDA 
> > it's all static data AFAICT, and the EBDA should reside in a reserved 
> > region (or at least does on my systems).
> 
> I'm afraid there are systems where the EBDA is not marked reserved.
> But maybe there are no current (64-bit capable) ones of that sort.

Hm, I would say that we leave this as it is currently, and then we can 
always play more tricks later on if we found any of such systems.
 
> > Might it be possible to solve this by identity mapping the first 1MB, and 
> > marking the RAM regions there as p2m_ram_rw? Or would that create even 
> > further problems?
> 
> Hmm, not sure - the range below 1Mb is marked as MMIO in
> frame_table[], so there would be a (benign?) conflict at least there.

As said before, I would leave the current implementation and look into that 
option if needed.

> >> > +    io_apic = (struct acpi_madt_io_apic *)(madt + 1);
> >> > +    io_apic->header.type = ACPI_MADT_TYPE_IO_APIC;
> >> > +    io_apic->header.length = sizeof(*io_apic);
> >> > +    io_apic->id = 1;
> >> > +    io_apic->address = VIOAPIC_DEFAULT_BASE_ADDRESS;
> >> 
> >> I think you need to make as many IO-APICs available to Dom0 as
> >> there are hardware ones.
> > 
> > Right, I've been wondering about this, and then I need to expand the IO APIC 
> > emulation code so that Xen is able to emulate two IO-APICs.
> > 
> > Can I ask why do you think this is needed? If the number of pins in the 
> > multiple IO APIC case doesn't exceed 48 (which is what the emulated IO APIC 
> > provides), shouldn't this be enough?
> 
> The number of pins easily can be larger. And I think the mapping
> code would end up simpler if there was a 1:1 relationship between
> physical and virtual IO-APICs. I've just not checked one of my
> larger (but older) systems - it has 5 IO-APICs with a total of 120 pins.

Yes, I agree that having a 1:1 relation between physical and virtual IO 
APICs is the best option. I've added a warning printk if the native hardware 
has more than one IO APIC, and I plan to expand the current IO APIC 
emulation code when I'm done with this initial PVHv2 Dom0 implementation, so 
that it can support emulating multiple IO APICs with varying number of pins.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 11:35         ` Roger Pau Monne
@ 2016-10-26 14:10           ` Jan Beulich
  2016-10-26 15:08             ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-26 14:10 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 26.10.16 at 13:35, <roger.pau@citrix.com> wrote:
> On Wed, Oct 12, 2016 at 09:55:44AM -0600, Jan Beulich wrote:
>> Taking an abstract perspective I agree with Andrew that we should
>> be whitelisting here. However, as you already see from the list you
>> provided (which afaict is far from complete wrt ACPI 6.1), this may
>> become cumbersome already initially, not to speak of down the road.
> 
> I've initially used a back-listing approach. We can always change this later 
> on.
> 
> So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
> crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
> 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
> from the guest memory map and marked as E820_ACPI, which means that the new 
> RSDP no longer resides below 1MB, and that the Dom0 kernel _must_ use the 
> rsdp_paddr provided in the start info, or else it's going to access the 
> native RSDP.

Hmm, for the RSDP I'm not sure. It might be better if we put it at the
same spot where the host one is, mapping a RAM page there with a
copy of the respective host page data. Otoh your approach allows
Dom0 to still find the real tables if need be, which has both up and
down sides.

>> I'm afraid there are systems where the EBDA is not marked reserved.
>> But maybe there are no current (64-bit capable) ones of that sort.
> 
> Hm, I would say that we leave this as it is currently, and then we can 
> always play more tricks later on if we found any of such systems.

As long as the code is experimental, and there's a note to that effect
which can be easily found (and has to be gone for the code to become
non-experimental), I'm fine with that.

>> > Might it be possible to solve this by identity mapping the first 1MB, and 
>> > marking the RAM regions there as p2m_ram_rw? Or would that create even 
>> > further problems?
>> 
>> Hmm, not sure - the range below 1Mb is marked as MMIO in
>> frame_table[], so there would be a (benign?) conflict at least there.
> 
> As said before, I would leave the current implementation and look into that 
> option if needed.

Same thing here.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 14:10           ` Jan Beulich
@ 2016-10-26 15:08             ` Roger Pau Monne
  2016-10-26 15:16               ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-26 15:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Wed, Oct 26, 2016 at 08:10:50AM -0600, Jan Beulich wrote:
> >>> On 26.10.16 at 13:35, <roger.pau@citrix.com> wrote:
> > On Wed, Oct 12, 2016 at 09:55:44AM -0600, Jan Beulich wrote:
> >> Taking an abstract perspective I agree with Andrew that we should
> >> be whitelisting here. However, as you already see from the list you
> >> provided (which afaict is far from complete wrt ACPI 6.1), this may
> >> become cumbersome already initially, not to speak of down the road.
> > 
> > I've initially used a back-listing approach. We can always change this later 
> > on.
> > 
> > So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
> > crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
> > 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
> > from the guest memory map and marked as E820_ACPI, which means that the new 
> > RSDP no longer resides below 1MB, and that the Dom0 kernel _must_ use the 
> > rsdp_paddr provided in the start info, or else it's going to access the 
> > native RSDP.
> 
> Hmm, for the RSDP I'm not sure. It might be better if we put it at the
> same spot where the host one is, mapping a RAM page there with a
> copy of the respective host page data. Otoh your approach allows
> Dom0 to still find the real tables if need be, which has both up and
> down sides.

The problem with putting it at the same page is that AFAICT there's a big 
chance that other things (like EBDA or ROM) being at the same page, and 
we would be shadowing them by mapping a RAM page over it, even if the 
original data is copied. This hole area below 1MB is just a mess to deal 
with TBH.

> >> I'm afraid there are systems where the EBDA is not marked reserved.
> >> But maybe there are no current (64-bit capable) ones of that sort.
> > 
> > Hm, I would say that we leave this as it is currently, and then we can 
> > always play more tricks later on if we found any of such systems.
> 
> As long as the code is experimental, and there's a note to that effect
> which can be easily found (and has to be gone for the code to become
> non-experimental), I'm fine with that.
> 
> >> > Might it be possible to solve this by identity mapping the first 1MB, and 
> >> > marking the RAM regions there as p2m_ram_rw? Or would that create even 
> >> > further problems?
> >> 
> >> Hmm, not sure - the range below 1Mb is marked as MMIO in
> >> frame_table[], so there would be a (benign?) conflict at least there.
> > 
> > As said before, I would leave the current implementation and look into that 
> > option if needed.

Right, I've added to following note as a comment:

NB2: regions marked as RAM in the memory map are backed by RAM pages
in the p2m, and the original data is copied over. This is done because
at least FreeBSD places the AP boot trampoline in a RAM region found
below the first MB, and the real-mode emulator found in Xen cannot
deal with code that resides in guest pages marked as MMIO. This can
cause problems if the memory map is not correct, and for example the
EBDA or the video ROM region is marked as RAM.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 15:08             ` Roger Pau Monne
@ 2016-10-26 15:16               ` Jan Beulich
  2016-10-26 16:03                 ` Roger Pau Monne
  2016-10-26 17:14                 ` Boris Ostrovsky
  0 siblings, 2 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-26 15:16 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 26.10.16 at 17:08, <roger.pau@citrix.com> wrote:
> On Wed, Oct 26, 2016 at 08:10:50AM -0600, Jan Beulich wrote:
>> >>> On 26.10.16 at 13:35, <roger.pau@citrix.com> wrote:
>> > On Wed, Oct 12, 2016 at 09:55:44AM -0600, Jan Beulich wrote:
>> >> Taking an abstract perspective I agree with Andrew that we should
>> >> be whitelisting here. However, as you already see from the list you
>> >> provided (which afaict is far from complete wrt ACPI 6.1), this may
>> >> become cumbersome already initially, not to speak of down the road.
>> > 
>> > I've initially used a back-listing approach. We can always change this later 
>> > on.
>> > 
>> > So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
>> > crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
>> > 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
>> > from the guest memory map and marked as E820_ACPI, which means that the new 
>> > RSDP no longer resides below 1MB, and that the Dom0 kernel _must_ use the 
>> > rsdp_paddr provided in the start info, or else it's going to access the 
>> > native RSDP.
>> 
>> Hmm, for the RSDP I'm not sure. It might be better if we put it at the
>> same spot where the host one is, mapping a RAM page there with a
>> copy of the respective host page data. Otoh your approach allows
>> Dom0 to still find the real tables if need be, which has both up and
>> down sides.
> 
> The problem with putting it at the same page is that AFAICT there's a big 
> chance that other things (like EBDA or ROM) being at the same page, and 
> we would be shadowing them by mapping a RAM page over it, even if the 
> original data is copied. This hole area below 1MB is just a mess to deal 
> with TBH.

Unless it's page zero, what bad could such shadowing do?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 15:16               ` Jan Beulich
@ 2016-10-26 16:03                 ` Roger Pau Monne
  2016-10-27  7:25                   ` Jan Beulich
  2016-10-26 17:14                 ` Boris Ostrovsky
  1 sibling, 1 reply; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-26 16:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Wed, Oct 26, 2016 at 09:16:55AM -0600, Jan Beulich wrote:
> >>> On 26.10.16 at 17:08, <roger.pau@citrix.com> wrote:
> > On Wed, Oct 26, 2016 at 08:10:50AM -0600, Jan Beulich wrote:
> >> >>> On 26.10.16 at 13:35, <roger.pau@citrix.com> wrote:
> >> > On Wed, Oct 12, 2016 at 09:55:44AM -0600, Jan Beulich wrote:
> >> >> Taking an abstract perspective I agree with Andrew that we should
> >> >> be whitelisting here. However, as you already see from the list you
> >> >> provided (which afaict is far from complete wrt ACPI 6.1), this may
> >> >> become cumbersome already initially, not to speak of down the road.
> >> > 
> >> > I've initially used a back-listing approach. We can always change this later 
> >> > on.
> >> > 
> >> > So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
> >> > crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
> >> > 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
> >> > from the guest memory map and marked as E820_ACPI, which means that the new 
> >> > RSDP no longer resides below 1MB, and that the Dom0 kernel _must_ use the 
> >> > rsdp_paddr provided in the start info, or else it's going to access the 
> >> > native RSDP.
> >> 
> >> Hmm, for the RSDP I'm not sure. It might be better if we put it at the
> >> same spot where the host one is, mapping a RAM page there with a
> >> copy of the respective host page data. Otoh your approach allows
> >> Dom0 to still find the real tables if need be, which has both up and
> >> down sides.
> > 
> > The problem with putting it at the same page is that AFAICT there's a big 
> > chance that other things (like EBDA or ROM) being at the same page, and 
> > we would be shadowing them by mapping a RAM page over it, even if the 
> > original data is copied. This hole area below 1MB is just a mess to deal 
> > with TBH.
> 
> Unless it's page zero, what bad could such shadowing do?

You where the one that warmed me of this in 
http://marc.info/?l=xen-devel&m=147576851115855 that shadowing regions 
below 1MB could prevent the OS from properly interacting with the firmware, 
or at least this was my understanding. The RSDP is usually located in the 
same page as the EBDA, so we would be shadowing the EBDA area.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 15:16               ` Jan Beulich
  2016-10-26 16:03                 ` Roger Pau Monne
@ 2016-10-26 17:14                 ` Boris Ostrovsky
  2016-10-27  7:27                   ` Jan Beulich
  2016-10-27 11:13                   ` Roger Pau Monne
  1 sibling, 2 replies; 146+ messages in thread
From: Boris Ostrovsky @ 2016-10-26 17:14 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monne; +Cc: Andrew Cooper, xen-devel


>>>> I've initially used a back-listing approach. We can always change this later 
>>>> on.
>>>>
>>>> So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
>>>> crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
>>>> 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
>>>> from the guest memory map and marked as E820_ACPI, which means that the new 
>>>> RSDP no longer resides below 1MB, 


As I mentioned in the other thread I am not sure this would be in
compliance with the ACPI spec.

-boris



>>>> and that the Dom0 kernel _must_ use the 
>>>> rsdp_paddr provided in the start info, or else it's going to access the 
>>>> native RSDP.
>>>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 16:03                 ` Roger Pau Monne
@ 2016-10-27  7:25                   ` Jan Beulich
  2016-10-27 11:08                     ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-27  7:25 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

>>> On 26.10.16 at 18:03, <roger.pau@citrix.com> wrote:
> On Wed, Oct 26, 2016 at 09:16:55AM -0600, Jan Beulich wrote:
>> >>> On 26.10.16 at 17:08, <roger.pau@citrix.com> wrote:
>> > On Wed, Oct 26, 2016 at 08:10:50AM -0600, Jan Beulich wrote:
>> >> >>> On 26.10.16 at 13:35, <roger.pau@citrix.com> wrote:
>> >> > On Wed, Oct 12, 2016 at 09:55:44AM -0600, Jan Beulich wrote:
>> >> >> Taking an abstract perspective I agree with Andrew that we should
>> >> >> be whitelisting here. However, as you already see from the list you
>> >> >> provided (which afaict is far from complete wrt ACPI 6.1), this may
>> >> >> become cumbersome already initially, not to speak of down the road.
>> >> > 
>> >> > I've initially used a back-listing approach. We can always change this later 
>> >> > on.
>> >> > 
>> >> > So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
>> >> > crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
>> >> > 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
>> >> > from the guest memory map and marked as E820_ACPI, which means that the new 
>> >> > RSDP no longer resides below 1MB, and that the Dom0 kernel _must_ use the 
>> >> > rsdp_paddr provided in the start info, or else it's going to access the 
>> >> > native RSDP.
>> >> 
>> >> Hmm, for the RSDP I'm not sure. It might be better if we put it at the
>> >> same spot where the host one is, mapping a RAM page there with a
>> >> copy of the respective host page data. Otoh your approach allows
>> >> Dom0 to still find the real tables if need be, which has both up and
>> >> down sides.
>> > 
>> > The problem with putting it at the same page is that AFAICT there's a big 
>> > chance that other things (like EBDA or ROM) being at the same page, and 
>> > we would be shadowing them by mapping a RAM page over it, even if the 
>> > original data is copied. This hole area below 1MB is just a mess to deal 
>> > with TBH.
>> 
>> Unless it's page zero, what bad could such shadowing do?
> 
> You where the one that warmed me of this in 
> http://marc.info/?l=xen-devel&m=147576851115855 that shadowing regions 
> below 1MB could prevent the OS from properly interacting with the firmware, 
> or at least this was my understanding. The RSDP is usually located in the 
> same page as the EBDA, so we would be shadowing the EBDA area.

Oh, right, for a moment I did forget that the EBDA is a permissible
place for the RSDPTR to live in. Only mapping a page over a ROM
one (E0000...FFFFF) would be reasonable. So as said - there are
positive aspects to keeping the original pointer visible; the main
issue I foresee is that once again user mode tools will (unless made
Xen-aware) have a different view of the system than the kernel.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 17:14                 ` Boris Ostrovsky
@ 2016-10-27  7:27                   ` Jan Beulich
  2016-10-27 11:13                   ` Roger Pau Monne
  1 sibling, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-27  7:27 UTC (permalink / raw)
  To: Roger Pau Monne, Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel

>>> On 26.10.16 at 19:14, <boris.ostrovsky@oracle.com> wrote:
>>>>> I've initially used a back-listing approach. We can always change this later 
>>>>> on.
>>>>>
>>>>> So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
>>>>> crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
>>>>> 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
>>>>> from the guest memory map and marked as E820_ACPI, which means that the new 
>>>>> RSDP no longer resides below 1MB, 
> 
> As I mentioned in the other thread I am not sure this would be in
> compliance with the ACPI spec.

While that's true, Roger had already pointed out that the normal
way of finding RSDPTR needs to be avoided anyway.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27  7:25                   ` Jan Beulich
@ 2016-10-27 11:08                     ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-27 11:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, boris.ostrovsky, xen-devel

On Thu, Oct 27, 2016 at 01:25:40AM -0600, Jan Beulich wrote:
> >>> On 26.10.16 at 18:03, <roger.pau@citrix.com> wrote:
> > On Wed, Oct 26, 2016 at 09:16:55AM -0600, Jan Beulich wrote:
> >> >>> On 26.10.16 at 17:08, <roger.pau@citrix.com> wrote:
> >> > On Wed, Oct 26, 2016 at 08:10:50AM -0600, Jan Beulich wrote:
> >> >> >>> On 26.10.16 at 13:35, <roger.pau@citrix.com> wrote:
> >> >> > On Wed, Oct 12, 2016 at 09:55:44AM -0600, Jan Beulich wrote:
> >> >> >> Taking an abstract perspective I agree with Andrew that we should
> >> >> >> be whitelisting here. However, as you already see from the list you
> >> >> >> provided (which afaict is far from complete wrt ACPI 6.1), this may
> >> >> >> become cumbersome already initially, not to speak of down the road.
> >> >> > 
> >> >> > I've initially used a back-listing approach. We can always change this later 
> >> >> > on.
> >> >> > 
> >> >> > So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
> >> >> > crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
> >> >> > 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
> >> >> > from the guest memory map and marked as E820_ACPI, which means that the new 
> >> >> > RSDP no longer resides below 1MB, and that the Dom0 kernel _must_ use the 
> >> >> > rsdp_paddr provided in the start info, or else it's going to access the 
> >> >> > native RSDP.
> >> >> 
> >> >> Hmm, for the RSDP I'm not sure. It might be better if we put it at the
> >> >> same spot where the host one is, mapping a RAM page there with a
> >> >> copy of the respective host page data. Otoh your approach allows
> >> >> Dom0 to still find the real tables if need be, which has both up and
> >> >> down sides.
> >> > 
> >> > The problem with putting it at the same page is that AFAICT there's a big 
> >> > chance that other things (like EBDA or ROM) being at the same page, and 
> >> > we would be shadowing them by mapping a RAM page over it, even if the 
> >> > original data is copied. This hole area below 1MB is just a mess to deal 
> >> > with TBH.
> >> 
> >> Unless it's page zero, what bad could such shadowing do?
> > 
> > You where the one that warmed me of this in 
> > http://marc.info/?l=xen-devel&m=147576851115855 that shadowing regions 
> > below 1MB could prevent the OS from properly interacting with the firmware, 
> > or at least this was my understanding. The RSDP is usually located in the 
> > same page as the EBDA, so we would be shadowing the EBDA area.
> 
> Oh, right, for a moment I did forget that the EBDA is a permissible
> place for the RSDPTR to live in. Only mapping a page over a ROM
> one (E0000...FFFFF) would be reasonable. So as said - there are
> positive aspects to keeping the original pointer visible; the main
> issue I foresee is that once again user mode tools will (unless made
> Xen-aware) have a different view of the system than the kernel.

Yes, that's certainly possible. On FreeBSD acpidump will try to fetch the 
address of the RSDP from sysctl, and that's going to be right (because it's 
set by the kernel based on the RSDP that's using). I guess Linux is using 
some similar functionality, or else tools like acpidump won't work on UEFI 
environments (where AFAIK the RSDP can be anywhere in memory).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-26 17:14                 ` Boris Ostrovsky
  2016-10-27  7:27                   ` Jan Beulich
@ 2016-10-27 11:13                   ` Roger Pau Monne
  2016-10-27 11:25                     ` Jan Beulich
  2016-10-27 13:51                     ` Boris Ostrovsky
  1 sibling, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-27 11:13 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, Jan Beulich, xen-devel

On Wed, Oct 26, 2016 at 01:14:16PM -0400, Boris Ostrovsky wrote:
> 
> >>>> I've initially used a back-listing approach. We can always change this later 
> >>>> on.
> >>>>
> >>>> So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
> >>>> crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
> >>>> 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
> >>>> from the guest memory map and marked as E820_ACPI, which means that the new 
> >>>> RSDP no longer resides below 1MB, 
> 
> 
> As I mentioned in the other thread I am not sure this would be in
> compliance with the ACPI spec.

Right, the ACPI spec mandates that the RSDP must reside in the first 1 KB of 
the EBDA, or in the ROM regions between 0xE0000 0xFFFFF, but that's only 
when booted from a legacy BIOS, when booted from UEFI the RSDP can reside 
anywhere in memory, and the pointer must be fetched from a UEFI specific 
table. I would consider Xen to be similar to UEFI boot in that regard.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 11:13                   ` Roger Pau Monne
@ 2016-10-27 11:25                     ` Jan Beulich
  2016-10-27 13:51                     ` Boris Ostrovsky
  1 sibling, 0 replies; 146+ messages in thread
From: Jan Beulich @ 2016-10-27 11:25 UTC (permalink / raw)
  To: Roger Pau Monne, Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel

>>> On 27.10.16 at 13:13, <roger.pau@citrix.com> wrote:
> On Wed, Oct 26, 2016 at 01:14:16PM -0400, Boris Ostrovsky wrote:
>> >>>> I've initially used a back-listing approach. We can always change this later 
>> >>>> on.
>> >>>>
>> >>>> So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not 
>> >>>> crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address = 
>> >>>> 0 in the RSDP together with revision = 2). This is all placed in RAM stolen 
>> >>>> from the guest memory map and marked as E820_ACPI, which means that the new 
>> >>>> RSDP no longer resides below 1MB, 
>> 
>> As I mentioned in the other thread I am not sure this would be in
>> compliance with the ACPI spec.
> 
> Right, the ACPI spec mandates that the RSDP must reside in the first 1 KB of 
> the EBDA, or in the ROM regions between 0xE0000 0xFFFFF, but that's only 
> when booted from a legacy BIOS, when booted from UEFI the RSDP can reside 
> anywhere in memory, and the pointer must be fetched from a UEFI specific 
> table. I would consider Xen to be similar to UEFI boot in that regard.

+1

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 11:13                   ` Roger Pau Monne
  2016-10-27 11:25                     ` Jan Beulich
@ 2016-10-27 13:51                     ` Boris Ostrovsky
  2016-10-27 14:02                       ` Jan Beulich
  1 sibling, 1 reply; 146+ messages in thread
From: Boris Ostrovsky @ 2016-10-27 13:51 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, Jan Beulich, xen-devel



On 10/27/2016 07:13 AM, Roger Pau Monne wrote:
> On Wed, Oct 26, 2016 at 01:14:16PM -0400, Boris Ostrovsky wrote:
>>
>>>>>> I've initially used a back-listing approach. We can always change this later
>>>>>> on.
>>>>>>
>>>>>> So I've ended up crafting a new MADT, XSDT and RSDP. Note that I'm not
>>>>>> crafting a new custom RSDT (and in fact I'm setting rsdt_physical_address =
>>>>>> 0 in the RSDP together with revision = 2). This is all placed in RAM stolen
>>>>>> from the guest memory map and marked as E820_ACPI, which means that the new
>>>>>> RSDP no longer resides below 1MB,
>>
>>
>> As I mentioned in the other thread I am not sure this would be in
>> compliance with the ACPI spec.
>
> Right, the ACPI spec mandates that the RSDP must reside in the first 1 KB of
> the EBDA, or in the ROM regions between 0xE0000 0xFFFFF, but that's only
> when booted from a legacy BIOS, when booted from UEFI the RSDP can reside
> anywhere in memory, and the pointer must be fetched from a UEFI specific
> table. I would consider Xen to be similar to UEFI boot in that regard.


It is similar but not a true UEFI boot so we are violating the spec. 
Which, for example, means that standard Linux ACPI root discovery won't 
work.

I re-read this thread and I am not sure I understand why we can't keep 
host's RSDP descriptor. We are not mapping dom0 memory 1:1 (right?) so 
we can place our versions of RSDT/XSDT at the address that the 
descriptor points to.

Unless that address is beyond dom0 memory allocation so that could be a 
problem I guess.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 13:51                     ` Boris Ostrovsky
@ 2016-10-27 14:02                       ` Jan Beulich
  2016-10-27 14:15                         ` Boris Ostrovsky
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-27 14:02 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel, Roger Pau Monne

>>> On 27.10.16 at 15:51, <boris.ostrovsky@oracle.com> wrote:
> I re-read this thread and I am not sure I understand why we can't keep 
> host's RSDP descriptor. We are not mapping dom0 memory 1:1 (right?) so 
> we can place our versions of RSDT/XSDT at the address that the 
> descriptor points to.
> 
> Unless that address is beyond dom0 memory allocation so that could be a 
> problem I guess.

Also eventually we may want to add tables, not just remove some,
and then there may not be enough space at the original location.
Plus I think nothing precludes the XSDT living below 1Mb, and then
we're back into the same problems discussed in another branch of
this thread.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 14:02                       ` Jan Beulich
@ 2016-10-27 14:15                         ` Boris Ostrovsky
  2016-10-27 14:30                           ` Jan Beulich
  0 siblings, 1 reply; 146+ messages in thread
From: Boris Ostrovsky @ 2016-10-27 14:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel, Roger Pau Monne



On 10/27/2016 10:02 AM, Jan Beulich wrote:
>>>> On 27.10.16 at 15:51, <boris.ostrovsky@oracle.com> wrote:
>> I re-read this thread and I am not sure I understand why we can't keep
>> host's RSDP descriptor. We are not mapping dom0 memory 1:1 (right?) so
>> we can place our versions of RSDT/XSDT at the address that the
>> descriptor points to.
>>
>> Unless that address is beyond dom0 memory allocation so that could be a
>> problem I guess.
>
> Also eventually we may want to add tables, not just remove some,
> and then there may not be enough space at the original location.
> Plus I think nothing precludes the XSDT living below 1Mb, and then
> we're back into the same problems discussed in another branch of
> this thread.

Yes, that is a problem.

I guess we will need to do something in Linux to override descriptor 
search. There is already some cruft for this but it appears to be 
kexec-specific.


-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 14:15                         ` Boris Ostrovsky
@ 2016-10-27 14:30                           ` Jan Beulich
  2016-10-27 14:40                             ` Boris Ostrovsky
  0 siblings, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-27 14:30 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, xen-devel, Roger Pau Monne

>>> On 27.10.16 at 16:15, <boris.ostrovsky@oracle.com> wrote:
> On 10/27/2016 10:02 AM, Jan Beulich wrote:
>>>>> On 27.10.16 at 15:51, <boris.ostrovsky@oracle.com> wrote:
>>> I re-read this thread and I am not sure I understand why we can't keep
>>> host's RSDP descriptor. We are not mapping dom0 memory 1:1 (right?) so
>>> we can place our versions of RSDT/XSDT at the address that the
>>> descriptor points to.
>>>
>>> Unless that address is beyond dom0 memory allocation so that could be a
>>> problem I guess.
>>
>> Also eventually we may want to add tables, not just remove some,
>> and then there may not be enough space at the original location.
>> Plus I think nothing precludes the XSDT living below 1Mb, and then
>> we're back into the same problems discussed in another branch of
>> this thread.
> 
> Yes, that is a problem.
> 
> I guess we will need to do something in Linux to override descriptor 
> search. There is already some cruft for this but it appears to be 
> kexec-specific.

As Roger pointed out - there should be kexec-independent logic
to avoid that lookup on EFI systems (by finding the pointer earlier
another way).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 14:30                           ` Jan Beulich
@ 2016-10-27 14:40                             ` Boris Ostrovsky
  2016-10-27 15:04                               ` Roger Pau Monne
  0 siblings, 1 reply; 146+ messages in thread
From: Boris Ostrovsky @ 2016-10-27 14:40 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel, Roger Pau Monne



On 10/27/2016 10:30 AM, Jan Beulich wrote:
>>>> On 27.10.16 at 16:15, <boris.ostrovsky@oracle.com> wrote:
>> On 10/27/2016 10:02 AM, Jan Beulich wrote:
>>>>>> On 27.10.16 at 15:51, <boris.ostrovsky@oracle.com> wrote:
>>>> I re-read this thread and I am not sure I understand why we can't keep
>>>> host's RSDP descriptor. We are not mapping dom0 memory 1:1 (right?) so
>>>> we can place our versions of RSDT/XSDT at the address that the
>>>> descriptor points to.
>>>>
>>>> Unless that address is beyond dom0 memory allocation so that could be a
>>>> problem I guess.
>>>
>>> Also eventually we may want to add tables, not just remove some,
>>> and then there may not be enough space at the original location.
>>> Plus I think nothing precludes the XSDT living below 1Mb, and then
>>> we're back into the same problems discussed in another branch of
>>> this thread.
>>
>> Yes, that is a problem.
>>
>> I guess we will need to do something in Linux to override descriptor
>> search. There is already some cruft for this but it appears to be
>> kexec-specific.
>
> As Roger pointed out - there should be kexec-independent logic
> to avoid that lookup on EFI systems (by finding the pointer earlier
> another way).


Yes, that's exactly what Linux does (acpi_os_get_root_pointer()) but 
that would imply that we are running on a UEFI system, which we are not. 
And trying to fake just this feature may cause other components to get 
confused.


-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 14:40                             ` Boris Ostrovsky
@ 2016-10-27 15:04                               ` Roger Pau Monne
  2016-10-27 15:20                                 ` Jan Beulich
  2016-10-28 13:51                                 ` Boris Ostrovsky
  0 siblings, 2 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-27 15:04 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Andrew Cooper, Jan Beulich, xen-devel

On Thu, Oct 27, 2016 at 10:40:52AM -0400, Boris Ostrovsky wrote:
> 
> 
> On 10/27/2016 10:30 AM, Jan Beulich wrote:
> > > > > On 27.10.16 at 16:15, <boris.ostrovsky@oracle.com> wrote:
> > > On 10/27/2016 10:02 AM, Jan Beulich wrote:
> > > > > > > On 27.10.16 at 15:51, <boris.ostrovsky@oracle.com> wrote:
> > > > > I re-read this thread and I am not sure I understand why we can't keep
> > > > > host's RSDP descriptor. We are not mapping dom0 memory 1:1 (right?) so
> > > > > we can place our versions of RSDT/XSDT at the address that the
> > > > > descriptor points to.
> > > > > 
> > > > > Unless that address is beyond dom0 memory allocation so that could be a
> > > > > problem I guess.
> > > > 
> > > > Also eventually we may want to add tables, not just remove some,
> > > > and then there may not be enough space at the original location.
> > > > Plus I think nothing precludes the XSDT living below 1Mb, and then
> > > > we're back into the same problems discussed in another branch of
> > > > this thread.
> > > 
> > > Yes, that is a problem.
> > > 
> > > I guess we will need to do something in Linux to override descriptor
> > > search. There is already some cruft for this but it appears to be
> > > kexec-specific.
> > 
> > As Roger pointed out - there should be kexec-independent logic
> > to avoid that lookup on EFI systems (by finding the pointer earlier
> > another way).
> 
> 
> Yes, that's exactly what Linux does (acpi_os_get_root_pointer()) but that
> would imply that we are running on a UEFI system, which we are not. And
> trying to fake just this feature may cause other components to get confused.

TBH, I would just make acpi_rsdp not dependent on KEXEC, but is this going 
to be propagated to user-space tools somehow (ie: so that acpidump will 
report the right tables)?

I prefer this solution because it seems simpler and less likely to cause 
future issues, but if this seems like too much fuss I can always try to 
place the RSDP on top of the original one and shadow that page. From the 
information I've been able to find on the Internet, the EBDA just seems to 
contain the position of the pointing device (PS/2 mouse) [0], which I doubt 
is used by any modern OS.

Roger.

[0] http://www.msc-technologies.eu/fileadmin/documentpool/Support-Center/Archive/70-PISA/PISA-P/02-Manual/BIOS%20Programmer's%20Guide%20v10.pdf?file=&ref=

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 15:04                               ` Roger Pau Monne
@ 2016-10-27 15:20                                 ` Jan Beulich
  2016-10-27 15:37                                   ` Roger Pau Monne
  2016-10-28 13:51                                 ` Boris Ostrovsky
  1 sibling, 1 reply; 146+ messages in thread
From: Jan Beulich @ 2016-10-27 15:20 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, Boris Ostrovsky, xen-devel

>>> On 27.10.16 at 17:04, <roger.pau@citrix.com> wrote:
> I prefer this solution because it seems simpler and less likely to cause 
> future issues, but if this seems like too much fuss I can always try to 
> place the RSDP on top of the original one and shadow that page. From the 
> information I've been able to find on the Internet, the EBDA just seems to 
> contain the position of the pointing device (PS/2 mouse) [0], which I doubt 
> is used by any modern OS.

Well, that's the (half way) documented aspect of what it gets used
for. As far as I recall from my DOS support / engineering days,
there's quite a bit more use of it without being documented (among
other things for remote boot purposes). And then I'm surprised you
found only that part documented. My primary source of information
back in those days (Ralph Brown's "Interrupt List") tells me:

Format of Extended BIOS Data Area (IBM):
Offset	Size	Description	(Table M0001)
 00h	BYTE	length of EBDA in kilobytes
 01h 15 BYTEs	reserved
 17h	BYTE	number of entries in POST error log (0-5)
 18h  5 WORDs	POST error log (each word is a POST error number)
 22h	DWORD	Pointing Device Driver entry point
 26h	BYTE	Pointing Device Flags 1 (see #M0002)
 27h	BYTE	Pointing Device Flags 2 (see #M0003)
 28h  8 BYTEs	Pointing Device Auxiliary Device Data
 30h	DWORD	Vector for INT 07h stored here during 80387 interrupt
 34h	DWORD	Vector for INT 01h stored here during INT 07h emulation
 38h	BYTE	Scratchpad for 80287/80387 interrupt code
 39h	WORD	Timer3: Watchdog timer initial count
 3Bh	BYTE	??? seen non-zero on Model 30
 3Ch	BYTE	???
 3Dh 16 BYTEs	Fixed Disk parameter table for drive 0 (for older machines
		  which don't directly support the installed drive)
 4Dh 16 BYTEs	Fixed Disk parameter table for drive 1 (for older machines
		  which don't directly support the installed drive)
 5Dh-67h	???
 68h	BYTE	cache control
		bits 7-2 unused (0)
		bit 1: CPU cache failed test
		bit 0: CPU cache disabled
 69h-6Bh	???
 6Ch	BYTE	Fixed disk: (=FFh on ESDI systems)
		    bits 7-4: Channel number 00-0Fh
		    bits 3-0: DMA arbitration level 00-0Eh
 6Dh	BYTE	???
 6Eh	WORD	current typematic setting (see INT 16/AH=03h)
 70h	BYTE	number of attached hard drives
 71h	BYTE	hard disk 16-bit DMA channel
 72h	BYTE	interrupt status for hard disk controller (1Fh on timeout)
 73h	BYTE	hard disk operation flags
		bit 7: controller issued operation-complete INT 76h
		bit 6: controller has been reset
		bits 5-0: unused (0)
 74h	DWORD	old INT 76h vector
 78h	BYTE	hard disk DMA type
		typically 44h for reads and 4Ch for writes
 79h	BYTE	status of last hard disk operation
 7Ah	BYTE	hard disk timeout counter
 7Bh-7Dh
 7Eh  8 WORDs	storage for hard disk controller status
 8Eh-E6h
 E7h	BYTE	floppy drive type
		bit 7: drive(s) present
		bits 6-2: unused (0)
		bit 1: drive 1 is 5.25" instead of 3.5"
		bit 0: drive 0 is 5.25"
 E8h  4 BYTEs	???
 ECh	BYTE	hard disk parameters flag
		bit 7: parameters loaded into EBDA
		bits 6-0: unused (0)
 EDh	BYTE	???
 EEh	BYTE	CPU family ID (03h = 386, 04h = 486, etc.) (see INT 15/AH=C9h)
 EFh	BYTE	CPU stepping (see INT 15/AH=C9h)
 F0h 39 BYTEs	???
117h	WORD	keyboard ID (see INT 16/AH=0Ah)
		(most commonly 41ABh)
119h	BYTE	???
11Ah	BYTE	non-BIOS INT 18h flag
		bits 7-1: unused (0)
		bit 0: set by BIOS before calling user INT 18h at offset 11Dh
11Bh  2 BYTE	???
11Dh	DWORD	user INT 18h vector if BIOS has re-hooked INT 18h
121h and up:	??? seen non-zero on Model 60
3F0h	BYTE	Fixed disk buffer (???)

Many question marks there ... There's a second layout documented
in there, too, but in the end the actual layout is of no interest to us.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 15:20                                 ` Jan Beulich
@ 2016-10-27 15:37                                   ` Roger Pau Monne
  0 siblings, 0 replies; 146+ messages in thread
From: Roger Pau Monne @ 2016-10-27 15:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Boris Ostrovsky, xen-devel

On Thu, Oct 27, 2016 at 09:20:00AM -0600, Jan Beulich wrote:
> >>> On 27.10.16 at 17:04, <roger.pau@citrix.com> wrote:
> > I prefer this solution because it seems simpler and less likely to cause 
> > future issues, but if this seems like too much fuss I can always try to 
> > place the RSDP on top of the original one and shadow that page. From the 
> > information I've been able to find on the Internet, the EBDA just seems to 
> > contain the position of the pointing device (PS/2 mouse) [0], which I doubt 
> > is used by any modern OS.
> 
> Well, that's the (half way) documented aspect of what it gets used
> for. As far as I recall from my DOS support / engineering days,
> there's quite a bit more use of it without being documented (among
> other things for remote boot purposes). And then I'm surprised you
> found only that part documented. My primary source of information
> back in those days (Ralph Brown's "Interrupt List") tells me:
> 
> Format of Extended BIOS Data Area (IBM):
> Offset	Size	Description	(Table M0001)
>  00h	BYTE	length of EBDA in kilobytes
>  01h 15 BYTEs	reserved
>  17h	BYTE	number of entries in POST error log (0-5)
>  18h  5 WORDs	POST error log (each word is a POST error number)
>  22h	DWORD	Pointing Device Driver entry point
>  26h	BYTE	Pointing Device Flags 1 (see #M0002)
>  27h	BYTE	Pointing Device Flags 2 (see #M0003)
>  28h  8 BYTEs	Pointing Device Auxiliary Device Data
>  30h	DWORD	Vector for INT 07h stored here during 80387 interrupt
>  34h	DWORD	Vector for INT 01h stored here during INT 07h emulation
>  38h	BYTE	Scratchpad for 80287/80387 interrupt code
>  39h	WORD	Timer3: Watchdog timer initial count
>  3Bh	BYTE	??? seen non-zero on Model 30
>  3Ch	BYTE	???
>  3Dh 16 BYTEs	Fixed Disk parameter table for drive 0 (for older machines
> 		  which don't directly support the installed drive)
>  4Dh 16 BYTEs	Fixed Disk parameter table for drive 1 (for older machines
> 		  which don't directly support the installed drive)
>  5Dh-67h	???
>  68h	BYTE	cache control
> 		bits 7-2 unused (0)
> 		bit 1: CPU cache failed test
> 		bit 0: CPU cache disabled
>  69h-6Bh	???
>  6Ch	BYTE	Fixed disk: (=FFh on ESDI systems)
> 		    bits 7-4: Channel number 00-0Fh
> 		    bits 3-0: DMA arbitration level 00-0Eh
>  6Dh	BYTE	???
>  6Eh	WORD	current typematic setting (see INT 16/AH=03h)
>  70h	BYTE	number of attached hard drives
>  71h	BYTE	hard disk 16-bit DMA channel
>  72h	BYTE	interrupt status for hard disk controller (1Fh on timeout)
>  73h	BYTE	hard disk operation flags
> 		bit 7: controller issued operation-complete INT 76h
> 		bit 6: controller has been reset
> 		bits 5-0: unused (0)
>  74h	DWORD	old INT 76h vector
>  78h	BYTE	hard disk DMA type
> 		typically 44h for reads and 4Ch for writes
>  79h	BYTE	status of last hard disk operation
>  7Ah	BYTE	hard disk timeout counter
>  7Bh-7Dh
>  7Eh  8 WORDs	storage for hard disk controller status
>  8Eh-E6h
>  E7h	BYTE	floppy drive type
> 		bit 7: drive(s) present
> 		bits 6-2: unused (0)
> 		bit 1: drive 1 is 5.25" instead of 3.5"
> 		bit 0: drive 0 is 5.25"
>  E8h  4 BYTEs	???
>  ECh	BYTE	hard disk parameters flag
> 		bit 7: parameters loaded into EBDA
> 		bits 6-0: unused (0)
>  EDh	BYTE	???
>  EEh	BYTE	CPU family ID (03h = 386, 04h = 486, etc.) (see INT 15/AH=C9h)
>  EFh	BYTE	CPU stepping (see INT 15/AH=C9h)
>  F0h 39 BYTEs	???
> 117h	WORD	keyboard ID (see INT 16/AH=0Ah)
> 		(most commonly 41ABh)
> 119h	BYTE	???
> 11Ah	BYTE	non-BIOS INT 18h flag
> 		bits 7-1: unused (0)
> 		bit 0: set by BIOS before calling user INT 18h at offset 11Dh
> 11Bh  2 BYTE	???
> 11Dh	DWORD	user INT 18h vector if BIOS has re-hooked INT 18h
> 121h and up:	??? seen non-zero on Model 60
> 3F0h	BYTE	Fixed disk buffer (???)
> 
> Many question marks there ... There's a second layout documented
> in there, too, but in the end the actual layout is of no interest to us.

Thanks for the info, seeing that now I think we have no other option but to 
place the RSDP in another area.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

* Re: [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables
  2016-10-27 15:04                               ` Roger Pau Monne
  2016-10-27 15:20                                 ` Jan Beulich
@ 2016-10-28 13:51                                 ` Boris Ostrovsky
  1 sibling, 0 replies; 146+ messages in thread
From: Boris Ostrovsky @ 2016-10-28 13:51 UTC (permalink / raw)
  To: Roger Pau Monne; +Cc: Andrew Cooper, Jan Beulich, xen-devel

On 10/27/2016 11:04 AM, Roger Pau Monne wrote:
> On Thu, Oct 27, 2016 at 10:40:52AM -0400, Boris Ostrovsky wrote:
>>
>> On 10/27/2016 10:30 AM, Jan Beulich wrote:
>>>>>> On 27.10.16 at 16:15, <boris.ostrovsky@oracle.com> wrote:
>>>> On 10/27/2016 10:02 AM, Jan Beulich wrote:
>>>>>>>> On 27.10.16 at 15:51, <boris.ostrovsky@oracle.com> wrote:
>>>>>> I re-read this thread and I am not sure I understand why we can't keep
>>>>>> host's RSDP descriptor. We are not mapping dom0 memory 1:1 (right?) so
>>>>>> we can place our versions of RSDT/XSDT at the address that the
>>>>>> descriptor points to.
>>>>>>
>>>>>> Unless that address is beyond dom0 memory allocation so that could be a
>>>>>> problem I guess.
>>>>> Also eventually we may want to add tables, not just remove some,
>>>>> and then there may not be enough space at the original location.
>>>>> Plus I think nothing precludes the XSDT living below 1Mb, and then
>>>>> we're back into the same problems discussed in another branch of
>>>>> this thread.
>>>> Yes, that is a problem.
>>>>
>>>> I guess we will need to do something in Linux to override descriptor
>>>> search. There is already some cruft for this but it appears to be
>>>> kexec-specific.
>>> As Roger pointed out - there should be kexec-independent logic
>>> to avoid that lookup on EFI systems (by finding the pointer earlier
>>> another way).
>>
>> Yes, that's exactly what Linux does (acpi_os_get_root_pointer()) but that
>> would imply that we are running on a UEFI system, which we are not. And
>> trying to fake just this feature may cause other components to get confused.
> TBH, I would just make acpi_rsdp not dependent on KEXEC, but is this going 
> to be propagated to user-space tools somehow (ie: so that acpidump will 
> report the right tables)?

(Sorry, I forgot to respond)

If a tool is reading /sys/firmware/acpi/tables then I think it will see
what the kernel is using. acpidump, for example, *is* reading this
directory.

-boris

>
> I prefer this solution because it seems simpler and less likely to cause 
> future issues, but if this seems like too much fuss I can always try to 
> place the RSDP on top of the original one and shadow that page. From the 
> information I've been able to find on the Internet, the EBDA just seems to 
> contain the position of the pointing device (PS/2 mouse) [0], which I doubt 
> is used by any modern OS.
>
> Roger.
>
> [0] http://www.msc-technologies.eu/fileadmin/documentpool/Support-Center/Archive/70-PISA/PISA-P/02-Manual/BIOS%20Programmer's%20Guide%20v10.pdf?file=&ref=


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 146+ messages in thread

end of thread, other threads:[~2016-10-28 13:49 UTC | newest]

Thread overview: 146+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-27 15:56 [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne
2016-09-27 15:56 ` [PATCH v2 01/30] xen/x86: move setup of the VM86 TSS to the domain builder Roger Pau Monne
2016-09-28 15:35   ` Jan Beulich
2016-09-29 12:57     ` Roger Pau Monne
2016-09-27 15:56 ` [PATCH v2 02/30] xen/x86: remove XENFEAT_hvm_pirqs for PVHv2 guests Roger Pau Monne
2016-09-28 16:03   ` Jan Beulich
2016-09-29 14:17     ` Roger Pau Monne
2016-09-29 16:07       ` Jan Beulich
2016-09-27 15:56 ` [PATCH v2 03/30] xen/x86: fix parameters and return value of *_set_allocation functions Roger Pau Monne
2016-09-28  9:34   ` Tim Deegan
2016-09-29 10:39   ` Jan Beulich
2016-09-29 14:33     ` Roger Pau Monne
2016-09-29 16:09       ` Jan Beulich
2016-09-30 16:48   ` George Dunlap
2016-10-03  8:05   ` Paul Durrant
2016-10-06 11:33     ` Roger Pau Monne
2016-09-27 15:56 ` [PATCH v2 04/30] xen/x86: allow calling {sh/hap}_set_allocation with the idle domain Roger Pau Monne
2016-09-29 10:43   ` Jan Beulich
2016-09-29 14:37     ` Roger Pau Monne
2016-09-29 16:10       ` Jan Beulich
2016-09-30 16:56   ` George Dunlap
2016-09-30 16:56     ` George Dunlap
2016-09-27 15:57 ` [PATCH v2 05/30] xen/x86: assert that local_events_need_delivery is not called by " Roger Pau Monne
2016-09-29 10:45   ` Jan Beulich
2016-09-30  8:32     ` Roger Pau Monne
2016-09-30  8:59       ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 06/30] x86/paging: introduce paging_set_allocation Roger Pau Monne
2016-09-29 10:51   ` Jan Beulich
2016-09-29 14:51     ` Roger Pau Monne
2016-09-29 16:12       ` Jan Beulich
2016-09-29 16:57         ` Roger Pau Monne
2016-09-30 17:00   ` George Dunlap
2016-09-27 15:57 ` [PATCH v2 07/30] xen/x86: split the setup of Dom0 permissions to a function Roger Pau Monne
2016-09-29 13:47   ` Jan Beulich
2016-09-29 15:53     ` Roger Pau Monne
2016-09-27 15:57 ` [PATCH v2 08/30] xen/x86: do the PCI scan unconditionally Roger Pau Monne
2016-09-29 13:55   ` Jan Beulich
2016-09-29 15:11     ` Roger Pau Monne
2016-09-29 16:14       ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 09/30] x86/vtd: fix and simplify mapping RMRR regions Roger Pau Monne
2016-09-29 14:18   ` Jan Beulich
2016-09-30 11:27     ` Roger Pau Monne
2016-09-30 13:21       ` Jan Beulich
2016-09-30 15:02         ` Roger Pau Monne
2016-09-30 15:09           ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 10/30] xen/x86: allow the emulated APICs to be enbled for the hardware domain Roger Pau Monne
2016-09-29 14:26   ` Jan Beulich
2016-09-30 15:44     ` Roger Pau Monne
2016-09-27 15:57 ` [PATCH v2 11/30] xen/x86: split Dom0 build into PV and PVHv2 Roger Pau Monne
2016-09-30 15:03   ` Jan Beulich
2016-10-03 10:09     ` Roger Pau Monne
2016-10-04  6:54       ` Jan Beulich
2016-10-04  7:09         ` Andrew Cooper
2016-09-27 15:57 ` [PATCH v2 12/30] xen/x86: make print_e820_memory_map global Roger Pau Monne
2016-09-30 15:04   ` Jan Beulich
2016-10-03 16:23     ` Roger Pau Monne
2016-10-04  6:47       ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 13/30] xen: introduce a new format specifier to print sizes in human-readable form Roger Pau Monne
2016-09-28  8:24   ` Juergen Gross
2016-09-28 11:56     ` Roger Pau Monne
2016-09-28 12:01       ` Andrew Cooper
2016-10-03  8:36   ` Paul Durrant
2016-10-11 10:27   ` Roger Pau Monne
2016-09-27 15:57 ` [PATCH v2 14/30] xen/mm: add a ceil sufix to current page calculation routine Roger Pau Monne
2016-09-30 15:20   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 15/30] xen/x86: populate PVHv2 Dom0 physical memory map Roger Pau Monne
2016-09-30 15:52   ` Jan Beulich
2016-10-04  9:12     ` Roger Pau Monne
2016-10-04 11:16       ` Jan Beulich
2016-10-11 14:01         ` Roger Pau Monne
2016-10-12 11:51           ` Jan Beulich
2016-10-11 14:06     ` Roger Pau Monne
2016-10-12 11:58       ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 16/30] xen/x86: parse Dom0 kernel for PVHv2 Roger Pau Monne
2016-10-06 15:14   ` Jan Beulich
2016-10-11 15:02     ` Roger Pau Monne
2016-09-27 15:57 ` [PATCH v2 17/30] xen/x86: setup PVHv2 Dom0 CPUs Roger Pau Monne
2016-10-06 15:20   ` Jan Beulich
2016-10-12 11:06     ` Roger Pau Monne
2016-10-12 11:32       ` Andrew Cooper
2016-10-12 12:02       ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 18/30] xen/x86: setup PVHv2 Dom0 ACPI tables Roger Pau Monne
2016-10-06 15:40   ` Jan Beulich
2016-10-06 15:48     ` Andrew Cooper
2016-10-12 15:35     ` Roger Pau Monne
2016-10-12 15:55       ` Jan Beulich
2016-10-26 11:35         ` Roger Pau Monne
2016-10-26 14:10           ` Jan Beulich
2016-10-26 15:08             ` Roger Pau Monne
2016-10-26 15:16               ` Jan Beulich
2016-10-26 16:03                 ` Roger Pau Monne
2016-10-27  7:25                   ` Jan Beulich
2016-10-27 11:08                     ` Roger Pau Monne
2016-10-26 17:14                 ` Boris Ostrovsky
2016-10-27  7:27                   ` Jan Beulich
2016-10-27 11:13                   ` Roger Pau Monne
2016-10-27 11:25                     ` Jan Beulich
2016-10-27 13:51                     ` Boris Ostrovsky
2016-10-27 14:02                       ` Jan Beulich
2016-10-27 14:15                         ` Boris Ostrovsky
2016-10-27 14:30                           ` Jan Beulich
2016-10-27 14:40                             ` Boris Ostrovsky
2016-10-27 15:04                               ` Roger Pau Monne
2016-10-27 15:20                                 ` Jan Beulich
2016-10-27 15:37                                   ` Roger Pau Monne
2016-10-28 13:51                                 ` Boris Ostrovsky
2016-09-27 15:57 ` [PATCH v2 19/30] xen/dcpi: add a dpci passthrough handler for hardware domain Roger Pau Monne
2016-10-03  9:02   ` Paul Durrant
2016-10-06 14:31     ` Roger Pau Monne
2016-10-06 15:44   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 20/30] xen/x86: add the basic infrastructure to import QEMU passthrough code Roger Pau Monne
2016-10-03  9:54   ` Paul Durrant
2016-10-06 15:08     ` Roger Pau Monne
2016-10-06 15:52       ` Lars Kurth
2016-10-07  9:13       ` Jan Beulich
2016-10-06 15:47   ` Jan Beulich
2016-10-10 12:41   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 21/30] xen/pci: split code to size BARs from pci_add_device Roger Pau Monne
2016-10-06 16:00   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 22/30] xen/x86: support PVHv2 Dom0 BAR remapping Roger Pau Monne
2016-10-03 10:10   ` Paul Durrant
2016-10-06 15:25     ` Roger Pau Monne
2016-09-27 15:57 ` [PATCH v2 23/30] xen/x86: route legacy PCI interrupts to Dom0 Roger Pau Monne
2016-10-10 13:37   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 24/30] x86/vmsi: add MSI emulation for hardware domain Roger Pau Monne
2016-09-27 15:57 ` [PATCH v2 25/30] xen/x86: add all PCI devices to PVHv2 Dom0 Roger Pau Monne
2016-10-10 13:44   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 26/30] xen/x86: add PCIe emulation Roger Pau Monne
2016-10-03 10:46   ` Paul Durrant
2016-10-06 15:53     ` Roger Pau Monne
2016-10-10 13:57   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 27/30] x86/msixtbl: disable MSI-X intercepts for domains without an ioreq server Roger Pau Monne
2016-10-10 14:18   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 28/30] xen/x86: add MSI-X emulation to PVHv2 Dom0 Roger Pau Monne
2016-10-03 10:57   ` Paul Durrant
2016-10-06 15:58     ` Roger Pau Monne
2016-10-10 16:15   ` Jan Beulich
2016-09-27 15:57 ` [PATCH v2 29/30] xen/x86: allow PVHv2 to perform foreign memory mappings Roger Pau Monne
2016-09-30 17:36   ` George Dunlap
2016-10-10 14:21   ` Jan Beulich
2016-10-10 14:27     ` George Dunlap
2016-10-10 14:50       ` Jan Beulich
2016-10-10 14:58         ` George Dunlap
2016-09-27 15:57 ` [PATCH v2 30/30] xen: allow setting the store pfn HVM parameter Roger Pau Monne
2016-10-03 11:01   ` Paul Durrant
2016-09-28 12:22 ` [PATCH v2 00/30] PVHv2 Dom0 Roger Pau Monne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.