All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] x86/PVH: Dom0 building adjustments
@ 2021-09-21  7:15 Jan Beulich
  2021-09-21  7:16 ` [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation Jan Beulich
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:15 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

In the course of fixing XSA-378 fallout I ran into a number of
issues, which this series is trying to deal with. The majority of
the changes is pretty much independent of one another.

There was another rather basic issue to fight first (patch
was submitted separately as RFC): vPCI wasn't aware of hidden PCI
devices, hitting an ASSERT(). Obviously I couldn't afford not
having a functioning serial console.

Compared to v2 there are a number of new changes here, and a
controversial one has been moved to the end. See individual
patches for details.

1: PVH: improve Dom0 memory size calculation
2: PV: properly set shadow allocation for Dom0
3: PVH: permit more physdevop-s to be used by Dom0
4: PVH: provide VGA console info to Dom0
5: PVH: actually show Dom0's register state from debug key '0'
6: HVM: convert hvm_virtual_to_linear_addr() to be remote-capable
7: PVH: actually show Dom0's stacks from debug key '0'
8: HVM: skip offline vCPU-s when dumping VMCBs/VMCSes
9: P2M: relax permissions of PVH Dom0's MMIO entries

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
@ 2021-09-21  7:16 ` Jan Beulich
  2021-09-22 11:59   ` Roger Pau Monné
  2021-09-21  7:17 ` [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0 Jan Beulich
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:16 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Assuming that the accounting for IOMMU page tables will also take care
of the P2M needs was wrong: dom0_paging_pages() can determine a far
higher value, high enough for the system to run out of memory while
setting up Dom0. Hence in the case of shared page tables the larger of
the two values needs to be used (without shared page tables the sum of
both continues to be applicable).

While there also account for two further aspects in the PV case: With
"iommu=dom0-passthrough" no IOMMU page tables would get allocated, so
none need accounting for. And if shadow mode is to be enabled, setting
aside a suitable amount for the P2M pool to get populated is also
necessary (i.e. similar to the non-shared-page-tables case of PVH).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/dom0_build.c
+++ b/xen/arch/x86/dom0_build.c
@@ -318,7 +318,7 @@ unsigned long __init dom0_compute_nr_pag
     struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
 {
     nodeid_t node;
-    unsigned long avail = 0, nr_pages, min_pages, max_pages;
+    unsigned long avail = 0, nr_pages, min_pages, max_pages, iommu_pages = 0;
     bool need_paging;
 
     /* The ordering of operands is to work around a clang5 issue. */
@@ -337,18 +337,23 @@ unsigned long __init dom0_compute_nr_pag
         avail -= d->max_vcpus - 1;
 
     /* Reserve memory for iommu_dom0_init() (rough estimate). */
-    if ( is_iommu_enabled(d) )
+    if ( is_iommu_enabled(d) && !iommu_hwdom_passthrough )
     {
         unsigned int s;
 
         for ( s = 9; s < BITS_PER_LONG; s += 9 )
-            avail -= max_pdx >> s;
+            iommu_pages += max_pdx >> s;
+
+        avail -= iommu_pages;
     }
 
-    need_paging = is_hvm_domain(d) &&
-        (!iommu_use_hap_pt(d) || !paging_mode_hap(d));
+    need_paging = is_hvm_domain(d)
+                  ? !iommu_use_hap_pt(d) || !paging_mode_hap(d)
+                  : opt_dom0_shadow;
     for ( ; ; need_paging = false )
     {
+        unsigned long paging_pages;
+
         nr_pages = get_memsize(&dom0_size, avail);
         min_pages = get_memsize(&dom0_min_size, avail);
         max_pages = get_memsize(&dom0_max_size, avail);
@@ -377,11 +382,20 @@ unsigned long __init dom0_compute_nr_pag
         nr_pages = min(nr_pages, max_pages);
         nr_pages = min(nr_pages, avail);
 
-        if ( !need_paging )
-            break;
+        paging_pages = paging_mode_enabled(d) || need_paging
+                       ? dom0_paging_pages(d, nr_pages) : 0;
 
         /* Reserve memory for shadow or HAP. */
-        avail -= dom0_paging_pages(d, nr_pages);
+        if ( !need_paging )
+        {
+            if ( paging_pages <= iommu_pages )
+                break;
+
+            avail -= paging_pages - iommu_pages;
+        }
+        else
+            avail -= paging_pages;
+        iommu_pages = paging_pages;
     }
 
     if ( is_pv_domain(d) &&



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
  2021-09-21  7:16 ` [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation Jan Beulich
@ 2021-09-21  7:17 ` Jan Beulich
  2021-09-22 13:01   ` Roger Pau Monné
  2021-09-22 13:31   ` Andrew Cooper
  2021-09-21  7:17 ` [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0 Jan Beulich
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:17 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, Tim Deegan

Leaving shadow setup just to the L1TF tasklet means running Dom0 on a
minimally acceptable shadow memory pool, rather than what normally
would be used (also, for example, for PVH). Populate the pool before
triggering the tasklet, on a best effort basis (again like done for
PVH).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
---
v2: Latch dom0_paging_pages() result.

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1298,7 +1298,7 @@ int shadow_set_allocation(struct domain
 {
     struct page_info *sp;
 
-    ASSERT(paging_locked_by_me(d));
+    ASSERT(paging_locked_by_me(d) || system_state < SYS_STATE_active);
 
     if ( pages > 0 )
     {
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -21,6 +21,7 @@
 #include <asm/page.h>
 #include <asm/pv/mm.h>
 #include <asm/setup.h>
+#include <asm/shadow.h>
 
 /* Allow ring-3 access in long mode as guest cannot use ring 1 ... */
 #define BASE_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_USER)
@@ -933,7 +934,18 @@ int __init dom0_construct_pv(struct doma
 #ifdef CONFIG_SHADOW_PAGING
     if ( opt_dom0_shadow )
     {
+        bool preempted;
+
         printk("Switching dom0 to using shadow paging\n");
+
+        nr_pt_pages = dom0_paging_pages(d, nr_pages);
+
+        do {
+            preempted = false;
+            shadow_set_allocation(d, nr_pt_pages, &preempted);
+            process_pending_softirqs();
+        } while ( preempted );
+
         tasklet_schedule(&d->arch.paging.shadow.pv_l1tf_tasklet);
     }
 #endif



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
  2021-09-21  7:16 ` [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation Jan Beulich
  2021-09-21  7:17 ` [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0 Jan Beulich
@ 2021-09-21  7:17 ` Jan Beulich
  2021-09-22 14:22   ` Roger Pau Monné
  2021-09-21  7:18 ` [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0 Jan Beulich
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:17 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Certain notifications of Dom0 to Xen are independent of the mode Dom0 is
running in. Permit further PCI related ones (only their modern forms).
Also include the USB2 debug port operation at this occasion.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm uncertain about the has_vpci() part of the check: I would think
is_hardware_domain() is both sufficient and concise. Without vPCI a PVH
Dom0 won't see any PCI devices in the first place (and hence would
effectively be non-functioning). Dropping this would in particular make
PHYSDEVOP_dbgp_op better fit in the mix.
---
v3: New.

--- a/xen/arch/x86/hvm/hypercall.c
+++ b/xen/arch/x86/hvm/hypercall.c
@@ -94,6 +94,12 @@ static long hvm_physdev_op(int cmd, XEN_
         break;
 
     case PHYSDEVOP_pci_mmcfg_reserved:
+    case PHYSDEVOP_pci_device_add:
+    case PHYSDEVOP_pci_device_remove:
+    case PHYSDEVOP_restore_msi_ext:
+    case PHYSDEVOP_dbgp_op:
+    case PHYSDEVOP_prepare_msix:
+    case PHYSDEVOP_release_msix:
         if ( !has_vpci(currd) || !is_hardware_domain(currd) )
             return -ENOSYS;
         break;



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
                   ` (2 preceding siblings ...)
  2021-09-21  7:17 ` [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0 Jan Beulich
@ 2021-09-21  7:18 ` Jan Beulich
  2021-09-22 15:01   ` Roger Pau Monné
  2021-09-21  7:19 ` [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0' Jan Beulich
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:18 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

Like PV Dom0 in order to use the console if in a mode other than text
80x25 the kernel needs to be provided information about this mode. Bump
HVM start info's "current" version to 2 and use a previously reserved
32-bit field to provide struct dom0_vga_console_info's position and
size.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: New.

--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -19,6 +19,7 @@
  */
 
 #include <xen/acpi.h>
+#include <xen/console.h>
 #include <xen/init.h>
 #include <xen/libelf.h>
 #include <xen/multiboot.h>
@@ -549,6 +550,11 @@ static int __init pvh_load_kernel(struct
     paddr_t last_addr;
     struct hvm_start_info start_info = { 0 };
     struct hvm_modlist_entry mod = { 0 };
+#ifdef CONFIG_VIDEO
+    struct dom0_vga_console_info vga_info = { 0 };
+#else
+    struct {} __maybe_unused vga_info;
+#endif
     struct vcpu *v = d->vcpu[0];
     int rc;
 
@@ -598,7 +604,7 @@ static int __init pvh_load_kernel(struct
      * split into smaller allocations, done as a single region in order to
      * simplify it.
      */
-    last_addr = find_memory(d, &elf, sizeof(start_info) +
+    last_addr = find_memory(d, &elf, sizeof(start_info) + sizeof(vga_info) +
                             (initrd ? ROUNDUP(initrd->mod_end, PAGE_SIZE) +
                                       sizeof(mod)
                                     : 0) +
@@ -672,6 +678,22 @@ static int __init pvh_load_kernel(struct
         last_addr += sizeof(mod);
     }
 
+#ifdef CONFIG_VIDEO
+    if ( fill_console_start_info(&vga_info) )
+    {
+        rc = hvm_copy_to_guest_phys(last_addr + sizeof(start_info),
+                                    &vga_info, sizeof(vga_info), v);
+        if ( !rc )
+        {
+            start_info.version = 2;
+            start_info.vga_info.offset = sizeof(start_info);
+            start_info.vga_info.size = sizeof(vga_info);
+        }
+        else
+            printk("Unable to copy VGA info to guest\n");
+    }
+#endif
+
     start_info.magic = XEN_HVM_START_MAGIC_VALUE;
     start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
     rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info), v);
--- a/xen/include/public/arch-x86/hvm/start_info.h
+++ b/xen/include/public/arch-x86/hvm/start_info.h
@@ -33,7 +33,7 @@
  *    | magic          | Contains the magic value XEN_HVM_START_MAGIC_VALUE
  *    |                | ("xEn3" with the 0x80 bit of the "E" set).
  *  4 +----------------+
- *    | version        | Version of this structure. Current version is 1. New
+ *    | version        | Version of this structure. Current version is 2. New
  *    |                | versions are guaranteed to be backwards-compatible.
  *  8 +----------------+
  *    | flags          | SIF_xxx flags.
@@ -55,7 +55,15 @@
  *    |                | if there is no memory map being provided. Only
  *    |                | present in version 1 and newer of the structure.
  * 52 +----------------+
- *    | reserved       | Version 1 and newer only.
+ *    | vga_info.offset| Offset of struct dom0_vga_console_info from base of
+ *    |                | struct hvm_start_info. Optional and only present in
+ *    |                | version 2 and newer of the structure when
+ *    |                | SIF_INITDOMAIN is set; zero if absent.
+ * 54 +----------------+
+ *    | vga_info.size  | Size of present parts of struct dom0_vga_console_info.
+ *    |                | Optional and only present in version 2 and newer of
+ *    |                | the structure when SIF_INITDOMAIN is set; zero if
+ *    |                | absent.
  * 56 +----------------+
  *
  * The layout of each entry in the module structure is the following:
@@ -139,7 +147,15 @@ struct hvm_start_info {
     uint32_t memmap_entries;    /* Number of entries in the memmap table.    */
                                 /* Value will be zero if there is no memory  */
                                 /* map being provided.                       */
-    uint32_t reserved;          /* Must be zero.                             */
+    /*
+     * The following sub-structure is only present in version 2 and newer
+     * when SIF_INITDOMAIN is set. It is reserved in version 1 or when
+     * SIF_INITDOMAIN is clear, and absent in version 0.
+     */
+    struct {                    /* Coord-s of struct dom0_vga_console_info.  */
+        uint16_t offset;        /* ... from base of struct hvm_start_info.   */
+        uint16_t size;          /* ... of present parts of the struct.       */
+    } vga_info;
 };
 
 struct hvm_modlist_entry {



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0'
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
                   ` (3 preceding siblings ...)
  2021-09-21  7:18 ` [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0 Jan Beulich
@ 2021-09-21  7:19 ` Jan Beulich
  2021-09-22 15:48   ` Roger Pau Monné
  2021-09-21  7:19 ` [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable Jan Beulich
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:19 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

vcpu_show_registers() didn't do anything for HVM so far. Note though
that some extra hackery is needed for VMX - see the code comment.

Note further that the show_guest_stack() invocation is left alone here:
While strictly speaking guest_kernel_mode() should be predicated by a
PV / !HVM check, show_guest_stack() itself will bail immediately for
HVM.

While there and despite not being PVH-specific, take the opportunity and
filter offline vCPU-s: There's not really any register state associated
with them, so avoid spamming the log with useless information while
still leaving an indication of the fact.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I was pondering whether to also have the VMCS/VMCB dumped for every
vCPU, to present full state. The downside is that for larger systems
this would be a lot of output.
---
v2: New.

--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -631,6 +631,12 @@ void vcpu_show_execution_state(struct vc
 {
     unsigned long flags;
 
+    if ( test_bit(_VPF_down, &v->pause_flags) )
+    {
+        printk("*** %pv is offline ***\n", v);
+        return;
+    }
+
     printk("*** Dumping Dom%d vcpu#%d state: ***\n",
            v->domain->domain_id, v->vcpu_id);
 
@@ -642,6 +648,21 @@ void vcpu_show_execution_state(struct vc
 
     vcpu_pause(v); /* acceptably dangerous */
 
+#ifdef CONFIG_HVM
+    /*
+     * For VMX special care is needed: Reading some of the register state will
+     * require VMCS accesses. Engaging foreign VMCSes involves acquiring of a
+     * lock, which check_lock() would object to when done from an IRQs-disabled
+     * region. Despite this being a layering violation, engage the VMCS right
+     * here. This then also avoids doing so several times in close succession.
+     */
+    if ( cpu_has_vmx && is_hvm_vcpu(v) )
+    {
+        ASSERT(!in_irq());
+        vmx_vmcs_enter(v);
+    }
+#endif
+
     /* Prevent interleaving of output. */
     flags = console_lock_recursive_irqsave();
 
@@ -651,6 +672,11 @@ void vcpu_show_execution_state(struct vc
 
     console_unlock_recursive_irqrestore(flags);
 
+#ifdef CONFIG_HVM
+    if ( cpu_has_vmx && is_hvm_vcpu(v) )
+        vmx_vmcs_exit(v);
+#endif
+
     vcpu_unpause(v);
 }
 
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -49,6 +49,39 @@ static void read_registers(struct cpu_us
     crs[7] = read_gs_shadow();
 }
 
+static void get_hvm_registers(struct vcpu *v, struct cpu_user_regs *regs,
+                              unsigned long crs[8])
+{
+    struct segment_register sreg;
+
+    crs[0] = v->arch.hvm.guest_cr[0];
+    crs[2] = v->arch.hvm.guest_cr[2];
+    crs[3] = v->arch.hvm.guest_cr[3];
+    crs[4] = v->arch.hvm.guest_cr[4];
+
+    hvm_get_segment_register(v, x86_seg_cs, &sreg);
+    regs->cs = sreg.sel;
+
+    hvm_get_segment_register(v, x86_seg_ds, &sreg);
+    regs->ds = sreg.sel;
+
+    hvm_get_segment_register(v, x86_seg_es, &sreg);
+    regs->es = sreg.sel;
+
+    hvm_get_segment_register(v, x86_seg_fs, &sreg);
+    regs->fs = sreg.sel;
+    crs[5] = sreg.base;
+
+    hvm_get_segment_register(v, x86_seg_gs, &sreg);
+    regs->gs = sreg.sel;
+    crs[6] = sreg.base;
+
+    hvm_get_segment_register(v, x86_seg_ss, &sreg);
+    regs->ss = sreg.sel;
+
+    crs[7] = hvm_get_shadow_gs_base(v);
+}
+
 static void _show_registers(
     const struct cpu_user_regs *regs, unsigned long crs[8],
     enum context context, const struct vcpu *v)
@@ -99,27 +132,8 @@ void show_registers(const struct cpu_use
 
     if ( guest_mode(regs) && is_hvm_vcpu(v) )
     {
-        struct segment_register sreg;
+        get_hvm_registers(v, &fault_regs, fault_crs);
         context = CTXT_hvm_guest;
-        fault_crs[0] = v->arch.hvm.guest_cr[0];
-        fault_crs[2] = v->arch.hvm.guest_cr[2];
-        fault_crs[3] = v->arch.hvm.guest_cr[3];
-        fault_crs[4] = v->arch.hvm.guest_cr[4];
-        hvm_get_segment_register(v, x86_seg_cs, &sreg);
-        fault_regs.cs = sreg.sel;
-        hvm_get_segment_register(v, x86_seg_ds, &sreg);
-        fault_regs.ds = sreg.sel;
-        hvm_get_segment_register(v, x86_seg_es, &sreg);
-        fault_regs.es = sreg.sel;
-        hvm_get_segment_register(v, x86_seg_fs, &sreg);
-        fault_regs.fs = sreg.sel;
-        fault_crs[5] = sreg.base;
-        hvm_get_segment_register(v, x86_seg_gs, &sreg);
-        fault_regs.gs = sreg.sel;
-        fault_crs[6] = sreg.base;
-        hvm_get_segment_register(v, x86_seg_ss, &sreg);
-        fault_regs.ss = sreg.sel;
-        fault_crs[7] = hvm_get_shadow_gs_base(v);
     }
     else
     {
@@ -159,24 +173,35 @@ void show_registers(const struct cpu_use
 void vcpu_show_registers(const struct vcpu *v)
 {
     const struct cpu_user_regs *regs = &v->arch.user_regs;
-    bool kernel = guest_kernel_mode(v, regs);
+    struct cpu_user_regs aux_regs;
+    enum context context;
     unsigned long crs[8];
 
-    /* Only handle PV guests for now */
-    if ( !is_pv_vcpu(v) )
-        return;
-
-    crs[0] = v->arch.pv.ctrlreg[0];
-    crs[2] = arch_get_cr2(v);
-    crs[3] = pagetable_get_paddr(kernel ?
-                                 v->arch.guest_table :
-                                 v->arch.guest_table_user);
-    crs[4] = v->arch.pv.ctrlreg[4];
-    crs[5] = v->arch.pv.fs_base;
-    crs[6 + !kernel] = v->arch.pv.gs_base_kernel;
-    crs[7 - !kernel] = v->arch.pv.gs_base_user;
+    if ( is_hvm_vcpu(v) )
+    {
+        aux_regs = *regs;
+        get_hvm_registers(v->domain->vcpu[v->vcpu_id], &aux_regs, crs);
+        regs = &aux_regs;
+        context = CTXT_hvm_guest;
+    }
+    else
+    {
+        bool kernel = guest_kernel_mode(v, regs);
+
+        crs[0] = v->arch.pv.ctrlreg[0];
+        crs[2] = arch_get_cr2(v);
+        crs[3] = pagetable_get_paddr(kernel ?
+                                     v->arch.guest_table :
+                                     v->arch.guest_table_user);
+        crs[4] = v->arch.pv.ctrlreg[4];
+        crs[5] = v->arch.pv.fs_base;
+        crs[6 + !kernel] = v->arch.pv.gs_base_kernel;
+        crs[7 - !kernel] = v->arch.pv.gs_base_user;
+
+        context = CTXT_pv_guest;
+    }
 
-    _show_registers(regs, crs, CTXT_pv_guest, v);
+    _show_registers(regs, crs, context, v);
 }
 
 void show_page_walk(unsigned long addr)



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
                   ` (4 preceding siblings ...)
  2021-09-21  7:19 ` [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0' Jan Beulich
@ 2021-09-21  7:19 ` Jan Beulich
  2021-09-23  8:09   ` Roger Pau Monné
  2021-09-21  7:20 ` [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0' Jan Beulich
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:19 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

While all present callers want to act on "current", stack dumping for
HVM vCPU-s will require the function to be able to act on a remote vCPU.
To avoid touching all present callers, convert the existing function to
an inline wrapper around the extend new one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
Alternatively the actual dumping patch could avoid using this more
elaborate function and, ignoring access checks, simply add in the SS
segment base itself (if needed in the first place).
---
v3: New.

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2526,7 +2526,8 @@ int hvm_set_cr4(unsigned long value, boo
     return X86EMUL_OKAY;
 }
 
-bool_t hvm_virtual_to_linear_addr(
+bool hvm_vcpu_virtual_to_linear(
+    struct vcpu *v,
     enum x86_segment seg,
     const struct segment_register *reg,
     unsigned long offset,
@@ -2535,8 +2536,9 @@ bool_t hvm_virtual_to_linear_addr(
     const struct segment_register *active_cs,
     unsigned long *linear_addr)
 {
-    const struct vcpu *curr = current;
     unsigned long addr = offset, last_byte;
+    const struct cpu_user_regs *regs = v == current ? guest_cpu_user_regs()
+                                                    : &v->arch.user_regs;
     bool_t okay = 0;
 
     /*
@@ -2547,7 +2549,7 @@ bool_t hvm_virtual_to_linear_addr(
      */
     ASSERT(seg < x86_seg_none);
 
-    if ( !(curr->arch.hvm.guest_cr[0] & X86_CR0_PE) )
+    if ( !(v->arch.hvm.guest_cr[0] & X86_CR0_PE) )
     {
         /*
          * REAL MODE: Don't bother with segment access checks.
@@ -2555,7 +2557,7 @@ bool_t hvm_virtual_to_linear_addr(
          */
         addr = (uint32_t)(addr + reg->base);
     }
-    else if ( (guest_cpu_user_regs()->eflags & X86_EFLAGS_VM) &&
+    else if ( (regs->eflags & X86_EFLAGS_VM) &&
               is_x86_user_segment(seg) )
     {
         /* VM86 MODE: Fixed 64k limits on all user segments. */
@@ -2564,7 +2566,7 @@ bool_t hvm_virtual_to_linear_addr(
         if ( max(offset, last_byte) >> 16 )
             goto out;
     }
-    else if ( hvm_long_mode_active(curr) &&
+    else if ( hvm_long_mode_active(v) &&
               (is_x86_system_segment(seg) || active_cs->l) )
     {
         /*
@@ -2636,7 +2638,7 @@ bool_t hvm_virtual_to_linear_addr(
         else if ( last_byte > reg->limit )
             goto out; /* last byte is beyond limit */
         else if ( last_byte < offset &&
-                  curr->domain->arch.cpuid->x86_vendor == X86_VENDOR_AMD )
+                  v->domain->arch.cpuid->x86_vendor == X86_VENDOR_AMD )
             goto out; /* access wraps */
     }
 
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -314,7 +314,9 @@ enum hvm_access_type {
     hvm_access_read,
     hvm_access_write
 };
-bool_t hvm_virtual_to_linear_addr(
+
+bool hvm_vcpu_virtual_to_linear(
+    struct vcpu *v,
     enum x86_segment seg,
     const struct segment_register *reg,
     unsigned long offset,
@@ -323,6 +325,19 @@ bool_t hvm_virtual_to_linear_addr(
     const struct segment_register *active_cs,
     unsigned long *linear_addr);
 
+static inline bool hvm_virtual_to_linear_addr(
+    enum x86_segment seg,
+    const struct segment_register *reg,
+    unsigned long offset,
+    unsigned int bytes,
+    enum hvm_access_type access_type,
+    const struct segment_register *active_cs,
+    unsigned long *linear)
+{
+    return hvm_vcpu_virtual_to_linear(current, seg, reg, offset, bytes,
+                                      access_type, active_cs, linear);
+}
+
 void *hvm_map_guest_frame_rw(unsigned long gfn, bool_t permanent,
                              bool_t *writable);
 void *hvm_map_guest_frame_ro(unsigned long gfn, bool_t permanent);



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0'
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
                   ` (5 preceding siblings ...)
  2021-09-21  7:19 ` [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable Jan Beulich
@ 2021-09-21  7:20 ` Jan Beulich
  2021-09-23 10:31   ` Roger Pau Monné
  2021-09-21  7:20 ` [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes Jan Beulich
  2021-09-21  7:21 ` [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries Jan Beulich
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné

show_guest_stack() does nothing for HVM. Introduce a HVM-specific
dumping function, paralleling the 64- and 32-bit PV ones. We don't know
the real stack size, so only dump up to the next page boundary.

Rather than adding a vcpu parameter to hvm_copy_from_guest_linear(),
introduce hvm_copy_from_vcpu_linear() which - for now at least - in
return won't need a "pfinfo" parameter.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
TBD: The bypassing of the output interleaving avoidance isn't nice, but
     I've not been able to think of an alternative. Avoiding the call to
     hvm_vcpu_virtual_to_linear() would be in principle possible (adding
     in the SS base directly), but one way or another we need to access
     guest memory and hence can't sensibly avoid using the P2M layer.
     However, commit 0996e0f38540 ("x86/traps: prevent interleaving of
     concurrent cpu state dumps") introduced this logic here while
     really only talking about show_execution_state().
     vcpu_show_execution_state() is imo much less prone to interleaving
     of its output: It's uses from the keyhandler are sequential already
     anyway, and the only other use is from hvm_triple_fault(). Instead
     of making the locking conditional, it may therefore be an option to
     drop it again altogether.
TBD: For now this dumps also user mode stacks. We may want to restrict
     this.
TBD: An alternative to putting this next to {,compat_}show_guest_stack()
     is to put it in hvm.c, eliminating the need to introduce
     hvm_copy_from_vcpu_linear(), but then requiring extra parameters to
     be passed.
TBD: Technically this makes unnecessary the earlier added entering/
     leaving if the VMCS. Yet to avoid a series of non-trivial
     enter/exit pairs, I think leaving that in is still beneficial. In
     which case here perhaps merely the associate comment may want
     tweaking.
---
v3: New.

--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -3408,6 +3408,15 @@ enum hvm_translation_result hvm_copy_fro
                       PFEC_page_present | pfec, pfinfo);
 }
 
+enum hvm_translation_result hvm_copy_from_vcpu_linear(
+    void *buf, unsigned long addr, unsigned int size, struct vcpu *v,
+    unsigned int pfec)
+{
+    return __hvm_copy(buf, addr, size, v,
+                      HVMCOPY_from_guest | HVMCOPY_linear,
+                      PFEC_page_present | pfec, NULL);
+}
+
 unsigned int copy_to_user_hvm(void *to, const void *from, unsigned int len)
 {
     int rc;
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -364,6 +364,71 @@ static void show_guest_stack(struct vcpu
     printk("\n");
 }
 
+static void show_hvm_stack(struct vcpu *v, const struct cpu_user_regs *regs)
+{
+#ifdef CONFIG_HVM
+    unsigned long sp = regs->rsp, addr;
+    unsigned int i, bytes, words_per_line, pfec = PFEC_page_present;
+    struct segment_register ss, cs;
+
+    hvm_get_segment_register(v, x86_seg_ss, &ss);
+    hvm_get_segment_register(v, x86_seg_cs, &cs);
+
+    if ( hvm_long_mode_active(v) && cs.l )
+        i = 16, bytes = 8;
+    else
+    {
+        sp = ss.db ? (uint32_t)sp : (uint16_t)sp;
+        i = ss.db ? 8 : 4;
+        bytes = cs.db ? 4 : 2;
+    }
+
+    if ( bytes == 8 || (ss.db && !ss.base) )
+        printk("Guest stack trace from sp=%0*lx:", i, sp);
+    else
+        printk("Guest stack trace from ss:sp=%04x:%0*lx:", ss.sel, i, sp);
+
+    if ( !hvm_vcpu_virtual_to_linear(v, x86_seg_ss, &ss, sp, bytes,
+                                     hvm_access_read, &cs, &addr) )
+    {
+        printk(" Guest-inaccessible memory\n");
+        return;
+    }
+
+    if ( ss.dpl == 3 )
+        pfec |= PFEC_user_mode;
+
+    words_per_line = stack_words_per_line * (sizeof(void *) / bytes);
+    for ( i = 0; i < debug_stack_lines * words_per_line; )
+    {
+        unsigned long val = 0;
+
+        if ( (addr ^ (addr + bytes - 1)) & PAGE_SIZE )
+            break;
+
+        if ( !(i++ % words_per_line) )
+            printk("\n  ");
+
+        if ( hvm_copy_from_vcpu_linear(&val, addr, bytes, v,
+                                       pfec) != HVMTRANS_okay )
+        {
+            printk(" Fault while accessing guest memory.");
+            break;
+        }
+
+        printk(" %0*lx", 2 * bytes, val);
+
+        addr += bytes;
+        if ( !(addr & (PAGE_SIZE - 1)) )
+            break;
+    }
+
+    if ( !i )
+        printk(" Stack empty.");
+    printk("\n");
+#endif
+}
+
 /*
  * Notes for get_{stack,shstk}*_bottom() helpers
  *
@@ -629,7 +694,7 @@ void show_execution_state(const struct c
 
 void vcpu_show_execution_state(struct vcpu *v)
 {
-    unsigned long flags;
+    unsigned long flags = 0;
 
     if ( test_bit(_VPF_down, &v->pause_flags) )
     {
@@ -663,14 +728,22 @@ void vcpu_show_execution_state(struct vc
     }
 #endif
 
-    /* Prevent interleaving of output. */
-    flags = console_lock_recursive_irqsave();
+    /*
+     * Prevent interleaving of output if possible. For HVM we can't do so, as
+     * the necessary P2M lookups involve locking, which has to occur with IRQs
+     * enabled.
+     */
+    if ( !is_hvm_vcpu(v) )
+        flags = console_lock_recursive_irqsave();
 
     vcpu_show_registers(v);
-    if ( guest_kernel_mode(v, &v->arch.user_regs) )
+    if ( is_hvm_vcpu(v) )
+        show_hvm_stack(v, &v->arch.user_regs);
+    else if ( guest_kernel_mode(v, &v->arch.user_regs) )
         show_guest_stack(v, &v->arch.user_regs);
 
-    console_unlock_recursive_irqrestore(flags);
+    if ( !is_hvm_vcpu(v) )
+        console_unlock_recursive_irqrestore(flags);
 
 #ifdef CONFIG_HVM
     if ( cpu_has_vmx && is_hvm_vcpu(v) )
--- a/xen/include/asm-x86/hvm/support.h
+++ b/xen/include/asm-x86/hvm/support.h
@@ -101,6 +101,9 @@ enum hvm_translation_result hvm_copy_to_
 enum hvm_translation_result hvm_copy_from_guest_linear(
     void *buf, unsigned long addr, unsigned int size, uint32_t pfec,
     pagefault_info_t *pfinfo);
+enum hvm_translation_result hvm_copy_from_vcpu_linear(
+    void *buf, unsigned long addr, unsigned int size, struct vcpu *v,
+    unsigned int pfec);
 
 /*
  * Get a reference on the page under an HVM physical or linear address.  If



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
                   ` (6 preceding siblings ...)
  2021-09-21  7:20 ` [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0' Jan Beulich
@ 2021-09-21  7:20 ` Jan Beulich
  2021-09-23  8:23   ` Roger Pau Monné
  2021-09-21  7:21 ` [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries Jan Beulich
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:20 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima

There's not really any register state associated with offline vCPU-s, so
avoid spamming the log with largely useless information while still
leaving an indication of the fact.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/hvm/svm/vmcb.c
+++ b/xen/arch/x86/hvm/svm/vmcb.c
@@ -242,6 +242,11 @@ static void vmcb_dump(unsigned char ch)
         printk("\n>>> Domain %d <<<\n", d->domain_id);
         for_each_vcpu ( d, v )
         {
+            if ( test_bit(_VPF_down, &v->pause_flags) )
+            {
+                printk("\tVCPU %u: offline\n", v->vcpu_id);
+                continue;
+            }
             printk("\tVCPU %d\n", v->vcpu_id);
             svm_vmcb_dump("key_handler", v->arch.hvm.svm.vmcb);
         }
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -2133,6 +2133,11 @@ static void vmcs_dump(unsigned char ch)
         printk("\n>>> Domain %d <<<\n", d->domain_id);
         for_each_vcpu ( d, v )
         {
+            if ( test_bit(_VPF_down, &v->pause_flags) )
+            {
+                printk("\tVCPU %u: offline\n", v->vcpu_id);
+                continue;
+            }
             printk("\tVCPU %d\n", v->vcpu_id);
             vmcs_dump_vcpu(v);
         }



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
                   ` (7 preceding siblings ...)
  2021-09-21  7:20 ` [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes Jan Beulich
@ 2021-09-21  7:21 ` Jan Beulich
  2021-09-23 11:10   ` Roger Pau Monné
  8 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-21  7:21 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap

To become independent of the sequence of mapping operations, permit
"access" to accumulate for Dom0, noting that there's not going to be an
introspection agent for it which this might interfere with. While e.g.
ideally only ROM regions would get mapped with X set, getting there is
quite a bit of work. Plus the use of p2m_access_* here is abusive in the
first place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v3: Move last in series, for being controversial.
v2: Split off from original patch. Accumulate all of R, W, and X.

--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
             return -EPERM;
         }
 
+        /*
+         * Gross bodge, to go away again rather sooner than later:
+         *
+         * For MMIO allow access permissions to accumulate, but only for Dom0.
+         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
+         * the way they specify "access", this will allow the ultimate result
+         * to be independent of the sequence of operations.
+         */
+        if ( is_hardware_domain(d) && gfn_p2mt == p2m_mmio_direct &&
+             access <= p2m_access_rwx && a <= p2m_access_rwx )
+            access |= a;
+
         if ( access == a )
         {
             gfn_unlock(p2m, gfn, order);



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation
  2021-09-21  7:16 ` [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation Jan Beulich
@ 2021-09-22 11:59   ` Roger Pau Monné
  2021-09-29 10:53     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-22 11:59 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Sep 21, 2021 at 09:16:44AM +0200, Jan Beulich wrote:
> Assuming that the accounting for IOMMU page tables will also take care
> of the P2M needs was wrong: dom0_paging_pages() can determine a far
> higher value, high enough for the system to run out of memory while
> setting up Dom0. Hence in the case of shared page tables the larger of
> the two values needs to be used (without shared page tables the sum of
> both continues to be applicable).
> 
> While there also account for two further aspects in the PV case: With
> "iommu=dom0-passthrough" no IOMMU page tables would get allocated, so
> none need accounting for. And if shadow mode is to be enabled, setting
> aside a suitable amount for the P2M pool to get populated is also
> necessary (i.e. similar to the non-shared-page-tables case of PVH).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> --- a/xen/arch/x86/dom0_build.c
> +++ b/xen/arch/x86/dom0_build.c
> @@ -318,7 +318,7 @@ unsigned long __init dom0_compute_nr_pag
>      struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
>  {
>      nodeid_t node;
> -    unsigned long avail = 0, nr_pages, min_pages, max_pages;
> +    unsigned long avail = 0, nr_pages, min_pages, max_pages, iommu_pages = 0;
>      bool need_paging;
>  
>      /* The ordering of operands is to work around a clang5 issue. */
> @@ -337,18 +337,23 @@ unsigned long __init dom0_compute_nr_pag
>          avail -= d->max_vcpus - 1;
>  
>      /* Reserve memory for iommu_dom0_init() (rough estimate). */
> -    if ( is_iommu_enabled(d) )
> +    if ( is_iommu_enabled(d) && !iommu_hwdom_passthrough )
>      {
>          unsigned int s;
>  
>          for ( s = 9; s < BITS_PER_LONG; s += 9 )
> -            avail -= max_pdx >> s;
> +            iommu_pages += max_pdx >> s;
> +
> +        avail -= iommu_pages;
>      }
>  
> -    need_paging = is_hvm_domain(d) &&
> -        (!iommu_use_hap_pt(d) || !paging_mode_hap(d));
> +    need_paging = is_hvm_domain(d)
> +                  ? !iommu_use_hap_pt(d) || !paging_mode_hap(d)
> +                  : opt_dom0_shadow;
>      for ( ; ; need_paging = false )
>      {
> +        unsigned long paging_pages;
> +
>          nr_pages = get_memsize(&dom0_size, avail);
>          min_pages = get_memsize(&dom0_min_size, avail);
>          max_pages = get_memsize(&dom0_max_size, avail);
> @@ -377,11 +382,20 @@ unsigned long __init dom0_compute_nr_pag
>          nr_pages = min(nr_pages, max_pages);
>          nr_pages = min(nr_pages, avail);
>  
> -        if ( !need_paging )
> -            break;
> +        paging_pages = paging_mode_enabled(d) || need_paging
> +                       ? dom0_paging_pages(d, nr_pages) : 0;
>  
>          /* Reserve memory for shadow or HAP. */
> -        avail -= dom0_paging_pages(d, nr_pages);
> +        if ( !need_paging )
> +        {
> +            if ( paging_pages <= iommu_pages )
> +                break;
> +
> +            avail -= paging_pages - iommu_pages;
> +        }
> +        else
> +            avail -= paging_pages;
> +        iommu_pages = paging_pages;
>      }

I always found this loop extremely confusing to reason about. Now that
we account for the iommu page tables using separate logic, do we
really need a loop here?

In fact I would suggest something like:

unsigned long cpu_pages = 0;

if ( is_iommu_enabled(d) && !iommu_hwdom_passthrough )
{
    unsigned int s;

    for ( s = 9; s < BITS_PER_LONG; s += 9 )
        iommu_pages += max_pdx >> s;
}

[perform all the nr_pages adjustments]

if ( paging_mode_enabled(d) ||
     opt_dom0_shadow /* shadow paging gets enabled later for PV dom0. */ )
    cpu_pages = dom0_paging_pages(d, nr_pages);

if ( is_hvm_domain(d) && iommu_use_hap_pt(d) && paging_mode_hap(d) )
    avail -= max(iommu_pages, cpu_pages);
else
    avail -= cpu_pages + iommu_pages;

There will be a slight over estimation of cpu_pages, as the value
passed in doesn't account for the iommu pages in case they are used,
but still it's better to over estimate than to under estimate.

Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0
  2021-09-21  7:17 ` [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0 Jan Beulich
@ 2021-09-22 13:01   ` Roger Pau Monné
  2021-09-22 13:31   ` Andrew Cooper
  1 sibling, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-22 13:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, Tim Deegan

On Tue, Sep 21, 2021 at 09:17:15AM +0200, Jan Beulich wrote:
> Leaving shadow setup just to the L1TF tasklet means running Dom0 on a
> minimally acceptable shadow memory pool, rather than what normally
> would be used (also, for example, for PVH). Populate the pool before
> triggering the tasklet, on a best effort basis (again like done for
> PVH).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Tim Deegan <tim@xen.org>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0
  2021-09-21  7:17 ` [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0 Jan Beulich
  2021-09-22 13:01   ` Roger Pau Monné
@ 2021-09-22 13:31   ` Andrew Cooper
  2021-09-22 13:50     ` Jan Beulich
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Cooper @ 2021-09-22 13:31 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Wei Liu, Roger Pau Monné, Tim Deegan

On 21/09/2021 08:17, Jan Beulich wrote:
> Leaving shadow setup just to the L1TF tasklet means running Dom0 on a
> minimally acceptable shadow memory pool, rather than what normally
> would be used (also, for example, for PVH). Populate the pool before
> triggering the tasklet, on a best effort basis (again like done for
> PVH).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Tim Deegan <tim@xen.org>
> ---
> v2: Latch dom0_paging_pages() result.
>
> --- a/xen/arch/x86/mm/shadow/common.c
> +++ b/xen/arch/x86/mm/shadow/common.c
> @@ -1298,7 +1298,7 @@ int shadow_set_allocation(struct domain
>  {
>      struct page_info *sp;
>  
> -    ASSERT(paging_locked_by_me(d));
> +    ASSERT(paging_locked_by_me(d) || system_state < SYS_STATE_active);
>  
>      if ( pages > 0 )
>      {
> --- a/xen/arch/x86/pv/dom0_build.c
> +++ b/xen/arch/x86/pv/dom0_build.c
> @@ -21,6 +21,7 @@
>  #include <asm/page.h>
>  #include <asm/pv/mm.h>
>  #include <asm/setup.h>
> +#include <asm/shadow.h>
>  
>  /* Allow ring-3 access in long mode as guest cannot use ring 1 ... */
>  #define BASE_PROT (_PAGE_PRESENT|_PAGE_RW|_PAGE_ACCESSED|_PAGE_USER)
> @@ -933,7 +934,18 @@ int __init dom0_construct_pv(struct doma
>  #ifdef CONFIG_SHADOW_PAGING
>      if ( opt_dom0_shadow )
>      {
> +        bool preempted;
> +
>          printk("Switching dom0 to using shadow paging\n");
> +
> +        nr_pt_pages = dom0_paging_pages(d, nr_pages);
> +
> +        do {
> +            preempted = false;
> +            shadow_set_allocation(d, nr_pt_pages, &preempted);
> +            process_pending_softirqs();
> +        } while ( preempted );

This is still broken.

The loop setting the shadow allocation needs to be outside of this
conditional, because it is not related to early activation of the l1tf
tasklet.

~Andrew



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0
  2021-09-22 13:31   ` Andrew Cooper
@ 2021-09-22 13:50     ` Jan Beulich
  2021-09-22 14:25       ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-22 13:50 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Wei Liu, Roger Pau Monné, Tim Deegan, xen-devel

On 22.09.2021 15:31, Andrew Cooper wrote:
> On 21/09/2021 08:17, Jan Beulich wrote:
>> @@ -933,7 +934,18 @@ int __init dom0_construct_pv(struct doma
>>  #ifdef CONFIG_SHADOW_PAGING
>>      if ( opt_dom0_shadow )
>>      {
>> +        bool preempted;
>> +
>>          printk("Switching dom0 to using shadow paging\n");
>> +
>> +        nr_pt_pages = dom0_paging_pages(d, nr_pages);
>> +
>> +        do {
>> +            preempted = false;
>> +            shadow_set_allocation(d, nr_pt_pages, &preempted);
>> +            process_pending_softirqs();
>> +        } while ( preempted );
> 
> This is still broken.
> 
> The loop setting the shadow allocation needs to be outside of this
> conditional, because it is not related to early activation of the l1tf
> tasklet.

Well, I'm not sure what to say. On v1 you already said so. But then you
didn't care to reply to me responding:

"Are you suggesting to set up a (perhaps large) shadow pool just in
 case we need to enable shadow mode on Dom0? And all of this memory
 to then remain unused in the majority of cases?

 Plus even if so, I'd view this as a 2nd, independent step, largely
 orthogonal to the handling of "dom0=shadow". If somebody really
 wanted that, I think this should be driven by an explicit setting
 of the shadow pool size, indicating the admin is willing to waste
 the memory.

 I'm further puzzled by "not to retain upstream's security
 vulnerability" - are you saying upstream is vulnerable in some way,
 while perhaps you (XenServer) are not? In general I don't think I
 view downstream decisions as a driving factor for what upstream
 does, when the result is deliberately different behavior from
 upstream."

Which has left me with no justification to make the change you're
requesting. I've now got an ack by Tim and an R-b by Roger. I also
view the change as is being an improvement on its own (i.e. I
question you saying "This is still broken."), even if (later) we
were to follow what you request. For this reason I'll give it a day
or two for you to reply, but otherwise I'll commit the patch as is,
leaving further adjustments for a future change (by you, me, or
anyone else).

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0
  2021-09-21  7:17 ` [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0 Jan Beulich
@ 2021-09-22 14:22   ` Roger Pau Monné
  2021-09-24 12:18     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-22 14:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Sep 21, 2021 at 09:17:37AM +0200, Jan Beulich wrote:
> Certain notifications of Dom0 to Xen are independent of the mode Dom0 is
> running in. Permit further PCI related ones (only their modern forms).
> Also include the USB2 debug port operation at this occasion.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I'm uncertain about the has_vpci() part of the check: I would think
> is_hardware_domain() is both sufficient and concise. Without vPCI a PVH
> Dom0 won't see any PCI devices in the first place (and hence would
> effectively be non-functioning). Dropping this would in particular make
> PHYSDEVOP_dbgp_op better fit in the mix.
> ---
> v3: New.
> 
> --- a/xen/arch/x86/hvm/hypercall.c
> +++ b/xen/arch/x86/hvm/hypercall.c
> @@ -94,6 +94,12 @@ static long hvm_physdev_op(int cmd, XEN_
>          break;
>  
>      case PHYSDEVOP_pci_mmcfg_reserved:
> +    case PHYSDEVOP_pci_device_add:
> +    case PHYSDEVOP_pci_device_remove:
> +    case PHYSDEVOP_restore_msi_ext:

Hm, I'm slightly unsure we need the restore operation. Wouldn't it be
better to just reset all device state on suspend and then let dom0
restore it's state as it does on native?

Maybe there's some wrinkle that prevents that from working properly.

> +    case PHYSDEVOP_dbgp_op:
> +    case PHYSDEVOP_prepare_msix:
> +    case PHYSDEVOP_release_msix:

Albeit I think those two operations won't strictly conflict with vPCI
usage (as they require no MSIX entries to be activ) I still wonder
whether we will end up needing them on a PVH dom0. They are used by
pciback and it's not yet clear how we will end up using pciback on a
PVH dom0, hence I would prefer if we could leave them out until
strictly required.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0
  2021-09-22 13:50     ` Jan Beulich
@ 2021-09-22 14:25       ` Roger Pau Monné
  2021-09-22 14:28         ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-22 14:25 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: Wei Liu, Tim Deegan, xen-devel

On Wed, Sep 22, 2021 at 03:50:25PM +0200, Jan Beulich wrote:
> On 22.09.2021 15:31, Andrew Cooper wrote:
> > On 21/09/2021 08:17, Jan Beulich wrote:
> >> @@ -933,7 +934,18 @@ int __init dom0_construct_pv(struct doma
> >>  #ifdef CONFIG_SHADOW_PAGING
> >>      if ( opt_dom0_shadow )
> >>      {
> >> +        bool preempted;
> >> +
> >>          printk("Switching dom0 to using shadow paging\n");
> >> +
> >> +        nr_pt_pages = dom0_paging_pages(d, nr_pages);
> >> +
> >> +        do {
> >> +            preempted = false;
> >> +            shadow_set_allocation(d, nr_pt_pages, &preempted);
> >> +            process_pending_softirqs();
> >> +        } while ( preempted );
> > 
> > This is still broken.
> > 
> > The loop setting the shadow allocation needs to be outside of this
> > conditional, because it is not related to early activation of the l1tf
> > tasklet.
> 
> Well, I'm not sure what to say. On v1 you already said so. But then you
> didn't care to reply to me responding:
> 
> "Are you suggesting to set up a (perhaps large) shadow pool just in
>  case we need to enable shadow mode on Dom0? And all of this memory
>  to then remain unused in the majority of cases?
> 
>  Plus even if so, I'd view this as a 2nd, independent step, largely
>  orthogonal to the handling of "dom0=shadow". If somebody really
>  wanted that, I think this should be driven by an explicit setting
>  of the shadow pool size, indicating the admin is willing to waste
>  the memory.

Maybe an acceptable compromise would be to allocate the pool if
opt_dom0_shadow || opt_pv_l1tf_hwdom?

opt_pv_l1tf_hwdom is not enabled by default, so an admin opting to
enable it should also be willing to reserve the memory it requires in
case it needs activating during runtime.

Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0
  2021-09-22 14:25       ` Roger Pau Monné
@ 2021-09-22 14:28         ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-09-22 14:28 UTC (permalink / raw)
  To: Roger Pau Monné, Andrew Cooper; +Cc: Wei Liu, Tim Deegan, xen-devel

On 22.09.2021 16:25, Roger Pau Monné wrote:
> On Wed, Sep 22, 2021 at 03:50:25PM +0200, Jan Beulich wrote:
>> On 22.09.2021 15:31, Andrew Cooper wrote:
>>> On 21/09/2021 08:17, Jan Beulich wrote:
>>>> @@ -933,7 +934,18 @@ int __init dom0_construct_pv(struct doma
>>>>  #ifdef CONFIG_SHADOW_PAGING
>>>>      if ( opt_dom0_shadow )
>>>>      {
>>>> +        bool preempted;
>>>> +
>>>>          printk("Switching dom0 to using shadow paging\n");
>>>> +
>>>> +        nr_pt_pages = dom0_paging_pages(d, nr_pages);
>>>> +
>>>> +        do {
>>>> +            preempted = false;
>>>> +            shadow_set_allocation(d, nr_pt_pages, &preempted);
>>>> +            process_pending_softirqs();
>>>> +        } while ( preempted );
>>>
>>> This is still broken.
>>>
>>> The loop setting the shadow allocation needs to be outside of this
>>> conditional, because it is not related to early activation of the l1tf
>>> tasklet.
>>
>> Well, I'm not sure what to say. On v1 you already said so. But then you
>> didn't care to reply to me responding:
>>
>> "Are you suggesting to set up a (perhaps large) shadow pool just in
>>  case we need to enable shadow mode on Dom0? And all of this memory
>>  to then remain unused in the majority of cases?
>>
>>  Plus even if so, I'd view this as a 2nd, independent step, largely
>>  orthogonal to the handling of "dom0=shadow". If somebody really
>>  wanted that, I think this should be driven by an explicit setting
>>  of the shadow pool size, indicating the admin is willing to waste
>>  the memory.
> 
> Maybe an acceptable compromise would be to allocate the pool if
> opt_dom0_shadow || opt_pv_l1tf_hwdom?
> 
> opt_pv_l1tf_hwdom is not enabled by default, so an admin opting to
> enable it should also be willing to reserve the memory it requires in
> case it needs activating during runtime.

I'd be fine making that change. Andrew - would that be sufficient to
address your concern?

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0
  2021-09-21  7:18 ` [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0 Jan Beulich
@ 2021-09-22 15:01   ` Roger Pau Monné
  2021-09-22 17:03     ` Andrew Cooper
  2021-09-23  9:46     ` Jan Beulich
  0 siblings, 2 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-22 15:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Sep 21, 2021 at 09:18:05AM +0200, Jan Beulich wrote:
> Like PV Dom0 in order to use the console if in a mode other than text
> 80x25 the kernel needs to be provided information about this mode. Bump
> HVM start info's "current" version to 2 and use a previously reserved
> 32-bit field to provide struct dom0_vga_console_info's position and
> size.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: New.
> 
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -19,6 +19,7 @@
>   */
>  
>  #include <xen/acpi.h>
> +#include <xen/console.h>
>  #include <xen/init.h>
>  #include <xen/libelf.h>
>  #include <xen/multiboot.h>
> @@ -549,6 +550,11 @@ static int __init pvh_load_kernel(struct
>      paddr_t last_addr;
>      struct hvm_start_info start_info = { 0 };
>      struct hvm_modlist_entry mod = { 0 };
> +#ifdef CONFIG_VIDEO
> +    struct dom0_vga_console_info vga_info = { 0 };
> +#else
> +    struct {} __maybe_unused vga_info;
> +#endif
>      struct vcpu *v = d->vcpu[0];
>      int rc;
>  
> @@ -598,7 +604,7 @@ static int __init pvh_load_kernel(struct
>       * split into smaller allocations, done as a single region in order to
>       * simplify it.
>       */
> -    last_addr = find_memory(d, &elf, sizeof(start_info) +
> +    last_addr = find_memory(d, &elf, sizeof(start_info) + sizeof(vga_info) +
>                              (initrd ? ROUNDUP(initrd->mod_end, PAGE_SIZE) +
>                                        sizeof(mod)
>                                      : 0) +
> @@ -672,6 +678,22 @@ static int __init pvh_load_kernel(struct
>          last_addr += sizeof(mod);
>      }
>  
> +#ifdef CONFIG_VIDEO
> +    if ( fill_console_start_info(&vga_info) )
> +    {
> +        rc = hvm_copy_to_guest_phys(last_addr + sizeof(start_info),
> +                                    &vga_info, sizeof(vga_info), v);
> +        if ( !rc )
> +        {
> +            start_info.version = 2;
> +            start_info.vga_info.offset = sizeof(start_info);
> +            start_info.vga_info.size = sizeof(vga_info);
> +        }
> +        else
> +            printk("Unable to copy VGA info to guest\n");
> +    }
> +#endif
> +
>      start_info.magic = XEN_HVM_START_MAGIC_VALUE;
>      start_info.flags = SIF_PRIVILEGED | SIF_INITDOMAIN;
>      rc = hvm_copy_to_guest_phys(last_addr, &start_info, sizeof(start_info), v);
> --- a/xen/include/public/arch-x86/hvm/start_info.h
> +++ b/xen/include/public/arch-x86/hvm/start_info.h
> @@ -33,7 +33,7 @@
>   *    | magic          | Contains the magic value XEN_HVM_START_MAGIC_VALUE
>   *    |                | ("xEn3" with the 0x80 bit of the "E" set).
>   *  4 +----------------+
> - *    | version        | Version of this structure. Current version is 1. New
> + *    | version        | Version of this structure. Current version is 2. New
>   *    |                | versions are guaranteed to be backwards-compatible.
>   *  8 +----------------+
>   *    | flags          | SIF_xxx flags.
> @@ -55,7 +55,15 @@
>   *    |                | if there is no memory map being provided. Only
>   *    |                | present in version 1 and newer of the structure.
>   * 52 +----------------+
> - *    | reserved       | Version 1 and newer only.
> + *    | vga_info.offset| Offset of struct dom0_vga_console_info from base of

I'm not sure we are supposed to reference external structures like
that. We took a lot of care to not depend on other headers, and to
make this as agnostic as possible (IIRC KVM is also capable of using
the PVH entry point natively, and hence depends on this header).

IF we want to add support for dom0_vga_console_info I think we need to
at least provide a binary layout for it, like all the other pieces
that are part of the HVM start info.

> + *    |                | struct hvm_start_info. Optional and only present in
> + *    |                | version 2 and newer of the structure when
> + *    |                | SIF_INITDOMAIN is set; zero if absent.

We have usually used an absolute physical address to reference other
data, and I think we should likely keep in that way for coherency.

> + * 54 +----------------+
> + *    | vga_info.size  | Size of present parts of struct dom0_vga_console_info.
> + *    |                | Optional and only present in version 2 and newer of
> + *    |                | the structure when SIF_INITDOMAIN is set; zero if
> + *    |                | absent.
>   * 56 +----------------+
>   *
>   * The layout of each entry in the module structure is the following:
> @@ -139,7 +147,15 @@ struct hvm_start_info {
>      uint32_t memmap_entries;    /* Number of entries in the memmap table.    */
>                                  /* Value will be zero if there is no memory  */
>                                  /* map being provided.                       */
> -    uint32_t reserved;          /* Must be zero.                             */

This 'Must be zero' comment troubles me a bit, I'm not convinced we
can just place data here (ie: it would no longer be 0). It's even
worse because the must be zero comment is only present in the C
representation of the struct, the layout above just denotes the field
is reserved (which would imply it's fine to use for other means in
v2).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0'
  2021-09-21  7:19 ` [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0' Jan Beulich
@ 2021-09-22 15:48   ` Roger Pau Monné
  2021-09-23 10:21     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-22 15:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Sep 21, 2021 at 09:19:06AM +0200, Jan Beulich wrote:
> vcpu_show_registers() didn't do anything for HVM so far. Note though
> that some extra hackery is needed for VMX - see the code comment.
> 
> Note further that the show_guest_stack() invocation is left alone here:
> While strictly speaking guest_kernel_mode() should be predicated by a
> PV / !HVM check, show_guest_stack() itself will bail immediately for
> HVM.
> 
> While there and despite not being PVH-specific, take the opportunity and
> filter offline vCPU-s: There's not really any register state associated
> with them, so avoid spamming the log with useless information while
> still leaving an indication of the fact.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> I was pondering whether to also have the VMCS/VMCB dumped for every
> vCPU, to present full state. The downside is that for larger systems
> this would be a lot of output.

At least for Intel there's already a debug key to dump VMCS, so I'm
unsure it's worth dumping it here also, as a user can get the
information elsewhere (that's what I've always used to debug PVH
TBH).

> ---
> v2: New.
> 
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -631,6 +631,12 @@ void vcpu_show_execution_state(struct vc
>  {
>      unsigned long flags;
>  
> +    if ( test_bit(_VPF_down, &v->pause_flags) )
> +    {
> +        printk("*** %pv is offline ***\n", v);
> +        return;
> +    }
> +
>      printk("*** Dumping Dom%d vcpu#%d state: ***\n",
>             v->domain->domain_id, v->vcpu_id);
>  
> @@ -642,6 +648,21 @@ void vcpu_show_execution_state(struct vc
>  
>      vcpu_pause(v); /* acceptably dangerous */
>  
> +#ifdef CONFIG_HVM
> +    /*
> +     * For VMX special care is needed: Reading some of the register state will
> +     * require VMCS accesses. Engaging foreign VMCSes involves acquiring of a
> +     * lock, which check_lock() would object to when done from an IRQs-disabled
> +     * region. Despite this being a layering violation, engage the VMCS right
> +     * here. This then also avoids doing so several times in close succession.
> +     */
> +    if ( cpu_has_vmx && is_hvm_vcpu(v) )
> +    {
> +        ASSERT(!in_irq());
> +        vmx_vmcs_enter(v);
> +    }
> +#endif
> +
>      /* Prevent interleaving of output. */
>      flags = console_lock_recursive_irqsave();
>  
> @@ -651,6 +672,11 @@ void vcpu_show_execution_state(struct vc
>  
>      console_unlock_recursive_irqrestore(flags);
>  
> +#ifdef CONFIG_HVM
> +    if ( cpu_has_vmx && is_hvm_vcpu(v) )
> +        vmx_vmcs_exit(v);
> +#endif
> +
>      vcpu_unpause(v);
>  }
>  
> --- a/xen/arch/x86/x86_64/traps.c
> +++ b/xen/arch/x86/x86_64/traps.c
> @@ -49,6 +49,39 @@ static void read_registers(struct cpu_us
>      crs[7] = read_gs_shadow();
>  }
>  
> +static void get_hvm_registers(struct vcpu *v, struct cpu_user_regs *regs,
> +                              unsigned long crs[8])

Would this better be placed in hvm.c now that it's a HVM only
function?

> +{
> +    struct segment_register sreg;
> +
> +    crs[0] = v->arch.hvm.guest_cr[0];
> +    crs[2] = v->arch.hvm.guest_cr[2];
> +    crs[3] = v->arch.hvm.guest_cr[3];
> +    crs[4] = v->arch.hvm.guest_cr[4];
> +
> +    hvm_get_segment_register(v, x86_seg_cs, &sreg);
> +    regs->cs = sreg.sel;
> +
> +    hvm_get_segment_register(v, x86_seg_ds, &sreg);
> +    regs->ds = sreg.sel;
> +
> +    hvm_get_segment_register(v, x86_seg_es, &sreg);
> +    regs->es = sreg.sel;
> +
> +    hvm_get_segment_register(v, x86_seg_fs, &sreg);
> +    regs->fs = sreg.sel;
> +    crs[5] = sreg.base;
> +
> +    hvm_get_segment_register(v, x86_seg_gs, &sreg);
> +    regs->gs = sreg.sel;
> +    crs[6] = sreg.base;
> +
> +    hvm_get_segment_register(v, x86_seg_ss, &sreg);
> +    regs->ss = sreg.sel;
> +
> +    crs[7] = hvm_get_shadow_gs_base(v);
> +}
> +
>  static void _show_registers(
>      const struct cpu_user_regs *regs, unsigned long crs[8],
>      enum context context, const struct vcpu *v)
> @@ -99,27 +132,8 @@ void show_registers(const struct cpu_use
>  
>      if ( guest_mode(regs) && is_hvm_vcpu(v) )
>      {
> -        struct segment_register sreg;
> +        get_hvm_registers(v, &fault_regs, fault_crs);
>          context = CTXT_hvm_guest;
> -        fault_crs[0] = v->arch.hvm.guest_cr[0];
> -        fault_crs[2] = v->arch.hvm.guest_cr[2];
> -        fault_crs[3] = v->arch.hvm.guest_cr[3];
> -        fault_crs[4] = v->arch.hvm.guest_cr[4];
> -        hvm_get_segment_register(v, x86_seg_cs, &sreg);
> -        fault_regs.cs = sreg.sel;
> -        hvm_get_segment_register(v, x86_seg_ds, &sreg);
> -        fault_regs.ds = sreg.sel;
> -        hvm_get_segment_register(v, x86_seg_es, &sreg);
> -        fault_regs.es = sreg.sel;
> -        hvm_get_segment_register(v, x86_seg_fs, &sreg);
> -        fault_regs.fs = sreg.sel;
> -        fault_crs[5] = sreg.base;
> -        hvm_get_segment_register(v, x86_seg_gs, &sreg);
> -        fault_regs.gs = sreg.sel;
> -        fault_crs[6] = sreg.base;
> -        hvm_get_segment_register(v, x86_seg_ss, &sreg);
> -        fault_regs.ss = sreg.sel;
> -        fault_crs[7] = hvm_get_shadow_gs_base(v);
>      }
>      else
>      {
> @@ -159,24 +173,35 @@ void show_registers(const struct cpu_use
>  void vcpu_show_registers(const struct vcpu *v)
>  {
>      const struct cpu_user_regs *regs = &v->arch.user_regs;
> -    bool kernel = guest_kernel_mode(v, regs);
> +    struct cpu_user_regs aux_regs;
> +    enum context context;
>      unsigned long crs[8];
>  
> -    /* Only handle PV guests for now */
> -    if ( !is_pv_vcpu(v) )
> -        return;
> -
> -    crs[0] = v->arch.pv.ctrlreg[0];
> -    crs[2] = arch_get_cr2(v);
> -    crs[3] = pagetable_get_paddr(kernel ?
> -                                 v->arch.guest_table :
> -                                 v->arch.guest_table_user);
> -    crs[4] = v->arch.pv.ctrlreg[4];
> -    crs[5] = v->arch.pv.fs_base;
> -    crs[6 + !kernel] = v->arch.pv.gs_base_kernel;
> -    crs[7 - !kernel] = v->arch.pv.gs_base_user;
> +    if ( is_hvm_vcpu(v) )
> +    {
> +        aux_regs = *regs;
> +        get_hvm_registers(v->domain->vcpu[v->vcpu_id], &aux_regs, crs);

I wonder if you could load the values directly into v->arch.user_regs,
but maybe that would taint some other info already there. I certainly
haven't looked closely.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0
  2021-09-22 15:01   ` Roger Pau Monné
@ 2021-09-22 17:03     ` Andrew Cooper
  2021-09-23  9:58       ` Jan Beulich
  2021-09-23  9:46     ` Jan Beulich
  1 sibling, 1 reply; 44+ messages in thread
From: Andrew Cooper @ 2021-09-22 17:03 UTC (permalink / raw)
  To: Roger Pau Monné, Jan Beulich; +Cc: xen-devel, Wei Liu

On 22/09/2021 16:01, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:18:05AM +0200, Jan Beulich wrote:
>> --- a/xen/include/public/arch-x86/hvm/start_info.h
>> +++ b/xen/include/public/arch-x86/hvm/start_info.h
>> @@ -33,7 +33,7 @@
>>   *    | magic          | Contains the magic value XEN_HVM_START_MAGIC_VALUE
>>   *    |                | ("xEn3" with the 0x80 bit of the "E" set).
>>   *  4 +----------------+
>> - *    | version        | Version of this structure. Current version is 1. New
>> + *    | version        | Version of this structure. Current version is 2. New
>>   *    |                | versions are guaranteed to be backwards-compatible.
>>   *  8 +----------------+
>>   *    | flags          | SIF_xxx flags.
>> @@ -55,7 +55,15 @@
>>   *    |                | if there is no memory map being provided. Only
>>   *    |                | present in version 1 and newer of the structure.
>>   * 52 +----------------+
>> - *    | reserved       | Version 1 and newer only.
>> + *    | vga_info.offset| Offset of struct dom0_vga_console_info from base of
> I'm not sure we are supposed to reference external structures like
> that. We took a lot of care to not depend on other headers, and to
> make this as agnostic as possible (IIRC KVM is also capable of using
> the PVH entry point natively, and hence depends on this header).

Absolutely correct.  C is not an acceptable ABI description.

Furthermore, dom0_vga_console_info is a bad ABI to start with, as
demonstrated by the multiple problems we've had extending it in the past.

The MB1/2 framebuffer information would be a rather better example to
follow, but we'll surely need to pass the EDID string too (at least in
the case that there aren't EFI runtime services to use).

~Andrew


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable
  2021-09-21  7:19 ` [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable Jan Beulich
@ 2021-09-23  8:09   ` Roger Pau Monné
  2021-09-23 10:34     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23  8:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Sep 21, 2021 at 09:19:37AM +0200, Jan Beulich wrote:
> While all present callers want to act on "current", stack dumping for
> HVM vCPU-s will require the function to be able to act on a remote vCPU.
> To avoid touching all present callers, convert the existing function to
> an inline wrapper around the extend new one.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> Alternatively the actual dumping patch could avoid using this more
> elaborate function and, ignoring access checks, simply add in the SS
> segment base itself (if needed in the first place).
> ---
> v3: New.
> 
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2526,7 +2526,8 @@ int hvm_set_cr4(unsigned long value, boo
>      return X86EMUL_OKAY;
>  }
>  
> -bool_t hvm_virtual_to_linear_addr(
> +bool hvm_vcpu_virtual_to_linear(
> +    struct vcpu *v,
>      enum x86_segment seg,
>      const struct segment_register *reg,
>      unsigned long offset,
> @@ -2535,8 +2536,9 @@ bool_t hvm_virtual_to_linear_addr(
>      const struct segment_register *active_cs,
>      unsigned long *linear_addr)
>  {
> -    const struct vcpu *curr = current;
>      unsigned long addr = offset, last_byte;
> +    const struct cpu_user_regs *regs = v == current ? guest_cpu_user_regs()
> +                                                    : &v->arch.user_regs;
>      bool_t okay = 0;

Since you change the function return type to bool, you should also
change the type of the returned variable IMO. It's just a two line
change.

Also is it worth adding some check that the remote vCPU is paused? Or
else you might get inconsistent results by using data that's stale  by
the time Xen acts on it.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes
  2021-09-21  7:20 ` [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes Jan Beulich
@ 2021-09-23  8:23   ` Roger Pau Monné
  2021-09-23 11:27     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23  8:23 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Kevin Tian, Jun Nakajima

On Tue, Sep 21, 2021 at 09:20:32AM +0200, Jan Beulich wrote:
> There's not really any register state associated with offline vCPU-s, so
> avoid spamming the log with largely useless information while still
> leaving an indication of the fact.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Is the state cleared when the vCPU is brought down? Or else it might
be interesting to print such state for debug purposes.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0
  2021-09-22 15:01   ` Roger Pau Monné
  2021-09-22 17:03     ` Andrew Cooper
@ 2021-09-23  9:46     ` Jan Beulich
  2021-09-23 13:22       ` Roger Pau Monné
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23  9:46 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 22.09.2021 17:01, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:18:05AM +0200, Jan Beulich wrote:
>> --- a/xen/include/public/arch-x86/hvm/start_info.h
>> +++ b/xen/include/public/arch-x86/hvm/start_info.h
>> @@ -33,7 +33,7 @@
>>   *    | magic          | Contains the magic value XEN_HVM_START_MAGIC_VALUE
>>   *    |                | ("xEn3" with the 0x80 bit of the "E" set).
>>   *  4 +----------------+
>> - *    | version        | Version of this structure. Current version is 1. New
>> + *    | version        | Version of this structure. Current version is 2. New
>>   *    |                | versions are guaranteed to be backwards-compatible.
>>   *  8 +----------------+
>>   *    | flags          | SIF_xxx flags.
>> @@ -55,7 +55,15 @@
>>   *    |                | if there is no memory map being provided. Only
>>   *    |                | present in version 1 and newer of the structure.
>>   * 52 +----------------+
>> - *    | reserved       | Version 1 and newer only.
>> + *    | vga_info.offset| Offset of struct dom0_vga_console_info from base of
> 
> I'm not sure we are supposed to reference external structures like
> that. We took a lot of care to not depend on other headers, and to
> make this as agnostic as possible (IIRC KVM is also capable of using
> the PVH entry point natively, and hence depends on this header).

But KVM wouldn't be using a Dom0-only part of the interface, would
it? (I'm aware of the possible re-using of the entry point.)

> IF we want to add support for dom0_vga_console_info I think we need to
> at least provide a binary layout for it, like all the other pieces
> that are part of the HVM start info.

Which then means we can't sensibly re-use the existing structure,
as that doesn't have as strict rules as the hvm_start_info one.
Which in turn means Linux can't re-use the code converting
dom0_vga_console_info, resulting in two places needing updating
whenever information gets add to (then) both structures (what
information they carry will, after all, want to remain in sync).

>> + *    |                | struct hvm_start_info. Optional and only present in
>> + *    |                | version 2 and newer of the structure when
>> + *    |                | SIF_INITDOMAIN is set; zero if absent.
> 
> We have usually used an absolute physical address to reference other
> data, and I think we should likely keep in that way for coherency.

Hmm. (See below.)

>> + * 54 +----------------+
>> + *    | vga_info.size  | Size of present parts of struct dom0_vga_console_info.
>> + *    |                | Optional and only present in version 2 and newer of
>> + *    |                | the structure when SIF_INITDOMAIN is set; zero if
>> + *    |                | absent.
>>   * 56 +----------------+
>>   *
>>   * The layout of each entry in the module structure is the following:
>> @@ -139,7 +147,15 @@ struct hvm_start_info {
>>      uint32_t memmap_entries;    /* Number of entries in the memmap table.    */
>>                                  /* Value will be zero if there is no memory  */
>>                                  /* map being provided.                       */
>> -    uint32_t reserved;          /* Must be zero.                             */
> 
> This 'Must be zero' comment troubles me a bit, I'm not convinced we
> can just place data here (ie: it would no longer be 0). It's even
> worse because the must be zero comment is only present in the C
> representation of the struct, the layout above just denotes the field
> is reserved (which would imply it's fine to use for other means in
> v2).

I thought the textual description was meant to be the ABI spec. The C
comment should therefore be viewed as if missing "in version 1" or
"presently".

Taking into account also Andrew's reply, I have to admit that I'm
inclined to request that one of the two of you fix this obvious
shortcoming in both Xen and Linux. I'm not really willing to be the one
to introduce a 2nd layout for the same set of data just for the purpose
of "playing nice" in an area where that, affecting Dom0 only, doesn't
seem to matter all this much. My goal was rather to keep the impact on
hvm_start_info as low as possible (and in particular avoid changing its
size, as strictly speaking Linux'es consumer implementation is buggy:
It would always copy as much data as it knows _may_ be present, not as
little data as may have been _actually_ provided; whoever implemented
this did only consider one half of the compatibility requirements,
quite likely simply because in the design this aspect was also missed,
or else the structure would have had a length field right from its
introduction).

IOW I'm afraid I may not be seeing the "big picture" here ...

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0
  2021-09-22 17:03     ` Andrew Cooper
@ 2021-09-23  9:58       ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-09-23  9:58 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Wei Liu, Roger Pau Monné

On 22.09.2021 19:03, Andrew Cooper wrote:
> On 22/09/2021 16:01, Roger Pau Monné wrote:
>> On Tue, Sep 21, 2021 at 09:18:05AM +0200, Jan Beulich wrote:
>>> --- a/xen/include/public/arch-x86/hvm/start_info.h
>>> +++ b/xen/include/public/arch-x86/hvm/start_info.h
>>> @@ -33,7 +33,7 @@
>>>   *    | magic          | Contains the magic value XEN_HVM_START_MAGIC_VALUE
>>>   *    |                | ("xEn3" with the 0x80 bit of the "E" set).
>>>   *  4 +----------------+
>>> - *    | version        | Version of this structure. Current version is 1. New
>>> + *    | version        | Version of this structure. Current version is 2. New
>>>   *    |                | versions are guaranteed to be backwards-compatible.
>>>   *  8 +----------------+
>>>   *    | flags          | SIF_xxx flags.
>>> @@ -55,7 +55,15 @@
>>>   *    |                | if there is no memory map being provided. Only
>>>   *    |                | present in version 1 and newer of the structure.
>>>   * 52 +----------------+
>>> - *    | reserved       | Version 1 and newer only.
>>> + *    | vga_info.offset| Offset of struct dom0_vga_console_info from base of
>> I'm not sure we are supposed to reference external structures like
>> that. We took a lot of care to not depend on other headers, and to
>> make this as agnostic as possible (IIRC KVM is also capable of using
>> the PVH entry point natively, and hence depends on this header).
> 
> Absolutely correct.  C is not an acceptable ABI description.

See my reply to Roger's earlier mail.

> Furthermore, dom0_vga_console_info is a bad ABI to start with, as
> demonstrated by the multiple problems we've had extending it in the past.

I don't view this as "problems", nor do I think we couldn't extend it
further that same way, if need be.

> The MB1/2 framebuffer information would be a rather better example to
> follow,

Maybe, but I'm not sure - it doesn't look any better extensibility-wise
than dom0_vga_console_info. Also MB1 doesn't really have separate
structures, so if anything it would need to be MB2.

> but we'll surely need to pass the EDID string too (at least in
> the case that there aren't EFI runtime services to use).

According to the understanding I've gained while putting together the
patch to retrieve EDID info when running under EFI, there's no way to
obtain these via runtime services. Yet I also don't see why we would
need to pass this here - we've got XENPF_firmware_info to retrieve
this data, and I didn't think we need PVH to behave differently from
PV in such regards.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0'
  2021-09-22 15:48   ` Roger Pau Monné
@ 2021-09-23 10:21     ` Jan Beulich
  2021-09-23 14:27       ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23 10:21 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 22.09.2021 17:48, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:19:06AM +0200, Jan Beulich wrote:
>> vcpu_show_registers() didn't do anything for HVM so far. Note though
>> that some extra hackery is needed for VMX - see the code comment.
>>
>> Note further that the show_guest_stack() invocation is left alone here:
>> While strictly speaking guest_kernel_mode() should be predicated by a
>> PV / !HVM check, show_guest_stack() itself will bail immediately for
>> HVM.
>>
>> While there and despite not being PVH-specific, take the opportunity and
>> filter offline vCPU-s: There's not really any register state associated
>> with them, so avoid spamming the log with useless information while
>> still leaving an indication of the fact.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I was pondering whether to also have the VMCS/VMCB dumped for every
>> vCPU, to present full state. The downside is that for larger systems
>> this would be a lot of output.
> 
> At least for Intel there's already a debug key to dump VMCS, so I'm
> unsure it's worth dumping it here also, as a user can get the
> information elsewhere (that's what I've always used to debug PVH
> TBH).

I know there is a respective debug key. That dumps _all_ VMCSes, though,
so might be quite verbose on a big system (where Dom0's output alone
may already be quite verbose).

>> --- a/xen/arch/x86/x86_64/traps.c
>> +++ b/xen/arch/x86/x86_64/traps.c
>> @@ -49,6 +49,39 @@ static void read_registers(struct cpu_us
>>      crs[7] = read_gs_shadow();
>>  }
>>  
>> +static void get_hvm_registers(struct vcpu *v, struct cpu_user_regs *regs,
>> +                              unsigned long crs[8])
> 
> Would this better be placed in hvm.c now that it's a HVM only
> function?

I was asking myself this question, but decided that the placement here
is perhaps at least no bigger of a problem than putting it there.
Factors played into this:
- the specifics of the usage of the crs[8] array,
- the fact that the PV function also lives here, not under pv/,
- the desire to keep the function static.

I can certainly be talked into moving the code, but I will want to see
convincing arguments that none of the three items above (and possible
other ones I may have missed) are really a problem then.

>> @@ -159,24 +173,35 @@ void show_registers(const struct cpu_use
>>  void vcpu_show_registers(const struct vcpu *v)
>>  {
>>      const struct cpu_user_regs *regs = &v->arch.user_regs;

Please note this in addition to my response below.

>> -    bool kernel = guest_kernel_mode(v, regs);
>> +    struct cpu_user_regs aux_regs;
>> +    enum context context;
>>      unsigned long crs[8];
>>  
>> -    /* Only handle PV guests for now */
>> -    if ( !is_pv_vcpu(v) )
>> -        return;
>> -
>> -    crs[0] = v->arch.pv.ctrlreg[0];
>> -    crs[2] = arch_get_cr2(v);
>> -    crs[3] = pagetable_get_paddr(kernel ?
>> -                                 v->arch.guest_table :
>> -                                 v->arch.guest_table_user);
>> -    crs[4] = v->arch.pv.ctrlreg[4];
>> -    crs[5] = v->arch.pv.fs_base;
>> -    crs[6 + !kernel] = v->arch.pv.gs_base_kernel;
>> -    crs[7 - !kernel] = v->arch.pv.gs_base_user;
>> +    if ( is_hvm_vcpu(v) )
>> +    {
>> +        aux_regs = *regs;
>> +        get_hvm_registers(v->domain->vcpu[v->vcpu_id], &aux_regs, crs);
> 
> I wonder if you could load the values directly into v->arch.user_regs,
> but maybe that would taint some other info already there. I certainly
> haven't looked closely.

I had it that other way first, wondering whether altering the structure
there might be safe. It felt wrong to fiddle with the live registers,
and the "const" above than was the final bit that convinced me I should
go the chosen route. Yet again - I can be talked into going the route
you outline via convincing arguments. Don't forget that we e.g.
deliberately poison the selector values in debug builds (see
hvm_invalidate_regs_fields()) - that poisoning would get undermined if
we wrote directly into the structure.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0'
  2021-09-21  7:20 ` [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0' Jan Beulich
@ 2021-09-23 10:31   ` Roger Pau Monné
  2021-09-23 10:38     ` Roger Pau Monné
  2021-09-23 10:47     ` Jan Beulich
  0 siblings, 2 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 10:31 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Tue, Sep 21, 2021 at 09:20:00AM +0200, Jan Beulich wrote:
> show_guest_stack() does nothing for HVM. Introduce a HVM-specific
> dumping function, paralleling the 64- and 32-bit PV ones. We don't know
> the real stack size, so only dump up to the next page boundary.
> 
> Rather than adding a vcpu parameter to hvm_copy_from_guest_linear(),
> introduce hvm_copy_from_vcpu_linear() which - for now at least - in
> return won't need a "pfinfo" parameter.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> TBD: The bypassing of the output interleaving avoidance isn't nice, but
>      I've not been able to think of an alternative. Avoiding the call to
>      hvm_vcpu_virtual_to_linear() would be in principle possible (adding
>      in the SS base directly), but one way or another we need to access
>      guest memory and hence can't sensibly avoid using the P2M layer.
>      However, commit 0996e0f38540 ("x86/traps: prevent interleaving of
>      concurrent cpu state dumps") introduced this logic here while
>      really only talking about show_execution_state().
>      vcpu_show_execution_state() is imo much less prone to interleaving
>      of its output: It's uses from the keyhandler are sequential already
>      anyway, and the only other use is from hvm_triple_fault(). Instead
>      of making the locking conditional, it may therefore be an option to
>      drop it again altogether.
> TBD: For now this dumps also user mode stacks. We may want to restrict
>      this.
> TBD: An alternative to putting this next to {,compat_}show_guest_stack()
>      is to put it in hvm.c, eliminating the need to introduce
>      hvm_copy_from_vcpu_linear(), but then requiring extra parameters to
>      be passed.
> TBD: Technically this makes unnecessary the earlier added entering/
>      leaving if the VMCS. Yet to avoid a series of non-trivial
>      enter/exit pairs, I think leaving that in is still beneficial. In
>      which case here perhaps merely the associate comment may want
>      tweaking.
> ---
> v3: New.
> 
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -3408,6 +3408,15 @@ enum hvm_translation_result hvm_copy_fro
>                        PFEC_page_present | pfec, pfinfo);
>  }
>  
> +enum hvm_translation_result hvm_copy_from_vcpu_linear(
> +    void *buf, unsigned long addr, unsigned int size, struct vcpu *v,
> +    unsigned int pfec)

Even if your current use case doesn't need it, would it be worth
adding a pagefault_info_t parameter?

> +{
> +    return __hvm_copy(buf, addr, size, v,
> +                      HVMCOPY_from_guest | HVMCOPY_linear,
> +                      PFEC_page_present | pfec, NULL);
> +}
> +
>  unsigned int copy_to_user_hvm(void *to, const void *from, unsigned int len)
>  {
>      int rc;
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -364,6 +364,71 @@ static void show_guest_stack(struct vcpu
>      printk("\n");
>  }
>  
> +static void show_hvm_stack(struct vcpu *v, const struct cpu_user_regs *regs)
> +{
> +#ifdef CONFIG_HVM
> +    unsigned long sp = regs->rsp, addr;
> +    unsigned int i, bytes, words_per_line, pfec = PFEC_page_present;
> +    struct segment_register ss, cs;
> +
> +    hvm_get_segment_register(v, x86_seg_ss, &ss);
> +    hvm_get_segment_register(v, x86_seg_cs, &cs);
> +
> +    if ( hvm_long_mode_active(v) && cs.l )
> +        i = 16, bytes = 8;
> +    else
> +    {
> +        sp = ss.db ? (uint32_t)sp : (uint16_t)sp;
> +        i = ss.db ? 8 : 4;
> +        bytes = cs.db ? 4 : 2;
> +    }
> +
> +    if ( bytes == 8 || (ss.db && !ss.base) )
> +        printk("Guest stack trace from sp=%0*lx:", i, sp);
> +    else
> +        printk("Guest stack trace from ss:sp=%04x:%0*lx:", ss.sel, i, sp);
> +
> +    if ( !hvm_vcpu_virtual_to_linear(v, x86_seg_ss, &ss, sp, bytes,
> +                                     hvm_access_read, &cs, &addr) )
> +    {
> +        printk(" Guest-inaccessible memory\n");
> +        return;
> +    }
> +
> +    if ( ss.dpl == 3 )
> +        pfec |= PFEC_user_mode;
> +
> +    words_per_line = stack_words_per_line * (sizeof(void *) / bytes);
> +    for ( i = 0; i < debug_stack_lines * words_per_line; )
> +    {
> +        unsigned long val = 0;
> +
> +        if ( (addr ^ (addr + bytes - 1)) & PAGE_SIZE )
> +            break;
> +
> +        if ( !(i++ % words_per_line) )
> +            printk("\n  ");
> +
> +        if ( hvm_copy_from_vcpu_linear(&val, addr, bytes, v,
> +                                       pfec) != HVMTRANS_okay )

I think I'm confused, but what about guests without paging enabled?
Don't you need to use hvm_copy_from_guest_phys (likely transformed
into hvm_copy_from_vcpu_phys)?

> +        {
> +            printk(" Fault while accessing guest memory.");
> +            break;
> +        }
> +
> +        printk(" %0*lx", 2 * bytes, val);
> +
> +        addr += bytes;
> +        if ( !(addr & (PAGE_SIZE - 1)) )
> +            break;
> +    }
> +
> +    if ( !i )
> +        printk(" Stack empty.");
> +    printk("\n");
> +#endif
> +}
> +
>  /*
>   * Notes for get_{stack,shstk}*_bottom() helpers
>   *
> @@ -629,7 +694,7 @@ void show_execution_state(const struct c
>  
>  void vcpu_show_execution_state(struct vcpu *v)
>  {
> -    unsigned long flags;
> +    unsigned long flags = 0;
>  
>      if ( test_bit(_VPF_down, &v->pause_flags) )
>      {
> @@ -663,14 +728,22 @@ void vcpu_show_execution_state(struct vc
>      }
>  #endif
>  
> -    /* Prevent interleaving of output. */
> -    flags = console_lock_recursive_irqsave();
> +    /*
> +     * Prevent interleaving of output if possible. For HVM we can't do so, as
> +     * the necessary P2M lookups involve locking, which has to occur with IRQs
> +     * enabled.
> +     */
> +    if ( !is_hvm_vcpu(v) )
> +        flags = console_lock_recursive_irqsave();
>  
>      vcpu_show_registers(v);
> -    if ( guest_kernel_mode(v, &v->arch.user_regs) )
> +    if ( is_hvm_vcpu(v) )
> +        show_hvm_stack(v, &v->arch.user_regs);

Would it make sense to unlock in show_hvm_stack, and thus keep the
printing of vcpu_show_registers locked even when in HVM context?

TBH I've never found the guest stack dump to be helpful for debugging
purposes, but maybe others do.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable
  2021-09-23  8:09   ` Roger Pau Monné
@ 2021-09-23 10:34     ` Jan Beulich
  2021-09-23 14:28       ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23 10:34 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 23.09.2021 10:09, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:19:37AM +0200, Jan Beulich wrote:
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -2526,7 +2526,8 @@ int hvm_set_cr4(unsigned long value, boo
>>      return X86EMUL_OKAY;
>>  }
>>  
>> -bool_t hvm_virtual_to_linear_addr(
>> +bool hvm_vcpu_virtual_to_linear(
>> +    struct vcpu *v,
>>      enum x86_segment seg,
>>      const struct segment_register *reg,
>>      unsigned long offset,
>> @@ -2535,8 +2536,9 @@ bool_t hvm_virtual_to_linear_addr(
>>      const struct segment_register *active_cs,
>>      unsigned long *linear_addr)
>>  {
>> -    const struct vcpu *curr = current;
>>      unsigned long addr = offset, last_byte;
>> +    const struct cpu_user_regs *regs = v == current ? guest_cpu_user_regs()
>> +                                                    : &v->arch.user_regs;
>>      bool_t okay = 0;
> 
> Since you change the function return type to bool, you should also
> change the type of the returned variable IMO. It's just a two line
> change.

Can do; I would have viewed this as an unrelated change: While the
first change needed indeed is on an adjacent line (above), the other
one isn't.

> Also is it worth adding some check that the remote vCPU is paused? Or
> else you might get inconsistent results by using data that's stale  by
> the time Xen acts on it.

I did ask myself the same question. I did conclude that even if the
remote vCPU is paused at the time of the check, it may not be anymore
immediately after, so the information returned might be stale anyway.
I further looked at {hvm,vmx}_get_segment_register() to see what they
did - nothing. It is only now that I've also checked
svm_get_segment_register(), which - interestingly - has such a check.
I will copy the ASSERT() used there.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0'
  2021-09-23 10:31   ` Roger Pau Monné
@ 2021-09-23 10:38     ` Roger Pau Monné
  2021-09-23 10:47     ` Jan Beulich
  1 sibling, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 10:38 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Sep 23, 2021 at 12:31:14PM +0200, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:20:00AM +0200, Jan Beulich wrote:
> > -    /* Prevent interleaving of output. */
> > -    flags = console_lock_recursive_irqsave();
> > +    /*
> > +     * Prevent interleaving of output if possible. For HVM we can't do so, as
> > +     * the necessary P2M lookups involve locking, which has to occur with IRQs
> > +     * enabled.
> > +     */
> > +    if ( !is_hvm_vcpu(v) )
> > +        flags = console_lock_recursive_irqsave();
> >  
> >      vcpu_show_registers(v);
> > -    if ( guest_kernel_mode(v, &v->arch.user_regs) )
> > +    if ( is_hvm_vcpu(v) )
> > +        show_hvm_stack(v, &v->arch.user_regs);
> 
> Would it make sense to unlock in show_hvm_stack, and thus keep the
> printing of vcpu_show_registers locked even when in HVM context?

To clarify, the unlock should be limited around the call to
show_hvm_stack, not to be performed inside of the function itself
(since we would have to pass flags around then).

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0'
  2021-09-23 10:31   ` Roger Pau Monné
  2021-09-23 10:38     ` Roger Pau Monné
@ 2021-09-23 10:47     ` Jan Beulich
  2021-09-23 14:43       ` Roger Pau Monné
  1 sibling, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23 10:47 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 23.09.2021 12:31, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:20:00AM +0200, Jan Beulich wrote:
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -3408,6 +3408,15 @@ enum hvm_translation_result hvm_copy_fro
>>                        PFEC_page_present | pfec, pfinfo);
>>  }
>>  
>> +enum hvm_translation_result hvm_copy_from_vcpu_linear(
>> +    void *buf, unsigned long addr, unsigned int size, struct vcpu *v,
>> +    unsigned int pfec)
> 
> Even if your current use case doesn't need it, would it be worth
> adding a pagefault_info_t parameter?

I'd prefer to add such parameters only once they become necessary.

>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -364,6 +364,71 @@ static void show_guest_stack(struct vcpu
>>      printk("\n");
>>  }
>>  
>> +static void show_hvm_stack(struct vcpu *v, const struct cpu_user_regs *regs)
>> +{
>> +#ifdef CONFIG_HVM
>> +    unsigned long sp = regs->rsp, addr;
>> +    unsigned int i, bytes, words_per_line, pfec = PFEC_page_present;
>> +    struct segment_register ss, cs;
>> +
>> +    hvm_get_segment_register(v, x86_seg_ss, &ss);
>> +    hvm_get_segment_register(v, x86_seg_cs, &cs);
>> +
>> +    if ( hvm_long_mode_active(v) && cs.l )
>> +        i = 16, bytes = 8;
>> +    else
>> +    {
>> +        sp = ss.db ? (uint32_t)sp : (uint16_t)sp;
>> +        i = ss.db ? 8 : 4;
>> +        bytes = cs.db ? 4 : 2;
>> +    }
>> +
>> +    if ( bytes == 8 || (ss.db && !ss.base) )
>> +        printk("Guest stack trace from sp=%0*lx:", i, sp);
>> +    else
>> +        printk("Guest stack trace from ss:sp=%04x:%0*lx:", ss.sel, i, sp);
>> +
>> +    if ( !hvm_vcpu_virtual_to_linear(v, x86_seg_ss, &ss, sp, bytes,
>> +                                     hvm_access_read, &cs, &addr) )
>> +    {
>> +        printk(" Guest-inaccessible memory\n");
>> +        return;
>> +    }
>> +
>> +    if ( ss.dpl == 3 )
>> +        pfec |= PFEC_user_mode;
>> +
>> +    words_per_line = stack_words_per_line * (sizeof(void *) / bytes);
>> +    for ( i = 0; i < debug_stack_lines * words_per_line; )
>> +    {
>> +        unsigned long val = 0;
>> +
>> +        if ( (addr ^ (addr + bytes - 1)) & PAGE_SIZE )
>> +            break;
>> +
>> +        if ( !(i++ % words_per_line) )
>> +            printk("\n  ");
>> +
>> +        if ( hvm_copy_from_vcpu_linear(&val, addr, bytes, v,
>> +                                       pfec) != HVMTRANS_okay )
> 
> I think I'm confused, but what about guests without paging enabled?
> Don't you need to use hvm_copy_from_guest_phys (likely transformed
> into hvm_copy_from_vcpu_phys)?

__hvm_copy() calls hvm_translate_get_page() telling it whether the
input is a linear or physical address. hvm_translate_get_page() will
use paging_gva_to_gfn() in this case. The HAP backing function
changes when the guest {en,dis}ables paging, while shadow code deals
with paging disabled by installing an identity mapping
(d->arch.paging.shadow.unpaged_pagetable) which it would then end up
(needlessly) walking.

It really is - afaict - intentional for callers to not have to deal
with the special case.

>> @@ -663,14 +728,22 @@ void vcpu_show_execution_state(struct vc
>>      }
>>  #endif
>>  
>> -    /* Prevent interleaving of output. */
>> -    flags = console_lock_recursive_irqsave();
>> +    /*
>> +     * Prevent interleaving of output if possible. For HVM we can't do so, as
>> +     * the necessary P2M lookups involve locking, which has to occur with IRQs
>> +     * enabled.
>> +     */
>> +    if ( !is_hvm_vcpu(v) )
>> +        flags = console_lock_recursive_irqsave();
>>  
>>      vcpu_show_registers(v);
>> -    if ( guest_kernel_mode(v, &v->arch.user_regs) )
>> +    if ( is_hvm_vcpu(v) )
>> +        show_hvm_stack(v, &v->arch.user_regs);
> 
> Would it make sense to unlock in show_hvm_stack, and thus keep the
> printing of vcpu_show_registers locked even when in HVM context?

Perhaps not _in_, but before calling it, yet - why not.

> TBH I've never found the guest stack dump to be helpful for debugging
> purposes, but maybe others do.

I can't count how many times I did use the stack dumps of Dom0 to
actually pinpoint problems.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-21  7:21 ` [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries Jan Beulich
@ 2021-09-23 11:10   ` Roger Pau Monné
  2021-09-23 11:32     ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 11:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap

On Tue, Sep 21, 2021 at 09:21:11AM +0200, Jan Beulich wrote:
> To become independent of the sequence of mapping operations, permit
> "access" to accumulate for Dom0, noting that there's not going to be an
> introspection agent for it which this might interfere with. While e.g.
> ideally only ROM regions would get mapped with X set, getting there is
> quite a bit of work. Plus the use of p2m_access_* here is abusive in the
> first place.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v3: Move last in series, for being controversial.
> v2: Split off from original patch. Accumulate all of R, W, and X.
> 
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
>              return -EPERM;
>          }
>  
> +        /*
> +         * Gross bodge, to go away again rather sooner than later:
> +         *
> +         * For MMIO allow access permissions to accumulate, but only for Dom0.
> +         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
> +         * the way they specify "access", this will allow the ultimate result
> +         * to be independent of the sequence of operations.

Wouldn't it be better to 'fix' those operations so that they can work
together?

It's my understanding that set_identity_p2m_entry is the one that has
strong requirements regarding the access permissions, as on AMD ACPI
tables can specify how should regions be mapped.

A possible solution might be to make set_mmio_p2m_entry more tolerant
to how present mappings are handled. For once that function doesn't
let callers specify access permissions, so I would consider that if a
mapping is present on the gfn and it already points to the requested
mfn no error should be returned to the caller. At the end the 'default
access' for that gfn -> mfn relation is the one established by
set_identity_p2m_entry and shouldn't be subject to the p2m default
access.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes
  2021-09-23  8:23   ` Roger Pau Monné
@ 2021-09-23 11:27     ` Jan Beulich
  2021-09-23 14:46       ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23 11:27 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Kevin Tian, Jun Nakajima

On 23.09.2021 10:23, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:20:32AM +0200, Jan Beulich wrote:
>> There's not really any register state associated with offline vCPU-s, so
>> avoid spamming the log with largely useless information while still
>> leaving an indication of the fact.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks.

> Is the state cleared when the vCPU is brought down? Or else it might
> be interesting to print such state for debug purposes.

Hmm, if that's considered to be potentially useful, then we'd need a
key different from VPF_down. HLT would leave state in place, and the
INIT/SIPI sequence would clobber prior state also only on the 2nd
step. v->is_initialised may be an option, but gets cleared by INIT,
not SIPI. Any other ideas?

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-23 11:10   ` Roger Pau Monné
@ 2021-09-23 11:32     ` Jan Beulich
  2021-09-23 11:54       ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23 11:32 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap

On 23.09.2021 13:10, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:21:11AM +0200, Jan Beulich wrote:
>> To become independent of the sequence of mapping operations, permit
>> "access" to accumulate for Dom0, noting that there's not going to be an
>> introspection agent for it which this might interfere with. While e.g.
>> ideally only ROM regions would get mapped with X set, getting there is
>> quite a bit of work. Plus the use of p2m_access_* here is abusive in the
>> first place.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v3: Move last in series, for being controversial.
>> v2: Split off from original patch. Accumulate all of R, W, and X.
>>
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
>>              return -EPERM;
>>          }
>>  
>> +        /*
>> +         * Gross bodge, to go away again rather sooner than later:
>> +         *
>> +         * For MMIO allow access permissions to accumulate, but only for Dom0.
>> +         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
>> +         * the way they specify "access", this will allow the ultimate result
>> +         * to be independent of the sequence of operations.
> 
> Wouldn't it be better to 'fix' those operations so that they can work
> together?

Yes, but then we should do this properly by removing all abuse of
p2m_access_t.

> It's my understanding that set_identity_p2m_entry is the one that has
> strong requirements regarding the access permissions, as on AMD ACPI
> tables can specify how should regions be mapped.
> 
> A possible solution might be to make set_mmio_p2m_entry more tolerant
> to how present mappings are handled. For once that function doesn't
> let callers specify access permissions, so I would consider that if a
> mapping is present on the gfn and it already points to the requested
> mfn no error should be returned to the caller. At the end the 'default
> access' for that gfn -> mfn relation is the one established by
> set_identity_p2m_entry and shouldn't be subject to the p2m default
> access.

I think further reducing access is in theory supposed to be possible.
For Dom0 all of this (including the potential of default access not
being RWX) a questionable thing though, as pointed out in earlier
discussions. After all there's no introspection (or alike) agent
supposed to be controlling Dom0.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-23 11:32     ` Jan Beulich
@ 2021-09-23 11:54       ` Roger Pau Monné
  2021-09-23 12:15         ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 11:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap

On Thu, Sep 23, 2021 at 01:32:52PM +0200, Jan Beulich wrote:
> On 23.09.2021 13:10, Roger Pau Monné wrote:
> > On Tue, Sep 21, 2021 at 09:21:11AM +0200, Jan Beulich wrote:
> >> To become independent of the sequence of mapping operations, permit
> >> "access" to accumulate for Dom0, noting that there's not going to be an
> >> introspection agent for it which this might interfere with. While e.g.
> >> ideally only ROM regions would get mapped with X set, getting there is
> >> quite a bit of work. Plus the use of p2m_access_* here is abusive in the
> >> first place.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> v3: Move last in series, for being controversial.
> >> v2: Split off from original patch. Accumulate all of R, W, and X.
> >>
> >> --- a/xen/arch/x86/mm/p2m.c
> >> +++ b/xen/arch/x86/mm/p2m.c
> >> @@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
> >>              return -EPERM;
> >>          }
> >>  
> >> +        /*
> >> +         * Gross bodge, to go away again rather sooner than later:
> >> +         *
> >> +         * For MMIO allow access permissions to accumulate, but only for Dom0.
> >> +         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
> >> +         * the way they specify "access", this will allow the ultimate result
> >> +         * to be independent of the sequence of operations.
> > 
> > Wouldn't it be better to 'fix' those operations so that they can work
> > together?
> 
> Yes, but then we should do this properly by removing all abuse of
> p2m_access_t.

I'm not sure I follow what that abuse is.

> > It's my understanding that set_identity_p2m_entry is the one that has
> > strong requirements regarding the access permissions, as on AMD ACPI
> > tables can specify how should regions be mapped.
> > 
> > A possible solution might be to make set_mmio_p2m_entry more tolerant
> > to how present mappings are handled. For once that function doesn't
> > let callers specify access permissions, so I would consider that if a
> > mapping is present on the gfn and it already points to the requested
> > mfn no error should be returned to the caller. At the end the 'default
> > access' for that gfn -> mfn relation is the one established by
> > set_identity_p2m_entry and shouldn't be subject to the p2m default
> > access.
> 
> I think further reducing access is in theory supposed to be possible.

Couldn't an access reduction introduce issues, kind of similar to what
would happen if the regions are unmapped? (and that XSA-378 addressed)

I think whatever permissions set_identity_p2m_entry sets should be
mandatory ones, and no changes should be allowed. That would maybe
require introducing a new p2m type (p2m_mmio_mandatory) in order to
differentiate firmware mandatory MMIO mappings from the rest.

> For Dom0 all of this (including the potential of default access not
> being RWX) a questionable thing though, as pointed out in earlier
> discussions. After all there's no introspection (or alike) agent
> supposed to be controlling Dom0.

Ideally I would prefer a solution that could be applied to both dom0
and domU, if that's possible.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-23 11:54       ` Roger Pau Monné
@ 2021-09-23 12:15         ` Jan Beulich
  2021-09-23 15:15           ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23 12:15 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap

On 23.09.2021 13:54, Roger Pau Monné wrote:
> On Thu, Sep 23, 2021 at 01:32:52PM +0200, Jan Beulich wrote:
>> On 23.09.2021 13:10, Roger Pau Monné wrote:
>>> On Tue, Sep 21, 2021 at 09:21:11AM +0200, Jan Beulich wrote:
>>>> --- a/xen/arch/x86/mm/p2m.c
>>>> +++ b/xen/arch/x86/mm/p2m.c
>>>> @@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
>>>>              return -EPERM;
>>>>          }
>>>>  
>>>> +        /*
>>>> +         * Gross bodge, to go away again rather sooner than later:
>>>> +         *
>>>> +         * For MMIO allow access permissions to accumulate, but only for Dom0.
>>>> +         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
>>>> +         * the way they specify "access", this will allow the ultimate result
>>>> +         * to be independent of the sequence of operations.
>>>
>>> Wouldn't it be better to 'fix' those operations so that they can work
>>> together?
>>
>> Yes, but then we should do this properly by removing all abuse of
>> p2m_access_t.
> 
> I'm not sure I follow what that abuse is.

As was clarified, p2m_access_t should be solely used by e.g.
introspection agents, who are then the entity responsible for
resolving the resulting faults. Any other uses (to e.g. merely
restrict permissions for other reasons) are really abuses. That
is, if e.g. a r/o direct-MMIO mapping is needed, this should not
be expressed as (p2m_mmio_direct, p2m_access_r) tuple, but would
require a distinct p2m_mmio_direct_ro type.

>>> It's my understanding that set_identity_p2m_entry is the one that has
>>> strong requirements regarding the access permissions, as on AMD ACPI
>>> tables can specify how should regions be mapped.
>>>
>>> A possible solution might be to make set_mmio_p2m_entry more tolerant
>>> to how present mappings are handled. For once that function doesn't
>>> let callers specify access permissions, so I would consider that if a
>>> mapping is present on the gfn and it already points to the requested
>>> mfn no error should be returned to the caller. At the end the 'default
>>> access' for that gfn -> mfn relation is the one established by
>>> set_identity_p2m_entry and shouldn't be subject to the p2m default
>>> access.
>>
>> I think further reducing access is in theory supposed to be possible.
> 
> Couldn't an access reduction introduce issues, kind of similar to what
> would happen if the regions are unmapped? (and that XSA-378 addressed)
> 
> I think whatever permissions set_identity_p2m_entry sets should be
> mandatory ones, and no changes should be allowed. That would maybe
> require introducing a new p2m type (p2m_mmio_mandatory) in order to
> differentiate firmware mandatory MMIO mappings from the rest.

Hmm, indeed. No deviation in either direction should be permitted.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0
  2021-09-23  9:46     ` Jan Beulich
@ 2021-09-23 13:22       ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 13:22 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Sep 23, 2021 at 11:46:48AM +0200, Jan Beulich wrote:
> On 22.09.2021 17:01, Roger Pau Monné wrote:
> > On Tue, Sep 21, 2021 at 09:18:05AM +0200, Jan Beulich wrote:
> >> --- a/xen/include/public/arch-x86/hvm/start_info.h
> >> +++ b/xen/include/public/arch-x86/hvm/start_info.h
> >> @@ -33,7 +33,7 @@
> >>   *    | magic          | Contains the magic value XEN_HVM_START_MAGIC_VALUE
> >>   *    |                | ("xEn3" with the 0x80 bit of the "E" set).
> >>   *  4 +----------------+
> >> - *    | version        | Version of this structure. Current version is 1. New
> >> + *    | version        | Version of this structure. Current version is 2. New
> >>   *    |                | versions are guaranteed to be backwards-compatible.
> >>   *  8 +----------------+
> >>   *    | flags          | SIF_xxx flags.
> >> @@ -55,7 +55,15 @@
> >>   *    |                | if there is no memory map being provided. Only
> >>   *    |                | present in version 1 and newer of the structure.
> >>   * 52 +----------------+
> >> - *    | reserved       | Version 1 and newer only.
> >> + *    | vga_info.offset| Offset of struct dom0_vga_console_info from base of
> > 
> > I'm not sure we are supposed to reference external structures like
> > that. We took a lot of care to not depend on other headers, and to
> > make this as agnostic as possible (IIRC KVM is also capable of using
> > the PVH entry point natively, and hence depends on this header).
> 
> But KVM wouldn't be using a Dom0-only part of the interface, would
> it? (I'm aware of the possible re-using of the entry point.)

I think naming it as dom0-only is part of the problem. In principle I
don't see why you couldn't use such structure to report VGA console
information when inside of a VM that has such device.

> > IF we want to add support for dom0_vga_console_info I think we need to
> > at least provide a binary layout for it, like all the other pieces
> > that are part of the HVM start info.
> 
> Which then means we can't sensibly re-use the existing structure,
> as that doesn't have as strict rules as the hvm_start_info one.
> Which in turn means Linux can't re-use the code converting
> dom0_vga_console_info, resulting in two places needing updating
> whenever information gets add to (then) both structures (what
> information they carry will, after all, want to remain in sync).

Indeed. I don't think there's a way for us to cut corners here. As
said above, I don't think we should explicitly limit the usage of this
information to Xen dom0, and hence should be a first class addition to
the start info contents.

> >> + *    |                | struct hvm_start_info. Optional and only present in
> >> + *    |                | version 2 and newer of the structure when
> >> + *    |                | SIF_INITDOMAIN is set; zero if absent.
> > 
> > We have usually used an absolute physical address to reference other
> > data, and I think we should likely keep in that way for coherency.
> 
> Hmm. (See below.)
> 
> >> + * 54 +----------------+
> >> + *    | vga_info.size  | Size of present parts of struct dom0_vga_console_info.
> >> + *    |                | Optional and only present in version 2 and newer of
> >> + *    |                | the structure when SIF_INITDOMAIN is set; zero if
> >> + *    |                | absent.
> >>   * 56 +----------------+
> >>   *
> >>   * The layout of each entry in the module structure is the following:
> >> @@ -139,7 +147,15 @@ struct hvm_start_info {
> >>      uint32_t memmap_entries;    /* Number of entries in the memmap table.    */
> >>                                  /* Value will be zero if there is no memory  */
> >>                                  /* map being provided.                       */
> >> -    uint32_t reserved;          /* Must be zero.                             */
> > 
> > This 'Must be zero' comment troubles me a bit, I'm not convinced we
> > can just place data here (ie: it would no longer be 0). It's even
> > worse because the must be zero comment is only present in the C
> > representation of the struct, the layout above just denotes the field
> > is reserved (which would imply it's fine to use for other means in
> > v2).
> 
> I thought the textual description was meant to be the ABI spec. The C
> comment should therefore be viewed as if missing "in version 1" or
> "presently".
> 
> Taking into account also Andrew's reply, I have to admit that I'm
> inclined to request that one of the two of you fix this obvious
> shortcoming in both Xen and Linux. I'm not really willing to be the one
> to introduce a 2nd layout for the same set of data just for the purpose
> of "playing nice" in an area where that, affecting Dom0 only, doesn't
> seem to matter all this much. My goal was rather to keep the impact on
> hvm_start_info as low as possible (and in particular avoid changing its
> size, as strictly speaking Linux'es consumer implementation is buggy:
> It would always copy as much data as it knows _may_ be present, not as
> little data as may have been _actually_ provided; whoever implemented
> this did only consider one half of the compatibility requirements,
> quite likely simply because in the design this aspect was also missed,
> or else the structure would have had a length field right from its
> introduction).

The isn't a strict need to have several different layouts on the
consumer, as you could easily use offsetof to copy up to the boundary
you know it's populated given a version number. But yes, a size field
would be useful to prevent the consumer from having to know the
version boundaries.

I can try to take a stab at implementing this at a later point, I
certainly don't want to force you into implementing something you
believe it's not the correct solution.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0'
  2021-09-23 10:21     ` Jan Beulich
@ 2021-09-23 14:27       ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 14:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Sep 23, 2021 at 12:21:42PM +0200, Jan Beulich wrote:
> On 22.09.2021 17:48, Roger Pau Monné wrote:
> > On Tue, Sep 21, 2021 at 09:19:06AM +0200, Jan Beulich wrote:
> >> vcpu_show_registers() didn't do anything for HVM so far. Note though
> >> that some extra hackery is needed for VMX - see the code comment.
> >>
> >> Note further that the show_guest_stack() invocation is left alone here:
> >> While strictly speaking guest_kernel_mode() should be predicated by a
> >> PV / !HVM check, show_guest_stack() itself will bail immediately for
> >> HVM.
> >>
> >> While there and despite not being PVH-specific, take the opportunity and
> >> filter offline vCPU-s: There's not really any register state associated
> >> with them, so avoid spamming the log with useless information while
> >> still leaving an indication of the fact.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> I was pondering whether to also have the VMCS/VMCB dumped for every
> >> vCPU, to present full state. The downside is that for larger systems
> >> this would be a lot of output.
> > 
> > At least for Intel there's already a debug key to dump VMCS, so I'm
> > unsure it's worth dumping it here also, as a user can get the
> > information elsewhere (that's what I've always used to debug PVH
> > TBH).
> 
> I know there is a respective debug key. That dumps _all_ VMCSes, though,
> so might be quite verbose on a big system (where Dom0's output alone
> may already be quite verbose).
> 
> >> --- a/xen/arch/x86/x86_64/traps.c
> >> +++ b/xen/arch/x86/x86_64/traps.c
> >> @@ -49,6 +49,39 @@ static void read_registers(struct cpu_us
> >>      crs[7] = read_gs_shadow();
> >>  }
> >>  
> >> +static void get_hvm_registers(struct vcpu *v, struct cpu_user_regs *regs,
> >> +                              unsigned long crs[8])
> > 
> > Would this better be placed in hvm.c now that it's a HVM only
> > function?
> 
> I was asking myself this question, but decided that the placement here
> is perhaps at least no bigger of a problem than putting it there.
> Factors played into this:
> - the specifics of the usage of the crs[8] array,
> - the fact that the PV function also lives here, not under pv/,

I think both functions should live under hvm/ and pv/ respectively.
There's nothing x86_64 specific about them in order to guarantee a
placement here.

> - the desire to keep the function static.

Well, that's obviously not possible if it's moved to a different file.

> I can certainly be talked into moving the code, but I will want to see
> convincing arguments that none of the three items above (and possible
> other ones I may have missed) are really a problem then.

As said above, my preference would be to move those to pv/ and hvm/,
but I can also live with the current arrangement.

> >> @@ -159,24 +173,35 @@ void show_registers(const struct cpu_use
> >>  void vcpu_show_registers(const struct vcpu *v)
> >>  {
> >>      const struct cpu_user_regs *regs = &v->arch.user_regs;
> 
> Please note this in addition to my response below.
> 
> >> -    bool kernel = guest_kernel_mode(v, regs);
> >> +    struct cpu_user_regs aux_regs;
> >> +    enum context context;
> >>      unsigned long crs[8];
> >>  
> >> -    /* Only handle PV guests for now */
> >> -    if ( !is_pv_vcpu(v) )
> >> -        return;
> >> -
> >> -    crs[0] = v->arch.pv.ctrlreg[0];
> >> -    crs[2] = arch_get_cr2(v);
> >> -    crs[3] = pagetable_get_paddr(kernel ?
> >> -                                 v->arch.guest_table :
> >> -                                 v->arch.guest_table_user);
> >> -    crs[4] = v->arch.pv.ctrlreg[4];
> >> -    crs[5] = v->arch.pv.fs_base;
> >> -    crs[6 + !kernel] = v->arch.pv.gs_base_kernel;
> >> -    crs[7 - !kernel] = v->arch.pv.gs_base_user;
> >> +    if ( is_hvm_vcpu(v) )
> >> +    {
> >> +        aux_regs = *regs;
> >> +        get_hvm_registers(v->domain->vcpu[v->vcpu_id], &aux_regs, crs);
> > 
> > I wonder if you could load the values directly into v->arch.user_regs,
> > but maybe that would taint some other info already there. I certainly
> > haven't looked closely.
> 
> I had it that other way first, wondering whether altering the structure
> there might be safe. It felt wrong to fiddle with the live registers,
> and the "const" above than was the final bit that convinced me I should
> go the chosen route. Yet again - I can be talked into going the route
> you outline via convincing arguments. Don't forget that we e.g.
> deliberately poison the selector values in debug builds (see
> hvm_invalidate_regs_fields()) - that poisoning would get undermined if
> we wrote directly into the structure.

The vcpu is paused at this point, but I agree should not undermine the
poisoning. I assume those fields don't get filled because it's an
expensive operation and it doesn't get used that often.

Long term it might be helpful to have something akin to
guest_cpu_user_regs that could be used on paused remote vCPUs.

Anyway, I don't really have much other comments, so:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable
  2021-09-23 10:34     ` Jan Beulich
@ 2021-09-23 14:28       ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 14:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Sep 23, 2021 at 12:34:48PM +0200, Jan Beulich wrote:
> On 23.09.2021 10:09, Roger Pau Monné wrote:
> > On Tue, Sep 21, 2021 at 09:19:37AM +0200, Jan Beulich wrote:
> >> --- a/xen/arch/x86/hvm/hvm.c
> >> +++ b/xen/arch/x86/hvm/hvm.c
> >> @@ -2526,7 +2526,8 @@ int hvm_set_cr4(unsigned long value, boo
> >>      return X86EMUL_OKAY;
> >>  }
> >>  
> >> -bool_t hvm_virtual_to_linear_addr(
> >> +bool hvm_vcpu_virtual_to_linear(
> >> +    struct vcpu *v,
> >>      enum x86_segment seg,
> >>      const struct segment_register *reg,
> >>      unsigned long offset,
> >> @@ -2535,8 +2536,9 @@ bool_t hvm_virtual_to_linear_addr(
> >>      const struct segment_register *active_cs,
> >>      unsigned long *linear_addr)
> >>  {
> >> -    const struct vcpu *curr = current;
> >>      unsigned long addr = offset, last_byte;
> >> +    const struct cpu_user_regs *regs = v == current ? guest_cpu_user_regs()
> >> +                                                    : &v->arch.user_regs;
> >>      bool_t okay = 0;
> > 
> > Since you change the function return type to bool, you should also
> > change the type of the returned variable IMO. It's just a two line
> > change.
> 
> Can do; I would have viewed this as an unrelated change: While the
> first change needed indeed is on an adjacent line (above), the other
> one isn't.
> 
> > Also is it worth adding some check that the remote vCPU is paused? Or
> > else you might get inconsistent results by using data that's stale  by
> > the time Xen acts on it.
> 
> I did ask myself the same question. I did conclude that even if the
> remote vCPU is paused at the time of the check, it may not be anymore
> immediately after, so the information returned might be stale anyway.
> I further looked at {hvm,vmx}_get_segment_register() to see what they
> did - nothing. It is only now that I've also checked
> svm_get_segment_register(), which - interestingly - has such a check.
> I will copy the ASSERT() used there.

Thanks, with that:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0'
  2021-09-23 10:47     ` Jan Beulich
@ 2021-09-23 14:43       ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 14:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu

On Thu, Sep 23, 2021 at 12:47:26PM +0200, Jan Beulich wrote:
> On 23.09.2021 12:31, Roger Pau Monné wrote:
> > On Tue, Sep 21, 2021 at 09:20:00AM +0200, Jan Beulich wrote:
> >> --- a/xen/arch/x86/hvm/hvm.c
> >> +++ b/xen/arch/x86/hvm/hvm.c
> >> @@ -3408,6 +3408,15 @@ enum hvm_translation_result hvm_copy_fro
> >>                        PFEC_page_present | pfec, pfinfo);
> >>  }
> >>  
> >> +enum hvm_translation_result hvm_copy_from_vcpu_linear(
> >> +    void *buf, unsigned long addr, unsigned int size, struct vcpu *v,
> >> +    unsigned int pfec)
> > 
> > Even if your current use case doesn't need it, would it be worth
> > adding a pagefault_info_t parameter?
> 
> I'd prefer to add such parameters only once they become necessary.
> 
> >> --- a/xen/arch/x86/traps.c
> >> +++ b/xen/arch/x86/traps.c
> >> @@ -364,6 +364,71 @@ static void show_guest_stack(struct vcpu
> >>      printk("\n");
> >>  }
> >>  
> >> +static void show_hvm_stack(struct vcpu *v, const struct cpu_user_regs *regs)
> >> +{
> >> +#ifdef CONFIG_HVM
> >> +    unsigned long sp = regs->rsp, addr;
> >> +    unsigned int i, bytes, words_per_line, pfec = PFEC_page_present;
> >> +    struct segment_register ss, cs;
> >> +
> >> +    hvm_get_segment_register(v, x86_seg_ss, &ss);
> >> +    hvm_get_segment_register(v, x86_seg_cs, &cs);
> >> +
> >> +    if ( hvm_long_mode_active(v) && cs.l )
> >> +        i = 16, bytes = 8;
> >> +    else
> >> +    {
> >> +        sp = ss.db ? (uint32_t)sp : (uint16_t)sp;
> >> +        i = ss.db ? 8 : 4;
> >> +        bytes = cs.db ? 4 : 2;
> >> +    }
> >> +
> >> +    if ( bytes == 8 || (ss.db && !ss.base) )
> >> +        printk("Guest stack trace from sp=%0*lx:", i, sp);
> >> +    else
> >> +        printk("Guest stack trace from ss:sp=%04x:%0*lx:", ss.sel, i, sp);
> >> +
> >> +    if ( !hvm_vcpu_virtual_to_linear(v, x86_seg_ss, &ss, sp, bytes,
> >> +                                     hvm_access_read, &cs, &addr) )
> >> +    {
> >> +        printk(" Guest-inaccessible memory\n");
> >> +        return;
> >> +    }
> >> +
> >> +    if ( ss.dpl == 3 )
> >> +        pfec |= PFEC_user_mode;
> >> +
> >> +    words_per_line = stack_words_per_line * (sizeof(void *) / bytes);
> >> +    for ( i = 0; i < debug_stack_lines * words_per_line; )
> >> +    {
> >> +        unsigned long val = 0;
> >> +
> >> +        if ( (addr ^ (addr + bytes - 1)) & PAGE_SIZE )
> >> +            break;
> >> +
> >> +        if ( !(i++ % words_per_line) )
> >> +            printk("\n  ");
> >> +
> >> +        if ( hvm_copy_from_vcpu_linear(&val, addr, bytes, v,
> >> +                                       pfec) != HVMTRANS_okay )
> > 
> > I think I'm confused, but what about guests without paging enabled?
> > Don't you need to use hvm_copy_from_guest_phys (likely transformed
> > into hvm_copy_from_vcpu_phys)?
> 
> __hvm_copy() calls hvm_translate_get_page() telling it whether the
> input is a linear or physical address. hvm_translate_get_page() will
> use paging_gva_to_gfn() in this case. The HAP backing function
> changes when the guest {en,dis}ables paging, while shadow code deals
> with paging disabled by installing an identity mapping
> (d->arch.paging.shadow.unpaged_pagetable) which it would then end up
> (needlessly) walking.
> 
> It really is - afaict - intentional for callers to not have to deal
> with the special case.

I always forget that we change the paging_mode handlers when switching
between modes.

> >> @@ -663,14 +728,22 @@ void vcpu_show_execution_state(struct vc
> >>      }
> >>  #endif
> >>  
> >> -    /* Prevent interleaving of output. */
> >> -    flags = console_lock_recursive_irqsave();
> >> +    /*
> >> +     * Prevent interleaving of output if possible. For HVM we can't do so, as
> >> +     * the necessary P2M lookups involve locking, which has to occur with IRQs
> >> +     * enabled.
> >> +     */
> >> +    if ( !is_hvm_vcpu(v) )
> >> +        flags = console_lock_recursive_irqsave();
> >>  
> >>      vcpu_show_registers(v);
> >> -    if ( guest_kernel_mode(v, &v->arch.user_regs) )
> >> +    if ( is_hvm_vcpu(v) )
> >> +        show_hvm_stack(v, &v->arch.user_regs);
> > 
> > Would it make sense to unlock in show_hvm_stack, and thus keep the
> > printing of vcpu_show_registers locked even when in HVM context?
> 
> Perhaps not _in_, but before calling it, yet - why not.

Indeed, with that:

Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes
  2021-09-23 11:27     ` Jan Beulich
@ 2021-09-23 14:46       ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 14:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Kevin Tian, Jun Nakajima

On Thu, Sep 23, 2021 at 01:27:30PM +0200, Jan Beulich wrote:
> On 23.09.2021 10:23, Roger Pau Monné wrote:
> > On Tue, Sep 21, 2021 at 09:20:32AM +0200, Jan Beulich wrote:
> >> There's not really any register state associated with offline vCPU-s, so
> >> avoid spamming the log with largely useless information while still
> >> leaving an indication of the fact.
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> > Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> Thanks.
> 
> > Is the state cleared when the vCPU is brought down? Or else it might
> > be interesting to print such state for debug purposes.
> 
> Hmm, if that's considered to be potentially useful, then we'd need a
> key different from VPF_down. HLT would leave state in place, and the
> INIT/SIPI sequence would clobber prior state also only on the 2nd
> step. v->is_initialised may be an option, but gets cleared by INIT,
> not SIPI. Any other ideas?

Well, I guess it's close enough. I also wonder whether we should do
something similar for PV, where it might even make more sense as you
can upload a vCPU state in that case when the vCPU is down.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-23 12:15         ` Jan Beulich
@ 2021-09-23 15:15           ` Roger Pau Monné
  2021-09-23 15:22             ` Jan Beulich
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 15:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap

On Thu, Sep 23, 2021 at 02:15:25PM +0200, Jan Beulich wrote:
> On 23.09.2021 13:54, Roger Pau Monné wrote:
> > On Thu, Sep 23, 2021 at 01:32:52PM +0200, Jan Beulich wrote:
> >> On 23.09.2021 13:10, Roger Pau Monné wrote:
> >>> On Tue, Sep 21, 2021 at 09:21:11AM +0200, Jan Beulich wrote:
> >>>> --- a/xen/arch/x86/mm/p2m.c
> >>>> +++ b/xen/arch/x86/mm/p2m.c
> >>>> @@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
> >>>>              return -EPERM;
> >>>>          }
> >>>>  
> >>>> +        /*
> >>>> +         * Gross bodge, to go away again rather sooner than later:
> >>>> +         *
> >>>> +         * For MMIO allow access permissions to accumulate, but only for Dom0.
> >>>> +         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
> >>>> +         * the way they specify "access", this will allow the ultimate result
> >>>> +         * to be independent of the sequence of operations.
> >>>
> >>> Wouldn't it be better to 'fix' those operations so that they can work
> >>> together?
> >>
> >> Yes, but then we should do this properly by removing all abuse of
> >> p2m_access_t.
> > 
> > I'm not sure I follow what that abuse is.
> 
> As was clarified, p2m_access_t should be solely used by e.g.
> introspection agents, who are then the entity responsible for
> resolving the resulting faults. Any other uses (to e.g. merely
> restrict permissions for other reasons) are really abuses.

But some p2m types don't really have a fixed access tied to them, for
example MMIO regions just inherit whatever is the default access for
the domain, which makes all this look slightly weird. If the access
should solely be used by introspection, then each type should have a
fixed access tied to it?

> That
> is, if e.g. a r/o direct-MMIO mapping is needed, this should not
> be expressed as (p2m_mmio_direct, p2m_access_r) tuple, but would
> require a distinct p2m_mmio_direct_ro type.

I guess we would then require a p2m_mmio_direct_ro,
p2m_mmio_direct_rwx and a p2m_mmio_direct_n at least, and ideally we
would need to differentiate the mandatory regions as present in ACPI
tables using yet different types.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-23 15:15           ` Roger Pau Monné
@ 2021-09-23 15:22             ` Jan Beulich
  2021-09-23 15:32               ` Roger Pau Monné
  0 siblings, 1 reply; 44+ messages in thread
From: Jan Beulich @ 2021-09-23 15:22 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap

On 23.09.2021 17:15, Roger Pau Monné wrote:
> On Thu, Sep 23, 2021 at 02:15:25PM +0200, Jan Beulich wrote:
>> On 23.09.2021 13:54, Roger Pau Monné wrote:
>>> On Thu, Sep 23, 2021 at 01:32:52PM +0200, Jan Beulich wrote:
>>>> On 23.09.2021 13:10, Roger Pau Monné wrote:
>>>>> On Tue, Sep 21, 2021 at 09:21:11AM +0200, Jan Beulich wrote:
>>>>>> --- a/xen/arch/x86/mm/p2m.c
>>>>>> +++ b/xen/arch/x86/mm/p2m.c
>>>>>> @@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
>>>>>>              return -EPERM;
>>>>>>          }
>>>>>>  
>>>>>> +        /*
>>>>>> +         * Gross bodge, to go away again rather sooner than later:
>>>>>> +         *
>>>>>> +         * For MMIO allow access permissions to accumulate, but only for Dom0.
>>>>>> +         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
>>>>>> +         * the way they specify "access", this will allow the ultimate result
>>>>>> +         * to be independent of the sequence of operations.
>>>>>
>>>>> Wouldn't it be better to 'fix' those operations so that they can work
>>>>> together?
>>>>
>>>> Yes, but then we should do this properly by removing all abuse of
>>>> p2m_access_t.
>>>
>>> I'm not sure I follow what that abuse is.
>>
>> As was clarified, p2m_access_t should be solely used by e.g.
>> introspection agents, who are then the entity responsible for
>> resolving the resulting faults. Any other uses (to e.g. merely
>> restrict permissions for other reasons) are really abuses.
> 
> But some p2m types don't really have a fixed access tied to them, for
> example MMIO regions just inherit whatever is the default access for
> the domain, which makes all this look slightly weird. If the access
> should solely be used by introspection, then each type should have a
> fixed access tied to it?

I think so, yes. Hence e.g. p2m_ram_ro and p2m_grant_map_r{w,o}.

>> That
>> is, if e.g. a r/o direct-MMIO mapping is needed, this should not
>> be expressed as (p2m_mmio_direct, p2m_access_r) tuple, but would
>> require a distinct p2m_mmio_direct_ro type.
> 
> I guess we would then require a p2m_mmio_direct_ro,
> p2m_mmio_direct_rwx and a p2m_mmio_direct_n at least, and ideally we
> would need to differentiate the mandatory regions as present in ACPI
> tables using yet different types.

What would we need p2m_mmio_direct_n for? And what's the (present,
not future) reason for the x in p2m_mmio_direct_rwx?

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries
  2021-09-23 15:22             ` Jan Beulich
@ 2021-09-23 15:32               ` Roger Pau Monné
  0 siblings, 0 replies; 44+ messages in thread
From: Roger Pau Monné @ 2021-09-23 15:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Andrew Cooper, Wei Liu, George Dunlap

On Thu, Sep 23, 2021 at 05:22:08PM +0200, Jan Beulich wrote:
> On 23.09.2021 17:15, Roger Pau Monné wrote:
> > On Thu, Sep 23, 2021 at 02:15:25PM +0200, Jan Beulich wrote:
> >> On 23.09.2021 13:54, Roger Pau Monné wrote:
> >>> On Thu, Sep 23, 2021 at 01:32:52PM +0200, Jan Beulich wrote:
> >>>> On 23.09.2021 13:10, Roger Pau Monné wrote:
> >>>>> On Tue, Sep 21, 2021 at 09:21:11AM +0200, Jan Beulich wrote:
> >>>>>> --- a/xen/arch/x86/mm/p2m.c
> >>>>>> +++ b/xen/arch/x86/mm/p2m.c
> >>>>>> @@ -1319,6 +1319,18 @@ static int set_typed_p2m_entry(struct do
> >>>>>>              return -EPERM;
> >>>>>>          }
> >>>>>>  
> >>>>>> +        /*
> >>>>>> +         * Gross bodge, to go away again rather sooner than later:
> >>>>>> +         *
> >>>>>> +         * For MMIO allow access permissions to accumulate, but only for Dom0.
> >>>>>> +         * Since set_identity_p2m_entry() and set_mmio_p2m_entry() differ in
> >>>>>> +         * the way they specify "access", this will allow the ultimate result
> >>>>>> +         * to be independent of the sequence of operations.
> >>>>>
> >>>>> Wouldn't it be better to 'fix' those operations so that they can work
> >>>>> together?
> >>>>
> >>>> Yes, but then we should do this properly by removing all abuse of
> >>>> p2m_access_t.
> >>>
> >>> I'm not sure I follow what that abuse is.
> >>
> >> As was clarified, p2m_access_t should be solely used by e.g.
> >> introspection agents, who are then the entity responsible for
> >> resolving the resulting faults. Any other uses (to e.g. merely
> >> restrict permissions for other reasons) are really abuses.
> > 
> > But some p2m types don't really have a fixed access tied to them, for
> > example MMIO regions just inherit whatever is the default access for
> > the domain, which makes all this look slightly weird. If the access
> > should solely be used by introspection, then each type should have a
> > fixed access tied to it?
> 
> I think so, yes. Hence e.g. p2m_ram_ro and p2m_grant_map_r{w,o}.
> 
> >> That
> >> is, if e.g. a r/o direct-MMIO mapping is needed, this should not
> >> be expressed as (p2m_mmio_direct, p2m_access_r) tuple, but would
> >> require a distinct p2m_mmio_direct_ro type.
> > 
> > I guess we would then require a p2m_mmio_direct_ro,
> > p2m_mmio_direct_rwx and a p2m_mmio_direct_n at least, and ideally we
> > would need to differentiate the mandatory regions as present in ACPI
> > tables using yet different types.
> 
> What would we need p2m_mmio_direct_n for?

AMD can specify no access at all for certain regions on the ACPI
tables from what I've read on the manual (IW = IR = 0 in IVMD Flags).
AFAICT amd_iommu_reserve_domain_unity_map can already call
iommu_identity_mapping with access p2m_access_n and that would get
propagated into set_identity_p2m_entry.

> And what's the (present,
> not future) reason for the x in p2m_mmio_direct_rwx?

Mapped ROM BARs, but I'm also unsure we shouldn't just map MMIO with
execute permissions by default unless stated otherwise.

Thanks, Roger.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0
  2021-09-22 14:22   ` Roger Pau Monné
@ 2021-09-24 12:18     ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-09-24 12:18 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 22.09.2021 16:22, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:17:37AM +0200, Jan Beulich wrote:
>> Certain notifications of Dom0 to Xen are independent of the mode Dom0 is
>> running in. Permit further PCI related ones (only their modern forms).
>> Also include the USB2 debug port operation at this occasion.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> I'm uncertain about the has_vpci() part of the check: I would think
>> is_hardware_domain() is both sufficient and concise. Without vPCI a PVH
>> Dom0 won't see any PCI devices in the first place (and hence would
>> effectively be non-functioning). Dropping this would in particular make
>> PHYSDEVOP_dbgp_op better fit in the mix.
>> ---
>> v3: New.
>>
>> --- a/xen/arch/x86/hvm/hypercall.c
>> +++ b/xen/arch/x86/hvm/hypercall.c
>> @@ -94,6 +94,12 @@ static long hvm_physdev_op(int cmd, XEN_
>>          break;
>>  
>>      case PHYSDEVOP_pci_mmcfg_reserved:
>> +    case PHYSDEVOP_pci_device_add:
>> +    case PHYSDEVOP_pci_device_remove:
>> +    case PHYSDEVOP_restore_msi_ext:
> 
> Hm, I'm slightly unsure we need the restore operation. Wouldn't it be
> better to just reset all device state on suspend and then let dom0
> restore it's state as it does on native?

Hmm - Linux (even after my patch separating XEN_DOM0 from XEN_PV)
only issues this call when running in PV mode, so from that
perspective leaving it out would be okay. (Otherwise, i.e. if we
decide to permit its use, I guess we would better also permit
PHYSDEVOP_restore_msi. Somehow I had managed to not spot that.)
However, ...

> Maybe there's some wrinkle that prevents that from working properly.

... Xen might be using MSI for the serial console, and I'm not sure
this interrupt would get properly re-setup.

>> +    case PHYSDEVOP_dbgp_op:
>> +    case PHYSDEVOP_prepare_msix:
>> +    case PHYSDEVOP_release_msix:
> 
> Albeit I think those two operations won't strictly conflict with vPCI
> usage (as they require no MSIX entries to be activ) I still wonder
> whether we will end up needing them on a PVH dom0. They are used by
> pciback and it's not yet clear how we will end up using pciback on a
> PVH dom0, hence I would prefer if we could leave them out until
> strictly required.

Even without a clear plan towards pciback, do you have any idea how
their function could sensibly be replaced in the PVH case? If there
is at least a rough idea, I'd be fine leaving them out here.

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation
  2021-09-22 11:59   ` Roger Pau Monné
@ 2021-09-29 10:53     ` Jan Beulich
  0 siblings, 0 replies; 44+ messages in thread
From: Jan Beulich @ 2021-09-29 10:53 UTC (permalink / raw)
  To: Roger Pau Monné; +Cc: xen-devel, Andrew Cooper, Wei Liu

On 22.09.2021 13:59, Roger Pau Monné wrote:
> On Tue, Sep 21, 2021 at 09:16:44AM +0200, Jan Beulich wrote:
>> @@ -337,18 +337,23 @@ unsigned long __init dom0_compute_nr_pag
>>          avail -= d->max_vcpus - 1;
>>  
>>      /* Reserve memory for iommu_dom0_init() (rough estimate). */
>> -    if ( is_iommu_enabled(d) )
>> +    if ( is_iommu_enabled(d) && !iommu_hwdom_passthrough )
>>      {
>>          unsigned int s;
>>  
>>          for ( s = 9; s < BITS_PER_LONG; s += 9 )
>> -            avail -= max_pdx >> s;
>> +            iommu_pages += max_pdx >> s;
>> +
>> +        avail -= iommu_pages;
>>      }
>>  
>> -    need_paging = is_hvm_domain(d) &&
>> -        (!iommu_use_hap_pt(d) || !paging_mode_hap(d));
>> +    need_paging = is_hvm_domain(d)
>> +                  ? !iommu_use_hap_pt(d) || !paging_mode_hap(d)
>> +                  : opt_dom0_shadow;
>>      for ( ; ; need_paging = false )
>>      {
>> +        unsigned long paging_pages;
>> +
>>          nr_pages = get_memsize(&dom0_size, avail);
>>          min_pages = get_memsize(&dom0_min_size, avail);
>>          max_pages = get_memsize(&dom0_max_size, avail);
>> @@ -377,11 +382,20 @@ unsigned long __init dom0_compute_nr_pag
>>          nr_pages = min(nr_pages, max_pages);
>>          nr_pages = min(nr_pages, avail);
>>  
>> -        if ( !need_paging )
>> -            break;
>> +        paging_pages = paging_mode_enabled(d) || need_paging
>> +                       ? dom0_paging_pages(d, nr_pages) : 0;
>>  
>>          /* Reserve memory for shadow or HAP. */
>> -        avail -= dom0_paging_pages(d, nr_pages);
>> +        if ( !need_paging )
>> +        {
>> +            if ( paging_pages <= iommu_pages )
>> +                break;
>> +
>> +            avail -= paging_pages - iommu_pages;
>> +        }
>> +        else
>> +            avail -= paging_pages;
>> +        iommu_pages = paging_pages;
>>      }
> 
> I always found this loop extremely confusing to reason about. Now that
> we account for the iommu page tables using separate logic, do we
> really need a loop here?
> 
> In fact I would suggest something like:
> 
> unsigned long cpu_pages = 0;
> 
> if ( is_iommu_enabled(d) && !iommu_hwdom_passthrough )
> {
>     unsigned int s;
> 
>     for ( s = 9; s < BITS_PER_LONG; s += 9 )
>         iommu_pages += max_pdx >> s;
> }
> 
> [perform all the nr_pages adjustments]
> 
> if ( paging_mode_enabled(d) ||
>      opt_dom0_shadow /* shadow paging gets enabled later for PV dom0. */ )
>     cpu_pages = dom0_paging_pages(d, nr_pages);
> 
> if ( is_hvm_domain(d) && iommu_use_hap_pt(d) && paging_mode_hap(d) )
>     avail -= max(iommu_pages, cpu_pages);
> else
>     avail -= cpu_pages + iommu_pages;
> 
> There will be a slight over estimation of cpu_pages, as the value
> passed in doesn't account for the iommu pages in case they are used,
> but still it's better to over estimate than to under estimate.

Overestimating cpu_pages isn't a primary concern, I agree. But we
want to get the min/max clamping reasonably close. Therefore, while
dropping the loop as you suggest I've introduced two instances of
that logic, before and after calculating cpu_pages. I'll see to
send out v4 soonish (provided the result actually works for, in
particular, the case that I needed the fix for in the first place).

Jan



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-09-29 10:53 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-21  7:15 [PATCH v3 0/9] x86/PVH: Dom0 building adjustments Jan Beulich
2021-09-21  7:16 ` [PATCH v3 1/9] x86/PVH: improve Dom0 memory size calculation Jan Beulich
2021-09-22 11:59   ` Roger Pau Monné
2021-09-29 10:53     ` Jan Beulich
2021-09-21  7:17 ` [PATCH v3 2/9] x86/PV: properly set shadow allocation for Dom0 Jan Beulich
2021-09-22 13:01   ` Roger Pau Monné
2021-09-22 13:31   ` Andrew Cooper
2021-09-22 13:50     ` Jan Beulich
2021-09-22 14:25       ` Roger Pau Monné
2021-09-22 14:28         ` Jan Beulich
2021-09-21  7:17 ` [PATCH v3 3/9] x86/PVH: permit more physdevop-s to be used by Dom0 Jan Beulich
2021-09-22 14:22   ` Roger Pau Monné
2021-09-24 12:18     ` Jan Beulich
2021-09-21  7:18 ` [PATCH v3 4/9] x86/PVH: provide VGA console info to Dom0 Jan Beulich
2021-09-22 15:01   ` Roger Pau Monné
2021-09-22 17:03     ` Andrew Cooper
2021-09-23  9:58       ` Jan Beulich
2021-09-23  9:46     ` Jan Beulich
2021-09-23 13:22       ` Roger Pau Monné
2021-09-21  7:19 ` [PATCH v3 5/9] x86/PVH: actually show Dom0's register state from debug key '0' Jan Beulich
2021-09-22 15:48   ` Roger Pau Monné
2021-09-23 10:21     ` Jan Beulich
2021-09-23 14:27       ` Roger Pau Monné
2021-09-21  7:19 ` [PATCH v3 6/9] x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable Jan Beulich
2021-09-23  8:09   ` Roger Pau Monné
2021-09-23 10:34     ` Jan Beulich
2021-09-23 14:28       ` Roger Pau Monné
2021-09-21  7:20 ` [PATCH v3 7/9] x86/PVH: actually show Dom0's stacks from debug key '0' Jan Beulich
2021-09-23 10:31   ` Roger Pau Monné
2021-09-23 10:38     ` Roger Pau Monné
2021-09-23 10:47     ` Jan Beulich
2021-09-23 14:43       ` Roger Pau Monné
2021-09-21  7:20 ` [PATCH v3 8/9] x86/HVM: skip offline vCPU-s when dumping VMCBs/VMCSes Jan Beulich
2021-09-23  8:23   ` Roger Pau Monné
2021-09-23 11:27     ` Jan Beulich
2021-09-23 14:46       ` Roger Pau Monné
2021-09-21  7:21 ` [PATCH v3 9/9] x86/P2M: relax permissions of PVH Dom0's MMIO entries Jan Beulich
2021-09-23 11:10   ` Roger Pau Monné
2021-09-23 11:32     ` Jan Beulich
2021-09-23 11:54       ` Roger Pau Monné
2021-09-23 12:15         ` Jan Beulich
2021-09-23 15:15           ` Roger Pau Monné
2021-09-23 15:22             ` Jan Beulich
2021-09-23 15:32               ` Roger Pau Monné

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.