All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 00/22] Xen HVM support under KVM
@ 2022-12-09  9:55 David Woodhouse
  2022-12-09  9:55 ` [RFC PATCH v2 01/22] include: import xen public headers David Woodhouse
                   ` (21 more replies)
  0 siblings, 22 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

Continuing the revival of Oracle's work at
https://github.com/jpemartins/qemu/commits/xen-shim-rfc to work against
the Xen guest support as it was finally merged into the kernel, and
updated to today's QEMU. When complete, this will allow us to run native
Xen guests on top of Linux/KVM without them noticing that it's not Xen.

Thanks for the useful feedback in response to v1. Hopefully I've taken
it all on board correctly.

The main question I have right now is about the way the target KVM
code calls directly into things like xen_overlay_map_page() which
live in hw/i386/kvm/ — and the way that function uses a singleton
object. That can't be right, but I figured I should just keep typing
and get the actual code working to show what I'm trying to do...

v2:
 • Attempt to implement migration support; every Xen enlightenment is
   now recorded either from vmstate_x86_cpu or from a new sysdev device
   created for that purpose. And — I believe — correctly restored, in
   the right order, on vmload.

 • The shared_info page is created as a proper overlay instead of abusing
   the underlying guest page. This is important because Windows doesn't
   even select a GPA which had RAM behind it beforehand. This will be
   extended to handle the grant frames too, in the fullness of time.

 • Set vCPU attributes from the correct vCPU thread to avoid deadlocks.

 • Carefully copy the entire hypercall argument structure from userspace
   instead of assuming that it's contiguous in HVA space.

 • Distinguish between "handled but intentionally returns -ENOSYS" and
   "no idea what that was" in hypercalls, allowing us to emit a
   GUEST_ERROR (actually, shouldn't that change to UNIMP?) on the
   latter. Experience shows that to we'll end up having to intentionally
   return -ENOSYS to a bunch of weird crap that ancient guests still
   attempt to use, including XenServer local hacks that nobody even
   remembers what they were (hvmop 0x101, anyone? Some old Windows
   PV driver appears to be trying to use it...).

 * Drop the '+xen' CPU property and present Xen CPUID instead of KVM
   unconditionally when running in Xen mode. Make the Xen CPUID coexist
   with Hyper-V CPUID as it should, though.

 • Add XEN_EMU and XENFV_MACHINE (the latter to be XEN_EMU||XEN) config
   options. Some more work on this, and the incestuous relationships
   between the KVM target code and the 'platform' code, is going to be
   required but it's probably better to get on with implementing the
   real code so we can see those interactions in all their glory,
   before losing too much sleep over the details here.

 • Drop the GSI-2 hack, and also the patch which made the PCI platform
   device have real RAM (which isn't needed now we have overlays, qv).

 • Drop the XenState and XenVcpuState from KVMState and CPUArchState
   respectively. The Xen-specific fields are natively included in
   CPUArchState now though, for migration purposes. And we don't
   keep a host pointer to the shared_info or vcpu_info at all any
   more. With the kernel doing everything for us, we don't actually
   need them.
 
The guest boots as far as panicking when it can't register the timer
VIRQ because we haven't implemented event channel hypercalls yet. 

The xen-platform-pci and pc_piix patches still need a little cleaning
up but I'll rework them when the dust settles on the config options and
how the target/machine components interact rather than bikeshedding them
too much early on. For now, we just need to be able to use the xenfv
machine in order to instantiate the shinfo and evtchn objects.

  qemu-system-x86_64 -serial mon:stdio -machine xenfv,xen-version=0x4000a \
         -cpu host,+xen-vapic  -display none --trace "kvm_xen*" \
         -kernel /boot/vmlinuz-5.17.8-200.fc35.x86_64 \
         -append "console=ttyS0,115200 earlyprintk=ttyS0,115200"

Ankur Arora (2):
      i386/xen: implement HVMOP_set_evtchn_upcall_vector
      i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ

David Woodhouse (3):
      xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
      i386/xen: Add xen-version machine property and init KVM Xen support
      hw/xen: Add xen_overlay device for emulating shared xenheap pages

Joao Martins (17):
      include: import xen public headers
      i386/kvm: handle Xen HVM cpuid leaves
      xen-platform-pci: allow its creation with XEN_EMULATE mode
      hw/xen_backend: refactor xen_be_init()
      pc_piix: handle XEN_EMULATE backend init
      xen_platform: exclude vfio-pci from the PCI platform unplug
      pc_piix: allow xenfv machine with XEN_EMULATE
      i386/xen: handle guest hypercalls
      i386/xen: implement HYPERCALL_xen_version
      i386/xen: implement HYPERVISOR_memory_op
      i386/xen: implement HYPERVISOR_hvm_op
      i386/xen: implement HYPERVISOR_vcpu_op
      i386/xen: handle VCPUOP_register_vcpu_info
      i386/xen: handle VCPUOP_register_vcpu_time_info
      i386/xen: handle VCPUOP_register_runstate_memory_area
      i386/xen: implement HYPERVISOR_event_channel_op
      i386/xen: implement HYPERVISOR_sched_op

 accel/Kconfig                                      |    1 +
 hw/Kconfig                                         |    1 +
 hw/i386/kvm/meson.build                            |    4 +
 hw/i386/kvm/xen_evtchn.c                           |  117 +++
 hw/i386/kvm/xen_evtchn.h                           |   13 +
 hw/i386/kvm/xen_overlay.c                          |  198 ++++
 hw/i386/kvm/xen_overlay.h                          |   14 +
 hw/i386/pc.c                                       |   32 +
 hw/i386/pc_piix.c                                  |   29 +-
 hw/i386/xen/xen_platform.c                         |   29 +-
 hw/xen/Kconfig                                     |    3 +
 hw/xen/xen-legacy-backend.c                        |   62 +-
 include/hw/i386/pc.h                               |    3 +
 include/hw/xen/xen-legacy-backend.h                |    5 +
 include/standard-headers/xen/arch-x86/cpuid.h      |  118 +++
 include/standard-headers/xen/arch-x86/xen-x86_32.h |  194 ++++
 include/standard-headers/xen/arch-x86/xen-x86_64.h |  241 +++++
 include/standard-headers/xen/arch-x86/xen.h        |  398 ++++++++
 include/standard-headers/xen/event_channel.h       |  388 ++++++++
 include/standard-headers/xen/features.h            |  143 +++
 include/standard-headers/xen/grant_table.h         |  686 +++++++++++++
 include/standard-headers/xen/hvm/hvm_op.h          |  395 ++++++++
 include/standard-headers/xen/hvm/params.h          |  318 ++++++
 include/standard-headers/xen/memory.h              |  754 ++++++++++++++
 include/standard-headers/xen/physdev.h             |  383 +++++++
 include/standard-headers/xen/sched.h               |  202 ++++
 include/standard-headers/xen/trace.h               |  341 +++++++
 include/standard-headers/xen/vcpu.h                |  248 +++++
 include/standard-headers/xen/version.h             |  113 +++
 include/standard-headers/xen/xen-compat.h          |   46 +
 include/standard-headers/xen/xen.h                 | 1049 ++++++++++++++++++++
 meson.build                                        |    1 +
 target/Kconfig                                     |    4 +
 target/i386/cpu.c                                  |    1 +
 target/i386/cpu.h                                  |    7 +
 target/i386/kvm/kvm.c                              |  147 ++-
 target/i386/machine.c                              |   27 +
 target/i386/meson.build                            |    1 +
 target/i386/trace-events                           |    6 +
 target/i386/xen.c                                  |  594 +++++++++++
 target/i386/xen.h                                  |   30 +
 41 files changed, 7312 insertions(+), 34 deletions(-)





^ permalink raw reply	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 01/22] include: import xen public headers
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12  9:17   ` Paul Durrant
  2022-12-09  9:55 ` [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation David Woodhouse
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Update to Xen public headers from 4.16.2 release]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 include/standard-headers/xen/arch-x86/cpuid.h |  118 ++
 .../xen/arch-x86/xen-x86_32.h                 |  194 +++
 .../xen/arch-x86/xen-x86_64.h                 |  241 ++++
 include/standard-headers/xen/arch-x86/xen.h   |  398 +++++++
 include/standard-headers/xen/event_channel.h  |  388 ++++++
 include/standard-headers/xen/features.h       |  143 +++
 include/standard-headers/xen/grant_table.h    |  686 +++++++++++
 include/standard-headers/xen/hvm/hvm_op.h     |  395 +++++++
 include/standard-headers/xen/hvm/params.h     |  318 +++++
 include/standard-headers/xen/memory.h         |  754 ++++++++++++
 include/standard-headers/xen/physdev.h        |  383 ++++++
 include/standard-headers/xen/sched.h          |  202 ++++
 include/standard-headers/xen/trace.h          |  341 ++++++
 include/standard-headers/xen/vcpu.h           |  248 ++++
 include/standard-headers/xen/version.h        |  113 ++
 include/standard-headers/xen/xen-compat.h     |   46 +
 include/standard-headers/xen/xen.h            | 1049 +++++++++++++++++
 17 files changed, 6017 insertions(+)
 create mode 100644 include/standard-headers/xen/arch-x86/cpuid.h
 create mode 100644 include/standard-headers/xen/arch-x86/xen-x86_32.h
 create mode 100644 include/standard-headers/xen/arch-x86/xen-x86_64.h
 create mode 100644 include/standard-headers/xen/arch-x86/xen.h
 create mode 100644 include/standard-headers/xen/event_channel.h
 create mode 100644 include/standard-headers/xen/features.h
 create mode 100644 include/standard-headers/xen/grant_table.h
 create mode 100644 include/standard-headers/xen/hvm/hvm_op.h
 create mode 100644 include/standard-headers/xen/hvm/params.h
 create mode 100644 include/standard-headers/xen/memory.h
 create mode 100644 include/standard-headers/xen/physdev.h
 create mode 100644 include/standard-headers/xen/sched.h
 create mode 100644 include/standard-headers/xen/trace.h
 create mode 100644 include/standard-headers/xen/vcpu.h
 create mode 100644 include/standard-headers/xen/version.h
 create mode 100644 include/standard-headers/xen/xen-compat.h
 create mode 100644 include/standard-headers/xen/xen.h

diff --git a/include/standard-headers/xen/arch-x86/cpuid.h b/include/standard-headers/xen/arch-x86/cpuid.h
new file mode 100644
index 0000000000..ce46305bee
--- /dev/null
+++ b/include/standard-headers/xen/arch-x86/cpuid.h
@@ -0,0 +1,118 @@
+/******************************************************************************
+ * arch-x86/cpuid.h
+ *
+ * CPUID interface to Xen.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2007 Citrix Systems, Inc.
+ *
+ * Authors:
+ *    Keir Fraser <keir@xen.org>
+ */
+
+#ifndef __XEN_PUBLIC_ARCH_X86_CPUID_H__
+#define __XEN_PUBLIC_ARCH_X86_CPUID_H__
+
+/*
+ * For compatibility with other hypervisor interfaces, the Xen cpuid leaves
+ * can be found at the first otherwise unused 0x100 aligned boundary starting
+ * from 0x40000000.
+ *
+ * e.g If viridian extensions are enabled for an HVM domain, the Xen cpuid
+ * leaves will start at 0x40000100
+ */
+
+#define XEN_CPUID_FIRST_LEAF 0x40000000
+#define XEN_CPUID_LEAF(i)    (XEN_CPUID_FIRST_LEAF + (i))
+
+/*
+ * Leaf 1 (0x40000x00)
+ * EAX: Largest Xen-information leaf. All leaves up to an including @EAX
+ *      are supported by the Xen host.
+ * EBX-EDX: "XenVMMXenVMM" signature, allowing positive identification
+ *      of a Xen host.
+ */
+#define XEN_CPUID_SIGNATURE_EBX 0x566e6558 /* "XenV" */
+#define XEN_CPUID_SIGNATURE_ECX 0x65584d4d /* "MMXe" */
+#define XEN_CPUID_SIGNATURE_EDX 0x4d4d566e /* "nVMM" */
+
+/*
+ * Leaf 2 (0x40000x01)
+ * EAX[31:16]: Xen major version.
+ * EAX[15: 0]: Xen minor version.
+ * EBX-EDX: Reserved (currently all zeroes).
+ */
+
+/*
+ * Leaf 3 (0x40000x02)
+ * EAX: Number of hypercall transfer pages. This register is always guaranteed
+ *      to specify one hypercall page.
+ * EBX: Base address of Xen-specific MSRs.
+ * ECX: Features 1. Unused bits are set to zero.
+ * EDX: Features 2. Unused bits are set to zero.
+ */
+
+/* Does the host support MMU_PT_UPDATE_PRESERVE_AD for this guest? */
+#define _XEN_CPUID_FEAT1_MMU_PT_UPDATE_PRESERVE_AD 0
+#define XEN_CPUID_FEAT1_MMU_PT_UPDATE_PRESERVE_AD  (1u<<0)
+
+/*
+ * Leaf 4 (0x40000x03)
+ * Sub-leaf 0: EAX: bit 0: emulated tsc
+ *                  bit 1: host tsc is known to be reliable
+ *                  bit 2: RDTSCP instruction available
+ *             EBX: tsc_mode: 0=default (emulate if necessary), 1=emulate,
+ *                            2=no emulation, 3=no emulation + TSC_AUX support
+ *             ECX: guest tsc frequency in kHz
+ *             EDX: guest tsc incarnation (migration count)
+ * Sub-leaf 1: EAX: tsc offset low part
+ *             EBX: tsc offset high part
+ *             ECX: multiplicator for tsc->ns conversion
+ *             EDX: shift amount for tsc->ns conversion
+ * Sub-leaf 2: EAX: host tsc frequency in kHz
+ */
+
+/*
+ * Leaf 5 (0x40000x04)
+ * HVM-specific features
+ * Sub-leaf 0: EAX: Features
+ * Sub-leaf 0: EBX: vcpu id (iff EAX has XEN_HVM_CPUID_VCPU_ID_PRESENT flag)
+ * Sub-leaf 0: ECX: domain id (iff EAX has XEN_HVM_CPUID_DOMID_PRESENT flag)
+ */
+#define XEN_HVM_CPUID_APIC_ACCESS_VIRT (1u << 0) /* Virtualized APIC registers */
+#define XEN_HVM_CPUID_X2APIC_VIRT      (1u << 1) /* Virtualized x2APIC accesses */
+/* Memory mapped from other domains has valid IOMMU entries */
+#define XEN_HVM_CPUID_IOMMU_MAPPINGS   (1u << 2)
+#define XEN_HVM_CPUID_VCPU_ID_PRESENT  (1u << 3) /* vcpu id is present in EBX */
+#define XEN_HVM_CPUID_DOMID_PRESENT    (1u << 4) /* domid is present in ECX */
+
+/*
+ * Leaf 6 (0x40000x05)
+ * PV-specific parameters
+ * Sub-leaf 0: EAX: max available sub-leaf
+ * Sub-leaf 0: EBX: bits 0-7: max machine address width
+ */
+
+/* Max. address width in bits taking memory hotplug into account. */
+#define XEN_CPUID_MACHINE_ADDRESS_WIDTH_MASK (0xffu << 0)
+
+#define XEN_CPUID_MAX_NUM_LEAVES 5
+
+#endif /* __XEN_PUBLIC_ARCH_X86_CPUID_H__ */
diff --git a/include/standard-headers/xen/arch-x86/xen-x86_32.h b/include/standard-headers/xen/arch-x86/xen-x86_32.h
new file mode 100644
index 0000000000..19d7388633
--- /dev/null
+++ b/include/standard-headers/xen/arch-x86/xen-x86_32.h
@@ -0,0 +1,194 @@
+/******************************************************************************
+ * xen-x86_32.h
+ *
+ * Guest OS interface to x86 32-bit Xen.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004-2007, K A Fraser
+ */
+
+#ifndef __XEN_PUBLIC_ARCH_X86_XEN_X86_32_H__
+#define __XEN_PUBLIC_ARCH_X86_XEN_X86_32_H__
+
+/*
+ * Hypercall interface:
+ *  Input:  %ebx, %ecx, %edx, %esi, %edi, %ebp (arguments 1-6)
+ *  Output: %eax
+ * Access is via hypercall page (set up by guest loader or via a Xen MSR):
+ *  call hypercall_page + hypercall-number * 32
+ * Clobbered: Argument registers (e.g., 2-arg hypercall clobbers %ebx,%ecx)
+ */
+
+/*
+ * These flat segments are in the Xen-private section of every GDT. Since these
+ * are also present in the initial GDT, many OSes will be able to avoid
+ * installing their own GDT.
+ */
+#define FLAT_RING1_CS 0xe019    /* GDT index 259 */
+#define FLAT_RING1_DS 0xe021    /* GDT index 260 */
+#define FLAT_RING1_SS 0xe021    /* GDT index 260 */
+#define FLAT_RING3_CS 0xe02b    /* GDT index 261 */
+#define FLAT_RING3_DS 0xe033    /* GDT index 262 */
+#define FLAT_RING3_SS 0xe033    /* GDT index 262 */
+
+#define FLAT_KERNEL_CS FLAT_RING1_CS
+#define FLAT_KERNEL_DS FLAT_RING1_DS
+#define FLAT_KERNEL_SS FLAT_RING1_SS
+#define FLAT_USER_CS    FLAT_RING3_CS
+#define FLAT_USER_DS    FLAT_RING3_DS
+#define FLAT_USER_SS    FLAT_RING3_SS
+
+#define __HYPERVISOR_VIRT_START_PAE    0xF5800000
+#define __MACH2PHYS_VIRT_START_PAE     0xF5800000
+#define __MACH2PHYS_VIRT_END_PAE       0xF6800000
+#define HYPERVISOR_VIRT_START_PAE      xen_mk_ulong(__HYPERVISOR_VIRT_START_PAE)
+#define MACH2PHYS_VIRT_START_PAE       xen_mk_ulong(__MACH2PHYS_VIRT_START_PAE)
+#define MACH2PHYS_VIRT_END_PAE         xen_mk_ulong(__MACH2PHYS_VIRT_END_PAE)
+
+/* Non-PAE bounds are obsolete. */
+#define __HYPERVISOR_VIRT_START_NONPAE 0xFC000000
+#define __MACH2PHYS_VIRT_START_NONPAE  0xFC000000
+#define __MACH2PHYS_VIRT_END_NONPAE    0xFC400000
+#define HYPERVISOR_VIRT_START_NONPAE   \
+    xen_mk_ulong(__HYPERVISOR_VIRT_START_NONPAE)
+#define MACH2PHYS_VIRT_START_NONPAE    \
+    xen_mk_ulong(__MACH2PHYS_VIRT_START_NONPAE)
+#define MACH2PHYS_VIRT_END_NONPAE      \
+    xen_mk_ulong(__MACH2PHYS_VIRT_END_NONPAE)
+
+#define __HYPERVISOR_VIRT_START __HYPERVISOR_VIRT_START_PAE
+#define __MACH2PHYS_VIRT_START  __MACH2PHYS_VIRT_START_PAE
+#define __MACH2PHYS_VIRT_END    __MACH2PHYS_VIRT_END_PAE
+
+#ifndef HYPERVISOR_VIRT_START
+#define HYPERVISOR_VIRT_START xen_mk_ulong(__HYPERVISOR_VIRT_START)
+#endif
+
+#define MACH2PHYS_VIRT_START  xen_mk_ulong(__MACH2PHYS_VIRT_START)
+#define MACH2PHYS_VIRT_END    xen_mk_ulong(__MACH2PHYS_VIRT_END)
+#define MACH2PHYS_NR_ENTRIES  ((MACH2PHYS_VIRT_END-MACH2PHYS_VIRT_START)>>2)
+#ifndef machine_to_phys_mapping
+#define machine_to_phys_mapping ((unsigned long *)MACH2PHYS_VIRT_START)
+#endif
+
+/* 32-/64-bit invariability for control interfaces (domctl/sysctl). */
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+#undef ___DEFINE_XEN_GUEST_HANDLE
+#define ___DEFINE_XEN_GUEST_HANDLE(name, type)                  \
+    typedef struct { type *p; }                                 \
+        __guest_handle_ ## name;                                \
+    typedef struct { union { type *p; uint64_aligned_t q; }; }  \
+        __guest_handle_64_ ## name
+#undef set_xen_guest_handle_raw
+#define set_xen_guest_handle_raw(hnd, val)                  \
+    do { if ( sizeof(hnd) == 8 ) *(uint64_t *)&(hnd) = 0;   \
+         (hnd).p = val;                                     \
+    } while ( 0 )
+#define  int64_aligned_t  int64_t __attribute__((aligned(8)))
+#define uint64_aligned_t uint64_t __attribute__((aligned(8)))
+#define __XEN_GUEST_HANDLE_64(name) __guest_handle_64_ ## name
+#define XEN_GUEST_HANDLE_64(name) __XEN_GUEST_HANDLE_64(name)
+#endif
+
+#ifndef __ASSEMBLY__
+
+#if defined(XEN_GENERATING_COMPAT_HEADERS)
+/* nothing */
+#elif defined(__XEN__) || defined(__XEN_TOOLS__)
+/* Anonymous unions include all permissible names (e.g., al/ah/ax/eax). */
+#define __DECL_REG_LO8(which) union { \
+    uint32_t e ## which ## x; \
+    uint16_t which ## x; \
+    struct { \
+        uint8_t which ## l; \
+        uint8_t which ## h; \
+    }; \
+}
+#define __DECL_REG_LO16(name) union { \
+    uint32_t e ## name, _e ## name; \
+    uint16_t name; \
+}
+#else
+/* Other sources must always use the proper 32-bit name (e.g., eax). */
+#define __DECL_REG_LO8(which) uint32_t e ## which ## x
+#define __DECL_REG_LO16(name) uint32_t e ## name
+#endif
+
+struct cpu_user_regs {
+    __DECL_REG_LO8(b);
+    __DECL_REG_LO8(c);
+    __DECL_REG_LO8(d);
+    __DECL_REG_LO16(si);
+    __DECL_REG_LO16(di);
+    __DECL_REG_LO16(bp);
+    __DECL_REG_LO8(a);
+    uint16_t error_code;    /* private */
+    uint16_t entry_vector;  /* private */
+    __DECL_REG_LO16(ip);
+    uint16_t cs;
+    uint8_t  saved_upcall_mask;
+    uint8_t  _pad0;
+    __DECL_REG_LO16(flags); /* eflags.IF == !saved_upcall_mask */
+    __DECL_REG_LO16(sp);
+    uint16_t ss, _pad1;
+    uint16_t es, _pad2;
+    uint16_t ds, _pad3;
+    uint16_t fs, _pad4;
+    uint16_t gs, _pad5;
+};
+typedef struct cpu_user_regs cpu_user_regs_t;
+DEFINE_XEN_GUEST_HANDLE(cpu_user_regs_t);
+
+#undef __DECL_REG_LO8
+#undef __DECL_REG_LO16
+
+/*
+ * Page-directory addresses above 4GB do not fit into architectural %cr3.
+ * When accessing %cr3, or equivalent field in vcpu_guest_context, guests
+ * must use the following accessor macros to pack/unpack valid MFNs.
+ */
+#define xen_pfn_to_cr3(pfn) (((unsigned)(pfn) << 12) | ((unsigned)(pfn) >> 20))
+#define xen_cr3_to_pfn(cr3) (((unsigned)(cr3) >> 12) | ((unsigned)(cr3) << 20))
+
+struct arch_vcpu_info {
+    unsigned long cr2;
+    unsigned long pad[5]; /* sizeof(vcpu_info_t) == 64 */
+};
+typedef struct arch_vcpu_info arch_vcpu_info_t;
+
+struct xen_callback {
+    unsigned long cs;
+    unsigned long eip;
+};
+typedef struct xen_callback xen_callback_t;
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __XEN_PUBLIC_ARCH_X86_XEN_X86_32_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/arch-x86/xen-x86_64.h b/include/standard-headers/xen/arch-x86/xen-x86_64.h
new file mode 100644
index 0000000000..40aed14366
--- /dev/null
+++ b/include/standard-headers/xen/arch-x86/xen-x86_64.h
@@ -0,0 +1,241 @@
+/******************************************************************************
+ * xen-x86_64.h
+ *
+ * Guest OS interface to x86 64-bit Xen.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004-2006, K A Fraser
+ */
+
+#ifndef __XEN_PUBLIC_ARCH_X86_XEN_X86_64_H__
+#define __XEN_PUBLIC_ARCH_X86_XEN_X86_64_H__
+
+/*
+ * Hypercall interface:
+ *  Input:  %rdi, %rsi, %rdx, %r10, %r8, %r9 (arguments 1-6)
+ *  Output: %rax
+ * Access is via hypercall page (set up by guest loader or via a Xen MSR):
+ *  call hypercall_page + hypercall-number * 32
+ * Clobbered: argument registers (e.g., 2-arg hypercall clobbers %rdi,%rsi)
+ */
+
+/*
+ * 64-bit segment selectors
+ * These flat segments are in the Xen-private section of every GDT. Since these
+ * are also present in the initial GDT, many OSes will be able to avoid
+ * installing their own GDT.
+ */
+
+#define FLAT_RING3_CS32 0xe023  /* GDT index 260 */
+#define FLAT_RING3_CS64 0xe033  /* GDT index 262 */
+#define FLAT_RING3_DS32 0xe02b  /* GDT index 261 */
+#define FLAT_RING3_DS64 0x0000  /* NULL selector */
+#define FLAT_RING3_SS32 0xe02b  /* GDT index 261 */
+#define FLAT_RING3_SS64 0xe02b  /* GDT index 261 */
+
+#define FLAT_KERNEL_DS64 FLAT_RING3_DS64
+#define FLAT_KERNEL_DS32 FLAT_RING3_DS32
+#define FLAT_KERNEL_DS   FLAT_KERNEL_DS64
+#define FLAT_KERNEL_CS64 FLAT_RING3_CS64
+#define FLAT_KERNEL_CS32 FLAT_RING3_CS32
+#define FLAT_KERNEL_CS   FLAT_KERNEL_CS64
+#define FLAT_KERNEL_SS64 FLAT_RING3_SS64
+#define FLAT_KERNEL_SS32 FLAT_RING3_SS32
+#define FLAT_KERNEL_SS   FLAT_KERNEL_SS64
+
+#define FLAT_USER_DS64 FLAT_RING3_DS64
+#define FLAT_USER_DS32 FLAT_RING3_DS32
+#define FLAT_USER_DS   FLAT_USER_DS64
+#define FLAT_USER_CS64 FLAT_RING3_CS64
+#define FLAT_USER_CS32 FLAT_RING3_CS32
+#define FLAT_USER_CS   FLAT_USER_CS64
+#define FLAT_USER_SS64 FLAT_RING3_SS64
+#define FLAT_USER_SS32 FLAT_RING3_SS32
+#define FLAT_USER_SS   FLAT_USER_SS64
+
+#define __HYPERVISOR_VIRT_START 0xFFFF800000000000
+#define __HYPERVISOR_VIRT_END   0xFFFF880000000000
+#define __MACH2PHYS_VIRT_START  0xFFFF800000000000
+#define __MACH2PHYS_VIRT_END    0xFFFF804000000000
+
+#ifndef HYPERVISOR_VIRT_START
+#define HYPERVISOR_VIRT_START xen_mk_ulong(__HYPERVISOR_VIRT_START)
+#define HYPERVISOR_VIRT_END   xen_mk_ulong(__HYPERVISOR_VIRT_END)
+#endif
+
+#define MACH2PHYS_VIRT_START  xen_mk_ulong(__MACH2PHYS_VIRT_START)
+#define MACH2PHYS_VIRT_END    xen_mk_ulong(__MACH2PHYS_VIRT_END)
+#define MACH2PHYS_NR_ENTRIES  ((MACH2PHYS_VIRT_END-MACH2PHYS_VIRT_START)>>3)
+#ifndef machine_to_phys_mapping
+#define machine_to_phys_mapping ((unsigned long *)HYPERVISOR_VIRT_START)
+#endif
+
+/*
+ * int HYPERVISOR_set_segment_base(unsigned int which, unsigned long base)
+ *  @which == SEGBASE_*  ;  @base == 64-bit base address
+ * Returns 0 on success.
+ */
+#define SEGBASE_FS          0
+#define SEGBASE_GS_USER     1
+#define SEGBASE_GS_KERNEL   2
+#define SEGBASE_GS_USER_SEL 3 /* Set user %gs specified in base[15:0] */
+
+/*
+ * int HYPERVISOR_iret(void)
+ * All arguments are on the kernel stack, in the following format.
+ * Never returns if successful. Current kernel context is lost.
+ * The saved CS is mapped as follows:
+ *   RING0 -> RING3 kernel mode.
+ *   RING1 -> RING3 kernel mode.
+ *   RING2 -> RING3 kernel mode.
+ *   RING3 -> RING3 user mode.
+ * However RING0 indicates that the guest kernel should return to iteself
+ * directly with
+ *      orb   $3,1*8(%rsp)
+ *      iretq
+ * If flags contains VGCF_in_syscall:
+ *   Restore RAX, RIP, RFLAGS, RSP.
+ *   Discard R11, RCX, CS, SS.
+ * Otherwise:
+ *   Restore RAX, R11, RCX, CS:RIP, RFLAGS, SS:RSP.
+ * All other registers are saved on hypercall entry and restored to user.
+ */
+/* Guest exited in SYSCALL context? Return to guest with SYSRET? */
+#define _VGCF_in_syscall 8
+#define VGCF_in_syscall  (1<<_VGCF_in_syscall)
+#define VGCF_IN_SYSCALL  VGCF_in_syscall
+
+#ifndef __ASSEMBLY__
+
+struct iret_context {
+    /* Top of stack (%rsp at point of hypercall). */
+    uint64_t rax, r11, rcx, flags, rip, cs, rflags, rsp, ss;
+    /* Bottom of iret stack frame. */
+};
+
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+/* Anonymous unions include all permissible names (e.g., al/ah/ax/eax/rax). */
+#define __DECL_REG_LOHI(which) union { \
+    uint64_t r ## which ## x; \
+    uint32_t e ## which ## x; \
+    uint16_t which ## x; \
+    struct { \
+        uint8_t which ## l; \
+        uint8_t which ## h; \
+    }; \
+}
+#define __DECL_REG_LO8(name) union { \
+    uint64_t r ## name; \
+    uint32_t e ## name; \
+    uint16_t name; \
+    uint8_t name ## l; \
+}
+#define __DECL_REG_LO16(name) union { \
+    uint64_t r ## name; \
+    uint32_t e ## name; \
+    uint16_t name; \
+}
+#define __DECL_REG_HI(num) union { \
+    uint64_t r ## num; \
+    uint32_t r ## num ## d; \
+    uint16_t r ## num ## w; \
+    uint8_t r ## num ## b; \
+}
+#elif defined(__GNUC__) && !defined(__STRICT_ANSI__)
+/* Anonymous union includes both 32- and 64-bit names (e.g., eax/rax). */
+#define __DECL_REG(name) union { \
+    uint64_t r ## name, e ## name; \
+    uint32_t _e ## name; \
+}
+#else
+/* Non-gcc sources must always use the proper 64-bit name (e.g., rax). */
+#define __DECL_REG(name) uint64_t r ## name
+#endif
+
+#ifndef __DECL_REG_LOHI
+#define __DECL_REG_LOHI(name) __DECL_REG(name ## x)
+#define __DECL_REG_LO8        __DECL_REG
+#define __DECL_REG_LO16       __DECL_REG
+#define __DECL_REG_HI(num)    uint64_t r ## num
+#endif
+
+struct cpu_user_regs {
+    __DECL_REG_HI(15);
+    __DECL_REG_HI(14);
+    __DECL_REG_HI(13);
+    __DECL_REG_HI(12);
+    __DECL_REG_LO8(bp);
+    __DECL_REG_LOHI(b);
+    __DECL_REG_HI(11);
+    __DECL_REG_HI(10);
+    __DECL_REG_HI(9);
+    __DECL_REG_HI(8);
+    __DECL_REG_LOHI(a);
+    __DECL_REG_LOHI(c);
+    __DECL_REG_LOHI(d);
+    __DECL_REG_LO8(si);
+    __DECL_REG_LO8(di);
+    uint32_t error_code;    /* private */
+    uint32_t entry_vector;  /* private */
+    __DECL_REG_LO16(ip);
+    uint16_t cs, _pad0[1];
+    uint8_t  saved_upcall_mask;
+    uint8_t  _pad1[3];
+    __DECL_REG_LO16(flags); /* rflags.IF == !saved_upcall_mask */
+    __DECL_REG_LO8(sp);
+    uint16_t ss, _pad2[3];
+    uint16_t es, _pad3[3];
+    uint16_t ds, _pad4[3];
+    uint16_t fs, _pad5[3];
+    uint16_t gs, _pad6[3];
+};
+typedef struct cpu_user_regs cpu_user_regs_t;
+DEFINE_XEN_GUEST_HANDLE(cpu_user_regs_t);
+
+#undef __DECL_REG
+#undef __DECL_REG_LOHI
+#undef __DECL_REG_LO8
+#undef __DECL_REG_LO16
+#undef __DECL_REG_HI
+
+#define xen_pfn_to_cr3(pfn) ((unsigned long)(pfn) << 12)
+#define xen_cr3_to_pfn(cr3) ((unsigned long)(cr3) >> 12)
+
+struct arch_vcpu_info {
+    unsigned long cr2;
+    unsigned long pad; /* sizeof(vcpu_info_t) == 64 */
+};
+typedef struct arch_vcpu_info arch_vcpu_info_t;
+
+typedef unsigned long xen_callback_t;
+
+#endif /* !__ASSEMBLY__ */
+
+#endif /* __XEN_PUBLIC_ARCH_X86_XEN_X86_64_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/arch-x86/xen.h b/include/standard-headers/xen/arch-x86/xen.h
new file mode 100644
index 0000000000..7acd94c8eb
--- /dev/null
+++ b/include/standard-headers/xen/arch-x86/xen.h
@@ -0,0 +1,398 @@
+/******************************************************************************
+ * arch-x86/xen.h
+ *
+ * Guest OS interface to x86 Xen.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004-2006, K A Fraser
+ */
+
+#include "../xen.h"
+
+#ifndef __XEN_PUBLIC_ARCH_X86_XEN_H__
+#define __XEN_PUBLIC_ARCH_X86_XEN_H__
+
+/* Structural guest handles introduced in 0x00030201. */
+#if __XEN_INTERFACE_VERSION__ >= 0x00030201
+#define ___DEFINE_XEN_GUEST_HANDLE(name, type) \
+    typedef struct { type *p; } __guest_handle_ ## name
+#else
+#define ___DEFINE_XEN_GUEST_HANDLE(name, type) \
+    typedef type * __guest_handle_ ## name
+#endif
+
+/*
+ * XEN_GUEST_HANDLE represents a guest pointer, when passed as a field
+ * in a struct in memory.
+ * XEN_GUEST_HANDLE_PARAM represent a guest pointer, when passed as an
+ * hypercall argument.
+ * XEN_GUEST_HANDLE_PARAM and XEN_GUEST_HANDLE are the same on X86 but
+ * they might not be on other architectures.
+ */
+#define __DEFINE_XEN_GUEST_HANDLE(name, type) \
+    ___DEFINE_XEN_GUEST_HANDLE(name, type);   \
+    ___DEFINE_XEN_GUEST_HANDLE(const_##name, const type)
+#define DEFINE_XEN_GUEST_HANDLE(name)   __DEFINE_XEN_GUEST_HANDLE(name, name)
+#define __XEN_GUEST_HANDLE(name)        __guest_handle_ ## name
+#define XEN_GUEST_HANDLE(name)          __XEN_GUEST_HANDLE(name)
+#define XEN_GUEST_HANDLE_PARAM(name)    XEN_GUEST_HANDLE(name)
+#define set_xen_guest_handle_raw(hnd, val)  do { (hnd).p = val; } while (0)
+#define set_xen_guest_handle(hnd, val) set_xen_guest_handle_raw(hnd, val)
+
+#if defined(__i386__)
+# ifdef __XEN__
+__DeFiNe__ __DECL_REG_LO8(which) uint32_t e ## which ## x
+__DeFiNe__ __DECL_REG_LO16(name) union { uint32_t e ## name; }
+# endif
+#include "xen-x86_32.h"
+# ifdef __XEN__
+__UnDeF__ __DECL_REG_LO8
+__UnDeF__ __DECL_REG_LO16
+__DeFiNe__ __DECL_REG_LO8(which) e ## which ## x
+__DeFiNe__ __DECL_REG_LO16(name) e ## name
+# endif
+#elif defined(__x86_64__)
+#include "xen-x86_64.h"
+#endif
+
+#ifndef __ASSEMBLY__
+typedef unsigned long xen_pfn_t;
+#define PRI_xen_pfn "lx"
+#define PRIu_xen_pfn "lu"
+#endif
+
+#define XEN_HAVE_PV_GUEST_ENTRY 1
+
+#define XEN_HAVE_PV_UPCALL_MASK 1
+
+/*
+ * `incontents 200 segdesc Segment Descriptor Tables
+ */
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_set_gdt(const xen_pfn_t frames[], unsigned int entries);
+ * `
+ */
+/*
+ * A number of GDT entries are reserved by Xen. These are not situated at the
+ * start of the GDT because some stupid OSes export hard-coded selector values
+ * in their ABI. These hard-coded values are always near the start of the GDT,
+ * so Xen places itself out of the way, at the far end of the GDT.
+ *
+ * NB The LDT is set using the MMUEXT_SET_LDT op of HYPERVISOR_mmuext_op
+ */
+#define FIRST_RESERVED_GDT_PAGE  14
+#define FIRST_RESERVED_GDT_BYTE  (FIRST_RESERVED_GDT_PAGE * 4096)
+#define FIRST_RESERVED_GDT_ENTRY (FIRST_RESERVED_GDT_BYTE / 8)
+
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_update_descriptor(u64 pa, u64 desc);
+ * `
+ * ` @pa   The machine physical address of the descriptor to
+ * `       update. Must be either a descriptor page or writable.
+ * ` @desc The descriptor value to update, in the same format as a
+ * `       native descriptor table entry.
+ */
+
+/* Maximum number of virtual CPUs in legacy multi-processor guests. */
+#define XEN_LEGACY_MAX_VCPUS 32
+
+#ifndef __ASSEMBLY__
+
+typedef unsigned long xen_ulong_t;
+#define PRI_xen_ulong "lx"
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_stack_switch(unsigned long ss, unsigned long esp);
+ * `
+ * Sets the stack segment and pointer for the current vcpu.
+ */
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_set_trap_table(const struct trap_info traps[]);
+ * `
+ */
+/*
+ * Send an array of these to HYPERVISOR_set_trap_table().
+ * Terminate the array with a sentinel entry, with traps[].address==0.
+ * The privilege level specifies which modes may enter a trap via a software
+ * interrupt. On x86/64, since rings 1 and 2 are unavailable, we allocate
+ * privilege levels as follows:
+ *  Level == 0: Noone may enter
+ *  Level == 1: Kernel may enter
+ *  Level == 2: Kernel may enter
+ *  Level == 3: Everyone may enter
+ *
+ * Note: For compatibility with kernels not setting up exception handlers
+ *       early enough, Xen will avoid trying to inject #GP (and hence crash
+ *       the domain) when an RDMSR would require this, but no handler was
+ *       set yet. The precise conditions are implementation specific, and
+ *       new code may not rely on such behavior anyway.
+ */
+#define TI_GET_DPL(_ti)      ((_ti)->flags & 3)
+#define TI_GET_IF(_ti)       ((_ti)->flags & 4)
+#define TI_SET_DPL(_ti,_dpl) ((_ti)->flags |= (_dpl))
+#define TI_SET_IF(_ti,_if)   ((_ti)->flags |= ((!!(_if))<<2))
+struct trap_info {
+    uint8_t       vector;  /* exception vector                              */
+    uint8_t       flags;   /* 0-3: privilege level; 4: clear event enable?  */
+    uint16_t      cs;      /* code selector                                 */
+    unsigned long address; /* code offset                                   */
+};
+typedef struct trap_info trap_info_t;
+DEFINE_XEN_GUEST_HANDLE(trap_info_t);
+
+typedef uint64_t tsc_timestamp_t; /* RDTSC timestamp */
+
+/*
+ * The following is all CPU context. Note that the fpu_ctxt block is filled
+ * in by FXSAVE if the CPU has feature FXSR; otherwise FSAVE is used.
+ *
+ * Also note that when calling DOMCTL_setvcpucontext for HVM guests, not all
+ * information in this structure is updated, the fields read include: fpu_ctxt
+ * (if VGCT_I387_VALID is set), flags, user_regs and debugreg[*].
+ *
+ * Note: VCPUOP_initialise for HVM guests is non-symetric with
+ * DOMCTL_setvcpucontext, and uses struct vcpu_hvm_context from hvm/hvm_vcpu.h
+ */
+struct vcpu_guest_context {
+    /* FPU registers come first so they can be aligned for FXSAVE/FXRSTOR. */
+    struct { char x[512]; } fpu_ctxt;       /* User-level FPU registers     */
+#define VGCF_I387_VALID                (1<<0)
+#define VGCF_IN_KERNEL                 (1<<2)
+#define _VGCF_i387_valid               0
+#define VGCF_i387_valid                (1<<_VGCF_i387_valid)
+#define _VGCF_in_kernel                2
+#define VGCF_in_kernel                 (1<<_VGCF_in_kernel)
+#define _VGCF_failsafe_disables_events 3
+#define VGCF_failsafe_disables_events  (1<<_VGCF_failsafe_disables_events)
+#define _VGCF_syscall_disables_events  4
+#define VGCF_syscall_disables_events   (1<<_VGCF_syscall_disables_events)
+#define _VGCF_online                   5
+#define VGCF_online                    (1<<_VGCF_online)
+    unsigned long flags;                    /* VGCF_* flags                 */
+    struct cpu_user_regs user_regs;         /* User-level CPU registers     */
+    struct trap_info trap_ctxt[256];        /* Virtual IDT                  */
+    unsigned long ldt_base, ldt_ents;       /* LDT (linear address, # ents) */
+    unsigned long gdt_frames[16], gdt_ents; /* GDT (machine frames, # ents) */
+    unsigned long kernel_ss, kernel_sp;     /* Virtual TSS (only SS1/SP1)   */
+    /* NB. User pagetable on x86/64 is placed in ctrlreg[1]. */
+    unsigned long ctrlreg[8];               /* CR0-CR7 (control registers)  */
+    unsigned long debugreg[8];              /* DB0-DB7 (debug registers)    */
+#ifdef __i386__
+    unsigned long event_callback_cs;        /* CS:EIP of event callback     */
+    unsigned long event_callback_eip;
+    unsigned long failsafe_callback_cs;     /* CS:EIP of failsafe callback  */
+    unsigned long failsafe_callback_eip;
+#else
+    unsigned long event_callback_eip;
+    unsigned long failsafe_callback_eip;
+#ifdef __XEN__
+    union {
+        unsigned long syscall_callback_eip;
+        struct {
+            unsigned int event_callback_cs;    /* compat CS of event cb     */
+            unsigned int failsafe_callback_cs; /* compat CS of failsafe cb  */
+        };
+    };
+#else
+    unsigned long syscall_callback_eip;
+#endif
+#endif
+    unsigned long vm_assist;                /* VMASST_TYPE_* bitmap */
+#ifdef __x86_64__
+    /* Segment base addresses. */
+    uint64_t      fs_base;
+    uint64_t      gs_base_kernel;
+    uint64_t      gs_base_user;
+#endif
+};
+typedef struct vcpu_guest_context vcpu_guest_context_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_guest_context_t);
+
+struct arch_shared_info {
+    /*
+     * Number of valid entries in the p2m table(s) anchored at
+     * pfn_to_mfn_frame_list_list and/or p2m_vaddr.
+     */
+    unsigned long max_pfn;
+    /*
+     * Frame containing list of mfns containing list of mfns containing p2m.
+     * A value of 0 indicates it has not yet been set up, ~0 indicates it has
+     * been set to invalid e.g. due to the p2m being too large for the 3-level
+     * p2m tree. In this case the linear mapper p2m list anchored at p2m_vaddr
+     * is to be used.
+     */
+    xen_pfn_t     pfn_to_mfn_frame_list_list;
+    unsigned long nmi_reason;
+    /*
+     * Following three fields are valid if p2m_cr3 contains a value different
+     * from 0.
+     * p2m_cr3 is the root of the address space where p2m_vaddr is valid.
+     * p2m_cr3 is in the same format as a cr3 value in the vcpu register state
+     * and holds the folded machine frame number (via xen_pfn_to_cr3) of a
+     * L3 or L4 page table.
+     * p2m_vaddr holds the virtual address of the linear p2m list. All entries
+     * in the range [0...max_pfn[ are accessible via this pointer.
+     * p2m_generation will be incremented by the guest before and after each
+     * change of the mappings of the p2m list. p2m_generation starts at 0 and
+     * a value with the least significant bit set indicates that a mapping
+     * update is in progress. This allows guest external software (e.g. in Dom0)
+     * to verify that read mappings are consistent and whether they have changed
+     * since the last check.
+     * Modifying a p2m element in the linear p2m list is allowed via an atomic
+     * write only.
+     */
+    unsigned long p2m_cr3;         /* cr3 value of the p2m address space */
+    unsigned long p2m_vaddr;       /* virtual address of the p2m list */
+    unsigned long p2m_generation;  /* generation count of p2m mapping */
+#ifdef __i386__
+    /* There's no room for this field in the generic structure. */
+    uint32_t wc_sec_hi;
+#endif
+};
+typedef struct arch_shared_info arch_shared_info_t;
+
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+/*
+ * struct xen_arch_domainconfig's ABI is covered by
+ * XEN_DOMCTL_INTERFACE_VERSION.
+ */
+struct xen_arch_domainconfig {
+#define _XEN_X86_EMU_LAPIC          0
+#define XEN_X86_EMU_LAPIC           (1U<<_XEN_X86_EMU_LAPIC)
+#define _XEN_X86_EMU_HPET           1
+#define XEN_X86_EMU_HPET            (1U<<_XEN_X86_EMU_HPET)
+#define _XEN_X86_EMU_PM             2
+#define XEN_X86_EMU_PM              (1U<<_XEN_X86_EMU_PM)
+#define _XEN_X86_EMU_RTC            3
+#define XEN_X86_EMU_RTC             (1U<<_XEN_X86_EMU_RTC)
+#define _XEN_X86_EMU_IOAPIC         4
+#define XEN_X86_EMU_IOAPIC          (1U<<_XEN_X86_EMU_IOAPIC)
+#define _XEN_X86_EMU_PIC            5
+#define XEN_X86_EMU_PIC             (1U<<_XEN_X86_EMU_PIC)
+#define _XEN_X86_EMU_VGA            6
+#define XEN_X86_EMU_VGA             (1U<<_XEN_X86_EMU_VGA)
+#define _XEN_X86_EMU_IOMMU          7
+#define XEN_X86_EMU_IOMMU           (1U<<_XEN_X86_EMU_IOMMU)
+#define _XEN_X86_EMU_PIT            8
+#define XEN_X86_EMU_PIT             (1U<<_XEN_X86_EMU_PIT)
+#define _XEN_X86_EMU_USE_PIRQ       9
+#define XEN_X86_EMU_USE_PIRQ        (1U<<_XEN_X86_EMU_USE_PIRQ)
+#define _XEN_X86_EMU_VPCI           10
+#define XEN_X86_EMU_VPCI            (1U<<_XEN_X86_EMU_VPCI)
+
+#define XEN_X86_EMU_ALL             (XEN_X86_EMU_LAPIC | XEN_X86_EMU_HPET |  \
+                                     XEN_X86_EMU_PM | XEN_X86_EMU_RTC |      \
+                                     XEN_X86_EMU_IOAPIC | XEN_X86_EMU_PIC |  \
+                                     XEN_X86_EMU_VGA | XEN_X86_EMU_IOMMU |   \
+                                     XEN_X86_EMU_PIT | XEN_X86_EMU_USE_PIRQ |\
+                                     XEN_X86_EMU_VPCI)
+    uint32_t emulation_flags;
+
+/*
+ * Select whether to use a relaxed behavior for accesses to MSRs not explicitly
+ * handled by Xen instead of injecting a #GP to the guest. Note this option
+ * doesn't allow the guest to read or write to the underlying MSR.
+ */
+#define XEN_X86_MSR_RELAXED (1u << 0)
+    uint32_t misc_flags;
+};
+
+/* Location of online VCPU bitmap. */
+#define XEN_ACPI_CPU_MAP             0xaf00
+#define XEN_ACPI_CPU_MAP_LEN         ((HVM_MAX_VCPUS + 7) / 8)
+
+/* GPE0 bit set during CPU hotplug */
+#define XEN_ACPI_GPE0_CPUHP_BIT      2
+#endif
+
+/*
+ * Representations of architectural CPUID and MSR information.  Used as the
+ * serialised version of Xen's internal representation.
+ */
+typedef struct xen_cpuid_leaf {
+#define XEN_CPUID_NO_SUBLEAF 0xffffffffu
+    uint32_t leaf, subleaf;
+    uint32_t a, b, c, d;
+} xen_cpuid_leaf_t;
+DEFINE_XEN_GUEST_HANDLE(xen_cpuid_leaf_t);
+
+typedef struct xen_msr_entry {
+    uint32_t idx;
+    uint32_t flags; /* Reserved MBZ. */
+    uint64_t val;
+} xen_msr_entry_t;
+DEFINE_XEN_GUEST_HANDLE(xen_msr_entry_t);
+
+#endif /* !__ASSEMBLY__ */
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_fpu_taskswitch(int set);
+ * `
+ * Sets (if set!=0) or clears (if set==0) CR0.TS.
+ */
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_set_debugreg(int regno, unsigned long value);
+ *
+ * ` unsigned long
+ * ` HYPERVISOR_get_debugreg(int regno);
+ * For 0<=reg<=7, returns the debug register value.
+ * For other values of reg, returns ((unsigned long)-EINVAL).
+ * (Unfortunately, this interface is defective.)
+ */
+
+/*
+ * Prefix forces emulation of some non-trapping instructions.
+ * Currently only CPUID.
+ */
+#ifdef __ASSEMBLY__
+#define XEN_EMULATE_PREFIX .byte 0x0f,0x0b,0x78,0x65,0x6e ;
+#define XEN_CPUID          XEN_EMULATE_PREFIX cpuid
+#else
+#define XEN_EMULATE_PREFIX ".byte 0x0f,0x0b,0x78,0x65,0x6e ; "
+#define XEN_CPUID          XEN_EMULATE_PREFIX "cpuid"
+#endif
+
+/*
+ * Debug console IO port, also called "port E9 hack". Each character written
+ * to this IO port will be printed on the hypervisor console, subject to log
+ * level restrictions.
+ */
+#define XEN_HVM_DEBUGCONS_IOPORT 0xe9
+
+#endif /* __XEN_PUBLIC_ARCH_X86_XEN_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/event_channel.h b/include/standard-headers/xen/event_channel.h
new file mode 100644
index 0000000000..73c9f38ce1
--- /dev/null
+++ b/include/standard-headers/xen/event_channel.h
@@ -0,0 +1,388 @@
+/******************************************************************************
+ * event_channel.h
+ *
+ * Event channels between domains.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2003-2004, K A Fraser.
+ */
+
+#ifndef __XEN_PUBLIC_EVENT_CHANNEL_H__
+#define __XEN_PUBLIC_EVENT_CHANNEL_H__
+
+#include "xen.h"
+
+/*
+ * `incontents 150 evtchn Event Channels
+ *
+ * Event channels are the basic primitive provided by Xen for event
+ * notifications. An event is the Xen equivalent of a hardware
+ * interrupt. They essentially store one bit of information, the event
+ * of interest is signalled by transitioning this bit from 0 to 1.
+ *
+ * Notifications are received by a guest via an upcall from Xen,
+ * indicating when an event arrives (setting the bit). Further
+ * notifications are masked until the bit is cleared again (therefore,
+ * guests must check the value of the bit after re-enabling event
+ * delivery to ensure no missed notifications).
+ *
+ * Event notifications can be masked by setting a flag; this is
+ * equivalent to disabling interrupts and can be used to ensure
+ * atomicity of certain operations in the guest kernel.
+ *
+ * Event channels are represented by the evtchn_* fields in
+ * struct shared_info and struct vcpu_info.
+ */
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_event_channel_op(enum event_channel_op cmd, void *args)
+ * `
+ * @cmd  == EVTCHNOP_* (event-channel operation).
+ * @args == struct evtchn_* Operation-specific extra arguments (NULL if none).
+ */
+
+/* ` enum event_channel_op { // EVTCHNOP_* => struct evtchn_* */
+#define EVTCHNOP_bind_interdomain 0
+#define EVTCHNOP_bind_virq        1
+#define EVTCHNOP_bind_pirq        2
+#define EVTCHNOP_close            3
+#define EVTCHNOP_send             4
+#define EVTCHNOP_status           5
+#define EVTCHNOP_alloc_unbound    6
+#define EVTCHNOP_bind_ipi         7
+#define EVTCHNOP_bind_vcpu        8
+#define EVTCHNOP_unmask           9
+#define EVTCHNOP_reset           10
+#define EVTCHNOP_init_control    11
+#define EVTCHNOP_expand_array    12
+#define EVTCHNOP_set_priority    13
+#ifdef __XEN__
+#define EVTCHNOP_reset_cont      14
+#endif
+/* ` } */
+
+typedef uint32_t evtchn_port_t;
+DEFINE_XEN_GUEST_HANDLE(evtchn_port_t);
+
+/*
+ * EVTCHNOP_alloc_unbound: Allocate a port in domain <dom> and mark as
+ * accepting interdomain bindings from domain <remote_dom>. A fresh port
+ * is allocated in <dom> and returned as <port>.
+ * NOTES:
+ *  1. If the caller is unprivileged then <dom> must be DOMID_SELF.
+ *  2. <remote_dom> may be DOMID_SELF, allowing loopback connections.
+ */
+struct evtchn_alloc_unbound {
+    /* IN parameters */
+    domid_t dom, remote_dom;
+    /* OUT parameters */
+    evtchn_port_t port;
+};
+typedef struct evtchn_alloc_unbound evtchn_alloc_unbound_t;
+
+/*
+ * EVTCHNOP_bind_interdomain: Construct an interdomain event channel between
+ * the calling domain and <remote_dom>. <remote_dom,remote_port> must identify
+ * a port that is unbound and marked as accepting bindings from the calling
+ * domain. A fresh port is allocated in the calling domain and returned as
+ * <local_port>.
+ *
+ * In case the peer domain has already tried to set our event channel
+ * pending, before it was bound, EVTCHNOP_bind_interdomain always sets
+ * the local event channel pending.
+ *
+ * The usual pattern of use, in the guest's upcall (or subsequent
+ * handler) is as follows: (Re-enable the event channel for subsequent
+ * signalling and then) check for the existence of whatever condition
+ * is being waited for by other means, and take whatever action is
+ * needed (if any).
+ *
+ * NOTES:
+ *  1. <remote_dom> may be DOMID_SELF, allowing loopback connections.
+ */
+struct evtchn_bind_interdomain {
+    /* IN parameters. */
+    domid_t remote_dom;
+    evtchn_port_t remote_port;
+    /* OUT parameters. */
+    evtchn_port_t local_port;
+};
+typedef struct evtchn_bind_interdomain evtchn_bind_interdomain_t;
+
+/*
+ * EVTCHNOP_bind_virq: Bind a local event channel to VIRQ <irq> on specified
+ * vcpu.
+ * NOTES:
+ *  1. Virtual IRQs are classified as per-vcpu or global. See the VIRQ list
+ *     in xen.h for the classification of each VIRQ.
+ *  2. Global VIRQs must be allocated on VCPU0 but can subsequently be
+ *     re-bound via EVTCHNOP_bind_vcpu.
+ *  3. Per-vcpu VIRQs may be bound to at most one event channel per vcpu.
+ *     The allocated event channel is bound to the specified vcpu and the
+ *     binding cannot be changed.
+ */
+struct evtchn_bind_virq {
+    /* IN parameters. */
+    uint32_t virq; /* enum virq */
+    uint32_t vcpu;
+    /* OUT parameters. */
+    evtchn_port_t port;
+};
+typedef struct evtchn_bind_virq evtchn_bind_virq_t;
+
+/*
+ * EVTCHNOP_bind_pirq: Bind a local event channel to a real IRQ (PIRQ <irq>).
+ * NOTES:
+ *  1. A physical IRQ may be bound to at most one event channel per domain.
+ *  2. Only a sufficiently-privileged domain may bind to a physical IRQ.
+ */
+struct evtchn_bind_pirq {
+    /* IN parameters. */
+    uint32_t pirq;
+#define BIND_PIRQ__WILL_SHARE 1
+    uint32_t flags; /* BIND_PIRQ__* */
+    /* OUT parameters. */
+    evtchn_port_t port;
+};
+typedef struct evtchn_bind_pirq evtchn_bind_pirq_t;
+
+/*
+ * EVTCHNOP_bind_ipi: Bind a local event channel to receive events.
+ * NOTES:
+ *  1. The allocated event channel is bound to the specified vcpu. The binding
+ *     may not be changed.
+ */
+struct evtchn_bind_ipi {
+    uint32_t vcpu;
+    /* OUT parameters. */
+    evtchn_port_t port;
+};
+typedef struct evtchn_bind_ipi evtchn_bind_ipi_t;
+
+/*
+ * EVTCHNOP_close: Close a local event channel <port>. If the channel is
+ * interdomain then the remote end is placed in the unbound state
+ * (EVTCHNSTAT_unbound), awaiting a new connection.
+ */
+struct evtchn_close {
+    /* IN parameters. */
+    evtchn_port_t port;
+};
+typedef struct evtchn_close evtchn_close_t;
+
+/*
+ * EVTCHNOP_send: Send an event to the remote end of the channel whose local
+ * endpoint is <port>.
+ */
+struct evtchn_send {
+    /* IN parameters. */
+    evtchn_port_t port;
+};
+typedef struct evtchn_send evtchn_send_t;
+
+/*
+ * EVTCHNOP_status: Get the current status of the communication channel which
+ * has an endpoint at <dom, port>.
+ * NOTES:
+ *  1. <dom> may be specified as DOMID_SELF.
+ *  2. Only a sufficiently-privileged domain may obtain the status of an event
+ *     channel for which <dom> is not DOMID_SELF.
+ */
+struct evtchn_status {
+    /* IN parameters */
+    domid_t  dom;
+    evtchn_port_t port;
+    /* OUT parameters */
+#define EVTCHNSTAT_closed       0  /* Channel is not in use.                 */
+#define EVTCHNSTAT_unbound      1  /* Channel is waiting interdom connection.*/
+#define EVTCHNSTAT_interdomain  2  /* Channel is connected to remote domain. */
+#define EVTCHNSTAT_pirq         3  /* Channel is bound to a phys IRQ line.   */
+#define EVTCHNSTAT_virq         4  /* Channel is bound to a virtual IRQ line */
+#define EVTCHNSTAT_ipi          5  /* Channel is bound to a virtual IPI line */
+    uint32_t status;
+    uint32_t vcpu;                 /* VCPU to which this channel is bound.   */
+    union {
+        struct {
+            domid_t dom;
+        } unbound;                 /* EVTCHNSTAT_unbound */
+        struct {
+            domid_t dom;
+            evtchn_port_t port;
+        } interdomain;             /* EVTCHNSTAT_interdomain */
+        uint32_t pirq;             /* EVTCHNSTAT_pirq        */
+        uint32_t virq;             /* EVTCHNSTAT_virq        */
+    } u;
+};
+typedef struct evtchn_status evtchn_status_t;
+
+/*
+ * EVTCHNOP_bind_vcpu: Specify which vcpu a channel should notify when an
+ * event is pending.
+ * NOTES:
+ *  1. IPI-bound channels always notify the vcpu specified at bind time.
+ *     This binding cannot be changed.
+ *  2. Per-VCPU VIRQ channels always notify the vcpu specified at bind time.
+ *     This binding cannot be changed.
+ *  3. All other channels notify vcpu0 by default. This default is set when
+ *     the channel is allocated (a port that is freed and subsequently reused
+ *     has its binding reset to vcpu0).
+ */
+struct evtchn_bind_vcpu {
+    /* IN parameters. */
+    evtchn_port_t port;
+    uint32_t vcpu;
+};
+typedef struct evtchn_bind_vcpu evtchn_bind_vcpu_t;
+
+/*
+ * EVTCHNOP_unmask: Unmask the specified local event-channel port and deliver
+ * a notification to the appropriate VCPU if an event is pending.
+ */
+struct evtchn_unmask {
+    /* IN parameters. */
+    evtchn_port_t port;
+};
+typedef struct evtchn_unmask evtchn_unmask_t;
+
+/*
+ * EVTCHNOP_reset: Close all event channels associated with specified domain.
+ * NOTES:
+ *  1. <dom> may be specified as DOMID_SELF.
+ *  2. Only a sufficiently-privileged domain may specify other than DOMID_SELF.
+ *  3. Destroys all control blocks and event array, resets event channel
+ *     operations to 2-level ABI if called with <dom> == DOMID_SELF and FIFO
+ *     ABI was used. Guests should not bind events during EVTCHNOP_reset call
+ *     as these events are likely to be lost.
+ */
+struct evtchn_reset {
+    /* IN parameters. */
+    domid_t dom;
+};
+typedef struct evtchn_reset evtchn_reset_t;
+
+/*
+ * EVTCHNOP_init_control: initialize the control block for the FIFO ABI.
+ *
+ * Note: any events that are currently pending will not be resent and
+ * will be lost.  Guests should call this before binding any event to
+ * avoid losing any events.
+ */
+struct evtchn_init_control {
+    /* IN parameters. */
+    uint64_t control_gfn;
+    uint32_t offset;
+    uint32_t vcpu;
+    /* OUT parameters. */
+    uint8_t link_bits;
+    uint8_t _pad[7];
+};
+typedef struct evtchn_init_control evtchn_init_control_t;
+
+/*
+ * EVTCHNOP_expand_array: add an additional page to the event array.
+ */
+struct evtchn_expand_array {
+    /* IN parameters. */
+    uint64_t array_gfn;
+};
+typedef struct evtchn_expand_array evtchn_expand_array_t;
+
+/*
+ * EVTCHNOP_set_priority: set the priority for an event channel.
+ */
+struct evtchn_set_priority {
+    /* IN parameters. */
+    evtchn_port_t port;
+    uint32_t priority;
+};
+typedef struct evtchn_set_priority evtchn_set_priority_t;
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_event_channel_op_compat(struct evtchn_op *op)
+ * `
+ * Superceded by new event_channel_op() hypercall since 0x00030202.
+ */
+struct evtchn_op {
+    uint32_t cmd; /* enum event_channel_op */
+    union {
+        evtchn_alloc_unbound_t    alloc_unbound;
+        evtchn_bind_interdomain_t bind_interdomain;
+        evtchn_bind_virq_t        bind_virq;
+        evtchn_bind_pirq_t        bind_pirq;
+        evtchn_bind_ipi_t         bind_ipi;
+        evtchn_close_t            close;
+        evtchn_send_t             send;
+        evtchn_status_t           status;
+        evtchn_bind_vcpu_t        bind_vcpu;
+        evtchn_unmask_t           unmask;
+    } u;
+};
+typedef struct evtchn_op evtchn_op_t;
+DEFINE_XEN_GUEST_HANDLE(evtchn_op_t);
+
+/*
+ * 2-level ABI
+ */
+
+#define EVTCHN_2L_NR_CHANNELS (sizeof(xen_ulong_t) * sizeof(xen_ulong_t) * 64)
+
+/*
+ * FIFO ABI
+ */
+
+/* Events may have priorities from 0 (highest) to 15 (lowest). */
+#define EVTCHN_FIFO_PRIORITY_MAX     0
+#define EVTCHN_FIFO_PRIORITY_DEFAULT 7
+#define EVTCHN_FIFO_PRIORITY_MIN     15
+
+#define EVTCHN_FIFO_MAX_QUEUES (EVTCHN_FIFO_PRIORITY_MIN + 1)
+
+typedef uint32_t event_word_t;
+
+#define EVTCHN_FIFO_PENDING 31
+#define EVTCHN_FIFO_MASKED  30
+#define EVTCHN_FIFO_LINKED  29
+#define EVTCHN_FIFO_BUSY    28
+
+#define EVTCHN_FIFO_LINK_BITS 17
+#define EVTCHN_FIFO_LINK_MASK ((1 << EVTCHN_FIFO_LINK_BITS) - 1)
+
+#define EVTCHN_FIFO_NR_CHANNELS (1 << EVTCHN_FIFO_LINK_BITS)
+
+struct evtchn_fifo_control_block {
+    uint32_t ready;
+    uint32_t _rsvd;
+    uint32_t head[EVTCHN_FIFO_MAX_QUEUES];
+};
+typedef struct evtchn_fifo_control_block evtchn_fifo_control_block_t;
+
+#endif /* __XEN_PUBLIC_EVENT_CHANNEL_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/features.h b/include/standard-headers/xen/features.h
new file mode 100644
index 0000000000..9ee2f760ef
--- /dev/null
+++ b/include/standard-headers/xen/features.h
@@ -0,0 +1,143 @@
+/******************************************************************************
+ * features.h
+ *
+ * Feature flags, reported by XENVER_get_features.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2006, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_FEATURES_H__
+#define __XEN_PUBLIC_FEATURES_H__
+
+/*
+ * `incontents 200 elfnotes_features XEN_ELFNOTE_FEATURES
+ *
+ * The list of all the features the guest supports. They are set by
+ * parsing the XEN_ELFNOTE_FEATURES and XEN_ELFNOTE_SUPPORTED_FEATURES
+ * string. The format is the  feature names (as given here without the
+ * "XENFEAT_" prefix) separated by '|' characters.
+ * If a feature is required for the kernel to function then the feature name
+ * must be preceded by a '!' character.
+ *
+ * Note that if XEN_ELFNOTE_SUPPORTED_FEATURES is used, then in the
+ * XENFEAT_dom0 MUST be set if the guest is to be booted as dom0,
+ */
+
+/*
+ * If set, the guest does not need to write-protect its pagetables, and can
+ * update them via direct writes.
+ */
+#define XENFEAT_writable_page_tables       0
+
+/*
+ * If set, the guest does not need to write-protect its segment descriptor
+ * tables, and can update them via direct writes.
+ */
+#define XENFEAT_writable_descriptor_tables 1
+
+/*
+ * If set, translation between the guest's 'pseudo-physical' address space
+ * and the host's machine address space are handled by the hypervisor. In this
+ * mode the guest does not need to perform phys-to/from-machine translations
+ * when performing page table operations.
+ */
+#define XENFEAT_auto_translated_physmap    2
+
+/* If set, the guest is running in supervisor mode (e.g., x86 ring 0). */
+#define XENFEAT_supervisor_mode_kernel     3
+
+/*
+ * If set, the guest does not need to allocate x86 PAE page directories
+ * below 4GB. This flag is usually implied by auto_translated_physmap.
+ */
+#define XENFEAT_pae_pgdir_above_4gb        4
+
+/* x86: Does this Xen host support the MMU_PT_UPDATE_PRESERVE_AD hypercall? */
+#define XENFEAT_mmu_pt_update_preserve_ad  5
+
+/* x86: Does this Xen host support the MMU_{CLEAR,COPY}_PAGE hypercall? */
+#define XENFEAT_highmem_assist             6
+
+/*
+ * If set, GNTTABOP_map_grant_ref honors flags to be placed into guest kernel
+ * available pte bits.
+ */
+#define XENFEAT_gnttab_map_avail_bits      7
+
+/* x86: Does this Xen host support the HVM callback vector type? */
+#define XENFEAT_hvm_callback_vector        8
+
+/* x86: pvclock algorithm is safe to use on HVM */
+#define XENFEAT_hvm_safe_pvclock           9
+
+/* x86: pirq can be used by HVM guests */
+#define XENFEAT_hvm_pirqs                 10
+
+/* operation as Dom0 is supported */
+#define XENFEAT_dom0                      11
+
+/* Xen also maps grant references at pfn = mfn.
+ * This feature flag is deprecated and should not be used.
+#define XENFEAT_grant_map_identity        12
+ */
+
+/* Guest can use XENMEMF_vnode to specify virtual node for memory op. */
+#define XENFEAT_memory_op_vnode_supported 13
+
+/* arm: Hypervisor supports ARM SMC calling convention. */
+#define XENFEAT_ARM_SMCCC_supported       14
+
+/*
+ * x86/PVH: If set, ACPI RSDP can be placed at any address. Otherwise RSDP
+ * must be located in lower 1MB, as required by ACPI Specification for IA-PC
+ * systems.
+ * This feature flag is only consulted if XEN_ELFNOTE_GUEST_OS contains
+ * the "linux" string.
+ */
+#define XENFEAT_linux_rsdp_unrestricted   15
+
+/*
+ * A direct-mapped (or 1:1 mapped) domain is a domain for which its
+ * local pages have gfn == mfn. If a domain is direct-mapped,
+ * XENFEAT_direct_mapped is set; otherwise XENFEAT_not_direct_mapped
+ * is set.
+ *
+ * If neither flag is set (e.g. older Xen releases) the assumptions are:
+ * - not auto_translated domains (x86 only) are always direct-mapped
+ * - on x86, auto_translated domains are not direct-mapped
+ * - on ARM, Dom0 is direct-mapped, DomUs are not
+ */
+#define XENFEAT_not_direct_mapped         16
+#define XENFEAT_direct_mapped             17
+
+#define XENFEAT_NR_SUBMAPS 1
+
+#endif /* __XEN_PUBLIC_FEATURES_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/grant_table.h b/include/standard-headers/xen/grant_table.h
new file mode 100644
index 0000000000..7934d7b718
--- /dev/null
+++ b/include/standard-headers/xen/grant_table.h
@@ -0,0 +1,686 @@
+/******************************************************************************
+ * grant_table.h
+ *
+ * Interface for granting foreign access to page frames, and receiving
+ * page-ownership transfers.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004, K A Fraser
+ */
+
+#ifndef __XEN_PUBLIC_GRANT_TABLE_H__
+#define __XEN_PUBLIC_GRANT_TABLE_H__
+
+#include "xen.h"
+
+/*
+ * `incontents 150 gnttab Grant Tables
+ *
+ * Xen's grant tables provide a generic mechanism to memory sharing
+ * between domains. This shared memory interface underpins the split
+ * device drivers for block and network IO.
+ *
+ * Each domain has its own grant table. This is a data structure that
+ * is shared with Xen; it allows the domain to tell Xen what kind of
+ * permissions other domains have on its pages. Entries in the grant
+ * table are identified by grant references. A grant reference is an
+ * integer, which indexes into the grant table. It acts as a
+ * capability which the grantee can use to perform operations on the
+ * granter's memory.
+ *
+ * This capability-based system allows shared-memory communications
+ * between unprivileged domains. A grant reference also encapsulates
+ * the details of a shared page, removing the need for a domain to
+ * know the real machine address of a page it is sharing. This makes
+ * it possible to share memory correctly with domains running in
+ * fully virtualised memory.
+ */
+
+/***********************************
+ * GRANT TABLE REPRESENTATION
+ */
+
+/* Some rough guidelines on accessing and updating grant-table entries
+ * in a concurrency-safe manner. For more information, Linux contains a
+ * reference implementation for guest OSes (drivers/xen/grant_table.c, see
+ * http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=drivers/xen/grant-table.c;hb=HEAD
+ *
+ * NB. WMB is a no-op on current-generation x86 processors. However, a
+ *     compiler barrier will still be required.
+ *
+ * Introducing a valid entry into the grant table:
+ *  1. Write ent->domid.
+ *  2. Write ent->frame:
+ *      GTF_permit_access:   Frame to which access is permitted.
+ *      GTF_accept_transfer: Pseudo-phys frame slot being filled by new
+ *                           frame, or zero if none.
+ *  3. Write memory barrier (WMB).
+ *  4. Write ent->flags, inc. valid type.
+ *
+ * Invalidating an unused GTF_permit_access entry:
+ *  1. flags = ent->flags.
+ *  2. Observe that !(flags & (GTF_reading|GTF_writing)).
+ *  3. Check result of SMP-safe CMPXCHG(&ent->flags, flags, 0).
+ *  NB. No need for WMB as reuse of entry is control-dependent on success of
+ *      step 3, and all architectures guarantee ordering of ctrl-dep writes.
+ *
+ * Invalidating an in-use GTF_permit_access entry:
+ *  This cannot be done directly. Request assistance from the domain controller
+ *  which can set a timeout on the use of a grant entry and take necessary
+ *  action. (NB. This is not yet implemented!).
+ *
+ * Invalidating an unused GTF_accept_transfer entry:
+ *  1. flags = ent->flags.
+ *  2. Observe that !(flags & GTF_transfer_committed). [*]
+ *  3. Check result of SMP-safe CMPXCHG(&ent->flags, flags, 0).
+ *  NB. No need for WMB as reuse of entry is control-dependent on success of
+ *      step 3, and all architectures guarantee ordering of ctrl-dep writes.
+ *  [*] If GTF_transfer_committed is set then the grant entry is 'committed'.
+ *      The guest must /not/ modify the grant entry until the address of the
+ *      transferred frame is written. It is safe for the guest to spin waiting
+ *      for this to occur (detect by observing GTF_transfer_completed in
+ *      ent->flags).
+ *
+ * Invalidating a committed GTF_accept_transfer entry:
+ *  1. Wait for (ent->flags & GTF_transfer_completed).
+ *
+ * Changing a GTF_permit_access from writable to read-only:
+ *  Use SMP-safe CMPXCHG to set GTF_readonly, while checking !GTF_writing.
+ *
+ * Changing a GTF_permit_access from read-only to writable:
+ *  Use SMP-safe bit-setting instruction.
+ */
+
+/*
+ * Reference to a grant entry in a specified domain's grant table.
+ */
+typedef uint32_t grant_ref_t;
+
+/*
+ * A grant table comprises a packed array of grant entries in one or more
+ * page frames shared between Xen and a guest.
+ * [XEN]: This field is written by Xen and read by the sharing guest.
+ * [GST]: This field is written by the guest and read by Xen.
+ */
+
+/*
+ * Version 1 of the grant table entry structure is maintained largely for
+ * backwards compatibility.  New guests are recommended to support using
+ * version 2 to overcome version 1 limitations, but to default to version 1.
+ */
+#if __XEN_INTERFACE_VERSION__ < 0x0003020a
+#define grant_entry_v1 grant_entry
+#define grant_entry_v1_t grant_entry_t
+#endif
+struct grant_entry_v1 {
+    /* GTF_xxx: various type and flag information.  [XEN,GST] */
+    uint16_t flags;
+    /* The domain being granted foreign privileges. [GST] */
+    domid_t  domid;
+    /*
+     * GTF_permit_access: GFN that @domid is allowed to map and access. [GST]
+     * GTF_accept_transfer: GFN that @domid is allowed to transfer into. [GST]
+     * GTF_transfer_completed: MFN whose ownership transferred by @domid
+     *                         (non-translated guests only). [XEN]
+     */
+    uint32_t frame;
+};
+typedef struct grant_entry_v1 grant_entry_v1_t;
+
+/* The first few grant table entries will be preserved across grant table
+ * version changes and may be pre-populated at domain creation by tools.
+ */
+#define GNTTAB_NR_RESERVED_ENTRIES     8
+#define GNTTAB_RESERVED_CONSOLE        0
+#define GNTTAB_RESERVED_XENSTORE       1
+
+/*
+ * Type of grant entry.
+ *  GTF_invalid: This grant entry grants no privileges.
+ *  GTF_permit_access: Allow @domid to map/access @frame.
+ *  GTF_accept_transfer: Allow @domid to transfer ownership of one page frame
+ *                       to this guest. Xen writes the page number to @frame.
+ *  GTF_transitive: Allow @domid to transitively access a subrange of
+ *                  @trans_grant in @trans_domid.  No mappings are allowed.
+ */
+#define GTF_invalid         (0U<<0)
+#define GTF_permit_access   (1U<<0)
+#define GTF_accept_transfer (2U<<0)
+#define GTF_transitive      (3U<<0)
+#define GTF_type_mask       (3U<<0)
+
+/*
+ * Subflags for GTF_permit_access and GTF_transitive.
+ *  GTF_readonly: Restrict @domid to read-only mappings and accesses. [GST]
+ *  GTF_reading: Grant entry is currently mapped for reading by @domid. [XEN]
+ *  GTF_writing: Grant entry is currently mapped for writing by @domid. [XEN]
+ * Further subflags for GTF_permit_access only.
+ *  GTF_PAT, GTF_PWT, GTF_PCD: (x86) cache attribute flags to be used for
+ *                             mappings of the grant [GST]
+ *  GTF_sub_page: Grant access to only a subrange of the page.  @domid
+ *                will only be allowed to copy from the grant, and not
+ *                map it. [GST]
+ */
+#define _GTF_readonly       (2)
+#define GTF_readonly        (1U<<_GTF_readonly)
+#define _GTF_reading        (3)
+#define GTF_reading         (1U<<_GTF_reading)
+#define _GTF_writing        (4)
+#define GTF_writing         (1U<<_GTF_writing)
+#define _GTF_PWT            (5)
+#define GTF_PWT             (1U<<_GTF_PWT)
+#define _GTF_PCD            (6)
+#define GTF_PCD             (1U<<_GTF_PCD)
+#define _GTF_PAT            (7)
+#define GTF_PAT             (1U<<_GTF_PAT)
+#define _GTF_sub_page       (8)
+#define GTF_sub_page        (1U<<_GTF_sub_page)
+
+/*
+ * Subflags for GTF_accept_transfer:
+ *  GTF_transfer_committed: Xen sets this flag to indicate that it is committed
+ *      to transferring ownership of a page frame. When a guest sees this flag
+ *      it must /not/ modify the grant entry until GTF_transfer_completed is
+ *      set by Xen.
+ *  GTF_transfer_completed: It is safe for the guest to spin-wait on this flag
+ *      after reading GTF_transfer_committed. Xen will always write the frame
+ *      address, followed by ORing this flag, in a timely manner.
+ */
+#define _GTF_transfer_committed (2)
+#define GTF_transfer_committed  (1U<<_GTF_transfer_committed)
+#define _GTF_transfer_completed (3)
+#define GTF_transfer_completed  (1U<<_GTF_transfer_completed)
+
+/*
+ * Version 2 grant table entries.  These fulfil the same role as
+ * version 1 entries, but can represent more complicated operations.
+ * Any given domain will have either a version 1 or a version 2 table,
+ * and every entry in the table will be the same version.
+ *
+ * The interface by which domains use grant references does not depend
+ * on the grant table version in use by the other domain.
+ */
+#if __XEN_INTERFACE_VERSION__ >= 0x0003020a
+/*
+ * Version 1 and version 2 grant entries share a common prefix.  The
+ * fields of the prefix are documented as part of struct
+ * grant_entry_v1.
+ */
+struct grant_entry_header {
+    uint16_t flags;
+    domid_t  domid;
+};
+typedef struct grant_entry_header grant_entry_header_t;
+
+/*
+ * Version 2 of the grant entry structure.
+ */
+union grant_entry_v2 {
+    grant_entry_header_t hdr;
+
+    /*
+     * This member is used for V1-style full page grants, where either:
+     *
+     * -- hdr.type is GTF_accept_transfer, or
+     * -- hdr.type is GTF_permit_access and GTF_sub_page is not set.
+     *
+     * In that case, the frame field has the same semantics as the
+     * field of the same name in the V1 entry structure.
+     */
+    struct {
+        grant_entry_header_t hdr;
+        uint32_t pad0;
+        uint64_t frame;
+    } full_page;
+
+    /*
+     * If the grant type is GTF_grant_access and GTF_sub_page is set,
+     * @domid is allowed to access bytes [@page_off,@page_off+@length)
+     * in frame @frame.
+     */
+    struct {
+        grant_entry_header_t hdr;
+        uint16_t page_off;
+        uint16_t length;
+        uint64_t frame;
+    } sub_page;
+
+    /*
+     * If the grant is GTF_transitive, @domid is allowed to use the
+     * grant @gref in domain @trans_domid, as if it was the local
+     * domain.  Obviously, the transitive access must be compatible
+     * with the original grant.
+     *
+     * The current version of Xen does not allow transitive grants
+     * to be mapped.
+     */
+    struct {
+        grant_entry_header_t hdr;
+        domid_t trans_domid;
+        uint16_t pad0;
+        grant_ref_t gref;
+    } transitive;
+
+    uint32_t __spacer[4]; /* Pad to a power of two */
+};
+typedef union grant_entry_v2 grant_entry_v2_t;
+
+typedef uint16_t grant_status_t;
+
+#endif /* __XEN_INTERFACE_VERSION__ */
+
+/***********************************
+ * GRANT TABLE QUERIES AND USES
+ */
+
+/* ` enum neg_errnoval
+ * ` HYPERVISOR_grant_table_op(enum grant_table_op cmd,
+ * `                           void *args,
+ * `                           unsigned int count)
+ * `
+ *
+ * @args points to an array of a per-command data structure. The array
+ * has @count members
+ */
+
+/* ` enum grant_table_op { // GNTTABOP_* => struct gnttab_* */
+#define GNTTABOP_map_grant_ref        0
+#define GNTTABOP_unmap_grant_ref      1
+#define GNTTABOP_setup_table          2
+#define GNTTABOP_dump_table           3
+#define GNTTABOP_transfer             4
+#define GNTTABOP_copy                 5
+#define GNTTABOP_query_size           6
+#define GNTTABOP_unmap_and_replace    7
+#if __XEN_INTERFACE_VERSION__ >= 0x0003020a
+#define GNTTABOP_set_version          8
+#define GNTTABOP_get_status_frames    9
+#define GNTTABOP_get_version          10
+#define GNTTABOP_swap_grant_ref	      11
+#define GNTTABOP_cache_flush	      12
+#endif /* __XEN_INTERFACE_VERSION__ */
+/* ` } */
+
+/*
+ * Handle to track a mapping created via a grant reference.
+ */
+typedef uint32_t grant_handle_t;
+
+/*
+ * GNTTABOP_map_grant_ref: Map the grant entry (<dom>,<ref>) for access
+ * by devices and/or host CPUs. If successful, <handle> is a tracking number
+ * that must be presented later to destroy the mapping(s). On error, <status>
+ * is a negative status code.
+ * NOTES:
+ *  1. If GNTMAP_device_map is specified then <dev_bus_addr> is the address
+ *     via which I/O devices may access the granted frame.
+ *  2. If GNTMAP_host_map is specified then a mapping will be added at
+ *     either a host virtual address in the current address space, or at
+ *     a PTE at the specified machine address.  The type of mapping to
+ *     perform is selected through the GNTMAP_contains_pte flag, and the
+ *     address is specified in <host_addr>.
+ *  3. Mappings should only be destroyed via GNTTABOP_unmap_grant_ref. If a
+ *     host mapping is destroyed by other means then it is *NOT* guaranteed
+ *     to be accounted to the correct grant reference!
+ */
+struct gnttab_map_grant_ref {
+    /* IN parameters. */
+    uint64_t host_addr;
+    uint32_t flags;               /* GNTMAP_* */
+    grant_ref_t ref;
+    domid_t  dom;
+    /* OUT parameters. */
+    int16_t  status;              /* => enum grant_status */
+    grant_handle_t handle;
+    uint64_t dev_bus_addr;
+};
+typedef struct gnttab_map_grant_ref gnttab_map_grant_ref_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_map_grant_ref_t);
+
+/*
+ * GNTTABOP_unmap_grant_ref: Destroy one or more grant-reference mappings
+ * tracked by <handle>. If <host_addr> or <dev_bus_addr> is zero, that
+ * field is ignored. If non-zero, they must refer to a device/host mapping
+ * that is tracked by <handle>
+ * NOTES:
+ *  1. The call may fail in an undefined manner if either mapping is not
+ *     tracked by <handle>.
+ *  3. After executing a batch of unmaps, it is guaranteed that no stale
+ *     mappings will remain in the device or host TLBs.
+ */
+struct gnttab_unmap_grant_ref {
+    /* IN parameters. */
+    uint64_t host_addr;
+    uint64_t dev_bus_addr;
+    grant_handle_t handle;
+    /* OUT parameters. */
+    int16_t  status;              /* => enum grant_status */
+};
+typedef struct gnttab_unmap_grant_ref gnttab_unmap_grant_ref_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_unmap_grant_ref_t);
+
+/*
+ * GNTTABOP_setup_table: Set up a grant table for <dom> comprising at least
+ * <nr_frames> pages. The frame addresses are written to the <frame_list>.
+ * Only <nr_frames> addresses are written, even if the table is larger.
+ * NOTES:
+ *  1. <dom> may be specified as DOMID_SELF.
+ *  2. Only a sufficiently-privileged domain may specify <dom> != DOMID_SELF.
+ *  3. Xen may not support more than a single grant-table page per domain.
+ */
+struct gnttab_setup_table {
+    /* IN parameters. */
+    domid_t  dom;
+    uint32_t nr_frames;
+    /* OUT parameters. */
+    int16_t  status;              /* => enum grant_status */
+#if __XEN_INTERFACE_VERSION__ < 0x00040300
+    XEN_GUEST_HANDLE(ulong) frame_list;
+#else
+    XEN_GUEST_HANDLE(xen_pfn_t) frame_list;
+#endif
+};
+typedef struct gnttab_setup_table gnttab_setup_table_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_setup_table_t);
+
+/*
+ * GNTTABOP_dump_table: Dump the contents of the grant table to the
+ * xen console. Debugging use only.
+ */
+struct gnttab_dump_table {
+    /* IN parameters. */
+    domid_t dom;
+    /* OUT parameters. */
+    int16_t status;               /* => enum grant_status */
+};
+typedef struct gnttab_dump_table gnttab_dump_table_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_dump_table_t);
+
+/*
+ * GNTTABOP_transfer: Transfer <frame> to a foreign domain. The foreign domain
+ * has previously registered its interest in the transfer via <domid, ref>.
+ *
+ * Note that, even if the transfer fails, the specified page no longer belongs
+ * to the calling domain *unless* the error is GNTST_bad_page.
+ *
+ * Note further that only PV guests can use this operation.
+ */
+struct gnttab_transfer {
+    /* IN parameters. */
+    xen_pfn_t     mfn;
+    domid_t       domid;
+    grant_ref_t   ref;
+    /* OUT parameters. */
+    int16_t       status;
+};
+typedef struct gnttab_transfer gnttab_transfer_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_transfer_t);
+
+
+/*
+ * GNTTABOP_copy: Hypervisor based copy
+ * source and destinations can be eithers MFNs or, for foreign domains,
+ * grant references. the foreign domain has to grant read/write access
+ * in its grant table.
+ *
+ * The flags specify what type source and destinations are (either MFN
+ * or grant reference).
+ *
+ * Note that this can also be used to copy data between two domains
+ * via a third party if the source and destination domains had previously
+ * grant appropriate access to their pages to the third party.
+ *
+ * source_offset specifies an offset in the source frame, dest_offset
+ * the offset in the target frame and  len specifies the number of
+ * bytes to be copied.
+ */
+
+#define _GNTCOPY_source_gref      (0)
+#define GNTCOPY_source_gref       (1<<_GNTCOPY_source_gref)
+#define _GNTCOPY_dest_gref        (1)
+#define GNTCOPY_dest_gref         (1<<_GNTCOPY_dest_gref)
+
+struct gnttab_copy {
+    /* IN parameters. */
+    struct gnttab_copy_ptr {
+        union {
+            grant_ref_t ref;
+            xen_pfn_t   gmfn;
+        } u;
+        domid_t  domid;
+        uint16_t offset;
+    } source, dest;
+    uint16_t      len;
+    uint16_t      flags;          /* GNTCOPY_* */
+    /* OUT parameters. */
+    int16_t       status;
+};
+typedef struct gnttab_copy  gnttab_copy_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_copy_t);
+
+/*
+ * GNTTABOP_query_size: Query the current and maximum sizes of the shared
+ * grant table.
+ * NOTES:
+ *  1. <dom> may be specified as DOMID_SELF.
+ *  2. Only a sufficiently-privileged domain may specify <dom> != DOMID_SELF.
+ */
+struct gnttab_query_size {
+    /* IN parameters. */
+    domid_t  dom;
+    /* OUT parameters. */
+    uint32_t nr_frames;
+    uint32_t max_nr_frames;
+    int16_t  status;              /* => enum grant_status */
+};
+typedef struct gnttab_query_size gnttab_query_size_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_query_size_t);
+
+/*
+ * GNTTABOP_unmap_and_replace: Destroy one or more grant-reference mappings
+ * tracked by <handle> but atomically replace the page table entry with one
+ * pointing to the machine address under <new_addr>.  <new_addr> will be
+ * redirected to the null entry.
+ * NOTES:
+ *  1. The call may fail in an undefined manner if either mapping is not
+ *     tracked by <handle>.
+ *  2. After executing a batch of unmaps, it is guaranteed that no stale
+ *     mappings will remain in the device or host TLBs.
+ */
+struct gnttab_unmap_and_replace {
+    /* IN parameters. */
+    uint64_t host_addr;
+    uint64_t new_addr;
+    grant_handle_t handle;
+    /* OUT parameters. */
+    int16_t  status;              /* => enum grant_status */
+};
+typedef struct gnttab_unmap_and_replace gnttab_unmap_and_replace_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_unmap_and_replace_t);
+
+#if __XEN_INTERFACE_VERSION__ >= 0x0003020a
+/*
+ * GNTTABOP_set_version: Request a particular version of the grant
+ * table shared table structure.  This operation may be used to toggle
+ * between different versions, but must be performed while no grants
+ * are active.  The only defined versions are 1 and 2.
+ */
+struct gnttab_set_version {
+    /* IN/OUT parameters */
+    uint32_t version;
+};
+typedef struct gnttab_set_version gnttab_set_version_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_set_version_t);
+
+
+/*
+ * GNTTABOP_get_status_frames: Get the list of frames used to store grant
+ * status for <dom>. In grant format version 2, the status is separated
+ * from the other shared grant fields to allow more efficient synchronization
+ * using barriers instead of atomic cmpexch operations.
+ * <nr_frames> specify the size of vector <frame_list>.
+ * The frame addresses are returned in the <frame_list>.
+ * Only <nr_frames> addresses are returned, even if the table is larger.
+ * NOTES:
+ *  1. <dom> may be specified as DOMID_SELF.
+ *  2. Only a sufficiently-privileged domain may specify <dom> != DOMID_SELF.
+ */
+struct gnttab_get_status_frames {
+    /* IN parameters. */
+    uint32_t nr_frames;
+    domid_t  dom;
+    /* OUT parameters. */
+    int16_t  status;              /* => enum grant_status */
+    XEN_GUEST_HANDLE(uint64_t) frame_list;
+};
+typedef struct gnttab_get_status_frames gnttab_get_status_frames_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_get_status_frames_t);
+
+/*
+ * GNTTABOP_get_version: Get the grant table version which is in
+ * effect for domain <dom>.
+ */
+struct gnttab_get_version {
+    /* IN parameters */
+    domid_t dom;
+    uint16_t pad;
+    /* OUT parameters */
+    uint32_t version;
+};
+typedef struct gnttab_get_version gnttab_get_version_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_get_version_t);
+
+/*
+ * GNTTABOP_swap_grant_ref: Swap the contents of two grant entries.
+ */
+struct gnttab_swap_grant_ref {
+    /* IN parameters */
+    grant_ref_t ref_a;
+    grant_ref_t ref_b;
+    /* OUT parameters */
+    int16_t status;             /* => enum grant_status */
+};
+typedef struct gnttab_swap_grant_ref gnttab_swap_grant_ref_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_swap_grant_ref_t);
+
+/*
+ * Issue one or more cache maintenance operations on a portion of a
+ * page granted to the calling domain by a foreign domain.
+ */
+struct gnttab_cache_flush {
+    union {
+        uint64_t dev_bus_addr;
+        grant_ref_t ref;
+    } a;
+    uint16_t offset; /* offset from start of grant */
+    uint16_t length; /* size within the grant */
+#define GNTTAB_CACHE_CLEAN          (1u<<0)
+#define GNTTAB_CACHE_INVAL          (1u<<1)
+#define GNTTAB_CACHE_SOURCE_GREF    (1u<<31)
+    uint32_t op;
+};
+typedef struct gnttab_cache_flush gnttab_cache_flush_t;
+DEFINE_XEN_GUEST_HANDLE(gnttab_cache_flush_t);
+
+#endif /* __XEN_INTERFACE_VERSION__ */
+
+/*
+ * Bitfield values for gnttab_map_grant_ref.flags.
+ */
+ /* Map the grant entry for access by I/O devices. */
+#define _GNTMAP_device_map      (0)
+#define GNTMAP_device_map       (1<<_GNTMAP_device_map)
+ /* Map the grant entry for access by host CPUs. */
+#define _GNTMAP_host_map        (1)
+#define GNTMAP_host_map         (1<<_GNTMAP_host_map)
+ /* Accesses to the granted frame will be restricted to read-only access. */
+#define _GNTMAP_readonly        (2)
+#define GNTMAP_readonly         (1<<_GNTMAP_readonly)
+ /*
+  * GNTMAP_host_map subflag:
+  *  0 => The host mapping is usable only by the guest OS.
+  *  1 => The host mapping is usable by guest OS + current application.
+  */
+#define _GNTMAP_application_map (3)
+#define GNTMAP_application_map  (1<<_GNTMAP_application_map)
+
+ /*
+  * GNTMAP_contains_pte subflag:
+  *  0 => This map request contains a host virtual address.
+  *  1 => This map request contains the machine addess of the PTE to update.
+  */
+#define _GNTMAP_contains_pte    (4)
+#define GNTMAP_contains_pte     (1<<_GNTMAP_contains_pte)
+
+/*
+ * Bits to be placed in guest kernel available PTE bits (architecture
+ * dependent; only supported when XENFEAT_gnttab_map_avail_bits is set).
+ */
+#define _GNTMAP_guest_avail0    (16)
+#define GNTMAP_guest_avail_mask ((uint32_t)~0 << _GNTMAP_guest_avail0)
+
+/*
+ * Values for error status returns. All errors are -ve.
+ */
+/* ` enum grant_status { */
+#define GNTST_okay             (0)  /* Normal return.                        */
+#define GNTST_general_error    (-1) /* General undefined error.              */
+#define GNTST_bad_domain       (-2) /* Unrecognsed domain id.                */
+#define GNTST_bad_gntref       (-3) /* Unrecognised or inappropriate gntref. */
+#define GNTST_bad_handle       (-4) /* Unrecognised or inappropriate handle. */
+#define GNTST_bad_virt_addr    (-5) /* Inappropriate virtual address to map. */
+#define GNTST_bad_dev_addr     (-6) /* Inappropriate device address to unmap.*/
+#define GNTST_no_device_space  (-7) /* Out of space in I/O MMU.              */
+#define GNTST_permission_denied (-8) /* Not enough privilege for operation.  */
+#define GNTST_bad_page         (-9) /* Specified page was invalid for op.    */
+#define GNTST_bad_copy_arg    (-10) /* copy arguments cross page boundary.   */
+#define GNTST_address_too_big (-11) /* transfer page address too large.      */
+#define GNTST_eagain          (-12) /* Operation not done; try again.        */
+#define GNTST_no_space        (-13) /* Out of space (handles etc).           */
+/* ` } */
+
+#define GNTTABOP_error_msgs {                   \
+    "okay",                                     \
+    "undefined error",                          \
+    "unrecognised domain id",                   \
+    "invalid grant reference",                  \
+    "invalid mapping handle",                   \
+    "invalid virtual address",                  \
+    "invalid device address",                   \
+    "no spare translation slot in the I/O MMU", \
+    "permission denied",                        \
+    "bad page",                                 \
+    "copy arguments cross page boundary",       \
+    "page address size too large",              \
+    "operation not done; try again",            \
+    "out of space",                             \
+}
+
+#endif /* __XEN_PUBLIC_GRANT_TABLE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/hvm/hvm_op.h b/include/standard-headers/xen/hvm/hvm_op.h
new file mode 100644
index 0000000000..870ec52060
--- /dev/null
+++ b/include/standard-headers/xen/hvm/hvm_op.h
@@ -0,0 +1,395 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2007, Keir Fraser
+ */
+
+#ifndef __XEN_PUBLIC_HVM_HVM_OP_H__
+#define __XEN_PUBLIC_HVM_HVM_OP_H__
+
+#include "../xen.h"
+#include "../trace.h"
+#include "../event_channel.h"
+
+/* Get/set subcommands: extra argument == pointer to xen_hvm_param struct. */
+#define HVMOP_set_param           0
+#define HVMOP_get_param           1
+struct xen_hvm_param {
+    domid_t  domid;    /* IN */
+    uint16_t pad;
+    uint32_t index;    /* IN */
+    uint64_t value;    /* IN/OUT */
+};
+typedef struct xen_hvm_param xen_hvm_param_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_param_t);
+
+struct xen_hvm_altp2m_suppress_ve {
+    uint16_t view;
+    uint8_t suppress_ve; /* Boolean type. */
+    uint8_t pad1;
+    uint32_t pad2;
+    uint64_t gfn;
+};
+
+struct xen_hvm_altp2m_suppress_ve_multi {
+    uint16_t view;
+    uint8_t suppress_ve; /* Boolean type. */
+    uint8_t pad1;
+    int32_t first_error; /* Should be set to 0. */
+    uint64_t first_gfn; /* Value may be updated. */
+    uint64_t last_gfn;
+    uint64_t first_error_gfn; /* Gfn of the first error. */
+};
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040900
+
+/* Set the logical level of one of a domain's PCI INTx wires. */
+#define HVMOP_set_pci_intx_level  2
+struct xen_hvm_set_pci_intx_level {
+    /* Domain to be updated. */
+    domid_t  domid;
+    /* PCI INTx identification in PCI topology (domain:bus:device:intx). */
+    uint8_t  domain, bus, device, intx;
+    /* Assertion level (0 = unasserted, 1 = asserted). */
+    uint8_t  level;
+};
+typedef struct xen_hvm_set_pci_intx_level xen_hvm_set_pci_intx_level_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_set_pci_intx_level_t);
+
+/* Set the logical level of one of a domain's ISA IRQ wires. */
+#define HVMOP_set_isa_irq_level   3
+struct xen_hvm_set_isa_irq_level {
+    /* Domain to be updated. */
+    domid_t  domid;
+    /* ISA device identification, by ISA IRQ (0-15). */
+    uint8_t  isa_irq;
+    /* Assertion level (0 = unasserted, 1 = asserted). */
+    uint8_t  level;
+};
+typedef struct xen_hvm_set_isa_irq_level xen_hvm_set_isa_irq_level_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_set_isa_irq_level_t);
+
+#define HVMOP_set_pci_link_route  4
+struct xen_hvm_set_pci_link_route {
+    /* Domain to be updated. */
+    domid_t  domid;
+    /* PCI link identifier (0-3). */
+    uint8_t  link;
+    /* ISA IRQ (1-15), or 0 (disable link). */
+    uint8_t  isa_irq;
+};
+typedef struct xen_hvm_set_pci_link_route xen_hvm_set_pci_link_route_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_set_pci_link_route_t);
+
+#endif /* __XEN_INTERFACE_VERSION__ < 0x00040900 */
+
+/* Flushes all VCPU TLBs: @arg must be NULL. */
+#define HVMOP_flush_tlbs          5
+
+/*
+ * hvmmem_type_t should not be defined when generating the corresponding
+ * compat header. This will ensure that the improperly named HVMMEM_(*)
+ * values are defined only once.
+ */
+#ifndef XEN_GENERATING_COMPAT_HEADERS
+
+typedef enum {
+    HVMMEM_ram_rw,             /* Normal read/write guest RAM */
+    HVMMEM_ram_ro,             /* Read-only; writes are discarded */
+    HVMMEM_mmio_dm,            /* Reads and write go to the device model */
+#if __XEN_INTERFACE_VERSION__ < 0x00040700
+    HVMMEM_mmio_write_dm,      /* Read-only; writes go to the device model */
+#else
+    HVMMEM_unused,             /* Placeholder; setting memory to this type
+                                  will fail for code after 4.7.0 */
+#endif
+    HVMMEM_ioreq_server        /* Memory type claimed by an ioreq server; type
+                                  changes to this value are only allowed after
+                                  an ioreq server has claimed its ownership.
+                                  Only pages with HVMMEM_ram_rw are allowed to
+                                  change to this type; conversely, pages with
+                                  this type are only allowed to be changed back
+                                  to HVMMEM_ram_rw. */
+} hvmmem_type_t;
+
+#endif /* XEN_GENERATING_COMPAT_HEADERS */
+
+/* Hint from PV drivers for pagetable destruction. */
+#define HVMOP_pagetable_dying        9
+struct xen_hvm_pagetable_dying {
+    /* Domain with a pagetable about to be destroyed. */
+    domid_t  domid;
+    uint16_t pad[3]; /* align next field on 8-byte boundary */
+    /* guest physical address of the toplevel pagetable dying */
+    uint64_t gpa;
+};
+typedef struct xen_hvm_pagetable_dying xen_hvm_pagetable_dying_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_pagetable_dying_t);
+
+/* Get the current Xen time, in nanoseconds since system boot. */
+#define HVMOP_get_time              10
+struct xen_hvm_get_time {
+    uint64_t now;      /* OUT */
+};
+typedef struct xen_hvm_get_time xen_hvm_get_time_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_get_time_t);
+
+#define HVMOP_xentrace              11
+struct xen_hvm_xentrace {
+    uint16_t event, extra_bytes;
+    uint8_t extra[TRACE_EXTRA_MAX * sizeof(uint32_t)];
+};
+typedef struct xen_hvm_xentrace xen_hvm_xentrace_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_xentrace_t);
+
+/* Following tools-only interfaces may change in future. */
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+
+/* Deprecated by XENMEM_access_op_set_access */
+#define HVMOP_set_mem_access        12
+
+/* Deprecated by XENMEM_access_op_get_access */
+#define HVMOP_get_mem_access        13
+
+#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
+
+#define HVMOP_get_mem_type    15
+/* Return hvmmem_type_t for the specified pfn. */
+struct xen_hvm_get_mem_type {
+    /* Domain to be queried. */
+    domid_t domid;
+    /* OUT variable. */
+    uint16_t mem_type;
+    uint16_t pad[2]; /* align next field on 8-byte boundary */
+    /* IN variable. */
+    uint64_t pfn;
+};
+typedef struct xen_hvm_get_mem_type xen_hvm_get_mem_type_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_get_mem_type_t);
+
+/* Following tools-only interfaces may change in future. */
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+
+/*
+ * Definitions relating to DMOP_create_ioreq_server. (Defined here for
+ * backwards compatibility).
+ */
+
+#define HVM_IOREQSRV_BUFIOREQ_OFF    0
+#define HVM_IOREQSRV_BUFIOREQ_LEGACY 1
+/*
+ * Use this when read_pointer gets updated atomically and
+ * the pointer pair gets read atomically:
+ */
+#define HVM_IOREQSRV_BUFIOREQ_ATOMIC 2
+
+#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
+
+#if defined(__i386__) || defined(__x86_64__)
+
+/*
+ * HVMOP_set_evtchn_upcall_vector: Set a <vector> that should be used for event
+ *                                 channel upcalls on the specified <vcpu>. If set,
+ *                                 this vector will be used in preference to the
+ *                                 domain global callback via (see
+ *                                 HVM_PARAM_CALLBACK_IRQ).
+ */
+#define HVMOP_set_evtchn_upcall_vector 23
+struct xen_hvm_evtchn_upcall_vector {
+    uint32_t vcpu;
+    uint8_t vector;
+};
+typedef struct xen_hvm_evtchn_upcall_vector xen_hvm_evtchn_upcall_vector_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_evtchn_upcall_vector_t);
+
+#endif /* defined(__i386__) || defined(__x86_64__) */
+
+#define HVMOP_guest_request_vm_event 24
+
+/* HVMOP_altp2m: perform altp2m state operations */
+#define HVMOP_altp2m 25
+
+#define HVMOP_ALTP2M_INTERFACE_VERSION 0x00000001
+
+struct xen_hvm_altp2m_domain_state {
+    /* IN or OUT variable on/off */
+    uint8_t state;
+};
+typedef struct xen_hvm_altp2m_domain_state xen_hvm_altp2m_domain_state_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_domain_state_t);
+
+struct xen_hvm_altp2m_vcpu_enable_notify {
+    uint32_t vcpu_id;
+    uint32_t pad;
+    /* #VE info area gfn */
+    uint64_t gfn;
+};
+typedef struct xen_hvm_altp2m_vcpu_enable_notify xen_hvm_altp2m_vcpu_enable_notify_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_vcpu_enable_notify_t);
+
+struct xen_hvm_altp2m_vcpu_disable_notify {
+    uint32_t vcpu_id;
+};
+typedef struct xen_hvm_altp2m_vcpu_disable_notify xen_hvm_altp2m_vcpu_disable_notify_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_vcpu_disable_notify_t);
+
+struct xen_hvm_altp2m_view {
+    /* IN/OUT variable */
+    uint16_t view;
+    uint16_t hvmmem_default_access; /* xenmem_access_t */
+};
+typedef struct xen_hvm_altp2m_view xen_hvm_altp2m_view_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_view_t);
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040a00
+struct xen_hvm_altp2m_set_mem_access {
+    /* view */
+    uint16_t view;
+    /* Memory type */
+    uint16_t access; /* xenmem_access_t */
+    uint32_t pad;
+    /* gfn */
+    uint64_t gfn;
+};
+typedef struct xen_hvm_altp2m_set_mem_access xen_hvm_altp2m_set_mem_access_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_set_mem_access_t);
+#endif /* __XEN_INTERFACE_VERSION__ < 0x00040a00 */
+
+struct xen_hvm_altp2m_mem_access {
+    /* view */
+    uint16_t view;
+    /* Memory type */
+    uint16_t access; /* xenmem_access_t */
+    uint32_t pad;
+    /* gfn */
+    uint64_t gfn;
+};
+typedef struct xen_hvm_altp2m_mem_access xen_hvm_altp2m_mem_access_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_mem_access_t);
+
+struct xen_hvm_altp2m_set_mem_access_multi {
+    /* view */
+    uint16_t view;
+    uint16_t pad;
+    /* Number of pages */
+    uint32_t nr;
+    /*
+     * Used for continuation purposes.
+     * Must be set to zero upon initial invocation.
+     */
+    uint64_t opaque;
+    /* List of pfns to set access for */
+    XEN_GUEST_HANDLE(const_uint64) pfn_list;
+    /* Corresponding list of access settings for pfn_list */
+    XEN_GUEST_HANDLE(const_uint8) access_list;
+};
+
+struct xen_hvm_altp2m_change_gfn {
+    /* view */
+    uint16_t view;
+    uint16_t pad1;
+    uint32_t pad2;
+    /* old gfn */
+    uint64_t old_gfn;
+    /* new gfn, INVALID_GFN (~0UL) means revert */
+    uint64_t new_gfn;
+};
+typedef struct xen_hvm_altp2m_change_gfn xen_hvm_altp2m_change_gfn_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_change_gfn_t);
+
+struct xen_hvm_altp2m_get_vcpu_p2m_idx {
+    uint32_t vcpu_id;
+    uint16_t altp2m_idx;
+};
+
+struct xen_hvm_altp2m_set_visibility {
+    uint16_t altp2m_idx;
+    uint8_t visible;
+    uint8_t pad;
+};
+
+struct xen_hvm_altp2m_op {
+    uint32_t version;   /* HVMOP_ALTP2M_INTERFACE_VERSION */
+    uint32_t cmd;
+/* Get/set the altp2m state for a domain */
+#define HVMOP_altp2m_get_domain_state     1
+#define HVMOP_altp2m_set_domain_state     2
+/* Set a given VCPU to receive altp2m event notifications */
+#define HVMOP_altp2m_vcpu_enable_notify   3
+/* Create a new view */
+#define HVMOP_altp2m_create_p2m           4
+/* Destroy a view */
+#define HVMOP_altp2m_destroy_p2m          5
+/* Switch view for an entire domain */
+#define HVMOP_altp2m_switch_p2m           6
+/* Notify that a page of memory is to have specific access types */
+#define HVMOP_altp2m_set_mem_access       7
+/* Change a p2m entry to have a different gfn->mfn mapping */
+#define HVMOP_altp2m_change_gfn           8
+/* Set access for an array of pages */
+#define HVMOP_altp2m_set_mem_access_multi 9
+/* Set the "Suppress #VE" bit on a page */
+#define HVMOP_altp2m_set_suppress_ve      10
+/* Get the "Suppress #VE" bit of a page */
+#define HVMOP_altp2m_get_suppress_ve      11
+/* Get the access of a page of memory from a certain view */
+#define HVMOP_altp2m_get_mem_access       12
+/* Disable altp2m event notifications for a given VCPU */
+#define HVMOP_altp2m_vcpu_disable_notify  13
+/* Get the active vcpu p2m index */
+#define HVMOP_altp2m_get_p2m_idx          14
+/* Set the "Supress #VE" bit for a range of pages */
+#define HVMOP_altp2m_set_suppress_ve_multi 15
+/* Set visibility for a given altp2m view */
+#define HVMOP_altp2m_set_visibility       16
+    domid_t domain;
+    uint16_t pad1;
+    uint32_t pad2;
+    union {
+        struct xen_hvm_altp2m_domain_state         domain_state;
+        struct xen_hvm_altp2m_vcpu_enable_notify   enable_notify;
+        struct xen_hvm_altp2m_view                 view;
+#if __XEN_INTERFACE_VERSION__ < 0x00040a00
+        struct xen_hvm_altp2m_set_mem_access       set_mem_access;
+#endif /* __XEN_INTERFACE_VERSION__ < 0x00040a00 */
+        struct xen_hvm_altp2m_mem_access           mem_access;
+        struct xen_hvm_altp2m_change_gfn           change_gfn;
+        struct xen_hvm_altp2m_set_mem_access_multi set_mem_access_multi;
+        struct xen_hvm_altp2m_suppress_ve          suppress_ve;
+        struct xen_hvm_altp2m_suppress_ve_multi    suppress_ve_multi;
+        struct xen_hvm_altp2m_vcpu_disable_notify  disable_notify;
+        struct xen_hvm_altp2m_get_vcpu_p2m_idx     get_vcpu_p2m_idx;
+        struct xen_hvm_altp2m_set_visibility       set_visibility;
+        uint8_t pad[64];
+    } u;
+};
+typedef struct xen_hvm_altp2m_op xen_hvm_altp2m_op_t;
+DEFINE_XEN_GUEST_HANDLE(xen_hvm_altp2m_op_t);
+
+#endif /* __XEN_PUBLIC_HVM_HVM_OP_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/hvm/params.h b/include/standard-headers/xen/hvm/params.h
new file mode 100644
index 0000000000..c9d6e70d7b
--- /dev/null
+++ b/include/standard-headers/xen/hvm/params.h
@@ -0,0 +1,318 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2007, Keir Fraser
+ */
+
+#ifndef __XEN_PUBLIC_HVM_PARAMS_H__
+#define __XEN_PUBLIC_HVM_PARAMS_H__
+
+#include "hvm_op.h"
+
+/* These parameters are deprecated and their meaning is undefined. */
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+
+#define HVM_PARAM_PAE_ENABLED                4
+#define HVM_PARAM_DM_DOMAIN                 13
+#define HVM_PARAM_MEMORY_EVENT_CR0          20
+#define HVM_PARAM_MEMORY_EVENT_CR3          21
+#define HVM_PARAM_MEMORY_EVENT_CR4          22
+#define HVM_PARAM_MEMORY_EVENT_INT3         23
+#define HVM_PARAM_NESTEDHVM                 24
+#define HVM_PARAM_MEMORY_EVENT_SINGLE_STEP  25
+#define HVM_PARAM_BUFIOREQ_EVTCHN           26
+#define HVM_PARAM_MEMORY_EVENT_MSR          30
+
+#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
+
+/*
+ * Parameter space for HVMOP_{set,get}_param.
+ */
+
+#define HVM_PARAM_CALLBACK_IRQ 0
+#define HVM_PARAM_CALLBACK_IRQ_TYPE_MASK xen_mk_ullong(0xFF00000000000000)
+/*
+ * How should CPU0 event-channel notifications be delivered?
+ *
+ * If val == 0 then CPU0 event-channel notifications are not delivered.
+ * If val != 0, val[63:56] encodes the type, as follows:
+ */
+
+#define HVM_PARAM_CALLBACK_TYPE_GSI      0
+/*
+ * val[55:0] is a delivery GSI.  GSI 0 cannot be used, as it aliases val == 0,
+ * and disables all notifications.
+ */
+
+#define HVM_PARAM_CALLBACK_TYPE_PCI_INTX 1
+/*
+ * val[55:0] is a delivery PCI INTx line:
+ * Domain = val[47:32], Bus = val[31:16] DevFn = val[15:8], IntX = val[1:0]
+ */
+
+#if defined(__i386__) || defined(__x86_64__)
+#define HVM_PARAM_CALLBACK_TYPE_VECTOR   2
+/*
+ * val[7:0] is a vector number.  Check for XENFEAT_hvm_callback_vector to know
+ * if this delivery method is available.
+ */
+#elif defined(__arm__) || defined(__aarch64__)
+#define HVM_PARAM_CALLBACK_TYPE_PPI      2
+/*
+ * val[55:16] needs to be zero.
+ * val[15:8] is interrupt flag of the PPI used by event-channel:
+ *  bit 8: the PPI is edge(1) or level(0) triggered
+ *  bit 9: the PPI is active low(1) or high(0)
+ * val[7:0] is a PPI number used by event-channel.
+ * This is only used by ARM/ARM64 and masking/eoi the interrupt associated to
+ * the notification is handled by the interrupt controller.
+ */
+#define HVM_PARAM_CALLBACK_TYPE_PPI_FLAG_MASK      0xFF00
+#define HVM_PARAM_CALLBACK_TYPE_PPI_FLAG_LOW_LEVEL 2
+#endif
+
+/*
+ * These are not used by Xen. They are here for convenience of HVM-guest
+ * xenbus implementations.
+ */
+#define HVM_PARAM_STORE_PFN    1
+#define HVM_PARAM_STORE_EVTCHN 2
+
+#define HVM_PARAM_IOREQ_PFN    5
+
+#define HVM_PARAM_BUFIOREQ_PFN 6
+
+#if defined(__i386__) || defined(__x86_64__)
+
+/*
+ * Viridian enlightenments
+ *
+ * (See http://download.microsoft.com/download/A/B/4/AB43A34E-BDD0-4FA6-BDEF-79EEF16E880B/Hypervisor%20Top%20Level%20Functional%20Specification%20v4.0.docx)
+ *
+ * To expose viridian enlightenments to the guest set this parameter
+ * to the desired feature mask. The base feature set must be present
+ * in any valid feature mask.
+ */
+#define HVM_PARAM_VIRIDIAN     9
+
+/* Base+Freq viridian feature sets:
+ *
+ * - Hypercall MSRs (HV_X64_MSR_GUEST_OS_ID and HV_X64_MSR_HYPERCALL)
+ * - APIC access MSRs (HV_X64_MSR_EOI, HV_X64_MSR_ICR and HV_X64_MSR_TPR)
+ * - Virtual Processor index MSR (HV_X64_MSR_VP_INDEX)
+ * - Timer frequency MSRs (HV_X64_MSR_TSC_FREQUENCY and
+ *   HV_X64_MSR_APIC_FREQUENCY)
+ */
+#define _HVMPV_base_freq 0
+#define HVMPV_base_freq  (1 << _HVMPV_base_freq)
+
+/* Feature set modifications */
+
+/* Disable timer frequency MSRs (HV_X64_MSR_TSC_FREQUENCY and
+ * HV_X64_MSR_APIC_FREQUENCY).
+ * This modification restores the viridian feature set to the
+ * original 'base' set exposed in releases prior to Xen 4.4.
+ */
+#define _HVMPV_no_freq 1
+#define HVMPV_no_freq  (1 << _HVMPV_no_freq)
+
+/* Enable Partition Time Reference Counter (HV_X64_MSR_TIME_REF_COUNT) */
+#define _HVMPV_time_ref_count 2
+#define HVMPV_time_ref_count  (1 << _HVMPV_time_ref_count)
+
+/* Enable Reference TSC Page (HV_X64_MSR_REFERENCE_TSC) */
+#define _HVMPV_reference_tsc 3
+#define HVMPV_reference_tsc  (1 << _HVMPV_reference_tsc)
+
+/* Use Hypercall for remote TLB flush */
+#define _HVMPV_hcall_remote_tlb_flush 4
+#define HVMPV_hcall_remote_tlb_flush (1 << _HVMPV_hcall_remote_tlb_flush)
+
+/* Use APIC assist */
+#define _HVMPV_apic_assist 5
+#define HVMPV_apic_assist (1 << _HVMPV_apic_assist)
+
+/* Enable crash MSRs */
+#define _HVMPV_crash_ctl 6
+#define HVMPV_crash_ctl (1 << _HVMPV_crash_ctl)
+
+/* Enable SYNIC MSRs */
+#define _HVMPV_synic 7
+#define HVMPV_synic (1 << _HVMPV_synic)
+
+/* Enable STIMER MSRs */
+#define _HVMPV_stimer 8
+#define HVMPV_stimer (1 << _HVMPV_stimer)
+
+/* Use Synthetic Cluster IPI Hypercall */
+#define _HVMPV_hcall_ipi 9
+#define HVMPV_hcall_ipi (1 << _HVMPV_hcall_ipi)
+
+/* Enable ExProcessorMasks */
+#define _HVMPV_ex_processor_masks 10
+#define HVMPV_ex_processor_masks (1 << _HVMPV_ex_processor_masks)
+
+/* Allow more than 64 VPs */
+#define _HVMPV_no_vp_limit 11
+#define HVMPV_no_vp_limit (1 << _HVMPV_no_vp_limit)
+
+/* Enable vCPU hotplug */
+#define _HVMPV_cpu_hotplug 12
+#define HVMPV_cpu_hotplug (1 << _HVMPV_cpu_hotplug)
+
+#define HVMPV_feature_mask \
+        (HVMPV_base_freq | \
+         HVMPV_no_freq | \
+         HVMPV_time_ref_count | \
+         HVMPV_reference_tsc | \
+         HVMPV_hcall_remote_tlb_flush | \
+         HVMPV_apic_assist | \
+         HVMPV_crash_ctl | \
+         HVMPV_synic | \
+         HVMPV_stimer | \
+         HVMPV_hcall_ipi | \
+         HVMPV_ex_processor_masks | \
+         HVMPV_no_vp_limit | \
+         HVMPV_cpu_hotplug)
+
+#endif
+
+/*
+ * Set mode for virtual timers (currently x86 only):
+ *  delay_for_missed_ticks (default):
+ *   Do not advance a vcpu's time beyond the correct delivery time for
+ *   interrupts that have been missed due to preemption. Deliver missed
+ *   interrupts when the vcpu is rescheduled and advance the vcpu's virtual
+ *   time stepwise for each one.
+ *  no_delay_for_missed_ticks:
+ *   As above, missed interrupts are delivered, but guest time always tracks
+ *   wallclock (i.e., real) time while doing so.
+ *  no_missed_ticks_pending:
+ *   No missed interrupts are held pending. Instead, to ensure ticks are
+ *   delivered at some non-zero rate, if we detect missed ticks then the
+ *   internal tick alarm is not disabled if the VCPU is preempted during the
+ *   next tick period.
+ *  one_missed_tick_pending:
+ *   Missed interrupts are collapsed together and delivered as one 'late tick'.
+ *   Guest time always tracks wallclock (i.e., real) time.
+ */
+#define HVM_PARAM_TIMER_MODE   10
+#define HVMPTM_delay_for_missed_ticks    0
+#define HVMPTM_no_delay_for_missed_ticks 1
+#define HVMPTM_no_missed_ticks_pending   2
+#define HVMPTM_one_missed_tick_pending   3
+
+/* Boolean: Enable virtual HPET (high-precision event timer)? (x86-only) */
+#define HVM_PARAM_HPET_ENABLED 11
+
+/* Identity-map page directory used by Intel EPT when CR0.PG=0. */
+#define HVM_PARAM_IDENT_PT     12
+
+/* ACPI S state: currently support S0 and S3 on x86. */
+#define HVM_PARAM_ACPI_S_STATE 14
+
+/* TSS used on Intel when CR0.PE=0. */
+#define HVM_PARAM_VM86_TSS     15
+
+/* Boolean: Enable aligning all periodic vpts to reduce interrupts */
+#define HVM_PARAM_VPT_ALIGN    16
+
+/* Console debug shared memory ring and event channel */
+#define HVM_PARAM_CONSOLE_PFN    17
+#define HVM_PARAM_CONSOLE_EVTCHN 18
+
+/*
+ * Select location of ACPI PM1a and TMR control blocks. Currently two locations
+ * are supported, specified by version 0 or 1 in this parameter:
+ *   - 0: default, use the old addresses
+ *        PM1A_EVT == 0x1f40; PM1A_CNT == 0x1f44; PM_TMR == 0x1f48
+ *   - 1: use the new default qemu addresses
+ *        PM1A_EVT == 0xb000; PM1A_CNT == 0xb004; PM_TMR == 0xb008
+ * You can find these address definitions in <hvm/ioreq.h>
+ */
+#define HVM_PARAM_ACPI_IOPORTS_LOCATION 19
+
+/* Params for the mem event rings */
+#define HVM_PARAM_PAGING_RING_PFN   27
+#define HVM_PARAM_MONITOR_RING_PFN  28
+#define HVM_PARAM_SHARING_RING_PFN  29
+
+/* SHUTDOWN_* action in case of a triple fault */
+#define HVM_PARAM_TRIPLE_FAULT_REASON 31
+
+#define HVM_PARAM_IOREQ_SERVER_PFN 32
+#define HVM_PARAM_NR_IOREQ_SERVER_PAGES 33
+
+/* Location of the VM Generation ID in guest physical address space. */
+#define HVM_PARAM_VM_GENERATION_ID_ADDR 34
+
+/*
+ * Set mode for altp2m:
+ *  disabled: don't activate altp2m (default)
+ *  mixed: allow access to all altp2m ops for both in-guest and external tools
+ *  external: allow access to external privileged tools only
+ *  limited: guest only has limited access (ie. control VMFUNC and #VE)
+ *
+ * Note that 'mixed' mode has not been evaluated for safety from a
+ * security perspective.  Before using this mode in a
+ * security-critical environment, each subop should be evaluated for
+ * safety, with unsafe subops blacklisted in XSM.
+ */
+#define HVM_PARAM_ALTP2M       35
+#define XEN_ALTP2M_disabled      0
+#define XEN_ALTP2M_mixed         1
+#define XEN_ALTP2M_external      2
+#define XEN_ALTP2M_limited       3
+
+/*
+ * Size of the x87 FPU FIP/FDP registers that the hypervisor needs to
+ * save/restore.  This is a workaround for a hardware limitation that
+ * does not allow the full FIP/FDP and FCS/FDS to be restored.
+ *
+ * Valid values are:
+ *
+ * 8: save/restore 64-bit FIP/FDP and clear FCS/FDS (default if CPU
+ *    has FPCSDS feature).
+ *
+ * 4: save/restore 32-bit FIP/FDP, FCS/FDS, and clear upper 32-bits of
+ *    FIP/FDP.
+ *
+ * 0: allow hypervisor to choose based on the value of FIP/FDP
+ *    (default if CPU does not have FPCSDS).
+ *
+ * If FPCSDS (bit 13 in CPUID leaf 0x7, subleaf 0x0) is set, the CPU
+ * never saves FCS/FDS and this parameter should be left at the
+ * default of 8.
+ */
+#define HVM_PARAM_X87_FIP_WIDTH 36
+
+/*
+ * TSS (and its size) used on Intel when CR0.PE=0. The address occupies
+ * the low 32 bits, while the size is in the high 32 ones.
+ */
+#define HVM_PARAM_VM86_TSS_SIZED 37
+
+/* Enable MCA capabilities. */
+#define HVM_PARAM_MCA_CAP 38
+#define XEN_HVM_MCA_CAP_LMCE   (xen_mk_ullong(1) << 0)
+#define XEN_HVM_MCA_CAP_MASK   XEN_HVM_MCA_CAP_LMCE
+
+#define HVM_NR_PARAMS 39
+
+#endif /* __XEN_PUBLIC_HVM_PARAMS_H__ */
diff --git a/include/standard-headers/xen/memory.h b/include/standard-headers/xen/memory.h
new file mode 100644
index 0000000000..383a9468c3
--- /dev/null
+++ b/include/standard-headers/xen/memory.h
@@ -0,0 +1,754 @@
+/******************************************************************************
+ * memory.h
+ *
+ * Memory reservation and information.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_MEMORY_H__
+#define __XEN_PUBLIC_MEMORY_H__
+
+#include "xen.h"
+#include "physdev.h"
+
+/*
+ * Increase or decrease the specified domain's memory reservation. Returns the
+ * number of extents successfully allocated or freed.
+ * arg == addr of struct xen_memory_reservation.
+ */
+#define XENMEM_increase_reservation 0
+#define XENMEM_decrease_reservation 1
+#define XENMEM_populate_physmap     6
+
+#if __XEN_INTERFACE_VERSION__ >= 0x00030209
+/*
+ * Maximum # bits addressable by the user of the allocated region (e.g., I/O
+ * devices often have a 32-bit limitation even in 64-bit systems). If zero
+ * then the user has no addressing restriction. This field is not used by
+ * XENMEM_decrease_reservation.
+ */
+#define XENMEMF_address_bits(x)     (x)
+#define XENMEMF_get_address_bits(x) ((x) & 0xffu)
+/* NUMA node to allocate from. */
+#define XENMEMF_node(x)     (((x) + 1) << 8)
+#define XENMEMF_get_node(x) ((((x) >> 8) - 1) & 0xffu)
+/* Flag to populate physmap with populate-on-demand entries */
+#define XENMEMF_populate_on_demand (1<<16)
+/* Flag to request allocation only from the node specified */
+#define XENMEMF_exact_node_request  (1<<17)
+#define XENMEMF_exact_node(n) (XENMEMF_node(n) | XENMEMF_exact_node_request)
+/* Flag to indicate the node specified is virtual node */
+#define XENMEMF_vnode  (1<<18)
+#endif
+
+struct xen_memory_reservation {
+
+    /*
+     * XENMEM_increase_reservation:
+     *   OUT: MFN (*not* GMFN) bases of extents that were allocated
+     * XENMEM_decrease_reservation:
+     *   IN:  GMFN bases of extents to free
+     * XENMEM_populate_physmap:
+     *   IN:  GPFN bases of extents to populate with memory
+     *   OUT: GMFN bases of extents that were allocated
+     *   (NB. This command also updates the mach_to_phys translation table)
+     * XENMEM_claim_pages:
+     *   IN: must be zero
+     */
+    XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
+
+    /* Number of extents, and size/alignment of each (2^extent_order pages). */
+    xen_ulong_t    nr_extents;
+    unsigned int   extent_order;
+
+#if __XEN_INTERFACE_VERSION__ >= 0x00030209
+    /* XENMEMF flags. */
+    unsigned int   mem_flags;
+#else
+    unsigned int   address_bits;
+#endif
+
+    /*
+     * Domain whose reservation is being changed.
+     * Unprivileged domains can specify only DOMID_SELF.
+     */
+    domid_t        domid;
+};
+typedef struct xen_memory_reservation xen_memory_reservation_t;
+DEFINE_XEN_GUEST_HANDLE(xen_memory_reservation_t);
+
+/*
+ * An atomic exchange of memory pages. If return code is zero then
+ * @out.extent_list provides GMFNs of the newly-allocated memory.
+ * Returns zero on complete success, otherwise a negative error code.
+ * On complete success then always @nr_exchanged == @in.nr_extents.
+ * On partial success @nr_exchanged indicates how much work was done.
+ *
+ * Note that only PV guests can use this operation.
+ */
+#define XENMEM_exchange             11
+struct xen_memory_exchange {
+    /*
+     * [IN] Details of memory extents to be exchanged (GMFN bases).
+     * Note that @in.address_bits is ignored and unused.
+     */
+    struct xen_memory_reservation in;
+
+    /*
+     * [IN/OUT] Details of new memory extents.
+     * We require that:
+     *  1. @in.domid == @out.domid
+     *  2. @in.nr_extents  << @in.extent_order ==
+     *     @out.nr_extents << @out.extent_order
+     *  3. @in.extent_start and @out.extent_start lists must not overlap
+     *  4. @out.extent_start lists GPFN bases to be populated
+     *  5. @out.extent_start is overwritten with allocated GMFN bases
+     */
+    struct xen_memory_reservation out;
+
+    /*
+     * [OUT] Number of input extents that were successfully exchanged:
+     *  1. The first @nr_exchanged input extents were successfully
+     *     deallocated.
+     *  2. The corresponding first entries in the output extent list correctly
+     *     indicate the GMFNs that were successfully exchanged.
+     *  3. All other input and output extents are untouched.
+     *  4. If not all input exents are exchanged then the return code of this
+     *     command will be non-zero.
+     *  5. THIS FIELD MUST BE INITIALISED TO ZERO BY THE CALLER!
+     */
+    xen_ulong_t nr_exchanged;
+};
+typedef struct xen_memory_exchange xen_memory_exchange_t;
+DEFINE_XEN_GUEST_HANDLE(xen_memory_exchange_t);
+
+/*
+ * Returns the maximum machine frame number of mapped RAM in this system.
+ * This command always succeeds (it never returns an error code).
+ * arg == NULL.
+ */
+#define XENMEM_maximum_ram_page     2
+
+struct xen_memory_domain {
+    /* [IN] Domain information is being queried for. */
+    domid_t domid;
+};
+
+/*
+ * Returns the current or maximum memory reservation, in pages, of the
+ * specified domain (may be DOMID_SELF). Returns -ve errcode on failure.
+ * arg == addr of struct xen_memory_domain.
+ */
+#define XENMEM_current_reservation  3
+#define XENMEM_maximum_reservation  4
+
+/*
+ * Returns the maximum GFN in use by the specified domain (may be DOMID_SELF).
+ * Returns -ve errcode on failure.
+ * arg == addr of struct xen_memory_domain.
+ */
+#define XENMEM_maximum_gpfn         14
+
+/*
+ * Returns a list of MFN bases of 2MB extents comprising the machine_to_phys
+ * mapping table. Architectures which do not have a m2p table do not implement
+ * this command.
+ * arg == addr of xen_machphys_mfn_list_t.
+ */
+#define XENMEM_machphys_mfn_list    5
+struct xen_machphys_mfn_list {
+    /*
+     * Size of the 'extent_start' array. Fewer entries will be filled if the
+     * machphys table is smaller than max_extents * 2MB.
+     */
+    unsigned int max_extents;
+
+    /*
+     * Pointer to buffer to fill with list of extent starts. If there are
+     * any large discontiguities in the machine address space, 2MB gaps in
+     * the machphys table will be represented by an MFN base of zero.
+     */
+    XEN_GUEST_HANDLE(xen_pfn_t) extent_start;
+
+    /*
+     * Number of extents written to the above array. This will be smaller
+     * than 'max_extents' if the machphys table is smaller than max_e * 2MB.
+     */
+    unsigned int nr_extents;
+};
+typedef struct xen_machphys_mfn_list xen_machphys_mfn_list_t;
+DEFINE_XEN_GUEST_HANDLE(xen_machphys_mfn_list_t);
+
+/*
+ * For a compat caller, this is identical to XENMEM_machphys_mfn_list.
+ *
+ * For a non compat caller, this functions similarly to
+ * XENMEM_machphys_mfn_list, but returns the mfns making up the compatibility
+ * m2p table.
+ */
+#define XENMEM_machphys_compat_mfn_list     25
+
+/*
+ * Returns the location in virtual address space of the machine_to_phys
+ * mapping table. Architectures which do not have a m2p table, or which do not
+ * map it by default into guest address space, do not implement this command.
+ * arg == addr of xen_machphys_mapping_t.
+ */
+#define XENMEM_machphys_mapping     12
+struct xen_machphys_mapping {
+    xen_ulong_t v_start, v_end; /* Start and end virtual addresses.   */
+    xen_ulong_t max_mfn;        /* Maximum MFN that can be looked up. */
+};
+typedef struct xen_machphys_mapping xen_machphys_mapping_t;
+DEFINE_XEN_GUEST_HANDLE(xen_machphys_mapping_t);
+
+/* Source mapping space. */
+/* ` enum phys_map_space { */
+#define XENMAPSPACE_shared_info  0 /* shared info page */
+#define XENMAPSPACE_grant_table  1 /* grant table page */
+#define XENMAPSPACE_gmfn         2 /* GMFN */
+#define XENMAPSPACE_gmfn_range   3 /* GMFN range, XENMEM_add_to_physmap only. */
+#define XENMAPSPACE_gmfn_foreign 4 /* GMFN from another dom,
+                                    * XENMEM_add_to_physmap_batch only. */
+#define XENMAPSPACE_dev_mmio     5 /* device mmio region
+                                      ARM only; the region is mapped in
+                                      Stage-2 using the Normal Memory
+                                      Inner/Outer Write-Back Cacheable
+                                      memory attribute. */
+/* ` } */
+
+/*
+ * Sets the GPFN at which a particular page appears in the specified guest's
+ * physical address space (translated guests only).
+ * arg == addr of xen_add_to_physmap_t.
+ */
+#define XENMEM_add_to_physmap      7
+struct xen_add_to_physmap {
+    /* Which domain to change the mapping for. */
+    domid_t domid;
+
+    /* Number of pages to go through for gmfn_range */
+    uint16_t    size;
+
+    unsigned int space; /* => enum phys_map_space */
+
+#define XENMAPIDX_grant_table_status 0x80000000
+
+    /* Index into space being mapped. */
+    xen_ulong_t idx;
+
+    /* GPFN in domid where the source mapping page should appear. */
+    xen_pfn_t     gpfn;
+};
+typedef struct xen_add_to_physmap xen_add_to_physmap_t;
+DEFINE_XEN_GUEST_HANDLE(xen_add_to_physmap_t);
+
+/* A batched version of add_to_physmap. */
+#define XENMEM_add_to_physmap_batch 23
+struct xen_add_to_physmap_batch {
+    /* IN */
+    /* Which domain to change the mapping for. */
+    domid_t domid;
+    uint16_t space; /* => enum phys_map_space */
+
+    /* Number of pages to go through */
+    uint16_t size;
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040700
+    domid_t foreign_domid; /* IFF gmfn_foreign. Should be 0 for other spaces. */
+#else
+    union xen_add_to_physmap_batch_extra {
+        domid_t foreign_domid; /* gmfn_foreign */
+        uint16_t res0;  /* All the other spaces. Should be 0 */
+    } u;
+#endif
+
+    /* Indexes into space being mapped. */
+    XEN_GUEST_HANDLE(xen_ulong_t) idxs;
+
+    /* GPFN in domid where the source mapping page should appear. */
+    XEN_GUEST_HANDLE(xen_pfn_t) gpfns;
+
+    /* OUT */
+
+    /* Per index error code. */
+    XEN_GUEST_HANDLE(int) errs;
+};
+typedef struct xen_add_to_physmap_batch xen_add_to_physmap_batch_t;
+DEFINE_XEN_GUEST_HANDLE(xen_add_to_physmap_batch_t);
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040400
+#define XENMEM_add_to_physmap_range XENMEM_add_to_physmap_batch
+#define xen_add_to_physmap_range xen_add_to_physmap_batch
+typedef struct xen_add_to_physmap_batch xen_add_to_physmap_range_t;
+DEFINE_XEN_GUEST_HANDLE(xen_add_to_physmap_range_t);
+#endif
+
+/*
+ * Unmaps the page appearing at a particular GPFN from the specified guest's
+ * physical address space (translated guests only).
+ * arg == addr of xen_remove_from_physmap_t.
+ */
+#define XENMEM_remove_from_physmap      15
+struct xen_remove_from_physmap {
+    /* Which domain to change the mapping for. */
+    domid_t domid;
+
+    /* GPFN of the current mapping of the page. */
+    xen_pfn_t     gpfn;
+};
+typedef struct xen_remove_from_physmap xen_remove_from_physmap_t;
+DEFINE_XEN_GUEST_HANDLE(xen_remove_from_physmap_t);
+
+/*** REMOVED ***/
+/*#define XENMEM_translate_gpfn_list  8*/
+
+/*
+ * Returns the pseudo-physical memory map as it was when the domain
+ * was started (specified by XENMEM_set_memory_map).
+ * arg == addr of xen_memory_map_t.
+ */
+#define XENMEM_memory_map           9
+struct xen_memory_map {
+    /*
+     * On call the number of entries which can be stored in buffer. On
+     * return the number of entries which have been stored in
+     * buffer.
+     */
+    unsigned int nr_entries;
+
+    /*
+     * Entries in the buffer are in the same format as returned by the
+     * BIOS INT 0x15 EAX=0xE820 call.
+     */
+    XEN_GUEST_HANDLE(void) buffer;
+};
+typedef struct xen_memory_map xen_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_memory_map_t);
+
+/*
+ * Returns the real physical memory map. Passes the same structure as
+ * XENMEM_memory_map.
+ * Specifying buffer as NULL will return the number of entries required
+ * to store the complete memory map.
+ * arg == addr of xen_memory_map_t.
+ */
+#define XENMEM_machine_memory_map   10
+
+/*
+ * Set the pseudo-physical memory map of a domain, as returned by
+ * XENMEM_memory_map.
+ * arg == addr of xen_foreign_memory_map_t.
+ */
+#define XENMEM_set_memory_map       13
+struct xen_foreign_memory_map {
+    domid_t domid;
+    struct xen_memory_map map;
+};
+typedef struct xen_foreign_memory_map xen_foreign_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_foreign_memory_map_t);
+
+#define XENMEM_set_pod_target       16
+#define XENMEM_get_pod_target       17
+struct xen_pod_target {
+    /* IN */
+    uint64_t target_pages;
+    /* OUT */
+    uint64_t tot_pages;
+    uint64_t pod_cache_pages;
+    uint64_t pod_entries;
+    /* IN */
+    domid_t domid;
+};
+typedef struct xen_pod_target xen_pod_target_t;
+
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+
+#ifndef uint64_aligned_t
+#define uint64_aligned_t uint64_t
+#endif
+
+/*
+ * Get the number of MFNs saved through memory sharing.
+ * The call never fails.
+ */
+#define XENMEM_get_sharing_freed_pages    18
+#define XENMEM_get_sharing_shared_pages   19
+
+#define XENMEM_paging_op                    20
+#define XENMEM_paging_op_nominate           0
+#define XENMEM_paging_op_evict              1
+#define XENMEM_paging_op_prep               2
+
+struct xen_mem_paging_op {
+    uint8_t     op;         /* XENMEM_paging_op_* */
+    domid_t     domain;
+
+    /* IN: (XENMEM_paging_op_prep) buffer to immediately fill page from */
+    XEN_GUEST_HANDLE_64(const_uint8) buffer;
+    /* IN:  gfn of page being operated on */
+    uint64_aligned_t    gfn;
+};
+typedef struct xen_mem_paging_op xen_mem_paging_op_t;
+DEFINE_XEN_GUEST_HANDLE(xen_mem_paging_op_t);
+
+#define XENMEM_access_op                    21
+#define XENMEM_access_op_set_access         0
+#define XENMEM_access_op_get_access         1
+/*
+ * XENMEM_access_op_enable_emulate and XENMEM_access_op_disable_emulate are
+ * currently unused, but since they have been in use please do not reuse them.
+ *
+ * #define XENMEM_access_op_enable_emulate     2
+ * #define XENMEM_access_op_disable_emulate    3
+ */
+#define XENMEM_access_op_set_access_multi   4
+
+typedef enum {
+    XENMEM_access_n,
+    XENMEM_access_r,
+    XENMEM_access_w,
+    XENMEM_access_rw,
+    XENMEM_access_x,
+    XENMEM_access_rx,
+    XENMEM_access_wx,
+    XENMEM_access_rwx,
+    /*
+     * Page starts off as r-x, but automatically
+     * change to r-w on a write
+     */
+    XENMEM_access_rx2rw,
+    /*
+     * Log access: starts off as n, automatically
+     * goes to rwx, generating an event without
+     * pausing the vcpu
+     */
+    XENMEM_access_n2rwx,
+    /* Take the domain default */
+    XENMEM_access_default
+} xenmem_access_t;
+
+struct xen_mem_access_op {
+    /* XENMEM_access_op_* */
+    uint8_t op;
+    /* xenmem_access_t */
+    uint8_t access;
+    domid_t domid;
+    /*
+     * Number of pages for set op (or size of pfn_list for
+     * XENMEM_access_op_set_access_multi)
+     * Ignored on setting default access and other ops
+     */
+    uint32_t nr;
+    /*
+     * First pfn for set op
+     * pfn for get op
+     * ~0ull is used to set and get the default access for pages
+     */
+    uint64_aligned_t pfn;
+    /*
+     * List of pfns to set access for
+     * Used only with XENMEM_access_op_set_access_multi
+     */
+    XEN_GUEST_HANDLE(const_uint64) pfn_list;
+    /*
+     * Corresponding list of access settings for pfn_list
+     * Used only with XENMEM_access_op_set_access_multi
+     */
+    XEN_GUEST_HANDLE(const_uint8) access_list;
+};
+typedef struct xen_mem_access_op xen_mem_access_op_t;
+DEFINE_XEN_GUEST_HANDLE(xen_mem_access_op_t);
+
+#define XENMEM_sharing_op                   22
+#define XENMEM_sharing_op_nominate_gfn      0
+#define XENMEM_sharing_op_nominate_gref     1
+#define XENMEM_sharing_op_share             2
+#define XENMEM_sharing_op_debug_gfn         3
+#define XENMEM_sharing_op_debug_mfn         4
+#define XENMEM_sharing_op_debug_gref        5
+#define XENMEM_sharing_op_add_physmap       6
+#define XENMEM_sharing_op_audit             7
+#define XENMEM_sharing_op_range_share       8
+#define XENMEM_sharing_op_fork              9
+#define XENMEM_sharing_op_fork_reset        10
+
+#define XENMEM_SHARING_OP_S_HANDLE_INVALID  (-10)
+#define XENMEM_SHARING_OP_C_HANDLE_INVALID  (-9)
+
+/* The following allows sharing of grant refs. This is useful
+ * for sharing utilities sitting as "filters" in IO backends
+ * (e.g. memshr + blktap(2)). The IO backend is only exposed
+ * to grant references, and this allows sharing of the grefs */
+#define XENMEM_SHARING_OP_FIELD_IS_GREF_FLAG   (xen_mk_ullong(1) << 62)
+
+#define XENMEM_SHARING_OP_FIELD_MAKE_GREF(field, val)  \
+    (field) = (XENMEM_SHARING_OP_FIELD_IS_GREF_FLAG | val)
+#define XENMEM_SHARING_OP_FIELD_IS_GREF(field)         \
+    ((field) & XENMEM_SHARING_OP_FIELD_IS_GREF_FLAG)
+#define XENMEM_SHARING_OP_FIELD_GET_GREF(field)        \
+    ((field) & (~XENMEM_SHARING_OP_FIELD_IS_GREF_FLAG))
+
+struct xen_mem_sharing_op {
+    uint8_t     op;     /* XENMEM_sharing_op_* */
+    domid_t     domain;
+
+    union {
+        struct mem_sharing_op_nominate {  /* OP_NOMINATE_xxx           */
+            union {
+                uint64_aligned_t gfn;     /* IN: gfn to nominate       */
+                uint32_t      grant_ref;  /* IN: grant ref to nominate */
+            } u;
+            uint64_aligned_t  handle;     /* OUT: the handle           */
+        } nominate;
+        struct mem_sharing_op_share {     /* OP_SHARE/ADD_PHYSMAP */
+            uint64_aligned_t source_gfn;    /* IN: the gfn of the source page */
+            uint64_aligned_t source_handle; /* IN: handle to the source page */
+            uint64_aligned_t client_gfn;    /* IN: the client gfn */
+            uint64_aligned_t client_handle; /* IN: handle to the client page */
+            domid_t  client_domain; /* IN: the client domain id */
+        } share;
+        struct mem_sharing_op_range {         /* OP_RANGE_SHARE */
+            uint64_aligned_t first_gfn;      /* IN: the first gfn */
+            uint64_aligned_t last_gfn;       /* IN: the last gfn */
+            uint64_aligned_t opaque;         /* Must be set to 0 */
+            domid_t client_domain;           /* IN: the client domain id */
+            uint16_t _pad[3];                /* Must be set to 0 */
+        } range;
+        struct mem_sharing_op_debug {     /* OP_DEBUG_xxx */
+            union {
+                uint64_aligned_t gfn;      /* IN: gfn to debug          */
+                uint64_aligned_t mfn;      /* IN: mfn to debug          */
+                uint32_t gref;     /* IN: gref to debug         */
+            } u;
+        } debug;
+        struct mem_sharing_op_fork {      /* OP_FORK */
+            domid_t parent_domain;        /* IN: parent's domain id */
+/* Only makes sense for short-lived forks */
+#define XENMEM_FORK_WITH_IOMMU_ALLOWED (1u << 0)
+/* Only makes sense for short-lived forks */
+#define XENMEM_FORK_BLOCK_INTERRUPTS   (1u << 1)
+            uint16_t flags;               /* IN: optional settings */
+            uint32_t pad;                 /* Must be set to 0 */
+        } fork;
+    } u;
+};
+typedef struct xen_mem_sharing_op xen_mem_sharing_op_t;
+DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
+
+/*
+ * Attempt to stake a claim for a domain on a quantity of pages
+ * of system RAM, but _not_ assign specific pageframes.  Only
+ * arithmetic is performed so the hypercall is very fast and need
+ * not be preemptible, thus sidestepping time-of-check-time-of-use
+ * races for memory allocation.  Returns 0 if the hypervisor page
+ * allocator has atomically and successfully claimed the requested
+ * number of pages, else non-zero.
+ *
+ * Any domain may have only one active claim.  When sufficient memory
+ * has been allocated to resolve the claim, the claim silently expires.
+ * Claiming zero pages effectively resets any outstanding claim and
+ * is always successful.
+ *
+ * Note that a valid claim may be staked even after memory has been
+ * allocated for a domain.  In this case, the claim is not incremental,
+ * i.e. if the domain's total page count is 3, and a claim is staked
+ * for 10, only 7 additional pages are claimed.
+ *
+ * Caller must be privileged or the hypercall fails.
+ */
+#define XENMEM_claim_pages                  24
+
+/*
+ * XENMEM_claim_pages flags - the are no flags at this time.
+ * The zero value is appropriate.
+ */
+
+/*
+ * With some legacy devices, certain guest-physical addresses cannot safely
+ * be used for other purposes, e.g. to map guest RAM.  This hypercall
+ * enumerates those regions so the toolstack can avoid using them.
+ */
+#define XENMEM_reserved_device_memory_map   27
+struct xen_reserved_device_memory {
+    xen_pfn_t start_pfn;
+    xen_ulong_t nr_pages;
+};
+typedef struct xen_reserved_device_memory xen_reserved_device_memory_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_t);
+
+struct xen_reserved_device_memory_map {
+#define XENMEM_RDM_ALL 1 /* Request all regions (ignore dev union). */
+    /* IN */
+    uint32_t flags;
+    /*
+     * IN/OUT
+     *
+     * Gets set to the required number of entries when too low,
+     * signaled by error code -ERANGE.
+     */
+    unsigned int nr_entries;
+    /* OUT */
+    XEN_GUEST_HANDLE(xen_reserved_device_memory_t) buffer;
+    /* IN */
+    union {
+        physdev_pci_device_t pci;
+    } dev;
+};
+typedef struct xen_reserved_device_memory_map xen_reserved_device_memory_map_t;
+DEFINE_XEN_GUEST_HANDLE(xen_reserved_device_memory_map_t);
+
+#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
+
+/*
+ * Get the pages for a particular guest resource, so that they can be
+ * mapped directly by a tools domain.
+ */
+#define XENMEM_acquire_resource 28
+struct xen_mem_acquire_resource {
+    /* IN - The domain whose resource is to be mapped */
+    domid_t domid;
+    /* IN - the type of resource */
+    uint16_t type;
+
+#define XENMEM_resource_ioreq_server 0
+#define XENMEM_resource_grant_table 1
+#define XENMEM_resource_vmtrace_buf 2
+
+    /*
+     * IN - a type-specific resource identifier, which must be zero
+     *      unless stated otherwise.
+     *
+     * type == XENMEM_resource_ioreq_server -> id == ioreq server id
+     * type == XENMEM_resource_grant_table -> id defined below
+     */
+    uint32_t id;
+
+#define XENMEM_resource_grant_table_id_shared 0
+#define XENMEM_resource_grant_table_id_status 1
+
+    /*
+     * IN/OUT
+     *
+     * As an IN parameter number of frames of the resource to be mapped.
+     * This value may be updated over the course of the operation.
+     *
+     * When frame_list is NULL and nr_frames is 0, this is interpreted as a
+     * request for the size of the resource, which shall be returned in the
+     * nr_frames field.
+     *
+     * The size of a resource will never be zero, but a nonzero result doesn't
+     * guarantee that a subsequent mapping request will be successful.  There
+     * are further type/id specific constraints which may change between the
+     * two calls.
+     */
+    uint32_t nr_frames;
+    uint32_t pad;
+    /*
+     * IN - the index of the initial frame to be mapped. This parameter
+     *      is ignored if nr_frames is 0.  This value may be updated
+     *      over the course of the operation.
+     */
+    uint64_t frame;
+
+#define XENMEM_resource_ioreq_server_frame_bufioreq 0
+#define XENMEM_resource_ioreq_server_frame_ioreq(n) (1 + (n))
+
+    /*
+     * IN/OUT - If the tools domain is PV then, upon return, frame_list
+     *          will be populated with the MFNs of the resource.
+     *          If the tools domain is HVM then it is expected that, on
+     *          entry, frame_list will be populated with a list of GFNs
+     *          that will be mapped to the MFNs of the resource.
+     *          If -EIO is returned then the frame_list has only been
+     *          partially mapped and it is up to the caller to unmap all
+     *          the GFNs.
+     *          This parameter may be NULL if nr_frames is 0.  This
+     *          value may be updated over the course of the operation.
+     */
+    XEN_GUEST_HANDLE(xen_pfn_t) frame_list;
+};
+typedef struct xen_mem_acquire_resource xen_mem_acquire_resource_t;
+DEFINE_XEN_GUEST_HANDLE(xen_mem_acquire_resource_t);
+
+/*
+ * XENMEM_get_vnumainfo used by guest to get
+ * vNUMA topology from hypervisor.
+ */
+#define XENMEM_get_vnumainfo                26
+
+/* vNUMA node memory ranges */
+struct xen_vmemrange {
+    uint64_t start, end;
+    unsigned int flags;
+    unsigned int nid;
+};
+typedef struct xen_vmemrange xen_vmemrange_t;
+DEFINE_XEN_GUEST_HANDLE(xen_vmemrange_t);
+
+/*
+ * vNUMA topology specifies vNUMA node number, distance table,
+ * memory ranges and vcpu mapping provided for guests.
+ * XENMEM_get_vnumainfo hypercall expects to see from guest
+ * nr_vnodes, nr_vmemranges and nr_vcpus to indicate available memory.
+ * After filling guests structures, nr_vnodes, nr_vmemranges and nr_vcpus
+ * copied back to guest. Domain returns expected values of nr_vnodes,
+ * nr_vmemranges and nr_vcpus to guest if the values where incorrect.
+ */
+struct xen_vnuma_topology_info {
+    /* IN */
+    domid_t domid;
+    uint16_t pad;
+    /* IN/OUT */
+    unsigned int nr_vnodes;
+    unsigned int nr_vcpus;
+    unsigned int nr_vmemranges;
+    /* OUT */
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t pad;
+    } vdistance;
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t pad;
+    } vcpu_to_vnode;
+    union {
+        XEN_GUEST_HANDLE(xen_vmemrange_t) h;
+        uint64_t pad;
+    } vmemrange;
+};
+typedef struct xen_vnuma_topology_info xen_vnuma_topology_info_t;
+DEFINE_XEN_GUEST_HANDLE(xen_vnuma_topology_info_t);
+
+/* Next available subop number is 29 */
+
+#endif /* __XEN_PUBLIC_MEMORY_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/physdev.h b/include/standard-headers/xen/physdev.h
new file mode 100644
index 0000000000..d271766ad0
--- /dev/null
+++ b/include/standard-headers/xen/physdev.h
@@ -0,0 +1,383 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2006, Keir Fraser
+ */
+
+#ifndef __XEN_PUBLIC_PHYSDEV_H__
+#define __XEN_PUBLIC_PHYSDEV_H__
+
+#include "xen.h"
+
+/*
+ * Prototype for this hypercall is:
+ *  int physdev_op(int cmd, void *args)
+ * @cmd  == PHYSDEVOP_??? (physdev operation).
+ * @args == Operation-specific extra arguments (NULL if none).
+ */
+
+/*
+ * Notify end-of-interrupt (EOI) for the specified IRQ.
+ * @arg == pointer to physdev_eoi structure.
+ */
+#define PHYSDEVOP_eoi                   12
+struct physdev_eoi {
+    /* IN */
+    uint32_t irq;
+};
+typedef struct physdev_eoi physdev_eoi_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_eoi_t);
+
+/*
+ * Register a shared page for the hypervisor to indicate whether the guest
+ * must issue PHYSDEVOP_eoi. The semantics of PHYSDEVOP_eoi change slightly
+ * once the guest used this function in that the associated event channel
+ * will automatically get unmasked. The page registered is used as a bit
+ * array indexed by Xen's PIRQ value.
+ */
+#define PHYSDEVOP_pirq_eoi_gmfn_v1       17
+/*
+ * Register a shared page for the hypervisor to indicate whether the
+ * guest must issue PHYSDEVOP_eoi. This hypercall is very similar to
+ * PHYSDEVOP_pirq_eoi_gmfn_v1 but it doesn't change the semantics of
+ * PHYSDEVOP_eoi. The page registered is used as a bit array indexed by
+ * Xen's PIRQ value.
+ */
+#define PHYSDEVOP_pirq_eoi_gmfn_v2       28
+struct physdev_pirq_eoi_gmfn {
+    /* IN */
+    xen_pfn_t gmfn;
+};
+typedef struct physdev_pirq_eoi_gmfn physdev_pirq_eoi_gmfn_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_pirq_eoi_gmfn_t);
+
+/*
+ * Query the status of an IRQ line.
+ * @arg == pointer to physdev_irq_status_query structure.
+ */
+#define PHYSDEVOP_irq_status_query       5
+struct physdev_irq_status_query {
+    /* IN */
+    uint32_t irq;
+    /* OUT */
+    uint32_t flags; /* XENIRQSTAT_* */
+};
+typedef struct physdev_irq_status_query physdev_irq_status_query_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_irq_status_query_t);
+
+/* Need to call PHYSDEVOP_eoi when the IRQ has been serviced? */
+#define _XENIRQSTAT_needs_eoi   (0)
+#define  XENIRQSTAT_needs_eoi   (1U<<_XENIRQSTAT_needs_eoi)
+
+/* IRQ shared by multiple guests? */
+#define _XENIRQSTAT_shared      (1)
+#define  XENIRQSTAT_shared      (1U<<_XENIRQSTAT_shared)
+
+/*
+ * Set the current VCPU's I/O privilege level.
+ * @arg == pointer to physdev_set_iopl structure.
+ */
+#define PHYSDEVOP_set_iopl               6
+struct physdev_set_iopl {
+    /* IN */
+    uint32_t iopl;
+};
+typedef struct physdev_set_iopl physdev_set_iopl_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_set_iopl_t);
+
+/*
+ * Set the current VCPU's I/O-port permissions bitmap.
+ * @arg == pointer to physdev_set_iobitmap structure.
+ */
+#define PHYSDEVOP_set_iobitmap           7
+struct physdev_set_iobitmap {
+    /* IN */
+#if __XEN_INTERFACE_VERSION__ >= 0x00030205
+    XEN_GUEST_HANDLE(uint8) bitmap;
+#else
+    uint8_t *bitmap;
+#endif
+    uint32_t nr_ports;
+};
+typedef struct physdev_set_iobitmap physdev_set_iobitmap_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_set_iobitmap_t);
+
+/*
+ * Read or write an IO-APIC register.
+ * @arg == pointer to physdev_apic structure.
+ */
+#define PHYSDEVOP_apic_read              8
+#define PHYSDEVOP_apic_write             9
+struct physdev_apic {
+    /* IN */
+    unsigned long apic_physbase;
+    uint32_t reg;
+    /* IN or OUT */
+    uint32_t value;
+};
+typedef struct physdev_apic physdev_apic_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_apic_t);
+
+/*
+ * Allocate or free a physical upcall vector for the specified IRQ line.
+ * @arg == pointer to physdev_irq structure.
+ */
+#define PHYSDEVOP_alloc_irq_vector      10
+#define PHYSDEVOP_free_irq_vector       11
+struct physdev_irq {
+    /* IN */
+    uint32_t irq;
+    /* IN or OUT */
+    uint32_t vector;
+};
+typedef struct physdev_irq physdev_irq_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_irq_t);
+
+#define MAP_PIRQ_TYPE_MSI               0x0
+#define MAP_PIRQ_TYPE_GSI               0x1
+#define MAP_PIRQ_TYPE_UNKNOWN           0x2
+#define MAP_PIRQ_TYPE_MSI_SEG           0x3
+#define MAP_PIRQ_TYPE_MULTI_MSI         0x4
+
+#define PHYSDEVOP_map_pirq               13
+struct physdev_map_pirq {
+    domid_t domid;
+    /* IN */
+    int type;
+    /* IN (ignored for ..._MULTI_MSI) */
+    int index;
+    /* IN or OUT */
+    int pirq;
+    /* IN - high 16 bits hold segment for ..._MSI_SEG and ..._MULTI_MSI */
+    int bus;
+    /* IN */
+    int devfn;
+    /* IN (also OUT for ..._MULTI_MSI) */
+    int entry_nr;
+    /* IN */
+    uint64_t table_base;
+};
+typedef struct physdev_map_pirq physdev_map_pirq_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_map_pirq_t);
+
+#define PHYSDEVOP_unmap_pirq             14
+struct physdev_unmap_pirq {
+    domid_t domid;
+    /* IN */
+    int pirq;
+};
+
+typedef struct physdev_unmap_pirq physdev_unmap_pirq_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_unmap_pirq_t);
+
+#define PHYSDEVOP_manage_pci_add         15
+#define PHYSDEVOP_manage_pci_remove      16
+struct physdev_manage_pci {
+    /* IN */
+    uint8_t bus;
+    uint8_t devfn;
+};
+
+typedef struct physdev_manage_pci physdev_manage_pci_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_manage_pci_t);
+
+#define PHYSDEVOP_restore_msi            19
+struct physdev_restore_msi {
+    /* IN */
+    uint8_t bus;
+    uint8_t devfn;
+};
+typedef struct physdev_restore_msi physdev_restore_msi_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_restore_msi_t);
+
+#define PHYSDEVOP_manage_pci_add_ext     20
+struct physdev_manage_pci_ext {
+    /* IN */
+    uint8_t bus;
+    uint8_t devfn;
+    unsigned is_extfn;
+    unsigned is_virtfn;
+    struct {
+        uint8_t bus;
+        uint8_t devfn;
+    } physfn;
+};
+
+typedef struct physdev_manage_pci_ext physdev_manage_pci_ext_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_manage_pci_ext_t);
+
+/*
+ * Argument to physdev_op_compat() hypercall. Superceded by new physdev_op()
+ * hypercall since 0x00030202.
+ */
+struct physdev_op {
+    uint32_t cmd;
+    union {
+        physdev_irq_status_query_t irq_status_query;
+        physdev_set_iopl_t         set_iopl;
+        physdev_set_iobitmap_t     set_iobitmap;
+        physdev_apic_t             apic_op;
+        physdev_irq_t              irq_op;
+    } u;
+};
+typedef struct physdev_op physdev_op_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_op_t);
+
+#define PHYSDEVOP_setup_gsi    21
+struct physdev_setup_gsi {
+    int gsi;
+    /* IN */
+    uint8_t triggering;
+    /* IN */
+    uint8_t polarity;
+    /* IN */
+};
+
+typedef struct physdev_setup_gsi physdev_setup_gsi_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_setup_gsi_t);
+
+/* leave PHYSDEVOP 22 free */
+
+/* type is MAP_PIRQ_TYPE_GSI or MAP_PIRQ_TYPE_MSI
+ * the hypercall returns a free pirq */
+#define PHYSDEVOP_get_free_pirq    23
+struct physdev_get_free_pirq {
+    /* IN */
+    int type;
+    /* OUT */
+    uint32_t pirq;
+};
+
+typedef struct physdev_get_free_pirq physdev_get_free_pirq_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_get_free_pirq_t);
+
+#define XEN_PCI_MMCFG_RESERVED         0x1
+
+#define PHYSDEVOP_pci_mmcfg_reserved    24
+struct physdev_pci_mmcfg_reserved {
+    uint64_t address;
+    uint16_t segment;
+    uint8_t start_bus;
+    uint8_t end_bus;
+    uint32_t flags;
+};
+typedef struct physdev_pci_mmcfg_reserved physdev_pci_mmcfg_reserved_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_pci_mmcfg_reserved_t);
+
+#define XEN_PCI_DEV_EXTFN              0x1
+#define XEN_PCI_DEV_VIRTFN             0x2
+#define XEN_PCI_DEV_PXM                0x4
+
+#define PHYSDEVOP_pci_device_add        25
+struct physdev_pci_device_add {
+    /* IN */
+    uint16_t seg;
+    uint8_t bus;
+    uint8_t devfn;
+    uint32_t flags;
+    struct {
+        uint8_t bus;
+        uint8_t devfn;
+    } physfn;
+    /*
+     * Optional parameters array.
+     * First element ([0]) is PXM domain associated with the device (if
+     * XEN_PCI_DEV_PXM is set)
+     */
+    uint32_t optarr[XEN_FLEX_ARRAY_DIM];
+};
+typedef struct physdev_pci_device_add physdev_pci_device_add_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_pci_device_add_t);
+
+#define PHYSDEVOP_pci_device_remove     26
+#define PHYSDEVOP_restore_msi_ext       27
+/*
+ * Dom0 should use these two to announce MMIO resources assigned to
+ * MSI-X capable devices won't (prepare) or may (release) change.
+ */
+#define PHYSDEVOP_prepare_msix          30
+#define PHYSDEVOP_release_msix          31
+struct physdev_pci_device {
+    /* IN */
+    uint16_t seg;
+    uint8_t bus;
+    uint8_t devfn;
+};
+typedef struct physdev_pci_device physdev_pci_device_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_pci_device_t);
+
+#define PHYSDEVOP_DBGP_RESET_PREPARE    1
+#define PHYSDEVOP_DBGP_RESET_DONE       2
+
+#define PHYSDEVOP_DBGP_BUS_UNKNOWN      0
+#define PHYSDEVOP_DBGP_BUS_PCI          1
+
+#define PHYSDEVOP_dbgp_op               29
+struct physdev_dbgp_op {
+    /* IN */
+    uint8_t op;
+    uint8_t bus;
+    union {
+        physdev_pci_device_t pci;
+    } u;
+};
+typedef struct physdev_dbgp_op physdev_dbgp_op_t;
+DEFINE_XEN_GUEST_HANDLE(physdev_dbgp_op_t);
+
+/*
+ * Notify that some PIRQ-bound event channels have been unmasked.
+ * ** This command is obsolete since interface version 0x00030202 and is **
+ * ** unsupported by newer versions of Xen.                              **
+ */
+#define PHYSDEVOP_IRQ_UNMASK_NOTIFY      4
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040600
+/*
+ * These all-capitals physdev operation names are superceded by the new names
+ * (defined above) since interface version 0x00030202. The guard above was
+ * added post-4.5 only though and hence shouldn't check for 0x00030202.
+ */
+#define PHYSDEVOP_IRQ_STATUS_QUERY       PHYSDEVOP_irq_status_query
+#define PHYSDEVOP_SET_IOPL               PHYSDEVOP_set_iopl
+#define PHYSDEVOP_SET_IOBITMAP           PHYSDEVOP_set_iobitmap
+#define PHYSDEVOP_APIC_READ              PHYSDEVOP_apic_read
+#define PHYSDEVOP_APIC_WRITE             PHYSDEVOP_apic_write
+#define PHYSDEVOP_ASSIGN_VECTOR          PHYSDEVOP_alloc_irq_vector
+#define PHYSDEVOP_FREE_VECTOR            PHYSDEVOP_free_irq_vector
+#define PHYSDEVOP_IRQ_NEEDS_UNMASK_NOTIFY XENIRQSTAT_needs_eoi
+#define PHYSDEVOP_IRQ_SHARED             XENIRQSTAT_shared
+#endif
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040200
+#define PHYSDEVOP_pirq_eoi_gmfn PHYSDEVOP_pirq_eoi_gmfn_v1
+#else
+#define PHYSDEVOP_pirq_eoi_gmfn PHYSDEVOP_pirq_eoi_gmfn_v2
+#endif
+
+#endif /* __XEN_PUBLIC_PHYSDEV_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/sched.h b/include/standard-headers/xen/sched.h
new file mode 100644
index 0000000000..811bd87c82
--- /dev/null
+++ b/include/standard-headers/xen/sched.h
@@ -0,0 +1,202 @@
+/******************************************************************************
+ * sched.h
+ *
+ * Scheduler state interactions
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_SCHED_H__
+#define __XEN_PUBLIC_SCHED_H__
+
+#include "event_channel.h"
+
+/*
+ * `incontents 150 sched Guest Scheduler Operations
+ *
+ * The SCHEDOP interface provides mechanisms for a guest to interact
+ * with the scheduler, including yield, blocking and shutting itself
+ * down.
+ */
+
+/*
+ * The prototype for this hypercall is:
+ * ` long HYPERVISOR_sched_op(enum sched_op cmd, void *arg, ...)
+ *
+ * @cmd == SCHEDOP_??? (scheduler operation).
+ * @arg == Operation-specific extra argument(s), as described below.
+ * ...  == Additional Operation-specific extra arguments, described below.
+ *
+ * Versions of Xen prior to 3.0.2 provided only the following legacy version
+ * of this hypercall, supporting only the commands yield, block and shutdown:
+ *  long sched_op(int cmd, unsigned long arg)
+ * @cmd == SCHEDOP_??? (scheduler operation).
+ * @arg == 0               (SCHEDOP_yield and SCHEDOP_block)
+ *      == SHUTDOWN_* code (SCHEDOP_shutdown)
+ *
+ * This legacy version is available to new guests as:
+ * ` long HYPERVISOR_sched_op_compat(enum sched_op cmd, unsigned long arg)
+ */
+
+/* ` enum sched_op { // SCHEDOP_* => struct sched_* */
+/*
+ * Voluntarily yield the CPU.
+ * @arg == NULL.
+ */
+#define SCHEDOP_yield       0
+
+/*
+ * Block execution of this VCPU until an event is received for processing.
+ * If called with event upcalls masked, this operation will atomically
+ * reenable event delivery and check for pending events before blocking the
+ * VCPU. This avoids a "wakeup waiting" race.
+ * @arg == NULL.
+ */
+#define SCHEDOP_block       1
+
+/*
+ * Halt execution of this domain (all VCPUs) and notify the system controller.
+ * @arg == pointer to sched_shutdown_t structure.
+ *
+ * If the sched_shutdown_t reason is SHUTDOWN_suspend then
+ * x86 PV guests must also set RDX (EDX for 32-bit guests) to the MFN
+ * of the guest's start info page.  RDX/EDX is the third hypercall
+ * argument.
+ *
+ * In addition, which reason is SHUTDOWN_suspend this hypercall
+ * returns 1 if suspend was cancelled or the domain was merely
+ * checkpointed, and 0 if it is resuming in a new domain.
+ */
+#define SCHEDOP_shutdown    2
+
+/*
+ * Poll a set of event-channel ports. Return when one or more are pending. An
+ * optional timeout may be specified.
+ * @arg == pointer to sched_poll_t structure.
+ */
+#define SCHEDOP_poll        3
+
+/*
+ * Declare a shutdown for another domain. The main use of this function is
+ * in interpreting shutdown requests and reasons for fully-virtualized
+ * domains.  A para-virtualized domain may use SCHEDOP_shutdown directly.
+ * @arg == pointer to sched_remote_shutdown_t structure.
+ */
+#define SCHEDOP_remote_shutdown        4
+
+/*
+ * Latch a shutdown code, so that when the domain later shuts down it
+ * reports this code to the control tools.
+ * @arg == sched_shutdown_t, as for SCHEDOP_shutdown.
+ */
+#define SCHEDOP_shutdown_code 5
+
+/*
+ * Setup, poke and destroy a domain watchdog timer.
+ * @arg == pointer to sched_watchdog_t structure.
+ * With id == 0, setup a domain watchdog timer to cause domain shutdown
+ *               after timeout, returns watchdog id.
+ * With id != 0 and timeout == 0, destroy domain watchdog timer.
+ * With id != 0 and timeout != 0, poke watchdog timer and set new timeout.
+ */
+#define SCHEDOP_watchdog    6
+
+/*
+ * Override the current vcpu affinity by pinning it to one physical cpu or
+ * undo this override restoring the previous affinity.
+ * @arg == pointer to sched_pin_override_t structure.
+ *
+ * A negative pcpu value will undo a previous pin override and restore the
+ * previous cpu affinity.
+ * This call is allowed for the hardware domain only and requires the cpu
+ * to be part of the domain's cpupool.
+ */
+#define SCHEDOP_pin_override 7
+/* ` } */
+
+struct sched_shutdown {
+    unsigned int reason; /* SHUTDOWN_* => enum sched_shutdown_reason */
+};
+typedef struct sched_shutdown sched_shutdown_t;
+DEFINE_XEN_GUEST_HANDLE(sched_shutdown_t);
+
+struct sched_poll {
+    XEN_GUEST_HANDLE(evtchn_port_t) ports;
+    unsigned int nr_ports;
+    uint64_t timeout;
+};
+typedef struct sched_poll sched_poll_t;
+DEFINE_XEN_GUEST_HANDLE(sched_poll_t);
+
+struct sched_remote_shutdown {
+    domid_t domain_id;         /* Remote domain ID */
+    unsigned int reason;       /* SHUTDOWN_* => enum sched_shutdown_reason */
+};
+typedef struct sched_remote_shutdown sched_remote_shutdown_t;
+DEFINE_XEN_GUEST_HANDLE(sched_remote_shutdown_t);
+
+struct sched_watchdog {
+    uint32_t id;                /* watchdog ID */
+    uint32_t timeout;           /* timeout */
+};
+typedef struct sched_watchdog sched_watchdog_t;
+DEFINE_XEN_GUEST_HANDLE(sched_watchdog_t);
+
+struct sched_pin_override {
+    int32_t pcpu;
+};
+typedef struct sched_pin_override sched_pin_override_t;
+DEFINE_XEN_GUEST_HANDLE(sched_pin_override_t);
+
+/*
+ * Reason codes for SCHEDOP_shutdown. These may be interpreted by control
+ * software to determine the appropriate action. For the most part, Xen does
+ * not care about the shutdown code.
+ */
+/* ` enum sched_shutdown_reason { */
+#define SHUTDOWN_poweroff   0  /* Domain exited normally. Clean up and kill. */
+#define SHUTDOWN_reboot     1  /* Clean up, kill, and then restart.          */
+#define SHUTDOWN_suspend    2  /* Clean up, save suspend info, kill.         */
+#define SHUTDOWN_crash      3  /* Tell controller we've crashed.             */
+#define SHUTDOWN_watchdog   4  /* Restart because watchdog time expired.     */
+
+/*
+ * Domain asked to perform 'soft reset' for it. The expected behavior is to
+ * reset internal Xen state for the domain returning it to the point where it
+ * was created but leaving the domain's memory contents and vCPU contexts
+ * intact. This will allow the domain to start over and set up all Xen specific
+ * interfaces again.
+ */
+#define SHUTDOWN_soft_reset 5
+#define SHUTDOWN_MAX        5  /* Maximum valid shutdown reason.             */
+/* ` } */
+
+#endif /* __XEN_PUBLIC_SCHED_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/trace.h b/include/standard-headers/xen/trace.h
new file mode 100644
index 0000000000..d5fa4aea8d
--- /dev/null
+++ b/include/standard-headers/xen/trace.h
@@ -0,0 +1,341 @@
+/******************************************************************************
+ * include/public/trace.h
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Mark Williamson, (C) 2004 Intel Research Cambridge
+ * Copyright (C) 2005 Bin Ren
+ */
+
+#ifndef __XEN_PUBLIC_TRACE_H__
+#define __XEN_PUBLIC_TRACE_H__
+
+#define TRACE_EXTRA_MAX    7
+#define TRACE_EXTRA_SHIFT 28
+
+/* Trace classes */
+#define TRC_CLS_SHIFT 16
+#define TRC_GEN      0x0001f000    /* General trace            */
+#define TRC_SCHED    0x0002f000    /* Xen Scheduler trace      */
+#define TRC_DOM0OP   0x0004f000    /* Xen DOM0 operation trace */
+#define TRC_HVM      0x0008f000    /* Xen HVM trace            */
+#define TRC_MEM      0x0010f000    /* Xen memory trace         */
+#define TRC_PV       0x0020f000    /* Xen PV traces            */
+#define TRC_SHADOW   0x0040f000    /* Xen shadow tracing       */
+#define TRC_HW       0x0080f000    /* Xen hardware-related traces */
+#define TRC_GUEST    0x0800f000    /* Guest-generated traces   */
+#define TRC_ALL      0x0ffff000
+#define TRC_HD_TO_EVENT(x) ((x)&0x0fffffff)
+#define TRC_HD_CYCLE_FLAG (1UL<<31)
+#define TRC_HD_INCLUDES_CYCLE_COUNT(x) ( !!( (x) & TRC_HD_CYCLE_FLAG ) )
+#define TRC_HD_EXTRA(x)    (((x)>>TRACE_EXTRA_SHIFT)&TRACE_EXTRA_MAX)
+
+/* Trace subclasses */
+#define TRC_SUBCLS_SHIFT 12
+
+/* trace subclasses for SVM */
+#define TRC_HVM_ENTRYEXIT   0x00081000   /* VMENTRY and #VMEXIT       */
+#define TRC_HVM_HANDLER     0x00082000   /* various HVM handlers      */
+#define TRC_HVM_EMUL        0x00084000   /* emulated devices */
+
+#define TRC_SCHED_MIN       0x00021000   /* Just runstate changes */
+#define TRC_SCHED_CLASS     0x00022000   /* Scheduler-specific    */
+#define TRC_SCHED_VERBOSE   0x00028000   /* More inclusive scheduling */
+
+/*
+ * The highest 3 bits of the last 12 bits of TRC_SCHED_CLASS above are
+ * reserved for encoding what scheduler produced the information. The
+ * actual event is encoded in the last 9 bits.
+ *
+ * This means we have 8 scheduling IDs available (which means at most 8
+ * schedulers generating events) and, in each scheduler, up to 512
+ * different events.
+ */
+#define TRC_SCHED_ID_BITS 3
+#define TRC_SCHED_ID_SHIFT (TRC_SUBCLS_SHIFT - TRC_SCHED_ID_BITS)
+#define TRC_SCHED_ID_MASK (((1UL<<TRC_SCHED_ID_BITS) - 1) << TRC_SCHED_ID_SHIFT)
+#define TRC_SCHED_EVT_MASK (~(TRC_SCHED_ID_MASK))
+
+/* Per-scheduler IDs, to identify scheduler specific events */
+#define TRC_SCHED_CSCHED   0
+#define TRC_SCHED_CSCHED2  1
+/* #define XEN_SCHEDULER_SEDF 2 (Removed) */
+#define TRC_SCHED_ARINC653 3
+#define TRC_SCHED_RTDS     4
+#define TRC_SCHED_SNULL    5
+
+/* Per-scheduler tracing */
+#define TRC_SCHED_CLASS_EVT(_c, _e) \
+  ( ( TRC_SCHED_CLASS | \
+      ((TRC_SCHED_##_c << TRC_SCHED_ID_SHIFT) & TRC_SCHED_ID_MASK) ) + \
+    (_e & TRC_SCHED_EVT_MASK) )
+
+/* Trace classes for DOM0 operations */
+#define TRC_DOM0_DOMOPS     0x00041000   /* Domains manipulations */
+
+/* Trace classes for Hardware */
+#define TRC_HW_PM           0x00801000   /* Power management traces */
+#define TRC_HW_IRQ          0x00802000   /* Traces relating to the handling of IRQs */
+
+/* Trace events per class */
+#define TRC_LOST_RECORDS        (TRC_GEN + 1)
+#define TRC_TRACE_WRAP_BUFFER  (TRC_GEN + 2)
+#define TRC_TRACE_CPU_CHANGE    (TRC_GEN + 3)
+
+#define TRC_SCHED_RUNSTATE_CHANGE   (TRC_SCHED_MIN + 1)
+#define TRC_SCHED_CONTINUE_RUNNING  (TRC_SCHED_MIN + 2)
+#define TRC_SCHED_DOM_ADD        (TRC_SCHED_VERBOSE +  1)
+#define TRC_SCHED_DOM_REM        (TRC_SCHED_VERBOSE +  2)
+#define TRC_SCHED_SLEEP          (TRC_SCHED_VERBOSE +  3)
+#define TRC_SCHED_WAKE           (TRC_SCHED_VERBOSE +  4)
+#define TRC_SCHED_YIELD          (TRC_SCHED_VERBOSE +  5)
+#define TRC_SCHED_BLOCK          (TRC_SCHED_VERBOSE +  6)
+#define TRC_SCHED_SHUTDOWN       (TRC_SCHED_VERBOSE +  7)
+#define TRC_SCHED_CTL            (TRC_SCHED_VERBOSE +  8)
+#define TRC_SCHED_ADJDOM         (TRC_SCHED_VERBOSE +  9)
+#define TRC_SCHED_SWITCH         (TRC_SCHED_VERBOSE + 10)
+#define TRC_SCHED_S_TIMER_FN     (TRC_SCHED_VERBOSE + 11)
+#define TRC_SCHED_T_TIMER_FN     (TRC_SCHED_VERBOSE + 12)
+#define TRC_SCHED_DOM_TIMER_FN   (TRC_SCHED_VERBOSE + 13)
+#define TRC_SCHED_SWITCH_INFPREV (TRC_SCHED_VERBOSE + 14)
+#define TRC_SCHED_SWITCH_INFNEXT (TRC_SCHED_VERBOSE + 15)
+#define TRC_SCHED_SHUTDOWN_CODE  (TRC_SCHED_VERBOSE + 16)
+#define TRC_SCHED_SWITCH_INFCONT (TRC_SCHED_VERBOSE + 17)
+
+#define TRC_DOM0_DOM_ADD         (TRC_DOM0_DOMOPS + 1)
+#define TRC_DOM0_DOM_REM         (TRC_DOM0_DOMOPS + 2)
+
+#define TRC_MEM_PAGE_GRANT_MAP      (TRC_MEM + 1)
+#define TRC_MEM_PAGE_GRANT_UNMAP    (TRC_MEM + 2)
+#define TRC_MEM_PAGE_GRANT_TRANSFER (TRC_MEM + 3)
+#define TRC_MEM_SET_P2M_ENTRY       (TRC_MEM + 4)
+#define TRC_MEM_DECREASE_RESERVATION (TRC_MEM + 5)
+#define TRC_MEM_POD_POPULATE        (TRC_MEM + 16)
+#define TRC_MEM_POD_ZERO_RECLAIM    (TRC_MEM + 17)
+#define TRC_MEM_POD_SUPERPAGE_SPLINTER (TRC_MEM + 18)
+
+#define TRC_PV_ENTRY   0x00201000 /* Hypervisor entry points for PV guests. */
+#define TRC_PV_SUBCALL 0x00202000 /* Sub-call in a multicall hypercall */
+
+#define TRC_PV_HYPERCALL             (TRC_PV_ENTRY +  1)
+#define TRC_PV_TRAP                  (TRC_PV_ENTRY +  3)
+#define TRC_PV_PAGE_FAULT            (TRC_PV_ENTRY +  4)
+#define TRC_PV_FORCED_INVALID_OP     (TRC_PV_ENTRY +  5)
+#define TRC_PV_EMULATE_PRIVOP        (TRC_PV_ENTRY +  6)
+#define TRC_PV_EMULATE_4GB           (TRC_PV_ENTRY +  7)
+#define TRC_PV_MATH_STATE_RESTORE    (TRC_PV_ENTRY +  8)
+#define TRC_PV_PAGING_FIXUP          (TRC_PV_ENTRY +  9)
+#define TRC_PV_GDT_LDT_MAPPING_FAULT (TRC_PV_ENTRY + 10)
+#define TRC_PV_PTWR_EMULATION        (TRC_PV_ENTRY + 11)
+#define TRC_PV_PTWR_EMULATION_PAE    (TRC_PV_ENTRY + 12)
+#define TRC_PV_HYPERCALL_V2          (TRC_PV_ENTRY + 13)
+#define TRC_PV_HYPERCALL_SUBCALL     (TRC_PV_SUBCALL + 14)
+
+/*
+ * TRC_PV_HYPERCALL_V2 format
+ *
+ * Only some of the hypercall argument are recorded. Bit fields A0 to
+ * A5 in the first extra word are set if the argument is present and
+ * the arguments themselves are packed sequentially in the following
+ * words.
+ *
+ * The TRC_64_FLAG bit is not set for these events (even if there are
+ * 64-bit arguments in the record).
+ *
+ * Word
+ * 0    bit 31 30|29 28|27 26|25 24|23 22|21 20|19 ... 0
+ *          A5   |A4   |A3   |A2   |A1   |A0   |Hypercall op
+ * 1    First 32 bit (or low word of first 64 bit) arg in record
+ * 2    Second 32 bit (or high word of first 64 bit) arg in record
+ * ...
+ *
+ * A0-A5 bitfield values:
+ *
+ *   00b  Argument not present
+ *   01b  32-bit argument present
+ *   10b  64-bit argument present
+ *   11b  Reserved
+ */
+#define TRC_PV_HYPERCALL_V2_ARG_32(i) (0x1 << (20 + 2*(i)))
+#define TRC_PV_HYPERCALL_V2_ARG_64(i) (0x2 << (20 + 2*(i)))
+#define TRC_PV_HYPERCALL_V2_ARG_MASK  (0xfff00000)
+
+#define TRC_SHADOW_NOT_SHADOW                 (TRC_SHADOW +  1)
+#define TRC_SHADOW_FAST_PROPAGATE             (TRC_SHADOW +  2)
+#define TRC_SHADOW_FAST_MMIO                  (TRC_SHADOW +  3)
+#define TRC_SHADOW_FALSE_FAST_PATH            (TRC_SHADOW +  4)
+#define TRC_SHADOW_MMIO                       (TRC_SHADOW +  5)
+#define TRC_SHADOW_FIXUP                      (TRC_SHADOW +  6)
+#define TRC_SHADOW_DOMF_DYING                 (TRC_SHADOW +  7)
+#define TRC_SHADOW_EMULATE                    (TRC_SHADOW +  8)
+#define TRC_SHADOW_EMULATE_UNSHADOW_USER      (TRC_SHADOW +  9)
+#define TRC_SHADOW_EMULATE_UNSHADOW_EVTINJ    (TRC_SHADOW + 10)
+#define TRC_SHADOW_EMULATE_UNSHADOW_UNHANDLED (TRC_SHADOW + 11)
+#define TRC_SHADOW_WRMAP_BF                   (TRC_SHADOW + 12)
+#define TRC_SHADOW_PREALLOC_UNPIN             (TRC_SHADOW + 13)
+#define TRC_SHADOW_RESYNC_FULL                (TRC_SHADOW + 14)
+#define TRC_SHADOW_RESYNC_ONLY                (TRC_SHADOW + 15)
+
+/* trace events per subclass */
+#define TRC_HVM_NESTEDFLAG      (0x400)
+#define TRC_HVM_VMENTRY         (TRC_HVM_ENTRYEXIT + 0x01)
+#define TRC_HVM_VMEXIT          (TRC_HVM_ENTRYEXIT + 0x02)
+#define TRC_HVM_VMEXIT64        (TRC_HVM_ENTRYEXIT + TRC_64_FLAG + 0x02)
+#define TRC_HVM_PF_XEN          (TRC_HVM_HANDLER + 0x01)
+#define TRC_HVM_PF_XEN64        (TRC_HVM_HANDLER + TRC_64_FLAG + 0x01)
+#define TRC_HVM_PF_INJECT       (TRC_HVM_HANDLER + 0x02)
+#define TRC_HVM_PF_INJECT64     (TRC_HVM_HANDLER + TRC_64_FLAG + 0x02)
+#define TRC_HVM_INJ_EXC         (TRC_HVM_HANDLER + 0x03)
+#define TRC_HVM_INJ_VIRQ        (TRC_HVM_HANDLER + 0x04)
+#define TRC_HVM_REINJ_VIRQ      (TRC_HVM_HANDLER + 0x05)
+#define TRC_HVM_IO_READ         (TRC_HVM_HANDLER + 0x06)
+#define TRC_HVM_IO_WRITE        (TRC_HVM_HANDLER + 0x07)
+#define TRC_HVM_CR_READ         (TRC_HVM_HANDLER + 0x08)
+#define TRC_HVM_CR_READ64       (TRC_HVM_HANDLER + TRC_64_FLAG + 0x08)
+#define TRC_HVM_CR_WRITE        (TRC_HVM_HANDLER + 0x09)
+#define TRC_HVM_CR_WRITE64      (TRC_HVM_HANDLER + TRC_64_FLAG + 0x09)
+#define TRC_HVM_DR_READ         (TRC_HVM_HANDLER + 0x0A)
+#define TRC_HVM_DR_WRITE        (TRC_HVM_HANDLER + 0x0B)
+#define TRC_HVM_MSR_READ        (TRC_HVM_HANDLER + 0x0C)
+#define TRC_HVM_MSR_WRITE       (TRC_HVM_HANDLER + 0x0D)
+#define TRC_HVM_CPUID           (TRC_HVM_HANDLER + 0x0E)
+#define TRC_HVM_INTR            (TRC_HVM_HANDLER + 0x0F)
+#define TRC_HVM_NMI             (TRC_HVM_HANDLER + 0x10)
+#define TRC_HVM_SMI             (TRC_HVM_HANDLER + 0x11)
+#define TRC_HVM_VMMCALL         (TRC_HVM_HANDLER + 0x12)
+#define TRC_HVM_HLT             (TRC_HVM_HANDLER + 0x13)
+#define TRC_HVM_INVLPG          (TRC_HVM_HANDLER + 0x14)
+#define TRC_HVM_INVLPG64        (TRC_HVM_HANDLER + TRC_64_FLAG + 0x14)
+#define TRC_HVM_MCE             (TRC_HVM_HANDLER + 0x15)
+#define TRC_HVM_IOPORT_READ     (TRC_HVM_HANDLER + 0x16)
+#define TRC_HVM_IOMEM_READ      (TRC_HVM_HANDLER + 0x17)
+#define TRC_HVM_CLTS            (TRC_HVM_HANDLER + 0x18)
+#define TRC_HVM_LMSW            (TRC_HVM_HANDLER + 0x19)
+#define TRC_HVM_LMSW64          (TRC_HVM_HANDLER + TRC_64_FLAG + 0x19)
+#define TRC_HVM_RDTSC           (TRC_HVM_HANDLER + 0x1a)
+#define TRC_HVM_INTR_WINDOW     (TRC_HVM_HANDLER + 0x20)
+#define TRC_HVM_NPF             (TRC_HVM_HANDLER + 0x21)
+#define TRC_HVM_REALMODE_EMULATE (TRC_HVM_HANDLER + 0x22)
+#define TRC_HVM_TRAP             (TRC_HVM_HANDLER + 0x23)
+#define TRC_HVM_TRAP_DEBUG       (TRC_HVM_HANDLER + 0x24)
+#define TRC_HVM_VLAPIC           (TRC_HVM_HANDLER + 0x25)
+#define TRC_HVM_XCR_READ64      (TRC_HVM_HANDLER + TRC_64_FLAG + 0x26)
+#define TRC_HVM_XCR_WRITE64     (TRC_HVM_HANDLER + TRC_64_FLAG + 0x27)
+
+#define TRC_HVM_IOPORT_WRITE    (TRC_HVM_HANDLER + 0x216)
+#define TRC_HVM_IOMEM_WRITE     (TRC_HVM_HANDLER + 0x217)
+
+/* Trace events for emulated devices */
+#define TRC_HVM_EMUL_HPET_START_TIMER  (TRC_HVM_EMUL + 0x1)
+#define TRC_HVM_EMUL_PIT_START_TIMER   (TRC_HVM_EMUL + 0x2)
+#define TRC_HVM_EMUL_RTC_START_TIMER   (TRC_HVM_EMUL + 0x3)
+#define TRC_HVM_EMUL_LAPIC_START_TIMER (TRC_HVM_EMUL + 0x4)
+#define TRC_HVM_EMUL_HPET_STOP_TIMER   (TRC_HVM_EMUL + 0x5)
+#define TRC_HVM_EMUL_PIT_STOP_TIMER    (TRC_HVM_EMUL + 0x6)
+#define TRC_HVM_EMUL_RTC_STOP_TIMER    (TRC_HVM_EMUL + 0x7)
+#define TRC_HVM_EMUL_LAPIC_STOP_TIMER  (TRC_HVM_EMUL + 0x8)
+#define TRC_HVM_EMUL_PIT_TIMER_CB      (TRC_HVM_EMUL + 0x9)
+#define TRC_HVM_EMUL_LAPIC_TIMER_CB    (TRC_HVM_EMUL + 0xA)
+#define TRC_HVM_EMUL_PIC_INT_OUTPUT    (TRC_HVM_EMUL + 0xB)
+#define TRC_HVM_EMUL_PIC_KICK          (TRC_HVM_EMUL + 0xC)
+#define TRC_HVM_EMUL_PIC_INTACK        (TRC_HVM_EMUL + 0xD)
+#define TRC_HVM_EMUL_PIC_POSEDGE       (TRC_HVM_EMUL + 0xE)
+#define TRC_HVM_EMUL_PIC_NEGEDGE       (TRC_HVM_EMUL + 0xF)
+#define TRC_HVM_EMUL_PIC_PEND_IRQ_CALL (TRC_HVM_EMUL + 0x10)
+#define TRC_HVM_EMUL_LAPIC_PIC_INTR    (TRC_HVM_EMUL + 0x11)
+
+/* trace events for per class */
+#define TRC_PM_FREQ_CHANGE      (TRC_HW_PM + 0x01)
+#define TRC_PM_IDLE_ENTRY       (TRC_HW_PM + 0x02)
+#define TRC_PM_IDLE_EXIT        (TRC_HW_PM + 0x03)
+
+/* Trace events for IRQs */
+#define TRC_HW_IRQ_MOVE_CLEANUP_DELAY (TRC_HW_IRQ + 0x1)
+#define TRC_HW_IRQ_MOVE_CLEANUP       (TRC_HW_IRQ + 0x2)
+#define TRC_HW_IRQ_BIND_VECTOR        (TRC_HW_IRQ + 0x3)
+#define TRC_HW_IRQ_CLEAR_VECTOR       (TRC_HW_IRQ + 0x4)
+#define TRC_HW_IRQ_MOVE_FINISH        (TRC_HW_IRQ + 0x5)
+#define TRC_HW_IRQ_ASSIGN_VECTOR      (TRC_HW_IRQ + 0x6)
+#define TRC_HW_IRQ_UNMAPPED_VECTOR    (TRC_HW_IRQ + 0x7)
+#define TRC_HW_IRQ_HANDLED            (TRC_HW_IRQ + 0x8)
+
+/*
+ * Event Flags
+ *
+ * Some events (e.g, TRC_PV_TRAP and TRC_HVM_IOMEM_READ) have multiple
+ * record formats.  These event flags distinguish between the
+ * different formats.
+ */
+#define TRC_64_FLAG 0x100 /* Addresses are 64 bits (instead of 32 bits) */
+
+/* This structure represents a single trace buffer record. */
+struct t_rec {
+    uint32_t event:28;
+    uint32_t extra_u32:3;         /* # entries in trailing extra_u32[] array */
+    uint32_t cycles_included:1;   /* u.cycles or u.no_cycles? */
+    union {
+        struct {
+            uint32_t cycles_lo, cycles_hi; /* cycle counter timestamp */
+            uint32_t extra_u32[7];         /* event data items */
+        } cycles;
+        struct {
+            uint32_t extra_u32[7];         /* event data items */
+        } nocycles;
+    } u;
+};
+
+/*
+ * This structure contains the metadata for a single trace buffer.  The head
+ * field, indexes into an array of struct t_rec's.
+ */
+struct t_buf {
+    /* Assume the data buffer size is X.  X is generally not a power of 2.
+     * CONS and PROD are incremented modulo (2*X):
+     *     0 <= cons < 2*X
+     *     0 <= prod < 2*X
+     * This is done because addition modulo X breaks at 2^32 when X is not a
+     * power of 2:
+     *     (((2^32 - 1) % X) + 1) % X != (2^32) % X
+     */
+    uint32_t cons;   /* Offset of next item to be consumed by control tools. */
+    uint32_t prod;   /* Offset of next item to be produced by Xen.           */
+    /*  Records follow immediately after the meta-data header.    */
+};
+
+/* Structure used to pass MFNs to the trace buffers back to trace consumers.
+ * Offset is an offset into the mapped structure where the mfn list will be held.
+ * MFNs will be at ((unsigned long *)(t_info))+(t_info->cpu_offset[cpu]).
+ */
+struct t_info {
+    uint16_t tbuf_size; /* Size in pages of each trace buffer */
+    uint16_t mfn_offset[];  /* Offset within t_info structure of the page list per cpu */
+    /* MFN lists immediately after the header */
+};
+
+#endif /* __XEN_PUBLIC_TRACE_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/vcpu.h b/include/standard-headers/xen/vcpu.h
new file mode 100644
index 0000000000..3623af932f
--- /dev/null
+++ b/include/standard-headers/xen/vcpu.h
@@ -0,0 +1,248 @@
+/******************************************************************************
+ * vcpu.h
+ *
+ * VCPU initialisation, query, and hotplug.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_VCPU_H__
+#define __XEN_PUBLIC_VCPU_H__
+
+#include "xen.h"
+
+/*
+ * Prototype for this hypercall is:
+ *  long vcpu_op(int cmd, unsigned int vcpuid, void *extra_args)
+ * @cmd        == VCPUOP_??? (VCPU operation).
+ * @vcpuid     == VCPU to operate on.
+ * @extra_args == Operation-specific extra arguments (NULL if none).
+ */
+
+/*
+ * Initialise a VCPU. Each VCPU can be initialised only once. A
+ * newly-initialised VCPU will not run until it is brought up by VCPUOP_up.
+ *
+ * @extra_arg == For PV or ARM guests this is a pointer to a vcpu_guest_context
+ *               structure containing the initial state for the VCPU. For x86
+ *               HVM based guests this is a pointer to a vcpu_hvm_context
+ *               structure.
+ */
+#define VCPUOP_initialise            0
+
+/*
+ * Bring up a VCPU. This makes the VCPU runnable. This operation will fail
+ * if the VCPU has not been initialised (VCPUOP_initialise).
+ */
+#define VCPUOP_up                    1
+
+/*
+ * Bring down a VCPU (i.e., make it non-runnable).
+ * There are a few caveats that callers should observe:
+ *  1. This operation may return, and VCPU_is_up may return false, before the
+ *     VCPU stops running (i.e., the command is asynchronous). It is a good
+ *     idea to ensure that the VCPU has entered a non-critical loop before
+ *     bringing it down. Alternatively, this operation is guaranteed
+ *     synchronous if invoked by the VCPU itself.
+ *  2. After a VCPU is initialised, there is currently no way to drop all its
+ *     references to domain memory. Even a VCPU that is down still holds
+ *     memory references via its pagetable base pointer and GDT. It is good
+ *     practise to move a VCPU onto an 'idle' or default page table, LDT and
+ *     GDT before bringing it down.
+ */
+#define VCPUOP_down                  2
+
+/* Returns 1 if the given VCPU is up. */
+#define VCPUOP_is_up                 3
+
+/*
+ * Return information about the state and running time of a VCPU.
+ * @extra_arg == pointer to vcpu_runstate_info structure.
+ */
+#define VCPUOP_get_runstate_info     4
+struct vcpu_runstate_info {
+    /* VCPU's current state (RUNSTATE_*). */
+    int      state;
+    /* When was current state entered (system time, ns)? */
+    uint64_t state_entry_time;
+    /*
+     * Update indicator set in state_entry_time:
+     * When activated via VMASST_TYPE_runstate_update_flag, set during
+     * updates in guest memory mapped copy of vcpu_runstate_info.
+     */
+#define XEN_RUNSTATE_UPDATE          (xen_mk_ullong(1) << 63)
+    /*
+     * Time spent in each RUNSTATE_* (ns). The sum of these times is
+     * guaranteed not to drift from system time.
+     */
+    uint64_t time[4];
+};
+typedef struct vcpu_runstate_info vcpu_runstate_info_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_runstate_info_t);
+
+/* VCPU is currently running on a physical CPU. */
+#define RUNSTATE_running  0
+
+/* VCPU is runnable, but not currently scheduled on any physical CPU. */
+#define RUNSTATE_runnable 1
+
+/* VCPU is blocked (a.k.a. idle). It is therefore not runnable. */
+#define RUNSTATE_blocked  2
+
+/*
+ * VCPU is not runnable, but it is not blocked.
+ * This is a 'catch all' state for things like hotplug and pauses by the
+ * system administrator (or for critical sections in the hypervisor).
+ * RUNSTATE_blocked dominates this state (it is the preferred state).
+ */
+#define RUNSTATE_offline  3
+
+/*
+ * Register a shared memory area from which the guest may obtain its own
+ * runstate information without needing to execute a hypercall.
+ * Notes:
+ *  1. The registered address may be virtual or physical or guest handle,
+ *     depending on the platform. Virtual address or guest handle should be
+ *     registered on x86 systems.
+ *  2. Only one shared area may be registered per VCPU. The shared area is
+ *     updated by the hypervisor each time the VCPU is scheduled. Thus
+ *     runstate.state will always be RUNSTATE_running and
+ *     runstate.state_entry_time will indicate the system time at which the
+ *     VCPU was last scheduled to run.
+ * @extra_arg == pointer to vcpu_register_runstate_memory_area structure.
+ */
+#define VCPUOP_register_runstate_memory_area 5
+struct vcpu_register_runstate_memory_area {
+    union {
+        XEN_GUEST_HANDLE(vcpu_runstate_info_t) h;
+        struct vcpu_runstate_info *v;
+        uint64_t p;
+    } addr;
+};
+typedef struct vcpu_register_runstate_memory_area vcpu_register_runstate_memory_area_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_register_runstate_memory_area_t);
+
+/*
+ * Set or stop a VCPU's periodic timer. Every VCPU has one periodic timer
+ * which can be set via these commands. Periods smaller than one millisecond
+ * may not be supported.
+ */
+#define VCPUOP_set_periodic_timer    6 /* arg == vcpu_set_periodic_timer_t */
+#define VCPUOP_stop_periodic_timer   7 /* arg == NULL */
+struct vcpu_set_periodic_timer {
+    uint64_t period_ns;
+};
+typedef struct vcpu_set_periodic_timer vcpu_set_periodic_timer_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_set_periodic_timer_t);
+
+/*
+ * Set or stop a VCPU's single-shot timer. Every VCPU has one single-shot
+ * timer which can be set via these commands.
+ */
+#define VCPUOP_set_singleshot_timer  8 /* arg == vcpu_set_singleshot_timer_t */
+#define VCPUOP_stop_singleshot_timer 9 /* arg == NULL */
+struct vcpu_set_singleshot_timer {
+    uint64_t timeout_abs_ns;   /* Absolute system time value in nanoseconds. */
+    uint32_t flags;            /* VCPU_SSHOTTMR_??? */
+};
+typedef struct vcpu_set_singleshot_timer vcpu_set_singleshot_timer_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_set_singleshot_timer_t);
+
+/* Flags to VCPUOP_set_singleshot_timer. */
+ /* Require the timeout to be in the future (return -ETIME if it's passed). */
+#define _VCPU_SSHOTTMR_future (0)
+#define VCPU_SSHOTTMR_future  (1U << _VCPU_SSHOTTMR_future)
+
+/*
+ * Register a memory location in the guest address space for the
+ * vcpu_info structure.  This allows the guest to place the vcpu_info
+ * structure in a convenient place, such as in a per-cpu data area.
+ * The pointer need not be page aligned, but the structure must not
+ * cross a page boundary.
+ *
+ * This may be called only once per vcpu.
+ */
+#define VCPUOP_register_vcpu_info   10  /* arg == vcpu_register_vcpu_info_t */
+struct vcpu_register_vcpu_info {
+    uint64_t mfn;    /* mfn of page to place vcpu_info */
+    uint32_t offset; /* offset within page */
+    uint32_t rsvd;   /* unused */
+};
+typedef struct vcpu_register_vcpu_info vcpu_register_vcpu_info_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_register_vcpu_info_t);
+
+/* Send an NMI to the specified VCPU. @extra_arg == NULL. */
+#define VCPUOP_send_nmi             11
+
+/*
+ * Get the physical ID information for a pinned vcpu's underlying physical
+ * processor.  The physical ID informmation is architecture-specific.
+ * On x86: id[31:0]=apic_id, id[63:32]=acpi_id.
+ * This command returns -EINVAL if it is not a valid operation for this VCPU.
+ */
+#define VCPUOP_get_physid           12 /* arg == vcpu_get_physid_t */
+struct vcpu_get_physid {
+    uint64_t phys_id;
+};
+typedef struct vcpu_get_physid vcpu_get_physid_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_get_physid_t);
+#define xen_vcpu_physid_to_x86_apicid(physid) ((uint32_t)(physid))
+#define xen_vcpu_physid_to_x86_acpiid(physid) ((uint32_t)((physid) >> 32))
+
+/*
+ * Register a memory location to get a secondary copy of the vcpu time
+ * parameters.  The master copy still exists as part of the vcpu shared
+ * memory area, and this secondary copy is updated whenever the master copy
+ * is updated (and using the same versioning scheme for synchronisation).
+ *
+ * The intent is that this copy may be mapped (RO) into userspace so
+ * that usermode can compute system time using the time info and the
+ * tsc.  Usermode will see an array of vcpu_time_info structures, one
+ * for each vcpu, and choose the right one by an existing mechanism
+ * which allows it to get the current vcpu number (such as via a
+ * segment limit).  It can then apply the normal algorithm to compute
+ * system time from the tsc.
+ *
+ * @extra_arg == pointer to vcpu_register_time_info_memory_area structure.
+ */
+#define VCPUOP_register_vcpu_time_memory_area   13
+DEFINE_XEN_GUEST_HANDLE(vcpu_time_info_t);
+struct vcpu_register_time_memory_area {
+    union {
+        XEN_GUEST_HANDLE(vcpu_time_info_t) h;
+        struct vcpu_time_info *v;
+        uint64_t p;
+    } addr;
+};
+typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
+DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
+
+#endif /* __XEN_PUBLIC_VCPU_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/version.h b/include/standard-headers/xen/version.h
new file mode 100644
index 0000000000..17a81e23cd
--- /dev/null
+++ b/include/standard-headers/xen/version.h
@@ -0,0 +1,113 @@
+/******************************************************************************
+ * version.h
+ *
+ * Xen version, type, and compile information.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2005, Nguyen Anh Quynh <aquynh@gmail.com>
+ * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
+ */
+
+#ifndef __XEN_PUBLIC_VERSION_H__
+#define __XEN_PUBLIC_VERSION_H__
+
+#include "xen.h"
+
+/* NB. All ops return zero on success, except XENVER_{version,pagesize}
+ * XENVER_{version,pagesize,build_id} */
+
+/* arg == NULL; returns major:minor (16:16). */
+#define XENVER_version      0
+
+/* arg == xen_extraversion_t. */
+#define XENVER_extraversion 1
+typedef char xen_extraversion_t[16];
+#define XEN_EXTRAVERSION_LEN (sizeof(xen_extraversion_t))
+
+/* arg == xen_compile_info_t. */
+#define XENVER_compile_info 2
+struct xen_compile_info {
+    char compiler[64];
+    char compile_by[16];
+    char compile_domain[32];
+    char compile_date[32];
+};
+typedef struct xen_compile_info xen_compile_info_t;
+
+#define XENVER_capabilities 3
+typedef char xen_capabilities_info_t[1024];
+#define XEN_CAPABILITIES_INFO_LEN (sizeof(xen_capabilities_info_t))
+
+#define XENVER_changeset 4
+typedef char xen_changeset_info_t[64];
+#define XEN_CHANGESET_INFO_LEN (sizeof(xen_changeset_info_t))
+
+#define XENVER_platform_parameters 5
+struct xen_platform_parameters {
+    xen_ulong_t virt_start;
+};
+typedef struct xen_platform_parameters xen_platform_parameters_t;
+
+#define XENVER_get_features 6
+struct xen_feature_info {
+    unsigned int submap_idx;    /* IN: which 32-bit submap to return */
+    uint32_t     submap;        /* OUT: 32-bit submap */
+};
+typedef struct xen_feature_info xen_feature_info_t;
+
+/* Declares the features reported by XENVER_get_features. */
+#include "features.h"
+
+/* arg == NULL; returns host memory page size. */
+#define XENVER_pagesize 7
+
+/* arg == xen_domain_handle_t.
+ *
+ * The toolstack fills it out for guest consumption. It is intended to hold
+ * the UUID of the guest.
+ */
+#define XENVER_guest_handle 8
+
+#define XENVER_commandline 9
+typedef char xen_commandline_t[1024];
+
+/*
+ * Return value is the number of bytes written, or XEN_Exx on error.
+ * Calling with empty parameter returns the size of build_id.
+ */
+#define XENVER_build_id 10
+struct xen_build_id {
+        uint32_t        len; /* IN: size of buf[]. */
+        unsigned char   buf[XEN_FLEX_ARRAY_DIM];
+                             /* OUT: Variable length buffer with build_id. */
+};
+typedef struct xen_build_id xen_build_id_t;
+
+#endif /* __XEN_PUBLIC_VERSION_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/standard-headers/xen/xen-compat.h b/include/standard-headers/xen/xen-compat.h
new file mode 100644
index 0000000000..e1c027a95c
--- /dev/null
+++ b/include/standard-headers/xen/xen-compat.h
@@ -0,0 +1,46 @@
+/******************************************************************************
+ * xen-compat.h
+ *
+ * Guest OS interface to Xen.  Compatibility layer.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2006, Christian Limpach
+ */
+
+#ifndef __XEN_PUBLIC_XEN_COMPAT_H__
+#define __XEN_PUBLIC_XEN_COMPAT_H__
+
+#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040e00
+
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+/* Xen is built with matching headers and implements the latest interface. */
+#define __XEN_INTERFACE_VERSION__ __XEN_LATEST_INTERFACE_VERSION__
+#elif !defined(__XEN_INTERFACE_VERSION__)
+/* Guests which do not specify a version get the legacy interface. */
+#define __XEN_INTERFACE_VERSION__ 0x00000000
+#endif
+
+#if __XEN_INTERFACE_VERSION__ > __XEN_LATEST_INTERFACE_VERSION__
+#error "These header files do not support the requested interface version."
+#endif
+
+#define COMPAT_FLEX_ARRAY_DIM XEN_FLEX_ARRAY_DIM
+
+#endif /* __XEN_PUBLIC_XEN_COMPAT_H__ */
diff --git a/include/standard-headers/xen/xen.h b/include/standard-headers/xen/xen.h
new file mode 100644
index 0000000000..e373592c33
--- /dev/null
+++ b/include/standard-headers/xen/xen.h
@@ -0,0 +1,1049 @@
+/******************************************************************************
+ * xen.h
+ *
+ * Guest OS interface to Xen.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2004, K A Fraser
+ */
+
+#ifndef __XEN_PUBLIC_XEN_H__
+#define __XEN_PUBLIC_XEN_H__
+
+#include "xen-compat.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+#include "arch-x86/xen.h"
+#elif defined(__arm__) || defined (__aarch64__)
+#include "arch-arm.h"
+#else
+#error "Unsupported architecture"
+#endif
+
+#ifndef __ASSEMBLY__
+/* Guest handles for primitive C types. */
+DEFINE_XEN_GUEST_HANDLE(char);
+__DEFINE_XEN_GUEST_HANDLE(uchar, unsigned char);
+DEFINE_XEN_GUEST_HANDLE(int);
+__DEFINE_XEN_GUEST_HANDLE(uint,  unsigned int);
+#if __XEN_INTERFACE_VERSION__ < 0x00040300
+DEFINE_XEN_GUEST_HANDLE(long);
+__DEFINE_XEN_GUEST_HANDLE(ulong, unsigned long);
+#endif
+DEFINE_XEN_GUEST_HANDLE(void);
+
+DEFINE_XEN_GUEST_HANDLE(uint64_t);
+DEFINE_XEN_GUEST_HANDLE(xen_pfn_t);
+DEFINE_XEN_GUEST_HANDLE(xen_ulong_t);
+
+/* Define a variable length array (depends on compiler). */
+#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L
+#define XEN_FLEX_ARRAY_DIM
+#elif defined(__GNUC__)
+#define XEN_FLEX_ARRAY_DIM  0
+#else
+#define XEN_FLEX_ARRAY_DIM  1 /* variable size */
+#endif
+
+/* Turn a plain number into a C unsigned (long (long)) constant. */
+#define __xen_mk_uint(x)  x ## U
+#define __xen_mk_ulong(x) x ## UL
+#ifndef __xen_mk_ullong
+# define __xen_mk_ullong(x) x ## ULL
+#endif
+#define xen_mk_uint(x)    __xen_mk_uint(x)
+#define xen_mk_ulong(x)   __xen_mk_ulong(x)
+#define xen_mk_ullong(x)  __xen_mk_ullong(x)
+
+#else
+
+/* In assembly code we cannot use C numeric constant suffixes. */
+#define xen_mk_uint(x)   x
+#define xen_mk_ulong(x)  x
+#define xen_mk_ullong(x) x
+
+#endif
+
+/*
+ * HYPERCALLS
+ */
+
+/* `incontents 100 hcalls List of hypercalls
+ * ` enum hypercall_num { // __HYPERVISOR_* => HYPERVISOR_*()
+ */
+
+#define __HYPERVISOR_set_trap_table        0
+#define __HYPERVISOR_mmu_update            1
+#define __HYPERVISOR_set_gdt               2
+#define __HYPERVISOR_stack_switch          3
+#define __HYPERVISOR_set_callbacks         4
+#define __HYPERVISOR_fpu_taskswitch        5
+#define __HYPERVISOR_sched_op_compat       6 /* compat since 0x00030101 */
+#define __HYPERVISOR_platform_op           7
+#define __HYPERVISOR_set_debugreg          8
+#define __HYPERVISOR_get_debugreg          9
+#define __HYPERVISOR_update_descriptor    10
+#define __HYPERVISOR_memory_op            12
+#define __HYPERVISOR_multicall            13
+#define __HYPERVISOR_update_va_mapping    14
+#define __HYPERVISOR_set_timer_op         15
+#define __HYPERVISOR_event_channel_op_compat 16 /* compat since 0x00030202 */
+#define __HYPERVISOR_xen_version          17
+#define __HYPERVISOR_console_io           18
+#define __HYPERVISOR_physdev_op_compat    19 /* compat since 0x00030202 */
+#define __HYPERVISOR_grant_table_op       20
+#define __HYPERVISOR_vm_assist            21
+#define __HYPERVISOR_update_va_mapping_otherdomain 22
+#define __HYPERVISOR_iret                 23 /* x86 only */
+#define __HYPERVISOR_vcpu_op              24
+#define __HYPERVISOR_set_segment_base     25 /* x86/64 only */
+#define __HYPERVISOR_mmuext_op            26
+#define __HYPERVISOR_xsm_op               27
+#define __HYPERVISOR_nmi_op               28
+#define __HYPERVISOR_sched_op             29
+#define __HYPERVISOR_callback_op          30
+#define __HYPERVISOR_xenoprof_op          31
+#define __HYPERVISOR_event_channel_op     32
+#define __HYPERVISOR_physdev_op           33
+#define __HYPERVISOR_hvm_op               34
+#define __HYPERVISOR_sysctl               35
+#define __HYPERVISOR_domctl               36
+#define __HYPERVISOR_kexec_op             37
+#define __HYPERVISOR_tmem_op              38
+#define __HYPERVISOR_argo_op              39
+#define __HYPERVISOR_xenpmu_op            40
+#define __HYPERVISOR_dm_op                41
+#define __HYPERVISOR_hypfs_op             42
+
+/* Architecture-specific hypercall definitions. */
+#define __HYPERVISOR_arch_0               48
+#define __HYPERVISOR_arch_1               49
+#define __HYPERVISOR_arch_2               50
+#define __HYPERVISOR_arch_3               51
+#define __HYPERVISOR_arch_4               52
+#define __HYPERVISOR_arch_5               53
+#define __HYPERVISOR_arch_6               54
+#define __HYPERVISOR_arch_7               55
+
+/* ` } */
+
+/*
+ * HYPERCALL COMPATIBILITY.
+ */
+
+/* New sched_op hypercall introduced in 0x00030101. */
+#if __XEN_INTERFACE_VERSION__ < 0x00030101
+#undef __HYPERVISOR_sched_op
+#define __HYPERVISOR_sched_op __HYPERVISOR_sched_op_compat
+#endif
+
+/* New event-channel and physdev hypercalls introduced in 0x00030202. */
+#if __XEN_INTERFACE_VERSION__ < 0x00030202
+#undef __HYPERVISOR_event_channel_op
+#define __HYPERVISOR_event_channel_op __HYPERVISOR_event_channel_op_compat
+#undef __HYPERVISOR_physdev_op
+#define __HYPERVISOR_physdev_op __HYPERVISOR_physdev_op_compat
+#endif
+
+/* New platform_op hypercall introduced in 0x00030204. */
+#if __XEN_INTERFACE_VERSION__ < 0x00030204
+#define __HYPERVISOR_dom0_op __HYPERVISOR_platform_op
+#endif
+
+/*
+ * VIRTUAL INTERRUPTS
+ *
+ * Virtual interrupts that a guest OS may receive from Xen.
+ *
+ * In the side comments, 'V.' denotes a per-VCPU VIRQ while 'G.' denotes a
+ * global VIRQ. The former can be bound once per VCPU and cannot be re-bound.
+ * The latter can be allocated only once per guest: they must initially be
+ * allocated to VCPU0 but can subsequently be re-bound.
+ */
+/* ` enum virq { */
+#define VIRQ_TIMER      0  /* V. Timebase update, and/or requested timeout.  */
+#define VIRQ_DEBUG      1  /* V. Request guest to dump debug info.           */
+#define VIRQ_CONSOLE    2  /* G. (DOM0) Bytes received on emergency console. */
+#define VIRQ_DOM_EXC    3  /* G. (DOM0) Exceptional event for some domain.   */
+#define VIRQ_TBUF       4  /* G. (DOM0) Trace buffer has records available.  */
+#define VIRQ_DEBUGGER   6  /* G. (DOM0) A domain has paused for debugging.   */
+#define VIRQ_XENOPROF   7  /* V. XenOprofile interrupt: new sample available */
+#define VIRQ_CON_RING   8  /* G. (DOM0) Bytes received on console            */
+#define VIRQ_PCPU_STATE 9  /* G. (DOM0) PCPU state changed                   */
+#define VIRQ_MEM_EVENT  10 /* G. (DOM0) A memory event has occurred          */
+#define VIRQ_ARGO       11 /* G. Argo interdomain message notification       */
+#define VIRQ_ENOMEM     12 /* G. (DOM0) Low on heap memory       */
+#define VIRQ_XENPMU     13 /* V.  PMC interrupt                              */
+
+/* Architecture-specific VIRQ definitions. */
+#define VIRQ_ARCH_0    16
+#define VIRQ_ARCH_1    17
+#define VIRQ_ARCH_2    18
+#define VIRQ_ARCH_3    19
+#define VIRQ_ARCH_4    20
+#define VIRQ_ARCH_5    21
+#define VIRQ_ARCH_6    22
+#define VIRQ_ARCH_7    23
+/* ` } */
+
+#define NR_VIRQS       24
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_mmu_update(const struct mmu_update reqs[],
+ * `                       unsigned count, unsigned *done_out,
+ * `                       unsigned foreigndom)
+ * `
+ * @reqs is an array of mmu_update_t structures ((ptr, val) pairs).
+ * @count is the length of the above array.
+ * @pdone is an output parameter indicating number of completed operations
+ * @foreigndom[15:0]: FD, the expected owner of data pages referenced in this
+ *                    hypercall invocation. Can be DOMID_SELF.
+ * @foreigndom[31:16]: PFD, the expected owner of pagetable pages referenced
+ *                     in this hypercall invocation. The value of this field
+ *                     (x) encodes the PFD as follows:
+ *                     x == 0 => PFD == DOMID_SELF
+ *                     x != 0 => PFD == x - 1
+ *
+ * Sub-commands: ptr[1:0] specifies the appropriate MMU_* command.
+ * -------------
+ * ptr[1:0] == MMU_NORMAL_PT_UPDATE:
+ * Updates an entry in a page table belonging to PFD. If updating an L1 table,
+ * and the new table entry is valid/present, the mapped frame must belong to
+ * FD. If attempting to map an I/O page then the caller assumes the privilege
+ * of the FD.
+ * FD == DOMID_IO: Permit /only/ I/O mappings, at the priv level of the caller.
+ * FD == DOMID_XEN: Map restricted areas of Xen's heap space.
+ * ptr[:2]  -- Machine address of the page-table entry to modify.
+ * val      -- Value to write.
+ *
+ * There also certain implicit requirements when using this hypercall. The
+ * pages that make up a pagetable must be mapped read-only in the guest.
+ * This prevents uncontrolled guest updates to the pagetable. Xen strictly
+ * enforces this, and will disallow any pagetable update which will end up
+ * mapping pagetable page RW, and will disallow using any writable page as a
+ * pagetable. In practice it means that when constructing a page table for a
+ * process, thread, etc, we MUST be very dilligient in following these rules:
+ *  1). Start with top-level page (PGD or in Xen language: L4). Fill out
+ *      the entries.
+ *  2). Keep on going, filling out the upper (PUD or L3), and middle (PMD
+ *      or L2).
+ *  3). Start filling out the PTE table (L1) with the PTE entries. Once
+ *  	done, make sure to set each of those entries to RO (so writeable bit
+ *  	is unset). Once that has been completed, set the PMD (L2) for this
+ *  	PTE table as RO.
+ *  4). When completed with all of the PMD (L2) entries, and all of them have
+ *  	been set to RO, make sure to set RO the PUD (L3). Do the same
+ *  	operation on PGD (L4) pagetable entries that have a PUD (L3) entry.
+ *  5). Now before you can use those pages (so setting the cr3), you MUST also
+ *      pin them so that the hypervisor can verify the entries. This is done
+ *      via the HYPERVISOR_mmuext_op(MMUEXT_PIN_L4_TABLE, guest physical frame
+ *      number of the PGD (L4)). And this point the HYPERVISOR_mmuext_op(
+ *      MMUEXT_NEW_BASEPTR, guest physical frame number of the PGD (L4)) can be
+ *      issued.
+ * For 32-bit guests, the L4 is not used (as there is less pagetables), so
+ * instead use L3.
+ * At this point the pagetables can be modified using the MMU_NORMAL_PT_UPDATE
+ * hypercall. Also if so desired the OS can also try to write to the PTE
+ * and be trapped by the hypervisor (as the PTE entry is RO).
+ *
+ * To deallocate the pages, the operations are the reverse of the steps
+ * mentioned above. The argument is MMUEXT_UNPIN_TABLE for all levels and the
+ * pagetable MUST not be in use (meaning that the cr3 is not set to it).
+ *
+ * ptr[1:0] == MMU_MACHPHYS_UPDATE:
+ * Updates an entry in the machine->pseudo-physical mapping table.
+ * ptr[:2]  -- Machine address within the frame whose mapping to modify.
+ *             The frame must belong to the FD, if one is specified.
+ * val      -- Value to write into the mapping entry.
+ *
+ * ptr[1:0] == MMU_PT_UPDATE_PRESERVE_AD:
+ * As MMU_NORMAL_PT_UPDATE above, but A/D bits currently in the PTE are ORed
+ * with those in @val.
+ *
+ * ptr[1:0] == MMU_PT_UPDATE_NO_TRANSLATE:
+ * As MMU_NORMAL_PT_UPDATE above, but @val is not translated though FD
+ * page tables.
+ *
+ * @val is usually the machine frame number along with some attributes.
+ * The attributes by default follow the architecture defined bits. Meaning that
+ * if this is a X86_64 machine and four page table layout is used, the layout
+ * of val is:
+ *  - 63 if set means No execute (NX)
+ *  - 46-13 the machine frame number
+ *  - 12 available for guest
+ *  - 11 available for guest
+ *  - 10 available for guest
+ *  - 9 available for guest
+ *  - 8 global
+ *  - 7 PAT (PSE is disabled, must use hypercall to make 4MB or 2MB pages)
+ *  - 6 dirty
+ *  - 5 accessed
+ *  - 4 page cached disabled
+ *  - 3 page write through
+ *  - 2 userspace accessible
+ *  - 1 writeable
+ *  - 0 present
+ *
+ *  The one bits that does not fit with the default layout is the PAGE_PSE
+ *  also called PAGE_PAT). The MMUEXT_[UN]MARK_SUPER arguments to the
+ *  HYPERVISOR_mmuext_op serve as mechanism to set a pagetable to be 4MB
+ *  (or 2MB) instead of using the PAGE_PSE bit.
+ *
+ *  The reason that the PAGE_PSE (bit 7) is not being utilized is due to Xen
+ *  using it as the Page Attribute Table (PAT) bit - for details on it please
+ *  refer to Intel SDM 10.12. The PAT allows to set the caching attributes of
+ *  pages instead of using MTRRs.
+ *
+ *  The PAT MSR is as follows (it is a 64-bit value, each entry is 8 bits):
+ *                    PAT4                 PAT0
+ *  +-----+-----+----+----+----+-----+----+----+
+ *  | UC  | UC- | WC | WB | UC | UC- | WC | WB |  <= Linux
+ *  +-----+-----+----+----+----+-----+----+----+
+ *  | UC  | UC- | WT | WB | UC | UC- | WT | WB |  <= BIOS (default when machine boots)
+ *  +-----+-----+----+----+----+-----+----+----+
+ *  | rsv | rsv | WP | WC | UC | UC- | WT | WB |  <= Xen
+ *  +-----+-----+----+----+----+-----+----+----+
+ *
+ *  The lookup of this index table translates to looking up
+ *  Bit 7, Bit 4, and Bit 3 of val entry:
+ *
+ *  PAT/PSE (bit 7) ... PCD (bit 4) .. PWT (bit 3).
+ *
+ *  If all bits are off, then we are using PAT0. If bit 3 turned on,
+ *  then we are using PAT1, if bit 3 and bit 4, then PAT2..
+ *
+ *  As you can see, the Linux PAT1 translates to PAT4 under Xen. Which means
+ *  that if a guest that follows Linux's PAT setup and would like to set Write
+ *  Combined on pages it MUST use PAT4 entry. Meaning that Bit 7 (PAGE_PAT) is
+ *  set. For example, under Linux it only uses PAT0, PAT1, and PAT2 for the
+ *  caching as:
+ *
+ *   WB = none (so PAT0)
+ *   WC = PWT (bit 3 on)
+ *   UC = PWT | PCD (bit 3 and 4 are on).
+ *
+ * To make it work with Xen, it needs to translate the WC bit as so:
+ *
+ *  PWT (so bit 3 on) --> PAT (so bit 7 is on) and clear bit 3
+ *
+ * And to translate back it would:
+ *
+ * PAT (bit 7 on) --> PWT (bit 3 on) and clear bit 7.
+ */
+#define MMU_NORMAL_PT_UPDATE       0 /* checked '*ptr = val'. ptr is MA.      */
+#define MMU_MACHPHYS_UPDATE        1 /* ptr = MA of frame to modify entry for */
+#define MMU_PT_UPDATE_PRESERVE_AD  2 /* atomically: *ptr = val | (*ptr&(A|D)) */
+#define MMU_PT_UPDATE_NO_TRANSLATE 3 /* checked '*ptr = val'. ptr is MA.      */
+                                     /* val never translated.                 */
+
+/*
+ * MMU EXTENDED OPERATIONS
+ *
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_mmuext_op(mmuext_op_t uops[],
+ * `                      unsigned int count,
+ * `                      unsigned int *pdone,
+ * `                      unsigned int foreigndom)
+ */
+/* HYPERVISOR_mmuext_op() accepts a list of mmuext_op structures.
+ * A foreigndom (FD) can be specified (or DOMID_SELF for none).
+ * Where the FD has some effect, it is described below.
+ *
+ * cmd: MMUEXT_(UN)PIN_*_TABLE
+ * mfn: Machine frame number to be (un)pinned as a p.t. page.
+ *      The frame must belong to the FD, if one is specified.
+ *
+ * cmd: MMUEXT_NEW_BASEPTR
+ * mfn: Machine frame number of new page-table base to install in MMU.
+ *
+ * cmd: MMUEXT_NEW_USER_BASEPTR [x86/64 only]
+ * mfn: Machine frame number of new page-table base to install in MMU
+ *      when in user space.
+ *
+ * cmd: MMUEXT_TLB_FLUSH_LOCAL
+ * No additional arguments. Flushes local TLB.
+ *
+ * cmd: MMUEXT_INVLPG_LOCAL
+ * linear_addr: Linear address to be flushed from the local TLB.
+ *
+ * cmd: MMUEXT_TLB_FLUSH_MULTI
+ * vcpumask: Pointer to bitmap of VCPUs to be flushed.
+ *
+ * cmd: MMUEXT_INVLPG_MULTI
+ * linear_addr: Linear address to be flushed.
+ * vcpumask: Pointer to bitmap of VCPUs to be flushed.
+ *
+ * cmd: MMUEXT_TLB_FLUSH_ALL
+ * No additional arguments. Flushes all VCPUs' TLBs.
+ *
+ * cmd: MMUEXT_INVLPG_ALL
+ * linear_addr: Linear address to be flushed from all VCPUs' TLBs.
+ *
+ * cmd: MMUEXT_FLUSH_CACHE
+ * No additional arguments. Writes back and flushes cache contents.
+ *
+ * cmd: MMUEXT_FLUSH_CACHE_GLOBAL
+ * No additional arguments. Writes back and flushes cache contents
+ * on all CPUs in the system.
+ *
+ * cmd: MMUEXT_SET_LDT
+ * linear_addr: Linear address of LDT base (NB. must be page-aligned).
+ * nr_ents: Number of entries in LDT.
+ *
+ * cmd: MMUEXT_CLEAR_PAGE
+ * mfn: Machine frame number to be cleared.
+ *
+ * cmd: MMUEXT_COPY_PAGE
+ * mfn: Machine frame number of the destination page.
+ * src_mfn: Machine frame number of the source page.
+ *
+ * cmd: MMUEXT_[UN]MARK_SUPER
+ * mfn: Machine frame number of head of superpage to be [un]marked.
+ */
+/* ` enum mmuext_cmd { */
+#define MMUEXT_PIN_L1_TABLE      0
+#define MMUEXT_PIN_L2_TABLE      1
+#define MMUEXT_PIN_L3_TABLE      2
+#define MMUEXT_PIN_L4_TABLE      3
+#define MMUEXT_UNPIN_TABLE       4
+#define MMUEXT_NEW_BASEPTR       5
+#define MMUEXT_TLB_FLUSH_LOCAL   6
+#define MMUEXT_INVLPG_LOCAL      7
+#define MMUEXT_TLB_FLUSH_MULTI   8
+#define MMUEXT_INVLPG_MULTI      9
+#define MMUEXT_TLB_FLUSH_ALL    10
+#define MMUEXT_INVLPG_ALL       11
+#define MMUEXT_FLUSH_CACHE      12
+#define MMUEXT_SET_LDT          13
+#define MMUEXT_NEW_USER_BASEPTR 15
+#define MMUEXT_CLEAR_PAGE       16
+#define MMUEXT_COPY_PAGE        17
+#define MMUEXT_FLUSH_CACHE_GLOBAL 18
+#define MMUEXT_MARK_SUPER       19
+#define MMUEXT_UNMARK_SUPER     20
+/* ` } */
+
+#ifndef __ASSEMBLY__
+struct mmuext_op {
+    unsigned int cmd; /* => enum mmuext_cmd */
+    union {
+        /* [UN]PIN_TABLE, NEW_BASEPTR, NEW_USER_BASEPTR
+         * CLEAR_PAGE, COPY_PAGE, [UN]MARK_SUPER */
+        xen_pfn_t     mfn;
+        /* INVLPG_LOCAL, INVLPG_ALL, SET_LDT */
+        unsigned long linear_addr;
+    } arg1;
+    union {
+        /* SET_LDT */
+        unsigned int nr_ents;
+        /* TLB_FLUSH_MULTI, INVLPG_MULTI */
+#if __XEN_INTERFACE_VERSION__ >= 0x00030205
+        XEN_GUEST_HANDLE(const_void) vcpumask;
+#else
+        const void *vcpumask;
+#endif
+        /* COPY_PAGE */
+        xen_pfn_t src_mfn;
+    } arg2;
+};
+typedef struct mmuext_op mmuext_op_t;
+DEFINE_XEN_GUEST_HANDLE(mmuext_op_t);
+#endif
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_update_va_mapping(unsigned long va, u64 val,
+ * `                              enum uvm_flags flags)
+ * `
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_update_va_mapping_otherdomain(unsigned long va, u64 val,
+ * `                                          enum uvm_flags flags,
+ * `                                          domid_t domid)
+ * `
+ * ` @va: The virtual address whose mapping we want to change
+ * ` @val: The new page table entry, must contain a machine address
+ * ` @flags: Control TLB flushes
+ */
+/* These are passed as 'flags' to update_va_mapping. They can be ORed. */
+/* When specifying UVMF_MULTI, also OR in a pointer to a CPU bitmap.   */
+/* UVMF_LOCAL is merely UVMF_MULTI with a NULL bitmap pointer.         */
+/* ` enum uvm_flags { */
+#define UVMF_NONE           (xen_mk_ulong(0)<<0) /* No flushing at all.   */
+#define UVMF_TLB_FLUSH      (xen_mk_ulong(1)<<0) /* Flush entire TLB(s).  */
+#define UVMF_INVLPG         (xen_mk_ulong(2)<<0) /* Flush only one entry. */
+#define UVMF_FLUSHTYPE_MASK (xen_mk_ulong(3)<<0)
+#define UVMF_MULTI          (xen_mk_ulong(0)<<2) /* Flush subset of TLBs. */
+#define UVMF_LOCAL          (xen_mk_ulong(0)<<2) /* Flush local TLB.      */
+#define UVMF_ALL            (xen_mk_ulong(1)<<2) /* Flush all TLBs.       */
+/* ` } */
+
+/*
+ * ` int
+ * ` HYPERVISOR_console_io(unsigned int cmd,
+ * `                       unsigned int count,
+ * `                       char buffer[]);
+ *
+ * @cmd: Command (see below)
+ * @count: Size of the buffer to read/write
+ * @buffer: Pointer in the guest memory
+ *
+ * List of commands:
+ *
+ *  * CONSOLEIO_write: Write the buffer to Xen console.
+ *      For the hardware domain, all the characters in the buffer will
+ *      be written. Characters will be printed directly to the console.
+ *      For all the other domains, only the printable characters will be
+ *      written. Characters may be buffered until a newline (i.e '\n') is
+ *      found.
+ *      @return 0 on success, otherwise return an error code.
+ *  * CONSOLEIO_read: Attempts to read up to @count characters from Xen
+ *      console. The maximum buffer size (i.e. @count) supported is 2GB.
+ *      @return the number of characters read on success, otherwise return
+ *      an error code.
+ */
+#define CONSOLEIO_write         0
+#define CONSOLEIO_read          1
+
+/*
+ * Commands to HYPERVISOR_vm_assist().
+ */
+#define VMASST_CMD_enable                0
+#define VMASST_CMD_disable               1
+
+/* x86/32 guests: simulate full 4GB segment limits. */
+#define VMASST_TYPE_4gb_segments         0
+
+/* x86/32 guests: trap (vector 15) whenever above vmassist is used. */
+#define VMASST_TYPE_4gb_segments_notify  1
+
+/*
+ * x86 guests: support writes to bottom-level PTEs.
+ * NB1. Page-directory entries cannot be written.
+ * NB2. Guest must continue to remove all writable mappings of PTEs.
+ */
+#define VMASST_TYPE_writable_pagetables  2
+
+/* x86/PAE guests: support PDPTs above 4GB. */
+#define VMASST_TYPE_pae_extended_cr3     3
+
+/*
+ * x86 guests: Sane behaviour for virtual iopl
+ *  - virtual iopl updated from do_iret() hypercalls.
+ *  - virtual iopl reported in bounce frames.
+ *  - guest kernels assumed to be level 0 for the purpose of iopl checks.
+ */
+#define VMASST_TYPE_architectural_iopl   4
+
+/*
+ * All guests: activate update indicator in vcpu_runstate_info
+ * Enable setting the XEN_RUNSTATE_UPDATE flag in guest memory mapped
+ * vcpu_runstate_info during updates of the runstate information.
+ */
+#define VMASST_TYPE_runstate_update_flag 5
+
+/*
+ * x86/64 guests: strictly hide M2P from user mode.
+ * This allows the guest to control respective hypervisor behavior:
+ * - when not set, L4 tables get created with the respective slot blank,
+ *   and whenever the L4 table gets used as a kernel one the missing
+ *   mapping gets inserted,
+ * - when set, L4 tables get created with the respective slot initialized
+ *   as before, and whenever the L4 table gets used as a user one the
+ *   mapping gets zapped.
+ */
+#define VMASST_TYPE_m2p_strict           32
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040600
+#define MAX_VMASST_TYPE                  3
+#endif
+
+/* Domain ids >= DOMID_FIRST_RESERVED cannot be used for ordinary domains. */
+#define DOMID_FIRST_RESERVED xen_mk_uint(0x7FF0)
+
+/* DOMID_SELF is used in certain contexts to refer to oneself. */
+#define DOMID_SELF           xen_mk_uint(0x7FF0)
+
+/*
+ * DOMID_IO is used to restrict page-table updates to mapping I/O memory.
+ * Although no Foreign Domain need be specified to map I/O pages, DOMID_IO
+ * is useful to ensure that no mappings to the OS's own heap are accidentally
+ * installed. (e.g., in Linux this could cause havoc as reference counts
+ * aren't adjusted on the I/O-mapping code path).
+ * This only makes sense as HYPERVISOR_mmu_update()'s and
+ * HYPERVISOR_update_va_mapping_otherdomain()'s "foreigndom" argument. For
+ * HYPERVISOR_mmu_update() context it can be specified by any calling domain,
+ * otherwise it's only permitted if the caller is privileged.
+ */
+#define DOMID_IO             xen_mk_uint(0x7FF1)
+
+/*
+ * DOMID_XEN is used to allow privileged domains to map restricted parts of
+ * Xen's heap space (e.g., the machine_to_phys table).
+ * This only makes sense as
+ * - HYPERVISOR_mmu_update()'s, HYPERVISOR_mmuext_op()'s, or
+ *   HYPERVISOR_update_va_mapping_otherdomain()'s "foreigndom" argument,
+ * - with XENMAPSPACE_gmfn_foreign,
+ * and is only permitted if the caller is privileged.
+ */
+#define DOMID_XEN            xen_mk_uint(0x7FF2)
+
+/*
+ * DOMID_COW is used as the owner of sharable pages */
+#define DOMID_COW            xen_mk_uint(0x7FF3)
+
+/* DOMID_INVALID is used to identify pages with unknown owner. */
+#define DOMID_INVALID        xen_mk_uint(0x7FF4)
+
+/* Idle domain. */
+#define DOMID_IDLE           xen_mk_uint(0x7FFF)
+
+/* Mask for valid domain id values */
+#define DOMID_MASK           xen_mk_uint(0x7FFF)
+
+#ifndef __ASSEMBLY__
+
+typedef uint16_t domid_t;
+
+/*
+ * Send an array of these to HYPERVISOR_mmu_update().
+ * NB. The fields are natural pointer/address size for this architecture.
+ */
+struct mmu_update {
+    uint64_t ptr;       /* Machine address of PTE. */
+    uint64_t val;       /* New contents of PTE.    */
+};
+typedef struct mmu_update mmu_update_t;
+DEFINE_XEN_GUEST_HANDLE(mmu_update_t);
+
+/*
+ * ` enum neg_errnoval
+ * ` HYPERVISOR_multicall(multicall_entry_t call_list[],
+ * `                      uint32_t nr_calls);
+ *
+ * NB. The fields are logically the natural register size for this
+ * architecture. In cases where xen_ulong_t is larger than this then
+ * any unused bits in the upper portion must be zero.
+ */
+struct multicall_entry {
+    xen_ulong_t op, result;
+    xen_ulong_t args[6];
+};
+typedef struct multicall_entry multicall_entry_t;
+DEFINE_XEN_GUEST_HANDLE(multicall_entry_t);
+
+#if __XEN_INTERFACE_VERSION__ < 0x00040400
+/*
+ * Event channel endpoints per domain (when using the 2-level ABI):
+ *  1024 if a long is 32 bits; 4096 if a long is 64 bits.
+ */
+#define NR_EVENT_CHANNELS EVTCHN_2L_NR_CHANNELS
+#endif
+
+struct vcpu_time_info {
+    /*
+     * Updates to the following values are preceded and followed by an
+     * increment of 'version'. The guest can therefore detect updates by
+     * looking for changes to 'version'. If the least-significant bit of
+     * the version number is set then an update is in progress and the guest
+     * must wait to read a consistent set of values.
+     * The correct way to interact with the version number is similar to
+     * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
+     */
+    uint32_t version;
+    uint32_t pad0;
+    uint64_t tsc_timestamp;   /* TSC at last update of time vals.  */
+    uint64_t system_time;     /* Time, in nanosecs, since boot.    */
+    /*
+     * Current system time:
+     *   system_time +
+     *   ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul) >> 32)
+     * CPU frequency (Hz):
+     *   ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
+     */
+    uint32_t tsc_to_system_mul;
+    int8_t   tsc_shift;
+#if __XEN_INTERFACE_VERSION__ > 0x040600
+    uint8_t  flags;
+    uint8_t  pad1[2];
+#else
+    int8_t   pad1[3];
+#endif
+}; /* 32 bytes */
+typedef struct vcpu_time_info vcpu_time_info_t;
+
+#define XEN_PVCLOCK_TSC_STABLE_BIT     (1 << 0)
+#define XEN_PVCLOCK_GUEST_STOPPED      (1 << 1)
+
+struct vcpu_info {
+    /*
+     * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
+     * a pending notification for a particular VCPU. It is then cleared
+     * by the guest OS /before/ checking for pending work, thus avoiding
+     * a set-and-check race. Note that the mask is only accessed by Xen
+     * on the CPU that is currently hosting the VCPU. This means that the
+     * pending and mask flags can be updated by the guest without special
+     * synchronisation (i.e., no need for the x86 LOCK prefix).
+     * This may seem suboptimal because if the pending flag is set by
+     * a different CPU then an IPI may be scheduled even when the mask
+     * is set. However, note:
+     *  1. The task of 'interrupt holdoff' is covered by the per-event-
+     *     channel mask bits. A 'noisy' event that is continually being
+     *     triggered can be masked at source at this very precise
+     *     granularity.
+     *  2. The main purpose of the per-VCPU mask is therefore to restrict
+     *     reentrant execution: whether for concurrency control, or to
+     *     prevent unbounded stack usage. Whatever the purpose, we expect
+     *     that the mask will be asserted only for short periods at a time,
+     *     and so the likelihood of a 'spurious' IPI is suitably small.
+     * The mask is read before making an event upcall to the guest: a
+     * non-zero mask therefore guarantees that the VCPU will not receive
+     * an upcall activation. The mask is cleared when the VCPU requests
+     * to block: this avoids wakeup-waiting races.
+     */
+    uint8_t evtchn_upcall_pending;
+#ifdef XEN_HAVE_PV_UPCALL_MASK
+    uint8_t evtchn_upcall_mask;
+#else /* XEN_HAVE_PV_UPCALL_MASK */
+    uint8_t pad0;
+#endif /* XEN_HAVE_PV_UPCALL_MASK */
+    xen_ulong_t evtchn_pending_sel;
+    struct arch_vcpu_info arch;
+    vcpu_time_info_t time;
+}; /* 64 bytes (x86) */
+#ifndef __XEN__
+typedef struct vcpu_info vcpu_info_t;
+#endif
+
+/*
+ * `incontents 200 startofday_shared Start-of-day shared data structure
+ * Xen/kernel shared data -- pointer provided in start_info.
+ *
+ * This structure is defined to be both smaller than a page, and the
+ * only data on the shared page, but may vary in actual size even within
+ * compatible Xen versions; guests should not rely on the size
+ * of this structure remaining constant.
+ */
+struct shared_info {
+    struct vcpu_info vcpu_info[XEN_LEGACY_MAX_VCPUS];
+
+    /*
+     * A domain can create "event channels" on which it can send and receive
+     * asynchronous event notifications. There are three classes of event that
+     * are delivered by this mechanism:
+     *  1. Bi-directional inter- and intra-domain connections. Domains must
+     *     arrange out-of-band to set up a connection (usually by allocating
+     *     an unbound 'listener' port and avertising that via a storage service
+     *     such as xenstore).
+     *  2. Physical interrupts. A domain with suitable hardware-access
+     *     privileges can bind an event-channel port to a physical interrupt
+     *     source.
+     *  3. Virtual interrupts ('events'). A domain can bind an event-channel
+     *     port to a virtual interrupt source, such as the virtual-timer
+     *     device or the emergency console.
+     *
+     * Event channels are addressed by a "port index". Each channel is
+     * associated with two bits of information:
+     *  1. PENDING -- notifies the domain that there is a pending notification
+     *     to be processed. This bit is cleared by the guest.
+     *  2. MASK -- if this bit is clear then a 0->1 transition of PENDING
+     *     will cause an asynchronous upcall to be scheduled. This bit is only
+     *     updated by the guest. It is read-only within Xen. If a channel
+     *     becomes pending while the channel is masked then the 'edge' is lost
+     *     (i.e., when the channel is unmasked, the guest must manually handle
+     *     pending notifications as no upcall will be scheduled by Xen).
+     *
+     * To expedite scanning of pending notifications, any 0->1 pending
+     * transition on an unmasked channel causes a corresponding bit in a
+     * per-vcpu selector word to be set. Each bit in the selector covers a
+     * 'C long' in the PENDING bitfield array.
+     */
+    xen_ulong_t evtchn_pending[sizeof(xen_ulong_t) * 8];
+    xen_ulong_t evtchn_mask[sizeof(xen_ulong_t) * 8];
+
+    /*
+     * Wallclock time: updated by control software or RTC emulation.
+     * Guests should base their gettimeofday() syscall on this
+     * wallclock-base value.
+     * The values of wc_sec and wc_nsec are offsets from the Unix epoch
+     * adjusted by the domain's 'time offset' (in seconds) as set either
+     * by XEN_DOMCTL_settimeoffset, or adjusted via a guest write to the
+     * emulated RTC.
+     */
+    uint32_t wc_version;      /* Version counter: see vcpu_time_info_t. */
+    uint32_t wc_sec;
+    uint32_t wc_nsec;
+#if !defined(__i386__)
+    uint32_t wc_sec_hi;
+# define xen_wc_sec_hi wc_sec_hi
+#elif !defined(__XEN__) && !defined(__XEN_TOOLS__)
+# define xen_wc_sec_hi arch.wc_sec_hi
+#endif
+
+    struct arch_shared_info arch;
+
+};
+#ifndef __XEN__
+typedef struct shared_info shared_info_t;
+#endif
+
+/*
+ * `incontents 200 startofday Start-of-day memory layout
+ *
+ *  1. The domain is started within contiguous virtual-memory region.
+ *  2. The contiguous region ends on an aligned 4MB boundary.
+ *  3. This the order of bootstrap elements in the initial virtual region:
+ *      a. relocated kernel image
+ *      b. initial ram disk              [mod_start, mod_len]
+ *         (may be omitted)
+ *      c. list of allocated page frames [mfn_list, nr_pages]
+ *         (unless relocated due to XEN_ELFNOTE_INIT_P2M)
+ *      d. start_info_t structure        [register rSI (x86)]
+ *         in case of dom0 this page contains the console info, too
+ *      e. unless dom0: xenstore ring page
+ *      f. unless dom0: console ring page
+ *      g. bootstrap page tables         [pt_base and CR3 (x86)]
+ *      h. bootstrap stack               [register ESP (x86)]
+ *  4. Bootstrap elements are packed together, but each is 4kB-aligned.
+ *  5. The list of page frames forms a contiguous 'pseudo-physical' memory
+ *     layout for the domain. In particular, the bootstrap virtual-memory
+ *     region is a 1:1 mapping to the first section of the pseudo-physical map.
+ *  6. All bootstrap elements are mapped read-writable for the guest OS. The
+ *     only exception is the bootstrap page table, which is mapped read-only.
+ *  7. There is guaranteed to be at least 512kB padding after the final
+ *     bootstrap element. If necessary, the bootstrap virtual region is
+ *     extended by an extra 4MB to ensure this.
+ *
+ * Note: Prior to 25833:bb85bbccb1c9. ("x86/32-on-64 adjust Dom0 initial page
+ * table layout") a bug caused the pt_base (3.g above) and cr3 to not point
+ * to the start of the guest page tables (it was offset by two pages).
+ * This only manifested itself on 32-on-64 dom0 kernels and not 32-on-64 domU
+ * or 64-bit kernels of any colour. The page tables for a 32-on-64 dom0 got
+ * allocated in the order: 'first L1','first L2', 'first L3', so the offset
+ * to the page table base is by two pages back. The initial domain if it is
+ * 32-bit and runs under a 64-bit hypervisor should _NOT_ use two of the
+ * pages preceding pt_base and mark them as reserved/unused.
+ */
+#ifdef XEN_HAVE_PV_GUEST_ENTRY
+struct start_info {
+    /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME.    */
+    char magic[32];             /* "xen-<version>-<platform>".            */
+    unsigned long nr_pages;     /* Total pages allocated to this domain.  */
+    unsigned long shared_info;  /* MACHINE address of shared info struct. */
+    uint32_t flags;             /* SIF_xxx flags.                         */
+    xen_pfn_t store_mfn;        /* MACHINE page number of shared page.    */
+    uint32_t store_evtchn;      /* Event channel for store communication. */
+    union {
+        struct {
+            xen_pfn_t mfn;      /* MACHINE page number of console page.   */
+            uint32_t  evtchn;   /* Event channel for console page.        */
+        } domU;
+        struct {
+            uint32_t info_off;  /* Offset of console_info struct.         */
+            uint32_t info_size; /* Size of console_info struct from start.*/
+        } dom0;
+    } console;
+    /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME).     */
+    unsigned long pt_base;      /* VIRTUAL address of page directory.     */
+    unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames.       */
+    unsigned long mfn_list;     /* VIRTUAL address of page-frame list.    */
+    unsigned long mod_start;    /* VIRTUAL address of pre-loaded module   */
+                                /* (PFN of pre-loaded module if           */
+                                /*  SIF_MOD_START_PFN set in flags).      */
+    unsigned long mod_len;      /* Size (bytes) of pre-loaded module.     */
+#define MAX_GUEST_CMDLINE 1024
+    int8_t cmd_line[MAX_GUEST_CMDLINE];
+    /* The pfn range here covers both page table and p->m table frames.   */
+    unsigned long first_p2m_pfn;/* 1st pfn forming initial P->M table.    */
+    unsigned long nr_p2m_frames;/* # of pfns forming initial P->M table.  */
+};
+typedef struct start_info start_info_t;
+
+/* New console union for dom0 introduced in 0x00030203. */
+#if __XEN_INTERFACE_VERSION__ < 0x00030203
+#define console_mfn    console.domU.mfn
+#define console_evtchn console.domU.evtchn
+#endif
+#endif /* XEN_HAVE_PV_GUEST_ENTRY */
+
+/* These flags are passed in the 'flags' field of start_info_t. */
+#define SIF_PRIVILEGED    (1<<0)  /* Is the domain privileged? */
+#define SIF_INITDOMAIN    (1<<1)  /* Is this the initial control domain? */
+#define SIF_MULTIBOOT_MOD (1<<2)  /* Is mod_start a multiboot module? */
+#define SIF_MOD_START_PFN (1<<3)  /* Is mod_start a PFN? */
+#define SIF_VIRT_P2M_4TOOLS (1<<4) /* Do Xen tools understand a virt. mapped */
+                                   /* P->M making the 3 level tree obsolete? */
+#define SIF_PM_MASK       (0xFF<<8) /* reserve 1 byte for xen-pm options */
+
+/*
+ * A multiboot module is a package containing modules very similar to a
+ * multiboot module array. The only differences are:
+ * - the array of module descriptors is by convention simply at the beginning
+ *   of the multiboot module,
+ * - addresses in the module descriptors are based on the beginning of the
+ *   multiboot module,
+ * - the number of modules is determined by a termination descriptor that has
+ *   mod_start == 0.
+ *
+ * This permits to both build it statically and reference it in a configuration
+ * file, and let the PV guest easily rebase the addresses to virtual addresses
+ * and at the same time count the number of modules.
+ */
+struct xen_multiboot_mod_list
+{
+    /* Address of first byte of the module */
+    uint32_t mod_start;
+    /* Address of last byte of the module (inclusive) */
+    uint32_t mod_end;
+    /* Address of zero-terminated command line */
+    uint32_t cmdline;
+    /* Unused, must be zero */
+    uint32_t pad;
+};
+/*
+ * `incontents 200 startofday_dom0_console Dom0_console
+ *
+ * The console structure in start_info.console.dom0
+ *
+ * This structure includes a variety of information required to
+ * have a working VGA/VESA console.
+ */
+typedef struct dom0_vga_console_info {
+    uint8_t video_type; /* DOM0_VGA_CONSOLE_??? */
+#define XEN_VGATYPE_TEXT_MODE_3 0x03
+#define XEN_VGATYPE_VESA_LFB    0x23
+#define XEN_VGATYPE_EFI_LFB     0x70
+
+    union {
+        struct {
+            /* Font height, in pixels. */
+            uint16_t font_height;
+            /* Cursor location (column, row). */
+            uint16_t cursor_x, cursor_y;
+            /* Number of rows and columns (dimensions in characters). */
+            uint16_t rows, columns;
+        } text_mode_3;
+
+        struct {
+            /* Width and height, in pixels. */
+            uint16_t width, height;
+            /* Bytes per scan line. */
+            uint16_t bytes_per_line;
+            /* Bits per pixel. */
+            uint16_t bits_per_pixel;
+            /* LFB physical address, and size (in units of 64kB). */
+            uint32_t lfb_base;
+            uint32_t lfb_size;
+            /* RGB mask offsets and sizes, as defined by VBE 1.2+ */
+            uint8_t  red_pos, red_size;
+            uint8_t  green_pos, green_size;
+            uint8_t  blue_pos, blue_size;
+            uint8_t  rsvd_pos, rsvd_size;
+#if __XEN_INTERFACE_VERSION__ >= 0x00030206
+            /* VESA capabilities (offset 0xa, VESA command 0x4f00). */
+            uint32_t gbl_caps;
+            /* Mode attributes (offset 0x0, VESA command 0x4f01). */
+            uint16_t mode_attrs;
+            uint16_t pad;
+#endif
+#if __XEN_INTERFACE_VERSION__ >= 0x00040d00
+            /* high 32 bits of lfb_base */
+            uint32_t ext_lfb_base;
+#endif
+        } vesa_lfb;
+    } u;
+} dom0_vga_console_info_t;
+#define xen_vga_console_info dom0_vga_console_info
+#define xen_vga_console_info_t dom0_vga_console_info_t
+
+typedef uint8_t xen_domain_handle_t[16];
+
+__DEFINE_XEN_GUEST_HANDLE(uint8,  uint8_t);
+__DEFINE_XEN_GUEST_HANDLE(uint16, uint16_t);
+__DEFINE_XEN_GUEST_HANDLE(uint32, uint32_t);
+__DEFINE_XEN_GUEST_HANDLE(uint64, uint64_t);
+
+typedef struct {
+    uint8_t a[16];
+} xen_uuid_t;
+
+/*
+ * XEN_DEFINE_UUID(0x00112233, 0x4455, 0x6677, 0x8899,
+ *                 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff)
+ * will construct UUID 00112233-4455-6677-8899-aabbccddeeff presented as
+ * {0x00, 0x11, 0x22, 0x33, 0x44, 0x55, 0x66, 0x77, 0x88,
+ * 0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff};
+ *
+ * NB: This is compatible with Linux kernel and with libuuid, but it is not
+ * compatible with Microsoft, as they use mixed-endian encoding (some
+ * components are little-endian, some are big-endian).
+ */
+#define XEN_DEFINE_UUID_(a, b, c, d, e1, e2, e3, e4, e5, e6)            \
+    {{((a) >> 24) & 0xFF, ((a) >> 16) & 0xFF,                           \
+      ((a) >>  8) & 0xFF, ((a) >>  0) & 0xFF,                           \
+      ((b) >>  8) & 0xFF, ((b) >>  0) & 0xFF,                           \
+      ((c) >>  8) & 0xFF, ((c) >>  0) & 0xFF,                           \
+      ((d) >>  8) & 0xFF, ((d) >>  0) & 0xFF,                           \
+                e1, e2, e3, e4, e5, e6}}
+
+#if defined(__STDC_VERSION__) ? __STDC_VERSION__ >= 199901L : defined(__GNUC__)
+#define XEN_DEFINE_UUID(a, b, c, d, e1, e2, e3, e4, e5, e6)             \
+    ((xen_uuid_t)XEN_DEFINE_UUID_(a, b, c, d, e1, e2, e3, e4, e5, e6))
+#else
+#define XEN_DEFINE_UUID(a, b, c, d, e1, e2, e3, e4, e5, e6)             \
+    XEN_DEFINE_UUID_(a, b, c, d, e1, e2, e3, e4, e5, e6)
+#endif /* __STDC_VERSION__ / __GNUC__ */
+
+#endif /* !__ASSEMBLY__ */
+
+/* Default definitions for macros used by domctl/sysctl. */
+#if defined(__XEN__) || defined(__XEN_TOOLS__)
+
+#ifndef int64_aligned_t
+#define int64_aligned_t int64_t
+#endif
+#ifndef uint64_aligned_t
+#define uint64_aligned_t uint64_t
+#endif
+#ifndef XEN_GUEST_HANDLE_64
+#define XEN_GUEST_HANDLE_64(name) XEN_GUEST_HANDLE(name)
+#endif
+
+#ifndef __ASSEMBLY__
+struct xenctl_bitmap {
+    XEN_GUEST_HANDLE_64(uint8) bitmap;
+    uint32_t nr_bits;
+};
+typedef struct xenctl_bitmap xenctl_bitmap_t;
+#endif
+
+#endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
+
+#endif /* __XEN_PUBLIC_XEN_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
  2022-12-09  9:55 ` [RFC PATCH v2 01/22] include: import xen public headers David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12  9:19   ` Paul Durrant
  2022-12-12 17:07   ` Paolo Bonzini
  2022-12-09  9:55 ` [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support David Woodhouse
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: David Woodhouse <dwmw@amazon.co.uk>

The XEN_EMU option will cover core Xen support in target/, which exists
only for x86 with KVM today but could theoretically also be implemented
on Arm/Aarch64 and with TCG or other accelerators. It will also cover
the support for architecture-independent grant table and event channel
support which will be added in hw/xen/.

The XENFV_MACHINE option is for the xenfv platform support, which will
now be used both by XEN_EMU and by real Xen.

The XEN option remains dependent on the Xen runtime libraries, and covers
support for real Xen. Some code which currently resides under CONFIG_XEN
will be moving to CONFIG_XENFV_MACHINE over time.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 accel/Kconfig  | 1 +
 hw/Kconfig     | 1 +
 hw/xen/Kconfig | 3 +++
 meson.build    | 1 +
 target/Kconfig | 4 ++++
 5 files changed, 10 insertions(+)
 create mode 100644 hw/xen/Kconfig

diff --git a/accel/Kconfig b/accel/Kconfig
index 8bdedb7d15..41e089e610 100644
--- a/accel/Kconfig
+++ b/accel/Kconfig
@@ -15,6 +15,7 @@ config TCG
 
 config KVM
     bool
+    imply XEN_EMU if (I386 || X86_64)
 
 config XEN
     bool
diff --git a/hw/Kconfig b/hw/Kconfig
index 38233bbb0f..ba62ff6417 100644
--- a/hw/Kconfig
+++ b/hw/Kconfig
@@ -41,6 +41,7 @@ source tpm/Kconfig
 source usb/Kconfig
 source virtio/Kconfig
 source vfio/Kconfig
+source xen/Kconfig
 source watchdog/Kconfig
 
 # arch Kconfig
diff --git a/hw/xen/Kconfig b/hw/xen/Kconfig
new file mode 100644
index 0000000000..755c8b1faf
--- /dev/null
+++ b/hw/xen/Kconfig
@@ -0,0 +1,3 @@
+config XENFV_MACHINE
+    bool
+    default y if (XEN || XEN_EMU)
diff --git a/meson.build b/meson.build
index 5c6b5a1c75..9348cf572c 100644
--- a/meson.build
+++ b/meson.build
@@ -3828,6 +3828,7 @@ if have_system
   if xen.found()
     summary_info += {'xen ctrl version':  xen.version()}
   endif
+  summary_info += {'Xen emulation':     config_all.has_key('CONFIG_XEN_EMU')}
 endif
 summary_info += {'TCG support':       config_all.has_key('CONFIG_TCG')}
 if config_all.has_key('CONFIG_TCG')
diff --git a/target/Kconfig b/target/Kconfig
index 83da0bd293..e19c9d77b5 100644
--- a/target/Kconfig
+++ b/target/Kconfig
@@ -18,3 +18,7 @@ source sh4/Kconfig
 source sparc/Kconfig
 source tricore/Kconfig
 source xtensa/Kconfig
+
+config XEN_EMU
+    bool
+    depends on KVM && (I386 || X86_64)
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
  2022-12-09  9:55 ` [RFC PATCH v2 01/22] include: import xen public headers David Woodhouse
  2022-12-09  9:55 ` [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12 12:48   ` Paul Durrant
  2022-12-12 17:30   ` Paolo Bonzini
  2022-12-09  9:55 ` [RFC PATCH v2 04/22] i386/kvm: handle Xen HVM cpuid leaves David Woodhouse
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: David Woodhouse <dwmw@amazon.co.uk>

This is a machine property for two main reasons. One is that it allows
us to set it in default_machine_opts for the xenfv platform when not
running on actual Xen. The other is that theoretically we *could* do
this with TCG too; we'd just have to implement a bunch of the stuff that
KVM already does for us.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/pc.c            | 32 +++++++++++++++++++++++++++
 hw/i386/pc_piix.c       | 10 +++++++--
 include/hw/i386/pc.h    |  3 +++
 target/i386/kvm/kvm.c   | 26 ++++++++++++++++++++++
 target/i386/meson.build |  1 +
 target/i386/xen.c       | 49 +++++++++++++++++++++++++++++++++++++++++
 target/i386/xen.h       | 19 ++++++++++++++++
 7 files changed, 138 insertions(+), 2 deletions(-)
 create mode 100644 target/i386/xen.c
 create mode 100644 target/i386/xen.h

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 546b703cb4..9bada1a8ff 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1811,6 +1811,32 @@ static void pc_machine_set_max_fw_size(Object *obj, Visitor *v,
     pcms->max_fw_size = value;
 }
 
+static void pc_machine_get_xen_version(Object *obj, Visitor *v,
+                                       const char *name, void *opaque,
+                                       Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+    uint32_t value = pcms->xen_version;
+
+    visit_type_uint32(v, name, &value, errp);
+}
+
+static void pc_machine_set_xen_version(Object *obj, Visitor *v,
+                                       const char *name, void *opaque,
+                                       Error **errp)
+{
+    PCMachineState *pcms = PC_MACHINE(obj);
+    Error *error = NULL;
+    uint32_t value;
+
+    visit_type_uint32(v, name, &value, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    pcms->xen_version = value;
+}
 
 static void pc_machine_initfn(Object *obj)
 {
@@ -1978,6 +2004,12 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
         NULL, NULL);
     object_class_property_set_description(oc, PC_MACHINE_SMBIOS_EP,
         "SMBIOS Entry Point type [32, 64]");
+
+    object_class_property_add(oc, "xen-version", "uint32",
+        pc_machine_get_xen_version, pc_machine_set_xen_version,
+        NULL, NULL);
+    object_class_property_set_description(oc, "xen-version",
+        "Xen version to be emulated (in XENVER_version form e.g. 0x4000a for 4.10)");
 }
 
 static const TypeInfo pc_machine_info = {
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 0ad0ed1603..13286d0739 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -876,7 +876,10 @@ static void xenfv_4_2_machine_options(MachineClass *m)
     pc_i440fx_4_2_machine_options(m);
     m->desc = "Xen Fully-virtualized PC";
     m->max_cpus = HVM_MAX_VCPUS;
-    m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
+    if (xen_enabled())
+            m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
+    else
+            m->default_machine_opts = "accel=kvm,xen-version=0x40002";
 }
 
 DEFINE_PC_MACHINE(xenfv_4_2, "xenfv-4.2", pc_xen_hvm_init,
@@ -888,7 +891,10 @@ static void xenfv_3_1_machine_options(MachineClass *m)
     m->desc = "Xen Fully-virtualized PC";
     m->alias = "xenfv";
     m->max_cpus = HVM_MAX_VCPUS;
-    m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
+    if (xen_enabled())
+            m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
+    else
+            m->default_machine_opts = "accel=kvm,xen-version=0x30001";
 }
 
 DEFINE_PC_MACHINE(xenfv, "xenfv-3.1", pc_xen_hvm_init,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index c95333514e..9b14b18836 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -52,6 +52,9 @@ typedef struct PCMachineState {
     bool default_bus_bypass_iommu;
     uint64_t max_fw_size;
 
+    /* Xen HVM emulation */
+    uint32_t xen_version;
+
     /* ACPI Memory hotplug IO base address */
     hwaddr memhp_io_base;
 
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a213209379..0a2069b117 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -31,6 +31,7 @@
 #include "sysemu/runstate.h"
 #include "kvm_i386.h"
 #include "sev.h"
+#include "xen.h"
 #include "hyperv.h"
 #include "hyperv-proto.h"
 
@@ -774,6 +775,17 @@ static inline bool freq_within_bounds(int freq, int target_freq)
         return false;
 }
 
+static uint32_t kvm_arch_xen_version(MachineState *ms)
+{
+    uint32_t v = object_property_get_int(OBJECT(ms), "xen-version", NULL);
+
+    /* If it was unset, return zero */
+    if (v == (uint32_t) -1)
+            return 0;
+
+    return v;
+}
+
 static int kvm_arch_set_tsc_khz(CPUState *cs)
 {
     X86CPU *cpu = X86_CPU(cs);
@@ -2459,6 +2471,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
 {
     uint64_t identity_base = 0xfffbc000;
     uint64_t shadow_mem;
+    uint32_t xen_version;
     int ret;
     struct utsname utsname;
     Error *local_err = NULL;
@@ -2513,6 +2526,19 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
         }
     }
 
+    xen_version = kvm_arch_xen_version(ms);
+    if (xen_version) {
+#ifdef CONFIG_XEN_EMU
+            ret = kvm_xen_init(s, xen_version);
+            if (ret < 0) {
+                    return ret;
+            }
+#else
+            error_report("kvm: Xen support not enabled in qemu");
+            return -ENOTSUP;
+#endif
+    }
+
     ret = kvm_get_supported_msrs(s);
     if (ret < 0) {
         return ret;
diff --git a/target/i386/meson.build b/target/i386/meson.build
index ae38dc9563..9f3ef246b8 100644
--- a/target/i386/meson.build
+++ b/target/i386/meson.build
@@ -7,6 +7,7 @@ i386_ss.add(files(
   'cpu-dump.c',
 ))
 i386_ss.add(when: 'CONFIG_SEV', if_true: files('host-cpu.c'))
+i386_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen.c'))
 
 # x86 cpu type
 i386_ss.add(when: 'CONFIG_KVM', if_true: files('host-cpu.c'))
diff --git a/target/i386/xen.c b/target/i386/xen.c
new file mode 100644
index 0000000000..bc183dce4e
--- /dev/null
+++ b/target/i386/xen.c
@@ -0,0 +1,49 @@
+/*
+ * Xen HVM emulation support in KVM
+ *
+ * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
+ * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "kvm/kvm_i386.h"
+#include "xen.h"
+
+int kvm_xen_init(KVMState *s, uint32_t xen_version)
+{
+    const int required_caps = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR |
+        KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | KVM_XEN_HVM_CONFIG_SHARED_INFO;
+    struct kvm_xen_hvm_config cfg = {
+        .msr = XEN_HYPERCALL_MSR,
+        .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
+    };
+    int xen_caps, ret;
+
+    xen_caps = kvm_check_extension(s, KVM_CAP_XEN_HVM);
+    if (required_caps & ~xen_caps) {
+        error_report("kvm: Xen HVM guest support not present or insufficient");
+        return -ENOSYS;
+    }
+
+    if (xen_caps & KVM_XEN_HVM_CONFIG_EVTCHN_SEND) {
+        struct kvm_xen_hvm_attr ha = {
+            .type = KVM_XEN_ATTR_TYPE_XEN_VERSION,
+            .u.xen_version = xen_version,
+        };
+        (void)kvm_vm_ioctl(s, KVM_XEN_HVM_SET_ATTR, &ha);
+
+        cfg.flags |= KVM_XEN_HVM_CONFIG_EVTCHN_SEND;
+    }
+
+    ret = kvm_vm_ioctl(s, KVM_XEN_HVM_CONFIG, &cfg);
+    if (ret < 0) {
+        error_report("kvm: Failed to enable Xen HVM support: %s", strerror(-ret));
+        return ret;
+    }
+
+    return 0;
+}
diff --git a/target/i386/xen.h b/target/i386/xen.h
new file mode 100644
index 0000000000..6c4f3b7822
--- /dev/null
+++ b/target/i386/xen.h
@@ -0,0 +1,19 @@
+/*
+ * Xen HVM emulation support in KVM
+ *
+ * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
+ * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_I386_XEN_H
+#define QEMU_I386_XEN_H
+
+#define XEN_HYPERCALL_MSR 0x40000000
+
+int kvm_xen_init(KVMState *s, uint32_t xen_version);
+
+#endif /* QEMU_I386_XEN_H */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 04/22] i386/kvm: handle Xen HVM cpuid leaves
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (2 preceding siblings ...)
  2022-12-09  9:55 ` [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12 13:13   ` Paul Durrant
  2022-12-09  9:55 ` [RFC PATCH v2 05/22] xen-platform-pci: allow its creation with XEN_EMULATE mode David Woodhouse
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Introduce support for emulating CPUID for Xen HVM guests. It doesn't make
sense to advertise the KVM leaves to a Xen guest, so do it unconditionally
when the xen-version machine property is set.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Obtain xen_version from machine property, make it automatic]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/cpu.c     |  1 +
 target/i386/cpu.h     |  2 ++
 target/i386/kvm/kvm.c | 72 +++++++++++++++++++++++++++++++++++++++++--
 target/i386/xen.h     |  8 +++++
 4 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 22b681ca37..50aa95f134 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -7069,6 +7069,7 @@ static Property x86_cpu_properties[] = {
      * own cache information (see x86_cpu_load_def()).
      */
     DEFINE_PROP_BOOL("legacy-cache", X86CPU, legacy_cache, true),
+    DEFINE_PROP_BOOL("xen-vapic", X86CPU, xen_vapic, false),
 
     /*
      * From "Requirements for Implementing the Microsoft
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index d4bc19577a..c6c57baed5 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1964,6 +1964,8 @@ struct ArchCPU {
     int32_t thread_id;
 
     int32_t hv_max_vps;
+
+    bool xen_vapic;
 };
 
 
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 0a2069b117..0d3eddf9de 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -22,6 +22,7 @@
 
 #include <linux/kvm.h>
 #include "standard-headers/asm-x86/kvm_para.h"
+#include "standard-headers/xen/arch-x86/cpuid.h"
 
 #include "cpu.h"
 #include "host-cpu.h"
@@ -1752,14 +1753,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
     X86CPU *cpu = X86_CPU(cs);
     CPUX86State *env = &cpu->env;
     uint32_t limit, i, j, cpuid_i;
-    uint32_t unused;
+    uint32_t unused, xen_version = 0;
     struct kvm_cpuid_entry2 *c;
     uint32_t signature[3];
     int kvm_base = KVM_CPUID_SIGNATURE;
     int max_nested_state_len;
     int r;
     Error *local_err = NULL;
-
     memset(&cpuid_data, 0, sizeof(cpuid_data));
 
     cpuid_i = 0;
@@ -1811,7 +1811,73 @@ int kvm_arch_init_vcpu(CPUState *cs)
         has_msr_hv_hypercall = true;
     }
 
-    if (cpu->expose_kvm) {
+    xen_version = kvm_arch_xen_version(MACHINE(qdev_get_machine()));
+    if (xen_version) {
+#ifdef CONFIG_XEN_EMU
+        struct kvm_cpuid_entry2 *xen_max_leaf;
+
+        memcpy(signature, "XenVMMXenVMM", 12);
+
+        xen_max_leaf = c = &cpuid_data.entries[cpuid_i++];
+        c->function = kvm_base + XEN_CPUID_SIGNATURE;
+        c->eax = kvm_base + XEN_CPUID_TIME;
+        c->ebx = signature[0];
+        c->ecx = signature[1];
+        c->edx = signature[2];
+
+        c = &cpuid_data.entries[cpuid_i++];
+        c->function = kvm_base + XEN_CPUID_VENDOR;
+        c->eax = xen_version;
+        c->ebx = 0;
+        c->ecx = 0;
+        c->edx = 0;
+
+        c = &cpuid_data.entries[cpuid_i++];
+        c->function = kvm_base + XEN_CPUID_HVM_MSR;
+        /* Number of hypercall-transfer pages */
+        c->eax = 1;
+        /* Hypercall MSR base address */
+        c->ebx = XEN_HYPERCALL_MSR;
+        c->ecx = 0;
+        c->edx = 0;
+
+        c = &cpuid_data.entries[cpuid_i++];
+        c->function = kvm_base + XEN_CPUID_TIME;
+        c->eax = ((!!tsc_is_stable_and_known(env) << 1) |
+            (!!(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP) << 2));
+        /* default=0 (emulate if necessary) */
+        c->ebx = 0;
+        /* guest tsc frequency */
+        c->ecx = env->user_tsc_khz;
+        /* guest tsc incarnation (migration count) */
+        c->edx = 0;
+
+        c = &cpuid_data.entries[cpuid_i++];
+        c->function = kvm_base + XEN_CPUID_HVM;
+        xen_max_leaf->eax = kvm_base + XEN_CPUID_HVM;
+        if (xen_version >= XEN_VERSION(4,5)) {
+            c->function = kvm_base + XEN_CPUID_HVM;
+
+            if (cpu->xen_vapic) {
+                c->eax |= XEN_HVM_CPUID_APIC_ACCESS_VIRT;
+                c->eax |= XEN_HVM_CPUID_X2APIC_VIRT;
+            }
+
+            c->eax |= XEN_HVM_CPUID_IOMMU_MAPPINGS;
+
+            if (xen_version >= XEN_VERSION(4,6)) {
+                c->eax |= XEN_HVM_CPUID_VCPU_ID_PRESENT;
+                c->ebx = cs->cpu_index;
+            }
+        }
+
+        kvm_base += 0x100;
+#else /* CONFIG_XEN_EMU */
+        /* This should never happen as kvm_arch_init() would have died first. */
+        fprintf(stderr, "Cannot enable Xen CPUID without Xen support\n");
+        abort();
+#endif
+    } else if (cpu->expose_kvm) {
         memcpy(signature, "KVMKVMKVM\0\0\0", 12);
         c = &cpuid_data.entries[cpuid_i++];
         c->function = KVM_CPUID_SIGNATURE | kvm_base;
diff --git a/target/i386/xen.h b/target/i386/xen.h
index 6c4f3b7822..ae880c47bc 100644
--- a/target/i386/xen.h
+++ b/target/i386/xen.h
@@ -14,6 +14,14 @@
 
 #define XEN_HYPERCALL_MSR 0x40000000
 
+#define XEN_CPUID_SIGNATURE        0
+#define XEN_CPUID_VENDOR           1
+#define XEN_CPUID_HVM_MSR          2
+#define XEN_CPUID_TIME             3
+#define XEN_CPUID_HVM              4
+
+#define XEN_VERSION(maj, min) ((maj) << 16 | (min))
+
 int kvm_xen_init(KVMState *s, uint32_t xen_version);
 
 #endif /* QEMU_I386_XEN_H */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 05/22] xen-platform-pci: allow its creation with XEN_EMULATE mode
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (3 preceding siblings ...)
  2022-12-09  9:55 ` [RFC PATCH v2 04/22] i386/kvm: handle Xen HVM cpuid leaves David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12 13:24   ` Paul Durrant
  2022-12-09  9:55 ` [RFC PATCH v2 06/22] hw/xen_backend: refactor xen_be_init() David Woodhouse
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

The only thing we need to handle on KVM side is to change the
pfn from R/W to R/O.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/xen/xen_platform.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/hw/i386/xen/xen_platform.c b/hw/i386/xen/xen_platform.c
index a64265cca0..914619d140 100644
--- a/hw/i386/xen/xen_platform.c
+++ b/hw/i386/xen/xen_platform.c
@@ -271,7 +271,10 @@ static void platform_fixed_ioport_writeb(void *opaque, uint32_t addr, uint32_t v
     case 0: /* Platform flags */ {
         hvmmem_type_t mem_type = (val & PFFLAG_ROM_LOCK) ?
             HVMMEM_ram_ro : HVMMEM_ram_rw;
-        if (xen_set_mem_type(xen_domid, mem_type, 0xc0, 0x40)) {
+        if (xen_mode == XEN_EMULATE) {
+            /* XXX */
+            s->flags = val & PFFLAG_ROM_LOCK;
+        } else if (xen_set_mem_type(xen_domid, mem_type, 0xc0, 0x40)) {
             DPRINTF("unable to change ro/rw state of ROM memory area!\n");
         } else {
             s->flags = val & PFFLAG_ROM_LOCK;
@@ -496,12 +499,6 @@ static void xen_platform_realize(PCIDevice *dev, Error **errp)
     PCIXenPlatformState *d = XEN_PLATFORM(dev);
     uint8_t *pci_conf;
 
-    /* Device will crash on reset if xen is not initialized */
-    if (!xen_enabled()) {
-        error_setg(errp, "xen-platform device requires the Xen accelerator");
-        return;
-    }
-
     pci_conf = dev->config;
 
     pci_set_word(pci_conf + PCI_COMMAND, PCI_COMMAND_IO | PCI_COMMAND_MEMORY);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 06/22] hw/xen_backend: refactor xen_be_init()
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (4 preceding siblings ...)
  2022-12-09  9:55 ` [RFC PATCH v2 05/22] xen-platform-pci: allow its creation with XEN_EMULATE mode David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12 13:27   ` Paul Durrant
  2022-12-09  9:55 ` [RFC PATCH v2 07/22] pc_piix: handle XEN_EMULATE backend init David Woodhouse
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/xen/xen-legacy-backend.c         | 40 +++++++++++++++++++++--------
 include/hw/xen/xen-legacy-backend.h |  3 +++
 2 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c
index 085fd31ef7..694e7bbc54 100644
--- a/hw/xen/xen-legacy-backend.c
+++ b/hw/xen/xen-legacy-backend.c
@@ -676,17 +676,40 @@ void xenstore_update_fe(char *watch, struct XenLegacyDevice *xendev)
 }
 /* -------------------------------------------------------------------- */
 
-int xen_be_init(void)
+int xen_be_xenstore_open(void)
 {
-    xengnttab_handle *gnttabdev;
-
     xenstore = xs_daemon_open();
     if (!xenstore) {
-        xen_pv_printf(NULL, 0, "can't connect to xenstored\n");
         return -1;
     }
 
     qemu_set_fd_handler(xs_fileno(xenstore), xenstore_update, NULL, NULL);
+    return 0;
+}
+
+void xen_be_xenstore_close(void)
+{
+    qemu_set_fd_handler(xs_fileno(xenstore), NULL, NULL, NULL);
+    xs_daemon_close(xenstore);
+    xenstore = NULL;
+}
+
+void xen_be_sysdev_init(void)
+{
+    xen_sysdev = qdev_new(TYPE_XENSYSDEV);
+    sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), &error_fatal);
+    xen_sysbus = qbus_new(TYPE_XENSYSBUS, xen_sysdev, "xen-sysbus");
+    qbus_set_bus_hotplug_handler(xen_sysbus);
+}
+
+int xen_be_init(void)
+{
+    xengnttab_handle *gnttabdev;
+
+    if (xen_be_xenstore_open()) {
+        xen_pv_printf(NULL, 0, "can't connect to xenstored\n");
+        return -1;
+    }
 
     if (xen_xc == NULL || xen_fmem == NULL) {
         /* Check if xen_init() have been called */
@@ -701,17 +724,12 @@ int xen_be_init(void)
         xengnttab_close(gnttabdev);
     }
 
-    xen_sysdev = qdev_new(TYPE_XENSYSDEV);
-    sysbus_realize_and_unref(SYS_BUS_DEVICE(xen_sysdev), &error_fatal);
-    xen_sysbus = qbus_new(TYPE_XENSYSBUS, xen_sysdev, "xen-sysbus");
-    qbus_set_bus_hotplug_handler(xen_sysbus);
+    xen_be_sysdev_init();
 
     return 0;
 
 err:
-    qemu_set_fd_handler(xs_fileno(xenstore), NULL, NULL, NULL);
-    xs_daemon_close(xenstore);
-    xenstore = NULL;
+    xen_be_xenstore_close();
 
     return -1;
 }
diff --git a/include/hw/xen/xen-legacy-backend.h b/include/hw/xen/xen-legacy-backend.h
index be281e1f38..0aa171f6c2 100644
--- a/include/hw/xen/xen-legacy-backend.h
+++ b/include/hw/xen/xen-legacy-backend.h
@@ -42,6 +42,9 @@ int xenstore_read_fe_uint64(struct XenLegacyDevice *xendev, const char *node,
 void xen_be_check_state(struct XenLegacyDevice *xendev);
 
 /* xen backend driver bits */
+int xen_be_xenstore_open(void);
+void xen_be_xenstore_close(void);
+void xen_be_sysdev_init(void);
 int xen_be_init(void);
 void xen_be_register_common(void);
 int xen_be_register(const char *type, struct XenDevOps *ops);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 07/22] pc_piix: handle XEN_EMULATE backend init
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (5 preceding siblings ...)
  2022-12-09  9:55 ` [RFC PATCH v2 06/22] hw/xen_backend: refactor xen_be_init() David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12 13:47   ` Paul Durrant
  2022-12-09  9:55 ` [RFC PATCH v2 08/22] xen_platform: exclude vfio-pci from the PCI platform unplug David Woodhouse
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

And use newly added xen_emulated_machine_init() to iniitalize
the xenstore and the sysdev bus for future emulated devices.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Move it to xen-legacy-backend.c]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/pc_piix.c                   |  5 +++++
 hw/xen/xen-legacy-backend.c         | 22 ++++++++++++++++------
 include/hw/xen/xen-legacy-backend.h |  2 ++
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 13286d0739..3dcac2f4b6 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -47,6 +47,7 @@
 #include "hw/sysbus.h"
 #include "hw/i2c/smbus_eeprom.h"
 #include "hw/xen/xen-x86.h"
+#include "hw/xen/xen-legacy-backend.h"
 #include "exec/memory.h"
 #include "hw/acpi/acpi.h"
 #include "hw/acpi/piix4.h"
@@ -155,6 +156,10 @@ static void pc_init1(MachineState *machine,
             x86ms->above_4g_mem_size = 0;
             x86ms->below_4g_mem_size = machine->ram_size;
         }
+
+        if (pcms->xen_version && !xen_be_xenstore_open()) {
+            xen_emulated_machine_init();
+        }
     }
 
     pc_machine_init_sgx_epc(pcms);
diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c
index 694e7bbc54..60a7bc7ab6 100644
--- a/hw/xen/xen-legacy-backend.c
+++ b/hw/xen/xen-legacy-backend.c
@@ -31,6 +31,7 @@
 #include "qapi/error.h"
 #include "hw/xen/xen-legacy-backend.h"
 #include "hw/xen/xen_pvdev.h"
+#include "hw/xen/xen-bus.h"
 #include "monitor/qdev.h"
 
 DeviceState *xen_sysdev;
@@ -294,13 +295,15 @@ static struct XenLegacyDevice *xen_be_get_xendev(const char *type, int dom,
     xendev->debug      = debug;
     xendev->local_port = -1;
 
-    xendev->evtchndev = xenevtchn_open(NULL, 0);
-    if (xendev->evtchndev == NULL) {
-        xen_pv_printf(NULL, 0, "can't open evtchn device\n");
-        qdev_unplug(DEVICE(xendev), NULL);
-        return NULL;
+    if (xen_mode != XEN_EMULATE) {
+        xendev->evtchndev = xenevtchn_open(NULL, 0);
+        if (xendev->evtchndev == NULL) {
+            xen_pv_printf(NULL, 0, "can't open evtchn device\n");
+            qdev_unplug(DEVICE(xendev), NULL);
+            return NULL;
+        }
+        qemu_set_cloexec(xenevtchn_fd(xendev->evtchndev));
     }
-    qemu_set_cloexec(xenevtchn_fd(xendev->evtchndev));
 
     xen_pv_insert_xendev(xendev);
 
@@ -859,3 +862,10 @@ static void xenbe_register_types(void)
 }
 
 type_init(xenbe_register_types)
+
+void xen_emulated_machine_init(void)
+{
+    xen_bus_init();
+    xen_be_sysdev_init();
+    xen_be_register_common();
+}
diff --git a/include/hw/xen/xen-legacy-backend.h b/include/hw/xen/xen-legacy-backend.h
index 0aa171f6c2..aa09015662 100644
--- a/include/hw/xen/xen-legacy-backend.h
+++ b/include/hw/xen/xen-legacy-backend.h
@@ -105,4 +105,6 @@ int xen_config_dev_vfb(int vdev, const char *type);
 int xen_config_dev_vkbd(int vdev);
 int xen_config_dev_console(int vdev);
 
+void xen_emulated_machine_init(void);
+
 #endif /* HW_XEN_LEGACY_BACKEND_H */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 08/22] xen_platform: exclude vfio-pci from the PCI platform unplug
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (6 preceding siblings ...)
  2022-12-09  9:55 ` [RFC PATCH v2 07/22] pc_piix: handle XEN_EMULATE backend init David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12 13:52   ` Paul Durrant
  2022-12-09  9:55 ` [RFC PATCH v2 09/22] pc_piix: allow xenfv machine with XEN_EMULATE David Woodhouse
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Such that PCI passthrough devices work for Xen emulated guests.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/xen/xen_platform.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/hw/i386/xen/xen_platform.c b/hw/i386/xen/xen_platform.c
index 914619d140..15d5ae7c69 100644
--- a/hw/i386/xen/xen_platform.c
+++ b/hw/i386/xen/xen_platform.c
@@ -109,12 +109,25 @@ static void log_writeb(PCIXenPlatformState *s, char val)
 #define _UNPLUG_NVME_DISKS 3
 #define UNPLUG_NVME_DISKS (1u << _UNPLUG_NVME_DISKS)
 
+static bool pci_device_is_passthrough(PCIDevice *d)
+{
+    if (!strcmp(d->name, "xen-pci-passthrough")) {
+        return true;
+    }
+
+    if (xen_mode == XEN_EMULATE && !strcmp(d->name, "vfio-pci")) {
+        return true;
+    }
+
+    return false;
+}
+
 static void unplug_nic(PCIBus *b, PCIDevice *d, void *o)
 {
     /* We have to ignore passthrough devices */
     if (pci_get_word(d->config + PCI_CLASS_DEVICE) ==
             PCI_CLASS_NETWORK_ETHERNET
-            && strcmp(d->name, "xen-pci-passthrough") != 0) {
+            && !pci_device_is_passthrough(d)) {
         object_unparent(OBJECT(d));
     }
 }
@@ -187,9 +200,8 @@ static void unplug_disks(PCIBus *b, PCIDevice *d, void *opaque)
         !(flags & UNPLUG_IDE_SCSI_DISKS);
 
     /* We have to ignore passthrough devices */
-    if (!strcmp(d->name, "xen-pci-passthrough")) {
+    if (pci_device_is_passthrough(d))
         return;
-    }
 
     switch (pci_get_word(d->config + PCI_CLASS_DEVICE)) {
     case PCI_CLASS_STORAGE_IDE:
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 09/22] pc_piix: allow xenfv machine with XEN_EMULATE
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (7 preceding siblings ...)
  2022-12-09  9:55 ` [RFC PATCH v2 08/22] xen_platform: exclude vfio-pci from the PCI platform unplug David Woodhouse
@ 2022-12-09  9:55 ` David Woodhouse
  2022-12-12 14:05   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls David Woodhouse
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

This allows -machine xenfv to work with Xen emulated guests.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/pc_piix.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 3dcac2f4b6..d1127adde0 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -404,8 +404,8 @@ static void pc_xen_hvm_init(MachineState *machine)
 {
     PCMachineState *pcms = PC_MACHINE(machine);
 
-    if (!xen_enabled()) {
-        error_report("xenfv machine requires the xen accelerator");
+    if (!xen_enabled() && (xen_mode != XEN_EMULATE)) {
+        error_report("xenfv machine requires the xen or kvm accelerator");
         exit(1);
     }
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (8 preceding siblings ...)
  2022-12-09  9:55 ` [RFC PATCH v2 09/22] pc_piix: allow xenfv machine with XEN_EMULATE David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 14:11   ` Paul Durrant
  2022-12-12 17:07   ` Paolo Bonzini
  2022-12-09  9:56 ` [RFC PATCH v2 11/22] i386/xen: implement HYPERCALL_xen_version David Woodhouse
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

This means handling the new exit reason for Xen but still
crashing on purpose. As we implement each of the hypercalls
we will then return the right return code.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Add CPL to hypercall tracing, disallow hypercalls from CPL > 0]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/kvm/kvm.c    |  5 +++++
 target/i386/trace-events |  3 +++
 target/i386/xen.c        | 39 +++++++++++++++++++++++++++++++++++++++
 target/i386/xen.h        |  1 +
 4 files changed, 48 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 0d3eddf9de..ebde6bc204 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5471,6 +5471,11 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
         assert(run->msr.reason == KVM_MSR_EXIT_REASON_FILTER);
         ret = kvm_handle_wrmsr(cpu, run);
         break;
+#ifdef CONFIG_XEN_EMU
+    case KVM_EXIT_XEN:
+        ret = kvm_xen_handle_exit(cpu, &run->xen);
+        break;
+#endif
     default:
         fprintf(stderr, "KVM: unknown exit reason %d\n", run->exit_reason);
         ret = -1;
diff --git a/target/i386/trace-events b/target/i386/trace-events
index 2cd8726eeb..1bf9558811 100644
--- a/target/i386/trace-events
+++ b/target/i386/trace-events
@@ -11,3 +11,6 @@ kvm_sev_launch_measurement(const char *value) "data %s"
 kvm_sev_launch_finish(void) ""
 kvm_sev_launch_secret(uint64_t hpa, uint64_t hva, uint64_t secret, int len) "hpa 0x%" PRIx64 " hva 0x%" PRIx64 " data 0x%" PRIx64 " len %d"
 kvm_sev_attestation_report(const char *mnonce, const char *data) "mnonce %s data %s"
+
+# target/i386/xen.c
+kvm_xen_hypercall(int cpu, uint8_t cpl, uint64_t input, uint64_t a0, uint64_t a1, uint64_t a2, uint64_t ret) "xen_hypercall: cpu %d cpl %d input %" PRIu64 " a0 0x%" PRIx64 " a1 0x%" PRIx64 " a2 0x%" PRIx64" ret 0x%" PRIx64
diff --git a/target/i386/xen.c b/target/i386/xen.c
index bc183dce4e..708ab908a0 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -10,8 +10,10 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/log.h"
 #include "kvm/kvm_i386.h"
 #include "xen.h"
+#include "trace.h"
 
 int kvm_xen_init(KVMState *s, uint32_t xen_version)
 {
@@ -47,3 +49,40 @@ int kvm_xen_init(KVMState *s, uint32_t xen_version)
 
     return 0;
 }
+
+static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
+{
+    uint16_t code = exit->u.hcall.input;
+
+    if (exit->u.hcall.cpl > 0) {
+        exit->u.hcall.result = -EPERM;
+        return true;
+    }
+
+    switch (code) {
+    default:
+        return false;
+    }
+}
+
+int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
+{
+    if (exit->type != KVM_EXIT_XEN_HCALL)
+        return -1;
+
+    if (!__kvm_xen_handle_exit(cpu, exit)) {
+        /* Some hypercalls will be deliberately "implemented" by returning
+         * -ENOSYS. This case is for hypercalls which are unexpected. */
+        exit->u.hcall.result = -ENOSYS;
+        qemu_log_mask(LOG_GUEST_ERROR, "Unimplemented Xen hypercall %"
+                      PRId64 " (0x%" PRIx64 " 0x%" PRIx64 " 0x%" PRIx64 ")\n",
+                      (uint64_t)exit->u.hcall.input, (uint64_t)exit->u.hcall.params[0],
+                      (uint64_t)exit->u.hcall.params[1], (uint64_t)exit->u.hcall.params[1]);
+    }
+
+    trace_kvm_xen_hypercall(CPU(cpu)->cpu_index, exit->u.hcall.cpl,
+                            exit->u.hcall.input, exit->u.hcall.params[0],
+                            exit->u.hcall.params[1], exit->u.hcall.params[2],
+                            exit->u.hcall.result);
+    return 0;
+}
diff --git a/target/i386/xen.h b/target/i386/xen.h
index ae880c47bc..9134d78685 100644
--- a/target/i386/xen.h
+++ b/target/i386/xen.h
@@ -23,5 +23,6 @@
 #define XEN_VERSION(maj, min) ((maj) << 16 | (min))
 
 int kvm_xen_init(KVMState *s, uint32_t xen_version);
+int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit);
 
 #endif /* QEMU_I386_XEN_H */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 11/22] i386/xen: implement HYPERCALL_xen_version
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (9 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 14:17   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages David Woodhouse
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

This is just meant to serve as an example on how we can implement
hypercalls. xen_version specifically since Qemu does all kind of
feature controllability. So handling that here seems appropriate.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Implement kvm_gva_rw() safely]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/xen.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/target/i386/xen.c b/target/i386/xen.c
index 708ab908a0..55beed1913 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -12,9 +12,51 @@
 #include "qemu/osdep.h"
 #include "qemu/log.h"
 #include "kvm/kvm_i386.h"
+#include "exec/address-spaces.h"
 #include "xen.h"
 #include "trace.h"
 
+#include "standard-headers/xen/version.h"
+
+static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
+                      bool is_write)
+{
+    uint8_t *buf = (uint8_t *)_buf;
+    size_t i = 0, len = 0;
+    int ret;
+
+    for (i = 0; i < sz; i+= len) {
+        struct kvm_translation tr = {
+            .linear_address = gva + i,
+        };
+
+        len = TARGET_PAGE_SIZE - (tr.linear_address & ~TARGET_PAGE_MASK);
+        if (len > sz)
+            len = sz;
+
+        ret = kvm_vcpu_ioctl(cs, KVM_TRANSLATE, &tr);
+        if (ret || !tr.valid || (is_write && !tr.writeable)) {
+            return -EFAULT;
+        }
+
+        cpu_physical_memory_rw(tr.physical_address, buf + i, len, is_write);
+    }
+
+    return 0;
+}
+
+static inline int kvm_copy_from_gva(CPUState *cs, uint64_t gva, void *buf,
+                                    size_t sz)
+{
+    return kvm_gva_rw(cs, gva, buf, sz, false);
+}
+
+static inline int kvm_copy_to_gva(CPUState *cs, uint64_t gva, void *buf,
+                                  size_t sz)
+{
+    return kvm_gva_rw(cs, gva, buf, sz, false);
+}
+
 int kvm_xen_init(KVMState *s, uint32_t xen_version)
 {
     const int required_caps = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR |
@@ -50,6 +92,40 @@ int kvm_xen_init(KVMState *s, uint32_t xen_version)
     return 0;
 }
 
+static bool kvm_xen_hcall_xen_version(struct kvm_xen_exit *exit, X86CPU *cpu,
+                                     int cmd, uint64_t arg)
+{
+    int err = 0;
+
+    switch (cmd) {
+    case XENVER_get_features: {
+        struct xen_feature_info fi;
+
+        err = kvm_copy_from_gva(CPU(cpu), arg, &fi, sizeof(fi));
+        if (err) {
+            break;
+        }
+
+        fi.submap = 0;
+        if (fi.submap_idx == 0) {
+            fi.submap |= 1 << XENFEAT_writable_page_tables |
+                         1 << XENFEAT_writable_descriptor_tables |
+                         1 << XENFEAT_auto_translated_physmap |
+                         1 << XENFEAT_supervisor_mode_kernel;
+        }
+
+        err = kvm_copy_to_gva(CPU(cpu), arg, &fi, sizeof(fi));
+        break;
+    }
+
+    default:
+            return false;
+    }
+
+    exit->u.hcall.result = err;
+    return true;
+}
+
 static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
 {
     uint16_t code = exit->u.hcall.input;
@@ -60,6 +136,9 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
     }
 
     switch (code) {
+    case __HYPERVISOR_xen_version:
+        return kvm_xen_hcall_xen_version(exit, cpu, exit->u.hcall.params[0],
+                                         exit->u.hcall.params[1]);
     default:
         return false;
     }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (10 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 11/22] i386/xen: implement HYPERCALL_xen_version David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 14:29   ` Paul Durrant
  2022-12-12 17:14   ` Paolo Bonzini
  2022-12-09  9:56 ` [RFC PATCH v2 13/22] i386/xen: implement HYPERVISOR_memory_op David Woodhouse
                   ` (9 subsequent siblings)
  21 siblings, 2 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: David Woodhouse <dwmw@amazon.co.uk>

For the shared info page and for grant tables, Xen shares its own pages
from the "Xen heap" to the guest. The guest requests that a given page
from a certain address space (XENMAPSPACE_shared_info, etc.) be mapped
to a given GPA using the XENMEM_add_to_physmap hypercall.

To support that in qemu when *emulating* Xen, create a memory region
(migratable) and allow it to be mapped as an overlay when requested.

Xen theoretically allows the same page to be mapped multiple times
into the guest, but that's hard to track and reinstate over migration,
so we automatically *unmap* any previous mapping when creating a new
one. This approach has been used in production with.... a non-trivial
number of guests expecting true Xen, without any problems yet being
noticed.

This adds just the shared info page for now. The grant tables will be
a larger region, and will need to be overlaid one page at a time. I
think that means I need to create separate aliases for each page of
the overall grant_frames region, so that they can be mapped individually.

Expecting some heckling at the use of xen_overlay_singleton. What is
the best way to do that? Using qemu_find_recursive() every time seemed
a bit wrong. But I suppose mapping it into the *guest* isn't a fast
path, and if the actual grant table code is allowed to just stash the
pointer it gets from xen_overlay_page_ptr() for later use then that
isn't a fast path for device I/O either.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/kvm/meson.build   |   1 +
 hw/i386/kvm/xen_overlay.c | 198 ++++++++++++++++++++++++++++++++++++++
 hw/i386/kvm/xen_overlay.h |  14 +++
 hw/i386/pc_piix.c         |   8 ++
 4 files changed, 221 insertions(+)
 create mode 100644 hw/i386/kvm/xen_overlay.c
 create mode 100644 hw/i386/kvm/xen_overlay.h

diff --git a/hw/i386/kvm/meson.build b/hw/i386/kvm/meson.build
index 95467f1ded..6165cbf019 100644
--- a/hw/i386/kvm/meson.build
+++ b/hw/i386/kvm/meson.build
@@ -4,5 +4,6 @@ i386_kvm_ss.add(when: 'CONFIG_APIC', if_true: files('apic.c'))
 i386_kvm_ss.add(when: 'CONFIG_I8254', if_true: files('i8254.c'))
 i386_kvm_ss.add(when: 'CONFIG_I8259', if_true: files('i8259.c'))
 i386_kvm_ss.add(when: 'CONFIG_IOAPIC', if_true: files('ioapic.c'))
+i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen_overlay.c'))
 
 i386_ss.add_all(when: 'CONFIG_KVM', if_true: i386_kvm_ss)
diff --git a/hw/i386/kvm/xen_overlay.c b/hw/i386/kvm/xen_overlay.c
new file mode 100644
index 0000000000..c3eeb8dae8
--- /dev/null
+++ b/hw/i386/kvm/xen_overlay.c
@@ -0,0 +1,198 @@
+/*
+ * QEMU Xen emulation: Shared/overlay pages support
+ *
+ * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * Authors: David Woodhouse <dwmw2@infradead.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+#include "qemu/module.h"
+#include "qemu/main-loop.h"
+#include "qapi/error.h"
+#include "qom/object.h"
+#include "exec/target_page.h"
+#include "exec/address-spaces.h"
+#include "migration/vmstate.h"
+
+#include "hw/sysbus.h"
+#include "hw/xen/xen.h"
+#include "xen_overlay.h"
+
+#include "sysemu/kvm.h"
+#include <linux/kvm.h>
+
+#include "standard-headers/xen/memory.h"
+
+static int xen_overlay_map_page_locked(uint32_t space, uint64_t idx, uint64_t gpa);
+
+#define INVALID_GPA UINT64_MAX
+#define INVALID_GFN UINT64_MAX
+
+#define TYPE_XEN_OVERLAY "xenoverlay"
+OBJECT_DECLARE_SIMPLE_TYPE(XenOverlayState, XEN_OVERLAY)
+
+#define XEN_PAGE_SHIFT 12
+#define XEN_PAGE_SIZE (1ULL << XEN_PAGE_SHIFT)
+
+struct XenOverlayState {
+    /*< private >*/
+    SysBusDevice busdev;
+    /*< public >*/
+
+    MemoryRegion shinfo_mem;
+    void *shinfo_ptr;
+    uint64_t shinfo_gpa;
+};
+
+struct XenOverlayState *xen_overlay_singleton;
+
+static void xen_overlay_realize(DeviceState *dev, Error **errp)
+{
+    XenOverlayState *s = XEN_OVERLAY(dev);
+
+    if (xen_mode != XEN_EMULATE) {
+        error_setg(errp, "Xen overlay page support is for Xen emulation");
+        return;
+    }
+
+    memory_region_init_ram(&s->shinfo_mem, OBJECT(dev), "xen:shared_info", XEN_PAGE_SIZE, &error_abort);
+    memory_region_set_enabled(&s->shinfo_mem, true);
+    s->shinfo_ptr = memory_region_get_ram_ptr(&s->shinfo_mem);
+    s->shinfo_gpa = INVALID_GPA;
+    memset(s->shinfo_ptr, 0, XEN_PAGE_SIZE);
+}
+
+static int xen_overlay_post_load(void *opaque, int version_id)
+{
+    XenOverlayState *s = opaque;
+
+    if (s->shinfo_gpa != INVALID_GPA) {
+            xen_overlay_map_page_locked(XENMAPSPACE_shared_info, 0, s->shinfo_gpa);
+    }
+
+    return 0;
+}
+
+static bool xen_overlay_is_needed(void *opaque)
+{
+    return xen_mode == XEN_EMULATE;
+}
+
+static const VMStateDescription xen_overlay_vmstate = {
+    .name = "xen_overlay",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = xen_overlay_is_needed,
+    .post_load = xen_overlay_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(shinfo_gpa, XenOverlayState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void xen_overlay_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = xen_overlay_realize;
+    dc->vmsd = &xen_overlay_vmstate;
+}
+
+static const TypeInfo xen_overlay_info = {
+    .name          = TYPE_XEN_OVERLAY,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(XenOverlayState),
+    .class_init    = xen_overlay_class_init,
+};
+
+void xen_overlay_create(void)
+{
+    xen_overlay_singleton = XEN_OVERLAY(sysbus_create_simple(TYPE_XEN_OVERLAY, -1, NULL));
+}
+
+static void xen_overlay_register_types(void)
+{
+    type_register_static(&xen_overlay_info);
+}
+
+type_init(xen_overlay_register_types)
+
+int xen_overlay_map_page(uint32_t space, uint64_t idx, uint64_t gpa)
+{
+    int ret;
+
+    qemu_mutex_lock_iothread();
+    ret = xen_overlay_map_page_locked(space, idx, gpa);
+    qemu_mutex_unlock_iothread();
+
+    return ret;
+}
+
+/* KVM is the only existing back end for now. Let's not overengineer it yet. */
+static int xen_overlay_set_be_shinfo(uint64_t gfn)
+{
+    struct kvm_xen_hvm_attr xa = {
+        .type = KVM_XEN_ATTR_TYPE_SHARED_INFO,
+        .u.shared_info.gfn = gfn,
+    };
+
+    return kvm_vm_ioctl(kvm_state, KVM_XEN_HVM_SET_ATTR, &xa);
+}
+
+static int xen_overlay_map_page_locked(uint32_t space, uint64_t idx, uint64_t gpa)
+{
+    MemoryRegion *ovl_page;
+    int err;
+
+    if (space != XENMAPSPACE_shared_info || idx != 0)
+        return -EINVAL;
+
+    if (!xen_overlay_singleton)
+        return -ENOENT;
+
+    ovl_page = &xen_overlay_singleton->shinfo_mem;
+
+    /* Xen allows guests to map the same page as many times as it likes
+     * into guest physical frames. We don't, because it would be hard
+     * to track and restore them all. One mapping of each page is
+     * perfectly sufficient for all known guests... and we've tested
+     * that theory on a few now in other implementations. dwmw2. */
+    if (memory_region_is_mapped(ovl_page)) {
+        if (gpa == INVALID_GPA) {
+            /* If removing shinfo page, turn the kernel magic off first */
+            if (space == XENMAPSPACE_shared_info) {
+                err = xen_overlay_set_be_shinfo(INVALID_GFN);
+                if (err)
+                    return err;
+            }
+            memory_region_del_subregion(get_system_memory(), ovl_page);
+            goto done;
+        } else {
+            /* Just move it */
+            memory_region_set_address(ovl_page, gpa);
+        }
+    } else if (gpa != INVALID_GPA) {
+        memory_region_add_subregion_overlap(get_system_memory(), gpa, ovl_page, 0);
+    }
+
+    xen_overlay_set_be_shinfo(gpa >> XEN_PAGE_SHIFT);
+ done:
+    xen_overlay_singleton->shinfo_gpa = gpa;
+    return 0;
+}
+
+void *xen_overlay_page_ptr(uint32_t space, uint64_t idx)
+{
+    if (space != XENMAPSPACE_shared_info || idx != 0)
+        return NULL;
+
+    if (!xen_overlay_singleton)
+        return NULL;
+
+    return xen_overlay_singleton->shinfo_ptr;
+}
diff --git a/hw/i386/kvm/xen_overlay.h b/hw/i386/kvm/xen_overlay.h
new file mode 100644
index 0000000000..afc63991ea
--- /dev/null
+++ b/hw/i386/kvm/xen_overlay.h
@@ -0,0 +1,14 @@
+/*
+ * QEMU Xen emulation: Shared/overlay pages support
+ *
+ * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * Authors: David Woodhouse <dwmw2@infradead.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+void xen_overlay_create(void);
+int xen_overlay_map_page(uint32_t space, uint64_t idx, uint64_t gpa);
+void *xen_overlay_page_ptr(uint32_t space, uint64_t idx);
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index d1127adde0..c3c61eedde 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -58,6 +58,9 @@
 #include <xen/hvm/hvm_info_table.h>
 #include "hw/xen/xen_pt.h"
 #endif
+#ifdef CONFIG_XEN_EMU
+#include "hw/i386/kvm/xen_overlay.h"
+#endif
 #include "migration/global_state.h"
 #include "migration/misc.h"
 #include "sysemu/numa.h"
@@ -411,6 +414,11 @@ static void pc_xen_hvm_init(MachineState *machine)
 
     pc_xen_hvm_init_pci(machine);
     pci_create_simple(pcms->bus, -1, "xen-platform");
+#ifdef CONFIG_XEN_EMU
+    if (xen_mode == XEN_EMULATE) {
+            xen_overlay_create();
+    }
+#endif
 }
 #endif
 
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 13/22] i386/xen: implement HYPERVISOR_memory_op
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (11 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 14:38   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 14/22] i386/xen: implement HYPERVISOR_hvm_op David Woodhouse
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Specifically XENMEM_add_to_physmap with space XENMAPSPACE_shared_info to
allow the guest to set its shared_info page.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Use the xen_overlay device]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/trace-events |  1 +
 target/i386/xen.c        | 59 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/target/i386/trace-events b/target/i386/trace-events
index 1bf9558811..fb999d0052 100644
--- a/target/i386/trace-events
+++ b/target/i386/trace-events
@@ -14,3 +14,4 @@ kvm_sev_attestation_report(const char *mnonce, const char *data) "mnonce %s data
 
 # target/i386/xen.c
 kvm_xen_hypercall(int cpu, uint8_t cpl, uint64_t input, uint64_t a0, uint64_t a1, uint64_t a2, uint64_t ret) "xen_hypercall: cpu %d cpl %d input %" PRIu64 " a0 0x%" PRIx64 " a1 0x%" PRIx64 " a2 0x%" PRIx64" ret 0x%" PRIx64
+kvm_xen_set_shared_info(uint64_t gfn) "shared info at gfn 0x%" PRIx64
diff --git a/target/i386/xen.c b/target/i386/xen.c
index 55beed1913..ddd144039a 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -15,8 +15,9 @@
 #include "exec/address-spaces.h"
 #include "xen.h"
 #include "trace.h"
-
+#include "hw/i386/kvm/xen_overlay.h"
 #include "standard-headers/xen/version.h"
+#include "standard-headers/xen/memory.h"
 
 static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
                       bool is_write)
@@ -126,6 +127,59 @@ static bool kvm_xen_hcall_xen_version(struct kvm_xen_exit *exit, X86CPU *cpu,
     return true;
 }
 
+static int xen_set_shared_info(CPUState *cs, uint64_t gfn)
+{
+    uint64_t gpa = gfn << TARGET_PAGE_BITS;
+    int err;
+
+    /* The xen_overlay device tells KVM about it too, since it had to
+     * do that on migration load anyway (unless we're going to jump
+     * through lots of hoops to maintain the fiction that this isn't
+     * KVM-specific */
+    err = xen_overlay_map_page(XENMAPSPACE_shared_info, 0, gpa);
+    if (err)
+            return err;
+
+    trace_kvm_xen_set_shared_info(gfn);
+
+    return err;
+}
+
+static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
+                                   int cmd, uint64_t arg, X86CPU *cpu)
+{
+    CPUState *cs = CPU(cpu);
+    int err = 0;
+
+    switch (cmd) {
+    case XENMEM_add_to_physmap: {
+            struct xen_add_to_physmap xatp;
+
+            err = kvm_copy_from_gva(cs, arg, &xatp, sizeof(xatp));
+            if (err) {
+                break;
+            }
+
+            switch (xatp.space) {
+            case XENMAPSPACE_shared_info:
+                break;
+            default:
+                err = -ENOSYS;
+                break;
+            }
+
+            err = xen_set_shared_info(cs, xatp.gpfn);
+            break;
+         }
+
+    default:
+            return false;
+    }
+
+    exit->u.hcall.result = err;
+    return true;
+}
+
 static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
 {
     uint16_t code = exit->u.hcall.input;
@@ -136,6 +190,9 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
     }
 
     switch (code) {
+    case __HYPERVISOR_memory_op:
+        return kvm_xen_hcall_memory_op(exit, exit->u.hcall.params[0],
+                                       exit->u.hcall.params[1], cpu);
     case __HYPERVISOR_xen_version:
         return kvm_xen_hcall_xen_version(exit, cpu, exit->u.hcall.params[0],
                                          exit->u.hcall.params[1]);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 14/22] i386/xen: implement HYPERVISOR_hvm_op
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (12 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 13/22] i386/xen: implement HYPERVISOR_memory_op David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 14:41   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 15/22] i386/xen: implement HYPERVISOR_vcpu_op David Woodhouse
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

This is when guest queries for support for HVMOP_pagetable_dying.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/xen.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/target/i386/xen.c b/target/i386/xen.c
index ddd144039a..2847b4f864 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -18,6 +18,7 @@
 #include "hw/i386/kvm/xen_overlay.h"
 #include "standard-headers/xen/version.h"
 #include "standard-headers/xen/memory.h"
+#include "standard-headers/xen/hvm/hvm_op.h"
 
 static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
                       bool is_write)
@@ -180,6 +181,19 @@ static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
     return true;
 }
 
+static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit,
+                                 int cmd, uint64_t arg)
+{
+    switch (cmd) {
+    case HVMOP_pagetable_dying:
+            exit->u.hcall.result = -ENOSYS;
+            return true;
+
+    default:
+            return false;
+    }
+}
+
 static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
 {
     uint16_t code = exit->u.hcall.input;
@@ -190,6 +204,9 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
     }
 
     switch (code) {
+    case __HYPERVISOR_hvm_op:
+        return kvm_xen_hcall_hvm_op(exit, exit->u.hcall.params[0],
+                                    exit->u.hcall.params[1]);
     case __HYPERVISOR_memory_op:
         return kvm_xen_hcall_memory_op(exit, exit->u.hcall.params[0],
                                        exit->u.hcall.params[1], cpu);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 15/22] i386/xen: implement HYPERVISOR_vcpu_op
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (13 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 14/22] i386/xen: implement HYPERVISOR_hvm_op David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 14:51   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info David Woodhouse
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

This is simply when guest tries to register a vcpu_info
and since vcpu_info placement is optional in the minimum ABI
therefore we can just fail with -ENOSYS

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/xen.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/target/i386/xen.c b/target/i386/xen.c
index 2847b4f864..9d1daadee1 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -19,6 +19,7 @@
 #include "standard-headers/xen/version.h"
 #include "standard-headers/xen/memory.h"
 #include "standard-headers/xen/hvm/hvm_op.h"
+#include "standard-headers/xen/vcpu.h"
 
 static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
                       bool is_write)
@@ -194,6 +195,25 @@ static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit,
     }
 }
 
+static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
+                                  int cmd, int vcpu_id, uint64_t arg)
+{
+    int err;
+
+    switch (cmd) {
+    case VCPUOP_register_vcpu_info:
+        /* no vcpu info placement for now */
+        err = -ENOSYS;
+        break;
+
+    default:
+        return false;
+    }
+
+    exit->u.hcall.result = err;
+    return true;
+}
+
 static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
 {
     uint16_t code = exit->u.hcall.input;
@@ -204,6 +224,11 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
     }
 
     switch (code) {
+    case __HYPERVISOR_vcpu_op:
+        return kvm_xen_hcall_vcpu_op(exit, cpu,
+                                     exit->u.hcall.params[0],
+                                     exit->u.hcall.params[1],
+                                     exit->u.hcall.params[2]);
     case __HYPERVISOR_hvm_op:
         return kvm_xen_hcall_hvm_op(exit, exit->u.hcall.params[0],
                                     exit->u.hcall.params[1]);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (14 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 15/22] i386/xen: implement HYPERVISOR_vcpu_op David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 14:58   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 17/22] i386/xen: handle VCPUOP_register_vcpu_time_info David Woodhouse
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Handle the hypercall to set a per vcpu info, and also wire up the default
vcpu_info in the shared_info page for the first 32 vCPUs.

To avoid deadlock within KVM a vCPU thread must set its *own* vcpu_info
rather than it being set from the context in which the hypercall is
invoked.

Add the vcpu_info (and default) GPA to the vmstate_x86_cpu for migration,
and restore it in kvm_arch_put_registers() appropriately.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/cpu.h        |  2 ++
 target/i386/kvm/kvm.c    | 19 +++++++++++
 target/i386/machine.c    | 21 ++++++++++++
 target/i386/trace-events |  1 +
 target/i386/xen.c        | 74 +++++++++++++++++++++++++++++++++++++---
 target/i386/xen.h        |  1 +
 6 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index c6c57baed5..109b2e5669 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1788,6 +1788,8 @@ typedef struct CPUArchState {
 #endif
 #if defined(CONFIG_KVM)
     struct kvm_nested_state *nested_state;
+    uint64_t xen_vcpu_info_gpa;
+    uint64_t xen_vcpu_info_default_gpa;
 #endif
 #if defined(CONFIG_HVF)
     HVFX86LazyFlags hvf_lflags;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index ebde6bc204..fa45e2f99a 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1811,6 +1811,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
         has_msr_hv_hypercall = true;
     }
 
+    env->xen_vcpu_info_gpa = UINT64_MAX;
+    env->xen_vcpu_info_default_gpa = UINT64_MAX;
+
     xen_version = kvm_arch_xen_version(MACHINE(qdev_get_machine()));
     if (xen_version) {
 #ifdef CONFIG_XEN_EMU
@@ -4728,6 +4731,22 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
         kvm_arch_set_tsc_khz(cpu);
     }
 
+#ifdef CONFIG_XEN_EMU
+    if (level == KVM_PUT_FULL_STATE) {
+        uint64_t gpa = x86_cpu->env.xen_vcpu_info_gpa;
+        if (gpa == UINT64_MAX) {
+            gpa = x86_cpu->env.xen_vcpu_info_default_gpa;
+        }
+
+        if (gpa != UINT64_MAX) {
+            ret = kvm_xen_set_vcpu_attr(cpu, KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO, gpa);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+    }
+#endif
+
     ret = kvm_getput_regs(x86_cpu, 1);
     if (ret < 0) {
         return ret;
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 310b125235..104cd6047c 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -1257,6 +1257,26 @@ static const VMStateDescription vmstate_nested_state = {
     }
 };
 
+static bool xen_vcpu_needed(void *opaque)
+{
+    X86CPU *cpu = opaque;
+    CPUX86State *env = &cpu->env;
+
+    return (env->xen_vcpu_info_gpa != UINT64_MAX ||
+            env->xen_vcpu_info_default_gpa != UINT64_MAX);
+}
+
+static const VMStateDescription vmstate_xen_vcpu = {
+    .name = "cpu/xen_vcpu",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = xen_vcpu_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(env.xen_vcpu_info_gpa, X86CPU),
+        VMSTATE_UINT64(env.xen_vcpu_info_default_gpa, X86CPU),
+        VMSTATE_END_OF_LIST()
+    }
+};
 #endif
 
 static bool mcg_ext_ctl_needed(void *opaque)
@@ -1716,6 +1736,7 @@ const VMStateDescription vmstate_x86_cpu = {
 #endif
 #ifdef CONFIG_KVM
         &vmstate_nested_state,
+        &vmstate_xen_vcpu,
 #endif
         &vmstate_msr_tsx_ctrl,
         &vmstate_msr_intel_sgx,
diff --git a/target/i386/trace-events b/target/i386/trace-events
index fb999d0052..7118640697 100644
--- a/target/i386/trace-events
+++ b/target/i386/trace-events
@@ -15,3 +15,4 @@ kvm_sev_attestation_report(const char *mnonce, const char *data) "mnonce %s data
 # target/i386/xen.c
 kvm_xen_hypercall(int cpu, uint8_t cpl, uint64_t input, uint64_t a0, uint64_t a1, uint64_t a2, uint64_t ret) "xen_hypercall: cpu %d cpl %d input %" PRIu64 " a0 0x%" PRIx64 " a1 0x%" PRIx64 " a2 0x%" PRIx64" ret 0x%" PRIx64
 kvm_xen_set_shared_info(uint64_t gfn) "shared info at gfn 0x%" PRIx64
+kvm_xen_set_vcpu_attr(int cpu, int type, uint64_t gpa) "vcpu attr cpu %d type %d gpa 0x%" PRIx64
diff --git a/target/i386/xen.c b/target/i386/xen.c
index 9d1daadee1..cd816bb711 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -129,10 +129,47 @@ static bool kvm_xen_hcall_xen_version(struct kvm_xen_exit *exit, X86CPU *cpu,
     return true;
 }
 
+int kvm_xen_set_vcpu_attr(CPUState *cs, uint16_t type, uint64_t gpa)
+{
+    struct kvm_xen_vcpu_attr xhsi;
+
+    xhsi.type = type;
+    xhsi.u.gpa = gpa;
+
+    trace_kvm_xen_set_vcpu_attr(cs->cpu_index, type, gpa);
+
+    return kvm_vcpu_ioctl(cs, KVM_XEN_VCPU_SET_ATTR, &xhsi);
+}
+
+static void do_set_vcpu_info_default_gpa(CPUState *cs, run_on_cpu_data data)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+
+    env->xen_vcpu_info_default_gpa = data.host_ulong;
+
+    /* Changing the default does nothing if a vcpu_info was explicitly set. */
+    if (env->xen_vcpu_info_gpa == UINT64_MAX) {
+            kvm_xen_set_vcpu_attr(cs, KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO,
+                                  env->xen_vcpu_info_default_gpa);
+    }
+}
+
+static void do_set_vcpu_info_gpa(CPUState *cs, run_on_cpu_data data)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+
+    env->xen_vcpu_info_gpa = data.host_ulong;
+
+    kvm_xen_set_vcpu_attr(cs, KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO,
+                          env->xen_vcpu_info_gpa);
+}
+
 static int xen_set_shared_info(CPUState *cs, uint64_t gfn)
 {
     uint64_t gpa = gfn << TARGET_PAGE_BITS;
-    int err;
+    int i, err;
 
     /* The xen_overlay device tells KVM about it too, since it had to
      * do that on migration load anyway (unless we're going to jump
@@ -144,6 +181,14 @@ static int xen_set_shared_info(CPUState *cs, uint64_t gfn)
 
     trace_kvm_xen_set_shared_info(gfn);
 
+    for (i = 0; i < XEN_LEGACY_MAX_VCPUS; i++) {
+        CPUState *cpu = qemu_get_cpu(i);
+        if (cpu) {
+                async_run_on_cpu(cpu, do_set_vcpu_info_default_gpa, RUN_ON_CPU_HOST_ULONG(gpa));
+        }
+        gpa += sizeof(vcpu_info_t);
+    }
+
     return err;
 }
 
@@ -195,19 +240,38 @@ static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit,
     }
 }
 
+static int vcpuop_register_vcpu_info(CPUState *cs, CPUState *target,
+                                     uint64_t arg)
+{
+    struct vcpu_register_vcpu_info rvi;
+    uint64_t gpa;
+
+    if (!target)
+            return -ENOENT;
+
+    if (kvm_copy_from_gva(cs, arg, &rvi, sizeof(rvi))) {
+        return -EFAULT;
+    }
+
+    gpa = ((rvi.mfn << TARGET_PAGE_BITS) + rvi.offset);
+    async_run_on_cpu(target, do_set_vcpu_info_gpa, RUN_ON_CPU_HOST_ULONG(gpa));
+    return 0;
+}
+
 static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
                                   int cmd, int vcpu_id, uint64_t arg)
 {
+    CPUState *dest = qemu_get_cpu(vcpu_id);
+    CPUState *cs = CPU(cpu);
     int err;
 
     switch (cmd) {
     case VCPUOP_register_vcpu_info:
-        /* no vcpu info placement for now */
-        err = -ENOSYS;
-        break;
+            err = vcpuop_register_vcpu_info(cs, dest, arg);
+            break;
 
     default:
-        return false;
+            return false;
     }
 
     exit->u.hcall.result = err;
diff --git a/target/i386/xen.h b/target/i386/xen.h
index 9134d78685..53573e07f8 100644
--- a/target/i386/xen.h
+++ b/target/i386/xen.h
@@ -24,5 +24,6 @@
 
 int kvm_xen_init(KVMState *s, uint32_t xen_version);
 int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit);
+int kvm_xen_set_vcpu_attr(CPUState *cs, uint16_t type, uint64_t gpa);
 
 #endif /* QEMU_I386_XEN_H */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 17/22] i386/xen: handle VCPUOP_register_vcpu_time_info
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (15 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 15:34   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 18/22] i386/xen: handle VCPUOP_register_runstate_memory_area David Woodhouse
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

In order to support Linux vdso in Xen.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/cpu.h     |  1 +
 target/i386/kvm/kvm.c |  9 ++++++
 target/i386/machine.c |  4 ++-
 target/i386/xen.c     | 70 ++++++++++++++++++++++++++++++++++++-------
 4 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 109b2e5669..96c2d0d5cb 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1790,6 +1790,7 @@ typedef struct CPUArchState {
     struct kvm_nested_state *nested_state;
     uint64_t xen_vcpu_info_gpa;
     uint64_t xen_vcpu_info_default_gpa;
+    uint64_t xen_vcpu_time_info_gpa;
 #endif
 #if defined(CONFIG_HVF)
     HVFX86LazyFlags hvf_lflags;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index fa45e2f99a..3f19fff21f 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1813,6 +1813,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
 
     env->xen_vcpu_info_gpa = UINT64_MAX;
     env->xen_vcpu_info_default_gpa = UINT64_MAX;
+    env->xen_vcpu_time_info_gpa = UINT64_MAX;
 
     xen_version = kvm_arch_xen_version(MACHINE(qdev_get_machine()));
     if (xen_version) {
@@ -4744,6 +4745,14 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
                 return ret;
             }
         }
+
+        gpa = x86_cpu->env.xen_vcpu_time_info_gpa;
+        if (gpa != UINT64_MAX) {
+            ret = kvm_xen_set_vcpu_attr(cpu, KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO, gpa);
+            if (ret < 0) {
+                return ret;
+            }
+        }
     }
 #endif
 
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 104cd6047c..9acef102a3 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -1263,7 +1263,8 @@ static bool xen_vcpu_needed(void *opaque)
     CPUX86State *env = &cpu->env;
 
     return (env->xen_vcpu_info_gpa != UINT64_MAX ||
-            env->xen_vcpu_info_default_gpa != UINT64_MAX);
+            env->xen_vcpu_info_default_gpa != UINT64_MAX ||
+            env->xen_vcpu_time_info_gpa != UINT64_MAX);
 }
 
 static const VMStateDescription vmstate_xen_vcpu = {
@@ -1274,6 +1275,7 @@ static const VMStateDescription vmstate_xen_vcpu = {
     .fields = (VMStateField[]) {
         VMSTATE_UINT64(env.xen_vcpu_info_gpa, X86CPU),
         VMSTATE_UINT64(env.xen_vcpu_info_default_gpa, X86CPU),
+        VMSTATE_UINT64(env.xen_vcpu_time_info_gpa, X86CPU),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/target/i386/xen.c b/target/i386/xen.c
index cd816bb711..427729ab4d 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -21,28 +21,41 @@
 #include "standard-headers/xen/hvm/hvm_op.h"
 #include "standard-headers/xen/vcpu.h"
 
+static bool kvm_gva_to_gpa(CPUState *cs, uint64_t gva, uint64_t *gpa,
+                           size_t *len, bool is_write)
+{
+        struct kvm_translation tr = {
+            .linear_address = gva,
+        };
+
+        if (len) {
+                *len = TARGET_PAGE_SIZE - (gva & ~TARGET_PAGE_MASK);
+        }
+
+        if (kvm_vcpu_ioctl(cs, KVM_TRANSLATE, &tr) || !tr.valid ||
+            (is_write && !tr.writeable)) {
+            return false;
+        }
+        *gpa = tr.physical_address;
+        return true;
+}
+
 static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
                       bool is_write)
 {
     uint8_t *buf = (uint8_t *)_buf;
     size_t i = 0, len = 0;
-    int ret;
 
     for (i = 0; i < sz; i+= len) {
-        struct kvm_translation tr = {
-            .linear_address = gva + i,
-        };
+        uint64_t gpa;
 
-        len = TARGET_PAGE_SIZE - (tr.linear_address & ~TARGET_PAGE_MASK);
+        if (!kvm_gva_to_gpa(cs, gva + i, &gpa, &len, is_write)) {
+                return -EFAULT;
+        }
         if (len > sz)
             len = sz;
 
-        ret = kvm_vcpu_ioctl(cs, KVM_TRANSLATE, &tr);
-        if (ret || !tr.valid || (is_write && !tr.writeable)) {
-            return -EFAULT;
-        }
-
-        cpu_physical_memory_rw(tr.physical_address, buf + i, len, is_write);
+        cpu_physical_memory_rw(gpa, buf + i, len, is_write);
     }
 
     return 0;
@@ -166,6 +179,17 @@ static void do_set_vcpu_info_gpa(CPUState *cs, run_on_cpu_data data)
                           env->xen_vcpu_info_gpa);
 }
 
+static void do_set_vcpu_time_info_gpa(CPUState *cs, run_on_cpu_data data)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+
+    env->xen_vcpu_time_info_gpa = data.host_ulong;
+
+    kvm_xen_set_vcpu_attr(cs, KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO,
+                          env->xen_vcpu_time_info_gpa);
+}
+
 static int xen_set_shared_info(CPUState *cs, uint64_t gfn)
 {
     uint64_t gpa = gfn << TARGET_PAGE_BITS;
@@ -258,6 +282,27 @@ static int vcpuop_register_vcpu_info(CPUState *cs, CPUState *target,
     return 0;
 }
 
+static int vcpuop_register_vcpu_time_info(CPUState *cs, CPUState *target,
+                                          uint64_t arg)
+{
+    struct vcpu_register_time_memory_area tma;
+    uint64_t gpa;
+    size_t len;
+
+    if (kvm_copy_from_gva(cs, arg, &tma, sizeof(*tma.addr.v))) {
+        return -EFAULT;
+    }
+
+    if (!kvm_gva_to_gpa(cs, tma.addr.p, &gpa, &len, false) ||
+        len < sizeof(tma)) {
+        return -EFAULT;
+    }
+
+    async_run_on_cpu(target, do_set_vcpu_time_info_gpa,
+                     RUN_ON_CPU_HOST_ULONG(gpa));
+    return 0;
+}
+
 static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
                                   int cmd, int vcpu_id, uint64_t arg)
 {
@@ -266,6 +311,9 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
     int err;
 
     switch (cmd) {
+    case VCPUOP_register_vcpu_time_memory_area:
+            err = vcpuop_register_vcpu_time_info(cs, dest, arg);
+            break;
     case VCPUOP_register_vcpu_info:
             err = vcpuop_register_vcpu_info(cs, dest, arg);
             break;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 18/22] i386/xen: handle VCPUOP_register_runstate_memory_area
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (16 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 17/22] i386/xen: handle VCPUOP_register_vcpu_time_info David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 15:38   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 19/22] i386/xen: implement HVMOP_set_evtchn_upcall_vector David Woodhouse
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Allow guest to setup the vcpu runstates which is used as
steal clock.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/cpu.h     |  1 +
 target/i386/kvm/kvm.c |  9 +++++++++
 target/i386/machine.c |  4 +++-
 target/i386/xen.c     | 35 +++++++++++++++++++++++++++++++++++
 4 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 96c2d0d5cb..bf44a87ddb 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1791,6 +1791,7 @@ typedef struct CPUArchState {
     uint64_t xen_vcpu_info_gpa;
     uint64_t xen_vcpu_info_default_gpa;
     uint64_t xen_vcpu_time_info_gpa;
+    uint64_t xen_vcpu_runstate_gpa;
 #endif
 #if defined(CONFIG_HVF)
     HVFX86LazyFlags hvf_lflags;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 3f19fff21f..a5e67a3119 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1814,6 +1814,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
     env->xen_vcpu_info_gpa = UINT64_MAX;
     env->xen_vcpu_info_default_gpa = UINT64_MAX;
     env->xen_vcpu_time_info_gpa = UINT64_MAX;
+    env->xen_vcpu_runstate_gpa = UINT64_MAX;
 
     xen_version = kvm_arch_xen_version(MACHINE(qdev_get_machine()));
     if (xen_version) {
@@ -4753,6 +4754,14 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
                 return ret;
             }
         }
+
+        gpa = x86_cpu->env.xen_vcpu_runstate_gpa;
+        if (gpa != UINT64_MAX) {
+            ret = kvm_xen_set_vcpu_attr(cpu, KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR, gpa);
+            if (ret < 0) {
+                return ret;
+            }
+        }
     }
 #endif
 
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 9acef102a3..6a510e5cbd 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -1264,7 +1264,8 @@ static bool xen_vcpu_needed(void *opaque)
 
     return (env->xen_vcpu_info_gpa != UINT64_MAX ||
             env->xen_vcpu_info_default_gpa != UINT64_MAX ||
-            env->xen_vcpu_time_info_gpa != UINT64_MAX);
+            env->xen_vcpu_time_info_gpa != UINT64_MAX ||
+            env->xen_vcpu_runstate_gpa != UINT64_MAX);
 }
 
 static const VMStateDescription vmstate_xen_vcpu = {
@@ -1276,6 +1277,7 @@ static const VMStateDescription vmstate_xen_vcpu = {
         VMSTATE_UINT64(env.xen_vcpu_info_gpa, X86CPU),
         VMSTATE_UINT64(env.xen_vcpu_info_default_gpa, X86CPU),
         VMSTATE_UINT64(env.xen_vcpu_time_info_gpa, X86CPU),
+        VMSTATE_UINT64(env.xen_vcpu_runstate_gpa, X86CPU),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/target/i386/xen.c b/target/i386/xen.c
index 427729ab4d..97032049e6 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -190,6 +190,17 @@ static void do_set_vcpu_time_info_gpa(CPUState *cs, run_on_cpu_data data)
                           env->xen_vcpu_time_info_gpa);
 }
 
+static void do_set_vcpu_runstate_gpa(CPUState *cs, run_on_cpu_data data)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+
+    env->xen_vcpu_runstate_gpa = data.host_ulong;
+
+    kvm_xen_set_vcpu_attr(cs, KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR,
+                          env->xen_vcpu_runstate_gpa);
+}
+
 static int xen_set_shared_info(CPUState *cs, uint64_t gfn)
 {
     uint64_t gpa = gfn << TARGET_PAGE_BITS;
@@ -303,6 +314,27 @@ static int vcpuop_register_vcpu_time_info(CPUState *cs, CPUState *target,
     return 0;
 }
 
+static int vcpuop_register_runstate_info(CPUState *cs, CPUState *target,
+                                         uint64_t arg)
+{
+    struct vcpu_register_runstate_memory_area rma;
+    uint64_t gpa;
+    size_t len;
+
+    if (kvm_copy_from_gva(cs, arg, &rma, sizeof(*rma.addr.v))) {
+        return -EFAULT;
+    }
+
+    if (!kvm_gva_to_gpa(cs, rma.addr.p, &gpa, &len, false) ||
+        len < sizeof(struct vcpu_time_info)) {
+        return -EFAULT;
+    }
+
+    async_run_on_cpu(target, do_set_vcpu_runstate_gpa,
+                     RUN_ON_CPU_HOST_ULONG(gpa));
+    return 0;
+}
+
 static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
                                   int cmd, int vcpu_id, uint64_t arg)
 {
@@ -311,6 +343,9 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
     int err;
 
     switch (cmd) {
+    case VCPUOP_register_runstate_memory_area:
+            err = vcpuop_register_runstate_info(cs, dest, arg);
+            break;
     case VCPUOP_register_vcpu_time_memory_area:
             err = vcpuop_register_vcpu_time_info(cs, dest, arg);
             break;
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 19/22] i386/xen: implement HVMOP_set_evtchn_upcall_vector
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (17 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 18/22] i386/xen: handle VCPUOP_register_runstate_memory_area David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 15:52   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ David Woodhouse
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Ankur Arora <ankur.a.arora@oracle.com>

The HVMOP_set_evtchn_upcall_vector hypercall sets the per-vCPU upcall
vector, to be delivered to the local APIC just like an MSI (with an EOI).

This takes precedence over the system-wide delivery method set by the
HVMOP_set_param hypercall with HVM_PARAM_CALLBACK_IRQ. It's used by
Windows and Xen (PV shim) guests but normally not by Linux.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Rework for upstream kernel changes and split from HVMOP_set_param]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/cpu.h        |  1 +
 target/i386/kvm/kvm.c    |  7 ++++
 target/i386/machine.c    |  4 ++-
 target/i386/trace-events |  1 +
 target/i386/xen.c        | 69 +++++++++++++++++++++++++++++++++++++---
 target/i386/xen.h        |  1 +
 6 files changed, 77 insertions(+), 6 deletions(-)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index bf44a87ddb..938a1b9c8b 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1792,6 +1792,7 @@ typedef struct CPUArchState {
     uint64_t xen_vcpu_info_default_gpa;
     uint64_t xen_vcpu_time_info_gpa;
     uint64_t xen_vcpu_runstate_gpa;
+    uint8_t xen_vcpu_callback_vector;
 #endif
 #if defined(CONFIG_HVF)
     HVFX86LazyFlags hvf_lflags;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a5e67a3119..dc1b3fc502 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -4762,6 +4762,13 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
                 return ret;
             }
         }
+
+        if (x86_cpu->env.xen_vcpu_callback_vector) {
+            ret = kvm_xen_set_vcpu_callback_vector(cpu);
+            if (ret < 0) {
+                return ret;
+            }
+        }
     }
 #endif
 
diff --git a/target/i386/machine.c b/target/i386/machine.c
index 6a510e5cbd..09e13cf716 100644
--- a/target/i386/machine.c
+++ b/target/i386/machine.c
@@ -1265,7 +1265,8 @@ static bool xen_vcpu_needed(void *opaque)
     return (env->xen_vcpu_info_gpa != UINT64_MAX ||
             env->xen_vcpu_info_default_gpa != UINT64_MAX ||
             env->xen_vcpu_time_info_gpa != UINT64_MAX ||
-            env->xen_vcpu_runstate_gpa != UINT64_MAX);
+            env->xen_vcpu_runstate_gpa != UINT64_MAX ||
+            env->xen_vcpu_callback_vector != 0);
 }
 
 static const VMStateDescription vmstate_xen_vcpu = {
@@ -1278,6 +1279,7 @@ static const VMStateDescription vmstate_xen_vcpu = {
         VMSTATE_UINT64(env.xen_vcpu_info_default_gpa, X86CPU),
         VMSTATE_UINT64(env.xen_vcpu_time_info_gpa, X86CPU),
         VMSTATE_UINT64(env.xen_vcpu_runstate_gpa, X86CPU),
+        VMSTATE_UINT8(env.xen_vcpu_callback_vector, X86CPU),
         VMSTATE_END_OF_LIST()
     }
 };
diff --git a/target/i386/trace-events b/target/i386/trace-events
index 7118640697..58e28f1a19 100644
--- a/target/i386/trace-events
+++ b/target/i386/trace-events
@@ -16,3 +16,4 @@ kvm_sev_attestation_report(const char *mnonce, const char *data) "mnonce %s data
 kvm_xen_hypercall(int cpu, uint8_t cpl, uint64_t input, uint64_t a0, uint64_t a1, uint64_t a2, uint64_t ret) "xen_hypercall: cpu %d cpl %d input %" PRIu64 " a0 0x%" PRIx64 " a1 0x%" PRIx64 " a2 0x%" PRIx64" ret 0x%" PRIx64
 kvm_xen_set_shared_info(uint64_t gfn) "shared info at gfn 0x%" PRIx64
 kvm_xen_set_vcpu_attr(int cpu, int type, uint64_t gpa) "vcpu attr cpu %d type %d gpa 0x%" PRIx64
+kvm_xen_set_vcpu_callback(int cpu, int vector) "callback vcpu %d vector %d"
diff --git a/target/i386/xen.c b/target/i386/xen.c
index 97032049e6..2583c00a6b 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -19,6 +19,7 @@
 #include "standard-headers/xen/version.h"
 #include "standard-headers/xen/memory.h"
 #include "standard-headers/xen/hvm/hvm_op.h"
+#include "standard-headers/xen/hvm/params.h"
 #include "standard-headers/xen/vcpu.h"
 
 static bool kvm_gva_to_gpa(CPUState *cs, uint64_t gva, uint64_t *gpa,
@@ -127,7 +128,8 @@ static bool kvm_xen_hcall_xen_version(struct kvm_xen_exit *exit, X86CPU *cpu,
             fi.submap |= 1 << XENFEAT_writable_page_tables |
                          1 << XENFEAT_writable_descriptor_tables |
                          1 << XENFEAT_auto_translated_physmap |
-                         1 << XENFEAT_supervisor_mode_kernel;
+                         1 << XENFEAT_supervisor_mode_kernel |
+                         1 << XENFEAT_hvm_callback_vector;
         }
 
         err = kvm_copy_to_gva(CPU(cpu), arg, &fi, sizeof(fi));
@@ -154,6 +156,29 @@ int kvm_xen_set_vcpu_attr(CPUState *cs, uint16_t type, uint64_t gpa)
     return kvm_vcpu_ioctl(cs, KVM_XEN_VCPU_SET_ATTR, &xhsi);
 }
 
+int kvm_xen_set_vcpu_callback_vector(CPUState *cs)
+{
+    uint8_t vector = X86_CPU(cs)->env.xen_vcpu_callback_vector;
+    struct kvm_xen_vcpu_attr xva;
+
+    xva.type = KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR;
+    xva.u.vector = vector;
+
+    trace_kvm_xen_set_vcpu_callback(cs->cpu_index, vector);
+
+    return kvm_vcpu_ioctl(cs, KVM_XEN_HVM_SET_ATTR, &xva);
+}
+
+static void do_set_vcpu_callback_vector(CPUState *cs, run_on_cpu_data data)
+{
+    X86CPU *cpu = X86_CPU(cs);
+    CPUX86State *env = &cpu->env;
+
+    env->xen_vcpu_callback_vector = data.host_int;
+
+    kvm_xen_set_vcpu_callback_vector(cs);
+}
+
 static void do_set_vcpu_info_default_gpa(CPUState *cs, run_on_cpu_data data)
 {
     X86CPU *cpu = X86_CPU(cs);
@@ -262,17 +287,51 @@ static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
     return true;
 }
 
-static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit,
+static int kvm_xen_hcall_evtchn_upcall_vector(struct kvm_xen_exit *exit,
+                                              X86CPU *cpu, uint64_t arg)
+{
+    struct xen_hvm_evtchn_upcall_vector *up;
+    CPUState *target_cs;
+    int vector;
+
+    up = gva_to_hva(CPU(cpu), arg);
+    if (!up) {
+        return -EFAULT;
+    }
+
+    vector = up->vector;
+    if (vector < 0x10) {
+        return -EINVAL;
+    }
+
+    target_cs = qemu_get_cpu(up->vcpu);
+    if (!target_cs) {
+        return -EINVAL;
+    }
+
+    async_run_on_cpu(target_cs, do_set_vcpu_callback_vector, RUN_ON_CPU_HOST_INT(vector));
+    return 0;
+}
+
+static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit, X86CPU *cpu,
                                  int cmd, uint64_t arg)
 {
+    int ret = -ENOSYS;
     switch (cmd) {
+    case HVMOP_set_evtchn_upcall_vector:
+            ret = kvm_xen_hcall_evtchn_upcall_vector(exit, cpu,
+                                                     exit->u.hcall.params[0]);
+            break;
     case HVMOP_pagetable_dying:
-            exit->u.hcall.result = -ENOSYS;
-            return true;
+            ret = -ENOSYS;
+            break;
 
     default:
             return false;
     }
+
+    exit->u.hcall.result = ret;
+    return true;
 }
 
 static int vcpuop_register_vcpu_info(CPUState *cs, CPUState *target,
@@ -377,7 +436,7 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
                                      exit->u.hcall.params[1],
                                      exit->u.hcall.params[2]);
     case __HYPERVISOR_hvm_op:
-        return kvm_xen_hcall_hvm_op(exit, exit->u.hcall.params[0],
+        return kvm_xen_hcall_hvm_op(exit, cpu, exit->u.hcall.params[0],
                                     exit->u.hcall.params[1]);
     case __HYPERVISOR_memory_op:
         return kvm_xen_hcall_memory_op(exit, exit->u.hcall.params[0],
diff --git a/target/i386/xen.h b/target/i386/xen.h
index 53573e07f8..07b133ae29 100644
--- a/target/i386/xen.h
+++ b/target/i386/xen.h
@@ -25,5 +25,6 @@
 int kvm_xen_init(KVMState *s, uint32_t xen_version);
 int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit);
 int kvm_xen_set_vcpu_attr(CPUState *cs, uint16_t type, uint64_t gpa);
+int kvm_xen_set_vcpu_callback_vector(CPUState *cs);
 
 #endif /* QEMU_I386_XEN_H */
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (18 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 19/22] i386/xen: implement HVMOP_set_evtchn_upcall_vector David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 16:16   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 21/22] i386/xen: implement HYPERVISOR_event_channel_op David Woodhouse
  2022-12-09  9:56 ` [RFC PATCH v2 22/22] i386/xen: implement HYPERVISOR_sched_op David Woodhouse
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Ankur Arora <ankur.a.arora@oracle.com>

The HVM_PARAM_CALLBACK_IRQ parameter controls the system-wide event
channel upcall method.  The vector support is handled by KVM internally,
when the evtchn_upcall_pending field in the vcpu_info is set.

The GSI and PCI_INTX delivery methods are not supported. yet; those
need to simulate a level-triggered event on the I/OAPIC.

Add a 'xen_evtchn' device to host the migration state, as we'll shortly
be adding a full event channel table there too.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
[dwmw2: Rework for upstream kernel changes, split from per-VCPU vector]
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/kvm/meson.build  |   5 +-
 hw/i386/kvm/xen_evtchn.c | 117 +++++++++++++++++++++++++++++++++++++++
 hw/i386/kvm/xen_evtchn.h |  13 +++++
 hw/i386/pc_piix.c        |   2 +
 target/i386/xen.c        |  44 +++++++++++++--
 5 files changed, 174 insertions(+), 7 deletions(-)
 create mode 100644 hw/i386/kvm/xen_evtchn.c
 create mode 100644 hw/i386/kvm/xen_evtchn.h

diff --git a/hw/i386/kvm/meson.build b/hw/i386/kvm/meson.build
index 6165cbf019..cab64df339 100644
--- a/hw/i386/kvm/meson.build
+++ b/hw/i386/kvm/meson.build
@@ -4,6 +4,9 @@ i386_kvm_ss.add(when: 'CONFIG_APIC', if_true: files('apic.c'))
 i386_kvm_ss.add(when: 'CONFIG_I8254', if_true: files('i8254.c'))
 i386_kvm_ss.add(when: 'CONFIG_I8259', if_true: files('i8259.c'))
 i386_kvm_ss.add(when: 'CONFIG_IOAPIC', if_true: files('ioapic.c'))
-i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen_overlay.c'))
+i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files(
+  'xen_overlay.c',
+  'xen_evtchn.c',
+  ))
 
 i386_ss.add_all(when: 'CONFIG_KVM', if_true: i386_kvm_ss)
diff --git a/hw/i386/kvm/xen_evtchn.c b/hw/i386/kvm/xen_evtchn.c
new file mode 100644
index 0000000000..1ca0c034e7
--- /dev/null
+++ b/hw/i386/kvm/xen_evtchn.c
@@ -0,0 +1,117 @@
+/*
+ * QEMU Xen emulation: Shared/overlay pages support
+ *
+ * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * Authors: David Woodhouse <dwmw2@infradead.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/host-utils.h"
+#include "qemu/module.h"
+#include "qemu/main-loop.h"
+#include "qapi/error.h"
+#include "qom/object.h"
+#include "exec/target_page.h"
+#include "exec/address-spaces.h"
+#include "migration/vmstate.h"
+
+#include "hw/sysbus.h"
+#include "hw/xen/xen.h"
+#include "xen_evtchn.h"
+
+#include "sysemu/kvm.h"
+#include <linux/kvm.h>
+
+#include "standard-headers/xen/memory.h"
+#include "standard-headers/xen/hvm/params.h"
+
+#define TYPE_XEN_EVTCHN "xenevtchn"
+OBJECT_DECLARE_SIMPLE_TYPE(XenEvtchnState, XEN_EVTCHN)
+
+struct XenEvtchnState {
+    /*< private >*/
+    SysBusDevice busdev;
+    /*< public >*/
+
+    uint64_t callback_param;
+};
+
+struct XenEvtchnState *xen_evtchn_singleton;
+
+static int xen_evtchn_post_load(void *opaque, int version_id)
+{
+    XenEvtchnState *s = opaque;
+
+    if (s->callback_param) {
+        xen_evtchn_set_callback_param(s->callback_param);
+    }
+
+    return 0;
+}
+
+static bool xen_evtchn_is_needed(void *opaque)
+{
+    return xen_mode == XEN_EMULATE;
+}
+
+static const VMStateDescription xen_evtchn_vmstate = {
+    .name = "xen_evtchn",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = xen_evtchn_is_needed,
+    .post_load = xen_evtchn_post_load,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT64(callback_param, XenEvtchnState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void xen_evtchn_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->vmsd = &xen_evtchn_vmstate;
+}
+
+static const TypeInfo xen_evtchn_info = {
+    .name          = TYPE_XEN_EVTCHN,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(XenEvtchnState),
+    .class_init    = xen_evtchn_class_init,
+};
+
+void xen_evtchn_create(void)
+{
+    xen_evtchn_singleton = XEN_EVTCHN(sysbus_create_simple(TYPE_XEN_EVTCHN, -1, NULL));
+}
+
+static void xen_evtchn_register_types(void)
+{
+    type_register_static(&xen_evtchn_info);
+}
+
+type_init(xen_evtchn_register_types)
+
+
+#define CALLBACK_VIA_TYPE_SHIFT       56
+
+int xen_evtchn_set_callback_param(uint64_t param)
+{
+    int ret = -ENOSYS;
+
+    if (param >> CALLBACK_VIA_TYPE_SHIFT == HVM_PARAM_CALLBACK_TYPE_VECTOR) {
+        struct kvm_xen_hvm_attr xa = {
+            .type = KVM_XEN_ATTR_TYPE_UPCALL_VECTOR,
+            .u.vector = (uint8_t)param,
+        };
+
+        ret = kvm_vm_ioctl(kvm_state, KVM_XEN_HVM_SET_ATTR, &xa);
+        if (!ret && xen_evtchn_singleton)
+            xen_evtchn_singleton->callback_param = param;
+    }
+    return ret;
+}
diff --git a/hw/i386/kvm/xen_evtchn.h b/hw/i386/kvm/xen_evtchn.h
new file mode 100644
index 0000000000..11c6ed22a0
--- /dev/null
+++ b/hw/i386/kvm/xen_evtchn.h
@@ -0,0 +1,13 @@
+/*
+ * QEMU Xen emulation: Event channel support
+ *
+ * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
+ *
+ * Authors: David Woodhouse <dwmw2@infradead.org>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+void xen_evtchn_create(void);
+int xen_evtchn_set_callback_param(uint64_t param);
diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index c3c61eedde..18540084a0 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -60,6 +60,7 @@
 #endif
 #ifdef CONFIG_XEN_EMU
 #include "hw/i386/kvm/xen_overlay.h"
+#include "hw/i386/kvm/xen_evtchn.h"
 #endif
 #include "migration/global_state.h"
 #include "migration/misc.h"
@@ -417,6 +418,7 @@ static void pc_xen_hvm_init(MachineState *machine)
 #ifdef CONFIG_XEN_EMU
     if (xen_mode == XEN_EMULATE) {
             xen_overlay_create();
+            xen_evtchn_create();
     }
 #endif
 }
diff --git a/target/i386/xen.c b/target/i386/xen.c
index 2583c00a6b..1af336d9e5 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -16,6 +16,8 @@
 #include "xen.h"
 #include "trace.h"
 #include "hw/i386/kvm/xen_overlay.h"
+#include "hw/i386/kvm/xen_evtchn.h"
+
 #include "standard-headers/xen/version.h"
 #include "standard-headers/xen/memory.h"
 #include "standard-headers/xen/hvm/hvm_op.h"
@@ -287,24 +289,53 @@ static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
     return true;
 }
 
+static int handle_set_param(struct kvm_xen_exit *exit, X86CPU *cpu,
+                            uint64_t arg)
+{
+    CPUState *cs = CPU(cpu);
+    struct xen_hvm_param hp;
+    int err = 0;
+
+    if (kvm_copy_from_gva(cs, arg, &hp, sizeof(hp))) {
+        err = -EFAULT;
+        goto out;
+    }
+
+    if (hp.domid != DOMID_SELF) {
+        err = -EINVAL;
+        goto out;
+    }
+
+    switch (hp.index) {
+    case HVM_PARAM_CALLBACK_IRQ:
+        err = xen_evtchn_set_callback_param(hp.value);
+        break;
+    default:
+        return false;
+    }
+
+out:
+    exit->u.hcall.result = err;
+    return true;
+}
+
 static int kvm_xen_hcall_evtchn_upcall_vector(struct kvm_xen_exit *exit,
                                               X86CPU *cpu, uint64_t arg)
 {
-    struct xen_hvm_evtchn_upcall_vector *up;
+    struct xen_hvm_evtchn_upcall_vector up;
     CPUState *target_cs;
     int vector;
 
-    up = gva_to_hva(CPU(cpu), arg);
-    if (!up) {
+    if (kvm_copy_from_gva(CPU(cpu), arg, &up, sizeof(up))) {
         return -EFAULT;
     }
 
-    vector = up->vector;
+    vector = up.vector;
     if (vector < 0x10) {
         return -EINVAL;
     }
 
-    target_cs = qemu_get_cpu(up->vcpu);
+    target_cs = qemu_get_cpu(up.vcpu);
     if (!target_cs) {
         return -EINVAL;
     }
@@ -325,7 +356,8 @@ static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit, X86CPU *cpu,
     case HVMOP_pagetable_dying:
             ret = -ENOSYS;
             break;
-
+    case HVMOP_set_param:
+            return handle_set_param(exit, cpu, arg);
     default:
             return false;
     }
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 21/22] i386/xen: implement HYPERVISOR_event_channel_op
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (19 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 16:23   ` Paul Durrant
  2022-12-09  9:56 ` [RFC PATCH v2 22/22] i386/xen: implement HYPERVISOR_sched_op David Woodhouse
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

Additionally set XEN_INTERFACE_VERSION to most recent in order to
exercise both event_channel_op and event_channel_op_compat.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/xen.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/target/i386/xen.c b/target/i386/xen.c
index 1af336d9e5..f102c40f04 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -18,11 +18,14 @@
 #include "hw/i386/kvm/xen_overlay.h"
 #include "hw/i386/kvm/xen_evtchn.h"
 
+#define __XEN_INTERFACE_VERSION__ 0x00040400
+
 #include "standard-headers/xen/version.h"
 #include "standard-headers/xen/memory.h"
 #include "standard-headers/xen/hvm/hvm_op.h"
 #include "standard-headers/xen/hvm/params.h"
 #include "standard-headers/xen/vcpu.h"
+#include "standard-headers/xen/event_channel.h"
 
 static bool kvm_gva_to_gpa(CPUState *cs, uint64_t gva, uint64_t *gpa,
                            size_t *len, bool is_write)
@@ -452,6 +455,42 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
     return true;
 }
 
+static bool kvm_xen_hcall_evtchn_op_compat(struct kvm_xen_exit *exit,
+                                          X86CPU *cpu, uint64_t arg)
+{
+    struct evtchn_op op;
+    int err = -EFAULT;
+
+    if (kvm_copy_from_gva(CPU(cpu), arg, &op, sizeof(op))) {
+        goto err;
+    }
+
+    switch (op.cmd) {
+    default:
+        return false;
+    }
+err:
+    exit->u.hcall.result = err;
+    return true;
+}
+
+static bool kvm_xen_hcall_evtchn_op(struct kvm_xen_exit *exit,
+                                    int cmd, uint64_t arg)
+{
+    int err = -ENOSYS;
+
+    switch (cmd) {
+    case EVTCHNOP_init_control:
+        err = -ENOSYS;
+        break;
+    default:
+        return false;
+    }
+
+    exit->u.hcall.result = err;
+    return true;
+}
+
 static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
 {
     uint16_t code = exit->u.hcall.input;
@@ -462,6 +501,12 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
     }
 
     switch (code) {
+    case __HYPERVISOR_event_channel_op_compat:
+        return kvm_xen_hcall_evtchn_op_compat(exit, cpu,
+                                              exit->u.hcall.params[0]);
+    case __HYPERVISOR_event_channel_op:
+        return kvm_xen_hcall_evtchn_op(exit, exit->u.hcall.params[0],
+                                       exit->u.hcall.params[1]);
     case __HYPERVISOR_vcpu_op:
         return kvm_xen_hcall_vcpu_op(exit, cpu,
                                      exit->u.hcall.params[0],
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [RFC PATCH v2 22/22] i386/xen: implement HYPERVISOR_sched_op
  2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
                   ` (20 preceding siblings ...)
  2022-12-09  9:56 ` [RFC PATCH v2 21/22] i386/xen: implement HYPERVISOR_event_channel_op David Woodhouse
@ 2022-12-09  9:56 ` David Woodhouse
  2022-12-12 16:37   ` Paul Durrant
  21 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-09  9:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

From: Joao Martins <joao.m.martins@oracle.com>

It allows to shutdown itself via hypercall with any of the 3 reasons:
  1) self-reboot
  2) shutdown
  3) crash

Implementing SCHEDOP_shutdown sub op let us handle crashes gracefully rather
than leading to triple faults if it remains unimplemented.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 target/i386/xen.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/target/i386/xen.c b/target/i386/xen.c
index f102c40f04..5f3b91450d 100644
--- a/target/i386/xen.c
+++ b/target/i386/xen.c
@@ -17,6 +17,7 @@
 #include "trace.h"
 #include "hw/i386/kvm/xen_overlay.h"
 #include "hw/i386/kvm/xen_evtchn.h"
+#include "sysemu/runstate.h"
 
 #define __XEN_INTERFACE_VERSION__ 0x00040400
 
@@ -25,6 +26,7 @@
 #include "standard-headers/xen/hvm/hvm_op.h"
 #include "standard-headers/xen/hvm/params.h"
 #include "standard-headers/xen/vcpu.h"
+#include "standard-headers/xen/sched.h"
 #include "standard-headers/xen/event_channel.h"
 
 static bool kvm_gva_to_gpa(CPUState *cs, uint64_t gva, uint64_t *gpa,
@@ -491,6 +493,45 @@ static bool kvm_xen_hcall_evtchn_op(struct kvm_xen_exit *exit,
     return true;
 }
 
+static int schedop_shutdown(CPUState *cs, uint64_t arg)
+{
+    struct sched_shutdown shutdown;
+
+    if (kvm_copy_from_gva(cs, arg, &shutdown, sizeof(shutdown))) {
+        return -EFAULT;
+    }
+
+    if (shutdown.reason == SHUTDOWN_crash) {
+        cpu_dump_state(cs, stderr, CPU_DUMP_CODE);
+        qemu_system_guest_panicked(NULL);
+    } else if (shutdown.reason == SHUTDOWN_reboot) {
+        qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
+    } else if (shutdown.reason == SHUTDOWN_poweroff) {
+        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
+    }
+
+    return 0;
+}
+
+static bool kvm_xen_hcall_sched_op(struct kvm_xen_exit *exit, X86CPU *cpu,
+                                   int cmd, uint64_t arg)
+{
+    CPUState *cs = CPU(cpu);
+    int err = -ENOSYS;
+
+    switch (cmd) {
+    case SCHEDOP_shutdown: {
+          err = schedop_shutdown(cs, arg);
+          break;
+       }
+    default:
+            return false;
+    }
+
+    exit->u.hcall.result = err;
+    return true;
+}
+
 static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
 {
     uint16_t code = exit->u.hcall.input;
@@ -501,6 +542,10 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
     }
 
     switch (code) {
+    case __HYPERVISOR_sched_op_compat:
+    case __HYPERVISOR_sched_op:
+        return kvm_xen_hcall_sched_op(exit, cpu, exit->u.hcall.params[0],
+                                      exit->u.hcall.params[1]);
     case __HYPERVISOR_event_channel_op_compat:
         return kvm_xen_hcall_evtchn_op_compat(exit, cpu,
                                               exit->u.hcall.params[0]);
-- 
2.35.3



^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 01/22] include: import xen public headers
  2022-12-09  9:55 ` [RFC PATCH v2 01/22] include: import xen public headers David Woodhouse
@ 2022-12-12  9:17   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12  9:17 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Update to Xen public headers from 4.16.2 release]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   include/standard-headers/xen/arch-x86/cpuid.h |  118 ++
>   .../xen/arch-x86/xen-x86_32.h                 |  194 +++
>   .../xen/arch-x86/xen-x86_64.h                 |  241 ++++
>   include/standard-headers/xen/arch-x86/xen.h   |  398 +++++++
>   include/standard-headers/xen/event_channel.h  |  388 ++++++
>   include/standard-headers/xen/features.h       |  143 +++
>   include/standard-headers/xen/grant_table.h    |  686 +++++++++++
>   include/standard-headers/xen/hvm/hvm_op.h     |  395 +++++++
>   include/standard-headers/xen/hvm/params.h     |  318 +++++
>   include/standard-headers/xen/memory.h         |  754 ++++++++++++
>   include/standard-headers/xen/physdev.h        |  383 ++++++
>   include/standard-headers/xen/sched.h          |  202 ++++
>   include/standard-headers/xen/trace.h          |  341 ++++++
>   include/standard-headers/xen/vcpu.h           |  248 ++++
>   include/standard-headers/xen/version.h        |  113 ++
>   include/standard-headers/xen/xen-compat.h     |   46 +
>   include/standard-headers/xen/xen.h            | 1049 +++++++++++++++++
>   17 files changed, 6017 insertions(+)
>   create mode 100644 include/standard-headers/xen/arch-x86/cpuid.h
>   create mode 100644 include/standard-headers/xen/arch-x86/xen-x86_32.h
>   create mode 100644 include/standard-headers/xen/arch-x86/xen-x86_64.h
>   create mode 100644 include/standard-headers/xen/arch-x86/xen.h
>   create mode 100644 include/standard-headers/xen/event_channel.h
>   create mode 100644 include/standard-headers/xen/features.h
>   create mode 100644 include/standard-headers/xen/grant_table.h
>   create mode 100644 include/standard-headers/xen/hvm/hvm_op.h
>   create mode 100644 include/standard-headers/xen/hvm/params.h
>   create mode 100644 include/standard-headers/xen/memory.h
>   create mode 100644 include/standard-headers/xen/physdev.h
>   create mode 100644 include/standard-headers/xen/sched.h
>   create mode 100644 include/standard-headers/xen/trace.h
>   create mode 100644 include/standard-headers/xen/vcpu.h
>   create mode 100644 include/standard-headers/xen/version.h
>   create mode 100644 include/standard-headers/xen/xen-compat.h
>   create mode 100644 include/standard-headers/xen/xen.h
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-09  9:55 ` [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation David Woodhouse
@ 2022-12-12  9:19   ` Paul Durrant
  2022-12-12 17:07   ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12  9:19 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The XEN_EMU option will cover core Xen support in target/, which exists
> only for x86 with KVM today but could theoretically also be implemented
> on Arm/Aarch64 and with TCG or other accelerators. It will also cover
> the support for architecture-independent grant table and event channel
> support which will be added in hw/xen/.
> 
> The XENFV_MACHINE option is for the xenfv platform support, which will
> now be used both by XEN_EMU and by real Xen.
> 
> The XEN option remains dependent on the Xen runtime libraries, and covers
> support for real Xen. Some code which currently resides under CONFIG_XEN
> will be moving to CONFIG_XENFV_MACHINE over time.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   accel/Kconfig  | 1 +
>   hw/Kconfig     | 1 +
>   hw/xen/Kconfig | 3 +++
>   meson.build    | 1 +
>   target/Kconfig | 4 ++++
>   5 files changed, 10 insertions(+)
>   create mode 100644 hw/xen/Kconfig
> 

Reviewed-by: Paul Durrant <paul@xen.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support
  2022-12-09  9:55 ` [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support David Woodhouse
@ 2022-12-12 12:48   ` Paul Durrant
  2022-12-12 17:30   ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 12:48 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This is a machine property for two main reasons. One is that it allows
> us to set it in default_machine_opts for the xenfv platform when not
> running on actual Xen. The other is that theoretically we *could* do
> this with TCG too; we'd just have to implement a bunch of the stuff that
> KVM already does for us.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/i386/pc.c            | 32 +++++++++++++++++++++++++++
>   hw/i386/pc_piix.c       | 10 +++++++--
>   include/hw/i386/pc.h    |  3 +++
>   target/i386/kvm/kvm.c   | 26 ++++++++++++++++++++++
>   target/i386/meson.build |  1 +
>   target/i386/xen.c       | 49 +++++++++++++++++++++++++++++++++++++++++
>   target/i386/xen.h       | 19 ++++++++++++++++
>   7 files changed, 138 insertions(+), 2 deletions(-)
>   create mode 100644 target/i386/xen.c
>   create mode 100644 target/i386/xen.h
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 546b703cb4..9bada1a8ff 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -1811,6 +1811,32 @@ static void pc_machine_set_max_fw_size(Object *obj, Visitor *v,
>       pcms->max_fw_size = value;
>   }
>   
> +static void pc_machine_get_xen_version(Object *obj, Visitor *v,
> +                                       const char *name, void *opaque,
> +                                       Error **errp)
> +{
> +    PCMachineState *pcms = PC_MACHINE(obj);
> +    uint32_t value = pcms->xen_version;
> +
> +    visit_type_uint32(v, name, &value, errp);
> +}
> +
> +static void pc_machine_set_xen_version(Object *obj, Visitor *v,
> +                                       const char *name, void *opaque,
> +                                       Error **errp)
> +{
> +    PCMachineState *pcms = PC_MACHINE(obj);
> +    Error *error = NULL;
> +    uint32_t value;
> +
> +    visit_type_uint32(v, name, &value, &error);
> +    if (error) {
> +        error_propagate(errp, error);
> +        return;
> +    }
> +
> +    pcms->xen_version = value;
> +}
>   
>   static void pc_machine_initfn(Object *obj)
>   {
> @@ -1978,6 +2004,12 @@ static void pc_machine_class_init(ObjectClass *oc, void *data)
>           NULL, NULL);
>       object_class_property_set_description(oc, PC_MACHINE_SMBIOS_EP,
>           "SMBIOS Entry Point type [32, 64]");
> +
> +    object_class_property_add(oc, "xen-version", "uint32",
> +        pc_machine_get_xen_version, pc_machine_set_xen_version,
> +        NULL, NULL);
> +    object_class_property_set_description(oc, "xen-version",
> +        "Xen version to be emulated (in XENVER_version form e.g. 0x4000a for 4.10)");
>   }

Since this is e properly of the general pc machine class, could it be 
made to report the actual version if running on real Xen and be 
read-only? AFAICT I could specify "accel=xen,xen-version=<blah>" and the 
feels like it should be an error.

>   
>   static const TypeInfo pc_machine_info = {
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index 0ad0ed1603..13286d0739 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -876,7 +876,10 @@ static void xenfv_4_2_machine_options(MachineClass *m)
>       pc_i440fx_4_2_machine_options(m);
>       m->desc = "Xen Fully-virtualized PC";
>       m->max_cpus = HVM_MAX_VCPUS;
> -    m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> +    if (xen_enabled())
> +            m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> +    else
> +            m->default_machine_opts = "accel=kvm,xen-version=0x40002";
>   }
>   
>   DEFINE_PC_MACHINE(xenfv_4_2, "xenfv-4.2", pc_xen_hvm_init,
> @@ -888,7 +891,10 @@ static void xenfv_3_1_machine_options(MachineClass *m)
>       m->desc = "Xen Fully-virtualized PC";
>       m->alias = "xenfv";
>       m->max_cpus = HVM_MAX_VCPUS;
> -    m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> +    if (xen_enabled())
> +            m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> +    else
> +            m->default_machine_opts = "accel=kvm,xen-version=0x30001";
>   }
>   
>   DEFINE_PC_MACHINE(xenfv, "xenfv-3.1", pc_xen_hvm_init,
> diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
> index c95333514e..9b14b18836 100644
> --- a/include/hw/i386/pc.h
> +++ b/include/hw/i386/pc.h
> @@ -52,6 +52,9 @@ typedef struct PCMachineState {
>       bool default_bus_bypass_iommu;
>       uint64_t max_fw_size;
>   
> +    /* Xen HVM emulation */
> +    uint32_t xen_version;
> +
>       /* ACPI Memory hotplug IO base address */
>       hwaddr memhp_io_base;
>   
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index a213209379..0a2069b117 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -31,6 +31,7 @@
>   #include "sysemu/runstate.h"
>   #include "kvm_i386.h"
>   #include "sev.h"
> +#include "xen.h"
>   #include "hyperv.h"
>   #include "hyperv-proto.h"
>   
> @@ -774,6 +775,17 @@ static inline bool freq_within_bounds(int freq, int target_freq)
>           return false;
>   }
>   
> +static uint32_t kvm_arch_xen_version(MachineState *ms)
> +{
> +    uint32_t v = object_property_get_int(OBJECT(ms), "xen-version", NULL);
> +
> +    /* If it was unset, return zero */
> +    if (v == (uint32_t) -1)
> +            return 0;
> +
> +    return v;
> +}
> +
>   static int kvm_arch_set_tsc_khz(CPUState *cs)
>   {
>       X86CPU *cpu = X86_CPU(cs);
> @@ -2459,6 +2471,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>   {
>       uint64_t identity_base = 0xfffbc000;
>       uint64_t shadow_mem;
> +    uint32_t xen_version;
>       int ret;
>       struct utsname utsname;
>       Error *local_err = NULL;
> @@ -2513,6 +2526,19 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>           }
>       }
>   
> +    xen_version = kvm_arch_xen_version(ms);
> +    if (xen_version) {
> +#ifdef CONFIG_XEN_EMU
> +            ret = kvm_xen_init(s, xen_version);
> +            if (ret < 0) {
> +                    return ret;
> +            }
> +#else
> +            error_report("kvm: Xen support not enabled in qemu");
> +            return -ENOTSUP;
> +#endif
> +    }
> +
>       ret = kvm_get_supported_msrs(s);
>       if (ret < 0) {
>           return ret;
> diff --git a/target/i386/meson.build b/target/i386/meson.build
> index ae38dc9563..9f3ef246b8 100644
> --- a/target/i386/meson.build
> +++ b/target/i386/meson.build
> @@ -7,6 +7,7 @@ i386_ss.add(files(
>     'cpu-dump.c',
>   ))
>   i386_ss.add(when: 'CONFIG_SEV', if_true: files('host-cpu.c'))
> +i386_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen.c'))
>   
>   # x86 cpu type
>   i386_ss.add(when: 'CONFIG_KVM', if_true: files('host-cpu.c'))
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> new file mode 100644
> index 0000000000..bc183dce4e
> --- /dev/null
> +++ b/target/i386/xen.c
> @@ -0,0 +1,49 @@
> +/*
> + * Xen HVM emulation support in KVM
> + *
> + * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
> + * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qemu/osdep.h"
> +#include "kvm/kvm_i386.h"
> +#include "xen.h"
> +
> +int kvm_xen_init(KVMState *s, uint32_t xen_version)
> +{
> +    const int required_caps = KVM_XEN_HVM_CONFIG_HYPERCALL_MSR |
> +        KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL | KVM_XEN_HVM_CONFIG_SHARED_INFO;
> +    struct kvm_xen_hvm_config cfg = {
> +        .msr = XEN_HYPERCALL_MSR,
> +        .flags = KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL,
> +    };
> +    int xen_caps, ret;
> +
> +    xen_caps = kvm_check_extension(s, KVM_CAP_XEN_HVM);
> +    if (required_caps & ~xen_caps) {
> +        error_report("kvm: Xen HVM guest support not present or insufficient");
> +        return -ENOSYS;
> +    }
> +
> +    if (xen_caps & KVM_XEN_HVM_CONFIG_EVTCHN_SEND) {
> +        struct kvm_xen_hvm_attr ha = {
> +            .type = KVM_XEN_ATTR_TYPE_XEN_VERSION,
> +            .u.xen_version = xen_version,
> +        };
> +        (void)kvm_vm_ioctl(s, KVM_XEN_HVM_SET_ATTR, &ha);

Should you not handle the error here? If the cap is present then surely 
it ought to work.

> +
> +        cfg.flags |= KVM_XEN_HVM_CONFIG_EVTCHN_SEND;
> +    }
> +
> +    ret = kvm_vm_ioctl(s, KVM_XEN_HVM_CONFIG, &cfg);
> +    if (ret < 0) {
> +        error_report("kvm: Failed to enable Xen HVM support: %s", strerror(-ret));
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> diff --git a/target/i386/xen.h b/target/i386/xen.h
> new file mode 100644
> index 0000000000..6c4f3b7822
> --- /dev/null
> +++ b/target/i386/xen.h
> @@ -0,0 +1,19 @@
> +/*
> + * Xen HVM emulation support in KVM
> + *
> + * Copyright © 2019 Oracle and/or its affiliates. All rights reserved.
> + * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef QEMU_I386_XEN_H
> +#define QEMU_I386_XEN_H
> +
> +#define XEN_HYPERCALL_MSR 0x40000000

This is a moveable MSR if Hyper-V is also enabled. Is that configuration 
being explicitly denied?

   Paul

> +
> +int kvm_xen_init(KVMState *s, uint32_t xen_version);
> +
> +#endif /* QEMU_I386_XEN_H */



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 04/22] i386/kvm: handle Xen HVM cpuid leaves
  2022-12-09  9:55 ` [RFC PATCH v2 04/22] i386/kvm: handle Xen HVM cpuid leaves David Woodhouse
@ 2022-12-12 13:13   ` Paul Durrant
  2022-12-13  9:47     ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 13:13 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Introduce support for emulating CPUID for Xen HVM guests. It doesn't make
> sense to advertise the KVM leaves to a Xen guest, so do it unconditionally
> when the xen-version machine property is set.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Obtain xen_version from machine property, make it automatic]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[snip]
> -    if (cpu->expose_kvm) {
> +    xen_version = kvm_arch_xen_version(MACHINE(qdev_get_machine()));
> +    if (xen_version) {
> +#ifdef CONFIG_XEN_EMU
> +        struct kvm_cpuid_entry2 *xen_max_leaf;
> +
> +        memcpy(signature, "XenVMMXenVMM", 12);
> +
> +        xen_max_leaf = c = &cpuid_data.entries[cpuid_i++];
> +        c->function = kvm_base + XEN_CPUID_SIGNATURE;
> +        c->eax = kvm_base + XEN_CPUID_TIME;
> +        c->ebx = signature[0];
> +        c->ecx = signature[1];
> +        c->edx = signature[2];
> +
> +        c = &cpuid_data.entries[cpuid_i++];
> +        c->function = kvm_base + XEN_CPUID_VENDOR;
> +        c->eax = xen_version;
> +        c->ebx = 0;
> +        c->ecx = 0;
> +        c->edx = 0;
> +
> +        c = &cpuid_data.entries[cpuid_i++];
> +        c->function = kvm_base + XEN_CPUID_HVM_MSR;
> +        /* Number of hypercall-transfer pages */
> +        c->eax = 1;
> +        /* Hypercall MSR base address */
> +        c->ebx = XEN_HYPERCALL_MSR;
> +        c->ecx = 0;
> +        c->edx = 0;
> +
> +        c = &cpuid_data.entries[cpuid_i++];
> +        c->function = kvm_base + XEN_CPUID_TIME;
> +        c->eax = ((!!tsc_is_stable_and_known(env) << 1) |
> +            (!!(env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_RDTSCP) << 2));
> +        /* default=0 (emulate if necessary) */
> +        c->ebx = 0;
> +        /* guest tsc frequency */
> +        c->ecx = env->user_tsc_khz;
> +        /* guest tsc incarnation (migration count) */
> +        c->edx = 0;
> +
> +        c = &cpuid_data.entries[cpuid_i++];
> +        c->function = kvm_base + XEN_CPUID_HVM;
> +        xen_max_leaf->eax = kvm_base + XEN_CPUID_HVM;
> +        if (xen_version >= XEN_VERSION(4,5)) {
> +            c->function = kvm_base + XEN_CPUID_HVM;
> +
> +            if (cpu->xen_vapic) {
> +                c->eax |= XEN_HVM_CPUID_APIC_ACCESS_VIRT;
> +                c->eax |= XEN_HVM_CPUID_X2APIC_VIRT;
> +            }
> +
> +            c->eax |= XEN_HVM_CPUID_IOMMU_MAPPINGS;
> +
> +            if (xen_version >= XEN_VERSION(4,6)) {
> +                c->eax |= XEN_HVM_CPUID_VCPU_ID_PRESENT;
> +                c->ebx = cs->cpu_index;
> +            }
> +        }
> +
> +        kvm_base += 0x100;

Ok, this tells me that we are intending to handle Hyper-V enlightenments 
being simultaneously enabled... in which case that MSR above needs to 
move, along with the cpuid leaves. It should be 0x40000200 in this case.

   Paul




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 05/22] xen-platform-pci: allow its creation with XEN_EMULATE mode
  2022-12-09  9:55 ` [RFC PATCH v2 05/22] xen-platform-pci: allow its creation with XEN_EMULATE mode David Woodhouse
@ 2022-12-12 13:24   ` Paul Durrant
  2022-12-12 22:07     ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 13:24 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> The only thing we need to handle on KVM side is to change the
> pfn from R/W to R/O.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/i386/xen/xen_platform.c | 11 ++++-------
>   1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/i386/xen/xen_platform.c b/hw/i386/xen/xen_platform.c
> index a64265cca0..914619d140 100644
> --- a/hw/i386/xen/xen_platform.c
> +++ b/hw/i386/xen/xen_platform.c
> @@ -271,7 +271,10 @@ static void platform_fixed_ioport_writeb(void *opaque, uint32_t addr, uint32_t v
>       case 0: /* Platform flags */ {
>           hvmmem_type_t mem_type = (val & PFFLAG_ROM_LOCK) ?
>               HVMMEM_ram_ro : HVMMEM_ram_rw;
> -        if (xen_set_mem_type(xen_domid, mem_type, 0xc0, 0x40)) {
> +        if (xen_mode == XEN_EMULATE) {
> +            /* XXX */
> +            s->flags = val & PFFLAG_ROM_LOCK;
> +        } else if (xen_set_mem_type(xen_domid, mem_type, 0xc0, 0x40)) {
>               DPRINTF("unable to change ro/rw state of ROM memory area!\n");
>           } else {
>               s->flags = val & PFFLAG_ROM_LOCK;

Surely this would cleaner as:

if (xen_mode != XEN_EMULATE && xen_set_mem_type(xen_domid, mem_type, 
0xc0, 0x40))
     DPRINTF("unable to change ro/rw state of ROM memory area!\n");
else
     s->flags = val & PFFLAG_ROM_LOCK;

?

   Paul

> @@ -496,12 +499,6 @@ static void xen_platform_realize(PCIDevice *dev, Error **errp)
>       PCIXenPlatformState *d = XEN_PLATFORM(dev);
>       uint8_t *pci_conf;
>   
> -    /* Device will crash on reset if xen is not initialized */
> -    if (!xen_enabled()) {
> -        error_setg(errp, "xen-platform device requires the Xen accelerator");
> -        return;
> -    }
> -
>       pci_conf = dev->config;
>   
>       pci_set_word(pci_conf + PCI_COMMAND, PCI_COMMAND_IO | PCI_COMMAND_MEMORY);



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 06/22] hw/xen_backend: refactor xen_be_init()
  2022-12-09  9:55 ` [RFC PATCH v2 06/22] hw/xen_backend: refactor xen_be_init() David Woodhouse
@ 2022-12-12 13:27   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 13:27 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/xen/xen-legacy-backend.c         | 40 +++++++++++++++++++++--------
>   include/hw/xen/xen-legacy-backend.h |  3 +++
>   2 files changed, 32 insertions(+), 11 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 07/22] pc_piix: handle XEN_EMULATE backend init
  2022-12-09  9:55 ` [RFC PATCH v2 07/22] pc_piix: handle XEN_EMULATE backend init David Woodhouse
@ 2022-12-12 13:47   ` Paul Durrant
  2022-12-12 14:50     ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 13:47 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> And use newly added xen_emulated_machine_init() to iniitalize
> the xenstore and the sysdev bus for future emulated devices.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Move it to xen-legacy-backend.c]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/i386/pc_piix.c                   |  5 +++++
>   hw/xen/xen-legacy-backend.c         | 22 ++++++++++++++++------
>   include/hw/xen/xen-legacy-backend.h |  2 ++
>   3 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index 13286d0739..3dcac2f4b6 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -47,6 +47,7 @@
>   #include "hw/sysbus.h"
>   #include "hw/i2c/smbus_eeprom.h"
>   #include "hw/xen/xen-x86.h"
> +#include "hw/xen/xen-legacy-backend.h"
>   #include "exec/memory.h"
>   #include "hw/acpi/acpi.h"
>   #include "hw/acpi/piix4.h"
> @@ -155,6 +156,10 @@ static void pc_init1(MachineState *machine,
>               x86ms->above_4g_mem_size = 0;
>               x86ms->below_4g_mem_size = machine->ram_size;
>           }
> +
> +        if (pcms->xen_version && !xen_be_xenstore_open()) {

So, this is a bit subtle... it's only *because* using real Xen results 
in xen_version being 0 that this is sane? Also does this not mean that 
we are now relying on libxenstore? Shouldn't that be called out in the 
config?

> +            xen_emulated_machine_init();
> +        }
>       }
>   
>       pc_machine_init_sgx_epc(pcms);
> diff --git a/hw/xen/xen-legacy-backend.c b/hw/xen/xen-legacy-backend.c
> index 694e7bbc54..60a7bc7ab6 100644
> --- a/hw/xen/xen-legacy-backend.c
> +++ b/hw/xen/xen-legacy-backend.c
> @@ -31,6 +31,7 @@
>   #include "qapi/error.h"
>   #include "hw/xen/xen-legacy-backend.h"
>   #include "hw/xen/xen_pvdev.h"
> +#include "hw/xen/xen-bus.h"
>   #include "monitor/qdev.h"
>   
>   DeviceState *xen_sysdev;
> @@ -294,13 +295,15 @@ static struct XenLegacyDevice *xen_be_get_xendev(const char *type, int dom,
>       xendev->debug      = debug;
>       xendev->local_port = -1;
>   
> -    xendev->evtchndev = xenevtchn_open(NULL, 0);
> -    if (xendev->evtchndev == NULL) {
> -        xen_pv_printf(NULL, 0, "can't open evtchn device\n");
> -        qdev_unplug(DEVICE(xendev), NULL);
> -        return NULL;
> +    if (xen_mode != XEN_EMULATE) {
> +        xendev->evtchndev = xenevtchn_open(NULL, 0);

Doesn't this need stubbing out so that we can build without libxenevtchn?

   Paul

> +        if (xendev->evtchndev == NULL) {
> +            xen_pv_printf(NULL, 0, "can't open evtchn device\n");
> +            qdev_unplug(DEVICE(xendev), NULL);
> +            return NULL;
> +        }
> +        qemu_set_cloexec(xenevtchn_fd(xendev->evtchndev));
>       }
> -    qemu_set_cloexec(xenevtchn_fd(xendev->evtchndev));
>   
>       xen_pv_insert_xendev(xendev);
>   
> @@ -859,3 +862,10 @@ static void xenbe_register_types(void)
>   }
>   
>   type_init(xenbe_register_types)
> +
> +void xen_emulated_machine_init(void)
> +{
> +    xen_bus_init();
> +    xen_be_sysdev_init();
> +    xen_be_register_common();
> +}
> diff --git a/include/hw/xen/xen-legacy-backend.h b/include/hw/xen/xen-legacy-backend.h
> index 0aa171f6c2..aa09015662 100644
> --- a/include/hw/xen/xen-legacy-backend.h
> +++ b/include/hw/xen/xen-legacy-backend.h
> @@ -105,4 +105,6 @@ int xen_config_dev_vfb(int vdev, const char *type);
>   int xen_config_dev_vkbd(int vdev);
>   int xen_config_dev_console(int vdev);
>   
> +void xen_emulated_machine_init(void);
> +
>   #endif /* HW_XEN_LEGACY_BACKEND_H */



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 08/22] xen_platform: exclude vfio-pci from the PCI platform unplug
  2022-12-09  9:55 ` [RFC PATCH v2 08/22] xen_platform: exclude vfio-pci from the PCI platform unplug David Woodhouse
@ 2022-12-12 13:52   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 13:52 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Such that PCI passthrough devices work for Xen emulated guests.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/i386/xen/xen_platform.c | 18 +++++++++++++++---
>   1 file changed, 15 insertions(+), 3 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 09/22] pc_piix: allow xenfv machine with XEN_EMULATE
  2022-12-09  9:55 ` [RFC PATCH v2 09/22] pc_piix: allow xenfv machine with XEN_EMULATE David Woodhouse
@ 2022-12-12 14:05   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:05 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:55, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> This allows -machine xenfv to work with Xen emulated guests.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/i386/pc_piix.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls
  2022-12-09  9:56 ` [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls David Woodhouse
@ 2022-12-12 14:11   ` Paul Durrant
  2022-12-12 14:17     ` David Woodhouse
  2022-12-12 17:07   ` Paolo Bonzini
  1 sibling, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:11 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> This means handling the new exit reason for Xen but still
> crashing on purpose. As we implement each of the hypercalls
> we will then return the right return code.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Add CPL to hypercall tracing, disallow hypercalls from CPL > 0]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/kvm/kvm.c    |  5 +++++
>   target/i386/trace-events |  3 +++
>   target/i386/xen.c        | 39 +++++++++++++++++++++++++++++++++++++++
>   target/i386/xen.h        |  1 +
>   4 files changed, 48 insertions(+)
> 
[snip]
> +int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
> +{
> +    if (exit->type != KVM_EXIT_XEN_HCALL)
> +        return -1;
> +
> +    if (!__kvm_xen_handle_exit(cpu, exit)) {
> +        /* Some hypercalls will be deliberately "implemented" by returning
> +         * -ENOSYS. This case is for hypercalls which are unexpected. */
> +        exit->u.hcall.result = -ENOSYS;
> +        qemu_log_mask(LOG_GUEST_ERROR, "Unimplemented Xen hypercall %"
> +                      PRId64 " (0x%" PRIx64 " 0x%" PRIx64 " 0x%" PRIx64 ")\n",
> +                      (uint64_t)exit->u.hcall.input, (uint64_t)exit->u.hcall.params[0],
> +                      (uint64_t)exit->u.hcall.params[1], (uint64_t)exit->u.hcall.params[1]);

This could get a little noisy; would a trace not be better? Then only 
those interested in it need be bothered by it. As we know, some ancient 
guests attempt to use some hypercalls which have long been consigned to 
the waste-bin of history.

   Paul

> +    }
> +
> +    trace_kvm_xen_hypercall(CPU(cpu)->cpu_index, exit->u.hcall.cpl,
> +                            exit->u.hcall.input, exit->u.hcall.params[0],
> +                            exit->u.hcall.params[1], exit->u.hcall.params[2],
> +                            exit->u.hcall.result);
> +    return 0;
> +}



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls
  2022-12-12 14:11   ` Paul Durrant
@ 2022-12-12 14:17     ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-12 14:17 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 2273 bytes --]

On Mon, 2022-12-12 at 14:11 +0000, Paul Durrant wrote:
> On 09/12/2022 09:56, David Woodhouse wrote:
> > From: Joao Martins <
> > joao.m.martins@oracle.com
> > >
> > 
> > This means handling the new exit reason for Xen but still
> > crashing on purpose. As we implement each of the hypercalls
> > we will then return the right return code.
> > 
> > Signed-off-by: Joao Martins <
> > joao.m.martins@oracle.com
> > >
> > [dwmw2: Add CPL to hypercall tracing, disallow hypercalls from CPL > 0]
> > Signed-off-by: David Woodhouse <
> > dwmw@amazon.co.uk
> > >
> > ---
> >   target/i386/kvm/kvm.c    |  5 +++++
> >   target/i386/trace-events |  3 +++
> >   target/i386/xen.c        | 39 +++++++++++++++++++++++++++++++++++++++
> >   target/i386/xen.h        |  1 +
> >   4 files changed, 48 insertions(+)
> > 
> 
> [snip]
> > +int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
> > +{
> > +    if (exit->type != KVM_EXIT_XEN_HCALL)
> > +        return -1;
> > +
> > +    if (!__kvm_xen_handle_exit(cpu, exit)) {
> > +        /* Some hypercalls will be deliberately "implemented" by returning
> > +         * -ENOSYS. This case is for hypercalls which are unexpected. */
> > +        exit->u.hcall.result = -ENOSYS;
> > +        qemu_log_mask(LOG_GUEST_ERROR, "Unimplemented Xen hypercall %"
> > +                      PRId64 " (0x%" PRIx64 " 0x%" PRIx64 " 0x%" PRIx64 ")\n",
> > +                      (uint64_t)exit->u.hcall.input, (uint64_t)exit->u.hcall.params[0],
> > +                      (uint64_t)exit->u.hcall.params[1], (uint64_t)exit->u.hcall.params[1]);
> 
> This could get a little noisy; would a trace not be better? Then only 
> those interested in it need be bothered by it. As we know, some ancient 
> guests attempt to use some hypercalls which have long been consigned to 
> the waste-bin of history.

By the time I'm done here, qemu is going to get the benefit of all
those things that "we know", and is going to have implementations of
those ancient hypercalls which intentionally return -ENOSYS and don't
trigger this warning.

So this is worth reporting in *addition* to the trace (which does also
exist). I'll change it to LOG_UNIMP though, as I think I mentioned in
the cover message.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 11/22] i386/xen: implement HYPERCALL_xen_version
  2022-12-09  9:56 ` [RFC PATCH v2 11/22] i386/xen: implement HYPERCALL_xen_version David Woodhouse
@ 2022-12-12 14:17   ` Paul Durrant
  2022-12-13  0:06     ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:17 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> This is just meant to serve as an example on how we can implement
> hypercalls. xen_version specifically since Qemu does all kind of
> feature controllability. So handling that here seems appropriate.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Implement kvm_gva_rw() safely]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/xen.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 79 insertions(+)
> 
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> index 708ab908a0..55beed1913 100644
> --- a/target/i386/xen.c
> +++ b/target/i386/xen.c
> @@ -12,9 +12,51 @@
>   #include "qemu/osdep.h"
>   #include "qemu/log.h"
>   #include "kvm/kvm_i386.h"
> +#include "exec/address-spaces.h"
>   #include "xen.h"
>   #include "trace.h"
>   
> +#include "standard-headers/xen/version.h"
> +
> +static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
> +                      bool is_write)
> +{
> +    uint8_t *buf = (uint8_t *)_buf;
> +    size_t i = 0, len = 0;
> +    int ret;
> +
> +    for (i = 0; i < sz; i+= len) {
> +        struct kvm_translation tr = {
> +            .linear_address = gva + i,
> +        };
> +
> +        len = TARGET_PAGE_SIZE - (tr.linear_address & ~TARGET_PAGE_MASK);
> +        if (len > sz)

Shouldn't this be (sz - i)?

   Paul



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages
  2022-12-09  9:56 ` [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages David Woodhouse
@ 2022-12-12 14:29   ` Paul Durrant
  2022-12-12 17:14   ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:29 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> For the shared info page and for grant tables, Xen shares its own pages
> from the "Xen heap" to the guest. The guest requests that a given page
> from a certain address space (XENMAPSPACE_shared_info, etc.) be mapped
> to a given GPA using the XENMEM_add_to_physmap hypercall.
> 
> To support that in qemu when *emulating* Xen, create a memory region
> (migratable) and allow it to be mapped as an overlay when requested.
> 
> Xen theoretically allows the same page to be mapped multiple times
> into the guest, but that's hard to track and reinstate over migration,
> so we automatically *unmap* any previous mapping when creating a new
> one. This approach has been used in production with.... a non-trivial
> number of guests expecting true Xen, without any problems yet being
> noticed.
> 
> This adds just the shared info page for now. The grant tables will be
> a larger region, and will need to be overlaid one page at a time. I
> think that means I need to create separate aliases for each page of
> the overall grant_frames region, so that they can be mapped individually.
> 

Is the following something you want in the commit log?

> Expecting some heckling at the use of xen_overlay_singleton. What is
> the best way to do that? Using qemu_find_recursive() every time seemed
> a bit wrong. But I suppose mapping it into the *guest* isn't a fast
> path, and if the actual grant table code is allowed to just stash the
> pointer it gets from xen_overlay_page_ptr() for later use then that
> isn't a fast path for device I/O either.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/i386/kvm/meson.build   |   1 +
>   hw/i386/kvm/xen_overlay.c | 198 ++++++++++++++++++++++++++++++++++++++
>   hw/i386/kvm/xen_overlay.h |  14 +++
>   hw/i386/pc_piix.c         |   8 ++
>   4 files changed, 221 insertions(+)
>   create mode 100644 hw/i386/kvm/xen_overlay.c
>   create mode 100644 hw/i386/kvm/xen_overlay.h
> 
[snip]

> +static int xen_overlay_map_page_locked(uint32_t space, uint64_t idx, uint64_t gpa)
> +{
> +    MemoryRegion *ovl_page;
> +    int err;
> +
> +    if (space != XENMAPSPACE_shared_info || idx != 0)
> +        return -EINVAL;
> +
> +    if (!xen_overlay_singleton)
> +        return -ENOENT;
> +
> +    ovl_page = &xen_overlay_singleton->shinfo_mem;
> +
> +    /* Xen allows guests to map the same page as many times as it likes
> +     * into guest physical frames. We don't, because it would be hard
> +     * to track and restore them all. One mapping of each page is
> +     * perfectly sufficient for all known guests... and we've tested
> +     * that theory on a few now in other implementations. dwmw2. */
> +    if (memory_region_is_mapped(ovl_page)) {
> +        if (gpa == INVALID_GPA) {
> +            /* If removing shinfo page, turn the kernel magic off first */
> +            if (space == XENMAPSPACE_shared_info) {
> +                err = xen_overlay_set_be_shinfo(INVALID_GFN);
> +                if (err)
> +                    return err;
> +            }
> +            memory_region_del_subregion(get_system_memory(), ovl_page);
> +            goto done;

This seems a little ugly when you could...

> +        } else {
> +            /* Just move it */
> +            memory_region_set_address(ovl_page, gpa);
> +        }
> +    } else if (gpa != INVALID_GPA) {
> +        memory_region_add_subregion_overlap(get_system_memory(), gpa, ovl_page, 0);
> +    }
> +

... just wrap the following line in 'if (gpa != INVALID_GPA)'

Paul

> +    xen_overlay_set_be_shinfo(gpa >> XEN_PAGE_SHIFT);
> + done:
> +    xen_overlay_singleton->shinfo_gpa = gpa;
> +    return 0;
> +}
> +



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 13/22] i386/xen: implement HYPERVISOR_memory_op
  2022-12-09  9:56 ` [RFC PATCH v2 13/22] i386/xen: implement HYPERVISOR_memory_op David Woodhouse
@ 2022-12-12 14:38   ` Paul Durrant
  2022-12-13  0:08     ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:38 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Specifically XENMEM_add_to_physmap with space XENMAPSPACE_shared_info to
> allow the guest to set its shared_info page.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Use the xen_overlay device]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/trace-events |  1 +
>   target/i386/xen.c        | 59 +++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 59 insertions(+), 1 deletion(-)
>
[snip]
> +static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
> +                                   int cmd, uint64_t arg, X86CPU *cpu)
> +{
> +    CPUState *cs = CPU(cpu);
> +    int err = 0;
> +
> +    switch (cmd) {
> +    case XENMEM_add_to_physmap: {
> +            struct xen_add_to_physmap xatp;
> +
> +            err = kvm_copy_from_gva(cs, arg, &xatp, sizeof(xatp));
> +            if (err) {
> +                break;
> +            }
> +
> +            switch (xatp.space) {
> +            case XENMAPSPACE_shared_info:
> +                break;
> +            default:
> +                err = -ENOSYS;
> +                break;

Don't you want to return false here?

   Paul

> +            }
> +
> +            err = xen_set_shared_info(cs, xatp.gpfn);
> +            break;
> +         }
> +
> +    default:
> +            return false;
> +    }
> +
> +    exit->u.hcall.result = err;
> +    return true;
> +}
> +



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 14/22] i386/xen: implement HYPERVISOR_hvm_op
  2022-12-09  9:56 ` [RFC PATCH v2 14/22] i386/xen: implement HYPERVISOR_hvm_op David Woodhouse
@ 2022-12-12 14:41   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:41 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> This is when guest queries for support for HVMOP_pagetable_dying.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/xen.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
> 
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> index ddd144039a..2847b4f864 100644
> --- a/target/i386/xen.c
> +++ b/target/i386/xen.c
> @@ -18,6 +18,7 @@
>   #include "hw/i386/kvm/xen_overlay.h"
>   #include "standard-headers/xen/version.h"
>   #include "standard-headers/xen/memory.h"
> +#include "standard-headers/xen/hvm/hvm_op.h"
>   
>   static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
>                         bool is_write)
> @@ -180,6 +181,19 @@ static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
>       return true;
>   }
>   
> +static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit,
> +                                 int cmd, uint64_t arg)
> +{
> +    switch (cmd) {
> +    case HVMOP_pagetable_dying:
> +            exit->u.hcall.result = -ENOSYS;
> +            return true;
> +
> +    default:
> +            return false;
> +    }
> +}
> +
>   static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
>   {
>       uint16_t code = exit->u.hcall.input;
> @@ -190,6 +204,9 @@ static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)
>       }
>   
>       switch (code) {
> +    case __HYPERVISOR_hvm_op:
> +        return kvm_xen_hcall_hvm_op(exit, exit->u.hcall.params[0],
> +                                    exit->u.hcall.params[1]);
>       case __HYPERVISOR_memory_op:
>           return kvm_xen_hcall_memory_op(exit, exit->u.hcall.params[0],
>                                          exit->u.hcall.params[1], cpu);

This all seems rather pointless since the only sub-op caught results in 
-ENOSYS. AFAICT the only real change to avoid the emission of a log 
message... which I suggested might become a trace anyway.

   Paul



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 07/22] pc_piix: handle XEN_EMULATE backend init
  2022-12-12 13:47   ` Paul Durrant
@ 2022-12-12 14:50     ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-12 14:50 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On Mon, 2022-12-12 at 13:47 +0000, Paul Durrant wrote:
> 
> > @@ -155,6 +156,10 @@ static void pc_init1(MachineState *machine,
> >                x86ms->above_4g_mem_size = 0;
> >                x86ms->below_4g_mem_size = machine->ram_size;
> >            }
> > +
> > +        if (pcms->xen_version && !xen_be_xenstore_open()) {
> 
> So, this is a bit subtle... it's only *because* using real Xen results 
> in xen_version being 0 that this is sane? Also does this not mean that 
> we are now relying on libxenstore? Shouldn't that be called out in the 
> config?

None of the CONFIG_XENFV_MACHINE code builds right now unless
CONFIG_XEN is set anyway. We can move code around and use #ifdef
appropriately once the dust has settled on how the config options are
going to relate to one another; doing that too soon seemed like
pointless churn. I know I didn't *quite* do the config options the way
that Philippe said, so figured it was better to wait until we have
consensus.

As noted in the cover letter, "For now, we just need to be able to use
the xenfv machine in order to instantiate the shinfo and evtchn
objects."

So for now I've basically just stuck with what was in the original
patchset, and this is going to change.

Ideally, I'd like to avoid the external xenstore completely. We could
have a completely internal implementation which is private to the
guest. Since this isn't true Xen, the guest has no way of talking to
anything other than qemu itself, which will play the rôle of dom0.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 15/22] i386/xen: implement HYPERVISOR_vcpu_op
  2022-12-09  9:56 ` [RFC PATCH v2 15/22] i386/xen: implement HYPERVISOR_vcpu_op David Woodhouse
@ 2022-12-12 14:51   ` Paul Durrant
  2022-12-13  0:10     ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:51 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> This is simply when guest tries to register a vcpu_info
> and since vcpu_info placement is optional in the minimum ABI
> therefore we can just fail with -ENOSYS
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/xen.c | 25 +++++++++++++++++++++++++
>   1 file changed, 25 insertions(+)
> 
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> index 2847b4f864..9d1daadee1 100644
> --- a/target/i386/xen.c
> +++ b/target/i386/xen.c
> @@ -19,6 +19,7 @@
>   #include "standard-headers/xen/version.h"
>   #include "standard-headers/xen/memory.h"
>   #include "standard-headers/xen/hvm/hvm_op.h"
> +#include "standard-headers/xen/vcpu.h"
>   
>   static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
>                         bool is_write)
> @@ -194,6 +195,25 @@ static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit,
>       }
>   }
>   
> +static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
> +                                  int cmd, int vcpu_id, uint64_t arg)
> +{
> +    int err;
> +
> +    switch (cmd) {
> +    case VCPUOP_register_vcpu_info:
> +        /* no vcpu info placement for now */
> +        err = -ENOSYS;

Again, should this patch be deferred until we actually implement 
something useful here? I.e. folding it into the subsequent patch? It's 
not like the boilerplate is massive.

   Paul



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info
  2022-12-09  9:56 ` [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info David Woodhouse
@ 2022-12-12 14:58   ` Paul Durrant
  2022-12-13  0:13     ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 14:58 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Handle the hypercall to set a per vcpu info, and also wire up the default
> vcpu_info in the shared_info page for the first 32 vCPUs.
> 
> To avoid deadlock within KVM a vCPU thread must set its *own* vcpu_info
> rather than it being set from the context in which the hypercall is
> invoked.
> 
> Add the vcpu_info (and default) GPA to the vmstate_x86_cpu for migration,
> and restore it in kvm_arch_put_registers() appropriately.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/cpu.h        |  2 ++
>   target/i386/kvm/kvm.c    | 19 +++++++++++
>   target/i386/machine.c    | 21 ++++++++++++
>   target/i386/trace-events |  1 +
>   target/i386/xen.c        | 74 +++++++++++++++++++++++++++++++++++++---
>   target/i386/xen.h        |  1 +
>   6 files changed, 113 insertions(+), 5 deletions(-)
> 
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index c6c57baed5..109b2e5669 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1788,6 +1788,8 @@ typedef struct CPUArchState {
>   #endif
>   #if defined(CONFIG_KVM)
>       struct kvm_nested_state *nested_state;
> +    uint64_t xen_vcpu_info_gpa;
> +    uint64_t xen_vcpu_info_default_gpa;
>   #endif
>   #if defined(CONFIG_HVF)
>       HVFX86LazyFlags hvf_lflags;
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index ebde6bc204..fa45e2f99a 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -1811,6 +1811,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
>           has_msr_hv_hypercall = true;
>       }
>   
> +    env->xen_vcpu_info_gpa = UINT64_MAX;
> +    env->xen_vcpu_info_default_gpa = UINT64_MAX;


There was an INVALID_GPA definition for shared info. Looks like we could 
use it here too.

> +
>       xen_version = kvm_arch_xen_version(MACHINE(qdev_get_machine()));
>       if (xen_version) {
>   #ifdef CONFIG_XEN_EMU
> @@ -4728,6 +4731,22 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
>           kvm_arch_set_tsc_khz(cpu);
>       }
>   
> +#ifdef CONFIG_XEN_EMU
> +    if (level == KVM_PUT_FULL_STATE) {
> +        uint64_t gpa = x86_cpu->env.xen_vcpu_info_gpa;
> +        if (gpa == UINT64_MAX) {
> +            gpa = x86_cpu->env.xen_vcpu_info_default_gpa;
> +        }
> +
> +        if (gpa != UINT64_MAX) {
> +            ret = kvm_xen_set_vcpu_attr(cpu, KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO, gpa);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +    }
> +#endif
> +
>       ret = kvm_getput_regs(x86_cpu, 1);
>       if (ret < 0) {
>           return ret;
[snip]
> @@ -195,19 +240,38 @@ static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit,
>       }
>   }
>   
> +static int vcpuop_register_vcpu_info(CPUState *cs, CPUState *target,
> +                                     uint64_t arg)
> +{
> +    struct vcpu_register_vcpu_info rvi;
> +    uint64_t gpa;
> +
> +    if (!target)
> +            return -ENOENT;
> +
> +    if (kvm_copy_from_gva(cs, arg, &rvi, sizeof(rvi))) {
> +        return -EFAULT;
> +    }
> +
> +    gpa = ((rvi.mfn << TARGET_PAGE_BITS) + rvi.offset);

Some sanity checks wouldn't go a miss here...

rvi.offset should:
a) be < TARGET_PAGE_SIZE, and
b) ba aligned to vcpu_info_t size

   Paul

> +    async_run_on_cpu(target, do_set_vcpu_info_gpa, RUN_ON_CPU_HOST_ULONG(gpa));
> +    return 0;
> +}
> +



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 17/22] i386/xen: handle VCPUOP_register_vcpu_time_info
  2022-12-09  9:56 ` [RFC PATCH v2 17/22] i386/xen: handle VCPUOP_register_vcpu_time_info David Woodhouse
@ 2022-12-12 15:34   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 15:34 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> In order to support Linux vdso in Xen.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/cpu.h     |  1 +
>   target/i386/kvm/kvm.c |  9 ++++++
>   target/i386/machine.c |  4 ++-
>   target/i386/xen.c     | 70 ++++++++++++++++++++++++++++++++++++-------
>   4 files changed, 72 insertions(+), 12 deletions(-)
> 
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 109b2e5669..96c2d0d5cb 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1790,6 +1790,7 @@ typedef struct CPUArchState {
>       struct kvm_nested_state *nested_state;
>       uint64_t xen_vcpu_info_gpa;
>       uint64_t xen_vcpu_info_default_gpa;
> +    uint64_t xen_vcpu_time_info_gpa;
>   #endif
>   #if defined(CONFIG_HVF)
>       HVFX86LazyFlags hvf_lflags;
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index fa45e2f99a..3f19fff21f 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -1813,6 +1813,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
>   
>       env->xen_vcpu_info_gpa = UINT64_MAX;
>       env->xen_vcpu_info_default_gpa = UINT64_MAX;
> +    env->xen_vcpu_time_info_gpa = UINT64_MAX;

Another few candidates for INVALID_GPA.

>   
>       xen_version = kvm_arch_xen_version(MACHINE(qdev_get_machine()));
>       if (xen_version) {
> @@ -4744,6 +4745,14 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
>                   return ret;
>               }
>           }
> +
> +        gpa = x86_cpu->env.xen_vcpu_time_info_gpa;
> +        if (gpa != UINT64_MAX) {
> +            ret = kvm_xen_set_vcpu_attr(cpu, KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO, gpa);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
>       }
>   #endif
>   
> diff --git a/target/i386/machine.c b/target/i386/machine.c
> index 104cd6047c..9acef102a3 100644
> --- a/target/i386/machine.c
> +++ b/target/i386/machine.c
> @@ -1263,7 +1263,8 @@ static bool xen_vcpu_needed(void *opaque)
>       CPUX86State *env = &cpu->env;
>   
>       return (env->xen_vcpu_info_gpa != UINT64_MAX ||
> -            env->xen_vcpu_info_default_gpa != UINT64_MAX);
> +            env->xen_vcpu_info_default_gpa != UINT64_MAX ||
> +            env->xen_vcpu_time_info_gpa != UINT64_MAX);
>   }
>   
>   static const VMStateDescription vmstate_xen_vcpu = {
> @@ -1274,6 +1275,7 @@ static const VMStateDescription vmstate_xen_vcpu = {
>       .fields = (VMStateField[]) {
>           VMSTATE_UINT64(env.xen_vcpu_info_gpa, X86CPU),
>           VMSTATE_UINT64(env.xen_vcpu_info_default_gpa, X86CPU),
> +        VMSTATE_UINT64(env.xen_vcpu_time_info_gpa, X86CPU),
>           VMSTATE_END_OF_LIST()
>       }
>   };
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> index cd816bb711..427729ab4d 100644
> --- a/target/i386/xen.c
> +++ b/target/i386/xen.c
> @@ -21,28 +21,41 @@
>   #include "standard-headers/xen/hvm/hvm_op.h"
>   #include "standard-headers/xen/vcpu.h"
>   
> +static bool kvm_gva_to_gpa(CPUState *cs, uint64_t gva, uint64_t *gpa,
> +                           size_t *len, bool is_write)
> +{
> +        struct kvm_translation tr = {
> +            .linear_address = gva,
> +        };
> +
> +        if (len) {
> +                *len = TARGET_PAGE_SIZE - (gva & ~TARGET_PAGE_MASK);
> +        }
> +
> +        if (kvm_vcpu_ioctl(cs, KVM_TRANSLATE, &tr) || !tr.valid ||
> +            (is_write && !tr.writeable)) {
> +            return false;
> +        }
> +        *gpa = tr.physical_address;
> +        return true;
> +}
> +
>   static int kvm_gva_rw(CPUState *cs, uint64_t gva, void *_buf, size_t sz,
>                         bool is_write)
>   {
>       uint8_t *buf = (uint8_t *)_buf;
>       size_t i = 0, len = 0;
> -    int ret;
>   
>       for (i = 0; i < sz; i+= len) {
> -        struct kvm_translation tr = {
> -            .linear_address = gva + i,
> -        };
> +        uint64_t gpa;
>   
> -        len = TARGET_PAGE_SIZE - (tr.linear_address & ~TARGET_PAGE_MASK);
> +        if (!kvm_gva_to_gpa(cs, gva + i, &gpa, &len, is_write)) {
> +                return -EFAULT;
> +        }
>           if (len > sz)
>               len = sz;
>   
> -        ret = kvm_vcpu_ioctl(cs, KVM_TRANSLATE, &tr);
> -        if (ret || !tr.valid || (is_write && !tr.writeable)) {
> -            return -EFAULT;
> -        }
> -
> -        cpu_physical_memory_rw(tr.physical_address, buf + i, len, is_write);
> +        cpu_physical_memory_rw(gpa, buf + i, len, is_write);
>       }
>   
>       return 0;
> @@ -166,6 +179,17 @@ static void do_set_vcpu_info_gpa(CPUState *cs, run_on_cpu_data data)
>                             env->xen_vcpu_info_gpa);
>   }
>   
> +static void do_set_vcpu_time_info_gpa(CPUState *cs, run_on_cpu_data data)
> +{
> +    X86CPU *cpu = X86_CPU(cs);
> +    CPUX86State *env = &cpu->env;
> +
> +    env->xen_vcpu_time_info_gpa = data.host_ulong;
> +
> +    kvm_xen_set_vcpu_attr(cs, KVM_XEN_VCPU_ATTR_TYPE_VCPU_TIME_INFO,
> +                          env->xen_vcpu_time_info_gpa);
> +}
> +
>   static int xen_set_shared_info(CPUState *cs, uint64_t gfn)
>   {
>       uint64_t gpa = gfn << TARGET_PAGE_BITS;
> @@ -258,6 +282,27 @@ static int vcpuop_register_vcpu_info(CPUState *cs, CPUState *target,
>       return 0;
>   }
>   
> +static int vcpuop_register_vcpu_time_info(CPUState *cs, CPUState *target,
> +                                          uint64_t arg)
> +{
> +    struct vcpu_register_time_memory_area tma;
> +    uint64_t gpa;
> +    size_t len;
> +
> +    if (kvm_copy_from_gva(cs, arg, &tma, sizeof(*tma.addr.v))) {
> +        return -EFAULT;
> +    }
> +
> +    if (!kvm_gva_to_gpa(cs, tma.addr.p, &gpa, &len, false) ||
> +        len < sizeof(tma)) {
> +        return -EFAULT;
> +    }

Xen stashes the GVA, not the GPA, and so it would be possible to 
register the same GVA on different vcpus to point at different areas of 
memory.

   Paul

> +
> +    async_run_on_cpu(target, do_set_vcpu_time_info_gpa,
> +                     RUN_ON_CPU_HOST_ULONG(gpa));
> +    return 0;
> +}
> +
>   static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
>                                     int cmd, int vcpu_id, uint64_t arg)
>   {
> @@ -266,6 +311,9 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
>       int err;
>   
>       switch (cmd) {
> +    case VCPUOP_register_vcpu_time_memory_area:
> +            err = vcpuop_register_vcpu_time_info(cs, dest, arg);
> +            break;
>       case VCPUOP_register_vcpu_info:
>               err = vcpuop_register_vcpu_info(cs, dest, arg);
>               break;



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 18/22] i386/xen: handle VCPUOP_register_runstate_memory_area
  2022-12-09  9:56 ` [RFC PATCH v2 18/22] i386/xen: handle VCPUOP_register_runstate_memory_area David Woodhouse
@ 2022-12-12 15:38   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 15:38 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Allow guest to setup the vcpu runstates which is used as
> steal clock.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/cpu.h     |  1 +
>   target/i386/kvm/kvm.c |  9 +++++++++
>   target/i386/machine.c |  4 +++-
>   target/i386/xen.c     | 35 +++++++++++++++++++++++++++++++++++
>   4 files changed, 48 insertions(+), 1 deletion(-)
> 
[snip]
> +static int vcpuop_register_runstate_info(CPUState *cs, CPUState *target,
> +                                         uint64_t arg)
> +{
> +    struct vcpu_register_runstate_memory_area rma;
> +    uint64_t gpa;
> +    size_t len;
> +
> +    if (kvm_copy_from_gva(cs, arg, &rma, sizeof(*rma.addr.v))) {
> +        return -EFAULT;
> +    }
> +
> +    if (!kvm_gva_to_gpa(cs, rma.addr.p, &gpa, &len, false) ||
> +        len < sizeof(struct vcpu_time_info)) {
> +        return -EFAULT;
> +    }

Again, Xen stashes the GVA for this and not the GPA.

   Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 19/22] i386/xen: implement HVMOP_set_evtchn_upcall_vector
  2022-12-09  9:56 ` [RFC PATCH v2 19/22] i386/xen: implement HVMOP_set_evtchn_upcall_vector David Woodhouse
@ 2022-12-12 15:52   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 15:52 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Ankur Arora <ankur.a.arora@oracle.com>
> 
> The HVMOP_set_evtchn_upcall_vector hypercall sets the per-vCPU upcall
> vector, to be delivered to the local APIC just like an MSI (with an EOI).
> 
> This takes precedence over the system-wide delivery method set by the
> HVMOP_set_param hypercall with HVM_PARAM_CALLBACK_IRQ. It's used by
> Windows and Xen (PV shim) guests but normally not by Linux.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Rework for upstream kernel changes and split from HVMOP_set_param]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/cpu.h        |  1 +
>   target/i386/kvm/kvm.c    |  7 ++++
>   target/i386/machine.c    |  4 ++-
>   target/i386/trace-events |  1 +
>   target/i386/xen.c        | 69 +++++++++++++++++++++++++++++++++++++---
>   target/i386/xen.h        |  1 +
>   6 files changed, 77 insertions(+), 6 deletions(-)
> 
[snip]
> +static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit, X86CPU *cpu,
>                                    int cmd, uint64_t arg)
>   {
> +    int ret = -ENOSYS;
>       switch (cmd) {
> +    case HVMOP_set_evtchn_upcall_vector:
> +            ret = kvm_xen_hcall_evtchn_upcall_vector(exit, cpu,
> +                                                     exit->u.hcall.params[0]);
> +            break;
>       case HVMOP_pagetable_dying:
> -            exit->u.hcall.result = -ENOSYS;
> -            return true;
> +            ret = -ENOSYS;
> +            break;

Handling HVMOP_pagetable_dying could now be folded in as a foot-note 
since it's a rather uninteresting relic of shadow paging. But otherwise...

Reviewed-by: Paul Durrant <paul@xen.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-09  9:56 ` [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ David Woodhouse
@ 2022-12-12 16:16   ` Paul Durrant
  2022-12-12 16:26     ` David Woodhouse
  2022-12-21  1:41     ` David Woodhouse
  0 siblings, 2 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 16:16 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Ankur Arora <ankur.a.arora@oracle.com>
> 
> The HVM_PARAM_CALLBACK_IRQ parameter controls the system-wide event
> channel upcall method.  The vector support is handled by KVM internally,
> when the evtchn_upcall_pending field in the vcpu_info is set.
> 
> The GSI and PCI_INTX delivery methods are not supported. yet; those
> need to simulate a level-triggered event on the I/OAPIC.

That's gonna be somewhat limiting if anyone runs a Windows guest with 
upcall vector support turned off... which is an option at:

https://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenbus.git;a=blob;f=src/xenbus/evtchn.c;;hb=HEAD#l1928

> 
> Add a 'xen_evtchn' device to host the migration state, as we'll shortly
> be adding a full event channel table there too.
> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> [dwmw2: Rework for upstream kernel changes, split from per-VCPU vector]
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   hw/i386/kvm/meson.build  |   5 +-
>   hw/i386/kvm/xen_evtchn.c | 117 +++++++++++++++++++++++++++++++++++++++
>   hw/i386/kvm/xen_evtchn.h |  13 +++++
>   hw/i386/pc_piix.c        |   2 +
>   target/i386/xen.c        |  44 +++++++++++++--
>   5 files changed, 174 insertions(+), 7 deletions(-)
>   create mode 100644 hw/i386/kvm/xen_evtchn.c
>   create mode 100644 hw/i386/kvm/xen_evtchn.h
> 
> diff --git a/hw/i386/kvm/meson.build b/hw/i386/kvm/meson.build
> index 6165cbf019..cab64df339 100644
> --- a/hw/i386/kvm/meson.build
> +++ b/hw/i386/kvm/meson.build
> @@ -4,6 +4,9 @@ i386_kvm_ss.add(when: 'CONFIG_APIC', if_true: files('apic.c'))
>   i386_kvm_ss.add(when: 'CONFIG_I8254', if_true: files('i8254.c'))
>   i386_kvm_ss.add(when: 'CONFIG_I8259', if_true: files('i8259.c'))
>   i386_kvm_ss.add(when: 'CONFIG_IOAPIC', if_true: files('ioapic.c'))
> -i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen_overlay.c'))
> +i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files(
> +  'xen_overlay.c',
> +  'xen_evtchn.c',
> +  ))
>   
>   i386_ss.add_all(when: 'CONFIG_KVM', if_true: i386_kvm_ss)
> diff --git a/hw/i386/kvm/xen_evtchn.c b/hw/i386/kvm/xen_evtchn.c
> new file mode 100644
> index 0000000000..1ca0c034e7
> --- /dev/null
> +++ b/hw/i386/kvm/xen_evtchn.c
> @@ -0,0 +1,117 @@
> +/*
> + * QEMU Xen emulation: Shared/overlay pages support
> + *
> + * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + *
> + * Authors: David Woodhouse <dwmw2@infradead.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/host-utils.h"
> +#include "qemu/module.h"
> +#include "qemu/main-loop.h"
> +#include "qapi/error.h"
> +#include "qom/object.h"
> +#include "exec/target_page.h"
> +#include "exec/address-spaces.h"
> +#include "migration/vmstate.h"
> +
> +#include "hw/sysbus.h"
> +#include "hw/xen/xen.h"
> +#include "xen_evtchn.h"
> +
> +#include "sysemu/kvm.h"
> +#include <linux/kvm.h>
> +
> +#include "standard-headers/xen/memory.h"
> +#include "standard-headers/xen/hvm/params.h"
> +
> +#define TYPE_XEN_EVTCHN "xenevtchn"
> +OBJECT_DECLARE_SIMPLE_TYPE(XenEvtchnState, XEN_EVTCHN)
> +
> +struct XenEvtchnState {
> +    /*< private >*/
> +    SysBusDevice busdev;
> +    /*< public >*/
> +
> +    uint64_t callback_param;
> +};
> +
> +struct XenEvtchnState *xen_evtchn_singleton;
> +
> +static int xen_evtchn_post_load(void *opaque, int version_id)
> +{
> +    XenEvtchnState *s = opaque;
> +
> +    if (s->callback_param) {
> +        xen_evtchn_set_callback_param(s->callback_param);
> +    }
> +
> +    return 0;
> +}
> +
> +static bool xen_evtchn_is_needed(void *opaque)
> +{
> +    return xen_mode == XEN_EMULATE;
> +}
> +
> +static const VMStateDescription xen_evtchn_vmstate = {
> +    .name = "xen_evtchn",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = xen_evtchn_is_needed,
> +    .post_load = xen_evtchn_post_load,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT64(callback_param, XenEvtchnState),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static void xen_evtchn_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->vmsd = &xen_evtchn_vmstate;
> +}
> +
> +static const TypeInfo xen_evtchn_info = {
> +    .name          = TYPE_XEN_EVTCHN,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .instance_size = sizeof(XenEvtchnState),
> +    .class_init    = xen_evtchn_class_init,
> +};
> +
> +void xen_evtchn_create(void)
> +{
> +    xen_evtchn_singleton = XEN_EVTCHN(sysbus_create_simple(TYPE_XEN_EVTCHN, -1, NULL));
> +}
> +
> +static void xen_evtchn_register_types(void)
> +{
> +    type_register_static(&xen_evtchn_info);
> +}
> +
> +type_init(xen_evtchn_register_types)
> +
> +
> +#define CALLBACK_VIA_TYPE_SHIFT       56
> +
> +int xen_evtchn_set_callback_param(uint64_t param)
> +{
> +    int ret = -ENOSYS;
> +
> +    if (param >> CALLBACK_VIA_TYPE_SHIFT == HVM_PARAM_CALLBACK_TYPE_VECTOR) {
> +        struct kvm_xen_hvm_attr xa = {
> +            .type = KVM_XEN_ATTR_TYPE_UPCALL_VECTOR,
> +            .u.vector = (uint8_t)param,
> +        };
> +
> +        ret = kvm_vm_ioctl(kvm_state, KVM_XEN_HVM_SET_ATTR, &xa);
> +        if (!ret && xen_evtchn_singleton)
> +            xen_evtchn_singleton->callback_param = param;
> +    }
> +    return ret;
> +}
> diff --git a/hw/i386/kvm/xen_evtchn.h b/hw/i386/kvm/xen_evtchn.h
> new file mode 100644
> index 0000000000..11c6ed22a0
> --- /dev/null
> +++ b/hw/i386/kvm/xen_evtchn.h
> @@ -0,0 +1,13 @@
> +/*
> + * QEMU Xen emulation: Event channel support
> + *
> + * Copyright © 2022 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + *
> + * Authors: David Woodhouse <dwmw2@infradead.org>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +void xen_evtchn_create(void);
> +int xen_evtchn_set_callback_param(uint64_t param);
> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
> index c3c61eedde..18540084a0 100644
> --- a/hw/i386/pc_piix.c
> +++ b/hw/i386/pc_piix.c
> @@ -60,6 +60,7 @@
>   #endif
>   #ifdef CONFIG_XEN_EMU
>   #include "hw/i386/kvm/xen_overlay.h"
> +#include "hw/i386/kvm/xen_evtchn.h"
>   #endif
>   #include "migration/global_state.h"
>   #include "migration/misc.h"
> @@ -417,6 +418,7 @@ static void pc_xen_hvm_init(MachineState *machine)
>   #ifdef CONFIG_XEN_EMU
>       if (xen_mode == XEN_EMULATE) {
>               xen_overlay_create();
> +            xen_evtchn_create();
>       }
>   #endif
>   }
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> index 2583c00a6b..1af336d9e5 100644
> --- a/target/i386/xen.c
> +++ b/target/i386/xen.c
> @@ -16,6 +16,8 @@
>   #include "xen.h"
>   #include "trace.h"
>   #include "hw/i386/kvm/xen_overlay.h"
> +#include "hw/i386/kvm/xen_evtchn.h"
> +
>   #include "standard-headers/xen/version.h"
>   #include "standard-headers/xen/memory.h"
>   #include "standard-headers/xen/hvm/hvm_op.h"
> @@ -287,24 +289,53 @@ static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
>       return true;
>   }
>   
> +static int handle_set_param(struct kvm_xen_exit *exit, X86CPU *cpu,
> +                            uint64_t arg)
> +{
> +    CPUState *cs = CPU(cpu);
> +    struct xen_hvm_param hp;
> +    int err = 0;
> +
> +    if (kvm_copy_from_gva(cs, arg, &hp, sizeof(hp))) {
> +        err = -EFAULT;
> +        goto out;
> +    }
> +
> +    if (hp.domid != DOMID_SELF) {

Xen actually allows the domain's own id to be specified as well as the 
magic DOMID_SELF.

> +        err = -EINVAL;

And this should be -ESRCH.

> +        goto out;
> +    }
> +
> +    switch (hp.index) {
> +    case HVM_PARAM_CALLBACK_IRQ:
> +        err = xen_evtchn_set_callback_param(hp.value);
> +        break;
> +    default:
> +        return false;
> +    }
> +
> +out:
> +    exit->u.hcall.result = err;

This is a bit on the ugly side isn't it? Why not return the err and have 
kvm_xen_hcall_hvm_op() deal with passing it back?

> +    return true;
> +}
> +
>   static int kvm_xen_hcall_evtchn_upcall_vector(struct kvm_xen_exit *exit,
>                                                 X86CPU *cpu, uint64_t arg)
>   {
> -    struct xen_hvm_evtchn_upcall_vector *up;
> +    struct xen_hvm_evtchn_upcall_vector up;
>       CPUState *target_cs;
>       int vector;
>   
> -    up = gva_to_hva(CPU(cpu), arg);
> -    if (!up) {
> +    if (kvm_copy_from_gva(CPU(cpu), arg, &up, sizeof(up))) {
>           return -EFAULT;
>       }
>   
> -    vector = up->vector;
> +    vector = up.vector;
>       if (vector < 0x10) {
>           return -EINVAL;
>       }
>   
> -    target_cs = qemu_get_cpu(up->vcpu);
> +    target_cs = qemu_get_cpu(up.vcpu);
>       if (!target_cs) {
>           return -EINVAL;
>       }

These changes to kvm_xen_hcall_evtchn_upcall_vector() seem to have 
nothing to do with the rest of the patch. Am I missing something?

   Paul

> @@ -325,7 +356,8 @@ static bool kvm_xen_hcall_hvm_op(struct kvm_xen_exit *exit, X86CPU *cpu,
>       case HVMOP_pagetable_dying:
>               ret = -ENOSYS;
>               break;
> -
> +    case HVMOP_set_param:
> +            return handle_set_param(exit, cpu, arg);
>       default:
>               return false;
>       }



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 21/22] i386/xen: implement HYPERVISOR_event_channel_op
  2022-12-09  9:56 ` [RFC PATCH v2 21/22] i386/xen: implement HYPERVISOR_event_channel_op David Woodhouse
@ 2022-12-12 16:23   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 16:23 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> Additionally set XEN_INTERFACE_VERSION to most recent in order to
> exercise both event_channel_op and event_channel_op_compat.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/xen.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 45 insertions(+)
> 
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> index 1af336d9e5..f102c40f04 100644
> --- a/target/i386/xen.c
> +++ b/target/i386/xen.c
> @@ -18,11 +18,14 @@
>   #include "hw/i386/kvm/xen_overlay.h"
>   #include "hw/i386/kvm/xen_evtchn.h"
>   
> +#define __XEN_INTERFACE_VERSION__ 0x00040400
> +

Xen is actually at:

#define __XEN_LATEST_INTERFACE_VERSION__ 0x00040e00

now.

>   #include "standard-headers/xen/version.h"
>   #include "standard-headers/xen/memory.h"
>   #include "standard-headers/xen/hvm/hvm_op.h"
>   #include "standard-headers/xen/hvm/params.h"
>   #include "standard-headers/xen/vcpu.h"
> +#include "standard-headers/xen/event_channel.h"
>   
>   static bool kvm_gva_to_gpa(CPUState *cs, uint64_t gva, uint64_t *gpa,
>                              size_t *len, bool is_write)
> @@ -452,6 +455,42 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_xen_exit *exit, X86CPU *cpu,
>       return true;
>   }
>   
> +static bool kvm_xen_hcall_evtchn_op_compat(struct kvm_xen_exit *exit,
> +                                          X86CPU *cpu, uint64_t arg)
> +{
> +    struct evtchn_op op;
> +    int err = -EFAULT;
> +
> +    if (kvm_copy_from_gva(CPU(cpu), arg, &op, sizeof(op))) {
> +        goto err;
> +    }
> +
> +    switch (op.cmd) {
> +    default:
> +        return false;
> +    }
> +err:
> +    exit->u.hcall.result = err;
> +    return true;
> +}
> +
> +static bool kvm_xen_hcall_evtchn_op(struct kvm_xen_exit *exit,
> +                                    int cmd, uint64_t arg)
> +{
> +    int err = -ENOSYS;
> +
> +    switch (cmd) {
> +    case EVTCHNOP_init_control:
> +        err = -ENOSYS;

Why? It already has that value.

   Paul

> +        break;
> +    default:
> +        return false;
> +    }
> +
> +    exit->u.hcall.result = err;
> +    return true;
> +}
> +



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-12 16:16   ` Paul Durrant
@ 2022-12-12 16:26     ` David Woodhouse
  2022-12-12 16:39       ` Paul Durrant
  2022-12-21  1:41     ` David Woodhouse
  1 sibling, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-12 16:26 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 1023 bytes --]

On Mon, 2022-12-12 at 16:16 +0000, Paul Durrant wrote:
> On 09/12/2022 09:56, David Woodhouse wrote:
> > From: Ankur Arora <ankur.a.arora@oracle.com>
> > The HVM_PARAM_CALLBACK_IRQ parameter controls the system-wide event
> > channel upcall method.  The vector support is handled by KVM internally,
> > when the evtchn_upcall_pending field in the vcpu_info is set.
> > The GSI and PCI_INTX delivery methods are not supported. yet; those
> > need to simulate a level-triggered event on the I/OAPIC.
> 
> That's gonna be somewhat limiting if anyone runs a Windows guest with 
> upcall vector support turned off... which is an option at:
> 
> https://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenbus.git;a=blob;f=src/xenbus/evtchn.c;;hb=HEAD#l1928

Sure. And as you know, I also added the 'xen_no_vector_callback' option
to the Linux command line to allow for that mode to be tested with
Linux too: https://git.kernel.org/torvalds/c/b36b0fe96a 

The GSI and PCI_INTX modes will be added in time, but not yet.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 22/22] i386/xen: implement HYPERVISOR_sched_op
  2022-12-09  9:56 ` [RFC PATCH v2 22/22] i386/xen: implement HYPERVISOR_sched_op David Woodhouse
@ 2022-12-12 16:37   ` Paul Durrant
  0 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 16:37 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 09/12/2022 09:56, David Woodhouse wrote:
> From: Joao Martins <joao.m.martins@oracle.com>
> 
> It allows to shutdown itself via hypercall with any of the 3 reasons:
>    1) self-reboot
>    2) shutdown
>    3) crash
> 
> Implementing SCHEDOP_shutdown sub op let us handle crashes gracefully rather
> than leading to triple faults if it remains unimplemented.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   target/i386/xen.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 45 insertions(+)
> 
> diff --git a/target/i386/xen.c b/target/i386/xen.c
> index f102c40f04..5f3b91450d 100644
> --- a/target/i386/xen.c
> +++ b/target/i386/xen.c
> @@ -17,6 +17,7 @@
>   #include "trace.h"
>   #include "hw/i386/kvm/xen_overlay.h"
>   #include "hw/i386/kvm/xen_evtchn.h"
> +#include "sysemu/runstate.h"
>   
>   #define __XEN_INTERFACE_VERSION__ 0x00040400
>   
> @@ -25,6 +26,7 @@
>   #include "standard-headers/xen/hvm/hvm_op.h"
>   #include "standard-headers/xen/hvm/params.h"
>   #include "standard-headers/xen/vcpu.h"
> +#include "standard-headers/xen/sched.h"
>   #include "standard-headers/xen/event_channel.h"
>   
>   static bool kvm_gva_to_gpa(CPUState *cs, uint64_t gva, uint64_t *gpa,
> @@ -491,6 +493,45 @@ static bool kvm_xen_hcall_evtchn_op(struct kvm_xen_exit *exit,
>       return true;
>   }
>   
> +static int schedop_shutdown(CPUState *cs, uint64_t arg)
> +{
> +    struct sched_shutdown shutdown;
> +
> +    if (kvm_copy_from_gva(cs, arg, &shutdown, sizeof(shutdown))) {
> +        return -EFAULT;
> +    }
> +
> +    if (shutdown.reason == SHUTDOWN_crash) {
> +        cpu_dump_state(cs, stderr, CPU_DUMP_CODE);
> +        qemu_system_guest_panicked(NULL);
> +    } else if (shutdown.reason == SHUTDOWN_reboot) {
> +        qemu_system_reset_request(SHUTDOWN_CAUSE_GUEST_RESET);
> +    } else if (shutdown.reason == SHUTDOWN_poweroff) {
> +        qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
> +    }

Why not a switch() statement? I think it would look neater and be more 
consistent with other code.
With that change...

Reviewed-by: Paul Durrant <paul@xen.org>



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-12 16:26     ` David Woodhouse
@ 2022-12-12 16:39       ` Paul Durrant
  2022-12-15 20:54         ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 16:39 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 12/12/2022 16:26, David Woodhouse wrote:
> On Mon, 2022-12-12 at 16:16 +0000, Paul Durrant wrote:
>> On 09/12/2022 09:56, David Woodhouse wrote:
>>> From: Ankur Arora <ankur.a.arora@oracle.com>
>>> The HVM_PARAM_CALLBACK_IRQ parameter controls the system-wide event
>>> channel upcall method.  The vector support is handled by KVM internally,
>>> when the evtchn_upcall_pending field in the vcpu_info is set.
>>> The GSI and PCI_INTX delivery methods are not supported. yet; those
>>> need to simulate a level-triggered event on the I/OAPIC.
>>
>> That's gonna be somewhat limiting if anyone runs a Windows guest with
>> upcall vector support turned off... which is an option at:
>>
>> https://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenbus.git;a=blob;f=src/xenbus/evtchn.c;;hb=HEAD#l1928
> 
> Sure. And as you know, I also added the 'xen_no_vector_callback' option
> to the Linux command line to allow for that mode to be tested with
> Linux too: https://git.kernel.org/torvalds/c/b36b0fe96a
> 
> The GSI and PCI_INTX modes will be added in time, but not yet.

Ok, but maybe worth calling out the limitation in the commit comment for 
those wishing to kick the tyres.

   Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-09  9:55 ` [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation David Woodhouse
  2022-12-12  9:19   ` Paul Durrant
@ 2022-12-12 17:07   ` Paolo Bonzini
  2022-12-12 22:22     ` David Woodhouse
  1 sibling, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2022-12-12 17:07 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 12/9/22 10:55, David Woodhouse wrote:
>   config KVM
>       bool
> +    imply XEN_EMU if (I386 || X86_64)

No need for the "imply", just make it "default y" below and it will have 
the same effect.

> 
> diff --git a/target/Kconfig b/target/Kconfig
> index 83da0bd293..e19c9d77b5 100644
> --- a/target/Kconfig
> +++ b/target/Kconfig
> @@ -18,3 +18,7 @@ source sh4/Kconfig
>  source sparc/Kconfig
>  source tricore/Kconfig
>  source xtensa/Kconfig
> +
> +config XEN_EMU
> +    bool
> +    depends on KVM && (I386 || X86_64)

Please place it in hw/xen/Kconfig.

Paolo



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls
  2022-12-09  9:56 ` [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls David Woodhouse
  2022-12-12 14:11   ` Paul Durrant
@ 2022-12-12 17:07   ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Paolo Bonzini @ 2022-12-12 17:07 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 12/9/22 10:56, David Woodhouse wrote:
> +static bool __kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)

No double underscores in userspace.

> +{
> +    uint16_t code = exit->u.hcall.input;
> +
> +    if (exit->u.hcall.cpl > 0) {
> +        exit->u.hcall.result = -EPERM;
> +        return true;
> +    }
> +
> +    switch (code) {
> +    default:
> +        return false;
> +    }
> +}
> +
> +int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit)

BTW, please name this file either target/i386/xen-emu.c or 
target/i386/kvm/xen-emu.c (probably the latter is better, since XEN_EMU 
depends on KVM.)

Paolo



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages
  2022-12-09  9:56 ` [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages David Woodhouse
  2022-12-12 14:29   ` Paul Durrant
@ 2022-12-12 17:14   ` Paolo Bonzini
  1 sibling, 0 replies; 78+ messages in thread
From: Paolo Bonzini @ 2022-12-12 17:14 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 12/9/22 10:56, David Woodhouse wrote:
> Expecting some heckling at the use of xen_overlay_singleton. What is
> the best way to do that? Using qemu_find_recursive() every time seemed
> a bit wrong. But I suppose mapping it into the*guest*  isn't a fast
> path, and if the actual grant table code is allowed to just stash the
> pointer it gets from xen_overlay_page_ptr() for later use then that
> isn't a fast path for device I/O either.

Nah, it's fine.  However I'll reply to patch 2 as to how to handle Xen 
vs. KVM mode.

Paolo



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support
  2022-12-09  9:55 ` [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support David Woodhouse
  2022-12-12 12:48   ` Paul Durrant
@ 2022-12-12 17:30   ` Paolo Bonzini
  2022-12-12 17:55     ` Paul Durrant
                       ` (2 more replies)
  1 sibling, 3 replies; 78+ messages in thread
From: Paolo Bonzini @ 2022-12-12 17:30 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 12/9/22 10:55, David Woodhouse wrote:
> -    m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> +    if (xen_enabled())
> +            m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> +    else
> +            m->default_machine_opts = "accel=kvm,xen-version=0x30001";

Please do not modify pc_xen_hvm_init().

"-M xenfv" should be the same as "-M pc-i440fx-...,suppress-vmdesc=on 
-accel xen -device xen-platform".  It must work *without* "-accel xen", 
while here you're basically requiring it.  For now, please make 
KVM-emulated Xen use "-M pc -device xen-platform".  We can figure out 
"-M xenfv" later.

You can instead have:

- a check in xen_init() that checks that xen_mode is XEN_ATTACH.  If 
not, fail.

- an extra enum value for xen_mode, XEN_DISABLED, which is the default 
instead of XEN_EMULATE;

- an accelerator property "-accel kvm,xen-version=...", added in 
kvm_accel_class_init() instead of the machine property.  The property, 
when set to a nonzero value, flips xen_mode from XEN_DISABLED to 
XEN_EMULATE.

The Xen overlay device can be created using the mc->kvm_type function 
(which you can set in DEFINE_PC_MACHINE); at that point, xen_mode has 
switched from XEN_DISABLED to XEN_EMULATE.  Those xen_enabled() checks 
that apply to KVM then become xen_mode != XEN_DISABLED, as long as they 
run during mc->kvm_type or afterwards.

The platform device can be created either in mc->kvm_type or manually 
(not sure if it makes sense to have a "XenVMMXenVMM" CPUID + emulated 
hypercalls but no platform device---would it still use pvclock for 
example?).

Paolo



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support
  2022-12-12 17:30   ` Paolo Bonzini
@ 2022-12-12 17:55     ` Paul Durrant
  2022-12-13  0:13     ` David Woodhouse
  2023-01-17 13:49     ` David Woodhouse
  2 siblings, 0 replies; 78+ messages in thread
From: Paul Durrant @ 2022-12-12 17:55 UTC (permalink / raw)
  To: Paolo Bonzini, David Woodhouse, qemu-devel
  Cc: Joao Martins, Ankur Arora, Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 12/12/2022 17:30, Paolo Bonzini wrote:
[snip]
> 
> The platform device can be created either in mc->kvm_type or manually 
> (not sure if it makes sense to have a "XenVMMXenVMM" CPUID + emulated 
> hypercalls but no platform device---would it still use pvclock for 
> example?).
> 

Not sure it's wise but the platform device is certainly optional in 
xl.cfg so you can easily bring up a VM without it.

   Paul





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 05/22] xen-platform-pci: allow its creation with XEN_EMULATE mode
  2022-12-12 13:24   ` Paul Durrant
@ 2022-12-12 22:07     ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-12 22:07 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 1521 bytes --]

On Mon, 2022-12-12 at 13:24 +0000, Paul Durrant wrote:
> On 09/12/2022 09:55, David Woodhouse wrote:
> > --- a/hw/i386/xen/xen_platform.c
> > +++ b/hw/i386/xen/xen_platform.c
> > @@ -271,7 +271,10 @@ static void platform_fixed_ioport_writeb(void *opaque, uint32_t addr, uint32_t v
> >        case 0: /* Platform flags */ {
> >            hvmmem_type_t mem_type = (val & PFFLAG_ROM_LOCK) ?
> >                HVMMEM_ram_ro : HVMMEM_ram_rw;
> > -        if (xen_set_mem_type(xen_domid, mem_type, 0xc0, 0x40)) {
> > +        if (xen_mode == XEN_EMULATE) {
> > +            /* XXX */
> > +            s->flags = val & PFFLAG_ROM_LOCK;
> > +        } else if (xen_set_mem_type(xen_domid, mem_type, 0xc0, 0x40)) {
> >                DPRINTF("unable to change ro/rw state of ROM memory area!\n");
> >            } else {
> >                s->flags = val & PFFLAG_ROM_LOCK;
> 
> 
> 
> Surely this would cleaner as:
> 
> 
> 
> if (xen_mode != XEN_EMULATE && xen_set_mem_type(xen_domid, mem_type, 0xc0, 0x40))
>      DPRINTF("unable to change ro/rw state of ROM memory area!\n");
> else
>      s->flags = val & PFFLAG_ROM_LOCK;

Or maybe it should actually call into the PIIX code for frobbing the
read-only state of the UMBs? Do we even implement that in qemu? But
again, this part is just the necessary evil to make the thing testable
with -M xenfv for now.

I'm going to take a closer look at Paolo's suggestion which should
reduce the amount of such noise before we get to the real parts.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-12 17:07   ` Paolo Bonzini
@ 2022-12-12 22:22     ` David Woodhouse
  2022-12-13  0:39       ` Paolo Bonzini
  0 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-12 22:22 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 1620 bytes --]

On Mon, 2022-12-12 at 18:07 +0100, Paolo Bonzini wrote:
> On 12/9/22 10:55, David Woodhouse wrote:
> >   config KVM
> >       bool
> > +    imply XEN_EMU if (I386 || X86_64)
> 
> No need for the "imply", just make it "default y" below and it will have 
> the same effect.
> 
> > 
> > diff --git a/target/Kconfig b/target/Kconfig
> > index 83da0bd293..e19c9d77b5 100644
> > --- a/target/Kconfig
> > +++ b/target/Kconfig
> > @@ -18,3 +18,7 @@ source sh4/Kconfig
> >  source sparc/Kconfig
> >  source tricore/Kconfig
> >  source xtensa/Kconfig
> > +
> > +config XEN_EMU
> > +    bool
> > +    depends on KVM && (I386 || X86_64)
> 
> Please place it in hw/xen/Kconfig.

Perhaps I misunderstand, but I'm not sure that is consistent with what
Philippe was asking for in  
https://lore.kernel.org/qemu-devel/d203e13d-e2f9-5816-030d-c1449bde364d@linaro.org/
specifically:

>> I rather have hw/ and target/ features disentangled, so I'd use
>> CONFIG_XEN_EMU under target/ and CONFIG_XENFV_MACHINE under hw/,
>> eventually having CONFIG_XEN_EMU depending on CONFIG_XENFV_MACHINE
>> and -- for now -- CONFIG_KVM.

The idea there seems to be that XEN_EMU is a *target* feature since it
covers the support in target/i386/kvm.

But yes, it *also* covers the devices I'm adding to hw/i386/kvm. Do I
want a *separate* config symbol for that? Or just make those depend on
XENFV_MACHINE && XEN_EMU ? 

I'll move XEN_EMU to hw/i386/Kconfig for now, thereby doing what
*neither* of you said (I don't think hw/xen/Kconfig is the best choice
when the *code* it enables is under hw/i386/kvm?)



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 11/22] i386/xen: implement HYPERCALL_xen_version
  2022-12-12 14:17   ` Paul Durrant
@ 2022-12-13  0:06     ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-13  0:06 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 192 bytes --]

On Mon, 2022-12-12 at 14:17 +0000, Paul Durrant wrote:
> Shouldn't this be (sz - i)?

Turns out I hate that loop. Have refactored it to be
while (sz) { ... sz -= len; ... }.

Thanks.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 13/22] i386/xen: implement HYPERVISOR_memory_op
  2022-12-12 14:38   ` Paul Durrant
@ 2022-12-13  0:08     ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-13  0:08 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

On Mon, 2022-12-12 at 14:38 +0000, Paul Durrant wrote:
> 
> > +            switch (xatp.space) {
> > +            case XENMAPSPACE_shared_info:
> > +                break;
> > +            default:
> > +                err = -ENOSYS;
> > +                break;
> 
> Don't you want to return false here?

It doesn't make a lot of difference right now. Once we believe we've
implemented everything a sane guest would use, then the distinction
between {true, -ENOSYS} and just returning false actually starts to
matter.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 15/22] i386/xen: implement HYPERVISOR_vcpu_op
  2022-12-12 14:51   ` Paul Durrant
@ 2022-12-13  0:10     ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-13  0:10 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 459 bytes --]

On Mon, 2022-12-12 at 14:51 +0000, Paul Durrant wrote:
> Again, should this patch be deferred until we actually implement 
> something useful here? I.e. folding it into the subsequent patch? It's 
> not like the boilerplate is massive.

That's how Joao did it; it seems sane enough to do bite-sized pieces so
I didn't change it. As I've been through and rebased/reworked it all
about ten times so far, it's been useful for it to be in small chunks.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support
  2022-12-12 17:30   ` Paolo Bonzini
  2022-12-12 17:55     ` Paul Durrant
@ 2022-12-13  0:13     ` David Woodhouse
  2023-01-17 13:49     ` David Woodhouse
  2 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-13  0:13 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 19744 bytes --]

On Mon, 2022-12-12 at 18:30 +0100, Paolo Bonzini wrote:
> On 12/9/22 10:55, David Woodhouse wrote:
> > -    m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> > +    if (xen_enabled())
> > +            m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> > +    else
> > +            m->default_machine_opts = "accel=kvm,xen-version=0x30001";
> 
> Please do not modify pc_xen_hvm_init().
> 
> "-M xenfv" should be the same as "-M pc-i440fx-...,suppress-vmdesc=on 
> -accel xen -device xen-platform".  It must work *without* "-accel xen", 
> while here you're basically requiring it.  For now, please make 
> KVM-emulated Xen use "-M pc -device xen-platform".  We can figure out 
> "-M xenfv" later.
> 
> You can instead have:
> 
> - a check in xen_init() that checks that xen_mode is XEN_ATTACH.  If 
> not, fail.
> 
> - an extra enum value for xen_mode, XEN_DISABLED, which is the default 
> instead of XEN_EMULATE;
> 
> - an accelerator property "-accel kvm,xen-version=...", added in 
> kvm_accel_class_init() instead of the machine property.  The property, 
> when set to a nonzero value, flips xen_mode from XEN_DISABLED to 
> XEN_EMULATE.
> 
> The Xen overlay device can be created using the mc->kvm_type function 
> (which you can set in DEFINE_PC_MACHINE); at that point, xen_mode has 
> switched from XEN_DISABLED to XEN_EMULATE.  Those xen_enabled() checks 
> that apply to KVM then become xen_mode != XEN_DISABLED, as long as they 
> run during mc->kvm_type or afterwards.
> 
> The platform device can be created either in mc->kvm_type or manually 
> (not sure if it makes sense to have a "XenVMMXenVMM" CPUID + emulated 
> hypercalls but no platform device---would it still use pvclock for 
> example?).

That works; thanks. I won't spam the list with another round just yet,
but have pushed it to
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv

The guest now correctly panics because I haven't implemented event
channel hypercalls yet (got to fix up a bit more of the 32-bit compat
first, and some other parts of Paul's feedback I haven't yet got to).

$ ./build/qemu-system-x86_64 -serial mon:stdio -M pc -accel kvm,xen-version=0x4000a  -cpu host  -display none -kernel vmlinuz-5.17.8-200.fc35.x86_64 -append "console=ttyS0 earlyprintk=ttyS0 panic=10000" --trace "kvm_xen*" -d unimp -m 1G -smp 4 -device xen-platform
Probing EDD (edd=off to disable)... ok
[    0.000000] Linux version 5.17.8-200.fc35.x86_64 (mockbuild@bkernel02.iad2.fedoraproject.org) (gcc (GCC) 11.3.1 20220421 (Red Hat 11.3.1-2), GNU ld version 2.37-17.fc35) #1 SMP PREEMPT Mon May 16 01:01:02 UTC 2022
[    0.000000] Command line: console=ttyS0 earlyprintk=ttyS0 panic=10000
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[    0.000000] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[    0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
[    0.000000] signal: max sigframe size: 2032
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable
[    0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] printk: bootconsole [earlyser0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] extended physical RAM map:
[    0.000000] reserve setup_data: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] reserve setup_data: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] reserve setup_data: [mem 0x0000000000100000-0x0000000000bf98ef] usable
[    0.000000] reserve setup_data: [mem 0x0000000000bf98f0-0x0000000000bf991f] usable
[    0.000000] reserve setup_data: [mem 0x0000000000bf9920-0x000000003ffdffff] usable
[    0.000000] reserve setup_data: [mem 0x000000003ffe0000-0x000000003fffffff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] reserve setup_data: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[    0.000000] Hypervisor detected: Xen HVM
[    0.000000] Xen version 4.10.
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 17 a0 0x6 a1 0xffffffffb8e03e70 a2 0x0 ret 0x0
kvm_xen_set_shared_info shared info at gfn 0x10
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 12 a0 0x7 a1 0xffffffffb8e03e60 a2 0x8000000000000163 ret 0x0
kvm_xen_set_vcpu_attr vcpu attr cpu 1 type 0 gpa 0x10040
kvm_xen_set_vcpu_attr vcpu attr cpu 2 type 0 gpa 0x10080
kvm_xen_set_vcpu_attr vcpu attr cpu 0 type 0 gpa 0x10000
kvm_xen_set_vcpu_attr vcpu attr cpu 3 type 0 gpa 0x100c0
[    0.000000] platform_pci_unplug: Netfront and the Xen platform PCI driver have been compiled for this kernel: unplug emulated NICs.
[    0.000000] platform_pci_unplug: Blkfront and the Xen platform PCI driver have been compiled for this kernel: unplug emulated disks.
[    0.000000] You might have to change the root device
[    0.000000] from /dev/hd[a-d] to /dev/xvd[a-d]
[    0.000000] in your root= kernel command line option
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 34 a0 0x9 a1 0xffffffffb8e03e68 a2 0x2 ret 0xffffffffffffffda
[    0.021222] tsc: Fast TSC calibration using PIT
[    0.022715] tsc: Detected 2112.208 MHz processor
[    0.024194] tsc: Detected 2112.000 MHz TSC
[    0.027112] last_pfn = 0x3ffe0 max_arch_pfn = 0x400000000
[    0.028983] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
Memory KASLR using RDRAND RDTSC...
[    0.040593] found SMP MP-table at [mem 0x000f5bf0-0x000f5bff]
[    0.042372] Using GB pages for direct mapping
[    0.044274] ACPI: Early table checksum verification disabled
[    0.045818] ACPI: RSDP 0x00000000000F5A10 000014 (v00 BOCHS )
[    0.047436] ACPI: RSDT 0x000000003FFE1BDD 000034 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.049981] ACPI: FACP 0x000000003FFE1A79 000074 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.052664] ACPI: DSDT 0x000000003FFE0040 001A39 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.055603] ACPI: FACS 0x000000003FFE0000 000040
[    0.057391] ACPI: APIC 0x000000003FFE1AED 000090 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.060233] ACPI: HPET 0x000000003FFE1B7D 000038 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.063058] ACPI: WAET 0x000000003FFE1BB5 000028 (v01 BOCHS  BXPC     00000001 BXPC 00000001)
[    0.065812] ACPI: Reserving FACP table memory at [mem 0x3ffe1a79-0x3ffe1aec]
[    0.068290] ACPI: Reserving DSDT table memory at [mem 0x3ffe0040-0x3ffe1a78]
[    0.070623] ACPI: Reserving FACS table memory at [mem 0x3ffe0000-0x3ffe003f]
[    0.072976] ACPI: Reserving APIC table memory at [mem 0x3ffe1aed-0x3ffe1b7c]
[    0.075241] ACPI: Reserving HPET table memory at [mem 0x3ffe1b7d-0x3ffe1bb4]
[    0.077313] ACPI: Reserving WAET table memory at [mem 0x3ffe1bb5-0x3ffe1bdc]
[    0.080178] No NUMA configuration found
[    0.081189] Faking a node at [mem 0x0000000000000000-0x000000003ffdffff]
[    0.083037] NODE_DATA(0) allocated [mem 0x3ffb5000-0x3ffdffff]
[    0.158885] Zone ranges:
[    0.159679]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.161442]   DMA32    [mem 0x0000000001000000-0x000000003ffdffff]
[    0.163104]   Normal   empty
[    0.163924]   Device   empty
[    0.164732] Movable zone start for each node
[    0.165899] Early memory node ranges
[    0.166843]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.168562]   node   0: [mem 0x0000000000100000-0x000000003ffdffff]
[    0.170032] Initmem setup node 0 [mem 0x0000000000001000-0x000000003ffdffff]
[    0.171611] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.171668] On node 0, zone DMA: 97 pages in unavailable ranges
[    0.175214] On node 0, zone DMA32: 32 pages in unavailable ranges
[    0.177220] ACPI: PM-Timer IO Port: 0x608
[    0.179928] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.181520] IOAPIC[0]: apic_id 0, version 17, address 0xfec00000, GSI 0-23
[    0.183448] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.185083] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.186633] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.188303] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.190365] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.192528] ACPI: Using ACPI (MADT) for SMP configuration information
[    0.194438] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.196053] TSC deadline timer available
[    0.197268] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.198877] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.201200] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[    0.203531] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000effff]
[    0.205839] PM: hibernation: Registered nosave memory: [mem 0x000f0000-0x000fffff]
[    0.208122] PM: hibernation: Registered nosave memory: [mem 0x00bf9000-0x00bf9fff]
[    0.210303] PM: hibernation: Registered nosave memory: [mem 0x00bf9000-0x00bf9fff]
[    0.212595] [mem 0x40000000-0xfeffbfff] available for PCI devices
[    0.214447] Booting paravirtualized kernel on Xen HVM
[    0.216050] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.226546] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:4 nr_cpu_ids:4 nr_node_ids:1
[    0.232545] percpu: Embedded 61 pages/cpu s212992 r8192 d28672 u524288
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 24 a0 0xa a1 0x0 a2 0xffffffffb8e03ed0 ret 0x0
kvm_xen_set_vcpu_attr vcpu attr cpu 0 type 0 gpa 0x3ec1e0e0
[    0.234134] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear)
[    0.235651] Fallback order for Node 0: 0 
[    0.236618] Built 1 zonelists, mobility grouping on.  Total pages: 257759
[    0.238432] Policy zone: DMA32
[    0.239373] Kernel command line: console=ttyS0 earlyprintk=ttyS0 panic=10000
[    0.242530] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[    0.245280] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.247676] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.253165] Memory: 978800K/1048056K available (16393K kernel code, 3607K rwdata, 10784K rodata, 2704K init, 6288K bss, 68996K reserved, 0K cma-reserved)
[    0.256519] random: get_random_u64 called from __kmem_cache_create+0x2a/0x530 with crng_init=0
[    0.256838] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
[    0.260445] Kernel/User page tables isolation: enabled
[    0.261691] ftrace: allocating 48342 entries in 189 pages
[    0.278046] ftrace: allocated 189 pages with 6 groups
[    0.280372] Dynamic Preempt: voluntary
[    0.281783] rcu: Preemptible hierarchical RCU implementation.
[    0.282905] rcu: 	RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=4.
[    0.284217] 	Trampoline variant of Tasks RCU enabled.
[    0.285323] 	Rude variant of Tasks RCU enabled.
[    0.286304] 	Tracing variant of Tasks RCU enabled.
[    0.287385] rcu: RCU calculated value of scheduler-enlistment delay is 100 jiffies.
[    0.289049] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.297553] NR_IRQS: 524544, nr_irqs: 456, preallocated irqs: 16
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 32 a0 0xb a1 0xffffffffb8e03e98 a2 0xffff9316bec00000 ret 0xffffffffffffffda
[    0.298817] xen:events: Using 2-level ABI
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 34 a0 0x0 a1 0xffffffffb8e03ec8 a2 0x100 ret 0x0
[    0.299811] xen:events: Xen HVM callback vector for event delivery is enabled
[    0.301653] random: crng init done (trusting CPU's manufacturer)
[    0.318080] Console: colour VGA+ 80x25
Unimplemented Xen hypercall 34 (0x1 0xffffffffb8e03eb8 0xffffffffb8e03eb8)
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 34 a0 0x1 a1 0xffffffffb8e03eb8 a2 0x7ff0 ret 0xffffffffffffffda
[    0.319300] Cannot get hvm parameter CONSOLE_EVTCHN (18): -38!
[    0.321118] printk: console [ttyS0] enabled
[    0.321118] printk: console [ttyS0] enabled
[    0.323780] printk: bootconsole [earlyser0] disabled
[    0.323780] printk: bootconsole [earlyser0] disabled
[    0.326482] ACPI: Core revision 20211217
[    0.328222] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.330092] APIC: Switch to symmetric I/O mode setup
[    0.331324] x2apic enabled
[    0.332145] Switched APIC routing to physical x2apic.
[    0.334323] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.340113] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e71785e5dd, max_idle_ns: 440795244814 ns
[    0.341973] Calibrating delay loop (skipped), value calculated using timer frequency.. 4224.00 BogoMIPS (lpj=2112000)
[    0.342962] pid_max: default: 32768 minimum: 301
[    0.344027] LSM: Security Framework initializing
[    0.344995] Yama: becoming mindful.
[    0.345971] SELinux:  Initializing.
[    0.347003] LSM support for eBPF active
[    0.347710] landlock: Up and running.
[    0.348070] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
[    0.348982] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
Poking KASLR using RDRAND RDTSC...
[    0.352097] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.353243] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.353963] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.355022] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.355969] Spectre V2 : Mitigation: Retpolines
[    0.356962] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    0.357965] Spectre V2 : Enabling Restricted Speculation for firmware calls
[    0.358972] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.359962] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl
[    0.360984] SRBDS: Unknown: Dependent on hypervisor status
[    0.361966] MDS: Mitigation: Clear CPU buffers
[    0.370697] Freeing SMP alternatives memory: 44K
[    0.371365] clocksource: xen: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
Unimplemented Xen hypercall 24 (0x7 0x0 0x0)
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 24 a0 0x7 a1 0x0 a2 0x0 ret 0xffffffffffffffda
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 24 a0 0xd a1 0x0 a2 0xffffa734c0013e58 ret 0x0
kvm_xen_set_vcpu_attr vcpu attr cpu 0 type 1 gpa 0x123e000
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 24 a0 0x5 a1 0x0 a2 0xffffa734c0013e48 ret 0x0
kvm_xen_set_vcpu_attr vcpu attr cpu 0 type 2 gpa 0x3ec2ec40
[    0.373086] installing Xen timer for CPU 0
Unimplemented Xen hypercall 32 (0x1 0xffffa734c0013da4 0xffffa734c0013da4)
kvm_xen_hypercall xen_hypercall: cpu 0 cpl 0 input 32 a0 0x1 a1 0xffffa734c0013da4 a2 0x1 ret 0xffffffffffffffda
[    0.374108] ------------[ cut here ]------------
[    0.374962] kernel BUG at drivers/xen/events/events_base.c:1397!
[    0.375979] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[    0.376960] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.8-200.fc35.x86_64 #1
[    0.376960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[    0.376960] RIP: 0010:bind_virq_to_irq+0x1b7/0x2d0
[    0.376960] Code: c7 c2 e0 f1 15 b7 48 c7 c6 20 05 18 b9 44 89 ff e8 ee 5a 98 ff e9 50 ff ff ff 45 31 db 83 f8 ef 74 41 85 d2 0f 89 77 ff ff ff <0f> 0b 44 89 ff 44 89 5c 24 08 e8 fa 4a 98 ff 44 8b 5c 24 08 48 85
[    0.376960] RSP: 0000:ffffa734c0013d80 EFLAGS: 00010282
[    0.376960] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000001e120
[    0.376960] RDX: 00000000ffffffda RSI: ffffa734c0013da4 RDI: 0000000000000001
[    0.376960] RBP: 0000000000000000 R08: ffff9316811e4200 R09: 0000000000000000
[    0.376960] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000002ed20
[    0.376960] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000018
[    0.376960] FS:  0000000000000000(0000) GS:ffff9316bec00000(0000) knlGS:0000000000000000
[    0.376960] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.376960] CR2: ffff93168f001000 CR3: 000000000de10001 CR4: 0000000000370ef0
[    0.376960] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.376960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.376960] Call Trace:
[    0.376960]  <TASK>
[    0.376960]  ? xen_timerop_shutdown+0x10/0x10
[    0.376960]  bind_virq_to_irqhandler+0x28/0x90
[    0.376960]  xen_setup_timer.cold+0x47/0xc5
[    0.376960]  xen_time_init+0x173/0x1a4
[    0.376960]  native_smp_prepare_cpus+0xdd/0x17e
[    0.376960]  xen_hvm_smp_prepare_cpus+0xc/0x67
[    0.376960]  kernel_init_freeable+0xe8/0x251
[    0.376960]  ? rest_init+0xd0/0xd0
[    0.376960]  kernel_init+0x16/0x130
[    0.376960]  ret_from_fork+0x22/0x30
[    0.376960]  </TASK>
[    0.376960] Modules linked in:
[    0.376965] ---[ end trace 0000000000000000 ]---
[    0.377964] RIP: 0010:bind_virq_to_irq+0x1b7/0x2d0
[    0.378874] Code: c7 c2 e0 f1 15 b7 48 c7 c6 20 05 18 b9 44 89 ff e8 ee 5a 98 ff e9 50 ff ff ff 45 31 db 83 f8 ef 74 41 85 d2 0f 89 77 ff ff ff <0f> 0b 44 89 ff 44 89 5c 24 08 e8 fa 4a 98 ff 44 8b 5c 24 08 48 85
[    0.378965] RSP: 0000:ffffa734c0013d80 EFLAGS: 00010282
[    0.379962] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000001e120
[    0.380965] RDX: 00000000ffffffda RSI: ffffa734c0013da4 RDI: 0000000000000001
[    0.381962] RBP: 0000000000000000 R08: ffff9316811e4200 R09: 0000000000000000
[    0.382964] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000002ed20
[    0.383963] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000018
[    0.384964] FS:  0000000000000000(0000) GS:ffff9316bec00000(0000) knlGS:0000000000000000
[    0.385963] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.386962] CR2: ffff93168f001000 CR3: 000000000de10001 CR4: 0000000000370ef0
[    0.387964] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.388962] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    0.389975] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    0.390960] Rebooting in 10000 seconds..

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info
  2022-12-12 14:58   ` Paul Durrant
@ 2022-12-13  0:13     ` David Woodhouse
  2022-12-14 10:28       ` Paul Durrant
  0 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-13  0:13 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 2561 bytes --]

On Mon, 2022-12-12 at 14:58 +0000, Paul Durrant wrote:
> On 09/12/2022 09:56, David Woodhouse wrote:
> > From: Joao Martins <
> > joao.m.martins@oracle.com
> > >
> > 
> > Handle the hypercall to set a per vcpu info, and also wire up the
> > default
> > vcpu_info in the shared_info page for the first 32 vCPUs.
> > 
> > To avoid deadlock within KVM a vCPU thread must set its *own*
> > vcpu_info
> > rather than it being set from the context in which the hypercall is
> > invoked.
> > 
> > Add the vcpu_info (and default) GPA to the vmstate_x86_cpu for
> > migration,
> > and restore it in kvm_arch_put_registers() appropriately.
> > 
> > Signed-off-by: Joao Martins <
> > joao.m.martins@oracle.com
> > >
> > Signed-off-by: David Woodhouse <
> > dwmw@amazon.co.uk
> > >
> > ---
> >   target/i386/cpu.h        |  2 ++
> >   target/i386/kvm/kvm.c    | 19 +++++++++++
> >   target/i386/machine.c    | 21 ++++++++++++
> >   target/i386/trace-events |  1 +
> >   target/i386/xen.c        | 74
> > +++++++++++++++++++++++++++++++++++++---
> >   target/i386/xen.h        |  1 +
> >   6 files changed, 113 insertions(+), 5 deletions(-)
> > 
> > diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> > index c6c57baed5..109b2e5669 100644
> > --- a/target/i386/cpu.h
> > +++ b/target/i386/cpu.h
> > @@ -1788,6 +1788,8 @@ typedef struct CPUArchState {
> >   #endif
> >   #if defined(CONFIG_KVM)
> >       struct kvm_nested_state *nested_state;
> > +    uint64_t xen_vcpu_info_gpa;
> > +    uint64_t xen_vcpu_info_default_gpa;
> >   #endif
> >   #if defined(CONFIG_HVF)
> >       HVFX86LazyFlags hvf_lflags;
> > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > index ebde6bc204..fa45e2f99a 100644
> > --- a/target/i386/kvm/kvm.c
> > +++ b/target/i386/kvm/kvm.c
> > @@ -1811,6 +1811,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >           has_msr_hv_hypercall = true;
> >       }
> >   
> > +    env->xen_vcpu_info_gpa = UINT64_MAX;
> > +    env->xen_vcpu_info_default_gpa = UINT64_MAX;
> 
> 
> There was an INVALID_GPA definition for shared info. Looks like we
> could use it here too.

There was, and I started trying to use it, but it fell foul of the "is
this going to live in target/ or hw/ and who can include what from
where?" and I decided to just use UINT64_MAX for now and keep typing.

That will work out in the end, I'm sure.

> Some sanity checks wouldn't go a miss here...
> 
> rvi.offset should:
> a) be < TARGET_PAGE_SIZE, and
> b) ba aligned to vcpu_info_t size

Ack.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-12 22:22     ` David Woodhouse
@ 2022-12-13  0:39       ` Paolo Bonzini
  2022-12-13  0:59         ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2022-12-13  0:39 UTC (permalink / raw)
  To: David Woodhouse
  Cc: qemu-devel, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 2142 bytes --]

Il lun 12 dic 2022, 23:23 David Woodhouse <dwmw2@infradead.org> ha scritto:

> On Mon, 2022-12-12 at 18:07 +0100, Paolo Bonzini wrote:
> > On 12/9/22 10:55, David Woodhouse wrote:
> > >   config KVM
> > >       bool
> > > +    imply XEN_EMU if (I386 || X86_64)
> >
> > No need for the "imply", just make it "default y" below and it will have
> > the same effect.
> >
> > >
> > > diff --git a/target/Kconfig b/target/Kconfig
> > > index 83da0bd293..e19c9d77b5 100644
> > > --- a/target/Kconfig
> > > +++ b/target/Kconfig
> > > @@ -18,3 +18,7 @@ source sh4/Kconfig
> > >  source sparc/Kconfig
> > >  source tricore/Kconfig
> > >  source xtensa/Kconfig
> > > +
> > > +config XEN_EMU
> > > +    bool
> > > +    depends on KVM && (I386 || X86_64)
> >
> > Please place it in hw/xen/Kconfig.
>
> Perhaps I misunderstand, but I'm not sure that is consistent with what
> Philippe was asking for in
>
> https://lore.kernel.org/qemu-devel/d203e13d-e2f9-5816-030d-c1449bde364d@linaro.org/
> specifically:
>
> >> I rather have hw/ and target/ features disentangled, so I'd use
> >> CONFIG_XEN_EMU under target/ and CONFIG_XENFV_MACHINE under hw/,
> >> eventually having CONFIG_XEN_EMU depending on CONFIG_XENFV_MACHINE
> >> and -- for now -- CONFIG_KVM.
>

However the dependency of the xenfv machine is misguided. In principle
there is no reason to depend on KVM as opposed to TCG, too, which is why I
didn't suggest hw/i386/kvm for either the devices or the Kconfig file.

The idea there seems to be that XEN_EMU is a *target* feature since it
> covers the support in target/i386/kvm.
>
> But yes, it *also* covers the devices I'm adding to hw/i386/kvm. Do I
> want a *separate* config symbol for that? Or just make those depend on
> XENFV_MACHINE && XEN_EMU ?
>
> I'll move XEN_EMU to hw/i386/Kconfig for now, thereby doing what
> *neither* of you said (I don't think hw/xen/Kconfig is the best choice
> when the *code* it enables is under hw/i386/kvm?)
>

Yeah there is no especially better match. I guess hw/i386 is fine, since
there will be code in mc->kvm_type. It would be either there or in
target/i386/Kconfig, but not target/Kconfig.

Paolo

>

[-- Attachment #2: Type: text/html, Size: 3418 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-13  0:39       ` Paolo Bonzini
@ 2022-12-13  0:59         ` David Woodhouse
  2022-12-13 22:32           ` Paolo Bonzini
  0 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-13  0:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 3779 bytes --]

On Tue, 2022-12-13 at 01:39 +0100, Paolo Bonzini wrote:
> 
> 
> Il lun 12 dic 2022, 23:23 David Woodhouse <dwmw2@infradead.org> ha
> scritto:
> > On Mon, 2022-12-12 at 18:07 +0100, Paolo Bonzini wrote:
> > > On 12/9/22 10:55, David Woodhouse wrote:
> > > >   config KVM
> > > >       bool
> > > > +    imply XEN_EMU if (I386 || X86_64)
> > > 
> > > No need for the "imply", just make it "default y" below and it
> > will have 
> > > the same effect.
> > > 
> > > > 
> > > > diff --git a/target/Kconfig b/target/Kconfig
> > > > index 83da0bd293..e19c9d77b5 100644
> > > > --- a/target/Kconfig
> > > > +++ b/target/Kconfig
> > > > @@ -18,3 +18,7 @@ source sh4/Kconfig
> > > >  source sparc/Kconfig
> > > >  source tricore/Kconfig
> > > >  source xtensa/Kconfig
> > > > +
> > > > +config XEN_EMU
> > > > +    bool
> > > > +    depends on KVM && (I386 || X86_64)
> > > 
> > > Please place it in hw/xen/Kconfig.
> > 
> > Perhaps I misunderstand, but I'm not sure that is consistent with
> > what
> > Philippe was asking for in  
> > https://lore.kernel.org/qemu-devel/d203e13d-e2f9-5816-030d-c1449bde364d@linaro.org/
> > specifically:
> > 
> > >> I rather have hw/ and target/ features disentangled, so I'd use
> > >> CONFIG_XEN_EMU under target/ and CONFIG_XENFV_MACHINE under hw/,
> > >> eventually having CONFIG_XEN_EMU depending on
> > CONFIG_XENFV_MACHINE
> > >> and -- for now -- CONFIG_KVM.
> 
> 
> However the dependency of the xenfv machine is misguided. In
> principle there is no reason to depend on KVM as opposed to TCG, too,
> which is why I didn't suggest hw/i386/kvm for either the devices or
> the Kconfig file.
> 

That was my initial thought too.

But those devices are there primarily as hooks to save/restore state,
and that means they want to actually program that state back into KVM
on restore; they *have* ended up using KVM directly, which is why I
moved them into hw/i386/kvm.

Contriving some pretence at a "generic" way for the target to set those
features seemed a bit like overengineering.

Supporting TCG (at least if we want to run on non-x86 hosts) isn't just
a case of reimplementing the bits that are already done in-kernel,
either. The structure layouts may differ too (it's bad enough between
i686 and x86_64 with the alignment of uint64_t). So I just don't see
TCG support happening any time soon.

> > The idea there seems to be that XEN_EMU is a *target* feature since
> > it covers the support in target/i386/kvm.
> > 
> > But yes, it *also* covers the devices I'm adding to hw/i386/kvm. Do
> > I want a *separate* config symbol for that? Or just make those
> > depend on XENFV_MACHINE && XEN_EMU ? 
> > 
> > I'll move XEN_EMU to hw/i386/Kconfig for now, thereby doing what
> > *neither* of you said (I don't think hw/xen/Kconfig is the best
> > choice when the *code* it enables is under hw/i386/kvm?)
> 
> 
> Yeah there is no especially better match. I guess hw/i386 is fine,
> since there will be code in mc->kvm_type. It would be either there or
> in target/i386/Kconfig, but not target/Kconfig.

I put pc_machine_kvm_type into hw/i386/pc.c and made DEFINE_PC_MACHINE
add it unconditionally. Then it registers those devices if
xen_mode==XEN_EMULATE. Which is *almost* pretending to be generic,
apart from the hook being KVM-specific.

There was *also* a call to xen_emulated_machine_init() added to
pc_init1() by the 'pc_piix: handle XEN_EMULATE backend init' patch.
I've dropped that for now; once we are ready to hook up the xenbus and
PV drivers, that seems like it can go into mc->kvm_type too. Or maybe I
should have kept that, and initialised the overlay and evtchn devices
from xen_emulated_machine_init() instead of mc->kvm_type() ?



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 04/22] i386/kvm: handle Xen HVM cpuid leaves
  2022-12-12 13:13   ` Paul Durrant
@ 2022-12-13  9:47     ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-13  9:47 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

On Mon, 2022-12-12 at 13:13 +0000, Paul Durrant wrote:
> Ok, this tells me that we are intending to handle Hyper-V enlightenments 
> being simultaneously enabled... in which case that MSR above needs to 
> move, along with the cpuid leaves. It should be 0x40000200 in this case.

Ah, yes. That gets entertaining because hyperv_enabled() doesn't work
at the time kvm_arch_init() runs, because kvm_hyperv_expand_features()
hasn't run yet.

But kvm_arch_init() is idempotent so for now I'll just call it *again*
from kvm_arch_init_vcpu() to move the MSR in the hyperv case.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-13  0:59         ` David Woodhouse
@ 2022-12-13 22:32           ` Paolo Bonzini
  2022-12-16  8:40             ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paolo Bonzini @ 2022-12-13 22:32 UTC (permalink / raw)
  To: David Woodhouse
  Cc: qemu-devel, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

Il mar 13 dic 2022, 01:59 David Woodhouse <dwmw2@infradead.org> ha scritto:

> There was *also* a call to xen_emulated_machine_init() added to
> pc_init1() by the 'pc_piix: handle XEN_EMULATE backend init' patch.
> I've dropped that for now; once we are ready to hook up the xenbus and
> PV drivers, that seems like it can go into mc->kvm_type too. Or maybe I
> should have kept that, and initialised the overlay and evtchn devices
> from xen_emulated_machine_init() instead of mc->kvm_type() ?
>

I think that would be too early. mc->kvm_type, which really should be
called mc->accel_created, is the first place where you can look at xen_mode.

Paolo

Paolo

>

[-- Attachment #2: Type: text/html, Size: 1285 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info
  2022-12-13  0:13     ` David Woodhouse
@ 2022-12-14 10:28       ` Paul Durrant
  2022-12-14 11:04         ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-14 10:28 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 13/12/2022 00:13, David Woodhouse wrote:
> On Mon, 2022-12-12 at 14:58 +0000, Paul Durrant wrote:
>> On 09/12/2022 09:56, David Woodhouse wrote:
>>> From: Joao Martins <
>>> joao.m.martins@oracle.com
>>>>
>>>
>>> Handle the hypercall to set a per vcpu info, and also wire up the
>>> default
>>> vcpu_info in the shared_info page for the first 32 vCPUs.
>>>
>>> To avoid deadlock within KVM a vCPU thread must set its *own*
>>> vcpu_info
>>> rather than it being set from the context in which the hypercall is
>>> invoked.
>>>
>>> Add the vcpu_info (and default) GPA to the vmstate_x86_cpu for
>>> migration,
>>> and restore it in kvm_arch_put_registers() appropriately.
>>>
>>> Signed-off-by: Joao Martins <
>>> joao.m.martins@oracle.com
>>>>
>>> Signed-off-by: David Woodhouse <
>>> dwmw@amazon.co.uk
>>>>
>>> ---
>>>    target/i386/cpu.h        |  2 ++
>>>    target/i386/kvm/kvm.c    | 19 +++++++++++
>>>    target/i386/machine.c    | 21 ++++++++++++
>>>    target/i386/trace-events |  1 +
>>>    target/i386/xen.c        | 74
>>> +++++++++++++++++++++++++++++++++++++---
>>>    target/i386/xen.h        |  1 +
>>>    6 files changed, 113 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>>> index c6c57baed5..109b2e5669 100644
>>> --- a/target/i386/cpu.h
>>> +++ b/target/i386/cpu.h
>>> @@ -1788,6 +1788,8 @@ typedef struct CPUArchState {
>>>    #endif
>>>    #if defined(CONFIG_KVM)
>>>        struct kvm_nested_state *nested_state;
>>> +    uint64_t xen_vcpu_info_gpa;
>>> +    uint64_t xen_vcpu_info_default_gpa;
>>>    #endif
>>>    #if defined(CONFIG_HVF)
>>>        HVFX86LazyFlags hvf_lflags;
>>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>>> index ebde6bc204..fa45e2f99a 100644
>>> --- a/target/i386/kvm/kvm.c
>>> +++ b/target/i386/kvm/kvm.c
>>> @@ -1811,6 +1811,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
>>>            has_msr_hv_hypercall = true;
>>>        }
>>>    
>>> +    env->xen_vcpu_info_gpa = UINT64_MAX;
>>> +    env->xen_vcpu_info_default_gpa = UINT64_MAX;
>>
>>
>> There was an INVALID_GPA definition for shared info. Looks like we
>> could use it here too.
> 
> There was, and I started trying to use it, but it fell foul of the "is
> this going to live in target/ or hw/ and who can include what from
> where?" and I decided to just use UINT64_MAX for now and keep typing.
> 
> That will work out in the end, I'm sure.

Hopefully 
https://lore.kernel.org/lkml/20221209023622.274715-1-yu.c.zhang@linux.intel.com/ 
will help.

> 
>> Some sanity checks wouldn't go a miss here...
>>
>> rvi.offset should:
>> a) be < TARGET_PAGE_SIZE, and
>> b) ba aligned to vcpu_info_t size
> 
> Ack.



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info
  2022-12-14 10:28       ` Paul Durrant
@ 2022-12-14 11:04         ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-14 11:04 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 1693 bytes --]

On Wed, 2022-12-14 at 10:28 +0000, Paul Durrant wrote:
> On 13/12/2022 00:13, David Woodhouse wrote:
> > On Mon, 2022-12-12 at 14:58 +0000, Paul Durrant wrote:
> > > On 09/12/2022 09:56, David Woodhouse wrote:
> > > > 
> > > > @@ -1811,6 +1811,9 @@ int kvm_arch_init_vcpu(CPUState *cs)
> > > >            has_msr_hv_hypercall = true;
> > > >        }
> > > >    
> > > > +    env->xen_vcpu_info_gpa = UINT64_MAX;
> > > > +    env->xen_vcpu_info_default_gpa = UINT64_MAX;
> > > 
> > > 
> > > There was an INVALID_GPA definition for shared info. Looks like we
> > > could use it here too.
> > 
> > There was, and I started trying to use it, but it fell foul of the "is
> > this going to live in target/ or hw/ and who can include what from
> > where?" and I decided to just use UINT64_MAX for now and keep typing.
> > 
> > That will work out in the end, I'm sure.
> 
> Hopefully 
> https://lore.kernel.org/lkml/20221209023622.274715-1-yu.c.zhang@linux.intel.com/
>  
> will help.

Those are kernel-internal; not in uapi headers. Although maybe they
*should* be uapi, at least for the KVM/Xen support because they are
actually part of the userspace ABI.

The kernel returns GFN_INVALID when queried about the shared info page
if it isn't set, or GPA_INVALID when queried about vcpu_info etc.

(Those are the same numerically but semantically subtly different, and
it hurts my brain that GFN_INVALID != GPA_INVALID >> PAGE_SHIFT.)

Userspace can also *set* those fields to Gxx_INVALID. Unlike the Xen
APIs which don't allow them to be turned off, we implement
SHUTDOWN_soft_reset in the userspace VMM so it needs to be able to turn
the shinfo areas off.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-12 16:39       ` Paul Durrant
@ 2022-12-15 20:54         ` David Woodhouse
  2022-12-20 13:56           ` Paul Durrant
  0 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-15 20:54 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 16700 bytes --]

On Mon, 2022-12-12 at 16:39 +0000, Paul Durrant wrote:
> On 12/12/2022 16:26, David Woodhouse wrote:
> > On Mon, 2022-12-12 at 16:16 +0000, Paul Durrant wrote:
> > > On 09/12/2022 09:56, David Woodhouse wrote:
> > > > From: Ankur Arora <ankur.a.arora@oracle.com>
> > > > The HVM_PARAM_CALLBACK_IRQ parameter controls the system-wide event
> > > > channel upcall method.  The vector support is handled by KVM internally,
> > > > when the evtchn_upcall_pending field in the vcpu_info is set.
> > > > The GSI and PCI_INTX delivery methods are not supported. yet; those
> > > > need to simulate a level-triggered event on the I/OAPIC.
> > > 
> > > That's gonna be somewhat limiting if anyone runs a Windows guest with
> > > upcall vector support turned off... which is an option at:
> > > 
> > > https://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenbus.git;a=blob;f=src/xenbus/evtchn.c;;hb=HEAD#l1928
> > > 
> > 
> > Sure. And as you know, I also added the 'xen_no_vector_callback' option
> > to the Linux command line to allow for that mode to be tested with
> > Linux too: 
> > https://git.kernel.org/torvalds/c/b36b0fe96a
> > 
> > 
> > The GSI and PCI_INTX modes will be added in time, but not yet.
> 
> Ok, but maybe worth calling out the limitation in the commit comment for 
> those wishing to kick the tyres.

Hm... this isn't as simple in QEMU as I hoped.

The way I want to handle it is like the way that VFIO eventfd pairs
work for level-triggered interrupts: the first eventfd is triggered on
a rising edge, and the other one is a 'resampler' which is triggered on
EOI, and causes the first one to be retriggered if the level is still
actually high.

However... unlike the kernel and another VMM that you and I are
familiar with, QEMU doesn't actually hook that up to the EOI in the
APIC/IOAPIC at all.

Instead, when VFIO devices assert a level-triggered interrupt, QEMU
*unmaps* that device's BARs from the guest so it can trap-and-emulate
them, and each MMIO read or write will also trigger the resampler
(whether that line is currently masked in the APIC or not).

I suppose we could try making the page with the vcpu_info as read-only, 
and trapping access to that so we spot when the guest clears its own
->evtchn_upcall_pending flag? That seems overly complex though.

So I've resorted to doing what Xen itself does: just poll the flag on
every vmexit. Patch is at the tip of my tree at 
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv
and below.

However, in the case of the in-kernel irqchip we might not even *get* a
vmexit all the way to userspace; can't a guest just get stuck in an
interrupt storm with it being handled entirely in-kernel? I might
decree that this works *only* with the split irqchip.

Then again, it'll work *nicely* in the kernel where the EOI
notification exists, so I might teach the kernel's evtchn code to
create those eventfd pairs like VFIO's, which can be hooked in as IRQFD
routing to the in-kernel {IOA,}PIC.

--------------------
From 15a91ff4833d07910abba1dec093e48580c2b4c4 Mon Sep 17 00:00:00 2001
From: David Woodhouse <dwmw@amazon.co.uk>
Subject: [PATCH] hw/xen: Support HVM_PARAM_CALLBACK_TYPE_GSI callback

The GSI callback (and later PCI_INTX) is a level triggered interrupt. It
is asserted when an event channel is delivered to vCPU0, and is supposed
to be cleared when the vcpu_info->evtchn_upcall_pending field for vCPU0
is cleared again.

Thankfully, Xen does *not* assert the GSI if the guest sets its own
evtchn_upcall_pending field; we only need to assert the GSI when we
have delivered an event for ourselves. So that's the easy part.

However, we *do* need to poll for the evtchn_upcall_pending flag being
cleared. In an ideal world we would poll that when the EOI happens on
the PIC/IOAPIC. That's how it works in the kernel with the VFIO eventfd
pairs — one is used to trigger the interrupt, and the other works in the
other direction to 'resample' on EOI, and trigger the first eventfd
again if the line is still active.

However, QEMU doesn't seem to do that. Even VFIO level interrupts seem
to be supported by temporarily unmapping the device's BARs from the
guest when an interrupt happens, then trapping *all* MMIO to the device
and sending the 'resample' event on *every* MMIO access until the IRQ
is cleared! Maybe in future we'll plumb the 'resample' concept through
QEMU's irq framework but for now we'll do what Xen itself does: just
check the flag on every vmexit if the upcall GSI is known to be
asserted.

This is barely tested; I did make it waggle IRQ1 up and down on *every*
delivery even through the vector method, and observed that it is indeed
visible to the guest (as lots of spurious keyboard interrupts). But if
the vector callback isn't in use, Linux guests won't actually use event
channels for anything except PV devices... which I haven't implemented
here yet, so it's hard to test delivery :)

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 hw/i386/kvm/xen_evtchn.c  | 88 +++++++++++++++++++++++++++++++++++----
 hw/i386/kvm/xen_evtchn.h  |  3 ++
 hw/i386/pc.c              | 10 ++++-
 include/sysemu/kvm_xen.h  |  2 +-
 target/i386/cpu.h         |  1 +
 target/i386/kvm/kvm.c     | 13 ++++++
 target/i386/kvm/xen-emu.c | 64 ++++++++++++++++++++--------
 target/i386/kvm/xen-emu.h |  1 +
 8 files changed, 154 insertions(+), 28 deletions(-)

diff --git a/hw/i386/kvm/xen_evtchn.c b/hw/i386/kvm/xen_evtchn.c
index 9292602c09..8ea8cf550e 100644
--- a/hw/i386/kvm/xen_evtchn.c
+++ b/hw/i386/kvm/xen_evtchn.c
@@ -24,6 +24,8 @@
 
 #include "hw/sysbus.h"
 #include "hw/xen/xen.h"
+#include "hw/i386/x86.h"
+#include "hw/irq.h"
 
 #include "xen_evtchn.h"
 #include "xen_overlay.h"
@@ -102,6 +104,7 @@ struct XenEvtchnState {
     QemuMutex port_lock;
     uint32_t nr_ports;
     XenEvtchnPort port_table[EVTCHN_2L_NR_CHANNELS];
+    qemu_irq gsis[GSI_NUM_PINS];
 };
 
 struct XenEvtchnState *xen_evtchn_singleton;
@@ -166,9 +169,29 @@ static const TypeInfo xen_evtchn_info = {
 void xen_evtchn_create(void)
 {
     XenEvtchnState *s = XEN_EVTCHN(sysbus_create_simple(TYPE_XEN_EVTCHN, -1, NULL));
+    int i;
+
     xen_evtchn_singleton = s;
 
     qemu_mutex_init(&s->port_lock);
+
+    for (i = 0; i < GSI_NUM_PINS; i++) {
+        sysbus_init_irq(SYS_BUS_DEVICE(s), &s->gsis[i]);
+    }
+}
+
+void xen_evtchn_connect_gsis(qemu_irq *system_gsis)
+{
+    XenEvtchnState *s = xen_evtchn_singleton;
+    int i;
+
+    if (!s) {
+        return;
+    }
+
+    for (i = 0; i < GSI_NUM_PINS; i++) {
+        sysbus_connect_irq(SYS_BUS_DEVICE(s), i, system_gsis[i]);
+    }
 }
 
 static void xen_evtchn_register_types(void)
@@ -178,26 +201,75 @@ static void xen_evtchn_register_types(void)
 
 type_init(xen_evtchn_register_types)
 
-
 #define CALLBACK_VIA_TYPE_SHIFT       56
 
 int xen_evtchn_set_callback_param(uint64_t param)
 {
+    XenEvtchnState *s = xen_evtchn_singleton;
     int ret = -ENOSYS;
 
-    if (param >> CALLBACK_VIA_TYPE_SHIFT == HVM_PARAM_CALLBACK_TYPE_VECTOR) {
+    if (!s) {
+        return -ENOTSUP;
+    }
+
+    switch (param >> CALLBACK_VIA_TYPE_SHIFT) {
+    case HVM_PARAM_CALLBACK_TYPE_VECTOR: {
         struct kvm_xen_hvm_attr xa = {
             .type = KVM_XEN_ATTR_TYPE_UPCALL_VECTOR,
             .u.vector = (uint8_t)param,
         };
 
         ret = kvm_vm_ioctl(kvm_state, KVM_XEN_HVM_SET_ATTR, &xa);
-        if (!ret && xen_evtchn_singleton)
-            xen_evtchn_singleton->callback_param = param;
+        break;
+    }
+    case HVM_PARAM_CALLBACK_TYPE_GSI:
+        ret = 0;
+        break;
     }
+
+    if (!ret) {
+        s->callback_param = param;
+    }
+
     return ret;
 }
 
+static void xen_evtchn_set_callback_level(XenEvtchnState *s, int level)
+{
+    uint32_t param = (uint32_t)s->callback_param;
+
+    switch (s->callback_param >> CALLBACK_VIA_TYPE_SHIFT) {
+    case HVM_PARAM_CALLBACK_TYPE_GSI:
+        if (param < GSI_NUM_PINS) {
+            qemu_set_irq(s->gsis[param], level);
+        }
+        break;
+    }
+}
+
+static void inject_callback(XenEvtchnState *s, uint32_t vcpu)
+{
+    if (kvm_xen_inject_vcpu_callback_vector(vcpu, s->callback_param)) {
+        return;
+    }
+
+    /* GSI or PCI_INTX delivery is only for events on vCPU 0 */
+    if (vcpu) {
+        return;
+    }
+
+    xen_evtchn_set_callback_level(s, 1);
+}
+
+void xen_evtchn_deassert_callback(void)
+{
+    XenEvtchnState *s = xen_evtchn_singleton;
+
+    if (s) {
+        xen_evtchn_set_callback_level(s, 0);
+    }
+}
+
 static void deassign_kernel_port(evtchn_port_t port)
 {
     struct kvm_xen_hvm_attr ha;
@@ -359,7 +431,7 @@ static int do_unmask_port_lm(XenEvtchnState *s, evtchn_port_t port,
         return 0;
     }
 
-    kvm_xen_inject_vcpu_callback_vector(s->port_table[port].vcpu);
+    inject_callback(s, s->port_table[port].vcpu);
 
     return 0;
 }
@@ -413,7 +485,7 @@ static int do_unmask_port_compat(XenEvtchnState *s, evtchn_port_t port,
         return 0;
     }
 
-    kvm_xen_inject_vcpu_callback_vector(s->port_table[port].vcpu);
+    inject_callback(s, s->port_table[port].vcpu);
 
     return 0;
 }
@@ -481,7 +553,7 @@ static int do_set_port_lm(XenEvtchnState *s, evtchn_port_t port,
         return 0;
     }
 
-    kvm_xen_inject_vcpu_callback_vector(s->port_table[port].vcpu);
+    inject_callback(s, s->port_table[port].vcpu);
 
     return 0;
 }
@@ -524,7 +596,7 @@ static int do_set_port_compat(XenEvtchnState *s, evtchn_port_t port,
         return 0;
     }
 
-    kvm_xen_inject_vcpu_callback_vector(s->port_table[port].vcpu);
+    inject_callback(s, s->port_table[port].vcpu);
 
     return 0;
 }
diff --git a/hw/i386/kvm/xen_evtchn.h b/hw/i386/kvm/xen_evtchn.h
index 2acbaeabaa..1176b67b91 100644
--- a/hw/i386/kvm/xen_evtchn.h
+++ b/hw/i386/kvm/xen_evtchn.h
@@ -9,9 +9,12 @@
  * See the COPYING file in the top-level directory.
  */
 
+#include "hw/sysbus.h"
 
 void xen_evtchn_create(void);
 int xen_evtchn_set_callback_param(uint64_t param);
+void xen_evtchn_connect_gsis(qemu_irq *system_gsis);
+void xen_evtchn_deassert_callback(void);
 
 void hmp_xen_event_list(Monitor *mon, const QDict *qdict);
 void hmp_xen_event_inject(Monitor *mon, const QDict *qdict);
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 427f79e6a8..1c4941de8f 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1303,6 +1303,12 @@ void pc_basic_device_init(struct PCMachineState *pcms,
     }
     *rtc_state = mc146818_rtc_init(isa_bus, 2000, rtc_irq);
 
+#ifdef CONFIG_XEN_EMU
+    if (xen_mode == XEN_EMULATE) {
+        xen_evtchn_connect_gsis(gsi);
+    }
+#endif
+
     qemu_register_boot_set(pc_boot_set, *rtc_state);
 
     if (!xen_enabled() &&
@@ -1848,8 +1854,8 @@ int pc_machine_kvm_type(MachineState *machine, const char *kvm_type)
 {
 #ifdef CONFIG_XEN_EMU
     if (xen_mode == XEN_EMULATE) {
-            xen_overlay_create();
-            xen_evtchn_create();
+        xen_overlay_create();
+        xen_evtchn_create();
     }
 #endif
     return 0;
diff --git a/include/sysemu/kvm_xen.h b/include/sysemu/kvm_xen.h
index e5b14ffe8d..73fe5969b8 100644
--- a/include/sysemu/kvm_xen.h
+++ b/include/sysemu/kvm_xen.h
@@ -13,7 +13,7 @@
 #define QEMU_SYSEMU_KVM_XEN_H
 
 void *kvm_xen_get_vcpu_info_hva(uint32_t vcpu_id);
-void kvm_xen_inject_vcpu_callback_vector(uint32_t vcpu_id);
+bool kvm_xen_inject_vcpu_callback_vector(uint32_t vcpu_id, uint64_t callback_param);
 int kvm_xen_set_vcpu_virq(uint32_t vcpu_id, uint16_t virq, uint16_t port);
 
 #endif /* QEMU_SYSEMU_KVM_XEN_H */
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 846c738fd7..9330eb83fd 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1795,6 +1795,7 @@ typedef struct CPUArchState {
     uint64_t xen_vcpu_time_info_gpa;
     uint64_t xen_vcpu_runstate_gpa;
     uint8_t xen_vcpu_callback_vector;
+    bool xen_callback_asserted;
     uint16_t xen_virq[XEN_NR_VIRQS];
 #endif
 #if defined(CONFIG_HVF)
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index cbf41d6f81..32d808da37 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5431,6 +5431,19 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
     char str[256];
     KVMState *state;
 
+#ifdef CONFIG_XEN_EMU
+    /*
+     * If the callback is asserted as a GSI (or PCI INTx) then check if
+     * vcpu_info->evtchn_upcall_pending has been cleared, and deassert
+     * the callback IRQ if so. Ideally we could hook into the PIC/IOAPIC
+     * EOI and only resample then, exactly how the VFIO eventfd pairs
+     * are designed to work for level triggered interrupts.
+     */
+    if (cpu->env.xen_callback_asserted) {
+        kvm_xen_maybe_deassert_callback(cs);
+    }
+#endif
+
     switch (run->exit_reason) {
     case KVM_EXIT_HLT:
         DPRINTF("handle_hlt\n");
diff --git a/target/i386/kvm/xen-emu.c b/target/i386/kvm/xen-emu.c
index a8c953e3ca..48ae47809a 100644
--- a/target/i386/kvm/xen-emu.c
+++ b/target/i386/kvm/xen-emu.c
@@ -240,18 +240,11 @@ static void *gpa_to_hva(uint64_t gpa)
                                              mrs.offset_within_region);
 }
 
-void *kvm_xen_get_vcpu_info_hva(uint32_t vcpu_id)
+static void *vcpu_info_hva_from_cs(CPUState *cs)
 {
-    CPUState *cs = qemu_get_cpu(vcpu_id);
-    CPUX86State *env;
-    uint64_t gpa;
-
-    if (!cs) {
-        return NULL;
-    }
-    env = &X86_CPU(cs)->env;
+    CPUX86State *env = &X86_CPU(cs)->env;
+    uint64_t gpa = env->xen_vcpu_info_gpa;
 
-    gpa = env->xen_vcpu_info_gpa;
     if (gpa == UINT64_MAX)
         gpa = env->xen_vcpu_info_default_gpa;
     if (gpa == UINT64_MAX)
@@ -260,13 +253,38 @@ void *kvm_xen_get_vcpu_info_hva(uint32_t vcpu_id)
     return gpa_to_hva(gpa);
 }
 
-void kvm_xen_inject_vcpu_callback_vector(uint32_t vcpu_id)
+void *kvm_xen_get_vcpu_info_hva(uint32_t vcpu_id)
+{
+    CPUState *cs = qemu_get_cpu(vcpu_id);
+
+    if (!cs) {
+            return NULL;
+    }
+
+    return vcpu_info_hva_from_cs(cs);
+}
+
+void kvm_xen_maybe_deassert_callback(CPUState *cs)
+{
+    struct vcpu_info *vi = vcpu_info_hva_from_cs(cs);
+    if (!vi) {
+            return;
+    }
+
+    /* If the evtchn_upcall_pending flag is cleared, turn the GSI off. */
+    if (!vi->evtchn_upcall_pending) {
+        X86_CPU(cs)->env.xen_callback_asserted = false;
+        xen_evtchn_deassert_callback();
+    }
+}
+
+bool kvm_xen_inject_vcpu_callback_vector(uint32_t vcpu_id, uint64_t callback_param)
 {
     CPUState *cs = qemu_get_cpu(vcpu_id);
     uint8_t vector;
 
     if (!cs) {
-        return;
+        return false;
     }
     vector = X86_CPU(cs)->env.xen_vcpu_callback_vector;
 
@@ -278,13 +296,25 @@ void kvm_xen_inject_vcpu_callback_vector(uint32_t vcpu_id)
             .data = vector | (1UL << MSI_DATA_LEVEL_SHIFT),
         };
         kvm_irqchip_send_msi(kvm_state, msg);
-        return;
+        return true;
     }
 
-    /* If the evtchn_upcall_pending field in the vcpu_info is set, then
-     * KVM will automatically deliver the vector on entering the vCPU
-     * so all we have to do is kick it out. */
-    qemu_cpu_kick(cs);
+    switch(callback_param >> 56) {
+    case HVM_PARAM_CALLBACK_TYPE_VECTOR:
+            /* If the evtchn_upcall_pending field in the vcpu_info is set, then
+             * KVM will automatically deliver the vector on entering the vCPU
+             * so all we have to do is kick it out. */
+            qemu_cpu_kick(cs);
+            return true;
+
+    case HVM_PARAM_CALLBACK_TYPE_GSI:
+    case HVM_PARAM_CALLBACK_TYPE_PCI_INTX:
+            if (vcpu_id == 0) {
+                X86_CPU(cs)->env.xen_callback_asserted = true;
+            }
+            return false;
+    }
+    return false;
 }
 
 static int kvm_xen_set_vcpu_timer(CPUState *cs)
diff --git a/target/i386/kvm/xen-emu.h b/target/i386/kvm/xen-emu.h
index 58e4748d80..0ff8bed350 100644
--- a/target/i386/kvm/xen-emu.h
+++ b/target/i386/kvm/xen-emu.h
@@ -27,5 +27,6 @@ int kvm_xen_init(KVMState *s, uint32_t hypercall_msr);
 int kvm_xen_handle_exit(X86CPU *cpu, struct kvm_xen_exit *exit);
 int kvm_xen_set_vcpu_attr(CPUState *cs, uint16_t type, uint64_t gpa);
 int kvm_xen_set_vcpu_callback_vector(CPUState *cs);
+void kvm_xen_maybe_deassert_callback(CPUState *cs);
 
 #endif /* QEMU_I386_KVM_XEN_EMU_H */
-- 
2.25.1


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation
  2022-12-13 22:32           ` Paolo Bonzini
@ 2022-12-16  8:40             ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-16  8:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: qemu-devel, Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 2207 bytes --]

On Tue, 2022-12-13 at 23:32 +0100, Paolo Bonzini wrote:
> Il mar 13 dic 2022, 01:59 David Woodhouse <dwmw2@infradead.org> ha scritto:
> > There was *also* a call to xen_emulated_machine_init() added to
> > pc_init1() by the 'pc_piix: handle XEN_EMULATE backend init' patch.
> > I've dropped that for now; once we are ready to hook up the xenbus and
> > PV drivers, that seems like it can go into mc->kvm_type too. Or maybe I
> > should have kept that, and initialised the overlay and evtchn devices
> > from xen_emulated_machine_init() instead of mc->kvm_type() ?
> 
> 
> I think that would be too early. mc->kvm_type, which really should be
> called mc->accel_created, is the first place where you can look at
> xen_mode.

Ack. So now I instantiate the overlay and evtchn devices from
pc_machine_kvm_type() when needed. Thanks.

I've followed the HPET example and wired up a full set of NUM_GSI_PINS
outbound IRQs, each connected to the corresponding system GSI, and fire
the one I want to according to the guest's CALLBACK_PARAM setup.

But *connecting* those had to be done later than kvm_type(), so I've
done it from pc_basic_device_init() right alongside the HPET one. It
couldn't work out how to qdev_find_recursive() the evtchn device, since
it lacks an ID. So I just called a function in xen_evtchn.c which has
access to that singleton you already told me not to worry about. 

The IRQs still concern me a little. I could have sworn that QEMU coped
with shared level IRQs, but I can't see how. It looks like each
'output' connected to a qemu_irq input gets to waggle it up and down
independently, overriding the others?

Maybe it was Xen itself that keeps a *count* of how many outputs have
asserted a given input, and each output is responsible for ensuring it
only ever contributes a count of 0 or 1 to that total?

Part of me wants to go off at a tangent and fix that in QEMU, allowing
shared level IRQs as well as the EOI-driven resampling. But December
only lasts so long before the US wakes up and remembers that you exist
in January, and I *really* want to get the Xen support to the point
where I can run XTF tests under it, at the very least...



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-15 20:54         ` David Woodhouse
@ 2022-12-20 13:56           ` Paul Durrant
  2022-12-20 16:27             ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-20 13:56 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 15/12/2022 20:54, David Woodhouse wrote:
> On Mon, 2022-12-12 at 16:39 +0000, Paul Durrant wrote:
>> On 12/12/2022 16:26, David Woodhouse wrote:
>>> On Mon, 2022-12-12 at 16:16 +0000, Paul Durrant wrote:
>>>> On 09/12/2022 09:56, David Woodhouse wrote:
>>>>> From: Ankur Arora <ankur.a.arora@oracle.com>
>>>>> The HVM_PARAM_CALLBACK_IRQ parameter controls the system-wide event
>>>>> channel upcall method.  The vector support is handled by KVM internally,
>>>>> when the evtchn_upcall_pending field in the vcpu_info is set.
>>>>> The GSI and PCI_INTX delivery methods are not supported. yet; those
>>>>> need to simulate a level-triggered event on the I/OAPIC.
>>>>
>>>> That's gonna be somewhat limiting if anyone runs a Windows guest with
>>>> upcall vector support turned off... which is an option at:
>>>>
>>>> https://xenbits.xen.org/gitweb/?p=pvdrivers/win/xenbus.git;a=blob;f=src/xenbus/evtchn.c;;hb=HEAD#l1928
>>>>
>>>
>>> Sure. And as you know, I also added the 'xen_no_vector_callback' option
>>> to the Linux command line to allow for that mode to be tested with
>>> Linux too:
>>> https://git.kernel.org/torvalds/c/b36b0fe96a
>>>
>>>
>>> The GSI and PCI_INTX modes will be added in time, but not yet.
>>
>> Ok, but maybe worth calling out the limitation in the commit comment for
>> those wishing to kick the tyres.
> 
> Hm... this isn't as simple in QEMU as I hoped.
> 
> The way I want to handle it is like the way that VFIO eventfd pairs
> work for level-triggered interrupts: the first eventfd is triggered on
> a rising edge, and the other one is a 'resampler' which is triggered on
> EOI, and causes the first one to be retriggered if the level is still
> actually high.
> 
> However... unlike the kernel and another VMM that you and I are
> familiar with, QEMU doesn't actually hook that up to the EOI in the
> APIC/IOAPIC at all.
> 
> Instead, when VFIO devices assert a level-triggered interrupt, QEMU
> *unmaps* that device's BARs from the guest so it can trap-and-emulate
> them, and each MMIO read or write will also trigger the resampler
> (whether that line is currently masked in the APIC or not).
> 
> I suppose we could try making the page with the vcpu_info as read-only,
> and trapping access to that so we spot when the guest clears its own
> ->evtchn_upcall_pending flag? That seems overly complex though.
> 
> So I've resorted to doing what Xen itself does: just poll the flag on
> every vmexit. Patch is at the tip of my tree at
> https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv
> and below.
> 
> However, in the case of the in-kernel irqchip we might not even *get* a
> vmexit all the way to userspace; can't a guest just get stuck in an
> interrupt storm with it being handled entirely in-kernel? I might
> decree that this works *only* with the split irqchip.
> 
> Then again, it'll work *nicely* in the kernel where the EOI
> notification exists, so I might teach the kernel's evtchn code to
> create those eventfd pairs like VFIO's, which can be hooked in as IRQFD
> routing to the in-kernel {IOA,}PIC.

The callback via is essentially just another line interrupt but its 
assertion is closely coupled with the vcpu upcall_pending bit. Because 
that is being dealt with in-kernel then the via should be dealt with 
in-kernel, right?

   Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-20 13:56           ` Paul Durrant
@ 2022-12-20 16:27             ` David Woodhouse
  2022-12-20 17:25               ` Paul Durrant
  0 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-20 16:27 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana



On 20 December 2022 13:56:39 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>The callback via is essentially just another line interrupt but its assertion is closely coupled with the vcpu upcall_pending bit. Because that is being dealt with in-kernel then the via should be dealt with in-kernel, right?

Not right now, because there's not a lot of point in kernel acceleration in the case that it's delivered as GSI. There's no per-vCPU delivery, so it doesn't get used for IPI, for timers. None of the stuff we want to accelerate in-kernel. Only the actual PV drivers.

If we do FIFO event channels in the future then the kernel will probably need to own those; they need spinlocks and you can have *both* userspace and kernel delivering with test-and-set-bit sequences like you can with 2level.

Even so, I was tempted to add a VFIO-like eventfd pair for the vCPU0 evtchn_upcall_pending and let the kernel drive it... but qemu doesn't even do the EOI side of that as $DEITY intended, so I didn't.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-20 16:27             ` David Woodhouse
@ 2022-12-20 17:25               ` Paul Durrant
  2022-12-20 17:29                 ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-20 17:25 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 20/12/2022 16:27, David Woodhouse wrote:
> 
> 
> On 20 December 2022 13:56:39 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>> The callback via is essentially just another line interrupt but its assertion is closely coupled with the vcpu upcall_pending bit. Because that is being dealt with in-kernel then the via should be dealt with in-kernel, right?
> 
> Not right now, because there's not a lot of point in kernel acceleration in the case that it's delivered as GSI. There's no per-vCPU delivery, so it doesn't get used for IPI, for timers. None of the stuff we want to accelerate in-kernel. Only the actual PV drivers.
> 
> If we do FIFO event channels in the future then the kernel will probably need to own those; they need spinlocks and you can have *both* userspace and kernel delivering with test-and-set-bit sequences like you can with 2level.
> 
> Even so, I was tempted to add a VFIO-like eventfd pair for the vCPU0 evtchn_upcall_pending and let the kernel drive it... but qemu doesn't even do the EOI side of that as $DEITY intended, so I didn't.

My point was that clearing upcall_pending is essentially the equivalent 
of EOI-at-device i.e. it's the thing that drops the line. So, without 
some form of interception, we need some way to check it at an 
appropriate time... and as you say, there may be no exit to VMM for EOI 
of the APIC. So when?

   Paul


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-20 17:25               ` Paul Durrant
@ 2022-12-20 17:29                 ` David Woodhouse
  2022-12-28 10:45                   ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-20 17:29 UTC (permalink / raw)
  To: paul, Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana



On 20 December 2022 17:25:31 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>On 20/12/2022 16:27, David Woodhouse wrote:
>> 
>> 
>> On 20 December 2022 13:56:39 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>>> The callback via is essentially just another line interrupt but its assertion is closely coupled with the vcpu upcall_pending bit. Because that is being dealt with in-kernel then the via should be dealt with in-kernel, right?
>> 
>> Not right now, because there's not a lot of point in kernel acceleration in the case that it's delivered as GSI. There's no per-vCPU delivery, so it doesn't get used for IPI, for timers. None of the stuff we want to accelerate in-kernel. Only the actual PV drivers.
>> 
>> If we do FIFO event channels in the future then the kernel will probably need to own those; they need spinlocks and you can have *both* userspace and kernel delivering with test-and-set-bit sequences like you can with 2level.
>> 
>> Even so, I was tempted to add a VFIO-like eventfd pair for the vCPU0 evtchn_upcall_pending and let the kernel drive it... but qemu doesn't even do the EOI side of that as $DEITY intended, so I didn't.
>
>My point was that clearing upcall_pending is essentially the equivalent of EOI-at-device i.e. it's the thing that drops the line. So, without some form of interception, we need some way to check it at an appropriate time... and as you say, there may be no exit to VMM for EOI of the APIC. So when?

If we have an in-kernel APIC then it has a notifier callback on EOI and we can poll evtchn_upcall_pending then. That's another point in favour of handling it in kernel.

And for userspace APIC we *do* get an exit of course... but QEMU lacks that notifier mechanism for EOI of pseudo-level interrupts for that "should it still be pending?" check. 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-12 16:16   ` Paul Durrant
  2022-12-12 16:26     ` David Woodhouse
@ 2022-12-21  1:41     ` David Woodhouse
  2022-12-21  9:37       ` Paul Durrant
  1 sibling, 1 reply; 78+ messages in thread
From: David Woodhouse @ 2022-12-21  1:41 UTC (permalink / raw)
  To: Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]

On Mon, 2022-12-12 at 16:16 +0000, Paul Durrant wrote:
> 
> > @@ -287,24 +289,53 @@ static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
> >       return true;
> >   }
> >   
> > +static int handle_set_param(struct kvm_xen_exit *exit, X86CPU *cpu,
> > +                            uint64_t arg)
> > +{
> > +    CPUState *cs = CPU(cpu);
> > +    struct xen_hvm_param hp;
> > +    int err = 0;
> > +
> > +    if (kvm_copy_from_gva(cs, arg, &hp, sizeof(hp))) {
> > +        err = -EFAULT;
> > +        goto out;
> > +    }
> > +
> > +    if (hp.domid != DOMID_SELF) {
> 
> Xen actually allows the domain's own id to be specified as well as the 
> magic DOMID_SELF.
> 
> > +        err = -EINVAL;
> 
> And this should be -ESRCH.
> 

Oops, fixed that after posting v4 series. Fixed in:

https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv

I fixed the similar -EPERM in evtchn_status_op() too.

> > +        goto out;
> > +    }
> > +
> > +    switch (hp.index) {
> > +    case HVM_PARAM_CALLBACK_IRQ:
> > +        err = xen_evtchn_set_callback_param(hp.value);
> > +        break;
> > +    default:
> > +        return false;
> > +    }
> > +
> > +out:
> > +    exit->u.hcall.result = err;
> 
> This is a bit on the ugly side isn't it? Why not return the err and have 
> kvm_xen_hcall_hvm_op() deal with passing it back?

Because 'return false' means qemu will whine about it being
unimplemented.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-21  1:41     ` David Woodhouse
@ 2022-12-21  9:37       ` Paul Durrant
  2022-12-21 12:16         ` David Woodhouse
  0 siblings, 1 reply; 78+ messages in thread
From: Paul Durrant @ 2022-12-21  9:37 UTC (permalink / raw)
  To: David Woodhouse, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

On 21/12/2022 01:41, David Woodhouse wrote:
> On Mon, 2022-12-12 at 16:16 +0000, Paul Durrant wrote:
>>
>>> @@ -287,24 +289,53 @@ static bool kvm_xen_hcall_memory_op(struct kvm_xen_exit *exit,
>>>        return true;
>>>    }
>>>    
>>> +static int handle_set_param(struct kvm_xen_exit *exit, X86CPU *cpu,
>>> +                            uint64_t arg)
>>> +{
>>> +    CPUState *cs = CPU(cpu);
>>> +    struct xen_hvm_param hp;
>>> +    int err = 0;
>>> +
>>> +    if (kvm_copy_from_gva(cs, arg, &hp, sizeof(hp))) {
>>> +        err = -EFAULT;
>>> +        goto out;
>>> +    }
>>> +
>>> +    if (hp.domid != DOMID_SELF) {
>>
>> Xen actually allows the domain's own id to be specified as well as the
>> magic DOMID_SELF.
>>
>>> +        err = -EINVAL;
>>
>> And this should be -ESRCH.
>>
> 
> Oops, fixed that after posting v4 series. Fixed in:
> 
> https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv
> 
> I fixed the similar -EPERM in evtchn_status_op() too.
> 
>>> +        goto out;
>>> +    }
>>> +
>>> +    switch (hp.index) {
>>> +    case HVM_PARAM_CALLBACK_IRQ:
>>> +        err = xen_evtchn_set_callback_param(hp.value);
>>> +        break;
>>> +    default:
>>> +        return false;
>>> +    }
>>> +
>>> +out:
>>> +    exit->u.hcall.result = err;
>>
>> This is a bit on the ugly side isn't it? Why not return the err and have
>> kvm_xen_hcall_hvm_op() deal with passing it back?
> 
> Because 'return false' means qemu will whine about it being
> unimplemented.
> 

Ah, ok. Yes, I did suggest turning that into a trace, which would mean 
that only those who cared would see such a whine.

   Paul




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-21  9:37       ` Paul Durrant
@ 2022-12-21 12:16         ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-21 12:16 UTC (permalink / raw)
  To: paul, Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana



On 21 December 2022 09:37:36 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
>On 21/12/2022 01:41, David Woodhouse wrote:
>>>> +    exit->u.hcall.result = err;
>>> 
>>> This is a bit on the ugly side isn't it? Why not return the err and have
>>> kvm_xen_hcall_hvm_op() deal with passing it back?
>> 
>> Because 'return false' means qemu will whine about it being
>> unimplemented.
>> 
>
>Ah, ok. Yes, I did suggest turning that into a trace, which would mean that only those who cared would see such a whine.

We have both a trace *and* the UNIMPL warning. Neither are enabled by default.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ
  2022-12-20 17:29                 ` David Woodhouse
@ 2022-12-28 10:45                   ` David Woodhouse
  0 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2022-12-28 10:45 UTC (permalink / raw)
  To: paul, Paul Durrant, qemu-devel
  Cc: Paolo Bonzini, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 3110 bytes --]

On Tue, 2022-12-20 at 17:29 +0000, David Woodhouse wrote:
> On 20 December 2022 17:25:31 GMT, Paul Durrant <xadimgnik@gmail.com> wrote:
> > On 20/12/2022 16:27, David Woodhouse wrote:
> > > 
> > > 
> > > On 20 December 2022 13:56:39 GMT, Paul Durrant
> > > <xadimgnik@gmail.com> wrote:
> > > > The callback via is essentially just another line interrupt but
> > > > its assertion is closely coupled with the vcpu upcall_pending
> > > > bit. Because that is being dealt with in-kernel then the via
> > > > should be dealt with in-kernel, right?
> > > 
> > > Not right now, because there's not a lot of point in kernel
> > > acceleration in the case that it's delivered as GSI. There's no
> > > per-vCPU delivery, so it doesn't get used for IPI, for timers.
> > > None of the stuff we want to accelerate in-kernel. Only the
> > > actual PV drivers.
> > > 
> > > If we do FIFO event channels in the future then the kernel will
> > > probably need to own those; they need spinlocks and you can have
> > > *both* userspace and kernel delivering with test-and-set-bit
> > > sequences like you can with 2level.
> > > 
> > > Even so, I was tempted to add a VFIO-like eventfd pair for the
> > > vCPU0 evtchn_upcall_pending and let the kernel drive it... but
> > > qemu doesn't even do the EOI side of that as $DEITY intended, so
> > > I didn't.
> > 
> > My point was that clearing upcall_pending is essentially the
> > equivalent of EOI-at-device i.e. it's the thing that drops the
> > line. So, without some form of interception, we need some way to
> > check it at an appropriate time... and as you say, there may be no
> > exit to VMM for EOI of the APIC. So when?
> 
> If we have an in-kernel APIC then it has a notifier callback on EOI
> and we can poll evtchn_upcall_pending then. That's another point in
> favour of handling it in kernel.
> 
> And for userspace APIC we *do* get an exit of course... but QEMU
> lacks that notifier mechanism for EOI of pseudo-level interrupts for
> that "should it still be pending?" check. 

So, I now have the qemu/backend side of event channels hooked up, and a
basic xenstore implementation that just returns ENOSYS to everything
for now. As before, at
https://git.infradead.org/users/dwmw2/qemu.git/shortlog/refs/heads/xenfv

I played with booting the guest with 'xen_no_vector_callback' on its
command line.

With kernel-irqchip=split (i.e. PIC/APIC in userspace) that seems to be
working OK. We *do* take that exit to qemu, and qemu spots that it
needs to deassert the GSI.

With the *kernel* PIC/APIC, we get a brief IRQ storm on the PCI INTX
every time:

[root@localhost ~]# grep xen /proc/interrupts 
 10:      80001          0   IO-APIC  10-fasteoi   xen-platform-pci
 24:          8          0   xen-dyn    -event     xenbus

In fact, unless CPU0 takes an exit, that storm might not be very brief
at all.

I *could* fix this up in the kernel via kvm_notify_acked_irq(), but I
suspect the better answer is just to declare that the Xen support (or
at least GSI evtchn delivery) requires split irqchip.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support
  2022-12-12 17:30   ` Paolo Bonzini
  2022-12-12 17:55     ` Paul Durrant
  2022-12-13  0:13     ` David Woodhouse
@ 2023-01-17 13:49     ` David Woodhouse
  2 siblings, 0 replies; 78+ messages in thread
From: David Woodhouse @ 2023-01-17 13:49 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: Paul Durrant, Joao Martins, Ankur Arora,
	Philippe Mathieu-Daudé,
	Thomas Huth, Alex Bennée, Juan Quintela,
	Dr . David Alan Gilbert, Claudio Fontana

[-- Attachment #1: Type: text/plain, Size: 3553 bytes --]

On Mon, 2022-12-12 at 18:30 +0100, Paolo Bonzini wrote:
> On 12/9/22 10:55, David Woodhouse wrote:
> > -    m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> > +    if (xen_enabled())
> > +            m->default_machine_opts = "accel=xen,suppress-vmdesc=on";
> > +    else
> > +            m->default_machine_opts = "accel=kvm,xen-version=0x30001";
> 
> Please do not modify pc_xen_hvm_init().
> 
> "-M xenfv" should be the same as "-M pc-i440fx-...,suppress-vmdesc=on
> -accel xen -device xen-platform".  It must work *without* "-accel xen", 
> while here you're basically requiring it.  For now, please make 
> KVM-emulated Xen use "-M pc -device xen-platform". 

I did that (well, you also need -accel kvm,xen_version=xxx).

>  We can figure out "-M xenfv" later.

I don't know that we care about doing this. Sure, I could change the if
(xen_enabled()) to something which isn't recursive, and works out which
of Xen or KVM mode is *possible* right now.

But having moved the xen-version from a machine property to a kvm-accel
property, using "accel=kvm,xen-version=xxx" doesn't even work; not
without the hackery to 'forward' that from the machine to the
accelerator, like we do in qemu_apply_legacy_machine_options() for
properties like kvm-shadow-mem. I don't think we want that.

So I think I'm most inclined to rename the CONFIG_XENFV_MACHINE option
to CONFIG_XEN_BUS, since that's basically what it ended up being once
the rest of the patches took shape on top of the basic platform
support. And drop the idea of making '-M xenfv' work altogether.

I'll add some documentation on how to launch it using
-accel kvm,xen-version=.

> You can instead have:
> 
> - a check in xen_init() that checks that xen_mode is XEN_ATTACH.  If 
> not, fail.
> 
> - an extra enum value for xen_mode, XEN_DISABLED, which is the default 
> instead of XEN_EMULATE;
> 
> - an accelerator property "-accel kvm,xen-version=...", added in 
> kvm_accel_class_init() instead of the machine property.  The property, 
> when set to a nonzero value, flips xen_mode from XEN_DISABLED to 
> XEN_EMULATE.
> 
> The Xen overlay device can be created using the mc->kvm_type function
> (which you can set in DEFINE_PC_MACHINE); at that point, xen_mode has
> switched from XEN_DISABLED to XEN_EMULATE.  Those xen_enabled() checks 
> that apply to KVM then become xen_mode != XEN_DISABLED, as long as they 
> run during mc->kvm_type or afterwards.
> 
> The platform device can be created either in mc->kvm_type or manually
> (not sure if it makes sense to have a "XenVMMXenVMM" CPUID + emulated
> hypercalls but no platform device---would it still use pvclock for 
> example?).

Yeah, I think fairly much everything *can* work without the platform
device. It's only really the hook for the legacy event channel
interrupt, and the unplug protocol. Linux does use its BAR to map grant
tables over, but that's just a crutch to find a free bit of guest
physical address space.

But still, I think it makes sense to add it unconditionally as you
suggest. In mc->kvm_type, pcms->bus isn't set yet but I can do it like
this:

--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -1313,6 +1313,9 @@ void pc_basic_device_init(struct PCMachineState
*pcms,
 #ifdef CONFIG_XEN_EMU
     if (xen_mode == XEN_EMULATE) {
         xen_evtchn_connect_gsis(gsi);
+        if (pcms->bus) {
+            pci_create_simple(pcms->bus, -1, "xen-platform");
+        }
     }
 #endif
 



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2023-01-17 13:50 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-09  9:55 [RFC PATCH v2 00/22] Xen HVM support under KVM David Woodhouse
2022-12-09  9:55 ` [RFC PATCH v2 01/22] include: import xen public headers David Woodhouse
2022-12-12  9:17   ` Paul Durrant
2022-12-09  9:55 ` [RFC PATCH v2 02/22] xen: add CONFIG_XENFV_MACHINE and CONFIG_XEN_EMU options for Xen emulation David Woodhouse
2022-12-12  9:19   ` Paul Durrant
2022-12-12 17:07   ` Paolo Bonzini
2022-12-12 22:22     ` David Woodhouse
2022-12-13  0:39       ` Paolo Bonzini
2022-12-13  0:59         ` David Woodhouse
2022-12-13 22:32           ` Paolo Bonzini
2022-12-16  8:40             ` David Woodhouse
2022-12-09  9:55 ` [RFC PATCH v2 03/22] i386/xen: Add xen-version machine property and init KVM Xen support David Woodhouse
2022-12-12 12:48   ` Paul Durrant
2022-12-12 17:30   ` Paolo Bonzini
2022-12-12 17:55     ` Paul Durrant
2022-12-13  0:13     ` David Woodhouse
2023-01-17 13:49     ` David Woodhouse
2022-12-09  9:55 ` [RFC PATCH v2 04/22] i386/kvm: handle Xen HVM cpuid leaves David Woodhouse
2022-12-12 13:13   ` Paul Durrant
2022-12-13  9:47     ` David Woodhouse
2022-12-09  9:55 ` [RFC PATCH v2 05/22] xen-platform-pci: allow its creation with XEN_EMULATE mode David Woodhouse
2022-12-12 13:24   ` Paul Durrant
2022-12-12 22:07     ` David Woodhouse
2022-12-09  9:55 ` [RFC PATCH v2 06/22] hw/xen_backend: refactor xen_be_init() David Woodhouse
2022-12-12 13:27   ` Paul Durrant
2022-12-09  9:55 ` [RFC PATCH v2 07/22] pc_piix: handle XEN_EMULATE backend init David Woodhouse
2022-12-12 13:47   ` Paul Durrant
2022-12-12 14:50     ` David Woodhouse
2022-12-09  9:55 ` [RFC PATCH v2 08/22] xen_platform: exclude vfio-pci from the PCI platform unplug David Woodhouse
2022-12-12 13:52   ` Paul Durrant
2022-12-09  9:55 ` [RFC PATCH v2 09/22] pc_piix: allow xenfv machine with XEN_EMULATE David Woodhouse
2022-12-12 14:05   ` Paul Durrant
2022-12-09  9:56 ` [RFC PATCH v2 10/22] i386/xen: handle guest hypercalls David Woodhouse
2022-12-12 14:11   ` Paul Durrant
2022-12-12 14:17     ` David Woodhouse
2022-12-12 17:07   ` Paolo Bonzini
2022-12-09  9:56 ` [RFC PATCH v2 11/22] i386/xen: implement HYPERCALL_xen_version David Woodhouse
2022-12-12 14:17   ` Paul Durrant
2022-12-13  0:06     ` David Woodhouse
2022-12-09  9:56 ` [RFC PATCH v2 12/22] hw/xen: Add xen_overlay device for emulating shared xenheap pages David Woodhouse
2022-12-12 14:29   ` Paul Durrant
2022-12-12 17:14   ` Paolo Bonzini
2022-12-09  9:56 ` [RFC PATCH v2 13/22] i386/xen: implement HYPERVISOR_memory_op David Woodhouse
2022-12-12 14:38   ` Paul Durrant
2022-12-13  0:08     ` David Woodhouse
2022-12-09  9:56 ` [RFC PATCH v2 14/22] i386/xen: implement HYPERVISOR_hvm_op David Woodhouse
2022-12-12 14:41   ` Paul Durrant
2022-12-09  9:56 ` [RFC PATCH v2 15/22] i386/xen: implement HYPERVISOR_vcpu_op David Woodhouse
2022-12-12 14:51   ` Paul Durrant
2022-12-13  0:10     ` David Woodhouse
2022-12-09  9:56 ` [RFC PATCH v2 16/22] i386/xen: handle VCPUOP_register_vcpu_info David Woodhouse
2022-12-12 14:58   ` Paul Durrant
2022-12-13  0:13     ` David Woodhouse
2022-12-14 10:28       ` Paul Durrant
2022-12-14 11:04         ` David Woodhouse
2022-12-09  9:56 ` [RFC PATCH v2 17/22] i386/xen: handle VCPUOP_register_vcpu_time_info David Woodhouse
2022-12-12 15:34   ` Paul Durrant
2022-12-09  9:56 ` [RFC PATCH v2 18/22] i386/xen: handle VCPUOP_register_runstate_memory_area David Woodhouse
2022-12-12 15:38   ` Paul Durrant
2022-12-09  9:56 ` [RFC PATCH v2 19/22] i386/xen: implement HVMOP_set_evtchn_upcall_vector David Woodhouse
2022-12-12 15:52   ` Paul Durrant
2022-12-09  9:56 ` [RFC PATCH v2 20/22] i386/xen: HVMOP_set_param / HVM_PARAM_CALLBACK_IRQ David Woodhouse
2022-12-12 16:16   ` Paul Durrant
2022-12-12 16:26     ` David Woodhouse
2022-12-12 16:39       ` Paul Durrant
2022-12-15 20:54         ` David Woodhouse
2022-12-20 13:56           ` Paul Durrant
2022-12-20 16:27             ` David Woodhouse
2022-12-20 17:25               ` Paul Durrant
2022-12-20 17:29                 ` David Woodhouse
2022-12-28 10:45                   ` David Woodhouse
2022-12-21  1:41     ` David Woodhouse
2022-12-21  9:37       ` Paul Durrant
2022-12-21 12:16         ` David Woodhouse
2022-12-09  9:56 ` [RFC PATCH v2 21/22] i386/xen: implement HYPERVISOR_event_channel_op David Woodhouse
2022-12-12 16:23   ` Paul Durrant
2022-12-09  9:56 ` [RFC PATCH v2 22/22] i386/xen: implement HYPERVISOR_sched_op David Woodhouse
2022-12-12 16:37   ` Paul Durrant

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.