All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] hvf: Implement Apple Silicon Support
@ 2020-11-26 21:50 Alexander Graf
  2020-11-26 21:50 ` [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries Alexander Graf
                   ` (8 more replies)
  0 siblings, 9 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

Now that Apple Silicon is widely available, people are obviously excited
to try and run virtualized workloads on them, such as Linux and Windows.

This patch set implements a rudimentary, first version to get the ball
going on that. With this applied, I can successfully run both Linux and
Windows as guests, albeit with a few caveats:

  * no WFI emulation, a vCPU always uses 100%
  * vtimer handling is a bit hacky
  * we handle most sysregs flying blindly, just returning 0
  * XHCI breaks in OVMF, works in Linux+Windows

Despite those drawbacks, it's still an exciting place to start playing
with the power of Apple Silicon.

Enjoy!

Alex

Alexander Graf (8):
  hvf: Add hypervisor entitlement to output binaries
  hvf: Move common code out
  arm: Set PSCI to 0.2 for HVF
  arm: Synchronize CPU on PSCI on
  hvf: Add Apple Silicon support
  hvf: Use OS provided vcpu kick function
  arm: Add Hypervisor.framework build target
  hw/arm/virt: Disable highmem when on hypervisor.framework

 MAINTAINERS                  |  14 +-
 accel/hvf/entitlements.plist |   8 +
 accel/hvf/hvf-all.c          |  56 ++++
 accel/hvf/hvf-cpus.c         | 484 +++++++++++++++++++++++++++++++++++
 accel/hvf/meson.build        |   7 +
 accel/meson.build            |   1 +
 hw/arm/virt.c                |   9 +
 include/hw/core/cpu.h        |   3 +-
 include/sysemu/hvf_int.h     |  69 +++++
 meson.build                  |  39 ++-
 scripts/entitlement.sh       |  11 +
 target/arm/arm-powerctl.c    |   3 +
 target/arm/cpu.c             |   4 +
 target/arm/hvf/hvf.c         | 345 +++++++++++++++++++++++++
 target/arm/hvf/meson.build   |   3 +
 target/arm/meson.build       |   2 +
 target/i386/hvf/hvf-cpus.c   | 131 ----------
 target/i386/hvf/hvf-cpus.h   |  25 --
 target/i386/hvf/hvf-i386.h   |  48 +---
 target/i386/hvf/hvf.c        | 360 +-------------------------
 target/i386/hvf/meson.build  |   1 -
 target/i386/hvf/x86hvf.c     |  11 +-
 target/i386/hvf/x86hvf.h     |   2 -
 23 files changed, 1061 insertions(+), 575 deletions(-)
 create mode 100644 accel/hvf/entitlements.plist
 create mode 100644 accel/hvf/hvf-all.c
 create mode 100644 accel/hvf/hvf-cpus.c
 create mode 100644 accel/hvf/meson.build
 create mode 100644 include/sysemu/hvf_int.h
 create mode 100755 scripts/entitlement.sh
 create mode 100644 target/arm/hvf/hvf.c
 create mode 100644 target/arm/hvf/meson.build
 delete mode 100644 target/i386/hvf/hvf-cpus.c
 delete mode 100644 target/i386/hvf/hvf-cpus.h

-- 
2.24.3 (Apple Git-128)



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-27  4:54   ` Paolo Bonzini
  2020-11-27 19:44   ` Roman Bolshakov
  2020-11-26 21:50 ` [PATCH 2/8] hvf: Move common code out Alexander Graf
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

In macOS 11, QEMU only gets access to Hypervisor.framework if it has the
respective entitlement. Add an entitlement template and automatically self
sign and apply the entitlement in the build.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 accel/hvf/entitlements.plist |  8 ++++++++
 meson.build                  | 30 ++++++++++++++++++++++++++----
 scripts/entitlement.sh       | 11 +++++++++++
 3 files changed, 45 insertions(+), 4 deletions(-)
 create mode 100644 accel/hvf/entitlements.plist
 create mode 100755 scripts/entitlement.sh

diff --git a/accel/hvf/entitlements.plist b/accel/hvf/entitlements.plist
new file mode 100644
index 0000000000..154f3308ef
--- /dev/null
+++ b/accel/hvf/entitlements.plist
@@ -0,0 +1,8 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+    <key>com.apple.security.hypervisor</key>
+    <true/>
+</dict>
+</plist>
diff --git a/meson.build b/meson.build
index 5062407c70..2a7ff5560c 100644
--- a/meson.build
+++ b/meson.build
@@ -1844,9 +1844,14 @@ foreach target : target_dirs
     }]
   endif
   foreach exe: execs
-    emulators += {exe['name']:
-         executable(exe['name'], exe['sources'],
-               install: true,
+    exe_name = exe['name']
+    exe_sign = 'CONFIG_HVF' in config_target
+    if exe_sign
+      exe_name += '-unsigned'
+    endif
+
+    emulator = executable(exe_name, exe['sources'],
+               install: not exe_sign,
                c_args: c_args,
                dependencies: arch_deps + deps + exe['dependencies'],
                objects: lib.extract_all_objects(recursive: true),
@@ -1854,7 +1859,24 @@ foreach target : target_dirs
                link_depends: [block_syms, qemu_syms] + exe.get('link_depends', []),
                link_args: link_args,
                gui_app: exe['gui'])
-    }
+
+    if exe_sign
+      exe_full = meson.current_build_dir() / exe['name']
+      emulators += {exe['name'] : custom_target(exe['name'],
+                   install: true,
+                   install_dir: get_option('bindir'),
+                   depends: emulator,
+                   output: exe['name'],
+                   command: [
+                     meson.current_source_dir() / 'scripts/entitlement.sh',
+                     meson.current_build_dir() / exe['name'] + '-unsigned',
+                     meson.current_build_dir() / exe['name'],
+                     meson.current_source_dir() / 'accel/hvf/entitlements.plist'
+                   ])
+      }
+    else
+      emulators += {exe['name']: emulator}
+    endif
 
     if 'CONFIG_TRACE_SYSTEMTAP' in config_host
       foreach stp: [
diff --git a/scripts/entitlement.sh b/scripts/entitlement.sh
new file mode 100755
index 0000000000..7ed9590bf9
--- /dev/null
+++ b/scripts/entitlement.sh
@@ -0,0 +1,11 @@
+#!/bin/sh -e
+#
+# Helper script for the build process to apply entitlements
+
+SRC="$1"
+DST="$2"
+ENTITLEMENT="$3"
+
+rm -f "$2"
+cp -a "$SRC" "$DST"
+codesign --entitlements "$ENTITLEMENT" --force -s - "$DST"
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/8] hvf: Move common code out
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
  2020-11-26 21:50 ` [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-27 20:00   ` Roman Bolshakov
  2020-11-26 21:50 ` [PATCH 3/8] arm: Set PSCI to 0.2 for HVF Alexander Graf
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

Until now, Hypervisor.framework has only been available on x86_64 systems.
With Apple Silicon shipping now, it extends its reach to aarch64. To
prepare for support for multiple architectures, let's move common code out
into its own accel directory.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 MAINTAINERS                 |   9 +-
 accel/hvf/hvf-all.c         |  56 +++++
 accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
 accel/hvf/meson.build       |   7 +
 accel/meson.build           |   1 +
 include/sysemu/hvf_int.h    |  69 ++++++
 target/i386/hvf/hvf-cpus.c  | 131 ----------
 target/i386/hvf/hvf-cpus.h  |  25 --
 target/i386/hvf/hvf-i386.h  |  48 +---
 target/i386/hvf/hvf.c       | 360 +--------------------------
 target/i386/hvf/meson.build |   1 -
 target/i386/hvf/x86hvf.c    |  11 +-
 target/i386/hvf/x86hvf.h    |   2 -
 13 files changed, 619 insertions(+), 569 deletions(-)
 create mode 100644 accel/hvf/hvf-all.c
 create mode 100644 accel/hvf/hvf-cpus.c
 create mode 100644 accel/hvf/meson.build
 create mode 100644 include/sysemu/hvf_int.h
 delete mode 100644 target/i386/hvf/hvf-cpus.c
 delete mode 100644 target/i386/hvf/hvf-cpus.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 68bc160f41..ca4b6d9279 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
 M: Roman Bolshakov <r.bolshakov@yadro.com>
 W: https://wiki.qemu.org/Features/HVF
 S: Maintained
-F: accel/stubs/hvf-stub.c
 F: target/i386/hvf/
+
+HVF
+M: Cameron Esfahani <dirty@apple.com>
+M: Roman Bolshakov <r.bolshakov@yadro.com>
+W: https://wiki.qemu.org/Features/HVF
+S: Maintained
+F: accel/hvf/
 F: include/sysemu/hvf.h
+F: include/sysemu/hvf_int.h
 
 WHPX CPUs
 M: Sunil Muthuswamy <sunilmut@microsoft.com>
diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
new file mode 100644
index 0000000000..47d77a472a
--- /dev/null
+++ b/accel/hvf/hvf-all.c
@@ -0,0 +1,56 @@
+/*
+ * QEMU Hypervisor.framework support
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Contributions after 2012-01-13 are licensed under the terms of the
+ * GNU GPL, version 2 or (at your option) any later version.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
+#include "sysemu/runstate.h"
+
+#include "qemu/main-loop.h"
+#include "sysemu/accel.h"
+
+#include <Hypervisor/Hypervisor.h>
+
+bool hvf_allowed;
+HVFState *hvf_state;
+
+void assert_hvf_ok(hv_return_t ret)
+{
+    if (ret == HV_SUCCESS) {
+        return;
+    }
+
+    switch (ret) {
+    case HV_ERROR:
+        error_report("Error: HV_ERROR");
+        break;
+    case HV_BUSY:
+        error_report("Error: HV_BUSY");
+        break;
+    case HV_BAD_ARGUMENT:
+        error_report("Error: HV_BAD_ARGUMENT");
+        break;
+    case HV_NO_RESOURCES:
+        error_report("Error: HV_NO_RESOURCES");
+        break;
+    case HV_NO_DEVICE:
+        error_report("Error: HV_NO_DEVICE");
+        break;
+    case HV_UNSUPPORTED:
+        error_report("Error: HV_UNSUPPORTED");
+        break;
+    default:
+        error_report("Unknown Error");
+    }
+
+    abort();
+}
diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
new file mode 100644
index 0000000000..f9bb5502b7
--- /dev/null
+++ b/accel/hvf/hvf-cpus.c
@@ -0,0 +1,468 @@
+/*
+ * Copyright 2008 IBM Corporation
+ *           2008 Red Hat, Inc.
+ * Copyright 2011 Intel Corporation
+ * Copyright 2016 Veertu, Inc.
+ * Copyright 2017 The Android Open Source Project
+ *
+ * QEMU Hypervisor.framework support
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ *
+ * This file contain code under public domain from the hvdos project:
+ * https://github.com/mist64/hvdos
+ *
+ * Parts Copyright (c) 2011 NetApp, Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+#include "qemu/main-loop.h"
+#include "exec/address-spaces.h"
+#include "exec/exec-all.h"
+#include "sysemu/cpus.h"
+#include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
+#include "sysemu/runstate.h"
+#include "qemu/guest-random.h"
+
+#include <Hypervisor/Hypervisor.h>
+
+/* Memory slots */
+
+struct mac_slot {
+    int present;
+    uint64_t size;
+    uint64_t gpa_start;
+    uint64_t gva;
+};
+
+hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
+{
+    hvf_slot *slot;
+    int x;
+    for (x = 0; x < hvf_state->num_slots; ++x) {
+        slot = &hvf_state->slots[x];
+        if (slot->size && start < (slot->start + slot->size) &&
+            (start + size) > slot->start) {
+            return slot;
+        }
+    }
+    return NULL;
+}
+
+struct mac_slot mac_slots[32];
+
+static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
+{
+    struct mac_slot *macslot;
+    hv_return_t ret;
+
+    macslot = &mac_slots[slot->slot_id];
+
+    if (macslot->present) {
+        if (macslot->size != slot->size) {
+            macslot->present = 0;
+            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
+            assert_hvf_ok(ret);
+        }
+    }
+
+    if (!slot->size) {
+        return 0;
+    }
+
+    macslot->present = 1;
+    macslot->gpa_start = slot->start;
+    macslot->size = slot->size;
+    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
+    assert_hvf_ok(ret);
+    return 0;
+}
+
+static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
+{
+    hvf_slot *mem;
+    MemoryRegion *area = section->mr;
+    bool writeable = !area->readonly && !area->rom_device;
+    hv_memory_flags_t flags;
+
+    if (!memory_region_is_ram(area)) {
+        if (writeable) {
+            return;
+        } else if (!memory_region_is_romd(area)) {
+            /*
+             * If the memory device is not in romd_mode, then we actually want
+             * to remove the hvf memory slot so all accesses will trap.
+             */
+             add = false;
+        }
+    }
+
+    mem = hvf_find_overlap_slot(
+            section->offset_within_address_space,
+            int128_get64(section->size));
+
+    if (mem && add) {
+        if (mem->size == int128_get64(section->size) &&
+            mem->start == section->offset_within_address_space &&
+            mem->mem == (memory_region_get_ram_ptr(area) +
+            section->offset_within_region)) {
+            return; /* Same region was attempted to register, go away. */
+        }
+    }
+
+    /* Region needs to be reset. set the size to 0 and remap it. */
+    if (mem) {
+        mem->size = 0;
+        if (do_hvf_set_memory(mem, 0)) {
+            error_report("Failed to reset overlapping slot");
+            abort();
+        }
+    }
+
+    if (!add) {
+        return;
+    }
+
+    if (area->readonly ||
+        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
+        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
+    } else {
+        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
+    }
+
+    /* Now make a new slot. */
+    int x;
+
+    for (x = 0; x < hvf_state->num_slots; ++x) {
+        mem = &hvf_state->slots[x];
+        if (!mem->size) {
+            break;
+        }
+    }
+
+    if (x == hvf_state->num_slots) {
+        error_report("No free slots");
+        abort();
+    }
+
+    mem->size = int128_get64(section->size);
+    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
+    mem->start = section->offset_within_address_space;
+    mem->region = area;
+
+    if (do_hvf_set_memory(mem, flags)) {
+        error_report("Error registering new memory slot");
+        abort();
+    }
+}
+
+static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
+{
+    hvf_slot *slot;
+
+    slot = hvf_find_overlap_slot(
+            section->offset_within_address_space,
+            int128_get64(section->size));
+
+    /* protect region against writes; begin tracking it */
+    if (on) {
+        slot->flags |= HVF_SLOT_LOG;
+        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
+                      HV_MEMORY_READ);
+    /* stop tracking region*/
+    } else {
+        slot->flags &= ~HVF_SLOT_LOG;
+        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
+                      HV_MEMORY_READ | HV_MEMORY_WRITE);
+    }
+}
+
+static void hvf_log_start(MemoryListener *listener,
+                          MemoryRegionSection *section, int old, int new)
+{
+    if (old != 0) {
+        return;
+    }
+
+    hvf_set_dirty_tracking(section, 1);
+}
+
+static void hvf_log_stop(MemoryListener *listener,
+                         MemoryRegionSection *section, int old, int new)
+{
+    if (new != 0) {
+        return;
+    }
+
+    hvf_set_dirty_tracking(section, 0);
+}
+
+static void hvf_log_sync(MemoryListener *listener,
+                         MemoryRegionSection *section)
+{
+    /*
+     * sync of dirty pages is handled elsewhere; just make sure we keep
+     * tracking the region.
+     */
+    hvf_set_dirty_tracking(section, 1);
+}
+
+static void hvf_region_add(MemoryListener *listener,
+                           MemoryRegionSection *section)
+{
+    hvf_set_phys_mem(section, true);
+}
+
+static void hvf_region_del(MemoryListener *listener,
+                           MemoryRegionSection *section)
+{
+    hvf_set_phys_mem(section, false);
+}
+
+static MemoryListener hvf_memory_listener = {
+    .priority = 10,
+    .region_add = hvf_region_add,
+    .region_del = hvf_region_del,
+    .log_start = hvf_log_start,
+    .log_stop = hvf_log_stop,
+    .log_sync = hvf_log_sync,
+};
+
+static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
+{
+    if (!cpu->vcpu_dirty) {
+        hvf_get_registers(cpu);
+        cpu->vcpu_dirty = true;
+    }
+}
+
+static void hvf_cpu_synchronize_state(CPUState *cpu)
+{
+    if (!cpu->vcpu_dirty) {
+        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
+    }
+}
+
+static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
+                                              run_on_cpu_data arg)
+{
+    hvf_put_registers(cpu);
+    cpu->vcpu_dirty = false;
+}
+
+static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
+{
+    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
+}
+
+static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
+                                             run_on_cpu_data arg)
+{
+    hvf_put_registers(cpu);
+    cpu->vcpu_dirty = false;
+}
+
+static void hvf_cpu_synchronize_post_init(CPUState *cpu)
+{
+    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
+}
+
+static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
+                                              run_on_cpu_data arg)
+{
+    cpu->vcpu_dirty = true;
+}
+
+static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
+{
+    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
+}
+
+static void hvf_vcpu_destroy(CPUState *cpu)
+{
+    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
+    assert_hvf_ok(ret);
+
+    hvf_arch_vcpu_destroy(cpu);
+}
+
+static void dummy_signal(int sig)
+{
+}
+
+static int hvf_init_vcpu(CPUState *cpu)
+{
+    int r;
+
+    /* init cpu signals */
+    sigset_t set;
+    struct sigaction sigact;
+
+    memset(&sigact, 0, sizeof(sigact));
+    sigact.sa_handler = dummy_signal;
+    sigaction(SIG_IPI, &sigact, NULL);
+
+    pthread_sigmask(SIG_BLOCK, NULL, &set);
+    sigdelset(&set, SIG_IPI);
+
+#ifdef __aarch64__
+    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
+#else
+    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
+#endif
+    cpu->vcpu_dirty = 1;
+    assert_hvf_ok(r);
+
+    return hvf_arch_init_vcpu(cpu);
+}
+
+/*
+ * The HVF-specific vCPU thread function. This one should only run when the host
+ * CPU supports the VMX "unrestricted guest" feature.
+ */
+static void *hvf_cpu_thread_fn(void *arg)
+{
+    CPUState *cpu = arg;
+
+    int r;
+
+    assert(hvf_enabled());
+
+    rcu_register_thread();
+
+    qemu_mutex_lock_iothread();
+    qemu_thread_get_self(cpu->thread);
+
+    cpu->thread_id = qemu_get_thread_id();
+    cpu->can_do_io = 1;
+    current_cpu = cpu;
+
+    hvf_init_vcpu(cpu);
+
+    /* signal CPU creation */
+    cpu_thread_signal_created(cpu);
+    qemu_guest_random_seed_thread_part2(cpu->random_seed);
+
+    do {
+        if (cpu_can_run(cpu)) {
+            r = hvf_vcpu_exec(cpu);
+            if (r == EXCP_DEBUG) {
+                cpu_handle_guest_debug(cpu);
+            }
+        }
+        qemu_wait_io_event(cpu);
+    } while (!cpu->unplug || cpu_can_run(cpu));
+
+    hvf_vcpu_destroy(cpu);
+    cpu_thread_signal_destroyed(cpu);
+    qemu_mutex_unlock_iothread();
+    rcu_unregister_thread();
+    return NULL;
+}
+
+static void hvf_start_vcpu_thread(CPUState *cpu)
+{
+    char thread_name[VCPU_THREAD_NAME_SIZE];
+
+    /*
+     * HVF currently does not support TCG, and only runs in
+     * unrestricted-guest mode.
+     */
+    assert(hvf_enabled());
+
+    cpu->thread = g_malloc0(sizeof(QemuThread));
+    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
+    qemu_cond_init(cpu->halt_cond);
+
+    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
+             cpu->cpu_index);
+    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
+                       cpu, QEMU_THREAD_JOINABLE);
+}
+
+static const CpusAccel hvf_cpus = {
+    .create_vcpu_thread = hvf_start_vcpu_thread,
+
+    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
+    .synchronize_post_init = hvf_cpu_synchronize_post_init,
+    .synchronize_state = hvf_cpu_synchronize_state,
+    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
+};
+
+static int hvf_accel_init(MachineState *ms)
+{
+    int x;
+    hv_return_t ret;
+    HVFState *s;
+
+    ret = hv_vm_create(HV_VM_DEFAULT);
+    assert_hvf_ok(ret);
+
+    s = g_new0(HVFState, 1);
+
+    s->num_slots = 32;
+    for (x = 0; x < s->num_slots; ++x) {
+        s->slots[x].size = 0;
+        s->slots[x].slot_id = x;
+    }
+
+    hvf_state = s;
+    memory_listener_register(&hvf_memory_listener, &address_space_memory);
+    cpus_register_accel(&hvf_cpus);
+    return 0;
+}
+
+static void hvf_accel_class_init(ObjectClass *oc, void *data)
+{
+    AccelClass *ac = ACCEL_CLASS(oc);
+    ac->name = "HVF";
+    ac->init_machine = hvf_accel_init;
+    ac->allowed = &hvf_allowed;
+}
+
+static const TypeInfo hvf_accel_type = {
+    .name = TYPE_HVF_ACCEL,
+    .parent = TYPE_ACCEL,
+    .class_init = hvf_accel_class_init,
+};
+
+static void hvf_type_init(void)
+{
+    type_register_static(&hvf_accel_type);
+}
+
+type_init(hvf_type_init);
diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
new file mode 100644
index 0000000000..dfd6b68dc7
--- /dev/null
+++ b/accel/hvf/meson.build
@@ -0,0 +1,7 @@
+hvf_ss = ss.source_set()
+hvf_ss.add(files(
+  'hvf-all.c',
+  'hvf-cpus.c',
+))
+
+specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
diff --git a/accel/meson.build b/accel/meson.build
index b26cca227a..6de12ce5d5 100644
--- a/accel/meson.build
+++ b/accel/meson.build
@@ -1,5 +1,6 @@
 softmmu_ss.add(files('accel.c'))
 
+subdir('hvf')
 subdir('qtest')
 subdir('kvm')
 subdir('tcg')
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
new file mode 100644
index 0000000000..de9bad23a8
--- /dev/null
+++ b/include/sysemu/hvf_int.h
@@ -0,0 +1,69 @@
+/*
+ * QEMU Hypervisor.framework (HVF) support
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/* header to be included in HVF-specific code */
+
+#ifndef HVF_INT_H
+#define HVF_INT_H
+
+#include <Hypervisor/Hypervisor.h>
+
+#define HVF_MAX_VCPU 0x10
+
+extern struct hvf_state hvf_global;
+
+struct hvf_vm {
+    int id;
+    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
+};
+
+struct hvf_state {
+    uint32_t version;
+    struct hvf_vm *vm;
+    uint64_t mem_quota;
+};
+
+/* hvf_slot flags */
+#define HVF_SLOT_LOG (1 << 0)
+
+typedef struct hvf_slot {
+    uint64_t start;
+    uint64_t size;
+    uint8_t *mem;
+    int slot_id;
+    uint32_t flags;
+    MemoryRegion *region;
+} hvf_slot;
+
+typedef struct hvf_vcpu_caps {
+    uint64_t vmx_cap_pinbased;
+    uint64_t vmx_cap_procbased;
+    uint64_t vmx_cap_procbased2;
+    uint64_t vmx_cap_entry;
+    uint64_t vmx_cap_exit;
+    uint64_t vmx_cap_preemption_timer;
+} hvf_vcpu_caps;
+
+struct HVFState {
+    AccelState parent;
+    hvf_slot slots[32];
+    int num_slots;
+
+    hvf_vcpu_caps *hvf_caps;
+};
+extern HVFState *hvf_state;
+
+void assert_hvf_ok(hv_return_t ret);
+int hvf_get_registers(CPUState *cpu);
+int hvf_put_registers(CPUState *cpu);
+int hvf_arch_init_vcpu(CPUState *cpu);
+void hvf_arch_vcpu_destroy(CPUState *cpu);
+int hvf_vcpu_exec(CPUState *cpu);
+hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
+
+#endif
diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
deleted file mode 100644
index 817b3d7452..0000000000
--- a/target/i386/hvf/hvf-cpus.c
+++ /dev/null
@@ -1,131 +0,0 @@
-/*
- * Copyright 2008 IBM Corporation
- *           2008 Red Hat, Inc.
- * Copyright 2011 Intel Corporation
- * Copyright 2016 Veertu, Inc.
- * Copyright 2017 The Android Open Source Project
- *
- * QEMU Hypervisor.framework support
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of version 2 of the GNU General Public
- * License as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, see <http://www.gnu.org/licenses/>.
- *
- * This file contain code under public domain from the hvdos project:
- * https://github.com/mist64/hvdos
- *
- * Parts Copyright (c) 2011 NetApp, Inc.
- * All rights reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in the
- *    documentation and/or other materials provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
- * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
- * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
- * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
- * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
- * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
- * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#include "qemu/osdep.h"
-#include "qemu/error-report.h"
-#include "qemu/main-loop.h"
-#include "sysemu/hvf.h"
-#include "sysemu/runstate.h"
-#include "target/i386/cpu.h"
-#include "qemu/guest-random.h"
-
-#include "hvf-cpus.h"
-
-/*
- * The HVF-specific vCPU thread function. This one should only run when the host
- * CPU supports the VMX "unrestricted guest" feature.
- */
-static void *hvf_cpu_thread_fn(void *arg)
-{
-    CPUState *cpu = arg;
-
-    int r;
-
-    assert(hvf_enabled());
-
-    rcu_register_thread();
-
-    qemu_mutex_lock_iothread();
-    qemu_thread_get_self(cpu->thread);
-
-    cpu->thread_id = qemu_get_thread_id();
-    cpu->can_do_io = 1;
-    current_cpu = cpu;
-
-    hvf_init_vcpu(cpu);
-
-    /* signal CPU creation */
-    cpu_thread_signal_created(cpu);
-    qemu_guest_random_seed_thread_part2(cpu->random_seed);
-
-    do {
-        if (cpu_can_run(cpu)) {
-            r = hvf_vcpu_exec(cpu);
-            if (r == EXCP_DEBUG) {
-                cpu_handle_guest_debug(cpu);
-            }
-        }
-        qemu_wait_io_event(cpu);
-    } while (!cpu->unplug || cpu_can_run(cpu));
-
-    hvf_vcpu_destroy(cpu);
-    cpu_thread_signal_destroyed(cpu);
-    qemu_mutex_unlock_iothread();
-    rcu_unregister_thread();
-    return NULL;
-}
-
-static void hvf_start_vcpu_thread(CPUState *cpu)
-{
-    char thread_name[VCPU_THREAD_NAME_SIZE];
-
-    /*
-     * HVF currently does not support TCG, and only runs in
-     * unrestricted-guest mode.
-     */
-    assert(hvf_enabled());
-
-    cpu->thread = g_malloc0(sizeof(QemuThread));
-    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
-    qemu_cond_init(cpu->halt_cond);
-
-    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
-             cpu->cpu_index);
-    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
-                       cpu, QEMU_THREAD_JOINABLE);
-}
-
-const CpusAccel hvf_cpus = {
-    .create_vcpu_thread = hvf_start_vcpu_thread,
-
-    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
-    .synchronize_post_init = hvf_cpu_synchronize_post_init,
-    .synchronize_state = hvf_cpu_synchronize_state,
-    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
-};
diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
deleted file mode 100644
index ced31b82c0..0000000000
--- a/target/i386/hvf/hvf-cpus.h
+++ /dev/null
@@ -1,25 +0,0 @@
-/*
- * Accelerator CPUS Interface
- *
- * Copyright 2020 SUSE LLC
- *
- * This work is licensed under the terms of the GNU GPL, version 2 or later.
- * See the COPYING file in the top-level directory.
- */
-
-#ifndef HVF_CPUS_H
-#define HVF_CPUS_H
-
-#include "sysemu/cpus.h"
-
-extern const CpusAccel hvf_cpus;
-
-int hvf_init_vcpu(CPUState *);
-int hvf_vcpu_exec(CPUState *);
-void hvf_cpu_synchronize_state(CPUState *);
-void hvf_cpu_synchronize_post_reset(CPUState *);
-void hvf_cpu_synchronize_post_init(CPUState *);
-void hvf_cpu_synchronize_pre_loadvm(CPUState *);
-void hvf_vcpu_destroy(CPUState *);
-
-#endif /* HVF_CPUS_H */
diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
index e0edffd077..6d56f8f6bb 100644
--- a/target/i386/hvf/hvf-i386.h
+++ b/target/i386/hvf/hvf-i386.h
@@ -18,57 +18,11 @@
 
 #include "sysemu/accel.h"
 #include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
 #include "cpu.h"
 #include "x86.h"
 
-#define HVF_MAX_VCPU 0x10
-
-extern struct hvf_state hvf_global;
-
-struct hvf_vm {
-    int id;
-    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
-};
-
-struct hvf_state {
-    uint32_t version;
-    struct hvf_vm *vm;
-    uint64_t mem_quota;
-};
-
-/* hvf_slot flags */
-#define HVF_SLOT_LOG (1 << 0)
-
-typedef struct hvf_slot {
-    uint64_t start;
-    uint64_t size;
-    uint8_t *mem;
-    int slot_id;
-    uint32_t flags;
-    MemoryRegion *region;
-} hvf_slot;
-
-typedef struct hvf_vcpu_caps {
-    uint64_t vmx_cap_pinbased;
-    uint64_t vmx_cap_procbased;
-    uint64_t vmx_cap_procbased2;
-    uint64_t vmx_cap_entry;
-    uint64_t vmx_cap_exit;
-    uint64_t vmx_cap_preemption_timer;
-} hvf_vcpu_caps;
-
-struct HVFState {
-    AccelState parent;
-    hvf_slot slots[32];
-    int num_slots;
-
-    hvf_vcpu_caps *hvf_caps;
-};
-extern HVFState *hvf_state;
-
-void hvf_set_phys_mem(MemoryRegionSection *, bool);
 void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
-hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
 
 #ifdef NEED_CPU_H
 /* Functions exported to host specific mode */
diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
index ed9356565c..8b96ecd619 100644
--- a/target/i386/hvf/hvf.c
+++ b/target/i386/hvf/hvf.c
@@ -51,6 +51,7 @@
 #include "qemu/error-report.h"
 
 #include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
 #include "sysemu/runstate.h"
 #include "hvf-i386.h"
 #include "vmcs.h"
@@ -72,171 +73,6 @@
 #include "sysemu/accel.h"
 #include "target/i386/cpu.h"
 
-#include "hvf-cpus.h"
-
-HVFState *hvf_state;
-
-static void assert_hvf_ok(hv_return_t ret)
-{
-    if (ret == HV_SUCCESS) {
-        return;
-    }
-
-    switch (ret) {
-    case HV_ERROR:
-        error_report("Error: HV_ERROR");
-        break;
-    case HV_BUSY:
-        error_report("Error: HV_BUSY");
-        break;
-    case HV_BAD_ARGUMENT:
-        error_report("Error: HV_BAD_ARGUMENT");
-        break;
-    case HV_NO_RESOURCES:
-        error_report("Error: HV_NO_RESOURCES");
-        break;
-    case HV_NO_DEVICE:
-        error_report("Error: HV_NO_DEVICE");
-        break;
-    case HV_UNSUPPORTED:
-        error_report("Error: HV_UNSUPPORTED");
-        break;
-    default:
-        error_report("Unknown Error");
-    }
-
-    abort();
-}
-
-/* Memory slots */
-hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
-{
-    hvf_slot *slot;
-    int x;
-    for (x = 0; x < hvf_state->num_slots; ++x) {
-        slot = &hvf_state->slots[x];
-        if (slot->size && start < (slot->start + slot->size) &&
-            (start + size) > slot->start) {
-            return slot;
-        }
-    }
-    return NULL;
-}
-
-struct mac_slot {
-    int present;
-    uint64_t size;
-    uint64_t gpa_start;
-    uint64_t gva;
-};
-
-struct mac_slot mac_slots[32];
-
-static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
-{
-    struct mac_slot *macslot;
-    hv_return_t ret;
-
-    macslot = &mac_slots[slot->slot_id];
-
-    if (macslot->present) {
-        if (macslot->size != slot->size) {
-            macslot->present = 0;
-            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
-            assert_hvf_ok(ret);
-        }
-    }
-
-    if (!slot->size) {
-        return 0;
-    }
-
-    macslot->present = 1;
-    macslot->gpa_start = slot->start;
-    macslot->size = slot->size;
-    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
-    assert_hvf_ok(ret);
-    return 0;
-}
-
-void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
-{
-    hvf_slot *mem;
-    MemoryRegion *area = section->mr;
-    bool writeable = !area->readonly && !area->rom_device;
-    hv_memory_flags_t flags;
-
-    if (!memory_region_is_ram(area)) {
-        if (writeable) {
-            return;
-        } else if (!memory_region_is_romd(area)) {
-            /*
-             * If the memory device is not in romd_mode, then we actually want
-             * to remove the hvf memory slot so all accesses will trap.
-             */
-             add = false;
-        }
-    }
-
-    mem = hvf_find_overlap_slot(
-            section->offset_within_address_space,
-            int128_get64(section->size));
-
-    if (mem && add) {
-        if (mem->size == int128_get64(section->size) &&
-            mem->start == section->offset_within_address_space &&
-            mem->mem == (memory_region_get_ram_ptr(area) +
-            section->offset_within_region)) {
-            return; /* Same region was attempted to register, go away. */
-        }
-    }
-
-    /* Region needs to be reset. set the size to 0 and remap it. */
-    if (mem) {
-        mem->size = 0;
-        if (do_hvf_set_memory(mem, 0)) {
-            error_report("Failed to reset overlapping slot");
-            abort();
-        }
-    }
-
-    if (!add) {
-        return;
-    }
-
-    if (area->readonly ||
-        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
-        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
-    } else {
-        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
-    }
-
-    /* Now make a new slot. */
-    int x;
-
-    for (x = 0; x < hvf_state->num_slots; ++x) {
-        mem = &hvf_state->slots[x];
-        if (!mem->size) {
-            break;
-        }
-    }
-
-    if (x == hvf_state->num_slots) {
-        error_report("No free slots");
-        abort();
-    }
-
-    mem->size = int128_get64(section->size);
-    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
-    mem->start = section->offset_within_address_space;
-    mem->region = area;
-
-    if (do_hvf_set_memory(mem, flags)) {
-        error_report("Error registering new memory slot");
-        abort();
-    }
-}
-
 void vmx_update_tpr(CPUState *cpu)
 {
     /* TODO: need integrate APIC handling */
@@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
     }
 }
 
-static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
-{
-    if (!cpu->vcpu_dirty) {
-        hvf_get_registers(cpu);
-        cpu->vcpu_dirty = true;
-    }
-}
-
-void hvf_cpu_synchronize_state(CPUState *cpu)
-{
-    if (!cpu->vcpu_dirty) {
-        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
-    }
-}
-
-static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
-                                              run_on_cpu_data arg)
-{
-    hvf_put_registers(cpu);
-    cpu->vcpu_dirty = false;
-}
-
-void hvf_cpu_synchronize_post_reset(CPUState *cpu)
-{
-    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
-}
-
-static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
-                                             run_on_cpu_data arg)
-{
-    hvf_put_registers(cpu);
-    cpu->vcpu_dirty = false;
-}
-
-void hvf_cpu_synchronize_post_init(CPUState *cpu)
-{
-    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
-}
-
-static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
-                                              run_on_cpu_data arg)
-{
-    cpu->vcpu_dirty = true;
-}
-
-void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
-{
-    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
-}
-
 static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
 {
     int read, write;
@@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
     return false;
 }
 
-static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
-{
-    hvf_slot *slot;
-
-    slot = hvf_find_overlap_slot(
-            section->offset_within_address_space,
-            int128_get64(section->size));
-
-    /* protect region against writes; begin tracking it */
-    if (on) {
-        slot->flags |= HVF_SLOT_LOG;
-        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
-                      HV_MEMORY_READ);
-    /* stop tracking region*/
-    } else {
-        slot->flags &= ~HVF_SLOT_LOG;
-        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
-                      HV_MEMORY_READ | HV_MEMORY_WRITE);
-    }
-}
-
-static void hvf_log_start(MemoryListener *listener,
-                          MemoryRegionSection *section, int old, int new)
-{
-    if (old != 0) {
-        return;
-    }
-
-    hvf_set_dirty_tracking(section, 1);
-}
-
-static void hvf_log_stop(MemoryListener *listener,
-                         MemoryRegionSection *section, int old, int new)
-{
-    if (new != 0) {
-        return;
-    }
-
-    hvf_set_dirty_tracking(section, 0);
-}
-
-static void hvf_log_sync(MemoryListener *listener,
-                         MemoryRegionSection *section)
-{
-    /*
-     * sync of dirty pages is handled elsewhere; just make sure we keep
-     * tracking the region.
-     */
-    hvf_set_dirty_tracking(section, 1);
-}
-
-static void hvf_region_add(MemoryListener *listener,
-                           MemoryRegionSection *section)
-{
-    hvf_set_phys_mem(section, true);
-}
-
-static void hvf_region_del(MemoryListener *listener,
-                           MemoryRegionSection *section)
-{
-    hvf_set_phys_mem(section, false);
-}
-
-static MemoryListener hvf_memory_listener = {
-    .priority = 10,
-    .region_add = hvf_region_add,
-    .region_del = hvf_region_del,
-    .log_start = hvf_log_start,
-    .log_stop = hvf_log_stop,
-    .log_sync = hvf_log_sync,
-};
-
-void hvf_vcpu_destroy(CPUState *cpu)
+void hvf_arch_vcpu_destroy(CPUState *cpu)
 {
     X86CPU *x86_cpu = X86_CPU(cpu);
     CPUX86State *env = &x86_cpu->env;
 
-    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
     g_free(env->hvf_mmio_buf);
-    assert_hvf_ok(ret);
-}
-
-static void dummy_signal(int sig)
-{
 }
 
-int hvf_init_vcpu(CPUState *cpu)
+int hvf_arch_init_vcpu(CPUState *cpu)
 {
 
     X86CPU *x86cpu = X86_CPU(cpu);
     CPUX86State *env = &x86cpu->env;
-    int r;
-
-    /* init cpu signals */
-    sigset_t set;
-    struct sigaction sigact;
-
-    memset(&sigact, 0, sizeof(sigact));
-    sigact.sa_handler = dummy_signal;
-    sigaction(SIG_IPI, &sigact, NULL);
-
-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
 
     init_emu();
     init_decoder();
@@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
     hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
     env->hvf_mmio_buf = g_new(char, 4096);
 
-    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
-    cpu->vcpu_dirty = 1;
-    assert_hvf_ok(r);
-
     if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
         &hvf_state->hvf_caps->vmx_cap_pinbased)) {
         abort();
@@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
 
     return ret;
 }
-
-bool hvf_allowed;
-
-static int hvf_accel_init(MachineState *ms)
-{
-    int x;
-    hv_return_t ret;
-    HVFState *s;
-
-    ret = hv_vm_create(HV_VM_DEFAULT);
-    assert_hvf_ok(ret);
-
-    s = g_new0(HVFState, 1);
- 
-    s->num_slots = 32;
-    for (x = 0; x < s->num_slots; ++x) {
-        s->slots[x].size = 0;
-        s->slots[x].slot_id = x;
-    }
-  
-    hvf_state = s;
-    memory_listener_register(&hvf_memory_listener, &address_space_memory);
-    cpus_register_accel(&hvf_cpus);
-    return 0;
-}
-
-static void hvf_accel_class_init(ObjectClass *oc, void *data)
-{
-    AccelClass *ac = ACCEL_CLASS(oc);
-    ac->name = "HVF";
-    ac->init_machine = hvf_accel_init;
-    ac->allowed = &hvf_allowed;
-}
-
-static const TypeInfo hvf_accel_type = {
-    .name = TYPE_HVF_ACCEL,
-    .parent = TYPE_ACCEL,
-    .class_init = hvf_accel_class_init,
-};
-
-static void hvf_type_init(void)
-{
-    type_register_static(&hvf_accel_type);
-}
-
-type_init(hvf_type_init);
diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
index 409c9a3f14..c8a43717ee 100644
--- a/target/i386/hvf/meson.build
+++ b/target/i386/hvf/meson.build
@@ -1,6 +1,5 @@
 i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
   'hvf.c',
-  'hvf-cpus.c',
   'x86.c',
   'x86_cpuid.c',
   'x86_decode.c',
diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
index bbec412b6c..89b8e9d87a 100644
--- a/target/i386/hvf/x86hvf.c
+++ b/target/i386/hvf/x86hvf.c
@@ -20,6 +20,9 @@
 #include "qemu/osdep.h"
 
 #include "qemu-common.h"
+#include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
+#include "sysemu/hw_accel.h"
 #include "x86hvf.h"
 #include "vmx.h"
 #include "vmcs.h"
@@ -32,8 +35,6 @@
 #include <Hypervisor/hv.h>
 #include <Hypervisor/hv_vmx.h>
 
-#include "hvf-cpus.h"
-
 void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
                      SegmentCache *qseg, bool is_tr)
 {
@@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
     env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
 
     if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
-        hvf_cpu_synchronize_state(cpu_state);
+        cpu_synchronize_state(cpu_state);
         do_cpu_init(cpu);
     }
 
@@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
         cpu_state->halted = 0;
     }
     if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
-        hvf_cpu_synchronize_state(cpu_state);
+        cpu_synchronize_state(cpu_state);
         do_cpu_sipi(cpu);
     }
     if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
         cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
-        hvf_cpu_synchronize_state(cpu_state);
+        cpu_synchronize_state(cpu_state);
         apic_handle_tpr_access_report(cpu->apic_state, env->eip,
                                       env->tpr_access_type);
     }
diff --git a/target/i386/hvf/x86hvf.h b/target/i386/hvf/x86hvf.h
index 635ab0f34e..99ed8d608d 100644
--- a/target/i386/hvf/x86hvf.h
+++ b/target/i386/hvf/x86hvf.h
@@ -21,8 +21,6 @@
 #include "x86_descr.h"
 
 int hvf_process_events(CPUState *);
-int hvf_put_registers(CPUState *);
-int hvf_get_registers(CPUState *);
 bool hvf_inject_interrupts(CPUState *);
 void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
                      SegmentCache *qseg, bool is_tr);
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/8] arm: Set PSCI to 0.2 for HVF
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
  2020-11-26 21:50 ` [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries Alexander Graf
  2020-11-26 21:50 ` [PATCH 2/8] hvf: Move common code out Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-26 21:50 ` [PATCH 4/8] arm: Synchronize CPU on PSCI on Alexander Graf
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

In Hypervisor.framework, we just pass PSCI calls straight on to the QEMU emulation
of it. That means, if TCG is compatible with PSCI 0.2, so are we. Let's transpose
that fact in code too.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 target/arm/cpu.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 07492e9f9a..db6f7c34ed 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1062,6 +1062,10 @@ static void arm_cpu_initfn(Object *obj)
     if (tcg_enabled()) {
         cpu->psci_version = 2; /* TCG implements PSCI 0.2 */
     }
+
+    if (hvf_enabled()) {
+        cpu->psci_version = 2; /* HVF uses TCG's PSCI */
+    }
 }
 
 static Property arm_cpu_gt_cntfrq_property =
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 4/8] arm: Synchronize CPU on PSCI on
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
                   ` (2 preceding siblings ...)
  2020-11-26 21:50 ` [PATCH 3/8] arm: Set PSCI to 0.2 for HVF Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-26 21:50 ` [PATCH 5/8] hvf: Add Apple Silicon support Alexander Graf
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

We are going to reuse the TCG PSCI code for HVF. This however means that we
need to ensure that CPU register state is synchronized properly between the
two worlds.

So let's make sure that at least on the PSCI on call, the secondary core gets
to sync its registers after reset, so that changes also propagate.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 target/arm/arm-powerctl.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/target/arm/arm-powerctl.c b/target/arm/arm-powerctl.c
index b75f813b40..256f7cfdcd 100644
--- a/target/arm/arm-powerctl.c
+++ b/target/arm/arm-powerctl.c
@@ -15,6 +15,7 @@
 #include "arm-powerctl.h"
 #include "qemu/log.h"
 #include "qemu/main-loop.h"
+#include "sysemu/hw_accel.h"
 
 #ifndef DEBUG_ARM_POWERCTL
 #define DEBUG_ARM_POWERCTL 0
@@ -66,6 +67,8 @@ static void arm_set_cpu_on_async_work(CPUState *target_cpu_state,
     cpu_reset(target_cpu_state);
     target_cpu_state->halted = 0;
 
+    cpu_synchronize_state(target_cpu_state);
+
     if (info->target_aa64) {
         if ((info->target_el < 3) && arm_feature(&target_cpu->env,
                                                  ARM_FEATURE_EL3)) {
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 5/8] hvf: Add Apple Silicon support
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
                   ` (3 preceding siblings ...)
  2020-11-26 21:50 ` [PATCH 4/8] arm: Synchronize CPU on PSCI on Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-26 21:50 ` [PATCH 6/8] hvf: Use OS provided vcpu kick function Alexander Graf
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

With Apple Silicon available to the masses, it's a good time to add support
for driving its virtualization extensions from QEMU.

This patch adds all necessary architecture specific code to get basic VMs
working. It's still pretty raw, but definitely functional.

Known limitations:

  - Vtimer acknowledgement is hacky
  - Should implement more sysregs and fault on invalid ones then
  - WFI handling is missing, need to marry it with vtimer

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 MAINTAINERS           |   5 +
 accel/hvf/hvf-cpus.c  |   4 +
 include/hw/core/cpu.h |   3 +-
 target/arm/hvf/hvf.c  | 345 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 356 insertions(+), 1 deletion(-)
 create mode 100644 target/arm/hvf/hvf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ca4b6d9279..9cd1d9d448 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -439,6 +439,11 @@ F: accel/accel.c
 F: accel/Makefile.objs
 F: accel/stubs/Makefile.objs
 
+Apple Silicon HVF CPUs
+M: Alexander Graf <agraf@csgraf.de>
+S: Maintained
+F: target/arm/hvf/
+
 X86 HVF CPUs
 M: Cameron Esfahani <dirty@apple.com>
 M: Roman Bolshakov <r.bolshakov@yadro.com>
diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index f9bb5502b7..b9f674478d 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -60,6 +60,10 @@
 
 #include <Hypervisor/Hypervisor.h>
 
+#ifdef __aarch64__
+#define HV_VM_DEFAULT NULL
+#endif
+
 /* Memory slots */
 
 struct mac_slot {
diff --git a/include/hw/core/cpu.h b/include/hw/core/cpu.h
index 3d92c967ff..a711eb04e4 100644
--- a/include/hw/core/cpu.h
+++ b/include/hw/core/cpu.h
@@ -463,7 +463,8 @@ struct CPUState {
 
     struct hax_vcpu_state *hax_vcpu;
 
-    int hvf_fd;
+    uint64_t hvf_fd;
+    void *hvf_exit;
 
     /* track IOMMUs whose translations we've cached in the TCG TLB */
     GArray *iommu_notifiers;
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
new file mode 100644
index 0000000000..6b9a02e21c
--- /dev/null
+++ b/target/arm/hvf/hvf.c
@@ -0,0 +1,345 @@
+/*
+ * QEMU Hypervisor.framework support for Apple Silicon
+
+ * Copyright 2020 Alexander Graf <agraf@csgraf.de>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+
+#include "sysemu/runstate.h"
+#include "sysemu/hvf.h"
+#include "sysemu/hvf_int.h"
+#include "sysemu/hw_accel.h"
+
+#include <Hypervisor/Hypervisor.h>
+
+#include "exec/address-spaces.h"
+#include "hw/irq.h"
+#include "qemu/main-loop.h"
+#include "sysemu/accel.h"
+#include "target/arm/cpu.h"
+#include "target/arm/internals.h"
+
+#define HVF_DEBUG 0
+#define DPRINTF(...)                                        \
+    if (HVF_DEBUG) {                                        \
+        fprintf(stderr, "HVF %s:%d ", __func__, __LINE__);  \
+        fprintf(stderr, __VA_ARGS__);                       \
+        fprintf(stderr, "\n");                              \
+    }
+
+#define SYSREG(op0, op1, op2, crn, crm) \
+    ((op0 << 20) | (op2 << 17) | (op1 << 14) | (crn << 10) | (crm << 1))
+#define SYSREG_MASK           SYSREG(0x3, 0x7, 0x7, 0xf, 0xf)
+#define SYSREG_CNTPCT_EL0     SYSREG(3, 3, 1, 14, 0)
+#define SYSREG_PMCCNTR_EL0    SYSREG(3, 3, 0, 9, 13)
+
+struct hvf_reg_match {
+    int reg;
+    uint64_t offset;
+};
+
+static const struct hvf_reg_match hvf_reg_match[] = {
+    { HV_REG_X0,   offsetof(CPUARMState, xregs[0]) },
+    { HV_REG_X1,   offsetof(CPUARMState, xregs[1]) },
+    { HV_REG_X2,   offsetof(CPUARMState, xregs[2]) },
+    { HV_REG_X3,   offsetof(CPUARMState, xregs[3]) },
+    { HV_REG_X4,   offsetof(CPUARMState, xregs[4]) },
+    { HV_REG_X5,   offsetof(CPUARMState, xregs[5]) },
+    { HV_REG_X6,   offsetof(CPUARMState, xregs[6]) },
+    { HV_REG_X7,   offsetof(CPUARMState, xregs[7]) },
+    { HV_REG_X8,   offsetof(CPUARMState, xregs[8]) },
+    { HV_REG_X9,   offsetof(CPUARMState, xregs[9]) },
+    { HV_REG_X10,  offsetof(CPUARMState, xregs[10]) },
+    { HV_REG_X11,  offsetof(CPUARMState, xregs[11]) },
+    { HV_REG_X12,  offsetof(CPUARMState, xregs[12]) },
+    { HV_REG_X13,  offsetof(CPUARMState, xregs[13]) },
+    { HV_REG_X14,  offsetof(CPUARMState, xregs[14]) },
+    { HV_REG_X15,  offsetof(CPUARMState, xregs[15]) },
+    { HV_REG_X16,  offsetof(CPUARMState, xregs[16]) },
+    { HV_REG_X17,  offsetof(CPUARMState, xregs[17]) },
+    { HV_REG_X18,  offsetof(CPUARMState, xregs[18]) },
+    { HV_REG_X19,  offsetof(CPUARMState, xregs[19]) },
+    { HV_REG_X20,  offsetof(CPUARMState, xregs[20]) },
+    { HV_REG_X21,  offsetof(CPUARMState, xregs[21]) },
+    { HV_REG_X22,  offsetof(CPUARMState, xregs[22]) },
+    { HV_REG_X23,  offsetof(CPUARMState, xregs[23]) },
+    { HV_REG_X24,  offsetof(CPUARMState, xregs[24]) },
+    { HV_REG_X25,  offsetof(CPUARMState, xregs[25]) },
+    { HV_REG_X26,  offsetof(CPUARMState, xregs[26]) },
+    { HV_REG_X27,  offsetof(CPUARMState, xregs[27]) },
+    { HV_REG_X28,  offsetof(CPUARMState, xregs[28]) },
+    { HV_REG_X29,  offsetof(CPUARMState, xregs[29]) },
+    { HV_REG_X30,  offsetof(CPUARMState, xregs[30]) },
+    { HV_REG_PC,   offsetof(CPUARMState, pc) },
+};
+
+int hvf_get_registers(CPUState *cpu)
+{
+    ARMCPU *arm_cpu = ARM_CPU(cpu);
+    CPUARMState *env = &arm_cpu->env;
+    hv_return_t ret;
+    uint64_t val;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(hvf_reg_match); i++) {
+        ret = hv_vcpu_get_reg(cpu->hvf_fd, hvf_reg_match[i].reg, &val);
+        *(uint64_t *)((void *)env + hvf_reg_match[i].offset) = val;
+        assert_hvf_ok(ret);
+    }
+
+    val = 0;
+    ret = hv_vcpu_get_reg(cpu->hvf_fd, HV_REG_FPCR, &val);
+    assert_hvf_ok(ret);
+    vfp_set_fpcr(env, val);
+
+    val = 0;
+    ret = hv_vcpu_get_reg(cpu->hvf_fd, HV_REG_FPSR, &val);
+    assert_hvf_ok(ret);
+    vfp_set_fpsr(env, val);
+
+    ret = hv_vcpu_get_reg(cpu->hvf_fd, HV_REG_CPSR, &val);
+    assert_hvf_ok(ret);
+    pstate_write(env, val);
+
+    return 0;
+}
+
+int hvf_put_registers(CPUState *cpu)
+{
+    ARMCPU *arm_cpu = ARM_CPU(cpu);
+    CPUARMState *env = &arm_cpu->env;
+    hv_return_t ret;
+    uint64_t val;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(hvf_reg_match); i++) {
+        val = *(uint64_t *)((void *)env + hvf_reg_match[i].offset);
+        ret = hv_vcpu_set_reg(cpu->hvf_fd, hvf_reg_match[i].reg, val);
+
+        assert_hvf_ok(ret);
+    }
+
+    ret = hv_vcpu_set_reg(cpu->hvf_fd, HV_REG_FPCR, vfp_get_fpcr(env));
+    assert_hvf_ok(ret);
+
+    ret = hv_vcpu_set_reg(cpu->hvf_fd, HV_REG_FPSR, vfp_get_fpsr(env));
+    assert_hvf_ok(ret);
+
+    ret = hv_vcpu_set_reg(cpu->hvf_fd, HV_REG_CPSR, pstate_read(env));
+    assert_hvf_ok(ret);
+
+    ret = hv_vcpu_set_sys_reg(cpu->hvf_fd, HV_SYS_REG_MPIDR_EL1,
+                              arm_cpu->mp_affinity);
+    assert_hvf_ok(ret);
+
+    return 0;
+}
+
+void hvf_arch_vcpu_destroy(CPUState *cpu)
+{
+}
+
+int hvf_arch_init_vcpu(CPUState *cpu)
+{
+    ARMCPU *arm_cpu = ARM_CPU(cpu);
+    CPUARMState *env = &arm_cpu->env;
+
+    env->aarch64 = 1;
+
+    return 0;
+}
+
+static int hvf_process_events(CPUState *cpu)
+{
+    DPRINTF("");
+    return 0;
+}
+
+static int hvf_inject_interrupts(CPUState *cpu)
+{
+    if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
+        DPRINTF("injecting FIQ");
+        hv_vcpu_set_pending_interrupt(cpu->hvf_fd, HV_INTERRUPT_TYPE_FIQ, true);
+    }
+
+    if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {
+        DPRINTF("injecting IRQ");
+        hv_vcpu_set_pending_interrupt(cpu->hvf_fd, HV_INTERRUPT_TYPE_IRQ, true);
+    }
+
+    return 0;
+}
+
+int hvf_vcpu_exec(CPUState *cpu)
+{
+    ARMCPU *arm_cpu = ARM_CPU(cpu);
+    CPUARMState *env = &arm_cpu->env;
+    hv_vcpu_exit_t *hvf_exit = cpu->hvf_exit;
+    int ret = 0;
+
+    if (hvf_process_events(cpu)) {
+        return EXCP_HLT;
+    }
+
+    do {
+        process_queued_cpu_work(cpu);
+
+        if (cpu->vcpu_dirty) {
+            hvf_put_registers(cpu);
+            cpu->vcpu_dirty = false;
+        }
+
+        if (hvf_inject_interrupts(cpu)) {
+            return EXCP_INTERRUPT;
+        }
+
+        qemu_mutex_unlock_iothread();
+        if (cpu->cpu_index && cpu->halted) {
+            qemu_mutex_lock_iothread();
+            return EXCP_HLT;
+        }
+
+        assert_hvf_ok(hv_vcpu_run(cpu->hvf_fd));
+
+        /* handle VMEXIT */
+        uint64_t exit_reason = hvf_exit->reason;
+        uint64_t syndrome = hvf_exit->exception.syndrome;
+        uint32_t ec = syn_get_ec(syndrome);
+
+        cpu_synchronize_state(cpu);
+
+        qemu_mutex_lock_iothread();
+
+        current_cpu = cpu;
+
+        switch (exit_reason) {
+        case HV_EXIT_REASON_EXCEPTION:
+            /* This is the main one, handle below. */
+            break;
+        case HV_EXIT_REASON_VTIMER_ACTIVATED:
+            qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
+            continue;
+        case HV_EXIT_REASON_CANCELED:
+            /* we got kicked, no exit to process */
+            continue;
+        default:
+            assert(0);
+        }
+
+        ret = 0;
+        switch (ec) {
+        case EC_DATAABORT: {
+            bool isv = syndrome & ARM_EL_ISV;
+            bool iswrite = (syndrome >> 6) & 1;
+            bool s1ptw = (syndrome >> 7) & 1;
+            uint32_t sas = (syndrome >> 22) & 3;
+            uint32_t len = 1 << sas;
+            uint32_t srt = (syndrome >> 16) & 0x1f;
+            uint64_t val = 0;
+
+            DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
+                    "iswrite=%x s1ptw=%x len=%d srt=%d]\n",
+                    env->pc, hvf_exit->exception.virtual_address,
+                    hvf_exit->exception.physical_address, isv, iswrite,
+                    s1ptw, len, srt);
+
+            assert(isv);
+
+            if (iswrite) {
+                val = env->xregs[srt];
+                address_space_write(&address_space_memory,
+                                    hvf_exit->exception.physical_address,
+                                    MEMTXATTRS_UNSPECIFIED, &val, len);
+
+                /*
+                 * We do not have a callback to see if the timer is out of
+                 * state. That means every MMIO write could potentially be
+                 * an EOI ends the vtimer. Until we get an actual callback,
+                 * let's just see if the timer is still pending on every
+                 * possible toggle point.
+                 */
+                qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 0);
+                hv_vcpu_set_vtimer_mask(cpu->hvf_fd, false);
+            } else {
+                address_space_read(&address_space_memory,
+                                   hvf_exit->exception.physical_address,
+                                   MEMTXATTRS_UNSPECIFIED, &val, len);
+                env->xregs[srt] = val;
+            }
+
+            env->pc += 4;
+            break;
+        }
+        case EC_SYSTEMREGISTERTRAP: {
+            bool isread = (syndrome >> 21) & 1;
+            uint32_t rt = (syndrome >> 5) & 0x1f;
+            uint32_t reg = syndrome & SYSREG_MASK;
+            uint64_t val = 0;
+
+            if (isread) {
+                switch (reg) {
+                case SYSREG_CNTPCT_EL0:
+                    val = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
+                          gt_cntfrq_period_ns(arm_cpu);
+                    break;
+                case SYSREG_PMCCNTR_EL0:
+                    val = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL);
+                    break;
+                default:
+                    DPRINTF("unhandled sysreg read %08x (op0=%d op1=%d op2=%d "
+                            "crn=%d crm=%d)", reg, (reg >> 20) & 0x3,
+                            (reg >> 14) & 0x7, (reg >> 17) & 0x7,
+                            (reg >> 10) & 0xf, (reg >> 1) & 0xf);
+                    break;
+                }
+
+                env->xregs[rt] = val;
+            } else {
+                val = env->xregs[rt];
+                switch (reg) {
+                case SYSREG_CNTPCT_EL0:
+                    break;
+                default:
+                    DPRINTF("unhandled sysreg write %08x", reg);
+                    break;
+                }
+            }
+
+            env->pc += 4;
+            break;
+        }
+        case EC_WFX_TRAP:
+            /* No halting yet */
+            break;
+        case EC_AA64_HVC:
+            if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
+                arm_handle_psci_call(arm_cpu);
+            } else {
+                DPRINTF("unknown HVC! %016llx", env->xregs[0]);
+                env->xregs[0] = -1;
+            }
+            break;
+        case EC_AA64_SMC:
+            if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
+                arm_handle_psci_call(arm_cpu);
+            } else {
+                DPRINTF("unknown SMC! %016llx", env->xregs[0]);
+                env->xregs[0] = -1;
+                env->pc += 4;
+            }
+            break;
+        default:
+            DPRINTF("exit: %llx [ec=0x%x pc=0x%llx]", syndrome, ec, env->pc);
+            error_report("%llx: unhandled exit %llx", env->pc, exit_reason);
+        }
+    } while (ret == 0);
+
+    return ret;
+}
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 6/8] hvf: Use OS provided vcpu kick function
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
                   ` (4 preceding siblings ...)
  2020-11-26 21:50 ` [PATCH 5/8] hvf: Add Apple Silicon support Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-26 22:18   ` Eduardo Habkost
  2020-11-26 21:50 ` [PATCH 7/8] arm: Add Hypervisor.framework build target Alexander Graf
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

When kicking another vCPU, we get an OS function that explicitly does that for us
on Apple Silicon. That works better than the current signaling logic, let's make
use of it there.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 accel/hvf/hvf-cpus.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index b9f674478d..74a272d2e8 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -418,8 +418,20 @@ static void hvf_start_vcpu_thread(CPUState *cpu)
                        cpu, QEMU_THREAD_JOINABLE);
 }
 
+#ifdef __aarch64__
+static void hvf_kick_vcpu_thread(CPUState *cpu)
+{
+    if (!qemu_cpu_is_self(cpu)) {
+        hv_vcpus_exit(&cpu->hvf_fd, 1);
+    }
+}
+#endif
+
 static const CpusAccel hvf_cpus = {
     .create_vcpu_thread = hvf_start_vcpu_thread,
+#ifdef __aarch64__
+    .kick_vcpu_thread = hvf_kick_vcpu_thread,
+#endif
 
     .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
     .synchronize_post_init = hvf_cpu_synchronize_post_init,
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 7/8] arm: Add Hypervisor.framework build target
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
                   ` (5 preceding siblings ...)
  2020-11-26 21:50 ` [PATCH 6/8] hvf: Use OS provided vcpu kick function Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-27  4:59   ` Paolo Bonzini
  2020-11-26 21:50 ` [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework Alexander Graf
  2020-11-26 22:10 ` [PATCH 0/8] hvf: Implement Apple Silicon Support Eduardo Habkost
  8 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

Now that we have all logic in place that we need to handle Hypervisor.framework
on Apple Silicon systems, let's add CONFIG_HVF for aarch64 as well so that we
can build it.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 meson.build                | 9 ++++++++-
 target/arm/hvf/meson.build | 3 +++
 target/arm/meson.build     | 2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)
 create mode 100644 target/arm/hvf/meson.build

diff --git a/meson.build b/meson.build
index 2a7ff5560c..21565f5787 100644
--- a/meson.build
+++ b/meson.build
@@ -74,16 +74,23 @@ else
 endif
 
 accelerator_targets = { 'CONFIG_KVM': kvm_targets }
+
+if cpu in ['x86', 'x86_64']
+  hvf_targets = ['i386-softmmu', 'x86_64-softmmu']
+elif cpu in ['aarch64']
+  hvf_targets = ['aarch64-softmmu']
+endif
+
 if cpu in ['x86', 'x86_64', 'arm', 'aarch64']
   # i368 emulator provides xenpv machine type for multiple architectures
   accelerator_targets += {
     'CONFIG_XEN': ['i386-softmmu', 'x86_64-softmmu'],
+    'CONFIG_HVF': hvf_targets,
   }
 endif
 if cpu in ['x86', 'x86_64']
   accelerator_targets += {
     'CONFIG_HAX': ['i386-softmmu', 'x86_64-softmmu'],
-    'CONFIG_HVF': ['x86_64-softmmu'],
     'CONFIG_WHPX': ['i386-softmmu', 'x86_64-softmmu'],
   }
 endif
diff --git a/target/arm/hvf/meson.build b/target/arm/hvf/meson.build
new file mode 100644
index 0000000000..855e6cce5a
--- /dev/null
+++ b/target/arm/hvf/meson.build
@@ -0,0 +1,3 @@
+arm_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
+  'hvf.c',
+))
diff --git a/target/arm/meson.build b/target/arm/meson.build
index f5de2a77b8..95bebae216 100644
--- a/target/arm/meson.build
+++ b/target/arm/meson.build
@@ -56,5 +56,7 @@ arm_softmmu_ss.add(files(
   'psci.c',
 ))
 
+subdir('hvf')
+
 target_arch += {'arm': arm_ss}
 target_softmmu_arch += {'arm': arm_softmmu_ss}
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
                   ` (6 preceding siblings ...)
  2020-11-26 21:50 ` [PATCH 7/8] arm: Add Hypervisor.framework build target Alexander Graf
@ 2020-11-26 21:50 ` Alexander Graf
  2020-11-26 22:14   ` Eduardo Habkost
  2020-11-26 22:10 ` [PATCH 0/8] hvf: Implement Apple Silicon Support Eduardo Habkost
  8 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-26 21:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Paolo Bonzini

The Apple M1 only supports up to 36 bits of physical address space. That
means we can not fit the 64bit MMIO BAR region into our address space.

To fix this, let's not expose a 64bit MMIO BAR region when running on
Apple Silicon.

I have not been able to find a way to enumerate that easily, so let's
just assume we always have that little PA space on hypervisor.framework
systems.

Signed-off-by: Alexander Graf <agraf@csgraf.de>
---
 hw/arm/virt.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 27dbeb549e..d74053ecd4 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -45,6 +45,7 @@
 #include "hw/display/ramfb.h"
 #include "net/net.h"
 #include "sysemu/device_tree.h"
+#include "sysemu/hvf.h"
 #include "sysemu/numa.h"
 #include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
@@ -1746,6 +1747,14 @@ static void machvirt_init(MachineState *machine)
     unsigned int smp_cpus = machine->smp.cpus;
     unsigned int max_cpus = machine->smp.max_cpus;
 
+    /*
+     * On Hypervisor.framework capable systems, we only have 36 bits of PA
+     * space, which is not enough to fit a 64bit BAR space
+     */
+    if (hvf_enabled()) {
+        vms->highmem = false;
+    }
+
     /*
      * In accelerated mode, the memory map is computed earlier in kvm_type()
      * to create a VM with the right number of IPA bits.
-- 
2.24.3 (Apple Git-128)



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/8] hvf: Implement Apple Silicon Support
  2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
                   ` (7 preceding siblings ...)
  2020-11-26 21:50 ` [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework Alexander Graf
@ 2020-11-26 22:10 ` Eduardo Habkost
  2020-11-27 17:48   ` Philippe Mathieu-Daudé
  8 siblings, 1 reply; 64+ messages in thread
From: Eduardo Habkost @ 2020-11-26 22:10 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Cameron Esfahani,
	Roman Bolshakov, qemu-arm, Claudio Fontana, Paolo Bonzini

On Thu, Nov 26, 2020 at 10:50:09PM +0100, Alexander Graf wrote:
> Now that Apple Silicon is widely available, people are obviously excited
> to try and run virtualized workloads on them, such as Linux and Windows.
> 
> This patch set implements a rudimentary, first version to get the ball
> going on that. With this applied, I can successfully run both Linux and
> Windows as guests, albeit with a few caveats:
> 
>   * no WFI emulation, a vCPU always uses 100%
>   * vtimer handling is a bit hacky
>   * we handle most sysregs flying blindly, just returning 0
>   * XHCI breaks in OVMF, works in Linux+Windows
> 
> Despite those drawbacks, it's still an exciting place to start playing
> with the power of Apple Silicon.
> 
> Enjoy!
> 
> Alex
> 
> Alexander Graf (8):
>   hvf: Add hypervisor entitlement to output binaries
>   hvf: Move common code out
>   arm: Set PSCI to 0.2 for HVF
>   arm: Synchronize CPU on PSCI on
>   hvf: Add Apple Silicon support
>   hvf: Use OS provided vcpu kick function
>   arm: Add Hypervisor.framework build target
>   hw/arm/virt: Disable highmem when on hypervisor.framework
> 
>  MAINTAINERS                  |  14 +-
>  accel/hvf/entitlements.plist |   8 +
>  accel/hvf/hvf-all.c          |  56 ++++
>  accel/hvf/hvf-cpus.c         | 484 +++++++++++++++++++++++++++++++++++
>  accel/hvf/meson.build        |   7 +
>  accel/meson.build            |   1 +

This seems to conflict with the accel cleanup work being done by
Claudio[1].  Maybe Claudio could cherry-pick some of the code
movement patches from this series, or this series could be
rebased on top of his.

[1] https://lore.kernel.org/qemu-devel/20201124162210.8796-1-cfontana@suse.de

>  hw/arm/virt.c                |   9 +
>  include/hw/core/cpu.h        |   3 +-
>  include/sysemu/hvf_int.h     |  69 +++++
>  meson.build                  |  39 ++-
>  scripts/entitlement.sh       |  11 +
>  target/arm/arm-powerctl.c    |   3 +
>  target/arm/cpu.c             |   4 +
>  target/arm/hvf/hvf.c         | 345 +++++++++++++++++++++++++
>  target/arm/hvf/meson.build   |   3 +
>  target/arm/meson.build       |   2 +
>  target/i386/hvf/hvf-cpus.c   | 131 ----------
>  target/i386/hvf/hvf-cpus.h   |  25 --
>  target/i386/hvf/hvf-i386.h   |  48 +---
>  target/i386/hvf/hvf.c        | 360 +-------------------------
>  target/i386/hvf/meson.build  |   1 -
>  target/i386/hvf/x86hvf.c     |  11 +-
>  target/i386/hvf/x86hvf.h     |   2 -
>  23 files changed, 1061 insertions(+), 575 deletions(-)
>  create mode 100644 accel/hvf/entitlements.plist
>  create mode 100644 accel/hvf/hvf-all.c
>  create mode 100644 accel/hvf/hvf-cpus.c
>  create mode 100644 accel/hvf/meson.build
>  create mode 100644 include/sysemu/hvf_int.h
>  create mode 100755 scripts/entitlement.sh
>  create mode 100644 target/arm/hvf/hvf.c
>  create mode 100644 target/arm/hvf/meson.build
>  delete mode 100644 target/i386/hvf/hvf-cpus.c
>  delete mode 100644 target/i386/hvf/hvf-cpus.h
> 
> -- 
> 2.24.3 (Apple Git-128)
> 

-- 
Eduardo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-26 21:50 ` [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework Alexander Graf
@ 2020-11-26 22:14   ` Eduardo Habkost
  2020-11-26 22:29     ` Peter Maydell
  0 siblings, 1 reply; 64+ messages in thread
From: Eduardo Habkost @ 2020-11-26 22:14 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Cameron Esfahani,
	Roman Bolshakov, qemu-arm, Paolo Bonzini

On Thu, Nov 26, 2020 at 10:50:17PM +0100, Alexander Graf wrote:
> The Apple M1 only supports up to 36 bits of physical address space. That
> means we can not fit the 64bit MMIO BAR region into our address space.
> 
> To fix this, let's not expose a 64bit MMIO BAR region when running on
> Apple Silicon.
> 
> I have not been able to find a way to enumerate that easily, so let's
> just assume we always have that little PA space on hypervisor.framework
> systems.
> 
> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> ---
>  hw/arm/virt.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 27dbeb549e..d74053ecd4 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -45,6 +45,7 @@
>  #include "hw/display/ramfb.h"
>  #include "net/net.h"
>  #include "sysemu/device_tree.h"
> +#include "sysemu/hvf.h"
>  #include "sysemu/numa.h"
>  #include "sysemu/runstate.h"
>  #include "sysemu/sysemu.h"
> @@ -1746,6 +1747,14 @@ static void machvirt_init(MachineState *machine)
>      unsigned int smp_cpus = machine->smp.cpus;
>      unsigned int max_cpus = machine->smp.max_cpus;
>  
> +    /*
> +     * On Hypervisor.framework capable systems, we only have 36 bits of PA
> +     * space, which is not enough to fit a 64bit BAR space
> +     */
> +    if (hvf_enabled()) {
> +        vms->highmem = false;
> +    }

Direct checks for *_enabled() are a pain to clean up later when
we add support to new accelerators.  Can't this be implemented as
(e.g.) a AccelClass::max_physical_address_bits field?

> +
>      /*
>       * In accelerated mode, the memory map is computed earlier in kvm_type()
>       * to create a VM with the right number of IPA bits.
> -- 
> 2.24.3 (Apple Git-128)
> 

-- 
Eduardo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 6/8] hvf: Use OS provided vcpu kick function
  2020-11-26 21:50 ` [PATCH 6/8] hvf: Use OS provided vcpu kick function Alexander Graf
@ 2020-11-26 22:18   ` Eduardo Habkost
  2020-11-30  2:42     ` Alexander Graf
  0 siblings, 1 reply; 64+ messages in thread
From: Eduardo Habkost @ 2020-11-26 22:18 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Cameron Esfahani,
	Roman Bolshakov, qemu-arm, Claudio Fontana, Paolo Bonzini

On Thu, Nov 26, 2020 at 10:50:15PM +0100, Alexander Graf wrote:
> When kicking another vCPU, we get an OS function that explicitly does that for us
> on Apple Silicon. That works better than the current signaling logic, let's make
> use of it there.
> 
> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> ---
>  accel/hvf/hvf-cpus.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> index b9f674478d..74a272d2e8 100644
> --- a/accel/hvf/hvf-cpus.c
> +++ b/accel/hvf/hvf-cpus.c
> @@ -418,8 +418,20 @@ static void hvf_start_vcpu_thread(CPUState *cpu)
>                         cpu, QEMU_THREAD_JOINABLE);
>  }
>  
> +#ifdef __aarch64__
> +static void hvf_kick_vcpu_thread(CPUState *cpu)
> +{
> +    if (!qemu_cpu_is_self(cpu)) {
> +        hv_vcpus_exit(&cpu->hvf_fd, 1);
> +    }
> +}
> +#endif
> +
>  static const CpusAccel hvf_cpus = {
>      .create_vcpu_thread = hvf_start_vcpu_thread,
> +#ifdef __aarch64__
> +    .kick_vcpu_thread = hvf_kick_vcpu_thread,
> +#endif

Interesting.  We have considered the possibility of adding
arch-specific TYPE_ACCEL subclasses when discussing Claudio's,
series.  Here we have another arch-specific hack that could be
avoided if we had a TYPE_ARM_HVF_ACCEL QOM class.

>  
>      .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>      .synchronize_post_init = hvf_cpu_synchronize_post_init,
> -- 
> 2.24.3 (Apple Git-128)
> 

-- 
Eduardo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-26 22:14   ` Eduardo Habkost
@ 2020-11-26 22:29     ` Peter Maydell
  2020-11-27 16:26       ` Eduardo Habkost
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Maydell @ 2020-11-26 22:29 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, qemu-arm, Paolo Bonzini

On Thu, 26 Nov 2020 at 22:14, Eduardo Habkost <ehabkost@redhat.com> wrote:
>
> On Thu, Nov 26, 2020 at 10:50:17PM +0100, Alexander Graf wrote:
> > The Apple M1 only supports up to 36 bits of physical address space. That
> > means we can not fit the 64bit MMIO BAR region into our address space.
> >
> > To fix this, let's not expose a 64bit MMIO BAR region when running on
> > Apple Silicon.
> >
> > I have not been able to find a way to enumerate that easily, so let's
> > just assume we always have that little PA space on hypervisor.framework
> > systems.
> >
> > Signed-off-by: Alexander Graf <agraf@csgraf.de>
> > ---
> >  hw/arm/virt.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 27dbeb549e..d74053ecd4 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -45,6 +45,7 @@
> >  #include "hw/display/ramfb.h"
> >  #include "net/net.h"
> >  #include "sysemu/device_tree.h"
> > +#include "sysemu/hvf.h"
> >  #include "sysemu/numa.h"
> >  #include "sysemu/runstate.h"
> >  #include "sysemu/sysemu.h"
> > @@ -1746,6 +1747,14 @@ static void machvirt_init(MachineState *machine)
> >      unsigned int smp_cpus = machine->smp.cpus;
> >      unsigned int max_cpus = machine->smp.max_cpus;
> >
> > +    /*
> > +     * On Hypervisor.framework capable systems, we only have 36 bits of PA
> > +     * space, which is not enough to fit a 64bit BAR space
> > +     */
> > +    if (hvf_enabled()) {
> > +        vms->highmem = false;
> > +    }
>
> Direct checks for *_enabled() are a pain to clean up later when
> we add support to new accelerators.  Can't this be implemented as
> (e.g.) a AccelClass::max_physical_address_bits field?

It's a property of the CPU (eg our emulated TCG CPUs may have
varying supported numbers of physical address bits). So the
virt board ought to look at the CPU, and the CPU should be
set up with the right information for all of KVM, TCG, HVF
(either a specific max_phys_addr_bits value or just ensure
its ID_AA64MMFR0_EL1.PARange is right, not sure which would
be easier/nicer).

thanks
-- PMM


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries
  2020-11-26 21:50 ` [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries Alexander Graf
@ 2020-11-27  4:54   ` Paolo Bonzini
  2020-11-27 19:44   ` Roman Bolshakov
  1 sibling, 0 replies; 64+ messages in thread
From: Paolo Bonzini @ 2020-11-27  4:54 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm

On 26/11/20 22:50, Alexander Graf wrote:
> +rm -f "$2"
> +cp -a "$SRC" "$DST"
> +codesign --entitlements "$ENTITLEMENT" --force -s - "$DST"

Slight improvement to avoid races between ^C and this script:

set -e
trap 'rm "$DST.tmp"' exit
cp -a "$SRC" "$DST.tmp"
codesign --entitlements "$ENTITLEMENT" --force -s - "$DST.tmp"
mv "$DST.tmp" "$DST"
trap '' exit

Paolo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 7/8] arm: Add Hypervisor.framework build target
  2020-11-26 21:50 ` [PATCH 7/8] arm: Add Hypervisor.framework build target Alexander Graf
@ 2020-11-27  4:59   ` Paolo Bonzini
  0 siblings, 0 replies; 64+ messages in thread
From: Paolo Bonzini @ 2020-11-27  4:59 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson,
	Cameron Esfahani, Roman Bolshakov, qemu-arm

On 26/11/20 22:50, Alexander Graf wrote:
> Now that we have all logic in place that we need to handle Hypervisor.framework
> on Apple Silicon systems, let's add CONFIG_HVF for aarch64 as well so that we
> can build it.
> 
> Signed-off-by: Alexander Graf <agraf@csgraf.de>

Between patch 1 and this one, this series is a nice showcase for the 
good, the bad and the ugly of Meson... :)

> diff --git a/meson.build b/meson.build
> index 2a7ff5560c..21565f5787 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -74,16 +74,23 @@ else
>   endif
>   
>   accelerator_targets = { 'CONFIG_KVM': kvm_targets }
> +
> +if cpu in ['x86', 'x86_64']
> +  hvf_targets = ['i386-softmmu', 'x86_64-softmmu']
> +elif cpu in ['aarch64']
> +  hvf_targets = ['aarch64-softmmu']
> +endif
> +
>   if cpu in ['x86', 'x86_64', 'arm', 'aarch64']

This would fail to compile on 32-bit ARM.  Simpler to add an 
"hvf_targets = []" else branch above, and add "'CONFIG_HVF': 
hvf_targets" unconditionally.  That is, copy even more of what it is 
doing for KVM.

Paolo

>     # i368 emulator provides xenpv machine type for multiple architectures
>     accelerator_targets += {
>       'CONFIG_XEN': ['i386-softmmu', 'x86_64-softmmu'],
> +    'CONFIG_HVF': hvf_targets,
>     }
>   endif
>   if cpu in ['x86', 'x86_64']
>     accelerator_targets += {
>       'CONFIG_HAX': ['i386-softmmu', 'x86_64-softmmu'],
> -    'CONFIG_HVF': ['x86_64-softmmu'],
>       'CONFIG_WHPX': ['i386-softmmu', 'x86_64-softmmu'],
>     }
>   endif
> diff --git a/target/arm/hvf/meson.build b/target/arm/hvf/meson.build
> new file mode 100644
> index 0000000000..855e6cce5a
> --- /dev/null
> +++ b/target/arm/hvf/meson.build
> @@ -0,0 +1,3 @@
> +arm_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
> +  'hvf.c',
> +))
> diff --git a/target/arm/meson.build b/target/arm/meson.build
> index f5de2a77b8..95bebae216 100644
> --- a/target/arm/meson.build
> +++ b/target/arm/meson.build
> @@ -56,5 +56,7 @@ arm_softmmu_ss.add(files(
>     'psci.c',
>   ))
>   
> +subdir('hvf')
> +
>   target_arch += {'arm': arm_ss}
>   target_softmmu_arch += {'arm': arm_softmmu_ss}
> 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-26 22:29     ` Peter Maydell
@ 2020-11-27 16:26       ` Eduardo Habkost
  2020-11-27 16:38         ` Peter Maydell
  0 siblings, 1 reply; 64+ messages in thread
From: Eduardo Habkost @ 2020-11-27 16:26 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Thu, Nov 26, 2020 at 10:29:01PM +0000, Peter Maydell wrote:
> On Thu, 26 Nov 2020 at 22:14, Eduardo Habkost <ehabkost@redhat.com> wrote:
> >
> > On Thu, Nov 26, 2020 at 10:50:17PM +0100, Alexander Graf wrote:
> > > The Apple M1 only supports up to 36 bits of physical address space. That
> > > means we can not fit the 64bit MMIO BAR region into our address space.
> > >
> > > To fix this, let's not expose a 64bit MMIO BAR region when running on
> > > Apple Silicon.
> > >
> > > I have not been able to find a way to enumerate that easily, so let's
> > > just assume we always have that little PA space on hypervisor.framework
> > > systems.
> > >
> > > Signed-off-by: Alexander Graf <agraf@csgraf.de>
> > > ---
> > >  hw/arm/virt.c | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > > index 27dbeb549e..d74053ecd4 100644
> > > --- a/hw/arm/virt.c
> > > +++ b/hw/arm/virt.c
> > > @@ -45,6 +45,7 @@
> > >  #include "hw/display/ramfb.h"
> > >  #include "net/net.h"
> > >  #include "sysemu/device_tree.h"
> > > +#include "sysemu/hvf.h"
> > >  #include "sysemu/numa.h"
> > >  #include "sysemu/runstate.h"
> > >  #include "sysemu/sysemu.h"
> > > @@ -1746,6 +1747,14 @@ static void machvirt_init(MachineState *machine)
> > >      unsigned int smp_cpus = machine->smp.cpus;
> > >      unsigned int max_cpus = machine->smp.max_cpus;
> > >
> > > +    /*
> > > +     * On Hypervisor.framework capable systems, we only have 36 bits of PA
> > > +     * space, which is not enough to fit a 64bit BAR space
> > > +     */
> > > +    if (hvf_enabled()) {
> > > +        vms->highmem = false;
> > > +    }
> >
> > Direct checks for *_enabled() are a pain to clean up later when
> > we add support to new accelerators.  Can't this be implemented as
> > (e.g.) a AccelClass::max_physical_address_bits field?
> 
> It's a property of the CPU (eg our emulated TCG CPUs may have
> varying supported numbers of physical address bits). So the
> virt board ought to look at the CPU, and the CPU should be
> set up with the right information for all of KVM, TCG, HVF
> (either a specific max_phys_addr_bits value or just ensure
> its ID_AA64MMFR0_EL1.PARange is right, not sure which would
> be easier/nicer).

Agreed.

My suggestion would still apply to the CPU code that will pick
the address size; ideally, accel-specific behaviour should be
represented as meaningful fields in AccelClass (either data or
virtual methods) instead of direct *_enabled() checks.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 16:26       ` Eduardo Habkost
@ 2020-11-27 16:38         ` Peter Maydell
  2020-11-27 16:47           ` Eduardo Habkost
  2020-11-27 16:47           ` Peter Maydell
  0 siblings, 2 replies; 64+ messages in thread
From: Peter Maydell @ 2020-11-27 16:38 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Fri, 27 Nov 2020 at 16:26, Eduardo Habkost <ehabkost@redhat.com> wrote:
>
> On Thu, Nov 26, 2020 at 10:29:01PM +0000, Peter Maydell wrote:
> > On Thu, 26 Nov 2020 at 22:14, Eduardo Habkost <ehabkost@redhat.com> wrote:
> > > Direct checks for *_enabled() are a pain to clean up later when
> > > we add support to new accelerators.  Can't this be implemented as
> > > (e.g.) a AccelClass::max_physical_address_bits field?
> >
> > It's a property of the CPU (eg our emulated TCG CPUs may have
> > varying supported numbers of physical address bits). So the
> > virt board ought to look at the CPU, and the CPU should be
> > set up with the right information for all of KVM, TCG, HVF
> > (either a specific max_phys_addr_bits value or just ensure
> > its ID_AA64MMFR0_EL1.PARange is right, not sure which would
> > be easier/nicer).
>
> Agreed.
>
> My suggestion would still apply to the CPU code that will pick
> the address size; ideally, accel-specific behaviour should be
> represented as meaningful fields in AccelClass (either data or
> virtual methods) instead of direct *_enabled() checks.

Having looked a bit more closely at some of the relevant target/arm
code, I think the best approach is going to be that in virt.c
we just check the PARange ID register field (probably via
a convenience function that does the conversion of that to
a nice number-of-bits return value; we might even have one
already). KVM and TCG both already set that ID register field
in the CPU struct correctly in their existing
implicitly-accelerator-specific code; HVF needs to do the same.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 16:38         ` Peter Maydell
@ 2020-11-27 16:47           ` Eduardo Habkost
  2020-11-27 16:53             ` Peter Maydell
  2020-11-27 16:47           ` Peter Maydell
  1 sibling, 1 reply; 64+ messages in thread
From: Eduardo Habkost @ 2020-11-27 16:47 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Fri, Nov 27, 2020 at 04:38:18PM +0000, Peter Maydell wrote:
> On Fri, 27 Nov 2020 at 16:26, Eduardo Habkost <ehabkost@redhat.com> wrote:
> >
> > On Thu, Nov 26, 2020 at 10:29:01PM +0000, Peter Maydell wrote:
> > > On Thu, 26 Nov 2020 at 22:14, Eduardo Habkost <ehabkost@redhat.com> wrote:
> > > > Direct checks for *_enabled() are a pain to clean up later when
> > > > we add support to new accelerators.  Can't this be implemented as
> > > > (e.g.) a AccelClass::max_physical_address_bits field?
> > >
> > > It's a property of the CPU (eg our emulated TCG CPUs may have
> > > varying supported numbers of physical address bits). So the
> > > virt board ought to look at the CPU, and the CPU should be
> > > set up with the right information for all of KVM, TCG, HVF
> > > (either a specific max_phys_addr_bits value or just ensure
> > > its ID_AA64MMFR0_EL1.PARange is right, not sure which would
> > > be easier/nicer).
> >
> > Agreed.
> >
> > My suggestion would still apply to the CPU code that will pick
> > the address size; ideally, accel-specific behaviour should be
> > represented as meaningful fields in AccelClass (either data or
> > virtual methods) instead of direct *_enabled() checks.
> 
> Having looked a bit more closely at some of the relevant target/arm
> code, I think the best approach is going to be that in virt.c
> we just check the PARange ID register field (probably via
> a convenience function that does the conversion of that to
> a nice number-of-bits return value; we might even have one
> already). KVM and TCG both already set that ID register field
> in the CPU struct correctly in their existing
> implicitly-accelerator-specific code; HVF needs to do the same.

Do you know how the implicitly-accelerator-specific code is
implemented?  PARange is in id_aa64mmfr0, correct?  I don't see
any accel-specific code for initializing id_aa64mmfr0.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 16:38         ` Peter Maydell
  2020-11-27 16:47           ` Eduardo Habkost
@ 2020-11-27 16:47           ` Peter Maydell
  2020-11-30  2:40             ` Alexander Graf
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Maydell @ 2020-11-27 16:47 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Fri, 27 Nov 2020 at 16:38, Peter Maydell <peter.maydell@linaro.org> wrote:
> Having looked a bit more closely at some of the relevant target/arm
> code, I think the best approach is going to be that in virt.c
> we just check the PARange ID register field (probably via
> a convenience function that does the conversion of that to
> a nice number-of-bits return value; we might even have one
> already).

Ha, in fact we're already doing something quite close to this,
though instead of saying "decide whether to use highmem based
on the CPU's PA range" we go for "report error to user if PA
range is insufficient" and let the user pick some command line
options that disable highmem if they want:

        if (aarch64 && vms->highmem) {
            int requested_pa_size = 64 - clz64(vms->highest_gpa);
            int pamax = arm_pamax(ARM_CPU(first_cpu));

            if (pamax < requested_pa_size) {
                error_report("VCPU supports less PA bits (%d) than "
                             "requested by the memory map (%d)",
                             pamax, requested_pa_size);
                exit(1);
            }
        }

thanks
-- PMM


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 16:47           ` Eduardo Habkost
@ 2020-11-27 16:53             ` Peter Maydell
  2020-11-27 17:17               ` Eduardo Habkost
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Maydell @ 2020-11-27 16:53 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Fri, 27 Nov 2020 at 16:47, Eduardo Habkost <ehabkost@redhat.com> wrote:
> Do you know how the implicitly-accelerator-specific code is
> implemented?  PARange is in id_aa64mmfr0, correct?  I don't see
> any accel-specific code for initializing id_aa64mmfr0.

For TCG, the value of id_aa64mmfr0 is set by the per-cpu
init functions aarch64_a57_initfn(), aarch64_a72_initfn(), etc.

For KVM, if we're using "-cpu cortex-a53" or "-cpu cortex-a57"
these only work if the host CPU really is an A53 or A57, in
which case the reset value set by the initfn is correct.
In the more usual case of "-cpu host", we ask the kernel for
the ID register values in kvm_arm_get_host_cpu_features(),
which is part of the implementation of
kvm_arm_set_cpu_features_from_host(), which gets called
in aarch64_max_initfn() (inside a kvm_enabled() conditional).

So there is a *_enabled() check involved, which I hadn't
realised until I worked back up to where this stuff is called.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 16:53             ` Peter Maydell
@ 2020-11-27 17:17               ` Eduardo Habkost
  2020-11-27 18:16                 ` Peter Maydell
  0 siblings, 1 reply; 64+ messages in thread
From: Eduardo Habkost @ 2020-11-27 17:17 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Fri, Nov 27, 2020 at 04:53:59PM +0000, Peter Maydell wrote:
> On Fri, 27 Nov 2020 at 16:47, Eduardo Habkost <ehabkost@redhat.com> wrote:
> > Do you know how the implicitly-accelerator-specific code is
> > implemented?  PARange is in id_aa64mmfr0, correct?  I don't see
> > any accel-specific code for initializing id_aa64mmfr0.
> 
> For TCG, the value of id_aa64mmfr0 is set by the per-cpu
> init functions aarch64_a57_initfn(), aarch64_a72_initfn(), etc.
> 
> For KVM, if we're using "-cpu cortex-a53" or "-cpu cortex-a57"
> these only work if the host CPU really is an A53 or A57, in
> which case the reset value set by the initfn is correct.
> In the more usual case of "-cpu host", we ask the kernel for
> the ID register values in kvm_arm_get_host_cpu_features(),
> which is part of the implementation of
> kvm_arm_set_cpu_features_from_host(), which gets called
> in aarch64_max_initfn() (inside a kvm_enabled() conditional).
> 
> So there is a *_enabled() check involved, which I hadn't
> realised until I worked back up to where this stuff is called.
> 

Thanks!  Is the data returned by kvm_arm_get_host_cpu_features()
supposed to eventually affect the value of id_aa64mmfr0?  I don't
see how that could happen.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/8] hvf: Implement Apple Silicon Support
  2020-11-26 22:10 ` [PATCH 0/8] hvf: Implement Apple Silicon Support Eduardo Habkost
@ 2020-11-27 17:48   ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 64+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-11-27 17:48 UTC (permalink / raw)
  To: Eduardo Habkost, Alexander Graf, Claudio Fontana
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Cameron Esfahani,
	Roman Bolshakov, qemu-arm, Paolo Bonzini

On 11/26/20 11:10 PM, Eduardo Habkost wrote:
> On Thu, Nov 26, 2020 at 10:50:09PM +0100, Alexander Graf wrote:
>> Now that Apple Silicon is widely available, people are obviously excited
>> to try and run virtualized workloads on them, such as Linux and Windows.
>>
>> This patch set implements a rudimentary, first version to get the ball
>> going on that. With this applied, I can successfully run both Linux and
>> Windows as guests, albeit with a few caveats:
>>
>>   * no WFI emulation, a vCPU always uses 100%
>>   * vtimer handling is a bit hacky
>>   * we handle most sysregs flying blindly, just returning 0
>>   * XHCI breaks in OVMF, works in Linux+Windows
>>
>> Despite those drawbacks, it's still an exciting place to start playing
>> with the power of Apple Silicon.
>>
>> Enjoy!
>>
>> Alex
>>
>> Alexander Graf (8):
>>   hvf: Add hypervisor entitlement to output binaries
>>   hvf: Move common code out
>>   arm: Set PSCI to 0.2 for HVF
>>   arm: Synchronize CPU on PSCI on
>>   hvf: Add Apple Silicon support
>>   hvf: Use OS provided vcpu kick function
>>   arm: Add Hypervisor.framework build target
>>   hw/arm/virt: Disable highmem when on hypervisor.framework
>>
>>  MAINTAINERS                  |  14 +-
>>  accel/hvf/entitlements.plist |   8 +
>>  accel/hvf/hvf-all.c          |  56 ++++
>>  accel/hvf/hvf-cpus.c         | 484 +++++++++++++++++++++++++++++++++++
>>  accel/hvf/meson.build        |   7 +
>>  accel/meson.build            |   1 +
> 
> This seems to conflict with the accel cleanup work being done by
> Claudio[1].  Maybe Claudio could cherry-pick some of the code
> movement patches from this series, or this series could be
> rebased on top of his.

It seems easier for Claudio to cherry-pick patch 2/8
of this series ("hvf: Move common code out") and rebase
on top.

Claudio's series is still tagged RFC, but if you were
planing to queue it, you could take patch 2/8 out of
this series, as it is generic, and let the HVF/AA64
specific bits still being discussed.

> 
> [1] https://lore.kernel.org/qemu-devel/20201124162210.8796-1-cfontana@suse.de



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 17:17               ` Eduardo Habkost
@ 2020-11-27 18:16                 ` Peter Maydell
  2020-11-27 18:20                   ` Eduardo Habkost
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Maydell @ 2020-11-27 18:16 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Fri, 27 Nov 2020 at 17:18, Eduardo Habkost <ehabkost@redhat.com> wrote:
> Thanks!  Is the data returned by kvm_arm_get_host_cpu_features()
> supposed to eventually affect the value of id_aa64mmfr0?  I don't
> see how that could happen.

kvm_arm_get_host_cpu_features() does:
        err |= read_sys_reg64(fdarray[2], &ahcf->isar.id_aa64mmfr0,
                              ARM64_SYS_REG(3, 0, 0, 7, 0));

which is filling in data in the ARMHostCPUFeatures* that it is
passed as an argument. The caller is kvm_arm_set_cpu_features_from_host(),
which does
 kvm_arm_get_host_cpu_features(&arm_host_cpu_features)
(assuming it hasn't already done it once and cached the results;
arm_host_cpu_features is a global) and then
 cpu->isar = arm_host_cpu_features.isar;
thus copying the ID values into the "struct ARMISARegisters isar"
that is part of the ARMCPU struct. (It also copies across the
'features' word which gets set up with ARM_FEATURE_* flags
for the benefit of the parts of the target code which key off
those rather than ID register fields.)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 18:16                 ` Peter Maydell
@ 2020-11-27 18:20                   ` Eduardo Habkost
  0 siblings, 0 replies; 64+ messages in thread
From: Eduardo Habkost @ 2020-11-27 18:20 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini

On Fri, Nov 27, 2020 at 06:16:27PM +0000, Peter Maydell wrote:
> On Fri, 27 Nov 2020 at 17:18, Eduardo Habkost <ehabkost@redhat.com> wrote:
> > Thanks!  Is the data returned by kvm_arm_get_host_cpu_features()
> > supposed to eventually affect the value of id_aa64mmfr0?  I don't
> > see how that could happen.
> 
> kvm_arm_get_host_cpu_features() does:
>         err |= read_sys_reg64(fdarray[2], &ahcf->isar.id_aa64mmfr0,
>                               ARM64_SYS_REG(3, 0, 0, 7, 0));
> 
> which is filling in data in the ARMHostCPUFeatures* that it is
> passed as an argument. The caller is kvm_arm_set_cpu_features_from_host(),
> which does
>  kvm_arm_get_host_cpu_features(&arm_host_cpu_features)
> (assuming it hasn't already done it once and cached the results;
> arm_host_cpu_features is a global) and then
>  cpu->isar = arm_host_cpu_features.isar;
> thus copying the ID values into the "struct ARMISARegisters isar"
> that is part of the ARMCPU struct. (It also copies across the
> 'features' word which gets set up with ARM_FEATURE_* flags
> for the benefit of the parts of the target code which key off
> those rather than ID register fields.)

Thanks!  For some reason I missed the line above when grepping
for id_aa64mmfr0.

-- 
Eduardo



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries
  2020-11-26 21:50 ` [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries Alexander Graf
  2020-11-27  4:54   ` Paolo Bonzini
@ 2020-11-27 19:44   ` Roman Bolshakov
  2020-11-27 21:17     ` Paolo Bonzini
  2020-11-27 21:51     ` Alexander Graf
  1 sibling, 2 replies; 64+ messages in thread
From: Roman Bolshakov @ 2020-11-27 19:44 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, qemu-arm, Paolo Bonzini

On Thu, Nov 26, 2020 at 10:50:10PM +0100, Alexander Graf wrote:
> In macOS 11, QEMU only gets access to Hypervisor.framework if it has the
> respective entitlement. Add an entitlement template and automatically self
> sign and apply the entitlement in the build.
> 
> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> ---
>  accel/hvf/entitlements.plist |  8 ++++++++
>  meson.build                  | 30 ++++++++++++++++++++++++++----
>  scripts/entitlement.sh       | 11 +++++++++++
>  3 files changed, 45 insertions(+), 4 deletions(-)
>  create mode 100644 accel/hvf/entitlements.plist
>  create mode 100755 scripts/entitlement.sh

Hi,

I think the patch should go ahead of other changes (with Paolo's fix for
^C) and land into 5.2 because entitlements are needed for x86_64 hvf too
since Big Sur Beta 3. Ad-hoc signing is very convenient for development.

Also, It might be good to have configure/meson option to disable signing
at all. Primarily for homebrew:

https://discourse.brew.sh/t/code-signing-installed-executables/2131/10

There's no established process how to deal with it, e.g. GDB in homebrew
has caveats section for now:

  ==> Caveats
  gdb requires special privileges to access Mach ports.
  You will need to codesign the binary. For instructions, see:

    https://sourceware.org/gdb/wiki/BuildingOnDarwin

The discussion on discourse mentions some plans to do signing in
homebrew CI (with real Developer ID) but none of them are implemented
now.

For now it'd be helpful to provide a way to disable signing and install
the entitlements (if one wants to sign after installation). Similar
issue was raised to fish-shell a while ago:

https://github.com/fish-shell/fish-shell/issues/6952
https://github.com/fish-shell/fish-shell/issues/7467

> 
> diff --git a/accel/hvf/entitlements.plist b/accel/hvf/entitlements.plist
> new file mode 100644
> index 0000000000..154f3308ef
> --- /dev/null
> +++ b/accel/hvf/entitlements.plist
> @@ -0,0 +1,8 @@
> +<?xml version="1.0" encoding="UTF-8"?>
> +<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
> +<plist version="1.0">
> +<dict>
> +    <key>com.apple.security.hypervisor</key>
> +    <true/>
> +</dict>
> +</plist>
> diff --git a/meson.build b/meson.build
> index 5062407c70..2a7ff5560c 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1844,9 +1844,14 @@ foreach target : target_dirs
>      }]
>    endif
>    foreach exe: execs
> -    emulators += {exe['name']:
> -         executable(exe['name'], exe['sources'],
> -               install: true,
> +    exe_name = exe['name']
> +    exe_sign = 'CONFIG_HVF' in config_target

I don't have Apple Silicon HW but it may require different kind of
entitlements for CONFIG_TCG:

https://developer.apple.com/documentation/apple_silicon/porting_just-in-time_compilers_to_apple_silicon

Thanks,
Roman

> +    if exe_sign
> +      exe_name += '-unsigned'
> +    endif
> +
> +    emulator = executable(exe_name, exe['sources'],
> +               install: not exe_sign,
>                 c_args: c_args,
>                 dependencies: arch_deps + deps + exe['dependencies'],
>                 objects: lib.extract_all_objects(recursive: true),
> @@ -1854,7 +1859,24 @@ foreach target : target_dirs
>                 link_depends: [block_syms, qemu_syms] + exe.get('link_depends', []),
>                 link_args: link_args,
>                 gui_app: exe['gui'])
> -    }
> +
> +    if exe_sign
> +      exe_full = meson.current_build_dir() / exe['name']
> +      emulators += {exe['name'] : custom_target(exe['name'],
> +                   install: true,
> +                   install_dir: get_option('bindir'),
> +                   depends: emulator,
> +                   output: exe['name'],
> +                   command: [
> +                     meson.current_source_dir() / 'scripts/entitlement.sh',
> +                     meson.current_build_dir() / exe['name'] + '-unsigned',
> +                     meson.current_build_dir() / exe['name'],
> +                     meson.current_source_dir() / 'accel/hvf/entitlements.plist'
> +                   ])
> +      }
> +    else
> +      emulators += {exe['name']: emulator}
> +    endif
>  
>      if 'CONFIG_TRACE_SYSTEMTAP' in config_host
>        foreach stp: [
> diff --git a/scripts/entitlement.sh b/scripts/entitlement.sh
> new file mode 100755
> index 0000000000..7ed9590bf9
> --- /dev/null
> +++ b/scripts/entitlement.sh
> @@ -0,0 +1,11 @@
> +#!/bin/sh -e
> +#
> +# Helper script for the build process to apply entitlements
> +
> +SRC="$1"
> +DST="$2"
> +ENTITLEMENT="$3"
> +
> +rm -f "$2"
> +cp -a "$SRC" "$DST"
> +codesign --entitlements "$ENTITLEMENT" --force -s - "$DST"
> -- 
> 2.24.3 (Apple Git-128)
> 
> 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-26 21:50 ` [PATCH 2/8] hvf: Move common code out Alexander Graf
@ 2020-11-27 20:00   ` Roman Bolshakov
  2020-11-27 21:55     ` Alexander Graf
  0 siblings, 1 reply; 64+ messages in thread
From: Roman Bolshakov @ 2020-11-27 20:00 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, qemu-arm, Claudio Fontana, Paolo Bonzini

On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
> Until now, Hypervisor.framework has only been available on x86_64 systems.
> With Apple Silicon shipping now, it extends its reach to aarch64. To
> prepare for support for multiple architectures, let's move common code out
> into its own accel directory.
> 
> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> ---
>  MAINTAINERS                 |   9 +-
>  accel/hvf/hvf-all.c         |  56 +++++
>  accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>  accel/hvf/meson.build       |   7 +
>  accel/meson.build           |   1 +
>  include/sysemu/hvf_int.h    |  69 ++++++
>  target/i386/hvf/hvf-cpus.c  | 131 ----------
>  target/i386/hvf/hvf-cpus.h  |  25 --
>  target/i386/hvf/hvf-i386.h  |  48 +---
>  target/i386/hvf/hvf.c       | 360 +--------------------------
>  target/i386/hvf/meson.build |   1 -
>  target/i386/hvf/x86hvf.c    |  11 +-
>  target/i386/hvf/x86hvf.h    |   2 -
>  13 files changed, 619 insertions(+), 569 deletions(-)
>  create mode 100644 accel/hvf/hvf-all.c
>  create mode 100644 accel/hvf/hvf-cpus.c
>  create mode 100644 accel/hvf/meson.build
>  create mode 100644 include/sysemu/hvf_int.h
>  delete mode 100644 target/i386/hvf/hvf-cpus.c
>  delete mode 100644 target/i386/hvf/hvf-cpus.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 68bc160f41..ca4b6d9279 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>  M: Roman Bolshakov <r.bolshakov@yadro.com>
>  W: https://wiki.qemu.org/Features/HVF
>  S: Maintained
> -F: accel/stubs/hvf-stub.c

There was a patch for that in the RFC series from Claudio.

>  F: target/i386/hvf/
> +
> +HVF
> +M: Cameron Esfahani <dirty@apple.com>
> +M: Roman Bolshakov <r.bolshakov@yadro.com>
> +W: https://wiki.qemu.org/Features/HVF
> +S: Maintained
> +F: accel/hvf/
>  F: include/sysemu/hvf.h
> +F: include/sysemu/hvf_int.h
>  
>  WHPX CPUs
>  M: Sunil Muthuswamy <sunilmut@microsoft.com>
> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
> new file mode 100644
> index 0000000000..47d77a472a
> --- /dev/null
> +++ b/accel/hvf/hvf-all.c
> @@ -0,0 +1,56 @@
> +/*
> + * QEMU Hypervisor.framework support
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + * Contributions after 2012-01-13 are licensed under the terms of the
> + * GNU GPL, version 2 or (at your option) any later version.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qemu/error-report.h"
> +#include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
> +#include "sysemu/runstate.h"
> +
> +#include "qemu/main-loop.h"
> +#include "sysemu/accel.h"
> +
> +#include <Hypervisor/Hypervisor.h>
> +
> +bool hvf_allowed;
> +HVFState *hvf_state;
> +
> +void assert_hvf_ok(hv_return_t ret)
> +{
> +    if (ret == HV_SUCCESS) {
> +        return;
> +    }
> +
> +    switch (ret) {
> +    case HV_ERROR:
> +        error_report("Error: HV_ERROR");
> +        break;
> +    case HV_BUSY:
> +        error_report("Error: HV_BUSY");
> +        break;
> +    case HV_BAD_ARGUMENT:
> +        error_report("Error: HV_BAD_ARGUMENT");
> +        break;
> +    case HV_NO_RESOURCES:
> +        error_report("Error: HV_NO_RESOURCES");
> +        break;
> +    case HV_NO_DEVICE:
> +        error_report("Error: HV_NO_DEVICE");
> +        break;
> +    case HV_UNSUPPORTED:
> +        error_report("Error: HV_UNSUPPORTED");
> +        break;
> +    default:
> +        error_report("Unknown Error");
> +    }
> +
> +    abort();
> +}
> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> new file mode 100644
> index 0000000000..f9bb5502b7
> --- /dev/null
> +++ b/accel/hvf/hvf-cpus.c
> @@ -0,0 +1,468 @@
> +/*
> + * Copyright 2008 IBM Corporation
> + *           2008 Red Hat, Inc.
> + * Copyright 2011 Intel Corporation
> + * Copyright 2016 Veertu, Inc.
> + * Copyright 2017 The Android Open Source Project
> + *
> + * QEMU Hypervisor.framework support
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> + *
> + * This file contain code under public domain from the hvdos project:
> + * https://github.com/mist64/hvdos
> + *
> + * Parts Copyright (c) 2011 NetApp, Inc.
> + * All rights reserved.
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + *
> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/error-report.h"
> +#include "qemu/main-loop.h"
> +#include "exec/address-spaces.h"
> +#include "exec/exec-all.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
> +#include "sysemu/runstate.h"
> +#include "qemu/guest-random.h"
> +
> +#include <Hypervisor/Hypervisor.h>
> +
> +/* Memory slots */
> +
> +struct mac_slot {
> +    int present;
> +    uint64_t size;
> +    uint64_t gpa_start;
> +    uint64_t gva;
> +};
> +
> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> +{
> +    hvf_slot *slot;
> +    int x;
> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> +        slot = &hvf_state->slots[x];
> +        if (slot->size && start < (slot->start + slot->size) &&
> +            (start + size) > slot->start) {
> +            return slot;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +struct mac_slot mac_slots[32];
> +
> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> +{
> +    struct mac_slot *macslot;
> +    hv_return_t ret;
> +
> +    macslot = &mac_slots[slot->slot_id];
> +
> +    if (macslot->present) {
> +        if (macslot->size != slot->size) {
> +            macslot->present = 0;
> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> +            assert_hvf_ok(ret);
> +        }
> +    }
> +
> +    if (!slot->size) {
> +        return 0;
> +    }
> +
> +    macslot->present = 1;
> +    macslot->gpa_start = slot->start;
> +    macslot->size = slot->size;
> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
> +    assert_hvf_ok(ret);
> +    return 0;
> +}
> +
> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> +{
> +    hvf_slot *mem;
> +    MemoryRegion *area = section->mr;
> +    bool writeable = !area->readonly && !area->rom_device;
> +    hv_memory_flags_t flags;
> +
> +    if (!memory_region_is_ram(area)) {
> +        if (writeable) {
> +            return;
> +        } else if (!memory_region_is_romd(area)) {
> +            /*
> +             * If the memory device is not in romd_mode, then we actually want
> +             * to remove the hvf memory slot so all accesses will trap.
> +             */
> +             add = false;
> +        }
> +    }
> +
> +    mem = hvf_find_overlap_slot(
> +            section->offset_within_address_space,
> +            int128_get64(section->size));
> +
> +    if (mem && add) {
> +        if (mem->size == int128_get64(section->size) &&
> +            mem->start == section->offset_within_address_space &&
> +            mem->mem == (memory_region_get_ram_ptr(area) +
> +            section->offset_within_region)) {
> +            return; /* Same region was attempted to register, go away. */
> +        }
> +    }
> +
> +    /* Region needs to be reset. set the size to 0 and remap it. */
> +    if (mem) {
> +        mem->size = 0;
> +        if (do_hvf_set_memory(mem, 0)) {
> +            error_report("Failed to reset overlapping slot");
> +            abort();
> +        }
> +    }
> +
> +    if (!add) {
> +        return;
> +    }
> +
> +    if (area->readonly ||
> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> +    } else {
> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> +    }
> +
> +    /* Now make a new slot. */
> +    int x;
> +
> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> +        mem = &hvf_state->slots[x];
> +        if (!mem->size) {
> +            break;
> +        }
> +    }
> +
> +    if (x == hvf_state->num_slots) {
> +        error_report("No free slots");
> +        abort();
> +    }
> +
> +    mem->size = int128_get64(section->size);
> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
> +    mem->start = section->offset_within_address_space;
> +    mem->region = area;
> +
> +    if (do_hvf_set_memory(mem, flags)) {
> +        error_report("Error registering new memory slot");
> +        abort();
> +    }
> +}
> +
> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
> +{
> +    hvf_slot *slot;
> +
> +    slot = hvf_find_overlap_slot(
> +            section->offset_within_address_space,
> +            int128_get64(section->size));
> +
> +    /* protect region against writes; begin tracking it */
> +    if (on) {
> +        slot->flags |= HVF_SLOT_LOG;
> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> +                      HV_MEMORY_READ);
> +    /* stop tracking region*/
> +    } else {
> +        slot->flags &= ~HVF_SLOT_LOG;
> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> +    }
> +}
> +
> +static void hvf_log_start(MemoryListener *listener,
> +                          MemoryRegionSection *section, int old, int new)
> +{
> +    if (old != 0) {
> +        return;
> +    }
> +
> +    hvf_set_dirty_tracking(section, 1);
> +}
> +
> +static void hvf_log_stop(MemoryListener *listener,
> +                         MemoryRegionSection *section, int old, int new)
> +{
> +    if (new != 0) {
> +        return;
> +    }
> +
> +    hvf_set_dirty_tracking(section, 0);
> +}
> +
> +static void hvf_log_sync(MemoryListener *listener,
> +                         MemoryRegionSection *section)
> +{
> +    /*
> +     * sync of dirty pages is handled elsewhere; just make sure we keep
> +     * tracking the region.
> +     */
> +    hvf_set_dirty_tracking(section, 1);
> +}
> +
> +static void hvf_region_add(MemoryListener *listener,
> +                           MemoryRegionSection *section)
> +{
> +    hvf_set_phys_mem(section, true);
> +}
> +
> +static void hvf_region_del(MemoryListener *listener,
> +                           MemoryRegionSection *section)
> +{
> +    hvf_set_phys_mem(section, false);
> +}
> +
> +static MemoryListener hvf_memory_listener = {
> +    .priority = 10,
> +    .region_add = hvf_region_add,
> +    .region_del = hvf_region_del,
> +    .log_start = hvf_log_start,
> +    .log_stop = hvf_log_stop,
> +    .log_sync = hvf_log_sync,
> +};
> +
> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
> +{
> +    if (!cpu->vcpu_dirty) {
> +        hvf_get_registers(cpu);
> +        cpu->vcpu_dirty = true;
> +    }
> +}
> +
> +static void hvf_cpu_synchronize_state(CPUState *cpu)
> +{
> +    if (!cpu->vcpu_dirty) {
> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> +    }
> +}
> +
> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> +                                              run_on_cpu_data arg)
> +{
> +    hvf_put_registers(cpu);
> +    cpu->vcpu_dirty = false;
> +}
> +
> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> +{
> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
> +}
> +
> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> +                                             run_on_cpu_data arg)
> +{
> +    hvf_put_registers(cpu);
> +    cpu->vcpu_dirty = false;
> +}
> +
> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
> +{
> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> +}
> +
> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> +                                              run_on_cpu_data arg)
> +{
> +    cpu->vcpu_dirty = true;
> +}
> +
> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> +{
> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
> +}
> +
> +static void hvf_vcpu_destroy(CPUState *cpu)
> +{
> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
> +    assert_hvf_ok(ret);
> +
> +    hvf_arch_vcpu_destroy(cpu);
> +}
> +
> +static void dummy_signal(int sig)
> +{
> +}
> +
> +static int hvf_init_vcpu(CPUState *cpu)
> +{
> +    int r;
> +
> +    /* init cpu signals */
> +    sigset_t set;
> +    struct sigaction sigact;
> +
> +    memset(&sigact, 0, sizeof(sigact));
> +    sigact.sa_handler = dummy_signal;
> +    sigaction(SIG_IPI, &sigact, NULL);
> +
> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
> +    sigdelset(&set, SIG_IPI);
> +
> +#ifdef __aarch64__
> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
> +#else
> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> +#endif

I think the first __aarch64__ bit fits better to arm part of the series.

> +    cpu->vcpu_dirty = 1;
> +    assert_hvf_ok(r);
> +
> +    return hvf_arch_init_vcpu(cpu);
> +}
> +
> +/*
> + * The HVF-specific vCPU thread function. This one should only run when the host
> + * CPU supports the VMX "unrestricted guest" feature.
> + */
> +static void *hvf_cpu_thread_fn(void *arg)
> +{
> +    CPUState *cpu = arg;
> +
> +    int r;
> +
> +    assert(hvf_enabled());
> +
> +    rcu_register_thread();
> +
> +    qemu_mutex_lock_iothread();
> +    qemu_thread_get_self(cpu->thread);
> +
> +    cpu->thread_id = qemu_get_thread_id();
> +    cpu->can_do_io = 1;
> +    current_cpu = cpu;
> +
> +    hvf_init_vcpu(cpu);
> +
> +    /* signal CPU creation */
> +    cpu_thread_signal_created(cpu);
> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> +
> +    do {
> +        if (cpu_can_run(cpu)) {
> +            r = hvf_vcpu_exec(cpu);
> +            if (r == EXCP_DEBUG) {
> +                cpu_handle_guest_debug(cpu);
> +            }
> +        }
> +        qemu_wait_io_event(cpu);
> +    } while (!cpu->unplug || cpu_can_run(cpu));
> +
> +    hvf_vcpu_destroy(cpu);
> +    cpu_thread_signal_destroyed(cpu);
> +    qemu_mutex_unlock_iothread();
> +    rcu_unregister_thread();
> +    return NULL;
> +}
> +
> +static void hvf_start_vcpu_thread(CPUState *cpu)
> +{
> +    char thread_name[VCPU_THREAD_NAME_SIZE];
> +
> +    /*
> +     * HVF currently does not support TCG, and only runs in
> +     * unrestricted-guest mode.
> +     */
> +    assert(hvf_enabled());
> +
> +    cpu->thread = g_malloc0(sizeof(QemuThread));
> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> +    qemu_cond_init(cpu->halt_cond);
> +
> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> +             cpu->cpu_index);
> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> +                       cpu, QEMU_THREAD_JOINABLE);
> +}
> +
> +static const CpusAccel hvf_cpus = {
> +    .create_vcpu_thread = hvf_start_vcpu_thread,
> +
> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> +    .synchronize_state = hvf_cpu_synchronize_state,
> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> +};
> +
> +static int hvf_accel_init(MachineState *ms)
> +{
> +    int x;
> +    hv_return_t ret;
> +    HVFState *s;
> +
> +    ret = hv_vm_create(HV_VM_DEFAULT);
> +    assert_hvf_ok(ret);
> +
> +    s = g_new0(HVFState, 1);
> +
> +    s->num_slots = 32;
> +    for (x = 0; x < s->num_slots; ++x) {
> +        s->slots[x].size = 0;
> +        s->slots[x].slot_id = x;
> +    }
> +
> +    hvf_state = s;
> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
> +    cpus_register_accel(&hvf_cpus);
> +    return 0;
> +}
> +
> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
> +{
> +    AccelClass *ac = ACCEL_CLASS(oc);
> +    ac->name = "HVF";
> +    ac->init_machine = hvf_accel_init;
> +    ac->allowed = &hvf_allowed;
> +}
> +
> +static const TypeInfo hvf_accel_type = {
> +    .name = TYPE_HVF_ACCEL,
> +    .parent = TYPE_ACCEL,
> +    .class_init = hvf_accel_class_init,
> +};
> +
> +static void hvf_type_init(void)
> +{
> +    type_register_static(&hvf_accel_type);
> +}
> +
> +type_init(hvf_type_init);
> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
> new file mode 100644
> index 0000000000..dfd6b68dc7
> --- /dev/null
> +++ b/accel/hvf/meson.build
> @@ -0,0 +1,7 @@
> +hvf_ss = ss.source_set()
> +hvf_ss.add(files(
> +  'hvf-all.c',
> +  'hvf-cpus.c',
> +))
> +
> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
> diff --git a/accel/meson.build b/accel/meson.build
> index b26cca227a..6de12ce5d5 100644
> --- a/accel/meson.build
> +++ b/accel/meson.build
> @@ -1,5 +1,6 @@
>  softmmu_ss.add(files('accel.c'))
>  
> +subdir('hvf')
>  subdir('qtest')
>  subdir('kvm')
>  subdir('tcg')
> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> new file mode 100644
> index 0000000000..de9bad23a8
> --- /dev/null
> +++ b/include/sysemu/hvf_int.h
> @@ -0,0 +1,69 @@
> +/*
> + * QEMU Hypervisor.framework (HVF) support
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +/* header to be included in HVF-specific code */
> +
> +#ifndef HVF_INT_H
> +#define HVF_INT_H
> +
> +#include <Hypervisor/Hypervisor.h>
> +
> +#define HVF_MAX_VCPU 0x10
> +
> +extern struct hvf_state hvf_global;
> +
> +struct hvf_vm {
> +    int id;
> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> +};
> +
> +struct hvf_state {
> +    uint32_t version;
> +    struct hvf_vm *vm;
> +    uint64_t mem_quota;
> +};
> +
> +/* hvf_slot flags */
> +#define HVF_SLOT_LOG (1 << 0)
> +
> +typedef struct hvf_slot {
> +    uint64_t start;
> +    uint64_t size;
> +    uint8_t *mem;
> +    int slot_id;
> +    uint32_t flags;
> +    MemoryRegion *region;
> +} hvf_slot;
> +
> +typedef struct hvf_vcpu_caps {
> +    uint64_t vmx_cap_pinbased;
> +    uint64_t vmx_cap_procbased;
> +    uint64_t vmx_cap_procbased2;
> +    uint64_t vmx_cap_entry;
> +    uint64_t vmx_cap_exit;
> +    uint64_t vmx_cap_preemption_timer;
> +} hvf_vcpu_caps;
> +
> +struct HVFState {
> +    AccelState parent;
> +    hvf_slot slots[32];
> +    int num_slots;
> +
> +    hvf_vcpu_caps *hvf_caps;
> +};
> +extern HVFState *hvf_state;
> +
> +void assert_hvf_ok(hv_return_t ret);
> +int hvf_get_registers(CPUState *cpu);
> +int hvf_put_registers(CPUState *cpu);
> +int hvf_arch_init_vcpu(CPUState *cpu);
> +void hvf_arch_vcpu_destroy(CPUState *cpu);
> +int hvf_vcpu_exec(CPUState *cpu);
> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
> +
> +#endif
> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
> deleted file mode 100644
> index 817b3d7452..0000000000
> --- a/target/i386/hvf/hvf-cpus.c
> +++ /dev/null
> @@ -1,131 +0,0 @@
> -/*
> - * Copyright 2008 IBM Corporation
> - *           2008 Red Hat, Inc.
> - * Copyright 2011 Intel Corporation
> - * Copyright 2016 Veertu, Inc.
> - * Copyright 2017 The Android Open Source Project
> - *
> - * QEMU Hypervisor.framework support
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of version 2 of the GNU General Public
> - * License as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> - * General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
> - *
> - * This file contain code under public domain from the hvdos project:
> - * https://github.com/mist64/hvdos
> - *
> - * Parts Copyright (c) 2011 NetApp, Inc.
> - * All rights reserved.
> - *
> - * Redistribution and use in source and binary forms, with or without
> - * modification, are permitted provided that the following conditions
> - * are met:
> - * 1. Redistributions of source code must retain the above copyright
> - *    notice, this list of conditions and the following disclaimer.
> - * 2. Redistributions in binary form must reproduce the above copyright
> - *    notice, this list of conditions and the following disclaimer in the
> - *    documentation and/or other materials provided with the distribution.
> - *
> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> - * SUCH DAMAGE.
> - */
> -
> -#include "qemu/osdep.h"
> -#include "qemu/error-report.h"
> -#include "qemu/main-loop.h"
> -#include "sysemu/hvf.h"
> -#include "sysemu/runstate.h"
> -#include "target/i386/cpu.h"
> -#include "qemu/guest-random.h"
> -
> -#include "hvf-cpus.h"
> -
> -/*
> - * The HVF-specific vCPU thread function. This one should only run when the host
> - * CPU supports the VMX "unrestricted guest" feature.
> - */
> -static void *hvf_cpu_thread_fn(void *arg)
> -{
> -    CPUState *cpu = arg;
> -
> -    int r;
> -
> -    assert(hvf_enabled());
> -
> -    rcu_register_thread();
> -
> -    qemu_mutex_lock_iothread();
> -    qemu_thread_get_self(cpu->thread);
> -
> -    cpu->thread_id = qemu_get_thread_id();
> -    cpu->can_do_io = 1;
> -    current_cpu = cpu;
> -
> -    hvf_init_vcpu(cpu);
> -
> -    /* signal CPU creation */
> -    cpu_thread_signal_created(cpu);
> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> -
> -    do {
> -        if (cpu_can_run(cpu)) {
> -            r = hvf_vcpu_exec(cpu);
> -            if (r == EXCP_DEBUG) {
> -                cpu_handle_guest_debug(cpu);
> -            }
> -        }
> -        qemu_wait_io_event(cpu);
> -    } while (!cpu->unplug || cpu_can_run(cpu));
> -
> -    hvf_vcpu_destroy(cpu);
> -    cpu_thread_signal_destroyed(cpu);
> -    qemu_mutex_unlock_iothread();
> -    rcu_unregister_thread();
> -    return NULL;
> -}
> -
> -static void hvf_start_vcpu_thread(CPUState *cpu)
> -{
> -    char thread_name[VCPU_THREAD_NAME_SIZE];
> -
> -    /*
> -     * HVF currently does not support TCG, and only runs in
> -     * unrestricted-guest mode.
> -     */
> -    assert(hvf_enabled());
> -
> -    cpu->thread = g_malloc0(sizeof(QemuThread));
> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> -    qemu_cond_init(cpu->halt_cond);
> -
> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> -             cpu->cpu_index);
> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> -                       cpu, QEMU_THREAD_JOINABLE);
> -}
> -
> -const CpusAccel hvf_cpus = {
> -    .create_vcpu_thread = hvf_start_vcpu_thread,
> -
> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> -    .synchronize_state = hvf_cpu_synchronize_state,
> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> -};
> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
> deleted file mode 100644
> index ced31b82c0..0000000000
> --- a/target/i386/hvf/hvf-cpus.h
> +++ /dev/null
> @@ -1,25 +0,0 @@
> -/*
> - * Accelerator CPUS Interface
> - *
> - * Copyright 2020 SUSE LLC
> - *
> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
> - * See the COPYING file in the top-level directory.
> - */
> -
> -#ifndef HVF_CPUS_H
> -#define HVF_CPUS_H
> -
> -#include "sysemu/cpus.h"
> -
> -extern const CpusAccel hvf_cpus;
> -
> -int hvf_init_vcpu(CPUState *);
> -int hvf_vcpu_exec(CPUState *);
> -void hvf_cpu_synchronize_state(CPUState *);
> -void hvf_cpu_synchronize_post_reset(CPUState *);
> -void hvf_cpu_synchronize_post_init(CPUState *);
> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
> -void hvf_vcpu_destroy(CPUState *);
> -
> -#endif /* HVF_CPUS_H */
> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
> index e0edffd077..6d56f8f6bb 100644
> --- a/target/i386/hvf/hvf-i386.h
> +++ b/target/i386/hvf/hvf-i386.h
> @@ -18,57 +18,11 @@
>  
>  #include "sysemu/accel.h"
>  #include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
>  #include "cpu.h"
>  #include "x86.h"
>  
> -#define HVF_MAX_VCPU 0x10
> -
> -extern struct hvf_state hvf_global;
> -
> -struct hvf_vm {
> -    int id;
> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> -};
> -
> -struct hvf_state {
> -    uint32_t version;
> -    struct hvf_vm *vm;
> -    uint64_t mem_quota;
> -};
> -
> -/* hvf_slot flags */
> -#define HVF_SLOT_LOG (1 << 0)
> -
> -typedef struct hvf_slot {
> -    uint64_t start;
> -    uint64_t size;
> -    uint8_t *mem;
> -    int slot_id;
> -    uint32_t flags;
> -    MemoryRegion *region;
> -} hvf_slot;
> -
> -typedef struct hvf_vcpu_caps {
> -    uint64_t vmx_cap_pinbased;
> -    uint64_t vmx_cap_procbased;
> -    uint64_t vmx_cap_procbased2;
> -    uint64_t vmx_cap_entry;
> -    uint64_t vmx_cap_exit;
> -    uint64_t vmx_cap_preemption_timer;
> -} hvf_vcpu_caps;
> -
> -struct HVFState {
> -    AccelState parent;
> -    hvf_slot slots[32];
> -    int num_slots;
> -
> -    hvf_vcpu_caps *hvf_caps;
> -};
> -extern HVFState *hvf_state;
> -
> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>  void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>  
>  #ifdef NEED_CPU_H
>  /* Functions exported to host specific mode */
> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
> index ed9356565c..8b96ecd619 100644
> --- a/target/i386/hvf/hvf.c
> +++ b/target/i386/hvf/hvf.c
> @@ -51,6 +51,7 @@
>  #include "qemu/error-report.h"
>  
>  #include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
>  #include "sysemu/runstate.h"
>  #include "hvf-i386.h"
>  #include "vmcs.h"
> @@ -72,171 +73,6 @@
>  #include "sysemu/accel.h"
>  #include "target/i386/cpu.h"
>  
> -#include "hvf-cpus.h"
> -
> -HVFState *hvf_state;
> -
> -static void assert_hvf_ok(hv_return_t ret)
> -{
> -    if (ret == HV_SUCCESS) {
> -        return;
> -    }
> -
> -    switch (ret) {
> -    case HV_ERROR:
> -        error_report("Error: HV_ERROR");
> -        break;
> -    case HV_BUSY:
> -        error_report("Error: HV_BUSY");
> -        break;
> -    case HV_BAD_ARGUMENT:
> -        error_report("Error: HV_BAD_ARGUMENT");
> -        break;
> -    case HV_NO_RESOURCES:
> -        error_report("Error: HV_NO_RESOURCES");
> -        break;
> -    case HV_NO_DEVICE:
> -        error_report("Error: HV_NO_DEVICE");
> -        break;
> -    case HV_UNSUPPORTED:
> -        error_report("Error: HV_UNSUPPORTED");
> -        break;
> -    default:
> -        error_report("Unknown Error");
> -    }
> -
> -    abort();
> -}
> -
> -/* Memory slots */
> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> -{
> -    hvf_slot *slot;
> -    int x;
> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> -        slot = &hvf_state->slots[x];
> -        if (slot->size && start < (slot->start + slot->size) &&
> -            (start + size) > slot->start) {
> -            return slot;
> -        }
> -    }
> -    return NULL;
> -}
> -
> -struct mac_slot {
> -    int present;
> -    uint64_t size;
> -    uint64_t gpa_start;
> -    uint64_t gva;
> -};
> -
> -struct mac_slot mac_slots[32];
> -
> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> -{
> -    struct mac_slot *macslot;
> -    hv_return_t ret;
> -
> -    macslot = &mac_slots[slot->slot_id];
> -
> -    if (macslot->present) {
> -        if (macslot->size != slot->size) {
> -            macslot->present = 0;
> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> -            assert_hvf_ok(ret);
> -        }
> -    }
> -
> -    if (!slot->size) {
> -        return 0;
> -    }
> -
> -    macslot->present = 1;
> -    macslot->gpa_start = slot->start;
> -    macslot->size = slot->size;
> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
> -    assert_hvf_ok(ret);
> -    return 0;
> -}
> -
> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> -{
> -    hvf_slot *mem;
> -    MemoryRegion *area = section->mr;
> -    bool writeable = !area->readonly && !area->rom_device;
> -    hv_memory_flags_t flags;
> -
> -    if (!memory_region_is_ram(area)) {
> -        if (writeable) {
> -            return;
> -        } else if (!memory_region_is_romd(area)) {
> -            /*
> -             * If the memory device is not in romd_mode, then we actually want
> -             * to remove the hvf memory slot so all accesses will trap.
> -             */
> -             add = false;
> -        }
> -    }
> -
> -    mem = hvf_find_overlap_slot(
> -            section->offset_within_address_space,
> -            int128_get64(section->size));
> -
> -    if (mem && add) {
> -        if (mem->size == int128_get64(section->size) &&
> -            mem->start == section->offset_within_address_space &&
> -            mem->mem == (memory_region_get_ram_ptr(area) +
> -            section->offset_within_region)) {
> -            return; /* Same region was attempted to register, go away. */
> -        }
> -    }
> -
> -    /* Region needs to be reset. set the size to 0 and remap it. */
> -    if (mem) {
> -        mem->size = 0;
> -        if (do_hvf_set_memory(mem, 0)) {
> -            error_report("Failed to reset overlapping slot");
> -            abort();
> -        }
> -    }
> -
> -    if (!add) {
> -        return;
> -    }
> -
> -    if (area->readonly ||
> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> -    } else {
> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> -    }
> -
> -    /* Now make a new slot. */
> -    int x;
> -
> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> -        mem = &hvf_state->slots[x];
> -        if (!mem->size) {
> -            break;
> -        }
> -    }
> -
> -    if (x == hvf_state->num_slots) {
> -        error_report("No free slots");
> -        abort();
> -    }
> -
> -    mem->size = int128_get64(section->size);
> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
> -    mem->start = section->offset_within_address_space;
> -    mem->region = area;
> -
> -    if (do_hvf_set_memory(mem, flags)) {
> -        error_report("Error registering new memory slot");
> -        abort();
> -    }
> -}
> -
>  void vmx_update_tpr(CPUState *cpu)
>  {
>      /* TODO: need integrate APIC handling */
> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>      }
>  }
>  
> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
> -{
> -    if (!cpu->vcpu_dirty) {
> -        hvf_get_registers(cpu);
> -        cpu->vcpu_dirty = true;
> -    }
> -}
> -
> -void hvf_cpu_synchronize_state(CPUState *cpu)
> -{
> -    if (!cpu->vcpu_dirty) {
> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> -    }
> -}
> -
> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> -                                              run_on_cpu_data arg)
> -{
> -    hvf_put_registers(cpu);
> -    cpu->vcpu_dirty = false;
> -}
> -
> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> -{
> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
> -}
> -
> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> -                                             run_on_cpu_data arg)
> -{
> -    hvf_put_registers(cpu);
> -    cpu->vcpu_dirty = false;
> -}
> -
> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
> -{
> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> -}
> -
> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> -                                              run_on_cpu_data arg)
> -{
> -    cpu->vcpu_dirty = true;
> -}
> -
> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> -{
> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
> -}
> -
>  static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>  {
>      int read, write;
> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>      return false;
>  }
>  
> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
> -{
> -    hvf_slot *slot;
> -
> -    slot = hvf_find_overlap_slot(
> -            section->offset_within_address_space,
> -            int128_get64(section->size));
> -
> -    /* protect region against writes; begin tracking it */
> -    if (on) {
> -        slot->flags |= HVF_SLOT_LOG;
> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> -                      HV_MEMORY_READ);
> -    /* stop tracking region*/
> -    } else {
> -        slot->flags &= ~HVF_SLOT_LOG;
> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> -    }
> -}
> -
> -static void hvf_log_start(MemoryListener *listener,
> -                          MemoryRegionSection *section, int old, int new)
> -{
> -    if (old != 0) {
> -        return;
> -    }
> -
> -    hvf_set_dirty_tracking(section, 1);
> -}
> -
> -static void hvf_log_stop(MemoryListener *listener,
> -                         MemoryRegionSection *section, int old, int new)
> -{
> -    if (new != 0) {
> -        return;
> -    }
> -
> -    hvf_set_dirty_tracking(section, 0);
> -}
> -
> -static void hvf_log_sync(MemoryListener *listener,
> -                         MemoryRegionSection *section)
> -{
> -    /*
> -     * sync of dirty pages is handled elsewhere; just make sure we keep
> -     * tracking the region.
> -     */
> -    hvf_set_dirty_tracking(section, 1);
> -}
> -
> -static void hvf_region_add(MemoryListener *listener,
> -                           MemoryRegionSection *section)
> -{
> -    hvf_set_phys_mem(section, true);
> -}
> -
> -static void hvf_region_del(MemoryListener *listener,
> -                           MemoryRegionSection *section)
> -{
> -    hvf_set_phys_mem(section, false);
> -}
> -
> -static MemoryListener hvf_memory_listener = {
> -    .priority = 10,
> -    .region_add = hvf_region_add,
> -    .region_del = hvf_region_del,
> -    .log_start = hvf_log_start,
> -    .log_stop = hvf_log_stop,
> -    .log_sync = hvf_log_sync,
> -};
> -
> -void hvf_vcpu_destroy(CPUState *cpu)
> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>  {
>      X86CPU *x86_cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86_cpu->env;
>  
> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>      g_free(env->hvf_mmio_buf);
> -    assert_hvf_ok(ret);
> -}
> -
> -static void dummy_signal(int sig)
> -{
>  }
>  
> -int hvf_init_vcpu(CPUState *cpu)
> +int hvf_arch_init_vcpu(CPUState *cpu)
>  {
>  
>      X86CPU *x86cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86cpu->env;
> -    int r;
> -
> -    /* init cpu signals */
> -    sigset_t set;
> -    struct sigaction sigact;
> -
> -    memset(&sigact, 0, sizeof(sigact));
> -    sigact.sa_handler = dummy_signal;
> -    sigaction(SIG_IPI, &sigact, NULL);
> -
> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> -    sigdelset(&set, SIG_IPI);
>  
>      init_emu();
>      init_decoder();
> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>      hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>      env->hvf_mmio_buf = g_new(char, 4096);
>  
> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> -    cpu->vcpu_dirty = 1;
> -    assert_hvf_ok(r);
> -
>      if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>          &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>          abort();
> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>  
>      return ret;
>  }
> -
> -bool hvf_allowed;
> -
> -static int hvf_accel_init(MachineState *ms)
> -{
> -    int x;
> -    hv_return_t ret;
> -    HVFState *s;
> -
> -    ret = hv_vm_create(HV_VM_DEFAULT);
> -    assert_hvf_ok(ret);
> -
> -    s = g_new0(HVFState, 1);
> - 
> -    s->num_slots = 32;
> -    for (x = 0; x < s->num_slots; ++x) {
> -        s->slots[x].size = 0;
> -        s->slots[x].slot_id = x;
> -    }
> -  
> -    hvf_state = s;
> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
> -    cpus_register_accel(&hvf_cpus);
> -    return 0;
> -}
> -
> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
> -{
> -    AccelClass *ac = ACCEL_CLASS(oc);
> -    ac->name = "HVF";
> -    ac->init_machine = hvf_accel_init;
> -    ac->allowed = &hvf_allowed;
> -}
> -
> -static const TypeInfo hvf_accel_type = {
> -    .name = TYPE_HVF_ACCEL,
> -    .parent = TYPE_ACCEL,
> -    .class_init = hvf_accel_class_init,
> -};
> -
> -static void hvf_type_init(void)
> -{
> -    type_register_static(&hvf_accel_type);
> -}
> -
> -type_init(hvf_type_init);
> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
> index 409c9a3f14..c8a43717ee 100644
> --- a/target/i386/hvf/meson.build
> +++ b/target/i386/hvf/meson.build
> @@ -1,6 +1,5 @@
>  i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>    'hvf.c',
> -  'hvf-cpus.c',
>    'x86.c',
>    'x86_cpuid.c',
>    'x86_decode.c',
> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
> index bbec412b6c..89b8e9d87a 100644
> --- a/target/i386/hvf/x86hvf.c
> +++ b/target/i386/hvf/x86hvf.c
> @@ -20,6 +20,9 @@
>  #include "qemu/osdep.h"
>  
>  #include "qemu-common.h"
> +#include "sysemu/hvf.h"
> +#include "sysemu/hvf_int.h"
> +#include "sysemu/hw_accel.h"
>  #include "x86hvf.h"
>  #include "vmx.h"
>  #include "vmcs.h"
> @@ -32,8 +35,6 @@
>  #include <Hypervisor/hv.h>
>  #include <Hypervisor/hv_vmx.h>
>  
> -#include "hvf-cpus.h"
> -
>  void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>                       SegmentCache *qseg, bool is_tr)
>  {
> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>      env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>  
>      if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
> -        hvf_cpu_synchronize_state(cpu_state);
> +        cpu_synchronize_state(cpu_state);
>          do_cpu_init(cpu);
>      }
>  
> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>          cpu_state->halted = 0;
>      }
>      if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
> -        hvf_cpu_synchronize_state(cpu_state);
> +        cpu_synchronize_state(cpu_state);
>          do_cpu_sipi(cpu);
>      }
>      if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>          cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
> -        hvf_cpu_synchronize_state(cpu_state);
> +        cpu_synchronize_state(cpu_state);

The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
be a separate patch. It follows cpu/accel cleanups Claudio was doing the
summer.

Phillipe raised the idea that the patch might go ahead of ARM-specific
part (which might involve some discussions) and I agree with that.

Some sync between Claudio series (CC'd him) and the patch might be need.

Thanks,
Roman

>          apic_handle_tpr_access_report(cpu->apic_state, env->eip,
>                                        env->tpr_access_type);
>      }
> diff --git a/target/i386/hvf/x86hvf.h b/target/i386/hvf/x86hvf.h
> index 635ab0f34e..99ed8d608d 100644
> --- a/target/i386/hvf/x86hvf.h
> +++ b/target/i386/hvf/x86hvf.h
> @@ -21,8 +21,6 @@
>  #include "x86_descr.h"
>  
>  int hvf_process_events(CPUState *);
> -int hvf_put_registers(CPUState *);
> -int hvf_get_registers(CPUState *);
>  bool hvf_inject_interrupts(CPUState *);
>  void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>                       SegmentCache *qseg, bool is_tr);
> -- 
> 2.24.3 (Apple Git-128)
> 
> 
> 



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries
  2020-11-27 19:44   ` Roman Bolshakov
@ 2020-11-27 21:17     ` Paolo Bonzini
  2020-11-27 21:51     ` Alexander Graf
  1 sibling, 0 replies; 64+ messages in thread
From: Paolo Bonzini @ 2020-11-27 21:17 UTC (permalink / raw)
  To: Roman Bolshakov
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Alexander Graf, qemu-arm

[-- Attachment #1: Type: text/plain, Size: 5199 bytes --]

Il ven 27 nov 2020, 20:44 Roman Bolshakov <r.bolshakov@yadro.com> ha
scritto:

> On Thu, Nov 26, 2020 at 10:50:10PM +0100, Alexander Graf wrote:
> > In macOS 11, QEMU only gets access to Hypervisor.framework if it has the
> > respective entitlement. Add an entitlement template and automatically
> self
> > sign and apply the entitlement in the build.
> >
> > Signed-off-by: Alexander Graf <agraf@csgraf.de>
> > ---
> >  accel/hvf/entitlements.plist |  8 ++++++++
> >  meson.build                  | 30 ++++++++++++++++++++++++++----
> >  scripts/entitlement.sh       | 11 +++++++++++
> >  3 files changed, 45 insertions(+), 4 deletions(-)
> >  create mode 100644 accel/hvf/entitlements.plist
> >  create mode 100755 scripts/entitlement.sh
>
> Hi,
>
> I think the patch should go ahead of other changes (with Paolo's fix for
> ^C) and land into 5.2 because entitlements are needed for x86_64 hvf too
> since Big Sur Beta 3. Ad-hoc signing is very convenient for development.
>

It's certainly too late for 5.2, but we could include the patch in the
release notes and in 5.2.1.

Paolo

Also, It might be good to have configure/meson option to disable signing
> at all. Primarily for homebrew:
>
> https://discourse.brew.sh/t/code-signing-installed-executables/2131/10
>
> There's no established process how to deal with it, e.g. GDB in homebrew
> has caveats section for now:
>
>   ==> Caveats
>   gdb requires special privileges to access Mach ports.
>   You will need to codesign the binary. For instructions, see:
>
>     https://sourceware.org/gdb/wiki/BuildingOnDarwin
>
> The discussion on discourse mentions some plans to do signing in
> homebrew CI (with real Developer ID) but none of them are implemented
> now.
>
> For now it'd be helpful to provide a way to disable signing and install
> the entitlements (if one wants to sign after installation). Similar
> issue was raised to fish-shell a while ago:
>
> https://github.com/fish-shell/fish-shell/issues/6952
> https://github.com/fish-shell/fish-shell/issues/7467
>
> >
> > diff --git a/accel/hvf/entitlements.plist b/accel/hvf/entitlements.plist
> > new file mode 100644
> > index 0000000000..154f3308ef
> > --- /dev/null
> > +++ b/accel/hvf/entitlements.plist
> > @@ -0,0 +1,8 @@
> > +<?xml version="1.0" encoding="UTF-8"?>
> > +<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "
> http://www.apple.com/DTDs/PropertyList-1.0.dtd">
> > +<plist version="1.0">
> > +<dict>
> > +    <key>com.apple.security.hypervisor</key>
> > +    <true/>
> > +</dict>
> > +</plist>
> > diff --git a/meson.build b/meson.build
> > index 5062407c70..2a7ff5560c 100644
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -1844,9 +1844,14 @@ foreach target : target_dirs
> >      }]
> >    endif
> >    foreach exe: execs
> > -    emulators += {exe['name']:
> > -         executable(exe['name'], exe['sources'],
> > -               install: true,
> > +    exe_name = exe['name']
> > +    exe_sign = 'CONFIG_HVF' in config_target
>
> I don't have Apple Silicon HW but it may require different kind of
> entitlements for CONFIG_TCG:
>
>
> https://developer.apple.com/documentation/apple_silicon/porting_just-in-time_compilers_to_apple_silicon
>
> Thanks,
> Roman
>
> > +    if exe_sign
> > +      exe_name += '-unsigned'
> > +    endif
> > +
> > +    emulator = executable(exe_name, exe['sources'],
> > +               install: not exe_sign,
> >                 c_args: c_args,
> >                 dependencies: arch_deps + deps + exe['dependencies'],
> >                 objects: lib.extract_all_objects(recursive: true),
> > @@ -1854,7 +1859,24 @@ foreach target : target_dirs
> >                 link_depends: [block_syms, qemu_syms] +
> exe.get('link_depends', []),
> >                 link_args: link_args,
> >                 gui_app: exe['gui'])
> > -    }
> > +
> > +    if exe_sign
> > +      exe_full = meson.current_build_dir() / exe['name']
> > +      emulators += {exe['name'] : custom_target(exe['name'],
> > +                   install: true,
> > +                   install_dir: get_option('bindir'),
> > +                   depends: emulator,
> > +                   output: exe['name'],
> > +                   command: [
> > +                     meson.current_source_dir() /
> 'scripts/entitlement.sh',
> > +                     meson.current_build_dir() / exe['name'] +
> '-unsigned',
> > +                     meson.current_build_dir() / exe['name'],
> > +                     meson.current_source_dir() /
> 'accel/hvf/entitlements.plist'
> > +                   ])
> > +      }
> > +    else
> > +      emulators += {exe['name']: emulator}
> > +    endif
> >
> >      if 'CONFIG_TRACE_SYSTEMTAP' in config_host
> >        foreach stp: [
> > diff --git a/scripts/entitlement.sh b/scripts/entitlement.sh
> > new file mode 100755
> > index 0000000000..7ed9590bf9
> > --- /dev/null
> > +++ b/scripts/entitlement.sh
> > @@ -0,0 +1,11 @@
> > +#!/bin/sh -e
> > +#
> > +# Helper script for the build process to apply entitlements
> > +
> > +SRC="$1"
> > +DST="$2"
> > +ENTITLEMENT="$3"
> > +
> > +rm -f "$2"
> > +cp -a "$SRC" "$DST"
> > +codesign --entitlements "$ENTITLEMENT" --force -s - "$DST"
> > --
> > 2.24.3 (Apple Git-128)
> >
> >
>
>

[-- Attachment #2: Type: text/html, Size: 7911 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries
  2020-11-27 19:44   ` Roman Bolshakov
  2020-11-27 21:17     ` Paolo Bonzini
@ 2020-11-27 21:51     ` Alexander Graf
  1 sibling, 0 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-27 21:51 UTC (permalink / raw)
  To: Roman Bolshakov
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, qemu-arm, Paolo Bonzini


On 27.11.20 20:44, Roman Bolshakov wrote:
> On Thu, Nov 26, 2020 at 10:50:10PM +0100, Alexander Graf wrote:
>> In macOS 11, QEMU only gets access to Hypervisor.framework if it has the
>> respective entitlement. Add an entitlement template and automatically self
>> sign and apply the entitlement in the build.
>>
>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>> ---
>>   accel/hvf/entitlements.plist |  8 ++++++++
>>   meson.build                  | 30 ++++++++++++++++++++++++++----
>>   scripts/entitlement.sh       | 11 +++++++++++
>>   3 files changed, 45 insertions(+), 4 deletions(-)
>>   create mode 100644 accel/hvf/entitlements.plist
>>   create mode 100755 scripts/entitlement.sh
> Hi,
>
> I think the patch should go ahead of other changes (with Paolo's fix for
> ^C) and land into 5.2 because entitlements are needed for x86_64 hvf too
> since Big Sur Beta 3. Ad-hoc signing is very convenient for development.
>
> Also, It might be good to have configure/meson option to disable signing
> at all. Primarily for homebrew:
>
> https://discourse.brew.sh/t/code-signing-installed-executables/2131/10
>
> There's no established process how to deal with it, e.g. GDB in homebrew
> has caveats section for now:
>
>    ==> Caveats
>    gdb requires special privileges to access Mach ports.
>    You will need to codesign the binary. For instructions, see:
>
>      https://sourceware.org/gdb/wiki/BuildingOnDarwin
>
> The discussion on discourse mentions some plans to do signing in
> homebrew CI (with real Developer ID) but none of them are implemented
> now.
>
> For now it'd be helpful to provide a way to disable signing and install
> the entitlements (if one wants to sign after installation). Similar
> issue was raised to fish-shell a while ago:
>
> https://github.com/fish-shell/fish-shell/issues/6952
> https://github.com/fish-shell/fish-shell/issues/7467


All binaries are signed in Big Sur by the linker as far as I understand, 
so I don't quite see the point in not signing :). If the build system 
doesn't have access to codesign, it sounds to me like one should fix the 
build system instead? Worst case by injecting a fake codesign binary 
that just calls /bin/true.


>
>> diff --git a/accel/hvf/entitlements.plist b/accel/hvf/entitlements.plist
>> new file mode 100644
>> index 0000000000..154f3308ef
>> --- /dev/null
>> +++ b/accel/hvf/entitlements.plist
>> @@ -0,0 +1,8 @@
>> +<?xml version="1.0" encoding="UTF-8"?>
>> +<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
>> +<plist version="1.0">
>> +<dict>
>> +    <key>com.apple.security.hypervisor</key>
>> +    <true/>
>> +</dict>
>> +</plist>
>> diff --git a/meson.build b/meson.build
>> index 5062407c70..2a7ff5560c 100644
>> --- a/meson.build
>> +++ b/meson.build
>> @@ -1844,9 +1844,14 @@ foreach target : target_dirs
>>       }]
>>     endif
>>     foreach exe: execs
>> -    emulators += {exe['name']:
>> -         executable(exe['name'], exe['sources'],
>> -               install: true,
>> +    exe_name = exe['name']
>> +    exe_sign = 'CONFIG_HVF' in config_target
> I don't have Apple Silicon HW but it may require different kind of
> entitlements for CONFIG_TCG:
>
> https://developer.apple.com/documentation/apple_silicon/porting_just-in-time_compilers_to_apple_silicon


You only need the JIT entitlement for the App Store. Locally signed 
applications work just fine without. I don't know about binaries you 
download from the internet that were signed with a developer key though.

Keep in mind that for this to work you also need the MAP_JIT and RWX 
toggle changes from another patch set on the ML.


Alex




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-27 20:00   ` Roman Bolshakov
@ 2020-11-27 21:55     ` Alexander Graf
  2020-11-27 23:30       ` Frank Yang
  0 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-27 21:55 UTC (permalink / raw)
  To: Roman Bolshakov
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, qemu-arm, Claudio Fontana, Paolo Bonzini


On 27.11.20 21:00, Roman Bolshakov wrote:
> On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>> Until now, Hypervisor.framework has only been available on x86_64 systems.
>> With Apple Silicon shipping now, it extends its reach to aarch64. To
>> prepare for support for multiple architectures, let's move common code out
>> into its own accel directory.
>>
>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>> ---
>>   MAINTAINERS                 |   9 +-
>>   accel/hvf/hvf-all.c         |  56 +++++
>>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>>   accel/hvf/meson.build       |   7 +
>>   accel/meson.build           |   1 +
>>   include/sysemu/hvf_int.h    |  69 ++++++
>>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>   target/i386/hvf/hvf-cpus.h  |  25 --
>>   target/i386/hvf/hvf-i386.h  |  48 +---
>>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>   target/i386/hvf/meson.build |   1 -
>>   target/i386/hvf/x86hvf.c    |  11 +-
>>   target/i386/hvf/x86hvf.h    |   2 -
>>   13 files changed, 619 insertions(+), 569 deletions(-)
>>   create mode 100644 accel/hvf/hvf-all.c
>>   create mode 100644 accel/hvf/hvf-cpus.c
>>   create mode 100644 accel/hvf/meson.build
>>   create mode 100644 include/sysemu/hvf_int.h
>>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 68bc160f41..ca4b6d9279 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>   W: https://wiki.qemu.org/Features/HVF
>>   S: Maintained
>> -F: accel/stubs/hvf-stub.c
> There was a patch for that in the RFC series from Claudio.


Yeah, I'm not worried about this hunk :).


>
>>   F: target/i386/hvf/
>> +
>> +HVF
>> +M: Cameron Esfahani <dirty@apple.com>
>> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>> +W: https://wiki.qemu.org/Features/HVF
>> +S: Maintained
>> +F: accel/hvf/
>>   F: include/sysemu/hvf.h
>> +F: include/sysemu/hvf_int.h
>>   
>>   WHPX CPUs
>>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>> new file mode 100644
>> index 0000000000..47d77a472a
>> --- /dev/null
>> +++ b/accel/hvf/hvf-all.c
>> @@ -0,0 +1,56 @@
>> +/*
>> + * QEMU Hypervisor.framework support
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>> + * the COPYING file in the top-level directory.
>> + *
>> + * Contributions after 2012-01-13 are licensed under the terms of the
>> + * GNU GPL, version 2 or (at your option) any later version.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +#include "qemu/error-report.h"
>> +#include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>> +#include "sysemu/runstate.h"
>> +
>> +#include "qemu/main-loop.h"
>> +#include "sysemu/accel.h"
>> +
>> +#include <Hypervisor/Hypervisor.h>
>> +
>> +bool hvf_allowed;
>> +HVFState *hvf_state;
>> +
>> +void assert_hvf_ok(hv_return_t ret)
>> +{
>> +    if (ret == HV_SUCCESS) {
>> +        return;
>> +    }
>> +
>> +    switch (ret) {
>> +    case HV_ERROR:
>> +        error_report("Error: HV_ERROR");
>> +        break;
>> +    case HV_BUSY:
>> +        error_report("Error: HV_BUSY");
>> +        break;
>> +    case HV_BAD_ARGUMENT:
>> +        error_report("Error: HV_BAD_ARGUMENT");
>> +        break;
>> +    case HV_NO_RESOURCES:
>> +        error_report("Error: HV_NO_RESOURCES");
>> +        break;
>> +    case HV_NO_DEVICE:
>> +        error_report("Error: HV_NO_DEVICE");
>> +        break;
>> +    case HV_UNSUPPORTED:
>> +        error_report("Error: HV_UNSUPPORTED");
>> +        break;
>> +    default:
>> +        error_report("Unknown Error");
>> +    }
>> +
>> +    abort();
>> +}
>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>> new file mode 100644
>> index 0000000000..f9bb5502b7
>> --- /dev/null
>> +++ b/accel/hvf/hvf-cpus.c
>> @@ -0,0 +1,468 @@
>> +/*
>> + * Copyright 2008 IBM Corporation
>> + *           2008 Red Hat, Inc.
>> + * Copyright 2011 Intel Corporation
>> + * Copyright 2016 Veertu, Inc.
>> + * Copyright 2017 The Android Open Source Project
>> + *
>> + * QEMU Hypervisor.framework support
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of version 2 of the GNU General Public
>> + * License as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + *
>> + * This file contain code under public domain from the hvdos project:
>> + * https://github.com/mist64/hvdos
>> + *
>> + * Parts Copyright (c) 2011 NetApp, Inc.
>> + * All rights reserved.
>> + *
>> + * Redistribution and use in source and binary forms, with or without
>> + * modification, are permitted provided that the following conditions
>> + * are met:
>> + * 1. Redistributions of source code must retain the above copyright
>> + *    notice, this list of conditions and the following disclaimer.
>> + * 2. Redistributions in binary form must reproduce the above copyright
>> + *    notice, this list of conditions and the following disclaimer in the
>> + *    documentation and/or other materials provided with the distribution.
>> + *
>> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> + * SUCH DAMAGE.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/main-loop.h"
>> +#include "exec/address-spaces.h"
>> +#include "exec/exec-all.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>> +#include "sysemu/runstate.h"
>> +#include "qemu/guest-random.h"
>> +
>> +#include <Hypervisor/Hypervisor.h>
>> +
>> +/* Memory slots */
>> +
>> +struct mac_slot {
>> +    int present;
>> +    uint64_t size;
>> +    uint64_t gpa_start;
>> +    uint64_t gva;
>> +};
>> +
>> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> +{
>> +    hvf_slot *slot;
>> +    int x;
>> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> +        slot = &hvf_state->slots[x];
>> +        if (slot->size && start < (slot->start + slot->size) &&
>> +            (start + size) > slot->start) {
>> +            return slot;
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>> +struct mac_slot mac_slots[32];
>> +
>> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> +{
>> +    struct mac_slot *macslot;
>> +    hv_return_t ret;
>> +
>> +    macslot = &mac_slots[slot->slot_id];
>> +
>> +    if (macslot->present) {
>> +        if (macslot->size != slot->size) {
>> +            macslot->present = 0;
>> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> +            assert_hvf_ok(ret);
>> +        }
>> +    }
>> +
>> +    if (!slot->size) {
>> +        return 0;
>> +    }
>> +
>> +    macslot->present = 1;
>> +    macslot->gpa_start = slot->start;
>> +    macslot->size = slot->size;
>> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>> +    assert_hvf_ok(ret);
>> +    return 0;
>> +}
>> +
>> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> +{
>> +    hvf_slot *mem;
>> +    MemoryRegion *area = section->mr;
>> +    bool writeable = !area->readonly && !area->rom_device;
>> +    hv_memory_flags_t flags;
>> +
>> +    if (!memory_region_is_ram(area)) {
>> +        if (writeable) {
>> +            return;
>> +        } else if (!memory_region_is_romd(area)) {
>> +            /*
>> +             * If the memory device is not in romd_mode, then we actually want
>> +             * to remove the hvf memory slot so all accesses will trap.
>> +             */
>> +             add = false;
>> +        }
>> +    }
>> +
>> +    mem = hvf_find_overlap_slot(
>> +            section->offset_within_address_space,
>> +            int128_get64(section->size));
>> +
>> +    if (mem && add) {
>> +        if (mem->size == int128_get64(section->size) &&
>> +            mem->start == section->offset_within_address_space &&
>> +            mem->mem == (memory_region_get_ram_ptr(area) +
>> +            section->offset_within_region)) {
>> +            return; /* Same region was attempted to register, go away. */
>> +        }
>> +    }
>> +
>> +    /* Region needs to be reset. set the size to 0 and remap it. */
>> +    if (mem) {
>> +        mem->size = 0;
>> +        if (do_hvf_set_memory(mem, 0)) {
>> +            error_report("Failed to reset overlapping slot");
>> +            abort();
>> +        }
>> +    }
>> +
>> +    if (!add) {
>> +        return;
>> +    }
>> +
>> +    if (area->readonly ||
>> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> +    } else {
>> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> +    }
>> +
>> +    /* Now make a new slot. */
>> +    int x;
>> +
>> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> +        mem = &hvf_state->slots[x];
>> +        if (!mem->size) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (x == hvf_state->num_slots) {
>> +        error_report("No free slots");
>> +        abort();
>> +    }
>> +
>> +    mem->size = int128_get64(section->size);
>> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>> +    mem->start = section->offset_within_address_space;
>> +    mem->region = area;
>> +
>> +    if (do_hvf_set_memory(mem, flags)) {
>> +        error_report("Error registering new memory slot");
>> +        abort();
>> +    }
>> +}
>> +
>> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>> +{
>> +    hvf_slot *slot;
>> +
>> +    slot = hvf_find_overlap_slot(
>> +            section->offset_within_address_space,
>> +            int128_get64(section->size));
>> +
>> +    /* protect region against writes; begin tracking it */
>> +    if (on) {
>> +        slot->flags |= HVF_SLOT_LOG;
>> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> +                      HV_MEMORY_READ);
>> +    /* stop tracking region*/
>> +    } else {
>> +        slot->flags &= ~HVF_SLOT_LOG;
>> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> +    }
>> +}
>> +
>> +static void hvf_log_start(MemoryListener *listener,
>> +                          MemoryRegionSection *section, int old, int new)
>> +{
>> +    if (old != 0) {
>> +        return;
>> +    }
>> +
>> +    hvf_set_dirty_tracking(section, 1);
>> +}
>> +
>> +static void hvf_log_stop(MemoryListener *listener,
>> +                         MemoryRegionSection *section, int old, int new)
>> +{
>> +    if (new != 0) {
>> +        return;
>> +    }
>> +
>> +    hvf_set_dirty_tracking(section, 0);
>> +}
>> +
>> +static void hvf_log_sync(MemoryListener *listener,
>> +                         MemoryRegionSection *section)
>> +{
>> +    /*
>> +     * sync of dirty pages is handled elsewhere; just make sure we keep
>> +     * tracking the region.
>> +     */
>> +    hvf_set_dirty_tracking(section, 1);
>> +}
>> +
>> +static void hvf_region_add(MemoryListener *listener,
>> +                           MemoryRegionSection *section)
>> +{
>> +    hvf_set_phys_mem(section, true);
>> +}
>> +
>> +static void hvf_region_del(MemoryListener *listener,
>> +                           MemoryRegionSection *section)
>> +{
>> +    hvf_set_phys_mem(section, false);
>> +}
>> +
>> +static MemoryListener hvf_memory_listener = {
>> +    .priority = 10,
>> +    .region_add = hvf_region_add,
>> +    .region_del = hvf_region_del,
>> +    .log_start = hvf_log_start,
>> +    .log_stop = hvf_log_stop,
>> +    .log_sync = hvf_log_sync,
>> +};
>> +
>> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>> +{
>> +    if (!cpu->vcpu_dirty) {
>> +        hvf_get_registers(cpu);
>> +        cpu->vcpu_dirty = true;
>> +    }
>> +}
>> +
>> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>> +{
>> +    if (!cpu->vcpu_dirty) {
>> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>> +    }
>> +}
>> +
>> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> +                                              run_on_cpu_data arg)
>> +{
>> +    hvf_put_registers(cpu);
>> +    cpu->vcpu_dirty = false;
>> +}
>> +
>> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> +{
>> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>> +}
>> +
>> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> +                                             run_on_cpu_data arg)
>> +{
>> +    hvf_put_registers(cpu);
>> +    cpu->vcpu_dirty = false;
>> +}
>> +
>> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> +{
>> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>> +}
>> +
>> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> +                                              run_on_cpu_data arg)
>> +{
>> +    cpu->vcpu_dirty = true;
>> +}
>> +
>> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> +{
>> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>> +}
>> +
>> +static void hvf_vcpu_destroy(CPUState *cpu)
>> +{
>> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>> +    assert_hvf_ok(ret);
>> +
>> +    hvf_arch_vcpu_destroy(cpu);
>> +}
>> +
>> +static void dummy_signal(int sig)
>> +{
>> +}
>> +
>> +static int hvf_init_vcpu(CPUState *cpu)
>> +{
>> +    int r;
>> +
>> +    /* init cpu signals */
>> +    sigset_t set;
>> +    struct sigaction sigact;
>> +
>> +    memset(&sigact, 0, sizeof(sigact));
>> +    sigact.sa_handler = dummy_signal;
>> +    sigaction(SIG_IPI, &sigact, NULL);
>> +
>> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> +    sigdelset(&set, SIG_IPI);
>> +
>> +#ifdef __aarch64__
>> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
>> +#else
>> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> +#endif
> I think the first __aarch64__ bit fits better to arm part of the series.


Oops. Thanks for catching it! Yes, absolutely. It should be part of the 
ARM enablement.


>
>> +    cpu->vcpu_dirty = 1;
>> +    assert_hvf_ok(r);
>> +
>> +    return hvf_arch_init_vcpu(cpu);
>> +}
>> +
>> +/*
>> + * The HVF-specific vCPU thread function. This one should only run when the host
>> + * CPU supports the VMX "unrestricted guest" feature.
>> + */
>> +static void *hvf_cpu_thread_fn(void *arg)
>> +{
>> +    CPUState *cpu = arg;
>> +
>> +    int r;
>> +
>> +    assert(hvf_enabled());
>> +
>> +    rcu_register_thread();
>> +
>> +    qemu_mutex_lock_iothread();
>> +    qemu_thread_get_self(cpu->thread);
>> +
>> +    cpu->thread_id = qemu_get_thread_id();
>> +    cpu->can_do_io = 1;
>> +    current_cpu = cpu;
>> +
>> +    hvf_init_vcpu(cpu);
>> +
>> +    /* signal CPU creation */
>> +    cpu_thread_signal_created(cpu);
>> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> +
>> +    do {
>> +        if (cpu_can_run(cpu)) {
>> +            r = hvf_vcpu_exec(cpu);
>> +            if (r == EXCP_DEBUG) {
>> +                cpu_handle_guest_debug(cpu);
>> +            }
>> +        }
>> +        qemu_wait_io_event(cpu);
>> +    } while (!cpu->unplug || cpu_can_run(cpu));
>> +
>> +    hvf_vcpu_destroy(cpu);
>> +    cpu_thread_signal_destroyed(cpu);
>> +    qemu_mutex_unlock_iothread();
>> +    rcu_unregister_thread();
>> +    return NULL;
>> +}
>> +
>> +static void hvf_start_vcpu_thread(CPUState *cpu)
>> +{
>> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>> +
>> +    /*
>> +     * HVF currently does not support TCG, and only runs in
>> +     * unrestricted-guest mode.
>> +     */
>> +    assert(hvf_enabled());
>> +
>> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> +    qemu_cond_init(cpu->halt_cond);
>> +
>> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> +             cpu->cpu_index);
>> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> +                       cpu, QEMU_THREAD_JOINABLE);
>> +}
>> +
>> +static const CpusAccel hvf_cpus = {
>> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>> +
>> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> +    .synchronize_state = hvf_cpu_synchronize_state,
>> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> +};
>> +
>> +static int hvf_accel_init(MachineState *ms)
>> +{
>> +    int x;
>> +    hv_return_t ret;
>> +    HVFState *s;
>> +
>> +    ret = hv_vm_create(HV_VM_DEFAULT);
>> +    assert_hvf_ok(ret);
>> +
>> +    s = g_new0(HVFState, 1);
>> +
>> +    s->num_slots = 32;
>> +    for (x = 0; x < s->num_slots; ++x) {
>> +        s->slots[x].size = 0;
>> +        s->slots[x].slot_id = x;
>> +    }
>> +
>> +    hvf_state = s;
>> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>> +    cpus_register_accel(&hvf_cpus);
>> +    return 0;
>> +}
>> +
>> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> +{
>> +    AccelClass *ac = ACCEL_CLASS(oc);
>> +    ac->name = "HVF";
>> +    ac->init_machine = hvf_accel_init;
>> +    ac->allowed = &hvf_allowed;
>> +}
>> +
>> +static const TypeInfo hvf_accel_type = {
>> +    .name = TYPE_HVF_ACCEL,
>> +    .parent = TYPE_ACCEL,
>> +    .class_init = hvf_accel_class_init,
>> +};
>> +
>> +static void hvf_type_init(void)
>> +{
>> +    type_register_static(&hvf_accel_type);
>> +}
>> +
>> +type_init(hvf_type_init);
>> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>> new file mode 100644
>> index 0000000000..dfd6b68dc7
>> --- /dev/null
>> +++ b/accel/hvf/meson.build
>> @@ -0,0 +1,7 @@
>> +hvf_ss = ss.source_set()
>> +hvf_ss.add(files(
>> +  'hvf-all.c',
>> +  'hvf-cpus.c',
>> +))
>> +
>> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>> diff --git a/accel/meson.build b/accel/meson.build
>> index b26cca227a..6de12ce5d5 100644
>> --- a/accel/meson.build
>> +++ b/accel/meson.build
>> @@ -1,5 +1,6 @@
>>   softmmu_ss.add(files('accel.c'))
>>   
>> +subdir('hvf')
>>   subdir('qtest')
>>   subdir('kvm')
>>   subdir('tcg')
>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>> new file mode 100644
>> index 0000000000..de9bad23a8
>> --- /dev/null
>> +++ b/include/sysemu/hvf_int.h
>> @@ -0,0 +1,69 @@
>> +/*
>> + * QEMU Hypervisor.framework (HVF) support
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +/* header to be included in HVF-specific code */
>> +
>> +#ifndef HVF_INT_H
>> +#define HVF_INT_H
>> +
>> +#include <Hypervisor/Hypervisor.h>
>> +
>> +#define HVF_MAX_VCPU 0x10
>> +
>> +extern struct hvf_state hvf_global;
>> +
>> +struct hvf_vm {
>> +    int id;
>> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> +};
>> +
>> +struct hvf_state {
>> +    uint32_t version;
>> +    struct hvf_vm *vm;
>> +    uint64_t mem_quota;
>> +};
>> +
>> +/* hvf_slot flags */
>> +#define HVF_SLOT_LOG (1 << 0)
>> +
>> +typedef struct hvf_slot {
>> +    uint64_t start;
>> +    uint64_t size;
>> +    uint8_t *mem;
>> +    int slot_id;
>> +    uint32_t flags;
>> +    MemoryRegion *region;
>> +} hvf_slot;
>> +
>> +typedef struct hvf_vcpu_caps {
>> +    uint64_t vmx_cap_pinbased;
>> +    uint64_t vmx_cap_procbased;
>> +    uint64_t vmx_cap_procbased2;
>> +    uint64_t vmx_cap_entry;
>> +    uint64_t vmx_cap_exit;
>> +    uint64_t vmx_cap_preemption_timer;
>> +} hvf_vcpu_caps;
>> +
>> +struct HVFState {
>> +    AccelState parent;
>> +    hvf_slot slots[32];
>> +    int num_slots;
>> +
>> +    hvf_vcpu_caps *hvf_caps;
>> +};
>> +extern HVFState *hvf_state;
>> +
>> +void assert_hvf_ok(hv_return_t ret);
>> +int hvf_get_registers(CPUState *cpu);
>> +int hvf_put_registers(CPUState *cpu);
>> +int hvf_arch_init_vcpu(CPUState *cpu);
>> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>> +int hvf_vcpu_exec(CPUState *cpu);
>> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>> +
>> +#endif
>> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>> deleted file mode 100644
>> index 817b3d7452..0000000000
>> --- a/target/i386/hvf/hvf-cpus.c
>> +++ /dev/null
>> @@ -1,131 +0,0 @@
>> -/*
>> - * Copyright 2008 IBM Corporation
>> - *           2008 Red Hat, Inc.
>> - * Copyright 2011 Intel Corporation
>> - * Copyright 2016 Veertu, Inc.
>> - * Copyright 2017 The Android Open Source Project
>> - *
>> - * QEMU Hypervisor.framework support
>> - *
>> - * This program is free software; you can redistribute it and/or
>> - * modify it under the terms of version 2 of the GNU General Public
>> - * License as published by the Free Software Foundation.
>> - *
>> - * This program is distributed in the hope that it will be useful,
>> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> - * General Public License for more details.
>> - *
>> - * You should have received a copy of the GNU General Public License
>> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
>> - *
>> - * This file contain code under public domain from the hvdos project:
>> - * https://github.com/mist64/hvdos
>> - *
>> - * Parts Copyright (c) 2011 NetApp, Inc.
>> - * All rights reserved.
>> - *
>> - * Redistribution and use in source and binary forms, with or without
>> - * modification, are permitted provided that the following conditions
>> - * are met:
>> - * 1. Redistributions of source code must retain the above copyright
>> - *    notice, this list of conditions and the following disclaimer.
>> - * 2. Redistributions in binary form must reproduce the above copyright
>> - *    notice, this list of conditions and the following disclaimer in the
>> - *    documentation and/or other materials provided with the distribution.
>> - *
>> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>> - * SUCH DAMAGE.
>> - */
>> -
>> -#include "qemu/osdep.h"
>> -#include "qemu/error-report.h"
>> -#include "qemu/main-loop.h"
>> -#include "sysemu/hvf.h"
>> -#include "sysemu/runstate.h"
>> -#include "target/i386/cpu.h"
>> -#include "qemu/guest-random.h"
>> -
>> -#include "hvf-cpus.h"
>> -
>> -/*
>> - * The HVF-specific vCPU thread function. This one should only run when the host
>> - * CPU supports the VMX "unrestricted guest" feature.
>> - */
>> -static void *hvf_cpu_thread_fn(void *arg)
>> -{
>> -    CPUState *cpu = arg;
>> -
>> -    int r;
>> -
>> -    assert(hvf_enabled());
>> -
>> -    rcu_register_thread();
>> -
>> -    qemu_mutex_lock_iothread();
>> -    qemu_thread_get_self(cpu->thread);
>> -
>> -    cpu->thread_id = qemu_get_thread_id();
>> -    cpu->can_do_io = 1;
>> -    current_cpu = cpu;
>> -
>> -    hvf_init_vcpu(cpu);
>> -
>> -    /* signal CPU creation */
>> -    cpu_thread_signal_created(cpu);
>> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> -
>> -    do {
>> -        if (cpu_can_run(cpu)) {
>> -            r = hvf_vcpu_exec(cpu);
>> -            if (r == EXCP_DEBUG) {
>> -                cpu_handle_guest_debug(cpu);
>> -            }
>> -        }
>> -        qemu_wait_io_event(cpu);
>> -    } while (!cpu->unplug || cpu_can_run(cpu));
>> -
>> -    hvf_vcpu_destroy(cpu);
>> -    cpu_thread_signal_destroyed(cpu);
>> -    qemu_mutex_unlock_iothread();
>> -    rcu_unregister_thread();
>> -    return NULL;
>> -}
>> -
>> -static void hvf_start_vcpu_thread(CPUState *cpu)
>> -{
>> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>> -
>> -    /*
>> -     * HVF currently does not support TCG, and only runs in
>> -     * unrestricted-guest mode.
>> -     */
>> -    assert(hvf_enabled());
>> -
>> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> -    qemu_cond_init(cpu->halt_cond);
>> -
>> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> -             cpu->cpu_index);
>> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> -                       cpu, QEMU_THREAD_JOINABLE);
>> -}
>> -
>> -const CpusAccel hvf_cpus = {
>> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>> -
>> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> -    .synchronize_state = hvf_cpu_synchronize_state,
>> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> -};
>> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>> deleted file mode 100644
>> index ced31b82c0..0000000000
>> --- a/target/i386/hvf/hvf-cpus.h
>> +++ /dev/null
>> @@ -1,25 +0,0 @@
>> -/*
>> - * Accelerator CPUS Interface
>> - *
>> - * Copyright 2020 SUSE LLC
>> - *
>> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> - * See the COPYING file in the top-level directory.
>> - */
>> -
>> -#ifndef HVF_CPUS_H
>> -#define HVF_CPUS_H
>> -
>> -#include "sysemu/cpus.h"
>> -
>> -extern const CpusAccel hvf_cpus;
>> -
>> -int hvf_init_vcpu(CPUState *);
>> -int hvf_vcpu_exec(CPUState *);
>> -void hvf_cpu_synchronize_state(CPUState *);
>> -void hvf_cpu_synchronize_post_reset(CPUState *);
>> -void hvf_cpu_synchronize_post_init(CPUState *);
>> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>> -void hvf_vcpu_destroy(CPUState *);
>> -
>> -#endif /* HVF_CPUS_H */
>> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>> index e0edffd077..6d56f8f6bb 100644
>> --- a/target/i386/hvf/hvf-i386.h
>> +++ b/target/i386/hvf/hvf-i386.h
>> @@ -18,57 +18,11 @@
>>   
>>   #include "sysemu/accel.h"
>>   #include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>>   #include "cpu.h"
>>   #include "x86.h"
>>   
>> -#define HVF_MAX_VCPU 0x10
>> -
>> -extern struct hvf_state hvf_global;
>> -
>> -struct hvf_vm {
>> -    int id;
>> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> -};
>> -
>> -struct hvf_state {
>> -    uint32_t version;
>> -    struct hvf_vm *vm;
>> -    uint64_t mem_quota;
>> -};
>> -
>> -/* hvf_slot flags */
>> -#define HVF_SLOT_LOG (1 << 0)
>> -
>> -typedef struct hvf_slot {
>> -    uint64_t start;
>> -    uint64_t size;
>> -    uint8_t *mem;
>> -    int slot_id;
>> -    uint32_t flags;
>> -    MemoryRegion *region;
>> -} hvf_slot;
>> -
>> -typedef struct hvf_vcpu_caps {
>> -    uint64_t vmx_cap_pinbased;
>> -    uint64_t vmx_cap_procbased;
>> -    uint64_t vmx_cap_procbased2;
>> -    uint64_t vmx_cap_entry;
>> -    uint64_t vmx_cap_exit;
>> -    uint64_t vmx_cap_preemption_timer;
>> -} hvf_vcpu_caps;
>> -
>> -struct HVFState {
>> -    AccelState parent;
>> -    hvf_slot slots[32];
>> -    int num_slots;
>> -
>> -    hvf_vcpu_caps *hvf_caps;
>> -};
>> -extern HVFState *hvf_state;
>> -
>> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>   
>>   #ifdef NEED_CPU_H
>>   /* Functions exported to host specific mode */
>> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>> index ed9356565c..8b96ecd619 100644
>> --- a/target/i386/hvf/hvf.c
>> +++ b/target/i386/hvf/hvf.c
>> @@ -51,6 +51,7 @@
>>   #include "qemu/error-report.h"
>>   
>>   #include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>>   #include "sysemu/runstate.h"
>>   #include "hvf-i386.h"
>>   #include "vmcs.h"
>> @@ -72,171 +73,6 @@
>>   #include "sysemu/accel.h"
>>   #include "target/i386/cpu.h"
>>   
>> -#include "hvf-cpus.h"
>> -
>> -HVFState *hvf_state;
>> -
>> -static void assert_hvf_ok(hv_return_t ret)
>> -{
>> -    if (ret == HV_SUCCESS) {
>> -        return;
>> -    }
>> -
>> -    switch (ret) {
>> -    case HV_ERROR:
>> -        error_report("Error: HV_ERROR");
>> -        break;
>> -    case HV_BUSY:
>> -        error_report("Error: HV_BUSY");
>> -        break;
>> -    case HV_BAD_ARGUMENT:
>> -        error_report("Error: HV_BAD_ARGUMENT");
>> -        break;
>> -    case HV_NO_RESOURCES:
>> -        error_report("Error: HV_NO_RESOURCES");
>> -        break;
>> -    case HV_NO_DEVICE:
>> -        error_report("Error: HV_NO_DEVICE");
>> -        break;
>> -    case HV_UNSUPPORTED:
>> -        error_report("Error: HV_UNSUPPORTED");
>> -        break;
>> -    default:
>> -        error_report("Unknown Error");
>> -    }
>> -
>> -    abort();
>> -}
>> -
>> -/* Memory slots */
>> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> -{
>> -    hvf_slot *slot;
>> -    int x;
>> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> -        slot = &hvf_state->slots[x];
>> -        if (slot->size && start < (slot->start + slot->size) &&
>> -            (start + size) > slot->start) {
>> -            return slot;
>> -        }
>> -    }
>> -    return NULL;
>> -}
>> -
>> -struct mac_slot {
>> -    int present;
>> -    uint64_t size;
>> -    uint64_t gpa_start;
>> -    uint64_t gva;
>> -};
>> -
>> -struct mac_slot mac_slots[32];
>> -
>> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> -{
>> -    struct mac_slot *macslot;
>> -    hv_return_t ret;
>> -
>> -    macslot = &mac_slots[slot->slot_id];
>> -
>> -    if (macslot->present) {
>> -        if (macslot->size != slot->size) {
>> -            macslot->present = 0;
>> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> -            assert_hvf_ok(ret);
>> -        }
>> -    }
>> -
>> -    if (!slot->size) {
>> -        return 0;
>> -    }
>> -
>> -    macslot->present = 1;
>> -    macslot->gpa_start = slot->start;
>> -    macslot->size = slot->size;
>> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
>> -    assert_hvf_ok(ret);
>> -    return 0;
>> -}
>> -
>> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> -{
>> -    hvf_slot *mem;
>> -    MemoryRegion *area = section->mr;
>> -    bool writeable = !area->readonly && !area->rom_device;
>> -    hv_memory_flags_t flags;
>> -
>> -    if (!memory_region_is_ram(area)) {
>> -        if (writeable) {
>> -            return;
>> -        } else if (!memory_region_is_romd(area)) {
>> -            /*
>> -             * If the memory device is not in romd_mode, then we actually want
>> -             * to remove the hvf memory slot so all accesses will trap.
>> -             */
>> -             add = false;
>> -        }
>> -    }
>> -
>> -    mem = hvf_find_overlap_slot(
>> -            section->offset_within_address_space,
>> -            int128_get64(section->size));
>> -
>> -    if (mem && add) {
>> -        if (mem->size == int128_get64(section->size) &&
>> -            mem->start == section->offset_within_address_space &&
>> -            mem->mem == (memory_region_get_ram_ptr(area) +
>> -            section->offset_within_region)) {
>> -            return; /* Same region was attempted to register, go away. */
>> -        }
>> -    }
>> -
>> -    /* Region needs to be reset. set the size to 0 and remap it. */
>> -    if (mem) {
>> -        mem->size = 0;
>> -        if (do_hvf_set_memory(mem, 0)) {
>> -            error_report("Failed to reset overlapping slot");
>> -            abort();
>> -        }
>> -    }
>> -
>> -    if (!add) {
>> -        return;
>> -    }
>> -
>> -    if (area->readonly ||
>> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> -    } else {
>> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> -    }
>> -
>> -    /* Now make a new slot. */
>> -    int x;
>> -
>> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> -        mem = &hvf_state->slots[x];
>> -        if (!mem->size) {
>> -            break;
>> -        }
>> -    }
>> -
>> -    if (x == hvf_state->num_slots) {
>> -        error_report("No free slots");
>> -        abort();
>> -    }
>> -
>> -    mem->size = int128_get64(section->size);
>> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>> -    mem->start = section->offset_within_address_space;
>> -    mem->region = area;
>> -
>> -    if (do_hvf_set_memory(mem, flags)) {
>> -        error_report("Error registering new memory slot");
>> -        abort();
>> -    }
>> -}
>> -
>>   void vmx_update_tpr(CPUState *cpu)
>>   {
>>       /* TODO: need integrate APIC handling */
>> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>>       }
>>   }
>>   
>> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>> -{
>> -    if (!cpu->vcpu_dirty) {
>> -        hvf_get_registers(cpu);
>> -        cpu->vcpu_dirty = true;
>> -    }
>> -}
>> -
>> -void hvf_cpu_synchronize_state(CPUState *cpu)
>> -{
>> -    if (!cpu->vcpu_dirty) {
>> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>> -    }
>> -}
>> -
>> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> -                                              run_on_cpu_data arg)
>> -{
>> -    hvf_put_registers(cpu);
>> -    cpu->vcpu_dirty = false;
>> -}
>> -
>> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> -{
>> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>> -}
>> -
>> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> -                                             run_on_cpu_data arg)
>> -{
>> -    hvf_put_registers(cpu);
>> -    cpu->vcpu_dirty = false;
>> -}
>> -
>> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> -{
>> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>> -}
>> -
>> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> -                                              run_on_cpu_data arg)
>> -{
>> -    cpu->vcpu_dirty = true;
>> -}
>> -
>> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> -{
>> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>> -}
>> -
>>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>   {
>>       int read, write;
>> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>       return false;
>>   }
>>   
>> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>> -{
>> -    hvf_slot *slot;
>> -
>> -    slot = hvf_find_overlap_slot(
>> -            section->offset_within_address_space,
>> -            int128_get64(section->size));
>> -
>> -    /* protect region against writes; begin tracking it */
>> -    if (on) {
>> -        slot->flags |= HVF_SLOT_LOG;
>> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> -                      HV_MEMORY_READ);
>> -    /* stop tracking region*/
>> -    } else {
>> -        slot->flags &= ~HVF_SLOT_LOG;
>> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> -    }
>> -}
>> -
>> -static void hvf_log_start(MemoryListener *listener,
>> -                          MemoryRegionSection *section, int old, int new)
>> -{
>> -    if (old != 0) {
>> -        return;
>> -    }
>> -
>> -    hvf_set_dirty_tracking(section, 1);
>> -}
>> -
>> -static void hvf_log_stop(MemoryListener *listener,
>> -                         MemoryRegionSection *section, int old, int new)
>> -{
>> -    if (new != 0) {
>> -        return;
>> -    }
>> -
>> -    hvf_set_dirty_tracking(section, 0);
>> -}
>> -
>> -static void hvf_log_sync(MemoryListener *listener,
>> -                         MemoryRegionSection *section)
>> -{
>> -    /*
>> -     * sync of dirty pages is handled elsewhere; just make sure we keep
>> -     * tracking the region.
>> -     */
>> -    hvf_set_dirty_tracking(section, 1);
>> -}
>> -
>> -static void hvf_region_add(MemoryListener *listener,
>> -                           MemoryRegionSection *section)
>> -{
>> -    hvf_set_phys_mem(section, true);
>> -}
>> -
>> -static void hvf_region_del(MemoryListener *listener,
>> -                           MemoryRegionSection *section)
>> -{
>> -    hvf_set_phys_mem(section, false);
>> -}
>> -
>> -static MemoryListener hvf_memory_listener = {
>> -    .priority = 10,
>> -    .region_add = hvf_region_add,
>> -    .region_del = hvf_region_del,
>> -    .log_start = hvf_log_start,
>> -    .log_stop = hvf_log_stop,
>> -    .log_sync = hvf_log_sync,
>> -};
>> -
>> -void hvf_vcpu_destroy(CPUState *cpu)
>> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>   {
>>       X86CPU *x86_cpu = X86_CPU(cpu);
>>       CPUX86State *env = &x86_cpu->env;
>>   
>> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>       g_free(env->hvf_mmio_buf);
>> -    assert_hvf_ok(ret);
>> -}
>> -
>> -static void dummy_signal(int sig)
>> -{
>>   }
>>   
>> -int hvf_init_vcpu(CPUState *cpu)
>> +int hvf_arch_init_vcpu(CPUState *cpu)
>>   {
>>   
>>       X86CPU *x86cpu = X86_CPU(cpu);
>>       CPUX86State *env = &x86cpu->env;
>> -    int r;
>> -
>> -    /* init cpu signals */
>> -    sigset_t set;
>> -    struct sigaction sigact;
>> -
>> -    memset(&sigact, 0, sizeof(sigact));
>> -    sigact.sa_handler = dummy_signal;
>> -    sigaction(SIG_IPI, &sigact, NULL);
>> -
>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> -    sigdelset(&set, SIG_IPI);
>>   
>>       init_emu();
>>       init_decoder();
>> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>       env->hvf_mmio_buf = g_new(char, 4096);
>>   
>> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> -    cpu->vcpu_dirty = 1;
>> -    assert_hvf_ok(r);
>> -
>>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>           abort();
>> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>   
>>       return ret;
>>   }
>> -
>> -bool hvf_allowed;
>> -
>> -static int hvf_accel_init(MachineState *ms)
>> -{
>> -    int x;
>> -    hv_return_t ret;
>> -    HVFState *s;
>> -
>> -    ret = hv_vm_create(HV_VM_DEFAULT);
>> -    assert_hvf_ok(ret);
>> -
>> -    s = g_new0(HVFState, 1);
>> -
>> -    s->num_slots = 32;
>> -    for (x = 0; x < s->num_slots; ++x) {
>> -        s->slots[x].size = 0;
>> -        s->slots[x].slot_id = x;
>> -    }
>> -
>> -    hvf_state = s;
>> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>> -    cpus_register_accel(&hvf_cpus);
>> -    return 0;
>> -}
>> -
>> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> -{
>> -    AccelClass *ac = ACCEL_CLASS(oc);
>> -    ac->name = "HVF";
>> -    ac->init_machine = hvf_accel_init;
>> -    ac->allowed = &hvf_allowed;
>> -}
>> -
>> -static const TypeInfo hvf_accel_type = {
>> -    .name = TYPE_HVF_ACCEL,
>> -    .parent = TYPE_ACCEL,
>> -    .class_init = hvf_accel_class_init,
>> -};
>> -
>> -static void hvf_type_init(void)
>> -{
>> -    type_register_static(&hvf_accel_type);
>> -}
>> -
>> -type_init(hvf_type_init);
>> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>> index 409c9a3f14..c8a43717ee 100644
>> --- a/target/i386/hvf/meson.build
>> +++ b/target/i386/hvf/meson.build
>> @@ -1,6 +1,5 @@
>>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>     'hvf.c',
>> -  'hvf-cpus.c',
>>     'x86.c',
>>     'x86_cpuid.c',
>>     'x86_decode.c',
>> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>> index bbec412b6c..89b8e9d87a 100644
>> --- a/target/i386/hvf/x86hvf.c
>> +++ b/target/i386/hvf/x86hvf.c
>> @@ -20,6 +20,9 @@
>>   #include "qemu/osdep.h"
>>   
>>   #include "qemu-common.h"
>> +#include "sysemu/hvf.h"
>> +#include "sysemu/hvf_int.h"
>> +#include "sysemu/hw_accel.h"
>>   #include "x86hvf.h"
>>   #include "vmx.h"
>>   #include "vmcs.h"
>> @@ -32,8 +35,6 @@
>>   #include <Hypervisor/hv.h>
>>   #include <Hypervisor/hv_vmx.h>
>>   
>> -#include "hvf-cpus.h"
>> -
>>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>>                        SegmentCache *qseg, bool is_tr)
>>   {
>> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>   
>>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>> -        hvf_cpu_synchronize_state(cpu_state);
>> +        cpu_synchronize_state(cpu_state);
>>           do_cpu_init(cpu);
>>       }
>>   
>> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>           cpu_state->halted = 0;
>>       }
>>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>> -        hvf_cpu_synchronize_state(cpu_state);
>> +        cpu_synchronize_state(cpu_state);
>>           do_cpu_sipi(cpu);
>>       }
>>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>> -        hvf_cpu_synchronize_state(cpu_state);
>> +        cpu_synchronize_state(cpu_state);
> The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
> be a separate patch. It follows cpu/accel cleanups Claudio was doing the
> summer.


The only reason they're in here is because we no longer have access to 
the hvf_ functions from the file. I am perfectly happy to rebase the 
patch on top of Claudio's if his goes in first. I'm sure it'll be 
trivial for him to rebase on top of this too if my series goes in first.


>
> Phillipe raised the idea that the patch might go ahead of ARM-specific
> part (which might involve some discussions) and I agree with that.
>
> Some sync between Claudio series (CC'd him) and the patch might be need.


I would prefer not to hold back because of the sync. Claudio's cleanup 
is trivial enough to adjust for if it gets merged ahead of this.


Alex




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-27 21:55     ` Alexander Graf
@ 2020-11-27 23:30       ` Frank Yang
  2020-11-30 20:15         ` Frank Yang
  0 siblings, 1 reply; 64+ messages in thread
From: Frank Yang @ 2020-11-27 23:30 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini, Peter Collingbourne

[-- Attachment #1: Type: text/plain, Size: 48755 bytes --]

Hi all,

+Peter Collingbourne <pcc@google.com>

I'm a developer on the Android Emulator, which is in a fork of QEMU.

Peter and I have been working on an HVF Apple Silicon backend with an eye
toward Android guests.

We have gotten things to basically switch to Android userspace already
(logcat/shell and graphics available at least)

Our strategy so far has been to import logic from the KVM implementation
and hook into QEMU's software devices that previously assumed to only work
with TCG, or have KVM-specific paths.

Thanks to Alexander for the tip on the 36-bit address space limitation btw;
our way of addressing this is to still allow highmem but not put pci high
mmio so high.

Also, note we have a sleep/signal based mechanism to deal with WFx, which
might be worth looking into in Alexander's implementation as well:

https://android-review.googlesource.com/c/platform/external/qemu/+/1512551

Patches so far, FYI:

https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3

https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6

Peter's also noticed that there are extra steps needed for M1's to allow
TCG to work, as it involves JIT:

https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9

We'd appreciate any feedback/comments :)

Best,

Frank

On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:

>
> On 27.11.20 21:00, Roman Bolshakov wrote:
> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
> >> Until now, Hypervisor.framework has only been available on x86_64
> systems.
> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
> >> prepare for support for multiple architectures, let's move common code
> out
> >> into its own accel directory.
> >>
> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
> >> ---
> >>   MAINTAINERS                 |   9 +-
> >>   accel/hvf/hvf-all.c         |  56 +++++
> >>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
> >>   accel/hvf/meson.build       |   7 +
> >>   accel/meson.build           |   1 +
> >>   include/sysemu/hvf_int.h    |  69 ++++++
> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
> >>   target/i386/hvf/hvf-cpus.h  |  25 --
> >>   target/i386/hvf/hvf-i386.h  |  48 +---
> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
> >>   target/i386/hvf/meson.build |   1 -
> >>   target/i386/hvf/x86hvf.c    |  11 +-
> >>   target/i386/hvf/x86hvf.h    |   2 -
> >>   13 files changed, 619 insertions(+), 569 deletions(-)
> >>   create mode 100644 accel/hvf/hvf-all.c
> >>   create mode 100644 accel/hvf/hvf-cpus.c
> >>   create mode 100644 accel/hvf/meson.build
> >>   create mode 100644 include/sysemu/hvf_int.h
> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index 68bc160f41..ca4b6d9279 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
> >>   W: https://wiki.qemu.org/Features/HVF
> >>   S: Maintained
> >> -F: accel/stubs/hvf-stub.c
> > There was a patch for that in the RFC series from Claudio.
>
>
> Yeah, I'm not worried about this hunk :).
>
>
> >
> >>   F: target/i386/hvf/
> >> +
> >> +HVF
> >> +M: Cameron Esfahani <dirty@apple.com>
> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
> >> +W: https://wiki.qemu.org/Features/HVF
> >> +S: Maintained
> >> +F: accel/hvf/
> >>   F: include/sysemu/hvf.h
> >> +F: include/sysemu/hvf_int.h
> >>
> >>   WHPX CPUs
> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
> >> new file mode 100644
> >> index 0000000000..47d77a472a
> >> --- /dev/null
> >> +++ b/accel/hvf/hvf-all.c
> >> @@ -0,0 +1,56 @@
> >> +/*
> >> + * QEMU Hypervisor.framework support
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2.
> See
> >> + * the COPYING file in the top-level directory.
> >> + *
> >> + * Contributions after 2012-01-13 are licensed under the terms of the
> >> + * GNU GPL, version 2 or (at your option) any later version.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu-common.h"
> >> +#include "qemu/error-report.h"
> >> +#include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >> +#include "sysemu/runstate.h"
> >> +
> >> +#include "qemu/main-loop.h"
> >> +#include "sysemu/accel.h"
> >> +
> >> +#include <Hypervisor/Hypervisor.h>
> >> +
> >> +bool hvf_allowed;
> >> +HVFState *hvf_state;
> >> +
> >> +void assert_hvf_ok(hv_return_t ret)
> >> +{
> >> +    if (ret == HV_SUCCESS) {
> >> +        return;
> >> +    }
> >> +
> >> +    switch (ret) {
> >> +    case HV_ERROR:
> >> +        error_report("Error: HV_ERROR");
> >> +        break;
> >> +    case HV_BUSY:
> >> +        error_report("Error: HV_BUSY");
> >> +        break;
> >> +    case HV_BAD_ARGUMENT:
> >> +        error_report("Error: HV_BAD_ARGUMENT");
> >> +        break;
> >> +    case HV_NO_RESOURCES:
> >> +        error_report("Error: HV_NO_RESOURCES");
> >> +        break;
> >> +    case HV_NO_DEVICE:
> >> +        error_report("Error: HV_NO_DEVICE");
> >> +        break;
> >> +    case HV_UNSUPPORTED:
> >> +        error_report("Error: HV_UNSUPPORTED");
> >> +        break;
> >> +    default:
> >> +        error_report("Unknown Error");
> >> +    }
> >> +
> >> +    abort();
> >> +}
> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >> new file mode 100644
> >> index 0000000000..f9bb5502b7
> >> --- /dev/null
> >> +++ b/accel/hvf/hvf-cpus.c
> >> @@ -0,0 +1,468 @@
> >> +/*
> >> + * Copyright 2008 IBM Corporation
> >> + *           2008 Red Hat, Inc.
> >> + * Copyright 2011 Intel Corporation
> >> + * Copyright 2016 Veertu, Inc.
> >> + * Copyright 2017 The Android Open Source Project
> >> + *
> >> + * QEMU Hypervisor.framework support
> >> + *
> >> + * This program is free software; you can redistribute it and/or
> >> + * modify it under the terms of version 2 of the GNU General Public
> >> + * License as published by the Free Software Foundation.
> >> + *
> >> + * This program is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> + * General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU General Public License
> >> + * along with this program; if not, see <http://www.gnu.org/licenses/
> >.
> >> + *
> >> + * This file contain code under public domain from the hvdos project:
> >> + * https://github.com/mist64/hvdos
> >> + *
> >> + * Parts Copyright (c) 2011 NetApp, Inc.
> >> + * All rights reserved.
> >> + *
> >> + * Redistribution and use in source and binary forms, with or without
> >> + * modification, are permitted provided that the following conditions
> >> + * are met:
> >> + * 1. Redistributions of source code must retain the above copyright
> >> + *    notice, this list of conditions and the following disclaimer.
> >> + * 2. Redistributions in binary form must reproduce the above copyright
> >> + *    notice, this list of conditions and the following disclaimer in
> the
> >> + *    documentation and/or other materials provided with the
> distribution.
> >> + *
> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> THE
> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> PURPOSE
> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
> LIABLE
> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> CONSEQUENTIAL
> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
> GOODS
> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> INTERRUPTION)
> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> STRICT
> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
> ANY WAY
> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
> OF
> >> + * SUCH DAMAGE.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/error-report.h"
> >> +#include "qemu/main-loop.h"
> >> +#include "exec/address-spaces.h"
> >> +#include "exec/exec-all.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >> +#include "sysemu/runstate.h"
> >> +#include "qemu/guest-random.h"
> >> +
> >> +#include <Hypervisor/Hypervisor.h>
> >> +
> >> +/* Memory slots */
> >> +
> >> +struct mac_slot {
> >> +    int present;
> >> +    uint64_t size;
> >> +    uint64_t gpa_start;
> >> +    uint64_t gva;
> >> +};
> >> +
> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> >> +{
> >> +    hvf_slot *slot;
> >> +    int x;
> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> +        slot = &hvf_state->slots[x];
> >> +        if (slot->size && start < (slot->start + slot->size) &&
> >> +            (start + size) > slot->start) {
> >> +            return slot;
> >> +        }
> >> +    }
> >> +    return NULL;
> >> +}
> >> +
> >> +struct mac_slot mac_slots[32];
> >> +
> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> >> +{
> >> +    struct mac_slot *macslot;
> >> +    hv_return_t ret;
> >> +
> >> +    macslot = &mac_slots[slot->slot_id];
> >> +
> >> +    if (macslot->present) {
> >> +        if (macslot->size != slot->size) {
> >> +            macslot->present = 0;
> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> >> +            assert_hvf_ok(ret);
> >> +        }
> >> +    }
> >> +
> >> +    if (!slot->size) {
> >> +        return 0;
> >> +    }
> >> +
> >> +    macslot->present = 1;
> >> +    macslot->gpa_start = slot->start;
> >> +    macslot->size = slot->size;
> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
> >> +    assert_hvf_ok(ret);
> >> +    return 0;
> >> +}
> >> +
> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> >> +{
> >> +    hvf_slot *mem;
> >> +    MemoryRegion *area = section->mr;
> >> +    bool writeable = !area->readonly && !area->rom_device;
> >> +    hv_memory_flags_t flags;
> >> +
> >> +    if (!memory_region_is_ram(area)) {
> >> +        if (writeable) {
> >> +            return;
> >> +        } else if (!memory_region_is_romd(area)) {
> >> +            /*
> >> +             * If the memory device is not in romd_mode, then we
> actually want
> >> +             * to remove the hvf memory slot so all accesses will trap.
> >> +             */
> >> +             add = false;
> >> +        }
> >> +    }
> >> +
> >> +    mem = hvf_find_overlap_slot(
> >> +            section->offset_within_address_space,
> >> +            int128_get64(section->size));
> >> +
> >> +    if (mem && add) {
> >> +        if (mem->size == int128_get64(section->size) &&
> >> +            mem->start == section->offset_within_address_space &&
> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
> >> +            section->offset_within_region)) {
> >> +            return; /* Same region was attempted to register, go away.
> */
> >> +        }
> >> +    }
> >> +
> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
> >> +    if (mem) {
> >> +        mem->size = 0;
> >> +        if (do_hvf_set_memory(mem, 0)) {
> >> +            error_report("Failed to reset overlapping slot");
> >> +            abort();
> >> +        }
> >> +    }
> >> +
> >> +    if (!add) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (area->readonly ||
> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> >> +    } else {
> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> >> +    }
> >> +
> >> +    /* Now make a new slot. */
> >> +    int x;
> >> +
> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> +        mem = &hvf_state->slots[x];
> >> +        if (!mem->size) {
> >> +            break;
> >> +        }
> >> +    }
> >> +
> >> +    if (x == hvf_state->num_slots) {
> >> +        error_report("No free slots");
> >> +        abort();
> >> +    }
> >> +
> >> +    mem->size = int128_get64(section->size);
> >> +    mem->mem = memory_region_get_ram_ptr(area) +
> section->offset_within_region;
> >> +    mem->start = section->offset_within_address_space;
> >> +    mem->region = area;
> >> +
> >> +    if (do_hvf_set_memory(mem, flags)) {
> >> +        error_report("Error registering new memory slot");
> >> +        abort();
> >> +    }
> >> +}
> >> +
> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
> on)
> >> +{
> >> +    hvf_slot *slot;
> >> +
> >> +    slot = hvf_find_overlap_slot(
> >> +            section->offset_within_address_space,
> >> +            int128_get64(section->size));
> >> +
> >> +    /* protect region against writes; begin tracking it */
> >> +    if (on) {
> >> +        slot->flags |= HVF_SLOT_LOG;
> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> >> +                      HV_MEMORY_READ);
> >> +    /* stop tracking region*/
> >> +    } else {
> >> +        slot->flags &= ~HVF_SLOT_LOG;
> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> >> +    }
> >> +}
> >> +
> >> +static void hvf_log_start(MemoryListener *listener,
> >> +                          MemoryRegionSection *section, int old, int
> new)
> >> +{
> >> +    if (old != 0) {
> >> +        return;
> >> +    }
> >> +
> >> +    hvf_set_dirty_tracking(section, 1);
> >> +}
> >> +
> >> +static void hvf_log_stop(MemoryListener *listener,
> >> +                         MemoryRegionSection *section, int old, int
> new)
> >> +{
> >> +    if (new != 0) {
> >> +        return;
> >> +    }
> >> +
> >> +    hvf_set_dirty_tracking(section, 0);
> >> +}
> >> +
> >> +static void hvf_log_sync(MemoryListener *listener,
> >> +                         MemoryRegionSection *section)
> >> +{
> >> +    /*
> >> +     * sync of dirty pages is handled elsewhere; just make sure we keep
> >> +     * tracking the region.
> >> +     */
> >> +    hvf_set_dirty_tracking(section, 1);
> >> +}
> >> +
> >> +static void hvf_region_add(MemoryListener *listener,
> >> +                           MemoryRegionSection *section)
> >> +{
> >> +    hvf_set_phys_mem(section, true);
> >> +}
> >> +
> >> +static void hvf_region_del(MemoryListener *listener,
> >> +                           MemoryRegionSection *section)
> >> +{
> >> +    hvf_set_phys_mem(section, false);
> >> +}
> >> +
> >> +static MemoryListener hvf_memory_listener = {
> >> +    .priority = 10,
> >> +    .region_add = hvf_region_add,
> >> +    .region_del = hvf_region_del,
> >> +    .log_start = hvf_log_start,
> >> +    .log_stop = hvf_log_stop,
> >> +    .log_sync = hvf_log_sync,
> >> +};
> >> +
> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
> run_on_cpu_data arg)
> >> +{
> >> +    if (!cpu->vcpu_dirty) {
> >> +        hvf_get_registers(cpu);
> >> +        cpu->vcpu_dirty = true;
> >> +    }
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
> >> +{
> >> +    if (!cpu->vcpu_dirty) {
> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> >> +    }
> >> +}
> >> +
> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> >> +                                              run_on_cpu_data arg)
> >> +{
> >> +    hvf_put_registers(cpu);
> >> +    cpu->vcpu_dirty = false;
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> >> +{
> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
> RUN_ON_CPU_NULL);
> >> +}
> >> +
> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> >> +                                             run_on_cpu_data arg)
> >> +{
> >> +    hvf_put_registers(cpu);
> >> +    cpu->vcpu_dirty = false;
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
> >> +{
> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> >> +}
> >> +
> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> >> +                                              run_on_cpu_data arg)
> >> +{
> >> +    cpu->vcpu_dirty = true;
> >> +}
> >> +
> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> >> +{
> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
> RUN_ON_CPU_NULL);
> >> +}
> >> +
> >> +static void hvf_vcpu_destroy(CPUState *cpu)
> >> +{
> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
> >> +    assert_hvf_ok(ret);
> >> +
> >> +    hvf_arch_vcpu_destroy(cpu);
> >> +}
> >> +
> >> +static void dummy_signal(int sig)
> >> +{
> >> +}
> >> +
> >> +static int hvf_init_vcpu(CPUState *cpu)
> >> +{
> >> +    int r;
> >> +
> >> +    /* init cpu signals */
> >> +    sigset_t set;
> >> +    struct sigaction sigact;
> >> +
> >> +    memset(&sigact, 0, sizeof(sigact));
> >> +    sigact.sa_handler = dummy_signal;
> >> +    sigaction(SIG_IPI, &sigact, NULL);
> >> +
> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >> +    sigdelset(&set, SIG_IPI);
> >> +
> >> +#ifdef __aarch64__
> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
> **)&cpu->hvf_exit, NULL);
> >> +#else
> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> >> +#endif
> > I think the first __aarch64__ bit fits better to arm part of the series.
>
>
> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
> ARM enablement.
>
>
> >
> >> +    cpu->vcpu_dirty = 1;
> >> +    assert_hvf_ok(r);
> >> +
> >> +    return hvf_arch_init_vcpu(cpu);
> >> +}
> >> +
> >> +/*
> >> + * The HVF-specific vCPU thread function. This one should only run
> when the host
> >> + * CPU supports the VMX "unrestricted guest" feature.
> >> + */
> >> +static void *hvf_cpu_thread_fn(void *arg)
> >> +{
> >> +    CPUState *cpu = arg;
> >> +
> >> +    int r;
> >> +
> >> +    assert(hvf_enabled());
> >> +
> >> +    rcu_register_thread();
> >> +
> >> +    qemu_mutex_lock_iothread();
> >> +    qemu_thread_get_self(cpu->thread);
> >> +
> >> +    cpu->thread_id = qemu_get_thread_id();
> >> +    cpu->can_do_io = 1;
> >> +    current_cpu = cpu;
> >> +
> >> +    hvf_init_vcpu(cpu);
> >> +
> >> +    /* signal CPU creation */
> >> +    cpu_thread_signal_created(cpu);
> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> >> +
> >> +    do {
> >> +        if (cpu_can_run(cpu)) {
> >> +            r = hvf_vcpu_exec(cpu);
> >> +            if (r == EXCP_DEBUG) {
> >> +                cpu_handle_guest_debug(cpu);
> >> +            }
> >> +        }
> >> +        qemu_wait_io_event(cpu);
> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
> >> +
> >> +    hvf_vcpu_destroy(cpu);
> >> +    cpu_thread_signal_destroyed(cpu);
> >> +    qemu_mutex_unlock_iothread();
> >> +    rcu_unregister_thread();
> >> +    return NULL;
> >> +}
> >> +
> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
> >> +{
> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
> >> +
> >> +    /*
> >> +     * HVF currently does not support TCG, and only runs in
> >> +     * unrestricted-guest mode.
> >> +     */
> >> +    assert(hvf_enabled());
> >> +
> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> >> +    qemu_cond_init(cpu->halt_cond);
> >> +
> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> >> +             cpu->cpu_index);
> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> >> +                       cpu, QEMU_THREAD_JOINABLE);
> >> +}
> >> +
> >> +static const CpusAccel hvf_cpus = {
> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
> >> +
> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> >> +    .synchronize_state = hvf_cpu_synchronize_state,
> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> >> +};
> >> +
> >> +static int hvf_accel_init(MachineState *ms)
> >> +{
> >> +    int x;
> >> +    hv_return_t ret;
> >> +    HVFState *s;
> >> +
> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
> >> +    assert_hvf_ok(ret);
> >> +
> >> +    s = g_new0(HVFState, 1);
> >> +
> >> +    s->num_slots = 32;
> >> +    for (x = 0; x < s->num_slots; ++x) {
> >> +        s->slots[x].size = 0;
> >> +        s->slots[x].slot_id = x;
> >> +    }
> >> +
> >> +    hvf_state = s;
> >> +    memory_listener_register(&hvf_memory_listener,
> &address_space_memory);
> >> +    cpus_register_accel(&hvf_cpus);
> >> +    return 0;
> >> +}
> >> +
> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
> >> +{
> >> +    AccelClass *ac = ACCEL_CLASS(oc);
> >> +    ac->name = "HVF";
> >> +    ac->init_machine = hvf_accel_init;
> >> +    ac->allowed = &hvf_allowed;
> >> +}
> >> +
> >> +static const TypeInfo hvf_accel_type = {
> >> +    .name = TYPE_HVF_ACCEL,
> >> +    .parent = TYPE_ACCEL,
> >> +    .class_init = hvf_accel_class_init,
> >> +};
> >> +
> >> +static void hvf_type_init(void)
> >> +{
> >> +    type_register_static(&hvf_accel_type);
> >> +}
> >> +
> >> +type_init(hvf_type_init);
> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
> >> new file mode 100644
> >> index 0000000000..dfd6b68dc7
> >> --- /dev/null
> >> +++ b/accel/hvf/meson.build
> >> @@ -0,0 +1,7 @@
> >> +hvf_ss = ss.source_set()
> >> +hvf_ss.add(files(
> >> +  'hvf-all.c',
> >> +  'hvf-cpus.c',
> >> +))
> >> +
> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
> >> diff --git a/accel/meson.build b/accel/meson.build
> >> index b26cca227a..6de12ce5d5 100644
> >> --- a/accel/meson.build
> >> +++ b/accel/meson.build
> >> @@ -1,5 +1,6 @@
> >>   softmmu_ss.add(files('accel.c'))
> >>
> >> +subdir('hvf')
> >>   subdir('qtest')
> >>   subdir('kvm')
> >>   subdir('tcg')
> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >> new file mode 100644
> >> index 0000000000..de9bad23a8
> >> --- /dev/null
> >> +++ b/include/sysemu/hvf_int.h
> >> @@ -0,0 +1,69 @@
> >> +/*
> >> + * QEMU Hypervisor.framework (HVF) support
> >> + *
> >> + * This work is licensed under the terms of the GNU GPL, version 2 or
> later.
> >> + * See the COPYING file in the top-level directory.
> >> + *
> >> + */
> >> +
> >> +/* header to be included in HVF-specific code */
> >> +
> >> +#ifndef HVF_INT_H
> >> +#define HVF_INT_H
> >> +
> >> +#include <Hypervisor/Hypervisor.h>
> >> +
> >> +#define HVF_MAX_VCPU 0x10
> >> +
> >> +extern struct hvf_state hvf_global;
> >> +
> >> +struct hvf_vm {
> >> +    int id;
> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> >> +};
> >> +
> >> +struct hvf_state {
> >> +    uint32_t version;
> >> +    struct hvf_vm *vm;
> >> +    uint64_t mem_quota;
> >> +};
> >> +
> >> +/* hvf_slot flags */
> >> +#define HVF_SLOT_LOG (1 << 0)
> >> +
> >> +typedef struct hvf_slot {
> >> +    uint64_t start;
> >> +    uint64_t size;
> >> +    uint8_t *mem;
> >> +    int slot_id;
> >> +    uint32_t flags;
> >> +    MemoryRegion *region;
> >> +} hvf_slot;
> >> +
> >> +typedef struct hvf_vcpu_caps {
> >> +    uint64_t vmx_cap_pinbased;
> >> +    uint64_t vmx_cap_procbased;
> >> +    uint64_t vmx_cap_procbased2;
> >> +    uint64_t vmx_cap_entry;
> >> +    uint64_t vmx_cap_exit;
> >> +    uint64_t vmx_cap_preemption_timer;
> >> +} hvf_vcpu_caps;
> >> +
> >> +struct HVFState {
> >> +    AccelState parent;
> >> +    hvf_slot slots[32];
> >> +    int num_slots;
> >> +
> >> +    hvf_vcpu_caps *hvf_caps;
> >> +};
> >> +extern HVFState *hvf_state;
> >> +
> >> +void assert_hvf_ok(hv_return_t ret);
> >> +int hvf_get_registers(CPUState *cpu);
> >> +int hvf_put_registers(CPUState *cpu);
> >> +int hvf_arch_init_vcpu(CPUState *cpu);
> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
> >> +int hvf_vcpu_exec(CPUState *cpu);
> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
> >> +
> >> +#endif
> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
> >> deleted file mode 100644
> >> index 817b3d7452..0000000000
> >> --- a/target/i386/hvf/hvf-cpus.c
> >> +++ /dev/null
> >> @@ -1,131 +0,0 @@
> >> -/*
> >> - * Copyright 2008 IBM Corporation
> >> - *           2008 Red Hat, Inc.
> >> - * Copyright 2011 Intel Corporation
> >> - * Copyright 2016 Veertu, Inc.
> >> - * Copyright 2017 The Android Open Source Project
> >> - *
> >> - * QEMU Hypervisor.framework support
> >> - *
> >> - * This program is free software; you can redistribute it and/or
> >> - * modify it under the terms of version 2 of the GNU General Public
> >> - * License as published by the Free Software Foundation.
> >> - *
> >> - * This program is distributed in the hope that it will be useful,
> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> >> - * General Public License for more details.
> >> - *
> >> - * You should have received a copy of the GNU General Public License
> >> - * along with this program; if not, see <http://www.gnu.org/licenses/
> >.
> >> - *
> >> - * This file contain code under public domain from the hvdos project:
> >> - * https://github.com/mist64/hvdos
> >> - *
> >> - * Parts Copyright (c) 2011 NetApp, Inc.
> >> - * All rights reserved.
> >> - *
> >> - * Redistribution and use in source and binary forms, with or without
> >> - * modification, are permitted provided that the following conditions
> >> - * are met:
> >> - * 1. Redistributions of source code must retain the above copyright
> >> - *    notice, this list of conditions and the following disclaimer.
> >> - * 2. Redistributions in binary form must reproduce the above copyright
> >> - *    notice, this list of conditions and the following disclaimer in
> the
> >> - *    documentation and/or other materials provided with the
> distribution.
> >> - *
> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
> THE
> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
> PURPOSE
> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
> LIABLE
> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
> CONSEQUENTIAL
> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
> GOODS
> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
> INTERRUPTION)
> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
> STRICT
> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
> ANY WAY
> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
> OF
> >> - * SUCH DAMAGE.
> >> - */
> >> -
> >> -#include "qemu/osdep.h"
> >> -#include "qemu/error-report.h"
> >> -#include "qemu/main-loop.h"
> >> -#include "sysemu/hvf.h"
> >> -#include "sysemu/runstate.h"
> >> -#include "target/i386/cpu.h"
> >> -#include "qemu/guest-random.h"
> >> -
> >> -#include "hvf-cpus.h"
> >> -
> >> -/*
> >> - * The HVF-specific vCPU thread function. This one should only run
> when the host
> >> - * CPU supports the VMX "unrestricted guest" feature.
> >> - */
> >> -static void *hvf_cpu_thread_fn(void *arg)
> >> -{
> >> -    CPUState *cpu = arg;
> >> -
> >> -    int r;
> >> -
> >> -    assert(hvf_enabled());
> >> -
> >> -    rcu_register_thread();
> >> -
> >> -    qemu_mutex_lock_iothread();
> >> -    qemu_thread_get_self(cpu->thread);
> >> -
> >> -    cpu->thread_id = qemu_get_thread_id();
> >> -    cpu->can_do_io = 1;
> >> -    current_cpu = cpu;
> >> -
> >> -    hvf_init_vcpu(cpu);
> >> -
> >> -    /* signal CPU creation */
> >> -    cpu_thread_signal_created(cpu);
> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
> >> -
> >> -    do {
> >> -        if (cpu_can_run(cpu)) {
> >> -            r = hvf_vcpu_exec(cpu);
> >> -            if (r == EXCP_DEBUG) {
> >> -                cpu_handle_guest_debug(cpu);
> >> -            }
> >> -        }
> >> -        qemu_wait_io_event(cpu);
> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
> >> -
> >> -    hvf_vcpu_destroy(cpu);
> >> -    cpu_thread_signal_destroyed(cpu);
> >> -    qemu_mutex_unlock_iothread();
> >> -    rcu_unregister_thread();
> >> -    return NULL;
> >> -}
> >> -
> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
> >> -{
> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
> >> -
> >> -    /*
> >> -     * HVF currently does not support TCG, and only runs in
> >> -     * unrestricted-guest mode.
> >> -     */
> >> -    assert(hvf_enabled());
> >> -
> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
> >> -    qemu_cond_init(cpu->halt_cond);
> >> -
> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
> >> -             cpu->cpu_index);
> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
> >> -                       cpu, QEMU_THREAD_JOINABLE);
> >> -}
> >> -
> >> -const CpusAccel hvf_cpus = {
> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
> >> -
> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
> >> -    .synchronize_state = hvf_cpu_synchronize_state,
> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
> >> -};
> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
> >> deleted file mode 100644
> >> index ced31b82c0..0000000000
> >> --- a/target/i386/hvf/hvf-cpus.h
> >> +++ /dev/null
> >> @@ -1,25 +0,0 @@
> >> -/*
> >> - * Accelerator CPUS Interface
> >> - *
> >> - * Copyright 2020 SUSE LLC
> >> - *
> >> - * This work is licensed under the terms of the GNU GPL, version 2 or
> later.
> >> - * See the COPYING file in the top-level directory.
> >> - */
> >> -
> >> -#ifndef HVF_CPUS_H
> >> -#define HVF_CPUS_H
> >> -
> >> -#include "sysemu/cpus.h"
> >> -
> >> -extern const CpusAccel hvf_cpus;
> >> -
> >> -int hvf_init_vcpu(CPUState *);
> >> -int hvf_vcpu_exec(CPUState *);
> >> -void hvf_cpu_synchronize_state(CPUState *);
> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
> >> -void hvf_cpu_synchronize_post_init(CPUState *);
> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
> >> -void hvf_vcpu_destroy(CPUState *);
> >> -
> >> -#endif /* HVF_CPUS_H */
> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
> >> index e0edffd077..6d56f8f6bb 100644
> >> --- a/target/i386/hvf/hvf-i386.h
> >> +++ b/target/i386/hvf/hvf-i386.h
> >> @@ -18,57 +18,11 @@
> >>
> >>   #include "sysemu/accel.h"
> >>   #include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >>   #include "cpu.h"
> >>   #include "x86.h"
> >>
> >> -#define HVF_MAX_VCPU 0x10
> >> -
> >> -extern struct hvf_state hvf_global;
> >> -
> >> -struct hvf_vm {
> >> -    int id;
> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
> >> -};
> >> -
> >> -struct hvf_state {
> >> -    uint32_t version;
> >> -    struct hvf_vm *vm;
> >> -    uint64_t mem_quota;
> >> -};
> >> -
> >> -/* hvf_slot flags */
> >> -#define HVF_SLOT_LOG (1 << 0)
> >> -
> >> -typedef struct hvf_slot {
> >> -    uint64_t start;
> >> -    uint64_t size;
> >> -    uint8_t *mem;
> >> -    int slot_id;
> >> -    uint32_t flags;
> >> -    MemoryRegion *region;
> >> -} hvf_slot;
> >> -
> >> -typedef struct hvf_vcpu_caps {
> >> -    uint64_t vmx_cap_pinbased;
> >> -    uint64_t vmx_cap_procbased;
> >> -    uint64_t vmx_cap_procbased2;
> >> -    uint64_t vmx_cap_entry;
> >> -    uint64_t vmx_cap_exit;
> >> -    uint64_t vmx_cap_preemption_timer;
> >> -} hvf_vcpu_caps;
> >> -
> >> -struct HVFState {
> >> -    AccelState parent;
> >> -    hvf_slot slots[32];
> >> -    int num_slots;
> >> -
> >> -    hvf_vcpu_caps *hvf_caps;
> >> -};
> >> -extern HVFState *hvf_state;
> >> -
> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
> >>
> >>   #ifdef NEED_CPU_H
> >>   /* Functions exported to host specific mode */
> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
> >> index ed9356565c..8b96ecd619 100644
> >> --- a/target/i386/hvf/hvf.c
> >> +++ b/target/i386/hvf/hvf.c
> >> @@ -51,6 +51,7 @@
> >>   #include "qemu/error-report.h"
> >>
> >>   #include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >>   #include "sysemu/runstate.h"
> >>   #include "hvf-i386.h"
> >>   #include "vmcs.h"
> >> @@ -72,171 +73,6 @@
> >>   #include "sysemu/accel.h"
> >>   #include "target/i386/cpu.h"
> >>
> >> -#include "hvf-cpus.h"
> >> -
> >> -HVFState *hvf_state;
> >> -
> >> -static void assert_hvf_ok(hv_return_t ret)
> >> -{
> >> -    if (ret == HV_SUCCESS) {
> >> -        return;
> >> -    }
> >> -
> >> -    switch (ret) {
> >> -    case HV_ERROR:
> >> -        error_report("Error: HV_ERROR");
> >> -        break;
> >> -    case HV_BUSY:
> >> -        error_report("Error: HV_BUSY");
> >> -        break;
> >> -    case HV_BAD_ARGUMENT:
> >> -        error_report("Error: HV_BAD_ARGUMENT");
> >> -        break;
> >> -    case HV_NO_RESOURCES:
> >> -        error_report("Error: HV_NO_RESOURCES");
> >> -        break;
> >> -    case HV_NO_DEVICE:
> >> -        error_report("Error: HV_NO_DEVICE");
> >> -        break;
> >> -    case HV_UNSUPPORTED:
> >> -        error_report("Error: HV_UNSUPPORTED");
> >> -        break;
> >> -    default:
> >> -        error_report("Unknown Error");
> >> -    }
> >> -
> >> -    abort();
> >> -}
> >> -
> >> -/* Memory slots */
> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
> >> -{
> >> -    hvf_slot *slot;
> >> -    int x;
> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> -        slot = &hvf_state->slots[x];
> >> -        if (slot->size && start < (slot->start + slot->size) &&
> >> -            (start + size) > slot->start) {
> >> -            return slot;
> >> -        }
> >> -    }
> >> -    return NULL;
> >> -}
> >> -
> >> -struct mac_slot {
> >> -    int present;
> >> -    uint64_t size;
> >> -    uint64_t gpa_start;
> >> -    uint64_t gva;
> >> -};
> >> -
> >> -struct mac_slot mac_slots[32];
> >> -
> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
> >> -{
> >> -    struct mac_slot *macslot;
> >> -    hv_return_t ret;
> >> -
> >> -    macslot = &mac_slots[slot->slot_id];
> >> -
> >> -    if (macslot->present) {
> >> -        if (macslot->size != slot->size) {
> >> -            macslot->present = 0;
> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
> >> -            assert_hvf_ok(ret);
> >> -        }
> >> -    }
> >> -
> >> -    if (!slot->size) {
> >> -        return 0;
> >> -    }
> >> -
> >> -    macslot->present = 1;
> >> -    macslot->gpa_start = slot->start;
> >> -    macslot->size = slot->size;
> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size,
> flags);
> >> -    assert_hvf_ok(ret);
> >> -    return 0;
> >> -}
> >> -
> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
> >> -{
> >> -    hvf_slot *mem;
> >> -    MemoryRegion *area = section->mr;
> >> -    bool writeable = !area->readonly && !area->rom_device;
> >> -    hv_memory_flags_t flags;
> >> -
> >> -    if (!memory_region_is_ram(area)) {
> >> -        if (writeable) {
> >> -            return;
> >> -        } else if (!memory_region_is_romd(area)) {
> >> -            /*
> >> -             * If the memory device is not in romd_mode, then we
> actually want
> >> -             * to remove the hvf memory slot so all accesses will trap.
> >> -             */
> >> -             add = false;
> >> -        }
> >> -    }
> >> -
> >> -    mem = hvf_find_overlap_slot(
> >> -            section->offset_within_address_space,
> >> -            int128_get64(section->size));
> >> -
> >> -    if (mem && add) {
> >> -        if (mem->size == int128_get64(section->size) &&
> >> -            mem->start == section->offset_within_address_space &&
> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
> >> -            section->offset_within_region)) {
> >> -            return; /* Same region was attempted to register, go away.
> */
> >> -        }
> >> -    }
> >> -
> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
> >> -    if (mem) {
> >> -        mem->size = 0;
> >> -        if (do_hvf_set_memory(mem, 0)) {
> >> -            error_report("Failed to reset overlapping slot");
> >> -            abort();
> >> -        }
> >> -    }
> >> -
> >> -    if (!add) {
> >> -        return;
> >> -    }
> >> -
> >> -    if (area->readonly ||
> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
> >> -    } else {
> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
> >> -    }
> >> -
> >> -    /* Now make a new slot. */
> >> -    int x;
> >> -
> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
> >> -        mem = &hvf_state->slots[x];
> >> -        if (!mem->size) {
> >> -            break;
> >> -        }
> >> -    }
> >> -
> >> -    if (x == hvf_state->num_slots) {
> >> -        error_report("No free slots");
> >> -        abort();
> >> -    }
> >> -
> >> -    mem->size = int128_get64(section->size);
> >> -    mem->mem = memory_region_get_ram_ptr(area) +
> section->offset_within_region;
> >> -    mem->start = section->offset_within_address_space;
> >> -    mem->region = area;
> >> -
> >> -    if (do_hvf_set_memory(mem, flags)) {
> >> -        error_report("Error registering new memory slot");
> >> -        abort();
> >> -    }
> >> -}
> >> -
> >>   void vmx_update_tpr(CPUState *cpu)
> >>   {
> >>       /* TODO: need integrate APIC handling */
> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t
> port, void *buffer,
> >>       }
> >>   }
> >>
> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
> run_on_cpu_data arg)
> >> -{
> >> -    if (!cpu->vcpu_dirty) {
> >> -        hvf_get_registers(cpu);
> >> -        cpu->vcpu_dirty = true;
> >> -    }
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
> >> -{
> >> -    if (!cpu->vcpu_dirty) {
> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
> >> -    }
> >> -}
> >> -
> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
> >> -                                              run_on_cpu_data arg)
> >> -{
> >> -    hvf_put_registers(cpu);
> >> -    cpu->vcpu_dirty = false;
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
> >> -{
> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
> RUN_ON_CPU_NULL);
> >> -}
> >> -
> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
> >> -                                             run_on_cpu_data arg)
> >> -{
> >> -    hvf_put_registers(cpu);
> >> -    cpu->vcpu_dirty = false;
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
> >> -{
> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
> >> -}
> >> -
> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
> >> -                                              run_on_cpu_data arg)
> >> -{
> >> -    cpu->vcpu_dirty = true;
> >> -}
> >> -
> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
> >> -{
> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
> RUN_ON_CPU_NULL);
> >> -}
> >> -
> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa,
> uint64_t ept_qual)
> >>   {
> >>       int read, write;
> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot,
> uint64_t gpa, uint64_t ept_qual)
> >>       return false;
> >>   }
> >>
> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
> on)
> >> -{
> >> -    hvf_slot *slot;
> >> -
> >> -    slot = hvf_find_overlap_slot(
> >> -            section->offset_within_address_space,
> >> -            int128_get64(section->size));
> >> -
> >> -    /* protect region against writes; begin tracking it */
> >> -    if (on) {
> >> -        slot->flags |= HVF_SLOT_LOG;
> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> >> -                      HV_MEMORY_READ);
> >> -    /* stop tracking region*/
> >> -    } else {
> >> -        slot->flags &= ~HVF_SLOT_LOG;
> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
> >> -    }
> >> -}
> >> -
> >> -static void hvf_log_start(MemoryListener *listener,
> >> -                          MemoryRegionSection *section, int old, int
> new)
> >> -{
> >> -    if (old != 0) {
> >> -        return;
> >> -    }
> >> -
> >> -    hvf_set_dirty_tracking(section, 1);
> >> -}
> >> -
> >> -static void hvf_log_stop(MemoryListener *listener,
> >> -                         MemoryRegionSection *section, int old, int
> new)
> >> -{
> >> -    if (new != 0) {
> >> -        return;
> >> -    }
> >> -
> >> -    hvf_set_dirty_tracking(section, 0);
> >> -}
> >> -
> >> -static void hvf_log_sync(MemoryListener *listener,
> >> -                         MemoryRegionSection *section)
> >> -{
> >> -    /*
> >> -     * sync of dirty pages is handled elsewhere; just make sure we keep
> >> -     * tracking the region.
> >> -     */
> >> -    hvf_set_dirty_tracking(section, 1);
> >> -}
> >> -
> >> -static void hvf_region_add(MemoryListener *listener,
> >> -                           MemoryRegionSection *section)
> >> -{
> >> -    hvf_set_phys_mem(section, true);
> >> -}
> >> -
> >> -static void hvf_region_del(MemoryListener *listener,
> >> -                           MemoryRegionSection *section)
> >> -{
> >> -    hvf_set_phys_mem(section, false);
> >> -}
> >> -
> >> -static MemoryListener hvf_memory_listener = {
> >> -    .priority = 10,
> >> -    .region_add = hvf_region_add,
> >> -    .region_del = hvf_region_del,
> >> -    .log_start = hvf_log_start,
> >> -    .log_stop = hvf_log_stop,
> >> -    .log_sync = hvf_log_sync,
> >> -};
> >> -
> >> -void hvf_vcpu_destroy(CPUState *cpu)
> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
> >>   {
> >>       X86CPU *x86_cpu = X86_CPU(cpu);
> >>       CPUX86State *env = &x86_cpu->env;
> >>
> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
> >>       g_free(env->hvf_mmio_buf);
> >> -    assert_hvf_ok(ret);
> >> -}
> >> -
> >> -static void dummy_signal(int sig)
> >> -{
> >>   }
> >>
> >> -int hvf_init_vcpu(CPUState *cpu)
> >> +int hvf_arch_init_vcpu(CPUState *cpu)
> >>   {
> >>
> >>       X86CPU *x86cpu = X86_CPU(cpu);
> >>       CPUX86State *env = &x86cpu->env;
> >> -    int r;
> >> -
> >> -    /* init cpu signals */
> >> -    sigset_t set;
> >> -    struct sigaction sigact;
> >> -
> >> -    memset(&sigact, 0, sizeof(sigact));
> >> -    sigact.sa_handler = dummy_signal;
> >> -    sigaction(SIG_IPI, &sigact, NULL);
> >> -
> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >> -    sigdelset(&set, SIG_IPI);
> >>
> >>       init_emu();
> >>       init_decoder();
> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
> >>       env->hvf_mmio_buf = g_new(char, 4096);
> >>
> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
> >> -    cpu->vcpu_dirty = 1;
> >> -    assert_hvf_ok(r);
> >> -
> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
> >>           abort();
> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>
> >>       return ret;
> >>   }
> >> -
> >> -bool hvf_allowed;
> >> -
> >> -static int hvf_accel_init(MachineState *ms)
> >> -{
> >> -    int x;
> >> -    hv_return_t ret;
> >> -    HVFState *s;
> >> -
> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
> >> -    assert_hvf_ok(ret);
> >> -
> >> -    s = g_new0(HVFState, 1);
> >> -
> >> -    s->num_slots = 32;
> >> -    for (x = 0; x < s->num_slots; ++x) {
> >> -        s->slots[x].size = 0;
> >> -        s->slots[x].slot_id = x;
> >> -    }
> >> -
> >> -    hvf_state = s;
> >> -    memory_listener_register(&hvf_memory_listener,
> &address_space_memory);
> >> -    cpus_register_accel(&hvf_cpus);
> >> -    return 0;
> >> -}
> >> -
> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
> >> -{
> >> -    AccelClass *ac = ACCEL_CLASS(oc);
> >> -    ac->name = "HVF";
> >> -    ac->init_machine = hvf_accel_init;
> >> -    ac->allowed = &hvf_allowed;
> >> -}
> >> -
> >> -static const TypeInfo hvf_accel_type = {
> >> -    .name = TYPE_HVF_ACCEL,
> >> -    .parent = TYPE_ACCEL,
> >> -    .class_init = hvf_accel_class_init,
> >> -};
> >> -
> >> -static void hvf_type_init(void)
> >> -{
> >> -    type_register_static(&hvf_accel_type);
> >> -}
> >> -
> >> -type_init(hvf_type_init);
> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
> >> index 409c9a3f14..c8a43717ee 100644
> >> --- a/target/i386/hvf/meson.build
> >> +++ b/target/i386/hvf/meson.build
> >> @@ -1,6 +1,5 @@
> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
> >>     'hvf.c',
> >> -  'hvf-cpus.c',
> >>     'x86.c',
> >>     'x86_cpuid.c',
> >>     'x86_decode.c',
> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
> >> index bbec412b6c..89b8e9d87a 100644
> >> --- a/target/i386/hvf/x86hvf.c
> >> +++ b/target/i386/hvf/x86hvf.c
> >> @@ -20,6 +20,9 @@
> >>   #include "qemu/osdep.h"
> >>
> >>   #include "qemu-common.h"
> >> +#include "sysemu/hvf.h"
> >> +#include "sysemu/hvf_int.h"
> >> +#include "sysemu/hw_accel.h"
> >>   #include "x86hvf.h"
> >>   #include "vmx.h"
> >>   #include "vmcs.h"
> >> @@ -32,8 +35,6 @@
> >>   #include <Hypervisor/hv.h>
> >>   #include <Hypervisor/hv_vmx.h>
> >>
> >> -#include "hvf-cpus.h"
> >> -
> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment
> *vmx_seg,
> >>                        SegmentCache *qseg, bool is_tr)
> >>   {
> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
> >>
> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
> >> -        hvf_cpu_synchronize_state(cpu_state);
> >> +        cpu_synchronize_state(cpu_state);
> >>           do_cpu_init(cpu);
> >>       }
> >>
> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
> >>           cpu_state->halted = 0;
> >>       }
> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
> >> -        hvf_cpu_synchronize_state(cpu_state);
> >> +        cpu_synchronize_state(cpu_state);
> >>           do_cpu_sipi(cpu);
> >>       }
> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
> >> -        hvf_cpu_synchronize_state(cpu_state);
> >> +        cpu_synchronize_state(cpu_state);
> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
> > summer.
>
>
> The only reason they're in here is because we no longer have access to
> the hvf_ functions from the file. I am perfectly happy to rebase the
> patch on top of Claudio's if his goes in first. I'm sure it'll be
> trivial for him to rebase on top of this too if my series goes in first.
>
>
> >
> > Phillipe raised the idea that the patch might go ahead of ARM-specific
> > part (which might involve some discussions) and I agree with that.
> >
> > Some sync between Claudio series (CC'd him) and the patch might be need.
>
>
> I would prefer not to hold back because of the sync. Claudio's cleanup
> is trivial enough to adjust for if it gets merged ahead of this.
>
>
> Alex
>
>
>
>

[-- Attachment #2: Type: text/html, Size: 68653 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework
  2020-11-27 16:47           ` Peter Maydell
@ 2020-11-30  2:40             ` Alexander Graf
  0 siblings, 0 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-30  2:40 UTC (permalink / raw)
  To: Peter Maydell, Eduardo Habkost
  Cc: Richard Henderson, QEMU Developers, Cameron Esfahani,
	Roman Bolshakov, qemu-arm, Claudio Fontana, Paolo Bonzini


On 27.11.20 17:47, Peter Maydell wrote:
> On Fri, 27 Nov 2020 at 16:38, Peter Maydell <peter.maydell@linaro.org> wrote:
>> Having looked a bit more closely at some of the relevant target/arm
>> code, I think the best approach is going to be that in virt.c
>> we just check the PARange ID register field (probably via
>> a convenience function that does the conversion of that to
>> a nice number-of-bits return value; we might even have one
>> already).
> Ha, in fact we're already doing something quite close to this,
> though instead of saying "decide whether to use highmem based
> on the CPU's PA range" we go for "report error to user if PA
> range is insufficient" and let the user pick some command line
> options that disable highmem if they want:
>
>          if (aarch64 && vms->highmem) {
>              int requested_pa_size = 64 - clz64(vms->highest_gpa);
>              int pamax = arm_pamax(ARM_CPU(first_cpu));
>
>              if (pamax < requested_pa_size) {
>                  error_report("VCPU supports less PA bits (%d) than "
>                               "requested by the memory map (%d)",
>                               pamax, requested_pa_size);
>                  exit(1);
>              }
>          }


Turns out I can sync aa64mfr0 just fine as well. So I'll just do that 
and remove this patch.


Alex




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 6/8] hvf: Use OS provided vcpu kick function
  2020-11-26 22:18   ` Eduardo Habkost
@ 2020-11-30  2:42     ` Alexander Graf
  2020-11-30  7:45       ` Claudio Fontana
  0 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-30  2:42 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Cameron Esfahani,
	Roman Bolshakov, qemu-arm, Claudio Fontana, Paolo Bonzini


On 26.11.20 23:18, Eduardo Habkost wrote:
> On Thu, Nov 26, 2020 at 10:50:15PM +0100, Alexander Graf wrote:
>> When kicking another vCPU, we get an OS function that explicitly does that for us
>> on Apple Silicon. That works better than the current signaling logic, let's make
>> use of it there.
>>
>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>> ---
>>   accel/hvf/hvf-cpus.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>> index b9f674478d..74a272d2e8 100644
>> --- a/accel/hvf/hvf-cpus.c
>> +++ b/accel/hvf/hvf-cpus.c
>> @@ -418,8 +418,20 @@ static void hvf_start_vcpu_thread(CPUState *cpu)
>>                          cpu, QEMU_THREAD_JOINABLE);
>>   }
>>   
>> +#ifdef __aarch64__
>> +static void hvf_kick_vcpu_thread(CPUState *cpu)
>> +{
>> +    if (!qemu_cpu_is_self(cpu)) {
>> +        hv_vcpus_exit(&cpu->hvf_fd, 1);
>> +    }
>> +}
>> +#endif
>> +
>>   static const CpusAccel hvf_cpus = {
>>       .create_vcpu_thread = hvf_start_vcpu_thread,
>> +#ifdef __aarch64__
>> +    .kick_vcpu_thread = hvf_kick_vcpu_thread,
>> +#endif
> Interesting.  We have considered the possibility of adding
> arch-specific TYPE_ACCEL subclasses when discussing Claudio's,
> series.  Here we have another arch-specific hack that could be
> avoided if we had a TYPE_ARM_HVF_ACCEL QOM class.


I don't think that's necessary in this case. I don't see how you could 
ever have aarch64 and x86 HVF backends compiled into the same binary. 
The header files even have a lot of #ifdef's.

Either way, I've changed it to a weak function in v2. That way it's a 
bit easier to read.


Alex




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 6/8] hvf: Use OS provided vcpu kick function
  2020-11-30  2:42     ` Alexander Graf
@ 2020-11-30  7:45       ` Claudio Fontana
  0 siblings, 0 replies; 64+ messages in thread
From: Claudio Fontana @ 2020-11-30  7:45 UTC (permalink / raw)
  To: Alexander Graf, Eduardo Habkost
  Cc: Peter Maydell, Richard Henderson, qemu-devel, Cameron Esfahani,
	Roman Bolshakov, qemu-arm, Paolo Bonzini

On 11/30/20 3:42 AM, Alexander Graf wrote:
> 
> On 26.11.20 23:18, Eduardo Habkost wrote:
>> On Thu, Nov 26, 2020 at 10:50:15PM +0100, Alexander Graf wrote:
>>> When kicking another vCPU, we get an OS function that explicitly does that for us
>>> on Apple Silicon. That works better than the current signaling logic, let's make
>>> use of it there.
>>>
>>> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>> ---
>>>   accel/hvf/hvf-cpus.c | 12 ++++++++++++
>>>   1 file changed, 12 insertions(+)
>>>
>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>> index b9f674478d..74a272d2e8 100644
>>> --- a/accel/hvf/hvf-cpus.c
>>> +++ b/accel/hvf/hvf-cpus.c
>>> @@ -418,8 +418,20 @@ static void hvf_start_vcpu_thread(CPUState *cpu)
>>>                          cpu, QEMU_THREAD_JOINABLE);
>>>   }
>>>   
>>> +#ifdef __aarch64__
>>> +static void hvf_kick_vcpu_thread(CPUState *cpu)
>>> +{
>>> +    if (!qemu_cpu_is_self(cpu)) {
>>> +        hv_vcpus_exit(&cpu->hvf_fd, 1);
>>> +    }
>>> +}
>>> +#endif
>>> +
>>>   static const CpusAccel hvf_cpus = {
>>>       .create_vcpu_thread = hvf_start_vcpu_thread,
>>> +#ifdef __aarch64__
>>> +    .kick_vcpu_thread = hvf_kick_vcpu_thread,
>>> +#endif
>> Interesting.  We have considered the possibility of adding
>> arch-specific TYPE_ACCEL subclasses when discussing Claudio's,
>> series.  Here we have another arch-specific hack that could be
>> avoided if we had a TYPE_ARM_HVF_ACCEL QOM class.
> 
> 
> I don't think that's necessary in this case. I don't see how you could 
> ever have aarch64 and x86 HVF backends compiled into the same binary. 
> The header files even have a lot of #ifdef's.
> 
> Either way, I've changed it to a weak function in v2. That way it's a 
> bit easier to read.
> 
> 
> Alex
> 
> 

Ciao Alex!

you're in the news, congrats for your hack!

Ciao,

Claudio


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-27 23:30       ` Frank Yang
@ 2020-11-30 20:15         ` Frank Yang
  2020-11-30 20:33           ` Alexander Graf
  0 siblings, 1 reply; 64+ messages in thread
From: Frank Yang @ 2020-11-30 20:15 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini, Peter Collingbourne

[-- Attachment #1: Type: text/plain, Size: 50995 bytes --]

Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But
the high CPU usage seems to be mitigated by having a poll interval (like
KVM does) in handling WFI:

https://android-review.googlesource.com/c/platform/external/qemu/+/1512501

This is loosely inspired by
https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766
which does seem to specify a poll interval.

It would be cool if we could have a lightweight way to enter sleep and
restart the vcpus precisely when CVAL passes, though.

Frank


On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:

> Hi all,
>
> +Peter Collingbourne <pcc@google.com>
>
> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>
> Peter and I have been working on an HVF Apple Silicon backend with an eye
> toward Android guests.
>
> We have gotten things to basically switch to Android userspace already
> (logcat/shell and graphics available at least)
>
> Our strategy so far has been to import logic from the KVM implementation
> and hook into QEMU's software devices that previously assumed to only work
> with TCG, or have KVM-specific paths.
>
> Thanks to Alexander for the tip on the 36-bit address space limitation
> btw; our way of addressing this is to still allow highmem but not put pci
> high mmio so high.
>
> Also, note we have a sleep/signal based mechanism to deal with WFx, which
> might be worth looking into in Alexander's implementation as well:
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>
> Patches so far, FYI:
>
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>
>
> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>
> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>
> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>
> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>
> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>
> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>
> Peter's also noticed that there are extra steps needed for M1's to allow
> TCG to work, as it involves JIT:
>
>
> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>
> We'd appreciate any feedback/comments :)
>
> Best,
>
> Frank
>
> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>>
>> On 27.11.20 21:00, Roman Bolshakov wrote:
>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>> >> Until now, Hypervisor.framework has only been available on x86_64
>> systems.
>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>> >> prepare for support for multiple architectures, let's move common code
>> out
>> >> into its own accel directory.
>> >>
>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>> >> ---
>> >>   MAINTAINERS                 |   9 +-
>> >>   accel/hvf/hvf-all.c         |  56 +++++
>> >>   accel/hvf/hvf-cpus.c        | 468
>> ++++++++++++++++++++++++++++++++++++
>> >>   accel/hvf/meson.build       |   7 +
>> >>   accel/meson.build           |   1 +
>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>> >>   target/i386/hvf/meson.build |   1 -
>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>> >>   target/i386/hvf/x86hvf.h    |   2 -
>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>> >>   create mode 100644 accel/hvf/hvf-all.c
>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>> >>   create mode 100644 accel/hvf/meson.build
>> >>   create mode 100644 include/sysemu/hvf_int.h
>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>> >>
>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>> >> index 68bc160f41..ca4b6d9279 100644
>> >> --- a/MAINTAINERS
>> >> +++ b/MAINTAINERS
>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>> >>   W: https://wiki.qemu.org/Features/HVF
>> >>   S: Maintained
>> >> -F: accel/stubs/hvf-stub.c
>> > There was a patch for that in the RFC series from Claudio.
>>
>>
>> Yeah, I'm not worried about this hunk :).
>>
>>
>> >
>> >>   F: target/i386/hvf/
>> >> +
>> >> +HVF
>> >> +M: Cameron Esfahani <dirty@apple.com>
>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>> >> +W: https://wiki.qemu.org/Features/HVF
>> >> +S: Maintained
>> >> +F: accel/hvf/
>> >>   F: include/sysemu/hvf.h
>> >> +F: include/sysemu/hvf_int.h
>> >>
>> >>   WHPX CPUs
>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>> >> new file mode 100644
>> >> index 0000000000..47d77a472a
>> >> --- /dev/null
>> >> +++ b/accel/hvf/hvf-all.c
>> >> @@ -0,0 +1,56 @@
>> >> +/*
>> >> + * QEMU Hypervisor.framework support
>> >> + *
>> >> + * This work is licensed under the terms of the GNU GPL, version 2.
>> See
>> >> + * the COPYING file in the top-level directory.
>> >> + *
>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>> >> + * GNU GPL, version 2 or (at your option) any later version.
>> >> + */
>> >> +
>> >> +#include "qemu/osdep.h"
>> >> +#include "qemu-common.h"
>> >> +#include "qemu/error-report.h"
>> >> +#include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >> +#include "sysemu/runstate.h"
>> >> +
>> >> +#include "qemu/main-loop.h"
>> >> +#include "sysemu/accel.h"
>> >> +
>> >> +#include <Hypervisor/Hypervisor.h>
>> >> +
>> >> +bool hvf_allowed;
>> >> +HVFState *hvf_state;
>> >> +
>> >> +void assert_hvf_ok(hv_return_t ret)
>> >> +{
>> >> +    if (ret == HV_SUCCESS) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    switch (ret) {
>> >> +    case HV_ERROR:
>> >> +        error_report("Error: HV_ERROR");
>> >> +        break;
>> >> +    case HV_BUSY:
>> >> +        error_report("Error: HV_BUSY");
>> >> +        break;
>> >> +    case HV_BAD_ARGUMENT:
>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>> >> +        break;
>> >> +    case HV_NO_RESOURCES:
>> >> +        error_report("Error: HV_NO_RESOURCES");
>> >> +        break;
>> >> +    case HV_NO_DEVICE:
>> >> +        error_report("Error: HV_NO_DEVICE");
>> >> +        break;
>> >> +    case HV_UNSUPPORTED:
>> >> +        error_report("Error: HV_UNSUPPORTED");
>> >> +        break;
>> >> +    default:
>> >> +        error_report("Unknown Error");
>> >> +    }
>> >> +
>> >> +    abort();
>> >> +}
>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>> >> new file mode 100644
>> >> index 0000000000..f9bb5502b7
>> >> --- /dev/null
>> >> +++ b/accel/hvf/hvf-cpus.c
>> >> @@ -0,0 +1,468 @@
>> >> +/*
>> >> + * Copyright 2008 IBM Corporation
>> >> + *           2008 Red Hat, Inc.
>> >> + * Copyright 2011 Intel Corporation
>> >> + * Copyright 2016 Veertu, Inc.
>> >> + * Copyright 2017 The Android Open Source Project
>> >> + *
>> >> + * QEMU Hypervisor.framework support
>> >> + *
>> >> + * This program is free software; you can redistribute it and/or
>> >> + * modify it under the terms of version 2 of the GNU General Public
>> >> + * License as published by the Free Software Foundation.
>> >> + *
>> >> + * This program is distributed in the hope that it will be useful,
>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> >> + * General Public License for more details.
>> >> + *
>> >> + * You should have received a copy of the GNU General Public License
>> >> + * along with this program; if not, see <http://www.gnu.org/licenses/
>> >.
>> >> + *
>> >> + * This file contain code under public domain from the hvdos project:
>> >> + * https://github.com/mist64/hvdos
>> >> + *
>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>> >> + * All rights reserved.
>> >> + *
>> >> + * Redistribution and use in source and binary forms, with or without
>> >> + * modification, are permitted provided that the following conditions
>> >> + * are met:
>> >> + * 1. Redistributions of source code must retain the above copyright
>> >> + *    notice, this list of conditions and the following disclaimer.
>> >> + * 2. Redistributions in binary form must reproduce the above
>> copyright
>> >> + *    notice, this list of conditions and the following disclaimer in
>> the
>> >> + *    documentation and/or other materials provided with the
>> distribution.
>> >> + *
>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>> THE
>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
>> PURPOSE
>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>> LIABLE
>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>> CONSEQUENTIAL
>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>> GOODS
>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>> INTERRUPTION)
>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>> CONTRACT, STRICT
>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>> ANY WAY
>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>> POSSIBILITY OF
>> >> + * SUCH DAMAGE.
>> >> + */
>> >> +
>> >> +#include "qemu/osdep.h"
>> >> +#include "qemu/error-report.h"
>> >> +#include "qemu/main-loop.h"
>> >> +#include "exec/address-spaces.h"
>> >> +#include "exec/exec-all.h"
>> >> +#include "sysemu/cpus.h"
>> >> +#include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >> +#include "sysemu/runstate.h"
>> >> +#include "qemu/guest-random.h"
>> >> +
>> >> +#include <Hypervisor/Hypervisor.h>
>> >> +
>> >> +/* Memory slots */
>> >> +
>> >> +struct mac_slot {
>> >> +    int present;
>> >> +    uint64_t size;
>> >> +    uint64_t gpa_start;
>> >> +    uint64_t gva;
>> >> +};
>> >> +
>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> >> +{
>> >> +    hvf_slot *slot;
>> >> +    int x;
>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> +        slot = &hvf_state->slots[x];
>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>> >> +            (start + size) > slot->start) {
>> >> +            return slot;
>> >> +        }
>> >> +    }
>> >> +    return NULL;
>> >> +}
>> >> +
>> >> +struct mac_slot mac_slots[32];
>> >> +
>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> >> +{
>> >> +    struct mac_slot *macslot;
>> >> +    hv_return_t ret;
>> >> +
>> >> +    macslot = &mac_slots[slot->slot_id];
>> >> +
>> >> +    if (macslot->present) {
>> >> +        if (macslot->size != slot->size) {
>> >> +            macslot->present = 0;
>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> >> +            assert_hvf_ok(ret);
>> >> +        }
>> >> +    }
>> >> +
>> >> +    if (!slot->size) {
>> >> +        return 0;
>> >> +    }
>> >> +
>> >> +    macslot->present = 1;
>> >> +    macslot->gpa_start = slot->start;
>> >> +    macslot->size = slot->size;
>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>> >> +    assert_hvf_ok(ret);
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> >> +{
>> >> +    hvf_slot *mem;
>> >> +    MemoryRegion *area = section->mr;
>> >> +    bool writeable = !area->readonly && !area->rom_device;
>> >> +    hv_memory_flags_t flags;
>> >> +
>> >> +    if (!memory_region_is_ram(area)) {
>> >> +        if (writeable) {
>> >> +            return;
>> >> +        } else if (!memory_region_is_romd(area)) {
>> >> +            /*
>> >> +             * If the memory device is not in romd_mode, then we
>> actually want
>> >> +             * to remove the hvf memory slot so all accesses will
>> trap.
>> >> +             */
>> >> +             add = false;
>> >> +        }
>> >> +    }
>> >> +
>> >> +    mem = hvf_find_overlap_slot(
>> >> +            section->offset_within_address_space,
>> >> +            int128_get64(section->size));
>> >> +
>> >> +    if (mem && add) {
>> >> +        if (mem->size == int128_get64(section->size) &&
>> >> +            mem->start == section->offset_within_address_space &&
>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>> >> +            section->offset_within_region)) {
>> >> +            return; /* Same region was attempted to register, go
>> away. */
>> >> +        }
>> >> +    }
>> >> +
>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>> >> +    if (mem) {
>> >> +        mem->size = 0;
>> >> +        if (do_hvf_set_memory(mem, 0)) {
>> >> +            error_report("Failed to reset overlapping slot");
>> >> +            abort();
>> >> +        }
>> >> +    }
>> >> +
>> >> +    if (!add) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    if (area->readonly ||
>> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area)))
>> {
>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> >> +    } else {
>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> >> +    }
>> >> +
>> >> +    /* Now make a new slot. */
>> >> +    int x;
>> >> +
>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> +        mem = &hvf_state->slots[x];
>> >> +        if (!mem->size) {
>> >> +            break;
>> >> +        }
>> >> +    }
>> >> +
>> >> +    if (x == hvf_state->num_slots) {
>> >> +        error_report("No free slots");
>> >> +        abort();
>> >> +    }
>> >> +
>> >> +    mem->size = int128_get64(section->size);
>> >> +    mem->mem = memory_region_get_ram_ptr(area) +
>> section->offset_within_region;
>> >> +    mem->start = section->offset_within_address_space;
>> >> +    mem->region = area;
>> >> +
>> >> +    if (do_hvf_set_memory(mem, flags)) {
>> >> +        error_report("Error registering new memory slot");
>> >> +        abort();
>> >> +    }
>> >> +}
>> >> +
>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
>> on)
>> >> +{
>> >> +    hvf_slot *slot;
>> >> +
>> >> +    slot = hvf_find_overlap_slot(
>> >> +            section->offset_within_address_space,
>> >> +            int128_get64(section->size));
>> >> +
>> >> +    /* protect region against writes; begin tracking it */
>> >> +    if (on) {
>> >> +        slot->flags |= HVF_SLOT_LOG;
>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> >> +                      HV_MEMORY_READ);
>> >> +    /* stop tracking region*/
>> >> +    } else {
>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> >> +    }
>> >> +}
>> >> +
>> >> +static void hvf_log_start(MemoryListener *listener,
>> >> +                          MemoryRegionSection *section, int old, int
>> new)
>> >> +{
>> >> +    if (old != 0) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    hvf_set_dirty_tracking(section, 1);
>> >> +}
>> >> +
>> >> +static void hvf_log_stop(MemoryListener *listener,
>> >> +                         MemoryRegionSection *section, int old, int
>> new)
>> >> +{
>> >> +    if (new != 0) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    hvf_set_dirty_tracking(section, 0);
>> >> +}
>> >> +
>> >> +static void hvf_log_sync(MemoryListener *listener,
>> >> +                         MemoryRegionSection *section)
>> >> +{
>> >> +    /*
>> >> +     * sync of dirty pages is handled elsewhere; just make sure we
>> keep
>> >> +     * tracking the region.
>> >> +     */
>> >> +    hvf_set_dirty_tracking(section, 1);
>> >> +}
>> >> +
>> >> +static void hvf_region_add(MemoryListener *listener,
>> >> +                           MemoryRegionSection *section)
>> >> +{
>> >> +    hvf_set_phys_mem(section, true);
>> >> +}
>> >> +
>> >> +static void hvf_region_del(MemoryListener *listener,
>> >> +                           MemoryRegionSection *section)
>> >> +{
>> >> +    hvf_set_phys_mem(section, false);
>> >> +}
>> >> +
>> >> +static MemoryListener hvf_memory_listener = {
>> >> +    .priority = 10,
>> >> +    .region_add = hvf_region_add,
>> >> +    .region_del = hvf_region_del,
>> >> +    .log_start = hvf_log_start,
>> >> +    .log_stop = hvf_log_stop,
>> >> +    .log_sync = hvf_log_sync,
>> >> +};
>> >> +
>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>> run_on_cpu_data arg)
>> >> +{
>> >> +    if (!cpu->vcpu_dirty) {
>> >> +        hvf_get_registers(cpu);
>> >> +        cpu->vcpu_dirty = true;
>> >> +    }
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>> >> +{
>> >> +    if (!cpu->vcpu_dirty) {
>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>> RUN_ON_CPU_NULL);
>> >> +    }
>> >> +}
>> >> +
>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> >> +                                              run_on_cpu_data arg)
>> >> +{
>> >> +    hvf_put_registers(cpu);
>> >> +    cpu->vcpu_dirty = false;
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> >> +{
>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>> RUN_ON_CPU_NULL);
>> >> +}
>> >> +
>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> >> +                                             run_on_cpu_data arg)
>> >> +{
>> >> +    hvf_put_registers(cpu);
>> >> +    cpu->vcpu_dirty = false;
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> >> +{
>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>> RUN_ON_CPU_NULL);
>> >> +}
>> >> +
>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> >> +                                              run_on_cpu_data arg)
>> >> +{
>> >> +    cpu->vcpu_dirty = true;
>> >> +}
>> >> +
>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> >> +{
>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>> RUN_ON_CPU_NULL);
>> >> +}
>> >> +
>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>> >> +{
>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>> >> +    assert_hvf_ok(ret);
>> >> +
>> >> +    hvf_arch_vcpu_destroy(cpu);
>> >> +}
>> >> +
>> >> +static void dummy_signal(int sig)
>> >> +{
>> >> +}
>> >> +
>> >> +static int hvf_init_vcpu(CPUState *cpu)
>> >> +{
>> >> +    int r;
>> >> +
>> >> +    /* init cpu signals */
>> >> +    sigset_t set;
>> >> +    struct sigaction sigact;
>> >> +
>> >> +    memset(&sigact, 0, sizeof(sigact));
>> >> +    sigact.sa_handler = dummy_signal;
>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>> >> +
>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> >> +    sigdelset(&set, SIG_IPI);
>> >> +
>> >> +#ifdef __aarch64__
>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
>> **)&cpu->hvf_exit, NULL);
>> >> +#else
>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> >> +#endif
>> > I think the first __aarch64__ bit fits better to arm part of the series.
>>
>>
>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>> ARM enablement.
>>
>>
>> >
>> >> +    cpu->vcpu_dirty = 1;
>> >> +    assert_hvf_ok(r);
>> >> +
>> >> +    return hvf_arch_init_vcpu(cpu);
>> >> +}
>> >> +
>> >> +/*
>> >> + * The HVF-specific vCPU thread function. This one should only run
>> when the host
>> >> + * CPU supports the VMX "unrestricted guest" feature.
>> >> + */
>> >> +static void *hvf_cpu_thread_fn(void *arg)
>> >> +{
>> >> +    CPUState *cpu = arg;
>> >> +
>> >> +    int r;
>> >> +
>> >> +    assert(hvf_enabled());
>> >> +
>> >> +    rcu_register_thread();
>> >> +
>> >> +    qemu_mutex_lock_iothread();
>> >> +    qemu_thread_get_self(cpu->thread);
>> >> +
>> >> +    cpu->thread_id = qemu_get_thread_id();
>> >> +    cpu->can_do_io = 1;
>> >> +    current_cpu = cpu;
>> >> +
>> >> +    hvf_init_vcpu(cpu);
>> >> +
>> >> +    /* signal CPU creation */
>> >> +    cpu_thread_signal_created(cpu);
>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> >> +
>> >> +    do {
>> >> +        if (cpu_can_run(cpu)) {
>> >> +            r = hvf_vcpu_exec(cpu);
>> >> +            if (r == EXCP_DEBUG) {
>> >> +                cpu_handle_guest_debug(cpu);
>> >> +            }
>> >> +        }
>> >> +        qemu_wait_io_event(cpu);
>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>> >> +
>> >> +    hvf_vcpu_destroy(cpu);
>> >> +    cpu_thread_signal_destroyed(cpu);
>> >> +    qemu_mutex_unlock_iothread();
>> >> +    rcu_unregister_thread();
>> >> +    return NULL;
>> >> +}
>> >> +
>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>> >> +{
>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>> >> +
>> >> +    /*
>> >> +     * HVF currently does not support TCG, and only runs in
>> >> +     * unrestricted-guest mode.
>> >> +     */
>> >> +    assert(hvf_enabled());
>> >> +
>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> >> +    qemu_cond_init(cpu->halt_cond);
>> >> +
>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> >> +             cpu->cpu_index);
>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>> >> +}
>> >> +
>> >> +static const CpusAccel hvf_cpus = {
>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>> >> +
>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> >> +};
>> >> +
>> >> +static int hvf_accel_init(MachineState *ms)
>> >> +{
>> >> +    int x;
>> >> +    hv_return_t ret;
>> >> +    HVFState *s;
>> >> +
>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>> >> +    assert_hvf_ok(ret);
>> >> +
>> >> +    s = g_new0(HVFState, 1);
>> >> +
>> >> +    s->num_slots = 32;
>> >> +    for (x = 0; x < s->num_slots; ++x) {
>> >> +        s->slots[x].size = 0;
>> >> +        s->slots[x].slot_id = x;
>> >> +    }
>> >> +
>> >> +    hvf_state = s;
>> >> +    memory_listener_register(&hvf_memory_listener,
>> &address_space_memory);
>> >> +    cpus_register_accel(&hvf_cpus);
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> >> +{
>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>> >> +    ac->name = "HVF";
>> >> +    ac->init_machine = hvf_accel_init;
>> >> +    ac->allowed = &hvf_allowed;
>> >> +}
>> >> +
>> >> +static const TypeInfo hvf_accel_type = {
>> >> +    .name = TYPE_HVF_ACCEL,
>> >> +    .parent = TYPE_ACCEL,
>> >> +    .class_init = hvf_accel_class_init,
>> >> +};
>> >> +
>> >> +static void hvf_type_init(void)
>> >> +{
>> >> +    type_register_static(&hvf_accel_type);
>> >> +}
>> >> +
>> >> +type_init(hvf_type_init);
>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>> >> new file mode 100644
>> >> index 0000000000..dfd6b68dc7
>> >> --- /dev/null
>> >> +++ b/accel/hvf/meson.build
>> >> @@ -0,0 +1,7 @@
>> >> +hvf_ss = ss.source_set()
>> >> +hvf_ss.add(files(
>> >> +  'hvf-all.c',
>> >> +  'hvf-cpus.c',
>> >> +))
>> >> +
>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>> >> diff --git a/accel/meson.build b/accel/meson.build
>> >> index b26cca227a..6de12ce5d5 100644
>> >> --- a/accel/meson.build
>> >> +++ b/accel/meson.build
>> >> @@ -1,5 +1,6 @@
>> >>   softmmu_ss.add(files('accel.c'))
>> >>
>> >> +subdir('hvf')
>> >>   subdir('qtest')
>> >>   subdir('kvm')
>> >>   subdir('tcg')
>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>> >> new file mode 100644
>> >> index 0000000000..de9bad23a8
>> >> --- /dev/null
>> >> +++ b/include/sysemu/hvf_int.h
>> >> @@ -0,0 +1,69 @@
>> >> +/*
>> >> + * QEMU Hypervisor.framework (HVF) support
>> >> + *
>> >> + * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> >> + * See the COPYING file in the top-level directory.
>> >> + *
>> >> + */
>> >> +
>> >> +/* header to be included in HVF-specific code */
>> >> +
>> >> +#ifndef HVF_INT_H
>> >> +#define HVF_INT_H
>> >> +
>> >> +#include <Hypervisor/Hypervisor.h>
>> >> +
>> >> +#define HVF_MAX_VCPU 0x10
>> >> +
>> >> +extern struct hvf_state hvf_global;
>> >> +
>> >> +struct hvf_vm {
>> >> +    int id;
>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> >> +};
>> >> +
>> >> +struct hvf_state {
>> >> +    uint32_t version;
>> >> +    struct hvf_vm *vm;
>> >> +    uint64_t mem_quota;
>> >> +};
>> >> +
>> >> +/* hvf_slot flags */
>> >> +#define HVF_SLOT_LOG (1 << 0)
>> >> +
>> >> +typedef struct hvf_slot {
>> >> +    uint64_t start;
>> >> +    uint64_t size;
>> >> +    uint8_t *mem;
>> >> +    int slot_id;
>> >> +    uint32_t flags;
>> >> +    MemoryRegion *region;
>> >> +} hvf_slot;
>> >> +
>> >> +typedef struct hvf_vcpu_caps {
>> >> +    uint64_t vmx_cap_pinbased;
>> >> +    uint64_t vmx_cap_procbased;
>> >> +    uint64_t vmx_cap_procbased2;
>> >> +    uint64_t vmx_cap_entry;
>> >> +    uint64_t vmx_cap_exit;
>> >> +    uint64_t vmx_cap_preemption_timer;
>> >> +} hvf_vcpu_caps;
>> >> +
>> >> +struct HVFState {
>> >> +    AccelState parent;
>> >> +    hvf_slot slots[32];
>> >> +    int num_slots;
>> >> +
>> >> +    hvf_vcpu_caps *hvf_caps;
>> >> +};
>> >> +extern HVFState *hvf_state;
>> >> +
>> >> +void assert_hvf_ok(hv_return_t ret);
>> >> +int hvf_get_registers(CPUState *cpu);
>> >> +int hvf_put_registers(CPUState *cpu);
>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>> >> +int hvf_vcpu_exec(CPUState *cpu);
>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>> >> +
>> >> +#endif
>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>> >> deleted file mode 100644
>> >> index 817b3d7452..0000000000
>> >> --- a/target/i386/hvf/hvf-cpus.c
>> >> +++ /dev/null
>> >> @@ -1,131 +0,0 @@
>> >> -/*
>> >> - * Copyright 2008 IBM Corporation
>> >> - *           2008 Red Hat, Inc.
>> >> - * Copyright 2011 Intel Corporation
>> >> - * Copyright 2016 Veertu, Inc.
>> >> - * Copyright 2017 The Android Open Source Project
>> >> - *
>> >> - * QEMU Hypervisor.framework support
>> >> - *
>> >> - * This program is free software; you can redistribute it and/or
>> >> - * modify it under the terms of version 2 of the GNU General Public
>> >> - * License as published by the Free Software Foundation.
>> >> - *
>> >> - * This program is distributed in the hope that it will be useful,
>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> >> - * General Public License for more details.
>> >> - *
>> >> - * You should have received a copy of the GNU General Public License
>> >> - * along with this program; if not, see <http://www.gnu.org/licenses/
>> >.
>> >> - *
>> >> - * This file contain code under public domain from the hvdos project:
>> >> - * https://github.com/mist64/hvdos
>> >> - *
>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>> >> - * All rights reserved.
>> >> - *
>> >> - * Redistribution and use in source and binary forms, with or without
>> >> - * modification, are permitted provided that the following conditions
>> >> - * are met:
>> >> - * 1. Redistributions of source code must retain the above copyright
>> >> - *    notice, this list of conditions and the following disclaimer.
>> >> - * 2. Redistributions in binary form must reproduce the above
>> copyright
>> >> - *    notice, this list of conditions and the following disclaimer in
>> the
>> >> - *    documentation and/or other materials provided with the
>> distribution.
>> >> - *
>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>> THE
>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
>> PURPOSE
>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>> LIABLE
>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>> CONSEQUENTIAL
>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>> GOODS
>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>> INTERRUPTION)
>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>> CONTRACT, STRICT
>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>> ANY WAY
>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>> POSSIBILITY OF
>> >> - * SUCH DAMAGE.
>> >> - */
>> >> -
>> >> -#include "qemu/osdep.h"
>> >> -#include "qemu/error-report.h"
>> >> -#include "qemu/main-loop.h"
>> >> -#include "sysemu/hvf.h"
>> >> -#include "sysemu/runstate.h"
>> >> -#include "target/i386/cpu.h"
>> >> -#include "qemu/guest-random.h"
>> >> -
>> >> -#include "hvf-cpus.h"
>> >> -
>> >> -/*
>> >> - * The HVF-specific vCPU thread function. This one should only run
>> when the host
>> >> - * CPU supports the VMX "unrestricted guest" feature.
>> >> - */
>> >> -static void *hvf_cpu_thread_fn(void *arg)
>> >> -{
>> >> -    CPUState *cpu = arg;
>> >> -
>> >> -    int r;
>> >> -
>> >> -    assert(hvf_enabled());
>> >> -
>> >> -    rcu_register_thread();
>> >> -
>> >> -    qemu_mutex_lock_iothread();
>> >> -    qemu_thread_get_self(cpu->thread);
>> >> -
>> >> -    cpu->thread_id = qemu_get_thread_id();
>> >> -    cpu->can_do_io = 1;
>> >> -    current_cpu = cpu;
>> >> -
>> >> -    hvf_init_vcpu(cpu);
>> >> -
>> >> -    /* signal CPU creation */
>> >> -    cpu_thread_signal_created(cpu);
>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>> >> -
>> >> -    do {
>> >> -        if (cpu_can_run(cpu)) {
>> >> -            r = hvf_vcpu_exec(cpu);
>> >> -            if (r == EXCP_DEBUG) {
>> >> -                cpu_handle_guest_debug(cpu);
>> >> -            }
>> >> -        }
>> >> -        qemu_wait_io_event(cpu);
>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>> >> -
>> >> -    hvf_vcpu_destroy(cpu);
>> >> -    cpu_thread_signal_destroyed(cpu);
>> >> -    qemu_mutex_unlock_iothread();
>> >> -    rcu_unregister_thread();
>> >> -    return NULL;
>> >> -}
>> >> -
>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>> >> -{
>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>> >> -
>> >> -    /*
>> >> -     * HVF currently does not support TCG, and only runs in
>> >> -     * unrestricted-guest mode.
>> >> -     */
>> >> -    assert(hvf_enabled());
>> >> -
>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>> >> -    qemu_cond_init(cpu->halt_cond);
>> >> -
>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>> >> -             cpu->cpu_index);
>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>> >> -}
>> >> -
>> >> -const CpusAccel hvf_cpus = {
>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>> >> -
>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>> >> -};
>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>> >> deleted file mode 100644
>> >> index ced31b82c0..0000000000
>> >> --- a/target/i386/hvf/hvf-cpus.h
>> >> +++ /dev/null
>> >> @@ -1,25 +0,0 @@
>> >> -/*
>> >> - * Accelerator CPUS Interface
>> >> - *
>> >> - * Copyright 2020 SUSE LLC
>> >> - *
>> >> - * This work is licensed under the terms of the GNU GPL, version 2 or
>> later.
>> >> - * See the COPYING file in the top-level directory.
>> >> - */
>> >> -
>> >> -#ifndef HVF_CPUS_H
>> >> -#define HVF_CPUS_H
>> >> -
>> >> -#include "sysemu/cpus.h"
>> >> -
>> >> -extern const CpusAccel hvf_cpus;
>> >> -
>> >> -int hvf_init_vcpu(CPUState *);
>> >> -int hvf_vcpu_exec(CPUState *);
>> >> -void hvf_cpu_synchronize_state(CPUState *);
>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>> >> -void hvf_vcpu_destroy(CPUState *);
>> >> -
>> >> -#endif /* HVF_CPUS_H */
>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>> >> index e0edffd077..6d56f8f6bb 100644
>> >> --- a/target/i386/hvf/hvf-i386.h
>> >> +++ b/target/i386/hvf/hvf-i386.h
>> >> @@ -18,57 +18,11 @@
>> >>
>> >>   #include "sysemu/accel.h"
>> >>   #include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >>   #include "cpu.h"
>> >>   #include "x86.h"
>> >>
>> >> -#define HVF_MAX_VCPU 0x10
>> >> -
>> >> -extern struct hvf_state hvf_global;
>> >> -
>> >> -struct hvf_vm {
>> >> -    int id;
>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>> >> -};
>> >> -
>> >> -struct hvf_state {
>> >> -    uint32_t version;
>> >> -    struct hvf_vm *vm;
>> >> -    uint64_t mem_quota;
>> >> -};
>> >> -
>> >> -/* hvf_slot flags */
>> >> -#define HVF_SLOT_LOG (1 << 0)
>> >> -
>> >> -typedef struct hvf_slot {
>> >> -    uint64_t start;
>> >> -    uint64_t size;
>> >> -    uint8_t *mem;
>> >> -    int slot_id;
>> >> -    uint32_t flags;
>> >> -    MemoryRegion *region;
>> >> -} hvf_slot;
>> >> -
>> >> -typedef struct hvf_vcpu_caps {
>> >> -    uint64_t vmx_cap_pinbased;
>> >> -    uint64_t vmx_cap_procbased;
>> >> -    uint64_t vmx_cap_procbased2;
>> >> -    uint64_t vmx_cap_entry;
>> >> -    uint64_t vmx_cap_exit;
>> >> -    uint64_t vmx_cap_preemption_timer;
>> >> -} hvf_vcpu_caps;
>> >> -
>> >> -struct HVFState {
>> >> -    AccelState parent;
>> >> -    hvf_slot slots[32];
>> >> -    int num_slots;
>> >> -
>> >> -    hvf_vcpu_caps *hvf_caps;
>> >> -};
>> >> -extern HVFState *hvf_state;
>> >> -
>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>> >>
>> >>   #ifdef NEED_CPU_H
>> >>   /* Functions exported to host specific mode */
>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>> >> index ed9356565c..8b96ecd619 100644
>> >> --- a/target/i386/hvf/hvf.c
>> >> +++ b/target/i386/hvf/hvf.c
>> >> @@ -51,6 +51,7 @@
>> >>   #include "qemu/error-report.h"
>> >>
>> >>   #include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >>   #include "sysemu/runstate.h"
>> >>   #include "hvf-i386.h"
>> >>   #include "vmcs.h"
>> >> @@ -72,171 +73,6 @@
>> >>   #include "sysemu/accel.h"
>> >>   #include "target/i386/cpu.h"
>> >>
>> >> -#include "hvf-cpus.h"
>> >> -
>> >> -HVFState *hvf_state;
>> >> -
>> >> -static void assert_hvf_ok(hv_return_t ret)
>> >> -{
>> >> -    if (ret == HV_SUCCESS) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    switch (ret) {
>> >> -    case HV_ERROR:
>> >> -        error_report("Error: HV_ERROR");
>> >> -        break;
>> >> -    case HV_BUSY:
>> >> -        error_report("Error: HV_BUSY");
>> >> -        break;
>> >> -    case HV_BAD_ARGUMENT:
>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>> >> -        break;
>> >> -    case HV_NO_RESOURCES:
>> >> -        error_report("Error: HV_NO_RESOURCES");
>> >> -        break;
>> >> -    case HV_NO_DEVICE:
>> >> -        error_report("Error: HV_NO_DEVICE");
>> >> -        break;
>> >> -    case HV_UNSUPPORTED:
>> >> -        error_report("Error: HV_UNSUPPORTED");
>> >> -        break;
>> >> -    default:
>> >> -        error_report("Unknown Error");
>> >> -    }
>> >> -
>> >> -    abort();
>> >> -}
>> >> -
>> >> -/* Memory slots */
>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>> >> -{
>> >> -    hvf_slot *slot;
>> >> -    int x;
>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> -        slot = &hvf_state->slots[x];
>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>> >> -            (start + size) > slot->start) {
>> >> -            return slot;
>> >> -        }
>> >> -    }
>> >> -    return NULL;
>> >> -}
>> >> -
>> >> -struct mac_slot {
>> >> -    int present;
>> >> -    uint64_t size;
>> >> -    uint64_t gpa_start;
>> >> -    uint64_t gva;
>> >> -};
>> >> -
>> >> -struct mac_slot mac_slots[32];
>> >> -
>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>> >> -{
>> >> -    struct mac_slot *macslot;
>> >> -    hv_return_t ret;
>> >> -
>> >> -    macslot = &mac_slots[slot->slot_id];
>> >> -
>> >> -    if (macslot->present) {
>> >> -        if (macslot->size != slot->size) {
>> >> -            macslot->present = 0;
>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>> >> -            assert_hvf_ok(ret);
>> >> -        }
>> >> -    }
>> >> -
>> >> -    if (!slot->size) {
>> >> -        return 0;
>> >> -    }
>> >> -
>> >> -    macslot->present = 1;
>> >> -    macslot->gpa_start = slot->start;
>> >> -    macslot->size = slot->size;
>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size,
>> flags);
>> >> -    assert_hvf_ok(ret);
>> >> -    return 0;
>> >> -}
>> >> -
>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>> >> -{
>> >> -    hvf_slot *mem;
>> >> -    MemoryRegion *area = section->mr;
>> >> -    bool writeable = !area->readonly && !area->rom_device;
>> >> -    hv_memory_flags_t flags;
>> >> -
>> >> -    if (!memory_region_is_ram(area)) {
>> >> -        if (writeable) {
>> >> -            return;
>> >> -        } else if (!memory_region_is_romd(area)) {
>> >> -            /*
>> >> -             * If the memory device is not in romd_mode, then we
>> actually want
>> >> -             * to remove the hvf memory slot so all accesses will
>> trap.
>> >> -             */
>> >> -             add = false;
>> >> -        }
>> >> -    }
>> >> -
>> >> -    mem = hvf_find_overlap_slot(
>> >> -            section->offset_within_address_space,
>> >> -            int128_get64(section->size));
>> >> -
>> >> -    if (mem && add) {
>> >> -        if (mem->size == int128_get64(section->size) &&
>> >> -            mem->start == section->offset_within_address_space &&
>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>> >> -            section->offset_within_region)) {
>> >> -            return; /* Same region was attempted to register, go
>> away. */
>> >> -        }
>> >> -    }
>> >> -
>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>> >> -    if (mem) {
>> >> -        mem->size = 0;
>> >> -        if (do_hvf_set_memory(mem, 0)) {
>> >> -            error_report("Failed to reset overlapping slot");
>> >> -            abort();
>> >> -        }
>> >> -    }
>> >> -
>> >> -    if (!add) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    if (area->readonly ||
>> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area)))
>> {
>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>> >> -    } else {
>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>> >> -    }
>> >> -
>> >> -    /* Now make a new slot. */
>> >> -    int x;
>> >> -
>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>> >> -        mem = &hvf_state->slots[x];
>> >> -        if (!mem->size) {
>> >> -            break;
>> >> -        }
>> >> -    }
>> >> -
>> >> -    if (x == hvf_state->num_slots) {
>> >> -        error_report("No free slots");
>> >> -        abort();
>> >> -    }
>> >> -
>> >> -    mem->size = int128_get64(section->size);
>> >> -    mem->mem = memory_region_get_ram_ptr(area) +
>> section->offset_within_region;
>> >> -    mem->start = section->offset_within_address_space;
>> >> -    mem->region = area;
>> >> -
>> >> -    if (do_hvf_set_memory(mem, flags)) {
>> >> -        error_report("Error registering new memory slot");
>> >> -        abort();
>> >> -    }
>> >> -}
>> >> -
>> >>   void vmx_update_tpr(CPUState *cpu)
>> >>   {
>> >>       /* TODO: need integrate APIC handling */
>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t
>> port, void *buffer,
>> >>       }
>> >>   }
>> >>
>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>> run_on_cpu_data arg)
>> >> -{
>> >> -    if (!cpu->vcpu_dirty) {
>> >> -        hvf_get_registers(cpu);
>> >> -        cpu->vcpu_dirty = true;
>> >> -    }
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>> >> -{
>> >> -    if (!cpu->vcpu_dirty) {
>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>> RUN_ON_CPU_NULL);
>> >> -    }
>> >> -}
>> >> -
>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>> >> -                                              run_on_cpu_data arg)
>> >> -{
>> >> -    hvf_put_registers(cpu);
>> >> -    cpu->vcpu_dirty = false;
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>> >> -{
>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>> RUN_ON_CPU_NULL);
>> >> -}
>> >> -
>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>> >> -                                             run_on_cpu_data arg)
>> >> -{
>> >> -    hvf_put_registers(cpu);
>> >> -    cpu->vcpu_dirty = false;
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>> >> -{
>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>> RUN_ON_CPU_NULL);
>> >> -}
>> >> -
>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>> >> -                                              run_on_cpu_data arg)
>> >> -{
>> >> -    cpu->vcpu_dirty = true;
>> >> -}
>> >> -
>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>> >> -{
>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>> RUN_ON_CPU_NULL);
>> >> -}
>> >> -
>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa,
>> uint64_t ept_qual)
>> >>   {
>> >>       int read, write;
>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot,
>> uint64_t gpa, uint64_t ept_qual)
>> >>       return false;
>> >>   }
>> >>
>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool
>> on)
>> >> -{
>> >> -    hvf_slot *slot;
>> >> -
>> >> -    slot = hvf_find_overlap_slot(
>> >> -            section->offset_within_address_space,
>> >> -            int128_get64(section->size));
>> >> -
>> >> -    /* protect region against writes; begin tracking it */
>> >> -    if (on) {
>> >> -        slot->flags |= HVF_SLOT_LOG;
>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> >> -                      HV_MEMORY_READ);
>> >> -    /* stop tracking region*/
>> >> -    } else {
>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>> >> -    }
>> >> -}
>> >> -
>> >> -static void hvf_log_start(MemoryListener *listener,
>> >> -                          MemoryRegionSection *section, int old, int
>> new)
>> >> -{
>> >> -    if (old != 0) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    hvf_set_dirty_tracking(section, 1);
>> >> -}
>> >> -
>> >> -static void hvf_log_stop(MemoryListener *listener,
>> >> -                         MemoryRegionSection *section, int old, int
>> new)
>> >> -{
>> >> -    if (new != 0) {
>> >> -        return;
>> >> -    }
>> >> -
>> >> -    hvf_set_dirty_tracking(section, 0);
>> >> -}
>> >> -
>> >> -static void hvf_log_sync(MemoryListener *listener,
>> >> -                         MemoryRegionSection *section)
>> >> -{
>> >> -    /*
>> >> -     * sync of dirty pages is handled elsewhere; just make sure we
>> keep
>> >> -     * tracking the region.
>> >> -     */
>> >> -    hvf_set_dirty_tracking(section, 1);
>> >> -}
>> >> -
>> >> -static void hvf_region_add(MemoryListener *listener,
>> >> -                           MemoryRegionSection *section)
>> >> -{
>> >> -    hvf_set_phys_mem(section, true);
>> >> -}
>> >> -
>> >> -static void hvf_region_del(MemoryListener *listener,
>> >> -                           MemoryRegionSection *section)
>> >> -{
>> >> -    hvf_set_phys_mem(section, false);
>> >> -}
>> >> -
>> >> -static MemoryListener hvf_memory_listener = {
>> >> -    .priority = 10,
>> >> -    .region_add = hvf_region_add,
>> >> -    .region_del = hvf_region_del,
>> >> -    .log_start = hvf_log_start,
>> >> -    .log_stop = hvf_log_stop,
>> >> -    .log_sync = hvf_log_sync,
>> >> -};
>> >> -
>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>> >>   {
>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>> >>       CPUX86State *env = &x86_cpu->env;
>> >>
>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>> >>       g_free(env->hvf_mmio_buf);
>> >> -    assert_hvf_ok(ret);
>> >> -}
>> >> -
>> >> -static void dummy_signal(int sig)
>> >> -{
>> >>   }
>> >>
>> >> -int hvf_init_vcpu(CPUState *cpu)
>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>> >>   {
>> >>
>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>> >>       CPUX86State *env = &x86cpu->env;
>> >> -    int r;
>> >> -
>> >> -    /* init cpu signals */
>> >> -    sigset_t set;
>> >> -    struct sigaction sigact;
>> >> -
>> >> -    memset(&sigact, 0, sizeof(sigact));
>> >> -    sigact.sa_handler = dummy_signal;
>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>> >> -
>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> >> -    sigdelset(&set, SIG_IPI);
>> >>
>> >>       init_emu();
>> >>       init_decoder();
>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>> >>
>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>> >> -    cpu->vcpu_dirty = 1;
>> >> -    assert_hvf_ok(r);
>> >> -
>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>> >>           abort();
>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>> >>
>> >>       return ret;
>> >>   }
>> >> -
>> >> -bool hvf_allowed;
>> >> -
>> >> -static int hvf_accel_init(MachineState *ms)
>> >> -{
>> >> -    int x;
>> >> -    hv_return_t ret;
>> >> -    HVFState *s;
>> >> -
>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>> >> -    assert_hvf_ok(ret);
>> >> -
>> >> -    s = g_new0(HVFState, 1);
>> >> -
>> >> -    s->num_slots = 32;
>> >> -    for (x = 0; x < s->num_slots; ++x) {
>> >> -        s->slots[x].size = 0;
>> >> -        s->slots[x].slot_id = x;
>> >> -    }
>> >> -
>> >> -    hvf_state = s;
>> >> -    memory_listener_register(&hvf_memory_listener,
>> &address_space_memory);
>> >> -    cpus_register_accel(&hvf_cpus);
>> >> -    return 0;
>> >> -}
>> >> -
>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>> >> -{
>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>> >> -    ac->name = "HVF";
>> >> -    ac->init_machine = hvf_accel_init;
>> >> -    ac->allowed = &hvf_allowed;
>> >> -}
>> >> -
>> >> -static const TypeInfo hvf_accel_type = {
>> >> -    .name = TYPE_HVF_ACCEL,
>> >> -    .parent = TYPE_ACCEL,
>> >> -    .class_init = hvf_accel_class_init,
>> >> -};
>> >> -
>> >> -static void hvf_type_init(void)
>> >> -{
>> >> -    type_register_static(&hvf_accel_type);
>> >> -}
>> >> -
>> >> -type_init(hvf_type_init);
>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>> >> index 409c9a3f14..c8a43717ee 100644
>> >> --- a/target/i386/hvf/meson.build
>> >> +++ b/target/i386/hvf/meson.build
>> >> @@ -1,6 +1,5 @@
>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>> >>     'hvf.c',
>> >> -  'hvf-cpus.c',
>> >>     'x86.c',
>> >>     'x86_cpuid.c',
>> >>     'x86_decode.c',
>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>> >> index bbec412b6c..89b8e9d87a 100644
>> >> --- a/target/i386/hvf/x86hvf.c
>> >> +++ b/target/i386/hvf/x86hvf.c
>> >> @@ -20,6 +20,9 @@
>> >>   #include "qemu/osdep.h"
>> >>
>> >>   #include "qemu-common.h"
>> >> +#include "sysemu/hvf.h"
>> >> +#include "sysemu/hvf_int.h"
>> >> +#include "sysemu/hw_accel.h"
>> >>   #include "x86hvf.h"
>> >>   #include "vmx.h"
>> >>   #include "vmcs.h"
>> >> @@ -32,8 +35,6 @@
>> >>   #include <Hypervisor/hv.h>
>> >>   #include <Hypervisor/hv_vmx.h>
>> >>
>> >> -#include "hvf-cpus.h"
>> >> -
>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment
>> *vmx_seg,
>> >>                        SegmentCache *qseg, bool is_tr)
>> >>   {
>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>> >>
>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>> >> -        hvf_cpu_synchronize_state(cpu_state);
>> >> +        cpu_synchronize_state(cpu_state);
>> >>           do_cpu_init(cpu);
>> >>       }
>> >>
>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>> >>           cpu_state->halted = 0;
>> >>       }
>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>> >> -        hvf_cpu_synchronize_state(cpu_state);
>> >> +        cpu_synchronize_state(cpu_state);
>> >>           do_cpu_sipi(cpu);
>> >>       }
>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>> >> -        hvf_cpu_synchronize_state(cpu_state);
>> >> +        cpu_synchronize_state(cpu_state);
>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
>> > summer.
>>
>>
>> The only reason they're in here is because we no longer have access to
>> the hvf_ functions from the file. I am perfectly happy to rebase the
>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>> trivial for him to rebase on top of this too if my series goes in first.
>>
>>
>> >
>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>> > part (which might involve some discussions) and I agree with that.
>> >
>> > Some sync between Claudio series (CC'd him) and the patch might be need.
>>
>>
>> I would prefer not to hold back because of the sync. Claudio's cleanup
>> is trivial enough to adjust for if it gets merged ahead of this.
>>
>>
>> Alex
>>
>>
>>
>>

[-- Attachment #2: Type: text/html, Size: 69985 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 20:15         ` Frank Yang
@ 2020-11-30 20:33           ` Alexander Graf
  2020-11-30 20:55             ` Frank Yang
  0 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-30 20:33 UTC (permalink / raw)
  To: Frank Yang
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Paolo Bonzini, Peter Collingbourne

[-- Attachment #1: Type: text/plain, Size: 66874 bytes --]

Hi Frank,

Thanks for the update :). Your previous email nudged me into the right 
direction. I previously had implemented WFI through the internal timer 
framework which performed way worse.

Along the way, I stumbled over a few issues though. For starters, the 
signal mask for SIG_IPI was not set correctly, so while pselect() would 
exit, the signal would never get delivered to the thread! For a fix, 
check out

https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/

Please also have a look at my latest stab at WFI emulation. It doesn't 
handle WFE (that's only relevant in overcommitted scenarios). But it 
does handle WFI and even does something similar to hlt polling, albeit 
not with an adaptive threshold.

Also, is there a particular reason you're working on this super 
interesting and useful code in a random downstream fork of QEMU? 
Wouldn't it be more helpful to contribute to the upstream code base instead?


Alex


On 30.11.20 21:15, Frank Yang wrote:
> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. 
> But the high CPU usage seems to be mitigated by having a poll interval 
> (like KVM does) in handling WFI:
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501 
> <https://android-review.googlesource.com/c/platform/external/qemu/+/1512501>
>
> This is loosely inspired by 
> https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 
> <https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766> 
> which does seem to specify a poll interval.
>
> It would be cool if we could have a lightweight way to enter sleep and 
> restart the vcpus precisely when CVAL passes, though.
>
> Frank
>
>
> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com 
> <mailto:lfy@google.com>> wrote:
>
>     Hi all,
>
>     +Peter Collingbourne <mailto:pcc@google.com>
>
>     I'm a developer on the Android Emulator, which is in a fork of QEMU.
>
>     Peter and I have been working on an HVF Apple Silicon backend with
>     an eye toward Android guests.
>
>     We have gotten things to basically switch to Android userspace
>     already (logcat/shell and graphics available at least)
>
>     Our strategy so far has been to import logic from the KVM
>     implementation and hook into QEMU's software devices
>     that previously assumed to only work with TCG, or have
>     KVM-specific paths.
>
>     Thanks to Alexander for the tip on the 36-bit address space
>     limitation btw; our way of addressing this is to still allow
>     highmem but not put pci high mmio so high.
>
>     Also, note we have a sleep/signal based mechanism to deal with
>     WFx, which might be worth looking into in Alexander's
>     implementation as well:
>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512551>
>
>     Patches so far, FYI:
>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3>
>     https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>     <https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3>
>
>     https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>     <https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a>
>     https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>     <https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b>
>     https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>     <https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01>
>     https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>     <https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228>
>     https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>     <https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102>
>     https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>     <https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6>
>
>     Peter's also noticed that there are extra steps needed for M1's to
>     allow TCG to work, as it involves JIT:
>
>     https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>     <https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9>
>
>     We'd appreciate any feedback/comments :)
>
>     Best,
>
>     Frank
>
>     On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de
>     <mailto:agraf@csgraf.de>> wrote:
>
>
>         On 27.11.20 21:00, Roman Bolshakov wrote:
>         > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>         >> Until now, Hypervisor.framework has only been available on
>         x86_64 systems.
>         >> With Apple Silicon shipping now, it extends its reach to
>         aarch64. To
>         >> prepare for support for multiple architectures, let's move
>         common code out
>         >> into its own accel directory.
>         >>
>         >> Signed-off-by: Alexander Graf <agraf@csgraf.de
>         <mailto:agraf@csgraf.de>>
>         >> ---
>         >>   MAINTAINERS                 |   9 +-
>         >>   accel/hvf/hvf-all.c         |  56 +++++
>         >>   accel/hvf/hvf-cpus.c        | 468
>         ++++++++++++++++++++++++++++++++++++
>         >>   accel/hvf/meson.build       |   7 +
>         >>   accel/meson.build           |   1 +
>         >>   include/sysemu/hvf_int.h    |  69 ++++++
>         >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>         >>   target/i386/hvf/hvf-cpus.h  |  25 --
>         >>   target/i386/hvf/hvf-i386.h  |  48 +---
>         >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>         >>   target/i386/hvf/meson.build |   1 -
>         >>   target/i386/hvf/x86hvf.c    |  11 +-
>         >>   target/i386/hvf/x86hvf.h    |   2 -
>         >>   13 files changed, 619 insertions(+), 569 deletions(-)
>         >>   create mode 100644 accel/hvf/hvf-all.c
>         >>   create mode 100644 accel/hvf/hvf-cpus.c
>         >>   create mode 100644 accel/hvf/meson.build
>         >>   create mode 100644 include/sysemu/hvf_int.h
>         >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>         >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>         >>
>         >> diff --git a/MAINTAINERS b/MAINTAINERS
>         >> index 68bc160f41..ca4b6d9279 100644
>         >> --- a/MAINTAINERS
>         >> +++ b/MAINTAINERS
>         >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com
>         <mailto:dirty@apple.com>>
>         >>   M: Roman Bolshakov <r.bolshakov@yadro.com
>         <mailto:r.bolshakov@yadro.com>>
>         >>   W: https://wiki.qemu.org/Features/HVF
>         <https://wiki.qemu.org/Features/HVF>
>         >>   S: Maintained
>         >> -F: accel/stubs/hvf-stub.c
>         > There was a patch for that in the RFC series from Claudio.
>
>
>         Yeah, I'm not worried about this hunk :).
>
>
>         >
>         >>   F: target/i386/hvf/
>         >> +
>         >> +HVF
>         >> +M: Cameron Esfahani <dirty@apple.com <mailto:dirty@apple.com>>
>         >> +M: Roman Bolshakov <r.bolshakov@yadro.com
>         <mailto:r.bolshakov@yadro.com>>
>         >> +W: https://wiki.qemu.org/Features/HVF
>         <https://wiki.qemu.org/Features/HVF>
>         >> +S: Maintained
>         >> +F: accel/hvf/
>         >>   F: include/sysemu/hvf.h
>         >> +F: include/sysemu/hvf_int.h
>         >>
>         >>   WHPX CPUs
>         >>   M: Sunil Muthuswamy <sunilmut@microsoft.com
>         <mailto:sunilmut@microsoft.com>>
>         >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>         >> new file mode 100644
>         >> index 0000000000..47d77a472a
>         >> --- /dev/null
>         >> +++ b/accel/hvf/hvf-all.c
>         >> @@ -0,0 +1,56 @@
>         >> +/*
>         >> + * QEMU Hypervisor.framework support
>         >> + *
>         >> + * This work is licensed under the terms of the GNU GPL,
>         version 2.  See
>         >> + * the COPYING file in the top-level directory.
>         >> + *
>         >> + * Contributions after 2012-01-13 are licensed under the
>         terms of the
>         >> + * GNU GPL, version 2 or (at your option) any later version.
>         >> + */
>         >> +
>         >> +#include "qemu/osdep.h"
>         >> +#include "qemu-common.h"
>         >> +#include "qemu/error-report.h"
>         >> +#include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >> +#include "sysemu/runstate.h"
>         >> +
>         >> +#include "qemu/main-loop.h"
>         >> +#include "sysemu/accel.h"
>         >> +
>         >> +#include <Hypervisor/Hypervisor.h>
>         >> +
>         >> +bool hvf_allowed;
>         >> +HVFState *hvf_state;
>         >> +
>         >> +void assert_hvf_ok(hv_return_t ret)
>         >> +{
>         >> +    if (ret == HV_SUCCESS) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    switch (ret) {
>         >> +    case HV_ERROR:
>         >> +        error_report("Error: HV_ERROR");
>         >> +        break;
>         >> +    case HV_BUSY:
>         >> +        error_report("Error: HV_BUSY");
>         >> +        break;
>         >> +    case HV_BAD_ARGUMENT:
>         >> +        error_report("Error: HV_BAD_ARGUMENT");
>         >> +        break;
>         >> +    case HV_NO_RESOURCES:
>         >> +        error_report("Error: HV_NO_RESOURCES");
>         >> +        break;
>         >> +    case HV_NO_DEVICE:
>         >> +        error_report("Error: HV_NO_DEVICE");
>         >> +        break;
>         >> +    case HV_UNSUPPORTED:
>         >> +        error_report("Error: HV_UNSUPPORTED");
>         >> +        break;
>         >> +    default:
>         >> +        error_report("Unknown Error");
>         >> +    }
>         >> +
>         >> +    abort();
>         >> +}
>         >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>         >> new file mode 100644
>         >> index 0000000000..f9bb5502b7
>         >> --- /dev/null
>         >> +++ b/accel/hvf/hvf-cpus.c
>         >> @@ -0,0 +1,468 @@
>         >> +/*
>         >> + * Copyright 2008 IBM Corporation
>         >> + *           2008 Red Hat, Inc.
>         >> + * Copyright 2011 Intel Corporation
>         >> + * Copyright 2016 Veertu, Inc.
>         >> + * Copyright 2017 The Android Open Source Project
>         >> + *
>         >> + * QEMU Hypervisor.framework support
>         >> + *
>         >> + * This program is free software; you can redistribute it
>         and/or
>         >> + * modify it under the terms of version 2 of the GNU
>         General Public
>         >> + * License as published by the Free Software Foundation.
>         >> + *
>         >> + * This program is distributed in the hope that it will be
>         useful,
>         >> + * but WITHOUT ANY WARRANTY; without even the implied
>         warranty of
>         >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
>         See the GNU
>         >> + * General Public License for more details.
>         >> + *
>         >> + * You should have received a copy of the GNU General
>         Public License
>         >> + * along with this program; if not, see
>         <http://www.gnu.org/licenses/ <http://www.gnu.org/licenses/>>.
>         >> + *
>         >> + * This file contain code under public domain from the
>         hvdos project:
>         >> + * https://github.com/mist64/hvdos
>         <https://github.com/mist64/hvdos>
>         >> + *
>         >> + * Parts Copyright (c) 2011 NetApp, Inc.
>         >> + * All rights reserved.
>         >> + *
>         >> + * Redistribution and use in source and binary forms, with
>         or without
>         >> + * modification, are permitted provided that the following
>         conditions
>         >> + * are met:
>         >> + * 1. Redistributions of source code must retain the above
>         copyright
>         >> + *    notice, this list of conditions and the following
>         disclaimer.
>         >> + * 2. Redistributions in binary form must reproduce the
>         above copyright
>         >> + *    notice, this list of conditions and the following
>         disclaimer in the
>         >> + *    documentation and/or other materials provided with
>         the distribution.
>         >> + *
>         >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>         >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>         LIMITED TO, THE
>         >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>         PARTICULAR PURPOSE
>         >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR
>         CONTRIBUTORS BE LIABLE
>         >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
>         EXEMPLARY, OR CONSEQUENTIAL
>         >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
>         SUBSTITUTE GOODS
>         >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>         INTERRUPTION)
>         >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
>         IN CONTRACT, STRICT
>         >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
>         ARISING IN ANY WAY
>         >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>         POSSIBILITY OF
>         >> + * SUCH DAMAGE.
>         >> + */
>         >> +
>         >> +#include "qemu/osdep.h"
>         >> +#include "qemu/error-report.h"
>         >> +#include "qemu/main-loop.h"
>         >> +#include "exec/address-spaces.h"
>         >> +#include "exec/exec-all.h"
>         >> +#include "sysemu/cpus.h"
>         >> +#include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >> +#include "sysemu/runstate.h"
>         >> +#include "qemu/guest-random.h"
>         >> +
>         >> +#include <Hypervisor/Hypervisor.h>
>         >> +
>         >> +/* Memory slots */
>         >> +
>         >> +struct mac_slot {
>         >> +    int present;
>         >> +    uint64_t size;
>         >> +    uint64_t gpa_start;
>         >> +    uint64_t gva;
>         >> +};
>         >> +
>         >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>         >> +{
>         >> +    hvf_slot *slot;
>         >> +    int x;
>         >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> +        slot = &hvf_state->slots[x];
>         >> +        if (slot->size && start < (slot->start +
>         slot->size) &&
>         >> +            (start + size) > slot->start) {
>         >> +            return slot;
>         >> +        }
>         >> +    }
>         >> +    return NULL;
>         >> +}
>         >> +
>         >> +struct mac_slot mac_slots[32];
>         >> +
>         >> +static int do_hvf_set_memory(hvf_slot *slot,
>         hv_memory_flags_t flags)
>         >> +{
>         >> +    struct mac_slot *macslot;
>         >> +    hv_return_t ret;
>         >> +
>         >> +    macslot = &mac_slots[slot->slot_id];
>         >> +
>         >> +    if (macslot->present) {
>         >> +        if (macslot->size != slot->size) {
>         >> +            macslot->present = 0;
>         >> +            ret = hv_vm_unmap(macslot->gpa_start,
>         macslot->size);
>         >> +            assert_hvf_ok(ret);
>         >> +        }
>         >> +    }
>         >> +
>         >> +    if (!slot->size) {
>         >> +        return 0;
>         >> +    }
>         >> +
>         >> +    macslot->present = 1;
>         >> +    macslot->gpa_start = slot->start;
>         >> +    macslot->size = slot->size;
>         >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size,
>         flags);
>         >> +    assert_hvf_ok(ret);
>         >> +    return 0;
>         >> +}
>         >> +
>         >> +static void hvf_set_phys_mem(MemoryRegionSection *section,
>         bool add)
>         >> +{
>         >> +    hvf_slot *mem;
>         >> +    MemoryRegion *area = section->mr;
>         >> +    bool writeable = !area->readonly && !area->rom_device;
>         >> +    hv_memory_flags_t flags;
>         >> +
>         >> +    if (!memory_region_is_ram(area)) {
>         >> +        if (writeable) {
>         >> +            return;
>         >> +        } else if (!memory_region_is_romd(area)) {
>         >> +            /*
>         >> +             * If the memory device is not in romd_mode,
>         then we actually want
>         >> +             * to remove the hvf memory slot so all
>         accesses will trap.
>         >> +             */
>         >> +             add = false;
>         >> +        }
>         >> +    }
>         >> +
>         >> +    mem = hvf_find_overlap_slot(
>         >> + section->offset_within_address_space,
>         >> +            int128_get64(section->size));
>         >> +
>         >> +    if (mem && add) {
>         >> +        if (mem->size == int128_get64(section->size) &&
>         >> +            mem->start ==
>         section->offset_within_address_space &&
>         >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>         >> +            section->offset_within_region)) {
>         >> +            return; /* Same region was attempted to
>         register, go away. */
>         >> +        }
>         >> +    }
>         >> +
>         >> +    /* Region needs to be reset. set the size to 0 and
>         remap it. */
>         >> +    if (mem) {
>         >> +        mem->size = 0;
>         >> +        if (do_hvf_set_memory(mem, 0)) {
>         >> +            error_report("Failed to reset overlapping slot");
>         >> +            abort();
>         >> +        }
>         >> +    }
>         >> +
>         >> +    if (!add) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    if (area->readonly ||
>         >> +        (!memory_region_is_ram(area) &&
>         memory_region_is_romd(area))) {
>         >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>         >> +    } else {
>         >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE |
>         HV_MEMORY_EXEC;
>         >> +    }
>         >> +
>         >> +    /* Now make a new slot. */
>         >> +    int x;
>         >> +
>         >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> +        mem = &hvf_state->slots[x];
>         >> +        if (!mem->size) {
>         >> +            break;
>         >> +        }
>         >> +    }
>         >> +
>         >> +    if (x == hvf_state->num_slots) {
>         >> +        error_report("No free slots");
>         >> +        abort();
>         >> +    }
>         >> +
>         >> +    mem->size = int128_get64(section->size);
>         >> +    mem->mem = memory_region_get_ram_ptr(area) +
>         section->offset_within_region;
>         >> +    mem->start = section->offset_within_address_space;
>         >> +    mem->region = area;
>         >> +
>         >> +    if (do_hvf_set_memory(mem, flags)) {
>         >> +        error_report("Error registering new memory slot");
>         >> +        abort();
>         >> +    }
>         >> +}
>         >> +
>         >> +static void hvf_set_dirty_tracking(MemoryRegionSection
>         *section, bool on)
>         >> +{
>         >> +    hvf_slot *slot;
>         >> +
>         >> +    slot = hvf_find_overlap_slot(
>         >> + section->offset_within_address_space,
>         >> +            int128_get64(section->size));
>         >> +
>         >> +    /* protect region against writes; begin tracking it */
>         >> +    if (on) {
>         >> +        slot->flags |= HVF_SLOT_LOG;
>         >> +        hv_vm_protect((uintptr_t)slot->start,
>         (size_t)slot->size,
>         >> +                      HV_MEMORY_READ);
>         >> +    /* stop tracking region*/
>         >> +    } else {
>         >> +        slot->flags &= ~HVF_SLOT_LOG;
>         >> +        hv_vm_protect((uintptr_t)slot->start,
>         (size_t)slot->size,
>         >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>         >> +    }
>         >> +}
>         >> +
>         >> +static void hvf_log_start(MemoryListener *listener,
>         >> +                          MemoryRegionSection *section,
>         int old, int new)
>         >> +{
>         >> +    if (old != 0) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    hvf_set_dirty_tracking(section, 1);
>         >> +}
>         >> +
>         >> +static void hvf_log_stop(MemoryListener *listener,
>         >> +                         MemoryRegionSection *section, int
>         old, int new)
>         >> +{
>         >> +    if (new != 0) {
>         >> +        return;
>         >> +    }
>         >> +
>         >> +    hvf_set_dirty_tracking(section, 0);
>         >> +}
>         >> +
>         >> +static void hvf_log_sync(MemoryListener *listener,
>         >> +                         MemoryRegionSection *section)
>         >> +{
>         >> +    /*
>         >> +     * sync of dirty pages is handled elsewhere; just make
>         sure we keep
>         >> +     * tracking the region.
>         >> +     */
>         >> +    hvf_set_dirty_tracking(section, 1);
>         >> +}
>         >> +
>         >> +static void hvf_region_add(MemoryListener *listener,
>         >> +                           MemoryRegionSection *section)
>         >> +{
>         >> +    hvf_set_phys_mem(section, true);
>         >> +}
>         >> +
>         >> +static void hvf_region_del(MemoryListener *listener,
>         >> +                           MemoryRegionSection *section)
>         >> +{
>         >> +    hvf_set_phys_mem(section, false);
>         >> +}
>         >> +
>         >> +static MemoryListener hvf_memory_listener = {
>         >> +    .priority = 10,
>         >> +    .region_add = hvf_region_add,
>         >> +    .region_del = hvf_region_del,
>         >> +    .log_start = hvf_log_start,
>         >> +    .log_stop = hvf_log_stop,
>         >> +    .log_sync = hvf_log_sync,
>         >> +};
>         >> +
>         >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>         run_on_cpu_data arg)
>         >> +{
>         >> +    if (!cpu->vcpu_dirty) {
>         >> +        hvf_get_registers(cpu);
>         >> +        cpu->vcpu_dirty = true;
>         >> +    }
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>         >> +{
>         >> +    if (!cpu->vcpu_dirty) {
>         >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>         RUN_ON_CPU_NULL);
>         >> +    }
>         >> +}
>         >> +
>         >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>         >> + run_on_cpu_data arg)
>         >> +{
>         >> +    hvf_put_registers(cpu);
>         >> +    cpu->vcpu_dirty = false;
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>         >> +{
>         >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>         RUN_ON_CPU_NULL);
>         >> +}
>         >> +
>         >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>         >> +  run_on_cpu_data arg)
>         >> +{
>         >> +    hvf_put_registers(cpu);
>         >> +    cpu->vcpu_dirty = false;
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>         >> +{
>         >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>         RUN_ON_CPU_NULL);
>         >> +}
>         >> +
>         >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>         >> + run_on_cpu_data arg)
>         >> +{
>         >> +    cpu->vcpu_dirty = true;
>         >> +}
>         >> +
>         >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>         >> +{
>         >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>         RUN_ON_CPU_NULL);
>         >> +}
>         >> +
>         >> +static void hvf_vcpu_destroy(CPUState *cpu)
>         >> +{
>         >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>         >> +    assert_hvf_ok(ret);
>         >> +
>         >> +    hvf_arch_vcpu_destroy(cpu);
>         >> +}
>         >> +
>         >> +static void dummy_signal(int sig)
>         >> +{
>         >> +}
>         >> +
>         >> +static int hvf_init_vcpu(CPUState *cpu)
>         >> +{
>         >> +    int r;
>         >> +
>         >> +    /* init cpu signals */
>         >> +    sigset_t set;
>         >> +    struct sigaction sigact;
>         >> +
>         >> +    memset(&sigact, 0, sizeof(sigact));
>         >> +    sigact.sa_handler = dummy_signal;
>         >> +    sigaction(SIG_IPI, &sigact, NULL);
>         >> +
>         >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>         >> +    sigdelset(&set, SIG_IPI);
>         >> +
>         >> +#ifdef __aarch64__
>         >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
>         **)&cpu->hvf_exit, NULL);
>         >> +#else
>         >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd,
>         HV_VCPU_DEFAULT);
>         >> +#endif
>         > I think the first __aarch64__ bit fits better to arm part of
>         the series.
>
>
>         Oops. Thanks for catching it! Yes, absolutely. It should be
>         part of the
>         ARM enablement.
>
>
>         >
>         >> +    cpu->vcpu_dirty = 1;
>         >> +    assert_hvf_ok(r);
>         >> +
>         >> +    return hvf_arch_init_vcpu(cpu);
>         >> +}
>         >> +
>         >> +/*
>         >> + * The HVF-specific vCPU thread function. This one should
>         only run when the host
>         >> + * CPU supports the VMX "unrestricted guest" feature.
>         >> + */
>         >> +static void *hvf_cpu_thread_fn(void *arg)
>         >> +{
>         >> +    CPUState *cpu = arg;
>         >> +
>         >> +    int r;
>         >> +
>         >> +    assert(hvf_enabled());
>         >> +
>         >> +    rcu_register_thread();
>         >> +
>         >> +    qemu_mutex_lock_iothread();
>         >> +    qemu_thread_get_self(cpu->thread);
>         >> +
>         >> +    cpu->thread_id = qemu_get_thread_id();
>         >> +    cpu->can_do_io = 1;
>         >> +    current_cpu = cpu;
>         >> +
>         >> +    hvf_init_vcpu(cpu);
>         >> +
>         >> +    /* signal CPU creation */
>         >> +    cpu_thread_signal_created(cpu);
>         >> + qemu_guest_random_seed_thread_part2(cpu->random_seed);
>         >> +
>         >> +    do {
>         >> +        if (cpu_can_run(cpu)) {
>         >> +            r = hvf_vcpu_exec(cpu);
>         >> +            if (r == EXCP_DEBUG) {
>         >> +                cpu_handle_guest_debug(cpu);
>         >> +            }
>         >> +        }
>         >> +        qemu_wait_io_event(cpu);
>         >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>         >> +
>         >> +    hvf_vcpu_destroy(cpu);
>         >> +    cpu_thread_signal_destroyed(cpu);
>         >> +    qemu_mutex_unlock_iothread();
>         >> +    rcu_unregister_thread();
>         >> +    return NULL;
>         >> +}
>         >> +
>         >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>         >> +{
>         >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>         >> +
>         >> +    /*
>         >> +     * HVF currently does not support TCG, and only runs in
>         >> +     * unrestricted-guest mode.
>         >> +     */
>         >> +    assert(hvf_enabled());
>         >> +
>         >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>         >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>         >> +    qemu_cond_init(cpu->halt_cond);
>         >> +
>         >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>         >> +             cpu->cpu_index);
>         >> +    qemu_thread_create(cpu->thread, thread_name,
>         hvf_cpu_thread_fn,
>         >> +                       cpu, QEMU_THREAD_JOINABLE);
>         >> +}
>         >> +
>         >> +static const CpusAccel hvf_cpus = {
>         >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>         >> +
>         >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>         >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>         >> +    .synchronize_state = hvf_cpu_synchronize_state,
>         >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>         >> +};
>         >> +
>         >> +static int hvf_accel_init(MachineState *ms)
>         >> +{
>         >> +    int x;
>         >> +    hv_return_t ret;
>         >> +    HVFState *s;
>         >> +
>         >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>         >> +    assert_hvf_ok(ret);
>         >> +
>         >> +    s = g_new0(HVFState, 1);
>         >> +
>         >> +    s->num_slots = 32;
>         >> +    for (x = 0; x < s->num_slots; ++x) {
>         >> +        s->slots[x].size = 0;
>         >> +        s->slots[x].slot_id = x;
>         >> +    }
>         >> +
>         >> +    hvf_state = s;
>         >> + memory_listener_register(&hvf_memory_listener,
>         &address_space_memory);
>         >> +    cpus_register_accel(&hvf_cpus);
>         >> +    return 0;
>         >> +}
>         >> +
>         >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>         >> +{
>         >> +    AccelClass *ac = ACCEL_CLASS(oc);
>         >> +    ac->name = "HVF";
>         >> +    ac->init_machine = hvf_accel_init;
>         >> +    ac->allowed = &hvf_allowed;
>         >> +}
>         >> +
>         >> +static const TypeInfo hvf_accel_type = {
>         >> +    .name = TYPE_HVF_ACCEL,
>         >> +    .parent = TYPE_ACCEL,
>         >> +    .class_init = hvf_accel_class_init,
>         >> +};
>         >> +
>         >> +static void hvf_type_init(void)
>         >> +{
>         >> +    type_register_static(&hvf_accel_type);
>         >> +}
>         >> +
>         >> +type_init(hvf_type_init);
>         >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>         >> new file mode 100644
>         >> index 0000000000..dfd6b68dc7
>         >> --- /dev/null
>         >> +++ b/accel/hvf/meson.build
>         >> @@ -0,0 +1,7 @@
>         >> +hvf_ss = ss.source_set()
>         >> +hvf_ss.add(files(
>         >> +  'hvf-all.c',
>         >> +  'hvf-cpus.c',
>         >> +))
>         >> +
>         >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>         >> diff --git a/accel/meson.build b/accel/meson.build
>         >> index b26cca227a..6de12ce5d5 100644
>         >> --- a/accel/meson.build
>         >> +++ b/accel/meson.build
>         >> @@ -1,5 +1,6 @@
>         >>   softmmu_ss.add(files('accel.c'))
>         >>
>         >> +subdir('hvf')
>         >>   subdir('qtest')
>         >>   subdir('kvm')
>         >>   subdir('tcg')
>         >> diff --git a/include/sysemu/hvf_int.h
>         b/include/sysemu/hvf_int.h
>         >> new file mode 100644
>         >> index 0000000000..de9bad23a8
>         >> --- /dev/null
>         >> +++ b/include/sysemu/hvf_int.h
>         >> @@ -0,0 +1,69 @@
>         >> +/*
>         >> + * QEMU Hypervisor.framework (HVF) support
>         >> + *
>         >> + * This work is licensed under the terms of the GNU GPL,
>         version 2 or later.
>         >> + * See the COPYING file in the top-level directory.
>         >> + *
>         >> + */
>         >> +
>         >> +/* header to be included in HVF-specific code */
>         >> +
>         >> +#ifndef HVF_INT_H
>         >> +#define HVF_INT_H
>         >> +
>         >> +#include <Hypervisor/Hypervisor.h>
>         >> +
>         >> +#define HVF_MAX_VCPU 0x10
>         >> +
>         >> +extern struct hvf_state hvf_global;
>         >> +
>         >> +struct hvf_vm {
>         >> +    int id;
>         >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>         >> +};
>         >> +
>         >> +struct hvf_state {
>         >> +    uint32_t version;
>         >> +    struct hvf_vm *vm;
>         >> +    uint64_t mem_quota;
>         >> +};
>         >> +
>         >> +/* hvf_slot flags */
>         >> +#define HVF_SLOT_LOG (1 << 0)
>         >> +
>         >> +typedef struct hvf_slot {
>         >> +    uint64_t start;
>         >> +    uint64_t size;
>         >> +    uint8_t *mem;
>         >> +    int slot_id;
>         >> +    uint32_t flags;
>         >> +    MemoryRegion *region;
>         >> +} hvf_slot;
>         >> +
>         >> +typedef struct hvf_vcpu_caps {
>         >> +    uint64_t vmx_cap_pinbased;
>         >> +    uint64_t vmx_cap_procbased;
>         >> +    uint64_t vmx_cap_procbased2;
>         >> +    uint64_t vmx_cap_entry;
>         >> +    uint64_t vmx_cap_exit;
>         >> +    uint64_t vmx_cap_preemption_timer;
>         >> +} hvf_vcpu_caps;
>         >> +
>         >> +struct HVFState {
>         >> +    AccelState parent;
>         >> +    hvf_slot slots[32];
>         >> +    int num_slots;
>         >> +
>         >> +    hvf_vcpu_caps *hvf_caps;
>         >> +};
>         >> +extern HVFState *hvf_state;
>         >> +
>         >> +void assert_hvf_ok(hv_return_t ret);
>         >> +int hvf_get_registers(CPUState *cpu);
>         >> +int hvf_put_registers(CPUState *cpu);
>         >> +int hvf_arch_init_vcpu(CPUState *cpu);
>         >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>         >> +int hvf_vcpu_exec(CPUState *cpu);
>         >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>         >> +
>         >> +#endif
>         >> diff --git a/target/i386/hvf/hvf-cpus.c
>         b/target/i386/hvf/hvf-cpus.c
>         >> deleted file mode 100644
>         >> index 817b3d7452..0000000000
>         >> --- a/target/i386/hvf/hvf-cpus.c
>         >> +++ /dev/null
>         >> @@ -1,131 +0,0 @@
>         >> -/*
>         >> - * Copyright 2008 IBM Corporation
>         >> - *           2008 Red Hat, Inc.
>         >> - * Copyright 2011 Intel Corporation
>         >> - * Copyright 2016 Veertu, Inc.
>         >> - * Copyright 2017 The Android Open Source Project
>         >> - *
>         >> - * QEMU Hypervisor.framework support
>         >> - *
>         >> - * This program is free software; you can redistribute it
>         and/or
>         >> - * modify it under the terms of version 2 of the GNU
>         General Public
>         >> - * License as published by the Free Software Foundation.
>         >> - *
>         >> - * This program is distributed in the hope that it will be
>         useful,
>         >> - * but WITHOUT ANY WARRANTY; without even the implied
>         warranty of
>         >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
>         See the GNU
>         >> - * General Public License for more details.
>         >> - *
>         >> - * You should have received a copy of the GNU General
>         Public License
>         >> - * along with this program; if not, see
>         <http://www.gnu.org/licenses/ <http://www.gnu.org/licenses/>>.
>         >> - *
>         >> - * This file contain code under public domain from the
>         hvdos project:
>         >> - * https://github.com/mist64/hvdos
>         <https://github.com/mist64/hvdos>
>         >> - *
>         >> - * Parts Copyright (c) 2011 NetApp, Inc.
>         >> - * All rights reserved.
>         >> - *
>         >> - * Redistribution and use in source and binary forms, with
>         or without
>         >> - * modification, are permitted provided that the following
>         conditions
>         >> - * are met:
>         >> - * 1. Redistributions of source code must retain the above
>         copyright
>         >> - *    notice, this list of conditions and the following
>         disclaimer.
>         >> - * 2. Redistributions in binary form must reproduce the
>         above copyright
>         >> - *    notice, this list of conditions and the following
>         disclaimer in the
>         >> - *    documentation and/or other materials provided with
>         the distribution.
>         >> - *
>         >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>         >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
>         LIMITED TO, THE
>         >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>         PARTICULAR PURPOSE
>         >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR
>         CONTRIBUTORS BE LIABLE
>         >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
>         EXEMPLARY, OR CONSEQUENTIAL
>         >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
>         SUBSTITUTE GOODS
>         >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>         INTERRUPTION)
>         >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
>         IN CONTRACT, STRICT
>         >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
>         ARISING IN ANY WAY
>         >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>         POSSIBILITY OF
>         >> - * SUCH DAMAGE.
>         >> - */
>         >> -
>         >> -#include "qemu/osdep.h"
>         >> -#include "qemu/error-report.h"
>         >> -#include "qemu/main-loop.h"
>         >> -#include "sysemu/hvf.h"
>         >> -#include "sysemu/runstate.h"
>         >> -#include "target/i386/cpu.h"
>         >> -#include "qemu/guest-random.h"
>         >> -
>         >> -#include "hvf-cpus.h"
>         >> -
>         >> -/*
>         >> - * The HVF-specific vCPU thread function. This one should
>         only run when the host
>         >> - * CPU supports the VMX "unrestricted guest" feature.
>         >> - */
>         >> -static void *hvf_cpu_thread_fn(void *arg)
>         >> -{
>         >> -    CPUState *cpu = arg;
>         >> -
>         >> -    int r;
>         >> -
>         >> -    assert(hvf_enabled());
>         >> -
>         >> -    rcu_register_thread();
>         >> -
>         >> -    qemu_mutex_lock_iothread();
>         >> -    qemu_thread_get_self(cpu->thread);
>         >> -
>         >> -    cpu->thread_id = qemu_get_thread_id();
>         >> -    cpu->can_do_io = 1;
>         >> -    current_cpu = cpu;
>         >> -
>         >> -    hvf_init_vcpu(cpu);
>         >> -
>         >> -    /* signal CPU creation */
>         >> -    cpu_thread_signal_created(cpu);
>         >> - qemu_guest_random_seed_thread_part2(cpu->random_seed);
>         >> -
>         >> -    do {
>         >> -        if (cpu_can_run(cpu)) {
>         >> -            r = hvf_vcpu_exec(cpu);
>         >> -            if (r == EXCP_DEBUG) {
>         >> -                cpu_handle_guest_debug(cpu);
>         >> -            }
>         >> -        }
>         >> -        qemu_wait_io_event(cpu);
>         >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>         >> -
>         >> -    hvf_vcpu_destroy(cpu);
>         >> -    cpu_thread_signal_destroyed(cpu);
>         >> -    qemu_mutex_unlock_iothread();
>         >> -    rcu_unregister_thread();
>         >> -    return NULL;
>         >> -}
>         >> -
>         >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>         >> -{
>         >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>         >> -
>         >> -    /*
>         >> -     * HVF currently does not support TCG, and only runs in
>         >> -     * unrestricted-guest mode.
>         >> -     */
>         >> -    assert(hvf_enabled());
>         >> -
>         >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>         >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>         >> -    qemu_cond_init(cpu->halt_cond);
>         >> -
>         >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>         >> -             cpu->cpu_index);
>         >> -    qemu_thread_create(cpu->thread, thread_name,
>         hvf_cpu_thread_fn,
>         >> -                       cpu, QEMU_THREAD_JOINABLE);
>         >> -}
>         >> -
>         >> -const CpusAccel hvf_cpus = {
>         >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>         >> -
>         >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>         >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>         >> -    .synchronize_state = hvf_cpu_synchronize_state,
>         >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>         >> -};
>         >> diff --git a/target/i386/hvf/hvf-cpus.h
>         b/target/i386/hvf/hvf-cpus.h
>         >> deleted file mode 100644
>         >> index ced31b82c0..0000000000
>         >> --- a/target/i386/hvf/hvf-cpus.h
>         >> +++ /dev/null
>         >> @@ -1,25 +0,0 @@
>         >> -/*
>         >> - * Accelerator CPUS Interface
>         >> - *
>         >> - * Copyright 2020 SUSE LLC
>         >> - *
>         >> - * This work is licensed under the terms of the GNU GPL,
>         version 2 or later.
>         >> - * See the COPYING file in the top-level directory.
>         >> - */
>         >> -
>         >> -#ifndef HVF_CPUS_H
>         >> -#define HVF_CPUS_H
>         >> -
>         >> -#include "sysemu/cpus.h"
>         >> -
>         >> -extern const CpusAccel hvf_cpus;
>         >> -
>         >> -int hvf_init_vcpu(CPUState *);
>         >> -int hvf_vcpu_exec(CPUState *);
>         >> -void hvf_cpu_synchronize_state(CPUState *);
>         >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>         >> -void hvf_cpu_synchronize_post_init(CPUState *);
>         >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>         >> -void hvf_vcpu_destroy(CPUState *);
>         >> -
>         >> -#endif /* HVF_CPUS_H */
>         >> diff --git a/target/i386/hvf/hvf-i386.h
>         b/target/i386/hvf/hvf-i386.h
>         >> index e0edffd077..6d56f8f6bb 100644
>         >> --- a/target/i386/hvf/hvf-i386.h
>         >> +++ b/target/i386/hvf/hvf-i386.h
>         >> @@ -18,57 +18,11 @@
>         >>
>         >>   #include "sysemu/accel.h"
>         >>   #include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >>   #include "cpu.h"
>         >>   #include "x86.h"
>         >>
>         >> -#define HVF_MAX_VCPU 0x10
>         >> -
>         >> -extern struct hvf_state hvf_global;
>         >> -
>         >> -struct hvf_vm {
>         >> -    int id;
>         >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>         >> -};
>         >> -
>         >> -struct hvf_state {
>         >> -    uint32_t version;
>         >> -    struct hvf_vm *vm;
>         >> -    uint64_t mem_quota;
>         >> -};
>         >> -
>         >> -/* hvf_slot flags */
>         >> -#define HVF_SLOT_LOG (1 << 0)
>         >> -
>         >> -typedef struct hvf_slot {
>         >> -    uint64_t start;
>         >> -    uint64_t size;
>         >> -    uint8_t *mem;
>         >> -    int slot_id;
>         >> -    uint32_t flags;
>         >> -    MemoryRegion *region;
>         >> -} hvf_slot;
>         >> -
>         >> -typedef struct hvf_vcpu_caps {
>         >> -    uint64_t vmx_cap_pinbased;
>         >> -    uint64_t vmx_cap_procbased;
>         >> -    uint64_t vmx_cap_procbased2;
>         >> -    uint64_t vmx_cap_entry;
>         >> -    uint64_t vmx_cap_exit;
>         >> -    uint64_t vmx_cap_preemption_timer;
>         >> -} hvf_vcpu_caps;
>         >> -
>         >> -struct HVFState {
>         >> -    AccelState parent;
>         >> -    hvf_slot slots[32];
>         >> -    int num_slots;
>         >> -
>         >> -    hvf_vcpu_caps *hvf_caps;
>         >> -};
>         >> -extern HVFState *hvf_state;
>         >> -
>         >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>         >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int,
>         int, int);
>         >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>         >>
>         >>   #ifdef NEED_CPU_H
>         >>   /* Functions exported to host specific mode */
>         >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>         >> index ed9356565c..8b96ecd619 100644
>         >> --- a/target/i386/hvf/hvf.c
>         >> +++ b/target/i386/hvf/hvf.c
>         >> @@ -51,6 +51,7 @@
>         >>   #include "qemu/error-report.h"
>         >>
>         >>   #include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >>   #include "sysemu/runstate.h"
>         >>   #include "hvf-i386.h"
>         >>   #include "vmcs.h"
>         >> @@ -72,171 +73,6 @@
>         >>   #include "sysemu/accel.h"
>         >>   #include "target/i386/cpu.h"
>         >>
>         >> -#include "hvf-cpus.h"
>         >> -
>         >> -HVFState *hvf_state;
>         >> -
>         >> -static void assert_hvf_ok(hv_return_t ret)
>         >> -{
>         >> -    if (ret == HV_SUCCESS) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    switch (ret) {
>         >> -    case HV_ERROR:
>         >> -        error_report("Error: HV_ERROR");
>         >> -        break;
>         >> -    case HV_BUSY:
>         >> -        error_report("Error: HV_BUSY");
>         >> -        break;
>         >> -    case HV_BAD_ARGUMENT:
>         >> -        error_report("Error: HV_BAD_ARGUMENT");
>         >> -        break;
>         >> -    case HV_NO_RESOURCES:
>         >> -        error_report("Error: HV_NO_RESOURCES");
>         >> -        break;
>         >> -    case HV_NO_DEVICE:
>         >> -        error_report("Error: HV_NO_DEVICE");
>         >> -        break;
>         >> -    case HV_UNSUPPORTED:
>         >> -        error_report("Error: HV_UNSUPPORTED");
>         >> -        break;
>         >> -    default:
>         >> -        error_report("Unknown Error");
>         >> -    }
>         >> -
>         >> -    abort();
>         >> -}
>         >> -
>         >> -/* Memory slots */
>         >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>         >> -{
>         >> -    hvf_slot *slot;
>         >> -    int x;
>         >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> -        slot = &hvf_state->slots[x];
>         >> -        if (slot->size && start < (slot->start +
>         slot->size) &&
>         >> -            (start + size) > slot->start) {
>         >> -            return slot;
>         >> -        }
>         >> -    }
>         >> -    return NULL;
>         >> -}
>         >> -
>         >> -struct mac_slot {
>         >> -    int present;
>         >> -    uint64_t size;
>         >> -    uint64_t gpa_start;
>         >> -    uint64_t gva;
>         >> -};
>         >> -
>         >> -struct mac_slot mac_slots[32];
>         >> -
>         >> -static int do_hvf_set_memory(hvf_slot *slot,
>         hv_memory_flags_t flags)
>         >> -{
>         >> -    struct mac_slot *macslot;
>         >> -    hv_return_t ret;
>         >> -
>         >> -    macslot = &mac_slots[slot->slot_id];
>         >> -
>         >> -    if (macslot->present) {
>         >> -        if (macslot->size != slot->size) {
>         >> -            macslot->present = 0;
>         >> -            ret = hv_vm_unmap(macslot->gpa_start,
>         macslot->size);
>         >> -            assert_hvf_ok(ret);
>         >> -        }
>         >> -    }
>         >> -
>         >> -    if (!slot->size) {
>         >> -        return 0;
>         >> -    }
>         >> -
>         >> -    macslot->present = 1;
>         >> -    macslot->gpa_start = slot->start;
>         >> -    macslot->size = slot->size;
>         >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start,
>         slot->size, flags);
>         >> -    assert_hvf_ok(ret);
>         >> -    return 0;
>         >> -}
>         >> -
>         >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>         >> -{
>         >> -    hvf_slot *mem;
>         >> -    MemoryRegion *area = section->mr;
>         >> -    bool writeable = !area->readonly && !area->rom_device;
>         >> -    hv_memory_flags_t flags;
>         >> -
>         >> -    if (!memory_region_is_ram(area)) {
>         >> -        if (writeable) {
>         >> -            return;
>         >> -        } else if (!memory_region_is_romd(area)) {
>         >> -            /*
>         >> -             * If the memory device is not in romd_mode,
>         then we actually want
>         >> -             * to remove the hvf memory slot so all
>         accesses will trap.
>         >> -             */
>         >> -             add = false;
>         >> -        }
>         >> -    }
>         >> -
>         >> -    mem = hvf_find_overlap_slot(
>         >> - section->offset_within_address_space,
>         >> -            int128_get64(section->size));
>         >> -
>         >> -    if (mem && add) {
>         >> -        if (mem->size == int128_get64(section->size) &&
>         >> -            mem->start ==
>         section->offset_within_address_space &&
>         >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>         >> -            section->offset_within_region)) {
>         >> -            return; /* Same region was attempted to
>         register, go away. */
>         >> -        }
>         >> -    }
>         >> -
>         >> -    /* Region needs to be reset. set the size to 0 and
>         remap it. */
>         >> -    if (mem) {
>         >> -        mem->size = 0;
>         >> -        if (do_hvf_set_memory(mem, 0)) {
>         >> -            error_report("Failed to reset overlapping slot");
>         >> -            abort();
>         >> -        }
>         >> -    }
>         >> -
>         >> -    if (!add) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    if (area->readonly ||
>         >> -        (!memory_region_is_ram(area) &&
>         memory_region_is_romd(area))) {
>         >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>         >> -    } else {
>         >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE |
>         HV_MEMORY_EXEC;
>         >> -    }
>         >> -
>         >> -    /* Now make a new slot. */
>         >> -    int x;
>         >> -
>         >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>         >> -        mem = &hvf_state->slots[x];
>         >> -        if (!mem->size) {
>         >> -            break;
>         >> -        }
>         >> -    }
>         >> -
>         >> -    if (x == hvf_state->num_slots) {
>         >> -        error_report("No free slots");
>         >> -        abort();
>         >> -    }
>         >> -
>         >> -    mem->size = int128_get64(section->size);
>         >> -    mem->mem = memory_region_get_ram_ptr(area) +
>         section->offset_within_region;
>         >> -    mem->start = section->offset_within_address_space;
>         >> -    mem->region = area;
>         >> -
>         >> -    if (do_hvf_set_memory(mem, flags)) {
>         >> -        error_report("Error registering new memory slot");
>         >> -        abort();
>         >> -    }
>         >> -}
>         >> -
>         >>   void vmx_update_tpr(CPUState *cpu)
>         >>   {
>         >>       /* TODO: need integrate APIC handling */
>         >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env,
>         uint16_t port, void *buffer,
>         >>       }
>         >>   }
>         >>
>         >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>         run_on_cpu_data arg)
>         >> -{
>         >> -    if (!cpu->vcpu_dirty) {
>         >> -        hvf_get_registers(cpu);
>         >> -        cpu->vcpu_dirty = true;
>         >> -    }
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>         >> -{
>         >> -    if (!cpu->vcpu_dirty) {
>         >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>         RUN_ON_CPU_NULL);
>         >> -    }
>         >> -}
>         >> -
>         >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>         >> - run_on_cpu_data arg)
>         >> -{
>         >> -    hvf_put_registers(cpu);
>         >> -    cpu->vcpu_dirty = false;
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>         >> -{
>         >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>         RUN_ON_CPU_NULL);
>         >> -}
>         >> -
>         >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>         >> -  run_on_cpu_data arg)
>         >> -{
>         >> -    hvf_put_registers(cpu);
>         >> -    cpu->vcpu_dirty = false;
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>         >> -{
>         >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>         RUN_ON_CPU_NULL);
>         >> -}
>         >> -
>         >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>         >> - run_on_cpu_data arg)
>         >> -{
>         >> -    cpu->vcpu_dirty = true;
>         >> -}
>         >> -
>         >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>         >> -{
>         >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>         RUN_ON_CPU_NULL);
>         >> -}
>         >> -
>         >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t
>         gpa, uint64_t ept_qual)
>         >>   {
>         >>       int read, write;
>         >> @@ -370,109 +156,19 @@ static bool
>         ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t
>         ept_qual)
>         >>       return false;
>         >>   }
>         >>
>         >> -static void hvf_set_dirty_tracking(MemoryRegionSection
>         *section, bool on)
>         >> -{
>         >> -    hvf_slot *slot;
>         >> -
>         >> -    slot = hvf_find_overlap_slot(
>         >> - section->offset_within_address_space,
>         >> -            int128_get64(section->size));
>         >> -
>         >> -    /* protect region against writes; begin tracking it */
>         >> -    if (on) {
>         >> -        slot->flags |= HVF_SLOT_LOG;
>         >> - hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>         >> -                      HV_MEMORY_READ);
>         >> -    /* stop tracking region*/
>         >> -    } else {
>         >> -        slot->flags &= ~HVF_SLOT_LOG;
>         >> - hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>         >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>         >> -    }
>         >> -}
>         >> -
>         >> -static void hvf_log_start(MemoryListener *listener,
>         >> -                          MemoryRegionSection *section,
>         int old, int new)
>         >> -{
>         >> -    if (old != 0) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    hvf_set_dirty_tracking(section, 1);
>         >> -}
>         >> -
>         >> -static void hvf_log_stop(MemoryListener *listener,
>         >> -                         MemoryRegionSection *section, int
>         old, int new)
>         >> -{
>         >> -    if (new != 0) {
>         >> -        return;
>         >> -    }
>         >> -
>         >> -    hvf_set_dirty_tracking(section, 0);
>         >> -}
>         >> -
>         >> -static void hvf_log_sync(MemoryListener *listener,
>         >> -                         MemoryRegionSection *section)
>         >> -{
>         >> -    /*
>         >> -     * sync of dirty pages is handled elsewhere; just make
>         sure we keep
>         >> -     * tracking the region.
>         >> -     */
>         >> -    hvf_set_dirty_tracking(section, 1);
>         >> -}
>         >> -
>         >> -static void hvf_region_add(MemoryListener *listener,
>         >> -                           MemoryRegionSection *section)
>         >> -{
>         >> -    hvf_set_phys_mem(section, true);
>         >> -}
>         >> -
>         >> -static void hvf_region_del(MemoryListener *listener,
>         >> -                           MemoryRegionSection *section)
>         >> -{
>         >> -    hvf_set_phys_mem(section, false);
>         >> -}
>         >> -
>         >> -static MemoryListener hvf_memory_listener = {
>         >> -    .priority = 10,
>         >> -    .region_add = hvf_region_add,
>         >> -    .region_del = hvf_region_del,
>         >> -    .log_start = hvf_log_start,
>         >> -    .log_stop = hvf_log_stop,
>         >> -    .log_sync = hvf_log_sync,
>         >> -};
>         >> -
>         >> -void hvf_vcpu_destroy(CPUState *cpu)
>         >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>         >>   {
>         >>       X86CPU *x86_cpu = X86_CPU(cpu);
>         >>       CPUX86State *env = &x86_cpu->env;
>         >>
>         >> -    hv_return_t ret =
>         hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>         >>       g_free(env->hvf_mmio_buf);
>         >> -    assert_hvf_ok(ret);
>         >> -}
>         >> -
>         >> -static void dummy_signal(int sig)
>         >> -{
>         >>   }
>         >>
>         >> -int hvf_init_vcpu(CPUState *cpu)
>         >> +int hvf_arch_init_vcpu(CPUState *cpu)
>         >>   {
>         >>
>         >>       X86CPU *x86cpu = X86_CPU(cpu);
>         >>       CPUX86State *env = &x86cpu->env;
>         >> -    int r;
>         >> -
>         >> -    /* init cpu signals */
>         >> -    sigset_t set;
>         >> -    struct sigaction sigact;
>         >> -
>         >> -    memset(&sigact, 0, sizeof(sigact));
>         >> -    sigact.sa_handler = dummy_signal;
>         >> -    sigaction(SIG_IPI, &sigact, NULL);
>         >> -
>         >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>         >> -    sigdelset(&set, SIG_IPI);
>         >>
>         >>       init_emu();
>         >>       init_decoder();
>         >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>         >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>         >>       env->hvf_mmio_buf = g_new(char, 4096);
>         >>
>         >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd,
>         HV_VCPU_DEFAULT);
>         >> -    cpu->vcpu_dirty = 1;
>         >> -    assert_hvf_ok(r);
>         >> -
>         >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>         >>  &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>         >>           abort();
>         >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>         >>
>         >>       return ret;
>         >>   }
>         >> -
>         >> -bool hvf_allowed;
>         >> -
>         >> -static int hvf_accel_init(MachineState *ms)
>         >> -{
>         >> -    int x;
>         >> -    hv_return_t ret;
>         >> -    HVFState *s;
>         >> -
>         >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>         >> -    assert_hvf_ok(ret);
>         >> -
>         >> -    s = g_new0(HVFState, 1);
>         >> -
>         >> -    s->num_slots = 32;
>         >> -    for (x = 0; x < s->num_slots; ++x) {
>         >> -        s->slots[x].size = 0;
>         >> -        s->slots[x].slot_id = x;
>         >> -    }
>         >> -
>         >> -    hvf_state = s;
>         >> - memory_listener_register(&hvf_memory_listener,
>         &address_space_memory);
>         >> -    cpus_register_accel(&hvf_cpus);
>         >> -    return 0;
>         >> -}
>         >> -
>         >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>         >> -{
>         >> -    AccelClass *ac = ACCEL_CLASS(oc);
>         >> -    ac->name = "HVF";
>         >> -    ac->init_machine = hvf_accel_init;
>         >> -    ac->allowed = &hvf_allowed;
>         >> -}
>         >> -
>         >> -static const TypeInfo hvf_accel_type = {
>         >> -    .name = TYPE_HVF_ACCEL,
>         >> -    .parent = TYPE_ACCEL,
>         >> -    .class_init = hvf_accel_class_init,
>         >> -};
>         >> -
>         >> -static void hvf_type_init(void)
>         >> -{
>         >> -    type_register_static(&hvf_accel_type);
>         >> -}
>         >> -
>         >> -type_init(hvf_type_init);
>         >> diff --git a/target/i386/hvf/meson.build
>         b/target/i386/hvf/meson.build
>         >> index 409c9a3f14..c8a43717ee 100644
>         >> --- a/target/i386/hvf/meson.build
>         >> +++ b/target/i386/hvf/meson.build
>         >> @@ -1,6 +1,5 @@
>         >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true:
>         files(
>         >>     'hvf.c',
>         >> -  'hvf-cpus.c',
>         >>     'x86.c',
>         >>     'x86_cpuid.c',
>         >>     'x86_decode.c',
>         >> diff --git a/target/i386/hvf/x86hvf.c
>         b/target/i386/hvf/x86hvf.c
>         >> index bbec412b6c..89b8e9d87a 100644
>         >> --- a/target/i386/hvf/x86hvf.c
>         >> +++ b/target/i386/hvf/x86hvf.c
>         >> @@ -20,6 +20,9 @@
>         >>   #include "qemu/osdep.h"
>         >>
>         >>   #include "qemu-common.h"
>         >> +#include "sysemu/hvf.h"
>         >> +#include "sysemu/hvf_int.h"
>         >> +#include "sysemu/hw_accel.h"
>         >>   #include "x86hvf.h"
>         >>   #include "vmx.h"
>         >>   #include "vmcs.h"
>         >> @@ -32,8 +35,6 @@
>         >>   #include <Hypervisor/hv.h>
>         >>   #include <Hypervisor/hv_vmx.h>
>         >>
>         >> -#include "hvf-cpus.h"
>         >> -
>         >>   void hvf_set_segment(struct CPUState *cpu, struct
>         vmx_segment *vmx_seg,
>         >>                        SegmentCache *qseg, bool is_tr)
>         >>   {
>         >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>         >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>         >>
>         >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>         >> -        hvf_cpu_synchronize_state(cpu_state);
>         >> +        cpu_synchronize_state(cpu_state);
>         >>           do_cpu_init(cpu);
>         >>       }
>         >>
>         >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState
>         *cpu_state)
>         >>           cpu_state->halted = 0;
>         >>       }
>         >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>         >> -        hvf_cpu_synchronize_state(cpu_state);
>         >> +        cpu_synchronize_state(cpu_state);
>         >>           do_cpu_sipi(cpu);
>         >>       }
>         >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>         >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>         >> -        hvf_cpu_synchronize_state(cpu_state);
>         >> +        cpu_synchronize_state(cpu_state);
>         > The changes from hvf_cpu_*() to cpu_*() are cleanup and
>         perhaps should
>         > be a separate patch. It follows cpu/accel cleanups Claudio
>         was doing the
>         > summer.
>
>
>         The only reason they're in here is because we no longer have
>         access to
>         the hvf_ functions from the file. I am perfectly happy to
>         rebase the
>         patch on top of Claudio's if his goes in first. I'm sure it'll be
>         trivial for him to rebase on top of this too if my series goes
>         in first.
>
>
>         >
>         > Phillipe raised the idea that the patch might go ahead of
>         ARM-specific
>         > part (which might involve some discussions) and I agree with
>         that.
>         >
>         > Some sync between Claudio series (CC'd him) and the patch
>         might be need.
>
>
>         I would prefer not to hold back because of the sync. Claudio's
>         cleanup
>         is trivial enough to adjust for if it gets merged ahead of this.
>
>
>         Alex
>
>
>

[-- Attachment #2: Type: text/html, Size: 96106 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 20:33           ` Alexander Graf
@ 2020-11-30 20:55             ` Frank Yang
  2020-11-30 21:08               ` Peter Collingbourne
                                 ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Frank Yang @ 2020-11-30 20:55 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini, Peter Collingbourne

[-- Attachment #1: Type: text/plain, Size: 54697 bytes --]

On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:

> Hi Frank,
>
> Thanks for the update :). Your previous email nudged me into the right
> direction. I previously had implemented WFI through the internal timer
> framework which performed way worse.
>
Cool, glad it's helping. Also, Peter found out that the main thing keeping
us from just using cntpct_el0 on the host directly and compare with cval is
that if we sleep, cval is going to be much < cntpct_el0 by the sleep time.
If we can get either the architecture or macos to read out the sleep time
then we might be able to not have to use a poll interval either!

> Along the way, I stumbled over a few issues though. For starters, the
> signal mask for SIG_IPI was not set correctly, so while pselect() would
> exit, the signal would never get delivered to the thread! For a fix, check
> out
>
>
> https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>
>
Thanks, we'll take a look :)


> Please also have a look at my latest stab at WFI emulation. It doesn't
> handle WFE (that's only relevant in overcommitted scenarios). But it does
> handle WFI and even does something similar to hlt polling, albeit not with
> an adaptive threshold.
>
> Also, is there a particular reason you're working on this super
> interesting and useful code in a random downstream fork of QEMU? Wouldn't
> it be more helpful to contribute to the upstream code base instead?
>
We'd actually like to contribute upstream too :) We do want to maintain our
own downstream though; Android Emulator codebase needs to work solidly on
macos and windows which has made keeping up with upstream difficult, and
staying on a previous version (2.12) with known quirks easier. (theres also
some android related customization relating to Qt Ui + different set of
virtual devices and snapshot support (incl. snapshots of graphics devices
with OpenGLES state tracking), which we hope to separate into other
libraries/processes, but its not insignificant)

>
> Alex
>
> On 30.11.20 21:15, Frank Yang wrote:
>
> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But
> the high CPU usage seems to be mitigated by having a poll interval (like
> KVM does) in handling WFI:
>
> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>
> This is loosely inspired by
> https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766
> which does seem to specify a poll interval.
>
> It would be cool if we could have a lightweight way to enter sleep and
> restart the vcpus precisely when CVAL passes, though.
>
> Frank
>
>
> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:
>
>> Hi all,
>>
>> +Peter Collingbourne <pcc@google.com>
>>
>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>
>> Peter and I have been working on an HVF Apple Silicon backend with an eye
>> toward Android guests.
>>
>> We have gotten things to basically switch to Android userspace already
>> (logcat/shell and graphics available at least)
>>
>> Our strategy so far has been to import logic from the KVM implementation
>> and hook into QEMU's software devices that previously assumed to only work
>> with TCG, or have KVM-specific paths.
>>
>> Thanks to Alexander for the tip on the 36-bit address space limitation
>> btw; our way of addressing this is to still allow highmem but not put pci
>> high mmio so high.
>>
>> Also, note we have a sleep/signal based mechanism to deal with WFx, which
>> might be worth looking into in Alexander's implementation as well:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>>
>> Patches so far, FYI:
>>
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>>
>>
>> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>>
>> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>>
>> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>>
>> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>>
>> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>>
>> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>>
>> Peter's also noticed that there are extra steps needed for M1's to allow
>> TCG to work, as it involves JIT:
>>
>>
>> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>>
>> We'd appreciate any feedback/comments :)
>>
>> Best,
>>
>> Frank
>>
>> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>>>
>>> On 27.11.20 21:00, Roman Bolshakov wrote:
>>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>>> >> Until now, Hypervisor.framework has only been available on x86_64
>>> systems.
>>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>>> >> prepare for support for multiple architectures, let's move common
>>> code out
>>> >> into its own accel directory.
>>> >>
>>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>> >> ---
>>> >>   MAINTAINERS                 |   9 +-
>>> >>   accel/hvf/hvf-all.c         |  56 +++++
>>> >>   accel/hvf/hvf-cpus.c        | 468
>>> ++++++++++++++++++++++++++++++++++++
>>> >>   accel/hvf/meson.build       |   7 +
>>> >>   accel/meson.build           |   1 +
>>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>> >>   target/i386/hvf/meson.build |   1 -
>>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>>> >>   target/i386/hvf/x86hvf.h    |   2 -
>>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>>> >>   create mode 100644 accel/hvf/hvf-all.c
>>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>>> >>   create mode 100644 accel/hvf/meson.build
>>> >>   create mode 100644 include/sysemu/hvf_int.h
>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>> >>
>>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>>> >> index 68bc160f41..ca4b6d9279 100644
>>> >> --- a/MAINTAINERS
>>> >> +++ b/MAINTAINERS
>>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>> >>   W: https://wiki.qemu.org/Features/HVF
>>> >>   S: Maintained
>>> >> -F: accel/stubs/hvf-stub.c
>>> > There was a patch for that in the RFC series from Claudio.
>>>
>>>
>>> Yeah, I'm not worried about this hunk :).
>>>
>>>
>>> >
>>> >>   F: target/i386/hvf/
>>> >> +
>>> >> +HVF
>>> >> +M: Cameron Esfahani <dirty@apple.com>
>>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>>> >> +W: https://wiki.qemu.org/Features/HVF
>>> >> +S: Maintained
>>> >> +F: accel/hvf/
>>> >>   F: include/sysemu/hvf.h
>>> >> +F: include/sysemu/hvf_int.h
>>> >>
>>> >>   WHPX CPUs
>>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>>> >> new file mode 100644
>>> >> index 0000000000..47d77a472a
>>> >> --- /dev/null
>>> >> +++ b/accel/hvf/hvf-all.c
>>> >> @@ -0,0 +1,56 @@
>>> >> +/*
>>> >> + * QEMU Hypervisor.framework support
>>> >> + *
>>> >> + * This work is licensed under the terms of the GNU GPL, version 2.
>>> See
>>> >> + * the COPYING file in the top-level directory.
>>> >> + *
>>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>>> >> + * GNU GPL, version 2 or (at your option) any later version.
>>> >> + */
>>> >> +
>>> >> +#include "qemu/osdep.h"
>>> >> +#include "qemu-common.h"
>>> >> +#include "qemu/error-report.h"
>>> >> +#include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >> +#include "sysemu/runstate.h"
>>> >> +
>>> >> +#include "qemu/main-loop.h"
>>> >> +#include "sysemu/accel.h"
>>> >> +
>>> >> +#include <Hypervisor/Hypervisor.h>
>>> >> +
>>> >> +bool hvf_allowed;
>>> >> +HVFState *hvf_state;
>>> >> +
>>> >> +void assert_hvf_ok(hv_return_t ret)
>>> >> +{
>>> >> +    if (ret == HV_SUCCESS) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    switch (ret) {
>>> >> +    case HV_ERROR:
>>> >> +        error_report("Error: HV_ERROR");
>>> >> +        break;
>>> >> +    case HV_BUSY:
>>> >> +        error_report("Error: HV_BUSY");
>>> >> +        break;
>>> >> +    case HV_BAD_ARGUMENT:
>>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>>> >> +        break;
>>> >> +    case HV_NO_RESOURCES:
>>> >> +        error_report("Error: HV_NO_RESOURCES");
>>> >> +        break;
>>> >> +    case HV_NO_DEVICE:
>>> >> +        error_report("Error: HV_NO_DEVICE");
>>> >> +        break;
>>> >> +    case HV_UNSUPPORTED:
>>> >> +        error_report("Error: HV_UNSUPPORTED");
>>> >> +        break;
>>> >> +    default:
>>> >> +        error_report("Unknown Error");
>>> >> +    }
>>> >> +
>>> >> +    abort();
>>> >> +}
>>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>> >> new file mode 100644
>>> >> index 0000000000..f9bb5502b7
>>> >> --- /dev/null
>>> >> +++ b/accel/hvf/hvf-cpus.c
>>> >> @@ -0,0 +1,468 @@
>>> >> +/*
>>> >> + * Copyright 2008 IBM Corporation
>>> >> + *           2008 Red Hat, Inc.
>>> >> + * Copyright 2011 Intel Corporation
>>> >> + * Copyright 2016 Veertu, Inc.
>>> >> + * Copyright 2017 The Android Open Source Project
>>> >> + *
>>> >> + * QEMU Hypervisor.framework support
>>> >> + *
>>> >> + * This program is free software; you can redistribute it and/or
>>> >> + * modify it under the terms of version 2 of the GNU General Public
>>> >> + * License as published by the Free Software Foundation.
>>> >> + *
>>> >> + * This program is distributed in the hope that it will be useful,
>>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> >> + * General Public License for more details.
>>> >> + *
>>> >> + * You should have received a copy of the GNU General Public License
>>> >> + * along with this program; if not, see <
>>> http://www.gnu.org/licenses/>.
>>> >> + *
>>> >> + * This file contain code under public domain from the hvdos project:
>>> >> + * https://github.com/mist64/hvdos
>>> >> + *
>>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>>> >> + * All rights reserved.
>>> >> + *
>>> >> + * Redistribution and use in source and binary forms, with or without
>>> >> + * modification, are permitted provided that the following conditions
>>> >> + * are met:
>>> >> + * 1. Redistributions of source code must retain the above copyright
>>> >> + *    notice, this list of conditions and the following disclaimer.
>>> >> + * 2. Redistributions in binary form must reproduce the above
>>> copyright
>>> >> + *    notice, this list of conditions and the following disclaimer
>>> in the
>>> >> + *    documentation and/or other materials provided with the
>>> distribution.
>>> >> + *
>>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>>> THE
>>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>>> PARTICULAR PURPOSE
>>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>>> LIABLE
>>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>>> CONSEQUENTIAL
>>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>>> GOODS
>>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>>> INTERRUPTION)
>>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>>> CONTRACT, STRICT
>>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>>> ANY WAY
>>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>>> POSSIBILITY OF
>>> >> + * SUCH DAMAGE.
>>> >> + */
>>> >> +
>>> >> +#include "qemu/osdep.h"
>>> >> +#include "qemu/error-report.h"
>>> >> +#include "qemu/main-loop.h"
>>> >> +#include "exec/address-spaces.h"
>>> >> +#include "exec/exec-all.h"
>>> >> +#include "sysemu/cpus.h"
>>> >> +#include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >> +#include "sysemu/runstate.h"
>>> >> +#include "qemu/guest-random.h"
>>> >> +
>>> >> +#include <Hypervisor/Hypervisor.h>
>>> >> +
>>> >> +/* Memory slots */
>>> >> +
>>> >> +struct mac_slot {
>>> >> +    int present;
>>> >> +    uint64_t size;
>>> >> +    uint64_t gpa_start;
>>> >> +    uint64_t gva;
>>> >> +};
>>> >> +
>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>> >> +{
>>> >> +    hvf_slot *slot;
>>> >> +    int x;
>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> +        slot = &hvf_state->slots[x];
>>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>>> >> +            (start + size) > slot->start) {
>>> >> +            return slot;
>>> >> +        }
>>> >> +    }
>>> >> +    return NULL;
>>> >> +}
>>> >> +
>>> >> +struct mac_slot mac_slots[32];
>>> >> +
>>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>> >> +{
>>> >> +    struct mac_slot *macslot;
>>> >> +    hv_return_t ret;
>>> >> +
>>> >> +    macslot = &mac_slots[slot->slot_id];
>>> >> +
>>> >> +    if (macslot->present) {
>>> >> +        if (macslot->size != slot->size) {
>>> >> +            macslot->present = 0;
>>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>> >> +            assert_hvf_ok(ret);
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    if (!slot->size) {
>>> >> +        return 0;
>>> >> +    }
>>> >> +
>>> >> +    macslot->present = 1;
>>> >> +    macslot->gpa_start = slot->start;
>>> >> +    macslot->size = slot->size;
>>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>>> >> +    assert_hvf_ok(ret);
>>> >> +    return 0;
>>> >> +}
>>> >> +
>>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>> >> +{
>>> >> +    hvf_slot *mem;
>>> >> +    MemoryRegion *area = section->mr;
>>> >> +    bool writeable = !area->readonly && !area->rom_device;
>>> >> +    hv_memory_flags_t flags;
>>> >> +
>>> >> +    if (!memory_region_is_ram(area)) {
>>> >> +        if (writeable) {
>>> >> +            return;
>>> >> +        } else if (!memory_region_is_romd(area)) {
>>> >> +            /*
>>> >> +             * If the memory device is not in romd_mode, then we
>>> actually want
>>> >> +             * to remove the hvf memory slot so all accesses will
>>> trap.
>>> >> +             */
>>> >> +             add = false;
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    mem = hvf_find_overlap_slot(
>>> >> +            section->offset_within_address_space,
>>> >> +            int128_get64(section->size));
>>> >> +
>>> >> +    if (mem && add) {
>>> >> +        if (mem->size == int128_get64(section->size) &&
>>> >> +            mem->start == section->offset_within_address_space &&
>>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>>> >> +            section->offset_within_region)) {
>>> >> +            return; /* Same region was attempted to register, go
>>> away. */
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>>> >> +    if (mem) {
>>> >> +        mem->size = 0;
>>> >> +        if (do_hvf_set_memory(mem, 0)) {
>>> >> +            error_report("Failed to reset overlapping slot");
>>> >> +            abort();
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    if (!add) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    if (area->readonly ||
>>> >> +        (!memory_region_is_ram(area) &&
>>> memory_region_is_romd(area))) {
>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>> >> +    } else {
>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>> >> +    }
>>> >> +
>>> >> +    /* Now make a new slot. */
>>> >> +    int x;
>>> >> +
>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> +        mem = &hvf_state->slots[x];
>>> >> +        if (!mem->size) {
>>> >> +            break;
>>> >> +        }
>>> >> +    }
>>> >> +
>>> >> +    if (x == hvf_state->num_slots) {
>>> >> +        error_report("No free slots");
>>> >> +        abort();
>>> >> +    }
>>> >> +
>>> >> +    mem->size = int128_get64(section->size);
>>> >> +    mem->mem = memory_region_get_ram_ptr(area) +
>>> section->offset_within_region;
>>> >> +    mem->start = section->offset_within_address_space;
>>> >> +    mem->region = area;
>>> >> +
>>> >> +    if (do_hvf_set_memory(mem, flags)) {
>>> >> +        error_report("Error registering new memory slot");
>>> >> +        abort();
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section,
>>> bool on)
>>> >> +{
>>> >> +    hvf_slot *slot;
>>> >> +
>>> >> +    slot = hvf_find_overlap_slot(
>>> >> +            section->offset_within_address_space,
>>> >> +            int128_get64(section->size));
>>> >> +
>>> >> +    /* protect region against writes; begin tracking it */
>>> >> +    if (on) {
>>> >> +        slot->flags |= HVF_SLOT_LOG;
>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>> >> +                      HV_MEMORY_READ);
>>> >> +    /* stop tracking region*/
>>> >> +    } else {
>>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void hvf_log_start(MemoryListener *listener,
>>> >> +                          MemoryRegionSection *section, int old, int
>>> new)
>>> >> +{
>>> >> +    if (old != 0) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    hvf_set_dirty_tracking(section, 1);
>>> >> +}
>>> >> +
>>> >> +static void hvf_log_stop(MemoryListener *listener,
>>> >> +                         MemoryRegionSection *section, int old, int
>>> new)
>>> >> +{
>>> >> +    if (new != 0) {
>>> >> +        return;
>>> >> +    }
>>> >> +
>>> >> +    hvf_set_dirty_tracking(section, 0);
>>> >> +}
>>> >> +
>>> >> +static void hvf_log_sync(MemoryListener *listener,
>>> >> +                         MemoryRegionSection *section)
>>> >> +{
>>> >> +    /*
>>> >> +     * sync of dirty pages is handled elsewhere; just make sure we
>>> keep
>>> >> +     * tracking the region.
>>> >> +     */
>>> >> +    hvf_set_dirty_tracking(section, 1);
>>> >> +}
>>> >> +
>>> >> +static void hvf_region_add(MemoryListener *listener,
>>> >> +                           MemoryRegionSection *section)
>>> >> +{
>>> >> +    hvf_set_phys_mem(section, true);
>>> >> +}
>>> >> +
>>> >> +static void hvf_region_del(MemoryListener *listener,
>>> >> +                           MemoryRegionSection *section)
>>> >> +{
>>> >> +    hvf_set_phys_mem(section, false);
>>> >> +}
>>> >> +
>>> >> +static MemoryListener hvf_memory_listener = {
>>> >> +    .priority = 10,
>>> >> +    .region_add = hvf_region_add,
>>> >> +    .region_del = hvf_region_del,
>>> >> +    .log_start = hvf_log_start,
>>> >> +    .log_stop = hvf_log_stop,
>>> >> +    .log_sync = hvf_log_sync,
>>> >> +};
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>>> run_on_cpu_data arg)
>>> >> +{
>>> >> +    if (!cpu->vcpu_dirty) {
>>> >> +        hvf_get_registers(cpu);
>>> >> +        cpu->vcpu_dirty = true;
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>>> >> +{
>>> >> +    if (!cpu->vcpu_dirty) {
>>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>>> RUN_ON_CPU_NULL);
>>> >> +    }
>>> >> +}
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>> >> +                                              run_on_cpu_data arg)
>>> >> +{
>>> >> +    hvf_put_registers(cpu);
>>> >> +    cpu->vcpu_dirty = false;
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>> >> +{
>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>>> RUN_ON_CPU_NULL);
>>> >> +}
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>> >> +                                             run_on_cpu_data arg)
>>> >> +{
>>> >> +    hvf_put_registers(cpu);
>>> >> +    cpu->vcpu_dirty = false;
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>> >> +{
>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>>> RUN_ON_CPU_NULL);
>>> >> +}
>>> >> +
>>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>> >> +                                              run_on_cpu_data arg)
>>> >> +{
>>> >> +    cpu->vcpu_dirty = true;
>>> >> +}
>>> >> +
>>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>> >> +{
>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>>> RUN_ON_CPU_NULL);
>>> >> +}
>>> >> +
>>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>>> >> +{
>>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>>> >> +    assert_hvf_ok(ret);
>>> >> +
>>> >> +    hvf_arch_vcpu_destroy(cpu);
>>> >> +}
>>> >> +
>>> >> +static void dummy_signal(int sig)
>>> >> +{
>>> >> +}
>>> >> +
>>> >> +static int hvf_init_vcpu(CPUState *cpu)
>>> >> +{
>>> >> +    int r;
>>> >> +
>>> >> +    /* init cpu signals */
>>> >> +    sigset_t set;
>>> >> +    struct sigaction sigact;
>>> >> +
>>> >> +    memset(&sigact, 0, sizeof(sigact));
>>> >> +    sigact.sa_handler = dummy_signal;
>>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>>> >> +
>>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> >> +    sigdelset(&set, SIG_IPI);
>>> >> +
>>> >> +#ifdef __aarch64__
>>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t
>>> **)&cpu->hvf_exit, NULL);
>>> >> +#else
>>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>> >> +#endif
>>> > I think the first __aarch64__ bit fits better to arm part of the
>>> series.
>>>
>>>
>>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>>> ARM enablement.
>>>
>>>
>>> >
>>> >> +    cpu->vcpu_dirty = 1;
>>> >> +    assert_hvf_ok(r);
>>> >> +
>>> >> +    return hvf_arch_init_vcpu(cpu);
>>> >> +}
>>> >> +
>>> >> +/*
>>> >> + * The HVF-specific vCPU thread function. This one should only run
>>> when the host
>>> >> + * CPU supports the VMX "unrestricted guest" feature.
>>> >> + */
>>> >> +static void *hvf_cpu_thread_fn(void *arg)
>>> >> +{
>>> >> +    CPUState *cpu = arg;
>>> >> +
>>> >> +    int r;
>>> >> +
>>> >> +    assert(hvf_enabled());
>>> >> +
>>> >> +    rcu_register_thread();
>>> >> +
>>> >> +    qemu_mutex_lock_iothread();
>>> >> +    qemu_thread_get_self(cpu->thread);
>>> >> +
>>> >> +    cpu->thread_id = qemu_get_thread_id();
>>> >> +    cpu->can_do_io = 1;
>>> >> +    current_cpu = cpu;
>>> >> +
>>> >> +    hvf_init_vcpu(cpu);
>>> >> +
>>> >> +    /* signal CPU creation */
>>> >> +    cpu_thread_signal_created(cpu);
>>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>> >> +
>>> >> +    do {
>>> >> +        if (cpu_can_run(cpu)) {
>>> >> +            r = hvf_vcpu_exec(cpu);
>>> >> +            if (r == EXCP_DEBUG) {
>>> >> +                cpu_handle_guest_debug(cpu);
>>> >> +            }
>>> >> +        }
>>> >> +        qemu_wait_io_event(cpu);
>>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>>> >> +
>>> >> +    hvf_vcpu_destroy(cpu);
>>> >> +    cpu_thread_signal_destroyed(cpu);
>>> >> +    qemu_mutex_unlock_iothread();
>>> >> +    rcu_unregister_thread();
>>> >> +    return NULL;
>>> >> +}
>>> >> +
>>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>>> >> +{
>>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>>> >> +
>>> >> +    /*
>>> >> +     * HVF currently does not support TCG, and only runs in
>>> >> +     * unrestricted-guest mode.
>>> >> +     */
>>> >> +    assert(hvf_enabled());
>>> >> +
>>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>> >> +    qemu_cond_init(cpu->halt_cond);
>>> >> +
>>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>> >> +             cpu->cpu_index);
>>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>>> >> +}
>>> >> +
>>> >> +static const CpusAccel hvf_cpus = {
>>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>>> >> +
>>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>> >> +};
>>> >> +
>>> >> +static int hvf_accel_init(MachineState *ms)
>>> >> +{
>>> >> +    int x;
>>> >> +    hv_return_t ret;
>>> >> +    HVFState *s;
>>> >> +
>>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>>> >> +    assert_hvf_ok(ret);
>>> >> +
>>> >> +    s = g_new0(HVFState, 1);
>>> >> +
>>> >> +    s->num_slots = 32;
>>> >> +    for (x = 0; x < s->num_slots; ++x) {
>>> >> +        s->slots[x].size = 0;
>>> >> +        s->slots[x].slot_id = x;
>>> >> +    }
>>> >> +
>>> >> +    hvf_state = s;
>>> >> +    memory_listener_register(&hvf_memory_listener,
>>> &address_space_memory);
>>> >> +    cpus_register_accel(&hvf_cpus);
>>> >> +    return 0;
>>> >> +}
>>> >> +
>>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>> >> +{
>>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>>> >> +    ac->name = "HVF";
>>> >> +    ac->init_machine = hvf_accel_init;
>>> >> +    ac->allowed = &hvf_allowed;
>>> >> +}
>>> >> +
>>> >> +static const TypeInfo hvf_accel_type = {
>>> >> +    .name = TYPE_HVF_ACCEL,
>>> >> +    .parent = TYPE_ACCEL,
>>> >> +    .class_init = hvf_accel_class_init,
>>> >> +};
>>> >> +
>>> >> +static void hvf_type_init(void)
>>> >> +{
>>> >> +    type_register_static(&hvf_accel_type);
>>> >> +}
>>> >> +
>>> >> +type_init(hvf_type_init);
>>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>>> >> new file mode 100644
>>> >> index 0000000000..dfd6b68dc7
>>> >> --- /dev/null
>>> >> +++ b/accel/hvf/meson.build
>>> >> @@ -0,0 +1,7 @@
>>> >> +hvf_ss = ss.source_set()
>>> >> +hvf_ss.add(files(
>>> >> +  'hvf-all.c',
>>> >> +  'hvf-cpus.c',
>>> >> +))
>>> >> +
>>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>>> >> diff --git a/accel/meson.build b/accel/meson.build
>>> >> index b26cca227a..6de12ce5d5 100644
>>> >> --- a/accel/meson.build
>>> >> +++ b/accel/meson.build
>>> >> @@ -1,5 +1,6 @@
>>> >>   softmmu_ss.add(files('accel.c'))
>>> >>
>>> >> +subdir('hvf')
>>> >>   subdir('qtest')
>>> >>   subdir('kvm')
>>> >>   subdir('tcg')
>>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>> >> new file mode 100644
>>> >> index 0000000000..de9bad23a8
>>> >> --- /dev/null
>>> >> +++ b/include/sysemu/hvf_int.h
>>> >> @@ -0,0 +1,69 @@
>>> >> +/*
>>> >> + * QEMU Hypervisor.framework (HVF) support
>>> >> + *
>>> >> + * This work is licensed under the terms of the GNU GPL, version 2
>>> or later.
>>> >> + * See the COPYING file in the top-level directory.
>>> >> + *
>>> >> + */
>>> >> +
>>> >> +/* header to be included in HVF-specific code */
>>> >> +
>>> >> +#ifndef HVF_INT_H
>>> >> +#define HVF_INT_H
>>> >> +
>>> >> +#include <Hypervisor/Hypervisor.h>
>>> >> +
>>> >> +#define HVF_MAX_VCPU 0x10
>>> >> +
>>> >> +extern struct hvf_state hvf_global;
>>> >> +
>>> >> +struct hvf_vm {
>>> >> +    int id;
>>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>> >> +};
>>> >> +
>>> >> +struct hvf_state {
>>> >> +    uint32_t version;
>>> >> +    struct hvf_vm *vm;
>>> >> +    uint64_t mem_quota;
>>> >> +};
>>> >> +
>>> >> +/* hvf_slot flags */
>>> >> +#define HVF_SLOT_LOG (1 << 0)
>>> >> +
>>> >> +typedef struct hvf_slot {
>>> >> +    uint64_t start;
>>> >> +    uint64_t size;
>>> >> +    uint8_t *mem;
>>> >> +    int slot_id;
>>> >> +    uint32_t flags;
>>> >> +    MemoryRegion *region;
>>> >> +} hvf_slot;
>>> >> +
>>> >> +typedef struct hvf_vcpu_caps {
>>> >> +    uint64_t vmx_cap_pinbased;
>>> >> +    uint64_t vmx_cap_procbased;
>>> >> +    uint64_t vmx_cap_procbased2;
>>> >> +    uint64_t vmx_cap_entry;
>>> >> +    uint64_t vmx_cap_exit;
>>> >> +    uint64_t vmx_cap_preemption_timer;
>>> >> +} hvf_vcpu_caps;
>>> >> +
>>> >> +struct HVFState {
>>> >> +    AccelState parent;
>>> >> +    hvf_slot slots[32];
>>> >> +    int num_slots;
>>> >> +
>>> >> +    hvf_vcpu_caps *hvf_caps;
>>> >> +};
>>> >> +extern HVFState *hvf_state;
>>> >> +
>>> >> +void assert_hvf_ok(hv_return_t ret);
>>> >> +int hvf_get_registers(CPUState *cpu);
>>> >> +int hvf_put_registers(CPUState *cpu);
>>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>>> >> +int hvf_vcpu_exec(CPUState *cpu);
>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>> >> +
>>> >> +#endif
>>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>>> >> deleted file mode 100644
>>> >> index 817b3d7452..0000000000
>>> >> --- a/target/i386/hvf/hvf-cpus.c
>>> >> +++ /dev/null
>>> >> @@ -1,131 +0,0 @@
>>> >> -/*
>>> >> - * Copyright 2008 IBM Corporation
>>> >> - *           2008 Red Hat, Inc.
>>> >> - * Copyright 2011 Intel Corporation
>>> >> - * Copyright 2016 Veertu, Inc.
>>> >> - * Copyright 2017 The Android Open Source Project
>>> >> - *
>>> >> - * QEMU Hypervisor.framework support
>>> >> - *
>>> >> - * This program is free software; you can redistribute it and/or
>>> >> - * modify it under the terms of version 2 of the GNU General Public
>>> >> - * License as published by the Free Software Foundation.
>>> >> - *
>>> >> - * This program is distributed in the hope that it will be useful,
>>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>> >> - * General Public License for more details.
>>> >> - *
>>> >> - * You should have received a copy of the GNU General Public License
>>> >> - * along with this program; if not, see <
>>> http://www.gnu.org/licenses/>.
>>> >> - *
>>> >> - * This file contain code under public domain from the hvdos project:
>>> >> - * https://github.com/mist64/hvdos
>>> >> - *
>>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>>> >> - * All rights reserved.
>>> >> - *
>>> >> - * Redistribution and use in source and binary forms, with or without
>>> >> - * modification, are permitted provided that the following conditions
>>> >> - * are met:
>>> >> - * 1. Redistributions of source code must retain the above copyright
>>> >> - *    notice, this list of conditions and the following disclaimer.
>>> >> - * 2. Redistributions in binary form must reproduce the above
>>> copyright
>>> >> - *    notice, this list of conditions and the following disclaimer
>>> in the
>>> >> - *    documentation and/or other materials provided with the
>>> distribution.
>>> >> - *
>>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
>>> THE
>>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
>>> PARTICULAR PURPOSE
>>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE
>>> LIABLE
>>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
>>> CONSEQUENTIAL
>>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
>>> GOODS
>>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
>>> INTERRUPTION)
>>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
>>> CONTRACT, STRICT
>>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
>>> ANY WAY
>>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
>>> POSSIBILITY OF
>>> >> - * SUCH DAMAGE.
>>> >> - */
>>> >> -
>>> >> -#include "qemu/osdep.h"
>>> >> -#include "qemu/error-report.h"
>>> >> -#include "qemu/main-loop.h"
>>> >> -#include "sysemu/hvf.h"
>>> >> -#include "sysemu/runstate.h"
>>> >> -#include "target/i386/cpu.h"
>>> >> -#include "qemu/guest-random.h"
>>> >> -
>>> >> -#include "hvf-cpus.h"
>>> >> -
>>> >> -/*
>>> >> - * The HVF-specific vCPU thread function. This one should only run
>>> when the host
>>> >> - * CPU supports the VMX "unrestricted guest" feature.
>>> >> - */
>>> >> -static void *hvf_cpu_thread_fn(void *arg)
>>> >> -{
>>> >> -    CPUState *cpu = arg;
>>> >> -
>>> >> -    int r;
>>> >> -
>>> >> -    assert(hvf_enabled());
>>> >> -
>>> >> -    rcu_register_thread();
>>> >> -
>>> >> -    qemu_mutex_lock_iothread();
>>> >> -    qemu_thread_get_self(cpu->thread);
>>> >> -
>>> >> -    cpu->thread_id = qemu_get_thread_id();
>>> >> -    cpu->can_do_io = 1;
>>> >> -    current_cpu = cpu;
>>> >> -
>>> >> -    hvf_init_vcpu(cpu);
>>> >> -
>>> >> -    /* signal CPU creation */
>>> >> -    cpu_thread_signal_created(cpu);
>>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>> >> -
>>> >> -    do {
>>> >> -        if (cpu_can_run(cpu)) {
>>> >> -            r = hvf_vcpu_exec(cpu);
>>> >> -            if (r == EXCP_DEBUG) {
>>> >> -                cpu_handle_guest_debug(cpu);
>>> >> -            }
>>> >> -        }
>>> >> -        qemu_wait_io_event(cpu);
>>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>>> >> -
>>> >> -    hvf_vcpu_destroy(cpu);
>>> >> -    cpu_thread_signal_destroyed(cpu);
>>> >> -    qemu_mutex_unlock_iothread();
>>> >> -    rcu_unregister_thread();
>>> >> -    return NULL;
>>> >> -}
>>> >> -
>>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>>> >> -{
>>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>>> >> -
>>> >> -    /*
>>> >> -     * HVF currently does not support TCG, and only runs in
>>> >> -     * unrestricted-guest mode.
>>> >> -     */
>>> >> -    assert(hvf_enabled());
>>> >> -
>>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>> >> -    qemu_cond_init(cpu->halt_cond);
>>> >> -
>>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>> >> -             cpu->cpu_index);
>>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>>> >> -}
>>> >> -
>>> >> -const CpusAccel hvf_cpus = {
>>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>>> >> -
>>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>> >> -};
>>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>>> >> deleted file mode 100644
>>> >> index ced31b82c0..0000000000
>>> >> --- a/target/i386/hvf/hvf-cpus.h
>>> >> +++ /dev/null
>>> >> @@ -1,25 +0,0 @@
>>> >> -/*
>>> >> - * Accelerator CPUS Interface
>>> >> - *
>>> >> - * Copyright 2020 SUSE LLC
>>> >> - *
>>> >> - * This work is licensed under the terms of the GNU GPL, version 2
>>> or later.
>>> >> - * See the COPYING file in the top-level directory.
>>> >> - */
>>> >> -
>>> >> -#ifndef HVF_CPUS_H
>>> >> -#define HVF_CPUS_H
>>> >> -
>>> >> -#include "sysemu/cpus.h"
>>> >> -
>>> >> -extern const CpusAccel hvf_cpus;
>>> >> -
>>> >> -int hvf_init_vcpu(CPUState *);
>>> >> -int hvf_vcpu_exec(CPUState *);
>>> >> -void hvf_cpu_synchronize_state(CPUState *);
>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>>> >> -void hvf_vcpu_destroy(CPUState *);
>>> >> -
>>> >> -#endif /* HVF_CPUS_H */
>>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>>> >> index e0edffd077..6d56f8f6bb 100644
>>> >> --- a/target/i386/hvf/hvf-i386.h
>>> >> +++ b/target/i386/hvf/hvf-i386.h
>>> >> @@ -18,57 +18,11 @@
>>> >>
>>> >>   #include "sysemu/accel.h"
>>> >>   #include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >>   #include "cpu.h"
>>> >>   #include "x86.h"
>>> >>
>>> >> -#define HVF_MAX_VCPU 0x10
>>> >> -
>>> >> -extern struct hvf_state hvf_global;
>>> >> -
>>> >> -struct hvf_vm {
>>> >> -    int id;
>>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>> >> -};
>>> >> -
>>> >> -struct hvf_state {
>>> >> -    uint32_t version;
>>> >> -    struct hvf_vm *vm;
>>> >> -    uint64_t mem_quota;
>>> >> -};
>>> >> -
>>> >> -/* hvf_slot flags */
>>> >> -#define HVF_SLOT_LOG (1 << 0)
>>> >> -
>>> >> -typedef struct hvf_slot {
>>> >> -    uint64_t start;
>>> >> -    uint64_t size;
>>> >> -    uint8_t *mem;
>>> >> -    int slot_id;
>>> >> -    uint32_t flags;
>>> >> -    MemoryRegion *region;
>>> >> -} hvf_slot;
>>> >> -
>>> >> -typedef struct hvf_vcpu_caps {
>>> >> -    uint64_t vmx_cap_pinbased;
>>> >> -    uint64_t vmx_cap_procbased;
>>> >> -    uint64_t vmx_cap_procbased2;
>>> >> -    uint64_t vmx_cap_entry;
>>> >> -    uint64_t vmx_cap_exit;
>>> >> -    uint64_t vmx_cap_preemption_timer;
>>> >> -} hvf_vcpu_caps;
>>> >> -
>>> >> -struct HVFState {
>>> >> -    AccelState parent;
>>> >> -    hvf_slot slots[32];
>>> >> -    int num_slots;
>>> >> -
>>> >> -    hvf_vcpu_caps *hvf_caps;
>>> >> -};
>>> >> -extern HVFState *hvf_state;
>>> >> -
>>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>> >>
>>> >>   #ifdef NEED_CPU_H
>>> >>   /* Functions exported to host specific mode */
>>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>>> >> index ed9356565c..8b96ecd619 100644
>>> >> --- a/target/i386/hvf/hvf.c
>>> >> +++ b/target/i386/hvf/hvf.c
>>> >> @@ -51,6 +51,7 @@
>>> >>   #include "qemu/error-report.h"
>>> >>
>>> >>   #include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >>   #include "sysemu/runstate.h"
>>> >>   #include "hvf-i386.h"
>>> >>   #include "vmcs.h"
>>> >> @@ -72,171 +73,6 @@
>>> >>   #include "sysemu/accel.h"
>>> >>   #include "target/i386/cpu.h"
>>> >>
>>> >> -#include "hvf-cpus.h"
>>> >> -
>>> >> -HVFState *hvf_state;
>>> >> -
>>> >> -static void assert_hvf_ok(hv_return_t ret)
>>> >> -{
>>> >> -    if (ret == HV_SUCCESS) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    switch (ret) {
>>> >> -    case HV_ERROR:
>>> >> -        error_report("Error: HV_ERROR");
>>> >> -        break;
>>> >> -    case HV_BUSY:
>>> >> -        error_report("Error: HV_BUSY");
>>> >> -        break;
>>> >> -    case HV_BAD_ARGUMENT:
>>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>>> >> -        break;
>>> >> -    case HV_NO_RESOURCES:
>>> >> -        error_report("Error: HV_NO_RESOURCES");
>>> >> -        break;
>>> >> -    case HV_NO_DEVICE:
>>> >> -        error_report("Error: HV_NO_DEVICE");
>>> >> -        break;
>>> >> -    case HV_UNSUPPORTED:
>>> >> -        error_report("Error: HV_UNSUPPORTED");
>>> >> -        break;
>>> >> -    default:
>>> >> -        error_report("Unknown Error");
>>> >> -    }
>>> >> -
>>> >> -    abort();
>>> >> -}
>>> >> -
>>> >> -/* Memory slots */
>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>> >> -{
>>> >> -    hvf_slot *slot;
>>> >> -    int x;
>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> -        slot = &hvf_state->slots[x];
>>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>>> >> -            (start + size) > slot->start) {
>>> >> -            return slot;
>>> >> -        }
>>> >> -    }
>>> >> -    return NULL;
>>> >> -}
>>> >> -
>>> >> -struct mac_slot {
>>> >> -    int present;
>>> >> -    uint64_t size;
>>> >> -    uint64_t gpa_start;
>>> >> -    uint64_t gva;
>>> >> -};
>>> >> -
>>> >> -struct mac_slot mac_slots[32];
>>> >> -
>>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>> >> -{
>>> >> -    struct mac_slot *macslot;
>>> >> -    hv_return_t ret;
>>> >> -
>>> >> -    macslot = &mac_slots[slot->slot_id];
>>> >> -
>>> >> -    if (macslot->present) {
>>> >> -        if (macslot->size != slot->size) {
>>> >> -            macslot->present = 0;
>>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>> >> -            assert_hvf_ok(ret);
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    if (!slot->size) {
>>> >> -        return 0;
>>> >> -    }
>>> >> -
>>> >> -    macslot->present = 1;
>>> >> -    macslot->gpa_start = slot->start;
>>> >> -    macslot->size = slot->size;
>>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size,
>>> flags);
>>> >> -    assert_hvf_ok(ret);
>>> >> -    return 0;
>>> >> -}
>>> >> -
>>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>> >> -{
>>> >> -    hvf_slot *mem;
>>> >> -    MemoryRegion *area = section->mr;
>>> >> -    bool writeable = !area->readonly && !area->rom_device;
>>> >> -    hv_memory_flags_t flags;
>>> >> -
>>> >> -    if (!memory_region_is_ram(area)) {
>>> >> -        if (writeable) {
>>> >> -            return;
>>> >> -        } else if (!memory_region_is_romd(area)) {
>>> >> -            /*
>>> >> -             * If the memory device is not in romd_mode, then we
>>> actually want
>>> >> -             * to remove the hvf memory slot so all accesses will
>>> trap.
>>> >> -             */
>>> >> -             add = false;
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    mem = hvf_find_overlap_slot(
>>> >> -            section->offset_within_address_space,
>>> >> -            int128_get64(section->size));
>>> >> -
>>> >> -    if (mem && add) {
>>> >> -        if (mem->size == int128_get64(section->size) &&
>>> >> -            mem->start == section->offset_within_address_space &&
>>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>>> >> -            section->offset_within_region)) {
>>> >> -            return; /* Same region was attempted to register, go
>>> away. */
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>>> >> -    if (mem) {
>>> >> -        mem->size = 0;
>>> >> -        if (do_hvf_set_memory(mem, 0)) {
>>> >> -            error_report("Failed to reset overlapping slot");
>>> >> -            abort();
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    if (!add) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    if (area->readonly ||
>>> >> -        (!memory_region_is_ram(area) &&
>>> memory_region_is_romd(area))) {
>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>> >> -    } else {
>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>> >> -    }
>>> >> -
>>> >> -    /* Now make a new slot. */
>>> >> -    int x;
>>> >> -
>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>> >> -        mem = &hvf_state->slots[x];
>>> >> -        if (!mem->size) {
>>> >> -            break;
>>> >> -        }
>>> >> -    }
>>> >> -
>>> >> -    if (x == hvf_state->num_slots) {
>>> >> -        error_report("No free slots");
>>> >> -        abort();
>>> >> -    }
>>> >> -
>>> >> -    mem->size = int128_get64(section->size);
>>> >> -    mem->mem = memory_region_get_ram_ptr(area) +
>>> section->offset_within_region;
>>> >> -    mem->start = section->offset_within_address_space;
>>> >> -    mem->region = area;
>>> >> -
>>> >> -    if (do_hvf_set_memory(mem, flags)) {
>>> >> -        error_report("Error registering new memory slot");
>>> >> -        abort();
>>> >> -    }
>>> >> -}
>>> >> -
>>> >>   void vmx_update_tpr(CPUState *cpu)
>>> >>   {
>>> >>       /* TODO: need integrate APIC handling */
>>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t
>>> port, void *buffer,
>>> >>       }
>>> >>   }
>>> >>
>>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu,
>>> run_on_cpu_data arg)
>>> >> -{
>>> >> -    if (!cpu->vcpu_dirty) {
>>> >> -        hvf_get_registers(cpu);
>>> >> -        cpu->vcpu_dirty = true;
>>> >> -    }
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>>> >> -{
>>> >> -    if (!cpu->vcpu_dirty) {
>>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state,
>>> RUN_ON_CPU_NULL);
>>> >> -    }
>>> >> -}
>>> >> -
>>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>> >> -                                              run_on_cpu_data arg)
>>> >> -{
>>> >> -    hvf_put_registers(cpu);
>>> >> -    cpu->vcpu_dirty = false;
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>> >> -{
>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset,
>>> RUN_ON_CPU_NULL);
>>> >> -}
>>> >> -
>>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>> >> -                                             run_on_cpu_data arg)
>>> >> -{
>>> >> -    hvf_put_registers(cpu);
>>> >> -    cpu->vcpu_dirty = false;
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>> >> -{
>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init,
>>> RUN_ON_CPU_NULL);
>>> >> -}
>>> >> -
>>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>> >> -                                              run_on_cpu_data arg)
>>> >> -{
>>> >> -    cpu->vcpu_dirty = true;
>>> >> -}
>>> >> -
>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>> >> -{
>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm,
>>> RUN_ON_CPU_NULL);
>>> >> -}
>>> >> -
>>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa,
>>> uint64_t ept_qual)
>>> >>   {
>>> >>       int read, write;
>>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot
>>> *slot, uint64_t gpa, uint64_t ept_qual)
>>> >>       return false;
>>> >>   }
>>> >>
>>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section,
>>> bool on)
>>> >> -{
>>> >> -    hvf_slot *slot;
>>> >> -
>>> >> -    slot = hvf_find_overlap_slot(
>>> >> -            section->offset_within_address_space,
>>> >> -            int128_get64(section->size));
>>> >> -
>>> >> -    /* protect region against writes; begin tracking it */
>>> >> -    if (on) {
>>> >> -        slot->flags |= HVF_SLOT_LOG;
>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>> >> -                      HV_MEMORY_READ);
>>> >> -    /* stop tracking region*/
>>> >> -    } else {
>>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>> >> -    }
>>> >> -}
>>> >> -
>>> >> -static void hvf_log_start(MemoryListener *listener,
>>> >> -                          MemoryRegionSection *section, int old, int
>>> new)
>>> >> -{
>>> >> -    if (old != 0) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    hvf_set_dirty_tracking(section, 1);
>>> >> -}
>>> >> -
>>> >> -static void hvf_log_stop(MemoryListener *listener,
>>> >> -                         MemoryRegionSection *section, int old, int
>>> new)
>>> >> -{
>>> >> -    if (new != 0) {
>>> >> -        return;
>>> >> -    }
>>> >> -
>>> >> -    hvf_set_dirty_tracking(section, 0);
>>> >> -}
>>> >> -
>>> >> -static void hvf_log_sync(MemoryListener *listener,
>>> >> -                         MemoryRegionSection *section)
>>> >> -{
>>> >> -    /*
>>> >> -     * sync of dirty pages is handled elsewhere; just make sure we
>>> keep
>>> >> -     * tracking the region.
>>> >> -     */
>>> >> -    hvf_set_dirty_tracking(section, 1);
>>> >> -}
>>> >> -
>>> >> -static void hvf_region_add(MemoryListener *listener,
>>> >> -                           MemoryRegionSection *section)
>>> >> -{
>>> >> -    hvf_set_phys_mem(section, true);
>>> >> -}
>>> >> -
>>> >> -static void hvf_region_del(MemoryListener *listener,
>>> >> -                           MemoryRegionSection *section)
>>> >> -{
>>> >> -    hvf_set_phys_mem(section, false);
>>> >> -}
>>> >> -
>>> >> -static MemoryListener hvf_memory_listener = {
>>> >> -    .priority = 10,
>>> >> -    .region_add = hvf_region_add,
>>> >> -    .region_del = hvf_region_del,
>>> >> -    .log_start = hvf_log_start,
>>> >> -    .log_stop = hvf_log_stop,
>>> >> -    .log_sync = hvf_log_sync,
>>> >> -};
>>> >> -
>>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>> >>   {
>>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>>> >>       CPUX86State *env = &x86_cpu->env;
>>> >>
>>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>> >>       g_free(env->hvf_mmio_buf);
>>> >> -    assert_hvf_ok(ret);
>>> >> -}
>>> >> -
>>> >> -static void dummy_signal(int sig)
>>> >> -{
>>> >>   }
>>> >>
>>> >> -int hvf_init_vcpu(CPUState *cpu)
>>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>>> >>   {
>>> >>
>>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>>> >>       CPUX86State *env = &x86cpu->env;
>>> >> -    int r;
>>> >> -
>>> >> -    /* init cpu signals */
>>> >> -    sigset_t set;
>>> >> -    struct sigaction sigact;
>>> >> -
>>> >> -    memset(&sigact, 0, sizeof(sigact));
>>> >> -    sigact.sa_handler = dummy_signal;
>>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>>> >> -
>>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> >> -    sigdelset(&set, SIG_IPI);
>>> >>
>>> >>       init_emu();
>>> >>       init_decoder();
>>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>>> >>
>>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>> >> -    cpu->vcpu_dirty = 1;
>>> >> -    assert_hvf_ok(r);
>>> >> -
>>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>> >>           abort();
>>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>> >>
>>> >>       return ret;
>>> >>   }
>>> >> -
>>> >> -bool hvf_allowed;
>>> >> -
>>> >> -static int hvf_accel_init(MachineState *ms)
>>> >> -{
>>> >> -    int x;
>>> >> -    hv_return_t ret;
>>> >> -    HVFState *s;
>>> >> -
>>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>>> >> -    assert_hvf_ok(ret);
>>> >> -
>>> >> -    s = g_new0(HVFState, 1);
>>> >> -
>>> >> -    s->num_slots = 32;
>>> >> -    for (x = 0; x < s->num_slots; ++x) {
>>> >> -        s->slots[x].size = 0;
>>> >> -        s->slots[x].slot_id = x;
>>> >> -    }
>>> >> -
>>> >> -    hvf_state = s;
>>> >> -    memory_listener_register(&hvf_memory_listener,
>>> &address_space_memory);
>>> >> -    cpus_register_accel(&hvf_cpus);
>>> >> -    return 0;
>>> >> -}
>>> >> -
>>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>> >> -{
>>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>>> >> -    ac->name = "HVF";
>>> >> -    ac->init_machine = hvf_accel_init;
>>> >> -    ac->allowed = &hvf_allowed;
>>> >> -}
>>> >> -
>>> >> -static const TypeInfo hvf_accel_type = {
>>> >> -    .name = TYPE_HVF_ACCEL,
>>> >> -    .parent = TYPE_ACCEL,
>>> >> -    .class_init = hvf_accel_class_init,
>>> >> -};
>>> >> -
>>> >> -static void hvf_type_init(void)
>>> >> -{
>>> >> -    type_register_static(&hvf_accel_type);
>>> >> -}
>>> >> -
>>> >> -type_init(hvf_type_init);
>>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>>> >> index 409c9a3f14..c8a43717ee 100644
>>> >> --- a/target/i386/hvf/meson.build
>>> >> +++ b/target/i386/hvf/meson.build
>>> >> @@ -1,6 +1,5 @@
>>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>> >>     'hvf.c',
>>> >> -  'hvf-cpus.c',
>>> >>     'x86.c',
>>> >>     'x86_cpuid.c',
>>> >>     'x86_decode.c',
>>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>>> >> index bbec412b6c..89b8e9d87a 100644
>>> >> --- a/target/i386/hvf/x86hvf.c
>>> >> +++ b/target/i386/hvf/x86hvf.c
>>> >> @@ -20,6 +20,9 @@
>>> >>   #include "qemu/osdep.h"
>>> >>
>>> >>   #include "qemu-common.h"
>>> >> +#include "sysemu/hvf.h"
>>> >> +#include "sysemu/hvf_int.h"
>>> >> +#include "sysemu/hw_accel.h"
>>> >>   #include "x86hvf.h"
>>> >>   #include "vmx.h"
>>> >>   #include "vmcs.h"
>>> >> @@ -32,8 +35,6 @@
>>> >>   #include <Hypervisor/hv.h>
>>> >>   #include <Hypervisor/hv_vmx.h>
>>> >>
>>> >> -#include "hvf-cpus.h"
>>> >> -
>>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment
>>> *vmx_seg,
>>> >>                        SegmentCache *qseg, bool is_tr)
>>> >>   {
>>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>> >>
>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>> >> +        cpu_synchronize_state(cpu_state);
>>> >>           do_cpu_init(cpu);
>>> >>       }
>>> >>
>>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>> >>           cpu_state->halted = 0;
>>> >>       }
>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>> >> +        cpu_synchronize_state(cpu_state);
>>> >>           do_cpu_sipi(cpu);
>>> >>       }
>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>> >> +        cpu_synchronize_state(cpu_state);
>>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing
>>> the
>>> > summer.
>>>
>>>
>>> The only reason they're in here is because we no longer have access to
>>> the hvf_ functions from the file. I am perfectly happy to rebase the
>>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>>> trivial for him to rebase on top of this too if my series goes in first.
>>>
>>>
>>> >
>>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>>> > part (which might involve some discussions) and I agree with that.
>>> >
>>> > Some sync between Claudio series (CC'd him) and the patch might be
>>> need.
>>>
>>>
>>> I would prefer not to hold back because of the sync. Claudio's cleanup
>>> is trivial enough to adjust for if it gets merged ahead of this.
>>>
>>>
>>> Alex
>>>
>>>
>>>
>>>

[-- Attachment #2: Type: text/html, Size: 99239 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 20:55             ` Frank Yang
@ 2020-11-30 21:08               ` Peter Collingbourne
  2020-11-30 21:40                 ` Alexander Graf
  2020-11-30 22:10               ` Peter Maydell
  2020-11-30 22:46               ` Peter Collingbourne
  2 siblings, 1 reply; 64+ messages in thread
From: Peter Collingbourne @ 2020-11-30 21:08 UTC (permalink / raw)
  To: Frank Yang
  Cc: Alexander Graf, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>
>
>
> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> Hi Frank,
>>
>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>
> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>
>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>
>>   https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>
>
> Thanks, we'll take a look :)
>
>>
>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.

Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
I'll reply to your patch here. You have:

+                    /* Set cpu->hvf->sleeping so that we get a
SIG_IPI signal. */
+                    cpu->hvf->sleeping = true;
+                    smp_mb();
+
+                    /* Bail out if we received an IRQ meanwhile */
+                    if (cpu->thread_kicked || (cpu->interrupt_request &
+                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+                        cpu->hvf->sleeping = false;
+                        break;
+                    }
+
+                    /* nanosleep returns on signal, so we wake up on kick. */
+                    nanosleep(ts, NULL);

and then send the signal conditional on whether sleeping is true, but
I think this is racy. If the signal is sent after sleeping is set to
true but before entering nanosleep then I think it will be ignored and
we will miss the wakeup. That's why in my implementation I block IPI
on the CPU thread at startup and then use pselect to atomically
unblock and begin sleeping. The signal is sent unconditionally so
there's no need to worry about races between actually sleeping and the
"we think we're sleeping" state. It may lead to an extra wakeup but
that's better than missing it entirely.

Peter

>>
>> Also, is there a particular reason you're working on this super interesting and useful code in a random downstream fork of QEMU? Wouldn't it be more helpful to contribute to the upstream code base instead?
>
> We'd actually like to contribute upstream too :) We do want to maintain our own downstream though; Android Emulator codebase needs to work solidly on macos and windows which has made keeping up with upstream difficult, and staying on a previous version (2.12) with known quirks easier. (theres also some android related customization relating to Qt Ui + different set of virtual devices and snapshot support (incl. snapshots of graphics devices with OpenGLES state tracking), which we hope to separate into other libraries/processes, but its not insignificant)
>>
>>
>> Alex
>>
>>
>> On 30.11.20 21:15, Frank Yang wrote:
>>
>> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But the high CPU usage seems to be mitigated by having a poll interval (like KVM does) in handling WFI:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>>
>> This is loosely inspired by https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 which does seem to specify a poll interval.
>>
>> It would be cool if we could have a lightweight way to enter sleep and restart the vcpus precisely when CVAL passes, though.
>>
>> Frank
>>
>>
>> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:
>>>
>>> Hi all,
>>>
>>> +Peter Collingbourne
>>>
>>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>>
>>> Peter and I have been working on an HVF Apple Silicon backend with an eye toward Android guests.
>>>
>>> We have gotten things to basically switch to Android userspace already (logcat/shell and graphics available at least)
>>>
>>> Our strategy so far has been to import logic from the KVM implementation and hook into QEMU's software devices that previously assumed to only work with TCG, or have KVM-specific paths.
>>>
>>> Thanks to Alexander for the tip on the 36-bit address space limitation btw; our way of addressing this is to still allow highmem but not put pci high mmio so high.
>>>
>>> Also, note we have a sleep/signal based mechanism to deal with WFx, which might be worth looking into in Alexander's implementation as well:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>>>
>>> Patches so far, FYI:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>>> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>>> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>>> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>>> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>>> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>>>
>>> Peter's also noticed that there are extra steps needed for M1's to allow TCG to work, as it involves JIT:
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>>>
>>> We'd appreciate any feedback/comments :)
>>>
>>> Best,
>>>
>>> Frank
>>>
>>> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>
>>>>
>>>> On 27.11.20 21:00, Roman Bolshakov wrote:
>>>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>>>> >> Until now, Hypervisor.framework has only been available on x86_64 systems.
>>>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>>>> >> prepare for support for multiple architectures, let's move common code out
>>>> >> into its own accel directory.
>>>> >>
>>>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>>> >> ---
>>>> >>   MAINTAINERS                 |   9 +-
>>>> >>   accel/hvf/hvf-all.c         |  56 +++++
>>>> >>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>>>> >>   accel/hvf/meson.build       |   7 +
>>>> >>   accel/meson.build           |   1 +
>>>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>>>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>>>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>>>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>>> >>   target/i386/hvf/meson.build |   1 -
>>>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>>>> >>   target/i386/hvf/x86hvf.h    |   2 -
>>>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>>>> >>   create mode 100644 accel/hvf/hvf-all.c
>>>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>>>> >>   create mode 100644 accel/hvf/meson.build
>>>> >>   create mode 100644 include/sysemu/hvf_int.h
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>>> >>
>>>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> >> index 68bc160f41..ca4b6d9279 100644
>>>> >> --- a/MAINTAINERS
>>>> >> +++ b/MAINTAINERS
>>>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >>   W: https://wiki.qemu.org/Features/HVF
>>>> >>   S: Maintained
>>>> >> -F: accel/stubs/hvf-stub.c
>>>> > There was a patch for that in the RFC series from Claudio.
>>>>
>>>>
>>>> Yeah, I'm not worried about this hunk :).
>>>>
>>>>
>>>> >
>>>> >>   F: target/i386/hvf/
>>>> >> +
>>>> >> +HVF
>>>> >> +M: Cameron Esfahani <dirty@apple.com>
>>>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >> +W: https://wiki.qemu.org/Features/HVF
>>>> >> +S: Maintained
>>>> >> +F: accel/hvf/
>>>> >>   F: include/sysemu/hvf.h
>>>> >> +F: include/sysemu/hvf_int.h
>>>> >>
>>>> >>   WHPX CPUs
>>>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>>>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..47d77a472a
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-all.c
>>>> >> @@ -0,0 +1,56 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>>> >> + * the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>>>> >> + * GNU GPL, version 2 or (at your option) any later version.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu-common.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "sysemu/accel.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +bool hvf_allowed;
>>>> >> +HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret)
>>>> >> +{
>>>> >> +    if (ret == HV_SUCCESS) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    switch (ret) {
>>>> >> +    case HV_ERROR:
>>>> >> +        error_report("Error: HV_ERROR");
>>>> >> +        break;
>>>> >> +    case HV_BUSY:
>>>> >> +        error_report("Error: HV_BUSY");
>>>> >> +        break;
>>>> >> +    case HV_BAD_ARGUMENT:
>>>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> +        break;
>>>> >> +    case HV_NO_RESOURCES:
>>>> >> +        error_report("Error: HV_NO_RESOURCES");
>>>> >> +        break;
>>>> >> +    case HV_NO_DEVICE:
>>>> >> +        error_report("Error: HV_NO_DEVICE");
>>>> >> +        break;
>>>> >> +    case HV_UNSUPPORTED:
>>>> >> +        error_report("Error: HV_UNSUPPORTED");
>>>> >> +        break;
>>>> >> +    default:
>>>> >> +        error_report("Unknown Error");
>>>> >> +    }
>>>> >> +
>>>> >> +    abort();
>>>> >> +}
>>>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..f9bb5502b7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-cpus.c
>>>> >> @@ -0,0 +1,468 @@
>>>> >> +/*
>>>> >> + * Copyright 2008 IBM Corporation
>>>> >> + *           2008 Red Hat, Inc.
>>>> >> + * Copyright 2011 Intel Corporation
>>>> >> + * Copyright 2016 Veertu, Inc.
>>>> >> + * Copyright 2017 The Android Open Source Project
>>>> >> + *
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This program is free software; you can redistribute it and/or
>>>> >> + * modify it under the terms of version 2 of the GNU General Public
>>>> >> + * License as published by the Free Software Foundation.
>>>> >> + *
>>>> >> + * This program is distributed in the hope that it will be useful,
>>>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> + * General Public License for more details.
>>>> >> + *
>>>> >> + * You should have received a copy of the GNU General Public License
>>>> >> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> + *
>>>> >> + * This file contain code under public domain from the hvdos project:
>>>> >> + * https://github.com/mist64/hvdos
>>>> >> + *
>>>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> + * All rights reserved.
>>>> >> + *
>>>> >> + * Redistribution and use in source and binary forms, with or without
>>>> >> + * modification, are permitted provided that the following conditions
>>>> >> + * are met:
>>>> >> + * 1. Redistributions of source code must retain the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer.
>>>> >> + * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer in the
>>>> >> + *    documentation and/or other materials provided with the distribution.
>>>> >> + *
>>>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> + * SUCH DAMAGE.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "exec/address-spaces.h"
>>>> >> +#include "exec/exec-all.h"
>>>> >> +#include "sysemu/cpus.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +#include "qemu/guest-random.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +/* Memory slots */
>>>> >> +
>>>> >> +struct mac_slot {
>>>> >> +    int present;
>>>> >> +    uint64_t size;
>>>> >> +    uint64_t gpa_start;
>>>> >> +    uint64_t gva;
>>>> >> +};
>>>> >> +
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +    int x;
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        slot = &hvf_state->slots[x];
>>>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> +            (start + size) > slot->start) {
>>>> >> +            return slot;
>>>> >> +        }
>>>> >> +    }
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +struct mac_slot mac_slots[32];
>>>> >> +
>>>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> +{
>>>> >> +    struct mac_slot *macslot;
>>>> >> +    hv_return_t ret;
>>>> >> +
>>>> >> +    macslot = &mac_slots[slot->slot_id];
>>>> >> +
>>>> >> +    if (macslot->present) {
>>>> >> +        if (macslot->size != slot->size) {
>>>> >> +            macslot->present = 0;
>>>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> +            assert_hvf_ok(ret);
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!slot->size) {
>>>> >> +        return 0;
>>>> >> +    }
>>>> >> +
>>>> >> +    macslot->present = 1;
>>>> >> +    macslot->gpa_start = slot->start;
>>>> >> +    macslot->size = slot->size;
>>>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> +{
>>>> >> +    hvf_slot *mem;
>>>> >> +    MemoryRegion *area = section->mr;
>>>> >> +    bool writeable = !area->readonly && !area->rom_device;
>>>> >> +    hv_memory_flags_t flags;
>>>> >> +
>>>> >> +    if (!memory_region_is_ram(area)) {
>>>> >> +        if (writeable) {
>>>> >> +            return;
>>>> >> +        } else if (!memory_region_is_romd(area)) {
>>>> >> +            /*
>>>> >> +             * If the memory device is not in romd_mode, then we actually want
>>>> >> +             * to remove the hvf memory slot so all accesses will trap.
>>>> >> +             */
>>>> >> +             add = false;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    mem = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    if (mem && add) {
>>>> >> +        if (mem->size == int128_get64(section->size) &&
>>>> >> +            mem->start == section->offset_within_address_space &&
>>>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> +            section->offset_within_region)) {
>>>> >> +            return; /* Same region was attempted to register, go away. */
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> +    if (mem) {
>>>> >> +        mem->size = 0;
>>>> >> +        if (do_hvf_set_memory(mem, 0)) {
>>>> >> +            error_report("Failed to reset overlapping slot");
>>>> >> +            abort();
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!add) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    if (area->readonly ||
>>>> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> +    } else {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Now make a new slot. */
>>>> >> +    int x;
>>>> >> +
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        mem = &hvf_state->slots[x];
>>>> >> +        if (!mem->size) {
>>>> >> +            break;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (x == hvf_state->num_slots) {
>>>> >> +        error_report("No free slots");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +
>>>> >> +    mem->size = int128_get64(section->size);
>>>> >> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> +    mem->start = section->offset_within_address_space;
>>>> >> +    mem->region = area;
>>>> >> +
>>>> >> +    if (do_hvf_set_memory(mem, flags)) {
>>>> >> +        error_report("Error registering new memory slot");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +
>>>> >> +    slot = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    /* protect region against writes; begin tracking it */
>>>> >> +    if (on) {
>>>> >> +        slot->flags |= HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ);
>>>> >> +    /* stop tracking region*/
>>>> >> +    } else {
>>>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_start(MemoryListener *listener,
>>>> >> +                          MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (old != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_stop(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (new != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 0);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_sync(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    /*
>>>> >> +     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> +     * tracking the region.
>>>> >> +     */
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_add(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, true);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_del(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, false);
>>>> >> +}
>>>> >> +
>>>> >> +static MemoryListener hvf_memory_listener = {
>>>> >> +    .priority = 10,
>>>> >> +    .region_add = hvf_region_add,
>>>> >> +    .region_del = hvf_region_del,
>>>> >> +    .log_start = hvf_log_start,
>>>> >> +    .log_stop = hvf_log_stop,
>>>> >> +    .log_sync = hvf_log_sync,
>>>> >> +};
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        hvf_get_registers(cpu);
>>>> >> +        cpu->vcpu_dirty = true;
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> +                                             run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    cpu->vcpu_dirty = true;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +{
>>>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    hvf_arch_vcpu_destroy(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +static void dummy_signal(int sig)
>>>> >> +{
>>>> >> +}
>>>> >> +
>>>> >> +static int hvf_init_vcpu(CPUState *cpu)
>>>> >> +{
>>>> >> +    int r;
>>>> >> +
>>>> >> +    /* init cpu signals */
>>>> >> +    sigset_t set;
>>>> >> +    struct sigaction sigact;
>>>> >> +
>>>> >> +    memset(&sigact, 0, sizeof(sigact));
>>>> >> +    sigact.sa_handler = dummy_signal;
>>>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> +
>>>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> +    sigdelset(&set, SIG_IPI);
>>>> >> +
>>>> >> +#ifdef __aarch64__
>>>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
>>>> >> +#else
>>>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> +#endif
>>>> > I think the first __aarch64__ bit fits better to arm part of the series.
>>>>
>>>>
>>>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>>>> ARM enablement.
>>>>
>>>>
>>>> >
>>>> >> +    cpu->vcpu_dirty = 1;
>>>> >> +    assert_hvf_ok(r);
>>>> >> +
>>>> >> +    return hvf_arch_init_vcpu(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +/*
>>>> >> + * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> + * CPU supports the VMX "unrestricted guest" feature.
>>>> >> + */
>>>> >> +static void *hvf_cpu_thread_fn(void *arg)
>>>> >> +{
>>>> >> +    CPUState *cpu = arg;
>>>> >> +
>>>> >> +    int r;
>>>> >> +
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    rcu_register_thread();
>>>> >> +
>>>> >> +    qemu_mutex_lock_iothread();
>>>> >> +    qemu_thread_get_self(cpu->thread);
>>>> >> +
>>>> >> +    cpu->thread_id = qemu_get_thread_id();
>>>> >> +    cpu->can_do_io = 1;
>>>> >> +    current_cpu = cpu;
>>>> >> +
>>>> >> +    hvf_init_vcpu(cpu);
>>>> >> +
>>>> >> +    /* signal CPU creation */
>>>> >> +    cpu_thread_signal_created(cpu);
>>>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> +
>>>> >> +    do {
>>>> >> +        if (cpu_can_run(cpu)) {
>>>> >> +            r = hvf_vcpu_exec(cpu);
>>>> >> +            if (r == EXCP_DEBUG) {
>>>> >> +                cpu_handle_guest_debug(cpu);
>>>> >> +            }
>>>> >> +        }
>>>> >> +        qemu_wait_io_event(cpu);
>>>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> +
>>>> >> +    hvf_vcpu_destroy(cpu);
>>>> >> +    cpu_thread_signal_destroyed(cpu);
>>>> >> +    qemu_mutex_unlock_iothread();
>>>> >> +    rcu_unregister_thread();
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> +{
>>>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> +
>>>> >> +    /*
>>>> >> +     * HVF currently does not support TCG, and only runs in
>>>> >> +     * unrestricted-guest mode.
>>>> >> +     */
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> +    qemu_cond_init(cpu->halt_cond);
>>>> >> +
>>>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> +             cpu->cpu_index);
>>>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> +}
>>>> >> +
>>>> >> +static const CpusAccel hvf_cpus = {
>>>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> +
>>>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> +};
>>>> >> +
>>>> >> +static int hvf_accel_init(MachineState *ms)
>>>> >> +{
>>>> >> +    int x;
>>>> >> +    hv_return_t ret;
>>>> >> +    HVFState *s;
>>>> >> +
>>>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    s = g_new0(HVFState, 1);
>>>> >> +
>>>> >> +    s->num_slots = 32;
>>>> >> +    for (x = 0; x < s->num_slots; ++x) {
>>>> >> +        s->slots[x].size = 0;
>>>> >> +        s->slots[x].slot_id = x;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_state = s;
>>>> >> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> +    cpus_register_accel(&hvf_cpus);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> +{
>>>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> +    ac->name = "HVF";
>>>> >> +    ac->init_machine = hvf_accel_init;
>>>> >> +    ac->allowed = &hvf_allowed;
>>>> >> +}
>>>> >> +
>>>> >> +static const TypeInfo hvf_accel_type = {
>>>> >> +    .name = TYPE_HVF_ACCEL,
>>>> >> +    .parent = TYPE_ACCEL,
>>>> >> +    .class_init = hvf_accel_class_init,
>>>> >> +};
>>>> >> +
>>>> >> +static void hvf_type_init(void)
>>>> >> +{
>>>> >> +    type_register_static(&hvf_accel_type);
>>>> >> +}
>>>> >> +
>>>> >> +type_init(hvf_type_init);
>>>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>>>> >> new file mode 100644
>>>> >> index 0000000000..dfd6b68dc7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/meson.build
>>>> >> @@ -0,0 +1,7 @@
>>>> >> +hvf_ss = ss.source_set()
>>>> >> +hvf_ss.add(files(
>>>> >> +  'hvf-all.c',
>>>> >> +  'hvf-cpus.c',
>>>> >> +))
>>>> >> +
>>>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>>>> >> diff --git a/accel/meson.build b/accel/meson.build
>>>> >> index b26cca227a..6de12ce5d5 100644
>>>> >> --- a/accel/meson.build
>>>> >> +++ b/accel/meson.build
>>>> >> @@ -1,5 +1,6 @@
>>>> >>   softmmu_ss.add(files('accel.c'))
>>>> >>
>>>> >> +subdir('hvf')
>>>> >>   subdir('qtest')
>>>> >>   subdir('kvm')
>>>> >>   subdir('tcg')
>>>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>>> >> new file mode 100644
>>>> >> index 0000000000..de9bad23a8
>>>> >> --- /dev/null
>>>> >> +++ b/include/sysemu/hvf_int.h
>>>> >> @@ -0,0 +1,69 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework (HVF) support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> + * See the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + */
>>>> >> +
>>>> >> +/* header to be included in HVF-specific code */
>>>> >> +
>>>> >> +#ifndef HVF_INT_H
>>>> >> +#define HVF_INT_H
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +#define HVF_MAX_VCPU 0x10
>>>> >> +
>>>> >> +extern struct hvf_state hvf_global;
>>>> >> +
>>>> >> +struct hvf_vm {
>>>> >> +    int id;
>>>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> +};
>>>> >> +
>>>> >> +struct hvf_state {
>>>> >> +    uint32_t version;
>>>> >> +    struct hvf_vm *vm;
>>>> >> +    uint64_t mem_quota;
>>>> >> +};
>>>> >> +
>>>> >> +/* hvf_slot flags */
>>>> >> +#define HVF_SLOT_LOG (1 << 0)
>>>> >> +
>>>> >> +typedef struct hvf_slot {
>>>> >> +    uint64_t start;
>>>> >> +    uint64_t size;
>>>> >> +    uint8_t *mem;
>>>> >> +    int slot_id;
>>>> >> +    uint32_t flags;
>>>> >> +    MemoryRegion *region;
>>>> >> +} hvf_slot;
>>>> >> +
>>>> >> +typedef struct hvf_vcpu_caps {
>>>> >> +    uint64_t vmx_cap_pinbased;
>>>> >> +    uint64_t vmx_cap_procbased;
>>>> >> +    uint64_t vmx_cap_procbased2;
>>>> >> +    uint64_t vmx_cap_entry;
>>>> >> +    uint64_t vmx_cap_exit;
>>>> >> +    uint64_t vmx_cap_preemption_timer;
>>>> >> +} hvf_vcpu_caps;
>>>> >> +
>>>> >> +struct HVFState {
>>>> >> +    AccelState parent;
>>>> >> +    hvf_slot slots[32];
>>>> >> +    int num_slots;
>>>> >> +
>>>> >> +    hvf_vcpu_caps *hvf_caps;
>>>> >> +};
>>>> >> +extern HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret);
>>>> >> +int hvf_get_registers(CPUState *cpu);
>>>> >> +int hvf_put_registers(CPUState *cpu);
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>>>> >> +int hvf_vcpu_exec(CPUState *cpu);
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >> +
>>>> >> +#endif
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>>>> >> deleted file mode 100644
>>>> >> index 817b3d7452..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.c
>>>> >> +++ /dev/null
>>>> >> @@ -1,131 +0,0 @@
>>>> >> -/*
>>>> >> - * Copyright 2008 IBM Corporation
>>>> >> - *           2008 Red Hat, Inc.
>>>> >> - * Copyright 2011 Intel Corporation
>>>> >> - * Copyright 2016 Veertu, Inc.
>>>> >> - * Copyright 2017 The Android Open Source Project
>>>> >> - *
>>>> >> - * QEMU Hypervisor.framework support
>>>> >> - *
>>>> >> - * This program is free software; you can redistribute it and/or
>>>> >> - * modify it under the terms of version 2 of the GNU General Public
>>>> >> - * License as published by the Free Software Foundation.
>>>> >> - *
>>>> >> - * This program is distributed in the hope that it will be useful,
>>>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> - * General Public License for more details.
>>>> >> - *
>>>> >> - * You should have received a copy of the GNU General Public License
>>>> >> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> - *
>>>> >> - * This file contain code under public domain from the hvdos project:
>>>> >> - * https://github.com/mist64/hvdos
>>>> >> - *
>>>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> - * All rights reserved.
>>>> >> - *
>>>> >> - * Redistribution and use in source and binary forms, with or without
>>>> >> - * modification, are permitted provided that the following conditions
>>>> >> - * are met:
>>>> >> - * 1. Redistributions of source code must retain the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer.
>>>> >> - * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer in the
>>>> >> - *    documentation and/or other materials provided with the distribution.
>>>> >> - *
>>>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> - * SUCH DAMAGE.
>>>> >> - */
>>>> >> -
>>>> >> -#include "qemu/osdep.h"
>>>> >> -#include "qemu/error-report.h"
>>>> >> -#include "qemu/main-loop.h"
>>>> >> -#include "sysemu/hvf.h"
>>>> >> -#include "sysemu/runstate.h"
>>>> >> -#include "target/i386/cpu.h"
>>>> >> -#include "qemu/guest-random.h"
>>>> >> -
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -/*
>>>> >> - * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> - * CPU supports the VMX "unrestricted guest" feature.
>>>> >> - */
>>>> >> -static void *hvf_cpu_thread_fn(void *arg)
>>>> >> -{
>>>> >> -    CPUState *cpu = arg;
>>>> >> -
>>>> >> -    int r;
>>>> >> -
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    rcu_register_thread();
>>>> >> -
>>>> >> -    qemu_mutex_lock_iothread();
>>>> >> -    qemu_thread_get_self(cpu->thread);
>>>> >> -
>>>> >> -    cpu->thread_id = qemu_get_thread_id();
>>>> >> -    cpu->can_do_io = 1;
>>>> >> -    current_cpu = cpu;
>>>> >> -
>>>> >> -    hvf_init_vcpu(cpu);
>>>> >> -
>>>> >> -    /* signal CPU creation */
>>>> >> -    cpu_thread_signal_created(cpu);
>>>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> -
>>>> >> -    do {
>>>> >> -        if (cpu_can_run(cpu)) {
>>>> >> -            r = hvf_vcpu_exec(cpu);
>>>> >> -            if (r == EXCP_DEBUG) {
>>>> >> -                cpu_handle_guest_debug(cpu);
>>>> >> -            }
>>>> >> -        }
>>>> >> -        qemu_wait_io_event(cpu);
>>>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> -
>>>> >> -    hvf_vcpu_destroy(cpu);
>>>> >> -    cpu_thread_signal_destroyed(cpu);
>>>> >> -    qemu_mutex_unlock_iothread();
>>>> >> -    rcu_unregister_thread();
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> -{
>>>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> -
>>>> >> -    /*
>>>> >> -     * HVF currently does not support TCG, and only runs in
>>>> >> -     * unrestricted-guest mode.
>>>> >> -     */
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> -    qemu_cond_init(cpu->halt_cond);
>>>> >> -
>>>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> -             cpu->cpu_index);
>>>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> -}
>>>> >> -
>>>> >> -const CpusAccel hvf_cpus = {
>>>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> -
>>>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> -};
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>>>> >> deleted file mode 100644
>>>> >> index ced31b82c0..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.h
>>>> >> +++ /dev/null
>>>> >> @@ -1,25 +0,0 @@
>>>> >> -/*
>>>> >> - * Accelerator CPUS Interface
>>>> >> - *
>>>> >> - * Copyright 2020 SUSE LLC
>>>> >> - *
>>>> >> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> - * See the COPYING file in the top-level directory.
>>>> >> - */
>>>> >> -
>>>> >> -#ifndef HVF_CPUS_H
>>>> >> -#define HVF_CPUS_H
>>>> >> -
>>>> >> -#include "sysemu/cpus.h"
>>>> >> -
>>>> >> -extern const CpusAccel hvf_cpus;
>>>> >> -
>>>> >> -int hvf_init_vcpu(CPUState *);
>>>> >> -int hvf_vcpu_exec(CPUState *);
>>>> >> -void hvf_cpu_synchronize_state(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>>>> >> -void hvf_vcpu_destroy(CPUState *);
>>>> >> -
>>>> >> -#endif /* HVF_CPUS_H */
>>>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>>>> >> index e0edffd077..6d56f8f6bb 100644
>>>> >> --- a/target/i386/hvf/hvf-i386.h
>>>> >> +++ b/target/i386/hvf/hvf-i386.h
>>>> >> @@ -18,57 +18,11 @@
>>>> >>
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "cpu.h"
>>>> >>   #include "x86.h"
>>>> >>
>>>> >> -#define HVF_MAX_VCPU 0x10
>>>> >> -
>>>> >> -extern struct hvf_state hvf_global;
>>>> >> -
>>>> >> -struct hvf_vm {
>>>> >> -    int id;
>>>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> -};
>>>> >> -
>>>> >> -struct hvf_state {
>>>> >> -    uint32_t version;
>>>> >> -    struct hvf_vm *vm;
>>>> >> -    uint64_t mem_quota;
>>>> >> -};
>>>> >> -
>>>> >> -/* hvf_slot flags */
>>>> >> -#define HVF_SLOT_LOG (1 << 0)
>>>> >> -
>>>> >> -typedef struct hvf_slot {
>>>> >> -    uint64_t start;
>>>> >> -    uint64_t size;
>>>> >> -    uint8_t *mem;
>>>> >> -    int slot_id;
>>>> >> -    uint32_t flags;
>>>> >> -    MemoryRegion *region;
>>>> >> -} hvf_slot;
>>>> >> -
>>>> >> -typedef struct hvf_vcpu_caps {
>>>> >> -    uint64_t vmx_cap_pinbased;
>>>> >> -    uint64_t vmx_cap_procbased;
>>>> >> -    uint64_t vmx_cap_procbased2;
>>>> >> -    uint64_t vmx_cap_entry;
>>>> >> -    uint64_t vmx_cap_exit;
>>>> >> -    uint64_t vmx_cap_preemption_timer;
>>>> >> -} hvf_vcpu_caps;
>>>> >> -
>>>> >> -struct HVFState {
>>>> >> -    AccelState parent;
>>>> >> -    hvf_slot slots[32];
>>>> >> -    int num_slots;
>>>> >> -
>>>> >> -    hvf_vcpu_caps *hvf_caps;
>>>> >> -};
>>>> >> -extern HVFState *hvf_state;
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >>
>>>> >>   #ifdef NEED_CPU_H
>>>> >>   /* Functions exported to host specific mode */
>>>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>>>> >> index ed9356565c..8b96ecd619 100644
>>>> >> --- a/target/i386/hvf/hvf.c
>>>> >> +++ b/target/i386/hvf/hvf.c
>>>> >> @@ -51,6 +51,7 @@
>>>> >>   #include "qemu/error-report.h"
>>>> >>
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "sysemu/runstate.h"
>>>> >>   #include "hvf-i386.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -72,171 +73,6 @@
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "target/i386/cpu.h"
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -HVFState *hvf_state;
>>>> >> -
>>>> >> -static void assert_hvf_ok(hv_return_t ret)
>>>> >> -{
>>>> >> -    if (ret == HV_SUCCESS) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    switch (ret) {
>>>> >> -    case HV_ERROR:
>>>> >> -        error_report("Error: HV_ERROR");
>>>> >> -        break;
>>>> >> -    case HV_BUSY:
>>>> >> -        error_report("Error: HV_BUSY");
>>>> >> -        break;
>>>> >> -    case HV_BAD_ARGUMENT:
>>>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> -        break;
>>>> >> -    case HV_NO_RESOURCES:
>>>> >> -        error_report("Error: HV_NO_RESOURCES");
>>>> >> -        break;
>>>> >> -    case HV_NO_DEVICE:
>>>> >> -        error_report("Error: HV_NO_DEVICE");
>>>> >> -        break;
>>>> >> -    case HV_UNSUPPORTED:
>>>> >> -        error_report("Error: HV_UNSUPPORTED");
>>>> >> -        break;
>>>> >> -    default:
>>>> >> -        error_report("Unknown Error");
>>>> >> -    }
>>>> >> -
>>>> >> -    abort();
>>>> >> -}
>>>> >> -
>>>> >> -/* Memory slots */
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -    int x;
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        slot = &hvf_state->slots[x];
>>>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> -            (start + size) > slot->start) {
>>>> >> -            return slot;
>>>> >> -        }
>>>> >> -    }
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -struct mac_slot {
>>>> >> -    int present;
>>>> >> -    uint64_t size;
>>>> >> -    uint64_t gpa_start;
>>>> >> -    uint64_t gva;
>>>> >> -};
>>>> >> -
>>>> >> -struct mac_slot mac_slots[32];
>>>> >> -
>>>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> -{
>>>> >> -    struct mac_slot *macslot;
>>>> >> -    hv_return_t ret;
>>>> >> -
>>>> >> -    macslot = &mac_slots[slot->slot_id];
>>>> >> -
>>>> >> -    if (macslot->present) {
>>>> >> -        if (macslot->size != slot->size) {
>>>> >> -            macslot->present = 0;
>>>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> -            assert_hvf_ok(ret);
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!slot->size) {
>>>> >> -        return 0;
>>>> >> -    }
>>>> >> -
>>>> >> -    macslot->present = 1;
>>>> >> -    macslot->gpa_start = slot->start;
>>>> >> -    macslot->size = slot->size;
>>>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> -{
>>>> >> -    hvf_slot *mem;
>>>> >> -    MemoryRegion *area = section->mr;
>>>> >> -    bool writeable = !area->readonly && !area->rom_device;
>>>> >> -    hv_memory_flags_t flags;
>>>> >> -
>>>> >> -    if (!memory_region_is_ram(area)) {
>>>> >> -        if (writeable) {
>>>> >> -            return;
>>>> >> -        } else if (!memory_region_is_romd(area)) {
>>>> >> -            /*
>>>> >> -             * If the memory device is not in romd_mode, then we actually want
>>>> >> -             * to remove the hvf memory slot so all accesses will trap.
>>>> >> -             */
>>>> >> -             add = false;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    mem = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    if (mem && add) {
>>>> >> -        if (mem->size == int128_get64(section->size) &&
>>>> >> -            mem->start == section->offset_within_address_space &&
>>>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> -            section->offset_within_region)) {
>>>> >> -            return; /* Same region was attempted to register, go away. */
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> -    if (mem) {
>>>> >> -        mem->size = 0;
>>>> >> -        if (do_hvf_set_memory(mem, 0)) {
>>>> >> -            error_report("Failed to reset overlapping slot");
>>>> >> -            abort();
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!add) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    if (area->readonly ||
>>>> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> -    } else {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Now make a new slot. */
>>>> >> -    int x;
>>>> >> -
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        mem = &hvf_state->slots[x];
>>>> >> -        if (!mem->size) {
>>>> >> -            break;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (x == hvf_state->num_slots) {
>>>> >> -        error_report("No free slots");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -
>>>> >> -    mem->size = int128_get64(section->size);
>>>> >> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> -    mem->start = section->offset_within_address_space;
>>>> >> -    mem->region = area;
>>>> >> -
>>>> >> -    if (do_hvf_set_memory(mem, flags)) {
>>>> >> -        error_report("Error registering new memory slot");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >>   void vmx_update_tpr(CPUState *cpu)
>>>> >>   {
>>>> >>       /* TODO: need integrate APIC handling */
>>>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>>>> >>       }
>>>> >>   }
>>>> >>
>>>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        hvf_get_registers(cpu);
>>>> >> -        cpu->vcpu_dirty = true;
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> -                                             run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    cpu->vcpu_dirty = true;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>   {
>>>> >>       int read, write;
>>>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>       return false;
>>>> >>   }
>>>> >>
>>>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -
>>>> >> -    slot = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    /* protect region against writes; begin tracking it */
>>>> >> -    if (on) {
>>>> >> -        slot->flags |= HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ);
>>>> >> -    /* stop tracking region*/
>>>> >> -    } else {
>>>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_start(MemoryListener *listener,
>>>> >> -                          MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (old != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_stop(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (new != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 0);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_sync(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    /*
>>>> >> -     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> -     * tracking the region.
>>>> >> -     */
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_add(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, true);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_del(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, false);
>>>> >> -}
>>>> >> -
>>>> >> -static MemoryListener hvf_memory_listener = {
>>>> >> -    .priority = 10,
>>>> >> -    .region_add = hvf_region_add,
>>>> >> -    .region_del = hvf_region_del,
>>>> >> -    .log_start = hvf_log_start,
>>>> >> -    .log_stop = hvf_log_stop,
>>>> >> -    .log_sync = hvf_log_sync,
>>>> >> -};
>>>> >> -
>>>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>>> >>   {
>>>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86_cpu->env;
>>>> >>
>>>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>>> >>       g_free(env->hvf_mmio_buf);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -}
>>>> >> -
>>>> >> -static void dummy_signal(int sig)
>>>> >> -{
>>>> >>   }
>>>> >>
>>>> >> -int hvf_init_vcpu(CPUState *cpu)
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>>>> >>   {
>>>> >>
>>>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86cpu->env;
>>>> >> -    int r;
>>>> >> -
>>>> >> -    /* init cpu signals */
>>>> >> -    sigset_t set;
>>>> >> -    struct sigaction sigact;
>>>> >> -
>>>> >> -    memset(&sigact, 0, sizeof(sigact));
>>>> >> -    sigact.sa_handler = dummy_signal;
>>>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> -
>>>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> -    sigdelset(&set, SIG_IPI);
>>>> >>
>>>> >>       init_emu();
>>>> >>       init_decoder();
>>>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>>>> >>
>>>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> -    cpu->vcpu_dirty = 1;
>>>> >> -    assert_hvf_ok(r);
>>>> >> -
>>>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>>> >>           abort();
>>>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>> >>
>>>> >>       return ret;
>>>> >>   }
>>>> >> -
>>>> >> -bool hvf_allowed;
>>>> >> -
>>>> >> -static int hvf_accel_init(MachineState *ms)
>>>> >> -{
>>>> >> -    int x;
>>>> >> -    hv_return_t ret;
>>>> >> -    HVFState *s;
>>>> >> -
>>>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -
>>>> >> -    s = g_new0(HVFState, 1);
>>>> >> -
>>>> >> -    s->num_slots = 32;
>>>> >> -    for (x = 0; x < s->num_slots; ++x) {
>>>> >> -        s->slots[x].size = 0;
>>>> >> -        s->slots[x].slot_id = x;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_state = s;
>>>> >> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> -    cpus_register_accel(&hvf_cpus);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> -{
>>>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> -    ac->name = "HVF";
>>>> >> -    ac->init_machine = hvf_accel_init;
>>>> >> -    ac->allowed = &hvf_allowed;
>>>> >> -}
>>>> >> -
>>>> >> -static const TypeInfo hvf_accel_type = {
>>>> >> -    .name = TYPE_HVF_ACCEL,
>>>> >> -    .parent = TYPE_ACCEL,
>>>> >> -    .class_init = hvf_accel_class_init,
>>>> >> -};
>>>> >> -
>>>> >> -static void hvf_type_init(void)
>>>> >> -{
>>>> >> -    type_register_static(&hvf_accel_type);
>>>> >> -}
>>>> >> -
>>>> >> -type_init(hvf_type_init);
>>>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>>>> >> index 409c9a3f14..c8a43717ee 100644
>>>> >> --- a/target/i386/hvf/meson.build
>>>> >> +++ b/target/i386/hvf/meson.build
>>>> >> @@ -1,6 +1,5 @@
>>>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>>> >>     'hvf.c',
>>>> >> -  'hvf-cpus.c',
>>>> >>     'x86.c',
>>>> >>     'x86_cpuid.c',
>>>> >>     'x86_decode.c',
>>>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>>>> >> index bbec412b6c..89b8e9d87a 100644
>>>> >> --- a/target/i386/hvf/x86hvf.c
>>>> >> +++ b/target/i386/hvf/x86hvf.c
>>>> >> @@ -20,6 +20,9 @@
>>>> >>   #include "qemu/osdep.h"
>>>> >>
>>>> >>   #include "qemu-common.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/hw_accel.h"
>>>> >>   #include "x86hvf.h"
>>>> >>   #include "vmx.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -32,8 +35,6 @@
>>>> >>   #include <Hypervisor/hv.h>
>>>> >>   #include <Hypervisor/hv_vmx.h>
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>>>> >>                        SegmentCache *qseg, bool is_tr)
>>>> >>   {
>>>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>>> >>
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_init(cpu);
>>>> >>       }
>>>> >>
>>>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>           cpu_state->halted = 0;
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_sipi(cpu);
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>>>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
>>>> > summer.
>>>>
>>>>
>>>> The only reason they're in here is because we no longer have access to
>>>> the hvf_ functions from the file. I am perfectly happy to rebase the
>>>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>>>> trivial for him to rebase on top of this too if my series goes in first.
>>>>
>>>>
>>>> >
>>>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>>>> > part (which might involve some discussions) and I agree with that.
>>>> >
>>>> > Some sync between Claudio series (CC'd him) and the patch might be need.
>>>>
>>>>
>>>> I would prefer not to hold back because of the sync. Claudio's cleanup
>>>> is trivial enough to adjust for if it gets merged ahead of this.
>>>>
>>>>
>>>> Alex
>>>>
>>>>
>>>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 21:08               ` Peter Collingbourne
@ 2020-11-30 21:40                 ` Alexander Graf
  2020-11-30 23:01                   ` Peter Collingbourne
  2020-12-01  0:37                   ` Roman Bolshakov
  0 siblings, 2 replies; 64+ messages in thread
From: Alexander Graf @ 2020-11-30 21:40 UTC (permalink / raw)
  To: Peter Collingbourne, Frank Yang
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Paolo Bonzini

Hi Peter,

On 30.11.20 22:08, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>
>>
>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>> Hi Frank,
>>>
>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>
>>>    https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>
>> Thanks, we'll take a look :)
>>
>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> I'll reply to your patch here. You have:
>
> +                    /* Set cpu->hvf->sleeping so that we get a
> SIG_IPI signal. */
> +                    cpu->hvf->sleeping = true;
> +                    smp_mb();
> +
> +                    /* Bail out if we received an IRQ meanwhile */
> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> +                        cpu->hvf->sleeping = false;
> +                        break;
> +                    }
> +
> +                    /* nanosleep returns on signal, so we wake up on kick. */
> +                    nanosleep(ts, NULL);
>
> and then send the signal conditional on whether sleeping is true, but
> I think this is racy. If the signal is sent after sleeping is set to
> true but before entering nanosleep then I think it will be ignored and
> we will miss the wakeup. That's why in my implementation I block IPI
> on the CPU thread at startup and then use pselect to atomically
> unblock and begin sleeping. The signal is sent unconditionally so
> there's no need to worry about races between actually sleeping and the
> "we think we're sleeping" state. It may lead to an extra wakeup but
> that's better than missing it entirely.


Thanks a bunch for the comment! So the trick I was using here is to 
modify the timespec from the kick function before sending the IPI 
signal. That way, we know that either we are inside the sleep (where the 
signal wakes it up) or we are outside the sleep (where timespec={} will 
make it return immediately).

The only race I can think of is if nanosleep does calculations based on 
the timespec and we happen to send the signal right there and then.

The problem with blocking IPIs is basically what Frank was describing 
earlier: How do you unset the IPI signal pending status? If the signal 
is never delivered, how can pselect differentiate "signal from last time 
is still pending" from "new signal because I got an IPI"?


Alex



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 20:55             ` Frank Yang
  2020-11-30 21:08               ` Peter Collingbourne
@ 2020-11-30 22:10               ` Peter Maydell
  2020-12-01  2:49                 ` Frank Yang
  2020-11-30 22:46               ` Peter Collingbourne
  2 siblings, 1 reply; 64+ messages in thread
From: Peter Maydell @ 2020-11-30 22:10 UTC (permalink / raw)
  To: Frank Yang
  Cc: Eduardo Habkost, Richard Henderson, qemu-devel, Cameron Esfahani,
	Roman Bolshakov, Alexander Graf, Claudio Fontana, qemu-arm,
	Paolo Bonzini, Peter Collingbourne

On Mon, 30 Nov 2020 at 20:56, Frank Yang <lfy@google.com> wrote:
> We'd actually like to contribute upstream too :) We do want to maintain
> our own downstream though; Android Emulator codebase needs to work
> solidly on macos and windows which has made keeping up with upstream difficult

One of the main reasons why OSX and Windows support upstream is
not so great is because very few people are helping to develop,
test and support it upstream. The way to fix that IMHO is for more
people who do care about those platforms to actively engage
with us upstream to help in making those platforms move closer to
being first class citizens. If you stay on a downstream fork
forever then I don't think you'll ever see things improve.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 20:55             ` Frank Yang
  2020-11-30 21:08               ` Peter Collingbourne
  2020-11-30 22:10               ` Peter Maydell
@ 2020-11-30 22:46               ` Peter Collingbourne
  2 siblings, 0 replies; 64+ messages in thread
From: Peter Collingbourne @ 2020-11-30 22:46 UTC (permalink / raw)
  To: Frank Yang
  Cc: Alexander Graf, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>
>
>
> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> Hi Frank,
>>
>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>
> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!

We tracked down the discrepancies between CNTPCT_EL0 on the guest vs
on the host to the fact that CNTPCT_EL0 on the guest does not
increment while the system is asleep and as such corresponds to
mach_absolute_time() on the host (if you read the XNU sources you will
see that mach_absolute_time() is implemented as CNTPCT_EL0 plus a
constant representing the time spent asleep) while CNTPCT_EL0 on the
host does increment while asleep. This patch switches the
implementation over to using mach_absolute_time() instead of reading
CNTPCT_EL0 directly:

https://android-review.googlesource.com/c/platform/external/qemu/+/1514870

Peter

>>
>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>
>>   https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>
>
> Thanks, we'll take a look :)
>
>>
>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>
>> Also, is there a particular reason you're working on this super interesting and useful code in a random downstream fork of QEMU? Wouldn't it be more helpful to contribute to the upstream code base instead?
>
> We'd actually like to contribute upstream too :) We do want to maintain our own downstream though; Android Emulator codebase needs to work solidly on macos and windows which has made keeping up with upstream difficult, and staying on a previous version (2.12) with known quirks easier. (theres also some android related customization relating to Qt Ui + different set of virtual devices and snapshot support (incl. snapshots of graphics devices with OpenGLES state tracking), which we hope to separate into other libraries/processes, but its not insignificant)
>>
>>
>> Alex
>>
>>
>> On 30.11.20 21:15, Frank Yang wrote:
>>
>> Update: We're not quite sure how to compare the CNTV_CVAL and CNTVCT. But the high CPU usage seems to be mitigated by having a poll interval (like KVM does) in handling WFI:
>>
>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512501
>>
>> This is loosely inspired by https://elixir.bootlin.com/linux/v5.10-rc6/source/virt/kvm/kvm_main.c#L2766 which does seem to specify a poll interval.
>>
>> It would be cool if we could have a lightweight way to enter sleep and restart the vcpus precisely when CVAL passes, though.
>>
>> Frank
>>
>>
>> On Fri, Nov 27, 2020 at 3:30 PM Frank Yang <lfy@google.com> wrote:
>>>
>>> Hi all,
>>>
>>> +Peter Collingbourne
>>>
>>> I'm a developer on the Android Emulator, which is in a fork of QEMU.
>>>
>>> Peter and I have been working on an HVF Apple Silicon backend with an eye toward Android guests.
>>>
>>> We have gotten things to basically switch to Android userspace already (logcat/shell and graphics available at least)
>>>
>>> Our strategy so far has been to import logic from the KVM implementation and hook into QEMU's software devices that previously assumed to only work with TCG, or have KVM-specific paths.
>>>
>>> Thanks to Alexander for the tip on the 36-bit address space limitation btw; our way of addressing this is to still allow highmem but not put pci high mmio so high.
>>>
>>> Also, note we have a sleep/signal based mechanism to deal with WFx, which might be worth looking into in Alexander's implementation as well:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551
>>>
>>> Patches so far, FYI:
>>>
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1513429/1
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512554/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512553/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512552/3
>>> https://android-review.googlesource.com/c/platform/external/qemu/+/1512551/3
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/c17eb6a3ffd50047e9646aff6640b710cb8ff48a
>>> https://android.googlesource.com/platform/external/qemu/+/74bed16de1afb41b7a7ab8da1d1861226c9db63b
>>> https://android.googlesource.com/platform/external/qemu/+/eccd9e47ab2ccb9003455e3bb721f57f9ebc3c01
>>> https://android.googlesource.com/platform/external/qemu/+/54fe3d67ed4698e85826537a4f49b2b9074b2228
>>> https://android.googlesource.com/platform/external/qemu/+/82ef91a6fede1d1000f36be037ad4d58fbe0d102
>>> https://android.googlesource.com/platform/external/qemu/+/c28147aa7c74d98b858e99623d2fe46e74a379f6
>>>
>>> Peter's also noticed that there are extra steps needed for M1's to allow TCG to work, as it involves JIT:
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/740e3fe47f88926c6bda9abb22ee6eae1bc254a9
>>>
>>> We'd appreciate any feedback/comments :)
>>>
>>> Best,
>>>
>>> Frank
>>>
>>> On Fri, Nov 27, 2020 at 1:57 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>
>>>>
>>>> On 27.11.20 21:00, Roman Bolshakov wrote:
>>>> > On Thu, Nov 26, 2020 at 10:50:11PM +0100, Alexander Graf wrote:
>>>> >> Until now, Hypervisor.framework has only been available on x86_64 systems.
>>>> >> With Apple Silicon shipping now, it extends its reach to aarch64. To
>>>> >> prepare for support for multiple architectures, let's move common code out
>>>> >> into its own accel directory.
>>>> >>
>>>> >> Signed-off-by: Alexander Graf <agraf@csgraf.de>
>>>> >> ---
>>>> >>   MAINTAINERS                 |   9 +-
>>>> >>   accel/hvf/hvf-all.c         |  56 +++++
>>>> >>   accel/hvf/hvf-cpus.c        | 468 ++++++++++++++++++++++++++++++++++++
>>>> >>   accel/hvf/meson.build       |   7 +
>>>> >>   accel/meson.build           |   1 +
>>>> >>   include/sysemu/hvf_int.h    |  69 ++++++
>>>> >>   target/i386/hvf/hvf-cpus.c  | 131 ----------
>>>> >>   target/i386/hvf/hvf-cpus.h  |  25 --
>>>> >>   target/i386/hvf/hvf-i386.h  |  48 +---
>>>> >>   target/i386/hvf/hvf.c       | 360 +--------------------------
>>>> >>   target/i386/hvf/meson.build |   1 -
>>>> >>   target/i386/hvf/x86hvf.c    |  11 +-
>>>> >>   target/i386/hvf/x86hvf.h    |   2 -
>>>> >>   13 files changed, 619 insertions(+), 569 deletions(-)
>>>> >>   create mode 100644 accel/hvf/hvf-all.c
>>>> >>   create mode 100644 accel/hvf/hvf-cpus.c
>>>> >>   create mode 100644 accel/hvf/meson.build
>>>> >>   create mode 100644 include/sysemu/hvf_int.h
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.c
>>>> >>   delete mode 100644 target/i386/hvf/hvf-cpus.h
>>>> >>
>>>> >> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> >> index 68bc160f41..ca4b6d9279 100644
>>>> >> --- a/MAINTAINERS
>>>> >> +++ b/MAINTAINERS
>>>> >> @@ -444,9 +444,16 @@ M: Cameron Esfahani <dirty@apple.com>
>>>> >>   M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >>   W: https://wiki.qemu.org/Features/HVF
>>>> >>   S: Maintained
>>>> >> -F: accel/stubs/hvf-stub.c
>>>> > There was a patch for that in the RFC series from Claudio.
>>>>
>>>>
>>>> Yeah, I'm not worried about this hunk :).
>>>>
>>>>
>>>> >
>>>> >>   F: target/i386/hvf/
>>>> >> +
>>>> >> +HVF
>>>> >> +M: Cameron Esfahani <dirty@apple.com>
>>>> >> +M: Roman Bolshakov <r.bolshakov@yadro.com>
>>>> >> +W: https://wiki.qemu.org/Features/HVF
>>>> >> +S: Maintained
>>>> >> +F: accel/hvf/
>>>> >>   F: include/sysemu/hvf.h
>>>> >> +F: include/sysemu/hvf_int.h
>>>> >>
>>>> >>   WHPX CPUs
>>>> >>   M: Sunil Muthuswamy <sunilmut@microsoft.com>
>>>> >> diff --git a/accel/hvf/hvf-all.c b/accel/hvf/hvf-all.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..47d77a472a
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-all.c
>>>> >> @@ -0,0 +1,56 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2.  See
>>>> >> + * the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + * Contributions after 2012-01-13 are licensed under the terms of the
>>>> >> + * GNU GPL, version 2 or (at your option) any later version.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu-common.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "sysemu/accel.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +bool hvf_allowed;
>>>> >> +HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret)
>>>> >> +{
>>>> >> +    if (ret == HV_SUCCESS) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    switch (ret) {
>>>> >> +    case HV_ERROR:
>>>> >> +        error_report("Error: HV_ERROR");
>>>> >> +        break;
>>>> >> +    case HV_BUSY:
>>>> >> +        error_report("Error: HV_BUSY");
>>>> >> +        break;
>>>> >> +    case HV_BAD_ARGUMENT:
>>>> >> +        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> +        break;
>>>> >> +    case HV_NO_RESOURCES:
>>>> >> +        error_report("Error: HV_NO_RESOURCES");
>>>> >> +        break;
>>>> >> +    case HV_NO_DEVICE:
>>>> >> +        error_report("Error: HV_NO_DEVICE");
>>>> >> +        break;
>>>> >> +    case HV_UNSUPPORTED:
>>>> >> +        error_report("Error: HV_UNSUPPORTED");
>>>> >> +        break;
>>>> >> +    default:
>>>> >> +        error_report("Unknown Error");
>>>> >> +    }
>>>> >> +
>>>> >> +    abort();
>>>> >> +}
>>>> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>>> >> new file mode 100644
>>>> >> index 0000000000..f9bb5502b7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/hvf-cpus.c
>>>> >> @@ -0,0 +1,468 @@
>>>> >> +/*
>>>> >> + * Copyright 2008 IBM Corporation
>>>> >> + *           2008 Red Hat, Inc.
>>>> >> + * Copyright 2011 Intel Corporation
>>>> >> + * Copyright 2016 Veertu, Inc.
>>>> >> + * Copyright 2017 The Android Open Source Project
>>>> >> + *
>>>> >> + * QEMU Hypervisor.framework support
>>>> >> + *
>>>> >> + * This program is free software; you can redistribute it and/or
>>>> >> + * modify it under the terms of version 2 of the GNU General Public
>>>> >> + * License as published by the Free Software Foundation.
>>>> >> + *
>>>> >> + * This program is distributed in the hope that it will be useful,
>>>> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> + * General Public License for more details.
>>>> >> + *
>>>> >> + * You should have received a copy of the GNU General Public License
>>>> >> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> + *
>>>> >> + * This file contain code under public domain from the hvdos project:
>>>> >> + * https://github.com/mist64/hvdos
>>>> >> + *
>>>> >> + * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> + * All rights reserved.
>>>> >> + *
>>>> >> + * Redistribution and use in source and binary forms, with or without
>>>> >> + * modification, are permitted provided that the following conditions
>>>> >> + * are met:
>>>> >> + * 1. Redistributions of source code must retain the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer.
>>>> >> + * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> + *    notice, this list of conditions and the following disclaimer in the
>>>> >> + *    documentation and/or other materials provided with the distribution.
>>>> >> + *
>>>> >> + * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> + * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> + * SUCH DAMAGE.
>>>> >> + */
>>>> >> +
>>>> >> +#include "qemu/osdep.h"
>>>> >> +#include "qemu/error-report.h"
>>>> >> +#include "qemu/main-loop.h"
>>>> >> +#include "exec/address-spaces.h"
>>>> >> +#include "exec/exec-all.h"
>>>> >> +#include "sysemu/cpus.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/runstate.h"
>>>> >> +#include "qemu/guest-random.h"
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +/* Memory slots */
>>>> >> +
>>>> >> +struct mac_slot {
>>>> >> +    int present;
>>>> >> +    uint64_t size;
>>>> >> +    uint64_t gpa_start;
>>>> >> +    uint64_t gva;
>>>> >> +};
>>>> >> +
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +    int x;
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        slot = &hvf_state->slots[x];
>>>> >> +        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> +            (start + size) > slot->start) {
>>>> >> +            return slot;
>>>> >> +        }
>>>> >> +    }
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +struct mac_slot mac_slots[32];
>>>> >> +
>>>> >> +static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> +{
>>>> >> +    struct mac_slot *macslot;
>>>> >> +    hv_return_t ret;
>>>> >> +
>>>> >> +    macslot = &mac_slots[slot->slot_id];
>>>> >> +
>>>> >> +    if (macslot->present) {
>>>> >> +        if (macslot->size != slot->size) {
>>>> >> +            macslot->present = 0;
>>>> >> +            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> +            assert_hvf_ok(ret);
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!slot->size) {
>>>> >> +        return 0;
>>>> >> +    }
>>>> >> +
>>>> >> +    macslot->present = 1;
>>>> >> +    macslot->gpa_start = slot->start;
>>>> >> +    macslot->size = slot->size;
>>>> >> +    ret = hv_vm_map(slot->mem, slot->start, slot->size, flags);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> +{
>>>> >> +    hvf_slot *mem;
>>>> >> +    MemoryRegion *area = section->mr;
>>>> >> +    bool writeable = !area->readonly && !area->rom_device;
>>>> >> +    hv_memory_flags_t flags;
>>>> >> +
>>>> >> +    if (!memory_region_is_ram(area)) {
>>>> >> +        if (writeable) {
>>>> >> +            return;
>>>> >> +        } else if (!memory_region_is_romd(area)) {
>>>> >> +            /*
>>>> >> +             * If the memory device is not in romd_mode, then we actually want
>>>> >> +             * to remove the hvf memory slot so all accesses will trap.
>>>> >> +             */
>>>> >> +             add = false;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    mem = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    if (mem && add) {
>>>> >> +        if (mem->size == int128_get64(section->size) &&
>>>> >> +            mem->start == section->offset_within_address_space &&
>>>> >> +            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> +            section->offset_within_region)) {
>>>> >> +            return; /* Same region was attempted to register, go away. */
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> +    if (mem) {
>>>> >> +        mem->size = 0;
>>>> >> +        if (do_hvf_set_memory(mem, 0)) {
>>>> >> +            error_report("Failed to reset overlapping slot");
>>>> >> +            abort();
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (!add) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    if (area->readonly ||
>>>> >> +        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> +    } else {
>>>> >> +        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> +    }
>>>> >> +
>>>> >> +    /* Now make a new slot. */
>>>> >> +    int x;
>>>> >> +
>>>> >> +    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> +        mem = &hvf_state->slots[x];
>>>> >> +        if (!mem->size) {
>>>> >> +            break;
>>>> >> +        }
>>>> >> +    }
>>>> >> +
>>>> >> +    if (x == hvf_state->num_slots) {
>>>> >> +        error_report("No free slots");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +
>>>> >> +    mem->size = int128_get64(section->size);
>>>> >> +    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> +    mem->start = section->offset_within_address_space;
>>>> >> +    mem->region = area;
>>>> >> +
>>>> >> +    if (do_hvf_set_memory(mem, flags)) {
>>>> >> +        error_report("Error registering new memory slot");
>>>> >> +        abort();
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> +{
>>>> >> +    hvf_slot *slot;
>>>> >> +
>>>> >> +    slot = hvf_find_overlap_slot(
>>>> >> +            section->offset_within_address_space,
>>>> >> +            int128_get64(section->size));
>>>> >> +
>>>> >> +    /* protect region against writes; begin tracking it */
>>>> >> +    if (on) {
>>>> >> +        slot->flags |= HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ);
>>>> >> +    /* stop tracking region*/
>>>> >> +    } else {
>>>> >> +        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> +        hv_vm_protect((uintptr_t)slot->start, (size_t)slot->size,
>>>> >> +                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_start(MemoryListener *listener,
>>>> >> +                          MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (old != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_stop(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section, int old, int new)
>>>> >> +{
>>>> >> +    if (new != 0) {
>>>> >> +        return;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_set_dirty_tracking(section, 0);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_log_sync(MemoryListener *listener,
>>>> >> +                         MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    /*
>>>> >> +     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> +     * tracking the region.
>>>> >> +     */
>>>> >> +    hvf_set_dirty_tracking(section, 1);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_add(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, true);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_region_del(MemoryListener *listener,
>>>> >> +                           MemoryRegionSection *section)
>>>> >> +{
>>>> >> +    hvf_set_phys_mem(section, false);
>>>> >> +}
>>>> >> +
>>>> >> +static MemoryListener hvf_memory_listener = {
>>>> >> +    .priority = 10,
>>>> >> +    .region_add = hvf_region_add,
>>>> >> +    .region_del = hvf_region_del,
>>>> >> +    .log_start = hvf_log_start,
>>>> >> +    .log_stop = hvf_log_stop,
>>>> >> +    .log_sync = hvf_log_sync,
>>>> >> +};
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        hvf_get_registers(cpu);
>>>> >> +        cpu->vcpu_dirty = true;
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> +{
>>>> >> +    if (!cpu->vcpu_dirty) {
>>>> >> +        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> +    }
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> +                                             run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    hvf_put_registers(cpu);
>>>> >> +    cpu->vcpu_dirty = false;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> +                                              run_on_cpu_data arg)
>>>> >> +{
>>>> >> +    cpu->vcpu_dirty = true;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> +{
>>>> >> +    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +{
>>>> >> +    hv_return_t ret = hv_vcpu_destroy(cpu->hvf_fd);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    hvf_arch_vcpu_destroy(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +static void dummy_signal(int sig)
>>>> >> +{
>>>> >> +}
>>>> >> +
>>>> >> +static int hvf_init_vcpu(CPUState *cpu)
>>>> >> +{
>>>> >> +    int r;
>>>> >> +
>>>> >> +    /* init cpu signals */
>>>> >> +    sigset_t set;
>>>> >> +    struct sigaction sigact;
>>>> >> +
>>>> >> +    memset(&sigact, 0, sizeof(sigact));
>>>> >> +    sigact.sa_handler = dummy_signal;
>>>> >> +    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> +
>>>> >> +    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> +    sigdelset(&set, SIG_IPI);
>>>> >> +
>>>> >> +#ifdef __aarch64__
>>>> >> +    r = hv_vcpu_create(&cpu->hvf_fd, (hv_vcpu_exit_t **)&cpu->hvf_exit, NULL);
>>>> >> +#else
>>>> >> +    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> +#endif
>>>> > I think the first __aarch64__ bit fits better to arm part of the series.
>>>>
>>>>
>>>> Oops. Thanks for catching it! Yes, absolutely. It should be part of the
>>>> ARM enablement.
>>>>
>>>>
>>>> >
>>>> >> +    cpu->vcpu_dirty = 1;
>>>> >> +    assert_hvf_ok(r);
>>>> >> +
>>>> >> +    return hvf_arch_init_vcpu(cpu);
>>>> >> +}
>>>> >> +
>>>> >> +/*
>>>> >> + * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> + * CPU supports the VMX "unrestricted guest" feature.
>>>> >> + */
>>>> >> +static void *hvf_cpu_thread_fn(void *arg)
>>>> >> +{
>>>> >> +    CPUState *cpu = arg;
>>>> >> +
>>>> >> +    int r;
>>>> >> +
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    rcu_register_thread();
>>>> >> +
>>>> >> +    qemu_mutex_lock_iothread();
>>>> >> +    qemu_thread_get_self(cpu->thread);
>>>> >> +
>>>> >> +    cpu->thread_id = qemu_get_thread_id();
>>>> >> +    cpu->can_do_io = 1;
>>>> >> +    current_cpu = cpu;
>>>> >> +
>>>> >> +    hvf_init_vcpu(cpu);
>>>> >> +
>>>> >> +    /* signal CPU creation */
>>>> >> +    cpu_thread_signal_created(cpu);
>>>> >> +    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> +
>>>> >> +    do {
>>>> >> +        if (cpu_can_run(cpu)) {
>>>> >> +            r = hvf_vcpu_exec(cpu);
>>>> >> +            if (r == EXCP_DEBUG) {
>>>> >> +                cpu_handle_guest_debug(cpu);
>>>> >> +            }
>>>> >> +        }
>>>> >> +        qemu_wait_io_event(cpu);
>>>> >> +    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> +
>>>> >> +    hvf_vcpu_destroy(cpu);
>>>> >> +    cpu_thread_signal_destroyed(cpu);
>>>> >> +    qemu_mutex_unlock_iothread();
>>>> >> +    rcu_unregister_thread();
>>>> >> +    return NULL;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> +{
>>>> >> +    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> +
>>>> >> +    /*
>>>> >> +     * HVF currently does not support TCG, and only runs in
>>>> >> +     * unrestricted-guest mode.
>>>> >> +     */
>>>> >> +    assert(hvf_enabled());
>>>> >> +
>>>> >> +    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> +    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> +    qemu_cond_init(cpu->halt_cond);
>>>> >> +
>>>> >> +    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> +             cpu->cpu_index);
>>>> >> +    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> +                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> +}
>>>> >> +
>>>> >> +static const CpusAccel hvf_cpus = {
>>>> >> +    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> +
>>>> >> +    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> +    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> +    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> +    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> +};
>>>> >> +
>>>> >> +static int hvf_accel_init(MachineState *ms)
>>>> >> +{
>>>> >> +    int x;
>>>> >> +    hv_return_t ret;
>>>> >> +    HVFState *s;
>>>> >> +
>>>> >> +    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> +    assert_hvf_ok(ret);
>>>> >> +
>>>> >> +    s = g_new0(HVFState, 1);
>>>> >> +
>>>> >> +    s->num_slots = 32;
>>>> >> +    for (x = 0; x < s->num_slots; ++x) {
>>>> >> +        s->slots[x].size = 0;
>>>> >> +        s->slots[x].slot_id = x;
>>>> >> +    }
>>>> >> +
>>>> >> +    hvf_state = s;
>>>> >> +    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> +    cpus_register_accel(&hvf_cpus);
>>>> >> +    return 0;
>>>> >> +}
>>>> >> +
>>>> >> +static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> +{
>>>> >> +    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> +    ac->name = "HVF";
>>>> >> +    ac->init_machine = hvf_accel_init;
>>>> >> +    ac->allowed = &hvf_allowed;
>>>> >> +}
>>>> >> +
>>>> >> +static const TypeInfo hvf_accel_type = {
>>>> >> +    .name = TYPE_HVF_ACCEL,
>>>> >> +    .parent = TYPE_ACCEL,
>>>> >> +    .class_init = hvf_accel_class_init,
>>>> >> +};
>>>> >> +
>>>> >> +static void hvf_type_init(void)
>>>> >> +{
>>>> >> +    type_register_static(&hvf_accel_type);
>>>> >> +}
>>>> >> +
>>>> >> +type_init(hvf_type_init);
>>>> >> diff --git a/accel/hvf/meson.build b/accel/hvf/meson.build
>>>> >> new file mode 100644
>>>> >> index 0000000000..dfd6b68dc7
>>>> >> --- /dev/null
>>>> >> +++ b/accel/hvf/meson.build
>>>> >> @@ -0,0 +1,7 @@
>>>> >> +hvf_ss = ss.source_set()
>>>> >> +hvf_ss.add(files(
>>>> >> +  'hvf-all.c',
>>>> >> +  'hvf-cpus.c',
>>>> >> +))
>>>> >> +
>>>> >> +specific_ss.add_all(when: 'CONFIG_HVF', if_true: hvf_ss)
>>>> >> diff --git a/accel/meson.build b/accel/meson.build
>>>> >> index b26cca227a..6de12ce5d5 100644
>>>> >> --- a/accel/meson.build
>>>> >> +++ b/accel/meson.build
>>>> >> @@ -1,5 +1,6 @@
>>>> >>   softmmu_ss.add(files('accel.c'))
>>>> >>
>>>> >> +subdir('hvf')
>>>> >>   subdir('qtest')
>>>> >>   subdir('kvm')
>>>> >>   subdir('tcg')
>>>> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>>> >> new file mode 100644
>>>> >> index 0000000000..de9bad23a8
>>>> >> --- /dev/null
>>>> >> +++ b/include/sysemu/hvf_int.h
>>>> >> @@ -0,0 +1,69 @@
>>>> >> +/*
>>>> >> + * QEMU Hypervisor.framework (HVF) support
>>>> >> + *
>>>> >> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> + * See the COPYING file in the top-level directory.
>>>> >> + *
>>>> >> + */
>>>> >> +
>>>> >> +/* header to be included in HVF-specific code */
>>>> >> +
>>>> >> +#ifndef HVF_INT_H
>>>> >> +#define HVF_INT_H
>>>> >> +
>>>> >> +#include <Hypervisor/Hypervisor.h>
>>>> >> +
>>>> >> +#define HVF_MAX_VCPU 0x10
>>>> >> +
>>>> >> +extern struct hvf_state hvf_global;
>>>> >> +
>>>> >> +struct hvf_vm {
>>>> >> +    int id;
>>>> >> +    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> +};
>>>> >> +
>>>> >> +struct hvf_state {
>>>> >> +    uint32_t version;
>>>> >> +    struct hvf_vm *vm;
>>>> >> +    uint64_t mem_quota;
>>>> >> +};
>>>> >> +
>>>> >> +/* hvf_slot flags */
>>>> >> +#define HVF_SLOT_LOG (1 << 0)
>>>> >> +
>>>> >> +typedef struct hvf_slot {
>>>> >> +    uint64_t start;
>>>> >> +    uint64_t size;
>>>> >> +    uint8_t *mem;
>>>> >> +    int slot_id;
>>>> >> +    uint32_t flags;
>>>> >> +    MemoryRegion *region;
>>>> >> +} hvf_slot;
>>>> >> +
>>>> >> +typedef struct hvf_vcpu_caps {
>>>> >> +    uint64_t vmx_cap_pinbased;
>>>> >> +    uint64_t vmx_cap_procbased;
>>>> >> +    uint64_t vmx_cap_procbased2;
>>>> >> +    uint64_t vmx_cap_entry;
>>>> >> +    uint64_t vmx_cap_exit;
>>>> >> +    uint64_t vmx_cap_preemption_timer;
>>>> >> +} hvf_vcpu_caps;
>>>> >> +
>>>> >> +struct HVFState {
>>>> >> +    AccelState parent;
>>>> >> +    hvf_slot slots[32];
>>>> >> +    int num_slots;
>>>> >> +
>>>> >> +    hvf_vcpu_caps *hvf_caps;
>>>> >> +};
>>>> >> +extern HVFState *hvf_state;
>>>> >> +
>>>> >> +void assert_hvf_ok(hv_return_t ret);
>>>> >> +int hvf_get_registers(CPUState *cpu);
>>>> >> +int hvf_put_registers(CPUState *cpu);
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu);
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu);
>>>> >> +int hvf_vcpu_exec(CPUState *cpu);
>>>> >> +hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >> +
>>>> >> +#endif
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.c b/target/i386/hvf/hvf-cpus.c
>>>> >> deleted file mode 100644
>>>> >> index 817b3d7452..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.c
>>>> >> +++ /dev/null
>>>> >> @@ -1,131 +0,0 @@
>>>> >> -/*
>>>> >> - * Copyright 2008 IBM Corporation
>>>> >> - *           2008 Red Hat, Inc.
>>>> >> - * Copyright 2011 Intel Corporation
>>>> >> - * Copyright 2016 Veertu, Inc.
>>>> >> - * Copyright 2017 The Android Open Source Project
>>>> >> - *
>>>> >> - * QEMU Hypervisor.framework support
>>>> >> - *
>>>> >> - * This program is free software; you can redistribute it and/or
>>>> >> - * modify it under the terms of version 2 of the GNU General Public
>>>> >> - * License as published by the Free Software Foundation.
>>>> >> - *
>>>> >> - * This program is distributed in the hope that it will be useful,
>>>> >> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>>>> >> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>>>> >> - * General Public License for more details.
>>>> >> - *
>>>> >> - * You should have received a copy of the GNU General Public License
>>>> >> - * along with this program; if not, see <http://www.gnu.org/licenses/>.
>>>> >> - *
>>>> >> - * This file contain code under public domain from the hvdos project:
>>>> >> - * https://github.com/mist64/hvdos
>>>> >> - *
>>>> >> - * Parts Copyright (c) 2011 NetApp, Inc.
>>>> >> - * All rights reserved.
>>>> >> - *
>>>> >> - * Redistribution and use in source and binary forms, with or without
>>>> >> - * modification, are permitted provided that the following conditions
>>>> >> - * are met:
>>>> >> - * 1. Redistributions of source code must retain the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer.
>>>> >> - * 2. Redistributions in binary form must reproduce the above copyright
>>>> >> - *    notice, this list of conditions and the following disclaimer in the
>>>> >> - *    documentation and/or other materials provided with the distribution.
>>>> >> - *
>>>> >> - * THIS SOFTWARE IS PROVIDED BY NETAPP, INC ``AS IS'' AND
>>>> >> - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
>>>> >> - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
>>>> >> - * ARE DISCLAIMED.  IN NO EVENT SHALL NETAPP, INC OR CONTRIBUTORS BE LIABLE
>>>> >> - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
>>>> >> - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
>>>> >> - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
>>>> >> - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
>>>> >> - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
>>>> >> - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>>>> >> - * SUCH DAMAGE.
>>>> >> - */
>>>> >> -
>>>> >> -#include "qemu/osdep.h"
>>>> >> -#include "qemu/error-report.h"
>>>> >> -#include "qemu/main-loop.h"
>>>> >> -#include "sysemu/hvf.h"
>>>> >> -#include "sysemu/runstate.h"
>>>> >> -#include "target/i386/cpu.h"
>>>> >> -#include "qemu/guest-random.h"
>>>> >> -
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -/*
>>>> >> - * The HVF-specific vCPU thread function. This one should only run when the host
>>>> >> - * CPU supports the VMX "unrestricted guest" feature.
>>>> >> - */
>>>> >> -static void *hvf_cpu_thread_fn(void *arg)
>>>> >> -{
>>>> >> -    CPUState *cpu = arg;
>>>> >> -
>>>> >> -    int r;
>>>> >> -
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    rcu_register_thread();
>>>> >> -
>>>> >> -    qemu_mutex_lock_iothread();
>>>> >> -    qemu_thread_get_self(cpu->thread);
>>>> >> -
>>>> >> -    cpu->thread_id = qemu_get_thread_id();
>>>> >> -    cpu->can_do_io = 1;
>>>> >> -    current_cpu = cpu;
>>>> >> -
>>>> >> -    hvf_init_vcpu(cpu);
>>>> >> -
>>>> >> -    /* signal CPU creation */
>>>> >> -    cpu_thread_signal_created(cpu);
>>>> >> -    qemu_guest_random_seed_thread_part2(cpu->random_seed);
>>>> >> -
>>>> >> -    do {
>>>> >> -        if (cpu_can_run(cpu)) {
>>>> >> -            r = hvf_vcpu_exec(cpu);
>>>> >> -            if (r == EXCP_DEBUG) {
>>>> >> -                cpu_handle_guest_debug(cpu);
>>>> >> -            }
>>>> >> -        }
>>>> >> -        qemu_wait_io_event(cpu);
>>>> >> -    } while (!cpu->unplug || cpu_can_run(cpu));
>>>> >> -
>>>> >> -    hvf_vcpu_destroy(cpu);
>>>> >> -    cpu_thread_signal_destroyed(cpu);
>>>> >> -    qemu_mutex_unlock_iothread();
>>>> >> -    rcu_unregister_thread();
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_start_vcpu_thread(CPUState *cpu)
>>>> >> -{
>>>> >> -    char thread_name[VCPU_THREAD_NAME_SIZE];
>>>> >> -
>>>> >> -    /*
>>>> >> -     * HVF currently does not support TCG, and only runs in
>>>> >> -     * unrestricted-guest mode.
>>>> >> -     */
>>>> >> -    assert(hvf_enabled());
>>>> >> -
>>>> >> -    cpu->thread = g_malloc0(sizeof(QemuThread));
>>>> >> -    cpu->halt_cond = g_malloc0(sizeof(QemuCond));
>>>> >> -    qemu_cond_init(cpu->halt_cond);
>>>> >> -
>>>> >> -    snprintf(thread_name, VCPU_THREAD_NAME_SIZE, "CPU %d/HVF",
>>>> >> -             cpu->cpu_index);
>>>> >> -    qemu_thread_create(cpu->thread, thread_name, hvf_cpu_thread_fn,
>>>> >> -                       cpu, QEMU_THREAD_JOINABLE);
>>>> >> -}
>>>> >> -
>>>> >> -const CpusAccel hvf_cpus = {
>>>> >> -    .create_vcpu_thread = hvf_start_vcpu_thread,
>>>> >> -
>>>> >> -    .synchronize_post_reset = hvf_cpu_synchronize_post_reset,
>>>> >> -    .synchronize_post_init = hvf_cpu_synchronize_post_init,
>>>> >> -    .synchronize_state = hvf_cpu_synchronize_state,
>>>> >> -    .synchronize_pre_loadvm = hvf_cpu_synchronize_pre_loadvm,
>>>> >> -};
>>>> >> diff --git a/target/i386/hvf/hvf-cpus.h b/target/i386/hvf/hvf-cpus.h
>>>> >> deleted file mode 100644
>>>> >> index ced31b82c0..0000000000
>>>> >> --- a/target/i386/hvf/hvf-cpus.h
>>>> >> +++ /dev/null
>>>> >> @@ -1,25 +0,0 @@
>>>> >> -/*
>>>> >> - * Accelerator CPUS Interface
>>>> >> - *
>>>> >> - * Copyright 2020 SUSE LLC
>>>> >> - *
>>>> >> - * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>> >> - * See the COPYING file in the top-level directory.
>>>> >> - */
>>>> >> -
>>>> >> -#ifndef HVF_CPUS_H
>>>> >> -#define HVF_CPUS_H
>>>> >> -
>>>> >> -#include "sysemu/cpus.h"
>>>> >> -
>>>> >> -extern const CpusAccel hvf_cpus;
>>>> >> -
>>>> >> -int hvf_init_vcpu(CPUState *);
>>>> >> -int hvf_vcpu_exec(CPUState *);
>>>> >> -void hvf_cpu_synchronize_state(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *);
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *);
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *);
>>>> >> -void hvf_vcpu_destroy(CPUState *);
>>>> >> -
>>>> >> -#endif /* HVF_CPUS_H */
>>>> >> diff --git a/target/i386/hvf/hvf-i386.h b/target/i386/hvf/hvf-i386.h
>>>> >> index e0edffd077..6d56f8f6bb 100644
>>>> >> --- a/target/i386/hvf/hvf-i386.h
>>>> >> +++ b/target/i386/hvf/hvf-i386.h
>>>> >> @@ -18,57 +18,11 @@
>>>> >>
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "cpu.h"
>>>> >>   #include "x86.h"
>>>> >>
>>>> >> -#define HVF_MAX_VCPU 0x10
>>>> >> -
>>>> >> -extern struct hvf_state hvf_global;
>>>> >> -
>>>> >> -struct hvf_vm {
>>>> >> -    int id;
>>>> >> -    struct hvf_vcpu_state *vcpus[HVF_MAX_VCPU];
>>>> >> -};
>>>> >> -
>>>> >> -struct hvf_state {
>>>> >> -    uint32_t version;
>>>> >> -    struct hvf_vm *vm;
>>>> >> -    uint64_t mem_quota;
>>>> >> -};
>>>> >> -
>>>> >> -/* hvf_slot flags */
>>>> >> -#define HVF_SLOT_LOG (1 << 0)
>>>> >> -
>>>> >> -typedef struct hvf_slot {
>>>> >> -    uint64_t start;
>>>> >> -    uint64_t size;
>>>> >> -    uint8_t *mem;
>>>> >> -    int slot_id;
>>>> >> -    uint32_t flags;
>>>> >> -    MemoryRegion *region;
>>>> >> -} hvf_slot;
>>>> >> -
>>>> >> -typedef struct hvf_vcpu_caps {
>>>> >> -    uint64_t vmx_cap_pinbased;
>>>> >> -    uint64_t vmx_cap_procbased;
>>>> >> -    uint64_t vmx_cap_procbased2;
>>>> >> -    uint64_t vmx_cap_entry;
>>>> >> -    uint64_t vmx_cap_exit;
>>>> >> -    uint64_t vmx_cap_preemption_timer;
>>>> >> -} hvf_vcpu_caps;
>>>> >> -
>>>> >> -struct HVFState {
>>>> >> -    AccelState parent;
>>>> >> -    hvf_slot slots[32];
>>>> >> -    int num_slots;
>>>> >> -
>>>> >> -    hvf_vcpu_caps *hvf_caps;
>>>> >> -};
>>>> >> -extern HVFState *hvf_state;
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *, bool);
>>>> >>   void hvf_handle_io(CPUArchState *, uint16_t, void *, int, int, int);
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t, uint64_t);
>>>> >>
>>>> >>   #ifdef NEED_CPU_H
>>>> >>   /* Functions exported to host specific mode */
>>>> >> diff --git a/target/i386/hvf/hvf.c b/target/i386/hvf/hvf.c
>>>> >> index ed9356565c..8b96ecd619 100644
>>>> >> --- a/target/i386/hvf/hvf.c
>>>> >> +++ b/target/i386/hvf/hvf.c
>>>> >> @@ -51,6 +51,7 @@
>>>> >>   #include "qemu/error-report.h"
>>>> >>
>>>> >>   #include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >>   #include "sysemu/runstate.h"
>>>> >>   #include "hvf-i386.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -72,171 +73,6 @@
>>>> >>   #include "sysemu/accel.h"
>>>> >>   #include "target/i386/cpu.h"
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >> -HVFState *hvf_state;
>>>> >> -
>>>> >> -static void assert_hvf_ok(hv_return_t ret)
>>>> >> -{
>>>> >> -    if (ret == HV_SUCCESS) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    switch (ret) {
>>>> >> -    case HV_ERROR:
>>>> >> -        error_report("Error: HV_ERROR");
>>>> >> -        break;
>>>> >> -    case HV_BUSY:
>>>> >> -        error_report("Error: HV_BUSY");
>>>> >> -        break;
>>>> >> -    case HV_BAD_ARGUMENT:
>>>> >> -        error_report("Error: HV_BAD_ARGUMENT");
>>>> >> -        break;
>>>> >> -    case HV_NO_RESOURCES:
>>>> >> -        error_report("Error: HV_NO_RESOURCES");
>>>> >> -        break;
>>>> >> -    case HV_NO_DEVICE:
>>>> >> -        error_report("Error: HV_NO_DEVICE");
>>>> >> -        break;
>>>> >> -    case HV_UNSUPPORTED:
>>>> >> -        error_report("Error: HV_UNSUPPORTED");
>>>> >> -        break;
>>>> >> -    default:
>>>> >> -        error_report("Unknown Error");
>>>> >> -    }
>>>> >> -
>>>> >> -    abort();
>>>> >> -}
>>>> >> -
>>>> >> -/* Memory slots */
>>>> >> -hvf_slot *hvf_find_overlap_slot(uint64_t start, uint64_t size)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -    int x;
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        slot = &hvf_state->slots[x];
>>>> >> -        if (slot->size && start < (slot->start + slot->size) &&
>>>> >> -            (start + size) > slot->start) {
>>>> >> -            return slot;
>>>> >> -        }
>>>> >> -    }
>>>> >> -    return NULL;
>>>> >> -}
>>>> >> -
>>>> >> -struct mac_slot {
>>>> >> -    int present;
>>>> >> -    uint64_t size;
>>>> >> -    uint64_t gpa_start;
>>>> >> -    uint64_t gva;
>>>> >> -};
>>>> >> -
>>>> >> -struct mac_slot mac_slots[32];
>>>> >> -
>>>> >> -static int do_hvf_set_memory(hvf_slot *slot, hv_memory_flags_t flags)
>>>> >> -{
>>>> >> -    struct mac_slot *macslot;
>>>> >> -    hv_return_t ret;
>>>> >> -
>>>> >> -    macslot = &mac_slots[slot->slot_id];
>>>> >> -
>>>> >> -    if (macslot->present) {
>>>> >> -        if (macslot->size != slot->size) {
>>>> >> -            macslot->present = 0;
>>>> >> -            ret = hv_vm_unmap(macslot->gpa_start, macslot->size);
>>>> >> -            assert_hvf_ok(ret);
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!slot->size) {
>>>> >> -        return 0;
>>>> >> -    }
>>>> >> -
>>>> >> -    macslot->present = 1;
>>>> >> -    macslot->gpa_start = slot->start;
>>>> >> -    macslot->size = slot->size;
>>>> >> -    ret = hv_vm_map((hv_uvaddr_t)slot->mem, slot->start, slot->size, flags);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_set_phys_mem(MemoryRegionSection *section, bool add)
>>>> >> -{
>>>> >> -    hvf_slot *mem;
>>>> >> -    MemoryRegion *area = section->mr;
>>>> >> -    bool writeable = !area->readonly && !area->rom_device;
>>>> >> -    hv_memory_flags_t flags;
>>>> >> -
>>>> >> -    if (!memory_region_is_ram(area)) {
>>>> >> -        if (writeable) {
>>>> >> -            return;
>>>> >> -        } else if (!memory_region_is_romd(area)) {
>>>> >> -            /*
>>>> >> -             * If the memory device is not in romd_mode, then we actually want
>>>> >> -             * to remove the hvf memory slot so all accesses will trap.
>>>> >> -             */
>>>> >> -             add = false;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    mem = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    if (mem && add) {
>>>> >> -        if (mem->size == int128_get64(section->size) &&
>>>> >> -            mem->start == section->offset_within_address_space &&
>>>> >> -            mem->mem == (memory_region_get_ram_ptr(area) +
>>>> >> -            section->offset_within_region)) {
>>>> >> -            return; /* Same region was attempted to register, go away. */
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Region needs to be reset. set the size to 0 and remap it. */
>>>> >> -    if (mem) {
>>>> >> -        mem->size = 0;
>>>> >> -        if (do_hvf_set_memory(mem, 0)) {
>>>> >> -            error_report("Failed to reset overlapping slot");
>>>> >> -            abort();
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (!add) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    if (area->readonly ||
>>>> >> -        (!memory_region_is_ram(area) && memory_region_is_romd(area))) {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_EXEC;
>>>> >> -    } else {
>>>> >> -        flags = HV_MEMORY_READ | HV_MEMORY_WRITE | HV_MEMORY_EXEC;
>>>> >> -    }
>>>> >> -
>>>> >> -    /* Now make a new slot. */
>>>> >> -    int x;
>>>> >> -
>>>> >> -    for (x = 0; x < hvf_state->num_slots; ++x) {
>>>> >> -        mem = &hvf_state->slots[x];
>>>> >> -        if (!mem->size) {
>>>> >> -            break;
>>>> >> -        }
>>>> >> -    }
>>>> >> -
>>>> >> -    if (x == hvf_state->num_slots) {
>>>> >> -        error_report("No free slots");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -
>>>> >> -    mem->size = int128_get64(section->size);
>>>> >> -    mem->mem = memory_region_get_ram_ptr(area) + section->offset_within_region;
>>>> >> -    mem->start = section->offset_within_address_space;
>>>> >> -    mem->region = area;
>>>> >> -
>>>> >> -    if (do_hvf_set_memory(mem, flags)) {
>>>> >> -        error_report("Error registering new memory slot");
>>>> >> -        abort();
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >>   void vmx_update_tpr(CPUState *cpu)
>>>> >>   {
>>>> >>       /* TODO: need integrate APIC handling */
>>>> >> @@ -276,56 +112,6 @@ void hvf_handle_io(CPUArchState *env, uint16_t port, void *buffer,
>>>> >>       }
>>>> >>   }
>>>> >>
>>>> >> -static void do_hvf_cpu_synchronize_state(CPUState *cpu, run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        hvf_get_registers(cpu);
>>>> >> -        cpu->vcpu_dirty = true;
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_state(CPUState *cpu)
>>>> >> -{
>>>> >> -    if (!cpu->vcpu_dirty) {
>>>> >> -        run_on_cpu(cpu, do_hvf_cpu_synchronize_state, RUN_ON_CPU_NULL);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_reset(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_reset(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_reset, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_post_init(CPUState *cpu,
>>>> >> -                                             run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    hvf_put_registers(cpu);
>>>> >> -    cpu->vcpu_dirty = false;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_post_init(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_post_init, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >> -static void do_hvf_cpu_synchronize_pre_loadvm(CPUState *cpu,
>>>> >> -                                              run_on_cpu_data arg)
>>>> >> -{
>>>> >> -    cpu->vcpu_dirty = true;
>>>> >> -}
>>>> >> -
>>>> >> -void hvf_cpu_synchronize_pre_loadvm(CPUState *cpu)
>>>> >> -{
>>>> >> -    run_on_cpu(cpu, do_hvf_cpu_synchronize_pre_loadvm, RUN_ON_CPU_NULL);
>>>> >> -}
>>>> >> -
>>>> >>   static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>   {
>>>> >>       int read, write;
>>>> >> @@ -370,109 +156,19 @@ static bool ept_emulation_fault(hvf_slot *slot, uint64_t gpa, uint64_t ept_qual)
>>>> >>       return false;
>>>> >>   }
>>>> >>
>>>> >> -static void hvf_set_dirty_tracking(MemoryRegionSection *section, bool on)
>>>> >> -{
>>>> >> -    hvf_slot *slot;
>>>> >> -
>>>> >> -    slot = hvf_find_overlap_slot(
>>>> >> -            section->offset_within_address_space,
>>>> >> -            int128_get64(section->size));
>>>> >> -
>>>> >> -    /* protect region against writes; begin tracking it */
>>>> >> -    if (on) {
>>>> >> -        slot->flags |= HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ);
>>>> >> -    /* stop tracking region*/
>>>> >> -    } else {
>>>> >> -        slot->flags &= ~HVF_SLOT_LOG;
>>>> >> -        hv_vm_protect((hv_gpaddr_t)slot->start, (size_t)slot->size,
>>>> >> -                      HV_MEMORY_READ | HV_MEMORY_WRITE);
>>>> >> -    }
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_start(MemoryListener *listener,
>>>> >> -                          MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (old != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_stop(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section, int old, int new)
>>>> >> -{
>>>> >> -    if (new != 0) {
>>>> >> -        return;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_set_dirty_tracking(section, 0);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_log_sync(MemoryListener *listener,
>>>> >> -                         MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    /*
>>>> >> -     * sync of dirty pages is handled elsewhere; just make sure we keep
>>>> >> -     * tracking the region.
>>>> >> -     */
>>>> >> -    hvf_set_dirty_tracking(section, 1);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_add(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, true);
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_region_del(MemoryListener *listener,
>>>> >> -                           MemoryRegionSection *section)
>>>> >> -{
>>>> >> -    hvf_set_phys_mem(section, false);
>>>> >> -}
>>>> >> -
>>>> >> -static MemoryListener hvf_memory_listener = {
>>>> >> -    .priority = 10,
>>>> >> -    .region_add = hvf_region_add,
>>>> >> -    .region_del = hvf_region_del,
>>>> >> -    .log_start = hvf_log_start,
>>>> >> -    .log_stop = hvf_log_stop,
>>>> >> -    .log_sync = hvf_log_sync,
>>>> >> -};
>>>> >> -
>>>> >> -void hvf_vcpu_destroy(CPUState *cpu)
>>>> >> +void hvf_arch_vcpu_destroy(CPUState *cpu)
>>>> >>   {
>>>> >>       X86CPU *x86_cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86_cpu->env;
>>>> >>
>>>> >> -    hv_return_t ret = hv_vcpu_destroy((hv_vcpuid_t)cpu->hvf_fd);
>>>> >>       g_free(env->hvf_mmio_buf);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -}
>>>> >> -
>>>> >> -static void dummy_signal(int sig)
>>>> >> -{
>>>> >>   }
>>>> >>
>>>> >> -int hvf_init_vcpu(CPUState *cpu)
>>>> >> +int hvf_arch_init_vcpu(CPUState *cpu)
>>>> >>   {
>>>> >>
>>>> >>       X86CPU *x86cpu = X86_CPU(cpu);
>>>> >>       CPUX86State *env = &x86cpu->env;
>>>> >> -    int r;
>>>> >> -
>>>> >> -    /* init cpu signals */
>>>> >> -    sigset_t set;
>>>> >> -    struct sigaction sigact;
>>>> >> -
>>>> >> -    memset(&sigact, 0, sizeof(sigact));
>>>> >> -    sigact.sa_handler = dummy_signal;
>>>> >> -    sigaction(SIG_IPI, &sigact, NULL);
>>>> >> -
>>>> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>> >> -    sigdelset(&set, SIG_IPI);
>>>> >>
>>>> >>       init_emu();
>>>> >>       init_decoder();
>>>> >> @@ -480,10 +176,6 @@ int hvf_init_vcpu(CPUState *cpu)
>>>> >>       hvf_state->hvf_caps = g_new0(struct hvf_vcpu_caps, 1);
>>>> >>       env->hvf_mmio_buf = g_new(char, 4096);
>>>> >>
>>>> >> -    r = hv_vcpu_create((hv_vcpuid_t *)&cpu->hvf_fd, HV_VCPU_DEFAULT);
>>>> >> -    cpu->vcpu_dirty = 1;
>>>> >> -    assert_hvf_ok(r);
>>>> >> -
>>>> >>       if (hv_vmx_read_capability(HV_VMX_CAP_PINBASED,
>>>> >>           &hvf_state->hvf_caps->vmx_cap_pinbased)) {
>>>> >>           abort();
>>>> >> @@ -865,49 +557,3 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>> >>
>>>> >>       return ret;
>>>> >>   }
>>>> >> -
>>>> >> -bool hvf_allowed;
>>>> >> -
>>>> >> -static int hvf_accel_init(MachineState *ms)
>>>> >> -{
>>>> >> -    int x;
>>>> >> -    hv_return_t ret;
>>>> >> -    HVFState *s;
>>>> >> -
>>>> >> -    ret = hv_vm_create(HV_VM_DEFAULT);
>>>> >> -    assert_hvf_ok(ret);
>>>> >> -
>>>> >> -    s = g_new0(HVFState, 1);
>>>> >> -
>>>> >> -    s->num_slots = 32;
>>>> >> -    for (x = 0; x < s->num_slots; ++x) {
>>>> >> -        s->slots[x].size = 0;
>>>> >> -        s->slots[x].slot_id = x;
>>>> >> -    }
>>>> >> -
>>>> >> -    hvf_state = s;
>>>> >> -    memory_listener_register(&hvf_memory_listener, &address_space_memory);
>>>> >> -    cpus_register_accel(&hvf_cpus);
>>>> >> -    return 0;
>>>> >> -}
>>>> >> -
>>>> >> -static void hvf_accel_class_init(ObjectClass *oc, void *data)
>>>> >> -{
>>>> >> -    AccelClass *ac = ACCEL_CLASS(oc);
>>>> >> -    ac->name = "HVF";
>>>> >> -    ac->init_machine = hvf_accel_init;
>>>> >> -    ac->allowed = &hvf_allowed;
>>>> >> -}
>>>> >> -
>>>> >> -static const TypeInfo hvf_accel_type = {
>>>> >> -    .name = TYPE_HVF_ACCEL,
>>>> >> -    .parent = TYPE_ACCEL,
>>>> >> -    .class_init = hvf_accel_class_init,
>>>> >> -};
>>>> >> -
>>>> >> -static void hvf_type_init(void)
>>>> >> -{
>>>> >> -    type_register_static(&hvf_accel_type);
>>>> >> -}
>>>> >> -
>>>> >> -type_init(hvf_type_init);
>>>> >> diff --git a/target/i386/hvf/meson.build b/target/i386/hvf/meson.build
>>>> >> index 409c9a3f14..c8a43717ee 100644
>>>> >> --- a/target/i386/hvf/meson.build
>>>> >> +++ b/target/i386/hvf/meson.build
>>>> >> @@ -1,6 +1,5 @@
>>>> >>   i386_softmmu_ss.add(when: [hvf, 'CONFIG_HVF'], if_true: files(
>>>> >>     'hvf.c',
>>>> >> -  'hvf-cpus.c',
>>>> >>     'x86.c',
>>>> >>     'x86_cpuid.c',
>>>> >>     'x86_decode.c',
>>>> >> diff --git a/target/i386/hvf/x86hvf.c b/target/i386/hvf/x86hvf.c
>>>> >> index bbec412b6c..89b8e9d87a 100644
>>>> >> --- a/target/i386/hvf/x86hvf.c
>>>> >> +++ b/target/i386/hvf/x86hvf.c
>>>> >> @@ -20,6 +20,9 @@
>>>> >>   #include "qemu/osdep.h"
>>>> >>
>>>> >>   #include "qemu-common.h"
>>>> >> +#include "sysemu/hvf.h"
>>>> >> +#include "sysemu/hvf_int.h"
>>>> >> +#include "sysemu/hw_accel.h"
>>>> >>   #include "x86hvf.h"
>>>> >>   #include "vmx.h"
>>>> >>   #include "vmcs.h"
>>>> >> @@ -32,8 +35,6 @@
>>>> >>   #include <Hypervisor/hv.h>
>>>> >>   #include <Hypervisor/hv_vmx.h>
>>>> >>
>>>> >> -#include "hvf-cpus.h"
>>>> >> -
>>>> >>   void hvf_set_segment(struct CPUState *cpu, struct vmx_segment *vmx_seg,
>>>> >>                        SegmentCache *qseg, bool is_tr)
>>>> >>   {
>>>> >> @@ -437,7 +438,7 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>       env->eflags = rreg(cpu_state->hvf_fd, HV_X86_RFLAGS);
>>>> >>
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_INIT) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_init(cpu);
>>>> >>       }
>>>> >>
>>>> >> @@ -451,12 +452,12 @@ int hvf_process_events(CPUState *cpu_state)
>>>> >>           cpu_state->halted = 0;
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_SIPI) {
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> >>           do_cpu_sipi(cpu);
>>>> >>       }
>>>> >>       if (cpu_state->interrupt_request & CPU_INTERRUPT_TPR) {
>>>> >>           cpu_state->interrupt_request &= ~CPU_INTERRUPT_TPR;
>>>> >> -        hvf_cpu_synchronize_state(cpu_state);
>>>> >> +        cpu_synchronize_state(cpu_state);
>>>> > The changes from hvf_cpu_*() to cpu_*() are cleanup and perhaps should
>>>> > be a separate patch. It follows cpu/accel cleanups Claudio was doing the
>>>> > summer.
>>>>
>>>>
>>>> The only reason they're in here is because we no longer have access to
>>>> the hvf_ functions from the file. I am perfectly happy to rebase the
>>>> patch on top of Claudio's if his goes in first. I'm sure it'll be
>>>> trivial for him to rebase on top of this too if my series goes in first.
>>>>
>>>>
>>>> >
>>>> > Phillipe raised the idea that the patch might go ahead of ARM-specific
>>>> > part (which might involve some discussions) and I agree with that.
>>>> >
>>>> > Some sync between Claudio series (CC'd him) and the patch might be need.
>>>>
>>>>
>>>> I would prefer not to hold back because of the sync. Claudio's cleanup
>>>> is trivial enough to adjust for if it gets merged ahead of this.
>>>>
>>>>
>>>> Alex
>>>>
>>>>
>>>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 21:40                 ` Alexander Graf
@ 2020-11-30 23:01                   ` Peter Collingbourne
  2020-11-30 23:18                     ` Alexander Graf
  2020-12-01  0:37                   ` Roman Bolshakov
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Collingbourne @ 2020-11-30 23:01 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Frank Yang, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>
> Hi Peter,
>
> On 30.11.20 22:08, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> >>
> >>
> >> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> >>> Hi Frank,
> >>>
> >>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> >> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> >>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> >>>
> >>>    https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> >>>
> >> Thanks, we'll take a look :)
> >>
> >>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > I'll reply to your patch here. You have:
> >
> > +                    /* Set cpu->hvf->sleeping so that we get a
> > SIG_IPI signal. */
> > +                    cpu->hvf->sleeping = true;
> > +                    smp_mb();
> > +
> > +                    /* Bail out if we received an IRQ meanwhile */
> > +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > +                        cpu->hvf->sleeping = false;
> > +                        break;
> > +                    }
> > +
> > +                    /* nanosleep returns on signal, so we wake up on kick. */
> > +                    nanosleep(ts, NULL);
> >
> > and then send the signal conditional on whether sleeping is true, but
> > I think this is racy. If the signal is sent after sleeping is set to
> > true but before entering nanosleep then I think it will be ignored and
> > we will miss the wakeup. That's why in my implementation I block IPI
> > on the CPU thread at startup and then use pselect to atomically
> > unblock and begin sleeping. The signal is sent unconditionally so
> > there's no need to worry about races between actually sleeping and the
> > "we think we're sleeping" state. It may lead to an extra wakeup but
> > that's better than missing it entirely.
>
>
> Thanks a bunch for the comment! So the trick I was using here is to
> modify the timespec from the kick function before sending the IPI
> signal. That way, we know that either we are inside the sleep (where the
> signal wakes it up) or we are outside the sleep (where timespec={} will
> make it return immediately).
>
> The only race I can think of is if nanosleep does calculations based on
> the timespec and we happen to send the signal right there and then.

Yes that's the race I was thinking of. Admittedly it's a small window
but it's theoretically possible and part of the reason why pselect was
created.

> The problem with blocking IPIs is basically what Frank was describing
> earlier: How do you unset the IPI signal pending status? If the signal
> is never delivered, how can pselect differentiate "signal from last time
> is still pending" from "new signal because I got an IPI"?

In this case we would take the additional wakeup which should be
harmless since we will take the WFx exit again and put us in the
correct state. But that's a lot better than busy looping.

I reckon that you could improve things a little by unblocking the
signal and then reblocking it before unlocking iothread (e.g. with a
pselect with zero time interval), which would flush any pending
signals. Since any such signal would correspond to a signal from last
time (because we still have the iothread lock) we know that any future
signals should correspond to new IPIs.

Peter


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 23:01                   ` Peter Collingbourne
@ 2020-11-30 23:18                     ` Alexander Graf
  2020-12-01  0:00                       ` Peter Collingbourne
  0 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-11-30 23:18 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini


On 01.12.20 00:01, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>> Hi Peter,
>>
>> On 30.11.20 22:08, Peter Collingbourne wrote:
>>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>>>
>>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>> Hi Frank,
>>>>>
>>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>>>
>>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>>>
>>>> Thanks, we'll take a look :)
>>>>
>>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
>>> I'll reply to your patch here. You have:
>>>
>>> +                    /* Set cpu->hvf->sleeping so that we get a
>>> SIG_IPI signal. */
>>> +                    cpu->hvf->sleeping = true;
>>> +                    smp_mb();
>>> +
>>> +                    /* Bail out if we received an IRQ meanwhile */
>>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
>>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>> +                        cpu->hvf->sleeping = false;
>>> +                        break;
>>> +                    }
>>> +
>>> +                    /* nanosleep returns on signal, so we wake up on kick. */
>>> +                    nanosleep(ts, NULL);
>>>
>>> and then send the signal conditional on whether sleeping is true, but
>>> I think this is racy. If the signal is sent after sleeping is set to
>>> true but before entering nanosleep then I think it will be ignored and
>>> we will miss the wakeup. That's why in my implementation I block IPI
>>> on the CPU thread at startup and then use pselect to atomically
>>> unblock and begin sleeping. The signal is sent unconditionally so
>>> there's no need to worry about races between actually sleeping and the
>>> "we think we're sleeping" state. It may lead to an extra wakeup but
>>> that's better than missing it entirely.
>>
>> Thanks a bunch for the comment! So the trick I was using here is to
>> modify the timespec from the kick function before sending the IPI
>> signal. That way, we know that either we are inside the sleep (where the
>> signal wakes it up) or we are outside the sleep (where timespec={} will
>> make it return immediately).
>>
>> The only race I can think of is if nanosleep does calculations based on
>> the timespec and we happen to send the signal right there and then.
> Yes that's the race I was thinking of. Admittedly it's a small window
> but it's theoretically possible and part of the reason why pselect was
> created.
>
>> The problem with blocking IPIs is basically what Frank was describing
>> earlier: How do you unset the IPI signal pending status? If the signal
>> is never delivered, how can pselect differentiate "signal from last time
>> is still pending" from "new signal because I got an IPI"?
> In this case we would take the additional wakeup which should be
> harmless since we will take the WFx exit again and put us in the
> correct state. But that's a lot better than busy looping.


I'm not sure I follow. I'm thinking of the following scenario:

   - trap into WFI handler
   - go to sleep with blocked SIG_IPI
   - SIG_IPI arrives, pselect() exits
   - signal is still pending because it's blocked
   - enter guest
   - trap into WFI handler
   - run pselect(), but it immediate exits because SIG_IPI is still pending

This was the loop I was seeing when running with SIG_IPI blocked. That's 
part of the reason why I switched to a different model.


> I reckon that you could improve things a little by unblocking the
> signal and then reblocking it before unlocking iothread (e.g. with a
> pselect with zero time interval), which would flush any pending
> signals. Since any such signal would correspond to a signal from last
> time (because we still have the iothread lock) we know that any future
> signals should correspond to new IPIs.


Yeah, I think you actually *have* to do exactly that, because otherwise 
pselect() will always return after 0ns because the signal is still pending.

And yes, I agree that that starts to sound a bit less racy now. But it 
means we can probably also just do

   - WFI handler
   - block SIG_IPI
   - set hvf->sleeping = true
   - check for pending interrupts
   - pselect()
   - unblock SIG_IPI

which means we run with SIG_IPI unmasked by default. I don't think the 
number of signal mask changes is any different with that compared to 
running with SIG_IPI always masked, right?


Alex



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 23:18                     ` Alexander Graf
@ 2020-12-01  0:00                       ` Peter Collingbourne
  2020-12-01  0:13                         ` Alexander Graf
  2020-12-03  9:41                         ` [PATCH 2/8] hvf: Move common code out Roman Bolshakov
  0 siblings, 2 replies; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-01  0:00 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Frank Yang, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 00:01, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
> >> Hi Peter,
> >>
> >> On 30.11.20 22:08, Peter Collingbourne wrote:
> >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> >>>>
> >>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> >>>>> Hi Frank,
> >>>>>
> >>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> >>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> >>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> >>>>>
> >>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> >>>>>
> >>>> Thanks, we'll take a look :)
> >>>>
> >>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> >>> I'll reply to your patch here. You have:
> >>>
> >>> +                    /* Set cpu->hvf->sleeping so that we get a
> >>> SIG_IPI signal. */
> >>> +                    cpu->hvf->sleeping = true;
> >>> +                    smp_mb();
> >>> +
> >>> +                    /* Bail out if we received an IRQ meanwhile */
> >>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> >>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >>> +                        cpu->hvf->sleeping = false;
> >>> +                        break;
> >>> +                    }
> >>> +
> >>> +                    /* nanosleep returns on signal, so we wake up on kick. */
> >>> +                    nanosleep(ts, NULL);
> >>>
> >>> and then send the signal conditional on whether sleeping is true, but
> >>> I think this is racy. If the signal is sent after sleeping is set to
> >>> true but before entering nanosleep then I think it will be ignored and
> >>> we will miss the wakeup. That's why in my implementation I block IPI
> >>> on the CPU thread at startup and then use pselect to atomically
> >>> unblock and begin sleeping. The signal is sent unconditionally so
> >>> there's no need to worry about races between actually sleeping and the
> >>> "we think we're sleeping" state. It may lead to an extra wakeup but
> >>> that's better than missing it entirely.
> >>
> >> Thanks a bunch for the comment! So the trick I was using here is to
> >> modify the timespec from the kick function before sending the IPI
> >> signal. That way, we know that either we are inside the sleep (where the
> >> signal wakes it up) or we are outside the sleep (where timespec={} will
> >> make it return immediately).
> >>
> >> The only race I can think of is if nanosleep does calculations based on
> >> the timespec and we happen to send the signal right there and then.
> > Yes that's the race I was thinking of. Admittedly it's a small window
> > but it's theoretically possible and part of the reason why pselect was
> > created.
> >
> >> The problem with blocking IPIs is basically what Frank was describing
> >> earlier: How do you unset the IPI signal pending status? If the signal
> >> is never delivered, how can pselect differentiate "signal from last time
> >> is still pending" from "new signal because I got an IPI"?
> > In this case we would take the additional wakeup which should be
> > harmless since we will take the WFx exit again and put us in the
> > correct state. But that's a lot better than busy looping.
>
>
> I'm not sure I follow. I'm thinking of the following scenario:
>
>    - trap into WFI handler
>    - go to sleep with blocked SIG_IPI
>    - SIG_IPI arrives, pselect() exits
>    - signal is still pending because it's blocked
>    - enter guest
>    - trap into WFI handler
>    - run pselect(), but it immediate exits because SIG_IPI is still pending
>
> This was the loop I was seeing when running with SIG_IPI blocked. That's
> part of the reason why I switched to a different model.

What I observe is that when returning from a pending signal pselect
consumes the signal (which is also consistent with my understanding of
what pselect does). That means that it doesn't matter if we take a
second WFx exit because once we reach the pselect in the second WFx
exit the signal will have been consumed by the pselect in the first
exit and we will just wait for the next one.

I don't know why things may have been going wrong in your
implementation but it may be related to the issue with
mach_absolute_time() which I posted about separately and was also
causing busy loops for us in some cases. Once that issue was fixed in
our implementation we started seeing sleep until VTIMER due work
properly.

>
>
> > I reckon that you could improve things a little by unblocking the
> > signal and then reblocking it before unlocking iothread (e.g. with a
> > pselect with zero time interval), which would flush any pending
> > signals. Since any such signal would correspond to a signal from last
> > time (because we still have the iothread lock) we know that any future
> > signals should correspond to new IPIs.
>
>
> Yeah, I think you actually *have* to do exactly that, because otherwise
> pselect() will always return after 0ns because the signal is still pending.
>
> And yes, I agree that that starts to sound a bit less racy now. But it
> means we can probably also just do
>
>    - WFI handler
>    - block SIG_IPI
>    - set hvf->sleeping = true
>    - check for pending interrupts
>    - pselect()
>    - unblock SIG_IPI
>
> which means we run with SIG_IPI unmasked by default. I don't think the
> number of signal mask changes is any different with that compared to
> running with SIG_IPI always masked, right?

And unlock/lock iothread around the pselect? I suppose that could work
but as I mentioned it would just be an optimization.

Maybe I can try to make my approach work on top of your series, or if
you already have a patch I can try to debug it. Let me know.

Peter


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-12-01  0:00                       ` Peter Collingbourne
@ 2020-12-01  0:13                         ` Alexander Graf
  2020-12-01  8:21                           ` [PATCH] arm/hvf: Optimize and simplify WFI handling Peter Collingbourne via
  2020-12-03  9:41                         ` [PATCH 2/8] hvf: Move common code out Roman Bolshakov
  1 sibling, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-12-01  0:13 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini


On 01.12.20 01:00, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> On 01.12.20 00:01, Peter Collingbourne wrote:
>>> On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>> Hi Peter,
>>>>
>>>> On 30.11.20 22:08, Peter Collingbourne wrote:
>>>>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>>>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>>>> Hi Frank,
>>>>>>>
>>>>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>>>>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>>>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>>>>>
>>>>>>>      https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>>>>>
>>>>>> Thanks, we'll take a look :)
>>>>>>
>>>>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>>>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
>>>>> I'll reply to your patch here. You have:
>>>>>
>>>>> +                    /* Set cpu->hvf->sleeping so that we get a
>>>>> SIG_IPI signal. */
>>>>> +                    cpu->hvf->sleeping = true;
>>>>> +                    smp_mb();
>>>>> +
>>>>> +                    /* Bail out if we received an IRQ meanwhile */
>>>>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
>>>>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>>>> +                        cpu->hvf->sleeping = false;
>>>>> +                        break;
>>>>> +                    }
>>>>> +
>>>>> +                    /* nanosleep returns on signal, so we wake up on kick. */
>>>>> +                    nanosleep(ts, NULL);
>>>>>
>>>>> and then send the signal conditional on whether sleeping is true, but
>>>>> I think this is racy. If the signal is sent after sleeping is set to
>>>>> true but before entering nanosleep then I think it will be ignored and
>>>>> we will miss the wakeup. That's why in my implementation I block IPI
>>>>> on the CPU thread at startup and then use pselect to atomically
>>>>> unblock and begin sleeping. The signal is sent unconditionally so
>>>>> there's no need to worry about races between actually sleeping and the
>>>>> "we think we're sleeping" state. It may lead to an extra wakeup but
>>>>> that's better than missing it entirely.
>>>> Thanks a bunch for the comment! So the trick I was using here is to
>>>> modify the timespec from the kick function before sending the IPI
>>>> signal. That way, we know that either we are inside the sleep (where the
>>>> signal wakes it up) or we are outside the sleep (where timespec={} will
>>>> make it return immediately).
>>>>
>>>> The only race I can think of is if nanosleep does calculations based on
>>>> the timespec and we happen to send the signal right there and then.
>>> Yes that's the race I was thinking of. Admittedly it's a small window
>>> but it's theoretically possible and part of the reason why pselect was
>>> created.
>>>
>>>> The problem with blocking IPIs is basically what Frank was describing
>>>> earlier: How do you unset the IPI signal pending status? If the signal
>>>> is never delivered, how can pselect differentiate "signal from last time
>>>> is still pending" from "new signal because I got an IPI"?
>>> In this case we would take the additional wakeup which should be
>>> harmless since we will take the WFx exit again and put us in the
>>> correct state. But that's a lot better than busy looping.
>>
>> I'm not sure I follow. I'm thinking of the following scenario:
>>
>>     - trap into WFI handler
>>     - go to sleep with blocked SIG_IPI
>>     - SIG_IPI arrives, pselect() exits
>>     - signal is still pending because it's blocked
>>     - enter guest
>>     - trap into WFI handler
>>     - run pselect(), but it immediate exits because SIG_IPI is still pending
>>
>> This was the loop I was seeing when running with SIG_IPI blocked. That's
>> part of the reason why I switched to a different model.
> What I observe is that when returning from a pending signal pselect
> consumes the signal (which is also consistent with my understanding of
> what pselect does). That means that it doesn't matter if we take a
> second WFx exit because once we reach the pselect in the second WFx
> exit the signal will have been consumed by the pselect in the first
> exit and we will just wait for the next one.
>
> I don't know why things may have been going wrong in your
> implementation but it may be related to the issue with
> mach_absolute_time() which I posted about separately and was also
> causing busy loops for us in some cases. Once that issue was fixed in
> our implementation we started seeing sleep until VTIMER due work
> properly.
>
>>
>>> I reckon that you could improve things a little by unblocking the
>>> signal and then reblocking it before unlocking iothread (e.g. with a
>>> pselect with zero time interval), which would flush any pending
>>> signals. Since any such signal would correspond to a signal from last
>>> time (because we still have the iothread lock) we know that any future
>>> signals should correspond to new IPIs.
>>
>> Yeah, I think you actually *have* to do exactly that, because otherwise
>> pselect() will always return after 0ns because the signal is still pending.
>>
>> And yes, I agree that that starts to sound a bit less racy now. But it
>> means we can probably also just do
>>
>>     - WFI handler
>>     - block SIG_IPI
>>     - set hvf->sleeping = true
>>     - check for pending interrupts
>>     - pselect()
>>     - unblock SIG_IPI
>>
>> which means we run with SIG_IPI unmasked by default. I don't think the
>> number of signal mask changes is any different with that compared to
>> running with SIG_IPI always masked, right?
> And unlock/lock iothread around the pselect? I suppose that could work
> but as I mentioned it would just be an optimization.
>
> Maybe I can try to make my approach work on top of your series, or if
> you already have a patch I can try to debug it. Let me know.


I would love to take a patch from you here :). I'll still be stuck for a 
while with the sysreg sync rework that Peter asked for before I can look 
at WFI again.


Alex




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 21:40                 ` Alexander Graf
  2020-11-30 23:01                   ` Peter Collingbourne
@ 2020-12-01  0:37                   ` Roman Bolshakov
  1 sibling, 0 replies; 64+ messages in thread
From: Roman Bolshakov @ 2020-12-01  0:37 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, qemu-arm, Claudio Fontana, Frank Yang,
	Paolo Bonzini, Peter Collingbourne

On Mon, Nov 30, 2020 at 10:40:49PM +0100, Alexander Graf wrote:
> Hi Peter,
> 
> On 30.11.20 22:08, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> > > 
> > > 
> > > On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> > > > Hi Frank,
> > > > 
> > > > Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> > > Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> > > > Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> > > > 
> > > >    https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> > > > 
> > > Thanks, we'll take a look :)
> > > 
> > > > Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > I'll reply to your patch here. You have:
> > 
> > +                    /* Set cpu->hvf->sleeping so that we get a
> > SIG_IPI signal. */
> > +                    cpu->hvf->sleeping = true;
> > +                    smp_mb();
> > +
> > +                    /* Bail out if we received an IRQ meanwhile */
> > +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > +                        cpu->hvf->sleeping = false;
> > +                        break;
> > +                    }
> > +
> > +                    /* nanosleep returns on signal, so we wake up on kick. */
> > +                    nanosleep(ts, NULL);
> > 
> > and then send the signal conditional on whether sleeping is true, but
> > I think this is racy. If the signal is sent after sleeping is set to
> > true but before entering nanosleep then I think it will be ignored and
> > we will miss the wakeup. That's why in my implementation I block IPI
> > on the CPU thread at startup and then use pselect to atomically
> > unblock and begin sleeping. The signal is sent unconditionally so
> > there's no need to worry about races between actually sleeping and the
> > "we think we're sleeping" state. It may lead to an extra wakeup but
> > that's better than missing it entirely.
> 
> 
> Thanks a bunch for the comment! So the trick I was using here is to modify
> the timespec from the kick function before sending the IPI signal. That way,
> we know that either we are inside the sleep (where the signal wakes it up)
> or we are outside the sleep (where timespec={} will make it return
> immediately).
> 
> The only race I can think of is if nanosleep does calculations based on the
> timespec and we happen to send the signal right there and then.
> 
> The problem with blocking IPIs is basically what Frank was describing
> earlier: How do you unset the IPI signal pending status? If the signal is
> never delivered, how can pselect differentiate "signal from last time is
> still pending" from "new signal because I got an IPI"?
> 
> 

Hi Alex,

There was a patch for x86 HVF that implements CPU kick and it wasn't
merged (mostly because of my lazyness). It has some changes like you
introduced in the series and VMX-specific handling of preemption timer
to gurantee interrupt delivery without kick loss:

https://patchwork.kernel.org/project/qemu-devel/patch/20200729124832.79375-1-r.bolshakov@yadro.com/

I wonder if it'd possible to have common handling of kicks for both x86
and arm (given that arch-specific bits are wrapped)?

Thanks,
Roman


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-11-30 22:10               ` Peter Maydell
@ 2020-12-01  2:49                 ` Frank Yang
  0 siblings, 0 replies; 64+ messages in thread
From: Frank Yang @ 2020-12-01  2:49 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alexander Graf, Roman Bolshakov, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini, Peter Collingbourne

[-- Attachment #1: Type: text/plain, Size: 975 bytes --]

On Mon, Nov 30, 2020 at 2:10 PM Peter Maydell <peter.maydell@linaro.org>
wrote:

> On Mon, 30 Nov 2020 at 20:56, Frank Yang <lfy@google.com> wrote:
> > We'd actually like to contribute upstream too :) We do want to maintain
> > our own downstream though; Android Emulator codebase needs to work
> > solidly on macos and windows which has made keeping up with upstream
> difficult
>
> One of the main reasons why OSX and Windows support upstream is
> not so great is because very few people are helping to develop,
> test and support it upstream. The way to fix that IMHO is for more
> people who do care about those platforms to actively engage
> with us upstream to help in making those platforms move closer to
> being first class citizens. If you stay on a downstream fork
> forever then I don't think you'll ever see things improve.
>
> thanks
> -- PMM
>

That's a really good point. I'll definitely be more active about sending
comments upstream in the future :)

Frank

[-- Attachment #2: Type: text/html, Size: 1462 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01  0:13                         ` Alexander Graf
@ 2020-12-01  8:21                           ` Peter Collingbourne via
  2020-12-01 11:16                             ` Alexander Graf
  2020-12-01 16:26                             ` Alexander Graf
  0 siblings, 2 replies; 64+ messages in thread
From: Peter Collingbourne via @ 2020-12-01  8:21 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Collingbourne, Frank Yang, Roman Bolshakov, Peter Maydell,
	Eduardo Habkost, Richard Henderson, qemu-devel, Cameron Esfahani,
	qemu-arm, Claudio Fontana, Paolo Bonzini

Sleep on WFx until the VTIMER is due but allow ourselves to be woken
up on IPI.

Signed-off-by: Peter Collingbourne <pcc@google.com>
---
Alexander Graf wrote:
> I would love to take a patch from you here :). I'll still be stuck for a
> while with the sysreg sync rework that Peter asked for before I can look
> at WFI again.

Okay, here's a patch :) It's a relatively straightforward adaptation
of what we have in our fork, which can now boot Android to GUI while
remaining at around 4% CPU when idle.

I'm not set up to boot a full Linux distribution at the moment so I
tested it on upstream QEMU by running a recent mainline Linux kernel
with a rootfs containing an init program that just does sleep(5)
and verified that the qemu process remains at low CPU usage during
the sleep. This was on top of your v2 plus the last patch of your v1
since it doesn't look like you have a replacement for that logic yet.

 accel/hvf/hvf-cpus.c     |  5 +--
 include/sysemu/hvf_int.h |  3 +-
 target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
 3 files changed, 28 insertions(+), 74 deletions(-)

diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index 4360f64671..b2c8fb57f6 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
     sigact.sa_handler = dummy_signal;
     sigaction(SIG_IPI, &sigact, NULL);
 
-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
-    pthread_sigmask(SIG_SETMASK, &set, NULL);
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
+    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
 
 #ifdef __aarch64__
     r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index c56baa3ae8..13adf6ea77 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -62,8 +62,7 @@ extern HVFState *hvf_state;
 struct hvf_vcpu_state {
     uint64_t fd;
     void *exit;
-    struct timespec ts;
-    bool sleeping;
+    sigset_t unblock_ipi_mask;
 };
 
 void assert_hvf_ok(hv_return_t ret);
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
index 8fe10966d2..60a361ff38 100644
--- a/target/arm/hvf/hvf.c
+++ b/target/arm/hvf/hvf.c
@@ -2,6 +2,7 @@
  * QEMU Hypervisor.framework support for Apple Silicon
 
  * Copyright 2020 Alexander Graf <agraf@csgraf.de>
+ * Copyright 2020 Google LLC
  *
  * This work is licensed under the terms of the GNU GPL, version 2 or later.
  * See the COPYING file in the top-level directory.
@@ -18,6 +19,7 @@
 #include "sysemu/hw_accel.h"
 
 #include <Hypervisor/Hypervisor.h>
+#include <mach/mach_time.h>
 
 #include "exec/address-spaces.h"
 #include "hw/irq.h"
@@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
 
 void hvf_kick_vcpu_thread(CPUState *cpu)
 {
-    if (cpu->hvf->sleeping) {
-        /*
-         * When sleeping, make sure we always send signals. Also, clear the
-         * timespec, so that an IPI that arrives between setting hvf->sleeping
-         * and the nanosleep syscall still aborts the sleep.
-         */
-        cpu->thread_kicked = false;
-        cpu->hvf->ts = (struct timespec){ };
-        cpus_kick_thread(cpu);
-    } else {
-        hv_vcpus_exit(&cpu->hvf->fd, 1);
-    }
+    cpus_kick_thread(cpu);
+    hv_vcpus_exit(&cpu->hvf->fd, 1);
 }
 
 static int hvf_inject_interrupts(CPUState *cpu)
@@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
         uint64_t syndrome = hvf_exit->exception.syndrome;
         uint32_t ec = syn_get_ec(syndrome);
 
+        qemu_mutex_lock_iothread();
         switch (exit_reason) {
         case HV_EXIT_REASON_EXCEPTION:
             /* This is the main one, handle below. */
             break;
         case HV_EXIT_REASON_VTIMER_ACTIVATED:
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
             qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
             qemu_mutex_unlock_iothread();
             continue;
         case HV_EXIT_REASON_CANCELED:
             /* we got kicked, no exit to process */
+            qemu_mutex_unlock_iothread();
             continue;
         default:
             assert(0);
@@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
             uint32_t srt = (syndrome >> 16) & 0x1f;
             uint64_t val = 0;
 
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
 
             DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
@@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
                 hvf_set_reg(cpu, srt, val);
             }
 
-            qemu_mutex_unlock_iothread();
-
             advance_pc = true;
             break;
         }
@@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
         case EC_WFX_TRAP:
             if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
                 (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                uint64_t cval, ctl, val, diff, now;
+                uint64_t cval;
 
-                /* Set up a local timer for vtimer if necessary ... */
-                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
-                assert_hvf_ok(r);
                 r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
                 assert_hvf_ok(r);
 
-                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
-                diff = cval - val;
-
-                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
-                      gt_cntfrq_period_ns(arm_cpu);
-
-                /* Timer disabled or masked, just wait for long */
-                if (!(ctl & 1) || (ctl & 2)) {
-                    diff = (120 * NANOSECONDS_PER_SECOND) /
-                           gt_cntfrq_period_ns(arm_cpu);
+                int64_t ticks_to_sleep = cval - mach_absolute_time();
+                if (ticks_to_sleep < 0) {
+                    break;
                 }
 
-                if (diff < INT64_MAX) {
-                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
-                    struct timespec *ts = &cpu->hvf->ts;
-
-                    *ts = (struct timespec){
-                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
-                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
-                    };
-
-                    /*
-                     * Waking up easily takes 1ms, don't go to sleep for smaller
-                     * time periods than 2ms.
-                     */
-                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
-                        advance_pc = true;
-                        break;
-                    }
-
-                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
-                    cpu->hvf->sleeping = true;
-                    smp_mb();
-
-                    /* Bail out if we received an IRQ meanwhile */
-                    if (cpu->thread_kicked || (cpu->interrupt_request &
-                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                        cpu->hvf->sleeping = false;
-                        break;
-                    }
-
-                    /* nanosleep returns on signal, so we wake up on kick. */
-                    nanosleep(ts, NULL);
-
-                    /* Out of sleep - either naturally or because of a kick */
-                    cpu->hvf->sleeping = false;
-                }
+                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
+                uint64_t nanos =
+                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
+                    1000000000 / arm_cpu->gt_cntfrq_hz;
+                struct timespec ts = { seconds, nanos };
+
+                /*
+                 * Use pselect to sleep so that other threads can IPI us while
+                 * we're sleeping.
+                 */
+                qatomic_mb_set(&cpu->thread_kicked, false);
+                qemu_mutex_unlock_iothread();
+                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
+                qemu_mutex_lock_iothread();
 
                 advance_pc = true;
             }
             break;
         case EC_AA64_HVC:
             cpu_synchronize_state(cpu);
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
             if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
                 arm_handle_psci_call(arm_cpu);
@@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
                 DPRINTF("unknown HVC! %016llx", env->xregs[0]);
                 env->xregs[0] = -1;
             }
-            qemu_mutex_unlock_iothread();
             break;
         case EC_AA64_SMC:
             cpu_synchronize_state(cpu);
-            qemu_mutex_lock_iothread();
             current_cpu = cpu;
             if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
                 arm_handle_psci_call(arm_cpu);
@@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
                 env->xregs[0] = -1;
                 env->pc += 4;
             }
-            qemu_mutex_unlock_iothread();
             break;
         default:
             cpu_synchronize_state(cpu);
@@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
             r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
             assert_hvf_ok(r);
         }
+        qemu_mutex_unlock_iothread();
     } while (ret == 0);
 
     qemu_mutex_lock_iothread();
-- 
2.29.2.454.gaff20da3a2-goog



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01  8:21                           ` [PATCH] arm/hvf: Optimize and simplify WFI handling Peter Collingbourne via
@ 2020-12-01 11:16                             ` Alexander Graf
  2020-12-01 18:59                               ` Peter Collingbourne
  2020-12-01 16:26                             ` Alexander Graf
  1 sibling, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-12-01 11:16 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini

Hi Peter,

On 01.12.20 09:21, Peter Collingbourne wrote:
> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> up on IPI.
>
> Signed-off-by: Peter Collingbourne <pcc@google.com>


Thanks a bunch!


> ---
> Alexander Graf wrote:
>> I would love to take a patch from you here :). I'll still be stuck for a
>> while with the sysreg sync rework that Peter asked for before I can look
>> at WFI again.
> Okay, here's a patch :) It's a relatively straightforward adaptation
> of what we have in our fork, which can now boot Android to GUI while
> remaining at around 4% CPU when idle.
>
> I'm not set up to boot a full Linux distribution at the moment so I
> tested it on upstream QEMU by running a recent mainline Linux kernel
> with a rootfs containing an init program that just does sleep(5)
> and verified that the qemu process remains at low CPU usage during
> the sleep. This was on top of your v2 plus the last patch of your v1
> since it doesn't look like you have a replacement for that logic yet.
>
>   accel/hvf/hvf-cpus.c     |  5 +--
>   include/sysemu/hvf_int.h |  3 +-
>   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
>   3 files changed, 28 insertions(+), 74 deletions(-)
>
> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> index 4360f64671..b2c8fb57f6 100644
> --- a/accel/hvf/hvf-cpus.c
> +++ b/accel/hvf/hvf-cpus.c
> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
>       sigact.sa_handler = dummy_signal;
>       sigaction(SIG_IPI, &sigact, NULL);
>   
> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> -    sigdelset(&set, SIG_IPI);
> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);


What will this do to the x86 hvf implementation? We're now not 
unblocking SIG_IPI again for that, right?


>   
>   #ifdef __aarch64__
>       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> index c56baa3ae8..13adf6ea77 100644
> --- a/include/sysemu/hvf_int.h
> +++ b/include/sysemu/hvf_int.h
> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
>   struct hvf_vcpu_state {
>       uint64_t fd;
>       void *exit;
> -    struct timespec ts;
> -    bool sleeping;
> +    sigset_t unblock_ipi_mask;
>   };
>   
>   void assert_hvf_ok(hv_return_t ret);
> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> index 8fe10966d2..60a361ff38 100644
> --- a/target/arm/hvf/hvf.c
> +++ b/target/arm/hvf/hvf.c
> @@ -2,6 +2,7 @@
>    * QEMU Hypervisor.framework support for Apple Silicon
>   
>    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> + * Copyright 2020 Google LLC
>    *
>    * This work is licensed under the terms of the GNU GPL, version 2 or later.
>    * See the COPYING file in the top-level directory.
> @@ -18,6 +19,7 @@
>   #include "sysemu/hw_accel.h"
>   
>   #include <Hypervisor/Hypervisor.h>
> +#include <mach/mach_time.h>
>   
>   #include "exec/address-spaces.h"
>   #include "hw/irq.h"
> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>   
>   void hvf_kick_vcpu_thread(CPUState *cpu)
>   {
> -    if (cpu->hvf->sleeping) {
> -        /*
> -         * When sleeping, make sure we always send signals. Also, clear the
> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> -         * and the nanosleep syscall still aborts the sleep.
> -         */
> -        cpu->thread_kicked = false;
> -        cpu->hvf->ts = (struct timespec){ };
> -        cpus_kick_thread(cpu);
> -    } else {
> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> -    }
> +    cpus_kick_thread(cpu);
> +    hv_vcpus_exit(&cpu->hvf->fd, 1);


This means your first WFI will almost always return immediately due to a 
pending signal, because there probably was an IRQ pending before on the 
same CPU, no?


>   }
>   
>   static int hvf_inject_interrupts(CPUState *cpu)
> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
>           uint64_t syndrome = hvf_exit->exception.syndrome;
>           uint32_t ec = syn_get_ec(syndrome);
>   
> +        qemu_mutex_lock_iothread();


Is there a particular reason you're moving the iothread lock out again 
from the individual bits? I would really like to keep a notion of fast 
path exits.


>           switch (exit_reason) {
>           case HV_EXIT_REASON_EXCEPTION:
>               /* This is the main one, handle below. */
>               break;
>           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
>               qemu_mutex_unlock_iothread();
>               continue;
>           case HV_EXIT_REASON_CANCELED:
>               /* we got kicked, no exit to process */
> +            qemu_mutex_unlock_iothread();
>               continue;
>           default:
>               assert(0);
> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>               uint32_t srt = (syndrome >> 16) & 0x1f;
>               uint64_t val = 0;
>   
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>   
>               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>                   hvf_set_reg(cpu, srt, val);
>               }
>   
> -            qemu_mutex_unlock_iothread();
> -
>               advance_pc = true;
>               break;
>           }
> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
>           case EC_WFX_TRAP:
>               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> -                uint64_t cval, ctl, val, diff, now;
> +                uint64_t cval;
>   
> -                /* Set up a local timer for vtimer if necessary ... */
> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> -                assert_hvf_ok(r);
>                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
>                   assert_hvf_ok(r);
>   
> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> -                diff = cval - val;
> -
> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> -                      gt_cntfrq_period_ns(arm_cpu);
> -
> -                /* Timer disabled or masked, just wait for long */
> -                if (!(ctl & 1) || (ctl & 2)) {
> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> -                           gt_cntfrq_period_ns(arm_cpu);
> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> +                if (ticks_to_sleep < 0) {
> +                    break;


This will loop at 100% for Windows, which configures the vtimer as 
cval=0 ctl=7, so with IRQ mask bit set.


Alex


>                   }
>   
> -                if (diff < INT64_MAX) {
> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> -                    struct timespec *ts = &cpu->hvf->ts;
> -
> -                    *ts = (struct timespec){
> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> -                    };
> -
> -                    /*
> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> -                     * time periods than 2ms.
> -                     */
> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {


I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to 
return. Without logic like this, super short WFIs will hurt performance 
quite badly.


Alex

> -                        advance_pc = true;
> -                        break;
> -                    }
> -
> -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> -                    cpu->hvf->sleeping = true;
> -                    smp_mb();
> -
> -                    /* Bail out if we received an IRQ meanwhile */
> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> -                        cpu->hvf->sleeping = false;
> -                        break;
> -                    }
> -
> -                    /* nanosleep returns on signal, so we wake up on kick. */
> -                    nanosleep(ts, NULL);
> -
> -                    /* Out of sleep - either naturally or because of a kick */
> -                    cpu->hvf->sleeping = false;
> -                }
> +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> +                uint64_t nanos =
> +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> +                struct timespec ts = { seconds, nanos };
> +
> +                /*
> +                 * Use pselect to sleep so that other threads can IPI us while
> +                 * we're sleeping.
> +                 */
> +                qatomic_mb_set(&cpu->thread_kicked, false);
> +                qemu_mutex_unlock_iothread();
> +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> +                qemu_mutex_lock_iothread();
>   
>                   advance_pc = true;
>               }
>               break;
>           case EC_AA64_HVC:
>               cpu_synchronize_state(cpu);
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
>                   arm_handle_psci_call(arm_cpu);
> @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
>                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
>                   env->xregs[0] = -1;
>               }
> -            qemu_mutex_unlock_iothread();
>               break;
>           case EC_AA64_SMC:
>               cpu_synchronize_state(cpu);
> -            qemu_mutex_lock_iothread();
>               current_cpu = cpu;
>               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
>                   arm_handle_psci_call(arm_cpu);
> @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>                   env->xregs[0] = -1;
>                   env->pc += 4;
>               }
> -            qemu_mutex_unlock_iothread();
>               break;
>           default:
>               cpu_synchronize_state(cpu);
> @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
>               assert_hvf_ok(r);
>           }
> +        qemu_mutex_unlock_iothread();
>       } while (ret == 0);
>   
>       qemu_mutex_lock_iothread();


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01  8:21                           ` [PATCH] arm/hvf: Optimize and simplify WFI handling Peter Collingbourne via
  2020-12-01 11:16                             ` Alexander Graf
@ 2020-12-01 16:26                             ` Alexander Graf
  2020-12-01 20:03                               ` Peter Collingbourne
  1 sibling, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-12-01 16:26 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini


On 01.12.20 09:21, Peter Collingbourne wrote:
> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> up on IPI.
>
> Signed-off-by: Peter Collingbourne <pcc@google.com>
> ---
> Alexander Graf wrote:
>> I would love to take a patch from you here :). I'll still be stuck for a
>> while with the sysreg sync rework that Peter asked for before I can look
>> at WFI again.
> Okay, here's a patch :) It's a relatively straightforward adaptation
> of what we have in our fork, which can now boot Android to GUI while
> remaining at around 4% CPU when idle.
>
> I'm not set up to boot a full Linux distribution at the moment so I
> tested it on upstream QEMU by running a recent mainline Linux kernel
> with a rootfs containing an init program that just does sleep(5)
> and verified that the qemu process remains at low CPU usage during
> the sleep. This was on top of your v2 plus the last patch of your v1
> since it doesn't look like you have a replacement for that logic yet.


How about something like this instead?


Alex


diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index 4360f64671..50384013ea 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
      cpu->hvf = g_malloc0(sizeof(*cpu->hvf));

      /* init cpu signals */
-    sigset_t set;
      struct sigaction sigact;

      memset(&sigact, 0, sizeof(sigact));
      sigact.sa_handler = dummy_signal;
      sigaction(SIG_IPI, &sigact, NULL);

-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
-    pthread_sigmask(SIG_SETMASK, &set, NULL);
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
+    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
+
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
+    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);

  #ifdef __aarch64__
      r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t 
**)&cpu->hvf->exit, NULL);
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index c56baa3ae8..6e237f2db0 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -62,8 +62,9 @@ extern HVFState *hvf_state;
  struct hvf_vcpu_state {
      uint64_t fd;
      void *exit;
-    struct timespec ts;
      bool sleeping;
+    sigset_t sigmask;
+    sigset_t sigmask_ipi;
  };

  void assert_hvf_ok(hv_return_t ret);
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
index 0c01a03725..350b845e6e 100644
--- a/target/arm/hvf/hvf.c
+++ b/target/arm/hvf/hvf.c
@@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)

  void hvf_kick_vcpu_thread(CPUState *cpu)
  {
-    if (cpu->hvf->sleeping) {
-        /*
-         * When sleeping, make sure we always send signals. Also, clear the
-         * timespec, so that an IPI that arrives between setting 
hvf->sleeping
-         * and the nanosleep syscall still aborts the sleep.
-         */
-        cpu->thread_kicked = false;
-        cpu->hvf->ts = (struct timespec){ };
+    if (qatomic_read(&cpu->hvf->sleeping)) {
+        /* When sleeping, send a signal to get out of pselect */
          cpus_kick_thread(cpu);
      } else {
          hv_vcpus_exit(&cpu->hvf->fd, 1);
      }
  }

+static void hvf_block_sig_ipi(CPUState *cpu)
+{
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
+}
+
+static void hvf_unblock_sig_ipi(CPUState *cpu)
+{
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
+}
+
  static int hvf_inject_interrupts(CPUState *cpu)
  {
      if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
@@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
      ARMCPU *arm_cpu = ARM_CPU(cpu);
      CPUARMState *env = &arm_cpu->env;
      hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
+    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
      hv_return_t r;
      int ret = 0;

@@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
              break;
          }
          case EC_WFX_TRAP:
-            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
-                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+            if (!(syndrome & WFX_IS_WFE) &&
+                !(cpu->interrupt_request & irq_mask)) {
                  uint64_t cval, ctl, val, diff, now;

                  /* Set up a local timer for vtimer if necessary ... */
@@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)

                  if (diff < INT64_MAX) {
                      uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
-                    struct timespec *ts = &cpu->hvf->ts;
-
-                    *ts = (struct timespec){
+                    struct timespec ts = {
                          .tv_sec = ns / NANOSECONDS_PER_SECOND,
                          .tv_nsec = ns % NANOSECONDS_PER_SECOND,
                      };
@@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
                       * Waking up easily takes 1ms, don't go to sleep 
for smaller
                       * time periods than 2ms.
                       */
-                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
+                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
                          advance_pc = true;
                          break;
                      }

+                    /* block SIG_IPI for the sleep */
+                    hvf_block_sig_ipi(cpu);
+                    cpu->thread_kicked = false;
+
                      /* Set cpu->hvf->sleeping so that we get a SIG_IPI 
signal. */
-                    cpu->hvf->sleeping = true;
-                    smp_mb();
+                    qatomic_set(&cpu->hvf->sleeping, true);

-                    /* Bail out if we received an IRQ meanwhile */
-                    if (cpu->thread_kicked || (cpu->interrupt_request &
-                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                        cpu->hvf->sleeping = false;
+                    /* Bail out if we received a kick meanwhile */
+                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
+ qatomic_set(&cpu->hvf->sleeping, false);
+                        hvf_unblock_sig_ipi(cpu);
                          break;
                      }

-                    /* nanosleep returns on signal, so we wake up on 
kick. */
-                    nanosleep(ts, NULL);
+                    /* pselect returns on kick signal and consumes it */
+                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);

                      /* Out of sleep - either naturally or because of a 
kick */
-                    cpu->hvf->sleeping = false;
+                    qatomic_set(&cpu->hvf->sleeping, false);
+                    hvf_unblock_sig_ipi(cpu);
                  }

                  advance_pc = true;



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 11:16                             ` Alexander Graf
@ 2020-12-01 18:59                               ` Peter Collingbourne
  2020-12-01 22:03                                 ` Alexander Graf
  2020-12-03 10:12                                 ` Roman Bolshakov
  0 siblings, 2 replies; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-01 18:59 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Frank Yang, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
>
> Hi Peter,
>
> On 01.12.20 09:21, Peter Collingbourne wrote:
> > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > up on IPI.
> >
> > Signed-off-by: Peter Collingbourne <pcc@google.com>
>
>
> Thanks a bunch!
>
>
> > ---
> > Alexander Graf wrote:
> >> I would love to take a patch from you here :). I'll still be stuck for a
> >> while with the sysreg sync rework that Peter asked for before I can look
> >> at WFI again.
> > Okay, here's a patch :) It's a relatively straightforward adaptation
> > of what we have in our fork, which can now boot Android to GUI while
> > remaining at around 4% CPU when idle.
> >
> > I'm not set up to boot a full Linux distribution at the moment so I
> > tested it on upstream QEMU by running a recent mainline Linux kernel
> > with a rootfs containing an init program that just does sleep(5)
> > and verified that the qemu process remains at low CPU usage during
> > the sleep. This was on top of your v2 plus the last patch of your v1
> > since it doesn't look like you have a replacement for that logic yet.
> >
> >   accel/hvf/hvf-cpus.c     |  5 +--
> >   include/sysemu/hvf_int.h |  3 +-
> >   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> >   3 files changed, 28 insertions(+), 74 deletions(-)
> >
> > diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> > index 4360f64671..b2c8fb57f6 100644
> > --- a/accel/hvf/hvf-cpus.c
> > +++ b/accel/hvf/hvf-cpus.c
> > @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> >       sigact.sa_handler = dummy_signal;
> >       sigaction(SIG_IPI, &sigact, NULL);
> >
> > -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> > -    sigdelset(&set, SIG_IPI);
> > -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> > +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> > +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
>
>
> What will this do to the x86 hvf implementation? We're now not
> unblocking SIG_IPI again for that, right?

Yes and that was the case before your patch series.

> >
> >   #ifdef __aarch64__
> >       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> > diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> > index c56baa3ae8..13adf6ea77 100644
> > --- a/include/sysemu/hvf_int.h
> > +++ b/include/sysemu/hvf_int.h
> > @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> >   struct hvf_vcpu_state {
> >       uint64_t fd;
> >       void *exit;
> > -    struct timespec ts;
> > -    bool sleeping;
> > +    sigset_t unblock_ipi_mask;
> >   };
> >
> >   void assert_hvf_ok(hv_return_t ret);
> > diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> > index 8fe10966d2..60a361ff38 100644
> > --- a/target/arm/hvf/hvf.c
> > +++ b/target/arm/hvf/hvf.c
> > @@ -2,6 +2,7 @@
> >    * QEMU Hypervisor.framework support for Apple Silicon
> >
> >    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> > + * Copyright 2020 Google LLC
> >    *
> >    * This work is licensed under the terms of the GNU GPL, version 2 or later.
> >    * See the COPYING file in the top-level directory.
> > @@ -18,6 +19,7 @@
> >   #include "sysemu/hw_accel.h"
> >
> >   #include <Hypervisor/Hypervisor.h>
> > +#include <mach/mach_time.h>
> >
> >   #include "exec/address-spaces.h"
> >   #include "hw/irq.h"
> > @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >
> >   void hvf_kick_vcpu_thread(CPUState *cpu)
> >   {
> > -    if (cpu->hvf->sleeping) {
> > -        /*
> > -         * When sleeping, make sure we always send signals. Also, clear the
> > -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> > -         * and the nanosleep syscall still aborts the sleep.
> > -         */
> > -        cpu->thread_kicked = false;
> > -        cpu->hvf->ts = (struct timespec){ };
> > -        cpus_kick_thread(cpu);
> > -    } else {
> > -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> > -    }
> > +    cpus_kick_thread(cpu);
> > +    hv_vcpus_exit(&cpu->hvf->fd, 1);
>
>
> This means your first WFI will almost always return immediately due to a
> pending signal, because there probably was an IRQ pending before on the
> same CPU, no?

That's right. Any approach involving the "sleeping" field would need
to be implemented carefully to avoid races that may result in missed
wakeups so for simplicity I just decided to send both kinds of
wakeups. In particular the approach in the updated patch you sent is
racy and I'll elaborate more in the reply to that patch.

> >   }
> >
> >   static int hvf_inject_interrupts(CPUState *cpu)
> > @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> >           uint64_t syndrome = hvf_exit->exception.syndrome;
> >           uint32_t ec = syn_get_ec(syndrome);
> >
> > +        qemu_mutex_lock_iothread();
>
>
> Is there a particular reason you're moving the iothread lock out again
> from the individual bits? I would really like to keep a notion of fast
> path exits.

We still need to lock at least once no matter the exit reason to check
the interrupts so I don't think it's worth it to try and avoid locking
like this. It also makes the implementation easier to reason about and
therefore more likely to be correct. In our implementation we just
stay locked the whole time unless we're in hv_vcpu_run() or pselect().

> >           switch (exit_reason) {
> >           case HV_EXIT_REASON_EXCEPTION:
> >               /* This is the main one, handle below. */
> >               break;
> >           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> >               qemu_mutex_unlock_iothread();
> >               continue;
> >           case HV_EXIT_REASON_CANCELED:
> >               /* we got kicked, no exit to process */
> > +            qemu_mutex_unlock_iothread();
> >               continue;
> >           default:
> >               assert(0);
> > @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >               uint32_t srt = (syndrome >> 16) & 0x1f;
> >               uint64_t val = 0;
> >
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >
> >               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> > @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >                   hvf_set_reg(cpu, srt, val);
> >               }
> >
> > -            qemu_mutex_unlock_iothread();
> > -
> >               advance_pc = true;
> >               break;
> >           }
> > @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> >           case EC_WFX_TRAP:
> >               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > -                uint64_t cval, ctl, val, diff, now;
> > +                uint64_t cval;
> >
> > -                /* Set up a local timer for vtimer if necessary ... */
> > -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> > -                assert_hvf_ok(r);
> >                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> >                   assert_hvf_ok(r);
> >
> > -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> > -                diff = cval - val;
> > -
> > -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> > -                      gt_cntfrq_period_ns(arm_cpu);
> > -
> > -                /* Timer disabled or masked, just wait for long */
> > -                if (!(ctl & 1) || (ctl & 2)) {
> > -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> > -                           gt_cntfrq_period_ns(arm_cpu);
> > +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> > +                if (ticks_to_sleep < 0) {
> > +                    break;
>
>
> This will loop at 100% for Windows, which configures the vtimer as
> cval=0 ctl=7, so with IRQ mask bit set.

Okay, but the 120s is kind of arbitrary so we should just sleep until
we get a signal. That can be done by passing null as the timespec
argument to pselect().

>
>
> Alex
>
>
> >                   }
> >
> > -                if (diff < INT64_MAX) {
> > -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> > -                    struct timespec *ts = &cpu->hvf->ts;
> > -
> > -                    *ts = (struct timespec){
> > -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> > -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> > -                    };
> > -
> > -                    /*
> > -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> > -                     * time periods than 2ms.
> > -                     */
> > -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>
>
> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> return. Without logic like this, super short WFIs will hurt performance
> quite badly.

I don't think that's accurate. According to this benchmark it's a few
hundred nanoseconds at most.

pcc@pac-mini /tmp> cat pselect.c
#include <signal.h>
#include <sys/select.h>

int main() {
  sigset_t mask, orig_mask;
  pthread_sigmask(SIG_SETMASK, 0, &mask);
  sigaddset(&mask, SIGUSR1);
  pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);

  for (int i = 0; i != 1000000; ++i) {
    struct timespec ts = { 0, 1 };
    pselect(0, 0, 0, 0, &ts, &orig_mask);
  }
}
pcc@pac-mini /tmp> time ./pselect

________________________________________________________
Executed in  179.87 millis    fish           external
   usr time   77.68 millis   57.00 micros   77.62 millis
   sys time  101.37 millis  852.00 micros  100.52 millis

Besides, all that you're really saving here is the single pselect
call. There are no doubt more expensive syscalls involved in exiting
and entering the VCPU that would dominate here.

Peter

>
>
> Alex
>
> > -                        advance_pc = true;
> > -                        break;
> > -                    }
> > -
> > -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> > -                    cpu->hvf->sleeping = true;
> > -                    smp_mb();
> > -
> > -                    /* Bail out if we received an IRQ meanwhile */
> > -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > -                        cpu->hvf->sleeping = false;
> > -                        break;
> > -                    }
> > -
> > -                    /* nanosleep returns on signal, so we wake up on kick. */
> > -                    nanosleep(ts, NULL);
> > -
> > -                    /* Out of sleep - either naturally or because of a kick */
> > -                    cpu->hvf->sleeping = false;
> > -                }
> > +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> > +                uint64_t nanos =
> > +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> > +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> > +                struct timespec ts = { seconds, nanos };
> > +
> > +                /*
> > +                 * Use pselect to sleep so that other threads can IPI us while
> > +                 * we're sleeping.
> > +                 */
> > +                qatomic_mb_set(&cpu->thread_kicked, false);
> > +                qemu_mutex_unlock_iothread();
> > +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> > +                qemu_mutex_lock_iothread();
> >
> >                   advance_pc = true;
> >               }
> >               break;
> >           case EC_AA64_HVC:
> >               cpu_synchronize_state(cpu);
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
> >                   arm_handle_psci_call(arm_cpu);
> > @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
> >                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
> >                   env->xregs[0] = -1;
> >               }
> > -            qemu_mutex_unlock_iothread();
> >               break;
> >           case EC_AA64_SMC:
> >               cpu_synchronize_state(cpu);
> > -            qemu_mutex_lock_iothread();
> >               current_cpu = cpu;
> >               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
> >                   arm_handle_psci_call(arm_cpu);
> > @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >                   env->xregs[0] = -1;
> >                   env->pc += 4;
> >               }
> > -            qemu_mutex_unlock_iothread();
> >               break;
> >           default:
> >               cpu_synchronize_state(cpu);
> > @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> >               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
> >               assert_hvf_ok(r);
> >           }
> > +        qemu_mutex_unlock_iothread();
> >       } while (ret == 0);
> >
> >       qemu_mutex_lock_iothread();


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 16:26                             ` Alexander Graf
@ 2020-12-01 20:03                               ` Peter Collingbourne
  2020-12-01 22:09                                 ` Alexander Graf
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-01 20:03 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Frank Yang, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 09:21, Peter Collingbourne wrote:
> > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > up on IPI.
> >
> > Signed-off-by: Peter Collingbourne <pcc@google.com>
> > ---
> > Alexander Graf wrote:
> >> I would love to take a patch from you here :). I'll still be stuck for a
> >> while with the sysreg sync rework that Peter asked for before I can look
> >> at WFI again.
> > Okay, here's a patch :) It's a relatively straightforward adaptation
> > of what we have in our fork, which can now boot Android to GUI while
> > remaining at around 4% CPU when idle.
> >
> > I'm not set up to boot a full Linux distribution at the moment so I
> > tested it on upstream QEMU by running a recent mainline Linux kernel
> > with a rootfs containing an init program that just does sleep(5)
> > and verified that the qemu process remains at low CPU usage during
> > the sleep. This was on top of your v2 plus the last patch of your v1
> > since it doesn't look like you have a replacement for that logic yet.
>
>
> How about something like this instead?
>
>
> Alex
>
>
> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> index 4360f64671..50384013ea 100644
> --- a/accel/hvf/hvf-cpus.c
> +++ b/accel/hvf/hvf-cpus.c
> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
>       cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
>
>       /* init cpu signals */
> -    sigset_t set;
>       struct sigaction sigact;
>
>       memset(&sigact, 0, sizeof(sigact));
>       sigact.sa_handler = dummy_signal;
>       sigaction(SIG_IPI, &sigact, NULL);
>
> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> -    sigdelset(&set, SIG_IPI);
> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> +
> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);

There's no reason to unblock SIG_IPI while not in pselect and it can
easily lead to missed wakeups. The whole point of pselect is so that
you can guarantee that only one part of your program sees signals
without a possibility of them being missed.

>
>   #ifdef __aarch64__
>       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t
> **)&cpu->hvf->exit, NULL);
> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> index c56baa3ae8..6e237f2db0 100644
> --- a/include/sysemu/hvf_int.h
> +++ b/include/sysemu/hvf_int.h
> @@ -62,8 +62,9 @@ extern HVFState *hvf_state;
>   struct hvf_vcpu_state {
>       uint64_t fd;
>       void *exit;
> -    struct timespec ts;
>       bool sleeping;
> +    sigset_t sigmask;
> +    sigset_t sigmask_ipi;
>   };
>
>   void assert_hvf_ok(hv_return_t ret);
> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> index 0c01a03725..350b845e6e 100644
> --- a/target/arm/hvf/hvf.c
> +++ b/target/arm/hvf/hvf.c
> @@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>
>   void hvf_kick_vcpu_thread(CPUState *cpu)
>   {
> -    if (cpu->hvf->sleeping) {
> -        /*
> -         * When sleeping, make sure we always send signals. Also, clear the
> -         * timespec, so that an IPI that arrives between setting
> hvf->sleeping
> -         * and the nanosleep syscall still aborts the sleep.
> -         */
> -        cpu->thread_kicked = false;
> -        cpu->hvf->ts = (struct timespec){ };
> +    if (qatomic_read(&cpu->hvf->sleeping)) {
> +        /* When sleeping, send a signal to get out of pselect */
>           cpus_kick_thread(cpu);
>       } else {
>           hv_vcpus_exit(&cpu->hvf->fd, 1);
>       }
>   }
>
> +static void hvf_block_sig_ipi(CPUState *cpu)
> +{
> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
> +}
> +
> +static void hvf_unblock_sig_ipi(CPUState *cpu)
> +{
> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> +}
> +
>   static int hvf_inject_interrupts(CPUState *cpu)
>   {
>       if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
> @@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>       ARMCPU *arm_cpu = ARM_CPU(cpu);
>       CPUARMState *env = &arm_cpu->env;
>       hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
> +    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
>       hv_return_t r;
>       int ret = 0;
>
> @@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
>               break;
>           }
>           case EC_WFX_TRAP:
> -            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> -                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> +            if (!(syndrome & WFX_IS_WFE) &&
> +                !(cpu->interrupt_request & irq_mask)) {
>                   uint64_t cval, ctl, val, diff, now;

I don't think the access to cpu->interrupt_request is safe because it
is done while not under the iothread lock. That's why to avoid these
types of issues I would prefer to hold the lock almost all of the
time.

>                   /* Set up a local timer for vtimer if necessary ... */
> @@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>
>                   if (diff < INT64_MAX) {
>                       uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> -                    struct timespec *ts = &cpu->hvf->ts;
> -
> -                    *ts = (struct timespec){
> +                    struct timespec ts = {
>                           .tv_sec = ns / NANOSECONDS_PER_SECOND,
>                           .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>                       };
> @@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
>                        * Waking up easily takes 1ms, don't go to sleep
> for smaller
>                        * time periods than 2ms.
>                        */
> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> +                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
>                           advance_pc = true;
>                           break;
>                       }
>
> +                    /* block SIG_IPI for the sleep */
> +                    hvf_block_sig_ipi(cpu);
> +                    cpu->thread_kicked = false;
> +
>                       /* Set cpu->hvf->sleeping so that we get a SIG_IPI
> signal. */
> -                    cpu->hvf->sleeping = true;
> -                    smp_mb();
> +                    qatomic_set(&cpu->hvf->sleeping, true);

This doesn't protect against races because another thread could call
kvf_vcpu_kick_thread() at any time between when we return from
hv_vcpu_run() and when we set sleeping = true and we would miss the
wakeup (due to kvf_vcpu_kick_thread() seeing sleeping = false and
calling hv_vcpus_exit() instead of pthread_kill()). I don't think it
can be fixed by setting sleeping to true earlier either because no
matter how early you move it, there will always be a window where we
are going to pselect() but sleeping is false, resulting in a missed
wakeup.

Peter

>
> -                    /* Bail out if we received an IRQ meanwhile */
> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> -                        cpu->hvf->sleeping = false;
> +                    /* Bail out if we received a kick meanwhile */
> +                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
> + qatomic_set(&cpu->hvf->sleeping, false);
> +                        hvf_unblock_sig_ipi(cpu);
>                           break;
>                       }
>
> -                    /* nanosleep returns on signal, so we wake up on
> kick. */
> -                    nanosleep(ts, NULL);
> +                    /* pselect returns on kick signal and consumes it */
> +                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);
>
>                       /* Out of sleep - either naturally or because of a
> kick */
> -                    cpu->hvf->sleeping = false;
> +                    qatomic_set(&cpu->hvf->sleeping, false);
> +                    hvf_unblock_sig_ipi(cpu);
>                   }
>
>                   advance_pc = true;
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 18:59                               ` Peter Collingbourne
@ 2020-12-01 22:03                                 ` Alexander Graf
  2020-12-02  1:19                                   ` Peter Collingbourne
  2020-12-03 10:12                                 ` Roman Bolshakov
  1 sibling, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-12-01 22:03 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini


On 01.12.20 19:59, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
>> Hi Peter,
>>
>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>> up on IPI.
>>>
>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>
>> Thanks a bunch!
>>
>>
>>> ---
>>> Alexander Graf wrote:
>>>> I would love to take a patch from you here :). I'll still be stuck for a
>>>> while with the sysreg sync rework that Peter asked for before I can look
>>>> at WFI again.
>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>> of what we have in our fork, which can now boot Android to GUI while
>>> remaining at around 4% CPU when idle.
>>>
>>> I'm not set up to boot a full Linux distribution at the moment so I
>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>> with a rootfs containing an init program that just does sleep(5)
>>> and verified that the qemu process remains at low CPU usage during
>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>> since it doesn't look like you have a replacement for that logic yet.
>>>
>>>    accel/hvf/hvf-cpus.c     |  5 +--
>>>    include/sysemu/hvf_int.h |  3 +-
>>>    target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
>>>    3 files changed, 28 insertions(+), 74 deletions(-)
>>>
>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>> index 4360f64671..b2c8fb57f6 100644
>>> --- a/accel/hvf/hvf-cpus.c
>>> +++ b/accel/hvf/hvf-cpus.c
>>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
>>>        sigact.sa_handler = dummy_signal;
>>>        sigaction(SIG_IPI, &sigact, NULL);
>>>
>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> -    sigdelset(&set, SIG_IPI);
>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
>>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
>>
>> What will this do to the x86 hvf implementation? We're now not
>> unblocking SIG_IPI again for that, right?
> Yes and that was the case before your patch series.


The way I understand Roman, he wanted to unblock the IPI signal on x86:

https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021

I agree that at this point it's not a problem though to break it again. 
I'm not quite sure how to merge your patches within my patch set though, 
given they basically revert half of my previously introduced code...


>
>>>    #ifdef __aarch64__
>>>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
>>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>> index c56baa3ae8..13adf6ea77 100644
>>> --- a/include/sysemu/hvf_int.h
>>> +++ b/include/sysemu/hvf_int.h
>>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
>>>    struct hvf_vcpu_state {
>>>        uint64_t fd;
>>>        void *exit;
>>> -    struct timespec ts;
>>> -    bool sleeping;
>>> +    sigset_t unblock_ipi_mask;
>>>    };
>>>
>>>    void assert_hvf_ok(hv_return_t ret);
>>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
>>> index 8fe10966d2..60a361ff38 100644
>>> --- a/target/arm/hvf/hvf.c
>>> +++ b/target/arm/hvf/hvf.c
>>> @@ -2,6 +2,7 @@
>>>     * QEMU Hypervisor.framework support for Apple Silicon
>>>
>>>     * Copyright 2020 Alexander Graf <agraf@csgraf.de>
>>> + * Copyright 2020 Google LLC
>>>     *
>>>     * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>     * See the COPYING file in the top-level directory.
>>> @@ -18,6 +19,7 @@
>>>    #include "sysemu/hw_accel.h"
>>>
>>>    #include <Hypervisor/Hypervisor.h>
>>> +#include <mach/mach_time.h>
>>>
>>>    #include "exec/address-spaces.h"
>>>    #include "hw/irq.h"
>>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>>>
>>>    void hvf_kick_vcpu_thread(CPUState *cpu)
>>>    {
>>> -    if (cpu->hvf->sleeping) {
>>> -        /*
>>> -         * When sleeping, make sure we always send signals. Also, clear the
>>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
>>> -         * and the nanosleep syscall still aborts the sleep.
>>> -         */
>>> -        cpu->thread_kicked = false;
>>> -        cpu->hvf->ts = (struct timespec){ };
>>> -        cpus_kick_thread(cpu);
>>> -    } else {
>>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
>>> -    }
>>> +    cpus_kick_thread(cpu);
>>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
>>
>> This means your first WFI will almost always return immediately due to a
>> pending signal, because there probably was an IRQ pending before on the
>> same CPU, no?
> That's right. Any approach involving the "sleeping" field would need
> to be implemented carefully to avoid races that may result in missed
> wakeups so for simplicity I just decided to send both kinds of
> wakeups. In particular the approach in the updated patch you sent is
> racy and I'll elaborate more in the reply to that patch.
>
>>>    }
>>>
>>>    static int hvf_inject_interrupts(CPUState *cpu)
>>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>            uint64_t syndrome = hvf_exit->exception.syndrome;
>>>            uint32_t ec = syn_get_ec(syndrome);
>>>
>>> +        qemu_mutex_lock_iothread();
>>
>> Is there a particular reason you're moving the iothread lock out again
>> from the individual bits? I would really like to keep a notion of fast
>> path exits.
> We still need to lock at least once no matter the exit reason to check
> the interrupts so I don't think it's worth it to try and avoid locking
> like this. It also makes the implementation easier to reason about and
> therefore more likely to be correct. In our implementation we just
> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
>
>>>            switch (exit_reason) {
>>>            case HV_EXIT_REASON_EXCEPTION:
>>>                /* This is the main one, handle below. */
>>>                break;
>>>            case HV_EXIT_REASON_VTIMER_ACTIVATED:
>>> -            qemu_mutex_lock_iothread();
>>>                current_cpu = cpu;
>>>                qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
>>>                qemu_mutex_unlock_iothread();
>>>                continue;
>>>            case HV_EXIT_REASON_CANCELED:
>>>                /* we got kicked, no exit to process */
>>> +            qemu_mutex_unlock_iothread();
>>>                continue;
>>>            default:
>>>                assert(0);
>>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>                uint32_t srt = (syndrome >> 16) & 0x1f;
>>>                uint64_t val = 0;
>>>
>>> -            qemu_mutex_lock_iothread();
>>>                current_cpu = cpu;
>>>
>>>                DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
>>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>                    hvf_set_reg(cpu, srt, val);
>>>                }
>>>
>>> -            qemu_mutex_unlock_iothread();
>>> -
>>>                advance_pc = true;
>>>                break;
>>>            }
>>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>            case EC_WFX_TRAP:
>>>                if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>>>                    (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>> -                uint64_t cval, ctl, val, diff, now;
>>> +                uint64_t cval;
>>>
>>> -                /* Set up a local timer for vtimer if necessary ... */
>>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
>>> -                assert_hvf_ok(r);
>>>                    r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
>>>                    assert_hvf_ok(r);
>>>
>>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
>>> -                diff = cval - val;
>>> -
>>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
>>> -                      gt_cntfrq_period_ns(arm_cpu);
>>> -
>>> -                /* Timer disabled or masked, just wait for long */
>>> -                if (!(ctl & 1) || (ctl & 2)) {
>>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
>>> -                           gt_cntfrq_period_ns(arm_cpu);
>>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
>>> +                if (ticks_to_sleep < 0) {
>>> +                    break;
>>
>> This will loop at 100% for Windows, which configures the vtimer as
>> cval=0 ctl=7, so with IRQ mask bit set.
> Okay, but the 120s is kind of arbitrary so we should just sleep until
> we get a signal. That can be done by passing null as the timespec
> argument to pselect().


The reason I capped it at 120s was so that if I do hit a race, you don't 
break everything forever. Only for 2 minutes :).


>
>>
>> Alex
>>
>>
>>>                    }
>>>
>>> -                if (diff < INT64_MAX) {
>>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
>>> -                    struct timespec *ts = &cpu->hvf->ts;
>>> -
>>> -                    *ts = (struct timespec){
>>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
>>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>>> -                    };
>>> -
>>> -                    /*
>>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
>>> -                     * time periods than 2ms.
>>> -                     */
>>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>>
>> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
>> return. Without logic like this, super short WFIs will hurt performance
>> quite badly.
> I don't think that's accurate. According to this benchmark it's a few
> hundred nanoseconds at most.
>
> pcc@pac-mini /tmp> cat pselect.c
> #include <signal.h>
> #include <sys/select.h>
>
> int main() {
>    sigset_t mask, orig_mask;
>    pthread_sigmask(SIG_SETMASK, 0, &mask);
>    sigaddset(&mask, SIGUSR1);
>    pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>
>    for (int i = 0; i != 1000000; ++i) {
>      struct timespec ts = { 0, 1 };
>      pselect(0, 0, 0, 0, &ts, &orig_mask);
>    }
> }
> pcc@pac-mini /tmp> time ./pselect
>
> ________________________________________________________
> Executed in  179.87 millis    fish           external
>     usr time   77.68 millis   57.00 micros   77.62 millis
>     sys time  101.37 millis  852.00 micros  100.52 millis
>
> Besides, all that you're really saving here is the single pselect
> call. There are no doubt more expensive syscalls involved in exiting
> and entering the VCPU that would dominate here.


I would expect that such a super low ts value has a short-circuit path 
in the kernel as well. Where things start to fall apart is when you're 
at a threshold where rescheduling might be ok, but then you need to take 
all of the additional task switch overhead into account. Try to adapt 
your test code a bit:

#include <signal.h>
#include <sys/select.h>

int main() {
   sigset_t mask, orig_mask;
   pthread_sigmask(SIG_SETMASK, 0, &mask);
   sigaddset(&mask, SIGUSR1);
   pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);

   for (int i = 0; i != 10000; ++i) {
#define SCALE_MS 1000000
     struct timespec ts = { 0, SCALE_MS / 10 };
     pselect(0, 0, 0, 0, &ts, &orig_mask);
   }
}


% time ./pselect
./pselect  0.00s user 0.01s system 1% cpu 1.282 total

You're suddenly seeing 300µs overhead per pselect call then. When I 
measured actual enter/exit times in QEMU, I saw much bigger differences 
between "time I want to sleep for" and "time I did sleep" even when just 
capturing the virtual time before and after the nanosleep/pselect call.


Alex




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 20:03                               ` Peter Collingbourne
@ 2020-12-01 22:09                                 ` Alexander Graf
  2020-12-01 23:13                                   ` Alexander Graf
  2020-12-02  0:52                                   ` Peter Collingbourne
  0 siblings, 2 replies; 64+ messages in thread
From: Alexander Graf @ 2020-12-01 22:09 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini


On 01.12.20 21:03, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>> up on IPI.
>>>
>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>> ---
>>> Alexander Graf wrote:
>>>> I would love to take a patch from you here :). I'll still be stuck for a
>>>> while with the sysreg sync rework that Peter asked for before I can look
>>>> at WFI again.
>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>> of what we have in our fork, which can now boot Android to GUI while
>>> remaining at around 4% CPU when idle.
>>>
>>> I'm not set up to boot a full Linux distribution at the moment so I
>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>> with a rootfs containing an init program that just does sleep(5)
>>> and verified that the qemu process remains at low CPU usage during
>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>> since it doesn't look like you have a replacement for that logic yet.
>>
>> How about something like this instead?
>>
>>
>> Alex
>>
>>
>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>> index 4360f64671..50384013ea 100644
>> --- a/accel/hvf/hvf-cpus.c
>> +++ b/accel/hvf/hvf-cpus.c
>> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
>>        cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
>>
>>        /* init cpu signals */
>> -    sigset_t set;
>>        struct sigaction sigact;
>>
>>        memset(&sigact, 0, sizeof(sigact));
>>        sigact.sa_handler = dummy_signal;
>>        sigaction(SIG_IPI, &sigact, NULL);
>>
>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>> -    sigdelset(&set, SIG_IPI);
>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
>> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
>> +
>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
>> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
> There's no reason to unblock SIG_IPI while not in pselect and it can
> easily lead to missed wakeups. The whole point of pselect is so that
> you can guarantee that only one part of your program sees signals
> without a possibility of them being missed.


Hm, I think I start to agree with you here :). We can probably just 
leave SIG_IPI masked at all times and only unmask on pselect. The worst 
thing that will happen is a premature wakeup if we did get an IPI 
incoming while hvf->sleeping is set, but were either not running 
pselect() yet and bailed out or already finished pselect() execution.


>
>>    #ifdef __aarch64__
>>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t
>> **)&cpu->hvf->exit, NULL);
>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>> index c56baa3ae8..6e237f2db0 100644
>> --- a/include/sysemu/hvf_int.h
>> +++ b/include/sysemu/hvf_int.h
>> @@ -62,8 +62,9 @@ extern HVFState *hvf_state;
>>    struct hvf_vcpu_state {
>>        uint64_t fd;
>>        void *exit;
>> -    struct timespec ts;
>>        bool sleeping;
>> +    sigset_t sigmask;
>> +    sigset_t sigmask_ipi;
>>    };
>>
>>    void assert_hvf_ok(hv_return_t ret);
>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
>> index 0c01a03725..350b845e6e 100644
>> --- a/target/arm/hvf/hvf.c
>> +++ b/target/arm/hvf/hvf.c
>> @@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>>
>>    void hvf_kick_vcpu_thread(CPUState *cpu)
>>    {
>> -    if (cpu->hvf->sleeping) {
>> -        /*
>> -         * When sleeping, make sure we always send signals. Also, clear the
>> -         * timespec, so that an IPI that arrives between setting
>> hvf->sleeping
>> -         * and the nanosleep syscall still aborts the sleep.
>> -         */
>> -        cpu->thread_kicked = false;
>> -        cpu->hvf->ts = (struct timespec){ };
>> +    if (qatomic_read(&cpu->hvf->sleeping)) {
>> +        /* When sleeping, send a signal to get out of pselect */
>>            cpus_kick_thread(cpu);
>>        } else {
>>            hv_vcpus_exit(&cpu->hvf->fd, 1);
>>        }
>>    }
>>
>> +static void hvf_block_sig_ipi(CPUState *cpu)
>> +{
>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
>> +}
>> +
>> +static void hvf_unblock_sig_ipi(CPUState *cpu)
>> +{
>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
>> +}
>> +
>>    static int hvf_inject_interrupts(CPUState *cpu)
>>    {
>>        if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
>> @@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>>        ARMCPU *arm_cpu = ARM_CPU(cpu);
>>        CPUARMState *env = &arm_cpu->env;
>>        hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
>> +    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
>>        hv_return_t r;
>>        int ret = 0;
>>
>> @@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
>>                break;
>>            }
>>            case EC_WFX_TRAP:
>> -            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>> -                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>> +            if (!(syndrome & WFX_IS_WFE) &&
>> +                !(cpu->interrupt_request & irq_mask)) {
>>                    uint64_t cval, ctl, val, diff, now;
> I don't think the access to cpu->interrupt_request is safe because it
> is done while not under the iothread lock. That's why to avoid these
> types of issues I would prefer to hold the lock almost all of the
> time.


In this branch, that's not a problem yet. On stale values, we either 
don't sleep (which is ok), or we go into the sleep path, and reevaluate 
cpu->interrupt_request atomically again after setting hvf->sleeping.


>
>>                    /* Set up a local timer for vtimer if necessary ... */
>> @@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)
>>
>>                    if (diff < INT64_MAX) {
>>                        uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
>> -                    struct timespec *ts = &cpu->hvf->ts;
>> -
>> -                    *ts = (struct timespec){
>> +                    struct timespec ts = {
>>                            .tv_sec = ns / NANOSECONDS_PER_SECOND,
>>                            .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>>                        };
>> @@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
>>                         * Waking up easily takes 1ms, don't go to sleep
>> for smaller
>>                         * time periods than 2ms.
>>                         */
>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>> +                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
>>                            advance_pc = true;
>>                            break;
>>                        }
>>
>> +                    /* block SIG_IPI for the sleep */
>> +                    hvf_block_sig_ipi(cpu);
>> +                    cpu->thread_kicked = false;
>> +
>>                        /* Set cpu->hvf->sleeping so that we get a SIG_IPI
>> signal. */
>> -                    cpu->hvf->sleeping = true;
>> -                    smp_mb();
>> +                    qatomic_set(&cpu->hvf->sleeping, true);
> This doesn't protect against races because another thread could call
> kvf_vcpu_kick_thread() at any time between when we return from
> hv_vcpu_run() and when we set sleeping = true and we would miss the
> wakeup (due to kvf_vcpu_kick_thread() seeing sleeping = false and
> calling hv_vcpus_exit() instead of pthread_kill()). I don't think it
> can be fixed by setting sleeping to true earlier either because no
> matter how early you move it, there will always be a window where we
> are going to pselect() but sleeping is false, resulting in a missed
> wakeup.


I don't follow. If anyone was sending us an IPI, it's because they want 
to notify us about an update to cpu->interrupt_request, right? In that 
case, the atomic read of that field below will catch it and bail out of 
the sleep sequence.


>
> Peter
>
>> -                    /* Bail out if we received an IRQ meanwhile */
>> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
>> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>> -                        cpu->hvf->sleeping = false;
>> +                    /* Bail out if we received a kick meanwhile */
>> +                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
>> + qatomic_set(&cpu->hvf->sleeping, false);


^^^


Alex


>> +                        hvf_unblock_sig_ipi(cpu);
>>                            break;
>>                        }
>>
>> -                    /* nanosleep returns on signal, so we wake up on
>> kick. */
>> -                    nanosleep(ts, NULL);
>> +                    /* pselect returns on kick signal and consumes it */
>> +                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);
>>
>>                        /* Out of sleep - either naturally or because of a
>> kick */
>> -                    cpu->hvf->sleeping = false;
>> +                    qatomic_set(&cpu->hvf->sleeping, false);
>> +                    hvf_unblock_sig_ipi(cpu);
>>                    }
>>
>>                    advance_pc = true;
>>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 22:09                                 ` Alexander Graf
@ 2020-12-01 23:13                                   ` Alexander Graf
  2020-12-02  0:52                                   ` Peter Collingbourne
  1 sibling, 0 replies; 64+ messages in thread
From: Alexander Graf @ 2020-12-01 23:13 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini


On 01.12.20 23:09, Alexander Graf wrote:
>
> On 01.12.20 21:03, Peter Collingbourne wrote:
>> On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
>>>
>>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>>> up on IPI.
>>>>
>>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>>> ---
>>>> Alexander Graf wrote:
>>>>> I would love to take a patch from you here :). I'll still be stuck 
>>>>> for a
>>>>> while with the sysreg sync rework that Peter asked for before I 
>>>>> can look
>>>>> at WFI again.
>>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>>> of what we have in our fork, which can now boot Android to GUI while
>>>> remaining at around 4% CPU when idle.
>>>>
>>>> I'm not set up to boot a full Linux distribution at the moment so I
>>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>>> with a rootfs containing an init program that just does sleep(5)
>>>> and verified that the qemu process remains at low CPU usage during
>>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>>> since it doesn't look like you have a replacement for that logic yet.
>>>
>>> How about something like this instead?
>>>
>>>
>>> Alex
>>>
>>>
>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>> index 4360f64671..50384013ea 100644
>>> --- a/accel/hvf/hvf-cpus.c
>>> +++ b/accel/hvf/hvf-cpus.c
>>> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
>>>        cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
>>>
>>>        /* init cpu signals */
>>> -    sigset_t set;
>>>        struct sigaction sigact;
>>>
>>>        memset(&sigact, 0, sizeof(sigact));
>>>        sigact.sa_handler = dummy_signal;
>>>        sigaction(SIG_IPI, &sigact, NULL);
>>>
>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>> -    sigdelset(&set, SIG_IPI);
>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
>>> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
>>> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
>>> +
>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
>>> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
>> There's no reason to unblock SIG_IPI while not in pselect and it can
>> easily lead to missed wakeups. The whole point of pselect is so that
>> you can guarantee that only one part of your program sees signals
>> without a possibility of them being missed.
>
>
> Hm, I think I start to agree with you here :). We can probably just 
> leave SIG_IPI masked at all times and only unmask on pselect. The 
> worst thing that will happen is a premature wakeup if we did get an 
> IPI incoming while hvf->sleeping is set, but were either not running 
> pselect() yet and bailed out or already finished pselect() execution.


How about this one? Do you really think it's still racy?


Alex


diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
index 4360f64671..e10fca622d 100644
--- a/accel/hvf/hvf-cpus.c
+++ b/accel/hvf/hvf-cpus.c
@@ -337,16 +337,17 @@ static int hvf_init_vcpu(CPUState *cpu)
      cpu->hvf = g_malloc0(sizeof(*cpu->hvf));

      /* init cpu signals */
-    sigset_t set;
      struct sigaction sigact;

      memset(&sigact, 0, sizeof(sigact));
      sigact.sa_handler = dummy_signal;
      sigaction(SIG_IPI, &sigact, NULL);

-    pthread_sigmask(SIG_BLOCK, NULL, &set);
-    sigdelset(&set, SIG_IPI);
-    pthread_sigmask(SIG_SETMASK, &set, NULL);
+    /* Remember unmasked IPI mask for pselect(), leave masked normally */
+    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
+    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
+    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
+    sigdelset(&cpu->hvf->sigmask_ipi, SIG_IPI);

  #ifdef __aarch64__
      r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t 
**)&cpu->hvf->exit, NULL);
diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
index c56baa3ae8..8d7d4a6226 100644
--- a/include/sysemu/hvf_int.h
+++ b/include/sysemu/hvf_int.h
@@ -62,8 +62,8 @@ extern HVFState *hvf_state;
  struct hvf_vcpu_state {
      uint64_t fd;
      void *exit;
-    struct timespec ts;
      bool sleeping;
+    sigset_t sigmask_ipi;
  };

  void assert_hvf_ok(hv_return_t ret);
diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
index 0c01a03725..a255a1a7d3 100644
--- a/target/arm/hvf/hvf.c
+++ b/target/arm/hvf/hvf.c
@@ -320,14 +320,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)

  void hvf_kick_vcpu_thread(CPUState *cpu)
  {
-    if (cpu->hvf->sleeping) {
-        /*
-         * When sleeping, make sure we always send signals. Also, clear the
-         * timespec, so that an IPI that arrives between setting 
hvf->sleeping
-         * and the nanosleep syscall still aborts the sleep.
-         */
-        cpu->thread_kicked = false;
-        cpu->hvf->ts = (struct timespec){ };
+    if (qatomic_read(&cpu->hvf->sleeping)) {
+        /* When sleeping, send a signal to get out of pselect */
          cpus_kick_thread(cpu);
      } else {
          hv_vcpus_exit(&cpu->hvf->fd, 1);
@@ -354,6 +348,7 @@ int hvf_vcpu_exec(CPUState *cpu)
      ARMCPU *arm_cpu = ARM_CPU(cpu);
      CPUARMState *env = &arm_cpu->env;
      hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
+    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
      hv_return_t r;
      int ret = 0;

@@ -491,8 +486,8 @@ int hvf_vcpu_exec(CPUState *cpu)
              break;
          }
          case EC_WFX_TRAP:
-            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
-                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
+            if (!(syndrome & WFX_IS_WFE) &&
+                !(cpu->interrupt_request & irq_mask)) {
                  uint64_t cval, ctl, val, diff, now;

                  /* Set up a local timer for vtimer if necessary ... */
@@ -515,9 +510,7 @@ int hvf_vcpu_exec(CPUState *cpu)

                  if (diff < INT64_MAX) {
                      uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
-                    struct timespec *ts = &cpu->hvf->ts;
-
-                    *ts = (struct timespec){
+                    struct timespec ts = {
                          .tv_sec = ns / NANOSECONDS_PER_SECOND,
                          .tv_nsec = ns % NANOSECONDS_PER_SECOND,
                      };
@@ -526,27 +519,27 @@ int hvf_vcpu_exec(CPUState *cpu)
                       * Waking up easily takes 1ms, don't go to sleep 
for smaller
                       * time periods than 2ms.
                       */
-                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
+                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
                          advance_pc = true;
                          break;
                      }

+                    cpu->thread_kicked = false;
+
                      /* Set cpu->hvf->sleeping so that we get a SIG_IPI 
signal. */
-                    cpu->hvf->sleeping = true;
-                    smp_mb();
+                    qatomic_set(&cpu->hvf->sleeping, true);

-                    /* Bail out if we received an IRQ meanwhile */
-                    if (cpu->thread_kicked || (cpu->interrupt_request &
-                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
-                        cpu->hvf->sleeping = false;
+                    /* Bail out if we received a kick meanwhile */
+                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
+ qatomic_set(&cpu->hvf->sleeping, false);
                          break;
                      }

-                    /* nanosleep returns on signal, so we wake up on 
kick. */
-                    nanosleep(ts, NULL);
+                    /* pselect returns on kick signal and consumes it */
+                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask_ipi);

                      /* Out of sleep - either naturally or because of a 
kick */
-                    cpu->hvf->sleeping = false;
+                    qatomic_set(&cpu->hvf->sleeping, false);
                  }

                  advance_pc = true;



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 22:09                                 ` Alexander Graf
  2020-12-01 23:13                                   ` Alexander Graf
@ 2020-12-02  0:52                                   ` Peter Collingbourne
  1 sibling, 0 replies; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-02  0:52 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Frank Yang, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Tue, Dec 1, 2020 at 2:09 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 21:03, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 8:26 AM Alexander Graf <agraf@csgraf.de> wrote:
> >>
> >> On 01.12.20 09:21, Peter Collingbourne wrote:
> >>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> >>> up on IPI.
> >>>
> >>> Signed-off-by: Peter Collingbourne <pcc@google.com>
> >>> ---
> >>> Alexander Graf wrote:
> >>>> I would love to take a patch from you here :). I'll still be stuck for a
> >>>> while with the sysreg sync rework that Peter asked for before I can look
> >>>> at WFI again.
> >>> Okay, here's a patch :) It's a relatively straightforward adaptation
> >>> of what we have in our fork, which can now boot Android to GUI while
> >>> remaining at around 4% CPU when idle.
> >>>
> >>> I'm not set up to boot a full Linux distribution at the moment so I
> >>> tested it on upstream QEMU by running a recent mainline Linux kernel
> >>> with a rootfs containing an init program that just does sleep(5)
> >>> and verified that the qemu process remains at low CPU usage during
> >>> the sleep. This was on top of your v2 plus the last patch of your v1
> >>> since it doesn't look like you have a replacement for that logic yet.
> >>
> >> How about something like this instead?
> >>
> >>
> >> Alex
> >>
> >>
> >> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >> index 4360f64671..50384013ea 100644
> >> --- a/accel/hvf/hvf-cpus.c
> >> +++ b/accel/hvf/hvf-cpus.c
> >> @@ -337,16 +337,18 @@ static int hvf_init_vcpu(CPUState *cpu)
> >>        cpu->hvf = g_malloc0(sizeof(*cpu->hvf));
> >>
> >>        /* init cpu signals */
> >> -    sigset_t set;
> >>        struct sigaction sigact;
> >>
> >>        memset(&sigact, 0, sizeof(sigact));
> >>        sigact.sa_handler = dummy_signal;
> >>        sigaction(SIG_IPI, &sigact, NULL);
> >>
> >> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >> -    sigdelset(&set, SIG_IPI);
> >> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> >> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask);
> >> +    sigdelset(&cpu->hvf->sigmask, SIG_IPI);
> >> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> >> +
> >> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->sigmask_ipi);
> >> +    sigaddset(&cpu->hvf->sigmask_ipi, SIG_IPI);
> > There's no reason to unblock SIG_IPI while not in pselect and it can
> > easily lead to missed wakeups. The whole point of pselect is so that
> > you can guarantee that only one part of your program sees signals
> > without a possibility of them being missed.
>
>
> Hm, I think I start to agree with you here :). We can probably just
> leave SIG_IPI masked at all times and only unmask on pselect. The worst
> thing that will happen is a premature wakeup if we did get an IPI
> incoming while hvf->sleeping is set, but were either not running
> pselect() yet and bailed out or already finished pselect() execution.

Ack.

> >
> >>    #ifdef __aarch64__
> >>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t
> >> **)&cpu->hvf->exit, NULL);
> >> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >> index c56baa3ae8..6e237f2db0 100644
> >> --- a/include/sysemu/hvf_int.h
> >> +++ b/include/sysemu/hvf_int.h
> >> @@ -62,8 +62,9 @@ extern HVFState *hvf_state;
> >>    struct hvf_vcpu_state {
> >>        uint64_t fd;
> >>        void *exit;
> >> -    struct timespec ts;
> >>        bool sleeping;
> >> +    sigset_t sigmask;
> >> +    sigset_t sigmask_ipi;
> >>    };
> >>
> >>    void assert_hvf_ok(hv_return_t ret);
> >> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> >> index 0c01a03725..350b845e6e 100644
> >> --- a/target/arm/hvf/hvf.c
> >> +++ b/target/arm/hvf/hvf.c
> >> @@ -320,20 +320,24 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >>
> >>    void hvf_kick_vcpu_thread(CPUState *cpu)
> >>    {
> >> -    if (cpu->hvf->sleeping) {
> >> -        /*
> >> -         * When sleeping, make sure we always send signals. Also, clear the
> >> -         * timespec, so that an IPI that arrives between setting
> >> hvf->sleeping
> >> -         * and the nanosleep syscall still aborts the sleep.
> >> -         */
> >> -        cpu->thread_kicked = false;
> >> -        cpu->hvf->ts = (struct timespec){ };
> >> +    if (qatomic_read(&cpu->hvf->sleeping)) {
> >> +        /* When sleeping, send a signal to get out of pselect */
> >>            cpus_kick_thread(cpu);
> >>        } else {
> >>            hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>        }
> >>    }
> >>
> >> +static void hvf_block_sig_ipi(CPUState *cpu)
> >> +{
> >> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask_ipi, NULL);
> >> +}
> >> +
> >> +static void hvf_unblock_sig_ipi(CPUState *cpu)
> >> +{
> >> +    pthread_sigmask(SIG_SETMASK, &cpu->hvf->sigmask, NULL);
> >> +}
> >> +
> >>    static int hvf_inject_interrupts(CPUState *cpu)
> >>    {
> >>        if (cpu->interrupt_request & CPU_INTERRUPT_FIQ) {
> >> @@ -354,6 +358,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>        ARMCPU *arm_cpu = ARM_CPU(cpu);
> >>        CPUARMState *env = &arm_cpu->env;
> >>        hv_vcpu_exit_t *hvf_exit = cpu->hvf->exit;
> >> +    const uint32_t irq_mask = CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ;
> >>        hv_return_t r;
> >>        int ret = 0;
> >>
> >> @@ -491,8 +496,8 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>                break;
> >>            }
> >>            case EC_WFX_TRAP:
> >> -            if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >> -                (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >> +            if (!(syndrome & WFX_IS_WFE) &&
> >> +                !(cpu->interrupt_request & irq_mask)) {
> >>                    uint64_t cval, ctl, val, diff, now;
> > I don't think the access to cpu->interrupt_request is safe because it
> > is done while not under the iothread lock. That's why to avoid these
> > types of issues I would prefer to hold the lock almost all of the
> > time.
>
>
> In this branch, that's not a problem yet. On stale values, we either
> don't sleep (which is ok), or we go into the sleep path, and reevaluate
> cpu->interrupt_request atomically again after setting hvf->sleeping.

Okay, this may be a "benign race" (and it may be helped a little by
the M1's sequential consistency extension) but this is the sort of
thing that I'd prefer not to rely on. At least it should be an atomic
read.

> >
> >>                    /* Set up a local timer for vtimer if necessary ... */
> >> @@ -515,9 +520,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>
> >>                    if (diff < INT64_MAX) {
> >>                        uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> >> -                    struct timespec *ts = &cpu->hvf->ts;
> >> -
> >> -                    *ts = (struct timespec){
> >> +                    struct timespec ts = {
> >>                            .tv_sec = ns / NANOSECONDS_PER_SECOND,
> >>                            .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> >>                        };
> >> @@ -526,27 +529,31 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>                         * Waking up easily takes 1ms, don't go to sleep
> >> for smaller
> >>                         * time periods than 2ms.
> >>                         */
> >> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >> +                    if (!ts.tv_sec && (ts.tv_nsec < (SCALE_MS * 2))) {
> >>                            advance_pc = true;
> >>                            break;
> >>                        }
> >>
> >> +                    /* block SIG_IPI for the sleep */
> >> +                    hvf_block_sig_ipi(cpu);
> >> +                    cpu->thread_kicked = false;
> >> +
> >>                        /* Set cpu->hvf->sleeping so that we get a SIG_IPI
> >> signal. */
> >> -                    cpu->hvf->sleeping = true;
> >> -                    smp_mb();
> >> +                    qatomic_set(&cpu->hvf->sleeping, true);
> > This doesn't protect against races because another thread could call
> > kvf_vcpu_kick_thread() at any time between when we return from
> > hv_vcpu_run() and when we set sleeping = true and we would miss the
> > wakeup (due to kvf_vcpu_kick_thread() seeing sleeping = false and
> > calling hv_vcpus_exit() instead of pthread_kill()). I don't think it
> > can be fixed by setting sleeping to true earlier either because no
> > matter how early you move it, there will always be a window where we
> > are going to pselect() but sleeping is false, resulting in a missed
> > wakeup.
>
>
> I don't follow. If anyone was sending us an IPI, it's because they want
> to notify us about an update to cpu->interrupt_request, right? In that
> case, the atomic read of that field below will catch it and bail out of
> the sleep sequence.

I think there are other possible IPI reasons, e.g. set halted to 1,
I/O events. Now we could check for halted below and maybe some of the
others but the code will be subtle and it seems like a game of
whack-a-mole to get them all. This is an example of what I was talking
about when I said that an approach that relies on the sleeping field
will be difficult to get right. I would strongly prefer to start with
a simple approach and maybe we can consider a more complicated one
later.

Peter

>
>
> >
> > Peter
> >
> >> -                    /* Bail out if we received an IRQ meanwhile */
> >> -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> >> -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >> -                        cpu->hvf->sleeping = false;
> >> +                    /* Bail out if we received a kick meanwhile */
> >> +                    if (qatomic_read(&cpu->interrupt_request) & irq_mask) {
> >> + qatomic_set(&cpu->hvf->sleeping, false);
>
>
> ^^^
>
>
> Alex
>
>
> >> +                        hvf_unblock_sig_ipi(cpu);
> >>                            break;
> >>                        }
> >>
> >> -                    /* nanosleep returns on signal, so we wake up on
> >> kick. */
> >> -                    nanosleep(ts, NULL);
> >> +                    /* pselect returns on kick signal and consumes it */
> >> +                    pselect(0, 0, 0, 0, &ts, &cpu->hvf->sigmask);
> >>
> >>                        /* Out of sleep - either naturally or because of a
> >> kick */
> >> -                    cpu->hvf->sleeping = false;
> >> +                    qatomic_set(&cpu->hvf->sleeping, false);
> >> +                    hvf_unblock_sig_ipi(cpu);
> >>                    }
> >>
> >>                    advance_pc = true;
> >>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 22:03                                 ` Alexander Graf
@ 2020-12-02  1:19                                   ` Peter Collingbourne
  2020-12-02  1:53                                     ` Alexander Graf
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-02  1:19 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Frank Yang, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Tue, Dec 1, 2020 at 2:04 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 01.12.20 19:59, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> >> Hi Peter,
> >>
> >> On 01.12.20 09:21, Peter Collingbourne wrote:
> >>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> >>> up on IPI.
> >>>
> >>> Signed-off-by: Peter Collingbourne <pcc@google.com>
> >>
> >> Thanks a bunch!
> >>
> >>
> >>> ---
> >>> Alexander Graf wrote:
> >>>> I would love to take a patch from you here :). I'll still be stuck for a
> >>>> while with the sysreg sync rework that Peter asked for before I can look
> >>>> at WFI again.
> >>> Okay, here's a patch :) It's a relatively straightforward adaptation
> >>> of what we have in our fork, which can now boot Android to GUI while
> >>> remaining at around 4% CPU when idle.
> >>>
> >>> I'm not set up to boot a full Linux distribution at the moment so I
> >>> tested it on upstream QEMU by running a recent mainline Linux kernel
> >>> with a rootfs containing an init program that just does sleep(5)
> >>> and verified that the qemu process remains at low CPU usage during
> >>> the sleep. This was on top of your v2 plus the last patch of your v1
> >>> since it doesn't look like you have a replacement for that logic yet.
> >>>
> >>>    accel/hvf/hvf-cpus.c     |  5 +--
> >>>    include/sysemu/hvf_int.h |  3 +-
> >>>    target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> >>>    3 files changed, 28 insertions(+), 74 deletions(-)
> >>>
> >>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >>> index 4360f64671..b2c8fb57f6 100644
> >>> --- a/accel/hvf/hvf-cpus.c
> >>> +++ b/accel/hvf/hvf-cpus.c
> >>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> >>>        sigact.sa_handler = dummy_signal;
> >>>        sigaction(SIG_IPI, &sigact, NULL);
> >>>
> >>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >>> -    sigdelset(&set, SIG_IPI);
> >>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> >>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> >>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> >>
> >> What will this do to the x86 hvf implementation? We're now not
> >> unblocking SIG_IPI again for that, right?
> > Yes and that was the case before your patch series.
>
>
> The way I understand Roman, he wanted to unblock the IPI signal on x86:
>
> https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021
>
> I agree that at this point it's not a problem though to break it again.
> I'm not quite sure how to merge your patches within my patch set though,
> given they basically revert half of my previously introduced code...
>
>
> >
> >>>    #ifdef __aarch64__
> >>>        r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> >>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >>> index c56baa3ae8..13adf6ea77 100644
> >>> --- a/include/sysemu/hvf_int.h
> >>> +++ b/include/sysemu/hvf_int.h
> >>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> >>>    struct hvf_vcpu_state {
> >>>        uint64_t fd;
> >>>        void *exit;
> >>> -    struct timespec ts;
> >>> -    bool sleeping;
> >>> +    sigset_t unblock_ipi_mask;
> >>>    };
> >>>
> >>>    void assert_hvf_ok(hv_return_t ret);
> >>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> >>> index 8fe10966d2..60a361ff38 100644
> >>> --- a/target/arm/hvf/hvf.c
> >>> +++ b/target/arm/hvf/hvf.c
> >>> @@ -2,6 +2,7 @@
> >>>     * QEMU Hypervisor.framework support for Apple Silicon
> >>>
> >>>     * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> >>> + * Copyright 2020 Google LLC
> >>>     *
> >>>     * This work is licensed under the terms of the GNU GPL, version 2 or later.
> >>>     * See the COPYING file in the top-level directory.
> >>> @@ -18,6 +19,7 @@
> >>>    #include "sysemu/hw_accel.h"
> >>>
> >>>    #include <Hypervisor/Hypervisor.h>
> >>> +#include <mach/mach_time.h>
> >>>
> >>>    #include "exec/address-spaces.h"
> >>>    #include "hw/irq.h"
> >>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >>>
> >>>    void hvf_kick_vcpu_thread(CPUState *cpu)
> >>>    {
> >>> -    if (cpu->hvf->sleeping) {
> >>> -        /*
> >>> -         * When sleeping, make sure we always send signals. Also, clear the
> >>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> >>> -         * and the nanosleep syscall still aborts the sleep.
> >>> -         */
> >>> -        cpu->thread_kicked = false;
> >>> -        cpu->hvf->ts = (struct timespec){ };
> >>> -        cpus_kick_thread(cpu);
> >>> -    } else {
> >>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>> -    }
> >>> +    cpus_kick_thread(cpu);
> >>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>
> >> This means your first WFI will almost always return immediately due to a
> >> pending signal, because there probably was an IRQ pending before on the
> >> same CPU, no?
> > That's right. Any approach involving the "sleeping" field would need
> > to be implemented carefully to avoid races that may result in missed
> > wakeups so for simplicity I just decided to send both kinds of
> > wakeups. In particular the approach in the updated patch you sent is
> > racy and I'll elaborate more in the reply to that patch.
> >
> >>>    }
> >>>
> >>>    static int hvf_inject_interrupts(CPUState *cpu)
> >>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>            uint64_t syndrome = hvf_exit->exception.syndrome;
> >>>            uint32_t ec = syn_get_ec(syndrome);
> >>>
> >>> +        qemu_mutex_lock_iothread();
> >>
> >> Is there a particular reason you're moving the iothread lock out again
> >> from the individual bits? I would really like to keep a notion of fast
> >> path exits.
> > We still need to lock at least once no matter the exit reason to check
> > the interrupts so I don't think it's worth it to try and avoid locking
> > like this. It also makes the implementation easier to reason about and
> > therefore more likely to be correct. In our implementation we just
> > stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> >
> >>>            switch (exit_reason) {
> >>>            case HV_EXIT_REASON_EXCEPTION:
> >>>                /* This is the main one, handle below. */
> >>>                break;
> >>>            case HV_EXIT_REASON_VTIMER_ACTIVATED:
> >>> -            qemu_mutex_lock_iothread();
> >>>                current_cpu = cpu;
> >>>                qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> >>>                qemu_mutex_unlock_iothread();
> >>>                continue;
> >>>            case HV_EXIT_REASON_CANCELED:
> >>>                /* we got kicked, no exit to process */
> >>> +            qemu_mutex_unlock_iothread();
> >>>                continue;
> >>>            default:
> >>>                assert(0);
> >>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>                uint32_t srt = (syndrome >> 16) & 0x1f;
> >>>                uint64_t val = 0;
> >>>
> >>> -            qemu_mutex_lock_iothread();
> >>>                current_cpu = cpu;
> >>>
> >>>                DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> >>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>                    hvf_set_reg(cpu, srt, val);
> >>>                }
> >>>
> >>> -            qemu_mutex_unlock_iothread();
> >>> -
> >>>                advance_pc = true;
> >>>                break;
> >>>            }
> >>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>            case EC_WFX_TRAP:
> >>>                if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >>>                    (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >>> -                uint64_t cval, ctl, val, diff, now;
> >>> +                uint64_t cval;
> >>>
> >>> -                /* Set up a local timer for vtimer if necessary ... */
> >>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> >>> -                assert_hvf_ok(r);
> >>>                    r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> >>>                    assert_hvf_ok(r);
> >>>
> >>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> >>> -                diff = cval - val;
> >>> -
> >>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> >>> -                      gt_cntfrq_period_ns(arm_cpu);
> >>> -
> >>> -                /* Timer disabled or masked, just wait for long */
> >>> -                if (!(ctl & 1) || (ctl & 2)) {
> >>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> >>> -                           gt_cntfrq_period_ns(arm_cpu);
> >>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> >>> +                if (ticks_to_sleep < 0) {
> >>> +                    break;
> >>
> >> This will loop at 100% for Windows, which configures the vtimer as
> >> cval=0 ctl=7, so with IRQ mask bit set.
> > Okay, but the 120s is kind of arbitrary so we should just sleep until
> > we get a signal. That can be done by passing null as the timespec
> > argument to pselect().
>
>
> The reason I capped it at 120s was so that if I do hit a race, you don't
> break everything forever. Only for 2 minutes :).

I see. I think at this point we want to notice these types of bugs if
they exist instead of hiding them, so I would mildly be in favor of
not capping at 120s.

> >
> >>
> >> Alex
> >>
> >>
> >>>                    }
> >>>
> >>> -                if (diff < INT64_MAX) {
> >>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> >>> -                    struct timespec *ts = &cpu->hvf->ts;
> >>> -
> >>> -                    *ts = (struct timespec){
> >>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> >>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> >>> -                    };
> >>> -
> >>> -                    /*
> >>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> >>> -                     * time periods than 2ms.
> >>> -                     */
> >>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >>
> >> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> >> return. Without logic like this, super short WFIs will hurt performance
> >> quite badly.
> > I don't think that's accurate. According to this benchmark it's a few
> > hundred nanoseconds at most.
> >
> > pcc@pac-mini /tmp> cat pselect.c
> > #include <signal.h>
> > #include <sys/select.h>
> >
> > int main() {
> >    sigset_t mask, orig_mask;
> >    pthread_sigmask(SIG_SETMASK, 0, &mask);
> >    sigaddset(&mask, SIGUSR1);
> >    pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >
> >    for (int i = 0; i != 1000000; ++i) {
> >      struct timespec ts = { 0, 1 };
> >      pselect(0, 0, 0, 0, &ts, &orig_mask);
> >    }
> > }
> > pcc@pac-mini /tmp> time ./pselect
> >
> > ________________________________________________________
> > Executed in  179.87 millis    fish           external
> >     usr time   77.68 millis   57.00 micros   77.62 millis
> >     sys time  101.37 millis  852.00 micros  100.52 millis
> >
> > Besides, all that you're really saving here is the single pselect
> > call. There are no doubt more expensive syscalls involved in exiting
> > and entering the VCPU that would dominate here.
>
>
> I would expect that such a super low ts value has a short-circuit path
> in the kernel as well. Where things start to fall apart is when you're
> at a threshold where rescheduling might be ok, but then you need to take
> all of the additional task switch overhead into account. Try to adapt
> your test code a bit:
>
> #include <signal.h>
> #include <sys/select.h>
>
> int main() {
>    sigset_t mask, orig_mask;
>    pthread_sigmask(SIG_SETMASK, 0, &mask);
>    sigaddset(&mask, SIGUSR1);
>    pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>
>    for (int i = 0; i != 10000; ++i) {
> #define SCALE_MS 1000000
>      struct timespec ts = { 0, SCALE_MS / 10 };
>      pselect(0, 0, 0, 0, &ts, &orig_mask);
>    }
> }
>
>
> % time ./pselect
> ./pselect  0.00s user 0.01s system 1% cpu 1.282 total
>
> You're suddenly seeing 300µs overhead per pselect call then. When I
> measured actual enter/exit times in QEMU, I saw much bigger differences
> between "time I want to sleep for" and "time I did sleep" even when just
> capturing the virtual time before and after the nanosleep/pselect call.

Okay. So the alternative is that we spin on the CPU, either doing
no-op VCPU entries/exits or something like:

while (mach_absolute_time() < cval);

My intuition is we shouldn't try to subvert the OS scheduler like this
unless it's proven to help with some real world metric since otherwise
we're not being fair to the other processes on the CPU. With CPU
intensive workloads I wouldn't expect these kinds of sleeps to happen
very often if at all so if it's only microbenchmarks and so on that
are affected then my inclination is not to do this for now.

Peter


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-02  1:19                                   ` Peter Collingbourne
@ 2020-12-02  1:53                                     ` Alexander Graf
  2020-12-02  4:44                                       ` Peter Collingbourne
  0 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-12-02  1:53 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Roman Bolshakov, qemu-arm, Claudio Fontana,
	Frank Yang, Paolo Bonzini


On 02.12.20 02:19, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 2:04 PM Alexander Graf <agraf@csgraf.de> wrote:
>>
>> On 01.12.20 19:59, Peter Collingbourne wrote:
>>> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
>>>> Hi Peter,
>>>>
>>>> On 01.12.20 09:21, Peter Collingbourne wrote:
>>>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
>>>>> up on IPI.
>>>>>
>>>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
>>>> Thanks a bunch!
>>>>
>>>>
>>>>> ---
>>>>> Alexander Graf wrote:
>>>>>> I would love to take a patch from you here :). I'll still be stuck for a
>>>>>> while with the sysreg sync rework that Peter asked for before I can look
>>>>>> at WFI again.
>>>>> Okay, here's a patch :) It's a relatively straightforward adaptation
>>>>> of what we have in our fork, which can now boot Android to GUI while
>>>>> remaining at around 4% CPU when idle.
>>>>>
>>>>> I'm not set up to boot a full Linux distribution at the moment so I
>>>>> tested it on upstream QEMU by running a recent mainline Linux kernel
>>>>> with a rootfs containing an init program that just does sleep(5)
>>>>> and verified that the qemu process remains at low CPU usage during
>>>>> the sleep. This was on top of your v2 plus the last patch of your v1
>>>>> since it doesn't look like you have a replacement for that logic yet.
>>>>>
>>>>>     accel/hvf/hvf-cpus.c     |  5 +--
>>>>>     include/sysemu/hvf_int.h |  3 +-
>>>>>     target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
>>>>>     3 files changed, 28 insertions(+), 74 deletions(-)
>>>>>
>>>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
>>>>> index 4360f64671..b2c8fb57f6 100644
>>>>> --- a/accel/hvf/hvf-cpus.c
>>>>> +++ b/accel/hvf/hvf-cpus.c
>>>>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
>>>>>         sigact.sa_handler = dummy_signal;
>>>>>         sigaction(SIG_IPI, &sigact, NULL);
>>>>>
>>>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
>>>>> -    sigdelset(&set, SIG_IPI);
>>>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
>>>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
>>>>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
>>>> What will this do to the x86 hvf implementation? We're now not
>>>> unblocking SIG_IPI again for that, right?
>>> Yes and that was the case before your patch series.
>>
>> The way I understand Roman, he wanted to unblock the IPI signal on x86:
>>
>> https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021
>>
>> I agree that at this point it's not a problem though to break it again.
>> I'm not quite sure how to merge your patches within my patch set though,
>> given they basically revert half of my previously introduced code...
>>
>>
>>>>>     #ifdef __aarch64__
>>>>>         r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
>>>>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
>>>>> index c56baa3ae8..13adf6ea77 100644
>>>>> --- a/include/sysemu/hvf_int.h
>>>>> +++ b/include/sysemu/hvf_int.h
>>>>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
>>>>>     struct hvf_vcpu_state {
>>>>>         uint64_t fd;
>>>>>         void *exit;
>>>>> -    struct timespec ts;
>>>>> -    bool sleeping;
>>>>> +    sigset_t unblock_ipi_mask;
>>>>>     };
>>>>>
>>>>>     void assert_hvf_ok(hv_return_t ret);
>>>>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
>>>>> index 8fe10966d2..60a361ff38 100644
>>>>> --- a/target/arm/hvf/hvf.c
>>>>> +++ b/target/arm/hvf/hvf.c
>>>>> @@ -2,6 +2,7 @@
>>>>>      * QEMU Hypervisor.framework support for Apple Silicon
>>>>>
>>>>>      * Copyright 2020 Alexander Graf <agraf@csgraf.de>
>>>>> + * Copyright 2020 Google LLC
>>>>>      *
>>>>>      * This work is licensed under the terms of the GNU GPL, version 2 or later.
>>>>>      * See the COPYING file in the top-level directory.
>>>>> @@ -18,6 +19,7 @@
>>>>>     #include "sysemu/hw_accel.h"
>>>>>
>>>>>     #include <Hypervisor/Hypervisor.h>
>>>>> +#include <mach/mach_time.h>
>>>>>
>>>>>     #include "exec/address-spaces.h"
>>>>>     #include "hw/irq.h"
>>>>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
>>>>>
>>>>>     void hvf_kick_vcpu_thread(CPUState *cpu)
>>>>>     {
>>>>> -    if (cpu->hvf->sleeping) {
>>>>> -        /*
>>>>> -         * When sleeping, make sure we always send signals. Also, clear the
>>>>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
>>>>> -         * and the nanosleep syscall still aborts the sleep.
>>>>> -         */
>>>>> -        cpu->thread_kicked = false;
>>>>> -        cpu->hvf->ts = (struct timespec){ };
>>>>> -        cpus_kick_thread(cpu);
>>>>> -    } else {
>>>>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
>>>>> -    }
>>>>> +    cpus_kick_thread(cpu);
>>>>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
>>>> This means your first WFI will almost always return immediately due to a
>>>> pending signal, because there probably was an IRQ pending before on the
>>>> same CPU, no?
>>> That's right. Any approach involving the "sleeping" field would need
>>> to be implemented carefully to avoid races that may result in missed
>>> wakeups so for simplicity I just decided to send both kinds of
>>> wakeups. In particular the approach in the updated patch you sent is
>>> racy and I'll elaborate more in the reply to that patch.
>>>
>>>>>     }
>>>>>
>>>>>     static int hvf_inject_interrupts(CPUState *cpu)
>>>>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>             uint64_t syndrome = hvf_exit->exception.syndrome;
>>>>>             uint32_t ec = syn_get_ec(syndrome);
>>>>>
>>>>> +        qemu_mutex_lock_iothread();
>>>> Is there a particular reason you're moving the iothread lock out again
>>>> from the individual bits? I would really like to keep a notion of fast
>>>> path exits.
>>> We still need to lock at least once no matter the exit reason to check
>>> the interrupts so I don't think it's worth it to try and avoid locking
>>> like this. It also makes the implementation easier to reason about and
>>> therefore more likely to be correct. In our implementation we just
>>> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
>>>
>>>>>             switch (exit_reason) {
>>>>>             case HV_EXIT_REASON_EXCEPTION:
>>>>>                 /* This is the main one, handle below. */
>>>>>                 break;
>>>>>             case HV_EXIT_REASON_VTIMER_ACTIVATED:
>>>>> -            qemu_mutex_lock_iothread();
>>>>>                 current_cpu = cpu;
>>>>>                 qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
>>>>>                 qemu_mutex_unlock_iothread();
>>>>>                 continue;
>>>>>             case HV_EXIT_REASON_CANCELED:
>>>>>                 /* we got kicked, no exit to process */
>>>>> +            qemu_mutex_unlock_iothread();
>>>>>                 continue;
>>>>>             default:
>>>>>                 assert(0);
>>>>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>                 uint32_t srt = (syndrome >> 16) & 0x1f;
>>>>>                 uint64_t val = 0;
>>>>>
>>>>> -            qemu_mutex_lock_iothread();
>>>>>                 current_cpu = cpu;
>>>>>
>>>>>                 DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
>>>>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>                     hvf_set_reg(cpu, srt, val);
>>>>>                 }
>>>>>
>>>>> -            qemu_mutex_unlock_iothread();
>>>>> -
>>>>>                 advance_pc = true;
>>>>>                 break;
>>>>>             }
>>>>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
>>>>>             case EC_WFX_TRAP:
>>>>>                 if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
>>>>>                     (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>>>> -                uint64_t cval, ctl, val, diff, now;
>>>>> +                uint64_t cval;
>>>>>
>>>>> -                /* Set up a local timer for vtimer if necessary ... */
>>>>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
>>>>> -                assert_hvf_ok(r);
>>>>>                     r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
>>>>>                     assert_hvf_ok(r);
>>>>>
>>>>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
>>>>> -                diff = cval - val;
>>>>> -
>>>>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
>>>>> -                      gt_cntfrq_period_ns(arm_cpu);
>>>>> -
>>>>> -                /* Timer disabled or masked, just wait for long */
>>>>> -                if (!(ctl & 1) || (ctl & 2)) {
>>>>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
>>>>> -                           gt_cntfrq_period_ns(arm_cpu);
>>>>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
>>>>> +                if (ticks_to_sleep < 0) {
>>>>> +                    break;
>>>> This will loop at 100% for Windows, which configures the vtimer as
>>>> cval=0 ctl=7, so with IRQ mask bit set.
>>> Okay, but the 120s is kind of arbitrary so we should just sleep until
>>> we get a signal. That can be done by passing null as the timespec
>>> argument to pselect().
>>
>> The reason I capped it at 120s was so that if I do hit a race, you don't
>> break everything forever. Only for 2 minutes :).
> I see. I think at this point we want to notice these types of bugs if
> they exist instead of hiding them, so I would mildly be in favor of
> not capping at 120s.


Crossing my fingers that we are at that point already :).


>
>>>> Alex
>>>>
>>>>
>>>>>                     }
>>>>>
>>>>> -                if (diff < INT64_MAX) {
>>>>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
>>>>> -                    struct timespec *ts = &cpu->hvf->ts;
>>>>> -
>>>>> -                    *ts = (struct timespec){
>>>>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
>>>>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
>>>>> -                    };
>>>>> -
>>>>> -                    /*
>>>>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
>>>>> -                     * time periods than 2ms.
>>>>> -                     */
>>>>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
>>>> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
>>>> return. Without logic like this, super short WFIs will hurt performance
>>>> quite badly.
>>> I don't think that's accurate. According to this benchmark it's a few
>>> hundred nanoseconds at most.
>>>
>>> pcc@pac-mini /tmp> cat pselect.c
>>> #include <signal.h>
>>> #include <sys/select.h>
>>>
>>> int main() {
>>>     sigset_t mask, orig_mask;
>>>     pthread_sigmask(SIG_SETMASK, 0, &mask);
>>>     sigaddset(&mask, SIGUSR1);
>>>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>>>
>>>     for (int i = 0; i != 1000000; ++i) {
>>>       struct timespec ts = { 0, 1 };
>>>       pselect(0, 0, 0, 0, &ts, &orig_mask);
>>>     }
>>> }
>>> pcc@pac-mini /tmp> time ./pselect
>>>
>>> ________________________________________________________
>>> Executed in  179.87 millis    fish           external
>>>      usr time   77.68 millis   57.00 micros   77.62 millis
>>>      sys time  101.37 millis  852.00 micros  100.52 millis
>>>
>>> Besides, all that you're really saving here is the single pselect
>>> call. There are no doubt more expensive syscalls involved in exiting
>>> and entering the VCPU that would dominate here.
>>
>> I would expect that such a super low ts value has a short-circuit path
>> in the kernel as well. Where things start to fall apart is when you're
>> at a threshold where rescheduling might be ok, but then you need to take
>> all of the additional task switch overhead into account. Try to adapt
>> your test code a bit:
>>
>> #include <signal.h>
>> #include <sys/select.h>
>>
>> int main() {
>>     sigset_t mask, orig_mask;
>>     pthread_sigmask(SIG_SETMASK, 0, &mask);
>>     sigaddset(&mask, SIGUSR1);
>>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
>>
>>     for (int i = 0; i != 10000; ++i) {
>> #define SCALE_MS 1000000
>>       struct timespec ts = { 0, SCALE_MS / 10 };
>>       pselect(0, 0, 0, 0, &ts, &orig_mask);
>>     }
>> }
>>
>>
>> % time ./pselect
>> ./pselect  0.00s user 0.01s system 1% cpu 1.282 total
>>
>> You're suddenly seeing 300µs overhead per pselect call then. When I
>> measured actual enter/exit times in QEMU, I saw much bigger differences
>> between "time I want to sleep for" and "time I did sleep" even when just
>> capturing the virtual time before and after the nanosleep/pselect call.
> Okay. So the alternative is that we spin on the CPU, either doing
> no-op VCPU entries/exits or something like:
>
> while (mach_absolute_time() < cval);


This won't catch events that arrive during that time, such as 
interrupts, right? I'd just declare the WFI as done and keep looping in 
and out of the guest for now.


> My intuition is we shouldn't try to subvert the OS scheduler like this
> unless it's proven to help with some real world metric since otherwise
> we're not being fair to the other processes on the CPU. With CPU
> intensive workloads I wouldn't expect these kinds of sleeps to happen
> very often if at all so if it's only microbenchmarks and so on that
> are affected then my inclination is not to do this for now.


The problem is that the VM's OS is expecting bare metal timer behavior 
usually. And that gives you much better granularities than what we can 
achieve with a virtualization layer on top. So I do feel strongly about 
leaving this bit in. In the workloads you describe above, you won't ever 
hit that branch anyway.

The workloads that benefit from logic like this are message passing 
ones. Check out this presentation from a KVM colleague of yours for details:

https://www.linux-kvm.org/images/a/ac/02x03-Davit_Matalack-KVM_Message_passing_Performance.pdf
   https://www.youtube.com/watch?v=p85FFrloLFg


Alex



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-02  1:53                                     ` Alexander Graf
@ 2020-12-02  4:44                                       ` Peter Collingbourne
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-02  4:44 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Frank Yang, Roman Bolshakov, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Tue, Dec 1, 2020 at 5:53 PM Alexander Graf <agraf@csgraf.de> wrote:
>
>
> On 02.12.20 02:19, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 2:04 PM Alexander Graf <agraf@csgraf.de> wrote:
> >>
> >> On 01.12.20 19:59, Peter Collingbourne wrote:
> >>> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> >>>> Hi Peter,
> >>>>
> >>>> On 01.12.20 09:21, Peter Collingbourne wrote:
> >>>>> Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> >>>>> up on IPI.
> >>>>>
> >>>>> Signed-off-by: Peter Collingbourne <pcc@google.com>
> >>>> Thanks a bunch!
> >>>>
> >>>>
> >>>>> ---
> >>>>> Alexander Graf wrote:
> >>>>>> I would love to take a patch from you here :). I'll still be stuck for a
> >>>>>> while with the sysreg sync rework that Peter asked for before I can look
> >>>>>> at WFI again.
> >>>>> Okay, here's a patch :) It's a relatively straightforward adaptation
> >>>>> of what we have in our fork, which can now boot Android to GUI while
> >>>>> remaining at around 4% CPU when idle.
> >>>>>
> >>>>> I'm not set up to boot a full Linux distribution at the moment so I
> >>>>> tested it on upstream QEMU by running a recent mainline Linux kernel
> >>>>> with a rootfs containing an init program that just does sleep(5)
> >>>>> and verified that the qemu process remains at low CPU usage during
> >>>>> the sleep. This was on top of your v2 plus the last patch of your v1
> >>>>> since it doesn't look like you have a replacement for that logic yet.
> >>>>>
> >>>>>     accel/hvf/hvf-cpus.c     |  5 +--
> >>>>>     include/sysemu/hvf_int.h |  3 +-
> >>>>>     target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> >>>>>     3 files changed, 28 insertions(+), 74 deletions(-)
> >>>>>
> >>>>> diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> >>>>> index 4360f64671..b2c8fb57f6 100644
> >>>>> --- a/accel/hvf/hvf-cpus.c
> >>>>> +++ b/accel/hvf/hvf-cpus.c
> >>>>> @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> >>>>>         sigact.sa_handler = dummy_signal;
> >>>>>         sigaction(SIG_IPI, &sigact, NULL);
> >>>>>
> >>>>> -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> >>>>> -    sigdelset(&set, SIG_IPI);
> >>>>> -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> >>>>> +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> >>>>> +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> >>>> What will this do to the x86 hvf implementation? We're now not
> >>>> unblocking SIG_IPI again for that, right?
> >>> Yes and that was the case before your patch series.
> >>
> >> The way I understand Roman, he wanted to unblock the IPI signal on x86:
> >>
> >> https://patchwork.kernel.org/project/qemu-devel/patch/20201126215017.41156-3-agraf@csgraf.de/#23807021
> >>
> >> I agree that at this point it's not a problem though to break it again.
> >> I'm not quite sure how to merge your patches within my patch set though,
> >> given they basically revert half of my previously introduced code...
> >>
> >>
> >>>>>     #ifdef __aarch64__
> >>>>>         r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> >>>>> diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> >>>>> index c56baa3ae8..13adf6ea77 100644
> >>>>> --- a/include/sysemu/hvf_int.h
> >>>>> +++ b/include/sysemu/hvf_int.h
> >>>>> @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> >>>>>     struct hvf_vcpu_state {
> >>>>>         uint64_t fd;
> >>>>>         void *exit;
> >>>>> -    struct timespec ts;
> >>>>> -    bool sleeping;
> >>>>> +    sigset_t unblock_ipi_mask;
> >>>>>     };
> >>>>>
> >>>>>     void assert_hvf_ok(hv_return_t ret);
> >>>>> diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> >>>>> index 8fe10966d2..60a361ff38 100644
> >>>>> --- a/target/arm/hvf/hvf.c
> >>>>> +++ b/target/arm/hvf/hvf.c
> >>>>> @@ -2,6 +2,7 @@
> >>>>>      * QEMU Hypervisor.framework support for Apple Silicon
> >>>>>
> >>>>>      * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> >>>>> + * Copyright 2020 Google LLC
> >>>>>      *
> >>>>>      * This work is licensed under the terms of the GNU GPL, version 2 or later.
> >>>>>      * See the COPYING file in the top-level directory.
> >>>>> @@ -18,6 +19,7 @@
> >>>>>     #include "sysemu/hw_accel.h"
> >>>>>
> >>>>>     #include <Hypervisor/Hypervisor.h>
> >>>>> +#include <mach/mach_time.h>
> >>>>>
> >>>>>     #include "exec/address-spaces.h"
> >>>>>     #include "hw/irq.h"
> >>>>> @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> >>>>>
> >>>>>     void hvf_kick_vcpu_thread(CPUState *cpu)
> >>>>>     {
> >>>>> -    if (cpu->hvf->sleeping) {
> >>>>> -        /*
> >>>>> -         * When sleeping, make sure we always send signals. Also, clear the
> >>>>> -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> >>>>> -         * and the nanosleep syscall still aborts the sleep.
> >>>>> -         */
> >>>>> -        cpu->thread_kicked = false;
> >>>>> -        cpu->hvf->ts = (struct timespec){ };
> >>>>> -        cpus_kick_thread(cpu);
> >>>>> -    } else {
> >>>>> -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>>>> -    }
> >>>>> +    cpus_kick_thread(cpu);
> >>>>> +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> >>>> This means your first WFI will almost always return immediately due to a
> >>>> pending signal, because there probably was an IRQ pending before on the
> >>>> same CPU, no?
> >>> That's right. Any approach involving the "sleeping" field would need
> >>> to be implemented carefully to avoid races that may result in missed
> >>> wakeups so for simplicity I just decided to send both kinds of
> >>> wakeups. In particular the approach in the updated patch you sent is
> >>> racy and I'll elaborate more in the reply to that patch.
> >>>
> >>>>>     }
> >>>>>
> >>>>>     static int hvf_inject_interrupts(CPUState *cpu)
> >>>>> @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>             uint64_t syndrome = hvf_exit->exception.syndrome;
> >>>>>             uint32_t ec = syn_get_ec(syndrome);
> >>>>>
> >>>>> +        qemu_mutex_lock_iothread();
> >>>> Is there a particular reason you're moving the iothread lock out again
> >>>> from the individual bits? I would really like to keep a notion of fast
> >>>> path exits.
> >>> We still need to lock at least once no matter the exit reason to check
> >>> the interrupts so I don't think it's worth it to try and avoid locking
> >>> like this. It also makes the implementation easier to reason about and
> >>> therefore more likely to be correct. In our implementation we just
> >>> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> >>>
> >>>>>             switch (exit_reason) {
> >>>>>             case HV_EXIT_REASON_EXCEPTION:
> >>>>>                 /* This is the main one, handle below. */
> >>>>>                 break;
> >>>>>             case HV_EXIT_REASON_VTIMER_ACTIVATED:
> >>>>> -            qemu_mutex_lock_iothread();
> >>>>>                 current_cpu = cpu;
> >>>>>                 qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> >>>>>                 qemu_mutex_unlock_iothread();
> >>>>>                 continue;
> >>>>>             case HV_EXIT_REASON_CANCELED:
> >>>>>                 /* we got kicked, no exit to process */
> >>>>> +            qemu_mutex_unlock_iothread();
> >>>>>                 continue;
> >>>>>             default:
> >>>>>                 assert(0);
> >>>>> @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>                 uint32_t srt = (syndrome >> 16) & 0x1f;
> >>>>>                 uint64_t val = 0;
> >>>>>
> >>>>> -            qemu_mutex_lock_iothread();
> >>>>>                 current_cpu = cpu;
> >>>>>
> >>>>>                 DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> >>>>> @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>                     hvf_set_reg(cpu, srt, val);
> >>>>>                 }
> >>>>>
> >>>>> -            qemu_mutex_unlock_iothread();
> >>>>> -
> >>>>>                 advance_pc = true;
> >>>>>                 break;
> >>>>>             }
> >>>>> @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> >>>>>             case EC_WFX_TRAP:
> >>>>>                 if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> >>>>>                     (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> >>>>> -                uint64_t cval, ctl, val, diff, now;
> >>>>> +                uint64_t cval;
> >>>>>
> >>>>> -                /* Set up a local timer for vtimer if necessary ... */
> >>>>> -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> >>>>> -                assert_hvf_ok(r);
> >>>>>                     r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> >>>>>                     assert_hvf_ok(r);
> >>>>>
> >>>>> -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> >>>>> -                diff = cval - val;
> >>>>> -
> >>>>> -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> >>>>> -                      gt_cntfrq_period_ns(arm_cpu);
> >>>>> -
> >>>>> -                /* Timer disabled or masked, just wait for long */
> >>>>> -                if (!(ctl & 1) || (ctl & 2)) {
> >>>>> -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> >>>>> -                           gt_cntfrq_period_ns(arm_cpu);
> >>>>> +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> >>>>> +                if (ticks_to_sleep < 0) {
> >>>>> +                    break;
> >>>> This will loop at 100% for Windows, which configures the vtimer as
> >>>> cval=0 ctl=7, so with IRQ mask bit set.
> >>> Okay, but the 120s is kind of arbitrary so we should just sleep until
> >>> we get a signal. That can be done by passing null as the timespec
> >>> argument to pselect().
> >>
> >> The reason I capped it at 120s was so that if I do hit a race, you don't
> >> break everything forever. Only for 2 minutes :).
> > I see. I think at this point we want to notice these types of bugs if
> > they exist instead of hiding them, so I would mildly be in favor of
> > not capping at 120s.
>
>
> Crossing my fingers that we are at that point already :).
>
>
> >
> >>>> Alex
> >>>>
> >>>>
> >>>>>                     }
> >>>>>
> >>>>> -                if (diff < INT64_MAX) {
> >>>>> -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> >>>>> -                    struct timespec *ts = &cpu->hvf->ts;
> >>>>> -
> >>>>> -                    *ts = (struct timespec){
> >>>>> -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> >>>>> -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> >>>>> -                    };
> >>>>> -
> >>>>> -                    /*
> >>>>> -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> >>>>> -                     * time periods than 2ms.
> >>>>> -                     */
> >>>>> -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >>>> I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> >>>> return. Without logic like this, super short WFIs will hurt performance
> >>>> quite badly.
> >>> I don't think that's accurate. According to this benchmark it's a few
> >>> hundred nanoseconds at most.
> >>>
> >>> pcc@pac-mini /tmp> cat pselect.c
> >>> #include <signal.h>
> >>> #include <sys/select.h>
> >>>
> >>> int main() {
> >>>     sigset_t mask, orig_mask;
> >>>     pthread_sigmask(SIG_SETMASK, 0, &mask);
> >>>     sigaddset(&mask, SIGUSR1);
> >>>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >>>
> >>>     for (int i = 0; i != 1000000; ++i) {
> >>>       struct timespec ts = { 0, 1 };
> >>>       pselect(0, 0, 0, 0, &ts, &orig_mask);
> >>>     }
> >>> }
> >>> pcc@pac-mini /tmp> time ./pselect
> >>>
> >>> ________________________________________________________
> >>> Executed in  179.87 millis    fish           external
> >>>      usr time   77.68 millis   57.00 micros   77.62 millis
> >>>      sys time  101.37 millis  852.00 micros  100.52 millis
> >>>
> >>> Besides, all that you're really saving here is the single pselect
> >>> call. There are no doubt more expensive syscalls involved in exiting
> >>> and entering the VCPU that would dominate here.
> >>
> >> I would expect that such a super low ts value has a short-circuit path
> >> in the kernel as well. Where things start to fall apart is when you're
> >> at a threshold where rescheduling might be ok, but then you need to take
> >> all of the additional task switch overhead into account. Try to adapt
> >> your test code a bit:
> >>
> >> #include <signal.h>
> >> #include <sys/select.h>
> >>
> >> int main() {
> >>     sigset_t mask, orig_mask;
> >>     pthread_sigmask(SIG_SETMASK, 0, &mask);
> >>     sigaddset(&mask, SIGUSR1);
> >>     pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >>
> >>     for (int i = 0; i != 10000; ++i) {
> >> #define SCALE_MS 1000000
> >>       struct timespec ts = { 0, SCALE_MS / 10 };
> >>       pselect(0, 0, 0, 0, &ts, &orig_mask);
> >>     }
> >> }
> >>
> >>
> >> % time ./pselect
> >> ./pselect  0.00s user 0.01s system 1% cpu 1.282 total
> >>
> >> You're suddenly seeing 300µs overhead per pselect call then. When I
> >> measured actual enter/exit times in QEMU, I saw much bigger differences
> >> between "time I want to sleep for" and "time I did sleep" even when just
> >> capturing the virtual time before and after the nanosleep/pselect call.
> > Okay. So the alternative is that we spin on the CPU, either doing
> > no-op VCPU entries/exits or something like:
> >
> > while (mach_absolute_time() < cval);
>
>
> This won't catch events that arrive during that time, such as
> interrupts, right? I'd just declare the WFI as done and keep looping in
> and out of the guest for now.

Oh, that's a good point.

> > My intuition is we shouldn't try to subvert the OS scheduler like this
> > unless it's proven to help with some real world metric since otherwise
> > we're not being fair to the other processes on the CPU. With CPU
> > intensive workloads I wouldn't expect these kinds of sleeps to happen
> > very often if at all so if it's only microbenchmarks and so on that
> > are affected then my inclination is not to do this for now.
>
>
> The problem is that the VM's OS is expecting bare metal timer behavior
> usually. And that gives you much better granularities than what we can
> achieve with a virtualization layer on top. So I do feel strongly about
> leaving this bit in. In the workloads you describe above, you won't ever
> hit that branch anyway.
>
> The workloads that benefit from logic like this are message passing
> ones. Check out this presentation from a KVM colleague of yours for details:
>
> https://www.linux-kvm.org/images/a/ac/02x03-Davit_Matalack-KVM_Message_passing_Performance.pdf
>    https://www.youtube.com/watch?v=p85FFrloLFg

Mm, okay. I personally would not add anything like that at this point
without real-world data but I don't feel too strongly and I suppose
the implementation can always be adjusted later.

Peter


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-12-01  0:00                       ` Peter Collingbourne
  2020-12-01  0:13                         ` Alexander Graf
@ 2020-12-03  9:41                         ` Roman Bolshakov
  2020-12-03 18:42                           ` Peter Collingbourne
  1 sibling, 1 reply; 64+ messages in thread
From: Roman Bolshakov @ 2020-12-03  9:41 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Alexander Graf, Claudio Fontana, qemu-arm,
	Frank Yang, Paolo Bonzini

On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
> On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
> >
> >
> > On 01.12.20 00:01, Peter Collingbourne wrote:
> > > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
> > >> Hi Peter,
> > >>
> > >> On 30.11.20 22:08, Peter Collingbourne wrote:
> > >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> > >>>>
> > >>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> > >>>>> Hi Frank,
> > >>>>>
> > >>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> > >>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> > >>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> > >>>>>
> > >>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> > >>>>>
> > >>>> Thanks, we'll take a look :)
> > >>>>
> > >>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > >>> I'll reply to your patch here. You have:
> > >>>
> > >>> +                    /* Set cpu->hvf->sleeping so that we get a
> > >>> SIG_IPI signal. */
> > >>> +                    cpu->hvf->sleeping = true;
> > >>> +                    smp_mb();
> > >>> +
> > >>> +                    /* Bail out if we received an IRQ meanwhile */
> > >>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > >>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > >>> +                        cpu->hvf->sleeping = false;
> > >>> +                        break;
> > >>> +                    }
> > >>> +
> > >>> +                    /* nanosleep returns on signal, so we wake up on kick. */
> > >>> +                    nanosleep(ts, NULL);
> > >>>
> > >>> and then send the signal conditional on whether sleeping is true, but
> > >>> I think this is racy. If the signal is sent after sleeping is set to
> > >>> true but before entering nanosleep then I think it will be ignored and
> > >>> we will miss the wakeup. That's why in my implementation I block IPI
> > >>> on the CPU thread at startup and then use pselect to atomically
> > >>> unblock and begin sleeping. The signal is sent unconditionally so
> > >>> there's no need to worry about races between actually sleeping and the
> > >>> "we think we're sleeping" state. It may lead to an extra wakeup but
> > >>> that's better than missing it entirely.
> > >>
> > >> Thanks a bunch for the comment! So the trick I was using here is to > > >> modify the timespec from the kick function before sending the IPI
> > >> signal. That way, we know that either we are inside the sleep (where the
> > >> signal wakes it up) or we are outside the sleep (where timespec={} will
> > >> make it return immediately).
> > >>
> > >> The only race I can think of is if nanosleep does calculations based on
> > >> the timespec and we happen to send the signal right there and then.
> > > Yes that's the race I was thinking of. Admittedly it's a small window
> > > but it's theoretically possible and part of the reason why pselect was
> > > created.
> > >
> > >> The problem with blocking IPIs is basically what Frank was describing
> > >> earlier: How do you unset the IPI signal pending status? If the signal
> > >> is never delivered, how can pselect differentiate "signal from last time
> > >> is still pending" from "new signal because I got an IPI"?
> > > In this case we would take the additional wakeup which should be
> > > harmless since we will take the WFx exit again and put us in the
> > > correct state. But that's a lot better than busy looping.
> >
> >
> > I'm not sure I follow. I'm thinking of the following scenario:
> >
> >    - trap into WFI handler
> >    - go to sleep with blocked SIG_IPI
> >    - SIG_IPI arrives, pselect() exits
> >    - signal is still pending because it's blocked
> >    - enter guest
> >    - trap into WFI handler
> >    - run pselect(), but it immediate exits because SIG_IPI is still pending
> >
> > This was the loop I was seeing when running with SIG_IPI blocked. That's
> > part of the reason why I switched to a different model.
> 
> What I observe is that when returning from a pending signal pselect
> consumes the signal (which is also consistent with my understanding of
> what pselect does). That means that it doesn't matter if we take a
> second WFx exit because once we reach the pselect in the second WFx
> exit the signal will have been consumed by the pselect in the first
> exit and we will just wait for the next one.
> 

Aha! Thanks for the explanation. So, the first WFI in the series of
guest WFIs will likely wake up immediately? After a period without WFIs
there must be a pending SIG_IPI...

It shouldn't be a critical issue though because (as defined in D1.16.2)
"the architecture permits a PE to leave the low-power state for any
reason, it is permissible for a PE to treat WFI as a NOP, but this is
not recommended for lowest power operation."

BTW. I think a bit from the thread should go into the description of
patch 8, because it's not trivial and it would really be helpful to keep
in repo history. At least something like this (taken from an earlier
reply in the thread):

  In this implementation IPI is blocked on the CPU thread at startup and
  pselect() is used to atomically unblock the signal and begin sleeping.
  The signal is sent unconditionally so there's no need to worry about
  races between actually sleeping and the "we think we're sleeping"
  state. It may lead to an extra wakeup but that's better than missing
  it entirely.


Thanks,
Roman

> I don't know why things may have been going wrong in your
> implementation but it may be related to the issue with
> mach_absolute_time() which I posted about separately and was also
> causing busy loops for us in some cases. Once that issue was fixed in
> our implementation we started seeing sleep until VTIMER due work
> properly.
> 
> >
> >
> > > I reckon that you could improve things a little by unblocking the
> > > signal and then reblocking it before unlocking iothread (e.g. with a
> > > pselect with zero time interval), which would flush any pending
> > > signals. Since any such signal would correspond to a signal from last
> > > time (because we still have the iothread lock) we know that any future
> > > signals should correspond to new IPIs.
> >
> >
> > Yeah, I think you actually *have* to do exactly that, because otherwise
> > pselect() will always return after 0ns because the signal is still pending.
> >
> > And yes, I agree that that starts to sound a bit less racy now. But it
> > means we can probably also just do
> >
> >    - WFI handler
> >    - block SIG_IPI
> >    - set hvf->sleeping = true
> >    - check for pending interrupts
> >    - pselect()
> >    - unblock SIG_IPI
> >
> > which means we run with SIG_IPI unmasked by default. I don't think the
> > number of signal mask changes is any different with that compared to
> > running with SIG_IPI always masked, right?
> 

P.S. Just found that Alex already raised my concern. Pending signals
have to be consumed or there should be no pending signals to start
sleeping on the very first WFI.

> And unlock/lock iothread around the pselect? I suppose that could work
> but as I mentioned it would just be an optimization.
> 
> Maybe I can try to make my approach work on top of your series, or if
> you already have a patch I can try to debug it. Let me know.
> 
> Peter


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-01 18:59                               ` Peter Collingbourne
  2020-12-01 22:03                                 ` Alexander Graf
@ 2020-12-03 10:12                                 ` Roman Bolshakov
  2020-12-03 18:30                                   ` Peter Collingbourne
  1 sibling, 1 reply; 64+ messages in thread
From: Roman Bolshakov @ 2020-12-03 10:12 UTC (permalink / raw)
  To: Peter Collingbourne
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, Alexander Graf, Claudio Fontana, qemu-arm,
	Frank Yang, Paolo Bonzini

On Tue, Dec 01, 2020 at 10:59:50AM -0800, Peter Collingbourne wrote:
> On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> >
> > Hi Peter,
> >
> > On 01.12.20 09:21, Peter Collingbourne wrote:
> > > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > > up on IPI.
> > >
> > > Signed-off-by: Peter Collingbourne <pcc@google.com>
> >
> >
> > Thanks a bunch!
> >
> >
> > > ---
> > > Alexander Graf wrote:
> > >> I would love to take a patch from you here :). I'll still be stuck for a
> > >> while with the sysreg sync rework that Peter asked for before I can look
> > >> at WFI again.
> > > Okay, here's a patch :) It's a relatively straightforward adaptation
> > > of what we have in our fork, which can now boot Android to GUI while
> > > remaining at around 4% CPU when idle.
> > >
> > > I'm not set up to boot a full Linux distribution at the moment so I
> > > tested it on upstream QEMU by running a recent mainline Linux kernel
> > > with a rootfs containing an init program that just does sleep(5)
> > > and verified that the qemu process remains at low CPU usage during
> > > the sleep. This was on top of your v2 plus the last patch of your v1
> > > since it doesn't look like you have a replacement for that logic yet.
> > >
> > >   accel/hvf/hvf-cpus.c     |  5 +--
> > >   include/sysemu/hvf_int.h |  3 +-
> > >   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> > >   3 files changed, 28 insertions(+), 74 deletions(-)
> > >
> > > diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> > > index 4360f64671..b2c8fb57f6 100644
> > > --- a/accel/hvf/hvf-cpus.c
> > > +++ b/accel/hvf/hvf-cpus.c
> > > @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> > >       sigact.sa_handler = dummy_signal;
> > >       sigaction(SIG_IPI, &sigact, NULL);
> > >
> > > -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> > > -    sigdelset(&set, SIG_IPI);
> > > -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> > > +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> > > +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> >
> >
> > What will this do to the x86 hvf implementation? We're now not
> > unblocking SIG_IPI again for that, right?
> 
> Yes and that was the case before your patch series.
> 
> > >
> > >   #ifdef __aarch64__
> > >       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> > > diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> > > index c56baa3ae8..13adf6ea77 100644
> > > --- a/include/sysemu/hvf_int.h
> > > +++ b/include/sysemu/hvf_int.h
> > > @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> > >   struct hvf_vcpu_state {
> > >       uint64_t fd;
> > >       void *exit;
> > > -    struct timespec ts;
> > > -    bool sleeping;
> > > +    sigset_t unblock_ipi_mask;
> > >   };
> > >
> > >   void assert_hvf_ok(hv_return_t ret);
> > > diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> > > index 8fe10966d2..60a361ff38 100644
> > > --- a/target/arm/hvf/hvf.c
> > > +++ b/target/arm/hvf/hvf.c
> > > @@ -2,6 +2,7 @@
> > >    * QEMU Hypervisor.framework support for Apple Silicon
> > >
> > >    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> > > + * Copyright 2020 Google LLC
> > >    *
> > >    * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > >    * See the COPYING file in the top-level directory.
> > > @@ -18,6 +19,7 @@
> > >   #include "sysemu/hw_accel.h"
> > >
> > >   #include <Hypervisor/Hypervisor.h>
> > > +#include <mach/mach_time.h>
> > >
> > >   #include "exec/address-spaces.h"
> > >   #include "hw/irq.h"
> > > @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> > >
> > >   void hvf_kick_vcpu_thread(CPUState *cpu)
> > >   {
> > > -    if (cpu->hvf->sleeping) {
> > > -        /*
> > > -         * When sleeping, make sure we always send signals. Also, clear the
> > > -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> > > -         * and the nanosleep syscall still aborts the sleep.
> > > -         */
> > > -        cpu->thread_kicked = false;
> > > -        cpu->hvf->ts = (struct timespec){ };
> > > -        cpus_kick_thread(cpu);
> > > -    } else {
> > > -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> > > -    }
> > > +    cpus_kick_thread(cpu);
> > > +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> >
> >
> > This means your first WFI will almost always return immediately due to a
> > pending signal, because there probably was an IRQ pending before on the
> > same CPU, no?
> 
> That's right. Any approach involving the "sleeping" field would need
> to be implemented carefully to avoid races that may result in missed
> wakeups so for simplicity I just decided to send both kinds of
> wakeups. In particular the approach in the updated patch you sent is
> racy and I'll elaborate more in the reply to that patch.
> 
> > >   }
> > >
> > >   static int hvf_inject_interrupts(CPUState *cpu)
> > > @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >           uint64_t syndrome = hvf_exit->exception.syndrome;
> > >           uint32_t ec = syn_get_ec(syndrome);
> > >
> > > +        qemu_mutex_lock_iothread();
> >
> >
> > Is there a particular reason you're moving the iothread lock out again
> > from the individual bits? I would really like to keep a notion of fast
> > path exits.
> 
> We still need to lock at least once no matter the exit reason to check
> the interrupts so I don't think it's worth it to try and avoid locking
> like this. It also makes the implementation easier to reason about and
> therefore more likely to be correct. In our implementation we just
> stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> 

But does it leaves a small window for a kick loss between
qemu_mutex_unlock_iothread() and hv_vcpu_run()/pselect()?

For x86 it could lose a kick between them. That was a reason for the
sophisticated approach to catch the kick [1] (and related discussions in
v1/v2/v3).  Unfortunately I can't read ARM assembly yet so I don't if
hv_vcpus_exit() suffers from the same issue as x86 hv_vcpu_interrupt().

1. https://patchwork.kernel.org/project/qemu-devel/patch/20200729124832.79375-1-r.bolshakov@yadro.com/

Thanks,
Roman

> > >           switch (exit_reason) {
> > >           case HV_EXIT_REASON_EXCEPTION:
> > >               /* This is the main one, handle below. */
> > >               break;
> > >           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> > >               qemu_mutex_unlock_iothread();
> > >               continue;
> > >           case HV_EXIT_REASON_CANCELED:
> > >               /* we got kicked, no exit to process */
> > > +            qemu_mutex_unlock_iothread();
> > >               continue;
> > >           default:
> > >               assert(0);
> > > @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >               uint32_t srt = (syndrome >> 16) & 0x1f;
> > >               uint64_t val = 0;
> > >
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >
> > >               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> > > @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >                   hvf_set_reg(cpu, srt, val);
> > >               }
> > >
> > > -            qemu_mutex_unlock_iothread();
> > > -
> > >               advance_pc = true;
> > >               break;
> > >           }
> > > @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >           case EC_WFX_TRAP:
> > >               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> > >                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > -                uint64_t cval, ctl, val, diff, now;
> > > +                uint64_t cval;
> > >
> > > -                /* Set up a local timer for vtimer if necessary ... */
> > > -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> > > -                assert_hvf_ok(r);
> > >                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> > >                   assert_hvf_ok(r);
> > >
> > > -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> > > -                diff = cval - val;
> > > -
> > > -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> > > -                      gt_cntfrq_period_ns(arm_cpu);
> > > -
> > > -                /* Timer disabled or masked, just wait for long */
> > > -                if (!(ctl & 1) || (ctl & 2)) {
> > > -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> > > -                           gt_cntfrq_period_ns(arm_cpu);
> > > +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> > > +                if (ticks_to_sleep < 0) {
> > > +                    break;
> >
> >
> > This will loop at 100% for Windows, which configures the vtimer as
> > cval=0 ctl=7, so with IRQ mask bit set.
> 
> Okay, but the 120s is kind of arbitrary so we should just sleep until
> we get a signal. That can be done by passing null as the timespec
> argument to pselect().
> 
> >
> >
> > Alex
> >
> >
> > >                   }
> > >
> > > -                if (diff < INT64_MAX) {
> > > -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> > > -                    struct timespec *ts = &cpu->hvf->ts;
> > > -
> > > -                    *ts = (struct timespec){
> > > -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> > > -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> > > -                    };
> > > -
> > > -                    /*
> > > -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> > > -                     * time periods than 2ms.
> > > -                     */
> > > -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> >
> >
> > I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> > return. Without logic like this, super short WFIs will hurt performance
> > quite badly.
> 
> I don't think that's accurate. According to this benchmark it's a few
> hundred nanoseconds at most.
> 
> pcc@pac-mini /tmp> cat pselect.c
> #include <signal.h>
> #include <sys/select.h>
> 
> int main() {
>   sigset_t mask, orig_mask;
>   pthread_sigmask(SIG_SETMASK, 0, &mask);
>   sigaddset(&mask, SIGUSR1);
>   pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> 
>   for (int i = 0; i != 1000000; ++i) {
>     struct timespec ts = { 0, 1 };
>     pselect(0, 0, 0, 0, &ts, &orig_mask);
>   }
> }
> pcc@pac-mini /tmp> time ./pselect
> 
> ________________________________________________________
> Executed in  179.87 millis    fish           external
>    usr time   77.68 millis   57.00 micros   77.62 millis
>    sys time  101.37 millis  852.00 micros  100.52 millis
> 
> Besides, all that you're really saving here is the single pselect
> call. There are no doubt more expensive syscalls involved in exiting
> and entering the VCPU that would dominate here.
> 
> Peter
> 
> >
> >
> > Alex
> >
> > > -                        advance_pc = true;
> > > -                        break;
> > > -                    }
> > > -
> > > -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> > > -                    cpu->hvf->sleeping = true;
> > > -                    smp_mb();
> > > -
> > > -                    /* Bail out if we received an IRQ meanwhile */
> > > -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > > -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > -                        cpu->hvf->sleeping = false;
> > > -                        break;
> > > -                    }
> > > -
> > > -                    /* nanosleep returns on signal, so we wake up on kick. */
> > > -                    nanosleep(ts, NULL);
> > > -
> > > -                    /* Out of sleep - either naturally or because of a kick */
> > > -                    cpu->hvf->sleeping = false;
> > > -                }
> > > +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> > > +                uint64_t nanos =
> > > +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> > > +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> > > +                struct timespec ts = { seconds, nanos };
> > > +
> > > +                /*
> > > +                 * Use pselect to sleep so that other threads can IPI us while
> > > +                 * we're sleeping.
> > > +                 */
> > > +                qatomic_mb_set(&cpu->thread_kicked, false);
> > > +                qemu_mutex_unlock_iothread();
> > > +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> > > +                qemu_mutex_lock_iothread();
> > >
> > >                   advance_pc = true;
> > >               }
> > >               break;
> > >           case EC_AA64_HVC:
> > >               cpu_synchronize_state(cpu);
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
> > >                   arm_handle_psci_call(arm_cpu);
> > > @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
> > >                   env->xregs[0] = -1;
> > >               }
> > > -            qemu_mutex_unlock_iothread();
> > >               break;
> > >           case EC_AA64_SMC:
> > >               cpu_synchronize_state(cpu);
> > > -            qemu_mutex_lock_iothread();
> > >               current_cpu = cpu;
> > >               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
> > >                   arm_handle_psci_call(arm_cpu);
> > > @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >                   env->xregs[0] = -1;
> > >                   env->pc += 4;
> > >               }
> > > -            qemu_mutex_unlock_iothread();
> > >               break;
> > >           default:
> > >               cpu_synchronize_state(cpu);
> > > @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> > >               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
> > >               assert_hvf_ok(r);
> > >           }
> > > +        qemu_mutex_unlock_iothread();
> > >       } while (ret == 0);
> > >
> > >       qemu_mutex_lock_iothread();


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] arm/hvf: Optimize and simplify WFI handling
  2020-12-03 10:12                                 ` Roman Bolshakov
@ 2020-12-03 18:30                                   ` Peter Collingbourne
  0 siblings, 0 replies; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-03 18:30 UTC (permalink / raw)
  To: Roman Bolshakov
  Cc: Alexander Graf, Frank Yang, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Thu, Dec 3, 2020 at 2:12 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
>
> On Tue, Dec 01, 2020 at 10:59:50AM -0800, Peter Collingbourne wrote:
> > On Tue, Dec 1, 2020 at 3:16 AM Alexander Graf <agraf@csgraf.de> wrote:
> > >
> > > Hi Peter,
> > >
> > > On 01.12.20 09:21, Peter Collingbourne wrote:
> > > > Sleep on WFx until the VTIMER is due but allow ourselves to be woken
> > > > up on IPI.
> > > >
> > > > Signed-off-by: Peter Collingbourne <pcc@google.com>
> > >
> > >
> > > Thanks a bunch!
> > >
> > >
> > > > ---
> > > > Alexander Graf wrote:
> > > >> I would love to take a patch from you here :). I'll still be stuck for a
> > > >> while with the sysreg sync rework that Peter asked for before I can look
> > > >> at WFI again.
> > > > Okay, here's a patch :) It's a relatively straightforward adaptation
> > > > of what we have in our fork, which can now boot Android to GUI while
> > > > remaining at around 4% CPU when idle.
> > > >
> > > > I'm not set up to boot a full Linux distribution at the moment so I
> > > > tested it on upstream QEMU by running a recent mainline Linux kernel
> > > > with a rootfs containing an init program that just does sleep(5)
> > > > and verified that the qemu process remains at low CPU usage during
> > > > the sleep. This was on top of your v2 plus the last patch of your v1
> > > > since it doesn't look like you have a replacement for that logic yet.
> > > >
> > > >   accel/hvf/hvf-cpus.c     |  5 +--
> > > >   include/sysemu/hvf_int.h |  3 +-
> > > >   target/arm/hvf/hvf.c     | 94 +++++++++++-----------------------------
> > > >   3 files changed, 28 insertions(+), 74 deletions(-)
> > > >
> > > > diff --git a/accel/hvf/hvf-cpus.c b/accel/hvf/hvf-cpus.c
> > > > index 4360f64671..b2c8fb57f6 100644
> > > > --- a/accel/hvf/hvf-cpus.c
> > > > +++ b/accel/hvf/hvf-cpus.c
> > > > @@ -344,9 +344,8 @@ static int hvf_init_vcpu(CPUState *cpu)
> > > >       sigact.sa_handler = dummy_signal;
> > > >       sigaction(SIG_IPI, &sigact, NULL);
> > > >
> > > > -    pthread_sigmask(SIG_BLOCK, NULL, &set);
> > > > -    sigdelset(&set, SIG_IPI);
> > > > -    pthread_sigmask(SIG_SETMASK, &set, NULL);
> > > > +    pthread_sigmask(SIG_BLOCK, NULL, &cpu->hvf->unblock_ipi_mask);
> > > > +    sigdelset(&cpu->hvf->unblock_ipi_mask, SIG_IPI);
> > >
> > >
> > > What will this do to the x86 hvf implementation? We're now not
> > > unblocking SIG_IPI again for that, right?
> >
> > Yes and that was the case before your patch series.
> >
> > > >
> > > >   #ifdef __aarch64__
> > > >       r = hv_vcpu_create(&cpu->hvf->fd, (hv_vcpu_exit_t **)&cpu->hvf->exit, NULL);
> > > > diff --git a/include/sysemu/hvf_int.h b/include/sysemu/hvf_int.h
> > > > index c56baa3ae8..13adf6ea77 100644
> > > > --- a/include/sysemu/hvf_int.h
> > > > +++ b/include/sysemu/hvf_int.h
> > > > @@ -62,8 +62,7 @@ extern HVFState *hvf_state;
> > > >   struct hvf_vcpu_state {
> > > >       uint64_t fd;
> > > >       void *exit;
> > > > -    struct timespec ts;
> > > > -    bool sleeping;
> > > > +    sigset_t unblock_ipi_mask;
> > > >   };
> > > >
> > > >   void assert_hvf_ok(hv_return_t ret);
> > > > diff --git a/target/arm/hvf/hvf.c b/target/arm/hvf/hvf.c
> > > > index 8fe10966d2..60a361ff38 100644
> > > > --- a/target/arm/hvf/hvf.c
> > > > +++ b/target/arm/hvf/hvf.c
> > > > @@ -2,6 +2,7 @@
> > > >    * QEMU Hypervisor.framework support for Apple Silicon
> > > >
> > > >    * Copyright 2020 Alexander Graf <agraf@csgraf.de>
> > > > + * Copyright 2020 Google LLC
> > > >    *
> > > >    * This work is licensed under the terms of the GNU GPL, version 2 or later.
> > > >    * See the COPYING file in the top-level directory.
> > > > @@ -18,6 +19,7 @@
> > > >   #include "sysemu/hw_accel.h"
> > > >
> > > >   #include <Hypervisor/Hypervisor.h>
> > > > +#include <mach/mach_time.h>
> > > >
> > > >   #include "exec/address-spaces.h"
> > > >   #include "hw/irq.h"
> > > > @@ -320,18 +322,8 @@ int hvf_arch_init_vcpu(CPUState *cpu)
> > > >
> > > >   void hvf_kick_vcpu_thread(CPUState *cpu)
> > > >   {
> > > > -    if (cpu->hvf->sleeping) {
> > > > -        /*
> > > > -         * When sleeping, make sure we always send signals. Also, clear the
> > > > -         * timespec, so that an IPI that arrives between setting hvf->sleeping
> > > > -         * and the nanosleep syscall still aborts the sleep.
> > > > -         */
> > > > -        cpu->thread_kicked = false;
> > > > -        cpu->hvf->ts = (struct timespec){ };
> > > > -        cpus_kick_thread(cpu);
> > > > -    } else {
> > > > -        hv_vcpus_exit(&cpu->hvf->fd, 1);
> > > > -    }
> > > > +    cpus_kick_thread(cpu);
> > > > +    hv_vcpus_exit(&cpu->hvf->fd, 1);
> > >
> > >
> > > This means your first WFI will almost always return immediately due to a
> > > pending signal, because there probably was an IRQ pending before on the
> > > same CPU, no?
> >
> > That's right. Any approach involving the "sleeping" field would need
> > to be implemented carefully to avoid races that may result in missed
> > wakeups so for simplicity I just decided to send both kinds of
> > wakeups. In particular the approach in the updated patch you sent is
> > racy and I'll elaborate more in the reply to that patch.
> >
> > > >   }
> > > >
> > > >   static int hvf_inject_interrupts(CPUState *cpu)
> > > > @@ -385,18 +377,19 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >           uint64_t syndrome = hvf_exit->exception.syndrome;
> > > >           uint32_t ec = syn_get_ec(syndrome);
> > > >
> > > > +        qemu_mutex_lock_iothread();
> > >
> > >
> > > Is there a particular reason you're moving the iothread lock out again
> > > from the individual bits? I would really like to keep a notion of fast
> > > path exits.
> >
> > We still need to lock at least once no matter the exit reason to check
> > the interrupts so I don't think it's worth it to try and avoid locking
> > like this. It also makes the implementation easier to reason about and
> > therefore more likely to be correct. In our implementation we just
> > stay locked the whole time unless we're in hv_vcpu_run() or pselect().
> >
>
> But does it leaves a small window for a kick loss between
> qemu_mutex_unlock_iothread() and hv_vcpu_run()/pselect()?
>
> For x86 it could lose a kick between them. That was a reason for the
> sophisticated approach to catch the kick [1] (and related discussions in
> v1/v2/v3).  Unfortunately I can't read ARM assembly yet so I don't if
> hv_vcpus_exit() suffers from the same issue as x86 hv_vcpu_interrupt().
>
> 1. https://patchwork.kernel.org/project/qemu-devel/patch/20200729124832.79375-1-r.bolshakov@yadro.com/

I addressed pselect() in my other reply.

It isn't on the website but the hv_vcpu.h header says this about
hv_vcpus_exit():

 * @discussion
 *             If a vcpu is not running, the next time hv_vcpu_run is
called for the corresponding
 *             vcpu, it will return immediately without entering the guest.

So at least as documented I think we are okay.

Peter

>
> Thanks,
> Roman
>
> > > >           switch (exit_reason) {
> > > >           case HV_EXIT_REASON_EXCEPTION:
> > > >               /* This is the main one, handle below. */
> > > >               break;
> > > >           case HV_EXIT_REASON_VTIMER_ACTIVATED:
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >               qemu_set_irq(arm_cpu->gt_timer_outputs[GTIMER_VIRT], 1);
> > > >               qemu_mutex_unlock_iothread();
> > > >               continue;
> > > >           case HV_EXIT_REASON_CANCELED:
> > > >               /* we got kicked, no exit to process */
> > > > +            qemu_mutex_unlock_iothread();
> > > >               continue;
> > > >           default:
> > > >               assert(0);
> > > > @@ -413,7 +406,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >               uint32_t srt = (syndrome >> 16) & 0x1f;
> > > >               uint64_t val = 0;
> > > >
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >
> > > >               DPRINTF("data abort: [pc=0x%llx va=0x%016llx pa=0x%016llx isv=%x "
> > > > @@ -446,8 +438,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >                   hvf_set_reg(cpu, srt, val);
> > > >               }
> > > >
> > > > -            qemu_mutex_unlock_iothread();
> > > > -
> > > >               advance_pc = true;
> > > >               break;
> > > >           }
> > > > @@ -493,68 +483,36 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >           case EC_WFX_TRAP:
> > > >               if (!(syndrome & WFX_IS_WFE) && !(cpu->interrupt_request &
> > > >                   (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > > -                uint64_t cval, ctl, val, diff, now;
> > > > +                uint64_t cval;
> > > >
> > > > -                /* Set up a local timer for vtimer if necessary ... */
> > > > -                r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CTL_EL0, &ctl);
> > > > -                assert_hvf_ok(r);
> > > >                   r = hv_vcpu_get_sys_reg(cpu->hvf->fd, HV_SYS_REG_CNTV_CVAL_EL0, &cval);
> > > >                   assert_hvf_ok(r);
> > > >
> > > > -                asm volatile("mrs %0, cntvct_el0" : "=r"(val));
> > > > -                diff = cval - val;
> > > > -
> > > > -                now = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) /
> > > > -                      gt_cntfrq_period_ns(arm_cpu);
> > > > -
> > > > -                /* Timer disabled or masked, just wait for long */
> > > > -                if (!(ctl & 1) || (ctl & 2)) {
> > > > -                    diff = (120 * NANOSECONDS_PER_SECOND) /
> > > > -                           gt_cntfrq_period_ns(arm_cpu);
> > > > +                int64_t ticks_to_sleep = cval - mach_absolute_time();
> > > > +                if (ticks_to_sleep < 0) {
> > > > +                    break;
> > >
> > >
> > > This will loop at 100% for Windows, which configures the vtimer as
> > > cval=0 ctl=7, so with IRQ mask bit set.
> >
> > Okay, but the 120s is kind of arbitrary so we should just sleep until
> > we get a signal. That can be done by passing null as the timespec
> > argument to pselect().
> >
> > >
> > >
> > > Alex
> > >
> > >
> > > >                   }
> > > >
> > > > -                if (diff < INT64_MAX) {
> > > > -                    uint64_t ns = diff * gt_cntfrq_period_ns(arm_cpu);
> > > > -                    struct timespec *ts = &cpu->hvf->ts;
> > > > -
> > > > -                    *ts = (struct timespec){
> > > > -                        .tv_sec = ns / NANOSECONDS_PER_SECOND,
> > > > -                        .tv_nsec = ns % NANOSECONDS_PER_SECOND,
> > > > -                    };
> > > > -
> > > > -                    /*
> > > > -                     * Waking up easily takes 1ms, don't go to sleep for smaller
> > > > -                     * time periods than 2ms.
> > > > -                     */
> > > > -                    if (!ts->tv_sec && (ts->tv_nsec < (SCALE_MS * 2))) {
> > >
> > >
> > > I put this logic here on purpose. A pselect(1 ns) easily takes 1-2ms to
> > > return. Without logic like this, super short WFIs will hurt performance
> > > quite badly.
> >
> > I don't think that's accurate. According to this benchmark it's a few
> > hundred nanoseconds at most.
> >
> > pcc@pac-mini /tmp> cat pselect.c
> > #include <signal.h>
> > #include <sys/select.h>
> >
> > int main() {
> >   sigset_t mask, orig_mask;
> >   pthread_sigmask(SIG_SETMASK, 0, &mask);
> >   sigaddset(&mask, SIGUSR1);
> >   pthread_sigmask(SIG_SETMASK, &mask, &orig_mask);
> >
> >   for (int i = 0; i != 1000000; ++i) {
> >     struct timespec ts = { 0, 1 };
> >     pselect(0, 0, 0, 0, &ts, &orig_mask);
> >   }
> > }
> > pcc@pac-mini /tmp> time ./pselect
> >
> > ________________________________________________________
> > Executed in  179.87 millis    fish           external
> >    usr time   77.68 millis   57.00 micros   77.62 millis
> >    sys time  101.37 millis  852.00 micros  100.52 millis
> >
> > Besides, all that you're really saving here is the single pselect
> > call. There are no doubt more expensive syscalls involved in exiting
> > and entering the VCPU that would dominate here.
> >
> > Peter
> >
> > >
> > >
> > > Alex
> > >
> > > > -                        advance_pc = true;
> > > > -                        break;
> > > > -                    }
> > > > -
> > > > -                    /* Set cpu->hvf->sleeping so that we get a SIG_IPI signal. */
> > > > -                    cpu->hvf->sleeping = true;
> > > > -                    smp_mb();
> > > > -
> > > > -                    /* Bail out if we received an IRQ meanwhile */
> > > > -                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > > > -                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > > -                        cpu->hvf->sleeping = false;
> > > > -                        break;
> > > > -                    }
> > > > -
> > > > -                    /* nanosleep returns on signal, so we wake up on kick. */
> > > > -                    nanosleep(ts, NULL);
> > > > -
> > > > -                    /* Out of sleep - either naturally or because of a kick */
> > > > -                    cpu->hvf->sleeping = false;
> > > > -                }
> > > > +                uint64_t seconds = ticks_to_sleep / arm_cpu->gt_cntfrq_hz;
> > > > +                uint64_t nanos =
> > > > +                    (ticks_to_sleep - arm_cpu->gt_cntfrq_hz * seconds) *
> > > > +                    1000000000 / arm_cpu->gt_cntfrq_hz;
> > > > +                struct timespec ts = { seconds, nanos };
> > > > +
> > > > +                /*
> > > > +                 * Use pselect to sleep so that other threads can IPI us while
> > > > +                 * we're sleeping.
> > > > +                 */
> > > > +                qatomic_mb_set(&cpu->thread_kicked, false);
> > > > +                qemu_mutex_unlock_iothread();
> > > > +                pselect(0, 0, 0, 0, &ts, &cpu->hvf->unblock_ipi_mask);
> > > > +                qemu_mutex_lock_iothread();
> > > >
> > > >                   advance_pc = true;
> > > >               }
> > > >               break;
> > > >           case EC_AA64_HVC:
> > > >               cpu_synchronize_state(cpu);
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >               if (arm_is_psci_call(arm_cpu, EXCP_HVC)) {
> > > >                   arm_handle_psci_call(arm_cpu);
> > > > @@ -562,11 +520,9 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >                   DPRINTF("unknown HVC! %016llx", env->xregs[0]);
> > > >                   env->xregs[0] = -1;
> > > >               }
> > > > -            qemu_mutex_unlock_iothread();
> > > >               break;
> > > >           case EC_AA64_SMC:
> > > >               cpu_synchronize_state(cpu);
> > > > -            qemu_mutex_lock_iothread();
> > > >               current_cpu = cpu;
> > > >               if (arm_is_psci_call(arm_cpu, EXCP_SMC)) {
> > > >                   arm_handle_psci_call(arm_cpu);
> > > > @@ -575,7 +531,6 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >                   env->xregs[0] = -1;
> > > >                   env->pc += 4;
> > > >               }
> > > > -            qemu_mutex_unlock_iothread();
> > > >               break;
> > > >           default:
> > > >               cpu_synchronize_state(cpu);
> > > > @@ -594,6 +549,7 @@ int hvf_vcpu_exec(CPUState *cpu)
> > > >               r = hv_vcpu_set_reg(cpu->hvf->fd, HV_REG_PC, pc);
> > > >               assert_hvf_ok(r);
> > > >           }
> > > > +        qemu_mutex_unlock_iothread();
> > > >       } while (ret == 0);
> > > >
> > > >       qemu_mutex_lock_iothread();


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-12-03  9:41                         ` [PATCH 2/8] hvf: Move common code out Roman Bolshakov
@ 2020-12-03 18:42                           ` Peter Collingbourne
  2020-12-03 22:13                             ` Alexander Graf
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Collingbourne @ 2020-12-03 18:42 UTC (permalink / raw)
  To: Roman Bolshakov
  Cc: Alexander Graf, Frank Yang, Peter Maydell, Eduardo Habkost,
	Richard Henderson, qemu-devel, Cameron Esfahani, qemu-arm,
	Claudio Fontana, Paolo Bonzini

On Thu, Dec 3, 2020 at 1:41 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
>
> On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
> > On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
> > >
> > >
> > > On 01.12.20 00:01, Peter Collingbourne wrote:
> > > > On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
> > > >> Hi Peter,
> > > >>
> > > >> On 30.11.20 22:08, Peter Collingbourne wrote:
> > > >>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
> > > >>>>
> > > >>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
> > > >>>>> Hi Frank,
> > > >>>>>
> > > >>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
> > > >>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
> > > >>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
> > > >>>>>
> > > >>>>>     https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
> > > >>>>>
> > > >>>> Thanks, we'll take a look :)
> > > >>>>
> > > >>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
> > > >>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
> > > >>> I'll reply to your patch here. You have:
> > > >>>
> > > >>> +                    /* Set cpu->hvf->sleeping so that we get a
> > > >>> SIG_IPI signal. */
> > > >>> +                    cpu->hvf->sleeping = true;
> > > >>> +                    smp_mb();
> > > >>> +
> > > >>> +                    /* Bail out if we received an IRQ meanwhile */
> > > >>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
> > > >>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
> > > >>> +                        cpu->hvf->sleeping = false;
> > > >>> +                        break;
> > > >>> +                    }
> > > >>> +
> > > >>> +                    /* nanosleep returns on signal, so we wake up on kick. */
> > > >>> +                    nanosleep(ts, NULL);
> > > >>>
> > > >>> and then send the signal conditional on whether sleeping is true, but
> > > >>> I think this is racy. If the signal is sent after sleeping is set to
> > > >>> true but before entering nanosleep then I think it will be ignored and
> > > >>> we will miss the wakeup. That's why in my implementation I block IPI
> > > >>> on the CPU thread at startup and then use pselect to atomically
> > > >>> unblock and begin sleeping. The signal is sent unconditionally so
> > > >>> there's no need to worry about races between actually sleeping and the
> > > >>> "we think we're sleeping" state. It may lead to an extra wakeup but
> > > >>> that's better than missing it entirely.
> > > >>
> > > >> Thanks a bunch for the comment! So the trick I was using here is to > > >> modify the timespec from the kick function before sending the IPI
> > > >> signal. That way, we know that either we are inside the sleep (where the
> > > >> signal wakes it up) or we are outside the sleep (where timespec={} will
> > > >> make it return immediately).
> > > >>
> > > >> The only race I can think of is if nanosleep does calculations based on
> > > >> the timespec and we happen to send the signal right there and then.
> > > > Yes that's the race I was thinking of. Admittedly it's a small window
> > > > but it's theoretically possible and part of the reason why pselect was
> > > > created.
> > > >
> > > >> The problem with blocking IPIs is basically what Frank was describing
> > > >> earlier: How do you unset the IPI signal pending status? If the signal
> > > >> is never delivered, how can pselect differentiate "signal from last time
> > > >> is still pending" from "new signal because I got an IPI"?
> > > > In this case we would take the additional wakeup which should be
> > > > harmless since we will take the WFx exit again and put us in the
> > > > correct state. But that's a lot better than busy looping.
> > >
> > >
> > > I'm not sure I follow. I'm thinking of the following scenario:
> > >
> > >    - trap into WFI handler
> > >    - go to sleep with blocked SIG_IPI
> > >    - SIG_IPI arrives, pselect() exits
> > >    - signal is still pending because it's blocked
> > >    - enter guest
> > >    - trap into WFI handler
> > >    - run pselect(), but it immediate exits because SIG_IPI is still pending
> > >
> > > This was the loop I was seeing when running with SIG_IPI blocked. That's
> > > part of the reason why I switched to a different model.
> >
> > What I observe is that when returning from a pending signal pselect
> > consumes the signal (which is also consistent with my understanding of
> > what pselect does). That means that it doesn't matter if we take a
> > second WFx exit because once we reach the pselect in the second WFx
> > exit the signal will have been consumed by the pselect in the first
> > exit and we will just wait for the next one.
> >
>
> Aha! Thanks for the explanation. So, the first WFI in the series of
> guest WFIs will likely wake up immediately? After a period without WFIs
> there must be a pending SIG_IPI...
>
> It shouldn't be a critical issue though because (as defined in D1.16.2)
> "the architecture permits a PE to leave the low-power state for any
> reason, it is permissible for a PE to treat WFI as a NOP, but this is
> not recommended for lowest power operation."
>
> BTW. I think a bit from the thread should go into the description of
> patch 8, because it's not trivial and it would really be helpful to keep
> in repo history. At least something like this (taken from an earlier
> reply in the thread):
>
>   In this implementation IPI is blocked on the CPU thread at startup and
>   pselect() is used to atomically unblock the signal and begin sleeping.
>   The signal is sent unconditionally so there's no need to worry about
>   races between actually sleeping and the "we think we're sleeping"
>   state. It may lead to an extra wakeup but that's better than missing
>   it entirely.

Okay, I'll add something like that to the next version of the patch I send out.

Peter

>
>
> Thanks,
> Roman
>
> > I don't know why things may have been going wrong in your
> > implementation but it may be related to the issue with
> > mach_absolute_time() which I posted about separately and was also
> > causing busy loops for us in some cases. Once that issue was fixed in
> > our implementation we started seeing sleep until VTIMER due work
> > properly.
> >
> > >
> > >
> > > > I reckon that you could improve things a little by unblocking the
> > > > signal and then reblocking it before unlocking iothread (e.g. with a
> > > > pselect with zero time interval), which would flush any pending
> > > > signals. Since any such signal would correspond to a signal from last
> > > > time (because we still have the iothread lock) we know that any future
> > > > signals should correspond to new IPIs.
> > >
> > >
> > > Yeah, I think you actually *have* to do exactly that, because otherwise
> > > pselect() will always return after 0ns because the signal is still pending.
> > >
> > > And yes, I agree that that starts to sound a bit less racy now. But it
> > > means we can probably also just do
> > >
> > >    - WFI handler
> > >    - block SIG_IPI
> > >    - set hvf->sleeping = true
> > >    - check for pending interrupts
> > >    - pselect()
> > >    - unblock SIG_IPI
> > >
> > > which means we run with SIG_IPI unmasked by default. I don't think the
> > > number of signal mask changes is any different with that compared to
> > > running with SIG_IPI always masked, right?
> >
>
> P.S. Just found that Alex already raised my concern. Pending signals
> have to be consumed or there should be no pending signals to start
> sleeping on the very first WFI.
>
> > And unlock/lock iothread around the pselect? I suppose that could work
> > but as I mentioned it would just be an optimization.
> >
> > Maybe I can try to make my approach work on top of your series, or if
> > you already have a patch I can try to debug it. Let me know.
> >
> > Peter


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-12-03 18:42                           ` Peter Collingbourne
@ 2020-12-03 22:13                             ` Alexander Graf
  2020-12-03 23:04                               ` Roman Bolshakov
  0 siblings, 1 reply; 64+ messages in thread
From: Alexander Graf @ 2020-12-03 22:13 UTC (permalink / raw)
  To: Peter Collingbourne, Roman Bolshakov
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, qemu-arm, Claudio Fontana, Frank Yang,
	Paolo Bonzini


On 03.12.20 19:42, Peter Collingbourne wrote:
> On Thu, Dec 3, 2020 at 1:41 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
>> On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
>>> On Mon, Nov 30, 2020 at 3:18 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>
>>>> On 01.12.20 00:01, Peter Collingbourne wrote:
>>>>> On Mon, Nov 30, 2020 at 1:40 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>>> Hi Peter,
>>>>>>
>>>>>> On 30.11.20 22:08, Peter Collingbourne wrote:
>>>>>>> On Mon, Nov 30, 2020 at 12:56 PM Frank Yang <lfy@google.com> wrote:
>>>>>>>> On Mon, Nov 30, 2020 at 12:34 PM Alexander Graf <agraf@csgraf.de> wrote:
>>>>>>>>> Hi Frank,
>>>>>>>>>
>>>>>>>>> Thanks for the update :). Your previous email nudged me into the right direction. I previously had implemented WFI through the internal timer framework which performed way worse.
>>>>>>>> Cool, glad it's helping. Also, Peter found out that the main thing keeping us from just using cntpct_el0 on the host directly and compare with cval is that if we sleep, cval is going to be much < cntpct_el0 by the sleep time. If we can get either the architecture or macos to read out the sleep time then we might be able to not have to use a poll interval either!
>>>>>>>>> Along the way, I stumbled over a few issues though. For starters, the signal mask for SIG_IPI was not set correctly, so while pselect() would exit, the signal would never get delivered to the thread! For a fix, check out
>>>>>>>>>
>>>>>>>>>      https://patchew.org/QEMU/20201130030723.78326-1-agraf@csgraf.de/20201130030723.78326-4-agraf@csgraf.de/
>>>>>>>>>
>>>>>>>> Thanks, we'll take a look :)
>>>>>>>>
>>>>>>>>> Please also have a look at my latest stab at WFI emulation. It doesn't handle WFE (that's only relevant in overcommitted scenarios). But it does handle WFI and even does something similar to hlt polling, albeit not with an adaptive threshold.
>>>>>>> Sorry I'm not subscribed to qemu-devel (I'll subscribe in a bit) so
>>>>>>> I'll reply to your patch here. You have:
>>>>>>>
>>>>>>> +                    /* Set cpu->hvf->sleeping so that we get a
>>>>>>> SIG_IPI signal. */
>>>>>>> +                    cpu->hvf->sleeping = true;
>>>>>>> +                    smp_mb();
>>>>>>> +
>>>>>>> +                    /* Bail out if we received an IRQ meanwhile */
>>>>>>> +                    if (cpu->thread_kicked || (cpu->interrupt_request &
>>>>>>> +                        (CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ))) {
>>>>>>> +                        cpu->hvf->sleeping = false;
>>>>>>> +                        break;
>>>>>>> +                    }
>>>>>>> +
>>>>>>> +                    /* nanosleep returns on signal, so we wake up on kick. */
>>>>>>> +                    nanosleep(ts, NULL);
>>>>>>>
>>>>>>> and then send the signal conditional on whether sleeping is true, but
>>>>>>> I think this is racy. If the signal is sent after sleeping is set to
>>>>>>> true but before entering nanosleep then I think it will be ignored and
>>>>>>> we will miss the wakeup. That's why in my implementation I block IPI
>>>>>>> on the CPU thread at startup and then use pselect to atomically
>>>>>>> unblock and begin sleeping. The signal is sent unconditionally so
>>>>>>> there's no need to worry about races between actually sleeping and the
>>>>>>> "we think we're sleeping" state. It may lead to an extra wakeup but
>>>>>>> that's better than missing it entirely.
>>>>>> Thanks a bunch for the comment! So the trick I was using here is to > > >> modify the timespec from the kick function before sending the IPI
>>>>>> signal. That way, we know that either we are inside the sleep (where the
>>>>>> signal wakes it up) or we are outside the sleep (where timespec={} will
>>>>>> make it return immediately).
>>>>>>
>>>>>> The only race I can think of is if nanosleep does calculations based on
>>>>>> the timespec and we happen to send the signal right there and then.
>>>>> Yes that's the race I was thinking of. Admittedly it's a small window
>>>>> but it's theoretically possible and part of the reason why pselect was
>>>>> created.
>>>>>
>>>>>> The problem with blocking IPIs is basically what Frank was describing
>>>>>> earlier: How do you unset the IPI signal pending status? If the signal
>>>>>> is never delivered, how can pselect differentiate "signal from last time
>>>>>> is still pending" from "new signal because I got an IPI"?
>>>>> In this case we would take the additional wakeup which should be
>>>>> harmless since we will take the WFx exit again and put us in the
>>>>> correct state. But that's a lot better than busy looping.
>>>>
>>>> I'm not sure I follow. I'm thinking of the following scenario:
>>>>
>>>>     - trap into WFI handler
>>>>     - go to sleep with blocked SIG_IPI
>>>>     - SIG_IPI arrives, pselect() exits
>>>>     - signal is still pending because it's blocked
>>>>     - enter guest
>>>>     - trap into WFI handler
>>>>     - run pselect(), but it immediate exits because SIG_IPI is still pending
>>>>
>>>> This was the loop I was seeing when running with SIG_IPI blocked. That's
>>>> part of the reason why I switched to a different model.
>>> What I observe is that when returning from a pending signal pselect
>>> consumes the signal (which is also consistent with my understanding of
>>> what pselect does). That means that it doesn't matter if we take a
>>> second WFx exit because once we reach the pselect in the second WFx
>>> exit the signal will have been consumed by the pselect in the first
>>> exit and we will just wait for the next one.
>>>
>> Aha! Thanks for the explanation. So, the first WFI in the series of
>> guest WFIs will likely wake up immediately? After a period without WFIs
>> there must be a pending SIG_IPI...
>>
>> It shouldn't be a critical issue though because (as defined in D1.16.2)
>> "the architecture permits a PE to leave the low-power state for any
>> reason, it is permissible for a PE to treat WFI as a NOP, but this is
>> not recommended for lowest power operation."
>>
>> BTW. I think a bit from the thread should go into the description of
>> patch 8, because it's not trivial and it would really be helpful to keep
>> in repo history. At least something like this (taken from an earlier
>> reply in the thread):
>>
>>    In this implementation IPI is blocked on the CPU thread at startup and
>>    pselect() is used to atomically unblock the signal and begin sleeping.
>>    The signal is sent unconditionally so there's no need to worry about
>>    races between actually sleeping and the "we think we're sleeping"
>>    state. It may lead to an extra wakeup but that's better than missing
>>    it entirely.
> Okay, I'll add something like that to the next version of the patch I send out.


If this is the only change, I've already added it for v4. If you want me 
to change it further, just let me know what to replace the patch 
description with.


Alex



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/8] hvf: Move common code out
  2020-12-03 22:13                             ` Alexander Graf
@ 2020-12-03 23:04                               ` Roman Bolshakov
  0 siblings, 0 replies; 64+ messages in thread
From: Roman Bolshakov @ 2020-12-03 23:04 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Eduardo Habkost, Richard Henderson, qemu-devel,
	Cameron Esfahani, qemu-arm, Claudio Fontana, Frank Yang,
	Paolo Bonzini, Peter Collingbourne

On Thu, Dec 03, 2020 at 11:13:35PM +0100, Alexander Graf wrote:
> 
> On 03.12.20 19:42, Peter Collingbourne wrote:
> > On Thu, Dec 3, 2020 at 1:41 AM Roman Bolshakov <r.bolshakov@yadro.com> wrote:
> > > On Mon, Nov 30, 2020 at 04:00:11PM -0800, Peter Collingbourne wrote:
> > > > What I observe is that when returning from a pending signal pselect
> > > > consumes the signal (which is also consistent with my understanding of
> > > > what pselect does). That means that it doesn't matter if we take a
> > > > second WFx exit because once we reach the pselect in the second WFx
> > > > exit the signal will have been consumed by the pselect in the first
> > > > exit and we will just wait for the next one.
> > > > 
> > > Aha! Thanks for the explanation. So, the first WFI in the series of
> > > guest WFIs will likely wake up immediately? After a period without WFIs
> > > there must be a pending SIG_IPI...
> > > 
> > > It shouldn't be a critical issue though because (as defined in D1.16.2)
> > > "the architecture permits a PE to leave the low-power state for any
> > > reason, it is permissible for a PE to treat WFI as a NOP, but this is
> > > not recommended for lowest power operation."
> > > 
> > > BTW. I think a bit from the thread should go into the description of
> > > patch 8, because it's not trivial and it would really be helpful to keep
> > > in repo history. At least something like this (taken from an earlier
> > > reply in the thread):
> > > 
> > >    In this implementation IPI is blocked on the CPU thread at startup and
> > >    pselect() is used to atomically unblock the signal and begin sleeping.
> > >    The signal is sent unconditionally so there's no need to worry about
> > >    races between actually sleeping and the "we think we're sleeping"
> > >    state. It may lead to an extra wakeup but that's better than missing
> > >    it entirely.
> > Okay, I'll add something like that to the next version of the patch I send out.
> 
> 
> If this is the only change, I've already added it for v4. If you want me to
> change it further, just let me know what to replace the patch description
> with.
> 
> 

Thanks, Alex.

I'm fine with the description and all set.

-Roman


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2020-12-03 23:06 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-26 21:50 [PATCH 0/8] hvf: Implement Apple Silicon Support Alexander Graf
2020-11-26 21:50 ` [PATCH 1/8] hvf: Add hypervisor entitlement to output binaries Alexander Graf
2020-11-27  4:54   ` Paolo Bonzini
2020-11-27 19:44   ` Roman Bolshakov
2020-11-27 21:17     ` Paolo Bonzini
2020-11-27 21:51     ` Alexander Graf
2020-11-26 21:50 ` [PATCH 2/8] hvf: Move common code out Alexander Graf
2020-11-27 20:00   ` Roman Bolshakov
2020-11-27 21:55     ` Alexander Graf
2020-11-27 23:30       ` Frank Yang
2020-11-30 20:15         ` Frank Yang
2020-11-30 20:33           ` Alexander Graf
2020-11-30 20:55             ` Frank Yang
2020-11-30 21:08               ` Peter Collingbourne
2020-11-30 21:40                 ` Alexander Graf
2020-11-30 23:01                   ` Peter Collingbourne
2020-11-30 23:18                     ` Alexander Graf
2020-12-01  0:00                       ` Peter Collingbourne
2020-12-01  0:13                         ` Alexander Graf
2020-12-01  8:21                           ` [PATCH] arm/hvf: Optimize and simplify WFI handling Peter Collingbourne via
2020-12-01 11:16                             ` Alexander Graf
2020-12-01 18:59                               ` Peter Collingbourne
2020-12-01 22:03                                 ` Alexander Graf
2020-12-02  1:19                                   ` Peter Collingbourne
2020-12-02  1:53                                     ` Alexander Graf
2020-12-02  4:44                                       ` Peter Collingbourne
2020-12-03 10:12                                 ` Roman Bolshakov
2020-12-03 18:30                                   ` Peter Collingbourne
2020-12-01 16:26                             ` Alexander Graf
2020-12-01 20:03                               ` Peter Collingbourne
2020-12-01 22:09                                 ` Alexander Graf
2020-12-01 23:13                                   ` Alexander Graf
2020-12-02  0:52                                   ` Peter Collingbourne
2020-12-03  9:41                         ` [PATCH 2/8] hvf: Move common code out Roman Bolshakov
2020-12-03 18:42                           ` Peter Collingbourne
2020-12-03 22:13                             ` Alexander Graf
2020-12-03 23:04                               ` Roman Bolshakov
2020-12-01  0:37                   ` Roman Bolshakov
2020-11-30 22:10               ` Peter Maydell
2020-12-01  2:49                 ` Frank Yang
2020-11-30 22:46               ` Peter Collingbourne
2020-11-26 21:50 ` [PATCH 3/8] arm: Set PSCI to 0.2 for HVF Alexander Graf
2020-11-26 21:50 ` [PATCH 4/8] arm: Synchronize CPU on PSCI on Alexander Graf
2020-11-26 21:50 ` [PATCH 5/8] hvf: Add Apple Silicon support Alexander Graf
2020-11-26 21:50 ` [PATCH 6/8] hvf: Use OS provided vcpu kick function Alexander Graf
2020-11-26 22:18   ` Eduardo Habkost
2020-11-30  2:42     ` Alexander Graf
2020-11-30  7:45       ` Claudio Fontana
2020-11-26 21:50 ` [PATCH 7/8] arm: Add Hypervisor.framework build target Alexander Graf
2020-11-27  4:59   ` Paolo Bonzini
2020-11-26 21:50 ` [PATCH 8/8] hw/arm/virt: Disable highmem when on hypervisor.framework Alexander Graf
2020-11-26 22:14   ` Eduardo Habkost
2020-11-26 22:29     ` Peter Maydell
2020-11-27 16:26       ` Eduardo Habkost
2020-11-27 16:38         ` Peter Maydell
2020-11-27 16:47           ` Eduardo Habkost
2020-11-27 16:53             ` Peter Maydell
2020-11-27 17:17               ` Eduardo Habkost
2020-11-27 18:16                 ` Peter Maydell
2020-11-27 18:20                   ` Eduardo Habkost
2020-11-27 16:47           ` Peter Maydell
2020-11-30  2:40             ` Alexander Graf
2020-11-26 22:10 ` [PATCH 0/8] hvf: Implement Apple Silicon Support Eduardo Habkost
2020-11-27 17:48   ` Philippe Mathieu-Daudé

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.