[PATCH][RESEND v3 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH][RESEND v3 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
@ 2023-02-24 21:41 Maciej S. Szmigiero
  2023-02-24 21:41 ` [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols Maciej S. Szmigiero
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-02-24 21:41 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, David Hildenbrand, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This is a rebase/resend of v2 patch series located here:
https://lore.kernel.org/qemu-devel/cover.1672878904.git.maciej.szmigiero@oracle.com/

The only changes from v2 are fixing some conflicts around build files (and
re-testing).

 Kconfig.host                     |    3 +
 configure                        |   36 +
 hw/hyperv/Kconfig                |    5 +
 hw/hyperv/hv-balloon.c           | 2185 ++++++++++++++++++++++++++++++
 hw/hyperv/meson.build            |    1 +
 hw/hyperv/trace-events           |   16 +
 hw/i386/Kconfig                  |    2 +
 hw/i386/pc.c                     |    4 +-
 hw/mem/Kconfig                   |    4 +
 hw/mem/hapvdimm.c                |  221 +++
 hw/mem/meson.build               |    1 +
 include/hw/hyperv/dynmem-proto.h |  423 ++++++
 include/hw/mem/hapvdimm.h        |   27 +
 meson.build                      |    4 +-
 qapi/machine.json                |   68 +
 15 files changed, 2998 insertions(+), 2 deletions(-)
 create mode 100644 hw/hyperv/hv-balloon.c
 create mode 100644 hw/mem/hapvdimm.c
 create mode 100644 include/hw/hyperv/dynmem-proto.h
 create mode 100644 include/hw/mem/hapvdimm.h



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-02-24 21:41 [PATCH][RESEND v3 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
@ 2023-02-24 21:41 ` Maciej S. Szmigiero
  2023-02-27 15:25   ` David Hildenbrand
  2023-02-24 21:41 ` [PATCH][RESEND v3 2/3] Add Hyper-V Dynamic Memory Protocol definitions Maciej S. Szmigiero
  2023-02-24 21:41 ` [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
  2 siblings, 1 reply; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-02-24 21:41 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, David Hildenbrand, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This device works like a virtual DIMM stick: it allows inserting extra RAM
into the guest at run time and later removing it without having to
duplicate all of the address space management logic of TYPE_MEMORY_DEVICE
in each memory hot-add protocol driver.

This device is not meant to be instantiated or removed by the QEMU user
directly: rather, the protocol driver is supposed to add and remove it as
required.

In fact, its very existence is supposed to be an implementation detail,
transparent to the QEMU user.

To prevent the user from accidentally manually creating an instance of this
device the protocol driver is supposed to place the qdev_device_add*() call
(that is uses to add this device) between hapvdimm_allow_adding() and
hapvdimm_disallow_adding() calls in order to temporary authorize the
operation.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/i386/Kconfig           |   2 +
 hw/i386/pc.c              |   4 +-
 hw/mem/Kconfig            |   4 +
 hw/mem/hapvdimm.c         | 221 ++++++++++++++++++++++++++++++++++++++
 hw/mem/meson.build        |   1 +
 include/hw/mem/hapvdimm.h |  27 +++++
 6 files changed, 258 insertions(+), 1 deletion(-)
 create mode 100644 hw/mem/hapvdimm.c
 create mode 100644 include/hw/mem/hapvdimm.h

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index 9fbfe748b5..13f70707ed 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -68,6 +68,7 @@ config I440FX
     imply E1000_PCI
     imply VMPORT
     imply VMMOUSE
+    imply HAPVDIMM
     select ACPI_PIIX4
     select PC_PCI
     select PC_ACPI
@@ -95,6 +96,7 @@ config Q35
     imply E1000E_PCI_EXPRESS
     imply VMPORT
     imply VMMOUSE
+    imply HAPVDIMM
     select PC_PCI
     select PC_ACPI
     select PCI_EXPRESS_Q35
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index a7a2ededf9..5469d89bcc 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -73,6 +73,7 @@
 #include "hw/acpi/acpi.h"
 #include "hw/acpi/cpu_hotplug.h"
 #include "acpi-build.h"
+#include "hw/mem/hapvdimm.h"
 #include "hw/mem/pc-dimm.h"
 #include "hw/mem/nvdimm.h"
 #include "hw/cxl/cxl.h"
@@ -1609,7 +1610,8 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
         object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
         object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI) ||
         object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_IOMMU_PCI) ||
-        object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE)) {
+        object_dynamic_cast(OBJECT(dev), TYPE_X86_IOMMU_DEVICE) ||
+        object_dynamic_cast(OBJECT(dev), TYPE_HAPVDIMM)) {
         return HOTPLUG_HANDLER(machine);
     }
 
diff --git a/hw/mem/Kconfig b/hw/mem/Kconfig
index 73c5ae8ad9..d8c1feafed 100644
--- a/hw/mem/Kconfig
+++ b/hw/mem/Kconfig
@@ -16,3 +16,7 @@ config CXL_MEM_DEVICE
     bool
     default y if CXL
     select MEM_DEVICE
+
+config HAPVDIMM
+    bool
+    select MEM_DEVICE
diff --git a/hw/mem/hapvdimm.c b/hw/mem/hapvdimm.c
new file mode 100644
index 0000000000..9ae82edb2c
--- /dev/null
+++ b/hw/mem/hapvdimm.c
@@ -0,0 +1,221 @@
+/*
+ * A memory hot-add protocol vDIMM device
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * Heavily based on pc-dimm.c:
+ * Copyright ProfitBricks GmbH 2012
+ * Copyright (C) 2014 Red Hat Inc
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "exec/memory.h"
+#include "hw/boards.h"
+#include "hw/mem/hapvdimm.h"
+#include "hw/mem/memory-device.h"
+#include "hw/qdev-core.h"
+#include "hw/qdev-properties.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+#include "qapi/visitor.h"
+#include "qemu/module.h"
+#include "sysemu/hostmem.h"
+#include "trace.h"
+
+typedef struct HAPVDIMMDevice {
+    /* private */
+    DeviceState parent_obj;
+
+    /* public */
+    bool ever_realized;
+    uint64_t addr;
+    uint64_t align;
+    uint32_t node;
+    HostMemoryBackend *hostmem;
+} HAPVDIMMDevice;
+
+typedef struct HAPVDIMMDeviceClass {
+    /* private */
+    DeviceClass parent_class;
+} HAPVDIMMDeviceClass;
+
+static bool hapvdimm_adding_allowed;
+static Property hapvdimm_properties[] = {
+    DEFINE_PROP_UINT64(HAPVDIMM_ADDR_PROP, HAPVDIMMDevice, addr, 0),
+    DEFINE_PROP_UINT64(HAPVDIMM_ALIGN_PROP, HAPVDIMMDevice, align, 0),
+    DEFINE_PROP_LINK(HAPVDIMM_MEMDEV_PROP, HAPVDIMMDevice, hostmem,
+                     TYPE_MEMORY_BACKEND, HostMemoryBackend *),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+void hapvdimm_allow_adding(void)
+{
+    hapvdimm_adding_allowed = true;
+}
+
+void hapvdimm_disallow_adding(void)
+{
+    hapvdimm_adding_allowed = false;
+}
+
+static void hapvdimm_get_size(Object *obj, Visitor *v, const char *name,
+                            void *opaque, Error **errp)
+{
+    ERRP_GUARD();
+    uint64_t value;
+
+    value = memory_device_get_region_size(MEMORY_DEVICE(obj), errp);
+    if (*errp) {
+        return;
+    }
+
+    visit_type_uint64(v, name, &value, errp);
+}
+
+static void hapvdimm_init(Object *obj)
+{
+    object_property_add(obj, HAPVDIMM_SIZE_PROP, "uint64", hapvdimm_get_size,
+                        NULL, NULL, NULL);
+}
+
+static void hapvdimm_realize(DeviceState *dev, Error **errp)
+{
+    ERRP_GUARD();
+    HAPVDIMMDevice *hapvdimm = HAPVDIMM(dev);
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    if (!hapvdimm->ever_realized) {
+        if (!hapvdimm_adding_allowed) {
+            error_setg(errp, "direct adding not allowed");
+            return;
+        }
+
+        hapvdimm->ever_realized = true;
+    }
+
+    memory_device_pre_plug(MEMORY_DEVICE(hapvdimm), ms,
+                           hapvdimm->align ? &hapvdimm->align : NULL,
+                           errp);
+    if (*errp) {
+        return;
+    }
+
+    if (!hapvdimm->hostmem) {
+        error_setg(errp, "'" HAPVDIMM_MEMDEV_PROP "' property is not set");
+        return;
+    } else if (host_memory_backend_is_mapped(hapvdimm->hostmem)) {
+        const char *path;
+
+        path = object_get_canonical_path_component(OBJECT(hapvdimm->hostmem));
+        error_setg(errp, "can't use already busy memdev: %s", path);
+        return;
+    }
+
+    host_memory_backend_set_mapped(hapvdimm->hostmem, true);
+
+    memory_device_plug(MEMORY_DEVICE(hapvdimm), ms);
+    vmstate_register_ram(host_memory_backend_get_memory(hapvdimm->hostmem),
+                         dev);
+}
+
+static void hapvdimm_unrealize(DeviceState *dev)
+{
+    HAPVDIMMDevice *hapvdimm = HAPVDIMM(dev);
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    memory_device_unplug(MEMORY_DEVICE(hapvdimm), ms);
+    vmstate_unregister_ram(host_memory_backend_get_memory(hapvdimm->hostmem),
+                           dev);
+
+    host_memory_backend_set_mapped(hapvdimm->hostmem, false);
+}
+
+static uint64_t hapvdimm_md_get_addr(const MemoryDeviceState *md)
+{
+    return object_property_get_uint(OBJECT(md), HAPVDIMM_ADDR_PROP,
+                                    &error_abort);
+}
+
+static void hapvdimm_md_set_addr(MemoryDeviceState *md, uint64_t addr,
+                               Error **errp)
+{
+    object_property_set_uint(OBJECT(md), HAPVDIMM_ADDR_PROP, addr, errp);
+}
+
+static MemoryRegion *hapvdimm_md_get_memory_region(MemoryDeviceState *md,
+                                                 Error **errp)
+{
+    HAPVDIMMDevice *hapvdimm = HAPVDIMM(md);
+
+    if (!hapvdimm->hostmem) {
+        error_setg(errp, "'" HAPVDIMM_MEMDEV_PROP "' property must be set");
+        return NULL;
+    }
+
+    return host_memory_backend_get_memory(hapvdimm->hostmem);
+}
+
+static void hapvdimm_md_fill_device_info(const MemoryDeviceState *md,
+                                       MemoryDeviceInfo *info)
+{
+    PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+    const DeviceClass *dc = DEVICE_GET_CLASS(md);
+    const HAPVDIMMDevice *hapvdimm = HAPVDIMM(md);
+    const DeviceState *dev = DEVICE(md);
+
+    if (dev->id) {
+        di->id = g_strdup(dev->id);
+    }
+    di->hotplugged = dev->hotplugged;
+    di->hotpluggable = dc->hotpluggable;
+    di->addr = hapvdimm->addr;
+    di->slot = -1;
+    di->node = 0; /* FIXME: report proper node */
+    di->size = object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_SIZE_PROP,
+                                        NULL);
+    di->memdev = object_get_canonical_path(OBJECT(hapvdimm->hostmem));
+
+    info->u.dimm.data = di;
+    info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
+}
+
+static void hapvdimm_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(oc);
+
+    dc->realize = hapvdimm_realize;
+    dc->unrealize = hapvdimm_unrealize;
+    device_class_set_props(dc, hapvdimm_properties);
+    dc->desc = "vDIMM for a hot add protocol";
+
+    mdc->get_addr = hapvdimm_md_get_addr;
+    mdc->set_addr = hapvdimm_md_set_addr;
+    mdc->get_plugged_size = memory_device_get_region_size;
+    mdc->get_memory_region = hapvdimm_md_get_memory_region;
+    mdc->fill_device_info = hapvdimm_md_fill_device_info;
+}
+
+static const TypeInfo hapvdimm_info = {
+    .name          = TYPE_HAPVDIMM,
+    .parent        = TYPE_DEVICE,
+    .instance_size = sizeof(HAPVDIMMDevice),
+    .instance_init = hapvdimm_init,
+    .class_init    = hapvdimm_class_init,
+    .class_size    = sizeof(HAPVDIMMDeviceClass),
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_MEMORY_DEVICE },
+        { }
+    },
+};
+
+static void hapvdimm_register_types(void)
+{
+    type_register_static(&hapvdimm_info);
+}
+
+type_init(hapvdimm_register_types)
diff --git a/hw/mem/meson.build b/hw/mem/meson.build
index 609b2b36fc..5f7a0181d3 100644
--- a/hw/mem/meson.build
+++ b/hw/mem/meson.build
@@ -4,6 +4,7 @@ mem_ss.add(when: 'CONFIG_DIMM', if_true: files('pc-dimm.c'))
 mem_ss.add(when: 'CONFIG_NPCM7XX', if_true: files('npcm7xx_mc.c'))
 mem_ss.add(when: 'CONFIG_NVDIMM', if_true: files('nvdimm.c'))
 mem_ss.add(when: 'CONFIG_CXL_MEM_DEVICE', if_true: files('cxl_type3.c'))
+mem_ss.add(when: 'CONFIG_HAPVDIMM', if_true: files('hapvdimm.c'))
 
 softmmu_ss.add_all(when: 'CONFIG_MEM_DEVICE', if_true: mem_ss)
 
diff --git a/include/hw/mem/hapvdimm.h b/include/hw/mem/hapvdimm.h
new file mode 100644
index 0000000000..bb9a135a52
--- /dev/null
+++ b/include/hw/mem/hapvdimm.h
@@ -0,0 +1,27 @@
+/*
+ * A memory hot-add protocol vDIMM device
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_HAPVDIMM_H
+#define QEMU_HAPVDIMM_H
+
+#include "qom/object.h"
+
+#define TYPE_HAPVDIMM "mem-hapvdimm"
+OBJECT_DECLARE_SIMPLE_TYPE(HAPVDIMMDevice, HAPVDIMM)
+
+#define HAPVDIMM_ADDR_PROP "addr"
+#define HAPVDIMM_ALIGN_PROP "align"
+#define HAPVDIMM_SIZE_PROP "size"
+#define HAPVDIMM_MEMDEV_PROP "memdev"
+
+void hapvdimm_allow_adding(void);
+void hapvdimm_disallow_adding(void);
+
+#endif


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH][RESEND v3 2/3] Add Hyper-V Dynamic Memory Protocol definitions
  2023-02-24 21:41 [PATCH][RESEND v3 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
  2023-02-24 21:41 ` [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols Maciej S. Szmigiero
@ 2023-02-24 21:41 ` Maciej S. Szmigiero
  2023-02-24 21:41 ` [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
  2 siblings, 0 replies; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-02-24 21:41 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, David Hildenbrand, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This commit adds Hyper-V Dynamic Memory Protocol definitions, taken
from hv_balloon Linux kernel driver, adapted to the QEMU coding style and
definitions.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/hw/hyperv/dynmem-proto.h | 423 +++++++++++++++++++++++++++++++
 1 file changed, 423 insertions(+)
 create mode 100644 include/hw/hyperv/dynmem-proto.h

diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h
new file mode 100644
index 0000000000..d0f9090ac4
--- /dev/null
+++ b/include/hw/hyperv/dynmem-proto.h
@@ -0,0 +1,423 @@
+#ifndef HW_HYPERV_DYNMEM_PROTO_H
+#define HW_HYPERV_DYNMEM_PROTO_H
+
+/*
+ * Hyper-V Dynamic Memory Protocol definitions
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * Based on drivers/hv/hv_balloon.c from Linux kernel:
+ * Copyright (c) 2012, Microsoft Corporation.
+ *
+ * Author: K. Y. Srinivasan <kys@microsoft.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+/*
+ * Protocol versions. The low word is the minor version, the high word the major
+ * version.
+ *
+ * History:
+ * Initial version 1.0
+ * Changed to 0.1 on 2009/03/25
+ * Changes to 0.2 on 2009/05/14
+ * Changes to 0.3 on 2009/12/03
+ * Changed to 1.0 on 2011/04/05
+ * Changed to 2.0 on 2019/12/10
+ */
+
+#define DYNMEM_MAKE_VERSION(Major, Minor) ((uint32_t)(((Major) << 16) | (Minor)))
+#define DYNMEM_MAJOR_VERSION(Version) ((uint32_t)(Version) >> 16)
+#define DYNMEM_MINOR_VERSION(Version) ((uint32_t)(Version) & 0xff)
+
+enum {
+    DYNMEM_PROTOCOL_VERSION_1 = DYNMEM_MAKE_VERSION(0, 3),
+    DYNMEM_PROTOCOL_VERSION_2 = DYNMEM_MAKE_VERSION(1, 0),
+    DYNMEM_PROTOCOL_VERSION_3 = DYNMEM_MAKE_VERSION(2, 0),
+
+    DYNMEM_PROTOCOL_VERSION_WIN7 = DYNMEM_PROTOCOL_VERSION_1,
+    DYNMEM_PROTOCOL_VERSION_WIN8 = DYNMEM_PROTOCOL_VERSION_2,
+    DYNMEM_PROTOCOL_VERSION_WIN10 = DYNMEM_PROTOCOL_VERSION_3,
+
+    DYNMEM_PROTOCOL_VERSION_CURRENT = DYNMEM_PROTOCOL_VERSION_WIN10
+};
+
+
+
+/*
+ * Message Types
+ */
+
+enum dm_message_type {
+    /*
+     * Version 0.3
+     */
+    DM_ERROR = 0,
+    DM_VERSION_REQUEST = 1,
+    DM_VERSION_RESPONSE = 2,
+    DM_CAPABILITIES_REPORT = 3,
+    DM_CAPABILITIES_RESPONSE = 4,
+    DM_STATUS_REPORT = 5,
+    DM_BALLOON_REQUEST = 6,
+    DM_BALLOON_RESPONSE = 7,
+    DM_UNBALLOON_REQUEST = 8,
+    DM_UNBALLOON_RESPONSE = 9,
+    DM_MEM_HOT_ADD_REQUEST = 10,
+    DM_MEM_HOT_ADD_RESPONSE = 11,
+    DM_VERSION_03_MAX = 11,
+    /*
+     * Version 1.0.
+     */
+    DM_INFO_MESSAGE = 12,
+    DM_VERSION_1_MAX = 12,
+
+    /*
+     * Version 2.0
+     */
+    DM_MEM_HOT_REMOVE_REQUEST = 13,
+    DM_MEM_HOT_REMOVE_RESPONSE = 14
+};
+
+
+/*
+ * Structures defining the dynamic memory management
+ * protocol.
+ */
+
+union dm_version {
+    struct {
+        uint16_t minor_version;
+        uint16_t major_version;
+    };
+    uint32_t version;
+} QEMU_PACKED;
+
+
+union dm_caps {
+    struct {
+        uint64_t balloon:1;
+        uint64_t hot_add:1;
+        /*
+         * To support guests that may have alignment
+         * limitations on hot-add, the guest can specify
+         * its alignment requirements; a value of n
+         * represents an alignment of 2^n in mega bytes.
+         */
+        uint64_t hot_add_alignment:4;
+        uint64_t hot_remove:1;
+        uint64_t reservedz:57;
+    } cap_bits;
+    uint64_t caps;
+} QEMU_PACKED;
+
+union dm_mem_page_range {
+    struct  {
+        /*
+         * The PFN number of the first page in the range.
+         * 40 bits is the architectural limit of a PFN
+         * number for AMD64.
+         */
+        uint64_t start_page:40;
+        /*
+         * The number of pages in the range.
+         */
+        uint64_t page_cnt:24;
+    } finfo;
+    uint64_t  page_range;
+} QEMU_PACKED;
+
+
+
+/*
+ * The header for all dynamic memory messages:
+ *
+ * type: Type of the message.
+ * size: Size of the message in bytes; including the header.
+ * trans_id: The guest is responsible for manufacturing this ID.
+ */
+
+struct dm_header {
+    uint16_t type;
+    uint16_t size;
+    uint32_t trans_id;
+} QEMU_PACKED;
+
+/*
+ * A generic message format for dynamic memory.
+ * Specific message formats are defined later in the file.
+ */
+
+struct dm_message {
+    struct dm_header hdr;
+    uint8_t data[]; /* enclosed message */
+} QEMU_PACKED;
+
+
+/*
+ * Specific message types supporting the dynamic memory protocol.
+ */
+
+/*
+ * Version negotiation message. Sent from the guest to the host.
+ * The guest is free to try different versions until the host
+ * accepts the version.
+ *
+ * dm_version: The protocol version requested.
+ * is_last_attempt: If TRUE, this is the last version guest will request.
+ * reservedz: Reserved field, set to zero.
+ */
+
+struct dm_version_request {
+    struct dm_header hdr;
+    union dm_version version;
+    uint32_t is_last_attempt:1;
+    uint32_t reservedz:31;
+} QEMU_PACKED;
+
+/*
+ * Version response message; Host to Guest and indicates
+ * if the host has accepted the version sent by the guest.
+ *
+ * is_accepted: If TRUE, host has accepted the version and the guest
+ * should proceed to the next stage of the protocol. FALSE indicates that
+ * guest should re-try with a different version.
+ *
+ * reservedz: Reserved field, set to zero.
+ */
+
+struct dm_version_response {
+    struct dm_header hdr;
+    uint64_t is_accepted:1;
+    uint64_t reservedz:63;
+} QEMU_PACKED;
+
+/*
+ * Message reporting capabilities. This is sent from the guest to the
+ * host.
+ */
+
+struct dm_capabilities {
+    struct dm_header hdr;
+    union dm_caps caps;
+    uint64_t min_page_cnt;
+    uint64_t max_page_number;
+} QEMU_PACKED;
+
+/*
+ * Response to the capabilities message. This is sent from the host to the
+ * guest. This message notifies if the host has accepted the guest's
+ * capabilities. If the host has not accepted, the guest must shutdown
+ * the service.
+ *
+ * is_accepted: Indicates if the host has accepted guest's capabilities.
+ * reservedz: Must be 0.
+ */
+
+struct dm_capabilities_resp_msg {
+    struct dm_header hdr;
+    uint64_t is_accepted:1;
+    uint64_t hot_remove:1;
+    uint64_t suppress_pressure_reports:1;
+    uint64_t reservedz:61;
+} QEMU_PACKED;
+
+/*
+ * This message is used to report memory pressure from the guest.
+ * This message is not part of any transaction and there is no
+ * response to this message.
+ *
+ * num_avail: Available memory in pages.
+ * num_committed: Committed memory in pages.
+ * page_file_size: The accumulated size of all page files
+ *                 in the system in pages.
+ * zero_free: The nunber of zero and free pages.
+ * page_file_writes: The writes to the page file in pages.
+ * io_diff: An indicator of file cache efficiency or page file activity,
+ *          calculated as File Cache Page Fault Count - Page Read Count.
+ *          This value is in pages.
+ *
+ * Some of these metrics are Windows specific and fortunately
+ * the algorithm on the host side that computes the guest memory
+ * pressure only uses num_committed value.
+ */
+
+struct dm_status {
+    struct dm_header hdr;
+    uint64_t num_avail;
+    uint64_t num_committed;
+    uint64_t page_file_size;
+    uint64_t zero_free;
+    uint32_t page_file_writes;
+    uint32_t io_diff;
+} QEMU_PACKED;
+
+
+/*
+ * Message to ask the guest to allocate memory - balloon up message.
+ * This message is sent from the host to the guest. The guest may not be
+ * able to allocate as much memory as requested.
+ *
+ * num_pages: number of pages to allocate.
+ */
+
+struct dm_balloon {
+    struct dm_header hdr;
+    uint32_t num_pages;
+    uint32_t reservedz;
+} QEMU_PACKED;
+
+
+/*
+ * Balloon response message; this message is sent from the guest
+ * to the host in response to the balloon message.
+ *
+ * reservedz: Reserved; must be set to zero.
+ * more_pages: If FALSE, this is the last message of the transaction.
+ * if TRUE there will atleast one more message from the guest.
+ *
+ * range_count: The number of ranges in the range array.
+ *
+ * range_array: An array of page ranges returned to the host.
+ *
+ */
+
+struct dm_balloon_response {
+    struct dm_header hdr;
+    uint32_t reservedz;
+    uint32_t more_pages:1;
+    uint32_t range_count:31;
+    union dm_mem_page_range range_array[];
+} QEMU_PACKED;
+
+/*
+ * Un-balloon message; this message is sent from the host
+ * to the guest to give guest more memory.
+ *
+ * more_pages: If FALSE, this is the last message of the transaction.
+ * if TRUE there will atleast one more message from the guest.
+ *
+ * reservedz: Reserved; must be set to zero.
+ *
+ * range_count: The number of ranges in the range array.
+ *
+ * range_array: An array of page ranges returned to the host.
+ *
+ */
+
+struct dm_unballoon_request {
+    struct dm_header hdr;
+    uint32_t more_pages:1;
+    uint32_t reservedz:31;
+    uint32_t range_count;
+    union dm_mem_page_range range_array[];
+} QEMU_PACKED;
+
+/*
+ * Un-balloon response message; this message is sent from the guest
+ * to the host in response to an unballoon request.
+ *
+ */
+
+struct dm_unballoon_response {
+    struct dm_header hdr;
+} QEMU_PACKED;
+
+
+/*
+ * Hot add request message. Message sent from the host to the guest.
+ *
+ * mem_range: Memory range to hot add.
+ *
+ */
+
+struct dm_hot_add {
+    struct dm_header hdr;
+    union dm_mem_page_range range;
+} QEMU_PACKED;
+
+/*
+ * Hot add response message.
+ * This message is sent by the guest to report the status of a hot add request.
+ * If page_count is less than the requested page count, then the host should
+ * assume all further hot add requests will fail, since this indicates that
+ * the guest has hit an upper physical memory barrier.
+ *
+ * Hot adds may also fail due to low resources; in this case, the guest must
+ * not complete this message until the hot add can succeed, and the host must
+ * not send a new hot add request until the response is sent.
+ * If VSC fails to hot add memory DYNMEM_NUMBER_OF_UNSUCCESSFUL_HOTADD_ATTEMPTS
+ * times it fails the request.
+ *
+ *
+ * page_count: number of pages that were successfully hot added.
+ *
+ * result: result of the operation 1: success, 0: failure.
+ *
+ */
+
+struct dm_hot_add_response {
+    struct dm_header hdr;
+    uint32_t page_count;
+    uint32_t result;
+} QEMU_PACKED;
+
+struct dm_hot_remove {
+    struct dm_header hdr;
+    uint32_t virtual_node;
+    uint32_t page_count;
+    uint32_t qos_flags;
+    uint32_t reservedZ;
+} QEMU_PACKED;
+
+struct dm_hot_remove_response {
+    struct dm_header hdr;
+    uint32_t result;
+    uint32_t range_count;
+    uint64_t more_pages:1;
+    uint64_t reservedz:63;
+    union dm_mem_page_range range_array[];
+} QEMU_PACKED;
+
+#define DM_REMOVE_QOS_LARGE (1 << 0)
+#define DM_REMOVE_QOS_LOCAL (1 << 1)
+#define DM_REMOVE_QOS_MASK (0x3)
+
+/*
+ * Types of information sent from host to the guest.
+ */
+
+enum dm_info_type {
+    INFO_TYPE_MAX_PAGE_CNT = 0,
+    MAX_INFO_TYPE
+};
+
+
+/*
+ * Header for the information message.
+ */
+
+struct dm_info_header {
+    enum dm_info_type type;
+    uint32_t data_size;
+    uint8_t  data[];
+} QEMU_PACKED;
+
+/*
+ * This message is sent from the host to the guest to pass
+ * some relevant information (win8 addition).
+ *
+ * reserved: no used.
+ * info_size: size of the information blob.
+ * info: information blob.
+ */
+
+struct dm_info_msg {
+    struct dm_header hdr;
+    uint32_t reserved;
+    uint32_t info_size;
+    uint8_t  info[];
+};
+
+#endif


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2023-02-24 21:41 [PATCH][RESEND v3 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
  2023-02-24 21:41 ` [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols Maciej S. Szmigiero
  2023-02-24 21:41 ` [PATCH][RESEND v3 2/3] Add Hyper-V Dynamic Memory Protocol definitions Maciej S. Szmigiero
@ 2023-02-24 21:41 ` Maciej S. Szmigiero
  2023-02-28 16:18   ` Igor Mammedov
  2023-02-28 17:34   ` Daniel P. Berrangé
  2 siblings, 2 replies; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-02-24 21:41 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, David Hildenbrand, qemu-devel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This driver is like virtio-balloon on steroids: it allows both changing the
guest memory allocation via ballooning and inserting extra RAM into it by
adding required memory backends and providing them to the driver.

One of advantages of these over ACPI-based PC DIMM hotplug is that such
memory can be hotplugged in much smaller granularity because the ACPI DIMM
slot limit does not apply.

Hot-adding additional memory is done by creating a new memory backend (for
example by executing HMP command
"object_add memory-backend-ram,id=mem1,size=4G"), then executing a new
"hv-balloon-add-memory" QMP command, providing the id of that memory
backend as the "id" parameter.

In contrast with ACPI DIMM hotplug where one can only request to unplug a
whole DIMM stick this driver allows removing memory from guest in single
page (4k) units via ballooning.

After a VM reboot each previously hot-added memory backend gets released.
A "HV_BALLOON_MEMORY_BACKEND_UNUSED" QMP event is emitted in this case so
the software controlling QEMU knows that it either needs to delete that
memory backend (if no longer needed) or re-insert it.

In the future, the guest boot memory size might be changed on reboot
instead, taking into account the effective size that VM had before that
reboot (much like Hyper-V does).

For performance reasons, the guest-released memory is tracked in a few
range trees, as a series of (start, count) ranges.
Each time a new page range is inserted into such tree its neighbors are
checked as candidates for possible merging with it.

Besides performance reasons, the Dynamic Memory protocol itself uses page
ranges as the data structure in its messages, so relevant pages need to be
merged into such ranges anyway.

One has to be careful when tracking the guest-released pages, since the
guest can maliciously report returning pages outside its current address
space, which later clash with the address range of newly added memory.
Similarly, the guest can report freeing the same page twice.

The above design results in much better ballooning performance than when
using virtio-balloon with the same guest: 230 GB / minute with this driver
versus 70 GB / minute with virtio-balloon.

During a ballooning operation most of time is spent waiting for the guest
to come up with newly freed page ranges, processing the received ranges on
the host side (in QEMU and KVM) is nearly instantaneous.

The unballoon operation is also pretty much instantaneous:
thanks to the merging of the ballooned out page ranges 200 GB of memory can
be returned to the guest in about 1 second.
With virtio-balloon this operation takes about 2.5 minutes.

These tests were done against a Windows Server 2019 guest running on a
Xeon E5-2699, after dirtying the whole memory inside guest before each
balloon operation.

Using a range tree instead of a bitmap to track the removed memory also
means that the solution scales well with the guest size: even a 1 TB range
takes just few bytes of memory.

Since the required GTree operations aren't present in every Glib version
a check for them was added to "configure" script, together with new
"--enable-hv-balloon" and "--disable-hv-balloon" arguments.
If these GTree operations are missing in the system's Glib version this
driver will be skipped during QEMU build.

An optional "status-report=on" device parameter requests memory status
events from the guest (typically sent every second), which allow the host
to learn both the guest memory available and the guest memory in use
counts.
They are emitted externally as "HV_BALLOON_STATUS_REPORT" QMP events.

The driver is named hv-balloon since the Linux kernel client driver for
the Dynamic Memory Protocol is named as such and to follow the naming
pattern established by the virtio-balloon driver.
The whole protocol runs over Hyper-V VMBus.

The driver was tested against Windows Server 2012 R2, Windows Server 2016
and Windows Server 2016 guests and obeys the guest alignment requirements
reported to the host via DM_CAPABILITIES_REPORT message.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 Kconfig.host           |    3 +
 configure              |   36 +
 hw/hyperv/Kconfig      |    5 +
 hw/hyperv/hv-balloon.c | 2185 ++++++++++++++++++++++++++++++++++++++++
 hw/hyperv/meson.build  |    1 +
 hw/hyperv/trace-events |   16 +
 meson.build            |    4 +-
 qapi/machine.json      |   68 ++
 8 files changed, 2317 insertions(+), 1 deletion(-)
 create mode 100644 hw/hyperv/hv-balloon.c

diff --git a/Kconfig.host b/Kconfig.host
index d763d89269..2ee71578f3 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -46,3 +46,6 @@ config FUZZ
 config VFIO_USER_SERVER_ALLOWED
     bool
     imply VFIO_USER_SERVER
+
+config HV_BALLOON_POSSIBLE
+    bool
diff --git a/configure b/configure
index cf6db3d551..b534955f58 100755
--- a/configure
+++ b/configure
@@ -283,6 +283,7 @@ bsd_user=""
 pie=""
 coroutine=""
 plugins="$default_feature"
+hv_balloon="$default_feature"
 meson=""
 ninja=""
 bindir="bin"
@@ -866,6 +867,10 @@ for opt do
   ;;
   --disable-vfio-user-server) vfio_user_server="disabled"
   ;;
+  --enable-hv-balloon) hv_balloon=yes
+  ;;
+  --disable-hv-balloon) hv_balloon=no
+  ;;
   # everything else has the same name in configure and meson
   --*) meson_option_parse "$opt" "$optarg"
   ;;
@@ -1019,6 +1024,7 @@ cat << EOF
   debug-info      debugging information
   safe-stack      SafeStack Stack Smash Protection. Depends on
                   clang/llvm and requires coroutine backend ucontext.
+  hv-balloon      hv-balloon driver where supported (requires Glib 2.68+ GTree API)
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -1740,6 +1746,32 @@ EOF
   fi
 fi
 
+##########################################
+# check for hv-balloon
+
+if test "$hv_balloon" != "no"; then
+	cat > $TMPC << EOF
+#include <string.h>
+#include <gmodule.h>
+int main(void) {
+    GTree *tree;
+
+    tree = g_tree_new((GCompareFunc)strcmp);
+    (void)g_tree_node_first(tree);
+    g_tree_destroy(tree);
+    return 0;
+}
+EOF
+	if compile_prog "$glib_cflags" "$glib_libs" ; then
+		hv_balloon=yes
+	else
+		if test "$hv_balloon" = "yes" ; then
+			feature_not_found "hv-balloon" "Update Glib"
+		fi
+		hv_balloon="no"
+	fi
+fi
+
 ##########################################
 # functions to probe cross compilers
 
@@ -2336,6 +2368,10 @@ if test "$have_tsan" = "yes" && test "$have_tsan_iface_fiber" = "yes" ; then
     echo "CONFIG_TSAN=y" >> $config_host_mak
 fi
 
+if test "$hv_balloon" = "yes" ; then
+  echo "CONFIG_HV_BALLOON_POSSIBLE=y" >> $config_host_mak
+fi
+
 if test "$plugins" = "yes" ; then
     echo "CONFIG_PLUGIN=y" >> $config_host_mak
 fi
diff --git a/hw/hyperv/Kconfig b/hw/hyperv/Kconfig
index fcf65903bd..8f8be1bcce 100644
--- a/hw/hyperv/Kconfig
+++ b/hw/hyperv/Kconfig
@@ -16,3 +16,8 @@ config SYNDBG
     bool
     default y
     depends on VMBUS
+
+config HV_BALLOON
+    bool
+    default y
+    depends on HV_BALLOON_POSSIBLE && VMBUS && HAPVDIMM
diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
new file mode 100644
index 0000000000..b11f005189
--- /dev/null
+++ b/hw/hyperv/hv-balloon.c
@@ -0,0 +1,2185 @@
+/*
+ * QEMU Hyper-V Dynamic Memory Protocol driver
+ *
+ * Copyright (C) 2020-2023 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "exec/address-spaces.h"
+#include "exec/cpu-common.h"
+#include "exec/memory.h"
+#include "exec/ramblock.h"
+#include "hw/boards.h"
+#include "hw/hyperv/dynmem-proto.h"
+#include "hw/hyperv/vmbus.h"
+#include "hw/mem/hapvdimm.h"
+#include "hw/mem/pc-dimm.h"
+#include "hw/qdev-core.h"
+#include "hw/qdev-properties.h"
+#include "monitor/qdev.h"
+#include "qapi/error.h"
+#include "qapi/qapi-commands-machine.h"
+#include "qapi/qapi-events-machine.h"
+#include "qapi/qmp/qdict.h"
+#include "qemu/error-report.h"
+#include "qemu/module.h"
+#include "qemu/units.h"
+#include "qemu/timer.h"
+#include "sysemu/balloon.h"
+#include "sysemu/reset.h"
+#include "trace.h"
+
+/*
+ * temporarily avoid warnings about enhanced GTree API usage requiring a
+ * too recent Glib version until GLIB_VERSION_MAX_ALLOWED finally reaches
+ * the Glib version with this API
+ */
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+
+#define TYPE_HV_BALLOON "hv-balloon"
+#define HV_BALLOON_GUID "525074DC-8985-46e2-8057-A307DC18A502"
+#define HV_BALLOON_PFN_SHIFT 12
+#define HV_BALLOON_PAGE_SIZE (1 << HV_BALLOON_PFN_SHIFT)
+
+/*
+ * Some Windows versions (at least Server 2019) will crash with various
+ * error codes when receiving DM protocol requests (at least
+ * DM_MEM_HOT_ADD_REQUEST) immediately after boot.
+ *
+ * It looks like Hyper-V from Server 2016 uses a 50-second after-boot
+ * delay, probably to workaround this issue, so we'll use this value, too.
+ */
+#define HV_BALLOON_POST_INIT_WAIT (50 * 1000)
+
+#define HV_BALLOON_HA_CHUNK_SIZE (2 * GiB)
+#define HV_BALLOON_HA_CHUNK_PAGES (HV_BALLOON_HA_CHUNK_SIZE / HV_BALLOON_PAGE_SIZE)
+
+#define HV_BALLOON_HR_CHUNK_PAGES 585728
+/*
+ *                                ^ that's the maximum number of pages
+ * that Windows returns in one hot remove response
+ *
+ * If the number requested is too high Windows will no longer honor
+ * these requests
+ */
+
+typedef enum State {
+    /* not a real state */
+    S_NO_CHANGE = 0,
+
+    S_WAIT_RESET,
+    S_CLOSED,
+    S_VERSION,
+    S_CAPS,
+    S_POST_INIT_WAIT,
+    S_IDLE,
+    S_HOT_ADD_RB_WAIT,
+    S_HOT_ADD_POSTING,
+    S_HOT_ADD_REPLY_WAIT,
+    S_HOT_ADD_SKIP_CURRENT,
+    S_HOT_ADD_PROCESSED_CLEAR_PENDING,
+    S_HOT_ADD_PROCESSED_NEXT,
+    S_HOT_REMOVE,
+    S_BALLOON_POSTING,
+    S_BALLOON_RB_WAIT,
+    S_BALLOON_REPLY_WAIT,
+    S_UNBALLOON_POSTING,
+    S_UNBALLOON_RB_WAIT,
+    S_UNBALLOON_REPLY_WAIT,
+} State;
+
+typedef struct StateDesc {
+    State state;
+    const char *desc;
+} StateDesc;
+
+typedef struct PageRange {
+    uint64_t start;
+    uint64_t count;
+} PageRange;
+
+/* type safety */
+typedef struct PageRangeTree {
+    GTree *t;
+} PageRangeTree;
+
+typedef struct HAPVDIMMRange {
+    HAPVDIMMDevice *hapvdimm;
+
+    PageRange range;
+    uint64_t used;
+
+    /*
+     * Pages not currently usable due to guest alignment reqs or
+     * not hot added in the first place
+     */
+    uint64_t unused_head, unused_tail;
+
+    /* Memory removed from the guest backed by this HAPVDIMM */
+    PageRangeTree removed_guest, removed_both;
+} HAPVDIMMRange;
+
+/* type safety */
+typedef struct HAPVDIMMRangeTree {
+    GTree *t;
+} HAPVDIMMRangeTree;
+
+typedef struct HvBalloon {
+    VMBusDevice parent;
+    State state;
+    bool status_reports;
+
+    union dm_version version;
+    union dm_caps caps;
+
+    QEMUTimer post_init_timer;
+    guint del_todo_process_timer;
+
+    unsigned int trans_id;
+
+    /* Guest target size */
+    uint64_t target;
+    bool target_changed;
+    uint64_t target_diff;
+
+    /*
+     * All HAPVDIMMs under control of this driver
+     * (but excluding the ones in hapvdimms_del_todo)
+     */
+    HAPVDIMMRangeTree hapvdimms;
+
+    /* Non-HAPVDIMM removed memory */
+    PageRangeTree removed_guest, removed_both;
+
+    /* Grand totals of removed memory (both HAPVDIMM and non-HAPVDIMM) */
+    uint64_t removed_guest_ctr, removed_both_ctr;
+
+    /* HAPVDIMMs waiting to be added during current connection */
+    GSList *ha_todo;
+    uint64_t ha_current_count;
+
+    /* HAPVDIMMs waiting to be deleted, not in any of the above structures */
+    GSList *hapvdimms_del_todo;
+} HvBalloon;
+
+#define HV_BALLOON(obj) OBJECT_CHECK(HvBalloon, (obj), TYPE_HV_BALLOON)
+
+#define HV_BALLOON_SET_STATE(hvb, news)             \
+    do {                                            \
+        assert(news != S_NO_CHANGE);                \
+        hv_balloon_state_set(hvb, news, # news);    \
+    } while (0)
+
+#define HV_BALLOON_STATE_DESC_SET(stdesc, news)         \
+    _hv_balloon_state_desc_set(stdesc, news, # news)
+
+#define HV_BALLOON_STATE_DESC_INIT \
+    {                              \
+        .state = S_NO_CHANGE,      \
+    }
+
+#define SUM_OVERFLOW_U64(in1, in2) ((in1) > UINT64_MAX - (in2))
+#define SUM_SATURATE_U64(in1, in2)              \
+    ({                                          \
+        uint64_t _in1 = (in1), _in2 = (in2);    \
+        uint64_t _result;                       \
+                                                \
+        if (!SUM_OVERFLOW_U64(_in1, _in2)) {    \
+            _result = _in1 + _in2;              \
+        } else {                                \
+            _result = UINT64_MAX;               \
+        }                                       \
+                                                \
+        _result;                                \
+    })
+
+typedef struct HvBalloonReq {
+    VMBusChanReq vmreq;
+} HvBalloonReq;
+
+/* PageRange */
+static void page_range_intersect(const PageRange *range,
+                                 uint64_t start, uint64_t count,
+                                 PageRange *out)
+{
+    uint64_t end1 = range->start + range->count;
+    uint64_t end2 = start + count;
+    uint64_t end = MIN(end1, end2);
+
+    out->start = MAX(range->start, start);
+    out->count = out->start < end ? end - out->start : 0;
+}
+
+static uint64_t page_range_intersection_size(const PageRange *range,
+                                             uint64_t start, uint64_t count)
+{
+    PageRange trange;
+
+    page_range_intersect(range, start, count, &trange);
+    return trange.count;
+}
+
+/* return just the part of range before (start) */
+static void page_range_part_before(const PageRange *range,
+                                   uint64_t start, PageRange *out)
+{
+    uint64_t endr = range->start + range->count;
+    uint64_t end = MIN(endr, start);
+
+    out->start = range->start;
+    if (end > out->start) {
+        out->count = end - out->start;
+    } else {
+        out->count = 0;
+    }
+}
+
+/* return just the part of range after (start, count) */
+static void page_range_part_after(const PageRange *range,
+                                  uint64_t start, uint64_t count,
+                                  PageRange *out)
+{
+    uint64_t end = range->start + range->count;
+    uint64_t ends = start + count;
+
+    out->start = MAX(range->start, ends);
+    if (end > out->start) {
+        out->count = end - out->start;
+    } else {
+        out->count = 0;
+    }
+}
+
+static bool page_range_joinable_left(const PageRange *range,
+                                     uint64_t start, uint64_t count)
+{
+    return start + count == range->start;
+}
+
+static bool page_range_joinable_right(const PageRange *range,
+                                      uint64_t start, uint64_t count)
+{
+    return range->start + range->count == start;
+}
+
+static bool page_range_joinable(const PageRange *range,
+                                uint64_t start, uint64_t count)
+{
+    return page_range_joinable_left(range, start, count) ||
+        page_range_joinable_right(range, start, count);
+}
+
+/* PageRangeTree */
+static gint page_range_tree_key_compare(gconstpointer leftp,
+                                        gconstpointer rightp,
+                                        gpointer user_data)
+{
+    const uint64_t *left = leftp, *right = rightp;
+
+    if (*left < *right) {
+        return -1;
+    } else if (*left > *right) {
+        return 1;
+    } else { /* *left == *right */
+        return 0;
+    }
+}
+
+static GTreeNode *page_range_tree_insert_new(PageRangeTree tree,
+                                             uint64_t start, uint64_t count)
+{
+    uint64_t *key = g_malloc(sizeof(*key));
+    PageRange *range = g_malloc(sizeof(*range));
+
+    assert(count > 0);
+
+    *key = range->start = start;
+    range->count = count;
+
+    return g_tree_insert_node(tree.t, key, range);
+}
+
+static void page_range_tree_insert(PageRangeTree tree,
+                                   uint64_t start, uint64_t count,
+                                   uint64_t *dupcount)
+{
+    GTreeNode *node;
+    bool joinable;
+    uint64_t intersection;
+    PageRange *range;
+
+    assert(!SUM_OVERFLOW_U64(start, count));
+    if (count == 0) {
+        return;
+    }
+
+    node = g_tree_upper_bound(tree.t, &start);
+    if (node) {
+        node = g_tree_node_previous(node);
+    } else {
+        node = g_tree_node_last(tree.t);
+    }
+
+    if (node) {
+        range = g_tree_node_value(node);
+        assert(range);
+        intersection = page_range_intersection_size(range, start, count);
+        joinable = page_range_joinable_right(range, start, count);
+    }
+
+    if (!node ||
+        (!intersection && !joinable)) {
+        /*
+         * !node case: the tree is empty or the very first node in the tree
+         * already has a higher key (the start of its range).
+         * the other case: there is a gap in the tree between the new range
+         * and the previous one.
+         * anyway, let's just insert the new range into the tree.
+         */
+        node = page_range_tree_insert_new(tree, start, count);
+        assert(node);
+        range = g_tree_node_value(node);
+        assert(range);
+    } else {
+        /*
+         * the previous range in the tree either partially covers the new
+         * range or ends just at its beginning - extend it
+         */
+        if (dupcount) {
+            *dupcount += intersection;
+        }
+
+        count += start - range->start;
+        range->count = MAX(range->count, count);
+    }
+
+    /* check next nodes for possible merging */
+    for (node = g_tree_node_next(node); node; ) {
+        PageRange *rangecur;
+
+        rangecur = g_tree_node_value(node);
+        assert(rangecur);
+
+        intersection = page_range_intersection_size(rangecur,
+                                                    range->start, range->count);
+        joinable = page_range_joinable_left(rangecur,
+                                            range->start, range->count);
+        if (!intersection && !joinable) {
+            /* the current node is disjoint */
+            break;
+        }
+
+        if (dupcount) {
+            *dupcount += intersection;
+        }
+
+        count = rangecur->count + (rangecur->start - range->start);
+        range->count = MAX(range->count, count);
+
+        /* the current node was merged in, remove it */
+        start = rangecur->start;
+        node = g_tree_node_next(node);
+        /* no hinted removal in GTree... */
+        g_tree_remove(tree.t, &start);
+    }
+}
+
+static bool page_range_tree_pop(PageRangeTree tree, PageRange *out,
+                                uint64_t maxcount)
+{
+    GTreeNode *node;
+    PageRange *range;
+
+    node = g_tree_node_last(tree.t);
+    if (!node) {
+        return false;
+    }
+
+    range = g_tree_node_value(node);
+    assert(range);
+
+    out->start = range->start;
+
+    /* can't modify range->start as it is the node key */
+    if (range->count > maxcount) {
+        out->start += range->count - maxcount;
+        out->count = maxcount;
+        range->count -= maxcount;
+    } else {
+        out->count = range->count;
+        /* no hinted removal in GTree... */
+        g_tree_remove(tree.t, &out->start);
+    }
+
+    return true;
+}
+
+static bool page_range_tree_intree_any(PageRangeTree tree,
+                                       uint64_t start, uint64_t count)
+{
+    GTreeNode *node;
+
+    if (count == 0) {
+        return false;
+    }
+
+    /* find the first node that can possibly intersect our range */
+    node = g_tree_upper_bound(tree.t, &start);
+    if (node) {
+        /*
+         * a NULL node below means that the very first node in the tree
+         * already has a higher key (the start of its range).
+         */
+        node = g_tree_node_previous(node);
+    } else {
+        /* a NULL node below means that the tree is empty */
+        node = g_tree_node_last(tree.t);
+    }
+    /* node range start <= range start */
+
+    if (!node) {
+        /* node range start > range start */
+        node = g_tree_node_first(tree.t);
+    }
+
+    for ( ; node; node = g_tree_node_next(node)) {
+        PageRange *range = g_tree_node_value(node);
+
+        assert(range);
+        /*
+         * if this node starts beyond or at the end of our range so does
+         * every next one
+         */
+        if (range->start >= start + count) {
+            break;
+        }
+
+        if (page_range_intersection_size(range, start, count) > 0) {
+            return true;
+        }
+    }
+
+    return false;
+}
+
+static PageRangeTree page_range_tree_new(void)
+{
+    PageRangeTree tree;
+
+    tree.t = g_tree_new_full(page_range_tree_key_compare, NULL,
+                             g_free, g_free);
+    return tree;
+}
+
+static void page_range_tree_destroy(PageRangeTree *tree)
+{
+    /* g_tree_destroy() is not NULL-safe */
+    if (!tree->t) {
+        return;
+    }
+
+    g_tree_destroy(tree->t);
+    tree->t = NULL;
+}
+
+/* HAPVDIMMDevice */
+static uint64_t hapvdimm_get_addr(HAPVDIMMDevice *hapvdimm)
+{
+    return object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_ADDR_PROP,
+                                    &error_abort) / HV_BALLOON_PAGE_SIZE;
+}
+
+static uint64_t hapvdimm_get_size(HAPVDIMMDevice *hapvdimm)
+{
+    return object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_SIZE_PROP,
+                                    &error_abort) / HV_BALLOON_PAGE_SIZE;
+}
+
+static void hapvdimm_get_range(HAPVDIMMDevice *hapvdimm, PageRange *out)
+{
+    out->start = hapvdimm_get_addr(hapvdimm);
+    assert(out->start > 0);
+
+    out->count = hapvdimm_get_size(hapvdimm);
+    assert(out->count > 0);
+}
+
+static HostMemoryBackend *hapvdimm_get_memdev(HAPVDIMMDevice *hapvdimm)
+{
+    Object *memdev_obj;
+
+    memdev_obj = object_property_get_link(OBJECT(hapvdimm),
+                                          HAPVDIMM_MEMDEV_PROP,
+                                          &error_abort);
+    return MEMORY_BACKEND(memdev_obj);
+}
+
+/* HAPVDIMMRange */
+static HAPVDIMMRange *hapvdimm_range_new(HAPVDIMMDevice *hapvdimm)
+{
+    HAPVDIMMRange *hpr = g_malloc(sizeof(*hpr));
+
+    hpr->hapvdimm = HAPVDIMM(object_ref(hapvdimm));
+    hapvdimm_get_range(hapvdimm, &hpr->range);
+
+    hpr->removed_guest = page_range_tree_new();
+    hpr->removed_both = page_range_tree_new();
+
+    /* mark the whole range as unused */
+    hpr->used = 0;
+    hpr->unused_head = hpr->range.count;
+    hpr->unused_tail = 0;
+
+    return hpr;
+}
+
+static void hapvdimm_range_free(HAPVDIMMRange *hpr)
+{
+    g_autoptr(HAPVDIMMDevice) hapvdimm = g_steal_pointer(&hpr->hapvdimm);
+
+    page_range_tree_destroy(&hpr->removed_guest);
+    page_range_tree_destroy(&hpr->removed_both);
+
+    g_free(hpr);
+}
+
+/* the hapvdimm range reduced by unused head and tail */
+static void hapvdimm_range_get_effective_range(HAPVDIMMRange *hpr,
+                                               PageRange *out)
+{
+    out->start = hpr->range.start + hpr->unused_head;
+    out->count = hpr->range.count - hpr->unused_head - hpr->unused_tail;
+}
+
+/* HAPVDIMMRangeTree */
+static gint hapvdimm_tree_key_compare(gconstpointer leftp, gconstpointer rightp,
+                                      gpointer user_data)
+{
+    /*
+     * hapvdimm tree is also keyed on page range start, so we can simply reuse
+     * the comparison function from the page range tree
+     */
+    return page_range_tree_key_compare(leftp, rightp, user_data);
+}
+
+static HAPVDIMMRange *hapvdimm_tree_insert_new(HvBalloon *balloon,
+                                               HAPVDIMMDevice *hapvdimm)
+{
+    HAPVDIMMRange *hpr;
+    uint64_t *key;
+
+    hpr = hapvdimm_range_new(hapvdimm);
+
+    key = g_malloc(sizeof(*key));
+    *key = hpr->range.start;
+
+    g_tree_insert(balloon->hapvdimms.t, key, hpr);
+
+    return hpr;
+}
+
+/* The HAPVDIMM must not be on the ha_todo list since it's going to get unref'ed. */
+static void hapvdimm_tree_remove(HvBalloon *balloon, HAPVDIMMDevice *hapvdimm)
+{
+    uint64_t addr;
+
+    addr = hapvdimm_get_addr(hapvdimm);
+    assert(addr > 0);
+
+    g_tree_remove(balloon->hapvdimms.t, &addr);
+}
+
+/* total RAM includes memory currently removed from the guest */
+static gboolean hapvdimm_tree_total_ram_node(gpointer key,
+                                             gpointer value,
+                                             gpointer data)
+{
+    HAPVDIMMRange *hpr = value;
+    uint64_t *size = data;
+    PageRange rangeeff;
+
+    hapvdimm_range_get_effective_range(hpr, &rangeeff);
+    *size += rangeeff.count;
+
+    return false;
+}
+
+static uint64_t hapvdimm_tree_total_ram(HvBalloon *balloon)
+{
+    uint64_t size = 0;
+
+    g_tree_foreach(balloon->hapvdimms.t, hapvdimm_tree_total_ram_node, &size);
+    return size;
+}
+
+static void hapvdimm_tree_value_free(gpointer data)
+{
+    HAPVDIMMRange *hpr = data;
+
+    hapvdimm_range_free(hpr);
+}
+
+static HAPVDIMMRangeTree hapvdimm_tree_new(void)
+{
+    HAPVDIMMRangeTree tree;
+
+    tree.t = g_tree_new_full(hapvdimm_tree_key_compare, NULL, g_free,
+                             hapvdimm_tree_value_free);
+    return tree;
+}
+
+static void hapvdimm_tree_destroy(HAPVDIMMRangeTree *tree)
+{
+    /* g_tree_destroy() is not NULL-safe */
+    if (!tree->t) {
+        return;
+    }
+
+    g_tree_destroy(tree->t);
+    tree->t = NULL;
+}
+
+static gboolean ha_todo_add_all_node(gpointer key,
+                                     gpointer value,
+                                     gpointer data)
+{
+    HAPVDIMMRange *hpr = value;
+    HvBalloon *balloon = data;
+
+    /* assume the hpr is fresh */
+    assert(hpr->used == 0);
+    assert(hpr->unused_head == hpr->range.count);
+    assert(hpr->unused_tail == 0);
+
+    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
+
+    return false;
+}
+
+static void ha_todo_add_all(HvBalloon *balloon)
+{
+    assert(balloon->ha_todo == NULL);
+    g_tree_foreach(balloon->hapvdimms.t, ha_todo_add_all_node, balloon);
+}
+
+static void ha_todo_clear(HvBalloon *balloon)
+{
+    g_slist_free(g_steal_pointer(&balloon->ha_todo));
+}
+
+/* TODO: unify the code below with virtio-balloon and cache the value */
+static int build_dimm_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
+        DeviceState *dev = DEVICE(obj);
+        if (dev->realized) { /* only realized DIMMs matter */
+            *list = g_slist_prepend(*list, dev);
+        }
+    }
+
+    object_child_foreach(obj, build_dimm_list, opaque);
+    return 0;
+}
+
+static ram_addr_t get_current_ram_size(void)
+{
+    GSList *list = NULL, *item;
+    ram_addr_t size = current_machine->ram_size;
+
+    build_dimm_list(qdev_get_machine(), &list);
+    for (item = list; item; item = g_slist_next(item)) {
+        Object *obj = OBJECT(item->data);
+        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM))
+            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
+                                            &error_abort);
+    }
+    g_slist_free(list);
+
+    return size;
+}
+
+/* total RAM includes memory currently removed from the guest */
+static uint64_t hv_balloon_total_ram(HvBalloon *balloon)
+{
+    ram_addr_t ram_size = get_current_ram_size();
+    uint64_t ram_size_pages = ram_size >> HV_BALLOON_PFN_SHIFT;
+    uint64_t hapvdimm_size_pages = hapvdimm_tree_total_ram(balloon);
+
+    assert(ram_size_pages > 0);
+
+    return SUM_SATURATE_U64(ram_size_pages, hapvdimm_size_pages);
+}
+
+/*
+ * calculating the total RAM size is a slow operation,
+ * avoid it as much as possible
+ */
+static uint64_t hv_balloon_total_removed_rs(HvBalloon *balloon,
+                                            uint64_t ram_size_pages)
+{
+    uint64_t total_removed;
+
+    total_removed = SUM_SATURATE_U64(balloon->removed_guest_ctr,
+                                     balloon->removed_both_ctr);
+
+    /* possible if guest returns pages outside actual RAM */
+    if (total_removed > ram_size_pages) {
+        total_removed = ram_size_pages;
+    }
+
+    return total_removed;
+}
+
+static bool hv_balloon_state_is_init(HvBalloon *balloon)
+{
+    return balloon->state == S_WAIT_RESET ||
+        balloon->state == S_CLOSED ||
+        balloon->state == S_VERSION ||
+        balloon->state == S_CAPS;
+}
+
+/* Returns whether the state has actually changed */
+static bool hv_balloon_state_set(HvBalloon *balloon,
+                                 State newst, const char *newststr)
+{
+    if (newst == S_NO_CHANGE || balloon->state == newst) {
+        return false;
+    }
+
+    balloon->state = newst;
+    trace_hv_balloon_state_change(newststr);
+    return true;
+}
+
+static void _hv_balloon_state_desc_set(StateDesc *stdesc,
+                                       State newst, const char *newststr)
+{
+    /* state setting is only permitted on a freshly init desc */
+    assert(stdesc->state == S_NO_CHANGE);
+
+    assert(newst != S_NO_CHANGE);
+
+    stdesc->state = newst;
+    stdesc->desc = newststr;
+}
+
+static void del_todo_process(HvBalloon *balloon)
+{
+    while (balloon->hapvdimms_del_todo) {
+        HAPVDIMMDevice *hapvdimm = balloon->hapvdimms_del_todo->data;
+        HostMemoryBackend *backend;
+        const char *backend_id;
+
+        backend = hapvdimm_get_memdev(hapvdimm);
+        backend_id = object_get_canonical_path_component(OBJECT(backend));
+
+        object_unparent(OBJECT(hapvdimm));
+        object_unref(OBJECT(hapvdimm));
+        qapi_event_send_hv_balloon_memory_backend_unused(backend_id);
+
+        balloon->hapvdimms_del_todo =
+            g_slist_remove(balloon->hapvdimms_del_todo, hapvdimm);
+    }
+
+    if (balloon->del_todo_process_timer) {
+        g_source_remove(balloon->del_todo_process_timer);
+        balloon->del_todo_process_timer = 0;
+    }
+}
+
+static gboolean del_todo_process_timer(gpointer user_data)
+{
+    HvBalloon *balloon = user_data;
+
+    balloon->del_todo_process_timer = 0;
+
+    del_todo_process(balloon);
+
+    return G_SOURCE_REMOVE;
+}
+
+static void del_todo_append(HvBalloon *balloon,
+                            HAPVDIMMDevice *hapvdimm)
+{
+    balloon->hapvdimms_del_todo = g_slist_append(balloon->hapvdimms_del_todo,
+                                                 object_ref(hapvdimm));
+}
+
+static void del_todo_add(HvBalloon *balloon,
+                         HAPVDIMMDevice *hapvdimm)
+{
+    hapvdimm_tree_remove(balloon, hapvdimm);
+    del_todo_append(balloon, hapvdimm);
+}
+
+static gboolean del_todo_add_all_node(gpointer key,
+                                      gpointer value,
+                                      gpointer data)
+{
+    HAPVDIMMRange *hpr = value;
+    HvBalloon *balloon = data;
+
+    del_todo_append(balloon, hpr->hapvdimm);
+
+    return false;
+}
+
+static void del_todo_add_all(HvBalloon *balloon)
+{
+    g_tree_foreach(balloon->hapvdimms.t, del_todo_add_all_node, balloon);
+    hapvdimm_tree_destroy(&balloon->hapvdimms);
+
+    balloon->hapvdimms = hapvdimm_tree_new();
+}
+
+static void del_todo_add_all_from_ha_todo(HvBalloon *balloon)
+{
+    while (balloon->ha_todo) {
+        HAPVDIMMRange *hpr = balloon->ha_todo->data;
+
+        del_todo_add(balloon, hpr->hapvdimm);
+        balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
+    }
+}
+
+static VMBusChannel *hv_balloon_get_channel_maybe(HvBalloon *balloon)
+{
+    return vmbus_device_channel(&balloon->parent, 0);
+}
+
+static VMBusChannel *hv_balloon_get_channel(HvBalloon *balloon)
+{
+    VMBusChannel *chan;
+
+    chan = hv_balloon_get_channel_maybe(balloon);
+    assert(chan != NULL);
+    return chan;
+}
+
+static ssize_t hv_balloon_send_packet(VMBusChannel *chan,
+                                      struct dm_message *msg)
+{
+    int ret;
+
+    ret = vmbus_channel_reserve(chan, 0, msg->hdr.size);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                              NULL, 0, msg, msg->hdr.size, false,
+                              msg->hdr.trans_id);
+}
+
+static bool hv_balloon_unballoon_get_source(HvBalloon *balloon,
+                                            PageRangeTree *dtree,
+                                            uint64_t **dctr,
+                                            HAPVDIMMRange **hpr)
+{
+    /* Try the boot memory first */
+    if (g_tree_nnodes(balloon->removed_guest.t) > 0) {
+        *dtree = balloon->removed_guest;
+        *dctr = &balloon->removed_guest_ctr;
+        *hpr = NULL;
+    } else if (g_tree_nnodes(balloon->removed_both.t) > 0) {
+        *dtree = balloon->removed_both;
+        *dctr = &balloon->removed_both_ctr;
+        *hpr = NULL;
+    } else {
+        GTreeNode *node;
+
+        for (node = g_tree_node_first(balloon->hapvdimms.t); node;
+             node = g_tree_node_next(node)) {
+            HAPVDIMMRange *hprnode = g_tree_node_value(node);
+
+            assert(hprnode);
+            if (g_tree_nnodes(hprnode->removed_guest.t) > 0) {
+                *dtree = hprnode->removed_guest;
+                *dctr = &balloon->removed_guest_ctr;
+                *hpr = hprnode;
+                break;
+            } else if (g_tree_nnodes(hprnode->removed_both.t) > 0) {
+                *dtree = hprnode->removed_both;
+                *dctr = &balloon->removed_both_ctr;
+                *hpr = hprnode;
+                break;
+            }
+        }
+
+        if (!node) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
+static void hv_balloon_balloon_unballoon_start(HvBalloon *balloon,
+                                               uint64_t ram_size_pages,
+                                               StateDesc *stdesc)
+{
+    uint64_t total_removed = hv_balloon_total_removed_rs(balloon,
+                                                         ram_size_pages);
+
+    assert(balloon->state == S_IDLE);
+    assert(ram_size_pages > 0);
+
+    /*
+     * we need to cache the value when starting the (un)balloon procedure
+     * in case somebody changes the balloon target when the procedure is
+     * in progress
+     */
+    if (balloon->target < ram_size_pages - total_removed) {
+        balloon->target_diff = ram_size_pages - total_removed - balloon->target;
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_RB_WAIT);
+    } else {
+        balloon->target_diff = balloon->target -
+            (ram_size_pages - total_removed);
+
+        /*
+         * careful here, the user might have set the balloon target
+         * above the RAM size, so above the total removed count
+         */
+        balloon->target_diff = MIN(balloon->target_diff, total_removed);
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_RB_WAIT);
+    }
+
+    balloon->target_changed = false;
+}
+
+static void hv_balloon_unballoon_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    struct dm_unballoon_request *ur;
+    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
+
+    assert(balloon->state == S_UNBALLOON_RB_WAIT);
+
+    if (vmbus_channel_reserve(chan, 0, ur_size) < 0) {
+        return;
+    }
+
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_POSTING);
+}
+
+static void hv_balloon_unballoon_posting(HvBalloon *balloon, StateDesc *stdesc)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    PageRangeTree dtree;
+    uint64_t *dctr;
+    HAPVDIMMRange *hpr;
+    struct dm_unballoon_request *ur;
+    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
+    PageRange range;
+    bool bret;
+    ssize_t ret;
+
+    assert(balloon->state == S_UNBALLOON_POSTING);
+    assert(balloon->target_diff > 0);
+
+    if (!hv_balloon_unballoon_get_source(balloon, &dtree, &dctr, &hpr)) {
+        error_report("trying to unballoon but nothing ballooned");
+        /*
+         * there is little we can do as we might have already
+         * sent the guest a partial request we can't cancel
+         */
+        return;
+    }
+
+    assert(dtree.t);
+    assert(dctr);
+
+    ur = alloca(ur_size);
+    memset(ur, 0, ur_size);
+    ur->hdr.type = DM_UNBALLOON_REQUEST;
+    ur->hdr.size = ur_size;
+    ur->hdr.trans_id = balloon->trans_id;
+
+    bret = page_range_tree_pop(dtree, &range, MIN(balloon->target_diff,
+                                                  HV_BALLOON_HA_CHUNK_PAGES));
+    assert(bret);
+    /* TODO: madvise? */
+
+    *dctr -= range.count;
+    balloon->target_diff -= range.count;
+    if (hpr) {
+        hpr->used += range.count;
+    }
+
+    ur->range_count = 1;
+    ur->range_array[0].finfo.start_page = range.start;
+    ur->range_array[0].finfo.page_cnt = range.count;
+    ur->more_pages = balloon->target_diff > 0;
+
+    trace_hv_balloon_outgoing_unballoon(ur->hdr.trans_id,
+                                        range.count, range.start,
+                                        balloon->target_diff);
+
+    if (ur->more_pages) {
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_RB_WAIT);
+    } else {
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_REPLY_WAIT);
+    }
+
+    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                             NULL, 0, ur, ur_size, false,
+                             ur->hdr.trans_id);
+    if (ret <= 0) {
+        error_report("error %zd when posting unballoon msg, expect problems",
+                     ret);
+    }
+}
+
+static void hv_balloon_hot_add_start(HvBalloon *balloon, StateDesc *stdesc)
+{
+    HAPVDIMMRange *hpr;
+    PageRange range;
+
+    assert(balloon->state == S_IDLE);
+    assert(balloon->ha_todo);
+
+    hpr = balloon->ha_todo->data;
+
+    range.start = QEMU_ALIGN_UP(hpr->range.start,
+                                (1 << balloon->caps.cap_bits.hot_add_alignment)
+                                * (MiB / HV_BALLOON_PAGE_SIZE));
+    hpr->unused_head = range.start - hpr->range.start;
+    if (hpr->unused_head >= hpr->range.count) {
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
+        return;
+    }
+
+    range.count = hpr->range.count - hpr->unused_head;
+    range.count = QEMU_ALIGN_DOWN(range.count,
+                                  (1 << balloon->caps.cap_bits.hot_add_alignment)
+                                  * (MiB / HV_BALLOON_PAGE_SIZE));
+    if (range.count == 0) {
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
+        return;
+    }
+    hpr->unused_tail = hpr->range.count - hpr->unused_head - range.count;
+    hpr->used = 0;
+
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_RB_WAIT);
+}
+
+static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    struct dm_hot_add *ha;
+    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+
+    assert(balloon->state == S_HOT_ADD_RB_WAIT);
+
+    if (vmbus_channel_reserve(chan, 0, ha_size) < 0) {
+        return;
+    }
+
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_POSTING);
+}
+
+static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    HAPVDIMMRange *hpr;
+    struct dm_hot_add *ha;
+    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+    union dm_mem_page_range *ha_region;
+    PageRange range;
+    uint64_t chunk_max_size;
+    ssize_t ret;
+
+    assert(balloon->state == S_HOT_ADD_POSTING);
+    assert(balloon->ha_todo);
+
+    hpr = balloon->ha_todo->data;
+
+    range.start = hpr->range.start + hpr->unused_head + hpr->used;
+    range.count = hpr->range.count;
+    range.count -= hpr->unused_head;
+    range.count -= hpr->used;
+    range.count -= hpr->unused_tail;
+
+    chunk_max_size = MAX((1 << balloon->caps.cap_bits.hot_add_alignment) *
+                         (MiB / HV_BALLOON_PAGE_SIZE),
+                         HV_BALLOON_HA_CHUNK_PAGES);
+    range.count = MIN(range.count, chunk_max_size);
+    balloon->ha_current_count = range.count;
+
+    ha = alloca(ha_size);
+    ha_region = &(&ha->range)[1];
+    memset(ha, 0, ha_size);
+    ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
+    ha->hdr.size = ha_size;
+    ha->hdr.trans_id = balloon->trans_id;
+
+    ha->range.finfo.start_page = range.start;
+    ha->range.finfo.page_cnt = range.count;
+    ha_region->finfo.start_page = range.start;
+    ha_region->finfo.page_cnt = ha->range.finfo.page_cnt;
+
+    trace_hv_balloon_outgoing_hot_add(ha->hdr.trans_id,
+                                      range.count, range.start);
+
+    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                             NULL, 0, ha, ha_size, false,
+                             ha->hdr.trans_id);
+    if (ret <= 0) {
+        error_report("error %zd when posting hot add msg, expect problems",
+                     ret);
+    }
+
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_REPLY_WAIT);
+}
+
+static void hv_balloon_hot_add_finish(HvBalloon *balloon, StateDesc *stdesc)
+{
+    HAPVDIMMRange *hpr;
+
+    assert(balloon->state == S_HOT_ADD_SKIP_CURRENT ||
+           balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING ||
+           balloon->state == S_HOT_ADD_PROCESSED_NEXT);
+    assert(balloon->ha_todo);
+
+    hpr = balloon->ha_todo->data;
+
+    balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
+    if (balloon->state == S_HOT_ADD_SKIP_CURRENT) {
+        del_todo_add(balloon, hpr->hapvdimm);
+    } else if (balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING) {
+        del_todo_add_all_from_ha_todo(balloon);
+    }
+
+    /* let other things happen, too, between hot adds to be done */
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
+}
+
+static void hv_balloon_balloon_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    size_t bl_size = sizeof(struct dm_balloon);
+
+    assert(balloon->state == S_BALLOON_RB_WAIT);
+
+    if (vmbus_channel_reserve(chan, 0, bl_size) < 0) {
+        return;
+    }
+
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_POSTING);
+}
+
+static void hv_balloon_balloon_posting(HvBalloon *balloon, StateDesc *stdesc)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    struct dm_balloon bl;
+    size_t bl_size = sizeof(bl);
+    ssize_t ret;
+
+    assert(balloon->state == S_BALLOON_POSTING);
+    assert(balloon->target_diff > 0);
+
+    memset(&bl, 0, sizeof(bl));
+    bl.hdr.type = DM_BALLOON_REQUEST;
+    bl.hdr.size = bl_size;
+    bl.hdr.trans_id = balloon->trans_id;
+    bl.num_pages = MIN(balloon->target_diff, HV_BALLOON_HR_CHUNK_PAGES);
+
+    trace_hv_balloon_outgoing_balloon(bl.hdr.trans_id, bl.num_pages,
+                                      balloon->target_diff);
+
+    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                             NULL, 0, &bl, bl_size, false,
+                             bl.hdr.trans_id);
+    if (ret <= 0) {
+        error_report("error %zd when posting balloon msg, expect problems",
+                     ret);
+    }
+
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_REPLY_WAIT);
+}
+
+static void hv_balloon_idle_state(HvBalloon *balloon,
+                                  StateDesc *stdesc)
+{
+    bool can_balloon = balloon->caps.cap_bits.balloon;
+    bool want_unballoon = false;
+    bool want_hot_add = balloon->ha_todo != NULL;
+    bool want_balloon = false;
+    uint64_t ram_size_pages;
+
+    assert(balloon->state == S_IDLE);
+
+    if (can_balloon && balloon->target_changed) {
+        uint64_t total_removed;
+
+        ram_size_pages = hv_balloon_total_ram(balloon);
+        total_removed = hv_balloon_total_removed_rs(balloon,
+                                                    ram_size_pages);
+
+        want_unballoon = total_removed > 0 &&
+            balloon->target > ram_size_pages - total_removed;
+        want_balloon = balloon->target < ram_size_pages - total_removed;
+    }
+
+    /*
+     * the order here is important, first we unballoon, then hot add,
+     * then balloon (or hot remove)
+     */
+    if (want_unballoon) {
+        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages, stdesc);
+    } else if (want_hot_add) {
+        hv_balloon_hot_add_start(balloon, stdesc);
+    } else if (want_balloon) {
+        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages, stdesc);
+    }
+}
+
+static const struct {
+    void (*handler)(HvBalloon *balloon, StateDesc *stdesc);
+} state_handlers[] = {
+    [S_IDLE].handler = hv_balloon_idle_state,
+    [S_UNBALLOON_RB_WAIT].handler = hv_balloon_unballoon_rb_wait,
+    [S_UNBALLOON_POSTING].handler = hv_balloon_unballoon_posting,
+    [S_HOT_ADD_RB_WAIT].handler = hv_balloon_hot_add_rb_wait,
+    [S_HOT_ADD_POSTING].handler = hv_balloon_hot_add_posting,
+    [S_HOT_ADD_SKIP_CURRENT].handler = hv_balloon_hot_add_finish,
+    [S_HOT_ADD_PROCESSED_CLEAR_PENDING].handler = hv_balloon_hot_add_finish,
+    [S_HOT_ADD_PROCESSED_NEXT].handler = hv_balloon_hot_add_finish,
+    [S_BALLOON_RB_WAIT].handler = hv_balloon_balloon_rb_wait,
+    [S_BALLOON_POSTING].handler = hv_balloon_balloon_posting,
+};
+
+static void hv_balloon_handle_state(HvBalloon *balloon, StateDesc *stdesc)
+{
+    if (!state_handlers[balloon->state].handler) {
+        return;
+    }
+
+    state_handlers[balloon->state].handler(balloon, stdesc);
+}
+
+static void hv_balloon_remove_response_insert_range(PageRangeTree tree,
+                                                    const PageRange *range,
+                                                    uint64_t *ctr1,
+                                                    uint64_t *ctr2,
+                                                    uint64_t *ctr3)
+{
+    uint64_t dupcount, effcount;
+
+    if (range->count == 0) {
+        return;
+    }
+
+    dupcount = 0;
+    page_range_tree_insert(tree, range->start, range->count, &dupcount);
+
+    assert(dupcount <= range->count);
+    effcount = range->count - dupcount;
+
+    *ctr1 += effcount;
+    *ctr2 += effcount;
+    if (ctr3) {
+        *ctr3 += effcount;
+    }
+}
+
+static void hv_balloon_remove_response_handle_range(HvBalloon *balloon,
+                                                    PageRange *range,
+                                                    bool both,
+                                                    uint64_t *removedctr)
+{
+    GTreeNode *node;
+    PageRangeTree globaltree = both ? balloon->removed_both :
+        balloon->removed_guest;
+    uint64_t *globalctr = both ? &balloon->removed_both_ctr :
+        &balloon->removed_guest_ctr;
+
+    if (range->count == 0) {
+        return;
+    }
+
+    trace_hv_balloon_remove_response(range->count, range->start, both);
+
+    /* find the first node that can possibly intersect our range */
+    node = g_tree_upper_bound(balloon->hapvdimms.t, &range->start);
+    if (node) {
+        /*
+         * a NULL node below means that the very first node in the tree
+         * already has a higher key (the start of its range).
+         */
+        node = g_tree_node_previous(node);
+    } else {
+        /* a NULL node below means that the tree is empty */
+        node = g_tree_node_last(balloon->hapvdimms.t);
+    }
+    /* node range start <= range start */
+
+    if (!node) {
+        /* node range start > range start */
+        node = g_tree_node_first(balloon->hapvdimms.t);
+    }
+
+    for ( ; node && range->count > 0; node = g_tree_node_next(node)) {
+        HAPVDIMMRange *hpr = g_tree_node_value(node);
+        PageRangeTree hprtree;
+        PageRange rangeeff, rangehole, rangecommon;
+        uint64_t hprremoved = 0;
+
+        assert(hpr);
+        hprtree = both ? hpr->removed_both : hpr->removed_guest;
+        hapvdimm_range_get_effective_range(hpr, &rangeeff);
+
+        /*
+         * if this node starts beyond or at the end of the range so does
+         * every next one
+         */
+        if (rangeeff.start >= range->start + range->count) {
+            break;
+        }
+
+        /* process the hole before the current hpr, if it exists */
+        page_range_part_before(range, rangeeff.start, &rangehole);
+        hv_balloon_remove_response_insert_range(globaltree, &rangehole,
+                                                globalctr, removedctr, NULL);
+        if (rangehole.count > 0) {
+            trace_hv_balloon_remove_response_hole(rangehole.count,
+                                                  rangehole.start,
+                                                  range->count, range->start,
+                                                  rangeeff.start, both);
+        }
+
+        /*
+         * process the hpr part, can be empty for the very first node processed
+         * or due to difference between the nominal and effective hpr start
+         */
+        page_range_intersect(range, rangeeff.start, rangeeff.count,
+                             &rangecommon);
+        hv_balloon_remove_response_insert_range(hprtree, &rangecommon,
+                                                globalctr, removedctr,
+                                                &hprremoved);
+        hpr->used -= hprremoved;
+        if (rangecommon.count > 0) {
+            trace_hv_balloon_remove_response_common(rangecommon.count,
+                                                    rangecommon.start,
+                                                    range->count, range->start,
+                                                    rangeeff.count,
+                                                    rangeeff.start, hprremoved,
+                                                    both);
+        }
+
+        /* calculate what's left after the current hpr */
+        rangecommon = *range;
+        page_range_part_after(&rangecommon, rangeeff.start, rangeeff.count,
+                              range);
+    }
+
+    /* process the remainder of the range that lies outside of the hpr tree */
+    if (range->count > 0) {
+        hv_balloon_remove_response_insert_range(globaltree, range,
+                                                globalctr, removedctr, NULL);
+        trace_hv_balloon_remove_response_remainder(range->count, range->start,
+                                                   both);
+        range->count = 0;
+    }
+}
+
+static void hv_balloon_remove_response_handle_pages(HvBalloon *balloon,
+                                                    PageRange *range,
+                                                    uint64_t start,
+                                                    uint64_t count,
+                                                    bool both,
+                                                    uint64_t *removedctr)
+{
+    assert(count > 0);
+
+    /*
+     * if there is an existing range that the new range can't be joined to
+     * dump it into tree(s)
+     */
+    if (range->count > 0 && !page_range_joinable(range, start, count)) {
+        hv_balloon_remove_response_handle_range(balloon, range, both,
+                                                removedctr);
+    }
+
+    if (range->count == 0) {
+        range->start = start;
+        range->count = count;
+    } else if (page_range_joinable_left(range, start, count)) {
+        range->start = start;
+        range->count += count;
+    } else { /* page_range_joinable_right() */
+        range->count += count;
+    }
+}
+
+static gboolean hv_balloon_handle_remove_host_addr_node(gpointer key,
+                                                        gpointer value,
+                                                        gpointer data)
+{
+    PageRange *range = value;
+    uint64_t pageoff;
+
+    for (pageoff = 0; pageoff < range->count; ) {
+        void *addr = (void *)((range->start + pageoff) * HV_BALLOON_PAGE_SIZE);
+        RAMBlock *rb;
+        ram_addr_t rb_offset;
+        size_t rb_page_size;
+        size_t discard_size;
+
+        rb = qemu_ram_block_from_host(addr, false, &rb_offset);
+        rb_page_size = qemu_ram_pagesize(rb);
+
+        if (rb_page_size != HV_BALLOON_PAGE_SIZE) {
+            /* TODO: these should end in "removed_guest" */
+            warn_report("guest reported removed page backed by unsupported page size %zu",
+                        rb_page_size);
+            pageoff++;
+            continue;
+        }
+
+        discard_size = MIN(range->count - pageoff,
+                           (rb->max_length - rb_offset) /
+                           HV_BALLOON_PAGE_SIZE);
+        discard_size = MAX(discard_size, 1);
+
+        if (ram_block_discard_range(rb, rb_offset, discard_size *
+                                    HV_BALLOON_PAGE_SIZE) != 0) {
+            warn_report("guest reported removed page failed discard");
+        }
+
+        pageoff += discard_size;
+    }
+
+    return false;
+}
+
+static void hv_balloon_handle_remove_host_addr_tree(PageRangeTree tree)
+{
+    g_tree_foreach(tree.t, hv_balloon_handle_remove_host_addr_node, NULL);
+}
+
+static int hv_balloon_handle_remove_section(PageRangeTree tree,
+                                            const MemoryRegionSection *section,
+                                            uint64_t count)
+{
+    void *addr = memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region;
+    uint64_t addr_page;
+
+    assert(count > 0);
+
+    if ((uintptr_t)addr % HV_BALLOON_PAGE_SIZE) {
+        warn_report("guest reported removed pages at an unaligned host addr %p",
+                    addr);
+        return -EINVAL;
+    }
+
+    addr_page = (uintptr_t)addr / HV_BALLOON_PAGE_SIZE;
+    page_range_tree_insert(tree, addr_page, count, NULL);
+
+    return 0;
+}
+
+static void hv_balloon_handle_remove_ranges(HvBalloon *balloon,
+                                            union dm_mem_page_range ranges[],
+                                            uint32_t count)
+{
+    uint64_t removedcnt;
+    PageRangeTree removed_host_addr;
+    PageRange range_guest, range_both;
+
+    removed_host_addr = page_range_tree_new();
+    range_guest.count = range_both.count = removedcnt = 0;
+    for (unsigned int ctr = 0; ctr < count; ctr++) {
+        union dm_mem_page_range *mr = &ranges[ctr];
+        hwaddr pa;
+        MemoryRegionSection section;
+
+        for (unsigned int offset = 0; offset < mr->finfo.page_cnt; ) {
+            int ret;
+            uint64_t pageno = mr->finfo.start_page + offset;
+            uint64_t pagecnt = 1;
+
+            pa = (hwaddr)pageno << HV_BALLOON_PFN_SHIFT;
+            section = memory_region_find(get_system_memory(), pa,
+                                         (mr->finfo.page_cnt - offset) *
+                                         HV_BALLOON_PAGE_SIZE);
+            if (!section.mr) {
+                warn_report("guest reported removed page %"PRIu64" not found in RAM",
+                            pageno);
+                ret = -EINVAL;
+                goto finish_page;
+            }
+
+            pagecnt = section.size / HV_BALLOON_PAGE_SIZE;
+            if (pagecnt <= 0) {
+                warn_report("guest reported removed page %"PRIu64" in a section smaller than page size",
+                            pageno);
+                pagecnt = 1; /* skip the whole page */
+                ret = -EINVAL;
+                goto finish_page;
+            }
+
+            if (!memory_region_is_ram(section.mr) ||
+                memory_region_is_rom(section.mr) ||
+                memory_region_is_romd(section.mr)) {
+                warn_report("guest reported removed page %"PRIu64" in a section that is not an ordinary RAM",
+                            pageno);
+                ret = -EINVAL;
+                goto finish_page;
+            }
+
+            ret = hv_balloon_handle_remove_section(removed_host_addr, &section,
+                                                   pagecnt);
+
+        finish_page:
+            if (ret == 0) {
+                hv_balloon_remove_response_handle_pages(balloon,
+                                                        &range_both,
+                                                        pageno, pagecnt,
+                                                        true, &removedcnt);
+            } else {
+                hv_balloon_remove_response_handle_pages(balloon,
+                                                        &range_guest,
+                                                        pageno, pagecnt,
+                                                        false, &removedcnt);
+            }
+
+            if (section.mr) {
+                memory_region_unref(section.mr);
+            }
+
+            offset += pagecnt;
+        }
+    }
+
+    hv_balloon_remove_response_handle_range(balloon, &range_both, true,
+                                            &removedcnt);
+    hv_balloon_remove_response_handle_range(balloon, &range_guest, false,
+                                            &removedcnt);
+
+    hv_balloon_handle_remove_host_addr_tree(removed_host_addr);
+    page_range_tree_destroy(&removed_host_addr);
+
+    if (removedcnt > balloon->target_diff) {
+        warn_report("guest reported more pages removed than currently pending (%"PRIu64" vs %"PRIu64")",
+                    removedcnt, balloon->target_diff);
+        balloon->target_diff = 0;
+    } else {
+        balloon->target_diff -= removedcnt;
+    }
+}
+
+static bool hv_balloon_handle_msg_size(HvBalloonReq *req, size_t minsize,
+                                       const char *msgname)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    uint32_t msglen = vmreq->msglen;
+
+    if (msglen >= minsize) {
+        return true;
+    }
+
+    warn_report("%s message too short (%u vs %zu), ignoring", msgname,
+                (unsigned int)msglen, minsize);
+    return false;
+}
+
+static void hv_balloon_handle_version_request(HvBalloon *balloon,
+                                              HvBalloonReq *req,
+                                              StateDesc *stdesc)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_version_request *msgVr = vmreq->msg;
+    struct dm_version_response respVr;
+
+    if (balloon->state != S_VERSION) {
+        warn_report("unexpected DM_VERSION_REQUEST in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgVr),
+                                    "DM_VERSION_REQUEST")) {
+        return;
+    }
+
+    trace_hv_balloon_incoming_version(msgVr->version.major_version,
+                                      msgVr->version.minor_version);
+
+    memset(&respVr, 0, sizeof(respVr));
+    respVr.hdr.type = DM_VERSION_RESPONSE;
+    respVr.hdr.size = sizeof(respVr);
+    respVr.hdr.trans_id = msgVr->hdr.trans_id;
+    respVr.is_accepted = msgVr->version.version >= DYNMEM_PROTOCOL_VERSION_1 &&
+        msgVr->version.version <= DYNMEM_PROTOCOL_VERSION_3;
+
+    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respVr);
+
+    if (respVr.is_accepted) {
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_CAPS);
+    }
+}
+
+static void hv_balloon_handle_caps_report(HvBalloon *balloon,
+                                          HvBalloonReq *req,
+                                          StateDesc *stdesc)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_capabilities *msgCap = vmreq->msg;
+    struct dm_capabilities_resp_msg respCap;
+
+    if (balloon->state != S_CAPS) {
+        warn_report("unexpected DM_CAPABILITIES_REPORT in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgCap),
+                                    "DM_CAPABILITIES_REPORT")) {
+        return;
+    }
+
+    trace_hv_balloon_incoming_caps(msgCap->caps.caps);
+    balloon->caps = msgCap->caps;
+
+    memset(&respCap, 0, sizeof(respCap));
+    respCap.hdr.type = DM_CAPABILITIES_RESPONSE;
+    respCap.hdr.size = sizeof(respCap);
+    respCap.hdr.trans_id = msgCap->hdr.trans_id;
+    respCap.is_accepted = 1;
+    respCap.hot_remove = 1;
+    respCap.suppress_pressure_reports = !balloon->status_reports;
+    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respCap);
+
+    if (balloon->caps.cap_bits.hot_add) {
+        ha_todo_add_all(balloon);
+    }
+
+    timer_mod(&balloon->post_init_timer,
+              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
+              HV_BALLOON_POST_INIT_WAIT);
+
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_POST_INIT_WAIT);
+}
+
+static void hv_balloon_handle_status_report(HvBalloon *balloon,
+                                            HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_status *msgStatus = vmreq->msg;
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgStatus),
+                                    "DM_STATUS_REPORT")) {
+        return;
+    }
+
+    if (!balloon->status_reports) {
+        return;
+    }
+
+    qapi_event_send_hv_balloon_status_report((uint64_t)msgStatus->num_committed *
+                                             HV_BALLOON_PAGE_SIZE,
+                                             (uint64_t)msgStatus->num_avail *
+                                             HV_BALLOON_PAGE_SIZE);
+}
+
+static void hv_balloon_handle_unballoon_response(HvBalloon *balloon,
+                                                 HvBalloonReq *req,
+                                                 StateDesc *stdesc)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_unballoon_response *msgUrR = vmreq->msg;
+
+    if (balloon->state != S_UNBALLOON_REPLY_WAIT) {
+        warn_report("unexpected DM_UNBALLOON_RESPONSE in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgUrR),
+                                    "DM_UNBALLOON_RESPONSE"))
+        return;
+
+    trace_hv_balloon_incoming_unballoon(msgUrR->hdr.trans_id);
+
+    balloon->trans_id++;
+    HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
+}
+
+static void hv_balloon_handle_hot_add_response(HvBalloon *balloon,
+                                               HvBalloonReq *req,
+                                               StateDesc *stdesc)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_hot_add_response *msgHaR = vmreq->msg;
+    HAPVDIMMRange *hpr;
+
+    if (balloon->state != S_HOT_ADD_REPLY_WAIT) {
+        warn_report("unexpected DM_HOT_ADD_RESPONSE in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgHaR),
+                                    "DM_HOT_ADD_RESPONSE"))
+        return;
+
+    trace_hv_balloon_incoming_hot_add(msgHaR->hdr.trans_id, msgHaR->result,
+                                      msgHaR->page_count);
+
+    balloon->trans_id++;
+
+    assert(balloon->ha_todo);
+    hpr = balloon->ha_todo->data;
+
+    if (msgHaR->result) {
+        if (msgHaR->page_count > balloon->ha_current_count) {
+            warn_report("DM_HOT_ADD_RESPONSE page count higher than requested (%"PRIu32" vs %"PRIu64")",
+                        msgHaR->page_count, balloon->ha_current_count);
+            msgHaR->page_count = balloon->ha_current_count;
+        }
+
+        hpr->used += msgHaR->page_count;
+    }
+
+    if (!msgHaR->result || msgHaR->page_count < balloon->ha_current_count) {
+        if (hpr->used == 0) {
+            /*
+             * apparently the guest didn't like the current range at all,
+             * let's try the next one
+             */
+            HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
+            return;
+        }
+
+        /*
+         * the current planned range was only partially hot-added, take note
+         * how much of it remains and don't attempt any further hot adds
+         */
+        hpr->unused_tail = hpr->range.count - hpr->unused_head - hpr->used;
+
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_PROCESSED_CLEAR_PENDING);
+        return;
+    }
+
+    /* any pages remaining in this hpr? */
+    if (hpr->range.count - hpr->unused_head - hpr->used -
+        hpr->unused_tail > 0) {
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_RB_WAIT);
+    } else {
+        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_PROCESSED_NEXT);
+    }
+}
+
+static void hv_balloon_handle_balloon_response(HvBalloon *balloon,
+                                               HvBalloonReq *req,
+                                               StateDesc *stdesc)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_balloon_response *msgBR = vmreq->msg;
+
+    if (balloon->state != S_BALLOON_REPLY_WAIT) {
+        warn_report("unexpected DM_BALLOON_RESPONSE in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgBR),
+                                    "DM_BALLOON_RESPONSE"))
+        return;
+
+    trace_hv_balloon_incoming_balloon(msgBR->hdr.trans_id, msgBR->range_count,
+                                      msgBR->more_pages);
+
+    if (vmreq->msglen < sizeof(*msgBR) +
+        (uint64_t)sizeof(msgBR->range_array[0]) * msgBR->range_count) {
+        warn_report("DM_BALLOON_RESPONSE too short for the range count");
+        return;
+    }
+
+    if (msgBR->range_count == 0) {
+        /* The guest is already at its minimum size */
+        msgBR->more_pages = 0;
+        balloon->target_diff = 0;
+    } else {
+        hv_balloon_handle_remove_ranges(balloon,
+                                        msgBR->range_array,
+                                        msgBR->range_count);
+    }
+
+    if (!msgBR->more_pages) {
+        balloon->trans_id++;
+
+        if (balloon->target_diff > 0) {
+            HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_RB_WAIT);
+        } else {
+            HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
+        }
+    }
+}
+
+static void hv_balloon_handle_packet(HvBalloon *balloon, HvBalloonReq *req,
+                                     StateDesc *stdesc)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_message *msg = vmreq->msg;
+
+    if (vmreq->msglen < sizeof(msg->hdr)) {
+        return;
+    }
+
+    switch (msg->hdr.type) {
+    case DM_VERSION_REQUEST:
+        hv_balloon_handle_version_request(balloon, req, stdesc);
+        break;
+
+    case DM_CAPABILITIES_REPORT:
+        hv_balloon_handle_caps_report(balloon, req, stdesc);
+        break;
+
+    case DM_STATUS_REPORT:
+        hv_balloon_handle_status_report(balloon, req);
+        break;
+
+    case DM_MEM_HOT_ADD_RESPONSE:
+        hv_balloon_handle_hot_add_response(balloon, req, stdesc);
+        break;
+
+    case DM_UNBALLOON_RESPONSE:
+        hv_balloon_handle_unballoon_response(balloon, req, stdesc);
+        break;
+
+    case DM_BALLOON_RESPONSE:
+        hv_balloon_handle_balloon_response(balloon, req, stdesc);
+        break;
+
+    default:
+        warn_report("unknown DM message %u", msg->hdr.type);
+        break;
+    }
+}
+
+static bool hv_balloon_recv_channel(HvBalloon *balloon, StateDesc *stdesc)
+{
+    VMBusChannel *chan;
+    HvBalloonReq *req;
+
+    if (balloon->state == S_WAIT_RESET ||
+        balloon->state == S_CLOSED) {
+        return false;
+    }
+
+    chan = hv_balloon_get_channel(balloon);
+    if (vmbus_channel_recv_start(chan)) {
+        return false;
+    }
+
+    while ((req = vmbus_channel_recv_peek(chan, sizeof(*req)))) {
+        hv_balloon_handle_packet(balloon, req, stdesc);
+        vmbus_free_req(req);
+        vmbus_channel_recv_pop(chan);
+
+        if (stdesc->state != S_NO_CHANGE) {
+            break;
+        }
+    }
+
+    return vmbus_channel_recv_done(chan) > 0;
+}
+
+static bool hv_balloon_event_loop_state(HvBalloon *balloon)
+{
+    StateDesc state_new = HV_BALLOON_STATE_DESC_INIT;
+
+    hv_balloon_handle_state(balloon, &state_new);
+    return hv_balloon_state_set(balloon, state_new.state, state_new.desc);
+}
+
+static bool hv_balloon_event_loop_recv(HvBalloon *balloon)
+{
+    StateDesc state_new = HV_BALLOON_STATE_DESC_INIT;
+    bool any_recv, state_changed;
+
+    any_recv = hv_balloon_recv_channel(balloon, &state_new);
+    state_changed = hv_balloon_state_set(balloon,
+                                         state_new.state, state_new.desc);
+
+    return state_changed || any_recv;
+}
+
+static void hv_balloon_event_loop(HvBalloon *balloon)
+{
+    bool state_repeat, recv_repeat;
+
+    do {
+        state_repeat = hv_balloon_event_loop_state(balloon);
+        recv_repeat = hv_balloon_event_loop_recv(balloon);
+    } while (state_repeat || recv_repeat);
+}
+
+void qmp_hv_balloon_add_memory(const char *id, Error **errp)
+{
+    HvBalloon *balloon;
+    uint64_t align;
+    g_autofree gchar *align_str = NULL;
+    g_autoptr(QDict) qdict = NULL;
+    g_autoptr(DeviceState) dev = NULL;
+    HAPVDIMMDevice *hapvdimm;
+    PageRange range;
+    HAPVDIMMRange *hpr;
+
+    balloon = HV_BALLOON(object_resolve_path_type("", TYPE_HV_BALLOON, NULL));
+    if (!balloon) {
+        error_setg(errp, "no %s device present", TYPE_HV_BALLOON);
+        return;
+    }
+
+    if (hv_balloon_state_is_init(balloon)) {
+        error_setg(errp, "no guest attached to the DM protocol yet");
+        return;
+    }
+
+    if (!balloon->caps.cap_bits.hot_add) {
+        error_setg(errp,
+                   "the current DM protocol guest has no support for memory hot add");
+        return;
+    }
+
+    /* add device */
+    qdict = qdict_new();
+    qdict_put_str(qdict, "driver", TYPE_HAPVDIMM);
+    qdict_put_str(qdict, HAPVDIMM_MEMDEV_PROP, id);
+
+    align = (1 << balloon->caps.cap_bits.hot_add_alignment) * MiB;
+    align_str = g_strdup_printf("%" PRIu64, align);
+    qdict_put_str(qdict, HAPVDIMM_ALIGN_PROP, align_str);
+
+    hapvdimm_allow_adding();
+    dev = qdev_device_add_from_qdict(qdict, false, errp);
+    hapvdimm_disallow_adding();
+    if (!dev) {
+        return;
+    }
+
+    hapvdimm = HAPVDIMM(dev);
+
+    hapvdimm_get_range(hapvdimm, &range);
+    if (page_range_tree_intree_any(balloon->removed_guest,
+                                   range.start, range.count) ||
+        page_range_tree_intree_any(balloon->removed_both,
+                                   range.start, range.count)) {
+        error_setg(errp,
+                   "some of the device new pages were already returned by the guest. this should not happen, please reboot the guest and try again");
+        return;
+    }
+
+    trace_hv_balloon_hapvdimm_range_add(range.count, range.start);
+
+    hpr = hapvdimm_tree_insert_new(balloon, hapvdimm);
+
+    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
+
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_notify_cb(VMBusChannel *chan)
+{
+    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
+
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_stat(void *opaque, BalloonInfo *info)
+{
+    HvBalloon *balloon = opaque;
+    info->actual = (hv_balloon_total_ram(balloon) - balloon->removed_both_ctr)
+        << HV_BALLOON_PFN_SHIFT;
+}
+
+static void hv_balloon_to_target(void *opaque, ram_addr_t target)
+{
+    HvBalloon *balloon = opaque;
+    uint64_t target_pages = target >> HV_BALLOON_PFN_SHIFT;
+
+    if (!target_pages) {
+        return;
+    }
+
+    /*
+     * always set target_changed, even with unchanged target, as the user
+     * might be asking us to try again reaching it
+     */
+    balloon->target = target_pages;
+    balloon->target_changed = true;
+
+    hv_balloon_event_loop(balloon);
+}
+
+static int hv_balloon_open_channel(VMBusChannel *chan)
+{
+    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
+
+    if (balloon->state != S_CLOSED) {
+        warn_report("guest trying to open a DM channel in invalid %d state",
+                    balloon->state);
+        return -EINVAL;
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_VERSION);
+    hv_balloon_event_loop(balloon);
+
+    return 0;
+}
+
+static void hv_balloon_close_channel(VMBusChannel *chan)
+{
+    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
+
+    timer_del(&balloon->post_init_timer);
+
+    HV_BALLOON_SET_STATE(balloon, S_WAIT_RESET);
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_post_init_timer(void *opaque)
+{
+    HvBalloon *balloon = opaque;
+
+    if (balloon->state != S_POST_INIT_WAIT) {
+        return;
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_IDLE);
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_system_reset(void *opaque)
+{
+    HvBalloon *balloon = HV_BALLOON(opaque);
+
+    if (!balloon->hapvdimms_del_todo) {
+        return;
+    }
+
+    if (balloon->del_todo_process_timer) {
+        return;
+    }
+
+    balloon->del_todo_process_timer = g_idle_add(del_todo_process_timer,
+                                                 balloon);
+}
+
+static void hv_balloon_dev_realize(VMBusDevice *vdev, Error **errp)
+{
+    ERRP_GUARD();
+    HvBalloon *balloon = HV_BALLOON(vdev);
+    int ret;
+
+    /* used by hv_balloon_stat() */
+    balloon->hapvdimms = hapvdimm_tree_new();
+    balloon->state = S_WAIT_RESET;
+
+    ret = qemu_add_balloon_handler(hv_balloon_to_target, hv_balloon_stat,
+                                   balloon);
+    if (ret < 0) {
+        /* This also protects against having multiple hv-balloon instances */
+        error_setg(errp, "Only one balloon device is supported");
+        goto ret_tree;
+    }
+
+    timer_init_ms(&balloon->post_init_timer, QEMU_CLOCK_VIRTUAL,
+                  hv_balloon_post_init_timer, balloon);
+
+    qemu_register_reset(hv_balloon_system_reset, balloon);
+
+    return;
+
+ret_tree:
+    hapvdimm_tree_destroy(&balloon->hapvdimms);
+}
+
+static void hv_balloon_reset_destroy_common(HvBalloon *balloon)
+{
+    ha_todo_clear(balloon);
+    del_todo_add_all(balloon);
+}
+
+static void hv_balloon_dev_reset(VMBusDevice *vdev)
+{
+    HvBalloon *balloon = HV_BALLOON(vdev);
+
+    page_range_tree_destroy(&balloon->removed_guest);
+    page_range_tree_destroy(&balloon->removed_both);
+    balloon->removed_guest = page_range_tree_new();
+    balloon->removed_both = page_range_tree_new();
+
+    hv_balloon_reset_destroy_common(balloon);
+
+    balloon->trans_id = 0;
+    balloon->removed_guest_ctr = 0;
+    balloon->removed_both_ctr = 0;
+
+    HV_BALLOON_SET_STATE(balloon, S_CLOSED);
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_dev_unrealize(VMBusDevice *vdev)
+{
+    HvBalloon *balloon = HV_BALLOON(vdev);
+
+    qemu_unregister_reset(hv_balloon_system_reset, balloon);
+
+    hv_balloon_reset_destroy_common(balloon);
+
+    del_todo_process(balloon);
+    assert(!balloon->del_todo_process_timer);
+
+    qemu_remove_balloon_handler(balloon);
+
+    page_range_tree_destroy(&balloon->removed_guest);
+    page_range_tree_destroy(&balloon->removed_both);
+    hapvdimm_tree_destroy(&balloon->hapvdimms);
+}
+
+static Property hv_balloon_properties[] = {
+    DEFINE_PROP_BOOL("status-report", HvBalloon,
+                     status_reports, false),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void hv_balloon_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VMBusDeviceClass *vdc = VMBUS_DEVICE_CLASS(klass);
+
+    device_class_set_props(dc, hv_balloon_properties);
+    qemu_uuid_parse(HV_BALLOON_GUID, &vdc->classid);
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    vdc->vmdev_realize = hv_balloon_dev_realize;
+    vdc->vmdev_unrealize = hv_balloon_dev_unrealize;
+    vdc->vmdev_reset = hv_balloon_dev_reset;
+    vdc->open_channel = hv_balloon_open_channel;
+    vdc->close_channel = hv_balloon_close_channel;
+    vdc->chan_notify_cb = hv_balloon_notify_cb;
+}
+
+static const TypeInfo hv_balloon_type_info = {
+    .name = TYPE_HV_BALLOON,
+    .parent = TYPE_VMBUS_DEVICE,
+    .instance_size = sizeof(HvBalloon),
+    .class_init = hv_balloon_class_init,
+};
+
+static void hv_balloon_register_types(void)
+{
+    type_register_static(&hv_balloon_type_info);
+}
+
+type_init(hv_balloon_register_types)
diff --git a/hw/hyperv/meson.build b/hw/hyperv/meson.build
index b43f119ea5..212e0ce51e 100644
--- a/hw/hyperv/meson.build
+++ b/hw/hyperv/meson.build
@@ -2,3 +2,4 @@ specific_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'))
 specific_ss.add(when: 'CONFIG_HYPERV_TESTDEV', if_true: files('hyperv_testdev.c'))
 specific_ss.add(when: 'CONFIG_VMBUS', if_true: files('vmbus.c'))
 specific_ss.add(when: 'CONFIG_SYNDBG', if_true: files('syndbg.c'))
+specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c'))
diff --git a/hw/hyperv/trace-events b/hw/hyperv/trace-events
index b4c35ca8e3..3b98ac3689 100644
--- a/hw/hyperv/trace-events
+++ b/hw/hyperv/trace-events
@@ -16,3 +16,19 @@ vmbus_gpadl_torndown(uint32_t gpadl_id) "gpadl #%d"
 vmbus_open_channel(uint32_t chan_id, uint32_t gpadl_id, uint32_t target_vp) "channel #%d gpadl #%d target vp %d"
 vmbus_channel_open(uint32_t chan_id, uint32_t status) "channel #%d status %d"
 vmbus_close_channel(uint32_t chan_id) "channel #%d"
+
+# hv-balloon
+hv_balloon_state_change(const char *tostr) "-> %s"
+hv_balloon_incoming_version(uint16_t major, uint16_t minor) "incoming proto version %u.%u"
+hv_balloon_incoming_caps(uint32_t caps) "incoming caps 0x%x"
+hv_balloon_outgoing_unballoon(uint32_t trans_id, uint64_t count, uint64_t start, uint64_t rempages) "posting unballoon %"PRIu32" for %"PRIu64" @ 0x%"PRIx64", remaining %"PRIu64
+hv_balloon_incoming_unballoon(uint32_t trans_id) "incoming unballoon response %"PRIu32
+hv_balloon_outgoing_hot_add(uint32_t trans_id, uint64_t count, uint64_t start) "posting hot add %"PRIu32" for %"PRIu64" @ 0x%"PRIx64
+hv_balloon_incoming_hot_add(uint32_t trans_id, uint32_t result, uint32_t count) "incoming hot add response %"PRIu32", result %"PRIu32", count %"PRIu32
+hv_balloon_outgoing_balloon(uint32_t trans_id, uint64_t count, uint64_t rempages) "posting balloon %"PRIu32" for %"PRIu64", remaining %"PRIu64
+hv_balloon_incoming_balloon(uint32_t trans_id, uint32_t range_count, uint32_t more_pages) "incoming balloon response %"PRIu32", ranges %"PRIu32", more %"PRIu32
+hv_balloon_hapvdimm_range_add(uint64_t count, uint64_t start) "adding hapvdimm range %"PRIu64" @ 0x%"PRIx64
+hv_balloon_remove_response(uint64_t count, uint64_t start, unsigned int both) "processing remove response range %"PRIu64" @ 0x%"PRIx64", both %u"
+hv_balloon_remove_response_hole(uint64_t counthole, uint64_t starthole, uint64_t countrange, uint64_t startrange, uint64_t starthpr, unsigned int both) "response range hole %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 0x%"PRIx64", before hpr start 0x%"PRIx64", both %u"
+hv_balloon_remove_response_common(uint64_t countcommon, uint64_t startcommon, uint64_t countrange, uint64_t startrange, uint64_t counthpr, uint64_t starthpr, uint64_t removed, unsigned int both) "response common range %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 0x%"PRIx64" with hpr %"PRIu64" @ 0x%"PRIx64", removed %"PRIu64", both %u"
+hv_balloon_remove_response_remainder(uint64_t count, uint64_t start, unsigned int both) "remove response remaining range %"PRIu64" @ 0x%"PRIx64", both %u"
diff --git a/meson.build b/meson.build
index 6cb2b1a42f..2d9c01b6ec 100644
--- a/meson.build
+++ b/meson.build
@@ -2550,7 +2550,8 @@ host_kconfig = \
   ('CONFIG_LINUX' in config_host ? ['CONFIG_LINUX=y'] : []) + \
   (have_pvrdma ? ['CONFIG_PVRDMA=y'] : []) + \
   (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : []) + \
-  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : [])
+  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : []) + \
+  ('CONFIG_HV_BALLOON_POSSIBLE' in config_host ? ['CONFIG_HV_BALLOON_POSSIBLE=y'] : [])
 
 ignored = [ 'TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_ARCH' ]
 
@@ -4027,6 +4028,7 @@ summary_info += {'libudev':           libudev}
 summary_info += {'FUSE lseek':        fuse_lseek.found()}
 summary_info += {'selinux':           selinux}
 summary_info += {'libdw':             libdw}
+summary_info += {'hv-balloon support': config_host.has_key('CONFIG_HV_BALLOON_POSSIBLE')}
 summary(summary_info, bool_yn: true, section: 'Dependencies')
 
 if not supported_cpus.contains(cpu)
diff --git a/qapi/machine.json b/qapi/machine.json
index b9228a5e46..04ff95337a 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -1104,6 +1104,74 @@
 { 'event': 'BALLOON_CHANGE',
   'data': { 'actual': 'int' } }
 
+##
+# @hv-balloon-add-memory:
+#
+# Hot-add memory backend via Hyper-V Dynamic Memory Protocol.
+#
+# @id: the name of the memory backend object to hot-add
+#
+# Returns: Nothing on success
+#          Error if there's no guest connected with hot-add capability,
+#          @id is not a valid memory backend or it's already in use.
+#
+# Since: TBD
+#
+# Example:
+#
+# -> { "execute": "hv-balloon-add-memory", "arguments": { "id": "mb1" } }
+# <- { "return": {} }
+#
+##
+{ 'command': 'hv-balloon-add-memory', 'data': {'id': 'str'} }
+
+##
+# @HV_BALLOON_STATUS_REPORT:
+#
+# Emitted when the hv-balloon driver receives a "STATUS" message from
+# the guest.
+#
+# @commited: the amount of memory in use inside the guest plus the amount
+#            of the memory unusable inside the guest (ballooned out,
+#            offline, etc.)
+#
+# @available: the amount of the memory inside the guest available for new
+#             allocations ("free")
+#
+# Since: TBD
+#
+# Example:
+#
+# <- { "event": "HV_BALLOON_STATUS_REPORT",
+#      "data": { "commited": 816640000, "available": 3333054464 },
+#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
+#
+##
+{ 'event': 'HV_BALLOON_STATUS_REPORT',
+  'data': { 'commited': 'size', 'available': 'size' } }
+
+##
+# @HV_BALLOON_MEMORY_BACKEND_UNUSED:
+#
+# Emitted when the hv-balloon driver marks a memory backend object
+# unused so it can now be removed, if required.
+#
+# This can happen because the VM was restarted.
+#
+# @id: the memory backend object id
+#
+# Since: TBD
+#
+# Example:
+#
+# <- { "event": "HV_BALLOON_MEMORY_BACKEND_UNUSED",
+#      "data": { "id": "mb1" },
+#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
+#
+##
+{ 'event': 'HV_BALLOON_MEMORY_BACKEND_UNUSED',
+  'data': { 'id': 'str' } }
+
 ##
 # @MemoryInfo:
 #


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-02-24 21:41 ` [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols Maciej S. Szmigiero
@ 2023-02-27 15:25   ` David Hildenbrand
  2023-02-28 14:14     ` Maciej S. Szmigiero
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-02-27 15:25 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel

On 24.02.23 22:41, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This device works like a virtual DIMM stick: it allows inserting extra RAM

All DIMMs in QEMU are virtual. What you want it, a piece of memory that 
doesn not get exposed via ACPI or similar (and doesn't follow the 
traditional "slots" concept).

> into the guest at run time and later removing it without having to
> duplicate all of the address space management logic of TYPE_MEMORY_DEVICE
> in each memory hot-add protocol driver.

... which are these? virtio-mem and virtio-pmem do their own thing for 
good reasons. You're adding it for HV.

I don't think their is demand for a generic device. In fact, I have no 
idea what "HAPVDIMM" should actually mean.

If you really need such a device after we discussed the alternatives, 
please keep it hv-specific.

> 
> This device is not meant to be instantiated or removed by the QEMU user
> directly: rather, the protocol driver is supposed to add and remove it as
> required.

That sounds like the wrong approach to me. More on that below.

> 
> In fact, its very existence is supposed to be an implementation detail,
> transparent to the QEMU user.
> 
> To prevent the user from accidentally manually creating an instance of this
> device the protocol driver is supposed to place the qdev_device_add*() call
> (that is uses to add this device) between hapvdimm_allow_adding() and
> hapvdimm_disallow_adding() calls in order to temporary authorize the
> operation.
> 

The most important part first: the realize function of a device is not 
supposed to assing itself any resources. Calling memory device (un)plug 
functions from the realize function is wrong.

(Hot)plug handlers are the right approach for that. Please refer to how 
we chain hotplug handlers (machine hotplug handler -> bus hotplug 
handler) to implement virtio-mem and virtio-pmem. These hotplug handlers 
would also be the place where to reject a device if created by the user, 
for example.

But before we dive into the details of that, I wonder if you could just 
avoid having a memory device for each block of memory you want to add.

An alternative might be the following:

Have a hv-balloon device be a memory device with a configured maximum 
size and a memory device region container. Let the machine hotplug 
handler assign a contiguous region in the device memory region and map 
the memory device region container (while plugging that hv-balloon 
device), just like we do it for virtio-mem and virtio-pmem.

In essence, you reserve a region in physical address space that way and 
can decide what to (un)map into that memory device region container, you 
do your own placement.

So when instructed to add a new memory backend, you simply assign an 
address in the assigned region yourself, and map the memory backend 
memory region into the device memory region container.

The only catch is that that memory device (hv-balloon) will then consume 
multiple memslots (one for each memory backend), right now we only 
support 1 memslot (e.g., asking if one more slot is free when plugging 
the device).

I'm adding support for that right now to implement a virtio-mem 
extension -- the memory device says how many memslots it requires, and 
these will get reserved for that memory device; the memory device can 
then consume them later without further checks dynamically. That 
approach could be extended to increase/decrease the memslot requirement 
(the device would ask to increase/decrease its limit), if ever required.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-02-27 15:25   ` David Hildenbrand
@ 2023-02-28 14:14     ` Maciej S. Szmigiero
  2023-02-28 15:02       ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-02-28 14:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 27.02.2023 16:25, David Hildenbrand wrote:
> On 24.02.23 22:41, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This device works like a virtual DIMM stick: it allows inserting extra RAM
> 
> All DIMMs in QEMU are virtual. What you want it, a piece of memory that doesn not get exposed via ACPI or similar (and doesn't follow the traditional "slots" concept).

Right.

>> into the guest at run time and later removing it without having to
>> duplicate all of the address space management logic of TYPE_MEMORY_DEVICE
>> in each memory hot-add protocol driver.
> 
> ... which are these? virtio-mem and virtio-pmem do their own thing for good reasons. You're adding it for HV.
> 
> I don't think their is demand for a generic device. In fact, I have no idea what "HAPVDIMM" should actually mean.
> 
> If you really need such a device after we discussed the alternatives, please keep it hv-specific.

No problem, the device can be made hv-specific - at least until another use
for it is found (if it is found).

>>
>> This device is not meant to be instantiated or removed by the QEMU user
>> directly: rather, the protocol driver is supposed to add and remove it as
>> required.
> 
> That sounds like the wrong approach to me. More on that below.
> 
>>
>> In fact, its very existence is supposed to be an implementation detail,
>> transparent to the QEMU user.
>>
>> To prevent the user from accidentally manually creating an instance of this
>> device the protocol driver is supposed to place the qdev_device_add*() call
>> (that is uses to add this device) between hapvdimm_allow_adding() and
>> hapvdimm_disallow_adding() calls in order to temporary authorize the
>> operation.
>>
> 
> The most important part first: the realize function of a device is not supposed to assing itself any resources. Calling memory device (un)plug functions from the realize function is wrong.
> 
> (Hot)plug handlers are the right approach for that. Please refer to how we chain hotplug handlers (machine hotplug handler -> bus hotplug handler) to implement virtio-mem and virtio-pmem. These hotplug handlers would also be the place where to reject a device if created by the user, for example.
> 

That was more or less the approach that v1 of this driver took:
The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices,
whatever one calls them) explicitly via the machine hotplug handler
(using the device_add command).

At that time you said [1] that:
> 1) I dislike that an external entity has to do vDIMM adaptions /
> ballooning adaptions when rebooting or when wanting to resize a guest.

because:
> Once you have the current approach upstream (vDIMMs, ballooning),
> there is no easy way to change that later (requires deprecating, etc.).

That's why this version hides these vDIMMs.
Instead, the QEMU manager (user) directly provides the raw memory
backend device (for example, memory-backend-ram) to the driver via a QMP
command.

Since now the user is not expected to touch these vDIMMs directly in any
way these become an implementation detail than can be changed or even
removed if needed at some point, without affecting the existing users.

> But before we dive into the details of that, I wonder if you could just avoid having a memory device for each block of memory you want to add.
> 
> 
> An alternative might be the following:
> 
> Have a hv-balloon device be a memory device with a configured maximum size and a memory device region container. Let the machine hotplug handler assign a contiguous region in the device memory region and map the memory device region container (while plugging that hv-balloon device), just like we do it for virtio-mem and virtio-pmem.
> 
> In essence, you reserve a region in physical address space that way and can decide what to (un)map into that memory device region container, you do your own placement.
> 
> So when instructed to add a new memory backend, you simply assign an address in the assigned region yourself, and map the memory backend memory region into the device memory region container.
> 
> The only catch is that that memory device (hv-balloon) will then consume multiple memslots (one for each memory backend), right now we only support 1 memslot (e.g., asking if one more slot is free when plugging the device).
> 
> 
Technically in this case a "main" hv-balloon device is still needed -
in contrast with virtio-mem (which allows multiple instances) there can
be only one Dynamic Memory protocol provider on the VMBus.

That means these "container" sub-devices would need to register with that
main hv-balloon device.

However, I'm not sure what is exactly gained by this approach.

These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface
so they are accounted for properly (the alternative would be to patch
the relevant QEMU code all over the place - that's probably why
virtio-mem also implements this interface instead).

One still needs some QMP command to add a raw memory backend to
the chosen "container" hv-balloon sub-device.

Since now the QEMU manager (user) is aware of the presence of these
"container" sub-devices, and has to manage them, changing the QEMU
interface in the future is more complex (as you said in [1]).

I understand that virtio-mem uses a similar approach, however that's
because the virtio-mem protocol itself works that way.

> I'm adding support for that right now to implement a virtio-mem
> extension -- the memory device says how many memslots it requires,
> and these will get reserved for that memory device; the memory device
> can then consume them later without further checks dynamically. That
> approach could be extended to increase/decrease the memslot
> requirement (the device would ask to increase/decrease its limit),
> if ever required.

In terms of future virtio-mem things I'm also eagerly waiting for an
ability to set a removed virtio-mem block read-only (or not covered by
any memslot) - this most probably could be reused later for implementing
the same functionality in this driver.

Thanks,
Maciej

[1]: https://lore.kernel.org/qemu-devel/28ab7005-c31c-239e-4659-e5287f4c5468@redhat.com/



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-02-28 14:14     ` Maciej S. Szmigiero
@ 2023-02-28 15:02       ` David Hildenbrand
  2023-02-28 21:27         ` Maciej S. Szmigiero
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-02-28 15:02 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

> 
> That was more or less the approach that v1 of this driver took:
> The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices,
> whatever one calls them) explicitly via the machine hotplug handler
> (using the device_add command).
> 
> At that time you said [1] that:
>> 1) I dislike that an external entity has to do vDIMM adaptions /
>> ballooning adaptions when rebooting or when wanting to resize a guest.
> 
> because:
>> Once you have the current approach upstream (vDIMMs, ballooning),
>> there is no easy way to change that later (requires deprecating, etc.).
> 
> That's why this version hides these vDIMMs.

Note that I don't have really strong feelings about letting the user 
hotplug devices. My comment was in general about user interactions when 
adding/removing memory or when rebooting the VM. As soon as you use 
individual memory blocks and/or devices, we end up with a similar user 
experience as we have already with DIMMS+virtio-balloon (bad IMHO).

Hiding the devices internally might make it a little bit easier to use, 
but it's still the same underlying concept: to add more memory you have 
to figure out whether to deflate the balloon or whether to add a new 
memory backend. What memory backends will remain when we reboot? When 
can we remove memory backends?

But that's just about the user interaction in general. My comment here 
was about the hidden devices: they have to go through plug handlers to 
get resources assigned, not self-assign resources in the realize function.

Note that virtio-mem uses a single sparse memory backend to make 
resizing easier (well, and to handle migration and some other things 
easier). But it comes with other things that require optimization. Using 
multiple memslots to expose memory to the VM is one optimization I'm 
working on. Resizable memory backends are another one.

I think you could implement the memory adding part similar to 
virtio-mem, and simply have a large sparse memory backend, from which 
you expose new memory to the VM as you please. And you could even use 
multiple memslots for that. But that's your design decision, and I won't 
argue with that, just pointing that out.

> Instead, the QEMU manager (user) directly provides the raw memory
> backend device (for example, memory-backend-ram) to the driver via a QMP
> command.

Yes, that's what I understood.

> 
> Since now the user is not expected to touch these vDIMMs directly in any
> way these become an implementation detail than can be changed or even
> removed if needed at some point, without affecting the existing users.
> 
>> But before we dive into the details of that, I wonder if you could just avoid having a memory device for each block of memory you want to add.
>>
>>
>> An alternative might be the following:
>>
>> Have a hv-balloon device be a memory device with a configured maximum size and a memory device region container. Let the machine hotplug handler assign a contiguous region in the device memory region and map the memory device region container (while plugging that hv-balloon device), just like we do it for virtio-mem and virtio-pmem.
>>
>> In essence, you reserve a region in physical address space that way and can decide what to (un)map into that memory device region container, you do your own placement.
>>
>> So when instructed to add a new memory backend, you simply assign an address in the assigned region yourself, and map the memory backend memory region into the device memory region container.
>>
>> The only catch is that that memory device (hv-balloon) will then consume multiple memslots (one for each memory backend), right now we only support 1 memslot (e.g., asking if one more slot is free when plugging the device).
>>
>>
> Technically in this case a "main" hv-balloon device is still needed -
> in contrast with virtio-mem (which allows multiple instances) there can
> be only one Dynamic Memory protocol provider on the VMBus.

Yes, just like virtio-balloon. There cannot be multiple instances.

> 
> That means these "container" sub-devices would need to register with that
> main hv-balloon device.
> 

My question is, if they really have to be devices. Why wouldn't it 
sufficient to map the memory backends directly into the container? Why 
is the

> However, I'm not sure what is exactly gained by this approach.
> 
> These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface

No, they wouldn't unless I am missing something. Only the hv-balloon 
device would be a TYPE_MEMORY_DEVICE.

> so they are accounted for properly (the alternative would be to patch
> the relevant QEMU code all over the place - that's probably why
> virtio-mem also implements this interface instead).

Please elaborate, I don't understand what you are trying to say here. 
Memory devices provide hooks, and the hooks exist for a reason -- 
because memory devices are no longer simple DIMMs/NVDIMMs. And 
virtio-mem + virtio-omem was responsible for adding some of these hooks.

> 
> One still needs some QMP command to add a raw memory backend to
> the chosen "container" hv-balloon sub-device.

If you go with multiple memory backends, yes.

> 
> Since now the QEMU manager (user) is aware of the presence of these
> "container" sub-devices, and has to manage them, changing the QEMU
> interface in the future is more complex (as you said in [1]).

Can you elaborate? Yes, when you design the feature around "multiple 
memory backends", you'll have to have an interface to add such. Well, 
and to query them during migration. And, maybe also to detect when to 
remove some (migration)?

> 
> I understand that virtio-mem uses a similar approach, however that's
> because the virtio-mem protocol itself works that way.
> 
>> I'm adding support for that right now to implement a virtio-mem
>> extension -- the memory device says how many memslots it requires,
>> and these will get reserved for that memory device; the memory device
>> can then consume them later without further checks dynamically. That
>> approach could be extended to increase/decrease the memslot
>> requirement (the device would ask to increase/decrease its limit),
>> if ever required.
> 
> In terms of future virtio-mem things I'm also eagerly waiting for an
> ability to set a removed virtio-mem block read-only (or not covered by
> any memslot) - this most probably could be reused later for implementing
> the same functionality in this driver.

In contrast to setting them read-only, the memslots that contain no 
plugged blocks anymore will be completely removed. The goal is to not 
consume any metadata overhead in KVM (well, and also do one step into 
the direction of protecting unplugged memory from getting reallocated).

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2023-02-24 21:41 ` [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
@ 2023-02-28 16:18   ` Igor Mammedov
  2023-02-28 17:12     ` David Hildenbrand
  2023-02-28 17:34   ` Daniel P. Berrangé
  1 sibling, 1 reply; 17+ messages in thread
From: Igor Mammedov @ 2023-02-28 16:18 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, David Hildenbrand, qemu-devel

On Fri, 24 Feb 2023 22:41:16 +0100
"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> wrote:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This driver is like virtio-balloon on steroids: it allows both changing the
> guest memory allocation via ballooning and inserting extra RAM into it by
> adding required memory backends and providing them to the driver.


this sounds pretty much like what virtio-mem does, modulo used protocol.
Would it be too crazy ask to reuse virtio-mem by teaching it new protocol
and avoid adding new device with all mgmt hurdles that virtio-mem has
already solved?


> One of advantages of these over ACPI-based PC DIMM hotplug is that such
> memory can be hotplugged in much smaller granularity because the ACPI DIMM
> slot limit does not apply.
> 
> Hot-adding additional memory is done by creating a new memory backend (for
> example by executing HMP command
> "object_add memory-backend-ram,id=mem1,size=4G"), then executing a new
> "hv-balloon-add-memory" QMP command, providing the id of that memory
> backend as the "id" parameter.
> 
> In contrast with ACPI DIMM hotplug where one can only request to unplug a
> whole DIMM stick this driver allows removing memory from guest in single
> page (4k) units via ballooning.
> 
> After a VM reboot each previously hot-added memory backend gets released.
> A "HV_BALLOON_MEMORY_BACKEND_UNUSED" QMP event is emitted in this case so
> the software controlling QEMU knows that it either needs to delete that
> memory backend (if no longer needed) or re-insert it.
> 
> In the future, the guest boot memory size might be changed on reboot
> instead, taking into account the effective size that VM had before that
> reboot (much like Hyper-V does).
> 
> For performance reasons, the guest-released memory is tracked in a few
> range trees, as a series of (start, count) ranges.
> Each time a new page range is inserted into such tree its neighbors are
> checked as candidates for possible merging with it.
> 
> Besides performance reasons, the Dynamic Memory protocol itself uses page
> ranges as the data structure in its messages, so relevant pages need to be
> merged into such ranges anyway.
> 
> One has to be careful when tracking the guest-released pages, since the
> guest can maliciously report returning pages outside its current address
> space, which later clash with the address range of newly added memory.
> Similarly, the guest can report freeing the same page twice.
> 
> The above design results in much better ballooning performance than when
> using virtio-balloon with the same guest: 230 GB / minute with this driver
> versus 70 GB / minute with virtio-balloon.
> 
> During a ballooning operation most of time is spent waiting for the guest
> to come up with newly freed page ranges, processing the received ranges on
> the host side (in QEMU and KVM) is nearly instantaneous.
> 
> The unballoon operation is also pretty much instantaneous:
> thanks to the merging of the ballooned out page ranges 200 GB of memory can
> be returned to the guest in about 1 second.
> With virtio-balloon this operation takes about 2.5 minutes.
> 
> These tests were done against a Windows Server 2019 guest running on a
> Xeon E5-2699, after dirtying the whole memory inside guest before each
> balloon operation.
> 
> Using a range tree instead of a bitmap to track the removed memory also
> means that the solution scales well with the guest size: even a 1 TB range
> takes just few bytes of memory.
> 
> Since the required GTree operations aren't present in every Glib version
> a check for them was added to "configure" script, together with new
> "--enable-hv-balloon" and "--disable-hv-balloon" arguments.
> If these GTree operations are missing in the system's Glib version this
> driver will be skipped during QEMU build.
> 
> An optional "status-report=on" device parameter requests memory status
> events from the guest (typically sent every second), which allow the host
> to learn both the guest memory available and the guest memory in use
> counts.
> They are emitted externally as "HV_BALLOON_STATUS_REPORT" QMP events.
> 
> The driver is named hv-balloon since the Linux kernel client driver for
> the Dynamic Memory Protocol is named as such and to follow the naming
> pattern established by the virtio-balloon driver.
> The whole protocol runs over Hyper-V VMBus.
> 
> The driver was tested against Windows Server 2012 R2, Windows Server 2016
> and Windows Server 2016 guests and obeys the guest alignment requirements
> reported to the host via DM_CAPABILITIES_REPORT message.
> 
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  Kconfig.host           |    3 +
>  configure              |   36 +
>  hw/hyperv/Kconfig      |    5 +
>  hw/hyperv/hv-balloon.c | 2185 ++++++++++++++++++++++++++++++++++++++++
>  hw/hyperv/meson.build  |    1 +
>  hw/hyperv/trace-events |   16 +
>  meson.build            |    4 +-
>  qapi/machine.json      |   68 ++
>  8 files changed, 2317 insertions(+), 1 deletion(-)
>  create mode 100644 hw/hyperv/hv-balloon.c
> 
> diff --git a/Kconfig.host b/Kconfig.host
> index d763d89269..2ee71578f3 100644
> --- a/Kconfig.host
> +++ b/Kconfig.host
> @@ -46,3 +46,6 @@ config FUZZ
>  config VFIO_USER_SERVER_ALLOWED
>      bool
>      imply VFIO_USER_SERVER
> +
> +config HV_BALLOON_POSSIBLE
> +    bool
> diff --git a/configure b/configure
> index cf6db3d551..b534955f58 100755
> --- a/configure
> +++ b/configure
> @@ -283,6 +283,7 @@ bsd_user=""
>  pie=""
>  coroutine=""
>  plugins="$default_feature"
> +hv_balloon="$default_feature"
>  meson=""
>  ninja=""
>  bindir="bin"
> @@ -866,6 +867,10 @@ for opt do
>    ;;
>    --disable-vfio-user-server) vfio_user_server="disabled"
>    ;;
> +  --enable-hv-balloon) hv_balloon=yes
> +  ;;
> +  --disable-hv-balloon) hv_balloon=no
> +  ;;
>    # everything else has the same name in configure and meson
>    --*) meson_option_parse "$opt" "$optarg"
>    ;;
> @@ -1019,6 +1024,7 @@ cat << EOF
>    debug-info      debugging information
>    safe-stack      SafeStack Stack Smash Protection. Depends on
>                    clang/llvm and requires coroutine backend ucontext.
> +  hv-balloon      hv-balloon driver where supported (requires Glib 2.68+ GTree API)
>  
>  NOTE: The object files are built at the place where configure is launched
>  EOF
> @@ -1740,6 +1746,32 @@ EOF
>    fi
>  fi
>  
> +##########################################
> +# check for hv-balloon
> +
> +if test "$hv_balloon" != "no"; then
> +	cat > $TMPC << EOF
> +#include <string.h>
> +#include <gmodule.h>
> +int main(void) {
> +    GTree *tree;
> +
> +    tree = g_tree_new((GCompareFunc)strcmp);
> +    (void)g_tree_node_first(tree);
> +    g_tree_destroy(tree);
> +    return 0;
> +}
> +EOF
> +	if compile_prog "$glib_cflags" "$glib_libs" ; then
> +		hv_balloon=yes
> +	else
> +		if test "$hv_balloon" = "yes" ; then
> +			feature_not_found "hv-balloon" "Update Glib"
> +		fi
> +		hv_balloon="no"
> +	fi
> +fi
> +
>  ##########################################
>  # functions to probe cross compilers
>  
> @@ -2336,6 +2368,10 @@ if test "$have_tsan" = "yes" && test "$have_tsan_iface_fiber" = "yes" ; then
>      echo "CONFIG_TSAN=y" >> $config_host_mak
>  fi
>  
> +if test "$hv_balloon" = "yes" ; then
> +  echo "CONFIG_HV_BALLOON_POSSIBLE=y" >> $config_host_mak
> +fi
> +
>  if test "$plugins" = "yes" ; then
>      echo "CONFIG_PLUGIN=y" >> $config_host_mak
>  fi
> diff --git a/hw/hyperv/Kconfig b/hw/hyperv/Kconfig
> index fcf65903bd..8f8be1bcce 100644
> --- a/hw/hyperv/Kconfig
> +++ b/hw/hyperv/Kconfig
> @@ -16,3 +16,8 @@ config SYNDBG
>      bool
>      default y
>      depends on VMBUS
> +
> +config HV_BALLOON
> +    bool
> +    default y
> +    depends on HV_BALLOON_POSSIBLE && VMBUS && HAPVDIMM
> diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
> new file mode 100644
> index 0000000000..b11f005189
> --- /dev/null
> +++ b/hw/hyperv/hv-balloon.c
> @@ -0,0 +1,2185 @@
> +/*
> + * QEMU Hyper-V Dynamic Memory Protocol driver
> + *
> + * Copyright (C) 2020-2023 Oracle and/or its affiliates.
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "exec/address-spaces.h"
> +#include "exec/cpu-common.h"
> +#include "exec/memory.h"
> +#include "exec/ramblock.h"
> +#include "hw/boards.h"
> +#include "hw/hyperv/dynmem-proto.h"
> +#include "hw/hyperv/vmbus.h"
> +#include "hw/mem/hapvdimm.h"
> +#include "hw/mem/pc-dimm.h"
> +#include "hw/qdev-core.h"
> +#include "hw/qdev-properties.h"
> +#include "monitor/qdev.h"
> +#include "qapi/error.h"
> +#include "qapi/qapi-commands-machine.h"
> +#include "qapi/qapi-events-machine.h"
> +#include "qapi/qmp/qdict.h"
> +#include "qemu/error-report.h"
> +#include "qemu/module.h"
> +#include "qemu/units.h"
> +#include "qemu/timer.h"
> +#include "sysemu/balloon.h"
> +#include "sysemu/reset.h"
> +#include "trace.h"
> +
> +/*
> + * temporarily avoid warnings about enhanced GTree API usage requiring a
> + * too recent Glib version until GLIB_VERSION_MAX_ALLOWED finally reaches
> + * the Glib version with this API
> + */
> +#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
> +
> +#define TYPE_HV_BALLOON "hv-balloon"
> +#define HV_BALLOON_GUID "525074DC-8985-46e2-8057-A307DC18A502"
> +#define HV_BALLOON_PFN_SHIFT 12
> +#define HV_BALLOON_PAGE_SIZE (1 << HV_BALLOON_PFN_SHIFT)
> +
> +/*
> + * Some Windows versions (at least Server 2019) will crash with various
> + * error codes when receiving DM protocol requests (at least
> + * DM_MEM_HOT_ADD_REQUEST) immediately after boot.
> + *
> + * It looks like Hyper-V from Server 2016 uses a 50-second after-boot
> + * delay, probably to workaround this issue, so we'll use this value, too.
> + */
> +#define HV_BALLOON_POST_INIT_WAIT (50 * 1000)
> +
> +#define HV_BALLOON_HA_CHUNK_SIZE (2 * GiB)
> +#define HV_BALLOON_HA_CHUNK_PAGES (HV_BALLOON_HA_CHUNK_SIZE / HV_BALLOON_PAGE_SIZE)
> +
> +#define HV_BALLOON_HR_CHUNK_PAGES 585728
> +/*
> + *                                ^ that's the maximum number of pages
> + * that Windows returns in one hot remove response
> + *
> + * If the number requested is too high Windows will no longer honor
> + * these requests
> + */
> +
> +typedef enum State {
> +    /* not a real state */
> +    S_NO_CHANGE = 0,
> +
> +    S_WAIT_RESET,
> +    S_CLOSED,
> +    S_VERSION,
> +    S_CAPS,
> +    S_POST_INIT_WAIT,
> +    S_IDLE,
> +    S_HOT_ADD_RB_WAIT,
> +    S_HOT_ADD_POSTING,
> +    S_HOT_ADD_REPLY_WAIT,
> +    S_HOT_ADD_SKIP_CURRENT,
> +    S_HOT_ADD_PROCESSED_CLEAR_PENDING,
> +    S_HOT_ADD_PROCESSED_NEXT,
> +    S_HOT_REMOVE,
> +    S_BALLOON_POSTING,
> +    S_BALLOON_RB_WAIT,
> +    S_BALLOON_REPLY_WAIT,
> +    S_UNBALLOON_POSTING,
> +    S_UNBALLOON_RB_WAIT,
> +    S_UNBALLOON_REPLY_WAIT,
> +} State;
> +
> +typedef struct StateDesc {
> +    State state;
> +    const char *desc;
> +} StateDesc;
> +
> +typedef struct PageRange {
> +    uint64_t start;
> +    uint64_t count;
> +} PageRange;
> +
> +/* type safety */
> +typedef struct PageRangeTree {
> +    GTree *t;
> +} PageRangeTree;
> +
> +typedef struct HAPVDIMMRange {
> +    HAPVDIMMDevice *hapvdimm;
> +
> +    PageRange range;
> +    uint64_t used;
> +
> +    /*
> +     * Pages not currently usable due to guest alignment reqs or
> +     * not hot added in the first place
> +     */
> +    uint64_t unused_head, unused_tail;
> +
> +    /* Memory removed from the guest backed by this HAPVDIMM */
> +    PageRangeTree removed_guest, removed_both;
> +} HAPVDIMMRange;
> +
> +/* type safety */
> +typedef struct HAPVDIMMRangeTree {
> +    GTree *t;
> +} HAPVDIMMRangeTree;
> +
> +typedef struct HvBalloon {
> +    VMBusDevice parent;
> +    State state;
> +    bool status_reports;
> +
> +    union dm_version version;
> +    union dm_caps caps;
> +
> +    QEMUTimer post_init_timer;
> +    guint del_todo_process_timer;
> +
> +    unsigned int trans_id;
> +
> +    /* Guest target size */
> +    uint64_t target;
> +    bool target_changed;
> +    uint64_t target_diff;
> +
> +    /*
> +     * All HAPVDIMMs under control of this driver
> +     * (but excluding the ones in hapvdimms_del_todo)
> +     */
> +    HAPVDIMMRangeTree hapvdimms;
> +
> +    /* Non-HAPVDIMM removed memory */
> +    PageRangeTree removed_guest, removed_both;
> +
> +    /* Grand totals of removed memory (both HAPVDIMM and non-HAPVDIMM) */
> +    uint64_t removed_guest_ctr, removed_both_ctr;
> +
> +    /* HAPVDIMMs waiting to be added during current connection */
> +    GSList *ha_todo;
> +    uint64_t ha_current_count;
> +
> +    /* HAPVDIMMs waiting to be deleted, not in any of the above structures */
> +    GSList *hapvdimms_del_todo;
> +} HvBalloon;
> +
> +#define HV_BALLOON(obj) OBJECT_CHECK(HvBalloon, (obj), TYPE_HV_BALLOON)
> +
> +#define HV_BALLOON_SET_STATE(hvb, news)             \
> +    do {                                            \
> +        assert(news != S_NO_CHANGE);                \
> +        hv_balloon_state_set(hvb, news, # news);    \
> +    } while (0)
> +
> +#define HV_BALLOON_STATE_DESC_SET(stdesc, news)         \
> +    _hv_balloon_state_desc_set(stdesc, news, # news)
> +
> +#define HV_BALLOON_STATE_DESC_INIT \
> +    {                              \
> +        .state = S_NO_CHANGE,      \
> +    }
> +
> +#define SUM_OVERFLOW_U64(in1, in2) ((in1) > UINT64_MAX - (in2))
> +#define SUM_SATURATE_U64(in1, in2)              \
> +    ({                                          \
> +        uint64_t _in1 = (in1), _in2 = (in2);    \
> +        uint64_t _result;                       \
> +                                                \
> +        if (!SUM_OVERFLOW_U64(_in1, _in2)) {    \
> +            _result = _in1 + _in2;              \
> +        } else {                                \
> +            _result = UINT64_MAX;               \
> +        }                                       \
> +                                                \
> +        _result;                                \
> +    })
> +
> +typedef struct HvBalloonReq {
> +    VMBusChanReq vmreq;
> +} HvBalloonReq;
> +
> +/* PageRange */
> +static void page_range_intersect(const PageRange *range,
> +                                 uint64_t start, uint64_t count,
> +                                 PageRange *out)
> +{
> +    uint64_t end1 = range->start + range->count;
> +    uint64_t end2 = start + count;
> +    uint64_t end = MIN(end1, end2);
> +
> +    out->start = MAX(range->start, start);
> +    out->count = out->start < end ? end - out->start : 0;
> +}
> +
> +static uint64_t page_range_intersection_size(const PageRange *range,
> +                                             uint64_t start, uint64_t count)
> +{
> +    PageRange trange;
> +
> +    page_range_intersect(range, start, count, &trange);
> +    return trange.count;
> +}
> +
> +/* return just the part of range before (start) */
> +static void page_range_part_before(const PageRange *range,
> +                                   uint64_t start, PageRange *out)
> +{
> +    uint64_t endr = range->start + range->count;
> +    uint64_t end = MIN(endr, start);
> +
> +    out->start = range->start;
> +    if (end > out->start) {
> +        out->count = end - out->start;
> +    } else {
> +        out->count = 0;
> +    }
> +}
> +
> +/* return just the part of range after (start, count) */
> +static void page_range_part_after(const PageRange *range,
> +                                  uint64_t start, uint64_t count,
> +                                  PageRange *out)
> +{
> +    uint64_t end = range->start + range->count;
> +    uint64_t ends = start + count;
> +
> +    out->start = MAX(range->start, ends);
> +    if (end > out->start) {
> +        out->count = end - out->start;
> +    } else {
> +        out->count = 0;
> +    }
> +}
> +
> +static bool page_range_joinable_left(const PageRange *range,
> +                                     uint64_t start, uint64_t count)
> +{
> +    return start + count == range->start;
> +}
> +
> +static bool page_range_joinable_right(const PageRange *range,
> +                                      uint64_t start, uint64_t count)
> +{
> +    return range->start + range->count == start;
> +}
> +
> +static bool page_range_joinable(const PageRange *range,
> +                                uint64_t start, uint64_t count)
> +{
> +    return page_range_joinable_left(range, start, count) ||
> +        page_range_joinable_right(range, start, count);
> +}
> +
> +/* PageRangeTree */
> +static gint page_range_tree_key_compare(gconstpointer leftp,
> +                                        gconstpointer rightp,
> +                                        gpointer user_data)
> +{
> +    const uint64_t *left = leftp, *right = rightp;
> +
> +    if (*left < *right) {
> +        return -1;
> +    } else if (*left > *right) {
> +        return 1;
> +    } else { /* *left == *right */
> +        return 0;
> +    }
> +}
> +
> +static GTreeNode *page_range_tree_insert_new(PageRangeTree tree,
> +                                             uint64_t start, uint64_t count)
> +{
> +    uint64_t *key = g_malloc(sizeof(*key));
> +    PageRange *range = g_malloc(sizeof(*range));
> +
> +    assert(count > 0);
> +
> +    *key = range->start = start;
> +    range->count = count;
> +
> +    return g_tree_insert_node(tree.t, key, range);
> +}
> +
> +static void page_range_tree_insert(PageRangeTree tree,
> +                                   uint64_t start, uint64_t count,
> +                                   uint64_t *dupcount)
> +{
> +    GTreeNode *node;
> +    bool joinable;
> +    uint64_t intersection;
> +    PageRange *range;
> +
> +    assert(!SUM_OVERFLOW_U64(start, count));
> +    if (count == 0) {
> +        return;
> +    }
> +
> +    node = g_tree_upper_bound(tree.t, &start);
> +    if (node) {
> +        node = g_tree_node_previous(node);
> +    } else {
> +        node = g_tree_node_last(tree.t);
> +    }
> +
> +    if (node) {
> +        range = g_tree_node_value(node);
> +        assert(range);
> +        intersection = page_range_intersection_size(range, start, count);
> +        joinable = page_range_joinable_right(range, start, count);
> +    }
> +
> +    if (!node ||
> +        (!intersection && !joinable)) {
> +        /*
> +         * !node case: the tree is empty or the very first node in the tree
> +         * already has a higher key (the start of its range).
> +         * the other case: there is a gap in the tree between the new range
> +         * and the previous one.
> +         * anyway, let's just insert the new range into the tree.
> +         */
> +        node = page_range_tree_insert_new(tree, start, count);
> +        assert(node);
> +        range = g_tree_node_value(node);
> +        assert(range);
> +    } else {
> +        /*
> +         * the previous range in the tree either partially covers the new
> +         * range or ends just at its beginning - extend it
> +         */
> +        if (dupcount) {
> +            *dupcount += intersection;
> +        }
> +
> +        count += start - range->start;
> +        range->count = MAX(range->count, count);
> +    }
> +
> +    /* check next nodes for possible merging */
> +    for (node = g_tree_node_next(node); node; ) {
> +        PageRange *rangecur;
> +
> +        rangecur = g_tree_node_value(node);
> +        assert(rangecur);
> +
> +        intersection = page_range_intersection_size(rangecur,
> +                                                    range->start, range->count);
> +        joinable = page_range_joinable_left(rangecur,
> +                                            range->start, range->count);
> +        if (!intersection && !joinable) {
> +            /* the current node is disjoint */
> +            break;
> +        }
> +
> +        if (dupcount) {
> +            *dupcount += intersection;
> +        }
> +
> +        count = rangecur->count + (rangecur->start - range->start);
> +        range->count = MAX(range->count, count);
> +
> +        /* the current node was merged in, remove it */
> +        start = rangecur->start;
> +        node = g_tree_node_next(node);
> +        /* no hinted removal in GTree... */
> +        g_tree_remove(tree.t, &start);
> +    }
> +}
> +
> +static bool page_range_tree_pop(PageRangeTree tree, PageRange *out,
> +                                uint64_t maxcount)
> +{
> +    GTreeNode *node;
> +    PageRange *range;
> +
> +    node = g_tree_node_last(tree.t);
> +    if (!node) {
> +        return false;
> +    }
> +
> +    range = g_tree_node_value(node);
> +    assert(range);
> +
> +    out->start = range->start;
> +
> +    /* can't modify range->start as it is the node key */
> +    if (range->count > maxcount) {
> +        out->start += range->count - maxcount;
> +        out->count = maxcount;
> +        range->count -= maxcount;
> +    } else {
> +        out->count = range->count;
> +        /* no hinted removal in GTree... */
> +        g_tree_remove(tree.t, &out->start);
> +    }
> +
> +    return true;
> +}
> +
> +static bool page_range_tree_intree_any(PageRangeTree tree,
> +                                       uint64_t start, uint64_t count)
> +{
> +    GTreeNode *node;
> +
> +    if (count == 0) {
> +        return false;
> +    }
> +
> +    /* find the first node that can possibly intersect our range */
> +    node = g_tree_upper_bound(tree.t, &start);
> +    if (node) {
> +        /*
> +         * a NULL node below means that the very first node in the tree
> +         * already has a higher key (the start of its range).
> +         */
> +        node = g_tree_node_previous(node);
> +    } else {
> +        /* a NULL node below means that the tree is empty */
> +        node = g_tree_node_last(tree.t);
> +    }
> +    /* node range start <= range start */
> +
> +    if (!node) {
> +        /* node range start > range start */
> +        node = g_tree_node_first(tree.t);
> +    }
> +
> +    for ( ; node; node = g_tree_node_next(node)) {
> +        PageRange *range = g_tree_node_value(node);
> +
> +        assert(range);
> +        /*
> +         * if this node starts beyond or at the end of our range so does
> +         * every next one
> +         */
> +        if (range->start >= start + count) {
> +            break;
> +        }
> +
> +        if (page_range_intersection_size(range, start, count) > 0) {
> +            return true;
> +        }
> +    }
> +
> +    return false;
> +}
> +
> +static PageRangeTree page_range_tree_new(void)
> +{
> +    PageRangeTree tree;
> +
> +    tree.t = g_tree_new_full(page_range_tree_key_compare, NULL,
> +                             g_free, g_free);
> +    return tree;
> +}
> +
> +static void page_range_tree_destroy(PageRangeTree *tree)
> +{
> +    /* g_tree_destroy() is not NULL-safe */
> +    if (!tree->t) {
> +        return;
> +    }
> +
> +    g_tree_destroy(tree->t);
> +    tree->t = NULL;
> +}
> +
> +/* HAPVDIMMDevice */
> +static uint64_t hapvdimm_get_addr(HAPVDIMMDevice *hapvdimm)
> +{
> +    return object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_ADDR_PROP,
> +                                    &error_abort) / HV_BALLOON_PAGE_SIZE;
> +}
> +
> +static uint64_t hapvdimm_get_size(HAPVDIMMDevice *hapvdimm)
> +{
> +    return object_property_get_uint(OBJECT(hapvdimm), HAPVDIMM_SIZE_PROP,
> +                                    &error_abort) / HV_BALLOON_PAGE_SIZE;
> +}
> +
> +static void hapvdimm_get_range(HAPVDIMMDevice *hapvdimm, PageRange *out)
> +{
> +    out->start = hapvdimm_get_addr(hapvdimm);
> +    assert(out->start > 0);
> +
> +    out->count = hapvdimm_get_size(hapvdimm);
> +    assert(out->count > 0);
> +}
> +
> +static HostMemoryBackend *hapvdimm_get_memdev(HAPVDIMMDevice *hapvdimm)
> +{
> +    Object *memdev_obj;
> +
> +    memdev_obj = object_property_get_link(OBJECT(hapvdimm),
> +                                          HAPVDIMM_MEMDEV_PROP,
> +                                          &error_abort);
> +    return MEMORY_BACKEND(memdev_obj);
> +}
> +
> +/* HAPVDIMMRange */
> +static HAPVDIMMRange *hapvdimm_range_new(HAPVDIMMDevice *hapvdimm)
> +{
> +    HAPVDIMMRange *hpr = g_malloc(sizeof(*hpr));
> +
> +    hpr->hapvdimm = HAPVDIMM(object_ref(hapvdimm));
> +    hapvdimm_get_range(hapvdimm, &hpr->range);
> +
> +    hpr->removed_guest = page_range_tree_new();
> +    hpr->removed_both = page_range_tree_new();
> +
> +    /* mark the whole range as unused */
> +    hpr->used = 0;
> +    hpr->unused_head = hpr->range.count;
> +    hpr->unused_tail = 0;
> +
> +    return hpr;
> +}
> +
> +static void hapvdimm_range_free(HAPVDIMMRange *hpr)
> +{
> +    g_autoptr(HAPVDIMMDevice) hapvdimm = g_steal_pointer(&hpr->hapvdimm);
> +
> +    page_range_tree_destroy(&hpr->removed_guest);
> +    page_range_tree_destroy(&hpr->removed_both);
> +
> +    g_free(hpr);
> +}
> +
> +/* the hapvdimm range reduced by unused head and tail */
> +static void hapvdimm_range_get_effective_range(HAPVDIMMRange *hpr,
> +                                               PageRange *out)
> +{
> +    out->start = hpr->range.start + hpr->unused_head;
> +    out->count = hpr->range.count - hpr->unused_head - hpr->unused_tail;
> +}
> +
> +/* HAPVDIMMRangeTree */
> +static gint hapvdimm_tree_key_compare(gconstpointer leftp, gconstpointer rightp,
> +                                      gpointer user_data)
> +{
> +    /*
> +     * hapvdimm tree is also keyed on page range start, so we can simply reuse
> +     * the comparison function from the page range tree
> +     */
> +    return page_range_tree_key_compare(leftp, rightp, user_data);
> +}
> +
> +static HAPVDIMMRange *hapvdimm_tree_insert_new(HvBalloon *balloon,
> +                                               HAPVDIMMDevice *hapvdimm)
> +{
> +    HAPVDIMMRange *hpr;
> +    uint64_t *key;
> +
> +    hpr = hapvdimm_range_new(hapvdimm);
> +
> +    key = g_malloc(sizeof(*key));
> +    *key = hpr->range.start;
> +
> +    g_tree_insert(balloon->hapvdimms.t, key, hpr);
> +
> +    return hpr;
> +}
> +
> +/* The HAPVDIMM must not be on the ha_todo list since it's going to get unref'ed. */
> +static void hapvdimm_tree_remove(HvBalloon *balloon, HAPVDIMMDevice *hapvdimm)
> +{
> +    uint64_t addr;
> +
> +    addr = hapvdimm_get_addr(hapvdimm);
> +    assert(addr > 0);
> +
> +    g_tree_remove(balloon->hapvdimms.t, &addr);
> +}
> +
> +/* total RAM includes memory currently removed from the guest */
> +static gboolean hapvdimm_tree_total_ram_node(gpointer key,
> +                                             gpointer value,
> +                                             gpointer data)
> +{
> +    HAPVDIMMRange *hpr = value;
> +    uint64_t *size = data;
> +    PageRange rangeeff;
> +
> +    hapvdimm_range_get_effective_range(hpr, &rangeeff);
> +    *size += rangeeff.count;
> +
> +    return false;
> +}
> +
> +static uint64_t hapvdimm_tree_total_ram(HvBalloon *balloon)
> +{
> +    uint64_t size = 0;
> +
> +    g_tree_foreach(balloon->hapvdimms.t, hapvdimm_tree_total_ram_node, &size);
> +    return size;
> +}
> +
> +static void hapvdimm_tree_value_free(gpointer data)
> +{
> +    HAPVDIMMRange *hpr = data;
> +
> +    hapvdimm_range_free(hpr);
> +}
> +
> +static HAPVDIMMRangeTree hapvdimm_tree_new(void)
> +{
> +    HAPVDIMMRangeTree tree;
> +
> +    tree.t = g_tree_new_full(hapvdimm_tree_key_compare, NULL, g_free,
> +                             hapvdimm_tree_value_free);
> +    return tree;
> +}
> +
> +static void hapvdimm_tree_destroy(HAPVDIMMRangeTree *tree)
> +{
> +    /* g_tree_destroy() is not NULL-safe */
> +    if (!tree->t) {
> +        return;
> +    }
> +
> +    g_tree_destroy(tree->t);
> +    tree->t = NULL;
> +}
> +
> +static gboolean ha_todo_add_all_node(gpointer key,
> +                                     gpointer value,
> +                                     gpointer data)
> +{
> +    HAPVDIMMRange *hpr = value;
> +    HvBalloon *balloon = data;
> +
> +    /* assume the hpr is fresh */
> +    assert(hpr->used == 0);
> +    assert(hpr->unused_head == hpr->range.count);
> +    assert(hpr->unused_tail == 0);
> +
> +    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
> +
> +    return false;
> +}
> +
> +static void ha_todo_add_all(HvBalloon *balloon)
> +{
> +    assert(balloon->ha_todo == NULL);
> +    g_tree_foreach(balloon->hapvdimms.t, ha_todo_add_all_node, balloon);
> +}
> +
> +static void ha_todo_clear(HvBalloon *balloon)
> +{
> +    g_slist_free(g_steal_pointer(&balloon->ha_todo));
> +}
> +
> +/* TODO: unify the code below with virtio-balloon and cache the value */
> +static int build_dimm_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> +        DeviceState *dev = DEVICE(obj);
> +        if (dev->realized) { /* only realized DIMMs matter */
> +            *list = g_slist_prepend(*list, dev);
> +        }
> +    }
> +
> +    object_child_foreach(obj, build_dimm_list, opaque);
> +    return 0;
> +}
> +
> +static ram_addr_t get_current_ram_size(void)
> +{
> +    GSList *list = NULL, *item;
> +    ram_addr_t size = current_machine->ram_size;
> +
> +    build_dimm_list(qdev_get_machine(), &list);
> +    for (item = list; item; item = g_slist_next(item)) {
> +        Object *obj = OBJECT(item->data);
> +        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM))
> +            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
> +                                            &error_abort);
> +    }
> +    g_slist_free(list);
> +
> +    return size;
> +}
> +
> +/* total RAM includes memory currently removed from the guest */
> +static uint64_t hv_balloon_total_ram(HvBalloon *balloon)
> +{
> +    ram_addr_t ram_size = get_current_ram_size();
> +    uint64_t ram_size_pages = ram_size >> HV_BALLOON_PFN_SHIFT;
> +    uint64_t hapvdimm_size_pages = hapvdimm_tree_total_ram(balloon);
> +
> +    assert(ram_size_pages > 0);
> +
> +    return SUM_SATURATE_U64(ram_size_pages, hapvdimm_size_pages);
> +}
> +
> +/*
> + * calculating the total RAM size is a slow operation,
> + * avoid it as much as possible
> + */
> +static uint64_t hv_balloon_total_removed_rs(HvBalloon *balloon,
> +                                            uint64_t ram_size_pages)
> +{
> +    uint64_t total_removed;
> +
> +    total_removed = SUM_SATURATE_U64(balloon->removed_guest_ctr,
> +                                     balloon->removed_both_ctr);
> +
> +    /* possible if guest returns pages outside actual RAM */
> +    if (total_removed > ram_size_pages) {
> +        total_removed = ram_size_pages;
> +    }
> +
> +    return total_removed;
> +}
> +
> +static bool hv_balloon_state_is_init(HvBalloon *balloon)
> +{
> +    return balloon->state == S_WAIT_RESET ||
> +        balloon->state == S_CLOSED ||
> +        balloon->state == S_VERSION ||
> +        balloon->state == S_CAPS;
> +}
> +
> +/* Returns whether the state has actually changed */
> +static bool hv_balloon_state_set(HvBalloon *balloon,
> +                                 State newst, const char *newststr)
> +{
> +    if (newst == S_NO_CHANGE || balloon->state == newst) {
> +        return false;
> +    }
> +
> +    balloon->state = newst;
> +    trace_hv_balloon_state_change(newststr);
> +    return true;
> +}
> +
> +static void _hv_balloon_state_desc_set(StateDesc *stdesc,
> +                                       State newst, const char *newststr)
> +{
> +    /* state setting is only permitted on a freshly init desc */
> +    assert(stdesc->state == S_NO_CHANGE);
> +
> +    assert(newst != S_NO_CHANGE);
> +
> +    stdesc->state = newst;
> +    stdesc->desc = newststr;
> +}
> +
> +static void del_todo_process(HvBalloon *balloon)
> +{
> +    while (balloon->hapvdimms_del_todo) {
> +        HAPVDIMMDevice *hapvdimm = balloon->hapvdimms_del_todo->data;
> +        HostMemoryBackend *backend;
> +        const char *backend_id;
> +
> +        backend = hapvdimm_get_memdev(hapvdimm);
> +        backend_id = object_get_canonical_path_component(OBJECT(backend));
> +
> +        object_unparent(OBJECT(hapvdimm));
> +        object_unref(OBJECT(hapvdimm));
> +        qapi_event_send_hv_balloon_memory_backend_unused(backend_id);
> +
> +        balloon->hapvdimms_del_todo =
> +            g_slist_remove(balloon->hapvdimms_del_todo, hapvdimm);
> +    }
> +
> +    if (balloon->del_todo_process_timer) {
> +        g_source_remove(balloon->del_todo_process_timer);
> +        balloon->del_todo_process_timer = 0;
> +    }
> +}
> +
> +static gboolean del_todo_process_timer(gpointer user_data)
> +{
> +    HvBalloon *balloon = user_data;
> +
> +    balloon->del_todo_process_timer = 0;
> +
> +    del_todo_process(balloon);
> +
> +    return G_SOURCE_REMOVE;
> +}
> +
> +static void del_todo_append(HvBalloon *balloon,
> +                            HAPVDIMMDevice *hapvdimm)
> +{
> +    balloon->hapvdimms_del_todo = g_slist_append(balloon->hapvdimms_del_todo,
> +                                                 object_ref(hapvdimm));
> +}
> +
> +static void del_todo_add(HvBalloon *balloon,
> +                         HAPVDIMMDevice *hapvdimm)
> +{
> +    hapvdimm_tree_remove(balloon, hapvdimm);
> +    del_todo_append(balloon, hapvdimm);
> +}
> +
> +static gboolean del_todo_add_all_node(gpointer key,
> +                                      gpointer value,
> +                                      gpointer data)
> +{
> +    HAPVDIMMRange *hpr = value;
> +    HvBalloon *balloon = data;
> +
> +    del_todo_append(balloon, hpr->hapvdimm);
> +
> +    return false;
> +}
> +
> +static void del_todo_add_all(HvBalloon *balloon)
> +{
> +    g_tree_foreach(balloon->hapvdimms.t, del_todo_add_all_node, balloon);
> +    hapvdimm_tree_destroy(&balloon->hapvdimms);
> +
> +    balloon->hapvdimms = hapvdimm_tree_new();
> +}
> +
> +static void del_todo_add_all_from_ha_todo(HvBalloon *balloon)
> +{
> +    while (balloon->ha_todo) {
> +        HAPVDIMMRange *hpr = balloon->ha_todo->data;
> +
> +        del_todo_add(balloon, hpr->hapvdimm);
> +        balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
> +    }
> +}
> +
> +static VMBusChannel *hv_balloon_get_channel_maybe(HvBalloon *balloon)
> +{
> +    return vmbus_device_channel(&balloon->parent, 0);
> +}
> +
> +static VMBusChannel *hv_balloon_get_channel(HvBalloon *balloon)
> +{
> +    VMBusChannel *chan;
> +
> +    chan = hv_balloon_get_channel_maybe(balloon);
> +    assert(chan != NULL);
> +    return chan;
> +}
> +
> +static ssize_t hv_balloon_send_packet(VMBusChannel *chan,
> +                                      struct dm_message *msg)
> +{
> +    int ret;
> +
> +    ret = vmbus_channel_reserve(chan, 0, msg->hdr.size);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    return vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                              NULL, 0, msg, msg->hdr.size, false,
> +                              msg->hdr.trans_id);
> +}
> +
> +static bool hv_balloon_unballoon_get_source(HvBalloon *balloon,
> +                                            PageRangeTree *dtree,
> +                                            uint64_t **dctr,
> +                                            HAPVDIMMRange **hpr)
> +{
> +    /* Try the boot memory first */
> +    if (g_tree_nnodes(balloon->removed_guest.t) > 0) {
> +        *dtree = balloon->removed_guest;
> +        *dctr = &balloon->removed_guest_ctr;
> +        *hpr = NULL;
> +    } else if (g_tree_nnodes(balloon->removed_both.t) > 0) {
> +        *dtree = balloon->removed_both;
> +        *dctr = &balloon->removed_both_ctr;
> +        *hpr = NULL;
> +    } else {
> +        GTreeNode *node;
> +
> +        for (node = g_tree_node_first(balloon->hapvdimms.t); node;
> +             node = g_tree_node_next(node)) {
> +            HAPVDIMMRange *hprnode = g_tree_node_value(node);
> +
> +            assert(hprnode);
> +            if (g_tree_nnodes(hprnode->removed_guest.t) > 0) {
> +                *dtree = hprnode->removed_guest;
> +                *dctr = &balloon->removed_guest_ctr;
> +                *hpr = hprnode;
> +                break;
> +            } else if (g_tree_nnodes(hprnode->removed_both.t) > 0) {
> +                *dtree = hprnode->removed_both;
> +                *dctr = &balloon->removed_both_ctr;
> +                *hpr = hprnode;
> +                break;
> +            }
> +        }
> +
> +        if (!node) {
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
> +static void hv_balloon_balloon_unballoon_start(HvBalloon *balloon,
> +                                               uint64_t ram_size_pages,
> +                                               StateDesc *stdesc)
> +{
> +    uint64_t total_removed = hv_balloon_total_removed_rs(balloon,
> +                                                         ram_size_pages);
> +
> +    assert(balloon->state == S_IDLE);
> +    assert(ram_size_pages > 0);
> +
> +    /*
> +     * we need to cache the value when starting the (un)balloon procedure
> +     * in case somebody changes the balloon target when the procedure is
> +     * in progress
> +     */
> +    if (balloon->target < ram_size_pages - total_removed) {
> +        balloon->target_diff = ram_size_pages - total_removed - balloon->target;
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_RB_WAIT);
> +    } else {
> +        balloon->target_diff = balloon->target -
> +            (ram_size_pages - total_removed);
> +
> +        /*
> +         * careful here, the user might have set the balloon target
> +         * above the RAM size, so above the total removed count
> +         */
> +        balloon->target_diff = MIN(balloon->target_diff, total_removed);
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_RB_WAIT);
> +    }
> +
> +    balloon->target_changed = false;
> +}
> +
> +static void hv_balloon_unballoon_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    struct dm_unballoon_request *ur;
> +    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
> +
> +    assert(balloon->state == S_UNBALLOON_RB_WAIT);
> +
> +    if (vmbus_channel_reserve(chan, 0, ur_size) < 0) {
> +        return;
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_POSTING);
> +}
> +
> +static void hv_balloon_unballoon_posting(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    PageRangeTree dtree;
> +    uint64_t *dctr;
> +    HAPVDIMMRange *hpr;
> +    struct dm_unballoon_request *ur;
> +    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
> +    PageRange range;
> +    bool bret;
> +    ssize_t ret;
> +
> +    assert(balloon->state == S_UNBALLOON_POSTING);
> +    assert(balloon->target_diff > 0);
> +
> +    if (!hv_balloon_unballoon_get_source(balloon, &dtree, &dctr, &hpr)) {
> +        error_report("trying to unballoon but nothing ballooned");
> +        /*
> +         * there is little we can do as we might have already
> +         * sent the guest a partial request we can't cancel
> +         */
> +        return;
> +    }
> +
> +    assert(dtree.t);
> +    assert(dctr);
> +
> +    ur = alloca(ur_size);
> +    memset(ur, 0, ur_size);
> +    ur->hdr.type = DM_UNBALLOON_REQUEST;
> +    ur->hdr.size = ur_size;
> +    ur->hdr.trans_id = balloon->trans_id;
> +
> +    bret = page_range_tree_pop(dtree, &range, MIN(balloon->target_diff,
> +                                                  HV_BALLOON_HA_CHUNK_PAGES));
> +    assert(bret);
> +    /* TODO: madvise? */
> +
> +    *dctr -= range.count;
> +    balloon->target_diff -= range.count;
> +    if (hpr) {
> +        hpr->used += range.count;
> +    }
> +
> +    ur->range_count = 1;
> +    ur->range_array[0].finfo.start_page = range.start;
> +    ur->range_array[0].finfo.page_cnt = range.count;
> +    ur->more_pages = balloon->target_diff > 0;
> +
> +    trace_hv_balloon_outgoing_unballoon(ur->hdr.trans_id,
> +                                        range.count, range.start,
> +                                        balloon->target_diff);
> +
> +    if (ur->more_pages) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_RB_WAIT);
> +    } else {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_UNBALLOON_REPLY_WAIT);
> +    }
> +
> +    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                             NULL, 0, ur, ur_size, false,
> +                             ur->hdr.trans_id);
> +    if (ret <= 0) {
> +        error_report("error %zd when posting unballoon msg, expect problems",
> +                     ret);
> +    }
> +}
> +
> +static void hv_balloon_hot_add_start(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    HAPVDIMMRange *hpr;
> +    PageRange range;
> +
> +    assert(balloon->state == S_IDLE);
> +    assert(balloon->ha_todo);
> +
> +    hpr = balloon->ha_todo->data;
> +
> +    range.start = QEMU_ALIGN_UP(hpr->range.start,
> +                                (1 << balloon->caps.cap_bits.hot_add_alignment)
> +                                * (MiB / HV_BALLOON_PAGE_SIZE));
> +    hpr->unused_head = range.start - hpr->range.start;
> +    if (hpr->unused_head >= hpr->range.count) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
> +        return;
> +    }
> +
> +    range.count = hpr->range.count - hpr->unused_head;
> +    range.count = QEMU_ALIGN_DOWN(range.count,
> +                                  (1 << balloon->caps.cap_bits.hot_add_alignment)
> +                                  * (MiB / HV_BALLOON_PAGE_SIZE));
> +    if (range.count == 0) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
> +        return;
> +    }
> +    hpr->unused_tail = hpr->range.count - hpr->unused_head - range.count;
> +    hpr->used = 0;
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_RB_WAIT);
> +}
> +
> +static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    struct dm_hot_add *ha;
> +    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
> +
> +    assert(balloon->state == S_HOT_ADD_RB_WAIT);
> +
> +    if (vmbus_channel_reserve(chan, 0, ha_size) < 0) {
> +        return;
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_POSTING);
> +}
> +
> +static void hv_balloon_hot_add_posting(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    HAPVDIMMRange *hpr;
> +    struct dm_hot_add *ha;
> +    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
> +    union dm_mem_page_range *ha_region;
> +    PageRange range;
> +    uint64_t chunk_max_size;
> +    ssize_t ret;
> +
> +    assert(balloon->state == S_HOT_ADD_POSTING);
> +    assert(balloon->ha_todo);
> +
> +    hpr = balloon->ha_todo->data;
> +
> +    range.start = hpr->range.start + hpr->unused_head + hpr->used;
> +    range.count = hpr->range.count;
> +    range.count -= hpr->unused_head;
> +    range.count -= hpr->used;
> +    range.count -= hpr->unused_tail;
> +
> +    chunk_max_size = MAX((1 << balloon->caps.cap_bits.hot_add_alignment) *
> +                         (MiB / HV_BALLOON_PAGE_SIZE),
> +                         HV_BALLOON_HA_CHUNK_PAGES);
> +    range.count = MIN(range.count, chunk_max_size);
> +    balloon->ha_current_count = range.count;
> +
> +    ha = alloca(ha_size);
> +    ha_region = &(&ha->range)[1];
> +    memset(ha, 0, ha_size);
> +    ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
> +    ha->hdr.size = ha_size;
> +    ha->hdr.trans_id = balloon->trans_id;
> +
> +    ha->range.finfo.start_page = range.start;
> +    ha->range.finfo.page_cnt = range.count;
> +    ha_region->finfo.start_page = range.start;
> +    ha_region->finfo.page_cnt = ha->range.finfo.page_cnt;
> +
> +    trace_hv_balloon_outgoing_hot_add(ha->hdr.trans_id,
> +                                      range.count, range.start);
> +
> +    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                             NULL, 0, ha, ha_size, false,
> +                             ha->hdr.trans_id);
> +    if (ret <= 0) {
> +        error_report("error %zd when posting hot add msg, expect problems",
> +                     ret);
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_REPLY_WAIT);
> +}
> +
> +static void hv_balloon_hot_add_finish(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    HAPVDIMMRange *hpr;
> +
> +    assert(balloon->state == S_HOT_ADD_SKIP_CURRENT ||
> +           balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING ||
> +           balloon->state == S_HOT_ADD_PROCESSED_NEXT);
> +    assert(balloon->ha_todo);
> +
> +    hpr = balloon->ha_todo->data;
> +
> +    balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
> +    if (balloon->state == S_HOT_ADD_SKIP_CURRENT) {
> +        del_todo_add(balloon, hpr->hapvdimm);
> +    } else if (balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING) {
> +        del_todo_add_all_from_ha_todo(balloon);
> +    }
> +
> +    /* let other things happen, too, between hot adds to be done */
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
> +}
> +
> +static void hv_balloon_balloon_rb_wait(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    size_t bl_size = sizeof(struct dm_balloon);
> +
> +    assert(balloon->state == S_BALLOON_RB_WAIT);
> +
> +    if (vmbus_channel_reserve(chan, 0, bl_size) < 0) {
> +        return;
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_POSTING);
> +}
> +
> +static void hv_balloon_balloon_posting(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan = hv_balloon_get_channel(balloon);
> +    struct dm_balloon bl;
> +    size_t bl_size = sizeof(bl);
> +    ssize_t ret;
> +
> +    assert(balloon->state == S_BALLOON_POSTING);
> +    assert(balloon->target_diff > 0);
> +
> +    memset(&bl, 0, sizeof(bl));
> +    bl.hdr.type = DM_BALLOON_REQUEST;
> +    bl.hdr.size = bl_size;
> +    bl.hdr.trans_id = balloon->trans_id;
> +    bl.num_pages = MIN(balloon->target_diff, HV_BALLOON_HR_CHUNK_PAGES);
> +
> +    trace_hv_balloon_outgoing_balloon(bl.hdr.trans_id, bl.num_pages,
> +                                      balloon->target_diff);
> +
> +    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
> +                             NULL, 0, &bl, bl_size, false,
> +                             bl.hdr.trans_id);
> +    if (ret <= 0) {
> +        error_report("error %zd when posting balloon msg, expect problems",
> +                     ret);
> +    }
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_REPLY_WAIT);
> +}
> +
> +static void hv_balloon_idle_state(HvBalloon *balloon,
> +                                  StateDesc *stdesc)
> +{
> +    bool can_balloon = balloon->caps.cap_bits.balloon;
> +    bool want_unballoon = false;
> +    bool want_hot_add = balloon->ha_todo != NULL;
> +    bool want_balloon = false;
> +    uint64_t ram_size_pages;
> +
> +    assert(balloon->state == S_IDLE);
> +
> +    if (can_balloon && balloon->target_changed) {
> +        uint64_t total_removed;
> +
> +        ram_size_pages = hv_balloon_total_ram(balloon);
> +        total_removed = hv_balloon_total_removed_rs(balloon,
> +                                                    ram_size_pages);
> +
> +        want_unballoon = total_removed > 0 &&
> +            balloon->target > ram_size_pages - total_removed;
> +        want_balloon = balloon->target < ram_size_pages - total_removed;
> +    }
> +
> +    /*
> +     * the order here is important, first we unballoon, then hot add,
> +     * then balloon (or hot remove)
> +     */
> +    if (want_unballoon) {
> +        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages, stdesc);
> +    } else if (want_hot_add) {
> +        hv_balloon_hot_add_start(balloon, stdesc);
> +    } else if (want_balloon) {
> +        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages, stdesc);
> +    }
> +}
> +
> +static const struct {
> +    void (*handler)(HvBalloon *balloon, StateDesc *stdesc);
> +} state_handlers[] = {
> +    [S_IDLE].handler = hv_balloon_idle_state,
> +    [S_UNBALLOON_RB_WAIT].handler = hv_balloon_unballoon_rb_wait,
> +    [S_UNBALLOON_POSTING].handler = hv_balloon_unballoon_posting,
> +    [S_HOT_ADD_RB_WAIT].handler = hv_balloon_hot_add_rb_wait,
> +    [S_HOT_ADD_POSTING].handler = hv_balloon_hot_add_posting,
> +    [S_HOT_ADD_SKIP_CURRENT].handler = hv_balloon_hot_add_finish,
> +    [S_HOT_ADD_PROCESSED_CLEAR_PENDING].handler = hv_balloon_hot_add_finish,
> +    [S_HOT_ADD_PROCESSED_NEXT].handler = hv_balloon_hot_add_finish,
> +    [S_BALLOON_RB_WAIT].handler = hv_balloon_balloon_rb_wait,
> +    [S_BALLOON_POSTING].handler = hv_balloon_balloon_posting,
> +};
> +
> +static void hv_balloon_handle_state(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    if (!state_handlers[balloon->state].handler) {
> +        return;
> +    }
> +
> +    state_handlers[balloon->state].handler(balloon, stdesc);
> +}
> +
> +static void hv_balloon_remove_response_insert_range(PageRangeTree tree,
> +                                                    const PageRange *range,
> +                                                    uint64_t *ctr1,
> +                                                    uint64_t *ctr2,
> +                                                    uint64_t *ctr3)
> +{
> +    uint64_t dupcount, effcount;
> +
> +    if (range->count == 0) {
> +        return;
> +    }
> +
> +    dupcount = 0;
> +    page_range_tree_insert(tree, range->start, range->count, &dupcount);
> +
> +    assert(dupcount <= range->count);
> +    effcount = range->count - dupcount;
> +
> +    *ctr1 += effcount;
> +    *ctr2 += effcount;
> +    if (ctr3) {
> +        *ctr3 += effcount;
> +    }
> +}
> +
> +static void hv_balloon_remove_response_handle_range(HvBalloon *balloon,
> +                                                    PageRange *range,
> +                                                    bool both,
> +                                                    uint64_t *removedctr)
> +{
> +    GTreeNode *node;
> +    PageRangeTree globaltree = both ? balloon->removed_both :
> +        balloon->removed_guest;
> +    uint64_t *globalctr = both ? &balloon->removed_both_ctr :
> +        &balloon->removed_guest_ctr;
> +
> +    if (range->count == 0) {
> +        return;
> +    }
> +
> +    trace_hv_balloon_remove_response(range->count, range->start, both);
> +
> +    /* find the first node that can possibly intersect our range */
> +    node = g_tree_upper_bound(balloon->hapvdimms.t, &range->start);
> +    if (node) {
> +        /*
> +         * a NULL node below means that the very first node in the tree
> +         * already has a higher key (the start of its range).
> +         */
> +        node = g_tree_node_previous(node);
> +    } else {
> +        /* a NULL node below means that the tree is empty */
> +        node = g_tree_node_last(balloon->hapvdimms.t);
> +    }
> +    /* node range start <= range start */
> +
> +    if (!node) {
> +        /* node range start > range start */
> +        node = g_tree_node_first(balloon->hapvdimms.t);
> +    }
> +
> +    for ( ; node && range->count > 0; node = g_tree_node_next(node)) {
> +        HAPVDIMMRange *hpr = g_tree_node_value(node);
> +        PageRangeTree hprtree;
> +        PageRange rangeeff, rangehole, rangecommon;
> +        uint64_t hprremoved = 0;
> +
> +        assert(hpr);
> +        hprtree = both ? hpr->removed_both : hpr->removed_guest;
> +        hapvdimm_range_get_effective_range(hpr, &rangeeff);
> +
> +        /*
> +         * if this node starts beyond or at the end of the range so does
> +         * every next one
> +         */
> +        if (rangeeff.start >= range->start + range->count) {
> +            break;
> +        }
> +
> +        /* process the hole before the current hpr, if it exists */
> +        page_range_part_before(range, rangeeff.start, &rangehole);
> +        hv_balloon_remove_response_insert_range(globaltree, &rangehole,
> +                                                globalctr, removedctr, NULL);
> +        if (rangehole.count > 0) {
> +            trace_hv_balloon_remove_response_hole(rangehole.count,
> +                                                  rangehole.start,
> +                                                  range->count, range->start,
> +                                                  rangeeff.start, both);
> +        }
> +
> +        /*
> +         * process the hpr part, can be empty for the very first node processed
> +         * or due to difference between the nominal and effective hpr start
> +         */
> +        page_range_intersect(range, rangeeff.start, rangeeff.count,
> +                             &rangecommon);
> +        hv_balloon_remove_response_insert_range(hprtree, &rangecommon,
> +                                                globalctr, removedctr,
> +                                                &hprremoved);
> +        hpr->used -= hprremoved;
> +        if (rangecommon.count > 0) {
> +            trace_hv_balloon_remove_response_common(rangecommon.count,
> +                                                    rangecommon.start,
> +                                                    range->count, range->start,
> +                                                    rangeeff.count,
> +                                                    rangeeff.start, hprremoved,
> +                                                    both);
> +        }
> +
> +        /* calculate what's left after the current hpr */
> +        rangecommon = *range;
> +        page_range_part_after(&rangecommon, rangeeff.start, rangeeff.count,
> +                              range);
> +    }
> +
> +    /* process the remainder of the range that lies outside of the hpr tree */
> +    if (range->count > 0) {
> +        hv_balloon_remove_response_insert_range(globaltree, range,
> +                                                globalctr, removedctr, NULL);
> +        trace_hv_balloon_remove_response_remainder(range->count, range->start,
> +                                                   both);
> +        range->count = 0;
> +    }
> +}
> +
> +static void hv_balloon_remove_response_handle_pages(HvBalloon *balloon,
> +                                                    PageRange *range,
> +                                                    uint64_t start,
> +                                                    uint64_t count,
> +                                                    bool both,
> +                                                    uint64_t *removedctr)
> +{
> +    assert(count > 0);
> +
> +    /*
> +     * if there is an existing range that the new range can't be joined to
> +     * dump it into tree(s)
> +     */
> +    if (range->count > 0 && !page_range_joinable(range, start, count)) {
> +        hv_balloon_remove_response_handle_range(balloon, range, both,
> +                                                removedctr);
> +    }
> +
> +    if (range->count == 0) {
> +        range->start = start;
> +        range->count = count;
> +    } else if (page_range_joinable_left(range, start, count)) {
> +        range->start = start;
> +        range->count += count;
> +    } else { /* page_range_joinable_right() */
> +        range->count += count;
> +    }
> +}
> +
> +static gboolean hv_balloon_handle_remove_host_addr_node(gpointer key,
> +                                                        gpointer value,
> +                                                        gpointer data)
> +{
> +    PageRange *range = value;
> +    uint64_t pageoff;
> +
> +    for (pageoff = 0; pageoff < range->count; ) {
> +        void *addr = (void *)((range->start + pageoff) * HV_BALLOON_PAGE_SIZE);
> +        RAMBlock *rb;
> +        ram_addr_t rb_offset;
> +        size_t rb_page_size;
> +        size_t discard_size;
> +
> +        rb = qemu_ram_block_from_host(addr, false, &rb_offset);
> +        rb_page_size = qemu_ram_pagesize(rb);
> +
> +        if (rb_page_size != HV_BALLOON_PAGE_SIZE) {
> +            /* TODO: these should end in "removed_guest" */
> +            warn_report("guest reported removed page backed by unsupported page size %zu",
> +                        rb_page_size);
> +            pageoff++;
> +            continue;
> +        }
> +
> +        discard_size = MIN(range->count - pageoff,
> +                           (rb->max_length - rb_offset) /
> +                           HV_BALLOON_PAGE_SIZE);
> +        discard_size = MAX(discard_size, 1);
> +
> +        if (ram_block_discard_range(rb, rb_offset, discard_size *
> +                                    HV_BALLOON_PAGE_SIZE) != 0) {
> +            warn_report("guest reported removed page failed discard");
> +        }
> +
> +        pageoff += discard_size;
> +    }
> +
> +    return false;
> +}
> +
> +static void hv_balloon_handle_remove_host_addr_tree(PageRangeTree tree)
> +{
> +    g_tree_foreach(tree.t, hv_balloon_handle_remove_host_addr_node, NULL);
> +}
> +
> +static int hv_balloon_handle_remove_section(PageRangeTree tree,
> +                                            const MemoryRegionSection *section,
> +                                            uint64_t count)
> +{
> +    void *addr = memory_region_get_ram_ptr(section->mr) +
> +        section->offset_within_region;
> +    uint64_t addr_page;
> +
> +    assert(count > 0);
> +
> +    if ((uintptr_t)addr % HV_BALLOON_PAGE_SIZE) {
> +        warn_report("guest reported removed pages at an unaligned host addr %p",
> +                    addr);
> +        return -EINVAL;
> +    }
> +
> +    addr_page = (uintptr_t)addr / HV_BALLOON_PAGE_SIZE;
> +    page_range_tree_insert(tree, addr_page, count, NULL);
> +
> +    return 0;
> +}
> +
> +static void hv_balloon_handle_remove_ranges(HvBalloon *balloon,
> +                                            union dm_mem_page_range ranges[],
> +                                            uint32_t count)
> +{
> +    uint64_t removedcnt;
> +    PageRangeTree removed_host_addr;
> +    PageRange range_guest, range_both;
> +
> +    removed_host_addr = page_range_tree_new();
> +    range_guest.count = range_both.count = removedcnt = 0;
> +    for (unsigned int ctr = 0; ctr < count; ctr++) {
> +        union dm_mem_page_range *mr = &ranges[ctr];
> +        hwaddr pa;
> +        MemoryRegionSection section;
> +
> +        for (unsigned int offset = 0; offset < mr->finfo.page_cnt; ) {
> +            int ret;
> +            uint64_t pageno = mr->finfo.start_page + offset;
> +            uint64_t pagecnt = 1;
> +
> +            pa = (hwaddr)pageno << HV_BALLOON_PFN_SHIFT;
> +            section = memory_region_find(get_system_memory(), pa,
> +                                         (mr->finfo.page_cnt - offset) *
> +                                         HV_BALLOON_PAGE_SIZE);
> +            if (!section.mr) {
> +                warn_report("guest reported removed page %"PRIu64" not found in RAM",
> +                            pageno);
> +                ret = -EINVAL;
> +                goto finish_page;
> +            }
> +
> +            pagecnt = section.size / HV_BALLOON_PAGE_SIZE;
> +            if (pagecnt <= 0) {
> +                warn_report("guest reported removed page %"PRIu64" in a section smaller than page size",
> +                            pageno);
> +                pagecnt = 1; /* skip the whole page */
> +                ret = -EINVAL;
> +                goto finish_page;
> +            }
> +
> +            if (!memory_region_is_ram(section.mr) ||
> +                memory_region_is_rom(section.mr) ||
> +                memory_region_is_romd(section.mr)) {
> +                warn_report("guest reported removed page %"PRIu64" in a section that is not an ordinary RAM",
> +                            pageno);
> +                ret = -EINVAL;
> +                goto finish_page;
> +            }
> +
> +            ret = hv_balloon_handle_remove_section(removed_host_addr, &section,
> +                                                   pagecnt);
> +
> +        finish_page:
> +            if (ret == 0) {
> +                hv_balloon_remove_response_handle_pages(balloon,
> +                                                        &range_both,
> +                                                        pageno, pagecnt,
> +                                                        true, &removedcnt);
> +            } else {
> +                hv_balloon_remove_response_handle_pages(balloon,
> +                                                        &range_guest,
> +                                                        pageno, pagecnt,
> +                                                        false, &removedcnt);
> +            }
> +
> +            if (section.mr) {
> +                memory_region_unref(section.mr);
> +            }
> +
> +            offset += pagecnt;
> +        }
> +    }
> +
> +    hv_balloon_remove_response_handle_range(balloon, &range_both, true,
> +                                            &removedcnt);
> +    hv_balloon_remove_response_handle_range(balloon, &range_guest, false,
> +                                            &removedcnt);
> +
> +    hv_balloon_handle_remove_host_addr_tree(removed_host_addr);
> +    page_range_tree_destroy(&removed_host_addr);
> +
> +    if (removedcnt > balloon->target_diff) {
> +        warn_report("guest reported more pages removed than currently pending (%"PRIu64" vs %"PRIu64")",
> +                    removedcnt, balloon->target_diff);
> +        balloon->target_diff = 0;
> +    } else {
> +        balloon->target_diff -= removedcnt;
> +    }
> +}
> +
> +static bool hv_balloon_handle_msg_size(HvBalloonReq *req, size_t minsize,
> +                                       const char *msgname)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    uint32_t msglen = vmreq->msglen;
> +
> +    if (msglen >= minsize) {
> +        return true;
> +    }
> +
> +    warn_report("%s message too short (%u vs %zu), ignoring", msgname,
> +                (unsigned int)msglen, minsize);
> +    return false;
> +}
> +
> +static void hv_balloon_handle_version_request(HvBalloon *balloon,
> +                                              HvBalloonReq *req,
> +                                              StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_version_request *msgVr = vmreq->msg;
> +    struct dm_version_response respVr;
> +
> +    if (balloon->state != S_VERSION) {
> +        warn_report("unexpected DM_VERSION_REQUEST in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgVr),
> +                                    "DM_VERSION_REQUEST")) {
> +        return;
> +    }
> +
> +    trace_hv_balloon_incoming_version(msgVr->version.major_version,
> +                                      msgVr->version.minor_version);
> +
> +    memset(&respVr, 0, sizeof(respVr));
> +    respVr.hdr.type = DM_VERSION_RESPONSE;
> +    respVr.hdr.size = sizeof(respVr);
> +    respVr.hdr.trans_id = msgVr->hdr.trans_id;
> +    respVr.is_accepted = msgVr->version.version >= DYNMEM_PROTOCOL_VERSION_1 &&
> +        msgVr->version.version <= DYNMEM_PROTOCOL_VERSION_3;
> +
> +    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respVr);
> +
> +    if (respVr.is_accepted) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_CAPS);
> +    }
> +}
> +
> +static void hv_balloon_handle_caps_report(HvBalloon *balloon,
> +                                          HvBalloonReq *req,
> +                                          StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_capabilities *msgCap = vmreq->msg;
> +    struct dm_capabilities_resp_msg respCap;
> +
> +    if (balloon->state != S_CAPS) {
> +        warn_report("unexpected DM_CAPABILITIES_REPORT in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgCap),
> +                                    "DM_CAPABILITIES_REPORT")) {
> +        return;
> +    }
> +
> +    trace_hv_balloon_incoming_caps(msgCap->caps.caps);
> +    balloon->caps = msgCap->caps;
> +
> +    memset(&respCap, 0, sizeof(respCap));
> +    respCap.hdr.type = DM_CAPABILITIES_RESPONSE;
> +    respCap.hdr.size = sizeof(respCap);
> +    respCap.hdr.trans_id = msgCap->hdr.trans_id;
> +    respCap.is_accepted = 1;
> +    respCap.hot_remove = 1;
> +    respCap.suppress_pressure_reports = !balloon->status_reports;
> +    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respCap);
> +
> +    if (balloon->caps.cap_bits.hot_add) {
> +        ha_todo_add_all(balloon);
> +    }
> +
> +    timer_mod(&balloon->post_init_timer,
> +              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
> +              HV_BALLOON_POST_INIT_WAIT);
> +
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_POST_INIT_WAIT);
> +}
> +
> +static void hv_balloon_handle_status_report(HvBalloon *balloon,
> +                                            HvBalloonReq *req)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_status *msgStatus = vmreq->msg;
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgStatus),
> +                                    "DM_STATUS_REPORT")) {
> +        return;
> +    }
> +
> +    if (!balloon->status_reports) {
> +        return;
> +    }
> +
> +    qapi_event_send_hv_balloon_status_report((uint64_t)msgStatus->num_committed *
> +                                             HV_BALLOON_PAGE_SIZE,
> +                                             (uint64_t)msgStatus->num_avail *
> +                                             HV_BALLOON_PAGE_SIZE);
> +}
> +
> +static void hv_balloon_handle_unballoon_response(HvBalloon *balloon,
> +                                                 HvBalloonReq *req,
> +                                                 StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_unballoon_response *msgUrR = vmreq->msg;
> +
> +    if (balloon->state != S_UNBALLOON_REPLY_WAIT) {
> +        warn_report("unexpected DM_UNBALLOON_RESPONSE in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgUrR),
> +                                    "DM_UNBALLOON_RESPONSE"))
> +        return;
> +
> +    trace_hv_balloon_incoming_unballoon(msgUrR->hdr.trans_id);
> +
> +    balloon->trans_id++;
> +    HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
> +}
> +
> +static void hv_balloon_handle_hot_add_response(HvBalloon *balloon,
> +                                               HvBalloonReq *req,
> +                                               StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_hot_add_response *msgHaR = vmreq->msg;
> +    HAPVDIMMRange *hpr;
> +
> +    if (balloon->state != S_HOT_ADD_REPLY_WAIT) {
> +        warn_report("unexpected DM_HOT_ADD_RESPONSE in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgHaR),
> +                                    "DM_HOT_ADD_RESPONSE"))
> +        return;
> +
> +    trace_hv_balloon_incoming_hot_add(msgHaR->hdr.trans_id, msgHaR->result,
> +                                      msgHaR->page_count);
> +
> +    balloon->trans_id++;
> +
> +    assert(balloon->ha_todo);
> +    hpr = balloon->ha_todo->data;
> +
> +    if (msgHaR->result) {
> +        if (msgHaR->page_count > balloon->ha_current_count) {
> +            warn_report("DM_HOT_ADD_RESPONSE page count higher than requested (%"PRIu32" vs %"PRIu64")",
> +                        msgHaR->page_count, balloon->ha_current_count);
> +            msgHaR->page_count = balloon->ha_current_count;
> +        }
> +
> +        hpr->used += msgHaR->page_count;
> +    }
> +
> +    if (!msgHaR->result || msgHaR->page_count < balloon->ha_current_count) {
> +        if (hpr->used == 0) {
> +            /*
> +             * apparently the guest didn't like the current range at all,
> +             * let's try the next one
> +             */
> +            HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_SKIP_CURRENT);
> +            return;
> +        }
> +
> +        /*
> +         * the current planned range was only partially hot-added, take note
> +         * how much of it remains and don't attempt any further hot adds
> +         */
> +        hpr->unused_tail = hpr->range.count - hpr->unused_head - hpr->used;
> +
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_PROCESSED_CLEAR_PENDING);
> +        return;
> +    }
> +
> +    /* any pages remaining in this hpr? */
> +    if (hpr->range.count - hpr->unused_head - hpr->used -
> +        hpr->unused_tail > 0) {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_RB_WAIT);
> +    } else {
> +        HV_BALLOON_STATE_DESC_SET(stdesc, S_HOT_ADD_PROCESSED_NEXT);
> +    }
> +}
> +
> +static void hv_balloon_handle_balloon_response(HvBalloon *balloon,
> +                                               HvBalloonReq *req,
> +                                               StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_balloon_response *msgBR = vmreq->msg;
> +
> +    if (balloon->state != S_BALLOON_REPLY_WAIT) {
> +        warn_report("unexpected DM_BALLOON_RESPONSE in %d state",
> +                    balloon->state);
> +        return;
> +    }
> +
> +    if (!hv_balloon_handle_msg_size(req, sizeof(*msgBR),
> +                                    "DM_BALLOON_RESPONSE"))
> +        return;
> +
> +    trace_hv_balloon_incoming_balloon(msgBR->hdr.trans_id, msgBR->range_count,
> +                                      msgBR->more_pages);
> +
> +    if (vmreq->msglen < sizeof(*msgBR) +
> +        (uint64_t)sizeof(msgBR->range_array[0]) * msgBR->range_count) {
> +        warn_report("DM_BALLOON_RESPONSE too short for the range count");
> +        return;
> +    }
> +
> +    if (msgBR->range_count == 0) {
> +        /* The guest is already at its minimum size */
> +        msgBR->more_pages = 0;
> +        balloon->target_diff = 0;
> +    } else {
> +        hv_balloon_handle_remove_ranges(balloon,
> +                                        msgBR->range_array,
> +                                        msgBR->range_count);
> +    }
> +
> +    if (!msgBR->more_pages) {
> +        balloon->trans_id++;
> +
> +        if (balloon->target_diff > 0) {
> +            HV_BALLOON_STATE_DESC_SET(stdesc, S_BALLOON_RB_WAIT);
> +        } else {
> +            HV_BALLOON_STATE_DESC_SET(stdesc, S_IDLE);
> +        }
> +    }
> +}
> +
> +static void hv_balloon_handle_packet(HvBalloon *balloon, HvBalloonReq *req,
> +                                     StateDesc *stdesc)
> +{
> +    VMBusChanReq *vmreq = &req->vmreq;
> +    struct dm_message *msg = vmreq->msg;
> +
> +    if (vmreq->msglen < sizeof(msg->hdr)) {
> +        return;
> +    }
> +
> +    switch (msg->hdr.type) {
> +    case DM_VERSION_REQUEST:
> +        hv_balloon_handle_version_request(balloon, req, stdesc);
> +        break;
> +
> +    case DM_CAPABILITIES_REPORT:
> +        hv_balloon_handle_caps_report(balloon, req, stdesc);
> +        break;
> +
> +    case DM_STATUS_REPORT:
> +        hv_balloon_handle_status_report(balloon, req);
> +        break;
> +
> +    case DM_MEM_HOT_ADD_RESPONSE:
> +        hv_balloon_handle_hot_add_response(balloon, req, stdesc);
> +        break;
> +
> +    case DM_UNBALLOON_RESPONSE:
> +        hv_balloon_handle_unballoon_response(balloon, req, stdesc);
> +        break;
> +
> +    case DM_BALLOON_RESPONSE:
> +        hv_balloon_handle_balloon_response(balloon, req, stdesc);
> +        break;
> +
> +    default:
> +        warn_report("unknown DM message %u", msg->hdr.type);
> +        break;
> +    }
> +}
> +
> +static bool hv_balloon_recv_channel(HvBalloon *balloon, StateDesc *stdesc)
> +{
> +    VMBusChannel *chan;
> +    HvBalloonReq *req;
> +
> +    if (balloon->state == S_WAIT_RESET ||
> +        balloon->state == S_CLOSED) {
> +        return false;
> +    }
> +
> +    chan = hv_balloon_get_channel(balloon);
> +    if (vmbus_channel_recv_start(chan)) {
> +        return false;
> +    }
> +
> +    while ((req = vmbus_channel_recv_peek(chan, sizeof(*req)))) {
> +        hv_balloon_handle_packet(balloon, req, stdesc);
> +        vmbus_free_req(req);
> +        vmbus_channel_recv_pop(chan);
> +
> +        if (stdesc->state != S_NO_CHANGE) {
> +            break;
> +        }
> +    }
> +
> +    return vmbus_channel_recv_done(chan) > 0;
> +}
> +
> +static bool hv_balloon_event_loop_state(HvBalloon *balloon)
> +{
> +    StateDesc state_new = HV_BALLOON_STATE_DESC_INIT;
> +
> +    hv_balloon_handle_state(balloon, &state_new);
> +    return hv_balloon_state_set(balloon, state_new.state, state_new.desc);
> +}
> +
> +static bool hv_balloon_event_loop_recv(HvBalloon *balloon)
> +{
> +    StateDesc state_new = HV_BALLOON_STATE_DESC_INIT;
> +    bool any_recv, state_changed;
> +
> +    any_recv = hv_balloon_recv_channel(balloon, &state_new);
> +    state_changed = hv_balloon_state_set(balloon,
> +                                         state_new.state, state_new.desc);
> +
> +    return state_changed || any_recv;
> +}
> +
> +static void hv_balloon_event_loop(HvBalloon *balloon)
> +{
> +    bool state_repeat, recv_repeat;
> +
> +    do {
> +        state_repeat = hv_balloon_event_loop_state(balloon);
> +        recv_repeat = hv_balloon_event_loop_recv(balloon);
> +    } while (state_repeat || recv_repeat);
> +}
> +
> +void qmp_hv_balloon_add_memory(const char *id, Error **errp)
> +{
> +    HvBalloon *balloon;
> +    uint64_t align;
> +    g_autofree gchar *align_str = NULL;
> +    g_autoptr(QDict) qdict = NULL;
> +    g_autoptr(DeviceState) dev = NULL;
> +    HAPVDIMMDevice *hapvdimm;
> +    PageRange range;
> +    HAPVDIMMRange *hpr;
> +
> +    balloon = HV_BALLOON(object_resolve_path_type("", TYPE_HV_BALLOON, NULL));
> +    if (!balloon) {
> +        error_setg(errp, "no %s device present", TYPE_HV_BALLOON);
> +        return;
> +    }
> +
> +    if (hv_balloon_state_is_init(balloon)) {
> +        error_setg(errp, "no guest attached to the DM protocol yet");
> +        return;
> +    }
> +
> +    if (!balloon->caps.cap_bits.hot_add) {
> +        error_setg(errp,
> +                   "the current DM protocol guest has no support for memory hot add");
> +        return;
> +    }
> +
> +    /* add device */
> +    qdict = qdict_new();
> +    qdict_put_str(qdict, "driver", TYPE_HAPVDIMM);
> +    qdict_put_str(qdict, HAPVDIMM_MEMDEV_PROP, id);
> +
> +    align = (1 << balloon->caps.cap_bits.hot_add_alignment) * MiB;
> +    align_str = g_strdup_printf("%" PRIu64, align);
> +    qdict_put_str(qdict, HAPVDIMM_ALIGN_PROP, align_str);
> +
> +    hapvdimm_allow_adding();
> +    dev = qdev_device_add_from_qdict(qdict, false, errp);
> +    hapvdimm_disallow_adding();
> +    if (!dev) {
> +        return;
> +    }
> +
> +    hapvdimm = HAPVDIMM(dev);
> +
> +    hapvdimm_get_range(hapvdimm, &range);
> +    if (page_range_tree_intree_any(balloon->removed_guest,
> +                                   range.start, range.count) ||
> +        page_range_tree_intree_any(balloon->removed_both,
> +                                   range.start, range.count)) {
> +        error_setg(errp,
> +                   "some of the device new pages were already returned by the guest. this should not happen, please reboot the guest and try again");
> +        return;
> +    }
> +
> +    trace_hv_balloon_hapvdimm_range_add(range.count, range.start);
> +
> +    hpr = hapvdimm_tree_insert_new(balloon, hapvdimm);
> +
> +    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
> +
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_notify_cb(VMBusChannel *chan)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
> +
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_stat(void *opaque, BalloonInfo *info)
> +{
> +    HvBalloon *balloon = opaque;
> +    info->actual = (hv_balloon_total_ram(balloon) - balloon->removed_both_ctr)
> +        << HV_BALLOON_PFN_SHIFT;
> +}
> +
> +static void hv_balloon_to_target(void *opaque, ram_addr_t target)
> +{
> +    HvBalloon *balloon = opaque;
> +    uint64_t target_pages = target >> HV_BALLOON_PFN_SHIFT;
> +
> +    if (!target_pages) {
> +        return;
> +    }
> +
> +    /*
> +     * always set target_changed, even with unchanged target, as the user
> +     * might be asking us to try again reaching it
> +     */
> +    balloon->target = target_pages;
> +    balloon->target_changed = true;
> +
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static int hv_balloon_open_channel(VMBusChannel *chan)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
> +
> +    if (balloon->state != S_CLOSED) {
> +        warn_report("guest trying to open a DM channel in invalid %d state",
> +                    balloon->state);
> +        return -EINVAL;
> +    }
> +
> +    HV_BALLOON_SET_STATE(balloon, S_VERSION);
> +    hv_balloon_event_loop(balloon);
> +
> +    return 0;
> +}
> +
> +static void hv_balloon_close_channel(VMBusChannel *chan)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
> +
> +    timer_del(&balloon->post_init_timer);
> +
> +    HV_BALLOON_SET_STATE(balloon, S_WAIT_RESET);
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_post_init_timer(void *opaque)
> +{
> +    HvBalloon *balloon = opaque;
> +
> +    if (balloon->state != S_POST_INIT_WAIT) {
> +        return;
> +    }
> +
> +    HV_BALLOON_SET_STATE(balloon, S_IDLE);
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_system_reset(void *opaque)
> +{
> +    HvBalloon *balloon = HV_BALLOON(opaque);
> +
> +    if (!balloon->hapvdimms_del_todo) {
> +        return;
> +    }
> +
> +    if (balloon->del_todo_process_timer) {
> +        return;
> +    }
> +
> +    balloon->del_todo_process_timer = g_idle_add(del_todo_process_timer,
> +                                                 balloon);
> +}
> +
> +static void hv_balloon_dev_realize(VMBusDevice *vdev, Error **errp)
> +{
> +    ERRP_GUARD();
> +    HvBalloon *balloon = HV_BALLOON(vdev);
> +    int ret;
> +
> +    /* used by hv_balloon_stat() */
> +    balloon->hapvdimms = hapvdimm_tree_new();
> +    balloon->state = S_WAIT_RESET;
> +
> +    ret = qemu_add_balloon_handler(hv_balloon_to_target, hv_balloon_stat,
> +                                   balloon);
> +    if (ret < 0) {
> +        /* This also protects against having multiple hv-balloon instances */
> +        error_setg(errp, "Only one balloon device is supported");
> +        goto ret_tree;
> +    }
> +
> +    timer_init_ms(&balloon->post_init_timer, QEMU_CLOCK_VIRTUAL,
> +                  hv_balloon_post_init_timer, balloon);
> +
> +    qemu_register_reset(hv_balloon_system_reset, balloon);
> +
> +    return;
> +
> +ret_tree:
> +    hapvdimm_tree_destroy(&balloon->hapvdimms);
> +}
> +
> +static void hv_balloon_reset_destroy_common(HvBalloon *balloon)
> +{
> +    ha_todo_clear(balloon);
> +    del_todo_add_all(balloon);
> +}
> +
> +static void hv_balloon_dev_reset(VMBusDevice *vdev)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vdev);
> +
> +    page_range_tree_destroy(&balloon->removed_guest);
> +    page_range_tree_destroy(&balloon->removed_both);
> +    balloon->removed_guest = page_range_tree_new();
> +    balloon->removed_both = page_range_tree_new();
> +
> +    hv_balloon_reset_destroy_common(balloon);
> +
> +    balloon->trans_id = 0;
> +    balloon->removed_guest_ctr = 0;
> +    balloon->removed_both_ctr = 0;
> +
> +    HV_BALLOON_SET_STATE(balloon, S_CLOSED);
> +    hv_balloon_event_loop(balloon);
> +}
> +
> +static void hv_balloon_dev_unrealize(VMBusDevice *vdev)
> +{
> +    HvBalloon *balloon = HV_BALLOON(vdev);
> +
> +    qemu_unregister_reset(hv_balloon_system_reset, balloon);
> +
> +    hv_balloon_reset_destroy_common(balloon);
> +
> +    del_todo_process(balloon);
> +    assert(!balloon->del_todo_process_timer);
> +
> +    qemu_remove_balloon_handler(balloon);
> +
> +    page_range_tree_destroy(&balloon->removed_guest);
> +    page_range_tree_destroy(&balloon->removed_both);
> +    hapvdimm_tree_destroy(&balloon->hapvdimms);
> +}
> +
> +static Property hv_balloon_properties[] = {
> +    DEFINE_PROP_BOOL("status-report", HvBalloon,
> +                     status_reports, false),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void hv_balloon_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    VMBusDeviceClass *vdc = VMBUS_DEVICE_CLASS(klass);
> +
> +    device_class_set_props(dc, hv_balloon_properties);
> +    qemu_uuid_parse(HV_BALLOON_GUID, &vdc->classid);
> +    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
> +    vdc->vmdev_realize = hv_balloon_dev_realize;
> +    vdc->vmdev_unrealize = hv_balloon_dev_unrealize;
> +    vdc->vmdev_reset = hv_balloon_dev_reset;
> +    vdc->open_channel = hv_balloon_open_channel;
> +    vdc->close_channel = hv_balloon_close_channel;
> +    vdc->chan_notify_cb = hv_balloon_notify_cb;
> +}
> +
> +static const TypeInfo hv_balloon_type_info = {
> +    .name = TYPE_HV_BALLOON,
> +    .parent = TYPE_VMBUS_DEVICE,
> +    .instance_size = sizeof(HvBalloon),
> +    .class_init = hv_balloon_class_init,
> +};
> +
> +static void hv_balloon_register_types(void)
> +{
> +    type_register_static(&hv_balloon_type_info);
> +}
> +
> +type_init(hv_balloon_register_types)
> diff --git a/hw/hyperv/meson.build b/hw/hyperv/meson.build
> index b43f119ea5..212e0ce51e 100644
> --- a/hw/hyperv/meson.build
> +++ b/hw/hyperv/meson.build
> @@ -2,3 +2,4 @@ specific_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'))
>  specific_ss.add(when: 'CONFIG_HYPERV_TESTDEV', if_true: files('hyperv_testdev.c'))
>  specific_ss.add(when: 'CONFIG_VMBUS', if_true: files('vmbus.c'))
>  specific_ss.add(when: 'CONFIG_SYNDBG', if_true: files('syndbg.c'))
> +specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c'))
> diff --git a/hw/hyperv/trace-events b/hw/hyperv/trace-events
> index b4c35ca8e3..3b98ac3689 100644
> --- a/hw/hyperv/trace-events
> +++ b/hw/hyperv/trace-events
> @@ -16,3 +16,19 @@ vmbus_gpadl_torndown(uint32_t gpadl_id) "gpadl #%d"
>  vmbus_open_channel(uint32_t chan_id, uint32_t gpadl_id, uint32_t target_vp) "channel #%d gpadl #%d target vp %d"
>  vmbus_channel_open(uint32_t chan_id, uint32_t status) "channel #%d status %d"
>  vmbus_close_channel(uint32_t chan_id) "channel #%d"
> +
> +# hv-balloon
> +hv_balloon_state_change(const char *tostr) "-> %s"
> +hv_balloon_incoming_version(uint16_t major, uint16_t minor) "incoming proto version %u.%u"
> +hv_balloon_incoming_caps(uint32_t caps) "incoming caps 0x%x"
> +hv_balloon_outgoing_unballoon(uint32_t trans_id, uint64_t count, uint64_t start, uint64_t rempages) "posting unballoon %"PRIu32" for %"PRIu64" @ 0x%"PRIx64", remaining %"PRIu64
> +hv_balloon_incoming_unballoon(uint32_t trans_id) "incoming unballoon response %"PRIu32
> +hv_balloon_outgoing_hot_add(uint32_t trans_id, uint64_t count, uint64_t start) "posting hot add %"PRIu32" for %"PRIu64" @ 0x%"PRIx64
> +hv_balloon_incoming_hot_add(uint32_t trans_id, uint32_t result, uint32_t count) "incoming hot add response %"PRIu32", result %"PRIu32", count %"PRIu32
> +hv_balloon_outgoing_balloon(uint32_t trans_id, uint64_t count, uint64_t rempages) "posting balloon %"PRIu32" for %"PRIu64", remaining %"PRIu64
> +hv_balloon_incoming_balloon(uint32_t trans_id, uint32_t range_count, uint32_t more_pages) "incoming balloon response %"PRIu32", ranges %"PRIu32", more %"PRIu32
> +hv_balloon_hapvdimm_range_add(uint64_t count, uint64_t start) "adding hapvdimm range %"PRIu64" @ 0x%"PRIx64
> +hv_balloon_remove_response(uint64_t count, uint64_t start, unsigned int both) "processing remove response range %"PRIu64" @ 0x%"PRIx64", both %u"
> +hv_balloon_remove_response_hole(uint64_t counthole, uint64_t starthole, uint64_t countrange, uint64_t startrange, uint64_t starthpr, unsigned int both) "response range hole %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 0x%"PRIx64", before hpr start 0x%"PRIx64", both %u"
> +hv_balloon_remove_response_common(uint64_t countcommon, uint64_t startcommon, uint64_t countrange, uint64_t startrange, uint64_t counthpr, uint64_t starthpr, uint64_t removed, unsigned int both) "response common range %"PRIu64" @ 0x%"PRIx64" from range %"PRIu64" @ 0x%"PRIx64" with hpr %"PRIu64" @ 0x%"PRIx64", removed %"PRIu64", both %u"
> +hv_balloon_remove_response_remainder(uint64_t count, uint64_t start, unsigned int both) "remove response remaining range %"PRIu64" @ 0x%"PRIx64", both %u"
> diff --git a/meson.build b/meson.build
> index 6cb2b1a42f..2d9c01b6ec 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2550,7 +2550,8 @@ host_kconfig = \
>    ('CONFIG_LINUX' in config_host ? ['CONFIG_LINUX=y'] : []) + \
>    (have_pvrdma ? ['CONFIG_PVRDMA=y'] : []) + \
>    (multiprocess_allowed ? ['CONFIG_MULTIPROCESS_ALLOWED=y'] : []) + \
> -  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : [])
> +  (vfio_user_server_allowed ? ['CONFIG_VFIO_USER_SERVER_ALLOWED=y'] : []) + \
> +  ('CONFIG_HV_BALLOON_POSSIBLE' in config_host ? ['CONFIG_HV_BALLOON_POSSIBLE=y'] : [])
>  
>  ignored = [ 'TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_ARCH' ]
>  
> @@ -4027,6 +4028,7 @@ summary_info += {'libudev':           libudev}
>  summary_info += {'FUSE lseek':        fuse_lseek.found()}
>  summary_info += {'selinux':           selinux}
>  summary_info += {'libdw':             libdw}
> +summary_info += {'hv-balloon support': config_host.has_key('CONFIG_HV_BALLOON_POSSIBLE')}
>  summary(summary_info, bool_yn: true, section: 'Dependencies')
>  
>  if not supported_cpus.contains(cpu)
> diff --git a/qapi/machine.json b/qapi/machine.json
> index b9228a5e46..04ff95337a 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1104,6 +1104,74 @@
>  { 'event': 'BALLOON_CHANGE',
>    'data': { 'actual': 'int' } }
>  
> +##
> +# @hv-balloon-add-memory:
> +#
> +# Hot-add memory backend via Hyper-V Dynamic Memory Protocol.
> +#
> +# @id: the name of the memory backend object to hot-add
> +#
> +# Returns: Nothing on success
> +#          Error if there's no guest connected with hot-add capability,
> +#          @id is not a valid memory backend or it's already in use.
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# -> { "execute": "hv-balloon-add-memory", "arguments": { "id": "mb1" } }
> +# <- { "return": {} }
> +#
> +##
> +{ 'command': 'hv-balloon-add-memory', 'data': {'id': 'str'} }
> +
> +##
> +# @HV_BALLOON_STATUS_REPORT:
> +#
> +# Emitted when the hv-balloon driver receives a "STATUS" message from
> +# the guest.
> +#
> +# @commited: the amount of memory in use inside the guest plus the amount
> +#            of the memory unusable inside the guest (ballooned out,
> +#            offline, etc.)
> +#
> +# @available: the amount of the memory inside the guest available for new
> +#             allocations ("free")
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# <- { "event": "HV_BALLOON_STATUS_REPORT",
> +#      "data": { "commited": 816640000, "available": 3333054464 },
> +#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
> +#
> +##
> +{ 'event': 'HV_BALLOON_STATUS_REPORT',
> +  'data': { 'commited': 'size', 'available': 'size' } }
> +
> +##
> +# @HV_BALLOON_MEMORY_BACKEND_UNUSED:
> +#
> +# Emitted when the hv-balloon driver marks a memory backend object
> +# unused so it can now be removed, if required.
> +#
> +# This can happen because the VM was restarted.
> +#
> +# @id: the memory backend object id
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# <- { "event": "HV_BALLOON_MEMORY_BACKEND_UNUSED",
> +#      "data": { "id": "mb1" },
> +#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
> +#
> +##
> +{ 'event': 'HV_BALLOON_MEMORY_BACKEND_UNUSED',
> +  'data': { 'id': 'str' } }
> +
>  ##
>  # @MemoryInfo:
>  #
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2023-02-28 16:18   ` Igor Mammedov
@ 2023-02-28 17:12     ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-02-28 17:12 UTC (permalink / raw)
  To: Igor Mammedov, Maciej S. Szmigiero
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel

On 28.02.23 17:18, Igor Mammedov wrote:
> On Fri, 24 Feb 2023 22:41:16 +0100
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> wrote:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This driver is like virtio-balloon on steroids: it allows both changing the
>> guest memory allocation via ballooning and inserting extra RAM into it by
>> adding required memory backends and providing them to the driver.
> 
> 
> this sounds pretty much like what virtio-mem does, modulo used protocol.
> Would it be too crazy ask to reuse virtio-mem by teaching it new protocol
> and avoid adding new device with all mgmt hurdles that virtio-mem has
> already solved?

There are some main differences between both approaches that make a 1:1 
reuse impossible. As one example, the hv-balloon can operate (inflate) 
on the whole VM memory, which is very different to the virtio-mem model. 
As another example, the hv-balloon does not support variable (large) 
block sizes, and must be able to operate in page granularity IIRC. This 
not only restricts which memory backends we can use, it also means that 
vfio support is rather problematic (just like with virtio-balloon).

So there is more to that than a simple protocol difference and I don't 
think we can simply implement a proxy devices.

But I do think that we would be able to reuse some of the 
ideas/infrastructure virtio-mem implemented: for example, using a single 
large sparse memory region.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2023-02-24 21:41 ` [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
  2023-02-28 16:18   ` Igor Mammedov
@ 2023-02-28 17:34   ` Daniel P. Berrangé
  2023-02-28 21:24     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 17+ messages in thread
From: Daniel P. Berrangé @ 2023-02-28 17:34 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, David Hildenbrand, qemu-devel

On Fri, Feb 24, 2023 at 10:41:16PM +0100, Maciej S. Szmigiero wrote:

> Hot-adding additional memory is done by creating a new memory backend (for
> example by executing HMP command
> "object_add memory-backend-ram,id=mem1,size=4G"), then executing a new
> "hv-balloon-add-memory" QMP command, providing the id of that memory
> backend as the "id" parameter.

[snip]

> After a VM reboot each previously hot-added memory backend gets released.
> A "HV_BALLOON_MEMORY_BACKEND_UNUSED" QMP event is emitted in this case so
> the software controlling QEMU knows that it either needs to delete that
> memory backend (if no longer needed) or re-insert it.

IIUC you're saying that the 'hv-balloon-add-memory' command needs
to be re-run after a guest reset ? If so I feel that is a rather
undesirable job to punt over the mgmt app. The 'reset' event can
be missed if the mgmt app happend to be restarting and reconnecting
to existing running QMP console.

> In the future, the guest boot memory size might be changed on reboot
> instead, taking into account the effective size that VM had before that
> reboot (much like Hyper-V does).

Is that difficult to do right now ?  It isn't too nice to make the
mgmt apps implement the workaround now if we're going to make it
redundant later.

> The above design results in much better ballooning performance than when
> using virtio-balloon with the same guest: 230 GB / minute with this driver
> versus 70 GB / minute with virtio-balloon.

snip

> The unballoon operation is also pretty much instantaneous:
> thanks to the merging of the ballooned out page ranges 200 GB of memory can
> be returned to the guest in about 1 second.
> With virtio-balloon this operation takes about 2.5 minutes.

That's pretty impressive !

> These tests were done against a Windows Server 2019 guest running on a
> Xeon E5-2699, after dirtying the whole memory inside guest before each
> balloon operation.



> Since the required GTree operations aren't present in every Glib version
> a check for them was added to "configure" script, together with new
> "--enable-hv-balloon" and "--disable-hv-balloon" arguments.
> If these GTree operations are missing in the system's Glib version this
> driver will be skipped during QEMU build.

Funnily enough there's a patch posted recently that imports the glib
GTree impl into QEMU calling it QTree. This was to workaround a problem
with GSlice not being async signal safe, but if we take that patch, then
you wouldn't need to skip the build you could rely on this in-tree copy
instead.

https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg01225.html


> diff --git a/qapi/machine.json b/qapi/machine.json
> index b9228a5e46..04ff95337a 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -1104,6 +1104,74 @@
>  { 'event': 'BALLOON_CHANGE',
>    'data': { 'actual': 'int' } }
>  
> +##
> +# @hv-balloon-add-memory:
> +#
> +# Hot-add memory backend via Hyper-V Dynamic Memory Protocol.
> +#
> +# @id: the name of the memory backend object to hot-add
> +#
> +# Returns: Nothing on success
> +#          Error if there's no guest connected with hot-add capability,
> +#          @id is not a valid memory backend or it's already in use.
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# -> { "execute": "hv-balloon-add-memory", "arguments": { "id": "mb1" } }
> +# <- { "return": {} }
> +#
> +##
> +{ 'command': 'hv-balloon-add-memory', 'data': {'id': 'str'} }
> +
> +##
> +# @HV_BALLOON_STATUS_REPORT:
> +#
> +# Emitted when the hv-balloon driver receives a "STATUS" message from
> +# the guest.
> +#
> +# @commited: the amount of memory in use inside the guest plus the amount
> +#            of the memory unusable inside the guest (ballooned out,
> +#            offline, etc.)
> +#
> +# @available: the amount of the memory inside the guest available for new
> +#             allocations ("free")
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# <- { "event": "HV_BALLOON_STATUS_REPORT",
> +#      "data": { "commited": 816640000, "available": 3333054464 },
> +#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
> +#
> +##
> +{ 'event': 'HV_BALLOON_STATUS_REPORT',
> +  'data': { 'commited': 'size', 'available': 'size' } }
> +
> +##
> +# @HV_BALLOON_MEMORY_BACKEND_UNUSED:
> +#
> +# Emitted when the hv-balloon driver marks a memory backend object
> +# unused so it can now be removed, if required.
> +#
> +# This can happen because the VM was restarted.
> +#
> +# @id: the memory backend object id
> +#
> +# Since: TBD
> +#
> +# Example:
> +#
> +# <- { "event": "HV_BALLOON_MEMORY_BACKEND_UNUSED",
> +#      "data": { "id": "mb1" },
> +#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
> +#
> +##
> +{ 'event': 'HV_BALLOON_MEMORY_BACKEND_UNUSED',
> +  'data': { 'id': 'str' } }

There is a reply from Igor about possibility of sharing code with
virtio-mem. I also wonder if there's any scope for sharing with
the virtio-balloon driver too, in terms of the QAPI schema.

I've not looked closely enough to say if its possible to not, so
if not practical, no worries.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2023-02-28 17:34   ` Daniel P. Berrangé
@ 2023-02-28 21:24     ` Maciej S. Szmigiero
  0 siblings, 0 replies; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-02-28 21:24 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, David Hildenbrand, qemu-devel

On 28.02.2023 18:34, Daniel P. Berrangé wrote:
> On Fri, Feb 24, 2023 at 10:41:16PM +0100, Maciej S. Szmigiero wrote:
> 
>> Hot-adding additional memory is done by creating a new memory backend (for
>> example by executing HMP command
>> "object_add memory-backend-ram,id=mem1,size=4G"), then executing a new
>> "hv-balloon-add-memory" QMP command, providing the id of that memory
>> backend as the "id" parameter.
> 
> [snip]
> 
>> After a VM reboot each previously hot-added memory backend gets released.
>> A "HV_BALLOON_MEMORY_BACKEND_UNUSED" QMP event is emitted in this case so
>> the software controlling QEMU knows that it either needs to delete that
>> memory backend (if no longer needed) or re-insert it.
> 
> IIUC you're saying that the 'hv-balloon-add-memory' command needs
> to be re-run after a guest reset ?

Yes.

> If so I feel that is a rather> undesirable job to punt over the mgmt app. The 'reset' event can
> be missed if the mgmt app happend to be restarting and reconnecting
> to existing running QMP console.

See the answer below the next paragraph.

>> In the future, the guest boot memory size might be changed on reboot
>> instead, taking into account the effective size that VM had before that
>> reboot (much like Hyper-V does).
> 
> Is that difficult to do right now ?  It isn't too nice to make the
> mgmt apps implement the workaround now if we're going to make it
> redundant later.

The v1 of this driver did re-add memory backends automatically after
a reboot, so if that's something that is desirable it can be re-introduced
without much difficulty.

The issue here is that the guest might never re-connect to the DM protocol
interface after a reboot (perhaps because the VM was rebooted from
a Windows to a Linux guest).
In this case the driver would wait endlessly, not letting the
underlying memory backends to be removed.

virtio-mem also seems to unplug all blocks unconditionally when the VM is
rebooted.

On the other hand, actually resizing the guest boot memory is definitely
not trivial - for sure that's something for future work
(virtio-mem might also benefit from it).

>> The above design results in much better ballooning performance than when
>> using virtio-balloon with the same guest: 230 GB / minute with this driver
>> versus 70 GB / minute with virtio-balloon.
> 
> snip
> 
>> The unballoon operation is also pretty much instantaneous:
>> thanks to the merging of the ballooned out page ranges 200 GB of memory can
>> be returned to the guest in about 1 second.
>> With virtio-balloon this operation takes about 2.5 minutes.
> 
> That's pretty impressive !

Thanks!

>> These tests were done against a Windows Server 2019 guest running on a
>> Xeon E5-2699, after dirtying the whole memory inside guest before each
>> balloon operation.
> 
> 
> 
>> Since the required GTree operations aren't present in every Glib version
>> a check for them was added to "configure" script, together with new
>> "--enable-hv-balloon" and "--disable-hv-balloon" arguments.
>> If these GTree operations are missing in the system's Glib version this
>> driver will be skipped during QEMU build.
> 
> Funnily enough there's a patch posted recently that imports the glib
> GTree impl into QEMU calling it QTree. This was to workaround a problem
> with GSlice not being async signal safe, but if we take that patch, then
> you wouldn't need to skip the build you could rely on this in-tree copy
> instead.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2023-02/msg01225.html

Thanks for the pointer, however the {G,Q}Tree import explicitly excludes
these tree operations that this driver needs (as they currently don't
have any callers in QEMU).

So in this case either they would have to be imported too or the driver
would need QEMU being built with the upstream Glib (as far as I can see,
[1] says this will still be possible with glib >= 2.76.0).

Thanks,
Maciej

[1]: https://gitlab.com/qemu-project/qemu/-/issues/285



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-02-28 15:02       ` David Hildenbrand
@ 2023-02-28 21:27         ` Maciej S. Szmigiero
  2023-02-28 22:12           ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-02-28 21:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 28.02.2023 16:02, David Hildenbrand wrote:
>>
>> That was more or less the approach that v1 of this driver took:
>> The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices,
>> whatever one calls them) explicitly via the machine hotplug handler
>> (using the device_add command).
>>
>> At that time you said [1] that:
>>> 1) I dislike that an external entity has to do vDIMM adaptions /
>>> ballooning adaptions when rebooting or when wanting to resize a guest.
>>
>> because:
>>> Once you have the current approach upstream (vDIMMs, ballooning),
>>> there is no easy way to change that later (requires deprecating, etc.).
>>
>> That's why this version hides these vDIMMs.
> 
> Note that I don't have really strong feelings about letting the user hotplug devices. My comment was in general about user interactions when adding/removing memory or when rebooting the VM. As soon as you use individual memory blocks and/or devices, we end up with a similar user experience as we have already with DIMMS+virtio-balloon (bad IMHO).
> 
> Hiding the devices internally might make it a little bit easier to use, but it's still the same underlying concept: to add more memory you have to figure out whether to deflate the balloon or whether to add a new memory backend. 

Well, the logic here is pretty simple: deflate the balloon first
(including deflating it by zero bytes if not inflated), then, if any
memory size remains to add, hot-add the reminder.

We can't get rid of ballooning altogether because otherwise going
below the boot memory size wouldn't be possible.

> What memory backends will remain when we reboot?

In this driver version, none will remain inserted
(virtio-mem also seems to unplug all blocks unconditionally when the
VM is rebooted).

In version 1, all memory backeds were re-inserted once the guest
re-connected to the DM protocol after a reboot.

As I wrote in my response to Daniel moments ago, there are some issues
with automatic re-insertion if the guest never re-connects to the DM
protocol - that's why I've removed this functionality from this
driver version.

> When can we remove memory backends?

There's a QMP event generated when a memory backend can be removed:
HV_BALLOON_MEMORY_BACKEND_UNUSED

> But that's just about the user interaction in general. My comment here was about the hidden devices: they have to go through plug handlers to get resources assigned, not self-assign resources in the realize function.
> 
> 
> Note that virtio-mem uses a single sparse memory backend to make resizing easier (well, and to handle migration and some other things easier). But it comes with other things that require optimization. Using multiple memslots to expose memory to the VM is one optimization I'm working on. Resizable memory backends are another one.
> 
> I think you could implement the memory adding part similar to virtio-mem, and simply have a large sparse memory backend, from which you expose new memory to the VM as you please. And you could even use multiple memslots for that. But that's your design decision, and I won't argue with that, just pointing that out.
> 
> 
>> Instead, the QEMU manager (user) directly provides the raw memory
>> backend device (for example, memory-backend-ram) to the driver via a QMP
>> command.
> 
> Yes, that's what I understood.
> 
>>
>> Since now the user is not expected to touch these vDIMMs directly in any
>> way these become an implementation detail than can be changed or even
>> removed if needed at some point, without affecting the existing users.
>>
>>> But before we dive into the details of that, I wonder if you could just avoid having a memory device for each block of memory you want to add.
>>>
>>>
>>> An alternative might be the following:
>>>
>>> Have a hv-balloon device be a memory device with a configured maximum size and a memory device region container. Let the machine hotplug handler assign a contiguous region in the device memory region and map the memory device region container (while plugging that hv-balloon device), just like we do it for virtio-mem and virtio-pmem.
>>>
>>> In essence, you reserve a region in physical address space that way and can decide what to (un)map into that memory device region container, you do your own placement.
>>>
>>> So when instructed to add a new memory backend, you simply assign an address in the assigned region yourself, and map the memory backend memory region into the device memory region container.
>>>
>>> The only catch is that that memory device (hv-balloon) will then consume multiple memslots (one for each memory backend), right now we only support 1 memslot (e.g., asking if one more slot is free when plugging the device).
>>>
>>>
>> Technically in this case a "main" hv-balloon device is still needed -
>> in contrast with virtio-mem (which allows multiple instances) there can
>> be only one Dynamic Memory protocol provider on the VMBus.
> 
> Yes, just like virtio-balloon. There cannot be multiple instances.

Right, this has some important consequences (see below).

>>
>> That means these "container" sub-devices would need to register with that
>> main hv-balloon device.
>>
> 
> My question is, if they really have to be devices. Why wouldn't it sufficient to map the memory backends directly into the container? Why is the

See the answer below the next paragraph.

> 
>> However, I'm not sure what is exactly gained by this approach.
>>
>> These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface
> 
> No, they wouldn't unless I am missing something. Only the hv-balloon device would be a TYPE_MEMORY_DEVICE.
In case of virtio-mem if one wants to add even more memory than the
"current" backing memory device allows there's always a possibility of
adding yet another virtio-mem-pci device with an additional backing
memory device.

If there would be just the main hv-balloon device (implementing
TYPE_MEMORY_DEVICE) then this would not be possible, since one can't
have multiple DM VMBus devices.

Hence, intermediate sub-devices are necessary (each one implementing
TYPE_MEMORY_DEVICE), which do not sit on the VMBus, in order to allow
adding new backing memory devices (as virtio-mem allows).

>> so they are accounted for properly (the alternative would be to patch
>> the relevant QEMU code all over the place - that's probably why
>> virtio-mem also implements this interface instead).
> 
> Please elaborate, I don't understand what you are trying to say here. Memory devices provide hooks, and the hooks exist for a reason -- because memory devices are no longer simple DIMMs/NVDIMMs. And virtio-mem + virtio-omem was responsible for adding some of these hooks.

I was referring to the necessity of implementing TYPE_MEMORY_DEVICE at
all in hv-balloon driver - if it didn't implement this interface then it
couldn't benefit from the logic in hw/mem/memory-device.c, so it would
need to be open-coded inside the driver and every call to functions
provided by that file from QEMU would need to be patched to account for
the memory provided by this driver.

> 
>>
>> One still needs some QMP command to add a raw memory backend to
>> the chosen "container" hv-balloon sub-device.
> 
> If you go with multiple memory backends, yes.
> 
>>
>> Since now the QEMU manager (user) is aware of the presence of these
>> "container" sub-devices, and has to manage them, changing the QEMU
>> interface in the future is more complex (as you said in [1]).>
> Can you elaborate? Yes, when you design the feature around "multiple memory backends", you'll have to have an interface to add such. Well, and to query them during migration. And, maybe also to detect when to remove some (migration)?
> 

As I wrote above, multiple backing memory devices are necessary so the
guest can be expanded above the initially provided backing memory device,
much like virtio-mem already allows.

And then you have to either:
1) Let the hv-balloon driver transparently manage the lifetime of these
sub-devices, like this version of the patch set does, OR:

2) Make the QEMU manager (user) insert and remove these sub-devices
explicitly, like the version 1 of this driver did.

> 
>>
>> I understand that virtio-mem uses a similar approach, however that's
>> because the virtio-mem protocol itself works that way.
>>
>>> I'm adding support for that right now to implement a virtio-mem
>>> extension -- the memory device says how many memslots it requires,
>>> and these will get reserved for that memory device; the memory device
>>> can then consume them later without further checks dynamically. That
>>> approach could be extended to increase/decrease the memslot
>>> requirement (the device would ask to increase/decrease its limit),
>>> if ever required.
>>
>> In terms of future virtio-mem things I'm also eagerly waiting for an
>> ability to set a removed virtio-mem block read-only (or not covered by
>> any memslot) - this most probably could be reused later for implementing
>> the same functionality in this driver.
> 
> In contrast to setting them read-only, the memslots that contain no plugged blocks anymore will be completely removed. The goal is to not consume any metadata overhead in KVM (well, and also do one step into the direction of protecting unplugged memory from getting reallocated).
> 

Nice, looking forward to having this functionality in QEMU for Linux
guests.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-02-28 21:27         ` Maciej S. Szmigiero
@ 2023-02-28 22:12           ` David Hildenbrand
  2023-03-01 16:26             ` Maciej S. Szmigiero
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-02-28 22:12 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 28.02.23 22:27, Maciej S. Szmigiero wrote:
> On 28.02.2023 16:02, David Hildenbrand wrote:
>>>
>>> That was more or less the approach that v1 of this driver took:
>>> The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices,
>>> whatever one calls them) explicitly via the machine hotplug handler
>>> (using the device_add command).
>>>
>>> At that time you said [1] that:
>>>> 1) I dislike that an external entity has to do vDIMM adaptions /
>>>> ballooning adaptions when rebooting or when wanting to resize a guest.
>>>
>>> because:
>>>> Once you have the current approach upstream (vDIMMs, ballooning),
>>>> there is no easy way to change that later (requires deprecating, etc.).
>>>
>>> That's why this version hides these vDIMMs.
>>
>> Note that I don't have really strong feelings about letting the user hotplug devices. My comment was in general about user interactions when adding/removing memory or when rebooting the VM. As soon as you use individual memory blocks and/or devices, we end up with a similar user experience as we have already with DIMMS+virtio-balloon (bad IMHO).
>>
>> Hiding the devices internally might make it a little bit easier to use, but it's still the same underlying concept: to add more memory you have to figure out whether to deflate the balloon or whether to add a new memory backend.
> 
> Well, the logic here is pretty simple: deflate the balloon first
> (including deflating it by zero bytes if not inflated), then, if any
> memory size remains to add, hot-add the reminder.
> 

Yes, but if you have 1 GiB deflated and want to add 2 GiB, things are 
already getting more involved if you get what I mean.

I was going through the exact same model back when I was designing 
virtio-mem, and eventually added with a way where you can just tell QEMU 
the requested size an be done with it.

> We can't get rid of ballooning altogether because otherwise going
> below the boot memory size wouldn't be possible.

Right, more on that below.

> 
>> What memory backends will remain when we reboot?
> 
> In this driver version, none will remain inserted
> (virtio-mem also seems to unplug all blocks unconditionally when the
> VM is rebooted).
> 

There is a very important difference: virtio-mem only temporarily 
unplugs that memory. As the guest boots up it re-adds the requested 
amount of memory without any user interaction. That was added for two 
main reasons

(a) We can easily defragment the virtio-mem device that way.
(b) If the rebooted guest doesn't load the virtio-mem driver, it
     wouldn't be able to make use of that memory. Like, rebooting into
     Windows right now ;)

So if you hotplugged some memory using virtio-mem and reboot, that 
memory will automatically be re-added.

> In version 1, all memory backeds were re-inserted once the guest
> re-connected to the DM protocol after a reboot.
> 
> As I wrote in my response to Daniel moments ago, there are some issues
> with automatic re-insertion if the guest never re-connects to the DM
> protocol - that's why I've removed this functionality from this
> driver version.

I think we might be able to to better, but that's just my idea how it 
could look like. I'll describe it below.

[...]

>>> However, I'm not sure what is exactly gained by this approach.
>>>
>>> These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface
>>
>> No, they wouldn't unless I am missing something. Only the hv-balloon device would be a TYPE_MEMORY_DEVICE.
> In case of virtio-mem if one wants to add even more memory than the
> "current" backing memory device allows there's always a possibility of
> adding yet another virtio-mem-pci device with an additional backing
> memory device.

We could, but that's not the way I envision virtio-mem. The thing is, 
already when starting QEMU we have to make decisions about the maximum 
VM size when setting the maxmem option. Consequently, we cannot grow a 
VM until infinity, we already have to plan ahead to some degree.

So what my goal is with virito-mem, is the following (it already works, 
we just have to work on reduction of metadata and memory overcommit 
handling -- mostly internal optimizations):

qemu-kvm ... \
-m 4G,maxmem=1048G \
-object memory-backend-ram,id=mem0,size=1T, ... \
-device virtio-mem-pci,id=vmem0,memdev=mem0,requested-size=0

So we an grow the guest up to 1T if we like. There is no way we could 
add more memory to that VM because we're already hitting the limit of 
maxmem.

It gets more complicated with multiple NUMA nodes, NVDIMMS, etc, but the 
main goal is to make it possible to have the maximum size be 
ridiculously large (while optimizing it internally!) that one doesn't 
have to even worry about adding a new device.

I think the same model would work for hv as well, at least with my 
limited knowledge about it ;)

> 
> If there would be just the main hv-balloon device (implementing
> TYPE_MEMORY_DEVICE) then this would not be possible, since one can't
> have multiple DM VMBus devices.
> 
> Hence, intermediate sub-devices are necessary (each one implementing
> TYPE_MEMORY_DEVICE), which do not sit on the VMBus, in order to allow
> adding new backing memory devices (as virtio-mem allows).

Not necessarily, I think, as discussed.

> 
>>> so they are accounted for properly (the alternative would be to patch
>>> the relevant QEMU code all over the place - that's probably why
>>> virtio-mem also implements this interface instead).
>>
>> Please elaborate, I don't understand what you are trying to say here. Memory devices provide hooks, and the hooks exist for a reason -- because memory devices are no longer simple DIMMs/NVDIMMs. And virtio-mem + virtio-omem was responsible for adding some of these hooks.
> 
> I was referring to the necessity of implementing TYPE_MEMORY_DEVICE at
> all in hv-balloon driver - if it didn't implement this interface then it
> couldn't benefit from the logic in hw/mem/memory-device.c, so it would
> need to be open-coded inside the driver and every call to functions
> provided by that file from QEMU would need to be patched to account for
> the memory provided by this driver.

Ah, yes, one device has to be a memory device. I was just asking if you 
really need multiple ones.

> 
>>
>>>
>>> One still needs some QMP command to add a raw memory backend to
>>> the chosen "container" hv-balloon sub-device.
>>
>> If you go with multiple memory backends, yes.
>>
>>>
>>> Since now the QEMU manager (user) is aware of the presence of these
>>> "container" sub-devices, and has to manage them, changing the QEMU
>>> interface in the future is more complex (as you said in [1]).>
>> Can you elaborate? Yes, when you design the feature around "multiple memory backends", you'll have to have an interface to add such. Well, and to query them during migration. And, maybe also to detect when to remove some (migration)?
>>
> 
> As I wrote above, multiple backing memory devices are necessary so the
> guest can be expanded above the initially provided backing memory device,
> much like virtio-mem already allows.
> 
> And then you have to either:
> 1) Let the hv-balloon driver transparently manage the lifetime of these
> sub-devices, like this version of the patch set does, OR:
> 
> 2) Make the QEMU manager (user) insert and remove these sub-devices
> explicitly, like the version 1 of this driver did.

Let's me raise this idea:

qemu-kvm ... \
-m 4G,maxmem=1048G \
-object memory-backend-ram,id=mem0,size=1T, ... \
-device hv-balloon,id=vmem0,memdev=mem0

We'd do the same internal optimizations as we're doing (and the ones I 
am working on) for virtio-mem.

The above would result in a VM with 4G. With virtio-mem, we resize 
devices, with the balloon, you resize the logical VM size.

So the single (existing?) user interface would be the existing balloon 
cmd. Note that we set the logical VM size here, not the size of the balloon.

info balloon -> 4G
balloon 2G [will inflate]
info balloon -> 2G
balloon 128G [will deflate, then hotplug]
info balloon -> 128G
balloon 8G [will deflate]
info balloon -> 8G
...

How memory is added (deflate first, then expose some new memory via the 
memdev, ...) is left to the hv-balloon device, the user doesn't have to 
bother. We set the logical VM size and hv-balloon will do it's thing to 
eventually reach that goal.

Reboot? Logically unplug all memory and as the guest boots up, re-add 
the memory after the guest booted up.

The only thing we can't do is the following: when going below 4G, we 
cannot resize boot memory.

But I recall that that's *exactly* how the HV version I played with ~2 
years ago worked: always start up with some initial memory ("startup 
memory"). After the VM is up for some seconds, we either add more memory 
(requested > startup) or request the VM to inflate memory (requested < 
startup).

Even migration could eventually be fairly simple, because virtio-mem 
already solved it to some degree. The only catch is, that for boot 
memory, we'd also have to detect discarded ranges. But that would be 
something to think about in the future.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-02-28 22:12           ` David Hildenbrand
@ 2023-03-01 16:26             ` Maciej S. Szmigiero
  2023-03-01 17:24               ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-03-01 16:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 28.02.2023 23:12, David Hildenbrand wrote:
> On 28.02.23 22:27, Maciej S. Szmigiero wrote:
>> On 28.02.2023 16:02, David Hildenbrand wrote:
>>>>
>>>> That was more or less the approach that v1 of this driver took:
>>>> The QEMU manager inserted virtual DIMMs (Hyper-V DM memory devices,
>>>> whatever one calls them) explicitly via the machine hotplug handler
>>>> (using the device_add command).
>>>>
>>>> At that time you said [1] that:
>>>>> 1) I dislike that an external entity has to do vDIMM adaptions /
>>>>> ballooning adaptions when rebooting or when wanting to resize a guest.
>>>>
>>>> because:
>>>>> Once you have the current approach upstream (vDIMMs, ballooning),
>>>>> there is no easy way to change that later (requires deprecating, etc.).
>>>>
>>>> That's why this version hides these vDIMMs.
>>>
>>> Note that I don't have really strong feelings about letting the user hotplug devices. My comment was in general about user interactions when adding/removing memory or when rebooting the VM. As soon as you use individual memory blocks and/or devices, we end up with a similar user experience as we have already with DIMMS+virtio-balloon (bad IMHO).
>>>
>>> Hiding the devices internally might make it a little bit easier to use, but it's still the same underlying concept: to add more memory you have to figure out whether to deflate the balloon or whether to add a new memory backend.
>>
>> Well, the logic here is pretty simple: deflate the balloon first
>> (including deflating it by zero bytes if not inflated), then, if any
>> memory size remains to add, hot-add the reminder.
>>
> 
> Yes, but if you have 1 GiB deflated and want to add 2 GiB, things are already getting more involved if you get what I mean.
> 
> I was going through the exact same model back when I was designing virtio-mem, and eventually added with a way where you can just tell QEMU the requested size an be done with it.

Understood, this interface seems obviously more user-friendly.

>> We can't get rid of ballooning altogether because otherwise going
>> below the boot memory size wouldn't be possible.
> 
> Right, more on that below.
> 
>>
>>> What memory backends will remain when we reboot?
>>
>> In this driver version, none will remain inserted
>> (virtio-mem also seems to unplug all blocks unconditionally when the
>> VM is rebooted).
>>
> 
> There is a very important difference: virtio-mem only temporarily unplugs that memory. As the guest boots up it re-adds the requested amount of memory without any user interaction. That was added for two main reasons
> 
> (a) We can easily defragment the virtio-mem device that way.
> (b) If the rebooted guest doesn't load the virtio-mem driver, it
>      wouldn't be able to make use of that memory. Like, rebooting into
>      Windows right now ;)
> 
> So if you hotplugged some memory using virtio-mem and reboot, that memory will automatically be re-added.
> 
>> In version 1, all memory backeds were re-inserted once the guest
>> re-connected to the DM protocol after a reboot.
>>
>> As I wrote in my response to Daniel moments ago, there are some issues
>> with automatic re-insertion if the guest never re-connects to the DM
>> protocol - that's why I've removed this functionality from this
>> driver version.
> 
> I think we might be able to to better, but that's just my idea how it could look like. I'll describe it below.
> 
> [...]
> 
>>>> However, I'm not sure what is exactly gained by this approach.
>>>>
>>>> These sub-devices still need to implement the TYPE_MEMORY_DEVICE interface
>>>
>>> No, they wouldn't unless I am missing something. Only the hv-balloon device would be a TYPE_MEMORY_DEVICE.
>> In case of virtio-mem if one wants to add even more memory than the
>> "current" backing memory device allows there's always a possibility of
>> adding yet another virtio-mem-pci device with an additional backing
>> memory device.
> 
> We could, but that's not the way I envision virtio-mem. The thing is, already when starting QEMU we have to make decisions about the maximum VM size when setting the maxmem option. Consequently, we cannot grow a VM until infinity, we already have to plan ahead to some degree.
> 
> So what my goal is with virito-mem, is the following (it already works, we just have to work on reduction of metadata and memory overcommit handling -- mostly internal optimizations):
> 
> qemu-kvm ... \
> -m 4G,maxmem=1048G \
> -object memory-backend-ram,id=mem0,size=1T, ... \
> -device virtio-mem-pci,id=vmem0,memdev=mem0,requested-size=0
> 
> So we an grow the guest up to 1T if we like. There is no way we could add more memory to that VM because we're already hitting the limit of maxmem.
> 
> It gets more complicated with multiple NUMA nodes, NVDIMMS, etc, but the main goal is to make it possible to have the maximum size be ridiculously large (while optimizing it internally!) that one doesn't have to even worry about adding a new device.
> 
> I think the same model would work for hv as well, at least with my limited knowledge about it ;)

I understand your idea - responded below, under the hv-balloon example.

>>
>> If there would be just the main hv-balloon device (implementing
>> TYPE_MEMORY_DEVICE) then this would not be possible, since one can't
>> have multiple DM VMBus devices.
>>
>> Hence, intermediate sub-devices are necessary (each one implementing
>> TYPE_MEMORY_DEVICE), which do not sit on the VMBus, in order to allow
>> adding new backing memory devices (as virtio-mem allows).
> 
> Not necessarily, I think, as discussed.
> 
>>
>>>> so they are accounted for properly (the alternative would be to patch
>>>> the relevant QEMU code all over the place - that's probably why
>>>> virtio-mem also implements this interface instead).
>>>
>>> Please elaborate, I don't understand what you are trying to say here. Memory devices provide hooks, and the hooks exist for a reason -- because memory devices are no longer simple DIMMs/NVDIMMs. And virtio-mem + virtio-omem was responsible for adding some of these hooks.
>>
>> I was referring to the necessity of implementing TYPE_MEMORY_DEVICE at
>> all in hv-balloon driver - if it didn't implement this interface then it
>> couldn't benefit from the logic in hw/mem/memory-device.c, so it would
>> need to be open-coded inside the driver and every call to functions
>> provided by that file from QEMU would need to be patched to account for
>> the memory provided by this driver.
> 
> Ah, yes, one device has to be a memory device. I was just asking if you really need multiple ones.
> 
>>
>>>
>>>>
>>>> One still needs some QMP command to add a raw memory backend to
>>>> the chosen "container" hv-balloon sub-device.
>>>
>>> If you go with multiple memory backends, yes.
>>>
>>>>
>>>> Since now the QEMU manager (user) is aware of the presence of these
>>>> "container" sub-devices, and has to manage them, changing the QEMU
>>>> interface in the future is more complex (as you said in [1]).>
>>> Can you elaborate? Yes, when you design the feature around "multiple memory backends", you'll have to have an interface to add such. Well, and to query them during migration. And, maybe also to detect when to remove some (migration)?
>>>
>>
>> As I wrote above, multiple backing memory devices are necessary so the
>> guest can be expanded above the initially provided backing memory device,
>> much like virtio-mem already allows.
>>
>> And then you have to either:
>> 1) Let the hv-balloon driver transparently manage the lifetime of these
>> sub-devices, like this version of the patch set does, OR:
>>
>> 2) Make the QEMU manager (user) insert and remove these sub-devices
>> explicitly, like the version 1 of this driver did.
> 
> Let's me raise this idea:
> 
> qemu-kvm ... \
> -m 4G,maxmem=1048G \
> -object memory-backend-ram,id=mem0,size=1T, ... \
> -device hv-balloon,id=vmem0,memdev=mem0
> 
> We'd do the same internal optimizations as we're doing (and the ones I am working on) for virtio-mem.
> 
> The above would result in a VM with 4G. With virtio-mem, we resize devices, with the balloon, you resize the logical VM size.
> 
> So the single (existing?) user interface would be the existing balloon cmd. Note that we set the logical VM size here, not the size of the balloon.
> 
> info balloon -> 4G
> balloon 2G [will inflate]
> info balloon -> 2G
> balloon 128G [will deflate, then hotplug]
> info balloon -> 128G
> balloon 8G [will deflate]
> info balloon -> 8G
> ...
> 
> How memory is added (deflate first, then expose some new memory via the memdev, ...) is left to the hv-balloon device, the user doesn't have to bother. We set the logical VM size and hv-balloon will do it's thing to eventually reach that goal.

The idea would seem reasonable, but: (there's always some "but")
1) Once we implement NUMA support we'd probably need multiple
TYPE_MEMORY_DEVICEs anyway, since it seems one memdev can sit on only
one NUMA node,

With virtio-mem one can simply have per-node virtio-mem devices.

2) I'm not sure what's the overhead of having, let's say, 1 TiB backing
memory device mostly marked madvise(MADV_DONTNEED).
Like, how much memory + swap this setup would actually consume - that's
something I would need to measure.

3) In a public cloud environment malicious guests are a possibility.
Currently (without things like resizable memslots) the best idea I tried
was to place the whole QEMU process into a memory-limited cgroup
(limited to the guest target size).

There are still some issues with it: one needs to reserve swap space up
to the guest maximum size so the QEMU process doesn't get OOM-killed if
guest touches that memory and the cgroup memory controller for some
reason seems to start swapping even before reaching its limit (that's
still under investigation why).

> Reboot? Logically unplug all memory and as the guest boots up, re-add the memory after the guest booted up.
> 
> The only thing we can't do is the following: when going below 4G, we cannot resize boot memory.
> 
> 
> But I recall that that's *exactly* how the HV version I played with ~2 years ago worked: always start up with some initial memory ("startup memory"). After the VM is up for some seconds, we either add more memory (requested > startup) or request the VM to inflate memory (requested < startup).

Hyper-V actually "cleans up" the guest memory map on reboot - if the
guest was effectively resized up then on reboot the guest boot memory is
resized up to match that last size.
Similarly, if the guest was ballooned out - that amount of memory is
removed from the boot memory on reboot.

So it's not exactly doing a hot-add after the guest boots.
This approach (of resizing the boot memory) also avoids problems if the
guest loses hot-add / ballooning capability after a reboot - for example,
rebooting into a Linux guest from Windows with hv-balloon.

But unfortunately such resizing the guest boot memory seems not trivial
to implement in QEMU.

> 
> 
> Even migration could eventually be fairly simple, because virtio-mem already solved it to some degree. The only catch is, that for boot memory, we'd also have to detect discarded ranges. But that would be something to think about in the future.> 

Yes, migration support is planned for future versions of the driver,
when its final design is known.

Thanks,
Maciej



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-03-01 16:26             ` Maciej S. Szmigiero
@ 2023-03-01 17:24               ` David Hildenbrand
  2023-03-01 22:08                 ` Maciej S. Szmigiero
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2023-03-01 17:24 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

> 
> The idea would seem reasonable, but: (there's always some "but")
> 1) Once we implement NUMA support we'd probably need multiple
> TYPE_MEMORY_DEVICEs anyway, since it seems one memdev can sit on only
> one NUMA node,
> 

Not necessarily. You could extend the hv-balloon device to have one 
memslot for each NUMA node. Of course, once again, you have to plan 
ahead how to distribute memory across NUMA nodes (same with virtio-mem).

Having that said, last time I checked, HV dynamic memory was 
force-disabled when enabling vNUMA under HV. Simply because balloon 
inflation is not NUMA aware.

> With virtio-mem one can simply have per-node virtio-mem devices.
> 
> 2) I'm not sure what's the overhead of having, let's say, 1 TiB backing
> memory device mostly marked madvise(MADV_DONTNEED).
> Like, how much memory + swap this setup would actually consume - that's
> something I would need to measure.

There are some WIP items to improve that (QEMU metadata (e.g., bitmaps), 
KVM metadata (e.g., per-memslot), Linux metadata (e.g., page tables).
Memory overcommit handling also has to be tackled.

So it would be a "shared" problem with virtio-mem and will be sorted out 
eventually :)

> 
> 3) In a public cloud environment malicious guests are a possibility.
> Currently (without things like resizable memslots) the best idea I tried
> was to place the whole QEMU process into a memory-limited cgroup
> (limited to the guest target size).

Yes. Protection of unplugged memory is on my TODO list for virtio-mem as 
well, to avoid having to rely on cgroups.

> 
> There are still some issues with it: one needs to reserve swap space up
> to the guest maximum size so the QEMU process doesn't get OOM-killed if
> guest touches that memory and the cgroup memory controller for some
> reason seems to start swapping even before reaching its limit (that's
> still under investigation why).

Yes, putting a memory cap on Linux was always tricky.

> 
>> Reboot? Logically unplug all memory and as the guest boots up, re-add the memory after the guest booted up.
>>
>> The only thing we can't do is the following: when going below 4G, we cannot resize boot memory.
>>
>>
>> But I recall that that's *exactly* how the HV version I played with ~2 years ago worked: always start up with some initial memory ("startup memory"). After the VM is up for some seconds, we either add more memory (requested > startup) or request the VM to inflate memory (requested < startup).
> 
> Hyper-V actually "cleans up" the guest memory map on reboot - if the
> guest was effectively resized up then on reboot the guest boot memory is
> resized up to match that last size.
> Similarly, if the guest was ballooned out - that amount of memory is
> removed from the boot memory on reboot.

Yes, it cleans up, but as I said last time I checked there was this 
concept of startup vs. minimum vs. maximum, at least for dynamic memory:

https://www.fastvue.co/tmgreporter/blog/understanding-hyper-v-dynamic-memory-dynamic-ram/

Startup RAM would be whatever you specify for "-m xG". If you go below 
min, you remove memory via deflation once the guest is up.

> 
> So it's not exactly doing a hot-add after the guest boots.

I recall BUG reports in Linux, that we got hv-balloon hot-add requests 
~1 minute after Linux booted up, because of the above reason of startup 
memory [in these BUG reports, memory onlining was disabled and the VM 
would run out of memory because we hotplugged too much memory]. That's 
why I remember that this approach once was done.

Maybe there are multiple implementations noways. At least in QEMU you 
could chose whatever makes most sense for QEMU.

> This approach (of resizing the boot memory) also avoids problems if the
> guest loses hot-add / ballooning capability after a reboot - for example,
> rebooting into a Linux guest from Windows with hv-balloon.

TBH, I wouldn't be too concerned about that scenario ("hotplugged memory 
to a guest, guest reboots into a weird OS, weird OS isn't able to use 
hotplugged memory). For virtio-mem, the important part was that you 
always "know" how much memory the VM is aware about. If you always start 
with "Startup memory" and hotadd later (only if you detected guest 
support after a bootup), you can handle that scenario.

> 
> But unfortunately such resizing the guest boot memory seems not trivial
> to implement in QEMU.

Yes, avoiding changing memory layout to keep memory migration feasible 
was another thing I considered when designing virtio-mem.

Anyhow, I'm just throwing out ideas here on how to eventually handle it 
differently.

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-03-01 17:24               ` David Hildenbrand
@ 2023-03-01 22:08                 ` Maciej S. Szmigiero
  2023-03-02  9:28                   ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Maciej S. Szmigiero @ 2023-03-01 22:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 1.03.2023 18:24, David Hildenbrand wrote:
(...)
>> With virtio-mem one can simply have per-node virtio-mem devices.
>>
>> 2) I'm not sure what's the overhead of having, let's say, 1 TiB backing
>> memory device mostly marked madvise(MADV_DONTNEED).
>> Like, how much memory + swap this setup would actually consume - that's
>> something I would need to measure.
> 
> There are some WIP items to improve that (QEMU metadata (e.g., bitmaps), KVM metadata (e.g., per-memslot), Linux metadata (e.g., page tables).
> Memory overcommit handling also has to be tackled.
> 
> So it would be a "shared" problem with virtio-mem and will be sorted out eventually :)
> 

Yes, but this might take a bit of time, especially if kernel-side changes
are involved - that's why I will check how this setup works in practice
in its current shape.

(...)
>>> Reboot? Logically unplug all memory and as the guest boots up, re-add the memory after the guest booted up.
>>>
>>> The only thing we can't do is the following: when going below 4G, we cannot resize boot memory.
>>>
>>>
>>> But I recall that that's *exactly* how the HV version I played with ~2 years ago worked: always start up with some initial memory ("startup memory"). After the VM is up for some seconds, we either add more memory (requested > startup) or request the VM to inflate memory (requested < startup).
>>
>> Hyper-V actually "cleans up" the guest memory map on reboot - if the
>> guest was effectively resized up then on reboot the guest boot memory is
>> resized up to match that last size.
>> Similarly, if the guest was ballooned out - that amount of memory is
>> removed from the boot memory on reboot.
> 
> Yes, it cleans up, but as I said last time I checked there was this concept of startup vs. minimum vs. maximum, at least for dynamic memory:
> 
> https://www.fastvue.co/tmgreporter/blog/understanding-hyper-v-dynamic-memory-dynamic-ram/
> 
> Startup RAM would be whatever you specify for "-m xG". If you go below min, you remove memory via deflation once the guest is up.


That article was from 2014, so I guess it pertained Windows 2012 R2.

The memory settings page in more recent Hyper-V versions looks like on
the screenshot at [1].

It no longer calls that main memory amount value "Startup RAM", now it's
just "RAM".

Despite what one might think the "Enable Dynamic Memory" checkbox does
*not* control the Dynamic Memory protocol availability or usage - the
protocol is always available/exported to the guest.

What the "Enable Dynamic Memory" checkbox controls is some host-side
heuristics that automatically resize the guest within chosen bounds
based on some metrics.

Even if the "Enable Dynamic Memory" checkbox is *not* enabled the guest
can still be online-resized via Dynamic Memory protocol by simply
changing the value in the "RAM" field and clicking "Apply".

At least that's how it works on Windows 2019 with a Linux guest.

>>
>> So it's not exactly doing a hot-add after the guest boots.
> 
> I recall BUG reports in Linux, that we got hv-balloon hot-add requests ~1 minute after Linux booted up, because of the above reason of startup memory [in these BUG reports, memory onlining was disabled and the VM would run out of memory because we hotplugged too much memory]. That's why I remember that this approach once was done.
> 
> Maybe there are multiple implementations noways. At least in QEMU you could chose whatever makes most sense for QEMU.
> 

Right, it seems that the Hyper-V behavior evolved with time, too.

>> This approach (of resizing the boot memory) also avoids problems if the
>> guest loses hot-add / ballooning capability after a reboot - for example,
>> rebooting into a Linux guest from Windows with hv-balloon.
> 
> TBH, I wouldn't be too concerned about that scenario ("hotplugged memory to a guest, guest reboots into a weird OS, weird OS isn't able to use hotplugged memory). For virtio-mem, the important part was that you always "know" how much memory the VM is aware about. If you always start with "Startup memory" and hotadd later (only if you detected guest support after a bootup), you can handle that scenario.

I'm not *that* concerned with cross-guest-type scenario either,
but if it can be made more smooth then I wouldn't mind.

Thanks,
Maciej

[1]: https://www.tenforums.com/performance-maintenance/38478-windows-10-hyper-v-dynamic-memory.html#post544905




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols
  2023-03-01 22:08                 ` Maciej S. Szmigiero
@ 2023-03-02  9:28                   ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2023-03-02  9:28 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Michael S . Tsirkin, Marcel Apfelbaum, Alex Bennée,
	Thomas Huth, Marc-André Lureau, Daniel P. Berrangé,
	Philippe Mathieu-Daudé,
	Eric Blake, Markus Armbruster, qemu-devel, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On 01.03.23 23:08, Maciej S. Szmigiero wrote:
> On 1.03.2023 18:24, David Hildenbrand wrote:
> (...)
>>> With virtio-mem one can simply have per-node virtio-mem devices.
>>>
>>> 2) I'm not sure what's the overhead of having, let's say, 1 TiB backing
>>> memory device mostly marked madvise(MADV_DONTNEED).
>>> Like, how much memory + swap this setup would actually consume - that's
>>> something I would need to measure.
>>
>> There are some WIP items to improve that (QEMU metadata (e.g., bitmaps), KVM metadata (e.g., per-memslot), Linux metadata (e.g., page tables).
>> Memory overcommit handling also has to be tackled.
>>
>> So it would be a "shared" problem with virtio-mem and will be sorted out eventually :)
>>
> 
> Yes, but this might take a bit of time, especially if kernel-side changes
> are involved - that's why I will check how this setup works in practice
> in its current shape.

Yes, let me know if you have any question. I invested a lot of time to 
figure out all of the details and possible workarounds/approaches in the 
past.

>>> Hyper-V actually "cleans up" the guest memory map on reboot - if the
>>> guest was effectively resized up then on reboot the guest boot memory is
>>> resized up to match that last size.
>>> Similarly, if the guest was ballooned out - that amount of memory is
>>> removed from the boot memory on reboot.
>>
>> Yes, it cleans up, but as I said last time I checked there was this concept of startup vs. minimum vs. maximum, at least for dynamic memory:
>>
>> https://www.fastvue.co/tmgreporter/blog/understanding-hyper-v-dynamic-memory-dynamic-ram/
>>
>> Startup RAM would be whatever you specify for "-m xG". If you go below min, you remove memory via deflation once the guest is up.
> 
> 
> That article was from 2014, so I guess it pertained Windows 2012 R2.

I remember seeing the same interface when I played with that a couple of 
years ago, but I don't recall which windows version i was using.

> 
> The memory settings page in more recent Hyper-V versions looks like on
> the screenshot at [1].
> 
> It no longer calls that main memory amount value "Startup RAM", now it's
> just "RAM".
> 
> Despite what one might think the "Enable Dynamic Memory" checkbox does
> *not* control the Dynamic Memory protocol availability or usage - the
> protocol is always available/exported to the guest.
> 
> What the "Enable Dynamic Memory" checkbox controls is some host-side
> heuristics that automatically resize the guest within chosen bounds
> based on some metrics.
> 
> Even if the "Enable Dynamic Memory" checkbox is *not* enabled the guest
> can still be online-resized via Dynamic Memory protocol by simply
> changing the value in the "RAM" field and clicking "Apply".
> 
> At least that's how it works on Windows 2019 with a Linux guest.

Right, I recall that that's a feature that was separately announced as 
explicit VM resizing, not HV dynamic memory. It uses the same underlying 
mechanism, yes, which is why the feature is always exposed to the VMs.

That's most probably when they performed the "Startup RAM" -> "RAM" 
rename, to make both features possibly co-exist and easier to configure.

> 
>>>
>>> So it's not exactly doing a hot-add after the guest boots.
>>
>> I recall BUG reports in Linux, that we got hv-balloon hot-add requests ~1 minute after Linux booted up, because of the above reason of startup memory [in these BUG reports, memory onlining was disabled and the VM would run out of memory because we hotplugged too much memory]. That's why I remember that this approach once was done.
>>
>> Maybe there are multiple implementations noways. At least in QEMU you could chose whatever makes most sense for QEMU.
>>
> 
> Right, it seems that the Hyper-V behavior evolved with time, too.

Yes. One could think of a split approach, that is, we never resize the 
initial RAM size (-m XG) from inside QEMU. Instead, we could have the 
following models:

(1) Basic "Startup RAM" model: always (re)boot Linux with "-m XG". On
     reboot. Once the VM comes up, we either add memory or request to
     inflate the balloon, to reach the previous guest size. Whenever the
     VM reboots, we first defrag all hv-balloon provided memory ("one
     contiguous chunk") to then "add" that memory to the VM. If the
     logical VM size <= requested, this hv-balloon memory size would be
     "0". Essentially resembling the "old" HV dynamic memory approach.

(2) Extended "Startup RAM" mode: Same as (1), but instead of hot-adding
     the RAM after the guest came up, we simply defrag the
     hv-balloon RAM during reboot ("one contiguous chunk") and expose it
     via e820/SRAT ot the guest. Going "below" startup RAM will still
     require inflation once the guest is up.

(3) External "Resize" mode: On reboot, simply shutdown the VM and notify
     libvirt. Libvirt will restart the VM with adjusted "Startup RAM".

It's fairly straight forward to extend (1) to achieve (2). That could be 
a sane default for QEMU. However wants (3) can simply let libvirt handle 
it on top without any special handling.

Internal resize mode is tricky, especially regarding migration. With 
sufficient motivation and problem solving one might be able to turn (1) 
or (2) into such a (4) mode. It would just be an implementation detail.


Note that I never considered the "go below initial RAM" and "resize 
initial RAM" really relevant for virtio-mem. Instead, you chose the 
startup size to be reasonably small (e.g., 4 GiB) and expose memory via 
the virtio-mem devices right at QEMU startup ("requested-size=XG"). The 
same approach could be applied to the hv-balloon model.

One main reason to decide against resizing significantly below 4G was, 
for example, that you'll end up losing valuable DMA/DMA32 memory the 
lower you go -- that no hotplugged memory will provide. So using 
inflation for everything < 4G does not sound too crazy to me, and could 
avoid mode (3) altogether. But again, just my thoughts.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-03-02  9:29 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-24 21:41 [PATCH][RESEND v3 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
2023-02-24 21:41 ` [PATCH][RESEND v3 1/3] hapvdimm: add a virtual DIMM device for memory hot-add protocols Maciej S. Szmigiero
2023-02-27 15:25   ` David Hildenbrand
2023-02-28 14:14     ` Maciej S. Szmigiero
2023-02-28 15:02       ` David Hildenbrand
2023-02-28 21:27         ` Maciej S. Szmigiero
2023-02-28 22:12           ` David Hildenbrand
2023-03-01 16:26             ` Maciej S. Szmigiero
2023-03-01 17:24               ` David Hildenbrand
2023-03-01 22:08                 ` Maciej S. Szmigiero
2023-03-02  9:28                   ` David Hildenbrand
2023-02-24 21:41 ` [PATCH][RESEND v3 2/3] Add Hyper-V Dynamic Memory Protocol definitions Maciej S. Szmigiero
2023-02-24 21:41 ` [PATCH][RESEND v3 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
2023-02-28 16:18   ` Igor Mammedov
2023-02-28 17:12     ` David Hildenbrand
2023-02-28 17:34   ` Daniel P. Berrangé
2023-02-28 21:24     ` Maciej S. Szmigiero

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.