All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
@ 2020-09-20 13:25 Maciej S. Szmigiero
  2020-09-20 13:25 ` [PATCH 1/3] haprot: add a device for memory hot-add protocols Maciej S. Szmigiero
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Maciej S. Szmigiero @ 2020-09-20 13:25 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S. Tsirkin, Markus Armbruster, qemu-devel, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This series adds a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
and its protocol definitions.
Also included is a driver providing backing devices for memory hot-add
protocols ("haprots").

A haprot device works like a virtual DIMM stick: it allows inserting
extra RAM into the guest at run time.

The main differences from the ACPI-based PC DIMM hotplug are:
* Notifying the guest about the new memory range is not done via ACPI but
via a protocol handler that registers with the haprot framework.
This means that the ACPI DIMM slot limit does not apply.

* A protocol handler can prevent removal of a haprot device when it is
still in use by setting its "busy" field.

* A protocol handler can also register an "unplug" callback so it gets
notified when an user decides to remove the haprot device.
This way the protocol handler can inform the guest about this fact and / or
do its own cleanup.

The hv-balloon driver is like virtio-balloon on steroids: it allows both
changing the guest memory allocation via ballooning and inserting extra
RAM into it by adding haprot virtual DIMM sticks.
One of advantages of these over ACPI-based PC DIMM hotplug is that such
memory can be hotplugged in much smaller granularity because the ACPI DIMM
slot limit does not apply.

In contrast with ACPI DIMM hotplug where one can only request to unplug a
whole DIMM stick this driver allows removing memory from guest in single
page (4k) units via ballooning.
Then, once the guest has released the whole memory backed by a haprot
virtual DIMM stick such device is marked "unused" and can be removed from
the VM, if one wants so.
A "HV_BALLOON_HAPROT_UNUSED" QMP event is emitted in this case so the
software controlling QEMU knows that this operation is now possible.

The haprot devices are also marked unused after a VM reboot (with a
corresponding "HV_BALLOON_HAPROT_UNUSED" QMP event).
They are automatically reinserted (if still present) after the guest
reconnects to this protocol (a "HV_BALLOON_HAPROT_INUSE" QMP event is then
emitted).

For performance reasons, the guest-released memory is tracked in few range
trees, as a series of (start, count) ranges.
Each time a new page range is inserted into such tree its neighbors are
checked as candidates for possible merging with it.

Besides performance reasons, the Dynamic Memory protocol itself uses page
ranges as the data structure in its messages, so relevant pages need to be
merged into such ranges anyway.

One has to be careful when tracking the guest-released pages, since the
guest can maliciously report returning pages outside its current address
space, which later clash with the address range of newly added memory.
Similarly, the guest can report freeing the same page twice.

The above design results in much better ballooning performance than when
using virtio-balloon with the same guest: 230 GB / minute with this driver
versus 70 GB / minute with virtio-balloon.

During a ballooning operation most of time is spent waiting for the guest
to come up with newly freed page ranges, processing the received ranges on
the host side (in QEMU / KVM) is nearly instantaneous.

The unballoon operation is also pretty much instantaneous:
thanks to the merging of the ballooned out page ranges 200 GB of memory can
be returned to the guest in about 1 second.
With virtio-balloon this operation takes about 2.5 minutes.

These tests were done against a Windows Server 2019 guest running on a
Xeon E5-2699, after dirtying the whole memory inside guest before each
balloon operation.

Using a range tree instead of a bitmap to track the removed memory also
means that the solution scales well with the guest size: even a 1 TB range
takes just few bytes of memory.

The required GTree operations are available at
https://gitlab.gnome.org/maciejsszmigiero/glib/-/tree/gtree-add-iterators
Since they are not yet in the upstream Glib a check for them was added to
"configure" script, together with new "--enable-hv-balloon" and
"--disable-hv-balloon" arguments.
If these GTree operations are missing in the system Glib this driver will
be skipped during QEMU build.

An optional "status-report=on" device parameter requests memory status
events from the guest (typically sent every second), which allow the host
to learn both the guest memory available and the guest memory in use
counts.
They are emitted externally as "HV_BALLOON_STATUS_REPORT" QMP events.

The driver is named hv-balloon since the Linux kernel client driver for
the Dynamic Memory Protocol is named as such and to follow the naming
pattern established by the virtio-balloon driver.
The whole protocol runs over Hyper-V VMBus that has its implementation
recently merged in.

The driver was tested against Windows Server 2012 R2, Windows Server 2016
and Windows Server 2016 guests and obeys the guest alignment requirements
reported to the host via DM_CAPABILITIES_REPORT message.
Extensive event tracing is available under 'hv_balloon_*' prefix.

Example usage:
* Add "-device vmbus-bridge,id=vmbus-bridge -device hv-balloon,id=hvb"
  to the QEMU command line and set "maxmem" value to something large,
  like 1T.

* Use QEMU monitor commands to add a haprot virtual DIMM stick, together
  with its memory backend:
  object_add memory-backend-ram,id=mem1,size=200G
  device_add mem-haprot,id=ha1,memdev=mem1
  The first command is actually the same as for ACPI-based DIMM hotplug.

* Use the ballooning interface monitor commands to force the guest to give
  out as much memory as possible:
  balloon 1
  The ballooning interface monitor commands can also be used to resize
  the guest up and down appropriately.

* One can check the current guest size by issuing a "info balloon" command.
  This is useful to know what is happening, since large ballooning or
  unballooning operations take some time to complete.

* Once the guest releases the whole memory backed by a haprot device
  (or is restarted) a "HV_BALLOON_HAPROT_UNUSED" QMP event will be
  generated.
  The haprot device then can be removed, together with its memory backend:
  device_del ha1
  object_del mem1

Future directions:
* Allow sharing the ballooning QEMU interface between hv-balloon and
  virtio-balloon drivers.
  Currently, only one of them can be added to the VM at the same time.

* Allow new haport devices to reuse the same address range as the ones
  that were previously deleted via device_del monitor command without
  having to restart the VM.

* Add vmstate / live migration support to the hv-balloon driver.

* Use haprot device to also add memory via virtio interface (this requires
  defining a new operation in virtio-balloon protocol and appropriate
  support from the client virtio-balloon driver in the Linux kernel).

 Kconfig.host                     |    3 +
 configure                        |   35 +
 hw/hyperv/Kconfig                |    5 +
 hw/hyperv/hv-balloon.c           | 2172 ++++++++++++++++++++++++++++++
 hw/hyperv/meson.build            |    1 +
 hw/hyperv/trace-events           |   17 +
 hw/i386/Kconfig                  |    2 +
 hw/i386/pc.c                     |   18 +-
 hw/mem/Kconfig                   |    4 +
 hw/mem/haprot.c                  |  263 ++++
 hw/mem/meson.build               |    1 +
 include/hw/hyperv/dynmem-proto.h |  425 ++++++
 include/hw/mem/haprot.h          |   72 +
 meson.build                      |    2 +
 qapi/misc.json                   |   74 +
 15 files changed, 3093 insertions(+), 1 deletion(-)
 create mode 100644 hw/hyperv/hv-balloon.c
 create mode 100644 hw/mem/haprot.c
 create mode 100644 include/hw/hyperv/dynmem-proto.h
 create mode 100644 include/hw/mem/haprot.h



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] haprot: add a device for memory hot-add protocols
  2020-09-20 13:25 [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
@ 2020-09-20 13:25 ` Maciej S. Szmigiero
  2020-09-20 13:25 ` [PATCH 2/3] Add Hyper-V Dynamic Memory Protocol definitions Maciej S. Szmigiero
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Maciej S. Szmigiero @ 2020-09-20 13:25 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S. Tsirkin, Markus Armbruster, qemu-devel, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This device works like a virtual DIMM stick: it allows inserting extra RAM
into the guest at run time.

The main differences from the ACPI-based PC DIMM hotplug are:
* Notifying the guest about the new memory range is not done via ACPI but
via a protocol handler that registers with the haprot framework.
This means that the ACPI DIMM slot limit does not apply.

* A protocol handler can prevent removal of a haprot device when it is
still in use by setting its "busy" field.

* A protocol handler can also register an "unplug" callback so it gets
notified when an user decides to remove the haprot device.
This way the protocol handler can inform the guest about this fact and / or
do its own cleanup.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 hw/i386/Kconfig         |   2 +
 hw/i386/pc.c            |  18 ++-
 hw/mem/Kconfig          |   4 +
 hw/mem/haprot.c         | 263 ++++++++++++++++++++++++++++++++++++++++
 hw/mem/meson.build      |   1 +
 include/hw/mem/haprot.h |  72 +++++++++++
 6 files changed, 359 insertions(+), 1 deletion(-)
 create mode 100644 hw/mem/haprot.c
 create mode 100644 include/hw/mem/haprot.h

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index d0bd8b537d55..ca143e568de2 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -58,6 +58,7 @@ config I440FX
     imply E1000_PCI
     imply VMPORT
     imply VMMOUSE
+    imply HAPROT
     select PC_PCI
     select PC_ACPI
     select ACPI_SMBUS
@@ -85,6 +86,7 @@ config Q35
     imply E1000E_PCI_EXPRESS
     imply VMPORT
     imply VMMOUSE
+    imply HAPROT
     select PC_PCI
     select PC_ACPI
     select PCI_EXPRESS_Q35
diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index b55369357e5d..b7a0e4ee3ea2 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -77,6 +77,7 @@
 #include "hw/acpi/cpu_hotplug.h"
 #include "hw/boards.h"
 #include "acpi-build.h"
+#include "hw/mem/haprot.h"
 #include "hw/mem/pc-dimm.h"
 #include "hw/mem/nvdimm.h"
 #include "qapi/error.h"
@@ -1416,6 +1417,18 @@ static void pc_virtio_md_pci_unplug_request(HotplugHandler *hotplug_dev,
     error_setg(errp, "virtio based memory devices cannot be unplugged.");
 }
 
+static void pc_haprot_unplug_request(DeviceState *dev, Error **errp)
+{
+    HAProtDevice *haprot = HAPROT(dev);
+
+    if (haprot->busy) {
+        error_setg(errp, "the memory is still busy, cannot unplug");
+        return;
+    }
+
+    object_unparent(OBJECT(dev));
+}
+
 static void pc_virtio_md_pci_unplug(HotplugHandler *hotplug_dev,
                                     DeviceState *dev, Error **errp)
 {
@@ -1458,6 +1471,8 @@ static void pc_machine_device_unplug_request_cb(HotplugHandler *hotplug_dev,
     } else if (object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
                object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
         pc_virtio_md_pci_unplug_request(hotplug_dev, dev, errp);
+    } else if (object_dynamic_cast(OBJECT(dev), TYPE_HAPROT)) {
+        pc_haprot_unplug_request(dev, errp);
     } else {
         error_setg(errp, "acpi: device unplug request for not supported device"
                    " type: %s", object_get_typename(OBJECT(dev)));
@@ -1486,7 +1501,8 @@ static HotplugHandler *pc_get_hotplug_handler(MachineState *machine,
     if (object_dynamic_cast(OBJECT(dev), TYPE_PC_DIMM) ||
         object_dynamic_cast(OBJECT(dev), TYPE_CPU) ||
         object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_PMEM_PCI) ||
-        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI)) {
+        object_dynamic_cast(OBJECT(dev), TYPE_VIRTIO_MEM_PCI) ||
+        object_dynamic_cast(OBJECT(dev), TYPE_HAPROT)) {
         return HOTPLUG_HANDLER(machine);
     }
 
diff --git a/hw/mem/Kconfig b/hw/mem/Kconfig
index a0ef2cf648e1..a5d8c8851d1b 100644
--- a/hw/mem/Kconfig
+++ b/hw/mem/Kconfig
@@ -10,3 +10,7 @@ config NVDIMM
     default y
     depends on (PC || PSERIES || ARM_VIRT)
     select MEM_DEVICE
+
+config HAPROT
+    bool
+    select MEM_DEVICE
diff --git a/hw/mem/haprot.c b/hw/mem/haprot.c
new file mode 100644
index 000000000000..38373351b55c
--- /dev/null
+++ b/hw/mem/haprot.c
@@ -0,0 +1,263 @@
+/*
+ * A device for memory hot-add protocols
+ *
+ * Copyright (C) 2020 Oracle and/or its affiliates.
+ *
+ * Author: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
+ *
+ * Heavily based on pc-dimm.c:
+ * Copyright ProfitBricks GmbH 2012
+ * Copyright (C) 2014 Red Hat Inc
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "hw/boards.h"
+#include "hw/mem/haprot.h"
+#include "hw/mem/memory-device.h"
+#include "hw/qdev-properties.h"
+#include "migration/vmstate.h"
+#include "qapi/error.h"
+#include "qapi/visitor.h"
+#include "qemu/module.h"
+#include "sysemu/hostmem.h"
+#include "trace.h"
+
+static Property haprot_properties[] = {
+    DEFINE_PROP_UINT64(HAPROT_ADDR_PROP, HAProtDevice, addr, 0),
+    DEFINE_PROP_UINT32(HAPROT_NODE_PROP, HAProtDevice, node, 0),
+    DEFINE_PROP_LINK(HAPROT_MEMDEV_PROP, HAProtDevice, hostmem,
+                     TYPE_MEMORY_BACKEND, HostMemoryBackend *),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void haprot_get_size(Object *obj, Visitor *v, const char *name,
+                            void *opaque, Error **errp)
+{
+    Error *local_err = NULL;
+    uint64_t value;
+
+    value = memory_device_get_region_size(MEMORY_DEVICE(obj), &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    visit_type_uint64(v, name, &value, errp);
+}
+
+static void haprot_init(Object *obj)
+{
+    object_property_add(obj, HAPROT_SIZE_PROP, "uint64", haprot_get_size,
+                        NULL, NULL, NULL);
+}
+
+static void haprot_realize(DeviceState *dev, Error **errp)
+{
+    HAProtDevice *haprot = HAPROT(dev);
+    HAProtDeviceClass *hc = HAPROT_GET_CLASS(haprot);
+    uint64_t align;
+    MachineState *ms = MACHINE(qdev_get_machine());
+    Error *local_err = NULL;
+    int nb_numa_nodes = ms->numa_state->num_nodes;
+
+    if (!hc->plug_notify_cb) {
+        error_setg(errp, "no mem hot add protocol registered");
+        return;
+    }
+
+    if (hc->get_align_cb) {
+        align = hc->get_align_cb(hc->notify_cb_ctx, haprot);
+    } else {
+        align = 0;
+    }
+
+    memory_device_pre_plug(MEMORY_DEVICE(haprot), ms,
+                           align ? &align : NULL,
+                           &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    if (!haprot->hostmem) {
+        error_setg(errp, "'" HAPROT_MEMDEV_PROP "' property is not set");
+        return;
+    } else if (host_memory_backend_is_mapped(haprot->hostmem)) {
+        const char *path;
+
+        path = object_get_canonical_path_component(OBJECT(haprot->hostmem));
+        error_setg(errp, "can't use already busy memdev: %s", path);
+        return;
+    }
+    if (((nb_numa_nodes > 0) && (haprot->node >= nb_numa_nodes)) ||
+        (!nb_numa_nodes && haprot->node)) {
+        error_setg(errp,
+                   "Node property value %"PRIu32" exceeds the number of numa nodes %d",
+                   haprot->node, nb_numa_nodes ? nb_numa_nodes : 1);
+        return;
+    }
+
+    host_memory_backend_set_mapped(haprot->hostmem, true);
+
+    memory_device_plug(MEMORY_DEVICE(haprot), ms);
+    vmstate_register_ram(host_memory_backend_get_memory(haprot->hostmem),
+                         dev);
+
+    hc->plug_notify_cb(hc->notify_cb_ctx, haprot, &local_err);
+    if (local_err) {
+        memory_device_unplug(MEMORY_DEVICE(haprot), ms);
+        vmstate_unregister_ram(host_memory_backend_get_memory(haprot->hostmem),
+                               dev);
+        host_memory_backend_set_mapped(haprot->hostmem, false);
+
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static void haprot_unrealize(DeviceState *dev)
+{
+    HAProtDevice *haprot = HAPROT(dev);
+    HAProtDeviceClass *hc = HAPROT_GET_CLASS(haprot);
+    MachineState *ms = MACHINE(qdev_get_machine());
+
+    if (hc->unplug_notify_cb) {
+        hc->unplug_notify_cb(hc->notify_cb_ctx, haprot);
+    }
+
+    memory_device_unplug(MEMORY_DEVICE(haprot), ms);
+    vmstate_unregister_ram(host_memory_backend_get_memory(haprot->hostmem),
+                           dev);
+
+    host_memory_backend_set_mapped(haprot->hostmem, false);
+}
+
+static uint64_t haprot_md_get_addr(const MemoryDeviceState *md)
+{
+    return object_property_get_uint(OBJECT(md), HAPROT_ADDR_PROP,
+                                    &error_abort);
+}
+
+static void haprot_md_set_addr(MemoryDeviceState *md, uint64_t addr,
+                               Error **errp)
+{
+    object_property_set_uint(OBJECT(md), HAPROT_ADDR_PROP, addr, errp);
+}
+
+static MemoryRegion *haprot_md_get_memory_region(MemoryDeviceState *md,
+                                                 Error **errp)
+{
+    HAProtDevice *haprot = HAPROT(md);
+
+    if (!haprot->hostmem) {
+        error_setg(errp, "'" HAPROT_MEMDEV_PROP "' property must be set");
+        return NULL;
+    }
+
+    return host_memory_backend_get_memory(haprot->hostmem);
+}
+
+static void haprot_md_fill_device_info(const MemoryDeviceState *md,
+                                       MemoryDeviceInfo *info)
+{
+    PCDIMMDeviceInfo *di = g_new0(PCDIMMDeviceInfo, 1);
+    const DeviceClass *dc = DEVICE_GET_CLASS(md);
+    const HAProtDevice *haprot = HAPROT(md);
+    const DeviceState *dev = DEVICE(md);
+
+    if (dev->id) {
+        di->has_id = true;
+        di->id = g_strdup(dev->id);
+    }
+    di->hotplugged = dev->hotplugged;
+    di->hotpluggable = dc->hotpluggable;
+    di->addr = haprot->addr;
+    di->slot = -1;
+    di->node = haprot->node;
+    di->size = object_property_get_uint(OBJECT(haprot), HAPROT_SIZE_PROP,
+                                        NULL);
+    di->memdev = object_get_canonical_path(OBJECT(haprot->hostmem));
+
+    info->u.dimm.data = di;
+    info->type = MEMORY_DEVICE_INFO_KIND_DIMM;
+}
+
+static void haprot_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+    MemoryDeviceClass *mdc = MEMORY_DEVICE_CLASS(oc);
+
+    dc->realize = haprot_realize;
+    dc->unrealize = haprot_unrealize;
+    device_class_set_props(dc, haprot_properties);
+    dc->desc = "Memory for a hot add protocol";
+
+    mdc->get_addr = haprot_md_get_addr;
+    mdc->set_addr = haprot_md_set_addr;
+    mdc->get_plugged_size = memory_device_get_region_size;
+    mdc->get_memory_region = haprot_md_get_memory_region;
+    mdc->fill_device_info = haprot_md_fill_device_info;
+}
+
+void haprot_register_protocol(HAProtocolGetAlign get_align_cb,
+                              HAProtocolPlugNotify plug_notify_cb,
+                              HAProtocolUnplugNotify unplug_notify_cb,
+                              void *notify_ctx, Error **errp)
+{
+    HAProtDeviceClass *hc = HAPROT_CLASS(object_class_by_name(TYPE_HAPROT));
+
+    if (hc->plug_notify_cb) {
+        error_setg(errp, "a mem hot add protocol was already registered");
+        return;
+    }
+
+    hc->get_align_cb = get_align_cb;
+    hc->plug_notify_cb = plug_notify_cb;
+    hc->unplug_notify_cb = unplug_notify_cb;
+    hc->notify_cb_ctx = notify_ctx;
+}
+
+void haprot_unregister_protocol(HAProtocolPlugNotify plug_notify_cb,
+                                Error **errp)
+{
+    HAProtDeviceClass *hc = HAPROT_CLASS(object_class_by_name(TYPE_HAPROT));
+
+    if (!hc->plug_notify_cb) {
+        error_setg(errp, "no mem hot add protocol was registered");
+        return;
+    }
+
+    if (hc->plug_notify_cb != plug_notify_cb) {
+        error_setg(errp, "different mem hot add protocol was registered");
+        return;
+    }
+
+    hc->get_align_cb = NULL;
+    hc->plug_notify_cb = NULL;
+    hc->unplug_notify_cb = NULL;
+    hc->notify_cb_ctx = NULL;
+}
+
+static TypeInfo haprot_info = {
+    .name          = TYPE_HAPROT,
+    .parent        = TYPE_DEVICE,
+    .instance_size = sizeof(HAProtDevice),
+    .instance_init = haprot_init,
+    .class_init    = haprot_class_init,
+    .class_size    = sizeof(HAProtDeviceClass),
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_MEMORY_DEVICE },
+        { }
+    },
+};
+
+static void haprot_register_types(void)
+{
+    type_register_static(&haprot_info);
+}
+
+type_init(haprot_register_types)
diff --git a/hw/mem/meson.build b/hw/mem/meson.build
index 0d22f2b5727e..764062077dca 100644
--- a/hw/mem/meson.build
+++ b/hw/mem/meson.build
@@ -3,5 +3,6 @@ mem_ss.add(files('memory-device.c'))
 mem_ss.add(when: 'CONFIG_DIMM', if_true: files('pc-dimm.c'))
 mem_ss.add(when: 'CONFIG_NPCM7XX', if_true: files('npcm7xx_mc.c'))
 mem_ss.add(when: 'CONFIG_NVDIMM', if_true: files('nvdimm.c'))
+mem_ss.add(when: 'CONFIG_HAPROT', if_true: files('haprot.c'))
 
 softmmu_ss.add_all(when: 'CONFIG_MEM_DEVICE', if_true: mem_ss)
diff --git a/include/hw/mem/haprot.h b/include/hw/mem/haprot.h
new file mode 100644
index 000000000000..9d44007b4945
--- /dev/null
+++ b/include/hw/mem/haprot.h
@@ -0,0 +1,72 @@
+/*
+ * A device for memory hot-add protocols
+ *
+ * Copyright (C) 2020 Oracle and/or its affiliates.
+ *
+ * Author: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
+ *
+ * Heavily based on pc-dimm.h:
+ * Copyright ProfitBricks GmbH 2012
+ * Copyright (C) 2013-2014 Red Hat Inc
+ *
+ * Authors:
+ *  Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
+ *  Igor Mammedov <imammedo@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_HAPROT_H
+#define QEMU_HAPROT_H
+
+#include "exec/memory.h"
+#include "hw/qdev-core.h"
+
+#define TYPE_HAPROT "mem-haprot"
+#define HAPROT(obj) \
+    OBJECT_CHECK(HAProtDevice, (obj), TYPE_HAPROT)
+#define HAPROT_CLASS(oc) \
+    OBJECT_CLASS_CHECK(HAProtDeviceClass, (oc), TYPE_HAPROT)
+#define HAPROT_GET_CLASS(obj) \
+    OBJECT_GET_CLASS(HAProtDeviceClass, (obj), TYPE_HAPROT)
+
+#define HAPROT_ADDR_PROP "addr"
+#define HAPROT_NODE_PROP "node"
+#define HAPROT_SIZE_PROP "size"
+#define HAPROT_MEMDEV_PROP "memdev"
+
+typedef struct HAProtDevice {
+    /* private */
+    DeviceState parent_obj;
+
+    /* public */
+    uint64_t addr;
+    uint32_t node;
+    HostMemoryBackend *hostmem;
+    bool busy;
+} HAProtDevice;
+
+typedef uint64_t (*HAProtocolGetAlign)(void *ctx, HAProtDevice *haprot);
+typedef void (*HAProtocolPlugNotify)(void *ctx, HAProtDevice *haprot,
+                                     Error **errp);
+typedef void (*HAProtocolUnplugNotify)(void *ctx, HAProtDevice *haprot);
+
+typedef struct HAProtDeviceClass {
+    /* private */
+    DeviceClass parent_class;
+    HAProtocolGetAlign get_align_cb;
+    HAProtocolPlugNotify plug_notify_cb;
+    HAProtocolUnplugNotify unplug_notify_cb;
+    void *notify_cb_ctx;
+} HAProtDeviceClass;
+
+void haprot_register_protocol(HAProtocolGetAlign get_align_cb,
+                              HAProtocolPlugNotify plug_notify_cb,
+                              HAProtocolUnplugNotify unplug_notify_cb,
+                              void *notify_ctx, Error **errp);
+void haprot_unregister_protocol(HAProtocolPlugNotify plug_notify_cb,
+                                Error **errp);
+
+#endif


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] Add Hyper-V Dynamic Memory Protocol definitions
  2020-09-20 13:25 [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
  2020-09-20 13:25 ` [PATCH 1/3] haprot: add a device for memory hot-add protocols Maciej S. Szmigiero
@ 2020-09-20 13:25 ` Maciej S. Szmigiero
  2020-09-20 13:25 ` [PATCH 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Maciej S. Szmigiero @ 2020-09-20 13:25 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S. Tsirkin, Markus Armbruster, qemu-devel, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This commit adds Hyper-V Dynamic Memory Protocol definitions, taken
from hv_balloon Linux kernel driver, adapted to the QEMU coding style and
definitions.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 include/hw/hyperv/dynmem-proto.h | 425 +++++++++++++++++++++++++++++++
 1 file changed, 425 insertions(+)
 create mode 100644 include/hw/hyperv/dynmem-proto.h

diff --git a/include/hw/hyperv/dynmem-proto.h b/include/hw/hyperv/dynmem-proto.h
new file mode 100644
index 000000000000..407cc54b00a8
--- /dev/null
+++ b/include/hw/hyperv/dynmem-proto.h
@@ -0,0 +1,425 @@
+#ifndef HW_HYPERV_DYNMEM_PROTO_H
+#define HW_HYPERV_DYNMEM_PROTO_H
+
+/*
+ * Hyper-V Dynamic Memory Protocol definitions
+ *
+ * Copyright (C) 2020 Oracle and/or its affiliates.
+ *
+ * Author: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
+ *
+ * Based on drivers/hv/hv_balloon.c from Linux kernel:
+ * Copyright (c) 2012, Microsoft Corporation.
+ *
+ * Author: K. Y. Srinivasan <kys@microsoft.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ * See the COPYING file in the top-level directory.
+ */
+
+/*
+ * Protocol versions. The low word is the minor version, the high word the major
+ * version.
+ *
+ * History:
+ * Initial version 1.0
+ * Changed to 0.1 on 2009/03/25
+ * Changes to 0.2 on 2009/05/14
+ * Changes to 0.3 on 2009/12/03
+ * Changed to 1.0 on 2011/04/05
+ * Changed to 2.0 on 2019/12/10
+ */
+
+#define DYNMEM_MAKE_VERSION(Major, Minor) ((uint32_t)(((Major) << 16) | (Minor)))
+#define DYNMEM_MAJOR_VERSION(Version) ((uint32_t)(Version) >> 16)
+#define DYNMEM_MINOR_VERSION(Version) ((uint32_t)(Version) & 0xff)
+
+enum {
+    DYNMEM_PROTOCOL_VERSION_1 = DYNMEM_MAKE_VERSION(0, 3),
+    DYNMEM_PROTOCOL_VERSION_2 = DYNMEM_MAKE_VERSION(1, 0),
+    DYNMEM_PROTOCOL_VERSION_3 = DYNMEM_MAKE_VERSION(2, 0),
+
+    DYNMEM_PROTOCOL_VERSION_WIN7 = DYNMEM_PROTOCOL_VERSION_1,
+    DYNMEM_PROTOCOL_VERSION_WIN8 = DYNMEM_PROTOCOL_VERSION_2,
+    DYNMEM_PROTOCOL_VERSION_WIN10 = DYNMEM_PROTOCOL_VERSION_3,
+
+    DYNMEM_PROTOCOL_VERSION_CURRENT = DYNMEM_PROTOCOL_VERSION_WIN10
+};
+
+
+
+/*
+ * Message Types
+ */
+
+enum dm_message_type {
+    /*
+     * Version 0.3
+     */
+    DM_ERROR = 0,
+    DM_VERSION_REQUEST = 1,
+    DM_VERSION_RESPONSE = 2,
+    DM_CAPABILITIES_REPORT = 3,
+    DM_CAPABILITIES_RESPONSE = 4,
+    DM_STATUS_REPORT = 5,
+    DM_BALLOON_REQUEST = 6,
+    DM_BALLOON_RESPONSE = 7,
+    DM_UNBALLOON_REQUEST = 8,
+    DM_UNBALLOON_RESPONSE = 9,
+    DM_MEM_HOT_ADD_REQUEST = 10,
+    DM_MEM_HOT_ADD_RESPONSE = 11,
+    DM_VERSION_03_MAX = 11,
+    /*
+     * Version 1.0.
+     */
+    DM_INFO_MESSAGE = 12,
+    DM_VERSION_1_MAX = 12,
+
+    /*
+     * Version 2.0
+     */
+    DM_MEM_HOT_REMOVE_REQUEST = 13,
+    DM_MEM_HOT_REMOVE_RESPONSE = 14
+};
+
+
+/*
+ * Structures defining the dynamic memory management
+ * protocol.
+ */
+
+union dm_version {
+    struct {
+        uint16_t minor_version;
+        uint16_t major_version;
+    };
+    uint32_t version;
+} QEMU_PACKED;
+
+
+union dm_caps {
+    struct {
+        uint64_t balloon:1;
+        uint64_t hot_add:1;
+        /*
+         * To support guests that may have alignment
+         * limitations on hot-add, the guest can specify
+         * its alignment requirements; a value of n
+         * represents an alignment of 2^n in mega bytes.
+         */
+        uint64_t hot_add_alignment:4;
+        uint64_t hot_remove:1;
+        uint64_t reservedz:57;
+    } cap_bits;
+    uint64_t caps;
+} QEMU_PACKED;
+
+union dm_mem_page_range {
+    struct  {
+        /*
+         * The PFN number of the first page in the range.
+         * 40 bits is the architectural limit of a PFN
+         * number for AMD64.
+         */
+        uint64_t start_page:40;
+        /*
+         * The number of pages in the range.
+         */
+        uint64_t page_cnt:24;
+    } finfo;
+    uint64_t  page_range;
+} QEMU_PACKED;
+
+
+
+/*
+ * The header for all dynamic memory messages:
+ *
+ * type: Type of the message.
+ * size: Size of the message in bytes; including the header.
+ * trans_id: The guest is responsible for manufacturing this ID.
+ */
+
+struct dm_header {
+    uint16_t type;
+    uint16_t size;
+    uint32_t trans_id;
+} QEMU_PACKED;
+
+/*
+ * A generic message format for dynamic memory.
+ * Specific message formats are defined later in the file.
+ */
+
+struct dm_message {
+    struct dm_header hdr;
+    uint8_t data[]; /* enclosed message */
+} QEMU_PACKED;
+
+
+/*
+ * Specific message types supporting the dynamic memory protocol.
+ */
+
+/*
+ * Version negotiation message. Sent from the guest to the host.
+ * The guest is free to try different versions until the host
+ * accepts the version.
+ *
+ * dm_version: The protocol version requested.
+ * is_last_attempt: If TRUE, this is the last version guest will request.
+ * reservedz: Reserved field, set to zero.
+ */
+
+struct dm_version_request {
+    struct dm_header hdr;
+    union dm_version version;
+    uint32_t is_last_attempt:1;
+    uint32_t reservedz:31;
+} QEMU_PACKED;
+
+/*
+ * Version response message; Host to Guest and indicates
+ * if the host has accepted the version sent by the guest.
+ *
+ * is_accepted: If TRUE, host has accepted the version and the guest
+ * should proceed to the next stage of the protocol. FALSE indicates that
+ * guest should re-try with a different version.
+ *
+ * reservedz: Reserved field, set to zero.
+ */
+
+struct dm_version_response {
+    struct dm_header hdr;
+    uint64_t is_accepted:1;
+    uint64_t reservedz:63;
+} QEMU_PACKED;
+
+/*
+ * Message reporting capabilities. This is sent from the guest to the
+ * host.
+ */
+
+struct dm_capabilities {
+    struct dm_header hdr;
+    union dm_caps caps;
+    uint64_t min_page_cnt;
+    uint64_t max_page_number;
+} QEMU_PACKED;
+
+/*
+ * Response to the capabilities message. This is sent from the host to the
+ * guest. This message notifies if the host has accepted the guest's
+ * capabilities. If the host has not accepted, the guest must shutdown
+ * the service.
+ *
+ * is_accepted: Indicates if the host has accepted guest's capabilities.
+ * reservedz: Must be 0.
+ */
+
+struct dm_capabilities_resp_msg {
+    struct dm_header hdr;
+    uint64_t is_accepted:1;
+    uint64_t hot_remove:1;
+    uint64_t suppress_pressure_reports:1;
+    uint64_t reservedz:61;
+} QEMU_PACKED;
+
+/*
+ * This message is used to report memory pressure from the guest.
+ * This message is not part of any transaction and there is no
+ * response to this message.
+ *
+ * num_avail: Available memory in pages.
+ * num_committed: Committed memory in pages.
+ * page_file_size: The accumulated size of all page files
+ *                 in the system in pages.
+ * zero_free: The nunber of zero and free pages.
+ * page_file_writes: The writes to the page file in pages.
+ * io_diff: An indicator of file cache efficiency or page file activity,
+ *          calculated as File Cache Page Fault Count - Page Read Count.
+ *          This value is in pages.
+ *
+ * Some of these metrics are Windows specific and fortunately
+ * the algorithm on the host side that computes the guest memory
+ * pressure only uses num_committed value.
+ */
+
+struct dm_status {
+    struct dm_header hdr;
+    uint64_t num_avail;
+    uint64_t num_committed;
+    uint64_t page_file_size;
+    uint64_t zero_free;
+    uint32_t page_file_writes;
+    uint32_t io_diff;
+} QEMU_PACKED;
+
+
+/*
+ * Message to ask the guest to allocate memory - balloon up message.
+ * This message is sent from the host to the guest. The guest may not be
+ * able to allocate as much memory as requested.
+ *
+ * num_pages: number of pages to allocate.
+ */
+
+struct dm_balloon {
+    struct dm_header hdr;
+    uint32_t num_pages;
+    uint32_t reservedz;
+} QEMU_PACKED;
+
+
+/*
+ * Balloon response message; this message is sent from the guest
+ * to the host in response to the balloon message.
+ *
+ * reservedz: Reserved; must be set to zero.
+ * more_pages: If FALSE, this is the last message of the transaction.
+ * if TRUE there will atleast one more message from the guest.
+ *
+ * range_count: The number of ranges in the range array.
+ *
+ * range_array: An array of page ranges returned to the host.
+ *
+ */
+
+struct dm_balloon_response {
+    struct dm_header hdr;
+    uint32_t reservedz;
+    uint32_t more_pages:1;
+    uint32_t range_count:31;
+    union dm_mem_page_range range_array[];
+} QEMU_PACKED;
+
+/*
+ * Un-balloon message; this message is sent from the host
+ * to the guest to give guest more memory.
+ *
+ * more_pages: If FALSE, this is the last message of the transaction.
+ * if TRUE there will atleast one more message from the guest.
+ *
+ * reservedz: Reserved; must be set to zero.
+ *
+ * range_count: The number of ranges in the range array.
+ *
+ * range_array: An array of page ranges returned to the host.
+ *
+ */
+
+struct dm_unballoon_request {
+    struct dm_header hdr;
+    uint32_t more_pages:1;
+    uint32_t reservedz:31;
+    uint32_t range_count;
+    union dm_mem_page_range range_array[];
+} QEMU_PACKED;
+
+/*
+ * Un-balloon response message; this message is sent from the guest
+ * to the host in response to an unballoon request.
+ *
+ */
+
+struct dm_unballoon_response {
+    struct dm_header hdr;
+} QEMU_PACKED;
+
+
+/*
+ * Hot add request message. Message sent from the host to the guest.
+ *
+ * mem_range: Memory range to hot add.
+ *
+ */
+
+struct dm_hot_add {
+    struct dm_header hdr;
+    union dm_mem_page_range range;
+} QEMU_PACKED;
+
+/*
+ * Hot add response message.
+ * This message is sent by the guest to report the status of a hot add request.
+ * If page_count is less than the requested page count, then the host should
+ * assume all further hot add requests will fail, since this indicates that
+ * the guest has hit an upper physical memory barrier.
+ *
+ * Hot adds may also fail due to low resources; in this case, the guest must
+ * not complete this message until the hot add can succeed, and the host must
+ * not send a new hot add request until the response is sent.
+ * If VSC fails to hot add memory DYNMEM_NUMBER_OF_UNSUCCESSFUL_HOTADD_ATTEMPTS
+ * times it fails the request.
+ *
+ *
+ * page_count: number of pages that were successfully hot added.
+ *
+ * result: result of the operation 1: success, 0: failure.
+ *
+ */
+
+struct dm_hot_add_response {
+    struct dm_header hdr;
+    uint32_t page_count;
+    uint32_t result;
+} QEMU_PACKED;
+
+struct dm_hot_remove {
+    struct dm_header hdr;
+    uint32_t virtual_node;
+    uint32_t page_count;
+    uint32_t qos_flags;
+    uint32_t reservedZ;
+} QEMU_PACKED;
+
+struct dm_hot_remove_response {
+    struct dm_header hdr;
+    uint32_t result;
+    uint32_t range_count;
+    uint64_t more_pages:1;
+    uint64_t reservedz:63;
+    union dm_mem_page_range range_array[];
+} QEMU_PACKED;
+
+#define DM_REMOVE_QOS_LARGE (1 << 0)
+#define DM_REMOVE_QOS_LOCAL (1 << 1)
+#define DM_REMOVE_QOS_MASK (0x3)
+
+/*
+ * Types of information sent from host to the guest.
+ */
+
+enum dm_info_type {
+    INFO_TYPE_MAX_PAGE_CNT = 0,
+    MAX_INFO_TYPE
+};
+
+
+/*
+ * Header for the information message.
+ */
+
+struct dm_info_header {
+    enum dm_info_type type;
+    uint32_t data_size;
+    uint8_t  data[];
+} QEMU_PACKED;
+
+/*
+ * This message is sent from the host to the guest to pass
+ * some relevant information (win8 addition).
+ *
+ * reserved: no used.
+ * info_size: size of the information blob.
+ * info: information blob.
+ */
+
+struct dm_info_msg {
+    struct dm_header hdr;
+    uint32_t reserved;
+    uint32_t info_size;
+    uint8_t  info[];
+};
+
+#endif


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-20 13:25 [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
  2020-09-20 13:25 ` [PATCH 1/3] haprot: add a device for memory hot-add protocols Maciej S. Szmigiero
  2020-09-20 13:25 ` [PATCH 2/3] Add Hyper-V Dynamic Memory Protocol definitions Maciej S. Szmigiero
@ 2020-09-20 13:25 ` Maciej S. Szmigiero
  2020-09-20 14:16 ` [PATCH 0/3] " no-reply
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Maciej S. Szmigiero @ 2020-09-20 13:25 UTC (permalink / raw)
  To: Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S. Tsirkin, Markus Armbruster, qemu-devel, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

This driver is like virtio-balloon on steroids: it allows both changing the
guest memory allocation via ballooning and inserting extra RAM into it by
adding haprot virtual DIMM sticks.
One of advantages of these over ACPI-based PC DIMM hotplug is that such
memory can be hotplugged in much smaller granularity because the ACPI DIMM
slot limit does not apply.

In contrast with ACPI DIMM hotplug where one can only request to unplug a
whole DIMM stick this driver allows removing memory from guest in single
page (4k) units via ballooning.
Then, once the guest has released the whole memory backed by a haprot
virtual DIMM stick such device is marked "unused" and can be removed from
the VM, if one wants so.
A "HV_BALLOON_HAPROT_UNUSED" QMP event is emitted in this case so the
software controlling QEMU knows that this operation is now possible.

The haprot devices are also marked unused after a VM reboot (with a
corresponding "HV_BALLOON_HAPROT_UNUSED" QMP event).
They are automatically reinserted (if still present) after the guest
reconnects to this protocol (a "HV_BALLOON_HAPROT_INUSE" QMP event is then
emitted).

For performance reasons, the guest-released memory is tracked in few range
trees, as a series of (start, count) ranges.
Each time a new page range is inserted into such tree its neighbors are
checked as candidates for possible merging with it.

Besides performance reasons, the Dynamic Memory protocol itself uses page
ranges as the data structure in its messages, so relevant pages need to be
merged into such ranges anyway.

One has to be careful when tracking the guest-released pages, since the
guest can maliciously report returning pages outside its current address
space, which later clash with the address range of newly added memory.
Similarly, the guest can report freeing the same page twice.

The above design results in much better ballooning performance than when
using virtio-balloon with the same guest: 230 GB / minute with this driver
versus 70 GB / minute with virtio-balloon.

During a ballooning operation most of time is spent waiting for the guest
to come up with newly freed page ranges, processing the received ranges on
the host side (in QEMU / KVM) is nearly instantaneous.

The unballoon operation is also pretty much instantaneous:
thanks to the merging of the ballooned out page ranges 200 GB of memory can
be returned to the guest in about 1 second.
With virtio-balloon this operation takes about 2.5 minutes.

These tests were done against a Windows Server 2019 guest running on a
Xeon E5-2699, after dirtying the whole memory inside guest before each
balloon operation.

Using a range tree instead of a bitmap to track the removed memory also
means that the solution scales well with the guest size: even a 1 TB range
takes just few bytes of memory.

Since the required GTree operations aren't present in every Glib version
a check for them was added to "configure" script, together with new
"--enable-hv-balloon" and "--disable-hv-balloon" arguments.
If these GTree operations are missing in the system Glib this driver will
be skipped during QEMU build.

An optional "status-report=on" device parameter requests memory status
events from the guest (typically sent every second), which allow the host
to learn both the guest memory available and the guest memory in use
counts.
They are emitted externally as "HV_BALLOON_STATUS_REPORT" QMP events.

The driver is named hv-balloon since the Linux kernel client driver for
the Dynamic Memory Protocol is named as such and to follow the naming
pattern established by the virtio-balloon driver.
The whole protocol runs over Hyper-V VMBus that has its implementation
recently merged in.

The driver was tested against Windows Server 2012 R2, Windows Server 2016
and Windows Server 2016 guests and obeys the guest alignment requirements
reported to the host via DM_CAPABILITIES_REPORT message.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 Kconfig.host           |    3 +
 configure              |   35 +
 hw/hyperv/Kconfig      |    5 +
 hw/hyperv/hv-balloon.c | 2172 ++++++++++++++++++++++++++++++++++++++++
 hw/hyperv/meson.build  |    1 +
 hw/hyperv/trace-events |   17 +
 meson.build            |    2 +
 qapi/misc.json         |   74 ++
 8 files changed, 2309 insertions(+)
 create mode 100644 hw/hyperv/hv-balloon.c

diff --git a/Kconfig.host b/Kconfig.host
index 4af19bf70ef9..691de49a6907 100644
--- a/Kconfig.host
+++ b/Kconfig.host
@@ -33,3 +33,6 @@ config VIRTFS
 
 config PVRDMA
     bool
+
+config HV_BALLOON_POSSIBLE
+    bool
diff --git a/configure b/configure
index 756447900855..9d4146286ba1 100755
--- a/configure
+++ b/configure
@@ -543,6 +543,7 @@ fuzzing="no"
 rng_none="no"
 secret_keyring=""
 libdaxctl=""
+hv_balloon=""
 meson=""
 ninja=""
 skip_meson=no
@@ -1628,6 +1629,10 @@ for opt do
   ;;
   --disable-libdaxctl) libdaxctl=no
   ;;
+  --enable-hv-balloon) hv_balloon=yes
+  ;;
+  --disable-hv-balloon) hv_balloon=no
+  ;;
   *)
       echo "ERROR: unknown option $opt"
       echo "Try '$0 --help' for more information"
@@ -1949,6 +1954,7 @@ disabled with --disable-FEATURE, default is enabled if available:
   xkbcommon       xkbcommon support
   rng-none        dummy RNG, avoid using /dev/(u)random and getrandom()
   libdaxctl       libdaxctl support
+  hv-balloon      hv-balloon driver where supported (requires extended GTree API)
 
 NOTE: The object files are built at the place where configure is launched
 EOF
@@ -6119,6 +6125,32 @@ if test "$libdaxctl" != "no"; then
 	fi
 fi
 
+##########################################
+# check for hv-balloon
+
+if test "$hv_balloon" != "no"; then
+	cat > $TMPC << EOF
+#include <string.h>
+#include <gmodule.h>
+int main(void) {
+    GTree *tree;
+
+    tree = g_tree_new((GCompareFunc)strcmp);
+    (void)g_tree_node_first(tree);
+    g_tree_destroy(tree);
+    return 0;
+}
+EOF
+	if compile_prog "$glib_cflags" "$glib_libs" ; then
+		hv_balloon=yes
+	else
+		if test "$hv_balloon" = "yes" ; then
+			feature_not_found "hv-balloon" "Update Glib"
+		fi
+		hv_balloon="no"
+	fi
+fi
+
 ##########################################
 # check for slirp
 
@@ -7352,6 +7384,9 @@ fi
 if test "$sheepdog" = "yes" ; then
   echo "CONFIG_SHEEPDOG=y" >> $config_host_mak
 fi
+if test "$hv_balloon" = "yes" ; then
+  echo "CONFIG_HV_BALLOON_POSSIBLE=y" >> $config_host_mak
+fi
 if test "$pty_h" = "yes" ; then
   echo "HAVE_PTY_H=y" >> $config_host_mak
 fi
diff --git a/hw/hyperv/Kconfig b/hw/hyperv/Kconfig
index 3fbfe41c9e55..3d311378943f 100644
--- a/hw/hyperv/Kconfig
+++ b/hw/hyperv/Kconfig
@@ -11,3 +11,8 @@ config VMBUS
     bool
     default y
     depends on HYPERV
+
+config HV_BALLOON
+    bool
+    default y
+    depends on HV_BALLOON_POSSIBLE && VMBUS && HAPROT
diff --git a/hw/hyperv/hv-balloon.c b/hw/hyperv/hv-balloon.c
new file mode 100644
index 000000000000..4d06843c9102
--- /dev/null
+++ b/hw/hyperv/hv-balloon.c
@@ -0,0 +1,2172 @@
+/*
+ * QEMU Hyper-V Dynamic Memory Protocol driver
+ *
+ * Copyright (C) 2020 Oracle and/or its affiliates.
+ *
+ * Author: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+
+#include "exec/address-spaces.h"
+#include "exec/cpu-common.h"
+#include "exec/memory.h"
+#include "exec/ramblock.h"
+#include "hw/hyperv/dynmem-proto.h"
+#include "hw/hyperv/vmbus.h"
+#include "hw/mem/haprot.h"
+#include "hw/mem/pc-dimm.h"
+#include "hw/qdev-core.h"
+#include "hw/qdev-properties.h"
+#include "qapi/error.h"
+#include "qapi/qapi-events-misc.h"
+#include "qemu/error-report.h"
+#include "qemu/module.h"
+#include "qemu/units.h"
+#include "qemu/timer.h"
+#include "sysemu/balloon.h"
+#include "trace.h"
+
+/*
+ * temporarily avoid warnings about enhanced GTree API usage requiring a
+ * too recent Glib version until GLIB_VERSION_MAX_ALLOWED finally reaches
+ * the Glib version with this API
+ */
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+
+#define TYPE_HV_BALLOON "hv-balloon"
+#define HV_BALLOON_GUID "525074DC-8985-46e2-8057-A307DC18A502"
+#define HV_BALLOON_PFN_SHIFT 12
+#define HV_BALLOON_PAGE_SIZE (1 << HV_BALLOON_PFN_SHIFT)
+
+/*
+ * Some Windows versions (at least Server 2019) will crash with various
+ * error codes when receiving DM protocol requests (at least
+ * DM_MEM_HOT_ADD_REQUEST) immediately after boot.
+ *
+ * It looks like Hyper-V from Server 2016 uses a 50-second after-boot
+ * delay, probably to workaround this issue, so we'll use this value, too.
+ */
+#define HV_BALLOON_POST_INIT_WAIT (50 * 1000)
+
+#define HV_BALLOON_HA_CHUNK_SIZE (2 * GiB)
+#define HV_BALLOON_HA_CHUNK_PAGES (HV_BALLOON_HA_CHUNK_SIZE / HV_BALLOON_PAGE_SIZE)
+
+#define HV_BALLOON_HR_CHUNK_PAGES 585728
+/*
+ *                                ^ that's the maximum number of pages
+ * that Windows returns in one hot remove response
+ *
+ * If the number requested is too high Windows will no longer honor
+ * these requests
+ */
+
+typedef enum State {
+    S_WAIT_RESET,
+    S_CLOSED,
+    S_VERSION,
+    S_CAPS,
+    S_POST_INIT_WAIT,
+    S_IDLE,
+    S_HOT_ADD_RB_WAIT,
+    S_HOT_ADD_POSTING,
+    S_HOT_ADD_REPLY_WAIT,
+    S_HOT_ADD_SKIP_CURRENT,
+    S_HOT_ADD_PROCESSED_CLEAR_PENDING,
+    S_HOT_ADD_PROCESSED_NEXT,
+    S_HOT_REMOVE,
+    S_BALLOON_POSTING,
+    S_BALLOON_RB_WAIT,
+    S_BALLOON_REPLY_WAIT,
+    S_UNBALLOON_POSTING,
+    S_UNBALLOON_RB_WAIT,
+    S_UNBALLOON_REPLY_WAIT,
+} State;
+
+typedef struct PageRange {
+    uint64_t start;
+    uint64_t count;
+} PageRange;
+
+/* type safety */
+typedef struct PageRangeTree {
+    GTree *t;
+} PageRangeTree;
+
+typedef struct HAProtRange {
+    HAProtDevice *haprot;
+
+    PageRange range;
+    uint64_t used;
+
+    /*
+     * Pages not currently usable due to guest alignment reqs or
+     * not hot added in the first place
+     */
+    uint64_t unused_head, unused_tail;
+
+    /* Memory removed from the guest backed by this HAProt */
+    PageRangeTree removed_guest, removed_both;
+} HAProtRange;
+
+/* type safety */
+typedef struct HAProtRangeTree {
+    GTree *t;
+} HAProtRangeTree;
+
+typedef struct HvBalloon {
+    VMBusDevice parent;
+    State state;
+    bool state_changed;
+    bool status_reports;
+
+    union dm_version version;
+    union dm_caps caps;
+
+    QEMUTimer post_init_timer;
+
+    unsigned int trans_id;
+
+    /* Guest target size */
+    uint64_t target;
+    bool target_changed;
+    uint64_t target_diff;
+
+    HAProtRangeTree haprots;
+
+    /* Ranges disallowed for hot added memory */
+    PageRangeTree haprot_disallowed;
+
+    /* Non-HAProt removed memory */
+    PageRangeTree removed_guest, removed_both;
+
+    /* Grand totals of removed memory (both HAProt and non-HAProt) */
+    uint64_t removed_guest_ctr, removed_both_ctr;
+
+    GSList *ha_todo;
+    uint64_t ha_current_count;
+} HvBalloon;
+
+#define HV_BALLOON(obj) OBJECT_CHECK(HvBalloon, (obj), TYPE_HV_BALLOON)
+
+#define HV_BALLOON_SET_STATE(hvb, news)             \
+    do {                                            \
+        _hv_balloon_state_set(hvb, news, # news);   \
+    } while (0)
+
+#define SUM_OVERFLOW_U64(in1, in2) ((in1) > UINT64_MAX - (in2))
+#define SUM_SATURATE_U64(in1, in2)              \
+    ({                                          \
+        uint64_t _in1 = (in1), _in2 = (in2);    \
+        uint64_t _result;                       \
+                                                \
+        if (!SUM_OVERFLOW_U64(_in1, _in2)) {    \
+            _result = _in1 + _in2;              \
+        } else {                                \
+            _result = UINT64_MAX;               \
+        }                                       \
+                                                \
+        _result;                                \
+    })
+
+typedef struct HvBalloonReq {
+    VMBusChanReq vmreq;
+} HvBalloonReq;
+
+/* PageRange */
+static void page_range_intersect(const PageRange *range,
+                                 uint64_t start, uint64_t count,
+                                 PageRange *out)
+{
+    uint64_t end1 = range->start + range->count;
+    uint64_t end2 = start + count;
+    uint64_t end = MIN(end1, end2);
+
+    out->start = MAX(range->start, start);
+    out->count = out->start < end ? end - out->start : 0;
+}
+
+static uint64_t page_range_intersection_size(const PageRange *range,
+                                             uint64_t start, uint64_t count)
+{
+    PageRange trange;
+
+    page_range_intersect(range, start, count, &trange);
+    return trange.count;
+}
+
+/* return just the part of range before (start) */
+static void page_range_part_before(const PageRange *range,
+                                   uint64_t start, PageRange *out)
+{
+    uint64_t endr = range->start + range->count;
+    uint64_t end = MIN(endr, start);
+
+    out->start = range->start;
+    if (end > out->start) {
+        out->count = end - out->start;
+    } else {
+        out->count = 0;
+    }
+}
+
+/* return just the part of range after (start, count) */
+static void page_range_part_after(const PageRange *range,
+                                  uint64_t start, uint64_t count,
+                                  PageRange *out)
+{
+    uint64_t end = range->start + range->count;
+    uint64_t ends = start + count;
+
+    out->start = MAX(range->start, ends);
+    if (end > out->start) {
+        out->count = end - out->start;
+    } else {
+        out->count = 0;
+    }
+}
+
+static bool page_range_joinable_left(const PageRange *range,
+                                     uint64_t start, uint64_t count)
+{
+    return start + count == range->start;
+}
+
+static bool page_range_joinable_right(const PageRange *range,
+                                      uint64_t start, uint64_t count)
+{
+    return range->start + range->count == start;
+}
+
+static bool page_range_joinable(const PageRange *range,
+                                uint64_t start, uint64_t count)
+{
+    return page_range_joinable_left(range, start, count) ||
+        page_range_joinable_right(range, start, count);
+}
+
+/* PageRangeTree */
+static gint page_range_tree_key_compare(gconstpointer leftp,
+                                        gconstpointer rightp,
+                                        gpointer user_data)
+{
+    const uint64_t *left = leftp, *right = rightp;
+
+    if (*left < *right) {
+        return -1;
+    } else if (*left > *right) {
+        return 1;
+    } else { /* *left == *right */
+        return 0;
+    }
+}
+
+static GTreeNode *page_range_tree_insert_new(PageRangeTree tree,
+                                             uint64_t start, uint64_t count)
+{
+    uint64_t *key = g_malloc(sizeof(*key));
+    PageRange *range = g_malloc(sizeof(*range));
+
+    assert(count > 0);
+
+    *key = range->start = start;
+    range->count = count;
+
+    return g_tree_insert_node(tree.t, key, range);
+}
+
+static void page_range_tree_insert(PageRangeTree tree,
+                                   uint64_t start, uint64_t count,
+                                   uint64_t *dupcount)
+{
+    GTreeNode *node;
+    bool joinable;
+    uint64_t intersection;
+    PageRange *range;
+
+    assert(!SUM_OVERFLOW_U64(start, count));
+    if (count == 0) {
+        return;
+    }
+
+    node = g_tree_upper_bound(tree.t, &start);
+    if (node) {
+        node = g_tree_node_previous(node);
+    } else {
+        node = g_tree_node_last(tree.t);
+    }
+
+    if (node) {
+        range = g_tree_node_value(node);
+        assert(range);
+        intersection = page_range_intersection_size(range, start, count);
+        joinable = page_range_joinable_right(range, start, count);
+    }
+
+    if (!node ||
+        (!intersection && !joinable)) {
+        /*
+         * !node case: the tree is empty or the very first node in the tree
+         * already has a higher key (the start of its range).
+         * the other case: there is a gap in the tree between the new range
+         * and the previous one.
+         * anyway, let's just insert the new range into the tree.
+         */
+        node = page_range_tree_insert_new(tree, start, count);
+        assert(node);
+        range = g_tree_node_value(node);
+        assert(range);
+    } else {
+        /*
+         * the previous range in the tree either partially covers the new
+         * range or ends just at its beginning - extend it
+         */
+        if (dupcount) {
+            *dupcount += intersection;
+        }
+
+        count += start - range->start;
+        range->count = MAX(range->count, count);
+    }
+
+    /* check next nodes for possible merging */
+    for (node = g_tree_node_next(node); node; ) {
+        PageRange *rangecur;
+
+        rangecur = g_tree_node_value(node);
+        assert(rangecur);
+
+        intersection = page_range_intersection_size(rangecur,
+                                                    range->start, range->count);
+        joinable = page_range_joinable_left(rangecur,
+                                            range->start, range->count);
+        if (!intersection && !joinable) {
+            /* the current node is disjoint */
+            break;
+        }
+
+        if (dupcount) {
+            *dupcount += intersection;
+        }
+
+        count = rangecur->count + (rangecur->start - range->start);
+        range->count = MAX(range->count, count);
+
+        /* the current node was merged in, remove it */
+        start = rangecur->start;
+        node = g_tree_node_next(node);
+        /* no hinted removal in GTree... */
+        g_tree_remove(tree.t, &start);
+    }
+}
+
+static bool page_range_tree_pop(PageRangeTree tree, PageRange *out,
+                                uint64_t maxcount)
+{
+    GTreeNode *node;
+    PageRange *range;
+
+    node = g_tree_node_last(tree.t);
+    if (!node) {
+        return false;
+    }
+
+    range = g_tree_node_value(node);
+    assert(range);
+
+    out->start = range->start;
+
+    /* can't modify range->start as it is the node key */
+    if (range->count > maxcount) {
+        out->start += range->count - maxcount;
+        out->count = maxcount;
+        range->count -= maxcount;
+    } else {
+        out->count = range->count;
+        /* no hinted removal in GTree... */
+        g_tree_remove(tree.t, &out->start);
+    }
+
+    return true;
+}
+
+static bool page_range_tree_intree_any(PageRangeTree tree,
+                                       uint64_t start, uint64_t count)
+{
+    GTreeNode *node;
+
+    if (count == 0) {
+        return false;
+    }
+
+    /* find the first node that can possibly intersect our range */
+    node = g_tree_upper_bound(tree.t, &start);
+    if (node) {
+        /*
+         * a NULL node below means that the very first node in the tree
+         * already has a higher key (the start of its range).
+         */
+        node = g_tree_node_previous(node);
+    } else {
+        /* a NULL node below means that the tree is empty */
+        node = g_tree_node_last(tree.t);
+    }
+    /* node range start <= range start */
+
+    if (!node) {
+        /* node range start > range start */
+        node = g_tree_node_first(tree.t);
+    }
+
+    for ( ; node; node = g_tree_node_next(node)) {
+        PageRange *range = g_tree_node_value(node);
+
+        assert(range);
+        /*
+         * if this node starts beyond or at the end of our range so does
+         * every next one
+         */
+        if (range->start >= start + count) {
+            break;
+        }
+
+        if (page_range_intersection_size(range, start, count) > 0) {
+            return true;
+        }
+    }
+
+    return false;
+}
+
+static gboolean page_range_tree_npages_node(gpointer key,
+                                            gpointer value,
+                                            gpointer data)
+{
+    PageRange *range = value;
+    uint64_t *npages = data;
+
+    *npages += range->count;
+
+    return false;
+}
+
+static void page_range_tree_npages(PageRangeTree tree, uint64_t *npages)
+{
+    g_tree_foreach(tree.t, page_range_tree_npages_node, npages);
+}
+
+static PageRangeTree page_range_tree_new(void)
+{
+    PageRangeTree tree;
+
+    tree.t = g_tree_new_full(page_range_tree_key_compare, NULL,
+                             g_free, g_free);
+    return tree;
+}
+
+static void page_range_tree_destroy(PageRangeTree *tree)
+{
+    /* g_tree_destroy() is not NULL-safe */
+    if (!tree->t) {
+        return;
+    }
+
+    g_tree_destroy(tree->t);
+    tree->t = NULL;
+}
+
+/* HAProtDevice */
+static uint64_t haprot_get_size(HAProtDevice *haprot)
+{
+    return object_property_get_uint(OBJECT(haprot), HAPROT_SIZE_PROP,
+                                    &error_abort) / HV_BALLOON_PAGE_SIZE;
+}
+
+static void haprot_get_range(HAProtDevice *haprot, PageRange *out)
+{
+    out->start = object_property_get_uint(OBJECT(haprot), HAPROT_ADDR_PROP,
+                                          &error_abort) / HV_BALLOON_PAGE_SIZE;
+    assert(out->start > 0);
+
+    out->count = haprot_get_size(haprot);
+    assert(out->count > 0);
+}
+
+static void haprot_mark_inuse(HAProtDevice *haprot)
+{
+    const DeviceState *dev = DEVICE(haprot);
+
+    haprot->busy = true;
+    qapi_event_send_hv_balloon_haprot_inuse(dev->id ? : "");
+}
+
+static void haprot_mark_unused(HAProtDevice *haprot)
+{
+    const DeviceState *dev = DEVICE(haprot);
+
+    haprot->busy = false;
+    qapi_event_send_hv_balloon_haprot_unused(dev->id ? : "");
+}
+
+/* HAProtRange */
+/* the haprot range reduced by unused head and tail */
+static void haprot_range_get_effective_range(HAProtRange *hpr, PageRange *out)
+{
+    out->start = hpr->range.start + hpr->unused_head;
+    out->count = hpr->range.count - hpr->unused_head - hpr->unused_tail;
+}
+
+/*
+ * reset without triggering unref or notify when reaching zero pages used
+ * and without decrementing the grand total counters of removed memory
+ */
+static void haprot_range_reset_nounref(HAProtRange *hpr)
+{
+    /* mark the whole range as unused */
+    hpr->used = 0;
+    hpr->unused_head = hpr->range.count;
+    hpr->unused_tail = 0;
+
+    page_range_tree_destroy(&hpr->removed_guest);
+    page_range_tree_destroy(&hpr->removed_both);
+    hpr->removed_guest = page_range_tree_new();
+    hpr->removed_both = page_range_tree_new();
+}
+
+static void haprot_range_increment(HAProtRange *hpr, uint64_t diff)
+{
+    if (diff == 0) {
+        return;
+    }
+
+    if (hpr->used == 0) {
+        haprot_mark_inuse(hpr->haprot);
+    }
+
+    hpr->used += diff;
+}
+
+static void haprot_range_decrement(HAProtRange *hpr, uint64_t diff)
+{
+    if (diff == 0) {
+        return;
+    }
+
+    hpr->used -= diff;
+
+    if (hpr->used == 0) {
+        haprot_mark_unused(hpr->haprot);
+    }
+}
+
+static void haprot_range_reset(HAProtRange *hpr)
+{
+    haprot_range_decrement(hpr, hpr->used);
+    haprot_range_reset_nounref(hpr);
+}
+
+/* HAProtRangeTree */
+static gint haprot_tree_key_compare(gconstpointer leftp, gconstpointer rightp,
+                                    gpointer user_data)
+{
+    /*
+     * haprot tree is also keyed on page range start, so we can simply reuse
+     * the comparison function from the page range tree
+     */
+    return page_range_tree_key_compare(leftp, rightp, user_data);
+}
+
+static HAProtRange *haprot_tree_insert_new(HvBalloon *balloon,
+                                           HAProtDevice *haprot)
+{
+    uint64_t *key = g_malloc(sizeof(*key));
+    HAProtRange *hpr = g_malloc(sizeof(*hpr));
+
+    haprot->busy = true;
+    hpr->haprot = haprot;
+
+    haprot_get_range(haprot, &hpr->range);
+    *key = hpr->range.start;
+
+    hpr->removed_guest.t = hpr->removed_both.t = NULL;
+    haprot_range_reset_nounref(hpr);
+
+    g_tree_insert(balloon->haprots.t, key, hpr);
+
+    return hpr;
+}
+
+static void haprot_tree_remove(HvBalloon *balloon, HAProtDevice *haprot)
+{
+    uint64_t addr;
+
+    addr = object_property_get_uint(OBJECT(haprot), HAPROT_ADDR_PROP,
+                                    &error_abort) /
+        HV_BALLOON_PAGE_SIZE;
+    assert(addr > 0);
+
+    g_tree_remove(balloon->haprots.t, &addr);
+}
+
+static HAProtRange *haprot_tree_lookup_maybe(HvBalloon *balloon,
+                                             HAProtDevice *haprot)
+{
+    uint64_t addr;
+    GTreeNode *node;
+    HAProtRange *hpr;
+
+    addr = object_property_get_uint(OBJECT(haprot), HAPROT_ADDR_PROP,
+                                    &error_abort) /
+        HV_BALLOON_PAGE_SIZE;
+    assert(addr > 0);
+
+    node = g_tree_lookup_node(balloon->haprots.t, &addr);
+    if (!node) {
+        return NULL;
+    }
+
+    hpr = g_tree_node_value(node);
+    assert(hpr);
+    return hpr;
+}
+
+static HAProtRange *haprot_tree_lookup(HvBalloon *balloon,
+                                       HAProtDevice *haprot)
+{
+    HAProtRange *hpr;
+
+    hpr = haprot_tree_lookup_maybe(balloon, haprot);
+    assert(hpr);
+    return hpr;
+}
+
+/* total RAM includes memory currently removed from the guest */
+static gboolean haprot_tree_total_ram_node(gpointer key,
+                                           gpointer value,
+                                           gpointer data)
+{
+    HAProtRange *hpr = value;
+    uint64_t *size = data;
+    PageRange rangeeff;
+
+    haprot_range_get_effective_range(hpr, &rangeeff);
+    *size += rangeeff.count;
+
+    return false;
+}
+
+static uint64_t haprot_tree_total_ram(HvBalloon *balloon)
+{
+    uint64_t size = 0;
+
+    g_tree_foreach(balloon->haprots.t, haprot_tree_total_ram_node, &size);
+    return size;
+}
+
+static void haprot_tree_value_free(gpointer data)
+{
+    HAProtRange *hpr = data;
+
+    page_range_tree_destroy(&hpr->removed_guest);
+    page_range_tree_destroy(&hpr->removed_both);
+    g_free(hpr);
+}
+
+static gboolean haprot_tree_reset_all_node(gpointer key,
+                                           gpointer value,
+                                           gpointer data)
+{
+    HAProtRange *hpr = value;
+
+    haprot_range_reset(hpr);
+
+    return false;
+}
+
+static void haprot_tree_reset_all(HvBalloon *balloon)
+{
+    g_tree_foreach(balloon->haprots.t, haprot_tree_reset_all_node, NULL);
+}
+
+static HAProtRangeTree haprot_tree_new(void)
+{
+    HAProtRangeTree tree;
+
+    tree.t = g_tree_new_full(haprot_tree_key_compare, NULL, g_free,
+                             haprot_tree_value_free);
+    return tree;
+}
+
+static void haprot_tree_destroy(HAProtRangeTree *tree)
+{
+    /* g_tree_destroy() is not NULL-safe */
+    if (!tree->t) {
+        return;
+    }
+
+    g_tree_destroy(tree->t);
+    tree->t = NULL;
+}
+
+static gboolean ha_todo_add_all_node(gpointer key,
+                                     gpointer value,
+                                     gpointer data)
+{
+    HAProtRange *hpr = value;
+    HvBalloon *balloon = data;
+
+    /* assume the hpr has been reset */
+    assert(hpr->used == 0);
+    assert(hpr->unused_head == hpr->range.count);
+    assert(hpr->unused_tail == 0);
+
+    hpr->haprot->busy = true;
+    haprot_mark_inuse(hpr->haprot);
+    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
+
+    return false;
+}
+
+static void ha_todo_add_all(HvBalloon *balloon)
+{
+    assert(balloon->ha_todo == NULL);
+    g_tree_foreach(balloon->haprots.t, ha_todo_add_all_node, balloon);
+}
+
+static void ha_todo_clear(HvBalloon *balloon)
+{
+    while (balloon->ha_todo) {
+        HAProtRange *hpr = balloon->ha_todo->data;
+
+        haprot_range_reset_nounref(hpr);
+        haprot_mark_unused(hpr->haprot);
+
+        balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
+    }
+}
+
+/* TODO: unify the code below with virtio-balloon and cache the value */
+static int build_dimm_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
+        DeviceState *dev = DEVICE(obj);
+        if (dev->realized) { /* only realized DIMMs matter */
+            *list = g_slist_prepend(*list, dev);
+        }
+    }
+
+    object_child_foreach(obj, build_dimm_list, opaque);
+    return 0;
+}
+
+static ram_addr_t get_current_ram_size(void)
+{
+    GSList *list = NULL, *item;
+    ram_addr_t size = ram_size;
+
+    build_dimm_list(qdev_get_machine(), &list);
+    for (item = list; item; item = g_slist_next(item)) {
+        Object *obj = OBJECT(item->data);
+        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM))
+            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
+                                            &error_abort);
+    }
+    g_slist_free(list);
+
+    return size;
+}
+
+/* total RAM includes memory currently removed from the guest */
+static uint64_t hv_balloon_total_ram(HvBalloon *balloon)
+{
+    ram_addr_t ram_size = get_current_ram_size();
+    uint64_t ram_size_pages = ram_size >> HV_BALLOON_PFN_SHIFT;
+    uint64_t haprot_size_pages = haprot_tree_total_ram(balloon);
+
+    assert(ram_size_pages > 0);
+
+    return SUM_SATURATE_U64(ram_size_pages, haprot_size_pages);
+}
+
+/*
+ * calculating the total RAM size is a slow operation,
+ * avoid it as much as possible
+ */
+static uint64_t hv_balloon_total_removed_rs(HvBalloon *balloon,
+                                            uint64_t ram_size_pages)
+{
+    uint64_t total_removed;
+
+    total_removed = SUM_SATURATE_U64(balloon->removed_guest_ctr,
+                                     balloon->removed_both_ctr);
+
+    /* possible if guest returns pages outside actual RAM */
+    if (total_removed > ram_size_pages) {
+        total_removed = ram_size_pages;
+    }
+
+    return total_removed;
+}
+
+static bool hv_balloon_state_is_init(HvBalloon *balloon)
+{
+    return balloon->state == S_WAIT_RESET ||
+        balloon->state == S_CLOSED ||
+        balloon->state == S_VERSION ||
+        balloon->state == S_CAPS;
+}
+
+static void _hv_balloon_state_set(HvBalloon *balloon,
+                                  State newst, const char *newststr)
+{
+    if (balloon->state == newst) {
+        return;
+    }
+
+    balloon->state = newst;
+    balloon->state_changed = true;
+    trace_hv_balloon_state_change(newststr);
+}
+
+static VMBusChannel *hv_balloon_get_channel_maybe(HvBalloon *balloon)
+{
+    return vmbus_device_channel(&balloon->parent, 0);
+}
+
+static VMBusChannel *hv_balloon_get_channel(HvBalloon *balloon)
+{
+    VMBusChannel *chan;
+
+    chan = hv_balloon_get_channel_maybe(balloon);
+    assert(chan != NULL);
+    return chan;
+}
+
+static ssize_t hv_balloon_send_packet(VMBusChannel *chan,
+                                      struct dm_message *msg)
+{
+    int ret;
+
+    ret = vmbus_channel_reserve(chan, 0, msg->hdr.size);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                              NULL, 0, msg, msg->hdr.size, false,
+                              msg->hdr.trans_id);
+}
+
+static bool hv_balloon_unballoon_get_source(HvBalloon *balloon,
+                                            PageRangeTree *dtree,
+                                            uint64_t **dctr,
+                                            HAProtRange **hpr)
+{
+    /* Try the boot memory first */
+    if (g_tree_nnodes(balloon->removed_guest.t) > 0) {
+        *dtree = balloon->removed_guest;
+        *dctr = &balloon->removed_guest_ctr;
+        *hpr = NULL;
+    } else if (g_tree_nnodes(balloon->removed_both.t) > 0) {
+        *dtree = balloon->removed_both;
+        *dctr = &balloon->removed_both_ctr;
+        *hpr = NULL;
+    } else {
+        GTreeNode *node;
+
+        for (node = g_tree_node_first(balloon->haprots.t); node;
+             node = g_tree_node_next(node)) {
+            HAProtRange *hprnode = g_tree_node_value(node);
+
+            assert(hprnode);
+            if (g_tree_nnodes(hprnode->removed_guest.t) > 0) {
+                *dtree = hprnode->removed_guest;
+                *dctr = &balloon->removed_guest_ctr;
+                *hpr = hprnode;
+                break;
+            } else if (g_tree_nnodes(hprnode->removed_both.t) > 0) {
+                *dtree = hprnode->removed_both;
+                *dctr = &balloon->removed_both_ctr;
+                *hpr = hprnode;
+                break;
+            }
+        }
+
+        if (!node) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
+static void hv_balloon_balloon_unballoon_start(HvBalloon *balloon,
+                                               uint64_t ram_size_pages)
+{
+    uint64_t total_removed = hv_balloon_total_removed_rs(balloon,
+                                                         ram_size_pages);
+
+    assert(balloon->state == S_IDLE);
+    assert(ram_size_pages > 0);
+
+    /*
+     * we need to cache the value when starting the (un)balloon procedure
+     * in case somebody changes the balloon target when the procedure is
+     * in progress
+     */
+    if (balloon->target < ram_size_pages - total_removed) {
+        balloon->target_diff = ram_size_pages - total_removed - balloon->target;
+        HV_BALLOON_SET_STATE(balloon, S_BALLOON_RB_WAIT);
+    } else {
+        balloon->target_diff = balloon->target -
+            (ram_size_pages - total_removed);
+
+        /*
+         * careful here, the user might have set the balloon target
+         * above the RAM size, so above the total removed count
+         */
+        balloon->target_diff = MIN(balloon->target_diff, total_removed);
+        HV_BALLOON_SET_STATE(balloon, S_UNBALLOON_RB_WAIT);
+    }
+
+    balloon->target_changed = false;
+}
+
+static void hv_balloon_unballoon_rb_wait(HvBalloon *balloon)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    struct dm_unballoon_request *ur;
+    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
+
+    assert(balloon->state == S_UNBALLOON_RB_WAIT);
+
+    if (vmbus_channel_reserve(chan, 0, ur_size) < 0) {
+        return;
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_UNBALLOON_POSTING);
+}
+
+static void hv_balloon_unballoon_posting(HvBalloon *balloon)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    PageRangeTree dtree;
+    uint64_t *dctr;
+    HAProtRange *hpr;
+    struct dm_unballoon_request *ur;
+    size_t ur_size = sizeof(*ur) + sizeof(ur->range_array[0]);
+    PageRange range;
+    bool bret;
+    ssize_t ret;
+
+    assert(balloon->state == S_UNBALLOON_POSTING);
+    assert(balloon->target_diff > 0);
+
+    if (!hv_balloon_unballoon_get_source(balloon, &dtree, &dctr, &hpr)) {
+        error_report("trying to unballoon but nothing ballooned");
+        /*
+         * there is little we can do as we might have already
+         * sent the guest a partial request we can't cancel
+         */
+        return;
+    }
+
+    assert(dtree.t);
+    assert(dctr);
+
+    ur = alloca(ur_size);
+    memset(ur, 0, ur_size);
+    ur->hdr.type = DM_UNBALLOON_REQUEST;
+    ur->hdr.size = ur_size;
+    ur->hdr.trans_id = balloon->trans_id;
+
+    bret = page_range_tree_pop(dtree, &range, MIN(balloon->target_diff,
+                                                  HV_BALLOON_HA_CHUNK_PAGES));
+    assert(bret);
+    /* TODO: madvise? */
+
+    *dctr -= range.count;
+    balloon->target_diff -= range.count;
+    if (hpr) {
+        haprot_range_increment(hpr, range.count);
+    }
+
+    ur->range_count = 1;
+    ur->range_array[0].finfo.start_page = range.start;
+    ur->range_array[0].finfo.page_cnt = range.count;
+    ur->more_pages = balloon->target_diff > 0;
+
+    trace_hv_balloon_outgoing_unballoon(ur->hdr.trans_id,
+                                        range.count, range.start,
+                                        balloon->target_diff);
+
+    if (ur->more_pages) {
+        HV_BALLOON_SET_STATE(balloon, S_UNBALLOON_RB_WAIT);
+    } else {
+        HV_BALLOON_SET_STATE(balloon, S_UNBALLOON_REPLY_WAIT);
+    }
+
+    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                             NULL, 0, ur, ur_size, false,
+                             ur->hdr.trans_id);
+    if (ret <= 0) {
+        error_report("error %zd when posting unballoon msg, expect problems",
+                     ret);
+    }
+}
+
+static void hv_balloon_hot_add_start(HvBalloon *balloon)
+{
+    HAProtRange *hpr;
+    PageRange range;
+
+    assert(balloon->state == S_IDLE);
+    assert(balloon->ha_todo);
+
+    hpr = balloon->ha_todo->data;
+
+    range.start = QEMU_ALIGN_UP(hpr->range.start,
+                                (1 << balloon->caps.cap_bits.hot_add_alignment)
+                                * (MiB / HV_BALLOON_PAGE_SIZE));
+    hpr->unused_head = range.start - hpr->range.start;
+    if (hpr->unused_head >= hpr->range.count) {
+        HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_SKIP_CURRENT);
+        return;
+    }
+
+    range.count = hpr->range.count - hpr->unused_head;
+    range.count = QEMU_ALIGN_DOWN(range.count,
+                                  (1 << balloon->caps.cap_bits.hot_add_alignment)
+                                  * (MiB / HV_BALLOON_PAGE_SIZE));
+    if (range.count == 0) {
+        HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_SKIP_CURRENT);
+        return;
+    }
+    hpr->unused_tail = hpr->range.count - hpr->unused_head - range.count;
+    hpr->used = 0;
+
+    HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_RB_WAIT);
+}
+
+static void hv_balloon_hot_add_rb_wait(HvBalloon *balloon)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    struct dm_hot_add *ha;
+    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+
+    assert(balloon->state == S_HOT_ADD_RB_WAIT);
+
+    if (vmbus_channel_reserve(chan, 0, ha_size) < 0) {
+        return;
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_POSTING);
+}
+
+static void hv_balloon_hot_add_posting(HvBalloon *balloon)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    HAProtRange *hpr;
+    struct dm_hot_add *ha;
+    size_t ha_size = sizeof(*ha) + sizeof(ha->range);
+    union dm_mem_page_range *ha_region;
+    PageRange range;
+    uint64_t chunk_max_size;
+    ssize_t ret;
+
+    assert(balloon->state == S_HOT_ADD_POSTING);
+    assert(balloon->ha_todo);
+
+    hpr = balloon->ha_todo->data;
+
+    range.start = hpr->range.start + hpr->unused_head + hpr->used;
+    range.count = hpr->range.count;
+    range.count -= hpr->unused_head;
+    range.count -= hpr->used;
+    range.count -= hpr->unused_tail;
+
+    chunk_max_size = MAX((1 << balloon->caps.cap_bits.hot_add_alignment) *
+                         (MiB / HV_BALLOON_PAGE_SIZE),
+                         HV_BALLOON_HA_CHUNK_PAGES);
+    range.count = MIN(range.count, chunk_max_size);
+    balloon->ha_current_count = range.count;
+
+    ha = alloca(ha_size);
+    ha_region = &(&ha->range)[1];
+    memset(ha, 0, ha_size);
+    ha->hdr.type = DM_MEM_HOT_ADD_REQUEST;
+    ha->hdr.size = ha_size;
+    ha->hdr.trans_id = balloon->trans_id;
+
+    ha->range.finfo.start_page = range.start;
+    ha->range.finfo.page_cnt = range.count;
+    ha_region->finfo.start_page = range.start;
+    ha_region->finfo.page_cnt = ha->range.finfo.page_cnt;
+
+    trace_hv_balloon_outgoing_hot_add(ha->hdr.trans_id,
+                                      range.count, range.start);
+
+    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                             NULL, 0, ha, ha_size, false,
+                             ha->hdr.trans_id);
+    if (ret <= 0) {
+        error_report("error %zd when posting hot add msg, expect problems",
+                     ret);
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_REPLY_WAIT);
+}
+
+static void hv_balloon_hot_add_finish(HvBalloon *balloon)
+{
+    HAProtRange *hpr;
+
+    assert(balloon->state == S_HOT_ADD_SKIP_CURRENT ||
+           balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING ||
+           balloon->state == S_HOT_ADD_PROCESSED_NEXT);
+    assert(balloon->ha_todo);
+
+    hpr = balloon->ha_todo->data;
+
+    if (balloon->state == S_HOT_ADD_SKIP_CURRENT) {
+        haprot_range_reset_nounref(hpr);
+        haprot_mark_unused(hpr->haprot);
+    }
+
+    balloon->ha_todo = g_slist_remove(balloon->ha_todo, hpr);
+    if (balloon->state == S_HOT_ADD_PROCESSED_CLEAR_PENDING) {
+        ha_todo_clear(balloon);
+    }
+
+    /* let other things happen, too, between hot adds to be done */
+    HV_BALLOON_SET_STATE(balloon, S_IDLE);
+}
+
+static void hv_balloon_balloon_rb_wait(HvBalloon *balloon)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    size_t bl_size = sizeof(struct dm_balloon);
+
+    assert(balloon->state == S_BALLOON_RB_WAIT);
+
+    if (vmbus_channel_reserve(chan, 0, bl_size) < 0) {
+        return;
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_BALLOON_POSTING);
+}
+
+static void hv_balloon_balloon_posting(HvBalloon *balloon)
+{
+    VMBusChannel *chan = hv_balloon_get_channel(balloon);
+    struct dm_balloon bl;
+    size_t bl_size = sizeof(bl);
+    ssize_t ret;
+
+    assert(balloon->state == S_BALLOON_POSTING);
+    assert(balloon->target_diff > 0);
+
+    memset(&bl, 0, sizeof(bl));
+    bl.hdr.type = DM_BALLOON_REQUEST;
+    bl.hdr.size = bl_size;
+    bl.hdr.trans_id = balloon->trans_id;
+    bl.num_pages = MIN(balloon->target_diff, HV_BALLOON_HR_CHUNK_PAGES);
+
+    trace_hv_balloon_outgoing_balloon(bl.hdr.trans_id, bl.num_pages,
+                                      balloon->target_diff);
+
+    ret = vmbus_channel_send(chan, VMBUS_PACKET_DATA_INBAND,
+                             NULL, 0, &bl, bl_size, false,
+                             bl.hdr.trans_id);
+    if (ret <= 0) {
+        error_report("error %zd when posting balloon msg, expect problems",
+                     ret);
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_BALLOON_REPLY_WAIT);
+}
+
+static void hv_balloon_idle_state(HvBalloon *balloon)
+{
+    bool can_balloon = balloon->caps.cap_bits.balloon;
+    bool want_unballoon = false;
+    bool want_hot_add = balloon->ha_todo != NULL;
+    bool want_balloon = false;
+    uint64_t ram_size_pages;
+
+    assert(balloon->state == S_IDLE);
+
+    if (can_balloon && balloon->target_changed) {
+        uint64_t total_removed;
+
+        ram_size_pages = hv_balloon_total_ram(balloon);
+        total_removed = hv_balloon_total_removed_rs(balloon,
+                                                    ram_size_pages);
+
+        want_unballoon = total_removed > 0 &&
+            balloon->target > ram_size_pages - total_removed;
+        want_balloon = balloon->target < ram_size_pages - total_removed;
+    }
+
+    /*
+     * the order here is important, first we unballoon, then hot add,
+     * then balloon (or hot remove)
+     */
+    if (want_unballoon) {
+        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages);
+    } else if (want_hot_add) {
+        hv_balloon_hot_add_start(balloon);
+    } else if (want_balloon) {
+        hv_balloon_balloon_unballoon_start(balloon, ram_size_pages);
+    }
+}
+
+static const struct {
+    void (*handler)(HvBalloon *balloon);
+} state_handlers[] = {
+    [S_IDLE].handler = hv_balloon_idle_state,
+    [S_UNBALLOON_RB_WAIT].handler = hv_balloon_unballoon_rb_wait,
+    [S_UNBALLOON_POSTING].handler = hv_balloon_unballoon_posting,
+    [S_HOT_ADD_RB_WAIT].handler = hv_balloon_hot_add_rb_wait,
+    [S_HOT_ADD_POSTING].handler = hv_balloon_hot_add_posting,
+    [S_HOT_ADD_SKIP_CURRENT].handler = hv_balloon_hot_add_finish,
+    [S_HOT_ADD_PROCESSED_CLEAR_PENDING].handler = hv_balloon_hot_add_finish,
+    [S_HOT_ADD_PROCESSED_NEXT].handler = hv_balloon_hot_add_finish,
+    [S_BALLOON_RB_WAIT].handler = hv_balloon_balloon_rb_wait,
+    [S_BALLOON_POSTING].handler = hv_balloon_balloon_posting,
+};
+
+static void hv_balloon_handle_state(HvBalloon *balloon)
+{
+    if (!state_handlers[balloon->state].handler) {
+        return;
+    }
+
+    state_handlers[balloon->state].handler(balloon);
+}
+
+static void hv_balloon_remove_response_insert_range(PageRangeTree tree,
+                                                    const PageRange *range,
+                                                    uint64_t *ctr1,
+                                                    uint64_t *ctr2,
+                                                    uint64_t *ctr3)
+{
+    uint64_t dupcount, effcount;
+
+    if (range->count == 0) {
+        return;
+    }
+
+    dupcount = 0;
+    page_range_tree_insert(tree, range->start, range->count, &dupcount);
+
+    assert(dupcount <= range->count);
+    effcount = range->count - dupcount;
+
+    *ctr1 += effcount;
+    *ctr2 += effcount;
+    if (ctr3) {
+        *ctr3 += effcount;
+    }
+}
+
+static void hv_balloon_remove_response_handle_range(HvBalloon *balloon,
+                                                    PageRange *range,
+                                                    bool both,
+                                                    uint64_t *removedctr)
+{
+    GTreeNode *node;
+    PageRangeTree globaltree = both ? balloon->removed_both :
+        balloon->removed_guest;
+    uint64_t *globalctr = both ? &balloon->removed_both_ctr :
+        &balloon->removed_guest_ctr;
+
+    if (range->count == 0) {
+        return;
+    }
+
+    trace_hv_balloon_remove_response(range->count, range->start, both);
+
+    /* find the first node that can possibly intersect our range */
+    node = g_tree_upper_bound(balloon->haprots.t, &range->start);
+    if (node) {
+        /*
+         * a NULL node below means that the very first node in the tree
+         * already has a higher key (the start of its range).
+         */
+        node = g_tree_node_previous(node);
+    } else {
+        /* a NULL node below means that the tree is empty */
+        node = g_tree_node_last(balloon->haprots.t);
+    }
+    /* node range start <= range start */
+
+    if (!node) {
+        /* node range start > range start */
+        node = g_tree_node_first(balloon->haprots.t);
+    }
+
+    for ( ; node && range->count > 0; node = g_tree_node_next(node)) {
+        HAProtRange *hpr = g_tree_node_value(node);
+        PageRangeTree hprtree;
+        PageRange rangeeff, rangehole, rangecommon;
+        uint64_t hprremoved = 0;
+
+        assert(hpr);
+        hprtree = both ? hpr->removed_both : hpr->removed_guest;
+        haprot_range_get_effective_range(hpr, &rangeeff);
+
+        /*
+         * if this node starts beyond or at the end of the range so does
+         * every next one
+         */
+        if (rangeeff.start >= range->start + range->count) {
+            break;
+        }
+
+        /* process the hole before the current hpr, if it exists */
+        page_range_part_before(range, rangeeff.start, &rangehole);
+        hv_balloon_remove_response_insert_range(globaltree, &rangehole,
+                                                globalctr, removedctr, NULL);
+        if (rangehole.count > 0) {
+            trace_hv_balloon_remove_response_hole(rangehole.count,
+                                                  rangehole.start,
+                                                  range->count, range->start,
+                                                  rangeeff.start, both);
+        }
+
+        /*
+         * process the hpr part, can be empty for the very first node processed
+         * or due to difference between the nominal and effective hpr start
+         */
+        page_range_intersect(range, rangeeff.start, rangeeff.count,
+                             &rangecommon);
+        hv_balloon_remove_response_insert_range(hprtree, &rangecommon,
+                                                globalctr, removedctr,
+                                                &hprremoved);
+        haprot_range_decrement(hpr, hprremoved);
+        if (rangecommon.count > 0) {
+            trace_hv_balloon_remove_response_common(rangecommon.count,
+                                                    rangecommon.start,
+                                                    range->count, range->start,
+                                                    rangeeff.count,
+                                                    rangeeff.start, hprremoved,
+                                                    both);
+        }
+
+        /* calculate what's left after the current hpr */
+        rangecommon = *range;
+        page_range_part_after(&rangecommon, rangeeff.start, rangeeff.count,
+                              range);
+    }
+
+    /* process the remainder of the range that lies outside of the hpr tree */
+    if (range->count > 0) {
+        hv_balloon_remove_response_insert_range(globaltree, range,
+                                                globalctr, removedctr, NULL);
+        trace_hv_balloon_remove_response_remainder(range->count, range->start,
+                                                   both);
+        range->count = 0;
+    }
+}
+
+static void hv_balloon_remove_response_handle_pages(HvBalloon *balloon,
+                                                    PageRange *range,
+                                                    uint64_t start,
+                                                    uint64_t count,
+                                                    bool both,
+                                                    uint64_t *removedctr)
+{
+    assert(count > 0);
+
+    /*
+     * if there is an existing range that the new range can't be joined to
+     * dump it into tree(s)
+     */
+    if (range->count > 0 && !page_range_joinable(range, start, count)) {
+        hv_balloon_remove_response_handle_range(balloon, range, both,
+                                                removedctr);
+    }
+
+    if (range->count == 0) {
+        range->start = start;
+        range->count = count;
+    } else if (page_range_joinable_left(range, start, count)) {
+        range->start = start;
+        range->count += count;
+    } else { /* page_range_joinable_right() */
+        range->count += count;
+    }
+}
+
+static gboolean hv_balloon_handle_remove_host_addr_node(gpointer key,
+                                                        gpointer value,
+                                                        gpointer data)
+{
+    PageRange *range = value;
+    uint64_t pageoff;
+
+    for (pageoff = 0; pageoff < range->count; ) {
+        void *addr = (void *)((range->start + pageoff) * HV_BALLOON_PAGE_SIZE);
+        RAMBlock *rb;
+        ram_addr_t rb_offset;
+        size_t rb_page_size;
+        size_t discard_size;
+
+        rb = qemu_ram_block_from_host(addr, false, &rb_offset);
+        rb_page_size = qemu_ram_pagesize(rb);
+
+        if (rb_page_size != HV_BALLOON_PAGE_SIZE) {
+            /* TODO: these should end in "removed_guest" */
+            warn_report("guest reported removed page backed by unsupported page size %zu",
+                        rb_page_size);
+            pageoff++;
+            continue;
+        }
+
+        discard_size = MIN(range->count - pageoff,
+                           (rb->max_length - rb_offset) /
+                           HV_BALLOON_PAGE_SIZE);
+        discard_size = MAX(discard_size, 1);
+
+        if (ram_block_discard_range(rb, rb_offset, discard_size *
+                                    HV_BALLOON_PAGE_SIZE) != 0) {
+            warn_report("guest reported removed page failed discard");
+        }
+
+        pageoff += discard_size;
+    }
+
+    return false;
+}
+
+static void hv_balloon_handle_remove_host_addr_tree(PageRangeTree tree)
+{
+    g_tree_foreach(tree.t, hv_balloon_handle_remove_host_addr_node, NULL);
+}
+
+static int hv_balloon_handle_remove_section(PageRangeTree tree,
+                                            const MemoryRegionSection *section,
+                                            uint64_t count)
+{
+    void *addr = memory_region_get_ram_ptr(section->mr) +
+        section->offset_within_region;
+    uint64_t addr_page;
+
+    assert(count > 0);
+
+    if ((uintptr_t)addr % HV_BALLOON_PAGE_SIZE) {
+        warn_report("guest reported removed pages at an unaligned host addr %p",
+                    addr);
+        return -EINVAL;
+    }
+
+    addr_page = (uintptr_t)addr / HV_BALLOON_PAGE_SIZE;
+    page_range_tree_insert(tree, addr_page, count, NULL);
+
+    return 0;
+}
+
+static void hv_balloon_handle_remove_ranges(HvBalloon *balloon,
+                                            union dm_mem_page_range ranges[],
+                                            uint32_t count)
+{
+    uint64_t removedcnt;
+    PageRangeTree removed_host_addr;
+    PageRange range_guest, range_both;
+
+    removed_host_addr = page_range_tree_new();
+    range_guest.count = range_both.count = removedcnt = 0;
+    for (unsigned int ctr = 0; ctr < count; ctr++) {
+        union dm_mem_page_range *mr = &ranges[ctr];
+        hwaddr pa;
+        MemoryRegionSection section;
+
+        for (unsigned int offset = 0; offset < mr->finfo.page_cnt; ) {
+            int ret;
+            uint64_t pageno = mr->finfo.start_page + offset;
+            uint64_t pagecnt = 1;
+
+            pa = (hwaddr)pageno << HV_BALLOON_PFN_SHIFT;
+            section = memory_region_find(get_system_memory(), pa,
+                                         (mr->finfo.page_cnt - offset) *
+                                         HV_BALLOON_PAGE_SIZE);
+            if (!section.mr) {
+                warn_report("guest reported removed page %"PRIu64" not found in RAM",
+                            pageno);
+                ret = -EINVAL;
+                goto finish_page;
+            }
+
+            pagecnt = section.size / HV_BALLOON_PAGE_SIZE;
+            if (pagecnt <= 0) {
+                warn_report("guest reported removed page %"PRIu64" in a section smaller than page size",
+                            pageno);
+                pagecnt = 1; /* skip the whole page */
+                ret = -EINVAL;
+                goto finish_page;
+            }
+
+            if (!memory_region_is_ram(section.mr) ||
+                memory_region_is_rom(section.mr) ||
+                memory_region_is_romd(section.mr)) {
+                warn_report("guest reported removed page %"PRIu64" in a section that is not an ordinary RAM",
+                            pageno);
+                ret = -EINVAL;
+                goto finish_page;
+            }
+
+            ret = hv_balloon_handle_remove_section(removed_host_addr, &section,
+                                                   pagecnt);
+
+        finish_page:
+            if (ret == 0) {
+                hv_balloon_remove_response_handle_pages(balloon,
+                                                        &range_both,
+                                                        pageno, pagecnt,
+                                                        true, &removedcnt);
+            } else {
+                hv_balloon_remove_response_handle_pages(balloon,
+                                                        &range_guest,
+                                                        pageno, pagecnt,
+                                                        false, &removedcnt);
+            }
+
+            if (section.mr) {
+                memory_region_unref(section.mr);
+            }
+
+            offset += pagecnt;
+        }
+    }
+
+    hv_balloon_remove_response_handle_range(balloon, &range_both, true,
+                                            &removedcnt);
+    hv_balloon_remove_response_handle_range(balloon, &range_guest, false,
+                                            &removedcnt);
+
+    hv_balloon_handle_remove_host_addr_tree(removed_host_addr);
+    page_range_tree_destroy(&removed_host_addr);
+
+    if (removedcnt > balloon->target_diff) {
+        warn_report("guest reported more pages removed than currently pending (%"PRIu64" vs %"PRIu64")",
+                    removedcnt, balloon->target_diff);
+        balloon->target_diff = 0;
+    } else {
+        balloon->target_diff -= removedcnt;
+    }
+}
+
+static bool hv_balloon_handle_msg_size(HvBalloonReq *req, size_t minsize,
+                                       const char *msgname)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    uint32_t msglen = vmreq->msglen;
+
+    if (msglen >= minsize) {
+        return true;
+    }
+
+    warn_report("%s message too short (%u vs %zu), ignoring", msgname,
+                (unsigned int)msglen, minsize);
+    return false;
+}
+
+static void hv_balloon_handle_version_request(HvBalloon *balloon,
+                                              HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_version_request *msgVr = vmreq->msg;
+    struct dm_version_response respVr;
+
+    if (balloon->state != S_VERSION) {
+        warn_report("unexpected DM_VERSION_REQUEST in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgVr),
+                                    "DM_VERSION_REQUEST")) {
+        return;
+    }
+
+    trace_hv_balloon_incoming_version(msgVr->version.major_version,
+                                      msgVr->version.minor_version);
+
+    memset(&respVr, 0, sizeof(respVr));
+    respVr.hdr.type = DM_VERSION_RESPONSE;
+    respVr.hdr.size = sizeof(respVr);
+    respVr.hdr.trans_id = msgVr->hdr.trans_id;
+    respVr.is_accepted = msgVr->version.version >= DYNMEM_PROTOCOL_VERSION_1 &&
+        msgVr->version.version <= DYNMEM_PROTOCOL_VERSION_3;
+
+    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respVr);
+
+    if (respVr.is_accepted) {
+        HV_BALLOON_SET_STATE(balloon, S_CAPS);
+    }
+}
+
+static void hv_balloon_handle_caps_report(HvBalloon *balloon,
+                                          HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_capabilities *msgCap = vmreq->msg;
+    struct dm_capabilities_resp_msg respCap;
+
+    if (balloon->state != S_CAPS) {
+        warn_report("unexpected DM_CAPABILITIES_REPORT in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgCap),
+                                    "DM_CAPABILITIES_REPORT")) {
+        return;
+    }
+
+    trace_hv_balloon_incoming_caps(msgCap->caps.caps);
+    balloon->caps = msgCap->caps;
+
+    memset(&respCap, 0, sizeof(respCap));
+    respCap.hdr.type = DM_CAPABILITIES_RESPONSE;
+    respCap.hdr.size = sizeof(respCap);
+    respCap.hdr.trans_id = msgCap->hdr.trans_id;
+    respCap.is_accepted = 1;
+    respCap.hot_remove = 1;
+    respCap.suppress_pressure_reports = !balloon->status_reports;
+    hv_balloon_send_packet(vmreq->chan, (struct dm_message *)&respCap);
+
+    if (balloon->caps.cap_bits.hot_add) {
+        ha_todo_add_all(balloon);
+    }
+
+    timer_mod(&balloon->post_init_timer,
+              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
+              HV_BALLOON_POST_INIT_WAIT);
+
+    HV_BALLOON_SET_STATE(balloon, S_POST_INIT_WAIT);
+}
+
+static void hv_balloon_handle_status_report(HvBalloon *balloon,
+                                            HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_status *msgStatus = vmreq->msg;
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgStatus),
+                                    "DM_STATUS_REPORT")) {
+        return;
+    }
+
+    if (!balloon->status_reports) {
+        return;
+    }
+
+    qapi_event_send_hv_balloon_status_report((uint64_t)msgStatus->num_committed *
+                                             HV_BALLOON_PAGE_SIZE,
+                                             (uint64_t)msgStatus->num_avail *
+                                             HV_BALLOON_PAGE_SIZE);
+}
+
+static void hv_balloon_handle_unballoon_response(HvBalloon *balloon,
+                                                 HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_unballoon_response *msgUrR = vmreq->msg;
+
+    if (balloon->state != S_UNBALLOON_REPLY_WAIT) {
+        warn_report("unexpected DM_UNBALLOON_RESPONSE in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgUrR),
+                                    "DM_UNBALLOON_RESPONSE"))
+        return;
+
+    trace_hv_balloon_incoming_unballoon(msgUrR->hdr.trans_id);
+
+    balloon->trans_id++;
+    HV_BALLOON_SET_STATE(balloon, S_IDLE);
+}
+
+static void hv_balloon_handle_hot_add_response(HvBalloon *balloon,
+                                               HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_hot_add_response *msgHaR = vmreq->msg;
+    HAProtRange *hpr;
+
+    if (balloon->state != S_HOT_ADD_REPLY_WAIT) {
+        warn_report("unexpected DM_HOT_ADD_RESPONSE in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgHaR),
+                                    "DM_HOT_ADD_RESPONSE"))
+        return;
+
+    trace_hv_balloon_incoming_hot_add(msgHaR->hdr.trans_id, msgHaR->result,
+                                      msgHaR->page_count);
+
+    balloon->trans_id++;
+
+    assert(balloon->ha_todo);
+    hpr = balloon->ha_todo->data;
+
+    if (msgHaR->result) {
+        if (msgHaR->page_count > balloon->ha_current_count) {
+            warn_report("DM_HOT_ADD_RESPONSE page count higher than requested (%"PRIu32" vs %"PRIu64")",
+                        msgHaR->page_count, balloon->ha_current_count);
+            msgHaR->page_count = balloon->ha_current_count;
+        }
+
+        hpr->used += msgHaR->page_count;
+    }
+
+    if (!msgHaR->result || msgHaR->page_count < balloon->ha_current_count) {
+        if (hpr->used == 0) {
+            /*
+             * apparently the guest didn't like the current range at all,
+             * let's try the next one
+             */
+            HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_SKIP_CURRENT);
+            return;
+        }
+
+        /*
+         * the current planned range was only partially hot-added, take note
+         * how much of it remains and don't attempt any further hot adds
+         */
+        hpr->unused_tail = hpr->range.count - hpr->unused_head - hpr->used;
+
+        HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_PROCESSED_CLEAR_PENDING);
+        return;
+    }
+
+    /* any pages remaining in this hpr? */
+    if (hpr->range.count - hpr->unused_head - hpr->used -
+        hpr->unused_tail > 0) {
+        HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_RB_WAIT);
+    } else {
+        HV_BALLOON_SET_STATE(balloon, S_HOT_ADD_PROCESSED_NEXT);
+    }
+}
+
+static void hv_balloon_handle_balloon_response(HvBalloon *balloon,
+                                               HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_balloon_response *msgBR = vmreq->msg;
+
+    if (balloon->state != S_BALLOON_REPLY_WAIT) {
+        warn_report("unexpected DM_BALLOON_RESPONSE in %d state",
+                    balloon->state);
+        return;
+    }
+
+    if (!hv_balloon_handle_msg_size(req, sizeof(*msgBR),
+                                    "DM_BALLOON_RESPONSE"))
+        return;
+
+    trace_hv_balloon_incoming_balloon(msgBR->hdr.trans_id, msgBR->range_count,
+                                      msgBR->more_pages);
+
+    if (vmreq->msglen < sizeof(*msgBR) +
+        (uint64_t)sizeof(msgBR->range_array[0]) * msgBR->range_count) {
+        warn_report("DM_BALLOON_RESPONSE too short for the range count");
+        return;
+    }
+
+    if (msgBR->range_count == 0) {
+        /* The guest is already at its minimum size */
+        msgBR->more_pages = 0;
+        balloon->target_diff = 0;
+    } else {
+        hv_balloon_handle_remove_ranges(balloon,
+                                        msgBR->range_array,
+                                        msgBR->range_count);
+    }
+
+    if (!msgBR->more_pages) {
+        balloon->trans_id++;
+
+        if (balloon->target_diff > 0) {
+            HV_BALLOON_SET_STATE(balloon, S_BALLOON_RB_WAIT);
+        } else {
+            HV_BALLOON_SET_STATE(balloon, S_IDLE);
+        }
+    }
+}
+
+static void hv_balloon_handle_packet(HvBalloon *balloon, HvBalloonReq *req)
+{
+    VMBusChanReq *vmreq = &req->vmreq;
+    struct dm_message *msg = vmreq->msg;
+
+    if (vmreq->msglen < sizeof(msg->hdr)) {
+        return;
+    }
+
+    switch (msg->hdr.type) {
+    case DM_VERSION_REQUEST:
+        hv_balloon_handle_version_request(balloon, req);
+        break;
+
+    case DM_CAPABILITIES_REPORT:
+        hv_balloon_handle_caps_report(balloon, req);
+        break;
+
+    case DM_STATUS_REPORT:
+        hv_balloon_handle_status_report(balloon, req);
+        break;
+
+    case DM_MEM_HOT_ADD_RESPONSE:
+        hv_balloon_handle_hot_add_response(balloon, req);
+        break;
+
+    case DM_UNBALLOON_RESPONSE:
+        hv_balloon_handle_unballoon_response(balloon, req);
+        break;
+
+    case DM_BALLOON_RESPONSE:
+        hv_balloon_handle_balloon_response(balloon, req);
+        break;
+
+    default:
+        warn_report("unknown DM message %u", msg->hdr.type);
+        break;
+    }
+}
+
+static bool hv_balloon_recv_channel(HvBalloon *balloon)
+{
+    VMBusChannel *chan;
+    HvBalloonReq *req;
+
+    if (balloon->state == S_WAIT_RESET ||
+        balloon->state == S_CLOSED) {
+        return false;
+    }
+
+    chan = hv_balloon_get_channel(balloon);
+    if (vmbus_channel_recv_start(chan)) {
+        return false;
+    }
+
+    while ((req = vmbus_channel_recv_peek(chan, sizeof(*req)))) {
+        hv_balloon_handle_packet(balloon, req);
+        vmbus_free_req(req);
+        vmbus_channel_recv_pop(chan);
+    }
+
+    return vmbus_channel_recv_done(chan) > 0;
+}
+
+static void hv_balloon_event_loop(HvBalloon *balloon)
+{
+    bool any_recv;
+
+    do {
+        balloon->state_changed = false;
+        hv_balloon_handle_state(balloon);
+
+        any_recv = hv_balloon_recv_channel(balloon);
+    } while (balloon->state_changed || any_recv);
+}
+
+static uint64_t hv_balloon_haprot_get_align(void *ctx, HAProtDevice *haprot)
+{
+    HvBalloon *balloon = ctx;
+
+    if (hv_balloon_state_is_init(balloon)) {
+        return 0;
+    }
+
+    return (1 << balloon->caps.cap_bits.hot_add_alignment) * MiB;
+}
+
+static void hv_balloon_haprot_plug_notify(void *ctx, HAProtDevice *haprot,
+                                          Error **errp)
+{
+    HvBalloon *balloon = ctx;
+    PageRange range;
+    HAProtRange *hpr;
+
+    if (hv_balloon_state_is_init(balloon)) {
+        error_setg(errp, "no guest attached to the DM protocol yet");
+        return;
+    }
+
+    if (!balloon->caps.cap_bits.hot_add) {
+        error_setg(errp,
+                   "the current DM protocol guest has no support for memory hot add");
+        return;
+    }
+
+    haprot_get_range(haprot, &range);
+    if (page_range_tree_intree_any(balloon->haprot_disallowed,
+                                   range.start, range.count)) {
+        error_setg(errp,
+                   "some of the device pages used to be a part of the guest. this is not supported yet, please reboot the guest and try again");
+        return;
+    }
+    if (page_range_tree_intree_any(balloon->removed_guest,
+                                   range.start, range.count) ||
+        page_range_tree_intree_any(balloon->removed_both,
+                                   range.start, range.count)) {
+        error_setg(errp,
+                   "some of the device new pages were already returned by the guest. this should not happen, please reboot the guest and try again");
+        return;
+    }
+
+    trace_hv_balloon_haprot_range_add(range.count, range.start);
+
+    hpr = haprot_tree_insert_new(balloon, haprot);
+
+    balloon->ha_todo = g_slist_append(balloon->ha_todo, hpr);
+
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_haprot_range_remove_process(HvBalloon *balloon,
+                                                   HAProtRange *hpr)
+{
+    PageRange rangeeff;
+    uint64_t dupcount;
+    uint64_t removed_guest, removed_both;
+
+    haprot_range_get_effective_range(hpr, &rangeeff);
+    if (rangeeff.count == 0) {
+        /* not strictly necessary but saves a bit of time */
+        return;
+    }
+
+    dupcount = 0;
+    page_range_tree_insert(balloon->haprot_disallowed,
+                           rangeeff.start, rangeeff.count, &dupcount);
+    assert(dupcount == 0);
+
+    removed_guest = 0;
+    page_range_tree_npages(hpr->removed_guest, &removed_guest);
+    removed_both = 0;
+    page_range_tree_npages(hpr->removed_both, &removed_both);
+
+    trace_hv_balloon_haprot_range_remove(rangeeff.count, rangeeff.start,
+                                         removed_guest, removed_both,
+                                         balloon->removed_guest_ctr,
+                                         balloon->removed_both_ctr);
+
+    assert(removed_guest + removed_both == rangeeff.count);
+    assert(balloon->removed_guest_ctr >= removed_guest);
+    assert(balloon->removed_both_ctr >= removed_both);
+
+    balloon->removed_guest_ctr -= removed_guest;
+    balloon->removed_both_ctr -= removed_both;
+}
+
+static void hv_balloon_haprot_unplug_notify(void *ctx, HAProtDevice *haprot)
+{
+    HvBalloon *balloon = ctx;
+    HAProtRange *hpr;
+
+    hpr = haprot_tree_lookup(balloon, haprot);
+    hv_balloon_haprot_range_remove_process(balloon, hpr);
+    haprot_tree_remove(balloon, haprot);
+
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_notify_cb(VMBusChannel *chan)
+{
+    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
+
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_stat(void *opaque, BalloonInfo *info)
+{
+    HvBalloon *balloon = opaque;
+    info->actual = (hv_balloon_total_ram(balloon) - balloon->removed_both_ctr)
+        << HV_BALLOON_PFN_SHIFT;
+}
+
+static void hv_balloon_to_target(void *opaque, ram_addr_t target)
+{
+    HvBalloon *balloon = opaque;
+    uint64_t target_pages = target >> HV_BALLOON_PFN_SHIFT;
+
+    if (!target_pages) {
+        return;
+    }
+
+    /*
+     * always set target_changed, even with unchanged target, as the user
+     * might be asking us to try again reaching it
+     */
+    balloon->target = target_pages;
+    balloon->target_changed = true;
+
+    hv_balloon_event_loop(balloon);
+}
+
+static int hv_balloon_open_channel(VMBusChannel *chan)
+{
+    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
+
+    if (balloon->state != S_CLOSED) {
+        warn_report("guest trying to open a DM channel in invalid %d state",
+                    balloon->state);
+        return -EINVAL;
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_VERSION);
+    hv_balloon_event_loop(balloon);
+
+    return 0;
+}
+
+static void hv_balloon_close_channel(VMBusChannel *chan)
+{
+    HvBalloon *balloon = HV_BALLOON(vmbus_channel_device(chan));
+
+    timer_del(&balloon->post_init_timer);
+
+    HV_BALLOON_SET_STATE(balloon, S_WAIT_RESET);
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_post_init_timer(void *opaque)
+{
+    HvBalloon *balloon = opaque;
+
+    if (balloon->state != S_POST_INIT_WAIT) {
+        return;
+    }
+
+    HV_BALLOON_SET_STATE(balloon, S_IDLE);
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_dev_realize(VMBusDevice *vdev, Error **errp)
+{
+    HvBalloon *balloon = HV_BALLOON(vdev);
+    int ret;
+    Error *local_err = NULL;
+
+    balloon->haprots = haprot_tree_new();
+    balloon->state = S_WAIT_RESET;
+
+    ret = qemu_add_balloon_handler(hv_balloon_to_target, hv_balloon_stat,
+                                   balloon);
+    if (ret < 0) {
+        error_setg(errp, "Only one balloon device is supported");
+        goto ret_tree;
+    }
+
+    haprot_register_protocol(hv_balloon_haprot_get_align,
+                             hv_balloon_haprot_plug_notify,
+                             hv_balloon_haprot_unplug_notify,
+                             balloon, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        goto ret_handler;
+    }
+
+    timer_init_ms(&balloon->post_init_timer, QEMU_CLOCK_VIRTUAL,
+                  hv_balloon_post_init_timer, balloon);
+
+    return;
+
+ret_handler:
+    qemu_remove_balloon_handler(balloon);
+
+ret_tree:
+    haprot_tree_destroy(&balloon->haprots);
+}
+
+static void hv_balloon_reset_destroy_common(HvBalloon *balloon)
+{
+    ha_todo_clear(balloon);
+
+    haprot_tree_reset_all(balloon);
+}
+
+static void hv_balloon_dev_reset(VMBusDevice *vdev)
+{
+    HvBalloon *balloon = HV_BALLOON(vdev);
+
+    page_range_tree_destroy(&balloon->haprot_disallowed);
+    page_range_tree_destroy(&balloon->removed_guest);
+    page_range_tree_destroy(&balloon->removed_both);
+    balloon->haprot_disallowed = page_range_tree_new();
+    balloon->removed_guest = page_range_tree_new();
+    balloon->removed_both = page_range_tree_new();
+
+    hv_balloon_reset_destroy_common(balloon);
+
+    balloon->trans_id = 0;
+    balloon->removed_guest_ctr = 0;
+    balloon->removed_both_ctr = 0;
+
+    HV_BALLOON_SET_STATE(balloon, S_CLOSED);
+    hv_balloon_event_loop(balloon);
+}
+
+static void hv_balloon_dev_unrealize(VMBusDevice *vdev)
+{
+    HvBalloon *balloon = HV_BALLOON(vdev);
+
+    hv_balloon_reset_destroy_common(balloon);
+
+    haprot_unregister_protocol(hv_balloon_haprot_plug_notify, NULL);
+    qemu_remove_balloon_handler(balloon);
+
+    page_range_tree_destroy(&balloon->removed_guest);
+    page_range_tree_destroy(&balloon->removed_both);
+    page_range_tree_destroy(&balloon->haprot_disallowed);
+    haprot_tree_destroy(&balloon->haprots);
+}
+
+static Property hv_balloon_properties[] = {
+    DEFINE_PROP_BOOL("status-report", HvBalloon,
+                     status_reports, false),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void hv_balloon_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    VMBusDeviceClass *vdc = VMBUS_DEVICE_CLASS(klass);
+
+    device_class_set_props(dc, hv_balloon_properties);
+    qemu_uuid_parse(HV_BALLOON_GUID, &vdc->classid);
+    set_bit(DEVICE_CATEGORY_MISC, dc->categories);
+    vdc->vmdev_realize = hv_balloon_dev_realize;
+    vdc->vmdev_unrealize = hv_balloon_dev_unrealize;
+    vdc->vmdev_reset = hv_balloon_dev_reset;
+    vdc->open_channel = hv_balloon_open_channel;
+    vdc->close_channel = hv_balloon_close_channel;
+    vdc->chan_notify_cb = hv_balloon_notify_cb;
+}
+
+static const TypeInfo hv_balloon_type_info = {
+    .name = TYPE_HV_BALLOON,
+    .parent = TYPE_VMBUS_DEVICE,
+    .instance_size = sizeof(HvBalloon),
+    .class_init = hv_balloon_class_init,
+};
+
+static void hv_balloon_register_types(void)
+{
+    type_register_static(&hv_balloon_type_info);
+}
+
+type_init(hv_balloon_register_types)
diff --git a/hw/hyperv/meson.build b/hw/hyperv/meson.build
index 1367e2994f25..1c3df34eeb10 100644
--- a/hw/hyperv/meson.build
+++ b/hw/hyperv/meson.build
@@ -1,3 +1,4 @@
 specific_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'))
 specific_ss.add(when: 'CONFIG_HYPERV_TESTDEV', if_true: files('hyperv_testdev.c'))
 specific_ss.add(when: 'CONFIG_VMBUS', if_true: files('vmbus.c'))
+specific_ss.add(when: 'CONFIG_HV_BALLOON', if_true: files('hv-balloon.c'))
diff --git a/hw/hyperv/trace-events b/hw/hyperv/trace-events
index b4c35ca8e377..8da67ded87f9 100644
--- a/hw/hyperv/trace-events
+++ b/hw/hyperv/trace-events
@@ -16,3 +16,20 @@ vmbus_gpadl_torndown(uint32_t gpadl_id) "gpadl #%d"
 vmbus_open_channel(uint32_t chan_id, uint32_t gpadl_id, uint32_t target_vp) "channel #%d gpadl #%d target vp %d"
 vmbus_channel_open(uint32_t chan_id, uint32_t status) "channel #%d status %d"
 vmbus_close_channel(uint32_t chan_id) "channel #%d"
+
+# hv-balloon
+hv_balloon_state_change(const char *tostr) "-> %s"
+hv_balloon_incoming_version(uint16_t major, uint16_t minor) "incoming proto version %u.%u"
+hv_balloon_incoming_caps(uint32_t caps) "incoming caps 0x%x"
+hv_balloon_outgoing_unballoon(uint32_t trans_id, uint64_t count, uint64_t start, uint64_t rempages) "posting unballoon %"PRIu32" for %"PRIu64" @ %"PRIu64", remaining %"PRIu64
+hv_balloon_incoming_unballoon(uint32_t trans_id) "incoming unballoon response %"PRIu32
+hv_balloon_outgoing_hot_add(uint32_t trans_id, uint64_t count, uint64_t start) "posting hot add %"PRIu32" for %"PRIu64" @ %"PRIu64
+hv_balloon_incoming_hot_add(uint32_t trans_id, uint32_t result, uint32_t count) "incoming hot add response %"PRIu32", result %"PRIu32", count %"PRIu32
+hv_balloon_outgoing_balloon(uint32_t trans_id, uint64_t count, uint64_t rempages) "posting balloon %"PRIu32" for %"PRIu64", remaining %"PRIu64
+hv_balloon_incoming_balloon(uint32_t trans_id, uint32_t range_count, uint32_t more_pages) "incoming balloon response %"PRIu32", ranges %"PRIu32", more %"PRIu32
+hv_balloon_haprot_range_add(uint64_t count, uint64_t start) "adding haprot range %"PRIu64" @ %"PRIu64
+hv_balloon_haprot_range_remove(uint64_t count, uint64_t start, uint64_t removed_guest_range, uint64_t removed_both_range, uint64_t removed_guest, uint64_t removed_both) "removing haprot range %"PRIu64" @ %"PRIu64" counts (g %"PRIu64", b %"PRIu64"), global counts (g %"PRIu64", b %"PRIu64")"
+hv_balloon_remove_response(uint64_t count, uint64_t start, unsigned int both) "processing remove response range %"PRIu64" @ %"PRIu64", both %u"
+hv_balloon_remove_response_hole(uint64_t counthole, uint64_t starthole, uint64_t countrange, uint64_t startrange, uint64_t starthpr, unsigned int both) "response range hole %"PRIu64" @ %"PRIu64" from range %"PRIu64" @ %"PRIu64", before hpr start %"PRIu64", both %u"
+hv_balloon_remove_response_common(uint64_t countcommon, uint64_t startcommon, uint64_t countrange, uint64_t startrange, uint64_t counthpr, uint64_t starthpr, uint64_t removed, unsigned int both) "response common range %"PRIu64" @ %"PRIu64" from range %"PRIu64" @ %"PRIu64" with hpr %"PRIu64" @ %"PRIu64", removed %"PRIu64", both %u"
+hv_balloon_remove_response_remainder(uint64_t count, uint64_t start, unsigned int both) "remove response remaining range %"PRIu64" @ %"PRIu64", both %u"
diff --git a/meson.build b/meson.build
index f4d1ab109680..4f5a50a7a6a9 100644
--- a/meson.build
+++ b/meson.build
@@ -525,6 +525,7 @@ kconfig_external_symbols = [
   'CONFIG_VIRTFS',
   'CONFIG_LINUX',
   'CONFIG_PVRDMA',
+  'CONFIG_HV_BALLOON_POSSIBLE',
 ]
 ignored = ['TARGET_XML_FILES', 'TARGET_ABI_DIR', 'TARGET_DIRS']
 
@@ -1525,6 +1526,7 @@ endif
 summary_info += {'thread sanitizer':  config_host.has_key('CONFIG_TSAN')}
 summary_info += {'rng-none':          config_host.has_key('CONFIG_RNG_NONE')}
 summary_info += {'Linux keyring':     config_host.has_key('CONFIG_SECRET_KEYRING')}
+summary_info += {'hv-balloon support': config_host.has_key('CONFIG_HV_BALLOON_POSSIBLE')}
 summary(summary_info, bool_yn: true)
 
 if not supported_cpus.contains(cpu)
diff --git a/qapi/misc.json b/qapi/misc.json
index 8cf6ebe67cba..0e5ad3d3dffe 100644
--- a/qapi/misc.json
+++ b/qapi/misc.json
@@ -276,6 +276,80 @@
 { 'event': 'BALLOON_CHANGE',
   'data': { 'actual': 'int' } }
 
+##
+# @HV_BALLOON_STATUS_REPORT:
+#
+# Emitted when the hv-balloon driver receives a "STATUS" message from
+# the guest.
+#
+# @commited: the amount of memory in use inside the guest plus the amount
+#            of the memory unusable inside the guest (ballooned out,
+#            offline, etc.)
+#
+# @available: the amount of the memory inside the guest available for new
+#             allocations ("free")
+#
+# Since: TBD
+#
+# Example:
+#
+# <- { "event": "HV_BALLOON_STATUS_REPORT",
+#      "data": { "commited": 816640000, "available": 3333054464 },
+#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
+#
+##
+{ 'event': 'HV_BALLOON_STATUS_REPORT',
+  'data': { 'commited': 'size', 'available': 'size' } }
+
+##
+# @HV_BALLOON_HAPROT_UNUSED:
+#
+# Emitted when the hv-balloon driver marks a device for memory hot-add
+# protocols (haprot) unused so it can now be removed, if required.
+#
+# This can happen because the guest returned all the memory contained
+# in it via ballooning or the VM was restarted.
+#
+# @id: the haprot device id
+#
+# Since: TBD
+#
+# Example:
+#
+# <- { "event": "HV_BALLOON_HAPROT_UNUSED",
+#      "data": { "id": "ha1" },
+#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
+#
+##
+{ 'event': 'HV_BALLOON_HAPROT_UNUSED',
+  'data': { 'id': 'str' } }
+
+##
+# @HV_BALLOON_HAPROT_INUSE:
+#
+# Emitted when the hv-balloon driver marks a device for memory hot-add
+# protocols (haprot) in use once again so it can no longer be removed.
+#
+# This can happen because the guest was unballooned using its memory range
+# or the memory range was reinserted into the guest after a VM restart.
+#
+# It is NOT emitted when a new haprot device is successfully added,
+# although such device starts in the "in use" state.
+#
+# @id: the haprot device id
+#
+# Since: TBD
+#
+# Example:
+#
+# <- { "event": "HV_BALLOON_HAPROT_INUSE",
+#      "data": { "id": "ha1" },
+#      "timestamp": { "seconds": 1600295492, "microseconds": 661044 } }
+#
+##
+{ 'event': 'HV_BALLOON_HAPROT_INUSE',
+  'data': { 'id': 'str' } }
+
 ##
 # @PciMemoryRange:
 #


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-20 13:25 [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
                   ` (2 preceding siblings ...)
  2020-09-20 13:25 ` [PATCH 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
@ 2020-09-20 14:16 ` no-reply
  2020-09-21  9:00 ` Igor Mammedov
  2020-09-21  9:10 ` David Hildenbrand
  5 siblings, 0 replies; 14+ messages in thread
From: no-reply @ 2020-09-20 14:16 UTC (permalink / raw)
  To: mail
  Cc: ehabkost, mst, qemu-devel, armbru, imammedo, pbonzini, vkuznets,
	boris.ostrovsky, rth

Patchew URL: https://patchew.org/QEMU/cover.1600556526.git.maciej.szmigiero@oracle.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: cover.1600556526.git.maciej.szmigiero@oracle.com
Subject: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
180104d Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
cb9f9d7 Add Hyper-V Dynamic Memory Protocol definitions
3f22e8f haprot: add a device for memory hot-add protocols

=== OUTPUT BEGIN ===
1/3 Checking commit 3f22e8fb4473 (haprot: add a device for memory hot-add protocols)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#109: 
new file mode 100644

WARNING: line over 80 characters
#212: FILE: hw/mem/haprot.c:99:
+                   "Node property value %"PRIu32" exceeds the number of numa nodes %d",

total: 0 errors, 2 warnings, 404 lines checked

Patch 1/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
2/3 Checking commit cb9f9d7b2fb4 (Add Hyper-V Dynamic Memory Protocol definitions)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#15: 
new file mode 100644

WARNING: line over 80 characters
#52: FILE: include/hw/hyperv/dynmem-proto.h:33:
+#define DYNMEM_MAKE_VERSION(Major, Minor) ((uint32_t)(((Major) << 16) | (Minor)))

ERROR: space prohibited after that '&' (ctx:WxW)
#54: FILE: include/hw/hyperv/dynmem-proto.h:35:
+#define DYNMEM_MINOR_VERSION(Version) ((uint32_t)(Version) & 0xff)
                                                            ^

total: 1 errors, 2 warnings, 425 lines checked

Patch 2/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

3/3 Checking commit 180104d55786 (Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon))
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#188: 
new file mode 100644

WARNING: line over 80 characters
#248: FILE: hw/hyperv/hv-balloon.c:56:
+#define HV_BALLOON_HA_CHUNK_PAGES (HV_BALLOON_HA_CHUNK_SIZE / HV_BALLOON_PAGE_SIZE)

WARNING: line over 80 characters
#1236: FILE: hw/hyperv/hv-balloon.c:1044:
+                                  (1 << balloon->caps.cap_bits.hot_add_alignment)

WARNING: line over 80 characters
#1618: FILE: hw/hyperv/hv-balloon.c:1426:
+            warn_report("guest reported removed page backed by unsupported page size %zu",

WARNING: line over 80 characters
#1692: FILE: hw/hyperv/hv-balloon.c:1500:
+                warn_report("guest reported removed page %"PRIu64" not found in RAM",

ERROR: line over 90 characters
#1700: FILE: hw/hyperv/hv-balloon.c:1508:
+                warn_report("guest reported removed page %"PRIu64" in a section smaller than page size",

ERROR: line over 90 characters
#1710: FILE: hw/hyperv/hv-balloon.c:1518:
+                warn_report("guest reported removed page %"PRIu64" in a section that is not an ordinary RAM",

ERROR: line over 90 characters
#1749: FILE: hw/hyperv/hv-balloon.c:1557:
+        warn_report("guest reported more pages removed than currently pending (%"PRIu64" vs %"PRIu64")",

WARNING: line over 80 characters
#1863: FILE: hw/hyperv/hv-balloon.c:1671:
+    qapi_event_send_hv_balloon_status_report((uint64_t)msgStatus->num_committed *

ERROR: line over 90 characters
#1918: FILE: hw/hyperv/hv-balloon.c:1726:
+            warn_report("DM_HOT_ADD_RESPONSE page count higher than requested (%"PRIu32" vs %"PRIu64")",

total: 4 errors, 6 warnings, 2369 lines checked

Patch 3/3 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/cover.1600556526.git.maciej.szmigiero@oracle.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-20 13:25 [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
                   ` (3 preceding siblings ...)
  2020-09-20 14:16 ` [PATCH 0/3] " no-reply
@ 2020-09-21  9:00 ` Igor Mammedov
  2020-09-21  9:29   ` David Hildenbrand
  2020-09-21  9:10 ` David Hildenbrand
  5 siblings, 1 reply; 14+ messages in thread
From: Igor Mammedov @ 2020-09-21  9:00 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, David Hildenbrand, Paolo Bonzini,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

On Sun, 20 Sep 2020 15:25:19 +0200
"Maciej S. Szmigiero" <mail@maciej.szmigiero.name> wrote:

> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

From description it sounds like an alternative of virtio-mem,
CCing David.

> This series adds a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
> and its protocol definitions.
> Also included is a driver providing backing devices for memory hot-add
> protocols ("haprots").
> 
> A haprot device works like a virtual DIMM stick: it allows inserting
> extra RAM into the guest at run time.
> 
> The main differences from the ACPI-based PC DIMM hotplug are:
> * Notifying the guest about the new memory range is not done via ACPI but
> via a protocol handler that registers with the haprot framework.
> This means that the ACPI DIMM slot limit does not apply.
> 
> * A protocol handler can prevent removal of a haprot device when it is
> still in use by setting its "busy" field.
> 
> * A protocol handler can also register an "unplug" callback so it gets
> notified when an user decides to remove the haprot device.
> This way the protocol handler can inform the guest about this fact and / or
> do its own cleanup.
> 
> The hv-balloon driver is like virtio-balloon on steroids: it allows both
> changing the guest memory allocation via ballooning and inserting extra
> RAM into it by adding haprot virtual DIMM sticks.
> One of advantages of these over ACPI-based PC DIMM hotplug is that such
> memory can be hotplugged in much smaller granularity because the ACPI DIMM
> slot limit does not apply.
> 
> In contrast with ACPI DIMM hotplug where one can only request to unplug a
> whole DIMM stick this driver allows removing memory from guest in single
> page (4k) units via ballooning.
> Then, once the guest has released the whole memory backed by a haprot
> virtual DIMM stick such device is marked "unused" and can be removed from
> the VM, if one wants so.
> A "HV_BALLOON_HAPROT_UNUSED" QMP event is emitted in this case so the
> software controlling QEMU knows that this operation is now possible.
> 
> The haprot devices are also marked unused after a VM reboot (with a
> corresponding "HV_BALLOON_HAPROT_UNUSED" QMP event).
> They are automatically reinserted (if still present) after the guest
> reconnects to this protocol (a "HV_BALLOON_HAPROT_INUSE" QMP event is then
> emitted).
> 
> For performance reasons, the guest-released memory is tracked in few range
> trees, as a series of (start, count) ranges.
> Each time a new page range is inserted into such tree its neighbors are
> checked as candidates for possible merging with it.
> 
> Besides performance reasons, the Dynamic Memory protocol itself uses page
> ranges as the data structure in its messages, so relevant pages need to be
> merged into such ranges anyway.
> 
> One has to be careful when tracking the guest-released pages, since the
> guest can maliciously report returning pages outside its current address
> space, which later clash with the address range of newly added memory.
> Similarly, the guest can report freeing the same page twice.
> 
> The above design results in much better ballooning performance than when
> using virtio-balloon with the same guest: 230 GB / minute with this driver
> versus 70 GB / minute with virtio-balloon.
> 
> During a ballooning operation most of time is spent waiting for the guest
> to come up with newly freed page ranges, processing the received ranges on
> the host side (in QEMU / KVM) is nearly instantaneous.
> 
> The unballoon operation is also pretty much instantaneous:
> thanks to the merging of the ballooned out page ranges 200 GB of memory can
> be returned to the guest in about 1 second.
> With virtio-balloon this operation takes about 2.5 minutes.
> 
> These tests were done against a Windows Server 2019 guest running on a
> Xeon E5-2699, after dirtying the whole memory inside guest before each
> balloon operation.
> 
> Using a range tree instead of a bitmap to track the removed memory also
> means that the solution scales well with the guest size: even a 1 TB range
> takes just few bytes of memory.
> 
> The required GTree operations are available at
> https://gitlab.gnome.org/maciejsszmigiero/glib/-/tree/gtree-add-iterators
> Since they are not yet in the upstream Glib a check for them was added to
> "configure" script, together with new "--enable-hv-balloon" and
> "--disable-hv-balloon" arguments.
> If these GTree operations are missing in the system Glib this driver will
> be skipped during QEMU build.
> 
> An optional "status-report=on" device parameter requests memory status
> events from the guest (typically sent every second), which allow the host
> to learn both the guest memory available and the guest memory in use
> counts.
> They are emitted externally as "HV_BALLOON_STATUS_REPORT" QMP events.
> 
> The driver is named hv-balloon since the Linux kernel client driver for
> the Dynamic Memory Protocol is named as such and to follow the naming
> pattern established by the virtio-balloon driver.
> The whole protocol runs over Hyper-V VMBus that has its implementation
> recently merged in.
> 
> The driver was tested against Windows Server 2012 R2, Windows Server 2016
> and Windows Server 2016 guests and obeys the guest alignment requirements
> reported to the host via DM_CAPABILITIES_REPORT message.
> Extensive event tracing is available under 'hv_balloon_*' prefix.
> 
> Example usage:
> * Add "-device vmbus-bridge,id=vmbus-bridge -device hv-balloon,id=hvb"
>   to the QEMU command line and set "maxmem" value to something large,
>   like 1T.
> 
> * Use QEMU monitor commands to add a haprot virtual DIMM stick, together
>   with its memory backend:
>   object_add memory-backend-ram,id=mem1,size=200G
>   device_add mem-haprot,id=ha1,memdev=mem1
>   The first command is actually the same as for ACPI-based DIMM hotplug.
> 
> * Use the ballooning interface monitor commands to force the guest to give
>   out as much memory as possible:
>   balloon 1
>   The ballooning interface monitor commands can also be used to resize
>   the guest up and down appropriately.
> 
> * One can check the current guest size by issuing a "info balloon" command.
>   This is useful to know what is happening, since large ballooning or
>   unballooning operations take some time to complete.
> 
> * Once the guest releases the whole memory backed by a haprot device
>   (or is restarted) a "HV_BALLOON_HAPROT_UNUSED" QMP event will be
>   generated.
>   The haprot device then can be removed, together with its memory backend:
>   device_del ha1
>   object_del mem1
> 
> Future directions:
> * Allow sharing the ballooning QEMU interface between hv-balloon and
>   virtio-balloon drivers.
>   Currently, only one of them can be added to the VM at the same time.
> 
> * Allow new haport devices to reuse the same address range as the ones
>   that were previously deleted via device_del monitor command without
>   having to restart the VM.
> 
> * Add vmstate / live migration support to the hv-balloon driver.
> 
> * Use haprot device to also add memory via virtio interface (this requires
>   defining a new operation in virtio-balloon protocol and appropriate
>   support from the client virtio-balloon driver in the Linux kernel).
> 
>  Kconfig.host                     |    3 +
>  configure                        |   35 +
>  hw/hyperv/Kconfig                |    5 +
>  hw/hyperv/hv-balloon.c           | 2172 ++++++++++++++++++++++++++++++
>  hw/hyperv/meson.build            |    1 +
>  hw/hyperv/trace-events           |   17 +
>  hw/i386/Kconfig                  |    2 +
>  hw/i386/pc.c                     |   18 +-
>  hw/mem/Kconfig                   |    4 +
>  hw/mem/haprot.c                  |  263 ++++
>  hw/mem/meson.build               |    1 +
>  include/hw/hyperv/dynmem-proto.h |  425 ++++++
>  include/hw/mem/haprot.h          |   72 +
>  meson.build                      |    2 +
>  qapi/misc.json                   |   74 +
>  15 files changed, 3093 insertions(+), 1 deletion(-)
>  create mode 100644 hw/hyperv/hv-balloon.c
>  create mode 100644 hw/mem/haprot.c
>  create mode 100644 include/hw/hyperv/dynmem-proto.h
>  create mode 100644 include/hw/mem/haprot.h
> 
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-20 13:25 [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
                   ` (4 preceding siblings ...)
  2020-09-21  9:00 ` Igor Mammedov
@ 2020-09-21  9:10 ` David Hildenbrand
  2020-09-21 22:22   ` Maciej S. Szmigiero
  5 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2020-09-21  9:10 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Paolo Bonzini, Richard Henderson, Eduardo Habkost
  Cc: Michael S. Tsirkin, qemu-devel, Markus Armbruster, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky

On 20.09.20 15:25, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> This series adds a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
> and its protocol definitions.
> Also included is a driver providing backing devices for memory hot-add
> protocols ("haprots").
> 
> A haprot device works like a virtual DIMM stick: it allows inserting
> extra RAM into the guest at run time.
> 
> The main differences from the ACPI-based PC DIMM hotplug are:
> * Notifying the guest about the new memory range is not done via ACPI but
> via a protocol handler that registers with the haprot framework.
> This means that the ACPI DIMM slot limit does not apply.
> 
> * A protocol handler can prevent removal of a haprot device when it is
> still in use by setting its "busy" field.
> 
> * A protocol handler can also register an "unplug" callback so it gets
> notified when an user decides to remove the haprot device.
> This way the protocol handler can inform the guest about this fact and / or
> do its own cleanup.
> 
> The hv-balloon driver is like virtio-balloon on steroids: it allows both
> changing the guest memory allocation via ballooning and inserting extra
> RAM into it by adding haprot virtual DIMM sticks.
> One of advantages of these over ACPI-based PC DIMM hotplug is that such
> memory can be hotplugged in much smaller granularity because the ACPI DIMM
> slot limit does not apply.

Reading further below, it's essentially DIMM-based memory hotplug +
virtio-balloon - except the 256MB DIMM limit. But reading below, I don't
see how you want to avoid the KVM memory slot limit that's in a similar
size (I recall 256*2 due to 2 address spaces). Or avoid VMA limits when
wanting to grow a VM big in very tiny steps over time (e.g., adding 64MB
at a time).

> 
> In contrast with ACPI DIMM hotplug where one can only request to unplug a
> whole DIMM stick this driver allows removing memory from guest in single
> page (4k) units via ballooning.
> Then, once the guest has released the whole memory backed by a haprot
> virtual DIMM stick such device is marked "unused" and can be removed from
> the VM, if one wants so.
> A "HV_BALLOON_HAPROT_UNUSED" QMP event is emitted in this case so the
> software controlling QEMU knows that this operation is now possible.
> 
> The haprot devices are also marked unused after a VM reboot (with a
> corresponding "HV_BALLOON_HAPROT_UNUSED" QMP event).
> They are automatically reinserted (if still present) after the guest
> reconnects to this protocol (a "HV_BALLOON_HAPROT_INUSE" QMP event is then
> emitted).
> 
> For performance reasons, the guest-released memory is tracked in few range
> trees, as a series of (start, count) ranges.
> Each time a new page range is inserted into such tree its neighbors are
> checked as candidates for possible merging with it.
> 
> Besides performance reasons, the Dynamic Memory protocol itself uses page
> ranges as the data structure in its messages, so relevant pages need to be
> merged into such ranges anyway.
> 
> One has to be careful when tracking the guest-released pages, since the
> guest can maliciously report returning pages outside its current address
> space, which later clash with the address range of newly added memory.
> Similarly, the guest can report freeing the same page twice.
> 
> The above design results in much better ballooning performance than when
> using virtio-balloon with the same guest: 230 GB / minute with this driver
> versus 70 GB / minute with virtio-balloon.

I assume these numbers apply with Windows guests only. IIRC Linux
hv_balloon does not support page migration/compaction, while
virtio-balloon does. So you might end up with quite some fragmented
memory with hv_balloon in Linux guests - of course, usually only in
corner cases.

> 
> During a ballooning operation most of time is spent waiting for the guest
> to come up with newly freed page ranges, processing the received ranges on
> the host side (in QEMU / KVM) is nearly instantaneous.
> 
> The unballoon operation is also pretty much instantaneous:
> thanks to the merging of the ballooned out page ranges 200 GB of memory can
> be returned to the guest in about 1 second.
> With virtio-balloon this operation takes about 2.5 minutes.
> 
> These tests were done against a Windows Server 2019 guest running on a
> Xeon E5-2699, after dirtying the whole memory inside guest before each
> balloon operation.
> 
> Using a range tree instead of a bitmap to track the removed memory also
> means that the solution scales well with the guest size: even a 1 TB range
> takes just few bytes of memory.
> Example usage:
> * Add "-device vmbus-bridge,id=vmbus-bridge -device hv-balloon,id=hvb"
>   to the QEMU command line and set "maxmem" value to something large,
>   like 1T.
> 
> * Use QEMU monitor commands to add a haprot virtual DIMM stick, together
>   with its memory backend:
>   object_add memory-backend-ram,id=mem1,size=200G
>   device_add mem-haprot,id=ha1,memdev=mem1
>   The first command is actually the same as for ACPI-based DIMM hotplug.
> 
> * Use the ballooning interface monitor commands to force the guest to give
>   out as much memory as possible:
>   balloon 1

At least under virtio-balloon with Linux, that will pretty sure trigger
a guest crash. Is something like that expected to work with Windows
guests reasonably well?


>   The ballooning interface monitor commands can also be used to resize
>   the guest up and down appropriately.
> 
> * One can check the current guest size by issuing a "info balloon" command.
>   This is useful to know what is happening, since large ballooning or
>   unballooning operations take some time to complete.

So, every time you want to add more memory (after the balloon was
deflated) to a guest, you have to plug a new mem-haprot device, correct?

So your QEMU user has to be well aware of how to balance "balloon" and
"object_add/device_add/object_del_device_del" commands to achieve the
desired guest size.

> 
> * Once the guest releases the whole memory backed by a haprot device
>   (or is restarted) a "HV_BALLOON_HAPROT_UNUSED" QMP event will be
>   generated.
>   The haprot device then can be removed, together with its memory backend:
>   device_del ha1
>   object_del mem1

So, you rely on some external entity to properly shrink a guest again
(e.g., during reboot).

> 
> Future directions:
> * Allow sharing the ballooning QEMU interface between hv-balloon and
>   virtio-balloon drivers.
>   Currently, only one of them can be added to the VM at the same time.

Yeah, that makes sense. Only one at a time.

> 
> * Allow new haport devices to reuse the same address range as the ones
>   that were previously deleted via device_del monitor command without
>   having to restart the VM.
> 
> * Add vmstate / live migration support to the hv-balloon driver.
> 
> * Use haprot device to also add memory via virtio interface (this requires
>   defining a new operation in virtio-balloon protocol and appropriate
>   support from the client virtio-balloon driver in the Linux kernel).

Most probably not the direction we are going to take. We have virtio-mem
for clean, fine-grained, NUMA-aware, paravirtualized memory hot(un)plug
now, and we are well aware of various issues with (base-page size based)
memory ballooning that are fairly impossible to solve (especially in the
context of vfio).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-21  9:00 ` Igor Mammedov
@ 2020-09-21  9:29   ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2020-09-21  9:29 UTC (permalink / raw)
  To: Igor Mammedov, Maciej S. Szmigiero
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, David Hildenbrand, Paolo Bonzini,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

On 21.09.20 11:00, Igor Mammedov wrote:
> On Sun, 20 Sep 2020 15:25:19 +0200
> "Maciej S. Szmigiero" <mail@maciej.szmigiero.name> wrote:
> 
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> From description it sounds like an alternative of virtio-mem,
> CCing David.

Hah! Was already replying when your cc came in (thanks anyway!).

Not quite an alternative to virtio-mem, more like a (in some factors)
optimized alternative to DIMM-based memory hotplug + virtio-balloon.

The core design of virtio-mem avoids basic issues related to memory
ballooning. (it was the result of analyzing existing technologies and
their shortcomings - including HV dynamic memory).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-21  9:10 ` David Hildenbrand
@ 2020-09-21 22:22   ` Maciej S. Szmigiero
  2020-09-22  7:26     ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Maciej S. Szmigiero @ 2020-09-21 22:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, Igor Mammedov, Paolo Bonzini,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

Hi David,

Thank you for your comments.

First, I want to underline that this driver targets Windows guests,
where ability to modify and adapt the guest memory management
code is extremely limited.

While it does work with Linux guests, too, this is definitely not its
native environment.

It also has to support rather big guests, up to 1 TB of RAM, so
performance-related things are important.

Further answers are bellow.

On 21.09.2020 11:10, David Hildenbrand wrote:
> On 20.09.20 15:25, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> This series adds a Hyper-V Dynamic Memory Protocol driver (hv-balloon)
>> and its protocol definitions.
>> Also included is a driver providing backing devices for memory hot-add
>> protocols ("haprots").
>>
>> A haprot device works like a virtual DIMM stick: it allows inserting
>> extra RAM into the guest at run time.
>>
>> The main differences from the ACPI-based PC DIMM hotplug are:
>> * Notifying the guest about the new memory range is not done via ACPI but
>> via a protocol handler that registers with the haprot framework.
>> This means that the ACPI DIMM slot limit does not apply.
>>
>> * A protocol handler can prevent removal of a haprot device when it is
>> still in use by setting its "busy" field.
>>
>> * A protocol handler can also register an "unplug" callback so it gets
>> notified when an user decides to remove the haprot device.
>> This way the protocol handler can inform the guest about this fact and / or
>> do its own cleanup.
>>
>> The hv-balloon driver is like virtio-balloon on steroids: it allows both
>> changing the guest memory allocation via ballooning and inserting extra
>> RAM into it by adding haprot virtual DIMM sticks.
>> One of advantages of these over ACPI-based PC DIMM hotplug is that such
>> memory can be hotplugged in much smaller granularity because the ACPI DIMM
>> slot limit does not apply.
> 
> Reading further below, it's essentially DIMM-based memory hotplug +
> virtio-balloon - except the 256MB DIMM limit. But reading below, I don't
> see how you want to avoid the KVM memory slot limit that's in a similar
> size (I recall 256*2 due to 2 address spaces). 

The idea is to use virtual DIMM sticks for hot-adding extra memory at
runtime, while using ballooning for runtime adjustment of the guest
memory size within the current maximum.

When the guest is rebooted the virtual DIMMs configuration is adjusted
by the software controlling QEMU (some are removed and / or some are
added) to give the guest the same effective memory size as it had before
the reboot.

So, yes, it will be a problem if the user expands their running guest
~256 times, each time making it even bigger than previously, without
rebooting it even once, but this does seem to be an edge use case.

In the future it would be better to automatically turn the current
effective guest size into its boot memory size when the VM restarts
(the VM will then have no virtual DIMMs inserted after a reboot), but
doing this requires quite a few changes to QEMU, that's why it isn't
there yet.

The above is basically how Hyper-V hypervisor handles its memory size
changes and it seems to be as close to having a transparently resizable
guest as reasonably possible.


> Or avoid VMA limits when wanting to grow a VM big in very tiny steps over
> time (e.g., adding 64MB at a time).

Not sure if you are taking about VMA limits inside the host or the guest.
>>
>> In contrast with ACPI DIMM hotplug where one can only request to unplug a
>> whole DIMM stick this driver allows removing memory from guest in single
>> page (4k) units via ballooning.
>> Then, once the guest has released the whole memory backed by a haprot
>> virtual DIMM stick such device is marked "unused" and can be removed from
>> the VM, if one wants so.
>> A "HV_BALLOON_HAPROT_UNUSED" QMP event is emitted in this case so the
>> software controlling QEMU knows that this operation is now possible.
>>
>> The haprot devices are also marked unused after a VM reboot (with a
>> corresponding "HV_BALLOON_HAPROT_UNUSED" QMP event).
>> They are automatically reinserted (if still present) after the guest
>> reconnects to this protocol (a "HV_BALLOON_HAPROT_INUSE" QMP event is then
>> emitted).
>>
>> For performance reasons, the guest-released memory is tracked in few range
>> trees, as a series of (start, count) ranges.
>> Each time a new page range is inserted into such tree its neighbors are
>> checked as candidates for possible merging with it.
>>
>> Besides performance reasons, the Dynamic Memory protocol itself uses page
>> ranges as the data structure in its messages, so relevant pages need to be
>> merged into such ranges anyway.
>>
>> One has to be careful when tracking the guest-released pages, since the
>> guest can maliciously report returning pages outside its current address
>> space, which later clash with the address range of newly added memory.
>> Similarly, the guest can report freeing the same page twice.
>>
>> The above design results in much better ballooning performance than when
>> using virtio-balloon with the same guest: 230 GB / minute with this driver
>> versus 70 GB / minute with virtio-balloon.
> 
> I assume these numbers apply with Windows guests only. IIRC Linux
> hv_balloon does not support page migration/compaction, while
> virtio-balloon does. So you might end up with quite some fragmented
> memory with hv_balloon in Linux guests - of course, usually only in
> corner cases.

As I previously mentioned, this driver targets mainly Windows guests.

And Windows seems to be rather determined to free the requested number
of pages: waiting for the guest to reply to a 2GB balloon request
sometimes takes 2-3 seconds.
So i guess it does some kind of memory compaction during that request
processing time.

>>
>> During a ballooning operation most of time is spent waiting for the guest
>> to come up with newly freed page ranges, processing the received ranges on
>> the host side (in QEMU / KVM) is nearly instantaneous.
>>
>> The unballoon operation is also pretty much instantaneous:
>> thanks to the merging of the ballooned out page ranges 200 GB of memory can
>> be returned to the guest in about 1 second.
>> With virtio-balloon this operation takes about 2.5 minutes.
>>
>> These tests were done against a Windows Server 2019 guest running on a
>> Xeon E5-2699, after dirtying the whole memory inside guest before each
>> balloon operation.
>>
>> Using a range tree instead of a bitmap to track the removed memory also
>> means that the solution scales well with the guest size: even a 1 TB range
>> takes just few bytes of memory.
>> Example usage:
>> * Add "-device vmbus-bridge,id=vmbus-bridge -device hv-balloon,id=hvb"
>>   to the QEMU command line and set "maxmem" value to something large,
>>   like 1T.
>>
>> * Use QEMU monitor commands to add a haprot virtual DIMM stick, together
>>   with its memory backend:
>>   object_add memory-backend-ram,id=mem1,size=200G
>>   device_add mem-haprot,id=ha1,memdev=mem1
>>   The first command is actually the same as for ACPI-based DIMM hotplug.
>>
>> * Use the ballooning interface monitor commands to force the guest to give
>>   out as much memory as possible:
>>   balloon 1
> 
> At least under virtio-balloon with Linux, that will pretty sure trigger
> a guest crash. Is something like that expected to work with Windows
> guests reasonably well?

Windows will generally leave some memory free when processing balloon
requests, although the precise amount varies between few hundred MB to
values like 1+ GB.

Usually it runs stable even with these few hundred MBs of free memory
remaining but I have seen occasional crashes at shutdown time in this
case (probably something critical failing to initialize due to the
system running out of memory).

While the above command was just a quick example, I personally think
it is the guest who should be enforcing a balloon floor since it is
the guest that knows its internal memory requirements, not the host.

For this reason the hv_balloon client driver inside the Linux kernel
implements its own, rough balloon floor - see compute_balloon_floor().

On the other hand, one can also argue that the user wish should be
respected as much as possible.

>>   The ballooning interface monitor commands can also be used to resize
>>   the guest up and down appropriately.
>>
>> * One can check the current guest size by issuing a "info balloon" command.
>>   This is useful to know what is happening, since large ballooning or
>>   unballooning operations take some time to complete.
> 
> So, every time you want to add more memory (after the balloon was
> deflated) to a guest, you have to plug a new mem-haprot device, correct?

Yes.

> So your QEMU user has to be well aware of how to balance "balloon" and
> "object_add/device_add/object_del_device_del" commands to achieve the
> desired guest size.

In this case the VM user does not interact directly with the QEMU process.

Rather, the user tells the software controlling QEMU (think: libvirt)
their wish how large they want the guest to be and this software then
does everything what is necessary to achieve such target and make it
persistent across guest reboots.

 
>>
>> * Once the guest releases the whole memory backed by a haprot device
>>   (or is restarted) a "HV_BALLOON_HAPROT_UNUSED" QMP event will be
>>   generated.
>>   The haprot device then can be removed, together with its memory backend:
>>   device_del ha1
>>   object_del mem1
> 
> So, you rely on some external entity to properly shrink a guest again
> (e.g., during reboot).

Yes.

>>
>> Future directions:
>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>   virtio-balloon drivers.
>>   Currently, only one of them can be added to the VM at the same time.
> 
> Yeah, that makes sense. Only one at a time.

Having only one *active* at a time makes sense, however it ultimately
would be nice to be able to have them both inserted into a VM:
one for Windows guests and one for Linux ones.
Even though only one obviously would be active at the same time.

>>
>> * Allow new haport devices to reuse the same address range as the ones
>>   that were previously deleted via device_del monitor command without
>>   having to restart the VM.
>>
>> * Add vmstate / live migration support to the hv-balloon driver.
>>
>> * Use haprot device to also add memory via virtio interface (this requires
>>   defining a new operation in virtio-balloon protocol and appropriate
>>   support from the client virtio-balloon driver in the Linux kernel).
> 
> Most probably not the direction we are going to take. We have virtio-mem
> for clean, fine-grained, NUMA-aware, paravirtualized memory hot(un)plug
> now, and we are well aware of various issues with (base-page size based)
> memory ballooning that are fairly impossible to solve (especially in the
> context of vfio).
> 

Thanks,
Maciej


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-21 22:22   ` Maciej S. Szmigiero
@ 2020-09-22  7:26     ` David Hildenbrand
  2020-09-22 23:19       ` Maciej S. Szmigiero
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2020-09-22  7:26 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, Paolo Bonzini, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

On 22.09.20 00:22, Maciej S. Szmigiero wrote:
> Hi David,
> 
> Thank you for your comments.
> 
> First, I want to underline that this driver targets Windows guests,
> where ability to modify and adapt the guest memory management
> code is extremely limited.

Yeah, I know the pain.

[...]

> 
> The idea is to use virtual DIMM sticks for hot-adding extra memory at
> runtime, while using ballooning for runtime adjustment of the guest
> memory size within the current maximum.
> 
> When the guest is rebooted the virtual DIMMs configuration is adjusted
> by the software controlling QEMU (some are removed and / or some are
> added) to give the guest the same effective memory size as it had before
> the reboot.

Okay, so while "the ACPI DIMM slot limit does not apply", the KVM memory
slot limit (currently) applies, resulting in exactly the same behavior.

The only (conceptual difference) I am able to spot is then a
notification to the user on reboot, so the guest memory layout can be
adjusted (which I consider very ugly, but it's the same thing when
mixing ballooning and DIMMs - which is why it's usually never done).

[...]

> 
> So, yes, it will be a problem if the user expands their running guest
> ~256 times, each time making it even bigger than previously, without
> rebooting it even once, but this does seem to be an edge use case.

IIRC, that's exactly what dynamic memory under Windows does in automatic
mode, no? Monitor the guests, distribute memory accordingly - usually in
smaller steps. But I am no expert on Hyper-V.

> 
> In the future it would be better to automatically turn the current
> effective guest size into its boot memory size when the VM restarts
> (the VM will then have no virtual DIMMs inserted after a reboot), but
> doing this requires quite a few changes to QEMU, that's why it isn't
> there yet.

Will most probably never happen as reshuffling the layout of your boot
memory (especially with NUMA) within QEMU can break live migration in
various ways.

If you already notify the user on a reboot, the user can just kill the
VM and start it with an adjusted boot memory size. Yeah, that's ugly,
but so is the whole "adjust DIMM/balloon configuration during a reboot
from outside QEMU".

BTW, how would you handle: Start guest with 10G. Inflate balloon to 5G.
Reboot. There are no virtual DIMMs to adjust.

> 
> The above is basically how Hyper-V hypervisor handles its memory size
> changes and it seems to be as close to having a transparently resizable
> guest as reasonably possible.

"having a transparently resizable _Windows_ guests right now" :)

> 
> 
>> Or avoid VMA limits when wanting to grow a VM big in very tiny steps over
>> time (e.g., adding 64MB at a time).
> 
> Not sure if you are taking about VMA limits inside the host or the guest.

Host. one virtual DIMM corresponds to one VMA. But the KVM memory limit
already applies before that, so it doesn't matter.

[...]

>> I assume these numbers apply with Windows guests only. IIRC Linux
>> hv_balloon does not support page migration/compaction, while
>> virtio-balloon does. So you might end up with quite some fragmented
>> memory with hv_balloon in Linux guests - of course, usually only in
>> corner cases.
> 
> As I previously mentioned, this driver targets mainly Windows guests.

... and you cannot enforce that people will only use it with Windows
guests :)

[...]

> Windows will generally leave some memory free when processing balloon
> requests, although the precise amount varies between few hundred MB to
> values like 1+ GB.
> 
> Usually it runs stable even with these few hundred MBs of free memory
> remaining but I have seen occasional crashes at shutdown time in this
> case (probably something critical failing to initialize due to the
> system running out of memory).
> 
> While the above command was just a quick example, I personally think
> it is the guest who should be enforcing a balloon floor since it is
> the guest that knows its internal memory requirements, not the host.

Even the guest has no idea about the (future) working set size. That's a
known problem.

There are always cases where the calculation is wrong, and if the
monitoring process isn't fast enough to react and adjust the guest size,
your things will end up baldy in your guest. Just as the reboot case you
mentioned, where the VM crashes.

[...]

>>>
>>> Future directions:
>>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>>   virtio-balloon drivers.
>>>   Currently, only one of them can be added to the VM at the same time.
>>
>> Yeah, that makes sense. Only one at a time.
> 
> Having only one *active* at a time makes sense, however it ultimately
> would be nice to be able to have them both inserted into a VM:
> one for Windows guests and one for Linux ones.
> Even though only one obviously would be active at the same time.

I don't think that's the right way forward - that should be configured
when the VM is started.

Personal opinion: I can understand the motivation to implement
hypervisor-specific devices to better support closed-source operating
systems. But I doubt we want to introduce+support ten different
proprietary devices based on proprietary standards doing roughly the
same thing just because closed-source operating systems are too lazy to
support open standards properly.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-22  7:26     ` David Hildenbrand
@ 2020-09-22 23:19       ` Maciej S. Szmigiero
  2020-09-23 12:48         ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Maciej S. Szmigiero @ 2020-09-22 23:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, Paolo Bonzini, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

On 22.09.2020 09:26, David Hildenbrand wrote:
> On 22.09.20 00:22, Maciej S. Szmigiero wrote:
>> Hi David,
>>
>> Thank you for your comments.
>>
(...)
>>
>> The idea is to use virtual DIMM sticks for hot-adding extra memory at
>> runtime, while using ballooning for runtime adjustment of the guest
>> memory size within the current maximum.
>>
>> When the guest is rebooted the virtual DIMMs configuration is adjusted
>> by the software controlling QEMU (some are removed and / or some are
>> added) to give the guest the same effective memory size as it had before
>> the reboot.
> 
> Okay, so while "the ACPI DIMM slot limit does not apply", the KVM memory
> slot limit (currently) applies, resulting in exactly the same behavior.
>
> The only (conceptual difference) I am able to spot is then a
> notification to the user on reboot, so the guest memory layout can be
> adjusted (which I consider very ugly, but it's the same thing when
> mixing ballooning and DIMMs - which is why it's usually never done).

If you want to shrink a guest at runtime you'll pretty much have to use
ballooning as {ACPI-based PC, virtual} DIMM stick sizes are far too
large to make anything but rough adjustments to the guest memory size.

In addition to that with ACPI-based PC DIMM hotplug it is the host that
chooses which particular DIMM stick to unplug while having no feedback
from the guest how much of each DIMM stick memory range is currently
in use and so will have to be copied somewhere else.

I know that this a source of significant hot removal slowdown, especially
when a "ripple effect" happens on removal:
1) There are 3 extra DIMMs plugged into the guest: A, B, C.
   A and B are nearly empty, but C is nearly full.

2) The host does not know anything which DIMM is empty and which is full,
   so it requests the guest to unplug the stick C,

3) The guest copies the content of the stick C to the stick B,

4) Once again, the host does not know anything which DIMM is empty and
   which is full, so it requests the guest to unplug the stick B,

5) The guest now has to copy the same data from the stick B to the
   stick A, once again.

With virtual DIMM sticks + this driver it is the guest which chooses
which particular pages to release, hopefully choosing the already unused
ones.
Once the whole memory behind a DIMM stick is released the host knows
that it can be unplugged now without any copying.

While it might seem like this will cause a lot of fragmentation in
practice Windows seems to try to give out the largest continuous range
of pages it is able to find.

One can also see in the hv_balloon client driver from the Linux kernel
that this driver tries to do 2 MB allocations for as long as it can
before giving out single pages.

The reason why ballooning and DIMMs wasn't being used together previously
is probably because virtio-balloon will (at least on Windows) mark the
ballooned out pages as in use inside the guest, preventing the removal
of the DIMM stick backing them.

In addition to the above, virtio-balloon is also very slow, as the whole
protocol operates on single pages only, not on page ranges.
There might also be some interference with Windows memory management
causing an extra slowdown in comparison to the native Windows DM
protocol.

If the KVM slot limit starts to be a problem in practice then we can
think what can be done about it.
It's always one obstacle less.

I see that the same KVM slot limit probably applies also for virtio-mem,
since it uses memory-backend-ram as its backing memory device, too,
right?

If not, then how do you map a totally new memory range into the guest
address space without consuming a KVM memory slot?
If that's somehow possible then maybe the same mechanism can simply be
reused for this driver.

> [...]
> 
>>
>> So, yes, it will be a problem if the user expands their running guest
>> ~256 times, each time making it even bigger than previously, without
>> rebooting it even once, but this does seem to be an edge use case.
> 
> IIRC, that's exactly what dynamic memory under Windows does in automatic
> mode, no? Monitor the guests, distribute memory accordingly - usually in
> smaller steps. But I am no expert on Hyper-V.

Yes, they call their automatic mode "Dynamic Memory" in recent Windows
versions.

This is a bit confusing because even if you disable this feature
the Hyper-V hypervisor will still provide this Dynamic Memory Protocol
service and use it to resize the guest on (user) demand.
Just it won't do such resize on its own but only when explicitly
requested.

Don't know if they internally have any limit that is similar to the KVM
memory slot limit, though.

>>
>> In the future it would be better to automatically turn the current
>> effective guest size into its boot memory size when the VM restarts
>> (the VM will then have no virtual DIMMs inserted after a reboot), but
>> doing this requires quite a few changes to QEMU, that's why it isn't
>> there yet.
> 
> Will most probably never happen as reshuffling the layout of your boot
> memory (especially with NUMA) within QEMU can break live migration in
> various ways.

That's why this functionality is not in the current driver version as
it is a bit hard to implement :)

> If you already notify the user on a reboot, the user can just kill the
> VM and start it with an adjusted boot memory size. Yeah, that's ugly,
> but so is the whole "adjust DIMM/balloon configuration during a reboot
> from outside QEMU".
>
> BTW, how would you handle: Start guest with 10G. Inflate balloon to 5G.
> Reboot. There are no virtual DIMMs to adjust.

You'll typically want to avoid relaunching QEMU as much as possible
since things like chardev sockets and a VNC connection will disconnect
if the QEMU process exits.
Not to mention that it takes some time for it to actually start again.

However, there is a trade-off here: one can either start the guest with
a relatively large boot memory size, but then shrinking the guest means
that it will see the whole boot memory size again during reboot, until
it is ballooned down again after it has connected to the DM protocol.

Or it can be started with a small boot memory size, but this means that
few virtual DIMMs might always be inserted (their size and / or count
can be optimized during the next reboot or if they become unused due
to ballooning).

Or one can choose some point in between these two scenarios.
 
I think a virtio-mem user has to choose a similar trade-off between
the boot memory size and the size and count of plugged-in virtio-mem
devices, right?

>>
>> The above is basically how Hyper-V hypervisor handles its memory size
>> changes and it seems to be as close to having a transparently resizable
>> guest as reasonably possible.
> 
> "having a transparently resizable _Windows_ guests right now" :)

Right.

(...)
> 
>>> I assume these numbers apply with Windows guests only. IIRC Linux
>>> hv_balloon does not support page migration/compaction, while
>>> virtio-balloon does. So you might end up with quite some fragmented
>>> memory with hv_balloon in Linux guests - of course, usually only in
>>> corner cases.
>>
>> As I previously mentioned, this driver targets mainly Windows guests.
> 
> ... and you cannot enforce that people will only use it with Windows
> guests :)
If people want to run this driver with Linux or port the hv_balloon
client driver from the Linux kernel to, for example, GNU Hurd and run
the DM protocol there then they are free to do so.
Just it really isn't this driver target environment.

> [...]
> 
>> Windows will generally leave some memory free when processing balloon
>> requests, although the precise amount varies between few hundred MB to
>> values like 1+ GB.
>>
>> Usually it runs stable even with these few hundred MBs of free memory
>> remaining but I have seen occasional crashes at shutdown time in this
>> case (probably something critical failing to initialize due to the
>> system running out of memory).
>>
>> While the above command was just a quick example, I personally think
>> it is the guest who should be enforcing a balloon floor since it is
>> the guest that knows its internal memory requirements, not the host.
> 
> Even the guest has no idea about the (future) working set size. That's a
> known problem.
> 
> There are always cases where the calculation is wrong, and if the
> monitoring process isn't fast enough to react and adjust the guest size,
> your things will end up baldy in your guest. Just as the reboot case you
> mentioned, where the VM crashes.

The actual Hyper-V hypervisor somehow manages not to over-balloon its
guests to the point that they run of of memory and crash.
So this is definitely doable (with a margin of safety).

However, such heuristics are really an issue for the software
controlling QEMU and so are outside the scope of this driver.

By the way, that's why DM guests emit a STATUS message each second
with various memory counters (translated into a QMP event by this driver)
- to give its host hints about the guest memory pressure.

> [...]
> 
>>>>
>>>> Future directions:
>>>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>>>   virtio-balloon drivers.
>>>>   Currently, only one of them can be added to the VM at the same time.
>>>
>>> Yeah, that makes sense. Only one at a time.
>>
>> Having only one *active* at a time makes sense, however it ultimately
>> would be nice to be able to have them both inserted into a VM:
>> one for Windows guests and one for Linux ones.
>> Even though only one obviously would be active at the same time.
> 
> I don't think that's the right way forward - that should be configured
> when the VM is started.
> 
> Personal opinion: I can understand the motivation to implement
> hypervisor-specific devices to better support closed-source operating
> systems. But I doubt we want to introduce+support ten different
> proprietary devices based on proprietary standards doing roughly the
> same thing just because closed-source operating systems are too lazy to
> support open standards properly.
> 

What do you mean by "ten" proprietary devices?
Is there another balloon protocol driver currently in the tree other
than virtio-balloon running over various buses?

People are running Windows guests using QEMU, too.

That's why there are dozen or so Hyper-V enlightenments implemented,
even though they duplicate KVM PV stuff or that there is kvmvapic
with its Windows guest live-patching.

Not to mention many, many devices like e1000 or VMware vmxnet3 even
though virtio-net exists or PIIX IDE even though virtio-{blk,scsi} exist.
Or the applesmc driver, which is cleanly designed to help run just
one proprietary OS.

Thanks,
Maciej


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-22 23:19       ` Maciej S. Szmigiero
@ 2020-09-23 12:48         ` David Hildenbrand
  2020-09-24 22:37           ` Maciej S. Szmigiero
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2020-09-23 12:48 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, Paolo Bonzini, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

>> Okay, so while "the ACPI DIMM slot limit does not apply", the KVM memory
>> slot limit (currently) applies, resulting in exactly the same behavior.
>>
>> The only (conceptual difference) I am able to spot is then a
>> notification to the user on reboot, so the guest memory layout can be
>> adjusted (which I consider very ugly, but it's the same thing when
>> mixing ballooning and DIMMs - which is why it's usually never done).
> 
> If you want to shrink a guest at runtime you'll pretty much have to use
> ballooning as {ACPI-based PC, virtual} DIMM stick sizes are far too
> large to make anything but rough adjustments to the guest memory size.

Right.

> 
> In addition to that with ACPI-based PC DIMM hotplug it is the host that
> chooses which particular DIMM stick to unplug while having no feedback
> from the guest how much of each DIMM stick memory range is currently
> in use and so will have to be copied somewhere else.

Yeah, these external requests are bad in general. Not only for
performance - also because they can fail silently (if there is unmovable
data on it).

> 
> I know that this a source of significant hot removal slowdown, especially
> when a "ripple effect" happens on removal:
> 1) There are 3 extra DIMMs plugged into the guest: A, B, C.
>    A and B are nearly empty, but C is nearly full.
> 
> 2) The host does not know anything which DIMM is empty and which is full,
>    so it requests the guest to unplug the stick C,

In theory, the host can simply track inflation requests. In practice,
guests tend to sometimes re-access balloon-inflated memory (e.g., simple
kexec-style reboot under Linux, kdump on older Linux versions), so it's
not completely safe to do.

> 
> 3) The guest copies the content of the stick C to the stick B,
> 
> 4) Once again, the host does not know anything which DIMM is empty and
>    which is full, so it requests the guest to unplug the stick B,
> 
> 5) The guest now has to copy the same data from the stick B to the
>    stick A, once again.
> 
> With virtual DIMM sticks + this driver it is the guest which chooses
> which particular pages to release, hopefully choosing the already unused
> ones.
> Once the whole memory behind a DIMM stick is released the host knows
> that it can be unplugged now without any copying.
> 
> While it might seem like this will cause a lot of fragmentation in
> practice Windows seems to try to give out the largest continuous range
> of pages it is able to find.
> 
> One can also see in the hv_balloon client driver from the Linux kernel
> that this driver tries to do 2 MB allocations for as long as it can
> before giving out single pages.

Yeah, something similar was proposed for virtio-balloon already (and
there is a paper about that - see below). For virito-balloon, we were
not yet convinced that stealing most hugepages from the guest is always
such a good idea. It at least would have to be configurable, to not mess
with existing use cases.

> 
> The reason why ballooning and DIMMs wasn't being used together previously
> is probably because virtio-balloon will (at least on Windows) mark the
> ballooned out pages as in use inside the guest, preventing the removal
> of the DIMM stick backing them.

Same on Linux - but the pages are movable, such they can at least moved
around to offline + unplug a DIMM.

Some of the reasons why ballooning + DIMMs are not used as far as I know is:
- Management issues. Using it without some managing instance
  (plug/unplug DIMM, control balloon) is impossible. Try using it with
  bare QEMU - basically impossible.
- Memory Hotplug limitations: Maximum DIMM count. Minimum DIMM size that
  an OS can use (e.g., >= 128MiB under Linux, sometimes even 1GB). The
  granularity restrictions you mentioned apply.
- Memory Hotplug reliability: It can happen easily that hotplugging a
  DIMM / onlining it under Linux fails (e.g., minimum DIMM size). "What
  you think the guests actually has as available memory might be wrong".
  If you ignore that (and you don't even get notified) and adjust the
  balloon later, your (Linux) guest might be in trouble. Assume
  you hotplug a 8GB to a 2G VM and later try to inflate the balloon to
  4GB ... so you need reliable monitoring and error handling.

So yeah, I can understand how hv-balloon tries to work around some of
these issues.

> 
> In addition to the above, virtio-balloon is also very slow, as the whole
> protocol operates on single pages only, not on page ranges.
> There might also be some interference with Windows memory management
> causing an extra slowdown in comparison to the native Windows DM
> protocol.

Yes, because I assume Windows doesn't really care too much about
optimizing for virtio-balloon. There isn't too much external developers
can do about that. See below for hugepage ballooning in virtio-balloon.

> 
> If the KVM slot limit starts to be a problem in practice then we can
> think what can be done about it.
> It's always one obstacle less.

I'm not a friend of leaving the challenging/fundamental problems to be
sported out in the future (e.g., resizing initial boot memory, dealing
with fundamental limits - like KVM memory slots or VMA). But I get how
it's easier to get something running this way :)

> 
> I see that the same KVM slot limit probably applies also for virtio-mem,
> since it uses memory-backend-ram as its backing memory device, too,
> right?

Yes, one memory backend per virtio-mem device. You usually have one
device per NUMA node.

> 
> If not, then how do you map a totally new memory range into the guest
> address space without consuming a KVM memory slot?
> If that's somehow possible then maybe the same mechanism can simply be
> reused for this driver.

So, virtio-mem will (in the future, still to be upstreamed by me) use
resizeable memory regions / ramblocks / KVM memory slots. The region can
grow (e.g., memory hotplug) and shrink (e.g., during reboot, but later
also if unplugged memory would allow for it). Nice thing is that
migration code fully supports resizeable ramblocks already.

Resizes are triggered by the virtio-mem device, so stuff is completely
handled inside QEMU. For hv-balloon, you could grow the region when
required (e.g., balloon X, whereby X is > ram size after inflating), and
shrink during reboot (or whenever it might be valid to shrink). However,
you cannot "rip out" anything in between, you'll have to rely on
MADV_DONTNEED until the guests reboots (well, just like basic
ballooning) and you can definitely shrink the region.

That's the tradeoff virtio-mem decided to take for now to be able to
manage any size changes + migration completely in QEMU, avoiding any
coordination with an external entity (e.g., libvirt in your example)
when resizing a guest.

> 
>> [...]
>>
>>>
>>> So, yes, it will be a problem if the user expands their running guest
>>> ~256 times, each time making it even bigger than previously, without
>>> rebooting it even once, but this does seem to be an edge use case.
>>
>> IIRC, that's exactly what dynamic memory under Windows does in automatic
>> mode, no? Monitor the guests, distribute memory accordingly - usually in
>> smaller steps. But I am no expert on Hyper-V.
> 
> Yes, they call their automatic mode "Dynamic Memory" in recent Windows
> versions.
> 
> This is a bit confusing because even if you disable this feature
> the Hyper-V hypervisor will still provide this Dynamic Memory Protocol
> service and use it to resize the guest on (user) demand.
> Just it won't do such resize on its own but only when explicitly
> requested.

Interesting, thanks.

> 
> Don't know if they internally have any limit that is similar to the KVM
> memory slot limit, though.

That would be interesting for me - like which limits do actually apply
under Hyper-V.

[...]

> 
>> If you already notify the user on a reboot, the user can just kill the
>> VM and start it with an adjusted boot memory size. Yeah, that's ugly,
>> but so is the whole "adjust DIMM/balloon configuration during a reboot
>> from outside QEMU".
>>
>> BTW, how would you handle: Start guest with 10G. Inflate balloon to 5G.
>> Reboot. There are no virtual DIMMs to adjust.
> 
> You'll typically want to avoid relaunching QEMU as much as possible
> since things like chardev sockets and a VNC connection will disconnect
> if the QEMU process exits.
> Not to mention that it takes some time for it to actually start again.

Exactly my thoughts, that's why I tried to avoid that as well with
virtio-mem.

> 
> However, there is a trade-off here: one can either start the guest with
> a relatively large boot memory size, but then shrinking the guest means
> that it will see the whole boot memory size again during reboot, until
> it is ballooned down again after it has connected to the DM protocol.

Yeah, one of the main issues of memory ballooning.

> 
> Or it can be started with a small boot memory size, but this means that
> few virtual DIMMs might always be inserted (their size and / or count
> can be optimized during the next reboot or if they become unused due
> to ballooning).
> 
> Or one can choose some point in between these two scenarios.
>  
> I think a virtio-mem user has to choose a similar trade-off between
> the boot memory size and the size and count of plugged-in virtio-mem
> devices, right?

Partially yes, partially no. It doesn't really care about the second
case you mention ("few virtual DIMMs might always be inserted") due to
the way it works. And it does noever have to deal with "inflate balloon
after/during reboot".

A virtio-mem device manages only its assigned memory, it does not work
on random system memory like memory ballooning. So you can never unplug
initial memory.

However, you can do something like

-m 4G,maxmem=104G

and define a virtio-mem device with a maximum size of 100G and an
initial size of - say 16G. When booting up, the guest will detect the
additional 16GB and have effectively 20GB. However, you can only ever
shrink back down to 4GB (e.g., reliably during a reboot).

And it might not always be desirable (at least under Linux) to have
little boot memory - say the CMA allocator might want to reserve bigger
chunks of memory early during boot - if the initial memory size is too
small, this can fail easily.

[...]

>>
>>> Windows will generally leave some memory free when processing balloon
>>> requests, although the precise amount varies between few hundred MB to
>>> values like 1+ GB.
>>>
>>> Usually it runs stable even with these few hundred MBs of free memory
>>> remaining but I have seen occasional crashes at shutdown time in this
>>> case (probably something critical failing to initialize due to the
>>> system running out of memory).
>>>
>>> While the above command was just a quick example, I personally think
>>> it is the guest who should be enforcing a balloon floor since it is
>>> the guest that knows its internal memory requirements, not the host.
>>
>> Even the guest has no idea about the (future) working set size. That's a
>> known problem.
>>
>> There are always cases where the calculation is wrong, and if the
>> monitoring process isn't fast enough to react and adjust the guest size,
>> your things will end up baldy in your guest. Just as the reboot case you
>> mentioned, where the VM crashes.
> 
> The actual Hyper-V hypervisor somehow manages not to over-balloon its
> guests to the point that they run of of memory and crash.
> So this is definitely doable (with a margin of safety).
> 
> However, such heuristics are really an issue for the software
> controlling QEMU and so are outside the scope of this driver.

Yeah, just like any heuristic, it can be wrong. I wonder if we could add
something similar for virtio-balloon (at least don't immediately deflate
until your VM dies ...).

> 
> By the way, that's why DM guests emit a STATUS message each second
> with various memory counters (translated into a QMP event by this driver)
> - to give its host hints about the guest memory pressure.

Right, we have something similar for virtio-balloon, via the stats VQ.
IIRC, it's used by auto-ballooning mechanisms implemented in openstack
(I'd say similar to dynamic memory, it just won't try to increase the
size of a guest using new DIMMs).

>>
>>>>>
>>>>> Future directions:
>>>>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>>>>   virtio-balloon drivers.
>>>>>   Currently, only one of them can be added to the VM at the same time.
>>>>
>>>> Yeah, that makes sense. Only one at a time.
>>>
>>> Having only one *active* at a time makes sense, however it ultimately
>>> would be nice to be able to have them both inserted into a VM:
>>> one for Windows guests and one for Linux ones.
>>> Even though only one obviously would be active at the same time.
>>
>> I don't think that's the right way forward - that should be configured
>> when the VM is started.
>>
>> Personal opinion: I can understand the motivation to implement
>> hypervisor-specific devices to better support closed-source operating
>> systems. But I doubt we want to introduce+support ten different
>> proprietary devices based on proprietary standards doing roughly the
>> same thing just because closed-source operating systems are too lazy to
>> support open standards properly.
>>
> 
> What do you mean by "ten" proprietary devices?
> Is there another balloon protocol driver currently in the tree other
> than virtio-balloon running over various buses?

Maybe OSX wants to be next and re-invent the wheel with a proprietary
balloon driver for a custom hypervisor. I think you get the idea.

What I'm saying is that I'd much rather want to see Windows
improve+extend virtio-balloon and such (virtio-mem), instead of
requiring hypervisors to implement undocumented, proprietary devices to
get stuff running somewhat smoothly in modern cloud environments. I have
the feeling that quite some stuff you mention can simply be "fixed" by
extending/improving virtio-baloon under Windows - for example, inflation
speed can be improved significantly by inflating in bigger chunks. See

https://dl.acm.org/doi/10.1145/3240302.3240420

In contrast to hv-balloon *we* can extend/improve the
interface/stadnard/implementation on both sides (host/guest).

> 
> People are running Windows guests using QEMU, too.
> 
> That's why there are dozen or so Hyper-V enlightenments implemented,
> even though they duplicate KVM PV stuff or that there is kvmvapic
> with its Windows guest live-patching.

IIRC the Hyper-V enlightenment stuff is properly publicly documented -
whereby last time I checked, the hv-balloon is completely undocumented
and has to be reverse engineered from the Linux implementation. Please
correct me if I'm wrong - I am not able to spot references in your cover
letter as well - I'd be interested into that!

> 
> Not to mention many, many devices like e1000 or VMware vmxnet3 even
> though virtio-net exists or PIIX IDE even though virtio-{blk,scsi} exist.
> Or the applesmc driver, which is cleanly designed to help run just
> one proprietary OS.

IIRC we need the devices either to bootstrap - e.g., use e1000 until we
can install virtio-net once the guest is up and running, or to support
older, unmodified guests. I'd like to stress that what you are proposing
is different in that sense. Your Windows VM will work just fine without
a hv-balloon device.

Again, just my personal opinion, I don't make any decisions around here :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-23 12:48         ` David Hildenbrand
@ 2020-09-24 22:37           ` Maciej S. Szmigiero
  2020-09-25  6:49             ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Maciej S. Szmigiero @ 2020-09-24 22:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, Paolo Bonzini, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

On 23.09.2020 14:48, David Hildenbrand wrote:
(...)
>>
>> I know that this a source of significant hot removal slowdown, especially
>> when a "ripple effect" happens on removal:
>> 1) There are 3 extra DIMMs plugged into the guest: A, B, C.
>>    A and B are nearly empty, but C is nearly full.
>>
>> 2) The host does not know anything which DIMM is empty and which is full,
>>    so it requests the guest to unplug the stick C,
> 
> In theory, the host can simply track inflation requests. In practice,
> guests tend to sometimes re-access balloon-inflated memory (e.g., simple
> kexec-style reboot under Linux, kdump on older Linux versions), so it's
> not completely safe to do.

You are describing Linux situation here, while this driver targets
Windows.

I think the issues you describe (kexec, etc.) are probably fixable once
somebody is determined enough to do so.

I mean, either the old kernel needs to transfer information about
"forbidden" memory areas to the new kernel or the new kernel needs to
query these somehow (probably only if is receives a "do it" flag from
the old kernel).

>>
>> 3) The guest copies the content of the stick C to the stick B,
>>
>> 4) Once again, the host does not know anything which DIMM is empty and
>>    which is full, so it requests the guest to unplug the stick B,
>>
>> 5) The guest now has to copy the same data from the stick B to the
>>    stick A, once again.
>>
>> With virtual DIMM sticks + this driver it is the guest which chooses
>> which particular pages to release, hopefully choosing the already unused
>> ones.
>> Once the whole memory behind a DIMM stick is released the host knows
>> that it can be unplugged now without any copying.
>>
>> While it might seem like this will cause a lot of fragmentation in
>> practice Windows seems to try to give out the largest continuous range
>> of pages it is able to find.
>>
>> One can also see in the hv_balloon client driver from the Linux kernel
>> that this driver tries to do 2 MB allocations for as long as it can
>> before giving out single pages.
> 
> Yeah, something similar was proposed for virtio-balloon already (and
> there is a paper about that - see below). For virito-balloon, we were
> not yet convinced that stealing most hugepages from the guest is always
> such a good idea. It at least would have to be configurable, to not mess
> with existing use cases.

Thanks for the paper.

The hv_balloon Linux client driver does large continuous allocations of
ordinary 4k pages, not only transparent hugepages.

Even if the virtio-balloon client driver is changed to do higher-older
allocations the protocol itself only supports transporting individual
page numbers and not ranges.

Either the virtio-balloon protocol will need to be changed to allow
sending page ranges (good) or there will need to be some implicit
agreement between the host and client drivers that pages from a
continuous range will be sent consecutively.
Then the host driver can reassemble the whole range, ending either
when it receives a page outside the range or when a reassembly timeout
happens.

But that's just really ugly - realistically the virtio-balloon
protocol will need to be changed in such case.

> 
>>
>> The reason why ballooning and DIMMs wasn't being used together previously
>> is probably because virtio-balloon will (at least on Windows) mark the
>> ballooned out pages as in use inside the guest, preventing the removal
>> of the DIMM stick backing them.
> 
> Same on Linux - but the pages are movable, such they can at least moved
> around to offline + unplug a DIMM.
> 
> Some of the reasons why ballooning + DIMMs are not used as far as I know is:
> - Management issues. Using it without some managing instance
>   (plug/unplug DIMM, control balloon) is impossible. Try using it with
>   bare QEMU - basically impossible.

We'll it's not impossible, just needs some manual reconfiguration.

But such advanced usage scenarios usually are using QEMU with an
external controller anyway.

> - Memory Hotplug limitations: Maximum DIMM count. Minimum DIMM size that
>   an OS can use (e.g., >= 128MiB under Linux, sometimes even 1GB). The
>   granularity restrictions you mentioned apply.

Windows has a memory hotplug granularity of 1 MB.

> - Memory Hotplug reliability: It can happen easily that hotplugging a
>   DIMM / onlining it under Linux fails (e.g., minimum DIMM size). "What
>   you think the guests actually has as available memory might be wrong".
>   If you ignore that (and you don't even get notified) and adjust the
>   balloon later, your (Linux) guest might be in trouble. Assume
>   you hotplug a 8GB to a 2G VM and later try to inflate the balloon to
>   4GB ... so you need reliable monitoring and error handling.

The DM "hot add" message has a "page_count" field to tell the host
how much memory the guest has actually added.

Quoting the DM protocol header file:
> Hot adds may also fail due to low resources; in this case, the guest must
> not complete this message until the hot add can succeed, and the host must
> not send a new hot add request until the response is sent.
> If VSC fails to hot add memory DYNMEM_NUMBER_OF_UNSUCCESSFUL_HOTADD_ATTEMPTS
> times it fails the request.

The host also knows the current guest memory size from its STATUS
messages.

> So yeah, I can understand how hv-balloon tries to work around some of
> these issues.
>
>>
>> In addition to the above, virtio-balloon is also very slow, as the whole
>> protocol operates on single pages only, not on page ranges.
>> There might also be some interference with Windows memory management
>> causing an extra slowdown in comparison to the native Windows DM
>> protocol.
> 
> Yes, because I assume Windows doesn't really care too much about
> optimizing for virtio-balloon. There isn't too much external developers
> can do about that. See below for hugepage ballooning in virtio-balloon.

Exactly, that's why the DM protocol is the best thing we have to offer
for Windows guests right now.

>>
>> If the KVM slot limit starts to be a problem in practice then we can
>> think what can be done about it.
>> It's always one obstacle less.
> 
> I'm not a friend of leaving the challenging/fundamental problems to be
> sported out in the future (e.g., resizing initial boot memory, dealing
> with fundamental limits - like KVM memory slots or VMA). But I get how
> it's easier to get something running this way :)

Constraints can be removed step-by-step, when they actually start tobite.

It is unrealistic to have a perfect guest resize solution in a single
patch series upfront, the issue is just too complex.

>>
>> I see that the same KVM slot limit probably applies also for virtio-mem,
>> since it uses memory-backend-ram as its backing memory device, too,
>> right?
> 
> Yes, one memory backend per virtio-mem device. You usually have one
> device per NUMA node.

So if you want to dynamically manage most of the guest memory these
virtio-mem devices + their backends will be very large
(circa 1/4 guest size each for a 4-node machine).

That means in practice they won't ever be able to be hot-removed before
the VM is rebooted since there will very likely be at least single stuck
page in each of these backing devices preventing their removal.
(If hot-removal support is ever enabled for virtio-mem, it looks like it
is not possible yet even on a VM reboot).

And I can see that removing a single RAM block in virio-mem is done 
by discarding it via MADV_DONTNEED, just like in ballooning.
Only the minimum block size is 2 MB and not 4 KB so all consecutive
512 pages in a block will need to be free in order to discard it.

>>
>> If not, then how do you map a totally new memory range into the guest
>> address space without consuming a KVM memory slot?
>> If that's somehow possible then maybe the same mechanism can simply be
>> reused for this driver.
> 
> So, virtio-mem will (in the future, still to be upstreamed by me) use
> resizeable memory regions / ramblocks / KVM memory slots. The region can
> grow (e.g., memory hotplug) and shrink (e.g., during reboot, but later
> also if unplugged memory would allow for it). Nice thing is that
> migration code fully supports resizeable ramblocks already.
> 
> Resizes are triggered by the virtio-mem device, so stuff is completely
> handled inside QEMU. For hv-balloon, you could grow the region when
> required (e.g., balloon X, whereby X is > ram size after inflating), and
> shrink during reboot (or whenever it might be valid to shrink). However,
> you cannot "rip out" anything in between, you'll have to rely on
> MADV_DONTNEED until the guests reboots (well, just like basic
> ballooning) and you can definitely shrink the region.
> 
> That's the tradeoff virtio-mem decided to take for now to be able to
> manage any size changes + migration completely in QEMU, avoiding any
> coordination with an external entity (e.g., libvirt in your example)
> when resizing a guest.

For hv-balloon (that is, Windows guests), it seems to me that the slot
resizing would be a viable solution for growing the guest without
hitting the KVM slot limit or requiring a VM reboot.

For shrinking such guests it isn't really necessary since DM ballooning
is enough until a reboot happens.

During a reboot, virtual DIMM config optimization can be used for now,
while the best, long-term, solution would be to actually do what the
Hyper-V hypervisor does in this case: resize the boot memory to match
the target guest size.

This way the whole issue of the guest seeing only the boot memory part,
not the dynamic part, during its boot will no longer be there.

(...)
>>
>> Or it can be started with a small boot memory size, but this means that
>> few virtual DIMMs might always be inserted (their size and / or count
>> can be optimized during the next reboot or if they become unused due
>> to ballooning).
>>
>> Or one can choose some point in between these two scenarios.
>>  
>> I think a virtio-mem user has to choose a similar trade-off between
>> the boot memory size and the size and count of plugged-in virtio-mem
>> devices, right?
> 
> Partially yes, partially no. It doesn't really care about the second
> case you mention ("few virtual DIMMs might always be inserted") due to
> the way it works. And it does noever have to deal with "inflate balloon
> after/during reboot".
> 
> A virtio-mem device manages only its assigned memory, it does not work
> on random system memory like memory ballooning. So you can never unplug
> initial memory.
> 
> However, you can do something like
> 
> -m 4G,maxmem=104G
> 
> and define a virtio-mem device with a maximum size of 100G and an
> initial size of - say 16G. When booting up, the guest will detect the
> additional 16GB and have effectively 20GB. However, you can only ever
> shrink back down to 4GB (e.g., reliably during a reboot).
>
> And it might not always be desirable (at least under Linux) to have
> little boot memory - say the CMA allocator might want to reserve bigger
> chunks of memory early during boot - if the initial memory size is too
> small, this can fail easily.
> 

So the virtio-mem trade-off is between the size of the boot memory and
the dynamically managed part(s).
And also the block size, as as far I can see, a single stuck page in
a block will prevent it from being discarded.

Of course, we are talking about Linux guests here - Windows guest
will see just the boot memory part.

(...)
>>>
>>>>>>
>>>>>> Future directions:
>>>>>> * Allow sharing the ballooning QEMU interface between hv-balloon and
>>>>>>   virtio-balloon drivers.
>>>>>>   Currently, only one of them can be added to the VM at the same time.
>>>>>
>>>>> Yeah, that makes sense. Only one at a time.
>>>>
>>>> Having only one *active* at a time makes sense, however it ultimately
>>>> would be nice to be able to have them both inserted into a VM:
>>>> one for Windows guests and one for Linux ones.
>>>> Even though only one obviously would be active at the same time.
>>>
>>> I don't think that's the right way forward - that should be configured
>>> when the VM is started.
>>>
>>> Personal opinion: I can understand the motivation to implement
>>> hypervisor-specific devices to better support closed-source operating
>>> systems. But I doubt we want to introduce+support ten different
>>> proprietary devices based on proprietary standards doing roughly the
>>> same thing just because closed-source operating systems are too lazy to
>>> support open standards properly.
>>>
>>
>> What do you mean by "ten" proprietary devices?
>> Is there another balloon protocol driver currently in the tree other
>> than virtio-balloon running over various buses?
> 
> Maybe OSX wants to be next and re-invent the wheel with a proprietary
> balloon driver for a custom hypervisor. I think you get the idea.
> 
> What I'm saying is that I'd much rather want to see Windows
> improve+extend virtio-balloon and such (virtio-mem), instead of
> requiring hypervisors to implement undocumented, proprietary devices to
> get stuff running somewhat smoothly in modern cloud environments. I have
> the feeling that quite some stuff you mention can simply be "fixed" by
> extending/improving virtio-baloon under Windows - for example, inflation
> speed can be improved significantly by inflating in bigger chunks. See
> 
> https://dl.acm.org/doi/10.1145/3240302.3240420
> 
> In contrast to hv-balloon *we* can extend/improve the
> interface/stadnard/implementation on both sides (host/guest).

Even if we switched virtio-balloon to bigger allocations and made the
protocol return page ranges the allocation is still done by simply
using an ordinary alloc_pages()-equivalent API.
I don't see any exported Windows kernel API for allocating balloon
memory.

The same goes for adding new RAM.

The above alone means that supporting virtio-mem semantics on current
Windows versions is likely not possible.

>>
>> People are running Windows guests using QEMU, too.
>>
>> That's why there are dozen or so Hyper-V enlightenments implemented,
>> even though they duplicate KVM PV stuff or that there is kvmvapic
>> with its Windows guest live-patching.
> 
> IIRC the Hyper-V enlightenment stuff is properly publicly documented -

They are only documented from the guest perspective - basically what
the guest of a Hyper-V hypervisor can possibly use.

There is nothing in the Hyper-V TLFS which of the functionality it
documents is supported or required by any Windows version or how
Windows guests actually make use of these features.

Not to mention that the documented interface could say the guest can
expect values A or B or C for parameter X, which is technically true, 
however the actual Hyper-V hypervisor always uses A and that's what
Windows will expect.
 
> whereby last time I checked, the hv-balloon is completely undocumented
> and has to be reverse engineered from the Linux implementation. Please
> correct me if I'm wrong - I am not able to spot references in your cover
> letter as well - I'd be interested into that!

The DM protocol is rather straightforward - the Linux driver contains
well-commented definitions of its messages.

For a hot add the host simply provides the start page frame number and
the count of pages to add and in response receives the number of pages
the guest was actually able to hot add.

For a balloon request the host provides a count of pages it wants the
guest to free and the guest responds with page ranges it has managed
to release.
The reverse happens for an unballoon request.

The protocol consists of just a few simple messages, well described
in the Linux driver.

The VMBus part of the protocol works in the same way as in other VMBus
devices.

>>
>> Not to mention many, many devices like e1000 or VMware vmxnet3 even
>> though virtio-net exists or PIIX IDE even though virtio-{blk,scsi} exist.
>> Or the applesmc driver, which is cleanly designed to help run just
>> one proprietary OS.
> 
> IIRC we need the devices either to bootstrap - e.g., use e1000 until we
> can install virtio-net once the guest is up and running, or to support
> older, unmodified guests. I'd like to stress that what you are proposing
> is different in that sense. Your Windows VM will work just fine without
> a hv-balloon device.

There are also "accelerator" devices like kvmvapic or proprietary devices
where an older open standard exists, like sii3112 SATA, even though
IDE / AHCI can be used to bootstrap the guest.
The sii3112 driver was only added in 2017, although the hardware that
it emulates comes from early 21st century.

> Again, just my personal opinion, I don't make any decisions around here :)
> 

Thanks for taking the time to provide your feedback and insight,
Maciej


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon)
  2020-09-24 22:37           ` Maciej S. Szmigiero
@ 2020-09-25  6:49             ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2020-09-25  6:49 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Eduardo Habkost, Michael S. Tsirkin, qemu-devel,
	Markus Armbruster, Paolo Bonzini, Igor Mammedov,
	Vitaly Kuznetsov, Boris Ostrovsky, Richard Henderson

On 25.09.20 00:37, Maciej S. Szmigiero wrote:
> On 23.09.2020 14:48, David Hildenbrand wrote:
> (...)
>>>
>>> I know that this a source of significant hot removal slowdown, especially
>>> when a "ripple effect" happens on removal:
>>> 1) There are 3 extra DIMMs plugged into the guest: A, B, C.
>>>    A and B are nearly empty, but C is nearly full.
>>>
>>> 2) The host does not know anything which DIMM is empty and which is full,
>>>    so it requests the guest to unplug the stick C,
>>
>> In theory, the host can simply track inflation requests. In practice,
>> guests tend to sometimes re-access balloon-inflated memory (e.g., simple
>> kexec-style reboot under Linux, kdump on older Linux versions), so it's
>> not completely safe to do.
> 
> You are describing Linux situation here, while this driver targets
> Windows.

Yeah, this stuff is also broken with Linux under Hyper-V IIRC (I
remember fixing kdump/makedumpfile to not touch inflated pages).

> 
> I think the issues you describe (kexec, etc.) are probably fixable once
> somebody is determined enough to do so.
> 
> I mean, either the old kernel needs to transfer information about
> "forbidden" memory areas to the new kernel or the new kernel needs to
> query these somehow (probably only if is receives a "do it" flag from
> the old kernel).

One idea is to notify the host like "I'll reuse any page again". But
doing that via virtio while your dying isn't always possible. So the new
kernel would have to do it - but then virtio-balloon would have to be up
and running extremely early during boot, before any possibly-inflated
page gets touched - also a head scratcher. Then, what if your new kernel
doesn't support virtio-balloon ... not easy.

[...]

> But that's just really ugly - realistically the virtio-balloon
> protocol will need to be changed in such case.

Yes, and there were RFCs already.

https://lkml.kernel.org/r/1589276501-16026-1-git-send-email-teawater@gmail.com

[...]

> 
>>>
>>> If the KVM slot limit starts to be a problem in practice then we can
>>> think what can be done about it.
>>> It's always one obstacle less.
>>
>> I'm not a friend of leaving the challenging/fundamental problems to be
>> sported out in the future (e.g., resizing initial boot memory, dealing
>> with fundamental limits - like KVM memory slots or VMA). But I get how
>> it's easier to get something running this way :)
> 
> Constraints can be removed step-by-step, when they actually start tobite.

Not if they are fundamental. Try implementing vfio support for memory
ballooning, you're going to have a hard time.

> 
> It is unrealistic to have a perfect guest resize solution in a single
> patch series upfront, the issue is just too complex.

I disagree, but that's a different discussion. :)

> 
>>>
>>> I see that the same KVM slot limit probably applies also for virtio-mem,
>>> since it uses memory-backend-ram as its backing memory device, too,
>>> right?
>>
>> Yes, one memory backend per virtio-mem device. You usually have one
>> device per NUMA node.
> 
> So if you want to dynamically manage most of the guest memory these
> virtio-mem devices + their backends will be very large
> (circa 1/4 guest size each for a 4-node machine).

Right, or even bigger, depending on the setup.

> 
> That means in practice they won't ever be able to be hot-removed before
> the VM is rebooted since there will very likely be at least single stuck
> page in each of these backing devices preventing their removal.
> (If hot-removal support is ever enabled for virtio-mem, it looks like it
> is not possible yet even on a VM reboot).

Yeah, hot-removing a virtio-mem device is not one of the important use
cases (it's completely blocked). If you want to unplug memory, adjust
the requested size.

There are plans to support it in the future (for example during reboot),
but I barely see a need for it currently (especially once we support
resizeable memory backends upstream).

> 
> And I can see that removing a single RAM block in virio-mem is done 
> by discarding it via MADV_DONTNEED, just like in ballooning.
> Only the minimum block size is 2 MB and not 4 KB so all consecutive
> 512 pages in a block will need to be free in order to discard it.

Right, that was a decision to avoid issues known from base-page-size
based ballooning (like fragmenting guest memory, big tracking bitmaps,
incompatibility with vfio, breaking THP and degrading performance,
incompatibility with hugetlbfs ...)

[...]

> 
> During a reboot, virtual DIMM config optimization can be used for now,
> while the best, long-term, solution would be to actually do what the
> Hyper-V hypervisor does in this case: resize the boot memory to match
> the target guest size.
> 
> This way the whole issue of the guest seeing only the boot memory part,
> not the dynamic part, during its boot will no longer be there.
> 

Yeah, I had the same thought back when designing virtio-mem (and looking
into similar handling), but decided that it's impossible to get right -
at least in NUMA setups (and regarding migration). But I can see that
Hyper-V Dynamic Memory doesn't care too much about NUMA at all (and
NUMA-aware ballooning has its own set of issues).

[...]

> 
> So the virtio-mem trade-off is between the size of the boot memory and
> the dynamically managed part(s).
> And also the block size, as as far I can see, a single stuck page in
> a block will prevent it from being discarded.

Yes. It's really something in-between memory ballooning and DIMM-based
memory hot(un)plug. The block size will be comparatively large in some
setups (esp., with vfio). You're definitely not able to squeeze out the
page of your guest - we have virtio-balloon for that if one really wants
to do that - not the target use case of virtio-mem.

> 
> Of course, we are talking about Linux guests here - Windows guest
> will see just the boot memory part.

Yes, until we have support for it.
[...]

> Even if we switched virtio-balloon to bigger allocations and made the
> protocol return page ranges the allocation is still done by simply
> using an ordinary alloc_pages()-equivalent API.
> I don't see any exported Windows kernel API for allocating balloon
> memory.
> 
> The same goes for adding new RAM.
> 
> The above alone means that supporting virtio-mem semantics on current
> Windows versions is likely not possible.

All I can say is, that there are (unofficial?) APIs :) (e.g., the ones
used by DM). But yeah, that's what you get with closed-source operating
systems - and personally, I think, it shouldn't be us that have to suffer.

> 
>>>
>>> People are running Windows guests using QEMU, too.
>>>
>>> That's why there are dozen or so Hyper-V enlightenments implemented,
>>> even though they duplicate KVM PV stuff or that there is kvmvapic
>>> with its Windows guest live-patching.
>>
>> IIRC the Hyper-V enlightenment stuff is properly publicly documented -
> 
> They are only documented from the guest perspective - basically what
> the guest of a Hyper-V hypervisor can possibly use.
> 
> There is nothing in the Hyper-V TLFS which of the functionality it
> documents is supported or required by any Windows version or how
> Windows guests actually make use of these features.
> 
> Not to mention that the documented interface could say the guest can
> expect values A or B or C for parameter X, which is technically true, 
> however the actual Hyper-V hypervisor always uses A and that's what
> Windows will expect.
>  

Interesting, thanks.

>> whereby last time I checked, the hv-balloon is completely undocumented
>> and has to be reverse engineered from the Linux implementation. Please
>> correct me if I'm wrong - I am not able to spot references in your cover
>> letter as well - I'd be interested into that!
> 
> The DM protocol is rather straightforward - the Linux driver contains
> well-commented definitions of its messages.
> 
> For a hot add the host simply provides the start page frame number and
> the count of pages to add and in response receives the number of pages
> the guest was actually able to hot add.
> 
> For a balloon request the host provides a count of pages it wants the
> guest to free and the guest responds with page ranges it has managed
> to release.
> The reverse happens for an unballoon request.
> 
> The protocol consists of just a few simple messages, well described
> in the Linux driver.
> 
> The VMBus part of the protocol works in the same way as in other VMBus
> devices.
> 

Yeah, but then there are things like

https://lkml.kernel.org/r/20200107130950.2983-1-Tianyu.Lan@microsoft.com

that left me clueless - it seems like we're missing some things, maybe
there is more (I am pretty sure there is more ... :) )? (I do some work
on the Linux hv_balloon driver every now of then when working on
optimizing the other balloon drivers / virtio-mem).

>>>
>>> Not to mention many, many devices like e1000 or VMware vmxnet3 even
>>> though virtio-net exists or PIIX IDE even though virtio-{blk,scsi} exist.
>>> Or the applesmc driver, which is cleanly designed to help run just
>>> one proprietary OS.
>>
>> IIRC we need the devices either to bootstrap - e.g., use e1000 until we
>> can install virtio-net once the guest is up and running, or to support
>> older, unmodified guests. I'd like to stress that what you are proposing
>> is different in that sense. Your Windows VM will work just fine without
>> a hv-balloon device.
> 
> There are also "accelerator" devices like kvmvapic or proprietary devices
> where an older open standard exists, like sii3112 SATA, even though
> IDE / AHCI can be used to bootstrap the guest.
> The sii3112 driver was only added in 2017, although the hardware that
> it emulates comes from early 21st century.
> 
>> Again, just my personal opinion, I don't make any decisions around here :)
>>
> 
> Thanks for taking the time to provide your feedback and insight,

I'll try to give your series a look. I can definitely say that

1) I dislike that an external entity has to do vDIMM adaptions /
ballooning adaptions when rebooting or when wanting to resize a guest.

2) I am not sure ignoring the kvm memory slot limit is a good idea. (or
the fundamental issue of resizing boot memory - ever)

Once you have the current approach upstream (vDIMMs, ballooning), there
is no easy way to change that later (requires deprecating, etc.).

But we talked about that already.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-09-25  6:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-20 13:25 [PATCH 0/3] Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
2020-09-20 13:25 ` [PATCH 1/3] haprot: add a device for memory hot-add protocols Maciej S. Szmigiero
2020-09-20 13:25 ` [PATCH 2/3] Add Hyper-V Dynamic Memory Protocol definitions Maciej S. Szmigiero
2020-09-20 13:25 ` [PATCH 3/3] Add a Hyper-V Dynamic Memory Protocol driver (hv-balloon) Maciej S. Szmigiero
2020-09-20 14:16 ` [PATCH 0/3] " no-reply
2020-09-21  9:00 ` Igor Mammedov
2020-09-21  9:29   ` David Hildenbrand
2020-09-21  9:10 ` David Hildenbrand
2020-09-21 22:22   ` Maciej S. Szmigiero
2020-09-22  7:26     ` David Hildenbrand
2020-09-22 23:19       ` Maciej S. Szmigiero
2020-09-23 12:48         ` David Hildenbrand
2020-09-24 22:37           ` Maciej S. Szmigiero
2020-09-25  6:49             ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.