All of lore.kernel.org
 help / color / mirror / Atom feed
From: Juan Quintela <quintela@redhat.com>
To: qemu-devel@nongnu.org
Cc: "Richard Henderson" <richard.henderson@linaro.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Laurent Vivier" <lvivier@redhat.com>,
	"Ilya Leoshkevich" <iii@linux.ibm.com>,
	"Halil Pasic" <pasic@linux.ibm.com>,
	"Marc-André Lureau" <marcandre.lureau@redhat.com>,
	"Coiby Xu" <Coiby.Xu@gmail.com>,
	"Eric Farman" <farman@linux.ibm.com>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Christian Borntraeger" <borntraeger@linux.ibm.com>,
	"Stefan Hajnoczi" <stefanha@redhat.com>,
	"Philippe Mathieu-Daudé" <philmd@linaro.org>,
	"Stefan Berger" <stefanb@linux.vnet.ibm.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Eduardo Habkost" <eduardo@habkost.net>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Thomas Huth" <thuth@redhat.com>,
	"David Hildenbrand" <david@redhat.com>,
	"Marcel Apfelbaum" <marcel.apfelbaum@gmail.com>,
	"John Snow" <jsnow@redhat.com>,
	"Yanan Wang" <wangyanan55@huawei.com>,
	"Daniel P. Berrangé" <berrange@redhat.com>,
	"Vladimir Sementsov-Ogievskiy" <vsementsov@yandex-team.ru>,
	qemu-block@nongnu.org, "Paolo Bonzini" <pbonzini@redhat.com>,
	"Juan Quintela" <quintela@redhat.com>,
	"Fam Zheng" <fam@euphon.net>,
	qemu-s390x@nongnu.org, "Jing Qi" <jinqi@redhat.com>,
	"Peter Xu" <peterx@redhat.com>
Subject: [PULL 19/26] virtio-mem: Proper support for preallocation with migration
Date: Thu,  2 Feb 2023 17:06:33 +0100	[thread overview]
Message-ID: <20230202160640.2300-20-quintela@redhat.com> (raw)
In-Reply-To: <20230202160640.2300-1-quintela@redhat.com>

From: David Hildenbrand <david@redhat.com>

Ordinary memory preallocation runs when QEMU starts up and creates the
memory backends, before processing the incoming migration stream. With
virtio-mem, we don't know which memory blocks to preallocate before
migration started. Now that we migrate the virtio-mem bitmap early, before
migrating any RAM content, we can safely preallocate memory for all plugged
memory blocks before migrating any RAM content.

This is especially relevant for the following cases:

(1) User errors

With hugetlb/files, if we don't have sufficient backend memory available on
the migration destination, we'll crash QEMU (SIGBUS) during RAM migration
when running out of backend memory. Preallocating memory before actual
RAM migration allows for failing gracefully and informing the user about
the setup problem.

(2) Excluded memory ranges during migration

For example, virtio-balloon free page hinting will exclude some pages
from getting migrated. In that case, we won't crash during RAM
migration, but later, when running the VM on the destination, which is
bad.

To fix this for new QEMU machines that migrate the bitmap early,
preallocate the memory early, before any RAM migration. Warn with old
QEMU machines.

Getting postcopy right is a bit tricky, but we essentially now implement
the same (problematic) preallocation logic as ordinary preallocation:
preallocate memory early and discard it again before precopy starts. During
ordinary preallocation, discarding of RAM happens when postcopy is advised.
As the state (bitmap) is loaded after postcopy was advised but before
postcopy starts listening, we have to discard memory we preallocated
immediately again ourselves.

Note that nothing (not even hugetlb reservations) guarantees for postcopy
that backend memory (especially, hugetlb pages) are still free after they
were freed ones while discarding RAM. Still, allocating that memory at
least once helps catching some basic setup problems.

Before this change, trying to restore a VM when insufficient hugetlb
pages are around results in the process crashing to to a "Bus error"
(SIGBUS). With this change, QEMU fails gracefully:

  qemu-system-x86_64: qemu_prealloc_mem: preallocating memory failed: Bad address
  qemu-system-x86_64: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-mem-device-early'
  qemu-system-x86_64: load of migration failed: Cannot allocate memory

And we can even introspect the early migration data, including the
bitmap:
  $ ./scripts/analyze-migration.py -f STATEFILE
  {
  "ram (2)": {
      "section sizes": {
          "0000:00:03.0/mem0": "0x0000000780000000",
          "0000:00:04.0/mem1": "0x0000000780000000",
          "pc.ram": "0x0000000100000000",
          "/rom@etc/acpi/tables": "0x0000000000020000",
          "pc.bios": "0x0000000000040000",
          "0000:00:02.0/e1000.rom": "0x0000000000040000",
          "pc.rom": "0x0000000000020000",
          "/rom@etc/table-loader": "0x0000000000001000",
          "/rom@etc/acpi/rsdp": "0x0000000000001000"
      }
  },
  "0000:00:03.0/virtio-mem-device-early (51)": {
      "tmp": "00 00 00 01 40 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00",
      "size": "0x0000000040000000",
      "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...]
  },
  "0000:00:04.0/virtio-mem-device-early (53)": {
      "tmp": "00 00 00 08 c0 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 00 20 00 00 00 00 00 00",
      "size": "0x00000001fa400000",
      "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...]
  },
  [...]

Reported-by: Jing Qi <jinqi@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>S
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
---
 hw/virtio/virtio-mem.c | 87 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index ca37949df8..957fe77dc0 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -204,6 +204,30 @@ static int virtio_mem_for_each_unplugged_range(const VirtIOMEM *vmem, void *arg,
     return ret;
 }
 
+static int virtio_mem_for_each_plugged_range(const VirtIOMEM *vmem, void *arg,
+                                             virtio_mem_range_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_bit = find_first_bit(vmem->bitmap, vmem->bitmap_size);
+    while (first_bit < vmem->bitmap_size) {
+        offset = first_bit * vmem->block_size;
+        last_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+                                      first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * vmem->block_size;
+
+        ret = cb(vmem, arg, offset, size);
+        if (ret) {
+            break;
+        }
+        first_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+                                  last_bit + 2);
+    }
+    return ret;
+}
+
 /*
  * Adjust the memory section to cover the intersection with the given range.
  *
@@ -938,6 +962,10 @@ static int virtio_mem_post_load(void *opaque, int version_id)
     RamDiscardListener *rdl;
     int ret;
 
+    if (vmem->prealloc && !vmem->early_migration) {
+        warn_report("Proper preallocation with migration requires a newer QEMU machine");
+    }
+
     /*
      * We started out with all memory discarded and our memory region is mapped
      * into an address space. Replay, now that we updated the bitmap.
@@ -957,6 +985,64 @@ static int virtio_mem_post_load(void *opaque, int version_id)
     return virtio_mem_restore_unplugged(vmem);
 }
 
+static int virtio_mem_prealloc_range_cb(const VirtIOMEM *vmem, void *arg,
+                                        uint64_t offset, uint64_t size)
+{
+    void *area = memory_region_get_ram_ptr(&vmem->memdev->mr) + offset;
+    int fd = memory_region_get_fd(&vmem->memdev->mr);
+    Error *local_err = NULL;
+
+    qemu_prealloc_mem(fd, area, size, 1, NULL, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return -ENOMEM;
+    }
+    return 0;
+}
+
+static int virtio_mem_post_load_early(void *opaque, int version_id)
+{
+    VirtIOMEM *vmem = VIRTIO_MEM(opaque);
+    RAMBlock *rb = vmem->memdev->mr.ram_block;
+    int ret;
+
+    if (!vmem->prealloc) {
+        return 0;
+    }
+
+    /*
+     * We restored the bitmap and verified that the basic properties
+     * match on source and destination, so we can go ahead and preallocate
+     * memory for all plugged memory blocks, before actual RAM migration starts
+     * touching this memory.
+     */
+    ret = virtio_mem_for_each_plugged_range(vmem, NULL,
+                                            virtio_mem_prealloc_range_cb);
+    if (ret) {
+        return ret;
+    }
+
+    /*
+     * This is tricky: postcopy wants to start with a clean slate. On
+     * POSTCOPY_INCOMING_ADVISE, postcopy code discards all (ordinarily
+     * preallocated) RAM such that postcopy will work as expected later.
+     *
+     * However, we run after POSTCOPY_INCOMING_ADVISE -- but before actual
+     * RAM migration. So let's discard all memory again. This looks like an
+     * expensive NOP, but actually serves a purpose: we made sure that we
+     * were able to allocate all required backend memory once. We cannot
+     * guarantee that the backend memory we will free will remain free
+     * until we need it during postcopy, but at least we can catch the
+     * obvious setup issues this way.
+     */
+    if (migration_incoming_postcopy_advised()) {
+        if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) {
+            return -EBUSY;
+        }
+    }
+    return 0;
+}
+
 typedef struct VirtIOMEMMigSanityChecks {
     VirtIOMEM *parent;
     uint64_t addr;
@@ -1068,6 +1154,7 @@ static const VMStateDescription vmstate_virtio_mem_device_early = {
     .minimum_version_id = 1,
     .version_id = 1,
     .early_setup = true,
+    .post_load = virtio_mem_post_load_early,
     .fields = (VMStateField[]) {
         VMSTATE_WITH_TMP(VirtIOMEM, VirtIOMEMMigSanityChecks,
                          vmstate_virtio_mem_sanity_checks),
-- 
2.39.1



  parent reply	other threads:[~2023-02-02 16:13 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-02 16:06 [PULL 00/26] Next patches Juan Quintela
2023-02-02 16:06 ` [PULL 01/26] migration: Fix migration crash when target psize larger than host Juan Quintela
2023-02-02 16:06 ` [PULL 02/26] migration: No save_live_pending() method uses the QEMUFile parameter Juan Quintela
2023-02-02 16:06 ` [PULL 03/26] migration: Split save_live_pending() into state_pending_* Juan Quintela
2023-02-02 16:06 ` [PULL 04/26] migration: Remove unused threshold_size parameter Juan Quintela
2023-02-02 16:06 ` [PULL 05/26] migration: simplify migration_iteration_run() Juan Quintela
2023-02-02 16:06 ` [PULL 06/26] util/userfaultfd: Add uffd_open() Juan Quintela
2023-02-02 16:06 ` [PULL 07/26] migration/ram: Fix populate_read_range() Juan Quintela
2023-02-02 16:06 ` [PULL 08/26] migration/ram: Fix error handling in ram_write_tracking_start() Juan Quintela
2023-02-02 16:06 ` [PULL 09/26] migration/ram: Don't explicitly unprotect when unregistering uffd-wp Juan Quintela
2023-02-02 16:06 ` [PULL 10/26] migration/ram: Rely on used_length for uffd_change_protection() Juan Quintela
2023-02-02 16:06 ` [PULL 11/26] migration/ram: Optimize ram_write_tracking_start() for RamDiscardManager Juan Quintela
2023-02-02 16:06 ` [PULL 12/26] migration/savevm: Move more savevm handling into vmstate_save() Juan Quintela
2023-02-02 16:06 ` [PULL 13/26] migration/savevm: Prepare vmdesc json writer in qemu_savevm_state_setup() Juan Quintela
2023-02-02 16:06 ` [PULL 14/26] migration/savevm: Allow immutable device state to be migrated early (i.e., before RAM) Juan Quintela
2023-02-02 16:06 ` [PULL 15/26] migration/vmstate: Introduce VMSTATE_WITH_TMP_TEST() and VMSTATE_BITMAP_TEST() Juan Quintela
2023-02-02 16:06 ` [PULL 16/26] migration/ram: Factor out check for advised postcopy Juan Quintela
2023-02-02 16:06 ` [PULL 17/26] virtio-mem: Fail if a memory backend with "prealloc=on" is specified Juan Quintela
2023-02-02 16:06 ` [PULL 18/26] virtio-mem: Migrate immutable properties early Juan Quintela
2023-02-02 16:06 ` Juan Quintela [this message]
2023-02-02 16:06 ` [PULL 20/26] migration: Show downtime during postcopy phase Juan Quintela
2023-02-02 16:06 ` [PULL 21/26] migration/rdma: fix return value for qio_channel_rdma_{readv, writev} Juan Quintela
2023-02-02 16:06 ` [PULL 22/26] migration: Add canary to VMSTATE_END_OF_LIST Juan Quintela
2023-02-02 16:06 ` [PULL 23/26] migration: Perform vmsd structure check during tests Juan Quintela
2023-02-02 16:06 ` [PULL 24/26] migration/dirtyrate: Show sample pages only in page-sampling mode Juan Quintela
2023-02-02 16:06 ` [PULL 25/26] io: Add support for MSG_PEEK for socket channel Juan Quintela
2023-02-02 16:06 ` [PULL 26/26] migration: check magic value for deciding the mapping of channels Juan Quintela
2023-02-04 10:19 ` [PULL 00/26] Next patches Peter Maydell
2023-02-06 22:06   ` Peter Xu
2023-02-06 23:33     ` Juan Quintela
2023-02-07  0:49 ` Juan Quintela

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230202160640.2300-20-quintela@redhat.com \
    --to=quintela@redhat.com \
    --cc=Coiby.Xu@gmail.com \
    --cc=alex.williamson@redhat.com \
    --cc=berrange@redhat.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=david@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eblake@redhat.com \
    --cc=eduardo@habkost.net \
    --cc=fam@euphon.net \
    --cc=farman@linux.ibm.com \
    --cc=iii@linux.ibm.com \
    --cc=jinqi@redhat.com \
    --cc=jsnow@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=mst@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=philmd@linaro.org \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=richard.henderson@linaro.org \
    --cc=stefanb@linux.vnet.ibm.com \
    --cc=stefanha@redhat.com \
    --cc=thuth@redhat.com \
    --cc=vsementsov@yandex-team.ru \
    --cc=wangyanan55@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.