qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager
@ 2021-09-04 16:09 David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
                   ` (9 more replies)
  0 siblings, 10 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

virtio-mem exposes a dynamic amount of memory within RAMBlocks by
coordinating with the VM. Memory within a RAMBlock can either get
plugged and consequently used by the VM, or unplugged and consequently no
longer used by the VM. Logical unplug is realized by discarding the
physical memory backing for virtual memory ranges, similar to memory
ballooning.

However, important difference to virtio-balloon are:

a) A virtio-mem device only operates on its assigned memory region /
   RAMBlock ("device memory")
b) Initially, all device memory is logically unplugged
c) Virtual machines will never accidentally reuse memory that is currently
   logically unplugged. The spec defines most accesses to unplugged memory
   as "undefined behavior" -- except reading unplugged memory, which is
   currently expected to work, but that will change in the future.
d) The (un)plug granularity is in the range of megabytes -- "memory blocks"
e) The state (plugged/unplugged) of a memory block is always known and
   properly tracked.

Whenever memory blocks within the RAMBlock get (un)plugged, changes are
communicated via the RamDiscardManager to other QEMU subsystems, most
prominently vfio which updates the DMA mapping accordingly. "Unplugging"
corresponds to "discarding" and "plugging" corresponds to "populating".

While migrating (precopy/postcopy) that state of such memory blocks cannot
change, as virtio-mem will reject any guest requests that would change
the state of blocks with "busy". We don't want to migrate such logically
unplugged memory, because it can result in an unintended memory consumption
both, on the source (when reading memory from some memory backends) and on
the destination (when writing memory). Further, migration time can be
heavily reduced when skipping logically unplugged blocks and we avoid
populating unnecessary page tables in Linux.

Right now, virtio-mem reuses the free page hinting infrastructure during
precopy to exclude all logically unplugged ("discarded") parts from the
migration stream. However, there are some scenarios that are not handled
properly and need fixing. Further, there are some ugly corner cases in
postcopy code and background snapshotting code that similarly have to
handle such special RAMBlocks.

Let's reuse the RamDiscardManager infrastructure to essentially handle
precopy, postcopy and background snapshots cleanly, which means:

a) In precopy code, fixing up the initial dirty bitmaps (in the RAMBlock
   and e.g., KVM) to exclude discarded ranges.
b) In postcopy code, placing a zeropage when requested to handle a page
   falling into a discarded range -- because the source will never send it.
   Further, fix up the dirty bitmap when overwriting it in recovery mode.
c) In background snapshot code, never populating discarded ranges, not even
   with the shared zeropage, to avoid unintended memory consumption,
   especially in the future with hugetlb and shmem.

Detail: When realizing a virtio-mem devices, it will register the RAM
        for migration via vmstate_register_ram(). Further, it will
        set itself as the RamDiscardManager for the corresponding memory
        region of the RAMBlock via memory_region_set_ram_discard_manager().
        Last but not least, memory device code will actually map the
        memory region into guest physical address space. So migration
        code can always properly identify such RAMBlocks.

Tested with precopy/postcopy on shmem, where even reading unpopulated
memory ranges will populate actual memory and not the shared zeropage.
Tested with background snapshots on anonymous memory, because other
backends are not supported yet with upstream Linux.

Idealy, this should all go via the migration tree.

v4 -> v5:
- "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
   destination"
-- Use ROUND_DOWN and fix compile warning on 32 bit
-- Use int128_make64() instead of wrongly int128_get64()
- "migration: Simplify alignment and alignment checks"
-- Use ROUND_DOWN where possible instead of QEMU_ALIGN_DOWN and fix
   compilation warning on 32 bit
- "migration/ram: Factor out populating pages readable in
   ram_block_populate_pages()"
-- Rename functions, add a comment.
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on background
   snapshots"
-- Adjust to changed function names

v3 -> v4:
- Added ACKs
- "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
   destination"
-- Use QEMU_ALIGN_DOWN() to align to ram pagesize
- "migration: Simplify alignment and alignment checks"
-- Added
- "migration/ram: Factor out populating pages readable in
   ram_block_populate_pages()"
-- Added
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on background
   snapshots"
-- Simplified due to factored out code

v2 -> v3:
- "migration/ram: Don't passs RAMState to
   migration_clear_memory_region_dirty_bitmap_*()"
-- Added to make the next patch easier to implement
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration
   source"
-- Fixup the dirty bitmaps only initially and during postcopy recovery,
   not after every bitmap sync. Also properly clear the dirty bitmaps e.g.,
   in KVM. [Peter]
- "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
   destination"
-- Take care of proper host-page alignment [Peter]

v1 -> v2:
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on the
   migration source"
-- Added a note how it interacts with the clear_bmap and what we might want
   to further optimize in the future when synchronizing bitmaps.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Philippe Mathieu-Daudé <philmd@redhat.com>

David Hildenbrand (9):
  memory: Introduce replay_discarded callback for RamDiscardManager
  virtio-mem: Implement replay_discarded RamDiscardManager callback
  migration/ram: Don't passs RAMState to
    migration_clear_memory_region_dirty_bitmap_*()
  migration/ram: Handle RAMBlocks with a RamDiscardManager on the
    migration source
  virtio-mem: Drop precopy notifier
  migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
    destination
  migration: Simplify alignment and alignment checks
  migration/ram: Factor out populating pages readable in
    ram_block_populate_pages()
  migration/ram: Handle RAMBlocks with a RamDiscardManager on background
    snapshots

 hw/virtio/virtio-mem.c         |  92 ++++++++++-------
 include/exec/memory.h          |  21 ++++
 include/hw/virtio/virtio-mem.h |   3 -
 migration/migration.c          |   6 +-
 migration/postcopy-ram.c       |  40 ++++++--
 migration/ram.c                | 180 +++++++++++++++++++++++++++++----
 migration/ram.h                |   1 +
 softmmu/memory.c               |  11 ++
 8 files changed, 284 insertions(+), 70 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v5 1/9] memory: Introduce replay_discarded callback for RamDiscardManager
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback David Hildenbrand
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Introduce replay_discarded callback similar to our existing
replay_populated callback, to be used my migration code to never migrate
discarded memory.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/memory.h | 21 +++++++++++++++++++++
 softmmu/memory.c      | 11 +++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c3d417d317..93e972b55a 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -537,6 +537,7 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
 }
 
 typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
+typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -625,6 +626,21 @@ struct RamDiscardManagerClass {
                             MemoryRegionSection *section,
                             ReplayRamPopulate replay_fn, void *opaque);
 
+    /**
+     * @replay_discarded:
+     *
+     * Call the #ReplayRamDiscard callback for all discarded parts within the
+     * #MemoryRegionSection via the #RamDiscardManager.
+     *
+     * @rdm: the #RamDiscardManager
+     * @section: the #MemoryRegionSection
+     * @replay_fn: the #ReplayRamDiscard callback
+     * @opaque: pointer to forward to the callback
+     */
+    void (*replay_discarded)(const RamDiscardManager *rdm,
+                             MemoryRegionSection *section,
+                             ReplayRamDiscard replay_fn, void *opaque);
+
     /**
      * @register_listener:
      *
@@ -669,6 +685,11 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          ReplayRamPopulate replay_fn,
                                          void *opaque);
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscard replay_fn,
+                                          void *opaque);
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
                                            MemoryRegionSection *section);
diff --git a/softmmu/memory.c b/softmmu/memory.c
index bfedaf9c4d..cd86205627 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2076,6 +2076,17 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
     return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscard replay_fn,
+                                          void *opaque)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
+
+    g_assert(rdmc->replay_discarded);
+    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+}
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
                                            MemoryRegionSection *section)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*() David Hildenbrand
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Implement it similar to the replay_populated callback.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 58 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index df91e454b2..284096ec5f 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -228,6 +228,38 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
     return ret;
 }
 
+static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
+                                                 MemoryRegionSection *s,
+                                                 void *arg,
+                                                 virtio_mem_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_bit = s->offset_within_region / vmem->bitmap_size;
+    first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, first_bit);
+    while (first_bit < vmem->bitmap_size) {
+        MemoryRegionSection tmp = *s;
+
+        offset = first_bit * vmem->block_size;
+        last_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+                                 first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * vmem->block_size;
+
+        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+            break;
+        }
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+        first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+                                       last_bit + 2);
+    }
+    return ret;
+}
+
 static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
 {
     RamDiscardListener *rdl = arg;
@@ -1170,6 +1202,31 @@ static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                             virtio_mem_rdm_replay_populated_cb);
 }
 
+static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
+                                              void *arg)
+{
+    struct VirtIOMEMReplayData *data = arg;
+
+    ((ReplayRamDiscard)data->fn)(s, data->opaque);
+    return 0;
+}
+
+static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                            MemoryRegionSection *s,
+                                            ReplayRamDiscard replay_fn,
+                                            void *opaque)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
+    struct VirtIOMEMReplayData data = {
+        .fn = replay_fn,
+        .opaque = opaque,
+    };
+
+    g_assert(s->mr == &vmem->memdev->mr);
+    virtio_mem_for_each_unplugged_section(vmem, s, &data,
+                                          virtio_mem_rdm_replay_discarded_cb);
+}
+
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
                                              RamDiscardListener *rdl,
                                              MemoryRegionSection *s)
@@ -1234,6 +1291,7 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data)
     rdmc->get_min_granularity = virtio_mem_rdm_get_min_granularity;
     rdmc->is_populated = virtio_mem_rdm_is_populated;
     rdmc->replay_populated = virtio_mem_rdm_replay_populated;
+    rdmc->replay_discarded = virtio_mem_rdm_replay_discarded;
     rdmc->register_listener = virtio_mem_rdm_register_listener;
     rdmc->unregister_listener = virtio_mem_rdm_unregister_listener;
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*()
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-09  7:26   ` Juan Quintela
  2021-09-04 16:09 ` [PATCH v5 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source David Hildenbrand
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

The parameter is unused, let's drop it.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 7a43bfd7af..bb908822d5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -789,8 +789,7 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
     return find_next_bit(bitmap, size, start);
 }
 
-static void migration_clear_memory_region_dirty_bitmap(RAMState *rs,
-                                                       RAMBlock *rb,
+static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
                                                        unsigned long page)
 {
     uint8_t shift;
@@ -818,8 +817,7 @@ static void migration_clear_memory_region_dirty_bitmap(RAMState *rs,
 }
 
 static void
-migration_clear_memory_region_dirty_bitmap_range(RAMState *rs,
-                                                 RAMBlock *rb,
+migration_clear_memory_region_dirty_bitmap_range(RAMBlock *rb,
                                                  unsigned long start,
                                                  unsigned long npages)
 {
@@ -832,7 +830,7 @@ migration_clear_memory_region_dirty_bitmap_range(RAMState *rs,
      * exclusive.
      */
     for (i = chunk_start; i < chunk_end; i += chunk_pages) {
-        migration_clear_memory_region_dirty_bitmap(rs, rb, i);
+        migration_clear_memory_region_dirty_bitmap(rb, i);
     }
 }
 
@@ -850,7 +848,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
      * the page in the chunk we clear the remote dirty bitmap for all.
      * Clearing it earlier won't be a problem, but too late will.
      */
-    migration_clear_memory_region_dirty_bitmap(rs, rb, page);
+    migration_clear_memory_region_dirty_bitmap(rb, page);
 
     ret = test_and_clear_bit(page, rb->bmap);
     if (ret) {
@@ -2777,8 +2775,7 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
          * are initially set. Otherwise those skipped pages will be sent in
          * the next round after syncing from the memory region bitmap.
          */
-        migration_clear_memory_region_dirty_bitmap_range(ram_state, block,
-                                                         start, npages);
+        migration_clear_memory_region_dirty_bitmap_range(block, start, npages);
         ram_state->migration_dirty_pages -=
                       bitmap_count_one_with_offset(block->bmap, start, npages);
         bitmap_clear(block->bmap, start, npages);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-09-04 16:09 ` [PATCH v5 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*() David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 5/9] virtio-mem: Drop precopy notifier David Hildenbrand
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

We don't want to migrate memory that corresponds to discarded ranges as
managed by a RamDiscardManager responsible for the mapped memory region of
the RAMBlock. The content of these pages is essentially stale and
without any guarantees for the VM ("logically unplugged").

Depending on the underlying memory type, even reading memory might populate
memory on the source, resulting in an undesired memory consumption. Of
course, on the destination, even writing a zeropage consumes memory,
which we also want to avoid (similar to free page hinting).

Currently, virtio-mem tries achieving that goal (not migrating "unplugged"
memory that was discarded) by going via qemu_guest_free_page_hint() - but
it's hackish and incomplete.

For example, background snapshots still end up reading all memory, as
they don't do bitmap syncs. Postcopy recovery code will re-add
previously cleared bits to the dirty bitmap and migrate them.

Let's consult the RamDiscardManager after setting up our dirty bitmap
initially and when postcopy recovery code reinitializes it: clear
corresponding bits in the dirty bitmaps (e.g., of the RAMBlock and inside
KVM). It's important to fixup the dirty bitmap *after* our initial bitmap
sync, such that the corresponding dirty bits in KVM are actually cleared.

As colo is incompatible with discarding of RAM and inhibits it, we don't
have to bother.

Note: if a misbehaving guest would use discarded ranges after migration
started we would still migrate that memory: however, then we already
populated that memory on the migration source.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index bb908822d5..3be969f749 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -858,6 +858,60 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
+static void dirty_bitmap_clear_section(MemoryRegionSection *section,
+                                       void *opaque)
+{
+    const hwaddr offset = section->offset_within_region;
+    const hwaddr size = int128_get64(section->size);
+    const unsigned long start = offset >> TARGET_PAGE_BITS;
+    const unsigned long npages = size >> TARGET_PAGE_BITS;
+    RAMBlock *rb = section->mr->ram_block;
+    uint64_t *cleared_bits = opaque;
+
+    /*
+     * We don't grab ram_state->bitmap_mutex because we expect to run
+     * only when starting migration or during postcopy recovery where
+     * we don't have concurrent access.
+     */
+    if (!migration_in_postcopy() && !migrate_background_snapshot()) {
+        migration_clear_memory_region_dirty_bitmap_range(rb, start, npages);
+    }
+    *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
+    bitmap_clear(rb->bmap, start, npages);
+}
+
+/*
+ * Exclude all dirty pages from migration that fall into a discarded range as
+ * managed by a RamDiscardManager responsible for the mapped memory region of
+ * the RAMBlock. Clear the corresponding bits in the dirty bitmaps.
+ *
+ * Discarded pages ("logically unplugged") have undefined content and must
+ * not get migrated, because even reading these pages for migration might
+ * result in undesired behavior.
+ *
+ * Returns the number of cleared bits in the RAMBlock dirty bitmap.
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
+{
+    uint64_t cleared_bits = 0;
+
+    if (rb->mr && rb->bmap && memory_region_has_ram_discard_manager(rb->mr)) {
+        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+        MemoryRegionSection section = {
+            .mr = rb->mr,
+            .offset_within_region = 0,
+            .size = int128_make64(qemu_ram_get_used_length(rb)),
+        };
+
+        ram_discard_manager_replay_discarded(rdm, &section,
+                                             dirty_bitmap_clear_section,
+                                             &cleared_bits);
+    }
+    return cleared_bits;
+}
+
 /* Called with RCU critical section */
 static void ramblock_sync_dirty_bitmap(RAMState *rs, RAMBlock *rb)
 {
@@ -2668,6 +2722,19 @@ static void ram_list_init_bitmaps(void)
     }
 }
 
+static void migration_bitmap_clear_discarded_pages(RAMState *rs)
+{
+    unsigned long pages;
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
+            pages = ramblock_dirty_bitmap_clear_discarded_pages(rb);
+            rs->migration_dirty_pages -= pages;
+    }
+}
+
 static void ram_init_bitmaps(RAMState *rs)
 {
     /* For memory_global_dirty_log_start below.  */
@@ -2684,6 +2751,12 @@ static void ram_init_bitmaps(RAMState *rs)
     }
     qemu_mutex_unlock_ramlist();
     qemu_mutex_unlock_iothread();
+
+    /*
+     * After an eventual first bitmap sync, fixup the initial bitmap
+     * containing all 1s to exclude any discarded pages from migration.
+     */
+    migration_bitmap_clear_discarded_pages(rs);
 }
 
 static int ram_init_all(RAMState **rsp)
@@ -4112,6 +4185,10 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
      */
     bitmap_complement(block->bmap, block->bmap, nbits);
 
+    /* Clear dirty bits of discarded ranges that we don't want to migrate. */
+    ramblock_dirty_bitmap_clear_discarded_pages(block);
+
+    /* We'll recalculate migration_dirty_pages in ram_state_resume_prepare(). */
     trace_ram_dirty_bitmap_reload_complete(block->idstr);
 
     /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 5/9] virtio-mem: Drop precopy notifier
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-09-04 16:09 ` [PATCH v5 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination David Hildenbrand
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Migration code now properly handles RAMBlocks which are indirectly managed
by a RamDiscardManager. No need for manual handling via the free page
optimization interface, let's get rid of it.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c         | 34 ----------------------------------
 include/hw/virtio/virtio-mem.h |  3 ---
 2 files changed, 37 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 284096ec5f..d5a578142b 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -776,7 +776,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     host_memory_backend_set_mapped(vmem->memdev, true);
     vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
     qemu_register_reset(virtio_mem_system_reset, vmem);
-    precopy_add_notifier(&vmem->precopy_notifier);
 
     /*
      * Set ourselves as RamDiscardManager before the plug handler maps the
@@ -796,7 +795,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
      * found via an address space anymore. Unset ourselves.
      */
     memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
-    precopy_remove_notifier(&vmem->precopy_notifier);
     qemu_unregister_reset(virtio_mem_system_reset, vmem);
     vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
     host_memory_backend_set_mapped(vmem->memdev, false);
@@ -1089,43 +1087,11 @@ static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
     vmem->block_size = value;
 }
 
-static int virtio_mem_precopy_exclude_range_cb(const VirtIOMEM *vmem, void *arg,
-                                               uint64_t offset, uint64_t size)
-{
-    void * const host = qemu_ram_get_host_addr(vmem->memdev->mr.ram_block);
-
-    qemu_guest_free_page_hint(host + offset, size);
-    return 0;
-}
-
-static void virtio_mem_precopy_exclude_unplugged(VirtIOMEM *vmem)
-{
-    virtio_mem_for_each_unplugged_range(vmem, NULL,
-                                        virtio_mem_precopy_exclude_range_cb);
-}
-
-static int virtio_mem_precopy_notify(NotifierWithReturn *n, void *data)
-{
-    VirtIOMEM *vmem = container_of(n, VirtIOMEM, precopy_notifier);
-    PrecopyNotifyData *pnd = data;
-
-    switch (pnd->reason) {
-    case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
-        virtio_mem_precopy_exclude_unplugged(vmem);
-        break;
-    default:
-        break;
-    }
-
-    return 0;
-}
-
 static void virtio_mem_instance_init(Object *obj)
 {
     VirtIOMEM *vmem = VIRTIO_MEM(obj);
 
     notifier_list_init(&vmem->size_change_notifiers);
-    vmem->precopy_notifier.notify = virtio_mem_precopy_notify;
     QLIST_INIT(&vmem->rdl_list);
 
     object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 9a6e348fa2..a5dd6a493b 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -65,9 +65,6 @@ struct VirtIOMEM {
     /* notifiers to notify when "size" changes */
     NotifierList size_change_notifiers;
 
-    /* don't migrate unplugged memory */
-    NotifierWithReturn precopy_notifier;
-
     /* listeners to notify on plug/unplug activity. */
     QLIST_HEAD(, RamDiscardListener) rdl_list;
 };
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-09-04 16:09 ` [PATCH v5 5/9] virtio-mem: Drop precopy notifier David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-04 16:09 ` [PATCH v5 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Currently, when someone (i.e., the VM) accesses discarded parts inside a
RAMBlock with a RamDiscardManager managing the corresponding mapped memory
region, postcopy will request migration of the corresponding page from the
source. The source, however, will never answer, because it refuses to
migrate such pages with undefined content ("logically unplugged"): the
pages are never dirty, and get_queued_page() will consequently skip
processing these postcopy requests.

Especially reading discarded ("logically unplugged") ranges is supposed to
work in some setups (for example with current virtio-mem), although it
barely ever happens: still, not placing a page would currently stall the
VM, as it cannot make forward progress.

Let's check the state via the RamDiscardManager (the state e.g.,
of virtio-mem is migrated during precopy) and avoid sending a request
that will never get answered. Place a fresh zero page instead to keep
the VM working. This is the same behavior that would happen
automatically without userfaultfd being active, when accessing virtual
memory regions without populated pages -- "populate on demand".

For now, there are valid cases (as documented in the virtio-mem spec) where
a VM might read discarded memory; in the future, we will disallow that.
Then, we might want to handle that case differently, e.g., warning the
user that the VM seems to be mis-behaving.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/postcopy-ram.c | 31 +++++++++++++++++++++++++++----
 migration/ram.c          | 21 +++++++++++++++++++++
 migration/ram.h          |  1 +
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2e9697bdd2..3609ce7e52 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -671,6 +671,29 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
     return ret;
 }
 
+static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
+                                 ram_addr_t start, uint64_t haddr)
+{
+    void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
+
+    /*
+     * Discarded pages (via RamDiscardManager) are never migrated. On unlikely
+     * access, place a zeropage, which will also set the relevant bits in the
+     * recv_bitmap accordingly, so we won't try placing a zeropage twice.
+     *
+     * Checking a single bit is sufficient to handle pagesize > TPS as either
+     * all relevant bits are set or not.
+     */
+    assert(QEMU_IS_ALIGNED(start, qemu_ram_pagesize(rb)));
+    if (ramblock_page_is_discarded(rb, start)) {
+        bool received = ramblock_recv_bitmap_test_byte_offset(rb, start);
+
+        return received ? 0 : postcopy_place_page_zero(mis, aligned, rb);
+    }
+
+    return migrate_send_rp_req_pages(mis, rb, start, haddr);
+}
+
 /*
  * Callback from shared fault handlers to ask for a page,
  * the page must be specified by a RAMBlock and an offset in that rb
@@ -690,7 +713,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                         qemu_ram_get_idstr(rb), rb_offset);
         return postcopy_wake_shared(pcfd, client_addr, rb);
     }
-    migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
+    postcopy_request_page(mis, rb, aligned_rbo, client_addr);
     return 0;
 }
 
@@ -984,8 +1007,8 @@ retry:
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
-            ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
-                                            msg.arg.pagefault.address);
+            ret = postcopy_request_page(mis, rb, rb_offset,
+                                        msg.arg.pagefault.address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
                 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
@@ -993,7 +1016,7 @@ retry:
                     goto retry;
                 } else {
                     /* This is a unavoidable fault */
-                    error_report("%s: migrate_send_rp_req_pages() get %d",
+                    error_report("%s: postcopy_request_page() get %d",
                                  __func__, ret);
                     break;
                 }
diff --git a/migration/ram.c b/migration/ram.c
index 3be969f749..56240f0f17 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -912,6 +912,27 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
     return cleared_bits;
 }
 
+/*
+ * Check if a host-page aligned page falls into a discarded range as managed by
+ * a RamDiscardManager responsible for the mapped memory region of the RAMBlock.
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
+{
+    if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
+        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+        MemoryRegionSection section = {
+            .mr = rb->mr,
+            .offset_within_region = start,
+            .size = int128_make64(qemu_ram_pagesize(rb)),
+        };
+
+        return !ram_discard_manager_is_populated(rdm, &section);
+    }
+    return false;
+}
+
 /* Called with RCU critical section */
 static void ramblock_sync_dirty_bitmap(RAMState *rs, RAMBlock *rb)
 {
diff --git a/migration/ram.h b/migration/ram.h
index 4833e9fd5b..dda1988f3d 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -72,6 +72,7 @@ void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
 int64_t ramblock_recv_bitmap_send(QEMUFile *file,
                                   const char *block_name);
 int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
+bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
 
 /* ram cache */
 int colo_init_ram_cache(void);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 7/9] migration: Simplify alignment and alignment checks
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-09-04 16:09 ` [PATCH v5 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-28 15:31   ` Peter Xu
  2021-09-04 16:09 ` [PATCH v5 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to
read.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/migration.c    | 6 +++---
 migration/postcopy-ram.c | 9 ++++-----
 migration/ram.c          | 2 +-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index bb909781b7..c99360297b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
                               RAMBlock *rb, ram_addr_t start, uint64_t haddr)
 {
-    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
+    void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));
     bool received = false;
 
     WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
@@ -2619,8 +2619,8 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
      * Since we currently insist on matching page sizes, just sanity check
      * we're being asked for whole host pages.
      */
-    if (start & (our_host_ps - 1) ||
-       (len & (our_host_ps - 1))) {
+    if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
+        !QEMU_IS_ALIGNED(len, our_host_ps)) {
         error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
                      " len: %zd", __func__, start, len);
         mark_source_rp_bad(ms);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 3609ce7e52..e721f69d0f 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -402,7 +402,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
                      strerror(errno));
         goto out;
     }
-    g_assert(((size_t)testarea & (pagesize - 1)) == 0);
+    g_assert(QEMU_PTR_IS_ALIGNED(testarea, pagesize));
 
     reg_struct.range.start = (uintptr_t)testarea;
     reg_struct.range.len = pagesize;
@@ -660,7 +660,7 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
     struct uffdio_range range;
     int ret;
     trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
-    range.start = client_addr & ~(pagesize - 1);
+    range.start = ROUND_DOWN(client_addr, pagesize);
     range.len = pagesize;
     ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
     if (ret) {
@@ -702,8 +702,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                  uint64_t client_addr, uint64_t rb_offset)
 {
-    size_t pagesize = qemu_ram_pagesize(rb);
-    uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
+    uint64_t aligned_rbo = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
     MigrationIncomingState *mis = migration_incoming_get_current();
 
     trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
@@ -993,7 +992,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
                 break;
             }
 
-            rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
+            rb_offset = ROUND_DOWN(rb_offset, qemu_ram_pagesize(rb));
             trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
                                                 qemu_ram_get_idstr(rb),
                                                 rb_offset,
diff --git a/migration/ram.c b/migration/ram.c
index 56240f0f17..b225ec7507 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -811,7 +811,7 @@ static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
     assert(shift >= 6);
 
     size = 1ULL << (TARGET_PAGE_BITS + shift);
-    start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
+    start = QEMU_ALIGN_DOWN((ram_addr_t)page << TARGET_PAGE_BITS, size);
     trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
     memory_region_clear_dirty_bitmap(rb->mr, start, size);
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-09-04 16:09 ` [PATCH v5 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-28 15:34   ` Peter Xu
  2021-09-04 16:09 ` [PATCH v5 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots David Hildenbrand
  2021-09-28 14:01 ` [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
  9 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Let's factor out prefaulting/populating to make further changes easier to
review and add a comment what we are actually expecting to happen. While at
it, use the actual page size of the ramblock, which defaults to
qemu_real_host_page_size for anonymous memory. Further, rename
ram_block_populate_pages() to ram_block_populate_read() as well, to make
it clearer what we are doing.

In the future, we might want to use MADV_POPULATE_READ to speed up
population.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index b225ec7507..c212081f85 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1639,26 +1639,35 @@ out:
     return ret;
 }
 
+static inline void populate_read_range(RAMBlock *block, ram_addr_t offset,
+                                       ram_addr_t size)
+{
+    /*
+     * We read one byte of each page; this will preallocate page tables if
+     * required and populate the shared zeropage on MAP_PRIVATE anonymous memory
+     * where no page was populated yet. This might require adaption when
+     * supporting other mappings, like shmem.
+     */
+    for (; offset < size; offset += block->page_size) {
+        char tmp = *((char *)block->host + offset);
+
+        /* Don't optimize the read out */
+        asm volatile("" : "+r" (tmp));
+    }
+}
+
 /*
- * ram_block_populate_pages: populate memory in the RAM block by reading
- *   an integer from the beginning of each page.
+ * ram_block_populate_read: preallocate page tables and populate pages in the
+ *   RAM block by reading a byte of each page.
  *
  * Since it's solely used for userfault_fd WP feature, here we just
  *   hardcode page size to qemu_real_host_page_size.
  *
  * @block: RAM block to populate
  */
-static void ram_block_populate_pages(RAMBlock *block)
+static void ram_block_populate_read(RAMBlock *block)
 {
-    char *ptr = (char *) block->host;
-
-    for (ram_addr_t offset = 0; offset < block->used_length;
-            offset += qemu_real_host_page_size) {
-        char tmp = *(ptr + offset);
-
-        /* Don't optimize the read out */
-        asm volatile("" : "+r" (tmp));
-    }
+    populate_read_range(block, 0, block->used_length);
 }
 
 /*
@@ -1684,7 +1693,7 @@ void ram_write_tracking_prepare(void)
          * UFFDIO_WRITEPROTECT_MODE_WP mode setting would silently skip
          * pages with pte_none() entries in page table.
          */
-        ram_block_populate_pages(block);
+        ram_block_populate_read(block);
     }
 }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v5 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-09-04 16:09 ` [PATCH v5 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
@ 2021-09-04 16:09 ` David Hildenbrand
  2021-09-28 14:01 ` [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
  9 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-04 16:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

We already don't ever migrate memory that corresponds to discarded ranges
as managed by a RamDiscardManager responsible for the mapped memory region
of the RAMBlock.

virtio-mem uses this mechanism to logically unplug parts of a RAMBlock.
Right now, we still populate zeropages for the whole usable part of the
RAMBlock, which is undesired because:

1. Even populating the shared zeropage will result in memory getting
   consumed for page tables.
2. Memory backends without a shared zeropage (like hugetlbfs and shmem)
   will populate an actual, fresh page, resulting in an unintended
   memory consumption.

Discarded ("logically unplugged") parts have to remain discarded. As
these pages are never part of the migration stream, there is no need to
track modifications via userfaultfd WP reliably for these parts.

Further, any writes to these ranges by the VM are invalid and the
behavior is undefined.

Note that Linux only supports userfaultfd WP on private anonymous memory
for now, which usually results in the shared zeropage getting populated.
The issue will become more relevant once userfaultfd WP supports shmem
and hugetlb.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index c212081f85..dbbb1e6712 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1656,6 +1656,17 @@ static inline void populate_read_range(RAMBlock *block, ram_addr_t offset,
     }
 }
 
+static inline int populate_read_section(MemoryRegionSection *section,
+                                        void *opaque)
+{
+    const hwaddr size = int128_get64(section->size);
+    hwaddr offset = section->offset_within_region;
+    RAMBlock *block = section->mr->ram_block;
+
+    populate_read_range(block, offset, size);
+    return 0;
+}
+
 /*
  * ram_block_populate_read: preallocate page tables and populate pages in the
  *   RAM block by reading a byte of each page.
@@ -1665,9 +1676,32 @@ static inline void populate_read_range(RAMBlock *block, ram_addr_t offset,
  *
  * @block: RAM block to populate
  */
-static void ram_block_populate_read(RAMBlock *block)
+static void ram_block_populate_read(RAMBlock *rb)
 {
-    populate_read_range(block, 0, block->used_length);
+    /*
+     * Skip populating all pages that fall into a discarded range as managed by
+     * a RamDiscardManager responsible for the mapped memory region of the
+     * RAMBlock. Such discarded ("logically unplugged") parts of a RAMBlock
+     * must not get populated automatically. We don't have to track
+     * modifications via userfaultfd WP reliably, because these pages will
+     * not be part of the migration stream either way -- see
+     * ramblock_dirty_bitmap_exclude_discarded_pages().
+     *
+     * Note: The result is only stable while migrating (precopy/postcopy).
+     */
+    if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
+        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+        MemoryRegionSection section = {
+            .mr = rb->mr,
+            .offset_within_region = 0,
+            .size = rb->mr->size,
+        };
+
+        ram_discard_manager_replay_populated(rdm, &section,
+                                             populate_read_section, NULL);
+    } else {
+        populate_read_range(rb, 0, rb->used_length);
+    }
 }
 
 /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*()
  2021-09-04 16:09 ` [PATCH v5 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*() David Hildenbrand
@ 2021-09-09  7:26   ` Juan Quintela
  0 siblings, 0 replies; 15+ messages in thread
From: Juan Quintela @ 2021-09-09  7:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, teawater,
	qemu-devel, Peter Xu, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

David Hildenbrand <david@redhat.com> wrote:
> The parameter is unused, let's drop it.
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager
  2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (8 preceding siblings ...)
  2021-09-04 16:09 ` [PATCH v5 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots David Hildenbrand
@ 2021-09-28 14:01 ` David Hildenbrand
  9 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-28 14:01 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	Dr. David Alan Gilbert, Peter Xu, Marek Kedzierski,
	Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 04.09.21 18:09, David Hildenbrand wrote:
> virtio-mem exposes a dynamic amount of memory within RAMBlocks by
> coordinating with the VM. Memory within a RAMBlock can either get
> plugged and consequently used by the VM, or unplugged and consequently no
> longer used by the VM. Logical unplug is realized by discarding the
> physical memory backing for virtual memory ranges, similar to memory
> ballooning.
> 
> However, important difference to virtio-balloon are:
> 
> a) A virtio-mem device only operates on its assigned memory region /
>     RAMBlock ("device memory")
> b) Initially, all device memory is logically unplugged
> c) Virtual machines will never accidentally reuse memory that is currently
>     logically unplugged. The spec defines most accesses to unplugged memory
>     as "undefined behavior" -- except reading unplugged memory, which is
>     currently expected to work, but that will change in the future.
> d) The (un)plug granularity is in the range of megabytes -- "memory blocks"
> e) The state (plugged/unplugged) of a memory block is always known and
>     properly tracked.
> 
> Whenever memory blocks within the RAMBlock get (un)plugged, changes are
> communicated via the RamDiscardManager to other QEMU subsystems, most
> prominently vfio which updates the DMA mapping accordingly. "Unplugging"
> corresponds to "discarding" and "plugging" corresponds to "populating".
> 
> While migrating (precopy/postcopy) that state of such memory blocks cannot
> change, as virtio-mem will reject any guest requests that would change
> the state of blocks with "busy". We don't want to migrate such logically
> unplugged memory, because it can result in an unintended memory consumption
> both, on the source (when reading memory from some memory backends) and on
> the destination (when writing memory). Further, migration time can be
> heavily reduced when skipping logically unplugged blocks and we avoid
> populating unnecessary page tables in Linux.
> 
> Right now, virtio-mem reuses the free page hinting infrastructure during
> precopy to exclude all logically unplugged ("discarded") parts from the
> migration stream. However, there are some scenarios that are not handled
> properly and need fixing. Further, there are some ugly corner cases in
> postcopy code and background snapshotting code that similarly have to
> handle such special RAMBlocks.
> 
> Let's reuse the RamDiscardManager infrastructure to essentially handle
> precopy, postcopy and background snapshots cleanly, which means:
> 
> a) In precopy code, fixing up the initial dirty bitmaps (in the RAMBlock
>     and e.g., KVM) to exclude discarded ranges.
> b) In postcopy code, placing a zeropage when requested to handle a page
>     falling into a discarded range -- because the source will never send it.
>     Further, fix up the dirty bitmap when overwriting it in recovery mode.
> c) In background snapshot code, never populating discarded ranges, not even
>     with the shared zeropage, to avoid unintended memory consumption,
>     especially in the future with hugetlb and shmem.
> 
> Detail: When realizing a virtio-mem devices, it will register the RAM
>          for migration via vmstate_register_ram(). Further, it will
>          set itself as the RamDiscardManager for the corresponding memory
>          region of the RAMBlock via memory_region_set_ram_discard_manager().
>          Last but not least, memory device code will actually map the
>          memory region into guest physical address space. So migration
>          code can always properly identify such RAMBlocks.
> 
> Tested with precopy/postcopy on shmem, where even reading unpopulated
> memory ranges will populate actual memory and not the shared zeropage.
> Tested with background snapshots on anonymous memory, because other
> backends are not supported yet with upstream Linux.
> 

Gentle ping.


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 7/9] migration: Simplify alignment and alignment checks
  2021-09-04 16:09 ` [PATCH v5 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
@ 2021-09-28 15:31   ` Peter Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Xu @ 2021-09-28 15:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On Sat, Sep 04, 2021 at 06:09:11PM +0200, David Hildenbrand wrote:
> Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to
> read.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-04 16:09 ` [PATCH v5 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
@ 2021-09-28 15:34   ` Peter Xu
  2021-09-28 15:34     ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2021-09-28 15:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On Sat, Sep 04, 2021 at 06:09:12PM +0200, David Hildenbrand wrote:
> Let's factor out prefaulting/populating to make further changes easier to
> review and add a comment what we are actually expecting to happen. While at
> it, use the actual page size of the ramblock, which defaults to
> qemu_real_host_page_size for anonymous memory. Further, rename
> ram_block_populate_pages() to ram_block_populate_read() as well, to make
> it clearer what we are doing.
> 
> In the future, we might want to use MADV_POPULATE_READ to speed up
> population.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v5 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-28 15:34   ` Peter Xu
@ 2021-09-28 15:34     ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2021-09-28 15:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 28.09.21 17:34, Peter Xu wrote:
> On Sat, Sep 04, 2021 at 06:09:12PM +0200, David Hildenbrand wrote:
>> Let's factor out prefaulting/populating to make further changes easier to
>> review and add a comment what we are actually expecting to happen. While at
>> it, use the actual page size of the ramblock, which defaults to
>> qemu_real_host_page_size for anonymous memory. Further, rename
>> ram_block_populate_pages() to ram_block_populate_read() as well, to make
>> it clearer what we are doing.
>>
>> In the future, we might want to use MADV_POPULATE_READ to speed up
>> population.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 

Thanks Peter!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-09-28 15:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-04 16:09 [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
2021-09-04 16:09 ` [PATCH v5 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
2021-09-04 16:09 ` [PATCH v5 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback David Hildenbrand
2021-09-04 16:09 ` [PATCH v5 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*() David Hildenbrand
2021-09-09  7:26   ` Juan Quintela
2021-09-04 16:09 ` [PATCH v5 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source David Hildenbrand
2021-09-04 16:09 ` [PATCH v5 5/9] virtio-mem: Drop precopy notifier David Hildenbrand
2021-09-04 16:09 ` [PATCH v5 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination David Hildenbrand
2021-09-04 16:09 ` [PATCH v5 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
2021-09-28 15:31   ` Peter Xu
2021-09-04 16:09 ` [PATCH v5 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
2021-09-28 15:34   ` Peter Xu
2021-09-28 15:34     ` David Hildenbrand
2021-09-04 16:09 ` [PATCH v5 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots David Hildenbrand
2021-09-28 14:01 ` [PATCH v5 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).