All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager
@ 2021-09-02 13:14 David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

virtio-mem exposes a dynamic amount of memory within RAMBlocks by
coordinating with the VM. Memory within a RAMBlock can either get
plugged and consequently used by the VM, or unplugged and consequently no
longer used by the VM. Logical unplug is realized by discarding the
physical memory backing for virtual memory ranges, similar to memory
ballooning.

However, important difference to virtio-balloon are:

a) A virtio-mem device only operates on its assigned memory region /
   RAMBlock ("device memory")
b) Initially, all device memory is logically unplugged
c) Virtual machines will never accidentally reuse memory that is currently
   logically unplugged. The spec defines most accesses to unplugged memory
   as "undefined behavior" -- except reading unplugged memory, which is
   currently expected to work, but that will change in the future.
d) The (un)plug granularity is in the range of megabytes -- "memory blocks"
e) The state (plugged/unplugged) of a memory block is always known and
   properly tracked.

Whenever memory blocks within the RAMBlock get (un)plugged, changes are
communicated via the RamDiscardManager to other QEMU subsystems, most
prominently vfio which updates the DMA mapping accordingly. "Unplugging"
corresponds to "discarding" and "plugging" corresponds to "populating".

While migrating (precopy/postcopy) that state of such memory blocks cannot
change, as virtio-mem will reject any guest requests that would change
the state of blocks with "busy". We don't want to migrate such logically
unplugged memory, because it can result in an unintended memory consumption
both, on the source (when reading memory from some memory backends) and on
the destination (when writing memory). Further, migration time can be
heavily reduced when skipping logically unplugged blocks and we avoid
populating unnecessary page tables in Linux.

Right now, virtio-mem reuses the free page hinting infrastructure during
precopy to exclude all logically unplugged ("discarded") parts from the
migration stream. However, there are some scenarios that are not handled
properly and need fixing. Further, there are some ugly corner cases in
postcopy code and background snapshotting code that similarly have to
handle such special RAMBlocks.

Let's reuse the RamDiscardManager infrastructure to essentially handle
precopy, postcopy and background snapshots cleanly, which means:

a) In precopy code, fixing up the initial dirty bitmaps (in the RAMBlock
   and e.g., KVM) to exclude discarded ranges.
b) In postcopy code, placing a zeropage when requested to handle a page
   falling into a discarded range -- because the source will never send it.
   Further, fix up the dirty bitmap when overwriting it in recovery mode.
c) In background snapshot code, never populating discarded ranges, not even
   with the shared zeropage, to avoid unintended memory consumption,
   especially in the future with hugetlb and shmem.

Detail: When realizing a virtio-mem devices, it will register the RAM
        for migration via vmstate_register_ram(). Further, it will
        set itself as the RamDiscardManager for the corresponding memory
        region of the RAMBlock via memory_region_set_ram_discard_manager().
        Last but not least, memory device code will actually map the
        memory region into guest physical address space. So migration
        code can always properly identify such RAMBlocks.

Tested with precopy/postcopy on shmem, where even reading unpopulated
memory ranges will populate actual memory and not the shared zeropage.
Tested with background snapshots on anonymous memory, because other
backends are not supported yet with upstream Linux.

Idealy, this should all go via the migration tree.

v3 -> v4:
- Added ACKs
- "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
   destination"
-- Use QEMU_ALIGN_DOWN() to align to ram pagesize
- "migration: Simplify alignment and alignment checks"
-- Added
- "migration/ram: Factor out populating pages readable in
   ram_block_populate_pages()"
-- Added
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on background
   snapshots"
-- Simplified due to factored out code

v2 -> v3:
- "migration/ram: Don't passs RAMState to
   migration_clear_memory_region_dirty_bitmap_*()"
-- Added to make the next patch easier to implement
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration
   source"
-- Fixup the dirty bitmaps only initially and during postcopy recovery,
   not after every bitmap sync. Also properly clear the dirty bitmaps e.g.,
   in KVM. [Peter]
- "migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
   destination"
-- Take care of proper host-page alignment [Peter]

v1 -> v2:
- "migration/ram: Handle RAMBlocks with a RamDiscardManager on the
   migration source"
-- Added a note how it interacts with the clear_bmap and what we might want
   to further optimize in the future when synchronizing bitmaps.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrey Gruzdev <andrey.gruzdev@virtuozzo.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
Cc: Philippe Mathieu-Daudé <philmd@redhat.com>

David Hildenbrand (9):
  memory: Introduce replay_discarded callback for RamDiscardManager
  virtio-mem: Implement replay_discarded RamDiscardManager callback
  migration/ram: Don't passs RAMState to
    migration_clear_memory_region_dirty_bitmap_*()
  migration/ram: Handle RAMBlocks with a RamDiscardManager on the
    migration source
  virtio-mem: Drop precopy notifier
  migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the
    destination
  migration: Simplify alignment and alignment checks
  migration/ram: Factor out populating pages readable in
    ram_block_populate_pages()
  migration/ram: Handle RAMBlocks with a RamDiscardManager on background
    snapshots

 hw/virtio/virtio-mem.c         |  92 +++++++++++-------
 include/exec/memory.h          |  21 +++++
 include/hw/virtio/virtio-mem.h |   3 -
 migration/migration.c          |   6 +-
 migration/postcopy-ram.c       |  40 ++++++--
 migration/ram.c                | 167 +++++++++++++++++++++++++++++----
 migration/ram.h                |   1 +
 softmmu/memory.c               |  11 +++
 8 files changed, 274 insertions(+), 67 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v4 1/9] memory: Introduce replay_discarded callback for RamDiscardManager
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback David Hildenbrand
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Introduce replay_discarded callback similar to our existing
replay_populated callback, to be used my migration code to never migrate
discarded memory.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/exec/memory.h | 21 +++++++++++++++++++++
 softmmu/memory.c      | 11 +++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c3d417d317..93e972b55a 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -537,6 +537,7 @@ static inline void ram_discard_listener_init(RamDiscardListener *rdl,
 }
 
 typedef int (*ReplayRamPopulate)(MemoryRegionSection *section, void *opaque);
+typedef void (*ReplayRamDiscard)(MemoryRegionSection *section, void *opaque);
 
 /*
  * RamDiscardManagerClass:
@@ -625,6 +626,21 @@ struct RamDiscardManagerClass {
                             MemoryRegionSection *section,
                             ReplayRamPopulate replay_fn, void *opaque);
 
+    /**
+     * @replay_discarded:
+     *
+     * Call the #ReplayRamDiscard callback for all discarded parts within the
+     * #MemoryRegionSection via the #RamDiscardManager.
+     *
+     * @rdm: the #RamDiscardManager
+     * @section: the #MemoryRegionSection
+     * @replay_fn: the #ReplayRamDiscard callback
+     * @opaque: pointer to forward to the callback
+     */
+    void (*replay_discarded)(const RamDiscardManager *rdm,
+                             MemoryRegionSection *section,
+                             ReplayRamDiscard replay_fn, void *opaque);
+
     /**
      * @register_listener:
      *
@@ -669,6 +685,11 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
                                          ReplayRamPopulate replay_fn,
                                          void *opaque);
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscard replay_fn,
+                                          void *opaque);
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
                                            MemoryRegionSection *section);
diff --git a/softmmu/memory.c b/softmmu/memory.c
index bfedaf9c4d..cd86205627 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -2076,6 +2076,17 @@ int ram_discard_manager_replay_populated(const RamDiscardManager *rdm,
     return rdmc->replay_populated(rdm, section, replay_fn, opaque);
 }
 
+void ram_discard_manager_replay_discarded(const RamDiscardManager *rdm,
+                                          MemoryRegionSection *section,
+                                          ReplayRamDiscard replay_fn,
+                                          void *opaque)
+{
+    RamDiscardManagerClass *rdmc = RAM_DISCARD_MANAGER_GET_CLASS(rdm);
+
+    g_assert(rdmc->replay_discarded);
+    rdmc->replay_discarded(rdm, section, replay_fn, opaque);
+}
+
 void ram_discard_manager_register_listener(RamDiscardManager *rdm,
                                            RamDiscardListener *rdl,
                                            MemoryRegionSection *section)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*() David Hildenbrand
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Implement it similar to the replay_populated callback.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c | 58 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index df91e454b2..284096ec5f 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -228,6 +228,38 @@ static int virtio_mem_for_each_plugged_section(const VirtIOMEM *vmem,
     return ret;
 }
 
+static int virtio_mem_for_each_unplugged_section(const VirtIOMEM *vmem,
+                                                 MemoryRegionSection *s,
+                                                 void *arg,
+                                                 virtio_mem_section_cb cb)
+{
+    unsigned long first_bit, last_bit;
+    uint64_t offset, size;
+    int ret = 0;
+
+    first_bit = s->offset_within_region / vmem->bitmap_size;
+    first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, first_bit);
+    while (first_bit < vmem->bitmap_size) {
+        MemoryRegionSection tmp = *s;
+
+        offset = first_bit * vmem->block_size;
+        last_bit = find_next_bit(vmem->bitmap, vmem->bitmap_size,
+                                 first_bit + 1) - 1;
+        size = (last_bit - first_bit + 1) * vmem->block_size;
+
+        if (!virito_mem_intersect_memory_section(&tmp, offset, size)) {
+            break;
+        }
+        ret = cb(&tmp, arg);
+        if (ret) {
+            break;
+        }
+        first_bit = find_next_zero_bit(vmem->bitmap, vmem->bitmap_size,
+                                       last_bit + 2);
+    }
+    return ret;
+}
+
 static int virtio_mem_notify_populate_cb(MemoryRegionSection *s, void *arg)
 {
     RamDiscardListener *rdl = arg;
@@ -1170,6 +1202,31 @@ static int virtio_mem_rdm_replay_populated(const RamDiscardManager *rdm,
                                             virtio_mem_rdm_replay_populated_cb);
 }
 
+static int virtio_mem_rdm_replay_discarded_cb(MemoryRegionSection *s,
+                                              void *arg)
+{
+    struct VirtIOMEMReplayData *data = arg;
+
+    ((ReplayRamDiscard)data->fn)(s, data->opaque);
+    return 0;
+}
+
+static void virtio_mem_rdm_replay_discarded(const RamDiscardManager *rdm,
+                                            MemoryRegionSection *s,
+                                            ReplayRamDiscard replay_fn,
+                                            void *opaque)
+{
+    const VirtIOMEM *vmem = VIRTIO_MEM(rdm);
+    struct VirtIOMEMReplayData data = {
+        .fn = replay_fn,
+        .opaque = opaque,
+    };
+
+    g_assert(s->mr == &vmem->memdev->mr);
+    virtio_mem_for_each_unplugged_section(vmem, s, &data,
+                                          virtio_mem_rdm_replay_discarded_cb);
+}
+
 static void virtio_mem_rdm_register_listener(RamDiscardManager *rdm,
                                              RamDiscardListener *rdl,
                                              MemoryRegionSection *s)
@@ -1234,6 +1291,7 @@ static void virtio_mem_class_init(ObjectClass *klass, void *data)
     rdmc->get_min_granularity = virtio_mem_rdm_get_min_granularity;
     rdmc->is_populated = virtio_mem_rdm_is_populated;
     rdmc->replay_populated = virtio_mem_rdm_replay_populated;
+    rdmc->replay_discarded = virtio_mem_rdm_replay_discarded;
     rdmc->register_listener = virtio_mem_rdm_register_listener;
     rdmc->unregister_listener = virtio_mem_rdm_unregister_listener;
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*()
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source David Hildenbrand
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

The parameter is unused, let's drop it.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index 7a43bfd7af..bb908822d5 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -789,8 +789,7 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
     return find_next_bit(bitmap, size, start);
 }
 
-static void migration_clear_memory_region_dirty_bitmap(RAMState *rs,
-                                                       RAMBlock *rb,
+static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
                                                        unsigned long page)
 {
     uint8_t shift;
@@ -818,8 +817,7 @@ static void migration_clear_memory_region_dirty_bitmap(RAMState *rs,
 }
 
 static void
-migration_clear_memory_region_dirty_bitmap_range(RAMState *rs,
-                                                 RAMBlock *rb,
+migration_clear_memory_region_dirty_bitmap_range(RAMBlock *rb,
                                                  unsigned long start,
                                                  unsigned long npages)
 {
@@ -832,7 +830,7 @@ migration_clear_memory_region_dirty_bitmap_range(RAMState *rs,
      * exclusive.
      */
     for (i = chunk_start; i < chunk_end; i += chunk_pages) {
-        migration_clear_memory_region_dirty_bitmap(rs, rb, i);
+        migration_clear_memory_region_dirty_bitmap(rb, i);
     }
 }
 
@@ -850,7 +848,7 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
      * the page in the chunk we clear the remote dirty bitmap for all.
      * Clearing it earlier won't be a problem, but too late will.
      */
-    migration_clear_memory_region_dirty_bitmap(rs, rb, page);
+    migration_clear_memory_region_dirty_bitmap(rb, page);
 
     ret = test_and_clear_bit(page, rb->bmap);
     if (ret) {
@@ -2777,8 +2775,7 @@ void qemu_guest_free_page_hint(void *addr, size_t len)
          * are initially set. Otherwise those skipped pages will be sent in
          * the next round after syncing from the memory region bitmap.
          */
-        migration_clear_memory_region_dirty_bitmap_range(ram_state, block,
-                                                         start, npages);
+        migration_clear_memory_region_dirty_bitmap_range(block, start, npages);
         ram_state->migration_dirty_pages -=
                       bitmap_count_one_with_offset(block->bmap, start, npages);
         bitmap_clear(block->bmap, start, npages);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (2 preceding siblings ...)
  2021-09-02 13:14 ` [PATCH v4 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*() David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 5/9] virtio-mem: Drop precopy notifier David Hildenbrand
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

We don't want to migrate memory that corresponds to discarded ranges as
managed by a RamDiscardManager responsible for the mapped memory region of
the RAMBlock. The content of these pages is essentially stale and
without any guarantees for the VM ("logically unplugged").

Depending on the underlying memory type, even reading memory might populate
memory on the source, resulting in an undesired memory consumption. Of
course, on the destination, even writing a zeropage consumes memory,
which we also want to avoid (similar to free page hinting).

Currently, virtio-mem tries achieving that goal (not migrating "unplugged"
memory that was discarded) by going via qemu_guest_free_page_hint() - but
it's hackish and incomplete.

For example, background snapshots still end up reading all memory, as
they don't do bitmap syncs. Postcopy recovery code will re-add
previously cleared bits to the dirty bitmap and migrate them.

Let's consult the RamDiscardManager after setting up our dirty bitmap
initially and when postcopy recovery code reinitializes it: clear
corresponding bits in the dirty bitmaps (e.g., of the RAMBlock and inside
KVM). It's important to fixup the dirty bitmap *after* our initial bitmap
sync, such that the corresponding dirty bits in KVM are actually cleared.

As colo is incompatible with discarding of RAM and inhibits it, we don't
have to bother.

Note: if a misbehaving guest would use discarded ranges after migration
started we would still migrate that memory: however, then we already
populated that memory on the migration source.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/migration/ram.c b/migration/ram.c
index bb908822d5..3be969f749 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -858,6 +858,60 @@ static inline bool migration_bitmap_clear_dirty(RAMState *rs,
     return ret;
 }
 
+static void dirty_bitmap_clear_section(MemoryRegionSection *section,
+                                       void *opaque)
+{
+    const hwaddr offset = section->offset_within_region;
+    const hwaddr size = int128_get64(section->size);
+    const unsigned long start = offset >> TARGET_PAGE_BITS;
+    const unsigned long npages = size >> TARGET_PAGE_BITS;
+    RAMBlock *rb = section->mr->ram_block;
+    uint64_t *cleared_bits = opaque;
+
+    /*
+     * We don't grab ram_state->bitmap_mutex because we expect to run
+     * only when starting migration or during postcopy recovery where
+     * we don't have concurrent access.
+     */
+    if (!migration_in_postcopy() && !migrate_background_snapshot()) {
+        migration_clear_memory_region_dirty_bitmap_range(rb, start, npages);
+    }
+    *cleared_bits += bitmap_count_one_with_offset(rb->bmap, start, npages);
+    bitmap_clear(rb->bmap, start, npages);
+}
+
+/*
+ * Exclude all dirty pages from migration that fall into a discarded range as
+ * managed by a RamDiscardManager responsible for the mapped memory region of
+ * the RAMBlock. Clear the corresponding bits in the dirty bitmaps.
+ *
+ * Discarded pages ("logically unplugged") have undefined content and must
+ * not get migrated, because even reading these pages for migration might
+ * result in undesired behavior.
+ *
+ * Returns the number of cleared bits in the RAMBlock dirty bitmap.
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
+{
+    uint64_t cleared_bits = 0;
+
+    if (rb->mr && rb->bmap && memory_region_has_ram_discard_manager(rb->mr)) {
+        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+        MemoryRegionSection section = {
+            .mr = rb->mr,
+            .offset_within_region = 0,
+            .size = int128_make64(qemu_ram_get_used_length(rb)),
+        };
+
+        ram_discard_manager_replay_discarded(rdm, &section,
+                                             dirty_bitmap_clear_section,
+                                             &cleared_bits);
+    }
+    return cleared_bits;
+}
+
 /* Called with RCU critical section */
 static void ramblock_sync_dirty_bitmap(RAMState *rs, RAMBlock *rb)
 {
@@ -2668,6 +2722,19 @@ static void ram_list_init_bitmaps(void)
     }
 }
 
+static void migration_bitmap_clear_discarded_pages(RAMState *rs)
+{
+    unsigned long pages;
+    RAMBlock *rb;
+
+    RCU_READ_LOCK_GUARD();
+
+    RAMBLOCK_FOREACH_NOT_IGNORED(rb) {
+            pages = ramblock_dirty_bitmap_clear_discarded_pages(rb);
+            rs->migration_dirty_pages -= pages;
+    }
+}
+
 static void ram_init_bitmaps(RAMState *rs)
 {
     /* For memory_global_dirty_log_start below.  */
@@ -2684,6 +2751,12 @@ static void ram_init_bitmaps(RAMState *rs)
     }
     qemu_mutex_unlock_ramlist();
     qemu_mutex_unlock_iothread();
+
+    /*
+     * After an eventual first bitmap sync, fixup the initial bitmap
+     * containing all 1s to exclude any discarded pages from migration.
+     */
+    migration_bitmap_clear_discarded_pages(rs);
 }
 
 static int ram_init_all(RAMState **rsp)
@@ -4112,6 +4185,10 @@ int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *block)
      */
     bitmap_complement(block->bmap, block->bmap, nbits);
 
+    /* Clear dirty bits of discarded ranges that we don't want to migrate. */
+    ramblock_dirty_bitmap_clear_discarded_pages(block);
+
+    /* We'll recalculate migration_dirty_pages in ram_state_resume_prepare(). */
     trace_ram_dirty_bitmap_reload_complete(block->idstr);
 
     /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 5/9] virtio-mem: Drop precopy notifier
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (3 preceding siblings ...)
  2021-09-02 13:14 ` [PATCH v4 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination David Hildenbrand
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Migration code now properly handles RAMBlocks which are indirectly managed
by a RamDiscardManager. No need for manual handling via the free page
optimization interface, let's get rid of it.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 hw/virtio/virtio-mem.c         | 34 ----------------------------------
 include/hw/virtio/virtio-mem.h |  3 ---
 2 files changed, 37 deletions(-)

diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c
index 284096ec5f..d5a578142b 100644
--- a/hw/virtio/virtio-mem.c
+++ b/hw/virtio/virtio-mem.c
@@ -776,7 +776,6 @@ static void virtio_mem_device_realize(DeviceState *dev, Error **errp)
     host_memory_backend_set_mapped(vmem->memdev, true);
     vmstate_register_ram(&vmem->memdev->mr, DEVICE(vmem));
     qemu_register_reset(virtio_mem_system_reset, vmem);
-    precopy_add_notifier(&vmem->precopy_notifier);
 
     /*
      * Set ourselves as RamDiscardManager before the plug handler maps the
@@ -796,7 +795,6 @@ static void virtio_mem_device_unrealize(DeviceState *dev)
      * found via an address space anymore. Unset ourselves.
      */
     memory_region_set_ram_discard_manager(&vmem->memdev->mr, NULL);
-    precopy_remove_notifier(&vmem->precopy_notifier);
     qemu_unregister_reset(virtio_mem_system_reset, vmem);
     vmstate_unregister_ram(&vmem->memdev->mr, DEVICE(vmem));
     host_memory_backend_set_mapped(vmem->memdev, false);
@@ -1089,43 +1087,11 @@ static void virtio_mem_set_block_size(Object *obj, Visitor *v, const char *name,
     vmem->block_size = value;
 }
 
-static int virtio_mem_precopy_exclude_range_cb(const VirtIOMEM *vmem, void *arg,
-                                               uint64_t offset, uint64_t size)
-{
-    void * const host = qemu_ram_get_host_addr(vmem->memdev->mr.ram_block);
-
-    qemu_guest_free_page_hint(host + offset, size);
-    return 0;
-}
-
-static void virtio_mem_precopy_exclude_unplugged(VirtIOMEM *vmem)
-{
-    virtio_mem_for_each_unplugged_range(vmem, NULL,
-                                        virtio_mem_precopy_exclude_range_cb);
-}
-
-static int virtio_mem_precopy_notify(NotifierWithReturn *n, void *data)
-{
-    VirtIOMEM *vmem = container_of(n, VirtIOMEM, precopy_notifier);
-    PrecopyNotifyData *pnd = data;
-
-    switch (pnd->reason) {
-    case PRECOPY_NOTIFY_AFTER_BITMAP_SYNC:
-        virtio_mem_precopy_exclude_unplugged(vmem);
-        break;
-    default:
-        break;
-    }
-
-    return 0;
-}
-
 static void virtio_mem_instance_init(Object *obj)
 {
     VirtIOMEM *vmem = VIRTIO_MEM(obj);
 
     notifier_list_init(&vmem->size_change_notifiers);
-    vmem->precopy_notifier.notify = virtio_mem_precopy_notify;
     QLIST_INIT(&vmem->rdl_list);
 
     object_property_add(obj, VIRTIO_MEM_SIZE_PROP, "size", virtio_mem_get_size,
diff --git a/include/hw/virtio/virtio-mem.h b/include/hw/virtio/virtio-mem.h
index 9a6e348fa2..a5dd6a493b 100644
--- a/include/hw/virtio/virtio-mem.h
+++ b/include/hw/virtio/virtio-mem.h
@@ -65,9 +65,6 @@ struct VirtIOMEM {
     /* notifiers to notify when "size" changes */
     NotifierList size_change_notifiers;
 
-    /* don't migrate unplugged memory */
-    NotifierWithReturn precopy_notifier;
-
     /* listeners to notify on plug/unplug activity. */
     QLIST_HEAD(, RamDiscardListener) rdl_list;
 };
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (4 preceding siblings ...)
  2021-09-02 13:14 ` [PATCH v4 5/9] virtio-mem: Drop precopy notifier David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Currently, when someone (i.e., the VM) accesses discarded parts inside a
RAMBlock with a RamDiscardManager managing the corresponding mapped memory
region, postcopy will request migration of the corresponding page from the
source. The source, however, will never answer, because it refuses to
migrate such pages with undefined content ("logically unplugged"): the
pages are never dirty, and get_queued_page() will consequently skip
processing these postcopy requests.

Especially reading discarded ("logically unplugged") ranges is supposed to
work in some setups (for example with current virtio-mem), although it
barely ever happens: still, not placing a page would currently stall the
VM, as it cannot make forward progress.

Let's check the state via the RamDiscardManager (the state e.g.,
of virtio-mem is migrated during precopy) and avoid sending a request
that will never get answered. Place a fresh zero page instead to keep
the VM working. This is the same behavior that would happen
automatically without userfaultfd being active, when accessing virtual
memory regions without populated pages -- "populate on demand".

For now, there are valid cases (as documented in the virtio-mem spec) where
a VM might read discarded memory; in the future, we will disallow that.
Then, we might want to handle that case differently, e.g., warning the
user that the VM seems to be mis-behaving.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/postcopy-ram.c | 31 +++++++++++++++++++++++++++----
 migration/ram.c          | 21 +++++++++++++++++++++
 migration/ram.h          |  1 +
 3 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 2e9697bdd2..39e3e057b4 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -671,6 +671,29 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
     return ret;
 }
 
+static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
+                                 ram_addr_t start, uint64_t haddr)
+{
+    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));
+
+    /*
+     * Discarded pages (via RamDiscardManager) are never migrated. On unlikely
+     * access, place a zeropage, which will also set the relevant bits in the
+     * recv_bitmap accordingly, so we won't try placing a zeropage twice.
+     *
+     * Checking a single bit is sufficient to handle pagesize > TPS as either
+     * all relevant bits are set or not.
+     */
+    assert(QEMU_IS_ALIGNED(start, qemu_ram_pagesize(rb)));
+    if (ramblock_page_is_discarded(rb, start)) {
+        bool received = ramblock_recv_bitmap_test_byte_offset(rb, start);
+
+        return received ? 0 : postcopy_place_page_zero(mis, aligned, rb);
+    }
+
+    return migrate_send_rp_req_pages(mis, rb, start, haddr);
+}
+
 /*
  * Callback from shared fault handlers to ask for a page,
  * the page must be specified by a RAMBlock and an offset in that rb
@@ -690,7 +713,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                         qemu_ram_get_idstr(rb), rb_offset);
         return postcopy_wake_shared(pcfd, client_addr, rb);
     }
-    migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
+    postcopy_request_page(mis, rb, aligned_rbo, client_addr);
     return 0;
 }
 
@@ -984,8 +1007,8 @@ retry:
              * Send the request to the source - we want to request one
              * of our host page sizes (which is >= TPS)
              */
-            ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
-                                            msg.arg.pagefault.address);
+            ret = postcopy_request_page(mis, rb, rb_offset,
+                                        msg.arg.pagefault.address);
             if (ret) {
                 /* May be network failure, try to wait for recovery */
                 if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
@@ -993,7 +1016,7 @@ retry:
                     goto retry;
                 } else {
                     /* This is a unavoidable fault */
-                    error_report("%s: migrate_send_rp_req_pages() get %d",
+                    error_report("%s: postcopy_request_page() get %d",
                                  __func__, ret);
                     break;
                 }
diff --git a/migration/ram.c b/migration/ram.c
index 3be969f749..e8abe10ddb 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -912,6 +912,27 @@ static uint64_t ramblock_dirty_bitmap_clear_discarded_pages(RAMBlock *rb)
     return cleared_bits;
 }
 
+/*
+ * Check if a host-page aligned page falls into a discarded range as managed by
+ * a RamDiscardManager responsible for the mapped memory region of the RAMBlock.
+ *
+ * Note: The result is only stable while migrating (precopy/postcopy).
+ */
+bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start)
+{
+    if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
+        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+        MemoryRegionSection section = {
+            .mr = rb->mr,
+            .offset_within_region = start,
+            .size = int128_get64(qemu_ram_pagesize(rb)),
+        };
+
+        return !ram_discard_manager_is_populated(rdm, &section);
+    }
+    return false;
+}
+
 /* Called with RCU critical section */
 static void ramblock_sync_dirty_bitmap(RAMState *rs, RAMBlock *rb)
 {
diff --git a/migration/ram.h b/migration/ram.h
index 4833e9fd5b..dda1988f3d 100644
--- a/migration/ram.h
+++ b/migration/ram.h
@@ -72,6 +72,7 @@ void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr, size_t nr);
 int64_t ramblock_recv_bitmap_send(QEMUFile *file,
                                   const char *block_name);
 int ram_dirty_bitmap_reload(MigrationState *s, RAMBlock *rb);
+bool ramblock_page_is_discarded(RAMBlock *rb, ram_addr_t start);
 
 /* ram cache */
 int colo_init_ram_cache(void);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 7/9] migration: Simplify alignment and alignment checks
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (5 preceding siblings ...)
  2021-09-02 13:14 ` [PATCH v4 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 22:32   ` Peter Xu
  2021-09-02 13:14 ` [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
  2021-09-02 13:14 ` [PATCH v4 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots David Hildenbrand
  8 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Let's use QEMU_ALIGN_DOWN() and friends to make the code a bit easier to
read.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/migration.c    | 6 +++---
 migration/postcopy-ram.c | 9 ++++-----
 migration/ram.c          | 2 +-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index bb909781b7..ae97c2c461 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
 int migrate_send_rp_req_pages(MigrationIncomingState *mis,
                               RAMBlock *rb, ram_addr_t start, uint64_t haddr)
 {
-    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
+    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));
     bool received = false;
 
     WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
@@ -2619,8 +2619,8 @@ static void migrate_handle_rp_req_pages(MigrationState *ms, const char* rbname,
      * Since we currently insist on matching page sizes, just sanity check
      * we're being asked for whole host pages.
      */
-    if (start & (our_host_ps - 1) ||
-       (len & (our_host_ps - 1))) {
+    if (!QEMU_IS_ALIGNED(start, our_host_ps) ||
+        !QEMU_IS_ALIGNED(len, our_host_ps)) {
         error_report("%s: Misaligned page request, start: " RAM_ADDR_FMT
                      " len: %zd", __func__, start, len);
         mark_source_rp_bad(ms);
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
index 39e3e057b4..3f0a1f7aa6 100644
--- a/migration/postcopy-ram.c
+++ b/migration/postcopy-ram.c
@@ -402,7 +402,7 @@ bool postcopy_ram_supported_by_host(MigrationIncomingState *mis)
                      strerror(errno));
         goto out;
     }
-    g_assert(((size_t)testarea & (pagesize - 1)) == 0);
+    g_assert(QEMU_PTR_IS_ALIGNED(testarea, pagesize));
 
     reg_struct.range.start = (uintptr_t)testarea;
     reg_struct.range.len = pagesize;
@@ -660,7 +660,7 @@ int postcopy_wake_shared(struct PostCopyFD *pcfd,
     struct uffdio_range range;
     int ret;
     trace_postcopy_wake_shared(client_addr, qemu_ram_get_idstr(rb));
-    range.start = client_addr & ~(pagesize - 1);
+    range.start = QEMU_ALIGN_DOWN(client_addr, pagesize);
     range.len = pagesize;
     ret = ioctl(pcfd->fd, UFFDIO_WAKE, &range);
     if (ret) {
@@ -702,8 +702,7 @@ static int postcopy_request_page(MigrationIncomingState *mis, RAMBlock *rb,
 int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
                                  uint64_t client_addr, uint64_t rb_offset)
 {
-    size_t pagesize = qemu_ram_pagesize(rb);
-    uint64_t aligned_rbo = rb_offset & ~(pagesize - 1);
+    uint64_t aligned_rbo = QEMU_ALIGN_DOWN(rb_offset, qemu_ram_pagesize(rb));
     MigrationIncomingState *mis = migration_incoming_get_current();
 
     trace_postcopy_request_shared_page(pcfd->idstr, qemu_ram_get_idstr(rb),
@@ -993,7 +992,7 @@ static void *postcopy_ram_fault_thread(void *opaque)
                 break;
             }
 
-            rb_offset &= ~(qemu_ram_pagesize(rb) - 1);
+            rb_offset = QEMU_ALIGN_DOWN(rb_offset, qemu_ram_pagesize(rb));
             trace_postcopy_ram_fault_thread_request(msg.arg.pagefault.address,
                                                 qemu_ram_get_idstr(rb),
                                                 rb_offset,
diff --git a/migration/ram.c b/migration/ram.c
index e8abe10ddb..e1c158dc92 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -811,7 +811,7 @@ static void migration_clear_memory_region_dirty_bitmap(RAMBlock *rb,
     assert(shift >= 6);
 
     size = 1ULL << (TARGET_PAGE_BITS + shift);
-    start = (((ram_addr_t)page) << TARGET_PAGE_BITS) & (-size);
+    start = QEMU_ALIGN_DOWN((ram_addr_t)page << TARGET_PAGE_BITS, size);
     trace_migration_bitmap_clear_dirty(rb->idstr, start, size, page);
     memory_region_clear_dirty_bitmap(rb->mr, start, size);
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (6 preceding siblings ...)
  2021-09-02 13:14 ` [PATCH v4 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  2021-09-02 22:28   ` Peter Xu
  2021-09-02 13:14 ` [PATCH v4 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots David Hildenbrand
  8 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

Let's factor out prefaulting/populating to make further changes easier to
review. While at it, use the actual page size of the ramblock, which
defaults to qemu_real_host_page_size for anonymous memory.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index e1c158dc92..de47650c90 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1639,6 +1639,17 @@ out:
     return ret;
 }
 
+static inline void populate_range(RAMBlock *block, ram_addr_t offset,
+                                  ram_addr_t size)
+{
+    for (; offset < size; offset += block->page_size) {
+        char tmp = *((char *)block->host + offset);
+
+        /* Don't optimize the read out */
+        asm volatile("" : "+r" (tmp));
+    }
+}
+
 /*
  * ram_block_populate_pages: populate memory in the RAM block by reading
  *   an integer from the beginning of each page.
@@ -1650,15 +1661,7 @@ out:
  */
 static void ram_block_populate_pages(RAMBlock *block)
 {
-    char *ptr = (char *) block->host;
-
-    for (ram_addr_t offset = 0; offset < block->used_length;
-            offset += qemu_real_host_page_size) {
-        char tmp = *(ptr + offset);
-
-        /* Don't optimize the read out */
-        asm volatile("" : "+r" (tmp));
-    }
+    populate_range(block, 0, block->used_length);
 }
 
 /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v4 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots
  2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
                   ` (7 preceding siblings ...)
  2021-09-02 13:14 ` [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
@ 2021-09-02 13:14 ` David Hildenbrand
  8 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-02 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eduardo Habkost, Michael S. Tsirkin, Pankaj Gupta, Juan Quintela,
	David Hildenbrand, Dr. David Alan Gilbert, Peter Xu,
	Marek Kedzierski, Alex Williamson, teawater, Paolo Bonzini,
	Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

We already don't ever migrate memory that corresponds to discarded ranges
as managed by a RamDiscardManager responsible for the mapped memory region
of the RAMBlock.

virtio-mem uses this mechanism to logically unplug parts of a RAMBlock.
Right now, we still populate zeropages for the whole usable part of the
RAMBlock, which is undesired because:

1. Even populating the shared zeropage will result in memory getting
   consumed for page tables.
2. Memory backends without a shared zeropage (like hugetlbfs and shmem)
   will populate an actual, fresh page, resulting in an unintended
   memory consumption.

Discarded ("logically unplugged") parts have to remain discarded. As
these pages are never part of the migration stream, there is no need to
track modifications via userfaultfd WP reliably for these parts.

Further, any writes to these ranges by the VM are invalid and the
behavior is undefined.

Note that Linux only supports userfaultfd WP on private anonymous memory
for now, which usually results in the shared zeropage getting populated.
The issue will become more relevant once userfaultfd WP supports shmem
and hugetlb.

Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 migration/ram.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/migration/ram.c b/migration/ram.c
index de47650c90..2f7ceb84b8 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1650,6 +1650,16 @@ static inline void populate_range(RAMBlock *block, ram_addr_t offset,
     }
 }
 
+static inline int populate_section(MemoryRegionSection *section, void *opaque)
+{
+    const hwaddr size = int128_get64(section->size);
+    hwaddr offset = section->offset_within_region;
+    RAMBlock *block = section->mr->ram_block;
+
+    populate_range(block, offset, size);
+    return 0;
+}
+
 /*
  * ram_block_populate_pages: populate memory in the RAM block by reading
  *   an integer from the beginning of each page.
@@ -1659,9 +1669,32 @@ static inline void populate_range(RAMBlock *block, ram_addr_t offset,
  *
  * @block: RAM block to populate
  */
-static void ram_block_populate_pages(RAMBlock *block)
+static void ram_block_populate_pages(RAMBlock *rb)
 {
-    populate_range(block, 0, block->used_length);
+    /*
+     * Skip populating all pages that fall into a discarded range as managed by
+     * a RamDiscardManager responsible for the mapped memory region of the
+     * RAMBlock. Such discarded ("logically unplugged") parts of a RAMBlock
+     * must not get populated automatically. We don't have to track
+     * modifications via userfaultfd WP reliably, because these pages will
+     * not be part of the migration stream either way -- see
+     * ramblock_dirty_bitmap_exclude_discarded_pages().
+     *
+     * Note: The result is only stable while migrating (precopy/postcopy).
+     */
+    if (rb->mr && memory_region_has_ram_discard_manager(rb->mr)) {
+        RamDiscardManager *rdm = memory_region_get_ram_discard_manager(rb->mr);
+        MemoryRegionSection section = {
+            .mr = rb->mr,
+            .offset_within_region = 0,
+            .size = rb->mr->size,
+        };
+
+        ram_discard_manager_replay_populated(rdm, &section,
+                                             populate_section, NULL);
+    } else {
+        populate_range(rb, 0, rb->used_length);
+    }
 }
 
 /*
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-02 13:14 ` [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
@ 2021-09-02 22:28   ` Peter Xu
  2021-09-03  7:45     ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-09-02 22:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On Thu, Sep 02, 2021 at 03:14:31PM +0200, David Hildenbrand wrote:
> Let's factor out prefaulting/populating to make further changes easier to
> review. While at it, use the actual page size of the ramblock, which
> defaults to qemu_real_host_page_size for anonymous memory.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  migration/ram.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index e1c158dc92..de47650c90 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1639,6 +1639,17 @@ out:
>      return ret;
>  }
>  
> +static inline void populate_range(RAMBlock *block, ram_addr_t offset,
> +                                  ram_addr_t size)
> +{
> +    for (; offset < size; offset += block->page_size) {
> +        char tmp = *((char *)block->host + offset);
> +
> +        /* Don't optimize the read out */
> +        asm volatile("" : "+r" (tmp));
> +    }
> +}

If to make it a common function, make it populate_range_read()?

Just to identify from RW, as we'll fill the holes with zero pages only, not
doing page allocations yet, so not a complete "populate".

That'll be good enough for live snapshot as uffd-wp works for zero pages,
however I'm just afraid it may stop working for some new users of it when zero
pages won't suffice.

Maybe some comment would help too?

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 7/9] migration: Simplify alignment and alignment checks
  2021-09-02 13:14 ` [PATCH v4 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
@ 2021-09-02 22:32   ` Peter Xu
  2021-09-03  8:47     ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-09-02 22:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On Thu, Sep 02, 2021 at 03:14:30PM +0200, David Hildenbrand wrote:
> diff --git a/migration/migration.c b/migration/migration.c
> index bb909781b7..ae97c2c461 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>                                RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>  {
> -    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
> +    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));

Is uintptr_t still needed?  I thought it would generate a warning otherwise but
not sure.

Also, maybe ROUND_DOWN() is better?  QEMU_ALIGN_DOWN is the slow version for
arbitrary numbers.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-02 22:28   ` Peter Xu
@ 2021-09-03  7:45     ` David Hildenbrand
  2021-09-03  7:58       ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03  7:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 03.09.21 00:28, Peter Xu wrote:
> On Thu, Sep 02, 2021 at 03:14:31PM +0200, David Hildenbrand wrote:
>> Let's factor out prefaulting/populating to make further changes easier to
>> review. While at it, use the actual page size of the ramblock, which
>> defaults to qemu_real_host_page_size for anonymous memory.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   migration/ram.c | 21 ++++++++++++---------
>>   1 file changed, 12 insertions(+), 9 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index e1c158dc92..de47650c90 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1639,6 +1639,17 @@ out:
>>       return ret;
>>   }
>>   
>> +static inline void populate_range(RAMBlock *block, ram_addr_t offset,
>> +                                  ram_addr_t size)
>> +{
>> +    for (; offset < size; offset += block->page_size) {
>> +        char tmp = *((char *)block->host + offset);
>> +
>> +        /* Don't optimize the read out */
>> +        asm volatile("" : "+r" (tmp));
>> +    }
>> +}
> 
> If to make it a common function, make it populate_range_read()?

Indeed, makes sense.

> 
> Just to identify from RW, as we'll fill the holes with zero pages only, not
> doing page allocations yet, so not a complete "populate".

Well, depending on the actual memory backend ...

> 
> That'll be good enough for live snapshot as uffd-wp works for zero pages,
> however I'm just afraid it may stop working for some new users of it when zero
> pages won't suffice.

I thought about that as well. But snapshots/migration will read all 
memory either way and consume real memory when there is no shared zero 
page. So it's just shifting the point in time when we allocate all these 
pages I guess.

> 
> Maybe some comment would help too?
>
Yes, will do, thanks!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-03  7:45     ` David Hildenbrand
@ 2021-09-03  7:58       ` David Hildenbrand
  2021-09-03 19:20         ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03  7:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

>> That'll be good enough for live snapshot as uffd-wp works for zero pages,
>> however I'm just afraid it may stop working for some new users of it when zero
>> pages won't suffice.
> 
> I thought about that as well. But snapshots/migration will read all
> memory either way and consume real memory when there is no shared zero
> page. So it's just shifting the point in time when we allocate all these
> pages I guess.

... thinking again, even when populating on shmem and friends there is 
nothing stopping pages from getting mapped out again.

What would happen when trying uffd-wp protection on a pte_none() in your 
current shmem implementation? Will it lookup if there is something in 
the page cache (not a hole) and set a PTE marker? Or will it simply skip 
as there is currently nothing in the page table? Or will it simply 
unconditionally install a PTE marker, even if there is a hole?

Having an uffd-wp mode that doesn't require pre-population would really 
be great. I remember you shared prototypes.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 7/9] migration: Simplify alignment and alignment checks
  2021-09-02 22:32   ` Peter Xu
@ 2021-09-03  8:47     ` David Hildenbrand
  2021-09-03 10:07       ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03  8:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 03.09.21 00:32, Peter Xu wrote:
> On Thu, Sep 02, 2021 at 03:14:30PM +0200, David Hildenbrand wrote:
>> diff --git a/migration/migration.c b/migration/migration.c
>> index bb909781b7..ae97c2c461 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>>   int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>>                                 RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>>   {
>> -    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
>> +    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));
> 
> Is uintptr_t still needed?  I thought it would generate a warning otherwise but
> not sure.

It doesn't in my setup, but maybe it will on 32bit archs ...

I discussed this with Phil in

https://lkml.kernel.org/r/2c8d80ad-f171-7d5f-3235-92f02fa174b3@redhat.com

Maybe

QEMU_ALIGN_PTR_DOWN((void *)haddr, qemu_ram_pagesize(rb)));

Is really what we want.

> 
> Also, maybe ROUND_DOWN() is better?  QEMU_ALIGN_DOWN is the slow version for
> arbitrary numbers.

We do have exactly 2 direct users of ROUND_DOWN() in the tree (well, we 
do have some more for ROUND_UP) :)

QEMU_ALIGN_DOWN vs. QEMU_ALIGN_DOWN is much easier to map and understand 
IMHO, and there is usually little need to optimize.

I actually do wonder how much of a difference it actually makes on 
modern CPUs ...

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 7/9] migration: Simplify alignment and alignment checks
  2021-09-03  8:47     ` David Hildenbrand
@ 2021-09-03 10:07       ` David Hildenbrand
  2021-09-03 10:22         ` David Hildenbrand
  2021-09-03 19:14         ` Peter Xu
  0 siblings, 2 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03 10:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 03.09.21 10:47, David Hildenbrand wrote:
> On 03.09.21 00:32, Peter Xu wrote:
>> On Thu, Sep 02, 2021 at 03:14:30PM +0200, David Hildenbrand wrote:
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index bb909781b7..ae97c2c461 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>>>    int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>>>                                  RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>>>    {
>>> -    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
>>> +    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));
>>
>> Is uintptr_t still needed?  I thought it would generate a warning otherwise but
>> not sure.
> 
> It doesn't in my setup, but maybe it will on 32bit archs ...
> 
> I discussed this with Phil in
> 
> https://lkml.kernel.org/r/2c8d80ad-f171-7d5f-3235-92f02fa174b3@redhat.com
> 
> Maybe
> 
> QEMU_ALIGN_PTR_DOWN((void *)haddr, qemu_ram_pagesize(rb)));
> 
> Is really what we want.

... but it would suffer the same issue I think. I just ran it trough the 
gitlab pipeline, including "i386-fedora-cross-compile" ... and it seems 
to compile just fine, which is weird, because I'd also expect

"warning: cast to pointer from integer of different size 
[-Wint-to-pointer-cast]"

We most certainly need the "(void *)(uintptr_t)" to convert from u64 to 
a pointer.

Let's just do it cleanly:

void *unaligned = (void *)(uintptr_t)haddr;
void *aligned = QEMU_ALIGN_PTR_DOWN(unaligned, qemu_ram_pagesize(rb));

Thoughts?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 7/9] migration: Simplify alignment and alignment checks
  2021-09-03 10:07       ` David Hildenbrand
@ 2021-09-03 10:22         ` David Hildenbrand
  2021-09-03 19:14         ` Peter Xu
  1 sibling, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03 10:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 03.09.21 12:07, David Hildenbrand wrote:
> On 03.09.21 10:47, David Hildenbrand wrote:
>> On 03.09.21 00:32, Peter Xu wrote:
>>> On Thu, Sep 02, 2021 at 03:14:30PM +0200, David Hildenbrand wrote:
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index bb909781b7..ae97c2c461 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>>>>     int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>>>>                                   RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>>>>     {
>>>> -    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
>>>> +    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));
>>>
>>> Is uintptr_t still needed?  I thought it would generate a warning otherwise but
>>> not sure.
>>
>> It doesn't in my setup, but maybe it will on 32bit archs ...
>>
>> I discussed this with Phil in
>>
>> https://lkml.kernel.org/r/2c8d80ad-f171-7d5f-3235-92f02fa174b3@redhat.com
>>
>> Maybe
>>
>> QEMU_ALIGN_PTR_DOWN((void *)haddr, qemu_ram_pagesize(rb)));
>>
>> Is really what we want.
> 
> ... but it would suffer the same issue I think. I just ran it trough the
> gitlab pipeline, including "i386-fedora-cross-compile" ... and it seems
> to compile just fine, which is weird, because I'd also expect

[I know, talking to my self] Some 32bit tests actually did fail later, 
so the CI is able to catch this properly.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 7/9] migration: Simplify alignment and alignment checks
  2021-09-03 10:07       ` David Hildenbrand
  2021-09-03 10:22         ` David Hildenbrand
@ 2021-09-03 19:14         ` Peter Xu
  2021-09-03 19:37           ` David Hildenbrand
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-09-03 19:14 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On Fri, Sep 03, 2021 at 12:07:20PM +0200, David Hildenbrand wrote:
> On 03.09.21 10:47, David Hildenbrand wrote:
> > On 03.09.21 00:32, Peter Xu wrote:
> > > On Thu, Sep 02, 2021 at 03:14:30PM +0200, David Hildenbrand wrote:
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index bb909781b7..ae97c2c461 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
> > > >    int migrate_send_rp_req_pages(MigrationIncomingState *mis,
> > > >                                  RAMBlock *rb, ram_addr_t start, uint64_t haddr)
> > > >    {
> > > > -    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
> > > > +    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));
> > > 
> > > Is uintptr_t still needed?  I thought it would generate a warning otherwise but
> > > not sure.
> > 
> > It doesn't in my setup, but maybe it will on 32bit archs ...
> > 
> > I discussed this with Phil in
> > 
> > https://lkml.kernel.org/r/2c8d80ad-f171-7d5f-3235-92f02fa174b3@redhat.com
> > 
> > Maybe
> > 
> > QEMU_ALIGN_PTR_DOWN((void *)haddr, qemu_ram_pagesize(rb)));
> > 
> > Is really what we want.
> 
> ... but it would suffer the same issue I think. I just ran it trough the
> gitlab pipeline, including "i386-fedora-cross-compile" ... and it seems to
> compile just fine, which is weird, because I'd also expect
> 
> "warning: cast to pointer from integer of different size
> [-Wint-to-pointer-cast]"
> 
> We most certainly need the "(void *)(uintptr_t)" to convert from u64 to a
> pointer.
> 
> Let's just do it cleanly:
> 
> void *unaligned = (void *)(uintptr_t)haddr;
> void *aligned = QEMU_ALIGN_PTR_DOWN(unaligned, qemu_ram_pagesize(rb));
> 
> Thoughts?

---8<---
$ cat a.c
#include <stdio.h>
#include <time.h>
#include <assert.h>

#define ROUND_DOWN(n, d) ((n) & -(0 ? (n) : (d)))
#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))

unsigned long getns(void)
{
    struct timespec tp;

    clock_gettime(CLOCK_MONOTONIC, &tp);
    return tp.tv_sec * 1000000000 + tp.tv_nsec;
}

void main(void)
{
    int i;
    unsigned long start, end, v1 = 0x1234567890, v2 = 0x1000;

    start = getns();
    for (i = 0; i < 1000000; i++) {
        v1 = ROUND_DOWN(v1, v2);
    }
    end = getns();
    printf("ROUND_DOWN took: \t%ld (us)\n", (end - start) / 1000);

    start = getns();
    for (i = 0; i < 1000000; i++) {
        v1 = QEMU_ALIGN_DOWN(v1, v2);
    }
    end = getns();
    printf("QEMU_ALIGN_DOWN took: \t%ld (us)\n", (end - start) / 1000);
}
$ make a
$ ./a
ROUND_DOWN took:        1445 (us)
QEMU_ALIGN_DOWN took:   9684 (us)
---8<---

So it's ~5 times slower here on the laptop, even if not very stable.  Agree
it's not a big deal. :)

It's just that since we know it's still faster, I then second:

  (uinptr_t)ROUND_DOWN(...);

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-03  7:58       ` David Hildenbrand
@ 2021-09-03 19:20         ` Peter Xu
  2021-09-03 19:40           ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2021-09-03 19:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On Fri, Sep 03, 2021 at 09:58:06AM +0200, David Hildenbrand wrote:
> > > That'll be good enough for live snapshot as uffd-wp works for zero pages,
> > > however I'm just afraid it may stop working for some new users of it when zero
> > > pages won't suffice.
> > 
> > I thought about that as well. But snapshots/migration will read all
> > memory either way and consume real memory when there is no shared zero
> > page. So it's just shifting the point in time when we allocate all these
> > pages I guess.
> 
> ... thinking again, even when populating on shmem and friends there is
> nothing stopping pages from getting mapped out again.
> 
> What would happen when trying uffd-wp protection on a pte_none() in your
> current shmem implementation? Will it lookup if there is something in the
> page cache (not a hole) and set a PTE marker? Or will it simply skip as
> there is currently nothing in the page table? Or will it simply
> unconditionally install a PTE marker, even if there is a hole?

It (will - I haven't rebased and posted) sets a pte marker.  So uffd-wp will
always work on read prefault irrelevant of memory type in the future.

> 
> Having an uffd-wp mode that doesn't require pre-population would really be
> great. I remember you shared prototypes.

Yes, I planned to do that after the shmem bits, because they have some
conflict. I don't want to mess up more with the current series either, which is
already hard to push, which is very unfortunate.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 7/9] migration: Simplify alignment and alignment checks
  2021-09-03 19:14         ` Peter Xu
@ 2021-09-03 19:37           ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03 19:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 03.09.21 21:14, Peter Xu wrote:
> On Fri, Sep 03, 2021 at 12:07:20PM +0200, David Hildenbrand wrote:
>> On 03.09.21 10:47, David Hildenbrand wrote:
>>> On 03.09.21 00:32, Peter Xu wrote:
>>>> On Thu, Sep 02, 2021 at 03:14:30PM +0200, David Hildenbrand wrote:
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index bb909781b7..ae97c2c461 100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -391,7 +391,7 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>>>>>     int migrate_send_rp_req_pages(MigrationIncomingState *mis,
>>>>>                                   RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>>>>>     {
>>>>> -    void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb)));
>>>>> +    void *aligned = (void *)QEMU_ALIGN_DOWN(haddr, qemu_ram_pagesize(rb));
>>>>
>>>> Is uintptr_t still needed?  I thought it would generate a warning otherwise but
>>>> not sure.
>>>
>>> It doesn't in my setup, but maybe it will on 32bit archs ...
>>>
>>> I discussed this with Phil in
>>>
>>> https://lkml.kernel.org/r/2c8d80ad-f171-7d5f-3235-92f02fa174b3@redhat.com
>>>
>>> Maybe
>>>
>>> QEMU_ALIGN_PTR_DOWN((void *)haddr, qemu_ram_pagesize(rb)));
>>>
>>> Is really what we want.
>>
>> ... but it would suffer the same issue I think. I just ran it trough the
>> gitlab pipeline, including "i386-fedora-cross-compile" ... and it seems to
>> compile just fine, which is weird, because I'd also expect
>>
>> "warning: cast to pointer from integer of different size
>> [-Wint-to-pointer-cast]"
>>
>> We most certainly need the "(void *)(uintptr_t)" to convert from u64 to a
>> pointer.
>>
>> Let's just do it cleanly:
>>
>> void *unaligned = (void *)(uintptr_t)haddr;
>> void *aligned = QEMU_ALIGN_PTR_DOWN(unaligned, qemu_ram_pagesize(rb));
>>
>> Thoughts?
> 
> ---8<---
> $ cat a.c
> #include <stdio.h>
> #include <time.h>
> #include <assert.h>
> 
> #define ROUND_DOWN(n, d) ((n) & -(0 ? (n) : (d)))
> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
> 
> unsigned long getns(void)
> {
>      struct timespec tp;
> 
>      clock_gettime(CLOCK_MONOTONIC, &tp);
>      return tp.tv_sec * 1000000000 + tp.tv_nsec;
> }
> 
> void main(void)
> {
>      int i;
>      unsigned long start, end, v1 = 0x1234567890, v2 = 0x1000;
> 
>      start = getns();
>      for (i = 0; i < 1000000; i++) {
>          v1 = ROUND_DOWN(v1, v2);
>      }
>      end = getns();
>      printf("ROUND_DOWN took: \t%ld (us)\n", (end - start) / 1000);
> 
>      start = getns();
>      for (i = 0; i < 1000000; i++) {
>          v1 = QEMU_ALIGN_DOWN(v1, v2);
>      }
>      end = getns();
>      printf("QEMU_ALIGN_DOWN took: \t%ld (us)\n", (end - start) / 1000);
> }
> $ make a
> $ ./a
> ROUND_DOWN took:        1445 (us)
> QEMU_ALIGN_DOWN took:   9684 (us)
> ---8<---
> 
> So it's ~5 times slower here on the laptop, even if not very stable.  Agree
> it's not a big deal. :)

Same results for me, especially even if I turn v1 and v2 into global volatiles,
make sure the results won't get optimized out and compile with -03.

> 
> It's just that since we know it's still faster, I then second:
> 
>    (uinptr_t)ROUND_DOWN(...);

Well okay then,

void *aligned = (void *)(uintptr_t)ROUND_DOWN(haddr, qemu_ram_pagesize(rb));

fits precisely into a single line :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-03 19:20         ` Peter Xu
@ 2021-09-03 19:40           ` David Hildenbrand
  2021-09-03 19:45             ` David Hildenbrand
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03 19:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 03.09.21 21:20, Peter Xu wrote:
> On Fri, Sep 03, 2021 at 09:58:06AM +0200, David Hildenbrand wrote:
>>>> That'll be good enough for live snapshot as uffd-wp works for zero pages,
>>>> however I'm just afraid it may stop working for some new users of it when zero
>>>> pages won't suffice.
>>>
>>> I thought about that as well. But snapshots/migration will read all
>>> memory either way and consume real memory when there is no shared zero
>>> page. So it's just shifting the point in time when we allocate all these
>>> pages I guess.
>>
>> ... thinking again, even when populating on shmem and friends there is
>> nothing stopping pages from getting mapped out again.
>>
>> What would happen when trying uffd-wp protection on a pte_none() in your
>> current shmem implementation? Will it lookup if there is something in the
>> page cache (not a hole) and set a PTE marker? Or will it simply skip as
>> there is currently nothing in the page table? Or will it simply
>> unconditionally install a PTE marker, even if there is a hole?
> 
> It (will - I haven't rebased and posted) sets a pte marker.  So uffd-wp will
> always work on read prefault irrelevant of memory type in the future.
> 
>>
>> Having an uffd-wp mode that doesn't require pre-population would really be
>> great. I remember you shared prototypes.
> 
> Yes, I planned to do that after the shmem bits, because they have some
> conflict. I don't want to mess up more with the current series either, which is
> already hard to push, which is very unfortunate.
> 

Yeah ... alternatively, we could simply populate the shared zeropage on 
private anonymous memory when trying protecting a pte_none(). That might 
actually be a very elegant solution.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages()
  2021-09-03 19:40           ` David Hildenbrand
@ 2021-09-03 19:45             ` David Hildenbrand
  0 siblings, 0 replies; 22+ messages in thread
From: David Hildenbrand @ 2021-09-03 19:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: Eduardo Habkost, Juan Quintela, Pankaj Gupta, Michael S. Tsirkin,
	teawater, qemu-devel, Dr. David Alan Gilbert, Alex Williamson,
	Marek Kedzierski, Paolo Bonzini, Philippe Mathieu-Daudé,
	Andrey Gruzdev, Wei Yang

On 03.09.21 21:40, David Hildenbrand wrote:
> On 03.09.21 21:20, Peter Xu wrote:
>> On Fri, Sep 03, 2021 at 09:58:06AM +0200, David Hildenbrand wrote:
>>>>> That'll be good enough for live snapshot as uffd-wp works for zero pages,
>>>>> however I'm just afraid it may stop working for some new users of it when zero
>>>>> pages won't suffice.
>>>>
>>>> I thought about that as well. But snapshots/migration will read all
>>>> memory either way and consume real memory when there is no shared zero
>>>> page. So it's just shifting the point in time when we allocate all these
>>>> pages I guess.
>>>
>>> ... thinking again, even when populating on shmem and friends there is
>>> nothing stopping pages from getting mapped out again.
>>>
>>> What would happen when trying uffd-wp protection on a pte_none() in your
>>> current shmem implementation? Will it lookup if there is something in the
>>> page cache (not a hole) and set a PTE marker? Or will it simply skip as
>>> there is currently nothing in the page table? Or will it simply
>>> unconditionally install a PTE marker, even if there is a hole?
>>
>> It (will - I haven't rebased and posted) sets a pte marker.  So uffd-wp will
>> always work on read prefault irrelevant of memory type in the future.
>>
>>>
>>> Having an uffd-wp mode that doesn't require pre-population would really be
>>> great. I remember you shared prototypes.
>>
>> Yes, I planned to do that after the shmem bits, because they have some
>> conflict. I don't want to mess up more with the current series either, which is
>> already hard to push, which is very unfortunate.
>>
> 
> Yeah ... alternatively, we could simply populate the shared zeropage on
> private anonymous memory when trying protecting a pte_none(). That might
> actually be a very elegant solution.
> 

Oh well, it's late in Germany ... doing it properly avoids having to 
modify/allocate page tables completely. So that is certainly the better 
approach.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-09-03 20:02 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 13:14 [PATCH v4 0/9] migration/ram: Optimize for virtio-mem via RamDiscardManager David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 1/9] memory: Introduce replay_discarded callback for RamDiscardManager David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 2/9] virtio-mem: Implement replay_discarded RamDiscardManager callback David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 3/9] migration/ram: Don't passs RAMState to migration_clear_memory_region_dirty_bitmap_*() David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 4/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on the migration source David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 5/9] virtio-mem: Drop precopy notifier David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 6/9] migration/postcopy: Handle RAMBlocks with a RamDiscardManager on the destination David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 7/9] migration: Simplify alignment and alignment checks David Hildenbrand
2021-09-02 22:32   ` Peter Xu
2021-09-03  8:47     ` David Hildenbrand
2021-09-03 10:07       ` David Hildenbrand
2021-09-03 10:22         ` David Hildenbrand
2021-09-03 19:14         ` Peter Xu
2021-09-03 19:37           ` David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 8/9] migration/ram: Factor out populating pages readable in ram_block_populate_pages() David Hildenbrand
2021-09-02 22:28   ` Peter Xu
2021-09-03  7:45     ` David Hildenbrand
2021-09-03  7:58       ` David Hildenbrand
2021-09-03 19:20         ` Peter Xu
2021-09-03 19:40           ` David Hildenbrand
2021-09-03 19:45             ` David Hildenbrand
2021-09-02 13:14 ` [PATCH v4 9/9] migration/ram: Handle RAMBlocks with a RamDiscardManager on background snapshots David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.