qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets
@ 2020-04-28 19:42 Peter Xu
  2020-04-28 19:42 ` [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy() Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Peter Xu @ 2020-04-28 19:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Tian Kevin, Dr . David Alan Gilbert, peterx,
	Juan Quintela

This RFC series starts from the fact that we will sync dirty bitmap when
removing a memslot for KVM.  IIUC that was majorly to maintain the dirty bitmap
even across a system reboot.

This series wants to move that sync from kvm memslot removal to system reset.

(I still don't know why the reset system will still need to keep the RAM status
 before the reset.  I thought things like kdump might use this to retrieve info
 from previous kernel panic, however IIUC that's not what kdump is doing now.
 Anyway, I'd be more than glad if anyone knows the real scenario behind
 this...)

The current solution (sync at kvm memslot removal) works in most cases, but:

  - it will be merely impossible to work for dirty ring, and,

  - it has an existing flaw on race condition. [1]

So if system reset is the only thing we care here, I'm thinking whether we can
move this sync explicitly to system reset so we do a global sync there instead
of sync every time when memory layout changed and caused memory removals.  I
think it can be more explict to sync during system reset, and also with that
context it will be far easier for kvm dirty ring to provide the same logic.

This is totally RFC because I'm still trying to find whether there will be
other cases besides system reset that we want to keep the dirty bits for a
to-be-removed memslot (real memory removals like unplugging memory shouldn't
matter, because we won't care about the dirty bits if it's never going to be
there anymore, not to mention we won't allow such things during a migration).
So far I don't see any.

I've run some tests either using the old dirty log or dirty ring, with either
some memory load or reboots on the source, and I see no issues so far.

Comments greatly welcomed.  Thanks.

[1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/

Peter Xu (4):
  migration: Export migration_bitmap_sync_precopy()
  migration: Introduce migrate_is_precopy()
  vl: Sync dirty bits for system resets during precopy
  kvm: No need to sync dirty bitmap before memslot removal any more

 accel/kvm/kvm-all.c      |  3 ---
 include/migration/misc.h |  2 ++
 migration/migration.c    |  7 +++++++
 migration/ram.c          | 10 +++++-----
 softmmu/vl.c             | 16 ++++++++++++++++
 5 files changed, 30 insertions(+), 8 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy()
  2020-04-28 19:42 [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Peter Xu
@ 2020-04-28 19:42 ` Peter Xu
  2020-05-06  9:58   ` Dr. David Alan Gilbert
  2020-04-28 19:42 ` [PATCH RFC 2/4] migration: Introduce migrate_is_precopy() Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2020-04-28 19:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Tian Kevin, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Make it usable outside migration.  To make it easier to use, remove the
RAMState parameter since after all ram.c has the reference of ram_state
directly from its context.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/misc.h |  1 +
 migration/ram.c          | 10 +++++-----
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index d2762257aa..e338be8c30 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -66,6 +66,7 @@ void remove_migration_state_change_notifier(Notifier *notify);
 bool migration_in_setup(MigrationState *);
 bool migration_has_finished(MigrationState *);
 bool migration_has_failed(MigrationState *);
+void migration_bitmap_sync_precopy(void);
 /* ...and after the device transmission */
 bool migration_in_postcopy_after_devices(MigrationState *);
 void migration_global_dump(Monitor *mon);
diff --git a/migration/ram.c b/migration/ram.c
index 04f13feb2e..d737175d4e 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -970,7 +970,7 @@ static void migration_bitmap_sync(RAMState *rs)
     }
 }
 
-static void migration_bitmap_sync_precopy(RAMState *rs)
+void migration_bitmap_sync_precopy(void)
 {
     Error *local_err = NULL;
 
@@ -983,7 +983,7 @@ static void migration_bitmap_sync_precopy(RAMState *rs)
         local_err = NULL;
     }
 
-    migration_bitmap_sync(rs);
+    migration_bitmap_sync(ram_state);
 
     if (precopy_notify(PRECOPY_NOTIFY_AFTER_BITMAP_SYNC, &local_err)) {
         error_report_err(local_err);
@@ -2303,7 +2303,7 @@ static void ram_init_bitmaps(RAMState *rs)
     WITH_RCU_READ_LOCK_GUARD() {
         ram_list_init_bitmaps();
         memory_global_dirty_log_start();
-        migration_bitmap_sync_precopy(rs);
+        migration_bitmap_sync_precopy();
     }
     qemu_mutex_unlock_ramlist();
     qemu_mutex_unlock_iothread();
@@ -2592,7 +2592,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
 
     WITH_RCU_READ_LOCK_GUARD() {
         if (!migration_in_postcopy()) {
-            migration_bitmap_sync_precopy(rs);
+            migration_bitmap_sync_precopy();
         }
 
         ram_control_before_iterate(f, RAM_CONTROL_FINISH);
@@ -2642,7 +2642,7 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
         remaining_size < max_size) {
         qemu_mutex_lock_iothread();
         WITH_RCU_READ_LOCK_GUARD() {
-            migration_bitmap_sync_precopy(rs);
+            migration_bitmap_sync_precopy();
         }
         qemu_mutex_unlock_iothread();
         remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/4] migration: Introduce migrate_is_precopy()
  2020-04-28 19:42 [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Peter Xu
  2020-04-28 19:42 ` [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy() Peter Xu
@ 2020-04-28 19:42 ` Peter Xu
  2020-05-06 10:05   ` Dr. David Alan Gilbert
  2020-04-28 19:42 ` [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2020-04-28 19:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Tian Kevin, Dr . David Alan Gilbert, peterx,
	Juan Quintela

Export a helper globally to check whether we're during a precopy.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/misc.h | 1 +
 migration/migration.c    | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/migration/misc.h b/include/migration/misc.h
index e338be8c30..b4f6bf7842 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -61,6 +61,7 @@ void migration_shutdown(void);
 void qemu_start_incoming_migration(const char *uri, Error **errp);
 bool migration_is_idle(void);
 bool migration_is_active(MigrationState *);
+bool migration_is_precopy(void);
 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
 bool migration_in_setup(MigrationState *);
diff --git a/migration/migration.c b/migration/migration.c
index 187ac0410c..0082880279 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1795,6 +1795,13 @@ bool migration_is_active(MigrationState *s)
             s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
 }
 
+bool migration_is_precopy(void)
+{
+    MigrationState *s = migrate_get_current();
+
+    return s && s->state == MIGRATION_STATUS_ACTIVE;
+}
+
 void migrate_init(MigrationState *s)
 {
     /*
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy
  2020-04-28 19:42 [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Peter Xu
  2020-04-28 19:42 ` [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy() Peter Xu
  2020-04-28 19:42 ` [PATCH RFC 2/4] migration: Introduce migrate_is_precopy() Peter Xu
@ 2020-04-28 19:42 ` Peter Xu
  2020-05-06 10:53   ` Dr. David Alan Gilbert
  2020-04-28 19:42 ` [PATCH RFC 4/4] kvm: No need to sync dirty bitmap before memslot removal any more Peter Xu
  2020-04-29 13:26 ` [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Dr. David Alan Gilbert
  4 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2020-04-28 19:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Tian Kevin, Dr . David Alan Gilbert, peterx,
	Juan Quintela

System resets will also reset system memory layout.  Although the memory layout
after the reset should probably the same as before the reset, we still need to
do frequent memory section removals and additions during the reset process.
Those operations could accidentally lose per-mem-section information like KVM
memslot dirty bitmaps.

Previously we keep those dirty bitmaps by sync it during memory removal.
However that's hard to make it right after all [1].  Instead, we sync dirty
pages before system reset if we know we're during a precopy migration.  This
should solve the same problem explicitly.

[1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 softmmu/vl.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/softmmu/vl.c b/softmmu/vl.c
index 32c0047889..8f864fee43 100644
--- a/softmmu/vl.c
+++ b/softmmu/vl.c
@@ -1387,6 +1387,22 @@ void qemu_system_reset(ShutdownCause reason)
 
     cpu_synchronize_all_states();
 
+    /*
+     * System reboot could reset memory layout.  Although the final status of
+     * the memory layout should be the same as before the reset, the memory
+     * sections can still be removed and added back frequently due to the reset
+     * process.  This could potentially drop dirty bits in track for those
+     * memory sections before the reset.
+     *
+     * Do a global dirty sync before the reset happens if we are during a
+     * precopy, so we don't lose the dirty bits during the memory shuffles.
+     */
+    if (migration_is_precopy()) {
+        WITH_RCU_READ_LOCK_GUARD() {
+            migration_bitmap_sync_precopy();
+        }
+    }
+
     if (mc && mc->reset) {
         mc->reset(current_machine);
     } else {
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 4/4] kvm: No need to sync dirty bitmap before memslot removal any more
  2020-04-28 19:42 [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Peter Xu
                   ` (2 preceding siblings ...)
  2020-04-28 19:42 ` [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy Peter Xu
@ 2020-04-28 19:42 ` Peter Xu
  2020-04-29 13:26 ` [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Dr. David Alan Gilbert
  4 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2020-04-28 19:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Tian Kevin, Dr . David Alan Gilbert, peterx,
	Juan Quintela

With the system reset dirty sync in qemu_system_reset(), we should be able to
drop this operation now.  After all it doesn't really fix the problem cleanly
because logically we could still have a race [1].

[1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 439a4efe52..e1c87fa4e1 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1061,9 +1061,6 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
             if (!mem) {
                 goto out;
             }
-            if (mem->flags & KVM_MEM_LOG_DIRTY_PAGES) {
-                kvm_physical_sync_dirty_bitmap(kml, section);
-            }
 
             /* unregister the slot */
             g_free(mem->dirty_bmap);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets
  2020-04-28 19:42 [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Peter Xu
                   ` (3 preceding siblings ...)
  2020-04-28 19:42 ` [PATCH RFC 4/4] kvm: No need to sync dirty bitmap before memslot removal any more Peter Xu
@ 2020-04-29 13:26 ` Dr. David Alan Gilbert
  2020-04-29 14:32   ` Peter Xu
  4 siblings, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2020-04-29 13:26 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> This RFC series starts from the fact that we will sync dirty bitmap when
> removing a memslot for KVM.  IIUC that was majorly to maintain the dirty bitmap
> even across a system reboot.
> 
> This series wants to move that sync from kvm memslot removal to system reset.
> 
> (I still don't know why the reset system will still need to keep the RAM status
>  before the reset.  I thought things like kdump might use this to retrieve info
>  from previous kernel panic, however IIUC that's not what kdump is doing now.
>  Anyway, I'd be more than glad if anyone knows the real scenario behind
>  this...)

Aren't there pages written by the BIOS that are read by the system as it
comes up through reset - so you need those pages intact?
(But I don't think that slot gets removed? Or does it - the bios has
some weird aliasing)

> The current solution (sync at kvm memslot removal) works in most cases, but:
> 
>   - it will be merely impossible to work for dirty ring, and,

Why doesn't that work with dirty ring?

>   - it has an existing flaw on race condition. [1]
> 
> So if system reset is the only thing we care here, I'm thinking whether we can
> move this sync explicitly to system reset so we do a global sync there instead
> of sync every time when memory layout changed and caused memory removals.  I
> think it can be more explict to sync during system reset, and also with that
> context it will be far easier for kvm dirty ring to provide the same logic.
> 
> This is totally RFC because I'm still trying to find whether there will be
> other cases besides system reset that we want to keep the dirty bits for a
> to-be-removed memslot (real memory removals like unplugging memory shouldn't
> matter, because we won't care about the dirty bits if it's never going to be
> there anymore, not to mention we won't allow such things during a migration).
> So far I don't see any.

I'm still unusure when slot removal happens for real; but if it's
happening for RAM on PCI devices, then that would make sense as
something you might want to keep.

Dave

> I've run some tests either using the old dirty log or dirty ring, with either
> some memory load or reboots on the source, and I see no issues so far.
> 
> Comments greatly welcomed.  Thanks.
> 
> [1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/
> 
> Peter Xu (4):
>   migration: Export migration_bitmap_sync_precopy()
>   migration: Introduce migrate_is_precopy()
>   vl: Sync dirty bits for system resets during precopy
>   kvm: No need to sync dirty bitmap before memslot removal any more
> 
>  accel/kvm/kvm-all.c      |  3 ---
>  include/migration/misc.h |  2 ++
>  migration/migration.c    |  7 +++++++
>  migration/ram.c          | 10 +++++-----
>  softmmu/vl.c             | 16 ++++++++++++++++
>  5 files changed, 30 insertions(+), 8 deletions(-)
> 
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets
  2020-04-29 13:26 ` [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Dr. David Alan Gilbert
@ 2020-04-29 14:32   ` Peter Xu
  2020-05-02 21:04     ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2020-04-29 14:32 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

On Wed, Apr 29, 2020 at 02:26:07PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > This RFC series starts from the fact that we will sync dirty bitmap when
> > removing a memslot for KVM.  IIUC that was majorly to maintain the dirty bitmap
> > even across a system reboot.
> > 
> > This series wants to move that sync from kvm memslot removal to system reset.
> > 
> > (I still don't know why the reset system will still need to keep the RAM status
> >  before the reset.  I thought things like kdump might use this to retrieve info
> >  from previous kernel panic, however IIUC that's not what kdump is doing now.
> >  Anyway, I'd be more than glad if anyone knows the real scenario behind
> >  this...)
> 
> Aren't there pages written by the BIOS that are read by the system as it
> comes up through reset - so you need those pages intact?
> (But I don't think that slot gets removed? Or does it - the bios has
> some weird aliasing)

Do you know which section it writes to?  I can trace a bit to see...

My previous understanding is that if the RAM starts to be written then it
shouldn't change layout anymore, but I could be wrong.

> 
> > The current solution (sync at kvm memslot removal) works in most cases, but:
> > 
> >   - it will be merely impossible to work for dirty ring, and,
> 
> Why doesn't that work with dirty ring?

Because strict sync is hard imho... I shouldn't say it's impossible, but it
should depend on the fixes for existing kvm get dirty log, so it's even harder
(please see below).

I'll start with normal get dirty log.  Currently, qemu's KVM syncs dirty with:

  - remove memslot
    - KVM_GET_DIRTY_LOG on memslot
    - KVM_SET_USER_MEMORY_REGION to remove

But it's racy because writes can happen between the two steps.  One correct way
to do that is:

  - remove memslot
    - KVM_SET_USER_MEMORY_REGION to set it READONLY [a]
      - kick all vcpu out synchronously to make sure dirty bits are flushed to
        KVM per-memslot dirty bitmaps [b]
    - KVM_GET_DIRTY_LOG on memslot
    - KVM_SET_USER_MEMORY_REGION to remove

The fix should both contain a QEMU fix at [a] to first mark it as READONLY so
no continuous write could happen before get dirty log, and another fix in
kernel [b] to do synchronously flush each vcpu to make sure PML is flushed
(which could be very awkward; I proposed an idea [1] but I didn't get any
positive feedback yet).

[1] https://lore.kernel.org/qemu-devel/20200402223240.GG103677@xz-x1/

So I think it's not easy to make it fully right even for get dirty log.

For dirty ring, the sync will look like:

  - remove memslot
    - KVM_SET_USER_MEMORY_REGION to set it READONLY [a]
      - kick all vcpu out synchronously to make sure dirty bits are flushed to
        KVM per-memslot dirty bitmaps [b]
    - collect dirty ring [c]

So it will depend on us fixing [a,b] first, then because dirty ring is per-vcpu
not per-slot, [c] needs to collect all data for all vcpus (across all slots).
So it will try to take all slots_lock even if we're removing a single small
memslot with one slots_lock taken already.  I think we can still do it, but as
you see, it's awkward and harder than the legacy get dirty log process.

I was always wondering why this is that useful and in what cases we're using
this.  If system reset is the only place then it'll be very clean to move it
there because that's a very high point of the stack where we only have a BQL,
at the same time we paused all the vcpus anyway so most nasty things are gone.
We just do a simple sync like in this series and it'll nicely work for both
dirty log and dirty ring (then we don't need to fix [a] and [b] either).

> 
> >   - it has an existing flaw on race condition. [1]
> > 
> > So if system reset is the only thing we care here, I'm thinking whether we can
> > move this sync explicitly to system reset so we do a global sync there instead
> > of sync every time when memory layout changed and caused memory removals.  I
> > think it can be more explict to sync during system reset, and also with that
> > context it will be far easier for kvm dirty ring to provide the same logic.
> > 
> > This is totally RFC because I'm still trying to find whether there will be
> > other cases besides system reset that we want to keep the dirty bits for a
> > to-be-removed memslot (real memory removals like unplugging memory shouldn't
> > matter, because we won't care about the dirty bits if it's never going to be
> > there anymore, not to mention we won't allow such things during a migration).
> > So far I don't see any.
> 
> I'm still unusure when slot removal happens for real; but if it's
> happening for RAM on PCI devices, then that would make sense as
> something you might want to keep.

Yeah indeed.  Though I'm also curious on a real user of that either that will
frequently map/unmap a MMIO region of device ram.  Also there's a workaround of
this issue too by marking the region as dirty when unmap the MMIO region, but I
haven't checked whether that's easy to do in qemu, assuming not that much...

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets
  2020-04-29 14:32   ` Peter Xu
@ 2020-05-02 21:04     ` Peter Xu
  2020-05-23 22:49       ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Xu @ 2020-05-02 21:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

On Wed, Apr 29, 2020 at 10:32:27AM -0400, Peter Xu wrote:
> On Wed, Apr 29, 2020 at 02:26:07PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > This RFC series starts from the fact that we will sync dirty bitmap when
> > > removing a memslot for KVM.  IIUC that was majorly to maintain the dirty bitmap
> > > even across a system reboot.
> > > 
> > > This series wants to move that sync from kvm memslot removal to system reset.
> > > 
> > > (I still don't know why the reset system will still need to keep the RAM status
> > >  before the reset.  I thought things like kdump might use this to retrieve info
> > >  from previous kernel panic, however IIUC that's not what kdump is doing now.
> > >  Anyway, I'd be more than glad if anyone knows the real scenario behind
> > >  this...)
> > 
> > Aren't there pages written by the BIOS that are read by the system as it
> > comes up through reset - so you need those pages intact?
> > (But I don't think that slot gets removed? Or does it - the bios has
> > some weird aliasing)
> 
> Do you know which section it writes to?  I can trace a bit to see...
> 
> My previous understanding is that if the RAM starts to be written then it
> shouldn't change layout anymore, but I could be wrong.
> 
> > 
> > > The current solution (sync at kvm memslot removal) works in most cases, but:
> > > 
> > >   - it will be merely impossible to work for dirty ring, and,
> > 
> > Why doesn't that work with dirty ring?
> 
> Because strict sync is hard imho... I shouldn't say it's impossible, but it
> should depend on the fixes for existing kvm get dirty log, so it's even harder
> (please see below).
> 
> I'll start with normal get dirty log.  Currently, qemu's KVM syncs dirty with:
> 
>   - remove memslot
>     - KVM_GET_DIRTY_LOG on memslot
>     - KVM_SET_USER_MEMORY_REGION to remove
> 
> But it's racy because writes can happen between the two steps.  One correct way
> to do that is:
> 
>   - remove memslot
>     - KVM_SET_USER_MEMORY_REGION to set it READONLY [a]
>       - kick all vcpu out synchronously to make sure dirty bits are flushed to
>         KVM per-memslot dirty bitmaps [b]
>     - KVM_GET_DIRTY_LOG on memslot
>     - KVM_SET_USER_MEMORY_REGION to remove
> 
> The fix should both contain a QEMU fix at [a] to first mark it as READONLY so
> no continuous write could happen before get dirty log, and another fix in
> kernel [b] to do synchronously flush each vcpu to make sure PML is flushed
> (which could be very awkward; I proposed an idea [1] but I didn't get any
> positive feedback yet).
> 
> [1] https://lore.kernel.org/qemu-devel/20200402223240.GG103677@xz-x1/
> 
> So I think it's not easy to make it fully right even for get dirty log.
> 
> For dirty ring, the sync will look like:
> 
>   - remove memslot
>     - KVM_SET_USER_MEMORY_REGION to set it READONLY [a]
>       - kick all vcpu out synchronously to make sure dirty bits are flushed to
>         KVM per-memslot dirty bitmaps [b]
>     - collect dirty ring [c]
> 
> So it will depend on us fixing [a,b] first, then because dirty ring is per-vcpu
> not per-slot, [c] needs to collect all data for all vcpus (across all slots).
> So it will try to take all slots_lock even if we're removing a single small
> memslot with one slots_lock taken already.  I think we can still do it, but as
> you see, it's awkward and harder than the legacy get dirty log process.
> 
> I was always wondering why this is that useful and in what cases we're using
> this.  If system reset is the only place then it'll be very clean to move it
> there because that's a very high point of the stack where we only have a BQL,
> at the same time we paused all the vcpus anyway so most nasty things are gone.
> We just do a simple sync like in this series and it'll nicely work for both
> dirty log and dirty ring (then we don't need to fix [a] and [b] either).

Paolo,

Do you have any opinion on this series?  Or more generally comments on what's
your preference to fix the known dirty sync issue with memslot removal would be
welcomed too.  For my own opinion, I prefer the approach with this series so we
won't need to touch the kernel at all, and it also paves the way for dirty
ring.  In all cases, I do think collecting dirty bit of a removing slot is
slightly awkward already.

Dave has valid conerns about other cases besides reboot where we might still
want to keep the dirty bits, on either (1) during BIOS/... boots, or (2) guest
unmaps of device mmio regions that were backed up by RAM.  I can continue to
look into theses, however before that I'd appreciate to know whether the
direction is correct and acceptable.

IMHO the worst case of a series like this could be that we got regression on
dirty bit loss, but then we can simply revert the last patch to verify or fix
it (or we can fix the missing case instead).

Thanks,

> 
> > 
> > >   - it has an existing flaw on race condition. [1]
> > > 
> > > So if system reset is the only thing we care here, I'm thinking whether we can
> > > move this sync explicitly to system reset so we do a global sync there instead
> > > of sync every time when memory layout changed and caused memory removals.  I
> > > think it can be more explict to sync during system reset, and also with that
> > > context it will be far easier for kvm dirty ring to provide the same logic.
> > > 
> > > This is totally RFC because I'm still trying to find whether there will be
> > > other cases besides system reset that we want to keep the dirty bits for a
> > > to-be-removed memslot (real memory removals like unplugging memory shouldn't
> > > matter, because we won't care about the dirty bits if it's never going to be
> > > there anymore, not to mention we won't allow such things during a migration).
> > > So far I don't see any.
> > 
> > I'm still unusure when slot removal happens for real; but if it's
> > happening for RAM on PCI devices, then that would make sense as
> > something you might want to keep.
> 
> Yeah indeed.  Though I'm also curious on a real user of that either that will
> frequently map/unmap a MMIO region of device ram.  Also there's a workaround of
> this issue too by marking the region as dirty when unmap the MMIO region, but I
> haven't checked whether that's easy to do in qemu, assuming not that much...
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy()
  2020-04-28 19:42 ` [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy() Peter Xu
@ 2020-05-06  9:58   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-06  9:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Make it usable outside migration.  To make it easier to use, remove the
> RAMState parameter since after all ram.c has the reference of ram_state
> directly from its context.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  include/migration/misc.h |  1 +
>  migration/ram.c          | 10 +++++-----
>  2 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index d2762257aa..e338be8c30 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -66,6 +66,7 @@ void remove_migration_state_change_notifier(Notifier *notify);
>  bool migration_in_setup(MigrationState *);
>  bool migration_has_finished(MigrationState *);
>  bool migration_has_failed(MigrationState *);
> +void migration_bitmap_sync_precopy(void);
>  /* ...and after the device transmission */
>  bool migration_in_postcopy_after_devices(MigrationState *);
>  void migration_global_dump(Monitor *mon);
> diff --git a/migration/ram.c b/migration/ram.c
> index 04f13feb2e..d737175d4e 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -970,7 +970,7 @@ static void migration_bitmap_sync(RAMState *rs)
>      }
>  }
>  
> -static void migration_bitmap_sync_precopy(RAMState *rs)
> +void migration_bitmap_sync_precopy(void)
>  {
>      Error *local_err = NULL;
>  
> @@ -983,7 +983,7 @@ static void migration_bitmap_sync_precopy(RAMState *rs)
>          local_err = NULL;
>      }
>  
> -    migration_bitmap_sync(rs);
> +    migration_bitmap_sync(ram_state);
>  
>      if (precopy_notify(PRECOPY_NOTIFY_AFTER_BITMAP_SYNC, &local_err)) {
>          error_report_err(local_err);
> @@ -2303,7 +2303,7 @@ static void ram_init_bitmaps(RAMState *rs)
>      WITH_RCU_READ_LOCK_GUARD() {
>          ram_list_init_bitmaps();
>          memory_global_dirty_log_start();
> -        migration_bitmap_sync_precopy(rs);
> +        migration_bitmap_sync_precopy();
>      }
>      qemu_mutex_unlock_ramlist();
>      qemu_mutex_unlock_iothread();
> @@ -2592,7 +2592,7 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>  
>      WITH_RCU_READ_LOCK_GUARD() {
>          if (!migration_in_postcopy()) {
> -            migration_bitmap_sync_precopy(rs);
> +            migration_bitmap_sync_precopy();
>          }
>  
>          ram_control_before_iterate(f, RAM_CONTROL_FINISH);
> @@ -2642,7 +2642,7 @@ static void ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size,
>          remaining_size < max_size) {
>          qemu_mutex_lock_iothread();
>          WITH_RCU_READ_LOCK_GUARD() {
> -            migration_bitmap_sync_precopy(rs);
> +            migration_bitmap_sync_precopy();
>          }
>          qemu_mutex_unlock_iothread();
>          remaining_size = rs->migration_dirty_pages * TARGET_PAGE_SIZE;
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/4] migration: Introduce migrate_is_precopy()
  2020-04-28 19:42 ` [PATCH RFC 2/4] migration: Introduce migrate_is_precopy() Peter Xu
@ 2020-05-06 10:05   ` Dr. David Alan Gilbert
  2020-05-06 14:52     ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-06 10:05 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> Export a helper globally to check whether we're during a precopy.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Can you change this to 'migration_in_precopy' to match the existing
'migration_in_postcopy'.

Dave
> ---
>  include/migration/misc.h | 1 +
>  migration/migration.c    | 7 +++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/migration/misc.h b/include/migration/misc.h
> index e338be8c30..b4f6bf7842 100644
> --- a/include/migration/misc.h
> +++ b/include/migration/misc.h
> @@ -61,6 +61,7 @@ void migration_shutdown(void);
>  void qemu_start_incoming_migration(const char *uri, Error **errp);
>  bool migration_is_idle(void);
>  bool migration_is_active(MigrationState *);
> +bool migration_is_precopy(void);
>  void add_migration_state_change_notifier(Notifier *notify);
>  void remove_migration_state_change_notifier(Notifier *notify);
>  bool migration_in_setup(MigrationState *);
> diff --git a/migration/migration.c b/migration/migration.c
> index 187ac0410c..0082880279 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1795,6 +1795,13 @@ bool migration_is_active(MigrationState *s)
>              s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
>  }
>  
> +bool migration_is_precopy(void)
> +{
> +    MigrationState *s = migrate_get_current();
> +
> +    return s && s->state == MIGRATION_STATUS_ACTIVE;
> +}
> +
>  void migrate_init(MigrationState *s)
>  {
>      /*
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy
  2020-04-28 19:42 ` [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy Peter Xu
@ 2020-05-06 10:53   ` Dr. David Alan Gilbert
  2020-05-06 15:38     ` Peter Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2020-05-06 10:53 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

* Peter Xu (peterx@redhat.com) wrote:
> System resets will also reset system memory layout.  Although the memory layout
> after the reset should probably the same as before the reset, we still need to
> do frequent memory section removals and additions during the reset process.
> Those operations could accidentally lose per-mem-section information like KVM
> memslot dirty bitmaps.
> 
> Previously we keep those dirty bitmaps by sync it during memory removal.
> However that's hard to make it right after all [1].  Instead, we sync dirty
> pages before system reset if we know we're during a precopy migration.  This
> should solve the same problem explicitly.
> 
> [1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/

I think the problem is knowing whether this is sufficient or whether you
definitely need to do it for other cases of removal/readd.

However, assuming we need to do it during reset, how do we know this is
the right point to do it, and not something further inside the reset
process?  Is this just on the basis that vcpus are stopped?

Dave

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  softmmu/vl.c | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index 32c0047889..8f864fee43 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -1387,6 +1387,22 @@ void qemu_system_reset(ShutdownCause reason)
>  
>      cpu_synchronize_all_states();
>  
> +    /*
> +     * System reboot could reset memory layout.  Although the final status of
> +     * the memory layout should be the same as before the reset, the memory
> +     * sections can still be removed and added back frequently due to the reset
> +     * process.  This could potentially drop dirty bits in track for those
> +     * memory sections before the reset.
> +     *
> +     * Do a global dirty sync before the reset happens if we are during a
> +     * precopy, so we don't lose the dirty bits during the memory shuffles.
> +     */
> +    if (migration_is_precopy()) {
> +        WITH_RCU_READ_LOCK_GUARD() {
> +            migration_bitmap_sync_precopy();
> +        }
> +    }
> +
>      if (mc && mc->reset) {
>          mc->reset(current_machine);
>      } else {
> -- 
> 2.24.1
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/4] migration: Introduce migrate_is_precopy()
  2020-05-06 10:05   ` Dr. David Alan Gilbert
@ 2020-05-06 14:52     ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2020-05-06 14:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

On Wed, May 06, 2020 at 11:05:49AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Export a helper globally to check whether we're during a precopy.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> Can you change this to 'migration_in_precopy' to match the existing
> 'migration_in_postcopy'.

Sure.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy
  2020-05-06 10:53   ` Dr. David Alan Gilbert
@ 2020-05-06 15:38     ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2020-05-06 15:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

On Wed, May 06, 2020 at 11:53:40AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > System resets will also reset system memory layout.  Although the memory layout
> > after the reset should probably the same as before the reset, we still need to
> > do frequent memory section removals and additions during the reset process.
> > Those operations could accidentally lose per-mem-section information like KVM
> > memslot dirty bitmaps.
> > 
> > Previously we keep those dirty bitmaps by sync it during memory removal.
> > However that's hard to make it right after all [1].  Instead, we sync dirty
> > pages before system reset if we know we're during a precopy migration.  This
> > should solve the same problem explicitly.
> > 
> > [1] https://lore.kernel.org/qemu-devel/20200327150425.GJ422390@xz-x1/
> 
> I think the problem is knowing whether this is sufficient or whether you
> definitely need to do it for other cases of removal/readd.

Right that's the tricky part.  I planned to look into the other potential cases
after I get a feedback from Paolo or anyone that think this approach is
acceptable, but of course I can even start to look into them now.

When I said this should mostly be for reset, it comes from this commit message:

    commit 4cc856fabae1447d53890e707c70f257a7691174
    Author: zhanghailiang <zhang.zhanghailiang@huawei.com>
    Date:   Thu Apr 2 19:26:31 2015 +0000

    kvm-all: Sync dirty-bitmap from kvm before kvm destroy the corresponding dirty_bitmap
    
    Sometimes, we destroy the dirty_bitmap in kvm_memory_slot before any sync action
    occur, this bit in dirty_bitmap will be missed, and which will lead the corresponding
    dirty pages to be missed in migration.
    
    This usually happens when do migration during VM's Start-up or Reboot.

That's where the get-dirty-log was missing and Hailiang reported an issue with
that probably on system reboot (but, again, I know nothing about the use case
behind this...).

Today I even tried to dig deeper into the initial place that this is
introduced, and it's:

    commit 3fbffb628c001bd540dc9c1805bdf7aa8591da4d
    Author: Avi Kivity <avi@redhat.com>
    Date:   Sun Jan 15 16:13:59 2012 +0200

    kvm: flush the dirty log when unregistering a slot
    
    Otherwise, the dirty log information is lost in the kernel forever.
    
    Fixes opensuse-12.1 boot screen, which changes the vga windows rapidly.
    
    Signed-off-by: Avi Kivity <avi@redhat.com>

So it's Avi's 8 years ago patch probably trying to fix a display issue since
VGA tracks dirty bit without migration.  IIUC this should belong to the case
you raised the concern about "removing a device MMIO region with RAM backed".
So I'll definitely look into that too, which I planned to, after all.

> 
> However, assuming we need to do it during reset, how do we know this is
> the right point to do it, and not something further inside the reset
> process?  Is this just on the basis that vcpus are stopped?

"Stopped vcpus" is not the hint for "this is the right place", but definitely a
good thing because we don't even need to consider multiple layers of dirty bit
caches when they're stopped (for which the last layer is the PML, which is
unluckily per-vcpu rather than per-memslot).  It's just something nice and
something extra we got.  Even if vcpus are not stopped at this point (I believe
the above VGA MMIO region removal could be the case where I need to take care
of later, that we will have to collect dirty bits during the vcpus running),
however as I mentioned in the other thread it should be a stack high enough to
do some global dirty sync (because both dirty log and dirty ring need some
global sync, and I believe that's one of the reasons why current code is still
racy - we always need to flush the PML on every vcpu even for dirty log!).

Sorry I went too far... should probably go back to the topic on "the right
place" - I think this should be the right place for reset because AFAIK all the
memslot add/remove is done inside device reset hooks, qemu_devices_reset() (or
MachineClass.reset(), which could also change stuff, but finally calls
qemu_devices_reset() too).  I tried to verify this by enable the tracepoint at
trace_kvm_set_user_memory() and also dump something at qemu_system_reset()
before qemu_devices_reset(), I think it proves my understanding that all
trace_kvm_set_user_memory() triggers after that point.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets
  2020-05-02 21:04     ` Peter Xu
@ 2020-05-23 22:49       ` Peter Xu
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Xu @ 2020-05-23 22:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Paolo Bonzini, Tian Kevin, qemu-devel, Juan Quintela

On Sat, May 02, 2020 at 05:04:23PM -0400, Peter Xu wrote:
> Paolo,
> 
> Do you have any opinion on this series?  Or more generally comments on what's
> your preference to fix the known dirty sync issue with memslot removal would be
> welcomed too.  For my own opinion, I prefer the approach with this series so we
> won't need to touch the kernel at all, and it also paves the way for dirty
> ring.  In all cases, I do think collecting dirty bit of a removing slot is
> slightly awkward already.
> 
> Dave has valid conerns about other cases besides reboot where we might still
> want to keep the dirty bits, on either (1) during BIOS/... boots, or (2) guest
> unmaps of device mmio regions that were backed up by RAM.  I can continue to
> look into theses, however before that I'd appreciate to know whether the
> direction is correct and acceptable.
> 
> IMHO the worst case of a series like this could be that we got regression on
> dirty bit loss, but then we can simply revert the last patch to verify or fix
> it (or we can fix the missing case instead).

Hmm... I didn't find a good and clean way to cover what Dave has mentioned.
Everything would be either ugly or even uglier.  Sad.

Since I didn't get any feedback either on this for some time (or the rest of
the discussion threads or proposals I opened), I think it's time to move on...
I start to feel unwise to continue have the two kvm dirty ring series blocked
due to this extremely corner case of race condition, after all it's not a
specific problem to kvm dirty ring but to both of the methods.  I'll continue
with dirty ring for now, then we fix the race altogether with dirty log when we
find some better solution.

I'll soon post a new QEMU version, keeping the removing memslot sync operation,
but just like dirty log it'll be a best effort sync (e.g. PML ignored).
However I'll comment to at least mention the issues we've found so we can get
to it later again.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-05-23 22:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-28 19:42 [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Peter Xu
2020-04-28 19:42 ` [PATCH RFC 1/4] migration: Export migration_bitmap_sync_precopy() Peter Xu
2020-05-06  9:58   ` Dr. David Alan Gilbert
2020-04-28 19:42 ` [PATCH RFC 2/4] migration: Introduce migrate_is_precopy() Peter Xu
2020-05-06 10:05   ` Dr. David Alan Gilbert
2020-05-06 14:52     ` Peter Xu
2020-04-28 19:42 ` [PATCH RFC 3/4] vl: Sync dirty bits for system resets during precopy Peter Xu
2020-05-06 10:53   ` Dr. David Alan Gilbert
2020-05-06 15:38     ` Peter Xu
2020-04-28 19:42 ` [PATCH RFC 4/4] kvm: No need to sync dirty bitmap before memslot removal any more Peter Xu
2020-04-29 13:26 ` [PATCH RFC 0/4] vl: Sync dirty bitmap when system resets Dr. David Alan Gilbert
2020-04-29 14:32   ` Peter Xu
2020-05-02 21:04     ` Peter Xu
2020-05-23 22:49       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).