[RFC v4 0/3] migration: reduce time of loading non-iterable vmstate

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate
@ 2022-12-23 14:23 Chuang Xu
  2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw)
  To: qemu-devel; +Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo

In this version:

- attach more information in the cover letter.
- remove changes on virtio_load().
- add rcu_read_locked() to detect holding of rcu lock.

The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test1 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 16-queue vhost-net device
  - 16 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 150 ms			  740+ ms
after		about 30 ms			  630+ ms

(This result is different from that of v1. It may be that someone has 
changed something on my host.., but it does not affect the display of 
the optimization effect.)

In test2, we keep the number of the device the same as test1, reduce the 
number of queues per device:

Here are the test2 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 1-queue vhost-net device
  - 16 1-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 90 ms			 about 250 ms

after		about 25 ms			 about 160 ms

In test3, we keep the number of queues per device the same as test1, reduce 
the number of devices:

Here are the test3 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 1 16-queue vhost-net device
  - 1 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 20 ms			 about 70 ms
after		about 11 ms			 about 60 ms

As we can see from the test results above, both the number of queues and 
the number of devices have a great impact on the time of loading non-iterable 
vmstate. The growth of the number of devices and queues will lead to more 
mr commits, and the time consumption caused by the flatview reconstruction 
will also increase.

Please review, Chuang.

[v3]

- move virtio_load_check_delay() from virtio_memory_listener_commit() to 
  virtio_vmstate_change().
- add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() 
  will be called when delay_check is true.

[v2]

- rebase to latest upstream.
- add sanity check to address_space_to_flatview().
- postpone the init of the vring cache until migration's loading completes. 

[v1]

The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test results:
test vm info:
- 32 CPUs 128GB RAM
- 8 16-queue vhost-net device
- 16 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate
before		about 210 ms
after		about 40 ms

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v4 1/3] rcu: introduce rcu_read_locked()
  2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
@ 2022-12-23 14:23 ` Chuang Xu
  2023-01-04 14:20   ` Alex Bennée
  2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, Chuang Xu

add rcu_read_locked() to detect holding of rcu lock.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 include/qemu/rcu.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index b063c6fde8..42cbd0080f 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void)
     }
 }
 
+static inline bool rcu_read_locked(void)
+{
+    struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader();
+
+    return p_rcu_reader->depth > 0;
+}
+
 extern void synchronize_rcu(void);
 
 /*
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v4 1/3] rcu: introduce rcu_read_locked()
  2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu
@ 2023-01-04 14:20   ` Alex Bennée
  2023-01-05  8:17     ` Chuang Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Bennée @ 2023-01-04 14:20 UTC (permalink / raw)
  To: Chuang Xu
  Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo,
	qemu-devel


Chuang Xu <xuchuangxclwt@bytedance.com> writes:

> add rcu_read_locked() to detect holding of rcu lock.
>
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>  include/qemu/rcu.h | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
> index b063c6fde8..42cbd0080f 100644
> --- a/include/qemu/rcu.h
> +++ b/include/qemu/rcu.h
> @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void)
>      }
>  }
>  
> +static inline bool rcu_read_locked(void)

We use the locked suffix to indicate functions that should be called
with a lock held. Perhaps renaming this to rcu_read_is_locked() would
make the intent of the function clearer?

> +{
> +    struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader();
> +
> +    return p_rcu_reader->depth > 0;
> +}
> +
>  extern void synchronize_rcu(void);
>  
>  /*


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 1/3] rcu: introduce rcu_read_locked()
  2023-01-04 14:20   ` Alex Bennée
@ 2023-01-05  8:17     ` Chuang Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Chuang Xu @ 2023-01-05  8:17 UTC (permalink / raw)
  To: Alex Bennée
  Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo,
	qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]

On 2023/1/4 下午10:20, Alex Bennée wrote:
> Chuang Xu writes:
>
>> add rcu_read_locked() to detect holding of rcu lock.
>>
>> Signed-off-by: Chuang Xu
>> ---
>> include/qemu/rcu.h | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
>> index b063c6fde8..42cbd0080f 100644
>> --- a/include/qemu/rcu.h
>> +++ b/include/qemu/rcu.h
>> @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void)
>> }
>> }
>>
>> +static inline bool rcu_read_locked(void)
> We use the locked suffix to indicate functions that should be called
> with a lock held. Perhaps renaming this to rcu_read_is_locked() would
> make the intent of the function clearer?

Yes, rcu_read_is_locked() do make the intent of the function clearer.
I'll rename the function in v5.

Thanks!

>> +{
>> + struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader();
>> +
>> + return p_rcu_reader->depth > 0;
>> +}
>> +
>> extern void synchronize_rcu(void);
>>
>> /*
>

[-- Attachment #2: Type: text/html, Size: 1381 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
  2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu
@ 2022-12-23 14:23 ` Chuang Xu
  2022-12-23 15:37   ` Peter Xu
                     ` (2 more replies)
  2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
  2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu
  3 siblings, 3 replies; 22+ messages in thread
From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, Chuang Xu

Before using any flatview, sanity check we're not during a memory
region transaction or the map can be invalid.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 include/exec/memory.h | 9 +++++++++
 softmmu/memory.c      | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 91f8a2395a..66c43b4862 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1069,8 +1069,17 @@ struct FlatView {
     MemoryRegion *root;
 };
 
+int memory_region_transaction_get_depth(void);
+
 static inline FlatView *address_space_to_flatview(AddressSpace *as)
 {
+    /*
+     * Before using any flatview, sanity check we're not during a memory
+     * region transaction or the map can be invalid.  Note that this can
+     * also be called during commit phase of memory transaction, but that
+     * should also only happen when the depth decreases to 0 first.
+     */
+    assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked());
     return qatomic_rcu_read(&as->current_map);
 }
 
diff --git a/softmmu/memory.c b/softmmu/memory.c
index bc0be3f62c..01192e2e5b 100644
--- a/softmmu/memory.c
+++ b/softmmu/memory.c
@@ -1116,6 +1116,11 @@ void memory_region_transaction_commit(void)
    }
 }
 
+int memory_region_transaction_get_depth(void)
+{
+    return memory_region_transaction_depth;
+}
+
 static void memory_region_destructor_none(MemoryRegion *mr)
 {
 }
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu
@ 2022-12-23 15:37   ` Peter Xu
  2022-12-23 15:47   ` Paolo Bonzini
  2022-12-28 10:50   ` Philippe Mathieu-Daudé
  2 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2022-12-23 15:37 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, dgilbert, quintela, pbonzini, david, philmd, zhouyibo

On Fri, Dec 23, 2022 at 10:23:06PM +0800, Chuang Xu wrote:
> Before using any flatview, sanity check we're not during a memory
> region transaction or the map can be invalid.
> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>  include/exec/memory.h | 9 +++++++++
>  softmmu/memory.c      | 5 +++++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 91f8a2395a..66c43b4862 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1069,8 +1069,17 @@ struct FlatView {
>      MemoryRegion *root;
>  };
>  
> +int memory_region_transaction_get_depth(void);
> +
>  static inline FlatView *address_space_to_flatview(AddressSpace *as)
>  {
> +    /*
> +     * Before using any flatview, sanity check we're not during a memory
> +     * region transaction or the map can be invalid.  Note that this can
> +     * also be called during commit phase of memory transaction, but that
> +     * should also only happen when the depth decreases to 0 first.

Nitpick: after adding the RCU check the comment may need slight touch up:

        * Meanwhile it's safe to access current_map with RCU read lock held
        * even if during a memory transaction.  It means the user can bear
        * with an obsolete map.

> +     */
> +    assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked());
>      return qatomic_rcu_read(&as->current_map);
>  }
>  
> diff --git a/softmmu/memory.c b/softmmu/memory.c
> index bc0be3f62c..01192e2e5b 100644
> --- a/softmmu/memory.c
> +++ b/softmmu/memory.c
> @@ -1116,6 +1116,11 @@ void memory_region_transaction_commit(void)
>     }
>  }
>  
> +int memory_region_transaction_get_depth(void)
> +{
> +    return memory_region_transaction_depth;
> +}
> +
>  static void memory_region_destructor_none(MemoryRegion *mr)
>  {
>  }
> -- 
> 2.20.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu
  2022-12-23 15:37   ` Peter Xu
@ 2022-12-23 15:47   ` Paolo Bonzini
  2022-12-23 15:54     ` Peter Xu
  2022-12-28 10:50   ` Philippe Mathieu-Daudé
  2 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-12-23 15:47 UTC (permalink / raw)
  To: Chuang Xu, qemu-devel; +Cc: dgilbert, quintela, peterx, david, philmd, zhouyibo

On 12/23/22 15:23, Chuang Xu wrote:
>   static inline FlatView *address_space_to_flatview(AddressSpace *as)
>   {
> +    /*
> +     * Before using any flatview, sanity check we're not during a memory
> +     * region transaction or the map can be invalid.  Note that this can
> +     * also be called during commit phase of memory transaction, but that
> +     * should also only happen when the depth decreases to 0 first.
> +     */
> +    assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked());
>       return qatomic_rcu_read(&as->current_map);
>   }

This is not valid because the transaction could happen in *another* 
thread.  In that case memory_region_transaction_depth() will be > 0, but 
RCU is needed.

Paolo



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-23 15:47   ` Paolo Bonzini
@ 2022-12-23 15:54     ` Peter Xu
  2022-12-28  8:27       ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2022-12-23 15:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Chuang Xu, qemu-devel, dgilbert, quintela, david, philmd, zhouyibo

Hi, Paolo,

On Fri, Dec 23, 2022 at 04:47:57PM +0100, Paolo Bonzini wrote:
> On 12/23/22 15:23, Chuang Xu wrote:
> >   static inline FlatView *address_space_to_flatview(AddressSpace *as)
> >   {
> > +    /*
> > +     * Before using any flatview, sanity check we're not during a memory
> > +     * region transaction or the map can be invalid.  Note that this can
> > +     * also be called during commit phase of memory transaction, but that
> > +     * should also only happen when the depth decreases to 0 first.
> > +     */
> > +    assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked());
> >       return qatomic_rcu_read(&as->current_map);
> >   }
> 
> This is not valid because the transaction could happen in *another* thread.
> In that case memory_region_transaction_depth() will be > 0, but RCU is
> needed.

Do you mean the code is wrong, or the comment?  Note that the code has
checked rcu_read_locked() where introduced in patch 1, but maybe something
else was missed?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-23 15:54     ` Peter Xu
@ 2022-12-28  8:27       ` Paolo Bonzini
  2023-01-03 17:43         ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2022-12-28  8:27 UTC (permalink / raw)
  To: Peter Xu
  Cc: Chuang Xu, qemu-devel, David Gilbert, Quintela, Juan,
	David Hildenbrand, Philippe Mathieu-Daudé,
	zhouyibo

[-- Attachment #1: Type: text/plain, Size: 706 bytes --]

Il ven 23 dic 2022, 16:54 Peter Xu <peterx@redhat.com> ha scritto:

> > This is not valid because the transaction could happen in *another*
> thread.
> > In that case memory_region_transaction_depth() will be > 0, but RCU is
> > needed.
>
> Do you mean the code is wrong, or the comment?  Note that the code has
> checked rcu_read_locked() where introduced in patch 1, but maybe something
> else was missed?
>

The assertion is wrong. It will succeed even if RCU is unlocked in this
thread but a transaction is in progress in another thread.

Perhaps you can check (memory_region_transaction_depth() > 0 &&
!qemu_mutex_iothread_locked()) || rcu_read_locked() instead?

Paolo

Thanks,
>
> --
> Peter Xu
>
>

[-- Attachment #2: Type: text/html, Size: 1371 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-28  8:27       ` Paolo Bonzini
@ 2023-01-03 17:43         ` Peter Xu
  2023-01-10  8:09           ` Chuang Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-01-03 17:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Chuang Xu, qemu-devel, David Gilbert, Quintela, Juan,
	David Hildenbrand, Philippe Mathieu-Daudé,
	zhouyibo

Hi, Paolo,

On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote:
> Il ven 23 dic 2022, 16:54 Peter Xu <peterx@redhat.com> ha scritto:
> 
> > > This is not valid because the transaction could happen in *another*
> > thread.
> > > In that case memory_region_transaction_depth() will be > 0, but RCU is
> > > needed.
> >
> > Do you mean the code is wrong, or the comment?  Note that the code has
> > checked rcu_read_locked() where introduced in patch 1, but maybe something
> > else was missed?
> >
> 
> The assertion is wrong. It will succeed even if RCU is unlocked in this
> thread but a transaction is in progress in another thread.

IIUC this is the case where the context:

  (1) doesn't have RCU read lock held, and,
  (2) doesn't have BQL held.

Is it safe at all to reference any flatview in such a context?  The thing
is I think the flatview pointer can be freed anytime if both locks are not
taken.

> Perhaps you can check (memory_region_transaction_depth() > 0 &&
> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead?

What if one thread calls address_space_to_flatview() with BQL held but not
RCU read lock held?  I assume it's a legal operation, but it seems to be
able to trigger the assert already?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2023-01-03 17:43         ` Peter Xu
@ 2023-01-10  8:09           ` Chuang Xu
  2023-01-10 14:45             ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Chuang Xu @ 2023-01-10  8:09 UTC (permalink / raw)
  To: Peter Xu, Paolo Bonzini
  Cc: qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand,
	Philippe Mathieu-Daudé,
	zhouyibo

[-- Attachment #1: Type: text/plain, Size: 2656 bytes --]

Hi, Peter and Paolo,
I'm sorry I didn't reply to your email in time. I was infected with
COVID-19 two weeks ago, so I couldn't think about the problems discussed
in your email for a long time.😷

On 2023/1/4 上午1:43, Peter Xu wrote:
> Hi, Paolo,
>
> On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote:
>> Il ven 23 dic 2022, 16:54 Peter Xu ha scritto:
>>
>>>> This is not valid because the transaction could happen in *another*
>>> thread.
>>>> In that case memory_region_transaction_depth() will be > 0, but RCU is
>>>> needed.
>>> Do you mean the code is wrong, or the comment? Note that the code has
>>> checked rcu_read_locked() where introduced in patch 1, but maybe
something
>>> else was missed?
>>>
>> The assertion is wrong. It will succeed even if RCU is unlocked in this
>> thread but a transaction is in progress in another thread.
> IIUC this is the case where the context:
>
> (1) doesn't have RCU read lock held, and,
> (2) doesn't have BQL held.
>
> Is it safe at all to reference any flatview in such a context? The thing
> is I think the flatview pointer can be freed anytime if both locks are
not
> taken.
>
>> Perhaps you can check (memory_region_transaction_depth() > 0 &&
>> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead?
> What if one thread calls address_space_to_flatview() with BQL held but
not
> RCU read lock held? I assume it's a legal operation, but it seems to be
> able to trigger the assert already?
>
> Thanks,
>
I'm not sure whether I understand the content of your discussion correctly,
so here I want to explain my understanding firstly.

From my perspective, Paolo thinks that when thread 1 is in a transaction,
thread 2 will trigger the assertion when accessing the flatview without
holding RCU read lock, although sometimes the thread 2's access to flatview
is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0
&& !qemu_mutex_iothread_locked()) || rcu_read_locked() instead.

And Peter thinks that as long as no thread holds the BQL or RCU read lock,
the old flatview can be released (actually executed by the rcu thread with
BQL held). When thread 1 is in a transaction, if thread 2 access the
flatview
with BQL held but not RCU read lock held, it's a legal operation. In this
legal case, it seems that both my code and Paolo's code will trigger
assertion.

I'm not sure if I have a good understanding of your emails? I think
checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() ||
qemu_mutex_iothread_locked()) should cover the case you discussed.

What's your take?

Thanks!

[-- Attachment #2: Type: text/html, Size: 3151 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2023-01-10  8:09           ` Chuang Xu
@ 2023-01-10 14:45             ` Peter Xu
  2023-01-12  7:59               ` Chuang Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-01-10 14:45 UTC (permalink / raw)
  To: Chuang Xu
  Cc: Paolo Bonzini, qemu-devel, David Gilbert, Quintela, Juan,
	David Hildenbrand, Philippe Mathieu-Daudé,
	zhouyibo

On Tue, Jan 10, 2023 at 12:09:41AM -0800, Chuang Xu wrote:
> Hi, Peter and Paolo,

Hi, Chuang, Paolo,

> I'm sorry I didn't reply to your email in time. I was infected with
> COVID-19 two weeks ago, so I couldn't think about the problems discussed
> in your email for a long time.😷
> 
> On 2023/1/4 上午1:43, Peter Xu wrote:
> > Hi, Paolo,
> >
> > On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote:
> >> Il ven 23 dic 2022, 16:54 Peter Xu ha scritto:
> >>
> >>>> This is not valid because the transaction could happen in *another*
> >>> thread.
> >>>> In that case memory_region_transaction_depth() will be > 0, but RCU is
> >>>> needed.
> >>> Do you mean the code is wrong, or the comment? Note that the code has
> >>> checked rcu_read_locked() where introduced in patch 1, but maybe
> something
> >>> else was missed?
> >>>
> >> The assertion is wrong. It will succeed even if RCU is unlocked in this
> >> thread but a transaction is in progress in another thread.
> > IIUC this is the case where the context:
> >
> > (1) doesn't have RCU read lock held, and,
> > (2) doesn't have BQL held.
> >
> > Is it safe at all to reference any flatview in such a context? The thing
> > is I think the flatview pointer can be freed anytime if both locks are
> not
> > taken.
> >
> >> Perhaps you can check (memory_region_transaction_depth() > 0 &&
> >> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead?
> > What if one thread calls address_space_to_flatview() with BQL held but
> not
> > RCU read lock held? I assume it's a legal operation, but it seems to be
> > able to trigger the assert already?
> >
> > Thanks,
> >
> I'm not sure whether I understand the content of your discussion correctly,
> so here I want to explain my understanding firstly.
> 
> From my perspective, Paolo thinks that when thread 1 is in a transaction,
> thread 2 will trigger the assertion when accessing the flatview without
> holding RCU read lock, although sometimes the thread 2's access to flatview
> is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0
> && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead.
> 
> And Peter thinks that as long as no thread holds the BQL or RCU read lock,
> the old flatview can be released (actually executed by the rcu thread with
> BQL held). When thread 1 is in a transaction, if thread 2 access the
> flatview
> with BQL held but not RCU read lock held, it's a legal operation. In this
> legal case, it seems that both my code and Paolo's code will trigger
> assertion.

IIUC your original patch is fine in this case (BQL held, RCU not held), as
long as depth==0.  IMHO what we want to trap here is when BQL held (while
RCU is not) and depth>0 which can cause unpredictable side effect of using
obsolete flatview.

To summarize, the original check didn't consider BQL, and if to consider
BQL I think it should be something like:

  /* Guarantees valid access to the flatview, either lock works */
  assert(BQL_HELD() || RCU_HELD());

  /*
   * Guarantees any BQL holder is not reading obsolete flatview (e.g. when
   * during vm load)
   */
  if (BQL_HELD())
      assert(depth==0);

IIUC it can be merged into:

  assert((BQL_HELD() && depth==0) || RCU_HELD());

> 
> I'm not sure if I have a good understanding of your emails? I think
> checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() ||
> qemu_mutex_iothread_locked()) should cover the case you discussed.

This seems still problematic too?  Since the assert can pass even if
neither BQL nor RCU is held (as long as depth==0).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2023-01-10 14:45             ` Peter Xu
@ 2023-01-12  7:59               ` Chuang Xu
  2023-01-12 15:13                 ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Chuang Xu @ 2023-01-12  7:59 UTC (permalink / raw)
  To: Peter Xu, Paolo Bonzini
  Cc: qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand,
	Philippe Mathieu-Daudé,
	zhouyibo

Hi, Peter, Paolo,

On 2023/1/10 下午10:45, Peter Xu wrote:
> On Tue, Jan 10, 2023 at 12:09:41AM -0800, Chuang Xu wrote:
>> Hi, Peter and Paolo,
> Hi, Chuang, Paolo,
>
>> I'm sorry I didn't reply to your email in time. I was infected with
>> COVID-19 two weeks ago, so I couldn't think about the problems discussed
>> in your email for a long time.😷
>>
>> On 2023/1/4 上午1:43, Peter Xu wrote:
>>> Hi, Paolo,
>>>
>>> On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote:
>>>> Il ven 23 dic 2022, 16:54 Peter Xu ha scritto:
>>>>
>>>>>> This is not valid because the transaction could happen in *another*
>>>>> thread.
>>>>>> In that case memory_region_transaction_depth() will be > 0, but RCU is
>>>>>> needed.
>>>>> Do you mean the code is wrong, or the comment? Note that the code has
>>>>> checked rcu_read_locked() where introduced in patch 1, but maybe
>> something
>>>>> else was missed?
>>>>>
>>>> The assertion is wrong. It will succeed even if RCU is unlocked in this
>>>> thread but a transaction is in progress in another thread.
>>> IIUC this is the case where the context:
>>>
>>> (1) doesn't have RCU read lock held, and,
>>> (2) doesn't have BQL held.
>>>
>>> Is it safe at all to reference any flatview in such a context? The thing
>>> is I think the flatview pointer can be freed anytime if both locks are
>> not
>>> taken.
>>>
>>>> Perhaps you can check (memory_region_transaction_depth() > 0 &&
>>>> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead?
>>> What if one thread calls address_space_to_flatview() with BQL held but
>> not
>>> RCU read lock held? I assume it's a legal operation, but it seems to be
>>> able to trigger the assert already?
>>>
>>> Thanks,
>>>
>> I'm not sure whether I understand the content of your discussion correctly,
>> so here I want to explain my understanding firstly.
>>
>>  From my perspective, Paolo thinks that when thread 1 is in a transaction,
>> thread 2 will trigger the assertion when accessing the flatview without
>> holding RCU read lock, although sometimes the thread 2's access to flatview
>> is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0
>> && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead.
>>
>> And Peter thinks that as long as no thread holds the BQL or RCU read lock,
>> the old flatview can be released (actually executed by the rcu thread with
>> BQL held). When thread 1 is in a transaction, if thread 2 access the
>> flatview
>> with BQL held but not RCU read lock held, it's a legal operation. In this
>> legal case, it seems that both my code and Paolo's code will trigger
>> assertion.
> IIUC your original patch is fine in this case (BQL held, RCU not held), as
> long as depth==0.  IMHO what we want to trap here is when BQL held (while
> RCU is not) and depth>0 which can cause unpredictable side effect of using
> obsolete flatview.

I don't quite understand the side effects of depth>0 when BQL is held (while
RCU is not).

In my perspective, both BQL and RCU can ensure that the flatview will not be
released when the worker thread accesses the flatview, because before the rcu
thread releases the flatview, it will make sure itself holding BQL and the
worker thread not holding RCU. So, whether the depth is 0 or not, as long as
BQL or RCU is held, the worker thread will not access the obsolete flatview
(IIUC 'obsolete' means that flatview is released).

>
> To summarize, the original check didn't consider BQL, and if to consider
> BQL I think it should be something like:
>
>    /* Guarantees valid access to the flatview, either lock works */
>    assert(BQL_HELD() || RCU_HELD());
>
>    /*
>     * Guarantees any BQL holder is not reading obsolete flatview (e.g. when
>     * during vm load)
>     */
>    if (BQL_HELD())
>        assert(depth==0);
>
> IIUC it can be merged into:
>
>    assert((BQL_HELD() && depth==0) || RCU_HELD());

IMHO assert(BQL_HELD() || RCU_HELD()) is enough..

Or you think that once a mr transaction is in progress, the old flatview has
been obsolete? If we regard flatview as obsolete when a mr transaction is in
progress, How can RCU ensure that flatview is not obsolete?

What does Paolo think of this check?

Thanks!

>> I'm not sure if I have a good understanding of your emails? I think
>> checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() ||
>> qemu_mutex_iothread_locked()) should cover the case you discussed.
> This seems still problematic too?  Since the assert can pass even if
> neither BQL nor RCU is held (as long as depth==0).
>
> Thanks,
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2023-01-12  7:59               ` Chuang Xu
@ 2023-01-12 15:13                 ` Peter Xu
  2023-01-13 19:29                   ` Chuang Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2023-01-12 15:13 UTC (permalink / raw)
  To: Chuang Xu
  Cc: Paolo Bonzini, qemu-devel, David Gilbert, Quintela, Juan,
	David Hildenbrand, Philippe Mathieu-Daudé,
	zhouyibo

On Thu, Jan 12, 2023 at 03:59:55PM +0800, Chuang Xu wrote:
> Hi, Peter, Paolo,

Chuang,

> 
> On 2023/1/10 下午10:45, Peter Xu wrote:
> > On Tue, Jan 10, 2023 at 12:09:41AM -0800, Chuang Xu wrote:
> > > Hi, Peter and Paolo,
> > Hi, Chuang, Paolo,
> > 
> > > I'm sorry I didn't reply to your email in time. I was infected with
> > > COVID-19 two weeks ago, so I couldn't think about the problems discussed
> > > in your email for a long time.😷
> > > 
> > > On 2023/1/4 上午1:43, Peter Xu wrote:
> > > > Hi, Paolo,
> > > > 
> > > > On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote:
> > > > > Il ven 23 dic 2022, 16:54 Peter Xu ha scritto:
> > > > > 
> > > > > > > This is not valid because the transaction could happen in *another*
> > > > > > thread.
> > > > > > > In that case memory_region_transaction_depth() will be > 0, but RCU is
> > > > > > > needed.
> > > > > > Do you mean the code is wrong, or the comment? Note that the code has
> > > > > > checked rcu_read_locked() where introduced in patch 1, but maybe
> > > something
> > > > > > else was missed?
> > > > > > 
> > > > > The assertion is wrong. It will succeed even if RCU is unlocked in this
> > > > > thread but a transaction is in progress in another thread.
> > > > IIUC this is the case where the context:
> > > > 
> > > > (1) doesn't have RCU read lock held, and,
> > > > (2) doesn't have BQL held.
> > > > 
> > > > Is it safe at all to reference any flatview in such a context? The thing
> > > > is I think the flatview pointer can be freed anytime if both locks are
> > > not
> > > > taken.
> > > > 
> > > > > Perhaps you can check (memory_region_transaction_depth() > 0 &&
> > > > > !qemu_mutex_iothread_locked()) || rcu_read_locked() instead?
> > > > What if one thread calls address_space_to_flatview() with BQL held but
> > > not
> > > > RCU read lock held? I assume it's a legal operation, but it seems to be
> > > > able to trigger the assert already?
> > > > 
> > > > Thanks,
> > > > 
> > > I'm not sure whether I understand the content of your discussion correctly,
> > > so here I want to explain my understanding firstly.
> > > 
> > >  From my perspective, Paolo thinks that when thread 1 is in a transaction,
> > > thread 2 will trigger the assertion when accessing the flatview without
> > > holding RCU read lock, although sometimes the thread 2's access to flatview
> > > is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0
> > > && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead.
> > > 
> > > And Peter thinks that as long as no thread holds the BQL or RCU read lock,
> > > the old flatview can be released (actually executed by the rcu thread with
> > > BQL held). When thread 1 is in a transaction, if thread 2 access the
> > > flatview
> > > with BQL held but not RCU read lock held, it's a legal operation. In this
> > > legal case, it seems that both my code and Paolo's code will trigger
> > > assertion.
> > IIUC your original patch is fine in this case (BQL held, RCU not held), as
> > long as depth==0.  IMHO what we want to trap here is when BQL held (while
> > RCU is not) and depth>0 which can cause unpredictable side effect of using
> > obsolete flatview.
> 
> I don't quite understand the side effects of depth>0 when BQL is held (while
> RCU is not).

We wanted to capture outliers when you apply the follow up patch to vm load
procedure.

That will make depth>0 for the whole process of vm load during migration,
and we wanted to make sure it's safe, hence this patch, right?

> 
> In my perspective, both BQL and RCU can ensure that the flatview will not be
> released when the worker thread accesses the flatview, because before the rcu
> thread releases the flatview, it will make sure itself holding BQL and the
> worker thread not holding RCU. So, whether the depth is 0 or not, as long as
> BQL or RCU is held, the worker thread will not access the obsolete flatview
> (IIUC 'obsolete' means that flatview is released).
> 
> > 
> > To summarize, the original check didn't consider BQL, and if to consider
> > BQL I think it should be something like:
> > 
> >    /* Guarantees valid access to the flatview, either lock works */
> >    assert(BQL_HELD() || RCU_HELD());
> > 
> >    /*
> >     * Guarantees any BQL holder is not reading obsolete flatview (e.g. when
> >     * during vm load)
> >     */
> >    if (BQL_HELD())
> >        assert(depth==0);
> > 
> > IIUC it can be merged into:
> > 
> >    assert((BQL_HELD() && depth==0) || RCU_HELD());
> 
> IMHO assert(BQL_HELD() || RCU_HELD()) is enough..

Yes, but IMHO this will guarantee safe use of flatview only if _before_
your follow up patch.

Before that patch, the depth==0 should always stand (when BQL_HELD()
stands) I think.

After that patch, since depth will be increased at the entry of vm load
there's risk that we can overlook code that will be referencing flatview
during vm load and that can reference an obsolete flatview.  Since the
whole process of qemu_loadvm_state_main() will have BQL held we won't hit
the assertion if only to check "BQL_HELD() || RCU_HELD()" because BQL_HELD
always satisfies.

> 
> Or you think that once a mr transaction is in progress, the old flatview has
> been obsolete? If we regard flatview as obsolete when a mr transaction is in
> progress, How can RCU ensure that flatview is not obsolete?

AFAIU RCU cannot guarantee that.  So IIUC any RCU lock user need to be able
to tolerant obsolete flatviews being referenced and it should not harm the
system.  If it needs the latest update, it should take care of that
separately.

For example, the virtio code we're looking at in this series uses RCU lock
to build address space cache for the device vrings which is based on the
current flatview of mem.  It's safe to reference obsolete flatview in this
case (it means the flatview can be during an update when virtio is
establishing the address space cache), IMHO that's fine because the address
space cache will be updated again in virtio_memory_listener_commit() so
it'll be consistent at last.  The intermediate phase of inconsistency
should be fine in this case just like any DMA happens during a memory
hotplug.

For this specific patch, IMHO the core is to check depth>0 reference, and
we need RCU_HELD to mask out where the obsolete references are fine.

Thanks,

> 
> What does Paolo think of this check?
> 
> Thanks!
> 
> > > I'm not sure if I have a good understanding of your emails? I think
> > > checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() ||
> > > qemu_mutex_iothread_locked()) should cover the case you discussed.
> > This seems still problematic too?  Since the assert can pass even if
> > neither BQL nor RCU is held (as long as depth==0).
> > 
> > Thanks,
> > 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2023-01-12 15:13                 ` Peter Xu
@ 2023-01-13 19:29                   ` Chuang Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Chuang Xu @ 2023-01-13 19:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: Paolo Bonzini, qemu-devel, David Gilbert, Quintela, Juan,
	David Hildenbrand, Philippe Mathieu-Daudé,
	zhouyibo

Hi, Peter,

On 2023/1/12 下午11:13, Peter Xu wrote:
> We wanted to capture outliers when you apply the follow up patch to vm load
> procedure.
>
> That will make depth>0 for the whole process of vm load during migration,
> and we wanted to make sure it's safe, hence this patch, right?
>
>> In my perspective, both BQL and RCU can ensure that the flatview will not be
>> released when the worker thread accesses the flatview, because before the rcu
>> thread releases the flatview, it will make sure itself holding BQL and the
>> worker thread not holding RCU. So, whether the depth is 0 or not, as long as
>> BQL or RCU is held, the worker thread will not access the obsolete flatview
>> (IIUC 'obsolete' means that flatview is released).
>>
>>> To summarize, the original check didn't consider BQL, and if to consider
>>> BQL I think it should be something like:
>>>
>>>     /* Guarantees valid access to the flatview, either lock works */
>>>     assert(BQL_HELD() || RCU_HELD());
>>>
>>>     /*
>>>      * Guarantees any BQL holder is not reading obsolete flatview (e.g. when
>>>      * during vm load)
>>>      */
>>>     if (BQL_HELD())
>>>         assert(depth==0);
>>>
>>> IIUC it can be merged into:
>>>
>>>     assert((BQL_HELD() && depth==0) || RCU_HELD());
>> IMHO assert(BQL_HELD() || RCU_HELD()) is enough..
> Yes, but IMHO this will guarantee safe use of flatview only if _before_
> your follow up patch.
>
> Before that patch, the depth==0 should always stand (when BQL_HELD()
> stands) I think.
>
> After that patch, since depth will be increased at the entry of vm load
> there's risk that we can overlook code that will be referencing flatview
> during vm load and that can reference an obsolete flatview.  Since the
> whole process of qemu_loadvm_state_main() will have BQL held we won't hit
> the assertion if only to check "BQL_HELD() || RCU_HELD()" because BQL_HELD
> always satisfies.
>
>> Or you think that once a mr transaction is in progress, the old flatview has
>> been obsolete? If we regard flatview as obsolete when a mr transaction is in
>> progress, How can RCU ensure that flatview is not obsolete?
> AFAIU RCU cannot guarantee that.  So IIUC any RCU lock user need to be able
> to tolerant obsolete flatviews being referenced and it should not harm the
> system.  If it needs the latest update, it should take care of that
> separately.
>
> For example, the virtio code we're looking at in this series uses RCU lock
> to build address space cache for the device vrings which is based on the
> current flatview of mem.  It's safe to reference obsolete flatview in this
> case (it means the flatview can be during an update when virtio is
> establishing the address space cache), IMHO that's fine because the address
> space cache will be updated again in virtio_memory_listener_commit() so
> it'll be consistent at last.  The intermediate phase of inconsistency
> should be fine in this case just like any DMA happens during a memory
> hotplug.
>
> For this specific patch, IMHO the core is to check depth>0 reference, and
> we need RCU_HELD to mask out where the obsolete references are fine.
>
> Thanks,

Thanks a lot for your patience!

I ignored the preconditions for this discussion, so that I asked a stupid question..

Now I'm in favor of checking “(BQL_HELD() && depth==0) || RCU_HELD()” in v5.
And what does Paolo thinks of Peter's solution?

Thanks again!



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu
  2022-12-23 15:37   ` Peter Xu
  2022-12-23 15:47   ` Paolo Bonzini
@ 2022-12-28 10:50   ` Philippe Mathieu-Daudé
  2023-01-04  7:39     ` [External] " Chuang Xu
  2 siblings, 1 reply; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2022-12-28 10:50 UTC (permalink / raw)
  To: Chuang Xu, qemu-devel
  Cc: dgilbert, quintela, pbonzini, peterx, david, zhouyibo

On 23/12/22 15:23, Chuang Xu wrote:
> Before using any flatview, sanity check we're not during a memory
> region transaction or the map can be invalid.
> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>   include/exec/memory.h | 9 +++++++++
>   softmmu/memory.c      | 5 +++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 91f8a2395a..66c43b4862 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -1069,8 +1069,17 @@ struct FlatView {
>       MemoryRegion *root;
>   };
>   
> +int memory_region_transaction_get_depth(void);

Do we want to expose this; isn't the depth internal?

If we need to expose something, can we restrict it to

   bool memory_region_in_transaction(void) or
   bool memory_region_transaction_in_progress(void)?

>   static inline FlatView *address_space_to_flatview(AddressSpace *as)
>   {
> +    /*
> +     * Before using any flatview, sanity check we're not during a memory
> +     * region transaction or the map can be invalid.  Note that this can
> +     * also be called during commit phase of memory transaction, but that
> +     * should also only happen when the depth decreases to 0 first.
> +     */
> +    assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked());
>       return qatomic_rcu_read(&as->current_map);
>   }



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [External] Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview
  2022-12-28 10:50   ` Philippe Mathieu-Daudé
@ 2023-01-04  7:39     ` Chuang Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Chuang Xu @ 2023-01-04  7:39 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, qemu-devel
  Cc: dgilbert, quintela, pbonzini, peterx, david, zhouyibo

[-- Attachment #1: Type: text/plain, Size: 1020 bytes --]

On 2022/12/28 下午6:50, Philippe Mathieu-Daudé wrote:

On 23/12/22 15:23, Chuang Xu wrote:

Before using any flatview, sanity check we're not during a memory
region transaction or the map can be invalid.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
<xuchuangxclwt@bytedance.com>
---
  include/exec/memory.h | 9 +++++++++
  softmmu/memory.c      | 5 +++++
  2 files changed, 14 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 91f8a2395a..66c43b4862 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1069,8 +1069,17 @@ struct FlatView {
      MemoryRegion *root;
  };
  +int memory_region_transaction_get_depth(void);


Do we want to expose this; isn't the depth internal?

If we need to expose something, can we restrict it to

  bool memory_region_in_transaction(void) or
  bool memory_region_transaction_in_progress(void)?

Yes, we'd better not expose the value of an internal
variable. I'll make changes in v5.

Thanks!

[-- Attachment #2: Type: text/html, Size: 2032 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate
  2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
  2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu
  2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu
@ 2022-12-23 14:23 ` Chuang Xu
  2022-12-23 16:06   ` David Hildenbrand
  2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu
  3 siblings, 1 reply; 22+ messages in thread
From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw)
  To: qemu-devel
  Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, Chuang Xu

The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test1 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 16-queue vhost-net device
  - 16 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 150 ms			  740+ ms
after		about 30 ms			  630+ ms

In test2, we keep the number of the device the same as test1, reduce the
number of queues per device:

Here are the test2 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 1-queue vhost-net device
  - 16 1-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 90 ms			 about 250 ms

after		about 25 ms			 about 160 ms

In test3, we keep the number of queues per device the same as test1, reduce
the number of devices:

Here are the test3 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 1 16-queue vhost-net device
  - 1 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before		about 20 ms			 about 70 ms
after		about 11 ms			 about 60 ms

As we can see from the test results above, both the number of queues and
the number of devices have a great impact on the time of loading non-iterable
vmstate. The growth of the number of devices and queues will lead to more
mr commits, and the time consumption caused by the flatview reconstruction
will also increase.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 migration/savevm.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index a0cdb714f7..19785e5a54 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
     uint8_t section_type;
     int ret = 0;
 
+    /* call memory_region_transaction_begin() before loading vmstate */
+    memory_region_transaction_begin();
+
 retry:
     while (true) {
         section_type = qemu_get_byte(f);
@@ -2684,6 +2687,10 @@ out:
             goto retry;
         }
     }
+
+    /* call memory_region_transaction_commit() after loading vmstate */
+    memory_region_transaction_commit();
+
     return ret;
 }
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate
  2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
@ 2022-12-23 16:06   ` David Hildenbrand
  2023-01-04  7:31     ` Chuang Xu
  0 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand @ 2022-12-23 16:06 UTC (permalink / raw)
  To: Chuang Xu, qemu-devel
  Cc: dgilbert, quintela, pbonzini, peterx, philmd, zhouyibo

On 23.12.22 15:23, Chuang Xu wrote:
> The duration of loading non-iterable vmstate accounts for a significant
> portion of downtime (starting with the timestamp of source qemu stop and
> ending with the timestamp of target qemu start). Most of the time is spent
> committing memory region changes repeatedly.
> 
> This patch packs all the changes to memory region during the period of
> loading non-iterable vmstate in a single memory transaction. With the
> increase of devices, this patch will greatly improve the performance.
> 
> Here are the test1 results:
> test info:
> - Host
>    - Intel(R) Xeon(R) Platinum 8260 CPU
>    - NVIDIA Mellanox ConnectX-5
> - VM
>    - 32 CPUs 128GB RAM VM
>    - 8 16-queue vhost-net device
>    - 16 4-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before		about 150 ms			  740+ ms
> after		about 30 ms			  630+ ms
> 
> In test2, we keep the number of the device the same as test1, reduce the
> number of queues per device:
> 
> Here are the test2 results:
> test info:
> - Host
>    - Intel(R) Xeon(R) Platinum 8260 CPU
>    - NVIDIA Mellanox ConnectX-5
> - VM
>    - 32 CPUs 128GB RAM VM
>    - 8 1-queue vhost-net device
>    - 16 1-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before		about 90 ms			 about 250 ms
> 
> after		about 25 ms			 about 160 ms
> 
> In test3, we keep the number of queues per device the same as test1, reduce
> the number of devices:
> 
> Here are the test3 results:
> test info:
> - Host
>    - Intel(R) Xeon(R) Platinum 8260 CPU
>    - NVIDIA Mellanox ConnectX-5
> - VM
>    - 32 CPUs 128GB RAM VM
>    - 1 16-queue vhost-net device
>    - 1 4-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before		about 20 ms			 about 70 ms
> after		about 11 ms			 about 60 ms
> 
> As we can see from the test results above, both the number of queues and
> the number of devices have a great impact on the time of loading non-iterable
> vmstate. The growth of the number of devices and queues will lead to more
> mr commits, and the time consumption caused by the flatview reconstruction
> will also increase.
> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
> ---
>   migration/savevm.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index a0cdb714f7..19785e5a54 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis)
>       uint8_t section_type;
>       int ret = 0;
>   
> +    /* call memory_region_transaction_begin() before loading vmstate */


I'd suggest extending the comment *why* you are doing that, that it's a 
pure performance optimization, and how it achieves that.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate
  2022-12-23 16:06   ` David Hildenbrand
@ 2023-01-04  7:31     ` Chuang Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Chuang Xu @ 2023-01-04  7:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: qemu-devel, dgilbert, quintela, pbonzini, peterx, philmd, zhouyibo

[-- Attachment #1: Type: text/plain, Size: 2988 bytes --]

On 2022/12/24 上午12:06, David Hildenbrand wrote:

On 23.12.22 15:23, Chuang Xu wrote:

The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test1 results:
test info:
- Host
   - Intel(R) Xeon(R) Platinum 8260 CPU
   - NVIDIA Mellanox ConnectX-5
- VM
   - 32 CPUs 128GB RAM VM
   - 8 16-queue vhost-net device
   - 16 4-queue vhost-user-blk device.

    time of loading non-iterable vmstate     downtime
before        about 150 ms              740+ ms
after        about 30 ms              630+ ms

In test2, we keep the number of the device the same as test1, reduce the
number of queues per device:

Here are the test2 results:
test info:
- Host
   - Intel(R) Xeon(R) Platinum 8260 CPU
   - NVIDIA Mellanox ConnectX-5
- VM
   - 32 CPUs 128GB RAM VM
   - 8 1-queue vhost-net device
   - 16 1-queue vhost-user-blk device.

    time of loading non-iterable vmstate     downtime
before        about 90 ms             about 250 ms

after        about 25 ms             about 160 ms

In test3, we keep the number of queues per device the same as test1, reduce
the number of devices:

Here are the test3 results:
test info:
- Host
   - Intel(R) Xeon(R) Platinum 8260 CPU
   - NVIDIA Mellanox ConnectX-5
- VM
   - 32 CPUs 128GB RAM VM
   - 1 16-queue vhost-net device
   - 1 4-queue vhost-user-blk device.

    time of loading non-iterable vmstate     downtime
before        about 20 ms             about 70 ms
after        about 11 ms             about 60 ms

As we can see from the test results above, both the number of queues and
the number of devices have a great impact on the time of loading
non-iterable
vmstate. The growth of the number of devices and queues will lead to more
mr commits, and the time consumption caused by the flatview reconstruction
will also increase.

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
<xuchuangxclwt@bytedance.com>
---
  migration/savevm.c | 7 +++++++
  1 file changed, 7 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index a0cdb714f7..19785e5a54 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f,
MigrationIncomingState *mis)
      uint8_t section_type;
      int ret = 0;
  +    /* call memory_region_transaction_begin() before loading vmstate */

I'd suggest extending the comment *why* you are doing that, that it's a
pure performance optimization, and how it achieves that.

Thanks! I'll extend the comment in v5.

[-- Attachment #2: Type: text/html, Size: 5433 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate
  2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
                   ` (2 preceding siblings ...)
  2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
@ 2022-12-23 15:50 ` Peter Xu
  2022-12-23 19:11   ` Chuang Xu
  3 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2022-12-23 15:50 UTC (permalink / raw)
  To: Chuang Xu
  Cc: qemu-devel, dgilbert, quintela, pbonzini, david, philmd, zhouyibo

Chuang,

On Fri, Dec 23, 2022 at 10:23:04PM +0800, Chuang Xu wrote:
> In this version:
> 
> - attach more information in the cover letter.
> - remove changes on virtio_load().
> - add rcu_read_locked() to detect holding of rcu lock.
> 
> The duration of loading non-iterable vmstate accounts for a significant
> portion of downtime (starting with the timestamp of source qemu stop and
> ending with the timestamp of target qemu start). Most of the time is spent
> committing memory region changes repeatedly.
> 
> This patch packs all the changes to memory region during the period of
> loading non-iterable vmstate in a single memory transaction. With the
> increase of devices, this patch will greatly improve the performance.
> 
> Here are the test1 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 8 16-queue vhost-net device
>   - 16 4-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before	about 150 ms			  740+ ms
> after		about 30 ms			  630+ ms

Have you investigated why multi-queue added so much downtime overhead with
the same environment, comparing to below [1]?

> 
> (This result is different from that of v1. It may be that someone has 
> changed something on my host.., but it does not affect the display of 
> the optimization effect.)
> 
> 
> In test2, we keep the number of the device the same as test1, reduce the 
> number of queues per device:
> 
> Here are the test2 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 8 1-queue vhost-net device
>   - 16 1-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before	about 90 ms			 about 250 ms
> after		about 25 ms			 about 160 ms

[1]

> 
> In test3, we keep the number of queues per device the same as test1, reduce 
> the number of devices:
> 
> Here are the test3 results:
> test info:
> - Host
>   - Intel(R) Xeon(R) Platinum 8260 CPU
>   - NVIDIA Mellanox ConnectX-5
> - VM
>   - 32 CPUs 128GB RAM VM
>   - 1 16-queue vhost-net device
>   - 1 4-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate     downtime
> before	about 20 ms			 about 70 ms
> after		about 11 ms			 about 60 ms
> 
> 
> As we can see from the test results above, both the number of queues and 
> the number of devices have a great impact on the time of loading non-iterable 
> vmstate. The growth of the number of devices and queues will lead to more 
> mr commits, and the time consumption caused by the flatview reconstruction 
> will also increase.

The downtime measured in precopy can be more complicated than postcopy
because the time of switch is calculated by qemu based on the downtime
setup, and also that contains part of RAM migrations.  Postcopy should be
more accurate on that because there's no calculation done, meanwhile
there's no RAM transferred during downtime.

However postcopy downtime is not accurate either in implementation of it in
postcopy_start(), where the downtime is measured right after we flushed the
packed data, and right below it there's some idea of optimizing it:

    if (migrate_postcopy_ram()) {
        /*
         * Although this ping is just for debug, it could potentially be
         * used for getting a better measurement of downtime at the source.
         */
        qemu_savevm_send_ping(ms->to_dst_file, 4);
    }

So maybe I'll have a look there.

Besides above, personally I'm happy with the series, one trivial comment in
patch 2 but not a huge deal.  I don't expect you can get any more comment
before the end of this year.. but let's wait until after the Xmas holiday.

Thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate
  2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu
@ 2022-12-23 19:11   ` Chuang Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Chuang Xu @ 2022-12-23 19:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, dgilbert, quintela, pbonzini, david, philmd, zhouyibo

[-- Attachment #1: Type: text/plain, Size: 5580 bytes --]

On 2022/12/23 下午11:50, Peter Xu wrote:

Chuang,

On Fri, Dec 23, 2022 at 10:23:04PM +0800, Chuang Xu wrote:

In this version:

- attach more information in the cover letter.
- remove changes on virtio_load().
- add rcu_read_locked() to detect holding of rcu lock.

The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test1 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 16-queue vhost-net device
  - 16 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before	about 150 ms			  740+ ms
after		about 30 ms			  630+ ms

Have you investigated why multi-queue added so much downtime overhead with
the same environment, comparing to below [1]?

I have analyzed the downtime in detail. Both stopping and starting the
device are
time-consuming.

For stopping vhost-net devices, vhost_net_stop_one() will be called once more
for each additional queue, while vhost_virtqueue_stop() will be called twice
in vhost_dev_stop(). For example, we need to call vhost_virtqueue_stop()
32(= 16 * 2) times to stop a 16-queue vhost-net device. In
vhost_virtqueue_stop(),
QEMU needs to negotiate with the vhost user daemon. The same is true
for vhost-net
devices startup.

For stopping vhost-user-blk devices, vhost_virtqueue_stop() will be called once
more for each additional queue. For example, we need to call
vhost_virtqueue_stop()
4 times to stop a 4-queue vhost-user-blk device. The same is true for
vhost-user-blk
devices startup.

It seems that the vhost-user-blk device is less affected by the number
of queues
than the vhost-net device. However, the vhost-user-blk device needs to prepare
inflight when it is started. The negotiation with spdk in this step is also
time-consuming. I tried to move this step to the startup phase of the
target QEMU
before the migration started. In my test, This optimization can greatly reduce
the vhost-user-blk device startup time and thus reduce the downtime.
I'm not sure
whether this is hacky. If you are interested in this, maybe we can
discuss it further.

(This result is different from that of v1. It may be that someone has
changed something on my host.., but it does not affect the display of
the optimization effect.)

In test2, we keep the number of the device the same as test1, reduce the
number of queues per device:

Here are the test2 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 8 1-queue vhost-net device
  - 16 1-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before	about 90 ms			 about 250 ms
after		about 25 ms			 about 160 ms

[1]

In test3, we keep the number of queues per device the same as test1, reduce
the number of devices:

Here are the test3 results:
test info:
- Host
  - Intel(R) Xeon(R) Platinum 8260 CPU
  - NVIDIA Mellanox ConnectX-5
- VM
  - 32 CPUs 128GB RAM VM
  - 1 16-queue vhost-net device
  - 1 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate     downtime
before	about 20 ms			 about 70 ms
after		about 11 ms			 about 60 ms

As we can see from the test results above, both the number of queues and
the number of devices have a great impact on the time of loading non-iterable
vmstate. The growth of the number of devices and queues will lead to more
mr commits, and the time consumption caused by the flatview reconstruction
will also increase.

The downtime measured in precopy can be more complicated than postcopy
because the time of switch is calculated by qemu based on the downtime
setup, and also that contains part of RAM migrations.  Postcopy should be
more accurate on that because there's no calculation done, meanwhile
there's no RAM transferred during downtime.

However postcopy downtime is not accurate either in implementation of it in
postcopy_start(), where the downtime is measured right after we flushed the
packed data, and right below it there's some idea of optimizing it:

    if (migrate_postcopy_ram()) {
        /*
         * Although this ping is just for debug, it could potentially be
         * used for getting a better measurement of downtime at the source.
         */
        qemu_savevm_send_ping(ms->to_dst_file, 4);
    }

So maybe I'll have a look there.

The current calculation of downtime is really inaccurate, because the source
side calculation does not take into account the time consumption of various
factors at the destination side. Maybe we can consider transmitting some key
timestamps to the destination, and the destination will calculate the actual
downtime data after startup.

Besides above, personally I'm happy with the series, one trivial comment in
patch 2 but not a huge deal.  I don't expect you can get any more comment
before the end of this year.. but let's wait until after the Xmas holiday.

Thanks!

I’ll further modify the patch 2 according to your comments.

Merry Christmas!

[-- Attachment #2: Type: text/html, Size: 6702 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-01-13 19:30 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu
2023-01-04 14:20   ` Alex Bennée
2023-01-05  8:17     ` Chuang Xu
2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu
2022-12-23 15:37   ` Peter Xu
2022-12-23 15:47   ` Paolo Bonzini
2022-12-23 15:54     ` Peter Xu
2022-12-28  8:27       ` Paolo Bonzini
2023-01-03 17:43         ` Peter Xu
2023-01-10  8:09           ` Chuang Xu
2023-01-10 14:45             ` Peter Xu
2023-01-12  7:59               ` Chuang Xu
2023-01-12 15:13                 ` Peter Xu
2023-01-13 19:29                   ` Chuang Xu
2022-12-28 10:50   ` Philippe Mathieu-Daudé
2023-01-04  7:39     ` [External] " Chuang Xu
2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu
2022-12-23 16:06   ` David Hildenbrand
2023-01-04  7:31     ` Chuang Xu
2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu
2022-12-23 19:11   ` Chuang Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.