* [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate @ 2022-12-23 14:23 Chuang Xu 2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu ` (3 more replies) 0 siblings, 4 replies; 22+ messages in thread From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw) To: qemu-devel; +Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo In this version: - attach more information in the cover letter. - remove changes on virtio_load(). - add rcu_read_locked() to detect holding of rcu lock. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms (This result is different from that of v1. It may be that someone has changed something on my host.., but it does not affect the display of the optimization effect.) In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Please review, Chuang. [v3] - move virtio_load_check_delay() from virtio_memory_listener_commit() to virtio_vmstate_change(). - add delay_check flag to VirtIODevice to make sure virtio_load_check_delay() will be called when delay_check is true. [v2] - rebase to latest upstream. - add sanity check to address_space_to_flatview(). - postpone the init of the vring cache until migration's loading completes. [v1] The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test results: test vm info: - 32 CPUs 128GB RAM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate before about 210 ms after about 40 ms ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC v4 1/3] rcu: introduce rcu_read_locked() 2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu @ 2022-12-23 14:23 ` Chuang Xu 2023-01-04 14:20 ` Alex Bennée 2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu ` (2 subsequent siblings) 3 siblings, 1 reply; 22+ messages in thread From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw) To: qemu-devel Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, Chuang Xu add rcu_read_locked() to detect holding of rcu lock. Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> --- include/qemu/rcu.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index b063c6fde8..42cbd0080f 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_locked(void) +{ + struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); + + return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.20.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC v4 1/3] rcu: introduce rcu_read_locked() 2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu @ 2023-01-04 14:20 ` Alex Bennée 2023-01-05 8:17 ` Chuang Xu 0 siblings, 1 reply; 22+ messages in thread From: Alex Bennée @ 2023-01-04 14:20 UTC (permalink / raw) To: Chuang Xu Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, qemu-devel Chuang Xu <xuchuangxclwt@bytedance.com> writes: > add rcu_read_locked() to detect holding of rcu lock. > > Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> > --- > include/qemu/rcu.h | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h > index b063c6fde8..42cbd0080f 100644 > --- a/include/qemu/rcu.h > +++ b/include/qemu/rcu.h > @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) > } > } > > +static inline bool rcu_read_locked(void) We use the locked suffix to indicate functions that should be called with a lock held. Perhaps renaming this to rcu_read_is_locked() would make the intent of the function clearer? > +{ > + struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); > + > + return p_rcu_reader->depth > 0; > +} > + > extern void synchronize_rcu(void); > > /* -- Alex Bennée Virtualisation Tech Lead @ Linaro ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 1/3] rcu: introduce rcu_read_locked() 2023-01-04 14:20 ` Alex Bennée @ 2023-01-05 8:17 ` Chuang Xu 0 siblings, 0 replies; 22+ messages in thread From: Chuang Xu @ 2023-01-05 8:17 UTC (permalink / raw) To: Alex Bennée Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1018 bytes --] On 2023/1/4 下午10:20, Alex Bennée wrote: > Chuang Xu writes: > >> add rcu_read_locked() to detect holding of rcu lock. >> >> Signed-off-by: Chuang Xu >> --- >> include/qemu/rcu.h | 7 +++++++ >> 1 file changed, 7 insertions(+) >> >> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h >> index b063c6fde8..42cbd0080f 100644 >> --- a/include/qemu/rcu.h >> +++ b/include/qemu/rcu.h >> @@ -119,6 +119,13 @@ static inline void rcu_read_unlock(void) >> } >> } >> >> +static inline bool rcu_read_locked(void) > We use the locked suffix to indicate functions that should be called > with a lock held. Perhaps renaming this to rcu_read_is_locked() would > make the intent of the function clearer? Yes, rcu_read_is_locked() do make the intent of the function clearer. I'll rename the function in v5. Thanks! >> +{ >> + struct rcu_reader_data *p_rcu_reader = get_ptr_rcu_reader(); >> + >> + return p_rcu_reader->depth > 0; >> +} >> + >> extern void synchronize_rcu(void); >> >> /* > [-- Attachment #2: Type: text/html, Size: 1381 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu 2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu @ 2022-12-23 14:23 ` Chuang Xu 2022-12-23 15:37 ` Peter Xu ` (2 more replies) 2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu 2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu 3 siblings, 3 replies; 22+ messages in thread From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw) To: qemu-devel Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, Chuang Xu Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> --- include/exec/memory.h | 9 +++++++++ softmmu/memory.c | 5 +++++ 2 files changed, 14 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..66c43b4862 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +int memory_region_transaction_get_depth(void); + static inline FlatView *address_space_to_flatview(AddressSpace *as) { + /* + * Before using any flatview, sanity check we're not during a memory + * region transaction or the map can be invalid. Note that this can + * also be called during commit phase of memory transaction, but that + * should also only happen when the depth decreases to 0 first. + */ + assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked()); return qatomic_rcu_read(&as->current_map); } diff --git a/softmmu/memory.c b/softmmu/memory.c index bc0be3f62c..01192e2e5b 100644 --- a/softmmu/memory.c +++ b/softmmu/memory.c @@ -1116,6 +1116,11 @@ void memory_region_transaction_commit(void) } } +int memory_region_transaction_get_depth(void) +{ + return memory_region_transaction_depth; +} + static void memory_region_destructor_none(MemoryRegion *mr) { } -- 2.20.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu @ 2022-12-23 15:37 ` Peter Xu 2022-12-23 15:47 ` Paolo Bonzini 2022-12-28 10:50 ` Philippe Mathieu-Daudé 2 siblings, 0 replies; 22+ messages in thread From: Peter Xu @ 2022-12-23 15:37 UTC (permalink / raw) To: Chuang Xu Cc: qemu-devel, dgilbert, quintela, pbonzini, david, philmd, zhouyibo On Fri, Dec 23, 2022 at 10:23:06PM +0800, Chuang Xu wrote: > Before using any flatview, sanity check we're not during a memory > region transaction or the map can be invalid. > > Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> > --- > include/exec/memory.h | 9 +++++++++ > softmmu/memory.c | 5 +++++ > 2 files changed, 14 insertions(+) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 91f8a2395a..66c43b4862 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -1069,8 +1069,17 @@ struct FlatView { > MemoryRegion *root; > }; > > +int memory_region_transaction_get_depth(void); > + > static inline FlatView *address_space_to_flatview(AddressSpace *as) > { > + /* > + * Before using any flatview, sanity check we're not during a memory > + * region transaction or the map can be invalid. Note that this can > + * also be called during commit phase of memory transaction, but that > + * should also only happen when the depth decreases to 0 first. Nitpick: after adding the RCU check the comment may need slight touch up: * Meanwhile it's safe to access current_map with RCU read lock held * even if during a memory transaction. It means the user can bear * with an obsolete map. > + */ > + assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked()); > return qatomic_rcu_read(&as->current_map); > } > > diff --git a/softmmu/memory.c b/softmmu/memory.c > index bc0be3f62c..01192e2e5b 100644 > --- a/softmmu/memory.c > +++ b/softmmu/memory.c > @@ -1116,6 +1116,11 @@ void memory_region_transaction_commit(void) > } > } > > +int memory_region_transaction_get_depth(void) > +{ > + return memory_region_transaction_depth; > +} > + > static void memory_region_destructor_none(MemoryRegion *mr) > { > } > -- > 2.20.1 > -- Peter Xu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu 2022-12-23 15:37 ` Peter Xu @ 2022-12-23 15:47 ` Paolo Bonzini 2022-12-23 15:54 ` Peter Xu 2022-12-28 10:50 ` Philippe Mathieu-Daudé 2 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2022-12-23 15:47 UTC (permalink / raw) To: Chuang Xu, qemu-devel; +Cc: dgilbert, quintela, peterx, david, philmd, zhouyibo On 12/23/22 15:23, Chuang Xu wrote: > static inline FlatView *address_space_to_flatview(AddressSpace *as) > { > + /* > + * Before using any flatview, sanity check we're not during a memory > + * region transaction or the map can be invalid. Note that this can > + * also be called during commit phase of memory transaction, but that > + * should also only happen when the depth decreases to 0 first. > + */ > + assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked()); > return qatomic_rcu_read(&as->current_map); > } This is not valid because the transaction could happen in *another* thread. In that case memory_region_transaction_depth() will be > 0, but RCU is needed. Paolo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-23 15:47 ` Paolo Bonzini @ 2022-12-23 15:54 ` Peter Xu 2022-12-28 8:27 ` Paolo Bonzini 0 siblings, 1 reply; 22+ messages in thread From: Peter Xu @ 2022-12-23 15:54 UTC (permalink / raw) To: Paolo Bonzini Cc: Chuang Xu, qemu-devel, dgilbert, quintela, david, philmd, zhouyibo Hi, Paolo, On Fri, Dec 23, 2022 at 04:47:57PM +0100, Paolo Bonzini wrote: > On 12/23/22 15:23, Chuang Xu wrote: > > static inline FlatView *address_space_to_flatview(AddressSpace *as) > > { > > + /* > > + * Before using any flatview, sanity check we're not during a memory > > + * region transaction or the map can be invalid. Note that this can > > + * also be called during commit phase of memory transaction, but that > > + * should also only happen when the depth decreases to 0 first. > > + */ > > + assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked()); > > return qatomic_rcu_read(&as->current_map); > > } > > This is not valid because the transaction could happen in *another* thread. > In that case memory_region_transaction_depth() will be > 0, but RCU is > needed. Do you mean the code is wrong, or the comment? Note that the code has checked rcu_read_locked() where introduced in patch 1, but maybe something else was missed? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-23 15:54 ` Peter Xu @ 2022-12-28 8:27 ` Paolo Bonzini 2023-01-03 17:43 ` Peter Xu 0 siblings, 1 reply; 22+ messages in thread From: Paolo Bonzini @ 2022-12-28 8:27 UTC (permalink / raw) To: Peter Xu Cc: Chuang Xu, qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand, Philippe Mathieu-Daudé, zhouyibo [-- Attachment #1: Type: text/plain, Size: 706 bytes --] Il ven 23 dic 2022, 16:54 Peter Xu <peterx@redhat.com> ha scritto: > > This is not valid because the transaction could happen in *another* > thread. > > In that case memory_region_transaction_depth() will be > 0, but RCU is > > needed. > > Do you mean the code is wrong, or the comment? Note that the code has > checked rcu_read_locked() where introduced in patch 1, but maybe something > else was missed? > The assertion is wrong. It will succeed even if RCU is unlocked in this thread but a transaction is in progress in another thread. Perhaps you can check (memory_region_transaction_depth() > 0 && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? Paolo Thanks, > > -- > Peter Xu > > [-- Attachment #2: Type: text/html, Size: 1371 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-28 8:27 ` Paolo Bonzini @ 2023-01-03 17:43 ` Peter Xu 2023-01-10 8:09 ` Chuang Xu 0 siblings, 1 reply; 22+ messages in thread From: Peter Xu @ 2023-01-03 17:43 UTC (permalink / raw) To: Paolo Bonzini Cc: Chuang Xu, qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand, Philippe Mathieu-Daudé, zhouyibo Hi, Paolo, On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote: > Il ven 23 dic 2022, 16:54 Peter Xu <peterx@redhat.com> ha scritto: > > > > This is not valid because the transaction could happen in *another* > > thread. > > > In that case memory_region_transaction_depth() will be > 0, but RCU is > > > needed. > > > > Do you mean the code is wrong, or the comment? Note that the code has > > checked rcu_read_locked() where introduced in patch 1, but maybe something > > else was missed? > > > > The assertion is wrong. It will succeed even if RCU is unlocked in this > thread but a transaction is in progress in another thread. IIUC this is the case where the context: (1) doesn't have RCU read lock held, and, (2) doesn't have BQL held. Is it safe at all to reference any flatview in such a context? The thing is I think the flatview pointer can be freed anytime if both locks are not taken. > Perhaps you can check (memory_region_transaction_depth() > 0 && > !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? What if one thread calls address_space_to_flatview() with BQL held but not RCU read lock held? I assume it's a legal operation, but it seems to be able to trigger the assert already? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2023-01-03 17:43 ` Peter Xu @ 2023-01-10 8:09 ` Chuang Xu 2023-01-10 14:45 ` Peter Xu 0 siblings, 1 reply; 22+ messages in thread From: Chuang Xu @ 2023-01-10 8:09 UTC (permalink / raw) To: Peter Xu, Paolo Bonzini Cc: qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand, Philippe Mathieu-Daudé, zhouyibo [-- Attachment #1: Type: text/plain, Size: 2656 bytes --] Hi, Peter and Paolo, I'm sorry I didn't reply to your email in time. I was infected with COVID-19 two weeks ago, so I couldn't think about the problems discussed in your email for a long time.😷 On 2023/1/4 上午1:43, Peter Xu wrote: > Hi, Paolo, > > On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote: >> Il ven 23 dic 2022, 16:54 Peter Xu ha scritto: >> >>>> This is not valid because the transaction could happen in *another* >>> thread. >>>> In that case memory_region_transaction_depth() will be > 0, but RCU is >>>> needed. >>> Do you mean the code is wrong, or the comment? Note that the code has >>> checked rcu_read_locked() where introduced in patch 1, but maybe something >>> else was missed? >>> >> The assertion is wrong. It will succeed even if RCU is unlocked in this >> thread but a transaction is in progress in another thread. > IIUC this is the case where the context: > > (1) doesn't have RCU read lock held, and, > (2) doesn't have BQL held. > > Is it safe at all to reference any flatview in such a context? The thing > is I think the flatview pointer can be freed anytime if both locks are not > taken. > >> Perhaps you can check (memory_region_transaction_depth() > 0 && >> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? > What if one thread calls address_space_to_flatview() with BQL held but not > RCU read lock held? I assume it's a legal operation, but it seems to be > able to trigger the assert already? > > Thanks, > I'm not sure whether I understand the content of your discussion correctly, so here I want to explain my understanding firstly. From my perspective, Paolo thinks that when thread 1 is in a transaction, thread 2 will trigger the assertion when accessing the flatview without holding RCU read lock, although sometimes the thread 2's access to flatview is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0 && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead. And Peter thinks that as long as no thread holds the BQL or RCU read lock, the old flatview can be released (actually executed by the rcu thread with BQL held). When thread 1 is in a transaction, if thread 2 access the flatview with BQL held but not RCU read lock held, it's a legal operation. In this legal case, it seems that both my code and Paolo's code will trigger assertion. I'm not sure if I have a good understanding of your emails? I think checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() || qemu_mutex_iothread_locked()) should cover the case you discussed. What's your take? Thanks! [-- Attachment #2: Type: text/html, Size: 3151 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2023-01-10 8:09 ` Chuang Xu @ 2023-01-10 14:45 ` Peter Xu 2023-01-12 7:59 ` Chuang Xu 0 siblings, 1 reply; 22+ messages in thread From: Peter Xu @ 2023-01-10 14:45 UTC (permalink / raw) To: Chuang Xu Cc: Paolo Bonzini, qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand, Philippe Mathieu-Daudé, zhouyibo On Tue, Jan 10, 2023 at 12:09:41AM -0800, Chuang Xu wrote: > Hi, Peter and Paolo, Hi, Chuang, Paolo, > I'm sorry I didn't reply to your email in time. I was infected with > COVID-19 two weeks ago, so I couldn't think about the problems discussed > in your email for a long time.😷 > > On 2023/1/4 上午1:43, Peter Xu wrote: > > Hi, Paolo, > > > > On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote: > >> Il ven 23 dic 2022, 16:54 Peter Xu ha scritto: > >> > >>>> This is not valid because the transaction could happen in *another* > >>> thread. > >>>> In that case memory_region_transaction_depth() will be > 0, but RCU is > >>>> needed. > >>> Do you mean the code is wrong, or the comment? Note that the code has > >>> checked rcu_read_locked() where introduced in patch 1, but maybe > something > >>> else was missed? > >>> > >> The assertion is wrong. It will succeed even if RCU is unlocked in this > >> thread but a transaction is in progress in another thread. > > IIUC this is the case where the context: > > > > (1) doesn't have RCU read lock held, and, > > (2) doesn't have BQL held. > > > > Is it safe at all to reference any flatview in such a context? The thing > > is I think the flatview pointer can be freed anytime if both locks are > not > > taken. > > > >> Perhaps you can check (memory_region_transaction_depth() > 0 && > >> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? > > What if one thread calls address_space_to_flatview() with BQL held but > not > > RCU read lock held? I assume it's a legal operation, but it seems to be > > able to trigger the assert already? > > > > Thanks, > > > I'm not sure whether I understand the content of your discussion correctly, > so here I want to explain my understanding firstly. > > From my perspective, Paolo thinks that when thread 1 is in a transaction, > thread 2 will trigger the assertion when accessing the flatview without > holding RCU read lock, although sometimes the thread 2's access to flatview > is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0 > && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead. > > And Peter thinks that as long as no thread holds the BQL or RCU read lock, > the old flatview can be released (actually executed by the rcu thread with > BQL held). When thread 1 is in a transaction, if thread 2 access the > flatview > with BQL held but not RCU read lock held, it's a legal operation. In this > legal case, it seems that both my code and Paolo's code will trigger > assertion. IIUC your original patch is fine in this case (BQL held, RCU not held), as long as depth==0. IMHO what we want to trap here is when BQL held (while RCU is not) and depth>0 which can cause unpredictable side effect of using obsolete flatview. To summarize, the original check didn't consider BQL, and if to consider BQL I think it should be something like: /* Guarantees valid access to the flatview, either lock works */ assert(BQL_HELD() || RCU_HELD()); /* * Guarantees any BQL holder is not reading obsolete flatview (e.g. when * during vm load) */ if (BQL_HELD()) assert(depth==0); IIUC it can be merged into: assert((BQL_HELD() && depth==0) || RCU_HELD()); > > I'm not sure if I have a good understanding of your emails? I think > checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() || > qemu_mutex_iothread_locked()) should cover the case you discussed. This seems still problematic too? Since the assert can pass even if neither BQL nor RCU is held (as long as depth==0). Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2023-01-10 14:45 ` Peter Xu @ 2023-01-12 7:59 ` Chuang Xu 2023-01-12 15:13 ` Peter Xu 0 siblings, 1 reply; 22+ messages in thread From: Chuang Xu @ 2023-01-12 7:59 UTC (permalink / raw) To: Peter Xu, Paolo Bonzini Cc: qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand, Philippe Mathieu-Daudé, zhouyibo Hi, Peter, Paolo, On 2023/1/10 下午10:45, Peter Xu wrote: > On Tue, Jan 10, 2023 at 12:09:41AM -0800, Chuang Xu wrote: >> Hi, Peter and Paolo, > Hi, Chuang, Paolo, > >> I'm sorry I didn't reply to your email in time. I was infected with >> COVID-19 two weeks ago, so I couldn't think about the problems discussed >> in your email for a long time.😷 >> >> On 2023/1/4 上午1:43, Peter Xu wrote: >>> Hi, Paolo, >>> >>> On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote: >>>> Il ven 23 dic 2022, 16:54 Peter Xu ha scritto: >>>> >>>>>> This is not valid because the transaction could happen in *another* >>>>> thread. >>>>>> In that case memory_region_transaction_depth() will be > 0, but RCU is >>>>>> needed. >>>>> Do you mean the code is wrong, or the comment? Note that the code has >>>>> checked rcu_read_locked() where introduced in patch 1, but maybe >> something >>>>> else was missed? >>>>> >>>> The assertion is wrong. It will succeed even if RCU is unlocked in this >>>> thread but a transaction is in progress in another thread. >>> IIUC this is the case where the context: >>> >>> (1) doesn't have RCU read lock held, and, >>> (2) doesn't have BQL held. >>> >>> Is it safe at all to reference any flatview in such a context? The thing >>> is I think the flatview pointer can be freed anytime if both locks are >> not >>> taken. >>> >>>> Perhaps you can check (memory_region_transaction_depth() > 0 && >>>> !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? >>> What if one thread calls address_space_to_flatview() with BQL held but >> not >>> RCU read lock held? I assume it's a legal operation, but it seems to be >>> able to trigger the assert already? >>> >>> Thanks, >>> >> I'm not sure whether I understand the content of your discussion correctly, >> so here I want to explain my understanding firstly. >> >> From my perspective, Paolo thinks that when thread 1 is in a transaction, >> thread 2 will trigger the assertion when accessing the flatview without >> holding RCU read lock, although sometimes the thread 2's access to flatview >> is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0 >> && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead. >> >> And Peter thinks that as long as no thread holds the BQL or RCU read lock, >> the old flatview can be released (actually executed by the rcu thread with >> BQL held). When thread 1 is in a transaction, if thread 2 access the >> flatview >> with BQL held but not RCU read lock held, it's a legal operation. In this >> legal case, it seems that both my code and Paolo's code will trigger >> assertion. > IIUC your original patch is fine in this case (BQL held, RCU not held), as > long as depth==0. IMHO what we want to trap here is when BQL held (while > RCU is not) and depth>0 which can cause unpredictable side effect of using > obsolete flatview. I don't quite understand the side effects of depth>0 when BQL is held (while RCU is not). In my perspective, both BQL and RCU can ensure that the flatview will not be released when the worker thread accesses the flatview, because before the rcu thread releases the flatview, it will make sure itself holding BQL and the worker thread not holding RCU. So, whether the depth is 0 or not, as long as BQL or RCU is held, the worker thread will not access the obsolete flatview (IIUC 'obsolete' means that flatview is released). > > To summarize, the original check didn't consider BQL, and if to consider > BQL I think it should be something like: > > /* Guarantees valid access to the flatview, either lock works */ > assert(BQL_HELD() || RCU_HELD()); > > /* > * Guarantees any BQL holder is not reading obsolete flatview (e.g. when > * during vm load) > */ > if (BQL_HELD()) > assert(depth==0); > > IIUC it can be merged into: > > assert((BQL_HELD() && depth==0) || RCU_HELD()); IMHO assert(BQL_HELD() || RCU_HELD()) is enough.. Or you think that once a mr transaction is in progress, the old flatview has been obsolete? If we regard flatview as obsolete when a mr transaction is in progress, How can RCU ensure that flatview is not obsolete? What does Paolo think of this check? Thanks! >> I'm not sure if I have a good understanding of your emails? I think >> checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() || >> qemu_mutex_iothread_locked()) should cover the case you discussed. > This seems still problematic too? Since the assert can pass even if > neither BQL nor RCU is held (as long as depth==0). > > Thanks, > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2023-01-12 7:59 ` Chuang Xu @ 2023-01-12 15:13 ` Peter Xu 2023-01-13 19:29 ` Chuang Xu 0 siblings, 1 reply; 22+ messages in thread From: Peter Xu @ 2023-01-12 15:13 UTC (permalink / raw) To: Chuang Xu Cc: Paolo Bonzini, qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand, Philippe Mathieu-Daudé, zhouyibo On Thu, Jan 12, 2023 at 03:59:55PM +0800, Chuang Xu wrote: > Hi, Peter, Paolo, Chuang, > > On 2023/1/10 下午10:45, Peter Xu wrote: > > On Tue, Jan 10, 2023 at 12:09:41AM -0800, Chuang Xu wrote: > > > Hi, Peter and Paolo, > > Hi, Chuang, Paolo, > > > > > I'm sorry I didn't reply to your email in time. I was infected with > > > COVID-19 two weeks ago, so I couldn't think about the problems discussed > > > in your email for a long time.😷 > > > > > > On 2023/1/4 上午1:43, Peter Xu wrote: > > > > Hi, Paolo, > > > > > > > > On Wed, Dec 28, 2022 at 09:27:50AM +0100, Paolo Bonzini wrote: > > > > > Il ven 23 dic 2022, 16:54 Peter Xu ha scritto: > > > > > > > > > > > > This is not valid because the transaction could happen in *another* > > > > > > thread. > > > > > > > In that case memory_region_transaction_depth() will be > 0, but RCU is > > > > > > > needed. > > > > > > Do you mean the code is wrong, or the comment? Note that the code has > > > > > > checked rcu_read_locked() where introduced in patch 1, but maybe > > > something > > > > > > else was missed? > > > > > > > > > > > The assertion is wrong. It will succeed even if RCU is unlocked in this > > > > > thread but a transaction is in progress in another thread. > > > > IIUC this is the case where the context: > > > > > > > > (1) doesn't have RCU read lock held, and, > > > > (2) doesn't have BQL held. > > > > > > > > Is it safe at all to reference any flatview in such a context? The thing > > > > is I think the flatview pointer can be freed anytime if both locks are > > > not > > > > taken. > > > > > > > > > Perhaps you can check (memory_region_transaction_depth() > 0 && > > > > > !qemu_mutex_iothread_locked()) || rcu_read_locked() instead? > > > > What if one thread calls address_space_to_flatview() with BQL held but > > > not > > > > RCU read lock held? I assume it's a legal operation, but it seems to be > > > > able to trigger the assert already? > > > > > > > > Thanks, > > > > > > > I'm not sure whether I understand the content of your discussion correctly, > > > so here I want to explain my understanding firstly. > > > > > > From my perspective, Paolo thinks that when thread 1 is in a transaction, > > > thread 2 will trigger the assertion when accessing the flatview without > > > holding RCU read lock, although sometimes the thread 2's access to flatview > > > is legal. So Paolo suggests checking (memory_region_transaction_depth() > 0 > > > && !qemu_mutex_iothread_locked()) || rcu_read_locked() instead. > > > > > > And Peter thinks that as long as no thread holds the BQL or RCU read lock, > > > the old flatview can be released (actually executed by the rcu thread with > > > BQL held). When thread 1 is in a transaction, if thread 2 access the > > > flatview > > > with BQL held but not RCU read lock held, it's a legal operation. In this > > > legal case, it seems that both my code and Paolo's code will trigger > > > assertion. > > IIUC your original patch is fine in this case (BQL held, RCU not held), as > > long as depth==0. IMHO what we want to trap here is when BQL held (while > > RCU is not) and depth>0 which can cause unpredictable side effect of using > > obsolete flatview. > > I don't quite understand the side effects of depth>0 when BQL is held (while > RCU is not). We wanted to capture outliers when you apply the follow up patch to vm load procedure. That will make depth>0 for the whole process of vm load during migration, and we wanted to make sure it's safe, hence this patch, right? > > In my perspective, both BQL and RCU can ensure that the flatview will not be > released when the worker thread accesses the flatview, because before the rcu > thread releases the flatview, it will make sure itself holding BQL and the > worker thread not holding RCU. So, whether the depth is 0 or not, as long as > BQL or RCU is held, the worker thread will not access the obsolete flatview > (IIUC 'obsolete' means that flatview is released). > > > > > To summarize, the original check didn't consider BQL, and if to consider > > BQL I think it should be something like: > > > > /* Guarantees valid access to the flatview, either lock works */ > > assert(BQL_HELD() || RCU_HELD()); > > > > /* > > * Guarantees any BQL holder is not reading obsolete flatview (e.g. when > > * during vm load) > > */ > > if (BQL_HELD()) > > assert(depth==0); > > > > IIUC it can be merged into: > > > > assert((BQL_HELD() && depth==0) || RCU_HELD()); > > IMHO assert(BQL_HELD() || RCU_HELD()) is enough.. Yes, but IMHO this will guarantee safe use of flatview only if _before_ your follow up patch. Before that patch, the depth==0 should always stand (when BQL_HELD() stands) I think. After that patch, since depth will be increased at the entry of vm load there's risk that we can overlook code that will be referencing flatview during vm load and that can reference an obsolete flatview. Since the whole process of qemu_loadvm_state_main() will have BQL held we won't hit the assertion if only to check "BQL_HELD() || RCU_HELD()" because BQL_HELD always satisfies. > > Or you think that once a mr transaction is in progress, the old flatview has > been obsolete? If we regard flatview as obsolete when a mr transaction is in > progress, How can RCU ensure that flatview is not obsolete? AFAIU RCU cannot guarantee that. So IIUC any RCU lock user need to be able to tolerant obsolete flatviews being referenced and it should not harm the system. If it needs the latest update, it should take care of that separately. For example, the virtio code we're looking at in this series uses RCU lock to build address space cache for the device vrings which is based on the current flatview of mem. It's safe to reference obsolete flatview in this case (it means the flatview can be during an update when virtio is establishing the address space cache), IMHO that's fine because the address space cache will be updated again in virtio_memory_listener_commit() so it'll be consistent at last. The intermediate phase of inconsistency should be fine in this case just like any DMA happens during a memory hotplug. For this specific patch, IMHO the core is to check depth>0 reference, and we need RCU_HELD to mask out where the obsolete references are fine. Thanks, > > What does Paolo think of this check? > > Thanks! > > > > I'm not sure if I have a good understanding of your emails? I think > > > checking(memory_region_transaction_get_depth() == 0 || rcu_read_locked() || > > > qemu_mutex_iothread_locked()) should cover the case you discussed. > > This seems still problematic too? Since the assert can pass even if > > neither BQL nor RCU is held (as long as depth==0). > > > > Thanks, > > > -- Peter Xu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2023-01-12 15:13 ` Peter Xu @ 2023-01-13 19:29 ` Chuang Xu 0 siblings, 0 replies; 22+ messages in thread From: Chuang Xu @ 2023-01-13 19:29 UTC (permalink / raw) To: Peter Xu Cc: Paolo Bonzini, qemu-devel, David Gilbert, Quintela, Juan, David Hildenbrand, Philippe Mathieu-Daudé, zhouyibo Hi, Peter, On 2023/1/12 下午11:13, Peter Xu wrote: > We wanted to capture outliers when you apply the follow up patch to vm load > procedure. > > That will make depth>0 for the whole process of vm load during migration, > and we wanted to make sure it's safe, hence this patch, right? > >> In my perspective, both BQL and RCU can ensure that the flatview will not be >> released when the worker thread accesses the flatview, because before the rcu >> thread releases the flatview, it will make sure itself holding BQL and the >> worker thread not holding RCU. So, whether the depth is 0 or not, as long as >> BQL or RCU is held, the worker thread will not access the obsolete flatview >> (IIUC 'obsolete' means that flatview is released). >> >>> To summarize, the original check didn't consider BQL, and if to consider >>> BQL I think it should be something like: >>> >>> /* Guarantees valid access to the flatview, either lock works */ >>> assert(BQL_HELD() || RCU_HELD()); >>> >>> /* >>> * Guarantees any BQL holder is not reading obsolete flatview (e.g. when >>> * during vm load) >>> */ >>> if (BQL_HELD()) >>> assert(depth==0); >>> >>> IIUC it can be merged into: >>> >>> assert((BQL_HELD() && depth==0) || RCU_HELD()); >> IMHO assert(BQL_HELD() || RCU_HELD()) is enough.. > Yes, but IMHO this will guarantee safe use of flatview only if _before_ > your follow up patch. > > Before that patch, the depth==0 should always stand (when BQL_HELD() > stands) I think. > > After that patch, since depth will be increased at the entry of vm load > there's risk that we can overlook code that will be referencing flatview > during vm load and that can reference an obsolete flatview. Since the > whole process of qemu_loadvm_state_main() will have BQL held we won't hit > the assertion if only to check "BQL_HELD() || RCU_HELD()" because BQL_HELD > always satisfies. > >> Or you think that once a mr transaction is in progress, the old flatview has >> been obsolete? If we regard flatview as obsolete when a mr transaction is in >> progress, How can RCU ensure that flatview is not obsolete? > AFAIU RCU cannot guarantee that. So IIUC any RCU lock user need to be able > to tolerant obsolete flatviews being referenced and it should not harm the > system. If it needs the latest update, it should take care of that > separately. > > For example, the virtio code we're looking at in this series uses RCU lock > to build address space cache for the device vrings which is based on the > current flatview of mem. It's safe to reference obsolete flatview in this > case (it means the flatview can be during an update when virtio is > establishing the address space cache), IMHO that's fine because the address > space cache will be updated again in virtio_memory_listener_commit() so > it'll be consistent at last. The intermediate phase of inconsistency > should be fine in this case just like any DMA happens during a memory > hotplug. > > For this specific patch, IMHO the core is to check depth>0 reference, and > we need RCU_HELD to mask out where the obsolete references are fine. > > Thanks, Thanks a lot for your patience! I ignored the preconditions for this discussion, so that I asked a stupid question.. Now I'm in favor of checking “(BQL_HELD() && depth==0) || RCU_HELD()” in v5. And what does Paolo thinks of Peter's solution? Thanks again! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu 2022-12-23 15:37 ` Peter Xu 2022-12-23 15:47 ` Paolo Bonzini @ 2022-12-28 10:50 ` Philippe Mathieu-Daudé 2023-01-04 7:39 ` [External] " Chuang Xu 2 siblings, 1 reply; 22+ messages in thread From: Philippe Mathieu-Daudé @ 2022-12-28 10:50 UTC (permalink / raw) To: Chuang Xu, qemu-devel Cc: dgilbert, quintela, pbonzini, peterx, david, zhouyibo On 23/12/22 15:23, Chuang Xu wrote: > Before using any flatview, sanity check we're not during a memory > region transaction or the map can be invalid. > > Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> > --- > include/exec/memory.h | 9 +++++++++ > softmmu/memory.c | 5 +++++ > 2 files changed, 14 insertions(+) > > diff --git a/include/exec/memory.h b/include/exec/memory.h > index 91f8a2395a..66c43b4862 100644 > --- a/include/exec/memory.h > +++ b/include/exec/memory.h > @@ -1069,8 +1069,17 @@ struct FlatView { > MemoryRegion *root; > }; > > +int memory_region_transaction_get_depth(void); Do we want to expose this; isn't the depth internal? If we need to expose something, can we restrict it to bool memory_region_in_transaction(void) or bool memory_region_transaction_in_progress(void)? > static inline FlatView *address_space_to_flatview(AddressSpace *as) > { > + /* > + * Before using any flatview, sanity check we're not during a memory > + * region transaction or the map can be invalid. Note that this can > + * also be called during commit phase of memory transaction, but that > + * should also only happen when the depth decreases to 0 first. > + */ > + assert(memory_region_transaction_get_depth() == 0 || rcu_read_locked()); > return qatomic_rcu_read(&as->current_map); > } ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [External] Re: [RFC v4 2/3] memory: add depth assert in address_space_to_flatview 2022-12-28 10:50 ` Philippe Mathieu-Daudé @ 2023-01-04 7:39 ` Chuang Xu 0 siblings, 0 replies; 22+ messages in thread From: Chuang Xu @ 2023-01-04 7:39 UTC (permalink / raw) To: Philippe Mathieu-Daudé, qemu-devel Cc: dgilbert, quintela, pbonzini, peterx, david, zhouyibo [-- Attachment #1: Type: text/plain, Size: 1020 bytes --] On 2022/12/28 下午6:50, Philippe Mathieu-Daudé wrote: On 23/12/22 15:23, Chuang Xu wrote: Before using any flatview, sanity check we're not during a memory region transaction or the map can be invalid. Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> <xuchuangxclwt@bytedance.com> --- include/exec/memory.h | 9 +++++++++ softmmu/memory.c | 5 +++++ 2 files changed, 14 insertions(+) diff --git a/include/exec/memory.h b/include/exec/memory.h index 91f8a2395a..66c43b4862 100644 --- a/include/exec/memory.h +++ b/include/exec/memory.h @@ -1069,8 +1069,17 @@ struct FlatView { MemoryRegion *root; }; +int memory_region_transaction_get_depth(void); Do we want to expose this; isn't the depth internal? If we need to expose something, can we restrict it to bool memory_region_in_transaction(void) or bool memory_region_transaction_in_progress(void)? Yes, we'd better not expose the value of an internal variable. I'll make changes in v5. Thanks! [-- Attachment #2: Type: text/html, Size: 2032 bytes --] ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate 2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu 2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu 2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu @ 2022-12-23 14:23 ` Chuang Xu 2022-12-23 16:06 ` David Hildenbrand 2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu 3 siblings, 1 reply; 22+ messages in thread From: Chuang Xu @ 2022-12-23 14:23 UTC (permalink / raw) To: qemu-devel Cc: dgilbert, quintela, pbonzini, peterx, david, philmd, zhouyibo, Chuang Xu The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> --- migration/savevm.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index a0cdb714f7..19785e5a54 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) uint8_t section_type; int ret = 0; + /* call memory_region_transaction_begin() before loading vmstate */ + memory_region_transaction_begin(); + retry: while (true) { section_type = qemu_get_byte(f); @@ -2684,6 +2687,10 @@ out: goto retry; } } + + /* call memory_region_transaction_commit() after loading vmstate */ + memory_region_transaction_commit(); + return ret; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate 2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu @ 2022-12-23 16:06 ` David Hildenbrand 2023-01-04 7:31 ` Chuang Xu 0 siblings, 1 reply; 22+ messages in thread From: David Hildenbrand @ 2022-12-23 16:06 UTC (permalink / raw) To: Chuang Xu, qemu-devel Cc: dgilbert, quintela, pbonzini, peterx, philmd, zhouyibo On 23.12.22 15:23, Chuang Xu wrote: > The duration of loading non-iterable vmstate accounts for a significant > portion of downtime (starting with the timestamp of source qemu stop and > ending with the timestamp of target qemu start). Most of the time is spent > committing memory region changes repeatedly. > > This patch packs all the changes to memory region during the period of > loading non-iterable vmstate in a single memory transaction. With the > increase of devices, this patch will greatly improve the performance. > > Here are the test1 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 16-queue vhost-net device > - 16 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > before about 150 ms 740+ ms > after about 30 ms 630+ ms > > In test2, we keep the number of the device the same as test1, reduce the > number of queues per device: > > Here are the test2 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 1-queue vhost-net device > - 16 1-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > before about 90 ms about 250 ms > > after about 25 ms about 160 ms > > In test3, we keep the number of queues per device the same as test1, reduce > the number of devices: > > Here are the test3 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 1 16-queue vhost-net device > - 1 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > before about 20 ms about 70 ms > after about 11 ms about 60 ms > > As we can see from the test results above, both the number of queues and > the number of devices have a great impact on the time of loading non-iterable > vmstate. The growth of the number of devices and queues will lead to more > mr commits, and the time consumption caused by the flatview reconstruction > will also increase. > > Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> > --- > migration/savevm.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/migration/savevm.c b/migration/savevm.c > index a0cdb714f7..19785e5a54 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) > uint8_t section_type; > int ret = 0; > > + /* call memory_region_transaction_begin() before loading vmstate */ I'd suggest extending the comment *why* you are doing that, that it's a pure performance optimization, and how it achieves that. -- Thanks, David / dhildenb ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate 2022-12-23 16:06 ` David Hildenbrand @ 2023-01-04 7:31 ` Chuang Xu 0 siblings, 0 replies; 22+ messages in thread From: Chuang Xu @ 2023-01-04 7:31 UTC (permalink / raw) To: David Hildenbrand Cc: qemu-devel, dgilbert, quintela, pbonzini, peterx, philmd, zhouyibo [-- Attachment #1: Type: text/plain, Size: 2988 bytes --] On 2022/12/24 上午12:06, David Hildenbrand wrote: On 23.12.22 15:23, Chuang Xu wrote: The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com> <xuchuangxclwt@bytedance.com> --- migration/savevm.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/migration/savevm.c b/migration/savevm.c index a0cdb714f7..19785e5a54 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -2617,6 +2617,9 @@ int qemu_loadvm_state_main(QEMUFile *f, MigrationIncomingState *mis) uint8_t section_type; int ret = 0; + /* call memory_region_transaction_begin() before loading vmstate */ I'd suggest extending the comment *why* you are doing that, that it's a pure performance optimization, and how it achieves that. Thanks! I'll extend the comment in v5. [-- Attachment #2: Type: text/html, Size: 5433 bytes --] ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate 2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu ` (2 preceding siblings ...) 2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu @ 2022-12-23 15:50 ` Peter Xu 2022-12-23 19:11 ` Chuang Xu 3 siblings, 1 reply; 22+ messages in thread From: Peter Xu @ 2022-12-23 15:50 UTC (permalink / raw) To: Chuang Xu Cc: qemu-devel, dgilbert, quintela, pbonzini, david, philmd, zhouyibo Chuang, On Fri, Dec 23, 2022 at 10:23:04PM +0800, Chuang Xu wrote: > In this version: > > - attach more information in the cover letter. > - remove changes on virtio_load(). > - add rcu_read_locked() to detect holding of rcu lock. > > The duration of loading non-iterable vmstate accounts for a significant > portion of downtime (starting with the timestamp of source qemu stop and > ending with the timestamp of target qemu start). Most of the time is spent > committing memory region changes repeatedly. > > This patch packs all the changes to memory region during the period of > loading non-iterable vmstate in a single memory transaction. With the > increase of devices, this patch will greatly improve the performance. > > Here are the test1 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 16-queue vhost-net device > - 16 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > before about 150 ms 740+ ms > after about 30 ms 630+ ms Have you investigated why multi-queue added so much downtime overhead with the same environment, comparing to below [1]? > > (This result is different from that of v1. It may be that someone has > changed something on my host.., but it does not affect the display of > the optimization effect.) > > > In test2, we keep the number of the device the same as test1, reduce the > number of queues per device: > > Here are the test2 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 8 1-queue vhost-net device > - 16 1-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > before about 90 ms about 250 ms > after about 25 ms about 160 ms [1] > > In test3, we keep the number of queues per device the same as test1, reduce > the number of devices: > > Here are the test3 results: > test info: > - Host > - Intel(R) Xeon(R) Platinum 8260 CPU > - NVIDIA Mellanox ConnectX-5 > - VM > - 32 CPUs 128GB RAM VM > - 1 16-queue vhost-net device > - 1 4-queue vhost-user-blk device. > > time of loading non-iterable vmstate downtime > before about 20 ms about 70 ms > after about 11 ms about 60 ms > > > As we can see from the test results above, both the number of queues and > the number of devices have a great impact on the time of loading non-iterable > vmstate. The growth of the number of devices and queues will lead to more > mr commits, and the time consumption caused by the flatview reconstruction > will also increase. The downtime measured in precopy can be more complicated than postcopy because the time of switch is calculated by qemu based on the downtime setup, and also that contains part of RAM migrations. Postcopy should be more accurate on that because there's no calculation done, meanwhile there's no RAM transferred during downtime. However postcopy downtime is not accurate either in implementation of it in postcopy_start(), where the downtime is measured right after we flushed the packed data, and right below it there's some idea of optimizing it: if (migrate_postcopy_ram()) { /* * Although this ping is just for debug, it could potentially be * used for getting a better measurement of downtime at the source. */ qemu_savevm_send_ping(ms->to_dst_file, 4); } So maybe I'll have a look there. Besides above, personally I'm happy with the series, one trivial comment in patch 2 but not a huge deal. I don't expect you can get any more comment before the end of this year.. but let's wait until after the Xmas holiday. Thanks! -- Peter Xu ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate 2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu @ 2022-12-23 19:11 ` Chuang Xu 0 siblings, 0 replies; 22+ messages in thread From: Chuang Xu @ 2022-12-23 19:11 UTC (permalink / raw) To: Peter Xu Cc: qemu-devel, dgilbert, quintela, pbonzini, david, philmd, zhouyibo [-- Attachment #1: Type: text/plain, Size: 5580 bytes --] On 2022/12/23 下午11:50, Peter Xu wrote: Chuang, On Fri, Dec 23, 2022 at 10:23:04PM +0800, Chuang Xu wrote: In this version: - attach more information in the cover letter. - remove changes on virtio_load(). - add rcu_read_locked() to detect holding of rcu lock. The duration of loading non-iterable vmstate accounts for a significant portion of downtime (starting with the timestamp of source qemu stop and ending with the timestamp of target qemu start). Most of the time is spent committing memory region changes repeatedly. This patch packs all the changes to memory region during the period of loading non-iterable vmstate in a single memory transaction. With the increase of devices, this patch will greatly improve the performance. Here are the test1 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 16-queue vhost-net device - 16 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 150 ms 740+ ms after about 30 ms 630+ ms Have you investigated why multi-queue added so much downtime overhead with the same environment, comparing to below [1]? I have analyzed the downtime in detail. Both stopping and starting the device are time-consuming. For stopping vhost-net devices, vhost_net_stop_one() will be called once more for each additional queue, while vhost_virtqueue_stop() will be called twice in vhost_dev_stop(). For example, we need to call vhost_virtqueue_stop() 32(= 16 * 2) times to stop a 16-queue vhost-net device. In vhost_virtqueue_stop(), QEMU needs to negotiate with the vhost user daemon. The same is true for vhost-net devices startup. For stopping vhost-user-blk devices, vhost_virtqueue_stop() will be called once more for each additional queue. For example, we need to call vhost_virtqueue_stop() 4 times to stop a 4-queue vhost-user-blk device. The same is true for vhost-user-blk devices startup. It seems that the vhost-user-blk device is less affected by the number of queues than the vhost-net device. However, the vhost-user-blk device needs to prepare inflight when it is started. The negotiation with spdk in this step is also time-consuming. I tried to move this step to the startup phase of the target QEMU before the migration started. In my test, This optimization can greatly reduce the vhost-user-blk device startup time and thus reduce the downtime. I'm not sure whether this is hacky. If you are interested in this, maybe we can discuss it further. (This result is different from that of v1. It may be that someone has changed something on my host.., but it does not affect the display of the optimization effect.) In test2, we keep the number of the device the same as test1, reduce the number of queues per device: Here are the test2 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 8 1-queue vhost-net device - 16 1-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 90 ms about 250 ms after about 25 ms about 160 ms [1] In test3, we keep the number of queues per device the same as test1, reduce the number of devices: Here are the test3 results: test info: - Host - Intel(R) Xeon(R) Platinum 8260 CPU - NVIDIA Mellanox ConnectX-5 - VM - 32 CPUs 128GB RAM VM - 1 16-queue vhost-net device - 1 4-queue vhost-user-blk device. time of loading non-iterable vmstate downtime before about 20 ms about 70 ms after about 11 ms about 60 ms As we can see from the test results above, both the number of queues and the number of devices have a great impact on the time of loading non-iterable vmstate. The growth of the number of devices and queues will lead to more mr commits, and the time consumption caused by the flatview reconstruction will also increase. The downtime measured in precopy can be more complicated than postcopy because the time of switch is calculated by qemu based on the downtime setup, and also that contains part of RAM migrations. Postcopy should be more accurate on that because there's no calculation done, meanwhile there's no RAM transferred during downtime. However postcopy downtime is not accurate either in implementation of it in postcopy_start(), where the downtime is measured right after we flushed the packed data, and right below it there's some idea of optimizing it: if (migrate_postcopy_ram()) { /* * Although this ping is just for debug, it could potentially be * used for getting a better measurement of downtime at the source. */ qemu_savevm_send_ping(ms->to_dst_file, 4); } So maybe I'll have a look there. The current calculation of downtime is really inaccurate, because the source side calculation does not take into account the time consumption of various factors at the destination side. Maybe we can consider transmitting some key timestamps to the destination, and the destination will calculate the actual downtime data after startup. Besides above, personally I'm happy with the series, one trivial comment in patch 2 but not a huge deal. I don't expect you can get any more comment before the end of this year.. but let's wait until after the Xmas holiday. Thanks! I’ll further modify the patch 2 according to your comments. Merry Christmas! [-- Attachment #2: Type: text/html, Size: 6702 bytes --] ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2023-01-13 19:30 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-12-23 14:23 [RFC v4 0/3] migration: reduce time of loading non-iterable vmstate Chuang Xu 2022-12-23 14:23 ` [RFC v4 1/3] rcu: introduce rcu_read_locked() Chuang Xu 2023-01-04 14:20 ` Alex Bennée 2023-01-05 8:17 ` Chuang Xu 2022-12-23 14:23 ` [RFC v4 2/3] memory: add depth assert in address_space_to_flatview Chuang Xu 2022-12-23 15:37 ` Peter Xu 2022-12-23 15:47 ` Paolo Bonzini 2022-12-23 15:54 ` Peter Xu 2022-12-28 8:27 ` Paolo Bonzini 2023-01-03 17:43 ` Peter Xu 2023-01-10 8:09 ` Chuang Xu 2023-01-10 14:45 ` Peter Xu 2023-01-12 7:59 ` Chuang Xu 2023-01-12 15:13 ` Peter Xu 2023-01-13 19:29 ` Chuang Xu 2022-12-28 10:50 ` Philippe Mathieu-Daudé 2023-01-04 7:39 ` [External] " Chuang Xu 2022-12-23 14:23 ` [RFC v4 3/3] migration: reduce time of loading non-iterable vmstate Chuang Xu 2022-12-23 16:06 ` David Hildenbrand 2023-01-04 7:31 ` Chuang Xu 2022-12-23 15:50 ` [RFC v4 0/3] " Peter Xu 2022-12-23 19:11 ` Chuang Xu
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.