All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] migration/rdma: Use huge page register VM memory
@ 2021-06-07 13:57 LIZHAOXIN1 [李照鑫]
  2021-06-07 14:17 ` Daniel P. Berrangé
  0 siblings, 1 reply; 5+ messages in thread
From: LIZHAOXIN1 [李照鑫] @ 2021-06-07 13:57 UTC (permalink / raw)
  To: qemu-devel, quintela, dgilbert
  Cc: LIZHAOXIN1 [李照鑫], sunhao2 [孙昊],
	DENGLINWEN [邓林文],
	YANGFENG1 [杨峰]

When using libvirt for RDMA live migration, if the VM memory is too large,
it will take a lot of time to deregister the VM at the source side, resulting
in a long downtime (VM 64G, deregister vm time is about 400ms).
    
Although the VM's memory uses 2M huge pages, the MLNX driver still uses 4K
pages for pin memory, as well as for unpin. So we use huge pages to skip the
process of pin memory and unpin memory to reduce downtime.
   
The test environment:
kernel: linux-5.12
MLNX: ConnectX-4 LX
libvirt command:
virsh migrate --live --p2p --persistent --copy-storage-inc --listen-address \
0.0.0.0 --rdma-pin-all --migrateuri rdma://192.168.0.2 [VM] qemu+tcp://192.168.0.2/system
    
Signed-off-by: lizhaoxin <lizhaoxin1@kingsoft.com>

diff --git a/migration/rdma.c b/migration/rdma.c
index 1cdb4561f3..9823449297 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -1123,13 +1123,26 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
     RDMALocalBlocks *local = &rdma->local_ram_blocks;
 
     for (i = 0; i < local->nb_blocks; i++) {
-        local->block[i].mr =
-            ibv_reg_mr(rdma->pd,
-                    local->block[i].local_host_addr,
-                    local->block[i].length,
-                    IBV_ACCESS_LOCAL_WRITE |
-                    IBV_ACCESS_REMOTE_WRITE
-                    );
+        if (strcmp(local->block[i].block_name,"pc.ram") == 0) {
+            local->block[i].mr =
+                ibv_reg_mr(rdma->pd,
+                        local->block[i].local_host_addr,
+                        local->block[i].length,
+                        IBV_ACCESS_LOCAL_WRITE |
+                        IBV_ACCESS_REMOTE_WRITE |
+                        IBV_ACCESS_ON_DEMAND |
+                        IBV_ACCESS_HUGETLB
+                        );
+        } else {
+            local->block[i].mr =
+                ibv_reg_mr(rdma->pd,
+                        local->block[i].local_host_addr,
+                        local->block[i].length,
+                        IBV_ACCESS_LOCAL_WRITE |
+                        IBV_ACCESS_REMOTE_WRITE
+                        );
+        }
+
         if (!local->block[i].mr) {
             perror("Failed to register local dest ram block!\n");
             break;

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] migration/rdma: Use huge page register VM memory
  2021-06-07 13:57 [PATCH] migration/rdma: Use huge page register VM memory LIZHAOXIN1 [李照鑫]
@ 2021-06-07 14:17 ` Daniel P. Berrangé
  2021-06-07 15:00   ` Dr. David Alan Gilbert
  2021-06-10 15:33   ` LIZHAOXIN1 [李照鑫]
  0 siblings, 2 replies; 5+ messages in thread
From: Daniel P. Berrangé @ 2021-06-07 14:17 UTC (permalink / raw)
  To: LIZHAOXIN1 [李照鑫]
  Cc: sunhao2 [孙昊], YANGFENG1 [杨峰],
	quintela, DENGLINWEN [邓林文],
	qemu-devel, dgilbert

On Mon, Jun 07, 2021 at 01:57:02PM +0000, LIZHAOXIN1 [李照鑫] wrote:
> When using libvirt for RDMA live migration, if the VM memory is too large,
> it will take a lot of time to deregister the VM at the source side, resulting
> in a long downtime (VM 64G, deregister vm time is about 400ms).
>     
> Although the VM's memory uses 2M huge pages, the MLNX driver still uses 4K
> pages for pin memory, as well as for unpin. So we use huge pages to skip the
> process of pin memory and unpin memory to reduce downtime.
>    
> The test environment:
> kernel: linux-5.12
> MLNX: ConnectX-4 LX
> libvirt command:
> virsh migrate --live --p2p --persistent --copy-storage-inc --listen-address \
> 0.0.0.0 --rdma-pin-all --migrateuri rdma://192.168.0.2 [VM] qemu+tcp://192.168.0.2/system
>     
> Signed-off-by: lizhaoxin <lizhaoxin1@kingsoft.com>
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 1cdb4561f3..9823449297 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1123,13 +1123,26 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>      RDMALocalBlocks *local = &rdma->local_ram_blocks;
>  
>      for (i = 0; i < local->nb_blocks; i++) {
> -        local->block[i].mr =
> -            ibv_reg_mr(rdma->pd,
> -                    local->block[i].local_host_addr,
> -                    local->block[i].length,
> -                    IBV_ACCESS_LOCAL_WRITE |
> -                    IBV_ACCESS_REMOTE_WRITE
> -                    );
> +        if (strcmp(local->block[i].block_name,"pc.ram") == 0) {

'pc.ram' is an x86 architecture specific name, so this will still
leave a problem on other architectures I assume.

> +            local->block[i].mr =
> +                ibv_reg_mr(rdma->pd,
> +                        local->block[i].local_host_addr,
> +                        local->block[i].length,
> +                        IBV_ACCESS_LOCAL_WRITE |
> +                        IBV_ACCESS_REMOTE_WRITE |
> +                        IBV_ACCESS_ON_DEMAND |
> +                        IBV_ACCESS_HUGETLB
> +                        );
> +        } else {
> +            local->block[i].mr =
> +                ibv_reg_mr(rdma->pd,
> +                        local->block[i].local_host_addr,
> +                        local->block[i].length,
> +                        IBV_ACCESS_LOCAL_WRITE |
> +                        IBV_ACCESS_REMOTE_WRITE
> +                        );
> +        }
> +
>          if (!local->block[i].mr) {
>              perror("Failed to register local dest ram block!\n");
>              break;

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] migration/rdma: Use huge page register VM memory
  2021-06-07 14:17 ` Daniel P. Berrangé
@ 2021-06-07 15:00   ` Dr. David Alan Gilbert
  2021-06-10 15:35     ` 回复: " LIZHAOXIN1 [李照鑫]
  2021-06-10 15:33   ` LIZHAOXIN1 [李照鑫]
  1 sibling, 1 reply; 5+ messages in thread
From: Dr. David Alan Gilbert @ 2021-06-07 15:00 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: sunhao2 [孙昊], YANGFENG1 [杨峰],
	quintela, DENGLINWEN [邓林文],
	qemu-devel, LIZHAOXIN1 [李照鑫]

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Jun 07, 2021 at 01:57:02PM +0000, LIZHAOXIN1 [李照鑫] wrote:
> > When using libvirt for RDMA live migration, if the VM memory is too large,
> > it will take a lot of time to deregister the VM at the source side, resulting
> > in a long downtime (VM 64G, deregister vm time is about 400ms).
> >     
> > Although the VM's memory uses 2M huge pages, the MLNX driver still uses 4K
> > pages for pin memory, as well as for unpin. So we use huge pages to skip the
> > process of pin memory and unpin memory to reduce downtime.
> >    
> > The test environment:
> > kernel: linux-5.12
> > MLNX: ConnectX-4 LX
> > libvirt command:
> > virsh migrate --live --p2p --persistent --copy-storage-inc --listen-address \
> > 0.0.0.0 --rdma-pin-all --migrateuri rdma://192.168.0.2 [VM] qemu+tcp://192.168.0.2/system
> >     
> > Signed-off-by: lizhaoxin <lizhaoxin1@kingsoft.com>
> > 
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 1cdb4561f3..9823449297 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -1123,13 +1123,26 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
> >      RDMALocalBlocks *local = &rdma->local_ram_blocks;
> >  
> >      for (i = 0; i < local->nb_blocks; i++) {
> > -        local->block[i].mr =
> > -            ibv_reg_mr(rdma->pd,
> > -                    local->block[i].local_host_addr,
> > -                    local->block[i].length,
> > -                    IBV_ACCESS_LOCAL_WRITE |
> > -                    IBV_ACCESS_REMOTE_WRITE
> > -                    );
> > +        if (strcmp(local->block[i].block_name,"pc.ram") == 0) {
> 
> 'pc.ram' is an x86 architecture specific name, so this will still
> leave a problem on other architectures I assume.

Yes, and also break even on PC when using NUMA.
I think the thing to do here is to call qemu_ram_pagesize on the
RAMBlock; 

  if (qemu_ram_pagesize(RAMBlock....) != qemu_real_host_page_size)
     it's a huge page

I guess it's probably best to do that in qemu_rdma_init_one_block or
something?

I wonder how that all works when there's a mix of different huge page
sizes?

Dave

> > +            local->block[i].mr =
> > +                ibv_reg_mr(rdma->pd,
> > +                        local->block[i].local_host_addr,
> > +                        local->block[i].length,
> > +                        IBV_ACCESS_LOCAL_WRITE |
> > +                        IBV_ACCESS_REMOTE_WRITE |
> > +                        IBV_ACCESS_ON_DEMAND |
> > +                        IBV_ACCESS_HUGETLB
> > +                        );
> > +        } else {
> > +            local->block[i].mr =
> > +                ibv_reg_mr(rdma->pd,
> > +                        local->block[i].local_host_addr,
> > +                        local->block[i].length,
> > +                        IBV_ACCESS_LOCAL_WRITE |
> > +                        IBV_ACCESS_REMOTE_WRITE
> > +                        );
> > +        }
> > +
> >          if (!local->block[i].mr) {
> >              perror("Failed to register local dest ram block!\n");
> >              break;
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 5+ messages in thread

* 回复: [PATCH] migration/rdma: Use huge page register VM memory
  2021-06-07 14:17 ` Daniel P. Berrangé
  2021-06-07 15:00   ` Dr. David Alan Gilbert
@ 2021-06-10 15:33   ` LIZHAOXIN1 [李照鑫]
  1 sibling, 0 replies; 5+ messages in thread
From: LIZHAOXIN1 [李照鑫] @ 2021-06-10 15:33 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: sunhao2 [孙昊], YANGFENG1 [杨峰],
	quintela, DENGLINWEN [邓林文],
	qemu-devel, dgilbert

Yes, 'pc.ram' is the specific name for x86. I have read that
memory_region_allocate_system_memory assigns different names
to other architectures.
Thanks for your reminding.

Regards,
lizhaoxin.

-----邮件原件-----
发件人: Daniel P. Berrangé <berrange@redhat.com> 
发送时间: 2021年6月7日 22:18
收件人: LIZHAOXIN1 [李照鑫] <LIZHAOXIN1@kingsoft.com>
抄送: qemu-devel@nongnu.org; quintela@redhat.com; dgilbert@redhat.com; sunhao2 [孙昊] <sunhao2@kingsoft.com>; DENGLINWEN [邓林文] <DENGLINWEN@kingsoft.com>; YANGFENG1 [杨峰] <YANGFENG1@kingsoft.com>
主题: Re: [PATCH] migration/rdma: Use huge page register VM memory

On Mon, Jun 07, 2021 at 01:57:02PM +0000, LIZHAOXIN1 [李照鑫] wrote:
> When using libvirt for RDMA live migration, if the VM memory is too 
> large, it will take a lot of time to deregister the VM at the source 
> side, resulting in a long downtime (VM 64G, deregister vm time is about 400ms).
>     
> Although the VM's memory uses 2M huge pages, the MLNX driver still 
> uses 4K pages for pin memory, as well as for unpin. So we use huge 
> pages to skip the process of pin memory and unpin memory to reduce downtime.
>    
> The test environment:
> kernel: linux-5.12
> MLNX: ConnectX-4 LX
> libvirt command:
> virsh migrate --live --p2p --persistent --copy-storage-inc 
> --listen-address \
> 0.0.0.0 --rdma-pin-all --migrateuri rdma://192.168.0.2 [VM] 
> qemu+tcp://192.168.0.2/system
>     
> Signed-off-by: lizhaoxin <lizhaoxin1@kingsoft.com>
> 
> diff --git a/migration/rdma.c b/migration/rdma.c index 
> 1cdb4561f3..9823449297 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -1123,13 +1123,26 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>      RDMALocalBlocks *local = &rdma->local_ram_blocks;
>  
>      for (i = 0; i < local->nb_blocks; i++) {
> -        local->block[i].mr =
> -            ibv_reg_mr(rdma->pd,
> -                    local->block[i].local_host_addr,
> -                    local->block[i].length,
> -                    IBV_ACCESS_LOCAL_WRITE |
> -                    IBV_ACCESS_REMOTE_WRITE
> -                    );
> +        if (strcmp(local->block[i].block_name,"pc.ram") == 0) {

'pc.ram' is an x86 architecture specific name, so this will still leave a problem on other architectures I assume.

> +            local->block[i].mr =
> +                ibv_reg_mr(rdma->pd,
> +                        local->block[i].local_host_addr,
> +                        local->block[i].length,
> +                        IBV_ACCESS_LOCAL_WRITE |
> +                        IBV_ACCESS_REMOTE_WRITE |
> +                        IBV_ACCESS_ON_DEMAND |
> +                        IBV_ACCESS_HUGETLB
> +                        );
> +        } else {
> +            local->block[i].mr =
> +                ibv_reg_mr(rdma->pd,
> +                        local->block[i].local_host_addr,
> +                        local->block[i].length,
> +                        IBV_ACCESS_LOCAL_WRITE |
> +                        IBV_ACCESS_REMOTE_WRITE
> +                        );
> +        }
> +
>          if (!local->block[i].mr) {
>              perror("Failed to register local dest ram block!\n");
>              break;

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 5+ messages in thread

* 回复: [PATCH] migration/rdma: Use huge page register VM memory
  2021-06-07 15:00   ` Dr. David Alan Gilbert
@ 2021-06-10 15:35     ` LIZHAOXIN1 [李照鑫]
  0 siblings, 0 replies; 5+ messages in thread
From: LIZHAOXIN1 [李照鑫] @ 2021-06-10 15:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Daniel P. Berrangé
  Cc: sunhao2 [孙昊], YANGFENG1 [杨峰],
	quintela, DENGLINWEN [邓林文],
	qemu-devel, LIZHAOXIN1 [李照鑫]

Yes, When I configured two NUMAs for the VM, the name of the memory 
is 'ram-node*', and other architectures had different names.
As you suggested, I use qemu_ram_pagesize() and qemu_real_host_page_size 
to determine which Ramblocks use huge page.
I will send the patch second version later.

when there's a mix of different huge page sizes, there is no difference in their 
behavior, register or pin are just to prevent the memory from being swapped out. 
Huge page itself will not be swapped out,so huge page no need deregister or unpin.

The libvirt xml of my VM is 
...
<memoryBacking>
    <hugepages>
      <page size='2048' unit='KiB' nodeset='0'/>
      <page size='1048576' unit='KiB' nodeset='1'/>
    </hugepages>
  </memoryBacking>
...
<numa>
      <cell id='0' cpus='0-7' memory='31457280' unit='KiB' memAccess='shared'/>
      <cell id='1' cpus='8-15' memory='2097152' unit='KiB' memAccess='shared'/>
</numa>
...

After testing, the RDMA live migration is normal, and the downtime is significantly reduced.

-----邮件原件-----
发件人: Dr. David Alan Gilbert <dgilbert@redhat.com> 
发送时间: 2021年6月7日 23:00
收件人: Daniel P. Berrangé <berrange@redhat.com>
抄送: LIZHAOXIN1 [李照鑫] <LIZHAOXIN1@kingsoft.com>; qemu-devel@nongnu.org; quintela@redhat.com; sunhao2 [孙昊] <sunhao2@kingsoft.com>; DENGLINWEN [邓林文] <DENGLINWEN@kingsoft.com>; YANGFENG1 [杨峰] <YANGFENG1@kingsoft.com>
主题: Re: [PATCH] migration/rdma: Use huge page register VM memory

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Jun 07, 2021 at 01:57:02PM +0000, LIZHAOXIN1 [李照鑫] wrote:
> > When using libvirt for RDMA live migration, if the VM memory is too 
> > large, it will take a lot of time to deregister the VM at the source 
> > side, resulting in a long downtime (VM 64G, deregister vm time is about 400ms).
> >     
> > Although the VM's memory uses 2M huge pages, the MLNX driver still 
> > uses 4K pages for pin memory, as well as for unpin. So we use huge 
> > pages to skip the process of pin memory and unpin memory to reduce downtime.
> >    
> > The test environment:
> > kernel: linux-5.12
> > MLNX: ConnectX-4 LX
> > libvirt command:
> > virsh migrate --live --p2p --persistent --copy-storage-inc 
> > --listen-address \
> > 0.0.0.0 --rdma-pin-all --migrateuri rdma://192.168.0.2 [VM] 
> > qemu+tcp://192.168.0.2/system
> >     
> > Signed-off-by: lizhaoxin <lizhaoxin1@kingsoft.com>
> > 
> > diff --git a/migration/rdma.c b/migration/rdma.c index 
> > 1cdb4561f3..9823449297 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -1123,13 +1123,26 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
> >      RDMALocalBlocks *local = &rdma->local_ram_blocks;
> >  
> >      for (i = 0; i < local->nb_blocks; i++) {
> > -        local->block[i].mr =
> > -            ibv_reg_mr(rdma->pd,
> > -                    local->block[i].local_host_addr,
> > -                    local->block[i].length,
> > -                    IBV_ACCESS_LOCAL_WRITE |
> > -                    IBV_ACCESS_REMOTE_WRITE
> > -                    );
> > +        if (strcmp(local->block[i].block_name,"pc.ram") == 0) {
> 
> 'pc.ram' is an x86 architecture specific name, so this will still 
> leave a problem on other architectures I assume.

Yes, and also break even on PC when using NUMA.
I think the thing to do here is to call qemu_ram_pagesize on the RAMBlock; 

  if (qemu_ram_pagesize(RAMBlock....) != qemu_real_host_page_size)
     it's a huge page

I guess it's probably best to do that in qemu_rdma_init_one_block or something?

I wonder how that all works when there's a mix of different huge page sizes?

Dave

> > +            local->block[i].mr =
> > +                ibv_reg_mr(rdma->pd,
> > +                        local->block[i].local_host_addr,
> > +                        local->block[i].length,
> > +                        IBV_ACCESS_LOCAL_WRITE |
> > +                        IBV_ACCESS_REMOTE_WRITE |
> > +                        IBV_ACCESS_ON_DEMAND |
> > +                        IBV_ACCESS_HUGETLB
> > +                        );
> > +        } else {
> > +            local->block[i].mr =
> > +                ibv_reg_mr(rdma->pd,
> > +                        local->block[i].local_host_addr,
> > +                        local->block[i].length,
> > +                        IBV_ACCESS_LOCAL_WRITE |
> > +                        IBV_ACCESS_REMOTE_WRITE
> > +                        );
> > +        }
> > +
> >          if (!local->block[i].mr) {
> >              perror("Failed to register local dest ram block!\n");
> >              break;
> 
> Regards,
> Daniel
> --
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-10 15:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 13:57 [PATCH] migration/rdma: Use huge page register VM memory LIZHAOXIN1 [李照鑫]
2021-06-07 14:17 ` Daniel P. Berrangé
2021-06-07 15:00   ` Dr. David Alan Gilbert
2021-06-10 15:35     ` 回复: " LIZHAOXIN1 [李照鑫]
2021-06-10 15:33   ` LIZHAOXIN1 [李照鑫]

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.