Re: [PATCH] migration/rdma: Use huge page register VM memory

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "sunhao2 [孙昊]" <sunhao2@kingsoft.com>,
	"YANGFENG1 [杨峰]" <YANGFENG1@kingsoft.com>,
	"quintela@redhat.com" <quintela@redhat.com>,
	"DENGLINWEN [邓林文]" <DENGLINWEN@kingsoft.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"LIZHAOXIN1 [李照鑫]" <LIZHAOXIN1@kingsoft.com>
Subject: Re: [PATCH] migration/rdma: Use huge page register VM memory
Date: Mon, 7 Jun 2021 16:00:28 +0100	[thread overview]
Message-ID: <YL40jJgKFQBnq3Us@work-vm> (raw)
In-Reply-To: <YL4qh35GquFrbSfq@redhat.com>

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Mon, Jun 07, 2021 at 01:57:02PM +0000, LIZHAOXIN1 [李照鑫] wrote:
> > When using libvirt for RDMA live migration, if the VM memory is too large,
> > it will take a lot of time to deregister the VM at the source side, resulting
> > in a long downtime (VM 64G, deregister vm time is about 400ms).
> >     
> > Although the VM's memory uses 2M huge pages, the MLNX driver still uses 4K
> > pages for pin memory, as well as for unpin. So we use huge pages to skip the
> > process of pin memory and unpin memory to reduce downtime.
> >    
> > The test environment:
> > kernel: linux-5.12
> > MLNX: ConnectX-4 LX
> > libvirt command:
> > virsh migrate --live --p2p --persistent --copy-storage-inc --listen-address \
> > 0.0.0.0 --rdma-pin-all --migrateuri rdma://192.168.0.2 [VM] qemu+tcp://192.168.0.2/system
> >     
> > Signed-off-by: lizhaoxin <lizhaoxin1@kingsoft.com>
> > 
> > diff --git a/migration/rdma.c b/migration/rdma.c
> > index 1cdb4561f3..9823449297 100644
> > --- a/migration/rdma.c
> > +++ b/migration/rdma.c
> > @@ -1123,13 +1123,26 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
> >      RDMALocalBlocks *local = &rdma->local_ram_blocks;
> >  
> >      for (i = 0; i < local->nb_blocks; i++) {
> > -        local->block[i].mr =
> > -            ibv_reg_mr(rdma->pd,
> > -                    local->block[i].local_host_addr,
> > -                    local->block[i].length,
> > -                    IBV_ACCESS_LOCAL_WRITE |
> > -                    IBV_ACCESS_REMOTE_WRITE
> > -                    );
> > +        if (strcmp(local->block[i].block_name,"pc.ram") == 0) {
> 
> 'pc.ram' is an x86 architecture specific name, so this will still
> leave a problem on other architectures I assume.

Yes, and also break even on PC when using NUMA.
I think the thing to do here is to call qemu_ram_pagesize on the
RAMBlock; 

  if (qemu_ram_pagesize(RAMBlock....) != qemu_real_host_page_size)
     it's a huge page

I guess it's probably best to do that in qemu_rdma_init_one_block or
something?

I wonder how that all works when there's a mix of different huge page
sizes?

Dave

> > +            local->block[i].mr =
> > +                ibv_reg_mr(rdma->pd,
> > +                        local->block[i].local_host_addr,
> > +                        local->block[i].length,
> > +                        IBV_ACCESS_LOCAL_WRITE |
> > +                        IBV_ACCESS_REMOTE_WRITE |
> > +                        IBV_ACCESS_ON_DEMAND |
> > +                        IBV_ACCESS_HUGETLB
> > +                        );
> > +        } else {
> > +            local->block[i].mr =
> > +                ibv_reg_mr(rdma->pd,
> > +                        local->block[i].local_host_addr,
> > +                        local->block[i].length,
> > +                        IBV_ACCESS_LOCAL_WRITE |
> > +                        IBV_ACCESS_REMOTE_WRITE
> > +                        );
> > +        }
> > +
> >          if (!local->block[i].mr) {
> >              perror("Failed to register local dest ram block!\n");
> >              break;
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK