From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael S. Tsirkin" Subject: Re: [RFC Design Doc]Speed up live migration by skipping free pages Date: Wed, 23 Mar 2016 16:08:04 +0200 Message-ID: <20160323155325-mutt-send-email-mst@redhat.com> References: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> <20160322101116.GA9532@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: "qemu-devel@nongnu.org" , "kvm@vger.kernel.org" , "linux-kernel@vger.kenel.org" , "pbonzini@redhat.com" , "rth@twiddle.net" , "ehabkost@redhat.com" , "amit.shah@redhat.com" , "quintela@redhat.com" , "dgilbert@redhat.com" , "mohan_parthasarathy@hpe.com" , "jitendra.kolhe@hpe.com" , "simhan@hpe.com" , "rkagan@virtuozzo.com" , "riel@redhat.com" To: "Li, Liang Z" Return-path: Received: from mx1.redhat.com ([209.132.183.28]:33993 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753683AbcCWOIK (ORCPT ); Wed, 23 Mar 2016 10:08:10 -0400 Content-Disposition: inline In-Reply-To: Sender: kvm-owner@vger.kernel.org List-ID: On Wed, Mar 23, 2016 at 06:05:27AM +0000, Li, Liang Z wrote: > > > To make things easier, I wrote this doc about the possible design= s and > > > my choices. Comments are welcome! > >=20 > > Thanks for putting this together, and especially for taking the tro= uble to > > benchmark existing code paths! > >=20 > > I think these numbers do show that there are gains to be had from m= erging > > your code with the existing balloon device. It will probably be a b= it more work, > > but I think it'll be worth it. > >=20 > > More comments below. > >=20 >=20 > Thanks for your comments! >=20 > > > 2. Why not use virtio-balloon > > > Actually, the virtio-balloon can do the similar thing by inflatin= g the > > > balloon before live migration, but its performance is no good, fo= r an > > > 8GB idle guest just boots, it takes about 5.7 Sec to inflate the > > > balloon to 7GB, but it only takes 25ms to get a valid free page b= itmap > > > from the guest. There are some of reasons for the bad performanc= e of > > > vitio-balloon: > > > a. allocating pages (5%, 304ms) > >=20 > > Interesting. This is definitely worth improving in guest kernel. > > Also, will it be faster if we allocate and pass to guest huge pages= instead? > > Might speed up madvise as well. >=20 > Maybe. >=20 > > > b. sending PFNs to host (71%, 4194ms) > >=20 > > OK, so we probably should teach balloon to pass huge lists in bitma= ps. > > Will be benefitial for regular balloon operation, as well. > >=20 >=20 > Agree. Current balloon just send 256 PFNs a time, that's too few and = lead to too many times=20 > of virtio transmission, that's the main reason for the bad performanc= e. > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can improv= e the > performance significant. Maybe we should increase it before doing the= further optimization, > do you think so ? We could push it up a bit higher: 256 is 1kbyte in size, so we can make it 3x bigger and still fit struct virtio_balloon is a single page. But if we are going to add the bitmap variant anyway, we probably shouldn't bother. > > > c. address translation and madvise() operation (24%, 1423ms) > >=20 > > How is this split between translation and madvise? I suspect it's = mostly > > madvise since you need translation when using bitmap as well. > > Correct? Could you measure this please? Also, what if we use the n= ew > > MADV_FREE instead? By how much would this help? > >=20 > For the current balloon, address translation is needed.=20 > But for live migration, there is no need to do address translation. Well you need ram address in order to clear the dirty bit. How would you get it without translation? >=20 > I did a another try and got the following data: > a. allocating pages (6.4%, 402ms) > b. sending PFNs to host (68.3%, 4263ms) > c. address translation (6.2%, 389ms) > d. madvise (19.0%, 1188ms) >=20 > The address translation is a time consuming operation too. > I will try MADV_FREE later. Thanks! > > Finally, we could teach balloon to skip madvise completely. > > By how much would this help? > >=20 > > > Debugging shows the time spends on these operations are listed in= the > > > brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to > > a > > > large value, such as 16384, the time spends on sending the PFNs c= an be > > > reduced to about 400ms, but it=E2=80=99s still too long. > > > Obviously, the virtio-balloon mechanism has a bigger performance > > > impact to the guest than the way we are trying to implement. > >=20 > > Since as we see some of the new interfaces might be benefitial to b= alloon as > > well, I am rather of the opinion that extending the balloon (basica= lly 3a) > > might be the right thing to do. > >=20 > > > 3. Virtio interface > > > There are three different ways of using the virtio interface to s= end > > > the free page information. > > > a. Extend the current virtio device > > > The virtio spec has already defined some virtio devices, and we c= an > > > extend one of these devices so as to use it to transport the free= page > > > information. It requires modifying the virtio spec. > >=20 > > You don't have to do it all by yourself by the way. > > Submit the proposal to the oasis virtio tc mailing list, we will ta= ke it from there. > >=20 > That's great. >=20 > >> 4. Construct free page bitmap > >> To minimize the space for saving free page information, it=E2=80=99= s better to=20 > >> use a bitmap to describe the free pages. There are two ways to=20 > >> construct the free page bitmap. > >>=20 > >> a. Construct free page bitmap when demand (My choice) Guest can=20 > >> allocate memory for the free page bitmap only when it receives the= =20 > >> request from QEMU, and set the free page bitmap by traversing the = free=20 > >> page list. The advantage of this way is that it=E2=80=99s quite si= mple and=20 > >> easy to implement. The disadvantage is that the traversing operati= on=20 > >> may consume quite a long time when there are a lot of free pages.=20 > >> (About 20ms for 7GB free pages) > >>=20 > >> b. Update free page bitmap when allocating/freeing pages Another=20 > >> choice is to allocate the memory for the free page bitmap when gue= st=20 > >>boots, and then update the free page bitmap when allocating/freeing= =20 > >> pages. It needs more modification to the code related to memory=20 > >>management in guest. The advantage of this way is that guest can=20 > >> response QEMU=E2=80=99s request for a free page bitmap very quickl= y, no matter=20 > >> how many free pages in the guest. Do the kernel guys like this? > >> >=20 > > > 8. Pseudo code > > > Dirty page logging should be enabled before getting the free page > > > information from guest, this is important because during the proc= ess > > > of getting free pages, some free pages may be used and written by= the > > > guest, dirty page logging can trace these pages. The pseudo code=C2= =A0is > > > like below: > > > > > > ----------------------------------------------- > > > MigrationState *s =3D migrate_get_current(); > > > ... > > > > > > memory_global_dirty_log_start(); > > > > > > if (get_guest_mem_info(&info)) { > > > while (!get_free_page_bmap(free_page_bitmap, drop_page_c= ache) > > && > > > s->state !=3D MIGRATION_STATUS_CANCELLING) { > > > usleep(1000) // sleep for 1 ms > > > } > > > > > > tighten_free_page_bmap =3D > > tighten_guest_free_pages(free_page_bitmap); > > > filter_out_guest_free_pages(tighten_free_page_bmap); > > > } > > > > > > migration_bitmap_sync(); > > > ... > > > > > > ----------------------------------------------- > >=20 > >=20 > > I don't completely agree with this part. In my opinion, it should = be > > asynchronous, depending on getting page lists from guest: > >=20 > > anywhere/periodically: > > ... > > request_guest_mem_info > > ... > >=20 >=20 > Periodically? That means filtering out guest free pages not only > in the ram bulk stage, but during the whole process of live migration= =2E right? =20 > If so, it's better to use 4b to construct the free page bitmap. That's up to guest. I would say focus on 4a first, once it works, experiment with 4b and see what the speedup is. > > later: > >=20 > >=20 > > handle_guest_mem_info() > > { > > address_space_sync_dirty_bitmap > > filter_out_guest_free_pages > > } > >=20 > > as long as we filter with VCPU stopped like this, we can drop the s= ync dirty > > stage, or alternatively we could move filter_out_guest_free_pages i= nto bh > > so it happens later while VCPU is running. > >=20 > > This removes any need for waiting. > >=20 > >=20 > > Introducing delay into migration might still be benefitial but this= way it is > > optional, we still get part of the benefit even if we don't wait lo= ng enough. > >=20 >=20 > Yes, I agree asynchronous mode is better and I will change it. > From the perspective of saving resources(CPU and network bandwidth),= waiting is not so bad. :) >=20 > Liang Sure, all I am saying is don't tie the logic to waiting enough. > >=20 > > > > > > -- > > > 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49933) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aijSI-0006Ec-Bd for qemu-devel@nongnu.org; Wed, 23 Mar 2016 10:08:15 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aijSE-0003yV-Vs for qemu-devel@nongnu.org; Wed, 23 Mar 2016 10:08:14 -0400 Received: from mx1.redhat.com ([209.132.183.28]:24234) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aijSE-0003yN-LZ for qemu-devel@nongnu.org; Wed, 23 Mar 2016 10:08:10 -0400 Date: Wed, 23 Mar 2016 16:08:04 +0200 From: "Michael S. Tsirkin" Message-ID: <20160323155325-mutt-send-email-mst@redhat.com> References: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> <20160322101116.GA9532@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Li, Liang Z" Cc: "rkagan@virtuozzo.com" , "linux-kernel@vger.kenel.org" , "ehabkost@redhat.com" , "kvm@vger.kernel.org" , "quintela@redhat.com" , "simhan@hpe.com" , "qemu-devel@nongnu.org" , "dgilbert@redhat.com" , "jitendra.kolhe@hpe.com" , "mohan_parthasarathy@hpe.com" , "amit.shah@redhat.com" , "pbonzini@redhat.com" , "rth@twiddle.net" On Wed, Mar 23, 2016 at 06:05:27AM +0000, Li, Liang Z wrote: > > > To make things easier, I wrote this doc about the possible designs = and > > > my choices. Comments are welcome! > >=20 > > Thanks for putting this together, and especially for taking the troub= le to > > benchmark existing code paths! > >=20 > > I think these numbers do show that there are gains to be had from mer= ging > > your code with the existing balloon device. It will probably be a bit= more work, > > but I think it'll be worth it. > >=20 > > More comments below. > >=20 >=20 > Thanks for your comments! >=20 > > > 2. Why not use virtio-balloon > > > Actually, the virtio-balloon can do the similar thing by inflating = the > > > balloon before live migration, but its performance is no good, for = an > > > 8GB idle guest just boots, it takes about 5.7 Sec to inflate the > > > balloon to 7GB, but it only takes 25ms to get a valid free page bit= map > > > from the guest. There are some of reasons for the bad performance = of > > > vitio-balloon: > > > a. allocating pages (5%, 304ms) > >=20 > > Interesting. This is definitely worth improving in guest kernel. > > Also, will it be faster if we allocate and pass to guest huge pages i= nstead? > > Might speed up madvise as well. >=20 > Maybe. >=20 > > > b. sending PFNs to host (71%, 4194ms) > >=20 > > OK, so we probably should teach balloon to pass huge lists in bitmaps= . > > Will be benefitial for regular balloon operation, as well. > >=20 >=20 > Agree. Current balloon just send 256 PFNs a time, that's too few and le= ad to too many times=20 > of virtio transmission, that's the main reason for the bad performance. > Change the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value can improve = the > performance significant. Maybe we should increase it before doing the f= urther optimization, > do you think so ? We could push it up a bit higher: 256 is 1kbyte in size, so we can make it 3x bigger and still fit struct virtio_balloon is a single page. But if we are going to add the bitmap variant anyway, we probably shouldn't bother. > > > c. address translation and madvise() operation (24%, 1423ms) > >=20 > > How is this split between translation and madvise? I suspect it's mo= stly > > madvise since you need translation when using bitmap as well. > > Correct? Could you measure this please? Also, what if we use the new > > MADV_FREE instead? By how much would this help? > >=20 > For the current balloon, address translation is needed.=20 > But for live migration, there is no need to do address translation. Well you need ram address in order to clear the dirty bit. How would you get it without translation? >=20 > I did a another try and got the following data: > a. allocating pages (6.4%, 402ms) > b. sending PFNs to host (68.3%, 4263ms) > c. address translation (6.2%, 389ms) > d. madvise (19.0%, 1188ms) >=20 > The address translation is a time consuming operation too. > I will try MADV_FREE later. Thanks! > > Finally, we could teach balloon to skip madvise completely. > > By how much would this help? > >=20 > > > Debugging shows the time spends on these operations are listed in t= he > > > brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to > > a > > > large value, such as 16384, the time spends on sending the PFNs can= be > > > reduced to about 400ms, but it=E2=80=99s still too long. > > > Obviously, the virtio-balloon mechanism has a bigger performance > > > impact to the guest than the way we are trying to implement. > >=20 > > Since as we see some of the new interfaces might be benefitial to bal= loon as > > well, I am rather of the opinion that extending the balloon (basicall= y 3a) > > might be the right thing to do. > >=20 > > > 3. Virtio interface > > > There are three different ways of using the virtio interface to sen= d > > > the free page information. > > > a. Extend the current virtio device > > > The virtio spec has already defined some virtio devices, and we can > > > extend one of these devices so as to use it to transport the free p= age > > > information. It requires modifying the virtio spec. > >=20 > > You don't have to do it all by yourself by the way. > > Submit the proposal to the oasis virtio tc mailing list, we will take= it from there. > >=20 > That's great. >=20 > >> 4. Construct free page bitmap > >> To minimize the space for saving free page information, it=E2=80=99s= better to=20 > >> use a bitmap to describe the free pages. There are two ways to=20 > >> construct the free page bitmap. > >>=20 > >> a. Construct free page bitmap when demand (My choice) Guest can=20 > >> allocate memory for the free page bitmap only when it receives the=20 > >> request from QEMU, and set the free page bitmap by traversing the fr= ee=20 > >> page list. The advantage of this way is that it=E2=80=99s quite simp= le and=20 > >> easy to implement. The disadvantage is that the traversing operation= =20 > >> may consume quite a long time when there are a lot of free pages.=20 > >> (About 20ms for 7GB free pages) > >>=20 > >> b. Update free page bitmap when allocating/freeing pages Another=20 > >> choice is to allocate the memory for the free page bitmap when guest= =20 > >>boots, and then update the free page bitmap when allocating/freeing=20 > >> pages. It needs more modification to the code related to memory=20 > >>management in guest. The advantage of this way is that guest can=20 > >> response QEMU=E2=80=99s request for a free page bitmap very quickly,= no matter=20 > >> how many free pages in the guest. Do the kernel guys like this? > >> >=20 > > > 8. Pseudo code > > > Dirty page logging should be enabled before getting the free page > > > information from guest, this is important because during the proces= s > > > of getting free pages, some free pages may be used and written by t= he > > > guest, dirty page logging can trace these pages. The pseudo code=C2= =A0is > > > like below: > > > > > > ----------------------------------------------- > > > MigrationState *s =3D migrate_get_current(); > > > ... > > > > > > memory_global_dirty_log_start(); > > > > > > if (get_guest_mem_info(&info)) { > > > while (!get_free_page_bmap(free_page_bitmap, drop_page_cac= he) > > && > > > s->state !=3D MIGRATION_STATUS_CANCELLING) { > > > usleep(1000) // sleep for 1 ms > > > } > > > > > > tighten_free_page_bmap =3D > > tighten_guest_free_pages(free_page_bitmap); > > > filter_out_guest_free_pages(tighten_free_page_bmap); > > > } > > > > > > migration_bitmap_sync(); > > > ... > > > > > > ----------------------------------------------- > >=20 > >=20 > > I don't completely agree with this part. In my opinion, it should be > > asynchronous, depending on getting page lists from guest: > >=20 > > anywhere/periodically: > > ... > > request_guest_mem_info > > ... > >=20 >=20 > Periodically? That means filtering out guest free pages not only > in the ram bulk stage, but during the whole process of live migration. = right? =20 > If so, it's better to use 4b to construct the free page bitmap. That's up to guest. I would say focus on 4a first, once it works, experiment with 4b and see what the speedup is. > > later: > >=20 > >=20 > > handle_guest_mem_info() > > { > > address_space_sync_dirty_bitmap > > filter_out_guest_free_pages > > } > >=20 > > as long as we filter with VCPU stopped like this, we can drop the syn= c dirty > > stage, or alternatively we could move filter_out_guest_free_pages int= o bh > > so it happens later while VCPU is running. > >=20 > > This removes any need for waiting. > >=20 > >=20 > > Introducing delay into migration might still be benefitial but this w= ay it is > > optional, we still get part of the benefit even if we don't wait long= enough. > >=20 >=20 > Yes, I agree asynchronous mode is better and I will change it. > From the perspective of saving resources(CPU and network bandwidth), w= aiting is not so bad. :) >=20 > Liang Sure, all I am saying is don't tie the logic to waiting enough. > >=20 > > > > > > -- > > > 1.9.1