From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Dr. David Alan Gilbert" Subject: Re: [RFC Design Doc]Speed up live migration by skipping free pages Date: Tue, 22 Mar 2016 19:05:31 +0000 Message-ID: <20160322190530.GI2216@work-vm> References: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-kernel@vger.kenel.org, pbonzini@redhat.com, rth@twiddle.net, ehabkost@redhat.com, mst@redhat.com, amit.shah@redhat.com, quintela@redhat.com, mohan_parthasarathy@hpe.com, jitendra.kolhe@hpe.com, simhan@hpe.com, rkagan@virtuozzo.com, riel@redhat.com To: Liang Li Return-path: Received: from mx1.redhat.com ([209.132.183.28]:49823 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750988AbcCVTFi (ORCPT ); Tue, 22 Mar 2016 15:05:38 -0400 Content-Disposition: inline In-Reply-To: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> Sender: kvm-owner@vger.kernel.org List-ID: * Liang Li (liang.z.li@intel.com) wrote: > I have sent the RFC version patch set for live migration optimization > by skipping processing the free pages in the ram bulk stage and > received a lot of comments. The related threads can be found at: Thanks! > Obviously, the virtio-balloon mechanism has a bigger performance > impact to the guest than the way we are trying to implement. Yeh, we should separately try and fix that; if it's that slow then people will be annoyed about it when they're just using it for balloon. > 3. Virtio interface > There are three different ways of using the virtio interface to > send the free page information. > a. Extend the current virtio device > The virtio spec has already defined some virtio devices, and we can > extend one of these devices so as to use it to transport the free pag= e > information. It requires modifying the virtio spec. >=20 > b. Implement a new virtio device > Implementing a brand new virtio device to exchange information > between host and guest is another choice. It requires modifying the > virtio spec too. If the right solution is to change the spec then we should do it; we shouldn't use a technically worse solution just to avoid the spec change; although we have to be even more careful to get the right solution if we want to change the spec. > c. Make use of virtio-serial (Amit=E2=80=99s suggestion, my choice) > It=E2=80=99s possible to make use the virtio-serial for communication= between > host and guest, the benefit of this solution is no need to modify the > virtio spec.=20 >=20 > 4. Construct free page bitmap > To minimize the space for saving free page information, it=E2=80=99s = better to > use a bitmap to describe the free pages. There are two ways to > construct the free page bitmap. >=20 > a. Construct free page bitmap when demand (My choice) > Guest can allocate memory for the free page bitmap only when it > receives the request from QEMU, and set the free page bitmap by > traversing the free page list. The advantage of this way is that it=E2= =80=99s > quite simple and easy to implement. The disadvantage is that the > traversing operation may consume quite a long time when there are a > lot of free pages. (About 20ms for 7GB free pages) I wonder how that scales; 20ms isn't too bad - but I'm more worried abo= ut what happens when someone does it to the 1TB database VM. > b. Update free page bitmap when allocating/freeing pages=20 > Another choice is to allocate the memory for the free page bitmap > when guest boots, and then update the free page bitmap when > allocating/freeing pages. It needs more modification to the code > related to memory management in guest. The advantage of this way is > that guest can response QEMU=E2=80=99s request for a free page bitmap= very > quickly, no matter how many free pages in the guest. Do the kernel gu= ys > like this? >=20 > 5. Tighten the free page bitmap > At last, the free page bitmap should be operated with the > ramlist.dirty_memory to filter out the free pages. We should make sur= e > the bit N in the free page bitmap and the bit N in the > ramlist.dirty_memory are corresponding to the same guest=E2=80=99s pa= ge.=20 > Some arch, like X86, there are =E2=80=98holes=E2=80=99 in the memory=E2= =80=99s physical > address, which means there are no actual physical RAM pages > corresponding to some PFNs. So, some arch specific information is > needed to construct a proper free page bitmap. >=20 > migration dirty page bitmap: > --------------------- > |a|b|c|d|e|f|g|h|i|j| > --------------------- > loose free page bitmap: > ----------------------------- =20 > |a|b|c|d|e|f| | | | |g|h|i|j| > ----------------------------- > tight free page bitmap: > --------------------- > |a|b|c|d|e|f|g|h|i|j| > --------------------- >=20 > There are two places for tightening the free page bitmap: > a. In guest=20 > Constructing the free page bitmap in guest requires adding the arch > related code in guest for building a tight bitmap. The advantage of > this way is that less memory is needed to store the free page bitmap. > b. In QEMU (My choice) > Constructing the free page bitmap in QEMU is more flexible, we can ge= t > a loose free page bitmap which contains the holes, and then filter ou= t > the holes in QEMU, the advantage of this way is that we can keep the > kernel code as simple as we can, the disadvantage is that more memory > is needed to save the loose free page bitmap. Because this is a mainl= y > QEMU feature, if possible, do all the related things in QEMU is > better. Yes, maybe; although we'd have to be careful to validate what the guest fills in makes sense. > 6. Handling page cache in the guest > The memory used for page cache in the guest will change depends on th= e > workload, if guest run some block IO intensive work load, there will > be lots of pages used for page cache, only a few free pages are left = in > the guest. In order to get more free pages, we can select to ask gues= t > to drop some page caches. Because dropping the page cache may lead t= o > performance degradation, only the clean cache should be dropped and w= e > should let the user decide whether to do this. >=20 > 7. APIs for live migration > To make things work, the following APIs should be implemented. >=20 > a. Get memory info of the guest, like this: > bool get_guest_mem_info(struct guest_mem_info * info ) >=20 > struct guest_mem_info is defined as bellow: >=20 > struct guest_mem_info { > uint64_t free_pages_num; // guest=E2=80=99s free pages count=20 > uint64_t cached_pages_num; //total cached pages count > uint64_t max_pfn; // the max pfn of the guest > }; What do you need max_pfn for? (We'll also have to think how hotplugged memory works with this). Also be careful of how big a page is; some architectures can choose between different guest page sizes (4, 16, 64k I think on AR= M), so we just need to make sure what unit we're dealing with. That size is also not necessarily the same as the unit size of the migration bitmap; this is always a bit tricky. > Return value: > flase, when QEMU or guest can=E2=80=99t support this operation. > true, when success. >=20 > b. Request guest=E2=80=99s current free pages information. > int get_free_page_bmap(unsigned long *bitmap, bool drop_cache); >=20 > Return value: > -1, when QEMU or guest can=E2=80=99t support this operation. > 1, when the free page bitmap is still in the progress of constructing= =2E > 1, when a valid free page bitmap is ready. I suggest not using 'long' - I know we do it a lot in QEMU but it's a p= ain; lets nail this down to a uint64_t and then we don't have to worry about what the guest is runing. > c. Tighten the free page bitmap > unsigned long * tighten_free_page_bmap(unsigned long *bitmap); >=20 > This function is an arch specific function to rebuild the loose free > page bitmap so as to get a tight bitmap which can be operated easily > with ramlist.dirty_memory. I'm not sure you actually need this; as long as what you expect is just a (small) series of chunks of bitmap; then you'd just have something like (start at 0... ) (start at 1MB...) (start at 1GB....) > 8. Pseudo code=C2=A0 > Dirty page logging should be enabled before getting the free page > information from guest, this is important because during the process > of getting free pages, some free pages may be used and written by the > guest, dirty page logging can trace these pages. The pseudo code=C2=A0= is > like below: >=20 > ----------------------------------------------- > MigrationState *s =3D migrate_get_current(); > ... >=20 > memory_global_dirty_log_start(); >=20 > if (get_guest_mem_info(&info)) { > while (!get_free_page_bmap(free_page_bitmap, drop_page_cache= ) && > s->state !=3D MIGRATION_STATUS_CANCELLING) { > usleep(1000) // sleep for 1 ms > } >=20 > tighten_free_page_bmap =3D tighten_guest_free_pages(free_page= _bitmap); > filter_out_guest_free_pages(tighten_free_page_bmap); > } Given the typical speed of networks; it wouldn't do too much harm to start sending assuming all pages are dirty and then when the guest finally gets around to finishing the bitmap then update, so it's asynchronous - and then if the guest never responds we don't really care. Dave >=20 > migration_bitmap_sync(); > ... >=20 > ----------------------------------------------- >=20 >=20 > --=20 > 1.9.1 >=20 -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:58651) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiRcc-0001UV-CF for qemu-devel@nongnu.org; Tue, 22 Mar 2016 15:05:43 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aiRcY-0003KO-A9 for qemu-devel@nongnu.org; Tue, 22 Mar 2016 15:05:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53471) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiRcX-0003KK-V5 for qemu-devel@nongnu.org; Tue, 22 Mar 2016 15:05:38 -0400 Date: Tue, 22 Mar 2016 19:05:31 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20160322190530.GI2216@work-vm> References: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Liang Li Cc: rkagan@virtuozzo.com, linux-kernel@vger.kenel.org, ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com, simhan@hpe.com, quintela@redhat.com, qemu-devel@nongnu.org, jitendra.kolhe@hpe.com, mohan_parthasarathy@hpe.com, amit.shah@redhat.com, pbonzini@redhat.com, rth@twiddle.net * Liang Li (liang.z.li@intel.com) wrote: > I have sent the RFC version patch set for live migration optimization > by skipping processing the free pages in the ram bulk stage and > received a lot of comments. The related threads can be found at: Thanks! > Obviously, the virtio-balloon mechanism has a bigger performance > impact to the guest than the way we are trying to implement. Yeh, we should separately try and fix that; if it's that slow then people will be annoyed about it when they're just using it for balloon. > 3. Virtio interface > There are three different ways of using the virtio interface to > send the free page information. > a. Extend the current virtio device > The virtio spec has already defined some virtio devices, and we can > extend one of these devices so as to use it to transport the free page > information. It requires modifying the virtio spec. >=20 > b. Implement a new virtio device > Implementing a brand new virtio device to exchange information > between host and guest is another choice. It requires modifying the > virtio spec too. If the right solution is to change the spec then we should do it; we shouldn't use a technically worse solution just to avoid the spec change; although we have to be even more careful to get the right solution if we want to change the spec. > c. Make use of virtio-serial (Amit=E2=80=99s suggestion, my choice) > It=E2=80=99s possible to make use the virtio-serial for communication b= etween > host and guest, the benefit of this solution is no need to modify the > virtio spec.=20 >=20 > 4. Construct free page bitmap > To minimize the space for saving free page information, it=E2=80=99s be= tter to > use a bitmap to describe the free pages. There are two ways to > construct the free page bitmap. >=20 > a. Construct free page bitmap when demand (My choice) > Guest can allocate memory for the free page bitmap only when it > receives the request from QEMU, and set the free page bitmap by > traversing the free page list. The advantage of this way is that it=E2=80= =99s > quite simple and easy to implement. The disadvantage is that the > traversing operation may consume quite a long time when there are a > lot of free pages. (About 20ms for 7GB free pages) I wonder how that scales; 20ms isn't too bad - but I'm more worried about what happens when someone does it to the 1TB database VM. > b. Update free page bitmap when allocating/freeing pages=20 > Another choice is to allocate the memory for the free page bitmap > when guest boots, and then update the free page bitmap when > allocating/freeing pages. It needs more modification to the code > related to memory management in guest. The advantage of this way is > that guest can response QEMU=E2=80=99s request for a free page bitmap v= ery > quickly, no matter how many free pages in the guest. Do the kernel guys > like this? >=20 > 5. Tighten the free page bitmap > At last, the free page bitmap should be operated with the > ramlist.dirty_memory to filter out the free pages. We should make sure > the bit N in the free page bitmap and the bit N in the > ramlist.dirty_memory are corresponding to the same guest=E2=80=99s page= .=20 > Some arch, like X86, there are =E2=80=98holes=E2=80=99 in the memory=E2= =80=99s physical > address, which means there are no actual physical RAM pages > corresponding to some PFNs. So, some arch specific information is > needed to construct a proper free page bitmap. >=20 > migration dirty page bitmap: > --------------------- > |a|b|c|d|e|f|g|h|i|j| > --------------------- > loose free page bitmap: > ----------------------------- =20 > |a|b|c|d|e|f| | | | |g|h|i|j| > ----------------------------- > tight free page bitmap: > --------------------- > |a|b|c|d|e|f|g|h|i|j| > --------------------- >=20 > There are two places for tightening the free page bitmap: > a. In guest=20 > Constructing the free page bitmap in guest requires adding the arch > related code in guest for building a tight bitmap. The advantage of > this way is that less memory is needed to store the free page bitmap. > b. In QEMU (My choice) > Constructing the free page bitmap in QEMU is more flexible, we can get > a loose free page bitmap which contains the holes, and then filter out > the holes in QEMU, the advantage of this way is that we can keep the > kernel code as simple as we can, the disadvantage is that more memory > is needed to save the loose free page bitmap. Because this is a mainly > QEMU feature, if possible, do all the related things in QEMU is > better. Yes, maybe; although we'd have to be careful to validate what the guest fills in makes sense. > 6. Handling page cache in the guest > The memory used for page cache in the guest will change depends on the > workload, if guest run some block IO intensive work load, there will > be lots of pages used for page cache, only a few free pages are left in > the guest. In order to get more free pages, we can select to ask guest > to drop some page caches. Because dropping the page cache may lead to > performance degradation, only the clean cache should be dropped and we > should let the user decide whether to do this. >=20 > 7. APIs for live migration > To make things work, the following APIs should be implemented. >=20 > a. Get memory info of the guest, like this: > bool get_guest_mem_info(struct guest_mem_info * info ) >=20 > struct guest_mem_info is defined as bellow: >=20 > struct guest_mem_info { > uint64_t free_pages_num; // guest=E2=80=99s free pages count=20 > uint64_t cached_pages_num; //total cached pages count > uint64_t max_pfn; // the max pfn of the guest > }; What do you need max_pfn for? (We'll also have to think how hotplugged memory works with this). Also be careful of how big a page is; some architectures can choose between different guest page sizes (4, 16, 64k I think on ARM)= , so we just need to make sure what unit we're dealing with. That size is also not necessarily the same as the unit size of the migration bitmap; this is always a bit tricky. > Return value: > flase, when QEMU or guest can=E2=80=99t support this operation. > true, when success. >=20 > b. Request guest=E2=80=99s current free pages information. > int get_free_page_bmap(unsigned long *bitmap, bool drop_cache); >=20 > Return value: > -1, when QEMU or guest can=E2=80=99t support this operation. > 1, when the free page bitmap is still in the progress of constructing. > 1, when a valid free page bitmap is ready. I suggest not using 'long' - I know we do it a lot in QEMU but it's a pai= n; lets nail this down to a uint64_t and then we don't have to worry about what the guest is runing. > c. Tighten the free page bitmap > unsigned long * tighten_free_page_bmap(unsigned long *bitmap); >=20 > This function is an arch specific function to rebuild the loose free > page bitmap so as to get a tight bitmap which can be operated easily > with ramlist.dirty_memory. I'm not sure you actually need this; as long as what you expect is just a (small) series of chunks of bitmap; then you'd just have something like (start at 0... ) (start at 1MB...) (start at 1GB....) > 8. Pseudo code=C2=A0 > Dirty page logging should be enabled before getting the free page > information from guest, this is important because during the process > of getting free pages, some free pages may be used and written by the > guest, dirty page logging can trace these pages. The pseudo code=C2=A0i= s > like below: >=20 > ----------------------------------------------- > MigrationState *s =3D migrate_get_current(); > ... >=20 > memory_global_dirty_log_start(); >=20 > if (get_guest_mem_info(&info)) { > while (!get_free_page_bmap(free_page_bitmap, drop_page_cache) = && > s->state !=3D MIGRATION_STATUS_CANCELLING) { > usleep(1000) // sleep for 1 ms > } >=20 > tighten_free_page_bmap =3D tighten_guest_free_pages(free_page_b= itmap); > filter_out_guest_free_pages(tighten_free_page_bmap); > } Given the typical speed of networks; it wouldn't do too much harm to start sending assuming all pages are dirty and then when the guest finally gets around to finishing the bitmap then update, so it's asynchronous - and then if the guest never responds we don't really care. Dave >=20 > migration_bitmap_sync(); > ... >=20 > ----------------------------------------------- >=20 >=20 > --=20 > 1.9.1 >=20 -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK