From mboxrd@z Thu Jan 1 00:00:00 1970 From: Liang Li Subject: [RFC Design Doc]Speed up live migration by skipping free pages Date: Tue, 22 Mar 2016 15:43:49 +0800 Message-ID: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=y Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: pbonzini@redhat.com, rth@twiddle.net, ehabkost@redhat.com, mst@redhat.com, amit.shah@redhat.com, quintela@redhat.com, dgilbert@redhat.com, mohan_parthasarathy@hpe.com, jitendra.kolhe@hpe.com, simhan@hpe.com, rkagan@virtuozzo.com, riel@redhat.com, Liang Li To: qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-kernel@vger.kenel.org Return-path: Received: from mga01.intel.com ([192.55.52.88]:37868 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758446AbcCVHwT (ORCPT ); Tue, 22 Mar 2016 03:52:19 -0400 Sender: kvm-owner@vger.kernel.org List-ID: I have sent the RFC version patch set for live migration optimization by skipping processing the free pages in the ram bulk stage and received a lot of comments. The related threads can be found at: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html=20 https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html To make things easier, I wrote this doc about the possible designs and my choices. Comments are welcome!=20 Content =3D=3D=3D=3D=3D=3D=3D 1. Background 2. Why not use virtio-balloon 3. Virtio interface 4. Constructing free page bitmap 5. Tighten free page bitmap 6. Handling page cache in the guest 7. APIs for live migration 8. Pseudo code=C2=A0 Details =3D=3D=3D=3D=3D=3D=3D 1. Background As we know, in the ram bulk stage of live migration, current QEMU live migration implementation mark the all guest's RAM pages as dirtied in the ram bulk stage, all these pages will be checked for zero page first, and the page content will be sent to the destination depends on the checking result, that process consumes quite a lot of CPU cycles and network bandwidth. =46rom guest's point of view, there are some pages currently not used b= y the guest, guest doesn't care about the content in these pages. Free pages are this kind of pages which are not used by guest. We can make use of this fact and skip processing the free pages in the ram bulk stage, it can save a lot CPU cycles and reduce the network traffic while speed up the live migration process obviously. Usually, only the guest has the information of free pages. But it=E2=80= =99s possible to let the guest tell QEMU it=E2=80=99s free page information = by some mechanism. E.g. Through the virtio interface. Once QEMU get the free page information, it can skip processing these free pages in the ram bulk stage by clearing the corresponding bit of the migration bitmap.=20 2. Why not use virtio-balloon=20 Actually, the virtio-balloon can do the similar thing by inflating the balloon before live migration, but its performance is no good, for an 8GB idle guest just boots, it takes about 5.7 Sec to inflate the balloon to 7GB, but it only takes 25ms to get a valid free page bitmap from the guest. There are some of reasons for the bad performance of vitio-balloon: a. allocating pages (5%, 304ms) b. sending PFNs to host (71%, 4194ms) c. address translation and madvise() operation (24%, 1423ms) Debugging shows the time spends on these operations are listed in the brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value, such as 16384, the time spends on sending the PFNs can be reduced to about 400ms, but it=E2=80=99s still too long. Obviously, the virtio-balloon mechanism has a bigger performance impact to the guest than the way we are trying to implement. 3. Virtio interface There are three different ways of using the virtio interface to send the free page information. a. Extend the current virtio device The virtio spec has already defined some virtio devices, and we can extend one of these devices so as to use it to transport the free page information. It requires modifying the virtio spec. b. Implement a new virtio device Implementing a brand new virtio device to exchange information between host and guest is another choice. It requires modifying the virtio spec too. c. Make use of virtio-serial (Amit=E2=80=99s suggestion, my choice) It=E2=80=99s possible to make use the virtio-serial for communication b= etween host and guest, the benefit of this solution is no need to modify the virtio spec.=20 4. Construct free page bitmap To minimize the space for saving free page information, it=E2=80=99s be= tter to use a bitmap to describe the free pages. There are two ways to construct the free page bitmap. a. Construct free page bitmap when demand (My choice) Guest can allocate memory for the free page bitmap only when it receives the request from QEMU, and set the free page bitmap by traversing the free page list. The advantage of this way is that it=E2=80= =99s quite simple and easy to implement. The disadvantage is that the traversing operation may consume quite a long time when there are a lot of free pages. (About 20ms for 7GB free pages) b. Update free page bitmap when allocating/freeing pages=20 Another choice is to allocate the memory for the free page bitmap when guest boots, and then update the free page bitmap when allocating/freeing pages. It needs more modification to the code related to memory management in guest. The advantage of this way is that guest can response QEMU=E2=80=99s request for a free page bitmap v= ery quickly, no matter how many free pages in the guest. Do the kernel guys like this? 5. Tighten the free page bitmap At last, the free page bitmap should be operated with the ramlist.dirty_memory to filter out the free pages. We should make sure the bit N in the free page bitmap and the bit N in the ramlist.dirty_memory are corresponding to the same guest=E2=80=99s page= =2E=20 Some arch, like X86, there are =E2=80=98holes=E2=80=99 in the memory=E2= =80=99s physical address, which means there are no actual physical RAM pages corresponding to some PFNs. So, some arch specific information is needed to construct a proper free page bitmap. migration dirty page bitmap: --------------------- |a|b|c|d|e|f|g|h|i|j| --------------------- loose free page bitmap: ----------------------------- =20 |a|b|c|d|e|f| | | | |g|h|i|j| ----------------------------- tight free page bitmap: --------------------- |a|b|c|d|e|f|g|h|i|j| --------------------- There are two places for tightening the free page bitmap: a. In guest=20 Constructing the free page bitmap in guest requires adding the arch related code in guest for building a tight bitmap. The advantage of this way is that less memory is needed to store the free page bitmap. b. In QEMU (My choice) Constructing the free page bitmap in QEMU is more flexible, we can get a loose free page bitmap which contains the holes, and then filter out the holes in QEMU, the advantage of this way is that we can keep the kernel code as simple as we can, the disadvantage is that more memory is needed to save the loose free page bitmap. Because this is a mainly QEMU feature, if possible, do all the related things in QEMU is better. 6. Handling page cache in the guest The memory used for page cache in the guest will change depends on the workload, if guest run some block IO intensive work load, there will be lots of pages used for page cache, only a few free pages are left in the guest. In order to get more free pages, we can select to ask guest to drop some page caches. Because dropping the page cache may lead to performance degradation, only the clean cache should be dropped and we should let the user decide whether to do this. 7. APIs for live migration To make things work, the following APIs should be implemented. a. Get memory info of the guest, like this: bool get_guest_mem_info(struct guest_mem_info * info ) struct guest_mem_info is defined as bellow: struct guest_mem_info { uint64_t free_pages_num; // guest=E2=80=99s free pages count=20 uint64_t cached_pages_num; //total cached pages count uint64_t max_pfn; // the max pfn of the guest }; Return value: flase, when QEMU or guest can=E2=80=99t support this operation. true, when success. b. Request guest=E2=80=99s current free pages information. int get_free_page_bmap(unsigned long *bitmap, bool drop_cache); Return value: -1, when QEMU or guest can=E2=80=99t support this operation. 1, when the free page bitmap is still in the progress of constructing. 1, when a valid free page bitmap is ready. c. Tighten the free page bitmap unsigned long * tighten_free_page_bmap(unsigned long *bitmap); This function is an arch specific function to rebuild the loose free page bitmap so as to get a tight bitmap which can be operated easily with ramlist.dirty_memory. 8. Pseudo code=C2=A0 Dirty page logging should be enabled before getting the free page information from guest, this is important because during the process of getting free pages, some free pages may be used and written by the guest, dirty page logging can trace these pages. The pseudo code=C2=A0i= s like below: ----------------------------------------------- MigrationState *s =3D migrate_get_current(); ... memory_global_dirty_log_start(); if (get_guest_mem_info(&info)) { while (!get_free_page_bmap(free_page_bitmap, drop_page_cache) = && s->state !=3D MIGRATION_STATUS_CANCELLING) { usleep(1000) // sleep for 1 ms } tighten_free_page_bmap =3D tighten_guest_free_pages(free_page_b= itmap); filter_out_guest_free_pages(tighten_free_page_bmap); } migration_bitmap_sync(); ... ----------------------------------------------- --=20 1.9.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52129) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiH70-0006oR-Hp for qemu-devel@nongnu.org; Tue, 22 Mar 2016 03:52:24 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aiH6x-0003aR-7T for qemu-devel@nongnu.org; Tue, 22 Mar 2016 03:52:22 -0400 Received: from mga01.intel.com ([192.55.52.88]:35083) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiH6w-0003aI-TV for qemu-devel@nongnu.org; Tue, 22 Mar 2016 03:52:19 -0400 From: Liang Li Date: Tue, 22 Mar 2016 15:43:49 +0800 Message-Id: <1458632629-4649-1-git-send-email-liang.z.li@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=y Content-Transfer-Encoding: 8bit Subject: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping free pages List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-kernel@vger.kenel.org Cc: rkagan@virtuozzo.com, Liang Li , ehabkost@redhat.com, mst@redhat.com, simhan@hpe.com, quintela@redhat.com, dgilbert@redhat.com, jitendra.kolhe@hpe.com, mohan_parthasarathy@hpe.com, amit.shah@redhat.com, pbonzini@redhat.com, rth@twiddle.net I have sent the RFC version patch set for live migration optimization by skipping processing the free pages in the ram bulk stage and received a lot of comments. The related threads can be found at: https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html To make things easier, I wrote this doc about the possible designs and my choices. Comments are welcome! Content ======= 1. Background 2. Why not use virtio-balloon 3. Virtio interface 4. Constructing free page bitmap 5. Tighten free page bitmap 6. Handling page cache in the guest 7. APIs for live migration 8. Pseudo code  Details ======= 1. Background As we know, in the ram bulk stage of live migration, current QEMU live migration implementation mark the all guest's RAM pages as dirtied in the ram bulk stage, all these pages will be checked for zero page first, and the page content will be sent to the destination depends on the checking result, that process consumes quite a lot of CPU cycles and network bandwidth. >>From guest's point of view, there are some pages currently not used by the guest, guest doesn't care about the content in these pages. Free pages are this kind of pages which are not used by guest. We can make use of this fact and skip processing the free pages in the ram bulk stage, it can save a lot CPU cycles and reduce the network traffic while speed up the live migration process obviously. Usually, only the guest has the information of free pages. But it’s possible to let the guest tell QEMU it’s free page information by some mechanism. E.g. Through the virtio interface. Once QEMU get the free page information, it can skip processing these free pages in the ram bulk stage by clearing the corresponding bit of the migration bitmap. 2. Why not use virtio-balloon Actually, the virtio-balloon can do the similar thing by inflating the balloon before live migration, but its performance is no good, for an 8GB idle guest just boots, it takes about 5.7 Sec to inflate the balloon to 7GB, but it only takes 25ms to get a valid free page bitmap from the guest. There are some of reasons for the bad performance of vitio-balloon: a. allocating pages (5%, 304ms) b. sending PFNs to host (71%, 4194ms) c. address translation and madvise() operation (24%, 1423ms) Debugging shows the time spends on these operations are listed in the brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a large value, such as 16384, the time spends on sending the PFNs can be reduced to about 400ms, but it’s still too long. Obviously, the virtio-balloon mechanism has a bigger performance impact to the guest than the way we are trying to implement. 3. Virtio interface There are three different ways of using the virtio interface to send the free page information. a. Extend the current virtio device The virtio spec has already defined some virtio devices, and we can extend one of these devices so as to use it to transport the free page information. It requires modifying the virtio spec. b. Implement a new virtio device Implementing a brand new virtio device to exchange information between host and guest is another choice. It requires modifying the virtio spec too. c. Make use of virtio-serial (Amit’s suggestion, my choice) It’s possible to make use the virtio-serial for communication between host and guest, the benefit of this solution is no need to modify the virtio spec. 4. Construct free page bitmap To minimize the space for saving free page information, it’s better to use a bitmap to describe the free pages. There are two ways to construct the free page bitmap. a. Construct free page bitmap when demand (My choice) Guest can allocate memory for the free page bitmap only when it receives the request from QEMU, and set the free page bitmap by traversing the free page list. The advantage of this way is that it’s quite simple and easy to implement. The disadvantage is that the traversing operation may consume quite a long time when there are a lot of free pages. (About 20ms for 7GB free pages) b. Update free page bitmap when allocating/freeing pages Another choice is to allocate the memory for the free page bitmap when guest boots, and then update the free page bitmap when allocating/freeing pages. It needs more modification to the code related to memory management in guest. The advantage of this way is that guest can response QEMU’s request for a free page bitmap very quickly, no matter how many free pages in the guest. Do the kernel guys like this? 5. Tighten the free page bitmap At last, the free page bitmap should be operated with the ramlist.dirty_memory to filter out the free pages. We should make sure the bit N in the free page bitmap and the bit N in the ramlist.dirty_memory are corresponding to the same guest’s page. Some arch, like X86, there are ‘holes’ in the memory’s physical address, which means there are no actual physical RAM pages corresponding to some PFNs. So, some arch specific information is needed to construct a proper free page bitmap. migration dirty page bitmap: --------------------- |a|b|c|d|e|f|g|h|i|j| --------------------- loose free page bitmap: ----------------------------- |a|b|c|d|e|f| | | | |g|h|i|j| ----------------------------- tight free page bitmap: --------------------- |a|b|c|d|e|f|g|h|i|j| --------------------- There are two places for tightening the free page bitmap: a. In guest Constructing the free page bitmap in guest requires adding the arch related code in guest for building a tight bitmap. The advantage of this way is that less memory is needed to store the free page bitmap. b. In QEMU (My choice) Constructing the free page bitmap in QEMU is more flexible, we can get a loose free page bitmap which contains the holes, and then filter out the holes in QEMU, the advantage of this way is that we can keep the kernel code as simple as we can, the disadvantage is that more memory is needed to save the loose free page bitmap. Because this is a mainly QEMU feature, if possible, do all the related things in QEMU is better. 6. Handling page cache in the guest The memory used for page cache in the guest will change depends on the workload, if guest run some block IO intensive work load, there will be lots of pages used for page cache, only a few free pages are left in the guest. In order to get more free pages, we can select to ask guest to drop some page caches. Because dropping the page cache may lead to performance degradation, only the clean cache should be dropped and we should let the user decide whether to do this. 7. APIs for live migration To make things work, the following APIs should be implemented. a. Get memory info of the guest, like this: bool get_guest_mem_info(struct guest_mem_info * info ) struct guest_mem_info is defined as bellow: struct guest_mem_info { uint64_t free_pages_num; // guest’s free pages count uint64_t cached_pages_num; //total cached pages count uint64_t max_pfn; // the max pfn of the guest }; Return value: flase, when QEMU or guest can’t support this operation. true, when success. b. Request guest’s current free pages information. int get_free_page_bmap(unsigned long *bitmap, bool drop_cache); Return value: -1, when QEMU or guest can’t support this operation. 1, when the free page bitmap is still in the progress of constructing. 1, when a valid free page bitmap is ready. c. Tighten the free page bitmap unsigned long * tighten_free_page_bmap(unsigned long *bitmap); This function is an arch specific function to rebuild the loose free page bitmap so as to get a tight bitmap which can be operated easily with ramlist.dirty_memory. 8. Pseudo code  Dirty page logging should be enabled before getting the free page information from guest, this is important because during the process of getting free pages, some free pages may be used and written by the guest, dirty page logging can trace these pages. The pseudo code is like below: ----------------------------------------------- MigrationState *s = migrate_get_current(); ... memory_global_dirty_log_start(); if (get_guest_mem_info(&info)) { while (!get_free_page_bmap(free_page_bitmap, drop_page_cache) && s->state != MIGRATION_STATUS_CANCELLING) { usleep(1000) // sleep for 1 ms } tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap); filter_out_guest_free_pages(tighten_free_page_bmap); } migration_bitmap_sync(); ... ----------------------------------------------- -- 1.9.1