From mboxrd@z Thu Jan  1 00:00:00 1970
From: Liang Li <liang.z.li@intel.com>
Subject: [RFC Design Doc]Speed up live migration by skipping free pages
Date: Tue, 22 Mar 2016 15:43:49 +0800
Message-ID: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=y
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: pbonzini@redhat.com, rth@twiddle.net, ehabkost@redhat.com,
	mst@redhat.com, amit.shah@redhat.com, quintela@redhat.com,
	dgilbert@redhat.com, mohan_parthasarathy@hpe.com,
	jitendra.kolhe@hpe.com, simhan@hpe.com, rkagan@virtuozzo.com,
	riel@redhat.com, Liang Li <liang.z.li@intel.com>
To: qemu-devel@nongnu.org, kvm@vger.kernel.org,
	linux-kernel@vger.kenel.org
Return-path: <kvm-owner@vger.kernel.org>
Received: from mga01.intel.com ([192.55.52.88]:37868 "EHLO mga01.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1758446AbcCVHwT (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 22 Mar 2016 03:52:19 -0400
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

I have sent the RFC version patch set for live migration optimization
by skipping processing the free pages in the ram bulk stage and
received a lot of comments. The related threads can be found at:

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html=20
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html

To make things easier, I wrote this doc about the possible designs
and my choices. Comments are welcome!=20

Content
=3D=3D=3D=3D=3D=3D=3D
1. Background
2. Why not use virtio-balloon
3. Virtio interface
4. Constructing free page bitmap
5. Tighten free page bitmap
6. Handling page cache in the guest
7. APIs for live migration
8. Pseudo code=C2=A0

Details
=3D=3D=3D=3D=3D=3D=3D
1. Background
As we know, in the ram bulk stage of live migration, current QEMU live
migration implementation mark the all guest's RAM pages as dirtied in
the ram bulk stage, all these pages will be checked for zero page
first, and the page content will be sent to the destination depends on
the checking result, that process consumes quite a lot of CPU cycles
and network bandwidth.

=46rom guest's point of view, there are some pages currently not used b=
y
the guest, guest doesn't care about the content in these pages. Free
pages are this kind of pages which are not used by guest. We can make
use of this fact and skip processing the free pages in the ram bulk
stage, it can save a lot CPU cycles and reduce the network traffic
while speed up the live migration process obviously.

Usually, only the guest has the information of free pages. But it=E2=80=
=99s
possible to let the guest tell QEMU it=E2=80=99s free page information =
by some
mechanism. E.g. Through the virtio interface. Once QEMU get the free
page information, it can skip processing these free pages in the ram
bulk stage by clearing the corresponding bit of the migration bitmap.=20

2. Why not use virtio-balloon=20
Actually, the virtio-balloon can do the similar thing by inflating the
balloon before live migration, but its performance is no good, for an
8GB idle guest just boots, it takes about 5.7 Sec to inflate the
balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
from the guest.  There are some of reasons for the bad performance of
vitio-balloon:
a. allocating pages (5%, 304ms)
b. sending PFNs to host (71%, 4194ms)
c. address translation and madvise() operation (24%, 1423ms)
Debugging shows the time spends on these operations are listed in the
brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
large value, such as 16384, the time spends on sending the PFNs can be
reduced to about 400ms, but it=E2=80=99s still too long.

Obviously, the virtio-balloon mechanism has a bigger performance
impact to the guest than the way we are trying to implement.

3. Virtio interface
There are three different ways of using the virtio interface to
send the free page information.
a. Extend the current virtio device
The virtio spec has already defined some virtio devices, and we can
extend one of these devices so as to use it to transport the free page
information. It requires modifying the virtio spec.

b. Implement a new virtio device
Implementing a brand new virtio device to exchange information
between host and guest is another choice. It requires modifying the
virtio spec too.

c. Make use of virtio-serial (Amit=E2=80=99s suggestion, my choice)
It=E2=80=99s possible to make use the virtio-serial for communication b=
etween
host and guest, the benefit of this solution is no need to modify the
virtio spec.=20

4. Construct free page bitmap
To minimize the space for saving free page information, it=E2=80=99s be=
tter to
use a bitmap to describe the free pages. There are two ways to
construct the free page bitmap.

a. Construct free page bitmap when demand (My choice)
Guest can allocate memory for the free page bitmap only when it
receives the request from QEMU, and set the free page bitmap by
traversing the free page list. The advantage of this way is that it=E2=80=
=99s
quite simple and easy to implement. The disadvantage is that the
traversing operation may consume quite a long time when there are a
lot of free pages. (About 20ms for 7GB free pages)

b. Update free page bitmap when allocating/freeing pages=20
Another choice is to allocate the memory for the free page bitmap
when guest boots, and then update the free page bitmap when
allocating/freeing pages. It needs more modification to the code
related to memory management in guest. The advantage of this way is
that guest can response QEMU=E2=80=99s request for a free page bitmap v=
ery
quickly, no matter how many free pages in the guest. Do the kernel guys
like this?

5. Tighten the free page bitmap
At last, the free page bitmap should be operated with the
ramlist.dirty_memory to filter out the free pages. We should make sure
the bit N in the free page bitmap and the bit N in the
ramlist.dirty_memory are corresponding to the same guest=E2=80=99s page=
=2E=20
Some arch, like X86, there are =E2=80=98holes=E2=80=99 in the memory=E2=
=80=99s physical
address, which means there are no actual physical RAM pages
corresponding to some PFNs. So, some arch specific information is
needed to construct a proper free page bitmap.

migration dirty page bitmap:
    ---------------------
    |a|b|c|d|e|f|g|h|i|j|
    ---------------------
loose free page bitmap:
    ----------------------------- =20
    |a|b|c|d|e|f| | | | |g|h|i|j|
    -----------------------------
tight free page bitmap:
    ---------------------
    |a|b|c|d|e|f|g|h|i|j|
    ---------------------

There are two places for tightening the free page bitmap:
a. In guest=20
Constructing the free page bitmap in guest requires adding the arch
related code in guest for building a tight bitmap. The advantage of
this way is that less memory is needed to store the free page bitmap.
b. In QEMU (My choice)
Constructing the free page bitmap in QEMU is more flexible, we can get
a loose free page bitmap which contains the holes, and then filter out
the holes in QEMU, the advantage of this way is that we can keep the
kernel code as simple as we can, the disadvantage is that more memory
is needed to save the loose free page bitmap. Because this is a mainly
QEMU feature, if possible, do all the related things in QEMU is
better.

6. Handling page cache in the guest
The memory used for page cache in the guest will change depends on the
workload, if guest run some block IO intensive work load, there will
be lots of pages used for page cache, only a few free pages are left in
the guest. In order to get more free pages, we can select to ask guest
to drop some page caches.  Because dropping the page cache may lead to
performance degradation, only the clean cache should be dropped and we
should let the user decide whether to do this.

7. APIs for live migration
To make things work, the following APIs should be implemented.

a. Get memory info of the guest, like this:
bool get_guest_mem_info(struct guest_mem_info  * info )

struct guest_mem_info is defined as bellow:

struct guest_mem_info {
uint64_t free_pages_num;      // guest=E2=80=99s free pages count=20
uint64_t cached_pages_num;     //total cached pages count
uint64_t max_pfn;     // the max pfn of the guest
};

Return value:
flase, when QEMU or guest can=E2=80=99t support this operation.
true, when success.

b. Request guest=E2=80=99s current free pages information.
int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);

Return value:
-1, when QEMU or guest can=E2=80=99t support this operation.
1, when the free page bitmap is still in the progress of constructing.
1, when a valid free page bitmap is ready.

c. Tighten the free page bitmap
unsigned long * tighten_free_page_bmap(unsigned long *bitmap);

This function is an arch specific function to rebuild the loose free
page bitmap so as to get a tight bitmap which can be operated easily
with ramlist.dirty_memory.

8. Pseudo code=C2=A0
Dirty page logging should be enabled before getting the free page
information from guest, this is important because during the process
of getting free pages, some free pages may be used and written by the
guest, dirty page logging can trace these pages. The pseudo code=C2=A0i=
s
like below:

    -----------------------------------------------
    MigrationState *s =3D migrate_get_current();
    ...

    memory_global_dirty_log_start();

    if (get_guest_mem_info(&info)) {
        while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache) =
&&
               s->state !=3D MIGRATION_STATUS_CANCELLING) {
            usleep(1000) // sleep for 1 ms
        }

        tighten_free_page_bmap =3D tighten_guest_free_pages(free_page_b=
itmap);
        filter_out_guest_free_pages(tighten_free_page_bmap);
    }

    migration_bitmap_sync();
    ...

    -----------------------------------------------


--=20
1.9.1


From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:52129)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <liang.z.li@intel.com>) id 1aiH70-0006oR-Hp
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 03:52:24 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <liang.z.li@intel.com>) id 1aiH6x-0003aR-7T
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 03:52:22 -0400
Received: from mga01.intel.com ([192.55.52.88]:35083)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <liang.z.li@intel.com>) id 1aiH6w-0003aI-TV
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 03:52:19 -0400
From: Liang Li <liang.z.li@intel.com>
Date: Tue, 22 Mar 2016 15:43:49 +0800
Message-Id: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=y
Content-Transfer-Encoding: 8bit
Subject: [Qemu-devel] [RFC Design Doc]Speed up live migration by skipping
	free pages
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org, kvm@vger.kernel.org, linux-kernel@vger.kenel.org
Cc: rkagan@virtuozzo.com, Liang Li <liang.z.li@intel.com>, ehabkost@redhat.com, mst@redhat.com, simhan@hpe.com, quintela@redhat.com, dgilbert@redhat.com, jitendra.kolhe@hpe.com, mohan_parthasarathy@hpe.com, amit.shah@redhat.com, pbonzini@redhat.com, rth@twiddle.net

I have sent the RFC version patch set for live migration optimization
by skipping processing the free pages in the ram bulk stage and
received a lot of comments. The related threads can be found at:

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00715.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00714.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00717.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00716.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00718.html

https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00719.html 
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00720.html
https://lists.gnu.org/archive/html/qemu-devel/2016-03/msg00721.html

To make things easier, I wrote this doc about the possible designs
and my choices. Comments are welcome! 

Content
=======
1. Background
2. Why not use virtio-balloon
3. Virtio interface
4. Constructing free page bitmap
5. Tighten free page bitmap
6. Handling page cache in the guest
7. APIs for live migration
8. Pseudo code 

Details
=======
1. Background
As we know, in the ram bulk stage of live migration, current QEMU live
migration implementation mark the all guest's RAM pages as dirtied in
the ram bulk stage, all these pages will be checked for zero page
first, and the page content will be sent to the destination depends on
the checking result, that process consumes quite a lot of CPU cycles
and network bandwidth.

>>From guest's point of view, there are some pages currently not used by
the guest, guest doesn't care about the content in these pages. Free
pages are this kind of pages which are not used by guest. We can make
use of this fact and skip processing the free pages in the ram bulk
stage, it can save a lot CPU cycles and reduce the network traffic
while speed up the live migration process obviously.

Usually, only the guest has the information of free pages. But it’s
possible to let the guest tell QEMU it’s free page information by some
mechanism. E.g. Through the virtio interface. Once QEMU get the free
page information, it can skip processing these free pages in the ram
bulk stage by clearing the corresponding bit of the migration bitmap. 

2. Why not use virtio-balloon 
Actually, the virtio-balloon can do the similar thing by inflating the
balloon before live migration, but its performance is no good, for an
8GB idle guest just boots, it takes about 5.7 Sec to inflate the
balloon to 7GB, but it only takes 25ms to get a valid free page bitmap
from the guest.  There are some of reasons for the bad performance of
vitio-balloon:
a. allocating pages (5%, 304ms)
b. sending PFNs to host (71%, 4194ms)
c. address translation and madvise() operation (24%, 1423ms)
Debugging shows the time spends on these operations are listed in the
brackets above. By changing the VIRTIO_BALLOON_ARRAY_PFNS_MAX to a
large value, such as 16384, the time spends on sending the PFNs can be
reduced to about 400ms, but it’s still too long.

Obviously, the virtio-balloon mechanism has a bigger performance
impact to the guest than the way we are trying to implement.

3. Virtio interface
There are three different ways of using the virtio interface to
send the free page information.
a. Extend the current virtio device
The virtio spec has already defined some virtio devices, and we can
extend one of these devices so as to use it to transport the free page
information. It requires modifying the virtio spec.

b. Implement a new virtio device
Implementing a brand new virtio device to exchange information
between host and guest is another choice. It requires modifying the
virtio spec too.

c. Make use of virtio-serial (Amit’s suggestion, my choice)
It’s possible to make use the virtio-serial for communication between
host and guest, the benefit of this solution is no need to modify the
virtio spec. 

4. Construct free page bitmap
To minimize the space for saving free page information, it’s better to
use a bitmap to describe the free pages. There are two ways to
construct the free page bitmap.

a. Construct free page bitmap when demand (My choice)
Guest can allocate memory for the free page bitmap only when it
receives the request from QEMU, and set the free page bitmap by
traversing the free page list. The advantage of this way is that it’s
quite simple and easy to implement. The disadvantage is that the
traversing operation may consume quite a long time when there are a
lot of free pages. (About 20ms for 7GB free pages)

b. Update free page bitmap when allocating/freeing pages 
Another choice is to allocate the memory for the free page bitmap
when guest boots, and then update the free page bitmap when
allocating/freeing pages. It needs more modification to the code
related to memory management in guest. The advantage of this way is
that guest can response QEMU’s request for a free page bitmap very
quickly, no matter how many free pages in the guest. Do the kernel guys
like this?

5. Tighten the free page bitmap
At last, the free page bitmap should be operated with the
ramlist.dirty_memory to filter out the free pages. We should make sure
the bit N in the free page bitmap and the bit N in the
ramlist.dirty_memory are corresponding to the same guest’s page. 
Some arch, like X86, there are ‘holes’ in the memory’s physical
address, which means there are no actual physical RAM pages
corresponding to some PFNs. So, some arch specific information is
needed to construct a proper free page bitmap.

migration dirty page bitmap:
    ---------------------
    |a|b|c|d|e|f|g|h|i|j|
    ---------------------
loose free page bitmap:
    -----------------------------  
    |a|b|c|d|e|f| | | | |g|h|i|j|
    -----------------------------
tight free page bitmap:
    ---------------------
    |a|b|c|d|e|f|g|h|i|j|
    ---------------------

There are two places for tightening the free page bitmap:
a. In guest 
Constructing the free page bitmap in guest requires adding the arch
related code in guest for building a tight bitmap. The advantage of
this way is that less memory is needed to store the free page bitmap.
b. In QEMU (My choice)
Constructing the free page bitmap in QEMU is more flexible, we can get
a loose free page bitmap which contains the holes, and then filter out
the holes in QEMU, the advantage of this way is that we can keep the
kernel code as simple as we can, the disadvantage is that more memory
is needed to save the loose free page bitmap. Because this is a mainly
QEMU feature, if possible, do all the related things in QEMU is
better.

6. Handling page cache in the guest
The memory used for page cache in the guest will change depends on the
workload, if guest run some block IO intensive work load, there will
be lots of pages used for page cache, only a few free pages are left in
the guest. In order to get more free pages, we can select to ask guest
to drop some page caches.  Because dropping the page cache may lead to
performance degradation, only the clean cache should be dropped and we
should let the user decide whether to do this.

7. APIs for live migration
To make things work, the following APIs should be implemented.

a. Get memory info of the guest, like this:
bool get_guest_mem_info(struct guest_mem_info  * info )

struct guest_mem_info is defined as bellow:

struct guest_mem_info {
uint64_t free_pages_num;      // guest’s free pages count 
uint64_t cached_pages_num;     //total cached pages count
uint64_t max_pfn;     // the max pfn of the guest
};

Return value:
flase, when QEMU or guest can’t support this operation.
true, when success.

b. Request guest’s current free pages information.
int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);

Return value:
-1, when QEMU or guest can’t support this operation.
1, when the free page bitmap is still in the progress of constructing.
1, when a valid free page bitmap is ready.

c. Tighten the free page bitmap
unsigned long * tighten_free_page_bmap(unsigned long *bitmap);

This function is an arch specific function to rebuild the loose free
page bitmap so as to get a tight bitmap which can be operated easily
with ramlist.dirty_memory.

8. Pseudo code 
Dirty page logging should be enabled before getting the free page
information from guest, this is important because during the process
of getting free pages, some free pages may be used and written by the
guest, dirty page logging can trace these pages. The pseudo code is
like below:

    -----------------------------------------------
    MigrationState *s = migrate_get_current();
    ...

    memory_global_dirty_log_start();

    if (get_guest_mem_info(&info)) {
        while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache) &&
               s->state != MIGRATION_STATUS_CANCELLING) {
            usleep(1000) // sleep for 1 ms
        }

        tighten_free_page_bmap = tighten_guest_free_pages(free_page_bitmap);
        filter_out_guest_free_pages(tighten_free_page_bmap);
    }

    migration_bitmap_sync();
    ...

    -----------------------------------------------


-- 
1.9.1