From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [RFC Design Doc]Speed up live migration by skipping free pages
Date: Tue, 22 Mar 2016 19:05:31 +0000
Message-ID: <20160322190530.GI2216@work-vm>
References: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org,
	linux-kernel@vger.kenel.org, pbonzini@redhat.com, rth@twiddle.net,
	ehabkost@redhat.com, mst@redhat.com, amit.shah@redhat.com,
	quintela@redhat.com, mohan_parthasarathy@hpe.com,
	jitendra.kolhe@hpe.com, simhan@hpe.com, rkagan@virtuozzo.com,
	riel@redhat.com
To: Liang Li <liang.z.li@intel.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:49823 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750988AbcCVTFi (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 22 Mar 2016 15:05:38 -0400
Content-Disposition: inline
In-Reply-To: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

* Liang Li (liang.z.li@intel.com) wrote:
> I have sent the RFC version patch set for live migration optimization
> by skipping processing the free pages in the ram bulk stage and
> received a lot of comments. The related threads can be found at:

Thanks!


> Obviously, the virtio-balloon mechanism has a bigger performance
> impact to the guest than the way we are trying to implement.

Yeh, we should separately try and fix that; if it's that slow then
people will be annoyed about it when they're just using it for balloon.

> 3. Virtio interface
> There are three different ways of using the virtio interface to
> send the free page information.
> a. Extend the current virtio device
> The virtio spec has already defined some virtio devices, and we can
> extend one of these devices so as to use it to transport the free pag=
e
> information. It requires modifying the virtio spec.
>=20
> b. Implement a new virtio device
> Implementing a brand new virtio device to exchange information
> between host and guest is another choice. It requires modifying the
> virtio spec too.

If the right solution is to change the spec then we should do it;
we shouldn't use a technically worse solution just to avoid the spec
change; although we have to be even more careful to get the right
solution if we want to change the spec.

> c. Make use of virtio-serial (Amit=E2=80=99s suggestion, my choice)
> It=E2=80=99s possible to make use the virtio-serial for communication=
 between
> host and guest, the benefit of this solution is no need to modify the
> virtio spec.=20
>=20
> 4. Construct free page bitmap
> To minimize the space for saving free page information, it=E2=80=99s =
better to
> use a bitmap to describe the free pages. There are two ways to
> construct the free page bitmap.
>=20
> a. Construct free page bitmap when demand (My choice)
> Guest can allocate memory for the free page bitmap only when it
> receives the request from QEMU, and set the free page bitmap by
> traversing the free page list. The advantage of this way is that it=E2=
=80=99s
> quite simple and easy to implement. The disadvantage is that the
> traversing operation may consume quite a long time when there are a
> lot of free pages. (About 20ms for 7GB free pages)

I wonder how that scales; 20ms isn't too bad - but I'm more worried abo=
ut
what happens when someone does it to the 1TB database VM.

> b. Update free page bitmap when allocating/freeing pages=20
> Another choice is to allocate the memory for the free page bitmap
> when guest boots, and then update the free page bitmap when
> allocating/freeing pages. It needs more modification to the code
> related to memory management in guest. The advantage of this way is
> that guest can response QEMU=E2=80=99s request for a free page bitmap=
 very
> quickly, no matter how many free pages in the guest. Do the kernel gu=
ys
> like this?
>=20
> 5. Tighten the free page bitmap
> At last, the free page bitmap should be operated with the
> ramlist.dirty_memory to filter out the free pages. We should make sur=
e
> the bit N in the free page bitmap and the bit N in the
> ramlist.dirty_memory are corresponding to the same guest=E2=80=99s pa=
ge.=20
> Some arch, like X86, there are =E2=80=98holes=E2=80=99 in the memory=E2=
=80=99s physical
> address, which means there are no actual physical RAM pages
> corresponding to some PFNs. So, some arch specific information is
> needed to construct a proper free page bitmap.
>=20
> migration dirty page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
> loose free page bitmap:
>     ----------------------------- =20
>     |a|b|c|d|e|f| | | | |g|h|i|j|
>     -----------------------------
> tight free page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
>=20
> There are two places for tightening the free page bitmap:
> a. In guest=20
> Constructing the free page bitmap in guest requires adding the arch
> related code in guest for building a tight bitmap. The advantage of
> this way is that less memory is needed to store the free page bitmap.
> b. In QEMU (My choice)
> Constructing the free page bitmap in QEMU is more flexible, we can ge=
t
> a loose free page bitmap which contains the holes, and then filter ou=
t
> the holes in QEMU, the advantage of this way is that we can keep the
> kernel code as simple as we can, the disadvantage is that more memory
> is needed to save the loose free page bitmap. Because this is a mainl=
y
> QEMU feature, if possible, do all the related things in QEMU is
> better.

Yes, maybe; although we'd have to be careful to validate what the guest
fills in makes sense.

> 6. Handling page cache in the guest
> The memory used for page cache in the guest will change depends on th=
e
> workload, if guest run some block IO intensive work load, there will
> be lots of pages used for page cache, only a few free pages are left =
in
> the guest. In order to get more free pages, we can select to ask gues=
t
> to drop some page caches.  Because dropping the page cache may lead t=
o
> performance degradation, only the clean cache should be dropped and w=
e
> should let the user decide whether to do this.
>=20
> 7. APIs for live migration
> To make things work, the following APIs should be implemented.
>=20
> a. Get memory info of the guest, like this:
> bool get_guest_mem_info(struct guest_mem_info  * info )
>=20
> struct guest_mem_info is defined as bellow:
>=20
> struct guest_mem_info {
> uint64_t free_pages_num;      // guest=E2=80=99s free pages count=20
> uint64_t cached_pages_num;     //total cached pages count
> uint64_t max_pfn;     // the max pfn of the guest
> };

What do you need max_pfn for?

(We'll also have to think how hotplugged memory works with this).
Also be careful of how big a page is;  some architectures
can choose between different guest page sizes (4, 16, 64k I think on AR=
M),
so we just need to make sure what unit we're dealing with.  That size
is also not necessarily the same as the unit size of the migration
bitmap; this is always a bit tricky.

> Return value:
> flase, when QEMU or guest can=E2=80=99t support this operation.
> true, when success.
>=20
> b. Request guest=E2=80=99s current free pages information.
> int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);
>=20
> Return value:
> -1, when QEMU or guest can=E2=80=99t support this operation.
> 1, when the free page bitmap is still in the progress of constructing=
=2E
> 1, when a valid free page bitmap is ready.

I suggest not using 'long' - I know we do it a lot in QEMU but it's a p=
ain;
lets nail this down to a uint64_t and then we don't have to worry about
what the guest is runing.

> c. Tighten the free page bitmap
> unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
>=20
> This function is an arch specific function to rebuild the loose free
> page bitmap so as to get a tight bitmap which can be operated easily
> with ramlist.dirty_memory.

I'm not sure you actually need this; as long as what you expect is just
a (small) series of chunks of bitmap; then you'd just have
something like (start at 0... ) (start at 1MB...) (start at 1GB....)

> 8. Pseudo code=C2=A0
> Dirty page logging should be enabled before getting the free page
> information from guest, this is important because during the process
> of getting free pages, some free pages may be used and written by the
> guest, dirty page logging can trace these pages. The pseudo code=C2=A0=
is
> like below:
>=20
>     -----------------------------------------------
>     MigrationState *s =3D migrate_get_current();
>     ...
>=20
>     memory_global_dirty_log_start();
>=20
>     if (get_guest_mem_info(&info)) {
>         while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache=
) &&
>                s->state !=3D MIGRATION_STATUS_CANCELLING) {
>             usleep(1000) // sleep for 1 ms
>         }
>=20
>         tighten_free_page_bmap =3D tighten_guest_free_pages(free_page=
_bitmap);
>         filter_out_guest_free_pages(tighten_free_page_bmap);
>     }

Given the typical speed of networks; it wouldn't do too much
harm to start sending assuming all pages are dirty and then
when the guest finally gets around to finishing the bitmap
then update, so it's asynchronous - and then if the guest
never responds we don't really care.

Dave

>=20
>     migration_bitmap_sync();
>     ...
>=20
>     -----------------------------------------------
>=20
>=20
> --=20
> 1.9.1
>=20
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:58651)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aiRcc-0001UV-CF
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 15:05:43 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aiRcY-0003KO-A9
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 15:05:42 -0400
Received: from mx1.redhat.com ([209.132.183.28]:53471)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1aiRcX-0003KK-V5
	for qemu-devel@nongnu.org; Tue, 22 Mar 2016 15:05:38 -0400
Date: Tue, 22 Mar 2016 19:05:31 +0000
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20160322190530.GI2216@work-vm>
References: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <1458632629-4649-1-git-send-email-liang.z.li@intel.com>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC Design Doc]Speed up live migration by
	skipping free pages
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Liang Li <liang.z.li@intel.com>
Cc: rkagan@virtuozzo.com, linux-kernel@vger.kenel.org, ehabkost@redhat.com, kvm@vger.kernel.org, mst@redhat.com, simhan@hpe.com, quintela@redhat.com, qemu-devel@nongnu.org, jitendra.kolhe@hpe.com, mohan_parthasarathy@hpe.com, amit.shah@redhat.com, pbonzini@redhat.com, rth@twiddle.net

* Liang Li (liang.z.li@intel.com) wrote:
> I have sent the RFC version patch set for live migration optimization
> by skipping processing the free pages in the ram bulk stage and
> received a lot of comments. The related threads can be found at:

Thanks!


> Obviously, the virtio-balloon mechanism has a bigger performance
> impact to the guest than the way we are trying to implement.

Yeh, we should separately try and fix that; if it's that slow then
people will be annoyed about it when they're just using it for balloon.

> 3. Virtio interface
> There are three different ways of using the virtio interface to
> send the free page information.
> a. Extend the current virtio device
> The virtio spec has already defined some virtio devices, and we can
> extend one of these devices so as to use it to transport the free page
> information. It requires modifying the virtio spec.
>=20
> b. Implement a new virtio device
> Implementing a brand new virtio device to exchange information
> between host and guest is another choice. It requires modifying the
> virtio spec too.

If the right solution is to change the spec then we should do it;
we shouldn't use a technically worse solution just to avoid the spec
change; although we have to be even more careful to get the right
solution if we want to change the spec.

> c. Make use of virtio-serial (Amit=E2=80=99s suggestion, my choice)
> It=E2=80=99s possible to make use the virtio-serial for communication b=
etween
> host and guest, the benefit of this solution is no need to modify the
> virtio spec.=20
>=20
> 4. Construct free page bitmap
> To minimize the space for saving free page information, it=E2=80=99s be=
tter to
> use a bitmap to describe the free pages. There are two ways to
> construct the free page bitmap.
>=20
> a. Construct free page bitmap when demand (My choice)
> Guest can allocate memory for the free page bitmap only when it
> receives the request from QEMU, and set the free page bitmap by
> traversing the free page list. The advantage of this way is that it=E2=80=
=99s
> quite simple and easy to implement. The disadvantage is that the
> traversing operation may consume quite a long time when there are a
> lot of free pages. (About 20ms for 7GB free pages)

I wonder how that scales; 20ms isn't too bad - but I'm more worried about
what happens when someone does it to the 1TB database VM.

> b. Update free page bitmap when allocating/freeing pages=20
> Another choice is to allocate the memory for the free page bitmap
> when guest boots, and then update the free page bitmap when
> allocating/freeing pages. It needs more modification to the code
> related to memory management in guest. The advantage of this way is
> that guest can response QEMU=E2=80=99s request for a free page bitmap v=
ery
> quickly, no matter how many free pages in the guest. Do the kernel guys
> like this?
>=20
> 5. Tighten the free page bitmap
> At last, the free page bitmap should be operated with the
> ramlist.dirty_memory to filter out the free pages. We should make sure
> the bit N in the free page bitmap and the bit N in the
> ramlist.dirty_memory are corresponding to the same guest=E2=80=99s page=
.=20
> Some arch, like X86, there are =E2=80=98holes=E2=80=99 in the memory=E2=
=80=99s physical
> address, which means there are no actual physical RAM pages
> corresponding to some PFNs. So, some arch specific information is
> needed to construct a proper free page bitmap.
>=20
> migration dirty page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
> loose free page bitmap:
>     ----------------------------- =20
>     |a|b|c|d|e|f| | | | |g|h|i|j|
>     -----------------------------
> tight free page bitmap:
>     ---------------------
>     |a|b|c|d|e|f|g|h|i|j|
>     ---------------------
>=20
> There are two places for tightening the free page bitmap:
> a. In guest=20
> Constructing the free page bitmap in guest requires adding the arch
> related code in guest for building a tight bitmap. The advantage of
> this way is that less memory is needed to store the free page bitmap.
> b. In QEMU (My choice)
> Constructing the free page bitmap in QEMU is more flexible, we can get
> a loose free page bitmap which contains the holes, and then filter out
> the holes in QEMU, the advantage of this way is that we can keep the
> kernel code as simple as we can, the disadvantage is that more memory
> is needed to save the loose free page bitmap. Because this is a mainly
> QEMU feature, if possible, do all the related things in QEMU is
> better.

Yes, maybe; although we'd have to be careful to validate what the guest
fills in makes sense.

> 6. Handling page cache in the guest
> The memory used for page cache in the guest will change depends on the
> workload, if guest run some block IO intensive work load, there will
> be lots of pages used for page cache, only a few free pages are left in
> the guest. In order to get more free pages, we can select to ask guest
> to drop some page caches.  Because dropping the page cache may lead to
> performance degradation, only the clean cache should be dropped and we
> should let the user decide whether to do this.
>=20
> 7. APIs for live migration
> To make things work, the following APIs should be implemented.
>=20
> a. Get memory info of the guest, like this:
> bool get_guest_mem_info(struct guest_mem_info  * info )
>=20
> struct guest_mem_info is defined as bellow:
>=20
> struct guest_mem_info {
> uint64_t free_pages_num;      // guest=E2=80=99s free pages count=20
> uint64_t cached_pages_num;     //total cached pages count
> uint64_t max_pfn;     // the max pfn of the guest
> };

What do you need max_pfn for?

(We'll also have to think how hotplugged memory works with this).
Also be careful of how big a page is;  some architectures
can choose between different guest page sizes (4, 16, 64k I think on ARM)=
,
so we just need to make sure what unit we're dealing with.  That size
is also not necessarily the same as the unit size of the migration
bitmap; this is always a bit tricky.

> Return value:
> flase, when QEMU or guest can=E2=80=99t support this operation.
> true, when success.
>=20
> b. Request guest=E2=80=99s current free pages information.
> int get_free_page_bmap(unsigned long *bitmap,  bool drop_cache);
>=20
> Return value:
> -1, when QEMU or guest can=E2=80=99t support this operation.
> 1, when the free page bitmap is still in the progress of constructing.
> 1, when a valid free page bitmap is ready.

I suggest not using 'long' - I know we do it a lot in QEMU but it's a pai=
n;
lets nail this down to a uint64_t and then we don't have to worry about
what the guest is runing.

> c. Tighten the free page bitmap
> unsigned long * tighten_free_page_bmap(unsigned long *bitmap);
>=20
> This function is an arch specific function to rebuild the loose free
> page bitmap so as to get a tight bitmap which can be operated easily
> with ramlist.dirty_memory.

I'm not sure you actually need this; as long as what you expect is just
a (small) series of chunks of bitmap; then you'd just have
something like (start at 0... ) (start at 1MB...) (start at 1GB....)

> 8. Pseudo code=C2=A0
> Dirty page logging should be enabled before getting the free page
> information from guest, this is important because during the process
> of getting free pages, some free pages may be used and written by the
> guest, dirty page logging can trace these pages. The pseudo code=C2=A0i=
s
> like below:
>=20
>     -----------------------------------------------
>     MigrationState *s =3D migrate_get_current();
>     ...
>=20
>     memory_global_dirty_log_start();
>=20
>     if (get_guest_mem_info(&info)) {
>         while (!get_free_page_bmap(free_page_bitmap,  drop_page_cache) =
&&
>                s->state !=3D MIGRATION_STATUS_CANCELLING) {
>             usleep(1000) // sleep for 1 ms
>         }
>=20
>         tighten_free_page_bmap =3D tighten_guest_free_pages(free_page_b=
itmap);
>         filter_out_guest_free_pages(tighten_free_page_bmap);
>     }

Given the typical speed of networks; it wouldn't do too much
harm to start sending assuming all pages are dirty and then
when the guest finally gets around to finishing the bitmap
then update, so it's asynchronous - and then if the guest
never responds we don't really care.

Dave

>=20
>     migration_bitmap_sync();
>     ...
>=20
>     -----------------------------------------------
>=20
>=20
> --=20
> 1.9.1
>=20
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK