From mboxrd@z Thu Jan 1 00:00:00 1970 From: Dor Laor Subject: Re: [PATCH 00/21][RFC] postcopy live migration Date: Sun, 01 Jan 2012 11:52:27 +0200 Message-ID: <4F002CDB.7070708@redhat.com> References: <4EFCEC38.3080308@codemonkey.ws> Reply-To: dlaor@redhat.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org, satoshi.itoh@aist.go.jp, t.hirofuchi@aist.go.jp, Juan Quintela , Michael Roth , qemu-devel@nongnu.org, Isaku Yamahata , Umesh Deshpande To: Anthony Liguori Return-path: In-Reply-To: <4EFCEC38.3080308@codemonkey.ws> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org List-Id: kvm.vger.kernel.org On 12/30/2011 12:39 AM, Anthony Liguori wrote: > On 12/28/2011 07:25 PM, Isaku Yamahata wrote: >> Intro >> ===== >> This patch series implements postcopy live migration.[1] >> As discussed at KVM forum 2011, dedicated character device is used for >> distributed shared memory between migration source and destination. >> Now we can discuss/benchmark/compare with precopy. I believe there are >> much rooms for improvement. >> >> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration >> >> >> Usage >> ===== >> You need load umem character device on the host before starting >> migration. >> Postcopy can be used for tcg and kvm accelarator. The implementation >> depend >> on only linux umem character device. But the driver dependent code is >> split >> into a file. >> I tested only host page size == guest page size case, but the >> implementation >> allows host page size != guest page size case. >> >> The following options are added with this patch series. >> - incoming part >> command line options >> -postcopy [-postcopy-flags] >> where flags is for changing behavior for benchmark/debugging >> Currently the following flags are available >> 0: default >> 1: enable touching page request >> >> example: >> qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm >> >> - outging part >> options for migrate command >> migrate [-p [-n]] URI >> -p: indicate postcopy migration >> -n: disable background transferring pages: This is for >> benchmark/debugging >> >> example: >> migrate -p -n tcp::4444 >> >> >> TODO >> ==== >> - benchmark/evaluation. Especially how async page fault affects the >> result. > > I'll review this series next week (Mike/Juan, please also review when > you can). > > But we really need to think hard about whether this is the right thing > to take into the tree. I worry a lot about the fact that we don't test > pre-copy migration nearly enough and adding a second form just > introduces more things to test. It is an issue but it can't be a merge criteria, Isaku is not blame of pre copy live migration lack of testing. I would say that 90% of issues of live migration problems are not related to the pre|post stage but more of issues of device model save state. So post-copy shouldn't add a significant regression here. Probably it will be good to ask every migration patch writer to write an additional unit test for migration. > It's also not clear to me why post-copy is better. If you were going to > sit down and explain to someone building a management tool when they > should use pre-copy and when they should use post-copy, what would you > tell them? Today, we have a default of max-downtime of 100ms. If either the guest work set size or the host networking throughput can't match the downtime, migration won't end. The mgmt user options are: - increase the downtime more and more to an actual stop - fail migrate W/ post-copy there is another option. Performance measurements will teach us (probably prior to commit) when this stage is valuable. Most likely, we better try first with pre-copy and if we can't meet the downtime we can optionally use post-copy. Here's a paper by Umesh (the migration thread writer): http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf Regards, Dor > > Regards, > > Anthony Liguori > From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:56486) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RhI5o-0001zI-EN for qemu-devel@nongnu.org; Sun, 01 Jan 2012 04:52:41 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RhI5n-0006lL-4M for qemu-devel@nongnu.org; Sun, 01 Jan 2012 04:52:40 -0500 Received: from mx1.redhat.com ([209.132.183.28]:2995) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RhI5m-0006lD-TX for qemu-devel@nongnu.org; Sun, 01 Jan 2012 04:52:39 -0500 Message-ID: <4F002CDB.7070708@redhat.com> Date: Sun, 01 Jan 2012 11:52:27 +0200 From: Dor Laor MIME-Version: 1.0 References: <4EFCEC38.3080308@codemonkey.ws> In-Reply-To: <4EFCEC38.3080308@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 00/21][RFC] postcopy live migration Reply-To: dlaor@redhat.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: kvm@vger.kernel.org, satoshi.itoh@aist.go.jp, t.hirofuchi@aist.go.jp, Juan Quintela , Michael Roth , qemu-devel@nongnu.org, Isaku Yamahata , Umesh Deshpande On 12/30/2011 12:39 AM, Anthony Liguori wrote: > On 12/28/2011 07:25 PM, Isaku Yamahata wrote: >> Intro >> ===== >> This patch series implements postcopy live migration.[1] >> As discussed at KVM forum 2011, dedicated character device is used for >> distributed shared memory between migration source and destination. >> Now we can discuss/benchmark/compare with precopy. I believe there are >> much rooms for improvement. >> >> [1] http://wiki.qemu.org/Features/PostCopyLiveMigration >> >> >> Usage >> ===== >> You need load umem character device on the host before starting >> migration. >> Postcopy can be used for tcg and kvm accelarator. The implementation >> depend >> on only linux umem character device. But the driver dependent code is >> split >> into a file. >> I tested only host page size == guest page size case, but the >> implementation >> allows host page size != guest page size case. >> >> The following options are added with this patch series. >> - incoming part >> command line options >> -postcopy [-postcopy-flags] >> where flags is for changing behavior for benchmark/debugging >> Currently the following flags are available >> 0: default >> 1: enable touching page request >> >> example: >> qemu -postcopy -incoming tcp:0:4444 -monitor stdio -machine accel=kvm >> >> - outging part >> options for migrate command >> migrate [-p [-n]] URI >> -p: indicate postcopy migration >> -n: disable background transferring pages: This is for >> benchmark/debugging >> >> example: >> migrate -p -n tcp::4444 >> >> >> TODO >> ==== >> - benchmark/evaluation. Especially how async page fault affects the >> result. > > I'll review this series next week (Mike/Juan, please also review when > you can). > > But we really need to think hard about whether this is the right thing > to take into the tree. I worry a lot about the fact that we don't test > pre-copy migration nearly enough and adding a second form just > introduces more things to test. It is an issue but it can't be a merge criteria, Isaku is not blame of pre copy live migration lack of testing. I would say that 90% of issues of live migration problems are not related to the pre|post stage but more of issues of device model save state. So post-copy shouldn't add a significant regression here. Probably it will be good to ask every migration patch writer to write an additional unit test for migration. > It's also not clear to me why post-copy is better. If you were going to > sit down and explain to someone building a management tool when they > should use pre-copy and when they should use post-copy, what would you > tell them? Today, we have a default of max-downtime of 100ms. If either the guest work set size or the host networking throughput can't match the downtime, migration won't end. The mgmt user options are: - increase the downtime more and more to an actual stop - fail migrate W/ post-copy there is another option. Performance measurements will teach us (probably prior to commit) when this stage is valuable. Most likely, we better try first with pre-copy and if we can't meet the downtime we can optionally use post-copy. Here's a paper by Umesh (the migration thread writer): http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf Regards, Dor > > Regards, > > Anthony Liguori >