From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33790) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YlZPj-00006u-3G for qemu-devel@nongnu.org; Fri, 24 Apr 2015 04:56:48 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YlZPd-0004Ss-9P for qemu-devel@nongnu.org; Fri, 24 Apr 2015 04:56:47 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42539) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YlZPd-0004Sg-2N for qemu-devel@nongnu.org; Fri, 24 Apr 2015 04:56:41 -0400 Date: Fri, 24 Apr 2015 09:56:23 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20150424085622.GB2139@work-vm> References: <1427347774-8960-1-git-send-email-zhang.zhanghailiang@huawei.com> <5524E3E9.1070102@huawei.com> <20150422111833.GD2386@work-vm> <553A043A.6040509@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <553A043A.6040509@huawei.com> Subject: Re: [Qemu-devel] [RFC PATCH v4 00/28] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: zhanghailiang Cc: lizhijian@cn.fujitsu.com, quintela@redhat.com, yunhong.jiang@intel.com, eddie.dong@intel.com, peter.huangpeng@huawei.com, qemu-devel@nongnu.org, arei.gonglei@huawei.com, amit.shah@redhat.com, david@gibson.dropbear.id.au * zhanghailiang (zhang.zhanghailiang@huawei.com) wrote: > On 2015/4/22 19:18, Dr. David Alan Gilbert wrote: > >* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote: > >>Hi, > >> > >>ping ... > > > >I will get to look at this again; but not until after next week. > > > > OK, thanks for your reply. :) > > >>The main blocked bugs for COLO have been solved, > > > >I've got the v3 set running, but the biggest problem I hit are problems > >with the packet comparison module; I've seen a panic which I think is > > What's the panic log? See my reply to Wen I just sent. > >in colo_send_checkpoint_req that I think is due to the use of > >GFP_KERNEL to allocate the netlink message and I think it can schedule > >there. I tried making that a GFP_ATOMIC but I'm hitting other > >problems with : > > > >kcolo_thread, no conn, schedule out > > > > Er, it is OK to get this messages if you enable the debug, > if there is no net connect to VM, or there is a checkpoint request happening, > it is no need to compare any network packets. So we just schedule out the kcolo_thread. > Is it just this messages been printed ? Or maybe some other problems ? The problem is that the primary stops at that point; I've not looked why yet. > >that I've not had time to look into yet. > > > >So I only get about a 50% success rate of starting COLO. > > This is really strange, yes, sometimes we can come across problems like kernel panic in our tests, > but not so often. Can you describe the problems in detail ? > > >I see there are stuff in the TODO of the colo-proxy that > >seem to say the netlink stuff should change, maybe you're already fixing > >that? > > > > Yes, we are trying to replace the current netlink in COLO with nfnetlink interface. > Hope to merge the code in next version. Good. > >>we also have finished some new features and optimization on COLO. (If you are interested in this, > >>we can send them to you in private ;)) > > > >>For easy of review, it is better to keep it simple now, so we will not add too much new codes into this frame > >>patch set before it been totally reviewed. > > > >I'd like to see those; but I don't want to take code privately. > >It's OK to post extra stuff as a separate set. > > > > Hmm, there is really a good idea, maybe we should also add a branch > with all the optimization and new features in github. Yes, that would be good. Dave > >>COLO is a totally new feature which is still in early stage, we hope to speed up the development, > >>so your comments and feedback are warmly welcomed. :) > > > >Yes, it's getting there though; I don't think anyone else has > >got this close to getting a full FT set working with disk and networking. > > > > Thanks, > zhanghailiang > > >> > >>On 2015/3/26 13:29, zhanghailiang wrote: > >>>This is the 4th version of COLO, here is only COLO frame part, include: VM checkpoint, > >>>failover, proxy API, block replication API, not include block replication. > >>>The block part has been sent by wencongyang: > >>>[RFC PATCH COLO v2 00/13] Block replication for continuous checkpoints > >>> > >>>Compared with last version, there aren't too much optimize and new functions. > >>>The main reason is that there is an known issue that still unsolved, we found > >>>some dirty pages which have been missed setting bit in corresponding bitmap. > >>>And it will trigger strange problem in VM. > >>>We hope to resolve it before add more codes. > >>> > >>>You can get the newest integrated qemu colo patches from github: > >>>https://github.com/coloft/qemu/commits/colo-v1.1 > >>> > >>>About how to test COLO, Please reference to the follow link. > >>>http://wiki.qemu.org/Features/COLO. > >>> > >>>Please review and test. > >>> > >>>Known issue still unsolved: > >>>(1) Some pages dirtied without setting its corresponding dirty-bitmap. > >>> > >>>Previous posted RFC patch series: > >>>http://lists.nongnu.org/archive/html/qemu-devel/2014-06/msg05567.html > >>>http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg04459.html > >>>https://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04771.html > >>> > >>>TODO list: > >>>1 Optimize the process of checkpoint, shorten the time-consuming: > >>> (Partly done, patch is not include into this series) > >>> 1) separate ram and device save/load process to reduce size of extra memory > >>> used during checkpoint > >>> 2) live migrate part of dirty pages to slave during sleep time. > >>>2 Add more debug/stat info > >>> (Partly done, patch is not include into this series) > >>> include checkpoint count, proxy discompare count, downtime, > >>> number of live migrated pages, total sent pages, etc. > >>>3 Strengthen failover > >>>4 optimize proxy part, include proxy script. > >>>5 The capability of continuous FT > >>> > >>>v4: > >>>- New block replication scheme (use image-fleecing for sencondary side) > >>>- Adress some comments from Eric Blake and Dave > >>>- Add commmand colo-set-checkpoint-period to set the time of periodic checkpoint > >>>- Add a delay (100ms) between continuous checkpoint requests to ensure VM > >>> run 100ms at least since last pause. > >>> > >>>v3: > >>>- use proxy instead of colo agent to compare network packets > >>>- add block replication > >>>- Optimize failover disposal > >>>- handle shutdown > >>> > >>>v2: > >>>- use QEMUSizedBuffer/QEMUFile as COLO buffer > >>>- colo support is enabled by default > >>>- add nic replication support > >>>- addressed comments from Eric Blake and Dr. David Alan Gilbert > >>> > >>>v1: > >>>- implement the frame of colo > >>> > >>>Wen Congyang (1): > >>> COLO: Add block replication into colo process > >>> > >>>zhanghailiang (27): > >>> configure: Add parameter for configure to enable/disable COLO support > >>> migration: Introduce capability 'colo' to migration > >>> COLO: migrate colo related info to slave > >>> migration: Integrate COLO checkpoint process into migration > >>> migration: Integrate COLO checkpoint process into loadvm > >>> COLO: Implement colo checkpoint protocol > >>> COLO: Add a new RunState RUN_STATE_COLO > >>> QEMUSizedBuffer: Introduce two help functions for qsb > >>> COLO: Save VM state to slave when do checkpoint > >>> COLO RAM: Load PVM's dirty page into SVM's RAM cache temporarily > >>> COLO VMstate: Load VM state into qsb before restore it > >>> arch_init: Start to trace dirty pages of SVM > >>> COLO RAM: Flush cached RAM into SVM's memory > >>> COLO failover: Introduce a new command to trigger a failover > >>> COLO failover: Implement COLO master/slave failover work > >>> COLO failover: Don't do failover during loading VM's state > >>> COLO: Add new command parameter 'colo_nicname' 'colo_script' for net > >>> COLO NIC: Init/remove colo nic devices when add/cleanup tap devices > >>> COLO NIC: Implement colo nic device interface configure() > >>> COLO NIC : Implement colo nic init/destroy function > >>> COLO NIC: Some init work related with proxy module > >>> COLO: Do checkpoint according to the result of net packets comparing > >>> COLO: Improve checkpoint efficiency by do additional periodic > >>> checkpoint > >>> COLO: Add colo-set-checkpoint-period command > >>> COLO NIC: Implement NIC checkpoint and failover > >>> COLO: Disable qdev hotplug when VM is in COLO mode > >>> COLO: Implement shutdown checkpoint > >>> > >>> arch_init.c | 199 +++++++- > >>> configure | 14 + > >>> hmp-commands.hx | 30 ++ > >>> hmp.c | 14 + > >>> hmp.h | 2 + > >>> include/exec/cpu-all.h | 1 + > >>> include/migration/migration-colo.h | 58 +++ > >>> include/migration/migration-failover.h | 22 + > >>> include/migration/migration.h | 3 + > >>> include/migration/qemu-file.h | 3 +- > >>> include/net/colo-nic.h | 25 + > >>> include/net/net.h | 4 + > >>> include/sysemu/sysemu.h | 3 + > >>> migration/Makefile.objs | 2 + > >>> migration/colo-comm.c | 80 ++++ > >>> migration/colo-failover.c | 48 ++ > >>> migration/colo.c | 809 +++++++++++++++++++++++++++++++++ > >>> migration/migration.c | 60 ++- > >>> migration/qemu-file-buf.c | 58 +++ > >>> net/Makefile.objs | 1 + > >>> net/colo-nic.c | 438 ++++++++++++++++++ > >>> net/tap.c | 45 +- > >>> qapi-schema.json | 42 +- > >>> qemu-options.hx | 10 +- > >>> qmp-commands.hx | 41 ++ > >>> savevm.c | 2 +- > >>> scripts/colo-proxy-script.sh | 97 ++++ > >>> stubs/Makefile.objs | 1 + > >>> stubs/migration-colo.c | 58 +++ > >>> vl.c | 36 +- > >>> 30 files changed, 2178 insertions(+), 28 deletions(-) > >>> create mode 100644 include/migration/migration-colo.h > >>> create mode 100644 include/migration/migration-failover.h > >>> create mode 100644 include/net/colo-nic.h > >>> create mode 100644 migration/colo-comm.c > >>> create mode 100644 migration/colo-failover.c > >>> create mode 100644 migration/colo.c > >>> create mode 100644 migration/colo.c. > >>> create mode 100644 net/colo-nic.c > >>> create mode 100755 scripts/colo-proxy-script.sh > >>> create mode 100644 stubs/migration-colo.c > >>> > >> > >> > >-- > >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > >. > > > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK