All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: zhanghailiang <zhang.zhanghailiang@huawei.com>
Cc: lizhijian@cn.fujitsu.com, quintela@redhat.com,
	yunhong.jiang@intel.com, eddie.dong@intel.com,
	peter.huangpeng@huawei.com, qemu-devel@nongnu.org,
	arei.gonglei@huawei.com, amit.shah@redhat.com,
	david@gibson.dropbear.id.au
Subject: Re: [Qemu-devel] [RFC PATCH v4 00/28] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service
Date: Fri, 24 Apr 2015 09:56:23 +0100	[thread overview]
Message-ID: <20150424085622.GB2139@work-vm> (raw)
In-Reply-To: <553A043A.6040509@huawei.com>

* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> On 2015/4/22 19:18, Dr. David Alan Gilbert wrote:
> >* zhanghailiang (zhang.zhanghailiang@huawei.com) wrote:
> >>Hi,
> >>
> >>ping ...
> >
> >I will get to look at this again; but not until after next week.
> >
> 
> OK, thanks for your reply. :)
> 
> >>The main blocked bugs for COLO have been solved,
> >
> >I've got the v3 set running, but the biggest problem I hit are problems
> >with the packet comparison module; I've seen a panic which I think is
> 
> What's the panic log?

See my reply to Wen I just sent.

> >in colo_send_checkpoint_req that I think is due to the use of
> >GFP_KERNEL to allocate the netlink message and I think it can schedule
> >there.  I tried making that a GFP_ATOMIC  but I'm hitting other
> >problems with :
> >
> >kcolo_thread, no conn, schedule out
> >
> 
> Er, it is OK to get this messages if you enable the debug,
> if there is no net connect to VM, or there is a checkpoint request happening,
> it is no need to compare any network packets. So we just schedule out the kcolo_thread.
> Is it just this messages been printed ? Or maybe some other problems ?

The problem is that the primary stops at that point; I've not looked why
yet.

> >that I've not had time to look into yet.
> >
> >So I only get about a 50% success rate of starting COLO.
> 
> This is really strange, yes, sometimes we can come across problems like kernel panic in our tests,
> but not so often. Can you describe the problems in detail ?
> 
> >I see there are stuff in the TODO of the colo-proxy that
> >seem to say the netlink stuff should change, maybe you're already fixing
> >that?
> >
> 
> Yes, we are trying to replace the  current netlink in COLO with nfnetlink interface.
> Hope to merge the code in next version.

Good.

> >>we also have finished some new features and optimization on COLO. (If you are interested in this,
> >>we can send them to you in private ;))
> >
> >>For easy of review, it is better to keep it simple now, so we will not add too much new codes into this frame
> >>patch set before it been totally reviewed.
> >
> >I'd like to see those; but I don't want to take code privately.
> >It's OK to post extra stuff as a separate set.
> >
> 
> Hmm, there is really a good idea, maybe we should also add a branch
> with all the optimization and new features in github.

Yes, that would be good.

Dave

> >>COLO is a totally new feature which is still in early stage, we hope to speed up the development,
> >>so your comments and feedback are warmly welcomed. :)
> >
> >Yes, it's getting there though; I don't think anyone else has
> >got this close to getting a full FT set working with disk and networking.
> >
> 
> Thanks,
> zhanghailiang
> 
> >>
> >>On 2015/3/26 13:29, zhanghailiang wrote:
> >>>This is the 4th version of COLO, here is only COLO frame part, include: VM checkpoint,
> >>>failover, proxy API, block replication API, not include block replication.
> >>>The block part has been sent by wencongyang:
> >>>[RFC PATCH COLO v2 00/13] Block replication for continuous checkpoints
> >>>
> >>>Compared with last version, there aren't too much optimize and new functions.
> >>>The main reason is that there is an known issue that still unsolved, we found
> >>>some dirty pages which have been missed setting bit in corresponding bitmap.
> >>>And it will trigger strange problem in VM.
> >>>We hope to resolve it before add more codes.
> >>>
> >>>You can get the newest integrated qemu colo patches from github:
> >>>https://github.com/coloft/qemu/commits/colo-v1.1
> >>>
> >>>About how to test COLO, Please reference to the follow link.
> >>>http://wiki.qemu.org/Features/COLO.
> >>>
> >>>Please review and test.
> >>>
> >>>Known issue still unsolved:
> >>>(1) Some pages dirtied without setting its corresponding dirty-bitmap.
> >>>
> >>>Previous posted RFC patch series:
> >>>http://lists.nongnu.org/archive/html/qemu-devel/2014-06/msg05567.html
> >>>http://lists.nongnu.org/archive/html/qemu-devel/2014-09/msg04459.html
> >>>https://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04771.html
> >>>
> >>>TODO list:
> >>>1 Optimize the process of checkpoint, shorten the time-consuming:
> >>>   (Partly done, patch is not include into this series)
> >>>    1) separate ram and device save/load process to reduce size of extra memory
> >>>       used during checkpoint
> >>>    2) live migrate part of dirty pages to slave during sleep time.
> >>>2 Add more debug/stat info
> >>>   (Partly done, patch is not include into this series)
> >>>   include checkpoint count, proxy discompare count, downtime,
> >>>    number of live migrated pages, total sent pages, etc.
> >>>3 Strengthen failover
> >>>4 optimize proxy part, include proxy script.
> >>>5 The capability of continuous FT
> >>>
> >>>v4:
> >>>- New block replication scheme (use image-fleecing for sencondary side)
> >>>- Adress some comments from Eric Blake and Dave
> >>>- Add commmand colo-set-checkpoint-period to set the time of periodic checkpoint
> >>>- Add a delay (100ms) between continuous checkpoint requests to ensure VM
> >>>   run 100ms at least since last pause.
> >>>
> >>>v3:
> >>>- use proxy instead of colo agent to compare network packets
> >>>- add block replication
> >>>- Optimize failover disposal
> >>>- handle shutdown
> >>>
> >>>v2:
> >>>- use QEMUSizedBuffer/QEMUFile as COLO buffer
> >>>- colo support is enabled by default
> >>>- add nic replication support
> >>>- addressed comments from Eric Blake and Dr. David Alan Gilbert
> >>>
> >>>v1:
> >>>- implement the frame of colo
> >>>
> >>>Wen Congyang (1):
> >>>   COLO: Add block replication into colo process
> >>>
> >>>zhanghailiang (27):
> >>>   configure: Add parameter for configure to enable/disable COLO support
> >>>   migration: Introduce capability 'colo' to migration
> >>>   COLO: migrate colo related info to slave
> >>>   migration: Integrate COLO checkpoint process into migration
> >>>   migration: Integrate COLO checkpoint process into loadvm
> >>>   COLO: Implement colo checkpoint protocol
> >>>   COLO: Add a new RunState RUN_STATE_COLO
> >>>   QEMUSizedBuffer: Introduce two help functions for qsb
> >>>   COLO: Save VM state to slave when do checkpoint
> >>>   COLO RAM: Load PVM's dirty page into SVM's RAM cache temporarily
> >>>   COLO VMstate: Load VM state into qsb before restore it
> >>>   arch_init: Start to trace dirty pages of SVM
> >>>   COLO RAM: Flush cached RAM into SVM's memory
> >>>   COLO failover: Introduce a new command to trigger a failover
> >>>   COLO failover: Implement COLO master/slave failover work
> >>>   COLO failover: Don't do failover during loading VM's state
> >>>   COLO: Add new command parameter 'colo_nicname' 'colo_script' for net
> >>>   COLO NIC: Init/remove colo nic devices when add/cleanup tap devices
> >>>   COLO NIC: Implement colo nic device interface configure()
> >>>   COLO NIC : Implement colo nic init/destroy function
> >>>   COLO NIC: Some init work related with proxy module
> >>>   COLO: Do checkpoint according to the result of net packets comparing
> >>>   COLO: Improve checkpoint efficiency by do additional periodic
> >>>     checkpoint
> >>>   COLO: Add colo-set-checkpoint-period command
> >>>   COLO NIC: Implement NIC checkpoint and failover
> >>>   COLO: Disable qdev hotplug when VM is in COLO mode
> >>>   COLO: Implement shutdown checkpoint
> >>>
> >>>  arch_init.c                            | 199 +++++++-
> >>>  configure                              |  14 +
> >>>  hmp-commands.hx                        |  30 ++
> >>>  hmp.c                                  |  14 +
> >>>  hmp.h                                  |   2 +
> >>>  include/exec/cpu-all.h                 |   1 +
> >>>  include/migration/migration-colo.h     |  58 +++
> >>>  include/migration/migration-failover.h |  22 +
> >>>  include/migration/migration.h          |   3 +
> >>>  include/migration/qemu-file.h          |   3 +-
> >>>  include/net/colo-nic.h                 |  25 +
> >>>  include/net/net.h                      |   4 +
> >>>  include/sysemu/sysemu.h                |   3 +
> >>>  migration/Makefile.objs                |   2 +
> >>>  migration/colo-comm.c                  |  80 ++++
> >>>  migration/colo-failover.c              |  48 ++
> >>>  migration/colo.c                       | 809 +++++++++++++++++++++++++++++++++
> >>>  migration/migration.c                  |  60 ++-
> >>>  migration/qemu-file-buf.c              |  58 +++
> >>>  net/Makefile.objs                      |   1 +
> >>>  net/colo-nic.c                         | 438 ++++++++++++++++++
> >>>  net/tap.c                              |  45 +-
> >>>  qapi-schema.json                       |  42 +-
> >>>  qemu-options.hx                        |  10 +-
> >>>  qmp-commands.hx                        |  41 ++
> >>>  savevm.c                               |   2 +-
> >>>  scripts/colo-proxy-script.sh           |  97 ++++
> >>>  stubs/Makefile.objs                    |   1 +
> >>>  stubs/migration-colo.c                 |  58 +++
> >>>  vl.c                                   |  36 +-
> >>>  30 files changed, 2178 insertions(+), 28 deletions(-)
> >>>  create mode 100644 include/migration/migration-colo.h
> >>>  create mode 100644 include/migration/migration-failover.h
> >>>  create mode 100644 include/net/colo-nic.h
> >>>  create mode 100644 migration/colo-comm.c
> >>>  create mode 100644 migration/colo-failover.c
> >>>  create mode 100644 migration/colo.c
> >>>  create mode 100644 migration/colo.c.
> >>>  create mode 100644 net/colo-nic.c
> >>>  create mode 100755 scripts/colo-proxy-script.sh
> >>>  create mode 100644 stubs/migration-colo.c
> >>>
> >>
> >>
> >--
> >Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >.
> >
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

  reply	other threads:[~2015-04-24  8:56 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-26  5:29 [Qemu-devel] [RFC PATCH v4 00/28] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 01/28] configure: Add parameter for configure to enable/disable COLO support zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 02/28] migration: Introduce capability 'colo' to migration zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 03/28] COLO: migrate colo related info to slave zhanghailiang
2015-05-15 11:38   ` Dr. David Alan Gilbert
2015-05-18  5:04     ` zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 04/28] migration: Integrate COLO checkpoint process into migration zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 05/28] migration: Integrate COLO checkpoint process into loadvm zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 06/28] COLO: Implement colo checkpoint protocol zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 07/28] COLO: Add a new RunState RUN_STATE_COLO zhanghailiang
2015-05-15 11:28   ` Dr. David Alan Gilbert
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 08/28] QEMUSizedBuffer: Introduce two help functions for qsb zhanghailiang
2015-05-15 11:56   ` Dr. David Alan Gilbert
2015-05-18  5:10     ` zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 09/28] COLO: Save VM state to slave when do checkpoint zhanghailiang
2015-05-15 12:09   ` Dr. David Alan Gilbert
2015-05-18  9:11     ` zhanghailiang
2015-05-18 12:10       ` Dr. David Alan Gilbert
2015-05-18 12:22         ` zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 10/28] COLO RAM: Load PVM's dirty page into SVM's RAM cache temporarily zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 11/28] COLO VMstate: Load VM state into qsb before restore it zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 12/28] arch_init: Start to trace dirty pages of SVM zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 13/28] COLO RAM: Flush cached RAM into SVM's memory zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 14/28] COLO failover: Introduce a new command to trigger a failover zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 15/28] COLO failover: Implement COLO master/slave failover work zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 16/28] COLO failover: Don't do failover during loading VM's state zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 17/28] COLO: Add new command parameter 'colo_nicname' 'colo_script' for net zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 18/28] COLO NIC: Init/remove colo nic devices when add/cleanup tap devices zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 19/28] COLO NIC: Implement colo nic device interface configure() zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 20/28] COLO NIC : Implement colo nic init/destroy function zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 21/28] COLO NIC: Some init work related with proxy module zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 22/28] COLO: Do checkpoint according to the result of net packets comparing zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 23/28] COLO: Improve checkpoint efficiency by do additional periodic checkpoint zhanghailiang
2015-05-18 16:48   ` Dr. David Alan Gilbert
2015-05-19  6:08     ` zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 24/28] COLO: Add colo-set-checkpoint-period command zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 25/28] COLO NIC: Implement NIC checkpoint and failover zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 26/28] COLO: Disable qdev hotplug when VM is in COLO mode zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 27/28] COLO: Implement shutdown checkpoint zhanghailiang
2015-03-26  5:29 ` [Qemu-devel] [RFC PATCH v4 28/28] COLO: Add block replication into colo process zhanghailiang
2015-04-08  8:16 ` [Qemu-devel] [RFC PATCH v4 00/28] COarse-grain LOck-stepping(COLO) Virtual Machines for Non-stop Service zhanghailiang
2015-04-22 11:18   ` Dr. David Alan Gilbert
2015-04-24  7:25     ` Wen Congyang
2015-04-24  8:35       ` Dr. David Alan Gilbert
2015-04-28 10:51         ` zhanghailiang
2015-05-06 17:11           ` Dr. David Alan Gilbert
2015-04-24  8:52     ` zhanghailiang
2015-04-24  8:56       ` Dr. David Alan Gilbert [this message]
2015-05-14 12:14 ` Dr. David Alan Gilbert
2015-05-14 12:58   ` zhanghailiang
2015-05-14 16:09     ` Dr. David Alan Gilbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150424085622.GB2139@work-vm \
    --to=dgilbert@redhat.com \
    --cc=amit.shah@redhat.com \
    --cc=arei.gonglei@huawei.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=eddie.dong@intel.com \
    --cc=lizhijian@cn.fujitsu.com \
    --cc=peter.huangpeng@huawei.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=yunhong.jiang@intel.com \
    --cc=zhang.zhanghailiang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.