[Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
@ 2014-12-26  3:31 Yang Hongyang
  2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO Yang Hongyang
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yang Hongyang @ 2014-12-26  3:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, quintela, GuiJianfeng, yunhong.jiang, eddie.dong,
	dgilbert, mrhines, stefanha, amit.shah, pbonzini, walid.nouri,
	Yang Hongyang

Hi all,

Please ignore the previous one, this is the updated patch.

We are implementing COLO feature for QEMU.
For what COLO is and steps to setup and runing COLO, refer to:
http://wiki.qemu.org/Features/COLO
  * you can get almost everything about COLO from that Wiki page.
Previously posted RFC proposal:
http://lists.nongnu.org/archive/html/qemu-devel/2014-06/msg05567.html
Latest COLO code:
https://github.com/macrosheep/qemu (checkout the latest COLO branch)

Our current work is implementing COLO Disk manager, we are planning to
implement it as a block driver which called 'blkcolo'. But we need the
feedbacks from commiunity.

This patchset is a Proof of Concept of our Block replication design,
including:
  1. A document 'docs/blkcolo.txt' which describes the drift design
     specification.
  2. Some DEMO/POC code that will help you to understand how we are
     going to implement it.

Please feel free to comment.
We want comments/feedbacks as many as possiable please, thanks in advance.

Thanks,
Yang.

Wen Congyang (1):
  PoC: Block replication for COLO

Yang Hongyang (1):
  Block: Block replication design for COLO

 block.c                   |  48 +++++++
 block/blkcolo.c           | 338 ++++++++++++++++++++++++++++++++++++++++++++++
 docs/blkcolo.txt          |  85 ++++++++++++
 include/block/block.h     |   6 +
 include/block/block_int.h |  21 +++
 5 files changed, 498 insertions(+)
 create mode 100644 block/blkcolo.c
 create mode 100644 docs/blkcolo.txt

-- 
1.9.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO
  2014-12-26  3:31 [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Yang Hongyang
@ 2014-12-26  3:31 ` Yang Hongyang
  2015-03-25 16:06   ` Eric Blake
  2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 2/2] PoC: Block replication " Yang Hongyang
  2014-12-27 15:23 ` [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Paolo Bonzini
  2 siblings, 1 reply; 11+ messages in thread
From: Yang Hongyang @ 2014-12-26  3:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, Lai Jiangshan, quintela, GuiJianfeng, yunhong.jiang,
	eddie.dong, dgilbert, mrhines, stefanha, amit.shah, pbonzini,
	walid.nouri, Yang Hongyang

This is the initial design of block replication.
The blkcolo block driver enables disk replication for continuous
checkpoints. It is designed for COLO that Secondary VM is running.
It can also be applied for FT/HA scene that Secondary VM is not
running.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 docs/blkcolo.txt | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100644 docs/blkcolo.txt

diff --git a/docs/blkcolo.txt b/docs/blkcolo.txt
new file mode 100644
index 0000000..41c2a05
--- /dev/null
+++ b/docs/blkcolo.txt
@@ -0,0 +1,85 @@
+Disk replication using blkcolo
+----------------------------------------
+Copyright Fujitsu, Corp. 2014
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+The blkcolo block driver enables disk replication for continuous checkpoints.
+It is designed for COLO that Secondary VM is running. It can also be applied
+for FT/HA scene that Secondary VM is not running.
+
+This document gives an overview of blkcolo's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoint. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Disk Buffer ==
+The following is the image of Disk buffer:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be bufferd in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Capture I/O request ==
+The blkcolo is a new block driver protocol, so all I/O requests can be
+captured in the driver interface bdrv_co_readv()/bdrv_co_writev().
+
+== Checkpoint & failover ==
+The blkcolo buffers the write requests in Secondary QEMU. And the buffer
+should be dropped at a checkpoint, or be flushed to Secondary disk when
+failover. We add four block driver interfaces to do this:
+a. bdrv_prepare_checkpoint()
+   This interface may block, and return when all Primary write
+   requests are forwarded to Secondary QEMU.
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transfered to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+c. bdrv_get_sent_data_size()
+   This is used on Primary node.
+   It should be called by migration/checkpoint thread in order
+   to decide whether to start a new checkpoint or not. If the data
+   amount being sent is too large, we should start a new checkpoint.
+d. bdrv_stop_replication()
+   It is called when failover. We will flush the Disk buffer into
+   Secondary Disk and stop disk replication.
+
+== Usage ==
+On both Primary/Secondary host, invoke QEMU with the following parameters:
+    "-drive file=blkcolo:host:port:/path/to/image"
+a. host
+   Hostname or IP of the Secondary host.
+b. port
+   The Secondary QEMU will listen on this port, and the Primary QEMU
+   will connect to this port.
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH RESEND 2/2] PoC: Block replication for COLO
  2014-12-26  3:31 [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Yang Hongyang
  2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO Yang Hongyang
@ 2014-12-26  3:31 ` Yang Hongyang
  2014-12-27 15:23 ` [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Paolo Bonzini
  2 siblings, 0 replies; 11+ messages in thread
From: Yang Hongyang @ 2014-12-26  3:31 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, quintela, GuiJianfeng, yunhong.jiang, eddie.dong,
	dgilbert, mrhines, stefanha, amit.shah, pbonzini, walid.nouri,
	Yang Hongyang

From: Wen Congyang <wency@cn.fujitsu.com>

It is not finished, but only show how it will be implemented.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
 block.c                   |  48 +++++++
 block/blkcolo.c           | 338 ++++++++++++++++++++++++++++++++++++++++++++++
 include/block/block.h     |   6 +
 include/block/block_int.h |  21 +++
 4 files changed, 413 insertions(+)
 create mode 100644 block/blkcolo.c

diff --git a/block.c b/block.c
index 4165d42..82a5283 100644
--- a/block.c
+++ b/block.c
@@ -6086,3 +6086,51 @@ BlockAcctStats *bdrv_get_stats(BlockDriverState *bs)
 {
     return &bs->stats;
 }
+
+int bdrv_prepare_checkpoint(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv && drv->bdrv_prepare_checkpoint) {
+        return drv->bdrv_prepare_checkpoint(bs);
+    } else if (bs->file) {
+        return bdrv_prepare_checkpoint(bs->file);
+    }
+
+    return -1;
+}
+
+int bdrv_do_checkpoint(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv && drv->bdrv_do_checkpoint) {
+        return drv->bdrv_do_checkpoint(bs);
+    } else if (bs->file) {
+        return bdrv_do_checkpoint(bs->file);
+    }
+
+    return -1;
+}
+
+int64_t bdrv_get_sent_data_size(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv && drv->bdrv_get_sent_data_size) {
+        return drv->bdrv_get_sent_data_size(bs);
+    } else if (bs->file) {
+        return bdrv_get_sent_data_size(bs->file);
+    }
+
+    return -1;
+}
+
+int bdrv_stop_replication(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv && drv->bdrv_stop_replication) {
+        return drv->bdrv_stop_replication(bs);
+    } else if (bs->file) {
+        return bdrv_stop_replication(bs->file);
+    }
+
+    return -1;
+}
diff --git a/block/blkcolo.c b/block/blkcolo.c
new file mode 100644
index 0000000..57ed4df
--- /dev/null
+++ b/block/blkcolo.c
@@ -0,0 +1,338 @@
+/*
+ * Primary mode functions
+ */
+
+static void coroutine_fn colo_pvm_forward_co(void *opaque)
+{
+    /*
+     * If the list is empty:
+     *   the status is COLO_PVM_CHECKPOINT_NONE, set the
+     *   state to idle and yield.
+     *   the status is COLO_PVM_CHECKPOINT_START, send
+     *   COLO_BLOCK_CHECKPOINT_SEC to the secondary QEMU.
+     * Otherwise, send the write requests to the secondary
+     * QEMU.
+     */
+}
+
+static colo_forward_state *colo_pvm_forward_request(BDRVBlkcoloState *s,
+                                                    int64_t sector_num,
+                                                    int nb_sectors,
+                                                    QEMUIOVector *qiov)
+{
+    /*
+     * Add the write requests to the tail of the list.
+     * Wakeup the coroutine colo_pvm_forward_co() if
+     * it is in idle state.
+     */
+}
+
+static int coroutine_fn colo_pvm_handle_write_request(BlockDriverState *bs,
+                                                      int64_t sector_num,
+                                                      int nb_sectors,
+                                                      QEMUIOVector *qiov)
+{
+    int ret;
+
+    /*
+     * call colo_pvm_forward_request to forward the primary
+     * write requests to the secondary QEMU.
+     */
+
+    ret = bdrv_co_writev(bs->file, sector_num, nb_sectors, qiov);
+
+    /* wait until the write request is forwarded to the secondary QEMU */
+
+    return ret;
+}
+
+static int coroutine_fn colo_pvm_handle_read_request(BlockDriverState *bs,
+                                                     int64_t sector_num,
+                                                     int nb_sectors,
+                                                     QEMUIOVector *qiov)
+{
+    return bdrv_co_readv(bs->file, sector_num, nb_sectors, qiov);
+}
+
+/* It should be called in the migration/checkpoint thread */
+static int colo_pvm_hanlde_checkpoint(BDRVBlkcoloState *s)
+{
+    /*
+     * wait until COLO_BLOCK_CHECKPOINT_SEC is sent to the
+     * secondary QEMU
+     */
+}
+
+/* It should be called in the migration/checkpoint thread */
+static void cancel_pvm_forward(BDRVBlkcoloState *s)
+{
+    /*
+     * Set the state to cancelled, and wait all coroutines
+     * exit.
+     */
+
+    /* switch to unprotected mode */
+}
+
+/*
+ * Secondary mode functions
+ *
+ * All write requests are forwarded to secondary QEMU from primary QEMU.
+ * The secondary QEMU should do the following things:
+ * 1. Receive and handle the forwarded write requests
+ * 2. Buffer the secondary write requests
+ */
+
+static void coroutine_fn colo_svm_handle_pvm_write_req_co(void *opaque)
+{
+    /*
+     * Do the following things:
+     * 1. read the original sector content
+     * 2. write the original sector content into disk buffer
+     *    if the sector content is not buffered
+     * 3. write the request to disk buffer
+     */
+}
+
+static void coroutine_fn colo_svm_handle_pvm_write_reqs_co(void *opaque)
+{
+    /*
+     * If the list is empty, set the state to idle, and yield.
+     * Otherwise, pick the first forwarded primary write requests,
+     * and create a coroutine colo_svm_handle_pvm_write_req_co()
+     * to handle it.
+     */
+}
+
+static void coroutine_fn colo_svm_recv_pvm_write_requests_co(void *opaque)
+{
+    /*
+     * Receive the forwarded primary write requests,
+     * and put it to the tail of the list. Wakeup the
+     * coroutine colo_svm_handle_pvm_write_reqs_co to
+     * handle the write requests if the coroutine is
+     * idle.
+     */
+}
+
+/* It should be called in the migration/checkpoint thread */
+static int svm_wait_recv_completed(BDRVBlkcoloState *s)
+{
+    /* wait until all forwarded write requests are received */
+}
+
+/*
+ * It should be called in the migration/checkpoint thread, and the caller
+ * should be hold io thread lock
+ */
+static int svm_handle_checkpoint(BlockDriverState *bs)
+{
+    /*
+     * wait until all forwarded write requests are written
+     * to the secondary disk, and then clear disk buffer.
+     */
+}
+
+/* It should be called in the migration/checkpoint thread */
+static void cancel_svm_receive(BDRVBlkcoloState *s)
+{
+    /*
+     * Set the state to cancelled, and wait all coroutines
+     * exit.
+     */
+
+    /* switch to unprotected mode */
+}
+
+static int coroutine_fn colo_svm_handle_write_request(BlockDriverState *bs,
+                                                      int64_t sector_num,
+                                                      int nb_sectors,
+                                                      QEMUIOVector *qiov)
+{
+    /*
+     * Write the request to the disk buffer. How to limit the
+     * write speed?
+     */
+}
+
+static int coroutine_fn colo_svm_handle_read_request(BlockDriverState *bs,
+                                                     int64_t sector_num,
+                                                     int nb_sectors,
+                                                     QEMUIOVector *qiov)
+{
+    /*
+     * Read the sector content from secondary disk first. If the sector
+     * content is buffered, use the buffered content.
+     */
+}
+
+/* Unprotected mode functions */
+static int coroutine_fn
+colo_unprotected_handle_write_request(BlockDriverState *bs, int64_t sector_num,
+                                      int nb_sectors, QEMUIOVector *qiov)
+{
+    return bdrv_co_writev(bs->file, sector_num, nb_sectors, qiov);
+}
+
+static int coroutine_fn
+colo_unprotected_handle_read_request(BlockDriverState *bs, int64_t sector_num,
+                                     int nb_sectors, QEMUIOVector *qiov)
+{
+    return bdrv_co_readv(bs->file, sector_num, nb_sectors, qiov);
+}
+
+/* Valid blkcolo filenames look like blkcolo:host:port:/path/to/image */
+static void blkcolo_parse_filename(const char *filename, QDict *options,
+                                   Error **errp)
+{
+}
+
+static int blkcolo_open(BlockDriverState *bs, QDict *options, int flags,
+                        Error **errp)
+{
+    /*
+     * Open the file, don't use BDRV_O_PROTOCOL to ensure that we are above
+     * the real format
+     */
+
+    /*
+     * try to listen host:port. The host is secondary host, so
+     * inet_listen() should return -1 and the errno should be
+     * EADDRNOTAVAIL if the mode is PRIMARY_MODE. But we cannot
+     * get errno after inet_listen() returns.
+     *
+     * TODO: Add a new API like inet_listen() but return -errno?
+     */
+    return 0;
+}
+
+static void blkcolo_close(BlockDriverState *bs)
+{
+
+}
+
+static int64_t blkcolo_getlength(BlockDriverState *bs)
+{
+    return bdrv_getlength(bs->file);
+}
+
+static void blkcolo_refresh_filename(BlockDriverState *bs)
+{
+}
+
+static int blkcolo_co_readv(BlockDriverState *bs, int64_t sector_num,
+                            int nb_sectors, QEMUIOVector *qiov)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    switch (s->mode) {
+    case UNPROTECTED_MODE:
+        return colo_unprotected_handle_read_request(bs, sector_num,
+                                                    nb_sectors, qiov);
+    case PRIMARY_MODE:
+        return colo_pvm_handle_read_request(bs, sector_num, nb_sectors, qiov);
+    case SECONDARY_MODE:
+        return colo_svm_handle_read_request(bs, sector_num, nb_sectors, qiov);
+    default:
+        assert(0);
+        return -1;
+    }
+}
+
+static int blkcolo_co_writev(BlockDriverState *bs, int64_t sector_num,
+                             int nb_sectors, QEMUIOVector *qiov)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    switch (s->mode) {
+    case UNPROTECTED_MODE:
+        return colo_unprotected_handle_write_request(bs, sector_num,
+                                                     nb_sectors, qiov);
+    case PRIMARY_MODE:
+        return colo_pvm_handle_write_request(bs, sector_num, nb_sectors, qiov);
+    case SECONDARY_MODE:
+        return colo_svm_handle_write_request(bs, sector_num, nb_sectors, qiov);
+    default:
+        assert(0);
+        return -1;
+    }
+}
+
+static int blkcolo_prepare_checkpoint(BlockDriverState *bs)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    switch (s->mode) {
+    case SECONDARY_MODE:
+        return svm_wait_recv_completed(s);
+    case UNPROTECTED_MODE:
+    case PRIMARY_MODE:
+    default:
+        assert(0);
+        return -1;
+    }
+}
+
+static int blkcolo_do_checkpoint(BlockDriverState *bs)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    switch (s->mode) {
+    case PRIMARY_MODE:
+        return pvm_hanlde_checkpoint(s);
+    case SECONDARY_MODE:
+        return svm_handle_checkpoint(s);
+    case UNPROTECTED_MODE:
+    default:
+        assert(0);
+        return -1;
+    }
+}
+
+static int64_t blkcolo_sent_data_size(BlockDriverState *bs)
+{
+    return 0;
+}
+
+static int blkcolo_stop_replication(BlockDriverState *bs)
+{
+    BDRVBlkcoloState *s = bs->opaque;
+
+    switch (s->mode) {
+    case PRIMARY_MODE:
+        return cancel_pvm_forward(s);
+    case SECONDARY_MODE:
+        return cancel_svm_receive(s);
+    case UNPROTECTED_MODE:
+    default:
+        assert(0);
+        return -1;
+    }
+}
+
+static BlockDriver bdrv_blkcolo = {
+    .format_name = "blkcolo",
+    .protocol_name = "blkcolo",
+    .instance_size = sizeof(BDRVBlkcoloState),
+
+    .bdrv_parse_filename = blkcolo_parse_filename,
+    .bdrv_file_open = blkcolo_open,
+    .bdrv_close = blkcolo_close,
+    .bdrv_getlength = blkcolo_getlength,
+    .bdrv_refresh_filename = blkcolo_refresh_filename,
+
+    .bdrv_co_readv = blkcolo_co_readv,
+    .bdrv_co_writev = blkcolo_co_writev,
+
+    .bdrv_prepare_checkpoint = blkcolo_prepare_checkpoint,
+    .bdrv_do_checkpoint = blkcolo_do_checkpoint,
+    .bdrv_get_sent_data_size = blkcolo_sent_data_size,
+    .bdrv_stop_replication = blkcolo_stop_replication,
+};
+
+static void bdrv_blkcolo_init(void)
+{
+    bdrv_register(&bdrv_blkcolo);
+};
+
+block_int(bdrv_blkcolo_init);
diff --git a/include/block/block.h b/include/block/block.h
index 6e7275d..9086abc 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -546,4 +546,10 @@ void bdrv_flush_io_queue(BlockDriverState *bs);
 
 BlockAcctStats *bdrv_get_stats(BlockDriverState *bs);
 
+/* Checkpoint control, called in migration/checkpoint thread */
+int bdrv_prepare_checkpoint(BlockDriverState *bs);
+int bdrv_do_checkpoint(BlockDriverState *bs);
+int64_t bdrv_get_sent_data_size(BlockDriverState *bs);
+int bdrv_stop_replication(BlockDriverState *bs);
+
 #endif
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 06a21dd..ee6320e 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -273,6 +273,27 @@ struct BlockDriver {
     void (*bdrv_io_unplug)(BlockDriverState *bs);
     void (*bdrv_flush_io_queue)(BlockDriverState *bs);
 
+    /* Checkpoint control, called in migration/checkpoint thread */
+    /*
+     * Before doing a new checkpoint, we should wait all write requests
+     * being transfered from Primary QEMU to Secondary QEMU.
+     */
+    int (*bdrv_prepare_checkpoint)(BlockDriverState *bs);
+    /*
+     * Drop Disk buffer when doing checkpoint.
+     */
+    int (*bdrv_do_checkpoint)(BlockDriverState *bs);
+    /* This is used on Primary node.
+     * Migration/checkpoint thread should call this interface in order
+     * to decide whether to start a new checkpoint or not. If the data
+     * amount being sent is too large, we should start a new checkpoint.
+     */
+    int64_t (*bdrv_get_sent_data_size)(BlockDriverState *bs);
+    /* After failover, we should flush Disk buffer into secondary disk
+     * and stop block replication.
+     */
+    int (*bdrv_stop_replication)(BlockDriverState *bs);
+
     QLIST_ENTRY(BlockDriver) list;
 };
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
  2014-12-26  3:31 [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Yang Hongyang
  2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO Yang Hongyang
  2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 2/2] PoC: Block replication " Yang Hongyang
@ 2014-12-27 15:23 ` Paolo Bonzini
  2014-12-30  7:52   ` Hongyang Yang
                     ` (3 more replies)
  2 siblings, 4 replies; 11+ messages in thread
From: Paolo Bonzini @ 2014-12-27 15:23 UTC (permalink / raw)
  To: Yang Hongyang, qemu-devel
  Cc: kwolf, quintela, GuiJianfeng, yunhong.jiang, eddie.dong,
	dgilbert, mrhines, stefanha, Amit Shah, walid.nouri

On 26/12/2014 04:31, Yang Hongyang wrote:
> Please feel free to comment.
> We want comments/feedbacks as many as possiable please, thanks in advance.

Hi Yang,

I think it's possible to build COLO block replication from many basic
blocks that are already in QEMU.  The only new piece would be the disk
buffer on the secondary.

         virtio-blk       ||
             ^            ||                            .----------
             |            ||                            | Secondary
        1 Quorum          ||                            '----------
         /      \         ||
        /        \        ||
   Primary      2 NBD  ------->  2 NBD
     disk       client    ||     server                  virtio-blk
                          ||        ^                         ^
--------.                 ||        |                         |
Primary |                 ||  Secondary disk <--------- COLO buffer 3
--------'                 ||                   backing

1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM.  The read pattern patches for quorum
(http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
be used/extended to make the primary always read from the local disk
instead of going through NBD.

2) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).

3) The disk on the secondary is represented by a custom block device
("COLO buffer").  The disk buffer's backing image is the secondary disk,
and the disk buffer uses bdrv_add_before_write_notifier to implement
copy-on-write, similar to block/backup.c.

4) Checkpointing can use new bdrv_prepare_checkpoint and
bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
similar to your patches (you did not explain why you do checkpointing in
two steps).  Failover instead is done with bdrv_commit or can even be
done without stopping the secondary (live commit, block/commit.c).

The missing parts are:

1) NBD server on the backing image of the COLO buffer.  This means the
backing image needs its own BlockBackend.  Apart for this, no new
infrastructure is needed to receive writes on the secondary.

2) Read pattern support for quorum need to be extended for the needs of
the COLO primary.  It may be simpler or faster to write a simple
"replication" driver that writes to N children but always reads from the
first.  But in any case initial tests can be done with the quorum
driver, even without read pattern support.  Again, all the network
infrastructure to replicate writes already exists in QEMU.

3) Of course the disk buffer itself.

Paolo

> Thanks,
> Yang.
> 
> Wen Congyang (1):
>   PoC: Block replication for COLO
> 
> Yang Hongyang (1):
>   Block: Block replication design for COLO
> 
>  block.c                   |  48 +++++++
>  block/blkcolo.c           | 338 ++++++++++++++++++++++++++++++++++++++++++++++
>  docs/blkcolo.txt          |  85 ++++++++++++
>  include/block/block.h     |   6 +
>  include/block/block_int.h |  21 +++
>  5 files changed, 498 insertions(+)
>  create mode 100644 block/blkcolo.c
>  create mode 100644 docs/blkcolo.txt
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
  2014-12-27 15:23 ` [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Paolo Bonzini
@ 2014-12-30  7:52   ` Hongyang Yang
  2015-01-05 10:44   ` Dr. David Alan Gilbert
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Hongyang Yang @ 2014-12-30  7:52 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, Lai Jiangshan, quintela, GuiJianfeng, yunhong.jiang,
	eddie.dong, dgilbert, mrhines, stefanha, Amit Shah, walid.nouri

Hi Paolo,

   Thank you very much for the reply, you certainly get what we are going to
do, we are investigating Quorum and NBD, and will give you replies in a few
days.

Thanks again,
Yang.

在 12/27/2014 11:23 PM, Paolo Bonzini 写道:
>
>
> On 26/12/2014 04:31, Yang Hongyang wrote:
>> Please feel free to comment.
>> We want comments/feedbacks as many as possiable please, thanks in advance.
>
> Hi Yang,
>
> I think it's possible to build COLO block replication from many basic
> blocks that are already in QEMU.  The only new piece would be the disk
> buffer on the secondary.
>
>           virtio-blk       ||
>               ^            ||                            .----------
>               |            ||                            | Secondary
>          1 Quorum          ||                            '----------
>           /      \         ||
>          /        \        ||
>     Primary      2 NBD  ------->  2 NBD
>       disk       client    ||     server                  virtio-blk
>                            ||        ^                         ^
> --------.                 ||        |                         |
> Primary |                 ||  Secondary disk <--------- COLO buffer 3
> --------'                 ||                   backing
>
>
> 1) The disk on the primary is represented by a block device with two
> children, providing replication between a primary disk and the host that
> runs the secondary VM.  The read pattern patches for quorum
> (http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
> be used/extended to make the primary always read from the local disk
> instead of going through NBD.
>
> 2) The secondary disk receives writes from the primary VM through QEMU's
> embedded NBD server (speculative write-through).
>
> 3) The disk on the secondary is represented by a custom block device
> ("COLO buffer").  The disk buffer's backing image is the secondary disk,
> and the disk buffer uses bdrv_add_before_write_notifier to implement
> copy-on-write, similar to block/backup.c.
>
> 4) Checkpointing can use new bdrv_prepare_checkpoint and
> bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
> similar to your patches (you did not explain why you do checkpointing in
> two steps).  Failover instead is done with bdrv_commit or can even be
> done without stopping the secondary (live commit, block/commit.c).
>
>
> The missing parts are:
>
> 1) NBD server on the backing image of the COLO buffer.  This means the
> backing image needs its own BlockBackend.  Apart for this, no new
> infrastructure is needed to receive writes on the secondary.
>
> 2) Read pattern support for quorum need to be extended for the needs of
> the COLO primary.  It may be simpler or faster to write a simple
> "replication" driver that writes to N children but always reads from the
> first.  But in any case initial tests can be done with the quorum
> driver, even without read pattern support.  Again, all the network
> infrastructure to replicate writes already exists in QEMU.
>
> 3) Of course the disk buffer itself.
>
> Paolo
>
>> Thanks,
>> Yang.
>>
>> Wen Congyang (1):
>>    PoC: Block replication for COLO
>>
>> Yang Hongyang (1):
>>    Block: Block replication design for COLO
>>
>>   block.c                   |  48 +++++++
>>   block/blkcolo.c           | 338 ++++++++++++++++++++++++++++++++++++++++++++++
>>   docs/blkcolo.txt          |  85 ++++++++++++
>>   include/block/block.h     |   6 +
>>   include/block/block_int.h |  21 +++
>>   5 files changed, 498 insertions(+)
>>   create mode 100644 block/blkcolo.c
>>   create mode 100644 docs/blkcolo.txt
>>
> .
>

-- 
Thanks,
Yang.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
  2014-12-27 15:23 ` [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Paolo Bonzini
  2014-12-30  7:52   ` Hongyang Yang
@ 2015-01-05 10:44   ` Dr. David Alan Gilbert
  2015-01-06  1:28     ` Wen Congyang
  2015-01-09  9:31   ` Hongyang Yang
  2015-01-28  6:42   ` Wen Congyang
  3 siblings, 1 reply; 11+ messages in thread
From: Dr. David Alan Gilbert @ 2015-01-05 10:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kwolf, quintela, GuiJianfeng, yunhong.jiang, eddie.dong,
	qemu-devel, mrhines, stefanha, Amit Shah, walid.nouri,
	Yang Hongyang

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 26/12/2014 04:31, Yang Hongyang wrote:
> > Please feel free to comment.
> > We want comments/feedbacks as many as possiable please, thanks in advance.
> 
> Hi Yang,
> 
> I think it's possible to build COLO block replication from many basic
> blocks that are already in QEMU.  The only new piece would be the disk
> buffer on the secondary.
> 
>          virtio-blk       ||
>              ^            ||                            .----------
>              |            ||                            | Secondary
>         1 Quorum          ||                            '----------
>          /      \         ||
>         /        \        ||
>    Primary      2 NBD  ------->  2 NBD
>      disk       client    ||     server                  virtio-blk
>                           ||        ^                         ^
> --------.                 ||        |                         |
> Primary |                 ||  Secondary disk <--------- COLO buffer 3
> --------'                 ||                   backing
> 

I think the other thing about this structure is that it provides
a way of doing an initial synchronisation of the secondary's disk at
the start of COLO operation by using the NBD server (which I think is
similar to the way the newer migration does it?)

> 1) The disk on the primary is represented by a block device with two
> children, providing replication between a primary disk and the host that
> runs the secondary VM.  The read pattern patches for quorum
> (http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
> be used/extended to make the primary always read from the local disk
> instead of going through NBD.
> 
> 2) The secondary disk receives writes from the primary VM through QEMU's
> embedded NBD server (speculative write-through).
> 
> 3) The disk on the secondary is represented by a custom block device
> ("COLO buffer").  The disk buffer's backing image is the secondary disk,
> and the disk buffer uses bdrv_add_before_write_notifier to implement
> copy-on-write, similar to block/backup.c.
> 
> 4) Checkpointing can use new bdrv_prepare_checkpoint and
> bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
> similar to your patches (you did not explain why you do checkpointing in
> two steps).  Failover instead is done with bdrv_commit or can even be
> done without stopping the secondary (live commit, block/commit.c).
> 
> 
> The missing parts are:
> 
> 1) NBD server on the backing image of the COLO buffer.  This means the
> backing image needs its own BlockBackend.  Apart for this, no new
> infrastructure is needed to receive writes on the secondary.
> 
> 2) Read pattern support for quorum need to be extended for the needs of
> the COLO primary.  It may be simpler or faster to write a simple
> "replication" driver that writes to N children but always reads from the
> first.  But in any case initial tests can be done with the quorum
> driver, even without read pattern support.  Again, all the network
> infrastructure to replicate writes already exists in QEMU.
> 
> 3) Of course the disk buffer itself.

I think there's also:
  a) How does the secondary becomes a primary - e.g. after
     the original primary dies and you need to bring it back into
     resilience; the block structure has to morph into the primary
     with the quorum etc

  b) There's some sequencing needed somewhere to ensure that at a
    checkpoint boundary, the secondary restarts it's buffer at the
    right point after all the writes from the previous checkpoint have
    been received and before any writes coming from after the checkpoint.
    Similarly at failover to make sure there aren't any left over blocks
    still going through the nbd server.

  c) Someone always has to have a valid disk after a power failure;
    I guess the worst case is that the primary goes first, the secondary starts
    replaying it's buffer to disk but then dies part way through the replay.

Dave
    
> Paolo
> 
> > Thanks,
> > Yang.
> > 
> > Wen Congyang (1):
> >   PoC: Block replication for COLO
> > 
> > Yang Hongyang (1):
> >   Block: Block replication design for COLO
> > 
> >  block.c                   |  48 +++++++
> >  block/blkcolo.c           | 338 ++++++++++++++++++++++++++++++++++++++++++++++
> >  docs/blkcolo.txt          |  85 ++++++++++++
> >  include/block/block.h     |   6 +
> >  include/block/block_int.h |  21 +++
> >  5 files changed, 498 insertions(+)
> >  create mode 100644 block/blkcolo.c
> >  create mode 100644 docs/blkcolo.txt
> > 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
  2015-01-05 10:44   ` Dr. David Alan Gilbert
@ 2015-01-06  1:28     ` Wen Congyang
  0 siblings, 0 replies; 11+ messages in thread
From: Wen Congyang @ 2015-01-06  1:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Paolo Bonzini
  Cc: kwolf, quintela, GuiJianfeng, yunhong.jiang, eddie.dong,
	qemu-devel, mrhines, stefanha, Amit Shah, walid.nouri,
	Yang Hongyang

On 01/05/2015 06:44 PM, Dr. David Alan Gilbert wrote:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>>
>>
>> On 26/12/2014 04:31, Yang Hongyang wrote:
>>> Please feel free to comment.
>>> We want comments/feedbacks as many as possiable please, thanks in advance.
>>
>> Hi Yang,
>>
>> I think it's possible to build COLO block replication from many basic
>> blocks that are already in QEMU.  The only new piece would be the disk
>> buffer on the secondary.
>>
>>          virtio-blk       ||
>>              ^            ||                            .----------
>>              |            ||                            | Secondary
>>         1 Quorum          ||                            '----------
>>          /      \         ||
>>         /        \        ||
>>    Primary      2 NBD  ------->  2 NBD
>>      disk       client    ||     server                  virtio-blk
>>                           ||        ^                         ^
>> --------.                 ||        |                         |
>> Primary |                 ||  Secondary disk <--------- COLO buffer 3
>> --------'                 ||                   backing
>>
> 
> I think the other thing about this structure is that it provides
> a way of doing an initial synchronisation of the secondary's disk at
> the start of COLO operation by using the NBD server (which I think is
> similar to the way the newer migration does it?)
> 
>> 1) The disk on the primary is represented by a block device with two
>> children, providing replication between a primary disk and the host that
>> runs the secondary VM.  The read pattern patches for quorum
>> (http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
>> be used/extended to make the primary always read from the local disk
>> instead of going through NBD.
>>
>> 2) The secondary disk receives writes from the primary VM through QEMU's
>> embedded NBD server (speculative write-through).
>>
>> 3) The disk on the secondary is represented by a custom block device
>> ("COLO buffer").  The disk buffer's backing image is the secondary disk,
>> and the disk buffer uses bdrv_add_before_write_notifier to implement
>> copy-on-write, similar to block/backup.c.
>>
>> 4) Checkpointing can use new bdrv_prepare_checkpoint and
>> bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
>> similar to your patches (you did not explain why you do checkpointing in
>> two steps).  Failover instead is done with bdrv_commit or can even be
>> done without stopping the secondary (live commit, block/commit.c).
>>
>>
>> The missing parts are:
>>
>> 1) NBD server on the backing image of the COLO buffer.  This means the
>> backing image needs its own BlockBackend.  Apart for this, no new
>> infrastructure is needed to receive writes on the secondary.
>>
>> 2) Read pattern support for quorum need to be extended for the needs of
>> the COLO primary.  It may be simpler or faster to write a simple
>> "replication" driver that writes to N children but always reads from the
>> first.  But in any case initial tests can be done with the quorum
>> driver, even without read pattern support.  Again, all the network
>> infrastructure to replicate writes already exists in QEMU.
>>
>> 3) Of course the disk buffer itself.
> 
> I think there's also:
>   a) How does the secondary becomes a primary - e.g. after
>      the original primary dies and you need to bring it back into
>      resilience; the block structure has to morph into the primary
>      with the quorum etc

What about this:

         virtio-blk       ||                                       virtio-blk
             ^            ||                .----------                ^
             |            ||                | Secondary                |
        1 Quorum          ||                '----------            1 Quorum
         /      \         ||                                       /      \
        /        \        ||                                      /        \
   Primary      2 NBD  ------->  2 NBD                           /          \
     disk       client    ||     server                         /            \
                          ||        ^                          /              \
--------.                 ||        |                         /                \
Primary |                 ||  Secondary disk <--------- COLO buffer 3          4 NBD
--------'                 ||                   backing                         client

The NBD client in secondary will work when it becomes primary.

> 
>   b) There's some sequencing needed somewhere to ensure that at a
>     checkpoint boundary, the secondary restarts it's buffer at the
>     right point after all the writes from the previous checkpoint have
>     been received and before any writes coming from after the checkpoint.
>     Similarly at failover to make sure there aren't any left over blocks
>     still going through the nbd server.

NBD client sends a write request to NBD server, and NBD server will return an
ACK to NBD client. We will wait all ACKs when we stop the vm.

> 
>   c) Someone always has to have a valid disk after a power failure;
>     I guess the worst case is that the primary goes first, the secondary starts
>     replaying it's buffer to disk but then dies part way through the replay.

COLO will use migration to do the first checkpoint, so we can use disk migration
to sync the disk first. And then start disk replication.

Thanks
Wen Congyang

> 
> Dave
>     
>> Paolo
>>
>>> Thanks,
>>> Yang.
>>>
>>> Wen Congyang (1):
>>>   PoC: Block replication for COLO
>>>
>>> Yang Hongyang (1):
>>>   Block: Block replication design for COLO
>>>
>>>  block.c                   |  48 +++++++
>>>  block/blkcolo.c           | 338 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  docs/blkcolo.txt          |  85 ++++++++++++
>>>  include/block/block.h     |   6 +
>>>  include/block/block_int.h |  21 +++
>>>  5 files changed, 498 insertions(+)
>>>  create mode 100644 block/blkcolo.c
>>>  create mode 100644 docs/blkcolo.txt
>>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 
> .
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
  2014-12-27 15:23 ` [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Paolo Bonzini
  2014-12-30  7:52   ` Hongyang Yang
  2015-01-05 10:44   ` Dr. David Alan Gilbert
@ 2015-01-09  9:31   ` Hongyang Yang
  2015-01-28  6:42   ` Wen Congyang
  3 siblings, 0 replies; 11+ messages in thread
From: Hongyang Yang @ 2015-01-09  9:31 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel
  Cc: kwolf, quintela, GuiJianfeng, yunhong.jiang, eddie.dong,
	dgilbert, mrhines, stefanha, Amit Shah, walid.nouri

Hi Paolo,

   Seems there are no more comments for now.
   We are going to implement COLO Disk replication as you suggested. And I add
your comments to the design doc, thank you!

在 12/27/2014 11:23 PM, Paolo Bonzini 写道:
>
>
> On 26/12/2014 04:31, Yang Hongyang wrote:
>> Please feel free to comment.
>> We want comments/feedbacks as many as possiable please, thanks in advance.
>
> Hi Yang,
>
> I think it's possible to build COLO block replication from many basic
> blocks that are already in QEMU.  The only new piece would be the disk
> buffer on the secondary.
>
>           virtio-blk       ||
>               ^            ||                            .----------
>               |            ||                            | Secondary
>          1 Quorum          ||                            '----------
>           /      \         ||
>          /        \        ||
>     Primary      2 NBD  ------->  2 NBD
>       disk       client    ||     server                  virtio-blk
>                            ||        ^                         ^
> --------.                 ||        |                         |
> Primary |                 ||  Secondary disk <--------- COLO buffer 3
> --------'                 ||                   backing
>
>
> 1) The disk on the primary is represented by a block device with two
> children, providing replication between a primary disk and the host that
> runs the secondary VM.  The read pattern patches for quorum
> (http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
> be used/extended to make the primary always read from the local disk
> instead of going through NBD.
>
> 2) The secondary disk receives writes from the primary VM through QEMU's
> embedded NBD server (speculative write-through).
>
> 3) The disk on the secondary is represented by a custom block device
> ("COLO buffer").  The disk buffer's backing image is the secondary disk,
> and the disk buffer uses bdrv_add_before_write_notifier to implement
> copy-on-write, similar to block/backup.c.
>
> 4) Checkpointing can use new bdrv_prepare_checkpoint and
> bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
> similar to your patches (you did not explain why you do checkpointing in
> two steps).  Failover instead is done with bdrv_commit or can even be

If we use NBD to send block request, we don't need to do checkpoint in two
steps, because NBD will ensure all block req being sent to the secondary.
we use pre_checkpoint to wait for all request being received on secondary(
primary send an END flag to secondary when all request been sent at checkpoint,
secondary will wait for the flag been received on all disks and then
do_checkpoint).

We delete bdrv_pre_checkpoint interface in the design doc.

-- 
Thanks,
Yang.


 From ef8a236d6fdcc88559cd9ce926173ef6eff74f77 Mon Sep 17 00:00:00 2001
From: Yang Hongyang <yanghy@cn.fujitsu.com>
Date: Thu, 25 Dec 2014 13:33:00 +0800
Subject: [POC v2] Block: Block replication design for COLO

This is the initial design of block replication.
The blkcolo block driver enables disk replication for continuous
checkpoints. It is designed for COLO that Secondary VM is running.
It can also be applied for FT/HA scene that Secondary VM is not
running.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
---
  docs/blkcolo.txt | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 134 insertions(+)
  create mode 100644 docs/blkcolo.txt

diff --git a/docs/blkcolo.txt b/docs/blkcolo.txt
new file mode 100644
index 0000000..3021928
--- /dev/null
+++ b/docs/blkcolo.txt
@@ -0,0 +1,134 @@
+Disk replication using blkcolo
+----------------------------------------
+Copyright Fujitsu, Corp. 2015
+
+This work is licensed under the terms of the GNU GPL, version 2 or later.
+See the COPYING file in the top-level directory.
+
+The blkcolo block driver enables disk replication for continuous checkpoints.
+It is designed for COLO that Secondary VM is running. It can also be applied
+for FT/HA scene that Secondary VM is not running.
+
+This document gives an overview of blkcolo's design.
+
+== Background ==
+High availability solutions such as micro checkpoint and COLO will do
+consecutive checkpoint. The VM state of Primary VM and Secondary VM is
+identical right after a VM checkpoint, but becomes different as the VM
+executes till the next checkpoint. To support disk contents checkpoint,
+the modified disk contents in the Secondary VM must be buffered, and are
+only dropped at next checkpoint time. To reduce the network transportation
+effort at the time of checkpoint, the disk modification operations of
+Primary disk are asynchronously forwarded to the Secondary node.
+
+== Disk Buffer ==
+The following is the image of Disk buffer:
+
+        +----------------------+            +------------------------+
+        |Primary Write Requests|            |Secondary Write Requests|
+        +----------------------+            +------------------------+
+                  |                                       |
+                  |                                      (4)
+                  |                                       V
+                  |                              /-------------\
+                  |      Copy and Forward        |             |
+                  |---------(1)----------+       | Disk Buffer |
+                  |                      |       |             |
+                  |                     (3)      \-------------/
+                  |                 speculative      ^
+                  |                write through    (2)
+                  |                      |           |
+                  V                      V           |
+           +--------------+           +----------------+
+           | Primary Disk |           | Secondary Disk |
+           +--------------+           +----------------+
+    1) Primary write requests will be copied and forwarded to Secondary
+       QEMU.
+    2) Before Primary write requests are written to Secondary disk, the
+       original sector content will be read from Secondary disk and
+       buffered in the Disk buffer, but it will not overwrite the existing
+       sector content in the Disk buffer.
+    3) Primary write requests will be written to Secondary disk.
+    4) Secondary write requests will be bufferd in the Disk buffer and it
+       will overwrite the existing sector content in the buffer.
+
+== Implementation ==
+
+We are going to implement COLO block replication from many basic
+blocks that are already in QEMU.  The only new piece would be the disk
+buffer on the secondary.
+
+         virtio-blk       ||
+             ^            ||                            .----------
+             |            ||                            | Secondary
+        1 Quorum          ||                            '----------
+         /      \         ||
+        /        \        ||
+   Primary      2 NBD  ------->  2 NBD
+     disk       client    ||     server                  virtio-blk
+                          ||        ^                         ^
+--------.                 ||        |                         |
+Primary |                 ||  Secondary disk <--------- COLO buffer 3
+--------'                 ||                   backing
+
+1) The disk on the primary is represented by a block device with two
+children, providing replication between a primary disk and the host that
+runs the secondary VM.  The read pattern patches for quorum
+(http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
+be used/extended to make the primary always read from the local disk
+instead of going through NBD.
+
+2) The secondary disk receives writes from the primary VM through QEMU's
+embedded NBD server (speculative write-through).
+
+3) The disk on the secondary is represented by a custom block device
+("COLO buffer").  The disk buffer's backing image is the secondary disk,
+and the disk buffer uses bdrv_add_before_write_notifier to implement
+copy-on-write, similar to block/backup.c.
+
+4) Checkpointing can use bdrv_do_checkpoint interface in BlockDriver to
+discard the COLO buffer. Failover instead is done with bdrv_commit or
+can be done without stopping the secondary (live commit, block/commit.c).
+
+
+The missing parts are:
+
+1) NBD server on the backing image of the COLO buffer.  This means the
+backing image needs its own BlockBackend.  Apart for this, no new
+infrastructure is needed to receive writes on the secondary.
+
+2) Read pattern support for quorum need to be extended for the needs of
+the COLO primary.  It may be simpler or faster to write a simple
+"replication" driver that writes to N children but always reads from the
+first.  But in any case initial tests can be done with the quorum
+driver, even without read pattern support.
+
+3) The disk buffer itself.
+
+== Checkpoint & failover ==
+The blkcolo buffers the write requests in Secondary QEMU. And the buffer
+should be dropped at a checkpoint, or be flushed to Secondary disk when
+failover. We add four block driver interfaces to do this:
+a. bdrv_start_replication()
+   Start replication, called in migration/checkpoint thread
+b. bdrv_do_checkpoint()
+   This interface is called after all VM state is transfered to
+   Secondary QEMU. The Disk buffer will be dropped in this interface.
+c. bdrv_get_sent_data_size()
+   This is used on Primary node.
+   It should be called by migration/checkpoint thread in order
+   to decide whether to start a new checkpoint or not. If the data
+   amount being sent is too large, we should start a new checkpoint.
+d. bdrv_stop_replication()
+   It is called when failover. We will flush the Disk buffer into
+   Secondary Disk and stop disk replication.
+
+== Usage ==
+Primary:
+  1. NBD Client should not be the first child of quorum.
+  2. There should be only one NBD Client.
+
+Secondary:
+  -drive if=xxx,driver=colo,export=xxx,\
+         backing.file.filename=1.raw,\
+         backing.driver=raw
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing
  2014-12-27 15:23 ` [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Paolo Bonzini
                     ` (2 preceding siblings ...)
  2015-01-09  9:31   ` Hongyang Yang
@ 2015-01-28  6:42   ` Wen Congyang
  3 siblings, 0 replies; 11+ messages in thread
From: Wen Congyang @ 2015-01-28  6:42 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Hongyang, qemu-devel
  Cc: kwolf, quintela, GuiJianfeng, yunhong.jiang, eddie.dong,
	dgilbert, mrhines, stefanha, Amit Shah, walid.nouri

On 12/27/2014 11:23 PM, Paolo Bonzini wrote:
> 
> 
> On 26/12/2014 04:31, Yang Hongyang wrote:
>> Please feel free to comment.
>> We want comments/feedbacks as many as possiable please, thanks in advance.
> 
> Hi Yang,
> 
> I think it's possible to build COLO block replication from many basic
> blocks that are already in QEMU.  The only new piece would be the disk
> buffer on the secondary.
> 
>          virtio-blk       ||
>              ^            ||                            .----------
>              |            ||                            | Secondary
>         1 Quorum          ||                            '----------
>          /      \         ||
>         /        \        ||
>    Primary      2 NBD  ------->  2 NBD
>      disk       client    ||     server                  virtio-blk
>                           ||        ^                         ^
> --------.                 ||        |                         |
> Primary |                 ||  Secondary disk <--------- COLO buffer 3
> --------'                 ||                   backing
> 
> 
> 1) The disk on the primary is represented by a block device with two
> children, providing replication between a primary disk and the host that
> runs the secondary VM.  The read pattern patches for quorum
> (http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can
> be used/extended to make the primary always read from the local disk
> instead of going through NBD.
> 
> 2) The secondary disk receives writes from the primary VM through QEMU's
> embedded NBD server (speculative write-through).
> 
> 3) The disk on the secondary is represented by a custom block device
> ("COLO buffer").  The disk buffer's backing image is the secondary disk,
> and the disk buffer uses bdrv_add_before_write_notifier to implement
> copy-on-write, similar to block/backup.c.
> 
> 4) Checkpointing can use new bdrv_prepare_checkpoint and
> bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer,
> similar to your patches (you did not explain why you do checkpointing in
> two steps).  Failover instead is done with bdrv_commit or can even be
> done without stopping the secondary (live commit, block/commit.c).
> 
> 
> The missing parts are:
> 
> 1) NBD server on the backing image of the COLO buffer.  This means the
> backing image needs its own BlockBackend.  Apart for this, no new
> infrastructure is needed to receive writes on the secondary.

Backing image is always opened read-only. How to remove this limitaion?
Add a option to control it?

Thanks
Wen Congyang

> 
> 2) Read pattern support for quorum need to be extended for the needs of
> the COLO primary.  It may be simpler or faster to write a simple
> "replication" driver that writes to N children but always reads from the
> first.  But in any case initial tests can be done with the quorum
> driver, even without read pattern support.  Again, all the network
> infrastructure to replicate writes already exists in QEMU.
> 
> 3) Of course the disk buffer itself.
> 
> Paolo
> 
>> Thanks,
>> Yang.
>>
>> Wen Congyang (1):
>>   PoC: Block replication for COLO
>>
>> Yang Hongyang (1):
>>   Block: Block replication design for COLO
>>
>>  block.c                   |  48 +++++++
>>  block/blkcolo.c           | 338 ++++++++++++++++++++++++++++++++++++++++++++++
>>  docs/blkcolo.txt          |  85 ++++++++++++
>>  include/block/block.h     |   6 +
>>  include/block/block_int.h |  21 +++
>>  5 files changed, 498 insertions(+)
>>  create mode 100644 block/blkcolo.c
>>  create mode 100644 docs/blkcolo.txt
>>
> 
> .
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO
  2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO Yang Hongyang
@ 2015-03-25 16:06   ` Eric Blake
  2015-03-25 16:11     ` Eric Blake
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2015-03-25 16:06 UTC (permalink / raw)
  To: Yang Hongyang, qemu-devel
  Cc: kwolf, Lai Jiangshan, quintela, GuiJianfeng, yunhong.jiang,
	eddie.dong, dgilbert, mrhines, stefanha, Amit Shah, pbonzini,
	walid.nouri

[-- Attachment #1: Type: text/plain, Size: 6124 bytes --]

On 12/25/2014 08:31 PM, Yang Hongyang wrote:
> This is the initial design of block replication.
> The blkcolo block driver enables disk replication for continuous
> checkpoints. It is designed for COLO that Secondary VM is running.
> It can also be applied for FT/HA scene that Secondary VM is not
> running.
> 
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
> ---
>  docs/blkcolo.txt | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 85 insertions(+)
>  create mode 100644 docs/blkcolo.txt

Grammar review only (I'll leave the technical review to others)

> 
> diff --git a/docs/blkcolo.txt b/docs/blkcolo.txt
> new file mode 100644
> index 0000000..41c2a05
> --- /dev/null
> +++ b/docs/blkcolo.txt
> @@ -0,0 +1,85 @@
> +Disk replication using blkcolo
> +----------------------------------------
> +Copyright Fujitsu, Corp. 2014

Visually, the separator line should match the length of the line above,
and maybe have a blank line after.

> +
> +This work is licensed under the terms of the GNU GPL, version 2 or later.
> +See the COPYING file in the top-level directory.
> +
> +The blkcolo block driver enables disk replication for continuous checkpoints.
> +It is designed for COLO that Secondary VM is running. It can also be applied

similar comments as for Wen's RFC COLO v2 series for
docs/block-replication.txt (in fact, do we need two files, or should all
this information be merged into a single file?):

s/for COLO that/for COLO (COurse-grain LOck-stepping replication), where/

> +for FT/HA scene that Secondary VM is not running.

s/for FT/HA scene that/to FT/HA (Fault-tolerance/High assurance)
scenarios, where/

> +
> +This document gives an overview of blkcolo's design.
> +
> +== Background ==
> +High availability solutions such as micro checkpoint and COLO will do
> +consecutive checkpoint. The VM state of Primary VM and Secondary VM is

s/checkpoint/checkpoints/

> +identical right after a VM checkpoint, but becomes different as the VM
> +executes till the next checkpoint. To support disk contents checkpoint,
> +the modified disk contents in the Secondary VM must be buffered, and are
> +only dropped at next checkpoint time. To reduce the network transportation
> +effort at the time of checkpoint, the disk modification operations of
> +Primary disk are asynchronously forwarded to the Secondary node.
> +
> +== Disk Buffer ==
> +The following is the image of Disk buffer:
> +
> +        +----------------------+            +------------------------+
> +        |Primary Write Requests|            |Secondary Write Requests|
> +        +----------------------+            +------------------------+
> +                  |                                       |
> +                  |                                      (4)
> +                  |                                       V
> +                  |                              /-------------\
> +                  |      Copy and Forward        |             |
> +                  |---------(1)----------+       | Disk Buffer |
> +                  |                      |       |             |
> +                  |                     (3)      \-------------/
> +                  |                 speculative      ^
> +                  |                write through    (2)
> +                  |                      |           |
> +                  V                      V           |
> +           +--------------+           +----------------+
> +           | Primary Disk |           | Secondary Disk |
> +           +--------------+           +----------------+
> +    1) Primary write requests will be copied and forwarded to Secondary
> +       QEMU.
> +    2) Before Primary write requests are written to Secondary disk, the
> +       original sector content will be read from Secondary disk and
> +       buffered in the Disk buffer, but it will not overwrite the existing
> +       sector content in the Disk buffer.
> +    3) Primary write requests will be written to Secondary disk.
> +    4) Secondary write requests will be bufferd in the Disk buffer and it

s/bufferd/buffered/

> +       will overwrite the existing sector content in the buffer.
> +
> +== Capture I/O request ==
> +The blkcolo is a new block driver protocol, so all I/O requests can be
> +captured in the driver interface bdrv_co_readv()/bdrv_co_writev().
> +
> +== Checkpoint & failover ==
> +The blkcolo buffers the write requests in Secondary QEMU. And the buffer
> +should be dropped at a checkpoint, or be flushed to Secondary disk when

s/when/on/

> +failover. We add four block driver interfaces to do this:
> +a. bdrv_prepare_checkpoint()
> +   This interface may block, and return when all Primary write

s/return/returns/

> +   requests are forwarded to Secondary QEMU.
> +b. bdrv_do_checkpoint()
> +   This interface is called after all VM state is transfered to

s/transfered/transferred/

> +   Secondary QEMU. The Disk buffer will be dropped in this interface.
> +c. bdrv_get_sent_data_size()
> +   This is used on Primary node.
> +   It should be called by migration/checkpoint thread in order
> +   to decide whether to start a new checkpoint or not. If the data
> +   amount being sent is too large, we should start a new checkpoint.
> +d. bdrv_stop_replication()
> +   It is called when failover. We will flush the Disk buffer into

s/when/on/

> +   Secondary Disk and stop disk replication.
> +
> +== Usage ==
> +On both Primary/Secondary host, invoke QEMU with the following parameters:
> +    "-drive file=blkcolo:host:port:/path/to/image"
> +a. host
> +   Hostname or IP of the Secondary host.
> +b. port
> +   The Secondary QEMU will listen on this port, and the Primary QEMU
> +   will connect to this port.
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO
  2015-03-25 16:06   ` Eric Blake
@ 2015-03-25 16:11     ` Eric Blake
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Blake @ 2015-03-25 16:11 UTC (permalink / raw)
  To: Yang Hongyang, qemu-devel
  Cc: kwolf, Lai Jiangshan, quintela, GuiJianfeng, yunhong.jiang,
	eddie.dong, dgilbert, mrhines, stefanha, Amit Shah, pbonzini,
	walid.nouri

[-- Attachment #1: Type: text/plain, Size: 1465 bytes --]

	On 03/25/2015 10:06 AM, Eric Blake wrote:
> On 12/25/2014 08:31 PM, Yang Hongyang wrote:
>> This is the initial design of block replication.
>> The blkcolo block driver enables disk replication for continuous
>> checkpoints. It is designed for COLO that Secondary VM is running.
>> It can also be applied for FT/HA scene that Secondary VM is not
>> running.
>>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
>> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
>> ---
>>  docs/blkcolo.txt | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 85 insertions(+)
>>  create mode 100644 docs/blkcolo.txt
> 
> Grammar review only (I'll leave the technical review to others)

Yikes; I replied to an old thread because I forgot to clear filtering on
my mail reader.  Apologies for the noise, since...


>> +The blkcolo block driver enables disk replication for continuous checkpoints.
>> +It is designed for COLO that Secondary VM is running. It can also be applied
> 
> similar comments as for Wen's RFC COLO v2 series for
> docs/block-replication.txt (in fact, do we need two files, or should all
> this information be merged into a single file?):

it looks like the patch I commented on has already been morphed into
newer versions.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-03-25 16:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-26  3:31 [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Yang Hongyang
2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 1/2] Block: Block replication design for COLO Yang Hongyang
2015-03-25 16:06   ` Eric Blake
2015-03-25 16:11     ` Eric Blake
2014-12-26  3:31 ` [Qemu-devel] [PATCH RESEND 2/2] PoC: Block replication " Yang Hongyang
2014-12-27 15:23 ` [Qemu-devel] [PATCH RESEND 0/2] PoC: Block replication for continuous checkpointing Paolo Bonzini
2014-12-30  7:52   ` Hongyang Yang
2015-01-05 10:44   ` Dr. David Alan Gilbert
2015-01-06  1:28     ` Wen Congyang
2015-01-09  9:31   ` Hongyang Yang
2015-01-28  6:42   ` Wen Congyang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.