On 01/13/2016 02:18 AM, Changlong Xie wrote: > From: Wen Congyang > > Signed-off-by: Wen Congyang > Signed-off-by: zhanghailiang > Signed-off-by: Gonglei > Signed-off-by: Changlong Xie > --- > docs/block-replication.txt | 229 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 229 insertions(+) > create mode 100644 docs/block-replication.txt > > diff --git a/docs/block-replication.txt b/docs/block-replication.txt > new file mode 100644 > index 0000000..d1a231e > --- /dev/null > +++ b/docs/block-replication.txt > @@ -0,0 +1,229 @@ > +Block replication > +---------------------------------------- > +Copyright Fujitsu, Corp. 2015 > +Copyright (c) 2015 Intel Corporation > +Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. Do you want to claim 2016 for any of this? > + > +This work is licensed under the terms of the GNU GPL, version 2 or later. > +See the COPYING file in the top-level directory. > + > +Block replication is used for continuous checkpoints. It is designed > +for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. > +It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, > +where the Secondary VM is not running. > + > +This document gives an overview of block replication's design. > + > +== Background == > +High availability solutions such as micro checkpoint and COLO will do > +consecutive checkpoints. The VM state of Primary VM and Secondary VM is s/of Primary/of the Primary/ > +identical right after a VM checkpoint, but becomes different as the VM > +executes till the next checkpoint. To support disk contents checkpoint, > +the modified disk contents in the Secondary VM must be buffered, and are > +only dropped at next checkpoint time. To reduce the network transportation > +effort at the time of checkpoint, the disk modification operations of s/at the time of/during a vmstate/ s/operations of/operations of the/ > +Primary disk are asynchronously forwarded to the Secondary node. > + > +== Workflow == > +== Architecture == > + > +6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer Space before ( in English description. > +any state that would otherwise be lost by the speculative write-through > +of the NBD server into the secondary disk. So before block replication, > +the primary disk and secondary disk should contain the same data. > + > +== Failure Handling == > +== Usage == > +Primary: > + -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ > + children.0.file.filename=1.raw,\ > + children.0.driver=raw > + > + Run qmp command in primary qemu: > + { 'execute': 'human-monitor-command', > + 'arguments': { > + 'command-line': 'drive_add buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1,if=none' Eww. We shouldn't ever have to pack a command line as a single QMP string that needs reparsing. Instead, you should pass the information as a nested QMP dictionary, something like: 'arguments': { 'remote-command': { 'command': 'drive_add', 'name': 'buddy', 'driver': 'replication', 'mode': 'primary', 'file': { 'driver': 'nbd', 'host': 'xxxx', ... } } } > + } > + } > + { 'execute': 'x-blockdev-change', > + 'arguments': { > + 'parent': 'colo1', > + 'node': 'nbd_client1' > + } > + } > + Note: > + 1. There should be only one NBD Client for each primary disk. > + 2. host is the secondary physical machine's hostname or IP > + 3. Each disk must have its own export name. > + 4. It is all a single argument to -drive and you should ignore the > + leading whitespace. > + 5. The qmp command line must be run after running qmp command line in > + secondary qemu. > + > +Secondary: > + -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ > + -drive if=xxx,driver=replication,mode=secondary,\ > + file.file.filename=active_disk.qcow2,\ > + file.driver=qcow2,\ > + file.backing.file.filename=hidden_disk.qcow2,\ > + file.backing.driver=qcow2,\ > + file.backing.backing=colo1 > + > + Then run qmp command in secondary qemu: > + { 'execute': 'nbd-server-start', > + 'arguments': { > + 'addr': { > + 'type': 'inet', > + 'data': { > + 'host': 'xxx', > + 'port': 'xxx' > + } > + } > + } > + } > + { 'execute': 'nbd-server-add', > + 'arguments': { > + 'device': 'colo1', > + 'writable': true > + } > + } > + > + Note: > + 1. The export name in secondary QEMU command line is the secondary > + disk's id. > + 2. The export name for the same disk must be the same > + 3. The qmp command nbd-server-start and nbd-server-add must be run > + before running the qmp command migrate on primary QEMU > + 4. Active disk, hidden disk and nbd target's length should be the > + same. > + 5. It is better to put active disk and hidden disk in ramdisk. > + 6. It is all a single argument to -drive, and you should ignore > + the leading whitespace. > + > +After Failover: > +Primary: > + The secondary host is down, so we should run the following qmp command > + to remove the nbd child from the quorum: > + { 'execute': 'x-blockdev-change', > + 'arguments': { > + 'parent': 'colo1', > + 'child': 'children.1' > + } > + } > + Note: there is no qmp command to remove the blockdev now > + > +Secondary: > + The primary host is down, so we should do the following thing: > + { 'execute': 'nbd-server-stop' } > + > +TODO: > +1. Continuous block replication > +2. Shared disk > -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org