[Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing
@ 2014-02-18  8:50 mrhines
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
                   ` (12 more replies)
  0 siblings, 13 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

Changes since v1:

1. Re-based against Juan's improved migration_bitmap performance changes
2. Overhauled RDMA support to prepare for better usage of RDMA in 
   other parts of the QEMU code base (such as storage).
3. Fix for netlink issues that failed to cleanup the network buffer
   device for development testing.

Michael R. Hines (12):
  mc: add documentation for micro-checkpointing
  mc: timestamp migration_bitmap and KVM logdirty usage
  mc: introduce a 'checkpointing' status check into the VCPU states
  mc: support custom page loading and copying
  rdma: accelerated memcpy() support and better external RDMA user
    interfaces
  mc: introduce state machine changes for MC
  mc: introduce additional QMP statistics for micro-checkpointing
  mc: core logic
  mc: configure and makefile support
  mc: expose tunable parameter for checkpointing frequency
  mc: introduce new capabilities to control micro-checkpointing
  mc: activate and use MC if requested

 Makefile.objs                 |    1 +
 arch_init.c                   |   72 +-
 configure                     |   45 +
 cpus.c                        |    9 +-
 docs/mc.txt                   |  222 ++++
 hmp-commands.hx               |   16 +-
 hmp.c                         |   23 +
 hmp.h                         |    1 +
 include/migration/migration.h |   70 +-
 include/migration/qemu-file.h |   55 +-
 migration-checkpoint.c        | 1565 +++++++++++++++++++++++++
 migration-rdma.c              | 2605 +++++++++++++++++++++++++++--------------
 migration.c                   |  156 ++-
 qapi-schema.json              |   86 +-
 qemu-file.c                   |   80 +-
 qmp-commands.hx               |   23 +
 vl.c                          |    9 +
 17 files changed, 4097 insertions(+), 941 deletions(-)
 create mode 100644 docs/mc.txt
 create mode 100644 migration-checkpoint.c

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-18 12:45   ` Dr. David Alan Gilbert
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
Github: git@github.com:hinesmr/qemu.git, 'mc' branch

NOTE: This is a direct copy of the QEMU wiki page for the convenience
of the review process. Since this series very much in flux, instead of
maintaing two copies of documentation in two different formats, this
documentation will be properly formatted in the future when the review
process has completed.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/mc.txt | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 222 insertions(+)
 create mode 100644 docs/mc.txt

diff --git a/docs/mc.txt b/docs/mc.txt
new file mode 100644
index 0000000..5d4b5fe
--- /dev/null
+++ b/docs/mc.txt
@@ -0,0 +1,222 @@
+Micro Checkpointing Specification v1
+==============================================
+Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
+Github: git@github.com:hinesmr/qemu.git, 'mc' branch
+
+Copyright (C) 2014 Michael R. Hines <mrhines@us.ibm.com>
+
+Contents
+1 Summary
+1.1 Contact
+1.2 Introduction
+2 The Micro-Checkpointing Process
+2.1 Basic Algorithm
+2.2 I/O buffering
+2.3 Failure Recovery
+3 Optimizations
+3.1 Memory Management
+3.2 RDMA Integration
+4 Usage
+4.1 BEFORE Running
+4.2 Running
+5 Performance
+6 TODO
+7 FAQ / Frequently Asked Questions
+7.1 What happens if a failure occurs in the *middle* of a flush of the network buffer?
+7.2 What's different about this implementation?
+Summary
+This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: "Continuous Replication" or "Fault Tolerance" or 100 other different names - choose your poison.
+
+Contact
+Name: Michael Hines
+Email: mrhines@us.ibm.com
+Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
+
+Github: http://github.com/hinesmr/qemu.git, 'mc' branch
+
+Libvirt Support: http://github.com/hinesmr/libvirt.git, 'mc' branch
+
+Copyright (C) 2014 IBM Michael R. Hines <mrhines@us.ibm.com>
+
+Introduction
+Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a running virtual machine (VM) with little or no runtime assistance from the guest kernel or guest application software. Furthermore, Fault Tolerance is one method of providing high availability to a VM such that, from the perspective of the outside world (clients, devices, and neighboring VMs that may be paired with it), the VM and its applications have not lost any runtime state in the event of either a failure of the hypervisor/hardware to allow the VM to make forward progress or a complete loss of power. This mechanism for providing fault tolerance does *not* provide any protection whatsoever against software-level faults in the guest kernel or applications. In fact, due to the potentially extended lifetime of the VM because of this type of high availability, such software-level bugs may in fact manifest themselves more often than they otherwise ordinarily would, in which case you would need to empl!
 oy other
+
+This implementation is also fully compatible with RDMA and has undergone special optimizations to suppor the use of RDMA. (See docs/rdma.txt for more details).
+
+The Micro-Checkpointing Process
+Basic Algorithm
+Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
+
+1. After N milliseconds, stop the VM.
+3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
+4. Resume the VM immediately so that it can make forward progress.
+5. Transmit the checkpoint to the destination.
+6. Repeat
+Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
+
+I/O buffering
+Additionally, a MC must include a consistent view of device I/O, particularly the network, a problem commonly referred to as "output commit". This means that the outside world can not be allowed to experience duplicate state that was committed by the virtual machine after failure. This is possible because a checkpoint may diverge by N milliseconds of time and commit state while the current MC is being transmitted to the destination.
+
+To guard against this problem, first, we must "buffer" the TX output of the network (not the input) between MCs until the current MC is safely received by the destination. For example, all outbound network packets must be held at the source until the MC is transmitted. After transmission is complete, those packets can be released. Similarly, in the case of disk I/O, we must ensure that either the contents of the local disk are safely mirrored to a remote disk before completing a MC or that the output to a shared disk, such as iSCSI, is also buffered between checkpoints and then later released in the same way.
+
+For the network in particular, buffering is performed using a series of netlink (libnl3) Qdisc "plugs", introduced by the Xen Remus implementation. All packets go through netlink in the host kernel - there are no exceptions and no gaps. Even while one buffer is being released (say, after a checkpoint has been saved), another plug will have already been initiated to hold the next round of packets simultaneously while the current round of packets are being released. Thus, at any given time, there may be as many as two simultaneous buffers in place.
+
+With this in mind, here is the extended procedure for the micro checkpointing process:
+
+1. Insert a new Qdisc plug (Buffer A).
+Repeat Forever:
+
+2. After N milliseconds, stop the VM.
+3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
+4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
+5. Resume the VM immediately so that it can make forward progress (making use of Buffer B).
+6. Transmit the MC to the destination.
+7. Wait for acknowledgement.
+8. Acknowledged.
+9. Release the Qdisc plug for Buffer A.
+10. Qdisc Buffer B now becomes (symbolically rename) the most recent Buffer A
+11. Go back to Step 2
+This implementation *currently* only supports buffering for the network. (Any help on implementing disk support would be greatly appreciated). Due to this lack of disk support, this requires that the VM's root disk or any non-ephemeral disks also be made network-accessible directly from within the VM. Until the aforementioned buffering or mirroring support is available (ideally through drive-mirror), the only "consistent" way to provide full fault tolerance of the VM's non-ephemeral disks is to construct a VM whose root disk is made to boot directly from iSCSI or NFS or similar such that all disk I/O is translated into network I/O.
+
+Buffering is performed with the combination of an IFB device attached to the KVM tap device combined with a netlink Qdisc plug (exactly like the Xen remus solution).
+
+Failure Recovery
+Due to the high-frequency nature of micro-checkpointing, we expect a new MC to be generated many times per second. Even missing just a few MCs easily constitutes a failure. Because of the consistent buffering of device I/O, this is safe because device I/O is not committed to the outside world until the MC has been received at the destination.
+
+Failure is thus assumed under two conditions:
+
+1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.
+
+2. MC over RDMA: Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed.
+
+In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
+
+If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
+
+If the destination is deemed to be lost, we perform the same action as a live migration: resume the sender normally and wait for management software to make a policy decision about whether or not to re-protect the VM, which may involve a third-party to identify a new destination host again to use as a backup for the VM.
+
+Optimizations
+Memory Management
+Managing QEMU memory usage in this implementation is critical to the performance of any micro-checkpointing (MC) implementation.
+
+MCs are typically only a few MB when idle. However, they can easily be very large during heavy workloads. In the *extreme* worst-case, QEMU will need double the amount of main memory than that of what was originally allocated to the virtual machine.
+
+To support this variability during transient periods, a MC consists of a linked list of slabs, each of identical size. A better name would be welcome, as the name was only chosen because it resembles linux memory allocation. Because MCs occur several times per second (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink without constantly re-allocating all memory in place during each checkpoint. During steady-state, the 'head' slab is permanently allocated and never goes away, so when the VM is idle, there is no memory allocation at all. This design supports the use of RDMA. Since RDMA requires memory pinning, we must be able to hold on to a slab for a reasonable amount of time to get any real use out of it.
+
+Regardless, the current strategy taken will be:
+
+1. If the checkpoint size increases, then grow the number of slabs to support it.
+2. If the next checkpoint size is smaller than the last one, then that's a "strike".
+3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before).
+As of this writing, the average size of a Linux-based Idle-VM checkpoint is under 5MB.
+
+RDMA Integration
+RDMA is instrumental in enabling better MC performance, which is the reason why it was introduced into QEMU first.
+
+RDMA is used for two different reasons:
+
+1. Checkpoint generation (RDMA-based memcpy):
+2. Checkpoint transmission
+Checkpoint generation must be done while the VM is paused. In the worst case, the size of the checkpoint can be equal in size to the amount of memory in total use by the VM. In order to resume VM execution as fast as possible, the checkpoint is copied consistently locally into a staging area before transmission. A standard memcpy() of potentially such a large amount of memory not only gets no use out of the CPU cache but also potentially clogs up the CPU pipeline which would otherwise be useful by other neighbor VMs on the same physical node that could be scheduled for execution. To minimize the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(), bypassing the host processor. On more recent processors, a 'beefy' enough memory bus architecture can move memory just as fast (sometimes faster) as a pure-software CPU-only optimized memcpy() from libc. However, on older computers, this feature only gives you the benefit of lower CPU-utilization at the expense of
+
+Checkpoint transmission can potentially also consume very large amounts of both bandwidth as well as CPU utilization that could otherwise by used by the VM itself or its neighbors. Once the aforementioned local copy of the checkpoint is saved, this implementation makes use of the same RDMA hardware to perform the transmission exactly the same way that a live migration happens over RDMA (see docs/rdma.txt).
+
+Usage
+BEFORE Running
+First, compile QEMU with '--enable-mc' and ensure that the corresponding libraries for netlink (libnl3) are available. The netlink 'plug' support from the Qdisc functionality is required in particular, because it allows QEMU to direct the kernel to buffer outbound network packages between checkpoints as described previously. Do not proceed without this support in a production environment, or you risk corrupting the state of your I/O.
+
+$ git clone http://github.com/hinesmr/qemu.git
+$ git checkout 'mc'
+$ ./configure --enable-mc [other options]
+Next, start the VM that you want to protect using your standard procedures.
+
+Enable MC like this:
+
+QEMU Monitor Command:
+
+$ migrate_set_capability x-mc on # disabled by default
+Currently, only one network interface is supported, *and* currently you must ensure that the root disk of your VM is booted either directly from iSCSI or NFS, as described previously. This will be rectified with future improvements.
+
+For testing only, you can ignore the aforementioned requirements if you simply want to get an understanding of the performance penalties associated with this feature activated.
+
+Next, you can optionally disable network-buffering for additional test-only execution. This is useful if you want to get a breakdown only of what the cost of checkpointing the memory state is without the cost of checkpointing device state.
+
+QEMU Monitor Command:
+
+$ migrate_set_capability mc-net-disable on # buffering activated by default 
+Next, you can optionally enable RDMA 'memcpy' support. This is only valid if you have RDMA support compiled into QEMU and you intend to use the 'rdma' migration URI upon initiating MC as described later.
+
+QEMU Monitor Command:
+
+$ migrate_set_capability mc-rdma-copy on # disabled by default
+Finally, if you are using QEMU's support for RDMA migration, you will want to enable RDMA keep-alive support to allow quick detection of failure. If you are using TCP/IP, this is not required:
+
+QEMU Monitor Command:
+
+$ migrate_set_capability rdma-keepalive on # disabled by default
+Running
+First, make sure the IFB device kernel module is loaded
+
+$ modprobe ifb numifbs=100 # (or some large number)
+Now, install a Qdisc plug to the tap device using the same naming convention as the tap device created by QEMU (it must be the same, because QEMU needs to interact with the IFB device and the only mechanism we have right now of knowing the name of the IFB devices is to assume that it matches the tap device numbering scheme):
+
+$ ip link set up ifb0 # <= corresponds to tap device 'tap0'
+$ tc qdisc add dev tap0 ingress
+$ tc filter add dev tap0 parent ffff: proto ip pref 10 u32 match u32 0 0 action mirred egress redirect dev ifb0
+(You will need a script to automate the part above until the libvirt patches are more complete).
+
+Now, that the network buffering connection is ready:
+
+MC can be initiated with exactly the same command as standard live migration:
+
+QEMU Monitor Command:
+
+$ migrate -d (tcp|rdma):host:port
+Upon failure, the destination VM will detect a loss in network connectivity and automatically revert to the last checkpoint taken and resume execution immediately. There is no need for additional QEMU monitor commands to initiate the recovery process.
+
+Performance
+By far, the biggest cost is network throughput. Virtual machines are capable of dirtying memory well in excess of the bandwidth provided a commodity 1 Gbps network link. If so, the MC process will always lag behind the virtual machine and forward progress will be poor. It is highly recommended to use at least a 10 Gbps link when using MC.
+
+Numbers are still coming in, but without output buffering of network I/O, the performance penalty of a typical 4GB RAM Java-based application server workload using a 10 Gbps link (a good worst case for testing due Java's constant garbage collection) is on the order of 25%. With network buffering activated, this can be as high as 50%.
+
+Assuming that you have a reasonable 10G (or RDMA) network in place, the majority of the penalty is due to the time it takes to copy the dirty memory into a staging area before transmission of the checkpoint. Any optimizations / proposals to speed this up would be welcome!
+
+The remaining penalty comes from network buffering is typically due to checkpoints not occurring fast enough since a typical "round trip" time between the request of an application-level transaction and the corresponding response should ideally be larger than the time it takes to complete a checkpoint, otherwise, the response to the application within the VM will appear to be congested since the VM's network endpoint may not have even received the TX request from the application in the first place.
+
+We believe that this effect is "amplified" due to the poor performance in processing copying the dirty memory to staging since an application-level RTT cannot be serviced with more frequent checkpoints, network I/O tends to get held in the buffer too long. This has the effect of causing the guest TCP/IP stack to experience congestion, propagating this artificially created delay all the way up to the application.
+
+TODO
+1. Main bottleneck is to try to improve performance of the local memory copy to staging memory. The faster we can copy, the faster we can flush then network buffer.
+
+2. Implement local disk mirroring by integrating with QEMU's 'drive-mirror' feature in order to full support virtual machines with local storage.
+
+3. Implement output commit buffering for shared storage.
+
+FAQ / Frequently Asked Questions
+What happens if a failure occurs in the *middle* of a flush of the network buffer?
+Micro-Checkpointing depends *heavily* on the correctness of TCP/IP. Thus, this is not a problem because the network buffer holds packets only for the last *committed* checkpoint (meaning that the last micro checkpoint must have been acknowledged as received successfully by the backup host). After understanding this, it is then important to understand how network buffering is repeated between checkpoints. *ALL* packets go through the buffer - there is no exception or gaps. There is no such situation where while the buffer is being flushed other newer packets are going through - that's not how it works. Please refer to the previous section "I/O buffering" for a detailed description of how network buffering works.
+
+Why is this not a problem?
+
+Example: Let's say we have packets "A" and "B" in the buffer.
+
+Packet A is sent successfully and a failure occurs before packet B is transmitted.
+
+Packet A) This is acceptable. The guest checkpoint has already recorded delivery of the packet from the guest's perspective. The network fabric can deliver or not deliver as it sees fit. Thus the buffer simply has the same effect of an additional network switch - it does not alter the effect of fault tolerance as viewed by the external world any more so than another faulty hop in the traditional network architecture would cause congestion in the network. The packet will never get RE-generated because the checkpoint has already been committed at the destination which corresponds to the transmission of that packet from the perspective of the virtual machine. Any FUTURE packets generated while the VM resumes execution are *also* buffered as described previously.
+
+Packet B) This is acceptable. This packet will be lost. This will result in a TCP-level timeout on the peer side of the connection in the case that packet B is an ACK or will result in a timeout on the guest-side of the connection in the case that the packet is a TCP PUSH. Either way, the packet will get re-transmitted either because the data was never acknowledged or never received as soon as the virtual machine resumes execution.
+
+What's different about this implementation?
+Several things about this implementation attempt are different from previous implementations:
+
+1. We are dedicated to see this through the community review process and stay current with the master branch.
+
+2. This implementation is 100% compatible with RDMA.
+
+3. Memory management is completely overhauled - malloc()/free() churn is reduced to a minimum.
+
+4. This is not port of Kemari. Kemari is obsolete and incompatible with the most recent QEMU.
+
+5. Network I/O buffering is outsourced to the host kernel, using netlink code introduced by the Remus/Xen project.
+
+6. We make every attempt to change as little of the existing migration call path as possible.
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-18 10:32   ` Dr. David Alan Gilbert
  2014-03-11 21:31   ` Juan Quintela
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
                   ` (10 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

We also later export these statistics over QMP for better
monitoring of micro-checkpointing as the workload changes.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 arch_init.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 80574a0..b8364b0 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -193,6 +193,8 @@ typedef struct AccountingInfo {
     uint64_t skipped_pages;
     uint64_t norm_pages;
     uint64_t iterations;
+    uint64_t log_dirty_time;
+    uint64_t migration_bitmap_time;
     uint64_t xbzrle_bytes;
     uint64_t xbzrle_pages;
     uint64_t xbzrle_cache_miss;
@@ -201,7 +203,7 @@ typedef struct AccountingInfo {
 
 static AccountingInfo acct_info;
 
-static void acct_clear(void)
+void acct_clear(void)
 {
     memset(&acct_info, 0, sizeof(acct_info));
 }
@@ -236,6 +238,16 @@ uint64_t norm_mig_pages_transferred(void)
     return acct_info.norm_pages;
 }
 
+uint64_t norm_mig_log_dirty_time(void)
+{
+    return acct_info.log_dirty_time;
+}
+
+uint64_t norm_mig_bitmap_time(void)
+{
+    return acct_info.migration_bitmap_time;
+}
+
 uint64_t xbzrle_mig_bytes_transferred(void)
 {
     return acct_info.xbzrle_bytes;
@@ -426,27 +438,35 @@ static void migration_bitmap_sync(void)
     static int64_t num_dirty_pages_period;
     int64_t end_time;
     int64_t bytes_xfer_now;
+    int64_t begin_time;
+    int64_t dirty_time;
 
     if (!bytes_xfer_prev) {
         bytes_xfer_prev = ram_bytes_transferred();
     }
 
+    begin_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     if (!start_time) {
         start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
     }
-
     trace_migration_bitmap_sync_start();
     address_space_sync_dirty_bitmap(&address_space_memory);
 
+    dirty_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
     QTAILQ_FOREACH(block, &ram_list.blocks, next) {
         migration_bitmap_sync_range(block->mr->ram_addr, block->length);
     }
+
     trace_migration_bitmap_sync_end(migration_dirty_pages
                                     - num_dirty_pages_init);
     num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init;
     end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
 
-    /* more than 1 second = 1000 millisecons */
+    acct_info.log_dirty_time += dirty_time - begin_time;
+    acct_info.migration_bitmap_time += end_time - dirty_time;
+
+    /* more than 1 second = 1000 milliseconds */
     if (end_time > start_time + 1000) {
         if (migrate_auto_converge()) {
             /* The following detection logic can be refined later. For now:
@@ -548,9 +568,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             /* XBZRLE overflow or normal page */
             if (bytes_sent == -1) {
                 bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
-                qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
-                bytes_sent += TARGET_PAGE_SIZE;
-                acct_info.norm_pages++;
+                if (ret != RAM_SAVE_CONTROL_DELAYED) {
+                    qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
+                    bytes_sent += TARGET_PAGE_SIZE;
+                    acct_info.norm_pages++;
+                }
             }
 
             /* if page is unmodified, continue to the next */
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-03-11 21:36   ` Juan Quintela
  2014-03-11 21:40   ` Eric Blake
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying mrhines
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

During micro-checkpointing, the VCPUs get repeatedly paused and
resumed. We need to not freak out when the VM begins micro-checkpointing.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 cpus.c                        |  9 ++++++++-
 include/migration/migration.h | 21 +++++++++++++++++++++
 qapi-schema.json              |  4 +++-
 vl.c                          |  7 +++++++
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/cpus.c b/cpus.c
index 945d85b..b876d6b 100644
--- a/cpus.c
+++ b/cpus.c
@@ -532,7 +532,14 @@ static int do_vm_stop(RunState state)
         pause_all_vcpus();
         runstate_set(state);
         vm_state_notify(0, state);
-        monitor_protocol_event(QEVENT_STOP, NULL);
+        /*
+         * If MC is enabled, libvirt gets confused 
+         * because it thinks the VM is stopped when 
+         * its just being micro-checkpointed.
+         */
+        if(state != RUN_STATE_CHECKPOINT_VM) {
+            monitor_protocol_event(QEVENT_STOP, NULL);
+        }
     }
 
     bdrv_drain_all();
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 3e1e6c7..9c62e2f 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -121,10 +121,31 @@ uint64_t skipped_mig_bytes_transferred(void);
 uint64_t skipped_mig_pages_transferred(void);
 uint64_t norm_mig_bytes_transferred(void);
 uint64_t norm_mig_pages_transferred(void);
+uint64_t norm_mig_log_dirty_time(void);
+uint64_t norm_mig_bitmap_time(void);
 uint64_t xbzrle_mig_bytes_transferred(void);
 uint64_t xbzrle_mig_pages_transferred(void);
 uint64_t xbzrle_mig_pages_overflow(void);
 uint64_t xbzrle_mig_pages_cache_miss(void);
+void acct_clear(void);
+
+void migrate_set_state(MigrationState *s, int old_state, int new_state);
+
+enum {
+    MIG_STATE_ERROR = -1,
+    MIG_STATE_NONE,
+    MIG_STATE_SETUP,
+    MIG_STATE_CANCELLED,
+    MIG_STATE_CANCELLING,
+    MIG_STATE_ACTIVE,
+    MIG_STATE_CHECKPOINTING,
+    MIG_STATE_COMPLETED,
+};
+
+int mc_enable_buffering(void);
+int mc_start_buffer(void);
+void mc_init_checkpointer(MigrationState *s);
+void mc_process_incoming_checkpoints_if_requested(QEMUFile *f);
 
 void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
 
diff --git a/qapi-schema.json b/qapi-schema.json
index 7cfb5e5..3c2ee4d 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -169,6 +169,8 @@
 #
 # @save-vm: guest is paused to save the VM state
 #
+# @checkpoint-vm: guest is paused to checkpoint the VM state
+#
 # @shutdown: guest is shut down (and -no-shutdown is in use)
 #
 # @suspended: guest is suspended (ACPI S3)
@@ -181,7 +183,7 @@
   'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused',
             'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm',
             'running', 'save-vm', 'shutdown', 'suspended', 'watchdog',
-            'guest-panicked' ] }
+            'guest-panicked', 'checkpoint-vm' ] }
 
 ##
 # @SnapshotInfo
diff --git a/vl.c b/vl.c
index 316de54..2fb5b1f 100644
--- a/vl.c
+++ b/vl.c
@@ -552,6 +552,7 @@ static const RunStateTransition runstate_transitions_def[] = {
 
     { RUN_STATE_PAUSED, RUN_STATE_RUNNING },
     { RUN_STATE_PAUSED, RUN_STATE_FINISH_MIGRATE },
+    { RUN_STATE_PAUSED, RUN_STATE_CHECKPOINT_VM },
 
     { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_POSTMIGRATE, RUN_STATE_FINISH_MIGRATE },
@@ -562,14 +563,18 @@ static const RunStateTransition runstate_transitions_def[] = {
 
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_RUNNING },
     { RUN_STATE_FINISH_MIGRATE, RUN_STATE_POSTMIGRATE },
+    { RUN_STATE_FINISH_MIGRATE, RUN_STATE_CHECKPOINT_VM },
 
     { RUN_STATE_RESTORE_VM, RUN_STATE_RUNNING },
 
+    { RUN_STATE_CHECKPOINT_VM, RUN_STATE_RUNNING },
+
     { RUN_STATE_RUNNING, RUN_STATE_DEBUG },
     { RUN_STATE_RUNNING, RUN_STATE_INTERNAL_ERROR },
     { RUN_STATE_RUNNING, RUN_STATE_IO_ERROR },
     { RUN_STATE_RUNNING, RUN_STATE_PAUSED },
     { RUN_STATE_RUNNING, RUN_STATE_FINISH_MIGRATE },
+    { RUN_STATE_RUNNING, RUN_STATE_CHECKPOINT_VM },
     { RUN_STATE_RUNNING, RUN_STATE_RESTORE_VM },
     { RUN_STATE_RUNNING, RUN_STATE_SAVE_VM },
     { RUN_STATE_RUNNING, RUN_STATE_SHUTDOWN },
@@ -585,9 +590,11 @@ static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_RUNNING, RUN_STATE_SUSPENDED },
     { RUN_STATE_SUSPENDED, RUN_STATE_RUNNING },
     { RUN_STATE_SUSPENDED, RUN_STATE_FINISH_MIGRATE },
+    { RUN_STATE_SUSPENDED, RUN_STATE_CHECKPOINT_VM },
 
     { RUN_STATE_WATCHDOG, RUN_STATE_RUNNING },
     { RUN_STATE_WATCHDOG, RUN_STATE_FINISH_MIGRATE },
+    { RUN_STATE_WATCHDOG, RUN_STATE_CHECKPOINT_VM },
 
     { RUN_STATE_GUEST_PANICKED, RUN_STATE_RUNNING },
     { RUN_STATE_GUEST_PANICKED, RUN_STATE_FINISH_MIGRATE },
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (2 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces mrhines
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

Just as RDMA has custom routines for saving memory,
this provides RDMA with custom routines for loading
and copying memory as well.

Micro-checkpointing needs this support to avoid modifying
the arch_init.c as little as possible while stilling being
able to load RDMA-based memory from checkpoints in a
performance-optimal way as they are received from the network.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 arch_init.c                   |  9 +++--
 include/migration/migration.h | 33 ++++++++++++++++--
 include/migration/qemu-file.h | 54 +++++++++++++++++++++++++++--
 qemu-file.c                   | 80 +++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 167 insertions(+), 9 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index b8364b0..db75120 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -540,7 +540,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             /* In doubt sent page as normal */
             bytes_sent = -1;
             ret = ram_control_save_page(f, block->offset,
-                               offset, TARGET_PAGE_SIZE, &bytes_sent);
+                       block->host, offset, TARGET_PAGE_SIZE, &bytes_sent);
 
             if (ret != RAM_SAVE_CONTROL_NOT_SUPP) {
                 if (ret != RAM_SAVE_CONTROL_DELAYED) {
@@ -1004,13 +1004,18 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
             ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
         } else if (flags & RAM_SAVE_FLAG_PAGE) {
             void *host;
+            int r;
 
             host = host_from_stream_offset(f, addr, flags);
             if (!host) {
                 return -EINVAL;
             }
 
-            qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+            r = ram_control_load_page(f, host, TARGET_PAGE_SIZE);
+
+            if (r == RAM_LOAD_CONTROL_NOT_SUPP) {
+                qemu_get_buffer(f, host, TARGET_PAGE_SIZE);
+            }
         } else if (flags & RAM_SAVE_FLAG_XBZRLE) {
             void *host = host_from_stream_offset(f, addr, flags);
             if (!host) {
diff --git a/include/migration/migration.h b/include/migration/migration.h
index 9c62e2f..5c1a574 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -190,9 +190,38 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags);
 
 #define RAM_SAVE_CONTROL_NOT_SUPP -1000
 #define RAM_SAVE_CONTROL_DELAYED  -2000
+#define RAM_LOAD_CONTROL_NOT_SUPP -3000
+#define RAM_LOAD_CONTROL_DELAYED  -4000
+#define RAM_COPY_CONTROL_NOT_SUPP -5000
+#define RAM_COPY_CONTROL_DELAYED  -6000
 
-size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                             ram_addr_t offset, size_t size,
+#define RDMA_CONTROL_VERSION_CURRENT 1
+
+int ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
+                             uint8_t *host_addr,
+                             ram_addr_t offset, long size,
                              int *bytes_sent);
 
+int ram_control_load_page(QEMUFile *f,
+                             void *host_addr,
+                             long size);
+
+int ram_control_copy_page(QEMUFile *f, 
+                             ram_addr_t block_offset_dest,
+                             ram_addr_t offset_dest,
+                             ram_addr_t block_offset_source,
+                             ram_addr_t offset_source,
+                             long size);
+
+int migrate_use_mc(void);
+int migrate_use_mc_net(void);
+int migrate_use_mc_rdma_copy(void);
+
+#define MC_VERSION 1
+
+int mc_info_load(QEMUFile *f, void *opaque, int version_id);
+void mc_info_save(QEMUFile *f, void *opaque);
+
+void qemu_rdma_info_save(QEMUFile *f, void *opaque);
+int qemu_rdma_info_load(QEMUFile *f, void *opaque, int version_id);
 #endif
diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index a191fb6..c50de0d 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -71,17 +71,63 @@ typedef int (QEMURamHookFunc)(QEMUFile *f, void *opaque, uint64_t flags);
 #define RAM_CONTROL_ROUND    1
 #define RAM_CONTROL_HOOK     2
 #define RAM_CONTROL_FINISH   3
+#define RAM_CONTROL_FLUSH    4
 
 /*
  * This function allows override of where the RAM page
  * is saved (such as RDMA, for example.)
  */
-typedef size_t (QEMURamSaveFunc)(QEMUFile *f, void *opaque,
+typedef int (QEMURamSaveFunc)(QEMUFile *f, void *opaque,
                                ram_addr_t block_offset,
+                               uint8_t *host_addr,
                                ram_addr_t offset,
-                               size_t size,
+                               long size,
                                int *bytes_sent);
 
+/*
+ * This function allows override of where the RAM page
+ * is saved (such as RDMA, for example.)
+ */
+typedef int (QEMURamLoadFunc)(QEMUFile *f,
+                               void *opaque,
+                               void *host_addr,
+                               long size);
+
+/*
+ * This function allows *local* RDMA copying memory between two registered
+ * RAMBlocks, both real ones as well as private memory areas independently
+ * registered by external callers (such as MC). If RDMA is not available,
+ * then this function does nothing and the caller should just use memcpy().
+ */
+typedef int (QEMURamCopyFunc)(QEMUFile *f, void *opaque,
+                               ram_addr_t block_offset_dest,
+                               ram_addr_t offset_dest,
+                               ram_addr_t block_offset_source,
+                               ram_addr_t offset_source,
+                               long size);
+
+/* 
+ * Inform the underlying transport of a new virtual memory area.
+ * If this area is an actual RAMBlock, then pass the corresponding
+ * parameters of that block.
+ * If this area is an arbitrary virtual memory address, then
+ * pass the same value for both @host_addr and @block_offset.
+ */
+typedef int (QEMURamAddFunc)(QEMUFile *f, void *opaque,
+                               void *host_addr,
+                               ram_addr_t block_offset,
+                               uint64_t length);
+
+/* 
+ * Remove an underlying new virtual memory area.
+ * If this area is an actual RAMBlock, then pass the corresponding
+ * parameters of that block.
+ * If this area is an arbitrary virtual memory address, then
+ * pass the same value for both @host_addr and @block_offset.
+ */
+typedef int (QEMURamRemoveFunc)(QEMUFile *f, void *opaque,
+                               ram_addr_t block_offset);
+
 typedef struct QEMUFileOps {
     QEMUFilePutBufferFunc *put_buffer;
     QEMUFileGetBufferFunc *get_buffer;
@@ -92,6 +138,10 @@ typedef struct QEMUFileOps {
     QEMURamHookFunc *after_ram_iterate;
     QEMURamHookFunc *hook_ram_load;
     QEMURamSaveFunc *save_page;
+    QEMURamLoadFunc *load_page;
+    QEMURamCopyFunc *copy_page;
+    QEMURamAddFunc *add;
+    QEMURamRemoveFunc *remove;
 } QEMUFileOps;
 
 QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
diff --git a/qemu-file.c b/qemu-file.c
index 9473b67..3d7428f 100644
--- a/qemu-file.c
+++ b/qemu-file.c
@@ -501,14 +501,17 @@ void ram_control_load_hook(QEMUFile *f, uint64_t flags)
     }
 }
 
-size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
-                         ram_addr_t offset, size_t size, int *bytes_sent)
+int ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
+                         uint8_t *host_addr,
+                         ram_addr_t offset, long size, int *bytes_sent)
 {
     if (f->ops->save_page) {
         int ret = f->ops->save_page(f, f->opaque, block_offset,
+                                    host_addr,
                                     offset, size, bytes_sent);
 
-        if (ret != RAM_SAVE_CONTROL_DELAYED) {
+        if (ret != RAM_SAVE_CONTROL_DELAYED
+                && ret != RAM_SAVE_CONTROL_NOT_SUPP) {
             if (bytes_sent && *bytes_sent > 0) {
                 qemu_update_position(f, *bytes_sent);
             } else if (ret < 0) {
@@ -522,6 +525,77 @@ size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset,
     return RAM_SAVE_CONTROL_NOT_SUPP;
 }
 
+int ram_control_load_page(QEMUFile *f, void *host_addr, long size)
+{
+    if (f->ops->load_page) {
+        int ret = f->ops->load_page(f, f->opaque, host_addr, size);
+
+        if (ret != RAM_LOAD_CONTROL_DELAYED 
+                && ret != RAM_LOAD_CONTROL_NOT_SUPP) {
+            if (ret < 0) {
+                qemu_file_set_error(f, ret);
+            }
+        }
+
+        return ret;
+    }
+
+    return RAM_LOAD_CONTROL_NOT_SUPP;
+}
+
+int ram_control_copy_page(QEMUFile *f, 
+                             ram_addr_t block_offset_dest,
+                             ram_addr_t offset_dest,
+                             ram_addr_t block_offset_source,
+                             ram_addr_t offset_source,
+                             long size)
+{
+    if (f->ops->copy_page) {
+        int ret = f->ops->copy_page(f, f->opaque,
+                                    block_offset_dest,
+                                    offset_dest,
+                                    block_offset_source,
+                                    offset_source,
+                                    size);
+
+        if (ret != RAM_COPY_CONTROL_DELAYED) {
+            if (ret < 0) {
+                qemu_file_set_error(f, ret);
+            }
+        }
+
+        return ret;
+    }
+
+    return RAM_COPY_CONTROL_NOT_SUPP;
+}
+
+
+void ram_control_add(QEMUFile *f, void *host_addr,
+                         ram_addr_t block_offset, uint64_t length)
+{
+    int ret = 0;
+
+    if (f->ops->add) {
+        ret = f->ops->add(f, f->opaque, host_addr, block_offset, length);
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+        }
+    }
+}
+
+void ram_control_remove(QEMUFile *f, ram_addr_t block_offset)
+{
+    int ret = 0;
+
+    if (f->ops->remove) {
+        ret = f->ops->remove(f, f->opaque, block_offset);
+        if (ret < 0) {
+            qemu_file_set_error(f, ret);
+        }
+    }
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (3 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

This patch implements the ability to provide accelerated
memory copies as an alternative to memcpy() using RDMA.

It requires the user to "register" the src and dest memory
with the RDMA subsystem using new interfaces and then call
a special copy function that is introduced to submit the
copy to the hardware.

This also significantly re-works the RDMA code itself such that
it is more agile: Users can call into the RDMA code from arbitrary
parts of the QEMU code base to create new connections or send
memory across an existing connection. (While this has not yet been
tested, with say, storage or something similar, the hardwork of
fixing the RDMA plumbing to make it work has mostly been done).

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/migration.h |    6 +
 migration-rdma.c              | 2605 +++++++++++++++++++++++++++--------------
 2 files changed, 1749 insertions(+), 862 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index 5c1a574..a7c54fe 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -180,7 +180,13 @@ int64_t xbzrle_cache_resize(int64_t new_size);
 void ram_control_before_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_after_iterate(QEMUFile *f, uint64_t flags);
 void ram_control_load_hook(QEMUFile *f, uint64_t flags);
+void ram_control_add(QEMUFile *f, void *host_addr,
+                         ram_addr_t block_offset, uint64_t length);
+void ram_control_remove(QEMUFile *f, ram_addr_t block_offset);
 
+#define MBPS(bytes, time) time ? ((((double) bytes * 8)         \
+        / ((double) time / 1000.0)) / 1000.0 / 1000.0) : 0.0
+        
 /* Whenever this is found in the data stream, the flags
  * will be passed to ram_control_load_hook in the incoming-migration
  * side. This lets before_ram_iterate/after_ram_iterate add
diff --git a/migration-rdma.c b/migration-rdma.c
index f94f3b4..56d1d39 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -31,28 +31,39 @@
 //#define DEBUG_RDMA_VERBOSE
 //#define DEBUG_RDMA_REALLY_VERBOSE
 
+/*
+ * Ability to runtime-enable debug statements while inside GDB.
+ * Choices are 1, 2, or 3 (so far).
+ */
+#if !defined(DEBUG_RDMA) || !defined(DEBUG_RDMA_VERBOSE) || \
+    !defined(DEBUG_RDMA_REALLY_VERBOSE)
+static int rdma_debug = 0;
+#endif
+
+#define RPRINTF(fmt, ...) printf("rdma: " fmt, ## __VA_ARGS__)
+
 #ifdef DEBUG_RDMA
 #define DPRINTF(fmt, ...) \
-    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+    do { RPRINTF(fmt, ## __VA_ARGS__); } while (0)
 #else
 #define DPRINTF(fmt, ...) \
-    do { } while (0)
+    do { if (rdma_debug >= 1) RPRINTF(fmt, ## __VA_ARGS__); } while (0)
 #endif
 
 #ifdef DEBUG_RDMA_VERBOSE
 #define DDPRINTF(fmt, ...) \
-    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+    do { RPRINTF(fmt, ## __VA_ARGS__); } while (0)
 #else
 #define DDPRINTF(fmt, ...) \
-    do { } while (0)
+    do { if (rdma_debug >= 2) RPRINTF(fmt, ## __VA_ARGS__); } while (0)
 #endif
 
 #ifdef DEBUG_RDMA_REALLY_VERBOSE
 #define DDDPRINTF(fmt, ...) \
-    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+    do { RPRINTF(fmt, ## __VA_ARGS__); } while (0)
 #else
 #define DDDPRINTF(fmt, ...) \
-    do { } while (0)
+    do { if (rdma_debug >= 3) RPRINTF(fmt, ## __VA_ARGS__); } while (0)
 #endif
 
 /*
@@ -60,17 +71,20 @@
  */
 #define ERROR(errp, fmt, ...) \
     do { \
+        Error **e = errp; \
         fprintf(stderr, "RDMA ERROR: " fmt "\n", ## __VA_ARGS__); \
-        if (errp && (*(errp) == NULL)) { \
-            error_setg(errp, "RDMA ERROR: " fmt, ## __VA_ARGS__); \
+        if (e && ((*e) == NULL)) { \
+            error_setg(e, "RDMA ERROR: " fmt, ## __VA_ARGS__); \
         } \
     } while (0)
 
+#define SET_ERROR(rdma, err) if (!rdma->error_state) rdma->error_state = err
+
 #define RDMA_RESOLVE_TIMEOUT_MS 10000
 
 /* Do not merge data if larger than this. */
 #define RDMA_MERGE_MAX (2 * 1024 * 1024)
-#define RDMA_SIGNALED_SEND_MAX (RDMA_MERGE_MAX / 4096)
+#define RDMA_SEND_MAX (RDMA_MERGE_MAX / 4096)
 
 #define RDMA_REG_CHUNK_SHIFT 20 /* 1 MB */
 
@@ -87,18 +101,30 @@
  */
 #define RDMA_CONTROL_MAX_BUFFER (512 * 1024)
 #define RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE 4096
-
-#define RDMA_CONTROL_VERSION_CURRENT 1
 /*
  * Capabilities for negotiation.
  */
 #define RDMA_CAPABILITY_PIN_ALL 0x01
+#define RDMA_CAPABILITY_KEEPALIVE 0x02
+
+/*
+ * Max # missed keepalive before we assume remote side is unavailable.
+ */
+#define RDMA_CONNECTION_INTERVAL_MS 300
+#define RDMA_KEEPALIVE_INTERVAL_MS 300
+#define RDMA_KEEPALIVE_FIRST_MISSED_OFFSET 1000
+#define RDMA_MAX_LOST_KEEPALIVE 10
+#define RDMA_MAX_STARTUP_MISSED_KEEPALIVE 400
 
 /*
  * Add the other flags above to this list of known capabilities
  * as they are introduced.
  */
-static uint32_t known_capabilities = RDMA_CAPABILITY_PIN_ALL;
+static uint32_t known_capabilities = RDMA_CAPABILITY_PIN_ALL
+                                   | RDMA_CAPABILITY_KEEPALIVE
+                                   ;
+static QEMUTimer *connection_timer = NULL;
+static QEMUTimer *keepalive_timer = NULL;
 
 #define CHECK_ERROR_STATE() \
     do { \
@@ -143,14 +169,18 @@ static uint32_t known_capabilities = RDMA_CAPABILITY_PIN_ALL;
  */
 enum {
     RDMA_WRID_NONE = 0,
-    RDMA_WRID_RDMA_WRITE = 1,
+    RDMA_WRID_RDMA_WRITE_REMOTE = 1,
+    RDMA_WRID_RDMA_WRITE_LOCAL = 2,
+    RDMA_WRID_RDMA_KEEPALIVE = 3,
     RDMA_WRID_SEND_CONTROL = 2000,
     RDMA_WRID_RECV_CONTROL = 4000,
 };
 
 const char *wrid_desc[] = {
     [RDMA_WRID_NONE] = "NONE",
-    [RDMA_WRID_RDMA_WRITE] = "WRITE RDMA",
+    [RDMA_WRID_RDMA_WRITE_REMOTE] = "WRITE RDMA REMOTE",
+    [RDMA_WRID_RDMA_WRITE_LOCAL] = "WRITE RDMA LOCAL",
+    [RDMA_WRID_RDMA_KEEPALIVE] = "KEEPALIVE",
     [RDMA_WRID_SEND_CONTROL] = "CONTROL SEND",
     [RDMA_WRID_RECV_CONTROL] = "CONTROL RECV",
 };
@@ -216,21 +246,41 @@ typedef struct {
 /*
  * Negotiate RDMA capabilities during connection-setup time.
  */
-typedef struct {
+typedef struct QEMU_PACKED RDMACapabilities {
     uint32_t version;
     uint32_t flags;
+    uint32_t keepalive_rkey;
+    uint64_t keepalive_addr;
 } RDMACapabilities;
 
+static uint64_t htonll(uint64_t v)
+{
+    union { uint32_t lv[2]; uint64_t llv; } u;
+    u.lv[0] = htonl(v >> 32);
+    u.lv[1] = htonl(v & 0xFFFFFFFFULL);
+    return u.llv;
+}
+
+static uint64_t ntohll(uint64_t v) {
+    union { uint32_t lv[2]; uint64_t llv; } u;
+    u.llv = v;
+    return ((uint64_t)ntohl(u.lv[0]) << 32) | (uint64_t) ntohl(u.lv[1]);
+}
+
 static void caps_to_network(RDMACapabilities *cap)
 {
     cap->version = htonl(cap->version);
     cap->flags = htonl(cap->flags);
+    cap->keepalive_rkey = htonl(cap->keepalive_rkey);
+    cap->keepalive_addr = htonll(cap->keepalive_addr);
 }
 
 static void network_to_caps(RDMACapabilities *cap)
 {
     cap->version = ntohl(cap->version);
     cap->flags = ntohl(cap->flags);
+    cap->keepalive_rkey = ntohl(cap->keepalive_rkey);
+    cap->keepalive_addr = ntohll(cap->keepalive_addr);
 }
 
 /*
@@ -245,11 +295,15 @@ typedef struct RDMALocalBlock {
     uint64_t remote_host_addr; /* remote virtual address */
     uint64_t offset;
     uint64_t length;
-    struct   ibv_mr **pmr;     /* MRs for chunk-level registration */
-    struct   ibv_mr *mr;       /* MR for non-chunk-level registration */
-    uint32_t *remote_keys;     /* rkeys for chunk-level registration */
-    uint32_t remote_rkey;      /* rkeys for non-chunk-level registration */
-    int      index;            /* which block are we */
+    struct ibv_mr **pmr;      /* MRs for remote chunk-level registration */
+    struct ibv_mr *mr;        /* MR for non-chunk-level registration */
+    struct ibv_mr **pmr_src;  /* MRs for copy chunk-level registration */
+    struct ibv_mr *mr_src;    /* MR for copy non-chunk-level registration */
+    struct ibv_mr **pmr_dest; /* MRs for copy chunk-level registration */
+    struct ibv_mr *mr_dest;   /* MR for copy non-chunk-level registration */
+    uint32_t *remote_keys;    /* rkeys for chunk-level registration */
+    uint32_t remote_rkey;     /* rkeys for non-chunk-level registration */
+    int      index;           /* which block are we */
     bool     is_ram_block;
     int      nb_chunks;
     unsigned long *transit_bitmap;
@@ -271,20 +325,6 @@ typedef struct QEMU_PACKED RDMARemoteBlock {
     uint32_t padding;
 } RDMARemoteBlock;
 
-static uint64_t htonll(uint64_t v)
-{
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.lv[0] = htonl(v >> 32);
-    u.lv[1] = htonl(v & 0xFFFFFFFFULL);
-    return u.llv;
-}
-
-static uint64_t ntohll(uint64_t v) {
-    union { uint32_t lv[2]; uint64_t llv; } u;
-    u.llv = v;
-    return ((uint64_t)ntohl(u.lv[0]) << 32) | (uint64_t) ntohl(u.lv[1]);
-}
-
 static void remote_block_to_network(RDMARemoteBlock *rb)
 {
     rb->remote_host_addr = htonll(rb->remote_host_addr);
@@ -313,15 +353,74 @@ typedef struct RDMALocalBlocks {
 } RDMALocalBlocks;
 
 /*
+ * We provide RDMA to QEMU by way of 2 mechanisms:
+ *
+ * 1. Local copy to remote copy
+ * 2. Local copy to local copy - like memcpy().
+ *
+ * Three instances of this structure are maintained inside of RDMAContext
+ * to manage both mechanisms.
+ */
+typedef struct RDMACurrentChunk {
+    /* store info about current buffer so that we can
+       merge it with future sends */
+    uint64_t current_addr;
+    uint64_t current_length;
+    /* index of ram block the current buffer belongs to */
+    int current_block_idx;
+    /* index of the chunk in the current ram block */
+    int current_chunk;
+
+    uint64_t block_offset;
+    uint64_t offset;
+
+    /* parameters for qemu_rdma_write() */
+    uint64_t chunk_idx;
+    uint8_t *chunk_start;
+    uint8_t *chunk_end;
+    RDMALocalBlock *block;
+    uint8_t *addr;
+    uint64_t chunks;
+} RDMACurrentChunk;
+
+/*
+ * Three copies of the following strucuture are used to hold the infiniband
+ * connection variables for each of the aformentioned mechanisms, one for
+ * remote copy and two local copy.
+ */
+typedef struct RDMALocalContext {
+    bool source;
+    bool dest;
+    bool connected;
+    char *host;
+    int port;
+    struct rdma_cm_id *cm_id;
+    struct rdma_cm_id *listen_id;
+    struct rdma_event_channel *channel;
+    struct ibv_context *verbs;
+    struct ibv_pd *pd;
+    struct ibv_comp_channel *comp_chan;
+    struct ibv_cq *cq;
+    struct ibv_qp *qp;
+    int nb_sent;
+    int64_t start_time;
+    int max_nb_sent;
+    const char * id_str;
+} RDMALocalContext;
+
+/*
  * Main data structure for RDMA state.
  * While there is only one copy of this structure being allocated right now,
  * this is the place where one would start if you wanted to consider
  * having more than one RDMA connection open at the same time.
+ *
+ * It is used for performing both local and remote RDMA operations
+ * with a single RDMA connection.
+ *
+ * Local operations are done by allocating separate queue pairs after
+ * the initial RDMA remote connection is initalized.
  */
 typedef struct RDMAContext {
-    char *host;
-    int port;
-
     RDMAWorkRequestData wr_data[RDMA_WRID_MAX];
 
     /*
@@ -333,37 +432,15 @@ typedef struct RDMAContext {
      */
     int control_ready_expected;
 
-    /* number of outstanding writes */
+    /* number of posts */
     int nb_sent;
 
-    /* store info about current buffer so that we can
-       merge it with future sends */
-    uint64_t current_addr;
-    uint64_t current_length;
-    /* index of ram block the current buffer belongs to */
-    int current_index;
-    /* index of the chunk in the current ram block */
-    int current_chunk;
+    RDMACurrentChunk chunk_remote;
+    RDMACurrentChunk chunk_local_src;
+    RDMACurrentChunk chunk_local_dest;
 
     bool pin_all;
-
-    /*
-     * infiniband-specific variables for opening the device
-     * and maintaining connection state and so forth.
-     *
-     * cm_id also has ibv_context, rdma_event_channel, and ibv_qp in
-     * cm_id->verbs, cm_id->channel, and cm_id->qp.
-     */
-    struct rdma_cm_id *cm_id;               /* connection manager ID */
-    struct rdma_cm_id *listen_id;
-    bool connected;
-
-    struct ibv_context          *verbs;
-    struct rdma_event_channel   *channel;
-    struct ibv_qp *qp;                      /* queue pair */
-    struct ibv_comp_channel *comp_channel;  /* completion channel */
-    struct ibv_pd *pd;                      /* protection domain */
-    struct ibv_cq *cq;                      /* completion queue */
+    bool do_keepalive;
 
     /*
      * If a previous write failed (perhaps because of a failed
@@ -384,17 +461,150 @@ typedef struct RDMAContext {
      * Then use coroutine yield function.
      * Source runs in a thread, so we don't care.
      */
-    int migration_started_on_destination;
+    bool migration_started;
 
     int total_registrations;
     int total_writes;
 
     int unregister_current, unregister_next;
-    uint64_t unregistrations[RDMA_SIGNALED_SEND_MAX];
+    uint64_t unregistrations[RDMA_SEND_MAX];
 
     GHashTable *blockmap;
+
+    uint64_t keepalive;
+    uint64_t last_keepalive;
+    uint64_t nb_missed_keepalive;
+    uint64_t next_keepalive;
+    struct ibv_mr *keepalive_mr;
+    struct ibv_mr *next_keepalive_mr;
+    uint32_t keepalive_rkey;
+    uint64_t keepalive_addr; 
+    bool keepalive_startup; 
+
+    RDMALocalContext lc_src;
+    RDMALocalContext lc_dest;
+    RDMALocalContext lc_remote;
+
+    /* who are we? */
+    bool source;
+    bool dest;
 } RDMAContext;
 
+static void close_ibv(RDMAContext *rdma, RDMALocalContext *lc)
+{
+
+    if (lc->cq) {
+        ibv_destroy_cq(lc->cq);
+        lc->cq = NULL;
+    }
+
+    if (lc->comp_chan) {
+        ibv_destroy_comp_channel(lc->comp_chan);
+        lc->comp_chan = NULL;
+    }
+
+    if (lc->pd) {
+        ibv_dealloc_pd(lc->pd);
+        lc->pd = NULL;
+    }
+
+    if (lc->verbs) {
+        ibv_close_device(lc->verbs);
+        lc->verbs = NULL;
+    }
+
+    if (lc->listen_id) {
+        rdma_destroy_id(lc->listen_id);
+        lc->listen_id = NULL;
+    }
+    if (lc->cm_id) {
+        if (lc->qp) {
+            struct ibv_qp_attr attr = {.qp_state = IBV_QPS_ERR };
+            ibv_modify_qp(lc->qp, &attr, IBV_QP_STATE);
+            rdma_destroy_qp(lc->cm_id);
+            lc->qp = NULL;
+        }
+
+        rdma_destroy_id(lc->cm_id);
+        rdma->lc_remote.cm_id = NULL;
+    }
+    if (lc->channel) {
+        rdma_destroy_event_channel(lc->channel);
+        lc->channel = NULL;
+    }
+
+    g_free(lc->host);
+    lc->host = NULL;
+}
+
+/*
+ * Create protection domain and completion queues
+ */
+static int qemu_rdma_alloc_pd_cq_qp(RDMAContext *rdma, RDMALocalContext *lc)
+{
+    struct rlimit r = { .rlim_cur = RLIM_INFINITY, .rlim_max = RLIM_INFINITY };
+    struct ibv_qp_init_attr attr = { 0 };
+    int ret;
+
+    if (getrlimit(RLIMIT_MEMLOCK, &r) < 0) {
+        perror("getrlimit");
+        ERROR(NULL, "getrlimit(RLIMIT_MEMLOCK)");
+        goto err_alloc;
+    }
+
+    DPRINTF("MemLock Limits cur: %" PRId64 " max: %" PRId64 "\n",
+            r.rlim_cur, r.rlim_max);
+
+    lc->pd = ibv_alloc_pd(lc->verbs);
+    if (!lc->pd) {
+        ERROR(NULL, "allocate protection domain");
+        goto err_alloc;
+    }
+
+    /* create completion channel */
+    lc->comp_chan = ibv_create_comp_channel(lc->verbs);
+    if (!lc->comp_chan) {
+        ERROR(NULL, "allocate completion channel");
+        goto err_alloc;
+    }
+
+    /*
+     * Completion queue can be filled by both read and write work requests,
+     * so must reflect the sum of both possible queue sizes.
+     */
+    lc->cq = ibv_create_cq(lc->verbs, (RDMA_SEND_MAX * 3), NULL, 
+                           lc->comp_chan, 0);
+    if (!lc->cq) {
+        ERROR(NULL, "allocate completion queue");
+        goto err_alloc;
+    }
+
+    attr.cap.max_send_wr = RDMA_SEND_MAX;
+    attr.cap.max_recv_wr = 3;
+    attr.cap.max_send_sge = 1;
+    attr.cap.max_recv_sge = 1;
+    attr.send_cq = lc->cq;
+    attr.recv_cq = lc->cq;
+    attr.qp_type = IBV_QPT_RC;
+
+    ret = rdma_create_qp(lc->cm_id, lc->pd, &attr);
+    if (ret) {
+        ERROR(NULL, "alloc queue pair");
+        goto err_alloc;
+    }
+
+    lc->qp = lc->cm_id->qp;
+
+    return 0;
+
+err_alloc:
+    ERROR(NULL, "allocating pd and cq and qp! Your mlock()"
+                " limits may be too low. Please check $ ulimit -a # and "
+                "search for 'ulimit -l' in the output");
+    close_ibv(rdma, lc);
+    return -EINVAL;
+}
+
 /*
  * Interface to the rest of the migration call stack.
  */
@@ -440,7 +650,7 @@ typedef struct QEMU_PACKED {
         uint64_t current_addr;  /* offset into the ramblock of the chunk */
         uint64_t chunk;         /* chunk to lookup if unregistering */
     } key;
-    uint32_t current_index; /* which ramblock the chunk belongs to */
+    uint32_t current_block_idx;     /* which ramblock the chunk belongs to */
     uint32_t padding;
     uint64_t chunks;            /* how many sequential chunks to register */
 } RDMARegister;
@@ -448,14 +658,14 @@ typedef struct QEMU_PACKED {
 static void register_to_network(RDMARegister *reg)
 {
     reg->key.current_addr = htonll(reg->key.current_addr);
-    reg->current_index = htonl(reg->current_index);
+    reg->current_block_idx = htonl(reg->current_block_idx);
     reg->chunks = htonll(reg->chunks);
 }
 
 static void network_to_register(RDMARegister *reg)
 {
     reg->key.current_addr = ntohll(reg->key.current_addr);
-    reg->current_index = ntohl(reg->current_index);
+    reg->current_block_idx = ntohl(reg->current_block_idx);
     reg->chunks = ntohll(reg->chunks);
 }
 
@@ -578,10 +788,10 @@ static int __qemu_rdma_add_block(RDMAContext *rdma, void *host_addr,
 
     g_hash_table_insert(rdma->blockmap, (void *) block_offset, block);
 
-    DDPRINTF("Added Block: %d, addr: %" PRIu64 ", offset: %" PRIu64
-           " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d\n",
-            local->nb_blocks, (uint64_t) block->local_host_addr, block->offset,
-            block->length, (uint64_t) (block->local_host_addr + block->length),
+    DDPRINTF("Added Block: %d, addr: %p, offset: %" PRIu64
+           " length: %" PRIu64 " end: %p bits %" PRIu64 " chunks %d\n",
+            local->nb_blocks, block->local_host_addr, block->offset,
+            block->length, block->local_host_addr + block->length,
                 BITS_TO_LONGS(block->nb_chunks) *
                     sizeof(unsigned long) * 8, block->nb_chunks);
 
@@ -621,35 +831,51 @@ static int qemu_rdma_init_ram_blocks(RDMAContext *rdma)
     return 0;
 }
 
-static int __qemu_rdma_delete_block(RDMAContext *rdma, ram_addr_t block_offset)
+static void qemu_rdma_free_pmrs(RDMAContext *rdma, RDMALocalBlock *block,
+                               struct ibv_mr ***mrs)
 {
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    RDMALocalBlock *block = g_hash_table_lookup(rdma->blockmap,
-        (void *) block_offset);
-    RDMALocalBlock *old = local->block;
-    int x;
-
-    assert(block);
-
-    if (block->pmr) {
+    if (*mrs) {
         int j;
 
         for (j = 0; j < block->nb_chunks; j++) {
-            if (!block->pmr[j]) {
+            if (!(*mrs)[j]) {
                 continue;
             }
-            ibv_dereg_mr(block->pmr[j]);
+            ibv_dereg_mr((*mrs)[j]);
             rdma->total_registrations--;
         }
-        g_free(block->pmr);
-        block->pmr = NULL;
+        g_free(*mrs);
+
+        *mrs = NULL;
     }
+}
 
-    if (block->mr) {
-        ibv_dereg_mr(block->mr);
+static void qemu_rdma_free_mr(RDMAContext *rdma, struct ibv_mr **mr)
+{
+    if (*mr) {
+        ibv_dereg_mr(*mr);
         rdma->total_registrations--;
-        block->mr = NULL;
+        *mr = NULL;
     }
+}
+
+static int __qemu_rdma_delete_block(RDMAContext *rdma, ram_addr_t block_offset)
+{
+    RDMALocalBlocks *local = &rdma->local_ram_blocks;
+    RDMALocalBlock *block = g_hash_table_lookup(rdma->blockmap,
+        (void *) block_offset);
+    RDMALocalBlock *old = local->block;
+    int x;
+
+    assert(block);
+
+    qemu_rdma_free_pmrs(rdma, block, &block->pmr);
+    qemu_rdma_free_pmrs(rdma, block, &block->pmr_src);
+    qemu_rdma_free_pmrs(rdma, block, &block->pmr_dest);
+
+    qemu_rdma_free_mr(rdma, &block->mr);
+    qemu_rdma_free_mr(rdma, &block->mr_src);
+    qemu_rdma_free_mr(rdma, &block->mr_dest);
 
     g_free(block->transit_bitmap);
     block->transit_bitmap = NULL;
@@ -674,7 +900,12 @@ static int __qemu_rdma_delete_block(RDMAContext *rdma, ram_addr_t block_offset)
         }
 
         if (block->index < (local->nb_blocks - 1)) {
-            memcpy(local->block + block->index, old + (block->index + 1),
+            RDMALocalBlock * end = old + (block->index + 1);
+            for (x = 0; x < (local->nb_blocks - (block->index + 1)); x++) {
+                end[x].index--;
+            }
+
+            memcpy(local->block + block->index, end,
                 sizeof(RDMALocalBlock) *
                     (local->nb_blocks - (block->index + 1)));
         }
@@ -683,6 +914,10 @@ static int __qemu_rdma_delete_block(RDMAContext *rdma, ram_addr_t block_offset)
         local->block = NULL;
     }
 
+    g_free(old);
+
+    local->nb_blocks--;
+
     DDPRINTF("Deleted Block: %d, addr: %" PRIu64 ", offset: %" PRIu64
            " length: %" PRIu64 " end: %" PRIu64 " bits %" PRIu64 " chunks %d\n",
             local->nb_blocks, (uint64_t) block->local_host_addr, block->offset,
@@ -690,10 +925,6 @@ static int __qemu_rdma_delete_block(RDMAContext *rdma, ram_addr_t block_offset)
                 BITS_TO_LONGS(block->nb_chunks) *
                     sizeof(unsigned long) * 8, block->nb_chunks);
 
-    g_free(old);
-
-    local->nb_blocks--;
-
     if (local->nb_blocks) {
         for (x = 0; x < local->nb_blocks; x++) {
             g_hash_table_insert(rdma->blockmap, (void *)local->block[x].offset,
@@ -873,190 +1104,66 @@ static int qemu_rdma_broken_ipv6_kernel(Error **errp, struct ibv_context *verbs)
     return 0;
 }
 
-/*
- * Figure out which RDMA device corresponds to the requested IP hostname
- * Also create the initial connection manager identifiers for opening
- * the connection.
- */
-static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
+static int qemu_rdma_reg_keepalive(RDMAContext *rdma)
 {
-    int ret;
-    struct rdma_addrinfo *res;
-    char port_str[16];
-    struct rdma_cm_event *cm_event;
-    char ip[40] = "unknown";
-    struct rdma_addrinfo *e;
-
-    if (rdma->host == NULL || !strcmp(rdma->host, "")) {
-        ERROR(errp, "RDMA hostname has not been set");
-        return -EINVAL;
-    }
-
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        ERROR(errp, "could not create CM channel");
-        return -EINVAL;
-    }
-
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &rdma->cm_id, NULL, RDMA_PS_TCP);
-    if (ret) {
-        ERROR(errp, "could not create channel id");
-        goto err_resolve_create_id;
-    }
-
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
-
-    ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-    if (ret < 0) {
-        ERROR(errp, "could not rdma_getaddrinfo address %s", rdma->host);
-        goto err_resolve_get_addr;
-    }
-
-    for (e = res; e != NULL; e = e->ai_next) {
-        inet_ntop(e->ai_family,
-            &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-        DPRINTF("Trying %s => %s\n", rdma->host, ip);
-
-        ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_dst_addr,
-                RDMA_RESOLVE_TIMEOUT_MS);
-        if (!ret) {
-            if (e->ai_family == AF_INET6) {
-                ret = qemu_rdma_broken_ipv6_kernel(errp, rdma->cm_id->verbs);
-                if (ret) {
-                    continue;
-                }
-            }
-            goto route;
-        }
-    }
-
-    ERROR(errp, "could not resolve address %s", rdma->host);
-    goto err_resolve_get_addr;
-
-route:
-    qemu_rdma_dump_gid("source_resolve_addr", rdma->cm_id);
+    rdma->keepalive_mr = ibv_reg_mr(rdma->lc_remote.pd,
+            &rdma->keepalive, sizeof(rdma->keepalive),
+            IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
 
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret) {
-        ERROR(errp, "could not perform event_addr_resolved");
-        goto err_resolve_get_addr;
+    if (!rdma->keepalive_mr) {
+        perror("Failed to register keepalive location!");
+        SET_ERROR(rdma, -ENOMEM);
+        goto err_alloc;
     }
 
-    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
-        ERROR(errp, "result not equal to event_addr_resolved %s",
-                rdma_event_str(cm_event->event));
-        perror("rdma_resolve_addr");
-        ret = -EINVAL;
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
+    rdma->next_keepalive_mr = ibv_reg_mr(rdma->lc_remote.pd,
+            &rdma->next_keepalive, sizeof(rdma->next_keepalive),
+            IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
 
-    /* resolve route */
-    ret = rdma_resolve_route(rdma->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
-    if (ret) {
-        ERROR(errp, "could not resolve rdma route");
-        goto err_resolve_get_addr;
+    if (!rdma->next_keepalive_mr) {
+        perror("Failed to register next keepalive location!");
+        SET_ERROR(rdma, -ENOMEM);
+        goto err_alloc;
     }
 
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret) {
-        ERROR(errp, "could not perform event_route_resolved");
-        goto err_resolve_get_addr;
-    }
-    if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
-        ERROR(errp, "result not equal to event_route_resolved: %s",
-                        rdma_event_str(cm_event->event));
-        rdma_ack_cm_event(cm_event);
-        ret = -EINVAL;
-        goto err_resolve_get_addr;
-    }
-    rdma_ack_cm_event(cm_event);
-    rdma->verbs = rdma->cm_id->verbs;
-    qemu_rdma_dump_id("source_resolve_host", rdma->cm_id->verbs);
-    qemu_rdma_dump_gid("source_resolve_host", rdma->cm_id);
     return 0;
 
-err_resolve_get_addr:
-    rdma_destroy_id(rdma->cm_id);
-    rdma->cm_id = NULL;
-err_resolve_create_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    return ret;
-}
-
-/*
- * Create protection domain and completion queues
- */
-static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
-{
-    /* allocate pd */
-    rdma->pd = ibv_alloc_pd(rdma->verbs);
-    if (!rdma->pd) {
-        fprintf(stderr, "failed to allocate protection domain\n");
-        return -1;
-    }
+err_alloc:
 
-    /* create completion channel */
-    rdma->comp_channel = ibv_create_comp_channel(rdma->verbs);
-    if (!rdma->comp_channel) {
-        fprintf(stderr, "failed to allocate completion channel\n");
-        goto err_alloc_pd_cq;
+    if (rdma->keepalive_mr) {
+        ibv_dereg_mr(rdma->keepalive_mr);
+        rdma->keepalive_mr = NULL;
     }
 
-    /*
-     * Completion queue can be filled by both read and write work requests,
-     * so must reflect the sum of both possible queue sizes.
-     */
-    rdma->cq = ibv_create_cq(rdma->verbs, (RDMA_SIGNALED_SEND_MAX * 3),
-            NULL, rdma->comp_channel, 0);
-    if (!rdma->cq) {
-        fprintf(stderr, "failed to allocate completion queue\n");
-        goto err_alloc_pd_cq;
+    if (rdma->next_keepalive_mr) {
+        ibv_dereg_mr(rdma->next_keepalive_mr);
+        rdma->next_keepalive_mr = NULL;
     }
 
-    return 0;
-
-err_alloc_pd_cq:
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-    }
-    if (rdma->comp_channel) {
-        ibv_destroy_comp_channel(rdma->comp_channel);
-    }
-    rdma->pd = NULL;
-    rdma->comp_channel = NULL;
     return -1;
-
 }
 
-/*
- * Create queue pairs.
- */
-static int qemu_rdma_alloc_qp(RDMAContext *rdma)
+static int qemu_rdma_reg_whole_mr(RDMAContext *rdma, 
+                                  struct ibv_pd *pd,
+                                  struct ibv_mr **mr,
+                                  int index)
 {
-    struct ibv_qp_init_attr attr = { 0 };
-    int ret;
-
-    attr.cap.max_send_wr = RDMA_SIGNALED_SEND_MAX;
-    attr.cap.max_recv_wr = 3;
-    attr.cap.max_send_sge = 1;
-    attr.cap.max_recv_sge = 1;
-    attr.send_cq = rdma->cq;
-    attr.recv_cq = rdma->cq;
-    attr.qp_type = IBV_QPT_RC;
+    RDMALocalBlocks *local = &rdma->local_ram_blocks;
 
-    ret = rdma_create_qp(rdma->cm_id, rdma->pd, &attr);
-    if (ret) {
+    *mr = ibv_reg_mr(pd,
+                local->block[index].local_host_addr,
+                local->block[index].length,
+                IBV_ACCESS_LOCAL_WRITE |
+                IBV_ACCESS_REMOTE_WRITE
+                );
+    if (!(*mr)) {
+        perror("Failed to register local dest ram block!\n");
         return -1;
     }
+    rdma->total_registrations++;
 
-    rdma->qp = rdma->cm_id->qp;
     return 0;
-}
+};
 
 static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
 {
@@ -1064,18 +1171,23 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
     RDMALocalBlocks *local = &rdma->local_ram_blocks;
 
     for (i = 0; i < local->nb_blocks; i++) {
-        local->block[i].mr =
-            ibv_reg_mr(rdma->pd,
-                    local->block[i].local_host_addr,
-                    local->block[i].length,
-                    IBV_ACCESS_LOCAL_WRITE |
-                    IBV_ACCESS_REMOTE_WRITE
-                    );
-        if (!local->block[i].mr) {
-            perror("Failed to register local dest ram block!\n");
+        if (qemu_rdma_reg_whole_mr(rdma, rdma->lc_remote.pd, &local->block[i].mr, i)) {
             break;
         }
-        rdma->total_registrations++;
+
+        if (migrate_use_mc_rdma_copy()) {
+            if (rdma->source) {
+                if (qemu_rdma_reg_whole_mr(rdma, rdma->lc_src.pd, 
+                        &local->block[i].mr_src, i)) {
+                    break;
+                }
+            } else {
+                if (qemu_rdma_reg_whole_mr(rdma, rdma->lc_dest.pd, 
+                        &local->block[i].mr_dest, i)) {
+                    break;
+                }
+            }
+        }
     }
 
     if (i >= local->nb_blocks) {
@@ -1083,8 +1195,12 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
     }
 
     for (i--; i >= 0; i--) {
-        ibv_dereg_mr(local->block[i].mr);
-        rdma->total_registrations--;
+        qemu_rdma_free_mr(rdma, &local->block[i].mr);
+        if (migrate_use_mc_rdma_copy()) {
+            qemu_rdma_free_mr(rdma, rdma->source ?
+                                &local->block[i].mr_src :
+                                &local->block[i].mr_dest);
+        }
     }
 
     return -1;
@@ -1129,24 +1245,34 @@ static int qemu_rdma_search_ram_block(RDMAContext *rdma,
  * to perform the actual RDMA operation.
  */
 static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
-        RDMALocalBlock *block, uint8_t *host_addr,
-        uint32_t *lkey, uint32_t *rkey, int chunk,
-        uint8_t *chunk_start, uint8_t *chunk_end)
+                                           RDMACurrentChunk *cc,
+                                           RDMALocalContext *lc,
+                                           bool copy,
+                                           uint32_t *lkey, 
+                                           uint32_t *rkey)
 {
-    if (block->mr) {
+    struct ibv_mr ***pmr = copy ? (rdma->source ? &cc->block->pmr_src : 
+                           &cc->block->pmr_dest) : &cc->block->pmr;
+    struct ibv_mr **mr = copy ? (rdma->source ? &cc->block->mr_src :
+                         &cc->block->mr_dest) : &cc->block->mr;
+
+    /*
+     * Use pre-registered keys for the entire VM, if available.
+     */
+    if (*mr) {
         if (lkey) {
-            *lkey = block->mr->lkey;
+            *lkey = (*mr)->lkey;
         }
         if (rkey) {
-            *rkey = block->mr->rkey;
+            *rkey = (*mr)->rkey;
         }
         return 0;
     }
 
     /* allocate memory to store chunk MRs */
-    if (!block->pmr) {
-        block->pmr = g_malloc0(block->nb_chunks * sizeof(struct ibv_mr *));
-        if (!block->pmr) {
+    if (!(*pmr)) {
+        *pmr = g_malloc0(cc->block->nb_chunks * sizeof(struct ibv_mr *));
+        if (!(*pmr)) {
             return -1;
         }
     }
@@ -1154,38 +1280,38 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
     /*
      * If 'rkey', then we're the destination, so grant access to the source.
      *
-     * If 'lkey', then we're the source VM, so grant access only to ourselves.
+     * If 'lkey', then we're the source, so grant access only to ourselves.
      */
-    if (!block->pmr[chunk]) {
-        uint64_t len = chunk_end - chunk_start;
+    if (!(*pmr)[cc->chunk_idx]) {
+        uint64_t len = cc->chunk_end - cc->chunk_start;
 
         DDPRINTF("Registering %" PRIu64 " bytes @ %p\n",
-                 len, chunk_start);
+                 len, cc->chunk_start);
 
-        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
-                chunk_start, len,
-                (rkey ? (IBV_ACCESS_LOCAL_WRITE |
-                        IBV_ACCESS_REMOTE_WRITE) : 0));
+        (*pmr)[cc->chunk_idx] = ibv_reg_mr(lc->pd, cc->chunk_start, len,
+                    (rkey ? (IBV_ACCESS_LOCAL_WRITE |
+                            IBV_ACCESS_REMOTE_WRITE) : 0));
 
-        if (!block->pmr[chunk]) {
+        if (!(*pmr)[cc->chunk_idx]) {
             perror("Failed to register chunk!");
-            fprintf(stderr, "Chunk details: block: %d chunk index %d"
+            fprintf(stderr, "Chunk details: block: %d chunk index %" PRIu64
                             " start %" PRIu64 " end %" PRIu64 " host %" PRIu64
                             " local %" PRIu64 " registrations: %d\n",
-                            block->index, chunk, (uint64_t) chunk_start,
-                            (uint64_t) chunk_end, (uint64_t) host_addr,
-                            (uint64_t) block->local_host_addr,
+                            cc->block->index, cc->chunk_idx, (uint64_t) cc->chunk_start,
+                            (uint64_t) cc->chunk_end, (uint64_t) cc->addr,
+                            (uint64_t) cc->block->local_host_addr,
                             rdma->total_registrations);
             return -1;
         }
+
         rdma->total_registrations++;
     }
 
     if (lkey) {
-        *lkey = block->pmr[chunk]->lkey;
+        *lkey = (*pmr)[cc->chunk_idx]->lkey;
     }
     if (rkey) {
-        *rkey = block->pmr[chunk]->rkey;
+        *rkey = (*pmr)[cc->chunk_idx]->rkey;
     }
     return 0;
 }
@@ -1196,7 +1322,7 @@ static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
  */
 static int qemu_rdma_reg_control(RDMAContext *rdma, int idx)
 {
-    rdma->wr_data[idx].control_mr = ibv_reg_mr(rdma->pd,
+    rdma->wr_data[idx].control_mr = ibv_reg_mr(rdma->lc_remote.pd,
             rdma->wr_data[idx].control, RDMA_CONTROL_MAX_BUFFER,
             IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE);
     if (rdma->wr_data[idx].control_mr) {
@@ -1257,11 +1383,11 @@ static int qemu_rdma_unregister_waiting(RDMAContext *rdma)
         uint64_t wr_id = rdma->unregistrations[rdma->unregister_current];
         uint64_t chunk =
             (wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
+        uint64_t block_index =
             (wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
         RDMALocalBlock *block =
-            &(rdma->local_ram_blocks.block[index]);
-        RDMARegister reg = { .current_index = index };
+            &(rdma->local_ram_blocks.block[block_index]);
+        RDMARegister reg = { .current_block_idx = block_index };
         RDMAControlHeader resp = { .type = RDMA_CONTROL_UNREGISTER_FINISHED,
                                  };
         RDMAControlHeader head = { .len = sizeof(RDMARegister),
@@ -1275,7 +1401,7 @@ static int qemu_rdma_unregister_waiting(RDMAContext *rdma)
         rdma->unregistrations[rdma->unregister_current] = 0;
         rdma->unregister_current++;
 
-        if (rdma->unregister_current == RDMA_SIGNALED_SEND_MAX) {
+        if (rdma->unregister_current == RDMA_SEND_MAX) {
             rdma->unregister_current = 0;
         }
 
@@ -1339,7 +1465,7 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
                                         uint64_t chunk, uint64_t wr_id)
 {
     if (rdma->unregistrations[rdma->unregister_next] != 0) {
-        fprintf(stderr, "rdma migration: queue is full!\n");
+        ERROR(NULL, "queue is full!");
     } else {
         RDMALocalBlock *block = &(rdma->local_ram_blocks.block[index]);
 
@@ -1350,7 +1476,7 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
             rdma->unregistrations[rdma->unregister_next++] =
                     qemu_rdma_make_wrid(wr_id, index, chunk);
 
-            if (rdma->unregister_next == RDMA_SIGNALED_SEND_MAX) {
+            if (rdma->unregister_next == RDMA_SEND_MAX) {
                 rdma->unregister_next = 0;
             }
         } else {
@@ -1365,14 +1491,21 @@ static void qemu_rdma_signal_unregister(RDMAContext *rdma, uint64_t index,
  * (of any kind) has completed.
  * Return the work request ID that completed.
  */
-static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
+static uint64_t qemu_rdma_poll(RDMAContext *rdma,
+                               RDMALocalContext *lc, 
+                               uint64_t *wr_id_out,
                                uint32_t *byte_len)
 {
+    int64_t current_time;
     int ret;
     struct ibv_wc wc;
     uint64_t wr_id;
 
-    ret = ibv_poll_cq(rdma->cq, 1, &wc);
+    if (!lc->start_time) {
+        lc->start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    }
+
+    ret = ibv_poll_cq(lc->cq, 1, &wc);
 
     if (!ret) {
         *wr_id_out = RDMA_WRID_NONE;
@@ -1380,16 +1513,17 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
     }
 
     if (ret < 0) {
-        fprintf(stderr, "ibv_poll_cq return %d!\n", ret);
+        fprintf(stderr, "ibv_poll_cq return %d (%s)!\n", ret, lc->id_str);
         return ret;
     }
 
     wr_id = wc.wr_id & RDMA_WRID_TYPE_MASK;
 
     if (wc.status != IBV_WC_SUCCESS) {
-        fprintf(stderr, "ibv_poll_cq wc.status=%d %s!\n",
-                        wc.status, ibv_wc_status_str(wc.status));
-        fprintf(stderr, "ibv_poll_cq wrid=%s!\n", wrid_desc[wr_id]);
+        fprintf(stderr, "ibv_poll_cq wc.status=%d %s! (%s)\n",
+                        wc.status, ibv_wc_status_str(wc.status), lc->id_str);
+        fprintf(stderr, "ibv_poll_cq wrid=%s! (%s)\n", wrid_desc[wr_id],
+                                                        lc->id_str);
 
         return -1;
     }
@@ -1397,29 +1531,49 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
     if (rdma->control_ready_expected &&
         (wr_id >= RDMA_WRID_RECV_CONTROL)) {
         DDDPRINTF("completion %s #%" PRId64 " received (%" PRId64 ")"
-                  " left %d\n", wrid_desc[RDMA_WRID_RECV_CONTROL],
-                  wr_id - RDMA_WRID_RECV_CONTROL, wr_id, rdma->nb_sent);
+                  " left %d (per qp %d) (%s)\n", 
+                  wrid_desc[RDMA_WRID_RECV_CONTROL],
+                  wr_id - RDMA_WRID_RECV_CONTROL, wr_id, 
+                  rdma->nb_sent, lc->nb_sent, lc->id_str);
         rdma->control_ready_expected = 0;
     }
 
-    if (wr_id == RDMA_WRID_RDMA_WRITE) {
+    if (wr_id == RDMA_WRID_RDMA_WRITE_REMOTE) {
         uint64_t chunk =
             (wc.wr_id & RDMA_WRID_CHUNK_MASK) >> RDMA_WRID_CHUNK_SHIFT;
-        uint64_t index =
+        uint64_t block_idx =
             (wc.wr_id & RDMA_WRID_BLOCK_MASK) >> RDMA_WRID_BLOCK_SHIFT;
-        RDMALocalBlock *block = &(rdma->local_ram_blocks.block[index]);
-
-        DDDPRINTF("completions %s (%" PRId64 ") left %d, "
-                 "block %" PRIu64 ", chunk: %" PRIu64 " %p %p\n",
-                 print_wrid(wr_id), wr_id, rdma->nb_sent, index, chunk,
-                 block->local_host_addr, (void *)block->remote_host_addr);
+        RDMALocalBlock *block = &(rdma->local_ram_blocks.block[block_idx]);
 
         clear_bit(chunk, block->transit_bitmap);
 
+        if (lc->nb_sent > lc->max_nb_sent) {
+            lc->max_nb_sent = lc->nb_sent;
+        }
+
+        current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+        
+        if ((current_time - lc->start_time) > 1000) {
+            lc->start_time = current_time;
+            DDPRINTF("outstanding %s total: %d context: %d max %d (%s)\n",
+                lc->id_str, rdma->nb_sent, lc->nb_sent, lc->max_nb_sent,
+                lc->id_str);
+        }
+
         if (rdma->nb_sent > 0) {
             rdma->nb_sent--;
         }
 
+        if (lc->nb_sent > 0) {
+            lc->nb_sent--;
+        }
+
+        DDDPRINTF("completions %s (%" PRId64 ") left %d (per qp %d), "
+                 "block %" PRIu64 ", chunk: %" PRIu64 " %p %p (%s)\n",
+                 print_wrid(wr_id), wr_id, rdma->nb_sent, lc->nb_sent,
+                 block_idx, chunk, block->local_host_addr, 
+                 (void *)block->remote_host_addr, lc->id_str);
+
         if (!rdma->pin_all) {
             /*
              * FYI: If one wanted to signal a specific chunk to be unregistered
@@ -1428,12 +1582,15 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
              * unregistered later.
              */
 #ifdef RDMA_UNREGISTRATION_EXAMPLE
-            qemu_rdma_signal_unregister(rdma, index, chunk, wc.wr_id);
+             if (block->pmr[chunk]) { 
+                 qemu_rdma_signal_unregister(rdma, block_idx, chunk, wc.wr_id);
+             }
 #endif
         }
     } else {
-        DDDPRINTF("other completion %s (%" PRId64 ") received left %d\n",
-            print_wrid(wr_id), wr_id, rdma->nb_sent);
+        DDDPRINTF("other completion %s (%" 
+                  PRId64 ") received left %d (per qp %d) (%s)\n",
+            print_wrid(wr_id), wr_id, rdma->nb_sent, lc->nb_sent, lc->id_str);
     }
 
     *wr_id_out = wc.wr_id;
@@ -1457,7 +1614,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, uint64_t *wr_id_out,
  * completions only need to be recorded, but do not actually
  * need further processing.
  */
-static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
+static int qemu_rdma_block_for_wrid(RDMAContext *rdma, 
+                                    RDMALocalContext *lc,
+                                    int wrid_requested,
                                     uint32_t *byte_len)
 {
     int num_cq_events = 0, ret = 0;
@@ -1465,12 +1624,15 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
     void *cq_ctx;
     uint64_t wr_id = RDMA_WRID_NONE, wr_id_in;
 
-    if (ibv_req_notify_cq(rdma->cq, 0)) {
-        return -1;
+    ret = ibv_req_notify_cq(lc->cq, 0);
+    if (ret) {
+        perror("ibv_req_notify_cq");
+        return -ret;
     }
+
     /* poll cq first */
     while (wr_id != wrid_requested) {
-        ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
+        ret = qemu_rdma_poll(rdma, lc, &wr_id_in, byte_len);
         if (ret < 0) {
             return ret;
         }
@@ -1481,9 +1643,9 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
             break;
         }
         if (wr_id != wrid_requested) {
-            DDDPRINTF("A Wanted wrid %s (%d) but got %s (%" PRIu64 ")\n",
+            DDDPRINTF("A Wanted wrid %s (%d) but got %s (%" PRIu64 ") (%s)\n",
                 print_wrid(wrid_requested),
-                wrid_requested, print_wrid(wr_id), wr_id);
+                wrid_requested, print_wrid(wr_id), wr_id, lc->id_str);
         }
     }
 
@@ -1496,23 +1658,27 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
          * Coroutine doesn't start until process_incoming_migration()
          * so don't yield unless we know we're running inside of a coroutine.
          */
-        if (rdma->migration_started_on_destination) {
-            yield_until_fd_readable(rdma->comp_channel->fd);
+        if (qemu_in_coroutine()) {
+            yield_until_fd_readable(lc->comp_chan->fd);
         }
 
-        if (ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx)) {
+        ret = ibv_get_cq_event(lc->comp_chan, &cq, &cq_ctx);
+        if (ret < 0) {
             perror("ibv_get_cq_event");
             goto err_block_for_wrid;
         }
 
         num_cq_events++;
 
-        if (ibv_req_notify_cq(cq, 0)) {
+        ret = ibv_req_notify_cq(cq, 0);
+        if (ret) {
+            ret = -ret;
+            perror("ibv_req_notify_cq");
             goto err_block_for_wrid;
         }
 
         while (wr_id != wrid_requested) {
-            ret = qemu_rdma_poll(rdma, &wr_id_in, byte_len);
+            ret = qemu_rdma_poll(rdma, lc, &wr_id_in, byte_len);
             if (ret < 0) {
                 goto err_block_for_wrid;
             }
@@ -1523,9 +1689,9 @@ static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid_requested,
                 break;
             }
             if (wr_id != wrid_requested) {
-                DDDPRINTF("B Wanted wrid %s (%d) but got %s (%" PRIu64 ")\n",
+                DDDPRINTF("B Wanted wrid %s (%d) but got %s (%" PRIu64 ") (%s)\n",
                     print_wrid(wrid_requested), wrid_requested,
-                    print_wrid(wr_id), wr_id);
+                    print_wrid(wr_id), wr_id, lc->id_str);
             }
         }
 
@@ -1589,18 +1755,17 @@ static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t *buf,
     }
 
 
-    if (ibv_post_send(rdma->qp, &send_wr, &bad_wr)) {
-        return -1;
-    }
+    ret = ibv_post_send(rdma->lc_remote.qp, &send_wr, &bad_wr);
 
-    if (ret < 0) {
-        fprintf(stderr, "Failed to use post IB SEND for control!\n");
-        return ret;
+    if (ret > 0) {
+        ERROR(NULL, "Failed to use post IB SEND for control!");
+        return -ret;
     }
 
-    ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL);
+    ret = qemu_rdma_block_for_wrid(rdma, &rdma->lc_remote,
+                                   RDMA_WRID_SEND_CONTROL, NULL);
     if (ret < 0) {
-        fprintf(stderr, "rdma migration: send polling control error!\n");
+        ERROR(NULL, "send polling control!");
     }
 
     return ret;
@@ -1626,7 +1791,7 @@ static int qemu_rdma_post_recv_control(RDMAContext *rdma, int idx)
                                  };
 
 
-    if (ibv_post_recv(rdma->qp, &recv_wr, &bad_wr)) {
+    if (ibv_post_recv(rdma->lc_remote.qp, &recv_wr, &bad_wr)) {
         return -1;
     }
 
@@ -1640,11 +1805,12 @@ static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
                 RDMAControlHeader *head, int expecting, int idx)
 {
     uint32_t byte_len;
-    int ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx,
+    int ret = qemu_rdma_block_for_wrid(rdma, &rdma->lc_remote,
+                                       RDMA_WRID_RECV_CONTROL + idx,
                                        &byte_len);
 
     if (ret < 0) {
-        fprintf(stderr, "rdma migration: recv polling control error!\n");
+        ERROR(NULL, "recv polling control!");
         return ret;
     }
 
@@ -1731,8 +1897,7 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
     if (resp) {
         ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_DATA);
         if (ret) {
-            fprintf(stderr, "rdma migration: error posting"
-                    " extra control recv for anticipated result!");
+            ERROR(NULL, "posting extra control recv for anticipated result!");
             return ret;
         }
     }
@@ -1742,7 +1907,7 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
      */
     ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY);
     if (ret) {
-        fprintf(stderr, "rdma migration: error posting first control recv!");
+        ERROR(NULL, "posting first control recv!");
         return ret;
     }
 
@@ -1752,7 +1917,7 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
     ret = qemu_rdma_post_send_control(rdma, data, head);
 
     if (ret < 0) {
-        fprintf(stderr, "Failed to send control buffer!\n");
+        ERROR(NULL, "sending control buffer!");
         return ret;
     }
 
@@ -1829,30 +1994,51 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head,
      */
     ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY);
     if (ret) {
-        fprintf(stderr, "rdma migration: error posting second control recv!");
+        ERROR(NULL, "posting second control recv!");
         return ret;
     }
 
     return 0;
 }
 
+static inline void install_boundaries(RDMAContext *rdma, RDMACurrentChunk *cc)
+{
+    uint64_t len = cc->block->is_ram_block ? 
+                   cc->current_length : cc->block->length;
+
+    cc->chunks = len / (1UL << RDMA_REG_CHUNK_SHIFT);
+
+    if (cc->chunks && ((len % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
+        cc->chunks--;
+    }
+
+    cc->addr = (uint8_t *) (uint64_t)(cc->block->local_host_addr +
+                                 (cc->current_addr - cc->block->offset));
+
+    cc->chunk_idx = ram_chunk_index(cc->block->local_host_addr, cc->addr);
+    cc->chunk_start = ram_chunk_start(cc->block, cc->chunk_idx);
+    cc->chunk_end = ram_chunk_end(cc->block, cc->chunk_idx + cc->chunks);
+
+    DDPRINTF("Block %d chunk %" PRIu64 " has %" PRIu64
+             " chunks, (%" PRIu64 " MB)\n", cc->block->index, cc->chunk_idx,
+                cc->chunks + 1, (cc->chunks + 1) * 
+                    (1UL << RDMA_REG_CHUNK_SHIFT) / 1024 / 1024);
+
+}
+
 /*
- * Write an actual chunk of memory using RDMA.
- *
- * If we're using dynamic registration on the dest-side, we have to
- * send a registration command first.
+ * Push out any unwritten RDMA operations.
  */
-static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
-                               int current_index, uint64_t current_addr,
-                               uint64_t length)
+static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
+                                 RDMACurrentChunk *src,
+                                 RDMACurrentChunk *dest)
 {
     struct ibv_sge sge;
     struct ibv_send_wr send_wr = { 0 };
     struct ibv_send_wr *bad_wr;
     int reg_result_idx, ret, count = 0;
-    uint64_t chunk, chunks;
-    uint8_t *chunk_start, *chunk_end;
-    RDMALocalBlock *block = &(rdma->local_ram_blocks.block[current_index]);
+    bool copy;
+    RDMALocalContext *lc;
     RDMARegister reg;
     RDMARegisterResult *reg_result;
     RDMAControlHeader resp = { .type = RDMA_CONTROL_REGISTER_RESULT };
@@ -1861,82 +2047,84 @@ static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
                                .repeat = 1,
                              };
 
-retry:
-    sge.addr = (uint64_t)(block->local_host_addr +
-                            (current_addr - block->offset));
-    sge.length = length;
+    if (!src->current_length) {
+        return 0;
+    }
 
-    chunk = ram_chunk_index(block->local_host_addr, (uint8_t *) sge.addr);
-    chunk_start = ram_chunk_start(block, chunk);
+    if (dest == src) {
+        dest = NULL;
+    }
 
-    if (block->is_ram_block) {
-        chunks = length / (1UL << RDMA_REG_CHUNK_SHIFT);
+    copy = dest ? true : false;
+    
+    lc = (migrate_use_mc_rdma_copy() && copy) ? 
+        (rdma->source ? &rdma->lc_src : &rdma->lc_dest) : &rdma->lc_remote;
 
-        if (chunks && ((length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
-    } else {
-        chunks = block->length / (1UL << RDMA_REG_CHUNK_SHIFT);
+retry:
+    src->block = &(rdma->local_ram_blocks.block[src->current_block_idx]);
+    install_boundaries(rdma, src);
 
-        if (chunks && ((block->length % (1UL << RDMA_REG_CHUNK_SHIFT)) == 0)) {
-            chunks--;
-        }
+    if (dest) {
+        dest->block = &(rdma->local_ram_blocks.block[dest->current_block_idx]);
+        install_boundaries(rdma, dest);
     }
 
-    DDPRINTF("Writing %" PRIu64 " chunks, (%" PRIu64 " MB)\n",
-        chunks + 1, (chunks + 1) * (1UL << RDMA_REG_CHUNK_SHIFT) / 1024 / 1024);
-
-    chunk_end = ram_chunk_end(block, chunk + chunks);
-
     if (!rdma->pin_all) {
 #ifdef RDMA_UNREGISTRATION_EXAMPLE
         qemu_rdma_unregister_waiting(rdma);
 #endif
     }
 
-    while (test_bit(chunk, block->transit_bitmap)) {
+    while (test_bit(src->chunk_idx, src->block->transit_bitmap)) {
         (void)count;
-        DDPRINTF("(%d) Not clobbering: block: %d chunk %" PRIu64
-                " current %" PRIu64 " len %" PRIu64 " %d %d\n",
-                count++, current_index, chunk,
-                sge.addr, length, rdma->nb_sent, block->nb_chunks);
+        DPRINTF("(%d) Not clobbering: block: %d chunk %" PRIu64
+                " current_addr %" PRIu64 " len %" PRIu64 
+                " left %d (per qp left %d) chunks %d (%s)\n",
+                count++, src->current_block_idx, src->chunk_idx,
+                (uint64_t) src->addr, src->current_length, 
+                rdma->nb_sent, lc->nb_sent, src->block->nb_chunks, lc->id_str);
 
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
+        ret = qemu_rdma_block_for_wrid(rdma, lc, 
+                                       RDMA_WRID_RDMA_WRITE_REMOTE, NULL);
 
         if (ret < 0) {
             fprintf(stderr, "Failed to Wait for previous write to complete "
                     "block %d chunk %" PRIu64
-                    " current %" PRIu64 " len %" PRIu64 " %d\n",
-                    current_index, chunk, sge.addr, length, rdma->nb_sent);
+                    " current_addr %" PRIu64 " len %" PRIu64
+                    " left %d (per qp left %d) (%s)\n",
+                    src->current_block_idx, src->chunk_idx, (uint64_t) src->addr, 
+                    src->current_length, rdma->nb_sent, lc->nb_sent, lc->id_str);
             return ret;
         }
     }
 
-    if (!rdma->pin_all || !block->is_ram_block) {
-        if (!block->remote_keys[chunk]) {
+    if (!rdma->pin_all || !src->block->is_ram_block) {
+        if (!src->block->remote_keys[src->chunk_idx]) {
             /*
              * This chunk has not yet been registered, so first check to see
              * if the entire chunk is zero. If so, tell the other size to
              * memset() + madvise() the entire chunk without RDMA.
              */
 
-            if (can_use_buffer_find_nonzero_offset((void *)sge.addr, length)
-                   && buffer_find_nonzero_offset((void *)sge.addr,
-                                                    length) == length) {
+            if (src->block->is_ram_block &&
+                   can_use_buffer_find_nonzero_offset(src->addr, src->current_length)
+                   && buffer_find_nonzero_offset(src->addr,
+                                                    src->current_length) == src->current_length) {
                 RDMACompress comp = {
-                                        .offset = current_addr,
+                                        .offset = src->current_addr,
                                         .value = 0,
-                                        .block_idx = current_index,
-                                        .length = length,
+                                        .block_idx = src->current_block_idx,
+                                        .length = src->current_length,
                                     };
 
                 head.len = sizeof(comp);
                 head.type = RDMA_CONTROL_COMPRESS;
 
-                DDPRINTF("Entire chunk is zero, sending compress: %"
-                    PRIu64 " for %d "
-                    "bytes, index: %d, offset: %" PRId64 "...\n",
-                    chunk, sge.length, current_index, current_addr);
+                DDPRINTF("Entire chunk is zero, sending compress: %" PRIu64 
+                         " for %" PRIu64 " bytes, index: %d"
+                         ", offset: %" PRId64 " (%s)...\n",
+                         src->chunk_idx, src->current_length, 
+                         src->current_block_idx, src->current_addr, lc->id_str);
 
                 compress_to_network(&comp);
                 ret = qemu_rdma_exchange_send(rdma, &head,
@@ -1946,109 +2134,125 @@ retry:
                     return -EIO;
                 }
 
-                acct_update_position(f, sge.length, true);
+                acct_update_position(f, src->current_length, true);
 
                 return 1;
             }
 
             /*
-             * Otherwise, tell other side to register.
+             * Otherwise, tell other side to register. (Only for remote RDMA)
              */
-            reg.current_index = current_index;
-            if (block->is_ram_block) {
-                reg.key.current_addr = current_addr;
-            } else {
-                reg.key.chunk = chunk;
-            }
-            reg.chunks = chunks;
+            if (!dest) {
+                reg.current_block_idx = src->current_block_idx;
+                if (src->block->is_ram_block) {
+                    reg.key.current_addr = src->current_addr;
+                } else {
+                    reg.key.chunk = src->chunk_idx;
+                }
+                reg.chunks = src->chunks;
 
-            DDPRINTF("Sending registration request chunk %" PRIu64 " for %d "
-                    "bytes, index: %d, offset: %" PRId64 "...\n",
-                    chunk, sge.length, current_index, current_addr);
+                DDPRINTF("Sending registration request chunk %" PRIu64 
+                         " for %" PRIu64 " bytes, index: %d, offset: %" 
+                         PRId64 " (%s)...\n",
+                         src->chunk_idx, src->current_length, 
+                         src->current_block_idx, src->current_addr, lc->id_str);
 
-            register_to_network(&reg);
-            ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                    &resp, &reg_result_idx, NULL);
-            if (ret < 0) {
-                return ret;
+                register_to_network(&reg);
+                ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
+                                        &resp, &reg_result_idx, NULL);
+                if (ret < 0) {
+                    return ret;
+                }
             }
 
             /* try to overlap this single registration with the one we sent. */
-            if (qemu_rdma_register_and_get_keys(rdma, block,
-                                                (uint8_t *) sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
+            if (qemu_rdma_register_and_get_keys(rdma, src, lc, copy, 
+                                                &sge.lkey, NULL)) {
                 fprintf(stderr, "cannot get lkey!\n");
                 return -EINVAL;
             }
 
-            reg_result = (RDMARegisterResult *)
-                    rdma->wr_data[reg_result_idx].control_curr;
+            if (!dest) {
+                reg_result = (RDMARegisterResult *)
+                        rdma->wr_data[reg_result_idx].control_curr;
 
-            network_to_result(reg_result);
+                network_to_result(reg_result);
 
-            DDPRINTF("Received registration result:"
-                    " my key: %x their key %x, chunk %" PRIu64 "\n",
-                    block->remote_keys[chunk], reg_result->rkey, chunk);
+                DDPRINTF("Received registration result:"
+                        " my key: %x their key %x, chunk %" PRIu64 " (%s)\n",
+                        src->block->remote_keys[src->chunk_idx], 
+                        reg_result->rkey, src->chunk_idx, lc->id_str);
 
-            block->remote_keys[chunk] = reg_result->rkey;
-            block->remote_host_addr = reg_result->host_addr;
+                src->block->remote_keys[src->chunk_idx] = reg_result->rkey;
+                src->block->remote_host_addr = reg_result->host_addr;
+            }
         } else {
             /* already registered before */
-            if (qemu_rdma_register_and_get_keys(rdma, block,
-                                                (uint8_t *)sge.addr,
-                                                &sge.lkey, NULL, chunk,
-                                                chunk_start, chunk_end)) {
+            if (qemu_rdma_register_and_get_keys(rdma, src, lc, copy,
+                                                &sge.lkey, NULL)) {
                 fprintf(stderr, "cannot get lkey!\n");
                 return -EINVAL;
             }
         }
 
-        send_wr.wr.rdma.rkey = block->remote_keys[chunk];
+        send_wr.wr.rdma.rkey = src->block->remote_keys[src->chunk_idx];
     } else {
-        send_wr.wr.rdma.rkey = block->remote_rkey;
+        send_wr.wr.rdma.rkey = src->block->remote_rkey;
 
-        if (qemu_rdma_register_and_get_keys(rdma, block, (uint8_t *)sge.addr,
-                                                     &sge.lkey, NULL, chunk,
-                                                     chunk_start, chunk_end)) {
+        if (qemu_rdma_register_and_get_keys(rdma, src, lc, copy, 
+                                            &sge.lkey, NULL)) {
             fprintf(stderr, "cannot get lkey!\n");
             return -EINVAL;
         }
     }
 
+    if (migrate_use_mc_rdma_copy() && dest) {
+        if (qemu_rdma_register_and_get_keys(rdma, dest,
+                                            &rdma->lc_dest, copy,
+                                            NULL, &send_wr.wr.rdma.rkey)) {
+            fprintf(stderr, "cannot get rkey!\n");
+            return -EINVAL;
+        }
+    }
+
     /*
      * Encode the ram block index and chunk within this wrid.
      * We will use this information at the time of completion
      * to figure out which bitmap to check against and then which
      * chunk in the bitmap to look for.
      */
-    send_wr.wr_id = qemu_rdma_make_wrid(RDMA_WRID_RDMA_WRITE,
-                                        current_index, chunk);
+    send_wr.wr_id = qemu_rdma_make_wrid(RDMA_WRID_RDMA_WRITE_REMOTE,
+                                        src->current_block_idx, src->chunk_idx);
 
+    sge.length = src->current_length;
+    sge.addr = (uint64_t) src->addr;
     send_wr.opcode = IBV_WR_RDMA_WRITE;
     send_wr.send_flags = IBV_SEND_SIGNALED;
     send_wr.sg_list = &sge;
     send_wr.num_sge = 1;
-    send_wr.wr.rdma.remote_addr = block->remote_host_addr +
-                                (current_addr - block->offset);
+    send_wr.wr.rdma.remote_addr = (dest ? (uint64_t) dest->addr : 
+                (src->block->remote_host_addr + 
+                    (src->current_addr - src->block->offset)));
 
-    DDDPRINTF("Posting chunk: %" PRIu64 ", addr: %lx"
-              " remote: %lx, bytes %" PRIu32 "\n",
-              chunk, sge.addr, send_wr.wr.rdma.remote_addr,
-              sge.length);
+    DDPRINTF("Posting chunk: %" PRIu64 ", addr: %lx"
+             " remote: %lx, bytes %" PRIu32 " lkey %" PRIu32 
+             " rkey %" PRIu32 " (%s)\n",
+             src->chunk_idx, sge.addr, 
+             send_wr.wr.rdma.remote_addr, sge.length,
+             sge.lkey, send_wr.wr.rdma.rkey, lc->id_str);
 
     /*
      * ibv_post_send() does not return negative error numbers,
      * per the specification they are positive - no idea why.
      */
-    ret = ibv_post_send(rdma->qp, &send_wr, &bad_wr);
+    ret = ibv_post_send(lc->qp, &send_wr, &bad_wr);
 
     if (ret == ENOMEM) {
         DDPRINTF("send queue is full. wait a little....\n");
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
+        ret = qemu_rdma_block_for_wrid(rdma, lc,
+                                       RDMA_WRID_RDMA_WRITE_REMOTE, NULL);
         if (ret < 0) {
-            fprintf(stderr, "rdma migration: failed to make "
-                            "room in full send queue! %d\n", ret);
+            ERROR(NULL, "could not make room in full send queue! %d", ret);
             return ret;
         }
 
@@ -2059,80 +2263,67 @@ retry:
         return -ret;
     }
 
-    set_bit(chunk, block->transit_bitmap);
-    acct_update_position(f, sge.length, false);
-    rdma->total_writes++;
-
-    return 0;
-}
-
-/*
- * Push out any unwritten RDMA operations.
- *
- * We support sending out multiple chunks at the same time.
- * Not all of them need to get signaled in the completion queue.
- */
-static int qemu_rdma_write_flush(QEMUFile *f, RDMAContext *rdma)
-{
-    int ret;
+    set_bit(src->chunk_idx, src->block->transit_bitmap);
 
-    if (!rdma->current_length) {
-        return 0;
+    if (!dest) {
+        acct_update_position(f, sge.length, false);
     }
 
-    ret = qemu_rdma_write_one(f, rdma,
-            rdma->current_index, rdma->current_addr, rdma->current_length);
+    rdma->total_writes++;
+    rdma->nb_sent++;
+    lc->nb_sent++;
 
-    if (ret < 0) {
-        return ret;
-    }
+    DDDPRINTF("sent total: %d sent lc: %d (%s)\n",
+                rdma->nb_sent, lc->nb_sent, lc->id_str);
 
-    if (ret == 0) {
-        rdma->nb_sent++;
-        DDDPRINTF("sent total: %d\n", rdma->nb_sent);
-    }
+    src->current_length = 0;
+    src->current_addr = 0;
 
-    rdma->current_length = 0;
-    rdma->current_addr = 0;
+    if (dest) {
+        dest->current_length = 0;
+        dest->current_addr = 0;
+    }
 
     return 0;
 }
 
 static inline int qemu_rdma_buffer_mergable(RDMAContext *rdma,
-                    uint64_t offset, uint64_t len)
+                                            RDMACurrentChunk *cc,
+                                            uint64_t current_addr, 
+                                            uint64_t len)
 {
     RDMALocalBlock *block;
     uint8_t *host_addr;
     uint8_t *chunk_end;
 
-    if (rdma->current_index < 0) {
+    if (cc->current_block_idx < 0) {
         return 0;
     }
 
-    if (rdma->current_chunk < 0) {
+    if (cc->current_chunk < 0) {
         return 0;
     }
 
-    block = &(rdma->local_ram_blocks.block[rdma->current_index]);
-    host_addr = block->local_host_addr + (offset - block->offset);
-    chunk_end = ram_chunk_end(block, rdma->current_chunk);
+    block = &(rdma->local_ram_blocks.block[cc->current_block_idx]);
+    host_addr = block->local_host_addr + (current_addr - block->offset);
+    chunk_end = ram_chunk_end(block, cc->current_chunk);
 
-    if (rdma->current_length == 0) {
+    if (cc->current_length == 0) {
         return 0;
     }
 
     /*
      * Only merge into chunk sequentially.
      */
-    if (offset != (rdma->current_addr + rdma->current_length)) {
+    if (current_addr != (cc->current_addr + cc->current_length)) {
         return 0;
     }
 
-    if (offset < block->offset) {
+    if (current_addr < block->offset) {
         return 0;
     }
 
-    if ((offset + len) > (block->offset + block->length)) {
+    if ((current_addr + len) > (block->offset + block->length)) {
         return 0;
     }
 
@@ -2143,80 +2334,148 @@ static inline int qemu_rdma_buffer_mergable(RDMAContext *rdma,
     return 1;
 }
 
-/*
- * We're not actually writing here, but doing three things:
- *
- * 1. Identify the chunk the buffer belongs to.
- * 2. If the chunk is full or the buffer doesn't belong to the current
- *    chunk, then start a new chunk and flush() the old chunk.
- * 3. To keep the hardware busy, we also group chunks into batches
- *    and only require that a batch gets acknowledged in the completion
- *    qeueue instead of each individual chunk.
+static int write_start(RDMAContext *rdma,
+                        RDMACurrentChunk *cc,
+                        uint64_t len,
+                        uint64_t current_addr)
+{
+    int ret;
+    uint64_t block_idx, chunk;
+
+    cc->current_addr = current_addr;
+    block_idx = cc->current_block_idx;
+    chunk = cc->current_chunk;
+
+    ret = qemu_rdma_search_ram_block(rdma, cc->block_offset,
+                                     cc->offset, len, &block_idx, &chunk);
+    if (ret) {
+        ERROR(NULL, "ram block search failed");
+        return ret;
+    }
+
+    cc->current_block_idx = block_idx;
+    cc->current_chunk = chunk;
+
+    return 0;
+}
+
+/* 
+ * If we cannot merge it, we flush the current buffer first.
  */
-static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
-                           uint64_t block_offset, uint64_t offset,
-                           uint64_t len)
+static int qemu_rdma_flush_unmergable(RDMAContext *rdma,
+                                      RDMACurrentChunk *src,
+                                      RDMACurrentChunk *dest,
+                                      QEMUFile *f, uint64_t len)
 {
-    uint64_t current_addr = block_offset + offset;
-    uint64_t index = rdma->current_index;
-    uint64_t chunk = rdma->current_chunk;
+    uint64_t current_addr_src;
+    uint64_t current_addr_dest;
     int ret;
 
-    /* If we cannot merge it, we flush the current buffer first. */
-    if (!qemu_rdma_buffer_mergable(rdma, current_addr, len)) {
-        ret = qemu_rdma_write_flush(f, rdma);
-        if (ret) {
-            return ret;
+    current_addr_src = src->block_offset + src->offset;
+
+    if (dest) {
+        current_addr_dest = dest->block_offset + dest->offset;
+    }
+
+    if (qemu_rdma_buffer_mergable(rdma, src, current_addr_src, len)) {
+        if (dest) {
+            if (qemu_rdma_buffer_mergable(rdma, dest, current_addr_dest, len)) {
+                goto merge;
+            }
+        } else {
+            goto merge;
         }
-        rdma->current_length = 0;
-        rdma->current_addr = current_addr;
+    }
+
+    ret = qemu_rdma_write(f, rdma, src, dest);
+
+    if (ret) {
+        return ret;
+    }
+
+    ret = write_start(rdma, src, len, current_addr_src);
+
+    if (ret) {
+        return ret;
+    }
+
+    if (dest) {
+        ret = write_start(rdma, dest, len, current_addr_dest);
 
-        ret = qemu_rdma_search_ram_block(rdma, block_offset,
-                                         offset, len, &index, &chunk);
         if (ret) {
-            fprintf(stderr, "ram block search failed\n");
             return ret;
         }
-        rdma->current_index = index;
-        rdma->current_chunk = chunk;
     }
 
-    /* merge it */
-    rdma->current_length += len;
-
-    /* flush it if buffer is too large */
-    if (rdma->current_length >= RDMA_MERGE_MAX) {
-        return qemu_rdma_write_flush(f, rdma);
+merge:
+    src->current_length += len;
+    if (dest) {
+        dest->current_length += len;
     }
 
     return 0;
 }
 
-static void qemu_rdma_cleanup(RDMAContext *rdma)
+static void disconnect_ibv(RDMAContext *rdma, RDMALocalContext *lc, bool force)
 {
     struct rdma_cm_event *cm_event;
-    int ret, idx;
+    int ret;
 
-    if (rdma->cm_id && rdma->connected) {
-        if (rdma->error_state) {
+    if (!lc->cm_id || !lc->connected) {
+        return;
+    }
+
+    if ((lc == (&rdma->lc_remote)) && rdma->error_state) {
+        if (rdma->error_state != -ENETUNREACH) {
             RDMAControlHeader head = { .len = 0,
                                        .type = RDMA_CONTROL_ERROR,
                                        .repeat = 1,
                                      };
             fprintf(stderr, "Early error. Sending error.\n");
             qemu_rdma_post_send_control(rdma, NULL, &head);
+        } else {
+            fprintf(stderr, "Early error.\n");
+            rdma_disconnect(lc->cm_id);
+            goto finish;
         }
+    }
 
-        ret = rdma_disconnect(rdma->cm_id);
+    ret = rdma_disconnect(lc->cm_id);
+    if (!ret && !force) {
+        DDPRINTF("waiting for disconnect\n");
+        ret = rdma_get_cm_event(lc->channel, &cm_event);
         if (!ret) {
-            DDPRINTF("waiting for disconnect\n");
-            ret = rdma_get_cm_event(rdma->channel, &cm_event);
-            if (!ret) {
-                rdma_ack_cm_event(cm_event);
-            }
+            rdma_ack_cm_event(cm_event);
         }
-        DDPRINTF("Disconnected.\n");
-        rdma->connected = false;
+    }
+
+finish:
+
+    DDPRINTF("Disconnected.\n");
+    lc->verbs = NULL;
+    lc->connected = false;
+}
+
+static void qemu_rdma_cleanup(RDMAContext *rdma, bool force)
+{
+    int idx;
+
+    if (connection_timer) {
+        timer_del(connection_timer);
+        timer_free(connection_timer);
+        connection_timer = NULL;
+    }
+
+    if (keepalive_timer) {
+        timer_del(keepalive_timer);
+        timer_free(keepalive_timer);
+        keepalive_timer = NULL;
+    }
+
+    disconnect_ibv(rdma, &rdma->lc_remote, force);
+    if (migrate_use_mc_rdma_copy()) {
+        disconnect_ibv(rdma, &rdma->lc_src, force);
+        disconnect_ibv(rdma, &rdma->lc_dest, force);
     }
 
     g_free(rdma->block);
@@ -2237,40 +2496,192 @@ static void qemu_rdma_cleanup(RDMAContext *rdma)
         }
     }
 
-    if (rdma->qp) {
-        rdma_destroy_qp(rdma->cm_id);
-        rdma->qp = NULL;
+    close_ibv(rdma, &rdma->lc_remote);
+    if (migrate_use_mc_rdma_copy()) {
+        close_ibv(rdma, &rdma->lc_src);
+        close_ibv(rdma, &rdma->lc_dest);
+    }
+
+    if (rdma->keepalive_mr) {
+        ibv_dereg_mr(rdma->keepalive_mr);
+        rdma->keepalive_mr = NULL;
+    }
+    if (rdma->next_keepalive_mr) {
+        ibv_dereg_mr(rdma->next_keepalive_mr);
+        rdma->next_keepalive_mr = NULL;
+    }
+}
+
+static int qemu_rdma_device_init(RDMAContext *rdma, Error **errp,
+                                 RDMALocalContext *lc)
+{
+    struct rdma_cm_event *cm_event;
+    int ret = -EINVAL;
+    char ip[40] = "unknown";
+    struct rdma_addrinfo *res;
+    char port_str[16];
+
+    if (lc->host == NULL) {
+        ERROR(errp, "RDMA host is not set!");
+        SET_ERROR(rdma, -EINVAL);
+        return -1;
+    }
+
+    /* create CM channel */
+    lc->channel = rdma_create_event_channel();
+    if (!lc->channel) {
+        ERROR(errp, "could not create rdma event channel (%s)", lc->id_str);
+        SET_ERROR(rdma, -EINVAL);
+        return -1;
     }
-    if (rdma->cq) {
-        ibv_destroy_cq(rdma->cq);
-        rdma->cq = NULL;
+
+    /* create CM id */
+    if (lc->listen_id) {
+        lc->cm_id = lc->listen_id;
+    } else {
+        ret = rdma_create_id(lc->channel, &lc->cm_id, NULL, RDMA_PS_TCP);
+        if (ret) {
+            ERROR(errp, "could not create cm_id! (%s)", lc->id_str);
+            goto err_device_init_create_id;
+        }
     }
-    if (rdma->comp_channel) {
-        ibv_destroy_comp_channel(rdma->comp_channel);
-        rdma->comp_channel = NULL;
+
+    snprintf(port_str, 16, "%d", lc->port);
+    port_str[15] = '\0';
+
+    if (lc->host && strcmp("", lc->host)) {
+        struct rdma_addrinfo *e;
+
+        ret = rdma_getaddrinfo(lc->host, port_str, NULL, &res);
+        if (ret < 0) {
+            ERROR(errp, "could not rdma_getaddrinfo address %s (%s)", 
+                        lc->host, lc->id_str);
+            goto err_device_init_bind_addr;
+        }
+
+        for (e = res; e != NULL; e = e->ai_next) {
+            inet_ntop(e->ai_family,
+                &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
+            DPRINTF("Trying %s => %s (port %s) (%s)\n", lc->host,
+                                                        ip, port_str,
+                                                        lc->id_str);
+
+            if (lc->dest) {
+                ret = rdma_bind_addr(lc->cm_id, e->ai_dst_addr);
+            } else {
+                ret = rdma_resolve_addr(lc->cm_id, NULL, e->ai_dst_addr,
+                    RDMA_RESOLVE_TIMEOUT_MS);
+            }
+            if (!ret) {
+                if (e->ai_family == AF_INET6) {
+                    ret = qemu_rdma_broken_ipv6_kernel(errp, lc->cm_id->verbs);
+                    if (ret) {
+                        continue;
+                    }
+                }
+                    
+                goto next;
+            }
+        }
+
+        ERROR(errp, "initialize/bind/resolve device! (%s)", lc->id_str);
+        goto err_device_init_bind_addr;
+    } else {
+        ERROR(errp, "migration host and port not specified! (%s)", lc->id_str);
+        ret = -EINVAL;
+        goto err_device_init_bind_addr;
     }
-    if (rdma->pd) {
-        ibv_dealloc_pd(rdma->pd);
-        rdma->pd = NULL;
+next:
+    qemu_rdma_dump_gid("device_init", lc->cm_id);
+
+    if(lc->source) {
+        ret = rdma_get_cm_event(lc->channel, &cm_event);
+        if (ret) {
+            ERROR(errp, "could not perform event_addr_resolved (%s)", lc->id_str);
+            goto err_device_init_bind_addr;
+        }
+
+        if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
+            ERROR(errp, "result not equal to event_addr_resolved %s (%s)",
+                    rdma_event_str(cm_event->event), lc->id_str);
+            perror("rdma_resolve_addr");
+            ret = -EINVAL;
+            goto err_device_init_bind_addr;
+        }
+
+        rdma_ack_cm_event(cm_event);
+
+        /* resolve route */
+        ret = rdma_resolve_route(lc->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
+        if (ret) {
+            ERROR(errp, "could not resolve rdma route");
+            goto err_device_init_bind_addr;
+        }
+
+        ret = rdma_get_cm_event(lc->channel, &cm_event);
+        if (ret) {
+            ERROR(errp, "could not perform event_route_resolved");
+            goto err_device_init_bind_addr;
+        }
+
+        if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
+            ERROR(errp, "result not equal to event_route_resolved: %s",
+                            rdma_event_str(cm_event->event));
+            rdma_ack_cm_event(cm_event);
+            ret = -EINVAL;
+            goto err_device_init_bind_addr;
+        }
+
+        lc->verbs = lc->cm_id->verbs;
+        printf("verbs: %p (%s)\n", lc->verbs, lc->id_str);
+
+        rdma_ack_cm_event(cm_event);
+
+        ret = qemu_rdma_alloc_pd_cq_qp(rdma, lc);
+        if (ret) {
+            goto err_device_init_bind_addr;
+        }
+
+        qemu_rdma_dump_id("rdma_accept_start", lc->verbs);
+    } else {
+        lc->listen_id = lc->cm_id;
+        lc->cm_id = NULL;
+
+        ret = rdma_listen(lc->listen_id, 1);
+
+        if (ret) {
+            perror("rdma_listen");
+            ERROR(errp, "listening on socket! (%s)", lc->id_str);
+            goto err_device_init_bind_addr;
+        }
+
+        DPRINTF("rdma_listen success\n");
     }
-    if (rdma->listen_id) {
-        rdma_destroy_id(rdma->listen_id);
-        rdma->listen_id = NULL;
+
+    DPRINTF("qemu_rdma_device_init success\n");
+    return 0;
+
+err_device_init_bind_addr:
+    if (lc->cm_id) {
+        rdma_destroy_id(lc->cm_id);
+        lc->cm_id = NULL;
     }
-    if (rdma->cm_id) {
-        rdma_destroy_id(rdma->cm_id);
-        rdma->cm_id = NULL;
+    if (lc->listen_id) {
+        rdma_destroy_id(lc->listen_id);
+        lc->listen_id = NULL;
     }
-    if (rdma->channel) {
-        rdma_destroy_event_channel(rdma->channel);
-        rdma->channel = NULL;
+err_device_init_create_id:
+    if (lc->channel) {
+        rdma_destroy_event_channel(lc->channel);
+        lc->channel = NULL;
     }
-    g_free(rdma->host);
-    rdma->host = NULL;
+    SET_ERROR(rdma, ret);
+    return ret;
 }
 
-
-static int qemu_rdma_source_init(RDMAContext *rdma, Error **errp, bool pin_all)
+static int qemu_rdma_init_outgoing(RDMAContext *rdma,
+                                 Error **errp,
+                                 MigrationState *s)
 {
     int ret, idx;
     Error *local_err = NULL, **temp = &local_err;
@@ -2279,107 +2690,157 @@ static int qemu_rdma_source_init(RDMAContext *rdma, Error **errp, bool pin_all)
      * Will be validated against destination's actual capabilities
      * after the connect() completes.
      */
-    rdma->pin_all = pin_all;
+    rdma->pin_all = s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
+    rdma->do_keepalive = s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_KEEPALIVE];
 
-    ret = qemu_rdma_resolve_host(rdma, temp);
-    if (ret) {
-        goto err_rdma_source_init;
+    for (idx = 0; idx < RDMA_WRID_MAX; idx++) {
+        rdma->wr_data[idx].control_len = 0;
+        rdma->wr_data[idx].control_curr = NULL;
     }
 
-    ret = qemu_rdma_alloc_pd_cq(rdma);
+    rdma->source = true;
+    rdma->dest = false;
+    rdma->lc_remote.source = true;
+    rdma->lc_remote.dest = false;
+
+    ret = qemu_rdma_device_init(rdma, temp, &rdma->lc_remote);
     if (ret) {
-        ERROR(temp, "rdma migration: error allocating pd and cq! Your mlock()"
-                    " limits may be too low. Please check $ ulimit -a # and "
-                    "search for 'ulimit -l' in the output");
-        goto err_rdma_source_init;
+        goto err_rdma_init_outgoing;
     }
 
-    ret = qemu_rdma_alloc_qp(rdma);
+    ret = qemu_rdma_reg_keepalive(rdma);
+
     if (ret) {
-        ERROR(temp, "rdma migration: error allocating qp!");
-        goto err_rdma_source_init;
+        ERROR(temp, "allocating keepalive structures");
+        goto err_rdma_init_outgoing;
     }
 
     ret = qemu_rdma_init_ram_blocks(rdma);
     if (ret) {
-        ERROR(temp, "rdma migration: error initializing ram blocks!");
-        goto err_rdma_source_init;
+        ERROR(temp, "initializing ram blocks!");
+        goto err_rdma_init_outgoing;
     }
 
     for (idx = 0; idx < RDMA_WRID_MAX; idx++) {
         ret = qemu_rdma_reg_control(rdma, idx);
         if (ret) {
-            ERROR(temp, "rdma migration: error registering %d control!",
-                                                            idx);
-            goto err_rdma_source_init;
+            ERROR(temp, "registering %d control!", idx);
+            goto err_rdma_init_outgoing;
         }
     }
 
     return 0;
 
-err_rdma_source_init:
+err_rdma_init_outgoing:
     error_propagate(errp, local_err);
-    qemu_rdma_cleanup(rdma);
+    qemu_rdma_cleanup(rdma, false);
     return -1;
 }
 
-static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
+static int qemu_rdma_connect_finish(RDMAContext *rdma,
+                                    RDMALocalContext *lc,
+                                    Error **errp,
+                                    struct rdma_cm_event **return_event)
 {
-    RDMACapabilities cap = {
-                                .version = RDMA_CONTROL_VERSION_CURRENT,
-                                .flags = 0,
-                           };
-    struct rdma_conn_param conn_param = { .initiator_depth = 2,
-                                          .retry_count = 5,
-                                          .private_data = &cap,
-                                          .private_data_len = sizeof(cap),
-                                        };
+    int ret = 0;
     struct rdma_cm_event *cm_event;
-    int ret;
 
-    /*
-     * Only negotiate the capability with destination if the user
-     * on the source first requested the capability.
-     */
-    if (rdma->pin_all) {
-        DPRINTF("Server pin-all memory requested.\n");
+    ret = rdma_get_cm_event(lc->channel, &cm_event);
+    if (ret) {
+        perror("rdma_get_cm_event after rdma_connect");
+        rdma_ack_cm_event(cm_event);
+        goto err;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        perror("rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect");
+        rdma_ack_cm_event(cm_event);
+        ret = -1;
+        goto err;
+    }
+
+    /*
+     * The rdmacm "private data area" may contain information from the receiver,
+     * just as we may have done the same from the sender side. If so, we cannot
+     * ack this CM event until we have processed/copied this small data
+     * out of the cm_event structure, otherwise, the ACK will free the structure
+     * and we will lose the data.
+     *
+     * Thus, we allow the caller to ACK this event if there is important
+     * information inside. Otherwise, we will ACK by ourselves.
+     */
+    if (return_event) {
+        *return_event = cm_event;
+    } else {
+        rdma_ack_cm_event(cm_event);
+    }
+
+    lc->connected = true;
+
+    return 0;
+err:
+    ERROR(errp, "connecting to destination!");
+    rdma_destroy_id(lc->cm_id);
+    lc->cm_id = NULL;
+    return ret;
+}
+
+static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
+{
+    RDMACapabilities cap = {
+                                .version = RDMA_CONTROL_VERSION_CURRENT,
+                                .flags = 0,
+                                .keepalive_rkey = rdma->keepalive_mr->rkey,
+                                .keepalive_addr = (uint64_t) &rdma->keepalive,
+                           };
+    struct rdma_conn_param conn_param = { .initiator_depth = 2,
+                                          .retry_count = 5,
+                                          .private_data = &cap,
+                                          .private_data_len = sizeof(cap),
+                                        };
+
+    struct rdma_cm_event *cm_event = NULL;
+    int ret;
+
+    /*
+     * Only negotiate the capability with destination if the user
+     * on the source first requested the capability.
+     */
+    if (rdma->pin_all) {
+        DPRINTF("Server pin-all memory requested.\n");
         cap.flags |= RDMA_CAPABILITY_PIN_ALL;
     }
 
+    if (rdma->do_keepalive) {
+        DPRINTF("Keepalives requested.\n");
+        cap.flags |= RDMA_CAPABILITY_KEEPALIVE;
+    }
+
+    DDPRINTF("Sending keepalive params: key %x addr: %" PRIx64 "\n",
+            cap.keepalive_rkey, cap.keepalive_addr);
     caps_to_network(&cap);
 
-    ret = rdma_connect(rdma->cm_id, &conn_param);
+    ret = rdma_connect(rdma->lc_remote.cm_id, &conn_param);
     if (ret) {
         perror("rdma_connect");
-        ERROR(errp, "connecting to destination!");
-        rdma_destroy_id(rdma->cm_id);
-        rdma->cm_id = NULL;
         goto err_rdma_source_connect;
     }
 
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret) {
-        perror("rdma_get_cm_event after rdma_connect");
-        ERROR(errp, "connecting to destination!");
-        rdma_ack_cm_event(cm_event);
-        rdma_destroy_id(rdma->cm_id);
-        rdma->cm_id = NULL;
-        goto err_rdma_source_connect;
-    }
+    ret = qemu_rdma_connect_finish(rdma, &rdma->lc_remote, errp, &cm_event);
 
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        perror("rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect");
-        ERROR(errp, "connecting to destination!");
-        rdma_ack_cm_event(cm_event);
-        rdma_destroy_id(rdma->cm_id);
-        rdma->cm_id = NULL;
+    if (ret) {
         goto err_rdma_source_connect;
     }
-    rdma->connected = true;
 
     memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
     network_to_caps(&cap);
 
+    rdma->keepalive_rkey = cap.keepalive_rkey;
+    rdma->keepalive_addr = cap.keepalive_addr;
+
+    DDPRINTF("Received keepalive params: key %x addr: %" PRIx64 "\n",
+            cap.keepalive_rkey, cap.keepalive_addr);
+
     /*
      * Verify that the *requested* capabilities are supported by the destination
      * and disable them otherwise.
@@ -2390,7 +2851,14 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
         rdma->pin_all = false;
     }
 
+    if (rdma->do_keepalive && !(cap.flags & RDMA_CAPABILITY_KEEPALIVE)) {
+        ERROR(errp, "Server cannot support keepalives. "
+                        "Will not check for them.");
+        rdma->do_keepalive = false;
+    }
+
     DPRINTF("Pin all memory: %s\n", rdma->pin_all ? "enabled" : "disabled");
+    DPRINTF("Keepalives: %s\n", rdma->do_keepalive ? "enabled" : "disabled");
 
     rdma_ack_cm_event(cm_event);
 
@@ -2405,93 +2873,115 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
     return 0;
 
 err_rdma_source_connect:
-    qemu_rdma_cleanup(rdma);
-    return -1;
+    SET_ERROR(rdma, ret);
+    qemu_rdma_cleanup(rdma, false);
+    return rdma->error_state;
 }
 
-static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
+static void send_keepalive(void *opaque)
 {
-    int ret = -EINVAL, idx;
-    struct rdma_cm_id *listen_id;
-    char ip[40] = "unknown";
-    struct rdma_addrinfo *res;
-    char port_str[16];
+    RDMAContext *rdma = opaque;
+    struct ibv_sge sge;
+    struct ibv_send_wr send_wr = { 0 };
+    struct ibv_send_wr *bad_wr;
+    int ret;
 
-    for (idx = 0; idx < RDMA_WRID_MAX; idx++) {
-        rdma->wr_data[idx].control_len = 0;
-        rdma->wr_data[idx].control_curr = NULL;
+    if (!rdma->migration_started) {
+        goto reset;
     }
 
-    if (rdma->host == NULL) {
-        ERROR(errp, "RDMA host is not set!");
-        rdma->error_state = -EINVAL;
-        return -1;
-    }
-    /* create CM channel */
-    rdma->channel = rdma_create_event_channel();
-    if (!rdma->channel) {
-        ERROR(errp, "could not create rdma event channel");
-        rdma->error_state = -EINVAL;
-        return -1;
-    }
+    rdma->next_keepalive++;
+retry:
 
-    /* create CM id */
-    ret = rdma_create_id(rdma->channel, &listen_id, NULL, RDMA_PS_TCP);
-    if (ret) {
-        ERROR(errp, "could not create cm_id!");
-        goto err_dest_init_create_listen_id;
+    sge.addr = (uint64_t) &rdma->next_keepalive;
+    sge.length = sizeof(rdma->next_keepalive);
+    sge.lkey = rdma->next_keepalive_mr->lkey;
+    send_wr.wr_id = RDMA_WRID_RDMA_KEEPALIVE;
+    send_wr.opcode = IBV_WR_RDMA_WRITE;
+    send_wr.send_flags = 0;
+    send_wr.sg_list = &sge;
+    send_wr.num_sge = 1;
+    send_wr.wr.rdma.remote_addr = rdma->keepalive_addr;
+    send_wr.wr.rdma.rkey = rdma->keepalive_rkey;
+
+    DDPRINTF("Posting keepalive: addr: %lx"
+              " remote: %lx, bytes %" PRIu32 "\n",
+              sge.addr, send_wr.wr.rdma.remote_addr, sge.length);
+
+    ret = ibv_post_send(rdma->lc_remote.qp, &send_wr, &bad_wr);
+
+    if (ret == ENOMEM) {
+        DPRINTF("send queue is full. wait a little....\n");
+        g_usleep(RDMA_KEEPALIVE_INTERVAL_MS * 1000);
+        goto retry;
+    } else if (ret > 0) {
+        perror("rdma migration: post keepalive");
+        SET_ERROR(rdma, -ret);
+        return;
     }
 
-    snprintf(port_str, 16, "%d", rdma->port);
-    port_str[15] = '\0';
+reset:
+    timer_mod(keepalive_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
+                    RDMA_KEEPALIVE_INTERVAL_MS);
+}
 
-    if (rdma->host && strcmp("", rdma->host)) {
-        struct rdma_addrinfo *e;
+static void check_qp_state(void *opaque)
+{
+    RDMAContext *rdma = opaque;
+    int first_missed = 0;
 
-        ret = rdma_getaddrinfo(rdma->host, port_str, NULL, &res);
-        if (ret < 0) {
-            ERROR(errp, "could not rdma_getaddrinfo address %s", rdma->host);
-            goto err_dest_init_bind_addr;
+    if (!rdma->migration_started) {
+        goto reset;
+    }
+
+    if (rdma->last_keepalive == rdma->keepalive) {
+        rdma->nb_missed_keepalive++;
+        if (rdma->nb_missed_keepalive == 1) {
+            first_missed = RDMA_KEEPALIVE_FIRST_MISSED_OFFSET;
+            DDPRINTF("Setting first missed additional delay\n");
+        } else {
+            DPRINTF("WARN: missed keepalive: %" PRIu64 "\n",
+                        rdma->nb_missed_keepalive);
         }
+    } else {
+        rdma->keepalive_startup = true;
+        rdma->nb_missed_keepalive = 0;
+    }
 
-        for (e = res; e != NULL; e = e->ai_next) {
-            inet_ntop(e->ai_family,
-                &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
-            DPRINTF("Trying %s => %s\n", rdma->host, ip);
-            ret = rdma_bind_addr(listen_id, e->ai_dst_addr);
-            if (!ret) {
-                if (e->ai_family == AF_INET6) {
-                    ret = qemu_rdma_broken_ipv6_kernel(errp, listen_id->verbs);
-                    if (ret) {
-                        continue;
-                    }
-                }
-                    
-                goto listen;
+    rdma->last_keepalive = rdma->keepalive;
+
+    if (rdma->keepalive_startup) {
+        if (rdma->nb_missed_keepalive > RDMA_MAX_LOST_KEEPALIVE) {
+            struct ibv_qp_attr attr = {.qp_state = IBV_QPS_ERR };
+            SET_ERROR(rdma, -ENETUNREACH);
+            ERROR(NULL, "peer keepalive failed.");
+             
+            if (ibv_modify_qp(rdma->lc_remote.qp, &attr, IBV_QP_STATE)) {
+                ERROR(NULL, "modify QP to RTR");
+                return;
             }
+            return;
         }
-
-        ERROR(errp, "Error: could not rdma_bind_addr!");
-        goto err_dest_init_bind_addr;
+    } else if (rdma->nb_missed_keepalive < RDMA_MAX_STARTUP_MISSED_KEEPALIVE) {
+        DDPRINTF("Keepalive startup waiting: %" PRIu64 "\n",
+                rdma->nb_missed_keepalive);
     } else {
-        ERROR(errp, "migration host and port not specified!");
-        ret = -EINVAL;
-        goto err_dest_init_bind_addr;
+        DDPRINTF("Keepalive startup too long.\n");
+        rdma->keepalive_startup = true;
     }
-listen:
-
-    rdma->listen_id = listen_id;
-    qemu_rdma_dump_gid("dest_init", listen_id);
-    return 0;
 
-err_dest_init_bind_addr:
-    rdma_destroy_id(listen_id);
-err_dest_init_create_listen_id:
-    rdma_destroy_event_channel(rdma->channel);
-    rdma->channel = NULL;
-    rdma->error_state = ret;
-    return ret;
+reset:
+    timer_mod(connection_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
+                    RDMA_KEEPALIVE_INTERVAL_MS + first_missed);
+}
 
+static void qemu_rdma_keepalive_start(void)
+{
+    DPRINTF("Starting up keepalives....\n");
+    timer_mod(connection_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) + 
+                    RDMA_CONNECTION_INTERVAL_MS);
+    timer_mod(keepalive_timer, qemu_clock_get_ms(QEMU_CLOCK_REALTIME) +
+                    RDMA_KEEPALIVE_INTERVAL_MS);
 }
 
 static void *qemu_rdma_data_init(const char *host_port, Error **errp)
@@ -2502,19 +2992,32 @@ static void *qemu_rdma_data_init(const char *host_port, Error **errp)
     if (host_port) {
         rdma = g_malloc0(sizeof(RDMAContext));
         memset(rdma, 0, sizeof(RDMAContext));
-        rdma->current_index = -1;
-        rdma->current_chunk = -1;
+        rdma->chunk_remote.current_block_idx = -1;
+        rdma->chunk_remote.current_chunk = -1;
+        rdma->chunk_local_src.current_block_idx = -1;
+        rdma->chunk_local_src.current_chunk = -1;
+        rdma->chunk_local_dest.current_block_idx = -1;
+        rdma->chunk_local_dest.current_chunk = -1;
 
         addr = inet_parse(host_port, NULL);
         if (addr != NULL) {
-            rdma->port = atoi(addr->port);
-            rdma->host = g_strdup(addr->host);
+            rdma->lc_remote.port = atoi(addr->port);
+            rdma->lc_remote.host = g_strdup(addr->host);
         } else {
             ERROR(errp, "bad RDMA migration address '%s'", host_port);
             g_free(rdma);
-            return NULL;
+            rdma = NULL;
         }
+
+        qapi_free_InetSocketAddress(addr);
     }
+       
+    rdma->keepalive_startup = false;
+    connection_timer = timer_new_ms(QEMU_CLOCK_REALTIME, check_qp_state, rdma);
+    keepalive_timer = timer_new_ms(QEMU_CLOCK_REALTIME, send_keepalive, rdma);
+    rdma->lc_dest.id_str = "local destination";
+    rdma->lc_src.id_str = "local src";
+    rdma->lc_remote.id_str = "remote";
 
     return rdma;
 }
@@ -2540,9 +3043,9 @@ static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf,
      * Push out any writes that
      * we're queued up for pc.ram.
      */
-    ret = qemu_rdma_write_flush(f, rdma);
+    ret = qemu_rdma_write(f, rdma, &rdma->chunk_remote, NULL);
     if (ret < 0) {
-        rdma->error_state = ret;
+        SET_ERROR(rdma, ret);
         return ret;
     }
 
@@ -2558,7 +3061,7 @@ static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf,
         ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL, NULL);
 
         if (ret < 0) {
-            rdma->error_state = ret;
+            SET_ERROR(rdma, ret);
             return ret;
         }
 
@@ -2618,7 +3121,7 @@ static int qemu_rdma_get_buffer(void *opaque, uint8_t *buf,
     ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_QEMU_FILE);
 
     if (ret < 0) {
-        rdma->error_state = ret;
+        SET_ERROR(rdma, ret);
         return ret;
     }
 
@@ -2631,18 +3134,23 @@ static int qemu_rdma_get_buffer(void *opaque, uint8_t *buf,
 /*
  * Block until all the outstanding chunks have been delivered by the hardware.
  */
-static int qemu_rdma_drain_cq(QEMUFile *f, RDMAContext *rdma)
+static int qemu_rdma_drain_cq(QEMUFile *f, RDMAContext *rdma,
+                              RDMACurrentChunk *src,
+                              RDMACurrentChunk *dest)
 {
     int ret;
+    RDMALocalContext *lc = (migrate_use_mc_rdma_copy() && dest && dest != src) ? 
+            (rdma->source ? &rdma->lc_src : &rdma->lc_dest) : &rdma->lc_remote;
 
-    if (qemu_rdma_write_flush(f, rdma) < 0) {
+    if (qemu_rdma_write(f, rdma, src, dest) < 0) {
         return -EIO;
     }
 
-    while (rdma->nb_sent) {
-        ret = qemu_rdma_block_for_wrid(rdma, RDMA_WRID_RDMA_WRITE, NULL);
+    while (lc->nb_sent) {
+        ret = qemu_rdma_block_for_wrid(rdma, lc,
+                                       RDMA_WRID_RDMA_WRITE_REMOTE, NULL);
         if (ret < 0) {
-            fprintf(stderr, "rdma migration: complete polling error!\n");
+            ERROR(NULL, "complete polling!");
             return -EIO;
         }
     }
@@ -2657,13 +3165,190 @@ static int qemu_rdma_close(void *opaque)
     DPRINTF("Shutting down connection.\n");
     QEMUFileRDMA *r = opaque;
     if (r->rdma) {
-        qemu_rdma_cleanup(r->rdma);
+        qemu_rdma_cleanup(r->rdma, false);
         g_free(r->rdma);
     }
     g_free(r);
     return 0;
 }
 
+static int qemu_rdma_instruct_unregister(RDMAContext *rdma, QEMUFile *f,
+                                         ram_addr_t block_offset,
+                                         ram_addr_t offset, long size)
+{
+    int ret;
+    uint64_t block, chunk;
+
+    if (size < 0) {
+        ret = qemu_rdma_drain_cq(f, rdma, &rdma->chunk_remote, NULL);
+        if (ret < 0) {
+            fprintf(stderr, "rdma: failed to synchronously drain"
+                            " completion queue before unregistration.\n");
+            return ret;
+        }
+    }
+
+    ret = qemu_rdma_search_ram_block(rdma, block_offset, 
+                                     offset, size, &block, &chunk);
+
+    if (ret) {
+        fprintf(stderr, "ram block search failed\n");
+        return ret;
+    }
+
+    qemu_rdma_signal_unregister(rdma, block, chunk, 0);
+
+    /*
+     * Synchronous, gauranteed unregistration (should not occur during
+     * fast-path). Otherwise, unregisters will process on the next call to
+     * qemu_rdma_drain_cq()
+     */
+    if (size < 0) {
+        qemu_rdma_unregister_waiting(rdma);
+    }
+
+    return 0;
+}
+
+
+static int qemu_rdma_poll_until_empty(RDMAContext *rdma, RDMALocalContext *lc)
+{
+    uint64_t wr_id, wr_id_in;
+    int ret;
+
+    /*
+     * Drain the Completion Queue if possible, but do not block,
+     * just poll.
+     *
+     * If nothing to poll, the end of the iteration will do this
+     * again to make sure we don't overflow the request queue.
+     */
+    while (1) {
+        ret = qemu_rdma_poll(rdma, lc, &wr_id_in, NULL);
+        if (ret < 0) {
+            ERROR(NULL, "empty polling error! %d", ret);
+            return ret;
+        }
+
+        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
+
+        if (wr_id == RDMA_WRID_NONE) {
+            break;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Parameters:
+ *    @offset_{source|dest} == 0 :
+ *        This means that 'block_offset' is a full virtual address that does not
+ *        belong to a RAMBlock of the virtual machine and instead
+ *        represents a private malloc'd memory area that the caller wishes to
+ *        transfer. Source and dest can be different (either real RAMBlocks or
+ *        private).
+ *
+ *    @offset != 0 :
+ *        Offset is an offset to be added to block_offset and used
+ *        to also lookup the corresponding RAMBlock. Source and dest can be different 
+ *        (either real RAMBlocks or private).
+ *
+ *    @size > 0 :
+ *        Amount of memory to copy locally using RDMA.
+ *
+ *    @size == 0 :
+ *        A 'hint' or 'advice' that means that we wish to speculatively
+ *        and asynchronously unregister either the source or destination memory.
+ *        In this case, there is no gaurantee that the unregister will actually happen, 
+ *        for example, if the memory is being actively copied. Additionally, the memory
+ *        may be re-registered at any future time if a copy within the same
+ *        range was requested again, even if you attempted to unregister it here.
+ *
+ *    @size < 0 : TODO, not yet supported
+ *        Unregister the memory NOW. This means that the caller does not
+ *        expect there to be any future RDMA copies and we just want to clean
+ *        things up. This is used in case the upper layer owns the memory and
+ *        cannot wait for qemu_fclose() to occur.
+ */
+static int qemu_rdma_copy_page(QEMUFile *f, void *opaque,
+                                  ram_addr_t block_offset_dest,
+                                  ram_addr_t offset_dest,
+                                  ram_addr_t block_offset_source,
+                                  ram_addr_t offset_source,
+                                  long size)
+{
+    QEMUFileRDMA *rfile = opaque;
+    RDMAContext *rdma = rfile->rdma;
+    int ret;
+    RDMACurrentChunk *src = &rdma->chunk_local_src;
+    RDMACurrentChunk *dest = &rdma->chunk_local_dest;
+
+    CHECK_ERROR_STATE();
+
+    qemu_fflush(f);
+
+    if (size > 0) {
+        /*
+         * Add this page to the current 'chunk'. If the chunk
+         * is full, or the page doen't belong to the current chunk,
+         * an actual RDMA write will occur and a new chunk will be formed.
+         */
+        src->block_offset = block_offset_source;
+        src->offset = offset_source;
+        dest->block_offset = block_offset_dest;
+        dest->offset = offset_dest;
+
+        DDPRINTF("Copy page: %p src offset %" PRIu64
+                " dest %p offset %" PRIu64 "\n",
+                (void *) block_offset_source, offset_source,
+                (void *) block_offset_dest, offset_dest);
+
+        ret = qemu_rdma_flush_unmergable(rdma, src, dest, f, size);
+
+        if (ret) {
+            ERROR(NULL, "local copy flush");
+            goto err;
+        }
+
+        if ((src->current_length >= RDMA_MERGE_MAX) || 
+            (dest->current_length >= RDMA_MERGE_MAX)) {
+            ret = qemu_rdma_write(f, rdma, src, dest);
+
+            if (ret < 0) {
+                goto err;
+            }
+        } else {
+            ret = 0;
+        }
+    } else {
+        ret = qemu_rdma_instruct_unregister(rdma, f, block_offset_source,
+                                                  offset_source, size);
+        if (ret) {
+            goto err;
+        }
+
+        ret = qemu_rdma_instruct_unregister(rdma, f, block_offset_dest, 
+                                                  offset_dest, size);
+
+        if (ret) {
+            goto err;
+        }
+    }
+
+    ret = qemu_rdma_poll_until_empty(rdma, 
+                rdma->source ? &rdma->lc_src : &rdma->lc_dest);
+
+    if (ret) {
+        goto err;
+    }
+
+    return RAM_COPY_CONTROL_DELAYED;
+err:
+    SET_ERROR(rdma, ret);
+    return ret;
+}
+
 /*
  * Parameters:
  *    @offset == 0 :
@@ -2672,6 +3357,20 @@ static int qemu_rdma_close(void *opaque)
  *        represents a private malloc'd memory area that the caller wishes to
  *        transfer.
  *
+ *        This allows callers to initiate RDMA transfers of arbitrary memory
+ *        areas and not just only by migration itself.
+ *
+ *        If this is true, then the virtual address specified by 'block_offset'
+ *        below must have been pre-registered with us in advance by calling the
+ *        new QEMUFileOps->add()/remove() functions on both sides of the
+ *        connection.
+ *
+ *        Also note: add()/remove() must been called in the *same sequence* and
+ *        against the *same size* private virtual memory on both sides of the
+ *        connection for this to work, regardless whether or not transfer of
+ *        this private memory was initiated by the migration code or a private
+ *        caller.
+ *
  *    @offset != 0 :
  *        Offset is an offset to be added to block_offset and used
  *        to also lookup the corresponding RAMBlock.
@@ -2680,7 +3379,7 @@ static int qemu_rdma_close(void *opaque)
  *        Initiate an transfer this size.
  *
  *    @size == 0 :
- *        A 'hint' or 'advice' that means that we wish to speculatively
+ *        A 'hint' that means that we wish to speculatively
  *        and asynchronously unregister this memory. In this case, there is no
  *        guarantee that the unregister will actually happen, for example,
  *        if the memory is being actively transmitted. Additionally, the memory
@@ -2698,12 +3397,15 @@ static int qemu_rdma_close(void *opaque)
  *                  sent. Usually, this will not be more than a few bytes of
  *                  the protocol because most transfers are sent asynchronously.
  */
-static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
-                                  ram_addr_t block_offset, ram_addr_t offset,
-                                  size_t size, int *bytes_sent)
+static int qemu_rdma_save_page(QEMUFile *f, void *opaque,
+                                  ram_addr_t block_offset,
+                                  uint8_t *host_addr,
+                                  ram_addr_t offset,
+                                  long size, int *bytes_sent)
 {
     QEMUFileRDMA *rfile = opaque;
     RDMAContext *rdma = rfile->rdma;
+    RDMACurrentChunk *cc = &rdma->chunk_remote;
     int ret;
 
     CHECK_ERROR_STATE();
@@ -2716,12 +3418,27 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
          * is full, or the page doen't belong to the current chunk,
          * an actual RDMA write will occur and a new chunk will be formed.
          */
-        ret = qemu_rdma_write(f, rdma, block_offset, offset, size);
-        if (ret < 0) {
-            fprintf(stderr, "rdma migration: write error! %d\n", ret);
+        cc->block_offset = block_offset;
+        cc->offset = offset;
+
+        ret = qemu_rdma_flush_unmergable(rdma, cc, NULL, f, size);
+
+        if (ret) {
+            ERROR(NULL, "remote flush unmergable");
             goto err;
         }
 
+        if (cc->current_length >= RDMA_MERGE_MAX) {
+            ret = qemu_rdma_write(f, rdma, cc, NULL);
+
+            if (ret < 0) {
+                ERROR(NULL, "remote write! %d", ret);
+                goto err;
+            }
+        } else {
+            ret = 0;
+        }
+
         /*
          * We always return 1 bytes because the RDMA
          * protocol is completely asynchronous. We do not yet know
@@ -2734,65 +3451,100 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
             *bytes_sent = 1;
         }
     } else {
-        uint64_t index, chunk;
-
-        /* TODO: Change QEMUFileOps prototype to be signed: size_t => long
-        if (size < 0) {
-            ret = qemu_rdma_drain_cq(f, rdma);
-            if (ret < 0) {
-                fprintf(stderr, "rdma: failed to synchronously drain"
-                                " completion queue before unregistration.\n");
-                goto err;
-            }
-        }
-        */
-
-        ret = qemu_rdma_search_ram_block(rdma, block_offset,
-                                         offset, size, &index, &chunk);
+        ret = qemu_rdma_instruct_unregister(rdma, f, block_offset, offset, size);
 
         if (ret) {
-            fprintf(stderr, "ram block search failed\n");
             goto err;
         }
+    }
 
-        qemu_rdma_signal_unregister(rdma, index, chunk, 0);
+    ret = qemu_rdma_poll_until_empty(rdma, &rdma->lc_remote);
 
-        /*
-         * TODO: Synchronous, guaranteed unregistration (should not occur during
-         * fast-path). Otherwise, unregisters will process on the next call to
-         * qemu_rdma_drain_cq()
-        if (size < 0) {
-            qemu_rdma_unregister_waiting(rdma);
-        }
-        */
+    if (ret) {
+        goto err;
     }
 
-    /*
-     * Drain the Completion Queue if possible, but do not block,
-     * just poll.
-     *
-     * If nothing to poll, the end of the iteration will do this
-     * again to make sure we don't overflow the request queue.
-     */
-    while (1) {
-        uint64_t wr_id, wr_id_in;
-        int ret = qemu_rdma_poll(rdma, &wr_id_in, NULL);
-        if (ret < 0) {
-            fprintf(stderr, "rdma migration: polling error! %d\n", ret);
-            goto err;
-        }
+    return RAM_SAVE_CONTROL_DELAYED;
+err:
+    SET_ERROR(rdma, ret);
+    return ret;
+}
 
-        wr_id = wr_id_in & RDMA_WRID_TYPE_MASK;
+static int qemu_rdma_accept_start(RDMAContext *rdma,
+                                  RDMALocalContext *lc,
+                                  struct rdma_cm_event **return_event)
+{
+    struct rdma_cm_event *cm_event = NULL;
+    int ret;
 
-        if (wr_id == RDMA_WRID_NONE) {
-            break;
-        }
+    ret = rdma_get_cm_event(lc->channel, &cm_event);
+    if (ret) {
+        ERROR(NULL, "failed to wait for initial connect request");
+        goto err;
     }
 
-    return RAM_SAVE_CONTROL_DELAYED;
+    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+        ERROR(NULL, "initial connect request is invalid");
+        ret = -EINVAL;
+        rdma_ack_cm_event(cm_event);
+        goto err;
+    }
+
+    if (lc->verbs && (lc->verbs != cm_event->id->verbs)) {
+        ret = -EINVAL;
+        ERROR(NULL, "ibv context %p != %p!", lc->verbs, 
+                                             cm_event->id->verbs);
+        goto err;
+    }
+
+    lc->cm_id = cm_event->id;
+    lc->verbs = cm_event->id->verbs;
+
+    DPRINTF("verbs context after listen: %p\n", lc->verbs);
+    qemu_rdma_dump_id("rdma_accept_start", lc->verbs);
+
+    if (return_event) {
+        *return_event = cm_event;
+    } else {
+        rdma_ack_cm_event(cm_event);
+    }
+
+    ret = qemu_rdma_alloc_pd_cq_qp(rdma, lc);
+    if (ret) {
+        goto err;
+    }
+
+    return 0;
 err:
-    rdma->error_state = ret;
-    return ret;
+    SET_ERROR(rdma, ret);
+    return rdma->error_state;
+}
+
+static int qemu_rdma_accept_finish(RDMAContext *rdma,
+                                   RDMALocalContext *lc)
+{
+    struct rdma_cm_event *cm_event;
+    int ret;
+
+    ret = rdma_get_cm_event(lc->channel, &cm_event);
+    if (ret) {
+        ERROR(NULL, "rdma_accept get_cm_event failed %d!", ret);
+        goto err;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        ERROR(NULL, "rdma_accept not event established!");
+        rdma_ack_cm_event(cm_event);
+        goto err;
+    }
+
+    rdma_ack_cm_event(cm_event);
+    lc->connected = true;
+
+    return 0;
+err:
+    SET_ERROR(rdma, ret);
+    return rdma->error_state;
 }
 
 static int qemu_rdma_accept(RDMAContext *rdma)
@@ -2804,19 +3556,10 @@ static int qemu_rdma_accept(RDMAContext *rdma)
                                             .private_data_len = sizeof(cap),
                                          };
     struct rdma_cm_event *cm_event;
-    struct ibv_context *verbs;
     int ret = -EINVAL;
     int idx;
 
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret) {
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
+    ret = qemu_rdma_accept_start(rdma, &rdma->lc_remote, &cm_event);
 
     memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
 
@@ -2829,6 +3572,13 @@ static int qemu_rdma_accept(RDMAContext *rdma)
             goto err_rdma_dest_wait;
     }
 
+    rdma->keepalive_rkey = cap.keepalive_rkey;
+    rdma->keepalive_addr = cap.keepalive_addr;
+
+    DDPRINTF("Received keepalive params: key %x addr: %" PRIx64 
+            " local %" PRIx64 "\n",
+            cap.keepalive_rkey, cap.keepalive_addr, (uint64_t) &rdma->keepalive);
+
     /*
      * Respond with only the capabilities this version of QEMU knows about.
      */
@@ -2838,103 +3588,80 @@ static int qemu_rdma_accept(RDMAContext *rdma)
      * Enable the ones that we do know about.
      * Add other checks here as new ones are introduced.
      */
-    if (cap.flags & RDMA_CAPABILITY_PIN_ALL) {
-        rdma->pin_all = true;
-    }
+    rdma->pin_all = cap.flags & RDMA_CAPABILITY_PIN_ALL;
+    rdma->do_keepalive = cap.flags & RDMA_CAPABILITY_KEEPALIVE;
 
-    rdma->cm_id = cm_event->id;
-    verbs = cm_event->id->verbs;
+    DPRINTF("Memory pin all: %s\n", rdma->pin_all ? "enabled" : "disabled");
+    DPRINTF("Keepalives: %s\n", rdma->do_keepalive ? "enabled" : "disabled");
 
     rdma_ack_cm_event(cm_event);
 
-    DPRINTF("Memory pin all: %s\n", rdma->pin_all ? "enabled" : "disabled");
+    ret = qemu_rdma_reg_keepalive(rdma);
 
-    caps_to_network(&cap);
+    if (ret) {
+        ERROR(NULL, "allocating keepalive structures");
+        goto err_rdma_dest_wait;
+    }
 
-    DPRINTF("verbs context after listen: %p\n", verbs);
+    cap.keepalive_rkey = rdma->keepalive_mr->rkey,
+    cap.keepalive_addr = (uint64_t) &rdma->keepalive;
 
-    if (!rdma->verbs) {
-        rdma->verbs = verbs;
-    } else if (rdma->verbs != verbs) {
-            fprintf(stderr, "ibv context not matching %p, %p!\n",
-                    rdma->verbs, verbs);
-            goto err_rdma_dest_wait;
-    }
+    DDPRINTF("Sending keepalive params: key %x addr: %" PRIx64 
+            " remote: %" PRIx64 "\n",
+            cap.keepalive_rkey, cap.keepalive_addr, rdma->keepalive_addr);
 
-    qemu_rdma_dump_id("dest_init", verbs);
+    caps_to_network(&cap);
 
-    ret = qemu_rdma_alloc_pd_cq(rdma);
+    ret = rdma_accept(rdma->lc_remote.cm_id, &conn_param);
     if (ret) {
-        fprintf(stderr, "rdma migration: error allocating pd and cq!\n");
+        ERROR(NULL, "rdma_accept returns %d!", ret);
         goto err_rdma_dest_wait;
     }
 
-    ret = qemu_rdma_alloc_qp(rdma);
+    ret = qemu_rdma_accept_finish(rdma, &rdma->lc_remote);
+
     if (ret) {
-        fprintf(stderr, "rdma migration: error allocating qp!\n");
+        ERROR(NULL, "finishing connection with capabilities to source");
         goto err_rdma_dest_wait;
     }
 
     ret = qemu_rdma_init_ram_blocks(rdma);
     if (ret) {
-        fprintf(stderr, "rdma migration: error initializing ram blocks!\n");
+        ERROR(NULL, "initializing ram blocks!");
         goto err_rdma_dest_wait;
     }
 
     for (idx = 0; idx < RDMA_WRID_MAX; idx++) {
         ret = qemu_rdma_reg_control(rdma, idx);
         if (ret) {
-            fprintf(stderr, "rdma: error registering %d control!\n", idx);
+            ERROR(NULL, "registering %d control!", idx);
             goto err_rdma_dest_wait;
         }
     }
 
-    qemu_set_fd_handler2(rdma->channel->fd, NULL, NULL, NULL, NULL);
-
-    ret = rdma_accept(rdma->cm_id, &conn_param);
-    if (ret) {
-        fprintf(stderr, "rdma_accept returns %d!\n", ret);
-        goto err_rdma_dest_wait;
-    }
-
-    ret = rdma_get_cm_event(rdma->channel, &cm_event);
-    if (ret) {
-        fprintf(stderr, "rdma_accept get_cm_event failed %d!\n", ret);
-        goto err_rdma_dest_wait;
-    }
-
-    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
-        fprintf(stderr, "rdma_accept not event established!\n");
-        rdma_ack_cm_event(cm_event);
-        goto err_rdma_dest_wait;
-    }
-
-    rdma_ack_cm_event(cm_event);
-    rdma->connected = true;
+    qemu_set_fd_handler2(rdma->lc_remote.channel->fd, NULL, NULL, NULL, NULL);
 
     ret = qemu_rdma_post_recv_control(rdma, RDMA_WRID_READY);
     if (ret) {
-        fprintf(stderr, "rdma migration: error posting second control recv!\n");
+        ERROR(NULL, "posting second control recv!");
         goto err_rdma_dest_wait;
     }
 
-    qemu_rdma_dump_gid("dest_connect", rdma->cm_id);
+    qemu_rdma_dump_gid("dest_connect", rdma->lc_remote.cm_id);
 
     return 0;
 
 err_rdma_dest_wait:
-    rdma->error_state = ret;
-    qemu_rdma_cleanup(rdma);
+    SET_ERROR(rdma, ret);
+    qemu_rdma_cleanup(rdma, false);
     return ret;
 }
 
 /*
  * During each iteration of the migration, we listen for instructions
- * by the source VM to perform dynamic page registrations before they
+ * by the source VM to perform pinning operations before they
  * can perform RDMA operations.
  *
- * We respond with the 'rkey'.
- *
  * Keep doing this until the source tells us to stop.
  */
 static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
@@ -2957,8 +3684,8 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
     RDMARegister *reg, *registers;
     RDMACompress *comp;
     RDMARegisterResult *reg_result;
-    static RDMARegisterResult results[RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE];
     RDMALocalBlock *block;
+    static RDMARegisterResult results[RDMA_CONTROL_MAX_COMMANDS_PER_MESSAGE];
     void *host_addr;
     int ret = 0;
     int idx = 0;
@@ -3009,8 +3736,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
             if (rdma->pin_all) {
                 ret = qemu_rdma_reg_whole_ram_blocks(rdma);
                 if (ret) {
-                    fprintf(stderr, "rdma migration: error dest "
-                                    "registering ram blocks!\n");
+                    ERROR(NULL, "dest registering ram blocks!");
                     goto out;
                 }
             }
@@ -3043,7 +3769,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
                                         (uint8_t *) rdma->block, &blocks);
 
             if (ret < 0) {
-                fprintf(stderr, "rdma migration: error sending remote info!\n");
+                ERROR(NULL, "sending remote info!");
                 goto out;
             }
 
@@ -3055,8 +3781,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
             registers = (RDMARegister *) rdma->wr_data[idx].control_curr;
 
             for (count = 0; count < head.repeat; count++) {
-                uint64_t chunk;
-                uint8_t *chunk_start, *chunk_end;
+                RDMACurrentChunk cc;
 
                 reg = &registers[count];
                 network_to_register(reg);
@@ -3065,30 +3790,28 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
 
                 DDPRINTF("Registration request (%d): index %d, current_addr %"
                          PRIu64 " chunks: %" PRIu64 "\n", count,
-                         reg->current_index, reg->key.current_addr, reg->chunks);
-
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
-                if (block->is_ram_block) {
-                    host_addr = (block->local_host_addr +
-                                (reg->key.current_addr - block->offset));
-                    chunk = ram_chunk_index(block->local_host_addr,
-                                            (uint8_t *) host_addr);
+                         reg->current_block_idx, reg->key.current_addr, reg->chunks);
+
+                cc.block = &(rdma->local_ram_blocks.block[reg->current_block_idx]);
+                if (cc.block->is_ram_block) {
+                    cc.addr = (cc.block->local_host_addr +
+                                (reg->key.current_addr - cc.block->offset));
+                    cc.chunk_idx = ram_chunk_index(block->local_host_addr, cc.addr);
                 } else {
-                    chunk = reg->key.chunk;
-                    host_addr = block->local_host_addr +
+                    cc.chunk_idx = reg->key.chunk;
+                    cc.addr = cc.block->local_host_addr +
                         (reg->key.chunk * (1UL << RDMA_REG_CHUNK_SHIFT));
                 }
-                chunk_start = ram_chunk_start(block, chunk);
-                chunk_end = ram_chunk_end(block, chunk + reg->chunks);
-                if (qemu_rdma_register_and_get_keys(rdma, block,
-                            (uint8_t *)host_addr, NULL, &reg_result->rkey,
-                            chunk, chunk_start, chunk_end)) {
+                cc.chunk_start = ram_chunk_start(cc.block, cc.chunk_idx);
+                cc.chunk_end = ram_chunk_end(cc.block, cc.chunk_idx + reg->chunks);
+                if (qemu_rdma_register_and_get_keys(rdma, &cc, &rdma->lc_remote,
+                                            false, NULL, &reg_result->rkey)) {
                     fprintf(stderr, "cannot get rkey!\n");
                     ret = -EINVAL;
                     goto out;
                 }
 
-                reg_result->host_addr = (uint64_t) block->local_host_addr;
+                reg_result->host_addr = (uint64_t) cc.block->local_host_addr;
 
                 DDPRINTF("Registered rkey for this request: %x\n",
                                 reg_result->rkey);
@@ -3115,9 +3838,9 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
 
                 DDPRINTF("Unregistration request (%d): "
                          " index %d, chunk %" PRIu64 "\n",
-                         count, reg->current_index, reg->key.chunk);
+                         count, reg->current_block_idx, reg->key.chunk);
 
-                block = &(rdma->local_ram_blocks.block[reg->current_index]);
+                block = &(rdma->local_ram_blocks.block[reg->current_block_idx]);
 
                 ret = ibv_dereg_mr(block->pmr[reg->key.chunk]);
                 block->pmr[reg->key.chunk] = NULL;
@@ -3154,7 +3877,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
     } while (1);
 out:
     if (ret < 0) {
-        rdma->error_state = ret;
+        SET_ERROR(rdma, ret);
     }
     return ret;
 }
@@ -3168,7 +3891,23 @@ static int qemu_rdma_registration_start(QEMUFile *f, void *opaque,
     CHECK_ERROR_STATE();
 
     DDDPRINTF("start section: %" PRIu64 "\n", flags);
-    qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
+
+    if (flags == RAM_CONTROL_FLUSH) {
+        int ret;
+
+        if (rdma->source) {
+            ret = qemu_rdma_drain_cq(f, rdma, &rdma->chunk_local_src, 
+                                              &rdma->chunk_local_dest);
+
+            if (ret < 0) {
+                return ret;
+            }
+        }
+
+    } else {
+        qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
+    }
+
     qemu_fflush(f);
 
     return 0;
@@ -3190,7 +3929,7 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
     CHECK_ERROR_STATE();
 
     qemu_fflush(f);
-    ret = qemu_rdma_drain_cq(f, rdma);
+    ret = qemu_rdma_drain_cq(f, rdma, &rdma->chunk_remote, NULL);
 
     if (ret < 0) {
         goto err;
@@ -3225,13 +3964,13 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
         /*
          * The protocol uses two different sets of rkeys (mutually exclusive):
          * 1. One key to represent the virtual address of the entire ram block.
-         *    (dynamic chunk registration disabled - pin everything with one rkey.)
+         *    (pinning enabled - pin everything with one rkey.)
          * 2. One to represent individual chunks within a ram block.
-         *    (dynamic chunk registration enabled - pin individual chunks.)
+         *    (pinning disabled - pin individual chunks.)
          *
          * Once the capability is successfully negotiated, the destination transmits
          * the keys to use (or sends them later) including the virtual addresses
-         * and then propagates the remote ram block descriptions to his local copy.
+         * and then propagates the remote ram block descriptions to their local copy.
          */
 
         if (local->nb_blocks != nb_remote_blocks) {
@@ -3285,7 +4024,7 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
 
     return 0;
 err:
-    rdma->error_state = ret;
+    SET_ERROR(rdma, ret);
     return ret;
 }
 
@@ -3294,7 +4033,23 @@ static int qemu_rdma_get_fd(void *opaque)
     QEMUFileRDMA *rfile = opaque;
     RDMAContext *rdma = rfile->rdma;
 
-    return rdma->comp_channel->fd;
+    return rdma->lc_remote.comp_chan->fd;
+}
+
+static int qemu_rdma_delete_block(QEMUFile *f, void *opaque,
+                                  ram_addr_t block_offset)
+{
+    QEMUFileRDMA *rfile = opaque;
+    return __qemu_rdma_delete_block(rfile->rdma, block_offset);
+}
+
+
+static int qemu_rdma_add_block(QEMUFile *f, void *opaque, void *host_addr,
+                         ram_addr_t block_offset, uint64_t length)
+{
+    QEMUFileRDMA *rfile = opaque;
+    return __qemu_rdma_add_block(rfile->rdma, host_addr,
+                                 block_offset, length);
 }
 
 const QEMUFileOps rdma_read_ops = {
@@ -3302,6 +4057,9 @@ const QEMUFileOps rdma_read_ops = {
     .get_fd        = qemu_rdma_get_fd,
     .close         = qemu_rdma_close,
     .hook_ram_load = qemu_rdma_registration_handle,
+    .copy_page     = qemu_rdma_copy_page,
+    .add           = qemu_rdma_add_block,
+    .remove        = qemu_rdma_delete_block,
 };
 
 const QEMUFileOps rdma_write_ops = {
@@ -3310,6 +4068,9 @@ const QEMUFileOps rdma_write_ops = {
     .before_ram_iterate = qemu_rdma_registration_start,
     .after_ram_iterate  = qemu_rdma_registration_stop,
     .save_page          = qemu_rdma_save_page,
+    .copy_page          = qemu_rdma_copy_page,
+    .add                = qemu_rdma_add_block,
+    .remove             = qemu_rdma_delete_block,
 };
 
 static void *qemu_fopen_rdma(RDMAContext *rdma, const char *mode)
@@ -3331,6 +4092,91 @@ static void *qemu_fopen_rdma(RDMAContext *rdma, const char *mode)
     return r->file;
 }
 
+static int init_local(RDMAContext *rdma)
+{
+    int ret;
+    struct rdma_conn_param cp_dest   = { .responder_resources = 2 },
+                           cp_source = { .initiator_depth = 2,
+                                         .retry_count = 5,
+                                       };
+
+    if (!migrate_use_mc_rdma_copy()) {
+        printf("RDMA local copy is disabled.\n");
+        return 0;
+    }
+
+    rdma->lc_dest.port = 0;
+    rdma->lc_src.host = g_malloc(100);
+    rdma->lc_dest.host = g_malloc(100);
+    strcpy(rdma->lc_src.host, "127.0.0.1");
+    strcpy(rdma->lc_dest.host, rdma->lc_src.host);
+    rdma->lc_src.source = true;
+    rdma->lc_src.dest = false;
+    rdma->lc_dest.source = false;
+    rdma->lc_dest.dest = true;
+
+    /* bind & listen */
+    ret = qemu_rdma_device_init(rdma, NULL, &rdma->lc_dest);
+    if (ret) {
+        ERROR(NULL, "initialize local device destination");
+        goto err;
+    }
+
+    rdma->lc_src.port = ntohs(rdma_get_src_port(rdma->lc_dest.listen_id));
+
+    DPRINTF("bound to port: %d\n", rdma->lc_src.port);
+
+    /* resolve */
+    ret = qemu_rdma_device_init(rdma, NULL, &rdma->lc_src);
+
+    if (ret) {
+        ERROR(NULL, "Failed to initialize local device source");
+        goto err;
+    }
+
+    /* async connect */
+    ret = rdma_connect(rdma->lc_src.cm_id, &cp_source);
+    if (ret) {
+        ERROR(NULL, "connect local device source");
+        goto err;
+    }
+
+    /* async accept */
+    ret = qemu_rdma_accept_start(rdma, &rdma->lc_dest, NULL);
+    if (ret) {
+        ERROR(NULL, "starting accept for local connection");
+        goto err;
+    }
+
+    /* accept */
+    ret = rdma_accept(rdma->lc_dest.cm_id, &cp_dest);
+    if (ret) {
+        ERROR(NULL, "rdma_accept returns %d (%s)!", ret, rdma->lc_dest.id_str);
+        goto err;
+    }
+
+    /* ack accept */
+    ret = qemu_rdma_connect_finish(rdma, &rdma->lc_src, NULL, NULL);
+    if (ret) {
+        ERROR(NULL, "finish local connection with source");
+        goto err;
+    }
+
+    /* established */
+    ret = qemu_rdma_accept_finish(rdma, &rdma->lc_dest);
+
+    if (ret) {
+        ERROR(NULL, "finish accept connection");
+        goto err;
+    }
+
+    return 0;
+err:
+    perror("init_local");
+    SET_ERROR(rdma, -ret);
+    return rdma->error_state;
+}
+
 static void rdma_accept_incoming_migration(void *opaque)
 {
     RDMAContext *rdma = opaque;
@@ -3342,7 +4188,7 @@ static void rdma_accept_incoming_migration(void *opaque)
     ret = qemu_rdma_accept(rdma);
 
     if (ret) {
-        ERROR(errp, "RDMA Migration initialization failed!");
+        ERROR(errp, "initialization failed!");
         return;
     }
 
@@ -3351,12 +4197,45 @@ static void rdma_accept_incoming_migration(void *opaque)
     f = qemu_fopen_rdma(rdma, "rb");
     if (f == NULL) {
         ERROR(errp, "could not qemu_fopen_rdma!");
-        qemu_rdma_cleanup(rdma);
-        return;
+        goto err;
     }
 
-    rdma->migration_started_on_destination = 1;
+    if (rdma->do_keepalive) {
+        qemu_rdma_keepalive_start();
+    }
+
+    rdma->migration_started = 1;
     process_incoming_migration(f);
+    return;
+err:
+    qemu_rdma_cleanup(rdma, false);
+}
+
+static int qemu_rdma_init_incoming(RDMAContext *rdma, Error **errp)
+{
+    int ret;
+    Error *local_err = NULL;
+
+    rdma->source = false;
+    rdma->dest = true;
+    rdma->lc_remote.source = false;
+    rdma->lc_remote.dest = true;
+
+    ret = qemu_rdma_device_init(rdma, &local_err, &rdma->lc_remote);
+
+    if (ret) {
+        goto err;
+    }
+
+    return 0;
+err:
+    if (rdma->lc_remote.listen_id) {
+        rdma_destroy_id(rdma->lc_remote.listen_id);
+        rdma->lc_remote.listen_id = NULL;
+    }
+    error_propagate(errp, local_err);
+
+    return ret;
 }
 
 void rdma_start_incoming_migration(const char *host_port, Error **errp)
@@ -3372,24 +4251,13 @@ void rdma_start_incoming_migration(const char *host_port, Error **errp)
         goto err;
     }
 
-    ret = qemu_rdma_dest_init(rdma, &local_err);
+    ret = qemu_rdma_init_incoming(rdma, &local_err);
 
     if (ret) {
         goto err;
     }
 
-    DPRINTF("qemu_rdma_dest_init success\n");
-
-    ret = rdma_listen(rdma->listen_id, 5);
-
-    if (ret) {
-        ERROR(errp, "listening on socket!");
-        goto err;
-    }
-
-    DPRINTF("rdma_listen success\n");
-
-    qemu_set_fd_handler2(rdma->channel->fd, NULL,
+    qemu_set_fd_handler2(rdma->lc_remote.channel->fd, NULL,
                          rdma_accept_incoming_migration, NULL,
                             (void *)(intptr_t) rdma);
     return;
@@ -3411,14 +4279,21 @@ void rdma_start_outgoing_migration(void *opaque,
         goto err;
     }
 
-    ret = qemu_rdma_source_init(rdma, &local_err,
-        s->enabled_capabilities[MIGRATION_CAPABILITY_X_RDMA_PIN_ALL]);
+    rdma->source = true;
+    rdma->dest = false;
+
+    if (init_local(rdma)) {
+        ERROR(temp, "could not initialize local rdma queue pairs!");
+        goto err;
+    }
+
+    ret = qemu_rdma_init_outgoing(rdma, &local_err, s);
 
     if (ret) {
         goto err;
     }
 
-    DPRINTF("qemu_rdma_source_init success\n");
+    DPRINTF("qemu_rdma_init_outgoing success\n");
     ret = qemu_rdma_connect(rdma, &local_err);
 
     if (ret) {
@@ -3428,6 +4303,12 @@ void rdma_start_outgoing_migration(void *opaque,
     DPRINTF("qemu_rdma_source_connect success\n");
 
     s->file = qemu_fopen_rdma(rdma, "wb");
+    rdma->migration_started = 1;
+
+    if (rdma->do_keepalive) {
+        qemu_rdma_keepalive_start();
+    }
+
     migrate_fd_connect(s);
     return;
 err:
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (4 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-19  1:00   ` Li Guang
  2014-03-11 21:57   ` Juan Quintela
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

This patch sets up the initial changes to the migration state
machine and prototypes to be used by the checkpointing code
to interact with the state machine so that we can later handle
failure and recovery scenarios.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 arch_init.c                   | 29 ++++++++++++++++++++++++-----
 include/migration/migration.h |  2 ++
 migration.c                   | 37 +++++++++++++++++++++----------------
 3 files changed, 47 insertions(+), 21 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index db75120..e9d4d9e 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
     migration_end();
 }
 
-static void reset_ram_globals(void)
+static void reset_ram_globals(bool reset_bulk_stage)
 {
     last_seen_block = NULL;
     last_sent_block = NULL;
     last_offset = 0;
     last_version = ram_list.version;
-    ram_bulk_stage = true;
+    ram_bulk_stage = reset_bulk_stage;
 }
 
 #define MAX_WAIT 50 /* ms, half buffered_file limit */
@@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     RAMBlock *block;
     int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
 
+    /*
+     * RAM stays open during micro-checkpointing for the next transaction.
+     */
+    if (migration_is_mc(migrate_get_current())) {
+        qemu_mutex_lock_ramlist();
+        reset_ram_globals(false);
+        goto skip_setup;
+    }
+
     migration_bitmap = bitmap_new(ram_pages);
     bitmap_set(migration_bitmap, 0, ram_pages);
     migration_dirty_pages = ram_pages;
@@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     qemu_mutex_lock_iothread();
     qemu_mutex_lock_ramlist();
     bytes_transferred = 0;
-    reset_ram_globals();
+    reset_ram_globals(true);
 
     memory_global_dirty_log_start();
     migration_bitmap_sync();
     qemu_mutex_unlock_iothread();
 
+skip_setup:
+
     qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
 
     QTAILQ_FOREACH(block, &ram_list.blocks, next) {
@@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
     qemu_mutex_lock_ramlist();
 
     if (ram_list.version != last_version) {
-        reset_ram_globals();
+        reset_ram_globals(true);
     }
 
     ram_control_before_iterate(f, RAM_CONTROL_ROUND);
@@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     }
 
     ram_control_after_iterate(f, RAM_CONTROL_FINISH);
-    migration_end();
+
+    /*
+     * Only cleanup at the end of normal migrations
+     * or if the MC destination failed and we got an error.
+     * Otherwise, we are (or will soon be) in MIG_STATE_CHECKPOINTING.
+     */
+    if(!migrate_use_mc() || migration_has_failed(migrate_get_current())) {
+        migration_end();
+    }
 
     qemu_mutex_unlock_ramlist();
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
diff --git a/include/migration/migration.h b/include/migration/migration.h
index a7c54fe..e876a2c 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -101,7 +101,9 @@ int migrate_fd_close(MigrationState *s);
 
 void add_migration_state_change_notifier(Notifier *notify);
 void remove_migration_state_change_notifier(Notifier *notify);
+bool migration_is_active(MigrationState *);
 bool migration_in_setup(MigrationState *);
+bool migration_is_mc(MigrationState *s);
 bool migration_has_finished(MigrationState *);
 bool migration_has_failed(MigrationState *);
 MigrationState *migrate_get_current(void);
diff --git a/migration.c b/migration.c
index 25add6f..f42dae4 100644
--- a/migration.c
+++ b/migration.c
@@ -36,16 +36,6 @@
     do { } while (0)
 #endif
 
-enum {
-    MIG_STATE_ERROR = -1,
-    MIG_STATE_NONE,
-    MIG_STATE_SETUP,
-    MIG_STATE_CANCELLING,
-    MIG_STATE_CANCELLED,
-    MIG_STATE_ACTIVE,
-    MIG_STATE_COMPLETED,
-};
-
 #define MAX_THROTTLE  (32 << 20)      /* Migration speed throttling */
 
 /* Amount of time to allocate to each "chunk" of bandwidth-throttled
@@ -273,7 +263,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
     MigrationState *s = migrate_get_current();
     MigrationCapabilityStatusList *cap;
 
-    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
+    if (migration_is_active(s)) {
         error_set(errp, QERR_MIGRATION_ACTIVE);
         return;
     }
@@ -285,7 +275,13 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
 
 /* shared migration helpers */
 
-static void migrate_set_state(MigrationState *s, int old_state, int new_state)
+bool migration_is_active(MigrationState *s)
+{
+    return (s->state == MIG_STATE_ACTIVE) || s->state == MIG_STATE_SETUP
+            || s->state == MIG_STATE_CHECKPOINTING;
+}
+
+void migrate_set_state(MigrationState *s, int old_state, int new_state)
 {
     if (atomic_cmpxchg(&s->state, old_state, new_state) == new_state) {
         trace_migrate_set_state(new_state);
@@ -309,7 +305,7 @@ static void migrate_fd_cleanup(void *opaque)
         s->file = NULL;
     }
 
-    assert(s->state != MIG_STATE_ACTIVE);
+    assert(!migration_is_active(s));
 
     if (s->state != MIG_STATE_COMPLETED) {
         qemu_savevm_state_cancel();
@@ -356,7 +352,12 @@ void remove_migration_state_change_notifier(Notifier *notify)
 
 bool migration_in_setup(MigrationState *s)
 {
-    return s->state == MIG_STATE_SETUP;
+        return s->state == MIG_STATE_SETUP;
+}
+
+bool migration_is_mc(MigrationState *s)
+{
+        return s->state == MIG_STATE_CHECKPOINTING;
 }
 
 bool migration_has_finished(MigrationState *s)
@@ -419,7 +420,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
     params.shared = has_inc && inc;
 
     if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP ||
-        s->state == MIG_STATE_CANCELLING) {
+        s->state == MIG_STATE_CANCELLING 
+         || s->state == MIG_STATE_CHECKPOINTING) {
         error_set(errp, QERR_MIGRATION_ACTIVE);
         return;
     }
@@ -624,7 +626,10 @@ static void *migration_thread(void *opaque)
                 }
 
                 if (!qemu_file_get_error(s->file)) {
-                    migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
+                    if (!migrate_use_mc()) {
+                        migrate_set_state(s,
+                            MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
+                    }
                     break;
                 }
             }
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (5 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-03-11 21:45   ` Eric Blake
  2014-03-11 21:59   ` Juan Quintela
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

MC provides a lot of new information, including the same RAM statistics
that ordinary migration does, so we centralize a lot of that printing
code into a common function so that the QMP printing statements don't
get duplicated too much.

We also introduce a new MCStats structure (like MigrationStats) due
to the large number of non-migration related statistics - don't want
to confuse migration and MC too much, so let's keep them separate for now.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 hmp.c                         | 17 +++++++++
 include/migration/migration.h |  6 +++
 migration.c                   | 86 ++++++++++++++++++++++++++-----------------
 qapi-schema.json              | 33 +++++++++++++++++
 4 files changed, 109 insertions(+), 33 deletions(-)

diff --git a/hmp.c b/hmp.c
index 1af0809..edf062e 100644
--- a/hmp.c
+++ b/hmp.c
@@ -203,6 +203,23 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
                        info->disk->total >> 10);
     }
 
+    if (info->has_mc) {
+        monitor_printf(mon, "checkpoints: %" PRIu64 "\n",
+                       info->mc->checkpoints);
+        monitor_printf(mon, "xmit_time: %" PRIu64 " ms\n",
+                       info->mc->xmit_time);
+        monitor_printf(mon, "log_dirty_time: %" PRIu64 " ms\n",
+                       info->mc->log_dirty_time);
+        monitor_printf(mon, "migration_bitmap_time: %" PRIu64 " ms\n",
+                       info->mc->migration_bitmap_time);
+        monitor_printf(mon, "ram_copy_time: %" PRIu64 " ms\n",
+                       info->mc->ram_copy_time);
+        monitor_printf(mon, "copy_mbps: %0.2f mbps\n",
+                       info->mc->copy_mbps);
+        monitor_printf(mon, "throughput: %0.2f mbps\n",
+                       info->mc->mbps);
+    }
+
     if (info->has_xbzrle_cache) {
         monitor_printf(mon, "cache size: %" PRIu64 " bytes\n",
                        info->xbzrle_cache->cache_size);
diff --git a/include/migration/migration.h b/include/migration/migration.h
index e876a2c..f18ff5e 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -53,14 +53,20 @@ struct MigrationState
     int state;
     MigrationParams params;
     double mbps;
+    double copy_mbps;
     int64_t total_time;
     int64_t downtime;
     int64_t expected_downtime;
+    int64_t xmit_time;
+    int64_t ram_copy_time;
+    int64_t log_dirty_time;
+    int64_t bitmap_time;
     int64_t dirty_pages_rate;
     int64_t dirty_bytes_rate;
     bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
     int64_t xbzrle_cache_size;
     int64_t setup_time;
+    int64_t checkpoints;
 };
 
 void process_incoming_migration(QEMUFile *f);
diff --git a/migration.c b/migration.c
index f42dae4..0ccbeaa 100644
--- a/migration.c
+++ b/migration.c
@@ -59,7 +59,6 @@ MigrationState *migrate_get_current(void)
         .state = MIG_STATE_NONE,
         .bandwidth_limit = MAX_THROTTLE,
         .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
-        .mbps = -1,
     };
 
     return &current_migration;
@@ -173,6 +172,31 @@ static void get_xbzrle_cache_stats(MigrationInfo *info)
     }
 }
 
+static void get_ram_stats(MigrationState *s, MigrationInfo *info)
+{
+    info->has_total_time = true;
+    info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
+        - s->total_time;
+
+    info->has_ram = true;
+    info->ram = g_malloc0(sizeof(*info->ram));
+    info->ram->transferred = ram_bytes_transferred();
+    info->ram->total = ram_bytes_total();
+    info->ram->duplicate = dup_mig_pages_transferred();
+    info->ram->skipped = skipped_mig_pages_transferred();
+    info->ram->normal = norm_mig_pages_transferred();
+    info->ram->normal_bytes = norm_mig_bytes_transferred();
+    info->ram->mbps = s->mbps;
+
+    if (blk_mig_active()) {
+        info->has_disk = true;
+        info->disk = g_malloc0(sizeof(*info->disk));
+        info->disk->transferred = blk_mig_bytes_transferred();
+        info->disk->remaining = blk_mig_bytes_remaining();
+        info->disk->total = blk_mig_bytes_total();
+    }
+}
+
 MigrationInfo *qmp_query_migrate(Error **errp)
 {
     MigrationInfo *info = g_malloc0(sizeof(*info));
@@ -199,26 +223,8 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         info->has_setup_time = true;
         info->setup_time = s->setup_time;
 
-        info->has_ram = true;
-        info->ram = g_malloc0(sizeof(*info->ram));
-        info->ram->transferred = ram_bytes_transferred();
-        info->ram->remaining = ram_bytes_remaining();
-        info->ram->total = ram_bytes_total();
-        info->ram->duplicate = dup_mig_pages_transferred();
-        info->ram->skipped = skipped_mig_pages_transferred();
-        info->ram->normal = norm_mig_pages_transferred();
-        info->ram->normal_bytes = norm_mig_bytes_transferred();
+        get_ram_stats(s, info);
         info->ram->dirty_pages_rate = s->dirty_pages_rate;
-        info->ram->mbps = s->mbps;
-
-        if (blk_mig_active()) {
-            info->has_disk = true;
-            info->disk = g_malloc0(sizeof(*info->disk));
-            info->disk->transferred = blk_mig_bytes_transferred();
-            info->disk->remaining = blk_mig_bytes_remaining();
-            info->disk->total = blk_mig_bytes_total();
-        }
-
         get_xbzrle_cache_stats(info);
         break;
     case MIG_STATE_COMPLETED:
@@ -227,22 +233,37 @@ MigrationInfo *qmp_query_migrate(Error **errp)
         info->has_status = true;
         info->status = g_strdup("completed");
         info->has_total_time = true;
-        info->total_time = s->total_time;
+        info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
+            - s->total_time;
         info->has_downtime = true;
         info->downtime = s->downtime;
         info->has_setup_time = true;
         info->setup_time = s->setup_time;
 
-        info->has_ram = true;
-        info->ram = g_malloc0(sizeof(*info->ram));
-        info->ram->transferred = ram_bytes_transferred();
-        info->ram->remaining = 0;
-        info->ram->total = ram_bytes_total();
-        info->ram->duplicate = dup_mig_pages_transferred();
-        info->ram->skipped = skipped_mig_pages_transferred();
-        info->ram->normal = norm_mig_pages_transferred();
-        info->ram->normal_bytes = norm_mig_bytes_transferred();
-        info->ram->mbps = s->mbps;
+        get_ram_stats(s, info);
+        break;
+    case MIG_STATE_CHECKPOINTING:
+        info->has_status = true;
+        info->status = g_strdup("checkpointing");
+        info->has_setup_time = true;
+        info->setup_time = s->setup_time;
+        info->has_downtime = true;
+        info->downtime = s->downtime;
+
+        get_ram_stats(s, info);
+        info->ram->dirty_pages_rate = s->dirty_pages_rate;
+        get_xbzrle_cache_stats(info);
+
+
+        info->has_mc = true;
+        info->mc = g_malloc0(sizeof(*info->mc));
+        info->mc->xmit_time = s->xmit_time;
+        info->mc->log_dirty_time = s->log_dirty_time; 
+        info->mc->migration_bitmap_time = s->bitmap_time;
+        info->mc->ram_copy_time = s->ram_copy_time;
+        info->mc->copy_mbps = s->copy_mbps;
+        info->mc->mbps = s->mbps;
+        info->mc->checkpoints = s->checkpoints;
         break;
     case MIG_STATE_ERROR:
         info->has_status = true;
@@ -646,8 +667,7 @@ static void *migration_thread(void *opaque)
             double bandwidth = transferred_bytes / time_spent;
             max_size = bandwidth * migrate_max_downtime() / 1000000;
 
-            s->mbps = time_spent ? (((double) transferred_bytes * 8.0) /
-                    ((double) time_spent / 1000.0)) / 1000.0 / 1000.0 : -1;
+            s->mbps = MBPS(transferred_bytes, time_spent);
 
             DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
                     " bandwidth %g max_size %" PRId64 "\n",
diff --git a/qapi-schema.json b/qapi-schema.json
index 3c2ee4d..7306adc 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -603,6 +603,36 @@
            'cache-miss': 'int', 'overflow': 'int' } }
 
 ##
+# @MCStats
+#
+# Detailed Micro Checkpointing (MC) statistics
+#
+# @mbps: throughput of transmitting last MC 
+#
+# @xmit-time: milliseconds to transmit last MC 
+#
+# @log-dirty-time: milliseconds to GET_LOG_DIRTY for last MC 
+#
+# @migration-bitmap-time: milliseconds to prepare dirty bitmap for last MC
+#
+# @ram-copy-time: milliseconds to ram_save_live() last MC to staging memory
+#
+# @copy-mbps: throughput of ram_save_live() to staging memory for last MC 
+#
+# @checkpoints: cummulative total number of MCs generated 
+#
+# Since: 2.x
+##
+{ 'type': 'MCStats',
+  'data': {'mbps': 'number',
+           'xmit-time': 'uint64',
+           'log-dirty-time': 'uint64',
+           'migration-bitmap-time': 'uint64', 
+           'ram-copy-time': 'uint64',
+           'checkpoints' : 'uint64',
+           'copy-mbps': 'number' }}
+
+##
 # @MigrationInfo
 #
 # Information about current migration process.
@@ -624,6 +654,8 @@
 #                migration statistics, only returned if XBZRLE feature is on and
 #                status is 'active' or 'completed' (since 1.2)
 #
+# @mc: #options @MCStats containing details Micro-Checkpointing statistics
+#
 # @total-time: #optional total amount of milliseconds since migration started.
 #        If migration has ended, it returns the total migration
 #        time. (since 1.2)
@@ -648,6 +680,7 @@
   'data': {'*status': 'str', '*ram': 'MigrationStats',
            '*disk': 'MigrationStats',
            '*xbzrle-cache': 'XBZRLECacheStats',
+           '*mc': 'MCStats',
            '*total-time': 'int',
            '*expected-downtime': 'int',
            '*downtime': 'int',
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (6 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-19  1:07   ` Li Guang
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support mrhines
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

This implements the core logic,
all described in the first patch (docs/mc.txt).

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-checkpoint.c | 1565 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1565 insertions(+)
 create mode 100644 migration-checkpoint.c

diff --git a/migration-checkpoint.c b/migration-checkpoint.c
new file mode 100644
index 0000000..a69edb2
--- /dev/null
+++ b/migration-checkpoint.c
@@ -0,0 +1,1565 @@
+/*
+ *  Micro-Checkpointing (MC) support 
+ *  (a.k.a. Fault Tolerance or Continuous Replication)
+ *
+ *  Copyright IBM, Corp. 2014
+ *
+ *  Authors:
+ *   Michael R. Hines <mrhines@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ *
+ */
+#include <libnl3/netlink/route/qdisc/plug.h>
+#include <libnl3/netlink/route/class.h>
+#include <libnl3/netlink/cli/utils.h>
+#include <libnl3/netlink/cli/tc.h>
+#include <libnl3/netlink/cli/qdisc.h>
+#include <libnl3/netlink/cli/link.h>
+#include "qemu-common.h"
+#include "hw/virtio/virtio.h"
+#include "hw/virtio/virtio-net.h"
+#include "qemu/sockets.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "qmp-commands.h"
+#include "net/tap-linux.h"
+#include <sys/ioctl.h>
+
+#define DEBUG_MC
+//#define DEBUG_MC_VERBOSE
+//#define DEBUG_MC_REALLY_VERBOSE
+
+#ifdef DEBUG_MC
+#define DPRINTF(fmt, ...) \
+    do { printf("mc: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#ifdef DEBUG_MC_VERBOSE
+#define DDPRINTF(fmt, ...) \
+    do { printf("mc: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DDPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#ifdef DEBUG_MC_REALLY_VERBOSE
+#define DDDPRINTF(fmt, ...) \
+    do { printf("mc: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DDDPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+/*
+ * Micro checkpoints (MC)s are typically only a few MB when idle.
+ * However, they can easily be very large during heavy workloads.
+ * In the *extreme* worst-case, QEMU might need double the amount of main memory
+ * than that of what was originally allocated to the virtual machine.
+ *
+ * To support this variability during transient periods, a MC
+ * consists of a linked list of slabs, each of identical size. A better name
+ * would be welcome, as the name was only chosen because it resembles linux
+ * memory allocation. Because MCs occur several times per second 
+ * (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink 
+ * without constantly re-allocating all memory in place during each checkpoint.
+ *
+ * During steady-state, the 'head' slab is permanently allocated and never goes
+ * away, so when the VM is idle, there is no memory allocation at all.
+ * This design supports the use of RDMA. Since RDMA requires memory pinning, we
+ * must be able to hold on to a slab for a reasonable amount of time to get any
+ * real use out of it.
+ *
+ * Regardless, the current strategy taken is:
+ * 
+ * 1. If the checkpoint size increases,
+ *    then grow the number of slabs to support it,
+ *    (if and only if RDMA is activated, these slabs will be pinned.)
+ * 2. If the next checkpoint size is smaller than the last one,
+      then that's a "strike".
+ * 3. After N strikes, cut the size of the slab cache in half
+ *    (to a minimum of 1 slab as described before).
+ *
+ * As of this writing, a typical average size of 
+ * an Idle-VM checkpoint is under 5MB.
+ */
+
+#define MC_SLAB_BUFFER_SIZE     (5UL * 1024UL * 1024UL) /* empirical */
+#define MC_DEV_NAME_MAX_SIZE    256
+
+#define MC_DEFAULT_CHECKPOINT_FREQ_MS 100 /* too slow, but best for now */
+#define CALC_MAX_STRIKES()                                           \
+    do {  max_strikes = (max_strikes_delay_secs * 1000) / freq_ms; } \
+    while (0)
+
+/*
+ * How many "seconds-worth" of checkpoints to wait before re-evaluating the size
+ * of the slab list?
+ *
+ * #strikes_until_shrink_cache = Function(#checkpoints/sec)
+ *
+ * Increasing the number of seconds also increases the number of strikes needed
+ * to be reached until it is time to cut the cache in half.
+ *
+ * Below value is open for debate - we just want it to be small enough to ensure
+ * that a large, idle slab list doesn't stay too large for too long.
+ */
+#define MC_DEFAULT_SLAB_MAX_CHECK_DELAY_SECS 10
+
+/* 
+ * MC serializes the actual RAM page contents in such a way that the actual
+ * pages are separated from the meta-data (all the QEMUFile stuff).
+ *
+ * This is done strictly for the purposes of being able to use RDMA
+ * and to replace memcpy() on the local machine for hardware with very
+ * fast RAM memory speeds.
+ * 
+ * This serialization requires recording the page descriptions and then
+ * pushing them into slabs after the checkpoint has been captured
+ * (minus the page data).
+ *
+ * The memory holding the page descriptions are allocated in unison with the
+ * slabs themselves, and thus we need to know in advance the maximum number of
+ * page descriptions that can fit into a slab before allocating the slab.
+ * It should be safe to assume the *minimum* page size (not the maximum,
+ * that would be dangerous) is 4096.
+ *
+ * We're not actually using this assumption for any memory management 
+ * management, only as a hint to know how big of an array to allocate.
+ *
+ * The following adds a fixed-cost of about 40 KB to each slab.
+ */
+#define MC_MAX_SLAB_COPY_DESCRIPTORS (MC_SLAB_BUFFER_SIZE / 4096)
+
+#define SLAB_RESET(s) do {                      \
+                            s->size = 0;      \
+                            s->read = 0;      \
+                      } while(0)
+
+uint64_t freq_ms = MC_DEFAULT_CHECKPOINT_FREQ_MS;
+uint32_t max_strikes_delay_secs = MC_DEFAULT_SLAB_MAX_CHECK_DELAY_SECS;
+uint32_t max_strikes = -1;
+
+typedef struct QEMU_PACKED MCCopy {
+    uint64_t ramblock_offset;
+    uint64_t host_addr;
+    uint64_t offset;
+    uint64_t size;
+} MCCopy;
+
+typedef struct QEMU_PACKED MCCopyset {
+    QTAILQ_ENTRY(MCCopyset) node;
+    MCCopy copies[MC_MAX_SLAB_COPY_DESCRIPTORS];
+    uint64_t nb_copies;
+    int idx;
+} MCCopyset;
+
+typedef struct QEMU_PACKED MCSlab {
+    QTAILQ_ENTRY(MCSlab) node;
+    uint8_t buf[MC_SLAB_BUFFER_SIZE];
+    uint64_t read;
+    uint64_t size;
+    int idx;
+} MCSlab;
+
+typedef struct MCParams {
+    QTAILQ_HEAD(shead, MCSlab) slab_head;
+    QTAILQ_HEAD(chead, MCCopyset) copy_head;
+    MCSlab *curr_slab;
+    MCSlab *mem_slab;
+    MCCopyset *curr_copyset;
+    MCCopy *copy;
+    QEMUFile *file;
+    QEMUFile *staging;
+    uint64_t start_copyset;
+    uint64_t slab_total;
+    uint64_t total_copies;
+    uint64_t nb_slabs;
+    uint64_t used_slabs;
+    uint32_t slab_strikes;
+    uint32_t copy_strikes;
+    int nb_copysets;
+    uint64_t checkpoints;
+} MCParams;
+
+enum {
+    MC_TRANSACTION_NACK = 300,
+    MC_TRANSACTION_START,
+    MC_TRANSACTION_COMMIT,
+    MC_TRANSACTION_ABORT,
+    MC_TRANSACTION_ACK,
+    MC_TRANSACTION_END,
+    MC_TRANSACTION_ANY,
+};
+
+static const char * mc_desc[] = {
+    [MC_TRANSACTION_NACK] = "NACK",
+    [MC_TRANSACTION_START] = "START",
+    [MC_TRANSACTION_COMMIT] = "COMMIT",
+    [MC_TRANSACTION_ABORT] = "ABORT",
+    [MC_TRANSACTION_ACK] = "ACK",
+    [MC_TRANSACTION_END] = "END",
+    [MC_TRANSACTION_ANY] = "ANY",
+};
+
+static struct rtnl_qdisc        *qdisc      = NULL;
+static struct nl_sock           *sock       = NULL;
+static struct rtnl_tc           *tc         = NULL;
+static struct nl_cache          *link_cache = NULL;
+static struct rtnl_tc_ops       *ops        = NULL;
+static struct nl_cli_tc_module  *tm         = NULL;
+static int first_nic_chosen = 0;
+
+/*
+ * Assuming a guest can 'try' to fill a 1 Gbps pipe,
+ * that works about to 125000000 bytes/sec.
+ *
+ * Netlink better not be pre-allocating megabytes in the
+ * kernel qdisc, that would be crazy....
+ */
+#define START_BUFFER (1000*1000*1000 / 8)
+static int buffer_size = START_BUFFER, new_buffer_size = START_BUFFER;
+static const char * parent = "root";
+static int buffering_enabled = 0;
+static const char * BUFFER_NIC_PREFIX = "ifb";
+static QEMUBH *checkpoint_bh = NULL;
+static bool mc_requested = false;
+
+int migrate_use_mc(void)
+{
+    MigrationState *s = migrate_get_current();
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_MC];
+}
+
+int migrate_use_mc_net(void)
+{
+    MigrationState *s = migrate_get_current();
+    return !s->enabled_capabilities[MIGRATION_CAPABILITY_MC_NET_DISABLE];
+}
+
+int migrate_use_mc_rdma_copy(void)
+{
+    MigrationState *s = migrate_get_current();
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_MC_RDMA_COPY];
+}
+
+static int mc_deliver(int update)
+{
+    int err, flags = NLM_F_CREATE | NLM_F_REPLACE;
+
+    if (!buffering_enabled)
+        return -EINVAL;
+
+    if (!update)
+        flags |= NLM_F_EXCL;
+ 
+    if ((err = rtnl_qdisc_add(sock, qdisc, flags)) < 0) {
+        fprintf(stderr, "Unable to control qdisc: %s! %p %p %d\n",
+            nl_geterror(err), sock, qdisc, flags);
+        return -EINVAL;
+    }
+
+    return 0;
+}
+
+static int mc_set_buffer_size(int size)
+{
+    int err;
+
+    if (!buffering_enabled) {
+        return 1;
+    }
+
+    buffer_size = size;
+    new_buffer_size = size;
+
+    if ((err = rtnl_qdisc_plug_set_limit((void *) qdisc, size)) < 0) {
+       fprintf(stderr, "MC: Unable to change buffer size: %s\n",
+			nl_geterror(err));
+       return -EINVAL;
+    }
+
+    DPRINTF("Set buffer size to %d bytes\n", size);
+
+    return mc_deliver(1);
+}
+
+/*
+ * Micro-checkpointing may require buffering network packets.
+ * Set that up for the first NIC only.... We'll worry about
+ * multiple NICs later.
+ */
+static void init_mc_nic_buffering(NICState *nic, void *opaque)
+{
+    char * device = opaque;
+    NetClientState * nc = &nic->ncs[0];
+    const char * key = "ifname=";
+    int keylen = strlen(key);
+    char * name;
+    int end = 0;
+    bool use_fd = false;
+   
+    if (first_nic_chosen) {
+         fprintf(stderr, "Micro-Checkpointing with multiple NICs not yet supported!\n");
+         return;
+    }
+
+    if (!nc->peer) {
+        fprintf(stderr, "Micro-Checkpoint nic %s does not have peer host device for buffering. VM will not be consistent.\n", nc->name);
+        return;
+    }
+
+    name = nc->peer->info_str;
+
+    DPRINTF("Checking contents of %s\n", name);
+
+    if (strncmp(name, key, keylen)) {
+        fprintf(stderr, "Micro-Checkpoint nic %s does not have 'ifname' "
+                        "in its description (%s, %s). Trying workaround...\n",
+                        nc->name, name, nc->peer->name);
+        key = "fd=";
+        keylen = strlen(key);
+        if (strncmp(name, key, keylen)) {
+            fprintf(stderr, "Still cannot find 'fd=' either. Failure.\n");
+            return;
+        }
+
+        use_fd = true;
+    }
+
+    name += keylen;
+
+    while (name[end++] != (use_fd ? '\0' : ','));
+
+    strncpy(device, name, end - 1);
+    memset(&device[end - 1], 0, MC_DEV_NAME_MAX_SIZE - (end - 1));
+
+    if (use_fd) {
+        struct ifreq r;
+        DPRINTF("Want to retreive name from fd: %d\n", atoi(device));
+
+        if (ioctl(atoi(device), TUNGETIFF, &r) == -1) {
+            fprintf(stderr, "Failed to convert fd %s to name.\n", device);
+            return;
+        }
+
+        DPRINTF("Got name %s!\n", r.ifr_name);
+        strcpy(device, r.ifr_name);
+    }
+
+    first_nic_chosen = 1;
+}
+
+static int mc_suspend_buffering(void)
+{
+    int err;
+
+    if (!buffering_enabled) {
+        return -EINVAL;
+    }
+
+    if ((err = rtnl_qdisc_plug_release_indefinite((void *) qdisc)) < 0) {
+        fprintf(stderr, "MC: Unable to release indefinite: %s\n",
+            nl_geterror(err));
+        return -EINVAL;
+    }
+
+    DPRINTF("Buffering suspended\n");
+
+    return mc_deliver(1);
+}
+
+static int mc_disable_buffering(void)
+{
+    int err;
+
+    if (!buffering_enabled) {
+		goto out;
+	}
+
+    mc_suspend_buffering();
+
+    if (qdisc && sock && (err = rtnl_qdisc_delete(sock, (void *) qdisc)) < 0) {
+        fprintf(stderr, "Unable to release indefinite: %s\n", nl_geterror(err));
+    }
+
+out:
+    buffering_enabled = 0;
+    qdisc = NULL;
+    sock = NULL;
+    tc = NULL;
+    link_cache = NULL;
+    ops = NULL;
+    tm = NULL;
+
+    DPRINTF("Buffering disabled\n");
+
+    return 0;
+}
+
+/*
+ * Install a Qdisc plug for micro-checkpointing.
+ * If it exists already (say, from a previous dead VM or debugging
+ * session) then just open all the netlink data structures pointing
+ * to the existing plug and replace it.
+ */
+int mc_enable_buffering(void)
+{
+    char dev[MC_DEV_NAME_MAX_SIZE], buffer_dev[MC_DEV_NAME_MAX_SIZE];
+    int prefix_len = 0;
+    int buffer_prefix_len = strlen(BUFFER_NIC_PREFIX);
+
+    if (buffering_enabled) {
+        fprintf(stderr, "Buffering already enable Skipping.\n");
+        return 0;
+    }
+
+    first_nic_chosen = 0;
+
+    qemu_foreach_nic(init_mc_nic_buffering, dev);
+
+    if (!first_nic_chosen) {
+        fprintf(stderr, "Enumeration of NICs complete, but failed.\n");
+        goto failed;
+    }
+
+    while ((dev[prefix_len] < '0') || (dev[prefix_len] > '9'))
+        prefix_len++;
+
+    strcpy(buffer_dev, BUFFER_NIC_PREFIX);
+    strncpy(buffer_dev + buffer_prefix_len,
+                dev + prefix_len, strlen(dev) - prefix_len + 1);
+
+    fprintf(stderr, "Initializing buffering for nic %s => %s\n", dev, buffer_dev);
+
+    if (sock == NULL) {
+        sock = (struct nl_sock *) nl_cli_alloc_socket();
+        if (!sock) {
+            fprintf(stderr, "MC: failed to allocate netlink socket\n");
+            goto failed;
+        }
+		nl_cli_connect(sock, NETLINK_ROUTE);
+    }
+
+    if (qdisc == NULL) {
+        qdisc = nl_cli_qdisc_alloc();
+        if (!qdisc) {
+            fprintf(stderr, "MC: failed to allocate netlink qdisc\n");
+            goto failed;
+        }
+        tc = (struct rtnl_tc *) qdisc;
+    }
+
+    if (link_cache == NULL) {
+		link_cache = nl_cli_link_alloc_cache(sock);
+        if (!link_cache) {
+            fprintf(stderr, "MC: failed to allocate netlink link_cache\n");
+            goto failed;
+        }
+    }
+
+    nl_cli_tc_parse_dev(tc, link_cache, (char *) buffer_dev);
+    nl_cli_tc_parse_parent(tc, (char *) parent);
+
+    if (!rtnl_tc_get_ifindex(tc)) {
+        fprintf(stderr, "Qdisc device '%s' does not exist!\n", buffer_dev);
+        goto failed;
+    }
+
+    if (!rtnl_tc_get_parent(tc)) {
+        fprintf(stderr, "Qdisc parent '%s' is not valid!\n", parent);
+        goto failed;
+    }
+
+    if (rtnl_tc_set_kind(tc, "plug") < 0) {
+        fprintf(stderr, "Could not open qdisc plug!\n");
+        goto failed;
+    }
+
+    if (!(ops = rtnl_tc_get_ops(tc))) {
+        fprintf(stderr, "Could not open qdisc plug!\n");
+        goto failed;
+    }
+
+    if (!(tm = nl_cli_tc_lookup(ops))) {
+        fprintf(stderr, "Qdisc plug not supported!\n");
+        goto failed;
+    }
+   
+    buffering_enabled = 1;
+
+    if (mc_deliver(0) < 0) {
+		fprintf(stderr, "First time qdisc create failed\n");
+		goto failed;
+    }
+
+    DPRINTF("Buffering enabled, size: %d MB.\n", buffer_size / 1024 / 1024);
+  
+    if (mc_set_buffer_size(buffer_size) < 0) {
+		goto failed;
+	}
+
+    if (mc_suspend_buffering() < 0) {
+		goto failed;
+	}
+
+
+    return 0;
+
+failed:
+    mc_disable_buffering();
+    return -EINVAL;
+}
+
+int mc_start_buffer(void)
+{
+    int err;
+
+    if (!buffering_enabled) {
+        return -EINVAL;
+    }
+
+    if (new_buffer_size != buffer_size) {
+        buffer_size = new_buffer_size;
+        fprintf(stderr, "GDB setting new buffer size to %d\n", buffer_size);
+        if (mc_set_buffer_size(buffer_size) < 0)
+            return -EINVAL;
+    }
+
+    if ((err = rtnl_qdisc_plug_buffer((void *) qdisc)) < 0) {
+        fprintf(stderr, "Unable to flush oldest checkpoint: %s\n", nl_geterror(err));
+        return -EINVAL;
+    }
+
+    DDPRINTF("Inserted checkpoint barrier\n");
+
+    return mc_deliver(1);
+}
+
+static int mc_flush_oldest_buffer(void)
+{
+    int err;
+
+    if (!buffering_enabled)
+        return -EINVAL;
+
+    if ((err = rtnl_qdisc_plug_release_one((void *) qdisc)) < 0) {
+        fprintf(stderr, "Unable to flush oldest checkpoint: %s\n", nl_geterror(err));
+        return -EINVAL;
+    }
+
+    DDPRINTF("Flushed oldest checkpoint barrier\n");
+
+    return mc_deliver(1);
+}
+
+/*
+ * Get the next slab in the list. If there is none, then make one.
+ */
+static MCSlab *mc_slab_next(MCParams *mc, MCSlab *slab)
+{
+    if (!QTAILQ_NEXT(slab, node)) {
+        int idx = mc->nb_slabs++;
+        mc->used_slabs++;
+        DDPRINTF("Extending slabs by one: %" PRIu64 " slabs total, "
+                 "%" PRIu64 " MB\n", mc->nb_slabs,
+                 mc->nb_slabs * sizeof(MCSlab) / 1024UL / 1024UL);
+        mc->curr_slab = qemu_memalign(4096, sizeof(MCSlab));
+        memset(mc->curr_slab, 0, sizeof(*(mc->curr_slab)));
+        mc->curr_slab->idx = idx;
+        QTAILQ_INSERT_TAIL(&mc->slab_head, mc->curr_slab, node);
+        slab = mc->curr_slab;
+        ram_control_add(mc->file, slab->buf, 
+                (uint64_t) slab->buf, MC_SLAB_BUFFER_SIZE);
+    } else {
+        slab = QTAILQ_NEXT(slab, node);
+        mc->used_slabs++;
+    }
+
+    mc->curr_slab = slab;
+    SLAB_RESET(slab);
+
+    if (slab->idx == mc->start_copyset) {
+        DDPRINTF("Found copyset slab @ idx %d\n", slab->idx);
+        mc->mem_slab = slab;
+    }
+
+    return slab;
+}
+
+static int mc_put_buffer(void *opaque, const uint8_t *buf,
+                                  int64_t pos, int size)
+{
+    MCParams *mc = opaque;
+    MCSlab *slab = mc->curr_slab;
+    uint64_t len = size;
+
+    assert(slab);
+
+    while (len) {
+        long put = MIN(MC_SLAB_BUFFER_SIZE - slab->size, len);
+
+        if (put == 0) {
+            DDPRINTF("Reached the end of slab %d Need a new one\n", slab->idx);
+            goto zero;
+        }
+
+        if (mc->copy && migrate_use_mc_rdma_copy()) {
+            int ret = ram_control_copy_page(mc->file, 
+                                        (uint64_t) slab->buf,
+                                        slab->size,
+                                        (ram_addr_t) mc->copy->ramblock_offset,
+                                        (ram_addr_t) mc->copy->offset,
+                                        put);
+
+            DDDPRINTF("Attempted offloaded memcpy.\n");
+
+            if (ret != RAM_COPY_CONTROL_NOT_SUPP) {
+                if (ret == RAM_COPY_CONTROL_DELAYED) {
+                    DDDPRINTF("Offloaded memcpy successful.\n"); 
+                    mc->copy->offset += put;
+                    goto next;
+                } else {
+                    fprintf(stderr, "Offloaded memcpy failed: %d\n", ret);
+                    return ret;
+                }
+            }
+        }
+
+        DDDPRINTF("Copying to %p from %p, size %" PRId64 "\n",
+                 slab->buf + slab->size, buf, put);
+
+        memcpy(slab->buf + slab->size, buf, put);
+next:
+
+        buf            += put;
+        slab->size     += put;
+        len            -= put;
+        mc->slab_total += put;
+
+        DDDPRINTF("put: %" PRIu64 " len: %" PRIu64
+                  " total %" PRIu64 " size: %" PRIu64 
+                  " slab %d\n",
+                  put, len, mc->slab_total, slab->size,
+                  slab->idx);
+zero:
+        if (len) {
+            slab = mc_slab_next(mc, slab);
+        }
+    }
+
+    return size;
+}
+
+/*
+ * Stop the VM, generate the micro checkpoint,
+ * but save the dirty memory into staging memory until
+ * we can re-activate the VM as soon as possible.
+ */
+static int capture_checkpoint(MCParams *mc, MigrationState *s)
+{
+    MCCopyset *copyset;
+    int idx, ret = 0;
+    uint64_t start, stop, copies = 0;
+    int64_t start_time;
+
+    mc->total_copies = 0;
+    qemu_mutex_lock_iothread();
+    vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
+    start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    /*
+     * If buffering is enabled, insert a Qdisc plug here
+     * to hold packets for the *next* MC, (not this one,
+     * the packets for this one have already been plugged
+     * and will be released after the MC has been transmitted.
+     */
+    mc_start_buffer();
+
+    qemu_savevm_state_begin(mc->staging, &s->params);
+    ret = qemu_file_get_error(s->file);
+
+    if (ret < 0) {
+        migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
+    }
+
+    qemu_savevm_state_complete(mc->staging);
+
+    ret = qemu_file_get_error(s->file);
+    if (ret < 0) {
+        migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
+        goto out;
+    }
+
+    /*
+     * The copied memory gets appended to the end of the snapshot, so let's
+     * remember where its going to go first and start a new slab.
+     */
+    mc_slab_next(mc, mc->curr_slab);
+    mc->start_copyset = mc->curr_slab->idx;
+
+    start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    /*
+     * Now perform the actual copy of memory into the tail end of the slab list. 
+     */
+    QTAILQ_FOREACH(copyset, &mc->copy_head, node) {
+        if (!copyset->nb_copies) {
+            break;
+        }
+
+        copies += copyset->nb_copies;
+
+        DDDPRINTF("copyset %d copies: %" PRIu64 " total: %" PRIu64 "\n",
+                copyset->idx, copyset->nb_copies, copies);
+
+        for (idx = 0; idx < copyset->nb_copies; idx++) {
+            uint8_t *addr;
+            long size;
+            mc->copy = &copyset->copies[idx];
+            addr = (uint8_t *) (mc->copy->host_addr + mc->copy->offset);
+            size = mc_put_buffer(mc, addr, mc->copy->offset, mc->copy->size);
+            if (size != mc->copy->size) {
+                fprintf(stderr, "Failure to initiate copyset %d index %d\n",
+                        copyset->idx, idx);
+                migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
+                vm_start();
+                goto out;
+            }
+
+            DDDPRINTF("Success copyset %d index %d\n", copyset->idx, idx);
+        }
+
+        copyset->nb_copies = 0;
+    }
+
+    s->ram_copy_time = (qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time);
+
+    mc->copy = NULL;
+    ram_control_before_iterate(mc->file, RAM_CONTROL_FLUSH); 
+    assert(mc->total_copies == copies);
+
+    stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+    /*
+     * MC is safe in staging area. Let the VM go.
+     */
+    vm_start();
+    qemu_fflush(mc->staging);
+
+    s->downtime = stop - start;
+out:
+    qemu_mutex_unlock_iothread();
+    return ret;
+}
+
+/*
+ * Synchronously send a micro-checkpointing command
+ */
+static int mc_send(QEMUFile *f, uint64_t request)
+{
+    int ret = 0;
+
+    qemu_put_be64(f, request);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        fprintf(stderr, "transaction: send error while sending %" PRIu64 ", "
+                "bailing: %s\n", request, strerror(-ret));
+    } else {
+        DDPRINTF("transaction: sent: %s (%" PRIu64 ")\n", 
+            mc_desc[request], request);
+    }
+
+    qemu_fflush(f);
+
+    return ret;
+}
+
+/*
+ * Synchronously receive a micro-checkpointing command
+ */
+static int mc_recv(QEMUFile *f, uint64_t request, uint64_t *action)
+{
+    int ret = 0;
+    uint64_t got;
+
+    got = qemu_get_be64(f);
+
+    ret = qemu_file_get_error(f);
+    if (ret) {
+        fprintf(stderr, "transaction: recv error while expecting %s (%"
+                PRIu64 "), bailing: %s\n", mc_desc[request], 
+                request, strerror(-ret));
+    } else {
+        if ((request != MC_TRANSACTION_ANY) && request != got) {
+            fprintf(stderr, "transaction: was expecting %s (%" PRIu64 
+                    ") but got %" PRIu64 " instead\n",
+                    mc_desc[request], request, got);
+            ret = -EINVAL;
+        } else {
+            DDPRINTF("transaction: recv: %s (%" PRIu64 ")\n", 
+                     mc_desc[got], got);
+            ret = 0;
+            if (action) {
+                *action = got;
+            }
+        }
+    }
+
+    return ret;
+}
+
+static MCSlab *mc_slab_start(MCParams *mc)
+{
+    if (mc->nb_slabs > 2) {
+        if (mc->slab_strikes >= max_strikes) {
+            uint64_t nb_slabs_to_free = MAX(1, (((mc->nb_slabs - 1) / 2)));
+
+            DPRINTF("MC has reached max strikes. Will free %" 
+                    PRIu64 " / %" PRIu64 " slabs max %d, "
+                    "checkpoints %" PRIu64 "\n",
+                    nb_slabs_to_free, mc->nb_slabs,
+                    max_strikes, mc->checkpoints);
+
+            mc->slab_strikes = 0;
+
+            while (nb_slabs_to_free) {
+                MCSlab *slab = QTAILQ_LAST(&mc->slab_head, shead);
+                ram_control_remove(mc->file, (uint64_t) slab->buf);
+                QTAILQ_REMOVE(&mc->slab_head, slab, node);
+                g_free(slab);
+                nb_slabs_to_free--;
+                mc->nb_slabs--;
+            }
+
+            goto skip;
+        } else if (((mc->slab_total <= 
+                    ((mc->nb_slabs - 1) * MC_SLAB_BUFFER_SIZE)))) {
+            mc->slab_strikes++;
+            DDPRINTF("MC has strike %d slabs %" PRIu64 " max %d\n", 
+                     mc->slab_strikes, mc->nb_slabs, max_strikes);
+            goto skip;
+        }
+    }
+
+    if (mc->slab_strikes) {
+        DDPRINTF("MC used all slabs. Resetting strikes to zero.\n");
+        mc->slab_strikes = 0;
+    }
+skip:
+
+    mc->used_slabs = 1;
+    mc->slab_total = 0;
+    mc->curr_slab = QTAILQ_FIRST(&mc->slab_head);
+    SLAB_RESET(mc->curr_slab);
+
+    return mc->curr_slab;
+}
+
+static MCCopyset *mc_copy_start(MCParams *mc)
+{
+    if (mc->nb_copysets >= 2) {
+        if (mc->copy_strikes >= max_strikes) {
+            int nb_copies_to_free = MAX(1, (((mc->nb_copysets - 1) / 2)));
+
+            DPRINTF("MC has reached max strikes. Will free %d / %d copies max %d\n",
+                    nb_copies_to_free, mc->nb_copysets, max_strikes);
+
+            mc->copy_strikes = 0;
+
+            while (nb_copies_to_free) {
+                MCCopyset * copyset = QTAILQ_LAST(&mc->copy_head, chead);
+                QTAILQ_REMOVE(&mc->copy_head, copyset, node);
+                g_free(copyset);
+                nb_copies_to_free--;
+                mc->nb_copysets--;
+            }
+
+            goto skip;
+        } else if (((mc->total_copies <= 
+                    ((mc->nb_copysets - 1) * MC_MAX_SLAB_COPY_DESCRIPTORS)))) {
+            mc->copy_strikes++;
+            DDPRINTF("MC has strike %d copies %d max %d\n", 
+                     mc->copy_strikes, mc->nb_copysets, max_strikes);
+            goto skip;
+        }
+    }
+
+    if (mc->copy_strikes) {
+        DDPRINTF("MC used all copies. Resetting strikes to zero.\n");
+        mc->copy_strikes = 0;
+    }
+skip:
+
+    mc->total_copies = 0;
+    mc->curr_copyset = QTAILQ_FIRST(&mc->copy_head);
+    mc->curr_copyset->nb_copies = 0;
+
+    return mc->curr_copyset;
+}
+
+/*
+ * Main MC loop. Stop the VM, dump the dirty memory
+ * into staging, restart the VM, transmit the MC,
+ * and then sleep for some milliseconds before
+ * starting the next MC.
+ */
+static void *mc_thread(void *opaque)
+{
+    MigrationState *s = opaque;
+    MCParams mc = { .file = s->file };
+    MCSlab * slab;
+    int64_t initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+    int ret = 0, fd = qemu_get_fd(s->file), x;
+    QEMUFile *mc_control, *mc_staging = NULL;
+    uint64_t wait_time = 0;
+   
+    if (!(mc_control = qemu_fopen_socket(fd, "rb"))) {
+        fprintf(stderr, "Failed to setup read MC control\n");
+        goto err;
+    }
+
+    if (!(mc_staging = qemu_fopen_mc(&mc, "wb"))) {
+        fprintf(stderr, "Failed to setup MC staging area\n");
+        goto err;
+    }
+
+    mc.staging = mc_staging;
+
+    qemu_set_block(fd);
+    socket_set_nodelay(fd);
+
+    s->checkpoints = 0;
+
+    while (s->state == MIG_STATE_CHECKPOINTING) {
+        int64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+        int64_t start_time, xmit_start, end_time;
+        bool commit_sent = false;
+        int nb_slab = 0;
+        (void)nb_slab;
+        
+        slab = mc_slab_start(&mc);
+        mc_copy_start(&mc);
+        acct_clear();
+        start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+        if (capture_checkpoint(&mc, s) < 0)
+                break;
+
+        xmit_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+
+        if ((ret = mc_send(s->file, MC_TRANSACTION_START) < 0)) {
+            fprintf(stderr, "transaction start failed\n");
+            break;
+        }
+        
+        DDPRINTF("Sending checkpoint size %" PRId64 
+                 " copyset start: %" PRIu64 " nb slab %" PRIu64 
+                 " used slabs %" PRIu64 "\n",
+                 mc.slab_total, mc.start_copyset, mc.nb_slabs, mc.used_slabs);
+
+        mc.curr_slab = QTAILQ_FIRST(&mc.slab_head);
+
+        qemu_put_be64(s->file, mc.slab_total);
+        qemu_put_be64(s->file, mc.start_copyset);
+        qemu_put_be64(s->file, mc.used_slabs);
+
+        qemu_fflush(s->file);
+       
+        DDPRINTF("Transaction commit\n");
+
+        /*
+         * The MC is safe, and VM is running again.
+         * Start a transaction and send it.
+         */
+        ram_control_before_iterate(s->file, RAM_CONTROL_ROUND); 
+
+        slab = QTAILQ_FIRST(&mc.slab_head);
+
+        for (x = 0; x < mc.used_slabs; x++) {
+            DDPRINTF("Attempting write to slab #%d: %p"
+                    " total size: %" PRId64 " / %" PRIu64 "\n",
+                    nb_slab++, slab->buf, slab->size, MC_SLAB_BUFFER_SIZE);
+
+            ret = ram_control_save_page(s->file, (uint64_t) slab->buf,
+                                        NULL, 0, slab->size, NULL);
+
+            if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
+                if (!commit_sent) {
+                    if ((ret = mc_send(s->file, MC_TRANSACTION_COMMIT) < 0)) {
+                        fprintf(stderr, "transaction commit failed\n");
+                        break;
+                    }
+                    commit_sent = true;
+                }
+
+                qemu_put_be64(s->file, slab->size);
+                qemu_put_buffer_async(s->file, slab->buf, slab->size);
+            } else if ((ret < 0) && (ret != RAM_SAVE_CONTROL_DELAYED)) {
+                fprintf(stderr, "failed 1, skipping send\n");
+                goto err;
+            }
+
+            if (qemu_file_get_error(s->file)) {
+                fprintf(stderr, "failed 2, skipping send\n");
+                goto err;
+            }
+                
+            DDPRINTF("Sent %" PRId64 " all %ld\n", slab->size, mc.slab_total);
+
+            slab = QTAILQ_NEXT(slab, node);
+        }
+
+        if (!commit_sent) {
+            ram_control_after_iterate(s->file, RAM_CONTROL_ROUND); 
+            slab = QTAILQ_FIRST(&mc.slab_head);
+
+            for (x = 0; x < mc.used_slabs; x++) {
+                qemu_put_be64(s->file, slab->size);
+                slab = QTAILQ_NEXT(slab, node);
+            }
+        }
+
+        qemu_fflush(s->file);
+
+        if (commit_sent) {
+            DDPRINTF("Waiting for commit ACK\n");
+
+            if ((ret = mc_recv(mc_control, MC_TRANSACTION_ACK, NULL)) < 0) {
+                goto err;
+            }
+        }
+
+        ret = qemu_file_get_error(s->file);
+        if (ret) {
+            fprintf(stderr, "Error sending checkpoint: %d\n", ret);
+            goto err;
+        }
+
+        DDPRINTF("Memory transfer complete.\n");
+
+        /*
+         * The MC is safe on the other side now,
+         * go along our merry way and release the network
+         * packets from the buffer if enabled.
+         */
+        mc_flush_oldest_buffer();
+
+        end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
+        s->total_time = end_time - start_time;
+        s->xmit_time = end_time - xmit_start;
+        s->bitmap_time = norm_mig_bitmap_time();
+        s->log_dirty_time = norm_mig_log_dirty_time();
+        s->mbps = MBPS(mc.slab_total, s->xmit_time);
+        s->copy_mbps = MBPS(mc.slab_total, s->ram_copy_time);
+        s->bytes_xfer = mc.slab_total;
+        s->checkpoints = mc.checkpoints++;
+
+        wait_time = (s->downtime <= freq_ms) ? (freq_ms - s->downtime) : 0;
+
+        if (current_time >= initial_time + 1000) {
+            DPRINTF("bytes %" PRIu64 " xmit_mbps %0.1f xmit_time %" PRId64
+                    " downtime %" PRIu64 " sync_time %" PRId64
+                    " logdirty_time %" PRId64 " ram_copy_time %" PRId64
+                    " copy_mbps %0.1f wait time %" PRIu64
+                    " checkpoints %" PRId64 "\n",
+                    s->bytes_xfer,
+                    s->mbps,
+                    s->xmit_time,
+                    s->downtime,
+                    s->bitmap_time,
+                    s->log_dirty_time,
+                    s->ram_copy_time,
+                    s->copy_mbps,
+                    wait_time,
+                    s->checkpoints);
+            initial_time = current_time;
+        }
+
+        /*
+         * Checkpoint frequency in microseconds.
+         * 
+         * Sometimes, when checkpoints are very large,
+         * all of the wait time was dominated by the 
+         * time taken to copy the checkpoint into the staging area,
+         * in which case wait_time, will probably be zero and we
+         * will end up diving right back into the next checkpoint
+         * as soon as the previous transmission completed.
+         */
+        if (wait_time) {
+            g_usleep(wait_time * 1000);
+        }
+    }
+
+    goto out;
+
+err:
+    /*
+     * TODO: Possible split-brain scenario:
+     * Normally, this should never be reached unless there was a
+     * connection error or network partition - in which case
+     * only the management software can resume the VM safely 
+     * when it knows the exact state of the MC destination.
+     *
+     * We need management to poll the source and destination to deterine
+     * if the destination has already taken control. If not, then
+     * we need to resume the source.
+     *
+     * If there was a connection error during checkpoint *transmission*
+     * then the destination VM will likely have already resumed,
+     * in which case we need to stop the current VM from running
+     * and throw away any buffered packets.
+     * 
+     * Verify that "disable_buffering" below does not release any traffic.
+     */
+    migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
+out:
+    if (mc_staging) {
+        qemu_fclose(mc_staging);
+    }
+
+    if (mc_control) {
+        qemu_fclose(mc_control);
+    }
+
+    mc_disable_buffering();
+
+    qemu_mutex_lock_iothread();
+
+    if (s->state != MIG_STATE_ERROR) {
+        migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_COMPLETED);
+    }
+
+    qemu_bh_schedule(s->cleanup_bh);
+    qemu_mutex_unlock_iothread();
+
+    return NULL;
+}
+
+/*
+ * Get the next copyset in the list. If there is none, then make one.
+ */
+static MCCopyset *mc_copy_next(MCParams *mc, MCCopyset *copyset)
+{
+    if (!QTAILQ_NEXT(copyset, node)) {
+        int idx = mc->nb_copysets++;
+        DDPRINTF("Extending copysets by one: %d sets total, "
+                 "%" PRIu64 " MB\n", mc->nb_copysets,
+                 mc->nb_copysets * sizeof(MCCopyset) / 1024UL / 1024UL);
+        mc->curr_copyset = g_malloc(sizeof(MCCopyset));
+        mc->curr_copyset->idx = idx;
+        QTAILQ_INSERT_TAIL(&mc->copy_head, mc->curr_copyset, node);
+        copyset = mc->curr_copyset;
+    } else {
+        copyset = QTAILQ_NEXT(copyset, node);
+    }
+
+    mc->curr_copyset = copyset;
+    copyset->nb_copies = 0;
+
+    return copyset;
+}
+
+void mc_process_incoming_checkpoints_if_requested(QEMUFile *f)
+{
+    MCParams mc = { .file = f };
+    MCSlab *slab;
+    int fd = qemu_get_fd(f);
+    QEMUFile *mc_control, *mc_staging;
+    uint64_t checkpoint_size, action;
+    uint64_t slabs;
+    int got, x, ret, received = 0;
+    bool checkpoint_received;
+
+    CALC_MAX_STRIKES();
+
+    if (!mc_requested) {
+        DPRINTF("Source has not requested MC. Returning.\n");
+        return;
+    }
+   
+    if (!(mc_control = qemu_fopen_socket(fd, "wb"))) {
+        fprintf(stderr, "Could not make incoming MC control channel\n");
+        goto rollback;
+    }
+
+    if (!(mc_staging = qemu_fopen_mc(&mc, "rb"))) {
+        fprintf(stderr, "Could not make outgoing MC staging area\n");
+        goto rollback;
+    }
+
+    //qemu_set_block(fd);
+    socket_set_nodelay(fd);
+
+    while (true) {
+        checkpoint_received = false;
+        ret = mc_recv(f, MC_TRANSACTION_ANY, &action);
+        if (ret < 0) {
+            goto rollback;
+        }
+
+        switch(action) {
+        case MC_TRANSACTION_START:
+            checkpoint_size = qemu_get_be64(f);
+            mc.start_copyset = qemu_get_be64(f);
+            slabs = qemu_get_be64(f);
+
+            DDPRINTF("Transaction start: size %" PRIu64 
+                     " copyset start: %" PRIu64 " slabs %" PRIu64 "\n",
+                     checkpoint_size, mc.start_copyset, slabs);
+
+            assert(checkpoint_size);
+            break;
+        case MC_TRANSACTION_COMMIT: /* tcp */
+            slab = mc_slab_start(&mc);
+            received = 0;
+
+            while (received < checkpoint_size) {
+                int total = 0;
+                slab->size = qemu_get_be64(f);
+
+                DDPRINTF("Expecting size: %" PRIu64 "\n", slab->size);
+
+                while (total != slab->size) {
+                    got = qemu_get_buffer(f, slab->buf + total, slab->size - total);
+                    if (got <= 0) {
+                        fprintf(stderr, "Error pre-filling checkpoint: %d\n", got);
+                        goto rollback;
+                    }
+                    DDPRINTF("Received %d slab %d / %ld received %d total %"
+                             PRIu64 "\n", got, total, slab->size, 
+                             received, checkpoint_size);
+                    received += got;
+                    total += got;
+                }
+
+                if (received != checkpoint_size) {
+                    slab = mc_slab_next(&mc, slab);
+                }
+            }
+
+            DDPRINTF("Acknowledging successful commit\n");
+
+            if (mc_send(mc_control, MC_TRANSACTION_ACK) < 0) {
+                goto rollback;
+            }
+
+            checkpoint_received = true;
+            break;
+        case RAM_SAVE_FLAG_HOOK: /* rdma */
+            /*
+             * Must be RDMA registration handling. Preallocate
+             * the slabs (if not already done in a previous checkpoint)
+             * before allowing RDMA to register them.
+             */
+            slab = mc_slab_start(&mc);
+
+            DDPRINTF("Pre-populating slabs %" PRIu64 "...\n", slabs);
+
+            for(x = 1; x < slabs; x++) {
+                slab = mc_slab_next(&mc, slab);
+            }
+
+            ram_control_load_hook(f, action);
+
+            DDPRINTF("Hook complete.\n");
+
+            slab = QTAILQ_FIRST(&mc.slab_head);
+
+            for(x = 0; x < slabs; x++) {
+                slab->size = qemu_get_be64(f);
+                slab = QTAILQ_NEXT(slab, node);
+            }
+
+            checkpoint_received = true;
+            break;
+        default:
+            fprintf(stderr, "Unknown MC action: %" PRIu64 "\n", action);
+            goto rollback;
+        }
+
+        if (checkpoint_received) {
+            mc.curr_slab = QTAILQ_FIRST(&mc.slab_head);
+            mc.slab_total = checkpoint_size;
+
+            DDPRINTF("Committed Loading MC state \n");
+
+            mc_copy_start(&mc);
+
+            if (qemu_loadvm_state(mc_staging) < 0) {
+                fprintf(stderr, "loadvm transaction failed\n");
+                /*
+                 * This is fatal. No rollback possible because we have potentially
+                 * applied only a subset of the checkpoint to main memory, potentially
+                 * leaving the VM in an inconsistent state.
+                 */
+                goto err;
+            }
+
+            mc.slab_total = checkpoint_size;
+
+            DDPRINTF("Transaction complete.\n");
+            mc.checkpoints++;
+        }
+    }
+
+rollback:
+    fprintf(stderr, "MC: checkpointing stopped. Recovering VM\n");
+    goto out;
+err:
+    fprintf(stderr, "Micro Checkpointing Protocol Failed\n");
+    exit(1); 
+out:
+    if (mc_staging) {
+        qemu_fclose(mc_staging);
+    }
+
+    if (mc_control) {
+        qemu_fclose(mc_control);
+    }
+}
+
+static int mc_get_buffer_internal(void *opaque, uint8_t *buf, int64_t pos,
+                                  int size, MCSlab **curr_slab, uint64_t end_idx)
+{
+    uint64_t len = size;
+    uint8_t *data = (uint8_t *) buf;
+    MCSlab *slab = *curr_slab;
+    MCParams *mc = opaque;
+
+    assert(slab);
+
+    DDDPRINTF("got request for %d bytes %p %p. idx %d\n",
+              size, slab, QTAILQ_FIRST(&mc->slab_head), slab->idx);
+
+    while (len && slab) {
+        uint64_t get = MIN(slab->size - slab->read, len);
+
+        memcpy(data, slab->buf + slab->read, get);
+
+        data           += get;
+        slab->read     += get;
+        len            -= get;
+        mc->slab_total -= get;
+
+        DDDPRINTF("got: %" PRIu64 " read: %" PRIu64 
+                 " len %" PRIu64 " slab_total %" PRIu64 
+                 " size %" PRIu64 " addr: %p slab %d"
+                 " requested %d\n",
+                 get, slab->read, len, mc->slab_total, 
+                 slab->size, slab->buf, slab->idx, size);
+
+        if (len) {
+            if (slab->idx == end_idx) {
+                break;
+            }
+
+            slab = QTAILQ_NEXT(slab, node);
+        }
+    }
+
+    *curr_slab = slab;
+    DDDPRINTF("Returning %" PRIu64 " / %d bytes\n", size - len, size);
+
+    return size - len;
+}
+static int mc_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
+{
+    MCParams *mc = opaque;
+
+    return mc_get_buffer_internal(mc, buf, pos, size, &mc->curr_slab,
+                                  mc->start_copyset - 1);
+}
+
+static int mc_load_page(QEMUFile *f, void *opaque, void *host_addr, long size)
+{
+    MCParams *mc = opaque;
+
+    DDDPRINTF("Loading page into %p of size %" PRIu64 "\n", host_addr, size);
+
+    return mc_get_buffer_internal(mc, host_addr, 0, size, &mc->mem_slab,
+                                  mc->nb_slabs - 1);
+}
+
+/*
+ * Provide QEMUFile with an *local* RDMA-based way to do memcpy().
+ * This lowers cache pollution and allows the CPU pipeline to
+ * remain free for regular use by VMs (as well as by neighbors).
+ *
+ * In a future implementation, we may attempt to perform this
+ * copy *without* stopping the source VM - if the data shows
+ * that it can be done effectively.
+ */
+static int mc_save_page(QEMUFile *f, void *opaque,
+                           ram_addr_t block_offset, 
+                           uint8_t *host_addr,
+                           ram_addr_t offset,
+                           long size, int *bytes_sent)
+{
+    MCParams *mc = opaque;
+    MCCopyset *copyset = mc->curr_copyset;
+    MCCopy *c;
+
+    if (copyset->nb_copies >= MC_MAX_SLAB_COPY_DESCRIPTORS) {
+        copyset = mc_copy_next(mc, copyset);
+    }
+
+    c = &copyset->copies[copyset->nb_copies++];
+    c->ramblock_offset = (uint64_t) block_offset;
+    c->host_addr = (uint64_t) host_addr;
+    c->offset = (uint64_t) offset;
+    c->size = (uint64_t) size;
+    mc->total_copies++;
+
+    return RAM_SAVE_CONTROL_DELAYED;
+}
+
+static ssize_t mc_writev_buffer(void *opaque, struct iovec *iov,
+                                int iovcnt, int64_t pos)
+{
+    ssize_t len = 0;
+    unsigned int i;
+
+    for (i = 0; i < iovcnt; i++) {
+        DDDPRINTF("iov # %d, len: %" PRId64 "\n", i, iov[i].iov_len); 
+        len += mc_put_buffer(opaque, iov[i].iov_base, 0, iov[i].iov_len); 
+    }
+
+    return len;
+}
+
+static int mc_get_fd(void *opaque)
+{
+    MCParams *mc = opaque;
+
+    return qemu_get_fd(mc->file);
+}
+
+static int mc_close(void *opaque)
+{
+    MCParams *mc = opaque;
+    MCSlab *slab, *next;
+
+    QTAILQ_FOREACH_SAFE(slab, &mc->slab_head, node, next) {
+        ram_control_remove(mc->file, (uint64_t) slab->buf);
+        QTAILQ_REMOVE(&mc->slab_head, slab, node);
+        g_free(slab);
+    }
+
+    mc->curr_slab = NULL;
+
+    return 0;
+}
+	
+static const QEMUFileOps mc_write_ops = {
+    .writev_buffer = mc_writev_buffer,
+    .put_buffer = mc_put_buffer,
+    .get_fd = mc_get_fd,
+    .close = mc_close,
+    .save_page = mc_save_page,
+};
+
+static const QEMUFileOps mc_read_ops = {
+    .get_buffer = mc_get_buffer,
+    .get_fd = mc_get_fd,
+    .close = mc_close,
+    .load_page = mc_load_page,
+};
+
+QEMUFile *qemu_fopen_mc(void *opaque, const char *mode)
+{
+    MCParams *mc = opaque;
+    MCSlab *slab;
+    MCCopyset *copyset;
+
+    if (qemu_file_mode_is_not_valid(mode)) {
+        return NULL;
+    }
+
+    QTAILQ_INIT(&mc->slab_head);
+    QTAILQ_INIT(&mc->copy_head);
+
+    slab = qemu_memalign(8, sizeof(MCSlab));
+    memset(slab, 0, sizeof(*slab));
+    slab->idx = 0;
+    QTAILQ_INSERT_HEAD(&mc->slab_head, slab, node);
+    mc->slab_total = 0;
+    mc->curr_slab = slab;
+    mc->nb_slabs = 1;
+    mc->slab_strikes = 0;
+
+    ram_control_add(mc->file, slab->buf, (uint64_t) slab->buf, MC_SLAB_BUFFER_SIZE);
+
+    copyset = g_malloc(sizeof(MCCopyset));
+    copyset->idx = 0;
+    QTAILQ_INSERT_HEAD(&mc->copy_head, copyset, node);
+    mc->total_copies = 0;
+    mc->curr_copyset = copyset;
+    mc->nb_copysets = 1;
+    mc->copy_strikes = 0;
+
+    if (mode[0] == 'w') {
+        return qemu_fopen_ops(mc, &mc_write_ops);
+    }
+
+    return qemu_fopen_ops(mc, &mc_read_ops);
+}
+
+static void mc_start_checkpointer(void *opaque) {
+    MigrationState *s = opaque;
+
+    if (checkpoint_bh) {
+        qemu_bh_delete(checkpoint_bh);
+        checkpoint_bh = NULL;
+    }
+
+    qemu_mutex_unlock_iothread();
+    qemu_thread_join(s->thread);
+    g_free(s->thread);
+    qemu_mutex_lock_iothread();
+
+    migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CHECKPOINTING);
+    s->thread = g_malloc0(sizeof(*s->thread));
+	qemu_thread_create(s->thread, mc_thread, s, QEMU_THREAD_JOINABLE);
+}
+
+void mc_init_checkpointer(MigrationState *s)
+{
+    CALC_MAX_STRIKES();
+    checkpoint_bh = qemu_bh_new(mc_start_checkpointer, s);
+    qemu_bh_schedule(checkpoint_bh);
+}
+
+void qmp_migrate_set_mc_delay(int64_t value, Error **errp)
+{
+    freq_ms = value;
+    CALC_MAX_STRIKES();
+    DPRINTF("Setting checkpoint frequency to %" PRId64 " ms and "
+            "resetting strikes to %d based on a %d sec delay.\n",
+            freq_ms, max_strikes, max_strikes_delay_secs);
+}
+
+int mc_info_load(QEMUFile *f, void *opaque, int version_id)
+{
+    bool mc_enabled = qemu_get_byte(f);
+
+    if (mc_enabled && !mc_requested) {
+        DPRINTF("MC is requested\n");
+        mc_requested = true;
+    }
+
+    max_strikes = qemu_get_be32(f);
+
+    return 0;
+}
+
+void mc_info_save(QEMUFile *f, void *opaque)
+{
+    qemu_put_byte(f, migrate_use_mc());
+    qemu_put_be32(f, max_strikes);
+}
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (7 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

Self-explanatory.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 Makefile.objs |  1 +
 configure     | 45 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/Makefile.objs b/Makefile.objs
index ac1d0e1..db70f93 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -53,6 +53,7 @@ common-obj-y += migration.o migration-tcp.o
 common-obj-y += vmstate.o
 common-obj-y += qemu-file.o
 common-obj-$(CONFIG_RDMA) += migration-rdma.o
+common-obj-$(CONFIG_MC) += migration-checkpoint.o
 common-obj-y += qemu-char.o #aio.o
 common-obj-y += block-migration.o
 common-obj-y += page_cache.o xbzrle.o
diff --git a/configure b/configure
index 0eadab5..507ee99 100755
--- a/configure
+++ b/configure
@@ -197,6 +197,7 @@ kvm="no"
 rdma=""
 gprof="no"
 debug_tcg="no"
+mc=""
 debug="no"
 strip_opt="yes"
 tcg_interpreter="no"
@@ -1001,6 +1002,10 @@ for opt do
   ;;
   --enable-libssh2) libssh2="yes"
   ;;
+  --disable-mc) mc="no"
+  ;;
+  --enable-mc) mc="yes"
+  ;;
   --enable-vhdx) vhdx="yes"
   ;;
   --disable-vhdx) vhdx="no"
@@ -1189,6 +1194,8 @@ Advanced options (experts only):
   --enable-kvm             enable KVM acceleration support
   --disable-rdma           disable RDMA-based migration support
   --enable-rdma            enable RDMA-based migration support
+  --disable-mc             disable Micro-Checkpointing support
+  --enable-mc              enable Micro-Checkpointing support
   --enable-tcg-interpreter enable TCG with bytecode interpreter (TCI)
   --enable-system          enable all system emulation targets
   --disable-system         disable all system emulation targets
@@ -1901,6 +1908,35 @@ EOF
   fi
 fi
 
+##################################################
+# Micro-Checkpointing requires netlink (libnl3)
+if test "$mc" != "no" ; then
+  cat > $TMPC <<EOF
+#include <libnl3/netlink/route/qdisc/plug.h>
+#include <libnl3/netlink/route/class.h>
+#include <libnl3/netlink/cli/utils.h>
+#include <libnl3/netlink/cli/tc.h>
+#include <libnl3/netlink/cli/qdisc.h>
+#include <libnl3/netlink/cli/link.h>
+int main(void) { return 0; }
+EOF
+  mc_libs="-lnl-3 -lnl-cli-3 -lnl-route-3"
+  mc_cflags="-I/usr/include/libnl3"
+  if compile_prog "$mc_cflags" "$mc_libs" ; then
+    mc="yes"
+    libs_softmmu="$libs_softmmu $mc_libs"
+    QEMU_CFLAGS="$QEMU_CFLAGS $mc_cflags"
+  else
+    if test "$mc" = "yes" ; then
+        error_exit \
+            " NetLink v3 libs/headers not present." \
+            " Please install the libnl3-*-dev(el) packages from your distro."
+    fi
+    mc="no"
+  fi
+fi
+
+
 ##########################################
 # VNC TLS/WS detection
 if test "$vnc" = "yes" -a \( "$vnc_tls" != "no" -o "$vnc_ws" != "no" \) ; then
@@ -3829,6 +3865,7 @@ echo "KVM support       $kvm"
 echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
 echo "fdt support       $fdt"
+echo "Micro checkpointing $mc"
 echo "preadv support    $preadv"
 echo "fdatasync         $fdatasync"
 echo "madvise           $madvise"
@@ -4334,6 +4371,10 @@ if test "$rdma" = "yes" ; then
   echo "CONFIG_RDMA=y" >> $config_host_mak
 fi
 
+if test "$mc" = "yes" ; then
+  echo "CONFIG_MC=y" >> $config_host_mak
+fi
+
 if test "$tcg_interpreter" = "yes"; then
   QEMU_INCLUDES="-I\$(SRC_PATH)/tcg/tci $QEMU_INCLUDES"
 elif test "$ARCH" = "sparc64" ; then
@@ -4769,6 +4810,10 @@ echo "QEMU_CFLAGS+=$cflags" >> $config_target_mak
 
 done # for target in $targets
 
+if test "$mc" = "yes" ; then
+echo "CONFIG_MC=y" >> $config_host_mak
+fi
+
 if [ "$pixman" = "internal" ]; then
   echo "config-host.h: subdir-pixman" >> $config_host_mak
 fi
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (8 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-03-11 21:49   ` Eric Blake
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

This exposes a QMP command that allows the management software
or policy to control the frequency of micro-checkpointing.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 hmp-commands.hx  | 16 +++++++++++++++-
 hmp.c            |  6 ++++++
 hmp.h            |  1 +
 qapi-schema.json | 13 +++++++++++++
 qmp-commands.hx  | 23 +++++++++++++++++++++++
 5 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index f3fc514..2066c76 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -888,7 +888,7 @@ ETEXI
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+ 		      "(base image shared between src and destination)",
         .mhandler.cmd = hmp_migrate,
     },
 
@@ -965,6 +965,20 @@ Set maximum tolerated downtime (in seconds) for migration.
 ETEXI
 
     {
+        .name       = "migrate-set-mc-delay",
+        .args_type  = "value:i",
+        .params     = "value",
+        .help       = "Set maximum delay (in milliseconds) between micro-checkpoints",
+        .mhandler.cmd = hmp_migrate_set_mc_delay,
+    },
+
+STEXI
+@item migrate-set-mc-delay @var{millisecond}
+@findex migrate-set-mc-delay
+Set maximum delay (in milliseconds) between micro-checkpoints.
+ETEXI
+
+    {
         .name       = "migrate_set_capability",
         .args_type  = "capability:s,state:b",
         .params     = "capability state",
diff --git a/hmp.c b/hmp.c
index edf062e..9880bc8 100644
--- a/hmp.c
+++ b/hmp.c
@@ -1029,6 +1029,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict)
     qmp_migrate_set_downtime(value, NULL);
 }
 
+void hmp_migrate_set_mc_delay(Monitor *mon, const QDict *qdict)
+{
+    int64_t value = qdict_get_int(qdict, "value");
+    qmp_migrate_set_mc_delay(value, NULL);
+}
+
 void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict)
 {
     int64_t value = qdict_get_int(qdict, "value");
diff --git a/hmp.h b/hmp.h
index ed58f0e..068b2c1 100644
--- a/hmp.h
+++ b/hmp.h
@@ -60,6 +60,7 @@ void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
 void hmp_drive_backup(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
+void hmp_migrate_set_mc_delay(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
diff --git a/qapi-schema.json b/qapi-schema.json
index 7306adc..98abdac 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2160,6 +2160,19 @@
 { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
 
 ##
+# @migrate-set-mc-delay
+#
+# Set delay (in milliseconds) between micro checkpoints.
+#
+# @value: maximum delay in milliseconds 
+#
+# Returns: nothing on success
+#
+# Since: 2.x
+##
+{ 'command': 'migrate-set-mc-delay', 'data': {'value': 'int'} }
+
+##
 # @migrate_set_speed
 #
 # Set maximum speed for migration.
diff --git a/qmp-commands.hx b/qmp-commands.hx
index cce6b81..d8b9c34 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -754,6 +754,29 @@ Example:
 EQMP
 
     {
+        .name       = "migrate-set-mc-delay",
+        .args_type  = "value:i",
+        .mhandler.cmd_new = qmp_marshal_input_migrate_set_mc_delay,
+    },
+
+SQMP
+migrate-set-mc-delay
+--------------------
+
+Set maximum delay (in milliseconds) between micro-checkpoints.
+
+Arguments:
+
+- "value": maximum delay (json-int)
+
+Example:
+
+-> { "execute": "migrate-set-mc-delay", "arguments": { "value": 100 } }
+<- { "return": {} }
+
+EQMP
+
+    {
         .name       = "client_migrate_info",
         .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
         .params     = "protocol hostname port tls-port cert-subject",
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (9 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-03-11 21:57   ` Eric Blake
  2014-03-11 22:02   ` Juan Quintela
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested mrhines
  2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
  12 siblings, 2 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

New capabilities include the use of RDMA acceleration,
use of network buffering, and keepalive support, as documented
in patch #1.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 qapi-schema.json | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/qapi-schema.json b/qapi-schema.json
index 98abdac..1fdf208 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -720,10 +720,44 @@
 # @auto-converge: If enabled, QEMU will automatically throttle down the guest
 #          to speed up convergence of RAM migration. (since 1.6)
 #
+# @mc: The migration will never end, and the VM will instead be continuously
+#          micro-checkpointed (MC). Use the command migrate-set-mc-delay to 
+#          control the frequency at which the checkpoints occur. 
+#          Disabled by default. (Since 2.x)
+#
+# @mc-net-disable: Deactivate network buffering against outbound network 
+#          traffic while Micro-Checkpointing (@mc) is active.
+#          Enabled by default. Disabling will make the MC protocol inconsistent
+#          and potentially break network connections upon an actual failure.
+#          Only for performance testing. (Since 2.x)
+#
+# @mc-rdma-copy: MC requires creating a local-memory checkpoint before
+#          transmission to the destination. This requires heavy use of 
+#          memcpy() which dominates the processor pipeline. This option 
+#          makes use of *local* RDMA to perform the copy instead of the CPU.
+#          Enabled by default only if the migration transport is RDMA.
+#          Disabled by default otherwise. (Since 2.x)
+#
+# @rdma-keepalive: RDMA connections do not timeout by themselves if a peer
+#         has disconnected prematurely or failed. User-level keepalives
+#         allow the migration to abort cleanly if there is a problem with the
+#         destination host. For debugging, this can be problematic as
+#         the keepalive may cause the peer to abort prematurely if we are
+#         at a GDB breakpoint, for example.
+#         Enabled by default. (Since 2.x)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle', 'x-rdma-pin-all', 'auto-converge', 'zero-blocks'] }
+  'data': ['xbzrle', 
+           'rdma-pin-all', 
+           'auto-converge', 
+           'zero-blocks',
+           'mc', 
+           'mc-net-disable',
+           'mc-rdma-copy',
+           'rdma-keepalive'
+          ] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (10 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
@ 2014-02-18  8:50 ` mrhines
  2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
  12 siblings, 0 replies; 68+ messages in thread
From: mrhines @ 2014-02-18  8:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, EREZH, owasserm, junqing.wang,
	onom, abali, isaku.yamahata, gokul, dbulkow, hinesmr, BIRAN,
	lig.fnst, Michael R. Hines

From: "Michael R. Hines" <mrhines@us.ibm.com>

Once the initial migration has completed, we kickoff the migration_thread,
which never dies. Additionally, we register load/save functions for MC
which allow us to inform the destination that we are requesting a
micro-checkpointing session without needing to add additional command-line
switches on the destination side.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/migration.h |  2 +-
 include/migration/qemu-file.h |  1 +
 migration.c                   | 33 ++++++++++++++++++++++++++++-----
 vl.c                          |  2 ++
 4 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index f18ff5e..1695e9e 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -46,7 +46,7 @@ struct MigrationState
     int64_t bandwidth_limit;
     size_t bytes_xfer;
     size_t xfer_limit;
-    QemuThread thread;
+    QemuThread *thread;
     QEMUBH *cleanup_bh;
     QEMUFile *file;
 
diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index c50de0d..89e7c19 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -148,6 +148,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd, const char *mode);
+QEMUFile *qemu_fopen_mc(void *opaque, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_get_fd(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
diff --git a/migration.c b/migration.c
index 0ccbeaa..1a6fdbd 100644
--- a/migration.c
+++ b/migration.c
@@ -93,6 +93,9 @@ static void process_incoming_migration_co(void *opaque)
     int ret;
 
     ret = qemu_loadvm_state(f);
+    if (ret >= 0) {
+        mc_process_incoming_checkpoints_if_requested(f);
+    }
     qemu_fclose(f);
     free_xbzrle_decoded_buf();
     if (ret < 0) {
@@ -313,14 +316,17 @@ static void migrate_fd_cleanup(void *opaque)
 {
     MigrationState *s = opaque;
 
-    qemu_bh_delete(s->cleanup_bh);
-    s->cleanup_bh = NULL;
+    if(s->cleanup_bh) {
+        qemu_bh_delete(s->cleanup_bh);
+        s->cleanup_bh = NULL;
+    }
 
     if (s->file) {
         DPRINTF("closing file\n");
         qemu_mutex_unlock_iothread();
-        qemu_thread_join(&s->thread);
+        qemu_thread_join(s->thread);
         qemu_mutex_lock_iothread();
+        g_free(s->thread);
 
         qemu_fclose(s->file);
         s->file = NULL;
@@ -695,11 +701,27 @@ static void *migration_thread(void *opaque)
         s->downtime = end_time - start_time;
         runstate_set(RUN_STATE_POSTMIGRATE);
     } else {
+        if(migrate_use_mc()) {
+            qemu_fflush(s->file);
+            if (migrate_use_mc_net()) {
+                if (mc_enable_buffering() < 0 ||
+                        mc_start_buffer() < 0) {
+                    migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_ERROR);
+                }
+            }
+        }
+
         if (old_vm_running) {
             vm_start();
         }
     }
-    qemu_bh_schedule(s->cleanup_bh);
+
+    if (migrate_use_mc() && s->state != MIG_STATE_ERROR) {
+        mc_init_checkpointer(s);
+    } else {
+        qemu_bh_schedule(s->cleanup_bh);
+    }
+
     qemu_mutex_unlock_iothread();
 
     return NULL;
@@ -720,6 +742,7 @@ void migrate_fd_connect(MigrationState *s)
     /* Notify before starting migration thread */
     notifier_list_notify(&migration_state_notifiers, s);
 
-    qemu_thread_create(&s->thread, migration_thread, s,
+    s->thread = g_malloc0(sizeof(*s->thread));
+    qemu_thread_create(s->thread, migration_thread, s,
                        QEMU_THREAD_JOINABLE);
 }
diff --git a/vl.c b/vl.c
index 2fb5b1f..093eb20 100644
--- a/vl.c
+++ b/vl.c
@@ -4145,6 +4145,8 @@ int main(int argc, char **argv, char **envp)
     default_drive(default_sdcard, snapshot, IF_SD, 0, SD_OPTS);
 
     register_savevm_live(NULL, "ram", 0, 4, &savevm_ram_handlers, NULL);
+    register_savevm(NULL, "mc", -1, MC_VERSION, mc_info_save, 
+                                mc_info_load, NULL); 
 
     if (nb_numa_nodes > 0) {
         int i;
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing
  2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
                   ` (11 preceding siblings ...)
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested mrhines
@ 2014-02-18  9:28 ` Li Guang
  2014-02-19  1:29   ` Michael R. Hines
  12 siblings, 1 reply; 68+ messages in thread
From: Li Guang @ 2014-02-18  9:28 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, Michael R. Hines, gokul, dbulkow,
	pbonzini, abali, isaku.yamahata

Hi, Michael

this patch-set will break normal build(without --enable-mc):

migration.c: In function ‘migrate_rdma_pin_all’:
migration.c:564: error: ‘MIGRATION_CAPABILITY_X_RDMA_PIN_ALL’ undeclared 
(first use in this function)
migration.c:564: error: for each function it appears in.)

Thanks!
Li Guang

mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines"<mrhines@us.ibm.com>
>
> Changes since v1:
>
> 1. Re-based against Juan's improved migration_bitmap performance changes
> 2. Overhauled RDMA support to prepare for better usage of RDMA in
>     other parts of the QEMU code base (such as storage).
> 3. Fix for netlink issues that failed to cleanup the network buffer
>     device for development testing.
>
> Michael R. Hines (12):
>    mc: add documentation for micro-checkpointing
>    mc: timestamp migration_bitmap and KVM logdirty usage
>    mc: introduce a 'checkpointing' status check into the VCPU states
>    mc: support custom page loading and copying
>    rdma: accelerated memcpy() support and better external RDMA user
>      interfaces
>    mc: introduce state machine changes for MC
>    mc: introduce additional QMP statistics for micro-checkpointing
>    mc: core logic
>    mc: configure and makefile support
>    mc: expose tunable parameter for checkpointing frequency
>    mc: introduce new capabilities to control micro-checkpointing
>    mc: activate and use MC if requested
>
>   Makefile.objs                 |    1 +
>   arch_init.c                   |   72 +-
>   configure                     |   45 +
>   cpus.c                        |    9 +-
>   docs/mc.txt                   |  222 ++++
>   hmp-commands.hx               |   16 +-
>   hmp.c                         |   23 +
>   hmp.h                         |    1 +
>   include/migration/migration.h |   70 +-
>   include/migration/qemu-file.h |   55 +-
>   migration-checkpoint.c        | 1565 +++++++++++++++++++++++++
>   migration-rdma.c              | 2605 +++++++++++++++++++++++++++--------------
>   migration.c                   |  156 ++-
>   qapi-schema.json              |   86 +-
>   qemu-file.c                   |   80 +-
>   qmp-commands.hx               |   23 +
>   vl.c                          |    9 +
>   17 files changed, 4097 insertions(+), 941 deletions(-)
>   create mode 100644 docs/mc.txt
>   create mode 100644 migration-checkpoint.c
>
>    


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
@ 2014-02-18 10:32   ` Dr. David Alan Gilbert
  2014-02-19  1:42     ` Michael R. Hines
  2014-03-11 21:31   ` Juan Quintela
  1 sibling, 1 reply; 68+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-18 10:32 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, lig.fnst, gokul, dbulkow, pbonzini,
	abali, isaku.yamahata, Michael R. Hines

* mrhines@linux.vnet.ibm.com (mrhines@linux.vnet.ibm.com) wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> We also later export these statistics over QMP for better
> monitoring of micro-checkpointing as the workload changes.

<snip>

> @@ -548,9 +568,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>              /* XBZRLE overflow or normal page */
>              if (bytes_sent == -1) {
>                  bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
> -                qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
> -                bytes_sent += TARGET_PAGE_SIZE;
> -                acct_info.norm_pages++;
> +                if (ret != RAM_SAVE_CONTROL_DELAYED) {
> +                    qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
> +                    bytes_sent += TARGET_PAGE_SIZE;
> +                    acct_info.norm_pages++;
> +                }
>              }

Is that last change intended for this patch; it doesn't look
timestamp related.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
@ 2014-02-18 12:45   ` Dr. David Alan Gilbert
  2014-02-19  1:40     ` Michael R. Hines
  0 siblings, 1 reply; 68+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-18 12:45 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, lig.fnst, gokul, dbulkow, pbonzini,
	abali, isaku.yamahata, Michael R. Hines

* mrhines@linux.vnet.ibm.com (mrhines@linux.vnet.ibm.com) wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
> Github: git@github.com:hinesmr/qemu.git, 'mc' branch
> 
> NOTE: This is a direct copy of the QEMU wiki page for the convenience
> of the review process. Since this series very much in flux, instead of
> maintaing two copies of documentation in two different formats, this
> documentation will be properly formatted in the future when the review
> process has completed.

It seems to be picking up some truncations as well.

> +The Micro-Checkpointing Process
> +Basic Algorithm
> +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
> +
> +1. After N milliseconds, stop the VM.
> +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
> +4. Resume the VM immediately so that it can make forward progress.
> +5. Transmit the checkpoint to the destination.
> +6. Repeat
> +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.

Later you talk about the memory allocation and how you grow the memory as needed
to fit the checkpoint, have you tried going the other way and triggering the
checkpoints sooner if they're taking too much memory?

> +1. MC over TCP/IP: Once the socket connection breaks, we assume
> failure. This happens very early in the loss of the latest MC not only
> because a very large amount of bytes is typically being sequenced in a
> TCP stream but perhaps also because of the timeout in acknowledgement
> of the receipt of a commit message by the destination.
> +
> +2. MC over RDMA: Since Infiniband does not provide any underlying
> timeout mechanisms, this implementation enhances QEMU's RDMA migration
> protocol to include a simple keep-alive. Upon the loss of multiple
> keep-alive messages, the sender is deemed to have failed.
> +
> +In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
> +
> +If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
> +
> +If the destination is deemed to be lost, we perform the same action
> as a live migration: resume the sender normally and wait for management
> software to make a policy decision about whether or not to re-protect
> the VM, which may involve a third-party to identify a new destination
>host again to use as a backup for the VM.

In this world what is making the decision about whether the sender/destination
should win - how do you avoid a split brain situation where both
VMs are running but the only thing that failed is the comms between them?
Is there any guarantee that you'll have received knowledge of the comms
failure before you pull the plug out and enable the corked packets to be
sent on the sender side?

<snip>

> +RDMA is used for two different reasons:
> +
> +1. Checkpoint generation (RDMA-based memcpy):
> +2. Checkpoint transmission
> +Checkpoint generation must be done while the VM is paused. In the
> worst case, the size of the checkpoint can be equal in size to the amount
> of memory in total use by the VM. In order to resume VM execution as
> fast as possible, the checkpoint is copied consistently locally into
> a staging area before transmission. A standard memcpy() of potentially
> such a large amount of memory not only gets no use out of the CPU cache
> but also potentially clogs up the CPU pipeline which would otherwise
> be useful by other neighbor VMs on the same physical node that could be
> scheduled for execution. To minimize the effect on neighbor VMs, we use
> RDMA to perform a "local" memcpy(), bypassing the host processor. On
> more recent processors, a 'beefy' enough memory bus architecture can
> move memory just as fast (sometimes faster) as a pure-software CPU-only
> optimized memcpy() from libc. However, on older computers, this feature
> only gives you the benefit of lower CPU-utilization at the expense of

Isn't there a generic kernel DMA ABI for doing this type of thing (I
think there was at one point, people have suggested things like using
graphics cards to do it but I don't know if it ever happened).
The other question is, do you always need to copy - what about something
like COWing the pages?

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
@ 2014-02-19  1:00   ` Li Guang
  2014-02-19  2:14     ` Michael R. Hines
                       ` (2 more replies)
  2014-03-11 21:57   ` Juan Quintela
  1 sibling, 3 replies; 68+ messages in thread
From: Li Guang @ 2014-02-19  1:00 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, pbonzini, quintela, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, Michael R. Hines, gokul, dbulkow,
	hinesmr, BIRAN, isaku.yamahata

Hi,

mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines"<mrhines@us.ibm.com>
>
> This patch sets up the initial changes to the migration state
> machine and prototypes to be used by the checkpointing code
> to interact with the state machine so that we can later handle
> failure and recovery scenarios.
>
> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
> ---
>   arch_init.c                   | 29 ++++++++++++++++++++++++-----
>   include/migration/migration.h |  2 ++
>   migration.c                   | 37 +++++++++++++++++++++----------------
>   3 files changed, 47 insertions(+), 21 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index db75120..e9d4d9e 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>       migration_end();
>   }
>
> -static void reset_ram_globals(void)
> +static void reset_ram_globals(bool reset_bulk_stage)
>   {
>       last_seen_block = NULL;
>       last_sent_block = NULL;
>       last_offset = 0;
>       last_version = ram_list.version;
> -    ram_bulk_stage = true;
> +    ram_bulk_stage = reset_bulk_stage;
>   }
>
>    

here is a chance that ram_save_block will never break while loop
if loat_seen_block be reset for mc when there are no dirty pages
to be migrated.

Thanks!

>   #define MAX_WAIT 50 /* ms, half buffered_file limit */
> @@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>       RAMBlock *block;
>       int64_t ram_pages = last_ram_offset()>>  TARGET_PAGE_BITS;
>
> +    /*
> +     * RAM stays open during micro-checkpointing for the next transaction.
> +     */
> +    if (migration_is_mc(migrate_get_current())) {
> +        qemu_mutex_lock_ramlist();
> +        reset_ram_globals(false);
> +        goto skip_setup;
> +    }
> +
>       migration_bitmap = bitmap_new(ram_pages);
>       bitmap_set(migration_bitmap, 0, ram_pages);
>       migration_dirty_pages = ram_pages;
> @@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>       qemu_mutex_lock_iothread();
>       qemu_mutex_lock_ramlist();
>       bytes_transferred = 0;
> -    reset_ram_globals();
> +    reset_ram_globals(true);
>
>       memory_global_dirty_log_start();
>       migration_bitmap_sync();
>       qemu_mutex_unlock_iothread();
>
> +skip_setup:
> +
>       qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
>
>       QTAILQ_FOREACH(block,&ram_list.blocks, next) {
> @@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>       qemu_mutex_lock_ramlist();
>
>       if (ram_list.version != last_version) {
> -        reset_ram_globals();
> +        reset_ram_globals(true);
>       }
>
>       ram_control_before_iterate(f, RAM_CONTROL_ROUND);
> @@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>       }
>
>       ram_control_after_iterate(f, RAM_CONTROL_FINISH);
> -    migration_end();
> +
> +    /*
> +     * Only cleanup at the end of normal migrations
> +     * or if the MC destination failed and we got an error.
> +     * Otherwise, we are (or will soon be) in MIG_STATE_CHECKPOINTING.
> +     */
> +    if(!migrate_use_mc() || migration_has_failed(migrate_get_current())) {
> +        migration_end();
> +    }
>
>       qemu_mutex_unlock_ramlist();
>       qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index a7c54fe..e876a2c 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -101,7 +101,9 @@ int migrate_fd_close(MigrationState *s);
>
>   void add_migration_state_change_notifier(Notifier *notify);
>   void remove_migration_state_change_notifier(Notifier *notify);
> +bool migration_is_active(MigrationState *);
>   bool migration_in_setup(MigrationState *);
> +bool migration_is_mc(MigrationState *s);
>   bool migration_has_finished(MigrationState *);
>   bool migration_has_failed(MigrationState *);
>   MigrationState *migrate_get_current(void);
> diff --git a/migration.c b/migration.c
> index 25add6f..f42dae4 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -36,16 +36,6 @@
>       do { } while (0)
>   #endif
>
> -enum {
> -    MIG_STATE_ERROR = -1,
> -    MIG_STATE_NONE,
> -    MIG_STATE_SETUP,
> -    MIG_STATE_CANCELLING,
> -    MIG_STATE_CANCELLED,
> -    MIG_STATE_ACTIVE,
> -    MIG_STATE_COMPLETED,
> -};
> -
>   #define MAX_THROTTLE  (32<<  20)      /* Migration speed throttling */
>
>   /* Amount of time to allocate to each "chunk" of bandwidth-throttled
> @@ -273,7 +263,7 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>       MigrationState *s = migrate_get_current();
>       MigrationCapabilityStatusList *cap;
>
> -    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
> +    if (migration_is_active(s)) {
>           error_set(errp, QERR_MIGRATION_ACTIVE);
>           return;
>       }
> @@ -285,7 +275,13 @@ void qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>
>   /* shared migration helpers */
>
> -static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> +bool migration_is_active(MigrationState *s)
> +{
> +    return (s->state == MIG_STATE_ACTIVE) || s->state == MIG_STATE_SETUP
> +            || s->state == MIG_STATE_CHECKPOINTING;
> +}
> +
> +void migrate_set_state(MigrationState *s, int old_state, int new_state)
>   {
>       if (atomic_cmpxchg(&s->state, old_state, new_state) == new_state) {
>           trace_migrate_set_state(new_state);
> @@ -309,7 +305,7 @@ static void migrate_fd_cleanup(void *opaque)
>           s->file = NULL;
>       }
>
> -    assert(s->state != MIG_STATE_ACTIVE);
> +    assert(!migration_is_active(s));
>
>       if (s->state != MIG_STATE_COMPLETED) {
>           qemu_savevm_state_cancel();
> @@ -356,7 +352,12 @@ void remove_migration_state_change_notifier(Notifier *notify)
>
>   bool migration_in_setup(MigrationState *s)
>   {
> -    return s->state == MIG_STATE_SETUP;
> +        return s->state == MIG_STATE_SETUP;
> +}
> +
> +bool migration_is_mc(MigrationState *s)
> +{
> +        return s->state == MIG_STATE_CHECKPOINTING;
>   }
>
>   bool migration_has_finished(MigrationState *s)
> @@ -419,7 +420,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>       params.shared = has_inc&&  inc;
>
>       if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP ||
> -        s->state == MIG_STATE_CANCELLING) {
> +        s->state == MIG_STATE_CANCELLING
> +         || s->state == MIG_STATE_CHECKPOINTING) {
>           error_set(errp, QERR_MIGRATION_ACTIVE);
>           return;
>       }
> @@ -624,7 +626,10 @@ static void *migration_thread(void *opaque)
>                   }
>
>                   if (!qemu_file_get_error(s->file)) {
> -                    migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
> +                    if (!migrate_use_mc()) {
> +                        migrate_set_state(s,
> +                            MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
> +                    }
>                       break;
>                   }
>               }
>    

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
@ 2014-02-19  1:07   ` Li Guang
  2014-02-19  2:16     ` Michael R. Hines
  0 siblings, 1 reply; 68+ messages in thread
From: Li Guang @ 2014-02-19  1:07 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, Michael R. Hines, gokul, dbulkow,
	pbonzini, abali, isaku.yamahata

Hi,
mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines"<mrhines@us.ibm.com>
>
> This implements the core logic,
> all described in the first patch (docs/mc.txt).
>
> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
> ---
>   migration-checkpoint.c | 1565 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 1565 insertions(+)
>   create mode 100644 migration-checkpoint.c
>
>
>    
[big snip] ...

> +
> +/*
> + * Stop the VM, generate the micro checkpoint,
> + * but save the dirty memory into staging memory until
> + * we can re-activate the VM as soon as possible.
> + */
> +static int capture_checkpoint(MCParams *mc, MigrationState *s)
> +{
> +    MCCopyset *copyset;
> +    int idx, ret = 0;
> +    uint64_t start, stop, copies = 0;
> +    int64_t start_time;
> +
> +    mc->total_copies = 0;
> +    qemu_mutex_lock_iothread();
> +    vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
> +    start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +    /*
> +     * If buffering is enabled, insert a Qdisc plug here
> +     * to hold packets for the *next* MC, (not this one,
> +     * the packets for this one have already been plugged
> +     * and will be released after the MC has been transmitted.
> +     */
> +    mc_start_buffer();
>    

actually, I have a special request,
if QEMU started without netdev,
then don't bother me by Qdisc for network buffering. :-)

Thanks!

> +
> +    qemu_savevm_state_begin(mc->staging,&s->params);
> +    ret = qemu_file_get_error(s->file);
> +
> +    if (ret<  0) {
> +        migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
> +    }
> +
> +    qemu_savevm_state_complete(mc->staging);
> +
> +    ret = qemu_file_get_error(s->file);
> +    if (ret<  0) {
> +        migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
> +        goto out;
> +    }
> +
> +    /*
> +     * The copied memory gets appended to the end of the snapshot, so let's
> +     * remember where its going to go first and start a new slab.
> +     */
> +    mc_slab_next(mc, mc->curr_slab);
> +    mc->start_copyset = mc->curr_slab->idx;
> +
> +    start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +    /*
> +     * Now perform the actual copy of memory into the tail end of the slab list.
> +     */
> +    QTAILQ_FOREACH(copyset,&mc->copy_head, node) {
> +        if (!copyset->nb_copies) {
> +            break;
> +        }
> +
> +        copies += copyset->nb_copies;
> +
> +        DDDPRINTF("copyset %d copies: %" PRIu64 " total: %" PRIu64 "\n",
> +                copyset->idx, copyset->nb_copies, copies);
> +
> +        for (idx = 0; idx<  copyset->nb_copies; idx++) {
> +            uint8_t *addr;
> +            long size;
> +            mc->copy =&copyset->copies[idx];
> +            addr = (uint8_t *) (mc->copy->host_addr + mc->copy->offset);
> +            size = mc_put_buffer(mc, addr, mc->copy->offset, mc->copy->size);
> +            if (size != mc->copy->size) {
> +                fprintf(stderr, "Failure to initiate copyset %d index %d\n",
> +                        copyset->idx, idx);
> +                migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
> +                vm_start();
> +                goto out;
> +            }
> +
> +            DDDPRINTF("Success copyset %d index %d\n", copyset->idx, idx);
> +        }
> +
> +        copyset->nb_copies = 0;
> +    }
> +
> +    s->ram_copy_time = (qemu_clock_get_ms(QEMU_CLOCK_REALTIME) - start_time);
> +
> +    mc->copy = NULL;
> +    ram_control_before_iterate(mc->file, RAM_CONTROL_FLUSH);
> +    assert(mc->total_copies == copies);
> +
> +    stop = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +    /*
> +     * MC is safe in staging area. Let the VM go.
> +     */
> +    vm_start();
> +    qemu_fflush(mc->staging);
> +
> +    s->downtime = stop - start;
> +out:
> +    qemu_mutex_unlock_iothread();
> +    return ret;
> +}
> +
> +/*
> + * Synchronously send a micro-checkpointing command
> + */
> +static int mc_send(QEMUFile *f, uint64_t request)
> +{
> +    int ret = 0;
> +
> +    qemu_put_be64(f, request);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        fprintf(stderr, "transaction: send error while sending %" PRIu64 ", "
> +                "bailing: %s\n", request, strerror(-ret));
> +    } else {
> +        DDPRINTF("transaction: sent: %s (%" PRIu64 ")\n",
> +            mc_desc[request], request);
> +    }
> +
> +    qemu_fflush(f);
> +
> +    return ret;
> +}
> +
> +/*
> + * Synchronously receive a micro-checkpointing command
> + */
> +static int mc_recv(QEMUFile *f, uint64_t request, uint64_t *action)
> +{
> +    int ret = 0;
> +    uint64_t got;
> +
> +    got = qemu_get_be64(f);
> +
> +    ret = qemu_file_get_error(f);
> +    if (ret) {
> +        fprintf(stderr, "transaction: recv error while expecting %s (%"
> +                PRIu64 "), bailing: %s\n", mc_desc[request],
> +                request, strerror(-ret));
> +    } else {
> +        if ((request != MC_TRANSACTION_ANY)&&  request != got) {
> +            fprintf(stderr, "transaction: was expecting %s (%" PRIu64
> +                    ") but got %" PRIu64 " instead\n",
> +                    mc_desc[request], request, got);
> +            ret = -EINVAL;
> +        } else {
> +            DDPRINTF("transaction: recv: %s (%" PRIu64 ")\n",
> +                     mc_desc[got], got);
> +            ret = 0;
> +            if (action) {
> +                *action = got;
> +            }
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +static MCSlab *mc_slab_start(MCParams *mc)
> +{
> +    if (mc->nb_slabs>  2) {
> +        if (mc->slab_strikes>= max_strikes) {
> +            uint64_t nb_slabs_to_free = MAX(1, (((mc->nb_slabs - 1) / 2)));
> +
> +            DPRINTF("MC has reached max strikes. Will free %"
> +                    PRIu64 " / %" PRIu64 " slabs max %d, "
> +                    "checkpoints %" PRIu64 "\n",
> +                    nb_slabs_to_free, mc->nb_slabs,
> +                    max_strikes, mc->checkpoints);
> +
> +            mc->slab_strikes = 0;
> +
> +            while (nb_slabs_to_free) {
> +                MCSlab *slab = QTAILQ_LAST(&mc->slab_head, shead);
> +                ram_control_remove(mc->file, (uint64_t) slab->buf);
> +                QTAILQ_REMOVE(&mc->slab_head, slab, node);
> +                g_free(slab);
> +                nb_slabs_to_free--;
> +                mc->nb_slabs--;
> +            }
> +
> +            goto skip;
> +        } else if (((mc->slab_total<=
> +                    ((mc->nb_slabs - 1) * MC_SLAB_BUFFER_SIZE)))) {
> +            mc->slab_strikes++;
> +            DDPRINTF("MC has strike %d slabs %" PRIu64 " max %d\n",
> +                     mc->slab_strikes, mc->nb_slabs, max_strikes);
> +            goto skip;
> +        }
> +    }
> +
> +    if (mc->slab_strikes) {
> +        DDPRINTF("MC used all slabs. Resetting strikes to zero.\n");
> +        mc->slab_strikes = 0;
> +    }
> +skip:
> +
> +    mc->used_slabs = 1;
> +    mc->slab_total = 0;
> +    mc->curr_slab = QTAILQ_FIRST(&mc->slab_head);
> +    SLAB_RESET(mc->curr_slab);
> +
> +    return mc->curr_slab;
> +}
> +
> +static MCCopyset *mc_copy_start(MCParams *mc)
> +{
> +    if (mc->nb_copysets>= 2) {
> +        if (mc->copy_strikes>= max_strikes) {
> +            int nb_copies_to_free = MAX(1, (((mc->nb_copysets - 1) / 2)));
> +
> +            DPRINTF("MC has reached max strikes. Will free %d / %d copies max %d\n",
> +                    nb_copies_to_free, mc->nb_copysets, max_strikes);
> +
> +            mc->copy_strikes = 0;
> +
> +            while (nb_copies_to_free) {
> +                MCCopyset * copyset = QTAILQ_LAST(&mc->copy_head, chead);
> +                QTAILQ_REMOVE(&mc->copy_head, copyset, node);
> +                g_free(copyset);
> +                nb_copies_to_free--;
> +                mc->nb_copysets--;
> +            }
> +
> +            goto skip;
> +        } else if (((mc->total_copies<=
> +                    ((mc->nb_copysets - 1) * MC_MAX_SLAB_COPY_DESCRIPTORS)))) {
> +            mc->copy_strikes++;
> +            DDPRINTF("MC has strike %d copies %d max %d\n",
> +                     mc->copy_strikes, mc->nb_copysets, max_strikes);
> +            goto skip;
> +        }
> +    }
> +
> +    if (mc->copy_strikes) {
> +        DDPRINTF("MC used all copies. Resetting strikes to zero.\n");
> +        mc->copy_strikes = 0;
> +    }
> +skip:
> +
> +    mc->total_copies = 0;
> +    mc->curr_copyset = QTAILQ_FIRST(&mc->copy_head);
> +    mc->curr_copyset->nb_copies = 0;
> +
> +    return mc->curr_copyset;
> +}
> +
> +/*
> + * Main MC loop. Stop the VM, dump the dirty memory
> + * into staging, restart the VM, transmit the MC,
> + * and then sleep for some milliseconds before
> + * starting the next MC.
> + */
> +static void *mc_thread(void *opaque)
> +{
> +    MigrationState *s = opaque;
> +    MCParams mc = { .file = s->file };
> +    MCSlab * slab;
> +    int64_t initial_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +    int ret = 0, fd = qemu_get_fd(s->file), x;
> +    QEMUFile *mc_control, *mc_staging = NULL;
> +    uint64_t wait_time = 0;
> +
> +    if (!(mc_control = qemu_fopen_socket(fd, "rb"))) {
> +        fprintf(stderr, "Failed to setup read MC control\n");
> +        goto err;
> +    }
> +
> +    if (!(mc_staging = qemu_fopen_mc(&mc, "wb"))) {
> +        fprintf(stderr, "Failed to setup MC staging area\n");
> +        goto err;
> +    }
> +
> +    mc.staging = mc_staging;
> +
> +    qemu_set_block(fd);
> +    socket_set_nodelay(fd);
> +
> +    s->checkpoints = 0;
> +
> +    while (s->state == MIG_STATE_CHECKPOINTING) {
> +        int64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +        int64_t start_time, xmit_start, end_time;
> +        bool commit_sent = false;
> +        int nb_slab = 0;
> +        (void)nb_slab;
> +
> +        slab = mc_slab_start(&mc);
> +        mc_copy_start(&mc);
> +        acct_clear();
> +        start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +        if (capture_checkpoint(&mc, s)<  0)
> +                break;
> +
> +        xmit_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
> +        if ((ret = mc_send(s->file, MC_TRANSACTION_START)<  0)) {
> +            fprintf(stderr, "transaction start failed\n");
> +            break;
> +        }
> +
> +        DDPRINTF("Sending checkpoint size %" PRId64
> +                 " copyset start: %" PRIu64 " nb slab %" PRIu64
> +                 " used slabs %" PRIu64 "\n",
> +                 mc.slab_total, mc.start_copyset, mc.nb_slabs, mc.used_slabs);
> +
> +        mc.curr_slab = QTAILQ_FIRST(&mc.slab_head);
> +
> +        qemu_put_be64(s->file, mc.slab_total);
> +        qemu_put_be64(s->file, mc.start_copyset);
> +        qemu_put_be64(s->file, mc.used_slabs);
> +
> +        qemu_fflush(s->file);
> +
> +        DDPRINTF("Transaction commit\n");
> +
> +        /*
> +         * The MC is safe, and VM is running again.
> +         * Start a transaction and send it.
> +         */
> +        ram_control_before_iterate(s->file, RAM_CONTROL_ROUND);
> +
> +        slab = QTAILQ_FIRST(&mc.slab_head);
> +
> +        for (x = 0; x<  mc.used_slabs; x++) {
> +            DDPRINTF("Attempting write to slab #%d: %p"
> +                    " total size: %" PRId64 " / %" PRIu64 "\n",
> +                    nb_slab++, slab->buf, slab->size, MC_SLAB_BUFFER_SIZE);
> +
> +            ret = ram_control_save_page(s->file, (uint64_t) slab->buf,
> +                                        NULL, 0, slab->size, NULL);
> +
> +            if (ret == RAM_SAVE_CONTROL_NOT_SUPP) {
> +                if (!commit_sent) {
> +                    if ((ret = mc_send(s->file, MC_TRANSACTION_COMMIT)<  0)) {
> +                        fprintf(stderr, "transaction commit failed\n");
> +                        break;
> +                    }
> +                    commit_sent = true;
> +                }
> +
> +                qemu_put_be64(s->file, slab->size);
> +                qemu_put_buffer_async(s->file, slab->buf, slab->size);
> +            } else if ((ret<  0)&&  (ret != RAM_SAVE_CONTROL_DELAYED)) {
> +                fprintf(stderr, "failed 1, skipping send\n");
> +                goto err;
> +            }
> +
> +            if (qemu_file_get_error(s->file)) {
> +                fprintf(stderr, "failed 2, skipping send\n");
> +                goto err;
> +            }
> +
> +            DDPRINTF("Sent %" PRId64 " all %ld\n", slab->size, mc.slab_total);
> +
> +            slab = QTAILQ_NEXT(slab, node);
> +        }
> +
> +        if (!commit_sent) {
> +            ram_control_after_iterate(s->file, RAM_CONTROL_ROUND);
> +            slab = QTAILQ_FIRST(&mc.slab_head);
> +
> +            for (x = 0; x<  mc.used_slabs; x++) {
> +                qemu_put_be64(s->file, slab->size);
> +                slab = QTAILQ_NEXT(slab, node);
> +            }
> +        }
> +
> +        qemu_fflush(s->file);
> +
> +        if (commit_sent) {
> +            DDPRINTF("Waiting for commit ACK\n");
> +
> +            if ((ret = mc_recv(mc_control, MC_TRANSACTION_ACK, NULL))<  0) {
> +                goto err;
> +            }
> +        }
> +
> +        ret = qemu_file_get_error(s->file);
> +        if (ret) {
> +            fprintf(stderr, "Error sending checkpoint: %d\n", ret);
> +            goto err;
> +        }
> +
> +        DDPRINTF("Memory transfer complete.\n");
> +
> +        /*
> +         * The MC is safe on the other side now,
> +         * go along our merry way and release the network
> +         * packets from the buffer if enabled.
> +         */
> +        mc_flush_oldest_buffer();
> +
> +        end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +        s->total_time = end_time - start_time;
> +        s->xmit_time = end_time - xmit_start;
> +        s->bitmap_time = norm_mig_bitmap_time();
> +        s->log_dirty_time = norm_mig_log_dirty_time();
> +        s->mbps = MBPS(mc.slab_total, s->xmit_time);
> +        s->copy_mbps = MBPS(mc.slab_total, s->ram_copy_time);
> +        s->bytes_xfer = mc.slab_total;
> +        s->checkpoints = mc.checkpoints++;
> +
> +        wait_time = (s->downtime<= freq_ms) ? (freq_ms - s->downtime) : 0;
> +
> +        if (current_time>= initial_time + 1000) {
> +            DPRINTF("bytes %" PRIu64 " xmit_mbps %0.1f xmit_time %" PRId64
> +                    " downtime %" PRIu64 " sync_time %" PRId64
> +                    " logdirty_time %" PRId64 " ram_copy_time %" PRId64
> +                    " copy_mbps %0.1f wait time %" PRIu64
> +                    " checkpoints %" PRId64 "\n",
> +                    s->bytes_xfer,
> +                    s->mbps,
> +                    s->xmit_time,
> +                    s->downtime,
> +                    s->bitmap_time,
> +                    s->log_dirty_time,
> +                    s->ram_copy_time,
> +                    s->copy_mbps,
> +                    wait_time,
> +                    s->checkpoints);
> +            initial_time = current_time;
> +        }
> +
> +        /*
> +         * Checkpoint frequency in microseconds.
> +         *
> +         * Sometimes, when checkpoints are very large,
> +         * all of the wait time was dominated by the
> +         * time taken to copy the checkpoint into the staging area,
> +         * in which case wait_time, will probably be zero and we
> +         * will end up diving right back into the next checkpoint
> +         * as soon as the previous transmission completed.
> +         */
> +        if (wait_time) {
> +            g_usleep(wait_time * 1000);
> +        }
> +    }
> +
> +    goto out;
> +
> +err:
> +    /*
> +     * TODO: Possible split-brain scenario:
> +     * Normally, this should never be reached unless there was a
> +     * connection error or network partition - in which case
> +     * only the management software can resume the VM safely
> +     * when it knows the exact state of the MC destination.
> +     *
> +     * We need management to poll the source and destination to deterine
> +     * if the destination has already taken control. If not, then
> +     * we need to resume the source.
> +     *
> +     * If there was a connection error during checkpoint *transmission*
> +     * then the destination VM will likely have already resumed,
> +     * in which case we need to stop the current VM from running
> +     * and throw away any buffered packets.
> +     *
> +     * Verify that "disable_buffering" below does not release any traffic.
> +     */
> +    migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_ERROR);
> +out:
> +    if (mc_staging) {
> +        qemu_fclose(mc_staging);
> +    }
> +
> +    if (mc_control) {
> +        qemu_fclose(mc_control);
> +    }
> +
> +    mc_disable_buffering();
> +
> +    qemu_mutex_lock_iothread();
> +
> +    if (s->state != MIG_STATE_ERROR) {
> +        migrate_set_state(s, MIG_STATE_CHECKPOINTING, MIG_STATE_COMPLETED);
> +    }
> +
> +    qemu_bh_schedule(s->cleanup_bh);
> +    qemu_mutex_unlock_iothread();
> +
> +    return NULL;
> +}
> +
> +/*
> + * Get the next copyset in the list. If there is none, then make one.
> + */
> +static MCCopyset *mc_copy_next(MCParams *mc, MCCopyset *copyset)
> +{
> +    if (!QTAILQ_NEXT(copyset, node)) {
> +        int idx = mc->nb_copysets++;
> +        DDPRINTF("Extending copysets by one: %d sets total, "
> +                 "%" PRIu64 " MB\n", mc->nb_copysets,
> +                 mc->nb_copysets * sizeof(MCCopyset) / 1024UL / 1024UL);
> +        mc->curr_copyset = g_malloc(sizeof(MCCopyset));
> +        mc->curr_copyset->idx = idx;
> +        QTAILQ_INSERT_TAIL(&mc->copy_head, mc->curr_copyset, node);
> +        copyset = mc->curr_copyset;
> +    } else {
> +        copyset = QTAILQ_NEXT(copyset, node);
> +    }
> +
> +    mc->curr_copyset = copyset;
> +    copyset->nb_copies = 0;
> +
> +    return copyset;
> +}
> +
> +void mc_process_incoming_checkpoints_if_requested(QEMUFile *f)
> +{
> +    MCParams mc = { .file = f };
> +    MCSlab *slab;
> +    int fd = qemu_get_fd(f);
> +    QEMUFile *mc_control, *mc_staging;
> +    uint64_t checkpoint_size, action;
> +    uint64_t slabs;
> +    int got, x, ret, received = 0;
> +    bool checkpoint_received;
> +
> +    CALC_MAX_STRIKES();
> +
> +    if (!mc_requested) {
> +        DPRINTF("Source has not requested MC. Returning.\n");
> +        return;
> +    }
> +
> +    if (!(mc_control = qemu_fopen_socket(fd, "wb"))) {
> +        fprintf(stderr, "Could not make incoming MC control channel\n");
> +        goto rollback;
> +    }
> +
> +    if (!(mc_staging = qemu_fopen_mc(&mc, "rb"))) {
> +        fprintf(stderr, "Could not make outgoing MC staging area\n");
> +        goto rollback;
> +    }
> +
> +    //qemu_set_block(fd);
> +    socket_set_nodelay(fd);
> +
> +    while (true) {
> +        checkpoint_received = false;
> +        ret = mc_recv(f, MC_TRANSACTION_ANY,&action);
> +        if (ret<  0) {
> +            goto rollback;
> +        }
> +
> +        switch(action) {
> +        case MC_TRANSACTION_START:
> +            checkpoint_size = qemu_get_be64(f);
> +            mc.start_copyset = qemu_get_be64(f);
> +            slabs = qemu_get_be64(f);
> +
> +            DDPRINTF("Transaction start: size %" PRIu64
> +                     " copyset start: %" PRIu64 " slabs %" PRIu64 "\n",
> +                     checkpoint_size, mc.start_copyset, slabs);
> +
> +            assert(checkpoint_size);
> +            break;
> +        case MC_TRANSACTION_COMMIT: /* tcp */
> +            slab = mc_slab_start(&mc);
> +            received = 0;
> +
> +            while (received<  checkpoint_size) {
> +                int total = 0;
> +                slab->size = qemu_get_be64(f);
> +
> +                DDPRINTF("Expecting size: %" PRIu64 "\n", slab->size);
> +
> +                while (total != slab->size) {
> +                    got = qemu_get_buffer(f, slab->buf + total, slab->size - total);
> +                    if (got<= 0) {
> +                        fprintf(stderr, "Error pre-filling checkpoint: %d\n", got);
> +                        goto rollback;
> +                    }
> +                    DDPRINTF("Received %d slab %d / %ld received %d total %"
> +                             PRIu64 "\n", got, total, slab->size,
> +                             received, checkpoint_size);
> +                    received += got;
> +                    total += got;
> +                }
> +
> +                if (received != checkpoint_size) {
> +                    slab = mc_slab_next(&mc, slab);
> +                }
> +            }
> +
> +            DDPRINTF("Acknowledging successful commit\n");
> +
> +            if (mc_send(mc_control, MC_TRANSACTION_ACK)<  0) {
> +                goto rollback;
> +            }
> +
> +            checkpoint_received = true;
> +            break;
> +        case RAM_SAVE_FLAG_HOOK: /* rdma */
> +            /*
> +             * Must be RDMA registration handling. Preallocate
> +             * the slabs (if not already done in a previous checkpoint)
> +             * before allowing RDMA to register them.
> +             */
> +            slab = mc_slab_start(&mc);
> +
> +            DDPRINTF("Pre-populating slabs %" PRIu64 "...\n", slabs);
> +
> +            for(x = 1; x<  slabs; x++) {
> +                slab = mc_slab_next(&mc, slab);
> +            }
> +
> +            ram_control_load_hook(f, action);
> +
> +            DDPRINTF("Hook complete.\n");
> +
> +            slab = QTAILQ_FIRST(&mc.slab_head);
> +
> +            for(x = 0; x<  slabs; x++) {
> +                slab->size = qemu_get_be64(f);
> +                slab = QTAILQ_NEXT(slab, node);
> +            }
> +
> +            checkpoint_received = true;
> +            break;
> +        default:
> +            fprintf(stderr, "Unknown MC action: %" PRIu64 "\n", action);
> +            goto rollback;
> +        }
> +
> +        if (checkpoint_received) {
> +            mc.curr_slab = QTAILQ_FIRST(&mc.slab_head);
> +            mc.slab_total = checkpoint_size;
> +
> +            DDPRINTF("Committed Loading MC state \n");
> +
> +            mc_copy_start(&mc);
> +
> +            if (qemu_loadvm_state(mc_staging)<  0) {
> +                fprintf(stderr, "loadvm transaction failed\n");
> +                /*
> +                 * This is fatal. No rollback possible because we have potentially
> +                 * applied only a subset of the checkpoint to main memory, potentially
> +                 * leaving the VM in an inconsistent state.
> +                 */
> +                goto err;
> +            }
> +
> +            mc.slab_total = checkpoint_size;
> +
> +            DDPRINTF("Transaction complete.\n");
> +            mc.checkpoints++;
> +        }
> +    }
> +
> +rollback:
> +    fprintf(stderr, "MC: checkpointing stopped. Recovering VM\n");
> +    goto out;
> +err:
> +    fprintf(stderr, "Micro Checkpointing Protocol Failed\n");
> +    exit(1);
> +out:
> +    if (mc_staging) {
> +        qemu_fclose(mc_staging);
> +    }
> +
> +    if (mc_control) {
> +        qemu_fclose(mc_control);
> +    }
> +}
> +
> +static int mc_get_buffer_internal(void *opaque, uint8_t *buf, int64_t pos,
> +                                  int size, MCSlab **curr_slab, uint64_t end_idx)
> +{
> +    uint64_t len = size;
> +    uint8_t *data = (uint8_t *) buf;
> +    MCSlab *slab = *curr_slab;
> +    MCParams *mc = opaque;
> +
> +    assert(slab);
> +
> +    DDDPRINTF("got request for %d bytes %p %p. idx %d\n",
> +              size, slab, QTAILQ_FIRST(&mc->slab_head), slab->idx);
> +
> +    while (len&&  slab) {
> +        uint64_t get = MIN(slab->size - slab->read, len);
> +
> +        memcpy(data, slab->buf + slab->read, get);
> +
> +        data           += get;
> +        slab->read     += get;
> +        len            -= get;
> +        mc->slab_total -= get;
> +
> +        DDDPRINTF("got: %" PRIu64 " read: %" PRIu64
> +                 " len %" PRIu64 " slab_total %" PRIu64
> +                 " size %" PRIu64 " addr: %p slab %d"
> +                 " requested %d\n",
> +                 get, slab->read, len, mc->slab_total,
> +                 slab->size, slab->buf, slab->idx, size);
> +
> +        if (len) {
> +            if (slab->idx == end_idx) {
> +                break;
> +            }
> +
> +            slab = QTAILQ_NEXT(slab, node);
> +        }
> +    }
> +
> +    *curr_slab = slab;
> +    DDDPRINTF("Returning %" PRIu64 " / %d bytes\n", size - len, size);
> +
> +    return size - len;
> +}
> +static int mc_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
> +{
> +    MCParams *mc = opaque;
> +
> +    return mc_get_buffer_internal(mc, buf, pos, size,&mc->curr_slab,
> +                                  mc->start_copyset - 1);
> +}
> +
> +static int mc_load_page(QEMUFile *f, void *opaque, void *host_addr, long size)
> +{
> +    MCParams *mc = opaque;
> +
> +    DDDPRINTF("Loading page into %p of size %" PRIu64 "\n", host_addr, size);
> +
> +    return mc_get_buffer_internal(mc, host_addr, 0, size,&mc->mem_slab,
> +                                  mc->nb_slabs - 1);
> +}
> +
> +/*
> + * Provide QEMUFile with an *local* RDMA-based way to do memcpy().
> + * This lowers cache pollution and allows the CPU pipeline to
> + * remain free for regular use by VMs (as well as by neighbors).
> + *
> + * In a future implementation, we may attempt to perform this
> + * copy *without* stopping the source VM - if the data shows
> + * that it can be done effectively.
> + */
> +static int mc_save_page(QEMUFile *f, void *opaque,
> +                           ram_addr_t block_offset,
> +                           uint8_t *host_addr,
> +                           ram_addr_t offset,
> +                           long size, int *bytes_sent)
> +{
> +    MCParams *mc = opaque;
> +    MCCopyset *copyset = mc->curr_copyset;
> +    MCCopy *c;
> +
> +    if (copyset->nb_copies>= MC_MAX_SLAB_COPY_DESCRIPTORS) {
> +        copyset = mc_copy_next(mc, copyset);
> +    }
> +
> +    c =&copyset->copies[copyset->nb_copies++];
> +    c->ramblock_offset = (uint64_t) block_offset;
> +    c->host_addr = (uint64_t) host_addr;
> +    c->offset = (uint64_t) offset;
> +    c->size = (uint64_t) size;
> +    mc->total_copies++;
> +
> +    return RAM_SAVE_CONTROL_DELAYED;
> +}
> +
> +static ssize_t mc_writev_buffer(void *opaque, struct iovec *iov,
> +                                int iovcnt, int64_t pos)
> +{
> +    ssize_t len = 0;
> +    unsigned int i;
> +
> +    for (i = 0; i<  iovcnt; i++) {
> +        DDDPRINTF("iov # %d, len: %" PRId64 "\n", i, iov[i].iov_len);
> +        len += mc_put_buffer(opaque, iov[i].iov_base, 0, iov[i].iov_len);
> +    }
> +
> +    return len;
> +}
> +
> +static int mc_get_fd(void *opaque)
> +{
> +    MCParams *mc = opaque;
> +
> +    return qemu_get_fd(mc->file);
> +}
> +
> +static int mc_close(void *opaque)
> +{
> +    MCParams *mc = opaque;
> +    MCSlab *slab, *next;
> +
> +    QTAILQ_FOREACH_SAFE(slab,&mc->slab_head, node, next) {
> +        ram_control_remove(mc->file, (uint64_t) slab->buf);
> +        QTAILQ_REMOVE(&mc->slab_head, slab, node);
> +        g_free(slab);
> +    }
> +
> +    mc->curr_slab = NULL;
> +
> +    return 0;
> +}
> +	
> +static const QEMUFileOps mc_write_ops = {
> +    .writev_buffer = mc_writev_buffer,
> +    .put_buffer = mc_put_buffer,
> +    .get_fd = mc_get_fd,
> +    .close = mc_close,
> +    .save_page = mc_save_page,
> +};
> +
> +static const QEMUFileOps mc_read_ops = {
> +    .get_buffer = mc_get_buffer,
> +    .get_fd = mc_get_fd,
> +    .close = mc_close,
> +    .load_page = mc_load_page,
> +};
> +
> +QEMUFile *qemu_fopen_mc(void *opaque, const char *mode)
> +{
> +    MCParams *mc = opaque;
> +    MCSlab *slab;
> +    MCCopyset *copyset;
> +
> +    if (qemu_file_mode_is_not_valid(mode)) {
> +        return NULL;
> +    }
> +
> +    QTAILQ_INIT(&mc->slab_head);
> +    QTAILQ_INIT(&mc->copy_head);
> +
> +    slab = qemu_memalign(8, sizeof(MCSlab));
> +    memset(slab, 0, sizeof(*slab));
> +    slab->idx = 0;
> +    QTAILQ_INSERT_HEAD(&mc->slab_head, slab, node);
> +    mc->slab_total = 0;
> +    mc->curr_slab = slab;
> +    mc->nb_slabs = 1;
> +    mc->slab_strikes = 0;
> +
> +    ram_control_add(mc->file, slab->buf, (uint64_t) slab->buf, MC_SLAB_BUFFER_SIZE);
> +
> +    copyset = g_malloc(sizeof(MCCopyset));
> +    copyset->idx = 0;
> +    QTAILQ_INSERT_HEAD(&mc->copy_head, copyset, node);
> +    mc->total_copies = 0;
> +    mc->curr_copyset = copyset;
> +    mc->nb_copysets = 1;
> +    mc->copy_strikes = 0;
> +
> +    if (mode[0] == 'w') {
> +        return qemu_fopen_ops(mc,&mc_write_ops);
> +    }
> +
> +    return qemu_fopen_ops(mc,&mc_read_ops);
> +}
> +
> +static void mc_start_checkpointer(void *opaque) {
> +    MigrationState *s = opaque;
> +
> +    if (checkpoint_bh) {
> +        qemu_bh_delete(checkpoint_bh);
> +        checkpoint_bh = NULL;
> +    }
> +
> +    qemu_mutex_unlock_iothread();
> +    qemu_thread_join(s->thread);
> +    g_free(s->thread);
> +    qemu_mutex_lock_iothread();
> +
> +    migrate_set_state(s, MIG_STATE_ACTIVE, MIG_STATE_CHECKPOINTING);
> +    s->thread = g_malloc0(sizeof(*s->thread));
> +	qemu_thread_create(s->thread, mc_thread, s, QEMU_THREAD_JOINABLE);
> +}
> +
> +void mc_init_checkpointer(MigrationState *s)
> +{
> +    CALC_MAX_STRIKES();
> +    checkpoint_bh = qemu_bh_new(mc_start_checkpointer, s);
> +    qemu_bh_schedule(checkpoint_bh);
> +}
> +
> +void qmp_migrate_set_mc_delay(int64_t value, Error **errp)
> +{
> +    freq_ms = value;
> +    CALC_MAX_STRIKES();
> +    DPRINTF("Setting checkpoint frequency to %" PRId64 " ms and "
> +            "resetting strikes to %d based on a %d sec delay.\n",
> +            freq_ms, max_strikes, max_strikes_delay_secs);
> +}
> +
> +int mc_info_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    bool mc_enabled = qemu_get_byte(f);
> +
> +    if (mc_enabled&&  !mc_requested) {
> +        DPRINTF("MC is requested\n");
> +        mc_requested = true;
> +    }
> +
> +    max_strikes = qemu_get_be32(f);
> +
> +    return 0;
> +}
> +
> +void mc_info_save(QEMUFile *f, void *opaque)
> +{
> +    qemu_put_byte(f, migrate_use_mc());
> +    qemu_put_be32(f, max_strikes);
> +}
>    

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing
  2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
@ 2014-02-19  1:29   ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-02-19  1:29 UTC (permalink / raw)
  To: Li Guang
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, Michael R. Hines, gokul, dbulkow,
	pbonzini, abali, isaku.yamahata



On 02/18/2014 05:28 PM, Li Guang wrote:
> Hi, Michael
>
> this patch-set will break normal build(without --enable-mc):
>
> migration.c: In function ‘migrate_rdma_pin_all’:
> migration.c:564: error: ‘MIGRATION_CAPABILITY_X_RDMA_PIN_ALL’ 
> undeclared (first use in this function)
> migration.c:564: error: for each function it appears in.)
>
> Thanks!
> Li Guang
>

Could you use the github.com version for this RFC?

https://github.com/hinesmr/qemu/tree/mc

Just do "git remote-add" followed by "git fetch" and then "get checkout" 
in your existing QEMU git clone directory.

- Michael

> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>
>> Changes since v1:
>>
>> 1. Re-based against Juan's improved migration_bitmap performance changes
>> 2. Overhauled RDMA support to prepare for better usage of RDMA in
>>     other parts of the QEMU code base (such as storage).
>> 3. Fix for netlink issues that failed to cleanup the network buffer
>>     device for development testing.
>>
>> Michael R. Hines (12):
>>    mc: add documentation for micro-checkpointing
>>    mc: timestamp migration_bitmap and KVM logdirty usage
>>    mc: introduce a 'checkpointing' status check into the VCPU states
>>    mc: support custom page loading and copying
>>    rdma: accelerated memcpy() support and better external RDMA user
>>      interfaces
>>    mc: introduce state machine changes for MC
>>    mc: introduce additional QMP statistics for micro-checkpointing
>>    mc: core logic
>>    mc: configure and makefile support
>>    mc: expose tunable parameter for checkpointing frequency
>>    mc: introduce new capabilities to control micro-checkpointing
>>    mc: activate and use MC if requested
>>
>>   Makefile.objs                 |    1 +
>>   arch_init.c                   |   72 +-
>>   configure                     |   45 +
>>   cpus.c                        |    9 +-
>>   docs/mc.txt                   |  222 ++++
>>   hmp-commands.hx               |   16 +-
>>   hmp.c                         |   23 +
>>   hmp.h                         |    1 +
>>   include/migration/migration.h |   70 +-
>>   include/migration/qemu-file.h |   55 +-
>>   migration-checkpoint.c        | 1565 +++++++++++++++++++++++++
>>   migration-rdma.c              | 2605 
>> +++++++++++++++++++++++++++--------------
>>   migration.c                   |  156 ++-
>>   qapi-schema.json              |   86 +-
>>   qemu-file.c                   |   80 +-
>>   qmp-commands.hx               |   23 +
>>   vl.c                          |    9 +
>>   17 files changed, 4097 insertions(+), 941 deletions(-)
>>   create mode 100644 docs/mc.txt
>>   create mode 100644 migration-checkpoint.c
>>
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-18 12:45   ` Dr. David Alan Gilbert
@ 2014-02-19  1:40     ` Michael R. Hines
  2014-02-19 11:27       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-02-19  1:40 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, lig.fnst, gokul, dbulkow, pbonzini,
	abali, isaku.yamahata, Michael R. Hines

On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
>> +The Micro-Checkpointing Process
>> +Basic Algorithm
>> +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
>> +
>> +1. After N milliseconds, stop the VM.
>> +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
>> +4. Resume the VM immediately so that it can make forward progress.
>> +5. Transmit the checkpoint to the destination.
>> +6. Repeat
>> +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
> Later you talk about the memory allocation and how you grow the memory as needed
> to fit the checkpoint, have you tried going the other way and triggering the
> checkpoints sooner if they're taking too much memory?

There is a 'knob' in this patch called "mc-set-delay" which was designed
to solve exactly that problem. It allows policy or management software
to make an independent decision about what the frequency of the
checkpoints should be.

I wasn't comfortable implementing policy directly inside the patch as
that seemed less likely to get accepted by the community sooner.

>> +1. MC over TCP/IP: Once the socket connection breaks, we assume
>> failure. This happens very early in the loss of the latest MC not only
>> because a very large amount of bytes is typically being sequenced in a
>> TCP stream but perhaps also because of the timeout in acknowledgement
>> of the receipt of a commit message by the destination.
>> +
>> +2. MC over RDMA: Since Infiniband does not provide any underlying
>> timeout mechanisms, this implementation enhances QEMU's RDMA migration
>> protocol to include a simple keep-alive. Upon the loss of multiple
>> keep-alive messages, the sender is deemed to have failed.
>> +
>> +In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
>> +
>> +If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
>> +
>> +If the destination is deemed to be lost, we perform the same action
>> as a live migration: resume the sender normally and wait for management
>> software to make a policy decision about whether or not to re-protect
>> the VM, which may involve a third-party to identify a new destination
>> host again to use as a backup for the VM.
> In this world what is making the decision about whether the sender/destination
> should win - how do you avoid a split brain situation where both
> VMs are running but the only thing that failed is the comms between them?
> Is there any guarantee that you'll have received knowledge of the comms
> failure before you pull the plug out and enable the corked packets to be
> sent on the sender side?

Good question in general - I'll add it to the FAQ. The patch implements
a basic 'transaction' mechanism in coordination with an outbound I/O
buffer (documented further down). With these two things in
places, split-brain is not possible because the destination is not running.
We don't allow the destination to resume execution until a committed
transaction has been acknowledged by the destination and only until
then do we allow any outbound network traffic to be release to the
outside world.

> <snip>
>
>> +RDMA is used for two different reasons:
>> +
>> +1. Checkpoint generation (RDMA-based memcpy):
>> +2. Checkpoint transmission
>> +Checkpoint generation must be done while the VM is paused. In the
>> worst case, the size of the checkpoint can be equal in size to the amount
>> of memory in total use by the VM. In order to resume VM execution as
>> fast as possible, the checkpoint is copied consistently locally into
>> a staging area before transmission. A standard memcpy() of potentially
>> such a large amount of memory not only gets no use out of the CPU cache
>> but also potentially clogs up the CPU pipeline which would otherwise
>> be useful by other neighbor VMs on the same physical node that could be
>> scheduled for execution. To minimize the effect on neighbor VMs, we use
>> RDMA to perform a "local" memcpy(), bypassing the host processor. On
>> more recent processors, a 'beefy' enough memory bus architecture can
>> move memory just as fast (sometimes faster) as a pure-software CPU-only
>> optimized memcpy() from libc. However, on older computers, this feature
>> only gives you the benefit of lower CPU-utilization at the expense of
> Isn't there a generic kernel DMA ABI for doing this type of thing (I
> think there was at one point, people have suggested things like using
> graphics cards to do it but I don't know if it ever happened).
> The other question is, do you always need to copy - what about something
> like COWing the pages?

Excellent question! Responding in two parts:

1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
      me if I'm wrong, but vmsplice was actually designed to avoid copies
      entirely between two userspace programs to be able to move memory
      more efficiently - whereas a fault tolerant system actually *needs*
      copy to be made.

2) Using COW: Actually, I think that's an excellent idea. I've bounced that
      around with my colleagues, but we simply didn't have the manpower
      to implement it and benchmark it. There was also some concern about
      performance: Would the writable working set of the guest be so 
active/busy
      that COW would not get you much benefit? I think it's worth a try.
      Patches welcome =)

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage
  2014-02-18 10:32   ` Dr. David Alan Gilbert
@ 2014-02-19  1:42     ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-02-19  1:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, qemu-devel, EREZH,
	owasserm, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	junqing.wang, BIRAN, lig.fnst, Michael R. Hines

On 02/18/2014 06:32 PM, Dr. David Alan Gilbert wrote:
> * mrhines@linux.vnet.ibm.com (mrhines@linux.vnet.ibm.com) wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> We also later export these statistics over QMP for better
>> monitoring of micro-checkpointing as the workload changes.
> <snip>
>
>> @@ -548,9 +568,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>               /* XBZRLE overflow or normal page */
>>               if (bytes_sent == -1) {
>>                   bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
>> -                qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
>> -                bytes_sent += TARGET_PAGE_SIZE;
>> -                acct_info.norm_pages++;
>> +                if (ret != RAM_SAVE_CONTROL_DELAYED) {
>> +                    qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
>> +                    bytes_sent += TARGET_PAGE_SIZE;
>> +                    acct_info.norm_pages++;
>> +                }
>>               }
> Is that last change intended for this patch; it doesn't look
> timestamp related.
>
> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

Oooops. I failed to split-out that patch correctly. How'd that get in 
there? =)

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-19  1:00   ` Li Guang
@ 2014-02-19  2:14     ` Michael R. Hines
  2014-02-20  5:03     ` Michael R. Hines
  2014-02-21  8:13     ` Michael R. Hines
  2 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-02-19  2:14 UTC (permalink / raw)
  To: Li Guang
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, Michael R. Hines, gokul, dbulkow,
	pbonzini, abali, isaku.yamahata

On 02/19/2014 09:00 AM, Li Guang wrote:
> Hi,
>
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>
>> This patch sets up the initial changes to the migration state
>> machine and prototypes to be used by the checkpointing code
>> to interact with the state machine so that we can later handle
>> failure and recovery scenarios.
>>
>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>> ---
>>   arch_init.c                   | 29 ++++++++++++++++++++++++-----
>>   include/migration/migration.h |  2 ++
>>   migration.c                   | 37 
>> +++++++++++++++++++++----------------
>>   3 files changed, 47 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index db75120..e9d4d9e 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>>       migration_end();
>>   }
>>
>> -static void reset_ram_globals(void)
>> +static void reset_ram_globals(bool reset_bulk_stage)
>>   {
>>       last_seen_block = NULL;
>>       last_sent_block = NULL;
>>       last_offset = 0;
>>       last_version = ram_list.version;
>> -    ram_bulk_stage = true;
>> +    ram_bulk_stage = reset_bulk_stage;
>>   }
>>
>
> here is a chance that ram_save_block will never break while loop
> if loat_seen_block be reset for mc when there are no dirty pages
> to be migrated.
>
> Thanks!

I see. Question:

While running the code, I have never seen a case where there are no
dirty pages. Have you seen this before? And even if there are no dirty
pages during a single 100 millisecond checkpoint interval (the default
which can be changed) - the probability that there will continue to be
no dirty pages during the next checkpoint interval will be very low,
right?

If there are no dirty pages during a checkpoint, should not
ram_save_block() keep waiting? The virtual machine is still running
during MC - the MC code does not stop the virtual machine, so if there
are no dirty pages, isn't it safe to just let ram_save_block() keep looping
until there are dirty pages available?

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
  2014-02-19  1:07   ` Li Guang
@ 2014-02-19  2:16     ` Michael R. Hines
  2014-02-19  2:53       ` Li Guang
  0 siblings, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-02-19  2:16 UTC (permalink / raw)
  To: Li Guang
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, Michael R. Hines, gokul, dbulkow,
	pbonzini, abali, isaku.yamahata

On 02/19/2014 09:07 AM, Li Guang wrote:
> Hi,
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>
>> This implements the core logic,
>> all described in the first patch (docs/mc.txt).
>>
>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>> ---
>>   migration-checkpoint.c | 1565 
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 1565 insertions(+)
>>   create mode 100644 migration-checkpoint.c
>>
>>
> [big snip] ...
>
>> +
>> +/*
>> + * Stop the VM, generate the micro checkpoint,
>> + * but save the dirty memory into staging memory until
>> + * we can re-activate the VM as soon as possible.
>> + */
>> +static int capture_checkpoint(MCParams *mc, MigrationState *s)
>> +{
>> +    MCCopyset *copyset;
>> +    int idx, ret = 0;
>> +    uint64_t start, stop, copies = 0;
>> +    int64_t start_time;
>> +
>> +    mc->total_copies = 0;
>> +    qemu_mutex_lock_iothread();
>> +    vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
>> +    start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> +
>> +    /*
>> +     * If buffering is enabled, insert a Qdisc plug here
>> +     * to hold packets for the *next* MC, (not this one,
>> +     * the packets for this one have already been plugged
>> +     * and will be released after the MC has been transmitted.
>> +     */
>> +    mc_start_buffer();
>
> actually, I have a special request,
> if QEMU started without netdev,
> then don't bother me by Qdisc for network buffering. :-)
>
> Thanks!
>

That ability is already available in the patchset.
It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt).

Did you try it?

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
  2014-02-19  2:16     ` Michael R. Hines
@ 2014-02-19  2:53       ` Li Guang
  2014-02-19  4:27         ` Michael R. Hines
  0 siblings, 1 reply; 68+ messages in thread
From: Li Guang @ 2014-02-19  2:53 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, qemu-devel, EREZH,
	owasserm, onom, hinesmr, Michael R. Hines, gokul, dbulkow,
	junqing.wang, BIRAN, isaku.yamahata

Michael R. Hines wrote:
> On 02/19/2014 09:07 AM, Li Guang wrote:
>> Hi,
>> mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>>
>>> This implements the core logic,
>>> all described in the first patch (docs/mc.txt).
>>>
>>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>>> ---
>>>   migration-checkpoint.c | 1565 
>>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 1565 insertions(+)
>>>   create mode 100644 migration-checkpoint.c
>>>
>>>
>> [big snip] ...
>>
>>> +
>>> +/*
>>> + * Stop the VM, generate the micro checkpoint,
>>> + * but save the dirty memory into staging memory until
>>> + * we can re-activate the VM as soon as possible.
>>> + */
>>> +static int capture_checkpoint(MCParams *mc, MigrationState *s)
>>> +{
>>> +    MCCopyset *copyset;
>>> +    int idx, ret = 0;
>>> +    uint64_t start, stop, copies = 0;
>>> +    int64_t start_time;
>>> +
>>> +    mc->total_copies = 0;
>>> +    qemu_mutex_lock_iothread();
>>> +    vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
>>> +    start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>> +
>>> +    /*
>>> +     * If buffering is enabled, insert a Qdisc plug here
>>> +     * to hold packets for the *next* MC, (not this one,
>>> +     * the packets for this one have already been plugged
>>> +     * and will be released after the MC has been transmitted.
>>> +     */
>>> +    mc_start_buffer();
>>
>> actually, I have a special request,
>> if QEMU started without netdev,
>> then don't bother me by Qdisc for network buffering. :-)
>>
>> Thanks!
>>
>
> That ability is already available in the patchset.
> It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt).
>
> Did you try it?
>

I don't mean disable it manually, I say even don't start buffering
for network when no netdev.

Thanks!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic
  2014-02-19  2:53       ` Li Guang
@ 2014-02-19  4:27         ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-02-19  4:27 UTC (permalink / raw)
  To: Li Guang
  Cc: GILR, SADEKJ, quintela, hinesmr, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, Michael R. Hines, gokul, dbulkow,
	pbonzini, BIRAN, isaku.yamahata

On 02/19/2014 10:53 AM, Li Guang wrote:
> Michael R. Hines wrote:
>> On 02/19/2014 09:07 AM, Li Guang wrote:
>>> Hi,
>>> mrhines@linux.vnet.ibm.com wrote:
>>>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>>>
>>>> This implements the core logic,
>>>> all described in the first patch (docs/mc.txt).
>>>>
>>>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>>>> ---
>>>>   migration-checkpoint.c | 1565 
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 1565 insertions(+)
>>>>   create mode 100644 migration-checkpoint.c
>>>>
>>>>
>>> [big snip] ...
>>>
>>>> +
>>>> +/*
>>>> + * Stop the VM, generate the micro checkpoint,
>>>> + * but save the dirty memory into staging memory until
>>>> + * we can re-activate the VM as soon as possible.
>>>> + */
>>>> +static int capture_checkpoint(MCParams *mc, MigrationState *s)
>>>> +{
>>>> +    MCCopyset *copyset;
>>>> +    int idx, ret = 0;
>>>> +    uint64_t start, stop, copies = 0;
>>>> +    int64_t start_time;
>>>> +
>>>> +    mc->total_copies = 0;
>>>> +    qemu_mutex_lock_iothread();
>>>> +    vm_stop_force_state(RUN_STATE_CHECKPOINT_VM);
>>>> +    start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>>> +
>>>> +    /*
>>>> +     * If buffering is enabled, insert a Qdisc plug here
>>>> +     * to hold packets for the *next* MC, (not this one,
>>>> +     * the packets for this one have already been plugged
>>>> +     * and will be released after the MC has been transmitted.
>>>> +     */
>>>> +    mc_start_buffer();
>>>
>>> actually, I have a special request,
>>> if QEMU started without netdev,
>>> then don't bother me by Qdisc for network buffering. :-)
>>>
>>> Thanks!
>>>
>>
>> That ability is already available in the patchset.
>> It is called "mc-net-disable" capability. (See the wiki or docs/mc.txt).
>>
>> Did you try it?
>>
>
> I don't mean disable it manually, I say even don't start buffering
> for network when no netdev.
>
> Thanks!
>
>

Oh, I see. Got it. I will update the patch =).

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-19  1:40     ` Michael R. Hines
@ 2014-02-19 11:27       ` Dr. David Alan Gilbert
  2014-02-20  1:17         ` Michael R. Hines
  0 siblings, 1 reply; 68+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-19 11:27 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, qemu-devel, EREZH,
	owasserm, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	junqing.wang, BIRAN, lig.fnst, Michael R. Hines

* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
> >>+The Micro-Checkpointing Process
> >>+Basic Algorithm
> >>+Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
> >>+
> >>+1. After N milliseconds, stop the VM.
> >>+3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
> >>+4. Resume the VM immediately so that it can make forward progress.
> >>+5. Transmit the checkpoint to the destination.
> >>+6. Repeat
> >>+Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
> >Later you talk about the memory allocation and how you grow the memory as needed
> >to fit the checkpoint, have you tried going the other way and triggering the
> >checkpoints sooner if they're taking too much memory?
> 
> There is a 'knob' in this patch called "mc-set-delay" which was designed
> to solve exactly that problem. It allows policy or management software
> to make an independent decision about what the frequency of the
> checkpoints should be.
> 
> I wasn't comfortable implementing policy directly inside the patch as
> that seemed less likely to get accepted by the community sooner.

I was just wondering if a separate 'max buffer size' knob would allow
you to more reasonably bound memory without setting policy; I don't think
people like having potentially x2 memory.

> >>+1. MC over TCP/IP: Once the socket connection breaks, we assume
> >>failure. This happens very early in the loss of the latest MC not only
> >>because a very large amount of bytes is typically being sequenced in a
> >>TCP stream but perhaps also because of the timeout in acknowledgement
> >>of the receipt of a commit message by the destination.
> >>+
> >>+2. MC over RDMA: Since Infiniband does not provide any underlying
> >>timeout mechanisms, this implementation enhances QEMU's RDMA migration
> >>protocol to include a simple keep-alive. Upon the loss of multiple
> >>keep-alive messages, the sender is deemed to have failed.
> >>+
> >>+In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
> >>+
> >>+If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
> >>+
> >>+If the destination is deemed to be lost, we perform the same action
> >>as a live migration: resume the sender normally and wait for management
> >>software to make a policy decision about whether or not to re-protect
> >>the VM, which may involve a third-party to identify a new destination
> >>host again to use as a backup for the VM.
> >In this world what is making the decision about whether the sender/destination
> >should win - how do you avoid a split brain situation where both
> >VMs are running but the only thing that failed is the comms between them?
> >Is there any guarantee that you'll have received knowledge of the comms
> >failure before you pull the plug out and enable the corked packets to be
> >sent on the sender side?
> 
> Good question in general - I'll add it to the FAQ. The patch implements
> a basic 'transaction' mechanism in coordination with an outbound I/O
> buffer (documented further down). With these two things in
> places, split-brain is not possible because the destination is not running.
> We don't allow the destination to resume execution until a committed
> transaction has been acknowledged by the destination and only until
> then do we allow any outbound network traffic to be release to the
> outside world.

Yeh I see the IO buffer, what I've not figured out is how:
  1) MC over TCP/IP gets an acknowledge on the source to know when
     it can unplug it's buffer.
  2) Lets say the MC connection fails, so that ack never arrives,
     the source must assume the destination has failed and release it's
     packets and carry on.
     The destination must assume the source has failed and take over.

     Now they're both running - and that's bad and it's standard
     split brain.
  3) If we're relying on TCP/IP timeout that's quite long.

> >>+RDMA is used for two different reasons:
> >>+
> >>+1. Checkpoint generation (RDMA-based memcpy):
> >>+2. Checkpoint transmission
> >>+Checkpoint generation must be done while the VM is paused. In the
> >>worst case, the size of the checkpoint can be equal in size to the amount
> >>of memory in total use by the VM. In order to resume VM execution as
> >>fast as possible, the checkpoint is copied consistently locally into
> >>a staging area before transmission. A standard memcpy() of potentially
> >>such a large amount of memory not only gets no use out of the CPU cache
> >>but also potentially clogs up the CPU pipeline which would otherwise
> >>be useful by other neighbor VMs on the same physical node that could be
> >>scheduled for execution. To minimize the effect on neighbor VMs, we use
> >>RDMA to perform a "local" memcpy(), bypassing the host processor. On
> >>more recent processors, a 'beefy' enough memory bus architecture can
> >>move memory just as fast (sometimes faster) as a pure-software CPU-only
> >>optimized memcpy() from libc. However, on older computers, this feature
> >>only gives you the benefit of lower CPU-utilization at the expense of
> >Isn't there a generic kernel DMA ABI for doing this type of thing (I
> >think there was at one point, people have suggested things like using
> >graphics cards to do it but I don't know if it ever happened).
> >The other question is, do you always need to copy - what about something
> >like COWing the pages?
> 
> Excellent question! Responding in two parts:
> 
> 1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
>      me if I'm wrong, but vmsplice was actually designed to avoid copies
>      entirely between two userspace programs to be able to move memory
>      more efficiently - whereas a fault tolerant system actually *needs*
>      copy to be made.

No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
of the use of Intel's I/OAT, graphics cards, etc for doing things like page
zeroing and DMAing data around; I can see there is a dmaengine API in the
kernel, I haven't found where if anywhere that is available to userspace.

> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>      around with my colleagues, but we simply didn't have the manpower
>      to implement it and benchmark it. There was also some concern about
>      performance: Would the writable working set of the guest be so
> active/busy
>      that COW would not get you much benefit? I think it's worth a try.
>      Patches welcome =)

It's possible that might be doable with some of the same tricks I'm
looking at for post-copy, I'll see what I can do.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-19 11:27       ` Dr. David Alan Gilbert
@ 2014-02-20  1:17         ` Michael R. Hines
  2014-02-20 10:09           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-02-20  1:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, qemu-devel, EREZH,
	owasserm, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	junqing.wang, BIRAN, lig.fnst, Michael R. Hines

On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>
> I was just wondering if a separate 'max buffer size' knob would allow
> you to more reasonably bound memory without setting policy; I don't think
> people like having potentially x2 memory.

Note: Checkpoint memory is not monotonic in this patchset (which
is unique to this implementation). Only if the guest actually dirties
100% of it's memory between one checkpoint to the next will
the host experience 2x memory usage for a short period of time.

The patch has a 'slab' mechanism built in to it which implements
a water-mark style policy that throws away unused portions of
the 2x checkpoint memory if later checkpoints are much smaller
(which is likely to be the case if the writable working set size changes).

However, to answer your question: Such a knob could be achieved, but
the same could be achieved simply by tuning the checkpoint frequency
itself. Memory usage would thus be a function of the checkpoint frequency.

If the guest application was maniacal, banging away at all the memory,
there's very little that can be done in the first place, but if the 
guest application
was mildly busy, you don't want to throw away your ability to be fault
tolerant - you would just need more frequent checkpoints to keep up with
the dirty rate.

Once the application died down - the water-mark policy would kick in
and start freeing checkpoint memory. (Note: this policy happens on
both sides in the patchset because the patch has to be fully compatible
with RDMA memory pinning).

What is *not* exposed, however, is the watermark knobs themselves,
I definitely think that needs to be exposed - that would also get you a 
similar
control to 'max buffer size' - you could place a time limit on the
slab list in the patch or something like that.......

>>
>> Good question in general - I'll add it to the FAQ. The patch implements
>> a basic 'transaction' mechanism in coordination with an outbound I/O
>> buffer (documented further down). With these two things in
>> places, split-brain is not possible because the destination is not running.
>> We don't allow the destination to resume execution until a committed
>> transaction has been acknowledged by the destination and only until
>> then do we allow any outbound network traffic to be release to the
>> outside world.
> Yeh I see the IO buffer, what I've not figured out is how:
>    1) MC over TCP/IP gets an acknowledge on the source to know when
>       it can unplug it's buffer.

Only partially correct (See the steps on the wiki). There are two I/O
buffers at any given time which protect against a split-brain scenario:
One buffer for the current checkpoint that is being generated (running VM)
and one buffer for the checkpoint that is being committed in a transaction.

>    2) Lets say the MC connection fails, so that ack never arrives,
>       the source must assume the destination has failed and release it's
>       packets and carry on.

Only the packets for Buffer A are released for the current committed
checkpoint after a completed transaction. The packets for Buffer B
(the current running VM) are still being held up until the next 
transaction starts.
Later once the transaction completes and A is released, B becomes the
new A and a new buffer is installed to become the new Buffer B for
the current running VM.

>       The destination must assume the source has failed and take over.

The destination must also receive an ACK. The ack goes both ways.

Once the source and destination both acknowledge a completed
transation does the source VM resume execution - and even then
it's packets are still being buffered until the next transaction starts.
(That's why it's important to checkpoint as frequently as possible).

>    3) If we're relying on TCP/IP timeout that's quite long.
>

Actually, my experience is been that TCP seems to have more than
one kind of timeout - if receiver is not responding *at all* - it seems that
TCP has a dedicated timer for that. The socket API immediately
sends back an error code and the patchset closes the conneciton
on the destination and recovers.

> No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
> of the use of Intel's I/OAT, graphics cards, etc for doing things like page
> zeroing and DMAing data around; I can see there is a dmaengine API in the
> kernel, I haven't found where if anywhere that is available to userspace.
>
>> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>>       around with my colleagues, but we simply didn't have the manpower
>>       to implement it and benchmark it. There was also some concern about
>>       performance: Would the writable working set of the guest be so
>> active/busy
>>       that COW would not get you much benefit? I think it's worth a try.
>>       Patches welcome =)
> It's possible that might be doable with some of the same tricks I'm
> looking at for post-copy, I'll see what I can do.

That's great news - I'm very interested to see how this applies
to post-copy and any kind patches.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-19  1:00   ` Li Guang
  2014-02-19  2:14     ` Michael R. Hines
@ 2014-02-20  5:03     ` Michael R. Hines
  2014-02-21  8:13     ` Michael R. Hines
  2 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-02-20  5:03 UTC (permalink / raw)
  To: Li Guang
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	owasserm, onom, junqing.wang, Michael R. Hines, gokul, dbulkow,
	pbonzini, abali, isaku.yamahata

On 02/19/2014 09:00 AM, Li Guang wrote:
> Hi,
>
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>
>> This patch sets up the initial changes to the migration state
>> machine and prototypes to be used by the checkpointing code
>> to interact with the state machine so that we can later handle
>> failure and recovery scenarios.
>>
>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>> ---
>>   arch_init.c                   | 29 ++++++++++++++++++++++++-----
>>   include/migration/migration.h |  2 ++
>>   migration.c                   | 37 
>> +++++++++++++++++++++----------------
>>   3 files changed, 47 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index db75120..e9d4d9e 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>>       migration_end();
>>   }
>>
>> -static void reset_ram_globals(void)
>> +static void reset_ram_globals(bool reset_bulk_stage)
>>   {
>>       last_seen_block = NULL;
>>       last_sent_block = NULL;
>>       last_offset = 0;
>>       last_version = ram_list.version;
>> -    ram_bulk_stage = true;
>> +    ram_bulk_stage = reset_bulk_stage;
>>   }
>>
>
> here is a chance that ram_save_block will never break while loop
> if loat_seen_block be reset for mc when there are no dirty pages
> to be migrated.
>
> Thanks! 

OK, now I have finally experienced this bug: When I deleted the netdev
network card, I get an infinite loop in ram_save_block.......

Need to figure out how to fix it =)

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-20  1:17         ` Michael R. Hines
@ 2014-02-20 10:09           ` Dr. David Alan Gilbert
  2014-02-20 11:14             ` Li Guang
  2014-02-20 14:57             ` Michael R. Hines
  0 siblings, 2 replies; 68+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-20 10:09 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: SADEKJ, pbonzini, quintela, abali, qemu-devel, EREZH, owasserm,
	onom, hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang,
	BIRAN, lig.fnst, Michael R. Hines

* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
> >
> >I was just wondering if a separate 'max buffer size' knob would allow
> >you to more reasonably bound memory without setting policy; I don't think
> >people like having potentially x2 memory.
> 
> Note: Checkpoint memory is not monotonic in this patchset (which
> is unique to this implementation). Only if the guest actually dirties
> 100% of it's memory between one checkpoint to the next will
> the host experience 2x memory usage for a short period of time.

Right, but that doesn't really help - if someone comes along and says
'How much memory do I need to be able to run an mc system?' the only
safe answer is 2x, otherwise we're adding a reason why the previously
stable guest might OOM.

> The patch has a 'slab' mechanism built in to it which implements
> a water-mark style policy that throws away unused portions of
> the 2x checkpoint memory if later checkpoints are much smaller
> (which is likely to be the case if the writable working set size changes).
> 
> However, to answer your question: Such a knob could be achieved, but
> the same could be achieved simply by tuning the checkpoint frequency
> itself. Memory usage would thus be a function of the checkpoint frequency.

> If the guest application was maniacal, banging away at all the memory,
> there's very little that can be done in the first place, but if the
> guest application
> was mildly busy, you don't want to throw away your ability to be fault
> tolerant - you would just need more frequent checkpoints to keep up with
> the dirty rate.

I'm not convinced; I can tune my checkpoint frequency until normal operation
makes a reasonable trade off between mc frequency and RAM usage,
but that doesn't prevent it running away when a garbage collect or some
other thing suddenly dirties a load of ram in one particular checkpoint.
Some management tool that watches ram usage etc can also help tune
it, but in the end it can't stop it taking loads of RAM.

> Once the application died down - the water-mark policy would kick in
> and start freeing checkpoint memory. (Note: this policy happens on
> both sides in the patchset because the patch has to be fully compatible
> with RDMA memory pinning).
> 
> What is *not* exposed, however, is the watermark knobs themselves,
> I definitely think that needs to be exposed - that would also get
> you a similar
> control to 'max buffer size' - you could place a time limit on the
> slab list in the patch or something like that.......
> 
> 
> >>
> >>Good question in general - I'll add it to the FAQ. The patch implements
> >>a basic 'transaction' mechanism in coordination with an outbound I/O
> >>buffer (documented further down). With these two things in
> >>places, split-brain is not possible because the destination is not running.
> >>We don't allow the destination to resume execution until a committed
> >>transaction has been acknowledged by the destination and only until
> >>then do we allow any outbound network traffic to be release to the
> >>outside world.
> >Yeh I see the IO buffer, what I've not figured out is how:
> >   1) MC over TCP/IP gets an acknowledge on the source to know when
> >      it can unplug it's buffer.
> 
> Only partially correct (See the steps on the wiki). There are two I/O
> buffers at any given time which protect against a split-brain scenario:
> One buffer for the current checkpoint that is being generated (running VM)
> and one buffer for the checkpoint that is being committed in a transaction.
> 
> >   2) Lets say the MC connection fails, so that ack never arrives,
> >      the source must assume the destination has failed and release it's
> >      packets and carry on.
> 
> Only the packets for Buffer A are released for the current committed
> checkpoint after a completed transaction. The packets for Buffer B
> (the current running VM) are still being held up until the next
> transaction starts.
> Later once the transaction completes and A is released, B becomes the
> new A and a new buffer is installed to become the new Buffer B for
> the current running VM.
> 
> 
> >      The destination must assume the source has failed and take over.
> 
> The destination must also receive an ACK. The ack goes both ways.
> 
> Once the source and destination both acknowledge a completed
> transation does the source VM resume execution - and even then
> it's packets are still being buffered until the next transaction starts.
> (That's why it's important to checkpoint as frequently as possible).

I think I understand normal operation - my question here is about failure;
what happens when neither side gets any ACKs.

> >   3) If we're relying on TCP/IP timeout that's quite long.
> >
> 
> Actually, my experience is been that TCP seems to have more than
> one kind of timeout - if receiver is not responding *at all* - it seems that
> TCP has a dedicated timer for that. The socket API immediately
> sends back an error code and the patchset closes the conneciton
> on the destination and recovers.

How did you test that?
My experience is that if a host knows that it has no route to the destination
(e.g. it has no route to try, that matches the destination, because someone
took the network interface away) you immediately get a 'no route to host',
however if an intermediate link disappears then it takes a while to time out.

> >No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
> >of the use of Intel's I/OAT, graphics cards, etc for doing things like page
> >zeroing and DMAing data around; I can see there is a dmaengine API in the
> >kernel, I haven't found where if anywhere that is available to userspace.
> >
> >>2) Using COW: Actually, I think that's an excellent idea. I've bounced that
> >>      around with my colleagues, but we simply didn't have the manpower
> >>      to implement it and benchmark it. There was also some concern about
> >>      performance: Would the writable working set of the guest be so
> >>active/busy
> >>      that COW would not get you much benefit? I think it's worth a try.
> >>      Patches welcome =)
> >It's possible that might be doable with some of the same tricks I'm
> >looking at for post-copy, I'll see what I can do.
> 
> That's great news - I'm very interested to see how this applies
> to post-copy and any kind patches.
> 
> - Michael
> 

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-20 10:09           ` Dr. David Alan Gilbert
@ 2014-02-20 11:14             ` Li Guang
  2014-02-20 14:58               ` Michael R. Hines
  2014-02-20 14:57             ` Michael R. Hines
  1 sibling, 1 reply; 68+ messages in thread
From: Li Guang @ 2014-02-20 11:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: SADEKJ, quintela, hinesmr, qemu-devel, Michael R. Hines,
	owasserm, junqing.wang, onom, abali, EREZH, gokul, dbulkow,
	pbonzini, BIRAN, isaku.yamahata, Michael R. Hines

Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>    
>> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>>      
>>> I was just wondering if a separate 'max buffer size' knob would allow
>>> you to more reasonably bound memory without setting policy; I don't think
>>> people like having potentially x2 memory.
>>>        
>> Note: Checkpoint memory is not monotonic in this patchset (which
>> is unique to this implementation). Only if the guest actually dirties
>> 100% of it's memory between one checkpoint to the next will
>> the host experience 2x memory usage for a short period of time.
>>      
> Right, but that doesn't really help - if someone comes along and says
> 'How much memory do I need to be able to run an mc system?' the only
> safe answer is 2x, otherwise we're adding a reason why the previously
> stable guest might OOM.
>
>    

so we may have to involve some disk operations
to handle memory exhaustion.

Thanks!

>> The patch has a 'slab' mechanism built in to it which implements
>> a water-mark style policy that throws away unused portions of
>> the 2x checkpoint memory if later checkpoints are much smaller
>> (which is likely to be the case if the writable working set size changes).
>>
>> However, to answer your question: Such a knob could be achieved, but
>> the same could be achieved simply by tuning the checkpoint frequency
>> itself. Memory usage would thus be a function of the checkpoint frequency.
>>      
>    
>> If the guest application was maniacal, banging away at all the memory,
>> there's very little that can be done in the first place, but if the
>> guest application
>> was mildly busy, you don't want to throw away your ability to be fault
>> tolerant - you would just need more frequent checkpoints to keep up with
>> the dirty rate.
>>      
> I'm not convinced; I can tune my checkpoint frequency until normal operation
> makes a reasonable trade off between mc frequency and RAM usage,
> but that doesn't prevent it running away when a garbage collect or some
> other thing suddenly dirties a load of ram in one particular checkpoint.
> Some management tool that watches ram usage etc can also help tune
> it, but in the end it can't stop it taking loads of RAM.
>
>    
>> Once the application died down - the water-mark policy would kick in
>> and start freeing checkpoint memory. (Note: this policy happens on
>> both sides in the patchset because the patch has to be fully compatible
>> with RDMA memory pinning).
>>
>> What is *not* exposed, however, is the watermark knobs themselves,
>> I definitely think that needs to be exposed - that would also get
>> you a similar
>> control to 'max buffer size' - you could place a time limit on the
>> slab list in the patch or something like that.......
>>
>>
>>      
>>>> Good question in general - I'll add it to the FAQ. The patch implements
>>>> a basic 'transaction' mechanism in coordination with an outbound I/O
>>>> buffer (documented further down). With these two things in
>>>> places, split-brain is not possible because the destination is not running.
>>>> We don't allow the destination to resume execution until a committed
>>>> transaction has been acknowledged by the destination and only until
>>>> then do we allow any outbound network traffic to be release to the
>>>> outside world.
>>>>          
>>> Yeh I see the IO buffer, what I've not figured out is how:
>>>    1) MC over TCP/IP gets an acknowledge on the source to know when
>>>       it can unplug it's buffer.
>>>        
>> Only partially correct (See the steps on the wiki). There are two I/O
>> buffers at any given time which protect against a split-brain scenario:
>> One buffer for the current checkpoint that is being generated (running VM)
>> and one buffer for the checkpoint that is being committed in a transaction.
>>
>>      
>>>    2) Lets say the MC connection fails, so that ack never arrives,
>>>       the source must assume the destination has failed and release it's
>>>       packets and carry on.
>>>        
>> Only the packets for Buffer A are released for the current committed
>> checkpoint after a completed transaction. The packets for Buffer B
>> (the current running VM) are still being held up until the next
>> transaction starts.
>> Later once the transaction completes and A is released, B becomes the
>> new A and a new buffer is installed to become the new Buffer B for
>> the current running VM.
>>
>>
>>      
>>>       The destination must assume the source has failed and take over.
>>>        
>> The destination must also receive an ACK. The ack goes both ways.
>>
>> Once the source and destination both acknowledge a completed
>> transation does the source VM resume execution - and even then
>> it's packets are still being buffered until the next transaction starts.
>> (That's why it's important to checkpoint as frequently as possible).
>>      
> I think I understand normal operation - my question here is about failure;
> what happens when neither side gets any ACKs.
>
>    
>>>    3) If we're relying on TCP/IP timeout that's quite long.
>>>
>>>        
>> Actually, my experience is been that TCP seems to have more than
>> one kind of timeout - if receiver is not responding *at all* - it seems that
>> TCP has a dedicated timer for that. The socket API immediately
>> sends back an error code and the patchset closes the conneciton
>> on the destination and recovers.
>>      
> How did you test that?
> My experience is that if a host knows that it has no route to the destination
> (e.g. it has no route to try, that matches the destination, because someone
> took the network interface away) you immediately get a 'no route to host',
> however if an intermediate link disappears then it takes a while to time out.
>
>    
>>> No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
>>> of the use of Intel's I/OAT, graphics cards, etc for doing things like page
>>> zeroing and DMAing data around; I can see there is a dmaengine API in the
>>> kernel, I haven't found where if anywhere that is available to userspace.
>>>
>>>        
>>>> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>>>>       around with my colleagues, but we simply didn't have the manpower
>>>>       to implement it and benchmark it. There was also some concern about
>>>>       performance: Would the writable working set of the guest be so
>>>> active/busy
>>>>       that COW would not get you much benefit? I think it's worth a try.
>>>>       Patches welcome =)
>>>>          
>>> It's possible that might be doable with some of the same tricks I'm
>>> looking at for post-copy, I'll see what I can do.
>>>        
>> That's great news - I'm very interested to see how this applies
>> to post-copy and any kind patches.
>>
>> - Michael
>>
>>      
> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
>    

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-20 10:09           ` Dr. David Alan Gilbert
  2014-02-20 11:14             ` Li Guang
@ 2014-02-20 14:57             ` Michael R. Hines
  2014-02-20 16:32               ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-02-20 14:57 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: SADEKJ, pbonzini, quintela, abali, qemu-devel, EREZH, owasserm,
	onom, hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang,
	BIRAN, lig.fnst, Michael R. Hines

On 02/20/2014 06:09 PM, Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>>> I was just wondering if a separate 'max buffer size' knob would allow
>>> you to more reasonably bound memory without setting policy; I don't think
>>> people like having potentially x2 memory.
>> Note: Checkpoint memory is not monotonic in this patchset (which
>> is unique to this implementation). Only if the guest actually dirties
>> 100% of it's memory between one checkpoint to the next will
>> the host experience 2x memory usage for a short period of time.
> Right, but that doesn't really help - if someone comes along and says
> 'How much memory do I need to be able to run an mc system?' the only
> safe answer is 2x, otherwise we're adding a reason why the previously
> stable guest might OOM.

Yes, exactly. Running MC is expensive and will probably always be
more or less to some degree. Saving memory and having 100%
fault tolerance are (at times) sometimes mutually exclusive.
Expectations have to be managed here.

The bottom line is: if you put a *hard* constraint on memory usage,
what will happen to the guest when that garbage collection you mentioned
shows up later and runs for several minutes? How about an hour?
Are we just going to block the guest from being allowed to start a
checkpoint until the memory usage goes down just for the sake of avoiding
the 2x memory usage? If you block the guest from being checkpointed,
then what happens if there is a failure during that extended period?
We will have saved memory at the expense of availability.

The customer that is expecting 100% fault tolerance and the provider
who is supporting it need to have an understanding that fault tolerance
is not free and that constraining memory usage will adversely affect
the VM's ability to be protected.

Do I understand your expectations correctly? Is fault tolerance
something you're willing to sacrifice?

>> The patch has a 'slab' mechanism built in to it which implements
>> a water-mark style policy that throws away unused portions of
>> the 2x checkpoint memory if later checkpoints are much smaller
>> (which is likely to be the case if the writable working set size changes).
>>
>> However, to answer your question: Such a knob could be achieved, but
>> the same could be achieved simply by tuning the checkpoint frequency
>> itself. Memory usage would thus be a function of the checkpoint frequency.
>> If the guest application was maniacal, banging away at all the memory,
>> there's very little that can be done in the first place, but if the
>> guest application
>> was mildly busy, you don't want to throw away your ability to be fault
>> tolerant - you would just need more frequent checkpoints to keep up with
>> the dirty rate.
> I'm not convinced; I can tune my checkpoint frequency until normal operation
> makes a reasonable trade off between mc frequency and RAM usage,
> but that doesn't prevent it running away when a garbage collect or some
> other thing suddenly dirties a load of ram in one particular checkpoint.
> Some management tool that watches ram usage etc can also help tune
> it, but in the end it can't stop it taking loads of RAM.

That's correct. See above comment....

>
>> Once the application died down - the water-mark policy would kick in
>> and start freeing checkpoint memory. (Note: this policy happens on
>> both sides in the patchset because the patch has to be fully compatible
>> with RDMA memory pinning).
>>
>> What is *not* exposed, however, is the watermark knobs themselves,
>> I definitely think that needs to be exposed - that would also get
>> you a similar
>> control to 'max buffer size' - you could place a time limit on the
>> slab list in the patch or something like that.......
>>
>>
>>>> Good question in general - I'll add it to the FAQ. The patch implements
>>>> a basic 'transaction' mechanism in coordination with an outbound I/O
>>>> buffer (documented further down). With these two things in
>>>> places, split-brain is not possible because the destination is not running.
>>>> We don't allow the destination to resume execution until a committed
>>>> transaction has been acknowledged by the destination and only until
>>>> then do we allow any outbound network traffic to be release to the
>>>> outside world.
>>> Yeh I see the IO buffer, what I've not figured out is how:
>>>    1) MC over TCP/IP gets an acknowledge on the source to know when
>>>       it can unplug it's buffer.
>> Only partially correct (See the steps on the wiki). There are two I/O
>> buffers at any given time which protect against a split-brain scenario:
>> One buffer for the current checkpoint that is being generated (running VM)
>> and one buffer for the checkpoint that is being committed in a transaction.
>>
>>>    2) Lets say the MC connection fails, so that ack never arrives,
>>>       the source must assume the destination has failed and release it's
>>>       packets and carry on.
>> Only the packets for Buffer A are released for the current committed
>> checkpoint after a completed transaction. The packets for Buffer B
>> (the current running VM) are still being held up until the next
>> transaction starts.
>> Later once the transaction completes and A is released, B becomes the
>> new A and a new buffer is installed to become the new Buffer B for
>> the current running VM.
>>
>>
>>>       The destination must assume the source has failed and take over.
>> The destination must also receive an ACK. The ack goes both ways.
>>
>> Once the source and destination both acknowledge a completed
>> transation does the source VM resume execution - and even then
>> it's packets are still being buffered until the next transaction starts.
>> (That's why it's important to checkpoint as frequently as possible).
> I think I understand normal operation - my question here is about failure;
> what happens when neither side gets any ACKs.

Well, that's simple: If there is a failure of the source, the destination
will simply revert to the previous checkpoint using the same mode
of operation. The lost ACKs that you're curious about only
apply to the checkpoint that is in progress. Just because a
checkpoint is in progress does not mean that the previous checkpoint
is thrown away - it is already loaded into the destination's memory
and ready to be activated.

>
>>>    3) If we're relying on TCP/IP timeout that's quite long.
>>>
>> Actually, my experience is been that TCP seems to have more than
>> one kind of timeout - if receiver is not responding *at all* - it seems that
>> TCP has a dedicated timer for that. The socket API immediately
>> sends back an error code and the patchset closes the conneciton
>> on the destination and recovers.
> How did you test that?
> My experience is that if a host knows that it has no route to the destination
> (e.g. it has no route to try, that matches the destination, because someone
> took the network interface away) you immediately get a 'no route to host',
> however if an intermediate link disappears then it takes a while to time out.

We have a script architecture (not on github) which runs MC in a tight
loop hundreds of times and kills the source QEMU and timestamps how 
quickly the
destination QEMU loses the TCP socket connection receives an error code
from the kernel - every single time, the destination resumes nearly 
instantaneously.
I've not empirically seen a case where the socket just hangs or doesn't
change state.

I'm not very familiar with the internal linux TCP/IP stack 
implementation itself,
but I have not had a problem with the dependability of the linux socket
not being able to shutdown the socket as soon as possible.

The RDMA implementation uses a manual keepalive mechanism that
I had to write from scratch - but I never ported this to the TCP 
implementation
simply because the failures always worked fine without it.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-20 11:14             ` Li Guang
@ 2014-02-20 14:58               ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-02-20 14:58 UTC (permalink / raw)
  To: Li Guang, Dr. David Alan Gilbert
  Cc: SADEKJ, quintela, hinesmr, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, Michael R. Hines, gokul, dbulkow,
	pbonzini, BIRAN, isaku.yamahata

On 02/20/2014 07:14 PM, Li Guang wrote:
> Dr. David Alan Gilbert wrote:
>> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>>> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>>>> I was just wondering if a separate 'max buffer size' knob would allow
>>>> you to more reasonably bound memory without setting policy; I don't 
>>>> think
>>>> people like having potentially x2 memory.
>>> Note: Checkpoint memory is not monotonic in this patchset (which
>>> is unique to this implementation). Only if the guest actually dirties
>>> 100% of it's memory between one checkpoint to the next will
>>> the host experience 2x memory usage for a short period of time.
>> Right, but that doesn't really help - if someone comes along and says
>> 'How much memory do I need to be able to run an mc system?' the only
>> safe answer is 2x, otherwise we're adding a reason why the previously
>> stable guest might OOM.
>>
>
> so we may have to involve some disk operations
> to handle memory exhaustion.
>
> Thanks!

Like a cgroups memory limit, for example?

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-20 14:57             ` Michael R. Hines
@ 2014-02-20 16:32               ` Dr. David Alan Gilbert
  2014-02-21  4:54                 ` Michael R. Hines
  0 siblings, 1 reply; 68+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-20 16:32 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: SADEKJ, quintela, hinesmr, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, lig.fnst, gokul, dbulkow, pbonzini,
	BIRAN, isaku.yamahata, Michael R. Hines

* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/20/2014 06:09 PM, Dr. David Alan Gilbert wrote:
> >* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> >>On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
> >>>I was just wondering if a separate 'max buffer size' knob would allow
> >>>you to more reasonably bound memory without setting policy; I don't think
> >>>people like having potentially x2 memory.
> >>Note: Checkpoint memory is not monotonic in this patchset (which
> >>is unique to this implementation). Only if the guest actually dirties
> >>100% of it's memory between one checkpoint to the next will
> >>the host experience 2x memory usage for a short period of time.
> >Right, but that doesn't really help - if someone comes along and says
> >'How much memory do I need to be able to run an mc system?' the only
> >safe answer is 2x, otherwise we're adding a reason why the previously
> >stable guest might OOM.
> 
> Yes, exactly. Running MC is expensive and will probably always be
> more or less to some degree. Saving memory and having 100%
> fault tolerance are (at times) sometimes mutually exclusive.
> Expectations have to be managed here.

I'm happy to use more memory to get FT, all I'm trying to do is see
if it's possible to put a lower bound than 2x on it while still maintaining
full FT, at the expense of performance in the case where it uses
a lot of memory.

> The bottom line is: if you put a *hard* constraint on memory usage,
> what will happen to the guest when that garbage collection you mentioned
> shows up later and runs for several minutes? How about an hour?
> Are we just going to block the guest from being allowed to start a
> checkpoint until the memory usage goes down just for the sake of avoiding
> the 2x memory usage?

Yes, or move to the next checkpoint sooner than the N milliseconds when
we see the buffer is getting full.

> If you block the guest from being checkpointed,
> then what happens if there is a failure during that extended period?
> We will have saved memory at the expense of availability.

If the active machine fails during this time then the secondary carries
on from it's last good snapshot in the knowledge that the active
never finished the new snapshot and so never uncorked it's previous packets.

If the secondary machine fails during this time then tha active drops
it's nascent snapshot and carries on.

However, what you have made me realise is that I don't have an answer
for the memory usage on the secondary; while the primary can pause
it's guest until the secondary ack's the checkpoint, the secondary has
to rely on the primary not to send it huge checkpoints.

> The customer that is expecting 100% fault tolerance and the provider
> who is supporting it need to have an understanding that fault tolerance
> is not free and that constraining memory usage will adversely affect
> the VM's ability to be protected.
> 
> Do I understand your expectations correctly? Is fault tolerance
> something you're willing to sacrifice?

As above, no I'm willing to sacrifice performance but not fault tolerance.
(It is entirely possible that others would want the other trade off, i.e.
some minimum performance is worse than useless, so if we can't maintain
that performance then dropping FT leaves us in a more-working position).

> >>The patch has a 'slab' mechanism built in to it which implements
> >>a water-mark style policy that throws away unused portions of
> >>the 2x checkpoint memory if later checkpoints are much smaller
> >>(which is likely to be the case if the writable working set size changes).
> >>
> >>However, to answer your question: Such a knob could be achieved, but
> >>the same could be achieved simply by tuning the checkpoint frequency
> >>itself. Memory usage would thus be a function of the checkpoint frequency.
> >>If the guest application was maniacal, banging away at all the memory,
> >>there's very little that can be done in the first place, but if the
> >>guest application
> >>was mildly busy, you don't want to throw away your ability to be fault
> >>tolerant - you would just need more frequent checkpoints to keep up with
> >>the dirty rate.
> >I'm not convinced; I can tune my checkpoint frequency until normal operation
> >makes a reasonable trade off between mc frequency and RAM usage,
> >but that doesn't prevent it running away when a garbage collect or some
> >other thing suddenly dirties a load of ram in one particular checkpoint.
> >Some management tool that watches ram usage etc can also help tune
> >it, but in the end it can't stop it taking loads of RAM.
> 
> That's correct. See above comment....
> 
> >
> >>Once the application died down - the water-mark policy would kick in
> >>and start freeing checkpoint memory. (Note: this policy happens on
> >>both sides in the patchset because the patch has to be fully compatible
> >>with RDMA memory pinning).
> >>
> >>What is *not* exposed, however, is the watermark knobs themselves,
> >>I definitely think that needs to be exposed - that would also get
> >>you a similar
> >>control to 'max buffer size' - you could place a time limit on the
> >>slab list in the patch or something like that.......
> >>
> >>
> >>>>Good question in general - I'll add it to the FAQ. The patch implements
> >>>>a basic 'transaction' mechanism in coordination with an outbound I/O
> >>>>buffer (documented further down). With these two things in
> >>>>places, split-brain is not possible because the destination is not running.
> >>>>We don't allow the destination to resume execution until a committed
> >>>>transaction has been acknowledged by the destination and only until
> >>>>then do we allow any outbound network traffic to be release to the
> >>>>outside world.
> >>>Yeh I see the IO buffer, what I've not figured out is how:
> >>>   1) MC over TCP/IP gets an acknowledge on the source to know when
> >>>      it can unplug it's buffer.
> >>Only partially correct (See the steps on the wiki). There are two I/O
> >>buffers at any given time which protect against a split-brain scenario:
> >>One buffer for the current checkpoint that is being generated (running VM)
> >>and one buffer for the checkpoint that is being committed in a transaction.
> >>
> >>>   2) Lets say the MC connection fails, so that ack never arrives,
> >>>      the source must assume the destination has failed and release it's
> >>>      packets and carry on.
> >>Only the packets for Buffer A are released for the current committed
> >>checkpoint after a completed transaction. The packets for Buffer B
> >>(the current running VM) are still being held up until the next
> >>transaction starts.
> >>Later once the transaction completes and A is released, B becomes the
> >>new A and a new buffer is installed to become the new Buffer B for
> >>the current running VM.
> >>
> >>
> >>>      The destination must assume the source has failed and take over.
> >>The destination must also receive an ACK. The ack goes both ways.
> >>
> >>Once the source and destination both acknowledge a completed
> >>transation does the source VM resume execution - and even then
> >>it's packets are still being buffered until the next transaction starts.
> >>(That's why it's important to checkpoint as frequently as possible).
> >I think I understand normal operation - my question here is about failure;
> >what happens when neither side gets any ACKs.
> 
> Well, that's simple: If there is a failure of the source, the destination
> will simply revert to the previous checkpoint using the same mode
> of operation. The lost ACKs that you're curious about only
> apply to the checkpoint that is in progress. Just because a
> checkpoint is in progress does not mean that the previous checkpoint
> is thrown away - it is already loaded into the destination's memory
> and ready to be activated.

I still don't see why, if the link between them fails, the destination
doesn't fall back it it's previous checkpoint, AND the source carries
on running - I don't see how they can differentiate which of them has failed.

> >>>   3) If we're relying on TCP/IP timeout that's quite long.
> >>>
> >>Actually, my experience is been that TCP seems to have more than
> >>one kind of timeout - if receiver is not responding *at all* - it seems that
> >>TCP has a dedicated timer for that. The socket API immediately
> >>sends back an error code and the patchset closes the conneciton
> >>on the destination and recovers.
> >How did you test that?
> >My experience is that if a host knows that it has no route to the destination
> >(e.g. it has no route to try, that matches the destination, because someone
> >took the network interface away) you immediately get a 'no route to host',
> >however if an intermediate link disappears then it takes a while to time out.
> 
> We have a script architecture (not on github) which runs MC in a tight
> loop hundreds of times and kills the source QEMU and timestamps how
> quickly the
> destination QEMU loses the TCP socket connection receives an error code
> from the kernel - every single time, the destination resumes nearly
> instantaneously.
> I've not empirically seen a case where the socket just hangs or doesn't
> change state.
> 
> I'm not very familiar with the internal linux TCP/IP stack
> implementation itself,
> but I have not had a problem with the dependability of the linux socket
> not being able to shutdown the socket as soon as possible.

OK, that only covers a very small range of normal failures.
When you kill the destination QEMU the host OS knows that QEMU is dead
and sends a packet back closing the socket, hence the source knows
the destination is dead very quickly.
If:
   a) The destination machine was to lose power or hang
   b) Or a network link fail  (other than the one attached to the source
      possibly)

the source would have to do a full TCP timeout.

To test a,b I'd use an iptables rule somewhere to cause the packets to
be dropped (not rejected).  Stopping the qemu in gdb might be good enough.

> The RDMA implementation uses a manual keepalive mechanism that
> I had to write from scratch - but I never ported this to the TCP
> implementation
> simply because the failures always worked fine without it.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-20 16:32               ` Dr. David Alan Gilbert
@ 2014-02-21  4:54                 ` Michael R. Hines
  2014-02-21  9:44                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-02-21  4:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: SADEKJ, quintela, hinesmr, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, lig.fnst, gokul, dbulkow, pbonzini,
	BIRAN, isaku.yamahata, Michael R. Hines

On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
>
> I'm happy to use more memory to get FT, all I'm trying to do is see
> if it's possible to put a lower bound than 2x on it while still maintaining
> full FT, at the expense of performance in the case where it uses
> a lot of memory.
>
>> The bottom line is: if you put a *hard* constraint on memory usage,
>> what will happen to the guest when that garbage collection you mentioned
>> shows up later and runs for several minutes? How about an hour?
>> Are we just going to block the guest from being allowed to start a
>> checkpoint until the memory usage goes down just for the sake of avoiding
>> the 2x memory usage?
> Yes, or move to the next checkpoint sooner than the N milliseconds when
> we see the buffer is getting full.

OK, I see there is definitely some common ground there: So to be
more specific, what we really need is two things: (I've learned that
the reviewers are very cautious about adding to much policy into
QEMU itself, but let's iron this out anyway:)

1. First, we need to throttle down the guest (QEMU can already do this
     using the recently introduced "auto-converge" feature). This means
     that the guest is still making forward progress, albeit slow progress.

2. Then we would need some kind of policy, or better yet, a trigger that
     does something to the effect of "we're about to use a whole lot of
     checkpoint memory soon - can we afford this much memory usage".
     Such a trigger would be conditional on the current policy of the
     administrator or management software: We would either have a QMP
     command that with a boolean flag that says "Yes" or "No", it's
     tolerable or not to use that much memory in the next checkpoint.

     If the answer is "Yes", then nothing changes.
     If the answer is "No", then we should either:
        a) throttle down the guest
        b) Adjust the checkpoint frequency
        c) Or pause it altogether while we migrate some other VMs off the
            host such that we can complete the next checkpoint in its 
entirety.

It's not clear to me how much of this (or any) of this control loop should
be in QEMU or in the management software, but I would definitely agree
that a minimum of at least the ability to detect the situation and remedy
the situation should be in QEMU. I'm not entirely convince that the
ability to *decide* to remedy the situation should be in QEMU, though.


>
>> If you block the guest from being checkpointed,
>> then what happens if there is a failure during that extended period?
>> We will have saved memory at the expense of availability.
> If the active machine fails during this time then the secondary carries
> on from it's last good snapshot in the knowledge that the active
> never finished the new snapshot and so never uncorked it's previous packets.
>
> If the secondary machine fails during this time then tha active drops
> it's nascent snapshot and carries on.

Yes, that makes sense. Where would that policy go, though,
continuing the above concern?

> However, what you have made me realise is that I don't have an answer
> for the memory usage on the secondary; while the primary can pause
> it's guest until the secondary ack's the checkpoint, the secondary has
> to rely on the primary not to send it huge checkpoints.

Good question: There's a lot of work ideas out there in the academic
community to compress the secondary, or push the secondary to
a flash-based device, or de-duplicate the secondary. I'm sure any
of them would put a dent in the problem, but I'm not seeing a smoking
gun solution that would absolutely save all that memory completely.

(Personally, I don't believe in swap. I wouldn't even consider swap
or any kind of traditional disk-based remedy to be a viable solution).

>> The customer that is expecting 100% fault tolerance and the provider
>> who is supporting it need to have an understanding that fault tolerance
>> is not free and that constraining memory usage will adversely affect
>> the VM's ability to be protected.
>>
>> Do I understand your expectations correctly? Is fault tolerance
>> something you're willing to sacrifice?
> As above, no I'm willing to sacrifice performance but not fault tolerance.
> (It is entirely possible that others would want the other trade off, i.e.
> some minimum performance is worse than useless, so if we can't maintain
> that performance then dropping FT leaves us in a more-working position).
>

Agreed - I think a "proactive" failover in this case would solve the 
problem.
If we observed that availability/fault tolerance was going to be at
risk soon (which is relatively easy to detect) - we could just *force*
a failover to the secondary host and restart the protection from
scratch.


>>
>> Well, that's simple: If there is a failure of the source, the destination
>> will simply revert to the previous checkpoint using the same mode
>> of operation. The lost ACKs that you're curious about only
>> apply to the checkpoint that is in progress. Just because a
>> checkpoint is in progress does not mean that the previous checkpoint
>> is thrown away - it is already loaded into the destination's memory
>> and ready to be activated.
> I still don't see why, if the link between them fails, the destination
> doesn't fall back it it's previous checkpoint, AND the source carries
> on running - I don't see how they can differentiate which of them has failed.

I think you're forgetting that the source I/O is buffered - it doesn't
matter that the source VM is still running. As long as it's output is
buffered - it cannot have any non-fault-tolerant affect on the outside
world.

In the future, if a technician access the machine or the network
is restored, the management software can terminate the stale
source virtual machine.

>> We have a script architecture (not on github) which runs MC in a tight
>> loop hundreds of times and kills the source QEMU and timestamps how
>> quickly the
>> destination QEMU loses the TCP socket connection receives an error code
>> from the kernel - every single time, the destination resumes nearly
>> instantaneously.
>> I've not empirically seen a case where the socket just hangs or doesn't
>> change state.
>>
>> I'm not very familiar with the internal linux TCP/IP stack
>> implementation itself,
>> but I have not had a problem with the dependability of the linux socket
>> not being able to shutdown the socket as soon as possible.
> OK, that only covers a very small range of normal failures.
> When you kill the destination QEMU the host OS knows that QEMU is dead
> and sends a packet back closing the socket, hence the source knows
> the destination is dead very quickly.
> If:
>     a) The destination machine was to lose power or hang
>     b) Or a network link fail  (other than the one attached to the source
>        possibly)
>
> the source would have to do a full TCP timeout.
>
> To test a,b I'd use an iptables rule somewhere to cause the packets to
> be dropped (not rejected).  Stopping the qemu in gdb might be good enough.

Very good idea - I'll add that to the "todo" list of things to do
in my test infrastructure. It may indeed turn out be necessary
to add a formal keepalive between the source and destination.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-19  1:00   ` Li Guang
  2014-02-19  2:14     ` Michael R. Hines
  2014-02-20  5:03     ` Michael R. Hines
@ 2014-02-21  8:13     ` Michael R. Hines
  2014-02-24  6:48       ` Li Guang
  2 siblings, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-02-21  8:13 UTC (permalink / raw)
  To: Li Guang
  Cc: GILR, SADEKJ, pbonzini, quintela, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, Michael R. Hines, gokul, dbulkow,
	hinesmr, BIRAN, isaku.yamahata

On 02/19/2014 09:00 AM, Li Guang wrote:
> Hi,
>
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>
>> This patch sets up the initial changes to the migration state
>> machine and prototypes to be used by the checkpointing code
>> to interact with the state machine so that we can later handle
>> failure and recovery scenarios.
>>
>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>> ---
>>   arch_init.c                   | 29 ++++++++++++++++++++++++-----
>>   include/migration/migration.h |  2 ++
>>   migration.c                   | 37 
>> +++++++++++++++++++++----------------
>>   3 files changed, 47 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index db75120..e9d4d9e 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>>       migration_end();
>>   }
>>
>> -static void reset_ram_globals(void)
>> +static void reset_ram_globals(bool reset_bulk_stage)
>>   {
>>       last_seen_block = NULL;
>>       last_sent_block = NULL;
>>       last_offset = 0;
>>       last_version = ram_list.version;
>> -    ram_bulk_stage = true;
>> +    ram_bulk_stage = reset_bulk_stage;
>>   }
>>
>
> here is a chance that ram_save_block will never break while loop
> if loat_seen_block be reset for mc when there are no dirty pages
> to be migrated.
>
> Thanks!
>

This bug is fixed now - you can re-pull from github.com.

     Believe it or not, when there is no network devices attached to the
     guest whatsoever, the initial bootup process can be extremely slow,
     where there are almost no processes dirtying memory at all or
     only occasionally except for maybe a DHCP client. This results in
     some 100ms periods of time where there are actually *no* dirty
     pages - hard to believe, but it does happen.

     ram_save_block() really doesn't understand this possibility,
     surprisingly. It results in an infinite loop because it was expecting
     last_seen_block to always be non-NULL, when in fact, we have reset
     the value to start from the beginning of the guest can scan the
     entire VM for dirty memory.


>>   #define MAX_WAIT 50 /* ms, half buffered_file limit */
>> @@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void 
>> *opaque)
>>       RAMBlock *block;
>>       int64_t ram_pages = last_ram_offset()>> TARGET_PAGE_BITS;
>>
>> +    /*
>> +     * RAM stays open during micro-checkpointing for the next 
>> transaction.
>> +     */
>> +    if (migration_is_mc(migrate_get_current())) {
>> +        qemu_mutex_lock_ramlist();
>> +        reset_ram_globals(false);
>> +        goto skip_setup;
>> +    }
>> +
>>       migration_bitmap = bitmap_new(ram_pages);
>>       bitmap_set(migration_bitmap, 0, ram_pages);
>>       migration_dirty_pages = ram_pages;
>> @@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void 
>> *opaque)
>>       qemu_mutex_lock_iothread();
>>       qemu_mutex_lock_ramlist();
>>       bytes_transferred = 0;
>> -    reset_ram_globals();
>> +    reset_ram_globals(true);
>>
>>       memory_global_dirty_log_start();
>>       migration_bitmap_sync();
>>       qemu_mutex_unlock_iothread();
>>
>> +skip_setup:
>> +
>>       qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
>>
>>       QTAILQ_FOREACH(block,&ram_list.blocks, next) {
>> @@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void 
>> *opaque)
>>       qemu_mutex_lock_ramlist();
>>
>>       if (ram_list.version != last_version) {
>> -        reset_ram_globals();
>> +        reset_ram_globals(true);
>>       }
>>
>>       ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>> @@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void 
>> *opaque)
>>       }
>>
>>       ram_control_after_iterate(f, RAM_CONTROL_FINISH);
>> -    migration_end();
>> +
>> +    /*
>> +     * Only cleanup at the end of normal migrations
>> +     * or if the MC destination failed and we got an error.
>> +     * Otherwise, we are (or will soon be) in MIG_STATE_CHECKPOINTING.
>> +     */
>> +    if(!migrate_use_mc() || 
>> migration_has_failed(migrate_get_current())) {
>> +        migration_end();
>> +    }
>>
>>       qemu_mutex_unlock_ramlist();
>>       qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>> diff --git a/include/migration/migration.h 
>> b/include/migration/migration.h
>> index a7c54fe..e876a2c 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -101,7 +101,9 @@ int migrate_fd_close(MigrationState *s);
>>
>>   void add_migration_state_change_notifier(Notifier *notify);
>>   void remove_migration_state_change_notifier(Notifier *notify);
>> +bool migration_is_active(MigrationState *);
>>   bool migration_in_setup(MigrationState *);
>> +bool migration_is_mc(MigrationState *s);
>>   bool migration_has_finished(MigrationState *);
>>   bool migration_has_failed(MigrationState *);
>>   MigrationState *migrate_get_current(void);
>> diff --git a/migration.c b/migration.c
>> index 25add6f..f42dae4 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -36,16 +36,6 @@
>>       do { } while (0)
>>   #endif
>>
>> -enum {
>> -    MIG_STATE_ERROR = -1,
>> -    MIG_STATE_NONE,
>> -    MIG_STATE_SETUP,
>> -    MIG_STATE_CANCELLING,
>> -    MIG_STATE_CANCELLED,
>> -    MIG_STATE_ACTIVE,
>> -    MIG_STATE_COMPLETED,
>> -};
>> -
>>   #define MAX_THROTTLE  (32<<  20)      /* Migration speed throttling */
>>
>>   /* Amount of time to allocate to each "chunk" of bandwidth-throttled
>> @@ -273,7 +263,7 @@ void 
>> qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>>       MigrationState *s = migrate_get_current();
>>       MigrationCapabilityStatusList *cap;
>>
>> -    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
>> +    if (migration_is_active(s)) {
>>           error_set(errp, QERR_MIGRATION_ACTIVE);
>>           return;
>>       }
>> @@ -285,7 +275,13 @@ void 
>> qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>>
>>   /* shared migration helpers */
>>
>> -static void migrate_set_state(MigrationState *s, int old_state, int 
>> new_state)
>> +bool migration_is_active(MigrationState *s)
>> +{
>> +    return (s->state == MIG_STATE_ACTIVE) || s->state == 
>> MIG_STATE_SETUP
>> +            || s->state == MIG_STATE_CHECKPOINTING;
>> +}
>> +
>> +void migrate_set_state(MigrationState *s, int old_state, int new_state)
>>   {
>>       if (atomic_cmpxchg(&s->state, old_state, new_state) == 
>> new_state) {
>>           trace_migrate_set_state(new_state);
>> @@ -309,7 +305,7 @@ static void migrate_fd_cleanup(void *opaque)
>>           s->file = NULL;
>>       }
>>
>> -    assert(s->state != MIG_STATE_ACTIVE);
>> +    assert(!migration_is_active(s));
>>
>>       if (s->state != MIG_STATE_COMPLETED) {
>>           qemu_savevm_state_cancel();
>> @@ -356,7 +352,12 @@ void 
>> remove_migration_state_change_notifier(Notifier *notify)
>>
>>   bool migration_in_setup(MigrationState *s)
>>   {
>> -    return s->state == MIG_STATE_SETUP;
>> +        return s->state == MIG_STATE_SETUP;
>> +}
>> +
>> +bool migration_is_mc(MigrationState *s)
>> +{
>> +        return s->state == MIG_STATE_CHECKPOINTING;
>>   }
>>
>>   bool migration_has_finished(MigrationState *s)
>> @@ -419,7 +420,8 @@ void qmp_migrate(const char *uri, bool has_blk, 
>> bool blk,
>>       params.shared = has_inc&&  inc;
>>
>>       if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP ||
>> -        s->state == MIG_STATE_CANCELLING) {
>> +        s->state == MIG_STATE_CANCELLING
>> +         || s->state == MIG_STATE_CHECKPOINTING) {
>>           error_set(errp, QERR_MIGRATION_ACTIVE);
>>           return;
>>       }
>> @@ -624,7 +626,10 @@ static void *migration_thread(void *opaque)
>>                   }
>>
>>                   if (!qemu_file_get_error(s->file)) {
>> -                    migrate_set_state(s, MIG_STATE_ACTIVE, 
>> MIG_STATE_COMPLETED);
>> +                    if (!migrate_use_mc()) {
>> +                        migrate_set_state(s,
>> +                            MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
>> +                    }
>>                       break;
>>                   }
>>               }
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-21  4:54                 ` Michael R. Hines
@ 2014-02-21  9:44                   ` Dr. David Alan Gilbert
  2014-03-03  6:08                     ` Michael R. Hines
  0 siblings, 1 reply; 68+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-21  9:44 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: SADEKJ, pbonzini, quintela, BIRAN, qemu-devel, EREZH, owasserm,
	onom, hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang,
	abali, lig.fnst, Michael R. Hines

* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
> >
> >I'm happy to use more memory to get FT, all I'm trying to do is see
> >if it's possible to put a lower bound than 2x on it while still maintaining
> >full FT, at the expense of performance in the case where it uses
> >a lot of memory.
> >
> >>The bottom line is: if you put a *hard* constraint on memory usage,
> >>what will happen to the guest when that garbage collection you mentioned
> >>shows up later and runs for several minutes? How about an hour?
> >>Are we just going to block the guest from being allowed to start a
> >>checkpoint until the memory usage goes down just for the sake of avoiding
> >>the 2x memory usage?
> >Yes, or move to the next checkpoint sooner than the N milliseconds when
> >we see the buffer is getting full.
> 
> OK, I see there is definitely some common ground there: So to be
> more specific, what we really need is two things: (I've learned that
> the reviewers are very cautious about adding to much policy into
> QEMU itself, but let's iron this out anyway:)
> 
> 1. First, we need to throttle down the guest (QEMU can already do this
>     using the recently introduced "auto-converge" feature). This means
>     that the guest is still making forward progress, albeit slow progress.
> 
> 2. Then we would need some kind of policy, or better yet, a trigger that
>     does something to the effect of "we're about to use a whole lot of
>     checkpoint memory soon - can we afford this much memory usage".
>     Such a trigger would be conditional on the current policy of the
>     administrator or management software: We would either have a QMP
>     command that with a boolean flag that says "Yes" or "No", it's
>     tolerable or not to use that much memory in the next checkpoint.
> 
>     If the answer is "Yes", then nothing changes.
>     If the answer is "No", then we should either:
>        a) throttle down the guest
>        b) Adjust the checkpoint frequency
>        c) Or pause it altogether while we migrate some other VMs off the
>            host such that we can complete the next checkpoint in its
> entirety.

Yes I think so, although what I was thinking was mainly (b) possibly
to the point of not starting the next checkpoint.

> It's not clear to me how much of this (or any) of this control loop should
> be in QEMU or in the management software, but I would definitely agree
> that a minimum of at least the ability to detect the situation and remedy
> the situation should be in QEMU. I'm not entirely convince that the
> ability to *decide* to remedy the situation should be in QEMU, though.

The management software access is low frequency, high latency; it should
be setting general parameters (max memory allowed, desired checkpoint
frequency etc) but I don't see that we can use it to do anything on
a sooner than a few second basis; so yes it can monitor things and
tweek the knobs if it sees the host as a whole is getting tight on RAM
etc - but we can't rely on it to throw in the breaks if this guest
suddenly decides to take bucket loads of RAM; something has to react
quickly in relation to previously set limits.

> >>If you block the guest from being checkpointed,
> >>then what happens if there is a failure during that extended period?
> >>We will have saved memory at the expense of availability.
> >If the active machine fails during this time then the secondary carries
> >on from it's last good snapshot in the knowledge that the active
> >never finished the new snapshot and so never uncorked it's previous packets.
> >
> >If the secondary machine fails during this time then tha active drops
> >it's nascent snapshot and carries on.
> 
> Yes, that makes sense. Where would that policy go, though,
> continuing the above concern?

I think there has to be some input from the management layer for failover,
because (as per my split-brain concerns) something has to make the decision
about which of the source/destination is to take over, and I don't
believe individual instances have that information.

> >However, what you have made me realise is that I don't have an answer
> >for the memory usage on the secondary; while the primary can pause
> >it's guest until the secondary ack's the checkpoint, the secondary has
> >to rely on the primary not to send it huge checkpoints.
> 
> Good question: There's a lot of work ideas out there in the academic
> community to compress the secondary, or push the secondary to
> a flash-based device, or de-duplicate the secondary. I'm sure any
> of them would put a dent in the problem, but I'm not seeing a smoking
> gun solution that would absolutely save all that memory completely.

Ah, I was thinking that flash would be a good solution for secondary;
it would be a nice demo.

> (Personally, I don't believe in swap. I wouldn't even consider swap
> or any kind of traditional disk-based remedy to be a viable solution).

Well it certainly exists - I've seen it!
Swap works well in limited circumstances; but as soon as you've got
multiple VMs fighting over something with 10s of ms latency you're doomed.

> >>The customer that is expecting 100% fault tolerance and the provider
> >>who is supporting it need to have an understanding that fault tolerance
> >>is not free and that constraining memory usage will adversely affect
> >>the VM's ability to be protected.
> >>
> >>Do I understand your expectations correctly? Is fault tolerance
> >>something you're willing to sacrifice?
> >As above, no I'm willing to sacrifice performance but not fault tolerance.
> >(It is entirely possible that others would want the other trade off, i.e.
> >some minimum performance is worse than useless, so if we can't maintain
> >that performance then dropping FT leaves us in a more-working position).
> >
> 
> Agreed - I think a "proactive" failover in this case would solve the
> problem.
> If we observed that availability/fault tolerance was going to be at
> risk soon (which is relatively easy to detect) - we could just *force*
> a failover to the secondary host and restart the protection from
> scratch.
> 
> 
> >>
> >>Well, that's simple: If there is a failure of the source, the destination
> >>will simply revert to the previous checkpoint using the same mode
> >>of operation. The lost ACKs that you're curious about only
> >>apply to the checkpoint that is in progress. Just because a
> >>checkpoint is in progress does not mean that the previous checkpoint
> >>is thrown away - it is already loaded into the destination's memory
> >>and ready to be activated.
> >I still don't see why, if the link between them fails, the destination
> >doesn't fall back it it's previous checkpoint, AND the source carries
> >on running - I don't see how they can differentiate which of them has failed.
> 
> I think you're forgetting that the source I/O is buffered - it doesn't
> matter that the source VM is still running. As long as it's output is
> buffered - it cannot have any non-fault-tolerant affect on the outside
> world.
> 
> In the future, if a technician access the machine or the network
> is restored, the management software can terminate the stale
> source virtual machine.

I think going with my comment above; I'm working on the basis it's just
as likely for the destination to fail as it is for the source to fail,
and a destination failure shouldn't kill the source; and in the case
of a destination failure the source is going to have to let it's buffered
I/Os start going again.

> >>We have a script architecture (not on github) which runs MC in a tight
> >>loop hundreds of times and kills the source QEMU and timestamps how
> >>quickly the
> >>destination QEMU loses the TCP socket connection receives an error code
> >>from the kernel - every single time, the destination resumes nearly
> >>instantaneously.
> >>I've not empirically seen a case where the socket just hangs or doesn't
> >>change state.
> >>
> >>I'm not very familiar with the internal linux TCP/IP stack
> >>implementation itself,
> >>but I have not had a problem with the dependability of the linux socket
> >>not being able to shutdown the socket as soon as possible.
> >OK, that only covers a very small range of normal failures.
> >When you kill the destination QEMU the host OS knows that QEMU is dead
> >and sends a packet back closing the socket, hence the source knows
> >the destination is dead very quickly.
> >If:
> >    a) The destination machine was to lose power or hang
> >    b) Or a network link fail  (other than the one attached to the source
> >       possibly)
> >
> >the source would have to do a full TCP timeout.
> >
> >To test a,b I'd use an iptables rule somewhere to cause the packets to
> >be dropped (not rejected).  Stopping the qemu in gdb might be good enough.
> 
> Very good idea - I'll add that to the "todo" list of things to do
> in my test infrastructure. It may indeed turn out be necessary
> to add a formal keepalive between the source and destination.
> 
> - Michael

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-21  8:13     ` Michael R. Hines
@ 2014-02-24  6:48       ` Li Guang
  2014-02-26  2:52         ` Li Guang
  0 siblings, 1 reply; 68+ messages in thread
From: Li Guang @ 2014-02-24  6:48 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: GILR, SADEKJ, pbonzini, quintela, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, Michael R. Hines, gokul, dbulkow,
	hinesmr, BIRAN, isaku.yamahata

Michael R. Hines wrote:
> On 02/19/2014 09:00 AM, Li Guang wrote:
>> Hi,
>>
>> mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>>
>>> This patch sets up the initial changes to the migration state
>>> machine and prototypes to be used by the checkpointing code
>>> to interact with the state machine so that we can later handle
>>> failure and recovery scenarios.
>>>
>>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>>> ---
>>>   arch_init.c                   | 29 ++++++++++++++++++++++++-----
>>>   include/migration/migration.h |  2 ++
>>>   migration.c                   | 37 
>>> +++++++++++++++++++++----------------
>>>   3 files changed, 47 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/arch_init.c b/arch_init.c
>>> index db75120..e9d4d9e 100644
>>> --- a/arch_init.c
>>> +++ b/arch_init.c
>>> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>>>       migration_end();
>>>   }
>>>
>>> -static void reset_ram_globals(void)
>>> +static void reset_ram_globals(bool reset_bulk_stage)
>>>   {
>>>       last_seen_block = NULL;
>>>       last_sent_block = NULL;
>>>       last_offset = 0;
>>>       last_version = ram_list.version;
>>> -    ram_bulk_stage = true;
>>> +    ram_bulk_stage = reset_bulk_stage;
>>>   }
>>>
>>
>> here is a chance that ram_save_block will never break while loop
>> if loat_seen_block be reset for mc when there are no dirty pages
>> to be migrated.
>>
>> Thanks!
>>
>
> This bug is fixed now - you can re-pull from github.com.
>
>     Believe it or not, when there is no network devices attached to the
>     guest whatsoever, the initial bootup process can be extremely slow,
>     where there are almost no processes dirtying memory at all or
>     only occasionally except for maybe a DHCP client. This results in
>     some 100ms periods of time where there are actually *no* dirty
>     pages - hard to believe, but it does happen.

sorry, seems all my pull requests for github was blocked,
let me check it later.

Thanks!

>
>     ram_save_block() really doesn't understand this possibility,
>     surprisingly. It results in an infinite loop because it was expecting
>     last_seen_block to always be non-NULL, when in fact, we have reset
>     the value to start from the beginning of the guest can scan the
>     entire VM for dirty memory.
>
>
>>>   #define MAX_WAIT 50 /* ms, half buffered_file limit */
>>> @@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void 
>>> *opaque)
>>>       RAMBlock *block;
>>>       int64_t ram_pages = last_ram_offset()>> TARGET_PAGE_BITS;
>>>
>>> +    /*
>>> +     * RAM stays open during micro-checkpointing for the next 
>>> transaction.
>>> +     */
>>> +    if (migration_is_mc(migrate_get_current())) {
>>> +        qemu_mutex_lock_ramlist();
>>> +        reset_ram_globals(false);
>>> +        goto skip_setup;
>>> +    }
>>> +
>>>       migration_bitmap = bitmap_new(ram_pages);
>>>       bitmap_set(migration_bitmap, 0, ram_pages);
>>>       migration_dirty_pages = ram_pages;
>>> @@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void 
>>> *opaque)
>>>       qemu_mutex_lock_iothread();
>>>       qemu_mutex_lock_ramlist();
>>>       bytes_transferred = 0;
>>> -    reset_ram_globals();
>>> +    reset_ram_globals(true);
>>>
>>>       memory_global_dirty_log_start();
>>>       migration_bitmap_sync();
>>>       qemu_mutex_unlock_iothread();
>>>
>>> +skip_setup:
>>> +
>>>       qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
>>>
>>>       QTAILQ_FOREACH(block,&ram_list.blocks, next) {
>>> @@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void 
>>> *opaque)
>>>       qemu_mutex_lock_ramlist();
>>>
>>>       if (ram_list.version != last_version) {
>>> -        reset_ram_globals();
>>> +        reset_ram_globals(true);
>>>       }
>>>
>>>       ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>>> @@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void 
>>> *opaque)
>>>       }
>>>
>>>       ram_control_after_iterate(f, RAM_CONTROL_FINISH);
>>> -    migration_end();
>>> +
>>> +    /*
>>> +     * Only cleanup at the end of normal migrations
>>> +     * or if the MC destination failed and we got an error.
>>> +     * Otherwise, we are (or will soon be) in MIG_STATE_CHECKPOINTING.
>>> +     */
>>> +    if(!migrate_use_mc() || 
>>> migration_has_failed(migrate_get_current())) {
>>> +        migration_end();
>>> +    }
>>>
>>>       qemu_mutex_unlock_ramlist();
>>>       qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>>> diff --git a/include/migration/migration.h 
>>> b/include/migration/migration.h
>>> index a7c54fe..e876a2c 100644
>>> --- a/include/migration/migration.h
>>> +++ b/include/migration/migration.h
>>> @@ -101,7 +101,9 @@ int migrate_fd_close(MigrationState *s);
>>>
>>>   void add_migration_state_change_notifier(Notifier *notify);
>>>   void remove_migration_state_change_notifier(Notifier *notify);
>>> +bool migration_is_active(MigrationState *);
>>>   bool migration_in_setup(MigrationState *);
>>> +bool migration_is_mc(MigrationState *s);
>>>   bool migration_has_finished(MigrationState *);
>>>   bool migration_has_failed(MigrationState *);
>>>   MigrationState *migrate_get_current(void);
>>> diff --git a/migration.c b/migration.c
>>> index 25add6f..f42dae4 100644
>>> --- a/migration.c
>>> +++ b/migration.c
>>> @@ -36,16 +36,6 @@
>>>       do { } while (0)
>>>   #endif
>>>
>>> -enum {
>>> -    MIG_STATE_ERROR = -1,
>>> -    MIG_STATE_NONE,
>>> -    MIG_STATE_SETUP,
>>> -    MIG_STATE_CANCELLING,
>>> -    MIG_STATE_CANCELLED,
>>> -    MIG_STATE_ACTIVE,
>>> -    MIG_STATE_COMPLETED,
>>> -};
>>> -
>>>   #define MAX_THROTTLE  (32<<  20)      /* Migration speed 
>>> throttling */
>>>
>>>   /* Amount of time to allocate to each "chunk" of bandwidth-throttled
>>> @@ -273,7 +263,7 @@ void 
>>> qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>>>       MigrationState *s = migrate_get_current();
>>>       MigrationCapabilityStatusList *cap;
>>>
>>> -    if (s->state == MIG_STATE_ACTIVE || s->state == MIG_STATE_SETUP) {
>>> +    if (migration_is_active(s)) {
>>>           error_set(errp, QERR_MIGRATION_ACTIVE);
>>>           return;
>>>       }
>>> @@ -285,7 +275,13 @@ void 
>>> qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>>>
>>>   /* shared migration helpers */
>>>
>>> -static void migrate_set_state(MigrationState *s, int old_state, int 
>>> new_state)
>>> +bool migration_is_active(MigrationState *s)
>>> +{
>>> +    return (s->state == MIG_STATE_ACTIVE) || s->state == 
>>> MIG_STATE_SETUP
>>> +            || s->state == MIG_STATE_CHECKPOINTING;
>>> +}
>>> +
>>> +void migrate_set_state(MigrationState *s, int old_state, int 
>>> new_state)
>>>   {
>>>       if (atomic_cmpxchg(&s->state, old_state, new_state) == 
>>> new_state) {
>>>           trace_migrate_set_state(new_state);
>>> @@ -309,7 +305,7 @@ static void migrate_fd_cleanup(void *opaque)
>>>           s->file = NULL;
>>>       }
>>>
>>> -    assert(s->state != MIG_STATE_ACTIVE);
>>> +    assert(!migration_is_active(s));
>>>
>>>       if (s->state != MIG_STATE_COMPLETED) {
>>>           qemu_savevm_state_cancel();
>>> @@ -356,7 +352,12 @@ void 
>>> remove_migration_state_change_notifier(Notifier *notify)
>>>
>>>   bool migration_in_setup(MigrationState *s)
>>>   {
>>> -    return s->state == MIG_STATE_SETUP;
>>> +        return s->state == MIG_STATE_SETUP;
>>> +}
>>> +
>>> +bool migration_is_mc(MigrationState *s)
>>> +{
>>> +        return s->state == MIG_STATE_CHECKPOINTING;
>>>   }
>>>
>>>   bool migration_has_finished(MigrationState *s)
>>> @@ -419,7 +420,8 @@ void qmp_migrate(const char *uri, bool has_blk, 
>>> bool blk,
>>>       params.shared = has_inc&&  inc;
>>>
>>>       if (s->state == MIG_STATE_ACTIVE || s->state == 
>>> MIG_STATE_SETUP ||
>>> -        s->state == MIG_STATE_CANCELLING) {
>>> +        s->state == MIG_STATE_CANCELLING
>>> +         || s->state == MIG_STATE_CHECKPOINTING) {
>>>           error_set(errp, QERR_MIGRATION_ACTIVE);
>>>           return;
>>>       }
>>> @@ -624,7 +626,10 @@ static void *migration_thread(void *opaque)
>>>                   }
>>>
>>>                   if (!qemu_file_get_error(s->file)) {
>>> -                    migrate_set_state(s, MIG_STATE_ACTIVE, 
>>> MIG_STATE_COMPLETED);
>>> +                    if (!migrate_use_mc()) {
>>> +                        migrate_set_state(s,
>>> +                            MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
>>> +                    }
>>>                       break;
>>>                   }
>>>               }
>>
>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-24  6:48       ` Li Guang
@ 2014-02-26  2:52         ` Li Guang
  0 siblings, 0 replies; 68+ messages in thread
From: Li Guang @ 2014-02-26  2:52 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: GILR, SADEKJ, pbonzini, quintela, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, abali, Michael R. Hines, gokul, dbulkow,
	hinesmr, BIRAN, isaku.yamahata

Li Guang wrote:
> Michael R. Hines wrote:
>> On 02/19/2014 09:00 AM, Li Guang wrote:
>>> Hi,
>>>
>>> mrhines@linux.vnet.ibm.com wrote:
>>>> From: "Michael R. Hines"<mrhines@us.ibm.com>
>>>>
>>>> This patch sets up the initial changes to the migration state
>>>> machine and prototypes to be used by the checkpointing code
>>>> to interact with the state machine so that we can later handle
>>>> failure and recovery scenarios.
>>>>
>>>> Signed-off-by: Michael R. Hines<mrhines@us.ibm.com>
>>>> ---
>>>>   arch_init.c                   | 29 ++++++++++++++++++++++++-----
>>>>   include/migration/migration.h |  2 ++
>>>>   migration.c                   | 37 
>>>> +++++++++++++++++++++----------------
>>>>   3 files changed, 47 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/arch_init.c b/arch_init.c
>>>> index db75120..e9d4d9e 100644
>>>> --- a/arch_init.c
>>>> +++ b/arch_init.c
>>>> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>>>>       migration_end();
>>>>   }
>>>>
>>>> -static void reset_ram_globals(void)
>>>> +static void reset_ram_globals(bool reset_bulk_stage)
>>>>   {
>>>>       last_seen_block = NULL;
>>>>       last_sent_block = NULL;
>>>>       last_offset = 0;
>>>>       last_version = ram_list.version;
>>>> -    ram_bulk_stage = true;
>>>> +    ram_bulk_stage = reset_bulk_stage;
>>>>   }
>>>>
>>>
>>> here is a chance that ram_save_block will never break while loop
>>> if loat_seen_block be reset for mc when there are no dirty pages
>>> to be migrated.
>>>
>>> Thanks!
>>>
>>
>> This bug is fixed now - you can re-pull from github.com.
>>
>>     Believe it or not, when there is no network devices attached to the
>>     guest whatsoever, the initial bootup process can be extremely slow,
>>     where there are almost no processes dirtying memory at all or
>>     only occasionally except for maybe a DHCP client. This results in
>>     some 100ms periods of time where there are actually *no* dirty
>>     pages - hard to believe, but it does happen.
>
> sorry, seems all my pull requests for github was blocked,
> let me check it later.
>
> Thanks!
>

tested, works well for me.

Thanks!

>>
>>     ram_save_block() really doesn't understand this possibility,
>>     surprisingly. It results in an infinite loop because it was 
>> expecting
>>     last_seen_block to always be non-NULL, when in fact, we have reset
>>     the value to start from the beginning of the guest can scan the
>>     entire VM for dirty memory.
>>
>>
>>>>   #define MAX_WAIT 50 /* ms, half buffered_file limit */
>>>> @@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void 
>>>> *opaque)
>>>>       RAMBlock *block;
>>>>       int64_t ram_pages = last_ram_offset()>> TARGET_PAGE_BITS;
>>>>
>>>> +    /*
>>>> +     * RAM stays open during micro-checkpointing for the next 
>>>> transaction.
>>>> +     */
>>>> +    if (migration_is_mc(migrate_get_current())) {
>>>> +        qemu_mutex_lock_ramlist();
>>>> +        reset_ram_globals(false);
>>>> +        goto skip_setup;
>>>> +    }
>>>> +
>>>>       migration_bitmap = bitmap_new(ram_pages);
>>>>       bitmap_set(migration_bitmap, 0, ram_pages);
>>>>       migration_dirty_pages = ram_pages;
>>>> @@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void 
>>>> *opaque)
>>>>       qemu_mutex_lock_iothread();
>>>>       qemu_mutex_lock_ramlist();
>>>>       bytes_transferred = 0;
>>>> -    reset_ram_globals();
>>>> +    reset_ram_globals(true);
>>>>
>>>>       memory_global_dirty_log_start();
>>>>       migration_bitmap_sync();
>>>>       qemu_mutex_unlock_iothread();
>>>>
>>>> +skip_setup:
>>>> +
>>>>       qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
>>>>
>>>>       QTAILQ_FOREACH(block,&ram_list.blocks, next) {
>>>> @@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void 
>>>> *opaque)
>>>>       qemu_mutex_lock_ramlist();
>>>>
>>>>       if (ram_list.version != last_version) {
>>>> -        reset_ram_globals();
>>>> +        reset_ram_globals(true);
>>>>       }
>>>>
>>>>       ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>>>> @@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void 
>>>> *opaque)
>>>>       }
>>>>
>>>>       ram_control_after_iterate(f, RAM_CONTROL_FINISH);
>>>> -    migration_end();
>>>> +
>>>> +    /*
>>>> +     * Only cleanup at the end of normal migrations
>>>> +     * or if the MC destination failed and we got an error.
>>>> +     * Otherwise, we are (or will soon be) in 
>>>> MIG_STATE_CHECKPOINTING.
>>>> +     */
>>>> +    if(!migrate_use_mc() || 
>>>> migration_has_failed(migrate_get_current())) {
>>>> +        migration_end();
>>>> +    }
>>>>
>>>>       qemu_mutex_unlock_ramlist();
>>>>       qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>>>> diff --git a/include/migration/migration.h 
>>>> b/include/migration/migration.h
>>>> index a7c54fe..e876a2c 100644
>>>> --- a/include/migration/migration.h
>>>> +++ b/include/migration/migration.h
>>>> @@ -101,7 +101,9 @@ int migrate_fd_close(MigrationState *s);
>>>>
>>>>   void add_migration_state_change_notifier(Notifier *notify);
>>>>   void remove_migration_state_change_notifier(Notifier *notify);
>>>> +bool migration_is_active(MigrationState *);
>>>>   bool migration_in_setup(MigrationState *);
>>>> +bool migration_is_mc(MigrationState *s);
>>>>   bool migration_has_finished(MigrationState *);
>>>>   bool migration_has_failed(MigrationState *);
>>>>   MigrationState *migrate_get_current(void);
>>>> diff --git a/migration.c b/migration.c
>>>> index 25add6f..f42dae4 100644
>>>> --- a/migration.c
>>>> +++ b/migration.c
>>>> @@ -36,16 +36,6 @@
>>>>       do { } while (0)
>>>>   #endif
>>>>
>>>> -enum {
>>>> -    MIG_STATE_ERROR = -1,
>>>> -    MIG_STATE_NONE,
>>>> -    MIG_STATE_SETUP,
>>>> -    MIG_STATE_CANCELLING,
>>>> -    MIG_STATE_CANCELLED,
>>>> -    MIG_STATE_ACTIVE,
>>>> -    MIG_STATE_COMPLETED,
>>>> -};
>>>> -
>>>>   #define MAX_THROTTLE  (32<<  20)      /* Migration speed 
>>>> throttling */
>>>>
>>>>   /* Amount of time to allocate to each "chunk" of bandwidth-throttled
>>>> @@ -273,7 +263,7 @@ void 
>>>> qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>>>>       MigrationState *s = migrate_get_current();
>>>>       MigrationCapabilityStatusList *cap;
>>>>
>>>> -    if (s->state == MIG_STATE_ACTIVE || s->state == 
>>>> MIG_STATE_SETUP) {
>>>> +    if (migration_is_active(s)) {
>>>>           error_set(errp, QERR_MIGRATION_ACTIVE);
>>>>           return;
>>>>       }
>>>> @@ -285,7 +275,13 @@ void 
>>>> qmp_migrate_set_capabilities(MigrationCapabilityStatusList *params,
>>>>
>>>>   /* shared migration helpers */
>>>>
>>>> -static void migrate_set_state(MigrationState *s, int old_state, 
>>>> int new_state)
>>>> +bool migration_is_active(MigrationState *s)
>>>> +{
>>>> +    return (s->state == MIG_STATE_ACTIVE) || s->state == 
>>>> MIG_STATE_SETUP
>>>> +            || s->state == MIG_STATE_CHECKPOINTING;
>>>> +}
>>>> +
>>>> +void migrate_set_state(MigrationState *s, int old_state, int 
>>>> new_state)
>>>>   {
>>>>       if (atomic_cmpxchg(&s->state, old_state, new_state) == 
>>>> new_state) {
>>>>           trace_migrate_set_state(new_state);
>>>> @@ -309,7 +305,7 @@ static void migrate_fd_cleanup(void *opaque)
>>>>           s->file = NULL;
>>>>       }
>>>>
>>>> -    assert(s->state != MIG_STATE_ACTIVE);
>>>> +    assert(!migration_is_active(s));
>>>>
>>>>       if (s->state != MIG_STATE_COMPLETED) {
>>>>           qemu_savevm_state_cancel();
>>>> @@ -356,7 +352,12 @@ void 
>>>> remove_migration_state_change_notifier(Notifier *notify)
>>>>
>>>>   bool migration_in_setup(MigrationState *s)
>>>>   {
>>>> -    return s->state == MIG_STATE_SETUP;
>>>> +        return s->state == MIG_STATE_SETUP;
>>>> +}
>>>> +
>>>> +bool migration_is_mc(MigrationState *s)
>>>> +{
>>>> +        return s->state == MIG_STATE_CHECKPOINTING;
>>>>   }
>>>>
>>>>   bool migration_has_finished(MigrationState *s)
>>>> @@ -419,7 +420,8 @@ void qmp_migrate(const char *uri, bool has_blk, 
>>>> bool blk,
>>>>       params.shared = has_inc&&  inc;
>>>>
>>>>       if (s->state == MIG_STATE_ACTIVE || s->state == 
>>>> MIG_STATE_SETUP ||
>>>> -        s->state == MIG_STATE_CANCELLING) {
>>>> +        s->state == MIG_STATE_CANCELLING
>>>> +         || s->state == MIG_STATE_CHECKPOINTING) {
>>>>           error_set(errp, QERR_MIGRATION_ACTIVE);
>>>>           return;
>>>>       }
>>>> @@ -624,7 +626,10 @@ static void *migration_thread(void *opaque)
>>>>                   }
>>>>
>>>>                   if (!qemu_file_get_error(s->file)) {
>>>> -                    migrate_set_state(s, MIG_STATE_ACTIVE, 
>>>> MIG_STATE_COMPLETED);
>>>> +                    if (!migrate_use_mc()) {
>>>> +                        migrate_set_state(s,
>>>> +                            MIG_STATE_ACTIVE, MIG_STATE_COMPLETED);
>>>> +                    }
>>>>                       break;
>>>>                   }
>>>>               }
>>>
>>
>>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
  2014-02-21  9:44                   ` Dr. David Alan Gilbert
@ 2014-03-03  6:08                     ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-03-03  6:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: SADEKJ, pbonzini, quintela, BIRAN, qemu-devel, EREZH, owasserm,
	onom, hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang,
	abali, lig.fnst, Michael R. Hines

On 02/21/2014 05:44 PM, Dr. David Alan Gilbert wrote:
>> It's not clear to me how much of this (or any) of this control loop should
>> be in QEMU or in the management software, but I would definitely agree
>> that a minimum of at least the ability to detect the situation and remedy
>> the situation should be in QEMU. I'm not entirely convince that the
>> ability to *decide* to remedy the situation should be in QEMU, though.
> The management software access is low frequency, high latency; it should
> be setting general parameters (max memory allowed, desired checkpoint
> frequency etc) but I don't see that we can use it to do anything on
> a sooner than a few second basis; so yes it can monitor things and
> tweek the knobs if it sees the host as a whole is getting tight on RAM
> etc - but we can't rely on it to throw in the breaks if this guest
> suddenly decides to take bucket loads of RAM; something has to react
> quickly in relation to previously set limits.

I agree - the boolean flag I mentioned previously would do just
that: setting the flag (or state, perhaps instead of boolean),
would indicate to QEMU to make a particular type of sacrifice:

A flag of "0" might mean "Throttle the guest in an emergency"
A flag of "1" might mean "Throttling is not acceptable, just let the 
guest use the extra memory"
A flag of "2" might mean "Neither one is acceptable, fail now and inform 
the management software to restart somewhere else".

Or something to that effect........

>>>> If you block the guest from being checkpointed,
>>>> then what happens if there is a failure during that extended period?
>>>> We will have saved memory at the expense of availability.
>>> If the active machine fails during this time then the secondary carries
>>> on from it's last good snapshot in the knowledge that the active
>>> never finished the new snapshot and so never uncorked it's previous packets.
>>>
>>> If the secondary machine fails during this time then tha active drops
>>> it's nascent snapshot and carries on.
>> Yes, that makes sense. Where would that policy go, though,
>> continuing the above concern?
> I think there has to be some input from the management layer for failover,
> because (as per my split-brain concerns) something has to make the decision
> about which of the source/destination is to take over, and I don't
> believe individual instances have that information.

Agreed - so the "ability" (as hinted on above) should be in QEMU,
but the decision to recover from the situation probably should not
be, where "recover" is defined as the VM is back in a fully running,
fully fault-tolerant protected state (potentially where the source VM
is on a different machine than it was before).

>
>>>> Well, that's simple: If there is a failure of the source, the destination
>>>> will simply revert to the previous checkpoint using the same mode
>>>> of operation. The lost ACKs that you're curious about only
>>>> apply to the checkpoint that is in progress. Just because a
>>>> checkpoint is in progress does not mean that the previous checkpoint
>>>> is thrown away - it is already loaded into the destination's memory
>>>> and ready to be activated.
>>> I still don't see why, if the link between them fails, the destination
>>> doesn't fall back it it's previous checkpoint, AND the source carries
>>> on running - I don't see how they can differentiate which of them has failed.
>> I think you're forgetting that the source I/O is buffered - it doesn't
>> matter that the source VM is still running. As long as it's output is
>> buffered - it cannot have any non-fault-tolerant affect on the outside
>> world.
>>
>> In the future, if a technician access the machine or the network
>> is restored, the management software can terminate the stale
>> source virtual machine.
> I think going with my comment above; I'm working on the basis it's just
> as likely for the destination to fail as it is for the source to fail,
> and a destination failure shouldn't kill the source; and in the case
> of a destination failure the source is going to have to let it's buffered
> I/Os start going again.

Yes, that's correct, but only after management software knows about
the failure. If we're on a tightly-coupled fast lan, there's no reason
to believe that libvirt, for example, would be so slow that we cannot
wait a few extra (10s of?) milliseconds after destination failure to
choose a new destination and restart the previous checkpoint.

But if management *is* too slow, which is not unlikely, then I think
we should just tell the source to Migrate entirely and get out of that
environment.

Either way - this isn't something QEMU itself necessarily needs to
worry about - it just needs to know not to explode if the destination
fails and wait for instructions on what to do next.......

Alternatively, if the administrator "prefers" restarting the fault-tolerance
instead of Migration, we could have a QMP command that specifies
a "backup" destination (or even a "duplicate" destination) that QEMU
would automatically know about in the case of destination failure.

But, I wouldn't implement something like that until at least a first version
was accepted by the community.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
  2014-02-18 10:32   ` Dr. David Alan Gilbert
@ 2014-03-11 21:31   ` Juan Quintela
  2014-04-04  3:08     ` Michael R. Hines
  1 sibling, 1 reply; 68+ messages in thread
From: Juan Quintela @ 2014-03-11 21:31 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, pbonzini, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	abali, BIRAN, lig.fnst, Michael R. Hines

mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> We also later export these statistics over QMP for better
> monitoring of micro-checkpointing as the workload changes.
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  arch_init.c | 34 ++++++++++++++++++++++++++++------
>  1 file changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index 80574a0..b8364b0 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -193,6 +193,8 @@ typedef struct AccountingInfo {
>      uint64_t skipped_pages;
>      uint64_t norm_pages;
>      uint64_t iterations;
> +    uint64_t log_dirty_time;
> +    uint64_t migration_bitmap_time;
>      uint64_t xbzrle_bytes;
>      uint64_t xbzrle_pages;
>      uint64_t xbzrle_cache_miss;
> @@ -201,7 +203,7 @@ typedef struct AccountingInfo {
>  
>  static AccountingInfo acct_info;
>  
> -static void acct_clear(void)
> +void acct_clear(void)
>  {
>      memset(&acct_info, 0, sizeof(acct_info));
>  }
> @@ -236,6 +238,16 @@ uint64_t norm_mig_pages_transferred(void)
>      return acct_info.norm_pages;
>  }
>  
> +uint64_t norm_mig_log_dirty_time(void)
> +{
> +    return acct_info.log_dirty_time;
> +}
> +
> +uint64_t norm_mig_bitmap_time(void)
> +{
> +    return acct_info.migration_bitmap_time;
> +}
> +
>  uint64_t xbzrle_mig_bytes_transferred(void)
>  {
>      return acct_info.xbzrle_bytes;
> @@ -426,27 +438,35 @@ static void migration_bitmap_sync(void)
>      static int64_t num_dirty_pages_period;
>      int64_t end_time;
>      int64_t bytes_xfer_now;
> +    int64_t begin_time;
> +    int64_t dirty_time;
>  
>      if (!bytes_xfer_prev) {
>          bytes_xfer_prev = ram_bytes_transferred();
>      }
>  
> +    begin_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      if (!start_time) {
>          start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>      }

     if (!start_time) {
         start_time = begin_time;
     }

Althought I think we need to search for better names?

start_time --> migration_start_time
begin_time --> iteration_start_time
?

I am open to better names.

> -
>      trace_migration_bitmap_sync_start();
>      address_space_sync_dirty_bitmap(&address_space_memory);
>  
> +    dirty_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> +
>      QTAILQ_FOREACH(block, &ram_list.blocks, next) {
>          migration_bitmap_sync_range(block->mr->ram_addr, block->length);
>      }
> +
>      trace_migration_bitmap_sync_end(migration_dirty_pages
>                                      - num_dirty_pages_init);
>      num_dirty_pages_period += migration_dirty_pages - num_dirty_pages_init;
>      end_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>  
> -    /* more than 1 second = 1000 millisecons */
> +    acct_info.log_dirty_time += dirty_time - begin_time;
> +    acct_info.migration_bitmap_time += end_time - dirty_time;
> +
> +    /* more than 1 second = 1000 milliseconds */
>      if (end_time > start_time + 1000) {
>          if (migrate_auto_converge()) {
>              /* The following detection logic can be refined later. For now:
> @@ -548,9 +568,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>              /* XBZRLE overflow or normal page */
>              if (bytes_sent == -1) {
>                  bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
> -                qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
> -                bytes_sent += TARGET_PAGE_SIZE;
> -                acct_info.norm_pages++;
> +                if (ret != RAM_SAVE_CONTROL_DELAYED) {
> +                    qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
> +                    bytes_sent += TARGET_PAGE_SIZE;
> +                    acct_info.norm_pages++;
> +                }
>              }
>  
>              /* if page is unmodified, continue to the next */

Except for this bit, rest of the patch ok.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
@ 2014-03-11 21:36   ` Juan Quintela
  2014-04-04  3:11     ` Michael R. Hines
  2014-03-11 21:40   ` Eric Blake
  1 sibling, 1 reply; 68+ messages in thread
From: Juan Quintela @ 2014-03-11 21:36 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, pbonzini, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	abali, BIRAN, lig.fnst, Michael R. Hines

mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> During micro-checkpointing, the VCPUs get repeatedly paused and
> resumed. We need to not freak out when the VM begins micro-checkpointing.
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index 3e1e6c7..9c62e2f 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -121,10 +121,31 @@ uint64_t skipped_mig_bytes_transferred(void);
>  uint64_t skipped_mig_pages_transferred(void);
>  uint64_t norm_mig_bytes_transferred(void);
>  uint64_t norm_mig_pages_transferred(void);
> +uint64_t norm_mig_log_dirty_time(void);
> +uint64_t norm_mig_bitmap_time(void);
>  uint64_t xbzrle_mig_bytes_transferred(void);
>  uint64_t xbzrle_mig_pages_transferred(void);
>  uint64_t xbzrle_mig_pages_overflow(void);
>  uint64_t xbzrle_mig_pages_cache_miss(void);
> +void acct_clear(void);
> +
> +void migrate_set_state(MigrationState *s, int old_state, int new_state);
> +
> +enum {
> +    MIG_STATE_ERROR = -1,
> +    MIG_STATE_NONE,
> +    MIG_STATE_SETUP,
> +    MIG_STATE_CANCELLED,
> +    MIG_STATE_CANCELLING,
> +    MIG_STATE_ACTIVE,
> +    MIG_STATE_CHECKPOINTING,
> +    MIG_STATE_COMPLETED,
> +};
> +
> +int mc_enable_buffering(void);
> +int mc_start_buffer(void);
> +void mc_init_checkpointer(MigrationState *s);
> +void mc_process_incoming_checkpoints_if_requested(QEMUFile *f);
>  
>  void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
>  

This clearly don't work on this patch.

Rest of it is ok.

Later, Juan.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
  2014-03-11 21:36   ` Juan Quintela
@ 2014-03-11 21:40   ` Eric Blake
  2014-04-04  3:12     ` Michael R. Hines
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Blake @ 2014-03-11 21:40 UTC (permalink / raw)
  To: mrhines, qemu-devel
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 1663 bytes --]

On 02/18/2014 01:50 AM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> During micro-checkpointing, the VCPUs get repeatedly paused and
> resumed. We need to not freak out when the VM begins micro-checkpointing.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  cpus.c                        |  9 ++++++++-
>  include/migration/migration.h | 21 +++++++++++++++++++++
>  qapi-schema.json              |  4 +++-
>  vl.c                          |  7 +++++++
>  4 files changed, 39 insertions(+), 2 deletions(-)
> 

> +++ b/qapi-schema.json
> @@ -169,6 +169,8 @@
>  #
>  # @save-vm: guest is paused to save the VM state
>  #
> +# @checkpoint-vm: guest is paused to checkpoint the VM state
> +#

It would be nice to mention '(since 2.1)'.

>  # @shutdown: guest is shut down (and -no-shutdown is in use)
>  #
>  # @suspended: guest is suspended (ACPI S3)
> @@ -181,7 +183,7 @@
>    'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused',
>              'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm',
>              'running', 'save-vm', 'shutdown', 'suspended', 'watchdog',
> -            'guest-panicked' ] }
> +            'guest-panicked', 'checkpoint-vm' ] }

It would also be nice to document the enum variables in the same order
you declare them.  The declaration of 'checkpoint-vm' does not have to
be at the end; we use named enums precisely so that the QMP interface is
not tied to the C integer value underlying the enum.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
@ 2014-03-11 21:45   ` Eric Blake
  2014-04-04  3:15     ` Michael R. Hines
  2014-03-11 21:59   ` Juan Quintela
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Blake @ 2014-03-11 21:45 UTC (permalink / raw)
  To: mrhines, qemu-devel
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 2547 bytes --]

On 02/18/2014 01:50 AM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> MC provides a lot of new information, including the same RAM statistics
> that ordinary migration does, so we centralize a lot of that printing
> code into a common function so that the QMP printing statements don't
> get duplicated too much.
> 
> We also introduce a new MCStats structure (like MigrationStats) due
> to the large number of non-migration related statistics - don't want
> to confuse migration and MC too much, so let's keep them separate for now.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---

> +++ b/qapi-schema.json
> @@ -603,6 +603,36 @@
>             'cache-miss': 'int', 'overflow': 'int' } }
>  
>  ##
> +# @MCStats
> +#
> +# Detailed Micro Checkpointing (MC) statistics
> +#
> +# @mbps: throughput of transmitting last MC 
> +#
> +# @xmit-time: milliseconds to transmit last MC 

Trailing whitespace.

Rather than abbreviate, how about naming this 'transmit-time'.

> +#
> +# @checkpoints: cummulative total number of MCs generated 

More trailing whitespace.  Please run your series through
scripts/checkpatch.pl.

s/cummulative total/cumulative/

> +#
> +# Since: 2.x
> +##
> +{ 'type': 'MCStats',
> +  'data': {'mbps': 'number',
> +           'xmit-time': 'uint64',
> +           'log-dirty-time': 'uint64',
> +           'migration-bitmap-time': 'uint64', 
> +           'ram-copy-time': 'uint64',
> +           'checkpoints' : 'uint64',
> +           'copy-mbps': 'number' }}

Again, it helps to document the fields in the same order as they are
declared (no, it's not a hard requirement, but being nice to readers is
always worth the effort).

> +
> +##
>  # @MigrationInfo
>  #
>  # Information about current migration process.
> @@ -624,6 +654,8 @@
>  #                migration statistics, only returned if XBZRLE feature is on and
>  #                status is 'active' or 'completed' (since 1.2)
>  #
> +# @mc: #options @MCStats containing details Micro-Checkpointing statistics

s/options/optional/ - I'm assuming it is optional because it only
appears when MC is in use.

'mc' is a rather short name, maybe 'micro-checkpoint' is better?

Missing a '(since 2.1)' designation (or 2.x, as you used above as a
placeholder, although obviously we'd fix the .x before actually bringing
into mainline)

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
@ 2014-03-11 21:49   ` Eric Blake
  2014-03-11 22:15     ` Juan Quintela
  2014-04-04  3:29     ` Michael R. Hines
  0 siblings, 2 replies; 68+ messages in thread
From: Eric Blake @ 2014-03-11 21:49 UTC (permalink / raw)
  To: mrhines, qemu-devel
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 2814 bytes --]

On 02/18/2014 01:50 AM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This exposes a QMP command that allows the management software
> or policy to control the frequency of micro-checkpointing.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  hmp-commands.hx  | 16 +++++++++++++++-
>  hmp.c            |  6 ++++++
>  hmp.h            |  1 +
>  qapi-schema.json | 13 +++++++++++++
>  qmp-commands.hx  | 23 +++++++++++++++++++++++
>  5 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index f3fc514..2066c76 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -888,7 +888,7 @@ ETEXI
>  		      "\n\t\t\t -b for migration without shared storage with"
>  		      " full copy of disk\n\t\t\t -i for migration without "
>  		      "shared storage with incremental copy of disk "
> -		      "(base image shared between src and destination)",
> + 		      "(base image shared between src and destination)",

Spurious hunk.  Oh, I see - you managed to take TAB damage and make it
worse with a space-TAB (I guess this file isn't tab-clean, like the
.json file is).  Eww.

>          .mhandler.cmd = hmp_migrate,
>      },
>  
> @@ -965,6 +965,20 @@ Set maximum tolerated downtime (in seconds) for migration.
>  ETEXI
>  
>      {
> +        .name       = "migrate-set-mc-delay",

We're building up a LOT of migrate- tunable commands.  Maybe it's time
to think about building a more generic migrate-set-parameter, which
takes both the name of the parameter to set and its value, so that a
single command serves all parameters, instead of needing a proliferation
of commands.  Of course, for that to be useful, we also need a way to
introspect which parameters can be tuned; whereas with the current
approach of one command per parameter (well, 2 for set vs. get) the
introspection is based on whether the command exists.


> +++ b/qapi-schema.json
> @@ -2160,6 +2160,19 @@
>  { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>  
>  ##
> +# @migrate-set-mc-delay
> +#
> +# Set delay (in milliseconds) between micro checkpoints.
> +#
> +# @value: maximum delay in milliseconds 
> +#
> +# Returns: nothing on success
> +#
> +# Since: 2.x
> +##
> +{ 'command': 'migrate-set-mc-delay', 'data': {'value': 'int'} }
> +
> +##

I hate write-only interfaces.  If I can set the parameter, I _also_ need
a way to query the current value of the parameter.  Either an existing
migration statistics output should be modified to include this new
information, or you need to add a counterpart migrate-get-mc-delay command.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
@ 2014-03-11 21:57   ` Eric Blake
  2014-04-04  3:38     ` Michael R. Hines
  2014-03-11 22:02   ` Juan Quintela
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Blake @ 2014-03-11 21:57 UTC (permalink / raw)
  To: mrhines, qemu-devel
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 2859 bytes --]

On 02/18/2014 01:50 AM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> New capabilities include the use of RDMA acceleration,
> use of network buffering, and keepalive support, as documented
> in patch #1.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  qapi-schema.json | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 

> +#          Only for performance testing. (Since 2.x)
> +#
> +# @mc-rdma-copy: MC requires creating a local-memory checkpoint before
> +#          transmission to the destination. This requires heavy use of 
> +#          memcpy() which dominates the processor pipeline. This option 
> +#          makes use of *local* RDMA to perform the copy instead of the CPU.
> +#          Enabled by default only if the migration transport is RDMA.
> +#          Disabled by default otherwise. (Since 2.x)

How does that work?  If I query migration capabilities before requesting
a migration, what state am I going to read?  Is there coupling where I
would observe the state of this flag change merely because I did some
other action?  And if so, then how do I know that explicitly setting
this flag won't be undone by similar coupling?

It sounds like you are describing a tri-state option (unspecified so
default to migration transport, explicitly disabled, explicitly
enabled); but that doesn't work for something that only lists boolean
capabilities.  The only way around that is to have 2 separate
capabilities (one on whether to base decision on transport or to honor
override, and the other to provide the override value which is ignored
when defaulting by transport).

> +#
> +# @rdma-keepalive: RDMA connections do not timeout by themselves if a peer
> +#         has disconnected prematurely or failed. User-level keepalives
> +#         allow the migration to abort cleanly if there is a problem with the
> +#         destination host. For debugging, this can be problematic as
> +#         the keepalive may cause the peer to abort prematurely if we are
> +#         at a GDB breakpoint, for example.
> +#         Enabled by default. (Since 2.x)

Enabled-by-default is an interesting choice, but I suppose it is okay.


> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle', 'x-rdma-pin-all', 'auto-converge', 'zero-blocks'] }
> +  'data': ['xbzrle', 
> +           'rdma-pin-all', 
> +           'auto-converge', 
> +           'zero-blocks',
> +           'mc', 
> +           'mc-net-disable',
> +           'mc-rdma-copy',
> +           'rdma-keepalive'
> +          ] }
>  
>  ##
>  # @MigrationCapabilityStatus
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
  2014-02-19  1:00   ` Li Guang
@ 2014-03-11 21:57   ` Juan Quintela
  2014-04-04  3:50     ` Michael R. Hines
  1 sibling, 1 reply; 68+ messages in thread
From: Juan Quintela @ 2014-03-11 21:57 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, pbonzini, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	abali, BIRAN, lig.fnst, Michael R. Hines

mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> This patch sets up the initial changes to the migration state
> machine and prototypes to be used by the checkpointing code
> to interact with the state machine so that we can later handle
> failure and recovery scenarios.
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  arch_init.c                   | 29 ++++++++++++++++++++++++-----
>  include/migration/migration.h |  2 ++
>  migration.c                   | 37 +++++++++++++++++++++----------------
>  3 files changed, 47 insertions(+), 21 deletions(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index db75120..e9d4d9e 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>      migration_end();
>  }
>  
> -static void reset_ram_globals(void)
> +static void reset_ram_globals(bool reset_bulk_stage)
>  {
>      last_seen_block = NULL;
>      last_sent_block = NULL;
>      last_offset = 0;
>      last_version = ram_list.version;
> -    ram_bulk_stage = true;
> +    ram_bulk_stage = reset_bulk_stage;
>  }
>  
>  #define MAX_WAIT 50 /* ms, half buffered_file limit */
> @@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      RAMBlock *block;
>      int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
>  
> +    /*
> +     * RAM stays open during micro-checkpointing for the next transaction.
> +     */
> +    if (migration_is_mc(migrate_get_current())) {
> +        qemu_mutex_lock_ramlist();
> +        reset_ram_globals(false);
> +        goto skip_setup;
> +    }
> +
>      migration_bitmap = bitmap_new(ram_pages);
>      bitmap_set(migration_bitmap, 0, ram_pages);
>      migration_dirty_pages = ram_pages;
> @@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>      qemu_mutex_lock_iothread();
>      qemu_mutex_lock_ramlist();
>      bytes_transferred = 0;
> -    reset_ram_globals();
> +    reset_ram_globals(true);
>  
>      memory_global_dirty_log_start();
>      migration_bitmap_sync();
>      qemu_mutex_unlock_iothread();
>  
> +skip_setup:
> +
>      qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
>  
>      QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> @@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>      qemu_mutex_lock_ramlist();
>  
>      if (ram_list.version != last_version) {
> -        reset_ram_globals();
> +        reset_ram_globals(true);
>      }
>  
>      ram_control_before_iterate(f, RAM_CONTROL_ROUND);
> @@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>      }
>  
>      ram_control_after_iterate(f, RAM_CONTROL_FINISH);
> -    migration_end();
> +
> +    /*
> +     * Only cleanup at the end of normal migrations
> +     * or if the MC destination failed and we got an error.
> +     * Otherwise, we are (or will soon be) in MIG_STATE_CHECKPOINTING.
> +     */
> +    if(!migrate_use_mc() || migration_has_failed(migrate_get_current())) {
> +        migration_end();
> +    }
>  
>      qemu_mutex_unlock_ramlist();
>      qemu_put_be64(f, RAM_SAVE_FLAG_EOS);



I haven't looked at the code in detail, but what we have here is
esentially:


ram_save_complete()
{
   code not needed for mc
   common codo for migration and mc
   code not needed for mc
}

Similar code on ram_save_setup.  Yes, I know that there are some locking
issues here.


SHould we be able do do something like

__ram_save_complete()
{
    common code
}

mc_ram_save_complete()
{
    # Possible something else here
    __ram_save_complete()
}

rest_ram_save_complete()
{
    code not needed for mc
    __ram_save_complete()
    code not needed for mc
}

My problem here is that current code is already quite complex and
convoluted.  At some point we are going to need to change it to
something that is easier to understand?


> -enum {
> -    MIG_STATE_ERROR = -1,
> -    MIG_STATE_NONE,
> -    MIG_STATE_SETUP,
> -    MIG_STATE_CANCELLING,
> -    MIG_STATE_CANCELLED,
> -    MIG_STATE_ACTIVE,
> -    MIG_STATE_COMPLETED,
> -};
> -

Here comes the code seen on the previous patch O:-)

>  
> -static void migrate_set_state(MigrationState *s, int old_state, int new_state)
> +bool migration_is_active(MigrationState *s)
> +{
> +    return (s->state == MIG_STATE_ACTIVE) || s->state == MIG_STATE_SETUP
> +            || s->state == MIG_STATE_CHECKPOINTING;
> +}

The whole idea of moving MIG_STATE_* to this file was to "force" all
other users to use accessor functions.  This way we know what the others
expect frum us.

> -    assert(s->state != MIG_STATE_ACTIVE);
> +    assert(!migration_is_active(s));

I can understand that we want here MIG_STATE_CHECKPOINTING, but _SETUP?
Or it is a bug on upstream?

Thanks, Juan.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
  2014-03-11 21:45   ` Eric Blake
@ 2014-03-11 21:59   ` Juan Quintela
  2014-04-04  3:55     ` Michael R. Hines
  1 sibling, 1 reply; 68+ messages in thread
From: Juan Quintela @ 2014-03-11 21:59 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, pbonzini, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	abali, BIRAN, lig.fnst, Michael R. Hines

mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> MC provides a lot of new information, including the same RAM statistics
> that ordinary migration does, so we centralize a lot of that printing
> code into a common function so that the QMP printing statements don't
> get duplicated too much.
>
> We also introduce a new MCStats structure (like MigrationStats) due
> to the large number of non-migration related statistics - don't want
> to confuse migration and MC too much, so let's keep them separate for now.
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>

We can add the non-mc stats if you split them.  And you get a smaller
series.

Later, Juan.


> ---
>  hmp.c                         | 17 +++++++++
>  include/migration/migration.h |  6 +++
>  migration.c                   | 86 ++++++++++++++++++++++++++-----------------
>  qapi-schema.json              | 33 +++++++++++++++++
>  4 files changed, 109 insertions(+), 33 deletions(-)
>
> diff --git a/hmp.c b/hmp.c
> index 1af0809..edf062e 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -203,6 +203,23 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>                         info->disk->total >> 10);
>      }
>  
> +    if (info->has_mc) {
> +        monitor_printf(mon, "checkpoints: %" PRIu64 "\n",
> +                       info->mc->checkpoints);
> +        monitor_printf(mon, "xmit_time: %" PRIu64 " ms\n",
> +                       info->mc->xmit_time);
> +        monitor_printf(mon, "log_dirty_time: %" PRIu64 " ms\n",
> +                       info->mc->log_dirty_time);
> +        monitor_printf(mon, "migration_bitmap_time: %" PRIu64 " ms\n",
> +                       info->mc->migration_bitmap_time);
> +        monitor_printf(mon, "ram_copy_time: %" PRIu64 " ms\n",
> +                       info->mc->ram_copy_time);
> +        monitor_printf(mon, "copy_mbps: %0.2f mbps\n",
> +                       info->mc->copy_mbps);
> +        monitor_printf(mon, "throughput: %0.2f mbps\n",
> +                       info->mc->mbps);
> +    }
> +
>      if (info->has_xbzrle_cache) {
>          monitor_printf(mon, "cache size: %" PRIu64 " bytes\n",
>                         info->xbzrle_cache->cache_size);
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index e876a2c..f18ff5e 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -53,14 +53,20 @@ struct MigrationState
>      int state;
>      MigrationParams params;
>      double mbps;
> +    double copy_mbps;
>      int64_t total_time;
>      int64_t downtime;
>      int64_t expected_downtime;
> +    int64_t xmit_time;
> +    int64_t ram_copy_time;
> +    int64_t log_dirty_time;
> +    int64_t bitmap_time;
>      int64_t dirty_pages_rate;
>      int64_t dirty_bytes_rate;
>      bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
>      int64_t xbzrle_cache_size;
>      int64_t setup_time;
> +    int64_t checkpoints;
>  };
>  
>  void process_incoming_migration(QEMUFile *f);
> diff --git a/migration.c b/migration.c
> index f42dae4..0ccbeaa 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -59,7 +59,6 @@ MigrationState *migrate_get_current(void)
>          .state = MIG_STATE_NONE,
>          .bandwidth_limit = MAX_THROTTLE,
>          .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
> -        .mbps = -1,
>      };
>  
>      return &current_migration;
> @@ -173,6 +172,31 @@ static void get_xbzrle_cache_stats(MigrationInfo *info)
>      }
>  }
>  
> +static void get_ram_stats(MigrationState *s, MigrationInfo *info)
> +{
> +    info->has_total_time = true;
> +    info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
> +        - s->total_time;
> +
> +    info->has_ram = true;
> +    info->ram = g_malloc0(sizeof(*info->ram));
> +    info->ram->transferred = ram_bytes_transferred();
> +    info->ram->total = ram_bytes_total();
> +    info->ram->duplicate = dup_mig_pages_transferred();
> +    info->ram->skipped = skipped_mig_pages_transferred();
> +    info->ram->normal = norm_mig_pages_transferred();
> +    info->ram->normal_bytes = norm_mig_bytes_transferred();
> +    info->ram->mbps = s->mbps;
> +
> +    if (blk_mig_active()) {
> +        info->has_disk = true;
> +        info->disk = g_malloc0(sizeof(*info->disk));
> +        info->disk->transferred = blk_mig_bytes_transferred();
> +        info->disk->remaining = blk_mig_bytes_remaining();
> +        info->disk->total = blk_mig_bytes_total();
> +    }
> +}
> +
>  MigrationInfo *qmp_query_migrate(Error **errp)
>  {
>      MigrationInfo *info = g_malloc0(sizeof(*info));
> @@ -199,26 +223,8 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          info->has_setup_time = true;
>          info->setup_time = s->setup_time;
>  
> -        info->has_ram = true;
> -        info->ram = g_malloc0(sizeof(*info->ram));
> -        info->ram->transferred = ram_bytes_transferred();
> -        info->ram->remaining = ram_bytes_remaining();
> -        info->ram->total = ram_bytes_total();
> -        info->ram->duplicate = dup_mig_pages_transferred();
> -        info->ram->skipped = skipped_mig_pages_transferred();
> -        info->ram->normal = norm_mig_pages_transferred();
> -        info->ram->normal_bytes = norm_mig_bytes_transferred();
> +        get_ram_stats(s, info);
>          info->ram->dirty_pages_rate = s->dirty_pages_rate;
> -        info->ram->mbps = s->mbps;
> -
> -        if (blk_mig_active()) {
> -            info->has_disk = true;
> -            info->disk = g_malloc0(sizeof(*info->disk));
> -            info->disk->transferred = blk_mig_bytes_transferred();
> -            info->disk->remaining = blk_mig_bytes_remaining();
> -            info->disk->total = blk_mig_bytes_total();
> -        }
> -
>          get_xbzrle_cache_stats(info);
>          break;
>      case MIG_STATE_COMPLETED:
> @@ -227,22 +233,37 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>          info->has_status = true;
>          info->status = g_strdup("completed");
>          info->has_total_time = true;
> -        info->total_time = s->total_time;
> +        info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
> +            - s->total_time;
>          info->has_downtime = true;
>          info->downtime = s->downtime;
>          info->has_setup_time = true;
>          info->setup_time = s->setup_time;
>  
> -        info->has_ram = true;
> -        info->ram = g_malloc0(sizeof(*info->ram));
> -        info->ram->transferred = ram_bytes_transferred();
> -        info->ram->remaining = 0;
> -        info->ram->total = ram_bytes_total();
> -        info->ram->duplicate = dup_mig_pages_transferred();
> -        info->ram->skipped = skipped_mig_pages_transferred();
> -        info->ram->normal = norm_mig_pages_transferred();
> -        info->ram->normal_bytes = norm_mig_bytes_transferred();
> -        info->ram->mbps = s->mbps;
> +        get_ram_stats(s, info);
> +        break;
> +    case MIG_STATE_CHECKPOINTING:
> +        info->has_status = true;
> +        info->status = g_strdup("checkpointing");
> +        info->has_setup_time = true;
> +        info->setup_time = s->setup_time;
> +        info->has_downtime = true;
> +        info->downtime = s->downtime;
> +
> +        get_ram_stats(s, info);
> +        info->ram->dirty_pages_rate = s->dirty_pages_rate;
> +        get_xbzrle_cache_stats(info);
> +
> +
> +        info->has_mc = true;
> +        info->mc = g_malloc0(sizeof(*info->mc));
> +        info->mc->xmit_time = s->xmit_time;
> +        info->mc->log_dirty_time = s->log_dirty_time; 
> +        info->mc->migration_bitmap_time = s->bitmap_time;
> +        info->mc->ram_copy_time = s->ram_copy_time;
> +        info->mc->copy_mbps = s->copy_mbps;
> +        info->mc->mbps = s->mbps;
> +        info->mc->checkpoints = s->checkpoints;
>          break;
>      case MIG_STATE_ERROR:
>          info->has_status = true;
> @@ -646,8 +667,7 @@ static void *migration_thread(void *opaque)
>              double bandwidth = transferred_bytes / time_spent;
>              max_size = bandwidth * migrate_max_downtime() / 1000000;
>  
> -            s->mbps = time_spent ? (((double) transferred_bytes * 8.0) /
> -                    ((double) time_spent / 1000.0)) / 1000.0 / 1000.0 : -1;
> +            s->mbps = MBPS(transferred_bytes, time_spent);
>  
>              DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
>                      " bandwidth %g max_size %" PRId64 "\n",
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 3c2ee4d..7306adc 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -603,6 +603,36 @@
>             'cache-miss': 'int', 'overflow': 'int' } }
>  
>  ##
> +# @MCStats
> +#
> +# Detailed Micro Checkpointing (MC) statistics
> +#
> +# @mbps: throughput of transmitting last MC 
> +#
> +# @xmit-time: milliseconds to transmit last MC 
> +#
> +# @log-dirty-time: milliseconds to GET_LOG_DIRTY for last MC 
> +#
> +# @migration-bitmap-time: milliseconds to prepare dirty bitmap for last MC
> +#
> +# @ram-copy-time: milliseconds to ram_save_live() last MC to staging memory
> +#
> +# @copy-mbps: throughput of ram_save_live() to staging memory for last MC 
> +#
> +# @checkpoints: cummulative total number of MCs generated 
> +#
> +# Since: 2.x
> +##
> +{ 'type': 'MCStats',
> +  'data': {'mbps': 'number',
> +           'xmit-time': 'uint64',
> +           'log-dirty-time': 'uint64',
> +           'migration-bitmap-time': 'uint64', 
> +           'ram-copy-time': 'uint64',
> +           'checkpoints' : 'uint64',
> +           'copy-mbps': 'number' }}
> +
> +##
>  # @MigrationInfo
>  #
>  # Information about current migration process.
> @@ -624,6 +654,8 @@
>  #                migration statistics, only returned if XBZRLE feature is on and
>  #                status is 'active' or 'completed' (since 1.2)
>  #
> +# @mc: #options @MCStats containing details Micro-Checkpointing statistics
> +#
>  # @total-time: #optional total amount of milliseconds since migration started.
>  #        If migration has ended, it returns the total migration
>  #        time. (since 1.2)
> @@ -648,6 +680,7 @@
>    'data': {'*status': 'str', '*ram': 'MigrationStats',
>             '*disk': 'MigrationStats',
>             '*xbzrle-cache': 'XBZRLECacheStats',
> +           '*mc': 'MCStats',
>             '*total-time': 'int',
>             '*expected-downtime': 'int',
>             '*downtime': 'int',

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
  2014-03-11 21:57   ` Eric Blake
@ 2014-03-11 22:02   ` Juan Quintela
  2014-03-11 22:07     ` Eric Blake
  2014-04-04  3:56     ` Michael R. Hines
  1 sibling, 2 replies; 68+ messages in thread
From: Juan Quintela @ 2014-03-11 22:02 UTC (permalink / raw)
  To: mrhines
  Cc: GILR, SADEKJ, pbonzini, qemu-devel, EREZH, owasserm,
	junqing.wang, onom, hinesmr, isaku.yamahata, gokul, dbulkow,
	abali, BIRAN, lig.fnst, Michael R. Hines

mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> New capabilities include the use of RDMA acceleration,
> use of network buffering, and keepalive support, as documented
> in patch #1.
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  qapi-schema.json | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 98abdac..1fdf208 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -720,10 +720,44 @@
>  # @auto-converge: If enabled, QEMU will automatically throttle down the guest
>  #          to speed up convergence of RAM migration. (since 1.6)
>  #
> +# @mc: The migration will never end, and the VM will instead be continuously
> +#          micro-checkpointed (MC). Use the command migrate-set-mc-delay to 
> +#          control the frequency at which the checkpoints occur. 
> +#          Disabled by default. (Since 2.x)
> +#
> +# @mc-net-disable: Deactivate network buffering against outbound network 
> +#          traffic while Micro-Checkpointing (@mc) is active.
> +#          Enabled by default. Disabling will make the MC protocol inconsistent
> +#          and potentially break network connections upon an actual failure.
> +#          Only for performance testing. (Since 2.x)

If it is dangerous, can we put dangerous/unsafe on the name?  Having an option that
can corrupt things make me nervous.

> +#
> +# @mc-rdma-copy: MC requires creating a local-memory checkpoint before
> +#          transmission to the destination. This requires heavy use of 
> +#          memcpy() which dominates the processor pipeline. This option 
> +#          makes use of *local* RDMA to perform the copy instead of the CPU.
> +#          Enabled by default only if the migration transport is RDMA.
> +#          Disabled by default otherwise. (Since 2.x)
> +#
> +# @rdma-keepalive: RDMA connections do not timeout by themselves if a peer
> +#         has disconnected prematurely or failed. User-level keepalives
> +#         allow the migration to abort cleanly if there is a problem with the
> +#         destination host. For debugging, this can be problematic as
> +#         the keepalive may cause the peer to abort prematurely if we are
> +#         at a GDB breakpoint, for example.
> +#         Enabled by default. (Since 2.x)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle', 'x-rdma-pin-all', 'auto-converge', 'zero-blocks'] }
> +  'data': ['xbzrle', 
> +           'rdma-pin-all', 
> +           'auto-converge', 
> +           'zero-blocks',
> +           'mc', 
> +           'mc-net-disable',
> +           'mc-rdma-copy',
> +           'rdma-keepalive'
> +          ] }
>  
>  ##
>  # @MigrationCapabilityStatus

Thask, Juan.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-03-11 22:02   ` Juan Quintela
@ 2014-03-11 22:07     ` Eric Blake
  2014-04-04  3:57       ` Michael R. Hines
  2014-04-04  3:56     ` Michael R. Hines
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Blake @ 2014-03-11 22:07 UTC (permalink / raw)
  To: quintela, mrhines
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

On 03/11/2014 04:02 PM, Juan Quintela wrote:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>

>> +# @mc-net-disable: Deactivate network buffering against outbound network 
>> +#          traffic while Micro-Checkpointing (@mc) is active.
>> +#          Enabled by default. Disabling will make the MC protocol inconsistent
>> +#          and potentially break network connections upon an actual failure.
>> +#          Only for performance testing. (Since 2.x)
> 
> If it is dangerous, can we put dangerous/unsafe on the name?  Having an option that
> can corrupt things make me nervous.

Or even name it x-mc-net-disable, so that we reserve the right to remove
it, as well as make it obvious that management must not try to tune it,
only developers.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-03-11 21:49   ` Eric Blake
@ 2014-03-11 22:15     ` Juan Quintela
  2014-03-11 22:49       ` Eric Blake
  2014-04-04  3:29     ` Michael R. Hines
  1 sibling, 1 reply; 68+ messages in thread
From: Juan Quintela @ 2014-03-11 22:15 UTC (permalink / raw)
  To: Eric Blake
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, lig.fnst,
	owasserm, onom, junqing.wang, mrhines, gokul, dbulkow, pbonzini,
	Luiz Capitulino, abali, isaku.yamahata, Michael R. Hines

Eric Blake <eblake@redhat.com> wrote:
> On 02/18/2014 01:50 AM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>

> We're building up a LOT of migrate- tunable commands.  Maybe it's time
> to think about building a more generic migrate-set-parameter, which
> takes both the name of the parameter to set and its value, so that a
> single command serves all parameters, instead of needing a proliferation
> of commands.  Of course, for that to be useful, we also need a way to
> introspect which parameters can be tuned; whereas with the current
> approach of one command per parameter (well, 2 for set vs. get) the
> introspection is based on whether the command exists.

I asked to have that.  My suggestion was that

migrate_set_capability auto-throotle on

So we could add it to new variables without extra change.

And I agree that having a way to read them, and ask what values they
have is a good idea.

Luiz, any good idea about how to do it through QMP?

Having the migration changes is easy, the problem is knowing how we want
them.

Later, Juan.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-03-11 22:15     ` Juan Quintela
@ 2014-03-11 22:49       ` Eric Blake
  2014-04-04  5:29         ` Michael R. Hines
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Blake @ 2014-03-11 22:49 UTC (permalink / raw)
  To: quintela
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, lig.fnst,
	owasserm, onom, junqing.wang, mrhines, gokul, dbulkow, pbonzini,
	Luiz Capitulino, abali, isaku.yamahata, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 4586 bytes --]

On 03/11/2014 04:15 PM, Juan Quintela wrote:
> Eric Blake <eblake@redhat.com> wrote:
>> On 02/18/2014 01:50 AM, mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
>> We're building up a LOT of migrate- tunable commands.  Maybe it's time
>> to think about building a more generic migrate-set-parameter, which
>> takes both the name of the parameter to set and its value, so that a
>> single command serves all parameters, instead of needing a proliferation
>> of commands.  Of course, for that to be useful, we also need a way to
>> introspect which parameters can be tuned; whereas with the current
>> approach of one command per parameter (well, 2 for set vs. get) the
>> introspection is based on whether the command exists.
> 
> I asked to have that.  My suggestion was that
> 
> migrate_set_capability auto-throotle on
> 
> So we could add it to new variables without extra change.
> 
> And I agree that having a way to read them, and ask what values they
> have is a good idea.
> 
> Luiz, any good idea about how to do it through QMP?

I'm trying to thing of a back-compat method, which exploits the fact
that we now have flat unions (something we didn't have when
migrate-set-capabilities was first added).  Maybe something like:

{ 'type': 'MigrationCapabilityBase',
  'data': { 'capability': 'MigrationCapability' } }
{ 'type': 'MigrationCapabilityBool',
  'data': { 'state': 'bool' } }
{ 'type': 'Migration CapabilityInt',
  'data': { 'value': 'int' } }
{ 'union': 'MigrationCapabilityStatus',
  'base': 'MigrationCapabilityBase',
  'discriminator': 'capability',
  'data': {
    'xbzrle': 'MigrationCapabilityBool',
    'auto-converge': 'MigrationCapabilityBool',
...
    'mc-delay': 'MigrationCapabilityInt'
  } }

along with a tweak to query-migrate-capabilities for full back-compat:

# @query-migrate-capabilities
# @extended: #optional defaults to false; set to true to see non-boolean
capabilities (since 2.1)
{ 'command: 'query-migrate-capabilities',
  'data': { '*extended': 'bool' },
  'returns': ['MigrationCapabilityStatus'] }

Now, observe what happens.  If an old client calls { "execute":
"query-migrate-capabilities" }, they get a return that lists ONLY the
boolean members of the MigrationCapabilityStatus array (good, because if
we returned a non-boolean, we would confuse the consumer when they are
expecting a 'state' variable that is not present) - what's more, this
representation is identical on the wire to the format used in earlier
qemu.  But new clients can call { "execute":
"query-migrate-capabilities", "arguments": { "extended": true } }, and
get back:

{ "capabilities": [
   { "capability": "xbzrle", "state": true },
   { "capability": "auto-converge", "state": false },
...
   { "capability": "mc-delay", "value": 100 }
  ] }

Also, once a new client has learned of non-boolean extended
capabilities, they can also set them via the existing command:
{ "execute": "migrate-set-capabilities",
  "arguments": [
     { "capability": "xbzrle", "state", false },
     { "capability": "mc-delay", "value", 200 }
  ] }

So, what do you think?  My slick type manipulation means that we need
zero new commands, just a new option the the query command, and a new
flat union type that replaces the current struct type.  The existence
(but not the type) of non-boolean parameters is already introspectible
to a client new enough to request an 'extended' query, and down the
road, if we ever gain full QAPI introspection, then a client also would
gain the ability to learn the type of any non-boolean parameter as well.
 Stability wise, as long as we never change the type of a capability
once first exposed, then if a client plans on using a particular
parameter when available, it can already hard-code what type that
parameter should have without even needing full QAPI introspection (that
is, if libvirt is taught to manipulate mc-delay, libvirt will already
know to expect mc-delay as an int, and not any other type, and merely
needs to probe if qemu supports the 'mc-delay' extended capability).
And of course, this new schema idea can retroactively cover all existing
migration tunables, such as migrate_set_downtime, migrate_set_speed,
migrate-set-cache-size, and so on.

> 
> Having the migration changes is easy, the problem is knowing how we want
> them.

And maybe my proposal just solved that.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage
  2014-03-11 21:31   ` Juan Quintela
@ 2014-04-04  3:08     ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:08 UTC (permalink / raw)
  To: quintela
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

On 03/12/2014 05:31 AM, Juan Quintela wrote:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> We also later export these statistics over QMP for better
>> monitoring of micro-checkpointing as the workload changes.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   arch_init.c | 34 ++++++++++++++++++++++++++++------
>>   1 file changed, 28 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 80574a0..b8364b0 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -193,6 +193,8 @@ typedef struct AccountingInfo {
>>       uint64_t skipped_pages;
>>       uint64_t norm_pages;
>>       uint64_t iterations;
>> +    uint64_t log_dirty_time;
>> +    uint64_t migration_bitmap_time;
>>       uint64_t xbzrle_bytes;
>>       uint64_t xbzrle_pages;
>>       uint64_t xbzrle_cache_miss;
>> @@ -201,7 +203,7 @@ typedef struct AccountingInfo {
>>   
>>   static AccountingInfo acct_info;
>>   
>> -static void acct_clear(void)
>> +void acct_clear(void)
>>   {
>>       memset(&acct_info, 0, sizeof(acct_info));
>>   }
>> @@ -236,6 +238,16 @@ uint64_t norm_mig_pages_transferred(void)
>>       return acct_info.norm_pages;
>>   }
>>   
>> +uint64_t norm_mig_log_dirty_time(void)
>> +{
>> +    return acct_info.log_dirty_time;
>> +}
>> +
>> +uint64_t norm_mig_bitmap_time(void)
>> +{
>> +    return acct_info.migration_bitmap_time;
>> +}
>> +
>>   uint64_t xbzrle_mig_bytes_transferred(void)
>>   {
>>       return acct_info.xbzrle_bytes;
>> @@ -426,27 +438,35 @@ static void migration_bitmap_sync(void)
>>       static int64_t num_dirty_pages_period;
>>       int64_t end_time;
>>       int64_t bytes_xfer_now;
>> +    int64_t begin_time;
>> +    int64_t dirty_time;
>>   
>>       if (!bytes_xfer_prev) {
>>           bytes_xfer_prev = ram_bytes_transferred();
>>       }
>>   
>> +    begin_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>       if (!start_time) {
>>           start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>>       }
>       if (!start_time) {
>           start_time = begin_time;
>       }
>
> Althought I think we need to search for better names?
>
> start_time --> migration_start_time
> begin_time --> iteration_start_time
> ?

Will do. These new names are fine - no problem =)

> I am open to better names.
>
>> @@ -548,9 +568,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>               /* XBZRLE overflow or normal page */
>>               if (bytes_sent == -1) {
>>                   bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
>> -                qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
>> -                bytes_sent += TARGET_PAGE_SIZE;
>> -                acct_info.norm_pages++;
>> +                if (ret != RAM_SAVE_CONTROL_DELAYED) {
>> +                    qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
>> +                    bytes_sent += TARGET_PAGE_SIZE;
>> +                    acct_info.norm_pages++;
>> +                }
>>               }
>>   
>>               /* if page is unmodified, continue to the next */
> Except for this bit, rest of the patch ok.
>


The goal of this patch is to allow the virtual machine to resume 
execution of the
main VCPUs "as soon as possible" after each checkpoint completes.
In order to make that possible, all the other micro-checkpointing 
implementations
use a "staging" buffer for this to work:

The purpose of the staging buffer is to hold a complete copy of the 
dirty memory
locally and capture that memory *before* transmitting it to the other 
side. Once
we have a complete copy of the dirty memory, we can allow the virtual 
machine
to continue execution immediately without waiting for the memory to be 
transmitted
to the other side of the connection.

Since this patch is very critical to performance, I'll make it a 
separate patch with
it's own summary in the series.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states
  2014-03-11 21:36   ` Juan Quintela
@ 2014-04-04  3:11     ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:11 UTC (permalink / raw)
  To: quintela
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

On 03/12/2014 05:36 AM, Juan Quintela wrote:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> During micro-checkpointing, the VCPUs get repeatedly paused and
>> resumed. We need to not freak out when the VM begins micro-checkpointing.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index 3e1e6c7..9c62e2f 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -121,10 +121,31 @@ uint64_t skipped_mig_bytes_transferred(void);
>>   uint64_t skipped_mig_pages_transferred(void);
>>   uint64_t norm_mig_bytes_transferred(void);
>>   uint64_t norm_mig_pages_transferred(void);
>> +uint64_t norm_mig_log_dirty_time(void);
>> +uint64_t norm_mig_bitmap_time(void);
>>   uint64_t xbzrle_mig_bytes_transferred(void);
>>   uint64_t xbzrle_mig_pages_transferred(void);
>>   uint64_t xbzrle_mig_pages_overflow(void);
>>   uint64_t xbzrle_mig_pages_cache_miss(void);
>> +void acct_clear(void);
>> +
>> +void migrate_set_state(MigrationState *s, int old_state, int new_state);
>> +
>> +enum {
>> +    MIG_STATE_ERROR = -1,
>> +    MIG_STATE_NONE,
>> +    MIG_STATE_SETUP,
>> +    MIG_STATE_CANCELLED,
>> +    MIG_STATE_CANCELLING,
>> +    MIG_STATE_ACTIVE,
>> +    MIG_STATE_CHECKPOINTING,
>> +    MIG_STATE_COMPLETED,
>> +};
>> +
>> +int mc_enable_buffering(void);
>> +int mc_start_buffer(void);
>> +void mc_init_checkpointer(MigrationState *s);
>> +void mc_process_incoming_checkpoints_if_requested(QEMUFile *f);
>>   
>>   void ram_handle_compressed(void *host, uint8_t ch, uint64_t size);
>>   
> This clearly don't work on this patch.
>
> Rest of it is ok.
>
> Later, Juan.
>

Yeah, I was lazy - I'll split it out correctly next time.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states
  2014-03-11 21:40   ` Eric Blake
@ 2014-04-04  3:12     ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:12 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, EREZH, owasserm, onom,
	hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang, BIRAN,
	lig.fnst, Michael R. Hines

On 03/12/2014 05:40 AM, Eric Blake wrote:
>> +++ b/qapi-schema.json
>> @@ -169,6 +169,8 @@
>>   #
>>   # @save-vm: guest is paused to save the VM state
>>   #
>> +# @checkpoint-vm: guest is paused to checkpoint the VM state
>> +#
> It would be nice to mention '(since 2.1)'.

Acknowledged.

>>   # @shutdown: guest is shut down (and -no-shutdown is in use)
>>   #
>>   # @suspended: guest is suspended (ACPI S3)
>> @@ -181,7 +183,7 @@
>>     'data': [ 'debug', 'inmigrate', 'internal-error', 'io-error', 'paused',
>>               'postmigrate', 'prelaunch', 'finish-migrate', 'restore-vm',
>>               'running', 'save-vm', 'shutdown', 'suspended', 'watchdog',
>> -            'guest-panicked' ] }
>> +            'guest-panicked', 'checkpoint-vm' ] }
> It would also be nice to document the enum variables in the same order
> you declare them.  The declaration of 'checkpoint-vm' does not have to
> be at the end; we use named enums precisely so that the QMP interface is
> not tied to the C integer value underlying the enum.
>

Acknowledged.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing
  2014-03-11 21:45   ` Eric Blake
@ 2014-04-04  3:15     ` Michael R. Hines
  2014-04-04  4:22       ` Eric Blake
  0 siblings, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:15 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, EREZH, owasserm, onom,
	hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang, BIRAN,
	lig.fnst, Michael R. Hines

On 03/12/2014 05:45 AM, Eric Blake wrote:
>> +++ b/qapi-schema.json
>> @@ -603,6 +603,36 @@
>>              'cache-miss': 'int', 'overflow': 'int' } }
>>   
>>   ##
>> +# @MCStats
>> +#
>> +# Detailed Micro Checkpointing (MC) statistics
>> +#
>> +# @mbps: throughput of transmitting last MC
>> +#
>> +# @xmit-time: milliseconds to transmit last MC
> Trailing whitespace.
>
> Rather than abbreviate, how about naming this 'transmit-time'.

Acknowledged.

>> +#
>> +# @checkpoints: cummulative total number of MCs generated
> More trailing whitespace.  Please run your series through
> scripts/checkpatch.pl.
>
> s/cummulative total/cumulative/

Acknowledged.


>> +#
>> +# Since: 2.x
>> +##
>> +{ 'type': 'MCStats',
>> +  'data': {'mbps': 'number',
>> +           'xmit-time': 'uint64',
>> +           'log-dirty-time': 'uint64',
>> +           'migration-bitmap-time': 'uint64',
>> +           'ram-copy-time': 'uint64',
>> +           'checkpoints' : 'uint64',
>> +           'copy-mbps': 'number' }}
> Again, it helps to document the fields in the same order as they are
> declared (no, it's not a hard requirement, but being nice to readers is
> always worth the effort).

Acknowledged.

>> +
>> +##
>>   # @MigrationInfo
>>   #
>>   # Information about current migration process.
>> @@ -624,6 +654,8 @@
>>   #                migration statistics, only returned if XBZRLE feature is on and
>>   #                status is 'active' or 'completed' (since 1.2)
>>   #
>> +# @mc: #options @MCStats containing details Micro-Checkpointing statistics
> s/options/optional/ - I'm assuming it is optional because it only
> appears when MC is in use.
>
> 'mc' is a rather short name, maybe 'micro-checkpoint' is better?

Funny. I thought 'micro-checkpoint' was too long, particularly
in the ./configure output and the QEMU Monitor 'help' command output =)

> Missing a '(since 2.1)' designation (or 2.x, as you used above as a
> placeholder, although obviously we'd fix the .x before actually bringing
> into mainline)
>

Acknowledged.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-03-11 21:49   ` Eric Blake
  2014-03-11 22:15     ` Juan Quintela
@ 2014-04-04  3:29     ` Michael R. Hines
  1 sibling, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:29 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, EREZH, owasserm, onom,
	hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang, BIRAN,
	lig.fnst, Michael R. Hines

On 03/12/2014 05:49 AM, Eric Blake wrote:
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index f3fc514..2066c76 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -888,7 +888,7 @@ ETEXI
>   		      "\n\t\t\t -b for migration without shared storage with"
>   		      " full copy of disk\n\t\t\t -i for migration without "
>   		      "shared storage with incremental copy of disk "
> -		      "(base image shared between src and destination)",
> + 		      "(base image shared between src and destination)",
> Spurious hunk.  Oh, I see - you managed to take TAB damage and make it
> worse with a space-TAB (I guess this file isn't tab-clean, like the
> .json file is).  Eww.

Ooops. =)

>>           .mhandler.cmd = hmp_migrate,
>>       },
>>   
>> @@ -965,6 +965,20 @@ Set maximum tolerated downtime (in seconds) for migration.
>>   ETEXI
>>   
>>       {
>> +        .name       = "migrate-set-mc-delay",
> We're building up a LOT of migrate- tunable commands.  Maybe it's time
> to think about building a more generic migrate-set-parameter, which
> takes both the name of the parameter to set and its value, so that a
> single command serves all parameters, instead of needing a proliferation
> of commands.  Of course, for that to be useful, we also need a way to
> introspect which parameters can be tuned; whereas with the current
> approach of one command per parameter (well, 2 for set vs. get) the
> introspection is based on whether the command exists.
>

Well, unless there's anymore strong objection, I didn't find it too
difficult to add a new command in QEMU, although I did find it quite
painful to expose this command in Libvirt - I had to modify something
like 5 or 6 (IIRC) different files in libvirt to accomplish the same goal.

Could we "merge" the commands into a single command at the
libvirt level instead of the QEMU level?

Is there any other "pressing" reason to merge them at the QEMU
level?


>> +++ b/qapi-schema.json
>> @@ -2160,6 +2160,19 @@
>>   { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>>   
>>   ##
>> +# @migrate-set-mc-delay
>> +#
>> +# Set delay (in milliseconds) between micro checkpoints.
>> +#
>> +# @value: maximum delay in milliseconds
>> +#
>> +# Returns: nothing on success
>> +#
>> +# Since: 2.x
>> +##
>> +{ 'command': 'migrate-set-mc-delay', 'data': {'value': 'int'} }
>> +
>> +##
> I hate write-only interfaces.  If I can set the parameter, I _also_ need
> a way to query the current value of the parameter.  Either an existing
> migration statistics output should be modified to include this new
> information, or you need to add a counterpart migrate-get-mc-delay command.
>

Totally forgot about that - will get a 'get' command in there ASAP.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-03-11 21:57   ` Eric Blake
@ 2014-04-04  3:38     ` Michael R. Hines
  2014-04-04  4:25       ` Eric Blake
  0 siblings, 1 reply; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:38 UTC (permalink / raw)
  To: Eric Blake, qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, EREZH, owasserm, onom,
	hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang, BIRAN,
	lig.fnst, Michael R. Hines

On 03/12/2014 05:57 AM, Eric Blake wrote:
> ---
>   qapi-schema.json | 36 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 35 insertions(+), 1 deletion(-)
>
>> +#          Only for performance testing. (Since 2.x)
>> +#
>> +# @mc-rdma-copy: MC requires creating a local-memory checkpoint before
>> +#          transmission to the destination. This requires heavy use of
>> +#          memcpy() which dominates the processor pipeline. This option
>> +#          makes use of *local* RDMA to perform the copy instead of the CPU.
>> +#          Enabled by default only if the migration transport is RDMA.
>> +#          Disabled by default otherwise. (Since 2.x)
> How does that work?  If I query migration capabilities before requesting
> a migration, what state am I going to read?  Is there coupling where I
> would observe the state of this flag change merely because I did some
> other action?  And if so, then how do I know that explicitly setting
> this flag won't be undone by similar coupling?
>
> It sounds like you are describing a tri-state option (unspecified so
> default to migration transport, explicitly disabled, explicitly
> enabled); but that doesn't work for something that only lists boolean
> capabilities.  The only way around that is to have 2 separate
> capabilities (one on whether to base decision on transport or to honor
> override, and the other to provide the override value which is ignored
> when defaulting by transport).

Yes, now that I think about it, this 'tri-state' possibility is indeed
confusing to the management software. I'll stop this behavior
and instead require that it be manually enabled when needed.

>> +#
>> +# @rdma-keepalive: RDMA connections do not timeout by themselves if a peer
>> +#         has disconnected prematurely or failed. User-level keepalives
>> +#         allow the migration to abort cleanly if there is a problem with the
>> +#         destination host. For debugging, this can be problematic as
>> +#         the keepalive may cause the peer to abort prematurely if we are
>> +#         at a GDB breakpoint, for example.
>> +#         Enabled by default. (Since 2.x)
> Enabled-by-default is an interesting choice, but I suppose it is okay.

I'll rename the command to "rdma-disable-keepalive" and change
the default to "disabled".

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC
  2014-03-11 21:57   ` Juan Quintela
@ 2014-04-04  3:50     ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:50 UTC (permalink / raw)
  To: quintela
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

On 03/12/2014 05:57 AM, Juan Quintela wrote:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> This patch sets up the initial changes to the migration state
>> machine and prototypes to be used by the checkpointing code
>> to interact with the state machine so that we can later handle
>> failure and recovery scenarios.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   arch_init.c                   | 29 ++++++++++++++++++++++++-----
>>   include/migration/migration.h |  2 ++
>>   migration.c                   | 37 +++++++++++++++++++++----------------
>>   3 files changed, 47 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index db75120..e9d4d9e 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -658,13 +658,13 @@ static void ram_migration_cancel(void *opaque)
>>       migration_end();
>>   }
>>   
>> -static void reset_ram_globals(void)
>> +static void reset_ram_globals(bool reset_bulk_stage)
>>   {
>>       last_seen_block = NULL;
>>       last_sent_block = NULL;
>>       last_offset = 0;
>>       last_version = ram_list.version;
>> -    ram_bulk_stage = true;
>> +    ram_bulk_stage = reset_bulk_stage;
>>   }
>>   
>>   #define MAX_WAIT 50 /* ms, half buffered_file limit */
>> @@ -674,6 +674,15 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>       RAMBlock *block;
>>       int64_t ram_pages = last_ram_offset() >> TARGET_PAGE_BITS;
>>   
>> +    /*
>> +     * RAM stays open during micro-checkpointing for the next transaction.
>> +     */
>> +    if (migration_is_mc(migrate_get_current())) {
>> +        qemu_mutex_lock_ramlist();
>> +        reset_ram_globals(false);
>> +        goto skip_setup;
>> +    }
>> +
>>       migration_bitmap = bitmap_new(ram_pages);
>>       bitmap_set(migration_bitmap, 0, ram_pages);
>>       migration_dirty_pages = ram_pages;
>> @@ -710,12 +719,14 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
>>       qemu_mutex_lock_iothread();
>>       qemu_mutex_lock_ramlist();
>>       bytes_transferred = 0;
>> -    reset_ram_globals();
>> +    reset_ram_globals(true);
>>   
>>       memory_global_dirty_log_start();
>>       migration_bitmap_sync();
>>       qemu_mutex_unlock_iothread();
>>   
>> +skip_setup:
>> +
>>       qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
>>   
>>       QTAILQ_FOREACH(block, &ram_list.blocks, next) {
>> @@ -744,7 +755,7 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
>>       qemu_mutex_lock_ramlist();
>>   
>>       if (ram_list.version != last_version) {
>> -        reset_ram_globals();
>> +        reset_ram_globals(true);
>>       }
>>   
>>       ram_control_before_iterate(f, RAM_CONTROL_ROUND);
>> @@ -825,7 +836,15 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
>>       }
>>   
>>       ram_control_after_iterate(f, RAM_CONTROL_FINISH);
>> -    migration_end();
>> +
>> +    /*
>> +     * Only cleanup at the end of normal migrations
>> +     * or if the MC destination failed and we got an error.
>> +     * Otherwise, we are (or will soon be) in MIG_STATE_CHECKPOINTING.
>> +     */
>> +    if(!migrate_use_mc() || migration_has_failed(migrate_get_current())) {
>> +        migration_end();
>> +    }
>>   
>>       qemu_mutex_unlock_ramlist();
>>       qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
>
>
> I haven't looked at the code in detail, but what we have here is
> esentially:
>
>
> ram_save_complete()
> {
>     code not needed for mc
>     common codo for migration and mc
>     code not needed for mc
> }
>
> Similar code on ram_save_setup.  Yes, I know that there are some locking
> issues here.
>
>
> SHould we be able do do something like
>
> __ram_save_complete()
> {
>      common code
> }
>
> mc_ram_save_complete()
> {
>      # Possible something else here
>      __ram_save_complete()
> }
>
> rest_ram_save_complete()
> {
>      code not needed for mc
>      __ram_save_complete()
>      code not needed for mc
> }
>
> My problem here is that current code is already quite complex and
> convoluted.  At some point we are going to need to change it to
> something that is easier to understand?

Understood: So it looks like we needs some "accessor" function
pointers or something here, similar to the way Paolo suggested
that we breakout "before" and "after" iteration methods for
localhost migration and RDMA migration.

I'll cook something up and re-submit.

>
>> -enum {
>> -    MIG_STATE_ERROR = -1,
>> -    MIG_STATE_NONE,
>> -    MIG_STATE_SETUP,
>> -    MIG_STATE_CANCELLING,
>> -    MIG_STATE_CANCELLED,
>> -    MIG_STATE_ACTIVE,
>> -    MIG_STATE_COMPLETED,
>> -};
>> -
> Here comes the code seen on the previous patch O:-)
>
>>   
>> -static void migrate_set_state(MigrationState *s, int old_state, int new_state)
>> +bool migration_is_active(MigrationState *s)
>> +{
>> +    return (s->state == MIG_STATE_ACTIVE) || s->state == MIG_STATE_SETUP
>> +            || s->state == MIG_STATE_CHECKPOINTING;
>> +}
> The whole idea of moving MIG_STATE_* to this file was to "force" all
> other users to use accessor functions.  This way we know what the others
> expect frum us.

Acknowleged - I'll work on creating (or enhancing) the accessor
functions to avoid moving the flags again.

>> -    assert(s->state != MIG_STATE_ACTIVE);
>> +    assert(!migration_is_active(s));
> I can understand that we want here MIG_STATE_CHECKPOINTING, but _SETUP?
> Or it is a bug on upstream?

My fault, I think there is some merge breakage here when I started
walking through the diff. Ignore this one for now..........

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing
  2014-03-11 21:59   ` Juan Quintela
@ 2014-04-04  3:55     ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:55 UTC (permalink / raw)
  To: quintela
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

On 03/12/2014 05:59 AM, Juan Quintela wrote:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> MC provides a lot of new information, including the same RAM statistics
>> that ordinary migration does, so we centralize a lot of that printing
>> code into a common function so that the QMP printing statements don't
>> get duplicated too much.
>>
>> We also introduce a new MCStats structure (like MigrationStats) due
>> to the large number of non-migration related statistics - don't want
>> to confuse migration and MC too much, so let's keep them separate for now.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> We can add the non-mc stats if you split them.  And you get a smaller
> series.

Well, the MIG_STATE_COMPLETED and the MIG_STATE_ACTIVE cleanup
is removing a ton of duplicated code anyway - that needed to be cleaned
up regardless.

But, for the MC-related statistics, this really is a completely new state
in the migration state machine with so many new statistics - I don't think
they belong in the MigrationInfo structure at all......

- Michael

> Later, Juan.
>
>
>> ---
>>   hmp.c                         | 17 +++++++++
>>   include/migration/migration.h |  6 +++
>>   migration.c                   | 86 ++++++++++++++++++++++++++-----------------
>>   qapi-schema.json              | 33 +++++++++++++++++
>>   4 files changed, 109 insertions(+), 33 deletions(-)
>>
>> diff --git a/hmp.c b/hmp.c
>> index 1af0809..edf062e 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -203,6 +203,23 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>>                          info->disk->total >> 10);
>>       }
>>   
>> +    if (info->has_mc) {
>> +        monitor_printf(mon, "checkpoints: %" PRIu64 "\n",
>> +                       info->mc->checkpoints);
>> +        monitor_printf(mon, "xmit_time: %" PRIu64 " ms\n",
>> +                       info->mc->xmit_time);
>> +        monitor_printf(mon, "log_dirty_time: %" PRIu64 " ms\n",
>> +                       info->mc->log_dirty_time);
>> +        monitor_printf(mon, "migration_bitmap_time: %" PRIu64 " ms\n",
>> +                       info->mc->migration_bitmap_time);
>> +        monitor_printf(mon, "ram_copy_time: %" PRIu64 " ms\n",
>> +                       info->mc->ram_copy_time);
>> +        monitor_printf(mon, "copy_mbps: %0.2f mbps\n",
>> +                       info->mc->copy_mbps);
>> +        monitor_printf(mon, "throughput: %0.2f mbps\n",
>> +                       info->mc->mbps);
>> +    }
>> +
>>       if (info->has_xbzrle_cache) {
>>           monitor_printf(mon, "cache size: %" PRIu64 " bytes\n",
>>                          info->xbzrle_cache->cache_size);
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index e876a2c..f18ff5e 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -53,14 +53,20 @@ struct MigrationState
>>       int state;
>>       MigrationParams params;
>>       double mbps;
>> +    double copy_mbps;
>>       int64_t total_time;
>>       int64_t downtime;
>>       int64_t expected_downtime;
>> +    int64_t xmit_time;
>> +    int64_t ram_copy_time;
>> +    int64_t log_dirty_time;
>> +    int64_t bitmap_time;
>>       int64_t dirty_pages_rate;
>>       int64_t dirty_bytes_rate;
>>       bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
>>       int64_t xbzrle_cache_size;
>>       int64_t setup_time;
>> +    int64_t checkpoints;
>>   };
>>   
>>   void process_incoming_migration(QEMUFile *f);
>> diff --git a/migration.c b/migration.c
>> index f42dae4..0ccbeaa 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -59,7 +59,6 @@ MigrationState *migrate_get_current(void)
>>           .state = MIG_STATE_NONE,
>>           .bandwidth_limit = MAX_THROTTLE,
>>           .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
>> -        .mbps = -1,
>>       };
>>   
>>       return &current_migration;
>> @@ -173,6 +172,31 @@ static void get_xbzrle_cache_stats(MigrationInfo *info)
>>       }
>>   }
>>   
>> +static void get_ram_stats(MigrationState *s, MigrationInfo *info)
>> +{
>> +    info->has_total_time = true;
>> +    info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
>> +        - s->total_time;
>> +
>> +    info->has_ram = true;
>> +    info->ram = g_malloc0(sizeof(*info->ram));
>> +    info->ram->transferred = ram_bytes_transferred();
>> +    info->ram->total = ram_bytes_total();
>> +    info->ram->duplicate = dup_mig_pages_transferred();
>> +    info->ram->skipped = skipped_mig_pages_transferred();
>> +    info->ram->normal = norm_mig_pages_transferred();
>> +    info->ram->normal_bytes = norm_mig_bytes_transferred();
>> +    info->ram->mbps = s->mbps;
>> +
>> +    if (blk_mig_active()) {
>> +        info->has_disk = true;
>> +        info->disk = g_malloc0(sizeof(*info->disk));
>> +        info->disk->transferred = blk_mig_bytes_transferred();
>> +        info->disk->remaining = blk_mig_bytes_remaining();
>> +        info->disk->total = blk_mig_bytes_total();
>> +    }
>> +}
>> +
>>   MigrationInfo *qmp_query_migrate(Error **errp)
>>   {
>>       MigrationInfo *info = g_malloc0(sizeof(*info));
>> @@ -199,26 +223,8 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>>           info->has_setup_time = true;
>>           info->setup_time = s->setup_time;
>>   
>> -        info->has_ram = true;
>> -        info->ram = g_malloc0(sizeof(*info->ram));
>> -        info->ram->transferred = ram_bytes_transferred();
>> -        info->ram->remaining = ram_bytes_remaining();
>> -        info->ram->total = ram_bytes_total();
>> -        info->ram->duplicate = dup_mig_pages_transferred();
>> -        info->ram->skipped = skipped_mig_pages_transferred();
>> -        info->ram->normal = norm_mig_pages_transferred();
>> -        info->ram->normal_bytes = norm_mig_bytes_transferred();
>> +        get_ram_stats(s, info);
>>           info->ram->dirty_pages_rate = s->dirty_pages_rate;
>> -        info->ram->mbps = s->mbps;
>> -
>> -        if (blk_mig_active()) {
>> -            info->has_disk = true;
>> -            info->disk = g_malloc0(sizeof(*info->disk));
>> -            info->disk->transferred = blk_mig_bytes_transferred();
>> -            info->disk->remaining = blk_mig_bytes_remaining();
>> -            info->disk->total = blk_mig_bytes_total();
>> -        }
>> -
>>           get_xbzrle_cache_stats(info);
>>           break;
>>       case MIG_STATE_COMPLETED:
>> @@ -227,22 +233,37 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>>           info->has_status = true;
>>           info->status = g_strdup("completed");
>>           info->has_total_time = true;
>> -        info->total_time = s->total_time;
>> +        info->total_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME)
>> +            - s->total_time;
>>           info->has_downtime = true;
>>           info->downtime = s->downtime;
>>           info->has_setup_time = true;
>>           info->setup_time = s->setup_time;
>>   
>> -        info->has_ram = true;
>> -        info->ram = g_malloc0(sizeof(*info->ram));
>> -        info->ram->transferred = ram_bytes_transferred();
>> -        info->ram->remaining = 0;
>> -        info->ram->total = ram_bytes_total();
>> -        info->ram->duplicate = dup_mig_pages_transferred();
>> -        info->ram->skipped = skipped_mig_pages_transferred();
>> -        info->ram->normal = norm_mig_pages_transferred();
>> -        info->ram->normal_bytes = norm_mig_bytes_transferred();
>> -        info->ram->mbps = s->mbps;
>> +        get_ram_stats(s, info);
>> +        break;
>> +    case MIG_STATE_CHECKPOINTING:
>> +        info->has_status = true;
>> +        info->status = g_strdup("checkpointing");
>> +        info->has_setup_time = true;
>> +        info->setup_time = s->setup_time;
>> +        info->has_downtime = true;
>> +        info->downtime = s->downtime;
>> +
>> +        get_ram_stats(s, info);
>> +        info->ram->dirty_pages_rate = s->dirty_pages_rate;
>> +        get_xbzrle_cache_stats(info);
>> +
>> +
>> +        info->has_mc = true;
>> +        info->mc = g_malloc0(sizeof(*info->mc));
>> +        info->mc->xmit_time = s->xmit_time;
>> +        info->mc->log_dirty_time = s->log_dirty_time;
>> +        info->mc->migration_bitmap_time = s->bitmap_time;
>> +        info->mc->ram_copy_time = s->ram_copy_time;
>> +        info->mc->copy_mbps = s->copy_mbps;
>> +        info->mc->mbps = s->mbps;
>> +        info->mc->checkpoints = s->checkpoints;
>>           break;
>>       case MIG_STATE_ERROR:
>>           info->has_status = true;
>> @@ -646,8 +667,7 @@ static void *migration_thread(void *opaque)
>>               double bandwidth = transferred_bytes / time_spent;
>>               max_size = bandwidth * migrate_max_downtime() / 1000000;
>>   
>> -            s->mbps = time_spent ? (((double) transferred_bytes * 8.0) /
>> -                    ((double) time_spent / 1000.0)) / 1000.0 / 1000.0 : -1;
>> +            s->mbps = MBPS(transferred_bytes, time_spent);
>>   
>>               DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
>>                       " bandwidth %g max_size %" PRId64 "\n",
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 3c2ee4d..7306adc 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -603,6 +603,36 @@
>>              'cache-miss': 'int', 'overflow': 'int' } }
>>   
>>   ##
>> +# @MCStats
>> +#
>> +# Detailed Micro Checkpointing (MC) statistics
>> +#
>> +# @mbps: throughput of transmitting last MC
>> +#
>> +# @xmit-time: milliseconds to transmit last MC
>> +#
>> +# @log-dirty-time: milliseconds to GET_LOG_DIRTY for last MC
>> +#
>> +# @migration-bitmap-time: milliseconds to prepare dirty bitmap for last MC
>> +#
>> +# @ram-copy-time: milliseconds to ram_save_live() last MC to staging memory
>> +#
>> +# @copy-mbps: throughput of ram_save_live() to staging memory for last MC
>> +#
>> +# @checkpoints: cummulative total number of MCs generated
>> +#
>> +# Since: 2.x
>> +##
>> +{ 'type': 'MCStats',
>> +  'data': {'mbps': 'number',
>> +           'xmit-time': 'uint64',
>> +           'log-dirty-time': 'uint64',
>> +           'migration-bitmap-time': 'uint64',
>> +           'ram-copy-time': 'uint64',
>> +           'checkpoints' : 'uint64',
>> +           'copy-mbps': 'number' }}
>> +
>> +##
>>   # @MigrationInfo
>>   #
>>   # Information about current migration process.
>> @@ -624,6 +654,8 @@
>>   #                migration statistics, only returned if XBZRLE feature is on and
>>   #                status is 'active' or 'completed' (since 1.2)
>>   #
>> +# @mc: #options @MCStats containing details Micro-Checkpointing statistics
>> +#
>>   # @total-time: #optional total amount of milliseconds since migration started.
>>   #        If migration has ended, it returns the total migration
>>   #        time. (since 1.2)
>> @@ -648,6 +680,7 @@
>>     'data': {'*status': 'str', '*ram': 'MigrationStats',
>>              '*disk': 'MigrationStats',
>>              '*xbzrle-cache': 'XBZRLECacheStats',
>> +           '*mc': 'MCStats',
>>              '*total-time': 'int',
>>              '*expected-downtime': 'int',
>>              '*downtime': 'int',

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-03-11 22:02   ` Juan Quintela
  2014-03-11 22:07     ` Eric Blake
@ 2014-04-04  3:56     ` Michael R. Hines
  1 sibling, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:56 UTC (permalink / raw)
  To: quintela
  Cc: GILR, SADEKJ, BIRAN, hinesmr, qemu-devel, EREZH, owasserm, onom,
	junqing.wang, lig.fnst, gokul, dbulkow, pbonzini, abali,
	isaku.yamahata, Michael R. Hines

On 03/12/2014 06:02 AM, Juan Quintela wrote:
> mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> New capabilities include the use of RDMA acceleration,
>> use of network buffering, and keepalive support, as documented
>> in patch #1.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   qapi-schema.json | 36 +++++++++++++++++++++++++++++++++++-
>>   1 file changed, 35 insertions(+), 1 deletion(-)
>>
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 98abdac..1fdf208 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -720,10 +720,44 @@
>>   # @auto-converge: If enabled, QEMU will automatically throttle down the guest
>>   #          to speed up convergence of RAM migration. (since 1.6)
>>   #
>> +# @mc: The migration will never end, and the VM will instead be continuously
>> +#          micro-checkpointed (MC). Use the command migrate-set-mc-delay to
>> +#          control the frequency at which the checkpoints occur.
>> +#          Disabled by default. (Since 2.x)
>> +#
>> +# @mc-net-disable: Deactivate network buffering against outbound network
>> +#          traffic while Micro-Checkpointing (@mc) is active.
>> +#          Enabled by default. Disabling will make the MC protocol inconsistent
>> +#          and potentially break network connections upon an actual failure.
>> +#          Only for performance testing. (Since 2.x)
> If it is dangerous, can we put dangerous/unsafe on the name?  Having an option that
> can corrupt things make me nervous.

You got it =)

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-03-11 22:07     ` Eric Blake
@ 2014-04-04  3:57       ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  3:57 UTC (permalink / raw)
  To: Eric Blake, quintela
  Cc: GILR, SADEKJ, pbonzini, abali, qemu-devel, EREZH, owasserm, onom,
	hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang, BIRAN,
	lig.fnst, Michael R. Hines

On 03/12/2014 06:07 AM, Eric Blake wrote:
> On 03/11/2014 04:02 PM, Juan Quintela wrote:
>> mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>
>>> +# @mc-net-disable: Deactivate network buffering against outbound network
>>> +#          traffic while Micro-Checkpointing (@mc) is active.
>>> +#          Enabled by default. Disabling will make the MC protocol inconsistent
>>> +#          and potentially break network connections upon an actual failure.
>>> +#          Only for performance testing. (Since 2.x)
>> If it is dangerous, can we put dangerous/unsafe on the name?  Having an option that
>> can corrupt things make me nervous.
> Or even name it x-mc-net-disable, so that we reserve the right to remove
> it, as well as make it obvious that management must not try to tune it,
> only developers.
>

Good idea..... will do.

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing
  2014-04-04  3:15     ` Michael R. Hines
@ 2014-04-04  4:22       ` Eric Blake
  0 siblings, 0 replies; 68+ messages in thread
From: Eric Blake @ 2014-04-04  4:22 UTC (permalink / raw)
  To: Michael R. Hines, qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, EREZH, owasserm, onom,
	hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang, BIRAN,
	lig.fnst, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 1158 bytes --]

On 04/03/2014 09:15 PM, Michael R. Hines wrote:

>>>   #
>>> +# @mc: #options @MCStats containing details Micro-Checkpointing
>>> statistics
>> s/options/optional/ - I'm assuming it is optional because it only
>> appears when MC is in use.
>>
>> 'mc' is a rather short name, maybe 'micro-checkpoint' is better?
> 
> Funny. I thought 'micro-checkpoint' was too long, particularly
> in the ./configure output and the QEMU Monitor 'help' command output =)

For ./configure output, short is fine.  For QMP, it's a programmatic
interface, and management programs like libvirt have no problem
outputting longer names.  Where the longer names are nice is when
reading logs generated by a management program, as we tried hard to make
the QMP wire format still seem more or less legible to a human, even
though humans aren't the primary driver of the interface.  At the end of
the day, though, the name doesn't matter, so it's not worth arguing what
color to paint the bikeshed - you as author have some sway on what name
you want :)

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing
  2014-04-04  3:38     ` Michael R. Hines
@ 2014-04-04  4:25       ` Eric Blake
  0 siblings, 0 replies; 68+ messages in thread
From: Eric Blake @ 2014-04-04  4:25 UTC (permalink / raw)
  To: Michael R. Hines, qemu-devel
  Cc: GILR, SADEKJ, pbonzini, quintela, abali, EREZH, owasserm, onom,
	hinesmr, isaku.yamahata, gokul, dbulkow, junqing.wang, BIRAN,
	lig.fnst, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 934 bytes --]

On 04/03/2014 09:38 PM, Michael R. Hines wrote:

>>> +# @rdma-keepalive: RDMA connections do not timeout by themselves if
>>> a peer
>>> +#         has disconnected prematurely or failed. User-level keepalives
>>> +#         allow the migration to abort cleanly if there is a problem
>>> with the
>>> +#         destination host. For debugging, this can be problematic as
>>> +#         the keepalive may cause the peer to abort prematurely if
>>> we are
>>> +#         at a GDB breakpoint, for example.
>>> +#         Enabled by default. (Since 2.x)
>> Enabled-by-default is an interesting choice, but I suppose it is okay.
> 
> I'll rename the command to "rdma-disable-keepalive" and change
> the default to "disabled".

Hopefully this doesn't lead to awkward double-negative interpretation
questions.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-03-11 22:49       ` Eric Blake
@ 2014-04-04  5:29         ` Michael R. Hines
  2014-04-04 14:56           ` Eric Blake
  2014-04-04 16:28           ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-04  5:29 UTC (permalink / raw)
  To: Eric Blake, quintela
  Cc: GILR, SADEKJ, pbonzini, abali, qemu-devel, EREZH,
	Luiz Capitulino, owasserm, onom, hinesmr, isaku.yamahata, gokul,
	dbulkow, junqing.wang, BIRAN, lig.fnst, Michael R. Hines

On 03/12/2014 06:49 AM, Eric Blake wrote:
> On 03/11/2014 04:15 PM, Juan Quintela wrote:
>> Eric Blake <eblake@redhat.com> wrote:
>>> On 02/18/2014 01:50 AM, mrhines@linux.vnet.ibm.com wrote:
>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>> We're building up a LOT of migrate- tunable commands.  Maybe it's time
>>> to think about building a more generic migrate-set-parameter, which
>>> takes both the name of the parameter to set and its value, so that a
>>> single command serves all parameters, instead of needing a proliferation
>>> of commands.  Of course, for that to be useful, we also need a way to
>>> introspect which parameters can be tuned; whereas with the current
>>> approach of one command per parameter (well, 2 for set vs. get) the
>>> introspection is based on whether the command exists.
>> I asked to have that.  My suggestion was that
>>
>> migrate_set_capability auto-throotle on
>>
>> So we could add it to new variables without extra change.
>>
>> And I agree that having a way to read them, and ask what values they
>> have is a good idea.
>>
>> Luiz, any good idea about how to do it through QMP?
> I'm trying to thing of a back-compat method, which exploits the fact
> that we now have flat unions (something we didn't have when
> migrate-set-capabilities was first added).  Maybe something like:
>
> { 'type': 'MigrationCapabilityBase',
>    'data': { 'capability': 'MigrationCapability' } }
> { 'type': 'MigrationCapabilityBool',
>    'data': { 'state': 'bool' } }
> { 'type': 'Migration CapabilityInt',
>    'data': { 'value': 'int' } }
> { 'union': 'MigrationCapabilityStatus',
>    'base': 'MigrationCapabilityBase',
>    'discriminator': 'capability',
>    'data': {
>      'xbzrle': 'MigrationCapabilityBool',
>      'auto-converge': 'MigrationCapabilityBool',
> ...
>      'mc-delay': 'MigrationCapabilityInt'
>    } }
>
> along with a tweak to query-migrate-capabilities for full back-compat:
>
> # @query-migrate-capabilities
> # @extended: #optional defaults to false; set to true to see non-boolean
> capabilities (since 2.1)
> { 'command: 'query-migrate-capabilities',
>    'data': { '*extended': 'bool' },
>    'returns': ['MigrationCapabilityStatus'] }
>
> Now, observe what happens.  If an old client calls { "execute":
> "query-migrate-capabilities" }, they get a return that lists ONLY the
> boolean members of the MigrationCapabilityStatus array (good, because if
> we returned a non-boolean, we would confuse the consumer when they are
> expecting a 'state' variable that is not present) - what's more, this
> representation is identical on the wire to the format used in earlier
> qemu.  But new clients can call { "execute":
> "query-migrate-capabilities", "arguments": { "extended": true } }, and
> get back:
>
> { "capabilities": [
>     { "capability": "xbzrle", "state": true },
>     { "capability": "auto-converge", "state": false },
> ...
>     { "capability": "mc-delay", "value": 100 }
>    ] }
>
> Also, once a new client has learned of non-boolean extended
> capabilities, they can also set them via the existing command:
> { "execute": "migrate-set-capabilities",
>    "arguments": [
>       { "capability": "xbzrle", "state", false },
>       { "capability": "mc-delay", "value", 200 }
>    ] }
>
> So, what do you think?  My slick type manipulation means that we need
> zero new commands, just a new option the the query command, and a new
> flat union type that replaces the current struct type.  The existence
> (but not the type) of non-boolean parameters is already introspectible
> to a client new enough to request an 'extended' query, and down the
> road, if we ever gain full QAPI introspection, then a client also would
> gain the ability to learn the type of any non-boolean parameter as well.
>   Stability wise, as long as we never change the type of a capability
> once first exposed, then if a client plans on using a particular
> parameter when available, it can already hard-code what type that
> parameter should have without even needing full QAPI introspection (that
> is, if libvirt is taught to manipulate mc-delay, libvirt will already
> know to expect mc-delay as an int, and not any other type, and merely
> needs to probe if qemu supports the 'mc-delay' extended capability).
> And of course, this new schema idea can retroactively cover all existing
> migration tunables, such as migrate_set_downtime, migrate_set_speed,
> migrate-set-cache-size, and so on.

I like this a lot - it's very complicated, but it is clean, I think.

Maybe you should add some "reserved" fields in there as well
to the union, in case you want to expand the number of members
of the union in the future?

- Michael

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-04-04  5:29         ` Michael R. Hines
@ 2014-04-04 14:56           ` Eric Blake
  2014-04-11  6:10             ` Michael R. Hines
  2014-04-04 16:28           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Blake @ 2014-04-04 14:56 UTC (permalink / raw)
  To: Michael R. Hines, quintela
  Cc: GILR, SADEKJ, pbonzini, abali, qemu-devel, EREZH,
	Luiz Capitulino, owasserm, onom, hinesmr, isaku.yamahata, gokul,
	dbulkow, junqing.wang, BIRAN, lig.fnst, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 2402 bytes --]

On 04/03/2014 11:29 PM, Michael R. Hines wrote:
>> I'm trying to thing of a back-compat method, which exploits the fact
>> that we now have flat unions (something we didn't have when
>> migrate-set-capabilities was first added).  Maybe something like:
>>
>> { 'type': 'MigrationCapabilityBase',
>>    'data': { 'capability': 'MigrationCapability' } }
>> { 'type': 'MigrationCapabilityBool',
>>    'data': { 'state': 'bool' } }
>> { 'type': 'Migration CapabilityInt',
>>    'data': { 'value': 'int' } }
>> { 'union': 'MigrationCapabilityStatus',
>>    'base': 'MigrationCapabilityBase',
>>    'discriminator': 'capability',
>>    'data': {
>>      'xbzrle': 'MigrationCapabilityBool',
>>      'auto-converge': 'MigrationCapabilityBool',
>> ...
>>      'mc-delay': 'MigrationCapabilityInt'
>>    } }
>>
>> along with a tweak to query-migrate-capabilities for full back-compat:
>>
>> # @query-migrate-capabilities
>> # @extended: #optional defaults to false; set to true to see non-boolean
>> capabilities (since 2.1)
>> { 'command: 'query-migrate-capabilities',
>>    'data': { '*extended': 'bool' },
>>    'returns': ['MigrationCapabilityStatus'] }
>>

> 
> I like this a lot - it's very complicated, but it is clean, I think.

Good - that means I made sense in trying to explain it.  And the more I
re-read my mail, the more I like the idea - fewer new commands, and make
the existing commands both more powerful and more easily extensible, all
while still being discoverable by libvirt without waiting for full
schema introspection.

> 
> Maybe you should add some "reserved" fields in there as well
> to the union, in case you want to expand the number of members
> of the union in the future?

Adding members to a union is back-compat safe.  No need to reserve
slots, we just add them when we have a use for them.  Besides, how would
you reserve a slot?  QAPI requires a name (not just a type) - but unless
you know what to name your slot, you can't really reserve it.  (We are
NOT worried about C ABI compatibility, where a union type must be large
enough to occupy enough bytes for any future larger structs carved into
the union - we are only worried about QAPI API compatibility which
requires a name for each branch of the union)

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-04-04  5:29         ` Michael R. Hines
  2014-04-04 14:56           ` Eric Blake
@ 2014-04-04 16:28           ` Dr. David Alan Gilbert
  2014-04-04 16:35             ` Eric Blake
  1 sibling, 1 reply; 68+ messages in thread
From: Dr. David Alan Gilbert @ 2014-04-04 16:28 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: GILR, SADEKJ, quintela, BIRAN, hinesmr, qemu-devel, EREZH,
	Luiz Capitulino, owasserm, junqing.wang, onom, abali, lig.fnst,
	gokul, dbulkow, pbonzini, isaku.yamahata, Michael R. Hines

One thing to be a little careful about if we merge these tunables
together, is what tunables are allowed to be changed while the migration
is running.  The 'capabilities' are currently fixed once the migration
starts, but I know at least some of the tuneables people want to change
while things are going - and some care is needed with it since (as we
found with the xbzrle cache size) we get fun due to the use being in
a different thread.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-04-04 16:28           ` Dr. David Alan Gilbert
@ 2014-04-04 16:35             ` Eric Blake
  0 siblings, 0 replies; 68+ messages in thread
From: Eric Blake @ 2014-04-04 16:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Michael R. Hines
  Cc: GILR, SADEKJ, quintela, hinesmr, qemu-devel, EREZH,
	Luiz Capitulino, owasserm, junqing.wang, onom, abali, lig.fnst,
	gokul, dbulkow, pbonzini, BIRAN, isaku.yamahata,
	Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 880 bytes --]

On 04/04/2014 10:28 AM, Dr. David Alan Gilbert wrote:
> One thing to be a little careful about if we merge these tunables
> together, is what tunables are allowed to be changed while the migration
> is running.  The 'capabilities' are currently fixed once the migration
> starts, but I know at least some of the tuneables people want to change
> while things are going - and some care is needed with it since (as we
> found with the xbzrle cache size) we get fun due to the use being in
> a different thread.

Possible to solve that by adding an annotation in extended query output
of which tunables are live, and by making the set command reject any
changes for a tunable that is not live if migration is already underway.
 But yes, worth thinking about.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency
  2014-04-04 14:56           ` Eric Blake
@ 2014-04-11  6:10             ` Michael R. Hines
  0 siblings, 0 replies; 68+ messages in thread
From: Michael R. Hines @ 2014-04-11  6:10 UTC (permalink / raw)
  To: Eric Blake, quintela
  Cc: GILR, SADEKJ, pbonzini, abali, qemu-devel, EREZH,
	Luiz Capitulino, owasserm, onom, hinesmr, isaku.yamahata, gokul,
	dbulkow, junqing.wang, BIRAN, lig.fnst, Michael R. Hines

[-- Attachment #1: Type: text/plain, Size: 2103 bytes --]

On 04/04/2014 10:56 PM, Eric Blake wrote:
> On 04/03/2014 11:29 PM, Michael R. Hines wrote:
>>> I'm trying to thing of a back-compat method, which exploits the fact
>>> that we now have flat unions (something we didn't have when
>>> migrate-set-capabilities was first added).  Maybe something like:
>>>
>>> { 'type': 'MigrationCapabilityBase',
>>>     'data': { 'capability': 'MigrationCapability' } }
>>> { 'type': 'MigrationCapabilityBool',
>>>     'data': { 'state': 'bool' } }
>>> { 'type': 'Migration CapabilityInt',
>>>     'data': { 'value': 'int' } }
>>> { 'union': 'MigrationCapabilityStatus',
>>>     'base': 'MigrationCapabilityBase',
>>>     'discriminator': 'capability',
>>>     'data': {
>>>       'xbzrle': 'MigrationCapabilityBool',
>>>       'auto-converge': 'MigrationCapabilityBool',
>>> ...
>>>       'mc-delay': 'MigrationCapabilityInt'
>>>     } }
>>>
>>> along with a tweak to query-migrate-capabilities for full back-compat:
>>>
>>> # @query-migrate-capabilities
>>> # @extended: #optional defaults to false; set to true to see non-boolean
>>> capabilities (since 2.1)
>>> { 'command: 'query-migrate-capabilities',
>>>     'data': { '*extended': 'bool' },
>>>     'returns': ['MigrationCapabilityStatus'] }
>>>
>> I like this a lot - it's very complicated, but it is clean, I think.
> Good - that means I made sense in trying to explain it.  And the more I
> re-read my mail, the more I like the idea - fewer new commands, and make
> the existing commands both more powerful and more easily extensible, all
> while still being discoverable by libvirt without waiting for full
> schema introspection.

Alright, I've saved this proposal on the wiki on the MicroCheckpointing
TODO section:

http://wiki.qemu.org/Features/MicroCheckpointing#TODO

For now, I've got several other issues to address before "someone"
gets around to this (I'd assume the maintainer or someone else would
want to test the 'extended' feature by itself in isolation with the existing
set of migration commands before someone else like me attempts to
use it or start adding new features to it.)

- Michael


[-- Attachment #2: Type: text/html, Size: 2819 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2014-04-11  6:12 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
2014-02-18 12:45   ` Dr. David Alan Gilbert
2014-02-19  1:40     ` Michael R. Hines
2014-02-19 11:27       ` Dr. David Alan Gilbert
2014-02-20  1:17         ` Michael R. Hines
2014-02-20 10:09           ` Dr. David Alan Gilbert
2014-02-20 11:14             ` Li Guang
2014-02-20 14:58               ` Michael R. Hines
2014-02-20 14:57             ` Michael R. Hines
2014-02-20 16:32               ` Dr. David Alan Gilbert
2014-02-21  4:54                 ` Michael R. Hines
2014-02-21  9:44                   ` Dr. David Alan Gilbert
2014-03-03  6:08                     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
2014-02-18 10:32   ` Dr. David Alan Gilbert
2014-02-19  1:42     ` Michael R. Hines
2014-03-11 21:31   ` Juan Quintela
2014-04-04  3:08     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
2014-03-11 21:36   ` Juan Quintela
2014-04-04  3:11     ` Michael R. Hines
2014-03-11 21:40   ` Eric Blake
2014-04-04  3:12     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
2014-02-19  1:00   ` Li Guang
2014-02-19  2:14     ` Michael R. Hines
2014-02-20  5:03     ` Michael R. Hines
2014-02-21  8:13     ` Michael R. Hines
2014-02-24  6:48       ` Li Guang
2014-02-26  2:52         ` Li Guang
2014-03-11 21:57   ` Juan Quintela
2014-04-04  3:50     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
2014-03-11 21:45   ` Eric Blake
2014-04-04  3:15     ` Michael R. Hines
2014-04-04  4:22       ` Eric Blake
2014-03-11 21:59   ` Juan Quintela
2014-04-04  3:55     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
2014-02-19  1:07   ` Li Guang
2014-02-19  2:16     ` Michael R. Hines
2014-02-19  2:53       ` Li Guang
2014-02-19  4:27         ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
2014-03-11 21:49   ` Eric Blake
2014-03-11 22:15     ` Juan Quintela
2014-03-11 22:49       ` Eric Blake
2014-04-04  5:29         ` Michael R. Hines
2014-04-04 14:56           ` Eric Blake
2014-04-11  6:10             ` Michael R. Hines
2014-04-04 16:28           ` Dr. David Alan Gilbert
2014-04-04 16:35             ` Eric Blake
2014-04-04  3:29     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
2014-03-11 21:57   ` Eric Blake
2014-04-04  3:38     ` Michael R. Hines
2014-04-04  4:25       ` Eric Blake
2014-03-11 22:02   ` Juan Quintela
2014-03-11 22:07     ` Eric Blake
2014-04-04  3:57       ` Michael R. Hines
2014-04-04  3:56     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested mrhines
2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
2014-02-19  1:29   ` Michael R. Hines

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.