All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation
@ 2013-03-18  3:18 mrhines
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 01/10] ./configure --enable-rdma mrhines
                   ` (9 more replies)
  0 siblings, 10 replies; 73+ messages in thread
From: mrhines @ 2013-03-18  3:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Changes since v3:

- Compile-tested with and without --enable-rdma is working.
- Updated docs/rdma.txt (included below)
- Merged with latest pull queue from Paolo
- Implemented qemu_ram_foreach_block()

mrhines@mrhinesdev:~/qemu$ git diff --stat master
Makefile.objs                 |    1 +
arch_init.c                   |   28 +-
configure                     |   25 ++
docs/rdma.txt                 |  190 +++++++++++
exec.c                        |   21 ++
include/exec/cpu-common.h     |    6 +
include/migration/migration.h |    3 +
include/migration/qemu-file.h |   10 +
include/migration/rdma.h      |  269 ++++++++++++++++
include/qemu/sockets.h        |    1 +
migration-rdma.c              |  205 ++++++++++++
migration.c                   |   19 +-
rdma.c                        | 1511 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
savevm.c                      |  172 +++++++++-
util/qemu-sockets.c           |    2 +-
15 files changed, 2445 insertions(+), 18 deletions(-)

QEMUFileRDMA:
==================================

QEMUFileRDMA introduces a couple of new functions:

1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)

These two functions provide an RDMA transport
(not a protocol) without changing the upper-level
users of QEMUFile that depend on a bytstream abstraction.

In order to provide the same bytestream interface 
for RDMA, we use SEND messages instead of sockets.
The operations themselves and the protocol built on 
top of QEMUFile used throughout the migration 
process do not change whatsoever.

An infiniband SEND message is the standard ibverbs
message used by applications of infiniband hardware.
The only difference between a SEND message and an RDMA
message is that SEND message cause completion notifications
to be posted to the completion queue (CQ) on the 
infiniband receiver side, whereas RDMA messages (used
for pc.ram) do not (to behave like an actual DMA).
    
Messages in infiniband require two things:

1. registration of the memory that will be transmitted
2. (SEND only) work requests to be posted on both
   sides of the network before the actual transmission
   can occur.

RDMA messages much easier to deal with. Once the memory
on the receiver side is registed and pinned, we're
basically done. All that is required is for the sender
side to start dumping bytes onto the link.

SEND messages require more coordination because the
receiver must have reserved space (using a receive
work request) on the receive queue (RQ) before QEMUFileRDMA
can start using them to carry all the bytes as
a transport for migration of device state.

After the initial connection setup (migration-rdma.c),
this coordination starts by having both sides post
a single work request to the RQ before any users
of QEMUFile are activated.

Once an initial receive work request is posted,
we have a put_buffer()/get_buffer() implementation
that looks like this:

Logically:

qemu_rdma_get_buffer():

1. A user on top of QEMUFile calls ops->get_buffer(),
   which calls us.
2. We transmit an empty SEND to let the sender know that 
   we are *ready* to receive some bytes from QEMUFileRDMA.
   These bytes will come in the form of a another SEND.
3. Before attempting to receive that SEND, we post another
   RQ work request to replace the one we just used up.
4. Block on a CQ event channel and wait for the SEND
   to arrive.
5. When the send arrives, librdmacm will unblock us
   and we can consume the bytes (described later).
   
qemu_rdma_put_buffer(): 

1. A user on top of QEMUFile calls ops->put_buffer(),
   which calls us.
2. Block on the CQ event channel waiting for a SEND
   from the receiver to tell us that the receiver
   is *ready* for us to transmit some new bytes.
3. When the "ready" SEND arrives, librdmacm will 
   unblock us and we immediately post a RQ work request
   to replace the one we just used up.
4. Now, we can actually deliver the bytes that
   put_buffer() wants and return. 

NOTE: This entire sequents of events is designed this
way to mimic the operations of a bytestream and is not
typical of an infiniband application. (Something like MPI
would not 'ping-pong' messages like this and would not
block after every request, which would normally defeat
the purpose of using zero-copy infiniband in the first place).

Finally, how do we handoff the actual bytes to get_buffer()?

Again, because we're trying to "fake" a bytestream abstraction
using an analogy not unlike individual UDP frames, we have
to hold on to the bytes received from SEND in memory.

Each time we get to "Step 5" above for get_buffer(),
the bytes from SEND are copied into a local holding buffer.

Then, we return the number of bytes requested by get_buffer()
and leave the remaining bytes in the buffer until get_buffer()
comes around for another pass.

If the buffer is empty, then we follow the same steps
listed above for qemu_rdma_get_buffer() and block waiting
for another SEND message to re-fill the buffer.

Migration of pc.ram:
===============================

At the beginning of the migration, (migration-rdma.c),
the sender and the receiver populate the list of RAMBlocks
to be registered with each other into a structure.

Then, using a single SEND message, they exchange this
structure with each other, to be used later during the
iteration of main memory. This structure includes a list
of all the RAMBlocks, their offsets and lengths.

Main memory is not migrated with SEND infiniband 
messages, but is instead migrated with RDMA infiniband
messages.

Messages are migrated in "chunks" (about 64 pages right now).
Chunk size is not dynamic, but it could be in a future
implementation.

When a total of 64 pages (or a flush()) are aggregated,
the memory backed by the chunk on the sender side is
registered with librdmacm and pinned in memory.

After pinning, an RDMA send is generated and tramsmitted
for the entire chunk.

Error-handling:
===============================

Infiniband has what is called a "Reliable, Connected"
link (one of 4 choices). This is the mode in which
we use for RDMA migration.

If a *single* message fails,
the decision is to abort the migration entirely and
cleanup all the RDMA descriptors and unregister all
the memory.

After cleanup, the Virtual Machine is returned to normal
operation the same way that would happen if the TCP
socket is broken during a non-RDMA based migration.

USAGE
===============================

Compiling:

$ ./configure --enable-rdma --target-list=x86_64-softmmu

$ make

Command-line on the Source machine AND Destination:

$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device

Finally, perform the actual migration:

$ virsh migrate domain rdma:xx.xx.xx.xx:port

PERFORMANCE
===================

Using a 40gbps infinband link performing a worst-case stress test:

RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
Approximately 30 gpbs (little better than the paper)
1. Average worst-case throughput 
TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
2. Approximately 8 gpbs (using IPOIB IP over Infiniband)

Average downtime (stop time) ranges between 28 and 33 milliseconds.

An *exhaustive* paper (2010) shows additional performance details
linked on the QEMU wiki:

http://wiki.qemu.org/Features/RDMALiveMigration

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 01/10] ./configure --enable-rdma
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
@ 2013-03-18  3:18 ` mrhines
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 02/10] check for CONFIG_RDMA mrhines
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 73+ messages in thread
From: mrhines @ 2013-03-18  3:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 configure |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/configure b/configure
index 46a7594..bdc6b13 100755
--- a/configure
+++ b/configure
@@ -170,6 +170,7 @@ xfs=""
 
 vhost_net="no"
 kvm="no"
+rdma="no"
 gprof="no"
 debug_tcg="no"
 debug="no"
@@ -904,6 +905,10 @@ for opt do
   ;;
   --enable-gtk) gtk="yes"
   ;;
+  --enable-rdma) rdma="yes"
+  ;;
+  --disable-rdma) rdma="no"
+  ;;
   --with-gtkabi=*) gtkabi="$optarg"
   ;;
   --enable-tpm) tpm="yes"
@@ -1104,6 +1109,8 @@ echo "  --enable-bluez           enable bluez stack connectivity"
 echo "  --disable-slirp          disable SLIRP userspace network connectivity"
 echo "  --disable-kvm            disable KVM acceleration support"
 echo "  --enable-kvm             enable KVM acceleration support"
+echo "  --disable-rdma           disable RDMA-based migration support"
+echo "  --enable-rdma            enable RDMA-based migration support"
 echo "  --enable-tcg-interpreter enable TCG with bytecode interpreter (TCI)"
 echo "  --disable-nptl           disable usermode NPTL support"
 echo "  --enable-nptl            enable usermode NPTL support"
@@ -1766,6 +1773,18 @@ EOF
   libs_softmmu="$sdl_libs $libs_softmmu"
 fi
 
+if test "$rdma" = "yes" ; then
+  cat > $TMPC <<EOF
+#include <rdma/rdma_cma.h>
+int main(void) { return 0; }
+EOF
+  rdma_libs="-lrdmacm -libverbs"
+  if ! compile_prog "" "$rdma_libs" ; then
+      feature_not_found "rdma"
+  fi
+    
+fi
+
 ##########################################
 # VNC TLS/WS detection
 if test "$vnc" = "yes" -a \( "$vnc_tls" != "no" -o "$vnc_ws" != "no" \) ; then
@@ -3412,6 +3431,7 @@ echo "Linux AIO support $linux_aio"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
+echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
 echo "fdt support       $fdt"
 echo "preadv support    $preadv"
@@ -4384,6 +4404,11 @@ if [ "$pixman" = "internal" ]; then
   echo "config-host.h: subdir-pixman" >> $config_host_mak
 fi
 
+if test "$rdma" = "yes" ; then
+echo "CONFIG_RDMA=y" >> $config_host_mak
+echo "LIBS+=$rdma_libs" >> $config_host_mak
+fi
+
 # build tree in object directory in case the source is not in the current directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas"
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 02/10] check for CONFIG_RDMA
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 01/10] ./configure --enable-rdma mrhines
@ 2013-03-18  3:18 ` mrhines
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport mrhines
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 73+ messages in thread
From: mrhines @ 2013-03-18  3:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Make both rdma.c and migration-rdma.c conditionally built.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 Makefile.objs |    1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile.objs b/Makefile.objs
index f99841c..d12208b 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -58,6 +58,7 @@ common-obj-$(CONFIG_POSIX) += os-posix.o
 common-obj-$(CONFIG_LINUX) += fsdev/
 
 common-obj-y += migration.o migration-tcp.o
+common-obj-$(CONFIG_RDMA) += migration-rdma.o rdma.o
 common-obj-y += qemu-char.o #aio.o
 common-obj-y += block-migration.o
 common-obj-y += page_cache.o xbzrle.o
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 01/10] ./configure --enable-rdma mrhines
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 02/10] check for CONFIG_RDMA mrhines
@ 2013-03-18  3:18 ` mrhines
  2013-03-18 10:40   ` Michael S. Tsirkin
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks mrhines
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 73+ messages in thread
From: mrhines @ 2013-03-18  3:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This tries to cover all the questions I got the last time.

Please do tell me what is not clear, and I'll revise again.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt |  208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 208 insertions(+)
 create mode 100644 docs/rdma.txt

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..2a48ab0
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,208 @@
+Changes since v3:
+
+- Compile-tested with and without --enable-rdma is working.
+- Updated docs/rdma.txt (included below)
+- Merged with latest pull queue from Paolo
+- Implemented qemu_ram_foreach_block()
+
+mrhines@mrhinesdev:~/qemu$ git diff --stat master
+Makefile.objs                 |    1 +
+arch_init.c                   |   28 +-
+configure                     |   25 ++
+docs/rdma.txt                 |  190 +++++++++++
+exec.c                        |   21 ++
+include/exec/cpu-common.h     |    6 +
+include/migration/migration.h |    3 +
+include/migration/qemu-file.h |   10 +
+include/migration/rdma.h      |  269 ++++++++++++++++
+include/qemu/sockets.h        |    1 +
+migration-rdma.c              |  205 ++++++++++++
+migration.c                   |   19 +-
+rdma.c                        | 1511 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+savevm.c                      |  172 +++++++++-
+util/qemu-sockets.c           |    2 +-
+15 files changed, 2445 insertions(+), 18 deletions(-)
+
+QEMUFileRDMA:
+==================================
+
+QEMUFileRDMA introduces a couple of new functions:
+
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+
+These two functions provide an RDMA transport
+(not a protocol) without changing the upper-level
+users of QEMUFile that depend on a bytstream abstraction.
+
+In order to provide the same bytestream interface 
+for RDMA, we use SEND messages instead of sockets.
+The operations themselves and the protocol built on 
+top of QEMUFile used throughout the migration 
+process do not change whatsoever.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the 
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+    
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registed and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+After the initial connection setup (migration-rdma.c),
+this coordination starts by having both sides post
+a single work request to the RQ before any users
+of QEMUFile are activated.
+
+Once an initial receive work request is posted,
+we have a put_buffer()/get_buffer() implementation
+that looks like this:
+
+Logically:
+
+qemu_rdma_get_buffer():
+
+1. A user on top of QEMUFile calls ops->get_buffer(),
+   which calls us.
+2. We transmit an empty SEND to let the sender know that 
+   we are *ready* to receive some bytes from QEMUFileRDMA.
+   These bytes will come in the form of a another SEND.
+3. Before attempting to receive that SEND, we post another
+   RQ work request to replace the one we just used up.
+4. Block on a CQ event channel and wait for the SEND
+   to arrive.
+5. When the send arrives, librdmacm will unblock us
+   and we can consume the bytes (described later).
+   
+qemu_rdma_put_buffer(): 
+
+1. A user on top of QEMUFile calls ops->put_buffer(),
+   which calls us.
+2. Block on the CQ event channel waiting for a SEND
+   from the receiver to tell us that the receiver
+   is *ready* for us to transmit some new bytes.
+3. When the "ready" SEND arrives, librdmacm will 
+   unblock us and we immediately post a RQ work request
+   to replace the one we just used up.
+4. Now, we can actually deliver the bytes that
+   put_buffer() wants and return. 
+
+NOTE: This entire sequents of events is designed this
+way to mimic the operations of a bytestream and is not
+typical of an infiniband application. (Something like MPI
+would not 'ping-pong' messages like this and would not
+block after every request, which would normally defeat
+the purpose of using zero-copy infiniband in the first place).
+
+Finally, how do we handoff the actual bytes to get_buffer()?
+
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from SEND in memory.
+
+Each time we get to "Step 5" above for get_buffer(),
+the bytes from SEND are copied into a local holding buffer.
+
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the buffer until get_buffer()
+comes around for another pass.
+
+If the buffer is empty, then we follow the same steps
+listed above for qemu_rdma_get_buffer() and block waiting
+for another SEND message to re-fill the buffer.
+
+Migration of pc.ram:
+===============================
+
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+
+Then, using a single SEND message, they exchange this
+structure with each other, to be used later during the
+iteration of main memory. This structure includes a list
+of all the RAMBlocks, their offsets and lengths.
+
+Main memory is not migrated with SEND infiniband 
+messages, but is instead migrated with RDMA infiniband
+messages.
+
+Messages are migrated in "chunks" (about 64 pages right now).
+Chunk size is not dynamic, but it could be in a future
+implementation.
+
+When a total of 64 pages (or a flush()) are aggregated,
+the memory backed by the chunk on the sender side is
+registered with librdmacm and pinned in memory.
+
+After pinning, an RDMA send is generated and tramsmitted
+for the entire chunk.
+
+Error-handling:
+===============================
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
+
+USAGE
+===============================
+
+Compiling:
+
+$ ./configure --enable-rdma --target-list=x86_64-softmmu
+
+$ make
+
+Command-line on the Source machine AND Destination:
+
+$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+PERFORMANCE
+===================
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput 
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+
+Average downtime (stop time) ranges between 28 and 33 milliseconds.
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki:
+
+http://wiki.qemu.org/Features/RDMALiveMigration
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
                   ` (2 preceding siblings ...)
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport mrhines
@ 2013-03-18  3:18 ` mrhines
  2013-03-18  8:48   ` Paolo Bonzini
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 05/10] reuse function for parsing the QMP 'migrate' string mrhines
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 73+ messages in thread
From: mrhines @ 2013-03-18  3:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This introduces:
1. qemu_ram_foreach_block
2. qemu_ram_count_blocks

Both used in communicating the RAMBlocks
to each side for later memory registration.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 exec.c                    |   21 +++++++++++++++++++++
 include/exec/cpu-common.h |    6 ++++++
 2 files changed, 27 insertions(+)

diff --git a/exec.c b/exec.c
index 8a6aac3..a985da8 100644
--- a/exec.c
+++ b/exec.c
@@ -2629,3 +2629,24 @@ bool cpu_physical_memory_is_io(hwaddr phys_addr)
              memory_region_is_romd(section->mr));
 }
 #endif
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
+{
+    RAMBlock *block;
+
+    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+        func(block->host, block->offset, block->length, opaque);
+    }
+}
+
+int qemu_ram_count_blocks(void)
+{
+    RAMBlock *block;
+    int total = 0;
+
+    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+        total++;
+    }
+
+    return total;
+}
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 2e5f11f..aea3fe0 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -119,6 +119,12 @@ extern struct MemoryRegion io_mem_rom;
 extern struct MemoryRegion io_mem_unassigned;
 extern struct MemoryRegion io_mem_notdirty;
 
+typedef void  (RAMBlockIterFunc)(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque); 
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
+int qemu_ram_count_blocks(void);
+
 #endif
 
 #endif /* !CPU_COMMON_H */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 05/10] reuse function for parsing the QMP 'migrate' string
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
                   ` (3 preceding siblings ...)
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks mrhines
@ 2013-03-18  3:18 ` mrhines
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 06/10] core RDMA migration code (rdma.c) mrhines
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 73+ messages in thread
From: mrhines @ 2013-03-18  3:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/qemu/sockets.h |    1 +
 util/qemu-sockets.c    |    2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/qemu/sockets.h b/include/qemu/sockets.h
index ae5c21c..5066fca 100644
--- a/include/qemu/sockets.h
+++ b/include/qemu/sockets.h
@@ -48,6 +48,7 @@ typedef void NonBlockingConnectHandler(int fd, void *opaque);
 int inet_listen_opts(QemuOpts *opts, int port_offset, Error **errp);
 int inet_listen(const char *str, char *ostr, int olen,
                 int socktype, int port_offset, Error **errp);
+InetSocketAddress *inet_parse(const char *str, Error **errp);
 int inet_connect_opts(QemuOpts *opts, Error **errp,
                       NonBlockingConnectHandler *callback, void *opaque);
 int inet_connect(const char *str, Error **errp);
diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
index 83e4e08..6b60b63 100644
--- a/util/qemu-sockets.c
+++ b/util/qemu-sockets.c
@@ -485,7 +485,7 @@ err:
 }
 
 /* compatibility wrapper */
-static InetSocketAddress *inet_parse(const char *str, Error **errp)
+InetSocketAddress *inet_parse(const char *str, Error **errp)
 {
     InetSocketAddress *addr;
     const char *optstr, *h;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 06/10] core RDMA migration code (rdma.c)
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
                   ` (4 preceding siblings ...)
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 05/10] reuse function for parsing the QMP 'migrate' string mrhines
@ 2013-03-18  3:18 ` mrhines
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA mrhines
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 73+ messages in thread
From: mrhines @ 2013-03-18  3:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/rdma.h |  244 ++++++++
 rdma.c                   | 1532 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1776 insertions(+)
 create mode 100644 include/migration/rdma.h
 create mode 100644 rdma.c

diff --git a/include/migration/rdma.h b/include/migration/rdma.h
new file mode 100644
index 0000000..a6c521a
--- /dev/null
+++ b/include/migration/rdma.h
@@ -0,0 +1,244 @@
+/*
+ *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
+ *  Copyright (C) 2013 Jiuxing Liu <jl@us.ibm.com>
+ *
+ *  RDMA data structures and helper functions (for migration)
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _RDMA_H
+#define _RDMA_H
+
+#include "config-host.h"
+#ifdef CONFIG_RDMA 
+#include <rdma/rdma_cma.h>
+#endif
+#include "monitor/monitor.h"
+#include "exec/cpu-common.h"
+#include "migration/migration.h"
+
+#define Gbps(bytes, ms) ((double) bytes * 8.0 / ((double) ms / 1000.0)) \
+                                / 1000.0 / 1000.0
+#define qemu_rdma_print(msg) fprintf(stderr, msg "\n")
+//#define qemu_rdma_print(msg) error_setg(errp, msg)
+
+#define RDMA_CHUNK_REGISTRATION
+
+#define RDMA_LAZY_REGISTRATION
+
+#define RDMA_REG_CHUNK_SHIFT 20
+#define RDMA_REG_CHUNK_SIZE (1UL << (RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_CHUNK_INDEX(start_addr, host_addr) \
+            (((unsigned long)(host_addr) >> RDMA_REG_CHUNK_SHIFT) - \
+            ((unsigned long)(start_addr) >> RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_NUM_CHUNKS(rdma_ram_block) \
+            (RDMA_REG_CHUNK_INDEX((rdma_ram_block)->local_host_addr,\
+                (rdma_ram_block)->local_host_addr +\
+                (rdma_ram_block)->length) + 1)
+#define RDMA_REG_CHUNK_START(rdma_ram_block, i) ((uint8_t *)\
+            ((((unsigned long)((rdma_ram_block)->local_host_addr) >> \
+                RDMA_REG_CHUNK_SHIFT) + (i)) << \
+                RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_CHUNK_END(rdma_ram_block, i) \
+            (RDMA_REG_CHUNK_START(rdma_ram_block, i) + \
+             RDMA_REG_CHUNK_SIZE)
+
+/*
+ * This is only for non-live state being migrated.
+ * Instead of RDMA_WRITE messages, we use RDMA_SEND
+ * messages for that state, which requires a different
+ * delivery design than main memory.
+ */
+#define RDMA_SEND_INCREMENT 32768
+#define QEMU_FILE_RDMA_MAX (512 * 1024)
+
+#define RDMA_BLOCKING
+
+#ifdef CONFIG_RDMA
+enum {
+    RDMA_WRID_NONE = 0,
+    RDMA_WRID_RDMA,
+    RDMA_WRID_SEND_REMOTE_INFO,
+    RDMA_WRID_RECV_REMOTE_INFO,
+    RDMA_WRID_SEND_QEMU_FILE = 1000,
+    RDMA_WRID_RECV_QEMU_FILE = 2000,
+};
+
+typedef struct RDMAContext {
+    /* cm_id also has ibv_conext, rdma_event_channel, and ibv_qp in
+       cm_id->verbs, cm_id->channel, and cm_id->qp. */
+    struct rdma_cm_id *cm_id;
+    struct rdma_cm_id *listen_id;
+
+    struct ibv_context *verbs;
+    struct rdma_event_channel *channel;
+    struct ibv_qp *qp;
+
+    struct ibv_comp_channel *comp_channel;
+    struct ibv_pd *pd;
+    struct ibv_cq *cq;
+} RDMAContext;
+
+typedef struct RDMALocalBlock {
+    uint8_t *local_host_addr;
+    uint64_t remote_host_addr;
+    uint64_t offset;
+    uint64_t length;
+    struct ibv_mr **pmr;
+    struct ibv_mr *mr;
+    uint32_t remote_rkey;
+} RDMALocalBlock;
+
+typedef struct RDMARemoteBlock {
+    uint64_t remote_host_addr;
+    uint64_t offset;
+    uint64_t length;
+    uint32_t remote_rkey;
+} RDMARemoteBlock;
+
+typedef struct RDMALocalBlocks {
+    int num_blocks;
+    RDMALocalBlock *block;
+} RDMALocalBlocks;
+
+typedef struct RDMARemoteBlocks {
+    int * num_blocks;
+    RDMARemoteBlock *block;
+    void * remote_info_area;
+    int info_size;
+} RDMARemoteBlocks;
+
+typedef struct RDMAData {
+    char *host;
+    int port;
+    int enabled;
+    int gidx;
+    union ibv_gid gid;
+    uint8_t b;
+
+    RDMAContext rdma_ctx;
+    RDMALocalBlocks rdma_local_ram_blocks;
+
+    /* This is used for synchronization: We use
+       IBV_WR_SEND to send it after all IBV_WR_RDMA_WRITEs
+       are done. When the receiver gets it, it can be certain
+       that all the RDMAs are completed. */
+    int sync;
+    struct ibv_mr *sync_mr;
+
+    /* This is used for the server to write the remote
+       ram blocks info. */
+    RDMARemoteBlocks remote_info;
+    struct ibv_mr *remote_info_mr;
+
+    /* This is used by the migration protocol to transmit
+     * device and CPU state that's not part of the VM's
+     * main memory.
+     */
+    uint8_t qemu_file[QEMU_FILE_RDMA_MAX];
+    struct ibv_mr *qemu_file_mr;
+    size_t qemu_file_len;
+    uint8_t * qemu_file_curr;
+    int qemu_file_send_waiting;
+
+    /* The rest is only for the initiator of the migration. */
+    int client_init_done;
+
+    /* number of outstanding unsignaled send */
+    int num_unsignaled_send;
+
+    /* number of outstanding signaled send */
+    int num_signaled_send;
+
+    /* store info about current buffer so that we can
+       merge it with future sends */
+    uint64_t current_offset;
+    uint64_t current_length;
+    /* index of ram block the current buffer belongs to */
+    int current_index;
+    /* index of the chunk in the current ram block */
+    int current_chunk;
+
+    uint64_t total_bytes;
+  
+    // TODO the initial post_send is happening too quickly
+    // try to delay it or record it and then theck
+    // for its receipt later....
+    int initial_kick_not_received;
+} RDMAData;
+
+void qemu_rdma_disable(RDMAData * rdma);
+
+int qemu_rdma_resolve_host(RDMAContext *rdma_ctx,
+        const char *host, int port);
+int qemu_rdma_alloc_pd_cq(RDMAContext *rdma_ctx);
+int qemu_rdma_alloc_qp(RDMAContext *rdma_ctx);
+int qemu_rdma_migrate_connect(RDMAContext *rdma_ctx,
+        void *in_data, int *in_len, void *out_data, int out_len);
+int qemu_rdma_migrate_accept(RDMAContext *rdma_ctx,
+        void *in_data, int *in_len, void *out_data, int out_len);
+void qemu_rdma_migrate_disconnect(RDMAContext *rdma_ctx);
+int qemu_rdma_exchange_send(RDMAData * rdma, uint8_t * data, size_t len);
+int qemu_rdma_exchange_recv(void *rdma);
+
+
+int qemu_rdma_migrate_listen(RDMAData *mdata, char *host, int port);
+int qemu_rdma_poll_for_wrid(RDMAData *mdata, int wrid);
+int qemu_rdma_block_for_wrid(RDMAData *mdata, int wrid);
+
+int qemu_rdma_post_send_remote_info(RDMAData *mdata);
+int qemu_rdma_post_recv_qemu_file(RDMAData *mdata);
+void qemu_rdma_dump_gid(const char * who, struct rdma_cm_id * id);
+
+void qemu_rdma_cleanup(RDMAData * mdata);
+int qemu_rdma_client_init(RDMAData *mdata, Error **errp);
+int qemu_rdma_client_connect(RDMAData *mdata, Error **errp);
+int qemu_rdma_data_init(RDMAData *mdata, const char *host_port, Error **errp);
+int qemu_rdma_server_init(RDMAData *mdata, Error **errp);
+int qemu_rdma_server_prepare(RDMAData *mdata, Error **errp);
+int qemu_rdma_write(RDMAData *mdata, uint64_t addr, uint64_t len);
+int qemu_rdma_write_flush(RDMAData *mdata);
+int qemu_rdma_poll(RDMAData *mdata);
+int qemu_rdma_wait_for_wrid(RDMAData *mdata, int wrid);
+int qemu_rdma_enabled(void *rdma);
+int qemu_rdma_drain_cq(void *opaque);
+size_t qemu_rdma_fill(void *opaque, uint8_t *buf, int size);
+size_t save_rdma_page(QEMUFile *f, ram_addr_t block_offset, ram_addr_t offset, int cont, size_t size);
+void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp);
+int rdma_start_incoming_migration(const char * host_port, Error **errp);
+
+#else /* !defined(CONFIG_RDMA) */
+#define NOT_CONFIGURED() do { printf("WARN: RDMA is not configured\n"); } while(0)
+#define qemu_rdma_cleanup(...) NOT_CONFIGURED()
+#define qemu_rdma_data_init(...) NOT_CONFIGURED() 
+#define rdma_start_outgoing_migration(...) NOT_CONFIGURED()
+#define rdma_start_incoming_migration(...) NOT_CONFIGURED()
+#define qemu_rdma_client_init(...) -1 
+#define qemu_rdma_client_connect(...) -1 
+#define qemu_rdma_server_init(...) -1 
+#define qemu_rdma_server_prepare(...) -1 
+#define qemu_rdma_write(...) -1 
+#define qemu_rdma_write_flush(...) -1 
+#define qemu_rdma_poll(...) -1 
+#define qemu_rdma_wait_for_wrid(...) -1 
+#define qemu_rdma_enabled(...) 0
+#define qemu_rdma_exchange_send(...) 0
+#define qemu_rdma_exchange_recv(...) 0
+#define qemu_rdma_drain_cq(...) 0
+#define qemu_rdma_fill(...) 0
+#define save_rdma_page(...) 0
+
+#endif /* CONFIG_RDMA */
+
+#endif
diff --git a/rdma.c b/rdma.c
new file mode 100644
index 0000000..c56bd20
--- /dev/null
+++ b/rdma.c
@@ -0,0 +1,1532 @@
+/*
+ *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
+ *  Copyright (C) 2013 Jiuxing Liu <jl@us.ibm.com>
+ *
+ *  RDMA data structures and helper functions
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "migration/rdma.h"
+#include "qemu-common.h"
+#include "migration/migration.h"
+#include "exec/cpu-common.h"
+#include "qemu/sockets.h"
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <arpa/inet.h>
+#include <string.h>
+
+//#define DEBUG_RDMA
+
+#ifdef DEBUG_RDMA
+#define DPRINTF(fmt, ...) \
+    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#define RDMA_RESOLVE_TIMEOUT_MS 10000
+/*
+ * Completion queue can be filled by both read and write work requests, 
+ * so must reflect the sum of both possible queue sizes.
+ */
+#define RDMA_QP_SIZE 1000
+#define RDMA_CQ_SIZE (RDMA_QP_SIZE * 3)
+
+const char * wrid_desc[] = {
+        [RDMA_WRID_NONE] = "NONE",
+        [RDMA_WRID_RDMA] = "WRITE RDMA",
+        [RDMA_WRID_SEND_REMOTE_INFO] = "INFO SEND",
+        [RDMA_WRID_RECV_REMOTE_INFO] = "INFO RECV",
+        [RDMA_WRID_SEND_QEMU_FILE] = "QEMU SEND",
+        [RDMA_WRID_RECV_QEMU_FILE] = "QEMU RECV",
+};
+
+/*
+ * Memory regions need to be registered with the device and queue pairs setup
+ * in advanced before the migration starts. This tells us where the RAM blocks
+ * are so that we can register them individually.
+ */
+
+static void qemu_rdma_init_one_block(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque)
+{
+    RDMALocalBlocks *rdma_local_ram_blocks = opaque;
+    int num_blocks = rdma_local_ram_blocks->num_blocks;
+
+    rdma_local_ram_blocks->block[num_blocks].local_host_addr = host_addr;
+    rdma_local_ram_blocks->block[num_blocks].offset = (uint64_t)offset;
+    rdma_local_ram_blocks->block[num_blocks].length = (uint64_t)length;
+    rdma_local_ram_blocks->num_blocks++;
+}
+
+static int qemu_rdma_init_ram_blocks(RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int num_blocks = qemu_ram_count_blocks();
+
+    memset(rdma_local_ram_blocks, 0, sizeof *rdma_local_ram_blocks);
+
+    rdma_local_ram_blocks->block = g_malloc0(sizeof(RDMALocalBlock) *
+                                    num_blocks);
+    rdma_local_ram_blocks->num_blocks = 0;
+
+    qemu_ram_foreach_block(qemu_rdma_init_one_block, rdma_local_ram_blocks);
+
+    DPRINTF("Allocated %d local ram block structures\n", 
+                    rdma_local_ram_blocks->num_blocks);
+    return 0;
+}
+
+/*
+ * Put in the log file which RDMA device was opened and the details
+ * associated with that device.
+ */
+static void qemu_rdma_dump_id(const char * who, struct ibv_context * verbs)
+{
+    printf("%s RDMA verbs Device opened: kernel name %s "
+           "uverbs device name %s, "
+           "infiniband_verbs class device path %s,"
+           " infiniband class device path %s\n", 
+                who, 
+                verbs->device->name, 
+                verbs->device->dev_name, 
+                verbs->device->dev_path, 
+                verbs->device->ibdev_path);
+}
+
+/*
+ * Put in the log file the RDMA gid addressing information,
+ * useful for folks who have trouble understanding the
+ * RDMA device hierarchy in the kernel. 
+ */
+void qemu_rdma_dump_gid(const char * who, struct rdma_cm_id * id)
+{
+    char sgid[33];
+    char dgid[33];
+    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.sgid, sgid, sizeof sgid);
+    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.dgid, dgid, sizeof dgid);
+    DPRINTF("%s Source GID: %s, Dest GID: %s\n", who, sgid, dgid);
+}
+
+int qemu_rdma_resolve_host(RDMAContext *rdma_ctx, const char *host, int port)
+{
+    int ret;
+    struct addrinfo *res;
+    char port_str[16];
+    struct rdma_cm_event *cm_event;
+    char ip[40] = "unknown";
+
+    if (host == NULL || !strcmp(host, "")) {
+        fprintf(stderr, "RDMA hostname has not been set\n");
+        return -1;
+    }
+
+    /* create CM channel */
+    rdma_ctx->channel = rdma_create_event_channel();
+    if (!rdma_ctx->channel) {
+        fprintf(stderr, "could not create CM channel\n");
+        return -1;
+    }
+
+    /* create CM id */
+    ret = rdma_create_id(rdma_ctx->channel, &rdma_ctx->cm_id, NULL,
+            RDMA_PS_TCP);
+    if (ret) {
+        fprintf(stderr, "could not create channel id\n");
+        goto err_resolve_create_id;
+    }
+
+    snprintf(port_str, 16, "%d", port);
+    port_str[15] = '\0';
+
+    ret = getaddrinfo(host, port_str, NULL, &res);
+    if (ret < 0) {
+        fprintf(stderr, "could not getaddrinfo destination address %s\n", host);
+        goto err_resolve_get_addr;
+    }
+
+    inet_ntop(AF_INET, &((struct sockaddr_in *) res->ai_addr)->sin_addr, 
+                                ip, sizeof ip);
+    printf("%s => %s\n", host, ip);
+
+    /* resolve the first address */
+    ret = rdma_resolve_addr(rdma_ctx->cm_id, NULL, res->ai_addr,
+            RDMA_RESOLVE_TIMEOUT_MS);
+    if (ret) {
+        fprintf(stderr, "could not resolve address %s\n", host);
+        goto err_resolve_get_addr;
+    }
+
+    qemu_rdma_dump_gid("client_resolve_addr", rdma_ctx->cm_id);
+
+    ret = rdma_get_cm_event(rdma_ctx->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "could not perform event_addr_resolved\n");
+        goto err_resolve_get_addr;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
+        fprintf(stderr, "result not equal to event_addr_resolved %s\n", 
+                rdma_event_str(cm_event->event));
+        perror("rdma_resolve_addr");
+        rdma_ack_cm_event(cm_event);
+        goto err_resolve_get_addr;
+    }
+    rdma_ack_cm_event(cm_event);
+
+    /* resolve route */
+    ret = rdma_resolve_route(rdma_ctx->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
+    if (ret) {
+        fprintf(stderr, "could not resolve rdma route\n");
+        goto err_resolve_get_addr;
+    }
+
+    ret = rdma_get_cm_event(rdma_ctx->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "could not perform event_route_resolved\n");
+        goto err_resolve_get_addr;
+    }
+    if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
+        fprintf(stderr, "result not equal to event_route_resolved: %s\n", rdma_event_str(cm_event->event));
+        rdma_ack_cm_event(cm_event);
+        goto err_resolve_get_addr;
+    }
+    rdma_ack_cm_event(cm_event);
+    rdma_ctx->verbs = rdma_ctx->cm_id->verbs;
+    qemu_rdma_dump_id("client_resolve_host", rdma_ctx->cm_id->verbs);
+    qemu_rdma_dump_gid("client_resolve_host", rdma_ctx->cm_id);
+    return 0;
+
+err_resolve_get_addr:
+    rdma_destroy_id(rdma_ctx->cm_id);
+err_resolve_create_id:
+    rdma_destroy_event_channel(rdma_ctx->channel);
+    rdma_ctx->channel = NULL;
+
+    return -1;
+}
+
+int qemu_rdma_alloc_pd_cq(RDMAContext *rdma_ctx)
+{
+
+    /* allocate pd */
+    rdma_ctx->pd = ibv_alloc_pd(rdma_ctx->verbs);
+    if (!rdma_ctx->pd) {
+        return -1;
+    }
+
+#ifdef RDMA_BLOCKING
+    /* create completion channel */
+    rdma_ctx->comp_channel = ibv_create_comp_channel(rdma_ctx->verbs);
+    if (!rdma_ctx->comp_channel) {
+        goto err_alloc_pd_cq;
+    }
+#endif
+
+    /* create cq */
+    rdma_ctx->cq = ibv_create_cq(rdma_ctx->verbs, RDMA_CQ_SIZE,
+            NULL, rdma_ctx->comp_channel, 0);
+    if (!rdma_ctx->cq) {
+        goto err_alloc_pd_cq;
+    }
+
+    return 0;
+
+err_alloc_pd_cq:
+    if (rdma_ctx->pd) {
+        ibv_dealloc_pd(rdma_ctx->pd);
+    }
+    if (rdma_ctx->comp_channel) {
+        ibv_destroy_comp_channel(rdma_ctx->comp_channel);
+    }
+    rdma_ctx->pd = NULL;
+    rdma_ctx->comp_channel = NULL;
+    return -1;
+
+}
+
+int qemu_rdma_alloc_qp(RDMAContext *rdma_ctx)
+{
+    struct ibv_qp_init_attr attr = { 0 };
+    int ret;
+
+    attr.cap.max_send_wr = RDMA_QP_SIZE;
+    attr.cap.max_recv_wr = 3;
+    attr.cap.max_send_sge = 1;
+    attr.cap.max_recv_sge = 1;
+    attr.send_cq = rdma_ctx->cq;
+    attr.recv_cq = rdma_ctx->cq;
+    attr.qp_type = IBV_QPT_RC;
+
+    ret = rdma_create_qp(rdma_ctx->cm_id, rdma_ctx->pd, &attr);
+    if (ret) {
+        return -1;
+    }
+
+    rdma_ctx->qp = rdma_ctx->cm_id->qp;
+    return 0;
+}
+
+int qemu_rdma_migrate_connect(RDMAContext *rdma_ctx,
+        void *in_data, int *in_len, void *out_data, int out_len)
+{
+    int ret;
+    struct rdma_conn_param conn_param = { 0 };
+    struct rdma_cm_event *cm_event;
+
+    conn_param.initiator_depth = 2;
+    conn_param.retry_count = 5;
+    conn_param.private_data = out_data;
+    conn_param.private_data_len = out_len;
+
+    ret = rdma_connect(rdma_ctx->cm_id, &conn_param);
+    if (ret) {
+        perror("rdma_connect");
+        return -1;
+    }
+
+    ret = rdma_get_cm_event(rdma_ctx->channel, &cm_event);
+    if (ret) {
+        perror("rdma_get_cm_event after rdma_connect");
+        return -1;
+    }
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        perror("rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect");
+        return -1;
+    }
+
+    if (in_len) {
+        if (*in_len > cm_event->param.conn.private_data_len) {
+            *in_len = cm_event->param.conn.private_data_len;
+        }
+        if (*in_len) {
+            memcpy(in_data, cm_event->param.conn.private_data, *in_len);
+        }
+    }
+
+    rdma_ack_cm_event(cm_event);
+
+    return 0;
+}
+
+int qemu_rdma_migrate_listen(RDMAData *rdma, char *host,
+        int port)
+{
+    int ret;
+    struct rdma_cm_event *cm_event;
+    RDMAContext *rdma_ctx = &rdma->rdma_ctx;
+    struct ibv_context *verbs;
+
+    ret = rdma_get_cm_event(rdma_ctx->channel, &cm_event);
+    if (ret) {
+        goto err_listen;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+        rdma_ack_cm_event(cm_event);
+        goto err_listen;
+    }
+
+    rdma_ctx->cm_id = cm_event->id;
+    verbs = cm_event->id->verbs;
+    DPRINTF("verbs context after listen: %p\n", verbs);
+    rdma_ack_cm_event(cm_event);
+
+    if (!rdma_ctx->verbs) {
+        rdma_ctx->verbs = verbs;
+        ret = qemu_rdma_server_prepare(rdma, NULL);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error preparing server!\n");
+            goto err_listen;
+        }
+    } else if (rdma_ctx->verbs != verbs) {
+            fprintf(stderr, "ibv context not matching %p, %p!\n",
+                    rdma_ctx->verbs, verbs);
+            goto err_listen;
+    }
+    /* xxx destroy listen_id ??? */
+
+    return 0;
+
+err_listen:
+
+    return -1;
+
+}
+
+int qemu_rdma_migrate_accept(RDMAContext *rdma_ctx,
+        void *in_data, int *in_len, void *out_data, int out_len)
+{
+    int ret;
+    struct rdma_conn_param conn_param = { 0 };
+    struct rdma_cm_event *cm_event;
+
+    conn_param.responder_resources = 2;
+    conn_param.private_data = out_data;
+    conn_param.private_data_len = out_len;
+
+    ret = rdma_accept(rdma_ctx->cm_id, &conn_param);
+    if (ret) {
+        fprintf(stderr, "rdma_accept returns %d!\n", ret);
+        return -1;
+    }
+
+    ret = rdma_get_cm_event(rdma_ctx->channel, &cm_event);
+    if (ret) {
+        return -1;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        rdma_ack_cm_event(cm_event);
+        return -1;
+    }
+
+    if (in_len) {
+        if (*in_len > cm_event->param.conn.private_data_len) {
+            *in_len = cm_event->param.conn.private_data_len;
+        }
+        if (*in_len) {
+            memcpy(in_data, cm_event->param.conn.private_data, *in_len);
+        }
+    }
+
+    rdma_ack_cm_event(cm_event);
+
+    return 0;
+}
+
+void qemu_rdma_migrate_disconnect(RDMAContext *rdma_ctx)
+{
+    int ret;
+    struct rdma_cm_event *cm_event;
+
+    ret = rdma_disconnect(rdma_ctx->cm_id);
+    if (ret) {
+        return;
+    }
+    ret = rdma_get_cm_event(rdma_ctx->channel, &cm_event);
+    if (ret) {
+        return;
+    }
+    rdma_ack_cm_event(cm_event);
+}
+
+int qemu_rdma_reg_chunk_ram_blocks(RDMAContext *rdma_ctx,
+        RDMALocalBlocks *rdma_local_ram_blocks);
+
+int qemu_rdma_reg_chunk_ram_blocks(RDMAContext *rdma_ctx,
+        RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i, j;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        RDMALocalBlock *block = &(rdma_local_ram_blocks->block[i]);
+        int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+        /* allocate memory to store chunk MRs */
+        rdma_local_ram_blocks->block[i].pmr = g_malloc0(
+                                num_chunks * sizeof(struct ibv_mr *));
+
+        if (!block->pmr) {
+            goto err_reg_chunk_ram_blocks;
+        }
+
+        for (j = 0; j < num_chunks; j++) {
+            uint8_t *start_addr = RDMA_REG_CHUNK_START(block, j);
+            uint8_t *end_addr = RDMA_REG_CHUNK_END(block, j);
+            if (start_addr < block->local_host_addr) {
+                start_addr = block->local_host_addr;
+            }
+            if (end_addr > block->local_host_addr + block->length) {
+                end_addr = block->local_host_addr + block->length;
+            }
+            block->pmr[j] = ibv_reg_mr(rdma_ctx->pd,
+                                start_addr,
+                                end_addr - start_addr,
+                                IBV_ACCESS_LOCAL_WRITE |
+                                IBV_ACCESS_REMOTE_WRITE |
+                                IBV_ACCESS_REMOTE_READ);
+            if (!block->pmr[j]) {
+                break;
+            }
+        }
+        if (j < num_chunks) {
+            for (j--; j >= 0; j--) {
+                ibv_dereg_mr(block->pmr[j]);
+            }
+            block->pmr[i] = NULL;
+            goto err_reg_chunk_ram_blocks;
+        }
+    }
+
+    return 0;
+
+err_reg_chunk_ram_blocks:
+    for (i--; i >= 0; i--) {
+        int num_chunks =
+            RDMA_REG_NUM_CHUNKS(&(rdma_local_ram_blocks->block[i]));
+        for (j = 0; j < num_chunks; j++) {
+            ibv_dereg_mr(rdma_local_ram_blocks->block[i].pmr[j]);
+        }
+        free(rdma_local_ram_blocks->block[i].pmr);
+        rdma_local_ram_blocks->block[i].pmr = NULL;
+    }
+
+    return -1;
+
+}
+
+static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma_ctx,
+        RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        rdma_local_ram_blocks->block[i].mr =
+            ibv_reg_mr(rdma_ctx->pd,
+                    rdma_local_ram_blocks->block[i].local_host_addr,
+                    rdma_local_ram_blocks->block[i].length,
+                    IBV_ACCESS_LOCAL_WRITE |
+                    IBV_ACCESS_REMOTE_WRITE |
+                    IBV_ACCESS_REMOTE_READ);
+        if (!rdma_local_ram_blocks->block[i].mr) {
+                break;
+        }
+    }
+
+    if (i >= rdma_local_ram_blocks->num_blocks) {
+        return 0;
+    }
+
+    for (i--; i >= 0; i--) {
+        ibv_dereg_mr(rdma_local_ram_blocks->block[i].mr);
+    }
+
+    return -1;
+
+}
+
+static int qemu_rdma_client_reg_ram_blocks(RDMAContext *rdma_ctx,
+                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+#ifdef RDMA_CHUNK_REGISTRATION
+#ifdef RDMA_LAZY_REGISTRATION
+    return 0;
+#else
+    return qemu_rdma_reg_chunk_ram_blocks(rdma_ctx, rdma_local_ram_blocks);
+#endif
+#else
+    return qemu_rdma_reg_whole_ram_blocks(rdma_ctx, rdma_local_ram_blocks);
+#endif
+}
+
+static int qemu_rdma_server_reg_ram_blocks(RDMAContext *rdma_ctx,
+                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    return qemu_rdma_reg_whole_ram_blocks(rdma_ctx, rdma_local_ram_blocks);
+}
+
+static void qemu_rdma_dereg_ram_blocks(RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i, j;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        int num_chunks;
+        if (!rdma_local_ram_blocks->block[i].pmr) {
+            continue;
+        }
+        num_chunks = RDMA_REG_NUM_CHUNKS(&(rdma_local_ram_blocks->block[i]));
+        for (j = 0; j < num_chunks; j++) {
+            if (!rdma_local_ram_blocks->block[i].pmr[j]) {
+                continue;
+            }
+            ibv_dereg_mr(rdma_local_ram_blocks->block[i].pmr[j]);
+        }
+        free(rdma_local_ram_blocks->block[i].pmr);
+        rdma_local_ram_blocks->block[i].pmr = NULL;
+    }
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        if (!rdma_local_ram_blocks->block[i].mr) {
+            continue;
+        }
+        ibv_dereg_mr(rdma_local_ram_blocks->block[i].mr);
+        rdma_local_ram_blocks->block[i].mr = NULL;
+    }
+}
+
+static void qemu_rdma_copy_to_remote_ram_blocks(RDMALocalBlocks *local,
+                RDMARemoteBlocks *remote)
+{
+    int i;
+    DPRINTF("Allocating %d remote ram block structures\n", local->num_blocks);
+    *remote->num_blocks = local->num_blocks;
+
+    for (i = 0; i < local->num_blocks; i++) {
+            remote->block[i].remote_host_addr =
+                (uint64_t)(local->block[i].local_host_addr);
+            remote->block[i].remote_rkey = local->block[i].mr->rkey;
+            remote->block[i].offset = local->block[i].offset;
+            remote->block[i].length = local->block[i].length;
+    }
+}
+
+static int qemu_rdma_process_remote_ram_blocks(RDMALocalBlocks *local, RDMARemoteBlocks *remote)
+{
+    int i, j;
+
+    if (local->num_blocks != *remote->num_blocks) {
+        fprintf(stderr, "local %d != remote %d\n", 
+            local->num_blocks, *remote->num_blocks);
+        return -1;
+    }
+
+    for (i = 0; i < *remote->num_blocks; i++) {
+        /* search local ram blocks */
+        for (j = 0; j < local->num_blocks; j++) {
+            if (remote->block[i].offset != local->block[j].offset) {
+                continue;
+            }
+            if (remote->block[i].length != local->block[j].length) {
+                return -1;
+            }
+            local->block[j].remote_host_addr =
+                remote->block[i].remote_host_addr;
+            local->block[j].remote_rkey = remote->block[i].remote_rkey;
+            break;
+        }
+        if (j >= local->num_blocks) {
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+static int qemu_rdma_search_ram_block(uint64_t offset, uint64_t length,
+        RDMALocalBlocks *blocks, int *block_index, int *chunk_index)
+{
+    int i;
+    for (i = 0; i < blocks->num_blocks; i++) {
+        if (offset < blocks->block[i].offset) {
+            continue;
+        }
+        if (offset + length >
+                blocks->block[i].offset + blocks->block[i].length) {
+            continue;
+        }
+        *block_index = i;
+        if (chunk_index) {
+            uint8_t *host_addr = blocks->block[i].local_host_addr +
+                (offset - blocks->block[i].offset);
+            *chunk_index = RDMA_REG_CHUNK_INDEX(
+                    blocks->block[i].local_host_addr, host_addr);
+        }
+        return 0;
+    }
+    return -1;
+}
+
+static int qemu_rdma_get_lkey(RDMAContext *rdma_ctx,
+        RDMALocalBlock *block, uint64_t host_addr,
+        uint32_t *lkey)
+{
+    int chunk;
+    if (block->mr) {
+        *lkey = block->mr->lkey;
+        return 0;
+    }
+    if (!block->pmr) {
+        int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+        /* allocate memory to store chunk MRs */
+        block->pmr = g_malloc0(num_chunks *
+                sizeof(struct ibv_mr *));
+        if (!block->pmr) {
+            return -1;
+        }
+    }
+    chunk = RDMA_REG_CHUNK_INDEX(block->local_host_addr, host_addr);
+    if (!block->pmr[chunk]) {
+        uint8_t *start_addr = RDMA_REG_CHUNK_START(block, chunk);
+        uint8_t *end_addr = RDMA_REG_CHUNK_END(block, chunk);
+        if (start_addr < block->local_host_addr) {
+            start_addr = block->local_host_addr;
+        }
+        if (end_addr > block->local_host_addr + block->length) {
+            end_addr = block->local_host_addr + block->length;
+        }
+        block->pmr[chunk] = ibv_reg_mr(rdma_ctx->pd,
+                start_addr,
+                end_addr - start_addr,
+                IBV_ACCESS_LOCAL_WRITE |
+                IBV_ACCESS_REMOTE_WRITE |
+                IBV_ACCESS_REMOTE_READ);
+        if (!block->pmr[chunk]) {
+            return -1;
+        }
+    }
+    *lkey = block->pmr[chunk]->lkey;
+    return 0;
+}
+
+/* Do not merge data if larger than this. */
+#define RDMA_MERGE_MAX (4 * 1024 * 1024)
+
+#define RDMA_UNSIGNALED_SEND_MAX 64
+
+static int qemu_rdma_reg_remote_info(RDMAData *rdma)
+{
+    int info_size = (sizeof(RDMARemoteBlock) * 
+                rdma->rdma_local_ram_blocks.num_blocks)
+                +   sizeof(*rdma->remote_info.num_blocks);
+
+    DPRINTF("Preparing %d bytes for remote info\n", info_size);
+
+    rdma->remote_info.remote_info_area = g_malloc0(info_size);
+    rdma->remote_info.info_size = info_size;
+    rdma->remote_info.num_blocks = rdma->remote_info.remote_info_area;
+    rdma->remote_info.block = (void *) (rdma->remote_info.num_blocks + 1);
+
+    rdma->remote_info_mr = ibv_reg_mr(rdma->rdma_ctx.pd,
+            rdma->remote_info.remote_info_area, info_size,
+            IBV_ACCESS_LOCAL_WRITE |
+            IBV_ACCESS_REMOTE_WRITE |
+            IBV_ACCESS_REMOTE_READ);
+    if (rdma->remote_info_mr) {
+        return 0;
+    }
+    return -1;
+}
+
+static int qemu_rdma_dereg_remote_info(RDMAData *rdma)
+{
+    int ret = ibv_dereg_mr(rdma->remote_info_mr);
+
+    g_free(rdma->remote_info.remote_info_area);
+
+    return ret;
+}
+
+static int qemu_rdma_reg_qemu_file(RDMAData *rdma)
+{
+    rdma->qemu_file_mr = ibv_reg_mr(rdma->rdma_ctx.pd,
+            rdma->qemu_file, QEMU_FILE_RDMA_MAX,
+            IBV_ACCESS_LOCAL_WRITE |
+            IBV_ACCESS_REMOTE_WRITE |
+            IBV_ACCESS_REMOTE_READ);
+    if (rdma->qemu_file_mr) {
+        return 0;
+    }
+    return -1;
+}
+
+static int qemu_rdma_dereg_qemu_file(RDMAData *rdma)
+{
+    return ibv_dereg_mr(rdma->qemu_file_mr);
+}
+
+static int qemu_rdma_post_send(RDMAData *rdma, struct ibv_sge * sge, uint64_t wr_id)
+{
+    struct ibv_send_wr send_wr = { 0 };
+    struct ibv_send_wr *bad_wr;
+
+    send_wr.wr_id = wr_id;
+    send_wr.opcode = IBV_WR_SEND;
+    send_wr.send_flags = IBV_SEND_SIGNALED;
+    send_wr.sg_list = sge;
+    send_wr.num_sge = 1;
+
+    if (ibv_post_send(rdma->rdma_ctx.qp, &send_wr, &bad_wr)) {
+        return -1;
+    }
+
+    return 0;
+}
+
+static int qemu_rdma_post_recv(RDMAData *rdma, struct ibv_sge * sge, uint64_t wr_id)
+{
+    struct ibv_recv_wr recv_wr = { 0 };
+    struct ibv_recv_wr *bad_wr;
+
+    recv_wr.wr_id = wr_id;
+    recv_wr.sg_list = sge;
+    recv_wr.num_sge = 1;
+
+    if (ibv_post_recv(rdma->rdma_ctx.qp, &recv_wr, &bad_wr)) {
+        return -1;
+    }
+
+    return 0;
+}
+
+int qemu_rdma_post_send_remote_info(RDMAData *rdma)
+{
+    int ret;
+    struct ibv_sge sge;
+
+    sge.addr = (uint64_t)(rdma->remote_info.remote_info_area);
+    sge.length = rdma->remote_info.info_size;
+    sge.lkey = rdma->remote_info_mr->lkey;
+
+    ret = qemu_rdma_post_send(rdma, &sge, RDMA_WRID_SEND_REMOTE_INFO);
+    return ret;
+}
+
+static int qemu_rdma_post_recv_remote_info(RDMAData *rdma)
+{
+    struct ibv_sge sge;
+
+    sge.addr = (uint64_t)(rdma->remote_info.remote_info_area);
+    sge.length = rdma->remote_info.info_size;
+    sge.lkey = rdma->remote_info_mr->lkey;
+
+    return qemu_rdma_post_recv(rdma, &sge, RDMA_WRID_RECV_REMOTE_INFO);
+}
+
+static int qemu_rdma_post_send_qemu_file(RDMAData *rdma, uint8_t * buf, size_t len)
+{
+    int ret;
+    struct ibv_sge sge;
+    int count_len = sizeof(size_t);
+
+    memcpy(rdma->qemu_file, &len, count_len);
+    memcpy(rdma->qemu_file + count_len, buf, len);
+
+    len += count_len;
+
+    sge.addr = (uint64_t)(rdma->qemu_file);
+    sge.length = len;
+    sge.lkey = rdma->qemu_file_mr->lkey;
+
+    ret = qemu_rdma_post_send(rdma, &sge, RDMA_WRID_SEND_QEMU_FILE);
+    
+    if (ret < 0) {
+        fprintf(stderr, "Failed to use post IB SEND for qemu file!\n");
+        return ret;
+    }
+
+    ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_SEND_QEMU_FILE);
+    if (ret < 0) {
+        qemu_rdma_print("rdma migration: polling qemu file error!");
+    }
+
+    return ret;
+}
+
+
+int qemu_rdma_post_recv_qemu_file(RDMAData *rdma)
+{
+    struct ibv_sge sge;
+
+    sge.addr = (uint64_t)(rdma->qemu_file);
+    sge.length = QEMU_FILE_RDMA_MAX;
+    sge.lkey = rdma->qemu_file_mr->lkey;
+
+    return qemu_rdma_post_recv(rdma, &sge, RDMA_WRID_RECV_QEMU_FILE);
+}
+static int __qemu_rdma_write(RDMAContext *rdma_ctx,
+        RDMALocalBlock *block,
+        uint64_t offset, uint64_t length,
+        uint64_t wr_id, enum ibv_send_flags flag)
+{
+    struct ibv_sge sge;
+    struct ibv_send_wr send_wr = { 0 };
+    struct ibv_send_wr *bad_wr;
+
+    sge.addr = (uint64_t)(block->local_host_addr + (offset - block->offset));
+    sge.length = length;
+    if (qemu_rdma_get_lkey(rdma_ctx, block, sge.addr, &sge.lkey)) {
+        fprintf(stderr, "cannot get lkey!\n");
+        return -EINVAL;
+    }
+    send_wr.wr_id = wr_id;
+    send_wr.opcode = IBV_WR_RDMA_WRITE;
+    send_wr.send_flags = flag;
+    send_wr.sg_list = &sge;
+    send_wr.num_sge = 1;
+    send_wr.wr.rdma.rkey = block->remote_rkey;
+    send_wr.wr.rdma.remote_addr = block->remote_host_addr +
+        (offset - block->offset);
+
+    return ibv_post_send(rdma_ctx->qp, &send_wr, &bad_wr);
+}
+
+int qemu_rdma_write_flush(RDMAData *rdma)
+{
+    int ret;
+    enum ibv_send_flags flags = 0;
+
+    if (!rdma->current_length) {
+        return 0;
+    }
+    if (rdma->num_unsignaled_send >=
+            RDMA_UNSIGNALED_SEND_MAX) {
+        flags = IBV_SEND_SIGNALED;
+    }
+
+    while(1) {
+        ret = __qemu_rdma_write(&rdma->rdma_ctx,
+                &(rdma->rdma_local_ram_blocks.block[rdma->current_index]),
+                rdma->current_offset,
+                rdma->current_length,
+                RDMA_WRID_RDMA, flags);
+        if(ret) {
+            if(ret == ENOMEM) {
+                DPRINTF("send queue is full. wait a little....\n");
+                ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_RDMA);
+                if(ret < 0) {
+                    fprintf(stderr, "rdma migration: failed to make room in full send queue! %d\n", ret);
+                    return -EIO;
+                }
+            } else {
+                 fprintf(stderr, "rdma migration: write flush error! %d\n", ret);
+                 perror("write flush error");
+                 return -EIO;
+            }
+        } else {
+                break;
+        }
+    }
+
+    if (rdma->num_unsignaled_send >=
+            RDMA_UNSIGNALED_SEND_MAX) {
+        rdma->num_unsignaled_send = 0;
+        rdma->num_signaled_send++;
+        DPRINTF("signaled total: %d\n", rdma->num_signaled_send);
+    } else {
+        rdma->num_unsignaled_send++;
+    }
+
+    rdma->total_bytes += rdma->current_length;
+    rdma->current_length = 0;
+    rdma->current_offset = 0;
+
+    return 0;
+}
+
+static inline int qemu_rdma_in_current_block( RDMAData *rdma,
+                uint64_t offset, uint64_t len)
+{
+    RDMALocalBlock *block =
+        &(rdma->rdma_local_ram_blocks.block[rdma->current_index]);
+    if (rdma->current_index < 0) {
+        return 0;
+    }
+    if (offset < block->offset) {
+        return 0;
+    }
+    if (offset + len > block->offset + block->length) {
+        return 0;
+    }
+    return 1;
+}
+
+static inline int qemu_rdma_in_current_chunk(RDMAData *rdma,
+                uint64_t offset, uint64_t len)
+{
+    RDMALocalBlock *block = &(rdma->rdma_local_ram_blocks.block[rdma->current_index]);
+    uint8_t *chunk_start, *chunk_end, *host_addr;
+    if (rdma->current_chunk < 0) {
+        return 0;
+    }
+    host_addr = block->local_host_addr + (offset - block->offset);
+    chunk_start = RDMA_REG_CHUNK_START(block, rdma->current_chunk);
+    if (chunk_start < block->local_host_addr) {
+        chunk_start = block->local_host_addr;
+    }
+    if (host_addr < chunk_start) {
+        return 0;
+    }
+    chunk_end = RDMA_REG_CHUNK_END(block, rdma->current_chunk);
+    if (chunk_end > chunk_start + block->length) {
+        chunk_end = chunk_start + block->length;
+    }
+    if (host_addr + len > chunk_end) {
+        return 0;
+    }
+    return 1;
+}
+
+static inline int qemu_rdma_buffer_mergable(RDMAData *rdma,
+                    uint64_t offset, uint64_t len)
+{
+    if (rdma->current_length == 0) {
+        return 0;
+    }
+    if (offset != rdma->current_offset + rdma->current_length) {
+        return 0;
+    }
+    if (!qemu_rdma_in_current_block(rdma, offset, len)) {
+        return 0;
+    }
+#ifdef RDMA_CHUNK_REGISTRATION
+    if (!qemu_rdma_in_current_chunk(rdma, offset, len)) {
+        return 0;
+    }
+#endif
+    return 1;
+}
+
+/* Note that buffer must be within a single block/chunk. */
+int qemu_rdma_write(RDMAData *rdma, uint64_t offset, uint64_t len)
+{
+    int index = rdma->current_index;
+    int chunk_index = rdma->current_chunk;
+    int ret;
+
+    /* If we cannot merge it, we flush the current buffer first. */
+    if (!qemu_rdma_buffer_mergable(rdma, offset, len)) {
+        ret = qemu_rdma_write_flush(rdma);
+        if (ret) {
+            return ret;
+        }
+        rdma->current_length = 0;
+        rdma->current_offset = offset;
+
+        if ((ret = qemu_rdma_search_ram_block(offset, len,
+                    &rdma->rdma_local_ram_blocks, &index, &chunk_index))) {
+            fprintf(stderr, "ram block search failed\n");
+            return ret;
+        }
+        rdma->current_index = index;
+        rdma->current_chunk = chunk_index;
+    }
+
+    /* merge it */
+    rdma->current_length += len;
+
+    /* flush it if buffer is too large */
+    if (rdma->current_length >= RDMA_MERGE_MAX) {
+        return qemu_rdma_write_flush(rdma);
+    }
+
+    return 0;
+}
+
+int qemu_rdma_poll(RDMAData *rdma)
+{
+    int ret;
+    struct ibv_wc wc;
+
+    ret = ibv_poll_cq(rdma->rdma_ctx.cq, 1, &wc);
+    if (!ret) {
+        return RDMA_WRID_NONE;
+    }
+    if (ret < 0) {
+        fprintf(stderr, "ibv_poll_cq return %d!\n", ret);
+        return ret;
+    }
+    if (wc.status != IBV_WC_SUCCESS) {
+        fprintf(stderr, "ibv_poll_cq wc.status=%d %s!\n",
+                        wc.status, ibv_wc_status_str(wc.status));
+        fprintf(stderr, "ibv_poll_cq wrid=%s!\n", wrid_desc[wc.wr_id]);
+
+        return -1;
+    }
+
+    if(rdma->qemu_file_send_waiting &&
+        (wc.wr_id == RDMA_WRID_RECV_QEMU_FILE)) {
+        DPRINTF("completion %s received\n", wrid_desc[wc.wr_id]);
+        rdma->qemu_file_send_waiting = 0;
+    }
+
+    if(wc.wr_id == RDMA_WRID_RDMA) {
+        rdma->num_signaled_send--;
+        DPRINTF("completions %d %s left %d\n", 
+            ret, wrid_desc[wc.wr_id], rdma->num_signaled_send);
+    } else {
+        DPRINTF("other completion %d %s received left %d\n", 
+            ret, wrid_desc[wc.wr_id], rdma->num_signaled_send);
+    }
+   
+    return  (int)wc.wr_id;
+}
+
+int qemu_rdma_wait_for_wrid(RDMAData *rdma, int wrid)
+{
+#ifdef RDMA_BLOCKING
+    return qemu_rdma_block_for_wrid(rdma, wrid);
+#else
+    return qemu_rdma_poll_for_wrid(rdma, wrid);
+#endif
+}
+
+int qemu_rdma_poll_for_wrid(RDMAData *rdma, int wrid)
+{
+    int r = RDMA_WRID_NONE;
+    while (r != wrid) {
+        r = qemu_rdma_poll(rdma);
+        if (r < 0) {
+            return r;
+        }
+    }
+    return 0;
+}
+
+int qemu_rdma_block_for_wrid(RDMAData *rdma, int wrid)
+{
+    int num_cq_events = 0;
+    int r = RDMA_WRID_NONE;
+    struct ibv_cq *cq;
+    void *cq_ctx;
+
+    if (ibv_req_notify_cq(rdma->rdma_ctx.cq, 0)) {
+        return -1;
+    }
+    /* poll cq first */
+    while (r != wrid) {
+        r = qemu_rdma_poll(rdma);
+        if (r < 0) {
+            return r;
+        }
+        if (r == RDMA_WRID_NONE) {
+            break;
+        }
+        if(r != wrid) {
+                DPRINTF("A Wanted wrid %d but got %d\n", wrid, r);
+        }
+    }
+    if (r == wrid) {
+        return 0;
+    }
+
+    while (1) {
+        if (ibv_get_cq_event(rdma->rdma_ctx.comp_channel,
+                    &cq, &cq_ctx)) {
+            goto err_block_for_wrid;
+        }
+        num_cq_events++;
+        if (ibv_req_notify_cq(cq, 0)) {
+            goto err_block_for_wrid;
+        }
+        /* poll cq */
+        while (r != wrid) {
+            r = qemu_rdma_poll(rdma);
+            if (r < 0) {
+                goto err_block_for_wrid;
+            }
+            if (r == RDMA_WRID_NONE) {
+                break;
+            }
+            if(r != wrid) {
+                    DPRINTF("B Wanted wrid %d but got %d\n", wrid, r);
+            }
+        }
+        if (r == wrid) {
+            goto success_block_for_wrid;
+        }
+    }
+
+success_block_for_wrid:
+    if (num_cq_events) {
+        ibv_ack_cq_events(cq, num_cq_events);
+    }
+    return 0;
+
+err_block_for_wrid:
+    if (num_cq_events) {
+        ibv_ack_cq_events(cq, num_cq_events);
+    }
+    return -1;
+}
+
+void qemu_rdma_cleanup(RDMAData *rdma)
+{
+    RDMAContext *rdma_ctx = &rdma->rdma_ctx;
+    
+    rdma->enabled = 0;
+    if (rdma->remote_info_mr) {
+        qemu_rdma_dereg_remote_info(rdma);
+    }
+    if (rdma->qemu_file_mr) {
+        qemu_rdma_dereg_qemu_file(rdma);
+    }
+    rdma->sync_mr = NULL;
+    rdma->remote_info_mr = NULL;
+    rdma->qemu_file_mr = NULL;
+    rdma->qemu_file_mr = NULL;
+    qemu_rdma_dereg_ram_blocks(&rdma->rdma_local_ram_blocks);
+
+    if(rdma->rdma_local_ram_blocks.block)
+        g_free(rdma->rdma_local_ram_blocks.block);
+
+    if (rdma_ctx->qp) {
+        ibv_destroy_qp(rdma_ctx->qp);
+    }
+    if (rdma_ctx->cq) {
+        ibv_destroy_cq(rdma_ctx->cq);
+    }
+    if (rdma_ctx->comp_channel) {
+        ibv_destroy_comp_channel(rdma_ctx->comp_channel);
+    }
+    if (rdma_ctx->pd) {
+        ibv_dealloc_pd(rdma_ctx->pd);
+    }
+    if (rdma_ctx->listen_id) {
+        rdma_destroy_id(rdma_ctx->listen_id);
+    }
+    if (rdma_ctx->cm_id) {
+        rdma_destroy_id(rdma_ctx->cm_id);
+    }
+    if (rdma_ctx->channel) {
+        rdma_destroy_event_channel(rdma_ctx->channel);
+    }
+
+    qemu_rdma_data_init(rdma, NULL, NULL);
+}
+
+int qemu_rdma_client_init(RDMAData *rdma, Error **errp)
+{
+    int ret;
+
+    if (rdma->client_init_done) {
+        return 0;
+    }
+
+    ret = qemu_rdma_resolve_host(&rdma->rdma_ctx, rdma->host, rdma->port);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error resolving host!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_alloc_pd_cq(&rdma->rdma_ctx);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error allocating pd and cq!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_alloc_qp(&rdma->rdma_ctx);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error allocating qp!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_init_ram_blocks(&rdma->rdma_local_ram_blocks);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error initializing ram blocks!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_client_reg_ram_blocks(&rdma->rdma_ctx, &rdma->rdma_local_ram_blocks);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error registering ram blocks!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_reg_remote_info(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error registering remote info!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_reg_qemu_file(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error registering 1st qemu file!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_post_recv_remote_info(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error posting remote info recv!");
+        goto err_rdma_client_init;
+    }
+
+    rdma->client_init_done = 1;
+    return 0;
+
+err_rdma_client_init:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+int qemu_rdma_client_connect(RDMAData *rdma, Error **errp)
+{
+    int ret;
+    ret = qemu_rdma_migrate_connect(&rdma->rdma_ctx, NULL, NULL, NULL, 0);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error connecting!");
+        goto err_rdma_client_connect;
+    }
+
+    ret = qemu_rdma_post_recv_qemu_file(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error posting first qemu file recv!");
+        goto err_rdma_client_connect;
+    }
+
+    ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_RECV_REMOTE_INFO);
+    if (ret < 0) {
+        qemu_rdma_print("rdma migration: polling remote info error!\n");
+        goto err_rdma_client_connect;
+    }
+
+    ret = qemu_rdma_process_remote_ram_blocks(
+            &rdma->rdma_local_ram_blocks, &rdma->remote_info);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error processing remote ram blocks!\n");
+        goto err_rdma_client_connect;
+    }
+
+    rdma->qemu_file_send_waiting = 1;
+    rdma->num_signaled_send = 0;
+    rdma->total_bytes = 0;
+    rdma->enabled = 1;
+    return 0;
+
+err_rdma_client_connect:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+int qemu_rdma_server_init(RDMAData *rdma, Error **errp)
+{
+    int ret;
+    struct sockaddr_in sin;
+    struct rdma_cm_id *listen_id;
+    RDMAContext *rdma_ctx = &rdma->rdma_ctx;
+    char ip[40] = "unknown";
+    rdma->qemu_file_len = 0;
+    rdma->qemu_file_curr = NULL;  
+
+    if(rdma->host == NULL) {
+        qemu_rdma_print("Error: RDMA host is not set!");
+        return -1;
+    }
+    /* create CM channel */
+    rdma_ctx->channel = rdma_create_event_channel();
+    if (!rdma_ctx->channel) {
+        qemu_rdma_print("Error: could not create rdma event channel");
+        return -1;
+    }
+
+    /* create CM id */
+    ret = rdma_create_id(rdma_ctx->channel, &listen_id, NULL, RDMA_PS_TCP);
+    if (ret) {
+        qemu_rdma_print("Error: could not create cm_id!");
+        goto err_server_init_create_listen_id;
+    }
+
+    memset(&sin, 0, sizeof(sin));
+    sin.sin_family = AF_INET;
+    sin.sin_port = htons(rdma->port);
+
+    if (rdma->host && strcmp("", rdma->host)) {
+        struct hostent *server_addr;
+        server_addr = gethostbyname(rdma->host);
+        if (!server_addr) {
+            qemu_rdma_print("Error: migration could not gethostbyname!");
+            goto err_server_init_bind_addr;
+        }
+        memcpy(&sin.sin_addr.s_addr, server_addr->h_addr,
+                server_addr->h_length);
+        inet_ntop(AF_INET, server_addr->h_addr, ip, sizeof ip);
+    } else {
+        sin.sin_addr.s_addr = INADDR_ANY;
+    }
+
+    DPRINTF("%s => %s\n", rdma->host, ip);
+
+    ret = rdma_bind_addr(listen_id, (struct sockaddr *)&sin);
+    if (ret) {
+        qemu_rdma_print("Error: could not rdma_bind_addr!");
+        goto err_server_init_bind_addr;
+    }
+
+    rdma_ctx->listen_id = listen_id;
+    if (listen_id->verbs) {
+        rdma_ctx->verbs = listen_id->verbs;
+    }
+    qemu_rdma_dump_id("server_init", rdma_ctx->verbs);
+    qemu_rdma_dump_gid("server_init", listen_id);
+    return 0;
+
+err_server_init_bind_addr:
+    rdma_destroy_id(listen_id);
+err_server_init_create_listen_id:
+    rdma_destroy_event_channel(rdma_ctx->channel);
+    rdma_ctx->channel = NULL;
+    return -1;
+
+}
+
+int qemu_rdma_server_prepare(RDMAData *rdma, Error **errp)
+{
+    int ret;
+    RDMAContext *rdma_ctx = &rdma->rdma_ctx;
+
+    if (!rdma_ctx->verbs) {
+        qemu_rdma_print("rdma migration: no verbs context!");
+        return 0;
+    }
+
+    ret = qemu_rdma_alloc_pd_cq(rdma_ctx);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error allocating pd and cq!");
+        goto err_rdma_server_prepare;
+    }
+
+    ret = qemu_rdma_init_ram_blocks(&rdma->rdma_local_ram_blocks);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error initializing ram blocks!");
+        goto err_rdma_server_prepare;
+    }
+
+    ret = qemu_rdma_server_reg_ram_blocks(rdma_ctx,
+            &rdma->rdma_local_ram_blocks);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error registering ram blocks!");
+        goto err_rdma_server_prepare;
+    }
+
+    ret = qemu_rdma_reg_remote_info(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error registering remote info!");
+        goto err_rdma_server_prepare;
+    }
+
+    qemu_rdma_copy_to_remote_ram_blocks(&rdma->rdma_local_ram_blocks,
+            &rdma->remote_info);
+
+    ret = qemu_rdma_reg_qemu_file(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error registering 1st qemu file!");
+        goto err_rdma_server_prepare;
+    }
+
+    ret = rdma_listen(rdma_ctx->listen_id, 5);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error listening on socket!");
+        goto err_rdma_server_prepare;
+    }
+
+    return 0;
+
+err_rdma_server_prepare:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+int qemu_rdma_data_init(RDMAData *rdma, const char *host_port, Error **errp)
+{
+    InetSocketAddress *addr;
+
+    memset(rdma, 0, sizeof(RDMAData));
+
+    rdma->current_index = -1;
+    rdma->current_chunk = -1;
+
+    if(host_port) {
+        addr = inet_parse(host_port, errp);
+        if (addr != NULL) {
+            rdma->port = atoi(addr->port);
+            rdma->host = g_strdup(addr->host);
+            printf("rdma host: %s\n", rdma->host);
+            printf("rdma port: %d\n", rdma->port);
+        } else {
+            error_setg(errp, "bad RDMA migration address '%s'", host_port);
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+void qemu_rdma_disable(RDMAData *rdma)
+{
+    rdma->port = -1;
+    rdma->enabled = 0;
+}
+
+int qemu_rdma_exchange_send(RDMAData *rdma, uint8_t * data, size_t len)
+{
+    int ret;
+
+    if(rdma->qemu_file_send_waiting) {
+        ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_RECV_QEMU_FILE);
+        if (ret < 0) {
+            fprintf(stderr, "rdma migration: polling qemu file error!\n");
+            return ret;
+        }
+    }
+
+    rdma->qemu_file_send_waiting = 1;
+
+    ret = qemu_rdma_post_recv_qemu_file(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting first qemu file recv!");
+        return ret;
+    }
+
+    ret = qemu_rdma_post_send_qemu_file(rdma, data, len);
+    if(ret < 0) {
+        fprintf(stderr, "Failed to send qemu file buffer!\n");
+        return ret;
+    }
+
+    return 0;
+}
+
+int qemu_rdma_exchange_recv(void * opaque)
+{
+    RDMAData * rdma = opaque;
+    int ret = 0;
+    int count_len = sizeof(size_t);
+
+    ret = qemu_rdma_post_send_qemu_file(rdma, &(rdma->b), 1);
+    if(ret < 0) {
+        fprintf(stderr, "Failed to send qemu file buffer!\n");
+        return ret;
+    }
+
+    ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_RECV_QEMU_FILE);
+    if (ret < 0) {
+	fprintf(stderr, "rdma migration: polling qemu file error!\n");
+	return ret;
+    }
+  
+    rdma->qemu_file_len = *((size_t *)rdma->qemu_file);
+    rdma->qemu_file_curr = rdma->qemu_file + count_len; 
+
+    ret = qemu_rdma_post_recv_qemu_file(rdma);
+    if (ret) {
+	fprintf(stderr, "rdma migration: error posting second qemu file recv!");
+	return ret;
+    }
+
+    return 0;
+}
+
+int qemu_rdma_drain_cq(void *opaque)
+{
+    RDMAData *rdma = opaque; 
+    int ret;
+
+    if (qemu_rdma_write_flush(rdma) < 0) {
+        return -EIO;
+    }
+
+    while (rdma->num_signaled_send) {
+        ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_RDMA);
+        if (ret < 0) {
+            fprintf(stderr, "rdma migration: complete polling error!\n");
+            return -EIO;
+        }
+    }
+
+    return 0;
+}
+
+int qemu_rdma_enabled(void *opaque)
+{
+    RDMAData * rdma = opaque;
+    return rdma->enabled;
+}
+
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
                   ` (5 preceding siblings ...)
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 06/10] core RDMA migration code (rdma.c) mrhines
@ 2013-03-18  3:19 ` mrhines
  2013-03-18  8:56   ` Paolo Bonzini
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA mrhines
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 73+ messages in thread
From: mrhines @ 2013-03-18  3:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 205 insertions(+)
 create mode 100644 migration-rdma.c

diff --git a/migration-rdma.c b/migration-rdma.c
new file mode 100644
index 0000000..e1ea055
--- /dev/null
+++ b/migration-rdma.c
@@ -0,0 +1,205 @@
+/*
+ *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
+ *  Copyright (C) 2013 Jiuxing Liu <jl@us.ibm.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "migration/rdma.h"
+#include "qemu-common.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <arpa/inet.h>
+#include <string.h>
+
+//#define DEBUG_MIGRATION_RDMA
+
+#ifdef DEBUG_MIGRATION_RDMA
+#define DPRINTF(fmt, ...) \
+    do { printf("migration-rdma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+static int rdma_accept_incoming_migration(RDMAData *rdma, Error **errp)
+{
+    int ret;
+
+    ret = qemu_rdma_migrate_listen(rdma, rdma->host, rdma->port);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error listening!");
+        goto err_rdma_server_wait;
+    }
+
+    ret = qemu_rdma_alloc_qp(&rdma->rdma_ctx);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error allocating qp!");
+        goto err_rdma_server_wait;
+    }
+
+    ret = qemu_rdma_migrate_accept(&rdma->rdma_ctx, NULL, NULL, NULL, 0);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error accepting connection!");
+        goto err_rdma_server_wait;
+    }
+
+    ret = qemu_rdma_post_recv_qemu_file(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error posting second qemu file recv!");
+        goto err_rdma_server_wait;
+    }
+
+    ret = qemu_rdma_post_send_remote_info(rdma);
+    if (ret) {
+        qemu_rdma_print("rdma migration: error sending remote info!");
+        goto err_rdma_server_wait;
+    }
+
+    ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_SEND_REMOTE_INFO);
+    if (ret < 0) {
+        qemu_rdma_print("rdma migration: polling remote info error!");
+        goto err_rdma_server_wait;
+    }
+
+    rdma->total_bytes = 0;
+    rdma->enabled = 1;
+    qemu_rdma_dump_gid("server_connect", rdma->rdma_ctx.cm_id);
+    return 0;
+
+err_rdma_server_wait:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+
+}
+
+int rdma_start_incoming_migration(const char * host_port, Error **errp)
+{
+    RDMAData *rdma = g_malloc0(sizeof(RDMAData));
+    QEMUFile *f;
+    int ret;
+
+    if ((ret = qemu_rdma_data_init(rdma, host_port, errp)) < 0)
+        return ret; 
+
+    ret = qemu_rdma_server_init(rdma, NULL);
+
+    DPRINTF("Starting RDMA-based incoming migration\n");
+
+    if (!ret) {
+        DPRINTF("qemu_rdma_server_init success\n");
+        ret = qemu_rdma_server_prepare(rdma, NULL);
+
+        if (!ret) {
+            DPRINTF("qemu_rdma_server_prepare success\n");
+
+            ret = rdma_accept_incoming_migration(rdma, NULL);
+            if(!ret)
+                DPRINTF("qemu_rdma_accept_incoming_migration success\n");
+            f = qemu_fopen_rdma(rdma, "rb");
+            if (f == NULL) {
+                fprintf(stderr, "could not qemu_fopen RDMA\n");
+                ret = -EIO;
+            }
+
+            process_incoming_migration(f);
+        }
+    }
+
+    return ret;
+}
+
+void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp)
+{
+    RDMAData *rdma = g_malloc0(sizeof(RDMAData));
+    MigrationState *s = opaque;
+    int ret;
+
+    if (qemu_rdma_data_init(rdma, host_port, errp) < 0)
+        return; 
+
+    ret = qemu_rdma_client_init(rdma, NULL);
+    if(!ret) {
+        DPRINTF("qemu_rdma_client_init success\n");
+        ret = qemu_rdma_client_connect(rdma, NULL);
+
+        if(!ret) {
+            s->file = qemu_fopen_rdma(rdma, "wb");
+            DPRINTF("qemu_rdma_client_connect success\n");
+            migrate_fd_connect(s);
+            return;
+        }
+    }
+
+    migrate_fd_error(s);
+}
+
+size_t save_rdma_page(QEMUFile *f, ram_addr_t block_offset, ram_addr_t offset, int cont, size_t size)
+{
+    int ret;
+    size_t bytes_sent = 0;
+    ram_addr_t current_addr;
+    RDMAData * rdma = migrate_use_rdma(f);
+
+    current_addr = block_offset + offset;
+
+    /*
+     * Add this page to the current 'chunk'. If the chunk
+     * is full, an actual RDMA write will occur.
+     */
+    if ((ret = qemu_rdma_write(rdma, current_addr, size)) < 0) {
+        fprintf(stderr, "rdma migration: write error! %d\n", ret);
+        return ret;
+    }
+
+    /*
+     * Drain the Completion Queue if possible.
+     * If not, the end of the iteration will do this
+     * again to make sure we don't overflow the
+     * request queue. 
+     */
+    while (1) {
+        int ret = qemu_rdma_poll(rdma);
+        if (ret == RDMA_WRID_NONE) {
+            break;
+        }
+        if (ret < 0) {
+            fprintf(stderr, "rdma migration: polling error! %d\n", ret);
+            return ret;
+        }
+    }
+
+    bytes_sent += size;
+    return bytes_sent;
+}
+
+size_t qemu_rdma_fill(void * opaque, uint8_t *buf, int size)
+{
+    RDMAData * rdma = opaque;
+    size_t len = 0;
+
+    if(rdma->qemu_file_len) {
+        DPRINTF("RDMA %" PRId64 " of %d bytes already in buffer\n",
+	    rdma->qemu_file_len, size);
+
+        len = MIN(size, rdma->qemu_file_len);
+        memcpy(buf, rdma->qemu_file_curr, len);
+        rdma->qemu_file_curr += len;
+        rdma->qemu_file_len -= len;
+    }
+
+    return len;
+}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
                   ` (6 preceding siblings ...)
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA mrhines
@ 2013-03-18  3:19 ` mrhines
  2013-03-18  9:09   ` Paolo Bonzini
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls mrhines
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 10/10] send pc.ram over RDMA mrhines
  9 siblings, 1 reply; 73+ messages in thread
From: mrhines @ 2013-03-18  3:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This compiles with and without --enable-rdma.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/qemu-file.h |   10 +++
 savevm.c                      |  172 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 172 insertions(+), 10 deletions(-)

diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index df81261..9046751 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -51,23 +51,33 @@ typedef int (QEMUFileCloseFunc)(void *opaque);
  */
 typedef int (QEMUFileGetFD)(void *opaque);
 
+/* 
+ * 'drain' from a QEMUFile perspective means
+ * to flush the outbound send buffer
+ * (if one exists). (Only used by RDMA right now)
+ */
+typedef int (QEMUFileDrainFunc)(void *opaque);
+
 typedef struct QEMUFileOps {
     QEMUFilePutBufferFunc *put_buffer;
     QEMUFileGetBufferFunc *get_buffer;
     QEMUFileCloseFunc *close;
     QEMUFileGetFD *get_fd;
+    QEMUFileDrainFunc *drain;
 } QEMUFileOps;
 
 QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd, const char *mode);
+QEMUFile *qemu_fopen_rdma(void *opaque, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_get_fd(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 int64_t qemu_ftell(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+int qemu_drain(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index 35c8d1e..9b90b7f 100644
--- a/savevm.c
+++ b/savevm.c
@@ -32,6 +32,7 @@
 #include "qemu/timer.h"
 #include "audio/audio.h"
 #include "migration/migration.h"
+#include "migration/rdma.h"
 #include "qemu/sockets.h"
 #include "qemu/queue.h"
 #include "sysemu/cpus.h"
@@ -143,6 +144,13 @@ typedef struct QEMUFileSocket
     QEMUFile *file;
 } QEMUFileSocket;
 
+typedef struct QEMUFileRDMA
+{
+    void *rdma;
+    size_t len;
+    QEMUFile *file;
+} QEMUFileRDMA;
+
 typedef struct {
     Coroutine *co;
     int fd;
@@ -178,6 +186,66 @@ static int socket_get_fd(void *opaque)
     return s->fd;
 }
 
+/*
+ * SEND messages for none-live state only.
+ * pc.ram is handled elsewhere...
+ */
+static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileRDMA *r = opaque;
+    size_t remaining = size;
+    uint8_t * data = (void *) buf;
+
+    /*
+     * Although we're sending non-live
+     * state here, push out any writes that
+     * we're queued up for pc.ram anyway.
+     */
+    if (qemu_rdma_write_flush(r->rdma) < 0)
+        return -EIO;
+
+    while(remaining) {
+        r->len = MIN(remaining, RDMA_SEND_INCREMENT);
+        remaining -= r->len;
+
+        if(qemu_rdma_exchange_send(r->rdma, data, r->len) < 0)
+                return -EINVAL;
+
+        data += r->len;
+    }
+
+    return size;
+} 
+
+/*
+ * RDMA links don't use bytestreams, so we have to
+ * return bytes to QEMUFile opportunistically.
+ */
+static int qemu_rdma_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileRDMA *r = opaque;
+
+    /*
+     * First, we hold on to the last SEND message we 
+     * were given and dish out the bytes until we run 
+     * out of bytes.
+     */
+    if((r->len = qemu_rdma_fill(r->rdma, buf, size)))
+	return r->len; 
+
+     /*
+      * Once we run out, we block and wait for another
+      * SEND message to arrive.
+      */
+    if(qemu_rdma_exchange_recv(r->rdma) < 0)
+	return -EINVAL;
+
+    /*
+     * SEND was received with new bytes, now try again.
+     */
+    return qemu_rdma_fill(r->rdma, buf, size);
+} 
+
 static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
 {
     QEMUFileSocket *s = opaque;
@@ -390,16 +458,24 @@ static const QEMUFileOps socket_write_ops = {
     .close =      socket_close
 };
 
-QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+static bool qemu_mode_is_not_valid(const char * mode)
 {
-    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
-
     if (mode == NULL ||
         (mode[0] != 'r' && mode[0] != 'w') ||
         mode[1] != 'b' || mode[2] != 0) {
         fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
-        return NULL;
+        return true;
     }
+    
+    return false;
+}
+
+QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+{
+    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
+
+    if(qemu_mode_is_not_valid(mode))
+	return NULL;
 
     s->fd = fd;
     if (mode[0] == 'w') {
@@ -411,16 +487,66 @@ QEMUFile *qemu_fopen_socket(int fd, const char *mode)
     return s->file;
 }
 
+static int qemu_rdma_close(void *opaque)
+{
+    QEMUFileRDMA *r = opaque;
+    if(r->rdma) {
+        qemu_rdma_cleanup(r->rdma);
+        g_free(r->rdma);
+    }
+    g_free(r);
+    return 0;
+}
+
+void * migrate_use_rdma(QEMUFile *f)
+{
+    QEMUFileRDMA *r = f->opaque;
+
+    return qemu_rdma_enabled(r->rdma) ? r->rdma : NULL;
+}
+
+static int qemu_rdma_drain_completion(void *opaque)
+{
+    QEMUFileRDMA *r = opaque;
+    r->len = 0;
+    return qemu_rdma_drain_cq(r->rdma);
+}
+
+static const QEMUFileOps rdma_read_ops = {
+    .get_buffer = qemu_rdma_get_buffer,
+    .close =      qemu_rdma_close,
+};
+
+static const QEMUFileOps rdma_write_ops = {
+    .put_buffer = qemu_rdma_put_buffer,
+    .close =      qemu_rdma_close,
+    .drain =	  qemu_rdma_drain_completion,
+};
+
+QEMUFile *qemu_fopen_rdma(void *opaque, const char * mode)
+{
+    QEMUFileRDMA *r = g_malloc0(sizeof(QEMUFileRDMA));
+
+    if(qemu_mode_is_not_valid(mode))
+	return NULL;
+
+    r->rdma = opaque;
+
+    if (mode[0] == 'w') {
+        r->file = qemu_fopen_ops(r, &rdma_write_ops);
+    } else {
+        r->file = qemu_fopen_ops(r, &rdma_read_ops);
+    }
+
+    return r->file;
+}
+
 QEMUFile *qemu_fopen(const char *filename, const char *mode)
 {
     QEMUFileStdio *s;
 
-    if (mode == NULL ||
-	(mode[0] != 'r' && mode[0] != 'w') ||
-	mode[1] != 'b' || mode[2] != 0) {
-        fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
-        return NULL;
-    }
+    if(qemu_mode_is_not_valid(mode))
+	return NULL;
 
     s = g_malloc0(sizeof(QEMUFileStdio));
 
@@ -497,6 +623,24 @@ static void qemu_file_set_error(QEMUFile *f, int ret)
     }
 }
 
+/*
+ * Called only for RDMA right now at the end 
+ * of each live iteration of memory.
+ *
+ * 'drain' from a QEMUFile perspective means
+ * to flush the outbound send buffer
+ * (if one exists). 
+ *
+ * For RDMA, this means to make sure we've
+ * received completion queue (CQ) messages
+ * successfully for all of the RDMA writes
+ * that we requested.
+ */ 
+int qemu_drain(QEMUFile *f)
+{
+    return f->ops->drain ? f->ops->drain(f->opaque) : 0;
+}
+
 /** Flushes QEMUFile buffer
  *
  */
@@ -723,6 +867,8 @@ int qemu_get_byte(QEMUFile *f)
 int64_t qemu_ftell(QEMUFile *f)
 {
     qemu_fflush(f);
+    if(migrate_use_rdma(f))
+	return delta_norm_mig_bytes_transferred();
     return f->pos;
 }
 
@@ -1737,6 +1883,12 @@ void qemu_savevm_state_complete(QEMUFile *f)
         }
     }
 
+    if ((ret = qemu_drain(f)) < 0) {
+	fprintf(stderr, "failed to drain RDMA first!\n");
+        qemu_file_set_error(f, ret);
+	return;
+    }
+
     QTAILQ_FOREACH(se, &savevm_handlers, entry) {
         int len;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
                   ` (7 preceding siblings ...)
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA mrhines
@ 2013-03-18  3:19 ` mrhines
  2013-03-18  8:47   ` Paolo Bonzini
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 10/10] send pc.ram over RDMA mrhines
  9 siblings, 1 reply; 73+ messages in thread
From: mrhines @ 2013-03-18  3:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Since we're not using TCP anymore, we skip these calls.

Also print a little extra text while debugging, like "gbps"
which is helpful to know how the link is being utilized.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/migration.h |    3 +++
 migration.c                   |   19 +++++++++++++------
 2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index bb617fd..88ab5f6 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -20,6 +20,7 @@
 #include "qemu/notify.h"
 #include "qapi/error.h"
 #include "migration/vmstate.h"
+#include "migration/rdma.h"
 #include "qapi-types.h"
 
 struct MigrationParams {
@@ -102,6 +103,7 @@ uint64_t xbzrle_mig_bytes_transferred(void);
 uint64_t xbzrle_mig_pages_transferred(void);
 uint64_t xbzrle_mig_pages_overflow(void);
 uint64_t xbzrle_mig_pages_cache_miss(void);
+uint64_t delta_norm_mig_bytes_transferred(void);
 
 /**
  * @migrate_add_blocker - prevent migration from proceeding
@@ -122,6 +124,7 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
 int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
 
 int migrate_use_xbzrle(void);
+void *migrate_use_rdma(QEMUFile *f);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
diff --git a/migration.c b/migration.c
index 185d112..634437a 100644
--- a/migration.c
+++ b/migration.c
@@ -15,6 +15,7 @@
 
 #include "qemu-common.h"
 #include "migration/migration.h"
+#include "migration/rdma.h"
 #include "monitor/monitor.h"
 #include "migration/qemu-file.h"
 #include "sysemu/sysemu.h"
@@ -77,6 +78,8 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
 
     if (strstart(uri, "tcp:", &p))
         tcp_start_incoming_migration(p, errp);
+    else if (strstart(uri, "rdma:", &p))
+        rdma_start_incoming_migration(p, errp);
 #if !defined(WIN32)
     else if (strstart(uri, "exec:", &p))
         exec_start_incoming_migration(p, errp);
@@ -118,10 +121,11 @@ static void process_incoming_migration_co(void *opaque)
 void process_incoming_migration(QEMUFile *f)
 {
     Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
-    int fd = qemu_get_fd(f);
-
-    assert(fd != -1);
-    socket_set_nonblock(fd);
+    if(!migrate_use_rdma(f)) {
+        int fd = qemu_get_fd(f);
+        assert(fd != -1);
+        socket_set_nonblock(fd);
+    }
     qemu_coroutine_enter(co, f);
 }
 
@@ -404,6 +408,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 
     if (strstart(uri, "tcp:", &p)) {
         tcp_start_outgoing_migration(s, p, &local_err);
+    } else if (strstart(uri, "rdma:", &p)) {
+        rdma_start_outgoing_migration(s, p, &local_err);
 #if !defined(WIN32)
     } else if (strstart(uri, "exec:", &p)) {
         exec_start_outgoing_migration(s, p, &local_err);
@@ -545,8 +551,9 @@ static void *migration_thread(void *opaque)
             max_size = bandwidth * migrate_max_downtime() / 1000000;
 
             DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
-                    " bandwidth %g max_size %" PRId64 "\n",
-                    transferred_bytes, time_spent, bandwidth, max_size);
+                    " bandwidth %g (%0.2f mbps) max_size %" PRId64 "\n",
+                    transferred_bytes, time_spent, 
+                    bandwidth, Gbps(transferred_bytes, time_spent), max_size);
             /* if we haven't sent anything, we don't want to recalculate
                10000 is a small enough number for our purposes */
             if (s->dirty_bytes_rate && transferred_bytes > 10000) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v4: 10/10] send pc.ram over RDMA
  2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
                   ` (8 preceding siblings ...)
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls mrhines
@ 2013-03-18  3:19 ` mrhines
  9 siblings, 0 replies; 73+ messages in thread
From: mrhines @ 2013-03-18  3:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 arch_init.c |   28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch_init.c b/arch_init.c
index 98e2bc6..b013cc8 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -45,6 +45,7 @@
 #include "exec/address-spaces.h"
 #include "hw/pcspk.h"
 #include "migration/page_cache.h"
+#include "migration/rdma.h"
 #include "qemu/config-file.h"
 #include "qmp-commands.h"
 #include "trace.h"
@@ -225,6 +226,18 @@ static void acct_clear(void)
     memset(&acct_info, 0, sizeof(acct_info));
 }
 
+/*
+ * RDMA pc.ram doesn't go through QEMUFile directly,
+ * but still needs to be accounted for...
+ */
+uint64_t delta_norm_mig_bytes_transferred(void)
+{
+    static uint64_t last_norm_pages = 0;
+    uint64_t delta_bytes = (acct_info.norm_pages - last_norm_pages) * TARGET_PAGE_SIZE;
+    last_norm_pages = acct_info.norm_pages; 
+    return delta_bytes;
+}
+
 uint64_t dup_mig_bytes_transferred(void)
 {
     return acct_info.dup_pages * TARGET_PAGE_SIZE;
@@ -463,7 +476,11 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
 
             /* In doubt sent page as normal */
             bytes_sent = -1;
-            if (is_dup_page(p)) {
+            if (migrate_use_rdma(f)) {
+                /* for now, mapping the page is slower than RDMA */
+                acct_info.norm_pages++;
+                bytes_sent = save_rdma_page(f, block->offset, offset, cont, TARGET_PAGE_SIZE);
+            } else if (is_dup_page(p)) {
                 acct_info.dup_pages++;
                 bytes_sent = save_block_hdr(f, block, offset, cont,
                                             RAM_SAVE_FLAG_COMPRESS);
@@ -648,6 +665,15 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
     qemu_mutex_unlock_ramlist();
 
+    /*
+     * Don't go to the next iteration without
+     * ensuring RDMA transfers have completed.
+     */
+    if ((ret = qemu_drain(f)) < 0) {
+        fprintf(stderr, "failed to drain RDMA first!\n");
+        return ret;
+    }
+
     if (ret < 0) {
         bytes_transferred += total_sent;
         return ret;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls mrhines
@ 2013-03-18  8:47   ` Paolo Bonzini
  2013-03-18 20:37     ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-18  8:47 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 18/03/2013 04:19, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Since we're not using TCP anymore, we skip these calls.
> 
> Also print a little extra text while debugging, like "gbps"
> which is helpful to know how the link is being utilized.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  include/migration/migration.h |    3 +++
>  migration.c                   |   19 +++++++++++++------
>  2 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index bb617fd..88ab5f6 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -20,6 +20,7 @@
>  #include "qemu/notify.h"
>  #include "qapi/error.h"
>  #include "migration/vmstate.h"
> +#include "migration/rdma.h"
>  #include "qapi-types.h"
>  
>  struct MigrationParams {
> @@ -102,6 +103,7 @@ uint64_t xbzrle_mig_bytes_transferred(void);
>  uint64_t xbzrle_mig_pages_transferred(void);
>  uint64_t xbzrle_mig_pages_overflow(void);
>  uint64_t xbzrle_mig_pages_cache_miss(void);
> +uint64_t delta_norm_mig_bytes_transferred(void);

Please add the protocol under the
>  
>  /**
>   * @migrate_add_blocker - prevent migration from proceeding
> @@ -122,6 +124,7 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
>  int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
>  
>  int migrate_use_xbzrle(void);
> +void *migrate_use_rdma(QEMUFile *f);

Perhaps you can add a new function to QEMUFile send_page?  And if it
returns -ENOTSUP, proceed with the normal is_dup_page + put_buffer.  I
wonder if that lets use remove migrate_use_rdma() completely.

Also, if QEMUFileRDMA is moved to rdma.c, the number of public and
stubbed functions should decrease noticeably.  There is a patch on the
list to move QEMUFile to its own source file.  You could incorporate it
in your series.

>  int64_t migrate_xbzrle_cache_size(void);
>  
>  int64_t xbzrle_cache_resize(int64_t new_size);
> diff --git a/migration.c b/migration.c
> index 185d112..634437a 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -15,6 +15,7 @@
>  
>  #include "qemu-common.h"
>  #include "migration/migration.h"
> +#include "migration/rdma.h"
>  #include "monitor/monitor.h"
>  #include "migration/qemu-file.h"
>  #include "sysemu/sysemu.h"
> @@ -77,6 +78,8 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
>  
>      if (strstart(uri, "tcp:", &p))
>          tcp_start_incoming_migration(p, errp);
> +    else if (strstart(uri, "rdma:", &p))
> +        rdma_start_incoming_migration(p, errp);
>  #if !defined(WIN32)
>      else if (strstart(uri, "exec:", &p))
>          exec_start_incoming_migration(p, errp);
> @@ -118,10 +121,11 @@ static void process_incoming_migration_co(void *opaque)
>  void process_incoming_migration(QEMUFile *f)
>  {
>      Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
> -    int fd = qemu_get_fd(f);
> -
> -    assert(fd != -1);
> -    socket_set_nonblock(fd);
> +    if(!migrate_use_rdma(f)) {
> +        int fd = qemu_get_fd(f);
> +        assert(fd != -1);
> +        socket_set_nonblock(fd);

Is this because qemu_get_fd(f) returns -1 for RDMA?  Then, you can
instead put socket_set_nonblock under an if(fd != -1).

> +    }
>      qemu_coroutine_enter(co, f);
>  }
>  
> @@ -404,6 +408,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>  
>      if (strstart(uri, "tcp:", &p)) {
>          tcp_start_outgoing_migration(s, p, &local_err);
> +    } else if (strstart(uri, "rdma:", &p)) {
> +        rdma_start_outgoing_migration(s, p, &local_err);
>  #if !defined(WIN32)
>      } else if (strstart(uri, "exec:", &p)) {
>          exec_start_outgoing_migration(s, p, &local_err);
> @@ -545,8 +551,9 @@ static void *migration_thread(void *opaque)
>              max_size = bandwidth * migrate_max_downtime() / 1000000;
>  
>              DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
> -                    " bandwidth %g max_size %" PRId64 "\n",
> -                    transferred_bytes, time_spent, bandwidth, max_size);
> +                    " bandwidth %g (%0.2f mbps) max_size %" PRId64 "\n",
> +                    transferred_bytes, time_spent, 
> +                    bandwidth, Gbps(transferred_bytes, time_spent), max_size);
>              /* if we haven't sent anything, we don't want to recalculate
>                 10000 is a small enough number for our purposes */
>              if (s->dirty_bytes_rate && transferred_bytes > 10000) {
> 

Otherwise looks good.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks mrhines
@ 2013-03-18  8:48   ` Paolo Bonzini
  2013-03-18 20:25     ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-18  8:48 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 18/03/2013 04:18, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This introduces:
> 1. qemu_ram_foreach_block
> 2. qemu_ram_count_blocks
> 
> Both used in communicating the RAMBlocks
> to each side for later memory registration.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  exec.c                    |   21 +++++++++++++++++++++
>  include/exec/cpu-common.h |    6 ++++++
>  2 files changed, 27 insertions(+)
> 
> diff --git a/exec.c b/exec.c
> index 8a6aac3..a985da8 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -2629,3 +2629,24 @@ bool cpu_physical_memory_is_io(hwaddr phys_addr)
>               memory_region_is_romd(section->mr));
>  }
>  #endif
> +
> +void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
> +{
> +    RAMBlock *block;
> +
> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> +        func(block->host, block->offset, block->length, opaque);
> +    }
> +}
> +
> +int qemu_ram_count_blocks(void)
> +{
> +    RAMBlock *block;
> +    int total = 0;
> +
> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
> +        total++;
> +    }

Please move this to rdma.c, and implement it using qemu_ram_foreach_block.

Otherwise looks good.

Paolo

> +    return total;
> +}
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 2e5f11f..aea3fe0 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -119,6 +119,12 @@ extern struct MemoryRegion io_mem_rom;
>  extern struct MemoryRegion io_mem_unassigned;
>  extern struct MemoryRegion io_mem_notdirty;
>  
> +typedef void  (RAMBlockIterFunc)(void *host_addr, 
> +    ram_addr_t offset, ram_addr_t length, void *opaque); 
> +
> +void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
> +int qemu_ram_count_blocks(void);
> +
>  #endif
>  
>  #endif /* !CPU_COMMON_H */
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA mrhines
@ 2013-03-18  8:56   ` Paolo Bonzini
  2013-03-18 20:26     ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-18  8:56 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 18/03/2013 04:19, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  migration-rdma.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 205 insertions(+)
>  create mode 100644 migration-rdma.c
> 
> diff --git a/migration-rdma.c b/migration-rdma.c
> new file mode 100644
> index 0000000..e1ea055
> --- /dev/null
> +++ b/migration-rdma.c
> @@ -0,0 +1,205 @@
> +/*
> + *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
> + *  Copyright (C) 2013 Jiuxing Liu <jl@us.ibm.com>
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; under version 2 of the License.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +#include "migration/rdma.h"
> +#include "qemu-common.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include <stdio.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <netdb.h>
> +#include <arpa/inet.h>
> +#include <string.h>
> +
> +//#define DEBUG_MIGRATION_RDMA
> +
> +#ifdef DEBUG_MIGRATION_RDMA
> +#define DPRINTF(fmt, ...) \
> +    do { printf("migration-rdma: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define DPRINTF(fmt, ...) \
> +    do { } while (0)
> +#endif
> +
> +static int rdma_accept_incoming_migration(RDMAData *rdma, Error **errp)
> +{
> +    int ret;
> +
> +    ret = qemu_rdma_migrate_listen(rdma, rdma->host, rdma->port);
> +    if (ret) {
> +        qemu_rdma_print("rdma migration: error listening!");
> +        goto err_rdma_server_wait;
> +    }
> +
> +    ret = qemu_rdma_alloc_qp(&rdma->rdma_ctx);
> +    if (ret) {
> +        qemu_rdma_print("rdma migration: error allocating qp!");
> +        goto err_rdma_server_wait;
> +    }
> +
> +    ret = qemu_rdma_migrate_accept(&rdma->rdma_ctx, NULL, NULL, NULL, 0);
> +    if (ret) {
> +        qemu_rdma_print("rdma migration: error accepting connection!");
> +        goto err_rdma_server_wait;
> +    }
> +
> +    ret = qemu_rdma_post_recv_qemu_file(rdma);
> +    if (ret) {
> +        qemu_rdma_print("rdma migration: error posting second qemu file recv!");
> +        goto err_rdma_server_wait;
> +    }
> +
> +    ret = qemu_rdma_post_send_remote_info(rdma);
> +    if (ret) {
> +        qemu_rdma_print("rdma migration: error sending remote info!");
> +        goto err_rdma_server_wait;
> +    }
> +
> +    ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_SEND_REMOTE_INFO);
> +    if (ret < 0) {
> +        qemu_rdma_print("rdma migration: polling remote info error!");
> +        goto err_rdma_server_wait;
> +    }

In a "socket-like" abstraction, all of these steps except the initial
listen are part of "accept".  Please move them to
qemu_rdma_migrate_accept (possibly renaming the existing
qemu_rdma_migrate_accept to a different name).

Similarly, perhaps you can merge qemu_rdma_server_prepare and
qemu_rdma_migrate_listen.

Try to make the public API between modules as small as possible (but not
smaller :)), so that you can easily document it without too many
references to RDMA concepts.

Thanks,

Paolo

> +    rdma->total_bytes = 0;
> +    rdma->enabled = 1;
> +    qemu_rdma_dump_gid("server_connect", rdma->rdma_ctx.cm_id);
> +    return 0;
> +
> +err_rdma_server_wait:
> +    qemu_rdma_cleanup(rdma);
> +    return -1;
> +
> +}
> +
> +int rdma_start_incoming_migration(const char * host_port, Error **errp)
> +{
> +    RDMAData *rdma = g_malloc0(sizeof(RDMAData));
> +    QEMUFile *f;
> +    int ret;
> +
> +    if ((ret = qemu_rdma_data_init(rdma, host_port, errp)) < 0)
> +        return ret; 
> +
> +    ret = qemu_rdma_server_init(rdma, NULL);
> +
> +    DPRINTF("Starting RDMA-based incoming migration\n");
> +
> +    if (!ret) {
> +        DPRINTF("qemu_rdma_server_init success\n");
> +        ret = qemu_rdma_server_prepare(rdma, NULL);
> +
> +        if (!ret) {
> +            DPRINTF("qemu_rdma_server_prepare success\n");
> +
> +            ret = rdma_accept_incoming_migration(rdma, NULL);
> +            if(!ret)
> +                DPRINTF("qemu_rdma_accept_incoming_migration success\n");
> +            f = qemu_fopen_rdma(rdma, "rb");
> +            if (f == NULL) {
> +                fprintf(stderr, "could not qemu_fopen RDMA\n");
> +                ret = -EIO;
> +            }
> +
> +            process_incoming_migration(f);
> +        }
> +    }
> +
> +    return ret;
> +}
> +
> +void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp)
> +{
> +    RDMAData *rdma = g_malloc0(sizeof(RDMAData));
> +    MigrationState *s = opaque;
> +    int ret;
> +
> +    if (qemu_rdma_data_init(rdma, host_port, errp) < 0)
> +        return; 
> +
> +    ret = qemu_rdma_client_init(rdma, NULL);
> +    if(!ret) {
> +        DPRINTF("qemu_rdma_client_init success\n");
> +        ret = qemu_rdma_client_connect(rdma, NULL);
> +
> +        if(!ret) {
> +            s->file = qemu_fopen_rdma(rdma, "wb");
> +            DPRINTF("qemu_rdma_client_connect success\n");
> +            migrate_fd_connect(s);
> +            return;
> +        }
> +    }
> +
> +    migrate_fd_error(s);
> +}
> +
> +size_t save_rdma_page(QEMUFile *f, ram_addr_t block_offset, ram_addr_t offset, int cont, size_t size)
> +{
> +    int ret;
> +    size_t bytes_sent = 0;
> +    ram_addr_t current_addr;
> +    RDMAData * rdma = migrate_use_rdma(f);
> +
> +    current_addr = block_offset + offset;
> +
> +    /*
> +     * Add this page to the current 'chunk'. If the chunk
> +     * is full, an actual RDMA write will occur.
> +     */
> +    if ((ret = qemu_rdma_write(rdma, current_addr, size)) < 0) {
> +        fprintf(stderr, "rdma migration: write error! %d\n", ret);
> +        return ret;
> +    }
> +
> +    /*
> +     * Drain the Completion Queue if possible.
> +     * If not, the end of the iteration will do this
> +     * again to make sure we don't overflow the
> +     * request queue. 
> +     */
> +    while (1) {
> +        int ret = qemu_rdma_poll(rdma);
> +        if (ret == RDMA_WRID_NONE) {
> +            break;
> +        }
> +        if (ret < 0) {
> +            fprintf(stderr, "rdma migration: polling error! %d\n", ret);
> +            return ret;
> +        }
> +    }
> +
> +    bytes_sent += size;
> +    return bytes_sent;
> +}
> +
> +size_t qemu_rdma_fill(void * opaque, uint8_t *buf, int size)
> +{
> +    RDMAData * rdma = opaque;
> +    size_t len = 0;
> +
> +    if(rdma->qemu_file_len) {
> +        DPRINTF("RDMA %" PRId64 " of %d bytes already in buffer\n",
> +	    rdma->qemu_file_len, size);
> +
> +        len = MIN(size, rdma->qemu_file_len);
> +        memcpy(buf, rdma->qemu_file_curr, len);
> +        rdma->qemu_file_curr += len;
> +        rdma->qemu_file_len -= len;
> +    }
> +
> +    return len;
> +}
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA mrhines
@ 2013-03-18  9:09   ` Paolo Bonzini
  2013-03-18 20:33     ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-18  9:09 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 18/03/2013 04:19, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This compiles with and without --enable-rdma.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  include/migration/qemu-file.h |   10 +++
>  savevm.c                      |  172 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 172 insertions(+), 10 deletions(-)
> 
> diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
> index df81261..9046751 100644
> --- a/include/migration/qemu-file.h
> +++ b/include/migration/qemu-file.h
> @@ -51,23 +51,33 @@ typedef int (QEMUFileCloseFunc)(void *opaque);
>   */
>  typedef int (QEMUFileGetFD)(void *opaque);
>  
> +/* 
> + * 'drain' from a QEMUFile perspective means
> + * to flush the outbound send buffer
> + * (if one exists). (Only used by RDMA right now)
> + */
> +typedef int (QEMUFileDrainFunc)(void *opaque);
> +
>  typedef struct QEMUFileOps {
>      QEMUFilePutBufferFunc *put_buffer;
>      QEMUFileGetBufferFunc *get_buffer;
>      QEMUFileCloseFunc *close;
>      QEMUFileGetFD *get_fd;
> +    QEMUFileDrainFunc *drain;
>  } QEMUFileOps;
>  
>  QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
>  QEMUFile *qemu_fopen(const char *filename, const char *mode);
>  QEMUFile *qemu_fdopen(int fd, const char *mode);
>  QEMUFile *qemu_fopen_socket(int fd, const char *mode);
> +QEMUFile *qemu_fopen_rdma(void *opaque, const char *mode);
>  QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
>  int qemu_get_fd(QEMUFile *f);
>  int qemu_fclose(QEMUFile *f);
>  int64_t qemu_ftell(QEMUFile *f);
>  void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
>  void qemu_put_byte(QEMUFile *f, int v);
> +int qemu_drain(QEMUFile *f);
>  
>  static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
>  {
> diff --git a/savevm.c b/savevm.c
> index 35c8d1e..9b90b7f 100644
> --- a/savevm.c
> +++ b/savevm.c
> @@ -32,6 +32,7 @@
>  #include "qemu/timer.h"
>  #include "audio/audio.h"
>  #include "migration/migration.h"
> +#include "migration/rdma.h"
>  #include "qemu/sockets.h"
>  #include "qemu/queue.h"
>  #include "sysemu/cpus.h"
> @@ -143,6 +144,13 @@ typedef struct QEMUFileSocket
>      QEMUFile *file;
>  } QEMUFileSocket;
>  
> +typedef struct QEMUFileRDMA
> +{
> +    void *rdma;

This is an RDMAData *.  Please avoid using void * as much as possible.

> +    size_t len;
> +    QEMUFile *file;
> +} QEMUFileRDMA;
> +
>  typedef struct {
>      Coroutine *co;
>      int fd;
> @@ -178,6 +186,66 @@ static int socket_get_fd(void *opaque)
>      return s->fd;
>  }
>  
> +/*
> + * SEND messages for none-live state only.
> + * pc.ram is handled elsewhere...
> + */
> +static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size)
> +{
> +    QEMUFileRDMA *r = opaque;
> +    size_t remaining = size;
> +    uint8_t * data = (void *) buf;
> +
> +    /*
> +     * Although we're sending non-live
> +     * state here, push out any writes that
> +     * we're queued up for pc.ram anyway.
> +     */
> +    if (qemu_rdma_write_flush(r->rdma) < 0)
> +        return -EIO;
> +
> +    while(remaining) {
> +        r->len = MIN(remaining, RDMA_SEND_INCREMENT);
> +        remaining -= r->len;
> +
> +        if(qemu_rdma_exchange_send(r->rdma, data, r->len) < 0)
> +                return -EINVAL;
> +
> +        data += r->len;
> +    }
> +
> +    return size;
> +} 
> +
> +/*
> + * RDMA links don't use bytestreams, so we have to
> + * return bytes to QEMUFile opportunistically.
> + */
> +static int qemu_rdma_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
> +{
> +    QEMUFileRDMA *r = opaque;
> +
> +    /*
> +     * First, we hold on to the last SEND message we 
> +     * were given and dish out the bytes until we run 
> +     * out of bytes.
> +     */
> +    if((r->len = qemu_rdma_fill(r->rdma, buf, size)))
> +	return r->len; 
> +
> +     /*
> +      * Once we run out, we block and wait for another
> +      * SEND message to arrive.
> +      */
> +    if(qemu_rdma_exchange_recv(r->rdma) < 0)
> +	return -EINVAL;
> +
> +    /*
> +     * SEND was received with new bytes, now try again.
> +     */
> +    return qemu_rdma_fill(r->rdma, buf, size);
> +} 

Please move these functions closer to qemu_fopen_rdma (or better, to an
RDMA-specific file altogether).  Also, using qemu_rdma_fill introduces a
dependency of savevm.c on migration-rdma.c.  There should be no such
dependency; migration-rdma.c should be used only by migration.c.

>  static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
>  {
>      QEMUFileSocket *s = opaque;
> @@ -390,16 +458,24 @@ static const QEMUFileOps socket_write_ops = {
>      .close =      socket_close
>  };
>  
> -QEMUFile *qemu_fopen_socket(int fd, const char *mode)
> +static bool qemu_mode_is_not_valid(const char * mode)
>  {
> -    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
> -
>      if (mode == NULL ||
>          (mode[0] != 'r' && mode[0] != 'w') ||
>          mode[1] != 'b' || mode[2] != 0) {
>          fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
> -        return NULL;
> +        return true;
>      }
> +    
> +    return false;
> +}
> +
> +QEMUFile *qemu_fopen_socket(int fd, const char *mode)
> +{
> +    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
> +
> +    if(qemu_mode_is_not_valid(mode))
> +	return NULL;
>  
>      s->fd = fd;
>      if (mode[0] == 'w') {
> @@ -411,16 +487,66 @@ QEMUFile *qemu_fopen_socket(int fd, const char *mode)
>      return s->file;
>  }
>  
> +static int qemu_rdma_close(void *opaque)
> +{
> +    QEMUFileRDMA *r = opaque;
> +    if(r->rdma) {
> +        qemu_rdma_cleanup(r->rdma);
> +        g_free(r->rdma);
> +    }
> +    g_free(r);
> +    return 0;
> +}
> +
> +void * migrate_use_rdma(QEMUFile *f)
> +{
> +    QEMUFileRDMA *r = f->opaque;
> +
> +    return qemu_rdma_enabled(r->rdma) ? r->rdma : NULL;

You cannot be sure that f->opaque->rdma is a valid pointer.  For
example, the first field in a socket QEMUFile's is a file descriptor.

Instead, you could use a qemu_file_ops_are(const QEMUFile *, const
QEMUFileOps *) function that checks if the file uses the given ops.
Then, migrate_use_rdma can simply check if the QEMUFile is using the
RDMA ops structure.

With this change, the "enabled" field of RDMAData should go.

> +}
> +
> +static int qemu_rdma_drain_completion(void *opaque)
> +{
> +    QEMUFileRDMA *r = opaque;
> +    r->len = 0;
> +    return qemu_rdma_drain_cq(r->rdma);
> +}
> +
> +static const QEMUFileOps rdma_read_ops = {
> +    .get_buffer = qemu_rdma_get_buffer,
> +    .close =      qemu_rdma_close,
> +};
> +
> +static const QEMUFileOps rdma_write_ops = {
> +    .put_buffer = qemu_rdma_put_buffer,
> +    .close =      qemu_rdma_close,
> +    .drain =	  qemu_rdma_drain_completion,
> +};
> +
> +QEMUFile *qemu_fopen_rdma(void *opaque, const char * mode)
> +{
> +    QEMUFileRDMA *r = g_malloc0(sizeof(QEMUFileRDMA));
> +
> +    if(qemu_mode_is_not_valid(mode))
> +	return NULL;
> +
> +    r->rdma = opaque;
> +
> +    if (mode[0] == 'w') {
> +        r->file = qemu_fopen_ops(r, &rdma_write_ops);
> +    } else {
> +        r->file = qemu_fopen_ops(r, &rdma_read_ops);
> +    }
> +
> +    return r->file;
> +}
> +
>  QEMUFile *qemu_fopen(const char *filename, const char *mode)
>  {
>      QEMUFileStdio *s;
>  
> -    if (mode == NULL ||
> -	(mode[0] != 'r' && mode[0] != 'w') ||
> -	mode[1] != 'b' || mode[2] != 0) {
> -        fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
> -        return NULL;
> -    }
> +    if(qemu_mode_is_not_valid(mode))
> +	return NULL;
>  
>      s = g_malloc0(sizeof(QEMUFileStdio));
>  
> @@ -497,6 +623,24 @@ static void qemu_file_set_error(QEMUFile *f, int ret)
>      }
>  }
>  
> +/*
> + * Called only for RDMA right now at the end 
> + * of each live iteration of memory.
> + *
> + * 'drain' from a QEMUFile perspective means
> + * to flush the outbound send buffer
> + * (if one exists). 
> + *
> + * For RDMA, this means to make sure we've
> + * received completion queue (CQ) messages
> + * successfully for all of the RDMA writes
> + * that we requested.
> + */ 
> +int qemu_drain(QEMUFile *f)
> +{
> +    return f->ops->drain ? f->ops->drain(f->opaque) : 0;
> +}

Hmm, this is very similar to qemu_fflush, but not quite. :/

Why exactly is this needed?

>  /** Flushes QEMUFile buffer
>   *
>   */
> @@ -723,6 +867,8 @@ int qemu_get_byte(QEMUFile *f)
>  int64_t qemu_ftell(QEMUFile *f)
>  {
>      qemu_fflush(f);
> +    if(migrate_use_rdma(f))
> +	return delta_norm_mig_bytes_transferred();

Not needed, and another undesirable dependency (savevm.c ->
arch_init.c).  Just update f->pos in save_rdma_page.

This is taking shape.  Thanks for persevering!

Paolo

>      return f->pos;
>  }
>  
> @@ -1737,6 +1883,12 @@ void qemu_savevm_state_complete(QEMUFile *f)
>          }
>      }
>  
> +    if ((ret = qemu_drain(f)) < 0) {
> +	fprintf(stderr, "failed to drain RDMA first!\n");
> +        qemu_file_set_error(f, ret);
> +	return;
> +    }
> +
>      QTAILQ_FOREACH(se, &savevm_handlers, entry) {
>          int len;
>  
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport mrhines
@ 2013-03-18 10:40   ` Michael S. Tsirkin
  2013-03-18 20:24     ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-18 10:40 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Sun, Mar 17, 2013 at 11:18:56PM -0400, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This tries to cover all the questions I got the last time.
> 
> Please do tell me what is not clear, and I'll revise again.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt |  208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 208 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..2a48ab0
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,208 @@
> +Changes since v3:
> +
> +- Compile-tested with and without --enable-rdma is working.
> +- Updated docs/rdma.txt (included below)
> +- Merged with latest pull queue from Paolo
> +- Implemented qemu_ram_foreach_block()
> +
> +mrhines@mrhinesdev:~/qemu$ git diff --stat master
> +Makefile.objs                 |    1 +
> +arch_init.c                   |   28 +-
> +configure                     |   25 ++
> +docs/rdma.txt                 |  190 +++++++++++
> +exec.c                        |   21 ++
> +include/exec/cpu-common.h     |    6 +
> +include/migration/migration.h |    3 +
> +include/migration/qemu-file.h |   10 +
> +include/migration/rdma.h      |  269 ++++++++++++++++
> +include/qemu/sockets.h        |    1 +
> +migration-rdma.c              |  205 ++++++++++++
> +migration.c                   |   19 +-
> +rdma.c                        | 1511 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +savevm.c                      |  172 +++++++++-
> +util/qemu-sockets.c           |    2 +-
> +15 files changed, 2445 insertions(+), 18 deletions(-)


Above looks strange :)

> +QEMUFileRDMA:

I think there are two things here, API documentation
and protocol documentation, protocol documentation
still needs some more work. Also if what I understand
from this document is correct this breaks memory overcommit
on destination which needs to be fixed.


> +==================================
> +
> +QEMUFileRDMA introduces a couple of new functions:
> +
> +1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> +2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> +
> +These two functions provide an RDMA transport
> +(not a protocol) without changing the upper-level
> +users of QEMUFile that depend on a bytstream abstraction.
> +
> +In order to provide the same bytestream interface 
> +for RDMA, we use SEND messages instead of sockets.
> +The operations themselves and the protocol built on 
> +top of QEMUFile used throughout the migration 
> +process do not change whatsoever.
> +
> +An infiniband SEND message is the standard ibverbs
> +message used by applications of infiniband hardware.
> +The only difference between a SEND message and an RDMA
> +message is that SEND message cause completion notifications
> +to be posted to the completion queue (CQ) on the 
> +infiniband receiver side, whereas RDMA messages (used
> +for pc.ram) do not (to behave like an actual DMA).
> +    
> +Messages in infiniband require two things:
> +
> +1. registration of the memory that will be transmitted
> +2. (SEND only) work requests to be posted on both
> +   sides of the network before the actual transmission
> +   can occur.
> +
> +RDMA messages much easier to deal with. Once the memory
> +on the receiver side is registed and pinned, we're
> +basically done. All that is required is for the sender
> +side to start dumping bytes onto the link.
> +
> +SEND messages require more coordination because the
> +receiver must have reserved space (using a receive
> +work request) on the receive queue (RQ) before QEMUFileRDMA
> +can start using them to carry all the bytes as
> +a transport for migration of device state.
> +
> +After the initial connection setup (migration-rdma.c),

Is there any feature and/or version negotiation? How are we going to
handle compatibility when we extend the protocol?

> +this coordination starts by having both sides post
> +a single work request to the RQ before any users
> +of QEMUFile are activated.

So how does destination know it's ok to send anything
to source?
I suspect this is wrong. When using CM you must post
on RQ before completing the connection negotiation,
not after it's done.

> +
> +Once an initial receive work request is posted,
> +we have a put_buffer()/get_buffer() implementation
> +that looks like this:
> +
> +Logically:
> +
> +qemu_rdma_get_buffer():
> +
> +1. A user on top of QEMUFile calls ops->get_buffer(),
> +   which calls us.
> +2. We transmit an empty SEND to let the sender know that 
> +   we are *ready* to receive some bytes from QEMUFileRDMA.
> +   These bytes will come in the form of a another SEND.
> +3. Before attempting to receive that SEND, we post another
> +   RQ work request to replace the one we just used up.
> +4. Block on a CQ event channel and wait for the SEND
> +   to arrive.
> +5. When the send arrives, librdmacm will unblock us
> +   and we can consume the bytes (described later).

Using an empty message seems somewhat hacky, a fixed header in the
message would let you do more things if protocol is ever extended.

> +qemu_rdma_put_buffer(): 
> +
> +1. A user on top of QEMUFile calls ops->put_buffer(),
> +   which calls us.
> +2. Block on the CQ event channel waiting for a SEND
> +   from the receiver to tell us that the receiver
> +   is *ready* for us to transmit some new bytes.
> +3. When the "ready" SEND arrives, librdmacm will 
> +   unblock us and we immediately post a RQ work request
> +   to replace the one we just used up.
> +4. Now, we can actually deliver the bytes that
> +   put_buffer() wants and return. 

OK to summarize flow control: at any time there's
either 0 or 1 outstanding buffers in RQ.
At each time only one side can talk.
Destination always goes first, then source, etc.
At each time a single send message can be passed.


Just FYI, this means you are often at 0 buffers in RQ and IIRC 0 buffers
is a worst-case path for infiniband. It's better to keep at least 1
buffers in RQ at all times, so prepost 2 initially so it would fluctuate
between 1 and 2.

> +
> +NOTE: This entire sequents of events is designed this
> +way to mimic the operations of a bytestream and is not
> +typical of an infiniband application. (Something like MPI
> +would not 'ping-pong' messages like this and would not
> +block after every request, which would normally defeat
> +the purpose of using zero-copy infiniband in the first place).
> +
> +Finally, how do we handoff the actual bytes to get_buffer()?
> +
> +Again, because we're trying to "fake" a bytestream abstraction
> +using an analogy not unlike individual UDP frames, we have
> +to hold on to the bytes received from SEND in memory.
> +
> +Each time we get to "Step 5" above for get_buffer(),
> +the bytes from SEND are copied into a local holding buffer.
> +
> +Then, we return the number of bytes requested by get_buffer()
> +and leave the remaining bytes in the buffer until get_buffer()
> +comes around for another pass.
> +
> +If the buffer is empty, then we follow the same steps
> +listed above for qemu_rdma_get_buffer() and block waiting
> +for another SEND message to re-fill the buffer.
> +
> +Migration of pc.ram:
> +===============================
> +
> +At the beginning of the migration, (migration-rdma.c),
> +the sender and the receiver populate the list of RAMBlocks
> +to be registered with each other into a structure.

Could you add the packet format here as well please?
Need to document endian-ness etc.

> +Then, using a single SEND message, they exchange this
> +structure with each other, to be used later during the
> +iteration of main memory. This structure includes a list
> +of all the RAMBlocks, their offsets and lengths.

This basically means that all memort on destination has to be registered
upfront.  A typical guest has gigabytes of memory, IMHO that's too much
memory to have pinned.

> +
> +Main memory is not migrated with SEND infiniband 
> +messages, but is instead migrated with RDMA infiniband
> +messages.
> +
> +Messages are migrated in "chunks" (about 64 pages right now).
> +Chunk size is not dynamic, but it could be in a future
> +implementation.
> +
> +When a total of 64 pages (or a flush()) are aggregated,
> +the memory backed by the chunk on the sender side is
> +registered with librdmacm and pinned in memory.
> +
> +After pinning, an RDMA send is generated and tramsmitted
> +for the entire chunk.

I think something chunk-based on the destination side is required
as well. You also can't trust the source to tell you
the chunk size it could be malicious and ask for too much.
Maybe source gives chunk size hint and destination responds
with what it wants to use.


> +Error-handling:
> +===============================
> +
> +Infiniband has what is called a "Reliable, Connected"
> +link (one of 4 choices). This is the mode in which
> +we use for RDMA migration.
> +
> +If a *single* message fails,
> +the decision is to abort the migration entirely and
> +cleanup all the RDMA descriptors and unregister all
> +the memory.
> +
> +After cleanup, the Virtual Machine is returned to normal
> +operation the same way that would happen if the TCP
> +socket is broken during a non-RDMA based migration.

Yes but we also need to report errors detected during migration.
Need to document how this is done.
We also need to report success.

> +
> +USAGE
> +===============================
> +
> +Compiling:
> +
> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
> +
> +$ make
> +
> +Command-line on the Source machine AND Destination:
> +
> +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:
> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)
> +1. Average worst-case throughput 
> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> +
> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:
> +
> +http://wiki.qemu.org/Features/RDMALiveMigration
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-18 10:40   ` Michael S. Tsirkin
@ 2013-03-18 20:24     ` Michael R. Hines
  2013-03-18 21:26       ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-18 20:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 03/18/2013 06:40 AM, Michael S. Tsirkin wrote:
> I think there are two things here, API documentation and protocol 
> documentation, protocol documentation still needs some more work. Also 
> if what I understand from this document is correct this breaks memory 
> overcommit on destination which needs to be fixed.
>
> I think something chunk-based on the destination side is required as 
> well. You also can't trust the source to tell you the chunk size it 
> could be malicious and ask for too much. Maybe source gives chunk size 
> hint and destination responds with what it wants to use. 

Do we allow ballooning *during* the live migration? Is that necessary?

Would it be sufficient to inform the destination which pages are ballooned
and then only register the ones that the VM actually owns?

> Is there any feature and/or version negotiation? How are we going to
> handle compatibility when we extend the protocol?
You mean, on top of the protocol versioning that's already
builtin to QEMUFile? inside qemu_savevm_state_begin()?

Should I piggy-back and additional protocol version number
before QEMUFile sends it's version number?

> So how does destination know it's ok to send anything to source? I 
> suspect this is wrong. When using CM you must post on RQ before 
> completing the connection negotiation, not after it's done. 

This is already handled by the RDMA connection manager (librdmacm).

The library already has functions like listen() and accept() the same
way that TCP does.

Once these functions return success, we have a gaurantee that both
sides of the connection have already posted the appropriate work
requests sufficient for driving the migration.


>> +2. We transmit an empty SEND to let the sender know that
>> +   we are *ready* to receive some bytes from QEMUFileRDMA.
>> +   These bytes will come in the form of a another SEND.
> Using an empty message seems somewhat hacky, a fixed header in the
> message would let you do more things if protocol is ever extended.

Great idea....... I'll add a struct RDMAHeader to each send
message in the next RFC which includes a version number.

(Until now, there were *only* QEMUFile bytes, nothing else,
so I didn't have any reason for a formal structure.)


> OK to summarize flow control: at any time there's either 0 or 1 
> outstanding buffers in RQ. At each time only one side can talk. 
> Destination always goes first, then source, etc. At each time a single 
> send message can be passed. Just FYI, this means you are often at 0 
> buffers in RQ and IIRC 0 buffers is a worst-case path for infiniband. 
> It's better to keep at least 1 buffers in RQ at all times, so prepost 
> 2 initially so it would fluctuate between 1 and 2. 

That's correct. Having 0 buffers is not possible - sending
a message with 0 buffers would throw an error. The "protocol"
as I described ensures that there is always one buffer posted
before waiting for another message to arrive.

I avoided "better" flow control because the non-live state
is so small in comparison to the pc.ram contents that would be sent.
The non-live state is in the range of kilobytes, so it seemed silly to
have more rigorous flow control....

>> +Migration of pc.ram:
>> +===============================
>> +
>> +At the beginning of the migration, (migration-rdma.c),
>> +the sender and the receiver populate the list of RAMBlocks
>> +to be registered with each other into a structure.
> Could you add the packet format here as well please?
> Need to document endian-ness etc.

There is no packet format for pc.ram. It's just bytes - raw RDMA
writes of each 4K page, because the memory must be registered
before the RDMA write can begin.

(As discussed, there will be a format for SEND, though - so I'll
take care of that in my next RFC).

>  Yes but we also need to report errors detected during migration. Need 
> to document how this is done. We also need to report success. 
Acknowledged - I'll add more verbosity to the different error conditions.

- Michael R. Hines

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks
  2013-03-18  8:48   ` Paolo Bonzini
@ 2013-03-18 20:25     ` Michael R. Hines
  0 siblings, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-18 20:25 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Acnkowledged.

On 03/18/2013 04:48 AM, Paolo Bonzini wrote:
> Il 18/03/2013 04:18, mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> This introduces:
>> 1. qemu_ram_foreach_block
>> 2. qemu_ram_count_blocks
>>
>> Both used in communicating the RAMBlocks
>> to each side for later memory registration.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   exec.c                    |   21 +++++++++++++++++++++
>>   include/exec/cpu-common.h |    6 ++++++
>>   2 files changed, 27 insertions(+)
>>
>> diff --git a/exec.c b/exec.c
>> index 8a6aac3..a985da8 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -2629,3 +2629,24 @@ bool cpu_physical_memory_is_io(hwaddr phys_addr)
>>                memory_region_is_romd(section->mr));
>>   }
>>   #endif
>> +
>> +void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
>> +{
>> +    RAMBlock *block;
>> +
>> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
>> +        func(block->host, block->offset, block->length, opaque);
>> +    }
>> +}
>> +
>> +int qemu_ram_count_blocks(void)
>> +{
>> +    RAMBlock *block;
>> +    int total = 0;
>> +
>> +    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
>> +        total++;
>> +    }
> Please move this to rdma.c, and implement it using qemu_ram_foreach_block.
>
> Otherwise looks good.
>
> Paolo
>
>> +    return total;
>> +}
>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>> index 2e5f11f..aea3fe0 100644
>> --- a/include/exec/cpu-common.h
>> +++ b/include/exec/cpu-common.h
>> @@ -119,6 +119,12 @@ extern struct MemoryRegion io_mem_rom;
>>   extern struct MemoryRegion io_mem_unassigned;
>>   extern struct MemoryRegion io_mem_notdirty;
>>   
>> +typedef void  (RAMBlockIterFunc)(void *host_addr,
>> +    ram_addr_t offset, ram_addr_t length, void *opaque);
>> +
>> +void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
>> +int qemu_ram_count_blocks(void);
>> +
>>   #endif
>>   
>>   #endif /* !CPU_COMMON_H */
>>
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA
  2013-03-18  8:56   ` Paolo Bonzini
@ 2013-03-18 20:26     ` Michael R. Hines
  0 siblings, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-18 20:26 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Acknowledged.

On 03/18/2013 04:56 AM, Paolo Bonzini wrote:
> In a "socket-like" abstraction, all of these steps except the initial 
> listen are part of "accept". Please move them to 
> qemu_rdma_migrate_accept (possibly renaming the existing 
> qemu_rdma_migrate_accept to a different name). Similarly, perhaps you 
> can merge qemu_rdma_server_prepare and qemu_rdma_migrate_listen. Try 
> to make the public API between modules as small as possible (but not 
> smaller :)), so that you can easily document it without too many 
> references to RDMA concepts. Thanks, Paolo
>> +    rdma->total_bytes = 0;
>> +    rdma->enabled = 1;
>> +    qemu_rdma_dump_gid("server_connect", rdma->rdma_ctx.cm_id);
>> +    return 0;
>> +
>> +err_rdma_server_wait:
>> +    qemu_rdma_cleanup(rdma);
>> +    return -1;
>> +
>> +}
>> +
>> +int rdma_start_incoming_migration(const char * host_port, Error **errp)
>> +{
>> +    RDMAData *rdma = g_malloc0(sizeof(RDMAData));
>> +    QEMUFile *f;
>> +    int ret;
>> +
>> +    if ((ret = qemu_rdma_data_init(rdma, host_port, errp)) < 0)
>> +        return ret;
>> +
>> +    ret = qemu_rdma_server_init(rdma, NULL);
>> +
>> +    DPRINTF("Starting RDMA-based incoming migration\n");
>> +
>> +    if (!ret) {
>> +        DPRINTF("qemu_rdma_server_init success\n");
>> +        ret = qemu_rdma_server_prepare(rdma, NULL);
>> +
>> +        if (!ret) {
>> +            DPRINTF("qemu_rdma_server_prepare success\n");
>> +
>> +            ret = rdma_accept_incoming_migration(rdma, NULL);
>> +            if(!ret)
>> +                DPRINTF("qemu_rdma_accept_incoming_migration success\n");
>> +            f = qemu_fopen_rdma(rdma, "rb");
>> +            if (f == NULL) {
>> +                fprintf(stderr, "could not qemu_fopen RDMA\n");
>> +                ret = -EIO;
>> +            }
>> +
>> +            process_incoming_migration(f);
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp)
>> +{
>> +    RDMAData *rdma = g_malloc0(sizeof(RDMAData));
>> +    MigrationState *s = opaque;
>> +    int ret;
>> +
>> +    if (qemu_rdma_data_init(rdma, host_port, errp) < 0)
>> +        return;
>> +
>> +    ret = qemu_rdma_client_init(rdma, NULL);
>> +    if(!ret) {
>> +        DPRINTF("qemu_rdma_client_init success\n");
>> +        ret = qemu_rdma_client_connect(rdma, NULL);
>> +
>> +        if(!ret) {
>> +            s->file = qemu_fopen_rdma(rdma, "wb");
>> +            DPRINTF("qemu_rdma_client_connect success\n");
>> +            migrate_fd_connect(s);
>> +            return;
>> +        }
>> +    }
>> +
>> +    migrate_fd_error(s);
>> +}
>> +
>> +size_t save_rdma_page(QEMUFile *f, ram_addr_t block_offset, ram_addr_t offset, int cont, size_t size)
>> +{
>> +    int ret;
>> +    size_t bytes_sent = 0;
>> +    ram_addr_t current_addr;
>> +    RDMAData * rdma = migrate_use_rdma(f);
>> +
>> +    current_addr = block_offset + offset;
>> +
>> +    /*
>> +     * Add this page to the current 'chunk'. If the chunk
>> +     * is full, an actual RDMA write will occur.
>> +     */
>> +    if ((ret = qemu_rdma_write(rdma, current_addr, size)) < 0) {
>> +        fprintf(stderr, "rdma migration: write error! %d\n", ret);
>> +        return ret;
>> +    }
>> +
>> +    /*
>> +     * Drain the Completion Queue if possible.
>> +     * If not, the end of the iteration will do this
>> +     * again to make sure we don't overflow the
>> +     * request queue.
>> +     */
>> +    while (1) {
>> +        int ret = qemu_rdma_poll(rdma);
>> +        if (ret == RDMA_WRID_NONE) {
>> +            break;
>> +        }
>> +        if (ret < 0) {
>> +            fprintf(stderr, "rdma migration: polling error! %d\n", ret);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    bytes_sent += size;
>> +    return bytes_sent;
>> +}
>> +
>> +size_t qemu_rdma_fill(void * opaque, uint8_t *buf, int size)
>> +{
>> +    RDMAData * rdma = opaque;
>> +    size_t len = 0;
>> +
>> +    if(rdma->qemu_file_len) {
>> +        DPRINTF("RDMA %" PRId64 " of %d bytes already in buffer\n",
>> +	    rdma->qemu_file_len, size);
>> +
>> +        len = MIN(size, rdma->qemu_file_len);
>> +        memcpy(buf, rdma->qemu_file_curr, len);
>> +        rdma->qemu_file_curr += len;
>> +        rdma->qemu_file_len -= len;
>> +    }
>> +
>> +    return len;
>> +}
>>
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-18  9:09   ` Paolo Bonzini
@ 2013-03-18 20:33     ` Michael R. Hines
  2013-03-19  9:18       ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-18 20:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Comments inline - tell me what you think.......

On 03/18/2013 05:09 AM, Paolo Bonzini wrote:
> +typedef struct QEMUFileRDMA
> +{
> +    void *rdma;
> This is an RDMAData *.  Please avoid using void * as much as possible.
Acknowledged - forgot to move this to rdma.c, so it doesn't have to be 
void anymore.
>      */
> +    return qemu_rdma_fill(r->rdma, buf, size);
> +}
> Please move these functions closer to qemu_fopen_rdma (or better, to an
> RDMA-specific file altogether).  Also, using qemu_rdma_fill introduces a
> dependency of savevm.c on migration-rdma.c.  There should be no such
> dependency; migration-rdma.c should be used only by migration.c.
Acknowledged......
> +void * migrate_use_rdma(QEMUFile *f)
> +{
> +    QEMUFileRDMA *r = f->opaque;
> +
> +    return qemu_rdma_enabled(r->rdma) ? r->rdma : NULL;
> You cannot be sure that f->opaque->rdma is a valid pointer.  For
> example, the first field in a socket QEMUFile's is a file descriptor.
>
> Instead, you could use a qemu_file_ops_are(const QEMUFile *, const
> QEMUFileOps *) function that checks if the file uses the given ops.
> Then, migrate_use_rdma can simply check if the QEMUFile is using the
> RDMA ops structure.
>
> With this change, the "enabled" field of RDMAData should go.

Great - I like that...... will do....

> +/*
> + * Called only for RDMA right now at the end
> + * of each live iteration of memory.
> + *
> + * 'drain' from a QEMUFile perspective means
> + * to flush the outbound send buffer
> + * (if one exists).
> + *
> + * For RDMA, this means to make sure we've
> + * received completion queue (CQ) messages
> + * successfully for all of the RDMA writes
> + * that we requested.
> + */
> +int qemu_drain(QEMUFile *f)
> +{
> +    return f->ops->drain ? f->ops->drain(f->opaque) : 0;
> +}
> Hmm, this is very similar to qemu_fflush, but not quite. :/
>
> Why exactly is this needed?

Good idea - I'll replace drain with flush once I added
the  "qemu_file_ops_are(const QEMUFile *, const QEMUFileOps *) "
that you recommended......


>>   /** Flushes QEMUFile buffer
>>    *
>>    */
>> @@ -723,6 +867,8 @@ int qemu_get_byte(QEMUFile *f)
>>   int64_t qemu_ftell(QEMUFile *f)
>>   {
>>       qemu_fflush(f);
>> +    if(migrate_use_rdma(f))
>> +	return delta_norm_mig_bytes_transferred();
> Not needed, and another undesirable dependency (savevm.c ->
> arch_init.c).  Just update f->pos in save_rdma_page.

f->pos isn't good enough because save_rdma_page does not
go through QEMUFile directly - only non-live state goes
through QEMUFile ....... pc.ram uses direct RDMA writes.

As a result, the position pointer does not get updated
and the accounting is missed........

- Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls
  2013-03-18  8:47   ` Paolo Bonzini
@ 2013-03-18 20:37     ` Michael R. Hines
  2013-03-19  9:23       ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-18 20:37 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Comments inline.......

On 03/18/2013 04:47 AM, Paolo Bonzini wrote:
>   int migrate_use_xbzrle(void);
> +void *migrate_use_rdma(QEMUFile *f);
> Perhaps you can add a new function to QEMUFile send_page?  And if it
> returns -ENOTSUP, proceed with the normal is_dup_page + put_buffer.  I
> wonder if that lets use remove migrate_use_rdma() completely.

That's great - I'll make the modification......

>   void process_incoming_migration(QEMUFile *f)
>   {
>       Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
> -    int fd = qemu_get_fd(f);
> -
> -    assert(fd != -1);
> -    socket_set_nonblock(fd);
> +    if(!migrate_use_rdma(f)) {
> +        int fd = qemu_get_fd(f);
> +        assert(fd != -1);
> +        socket_set_nonblock(fd);
> Is this because qemu_get_fd(f) returns -1 for RDMA?  Then, you can
> instead put socket_set_nonblock under an if(fd != -1).

Yes, I proposed doing that check (for -1) in a previous RFC,
but you told me to remove it and make a separate patch =)

Is it OK to keep it in this patch?

> Otherwise looks good. 

Thanks for taking the time =)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-18 20:24     ` Michael R. Hines
@ 2013-03-18 21:26       ` Michael S. Tsirkin
  2013-03-18 23:23         ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-18 21:26 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Mon, Mar 18, 2013 at 04:24:44PM -0400, Michael R. Hines wrote:
> On 03/18/2013 06:40 AM, Michael S. Tsirkin wrote:
> >I think there are two things here, API documentation and protocol
> >documentation, protocol documentation still needs some more work.
> >Also if what I understand from this document is correct this
> >breaks memory overcommit on destination which needs to be fixed.
> >
> >I think something chunk-based on the destination side is required
> >as well. You also can't trust the source to tell you the chunk
> >size it could be malicious and ask for too much. Maybe source
> >gives chunk size hint and destination responds with what it wants
> >to use.
> 
> Do we allow ballooning *during* the live migration? Is that necessary?

Probably but I haven't mentioned ballooning at all.

memory overcommit != ballooning

> Would it be sufficient to inform the destination which pages are ballooned
> and then only register the ones that the VM actually owns?

I haven't thought about it.

> >Is there any feature and/or version negotiation? How are we going to
> >handle compatibility when we extend the protocol?
> You mean, on top of the protocol versioning that's already
> builtin to QEMUFile? inside qemu_savevm_state_begin()?

I mean for protocol things like credit negotiation, which are unrelated
to high level QEMUFile.

> Should I piggy-back and additional protocol version number
> before QEMUFile sends it's version number?

CM can exchange a bit of data during connection setup, maybe use that?

> >So how does destination know it's ok to send anything to source? I
> >suspect this is wrong. When using CM you must post on RQ before
> >completing the connection negotiation, not after it's done.
> 
> This is already handled by the RDMA connection manager (librdmacm).
> 
> The library already has functions like listen() and accept() the same
> way that TCP does.
> 
> Once these functions return success, we have a gaurantee that both
> sides of the connection have already posted the appropriate work
> requests sufficient for driving the migration.

Not if you don't post anything. librdmacm does not post requests.  So
everyone posts 1 buffer on RQ during connection setup?
OK though this is not what the document said, I was under the impression
this is done after connection setup.

> 
> >>+2. We transmit an empty SEND to let the sender know that
> >>+   we are *ready* to receive some bytes from QEMUFileRDMA.
> >>+   These bytes will come in the form of a another SEND.
> >Using an empty message seems somewhat hacky, a fixed header in the
> >message would let you do more things if protocol is ever extended.
> 
> Great idea....... I'll add a struct RDMAHeader to each send
> message in the next RFC which includes a version number.
> 
> (Until now, there were *only* QEMUFile bytes, nothing else,
> so I didn't have any reason for a formal structure.)
> 
> 
> >OK to summarize flow control: at any time there's either 0 or 1
> >outstanding buffers in RQ. At each time only one side can talk.
> >Destination always goes first, then source, etc. At each time a
> >single send message can be passed. Just FYI, this means you are
> >often at 0 buffers in RQ and IIRC 0 buffers is a worst-case path
> >for infiniband. It's better to keep at least 1 buffers in RQ at
> >all times, so prepost 2 initially so it would fluctuate between 1
> >and 2.
> 
> That's correct. Having 0 buffers is not possible - sending
> a message with 0 buffers would throw an error. The "protocol"
> as I described ensures that there is always one buffer posted
> before waiting for another message to arrive.

So # of buffers goes 0 -> 1 -> 0 -> 1.
What I am saying is you should have an extra buffer
so it goes 1 -> 2 -> 1 -> 2
otherwise you keep hitting slow path in RQ processing:
each time you consume the last buffer, IIRC receiver sends
and ACK to sender saying "hey this is the last buffer, slow down".
You don't want that.

> I avoided "better" flow control because the non-live state
> is so small in comparison to the pc.ram contents that would be sent.
> The non-live state is in the range of kilobytes, so it seemed silly to
> have more rigorous flow control....

I think it's good enough, just add an extra unused buffer to make
hardware happy.

> >>+Migration of pc.ram:
> >>+===============================
> >>+
> >>+At the beginning of the migration, (migration-rdma.c),
> >>+the sender and the receiver populate the list of RAMBlocks
> >>+to be registered with each other into a structure.
> >Could you add the packet format here as well please?
> >Need to document endian-ness etc.
> 
> There is no packet format for pc.ram.

The 'structure' above is passed using SEND so there is
a format.

> It's just bytes - raw RDMA
> writes of each 4K page, because the memory must be registered
> before the RDMA write can begin.
> 
> (As discussed, there will be a format for SEND, though - so I'll
> take care of that in my next RFC).
> 
> > Yes but we also need to report errors detected during migration.
> >Need to document how this is done. We also need to report success.
> Acknowledged - I'll add more verbosity to the different error conditions.
> 
> - Michael R. Hines

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-18 21:26       ` Michael S. Tsirkin
@ 2013-03-18 23:23         ` Michael R. Hines
  2013-03-19  8:19           ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-18 23:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 03/18/2013 05:26 PM, Michael S. Tsirkin wrote:
>
> Probably but I haven't mentioned ballooning at all.
>
> memory overcommit != ballooning

Sure, then setting ballooning aside for the moment,
then let's just consider regular (unused) virtual memory.

In this case, what's wrong with the destination mapping
and pinning all the memory if it is not being ballooned?

If the guest touches all the memory during normal operation
before migration begins (which would be the common case),
then overcommit is irrelevant, no?

> This is already handled by the RDMA connection manager (librdmacm).
>
> The library already has functions like listen() and accept() the same
> way that TCP does.
>
> Once these functions return success, we have a gaurantee that both
> sides of the connection have already posted the appropriate work
> requests sufficient for driving the migration.
> Not if you don't post anything. librdmacm does not post requests.  So
> everyone posts 1 buffer on RQ during connection setup?
> OK though this is not what the document said, I was under the impression
> this is done after connection setup.

Sorry, I wasn't being clear. Here's the existing sequence
that I've already coded and validated:

1. Receiver and Sender are started (command line):
      (The receiver has to be running before QMP migrate
       can connect, of course or this all falls apart.)

2. Both sides post RQ work requests (or multiple ones)
3. Receiver does listen()
4. Sender does connect()
         At this point both sides have already posted
         work requests as stated before.
5. Receiver accept() => issue first SEND message

At this point the sequence of events I describe in the
documentation for put_buffer() / get_buffer() all kick
in and everything is normal.

I'll be sure to post an extra few work requests as suggested.

>
> So # of buffers goes 0 -> 1 -> 0 -> 1.
> What I am saying is you should have an extra buffer
> so it goes 1 -> 2 -> 1 -> 2
> otherwise you keep hitting slow path in RQ processing:
> each time you consume the last buffer, IIRC receiver sends
> and ACK to sender saying "hey this is the last buffer, slow down".
> You don't want that.

No problem - I'll take care of it.......

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-18 23:23         ` Michael R. Hines
@ 2013-03-19  8:19           ` Michael S. Tsirkin
  2013-03-19 13:21             ` Michael R. Hines
  2013-03-19 15:08             ` Michael R. Hines
  0 siblings, 2 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-19  8:19 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Mon, Mar 18, 2013 at 07:23:53PM -0400, Michael R. Hines wrote:
> On 03/18/2013 05:26 PM, Michael S. Tsirkin wrote:
> >
> >Probably but I haven't mentioned ballooning at all.
> >
> >memory overcommit != ballooning
> 
> Sure, then setting ballooning aside for the moment,
> then let's just consider regular (unused) virtual memory.
> 
> In this case, what's wrong with the destination mapping
> and pinning all the memory if it is not being ballooned?
> 
> If the guest touches all the memory during normal operation
> before migration begins (which would be the common case),
> then overcommit is irrelevant, no?

We have ways (e.g. cgroups) to limit what a VM can do. If it tries to
use more RAM than we let it, it will swap, still making progress, just
slower.  OTOH it looks like pinning more memory than allowed by the
cgroups limit will just get stuck forever (probably a bug,
should fail instead? but does not help your protocol
which needs it all pinned at all times).

There are also per-task resource limits. If you exceed this
registration will fail, so not good either.

I just don't see why do registration by chunks
on source but not on destination.

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-18 20:33     ` Michael R. Hines
@ 2013-03-19  9:18       ` Paolo Bonzini
  2013-03-19 13:12         ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19  9:18 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 18/03/2013 21:33, Michael R. Hines ha scritto:
>>
>> +int qemu_drain(QEMUFile *f)
>> +{
>> +    return f->ops->drain ? f->ops->drain(f->opaque) : 0;
>> +}
>> Hmm, this is very similar to qemu_fflush, but not quite. :/
>>
>> Why exactly is this needed?
> 
> Good idea - I'll replace drain with flush once I added
> the  "qemu_file_ops_are(const QEMUFile *, const QEMUFileOps *) "
> that you recommended......

If I understand correctly, the problem is that save_rdma_page is
asynchronous and you have to wait for pending operations to do the
put_buffer protocol correctly.

Would it work to just do the "drain" in the put_buffer operation, if and
only if it was preceded by a save_rdma_page operation?

> 
>>>   /** Flushes QEMUFile buffer
>>>    *
>>>    */
>>> @@ -723,6 +867,8 @@ int qemu_get_byte(QEMUFile *f)
>>>   int64_t qemu_ftell(QEMUFile *f)
>>>   {
>>>       qemu_fflush(f);
>>> +    if(migrate_use_rdma(f))
>>> +    return delta_norm_mig_bytes_transferred();
>> Not needed, and another undesirable dependency (savevm.c ->
>> arch_init.c).  Just update f->pos in save_rdma_page.
> 
> f->pos isn't good enough because save_rdma_page does not
> go through QEMUFile directly - only non-live state goes
> through QEMUFile ....... pc.ram uses direct RDMA writes.
> 
> As a result, the position pointer does not get updated
> and the accounting is missed........

Yes, I am suggesting to modify f->pos in save_rdma_page instead.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls
  2013-03-18 20:37     ` Michael R. Hines
@ 2013-03-19  9:23       ` Paolo Bonzini
  2013-03-19 13:08         ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19  9:23 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 18/03/2013 21:37, Michael R. Hines ha scritto:
>>
>> +    if(!migrate_use_rdma(f)) {
>> +        int fd = qemu_get_fd(f);
>> +        assert(fd != -1);
>> +        socket_set_nonblock(fd);
>> Is this because qemu_get_fd(f) returns -1 for RDMA?  Then, you can
>> instead put socket_set_nonblock under an if(fd != -1).
> 
> Yes, I proposed doing that check (for -1) in a previous RFC,
> but you told me to remove it and make a separate patch =)
> 
> Is it OK to keep it in this patch?

Yes---this is a separate patch.  Apologies if you had the if(fd != -1)
before. :)  In fact, both the if(fd != -1) and the
if(!migrate_use_rdma(f)) are bad, but I prefer to eliminate as many uses
as possible of migrate_use_rdma.

The reason why they are bad, is that we try to operate on the socket in
a non-blocking manner, so that the monitor keeps working during incoming
migration.  We do it with non-blocking sockets because incoming
migration does not (yet?) have a separate thread, and is not a
bottleneck (the VM is not running, so it's not a problem to hold the big
QEMU lock for extended periods of time).

Does librdmacm support non-blocking operation, similar to select() or
poll()?  Perhaps we can add support for that later.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls
  2013-03-19  9:23       ` Paolo Bonzini
@ 2013-03-19 13:08         ` Michael R. Hines
  2013-03-19 13:20           ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 13:08 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

On 03/19/2013 05:23 AM, Paolo Bonzini wrote:
>
> Yes---this is a separate patch.  Apologies if you had the if(fd != -1)
> before. :)  In fact, both the if(fd != -1) and the
> if(!migrate_use_rdma(f)) are bad, but I prefer to eliminate as many uses
> as possible of migrate_use_rdma.
I agree. In my current patch I've eliminated all of them.

> Does librdmacm support non-blocking operation, similar to select() or
> poll()?  Perhaps we can add support for that later.

Yes, it does, actually. The library provides what is called an "event 
channel".
(This term is overloaded by other technologies, but that's OK).

An event channel is a file descriptor provided by (I believe) the rdma_cm
kernel module driver.

When you poll on this file descriptor, it can tell you all sorts of things
just like other files or sockets like when data is ready or when
events of interest have completed (like the completion queue has elements).

In my current patch, I'm using this during 
"rdma_accept_incoming_connection()",
but I'm not currently using it for the rest of the coroutine on the 
receiver side.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19  9:18       ` Paolo Bonzini
@ 2013-03-19 13:12         ` Michael R. Hines
  2013-03-19 13:25           ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 13:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

On 03/19/2013 05:18 AM, Paolo Bonzini wrote:
> Il 18/03/2013 21:33, Michael R. Hines ha scritto:
>>> +int qemu_drain(QEMUFile *f)
>>> +{
>>> +    return f->ops->drain ? f->ops->drain(f->opaque) : 0;
>>> +}
>>> Hmm, this is very similar to qemu_fflush, but not quite. :/
>>>
>>> Why exactly is this needed?
>> Good idea - I'll replace drain with flush once I added
>> the  "qemu_file_ops_are(const QEMUFile *, const QEMUFileOps *) "
>> that you recommended......
> If I understand correctly, the problem is that save_rdma_page is
> asynchronous and you have to wait for pending operations to do the
> put_buffer protocol correctly.
>
> Would it work to just do the "drain" in the put_buffer operation, if and
> only if it was preceded by a save_rdma_page operation?

Yes, the drain needs to happen in a few places already:

1. During save_rdma_page (if the current "chunk" is full of pages)
2. During the end of each iteration (now using qemu_fflush in my current 
patch)
3. And also during qemu_savem_state_complete(), also using qemu_fflush.
>>>>    /** Flushes QEMUFile buffer
>>>>     *
>>>>     */
>>>> @@ -723,6 +867,8 @@ int qemu_get_byte(QEMUFile *f)
>>>>    int64_t qemu_ftell(QEMUFile *f)
>>>>    {
>>>>        qemu_fflush(f);
>>>> +    if(migrate_use_rdma(f))
>>>> +    return delta_norm_mig_bytes_transferred();
>>> Not needed, and another undesirable dependency (savevm.c ->
>>> arch_init.c).  Just update f->pos in save_rdma_page.
>> f->pos isn't good enough because save_rdma_page does not
>> go through QEMUFile directly - only non-live state goes
>> through QEMUFile ....... pc.ram uses direct RDMA writes.
>>
>> As a result, the position pointer does not get updated
>> and the accounting is missed........
> Yes, I am suggesting to modify f->pos in save_rdma_page instead.
>
> Paolo
>

Would that not confuse the other QEMUFile users?
If I change that pointer (without actually putting bytes
in into QEMUFile), won't the f->pos pointer be
incorrectly updated?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls
  2013-03-19 13:08         ` Michael R. Hines
@ 2013-03-19 13:20           ` Paolo Bonzini
  0 siblings, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19 13:20 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 19/03/2013 14:08, Michael R. Hines ha scritto:
> On 03/19/2013 05:23 AM, Paolo Bonzini wrote:
>>
>> Yes---this is a separate patch.  Apologies if you had the if(fd != -1)
>> before. :)  In fact, both the if(fd != -1) and the
>> if(!migrate_use_rdma(f)) are bad, but I prefer to eliminate as many uses
>> as possible of migrate_use_rdma.
> I agree. In my current patch I've eliminated all of them.

Very nice.  It remains to be seen how many are replaced by checks on the
QEMUFileOps :) but it cannot be worse!

>> Does librdmacm support non-blocking operation, similar to select() or
>> poll()?  Perhaps we can add support for that later.
> 
> Yes, it does, actually. The library provides what is called an "event
> channel".
> (This term is overloaded by other technologies, but that's OK).
> 
> An event channel is a file descriptor provided by (I believe) the rdma_cm
> kernel module driver.
> 
> When you poll on this file descriptor, it can tell you all sorts of things
> just like other files or sockets like when data is ready or when
> events of interest have completed (like the completion queue has elements).
> 
> In my current patch, I'm using this during
> "rdma_accept_incoming_connection()",
> but I'm not currently using it for the rest of the coroutine on the
> receiver side.

Ok, this can be added later.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19  8:19           ` Michael S. Tsirkin
@ 2013-03-19 13:21             ` Michael R. Hines
  2013-03-19 15:08             ` Michael R. Hines
  1 sibling, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 13:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 03/19/2013 04:19 AM, Michael S. Tsirkin wrote:
> We have ways (e.g. cgroups) to limit what a VM can do. If it tries to 
> use more RAM than we let it, it will swap, still making progress, just 
> slower. OTOH it looks like pinning more memory than allowed by the 
> cgroups limit will just get stuck forever (probably a bug, should fail 
> instead? but does not help your protocol which needs it all pinned at 
> all times). There are also per-task resource limits. If you exceed 
> this registration will fail, so not good either. I just don't see why 
> do registration by chunks on source but not on destination. 

Would this a hard requirement for an initial version?

I do understand how and why this makes things more flexible during
the long run, but it does have the potential to slow down the RDMA
protocol significantly.

The way its implemented now, the sender can dump bytes
onto the wire at full speed (up to 30gbps last time I measured it),
but if we insert a round-trip message + registration on the
destination side before we're allowed to push more bytes out,
we'll have to introduce more complex flow control only for
the benefit of making the destination side have the flexibility
that you described.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 13:12         ` Michael R. Hines
@ 2013-03-19 13:25           ` Paolo Bonzini
  2013-03-19 13:40             ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19 13:25 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 19/03/2013 14:12, Michael R. Hines ha scritto:
> On 03/19/2013 05:18 AM, Paolo Bonzini wrote:
>> Il 18/03/2013 21:33, Michael R. Hines ha scritto:
>>>> +int qemu_drain(QEMUFile *f)
>>>> +{
>>>> +    return f->ops->drain ? f->ops->drain(f->opaque) : 0;
>>>> +}
>>>> Hmm, this is very similar to qemu_fflush, but not quite. :/
>>>>
>>>> Why exactly is this needed?
>>> Good idea - I'll replace drain with flush once I added
>>> the  "qemu_file_ops_are(const QEMUFile *, const QEMUFileOps *) "
>>> that you recommended......
>> If I understand correctly, the problem is that save_rdma_page is
>> asynchronous and you have to wait for pending operations to do the
>> put_buffer protocol correctly.
>>
>> Would it work to just do the "drain" in the put_buffer operation, if and
>> only if it was preceded by a save_rdma_page operation?
> 
> Yes, the drain needs to happen in a few places already:
> 
> 1. During save_rdma_page (if the current "chunk" is full of pages)

Ok, this is internal to RDMA so no problem.

> 2. During the end of each iteration (now using qemu_fflush in my current
> patch)

Why?

> 3. And also during qemu_savem_state_complete(), also using qemu_fflush.

This would be caught by put_buffer, but (2) would not.

>>>>>    /** Flushes QEMUFile buffer
>>>>>     *
>>>>>     */
>>>>> @@ -723,6 +867,8 @@ int qemu_get_byte(QEMUFile *f)
>>>>>    int64_t qemu_ftell(QEMUFile *f)
>>>>>    {
>>>>>        qemu_fflush(f);
>>>>> +    if(migrate_use_rdma(f))
>>>>> +    return delta_norm_mig_bytes_transferred();
>>>> Not needed, and another undesirable dependency (savevm.c ->
>>>> arch_init.c).  Just update f->pos in save_rdma_page.
>>> f->pos isn't good enough because save_rdma_page does not
>>> go through QEMUFile directly - only non-live state goes
>>> through QEMUFile ....... pc.ram uses direct RDMA writes.
>>>
>>> As a result, the position pointer does not get updated
>>> and the accounting is missed........
>> Yes, I am suggesting to modify f->pos in save_rdma_page instead.
>>
>> Paolo
>>
> 
> Would that not confuse the other QEMUFile users?
> If I change that pointer (without actually putting bytes
> in into QEMUFile), won't the f->pos pointer be
> incorrectly updated?

f->pos is never used directly by QEMUFile, it is almost an opaque value.
 It is accumulated on every qemu_fflush (so that it can be passed to the
->put_buffer function), and returned by qemu_ftell; nothing else.

If you make somehow save_rdma_page a new op, returning a value from that
op and adding it to f->pos would be a good way to achieve this.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 13:25           ` Paolo Bonzini
@ 2013-03-19 13:40             ` Michael R. Hines
  2013-03-19 13:45               ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 13:40 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

On 03/19/2013 09:25 AM, Paolo Bonzini wrote:
> Yes, the drain needs to happen in a few places already:
>
> 1. During save_rdma_page (if the current "chunk" is full of pages)
> Ok, this is internal to RDMA so no problem.
>
>> 2. During the end of each iteration (now using qemu_fflush in my current
>> patch)
> Why?

This is because of downtime: You have to drain the queue anyway at the
very end, and if you don't drain it in advance after each iteration, then
the queue will have lots of bytes in it waiting for transmission and the
Virtual Machine will be stopped for a much longer period of time during
the last iteration waiting for RDMA card to finish transmission of all those
bytes.

If you wait till the last iteration to do this, then all of that waiting 
time gets
counted as downtime, causing the VCPUs to be unnecessarily stopped.


>> 3. And also during qemu_savem_state_complete(), also using qemu_fflush.
> This would be caught by put_buffer, but (2) would not.
>

I'm not sure this is good enough either - we don't want to flush
the queue *frequently*..... only when it's necessary for performance
.... we do want the queue to have some meat to it so the hardware
can write bytes as fast as possible.....

If we flush inside put_buffer (which is called very frequently): then
we have no way to distinquish *where* put buffer was called from
(either from qemu_savevm_state_complete() or from a device-level
function call that's using QEMUFile).

>>> Yes, I am suggesting to modify f->pos in save_rdma_page instead.
>>>
>>> Paolo
>>>
>> Would that not confuse the other QEMUFile users?
>> If I change that pointer (without actually putting bytes
>> in into QEMUFile), won't the f->pos pointer be
>> incorrectly updated?
> f->pos is never used directly by QEMUFile, it is almost an opaque value.
>   It is accumulated on every qemu_fflush (so that it can be passed to the
> ->put_buffer function), and returned by qemu_ftell; nothing else.
>
> If you make somehow save_rdma_page a new op, returning a value from that
> op and adding it to f->pos would be a good way to achieve this.

Ok, great - I'll take advantage of that........Thanks.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 13:40             ` Michael R. Hines
@ 2013-03-19 13:45               ` Paolo Bonzini
  2013-03-19 14:10                 ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19 13:45 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 19/03/2013 14:40, Michael R. Hines ha scritto:
> On 03/19/2013 09:25 AM, Paolo Bonzini wrote:
>> Yes, the drain needs to happen in a few places already:
>>
>> 1. During save_rdma_page (if the current "chunk" is full of pages)
>> Ok, this is internal to RDMA so no problem.
>>
>>> 2. During the end of each iteration (now using qemu_fflush in my current
>>> patch)
>> Why?
> 
> This is because of downtime: You have to drain the queue anyway at the
> very end, and if you don't drain it in advance after each iteration, then
> the queue will have lots of bytes in it waiting for transmission and the
> Virtual Machine will be stopped for a much longer period of time during
> the last iteration waiting for RDMA card to finish transmission of all
> those
> bytes.

Shouldn't the "current chunk full" case take care of it too?

Of course if you disable chunking you have to add a different condition,
perhaps directly into save_rdma_page.

> If you wait till the last iteration to do this, then all of that waiting time gets
> counted as downtime, causing the VCPUs to be unnecessarily stopped.
> 
>>> 3. And also during qemu_savem_state_complete(), also using qemu_fflush.
>> This would be caught by put_buffer, but (2) would not.
>>
> 
> I'm not sure this is good enough either - we don't want to flush
> the queue *frequently*..... only when it's necessary for performance
> .... we do want the queue to have some meat to it so the hardware
> can write bytes as fast as possible.....
> 
> If we flush inside put_buffer (which is called very frequently):

Is it called at any time during RAM migration?

> then we have no way to distinquish *where* put buffer was called from
> (either from qemu_savevm_state_complete() or from a device-level
> function call that's using QEMUFile).

Can you make drain a no-op if there is nothing in flight?  Then every
call to put_buffer after the first should not have any overhead.

Paolo

>>>> Yes, I am suggesting to modify f->pos in save_rdma_page instead.
>>>>
>>>> Paolo
>>>>
>>> Would that not confuse the other QEMUFile users?
>>> If I change that pointer (without actually putting bytes
>>> in into QEMUFile), won't the f->pos pointer be
>>> incorrectly updated?
>> f->pos is never used directly by QEMUFile, it is almost an opaque value.
>>   It is accumulated on every qemu_fflush (so that it can be passed to the
>> ->put_buffer function), and returned by qemu_ftell; nothing else.
>>
>> If you make somehow save_rdma_page a new op, returning a value from that
>> op and adding it to f->pos would be a good way to achieve this.
> 
> Ok, great - I'll take advantage of that........Thanks.
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 13:45               ` Paolo Bonzini
@ 2013-03-19 14:10                 ` Michael R. Hines
  2013-03-19 14:22                   ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 14:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

On 03/19/2013 09:45 AM, Paolo Bonzini wrote:
> This is because of downtime: You have to drain the queue anyway at the
> very end, and if you don't drain it in advance after each iteration, then
> the queue will have lots of bytes in it waiting for transmission and the
> Virtual Machine will be stopped for a much longer period of time during
> the last iteration waiting for RDMA card to finish transmission of all
> those
> bytes.
> Shouldn't the "current chunk full" case take care of it too?
>
> Of course if you disable chunking you have to add a different condition,
> perhaps directly into save_rdma_page.

No, we don't want to flush on "chunk full" - that has a different meaning.
We want to have as many chunks submitted to the hardware for transmission
as possible to keep the bytes moving.


>>>> 3. And also during qemu_savem_state_complete(), also using qemu_fflush.
>>> This would be caught by put_buffer, but (2) would not.
>>>
>> I'm not sure this is good enough either - we don't want to flush
>> the queue *frequently*..... only when it's necessary for performance
>> .... we do want the queue to have some meat to it so the hardware
>> can write bytes as fast as possible.....
>>
>> If we flush inside put_buffer (which is called very frequently):
> Is it called at any time during RAM migration?

I don't understand the question: the flushing we've been discussing
is *only* for RAM migration - not for the non-live state.

I haven't introduced any "new" flushes for non-live state other than
when it's absolutely necessary to flush for RAM migration.


>> then we have no way to distinquish *where* put buffer was called from
>> (either from qemu_savevm_state_complete() or from a device-level
>> function call that's using QEMUFile).
> Can you make drain a no-op if there is nothing in flight?  Then every
> call to put_buffer after the first should not have any overhead.
>
> Paolo

That still doesn't solve the problem: If there is nothing in flight,
then there is no reason to call qemu_fflush() in the first place.

This is why I avoided using fflush() in the beginning, because it
sort of "confuses" who is using it: from the perspective of fflush(),
you can't tell if the user calling it for RAM or for non-live state.

The flushes we need are only for RAM, not the rest of it......

Make sense?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 14:10                 ` Michael R. Hines
@ 2013-03-19 14:22                   ` Paolo Bonzini
  2013-03-19 15:02                     ` [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration? Michael R. Hines
  2013-03-19 18:27                     ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA Michael R. Hines
  0 siblings, 2 replies; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19 14:22 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 19/03/2013 15:10, Michael R. Hines ha scritto:
> On 03/19/2013 09:45 AM, Paolo Bonzini wrote:
>> This is because of downtime: You have to drain the queue anyway at the
>> very end, and if you don't drain it in advance after each iteration, then
>> the queue will have lots of bytes in it waiting for transmission and the
>> Virtual Machine will be stopped for a much longer period of time during
>> the last iteration waiting for RDMA card to finish transmission of all
>> those
>> bytes.
>> Shouldn't the "current chunk full" case take care of it too?
>>
>> Of course if you disable chunking you have to add a different condition,
>> perhaps directly into save_rdma_page.
> 
> No, we don't want to flush on "chunk full" - that has a different meaning.
> We want to have as many chunks submitted to the hardware for transmission
> as possible to keep the bytes moving.

That however gives me an idea...  Instead of the full drain at the end
of an iteration, does it make sense to do a "partial" drain at every
chunk full, so that you don't have > N bytes pending and the downtime is
correspondingly limited?

>>>>> 3. And also during qemu_savem_state_complete(), also using
>>>>> qemu_fflush.
>>>> This would be caught by put_buffer, but (2) would not.
>>>>
>>> I'm not sure this is good enough either - we don't want to flush
>>> the queue *frequently*..... only when it's necessary for performance
>>> .... we do want the queue to have some meat to it so the hardware
>>> can write bytes as fast as possible.....
>>>
>>> If we flush inside put_buffer (which is called very frequently):
>> Is it called at any time during RAM migration?
> 
> I don't understand the question: the flushing we've been discussing
> is *only* for RAM migration - not for the non-live state.

Yes.  But I would like to piggyback the final, full drain on the switch
from RAM migration to device migration.

>> Can you make drain a no-op if there is nothing in flight?  Then every
>> call to put_buffer after the first should not have any overhead.
> 
> That still doesn't solve the problem: If there is nothing in flight,
> then there is no reason to call qemu_fflush() in the first place.

If there is no RAM migration in flight.  So you have

   migrate RAM
   ...
   RAM migration finished, device migration start
   put_buffer <<<<< QEMUFileRDMA triggers drain
   put_buffer
   put_buffer
   put_buffer
   ...

> The flushes we need are only for RAM, not the rest of it......
> 
> Make sense?

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration?
  2013-03-19 14:22                   ` Paolo Bonzini
@ 2013-03-19 15:02                     ` Michael R. Hines
  2013-03-19 15:12                       ` Michael R. Hines
  2013-03-19 18:27                     ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA Michael R. Hines
  1 sibling, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 15:02 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Consider the following sequence:

1. Boot fresh VM (say, a boring 1GB vm)                    => Resident 
set is small, say 100M
2. Touch all the memory (with a utility or something) => Resident set is ~1G
3. Send QMP "balloon 500" => Resident set is ~500M
4. Now, migrate the VM => Resident set is 1G again

This suggests to me that migration is not accounting for
what memory was ballooned.

I suspect this is because the migration_bitmap does not coordinate
with the list of ballooned-out memory that was MADVISED().

This affects RDMA as well as TCP on the sender side.

Is there any hard reason why we're not validating migration_bitmap against
the memory that was MADVISED()'d?

- Michael R. Hines

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19  8:19           ` Michael S. Tsirkin
  2013-03-19 13:21             ` Michael R. Hines
@ 2013-03-19 15:08             ` Michael R. Hines
  2013-03-19 15:16               ` Michael S. Tsirkin
  1 sibling, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 15:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

This is actual a much bigger problem that I thought, not just for RDMA:

Currently the *sender* side is does not support overcommit
during a regular TCP migration.......I assume because the
migration_bitmap does not know which memory is mapped or
unmapped by the host kernel.

Is this a known issue?

- Michael

On 03/19/2013 04:19 AM, Michael S. Tsirkin wrote:
> On Mon, Mar 18, 2013 at 07:23:53PM -0400, Michael R. Hines wrote:
>> On 03/18/2013 05:26 PM, Michael S. Tsirkin wrote:
>>> Probably but I haven't mentioned ballooning at all.
>>>
>>> memory overcommit != ballooning
>> Sure, then setting ballooning aside for the moment,
>> then let's just consider regular (unused) virtual memory.
>>
>> In this case, what's wrong with the destination mapping
>> and pinning all the memory if it is not being ballooned?
>>
>> If the guest touches all the memory during normal operation
>> before migration begins (which would be the common case),
>> then overcommit is irrelevant, no?
> We have ways (e.g. cgroups) to limit what a VM can do. If it tries to
> use more RAM than we let it, it will swap, still making progress, just
> slower.  OTOH it looks like pinning more memory than allowed by the
> cgroups limit will just get stuck forever (probably a bug,
> should fail instead? but does not help your protocol
> which needs it all pinned at all times).
>
> There are also per-task resource limits. If you exceed this
> registration will fail, so not good either.
>
> I just don't see why do registration by chunks
> on source but not on destination.
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration?
  2013-03-19 15:02                     ` [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration? Michael R. Hines
@ 2013-03-19 15:12                       ` Michael R. Hines
  2013-03-19 15:17                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 15:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Actually, you don't even need ballooning to reproduce this behavior.

Is this a known issue?

- Michael


On 03/19/2013 11:02 AM, Michael R. Hines wrote:
> Consider the following sequence:
>
> 1. Boot fresh VM (say, a boring 1GB vm)                    => Resident 
> set is small, say 100M
> 2. Touch all the memory (with a utility or something) => Resident set 
> is ~1G
> 3. Send QMP "balloon 500" => Resident set is ~500M
> 4. Now, migrate the VM => Resident set is 1G again
>
> This suggests to me that migration is not accounting for
> what memory was ballooned.
>
> I suspect this is because the migration_bitmap does not coordinate
> with the list of ballooned-out memory that was MADVISED().
>
> This affects RDMA as well as TCP on the sender side.
>
> Is there any hard reason why we're not validating migration_bitmap 
> against
> the memory that was MADVISED()'d?
>
> - Michael R. Hines
>
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 15:08             ` Michael R. Hines
@ 2013-03-19 15:16               ` Michael S. Tsirkin
  2013-03-19 15:32                 ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-19 15:16 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
> This is actual a much bigger problem that I thought, not just for RDMA:
> 
> Currently the *sender* side is does not support overcommit
> during a regular TCP migration.......I assume because the
> migration_bitmap does not know which memory is mapped or
> unmapped by the host kernel.
> 
> Is this a known issue?
> 
> - Michael

I don't really understand what you are saying here.
Do you see some bug with migration where we might use
more memory than allowed by cgroups?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration?
  2013-03-19 15:12                       ` Michael R. Hines
@ 2013-03-19 15:17                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-19 15:17 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Tue, Mar 19, 2013 at 11:12:50AM -0400, Michael R. Hines wrote:
> Actually, you don't even need ballooning to reproduce this behavior.
> 
> Is this a known issue?
> 
> - Michael

Yes.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 15:16               ` Michael S. Tsirkin
@ 2013-03-19 15:32                 ` Michael R. Hines
  2013-03-19 15:36                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 15:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 03/19/2013 11:16 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
>> This is actual a much bigger problem that I thought, not just for RDMA:
>>
>> Currently the *sender* side is does not support overcommit
>> during a regular TCP migration.......I assume because the
>> migration_bitmap does not know which memory is mapped or
>> unmapped by the host kernel.
>>
>> Is this a known issue?
>>
>> - Michael
> I don't really understand what you are saying here.
> Do you see some bug with migration where we might use
> more memory than allowed by cgroups?
>

Yes: cgroups does not coordinate with the list of pages
that have "not yet been mapped" or touched by the
virtual machine, right?

I may be missing something here from what I read in
the code, but even if I set a cgroups limit on memory,
QEMU will still attempt to access that memory if the
migration_bitmap tells it to, as far as I can tell.

Is this an accurate observation?

A simple solution would be to just have QEMU consult with /dev/pagemap, no?

- Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 15:32                 ` Michael R. Hines
@ 2013-03-19 15:36                   ` Michael S. Tsirkin
  2013-03-19 17:09                     ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-19 15:36 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Tue, Mar 19, 2013 at 11:32:49AM -0400, Michael R. Hines wrote:
> On 03/19/2013 11:16 AM, Michael S. Tsirkin wrote:
> >On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
> >>This is actual a much bigger problem that I thought, not just for RDMA:
> >>
> >>Currently the *sender* side is does not support overcommit
> >>during a regular TCP migration.......I assume because the
> >>migration_bitmap does not know which memory is mapped or
> >>unmapped by the host kernel.
> >>
> >>Is this a known issue?
> >>
> >>- Michael
> >I don't really understand what you are saying here.
> >Do you see some bug with migration where we might use
> >more memory than allowed by cgroups?
> >
> 
> Yes: cgroups does not coordinate with the list of pages
> that have "not yet been mapped" or touched by the
> virtual machine, right?
> 
> I may be missing something here from what I read in
> the code, but even if I set a cgroups limit on memory,
> QEMU will still attempt to access that memory if the
> migration_bitmap tells it to, as far as I can tell.
> 
> Is this an accurate observation?

Yes but this simply means QEMU will hit swap.

> A simple solution would be to just have QEMU consult with /dev/pagemap, no?
> 
> - Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 15:36                   ` Michael S. Tsirkin
@ 2013-03-19 17:09                     ` Michael R. Hines
  2013-03-19 17:14                       ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 17:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

Allowing QEMU to swap due to a cgroup limit during migration is a viable 
overcommit option?

I'm trying to keep an open mind, but that would kill the migration time.....

- Michael

On 03/19/2013 11:36 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 11:32:49AM -0400, Michael R. Hines wrote:
>> On 03/19/2013 11:16 AM, Michael S. Tsirkin wrote:
>>> On Tue, Mar 19, 2013 at 11:08:24AM -0400, Michael R. Hines wrote:
>>>> This is actual a much bigger problem that I thought, not just for RDMA:
>>>>
>>>> Currently the *sender* side is does not support overcommit
>>>> during a regular TCP migration.......I assume because the
>>>> migration_bitmap does not know which memory is mapped or
>>>> unmapped by the host kernel.
>>>>
>>>> Is this a known issue?
>>>>
>>>> - Michael
>>> I don't really understand what you are saying here.
>>> Do you see some bug with migration where we might use
>>> more memory than allowed by cgroups?
>>>
>> Yes: cgroups does not coordinate with the list of pages
>> that have "not yet been mapped" or touched by the
>> virtual machine, right?
>>
>> I may be missing something here from what I read in
>> the code, but even if I set a cgroups limit on memory,
>> QEMU will still attempt to access that memory if the
>> migration_bitmap tells it to, as far as I can tell.
>>
>> Is this an accurate observation?
> Yes but this simply means QEMU will hit swap.
>
>> A simple solution would be to just have QEMU consult with /dev/pagemap, no?
>>
>> - Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:09                     ` Michael R. Hines
@ 2013-03-19 17:14                       ` Paolo Bonzini
  2013-03-19 17:23                         ` Michael S. Tsirkin
                                           ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19 17:14 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 19/03/2013 18:09, Michael R. Hines ha scritto:
> Allowing QEMU to swap due to a cgroup limit during migration is a viable
> overcommit option?
> 
> I'm trying to keep an open mind, but that would kill the migration
> time.....

Would it swap?  Doesn't the kernel back all zero pages with a single
copy-on-write page?  If that still accounts towards cgroup limits, it
would be a bug.

Old kernels do not have a shared zero hugepage, and that includes some
distro kernels.  Perhaps that's the problem.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:14                       ` Paolo Bonzini
@ 2013-03-19 17:23                         ` Michael S. Tsirkin
  2013-03-19 17:40                         ` Michael R. Hines
  2013-03-19 17:49                         ` Michael R. Hines
  2 siblings, 0 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-19 17:23 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Tue, Mar 19, 2013 at 06:14:45PM +0100, Paolo Bonzini wrote:
> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
> > Allowing QEMU to swap due to a cgroup limit during migration is a viable
> > overcommit option?
> > 
> > I'm trying to keep an open mind, but that would kill the migration
> > time.....

Maybe not if you have a fast SSD, or are using swap in RAM or compressed
swap or ...

> Would it swap?  Doesn't the kernel back all zero pages with a single
> copy-on-write page?  If that still accounts towards cgroup limits, it
> would be a bug.
> 
> Old kernels do not have a shared zero hugepage, and that includes some
> distro kernels.  Perhaps that's the problem.
> 
> Paolo

AFAIK for zero pages, yes. I'm not sure what the problem is either.

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:14                       ` Paolo Bonzini
  2013-03-19 17:23                         ` Michael S. Tsirkin
@ 2013-03-19 17:40                         ` Michael R. Hines
  2013-03-19 17:52                           ` Paolo Bonzini
  2013-03-19 17:49                         ` Michael R. Hines
  2 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 17:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

OK, so I did a quick test and the cgroup does appear to be working 
correctly for zero pages.

Nevertheless, this still doesn't solve the chunk registration problem 
for RDMA.

Even with a cgroup on the sender *or* receiver side, there is no API 
that I know
that would correctly indicate to the migration process which pages are 
safe to register
or not with the hardware. Without such an API, even a "smarter" chunked 
memory
registration scheme would not work with cgroups because we would be 
attempting
to pin zero pages (for no reason) that cgroups has already kicked out, 
which would
defeat the purpose of using cgroups.

So, if I submit a separate patch to fix this, would you guys review it? 
(Using /dev/pagemap).

Unless there is a better idea? Does KVM expose the necessary mappings?

- Michael

On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>> overcommit option?
>>
>> I'm trying to keep an open mind, but that would kill the migration
>> time.....
> Would it swap?  Doesn't the kernel back all zero pages with a single
> copy-on-write page?  If that still accounts towards cgroup limits, it
> would be a bug.
>
> Old kernels do not have a shared zero hugepage, and that includes some
> distro kernels.  Perhaps that's the problem.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:14                       ` Paolo Bonzini
  2013-03-19 17:23                         ` Michael S. Tsirkin
  2013-03-19 17:40                         ` Michael R. Hines
@ 2013-03-19 17:49                         ` Michael R. Hines
  2013-03-21  6:11                           ` Michael S. Tsirkin
  2 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 17:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)

So, infiniband is not smart enough to know how to avoid pinning a zero 
page, I guess.

- Michael

On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>> overcommit option?
>>
>> I'm trying to keep an open mind, but that would kill the migration
>> time.....
> Would it swap?  Doesn't the kernel back all zero pages with a single
> copy-on-write page?  If that still accounts towards cgroup limits, it
> would be a bug.
>
> Old kernels do not have a shared zero hugepage, and that includes some
> distro kernels.  Perhaps that's the problem.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:40                         ` Michael R. Hines
@ 2013-03-19 17:52                           ` Paolo Bonzini
  2013-03-19 18:04                             ` Michael R. Hines
  2013-03-20 13:07                             ` Michael S. Tsirkin
  0 siblings, 2 replies; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19 17:52 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 19/03/2013 18:40, Michael R. Hines ha scritto:
> registration scheme would not work with cgroups because we would be 
> attempting to pin zero pages (for no reason) that cgroups has already
> kicked out, which would defeat the purpose of using cgroups.

Yeah, pinning would be a problem.

> So, if I submit a separate patch to fix this, would you guys review it?
> (Using /dev/pagemap).

Sorry about the ignorance, but what is /dev/pagemap? :)

> Unless there is a better idea? Does KVM expose the necessary mappings?

We could have the balloon driver track the pages.  I and Michael had
some initial work a few months ago on extending the virtio-balloon spec
to allow this.  It went nowhere, though.

Still, at this point this is again an RDMA-specific problem, I don't
think it would be that bad if the first iterations of RDMA didn't
support ballooning/overcommit.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:52                           ` Paolo Bonzini
@ 2013-03-19 18:04                             ` Michael R. Hines
  2013-03-20 13:07                             ` Michael S. Tsirkin
  1 sibling, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 18:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 03/19/2013 01:52 PM, Paolo Bonzini wrote:
> So, if I submit a separate patch to fix this, would you guys review it?
> (Using /dev/pagemap).
> Sorry about the ignorance, but what is /dev/pagemap? :)
/dev/pagemap is a recent interface for eserland accesses to the pagetables.

https://www.kernel.org/doc/Documentation/vm/pagemap.txt

It would very easily tell you (without extra tracking) which pages
were mapped and which were not mapped.

It should work for both cgroups and ballooning. We've used it before.

- Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 14:22                   ` Paolo Bonzini
  2013-03-19 15:02                     ` [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration? Michael R. Hines
@ 2013-03-19 18:27                     ` Michael R. Hines
  2013-03-19 18:40                       ` Paolo Bonzini
  1 sibling, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-19 18:27 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

On 03/19/2013 10:22 AM, Paolo Bonzini wrote:
> Il 19/03/2013 15:10, Michael R. Hines ha scritto:
>> On 03/19/2013 09:45 AM, Paolo Bonzini wrote:
>>> This is because of downtime: You have to drain the queue anyway at the
>>> very end, and if you don't drain it in advance after each iteration, then
>>> the queue will have lots of bytes in it waiting for transmission and the
>>> Virtual Machine will be stopped for a much longer period of time during
>>> the last iteration waiting for RDMA card to finish transmission of all
>>> those
>>> bytes.
>>> Shouldn't the "current chunk full" case take care of it too?
>>>
>>> Of course if you disable chunking you have to add a different condition,
>>> perhaps directly into save_rdma_page.
>> No, we don't want to flush on "chunk full" - that has a different meaning.
>> We want to have as many chunks submitted to the hardware for transmission
>> as possible to keep the bytes moving.
> That however gives me an idea...  Instead of the full drain at the end
> of an iteration, does it make sense to do a "partial" drain at every
> chunk full, so that you don't have > N bytes pending and the downtime is
> correspondingly limited?


Sure, you could do that, but it seems overly complex just to avoid
a single flush() call at the end of each iteration, right?

> If there is no RAM migration in flight.  So you have
>
>     migrate RAM
>     ...
>     RAM migration finished, device migration start
>     put_buffer <<<<< QEMUFileRDMA triggers drain
>     put_buffer
>     put_buffer
>     put_buffer
>     ...

Ah, yes, ok. Very simple modification......

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 18:27                     ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA Michael R. Hines
@ 2013-03-19 18:40                       ` Paolo Bonzini
  2013-03-20 15:20                         ` Paolo Bonzini
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-19 18:40 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 19/03/2013 19:27, Michael R. Hines ha scritto:
>>>
>> That however gives me an idea...  Instead of the full drain at the end
>> of an iteration, does it make sense to do a "partial" drain at every
>> chunk full, so that you don't have > N bytes pending and the downtime is
>> correspondingly limited?
> 
> 
> Sure, you could do that, but it seems overly complex just to avoid
> a single flush() call at the end of each iteration, right?

Would it really be that complex?  Not having an extra QEMUFile op
perhaps balances that complexity (and the complexity remains hidden in
rdma.c, which is an advantage).

You could alternatively drain every N megabytes sent, or something like
that.  But a partial drain would help obeying the maximum downtime
limitations.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:52                           ` Paolo Bonzini
  2013-03-19 18:04                             ` Michael R. Hines
@ 2013-03-20 13:07                             ` Michael S. Tsirkin
  2013-03-20 15:15                               ` Michael R. Hines
  1 sibling, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-20 13:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Tue, Mar 19, 2013 at 06:52:59PM +0100, Paolo Bonzini wrote:
> Il 19/03/2013 18:40, Michael R. Hines ha scritto:
> > registration scheme would not work with cgroups because we would be 
> > attempting to pin zero pages (for no reason) that cgroups has already
> > kicked out, which would defeat the purpose of using cgroups.
> 
> Yeah, pinning would be a problem.
> 
> > So, if I submit a separate patch to fix this, would you guys review it?
> > (Using /dev/pagemap).
> 
> Sorry about the ignorance, but what is /dev/pagemap? :)
> 
> > Unless there is a better idea? Does KVM expose the necessary mappings?
> 
> We could have the balloon driver track the pages.  I and Michael had
> some initial work a few months ago on extending the virtio-balloon spec
> to allow this.  It went nowhere, though.
> 
> Still, at this point this is again an RDMA-specific problem, I don't
> think it would be that bad if the first iterations of RDMA didn't
> support ballooning/overcommit.
> 
> Paolo

My problem is with the protocol. If it assumes at the protocol level
that everything is pinned down on the destination, we'll have to rework
it all to make it really useful.

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 13:07                             ` Michael S. Tsirkin
@ 2013-03-20 15:15                               ` Michael R. Hines
  2013-03-20 15:22                                 ` Michael R. Hines
  2013-03-20 15:55                                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 15:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

OK, can we make a deal? =)

I'm willing to put in the work to perform the dynamic registration on 
the destination side,
but let's go a step further and piggy-back on the effort:

We need to couple this registration with a very small modification to 
save_ram_block():

Currently, save_ram_block does:

1. is RDMA turned on?      if yes, unconditionally add to next chunk
                                          (will be made to dynamically 
register on destination)
2. is_dup_page() ?            if yes, skip
3. in xbzrle cache?           if yes, skip
4. still not sent?                if yes, transmit

I propose adding a "stub" function that adds:

0. is page mapped?         if yes, skip   (always returns true for now)
1. same
2. same
3. same
4. same

Then, later, in a separate patch, I can implement /dev/pagemap support.

When that's done, RDMA dynamic registration will actually take effect and
benefit from actually verifying that the page is mapped or not.

- Michael


On 03/20/2013 09:07 AM, Michael S. Tsirkin wrote:
> My problem is with the protocol. If it assumes at the protocol level 
> that everything is pinned down on the destination, we'll have to 
> rework it all to make it really useful. 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-19 18:40                       ` Paolo Bonzini
@ 2013-03-20 15:20                         ` Paolo Bonzini
  2013-03-20 16:09                           ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Paolo Bonzini @ 2013-03-20 15:20 UTC (permalink / raw)
  Cc: aliguori, mst, qemu-devel, Michael R. Hines, owasserm, abali,
	mrhines, gokul

Il 19/03/2013 19:40, Paolo Bonzini ha scritto:
>>> >> That however gives me an idea...  Instead of the full drain at the end
>>> >> of an iteration, does it make sense to do a "partial" drain at every
>>> >> chunk full, so that you don't have > N bytes pending and the downtime is
>>> >> correspondingly limited?
>> > 
>> > 
>> > Sure, you could do that, but it seems overly complex just to avoid
>> > a single flush() call at the end of each iteration, right?
> Would it really be that complex?  Not having an extra QEMUFile op
> perhaps balances that complexity (and the complexity remains hidden in
> rdma.c, which is an advantage).
> 
> You could alternatively drain every N megabytes sent, or something like
> that.  But a partial drain would help obeying the maximum downtime
> limitations.

On second thought: just keep the drain operation, but make it clear that
it is related to the new save_ram_page QEMUFileOps field.  You could
call it flush_ram_pages or something like that.

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 15:15                               ` Michael R. Hines
@ 2013-03-20 15:22                                 ` Michael R. Hines
  2013-03-20 15:55                                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

s / is page mapped?/ is page unmapped?/ g


On 03/20/2013 11:15 AM, Michael R. Hines wrote:
> OK, can we make a deal? =)
>
> I'm willing to put in the work to perform the dynamic registration on 
> the destination side,
> but let's go a step further and piggy-back on the effort:
>
> We need to couple this registration with a very small modification to 
> save_ram_block():
>
> Currently, save_ram_block does:
>
> 1. is RDMA turned on?      if yes, unconditionally add to next chunk
>                                          (will be made to dynamically 
> register on destination)
> 2. is_dup_page() ?            if yes, skip
> 3. in xbzrle cache?           if yes, skip
> 4. still not sent?                if yes, transmit
>
> I propose adding a "stub" function that adds:
>
> 0. is page mapped?         if yes, skip   (always returns true for now)
> 1. same
> 2. same
> 3. same
> 4. same
>
> Then, later, in a separate patch, I can implement /dev/pagemap support.
>
> When that's done, RDMA dynamic registration will actually take effect and
> benefit from actually verifying that the page is mapped or not.
>
> - Michael
>
>
> On 03/20/2013 09:07 AM, Michael S. Tsirkin wrote:
>> My problem is with the protocol. If it assumes at the protocol level 
>> that everything is pinned down on the destination, we'll have to 
>> rework it all to make it really useful. 
>
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 15:15                               ` Michael R. Hines
  2013-03-20 15:22                                 ` Michael R. Hines
@ 2013-03-20 15:55                                 ` Michael S. Tsirkin
  2013-03-20 16:08                                   ` Michael R. Hines
  2013-03-20 20:24                                   ` Michael R. Hines
  1 sibling, 2 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-20 15:55 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Wed, Mar 20, 2013 at 11:15:48AM -0400, Michael R. Hines wrote:
> OK, can we make a deal? =)
> 
> I'm willing to put in the work to perform the dynamic registration
> on the destination side,
> but let's go a step further and piggy-back on the effort:
> 
> We need to couple this registration with a very small modification
> to save_ram_block():
> 
> Currently, save_ram_block does:
> 
> 1. is RDMA turned on?      if yes, unconditionally add to next chunk
>                                          (will be made to
> dynamically register on destination)
> 2. is_dup_page() ?            if yes, skip
> 3. in xbzrle cache?           if yes, skip
> 4. still not sent?                if yes, transmit
> 
> I propose adding a "stub" function that adds:
> 
> 0. is page mapped?         if yes, skip   (always returns true for now)
> 1. same
> 2. same
> 3. same
> 4. same
> 
> Then, later, in a separate patch, I can implement /dev/pagemap support.
> 
> When that's done, RDMA dynamic registration will actually take effect and
> benefit from actually verifying that the page is mapped or not.
> 
> - Michael

Mapped into guest? You mean e.g. for ballooning?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 15:55                                 ` Michael S. Tsirkin
@ 2013-03-20 16:08                                   ` Michael R. Hines
  2013-03-20 19:06                                     ` Michael S. Tsirkin
  2013-03-20 20:24                                   ` Michael R. Hines
  1 sibling, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini


On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 11:15:48AM -0400, Michael R. Hines wrote:
>> OK, can we make a deal? =)
>>
>> I'm willing to put in the work to perform the dynamic registration
>> on the destination side,
>> but let's go a step further and piggy-back on the effort:
>>
>> We need to couple this registration with a very small modification
>> to save_ram_block():
>>
>> Currently, save_ram_block does:
>>
>> 1. is RDMA turned on?      if yes, unconditionally add to next chunk
>>                                           (will be made to
>> dynamically register on destination)
>> 2. is_dup_page() ?            if yes, skip
>> 3. in xbzrle cache?           if yes, skip
>> 4. still not sent?                if yes, transmit
>>
>> I propose adding a "stub" function that adds:
>>
>> 0. is page mapped?         if yes, skip   (always returns true for now)
>> 1. same
>> 2. same
>> 3. same
>> 4. same
>>
>> Then, later, in a separate patch, I can implement /dev/pagemap support.
>>
>> When that's done, RDMA dynamic registration will actually take effect and
>> benefit from actually verifying that the page is mapped or not.
>>
>> - Michael
> Mapped into guest? You mean e.g. for ballooning?
>

No, not just ballooning. Overcommit (i.e. cgroups).

Anytime cgroups kicks out a page (or anytime the balloon kicks in),
the page would become unmapped.

The make dynamic registration useful, we have to actually have something
in place in the future that knows how to *check* if a page is unmapped
from the virtual machine, either because it has never been dirtied before
(and might be pointing to the zero page) or because it has been madvised()
out or has been detatched because of a cgroup limit.

- Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA
  2013-03-20 15:20                         ` Paolo Bonzini
@ 2013-03-20 16:09                           ` Michael R. Hines
  0 siblings, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 16:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

On 03/20/2013 11:20 AM, Paolo Bonzini wrote:
> Il 19/03/2013 19:40, Paolo Bonzini ha scritto:
>>>>>> That however gives me an idea...  Instead of the full drain at the end
>>>>>> of an iteration, does it make sense to do a "partial" drain at every
>>>>>> chunk full, so that you don't have > N bytes pending and the downtime is
>>>>>> correspondingly limited?
>>>>
>>>> Sure, you could do that, but it seems overly complex just to avoid
>>>> a single flush() call at the end of each iteration, right?
>> Would it really be that complex?  Not having an extra QEMUFile op
>> perhaps balances that complexity (and the complexity remains hidden in
>> rdma.c, which is an advantage).
>>
>> You could alternatively drain every N megabytes sent, or something like
>> that.  But a partial drain would help obeying the maximum downtime
>> limitations.
> On second thought: just keep the drain operation, but make it clear that
> it is related to the new save_ram_page QEMUFileOps field.  You could
> call it flush_ram_pages or something like that.
>
> Paolo
>

Acknowledged. This helps a lot, thank you. I'll be sure to
clearly conditionalize everything in the next RFC.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 16:08                                   ` Michael R. Hines
@ 2013-03-20 19:06                                     ` Michael S. Tsirkin
  2013-03-20 20:20                                       ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-20 19:06 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Wed, Mar 20, 2013 at 12:08:40PM -0400, Michael R. Hines wrote:
> 
> On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> >On Wed, Mar 20, 2013 at 11:15:48AM -0400, Michael R. Hines wrote:
> >>OK, can we make a deal? =)
> >>
> >>I'm willing to put in the work to perform the dynamic registration
> >>on the destination side,
> >>but let's go a step further and piggy-back on the effort:
> >>
> >>We need to couple this registration with a very small modification
> >>to save_ram_block():
> >>
> >>Currently, save_ram_block does:
> >>
> >>1. is RDMA turned on?      if yes, unconditionally add to next chunk
> >>                                          (will be made to
> >>dynamically register on destination)
> >>2. is_dup_page() ?            if yes, skip
> >>3. in xbzrle cache?           if yes, skip
> >>4. still not sent?                if yes, transmit
> >>
> >>I propose adding a "stub" function that adds:
> >>
> >>0. is page mapped?         if yes, skip   (always returns true for now)
> >>1. same
> >>2. same
> >>3. same
> >>4. same
> >>
> >>Then, later, in a separate patch, I can implement /dev/pagemap support.
> >>
> >>When that's done, RDMA dynamic registration will actually take effect and
> >>benefit from actually verifying that the page is mapped or not.
> >>
> >>- Michael
> >Mapped into guest? You mean e.g. for ballooning?
> >
> 
> No, not just ballooning. Overcommit (i.e. cgroups).
> 
> Anytime cgroups kicks out a page (or anytime the balloon kicks in),
> the page would become unmapped.

OK but we still need to send that page to remote.
It's in swap but has guest data in there, you can't
just ignore it.

> The make dynamic registration useful, we have to actually have something
> in place in the future that knows how to *check* if a page is unmapped
> from the virtual machine, either because it has never been dirtied before
> (and might be pointing to the zero page) or because it has been madvised()
> out or has been detatched because of a cgroup limit.
> 
> - Michael
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 19:06                                     ` Michael S. Tsirkin
@ 2013-03-20 20:20                                       ` Michael R. Hines
  2013-03-20 20:31                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 20:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 714 bytes --]

On 03/20/2013 03:06 PM, Michael S. Tsirkin wrote:
> No, not just ballooning. Overcommit (i.e. cgroups).
>
> Anytime cgroups kicks out a page (or anytime the balloon kicks in),
> the page would become unmapped.
> OK but we still need to send that page to remote.
> It's in swap but has guest data in there, you can't
> just ignore it.

Yes, absolutely: https://www.kernel.org/doc/Documentation/vm/pagemap.txt

The pagemap will tell you that.

In fact the pagemap ideally would *only* be used for the 1st migration 
round.

The rest of them would depend exclusively on the dirty bitmap as they do.

Basically, we could use the pagemap as first-time "hint" for the bulk of
the memory that costs the most to transmit.

[-- Attachment #2: Type: text/html, Size: 1326 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 15:55                                 ` Michael S. Tsirkin
  2013-03-20 16:08                                   ` Michael R. Hines
@ 2013-03-20 20:24                                   ` Michael R. Hines
  2013-03-20 20:37                                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 20:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> Then, later, in a separate patch, I can implement /dev/pagemap support.
>
> When that's done, RDMA dynamic registration will actually take effect and
> benefit from actually verifying that the page is mapped or not.
>
> - Michael
> Mapped into guest? You mean e.g. for ballooning?
>

Three scenarios are candidates for mapped checking:

1. anytime the virtual machine has not yet accessed a page (usually 
during the 1st-time boot)
2. Anytime madvise(DONTNEED) happens (for ballooning)
3.  Anytime cgroups kicks out a zero page that was accessed and faulted 
but not dirty that is a clean candidate for unmapping.
        (I did a test that seems to confirm that cgroups is pretty 
"smart" about that)

Basically, anytime the pagemap says "this page is *not* swap and *not* 
mapped
- then the page is not important during the 1st iteration.

On the subsequent iterations, we come along as normal checking the dirty 
bitmap as usual.

- Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:20                                       ` Michael R. Hines
@ 2013-03-20 20:31                                         ` Michael S. Tsirkin
  2013-03-20 20:39                                           ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-20 20:31 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Wed, Mar 20, 2013 at 04:20:06PM -0400, Michael R. Hines wrote:
> On 03/20/2013 03:06 PM, Michael S. Tsirkin wrote:
> 
>     No, not just ballooning. Overcommit (i.e. cgroups).
> 
>     Anytime cgroups kicks out a page (or anytime the balloon kicks in),
>     the page would become unmapped.
> 
>     OK but we still need to send that page to remote.
>     It's in swap but has guest data in there, you can't
>     just ignore it.
> 
> 
> Yes, absolutely: https://www.kernel.org/doc/Documentation/vm/pagemap.txt
> 
> The pagemap will tell you that.
> 
> In fact the pagemap ideally would *only* be used for the 1st migration round.
> 
> The rest of them would depend exclusively on the dirty bitmap as they do.
> 
> Basically, we could use the pagemap as first-time "hint" for the bulk of
> the memory that costs the most to transmit.

OK sure, this could be useful to detect pages deduplicated by KSM and only
transmit one copy. There's still the question of creating same
duplicate mappings on destination - do you just do data copy on destination?

Not sure why you talk about unmapped pages above though, it seems
not really relevant...

There's also the matter of KSM not touching pinned pages,
that's another good reason not to pin all pages on destination,
they won't be deduplicated.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:24                                   ` Michael R. Hines
@ 2013-03-20 20:37                                     ` Michael S. Tsirkin
  2013-03-20 20:45                                       ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-20 20:37 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Wed, Mar 20, 2013 at 04:24:14PM -0400, Michael R. Hines wrote:
> On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> >Then, later, in a separate patch, I can implement /dev/pagemap support.
> >
> >When that's done, RDMA dynamic registration will actually take effect and
> >benefit from actually verifying that the page is mapped or not.
> >
> >- Michael
> >Mapped into guest? You mean e.g. for ballooning?
> >
> 
> Three scenarios are candidates for mapped checking:
> 
> 1. anytime the virtual machine has not yet accessed a page (usually
> during the 1st-time boot)

So migrating booting machines is faster now?  Why is this worth
optimizing for?

> 2. Anytime madvise(DONTNEED) happens (for ballooning)

This is likely worth optimizing.
I think a better the way to handling this one is by tracking
ballooned state. Just mark these pages as unused in qemu.

> 3.  Anytime cgroups kicks out a zero page that was accessed and
> faulted but not dirty that is a clean candidate for unmapping.
>        (I did a test that seems to confirm that cgroups is pretty
> "smart" about that)
> Basically, anytime the pagemap says "this page is *not* swap and
> *not* mapped
> - then the page is not important during the 1st iteration.
> On the subsequent iterations, we come along as normal checking the
> dirty bitmap as usual.
> 
> - Michael

If it will never be dirty you will never migrate it?
Seems wrong - it could have guest data on disk - AFAIK clean does not
mean no data, it means disk is in sync with memory.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:31                                         ` Michael S. Tsirkin
@ 2013-03-20 20:39                                           ` Michael R. Hines
  2013-03-20 20:46                                             ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 20:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

Agreed. Very useful for KSM.

Unmapped virtual addresses cannot be pinned for RDMA (the hardware will 
break),
but there's no way to know they are unmapped without checking another 
data structure.

- Michael

On 03/20/2013 04:31 PM, Michael S. Tsirkin wrote:
>
> OK sure, this could be useful to detect pages deduplicated by KSM and only
> transmit one copy. There's still the question of creating same
> duplicate mappings on destination - do you just do data copy on destination?
>
> Not sure why you talk about unmapped pages above though, it seems
> not really relevant...
>
> There's also the matter of KSM not touching pinned pages,
> that's another good reason not to pin all pages on destination,
> they won't be deduplicated.
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:37                                     ` Michael S. Tsirkin
@ 2013-03-20 20:45                                       ` Michael R. Hines
  2013-03-20 20:52                                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 20:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 03/20/2013 04:37 PM, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 04:24:14PM -0400, Michael R. Hines wrote:
>> On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
>>> Then, later, in a separate patch, I can implement /dev/pagemap support.
>>>
>>> When that's done, RDMA dynamic registration will actually take effect and
>>> benefit from actually verifying that the page is mapped or not.
>>>
>>> - Michael
>>> Mapped into guest? You mean e.g. for ballooning?
>>>
>> Three scenarios are candidates for mapped checking:
>>
>> 1. anytime the virtual machine has not yet accessed a page (usually
>> during the 1st-time boot)
> So migrating booting machines is faster now?  Why is this worth
> optimizing for?
Yes, it helps both the TCP migration and RDMA migration simultaneously.
>
>> 2. Anytime madvise(DONTNEED) happens (for ballooning)
> This is likely worth optimizing.
> I think a better the way to handling this one is by tracking
> ballooned state. Just mark these pages as unused in qemu.

Paolo said somebody attempted that, but stopped work on it for some reason?

>> 3.  Anytime cgroups kicks out a zero page that was accessed and
>> faulted but not dirty that is a clean candidate for unmapping.
>>         (I did a test that seems to confirm that cgroups is pretty
>> "smart" about that)
>> Basically, anytime the pagemap says "this page is *not* swap and
>> *not* mapped
>> - then the page is not important during the 1st iteration.
>> On the subsequent iterations, we come along as normal checking the
>> dirty bitmap as usual.
>>
>> - Michael
> If it will never be dirty you will never migrate it?
> Seems wrong - it could have guest data on disk - AFAIK clean does not
> mean no data, it means disk is in sync with memory.
>

Sorry, yes - that was a mis-statement: clean pages are always mapped (or 
swapped) and would have to
be transmitted at least once.

- Michael

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:39                                           ` Michael R. Hines
@ 2013-03-20 20:46                                             ` Michael S. Tsirkin
  2013-03-20 20:56                                               ` Michael R. Hines
  0 siblings, 1 reply; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-20 20:46 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Wed, Mar 20, 2013 at 04:39:00PM -0400, Michael R. Hines wrote:
> Unmapped virtual addresses cannot be pinned for RDMA (the hardware
> will break),
> but there's no way to know they are unmapped without checking
> another data structure.

So for RDMA, when you try to register them, this will fault them in.
For regular migration we really should try using vmsplice.  Anyone up to
it? If we do this TCP could outperform RDMA for some workloads ...

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:45                                       ` Michael R. Hines
@ 2013-03-20 20:52                                         ` Michael S. Tsirkin
  0 siblings, 0 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-20 20:52 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Wed, Mar 20, 2013 at 04:45:05PM -0400, Michael R. Hines wrote:
> On 03/20/2013 04:37 PM, Michael S. Tsirkin wrote:
> >On Wed, Mar 20, 2013 at 04:24:14PM -0400, Michael R. Hines wrote:
> >>On 03/20/2013 11:55 AM, Michael S. Tsirkin wrote:
> >>>Then, later, in a separate patch, I can implement /dev/pagemap support.
> >>>
> >>>When that's done, RDMA dynamic registration will actually take effect and
> >>>benefit from actually verifying that the page is mapped or not.
> >>>
> >>>- Michael
> >>>Mapped into guest? You mean e.g. for ballooning?
> >>>
> >>Three scenarios are candidates for mapped checking:
> >>
> >>1. anytime the virtual machine has not yet accessed a page (usually
> >>during the 1st-time boot)
> >So migrating booting machines is faster now?  Why is this worth
> >optimizing for?
> Yes, it helps both the TCP migration and RDMA migration simultaneously.

But for a class of VMs that is only common when you want to
run a benchmark. People do live migration precisely to
avoid the need to reboot the VM.

> >
> >>2. Anytime madvise(DONTNEED) happens (for ballooning)
> >This is likely worth optimizing.
> >I think a better the way to handling this one is by tracking
> >ballooned state. Just mark these pages as unused in qemu.
> 
> Paolo said somebody attempted that, but stopped work on it for some reason?
> 
> >>3.  Anytime cgroups kicks out a zero page that was accessed and
> >>faulted but not dirty that is a clean candidate for unmapping.
> >>        (I did a test that seems to confirm that cgroups is pretty
> >>"smart" about that)
> >>Basically, anytime the pagemap says "this page is *not* swap and
> >>*not* mapped
> >>- then the page is not important during the 1st iteration.
> >>On the subsequent iterations, we come along as normal checking the
> >>dirty bitmap as usual.
> >>
> >>- Michael
> >If it will never be dirty you will never migrate it?
> >Seems wrong - it could have guest data on disk - AFAIK clean does not
> >mean no data, it means disk is in sync with memory.
> >
> 
> Sorry, yes - that was a mis-statement: clean pages are always mapped
> (or swapped) and would have to
> be transmitted at least once.
> 
> - Michael

Right so maybe my idea of looking at the PFNs in pagemap and transmitting
only once could help some VMs (and it would cover the booting VMs as a
partial case), and it could be a useful though linux-specific
optimization, but I don't see how looking at whether page is
mapped would help for TCP.

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:46                                             ` Michael S. Tsirkin
@ 2013-03-20 20:56                                               ` Michael R. Hines
  2013-03-21  5:20                                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 73+ messages in thread
From: Michael R. Hines @ 2013-03-20 20:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini


Forgive me, vmsplice system call? Or some other interface?

I'm not following......

On 03/20/2013 04:46 PM, Michael S. Tsirkin wrote:
> On Wed, Mar 20, 2013 at 04:39:00PM -0400, Michael R. Hines wrote:
>> Unmapped virtual addresses cannot be pinned for RDMA (the hardware
>> will break),
>> but there's no way to know they are unmapped without checking
>> another data structure.
> So for RDMA, when you try to register them, this will fault them in.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-20 20:56                                               ` Michael R. Hines
@ 2013-03-21  5:20                                                 ` Michael S. Tsirkin
  0 siblings, 0 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-21  5:20 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Wed, Mar 20, 2013 at 04:56:01PM -0400, Michael R. Hines wrote:
> 
> Forgive me, vmsplice system call? Or some other interface?
> 
> I'm not following......
> 
> On 03/20/2013 04:46 PM, Michael S. Tsirkin wrote:
> >On Wed, Mar 20, 2013 at 04:39:00PM -0400, Michael R. Hines wrote:
> >>Unmapped virtual addresses cannot be pinned for RDMA (the hardware
> >>will break),
> >>but there's no way to know they are unmapped without checking
> >>another data structure.
> >So for RDMA, when you try to register them, this will fault them in.

I'm just saying get_user_pages brings pages back in from swap.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-19 17:49                         ` Michael R. Hines
@ 2013-03-21  6:11                           ` Michael S. Tsirkin
  2013-03-21 15:22                             ` Michael R. Hines
                                               ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Michael S. Tsirkin @ 2013-03-21  6:11 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Tue, Mar 19, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
> I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)
> 
> So, infiniband is not smart enough to know how to avoid pinning a
> zero page, I guess.
> 
> - Michael
> 
> On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
> >Il 19/03/2013 18:09, Michael R. Hines ha scritto:
> >>Allowing QEMU to swap due to a cgroup limit during migration is a viable
> >>overcommit option?
> >>
> >>I'm trying to keep an open mind, but that would kill the migration
> >>time.....
> >Would it swap?  Doesn't the kernel back all zero pages with a single
> >copy-on-write page?  If that still accounts towards cgroup limits, it
> >would be a bug.
> >
> >Old kernels do not have a shared zero hugepage, and that includes some
> >distro kernels.  Perhaps that's the problem.
> >
> >Paolo
> >

I really shouldn't break COW if you don't request LOCAL_WRITE.
I think it's a kernel bug, and apparently has been there in the code since the
first version: get_user_pages parameters swapped.

I'll send a patch. If it's applied, you should also
change your code from

+                                IBV_ACCESS_LOCAL_WRITE |
+                                IBV_ACCESS_REMOTE_WRITE |
+                                IBV_ACCESS_REMOTE_READ);

to

+                                IBV_ACCESS_REMOTE_READ);

on send side.
Then, each time we detect a page has changed we must make sure to
unregister and re-register it. Or if you want to be very
smart, check that the PFN didn't change and reregister
if it did.

This will make overcommit work.

-- 
MST

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-21  6:11                           ` Michael S. Tsirkin
@ 2013-03-21 15:22                             ` Michael R. Hines
  2013-04-05 20:45                             ` Michael R. Hines
  2013-04-05 20:46                             ` Michael R. Hines
  2 siblings, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-03-21 15:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini


Very nice catch. Yes, I didn't think about that.

Thanks.

On 03/21/2013 02:11 AM, Michael S. Tsirkin wrote:
>
> I really shouldn't break COW if you don't request LOCAL_WRITE.
> I think it's a kernel bug, and apparently has been there in the code since the
> first version: get_user_pages parameters swapped.
>
> I'll send a patch. If it's applied, you should also
> change your code from
>
> +                                IBV_ACCESS_LOCAL_WRITE |
> +                                IBV_ACCESS_REMOTE_WRITE |
> +                                IBV_ACCESS_REMOTE_READ);
>
> to
>
> +                                IBV_ACCESS_REMOTE_READ);
>
> on send side.
> Then, each time we detect a page has changed we must make sure to
> unregister and re-register it. Or if you want to be very
> smart, check that the PFN didn't change and reregister
> if it did.
>
> This will make overcommit work.
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-21  6:11                           ` Michael S. Tsirkin
  2013-03-21 15:22                             ` Michael R. Hines
@ 2013-04-05 20:45                             ` Michael R. Hines
  2013-04-05 20:46                             ` Michael R. Hines
  2 siblings, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-04-05 20:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 03/21/2013 02:11 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>> I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)
>>
>> So, infiniband is not smart enough to know how to avoid pinning a
>> zero page, I guess.
>>
>> - Michael
>>
>> On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
>>> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>>>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>>>> overcommit option?
>>>>
>>>> I'm trying to keep an open mind, but that would kill the migration
>>>> time.....
>>> Would it swap?  Doesn't the kernel back all zero pages with a single
>>> copy-on-write page?  If that still accounts towards cgroup limits, it
>>> would be a bug.
>>>
>>> Old kernels do not have a shared zero hugepage, and that includes some
>>> distro kernels.  Perhaps that's the problem.
>>>
>>> Paolo
>>>
> I really shouldn't break COW if you don't request LOCAL_WRITE.
> I think it's a kernel bug, and apparently has been there in the code since the
> first version: get_user_pages parameters swapped.
>
> I'll send a patch. If it's applied, you should also
> change your code from
>
> +                                IBV_ACCESS_LOCAL_WRITE |
> +                                IBV_ACCESS_REMOTE_WRITE |
> +                                IBV_ACCESS_REMOTE_READ);
>
> to
>
> +                                IBV_ACCESS_REMOTE_READ);
>
> on send side.
> Then, each time we detect a page has changed we must make sure to
> unregister and re-register it. Or if you want to be very
> smart, check that the PFN didn't change and reregister
> if it did.
>
> This will make overcommit work.
>
Unfortunately RDMA + cgroups still kills QEMU:

I removed the *_WRITE flags and did a test like this:

1. Start QEMU with 2GB ram configured

$ cd /sys/fs/cgroup/memory/libvirt/qemu
$ echo "-1" > memory.memsw.limit_in_bytes
$ echo "-1" > memory.limit_in_bytes
$ echo $(pidof qemu-system-x86_64) > tasks
$ echo 512M > memory.limit_in_bytes              # maximum RSS
$ echo 3G > memory.memsw.limit_in_bytes     # maximum RSS + swap, extra 
1G to be safe

2. Start RDMA migration

3. RSS of 512M is reached
4. swap starts filling up
5. the kernel kills QEMU
6. dmesg:

[ 2981.657135] Task in /libvirt/qemu killed as a result of limit of 
/libvirt/qemu
[ 2981.657140] memory: usage 524288kB, limit 524288kB, failcnt 18031
[ 2981.657143] memory+swap: usage 525460kB, limit 3145728kB, failcnt 0
[ 2981.657146] Mem-Info:
[ 2981.657148] Node 0 DMA per-cpu:
[ 2981.657152] CPU    0: hi:    0, btch:   1 usd:   0
[ 2981.657155] CPU    1: hi:    0, btch:   1 usd:   0
[ 2981.657157] CPU    2: hi:    0, btch:   1 usd:   0
[ 2981.657160] CPU    3: hi:    0, btch:   1 usd:   0
[ 2981.657163] CPU    4: hi:    0, btch:   1 usd:   0
[ 2981.657165] CPU    5: hi:    0, btch:   1 usd:   0
[ 2981.657167] CPU    6: hi:    0, btch:   1 usd:   0
[ 2981.657170] CPU    7: hi:    0, btch:   1 usd:   0
[ 2981.657172] Node 0 DMA32 per-cpu:
[ 2981.657176] CPU    0: hi:  186, btch:  31 usd: 160
[ 2981.657178] CPU    1: hi:  186, btch:  31 usd:  22
[ 2981.657181] CPU    2: hi:  186, btch:  31 usd: 179
[ 2981.657184] CPU    3: hi:  186, btch:  31 usd:   6
[ 2981.657186] CPU    4: hi:  186, btch:  31 usd:  21
[ 2981.657189] CPU    5: hi:  186, btch:  31 usd:  15
[ 2981.657191] CPU    6: hi:  186, btch:  31 usd:  19
[ 2981.657194] CPU    7: hi:  186, btch:  31 usd:  22
[ 2981.657196] Node 0 Normal per-cpu:
[ 2981.657200] CPU    0: hi:  186, btch:  31 usd:  44
[ 2981.657202] CPU    1: hi:  186, btch:  31 usd:  58
[ 2981.657205] CPU    2: hi:  186, btch:  31 usd: 156
[ 2981.657207] CPU    3: hi:  186, btch:  31 usd: 107
[ 2981.657210] CPU    4: hi:  186, btch:  31 usd:  44
[ 2981.657213] CPU    5: hi:  186, btch:  31 usd:  70
[ 2981.657215] CPU    6: hi:  186, btch:  31 usd:  76
[ 2981.657218] CPU    7: hi:  186, btch:  31 usd: 173
[ 2981.657223] active_anon:181703 inactive_anon:68856 isolated_anon:0
[ 2981.657224]  active_file:66881 inactive_file:141056 isolated_file:0
[ 2981.657225]  unevictable:2174 dirty:6 writeback:0 unstable:0
[ 2981.657226]  free:4058168 slab_reclaimable:5152 slab_unreclaimable:10785
[ 2981.657227]  mapped:7709 shmem:192 pagetables:1913 bounce:0
[ 2981.657230] Node 0 DMA free:15896kB min:56kB low:68kB high:84kB 
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB 
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB 
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? no
[ 2981.657242] lowmem_reserve[]: 0 1966 18126 18126
[ 2981.657249] Node 0 DMA32 free:1990652kB min:7324kB low:9152kB 
high:10984kB active_anon:0kB inactive_anon:0kB active_file:0kB 
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
present:2013280kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB 
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB 
pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 
all_unreclaimable? no
[ 2981.657260] lowmem_reserve[]: 0 0 16160 16160
[ 2981.657268] Node 0 Normal free:14226124kB min:60200kB low:75248kB 
high:90300kB active_anon:726812kB inactive_anon:275424kB 
active_file:267524kB inactive_file:564224kB unevictable:8696kB 
isolated(anon):0kB isolated(file):0kB present:16547840kB mlocked:6652kB 
dirty:24kB writeback:0kB mapped:30832kB shmem:768kB 
slab_reclaimable:20608kB slab_unreclaimable:43140kB kernel_stack:1784kB 
pagetables:7652kB unstable:0kB bounce:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
[ 2981.657281] lowmem_reserve[]: 0 0 0 0
[ 2981.657289] Node 0 DMA: 0*4kB 1*8kB 1*16kB 0*32kB 2*64kB 1*128kB 
1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15896kB
[ 2981.657307] Node 0 DMA32: 17*4kB 9*8kB 7*16kB 4*32kB 8*64kB 5*128kB 
6*256kB 4*512kB 3*1024kB 6*2048kB 481*4096kB = 1990652kB
[ 2981.657325] Node 0 Normal: 2*4kB 1*8kB 991*16kB 893*32kB 271*64kB 
50*128kB 50*256kB 12*512kB 5*1024kB 1*2048kB 3450*4096kB = 14225504kB
[ 2981.657343] 277718 total pagecache pages
[ 2981.657345] 68816 pages in swap cache
[ 2981.657348] Swap cache stats: add 656848, delete 588032, find 19850/22338
[ 2981.657350] Free swap  = 15288376kB
[ 2981.657353] Total swap = 15564796kB
[ 2981.706982] 4718576 pages RAM

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
  2013-03-21  6:11                           ` Michael S. Tsirkin
  2013-03-21 15:22                             ` Michael R. Hines
  2013-04-05 20:45                             ` Michael R. Hines
@ 2013-04-05 20:46                             ` Michael R. Hines
  2 siblings, 0 replies; 73+ messages in thread
From: Michael R. Hines @ 2013-04-05 20:46 UTC (permalink / raw)
  To: qemu-devel

FYI, I used the following redhat cgroups instructions, to test if 
overcommit + RDMA was working:

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

- Michael

On 03/21/2013 02:11 AM, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>> I also did a test using RDMA + cgroup, and the kernel killed my QEMU :)
>>
>> So, infiniband is not smart enough to know how to avoid pinning a
>> zero page, I guess.
>>
>> - Michael
>>
>> On 03/19/2013 01:14 PM, Paolo Bonzini wrote:
>>> Il 19/03/2013 18:09, Michael R. Hines ha scritto:
>>>> Allowing QEMU to swap due to a cgroup limit during migration is a viable
>>>> overcommit option?
>>>>
>>>> I'm trying to keep an open mind, but that would kill the migration
>>>> time.....
>>> Would it swap?  Doesn't the kernel back all zero pages with a single
>>> copy-on-write page?  If that still accounts towards cgroup limits, it
>>> would be a bug.
>>>
>>> Old kernels do not have a shared zero hugepage, and that includes some
>>> distro kernels.  Perhaps that's the problem.
>>>
>>> Paolo
>>>
> I really shouldn't break COW if you don't request LOCAL_WRITE.
> I think it's a kernel bug, and apparently has been there in the code since the
> first version: get_user_pages parameters swapped.
>
> I'll send a patch. If it's applied, you should also
> change your code from
>
> +                                IBV_ACCESS_LOCAL_WRITE |
> +                                IBV_ACCESS_REMOTE_WRITE |
> +                                IBV_ACCESS_REMOTE_READ);
>
> to
>
> +                                IBV_ACCESS_REMOTE_READ);
>
> on send side.
> Then, each time we detect a page has changed we must make sure to
> unregister and re-register it. Or if you want to be very
> smart, check that the PFN didn't change and reregister
> if it did.
>
> This will make overcommit work.
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2013-04-05 20:47 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-18  3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 01/10] ./configure --enable-rdma mrhines
2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 02/10] check for CONFIG_RDMA mrhines
2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport mrhines
2013-03-18 10:40   ` Michael S. Tsirkin
2013-03-18 20:24     ` Michael R. Hines
2013-03-18 21:26       ` Michael S. Tsirkin
2013-03-18 23:23         ` Michael R. Hines
2013-03-19  8:19           ` Michael S. Tsirkin
2013-03-19 13:21             ` Michael R. Hines
2013-03-19 15:08             ` Michael R. Hines
2013-03-19 15:16               ` Michael S. Tsirkin
2013-03-19 15:32                 ` Michael R. Hines
2013-03-19 15:36                   ` Michael S. Tsirkin
2013-03-19 17:09                     ` Michael R. Hines
2013-03-19 17:14                       ` Paolo Bonzini
2013-03-19 17:23                         ` Michael S. Tsirkin
2013-03-19 17:40                         ` Michael R. Hines
2013-03-19 17:52                           ` Paolo Bonzini
2013-03-19 18:04                             ` Michael R. Hines
2013-03-20 13:07                             ` Michael S. Tsirkin
2013-03-20 15:15                               ` Michael R. Hines
2013-03-20 15:22                                 ` Michael R. Hines
2013-03-20 15:55                                 ` Michael S. Tsirkin
2013-03-20 16:08                                   ` Michael R. Hines
2013-03-20 19:06                                     ` Michael S. Tsirkin
2013-03-20 20:20                                       ` Michael R. Hines
2013-03-20 20:31                                         ` Michael S. Tsirkin
2013-03-20 20:39                                           ` Michael R. Hines
2013-03-20 20:46                                             ` Michael S. Tsirkin
2013-03-20 20:56                                               ` Michael R. Hines
2013-03-21  5:20                                                 ` Michael S. Tsirkin
2013-03-20 20:24                                   ` Michael R. Hines
2013-03-20 20:37                                     ` Michael S. Tsirkin
2013-03-20 20:45                                       ` Michael R. Hines
2013-03-20 20:52                                         ` Michael S. Tsirkin
2013-03-19 17:49                         ` Michael R. Hines
2013-03-21  6:11                           ` Michael S. Tsirkin
2013-03-21 15:22                             ` Michael R. Hines
2013-04-05 20:45                             ` Michael R. Hines
2013-04-05 20:46                             ` Michael R. Hines
2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks mrhines
2013-03-18  8:48   ` Paolo Bonzini
2013-03-18 20:25     ` Michael R. Hines
2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 05/10] reuse function for parsing the QMP 'migrate' string mrhines
2013-03-18  3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 06/10] core RDMA migration code (rdma.c) mrhines
2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA mrhines
2013-03-18  8:56   ` Paolo Bonzini
2013-03-18 20:26     ` Michael R. Hines
2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA mrhines
2013-03-18  9:09   ` Paolo Bonzini
2013-03-18 20:33     ` Michael R. Hines
2013-03-19  9:18       ` Paolo Bonzini
2013-03-19 13:12         ` Michael R. Hines
2013-03-19 13:25           ` Paolo Bonzini
2013-03-19 13:40             ` Michael R. Hines
2013-03-19 13:45               ` Paolo Bonzini
2013-03-19 14:10                 ` Michael R. Hines
2013-03-19 14:22                   ` Paolo Bonzini
2013-03-19 15:02                     ` [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration? Michael R. Hines
2013-03-19 15:12                       ` Michael R. Hines
2013-03-19 15:17                         ` Michael S. Tsirkin
2013-03-19 18:27                     ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA Michael R. Hines
2013-03-19 18:40                       ` Paolo Bonzini
2013-03-20 15:20                         ` Paolo Bonzini
2013-03-20 16:09                           ` Michael R. Hines
2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls mrhines
2013-03-18  8:47   ` Paolo Bonzini
2013-03-18 20:37     ` Michael R. Hines
2013-03-19  9:23       ` Paolo Bonzini
2013-03-19 13:08         ` Michael R. Hines
2013-03-19 13:20           ` Paolo Bonzini
2013-03-18  3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 10/10] send pc.ram over RDMA mrhines

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.