All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design
@ 2013-04-09  3:04 mrhines
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
                   ` (13 more replies)
  0 siblings, 14 replies; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Changes since v4:

- Created a "formal" protocol for the RDMA control channel
- Dynamic, chunked page registration now implemented on *both* the server and client
- Created new 'capability' for page registration
- Created new 'capability' for is_zero_page() - enabled by default
  (needed to test dynamic page registration)
- Created version-check before protocol begins at connection-time 
- no more migrate_use_rdma() !

NOTE: While dynamic registration works on both sides now,
      it does *not* work with cgroups swap limits. This functionality with infiniband
      remains broken. (It works fine with TCP). So, in order to take full 
      advantage of this feature, a fix will have to be developed on the kernel side.
      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.

Contents:
=================================
* Compiling
* Running (please readme before running)
* RDMA Protocol Description
* Versioning
* QEMUFileRDMA Interface
* Migration of pc.ram
* Error handling
* TODO
* Performance

COMPILING:
===============================

$ ./configure --enable-rdma --target-list=x86_64-softmmu
$ make

RUNNING:
===============================

First, decide if you want dynamic page registration on the server-side.
This always happens on the primary-VM side, but is optional on the server.
Doing this allows you to support overcommit (such as cgroups or ballooning)
with a smaller footprint on the server-side without having to register the
entire VM memory footprint. 
NOTE: This significantly slows down performance (about 30% slower).

$ virsh qemu-monitor-command --hmp \
    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default

Next, if you decided *not* to use chunked registration on the server,
it is recommended to also disable zero page detection. While this is not
strictly necessary, zero page detection also significantly slows down
performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:

$ virsh qemu-monitor-command --hmp \
    --cmd "migrate_set_capability check_for_zero off" # always enabled by default

Finally, set the migration speed to match your hardware's capabilities:

$ virsh qemu-monitor-command --hmp \
    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device

Finally, perform the actual migration:

$ virsh migrate domain rdma:xx.xx.xx.xx:port

RDMA Protocol Description:
=================================

Migration with RDMA is separated into two parts:

1. The transmission of the pages using RDMA
2. Everything else (a control channel is introduced)

"Everything else" is transmitted using a formal 
protocol now, consisting of infiniband SEND / RECV messages.

An infiniband SEND message is the standard ibverbs
message used by applications of infiniband hardware.
The only difference between a SEND message and an RDMA
message is that SEND message cause completion notifications
to be posted to the completion queue (CQ) on the 
infiniband receiver side, whereas RDMA messages (used
for pc.ram) do not (to behave like an actual DMA).
    
Messages in infiniband require two things:

1. registration of the memory that will be transmitted
2. (SEND/RECV only) work requests to be posted on both
   sides of the network before the actual transmission
   can occur.

RDMA messages much easier to deal with. Once the memory
on the receiver side is registered and pinned, we're
basically done. All that is required is for the sender
side to start dumping bytes onto the link.

SEND messages require more coordination because the
receiver must have reserved space (using a receive
work request) on the receive queue (RQ) before QEMUFileRDMA
can start using them to carry all the bytes as
a transport for migration of device state.

To begin the migration, the initial connection setup is
as follows (migration-rdma.c):

1. Receiver and Sender are started (command line or libvirt):
2. Both sides post two RQ work requests
3. Receiver does listen()
4. Sender does connect()
5. Receiver accept()
6. Check versioning and capabilities (described later)

At this point, we define a control channel on top of SEND messages
which is described by a formal protocol. Each SEND message has a 
header portion and a data portion (but together are transmitted 
as a single SEND message).

Header:
    * Length  (of the data portion)
    * Type    (what command to perform, described below)
    * Version (protocol version validated before send/recv occurs)

The 'type' field has 7 different command values:
    1. None
    2. Ready             (control-channel is available) 
    3. QEMU File         (for sending non-live device state) 
    4. RAM Blocks        (used right after connection setup)
    5. Register request  (dynamic chunk registration) 
    6. Register result   ('rkey' to be used by sender)
    7. Register finished (registration for current iteration finished)

After connection setup is completed, we have two protocol-level
functions, responsible for communicating control-channel commands
using the above list of values: 

Logically:

qemu_rdma_exchange_recv(header, expected command type)

1. We transmit a READY command to let the sender know that 
   we are *ready* to receive some data bytes on the control channel.
2. Before attempting to receive the expected command, we post another
   RQ work request to replace the one we just used up.
3. Block on a CQ event channel and wait for the SEND to arrive.
4. When the send arrives, librdmacm will unblock us.
5. Verify that the command-type and version received matches the one we expected.

qemu_rdma_exchange_send(header, data, optional response header & data): 

1. Block on the CQ event channel waiting for a READY command
   from the receiver to tell us that the receiver
   is *ready* for us to transmit some new bytes.
2. Optionally: if we are expecting a response from the command
   (that we have no yet transmitted), let's post an RQ
   work request to receive that data a few moments later. 
3. When the READY arrives, librdmacm will 
   unblock us and we immediately post a RQ work request
   to replace the one we just used up.
4. Now, we can actually post the work request to SEND
   the requested command type of the header we were asked for.
5. Optionally, if we are expecting a response (as before),
   we block again and wait for that response using the additional
   work request we previously posted. (This is used to carry
   'Register result' commands #6 back to the sender which
   hold the rkey need to perform RDMA.

All of the remaining command types (not including 'ready')
described above all use the aformentioned two functions to do the hard work:

1. After connection setup, RAMBlock information is exchanged using
   this protocol before the actual migration begins.
2. During runtime, once a 'chunk' becomes full of pages ready to
   be sent with RDMA, the registration commands are used to ask the
   other side to register the memory for this chunk and respond
   with the result (rkey) of the registration.
3. Also, the QEMUFile interfaces also call these functions (described below)
   when transmitting non-live state, such as devices or to send
   its own protocol information during the migration process.

Versioning
==================================

librdmacm provides the user with a 'private data' area to be exchanged
at connection-setup time before any infiniband traffic is generated.

This is a convenient place to check for protocol versioning because the
user does not need to register memory to transmit a few bytes of version
information.

This is also a convenient place to negotiate capabilities
(like dynamic page registration).

If the version is invalid, we throw an error.

If the version is new, we only negotiate the capabilities that the
requested version is able to perform and ignore the rest.

QEMUFileRDMA Interface:
==================================

QEMUFileRDMA introduces a couple of new functions:

1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)

These two functions are very short and simply used the protocol
describe above to deliver bytes without changing the upper-level
users of QEMUFile that depend on a bytstream abstraction.

Finally, how do we handoff the actual bytes to get_buffer()?

Again, because we're trying to "fake" a bytestream abstraction
using an analogy not unlike individual UDP frames, we have
to hold on to the bytes received from control-channel's SEND 
messages in memory.

Each time we receive a complete "QEMU File" control-channel 
message, the bytes from SEND are copied into a small local holding area.

Then, we return the number of bytes requested by get_buffer()
and leave the remaining bytes in the holding area until get_buffer()
comes around for another pass.

If the buffer is empty, then we follow the same steps
listed above and issue another "QEMU File" protocol command,
asking for a new SEND message to re-fill the buffer.

Migration of pc.ram:
===============================

At the beginning of the migration, (migration-rdma.c),
the sender and the receiver populate the list of RAMBlocks
to be registered with each other into a structure.
Then, using the aforementioned protocol, they exchange a
description of these blocks with each other, to be used later 
during the iteration of main memory. This description includes
a list of all the RAMBlocks, their offsets and lengths and
possibly includes pre-registered RDMA keys in case dynamic
page registration was disabled on the server-side, otherwise not.

Main memory is not migrated with the aforementioned protocol, 
but is instead migrated with normal RDMA Write operations.

Pages are migrated in "chunks" (about 1 Megabyte right now).
Chunk size is not dynamic, but it could be in a future implementation.
There's nothing to indicate that this is useful right now.

When a chunk is full (or a flush() occurs), the memory backed by 
the chunk is registered with librdmacm and pinned in memory on 
both sides using the aforementioned protocol.

After pinning, an RDMA Write is generated and tramsmitted
for the entire chunk.

Chunks are also transmitted in batches: This means that we
do not request that the hardware signal the completion queue
for the completion of *every* chunk. The current batch size
is about 64 chunks (corresponding to 64 MB of memory).
Only the last chunk in a batch must be signaled.
This helps keep everything as asynchronous as possible
and helps keep the hardware busy performing RDMA operations.

Error-handling:
===============================

Infiniband has what is called a "Reliable, Connected"
link (one of 4 choices). This is the mode in which
we use for RDMA migration.

If a *single* message fails,
the decision is to abort the migration entirely and
cleanup all the RDMA descriptors and unregister all
the memory.

After cleanup, the Virtual Machine is returned to normal
operation the same way that would happen if the TCP
socket is broken during a non-RDMA based migration.

TODO:
=================================
1. Currently, cgroups swap limits for *both* TCP and RDMA
   on the sender-side is broken. This is more poignant for
   RDMA because RDMA requires memory registration.
   Fixing this requires infiniband page registrations to be
   zero-page aware, and this does not yet work properly.
2. Currently overcommit for the the *receiver* side of
   TCP works, but not for RDMA. While dynamic page registration
   *does* work, it is only useful if the is_zero_page() capability
   is remained enabled (which it is by default).
   However, leaving this capability turned on *significantly* slows
   down the RDMA throughput, particularly on hardware capable
   of transmitting faster than 10 gbps (such as 40gbps links).
3. Use of the recent /dev/<pid>/pagemap would likely solve some
   of these problems.
4. Also, some form of balloon-device usage tracking would also
   help aleviate some of these issues.

PERFORMANCE
===================

Using a 40gbps infinband link performing a worst-case stress test:

RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
Approximately 30 gpbs (little better than the paper)
1. Average worst-case throughput 
TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
2. Approximately 8 gpbs (using IPOIB IP over Infiniband)

Average downtime (stop time) ranges between 28 and 33 milliseconds.

An *exhaustive* paper (2010) shows additional performance details
linked on the QEMU wiki:

http://wiki.qemu.org/Features/RDMALiveMigration

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 17:05   ` Paolo Bonzini
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA mrhines
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 configure |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/configure b/configure
index 3738de4..127a299 100755
--- a/configure
+++ b/configure
@@ -180,6 +180,7 @@ xfs=""
 
 vhost_net="no"
 kvm="no"
+rdma="no"
 gprof="no"
 debug_tcg="no"
 debug="no"
@@ -918,6 +919,10 @@ for opt do
   ;;
   --enable-gtk) gtk="yes"
   ;;
+  --enable-rdma) rdma="yes"
+  ;;
+  --disable-rdma) rdma="no"
+  ;;
   --with-gtkabi=*) gtkabi="$optarg"
   ;;
   --enable-tpm) tpm="yes"
@@ -1122,6 +1127,8 @@ echo "  --enable-bluez           enable bluez stack connectivity"
 echo "  --disable-slirp          disable SLIRP userspace network connectivity"
 echo "  --disable-kvm            disable KVM acceleration support"
 echo "  --enable-kvm             enable KVM acceleration support"
+echo "  --disable-rdma           disable RDMA-based migration support"
+echo "  --enable-rdma            enable RDMA-based migration support"
 echo "  --enable-tcg-interpreter enable TCG with bytecode interpreter (TCI)"
 echo "  --disable-nptl           disable usermode NPTL support"
 echo "  --enable-nptl            enable usermode NPTL support"
@@ -1767,6 +1774,18 @@ EOF
   libs_softmmu="$sdl_libs $libs_softmmu"
 fi
 
+if test "$rdma" = "yes" ; then
+  cat > $TMPC <<EOF
+#include <rdma/rdma_cma.h>
+int main(void) { return 0; }
+EOF
+  rdma_libs="-lrdmacm -libverbs"
+  if ! compile_prog "" "$rdma_libs" ; then
+      feature_not_found "rdma"
+  fi
+    
+fi
+
 ##########################################
 # VNC TLS/WS detection
 if test "$vnc" = "yes" -a \( "$vnc_tls" != "no" -o "$vnc_ws" != "no" \) ; then
@@ -3408,6 +3427,7 @@ echo "Linux AIO support $linux_aio"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
+echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
 echo "fdt support       $fdt"
 echo "preadv support    $preadv"
@@ -4377,6 +4397,11 @@ if [ "$pixman" = "internal" ]; then
   echo "config-host.h: subdir-pixman" >> $config_host_mak
 fi
 
+if test "$rdma" = "yes" ; then
+echo "CONFIG_RDMA=y" >> $config_host_mak
+echo "LIBS+=$rdma_libs" >> $config_host_mak
+fi
+
 # build tree in object directory in case the source is not in the current directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas"
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 16:46   ` Paolo Bonzini
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation mrhines
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Make both rdma.c and migration-rdma.c conditionally built.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 Makefile.objs |    1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile.objs b/Makefile.objs
index e568c01..32f39d3 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -49,6 +49,7 @@ common-obj-$(CONFIG_POSIX) += os-posix.o
 common-obj-$(CONFIG_LINUX) += fsdev/
 
 common-obj-y += migration.o migration-tcp.o
+common-obj-$(CONFIG_RDMA) += migration-rdma.o rdma.o
 common-obj-y += qemu-char.o #aio.o
 common-obj-y += block-migration.o
 common-obj-y += page_cache.o xbzrle.o
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-10  5:27   ` Michael S. Tsirkin
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 04/12] introduce qemu_ram_foreach_block() mrhines
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Both the protocol and interfaces are elaborated in more detail,
including the new use of dynamic chunk registration, versioning,
and capabilities negotiation.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 313 insertions(+)
 create mode 100644 docs/rdma.txt

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..e9fa4cd
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,313 @@
+Several changes since v4:
+
+- Created a "formal" protocol for the RDMA control channel
+- Dynamic, chunked page registration now implemented on *both* the server and client
+- Created new 'capability' for page registration
+- Created new 'capability' for is_zero_page() - enabled by default
+  (needed to test dynamic page registration)
+- Created version-check before protocol begins at connection-time 
+- no more migrate_use_rdma() !
+
+NOTE: While dynamic registration works on both sides now,
+      it does *not* work with cgroups swap limits. This functionality with infiniband
+      remains broken. (It works fine with TCP). So, in order to take full 
+      advantage of this feature, a fix will have to be developed on the kernel side.
+      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
+
+Contents:
+=================================
+* Compiling
+* Running (please readme before running)
+* RDMA Protocol Description
+* Versioning
+* QEMUFileRDMA Interface
+* Migration of pc.ram
+* Error handling
+* TODO
+* Performance
+
+COMPILING:
+===============================
+
+$ ./configure --enable-rdma --target-list=x86_64-softmmu
+$ make
+
+RUNNING:
+===============================
+
+First, decide if you want dynamic page registration on the server-side.
+This always happens on the primary-VM side, but is optional on the server.
+Doing this allows you to support overcommit (such as cgroups or ballooning)
+with a smaller footprint on the server-side without having to register the
+entire VM memory footprint. 
+NOTE: This significantly slows down performance (about 30% slower).
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
+
+Next, if you decided *not* to use chunked registration on the server,
+it is recommended to also disable zero page detection. While this is not
+strictly necessary, zero page detection also significantly slows down
+performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
+
+Finally, set the migration speed to match your hardware's capabilities:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+RDMA Protocol Description:
+=================================
+
+Migration with RDMA is separated into two parts:
+
+1. The transmission of the pages using RDMA
+2. Everything else (a control channel is introduced)
+
+"Everything else" is transmitted using a formal 
+protocol now, consisting of infiniband SEND / RECV messages.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the 
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+    
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND/RECV only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registered and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a 
+header portion and a data portion (but together are transmitted 
+as a single SEND message).
+
+Header:
+    * Length  (of the data portion)
+    * Type    (what command to perform, described below)
+    * Version (protocol version validated before send/recv occurs)
+
+The 'type' field has 7 different command values:
+    1. None
+    2. Ready             (control-channel is available) 
+    3. QEMU File         (for sending non-live device state) 
+    4. RAM Blocks        (used right after connection setup)
+    5. Register request  (dynamic chunk registration) 
+    6. Register result   ('rkey' to be used by sender)
+    7. Register finished (registration for current iteration finished)
+
+After connection setup is completed, we have two protocol-level
+functions, responsible for communicating control-channel commands
+using the above list of values: 
+
+Logically:
+
+qemu_rdma_exchange_recv(header, expected command type)
+
+1. We transmit a READY command to let the sender know that 
+   we are *ready* to receive some data bytes on the control channel.
+2. Before attempting to receive the expected command, we post another
+   RQ work request to replace the one we just used up.
+3. Block on a CQ event channel and wait for the SEND to arrive.
+4. When the send arrives, librdmacm will unblock us.
+5. Verify that the command-type and version received matches the one we expected.
+
+qemu_rdma_exchange_send(header, data, optional response header & data): 
+
+1. Block on the CQ event channel waiting for a READY command
+   from the receiver to tell us that the receiver
+   is *ready* for us to transmit some new bytes.
+2. Optionally: if we are expecting a response from the command
+   (that we have no yet transmitted), let's post an RQ
+   work request to receive that data a few moments later. 
+3. When the READY arrives, librdmacm will 
+   unblock us and we immediately post a RQ work request
+   to replace the one we just used up.
+4. Now, we can actually post the work request to SEND
+   the requested command type of the header we were asked for.
+5. Optionally, if we are expecting a response (as before),
+   we block again and wait for that response using the additional
+   work request we previously posted. (This is used to carry
+   'Register result' commands #6 back to the sender which
+   hold the rkey need to perform RDMA.
+
+All of the remaining command types (not including 'ready')
+described above all use the aformentioned two functions to do the hard work:
+
+1. After connection setup, RAMBlock information is exchanged using
+   this protocol before the actual migration begins.
+2. During runtime, once a 'chunk' becomes full of pages ready to
+   be sent with RDMA, the registration commands are used to ask the
+   other side to register the memory for this chunk and respond
+   with the result (rkey) of the registration.
+3. Also, the QEMUFile interfaces also call these functions (described below)
+   when transmitting non-live state, such as devices or to send
+   its own protocol information during the migration process.
+
+Versioning
+==================================
+
+librdmacm provides the user with a 'private data' area to be exchanged
+at connection-setup time before any infiniband traffic is generated.
+
+This is a convenient place to check for protocol versioning because the
+user does not need to register memory to transmit a few bytes of version
+information.
+
+This is also a convenient place to negotiate capabilities
+(like dynamic page registration).
+
+If the version is invalid, we throw an error.
+
+If the version is new, we only negotiate the capabilities that the
+requested version is able to perform and ignore the rest.
+
+QEMUFileRDMA Interface:
+==================================
+
+QEMUFileRDMA introduces a couple of new functions:
+
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+
+These two functions are very short and simply used the protocol
+describe above to deliver bytes without changing the upper-level
+users of QEMUFile that depend on a bytstream abstraction.
+
+Finally, how do we handoff the actual bytes to get_buffer()?
+
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from control-channel's SEND 
+messages in memory.
+
+Each time we receive a complete "QEMU File" control-channel 
+message, the bytes from SEND are copied into a small local holding area.
+
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the holding area until get_buffer()
+comes around for another pass.
+
+If the buffer is empty, then we follow the same steps
+listed above and issue another "QEMU File" protocol command,
+asking for a new SEND message to re-fill the buffer.
+
+Migration of pc.ram:
+===============================
+
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+Then, using the aforementioned protocol, they exchange a
+description of these blocks with each other, to be used later 
+during the iteration of main memory. This description includes
+a list of all the RAMBlocks, their offsets and lengths and
+possibly includes pre-registered RDMA keys in case dynamic
+page registration was disabled on the server-side, otherwise not.
+
+Main memory is not migrated with the aforementioned protocol, 
+but is instead migrated with normal RDMA Write operations.
+
+Pages are migrated in "chunks" (about 1 Megabyte right now).
+Chunk size is not dynamic, but it could be in a future implementation.
+There's nothing to indicate that this is useful right now.
+
+When a chunk is full (or a flush() occurs), the memory backed by 
+the chunk is registered with librdmacm and pinned in memory on 
+both sides using the aforementioned protocol.
+
+After pinning, an RDMA Write is generated and tramsmitted
+for the entire chunk.
+
+Chunks are also transmitted in batches: This means that we
+do not request that the hardware signal the completion queue
+for the completion of *every* chunk. The current batch size
+is about 64 chunks (corresponding to 64 MB of memory).
+Only the last chunk in a batch must be signaled.
+This helps keep everything as asynchronous as possible
+and helps keep the hardware busy performing RDMA operations.
+
+Error-handling:
+===============================
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
+
+TODO:
+=================================
+1. Currently, cgroups swap limits for *both* TCP and RDMA
+   on the sender-side is broken. This is more poignant for
+   RDMA because RDMA requires memory registration.
+   Fixing this requires infiniband page registrations to be
+   zero-page aware, and this does not yet work properly.
+2. Currently overcommit for the the *receiver* side of
+   TCP works, but not for RDMA. While dynamic page registration
+   *does* work, it is only useful if the is_zero_page() capability
+   is remained enabled (which it is by default).
+   However, leaving this capability turned on *significantly* slows
+   down the RDMA throughput, particularly on hardware capable
+   of transmitting faster than 10 gbps (such as 40gbps links).
+3. Use of the recent /dev/<pid>/pagemap would likely solve some
+   of these problems.
+4. Also, some form of balloon-device usage tracking would also
+   help aleviate some of these issues.
+
+PERFORMANCE
+===================
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput 
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+
+Average downtime (stop time) ranges between 28 and 33 milliseconds.
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki:
+
+http://wiki.qemu.org/Features/RDMALiveMigration
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 04/12] introduce qemu_ram_foreach_block()
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (2 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol mrhines
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 exec.c                    |    9 +++++++++
 include/exec/cpu-common.h |    5 +++++
 2 files changed, 14 insertions(+)

diff --git a/exec.c b/exec.c
index 786987a..5d284fc 100644
--- a/exec.c
+++ b/exec.c
@@ -2631,3 +2631,12 @@ bool cpu_physical_memory_is_io(hwaddr phys_addr)
              memory_region_is_romd(section->mr));
 }
 #endif
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
+{
+    RAMBlock *block;
+
+    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+        func(block->host, block->offset, block->length, opaque);
+    }
+}
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 2e5f11f..88cb741 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -119,6 +119,11 @@ extern struct MemoryRegion io_mem_rom;
 extern struct MemoryRegion io_mem_unassigned;
 extern struct MemoryRegion io_mem_notdirty;
 
+typedef void  (RAMBlockIterFunc)(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque); 
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
+
 #endif
 
 #endif /* !CPU_COMMON_H */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (3 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 04/12] introduce qemu_ram_foreach_block() mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 16:57   ` Paolo Bonzini
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 06/12] connection-establishment for RDMA mrhines
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Code well-commented throughout.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/rdma.h |   82 ++
 rdma.c                   | 2413 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 2495 insertions(+)
 create mode 100644 include/migration/rdma.h
 create mode 100644 rdma.c

diff --git a/include/migration/rdma.h b/include/migration/rdma.h
new file mode 100644
index 0000000..c08db4c
--- /dev/null
+++ b/include/migration/rdma.h
@@ -0,0 +1,82 @@
+/*
+ *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
+ *  Copyright (C) 2013 Jiuxing Liu <jl@us.ibm.com>
+ *
+ *  RDMA data structures and helper functions (for migration)
+ *
+ *  This program is free software; you can redistribute it and/or modify *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef _RDMA_H
+#define _RDMA_H
+
+#include "config-host.h"
+#ifdef CONFIG_RDMA 
+#include <rdma/rdma_cma.h>
+#endif
+#include "monitor/monitor.h"
+#include "exec/cpu-common.h"
+#include "migration/migration.h"
+
+#define Mbps(bytes, ms) ((double) bytes * 8.0 / ((double) ms / 1000.0)) \
+                                / 1000.0 / 1000.0
+
+extern const QEMUFileOps rdma_read_ops;
+extern const QEMUFileOps rdma_write_ops;
+
+#ifdef CONFIG_RDMA
+
+void qemu_rdma_disable(void *opaque);
+void qemu_rdma_cleanup(void *opaque);
+int qemu_rdma_client_init(void *opaque, Error **errp,
+            bool chunk_register_destination);
+int qemu_rdma_connect(void *opaque, Error **errp);
+void *qemu_rdma_data_init(const char *host_port, Error **errp);
+int qemu_rdma_server_init(void *opaque, Error **errp);
+int qemu_rdma_server_prepare(void *opaque, Error **errp);
+int qemu_rdma_drain_cq(QEMUFile *f);
+int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, 
+                            int64_t pos, int size);
+int qemu_rdma_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size);
+int qemu_rdma_close(void *opaque);
+size_t save_rdma_page(QEMUFile *f, ram_addr_t block_offset, 
+            ram_addr_t offset, int cont, size_t size, bool zero);
+void *qemu_fopen_rdma(void *opaque, const char * mode);
+int qemu_rdma_get_fd(void *opaque);
+int qemu_rdma_accept(void *opaque);
+void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp);
+void rdma_start_incoming_migration(const char * host_port, Error **errp);
+int qemu_rdma_handle_registrations(QEMUFile *f);
+int qemu_rdma_finish_registrations(QEMUFile *f);
+
+#else /* !defined(CONFIG_RDMA) */
+#define NOT_CONFIGURED() do { printf("WARN: RDMA is not configured\n"); } while(0)
+#define qemu_rdma_cleanup(...) NOT_CONFIGURED()
+#define qemu_rdma_data_init(...) NOT_CONFIGURED() 
+#define rdma_start_outgoing_migration(...) NOT_CONFIGURED()
+#define rdma_start_incoming_migration(...) NOT_CONFIGURED()
+#define qemu_rdma_handle_registrations(...) 0
+#define qemu_rdma_finish_registrations(...) 0
+#define qemu_rdma_get_buffer NULL
+#define qemu_rdma_put_buffer NULL
+#define qemu_rdma_close NULL
+#define qemu_fopen_rdma(...) NULL
+#define qemu_rdma_client_init(...) -1 
+#define qemu_rdma_client_connect(...) -1 
+#define qemu_rdma_server_init(...) -1 
+#define qemu_rdma_server_prepare(...) -1 
+#define qemu_rdma_drain_cq(...) -1 
+#define save_rdma_page(...) -ENOTSUP
+
+#endif /* CONFIG_RDMA */
+
+#endif
diff --git a/rdma.c b/rdma.c
new file mode 100644
index 0000000..7246b86
--- /dev/null
+++ b/rdma.c
@@ -0,0 +1,2413 @@
+/*
+ *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
+ *  Copyright (C) 2010 Jiuxing Liu <jl@us.ibm.com>
+ *
+ *  RDMA protocol and interfaces
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "migration/rdma.h"
+#include "migration/qemu-file.h"
+#include "qemu-common.h"
+#include "migration/migration.h"
+#include "exec/cpu-common.h"
+#include "qemu/sockets.h"
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <arpa/inet.h>
+#include <string.h>
+
+//#define DEBUG_RDMA
+//#define DEBUG_RDMA_VERBOSE
+
+#ifdef DEBUG_RDMA
+#define DPRINTF(fmt, ...) \
+    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#ifdef DEBUG_RDMA_VERBOSE
+#define DDPRINTF(fmt, ...) \
+    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DDPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#define RDMA_RESOLVE_TIMEOUT_MS 10000
+
+#define RDMA_CHUNK_REGISTRATION
+
+#define RDMA_LAZY_CLIENT_REGISTRATION
+
+/* Do not merge data if larger than this. */
+#define RDMA_MERGE_MAX (4 * 1024 * 1024)
+#define RDMA_UNSIGNALED_SEND_MAX 64
+
+#define RDMA_REG_CHUNK_SHIFT 20
+#define RDMA_REG_CHUNK_SIZE (1UL << (RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_CHUNK_INDEX(start_addr, host_addr) \
+            (((unsigned long)(host_addr) >> RDMA_REG_CHUNK_SHIFT) - \
+            ((unsigned long)(start_addr) >> RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_NUM_CHUNKS(rdma_ram_block) \
+            (RDMA_REG_CHUNK_INDEX((rdma_ram_block)->local_host_addr,\
+                (rdma_ram_block)->local_host_addr +\
+                (rdma_ram_block)->length) + 1)
+#define RDMA_REG_CHUNK_START(rdma_ram_block, i) ((uint8_t *)\
+            ((((unsigned long)((rdma_ram_block)->local_host_addr) >> \
+                RDMA_REG_CHUNK_SHIFT) + (i)) << \
+                RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_CHUNK_END(rdma_ram_block, i) \
+            (RDMA_REG_CHUNK_START(rdma_ram_block, i) + \
+             RDMA_REG_CHUNK_SIZE)
+
+/*
+ * This is only for non-live state being migrated.
+ * Instead of RDMA_WRITE messages, we use RDMA_SEND
+ * messages for that state, which requires a different
+ * delivery design than main memory.
+ */
+#define RDMA_SEND_INCREMENT 32768
+
+#define RDMA_BLOCKING
+/*
+ * Completion queue can be filled by both read and write work requests, 
+ * so must reflect the sum of both possible queue sizes.
+ */
+#define RDMA_QP_SIZE 1000
+#define RDMA_CQ_SIZE (RDMA_QP_SIZE * 3)
+
+/*
+ * Maximum size infiniband SEND message
+ */
+#define RDMA_CONTROL_MAX_BUFFER (512 * 1024)
+#define RDMA_CONTROL_MAX_WR 2
+
+/*
+ * RDMA migration protocol:
+ * 1. RDMA Writes (data messages, i.e. RAM)
+ * 2. IB Send/Recv (control channel messages)
+ */
+enum {
+    RDMA_WRID_NONE = 0,
+    RDMA_WRID_RDMA_WRITE,
+    RDMA_WRID_SEND_CONTROL = 1000,
+    RDMA_WRID_RECV_CONTROL = 2000,
+};
+
+const char * wrid_desc[] = {
+        [RDMA_WRID_NONE] = "NONE",
+        [RDMA_WRID_RDMA_WRITE] = "WRITE RDMA",
+        [RDMA_WRID_SEND_CONTROL] = "CONTROL SEND",
+        [RDMA_WRID_RECV_CONTROL] = "CONTROL RECV",
+};
+
+/*
+ * SEND/RECV IB Control Messages.
+ */ 
+enum {
+    RDMA_CONTROL_NONE = 0,
+    RDMA_CONTROL_READY,             /* ready to receive */
+    RDMA_CONTROL_QEMU_FILE,         /* QEMUFile-transmitted bytes */
+    RDMA_CONTROL_RAM_BLOCKS,       /* RAMBlock synchronization */
+    RDMA_CONTROL_REGISTER_REQUEST,  /* dynamic page registration */
+    RDMA_CONTROL_REGISTER_RESULT,   /* key to use after registration */
+    RDMA_CONTROL_REGISTER_FINISHED, /* current iteration finished */
+};
+
+const char * control_desc[] = {
+        [RDMA_CONTROL_NONE] = "NONE",
+        [RDMA_CONTROL_READY] = "READY",
+        [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
+        [RDMA_CONTROL_RAM_BLOCKS] = "REMOTE INFO",
+        [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
+        [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
+        [RDMA_CONTROL_REGISTER_FINISHED] = "REGISTER FINISHED",
+};
+
+/*
+ * Memory and MR structures used to represen an IB Send/Recv work request.
+ * This is *not* used for RDMA, only IB Send/Recv.
+ */
+typedef struct {
+    uint8_t  control[RDMA_CONTROL_MAX_BUFFER]; /* actual buffer to register */
+    struct   ibv_mr *control_mr;               /* registration metadata */
+    size_t   control_len;                      /* length of the message */
+    uint8_t *control_curr;                     /* start of unconsumed bytes */
+} RDMAWorkRequestData;
+
+/*
+ * Negotiate RDMA capabilities during connection-setup time.
+ */
+typedef struct {
+    int len;
+    int version;
+    int chunk_register_destination;
+} RDMACapabilities;
+
+/*
+ * Main data structure for RDMA state.
+ * While there is only one copy of this structure being allocated right now,
+ * this is the place where one would start if you wanted to consider 
+ * having more than one RDMA connection open at the same time.
+ */
+typedef struct RDMAContext {
+    char *host;
+    int port;
+
+    /* This is used by the migration protocol to transmit
+     * control messages (such as device state and registration commands)
+     * 
+     * WR #0 is for control channel ready messages from the server.
+     * WR #1 is for control channel data messages from the server. 
+     * WR #2 is for control channel send messages.
+     *
+     * We could use more WRs, but we have enough for now.
+     */
+    RDMAWorkRequestData wr_data[RDMA_CONTROL_MAX_WR + 1];
+
+    /* 
+     * This is used by *_exchange_send() to figure out whether or not
+     * the initial "READY" message has already been received or not.
+     * This is because other functions may potentially poll() and detect
+     * the READY message before send() does, in which case we need to
+     * know if it completed.
+     */
+    int control_ready_expected;
+
+    /* The rest is only for the initiator of the migration. */
+    int client_init_done;
+
+    /* number of outstanding unsignaled send */
+    int num_unsignaled_send;
+
+    /* number of outstanding signaled send */
+    int num_signaled_send;
+
+    /* store info about current buffer so that we can
+       merge it with future sends */
+    uint64_t current_offset;
+    uint64_t current_length;
+    /* index of ram block the current buffer belongs to */
+    int current_index;
+    /* index of the chunk in the current ram block */
+    int current_chunk;
+
+    int chunk_register_destination;
+
+    /* 
+     * infiniband-specific variables for opening the device
+     * and maintaining connection state and so forth.
+     * 
+     * cm_id also has ibv_context, rdma_event_channel, and ibv_qp in
+     * cm_id->verbs, cm_id->channel, and cm_id->qp.
+     */
+    struct rdma_cm_id *cm_id;               /* connection manager ID */
+    struct rdma_cm_id *listen_id; 
+
+    struct ibv_context *verbs;
+    struct rdma_event_channel *channel;
+    struct ibv_qp *qp;                      /* queue pair */
+    struct ibv_comp_channel *comp_channel;  /* completion channel */
+    struct ibv_pd *pd;                      /* protection domain */
+    struct ibv_cq *cq;                      /* completion queue */
+} RDMAContext;
+
+/*
+ * Interface to the rest of the migration call stack. 
+ */
+typedef struct QEMUFileRDMA
+{
+    RDMAContext *rdma;
+    size_t len;
+    void *file;
+} QEMUFileRDMA;
+
+/*
+ * Representation of a RAMBlock from an RDMA perspective.
+ * This an subsequent structures cannot be linked lists
+ * because we're using a single IB message to transmit
+ * the information. It's small anyway, so a list is overkill.
+ */
+typedef struct RDMALocalBlock {
+    uint8_t  *local_host_addr; /* local virtual address */
+    uint64_t remote_host_addr; /* remote virtual address */
+    uint64_t offset;
+    uint64_t length;
+    struct   ibv_mr **pmr;     /* MRs for chunk-level registration */
+    struct   ibv_mr *mr;       /* MR for non-chunk-level registration */
+    uint32_t *remote_keys;     /* rkeys for chunk-level registration */ 
+    uint32_t remote_rkey;      /* rkeys for non-chunk-level registration */
+} RDMALocalBlock;
+
+/*
+ * Also represents a RAMblock, but only on the server.
+ * This gets transmitted by the server during connection-time 
+ * to the client / primary VM and then is used to populate the 
+ * corresponding RDMALocalBlock with
+ * the information needed to perform the actual RDMA.
+ *
+ */
+typedef struct RDMARemoteBlock {
+    uint64_t remote_host_addr;
+    uint64_t offset;
+    uint64_t length;
+    uint32_t remote_rkey;
+} RDMARemoteBlock;
+
+/*
+ * Virtual address of the above structures used for transmitting
+ * the RAMBlock descriptions at connection-time.
+ */
+typedef struct RDMALocalBlocks {
+    int num_blocks;
+    RDMALocalBlock *block;
+} RDMALocalBlocks;
+
+/*
+ * Same as above
+ */
+typedef struct RDMARemoteBlocks {
+    int * num_blocks;
+    RDMARemoteBlock *block;
+    void * remote_area;
+    int remote_size;
+} RDMARemoteBlocks;
+
+#define RDMA_CONTROL_VERSION_1      1
+//#define RDMA_CONTROL_VERSION_2      2  /* next version */
+#define RDMA_CONTROL_VERSION_MAX    1
+#define RDMA_CONTROL_VERSION_MIN    1    /* change on next version */
+
+#define RDMA_CONTROL_CURRENT_VERSION RDMA_CONTROL_VERSION_1
+
+/*
+ * Main structure for IB Send/Recv control messages.
+ * This gets prepended at the beginning of every Send/Recv.
+ */
+typedef struct {
+    uint64_t    len;
+    uint32_t    type;
+    uint32_t    version;
+} RDMAControlHeader;
+
+/*
+ * Register a single Chunk.
+ * Information sent by the primary VM to inform the server
+ * to register an single chunk of memory before we can perform
+ * the actual RDMA operation.
+ */
+typedef struct {
+    size_t   len;              /* length of the chunk to be registered */
+    int      current_index;    /* which ramblock the chunk belongs to */
+    uint64_t offset;           /* offset into the ramblock of the chunk */
+} RDMARegister;
+
+/*
+ * The result of the server's memory registration produces an "rkey"
+ * which the primary VM must reference in order to perform
+ * the RDMA operation.
+ */
+typedef struct {
+    uint32_t rkey;
+} RDMARegisterResult;
+
+#define RDMAControlHeaderSize sizeof(RDMAControlHeader)
+
+RDMALocalBlocks local_ram_blocks;
+RDMARemoteBlocks remote_ram_blocks;
+
+/*
+ * Memory regions need to be registered with the device and queue pairs setup
+ * in advanced before the migration starts. This tells us where the RAM blocks
+ * are so that we can register them individually.
+ */
+static void qemu_rdma_init_one_block(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque)
+{
+    RDMALocalBlocks *rdma_local_ram_blocks = opaque;
+    int num_blocks = rdma_local_ram_blocks->num_blocks;
+
+    rdma_local_ram_blocks->block[num_blocks].local_host_addr = host_addr;
+    rdma_local_ram_blocks->block[num_blocks].offset = (uint64_t)offset;
+    rdma_local_ram_blocks->block[num_blocks].length = (uint64_t)length;
+    rdma_local_ram_blocks->num_blocks++;
+}
+
+static void qemu_rdma_ram_block_counter(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque)
+{
+    int * num_blocks = opaque;
+    *num_blocks = *num_blocks + 1;
+}
+
+/*
+ * Identify the RAMBlocks and their quantity. They will be references to
+ * identify chunk boundaries inside each RAMBlock and also be referenced
+ * during dynamic page registration.
+ */
+static int qemu_rdma_init_ram_blocks(RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int num_blocks = 0;
+    
+    qemu_ram_foreach_block(qemu_rdma_ram_block_counter, &num_blocks);  
+
+    memset(rdma_local_ram_blocks, 0, sizeof *rdma_local_ram_blocks);
+    rdma_local_ram_blocks->block = g_malloc0(sizeof(RDMALocalBlock) *
+                                    num_blocks);
+
+    rdma_local_ram_blocks->num_blocks = 0;
+    qemu_ram_foreach_block(qemu_rdma_init_one_block, rdma_local_ram_blocks);
+
+    DPRINTF("Allocated %d local ram block structures\n", 
+                    rdma_local_ram_blocks->num_blocks);
+    return 0;
+}
+
+/*
+ * Put in the log file which RDMA device was opened and the details
+ * associated with that device.
+ */
+static void qemu_rdma_dump_id(const char * who, struct ibv_context * verbs)
+{
+    printf("%s RDMA Device opened: kernel name %s "
+           "uverbs device name %s, "
+           "infiniband_verbs class device path %s,"
+           " infiniband class device path %s\n", 
+                who, 
+                verbs->device->name, 
+                verbs->device->dev_name, 
+                verbs->device->dev_path, 
+                verbs->device->ibdev_path);
+}
+
+/*
+ * Put in the log file the RDMA gid addressing information,
+ * useful for folks who have trouble understanding the
+ * RDMA device hierarchy in the kernel. 
+ */
+static void qemu_rdma_dump_gid(const char * who, struct rdma_cm_id * id)
+{
+    char sgid[33];
+    char dgid[33];
+    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.sgid, sgid, sizeof sgid);
+    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.dgid, dgid, sizeof dgid);
+    DPRINTF("%s Source GID: %s, Dest GID: %s\n", who, sgid, dgid);
+}
+
+/*
+ * Figure out which RDMA device corresponds to the requested IP hostname
+ * Also create the initial connection manager identifiers for opening
+ * the connection.
+ */
+static int qemu_rdma_resolve_host(RDMAContext *rdma)
+{
+    int ret;
+    struct addrinfo *res;
+    char port_str[16];
+    struct rdma_cm_event *cm_event;
+    char ip[40] = "unknown";
+
+    if (rdma->host == NULL || !strcmp(rdma->host, "")) {
+        fprintf(stderr, "RDMA hostname has not been set\n");
+        return -1;
+    }
+
+    /* create CM channel */
+    rdma->channel = rdma_create_event_channel();
+    if (!rdma->channel) {
+        fprintf(stderr, "could not create CM channel\n");
+        return -1;
+    }
+
+    /* create CM id */
+    ret = rdma_create_id(rdma->channel, &rdma->cm_id, NULL, RDMA_PS_TCP);
+    if (ret) {
+        fprintf(stderr, "could not create channel id\n");
+        goto err_resolve_create_id;
+    }
+
+    snprintf(port_str, 16, "%d", rdma->port);
+    port_str[15] = '\0';
+
+    ret = getaddrinfo(rdma->host, port_str, NULL, &res);
+    if (ret < 0) {
+        fprintf(stderr, "could not getaddrinfo destination address %s\n", rdma->host);
+        goto err_resolve_get_addr;
+    }
+
+    inet_ntop(AF_INET, &((struct sockaddr_in *) res->ai_addr)->sin_addr, 
+                                ip, sizeof ip);
+    printf("%s => %s\n", rdma->host, ip);
+
+    /* resolve the first address */
+    ret = rdma_resolve_addr(rdma->cm_id, NULL, res->ai_addr,
+            RDMA_RESOLVE_TIMEOUT_MS);
+    if (ret) {
+        fprintf(stderr, "could not resolve address %s\n", rdma->host);
+        goto err_resolve_get_addr;
+    }
+
+    qemu_rdma_dump_gid("client_resolve_addr", rdma->cm_id);
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "could not perform event_addr_resolved\n");
+        goto err_resolve_get_addr;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
+        fprintf(stderr, "result not equal to event_addr_resolved %s\n", 
+                rdma_event_str(cm_event->event));
+        perror("rdma_resolve_addr");
+        rdma_ack_cm_event(cm_event);
+        goto err_resolve_get_addr;
+    }
+    rdma_ack_cm_event(cm_event);
+
+    /* resolve route */
+    ret = rdma_resolve_route(rdma->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
+    if (ret) {
+        fprintf(stderr, "could not resolve rdma route\n");
+        goto err_resolve_get_addr;
+    }
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "could not perform event_route_resolved\n");
+        goto err_resolve_get_addr;
+    }
+    if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
+        fprintf(stderr, "result not equal to event_route_resolved: %s\n", rdma_event_str(cm_event->event));
+        rdma_ack_cm_event(cm_event);
+        goto err_resolve_get_addr;
+    }
+    rdma_ack_cm_event(cm_event);
+    rdma->verbs = rdma->cm_id->verbs;
+    qemu_rdma_dump_id("client_resolve_host", rdma->cm_id->verbs);
+    qemu_rdma_dump_gid("client_resolve_host", rdma->cm_id);
+    return 0;
+
+err_resolve_get_addr:
+    rdma_destroy_id(rdma->cm_id);
+err_resolve_create_id:
+    rdma_destroy_event_channel(rdma->channel);
+    rdma->channel = NULL;
+
+    return -1;
+}
+
+/*
+ * Create protection domain and completion queues
+ */
+static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
+{
+    /* allocate pd */
+    rdma->pd = ibv_alloc_pd(rdma->verbs);
+    if (!rdma->pd) {
+        return -1;
+    }
+
+#ifdef RDMA_BLOCKING
+    /* create completion channel */
+    rdma->comp_channel = ibv_create_comp_channel(rdma->verbs);
+    if (!rdma->comp_channel) {
+        goto err_alloc_pd_cq;
+    }
+#endif
+
+    /* create cq */
+    rdma->cq = ibv_create_cq(rdma->verbs, RDMA_CQ_SIZE,
+            NULL, rdma->comp_channel, 0);
+    if (!rdma->cq) {
+        goto err_alloc_pd_cq;
+    }
+
+    return 0;
+
+err_alloc_pd_cq:
+    if (rdma->pd) {
+        ibv_dealloc_pd(rdma->pd);
+    }
+    if (rdma->comp_channel) {
+        ibv_destroy_comp_channel(rdma->comp_channel);
+    }
+    rdma->pd = NULL;
+    rdma->comp_channel = NULL;
+    return -1;
+
+}
+
+/*
+ * Create queue pairs.
+ */
+static int qemu_rdma_alloc_qp(RDMAContext *rdma)
+{
+    struct ibv_qp_init_attr attr = { 0 };
+    int ret;
+
+    attr.cap.max_send_wr = RDMA_QP_SIZE;
+    attr.cap.max_recv_wr = 3;
+    attr.cap.max_send_sge = 1;
+    attr.cap.max_recv_sge = 1;
+    attr.send_cq = rdma->cq;
+    attr.recv_cq = rdma->cq;
+    attr.qp_type = IBV_QPT_RC;
+
+    ret = rdma_create_qp(rdma->cm_id, rdma->pd, &attr);
+    if (ret) {
+        return -1;
+    }
+
+    rdma->qp = rdma->cm_id->qp;
+    return 0;
+}
+
+/*
+ * For QEMUFile, used for setting non-blocking mode
+ * on the connection manager so that QEMU can poll
+ * and perform an asynchronous connection.
+ *
+ * We cannot block on the connection manager, otherwise
+ * the QEMU monitor will not be available.
+ */
+int qemu_rdma_get_fd(void *opaque)
+{
+    RDMAContext *rdma = opaque;
+    return rdma->channel->fd;
+}
+
+/*
+ * This is probably dead code, but its here anyway for testing.
+ * Sometimes nice to know the performance tradeoffs of pinning.
+ */
+#if !defined(RDMA_LAZY_CLIENT_REGISTRATION)
+static int qemu_rdma_reg_chunk_ram_blocks(RDMAContext *rdma,
+        RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i, j;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        RDMALocalBlock *block = &(rdma_local_ram_blocks->block[i]);
+        int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+        /* allocate memory to store chunk MRs */
+        rdma_local_ram_blocks->block[i].pmr = g_malloc0(
+                                num_chunks * sizeof(struct ibv_mr *));
+
+        if (!block->pmr) {
+            goto err_reg_chunk_ram_blocks;
+        }
+
+        for (j = 0; j < num_chunks; j++) {
+            uint8_t *start_addr = RDMA_REG_CHUNK_START(block, j);
+            uint8_t *end_addr = RDMA_REG_CHUNK_END(block, j);
+            if (start_addr < block->local_host_addr) {
+                start_addr = block->local_host_addr;
+            }
+            if (end_addr > block->local_host_addr + block->length) {
+                end_addr = block->local_host_addr + block->length;
+            }
+            block->pmr[j] = ibv_reg_mr(rdma->pd,
+                                start_addr,
+                                end_addr - start_addr,
+                                //IBV_ACCESS_LOCAL_WRITE |
+                                //IBV_ACCESS_REMOTE_WRITE |
+                                //IBV_ACCESS_GIFT |
+                                IBV_ACCESS_REMOTE_READ
+                                );
+            if (!block->pmr[j]) {
+                break;
+            }
+        }
+        if (j < num_chunks) {
+            for (j--; j >= 0; j--) {
+                ibv_dereg_mr(block->pmr[j]);
+            }
+            block->pmr[i] = NULL;
+            goto err_reg_chunk_ram_blocks;
+        }
+    }
+
+    return 0;
+
+err_reg_chunk_ram_blocks:
+    for (i--; i >= 0; i--) {
+        int num_chunks =
+            RDMA_REG_NUM_CHUNKS(&(rdma_local_ram_blocks->block[i]));
+        for (j = 0; j < num_chunks; j++) {
+            ibv_dereg_mr(rdma_local_ram_blocks->block[i].pmr[j]);
+        }
+        free(rdma_local_ram_blocks->block[i].pmr);
+        rdma_local_ram_blocks->block[i].pmr = NULL;
+    }
+
+    return -1;
+
+}
+#endif
+
+/*
+ * Also probably dead code, but for the same reason, its nice
+ * to know the performance tradeoffs of dynamic registration
+ * on both sides of the connection.
+ */
+static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, 
+                                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        rdma_local_ram_blocks->block[i].mr =
+            ibv_reg_mr(rdma->pd,
+                    rdma_local_ram_blocks->block[i].local_host_addr,
+                    rdma_local_ram_blocks->block[i].length,
+                    IBV_ACCESS_LOCAL_WRITE |
+                    IBV_ACCESS_REMOTE_WRITE
+                    );
+        if (!rdma_local_ram_blocks->block[i].mr) {
+            fprintf(stderr, "Failed to register local server ram block!\n");
+            break;
+        }
+    }
+
+    if (i >= rdma_local_ram_blocks->num_blocks) {
+        return 0;
+    }
+
+    for (i--; i >= 0; i--) {
+        ibv_dereg_mr(rdma_local_ram_blocks->block[i].mr);
+    }
+
+    return -1;
+
+}
+
+static int qemu_rdma_client_reg_ram_blocks(RDMAContext *rdma,
+                                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+#ifdef RDMA_CHUNK_REGISTRATION
+#ifdef RDMA_LAZY_CLIENT_REGISTRATION
+    return 0;
+#else
+    return qemu_rdma_reg_chunk_ram_blocks(rdma, rdma_local_ram_blocks);
+#endif
+#else
+    return qemu_rdma_reg_whole_ram_blocks(rdma, rdma_local_ram_blocks);
+#endif
+}
+
+static int qemu_rdma_server_reg_ram_blocks(RDMAContext *rdma,
+                                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    return qemu_rdma_reg_whole_ram_blocks(rdma, rdma_local_ram_blocks);
+}
+
+/*
+ * Shutdown and clean things up.
+ */
+static void qemu_rdma_dereg_ram_blocks(RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i, j;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        int num_chunks;
+        if (!rdma_local_ram_blocks->block[i].pmr) {
+            continue;
+        }
+        num_chunks = RDMA_REG_NUM_CHUNKS(&(rdma_local_ram_blocks->block[i]));
+        for (j = 0; j < num_chunks; j++) {
+            if (!rdma_local_ram_blocks->block[i].pmr[j]) {
+                continue;
+            }
+            ibv_dereg_mr(rdma_local_ram_blocks->block[i].pmr[j]);
+        }
+        free(rdma_local_ram_blocks->block[i].pmr);
+        rdma_local_ram_blocks->block[i].pmr = NULL;
+    }
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        if (!rdma_local_ram_blocks->block[i].mr) {
+            continue;
+        }
+        ibv_dereg_mr(rdma_local_ram_blocks->block[i].mr);
+        rdma_local_ram_blocks->block[i].mr = NULL;
+    }
+}
+
+/*
+ * Server uses this to prepare to transmit the RAMBlock descriptions
+ * to the primary VM after connection setup.
+ * Both sides use the "remote" structure to communicate and update
+ * their "local" descriptions with what was sent.
+ */
+static void qemu_rdma_copy_to_remote_ram_blocks(RDMAContext *rdma,
+                                                RDMALocalBlocks *local,
+                                                RDMARemoteBlocks *remote)
+{
+    int i;
+    DPRINTF("Allocating %d remote ram block structures\n", local->num_blocks);
+    *remote->num_blocks = local->num_blocks;
+
+    for (i = 0; i < local->num_blocks; i++) {
+            remote->block[i].remote_host_addr =
+                (uint64_t)(local->block[i].local_host_addr);
+
+            if(rdma->chunk_register_destination == false)
+                remote->block[i].remote_rkey = local->block[i].mr->rkey;
+
+            remote->block[i].offset = local->block[i].offset;
+            remote->block[i].length = local->block[i].length;
+    }
+}
+
+/*
+ * Client then propogates the remote ram block descriptions to his local copy.
+ * Really, only the virtual addresses are useful, but we propogate everything
+ * anyway.
+ *
+ * If we're using dynamic registration on the server side (the default), then
+ * the 'rkeys' are not useful because we will re-ask for them later during
+ * runtime.
+ */
+static int qemu_rdma_process_remote_ram_blocks(RDMALocalBlocks *local, RDMARemoteBlocks *remote)
+{
+    int i, j;
+
+    if (local->num_blocks != *remote->num_blocks) {
+        fprintf(stderr, "local %d != remote %d\n", 
+            local->num_blocks, *remote->num_blocks);
+        return -1;
+    }
+
+    for (i = 0; i < *remote->num_blocks; i++) {
+        /* search local ram blocks */
+        for (j = 0; j < local->num_blocks; j++) {
+            if (remote->block[i].offset != local->block[j].offset) {
+                continue;
+            }
+            if (remote->block[i].length != local->block[j].length) {
+                return -1;
+            }
+            local->block[j].remote_host_addr =
+                remote->block[i].remote_host_addr;
+            local->block[j].remote_rkey = remote->block[i].remote_rkey;
+            break;
+        }
+        if (j >= local->num_blocks) {
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Find the ram block that corresponds to the page requested to be
+ * transmitted by QEMU.
+ *
+ * Once the block is found, also identify which 'chunk' within that
+ * block that the page belongs to.
+ *
+ * This search cannot fail or the migration will fail. 
+ */
+static int qemu_rdma_search_ram_block(uint64_t offset, uint64_t length,
+        RDMALocalBlocks *blocks, int *block_index, int *chunk_index)
+{
+    int i;
+    for (i = 0; i < blocks->num_blocks; i++) {
+        if (offset < blocks->block[i].offset) {
+            continue;
+        }
+        if (offset + length >
+                blocks->block[i].offset + blocks->block[i].length) {
+            continue;
+        }
+        *block_index = i;
+        if (chunk_index) {
+            uint8_t *host_addr = blocks->block[i].local_host_addr +
+                (offset - blocks->block[i].offset);
+            *chunk_index = RDMA_REG_CHUNK_INDEX(
+                    blocks->block[i].local_host_addr, host_addr);
+        }
+        return 0;
+    }
+    return -1;
+}
+
+/*
+ * Register a chunk with IB. If the chunk was already registered
+ * previously, then skip.
+ *
+ * Also return the keys associated with the registration needed
+ * to perform the actual RDMA operation.
+ */ 
+static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
+        RDMALocalBlock *block, uint64_t host_addr,
+        uint32_t *lkey, uint32_t *rkey)
+{
+    int chunk;
+    if (block->mr) {
+        if(lkey)
+            *lkey = block->mr->lkey;
+        if(rkey)
+            *rkey = block->mr->rkey;
+        return 0;
+    }
+
+    /* allocate memory to store chunk MRs */
+    if (!block->pmr) {
+        int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+        block->pmr = g_malloc0(num_chunks *
+                sizeof(struct ibv_mr *));
+        if (!block->pmr) {
+            return -1;
+        }
+    }
+
+    /*
+     * If 'rkey', then we're the server performing a dynamic
+     * registration, so grant access to the client.
+     *
+     * If 'lkey', then we're the primary VM performing a dynamic
+     * registration, so grant access only to ourselves.
+     */
+    chunk = RDMA_REG_CHUNK_INDEX(block->local_host_addr, host_addr);
+    if (!block->pmr[chunk]) {
+        uint8_t *start_addr = RDMA_REG_CHUNK_START(block, chunk);
+        uint8_t *end_addr = RDMA_REG_CHUNK_END(block, chunk);
+        if (start_addr < block->local_host_addr) {
+            start_addr = block->local_host_addr;
+        }
+        if (end_addr > block->local_host_addr + block->length) {
+            end_addr = block->local_host_addr + block->length;
+        }
+        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
+                start_addr,
+                end_addr - start_addr,
+                //(lkey ? IBV_ACCESS_GIFT : 0) |
+                (rkey ? (IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE) : 0)
+                | IBV_ACCESS_REMOTE_READ);
+        if (!block->pmr[chunk]) {
+            fprintf(stderr, "Failed to register chunk!\n");
+            return -1;
+        }
+    }
+    if(lkey)
+        *lkey = block->pmr[chunk]->lkey;
+    if(rkey)
+        *rkey = block->pmr[chunk]->rkey;
+    return 0;
+}
+
+/*
+ * Register (at connection time) the memory used for control
+ * channel messages.
+ */
+static int qemu_rdma_reg_control(RDMAContext *rdma, int idx)
+{
+    rdma->wr_data[idx].control_mr = ibv_reg_mr(rdma->pd,
+            rdma->wr_data[idx].control, RDMA_CONTROL_MAX_BUFFER,
+            IBV_ACCESS_LOCAL_WRITE |
+            IBV_ACCESS_REMOTE_WRITE |
+            IBV_ACCESS_REMOTE_READ);
+    if (rdma->wr_data[idx].control_mr) {
+        return 0;
+    }
+    return -1;
+}
+
+static int qemu_rdma_dereg_control(RDMAContext *rdma, int idx)
+{
+    return ibv_dereg_mr(rdma->wr_data[idx].control_mr);
+}
+
+#if defined(DEBUG_RDMA) || defined(DEBUG_RDMA_VERBOSE)
+static const char * print_wrid(int wrid) {
+    if(wrid >= RDMA_WRID_RECV_CONTROL)
+        return wrid_desc[RDMA_WRID_RECV_CONTROL];
+    return wrid_desc[wrid];
+}
+#endif
+
+/*
+ * Consult the connection manager to see a work request
+ * (of any kind) has completed.
+ * Return the work request ID that completed.
+ */
+static int qemu_rdma_poll(RDMAContext *rdma)
+{
+    int ret;
+    struct ibv_wc wc;
+
+    ret = ibv_poll_cq(rdma->cq, 1, &wc);
+    if (!ret) {
+        return RDMA_WRID_NONE;
+    }
+    if (ret < 0) {
+        fprintf(stderr, "ibv_poll_cq return %d!\n", ret);
+        return ret;
+    }
+    if (wc.status != IBV_WC_SUCCESS) {
+        fprintf(stderr, "ibv_poll_cq wc.status=%d %s!\n",
+                        wc.status, ibv_wc_status_str(wc.status));
+        fprintf(stderr, "ibv_poll_cq wrid=%s!\n", wrid_desc[wc.wr_id]);
+
+        return -1;
+    }
+
+    if(rdma->control_ready_expected &&
+        (wc.wr_id >= RDMA_WRID_RECV_CONTROL)) {
+        DPRINTF("completion %s #%" PRId64 " received (%" PRId64 ")\n", 
+            wrid_desc[RDMA_WRID_RECV_CONTROL], wc.wr_id -
+            RDMA_WRID_RECV_CONTROL, wc.wr_id);
+        rdma->control_ready_expected = 0;
+    }
+
+    if(wc.wr_id == RDMA_WRID_RDMA_WRITE) {
+        rdma->num_signaled_send--;
+        DPRINTF("completions %s (%" PRId64 ") left %d\n", 
+            print_wrid(wc.wr_id), wc.wr_id, rdma->num_signaled_send);
+    } else {
+        DPRINTF("other completion %s (%" PRId64 ") received left %d\n", 
+            print_wrid(wc.wr_id), wc.wr_id, rdma->num_signaled_send);
+    }
+   
+    return  (int)wc.wr_id;
+}
+
+/*
+ * Block until the next work request has completed.
+ * 
+ * First poll to see if a work request has already completed,
+ * otherwise block.
+ *
+ * If we encounter completed work requests for IDs other than
+ * the one we're interested in, then that's generally an error.
+ *
+ * The only exception is actual RDMA Write completions. These
+ * completions only need to be recorded, but do not actually
+ * need further processing.
+ */
+#ifdef RDMA_BLOCKING
+static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid)
+{
+    int num_cq_events = 0;
+    int r = RDMA_WRID_NONE;
+    struct ibv_cq *cq;
+    void *cq_ctx;
+
+    if (ibv_req_notify_cq(rdma->cq, 0)) {
+        return -1;
+    }
+    /* poll cq first */
+    while (r != wrid) {
+        r = qemu_rdma_poll(rdma);
+        if (r < 0) {
+            return r;
+        }
+        if (r == RDMA_WRID_NONE) {
+            break;
+        }
+        if(r != wrid) {
+            DPRINTF("A Wanted wrid %s (%d) but got %s (%d)\n", 
+                print_wrid(wrid), wrid, print_wrid(r), r);
+        }
+    }
+    if (r == wrid) {
+        return 0;
+    }
+
+    while (1) {
+        if (ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx)) {
+            goto err_block_for_wrid;
+        }
+        num_cq_events++;
+        if (ibv_req_notify_cq(cq, 0)) {
+            goto err_block_for_wrid;
+        }
+        /* poll cq */
+        while (r != wrid) {
+            r = qemu_rdma_poll(rdma);
+            if (r < 0) {
+                goto err_block_for_wrid;
+            }
+            if (r == RDMA_WRID_NONE) {
+                break;
+            }
+            if(r != wrid) {
+                DPRINTF("B Wanted wrid %s (%d) but got %s (%d)\n", 
+                    print_wrid(wrid), wrid, print_wrid(r), r);
+            }
+        }
+        if (r == wrid) {
+            goto success_block_for_wrid;
+        }
+    }
+
+success_block_for_wrid:
+    if (num_cq_events) {
+        ibv_ack_cq_events(cq, num_cq_events);
+    }
+    return 0;
+
+err_block_for_wrid:
+    if (num_cq_events) {
+        ibv_ack_cq_events(cq, num_cq_events);
+    }
+    return -1;
+}
+#else
+static int qemu_rdma_poll_for_wrid(RDMAContext *rdma, int wrid)
+{
+    int r = RDMA_WRID_NONE;
+    while (r != wrid) {
+        r = qemu_rdma_poll(rdma);
+        if (r < 0) {
+            return r;
+        }
+    }
+    return 0;
+}
+#endif
+
+
+static int wait_for_wrid(RDMAContext *rdma, int wrid)
+{
+#ifdef RDMA_BLOCKING
+    return qemu_rdma_block_for_wrid(rdma, wrid);
+#else
+    return qemu_rdma_poll_for_wrid(rdma, wrid);
+#endif
+}
+
+/*
+ * Post a SEND message work request for the control channel
+ * containing some data and block until the post completes.
+ */
+static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t * buf, RDMAControlHeader * head)
+{
+    int ret = 0;
+    RDMAWorkRequestData * wr = &rdma->wr_data[RDMA_CONTROL_MAX_WR];
+    struct ibv_send_wr *bad_wr;
+    struct ibv_sge sge = 
+                    {
+                       .addr = (uint64_t)(wr->control),
+                       .length = head->len + RDMAControlHeaderSize,
+                       .lkey = wr->control_mr->lkey,
+                    };
+    struct ibv_send_wr send_wr = 
+                   {
+                       .wr_id = RDMA_WRID_SEND_CONTROL,
+                       .opcode = IBV_WR_SEND,
+                       .send_flags = IBV_SEND_SIGNALED,
+                       .sg_list = &sge,
+                       .num_sge = 1,
+                   };
+
+    if (head->version < RDMA_CONTROL_VERSION_MIN || 
+            head->version > RDMA_CONTROL_VERSION_MAX) {
+        fprintf(stderr, "SEND: Invalid control message version: %d,"
+                        " min: %d, max: %d\n", 
+                        head->version, RDMA_CONTROL_VERSION_MIN,
+                        RDMA_CONTROL_VERSION_MAX);
+        return -1;
+    }
+
+    DPRINTF("CONTROL: sending %s..\n", control_desc[head->type]);
+
+    /*
+     * We don't actually need to do a memcpy() in here if we used
+     * the "sge" properly, but since we're only sending control messages 
+     * (not RAM in a performance-critical path), then its OK for now.
+     *
+     * The copy makes the RDMAControlHeader simpler to manipulate
+     * for the time being.
+     */
+    memcpy(wr->control, head, RDMAControlHeaderSize);
+    if(buf)
+        memcpy(wr->control + RDMAControlHeaderSize, buf, head->len);
+
+
+    if (ibv_post_send(rdma->qp, &send_wr, &bad_wr)) {
+        return -1;
+    }
+
+    if (ret < 0) {
+        fprintf(stderr, "Failed to use post IB SEND for control!\n");
+        return ret;
+    }
+
+    ret = wait_for_wrid(rdma, RDMA_WRID_SEND_CONTROL);
+    if (ret < 0) {
+        fprintf(stderr, "rdma migration: polling control error!");
+    }
+
+    return ret;
+}
+
+/*
+ * Post a RECV work request in anticipation of some future receipt
+ * of data on the control channel.
+ */
+static int qemu_rdma_post_recv_control(RDMAContext *rdma, int idx)
+{
+    struct ibv_recv_wr *bad_wr;
+    struct ibv_sge sge = {
+                            .addr = (uint64_t)(rdma->wr_data[idx].control),
+                            .length = RDMA_CONTROL_MAX_BUFFER,
+                            .lkey = rdma->wr_data[idx].control_mr->lkey,
+                         };
+
+    struct ibv_recv_wr recv_wr = 
+                         { 
+                            .wr_id = RDMA_WRID_RECV_CONTROL + idx,
+                            .sg_list = &sge,
+                            .num_sge = 1,
+                         };
+
+    if (ibv_post_recv(rdma->qp, &recv_wr, &bad_wr)) {
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Block and wait for a RECV control channel message to arrive. 
+ */
+static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
+                RDMAControlHeader *head, int expecting, int idx)
+{
+    int ret = wait_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx);
+    RDMAControlHeader *temp = (RDMAControlHeader *) rdma->wr_data[idx].control;
+
+    if (ret < 0) {
+        fprintf(stderr, "rdma migration: polling control error!\n");
+        return ret;
+    }
+
+    if (temp->version < RDMA_CONTROL_VERSION_MIN || 
+            temp->version > RDMA_CONTROL_VERSION_MAX) {
+        fprintf(stderr, "RECV: Invalid control message version: %d,"
+                        " min: %d, max: %d\n", 
+                        temp->version, RDMA_CONTROL_VERSION_MIN,
+                        RDMA_CONTROL_VERSION_MAX);
+        return -1;
+    }
+
+    memcpy(head, temp, RDMAControlHeaderSize);
+
+    DPRINTF("CONTROL: %s received\n", control_desc[expecting]);
+
+    if (expecting != RDMA_CONTROL_NONE && head->type != expecting) {
+        fprintf(stderr, "Was expecting a %s control message"
+                ", but got: %s, length: %" PRId64 "\n", 
+                control_desc[expecting], 
+                control_desc[head->type], head->len);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+/*
+ * When a RECV work request has completed, the work request's
+ * buffer is pointed at the header. 
+ * 
+ * This will advance the pointer to the data portion 
+ * of the control message of the work request's buffer that
+ * was populated after the work request finished.
+ */ 
+static void qemu_rdma_move_header(RDMAContext *rdma, int idx, 
+                                  RDMAControlHeader *head)
+{
+    rdma->wr_data[idx].control_len = head->len;
+    rdma->wr_data[idx].control_curr = rdma->wr_data[idx].control + RDMAControlHeaderSize;
+}
+
+/*
+ * This is an 'atomic' high-level operation to deliver a single, unified
+ * control-channel message.
+ * 
+ * Additionally, if the user is expecting some kind of reply to this message,
+ * they can request a 'resp' response message be filled in by posting an
+ * additional work request on behalf of the user and waiting for an additional
+ * completion. 
+ * 
+ * The extra (optional) response is used during registration to us from having
+ * to perform an *additional* exchange of message just to provide a response by
+ * instead piggy-backing on the acknowledgement.
+ */
+static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head, 
+                                   uint8_t * data, RDMAControlHeader *resp, 
+                                   int * resp_idx)
+{
+    int ret = 0;
+    int idx = 0;
+
+    /*
+     * Wait until the server is ready before attempting to deliver the message
+     * by waiting for a READY message.
+     */
+    if(rdma->control_ready_expected) {
+        RDMAControlHeader resp;
+        ret = qemu_rdma_exchange_get_response(rdma, 
+                                    &resp, RDMA_CONTROL_READY, idx);
+        if(ret < 0)
+            return ret;
+    }
+
+    /*
+     * If the user is expecting a response, post a WR in anticipation of it.
+     */
+    if(resp) {
+        ret = qemu_rdma_post_recv_control(rdma, idx + 1);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error posting"
+                    " extra control recv for anticipated result!");
+            return ret;
+        }
+    }
+
+    /*
+     * Post a WR to replace the one we just consumed for the READY message.
+     */
+    ret = qemu_rdma_post_recv_control(rdma, idx);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting first control recv!");
+        return ret;
+    }
+
+    /*
+     * Deliver the control message that was requested.
+     */
+    ret = qemu_rdma_post_send_control(rdma, data, head);
+
+    if(ret < 0) {
+        fprintf(stderr, "Failed to send control buffer!\n");
+        return ret;
+    }
+
+    /*
+     * If we're expecting a response, block and wait for it.
+     */
+    if(resp) {
+        DPRINTF("Waiting for response %s\n", control_desc[resp->type]);
+        ret = qemu_rdma_exchange_get_response(rdma, resp, resp->type, idx + 1);
+
+        if (ret < 0)
+            return ret;
+
+        qemu_rdma_move_header(rdma, idx + 1, resp);
+        *resp_idx = idx + 1;
+        DPRINTF("Response %s received.\n", control_desc[resp->type]);
+    }
+
+    rdma->control_ready_expected = 1;
+
+    return 0;
+}
+
+/*
+ * This is an 'atomic' high-level operation to receive a single, unified
+ * control-channel message.
+ */
+static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head, 
+                                int expecting)
+{
+    RDMAControlHeader ready = { 
+                                .len = 0, 
+                                .type = RDMA_CONTROL_READY,
+                                .version = RDMA_CONTROL_CURRENT_VERSION, 
+                              };
+    int ret;
+    int idx = 0;
+
+    /*
+     * Inform the client that we're ready to receive a message.
+     */
+    ret = qemu_rdma_post_send_control(rdma, NULL, &ready);
+
+    if (ret < 0) {
+        fprintf(stderr, "Failed to send control buffer!\n");
+        return ret;
+    }
+
+    /*
+     * Block and wait for the message.
+     */
+    ret = qemu_rdma_exchange_get_response(rdma, head, expecting, idx);
+
+    if (ret < 0)
+        return ret;
+
+    qemu_rdma_move_header(rdma, idx, head);
+
+    /*
+     * Post a new RECV work request to replace the one we just consumed.
+     */
+    ret = qemu_rdma_post_recv_control(rdma, idx);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * Write an actual chunk of memory using RDMA.
+ *
+ * If we're using dynamic registration on the server-side, we have to
+ * send a registration command first.
+ */
+static int __qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
+        int current_index,
+        uint64_t offset, uint64_t length,
+        uint64_t wr_id, enum ibv_send_flags flag)
+{
+    struct ibv_sge sge;
+    struct ibv_send_wr send_wr = { 0 };
+    struct ibv_send_wr *bad_wr;
+    RDMALocalBlock *block = &(local_ram_blocks.block[current_index]);
+    int chunk;
+    RDMARegister reg;
+    RDMARegisterResult *reg_result;
+    int reg_result_idx;
+    RDMAControlHeader resp = { .len = sizeof(RDMARegisterResult),
+                               .type = RDMA_CONTROL_REGISTER_RESULT,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                              };
+    RDMAControlHeader head = { .len = sizeof(RDMARegister), 
+                               .type = RDMA_CONTROL_REGISTER_REQUEST,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+    int ret;
+
+    sge.addr = (uint64_t)(block->local_host_addr + (offset - block->offset));
+    sge.length = length;
+    if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr, &sge.lkey, NULL)) {
+        fprintf(stderr, "cannot get lkey!\n");
+        return -EINVAL;
+    }
+
+    send_wr.wr_id = wr_id;
+    send_wr.opcode = IBV_WR_RDMA_WRITE;
+    send_wr.send_flags = flag;
+    send_wr.sg_list = &sge;
+    send_wr.num_sge = 1;
+    send_wr.wr.rdma.remote_addr = block->remote_host_addr +
+        (offset - block->offset);
+
+    if(rdma->chunk_register_destination) {
+        chunk = RDMA_REG_CHUNK_INDEX(block->local_host_addr, sge.addr);
+        if (!block->remote_keys[chunk]) {
+            /*
+             * Tell other side to register.
+             */
+            reg.len = sge.length;
+            reg.current_index = current_index;
+            reg.offset = offset;
+
+            DPRINTF("Sending registration request chunk %d for %d bytes...\n", chunk, sge.length);
+            ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg, &resp, &reg_result_idx);
+            if(ret < 0)
+                return ret;
+
+            reg_result = (RDMARegisterResult *) rdma->wr_data[reg_result_idx].control_curr;
+            DPRINTF("Received registration result:"
+                    " my key: %x their key %x, chunk %d\n",
+                    block->remote_keys[chunk], reg_result->rkey, chunk);
+
+            block->remote_keys[chunk] = reg_result->rkey;
+        }
+
+        send_wr.wr.rdma.rkey = block->remote_keys[chunk];
+    } else {
+        send_wr.wr.rdma.rkey = block->remote_rkey;
+    }
+
+    return ibv_post_send(rdma->qp, &send_wr, &bad_wr);
+}
+
+/*
+ * Push out any unwritten RDMA operations.
+ *
+ * We support sending out multiple chunks at the same time.
+ * Not all of them need to get signaled in the completion queue.
+ */
+static int qemu_rdma_write_flush(QEMUFile *f, RDMAContext *rdma)
+{
+    int ret;
+    enum ibv_send_flags flags = 0;
+
+    if (!rdma->current_length) {
+        return 0;
+    }
+    if (rdma->num_unsignaled_send >=
+            RDMA_UNSIGNALED_SEND_MAX) {
+        flags = IBV_SEND_SIGNALED;
+    }
+
+    while(1) {
+        ret = __qemu_rdma_write(f, rdma,
+                rdma->current_index,
+                rdma->current_offset,
+                rdma->current_length,
+                RDMA_WRID_RDMA_WRITE, flags);
+        if(ret) {
+            if(ret == ENOMEM) {
+                DPRINTF("send queue is full. wait a little....\n");
+                ret = wait_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+                if(ret < 0) {
+                    fprintf(stderr, "rdma migration: failed to make room in full send queue! %d\n", ret);
+                    return -EIO;
+                }
+            } else {
+                 fprintf(stderr, "rdma migration: write flush error! %d\n", ret);
+                 perror("write flush error");
+                 return -EIO;
+            }
+        } else {
+                break;
+        }
+    }
+
+    if (rdma->num_unsignaled_send >=
+            RDMA_UNSIGNALED_SEND_MAX) {
+        rdma->num_unsignaled_send = 0;
+        rdma->num_signaled_send++;
+        DPRINTF("signaled total: %d\n", rdma->num_signaled_send);
+    } else {
+        rdma->num_unsignaled_send++;
+    }
+
+    rdma->current_length = 0;
+    rdma->current_offset = 0;
+
+    return 0;
+}
+
+static inline int qemu_rdma_in_current_block(RDMAContext *rdma,
+                uint64_t offset, uint64_t len)
+{
+    RDMALocalBlock *block =
+        &(local_ram_blocks.block[rdma->current_index]);
+    if (rdma->current_index < 0) {
+        return 0;
+    }
+    if (offset < block->offset) {
+        return 0;
+    }
+    if (offset + len > block->offset + block->length) {
+        return 0;
+    }
+    return 1;
+}
+
+static inline int qemu_rdma_in_current_chunk(RDMAContext *rdma,
+                uint64_t offset, uint64_t len)
+{
+    RDMALocalBlock *block = &(local_ram_blocks.block[rdma->current_index]);
+    uint8_t *chunk_start, *chunk_end, *host_addr;
+    if (rdma->current_chunk < 0) {
+        return 0;
+    }
+    host_addr = block->local_host_addr + (offset - block->offset);
+    chunk_start = RDMA_REG_CHUNK_START(block, rdma->current_chunk);
+    if (chunk_start < block->local_host_addr) {
+        chunk_start = block->local_host_addr;
+    }
+    if (host_addr < chunk_start) {
+        return 0;
+    }
+    chunk_end = RDMA_REG_CHUNK_END(block, rdma->current_chunk);
+    if (chunk_end > chunk_start + block->length) {
+        chunk_end = chunk_start + block->length;
+    }
+    if (host_addr + len > chunk_end) {
+        return 0;
+    }
+    return 1;
+}
+
+static inline int qemu_rdma_buffer_mergable(RDMAContext *rdma,
+                    uint64_t offset, uint64_t len)
+{
+    if (rdma->current_length == 0) {
+        return 0;
+    }
+    if (offset != rdma->current_offset + rdma->current_length) {
+        return 0;
+    }
+    if (!qemu_rdma_in_current_block(rdma, offset, len)) {
+        return 0;
+    }
+#ifdef RDMA_CHUNK_REGISTRATION
+    if (!qemu_rdma_in_current_chunk(rdma, offset, len)) {
+        return 0;
+    }
+#endif
+    return 1;
+}
+
+/*
+ * We're not actually writing here, but doing three things:
+ *
+ * 1. Identify the chunk the buffer belongs to.
+ * 2. If the chunk is full or the buffer doesn't belong to the current
+      chunk, then start a new chunk and flush() the old chunk.
+ * 3. To keep the hardware busy, we also group chunks into batches
+      and only require that a batch gets acknowledged in the completion
+      qeueue instead of each individual chunk. 
+ */
+static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma, uint64_t offset, uint64_t len)
+{
+    int index = rdma->current_index;
+    int chunk_index = rdma->current_chunk;
+    int ret;
+
+    /* If we cannot merge it, we flush the current buffer first. */
+    if (!qemu_rdma_buffer_mergable(rdma, offset, len)) {
+        ret = qemu_rdma_write_flush(f, rdma);
+        if (ret) {
+            return ret;
+        }
+        rdma->current_length = 0;
+        rdma->current_offset = offset;
+
+        if ((ret = qemu_rdma_search_ram_block(offset, len,
+                    &local_ram_blocks, &index, &chunk_index))) {
+            fprintf(stderr, "ram block search failed\n");
+            return ret;
+        }
+        rdma->current_index = index;
+        rdma->current_chunk = chunk_index;
+    }
+
+    /* merge it */
+    rdma->current_length += len;
+
+    /* flush it if buffer is too large */
+    if (rdma->current_length >= RDMA_MERGE_MAX) {
+        return qemu_rdma_write_flush(f, rdma);
+    }
+
+    return 0;
+}
+
+void qemu_rdma_cleanup(void * opaque)
+{
+    RDMAContext *rdma = opaque;
+    struct rdma_cm_event *cm_event;
+    int ret, idx;
+
+    if(rdma->cm_id) {
+        DPRINTF("Disconnecting...\n");
+        ret = rdma_disconnect(rdma->cm_id);
+        if (!ret) {
+            ret = rdma_get_cm_event(rdma->channel, &cm_event);
+            if (!ret) {
+                rdma_ack_cm_event(cm_event);
+            }
+        }
+        DPRINTF("Disconnected.\n");
+    }
+
+    if (remote_ram_blocks.remote_area) {
+        g_free(remote_ram_blocks.remote_area);
+    }
+
+    for(idx = 0; idx < (RDMA_CONTROL_MAX_WR + 1); idx++) {
+        if (rdma->wr_data[idx].control_mr) {
+            qemu_rdma_dereg_control(rdma, idx);
+        }
+        rdma->wr_data[idx].control_mr = NULL;
+    }
+
+    qemu_rdma_dereg_ram_blocks(&local_ram_blocks);
+
+    if(local_ram_blocks.block) {
+        if(rdma->chunk_register_destination) {
+            for (idx = 0; idx < local_ram_blocks.num_blocks; idx++) {
+                RDMALocalBlock *block = &(local_ram_blocks.block[idx]);
+                if(block->remote_keys)
+                    g_free(block->remote_keys);
+            }
+        }
+        g_free(local_ram_blocks.block);
+    }
+
+    if (rdma->qp) {
+        ibv_destroy_qp(rdma->qp);
+    }
+    if (rdma->cq) {
+        ibv_destroy_cq(rdma->cq);
+    }
+    if (rdma->comp_channel) {
+        ibv_destroy_comp_channel(rdma->comp_channel);
+    }
+    if (rdma->pd) {
+        ibv_dealloc_pd(rdma->pd);
+    }
+    if (rdma->listen_id) {
+        rdma_destroy_id(rdma->listen_id);
+    }
+    if (rdma->cm_id) {
+        rdma_destroy_id(rdma->cm_id);
+        rdma->cm_id = 0;
+    }
+    if (rdma->channel) {
+        rdma_destroy_event_channel(rdma->channel);
+    }
+}
+
+static void qemu_rdma_remote_ram_blocks_init(void)
+{
+    int remote_size = (sizeof(RDMARemoteBlock) * 
+                        local_ram_blocks.num_blocks)
+                        +   sizeof(*remote_ram_blocks.num_blocks);
+
+    DPRINTF("Preparing %d bytes for remote info\n", remote_size);
+
+    remote_ram_blocks.remote_area = g_malloc0(remote_size);
+    remote_ram_blocks.remote_size = remote_size;
+    remote_ram_blocks.num_blocks = remote_ram_blocks.remote_area;
+    remote_ram_blocks.block = (void *) (remote_ram_blocks.num_blocks + 1);
+}
+
+int qemu_rdma_client_init(void * opaque, Error **errp,
+                          bool chunk_register_destination)
+{
+    RDMAContext *rdma = opaque;
+    int ret, idx;
+
+    if (rdma->client_init_done) {
+        return 0;
+    }
+
+    rdma->chunk_register_destination = chunk_register_destination;
+
+    ret = qemu_rdma_resolve_host(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error resolving host!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_alloc_pd_cq(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating pd and cq!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_alloc_qp(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating qp!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_init_ram_blocks(&local_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error initializing ram blocks!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_client_reg_ram_blocks(rdma, &local_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error client registering ram blocks!");
+        goto err_rdma_client_init;
+    }
+
+    for(idx = 0; idx < (RDMA_CONTROL_MAX_WR + 1); idx++) {
+        ret = qemu_rdma_reg_control(rdma, idx);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error registering %d control!", idx);
+            goto err_rdma_client_init;
+        }
+    }
+
+    qemu_rdma_remote_ram_blocks_init();
+
+    rdma->client_init_done = 1;
+    return 0;
+
+err_rdma_client_init:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+int qemu_rdma_connect(void * opaque, Error **errp)
+{
+    RDMAControlHeader head;
+    RDMAContext *rdma = opaque;
+    struct rdma_cm_event *cm_event;
+    RDMACapabilities cap = 
+                {
+                    .len = sizeof(RDMACapabilities),
+                    .version = RDMA_CONTROL_CURRENT_VERSION,
+                    .chunk_register_destination = rdma->chunk_register_destination,
+                };
+    struct rdma_conn_param conn_param = { .initiator_depth = 2,
+                                          .retry_count = 5,
+                                          .private_data = &cap,
+                                          .private_data_len = sizeof(cap), 
+                                        };
+    int ret;
+    int idx = 0;
+    int x;
+
+    ret = rdma_connect(rdma->cm_id, &conn_param);
+    if (ret) {
+        perror("rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        goto err_rdma_client_connect;
+    }
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        perror("rdma_get_cm_event after rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        goto err_rdma_client_connect;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        perror("rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        goto err_rdma_client_connect;
+    }
+
+    rdma_ack_cm_event(cm_event);
+
+    ret = qemu_rdma_post_recv_control(rdma, idx + 1);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        goto err_rdma_client_connect;
+    }
+
+    ret = qemu_rdma_post_recv_control(rdma, idx);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        goto err_rdma_client_connect;
+    }
+
+
+    ret = qemu_rdma_exchange_get_response(rdma, 
+                                &head, RDMA_CONTROL_RAM_BLOCKS, idx + 1);
+
+    if(ret < 0) {
+        fprintf(stderr, "rdma migration: error sending remote info!");
+        goto err_rdma_client_connect;
+    }
+
+    qemu_rdma_move_header(rdma, idx + 1, &head);
+    memcpy(remote_ram_blocks.remote_area, rdma->wr_data[idx + 1].control_curr, 
+                    remote_ram_blocks.remote_size);
+
+    ret = qemu_rdma_process_remote_ram_blocks(
+                            &local_ram_blocks, &remote_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error processing remote ram blocks!\n");
+        goto err_rdma_client_connect;
+    }
+
+    if(rdma->chunk_register_destination) {
+        for (x = 0; x < local_ram_blocks.num_blocks; x++) {
+            RDMALocalBlock *block = &(local_ram_blocks.block[x]);
+            int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+            /* allocate memory to store remote rkeys */
+            block->remote_keys = g_malloc0(num_chunks * sizeof(uint32_t));
+        }
+    }
+    rdma->control_ready_expected = 1;
+    rdma->num_signaled_send = 0;
+    return 0;
+
+err_rdma_client_connect:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+int qemu_rdma_server_init(void * opaque, Error **errp)
+{
+    RDMAContext *rdma = opaque;
+    int ret, idx;
+    struct sockaddr_in sin;
+    struct rdma_cm_id *listen_id;
+    char ip[40] = "unknown";
+
+    for(idx = 0; idx < RDMA_CONTROL_MAX_WR; idx++) {
+        rdma->wr_data[idx].control_len = 0;
+        rdma->wr_data[idx].control_curr = NULL;  
+    }
+
+    if(rdma->host == NULL) {
+        fprintf(stderr, "Error: RDMA host is not set!");
+        return -1;
+    }
+    /* create CM channel */
+    rdma->channel = rdma_create_event_channel();
+    if (!rdma->channel) {
+        fprintf(stderr, "Error: could not create rdma event channel");
+        return -1;
+    }
+
+    /* create CM id */
+    ret = rdma_create_id(rdma->channel, &listen_id, NULL, RDMA_PS_TCP);
+    if (ret) {
+        fprintf(stderr, "Error: could not create cm_id!");
+        goto err_server_init_create_listen_id;
+    }
+
+    memset(&sin, 0, sizeof(sin));
+    sin.sin_family = AF_INET;
+    sin.sin_port = htons(rdma->port);
+
+    if (rdma->host && strcmp("", rdma->host)) {
+        struct hostent *server_addr;
+        server_addr = gethostbyname(rdma->host);
+        if (!server_addr) {
+            fprintf(stderr, "Error: migration could not gethostbyname!");
+            goto err_server_init_bind_addr;
+        }
+        memcpy(&sin.sin_addr.s_addr, server_addr->h_addr,
+                server_addr->h_length);
+        inet_ntop(AF_INET, server_addr->h_addr, ip, sizeof ip);
+    } else {
+        sin.sin_addr.s_addr = INADDR_ANY;
+    }
+
+    DPRINTF("%s => %s\n", rdma->host, ip);
+
+    ret = rdma_bind_addr(listen_id, (struct sockaddr *)&sin);
+    if (ret) {
+        fprintf(stderr, "Error: could not rdma_bind_addr!");
+        goto err_server_init_bind_addr;
+    }
+
+    rdma->listen_id = listen_id;
+    if (listen_id->verbs) {
+        rdma->verbs = listen_id->verbs;
+    }
+    qemu_rdma_dump_id("server_init", rdma->verbs);
+    qemu_rdma_dump_gid("server_init", listen_id);
+    return 0;
+
+err_server_init_bind_addr:
+    rdma_destroy_id(listen_id);
+err_server_init_create_listen_id:
+    rdma_destroy_event_channel(rdma->channel);
+    rdma->channel = NULL;
+    return -1;
+
+}
+
+int qemu_rdma_server_prepare(void * opaque, Error **errp)
+{
+    RDMAContext *rdma = opaque;
+    int ret;
+    int idx;
+
+    if (!rdma->verbs) {
+        fprintf(stderr, "rdma migration: no verbs context!");
+        return 0;
+    }
+
+    ret = qemu_rdma_alloc_pd_cq(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating pd and cq!");
+        goto err_rdma_server_prepare;
+    }
+
+    ret = qemu_rdma_init_ram_blocks(&local_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error initializing ram blocks!");
+        goto err_rdma_server_prepare;
+    }
+
+    qemu_rdma_remote_ram_blocks_init();
+
+    /* Extra one for the send buffer */
+    for(idx = 0; idx < (RDMA_CONTROL_MAX_WR + 1); idx++) {
+        ret = qemu_rdma_reg_control(rdma, idx);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error registering %d control!", idx);
+            goto err_rdma_server_prepare;
+        }
+    }
+
+    ret = rdma_listen(rdma->listen_id, 5);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error listening on socket!");
+        goto err_rdma_server_prepare;
+    }
+
+    return 0;
+
+err_rdma_server_prepare:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+void *qemu_rdma_data_init(const char *host_port, Error **errp)
+{
+    RDMAContext *rdma = NULL;
+    InetSocketAddress *addr;
+
+    if(host_port) {
+        rdma = g_malloc0(sizeof(RDMAContext));
+        memset(rdma, 0, sizeof(RDMAContext));
+        rdma->current_index = -1;
+        rdma->current_chunk = -1;
+
+        addr = inet_parse(host_port, errp);
+        if (addr != NULL) {
+            rdma->port = atoi(addr->port);
+            rdma->host = g_strdup(addr->host);
+            printf("rdma host: %s\n", rdma->host);
+            printf("rdma port: %d\n", rdma->port);
+        } else {
+            error_setg(errp, "bad RDMA migration address '%s'", host_port);
+            g_free(rdma);
+            return NULL;
+        }
+    }
+
+    return rdma;
+}
+
+void qemu_rdma_disable(void * opaque)
+{
+    RDMAContext *rdma = opaque;
+    rdma->port = -1;
+}
+
+/*
+ * QEMUFile interface to the control channel.
+ * SEND messages for control only.
+ * pc.ram is handled with regular RDMA messages.
+ */
+int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileRDMA *r = opaque;
+    QEMUFile *f = r->file;
+    RDMAContext *rdma = r->rdma;
+    size_t remaining = size;
+    uint8_t * data = (void *) buf;
+    int ret;
+
+    /*
+     * Push out any writes that
+     * we're queued up for pc.ram.
+     */
+    if (qemu_rdma_write_flush(f, rdma) < 0)
+        return -EIO;
+
+    while(remaining) {
+        RDMAControlHeader head;
+
+        r->len = MIN(remaining, RDMA_SEND_INCREMENT);
+        remaining -= r->len;
+
+        head.len = r->len;
+        head.type = RDMA_CONTROL_QEMU_FILE;
+        head.version = RDMA_CONTROL_CURRENT_VERSION;
+
+        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL);
+
+        if(ret < 0)
+            return ret;
+
+        data += r->len;
+    }
+
+    return size;
+} 
+
+static size_t qemu_rdma_fill(RDMAContext * rdma, uint8_t *buf, int size, int idx)
+{
+    size_t len = 0;
+
+    if(rdma->wr_data[idx].control_len) {
+        DPRINTF("RDMA %" PRId64 " of %d bytes already in buffer\n",
+	    rdma->wr_data[idx].control_len, size);
+
+        len = MIN(size, rdma->wr_data[idx].control_len);
+        memcpy(buf, rdma->wr_data[idx].control_curr, len);
+        rdma->wr_data[idx].control_curr += len;
+        rdma->wr_data[idx].control_len -= len;
+    }
+
+    return len;
+}
+
+/*
+ * QEMUFile interface to the control channel.
+ * RDMA links don't use bytestreams, so we have to
+ * return bytes to QEMUFile opportunistically.
+ */
+int qemu_rdma_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileRDMA *r = opaque;
+    RDMAContext *rdma = r->rdma;
+    RDMAControlHeader head;
+    int ret = 0;
+
+    /*
+     * First, we hold on to the last SEND message we 
+     * were given and dish out the bytes until we run 
+     * out of bytes.
+     */
+    if((r->len = qemu_rdma_fill(r->rdma, buf, size, 0)))
+        return r->len; 
+
+     /*
+      * Once we run out, we block and wait for another
+      * SEND message to arrive.
+      */
+    ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_QEMU_FILE);
+
+    if(ret < 0)
+        return ret;
+
+    /*
+     * SEND was received with new bytes, now try again.
+     */
+    return qemu_rdma_fill(r->rdma, buf, size, 0);
+} 
+
+/*
+ * Block until all the outstanding chunks have been delivered by the hardware.
+ */
+int qemu_rdma_drain_cq(QEMUFile *f)
+{
+    QEMUFileRDMA *rfile = qemu_file_ops_are(f, &rdma_write_ops);
+    RDMAContext *rdma = rfile->rdma;
+    int ret;
+
+    if (qemu_rdma_write_flush(f, rdma) < 0) {
+        return -EIO;
+    }
+
+    while (rdma->num_signaled_send) {
+        ret = wait_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+        if (ret < 0) {
+            fprintf(stderr, "rdma migration: complete polling error!\n");
+            return -EIO;
+        }
+    }
+
+    return 0;
+}
+
+int qemu_rdma_close(void *opaque)
+{
+    QEMUFileRDMA *r = opaque;
+    if(r->rdma) {
+        qemu_rdma_cleanup(r->rdma);
+        g_free(r->rdma);
+    }
+    g_free(r);
+    return 0;
+}
+
+void *qemu_fopen_rdma(void * opaque, const char * mode)
+{
+    RDMAContext *rdma = opaque;
+    QEMUFileRDMA *r = g_malloc0(sizeof(QEMUFileRDMA));
+
+    if(qemu_file_mode_is_not_valid(mode))
+        return NULL;
+
+    r->rdma = rdma;
+
+    if (mode[0] == 'w') {
+        r->file = qemu_fopen_ops(r, &rdma_write_ops);
+    } else {
+        r->file = qemu_fopen_ops(r, &rdma_read_ops);
+    }
+
+    return r->file;
+}
+
+size_t save_rdma_page(QEMUFile *f, ram_addr_t block_offset, ram_addr_t offset,
+                        int cont, size_t size, bool zero)
+{
+    int ret;
+    ram_addr_t current_addr = block_offset + offset;
+    QEMUFileRDMA * rfile = qemu_file_ops_are(f, &rdma_write_ops);
+    RDMAContext * rdma;
+
+    if(rfile) {
+        rdma = rfile->rdma;
+    } else
+        return -ENOTSUP;
+
+    qemu_ftell(f);
+
+    if(zero)
+        return 0;
+
+    /*
+     * Add this page to the current 'chunk'. If the chunk
+     * is full, or the page doen't belong to the current chunk,
+     * an actual RDMA write will occur and a new chunk will be formed.
+     */
+    if ((ret = qemu_rdma_write(f, rdma, current_addr, size)) < 0) {
+        fprintf(stderr, "rdma migration: write error! %d\n", ret);
+        return ret;
+    }
+
+    /*
+     * Drain the Completion Queue if possible.
+     * If not, the end of the iteration will do this
+     * again to make sure we don't overflow the
+     * request queue. 
+     */
+    while (1) {
+        int ret = qemu_rdma_poll(rdma);
+        if (ret == RDMA_WRID_NONE) {
+            break;
+        }
+        if (ret < 0) {
+            fprintf(stderr, "rdma migration: polling error! %d\n", ret);
+            return ret;
+        }
+    }
+
+    return size;
+}
+
+int qemu_rdma_accept(void * opaque)
+{
+    RDMAContext *rdma = opaque;
+    RDMAControlHeader head = { .len = remote_ram_blocks.remote_size, 
+                               .type = RDMA_CONTROL_RAM_BLOCKS,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+    RDMACapabilities cap; 
+    const RDMACapabilities *test;
+    struct rdma_conn_param conn_param = { 
+                                            .responder_resources = 2,
+                                            .private_data = NULL,
+                                            .private_data_len = 0, 
+                                         };
+    struct rdma_cm_event *cm_event;
+    struct ibv_context *verbs;
+    int ret;
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        goto err_rdma_server_wait;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+        rdma_ack_cm_event(cm_event);
+        goto err_rdma_server_wait;
+    }
+
+    test = cm_event->param.conn.private_data;
+
+    if(test->version < RDMA_CONTROL_VERSION_MIN || 
+            test->version > RDMA_CONTROL_VERSION_MAX) {
+            fprintf(stderr, "Unknown client RDMA version: %d, bailing...\n",
+                            test->version);
+            goto err_rdma_server_wait;
+    }
+
+    memcpy(&cap, test, MIN(test->len, cm_event->param.conn.private_data_len));
+
+    switch(test->version) {
+        //case RDMA_CONTROL_VERSION_2:
+        //    rdma->feature = cap.new_feature;
+        case RDMA_CONTROL_VERSION_1:
+            rdma->chunk_register_destination = cap.chunk_register_destination;
+            printf("Chunked registration: %d\n", rdma->chunk_register_destination);
+            break;
+        default:
+            fprintf(stderr, "Unknown client RDMA version: %d, bailing...\n",
+                            test->version);
+            goto err_rdma_server_wait;
+    }
+
+    rdma->cm_id = cm_event->id;
+    verbs = cm_event->id->verbs;
+
+    rdma_ack_cm_event(cm_event);
+
+    DPRINTF("verbs context after listen: %p\n", verbs);
+
+    if (!rdma->verbs) {
+        rdma->verbs = verbs;
+        ret = qemu_rdma_server_prepare(rdma, NULL);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error preparing server!\n");
+            goto err_rdma_server_wait;
+        }
+    } else if (rdma->verbs != verbs) {
+            fprintf(stderr, "ibv context not matching %p, %p!\n",
+                    rdma->verbs, verbs);
+            goto err_rdma_server_wait;
+    }
+
+    /* xxx destroy listen_id ??? */
+
+    qemu_set_fd_handler2(qemu_rdma_get_fd(rdma), NULL, NULL, NULL, NULL);
+
+    ret = qemu_rdma_alloc_qp(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating qp!");
+        goto err_rdma_server_wait;
+    }
+
+    ret = rdma_accept(rdma->cm_id, &conn_param);
+    if (ret) {
+        fprintf(stderr, "rdma_accept returns %d!\n", ret);
+        goto err_rdma_server_wait;
+    }
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "rdma_accept get_cm_event failed %d!\n", ret);
+        goto err_rdma_server_wait;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        fprintf(stderr, "rdma_accept not event established!\n");
+        rdma_ack_cm_event(cm_event);
+        goto err_rdma_server_wait;
+    }
+
+    rdma_ack_cm_event(cm_event);
+
+    ret = qemu_rdma_post_recv_control(rdma, 0);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        goto err_rdma_server_wait;
+    }
+
+    if(rdma->chunk_register_destination == false) {
+        ret = qemu_rdma_server_reg_ram_blocks(rdma, &local_ram_blocks);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error server registering ram blocks!");
+            goto err_rdma_server_wait;
+        }
+    }
+
+    qemu_rdma_copy_to_remote_ram_blocks(rdma, &local_ram_blocks, &remote_ram_blocks);
+
+    ret = qemu_rdma_post_send_control(rdma, (uint8_t *) remote_ram_blocks.remote_area, &head);
+
+    if(ret < 0) {
+        fprintf(stderr, "rdma migration: error sending remote info!");
+        goto err_rdma_server_wait;
+    }
+
+    qemu_rdma_dump_gid("server_connect", rdma->cm_id);
+
+    return 0;
+
+err_rdma_server_wait:
+    qemu_rdma_cleanup(rdma);
+    return ret;
+}
+
+/*
+ * During each iteration of the migration, we listen for instructions
+ * by the primary VM to perform dynamic page registrations before they
+ * can perform RDMA operations.
+ *
+ * We respond with the 'rkey'.
+ *
+ * Keep doing this until the primary tells us to stop.
+ */
+int qemu_rdma_handle_registrations(QEMUFile *f)
+{
+    RDMAControlHeader resp = { .len = sizeof(RDMARegisterResult),
+                               .type = RDMA_CONTROL_REGISTER_RESULT,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+    QEMUFileRDMA * rfile = qemu_file_ops_are(f, &rdma_read_ops);
+    RDMAContext * rdma = rfile->rdma;
+    RDMAControlHeader head;
+    RDMARegister * reg;
+    RDMARegisterResult reg_result;
+    RDMALocalBlock *block;
+    uint64_t host_addr;
+    int ret;
+    int idx = 0;
+
+    DPRINTF("Waiting for next registration...\n");
+
+    do {
+        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_NONE);
+
+        if(ret < 0)
+            break;
+
+        switch(head.type) {
+            case RDMA_CONTROL_REGISTER_FINISHED:
+                DPRINTF("Current registrations complete.\n");
+                return 0;
+            case RDMA_CONTROL_REGISTER_REQUEST:
+                reg = (RDMARegister *) rdma->wr_data[idx].control_curr;
+
+                DPRINTF("Registration request: %" PRId64 
+                    " bytes, index %d, offset %" PRId64 "\n", 
+                    reg->len, reg->current_index, reg->offset);
+
+                block = &(local_ram_blocks.block[reg->current_index]);
+                host_addr = (uint64_t)(block->local_host_addr + (reg->offset - block->offset));
+                if (qemu_rdma_register_and_get_keys(rdma, block, host_addr, NULL, &reg_result.rkey)) {
+                    fprintf(stderr, "cannot get rkey!\n");
+                    return -EINVAL;
+                }
+
+                DPRINTF("Registerd rkey for this request: %x\n", reg_result.rkey);
+                ret = qemu_rdma_post_send_control(rdma, (uint8_t *) &reg_result, &resp);
+
+                if(ret < 0) {
+                    fprintf(stderr, "Failed to send control buffer!\n");
+                    return ret;
+                }
+                break;
+            case RDMA_CONTROL_REGISTER_RESULT:
+                fprintf(stderr, "Invalid RESULT message at server.\n");
+                return -EIO;
+            default:
+                fprintf(stderr, "Unknown control message %s\n", control_desc[head.type]);
+                return -EIO;
+        }
+
+    } while(1);
+        
+
+    return ret;
+}
+
+/*
+ * Inform the server that we've finished dynamic page registrations for the
+ * current migration iteration.
+ */
+int qemu_rdma_finish_registrations(QEMUFile *f)
+{
+    QEMUFileRDMA * rfile = qemu_file_ops_are(f, &rdma_write_ops);
+    RDMAContext * rdma = rfile->rdma;
+    RDMAControlHeader head = { .len = 0,
+                               .type = RDMA_CONTROL_REGISTER_FINISHED,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+
+    DPRINTF("Sending registration finish...\n");
+
+    return qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL);
+}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 06/12] connection-establishment for RDMA
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (4 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c |  121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 121 insertions(+)
 create mode 100644 migration-rdma.c

diff --git a/migration-rdma.c b/migration-rdma.c
new file mode 100644
index 0000000..2a0becd
--- /dev/null
+++ b/migration-rdma.c
@@ -0,0 +1,121 @@
+/*
+ *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
+ *  Copyright (C) 2010 Jiuxing Liu <jl@us.ibm.com>
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "migration/rdma.h"
+#include "qemu-common.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <arpa/inet.h>
+#include <string.h>
+
+//#define DEBUG_MIGRATION_RDMA
+
+#ifdef DEBUG_MIGRATION_RDMA
+#define DPRINTF(fmt, ...) \
+    do { printf("migration-rdma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+static void rdma_accept_incoming_migration(void *opaque)
+{
+    int ret;
+    QEMUFile *f;
+
+    DPRINTF("Accepting rdma connection...\n");
+
+    if ((ret = qemu_rdma_accept(opaque))) {
+        fprintf(stderr, "RDMA Migration initialization failed!\n");
+        goto err;
+    }
+
+    DPRINTF("Accepted migration\n");
+
+    f = qemu_fopen_rdma(opaque, "rb");
+    if (f == NULL) {
+        fprintf(stderr, "could not qemu_fopen_rdma!\n");
+        goto err;
+    }
+
+    process_incoming_migration(f);
+    return;
+
+err:
+    qemu_rdma_cleanup(opaque);
+}
+
+void rdma_start_incoming_migration(const char * host_port, Error **errp)
+{
+    int ret;
+    void *opaque;
+
+    DPRINTF("Starting RDMA-based incoming migration\n");
+
+    if ((opaque = qemu_rdma_data_init(host_port, errp)) == NULL) {
+        return;
+    }
+
+    ret = qemu_rdma_server_init(opaque, NULL);
+
+    if (!ret) {
+        DPRINTF("qemu_rdma_server_init success\n");
+        ret = qemu_rdma_server_prepare(opaque, NULL);
+
+        if (!ret) {
+            DPRINTF("qemu_rdma_server_prepare success\n");
+
+            qemu_set_fd_handler2(qemu_rdma_get_fd(opaque), NULL, 
+                                 rdma_accept_incoming_migration, NULL,
+                                    (void *)(intptr_t) opaque);
+            return;
+        }
+    }
+
+    g_free(opaque);
+}
+
+void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp)
+{
+    MigrationState *s = opaque;
+    void *rdma_opaque = NULL;
+    int ret;
+
+    if ((rdma_opaque = qemu_rdma_data_init(host_port, errp)) == NULL)
+        return; 
+
+    ret = qemu_rdma_client_init(rdma_opaque, NULL,
+        s->enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION]);
+
+    if(!ret) {
+        DPRINTF("qemu_rdma_client_init success\n");
+        ret = qemu_rdma_connect(rdma_opaque, NULL);
+
+        if(!ret) {
+            s->file = qemu_fopen_rdma(rdma_opaque, "wb");
+            DPRINTF("qemu_rdma_client_connect success\n");
+            migrate_fd_connect(s);
+            return;
+        }
+    }
+
+    g_free(rdma_opaque);
+    migrate_fd_error(s);
+}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors for RDMA
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (5 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 06/12] connection-establishment for RDMA mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 17:03   ` Paolo Bonzini
  2013-04-09 17:31   ` Peter Maydell
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

1. qemu_file_ops_are()
2. qemu_file_update_position()    (for f->pos)

Also need to be here:
rdma_read_ops
rdma_write_ops

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 savevm.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 48 insertions(+), 9 deletions(-)

diff --git a/savevm.c b/savevm.c
index b1d8988..0f5c7aa 100644
--- a/savevm.c
+++ b/savevm.c
@@ -32,6 +32,7 @@
 #include "qemu/timer.h"
 #include "audio/audio.h"
 #include "migration/migration.h"
+#include "migration/rdma.h"
 #include "qemu/sockets.h"
 #include "qemu/queue.h"
 #include "sysemu/cpus.h"
@@ -409,16 +410,24 @@ static const QEMUFileOps socket_write_ops = {
     .close =      socket_close
 };
 
-QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+bool qemu_file_mode_is_not_valid(const char * mode)
 {
-    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
-
     if (mode == NULL ||
         (mode[0] != 'r' && mode[0] != 'w') ||
         mode[1] != 'b' || mode[2] != 0) {
         fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
-        return NULL;
+        return true;
     }
+    
+    return false;
+}
+
+QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+{
+    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
+
+    if(qemu_file_mode_is_not_valid(mode))
+	return NULL;
 
     s->fd = fd;
     if (mode[0] == 'w') {
@@ -430,16 +439,27 @@ QEMUFile *qemu_fopen_socket(int fd, const char *mode)
     return s->file;
 }
 
+/*
+ * These have to be here for qemu_file_ops_are()
+ * The function pointers compile to NULL if
+ * RDMA is disabled at configure time. 
+ */
+const QEMUFileOps rdma_read_ops = {
+    .get_buffer = qemu_rdma_get_buffer,
+    .close =      qemu_rdma_close,
+};
+
+const QEMUFileOps rdma_write_ops = {
+    .put_buffer = qemu_rdma_put_buffer,
+    .close =      qemu_rdma_close,
+};
+
 QEMUFile *qemu_fopen(const char *filename, const char *mode)
 {
     QEMUFileStdio *s;
 
-    if (mode == NULL ||
-	(mode[0] != 'r' && mode[0] != 'w') ||
-	mode[1] != 'b' || mode[2] != 0) {
-        fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
+    if(qemu_file_mode_is_not_valid(mode))
         return NULL;
-    }
 
     s = g_malloc0(sizeof(QEMUFileStdio));
 
@@ -790,6 +810,17 @@ int qemu_get_byte(QEMUFile *f)
     return result;
 }
 
+/*
+ * Validate which operations are actually in use
+ * before attempting to access opaque data.
+ */
+void * qemu_file_ops_are(QEMUFile *f, const QEMUFileOps *ops)
+{
+    if (f->ops == ops)
+        return f->opaque;
+    return NULL;
+}
+
 int64_t qemu_ftell(QEMUFile *f)
 {
     qemu_fflush(f);
@@ -807,6 +838,14 @@ int qemu_file_rate_limit(QEMUFile *f)
     return 0;
 }
 
+/*
+ * For users, like RDMA, that don't go through the QEMUFile buffer directly.
+ */
+void qemu_file_update_position(QEMUFile *f, int64_t inc)
+{
+    f->pos += inc;
+}
+
 int64_t qemu_file_get_rate_limit(QEMUFile *f)
 {
     return f->xfer_limit;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma'
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (6 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 17:01   ` Paolo Bonzini
  2013-04-09 17:02   ` Paolo Bonzini
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA mrhines
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

1. capability for zero pages (enabled by default)
2. capability for dynamic server chunk registration (disabled by default)

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration.c |   41 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/migration.c b/migration.c
index 3b4b467..f01efa9 100644
--- a/migration.c
+++ b/migration.c
@@ -15,6 +15,7 @@
 
 #include "qemu-common.h"
 #include "migration/migration.h"
+#include "migration/rdma.h"
 #include "monitor/monitor.h"
 #include "migration/qemu-file.h"
 #include "sysemu/sysemu.h"
@@ -68,6 +69,18 @@ MigrationState *migrate_get_current(void)
         .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
     };
 
+    static bool first_time = 1;
+
+    /*
+     * Historically, checking for zeros is enabled
+     * by default. Require the user to disable it
+     * (for example RDMA), if they really want to.
+     */
+    if(first_time) {
+        current_migration.enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO] = true;
+        first_time = 0;
+    }
+
     return &current_migration;
 }
 
@@ -77,6 +90,8 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
 
     if (strstart(uri, "tcp:", &p))
         tcp_start_incoming_migration(p, errp);
+    else if (strstart(uri, "rdma:", &p))
+        rdma_start_incoming_migration(p, errp);
 #if !defined(WIN32)
     else if (strstart(uri, "exec:", &p))
         exec_start_incoming_migration(p, errp);
@@ -120,7 +135,6 @@ void process_incoming_migration(QEMUFile *f)
     Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
     int fd = qemu_get_fd(f);
 
-    assert(fd != -1);
     qemu_set_nonblock(fd);
     qemu_coroutine_enter(co, f);
 }
@@ -405,6 +419,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 
     if (strstart(uri, "tcp:", &p)) {
         tcp_start_outgoing_migration(s, p, &local_err);
+    } else if (strstart(uri, "rdma:", &p)) {
+        rdma_start_outgoing_migration(s, p, &local_err);
 #if !defined(WIN32)
     } else if (strstart(uri, "exec:", &p)) {
         exec_start_outgoing_migration(s, p, &local_err);
@@ -474,6 +490,24 @@ void qmp_migrate_set_downtime(double value, Error **errp)
     max_downtime = (uint64_t)value;
 }
 
+bool migrate_chunk_register_destination(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION];
+}
+
+bool migrate_check_for_zero(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO];
+}
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
@@ -546,8 +580,9 @@ static void *migration_thread(void *opaque)
             max_size = bandwidth * migrate_max_downtime() / 1000000;
 
             DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
-                    " bandwidth %g max_size %" PRId64 "\n",
-                    transferred_bytes, time_spent, bandwidth, max_size);
+                    " bandwidth %g (%0.2f mbps) max_size %" PRId64 "\n",
+                    transferred_bytes, time_spent, 
+                    bandwidth, Mbps(transferred_bytes, time_spent), max_size);
             /* if we haven't sent anything, we don't want to recalculate
                10000 is a small enough number for our purposes */
             if (s->dirty_bytes_rate && transferred_bytes > 10000) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (7 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 16:50   ` Paolo Bonzini
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c mrhines
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 arch_init.c |   59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 56 insertions(+), 3 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index c2cbc71..5cf7509 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -45,6 +45,7 @@
 #include "exec/address-spaces.h"
 #include "hw/pcspk.h"
 #include "migration/page_cache.h"
+#include "migration/rdma.h"
 #include "qemu/config-file.h"
 #include "qmp-commands.h"
 #include "trace.h"
@@ -115,6 +116,7 @@ const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
+#define RAM_SAVE_FLAG_RDMA     0x80 /* Do server dynamic RDMA registerations */
 
 
 static struct defconfig_file {
@@ -447,15 +449,23 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
                 ram_bulk_stage = false;
             }
         } else {
+            bool zero;
             uint8_t *p;
             int cont = (block == last_sent_block) ?
                 RAM_SAVE_FLAG_CONTINUE : 0;
 
             p = memory_region_get_ram_ptr(mr) + offset;
 
+            /* use capability now, defaults to true */
+            zero = migrate_check_for_zero() ? is_zero_page(p) : false;
+
             /* In doubt sent page as normal */
             bytes_sent = -1;
-            if (is_zero_page(p)) {
+            if ((bytes_sent = save_rdma_page(f, block->offset, 
+                            offset, cont, TARGET_PAGE_SIZE, zero)) >= 0) {
+                acct_info.norm_pages++;
+                qemu_file_update_position(f, bytes_sent);
+            } else if (zero) {
                 acct_info.dup_pages++;
                 if (!ram_bulk_stage) {
                     bytes_sent = save_block_hdr(f, block, offset, cont,
@@ -476,7 +486,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             }
 
             /* XBZRLE overflow or normal page */
-            if (bytes_sent == -1) {
+            if (bytes_sent == -1 || bytes_sent == -ENOTSUP) {
                 bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
                 qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
                 bytes_sent += TARGET_PAGE_SIZE;
@@ -603,6 +613,33 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     return 0;
 }
 
+/* 
+ * Inform server to begin handling dynamic page registrations
+ */
+static void ram_registration_start(QEMUFile *f)
+{
+    if(qemu_file_ops_are(f, &rdma_write_ops)) {
+        qemu_put_be64(f, RAM_SAVE_FLAG_RDMA);
+    }
+}
+
+/*
+ * Inform server that dynamic registrations are done for now.
+ * First, flush writes, if any.
+ */
+static int ram_registration_stop(QEMUFile *f)
+{
+    int ret = 0;
+
+    if (qemu_file_ops_are(f, &rdma_write_ops)) {
+        ret = qemu_rdma_drain_cq(f);
+        if(ret >= 0)
+            ret = qemu_rdma_finish_registrations(f);
+    }
+
+    return ret;
+}
+
 static int ram_save_iterate(QEMUFile *f, void *opaque)
 {
     int ret;
@@ -616,6 +653,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
         reset_ram_globals();
     }
 
+    ram_registration_start(f);
+
     t0 = qemu_get_clock_ns(rt_clock);
     i = 0;
     while ((ret = qemu_file_rate_limit(f)) == 0) {
@@ -646,6 +685,9 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
     qemu_mutex_unlock_ramlist();
 
+    if(ret >= 0)
+        ret = ram_registration_stop(f);
+
     if (ret < 0) {
         bytes_transferred += total_sent;
         return ret;
@@ -660,8 +702,11 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
 static int ram_save_complete(QEMUFile *f, void *opaque)
 {
+    int ret = 0;
+
     qemu_mutex_lock_ramlist();
     migration_bitmap_sync();
+    ram_registration_start(f);
 
     /* try transferring iterative blocks of memory */
 
@@ -676,12 +721,15 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
         bytes_transferred += bytes_sent;
     }
+
+    ret = ram_registration_stop(f);
+
     migration_end();
 
     qemu_mutex_unlock_ramlist();
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 
-    return 0;
+    return ret;
 }
 
 static uint64_t ram_save_pending(QEMUFile *f, void *opaque, uint64_t max_size)
@@ -864,6 +912,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                 ret = -EINVAL;
                 goto done;
             }
+        } else if ((flags & RAM_SAVE_FLAG_RDMA) &&
+                          qemu_file_ops_are(f, &rdma_read_ops)) {
+            ret = qemu_rdma_handle_registrations(f);
+            if(ret < 0)
+                goto done;
         }
         error = qemu_file_get_error(f);
         if (error) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (8 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 16:43   ` Paolo Bonzini
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities mrhines
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/migration.h |    3 +++
 include/migration/qemu-file.h |    3 +++
 2 files changed, 6 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..40de049 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -127,4 +127,7 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_check_for_zero(void);
+bool migrate_chunk_register_destination(void);
 #endif
diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index 623c434..4ee0ed2 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -80,6 +80,9 @@ void qemu_put_byte(QEMUFile *f, int v);
  * The buffer should be available till it is sent asynchronously.
  */
 void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, int size);
+void *qemu_file_ops_are(QEMUFile *f, const QEMUFileOps *ops);
+bool qemu_file_mode_is_not_valid(const char * mode);
+void qemu_file_update_position(QEMUFile *f, int64_t inc);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (9 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 16:43   ` Paolo Bonzini
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor mrhines
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>


Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 qapi-schema.json |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/qapi-schema.json b/qapi-schema.json
index db542f6..7ebcf99 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -602,7 +602,7 @@
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'check_for_zero', 'chunk_register_destination'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (10 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities mrhines
@ 2013-04-09  3:04 ` mrhines
  2013-04-09 16:45   ` Paolo Bonzini
  2013-04-09  4:24 ` [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design Michael R. Hines
  2013-04-09 12:44 ` Michael S. Tsirkin
  13 siblings, 1 reply; 97+ messages in thread
From: mrhines @ 2013-04-09  3:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

migration.c thinks this is an error for RDMA, but it's not.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 util/oslib-posix.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 4e4b819..0b398f4 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -144,6 +144,8 @@ void qemu_set_block(int fd)
 void qemu_set_nonblock(int fd)
 {
     int f;
+    if(fd == -1)
+        return;
     f = fcntl(fd, F_GETFL);
     fcntl(fd, F_SETFL, f | O_NONBLOCK);
 }
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (11 preceding siblings ...)
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor mrhines
@ 2013-04-09  4:24 ` Michael R. Hines
  2013-04-09 12:44 ` Michael S. Tsirkin
  13 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-09  4:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

FYI: Testable patchset can be found here: github.com:hinesmr/qemu.git, 
'rdma' branch

- Michael

On 04/08/2013 11:04 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Changes since v4:
>
> - Created a "formal" protocol for the RDMA control channel
> - Dynamic, chunked page registration now implemented on *both* the server and client
> - Created new 'capability' for page registration
> - Created new 'capability' for is_zero_page() - enabled by default
>    (needed to test dynamic page registration)
> - Created version-check before protocol begins at connection-time
> - no more migrate_use_rdma() !
>
> NOTE: While dynamic registration works on both sides now,
>        it does *not* work with cgroups swap limits. This functionality with infiniband
>        remains broken. (It works fine with TCP). So, in order to take full
>        advantage of this feature, a fix will have to be developed on the kernel side.
>        Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
>
> Contents:
> =================================
> * Compiling
> * Running (please readme before running)
> * RDMA Protocol Description
> * Versioning
> * QEMUFileRDMA Interface
> * Migration of pc.ram
> * Error handling
> * TODO
> * Performance
>
> COMPILING:
> ===============================
>
> $ ./configure --enable-rdma --target-list=x86_64-softmmu
> $ make
>
> RUNNING:
> ===============================
>
> First, decide if you want dynamic page registration on the server-side.
> This always happens on the primary-VM side, but is optional on the server.
> Doing this allows you to support overcommit (such as cgroups or ballooning)
> with a smaller footprint on the server-side without having to register the
> entire VM memory footprint.
> NOTE: This significantly slows down performance (about 30% slower).
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
>
> Next, if you decided *not* to use chunked registration on the server,
> it is recommended to also disable zero page detection. While this is not
> strictly necessary, zero page detection also significantly slows down
> performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_capability check_for_zero off" # always enabled by default
>
> Finally, set the migration speed to match your hardware's capabilities:
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
>
> Finally, perform the actual migration:
>
> $ virsh migrate domain rdma:xx.xx.xx.xx:port
>
> RDMA Protocol Description:
> =================================
>
> Migration with RDMA is separated into two parts:
>
> 1. The transmission of the pages using RDMA
> 2. Everything else (a control channel is introduced)
>
> "Everything else" is transmitted using a formal
> protocol now, consisting of infiniband SEND / RECV messages.
>
> An infiniband SEND message is the standard ibverbs
> message used by applications of infiniband hardware.
> The only difference between a SEND message and an RDMA
> message is that SEND message cause completion notifications
> to be posted to the completion queue (CQ) on the
> infiniband receiver side, whereas RDMA messages (used
> for pc.ram) do not (to behave like an actual DMA).
>      
> Messages in infiniband require two things:
>
> 1. registration of the memory that will be transmitted
> 2. (SEND/RECV only) work requests to be posted on both
>     sides of the network before the actual transmission
>     can occur.
>
> RDMA messages much easier to deal with. Once the memory
> on the receiver side is registered and pinned, we're
> basically done. All that is required is for the sender
> side to start dumping bytes onto the link.
>
> SEND messages require more coordination because the
> receiver must have reserved space (using a receive
> work request) on the receive queue (RQ) before QEMUFileRDMA
> can start using them to carry all the bytes as
> a transport for migration of device state.
>
> To begin the migration, the initial connection setup is
> as follows (migration-rdma.c):
>
> 1. Receiver and Sender are started (command line or libvirt):
> 2. Both sides post two RQ work requests
> 3. Receiver does listen()
> 4. Sender does connect()
> 5. Receiver accept()
> 6. Check versioning and capabilities (described later)
>
> At this point, we define a control channel on top of SEND messages
> which is described by a formal protocol. Each SEND message has a
> header portion and a data portion (but together are transmitted
> as a single SEND message).
>
> Header:
>      * Length  (of the data portion)
>      * Type    (what command to perform, described below)
>      * Version (protocol version validated before send/recv occurs)
>
> The 'type' field has 7 different command values:
>      1. None
>      2. Ready             (control-channel is available)
>      3. QEMU File         (for sending non-live device state)
>      4. RAM Blocks        (used right after connection setup)
>      5. Register request  (dynamic chunk registration)
>      6. Register result   ('rkey' to be used by sender)
>      7. Register finished (registration for current iteration finished)
>
> After connection setup is completed, we have two protocol-level
> functions, responsible for communicating control-channel commands
> using the above list of values:
>
> Logically:
>
> qemu_rdma_exchange_recv(header, expected command type)
>
> 1. We transmit a READY command to let the sender know that
>     we are *ready* to receive some data bytes on the control channel.
> 2. Before attempting to receive the expected command, we post another
>     RQ work request to replace the one we just used up.
> 3. Block on a CQ event channel and wait for the SEND to arrive.
> 4. When the send arrives, librdmacm will unblock us.
> 5. Verify that the command-type and version received matches the one we expected.
>
> qemu_rdma_exchange_send(header, data, optional response header & data):
>
> 1. Block on the CQ event channel waiting for a READY command
>     from the receiver to tell us that the receiver
>     is *ready* for us to transmit some new bytes.
> 2. Optionally: if we are expecting a response from the command
>     (that we have no yet transmitted), let's post an RQ
>     work request to receive that data a few moments later.
> 3. When the READY arrives, librdmacm will
>     unblock us and we immediately post a RQ work request
>     to replace the one we just used up.
> 4. Now, we can actually post the work request to SEND
>     the requested command type of the header we were asked for.
> 5. Optionally, if we are expecting a response (as before),
>     we block again and wait for that response using the additional
>     work request we previously posted. (This is used to carry
>     'Register result' commands #6 back to the sender which
>     hold the rkey need to perform RDMA.
>
> All of the remaining command types (not including 'ready')
> described above all use the aformentioned two functions to do the hard work:
>
> 1. After connection setup, RAMBlock information is exchanged using
>     this protocol before the actual migration begins.
> 2. During runtime, once a 'chunk' becomes full of pages ready to
>     be sent with RDMA, the registration commands are used to ask the
>     other side to register the memory for this chunk and respond
>     with the result (rkey) of the registration.
> 3. Also, the QEMUFile interfaces also call these functions (described below)
>     when transmitting non-live state, such as devices or to send
>     its own protocol information during the migration process.
>
> Versioning
> ==================================
>
> librdmacm provides the user with a 'private data' area to be exchanged
> at connection-setup time before any infiniband traffic is generated.
>
> This is a convenient place to check for protocol versioning because the
> user does not need to register memory to transmit a few bytes of version
> information.
>
> This is also a convenient place to negotiate capabilities
> (like dynamic page registration).
>
> If the version is invalid, we throw an error.
>
> If the version is new, we only negotiate the capabilities that the
> requested version is able to perform and ignore the rest.
>
> QEMUFileRDMA Interface:
> ==================================
>
> QEMUFileRDMA introduces a couple of new functions:
>
> 1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> 2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
>
> These two functions are very short and simply used the protocol
> describe above to deliver bytes without changing the upper-level
> users of QEMUFile that depend on a bytstream abstraction.
>
> Finally, how do we handoff the actual bytes to get_buffer()?
>
> Again, because we're trying to "fake" a bytestream abstraction
> using an analogy not unlike individual UDP frames, we have
> to hold on to the bytes received from control-channel's SEND
> messages in memory.
>
> Each time we receive a complete "QEMU File" control-channel
> message, the bytes from SEND are copied into a small local holding area.
>
> Then, we return the number of bytes requested by get_buffer()
> and leave the remaining bytes in the holding area until get_buffer()
> comes around for another pass.
>
> If the buffer is empty, then we follow the same steps
> listed above and issue another "QEMU File" protocol command,
> asking for a new SEND message to re-fill the buffer.
>
> Migration of pc.ram:
> ===============================
>
> At the beginning of the migration, (migration-rdma.c),
> the sender and the receiver populate the list of RAMBlocks
> to be registered with each other into a structure.
> Then, using the aforementioned protocol, they exchange a
> description of these blocks with each other, to be used later
> during the iteration of main memory. This description includes
> a list of all the RAMBlocks, their offsets and lengths and
> possibly includes pre-registered RDMA keys in case dynamic
> page registration was disabled on the server-side, otherwise not.
>
> Main memory is not migrated with the aforementioned protocol,
> but is instead migrated with normal RDMA Write operations.
>
> Pages are migrated in "chunks" (about 1 Megabyte right now).
> Chunk size is not dynamic, but it could be in a future implementation.
> There's nothing to indicate that this is useful right now.
>
> When a chunk is full (or a flush() occurs), the memory backed by
> the chunk is registered with librdmacm and pinned in memory on
> both sides using the aforementioned protocol.
>
> After pinning, an RDMA Write is generated and tramsmitted
> for the entire chunk.
>
> Chunks are also transmitted in batches: This means that we
> do not request that the hardware signal the completion queue
> for the completion of *every* chunk. The current batch size
> is about 64 chunks (corresponding to 64 MB of memory).
> Only the last chunk in a batch must be signaled.
> This helps keep everything as asynchronous as possible
> and helps keep the hardware busy performing RDMA operations.
>
> Error-handling:
> ===============================
>
> Infiniband has what is called a "Reliable, Connected"
> link (one of 4 choices). This is the mode in which
> we use for RDMA migration.
>
> If a *single* message fails,
> the decision is to abort the migration entirely and
> cleanup all the RDMA descriptors and unregister all
> the memory.
>
> After cleanup, the Virtual Machine is returned to normal
> operation the same way that would happen if the TCP
> socket is broken during a non-RDMA based migration.
>
> TODO:
> =================================
> 1. Currently, cgroups swap limits for *both* TCP and RDMA
>     on the sender-side is broken. This is more poignant for
>     RDMA because RDMA requires memory registration.
>     Fixing this requires infiniband page registrations to be
>     zero-page aware, and this does not yet work properly.
> 2. Currently overcommit for the the *receiver* side of
>     TCP works, but not for RDMA. While dynamic page registration
>     *does* work, it is only useful if the is_zero_page() capability
>     is remained enabled (which it is by default).
>     However, leaving this capability turned on *significantly* slows
>     down the RDMA throughput, particularly on hardware capable
>     of transmitting faster than 10 gbps (such as 40gbps links).
> 3. Use of the recent /dev/<pid>/pagemap would likely solve some
>     of these problems.
> 4. Also, some form of balloon-device usage tracking would also
>     help aleviate some of these issues.
>
> PERFORMANCE
> ===================
>
> Using a 40gbps infinband link performing a worst-case stress test:
>
> RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> Approximately 30 gpbs (little better than the paper)
> 1. Average worst-case throughput
> TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> 2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
>
> Average downtime (stop time) ranges between 28 and 33 milliseconds.
>
> An *exhaustive* paper (2010) shows additional performance details
> linked on the QEMU wiki:
>
> http://wiki.qemu.org/Features/RDMALiveMigration
>
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design
  2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
                   ` (12 preceding siblings ...)
  2013-04-09  4:24 ` [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design Michael R. Hines
@ 2013-04-09 12:44 ` Michael S. Tsirkin
  2013-04-09 14:23   ` Michael R. Hines
  13 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-09 12:44 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Mon, Apr 08, 2013 at 11:04:29PM -0400, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>

off topic - did you have a chance to try my patches
that enable overcommit on source?
Need some response asap if we want it in 3.10

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design
  2013-04-09 12:44 ` Michael S. Tsirkin
@ 2013-04-09 14:23   ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-09 14:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

Yes, I sent several responses to the list some days ago =)

On 04/09/2013 08:44 AM, Michael S. Tsirkin wrote:
> On Mon, Apr 08, 2013 at 11:04:29PM -0400, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
> off topic - did you have a chance to try my patches
> that enable overcommit on source?
> Need some response asap if we want it in 3.10
>
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c mrhines
@ 2013-04-09 16:43   ` Paolo Bonzini
  0 siblings, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 16:43 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  include/migration/migration.h |    3 +++
>  include/migration/qemu-file.h |    3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index e2acec6..40de049 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -127,4 +127,7 @@ int migrate_use_xbzrle(void);
>  int64_t migrate_xbzrle_cache_size(void);
>  
>  int64_t xbzrle_cache_resize(int64_t new_size);
> +
> +bool migrate_check_for_zero(void);
> +bool migrate_chunk_register_destination(void);
>  #endif
> diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
> index 623c434..4ee0ed2 100644
> --- a/include/migration/qemu-file.h
> +++ b/include/migration/qemu-file.h
> @@ -80,6 +80,9 @@ void qemu_put_byte(QEMUFile *f, int v);
>   * The buffer should be available till it is sent asynchronously.
>   */
>  void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, int size);
> +void *qemu_file_ops_are(QEMUFile *f, const QEMUFileOps *ops);
> +bool qemu_file_mode_is_not_valid(const char * mode);
> +void qemu_file_update_position(QEMUFile *f, int64_t inc);
>  
>  static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
>  {
> 

This should be placed together with the patch that adds them (and in
turn the patch that adds them should be before the patch that uses them).

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities mrhines
@ 2013-04-09 16:43   ` Paolo Bonzini
  0 siblings, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 16:43 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  qapi-schema.json |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/qapi-schema.json b/qapi-schema.json
> index db542f6..7ebcf99 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -602,7 +602,7 @@
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle'] }
> +  'data': ['xbzrle', 'check_for_zero', 'chunk_register_destination'] }
>  
>  ##
>  # @MigrationCapabilityStatus
> 

Same comment as for patch 10.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor mrhines
@ 2013-04-09 16:45   ` Paolo Bonzini
  0 siblings, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 16:45 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> index 4e4b819..0b398f4 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -144,6 +144,8 @@ void qemu_set_block(int fd)
>  void qemu_set_nonblock(int fd)
>  {
>      int f;
> +    if(fd == -1)
> +        return;
>      f = fcntl(fd, F_GETFL);
>      fcntl(fd, F_SETFL, f | O_NONBLOCK);
>  }
> -- 1.7.10.4

Do you need to remove the assertion in process_incoming_migration, too?

I would prefer to touch process_incoming_migration instead, actually,
replacing the assertion with an if() there.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA mrhines
@ 2013-04-09 16:46   ` Paolo Bonzini
  0 siblings, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 16:46 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Make both rdma.c and migration-rdma.c conditionally built.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  Makefile.objs |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Makefile.objs b/Makefile.objs
> index e568c01..32f39d3 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -49,6 +49,7 @@ common-obj-$(CONFIG_POSIX) += os-posix.o
>  common-obj-$(CONFIG_LINUX) += fsdev/
>  
>  common-obj-y += migration.o migration-tcp.o
> +common-obj-$(CONFIG_RDMA) += migration-rdma.o rdma.o
>  common-obj-y += qemu-char.o #aio.o
>  common-obj-y += block-migration.o
>  common-obj-y += page_cache.o xbzrle.o
> 

Please add this line (and subsequently modify it) on the patches that
add the files.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA mrhines
@ 2013-04-09 16:50   ` Paolo Bonzini
  0 siblings, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 16:50 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> 
> +            if ((bytes_sent = save_rdma_page(f, block->offset, 
> +                            offset, cont, TARGET_PAGE_SIZE, zero)) >= 0) {
> +                acct_info.norm_pages++;
> +                qemu_file_update_position(f, bytes_sent);
> +            } else if (zero) {

I think this should become a new QEMUFileOps member, save_ram_page.  If
NULL it can return ENOTSUP.

> +/* 
> + * Inform server to begin handling dynamic page registrations
> + */
> +static void ram_registration_start(QEMUFile *f)
> +{
> +    if(qemu_file_ops_are(f, &rdma_write_ops)) {
> +        qemu_put_be64(f, RAM_SAVE_FLAG_RDMA);
> +    }
> +}
> +
> +/*
> + * Inform server that dynamic registrations are done for now.
> + * First, flush writes, if any.
> + */
> +static int ram_registration_stop(QEMUFile *f)
> +{
> +    int ret = 0;
> +
> +    if (qemu_file_ops_are(f, &rdma_write_ops)) {
> +        ret = qemu_rdma_drain_cq(f);
> +        if(ret >= 0)
> +            ret = qemu_rdma_finish_registrations(f);
> +    }
> +
> +    return ret;

I think this should become two QEMUFileOps instead: before_ram_iterate
and after_ram_iterate, or something like that.  Again, if NULL they
should just do nothing.

Errors from the callback should be persisted in the QEMUFile with
qemu_file_set_error.  Then you do not need any checks in the caller.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol mrhines
@ 2013-04-09 16:57   ` Paolo Bonzini
  0 siblings, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 16:57 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> +void qemu_rdma_disable(void *opaque);
> +void qemu_rdma_cleanup(void *opaque);
> +int qemu_rdma_client_init(void *opaque, Error **errp,
> +            bool chunk_register_destination);
> +int qemu_rdma_connect(void *opaque, Error **errp);
> +void *qemu_rdma_data_init(const char *host_port, Error **errp);
> +int qemu_rdma_server_init(void *opaque, Error **errp);
> +int qemu_rdma_server_prepare(void *opaque, Error **errp);
> +int qemu_rdma_drain_cq(QEMUFile *f);
> +int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, 
> +                            int64_t pos, int size);
> +int qemu_rdma_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size);
> +int qemu_rdma_close(void *opaque);
> +size_t save_rdma_page(QEMUFile *f, ram_addr_t block_offset, 
> +            ram_addr_t offset, int cont, size_t size, bool zero);
> +void *qemu_fopen_rdma(void *opaque, const char * mode);
> +int qemu_rdma_get_fd(void *opaque);
> +int qemu_rdma_accept(void *opaque);
> +void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp);
> +void rdma_start_incoming_migration(const char * host_port, Error **errp);
> +int qemu_rdma_handle_registrations(QEMUFile *f);
> +int qemu_rdma_finish_registrations(QEMUFile *f);

I think you have accumulated some dead code, for example qemu_rdma_disable.

Also, most of these functions can be static now.

> +#else /* !defined(CONFIG_RDMA) */
> +#define NOT_CONFIGURED() do { printf("WARN: RDMA is not configured\n"); } while(0)
> +#define qemu_rdma_cleanup(...) NOT_CONFIGURED()
> +#define qemu_rdma_data_init(...) NOT_CONFIGURED() 
> +#define rdma_start_outgoing_migration(...) NOT_CONFIGURED()
> +#define rdma_start_incoming_migration(...) NOT_CONFIGURED()
> +#define qemu_rdma_handle_registrations(...) 0
> +#define qemu_rdma_finish_registrations(...) 0
> +#define qemu_rdma_get_buffer NULL
> +#define qemu_rdma_put_buffer NULL
> +#define qemu_rdma_close NULL
> +#define qemu_fopen_rdma(...) NULL
> +#define qemu_rdma_client_init(...) -1 
> +#define qemu_rdma_client_connect(...) -1 
> +#define qemu_rdma_server_init(...) -1 
> +#define qemu_rdma_server_prepare(...) -1 
> +#define qemu_rdma_drain_cq(...) -1 
> +#define save_rdma_page(...) -ENOTSUP
> +
> +#endif /* CONFIG_RDMA */
> +

Please leave the prototypes even if CONFIG_RDMA is not defined.

This is because these symbols should not have any references when
CONFIG_RDMA is not defined.  The prototypes do not hurt.

You should almost be there.  The only functions that have some
references left in arch_init.c should be save_rdma_page,
qemu_rdma_drain_cq, qemu_rdma_finish_registrations.  Turn these into
QEMUFileOps, and there will be no undefined references even with
--disable-rdma.

But actually, once you get there, it could even make sense to merge
rdma.c and migration-rdma.c into a single file (migration-rdma.c; put
the former migration-rdma.c last so that you do not need forward
declarations, we tend to avoid them).  You can then eliminate the header
completely!

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma'
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
@ 2013-04-09 17:01   ` Paolo Bonzini
  2013-04-10  1:11     ` Michael R. Hines
  2013-04-09 17:02   ` Paolo Bonzini
  1 sibling, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 17:01 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> 1. capability for zero pages (enabled by default)
> 2. capability for dynamic server chunk registration (disabled by default)

The "zero" capability should be a separate patch.

The hunk adding mbps should also be a separate patch.

Otherwise, please merge this with patch 6.

> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  migration.c |   41 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/migration.c b/migration.c
> index 3b4b467..f01efa9 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -15,6 +15,7 @@
>  
>  #include "qemu-common.h"
>  #include "migration/migration.h"
> +#include "migration/rdma.h"
>  #include "monitor/monitor.h"
>  #include "migration/qemu-file.h"
>  #include "sysemu/sysemu.h"
> @@ -68,6 +69,18 @@ MigrationState *migrate_get_current(void)
>          .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
>      };
>  
> +    static bool first_time = 1;
> +
> +    /*
> +     * Historically, checking for zeros is enabled
> +     * by default. Require the user to disable it
> +     * (for example RDMA), if they really want to.
> +     */
> +    if(first_time) {
> +        current_migration.enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO] = true;
> +        first_time = 0;
> +    }

Just add

    .enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO] = true,

to the initializer above.

> +
>      return &current_migration;
>  }
>  
> @@ -77,6 +90,8 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
>  
>      if (strstart(uri, "tcp:", &p))
>          tcp_start_incoming_migration(p, errp);
> +    else if (strstart(uri, "rdma:", &p))
> +        rdma_start_incoming_migration(p, errp);
>  #if !defined(WIN32)
>      else if (strstart(uri, "exec:", &p))
>          exec_start_incoming_migration(p, errp);
> @@ -120,7 +135,6 @@ void process_incoming_migration(QEMUFile *f)
>      Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
>      int fd = qemu_get_fd(f);
>  
> -    assert(fd != -1);

Oh, missed it.  Here it is. :)

Please make this an if() instead.

>      qemu_set_nonblock(fd);
>      qemu_coroutine_enter(co, f);
>  }
> @@ -405,6 +419,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>  
>      if (strstart(uri, "tcp:", &p)) {
>          tcp_start_outgoing_migration(s, p, &local_err);
> +    } else if (strstart(uri, "rdma:", &p)) {
> +        rdma_start_outgoing_migration(s, p, &local_err);
>  #if !defined(WIN32)
>      } else if (strstart(uri, "exec:", &p)) {
>          exec_start_outgoing_migration(s, p, &local_err);
> @@ -474,6 +490,24 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>      max_downtime = (uint64_t)value;
>  }
>  
> +bool migrate_chunk_register_destination(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION];
> +}
> +
> +bool migrate_check_for_zero(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO];
> +}
> +
>  int migrate_use_xbzrle(void)
>  {
>      MigrationState *s;
> @@ -546,8 +580,9 @@ static void *migration_thread(void *opaque)
>              max_size = bandwidth * migrate_max_downtime() / 1000000;
>  
>              DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
> -                    " bandwidth %g max_size %" PRId64 "\n",
> -                    transferred_bytes, time_spent, bandwidth, max_size);
> +                    " bandwidth %g (%0.2f mbps) max_size %" PRId64 "\n",
> +                    transferred_bytes, time_spent, 
> +                    bandwidth, Mbps(transferred_bytes, time_spent), max_size);
>              /* if we haven't sent anything, we don't want to recalculate
>                 10000 is a small enough number for our purposes */
>              if (s->dirty_bytes_rate && transferred_bytes > 10000) {
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma'
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
  2013-04-09 17:01   ` Paolo Bonzini
@ 2013-04-09 17:02   ` Paolo Bonzini
  1 sibling, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 17:02 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> +    } else if (strstart(uri, "rdma:", &p)) {
> +        rdma_start_outgoing_migration(s, p, &local_err);

Forgot one: please wrap this and the equivalent incoming migration hunk
with #ifdef CONFIG_RDMA.

Prototypes can go in include/migration/migration.h.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors for RDMA
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
@ 2013-04-09 17:03   ` Paolo Bonzini
  2013-04-09 17:31   ` Peter Maydell
  1 sibling, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 17:03 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> 1. qemu_file_ops_are()
> 2. qemu_file_update_position()    (for f->pos)
> 
> Also need to be here:
> rdma_read_ops
> rdma_write_ops
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  savevm.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 48 insertions(+), 9 deletions(-)

As agreed before, please return the page size from save_ram_page and
update position in savevm.c.

Add enough QEMUFileOps, and you won't need qemu_file_ops_are.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
@ 2013-04-09 17:05   ` Paolo Bonzini
  2013-04-09 18:07     ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-09 17:05 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
> +if test "$rdma" = "yes" ; then
> +  cat > $TMPC <<EOF
> +#include <rdma/rdma_cma.h>
> +int main(void) { return 0; }
> +EOF
> +  rdma_libs="-lrdmacm -libverbs"
> +  if ! compile_prog "" "$rdma_libs" ; then
> +      feature_not_found "rdma"
> +  fi
> +    

Please enable this by default, or it will bitrot.

The test should be like this:

if test "$rdma" != "no" ; then
  cat > $TMPC << EOF
...
EOF
  rdma_libs="-lrdmacm -libverbs"
  if compile_prog "-Werror" "$rdma_libs" ; then
    rdma="yes"
    libs_softmmu="$libs_softmmu $rdma_libs"
  else
    if test "$rdma" = "yes" ; then
      feature_not_found "rdma"
    fi
    rdma="no"
  fi
fi


...

if test "$rdma" = "yes" ; then
  echo "CONFIG_RDMA=y" >> $config_host_mak
fi

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors for RDMA
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
  2013-04-09 17:03   ` Paolo Bonzini
@ 2013-04-09 17:31   ` Peter Maydell
  2013-04-09 18:04     ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Peter Maydell @ 2013-04-09 17:31 UTC (permalink / raw)
  To: mrhines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 9 April 2013 04:04,  <mrhines@linux.vnet.ibm.com> wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> 1. qemu_file_ops_are()
> 2. qemu_file_update_position()    (for f->pos)
>
> Also need to be here:
> rdma_read_ops
> rdma_write_ops

Do you think you could try to expand on your commit messages
a bit? The idea is that a commit message should generally
give an overview of the patch including rationale; it should
be reasonably meaningful if you look only at the commit
message and not the patch itself. This one has a very
abbreviated description of the "what" and is missing any
kind of "why".

thanks
-- PMM

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors for RDMA
  2013-04-09 17:31   ` Peter Maydell
@ 2013-04-09 18:04     ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-09 18:04 UTC (permalink / raw)
  To: Peter Maydell
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/09/2013 01:31 PM, Peter Maydell wrote:
> On 9 April 2013 04:04,  <mrhines@linux.vnet.ibm.com> wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> 1. qemu_file_ops_are()
>> 2. qemu_file_update_position()    (for f->pos)
>>
>> Also need to be here:
>> rdma_read_ops
>> rdma_write_ops

My apologies..... will do =)

> Do you think you could try to expand on your commit messages
> a bit? The idea is that a commit message should generally
> give an overview of the patch including rationale; it should
> be reasonably meaningful if you look only at the commit
> message and not the patch itself. This one has a very
> abbreviated description of the "what" and is missing any
> kind of "why".
>
> thanks
> -- PMM
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma
  2013-04-09 17:05   ` Paolo Bonzini
@ 2013-04-09 18:07     ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-09 18:07 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Thanks for all the comments...... will implement......


On 04/09/2013 01:05 PM, Paolo Bonzini wrote:
> Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
>> +if test "$rdma" = "yes" ; then
>> +  cat > $TMPC <<EOF
>> +#include <rdma/rdma_cma.h>
>> +int main(void) { return 0; }
>> +EOF
>> +  rdma_libs="-lrdmacm -libverbs"
>> +  if ! compile_prog "" "$rdma_libs" ; then
>> +      feature_not_found "rdma"
>> +  fi
>> +
> Please enable this by default, or it will bitrot.
>
> The test should be like this:
>
> if test "$rdma" != "no" ; then
>    cat > $TMPC << EOF
> ...
> EOF
>    rdma_libs="-lrdmacm -libverbs"
>    if compile_prog "-Werror" "$rdma_libs" ; then
>      rdma="yes"
>      libs_softmmu="$libs_softmmu $rdma_libs"
>    else
>      if test "$rdma" = "yes" ; then
>        feature_not_found "rdma"
>      fi
>      rdma="no"
>    fi
> fi
>
>
> ...
>
> if test "$rdma" = "yes" ; then
>    echo "CONFIG_RDMA=y" >> $config_host_mak
> fi
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma'
  2013-04-09 17:01   ` Paolo Bonzini
@ 2013-04-10  1:11     ` Michael R. Hines
  2013-04-10  8:07       ` Paolo Bonzini
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-10  1:11 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Actually, this capability needs to be a part of the same patch.

RDMA must have the ability to turn this off or performance will die.

Similarly dynamic chunk registration on the server depends on having
the ability to turn this on, or you cannot get the advantages of dynamic 
registration.

- Michael

On 04/09/2013 01:01 PM, Paolo Bonzini wrote:
> Il 09/04/2013 05:04, mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> 1. capability for zero pages (enabled by default)
>> 2. capability for dynamic server chunk registration (disabled by default)
> The "zero" capability should be a separate patch.
>
> The hunk adding mbps should also be a separate patch.
>
> Otherwise, please merge this with patch 6.
>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   migration.c |   41 ++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/migration.c b/migration.c
>> index 3b4b467..f01efa9 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -15,6 +15,7 @@
>>   
>>   #include "qemu-common.h"
>>   #include "migration/migration.h"
>> +#include "migration/rdma.h"
>>   #include "monitor/monitor.h"
>>   #include "migration/qemu-file.h"
>>   #include "sysemu/sysemu.h"
>> @@ -68,6 +69,18 @@ MigrationState *migrate_get_current(void)
>>           .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
>>       };
>>   
>> +    static bool first_time = 1;
>> +
>> +    /*
>> +     * Historically, checking for zeros is enabled
>> +     * by default. Require the user to disable it
>> +     * (for example RDMA), if they really want to.
>> +     */
>> +    if(first_time) {
>> +        current_migration.enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO] = true;
>> +        first_time = 0;
>> +    }
> Just add
>
>      .enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO] = true,
>
> to the initializer above.
>
>> +
>>       return &current_migration;
>>   }
>>   
>> @@ -77,6 +90,8 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
>>   
>>       if (strstart(uri, "tcp:", &p))
>>           tcp_start_incoming_migration(p, errp);
>> +    else if (strstart(uri, "rdma:", &p))
>> +        rdma_start_incoming_migration(p, errp);
>>   #if !defined(WIN32)
>>       else if (strstart(uri, "exec:", &p))
>>           exec_start_incoming_migration(p, errp);
>> @@ -120,7 +135,6 @@ void process_incoming_migration(QEMUFile *f)
>>       Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
>>       int fd = qemu_get_fd(f);
>>   
>> -    assert(fd != -1);
> Oh, missed it.  Here it is. :)
>
> Please make this an if() instead.
>
>>       qemu_set_nonblock(fd);
>>       qemu_coroutine_enter(co, f);
>>   }
>> @@ -405,6 +419,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>>   
>>       if (strstart(uri, "tcp:", &p)) {
>>           tcp_start_outgoing_migration(s, p, &local_err);
>> +    } else if (strstart(uri, "rdma:", &p)) {
>> +        rdma_start_outgoing_migration(s, p, &local_err);
>>   #if !defined(WIN32)
>>       } else if (strstart(uri, "exec:", &p)) {
>>           exec_start_outgoing_migration(s, p, &local_err);
>> @@ -474,6 +490,24 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>>       max_downtime = (uint64_t)value;
>>   }
>>   
>> +bool migrate_chunk_register_destination(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION];
>> +}
>> +
>> +bool migrate_check_for_zero(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_CHECK_FOR_ZERO];
>> +}
>> +
>>   int migrate_use_xbzrle(void)
>>   {
>>       MigrationState *s;
>> @@ -546,8 +580,9 @@ static void *migration_thread(void *opaque)
>>               max_size = bandwidth * migrate_max_downtime() / 1000000;
>>   
>>               DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
>> -                    " bandwidth %g max_size %" PRId64 "\n",
>> -                    transferred_bytes, time_spent, bandwidth, max_size);
>> +                    " bandwidth %g (%0.2f mbps) max_size %" PRId64 "\n",
>> +                    transferred_bytes, time_spent,
>> +                    bandwidth, Mbps(transferred_bytes, time_spent), max_size);
>>               /* if we haven't sent anything, we don't want to recalculate
>>                  10000 is a small enough number for our purposes */
>>               if (s->dirty_bytes_rate && transferred_bytes > 10000) {
>>
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation mrhines
@ 2013-04-10  5:27   ` Michael S. Tsirkin
  2013-04-10 13:04     ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-10  5:27 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

Below is a great high level overview. the protocol looks correct.
A bit more detail would be helpful, as noted below.

The main thing I'd like to see changed is that there are already
two protocols here: chunk-based and non chunk based.
We'll need to use versioning and capabilities going forward but in the
first version we don't need to maintain compatibility with legacy so
two versions seems like unnecessary pain.  Chunk based is somewhat slower and
that is worth fixing longer term, but seems like the way forward. So
let's implement a single chunk-based protocol in the first version we
merge.

Some more minor improvement suggestions below.

On Mon, Apr 08, 2013 at 11:04:32PM -0400, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Both the protocol and interfaces are elaborated in more detail,
> including the new use of dynamic chunk registration, versioning,
> and capabilities negotiation.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 313 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..e9fa4cd
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,313 @@
> +Several changes since v4:
> +
> +- Created a "formal" protocol for the RDMA control channel
> +- Dynamic, chunked page registration now implemented on *both* the server and client
> +- Created new 'capability' for page registration
> +- Created new 'capability' for is_zero_page() - enabled by default
> +  (needed to test dynamic page registration)
> +- Created version-check before protocol begins at connection-time 
> +- no more migrate_use_rdma() !
> +
> +NOTE: While dynamic registration works on both sides now,
> +      it does *not* work with cgroups swap limits. This functionality with infiniband
> +      remains broken. (It works fine with TCP). So, in order to take full 
> +      advantage of this feature, a fix will have to be developed on the kernel side.
> +      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.

You mean the idea of using pagemap to detect shared pages created by KSM
and/or zero pages? That would be helpful for TCP migration, thanks!

> +

BTW the above comments belong outside both document and commit log,
after --- before diff.

> +Contents:
> +=================================
> +* Compiling
> +* Running (please readme before running)
> +* RDMA Protocol Description
> +* Versioning
> +* QEMUFileRDMA Interface
> +* Migration of pc.ram
> +* Error handling
> +* TODO
> +* Performance
> +
> +COMPILING:
> +===============================
> +
> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
> +$ make
> +
> +RUNNING:
> +===============================
> +
> +First, decide if you want dynamic page registration on the server-side.
> +This always happens on the primary-VM side, but is optional on the server.
> +Doing this allows you to support overcommit (such as cgroups or ballooning)
> +with a smaller footprint on the server-side without having to register the
> +entire VM memory footprint. 
> +NOTE: This significantly slows down performance (about 30% slower).

Where does the overhead come from? It appears from the description that
you have exactly same amount of data to exchange using send messages,
either way?
Or are you using bigger chunks with upfront registration?

> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default

I think the right choice is to make chunk based the default, and remove
the non chunk based from code.  This will simplify the protocol a tiny bit,
and make us focus on improving chunk based long term so that it's as
fast as upfront registration.

> +
> +Next, if you decided *not* to use chunked registration on the server,
> +it is recommended to also disable zero page detection. While this is not
> +strictly necessary, zero page detection also significantly slows down
> +performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:

What is meant by performance here? downtime?

> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
> +
> +Finally, set the migration speed to match your hardware's capabilities:
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> +
> +RDMA Protocol Description:
> +=================================
> +
> +Migration with RDMA is separated into two parts:
> +
> +1. The transmission of the pages using RDMA
> +2. Everything else (a control channel is introduced)
> +
> +"Everything else" is transmitted using a formal 
> +protocol now, consisting of infiniband SEND / RECV messages.
> +
> +An infiniband SEND message is the standard ibverbs
> +message used by applications of infiniband hardware.
> +The only difference between a SEND message and an RDMA
> +message is that SEND message cause completion notifications
> +to be posted to the completion queue (CQ) on the 
> +infiniband receiver side, whereas RDMA messages (used
> +for pc.ram) do not (to behave like an actual DMA).
> +    
> +Messages in infiniband require two things:
> +
> +1. registration of the memory that will be transmitted
> +2. (SEND/RECV only) work requests to be posted on both
> +   sides of the network before the actual transmission
> +   can occur.
> +
> +RDMA messages much easier to deal with. Once the memory
> +on the receiver side is registered and pinned, we're
> +basically done. All that is required is for the sender
> +side to start dumping bytes onto the link.

When is memory unregistered and unpinned on send and receive
sides?

> +
> +SEND messages require more coordination because the
> +receiver must have reserved space (using a receive
> +work request) on the receive queue (RQ) before QEMUFileRDMA
> +can start using them to carry all the bytes as
> +a transport for migration of device state.
> +
> +To begin the migration, the initial connection setup is
> +as follows (migration-rdma.c):
> +
> +1. Receiver and Sender are started (command line or libvirt):
> +2. Both sides post two RQ work requests

Okay this could be where the problem is. This means with chunk
based receive side does:

loop:
	receive request
	register
	send response

while with non chunk based it does:

receive request
send response
loop:
	register

In reality each request/response requires two network round-trips
with the Ready credit-management messsages.
So the overhead will likely be avoided if we add better pipelining:
allow multiple registration requests in the air, and add more
send/receive credits so the overhead of credit management can be
reduced.

There's no requirement to implement these optimizations upfront
before merging the first version, but let's remove the
non-chunkbased crutch unless we see it as absolutely necessary.

> +3. Receiver does listen()
> +4. Sender does connect()
> +5. Receiver accept()
> +6. Check versioning and capabilities (described later)
> +
> +At this point, we define a control channel on top of SEND messages
> +which is described by a formal protocol. Each SEND message has a 
> +header portion and a data portion (but together are transmitted 
> +as a single SEND message).
> +
> +Header:
> +    * Length  (of the data portion)
> +    * Type    (what command to perform, described below)
> +    * Version (protocol version validated before send/recv occurs)

What's the expected value for Version field?
Also, confusing.  Below mentions using private field in librdmacm instead?
Need to add # of bytes and endian-ness of each field.

> +
> +The 'type' field has 7 different command values:

0. Unused.

> +    1. None

you mean this is unused?

> +    2. Ready             (control-channel is available) 
> +    3. QEMU File         (for sending non-live device state) 
> +    4. RAM Blocks        (used right after connection setup)
> +    5. Register request  (dynamic chunk registration) 
> +    6. Register result   ('rkey' to be used by sender)

Hmm, don't you also need a virtual address for RDMA writes?

> +    7. Register finished (registration for current iteration finished)

What does Register finished mean and how it's used?

Need to add which commands have a data portion, and in what format.

> +
> +After connection setup is completed, we have two protocol-level
> +functions, responsible for communicating control-channel commands
> +using the above list of values: 
> +
> +Logically:
> +
> +qemu_rdma_exchange_recv(header, expected command type)
> +
> +1. We transmit a READY command to let the sender know that 

you call it Ready above, so better be consistent.

> +   we are *ready* to receive some data bytes on the control channel.
> +2. Before attempting to receive the expected command, we post another
> +   RQ work request to replace the one we just used up.
> +3. Block on a CQ event channel and wait for the SEND to arrive.
> +4. When the send arrives, librdmacm will unblock us.
> +5. Verify that the command-type and version received matches the one we expected.
> +
> +qemu_rdma_exchange_send(header, data, optional response header & data): 
> +
> +1. Block on the CQ event channel waiting for a READY command
> +   from the receiver to tell us that the receiver
> +   is *ready* for us to transmit some new bytes.
> +2. Optionally: if we are expecting a response from the command
> +   (that we have no yet transmitted),

Which commands expect result? Only Register request?

> let's post an RQ
> +   work request to receive that data a few moments later. 
> +3. When the READY arrives, librdmacm will 
> +   unblock us and we immediately post a RQ work request
> +   to replace the one we just used up.
> +4. Now, we can actually post the work request to SEND
> +   the requested command type of the header we were asked for.
> +5. Optionally, if we are expecting a response (as before),
> +   we block again and wait for that response using the additional
> +   work request we previously posted. (This is used to carry
> +   'Register result' commands #6 back to the sender which
> +   hold the rkey need to perform RDMA.
> +
> +All of the remaining command types (not including 'ready')
> +described above all use the aformentioned two functions to do the hard work:
> +
> +1. After connection setup, RAMBlock information is exchanged using
> +   this protocol before the actual migration begins.
> +2. During runtime, once a 'chunk' becomes full of pages ready to
> +   be sent with RDMA, the registration commands are used to ask the
> +   other side to register the memory for this chunk and respond
> +   with the result (rkey) of the registration.
> +3. Also, the QEMUFile interfaces also call these functions (described below)
> +   when transmitting non-live state, such as devices or to send
> +   its own protocol information during the migration process.
> +
> +Versioning
> +==================================
> +
> +librdmacm provides the user with a 'private data' area to be exchanged
> +at connection-setup time before any infiniband traffic is generated.
> +
> +This is a convenient place to check for protocol versioning because the
> +user does not need to register memory to transmit a few bytes of version
> +information.
> +
> +This is also a convenient place to negotiate capabilities
> +(like dynamic page registration).

This would be a good place to document the format of the
private data field.

> +
> +If the version is invalid, we throw an error.

Which version is valid in this specification?

> +
> +If the version is new, we only negotiate the capabilities that the
> +requested version is able to perform and ignore the rest.

What are these capabilities and how do we negotiate them?

> +QEMUFileRDMA Interface:
> +==================================
> +
> +QEMUFileRDMA introduces a couple of new functions:
> +
> +1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> +2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> +
> +These two functions are very short and simply used the protocol
> +describe above to deliver bytes without changing the upper-level
> +users of QEMUFile that depend on a bytstream abstraction.
> +
> +Finally, how do we handoff the actual bytes to get_buffer()?
> +
> +Again, because we're trying to "fake" a bytestream abstraction
> +using an analogy not unlike individual UDP frames, we have
> +to hold on to the bytes received from control-channel's SEND 
> +messages in memory.
> +
> +Each time we receive a complete "QEMU File" control-channel 
> +message, the bytes from SEND are copied into a small local holding area.
> +
> +Then, we return the number of bytes requested by get_buffer()
> +and leave the remaining bytes in the holding area until get_buffer()
> +comes around for another pass.
> +
> +If the buffer is empty, then we follow the same steps
> +listed above and issue another "QEMU File" protocol command,
> +asking for a new SEND message to re-fill the buffer.
> +
> +Migration of pc.ram:
> +===============================
> +
> +At the beginning of the migration, (migration-rdma.c),
> +the sender and the receiver populate the list of RAMBlocks
> +to be registered with each other into a structure.
> +Then, using the aforementioned protocol, they exchange a
> +description of these blocks with each other, to be used later 
> +during the iteration of main memory. This description includes
> +a list of all the RAMBlocks, their offsets and lengths and
> +possibly includes pre-registered RDMA keys in case dynamic
> +page registration was disabled on the server-side, otherwise not.

Worth mentioning here that memory hotplug will require a protocol
extension. That's also true of TCP so not a big deal ...

> +
> +Main memory is not migrated with the aforementioned protocol, 
> +but is instead migrated with normal RDMA Write operations.
> +
> +Pages are migrated in "chunks" (about 1 Megabyte right now).

Why "about"? This is not dynamic so needs to be exactly same
on both sides, right?

> +Chunk size is not dynamic, but it could be in a future implementation.
> +There's nothing to indicate that this is useful right now.
> +
> +When a chunk is full (or a flush() occurs), the memory backed by 
> +the chunk is registered with librdmacm and pinned in memory on 
> +both sides using the aforementioned protocol.
> +
> +After pinning, an RDMA Write is generated and tramsmitted
> +for the entire chunk.
> +
> +Chunks are also transmitted in batches: This means that we
> +do not request that the hardware signal the completion queue
> +for the completion of *every* chunk. The current batch size
> +is about 64 chunks (corresponding to 64 MB of memory).
> +Only the last chunk in a batch must be signaled.
> +This helps keep everything as asynchronous as possible
> +and helps keep the hardware busy performing RDMA operations.
> +
> +Error-handling:
> +===============================
> +
> +Infiniband has what is called a "Reliable, Connected"
> +link (one of 4 choices). This is the mode in which
> +we use for RDMA migration.
> +
> +If a *single* message fails,
> +the decision is to abort the migration entirely and
> +cleanup all the RDMA descriptors and unregister all
> +the memory.
> +
> +After cleanup, the Virtual Machine is returned to normal
> +operation the same way that would happen if the TCP
> +socket is broken during a non-RDMA based migration.

That's on sender side? Presumably this means you respond to
completion with error?
 How does receive side know
migration is complete?

> +
> +TODO:
> +=================================
> +1. Currently, cgroups swap limits for *both* TCP and RDMA
> +   on the sender-side is broken. This is more poignant for
> +   RDMA because RDMA requires memory registration.
> +   Fixing this requires infiniband page registrations to be
> +   zero-page aware, and this does not yet work properly.
> +2. Currently overcommit for the the *receiver* side of
> +   TCP works, but not for RDMA. While dynamic page registration
> +   *does* work, it is only useful if the is_zero_page() capability
> +   is remained enabled (which it is by default).
> +   However, leaving this capability turned on *significantly* slows
> +   down the RDMA throughput, particularly on hardware capable
> +   of transmitting faster than 10 gbps (such as 40gbps links).
> +3. Use of the recent /dev/<pid>/pagemap would likely solve some
> +   of these problems.
> +4. Also, some form of balloon-device usage tracking would also
> +   help aleviate some of these issues.
> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:
> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)
> +1. Average worst-case throughput 
> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> +
> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:
> +
> +http://wiki.qemu.org/Features/RDMALiveMigration
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma'
  2013-04-10  1:11     ` Michael R. Hines
@ 2013-04-10  8:07       ` Paolo Bonzini
  2013-04-10 10:35         ` Michael S. Tsirkin
  2013-04-10 12:24         ` Michael R. Hines
  0 siblings, 2 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-10  8:07 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 10/04/2013 03:11, Michael R. Hines ha scritto:
> Actually, this capability needs to be a part of the same patch.
> 
> RDMA must have the ability to turn this off or performance will die.

Yes, but you can put it in a separate patch at the beginning of this
series.  RDMA depends on it, but it doesn't depend on RDMA.

Paolo

> Similarly dynamic chunk registration on the server depends on having
> the ability to turn this on, or you cannot get the advantages of dynamic
> registration.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma'
  2013-04-10  8:07       ` Paolo Bonzini
@ 2013-04-10 10:35         ` Michael S. Tsirkin
  2013-04-10 12:24         ` Michael R. Hines
  1 sibling, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-10 10:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Wed, Apr 10, 2013 at 10:07:47AM +0200, Paolo Bonzini wrote:
> Il 10/04/2013 03:11, Michael R. Hines ha scritto:
> > Actually, this capability needs to be a part of the same patch.
> > 
> > RDMA must have the ability to turn this off or performance will die.
> 
> Yes, but you can put it in a separate patch at the beginning of this
> series.  RDMA depends on it, but it doesn't depend on RDMA.
> 
> Paolo

Further, it's an implementation detail really. We don't want
management to play with this capability - it will not know what
to set it to.  Let's just do the right thing automatically.

> > Similarly dynamic chunk registration on the server depends on having
> > the ability to turn this on, or you cannot get the advantages of dynamic
> > registration.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma'
  2013-04-10  8:07       ` Paolo Bonzini
  2013-04-10 10:35         ` Michael S. Tsirkin
@ 2013-04-10 12:24         ` Michael R. Hines
  1 sibling, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-10 12:24 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

On 04/10/2013 04:07 AM, Paolo Bonzini wrote:
> Il 10/04/2013 03:11, Michael R. Hines ha scritto:
>> Actually, this capability needs to be a part of the same patch.
>>
>> RDMA must have the ability to turn this off or performance will die.
> Yes, but you can put it in a separate patch at the beginning of this
> series.  RDMA depends on it, but it doesn't depend on RDMA.

Acknowledged.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-10  5:27   ` Michael S. Tsirkin
@ 2013-04-10 13:04     ` Michael R. Hines
  2013-04-10 13:34       ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-10 13:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> Below is a great high level overview. the protocol looks correct.
> A bit more detail would be helpful, as noted below.
>
> The main thing I'd like to see changed is that there are already
> two protocols here: chunk-based and non chunk based.
> We'll need to use versioning and capabilities going forward but in the
> first version we don't need to maintain compatibility with legacy so
> two versions seems like unnecessary pain.  Chunk based is somewhat slower and
> that is worth fixing longer term, but seems like the way forward. So
> let's implement a single chunk-based protocol in the first version we
> merge.
>
> Some more minor improvement suggestions below.
Thanks.

However, IMHO restricting the policy to only used chunk-based is really
not an acceptable choice:

Here's the reason: Using my 10gbs RDMA hardware, throughput takes a dive 
from 10gbps to 6gbps.

But if I disable chunk-based registration altogether (forgoing 
overcommit), then performance comes back.

The reason for this is is the additional control trannel traffic needed 
to ask the server to register
memory pages on demand - without this traffic, we can easily saturate 
the link.

But with this traffic, the user needs to know (and be given the option) 
to disable the feature
in case they want performance instead of flexibility.

> On Mon, Apr 08, 2013 at 11:04:32PM -0400, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Both the protocol and interfaces are elaborated in more detail,
>> including the new use of dynamic chunk registration, versioning,
>> and capabilities negotiation.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 313 insertions(+)
>>   create mode 100644 docs/rdma.txt
>>
>> diff --git a/docs/rdma.txt b/docs/rdma.txt
>> new file mode 100644
>> index 0000000..e9fa4cd
>> --- /dev/null
>> +++ b/docs/rdma.txt
>> @@ -0,0 +1,313 @@
>> +Several changes since v4:
>> +
>> +- Created a "formal" protocol for the RDMA control channel
>> +- Dynamic, chunked page registration now implemented on *both* the server and client
>> +- Created new 'capability' for page registration
>> +- Created new 'capability' for is_zero_page() - enabled by default
>> +  (needed to test dynamic page registration)
>> +- Created version-check before protocol begins at connection-time
>> +- no more migrate_use_rdma() !
>> +
>> +NOTE: While dynamic registration works on both sides now,
>> +      it does *not* work with cgroups swap limits. This functionality with infiniband
>> +      remains broken. (It works fine with TCP). So, in order to take full
>> +      advantage of this feature, a fix will have to be developed on the kernel side.
>> +      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
> You mean the idea of using pagemap to detect shared pages created by KSM
> and/or zero pages? That would be helpful for TCP migration, thanks!

Yes, absolutely. This would *also* help the above registration problem.

We could use this to *pre-register* pages in advance, but that would be
an entirely different patch series (which I'm willing to write and submit).

>> +
> BTW the above comments belong outside both document and commit log,
> after --- before diff.
Acknowledged.

>> +Contents:
>> +=================================
>> +* Compiling
>> +* Running (please readme before running)
>> +* RDMA Protocol Description
>> +* Versioning
>> +* QEMUFileRDMA Interface
>> +* Migration of pc.ram
>> +* Error handling
>> +* TODO
>> +* Performance
>> +
>> +COMPILING:
>> +===============================
>> +
>> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
>> +$ make
>> +
>> +RUNNING:
>> +===============================
>> +
>> +First, decide if you want dynamic page registration on the server-side.
>> +This always happens on the primary-VM side, but is optional on the server.
>> +Doing this allows you to support overcommit (such as cgroups or ballooning)
>> +with a smaller footprint on the server-side without having to register the
>> +entire VM memory footprint.
>> +NOTE: This significantly slows down performance (about 30% slower).
> Where does the overhead come from? It appears from the description that
> you have exactly same amount of data to exchange using send messages,
> either way?
> Or are you using bigger chunks with upfront registration?

Answer is above.

Upfront registration registers the entire VM before migration starts
where as dynamic registration (on both sides) registers chunks in
1 MB increments as they are requested by the migration_thread.

The extra send messages required to request the server to register
the memory means that the RDMA must block until those messages
complete before the RDMA can begin.

>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
> I think the right choice is to make chunk based the default, and remove
> the non chunk based from code.  This will simplify the protocol a tiny bit,
> and make us focus on improving chunk based long term so that it's as
> fast as upfront registration.
Answer above.

>> +
>> +Next, if you decided *not* to use chunked registration on the server,
>> +it is recommended to also disable zero page detection. While this is not
>> +strictly necessary, zero page detection also significantly slows down
>> +performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
> What is meant by performance here? downtime?

Throughput. Zero page scanning (and dynamic registration) reduces 
throughput significantly.

>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
>> +
>> +Finally, set the migration speed to match your hardware's capabilities:
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
>> +
>> +Finally, perform the actual migration:
>> +
>> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
>> +
>> +RDMA Protocol Description:
>> +=================================
>> +
>> +Migration with RDMA is separated into two parts:
>> +
>> +1. The transmission of the pages using RDMA
>> +2. Everything else (a control channel is introduced)
>> +
>> +"Everything else" is transmitted using a formal
>> +protocol now, consisting of infiniband SEND / RECV messages.
>> +
>> +An infiniband SEND message is the standard ibverbs
>> +message used by applications of infiniband hardware.
>> +The only difference between a SEND message and an RDMA
>> +message is that SEND message cause completion notifications
>> +to be posted to the completion queue (CQ) on the
>> +infiniband receiver side, whereas RDMA messages (used
>> +for pc.ram) do not (to behave like an actual DMA).
>> +
>> +Messages in infiniband require two things:
>> +
>> +1. registration of the memory that will be transmitted
>> +2. (SEND/RECV only) work requests to be posted on both
>> +   sides of the network before the actual transmission
>> +   can occur.
>> +
>> +RDMA messages much easier to deal with. Once the memory
>> +on the receiver side is registered and pinned, we're
>> +basically done. All that is required is for the sender
>> +side to start dumping bytes onto the link.
> When is memory unregistered and unpinned on send and receive
> sides?
Only when the migration ends completely. Will update the documentation.

>> +
>> +SEND messages require more coordination because the
>> +receiver must have reserved space (using a receive
>> +work request) on the receive queue (RQ) before QEMUFileRDMA
>> +can start using them to carry all the bytes as
>> +a transport for migration of device state.
>> +
>> +To begin the migration, the initial connection setup is
>> +as follows (migration-rdma.c):
>> +
>> +1. Receiver and Sender are started (command line or libvirt):
>> +2. Both sides post two RQ work requests
> Okay this could be where the problem is. This means with chunk
> based receive side does:
>
> loop:
> 	receive request
> 	register
> 	send response
>
> while with non chunk based it does:
>
> receive request
> send response
> loop:
> 	register
No, that's incorrect. With "non" chunk based, the receive side does 
*not* communicate
during the migration of pc.ram.

The control channel is only used for chunk registration and device 
state, not RAM.

I will update the documentation to make that more clear.

> In reality each request/response requires two network round-trips
> with the Ready credit-management messsages.
> So the overhead will likely be avoided if we add better pipelining:
> allow multiple registration requests in the air, and add more
> send/receive credits so the overhead of credit management can be
> reduced.
Unfortunately, the migration thread doesn't work that way.
The thread only generates one page write at-a-time.

If someone were to write a patch which submits multiple
writes at the same time, I would be very interested in
consuming that feature and making chunk registration more
efficient by batching multiple registrations into fewer messages.

> There's no requirement to implement these optimizations upfront
> before merging the first version, but let's remove the
> non-chunkbased crutch unless we see it as absolutely necessary.
>
>> +3. Receiver does listen()
>> +4. Sender does connect()
>> +5. Receiver accept()
>> +6. Check versioning and capabilities (described later)
>> +
>> +At this point, we define a control channel on top of SEND messages
>> +which is described by a formal protocol. Each SEND message has a
>> +header portion and a data portion (but together are transmitted
>> +as a single SEND message).
>> +
>> +Header:
>> +    * Length  (of the data portion)
>> +    * Type    (what command to perform, described below)
>> +    * Version (protocol version validated before send/recv occurs)
> What's the expected value for Version field?
> Also, confusing.  Below mentions using private field in librdmacm instead?
> Need to add # of bytes and endian-ness of each field.

Correct, those are two separate versions. One for capability negotiation
and one for the protocol itself.

I will update the documentation.

>> +
>> +The 'type' field has 7 different command values:
> 0. Unused.
>
>> +    1. None
> you mean this is unused?

Correct - will update.

>> +    2. Ready             (control-channel is available)
>> +    3. QEMU File         (for sending non-live device state)
>> +    4. RAM Blocks        (used right after connection setup)
>> +    5. Register request  (dynamic chunk registration)
>> +    6. Register result   ('rkey' to be used by sender)
> Hmm, don't you also need a virtual address for RDMA writes?
>

The virtual addresses are communicated at the beginning of the
migration using command #4 "Ram blocks".

>> +    7. Register finished (registration for current iteration finished)
> What does Register finished mean and how it's used?
>
> Need to add which commands have a data portion, and in what format.

Acknowledged. "finished" signals that a migration round has completed
and that the receiver side can move to the next iteration.


>> +
>> +After connection setup is completed, we have two protocol-level
>> +functions, responsible for communicating control-channel commands
>> +using the above list of values:
>> +
>> +Logically:
>> +
>> +qemu_rdma_exchange_recv(header, expected command type)
>> +
>> +1. We transmit a READY command to let the sender know that
> you call it Ready above, so better be consistent.
>
>> +   we are *ready* to receive some data bytes on the control channel.
>> +2. Before attempting to receive the expected command, we post another
>> +   RQ work request to replace the one we just used up.
>> +3. Block on a CQ event channel and wait for the SEND to arrive.
>> +4. When the send arrives, librdmacm will unblock us.
>> +5. Verify that the command-type and version received matches the one we expected.
>> +
>> +qemu_rdma_exchange_send(header, data, optional response header & data):
>> +
>> +1. Block on the CQ event channel waiting for a READY command
>> +   from the receiver to tell us that the receiver
>> +   is *ready* for us to transmit some new bytes.
>> +2. Optionally: if we are expecting a response from the command
>> +   (that we have no yet transmitted),
> Which commands expect result? Only Register request?

Yes, only register. In the code, the command is #define 
RDMA_CONTROL_REGISTER_RESULT

>> let's post an RQ
>> +   work request to receive that data a few moments later.
>> +3. When the READY arrives, librdmacm will
>> +   unblock us and we immediately post a RQ work request
>> +   to replace the one we just used up.
>> +4. Now, we can actually post the work request to SEND
>> +   the requested command type of the header we were asked for.
>> +5. Optionally, if we are expecting a response (as before),
>> +   we block again and wait for that response using the additional
>> +   work request we previously posted. (This is used to carry
>> +   'Register result' commands #6 back to the sender which
>> +   hold the rkey need to perform RDMA.
>> +
>> +All of the remaining command types (not including 'ready')
>> +described above all use the aformentioned two functions to do the hard work:
>> +
>> +1. After connection setup, RAMBlock information is exchanged using
>> +   this protocol before the actual migration begins.
>> +2. During runtime, once a 'chunk' becomes full of pages ready to
>> +   be sent with RDMA, the registration commands are used to ask the
>> +   other side to register the memory for this chunk and respond
>> +   with the result (rkey) of the registration.
>> +3. Also, the QEMUFile interfaces also call these functions (described below)
>> +   when transmitting non-live state, such as devices or to send
>> +   its own protocol information during the migration process.
>> +
>> +Versioning
>> +==================================
>> +
>> +librdmacm provides the user with a 'private data' area to be exchanged
>> +at connection-setup time before any infiniband traffic is generated.
>> +
>> +This is a convenient place to check for protocol versioning because the
>> +user does not need to register memory to transmit a few bytes of version
>> +information.
>> +
>> +This is also a convenient place to negotiate capabilities
>> +(like dynamic page registration).
> This would be a good place to document the format of the
> private data field.

Acnkowledged.


>> +
>> +If the version is invalid, we throw an error.
> Which version is valid in this specification?
Version 1. Will update.
>> +
>> +If the version is new, we only negotiate the capabilities that the
>> +requested version is able to perform and ignore the rest.
> What are these capabilities and how do we negotiate them?
There is only one capability right now: dynamic server registration.

The client must tell the server whether or not the capability was
enabled or not on the primary VM side.

Will update the documentation.

>> +QEMUFileRDMA Interface:
>> +==================================
>> +
>> +QEMUFileRDMA introduces a couple of new functions:
>> +
>> +1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
>> +2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
>> +
>> +These two functions are very short and simply used the protocol
>> +describe above to deliver bytes without changing the upper-level
>> +users of QEMUFile that depend on a bytstream abstraction.
>> +
>> +Finally, how do we handoff the actual bytes to get_buffer()?
>> +
>> +Again, because we're trying to "fake" a bytestream abstraction
>> +using an analogy not unlike individual UDP frames, we have
>> +to hold on to the bytes received from control-channel's SEND
>> +messages in memory.
>> +
>> +Each time we receive a complete "QEMU File" control-channel
>> +message, the bytes from SEND are copied into a small local holding area.
>> +
>> +Then, we return the number of bytes requested by get_buffer()
>> +and leave the remaining bytes in the holding area until get_buffer()
>> +comes around for another pass.
>> +
>> +If the buffer is empty, then we follow the same steps
>> +listed above and issue another "QEMU File" protocol command,
>> +asking for a new SEND message to re-fill the buffer.
>> +
>> +Migration of pc.ram:
>> +===============================
>> +
>> +At the beginning of the migration, (migration-rdma.c),
>> +the sender and the receiver populate the list of RAMBlocks
>> +to be registered with each other into a structure.
>> +Then, using the aforementioned protocol, they exchange a
>> +description of these blocks with each other, to be used later
>> +during the iteration of main memory. This description includes
>> +a list of all the RAMBlocks, their offsets and lengths and
>> +possibly includes pre-registered RDMA keys in case dynamic
>> +page registration was disabled on the server-side, otherwise not.
> Worth mentioning here that memory hotplug will require a protocol
> extension. That's also true of TCP so not a big deal ...

Acknowledged.

>> +
>> +Main memory is not migrated with the aforementioned protocol,
>> +but is instead migrated with normal RDMA Write operations.
>> +
>> +Pages are migrated in "chunks" (about 1 Megabyte right now).
> Why "about"? This is not dynamic so needs to be exactly same
> on both sides, right?
About is a typo =). It is hard-coded to exactly 1MB.

>
>> +Chunk size is not dynamic, but it could be in a future implementation.
>> +There's nothing to indicate that this is useful right now.
>> +
>> +When a chunk is full (or a flush() occurs), the memory backed by
>> +the chunk is registered with librdmacm and pinned in memory on
>> +both sides using the aforementioned protocol.
>> +
>> +After pinning, an RDMA Write is generated and tramsmitted
>> +for the entire chunk.
>> +
>> +Chunks are also transmitted in batches: This means that we
>> +do not request that the hardware signal the completion queue
>> +for the completion of *every* chunk. The current batch size
>> +is about 64 chunks (corresponding to 64 MB of memory).
>> +Only the last chunk in a batch must be signaled.
>> +This helps keep everything as asynchronous as possible
>> +and helps keep the hardware busy performing RDMA operations.
>> +
>> +Error-handling:
>> +===============================
>> +
>> +Infiniband has what is called a "Reliable, Connected"
>> +link (one of 4 choices). This is the mode in which
>> +we use for RDMA migration.
>> +
>> +If a *single* message fails,
>> +the decision is to abort the migration entirely and
>> +cleanup all the RDMA descriptors and unregister all
>> +the memory.
>> +
>> +After cleanup, the Virtual Machine is returned to normal
>> +operation the same way that would happen if the TCP
>> +socket is broken during a non-RDMA based migration.
> That's on sender side? Presumably this means you respond to
> completion with error?
>   How does receive side know
> migration is complete?

Yes, on the sender side.

Migration "completeness" logic has not changed in this patch series.

Pleas recall that the entire QEMUFile protocol is still
happening at the upper-level inside of savevm.c/arch_init.c.



>> +
>> +TODO:
>> +=================================
>> +1. Currently, cgroups swap limits for *both* TCP and RDMA
>> +   on the sender-side is broken. This is more poignant for
>> +   RDMA because RDMA requires memory registration.
>> +   Fixing this requires infiniband page registrations to be
>> +   zero-page aware, and this does not yet work properly.
>> +2. Currently overcommit for the the *receiver* side of
>> +   TCP works, but not for RDMA. While dynamic page registration
>> +   *does* work, it is only useful if the is_zero_page() capability
>> +   is remained enabled (which it is by default).
>> +   However, leaving this capability turned on *significantly* slows
>> +   down the RDMA throughput, particularly on hardware capable
>> +   of transmitting faster than 10 gbps (such as 40gbps links).
>> +3. Use of the recent /dev/<pid>/pagemap would likely solve some
>> +   of these problems.
>> +4. Also, some form of balloon-device usage tracking would also
>> +   help aleviate some of these issues.
>> +
>> +PERFORMANCE
>> +===================
>> +
>> +Using a 40gbps infinband link performing a worst-case stress test:
>> +
>> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +Approximately 30 gpbs (little better than the paper)
>> +1. Average worst-case throughput
>> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
>> +
>> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
>> +
>> +An *exhaustive* paper (2010) shows additional performance details
>> +linked on the QEMU wiki:
>> +
>> +http://wiki.qemu.org/Features/RDMALiveMigration
>> -- 
>> 1.7.10.4

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-10 13:04     ` Michael R. Hines
@ 2013-04-10 13:34       ` Michael S. Tsirkin
  2013-04-10 15:29         ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-10 13:34 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
> On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> >Below is a great high level overview. the protocol looks correct.
> >A bit more detail would be helpful, as noted below.
> >
> >The main thing I'd like to see changed is that there are already
> >two protocols here: chunk-based and non chunk based.
> >We'll need to use versioning and capabilities going forward but in the
> >first version we don't need to maintain compatibility with legacy so
> >two versions seems like unnecessary pain.  Chunk based is somewhat slower and
> >that is worth fixing longer term, but seems like the way forward. So
> >let's implement a single chunk-based protocol in the first version we
> >merge.
> >
> >Some more minor improvement suggestions below.
> Thanks.
> 
> However, IMHO restricting the policy to only used chunk-based is really
> not an acceptable choice:
> 
> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> dive from 10gbps to 6gbps.

Who cares about the throughput really? What we do care about
is how long the whole process takes.



> But if I disable chunk-based registration altogether (forgoing
> overcommit), then performance comes back.
> 
> The reason for this is is the additional control trannel traffic
> needed to ask the server to register
> memory pages on demand - without this traffic, we can easily
> saturate the link.
> But with this traffic, the user needs to know (and be given the
> option) to disable the feature
> in case they want performance instead of flexibility.
> 

IMO that's just because the current control protocol is so inefficient.
You just need to pipeline the registration: request the next chunk
while remote side is handling the previous one(s).

With any protocol, you still need to:
	register all memory
	send addresses and keys to source
	get notification that write is done
what is different with chunk based?
simply that there are several network roundtrips
before the process can start.
So part of the time you are not doing writes,
you are waiting for the next control message.

So you should be doing several in parallel.
This will complicate the procotol though, so I am not asking
for this right away.

But a broken pin-it-all alternative will just confuse matters.  It is
best to keep it out of tree.


> >On Mon, Apr 08, 2013 at 11:04:32PM -0400, mrhines@linux.vnet.ibm.com wrote:
> >>From: "Michael R. Hines" <mrhines@us.ibm.com>
> >>
> >>Both the protocol and interfaces are elaborated in more detail,
> >>including the new use of dynamic chunk registration, versioning,
> >>and capabilities negotiation.
> >>
> >>Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> >>---
> >>  docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 313 insertions(+)
> >>  create mode 100644 docs/rdma.txt
> >>
> >>diff --git a/docs/rdma.txt b/docs/rdma.txt
> >>new file mode 100644
> >>index 0000000..e9fa4cd
> >>--- /dev/null
> >>+++ b/docs/rdma.txt
> >>@@ -0,0 +1,313 @@
> >>+Several changes since v4:
> >>+
> >>+- Created a "formal" protocol for the RDMA control channel
> >>+- Dynamic, chunked page registration now implemented on *both* the server and client
> >>+- Created new 'capability' for page registration
> >>+- Created new 'capability' for is_zero_page() - enabled by default
> >>+  (needed to test dynamic page registration)
> >>+- Created version-check before protocol begins at connection-time
> >>+- no more migrate_use_rdma() !
> >>+
> >>+NOTE: While dynamic registration works on both sides now,
> >>+      it does *not* work with cgroups swap limits. This functionality with infiniband
> >>+      remains broken. (It works fine with TCP). So, in order to take full
> >>+      advantage of this feature, a fix will have to be developed on the kernel side.
> >>+      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
> >You mean the idea of using pagemap to detect shared pages created by KSM
> >and/or zero pages? That would be helpful for TCP migration, thanks!
> 
> Yes, absolutely. This would *also* help the above registration problem.
> 
> We could use this to *pre-register* pages in advance, but that would be
> an entirely different patch series (which I'm willing to write and submit).
> 
> >>+
> >BTW the above comments belong outside both document and commit log,
> >after --- before diff.
> Acknowledged.
> 
> >>+Contents:
> >>+=================================
> >>+* Compiling
> >>+* Running (please readme before running)
> >>+* RDMA Protocol Description
> >>+* Versioning
> >>+* QEMUFileRDMA Interface
> >>+* Migration of pc.ram
> >>+* Error handling
> >>+* TODO
> >>+* Performance
> >>+
> >>+COMPILING:
> >>+===============================
> >>+
> >>+$ ./configure --enable-rdma --target-list=x86_64-softmmu
> >>+$ make
> >>+
> >>+RUNNING:
> >>+===============================
> >>+
> >>+First, decide if you want dynamic page registration on the server-side.
> >>+This always happens on the primary-VM side, but is optional on the server.
> >>+Doing this allows you to support overcommit (such as cgroups or ballooning)
> >>+with a smaller footprint on the server-side without having to register the
> >>+entire VM memory footprint.
> >>+NOTE: This significantly slows down performance (about 30% slower).
> >Where does the overhead come from? It appears from the description that
> >you have exactly same amount of data to exchange using send messages,
> >either way?
> >Or are you using bigger chunks with upfront registration?
> 
> Answer is above.
> 
> Upfront registration registers the entire VM before migration starts
> where as dynamic registration (on both sides) registers chunks in
> 1 MB increments as they are requested by the migration_thread.
> 
> The extra send messages required to request the server to register
> the memory means that the RDMA must block until those messages
> complete before the RDMA can begin.

So make the protocol smarter and fix this. This is not something
management needs to know about.


If you like, you can teach management to specify the max amount of
memory pinned. It should be specified at the appropriate place:
on the remote for remote, on source for source.

> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
> >I think the right choice is to make chunk based the default, and remove
> >the non chunk based from code.  This will simplify the protocol a tiny bit,
> >and make us focus on improving chunk based long term so that it's as
> >fast as upfront registration.
> Answer above.
> 
> >>+
> >>+Next, if you decided *not* to use chunked registration on the server,
> >>+it is recommended to also disable zero page detection. While this is not
> >>+strictly necessary, zero page detection also significantly slows down
> >>+performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
> >What is meant by performance here? downtime?
> 
> Throughput. Zero page scanning (and dynamic registration) reduces
> throughput significantly.

Again, not something management should worry about.
Do the right thing internally.

> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
> >>+
> >>+Finally, set the migration speed to match your hardware's capabilities:
> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> >>+
> >>+Finally, perform the actual migration:
> >>+
> >>+$ virsh migrate domain rdma:xx.xx.xx.xx:port
> >>+
> >>+RDMA Protocol Description:
> >>+=================================
> >>+
> >>+Migration with RDMA is separated into two parts:
> >>+
> >>+1. The transmission of the pages using RDMA
> >>+2. Everything else (a control channel is introduced)
> >>+
> >>+"Everything else" is transmitted using a formal
> >>+protocol now, consisting of infiniband SEND / RECV messages.
> >>+
> >>+An infiniband SEND message is the standard ibverbs
> >>+message used by applications of infiniband hardware.
> >>+The only difference between a SEND message and an RDMA
> >>+message is that SEND message cause completion notifications
> >>+to be posted to the completion queue (CQ) on the
> >>+infiniband receiver side, whereas RDMA messages (used
> >>+for pc.ram) do not (to behave like an actual DMA).
> >>+
> >>+Messages in infiniband require two things:
> >>+
> >>+1. registration of the memory that will be transmitted
> >>+2. (SEND/RECV only) work requests to be posted on both
> >>+   sides of the network before the actual transmission
> >>+   can occur.
> >>+
> >>+RDMA messages much easier to deal with. Once the memory
> >>+on the receiver side is registered and pinned, we're
> >>+basically done. All that is required is for the sender
> >>+side to start dumping bytes onto the link.
> >When is memory unregistered and unpinned on send and receive
> >sides?
> Only when the migration ends completely. Will update the documentation.
> 
> >>+
> >>+SEND messages require more coordination because the
> >>+receiver must have reserved space (using a receive
> >>+work request) on the receive queue (RQ) before QEMUFileRDMA
> >>+can start using them to carry all the bytes as
> >>+a transport for migration of device state.
> >>+
> >>+To begin the migration, the initial connection setup is
> >>+as follows (migration-rdma.c):
> >>+
> >>+1. Receiver and Sender are started (command line or libvirt):
> >>+2. Both sides post two RQ work requests
> >Okay this could be where the problem is. This means with chunk
> >based receive side does:
> >
> >loop:
> >	receive request
> >	register
> >	send response
> >
> >while with non chunk based it does:
> >
> >receive request
> >send response
> >loop:
> >	register
> No, that's incorrect. With "non" chunk based, the receive side does
> *not* communicate
> during the migration of pc.ram.

It does not matter when this happens. What we care about is downtime and
total time from start of qemu on remote and until migration completes.
Not peak throughput.
If you don't count registration time on remote, that's just wrong.

> The control channel is only used for chunk registration and device
> state, not RAM.
> 
> I will update the documentation to make that more clear.

It's clear enough I think. But it seems you are measuring
the wrong things.

> >In reality each request/response requires two network round-trips
> >with the Ready credit-management messsages.
> >So the overhead will likely be avoided if we add better pipelining:
> >allow multiple registration requests in the air, and add more
> >send/receive credits so the overhead of credit management can be
> >reduced.
> Unfortunately, the migration thread doesn't work that way.
> The thread only generates one page write at-a-time.

Yes but you do not have to block it. Each page is in these states:
	- unpinned not sent
	- pinned no rkey
	- pinned have rkey
	- unpinned sent

Each time you get a new page, it's in unpinned not sent state.
So you can start it on this state machine, and tell migration thread
to proceed tothe next page.

> If someone were to write a patch which submits multiple
> writes at the same time, I would be very interested in
> consuming that feature and making chunk registration more
> efficient by batching multiple registrations into fewer messages.

No changes to migration core is necessary I think.
But assuming they are - your protocol design and
management API should not be driven by internal qemu APIs.

> >There's no requirement to implement these optimizations upfront
> >before merging the first version, but let's remove the
> >non-chunkbased crutch unless we see it as absolutely necessary.
> >
> >>+3. Receiver does listen()
> >>+4. Sender does connect()
> >>+5. Receiver accept()
> >>+6. Check versioning and capabilities (described later)
> >>+
> >>+At this point, we define a control channel on top of SEND messages
> >>+which is described by a formal protocol. Each SEND message has a
> >>+header portion and a data portion (but together are transmitted
> >>+as a single SEND message).
> >>+
> >>+Header:
> >>+    * Length  (of the data portion)
> >>+    * Type    (what command to perform, described below)
> >>+    * Version (protocol version validated before send/recv occurs)
> >What's the expected value for Version field?
> >Also, confusing.  Below mentions using private field in librdmacm instead?
> >Need to add # of bytes and endian-ness of each field.
> 
> Correct, those are two separate versions. One for capability negotiation
> and one for the protocol itself.
> 
> I will update the documentation.

Just drop the all-pinned version, and we'll work to improve
the chunk-based one until it has reasonable performance.
It seems to get a decent speed already: consider that
most people run migration with the default speed limit.
Supporting all-pinned will just be a pain down the road when
we fix performance for chunk based one.


> >>+
> >>+The 'type' field has 7 different command values:
> >0. Unused.
> >
> >>+    1. None
> >you mean this is unused?
> 
> Correct - will update.
> 
> >>+    2. Ready             (control-channel is available)
> >>+    3. QEMU File         (for sending non-live device state)
> >>+    4. RAM Blocks        (used right after connection setup)
> >>+    5. Register request  (dynamic chunk registration)
> >>+    6. Register result   ('rkey' to be used by sender)
> >Hmm, don't you also need a virtual address for RDMA writes?
> >
> 
> The virtual addresses are communicated at the beginning of the
> migration using command #4 "Ram blocks".

Yes but ram blocks are sent source to dest.
virtual address needs to be sent dest to source no?

> >>+    7. Register finished (registration for current iteration finished)
> >What does Register finished mean and how it's used?
> >
> >Need to add which commands have a data portion, and in what format.
> 
> Acknowledged. "finished" signals that a migration round has completed
> and that the receiver side can move to the next iteration.
> 
> 
> >>+
> >>+After connection setup is completed, we have two protocol-level
> >>+functions, responsible for communicating control-channel commands
> >>+using the above list of values:
> >>+
> >>+Logically:
> >>+
> >>+qemu_rdma_exchange_recv(header, expected command type)
> >>+
> >>+1. We transmit a READY command to let the sender know that
> >you call it Ready above, so better be consistent.
> >
> >>+   we are *ready* to receive some data bytes on the control channel.
> >>+2. Before attempting to receive the expected command, we post another
> >>+   RQ work request to replace the one we just used up.
> >>+3. Block on a CQ event channel and wait for the SEND to arrive.
> >>+4. When the send arrives, librdmacm will unblock us.
> >>+5. Verify that the command-type and version received matches the one we expected.
> >>+
> >>+qemu_rdma_exchange_send(header, data, optional response header & data):
> >>+
> >>+1. Block on the CQ event channel waiting for a READY command
> >>+   from the receiver to tell us that the receiver
> >>+   is *ready* for us to transmit some new bytes.
> >>+2. Optionally: if we are expecting a response from the command
> >>+   (that we have no yet transmitted),
> >Which commands expect result? Only Register request?
> 
> Yes, only register. In the code, the command is #define
> RDMA_CONTROL_REGISTER_RESULT
> 
> >>let's post an RQ
> >>+   work request to receive that data a few moments later.
> >>+3. When the READY arrives, librdmacm will
> >>+   unblock us and we immediately post a RQ work request
> >>+   to replace the one we just used up.
> >>+4. Now, we can actually post the work request to SEND
> >>+   the requested command type of the header we were asked for.
> >>+5. Optionally, if we are expecting a response (as before),
> >>+   we block again and wait for that response using the additional
> >>+   work request we previously posted. (This is used to carry
> >>+   'Register result' commands #6 back to the sender which
> >>+   hold the rkey need to perform RDMA.
> >>+
> >>+All of the remaining command types (not including 'ready')
> >>+described above all use the aformentioned two functions to do the hard work:
> >>+
> >>+1. After connection setup, RAMBlock information is exchanged using
> >>+   this protocol before the actual migration begins.
> >>+2. During runtime, once a 'chunk' becomes full of pages ready to
> >>+   be sent with RDMA, the registration commands are used to ask the
> >>+   other side to register the memory for this chunk and respond
> >>+   with the result (rkey) of the registration.
> >>+3. Also, the QEMUFile interfaces also call these functions (described below)
> >>+   when transmitting non-live state, such as devices or to send
> >>+   its own protocol information during the migration process.
> >>+
> >>+Versioning
> >>+==================================
> >>+
> >>+librdmacm provides the user with a 'private data' area to be exchanged
> >>+at connection-setup time before any infiniband traffic is generated.
> >>+
> >>+This is a convenient place to check for protocol versioning because the
> >>+user does not need to register memory to transmit a few bytes of version
> >>+information.
> >>+
> >>+This is also a convenient place to negotiate capabilities
> >>+(like dynamic page registration).
> >This would be a good place to document the format of the
> >private data field.
> 
> Acnkowledged.
> 
> 
> >>+
> >>+If the version is invalid, we throw an error.
> >Which version is valid in this specification?
> Version 1. Will update.
> >>+
> >>+If the version is new, we only negotiate the capabilities that the
> >>+requested version is able to perform and ignore the rest.
> >What are these capabilities and how do we negotiate them?
> There is only one capability right now: dynamic server registration.
> 
> The client must tell the server whether or not the capability was
> enabled or not on the primary VM side.
> 
> Will update the documentation.

Cool, best add an exact structure format.

> >>+QEMUFileRDMA Interface:
> >>+==================================
> >>+
> >>+QEMUFileRDMA introduces a couple of new functions:
> >>+
> >>+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> >>+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> >>+
> >>+These two functions are very short and simply used the protocol
> >>+describe above to deliver bytes without changing the upper-level
> >>+users of QEMUFile that depend on a bytstream abstraction.
> >>+
> >>+Finally, how do we handoff the actual bytes to get_buffer()?
> >>+
> >>+Again, because we're trying to "fake" a bytestream abstraction
> >>+using an analogy not unlike individual UDP frames, we have
> >>+to hold on to the bytes received from control-channel's SEND
> >>+messages in memory.
> >>+
> >>+Each time we receive a complete "QEMU File" control-channel
> >>+message, the bytes from SEND are copied into a small local holding area.
> >>+
> >>+Then, we return the number of bytes requested by get_buffer()
> >>+and leave the remaining bytes in the holding area until get_buffer()
> >>+comes around for another pass.
> >>+
> >>+If the buffer is empty, then we follow the same steps
> >>+listed above and issue another "QEMU File" protocol command,
> >>+asking for a new SEND message to re-fill the buffer.
> >>+
> >>+Migration of pc.ram:
> >>+===============================
> >>+
> >>+At the beginning of the migration, (migration-rdma.c),
> >>+the sender and the receiver populate the list of RAMBlocks
> >>+to be registered with each other into a structure.
> >>+Then, using the aforementioned protocol, they exchange a
> >>+description of these blocks with each other, to be used later
> >>+during the iteration of main memory. This description includes
> >>+a list of all the RAMBlocks, their offsets and lengths and
> >>+possibly includes pre-registered RDMA keys in case dynamic
> >>+page registration was disabled on the server-side, otherwise not.
> >Worth mentioning here that memory hotplug will require a protocol
> >extension. That's also true of TCP so not a big deal ...
> 
> Acknowledged.
> 
> >>+
> >>+Main memory is not migrated with the aforementioned protocol,
> >>+but is instead migrated with normal RDMA Write operations.
> >>+
> >>+Pages are migrated in "chunks" (about 1 Megabyte right now).
> >Why "about"? This is not dynamic so needs to be exactly same
> >on both sides, right?
> About is a typo =). It is hard-coded to exactly 1MB.

This, by the way, is something management *may* want to control.

> >
> >>+Chunk size is not dynamic, but it could be in a future implementation.
> >>+There's nothing to indicate that this is useful right now.
> >>+
> >>+When a chunk is full (or a flush() occurs), the memory backed by
> >>+the chunk is registered with librdmacm and pinned in memory on
> >>+both sides using the aforementioned protocol.
> >>+
> >>+After pinning, an RDMA Write is generated and tramsmitted
> >>+for the entire chunk.
> >>+
> >>+Chunks are also transmitted in batches: This means that we
> >>+do not request that the hardware signal the completion queue
> >>+for the completion of *every* chunk. The current batch size
> >>+is about 64 chunks (corresponding to 64 MB of memory).
> >>+Only the last chunk in a batch must be signaled.
> >>+This helps keep everything as asynchronous as possible
> >>+and helps keep the hardware busy performing RDMA operations.
> >>+
> >>+Error-handling:
> >>+===============================
> >>+
> >>+Infiniband has what is called a "Reliable, Connected"
> >>+link (one of 4 choices). This is the mode in which
> >>+we use for RDMA migration.
> >>+
> >>+If a *single* message fails,
> >>+the decision is to abort the migration entirely and
> >>+cleanup all the RDMA descriptors and unregister all
> >>+the memory.
> >>+
> >>+After cleanup, the Virtual Machine is returned to normal
> >>+operation the same way that would happen if the TCP
> >>+socket is broken during a non-RDMA based migration.
> >That's on sender side? Presumably this means you respond to
> >completion with error?
> >  How does receive side know
> >migration is complete?
> 
> Yes, on the sender side.
> 
> Migration "completeness" logic has not changed in this patch series.
> 
> Pleas recall that the entire QEMUFile protocol is still
> happening at the upper-level inside of savevm.c/arch_init.c.
> 

So basically receive side detects that migration is complete by
looking at the QEMUFile data?

> 
> >>+
> >>+TODO:
> >>+=================================
> >>+1. Currently, cgroups swap limits for *both* TCP and RDMA
> >>+   on the sender-side is broken. This is more poignant for
> >>+   RDMA because RDMA requires memory registration.
> >>+   Fixing this requires infiniband page registrations to be
> >>+   zero-page aware, and this does not yet work properly.
> >>+2. Currently overcommit for the the *receiver* side of
> >>+   TCP works, but not for RDMA. While dynamic page registration
> >>+   *does* work, it is only useful if the is_zero_page() capability
> >>+   is remained enabled (which it is by default).
> >>+   However, leaving this capability turned on *significantly* slows
> >>+   down the RDMA throughput, particularly on hardware capable
> >>+   of transmitting faster than 10 gbps (such as 40gbps links).
> >>+3. Use of the recent /dev/<pid>/pagemap would likely solve some
> >>+   of these problems.
> >>+4. Also, some form of balloon-device usage tracking would also
> >>+   help aleviate some of these issues.
> >>+
> >>+PERFORMANCE
> >>+===================
> >>+
> >>+Using a 40gbps infinband link performing a worst-case stress test:
> >>+
> >>+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+Approximately 30 gpbs (little better than the paper)
> >>+1. Average worst-case throughput
> >>+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> >>+
> >>+Average downtime (stop time) ranges between 28 and 33 milliseconds.
> >>+
> >>+An *exhaustive* paper (2010) shows additional performance details
> >>+linked on the QEMU wiki:
> >>+
> >>+http://wiki.qemu.org/Features/RDMALiveMigration
> >>-- 
> >>1.7.10.4

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-10 13:34       ` Michael S. Tsirkin
@ 2013-04-10 15:29         ` Michael R. Hines
  2013-04-10 17:41           ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-10 15:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/10/2013 09:34 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
>> On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
>>> Below is a great high level overview. the protocol looks correct.
>>> A bit more detail would be helpful, as noted below.
>>>
>>> The main thing I'd like to see changed is that there are already
>>> two protocols here: chunk-based and non chunk based.
>>> We'll need to use versioning and capabilities going forward but in the
>>> first version we don't need to maintain compatibility with legacy so
>>> two versions seems like unnecessary pain.  Chunk based is somewhat slower and
>>> that is worth fixing longer term, but seems like the way forward. So
>>> let's implement a single chunk-based protocol in the first version we
>>> merge.
>>>
>>> Some more minor improvement suggestions below.
>> Thanks.
>>
>> However, IMHO restricting the policy to only used chunk-based is really
>> not an acceptable choice:
>>
>> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
>> dive from 10gbps to 6gbps.
> Who cares about the throughput really? What we do care about
> is how long the whole process takes.
>

Low latency and high throughput is very important =)

Without these properties of RDMA, many workloads simply either
take to long to finish migrating or do not converge to a stopping
point altogether.

*Not making this a configurable option would defeat the purpose of using 
RDMA altogether.

Otherwise, you're no better off than just using TCP.


>
>> But if I disable chunk-based registration altogether (forgoing
>> overcommit), then performance comes back.
>>
>> The reason for this is is the additional control trannel traffic
>> needed to ask the server to register
>> memory pages on demand - without this traffic, we can easily
>> saturate the link.
>> But with this traffic, the user needs to know (and be given the
>> option) to disable the feature
>> in case they want performance instead of flexibility.
>>
> IMO that's just because the current control protocol is so inefficient.
> You just need to pipeline the registration: request the next chunk
> while remote side is handling the previous one(s).
>
> With any protocol, you still need to:
> 	register all memory
> 	send addresses and keys to source
> 	get notification that write is done
> what is different with chunk based?
> simply that there are several network roundtrips
> before the process can start.
> So part of the time you are not doing writes,
> you are waiting for the next control message.
>
> So you should be doing several in parallel.
> This will complicate the procotol though, so I am not asking
> for this right away.
>
> But a broken pin-it-all alternative will just confuse matters.  It is
> best to keep it out of tree.

There's a huge difference. (Answer continued below this one).

The devil is in the details, here: Pipelining is simply not possible
right now because the migration thread has total control over
when and which pages are requested to be migrated.

You can't pipeline page registrations if you don't know the pages are 
dirty -
and the only way to that pages are dirty is if the migration thread told
you to save them.

On the other hand, advanced registration of *known* dirty pages
is very important - I will certainly be submitting a patch in the future
which attempts to handle this case.


> So make the protocol smarter and fix this. This is not something
> management needs to know about.
>
>
> If you like, you can teach management to specify the max amount of
> memory pinned. It should be specified at the appropriate place:
> on the remote for remote, on source for source.
>

Answer below.

>>>
>>> What is meant by performance here? downtime?
>> Throughput. Zero page scanning (and dynamic registration) reduces
>> throughput significantly.
> Again, not something management should worry about.
> Do the right thing internally.

I disagree with that: This is an entirely workload-specific decision,
not a system-level decision.

If I have a known memory-intensive workload that is virtualized,
then it would be "too late" to disable zero page detection *after*
the RDMA migration begins.

We have management tools already that are that smart - there's
nothing wrong with smart managment knowing in advance that
a workload is memory-intensive and also knowing that an RDMA
migration is going to be issued.

There's no way for QEMU to know that in advance without some kind
of advanced heuristic that tracks the behavior of the VM over time,
which I don't think anybody wants to get into the business of writing =)

>>>> +
>>>> +SEND messages require more coordination because the
>>>> +receiver must have reserved space (using a receive
>>>> +work request) on the receive queue (RQ) before QEMUFileRDMA
>>>> +can start using them to carry all the bytes as
>>>> +a transport for migration of device state.
>>>> +
>>>> +To begin the migration, the initial connection setup is
>>>> +as follows (migration-rdma.c):
>>>> +
>>>> +1. Receiver and Sender are started (command line or libvirt):
>>>> +2. Both sides post two RQ work requests
>>> Okay this could be where the problem is. This means with chunk
>>> based receive side does:
>>>
>>> loop:
>>> 	receive request
>>> 	register
>>> 	send response
>>>
>>> while with non chunk based it does:
>>>
>>> receive request
>>> send response
>>> loop:
>>> 	register
>> No, that's incorrect. With "non" chunk based, the receive side does
>> *not* communicate
>> during the migration of pc.ram.
> It does not matter when this happens. What we care about is downtime and
> total time from start of qemu on remote and until migration completes.
> Not peak throughput.
> If you don't count registration time on remote, that's just wrong.

Answer above.


>> The control channel is only used for chunk registration and device
>> state, not RAM.
>>
>> I will update the documentation to make that more clear.
> It's clear enough I think. But it seems you are measuring
> the wrong things.
>
>>> In reality each request/response requires two network round-trips
>>> with the Ready credit-management messsages.
>>> So the overhead will likely be avoided if we add better pipelining:
>>> allow multiple registration requests in the air, and add more
>>> send/receive credits so the overhead of credit management can be
>>> reduced.
>> Unfortunately, the migration thread doesn't work that way.
>> The thread only generates one page write at-a-time.
> Yes but you do not have to block it. Each page is in these states:
> 	- unpinned not sent
> 	- pinned no rkey
> 	- pinned have rkey
> 	- unpinned sent
>
> Each time you get a new page, it's in unpinned not sent state.
> So you can start it on this state machine, and tell migration thread
> to proceed tothe next page.

Yes, I'm doing that already (documented as "batching") in the
docs file.

But the problem is more complicated than that: there is no coordination
between the migration_thread and RDMA right now because Paolo is
trying to maintain a very clean separation of function.

However we *can* do what you described in a future patch like this:

1. Migration thread says "iteration starts, how much memory is dirty?"
2. RDMA protocol says "Is there a lot of dirty memory?"
         OK, yes? Then batch all the registration messages into a single 
request
         but do not write the memory until all the registrations have 
completed.

         OK, no?  Then just issue registrations with very little 
batching so that
                       we can quickly move on to the next iteration round.

Make sense?

>> If someone were to write a patch which submits multiple
>> writes at the same time, I would be very interested in
>> consuming that feature and making chunk registration more
>> efficient by batching multiple registrations into fewer messages.
> No changes to migration core is necessary I think.
> But assuming they are - your protocol design and
> management API should not be driven by internal qemu APIs.

Answer above.

>>> There's no requirement to implement these optimizations upfront
>>> before merging the first version, but let's remove the
>>> non-chunkbased crutch unless we see it as absolutely necessary.
>>>
>>>> +3. Receiver does listen()
>>>> +4. Sender does connect()
>>>> +5. Receiver accept()
>>>> +6. Check versioning and capabilities (described later)
>>>> +
>>>> +At this point, we define a control channel on top of SEND messages
>>>> +which is described by a formal protocol. Each SEND message has a
>>>> +header portion and a data portion (but together are transmitted
>>>> +as a single SEND message).
>>>> +
>>>> +Header:
>>>> +    * Length  (of the data portion)
>>>> +    * Type    (what command to perform, described below)
>>>> +    * Version (protocol version validated before send/recv occurs)
>>> What's the expected value for Version field?
>>> Also, confusing.  Below mentions using private field in librdmacm instead?
>>> Need to add # of bytes and endian-ness of each field.
>> Correct, those are two separate versions. One for capability negotiation
>> and one for the protocol itself.
>>
>> I will update the documentation.
> Just drop the all-pinned version, and we'll work to improve
> the chunk-based one until it has reasonable performance.
> It seems to get a decent speed already: consider that
> most people run migration with the default speed limit.
> Supporting all-pinned will just be a pain down the road when
> we fix performance for chunk based one.
>

The speed tops out at 6gbps, that's not good enough for a 40gbps link.

The migration could complete *much* faster by disabling chunk registration.

We have very large physical machines, where chunk registration is not as 
important
as migrating the workload very quickly with very little downtime.

In these cases, chunk registration just "gets in the way".

>>>> +
>>>> +The 'type' field has 7 different command values:
>>> 0. Unused.
>>>
>>>> +    1. None
>>> you mean this is unused?
>> Correct - will update.
>>
>>>> +    2. Ready             (control-channel is available)
>>>> +    3. QEMU File         (for sending non-live device state)
>>>> +    4. RAM Blocks        (used right after connection setup)
>>>> +    5. Register request  (dynamic chunk registration)
>>>> +    6. Register result   ('rkey' to be used by sender)
>>> Hmm, don't you also need a virtual address for RDMA writes?
>>>
>> The virtual addresses are communicated at the beginning of the
>> migration using command #4 "Ram blocks".
> Yes but ram blocks are sent source to dest.
> virtual address needs to be sent dest to source no?

I just said that, no? =)

>>
>> There is only one capability right now: dynamic server registration.
>>
>> The client must tell the server whether or not the capability was
>> enabled or not on the primary VM side.
>>
>> Will update the documentation.
> Cool, best add an exact structure format.

Acnkowledged.

>>>> +
>>>> +Main memory is not migrated with the aforementioned protocol,
>>>> +but is instead migrated with normal RDMA Write operations.
>>>> +
>>>> +Pages are migrated in "chunks" (about 1 Megabyte right now).
>>> Why "about"? This is not dynamic so needs to be exactly same
>>> on both sides, right?
>> About is a typo =). It is hard-coded to exactly 1MB.
> This, by the way, is something management *may* want to control.

Acknowledged.

>>>> +Chunk size is not dynamic, but it could be in a future implementation.
>>>> +There's nothing to indicate that this is useful right now.
>>>> +
>>>> +When a chunk is full (or a flush() occurs), the memory backed by
>>>> +the chunk is registered with librdmacm and pinned in memory on
>>>> +both sides using the aforementioned protocol.
>>>> +
>>>> +After pinning, an RDMA Write is generated and tramsmitted
>>>> +for the entire chunk.
>>>> +
>>>> +Chunks are also transmitted in batches: This means that we
>>>> +do not request that the hardware signal the completion queue
>>>> +for the completion of *every* chunk. The current batch size
>>>> +is about 64 chunks (corresponding to 64 MB of memory).
>>>> +Only the last chunk in a batch must be signaled.
>>>> +This helps keep everything as asynchronous as possible
>>>> +and helps keep the hardware busy performing RDMA operations.
>>>> +
>>>> +Error-handling:
>>>> +===============================
>>>> +
>>>> +Infiniband has what is called a "Reliable, Connected"
>>>> +link (one of 4 choices). This is the mode in which
>>>> +we use for RDMA migration.
>>>> +
>>>> +If a *single* message fails,
>>>> +the decision is to abort the migration entirely and
>>>> +cleanup all the RDMA descriptors and unregister all
>>>> +the memory.
>>>> +
>>>> +After cleanup, the Virtual Machine is returned to normal
>>>> +operation the same way that would happen if the TCP
>>>> +socket is broken during a non-RDMA based migration.
>>> That's on sender side? Presumably this means you respond to
>>> completion with error?
>>>   How does receive side know
>>> migration is complete?
>> Yes, on the sender side.
>>
>> Migration "completeness" logic has not changed in this patch series.
>>
>> Pleas recall that the entire QEMUFile protocol is still
>> happening at the upper-level inside of savevm.c/arch_init.c.
>>
> So basically receive side detects that migration is complete by
> looking at the QEMUFile data?
>

That's correct - same mechanism used by TCP.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-10 15:29         ` Michael R. Hines
@ 2013-04-10 17:41           ` Michael S. Tsirkin
  2013-04-10 20:05             ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-10 17:41 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Wed, Apr 10, 2013 at 11:29:24AM -0400, Michael R. Hines wrote:
> On 04/10/2013 09:34 AM, Michael S. Tsirkin wrote:
> >On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
> >>On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> >>>Below is a great high level overview. the protocol looks correct.
> >>>A bit more detail would be helpful, as noted below.
> >>>
> >>>The main thing I'd like to see changed is that there are already
> >>>two protocols here: chunk-based and non chunk based.
> >>>We'll need to use versioning and capabilities going forward but in the
> >>>first version we don't need to maintain compatibility with legacy so
> >>>two versions seems like unnecessary pain.  Chunk based is somewhat slower and
> >>>that is worth fixing longer term, but seems like the way forward. So
> >>>let's implement a single chunk-based protocol in the first version we
> >>>merge.
> >>>
> >>>Some more minor improvement suggestions below.
> >>Thanks.
> >>
> >>However, IMHO restricting the policy to only used chunk-based is really
> >>not an acceptable choice:
> >>
> >>Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> >>dive from 10gbps to 6gbps.
> >Who cares about the throughput really? What we do care about
> >is how long the whole process takes.
> >
> 
> Low latency and high throughput is very important =)
> 
> Without these properties of RDMA, many workloads simply either
> take to long to finish migrating or do not converge to a stopping
> point altogether.
> 
> *Not making this a configurable option would defeat the purpose of
> using RDMA altogether.
> 
> Otherwise, you're no better off than just using TCP.

So we have two protocols implemented: one is slow the other pins all
memory on destination indefinitely.

I see two options here:
- improve the slow version so it's fast, drop the pin all version
- give up and declare RDMA requires pinning all memory on destination

But giving management a way to do RDMA at the speed of TCP? Why is this
useful?

> 
> >
> >>But if I disable chunk-based registration altogether (forgoing
> >>overcommit), then performance comes back.
> >>
> >>The reason for this is is the additional control trannel traffic
> >>needed to ask the server to register
> >>memory pages on demand - without this traffic, we can easily
> >>saturate the link.
> >>But with this traffic, the user needs to know (and be given the
> >>option) to disable the feature
> >>in case they want performance instead of flexibility.
> >>
> >IMO that's just because the current control protocol is so inefficient.
> >You just need to pipeline the registration: request the next chunk
> >while remote side is handling the previous one(s).
> >
> >With any protocol, you still need to:
> >	register all memory
> >	send addresses and keys to source
> >	get notification that write is done
> >what is different with chunk based?
> >simply that there are several network roundtrips
> >before the process can start.
> >So part of the time you are not doing writes,
> >you are waiting for the next control message.
> >
> >So you should be doing several in parallel.
> >This will complicate the procotol though, so I am not asking
> >for this right away.
> >
> >But a broken pin-it-all alternative will just confuse matters.  It is
> >best to keep it out of tree.
> 
> There's a huge difference. (Answer continued below this one).
> 
> The devil is in the details, here: Pipelining is simply not possible
> right now because the migration thread has total control over
> when and which pages are requested to be migrated.
> 
> You can't pipeline page registrations if you don't know the pages
> are dirty -
> and the only way to that pages are dirty is if the migration thread told
> you to save them.


So it tells you to save them. It does not mean you need to start
RDMA immediately.  Note the address and start the process of
notifying the remote.


>
> On the other hand, advanced registration of *known* dirty pages
> is very important - I will certainly be submitting a patch in the future
> which attempts to handle this case.

Maybe I miss something, and there are changes in the migration core
that are prerequisite to making rdma fast. So take the time and make
these changes, that's better than maintaining a broken protocol
indefinitely.


> >So make the protocol smarter and fix this. This is not something
> >management needs to know about.
> >
> >
> >If you like, you can teach management to specify the max amount of
> >memory pinned. It should be specified at the appropriate place:
> >on the remote for remote, on source for source.
> >
> 
> Answer below.
> 
> >>>
> >>>What is meant by performance here? downtime?
> >>Throughput. Zero page scanning (and dynamic registration) reduces
> >>throughput significantly.
> >Again, not something management should worry about.
> >Do the right thing internally.
> 
> I disagree with that: This is an entirely workload-specific decision,
> not a system-level decision.
> 
> If I have a known memory-intensive workload that is virtualized,
> then it would be "too late" to disable zero page detection *after*
> the RDMA migration begins.
> 
> We have management tools already that are that smart - there's
> nothing wrong with smart managment knowing in advance that
> a workload is memory-intensive and also knowing that an RDMA
> migration is going to be issued.

"zero page detection" just cries out "implementation specific".

There's very little chance e.g. a different algorithm will have exactly
same performance tradeoffs. So we change some qemu internals and
suddenly your management carefully tuned for your workload is making all
the wrong decisions.



>
> There's no way for QEMU to know that in advance without some kind
> of advanced heuristic that tracks the behavior of the VM over time,
> which I don't think anybody wants to get into the business of writing =)

There's even less chance a management tool will make an
intelligent decision here. It's too tied to QEMU internals.

> >>>>+
> >>>>+SEND messages require more coordination because the
> >>>>+receiver must have reserved space (using a receive
> >>>>+work request) on the receive queue (RQ) before QEMUFileRDMA
> >>>>+can start using them to carry all the bytes as
> >>>>+a transport for migration of device state.
> >>>>+
> >>>>+To begin the migration, the initial connection setup is
> >>>>+as follows (migration-rdma.c):
> >>>>+
> >>>>+1. Receiver and Sender are started (command line or libvirt):
> >>>>+2. Both sides post two RQ work requests
> >>>Okay this could be where the problem is. This means with chunk
> >>>based receive side does:
> >>>
> >>>loop:
> >>>	receive request
> >>>	register
> >>>	send response
> >>>
> >>>while with non chunk based it does:
> >>>
> >>>receive request
> >>>send response
> >>>loop:
> >>>	register
> >>No, that's incorrect. With "non" chunk based, the receive side does
> >>*not* communicate
> >>during the migration of pc.ram.
> >It does not matter when this happens. What we care about is downtime and
> >total time from start of qemu on remote and until migration completes.
> >Not peak throughput.
> >If you don't count registration time on remote, that's just wrong.
> 
> Answer above.


I don't see it above.
> 
> >>The control channel is only used for chunk registration and device
> >>state, not RAM.
> >>
> >>I will update the documentation to make that more clear.
> >It's clear enough I think. But it seems you are measuring
> >the wrong things.
> >
> >>>In reality each request/response requires two network round-trips
> >>>with the Ready credit-management messsages.
> >>>So the overhead will likely be avoided if we add better pipelining:
> >>>allow multiple registration requests in the air, and add more
> >>>send/receive credits so the overhead of credit management can be
> >>>reduced.
> >>Unfortunately, the migration thread doesn't work that way.
> >>The thread only generates one page write at-a-time.
> >Yes but you do not have to block it. Each page is in these states:
> >	- unpinned not sent
> >	- pinned no rkey
> >	- pinned have rkey
> >	- unpinned sent
> >
> >Each time you get a new page, it's in unpinned not sent state.
> >So you can start it on this state machine, and tell migration thread
> >to proceed tothe next page.
> 
> Yes, I'm doing that already (documented as "batching") in the
> docs file.

All I see is a scheme to reduce the number of transmit completions.
This only gives a marginal gain.  E.g. you explicitly say there's a
single command in the air so another registration request can not even
start until you get a registration response.

> But the problem is more complicated than that: there is no coordination
> between the migration_thread and RDMA right now because Paolo is
> trying to maintain a very clean separation of function.
> 
> However we *can* do what you described in a future patch like this:
> 
> 1. Migration thread says "iteration starts, how much memory is dirty?"
> 2. RDMA protocol says "Is there a lot of dirty memory?"
>         OK, yes? Then batch all the registration messages into a
> single request
>         but do not write the memory until all the registrations have
> completed.
> 
>         OK, no?  Then just issue registrations with very little
> batching so that
>                       we can quickly move on to the next iteration round.
> 
> Make sense?

Actually, I think you just need to get a page from migration core and
give it to the FSM above.  Then let it give you another page, until you
have N pages in flight in the FSM all at different stages in the
pipeline.  That's the theory.

But if you want to try changing management core, go wild.  Very little
is written in stone here.

> >>If someone were to write a patch which submits multiple
> >>writes at the same time, I would be very interested in
> >>consuming that feature and making chunk registration more
> >>efficient by batching multiple registrations into fewer messages.
> >No changes to migration core is necessary I think.
> >But assuming they are - your protocol design and
> >management API should not be driven by internal qemu APIs.
> 
> Answer above.
> 
> >>>There's no requirement to implement these optimizations upfront
> >>>before merging the first version, but let's remove the
> >>>non-chunkbased crutch unless we see it as absolutely necessary.
> >>>
> >>>>+3. Receiver does listen()
> >>>>+4. Sender does connect()
> >>>>+5. Receiver accept()
> >>>>+6. Check versioning and capabilities (described later)
> >>>>+
> >>>>+At this point, we define a control channel on top of SEND messages
> >>>>+which is described by a formal protocol. Each SEND message has a
> >>>>+header portion and a data portion (but together are transmitted
> >>>>+as a single SEND message).
> >>>>+
> >>>>+Header:
> >>>>+    * Length  (of the data portion)
> >>>>+    * Type    (what command to perform, described below)
> >>>>+    * Version (protocol version validated before send/recv occurs)
> >>>What's the expected value for Version field?
> >>>Also, confusing.  Below mentions using private field in librdmacm instead?
> >>>Need to add # of bytes and endian-ness of each field.
> >>Correct, those are two separate versions. One for capability negotiation
> >>and one for the protocol itself.
> >>
> >>I will update the documentation.
> >Just drop the all-pinned version, and we'll work to improve
> >the chunk-based one until it has reasonable performance.
> >It seems to get a decent speed already: consider that
> >most people run migration with the default speed limit.
> >Supporting all-pinned will just be a pain down the road when
> >we fix performance for chunk based one.
> >
> 
> The speed tops out at 6gbps, that's not good enough for a 40gbps link.
> 
> The migration could complete *much* faster by disabling chunk registration.
> 
> We have very large physical machines, where chunk registration is
> not as important
> as migrating the workload very quickly with very little downtime.
> 
> In these cases, chunk registration just "gets in the way".

Well IMO you give up too early.

It gets in the way because you are not doing data transfers while
you are doing registration. You are doing it by chunks on the
source and source is much busier, it needs to find dirty pages,
and it needs to run VCPUs. Surely remote which is mostly idle should
be able to keep up with the demand.

Just fix the protocol so the control latency is less of the problem.


> >>>>+
> >>>>+The 'type' field has 7 different command values:
> >>>0. Unused.
> >>>
> >>>>+    1. None
> >>>you mean this is unused?
> >>Correct - will update.
> >>
> >>>>+    2. Ready             (control-channel is available)
> >>>>+    3. QEMU File         (for sending non-live device state)
> >>>>+    4. RAM Blocks        (used right after connection setup)
> >>>>+    5. Register request  (dynamic chunk registration)
> >>>>+    6. Register result   ('rkey' to be used by sender)
> >>>Hmm, don't you also need a virtual address for RDMA writes?
> >>>
> >>The virtual addresses are communicated at the beginning of the
> >>migration using command #4 "Ram blocks".
> >Yes but ram blocks are sent source to dest.
> >virtual address needs to be sent dest to source no?
> 
> I just said that, no? =)

You didn't previously.

> >>
> >>There is only one capability right now: dynamic server registration.
> >>
> >>The client must tell the server whether or not the capability was
> >>enabled or not on the primary VM side.
> >>
> >>Will update the documentation.
> >Cool, best add an exact structure format.
> 
> Acnkowledged.
> 
> >>>>+
> >>>>+Main memory is not migrated with the aforementioned protocol,
> >>>>+but is instead migrated with normal RDMA Write operations.
> >>>>+
> >>>>+Pages are migrated in "chunks" (about 1 Megabyte right now).
> >>>Why "about"? This is not dynamic so needs to be exactly same
> >>>on both sides, right?
> >>About is a typo =). It is hard-coded to exactly 1MB.
> >This, by the way, is something management *may* want to control.
> 
> Acknowledged.
> 
> >>>>+Chunk size is not dynamic, but it could be in a future implementation.
> >>>>+There's nothing to indicate that this is useful right now.
> >>>>+
> >>>>+When a chunk is full (or a flush() occurs), the memory backed by
> >>>>+the chunk is registered with librdmacm and pinned in memory on
> >>>>+both sides using the aforementioned protocol.
> >>>>+
> >>>>+After pinning, an RDMA Write is generated and tramsmitted
> >>>>+for the entire chunk.
> >>>>+
> >>>>+Chunks are also transmitted in batches: This means that we
> >>>>+do not request that the hardware signal the completion queue
> >>>>+for the completion of *every* chunk. The current batch size
> >>>>+is about 64 chunks (corresponding to 64 MB of memory).
> >>>>+Only the last chunk in a batch must be signaled.
> >>>>+This helps keep everything as asynchronous as possible
> >>>>+and helps keep the hardware busy performing RDMA operations.
> >>>>+
> >>>>+Error-handling:
> >>>>+===============================
> >>>>+
> >>>>+Infiniband has what is called a "Reliable, Connected"
> >>>>+link (one of 4 choices). This is the mode in which
> >>>>+we use for RDMA migration.
> >>>>+
> >>>>+If a *single* message fails,
> >>>>+the decision is to abort the migration entirely and
> >>>>+cleanup all the RDMA descriptors and unregister all
> >>>>+the memory.
> >>>>+
> >>>>+After cleanup, the Virtual Machine is returned to normal
> >>>>+operation the same way that would happen if the TCP
> >>>>+socket is broken during a non-RDMA based migration.
> >>>That's on sender side? Presumably this means you respond to
> >>>completion with error?
> >>>  How does receive side know
> >>>migration is complete?
> >>Yes, on the sender side.
> >>
> >>Migration "completeness" logic has not changed in this patch series.
> >>
> >>Pleas recall that the entire QEMUFile protocol is still
> >>happening at the upper-level inside of savevm.c/arch_init.c.
> >>
> >So basically receive side detects that migration is complete by
> >looking at the QEMUFile data?
> >
> 
> That's correct - same mechanism used by TCP.
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-10 17:41           ` Michael S. Tsirkin
@ 2013-04-10 20:05             ` Michael R. Hines
  2013-04-11  7:19               ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-10 20:05 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/10/2013 01:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> Thanks.
>>>>
>>>> However, IMHO restricting the policy to only used chunk-based is really
>>>> not an acceptable choice:
>>>>
>>>> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
>>>> dive from 10gbps to 6gbps.
>>> Who cares about the throughput really? What we do care about
>>> is how long the whole process takes.
>>>
>> Low latency and high throughput is very important =)
>>
>> Without these properties of RDMA, many workloads simply either
>> take to long to finish migrating or do not converge to a stopping
>> point altogether.
>>
>> *Not making this a configurable option would defeat the purpose of
>> using RDMA altogether.
>>
>> Otherwise, you're no better off than just using TCP.
> So we have two protocols implemented: one is slow the other pins all
> memory on destination indefinitely.
>
> I see two options here:
> - improve the slow version so it's fast, drop the pin all version
> - give up and declare RDMA requires pinning all memory on destination
>
> But giving management a way to do RDMA at the speed of TCP? Why is this
> useful?

This is "useful" because of the overcommit concerns you brought
before, which is the reason why I volunteered to write dynamic
server registration in the first place. We never required that overcommit
and performance had

 From prior experience, I don't believe overcommit and good performance
are compatible with each other in general (i.e. using compression,
page sharing, etc, etc.), but that's a debate for another day =)

I would like to propose a compromise:

How about we *keep* the registration capability and leave it enabled by 
default?

This gives management tools the ability to get performance if they want to,
but also satisfies your requirements in case management doesn't know the
feature exists - they will just get the default enabled?
>> But the problem is more complicated than that: there is no coordination
>> between the migration_thread and RDMA right now because Paolo is
>> trying to maintain a very clean separation of function.
>>
>> However we *can* do what you described in a future patch like this:
>>
>> 1. Migration thread says "iteration starts, how much memory is dirty?"
>> 2. RDMA protocol says "Is there a lot of dirty memory?"
>>          OK, yes? Then batch all the registration messages into a
>> single request
>>          but do not write the memory until all the registrations have
>> completed.
>>
>>          OK, no?  Then just issue registrations with very little
>> batching so that
>>                        we can quickly move on to the next iteration round.
>>
>> Make sense?
> Actually, I think you just need to get a page from migration core and
> give it to the FSM above.  Then let it give you another page, until you
> have N pages in flight in the FSM all at different stages in the
> pipeline.  That's the theory.
>
> But if you want to try changing management core, go wild.  Very little
> is written in stone here.

The FSM and what I described are basically the same thing, I just
described it more abstractly than you did.

Either way, I agree that the optimization would be very useful,
but I disagree that it is possible for an optimized registration algorithm
to perform *as well as* the case when there is no dynamic registration 
at all.

The point is that dynamic registration *only* helps overcommitment.

It does nothing for performance - and since that's true any optimizations
that improve on dynamic registrations will always be sub-optimal to turning
off dynamic registration in the first place.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-10 20:05             ` Michael R. Hines
@ 2013-04-11  7:19               ` Michael S. Tsirkin
  2013-04-11 13:12                 ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11  7:19 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> On 04/10/2013 01:41 PM, Michael S. Tsirkin wrote:
> >>>>
> >>>>Thanks.
> >>>>
> >>>>However, IMHO restricting the policy to only used chunk-based is really
> >>>>not an acceptable choice:
> >>>>
> >>>>Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> >>>>dive from 10gbps to 6gbps.
> >>>Who cares about the throughput really? What we do care about
> >>>is how long the whole process takes.
> >>>
> >>Low latency and high throughput is very important =)
> >>
> >>Without these properties of RDMA, many workloads simply either
> >>take to long to finish migrating or do not converge to a stopping
> >>point altogether.
> >>
> >>*Not making this a configurable option would defeat the purpose of
> >>using RDMA altogether.
> >>
> >>Otherwise, you're no better off than just using TCP.
> >So we have two protocols implemented: one is slow the other pins all
> >memory on destination indefinitely.
> >
> >I see two options here:
> >- improve the slow version so it's fast, drop the pin all version
> >- give up and declare RDMA requires pinning all memory on destination
> >
> >But giving management a way to do RDMA at the speed of TCP? Why is this
> >useful?
> 
> This is "useful" because of the overcommit concerns you brought
> before, which is the reason why I volunteered to write dynamic
> server registration in the first place. We never required that overcommit
> and performance had
> 
> From prior experience, I don't believe overcommit and good performance
> are compatible with each other in general (i.e. using compression,
> page sharing, etc, etc.), but that's a debate for another day =)

Maybe we should just say "RDMA is incompatible with memory overcommit"
and be done with it then. But see below.

> I would like to propose a compromise:
> 
> How about we *keep* the registration capability and leave it enabled
> by default?
> 
> This gives management tools the ability to get performance if they want to,
> but also satisfies your requirements in case management doesn't know the
> feature exists - they will just get the default enabled?

Well unfortunately the "overcommit" feature as implemented seems useless
really.  Someone wants to migrate with RDMA but with low performance?
Why not migrate with TCP then?

> >>But the problem is more complicated than that: there is no coordination
> >>between the migration_thread and RDMA right now because Paolo is
> >>trying to maintain a very clean separation of function.
> >>
> >>However we *can* do what you described in a future patch like this:
> >>
> >>1. Migration thread says "iteration starts, how much memory is dirty?"
> >>2. RDMA protocol says "Is there a lot of dirty memory?"
> >>         OK, yes? Then batch all the registration messages into a
> >>single request
> >>         but do not write the memory until all the registrations have
> >>completed.
> >>
> >>         OK, no?  Then just issue registrations with very little
> >>batching so that
> >>                       we can quickly move on to the next iteration round.
> >>
> >>Make sense?
> >Actually, I think you just need to get a page from migration core and
> >give it to the FSM above.  Then let it give you another page, until you
> >have N pages in flight in the FSM all at different stages in the
> >pipeline.  That's the theory.
> >
> >But if you want to try changing management core, go wild.  Very little
> >is written in stone here.
> 
> The FSM and what I described are basically the same thing, I just
> described it more abstractly than you did.

Yes but I'm saying it can be part of RDMA code, no strict need to
change anything else.

> Either way, I agree that the optimization would be very useful,
> but I disagree that it is possible for an optimized registration algorithm
> to perform *as well as* the case when there is no dynamic
> registration at all.
> 
> The point is that dynamic registration *only* helps overcommitment.
> 
> It does nothing for performance - and since that's true any optimizations
> that improve on dynamic registrations will always be sub-optimal to turning
> off dynamic registration in the first place.
> 
> - Michael

So you've given up on it.  Question is, sub-optimal by how much?  And
where's the bottleneck?

Let's do some math. Assume you send 16 bytes registration request and
get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
32/4096 < 1% transport overhead. Negligeable.

Is it the source CPU then? But CPU on source is basically doing same
things as with pre-registration: you do not pin all memory on source.

So it must be the destination CPU that does not keep up then?
But it has to do even less than the source CPU.

I suggest one explanation: the protocol you proposed is inefficient.
It seems to basically do everything in a single thread:
get a chunk,pin,wait for control credit,request,response,rdma,unpin,
There are two round-trips of send/receive here where you are not
going anything useful. Why not let migration proceed?

Doesn't all of this sound worth checking before we give up?

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11  7:19               ` Michael S. Tsirkin
@ 2013-04-11 13:12                 ` Michael R. Hines
  2013-04-11 13:48                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 13:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> Maybe we should just say "RDMA is incompatible with memory overcommit" 
> and be done with it then. But see below.
>> I would like to propose a compromise:
>>
>> How about we *keep* the registration capability and leave it enabled
>> by default?
>>
>> This gives management tools the ability to get performance if they want to,
>> but also satisfies your requirements in case management doesn't know the
>> feature exists - they will just get the default enabled?
> Well unfortunately the "overcommit" feature as implemented seems useless
> really.  Someone wants to migrate with RDMA but with low performance?
> Why not migrate with TCP then?

Answer below.

>> Either way, I agree that the optimization would be very useful,
>> but I disagree that it is possible for an optimized registration algorithm
>> to perform *as well as* the case when there is no dynamic
>> registration at all.
>>
>> The point is that dynamic registration *only* helps overcommitment.
>>
>> It does nothing for performance - and since that's true any optimizations
>> that improve on dynamic registrations will always be sub-optimal to turning
>> off dynamic registration in the first place.
>>
>> - Michael
> So you've given up on it.  Question is, sub-optimal by how much?  And
> where's the bottleneck?
>
> Let's do some math. Assume you send 16 bytes registration request and
> get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> 32/4096 < 1% transport overhead. Negligeable.
>
> Is it the source CPU then? But CPU on source is basically doing same
> things as with pre-registration: you do not pin all memory on source.
>
> So it must be the destination CPU that does not keep up then?
> But it has to do even less than the source CPU.
>
> I suggest one explanation: the protocol you proposed is inefficient.
> It seems to basically do everything in a single thread:
> get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> There are two round-trips of send/receive here where you are not
> going anything useful. Why not let migration proceed?
>
> Doesn't all of this sound worth checking before we give up?
>
First, let me remind you:

Chunks are already doing this!

Perhaps you don't fully understand how chunks work or perhaps I should 
be more verbose
in the documentation. The protocol is already joining multiple pages into a
single chunk without issuing any writes. It is only until the chunk is 
full that an
actual page registration request occurs.

So, basically what you want to know is what happens if we *change* the 
chunk size
dynamically?

Something like this:

1. Chunk = 1MB, what is the performance?
2. Chunk = 2MB, what is the performance?
3. Chunk = 4MB, what is the performance?
4. Chunk = 8MB, what is the performance?
5. Chunk = 16MB, what is the performance?
6. Chunk = 32MB, what is the performance?
7. Chunk = 64MB, what is the performance?
8. Chunk = 128MB, what is the performance?

I'll get you a this table today. Expect an email soon.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 13:12                 ` Michael R. Hines
@ 2013-04-11 13:48                   ` Michael S. Tsirkin
  2013-04-11 13:58                     ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 13:48 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
> On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> >On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> >Maybe we should just say "RDMA is incompatible with memory
> >overcommit" and be done with it then. But see below.
> >>I would like to propose a compromise:
> >>
> >>How about we *keep* the registration capability and leave it enabled
> >>by default?
> >>
> >>This gives management tools the ability to get performance if they want to,
> >>but also satisfies your requirements in case management doesn't know the
> >>feature exists - they will just get the default enabled?
> >Well unfortunately the "overcommit" feature as implemented seems useless
> >really.  Someone wants to migrate with RDMA but with low performance?
> >Why not migrate with TCP then?
> 
> Answer below.
> 
> >>Either way, I agree that the optimization would be very useful,
> >>but I disagree that it is possible for an optimized registration algorithm
> >>to perform *as well as* the case when there is no dynamic
> >>registration at all.
> >>
> >>The point is that dynamic registration *only* helps overcommitment.
> >>
> >>It does nothing for performance - and since that's true any optimizations
> >>that improve on dynamic registrations will always be sub-optimal to turning
> >>off dynamic registration in the first place.
> >>
> >>- Michael
> >So you've given up on it.  Question is, sub-optimal by how much?  And
> >where's the bottleneck?
> >
> >Let's do some math. Assume you send 16 bytes registration request and
> >get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> >32/4096 < 1% transport overhead. Negligeable.
> >
> >Is it the source CPU then? But CPU on source is basically doing same
> >things as with pre-registration: you do not pin all memory on source.
> >
> >So it must be the destination CPU that does not keep up then?
> >But it has to do even less than the source CPU.
> >
> >I suggest one explanation: the protocol you proposed is inefficient.
> >It seems to basically do everything in a single thread:
> >get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> >There are two round-trips of send/receive here where you are not
> >going anything useful. Why not let migration proceed?
> >
> >Doesn't all of this sound worth checking before we give up?
> >
> First, let me remind you:
> 
> Chunks are already doing this!
> 
> Perhaps you don't fully understand how chunks work or perhaps I
> should be more verbose
> in the documentation. The protocol is already joining multiple pages into a
> single chunk without issuing any writes. It is only until the chunk
> is full that an
> actual page registration request occurs.

I think I got that at a high level.
But there is a stall between chunks. If you make chunks smaller,
but pipeline registration, then there will never be any stall.

> So, basically what you want to know is what happens if we *change*
> the chunk size
> dynamically?

What I wanted to know is where is performance going?
Why is chunk based slower? It's not the extra messages,
on the wire, these take up negligeable BW.

> Something like this:
> 
> 1. Chunk = 1MB, what is the performance?
> 2. Chunk = 2MB, what is the performance?
> 3. Chunk = 4MB, what is the performance?
> 4. Chunk = 8MB, what is the performance?
> 5. Chunk = 16MB, what is the performance?
> 6. Chunk = 32MB, what is the performance?
> 7. Chunk = 64MB, what is the performance?
> 8. Chunk = 128MB, what is the performance?
> 
> I'll get you a this table today. Expect an email soon.
> 
> - Michael
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 13:48                   ` Michael S. Tsirkin
@ 2013-04-11 13:58                     ` Michael R. Hines
  2013-04-11 14:37                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 13:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/11/2013 09:48 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
>> On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
>>> On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
>>> Maybe we should just say "RDMA is incompatible with memory
>>> overcommit" and be done with it then. But see below.
>>>> I would like to propose a compromise:
>>>>
>>>> How about we *keep* the registration capability and leave it enabled
>>>> by default?
>>>>
>>>> This gives management tools the ability to get performance if they want to,
>>>> but also satisfies your requirements in case management doesn't know the
>>>> feature exists - they will just get the default enabled?
>>> Well unfortunately the "overcommit" feature as implemented seems useless
>>> really.  Someone wants to migrate with RDMA but with low performance?
>>> Why not migrate with TCP then?
>> Answer below.
>>
>>>> Either way, I agree that the optimization would be very useful,
>>>> but I disagree that it is possible for an optimized registration algorithm
>>>> to perform *as well as* the case when there is no dynamic
>>>> registration at all.
>>>>
>>>> The point is that dynamic registration *only* helps overcommitment.
>>>>
>>>> It does nothing for performance - and since that's true any optimizations
>>>> that improve on dynamic registrations will always be sub-optimal to turning
>>>> off dynamic registration in the first place.
>>>>
>>>> - Michael
>>> So you've given up on it.  Question is, sub-optimal by how much?  And
>>> where's the bottleneck?
>>>
>>> Let's do some math. Assume you send 16 bytes registration request and
>>> get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
>>> 32/4096 < 1% transport overhead. Negligeable.
>>>
>>> Is it the source CPU then? But CPU on source is basically doing same
>>> things as with pre-registration: you do not pin all memory on source.
>>>
>>> So it must be the destination CPU that does not keep up then?
>>> But it has to do even less than the source CPU.
>>>
>>> I suggest one explanation: the protocol you proposed is inefficient.
>>> It seems to basically do everything in a single thread:
>>> get a chunk,pin,wait for control credit,request,response,rdma,unpin,
>>> There are two round-trips of send/receive here where you are not
>>> going anything useful. Why not let migration proceed?
>>>
>>> Doesn't all of this sound worth checking before we give up?
>>>
>> First, let me remind you:
>>
>> Chunks are already doing this!
>>
>> Perhaps you don't fully understand how chunks work or perhaps I
>> should be more verbose
>> in the documentation. The protocol is already joining multiple pages into a
>> single chunk without issuing any writes. It is only until the chunk
>> is full that an
>> actual page registration request occurs.
> I think I got that at a high level.
> But there is a stall between chunks. If you make chunks smaller,
> but pipeline registration, then there will never be any stall.

Pipelineing == chunking. You cannot eliminate the stall,
that's impossible.

You can *grow* the chunk size (i.e. the pipeline)
to amortize the cost of the stall, but you cannot eliminate
the stall at the end of the pipeline.

At some point you have to flush the pipeline (i.e. the chunk),
whether you like it or not.


>> So, basically what you want to know is what happens if we *change*
>> the chunk size
>> dynamically?
> What I wanted to know is where is performance going?
> Why is chunk based slower? It's not the extra messages,
> on the wire, these take up negligeable BW.

Answer above.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 13:58                     ` Michael R. Hines
@ 2013-04-11 14:37                       ` Michael S. Tsirkin
  2013-04-11 14:50                         ` Paolo Bonzini
  2013-04-11 15:18                         ` Michael R. Hines
  0 siblings, 2 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 14:37 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Thu, Apr 11, 2013 at 09:58:50AM -0400, Michael R. Hines wrote:
> On 04/11/2013 09:48 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
> >>On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> >>>On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> >>>Maybe we should just say "RDMA is incompatible with memory
> >>>overcommit" and be done with it then. But see below.
> >>>>I would like to propose a compromise:
> >>>>
> >>>>How about we *keep* the registration capability and leave it enabled
> >>>>by default?
> >>>>
> >>>>This gives management tools the ability to get performance if they want to,
> >>>>but also satisfies your requirements in case management doesn't know the
> >>>>feature exists - they will just get the default enabled?
> >>>Well unfortunately the "overcommit" feature as implemented seems useless
> >>>really.  Someone wants to migrate with RDMA but with low performance?
> >>>Why not migrate with TCP then?
> >>Answer below.
> >>
> >>>>Either way, I agree that the optimization would be very useful,
> >>>>but I disagree that it is possible for an optimized registration algorithm
> >>>>to perform *as well as* the case when there is no dynamic
> >>>>registration at all.
> >>>>
> >>>>The point is that dynamic registration *only* helps overcommitment.
> >>>>
> >>>>It does nothing for performance - and since that's true any optimizations
> >>>>that improve on dynamic registrations will always be sub-optimal to turning
> >>>>off dynamic registration in the first place.
> >>>>
> >>>>- Michael
> >>>So you've given up on it.  Question is, sub-optimal by how much?  And
> >>>where's the bottleneck?
> >>>
> >>>Let's do some math. Assume you send 16 bytes registration request and
> >>>get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> >>>32/4096 < 1% transport overhead. Negligeable.
> >>>
> >>>Is it the source CPU then? But CPU on source is basically doing same
> >>>things as with pre-registration: you do not pin all memory on source.
> >>>
> >>>So it must be the destination CPU that does not keep up then?
> >>>But it has to do even less than the source CPU.
> >>>
> >>>I suggest one explanation: the protocol you proposed is inefficient.
> >>>It seems to basically do everything in a single thread:
> >>>get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> >>>There are two round-trips of send/receive here where you are not
> >>>going anything useful. Why not let migration proceed?
> >>>
> >>>Doesn't all of this sound worth checking before we give up?
> >>>
> >>First, let me remind you:
> >>
> >>Chunks are already doing this!
> >>
> >>Perhaps you don't fully understand how chunks work or perhaps I
> >>should be more verbose
> >>in the documentation. The protocol is already joining multiple pages into a
> >>single chunk without issuing any writes. It is only until the chunk
> >>is full that an
> >>actual page registration request occurs.
> >I think I got that at a high level.
> >But there is a stall between chunks. If you make chunks smaller,
> >but pipeline registration, then there will never be any stall.
> 
> Pipelineing == chunking.

pipelining:
https://en.wikipedia.org/wiki/Pipeline_%28computing%29
chunking:
https://en.wikipedia.org/wiki/Chunking_%28computing%29

> You cannot eliminate the stall,
> that's impossible.

Sure, you can eliminate the stalls. Just hide them
behind data transfers. See a diagram below.


> You can *grow* the chunk size (i.e. the pipeline)
> to amortize the cost of the stall, but you cannot eliminate
> the stall at the end of the pipeline.
> 
> At some point you have to flush the pipeline (i.e. the chunk),
> whether you like it or not.

You can process many chunks in parallel. Make chunks smaller but process
them in a pipelined fashion.  Yes the pipe might stall but it won't if
receive side is as fast as send side, then you won't have to flush at
all.


> >>So, basically what you want to know is what happens if we *change*
> >>the chunk size
> >>dynamically?
> >What I wanted to know is where is performance going?
> >Why is chunk based slower? It's not the extra messages,
> >on the wire, these take up negligeable BW.
> 
> Answer above.


Here's how things are supposed to work in a pipeline:

req -> registration request
res -> response
done -> rdma done notification (remote can unregister)
pgX  -> page, or chunk, or whatever unit is used
        for registration
rdma -> one or more rdma write requests



pg1 ->  pin -> req -> res -> rdma -> done
        pg2 ->  pin -> req -> res -> rdma -> done
                pg3 -> pin -> req -> res -> rdma -> done
                       pg4 -> pin -> req -> res -> rdma -> done
                              pg4 -> pin -> req -> res -> rdma -> done



It's like a assembly line see?  So while software does the registration
roundtrip dance, hardware is processing rdma requests for previous
chunks.

....

When do you have to stall? when you run out of rx buffer credits so you
can not start a new req.  Your protocol has 2 outstanding buffers,
so you can only have one req in the air. Do more and
you will not need to stall - possibly at all.

One other minor point is that your protocol requires extra explicit
ready commands. You can pass the number of rx buffers as extra payload
in the traffic you are sending anyway, and reduce that overhead.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 14:37                       ` Michael S. Tsirkin
@ 2013-04-11 14:50                         ` Paolo Bonzini
  2013-04-11 14:56                           ` Michael S. Tsirkin
  2013-04-11 15:01                           ` Michael R. Hines
  2013-04-11 15:18                         ` Michael R. Hines
  1 sibling, 2 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-11 14:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> 
> pg1 ->  pin -> req -> res -> rdma -> done
>         pg2 ->  pin -> req -> res -> rdma -> done
>                 pg3 -> pin -> req -> res -> rdma -> done
>                        pg4 -> pin -> req -> res -> rdma -> done
>                               pg4 -> pin -> req -> res -> rdma -> done
> 
> It's like a assembly line see?  So while software does the registration
> roundtrip dance, hardware is processing rdma requests for previous
> chunks.

Does this only affects the implementation, or also the wire protocol?
Does the destination have to be aware that the source is doing pipelining?

Paolo

> 
> ....
> 
> When do you have to stall? when you run out of rx buffer credits so you
> can not start a new req.  Your protocol has 2 outstanding buffers,
> so you can only have one req in the air. Do more and
> you will not need to stall - possibly at all.
> 
> One other minor point is that your protocol requires extra explicit
> ready commands. You can pass the number of rx buffers as extra payload
> in the traffic you are sending anyway, and reduce that overhead.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 14:50                         ` Paolo Bonzini
@ 2013-04-11 14:56                           ` Michael S. Tsirkin
  2013-04-11 17:49                             ` Michael R. Hines
  2013-04-11 15:01                           ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 14:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> > 
> > pg1 ->  pin -> req -> res -> rdma -> done
> >         pg2 ->  pin -> req -> res -> rdma -> done
> >                 pg3 -> pin -> req -> res -> rdma -> done
> >                        pg4 -> pin -> req -> res -> rdma -> done
> >                               pg4 -> pin -> req -> res -> rdma -> done
> > 
> > It's like a assembly line see?  So while software does the registration
> > roundtrip dance, hardware is processing rdma requests for previous
> > chunks.
> 
> Does this only affects the implementation, or also the wire protocol?

It affects the wire protocol.

> Does the destination have to be aware that the source is doing pipelining?
> 
> Paolo

Yes. At the moment the protocol assumption is there's only one
outstanding command on the control queue.  So destination has to
prequeue multiple buffers on hardware receive queue, and keep the source
updated about the number of available buffers. Preferably it should do
this using existing responses, maybe a separate ready command
is enough - this needs some thought, since a separate command
consumes buffers itself.

> > 
> > ....
> > 
> > When do you have to stall? when you run out of rx buffer credits so you
> > can not start a new req.  Your protocol has 2 outstanding buffers,
> > so you can only have one req in the air. Do more and
> > you will not need to stall - possibly at all.
> > 
> > One other minor point is that your protocol requires extra explicit
> > ready commands. You can pass the number of rx buffers as extra payload
> > in the traffic you are sending anyway, and reduce that overhead.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 14:50                         ` Paolo Bonzini
  2013-04-11 14:56                           ` Michael S. Tsirkin
@ 2013-04-11 15:01                           ` Michael R. Hines
  1 sibling, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 15:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

You cannot write data in the pipeline because you do not have the
permissions to do so yet until the registrations in the pipeline have
completed and been received by the primary VM.

On 04/11/2013 10:50 AM, Paolo Bonzini wrote:
> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>> pg1 ->  pin -> req -> res -> rdma -> done
>>          pg2 ->  pin -> req -> res -> rdma -> done
>>                  pg3 -> pin -> req -> res -> rdma -> done
>>                         pg4 -> pin -> req -> res -> rdma -> done
>>                                pg4 -> pin -> req -> res -> rdma -> done
>>
>> It's like a assembly line see?  So while software does the registration
>> roundtrip dance, hardware is processing rdma requests for previous
>> chunks.
> Does this only affects the implementation, or also the wire protocol?
> Does the destination have to be aware that the source is doing pipelining?
>
> Paolo

Yes, the destination has to be aware. The destination has to acknowledge
all of the registrations in the pipeline *and* the primary-VM has to block
until all the registrations in the pipeline have been received.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 14:37                       ` Michael S. Tsirkin
  2013-04-11 14:50                         ` Paolo Bonzini
@ 2013-04-11 15:18                         ` Michael R. Hines
  2013-04-11 15:33                           ` Paolo Bonzini
  2013-04-11 15:44                           ` Michael S. Tsirkin
  1 sibling, 2 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 15:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

First of all, this whole argument should not even exist for the 
following reason:

Page registrations are supposed to be *rare* - once a page is registered, it
is registered for life. There is nothing in the design that says a page must
be "unregistered" and I do not believe anybody is proposing that.

Second, this means that my previous analysis showing that performance 
was reduced
was also incorrect because most of the RDMA transfers were against pages 
during
the bulk phase round, which incorrectly makes dynamic page registration 
look bad.
I should have done more testing *after* the bulk phase round,
and I apologize for not doing that.

Indeed when I do such a test (with the 'stress' command) the cost of 
page registration disappears
because most of the registrations have already completed a long time ago.

Thanks, Paolo for reminding us about the bulk-phase behavior to being with.

Third, this means that optimizing this protocol would not be helpful and 
that we should
follow the "keep it simple" approach because during steady-state phase 
of the migration
most of the pages should have already been registered.

- Michael


On 04/11/2013 10:37 AM, Michael S. Tsirkin wrote:
> Answer above.
>
> Here's how things are supposed to work in a pipeline:
>
> req -> registration request
> res -> response
> done -> rdma done notification (remote can unregister)
> pgX  -> page, or chunk, or whatever unit is used
>          for registration
> rdma -> one or more rdma write requests
>
>
>
> pg1 ->  pin -> req -> res -> rdma -> done
>          pg2 ->  pin -> req -> res -> rdma -> done
>                  pg3 -> pin -> req -> res -> rdma -> done
>                         pg4 -> pin -> req -> res -> rdma -> done
>                                pg4 -> pin -> req -> res -> rdma -> done
>
>
>
> It's like a assembly line see?  So while software does the registration
> roundtrip dance, hardware is processing rdma requests for previous
> chunks.
>
> ....
>
> When do you have to stall? when you run out of rx buffer credits so you
> can not start a new req.  Your protocol has 2 outstanding buffers,
> so you can only have one req in the air. Do more and
> you will not need to stall - possibly at all.
>
> One other minor point is that your protocol requires extra explicit
> ready commands. You can pass the number of rx buffers as extra payload
> in the traffic you are sending anyway, and reduce that overhead.
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:18                         ` Michael R. Hines
@ 2013-04-11 15:33                           ` Paolo Bonzini
  2013-04-11 15:46                             ` Michael S. Tsirkin
  2013-04-12  5:10                             ` Michael R. Hines
  2013-04-11 15:44                           ` Michael S. Tsirkin
  1 sibling, 2 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-11 15:33 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 11/04/2013 17:18, Michael R. Hines ha scritto:
> First of all, this whole argument should not even exist for the 
> following reason:
> 
> Page registrations are supposed to be *rare* - once a page is 
> registered, it is registered for life.

Uh-oh.  That changes things a lot.  We do not even need to benchmark the
various chunk sizes.

> Third, this means that optimizing this protocol would not be helpful
> and that we should follow the "keep it simple" approach because
> during steady-state phase of the migration most of the pages should
> have already been registered.

Ok, let's keep it simple.  The only two things we need are:

1) remove the patch to disable is_dup_page

2) rename the transport to "x-rdma" (just in migration.c)

Both things together let us keep it safe for a release or two.  Let's
merge this thing.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:18                         ` Michael R. Hines
  2013-04-11 15:33                           ` Paolo Bonzini
@ 2013-04-11 15:44                           ` Michael S. Tsirkin
  2013-04-11 16:09                             ` Michael R. Hines
  2013-04-11 16:13                             ` Michael R. Hines
  1 sibling, 2 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 15:44 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
> First of all,

I know it's a hard habit to break but could you
please stop stop top-posting?

> this whole argument should not even exist for the
> following reason:
> 
> Page registrations are supposed to be *rare* - once a page is registered, it
> is registered for life. There is nothing in the design that says a page must
> be "unregistered" and I do not believe anybody is proposing that.

Hmm proposing what? Of course you need to unregister pages
eventually otherwise your pinned memory on the destination
will just grow indefinitely. People are often doing
registration caches to help reduce the overhead,
but never unregistering seems too aggressive.

You mean the chunk-based thing just delays the agony
until all guest memory is pinned for RDMA anyway?
Wait, is it registered for life on the source too?

Well this kind of explains why qemu was dying on OOM,
doesn't it?

> Second, this means that my previous analysis showing that
> performance was reduced
> was also incorrect because most of the RDMA transfers were against
> pages during
> the bulk phase round, which incorrectly makes dynamic page
> registration look bad.
> I should have done more testing *after* the bulk phase round,
> and I apologize for not doing that.
> 
> Indeed when I do such a test (with the 'stress' command) the cost of
> page registration disappears
> because most of the registrations have already completed a long time ago.
> 
> Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
> 
> Third, this means that optimizing this protocol would not be helpful
> and that we should
> follow the "keep it simple" approach because during steady-state
> phase of the migration
> most of the pages should have already been registered.
> 
> - Michael

If you mean that registering all memory is a requirement,
then I am not sure I agree: you wrote one slow protocol, this
does not mean that there can't be a fast one.

But if you mean to say that the current chunk based code
is useless, then I'd have to agree.

> 
> On 04/11/2013 10:37 AM, Michael S. Tsirkin wrote:
> >Answer above.
> >
> >Here's how things are supposed to work in a pipeline:
> >
> >req -> registration request
> >res -> response
> >done -> rdma done notification (remote can unregister)
> >pgX  -> page, or chunk, or whatever unit is used
> >         for registration
> >rdma -> one or more rdma write requests
> >
> >
> >
> >pg1 ->  pin -> req -> res -> rdma -> done
> >         pg2 ->  pin -> req -> res -> rdma -> done
> >                 pg3 -> pin -> req -> res -> rdma -> done
> >                        pg4 -> pin -> req -> res -> rdma -> done
> >                               pg4 -> pin -> req -> res -> rdma -> done
> >
> >
> >
> >It's like a assembly line see?  So while software does the registration
> >roundtrip dance, hardware is processing rdma requests for previous
> >chunks.
> >
> >....
> >
> >When do you have to stall? when you run out of rx buffer credits so you
> >can not start a new req.  Your protocol has 2 outstanding buffers,
> >so you can only have one req in the air. Do more and
> >you will not need to stall - possibly at all.
> >
> >One other minor point is that your protocol requires extra explicit
> >ready commands. You can pass the number of rx buffers as extra payload
> >in the traffic you are sending anyway, and reduce that overhead.
> >

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:33                           ` Paolo Bonzini
@ 2013-04-11 15:46                             ` Michael S. Tsirkin
  2013-04-11 15:47                               ` Paolo Bonzini
  2013-04-12  5:10                             ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 15:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Thu, Apr 11, 2013 at 05:33:41PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 17:18, Michael R. Hines ha scritto:
> > First of all, this whole argument should not even exist for the 
> > following reason:
> > 
> > Page registrations are supposed to be *rare* - once a page is 
> > registered, it is registered for life.
> 
> Uh-oh.  That changes things a lot.  We do not even need to benchmark the
> various chunk sizes.
> 
> > Third, this means that optimizing this protocol would not be helpful
> > and that we should follow the "keep it simple" approach because
> > during steady-state phase of the migration most of the pages should
> > have already been registered.
> 
> Ok, let's keep it simple.  The only two things we need are:
> 
> 1) remove the patch to disable is_dup_page
> 
> 2) rename the transport to "x-rdma" (just in migration.c)
> 
> Both things together let us keep it safe for a release or two.  Let's
> merge this thing.
> 
> Paolo

I would drop the chunk based thing too.  Besides being slow, it turns
out that it pins all memory anyway. So no memory overcommit.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:46                             ` Michael S. Tsirkin
@ 2013-04-11 15:47                               ` Paolo Bonzini
  2013-04-11 15:58                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-11 15:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

Il 11/04/2013 17:46, Michael S. Tsirkin ha scritto:
> > Ok, let's keep it simple.  The only two things we need are:
> > 
> > 1) remove the patch to disable is_dup_page
> > 
> > 2) rename the transport to "x-rdma" (just in migration.c)
> > 
> > Both things together let us keep it safe for a release or two.  Let's
> > merge this thing.
> 
> I would drop the chunk based thing too.  Besides being slow, it turns
> out that it pins all memory anyway. So no memory overcommit.

It doesn't pin zero pages.  Those are never transmitted (it's a recent
change).  So pages that are ballooned at the beginning of migration, and
remain ballooned throughout, will never be pinned.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:47                               ` Paolo Bonzini
@ 2013-04-11 15:58                                 ` Michael S. Tsirkin
  2013-04-11 16:06                                   ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 15:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Thu, Apr 11, 2013 at 05:47:53PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 17:46, Michael S. Tsirkin ha scritto:
> > > Ok, let's keep it simple.  The only two things we need are:
> > > 
> > > 1) remove the patch to disable is_dup_page
> > > 
> > > 2) rename the transport to "x-rdma" (just in migration.c)
> > > 
> > > Both things together let us keep it safe for a release or two.  Let's
> > > merge this thing.
> > 
> > I would drop the chunk based thing too.  Besides being slow, it turns
> > out that it pins all memory anyway. So no memory overcommit.
> 
> It doesn't pin zero pages.  Those are never transmitted (it's a recent
> change).  So pages that are ballooned at the beginning of migration, and
> remain ballooned throughout, will never be pinned.
> 
> Paolo

Of course Michael says it's slow unless you disable zero page detection,
and then I'm guessing it does?

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:58                                 ` Michael S. Tsirkin
@ 2013-04-11 16:06                                   ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 16:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/11/2013 11:58 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 05:47:53PM +0200, Paolo Bonzini wrote:
>> Il 11/04/2013 17:46, Michael S. Tsirkin ha scritto:
>>>> Ok, let's keep it simple.  The only two things we need are:
>>>>
>>>> 1) remove the patch to disable is_dup_page
>>>>
>>>> 2) rename the transport to "x-rdma" (just in migration.c)
>>>>
>>>> Both things together let us keep it safe for a release or two.  Let's
>>>> merge this thing.
>>> I would drop the chunk based thing too.  Besides being slow, it turns
>>> out that it pins all memory anyway. So no memory overcommit.
>> It doesn't pin zero pages.  Those are never transmitted (it's a recent
>> change).  So pages that are ballooned at the beginning of migration, and
>> remain ballooned throughout, will never be pinned.
>>
>> Paolo
> Of course Michael says it's slow unless you disable zero page detection,
> and then I'm guessing it does?
>
Only during the bulk phase round, and even then, as Paolo described,
zero pages do not get pinned on the destination.

Chunk registration is still very valuable when zero page detection is
activated.

The realization is that chunk registration (and zero page scanning) have
very little effect whatsoever on performance *after* the bulk phase 
round because
pages have already been mapped and already pinned in memory for life.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:44                           ` Michael S. Tsirkin
@ 2013-04-11 16:09                             ` Michael R. Hines
  2013-04-11 17:04                               ` Michael S. Tsirkin
  2013-04-11 16:13                             ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 16:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/11/2013 11:44 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
>> First of all,
> I know it's a hard habit to break but could you
> please stop stop top-posting?
Acknowledged.
>
>> this whole argument should not even exist for the
>> following reason:
>>
>> Page registrations are supposed to be *rare* - once a page is registered, it
>> is registered for life. There is nothing in the design that says a page must
>> be "unregistered" and I do not believe anybody is proposing that.
> Hmm proposing what? Of course you need to unregister pages
> eventually otherwise your pinned memory on the destination
> will just grow indefinitely. People are often doing
> registration caches to help reduce the overhead,
> but never unregistering seems too aggressive.
>
> You mean the chunk-based thing just delays the agony
> until all guest memory is pinned for RDMA anyway?
> Wait, is it registered for life on the source too?
>
> Well this kind of explains why qemu was dying on OOM,
> doesn't it?

Yes, that's correct. The agony is just delayed. The right thing to do
in a future patch would be to pin as much as possible in advance
before the bulk phase round even begins (using the pagemap).

In the meantime, chunk registartion performance is still very good
so long as total migration time is not the metric you are optimizing for.

>> Second, this means that my previous analysis showing that
>> performance was reduced
>> was also incorrect because most of the RDMA transfers were against
>> pages during
>> the bulk phase round, which incorrectly makes dynamic page
>> registration look bad.
>> I should have done more testing *after* the bulk phase round,
>> and I apologize for not doing that.
>>
>> Indeed when I do such a test (with the 'stress' command) the cost of
>> page registration disappears
>> because most of the registrations have already completed a long time ago.
>>
>> Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
>>
>> Third, this means that optimizing this protocol would not be helpful
>> and that we should
>> follow the "keep it simple" approach because during steady-state
>> phase of the migration
>> most of the pages should have already been registered.
>>
>> - Michael
> If you mean that registering all memory is a requirement,
> then I am not sure I agree: you wrote one slow protocol, this
> does not mean that there can't be a fast one.
>
> But if you mean to say that the current chunk based code
> is useless, then I'd have to agree.

Answer above.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:44                           ` Michael S. Tsirkin
  2013-04-11 16:09                             ` Michael R. Hines
@ 2013-04-11 16:13                             ` Michael R. Hines
  1 sibling, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 16:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/11/2013 11:44 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
>> First of all,
> I know it's a hard habit to break but could you
> please stop stop top-posting?
>
>> this whole argument should not even exist for the
>> following reason:
>>
>> Page registrations are supposed to be *rare* - once a page is registered, it
>> is registered for life. There is nothing in the design that says a page must
>> be "unregistered" and I do not believe anybody is proposing that.
> Hmm proposing what? Of course you need to unregister pages
> eventually otherwise your pinned memory on the destination
> will just grow indefinitely. People are often doing
> registration caches to help reduce the overhead,
> but never unregistering seems too aggressive.
>
> You mean the chunk-based thing just delays the agony
> until all guest memory is pinned for RDMA anyway?
> Wait, is it registered for life on the source too?
>
> Well this kind of explains why qemu was dying on OOM,
> doesn't it?
>
>> Second, this means that my previous analysis showing that
>> performance was reduced
>> was also incorrect because most of the RDMA transfers were against
>> pages during
>> the bulk phase round, which incorrectly makes dynamic page
>> registration look bad.
>> I should have done more testing *after* the bulk phase round,
>> and I apologize for not doing that.
>>
>> Indeed when I do such a test (with the 'stress' command) the cost of
>> page registration disappears
>> because most of the registrations have already completed a long time ago.
>>
>> Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
>>
>> Third, this means that optimizing this protocol would not be helpful
>> and that we should
>> follow the "keep it simple" approach because during steady-state
>> phase of the migration
>> most of the pages should have already been registered.
>>
>> - Michael
>
> But if you mean to say that the current chunk based code
> is useless, then I'd have to agree.
>
Well, you asked me to write an overcommit solution, so I wrote one. =)

Second, there is *no need* for a fast registration protocol, as I've 
summarized,
because most of the page registrations are supposed have already completed
before the steady-state iterative phase of the migration already begins
(which will be further optimized in a later patch). You're complaining 
about a non-issue.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 16:09                             ` Michael R. Hines
@ 2013-04-11 17:04                               ` Michael S. Tsirkin
  2013-04-11 17:27                                 ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 17:04 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Thu, Apr 11, 2013 at 12:09:44PM -0400, Michael R. Hines wrote:
> On 04/11/2013 11:44 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
> >>First of all,
> >I know it's a hard habit to break but could you
> >please stop stop top-posting?
> Acknowledged.
> >
> >>this whole argument should not even exist for the
> >>following reason:
> >>
> >>Page registrations are supposed to be *rare* - once a page is registered, it
> >>is registered for life. There is nothing in the design that says a page must
> >>be "unregistered" and I do not believe anybody is proposing that.
> >Hmm proposing what? Of course you need to unregister pages
> >eventually otherwise your pinned memory on the destination
> >will just grow indefinitely. People are often doing
> >registration caches to help reduce the overhead,
> >but never unregistering seems too aggressive.
> >
> >You mean the chunk-based thing just delays the agony
> >until all guest memory is pinned for RDMA anyway?
> >Wait, is it registered for life on the source too?
> >
> >Well this kind of explains why qemu was dying on OOM,
> >doesn't it?
> 
> Yes, that's correct. The agony is just delayed. The right thing to do
> in a future patch would be to pin as much as possible in advance
> before the bulk phase round even begins (using the pagemap).

IMHO the right thing is to unpin memory after it's sent.

> In the meantime, chunk registartion performance is still very good
> so long as total migration time is not the metric you are optimizing for.

You mean it has better downtime than TCP? Or lower host CPU
overhead? These are the metrics we care about.

> >>Second, this means that my previous analysis showing that
> >>performance was reduced
> >>was also incorrect because most of the RDMA transfers were against
> >>pages during
> >>the bulk phase round, which incorrectly makes dynamic page
> >>registration look bad.
> >>I should have done more testing *after* the bulk phase round,
> >>and I apologize for not doing that.
> >>
> >>Indeed when I do such a test (with the 'stress' command) the cost of
> >>page registration disappears
> >>because most of the registrations have already completed a long time ago.
> >>
> >>Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
> >>
> >>Third, this means that optimizing this protocol would not be helpful
> >>and that we should
> >>follow the "keep it simple" approach because during steady-state
> >>phase of the migration
> >>most of the pages should have already been registered.
> >>
> >>- Michael
> >If you mean that registering all memory is a requirement,
> >then I am not sure I agree: you wrote one slow protocol, this
> >does not mean that there can't be a fast one.
> >
> >But if you mean to say that the current chunk based code
> >is useless, then I'd have to agree.
> 
> Answer above.

I don't see it above. What does "keep it simple mean"?

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 17:04                               ` Michael S. Tsirkin
@ 2013-04-11 17:27                                 ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 17:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On 04/11/2013 01:04 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 12:09:44PM -0400, Michael R. Hines wrote:
>>
>> Yes, that's correct. The agony is just delayed. The right thing to do
>> in a future patch would be to pin as much as possible in advance
>> before the bulk phase round even begins (using the pagemap).
> IMHO the right thing is to unpin memory after it's sent.

Based on what, exactly? Would you unpin a hot page? Would you
unpin a cold page that becomes hot again later? I don't see how we can
know in advance the behavior of individual pages and make the decision
to unpin them - we probably don't want to know either.

Trying to build a more complex protocol just for something that's 
unpredictable
(and probably not the common case) doesn't seem like a good focus for 
debate.

Overcommit is really only useful when the "overcommitted" memory
is not expected to fluctuate.  Unpinning pages just so they can be 
overcommitted
later means that it was probably a bad idea to overcommit those pages in 
the first place....

What you're asking for is very fine-grained overcommitment, which, in my
experience is not a practical decision making process that QEMU can ever 
really know
about. Memory footprints tend to either be very big or very small
and they stay that way for a very long time until something comes along 
to change that.

>> In the meantime, chunk registartion performance is still very good
>> so long as total migration time is not the metric you are optimizing for.
> You mean it has better downtime than TCP? Or lower host CPU
> overhead? These are the metrics we care about.
Yes, it does indeed have better downtime because RDMA latencies are much
lower and *most* of the page registrations will have already occurred after
the bulk phase round has passed in the first iteration.

.

- Michael

>>> If you mean that registering all memory is a requirement,
>>> then I am not sure I agree: you wrote one slow protocol, this
>>> does not mean that there can't be a fast one.
>>>
>>> But if you mean to say that the current chunk based code
>>> is useless, then I'd have to agree.
>> Answer above.
> I don't see it above. What does "keep it simple mean"?
>

By simple, I mean the argument for a simpler protocol that I made above.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 14:56                           ` Michael S. Tsirkin
@ 2013-04-11 17:49                             ` Michael R. Hines
  2013-04-11 19:15                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 17:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
>> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>>> pg1 ->  pin -> req -> res -> rdma -> done
>>>          pg2 ->  pin -> req -> res -> rdma -> done
>>>                  pg3 -> pin -> req -> res -> rdma -> done
>>>                         pg4 -> pin -> req -> res -> rdma -> done
>>>                                pg4 -> pin -> req -> res -> rdma -> done
>>>
>>> It's like a assembly line see?  So while software does the registration
>>> roundtrip dance, hardware is processing rdma requests for previous
>>> chunks.
>> Does this only affects the implementation, or also the wire protocol?
> It affects the wire protocol.

I *do* believe chunked registration was a *very* useful request by
the community, and I want to thank you for convincing me to implement it.

But, with all due respect, pipelining is a "solution looking for a problem".

Improving the protocol does not help the behavior of any well-known 
workloads,
because it is based on the idea the the memory footprint of a VM would
*rapidly* shrink and contract up and down during the steady-state iteration
rounds while the migration is taking place.

This simply does not happen - workloads don't behave that way - they either
grow really big or they grow really small and they settle that way for a 
reasonable
amount of time before the load on the application changes at a future 
point in time.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 17:49                             ` Michael R. Hines
@ 2013-04-11 19:15                               ` Michael S. Tsirkin
  2013-04-11 20:33                                 ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 19:15 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
> On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
> >>Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> >>>pg1 ->  pin -> req -> res -> rdma -> done
> >>>         pg2 ->  pin -> req -> res -> rdma -> done
> >>>                 pg3 -> pin -> req -> res -> rdma -> done
> >>>                        pg4 -> pin -> req -> res -> rdma -> done
> >>>                               pg4 -> pin -> req -> res -> rdma -> done
> >>>
> >>>It's like a assembly line see?  So while software does the registration
> >>>roundtrip dance, hardware is processing rdma requests for previous
> >>>chunks.
> >>Does this only affects the implementation, or also the wire protocol?
> >It affects the wire protocol.
> 
> I *do* believe chunked registration was a *very* useful request by
> the community, and I want to thank you for convincing me to implement it.
> 
> But, with all due respect, pipelining is a "solution looking for a problem".

The problem is bad performance, isn't it?
If it wasn't we'd use chunk based all the time.

> Improving the protocol does not help the behavior of any well-known
> workloads,
> because it is based on the idea the the memory footprint of a VM would
> *rapidly* shrink and contract up and down during the steady-state iteration
> rounds while the migration is taking place.

What gave you that idea? Not at all.  It is based on the idea
of doing control actions in parallel with data transfers,
so that control latency does not degrade performance.

> This simply does not happen - workloads don't behave that way - they either
> grow really big or they grow really small and they settle that way
> for a reasonable
> amount of time before the load on the application changes at a
> future point in time.
> 
> - Michael

What is the bottleneck for chunk-based? Can you tell me that?  Find out,
and you will maybe see pipelining will help.

Basically to me, when you describe the protocol in detail the problems
become apparent.

I think you worry too much about what the guest does, what APIs are
exposed from the migration core and the specifics of the workload. Build
a sane protocol for data transfers and layer the workload on top.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 19:15                               ` Michael S. Tsirkin
@ 2013-04-11 20:33                                 ` Michael R. Hines
  2013-04-12 10:48                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-11 20:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/11/2013 03:15 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>> On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
>>>> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>>>>> pg1 ->  pin -> req -> res -> rdma -> done
>>>>>          pg2 ->  pin -> req -> res -> rdma -> done
>>>>>                  pg3 -> pin -> req -> res -> rdma -> done
>>>>>                         pg4 -> pin -> req -> res -> rdma -> done
>>>>>                                pg4 -> pin -> req -> res -> rdma -> done
>>>>>
>>>>> It's like a assembly line see?  So while software does the registration
>>>>> roundtrip dance, hardware is processing rdma requests for previous
>>>>> chunks.
>>>> Does this only affects the implementation, or also the wire protocol?
>>> It affects the wire protocol.
>> I *do* believe chunked registration was a *very* useful request by
>> the community, and I want to thank you for convincing me to implement it.
>>
>> But, with all due respect, pipelining is a "solution looking for a problem".
> The problem is bad performance, isn't it?
> If it wasn't we'd use chunk based all the time.
>
>> Improving the protocol does not help the behavior of any well-known
>> workloads,
>> because it is based on the idea the the memory footprint of a VM would
>> *rapidly* shrink and contract up and down during the steady-state iteration
>> rounds while the migration is taking place.
> What gave you that idea? Not at all.  It is based on the idea
> of doing control actions in parallel with data transfers,
> so that control latency does not degrade performance.
Again, this parallelization is trying to solve a problem that doesn't 
exist.

As I've described before, I re-executed the worst-case memory stress hog
tests with RDMA *after* the bulk-phase round completes and determined
that RDMA throughput remains unaffected because most of the memory
was already registered in advance.

>> This simply does not happen - workloads don't behave that way - they either
>> grow really big or they grow really small and they settle that way
>> for a reasonable
>> amount of time before the load on the application changes at a
>> future point in time.
>>
>> - Michael
> What is the bottleneck for chunk-based? Can you tell me that?  Find out,
> and you will maybe see pipelining will help.
>
> Basically to me, when you describe the protocol in detail the problems
> become apparent.
>
> I think you worry too much about what the guest does, what APIs are
> exposed from the migration core and the specifics of the workload. Build
> a sane protocol for data transfers and layer the workload on top.
>

What is the point in enhancing a protocol to solve a problem will never 
be manifested?

We're trying to overlap two *completely different use cases* that are 
completely unrelated:

1. Static overcommit
2. Dynamic, fine-grained overcommit (at small time scales... seconds or 
minutes)

#1 Happens all the time. Cram a bunch of virtual machines with fixed 
workloads
and fixed writable working sets into the same place, and you're good to go.

#2 never happens. Ever. It just doesn't happen, and the enhancements you've
described are trying to protect against #2, when we should really be 
focused on #1.

It is not standard practice for a workload to expect high overcommit 
performance
in the *middle* of a relocation and nobody in the industry that I have 
met over the
years has expressed any such desire to do so.

Workloads just don't behave that way.

Dynamic registration does an excellent job at overcommitment for #1 
because most
of the registrations are done at the very beginning and can be further 
optimized to
cause little or no performance loss by simply issuing the registrations 
before the
migration ever begins.

Performance for #2 even with dynamic registration is excellent and I am not
experiencing any problems associated with it.

So, we're discussing a non-issue.

- Michael



Overcommit has two

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 15:33                           ` Paolo Bonzini
  2013-04-11 15:46                             ` Michael S. Tsirkin
@ 2013-04-12  5:10                             ` Michael R. Hines
  2013-04-12  5:26                               ` Paolo Bonzini
  1 sibling, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-12  5:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/11/2013 11:33 AM, Paolo Bonzini wrote:
> 2) rename the transport to "x-rdma" (just in migration.c) 

What does this mean?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12  5:10                             ` Michael R. Hines
@ 2013-04-12  5:26                               ` Paolo Bonzini
  2013-04-12  5:54                                 ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-12  5:26 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 12/04/2013 07:10, Michael R. Hines ha scritto:
> On 04/11/2013 11:33 AM, Paolo Bonzini wrote:
>> 2) rename the transport to "x-rdma" (just in migration.c) 
> 
> What does this mean?

Use "migrate x-rdma:192.168.10.12" to migrate, to indicate it's
experimental and the protocol might change.  It's just to err on the
safe side.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12  5:26                               ` Paolo Bonzini
@ 2013-04-12  5:54                                 ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-12  5:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/12/2013 01:26 AM, Paolo Bonzini wrote:
> Il 12/04/2013 07:10, Michael R. Hines ha scritto:
>> On 04/11/2013 11:33 AM, Paolo Bonzini wrote:
>>> 2) rename the transport to "x-rdma" (just in migration.c)
>> What does this mean?
> Use "migrate x-rdma:192.168.10.12" to migrate, to indicate it's
> experimental and the protocol might change.  It's just to err on the
> safe side.
>
> Paolo
>
>
Ooops, you're not gonna make me re-send the patch, are you? =)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-11 20:33                                 ` Michael R. Hines
@ 2013-04-12 10:48                                   ` Michael S. Tsirkin
  2013-04-12 10:53                                     ` Paolo Bonzini
  2013-04-12 13:47                                     ` Michael R. Hines
  0 siblings, 2 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-12 10:48 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Thu, Apr 11, 2013 at 04:33:03PM -0400, Michael R. Hines wrote:
> On 04/11/2013 03:15 PM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
> >>On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
> >>>On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
> >>>>Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> >>>>>pg1 ->  pin -> req -> res -> rdma -> done
> >>>>>         pg2 ->  pin -> req -> res -> rdma -> done
> >>>>>                 pg3 -> pin -> req -> res -> rdma -> done
> >>>>>                        pg4 -> pin -> req -> res -> rdma -> done
> >>>>>                               pg4 -> pin -> req -> res -> rdma -> done
> >>>>>
> >>>>>It's like a assembly line see?  So while software does the registration
> >>>>>roundtrip dance, hardware is processing rdma requests for previous
> >>>>>chunks.
> >>>>Does this only affects the implementation, or also the wire protocol?
> >>>It affects the wire protocol.
> >>I *do* believe chunked registration was a *very* useful request by
> >>the community, and I want to thank you for convincing me to implement it.
> >>
> >>But, with all due respect, pipelining is a "solution looking for a problem".
> >The problem is bad performance, isn't it?
> >If it wasn't we'd use chunk based all the time.
> >
> >>Improving the protocol does not help the behavior of any well-known
> >>workloads,
> >>because it is based on the idea the the memory footprint of a VM would
> >>*rapidly* shrink and contract up and down during the steady-state iteration
> >>rounds while the migration is taking place.
> >What gave you that idea? Not at all.  It is based on the idea
> >of doing control actions in parallel with data transfers,
> >so that control latency does not degrade performance.
> Again, this parallelization is trying to solve a problem that
> doesn't exist.
> 
> As I've described before, I re-executed the worst-case memory stress hog
> tests with RDMA *after* the bulk-phase round completes and determined
> that RDMA throughput remains unaffected because most of the memory
> was already registered in advance.
> 
> >>This simply does not happen - workloads don't behave that way - they either
> >>grow really big or they grow really small and they settle that way
> >>for a reasonable
> >>amount of time before the load on the application changes at a
> >>future point in time.
> >>
> >>- Michael
> >What is the bottleneck for chunk-based? Can you tell me that?  Find out,
> >and you will maybe see pipelining will help.
> >
> >Basically to me, when you describe the protocol in detail the problems
> >become apparent.
> >
> >I think you worry too much about what the guest does, what APIs are
> >exposed from the migration core and the specifics of the workload. Build
> >a sane protocol for data transfers and layer the workload on top.
> >
> 
> What is the point in enhancing a protocol to solve a problem will
> never be manifested?
> 
> We're trying to overlap two *completely different use cases* that
> are completely unrelated:
> 
> 1. Static overcommit
> 2. Dynamic, fine-grained overcommit (at small time scales... seconds
> or minutes)
> 
> #1 Happens all the time. Cram a bunch of virtual machines with fixed
> workloads
> and fixed writable working sets into the same place, and you're good to go.
> 
> #2 never happens. Ever. It just doesn't happen, and the enhancements you've
> described are trying to protect against #2, when we should really be
> focused on #1.
> 
> It is not standard practice for a workload to expect high overcommit
> performance
> in the *middle* of a relocation and nobody in the industry that I
> have met over the
> years has expressed any such desire to do so.
> 

Depends on who you talk to I guess.  Almost everyone
overcommits to some level. They might not know it.
It depends on the amount of overcommit.  You pin all (at least non zero)
memory eventually, breaking memory overcommit completely. If I
overcommit by 4kilobytes do you expect performance to go completely
down? It does not make sense.

> Workloads just don't behave that way.
> 
> Dynamic registration does an excellent job at overcommitment for #1
> because most
> of the registrations are done at the very beginning and can be
> further optimized to
> cause little or no performance loss by simply issuing the
> registrations before the
> migration ever begins.

How does it? You pin all VM's memory eventually.
You said your tests have the OOM killer triggering.


> Performance for #2 even with dynamic registration is excellent and I am not
> experiencing any problems associated with it.

Well previously you said the reverse. You keep vaguely speaking about
performance.  We care about these metrics:

	1. total migration time: measured by:

	time
	 ssh dest qemu -incoming &;echo migrate > monitor
	time

	2.  min allowed downtime that lets migration converge

	3. average host CPU utilization during migration,
	   on source and destination

	4. max real memory used by qemu

Can you fill this table for TCP, and two protocol versions?

If dynamic works as well as static, this is a good reason to drop the
static one.  As the next step, fix the dynamic to unregister
memory (this is required for _GIFT anyway). When you do this
it is possible that pipelining is required.

> So, we're discussing a non-issue.
> 
> - Michael
> 

There are two issues.

1.  You have two protocols already and this does not make sense in
version 1 of the patch.  You said dynamic is slow so I pointed out ways
to improve it. Now you says it's as fast as static?  so drop static
then. At no point does it make sense to have management commands to play
with low level protocol details.

> 
> 
> Overcommit has two

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12 10:48                                   ` Michael S. Tsirkin
@ 2013-04-12 10:53                                     ` Paolo Bonzini
  2013-04-12 11:25                                       ` Michael S. Tsirkin
  2013-04-12 13:47                                     ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-12 10:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> 1.  You have two protocols already and this does not make sense in
> version 1 of the patch.

It makes sense if we consider it experimental (add x- in front of
transport and capability) and would like people to play with it.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12 10:53                                     ` Paolo Bonzini
@ 2013-04-12 11:25                                       ` Michael S. Tsirkin
  2013-04-12 14:43                                         ` Paolo Bonzini
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-12 11:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> > 1.  You have two protocols already and this does not make sense in
> > version 1 of the patch.
> 
> It makes sense if we consider it experimental (add x- in front of
> transport and capability) and would like people to play with it.
> 
> Paolo

But it's not testable yet.  I see problems just reading the
documentation.  Author thinks "ulimit -l 10000000000" on both source and
destination is just fine.  This can easily crash host or cause OOM
killer to kill QEMU.  So why is there any need for extra testers?  Fix
the major bugs first.

There's a similar issue with device assignment - we can't fix it there,
and despite being available for years, this was one of two reasons that
has kept this feature out of hands of lots of users (and assuming guest
has lots of zero pages won't work: balloon is not widely used either
since it depends on a well-behaved guest to work correctly).

And it's entirely avoidable, just fix the protocol and the code.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12 10:48                                   ` Michael S. Tsirkin
  2013-04-12 10:53                                     ` Paolo Bonzini
@ 2013-04-12 13:47                                     ` Michael R. Hines
  2013-04-14  8:28                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-12 13:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/12/2013 06:48 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 04:33:03PM -0400, Michael R. Hines wrote:
>> On 04/11/2013 03:15 PM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>>>> On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
>>>>>> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>>>>>>> pg1 ->  pin -> req -> res -> rdma -> done
>>>>>>>          pg2 ->  pin -> req -> res -> rdma -> done
>>>>>>>                  pg3 -> pin -> req -> res -> rdma -> done
>>>>>>>                         pg4 -> pin -> req -> res -> rdma -> done
>>>>>>>                                pg4 -> pin -> req -> res -> rdma -> done
>>>>>>>
>>>>>>> It's like a assembly line see?  So while software does the registration
>>>>>>> roundtrip dance, hardware is processing rdma requests for previous
>>>>>>> chunks.
>>>>>> Does this only affects the implementation, or also the wire protocol?
>>>>> It affects the wire protocol.
>>>> I *do* believe chunked registration was a *very* useful request by
>>>> the community, and I want to thank you for convincing me to implement it.
>>>>
>>>> But, with all due respect, pipelining is a "solution looking for a problem".
>>> The problem is bad performance, isn't it?
>>> If it wasn't we'd use chunk based all the time.
>>>
>>>> Improving the protocol does not help the behavior of any well-known
>>>> workloads,
>>>> because it is based on the idea the the memory footprint of a VM would
>>>> *rapidly* shrink and contract up and down during the steady-state iteration
>>>> rounds while the migration is taking place.
>>> What gave you that idea? Not at all.  It is based on the idea
>>> of doing control actions in parallel with data transfers,
>>> so that control latency does not degrade performance.
>> Again, this parallelization is trying to solve a problem that
>> doesn't exist.
>>
>> As I've described before, I re-executed the worst-case memory stress hog
>> tests with RDMA *after* the bulk-phase round completes and determined
>> that RDMA throughput remains unaffected because most of the memory
>> was already registered in advance.
>>
>>>> This simply does not happen - workloads don't behave that way - they either
>>>> grow really big or they grow really small and they settle that way
>>>> for a reasonable
>>>> amount of time before the load on the application changes at a
>>>> future point in time.
>>>>
>>>> - Michael
>>> What is the bottleneck for chunk-based? Can you tell me that?  Find out,
>>> and you will maybe see pipelining will help.
>>>
>>> Basically to me, when you describe the protocol in detail the problems
>>> become apparent.
>>>
>>> I think you worry too much about what the guest does, what APIs are
>>> exposed from the migration core and the specifics of the workload. Build
>>> a sane protocol for data transfers and layer the workload on top.
>>>
>> What is the point in enhancing a protocol to solve a problem will
>> never be manifested?
>>
>> We're trying to overlap two *completely different use cases* that
>> are completely unrelated:
>>
>> 1. Static overcommit
>> 2. Dynamic, fine-grained overcommit (at small time scales... seconds
>> or minutes)
>>
>> #1 Happens all the time. Cram a bunch of virtual machines with fixed
>> workloads
>> and fixed writable working sets into the same place, and you're good to go.
>>
>> #2 never happens. Ever. It just doesn't happen, and the enhancements you've
>> described are trying to protect against #2, when we should really be
>> focused on #1.
>>
>> It is not standard practice for a workload to expect high overcommit
>> performance
>> in the *middle* of a relocation and nobody in the industry that I
>> have met over the
>> years has expressed any such desire to do so.
>>
> Depends on who you talk to I guess.  Almost everyone
> overcommits to some level. They might not know it.
> It depends on the amount of overcommit.  You pin all (at least non zero)
> memory eventually, breaking memory overcommit completely. If I
> overcommit by 4kilobytes do you expect performance to go completely
> down? It does not make sense.
>
>> Workloads just don't behave that way.
>>
>> Dynamic registration does an excellent job at overcommitment for #1
>> because most
>> of the registrations are done at the very beginning and can be
>> further optimized to
>> cause little or no performance loss by simply issuing the
>> registrations before the
>> migration ever begins.
> How does it? You pin all VM's memory eventually.
> You said your tests have the OOM killer triggering.
>

That's because of cgroups memory limitations. Not the protocol.

Infiband was never designed to work with cgroups - that's a kernel
problem, not a QEMU problem or a protocol problem. Why do
we have to worry about that exactly?

>> Performance for #2 even with dynamic registration is excellent and I am not
>> experiencing any problems associated with it.
> Well previously you said the reverse. You keep vaguely speaking about
> performance.  We care about these metrics:
>
> 	1. total migration time: measured by:
>
> 	time
> 	 ssh dest qemu -incoming &;echo migrate > monitor
> 	time
>
> 	2.  min allowed downtime that lets migration converge
>
> 	3. average host CPU utilization during migration,
> 	   on source and destination
>
> 	4. max real memory used by qemu
>
> Can you fill this table for TCP, and two protocol versions?
>
> If dynamic works as well as static, this is a good reason to drop the
> static one.  As the next step, fix the dynamic to unregister
> memory (this is required for _GIFT anyway). When you do this
> it is possible that pipelining is required.

First, yes, I'm happy to fill out the table - let me address
Paolo's last requested changes (including the COMPRESS fix)

Second, there are not two protocol versions. That's incorrect.
There's only one protocol which can operate in different ways
as any protocol can operate in different ways. It has different
command types, not all of which need to be used by the protocol
at the same time.

Second, as I've explained, I strongly, strongly disagree with unregistering
memory for all of the aforementioned reasons - workloads do not
operate in such a manner that they can tolerate memory to be
pulled out from underneath them at such fine-grained time scales
in the *middle* of a relocation and I will not commit to writing a solution
for a problem that doesn't exist.

If you can prove (through some kind of anaylsis) that workloads
would benefit from this kind of fine-grained memory overcommit
by having cgroups swap out memory to disk underneath them
without their permission, I would happily reconsider my position.

- Michael



>> So, we're discussing a non-issue.
>>
>> - Michael
>>
> There are two issues.
>
> 1.  You have two protocols already and this does not make sense in
> version 1 of the patch.  You said dynamic is slow so I pointed out ways
> to improve it. Now you says it's as fast as static?  so drop static
> then. At no point does it make sense to have management commands to play
> with low level protocol details.
>
>>
>> Overcommit has two

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12 11:25                                       ` Michael S. Tsirkin
@ 2013-04-12 14:43                                         ` Paolo Bonzini
  2013-04-14 11:59                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-12 14:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>> 1.  You have two protocols already and this does not make sense in
>>> version 1 of the patch.
>>
>> It makes sense if we consider it experimental (add x- in front of
>> transport and capability) and would like people to play with it.
>>
>> Paolo
> 
> But it's not testable yet.  I see problems just reading the
> documentation.  Author thinks "ulimit -l 10000000000" on both source and
> destination is just fine.  This can easily crash host or cause OOM
> killer to kill QEMU.  So why is there any need for extra testers?  Fix
> the major bugs first.
> 
> There's a similar issue with device assignment - we can't fix it there,
> and despite being available for years, this was one of two reasons that
> has kept this feature out of hands of lots of users (and assuming guest
> has lots of zero pages won't work: balloon is not widely used either
> since it depends on a well-behaved guest to work correctly).

I agree assuming guest has lots of zero pages won't work, but I think
you are overstating the importance of overcommit.  Let's mark the damn
thing as experimental, and stop making perfect the enemy of good.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12 13:47                                     ` Michael R. Hines
@ 2013-04-14  8:28                                       ` Michael S. Tsirkin
  2013-04-14 14:31                                         ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-14  8:28 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> Second, as I've explained, I strongly, strongly disagree with unregistering
> memory for all of the aforementioned reasons - workloads do not
> operate in such a manner that they can tolerate memory to be
> pulled out from underneath them at such fine-grained time scales
> in the *middle* of a relocation and I will not commit to writing a solution
> for a problem that doesn't exist.

Exactly same thing happens with swap, doesn't it?
You are saying workloads simply can not tolerate swap.

> If you can prove (through some kind of anaylsis) that workloads
> would benefit from this kind of fine-grained memory overcommit
> by having cgroups swap out memory to disk underneath them
> without their permission, I would happily reconsider my position.
> 
> - Michael

This has nothing to do with cgroups directly, it's just a way to
demonstrate you have a bug.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-12 14:43                                         ` Paolo Bonzini
@ 2013-04-14 11:59                                           ` Michael S. Tsirkin
  2013-04-14 14:09                                             ` Paolo Bonzini
  2013-04-14 14:27                                             ` Michael R. Hines
  0 siblings, 2 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-14 11:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> > On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>> 1.  You have two protocols already and this does not make sense in
> >>> version 1 of the patch.
> >>
> >> It makes sense if we consider it experimental (add x- in front of
> >> transport and capability) and would like people to play with it.
> >>
> >> Paolo
> > 
> > But it's not testable yet.  I see problems just reading the
> > documentation.  Author thinks "ulimit -l 10000000000" on both source and
> > destination is just fine.  This can easily crash host or cause OOM
> > killer to kill QEMU.  So why is there any need for extra testers?  Fix
> > the major bugs first.
> > 
> > There's a similar issue with device assignment - we can't fix it there,
> > and despite being available for years, this was one of two reasons that
> > has kept this feature out of hands of lots of users (and assuming guest
> > has lots of zero pages won't work: balloon is not widely used either
> > since it depends on a well-behaved guest to work correctly).
> 
> I agree assuming guest has lots of zero pages won't work, but I think
> you are overstating the importance of overcommit.  Let's mark the damn
> thing as experimental, and stop making perfect the enemy of good.
> 
> Paolo

It looks like we have to decide, before merging, whether migration with
rdma that breaks overcommit is worth it or not.  Since the author made
it very clear he does not intend to make it work with overcommit, ever.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 11:59                                           ` Michael S. Tsirkin
@ 2013-04-14 14:09                                             ` Paolo Bonzini
  2013-04-14 14:40                                               ` Michael R. Hines
  2013-04-14 14:27                                             ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-14 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines, gokul

Il 14/04/2013 13:59, Michael S. Tsirkin ha scritto:
> > I agree assuming guest has lots of zero pages won't work, but I think
> > you are overstating the importance of overcommit.  Let's mark the damn
> > thing as experimental, and stop making perfect the enemy of good.
> 
> It looks like we have to decide, before merging, whether migration with
> rdma that breaks overcommit is worth it or not.  Since the author made
> it very clear he does not intend to make it work with overcommit, ever.

To me it is very much worth it.

I would like to understand if unregistration would require a protocol
change, but that's really more a curiosity than anything else.

Perhaps it would make sense to make chunk registration permanent only
after the bulk phase.  Chunks registered in the bulk phase are not
permanent.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 11:59                                           ` Michael S. Tsirkin
  2013-04-14 14:09                                             ` Paolo Bonzini
@ 2013-04-14 14:27                                             ` Michael R. Hines
  2013-04-14 16:03                                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-14 14:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>> 1.  You have two protocols already and this does not make sense in
>>>>> version 1 of the patch.
>>>> It makes sense if we consider it experimental (add x- in front of
>>>> transport and capability) and would like people to play with it.
>>>>
>>>> Paolo
>>> But it's not testable yet.  I see problems just reading the
>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>> destination is just fine.  This can easily crash host or cause OOM
>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>> the major bugs first.
>>>
>>> There's a similar issue with device assignment - we can't fix it there,
>>> and despite being available for years, this was one of two reasons that
>>> has kept this feature out of hands of lots of users (and assuming guest
>>> has lots of zero pages won't work: balloon is not widely used either
>>> since it depends on a well-behaved guest to work correctly).
>> I agree assuming guest has lots of zero pages won't work, but I think
>> you are overstating the importance of overcommit.  Let's mark the damn
>> thing as experimental, and stop making perfect the enemy of good.
>>
>> Paolo
> It looks like we have to decide, before merging, whether migration with
> rdma that breaks overcommit is worth it or not.  Since the author made
> it very clear he does not intend to make it work with overcommit, ever.
>
That depends entirely as what you define as overcommit.

The pages do get unregistered at the end of the migration =)

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14  8:28                                       ` Michael S. Tsirkin
@ 2013-04-14 14:31                                         ` Michael R. Hines
  2013-04-14 18:51                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-14 14:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
>> Second, as I've explained, I strongly, strongly disagree with unregistering
>> memory for all of the aforementioned reasons - workloads do not
>> operate in such a manner that they can tolerate memory to be
>> pulled out from underneath them at such fine-grained time scales
>> in the *middle* of a relocation and I will not commit to writing a solution
>> for a problem that doesn't exist.
> Exactly same thing happens with swap, doesn't it?
> You are saying workloads simply can not tolerate swap.
>
>> If you can prove (through some kind of anaylsis) that workloads
>> would benefit from this kind of fine-grained memory overcommit
>> by having cgroups swap out memory to disk underneath them
>> without their permission, I would happily reconsider my position.
>>
>> - Michael
> This has nothing to do with cgroups directly, it's just a way to
> demonstrate you have a bug.
>

If your datacenter or your cloud or your product does not want to
tolerate page registration, then don't use RDMA!

The bottom line is: RDMA is useless without page registration. Without
it, the performance of it will be crippled. If you define that as a bug,
then so be it.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 14:09                                             ` Paolo Bonzini
@ 2013-04-14 14:40                                               ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-14 14:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/14/2013 10:09 AM, Paolo Bonzini wrote:
> Il 14/04/2013 13:59, Michael S. Tsirkin ha scritto:
>>> I agree assuming guest has lots of zero pages won't work, but I think
>>> you are overstating the importance of overcommit.  Let's mark the damn
>>> thing as experimental, and stop making perfect the enemy of good.
>> It looks like we have to decide, before merging, whether migration with
>> rdma that breaks overcommit is worth it or not.  Since the author made
>> it very clear he does not intend to make it work with overcommit, ever.
> To me it is very much worth it.
>
> I would like to understand if unregistration would require a protocol
> change, but that's really more a curiosity than anything else.

Yes, it would require a protocol change. Either the source or the
destination would have to arbitrarily "decide" when is the time to
perform the unregistration without adversely causing the page
to be RE-registered over and over again during future iterations.

I really don't see how QEMU can accurately make such a decision.

> Perhaps it would make sense to make chunk registration permanent only
> after the bulk phase.  Chunks registered in the bulk phase are not
> permanent.

Unfortunately, that would require the entire memory footprint
to be pinned during the bulk round, which was what Michael
was originally trying to avoid a couple of weeks ago.

Nevertheless, the observation is accurate: We already have
a capability to disable chunk registration entirely.

If the user doesn't want it, they can just turn it off.


- Michael

> Paolo
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 14:27                                             ` Michael R. Hines
@ 2013-04-14 16:03                                               ` Michael S. Tsirkin
  2013-04-14 16:07                                                 ` Michael R. Hines
  2013-04-14 16:40                                                 ` Michael R. Hines
  0 siblings, 2 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-14 16:03 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>1.  You have two protocols already and this does not make sense in
> >>>>>version 1 of the patch.
> >>>>It makes sense if we consider it experimental (add x- in front of
> >>>>transport and capability) and would like people to play with it.
> >>>>
> >>>>Paolo
> >>>But it's not testable yet.  I see problems just reading the
> >>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>destination is just fine.  This can easily crash host or cause OOM
> >>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>the major bugs first.
> >>>
> >>>There's a similar issue with device assignment - we can't fix it there,
> >>>and despite being available for years, this was one of two reasons that
> >>>has kept this feature out of hands of lots of users (and assuming guest
> >>>has lots of zero pages won't work: balloon is not widely used either
> >>>since it depends on a well-behaved guest to work correctly).
> >>I agree assuming guest has lots of zero pages won't work, but I think
> >>you are overstating the importance of overcommit.  Let's mark the damn
> >>thing as experimental, and stop making perfect the enemy of good.
> >>
> >>Paolo
> >It looks like we have to decide, before merging, whether migration with
> >rdma that breaks overcommit is worth it or not.  Since the author made
> >it very clear he does not intend to make it work with overcommit, ever.
> >
> That depends entirely as what you define as overcommit.

You don't get to define your own terms.  Look it up in wikipedia or
something.

> 
> The pages do get unregistered at the end of the migration =)
> 
> - Michael

The limitations are pretty clear, and you really should document them:

1. run qemu as root, or under ulimit -l <total guest memory> on both source and
  destination

2. expect that as much as that amount of memory is pinned
  and unvailable to host kernel and applications for
  arbitrarily long time.
  Make sure you have much more RAM in host or QEMU will get killed.

To me, especially 1 is an unacceptable security tradeoff.
It is entirely fixable but we both have other priorities,
so it'll stay broken.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 16:03                                               ` Michael S. Tsirkin
@ 2013-04-14 16:07                                                 ` Michael R. Hines
  2013-04-14 16:40                                                 ` Michael R. Hines
  1 sibling, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-14 16:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>> version 1 of the patch.
>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>> transport and capability) and would like people to play with it.
>>>>>>
>>>>>> Paolo
>>>>> But it's not testable yet.  I see problems just reading the
>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>> the major bugs first.
>>>>>
>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>> and despite being available for years, this was one of two reasons that
>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>> since it depends on a well-behaved guest to work correctly).
>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>
>>>> Paolo
>>> It looks like we have to decide, before merging, whether migration with
>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>> it very clear he does not intend to make it work with overcommit, ever.
>>>
>> That depends entirely as what you define as overcommit.
> You don't get to define your own terms.  Look it up in wikipedia or
> something.
>
>> The pages do get unregistered at the end of the migration =)
>>
>> - Michael
> The limitations are pretty clear, and you really should document them:
>
> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>    destination
>
> 2. expect that as much as that amount of memory is pinned
>    and unvailable to host kernel and applications for
>    arbitrarily long time.
>    Make sure you have much more RAM in host or QEMU will get killed.
>
> To me, especially 1 is an unacceptable security tradeoff.
> It is entirely fixable but we both have other priorities,
> so it'll stay broken.
>

Agreed, the documentation should be clear.

So, if you define that scenario as broken, then yes, it's broken.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 16:03                                               ` Michael S. Tsirkin
  2013-04-14 16:07                                                 ` Michael R. Hines
@ 2013-04-14 16:40                                                 ` Michael R. Hines
  2013-04-14 18:30                                                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-14 16:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>> version 1 of the patch.
>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>> transport and capability) and would like people to play with it.
>>>>>>
>>>>>> Paolo
>>>>> But it's not testable yet.  I see problems just reading the
>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>> the major bugs first.
>>>>>
>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>> and despite being available for years, this was one of two reasons that
>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>> since it depends on a well-behaved guest to work correctly).
>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>
>>>> Paolo
>>> It looks like we have to decide, before merging, whether migration with
>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>> it very clear he does not intend to make it work with overcommit, ever.
>>>
>> That depends entirely as what you define as overcommit.
> You don't get to define your own terms.  Look it up in wikipedia or
> something.
>
>> The pages do get unregistered at the end of the migration =)
>>
>> - Michael
> The limitations are pretty clear, and you really should document them:
>
> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>    destination
>
> 2. expect that as much as that amount of memory is pinned
>    and unvailable to host kernel and applications for
>    arbitrarily long time.
>    Make sure you have much more RAM in host or QEMU will get killed.
>
> To me, especially 1 is an unacceptable security tradeoff.
> It is entirely fixable but we both have other priorities,
> so it'll stay broken.
>

I've modified the beginning of docs/rdma.txt to say the following:

$ cat docs/rdma.txt

... snip ..

BEFORE RUNNING:
===============

Use of RDMA requires pinning and registering memory with the
hardware. If this is not acceptable for your application or
product, then the use of RDMA is strongly discouraged and you
should revert back to standard TCP-based migration.

Next, decide if you want dynamic page registration on the server-side.
For example, if you have an 8GB RAM virtual machine, but only 1GB
is in active use, then disabling this feature will cause all 8GB to
be pinned and resident in memory. This feature mostly affects the
bulk-phase round of the migration and can be disabled for extremely
high-performance RDMA hardware using the following command:

QEMU Monitor Command:
$ migrate_set_capability chunk_register_destination off # enabled by default

Performing this action will cause all 8GB to be pinned, so if that's
not what you want, then please ignore this step altogether.

RUNNING:
=======

..... snip ...

I'll group this change into a future patch whenever the current patch
gets pulled, and I will also update the QEMU wiki to make this point clear.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 16:40                                                 ` Michael R. Hines
@ 2013-04-14 18:30                                                   ` Michael S. Tsirkin
  2013-04-14 19:06                                                     ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-14 18:30 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>version 1 of the patch.
> >>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>transport and capability) and would like people to play with it.
> >>>>>>
> >>>>>>Paolo
> >>>>>But it's not testable yet.  I see problems just reading the
> >>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>the major bugs first.
> >>>>>
> >>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>and despite being available for years, this was one of two reasons that
> >>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>since it depends on a well-behaved guest to work correctly).
> >>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>
> >>>>Paolo
> >>>It looks like we have to decide, before merging, whether migration with
> >>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>it very clear he does not intend to make it work with overcommit, ever.
> >>>
> >>That depends entirely as what you define as overcommit.
> >You don't get to define your own terms.  Look it up in wikipedia or
> >something.
> >
> >>The pages do get unregistered at the end of the migration =)
> >>
> >>- Michael
> >The limitations are pretty clear, and you really should document them:
> >
> >1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >   destination
> >
> >2. expect that as much as that amount of memory is pinned
> >   and unvailable to host kernel and applications for
> >   arbitrarily long time.
> >   Make sure you have much more RAM in host or QEMU will get killed.
> >
> >To me, especially 1 is an unacceptable security tradeoff.
> >It is entirely fixable but we both have other priorities,
> >so it'll stay broken.
> >
> 
> I've modified the beginning of docs/rdma.txt to say the following:

It really should say this, in a very prominent place:

BUGS:
1. You must run qemu as root, or under
   ulimit -l <total guest memory> on both source and destination

2. Expect that as much as that amount of memory to be locked
   and unvailable to host kernel and applications for
   arbitrarily long time.
   Make sure you have much more RAM in host otherwise QEMU,
   or some other arbitrary application on same host, will get killed.

3. Migration with RDMA support is experimental and unsupported.
   In particular, please do not expect it to work across qemu versions,
   and do not expect the management interface to be stable.
   

> 
> $ cat docs/rdma.txt
> 
> ... snip ..
> 
> BEFORE RUNNING:
> ===============
> 
> Use of RDMA requires pinning and registering memory with the
> hardware. If this is not acceptable for your application or
> product, then the use of RDMA is strongly discouraged and you
> should revert back to standard TCP-based migration.

No one knows of should know what "pinning and registering" means.
For which applications and products is it appropriate?
Also, you are talking about current QEMU
code using RDMA for migration but say "RDMA" generally.

> Next, decide if you want dynamic page registration on the server-side.
> For example, if you have an 8GB RAM virtual machine, but only 1GB
> is in active use, then disabling this feature will cause all 8GB to
> be pinned and resident in memory. This feature mostly affects the
> bulk-phase round of the migration and can be disabled for extremely
> high-performance RDMA hardware using the following command:
> QEMU Monitor Command:
> $ migrate_set_capability chunk_register_destination off # enabled by default
> 
> Performing this action will cause all 8GB to be pinned, so if that's
> not what you want, then please ignore this step altogether.

This does not make it clear what is the benefit of disabling this
capability. I think it's best to avoid options, just use chunk
based always.
If it's here "so people can play with it" then please rename
it to something like "x-unsupported-chunk_register_destination"
so people know this is unsupported and not to be used for production.

> RUNNING:
> =======
> 
> ..... snip ...
> 
> I'll group this change into a future patch whenever the current patch
> gets pulled, and I will also update the QEMU wiki to make this point clear.
> 
> - Michael
> 
> 
> 


-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 14:31                                         ` Michael R. Hines
@ 2013-04-14 18:51                                           ` Michael S. Tsirkin
  2013-04-14 19:43                                             ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-14 18:51 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
> On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> >On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> >>Second, as I've explained, I strongly, strongly disagree with unregistering
> >>memory for all of the aforementioned reasons - workloads do not
> >>operate in such a manner that they can tolerate memory to be
> >>pulled out from underneath them at such fine-grained time scales
> >>in the *middle* of a relocation and I will not commit to writing a solution
> >>for a problem that doesn't exist.
> >Exactly same thing happens with swap, doesn't it?
> >You are saying workloads simply can not tolerate swap.
> >
> >>If you can prove (through some kind of anaylsis) that workloads
> >>would benefit from this kind of fine-grained memory overcommit
> >>by having cgroups swap out memory to disk underneath them
> >>without their permission, I would happily reconsider my position.
> >>
> >>- Michael
> >This has nothing to do with cgroups directly, it's just a way to
> >demonstrate you have a bug.
> >
> 
> If your datacenter or your cloud or your product does not want to
> tolerate page registration, then don't use RDMA!
> 
> The bottom line is: RDMA is useless without page registration. Without
> it, the performance of it will be crippled. If you define that as a bug,
> then so be it.
> 
> - Michael

No one cares if you do page registration or not.  ulimit -l 10g is the
problem.  You should limit the amount of locked memory.
Lots of good research went into making RDMA go fast with limited locked
memory, with some success. Search for "registration cache" for example.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 18:30                                                   ` Michael S. Tsirkin
@ 2013-04-14 19:06                                                     ` Michael R. Hines
  2013-04-14 21:10                                                       ` Michael S. Tsirkin
  2013-04-15  8:26                                                       ` Paolo Bonzini
  0 siblings, 2 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-14 19:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>>>> version 1 of the patch.
>>>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>>>> transport and capability) and would like people to play with it.
>>>>>>>>
>>>>>>>> Paolo
>>>>>>> But it's not testable yet.  I see problems just reading the
>>>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>>>> the major bugs first.
>>>>>>>
>>>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>>>> and despite being available for years, this was one of two reasons that
>>>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>>>> since it depends on a well-behaved guest to work correctly).
>>>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>>>
>>>>>> Paolo
>>>>> It looks like we have to decide, before merging, whether migration with
>>>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>>>> it very clear he does not intend to make it work with overcommit, ever.
>>>>>
>>>> That depends entirely as what you define as overcommit.
>>> You don't get to define your own terms.  Look it up in wikipedia or
>>> something.
>>>
>>>> The pages do get unregistered at the end of the migration =)
>>>>
>>>> - Michael
>>> The limitations are pretty clear, and you really should document them:
>>>
>>> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>>>    destination
>>>
>>> 2. expect that as much as that amount of memory is pinned
>>>    and unvailable to host kernel and applications for
>>>    arbitrarily long time.
>>>    Make sure you have much more RAM in host or QEMU will get killed.
>>>
>>> To me, especially 1 is an unacceptable security tradeoff.
>>> It is entirely fixable but we both have other priorities,
>>> so it'll stay broken.
>>>
>> I've modified the beginning of docs/rdma.txt to say the following:
> It really should say this, in a very prominent place:
>
> BUGS:
Not a bug. We'll have to agree to disagree. Please drop this.
> 1. You must run qemu as root, or under
>     ulimit -l <total guest memory> on both source and destination
Good, will update the documentation now.
> 2. Expect that as much as that amount of memory to be locked
>     and unvailable to host kernel and applications for
>     arbitrarily long time.
>     Make sure you have much more RAM in host otherwise QEMU,
>     or some other arbitrary application on same host, will get killed.
This is implied already. The docs say "If you don't want pinning, then 
use TCP".
That's enough warning.
> 3. Migration with RDMA support is experimental and unsupported.
>     In particular, please do not expect it to work across qemu versions,
>     and do not expect the management interface to be stable.
>     

The only correct statement here is that it's experimental.

I will update the docs to reflect that.

>> $ cat docs/rdma.txt
>>
>> ... snip ..
>>
>> BEFORE RUNNING:
>> ===============
>>
>> Use of RDMA requires pinning and registering memory with the
>> hardware. If this is not acceptable for your application or
>> product, then the use of RDMA is strongly discouraged and you
>> should revert back to standard TCP-based migration.
> No one knows of should know what "pinning and registering" means.

I will define it in the docs, then.

> For which applications and products is it appropriate?

That's up to the vendor or user to decide, not us.

> Also, you are talking about current QEMU
> code using RDMA for migration but say "RDMA" generally.

Sure, I will fix the docs.

>> Next, decide if you want dynamic page registration on the server-side.
>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>> is in active use, then disabling this feature will cause all 8GB to
>> be pinned and resident in memory. This feature mostly affects the
>> bulk-phase round of the migration and can be disabled for extremely
>> high-performance RDMA hardware using the following command:
>> QEMU Monitor Command:
>> $ migrate_set_capability chunk_register_destination off # enabled by default
>>
>> Performing this action will cause all 8GB to be pinned, so if that's
>> not what you want, then please ignore this step altogether.
> This does not make it clear what is the benefit of disabling this
> capability. I think it's best to avoid options, just use chunk
> based always.
> If it's here "so people can play with it" then please rename
> it to something like "x-unsupported-chunk_register_destination"
> so people know this is unsupported and not to be used for production.

Again, please drop the request for removing chunking.

Paolo already told me to use "x-rdma" - so that's enough for now.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 18:51                                           ` Michael S. Tsirkin
@ 2013-04-14 19:43                                             ` Michael R. Hines
  2013-04-14 21:16                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-14 19:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
>> On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
>>> On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
>>>> Second, as I've explained, I strongly, strongly disagree with unregistering
>>>> memory for all of the aforementioned reasons - workloads do not
>>>> operate in such a manner that they can tolerate memory to be
>>>> pulled out from underneath them at such fine-grained time scales
>>>> in the *middle* of a relocation and I will not commit to writing a solution
>>>> for a problem that doesn't exist.
>>> Exactly same thing happens with swap, doesn't it?
>>> You are saying workloads simply can not tolerate swap.
>>>
>>>> If you can prove (through some kind of anaylsis) that workloads
>>>> would benefit from this kind of fine-grained memory overcommit
>>>> by having cgroups swap out memory to disk underneath them
>>>> without their permission, I would happily reconsider my position.
>>>>
>>>> - Michael
>>> This has nothing to do with cgroups directly, it's just a way to
>>> demonstrate you have a bug.
>>>
>> If your datacenter or your cloud or your product does not want to
>> tolerate page registration, then don't use RDMA!
>>
>> The bottom line is: RDMA is useless without page registration. Without
>> it, the performance of it will be crippled. If you define that as a bug,
>> then so be it.
>>
>> - Michael
> No one cares if you do page registration or not.  ulimit -l 10g is the
> problem.  You should limit the amount of locked memory.
> Lots of good research went into making RDMA go fast with limited locked
> memory, with some success. Search for "registration cache" for example.
>

Patches using such a cache would be welcome.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 19:06                                                     ` Michael R. Hines
@ 2013-04-14 21:10                                                       ` Michael S. Tsirkin
  2013-04-15  1:06                                                         ` Michael R. Hines
  2013-04-15  8:26                                                       ` Paolo Bonzini
  1 sibling, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-14 21:10 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
> On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>>>version 1 of the patch.
> >>>>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>>>transport and capability) and would like people to play with it.
> >>>>>>>>
> >>>>>>>>Paolo
> >>>>>>>But it's not testable yet.  I see problems just reading the
> >>>>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>>>the major bugs first.
> >>>>>>>
> >>>>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>>>and despite being available for years, this was one of two reasons that
> >>>>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>>>since it depends on a well-behaved guest to work correctly).
> >>>>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>>>
> >>>>>>Paolo
> >>>>>It looks like we have to decide, before merging, whether migration with
> >>>>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>>>it very clear he does not intend to make it work with overcommit, ever.
> >>>>>
> >>>>That depends entirely as what you define as overcommit.
> >>>You don't get to define your own terms.  Look it up in wikipedia or
> >>>something.
> >>>
> >>>>The pages do get unregistered at the end of the migration =)
> >>>>
> >>>>- Michael
> >>>The limitations are pretty clear, and you really should document them:
> >>>
> >>>1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >>>   destination
> >>>
> >>>2. expect that as much as that amount of memory is pinned
> >>>   and unvailable to host kernel and applications for
> >>>   arbitrarily long time.
> >>>   Make sure you have much more RAM in host or QEMU will get killed.
> >>>
> >>>To me, especially 1 is an unacceptable security tradeoff.
> >>>It is entirely fixable but we both have other priorities,
> >>>so it'll stay broken.
> >>>
> >>I've modified the beginning of docs/rdma.txt to say the following:
> >It really should say this, in a very prominent place:
> >
> >BUGS:
> Not a bug. We'll have to agree to disagree. Please drop this.

It's not a feature, it makes management harder and
will bite some users who are not careful enough
to read documentation and know what to expect.

> >1. You must run qemu as root, or under
> >    ulimit -l <total guest memory> on both source and destination
> Good, will update the documentation now.
> >2. Expect that as much as that amount of memory to be locked
> >    and unvailable to host kernel and applications for
> >    arbitrarily long time.
> >    Make sure you have much more RAM in host otherwise QEMU,
> >    or some other arbitrary application on same host, will get killed.
> This is implied already. The docs say "If you don't want pinning,
> then use TCP".
> That's enough warning.

No it's not. Pinning is jargon, and does not mean locking
up gigabytes.  Why are you using jargon?
Explain the limitation in plain English so people know
when to expect things to work.

> >3. Migration with RDMA support is experimental and unsupported.
> >    In particular, please do not expect it to work across qemu versions,
> >    and do not expect the management interface to be stable.
> 
> The only correct statement here is that it's experimental.
> 
> I will update the docs to reflect that.
> 
> >>$ cat docs/rdma.txt
> >>
> >>... snip ..
> >>
> >>BEFORE RUNNING:
> >>===============
> >>
> >>Use of RDMA requires pinning and registering memory with the
> >>hardware. If this is not acceptable for your application or
> >>product, then the use of RDMA is strongly discouraged and you
> >>should revert back to standard TCP-based migration.
> >No one knows of should know what "pinning and registering" means.
> 
> I will define it in the docs, then.

Keep it simple. Just tell people what they need to know.
It's silly to expect users to understand internals of
the product before they even try it for the first time.

> >For which applications and products is it appropriate?
> 
> That's up to the vendor or user to decide, not us.

With zero information so far, no one will be
able to decide.

> >Also, you are talking about current QEMU
> >code using RDMA for migration but say "RDMA" generally.
> 
> Sure, I will fix the docs.
> 
> >>Next, decide if you want dynamic page registration on the server-side.
> >>For example, if you have an 8GB RAM virtual machine, but only 1GB
> >>is in active use, then disabling this feature will cause all 8GB to
> >>be pinned and resident in memory. This feature mostly affects the
> >>bulk-phase round of the migration and can be disabled for extremely
> >>high-performance RDMA hardware using the following command:
> >>QEMU Monitor Command:
> >>$ migrate_set_capability chunk_register_destination off # enabled by default
> >>
> >>Performing this action will cause all 8GB to be pinned, so if that's
> >>not what you want, then please ignore this step altogether.
> >This does not make it clear what is the benefit of disabling this
> >capability. I think it's best to avoid options, just use chunk
> >based always.
> >If it's here "so people can play with it" then please rename
> >it to something like "x-unsupported-chunk_register_destination"
> >so people know this is unsupported and not to be used for production.
> 
> Again, please drop the request for removing chunking.
> 
> Paolo already told me to use "x-rdma" - so that's enough for now.
> 
> - Michael

You are adding a new command that's also experimental, so you must tag
it explicitly too.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 19:43                                             ` Michael R. Hines
@ 2013-04-14 21:16                                               ` Michael S. Tsirkin
  2013-04-15  1:10                                                 ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-14 21:16 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Sun, Apr 14, 2013 at 03:43:28PM -0400, Michael R. Hines wrote:
> On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> >>>On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> >>>>Second, as I've explained, I strongly, strongly disagree with unregistering
> >>>>memory for all of the aforementioned reasons - workloads do not
> >>>>operate in such a manner that they can tolerate memory to be
> >>>>pulled out from underneath them at such fine-grained time scales
> >>>>in the *middle* of a relocation and I will not commit to writing a solution
> >>>>for a problem that doesn't exist.
> >>>Exactly same thing happens with swap, doesn't it?
> >>>You are saying workloads simply can not tolerate swap.
> >>>
> >>>>If you can prove (through some kind of anaylsis) that workloads
> >>>>would benefit from this kind of fine-grained memory overcommit
> >>>>by having cgroups swap out memory to disk underneath them
> >>>>without their permission, I would happily reconsider my position.
> >>>>
> >>>>- Michael
> >>>This has nothing to do with cgroups directly, it's just a way to
> >>>demonstrate you have a bug.
> >>>
> >>If your datacenter or your cloud or your product does not want to
> >>tolerate page registration, then don't use RDMA!
> >>
> >>The bottom line is: RDMA is useless without page registration. Without
> >>it, the performance of it will be crippled. If you define that as a bug,
> >>then so be it.
> >>
> >>- Michael
> >No one cares if you do page registration or not.  ulimit -l 10g is the
> >problem.  You should limit the amount of locked memory.
> >Lots of good research went into making RDMA go fast with limited locked
> >memory, with some success. Search for "registration cache" for example.
> >
> 
> Patches using such a cache would be welcome.
> 
> - Michael
> 

And when someone writes them one day, we'll have to carry the old code
around for interoperability as well. Not pretty.  To avoid that, you
need to explicitly say in the documenation that it's experimental and
unsupported.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 21:10                                                       ` Michael S. Tsirkin
@ 2013-04-15  1:06                                                         ` Michael R. Hines
  2013-04-15  6:00                                                           ` Michael S. Tsirkin
  2013-04-15  8:28                                                           ` Paolo Bonzini
  0 siblings, 2 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-15  1:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
>>>>> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>>>>>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>>>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>>>>>> version 1 of the patch.
>>>>>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>>>>>> transport and capability) and would like people to play with it.
>>>>>>>>>>
>>>>>>>>>> Paolo
>>>>>>>>> But it's not testable yet.  I see problems just reading the
>>>>>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>>>>>> the major bugs first.
>>>>>>>>>
>>>>>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>>>>>> and despite being available for years, this was one of two reasons that
>>>>>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>>>>>> since it depends on a well-behaved guest to work correctly).
>>>>>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>>>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>>>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>>>>>
>>>>>>>> Paolo
>>>>>>> It looks like we have to decide, before merging, whether migration with
>>>>>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>>>>>> it very clear he does not intend to make it work with overcommit, ever.
>>>>>>>
>>>>>> That depends entirely as what you define as overcommit.
>>>>> You don't get to define your own terms.  Look it up in wikipedia or
>>>>> something.
>>>>>
>>>>>> The pages do get unregistered at the end of the migration =)
>>>>>>
>>>>>> - Michael
>>>>> The limitations are pretty clear, and you really should document them:
>>>>>
>>>>> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>>>>>    destination
>>>>>
>>>>> 2. expect that as much as that amount of memory is pinned
>>>>>    and unvailable to host kernel and applications for
>>>>>    arbitrarily long time.
>>>>>    Make sure you have much more RAM in host or QEMU will get killed.
>>>>>
>>>>> To me, especially 1 is an unacceptable security tradeoff.
>>>>> It is entirely fixable but we both have other priorities,
>>>>> so it'll stay broken.
>>>>>
>>>> I've modified the beginning of docs/rdma.txt to say the following:
>>> It really should say this, in a very prominent place:
>>>
>>> BUGS:
>> Not a bug. We'll have to agree to disagree. Please drop this.
> It's not a feature, it makes management harder and
> will bite some users who are not careful enough
> to read documentation and know what to expect.

Something that does not exist cannot be a bug. That's called a 
non-existent optimization.

>>> 1. You must run qemu as root, or under
>>>     ulimit -l <total guest memory> on both source and destination
>> Good, will update the documentation now.
>>> 2. Expect that as much as that amount of memory to be locked
>>>     and unvailable to host kernel and applications for
>>>     arbitrarily long time.
>>>     Make sure you have much more RAM in host otherwise QEMU,
>>>     or some other arbitrary application on same host, will get killed.
>> This is implied already. The docs say "If you don't want pinning,
>> then use TCP".
>> That's enough warning.
> No it's not. Pinning is jargon, and does not mean locking
> up gigabytes.  Why are you using jargon?
> Explain the limitation in plain English so people know
> when to expect things to work.

Already done.

>>> 3. Migration with RDMA support is experimental and unsupported.
>>>     In particular, please do not expect it to work across qemu versions,
>>>     and do not expect the management interface to be stable.
>> The only correct statement here is that it's experimental.
>>
>> I will update the docs to reflect that.
>>
>>>> $ cat docs/rdma.txt
>>>>
>>>> ... snip ..
>>>>
>>>> BEFORE RUNNING:
>>>> ===============
>>>>
>>>> Use of RDMA requires pinning and registering memory with the
>>>> hardware. If this is not acceptable for your application or
>>>> product, then the use of RDMA is strongly discouraged and you
>>>> should revert back to standard TCP-based migration.
>>> No one knows of should know what "pinning and registering" means.
>> I will define it in the docs, then.
> Keep it simple. Just tell people what they need to know.
> It's silly to expect users to understand internals of
> the product before they even try it for the first time.

Agreed.

>>> For which applications and products is it appropriate?
>> That's up to the vendor or user to decide, not us.
> With zero information so far, no one will be
> able to decide.

There is plenty of information. Including this email thread.


>>> Also, you are talking about current QEMU
>>> code using RDMA for migration but say "RDMA" generally.
>> Sure, I will fix the docs.
>>
>>>> Next, decide if you want dynamic page registration on the server-side.
>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>> is in active use, then disabling this feature will cause all 8GB to
>>>> be pinned and resident in memory. This feature mostly affects the
>>>> bulk-phase round of the migration and can be disabled for extremely
>>>> high-performance RDMA hardware using the following command:
>>>> QEMU Monitor Command:
>>>> $ migrate_set_capability chunk_register_destination off # enabled by default
>>>>
>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>> not what you want, then please ignore this step altogether.
>>> This does not make it clear what is the benefit of disabling this
>>> capability. I think it's best to avoid options, just use chunk
>>> based always.
>>> If it's here "so people can play with it" then please rename
>>> it to something like "x-unsupported-chunk_register_destination"
>>> so people know this is unsupported and not to be used for production.
>> Again, please drop the request for removing chunking.
>>
>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>
>> - Michael
> You are adding a new command that's also experimental, so you must tag
> it explicitly too.

The entire migration is experimental - which by extension makes the 
capability experimental.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 21:16                                               ` Michael S. Tsirkin
@ 2013-04-15  1:10                                                 ` Michael R. Hines
  2013-04-15  6:10                                                   ` Michael S. Tsirkin
  2013-04-15  8:34                                                   ` Paolo Bonzini
  0 siblings, 2 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-15  1:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/14/2013 05:16 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 03:43:28PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
>>>>>> Second, as I've explained, I strongly, strongly disagree with unregistering
>>>>>> memory for all of the aforementioned reasons - workloads do not
>>>>>> operate in such a manner that they can tolerate memory to be
>>>>>> pulled out from underneath them at such fine-grained time scales
>>>>>> in the *middle* of a relocation and I will not commit to writing a solution
>>>>>> for a problem that doesn't exist.
>>>>> Exactly same thing happens with swap, doesn't it?
>>>>> You are saying workloads simply can not tolerate swap.
>>>>>
>>>>>> If you can prove (through some kind of anaylsis) that workloads
>>>>>> would benefit from this kind of fine-grained memory overcommit
>>>>>> by having cgroups swap out memory to disk underneath them
>>>>>> without their permission, I would happily reconsider my position.
>>>>>>
>>>>>> - Michael
>>>>> This has nothing to do with cgroups directly, it's just a way to
>>>>> demonstrate you have a bug.
>>>>>
>>>> If your datacenter or your cloud or your product does not want to
>>>> tolerate page registration, then don't use RDMA!
>>>>
>>>> The bottom line is: RDMA is useless without page registration. Without
>>>> it, the performance of it will be crippled. If you define that as a bug,
>>>> then so be it.
>>>>
>>>> - Michael
>>> No one cares if you do page registration or not.  ulimit -l 10g is the
>>> problem.  You should limit the amount of locked memory.
>>> Lots of good research went into making RDMA go fast with limited locked
>>> memory, with some success. Search for "registration cache" for example.
>>>
>> Patches using such a cache would be welcome.
>>
>> - Michael
>>
> And when someone writes them one day, we'll have to carry the old code
> around for interoperability as well. Not pretty.  To avoid that, you
> need to explicitly say in the documenation that it's experimental and
> unsupported.
>

That's what protocols are for.

As I've already said, I've incorporated this into the design of the protocol
already.

The protocol already has a field called "repeat" which allows a user to
request multiple chunk registrations at the same time.

If you insist, I can add a capability / command to the protocol called 
"unregister chunk",
but I'm not volunteering to implement that command as I don't have any data
showing it to be of any value.

That would insulate the protocol against any such future "registration 
cache" design.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15  1:06                                                         ` Michael R. Hines
@ 2013-04-15  6:00                                                           ` Michael S. Tsirkin
  2013-04-15 13:07                                                             ` Michael R. Hines
  2013-04-15  8:28                                                           ` Paolo Bonzini
  1 sibling, 1 reply; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-15  6:00 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Sun, Apr 14, 2013 at 09:06:36PM -0400, Michael R. Hines wrote:
> On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >>>>>On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>>>>>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>>>>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>>>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>>>>>version 1 of the patch.
> >>>>>>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>>>>>transport and capability) and would like people to play with it.
> >>>>>>>>>>
> >>>>>>>>>>Paolo
> >>>>>>>>>But it's not testable yet.  I see problems just reading the
> >>>>>>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>>>>>the major bugs first.
> >>>>>>>>>
> >>>>>>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>>>>>and despite being available for years, this was one of two reasons that
> >>>>>>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>>>>>since it depends on a well-behaved guest to work correctly).
> >>>>>>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>>>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>>>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>>>>>
> >>>>>>>>Paolo
> >>>>>>>It looks like we have to decide, before merging, whether migration with
> >>>>>>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>>>>>it very clear he does not intend to make it work with overcommit, ever.
> >>>>>>>
> >>>>>>That depends entirely as what you define as overcommit.
> >>>>>You don't get to define your own terms.  Look it up in wikipedia or
> >>>>>something.
> >>>>>
> >>>>>>The pages do get unregistered at the end of the migration =)
> >>>>>>
> >>>>>>- Michael
> >>>>>The limitations are pretty clear, and you really should document them:
> >>>>>
> >>>>>1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >>>>>   destination
> >>>>>
> >>>>>2. expect that as much as that amount of memory is pinned
> >>>>>   and unvailable to host kernel and applications for
> >>>>>   arbitrarily long time.
> >>>>>   Make sure you have much more RAM in host or QEMU will get killed.
> >>>>>
> >>>>>To me, especially 1 is an unacceptable security tradeoff.
> >>>>>It is entirely fixable but we both have other priorities,
> >>>>>so it'll stay broken.
> >>>>>
> >>>>I've modified the beginning of docs/rdma.txt to say the following:
> >>>It really should say this, in a very prominent place:
> >>>
> >>>BUGS:
> >>Not a bug. We'll have to agree to disagree. Please drop this.
> >It's not a feature, it makes management harder and
> >will bite some users who are not careful enough
> >to read documentation and know what to expect.
> 
> Something that does not exist cannot be a bug. That's called a
> non-existent optimization.

No because overcommit already exists, and works with migration.  It's
your patch that breaks it.  We already have a ton of migration variants
and they all work fine.  So in 2013 overcommit is a given.

Look we can include code with known bugs, but we have to be very
explicit about them, because someone *will* be confused.  If it's a hard
bug to fix it won't get solved quickly but please stop pretending it's
perfect.



> >>>1. You must run qemu as root, or under
> >>>    ulimit -l <total guest memory> on both source and destination
> >>Good, will update the documentation now.
> >>>2. Expect that as much as that amount of memory to be locked
> >>>    and unvailable to host kernel and applications for
> >>>    arbitrarily long time.
> >>>    Make sure you have much more RAM in host otherwise QEMU,
> >>>    or some other arbitrary application on same host, will get killed.
> >>This is implied already. The docs say "If you don't want pinning,
> >>then use TCP".
> >>That's enough warning.
> >No it's not. Pinning is jargon, and does not mean locking
> >up gigabytes.  Why are you using jargon?
> >Explain the limitation in plain English so people know
> >when to expect things to work.
> 
> Already done.
> 
> >>>3. Migration with RDMA support is experimental and unsupported.
> >>>    In particular, please do not expect it to work across qemu versions,
> >>>    and do not expect the management interface to be stable.
> >>The only correct statement here is that it's experimental.
> >>
> >>I will update the docs to reflect that.
> >>
> >>>>$ cat docs/rdma.txt
> >>>>
> >>>>... snip ..
> >>>>
> >>>>BEFORE RUNNING:
> >>>>===============
> >>>>
> >>>>Use of RDMA requires pinning and registering memory with the
> >>>>hardware. If this is not acceptable for your application or
> >>>>product, then the use of RDMA is strongly discouraged and you
> >>>>should revert back to standard TCP-based migration.
> >>>No one knows of should know what "pinning and registering" means.
> >>I will define it in the docs, then.
> >Keep it simple. Just tell people what they need to know.
> >It's silly to expect users to understand internals of
> >the product before they even try it for the first time.
> 
> Agreed.
> 
> >>>For which applications and products is it appropriate?
> >>That's up to the vendor or user to decide, not us.
> >With zero information so far, no one will be
> >able to decide.
> 
> There is plenty of information. Including this email thread.

Nowhere in this email thread or in your patchset did you tell anyone for
which applications and products is it appropriate.  You also expect
someone to answer this question before they run your code.  It looks
like the purpose of this phrase is to assign blame rather than to
inform.

> 
> >>>Also, you are talking about current QEMU
> >>>code using RDMA for migration but say "RDMA" generally.
> >>Sure, I will fix the docs.
> >>
> >>>>Next, decide if you want dynamic page registration on the server-side.
> >>>>For example, if you have an 8GB RAM virtual machine, but only 1GB
> >>>>is in active use, then disabling this feature will cause all 8GB to
> >>>>be pinned and resident in memory. This feature mostly affects the
> >>>>bulk-phase round of the migration and can be disabled for extremely
> >>>>high-performance RDMA hardware using the following command:
> >>>>QEMU Monitor Command:
> >>>>$ migrate_set_capability chunk_register_destination off # enabled by default
> >>>>
> >>>>Performing this action will cause all 8GB to be pinned, so if that's
> >>>>not what you want, then please ignore this step altogether.
> >>>This does not make it clear what is the benefit of disabling this
> >>>capability. I think it's best to avoid options, just use chunk
> >>>based always.
> >>>If it's here "so people can play with it" then please rename
> >>>it to something like "x-unsupported-chunk_register_destination"
> >>>so people know this is unsupported and not to be used for production.
> >>Again, please drop the request for removing chunking.
> >>
> >>Paolo already told me to use "x-rdma" - so that's enough for now.
> >>
> >>- Michael
> >You are adding a new command that's also experimental, so you must tag
> >it explicitly too.
> 
> The entire migration is experimental - which by extension makes the
> capability experimental.

Again the purpose of documentation is not to educate people about
qemu or rdma internals but to educate them how to use a feature.
It doesn't even mention rdma anywhere in the name of the capability.
Users won't make the connection.  You also didn't bother telling anyone
when to set the option.  Is it here "to be able to play with it"?  Does
it have any purpose for users not in a playful mood?  If yes your
documentation should say what it is, if no mention that.

-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15  1:10                                                 ` Michael R. Hines
@ 2013-04-15  6:10                                                   ` Michael S. Tsirkin
  2013-04-15  8:34                                                   ` Paolo Bonzini
  1 sibling, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-15  6:10 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Sun, Apr 14, 2013 at 09:10:36PM -0400, Michael R. Hines wrote:
> On 04/14/2013 05:16 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 03:43:28PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> >>>>>On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> >>>>>>Second, as I've explained, I strongly, strongly disagree with unregistering
> >>>>>>memory for all of the aforementioned reasons - workloads do not
> >>>>>>operate in such a manner that they can tolerate memory to be
> >>>>>>pulled out from underneath them at such fine-grained time scales
> >>>>>>in the *middle* of a relocation and I will not commit to writing a solution
> >>>>>>for a problem that doesn't exist.
> >>>>>Exactly same thing happens with swap, doesn't it?
> >>>>>You are saying workloads simply can not tolerate swap.
> >>>>>
> >>>>>>If you can prove (through some kind of anaylsis) that workloads
> >>>>>>would benefit from this kind of fine-grained memory overcommit
> >>>>>>by having cgroups swap out memory to disk underneath them
> >>>>>>without their permission, I would happily reconsider my position.
> >>>>>>
> >>>>>>- Michael
> >>>>>This has nothing to do with cgroups directly, it's just a way to
> >>>>>demonstrate you have a bug.
> >>>>>
> >>>>If your datacenter or your cloud or your product does not want to
> >>>>tolerate page registration, then don't use RDMA!
> >>>>
> >>>>The bottom line is: RDMA is useless without page registration. Without
> >>>>it, the performance of it will be crippled. If you define that as a bug,
> >>>>then so be it.
> >>>>
> >>>>- Michael
> >>>No one cares if you do page registration or not.  ulimit -l 10g is the
> >>>problem.  You should limit the amount of locked memory.
> >>>Lots of good research went into making RDMA go fast with limited locked
> >>>memory, with some success. Search for "registration cache" for example.
> >>>
> >>Patches using such a cache would be welcome.
> >>
> >>- Michael
> >>
> >And when someone writes them one day, we'll have to carry the old code
> >around for interoperability as well. Not pretty.  To avoid that, you
> >need to explicitly say in the documenation that it's experimental and
> >unsupported.
> >
> 
> That's what protocols are for.
> 
> As I've already said, I've incorporated this into the design of the protocol
> already.
> 
> The protocol already has a field called "repeat" which allows a user to
> request multiple chunk registrations at the same time.
> If you insist, I can add a capability / command to the protocol
> called "unregister chunk",
> but I'm not volunteering to implement that command as I don't have any data
> showing it to be of any value.

The value would be being able to run your code in qemu as unpriveledged
user.

> That would insulate the protocol against any such future
> "registration cache" design.
> 
> - Michael
>

It won't.  If it's unimplemented it won't be of any use since now your
code does not implement the protocol fully.
 
-- 
MST

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-14 19:06                                                     ` Michael R. Hines
  2013-04-14 21:10                                                       ` Michael S. Tsirkin
@ 2013-04-15  8:26                                                       ` Paolo Bonzini
  1 sibling, 0 replies; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-15  8:26 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 14/04/2013 21:06, Michael R. Hines ha scritto:
> 
>> 3. Migration with RDMA support is experimental and unsupported.
>>     In particular, please do not expect it to work across qemu versions,
>>     and do not expect the management interface to be stable.
>>     
> 
> The only correct statement here is that it's experimental.

Actually no, this is correct.  The capabilities are experimental too,
the "x-rdma" will become "rdma" in the future, and we are free to modify
the protocol.

Will it happen?  Perhaps not.  But for the moment, that is the
situation.  The alternative is not merging, and it is a much worse
alternative IMHO.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15  1:06                                                         ` Michael R. Hines
  2013-04-15  6:00                                                           ` Michael S. Tsirkin
@ 2013-04-15  8:28                                                           ` Paolo Bonzini
  2013-04-15 13:08                                                             ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-15  8:28 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 15/04/2013 03:06, Michael R. Hines ha scritto:
>>>>> Next, decide if you want dynamic page registration on the server-side.
>>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>>> is in active use, then disabling this feature will cause all 8GB to
>>>>> be pinned and resident in memory. This feature mostly affects the
>>>>> bulk-phase round of the migration and can be disabled for extremely
>>>>> high-performance RDMA hardware using the following command:
>>>>> QEMU Monitor Command:
>>>>> $ migrate_set_capability chunk_register_destination off # enabled
>>>>> by default
>>>>>
>>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>>> not what you want, then please ignore this step altogether.
>>>> This does not make it clear what is the benefit of disabling this
>>>> capability. I think it's best to avoid options, just use chunk
>>>> based always.
>>>> If it's here "so people can play with it" then please rename
>>>> it to something like "x-unsupported-chunk_register_destination"
>>>> so people know this is unsupported and not to be used for production.
>>> Again, please drop the request for removing chunking.
>>>
>>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>
>> You are adding a new command that's also experimental, so you must tag
>> it explicitly too.
> 
> The entire migration is experimental - which by extension makes the
> capability experimental.

You still have to mark it as "x-".  Of course not "x-unsupported-", that
is a pleonasm.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15  1:10                                                 ` Michael R. Hines
  2013-04-15  6:10                                                   ` Michael S. Tsirkin
@ 2013-04-15  8:34                                                   ` Paolo Bonzini
  2013-04-15 13:24                                                     ` Michael R. Hines
  1 sibling, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-15  8:34 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 15/04/2013 03:10, Michael R. Hines ha scritto:
>>>
>> And when someone writes them one day, we'll have to carry the old code
>> around for interoperability as well. Not pretty.  To avoid that, you
>> need to explicitly say in the documenation that it's experimental and
>> unsupported.
>>
> 
> That's what protocols are for.
> 
> As I've already said, I've incorporated this into the design of the
> protocol
> already.
> 
> The protocol already has a field called "repeat" which allows a user to
> request multiple chunk registrations at the same time.
> 
> If you insist, I can add a capability / command to the protocol called
> "unregister chunk",
> but I'm not volunteering to implement that command as I don't have any data
> showing it to be of any value.

Implementing it on the destination side would be of value because it
would make the implementation interoperable.

A very basic implementation would be "during the bulk phase, unregister
the previous chunk every time you register a chunk".  It would work
great when migrating an idle guest, for example.  It would probably be
faster than TCP (which is now at 4.2 Gbps).

On one hand this should not block merging the patches; on the other
hand, "agreeing to disagree" without having done any test is not very
fruitful.  You can disagree on the priorities (and I agree with you on
this), but what mst is proposing is absolutely reasonable.

Paolo

> That would insulate the protocol against any such future "registration
> cache" design.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15  6:00                                                           ` Michael S. Tsirkin
@ 2013-04-15 13:07                                                             ` Michael R. Hines
  2013-04-15 22:20                                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-15 13:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On 04/15/2013 02:00 AM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 09:06:36PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
>>>>> On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
>>>>>> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
>>>>>>> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>>>>>>>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>>>>>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>>>>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>>>>>>>> version 1 of the patch.
>>>>>>>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>>>>>>>> transport and capability) and would like people to play with it.
>>>>>>>>>>>>
>>>>>>>>>>>> Paolo
>>>>>>>>>>> But it's not testable yet.  I see problems just reading the
>>>>>>>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>>>>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>>>>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>>>>>>>> the major bugs first.
>>>>>>>>>>>
>>>>>>>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>>>>>>>> and despite being available for years, this was one of two reasons that
>>>>>>>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>>>>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>>>>>>>> since it depends on a well-behaved guest to work correctly).
>>>>>>>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>>>>>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>>>>>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>>>>>>>
>>>>>>>>>> Paolo
>>>>>>>>> It looks like we have to decide, before merging, whether migration with
>>>>>>>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>>>>>>>> it very clear he does not intend to make it work with overcommit, ever.
>>>>>>>>>
>>>>>>>> That depends entirely as what you define as overcommit.
>>>>>>> You don't get to define your own terms.  Look it up in wikipedia or
>>>>>>> something.
>>>>>>>
>>>>>>>> The pages do get unregistered at the end of the migration =)
>>>>>>>>
>>>>>>>> - Michael
>>>>>>> The limitations are pretty clear, and you really should document them:
>>>>>>>
>>>>>>> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>>>>>>>    destination
>>>>>>>
>>>>>>> 2. expect that as much as that amount of memory is pinned
>>>>>>>    and unvailable to host kernel and applications for
>>>>>>>    arbitrarily long time.
>>>>>>>    Make sure you have much more RAM in host or QEMU will get killed.
>>>>>>>
>>>>>>> To me, especially 1 is an unacceptable security tradeoff.
>>>>>>> It is entirely fixable but we both have other priorities,
>>>>>>> so it'll stay broken.
>>>>>>>
>>>>>> I've modified the beginning of docs/rdma.txt to say the following:
>>>>> It really should say this, in a very prominent place:
>>>>>
>>>>> BUGS:
>>>> Not a bug. We'll have to agree to disagree. Please drop this.
>>> It's not a feature, it makes management harder and
>>> will bite some users who are not careful enough
>>> to read documentation and know what to expect.
>> Something that does not exist cannot be a bug. That's called a
>> non-existent optimization.
> No because overcommit already exists, and works with migration.  It's
> your patch that breaks it.  We already have a ton of migration variants
> and they all work fine.  So in 2013 overcommit is a given.
>
> Look we can include code with known bugs, but we have to be very
> explicit about them, because someone *will* be confused.  If it's a hard
> bug to fix it won't get solved quickly but please stop pretending it's
> perfect.
>

Setting aside RDMA for the moment, Are you trying to tell me that 
someone would
*willingly* migrate a VM to a hypervisor without first validating 
(programmatically)
whether or not the machine already has enough memory to support the entire
footprint of the VM?

If you answer yes to that question is yes, it's a bug.

That also means *any* use of RDMA in any application in the universe is also
a bug and it also means that any HPC application running against cgroups is
also buggy.

I categorically refuse to believe that someone runs a datacenter in this 
manner.

>
>>>>> 1. You must run qemu as root, or under
>>>>>     ulimit -l <total guest memory> on both source and destination
>>>> Good, will update the documentation now.
>>>>> 2. Expect that as much as that amount of memory to be locked
>>>>>     and unvailable to host kernel and applications for
>>>>>     arbitrarily long time.
>>>>>     Make sure you have much more RAM in host otherwise QEMU,
>>>>>     or some other arbitrary application on same host, will get killed.
>>>> This is implied already. The docs say "If you don't want pinning,
>>>> then use TCP".
>>>> That's enough warning.
>>> No it's not. Pinning is jargon, and does not mean locking
>>> up gigabytes.  Why are you using jargon?
>>> Explain the limitation in plain English so people know
>>> when to expect things to work.
>> Already done.
>>
>>>>> 3. Migration with RDMA support is experimental and unsupported.
>>>>>     In particular, please do not expect it to work across qemu versions,
>>>>>     and do not expect the management interface to be stable.
>>>> The only correct statement here is that it's experimental.
>>>>
>>>> I will update the docs to reflect that.
>>>>
>>>>>> $ cat docs/rdma.txt
>>>>>>
>>>>>> ... snip ..
>>>>>>
>>>>>> BEFORE RUNNING:
>>>>>> ===============
>>>>>>
>>>>>> Use of RDMA requires pinning and registering memory with the
>>>>>> hardware. If this is not acceptable for your application or
>>>>>> product, then the use of RDMA is strongly discouraged and you
>>>>>> should revert back to standard TCP-based migration.
>>>>> No one knows of should know what "pinning and registering" means.
>>>> I will define it in the docs, then.
>>> Keep it simple. Just tell people what they need to know.
>>> It's silly to expect users to understand internals of
>>> the product before they even try it for the first time.
>> Agreed.
>>
>>>>> For which applications and products is it appropriate?
>>>> That's up to the vendor or user to decide, not us.
>>> With zero information so far, no one will be
>>> able to decide.
>> There is plenty of information. Including this email thread.
> Nowhere in this email thread or in your patchset did you tell anyone for
> which applications and products is it appropriate.  You also expect
> someone to answer this question before they run your code.  It looks
> like the purpose of this phrase is to assign blame rather than to
> inform.
>>>>> Also, you are talking about current QEMU
>>>>> code using RDMA for migration but say "RDMA" generally.
>>>> Sure, I will fix the docs.
>>>>
>>>>>> Next, decide if you want dynamic page registration on the server-side.
>>>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>>>> is in active use, then disabling this feature will cause all 8GB to
>>>>>> be pinned and resident in memory. This feature mostly affects the
>>>>>> bulk-phase round of the migration and can be disabled for extremely
>>>>>> high-performance RDMA hardware using the following command:
>>>>>> QEMU Monitor Command:
>>>>>> $ migrate_set_capability chunk_register_destination off # enabled by default
>>>>>>
>>>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>>>> not what you want, then please ignore this step altogether.
>>>>> This does not make it clear what is the benefit of disabling this
>>>>> capability. I think it's best to avoid options, just use chunk
>>>>> based always.
>>>>> If it's here "so people can play with it" then please rename
>>>>> it to something like "x-unsupported-chunk_register_destination"
>>>>> so people know this is unsupported and not to be used for production.
>>>> Again, please drop the request for removing chunking.
>>>>
>>>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>>>
>>>> - Michael
>>> You are adding a new command that's also experimental, so you must tag
>>> it explicitly too.
>> The entire migration is experimental - which by extension makes the
>> capability experimental.
> Again the purpose of documentation is not to educate people about
> qemu or rdma internals but to educate them how to use a feature.
> It doesn't even mention rdma anywhere in the name of the capability.
> Users won't make the connection.  You also didn't bother telling anyone
> when to set the option.  Is it here "to be able to play with it"?  Does
> it have any purpose for users not in a playful mood?  If yes your
> documentation should say what it is, if no mention that.
>

The purpose of the capability is made blatantly clear in the documentation.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15  8:28                                                           ` Paolo Bonzini
@ 2013-04-15 13:08                                                             ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-15 13:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/15/2013 04:28 AM, Paolo Bonzini wrote:
> Il 15/04/2013 03:06, Michael R. Hines ha scritto:
>>>>>> Next, decide if you want dynamic page registration on the server-side.
>>>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>>>> is in active use, then disabling this feature will cause all 8GB to
>>>>>> be pinned and resident in memory. This feature mostly affects the
>>>>>> bulk-phase round of the migration and can be disabled for extremely
>>>>>> high-performance RDMA hardware using the following command:
>>>>>> QEMU Monitor Command:
>>>>>> $ migrate_set_capability chunk_register_destination off # enabled
>>>>>> by default
>>>>>>
>>>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>>>> not what you want, then please ignore this step altogether.
>>>>> This does not make it clear what is the benefit of disabling this
>>>>> capability. I think it's best to avoid options, just use chunk
>>>>> based always.
>>>>> If it's here "so people can play with it" then please rename
>>>>> it to something like "x-unsupported-chunk_register_destination"
>>>>> so people know this is unsupported and not to be used for production.
>>>> Again, please drop the request for removing chunking.
>>>>
>>>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>> You are adding a new command that's also experimental, so you must tag
>>> it explicitly too.
>> The entire migration is experimental - which by extension makes the
>> capability experimental.
> You still have to mark it as "x-".  Of course not "x-unsupported-", that
> is a pleonasm.
>
> Paolo
>

Sure, I'm happy add another 'x'. I will submit a patch with all the new
changes as soon as the pull completes.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15  8:34                                                   ` Paolo Bonzini
@ 2013-04-15 13:24                                                     ` Michael R. Hines
  2013-04-15 13:30                                                       ` Paolo Bonzini
  0 siblings, 1 reply; 97+ messages in thread
From: Michael R. Hines @ 2013-04-15 13:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/15/2013 04:34 AM, Paolo Bonzini wrote:
> Il 15/04/2013 03:10, Michael R. Hines ha scritto:
>>> And when someone writes them one day, we'll have to carry the old code
>>> around for interoperability as well. Not pretty.  To avoid that, you
>>> need to explicitly say in the documenation that it's experimental and
>>> unsupported.
>>>
>> That's what protocols are for.
>>
>> As I've already said, I've incorporated this into the design of the
>> protocol
>> already.
>>
>> The protocol already has a field called "repeat" which allows a user to
>> request multiple chunk registrations at the same time.
>>
>> If you insist, I can add a capability / command to the protocol called
>> "unregister chunk",
>> but I'm not volunteering to implement that command as I don't have any data
>> showing it to be of any value.
> Implementing it on the destination side would be of value because it
> would make the implementation interoperable.
>
> A very basic implementation would be "during the bulk phase, unregister
> the previous chunk every time you register a chunk".  It would work
> great when migrating an idle guest, for example.  It would probably be
> faster than TCP (which is now at 4.2 Gbps).
>
> On one hand this should not block merging the patches; on the other
> hand, "agreeing to disagree" without having done any test is not very
> fruitful.  You can disagree on the priorities (and I agree with you on
> this), but what mst is proposing is absolutely reasonable.
>
> Paolo

Ok, I think I understand the disconnect here: So, let's continue to use
the above example that you described and let me ask another question.

Let's say the above mentioned idle VM is chosen, for whatever reason,
*not* to use TCP migration, and instead use RDMA. (I recommend against
choosing RDMA in the current docs, but let's stick to this example for
the sake of argument).

Now, in this example, let's say the migration starts up and the hypervisor
has run out of physical memory and starts swapping during the migration.
(also for the sake of argument).

The next thing that would immediately happen is the
next IB verbs function call: "ib_reg_mr()".

This function call would probably fail because there's nothing else left 
to pin
and the function call would return an error.

So my question is: Is it not sufficient to send a message back to the 
primary-VM
side of the connection which says:

"Your migration cannot proceed anymore, please resume the VM and try 
again somewhere else".

In this case, both the system administrator and the virtual machine are 
safe,
nothing has been killed, nothing has crashed, and the management software
can proceed to make a new management decision.

Is there something wrong with this sequence of events?

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15 13:24                                                     ` Michael R. Hines
@ 2013-04-15 13:30                                                       ` Paolo Bonzini
  2013-04-15 19:55                                                         ` Michael R. Hines
  0 siblings, 1 reply; 97+ messages in thread
From: Paolo Bonzini @ 2013-04-15 13:30 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 15/04/2013 15:24, Michael R. Hines ha scritto:
> Now, in this example, let's say the migration starts up and the hypervisor
> has run out of physical memory and starts swapping during the migration.
> (also for the sake of argument).
> 
> The next thing that would immediately happen is the
> next IB verbs function call: "ib_reg_mr()".
> 
> This function call would probably fail because there's nothing else left
> to pin and the function call would return an error.
> 
> So my question is: Is it not sufficient to send a message back to the
> primary-VM side of the connection which says:
> 
> "Your migration cannot proceed anymore, please resume the VM and try
> again somewhere else".
> 
> In this case, both the system administrator and the virtual machine are safe,
> nothing has been killed, nothing has crashed, and the management software
> can proceed to make a new management decision.
> 
> Is there something wrong with this sequence of events?

I think it's good enough.  "info migrate" will then report that
migration failed.

Paolo

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15 13:30                                                       ` Paolo Bonzini
@ 2013-04-15 19:55                                                         ` Michael R. Hines
  0 siblings, 0 replies; 97+ messages in thread
From: Michael R. Hines @ 2013-04-15 19:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/15/2013 09:30 AM, Paolo Bonzini wrote:
> Il 15/04/2013 15:24, Michael R. Hines ha scritto:
>> Now, in this example, let's say the migration starts up and the hypervisor
>> has run out of physical memory and starts swapping during the migration.
>> (also for the sake of argument).
>>
>> The next thing that would immediately happen is the
>> next IB verbs function call: "ib_reg_mr()".
>>
>> This function call would probably fail because there's nothing else left
>> to pin and the function call would return an error.
>>
>> So my question is: Is it not sufficient to send a message back to the
>> primary-VM side of the connection which says:
>>
>> "Your migration cannot proceed anymore, please resume the VM and try
>> again somewhere else".
>>
>> In this case, both the system administrator and the virtual machine are safe,
>> nothing has been killed, nothing has crashed, and the management software
>> can proceed to make a new management decision.
>>
>> Is there something wrong with this sequence of events?
> I think it's good enough.  "info migrate" will then report that
> migration failed.
>
> Paolo
>

Ok, that's good. So the current patch "[PATCH v2] rdma" is not handling this
particular error condition properly, so that's a bug.

I'll send out a trivial patch to fix this after the pull along with all 
the other
documentation updates we have discussed.

- Michael

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation
  2013-04-15 13:07                                                             ` Michael R. Hines
@ 2013-04-15 22:20                                                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 97+ messages in thread
From: Michael S. Tsirkin @ 2013-04-15 22:20 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, Paolo Bonzini

On Mon, Apr 15, 2013 at 09:07:01AM -0400, Michael R. Hines wrote:
> On 04/15/2013 02:00 AM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 09:06:36PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> >>>>>On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> >>>>>>On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >>>>>>>On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>>>>>>>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>>>>>>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>>>>>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>>>>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>>>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>>>>>>>version 1 of the patch.
> >>>>>>>>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>>>>>>>transport and capability) and would like people to play with it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>Paolo
> >>>>>>>>>>>But it's not testable yet.  I see problems just reading the
> >>>>>>>>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>>>>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>>>>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>>>>>>>the major bugs first.
> >>>>>>>>>>>
> >>>>>>>>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>>>>>>>and despite being available for years, this was one of two reasons that
> >>>>>>>>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>>>>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>>>>>>>since it depends on a well-behaved guest to work correctly).
> >>>>>>>>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>>>>>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>>>>>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>>>>>>>
> >>>>>>>>>>Paolo
> >>>>>>>>>It looks like we have to decide, before merging, whether migration with
> >>>>>>>>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>>>>>>>it very clear he does not intend to make it work with overcommit, ever.
> >>>>>>>>>
> >>>>>>>>That depends entirely as what you define as overcommit.
> >>>>>>>You don't get to define your own terms.  Look it up in wikipedia or
> >>>>>>>something.
> >>>>>>>
> >>>>>>>>The pages do get unregistered at the end of the migration =)
> >>>>>>>>
> >>>>>>>>- Michael
> >>>>>>>The limitations are pretty clear, and you really should document them:
> >>>>>>>
> >>>>>>>1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >>>>>>>   destination
> >>>>>>>
> >>>>>>>2. expect that as much as that amount of memory is pinned
> >>>>>>>   and unvailable to host kernel and applications for
> >>>>>>>   arbitrarily long time.
> >>>>>>>   Make sure you have much more RAM in host or QEMU will get killed.
> >>>>>>>
> >>>>>>>To me, especially 1 is an unacceptable security tradeoff.
> >>>>>>>It is entirely fixable but we both have other priorities,
> >>>>>>>so it'll stay broken.
> >>>>>>>
> >>>>>>I've modified the beginning of docs/rdma.txt to say the following:
> >>>>>It really should say this, in a very prominent place:
> >>>>>
> >>>>>BUGS:
> >>>>Not a bug. We'll have to agree to disagree. Please drop this.
> >>>It's not a feature, it makes management harder and
> >>>will bite some users who are not careful enough
> >>>to read documentation and know what to expect.
> >>Something that does not exist cannot be a bug. That's called a
> >>non-existent optimization.
> >No because overcommit already exists, and works with migration.  It's
> >your patch that breaks it.  We already have a ton of migration variants
> >and they all work fine.  So in 2013 overcommit is a given.
> >
> >Look we can include code with known bugs, but we have to be very
> >explicit about them, because someone *will* be confused.  If it's a hard
> >bug to fix it won't get solved quickly but please stop pretending it's
> >perfect.
> >
> 
> Setting aside RDMA for the moment, Are you trying to tell me that
> someone would
> *willingly* migrate a VM to a hypervisor without first validating
> (programmatically)
> whether or not the machine already has enough memory to support the entire
> footprint of the VM?
> 
> If you answer yes to that question is yes, it's a bug.

enough virtual memory. not physical memory.

> That also means *any* use of RDMA in any application in the universe is also
> a bug and it also means that any HPC application running against cgroups is
> also buggy.

no, people don't normally lock up gigabytes of memory in HPC either.

> I categorically refuse to believe that someone runs a datacenter in
> this manner.

so you don't believe people overcommit memory.

> >
> >>>>>1. You must run qemu as root, or under
> >>>>>    ulimit -l <total guest memory> on both source and destination
> >>>>Good, will update the documentation now.
> >>>>>2. Expect that as much as that amount of memory to be locked
> >>>>>    and unvailable to host kernel and applications for
> >>>>>    arbitrarily long time.
> >>>>>    Make sure you have much more RAM in host otherwise QEMU,
> >>>>>    or some other arbitrary application on same host, will get killed.
> >>>>This is implied already. The docs say "If you don't want pinning,
> >>>>then use TCP".
> >>>>That's enough warning.
> >>>No it's not. Pinning is jargon, and does not mean locking
> >>>up gigabytes.  Why are you using jargon?
> >>>Explain the limitation in plain English so people know
> >>>when to expect things to work.
> >>Already done.
> >>
> >>>>>3. Migration with RDMA support is experimental and unsupported.
> >>>>>    In particular, please do not expect it to work across qemu versions,
> >>>>>    and do not expect the management interface to be stable.
> >>>>The only correct statement here is that it's experimental.
> >>>>
> >>>>I will update the docs to reflect that.
> >>>>
> >>>>>>$ cat docs/rdma.txt
> >>>>>>
> >>>>>>... snip ..
> >>>>>>
> >>>>>>BEFORE RUNNING:
> >>>>>>===============
> >>>>>>
> >>>>>>Use of RDMA requires pinning and registering memory with the
> >>>>>>hardware. If this is not acceptable for your application or
> >>>>>>product, then the use of RDMA is strongly discouraged and you
> >>>>>>should revert back to standard TCP-based migration.
> >>>>>No one knows of should know what "pinning and registering" means.
> >>>>I will define it in the docs, then.
> >>>Keep it simple. Just tell people what they need to know.
> >>>It's silly to expect users to understand internals of
> >>>the product before they even try it for the first time.
> >>Agreed.
> >>
> >>>>>For which applications and products is it appropriate?
> >>>>That's up to the vendor or user to decide, not us.
> >>>With zero information so far, no one will be
> >>>able to decide.
> >>There is plenty of information. Including this email thread.
> >Nowhere in this email thread or in your patchset did you tell anyone for
> >which applications and products is it appropriate.  You also expect
> >someone to answer this question before they run your code.  It looks
> >like the purpose of this phrase is to assign blame rather than to
> >inform.
> >>>>>Also, you are talking about current QEMU
> >>>>>code using RDMA for migration but say "RDMA" generally.
> >>>>Sure, I will fix the docs.
> >>>>
> >>>>>>Next, decide if you want dynamic page registration on the server-side.
> >>>>>>For example, if you have an 8GB RAM virtual machine, but only 1GB
> >>>>>>is in active use, then disabling this feature will cause all 8GB to
> >>>>>>be pinned and resident in memory. This feature mostly affects the
> >>>>>>bulk-phase round of the migration and can be disabled for extremely
> >>>>>>high-performance RDMA hardware using the following command:
> >>>>>>QEMU Monitor Command:
> >>>>>>$ migrate_set_capability chunk_register_destination off # enabled by default
> >>>>>>
> >>>>>>Performing this action will cause all 8GB to be pinned, so if that's
> >>>>>>not what you want, then please ignore this step altogether.
> >>>>>This does not make it clear what is the benefit of disabling this
> >>>>>capability. I think it's best to avoid options, just use chunk
> >>>>>based always.
> >>>>>If it's here "so people can play with it" then please rename
> >>>>>it to something like "x-unsupported-chunk_register_destination"
> >>>>>so people know this is unsupported and not to be used for production.
> >>>>Again, please drop the request for removing chunking.
> >>>>
> >>>>Paolo already told me to use "x-rdma" - so that's enough for now.
> >>>>
> >>>>- Michael
> >>>You are adding a new command that's also experimental, so you must tag
> >>>it explicitly too.
> >>The entire migration is experimental - which by extension makes the
> >>capability experimental.
> >Again the purpose of documentation is not to educate people about
> >qemu or rdma internals but to educate them how to use a feature.
> >It doesn't even mention rdma anywhere in the name of the capability.
> >Users won't make the connection.  You also didn't bother telling anyone
> >when to set the option.  Is it here "to be able to play with it"?  Does
> >it have any purpose for users not in a playful mood?  If yes your
> >documentation should say what it is, if no mention that.
> >
> 
> The purpose of the capability is made blatantly clear in the documentation.
> 
> - Michael

I know it's not clear to me.

You ask people to decide whether to use it or not basically first
or second thing. So tell them, in plain English, then and there,
what they need to know in order to decide. Not ten pages
down in the middle of a description of qemu internals.
Or just drop one of the variants - how much speed difference
is there between them?

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2013-04-15 22:20 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
2013-04-09 17:05   ` Paolo Bonzini
2013-04-09 18:07     ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA mrhines
2013-04-09 16:46   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation mrhines
2013-04-10  5:27   ` Michael S. Tsirkin
2013-04-10 13:04     ` Michael R. Hines
2013-04-10 13:34       ` Michael S. Tsirkin
2013-04-10 15:29         ` Michael R. Hines
2013-04-10 17:41           ` Michael S. Tsirkin
2013-04-10 20:05             ` Michael R. Hines
2013-04-11  7:19               ` Michael S. Tsirkin
2013-04-11 13:12                 ` Michael R. Hines
2013-04-11 13:48                   ` Michael S. Tsirkin
2013-04-11 13:58                     ` Michael R. Hines
2013-04-11 14:37                       ` Michael S. Tsirkin
2013-04-11 14:50                         ` Paolo Bonzini
2013-04-11 14:56                           ` Michael S. Tsirkin
2013-04-11 17:49                             ` Michael R. Hines
2013-04-11 19:15                               ` Michael S. Tsirkin
2013-04-11 20:33                                 ` Michael R. Hines
2013-04-12 10:48                                   ` Michael S. Tsirkin
2013-04-12 10:53                                     ` Paolo Bonzini
2013-04-12 11:25                                       ` Michael S. Tsirkin
2013-04-12 14:43                                         ` Paolo Bonzini
2013-04-14 11:59                                           ` Michael S. Tsirkin
2013-04-14 14:09                                             ` Paolo Bonzini
2013-04-14 14:40                                               ` Michael R. Hines
2013-04-14 14:27                                             ` Michael R. Hines
2013-04-14 16:03                                               ` Michael S. Tsirkin
2013-04-14 16:07                                                 ` Michael R. Hines
2013-04-14 16:40                                                 ` Michael R. Hines
2013-04-14 18:30                                                   ` Michael S. Tsirkin
2013-04-14 19:06                                                     ` Michael R. Hines
2013-04-14 21:10                                                       ` Michael S. Tsirkin
2013-04-15  1:06                                                         ` Michael R. Hines
2013-04-15  6:00                                                           ` Michael S. Tsirkin
2013-04-15 13:07                                                             ` Michael R. Hines
2013-04-15 22:20                                                               ` Michael S. Tsirkin
2013-04-15  8:28                                                           ` Paolo Bonzini
2013-04-15 13:08                                                             ` Michael R. Hines
2013-04-15  8:26                                                       ` Paolo Bonzini
2013-04-12 13:47                                     ` Michael R. Hines
2013-04-14  8:28                                       ` Michael S. Tsirkin
2013-04-14 14:31                                         ` Michael R. Hines
2013-04-14 18:51                                           ` Michael S. Tsirkin
2013-04-14 19:43                                             ` Michael R. Hines
2013-04-14 21:16                                               ` Michael S. Tsirkin
2013-04-15  1:10                                                 ` Michael R. Hines
2013-04-15  6:10                                                   ` Michael S. Tsirkin
2013-04-15  8:34                                                   ` Paolo Bonzini
2013-04-15 13:24                                                     ` Michael R. Hines
2013-04-15 13:30                                                       ` Paolo Bonzini
2013-04-15 19:55                                                         ` Michael R. Hines
2013-04-11 15:01                           ` Michael R. Hines
2013-04-11 15:18                         ` Michael R. Hines
2013-04-11 15:33                           ` Paolo Bonzini
2013-04-11 15:46                             ` Michael S. Tsirkin
2013-04-11 15:47                               ` Paolo Bonzini
2013-04-11 15:58                                 ` Michael S. Tsirkin
2013-04-11 16:06                                   ` Michael R. Hines
2013-04-12  5:10                             ` Michael R. Hines
2013-04-12  5:26                               ` Paolo Bonzini
2013-04-12  5:54                                 ` Michael R. Hines
2013-04-11 15:44                           ` Michael S. Tsirkin
2013-04-11 16:09                             ` Michael R. Hines
2013-04-11 17:04                               ` Michael S. Tsirkin
2013-04-11 17:27                                 ` Michael R. Hines
2013-04-11 16:13                             ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 04/12] introduce qemu_ram_foreach_block() mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol mrhines
2013-04-09 16:57   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 06/12] connection-establishment for RDMA mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
2013-04-09 17:03   ` Paolo Bonzini
2013-04-09 17:31   ` Peter Maydell
2013-04-09 18:04     ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
2013-04-09 17:01   ` Paolo Bonzini
2013-04-10  1:11     ` Michael R. Hines
2013-04-10  8:07       ` Paolo Bonzini
2013-04-10 10:35         ` Michael S. Tsirkin
2013-04-10 12:24         ` Michael R. Hines
2013-04-09 17:02   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA mrhines
2013-04-09 16:50   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c mrhines
2013-04-09 16:43   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities mrhines
2013-04-09 16:43   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor mrhines
2013-04-09 16:45   ` Paolo Bonzini
2013-04-09  4:24 ` [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design Michael R. Hines
2013-04-09 12:44 ` Michael S. Tsirkin
2013-04-09 14:23   ` Michael R. Hines

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.