All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel]  [RFC PATCH] replication agent module
@ 2012-02-07 10:29 Ori Mamluk
  2012-02-07 12:12 ` Anthony Liguori
  2012-02-07 13:34 ` Kevin Wolf
  0 siblings, 2 replies; 66+ messages in thread
From: Ori Mamluk @ 2012-02-07 10:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, dlaor

[-- Attachment #1: Type: text/plain, Size: 55498 bytes --]

Repagent is a new module that allows an external replication system to
replicate a volume of a Qemu VM.

This RFC patch adds the repagent client module to Qemu.



Documentation of the module role and API is in the patch at
replication/qemu-repagent.txt



The main motivation behind the module is to allow replication of VMs in a
virtualization environment like RhevM.

To achieve this we need basic replication support in Qemu.



This is the first submission of this module, which was written as a Proof
Of Concept, and used successfully for replicating and recovering a Qemu VM.

Points and open issues:

*             The module interfaces the Qemu storage stack at block.c
generic layer. Is this the right place to intercept/inject IOs?

*             The patch contains performing IO reads invoked by a new
thread (a TCP listener thread). See repaget_read_vol in repagent.c. It is
not protected by any lock – is this OK?

*             VM ID – the replication system implies an environment with
several VMs connected to a central replication system (Rephub).

                This requires some sort of identification for a VM. The
current patch does not include a VM ID – I did not find any adequate ID to
use.

                Any suggestions?



Appreciate any feedback or suggestions.  Thanks,

Ori.





>From 5a0d88689ddcf325f25fdfca2a2012f1bbf141b9 Mon Sep 17 00:00:00 2001

From: Ori Mamluk <orim@orim-fedora.(none)>

Date: Tue, 7 Feb 2012 11:12:12 +0200

Subject: [PATCH] Added replication agent module (repagent) to Qemu under

replication directory, added repagent configure and run

options, and the repagent API usage in bloc



Added build options to ./configure:  --enable-replication --disable-replicat

Added a commandline option to enable: -repagent <rep hub IP>

Added the module files under replication.



Signed-off-by: Ori Mamluk <orim@zerto.com>

---

Makefile                      |    9 +-

Makefile.objs                 |    6 +

block.c                       |   20 +++-

configure                     |   11 ++

qemu-options.hx               |    6 +

replication/qemu-repagent.txt |  104 +++++++++++++

replication/repagent.c        |  322
+++++++++++++++++++++++++++++++++++++++++

replication/repagent.h        |   46 ++++++

replication/repagent_client.c |  138 ++++++++++++++++++

replication/repagent_client.h |   36 +++++

replication/repcmd.h          |   59 ++++++++

replication/repcmd_listener.c |  137 +++++++++++++++++

replication/repcmd_listener.h |   32 ++++

replication/rephub_cmds.h     |  150 +++++++++++++++++++

replication/rephub_defs.h     |   40 +++++

vl.c                          |   10 ++

16 files changed, 1121 insertions(+), 5 deletions(-)

mode change 100644 => 100755 Makefile.objs

mode change 100644 => 100755 qemu-options.hx

create mode 100755 replication/qemu-repagent.txt

create mode 100644 replication/repagent.c

create mode 100644 replication/repagent.h

create mode 100644 replication/repagent_client.c

create mode 100644 replication/repagent_client.h

create mode 100644 replication/repcmd.h

create mode 100644 replication/repcmd_listener.c

create mode 100644 replication/repcmd_listener.h

create mode 100644 replication/rephub_cmds.h

create mode 100644 replication/rephub_defs.h



diff --git a/Makefile b/Makefile

index 4f6eaa4..a1b3701 100644

--- a/Makefile

+++ b/Makefile

@@ -149,9 +149,9 @@ qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o cmd.o
qemu-ga.o: $(GENERATED_HEADERS

tools-obj-y = qemu-tool.o $(oslib-obj-y) $(trace-obj-y) \

               qemu-timer-common.o cutils.o

-qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)

-qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)

-qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)

+qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
$(replication-obj-y)

+qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
$(replication-obj-y)

+qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
$(replication-obj-y)

 qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx

               $(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h < $< >
$@,"  GEN   $@")

@@ -228,6 +228,7 @@ clean:

               rm -f trace-dtrace.dtrace trace-dtrace.dtrace-timestamp

               rm -f trace-dtrace.h trace-dtrace.h-timestamp

               rm -rf $(qapi-dir)

+             rm -f replication/*.{o,d}

               $(MAKE) -C tests clean

               for d in $(ALL_SUBDIRS) $(QEMULIBS) libcacard; do \

               if test -d $$d; then $(MAKE) -C $$d $@ || exit 1; fi; \

@@ -387,4 +388,4 @@ tar:

               rm -rf /tmp/$(FILE)

 # Include automatically generated dependency files

--include $(wildcard *.d audio/*.d slirp/*.d block/*.d net/*.d ui/*.d
qapi/*.d qga/*.d)

+-include $(wildcard *.d audio/*.d slirp/*.d block/*.d net/*.d ui/*.d
qapi/*.d qga/*.d replication/*.d)

diff --git a/Makefile.objs b/Makefile.objs

old mode 100644

new mode 100755

index d7a6539..dbd6f15

--- a/Makefile.objs

+++ b/Makefile.objs

@@ -74,6 +74,7 @@ fsdev-obj-$(CONFIG_VIRTFS) += $(addprefix fsdev/,
$(fsdev-nested-y))

# CPUs and machines.

 common-obj-y = $(block-obj-y) blockdev.o

+common-obj-y += $(replication-obj-$(CONFIG_REPLICATION))

common-obj-y += $(net-obj-y)

common-obj-y += $(qobject-obj-y)

common-obj-$(CONFIG_LINUX) += $(fsdev-obj-$(CONFIG_LINUX))

@@ -413,6 +414,11 @@ common-obj-y += qmp-marshal.o qapi-visit.o
qapi-types.o $(qapi-obj-y)

common-obj-y += qmp.o hmp.o

 ######################################################################

+# replication

+replication-nested-y = repagent_client.o  repagent.o  repcmd_listener.o

+replication-obj-y = $(addprefix replication/, $(replication-nested-y))

+

+######################################################################

# guest agent

 qga-nested-y = guest-agent-commands.o guest-agent-command-state.o

diff --git a/block.c b/block.c

index 9bb236c..f3b8387 100644

--- a/block.c

+++ b/block.c

@@ -31,6 +31,10 @@

#include "qemu-coroutine.h"

#include "qmp-commands.h"

+#ifdef CONFIG_REPLICATION

+#include "replication/repagent.h"

+#endif

+

#ifdef CONFIG_BSD

#include <sys/types.h>

#include <sys/stat.h>

@@ -640,6 +644,9 @@ int bdrv_open(BlockDriverState *bs, const char
*filename, int flags,

         goto unlink_and_fail;

     }

+#ifdef CONFIG_REPLICATION

+    repagent_register_drive(filename,  bs);

+#endif

     /* Open the image */

     ret = bdrv_open_common(bs, filename, flags, drv);

     if (ret < 0) {

@@ -1292,6 +1299,17 @@ static int coroutine_fn
bdrv_co_do_writev(BlockDriverState *bs,

     ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);

+

+#ifdef CONFIG_REPLICATION

+    if (bs->device_name[0] != '\0') {

+        /* We split the IO only at the highest stack driver layer.

+           Currently we know that by checking device_name - only

+           highest level (closest to the guest) has that name.

+           */

+           repagent_handle_protected_write(bs, sector_num,

+                nb_sectors, qiov, ret);

+    }

+#endif

     if (bs->dirty_bitmap) {

         set_dirty_bitmap(bs, sector_num, nb_sectors, 1);

     }

@@ -1783,7 +1801,7 @@ int bdrv_has_zero_init(BlockDriverState *bs)

  * 'nb_sectors' is the max value 'pnum' should be set to.

  */

int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int
nb_sectors,

-              int *pnum)

+    int *pnum)

{

     int64_t n;

     if (!bs->drv->bdrv_is_allocated) {

diff --git a/configure b/configure

index 9e5da44..93d600e 100755

--- a/configure

+++ b/configure

@@ -179,6 +179,7 @@ spice=""

rbd=""

smartcard=""

smartcard_nss=""

+replication=""

usb_redir=""

opengl=""

zlib="yes"

@@ -772,6 +773,10 @@ for opt do

   ;;

   --enable-smartcard-nss) smartcard_nss="yes"

   ;;

+  --disable-replication) replication="no"

+  ;;

+  --enable-replication) replication="yes"

+  ;;

   --disable-usb-redir) usb_redir="no"

   ;;

   --enable-usb-redir) usb_redir="yes"

@@ -1067,6 +1072,7 @@ echo "  --disable-usb-redir      disable usb network
redirection support"

echo "  --enable-usb-redir       enable usb network redirection support"

echo "  --disable-guest-agent    disable building of the QEMU Guest Agent"

echo "  --enable-guest-agent     enable building of the QEMU Guest Agent"

+echo "  --enable-replication     enable replication support"

echo ""

echo "NOTE: The object files are built at the place where configure is
launched"

exit 1

@@ -2733,6 +2739,7 @@ echo "curl support      $curl"

echo "check support     $check_utests"

echo "mingw32 support   $mingw32"

echo "Audio drivers     $audio_drv_list"

+echo "Replication          $replication"

echo "Extra audio cards $audio_card_list"

echo "Block whitelist   $block_drv_whitelist"

echo "Mixer emulation   $mixemu"

@@ -3080,6 +3087,10 @@ if test "$smartcard_nss" = "yes" ; then

   echo "CONFIG_SMARTCARD_NSS=y" >> $config_host_mak

fi

+if test "$replication" = "yes" ; then

+  echo "CONFIG_REPLICATION=y" >> $config_host_mak

+fi

+

if test "$usb_redir" = "yes" ; then

   echo "CONFIG_USB_REDIR=y" >> $config_host_mak

fi

diff --git a/qemu-options.hx b/qemu-options.hx

old mode 100644

new mode 100755

index 681eaf1..c97e4f8

--- a/qemu-options.hx

+++ b/qemu-options.hx

@@ -2602,3 +2602,9 @@ HXCOMM This is the last statement. Insert new options
before this line!

STEXI

@end table

ETEXI

+

+DEF("repagent", HAS_ARG, QEMU_OPTION_repagent,

+    "-repagent [hub IP/name]\n"

+    "                Enable replication support for disks\n"

+    "                hub is the ip or name of the machine running the
replication hub.\n",

+    QEMU_ARCH_ALL)

diff --git a/replication/qemu-repagent.txt b/replication/qemu-repagent.txt

new file mode 100755

index 0000000..e3b0c1e

--- /dev/null

+++ b/replication/qemu-repagent.txt

@@ -0,0 +1,104 @@

+             repagent - replication agent - a Qemu module for enabling
continuous async replication of VM volumes

+

+Introduction

+             This document describes a feature in Qemu - a replication
agent (AKA Repagent).

+             The Repagent is a new module that exposes an API to an
external replication system (AKA Rephub).

+             This API allows a Rephub to communicate with a Qemu VM and
continuously replicate its volumes.

+             The imlementation of a Rephub is outside of the scope of this
document. There may be several various Rephub

+             implenetations using the same repagent in Qemu.

+

+Main feature of Repagent

+             Repagent does the following:

+             * Report volumes - report a list of all volumes in a VM to
the Rephub.

+             * Report writes to a volume - send all writes made to a
protected volume to the Rephub.

+                             The reporting of an IO is asyncronuous - i.e.
the IO is not delayed by the Repagent to get any acknowledgement from the
Rephub.

+                             It is only copied to the Rephub.

+             * Read a protected volume - allows the Rephub to read a
protected volume, to enable the protected hub to syncronize the content of
a protected volume.

+

+Description of the Repagent module

+

+Build and run options

+             New configure option: --enable-replication

+             New command line option:

+             -repagent [hub IP/name]

+
Enable replication support for disks

+
hub is the ip or name of the machine running the replication hub.

+

+Module APIs

+             The Repagent module interfaces two main components:

+             1. The Rephub - An external API based on socket messages

+             2. The generic block layer- block.c

+

+             Rephub message API

+                             The external replication API is a message
based API.

+                             We won't go into the structure of the
messages here - just the sematics.

+

+                             Messages list

+                                             (The updated list and
comments are in Rephub_cmds.h)

+

+                                             Messages from the Repagent to
the Rephub:

+                                             * Protected write

+                                                             The Repagent
sends each write to a protected volume to the hub with the IO status.

+                                                             In case the
status is bad the write content is not sent

+                                             * Report VM volumes

+                                                             The agent
reports all the volumes of the VM to the hub.

+                                             * Read Volume Response

+                                                             A response to
a Read Volume Request

+                                                             Sends the
data read from a protected volume to the hub

+                                             * Agent shutdown

+                                                             Notifies the
hub that the agent is about to shutdown.

+                                                             This allows a
graceful shutdown. Any disconnection of an agent without

+                                                             sending this
command will result in a full sync of the VM volumes.

+

+                                             Messages from the Rephub to
the Repagent:

+                                             * Start protect

+                                                             The hub
instructs the agent to start protecting a volume. When a volume is protected

+                                                             all its
writes are sent to to the hub.

+                                                             With this
command the hub also assigns a volume ID to the given volume name.

+                                             * Read volume request

+                                                             The hub
issues a read IO to a protected volume.

+                                                             This command
is used during sync - when the hub needs to read unsyncronized

+                                                             sections of a
protected volume.

+                                                             This command
is a request, the read data is returned by the read volume response message
(see above).

+             block.c API

+                             The API to the generic block storage layer
contains 3 functionalities:

+                             1. Handle writes to protected volumes

+                                             In bdrv_co_do_writev, each
write is reported to the Repagent module.

+                             2. Handle each new volume that registers

+                                             In bdrv_open - each new
bottom-level block driver that registers is reported.

+                             2. Read from a volume

+                                             Repagent calls bdrv_aio_readv
to handle read requests coming from the hub.

+

+

+General description of a Rephub  - a replication system the repagent
connects to

+             This section describes in high level a sample Rephub - a
replication system that uses the repagent API

+             to replicate disks.

+             It describes a simple Rephub that comntinuously maintains a
mirror of the volumes of a VM.

+

+             Say we have a VM we want to protect - call it PVM, say it has
2 volumes - V1, V2.

+             Our Rephub is called SingleRephub - a Rephub protecting a
single VM.

+

+             Preparations

+             1. The user chooses a host to rub SingleRephub - a different
host than PVM, call it Host2

+             2. The user creates two volumes on Host2 - same sizes of V1
and V2, call them V1R (V1 recovery) and V2R.

+             3. The user runs SingleRephub process on Host2, and gives V1R
and V2R as command line arguments.

+                             From now on SingleRephub waits for the
protected VM repagent to connect.

+             4. The user runs the protected VM PVM - and uses the switch
-repagent <Host2 IP>.

+

+             Runtime

+             1. The repagent module connects to SingleRephub on startup.

+             2. repagent reports V1 and V2 to SingleRephub.

+             3. SingleRephub starts to perform an initial synchronization
of the protected volumes-

+                             it reads each protected volume (V1 and V2) -
using read volume requests - and copies the data into the

+                             recovery volume V1R and V2R.

+             4. SingleRephub enters 'protection' mode - each write to the
protected volume is sent by the repagent to the Rephub,

+                             and the Rephub performs the write on the
matching recovery volume.

+

+             * Note that during stage 3 writes to the protected volumes
are not ignored - they're kept in a bitmap,

+                             and will be read again when stage 3 ends, in
an interative convergin process.

+

+             This flow continuously maintains an updated recovery volume.

+             If the protected system is damaged, the user can create a new
VM on Host2 with the replicated volumes attached to it.

+             The new VM is a replica of the protected system.

+

+

diff --git a/replication/repagent.c b/replication/repagent.c

new file mode 100644

index 0000000..c66eae7

--- /dev/null

+++ b/replication/repagent.c

@@ -0,0 +1,322 @@

+#include <string.h>

+#include <stdlib.h>

+#include <stdio.h>

+#include <pthread.h>

+#include <stdint.h>

+

+#include "block.h"

+#include "rephub_defs.h"

+#include "block_int.h"

+#include "repagent_client.h"

+#include "repagent.h"

+#include "rephub_cmds.h"

+

+#define ZERO_MEM_OBJ(pObj) memset(pObj, 0, sizeof(*pObj))

+#define REPAGENT_MAX_NUM_VOLUMES (64)

+#define REPAGENT_VOLUME_ID_NONE (0)

+

+typedef struct RepagentVolume {

+    uint64_t vol_id;

+    const char *vol_path;

+    BlockDriverState *driver_ptr;

+} RepagentVolume;

+

+struct RepAgentState {

+    int is_init;

+    int num_volumes;

+    RepagentVolume * volumes[REPAGENT_MAX_NUM_VOLUMES];

+};

+

+typedef struct RepagentReadVolIo {

+    QEMUIOVector qiov;

+    RepCmdReadVolReq rep_cmd;

+    uint8_t *buf;

+    struct timeval start_time;

+} RepagentReadVolIo;

+

+static int repagent_get_volume_by_name(const char *name);

+static void repagent_report_volumes_to_hub(void);

+static void repagent_vol_read_done(void *opaque, int ret);

+static struct timeval tsub(struct timeval t1, struct timeval t2);

+

+RepAgentState g_rep_agent = { 0 };

+

+void repagent_init(const char *hubname, int port)

+{

+    /* It is the responsibility of the thread to free this struct */

+    rephub_params *pParams = (rephub_params
*)g_malloc(sizeof(rephub_params));

+    if (hubname == NULL) {

+        hubname = "127.0.0.1";

+    }

+    if (port == 0) {

+        port = 9010;

+    }

+

+    printf("repagent_init %s\n", hubname);

+

+    pParams->port = port;

+    pParams->name = g_strdup(hubname);

+

+    pthread_t thread_id = 0;

+

+    /* Create the repagent client listener thread */

+    pthread_create(&thread_id, 0, repagent_listen, (void *) pParams);

+    pthread_detach(thread_id);

+}

+

+void repagent_register_drive(const char *drive_path,

+        BlockDriverState *driver_ptr)

+{

+    int i;

+    for (i = 0; i < g_rep_agent.num_volumes ; i++) {

+        RepagentVolume *vol = g_rep_agent.volumes[i];

+        if (vol != NULL) {

+            assert(

+                    strcmp(drive_path, vol->vol_path) != 0

+                    && driver_ptr != vol->driver_ptr);

+        }

+    }

+

+    assert(g_rep_agent.num_volumes < REPAGENT_MAX_NUM_VOLUMES);

+

+    printf("zerto repagent: Registering drive. Num drives %d, path %s\n",

+            g_rep_agent.num_volumes, drive_path);

+    g_rep_agent.volumes[i] =

+            (RepagentVolume *)g_malloc(sizeof(RepagentVolume));

+    g_rep_agent.volumes[i]->driver_ptr = driver_ptr;

+    /* orim todo strcpy? */

+    g_rep_agent.volumes[i]->vol_path = drive_path;

+

+    /* Orim todo thread-safety? */

+    g_rep_agent.num_volumes++;

+

+    repagent_report_volumes_to_hub();

+}

+

+/* orim todo destruction? */

+

+static RepagentVolume *repagent_get_protected_volume_by_driver(

+        BlockDriverState *bs)

+{

+    /* orim todo optimize search */

+    int i = 0;

+    for (i = 0; i < g_rep_agent.num_volumes ; i++) {

+        RepagentVolume *p_vol = g_rep_agent.volumes[i];

+        if (p_vol != NULL && p_vol->driver_ptr == (void *) bs) {

+            return p_vol;

+        }

+    }

+    return NULL;

+}

+

+void repagent_handle_protected_write(BlockDriverState *bs, int64_t
sector_num,

+        int nb_sectors, QEMUIOVector *qiov, int ret_status)

+{

+    printf("zerto Protected write offset %lld, size %d, IO return status
%d",

+            (long long int) sector_num, nb_sectors, ret_status);

+    if (bs->filename != NULL) {

+        printf(", filename %s", bs->filename);

+    }

+

+    printf("\n");

+

+    RepagentVolume *p_vol = repagent_get_protected_volume_by_driver(bs);

+    if (p_vol == NULL || p_vol->vol_id == REPAGENT_VOLUME_ID_NONE) {

+        /* Unprotected */

+        printf("Got a write to an unprotected volume.\n");

+        return;

+    }

+

+    /* Report IO to rephub */

+

+    int data_size = qiov->size;

+    if (ret_status < 0) {

+        /* On failed ios we don't send the data to the hub */

+        data_size = 0;

+    }

+    uint8_t *pdata = NULL;

+    RepCmdProtectedWrite *p_cmd = (RepCmdProtectedWrite *) repcmd_new(

+            REPHUB_CMD_PROTECTED_WRITE, data_size, (uint8_t **) &pdata);

+

+    if (ret_status >= 0) {

+        qemu_iovec_to_buffer(qiov, pdata);

+    }

+

+    p_cmd->volume_id = p_vol->vol_id;

+    p_cmd->offset_sectors = sector_num;

+    p_cmd->size_sectors = nb_sectors;

+    p_cmd->ret_status = ret_status;

+

+    if (repagent_client_send((RepCmd *) p_cmd) != 0) {

+        printf("Error sending command\n");

+    }

+}

+

+static void repagent_report_volumes_to_hub(void)

+{

+    /* Report IO to rephub */

+    int i;

+    RepCmdDataReportVmVolumes *p_cmd_data = NULL;

+    RepCmdReportVmVolumes *p_cmd = (RepCmdReportVmVolumes *) repcmd_new(

+            REPHUB_CMD_REPORT_VM_VOLUMES,

+            g_rep_agent.num_volumes * sizeof(RepVmVolumeInfo),

+            (uint8_t **) &p_cmd_data);

+    p_cmd->num_volumes = g_rep_agent.num_volumes;

+    printf("reporting %u volumes\n", g_rep_agent.num_volumes);

+    for (i = 0; i < g_rep_agent.num_volumes ; i++) {

+        assert(g_rep_agent.volumes[i] != NULL);

+        printf("reporting volume %s size %u\n",

+                g_rep_agent.volumes[i]->vol_path,

+                (uint32_t) sizeof(p_cmd_data->volumes[i].name));

+        strncpy((char *) p_cmd_data->volumes[i].name,

+                g_rep_agent.volumes[i]->vol_path,

+                sizeof(p_cmd_data->volumes[i].name));

+        p_cmd_data->volumes[i].volume_id = g_rep_agent.volumes[i]->vol_id;

+    }

+    if (repagent_client_send((RepCmd *) p_cmd) != 0) {

+        printf("Error sending command\n");

+    }

+}

+

+int repaget_start_protect(RepCmdStartProtect *pcmd,

+        RepCmdDataStartProtect *pcmd_data)

+{

+    printf("Start protect vol %s, ID %llu\n", pcmd_data->volume_name,

+            (unsigned long long) pcmd->volume_id);

+    int vol_index = repagent_get_volume_by_name(pcmd_data->volume_name);

+    if (vol_index < 0) {

+        printf("The volume doesn't exist\n");

+        return TRUE;

+    }

+    /* orim todo protect */

+    g_rep_agent.volumes[vol_index]->vol_id = pcmd->volume_id;

+

+    return TRUE;

+}

+

+static int repagent_get_volume_by_name(const char *name)

+{

+    int i = 0;

+    for (i = 0; i < g_rep_agent.num_volumes ; i++) {

+        if (g_rep_agent.volumes[i] != NULL

+                && strcmp(name, g_rep_agent.volumes[i]->vol_path) == 0) {

+            return i;

+        }

+    }

+    return -1;

+}

+

+static int repagent_get_volume_by_id(uint64_t vol_id)

+{

+    int i = 0;

+    for (i = 0; i < g_rep_agent.num_volumes ; i++) {

+        if (g_rep_agent.volumes[i] != NULL

+                && g_rep_agent.volumes[i]->vol_id == vol_id) {

+            return i;

+        }

+    }

+    return -1;

+}

+

+int repaget_read_vol(RepCmdReadVolReq *pcmd, uint8_t *pdata)

+{

+    int index = repagent_get_volume_by_id(pcmd->volume_id);

+    int size_bytes = pcmd->size_sectors * 512;

+    if (index < 0) {

+        printf("Vol read - Could not find vol id %llu\n",

+                (unsigned long long int) pcmd->volume_id);

+        RepCmdReadVolRes *p_res_cmd = (RepCmdReadVolRes *) repcmd_new(

+                REPHUB_CMD_READ_VOL_RES, 0, NULL);

+        p_res_cmd->req_id = pcmd->req_id;

+        p_res_cmd->volume_id = pcmd->volume_id;

+        p_res_cmd->is_status_success = FALSE;

+        repagent_client_send((RepCmd *) p_res_cmd);

+        return TRUE;

+    }

+

+    printf("Vol read - driver %p, volId %llu, offset %llu, size %u\n",

+            g_rep_agent.volumes[index]->driver_ptr,

+            (unsigned long long int) pcmd->volume_id,

+            (unsigned long long int) pcmd->offset_sectors,
pcmd->size_sectors);

+

+    {

+        RepagentReadVolIo *read_xact = calloc(1,
sizeof(RepagentReadVolIo));

+

+/*        BlockDriverAIOCB *acb; */

+

+        ZERO_MEM_OBJ(read_xact);

+

+        qemu_iovec_init(&read_xact->qiov, 1);

+

+        /*read_xact->buf =

+        qemu_blockalign(g_rep_agent.volumes[index]->driver_ptr,
size_bytes); */

+        read_xact->buf = (uint8_t *) g_malloc(size_bytes);

+        read_xact->rep_cmd = *pcmd;

+        qemu_iovec_add(&read_xact->qiov, read_xact->buf, size_bytes);

+

+        gettimeofday(&read_xact->start_time, NULL);

+        /* orim TODO - use the returned acb to cancel the request on
shutdown */

+        /*acb = */bdrv_aio_readv(g_rep_agent.volumes[index]->driver_ptr,

+                read_xact->rep_cmd.offset_sectors, &read_xact->qiov,

+                read_xact->rep_cmd.size_sectors, repagent_vol_read_done,

+                read_xact);

+    }

+

+    return TRUE;

+}

+

+static void repagent_vol_read_done(void *opaque, int ret)

+{

+    struct timeval t2;

+    RepagentReadVolIo *read_xact = (RepagentReadVolIo *) opaque;

+    uint8_t *pdata = NULL;

+    RepCmdReadVolRes *pcmd = (RepCmdReadVolRes *) repcmd_new(

+            REPHUB_CMD_READ_VOL_RES, read_xact->rep_cmd.size_sectors * 512,

+            &pdata);

+    pcmd->req_id = read_xact->rep_cmd.req_id;

+    pcmd->volume_id = read_xact->rep_cmd.volume_id;

+    pcmd->is_status_success = FALSE;

+

+    printf("Protected vol read - volId %llu, offset %llu, size %u\n",

+            (unsigned long long int) read_xact->rep_cmd.volume_id,

+            (unsigned long long int) read_xact->rep_cmd.offset_sectors,

+            read_xact->rep_cmd.size_sectors);

+    gettimeofday(&t2, NULL);

+

+    if (ret >= 0) {

+        /* Read response - send the data to the hub */

+        t2 = tsub(t2, read_xact->start_time);

+        printf("Read prot vol done. Took %u seconds, %u us.",

+                (uint32_t) t2.tv_sec, (uint32_t) t2.tv_usec);

+

+        pcmd->is_status_success = TRUE;

+        /* orim todo optimize - don't copy, use the qiov buffer */

+        qemu_iovec_to_buffer(&read_xact->qiov, pdata);

+    } else {

+        printf("readv failed: %s\n", strerror(-ret));

+    }

+

+    repagent_client_send((RepCmd *) pcmd);

+

+    /*qemu_vfree(read_xact->buf); */

+    g_free(read_xact->buf);

+

+    g_free(read_xact);

+}

+

+static struct timeval tsub(struct timeval t1, struct timeval t2)

+{

+    t1.tv_usec -= t2.tv_usec;

+    if (t1.tv_usec < 0) {

+        t1.tv_usec += 1000000;

+        t1.tv_sec--;

+    }

+    t1.tv_sec -= t2.tv_sec;

+    return t1;

+}

+

+void repagent_client_connected(void)

+{

+    /* orim todo thread protection */

+    repagent_report_volumes_to_hub();

+}

diff --git a/replication/repagent.h b/replication/repagent.h

new file mode 100644

index 0000000..98ccbf2

--- /dev/null

+++ b/replication/repagent.h

@@ -0,0 +1,46 @@

+/*

+ * QEMU System Emulator

+ *

+ * Copyright (c) 2003-2008 Fabrice Bellard

+ *

+ * Permission is hereby granted, free of charge, to any person obtaining a
copy

+ * of this software and associated documentation files (the "Software"),
to deal

+ * in the Software without restriction, including without limitation the
rights

+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell

+ * copies of the Software, and to permit persons to whom the Software is

+ * furnished to do so, subject to the following conditions:

+ *

+ * The above copyright notice and this permission notice shall be included
in

+ * all copies or substantial portions of the Software.

+ *

+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR

+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER

+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM,

+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN

+ * THE SOFTWARE.

+ */

+#ifndef REPAGENT_H

+#define REPAGENT_H

+#include <stdint.h>

+

+#include "qemu-common.h"

+

+typedef struct RepAgentState RepAgentState;

+typedef struct RepCmdStartProtect RepCmdStartProtect;

+typedef struct RepCmdDataStartProtect RepCmdDataStartProtect;

+struct RepCmdReadVolReq;

+

+void repagent_init(const char *hubname, int port);

+void repagent_handle_protected_write(BlockDriverState *bs,

+        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov, int
ret_status);

+void repagent_register_drive(const char *drive_path,

+        BlockDriverState *driver_ptr);

+int repaget_start_protect(RepCmdStartProtect *pcmd,

+        RepCmdDataStartProtect *pcmd_data);

+int repaget_read_vol(struct RepCmdReadVolReq *pcmd, uint8_t *pdata);

+void repagent_client_connected(void);

+

+

+#endif /* REPAGENT_H */

diff --git a/replication/repagent_client.c b/replication/repagent_client.c

new file mode 100644

index 0000000..4dd9ea4

--- /dev/null

+++ b/replication/repagent_client.c

@@ -0,0 +1,138 @@

+#include "repcmd.h"

+#include "rephub_cmds.h"

+#include "repcmd_listener.h"

+#include "repagent_client.h"

+#include "repagent.h"

+

+#include <string.h>

+#include <stdlib.h>

+#include <errno.h>

+#include <stdio.h>

+#include <resolv.h>

+#include <sys/socket.h>

+#include <arpa/inet.h>

+#include <netinet/in.h>

+#include <unistd.h>

+

+#define ZERO_MEM_OBJ(pObj) memset(pObj, 0, sizeof(*pObj))

+

+static void repagent_process_cmd(RepCmd *pCmd, uint8_t *pData, void
*clientPtr);

+

+typedef struct repagent_client_state {

+    int is_connected;

+    int is_terminate_receive;

+    int hsock;

+} repagent_client_state;

+

+static repagent_client_state g_client_state = { 0 };

+

+void *repagent_listen(void *pParam)

+{

+    rephub_params *pServerParams = (rephub_params *) pParam;

+    int host_port = pServerParams->port;

+    const char *host_name = pServerParams->name;

+

+    printf("Creating repagent listener thread...\n");

+    g_free(pServerParams);

+

+    struct sockaddr_in my_addr;

+

+    int err;

+    int retries = 0;

+

+    g_client_state.hsock = socket(AF_INET, SOCK_STREAM, 0);

+    if (g_client_state.hsock == -1) {

+        printf("Error initializing socket %d\n", errno);

+        return (void *) -1;

+    }

+

+    int param = 1;

+

+    if ((setsockopt(g_client_state.hsock, SOL_SOCKET, SO_REUSEADDR,

+            (char *) &param, sizeof(int)) == -1)

+            || (setsockopt(g_client_state.hsock, SOL_SOCKET, SO_KEEPALIVE,

+                    (char *) &param, sizeof(int)) == -1)) {

+        printf("Error setting options %d\n", errno);

+        return (void *) -1;

+    }

+

+    my_addr.sin_family = AF_INET;

+    my_addr.sin_port = htons(host_port);

+    memset(&(my_addr.sin_zero), 0, 8);

+

+    my_addr.sin_addr.s_addr = inet_addr(host_name);

+

+    /* Reconnect loop */

+    while (!g_client_state.is_terminate_receive) {

+

+        if (connect(g_client_state.hsock, (struct sockaddr *) &my_addr,

+                sizeof(my_addr)) == -1) {

+            err = errno;

+            if (err != EINPROGRESS) {

+                retries++;

+                fprintf(

+                        stderr,

+                        "Error connecting socket %d. Host %s, port %u.
Retry count %d\n",

+                        errno, host_name, host_port, retries);

+                usleep(5 * 1000 * 1000);

+                continue;

+            }

+        }

+        retries = 0;

+

+        g_client_state.is_connected = 1;

+

+        repagent_client_connected();

+        repcmd_listener(g_client_state.hsock, repagent_process_cmd, NULL);

+        close(g_client_state.hsock);

+

+        g_client_state.is_connected = 0;

+    }

+    return 0;

+}

+

+void repagent_process_cmd(RepCmd *pcmd, uint8_t *pdata, void *clientPtr)

+{

+    int is_free_data = 1;

+    printf("Repagent got cmd %d\n", pcmd->hdr.cmdid);

+    switch (pcmd->hdr.cmdid) {

+    case REPHUB_CMD_START_PROTECT: {

+        is_free_data = repaget_start_protect((RepCmdStartProtect *) pcmd,

+                (RepCmdDataStartProtect *) pdata);

+    }

+        break;

+    case REPHUB_CMD_READ_VOL_REQ: {

+        is_free_data = repaget_read_vol((RepCmdReadVolReq *) pcmd, pdata);

+    }

+        break;

+    default:

+        assert(0);

+        break;

+

+    }

+

+    if (is_free_data) {

+        g_free(pdata);

+    }

+}

+

+int repagent_client_send(RepCmd *p_cmd)

+{

+    int bytecount = 0;

+    printf("Send cmd %u, data size %u\n", p_cmd->hdr.cmdid,

+            p_cmd->hdr.data_size_bytes);

+    if (!g_client_state.is_connected) {

+        printf("Not connected to hub\n");

+        return -1;

+    }

+

+    bytecount = send(g_client_state.hsock, p_cmd,

+            sizeof(RepCmd) + p_cmd->hdr.data_size_bytes, 0);

+    if (bytecount < sizeof(RepCmd) + p_cmd->hdr.data_size_bytes) {

+        printf("Bad send %d, errno %d\n", bytecount, errno);

+        return bytecount;

+    }

+

+    /* Success */

+    return 0;

+}

diff --git a/replication/repagent_client.h b/replication/repagent_client.h

new file mode 100644

index 0000000..62a5377

--- /dev/null

+++ b/replication/repagent_client.h

@@ -0,0 +1,36 @@

+/*

+ * QEMU System Emulator

+ *

+ * Copyright (c) 2003-2008 Fabrice Bellard

+ *

+ * Permission is hereby granted, free of charge, to any person obtaining a
copy

+ * of this software and associated documentation files (the "Software"),
to deal

+ * in the Software without restriction, including without limitation the
rights

+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell

+ * copies of the Software, and to permit persons to whom the Software is

+ * furnished to do so, subject to the following conditions:

+ *

+ * The above copyright notice and this permission notice shall be included
in

+ * all copies or substantial portions of the Software.

+ *

+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR

+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER

+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM,

+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN

+ * THE SOFTWARE.

+ */

+#ifndef REPAGENT_CLIENT_H

+#define REPAGENT_CLIENT_H

+#include "repcmd.h"

+

+typedef struct rephub_params {

+    char *name;

+    int port;

+} rephub_params;

+

+void *repagent_listen(void *pParam);

+int repagent_client_send(RepCmd *p_cmd);

+

+#endif /* REPAGENT_CLIENT_H */

diff --git a/replication/repcmd.h b/replication/repcmd.h

new file mode 100644

index 0000000..8c6cf1b

--- /dev/null

+++ b/replication/repcmd.h

@@ -0,0 +1,59 @@

+/*

+ * QEMU System Emulator

+ *

+ * Copyright (c) 2003-2008 Fabrice Bellard

+ *

+ * Permission is hereby granted, free of charge, to any person obtaining a
copy

+ * of this software and associated documentation files (the "Software"),
to deal

+ * in the Software without restriction, including without limitation the
rights

+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell

+ * copies of the Software, and to permit persons to whom the Software is

+ * furnished to do so, subject to the following conditions:

+ *

+ * The above copyright notice and this permission notice shall be included
in

+ * all copies or substantial portions of the Software.

+ *

+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR

+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER

+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM,

+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN

+ * THE SOFTWARE.

+ */

+#ifndef REPCMD_H

+#define REPCMD_H

+

+#include <stdint.h>

+

+#define REPCMD_MAGIC1 (0x1122)

+#define REPCMD_MAGIC2 (0x3344)

+#define REPCMD_NUM_U32_PARAMS (11)

+

+enum RepCmds {

+    REPCMD_FIRST_INVALID                    = 0,

+    REPCMD_FIRST_HUBCMD                     = 1,

+    REPHUB_CMD_PROTECTED_WRITE              = 2,

+    REPHUB_CMD_REPORT_VM_VOLUMES            = 3,

+    REPHUB_CMD_START_PROTECT                = 4,

+    REPHUB_CMD_READ_VOL_REQ                 = 5,

+    REPHUB_CMD_READ_VOL_RES                 = 6,

+    REPHUB_CMD_AGENT_SHUTDOWN               = 7,

+};

+

+typedef struct RepCmdHdr {

+    uint16_t magic1;

+    uint16_t cmdid;

+    uint32_t data_size_bytes;

+} RepCmdHdr;

+

+typedef struct RepCmd {

+    RepCmdHdr hdr;

+    unsigned int parameters[REPCMD_NUM_U32_PARAMS];

+    unsigned int magic2;

+    uint8_t data[0];

+} RepCmd;

+

+RepCmd *repcmd_new(int cmd_id, int data_size, uint8_t **p_out_pdata);

+

+#endif /* REPCMD_H */

diff --git a/replication/repcmd_listener.c b/replication/repcmd_listener.c

new file mode 100644

index 0000000..a211927

--- /dev/null

+++ b/replication/repcmd_listener.c

@@ -0,0 +1,137 @@

+#include <fcntl.h>

+#include <string.h>

+#include <stdlib.h>

+#include <errno.h>

+#include <stdio.h>

+#include <netinet/in.h>

+#include <resolv.h>

+#include <sys/socket.h>

+#include <arpa/inet.h>

+#include <unistd.h>

+#include <pthread.h>

+#include <assert.h>

+

+/* Use the CONFIG_REPLICATION flag to determine whether

+ * we're under qemu build or a hub When under

+ * qemu use g_malloc */

+#ifdef CONFIG_REPLICATION

+#include <glib.h>

+#define REPCMD_MALLOC g_malloc

+#else

+#define REPCMD_MALLOC malloc

+#endif

+

+#include "repcmd.h"

+#include "repcmd_listener.h"

+

+#define ZERO_MEM_OBJ(pObj) memset((void *)pObj, 0, sizeof(*pObj))

+

+typedef struct RepCmdListenerState {

+    int is_terminate_receive;

+} RepCmdListenerState;

+

+static RepCmdListenerState g_listenerState = { 0 };

+

+/* Returns 0 for initiated termination or socket error value on error */

+int repcmd_listener(int hsock, pfn_received_cmd_cb callback, void
*clientPtr)

+{

+    RepCmd curCmd;

+    uint8_t *pReadBuf = (uint8_t *) &curCmd;

+    int bytesToGet = sizeof(RepCmd);

+    int bytesGotten = 0;

+    int isGotHeader = 0;

+    uint8_t *pdata = NULL;

+

+    assert(callback != NULL);

+

+    /* receive loop */

+    while (!g_listenerState.is_terminate_receive) {

+        int bytecount;

+

+        bytecount = recv(hsock, pReadBuf + bytesGotten,

+                bytesToGet - bytesGotten, 0);

+        if (bytecount == -1) {

+            fprintf(stderr, "Error receiving data %d\n", errno);

+            return errno;

+        }

+

+        if (bytecount == 0) {

+            printf("Disconnected\n");

+            return 0;

+        }

+        bytesGotten += bytecount;

+/*     printf("Recieved bytes %d, got %d/%d\n",

+                bytecount, bytesGotten, bytesToGet); */

+        /* print content */

+        if (0) {

+            int i;

+            for (i = 0; i < bytecount ; i += 4) {

+                /*printf("%d/%d", i, bytecount/4); */

+                printf("%#x ",

+                        *(int *) (&pReadBuf[bytesGotten - bytecount + i]));

+

+            }

+            printf("\n");

+        }

+        assert(bytesGotten <= bytesToGet);

+        if (bytesGotten == bytesToGet) {

+            int isGotData = 0;

+            bytesGotten = 0;

+            if (!isGotHeader) {

+                /* We just got the header */

+                isGotHeader = 1;

+

+                assert(curCmd.hdr.magic1 == REPCMD_MAGIC1);

+                assert(curCmd.magic2 == REPCMD_MAGIC2);

+                if (curCmd.hdr.data_size_bytes > 0) {

+                    pdata = (uint8_t *)REPCMD_MALLOC(

+                                curCmd.hdr.data_size_bytes);

+/*                    printf("malloc %p\n", pdata); */

+                    pReadBuf = pdata;

+                } else {

+                    /* no data */

+                    isGotData = 1;

+                    pdata = NULL;

+                }

+                bytesToGet = curCmd.hdr.data_size_bytes;

+            } else {

+                isGotData = 1;

+            }

+

+            if (isGotData) {

+                /* Got command and data */

+                (*callback)(&curCmd, pdata, clientPtr);

+

+                /* It's the callee responsibility to free pData */

+                pdata = NULL;

+                ZERO_MEM_OBJ(&curCmd);

+                pReadBuf = (uint8_t *) &curCmd;

+                bytesGotten = 0;

+                bytesToGet = sizeof(RepCmd);

+                isGotHeader = 0;

+            }

+        }

+    }

+    return 0;

+}

+

+RepCmd *repcmd_new(int cmd_id, int data_size, uint8_t **p_out_pdata)

+{

+    RepCmd *p_cmd = (RepCmd *)REPCMD_MALLOC(sizeof(RepCmd) + data_size);

+    assert(p_cmd != NULL);

+

+    /* Zero the CMD (not the data) */

+    ZERO_MEM_OBJ(p_cmd);

+

+    p_cmd->hdr.cmdid = cmd_id;

+    p_cmd->hdr.magic1 = REPCMD_MAGIC1;

+    p_cmd->magic2 = REPCMD_MAGIC2;

+    p_cmd->hdr.data_size_bytes = data_size;

+

+    if (p_out_pdata != NULL) {

+        *p_out_pdata = p_cmd->data;

+    }

+

+    return p_cmd;

+}

+

diff --git a/replication/repcmd_listener.h b/replication/repcmd_listener.h

new file mode 100644

index 0000000..c09a12e

--- /dev/null

+++ b/replication/repcmd_listener.h

@@ -0,0 +1,32 @@

+/*

+ * QEMU System Emulator

+ *

+ * Copyright (c) 2003-2008 Fabrice Bellard

+ *

+ * Permission is hereby granted, free of charge, to any person obtaining a
copy

+ * of this software and associated documentation files (the "Software"),
to deal

+ * in the Software without restriction, including without limitation the
rights

+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell

+ * copies of the Software, and to permit persons to whom the Software is

+ * furnished to do so, subject to the following conditions:

+ *

+ * The above copyright notice and this permission notice shall be included
in

+ * all copies or substantial portions of the Software.

+ *

+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR

+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER

+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM,

+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN

+ * THE SOFTWARE.

+ */

+#ifndef REPCMD_LISTENER_H

+#define REPCMD_LISTENER_H

+#include <stdint.h>

+typedef void (*pfn_received_cmd_cb)(RepCmd *pCmd,

+                uint8_t *pData, void *clientPtr);

+

+int repcmd_listener(int hsock, pfn_received_cmd_cb callback, void
*clientPtr);

+

+#endif /* REPCMD_LISTENER_H */

diff --git a/replication/rephub_cmds.h b/replication/rephub_cmds.h

new file mode 100644

index 0000000..820c37d

--- /dev/null

+++ b/replication/rephub_cmds.h

@@ -0,0 +1,150 @@

+/*

+ * QEMU System Emulator

+ *

+ * Copyright (c) 2003-2008 Fabrice Bellard

+ *

+ * Permission is hereby granted, free of charge, to any person obtaining a
copy

+ * of this software and associated documentation files (the "Software"),
to deal

+ * in the Software without restriction, including without limitation the
rights

+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell

+ * copies of the Software, and to permit persons to whom the Software is

+ * furnished to do so, subject to the following conditions:

+ *

+ * The above copyright notice and this permission notice shall be included
in

+ * all copies or substantial portions of the Software.

+ *

+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR

+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER

+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM,

+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN

+ * THE SOFTWARE.

+ */

+#ifndef REPHUB_CMDS_H

+#define REPHUB_CMDS_H

+

+#include <stdint.h>

+#include "repcmd.h"

+#include "rephub_defs.h"

+

+/*********************************************************

+ * RepCmd Report a protected IO

+ *

+ * REPHUB_CMD_PROTECTED_WRITE

+ * Direction: agent->hub

+ *

+ * Any write of a protected volume is send with this

+ * message to the hub, with its status.

+ * When case the status is bad no data is sent

+ *********************************************************/

+typedef struct RepCmdProtectedWrite {

+    RepCmdHdr hdr;

+    uint64_t volume_id;

+    uint64_t offset_sectors;

+    /* The size field duplicates the RepCmd size,

+     * but it is needed for reporting failed IOs' sizes */

+    uint32_t size_sectors;

+    int ret_status;

+} RepCmdProtectedWrite;

+

+/*********************************************************

+ * RepCmd Report VM volumes

+ *

+ * REPHUB_CMD_REPORT_VM_VOLUMES

+ * Direction: agent->hub

+ *

+ * The agent reports all the volumes of the VM

+ * to the hub.

+ *********************************************************/

+typedef struct RepVmVolumeInfo {

+    char name[REPHUB_MAX_VOL_NAME_LEN];

+    uint64_t volume_id;

+    uint32_t size_mb;

+} RepVmVolumeInfo;

+

+typedef struct RepCmdReportVmVolumes {

+    RepCmdHdr hdr;

+    int num_volumes;

+} RepCmdReportVmVolumes;

+

+typedef struct RepCmdDataReportVmVolumes {

+    RepVmVolumeInfo volumes[0];

+} RepCmdDataReportVmVolumes;

+

+

+/*********************************************************

+ * RepCmd Start protect

+ *

+ * REPHUB_CMD_START_PROTECT

+ * Direction: hub->agent

+ *

+ * The hub instructs the agent to start protecting

+ * a volume. When a volume is protected all its writes

+ * are sent to to the hub.

+ * With this command the hub also assigns a volume ID to

+ * the given volume name.

+ *********************************************************/

+typedef struct RepCmdStartProtect {

+    RepCmdHdr hdr;

+    uint64_t volume_id;

+} RepCmdStartProtect;

+

+typedef struct RepCmdDataStartProtect {

+    char volume_name[REPHUB_MAX_VOL_NAME_LEN];

+} RepCmdDataStartProtect;

+

+

+/*********************************************************

+ * RepCmd Read Volume Request

+ *

+ * REPHUB_CMD_READ_VOL_REQ

+ * Direction: hub->agent

+ *

+ * The hub issues a read IO to a protected volume.

+ * This command is used during sync - when the hub needs

+ * to read unsyncronized sections of a protected volume.

+ * This command is a request, the read data is returned

+ * by the response command REPHUB_CMD_READ_VOL_RES

+ *********************************************************/

+typedef struct RepCmdReadVolReq {

+    RepCmdHdr hdr;

+    int req_id;

+    int size_sectors;

+    uint64_t volume_id;

+    uint64_t offset_sectors;

+} RepCmdReadVolReq;

+

+/*********************************************************

+ * RepCmd Read Volume Response

+ *

+ * REPHUB_CMD_READ_VOL_RES

+ * Direction: agent->hub

+ *

+ * A response to REPHUB_CMD_READ_VOL_REQ.

+ * Sends the data read from a protected volume

+ *********************************************************/

+typedef struct RepCmdReadVolRes {

+    RepCmdHdr hdr;

+    int req_id;

+    int is_status_success;

+    uint64_t volume_id;

+} RepCmdReadVolRes;

+

+/*********************************************************

+ * RepCmd Agent shutdown

+ *

+ * REPHUB_CMD_AGENT_SHUTDOWN

+ * Direction: agent->hub

+ *

+ * Notifies the hub that the agent is about to shutdown.

+ * This allows a graceful shutdown. Any disconnection

+ * of an agent without sending this command will result

+ * in a full sync of the VM volumes.

+ *********************************************************/

+typedef struct RepCmdAgentShutdown {

+    RepCmdHdr hdr;

+} RepCmdAgentShutdown;

+

+

+#endif /* REPHUB_CMDS_H */

diff --git a/replication/rephub_defs.h b/replication/rephub_defs.h

new file mode 100644

index 0000000..e34e0ce

--- /dev/null

+++ b/replication/rephub_defs.h

@@ -0,0 +1,40 @@

+/*

+ * QEMU System Emulator

+ *

+ * Copyright (c) 2003-2008 Fabrice Bellard

+ *

+ * Permission is hereby granted, free of charge, to any person obtaining a
copy

+ * of this software and associated documentation files (the "Software"),
to deal

+ * in the Software without restriction, including without limitation the
rights

+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or
sell

+ * copies of the Software, and to permit persons to whom the Software is

+ * furnished to do so, subject to the following conditions:

+ *

+ * The above copyright notice and this permission notice shall be included
in

+ * all copies or substantial portions of the Software.

+ *

+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR

+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER

+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM,

+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
IN

+ * THE SOFTWARE.

+ */

+#ifndef REP_HUB_DEFS_H

+#define REP_HUB_DEFS_H

+

+#include <stdint.h>

+

+#define REPHUB_MAX_VOL_NAME_LEN (1024)

+#define REPHUB_MAX_NUM_VOLUMES (512)

+

+#ifndef TRUE

+    #define TRUE (1)

+#endif

+

+#ifndef FALSE

+    #define FALSE (0)

+#endif

+

+#endif /* REP_HUB_DEFS_H */

diff --git a/vl.c b/vl.c

index 624da0f..506b5dc 100644

--- a/vl.c

+++ b/vl.c

@@ -167,6 +167,7 @@ int main(int argc, char **argv)

 #include "ui/qemu-spice.h"

+#include "replication/repagent.h"

//#define DEBUG_NET

//#define DEBUG_SLIRP

@@ -2307,6 +2308,15 @@ int main(int argc, char **argv, char **envp)

                 drive_add(IF_DEFAULT, popt->index - QEMU_OPTION_hda,
optarg,

                           HD_OPTS);

                 break;

+            case QEMU_OPTION_repagent:

+#ifdef CONFIG_REPLICATION

+                repagent_init(optarg, 0);

+#else

+                fprintf(stderr, "Replication support is disabled. "

+                    "Don't use -repagent option.\n");

+                exit(1);

+#endif

+                break;

             case QEMU_OPTION_drive:

                 if (drive_def(optarg) == NULL) {

                     exit(1);

-- 

1.7.6.4

[-- Attachment #2: Type: text/html, Size: 88623 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 10:29 [Qemu-devel] [RFC PATCH] replication agent module Ori Mamluk
@ 2012-02-07 12:12 ` Anthony Liguori
  2012-02-07 12:25   ` Dor Laor
  2012-02-07 13:34 ` Kevin Wolf
  1 sibling, 1 reply; 66+ messages in thread
From: Anthony Liguori @ 2012-02-07 12:12 UTC (permalink / raw)
  To: Ori Mamluk; +Cc: Kevin Wolf, dlaor, qemu-devel

Hi,

On 02/07/2012 04:29 AM, Ori Mamluk wrote:
> Repagent is a new module that allows an external replication system to
> replicate a volume of a Qemu VM.
>
> This RFC patch adds the repagent client module to Qemu.

Please read http://wiki.qemu.org/Contribute/SubmitAPatch

In particular, use a tool like git-send-email and split this patch up into more 
manageable chunks.

Is there an Open Source rehub available?  As a project policy, adding external 
APIs specifically for proprietary software is not something we're willing to do.

Regards,

Anthony Liguori

> Documentation of the module role and API is in the patch at
> replication/qemu-repagent.txt
>
>
>
> The main motivation behind the module is to allow replication of VMs in a
> virtualization environment like RhevM.
>
> To achieve this we need basic replication support in Qemu.
>
>
>
> This is the first submission of this module, which was written as a Proof
> Of Concept, and used successfully for replicating and recovering a Qemu VM.
>
> Points and open issues:
>
> *             The module interfaces the Qemu storage stack at block.c
> generic layer. Is this the right place to intercept/inject IOs?
>
> *             The patch contains performing IO reads invoked by a new
> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It is
> not protected by any lock – is this OK?
>
> *             VM ID – the replication system implies an environment with
> several VMs connected to a central replication system (Rephub).
>
>                  This requires some sort of identification for a VM. The
> current patch does not include a VM ID – I did not find any adequate ID to
> use.
>
>                  Any suggestions?
>
>
>
> Appreciate any feedback or suggestions.  Thanks,
>
> Ori.
>
>
>
>
>
>  From 5a0d88689ddcf325f25fdfca2a2012f1bbf141b9 Mon Sep 17 00:00:00 2001
>
> From: Ori Mamluk<orim@orim-fedora.(none)>
>
> Date: Tue, 7 Feb 2012 11:12:12 +0200
>
> Subject: [PATCH] Added replication agent module (repagent) to Qemu under
>
> replication directory, added repagent configure and run
>
> options, and the repagent API usage in bloc
>
>
>
> Added build options to ./configure:  --enable-replication --disable-replicat
>
> Added a commandline option to enable: -repagent<rep hub IP>
>
> Added the module files under replication.
>
>
>
> Signed-off-by: Ori Mamluk<orim@zerto.com>
>
> ---
>
> Makefile                      |    9 +-
>
> Makefile.objs                 |    6 +
>
> block.c                       |   20 +++-
>
> configure                     |   11 ++
>
> qemu-options.hx               |    6 +
>
> replication/qemu-repagent.txt |  104 +++++++++++++
>
> replication/repagent.c        |  322
> +++++++++++++++++++++++++++++++++++++++++
>
> replication/repagent.h        |   46 ++++++
>
> replication/repagent_client.c |  138 ++++++++++++++++++
>
> replication/repagent_client.h |   36 +++++
>
> replication/repcmd.h          |   59 ++++++++
>
> replication/repcmd_listener.c |  137 +++++++++++++++++
>
> replication/repcmd_listener.h |   32 ++++
>
> replication/rephub_cmds.h     |  150 +++++++++++++++++++
>
> replication/rephub_defs.h     |   40 +++++
>
> vl.c                          |   10 ++
>
> 16 files changed, 1121 insertions(+), 5 deletions(-)
>
> mode change 100644 =>  100755 Makefile.objs
>
> mode change 100644 =>  100755 qemu-options.hx
>
> create mode 100755 replication/qemu-repagent.txt
>
> create mode 100644 replication/repagent.c
>
> create mode 100644 replication/repagent.h
>
> create mode 100644 replication/repagent_client.c
>
> create mode 100644 replication/repagent_client.h
>
> create mode 100644 replication/repcmd.h
>
> create mode 100644 replication/repcmd_listener.c
>
> create mode 100644 replication/repcmd_listener.h
>
> create mode 100644 replication/rephub_cmds.h
>
> create mode 100644 replication/rephub_defs.h
>
>
>
> diff --git a/Makefile b/Makefile
>
> index 4f6eaa4..a1b3701 100644
>
> --- a/Makefile
>
> +++ b/Makefile
>
> @@ -149,9 +149,9 @@ qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o cmd.o
> qemu-ga.o: $(GENERATED_HEADERS
>
> tools-obj-y = qemu-tool.o $(oslib-obj-y) $(trace-obj-y) \
>
>                 qemu-timer-common.o cutils.o
>
> -qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
>
> -qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
>
> -qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
>
> +qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
> $(replication-obj-y)
>
> +qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
> $(replication-obj-y)
>
> +qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
> $(replication-obj-y)
>
>   qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
>
>                 $(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h<  $<  >
> $@,"  GEN   $@")
>
> @@ -228,6 +228,7 @@ clean:
>
>                 rm -f trace-dtrace.dtrace trace-dtrace.dtrace-timestamp
>
>                 rm -f trace-dtrace.h trace-dtrace.h-timestamp
>
>                 rm -rf $(qapi-dir)
>
> +             rm -f replication/*.{o,d}
>
>                 $(MAKE) -C tests clean
>
>                 for d in $(ALL_SUBDIRS) $(QEMULIBS) libcacard; do \
>
>                 if test -d $$d; then $(MAKE) -C $$d $@ || exit 1; fi; \
>
> @@ -387,4 +388,4 @@ tar:
>
>                 rm -rf /tmp/$(FILE)
>
>   # Include automatically generated dependency files
>
> --include $(wildcard *.d audio/*.d slirp/*.d block/*.d net/*.d ui/*.d
> qapi/*.d qga/*.d)
>
> +-include $(wildcard *.d audio/*.d slirp/*.d block/*.d net/*.d ui/*.d
> qapi/*.d qga/*.d replication/*.d)
>
> diff --git a/Makefile.objs b/Makefile.objs
>
> old mode 100644
>
> new mode 100755
>
> index d7a6539..dbd6f15
>
> --- a/Makefile.objs
>
> +++ b/Makefile.objs
>
> @@ -74,6 +74,7 @@ fsdev-obj-$(CONFIG_VIRTFS) += $(addprefix fsdev/,
> $(fsdev-nested-y))
>
> # CPUs and machines.
>
>   common-obj-y = $(block-obj-y) blockdev.o
>
> +common-obj-y += $(replication-obj-$(CONFIG_REPLICATION))
>
> common-obj-y += $(net-obj-y)
>
> common-obj-y += $(qobject-obj-y)
>
> common-obj-$(CONFIG_LINUX) += $(fsdev-obj-$(CONFIG_LINUX))
>
> @@ -413,6 +414,11 @@ common-obj-y += qmp-marshal.o qapi-visit.o
> qapi-types.o $(qapi-obj-y)
>
> common-obj-y += qmp.o hmp.o
>
>   ######################################################################
>
> +# replication
>
> +replication-nested-y = repagent_client.o  repagent.o  repcmd_listener.o
>
> +replication-obj-y = $(addprefix replication/, $(replication-nested-y))
>
> +
>
> +######################################################################
>
> # guest agent
>
>   qga-nested-y = guest-agent-commands.o guest-agent-command-state.o
>
> diff --git a/block.c b/block.c
>
> index 9bb236c..f3b8387 100644
>
> --- a/block.c
>
> +++ b/block.c
>
> @@ -31,6 +31,10 @@
>
> #include "qemu-coroutine.h"
>
> #include "qmp-commands.h"
>
> +#ifdef CONFIG_REPLICATION
>
> +#include "replication/repagent.h"
>
> +#endif
>
> +
>
> #ifdef CONFIG_BSD
>
> #include<sys/types.h>
>
> #include<sys/stat.h>
>
> @@ -640,6 +644,9 @@ int bdrv_open(BlockDriverState *bs, const char
> *filename, int flags,
>
>           goto unlink_and_fail;
>
>       }
>
> +#ifdef CONFIG_REPLICATION
>
> +    repagent_register_drive(filename,  bs);
>
> +#endif
>
>       /* Open the image */
>
>       ret = bdrv_open_common(bs, filename, flags, drv);
>
>       if (ret<  0) {
>
> @@ -1292,6 +1299,17 @@ static int coroutine_fn
> bdrv_co_do_writev(BlockDriverState *bs,
>
>       ret = drv->bdrv_co_writev(bs, sector_num, nb_sectors, qiov);
>
> +
>
> +#ifdef CONFIG_REPLICATION
>
> +    if (bs->device_name[0] != '\0') {
>
> +        /* We split the IO only at the highest stack driver layer.
>
> +           Currently we know that by checking device_name - only
>
> +           highest level (closest to the guest) has that name.
>
> +           */
>
> +           repagent_handle_protected_write(bs, sector_num,
>
> +                nb_sectors, qiov, ret);
>
> +    }
>
> +#endif
>
>       if (bs->dirty_bitmap) {
>
>           set_dirty_bitmap(bs, sector_num, nb_sectors, 1);
>
>       }
>
> @@ -1783,7 +1801,7 @@ int bdrv_has_zero_init(BlockDriverState *bs)
>
>    * 'nb_sectors' is the max value 'pnum' should be set to.
>
>    */
>
> int bdrv_is_allocated(BlockDriverState *bs, int64_t sector_num, int
> nb_sectors,
>
> -              int *pnum)
>
> +    int *pnum)
>
> {
>
>       int64_t n;
>
>       if (!bs->drv->bdrv_is_allocated) {
>
> diff --git a/configure b/configure
>
> index 9e5da44..93d600e 100755
>
> --- a/configure
>
> +++ b/configure
>
> @@ -179,6 +179,7 @@ spice=""
>
> rbd=""
>
> smartcard=""
>
> smartcard_nss=""
>
> +replication=""
>
> usb_redir=""
>
> opengl=""
>
> zlib="yes"
>
> @@ -772,6 +773,10 @@ for opt do
>
>     ;;
>
>     --enable-smartcard-nss) smartcard_nss="yes"
>
>     ;;
>
> +  --disable-replication) replication="no"
>
> +  ;;
>
> +  --enable-replication) replication="yes"
>
> +  ;;
>
>     --disable-usb-redir) usb_redir="no"
>
>     ;;
>
>     --enable-usb-redir) usb_redir="yes"
>
> @@ -1067,6 +1072,7 @@ echo "  --disable-usb-redir      disable usb network
> redirection support"
>
> echo "  --enable-usb-redir       enable usb network redirection support"
>
> echo "  --disable-guest-agent    disable building of the QEMU Guest Agent"
>
> echo "  --enable-guest-agent     enable building of the QEMU Guest Agent"
>
> +echo "  --enable-replication     enable replication support"
>
> echo ""
>
> echo "NOTE: The object files are built at the place where configure is
> launched"
>
> exit 1
>
> @@ -2733,6 +2739,7 @@ echo "curl support      $curl"
>
> echo "check support     $check_utests"
>
> echo "mingw32 support   $mingw32"
>
> echo "Audio drivers     $audio_drv_list"
>
> +echo "Replication          $replication"
>
> echo "Extra audio cards $audio_card_list"
>
> echo "Block whitelist   $block_drv_whitelist"
>
> echo "Mixer emulation   $mixemu"
>
> @@ -3080,6 +3087,10 @@ if test "$smartcard_nss" = "yes" ; then
>
>     echo "CONFIG_SMARTCARD_NSS=y">>  $config_host_mak
>
> fi
>
> +if test "$replication" = "yes" ; then
>
> +  echo "CONFIG_REPLICATION=y">>  $config_host_mak
>
> +fi
>
> +
>
> if test "$usb_redir" = "yes" ; then
>
>     echo "CONFIG_USB_REDIR=y">>  $config_host_mak
>
> fi
>
> diff --git a/qemu-options.hx b/qemu-options.hx
>
> old mode 100644
>
> new mode 100755
>
> index 681eaf1..c97e4f8
>
> --- a/qemu-options.hx
>
> +++ b/qemu-options.hx
>
> @@ -2602,3 +2602,9 @@ HXCOMM This is the last statement. Insert new options
> before this line!
>
> STEXI
>
> @end table
>
> ETEXI
>
> +
>
> +DEF("repagent", HAS_ARG, QEMU_OPTION_repagent,
>
> +    "-repagent [hub IP/name]\n"
>
> +    "                Enable replication support for disks\n"
>
> +    "                hub is the ip or name of the machine running the
> replication hub.\n",
>
> +    QEMU_ARCH_ALL)
>
> diff --git a/replication/qemu-repagent.txt b/replication/qemu-repagent.txt
>
> new file mode 100755
>
> index 0000000..e3b0c1e
>
> --- /dev/null
>
> +++ b/replication/qemu-repagent.txt
>
> @@ -0,0 +1,104 @@
>
> +             repagent - replication agent - a Qemu module for enabling
> continuous async replication of VM volumes
>
> +
>
> +Introduction
>
> +             This document describes a feature in Qemu - a replication
> agent (AKA Repagent).
>
> +             The Repagent is a new module that exposes an API to an
> external replication system (AKA Rephub).
>
> +             This API allows a Rephub to communicate with a Qemu VM and
> continuously replicate its volumes.
>
> +             The imlementation of a Rephub is outside of the scope of this
> document. There may be several various Rephub
>
> +             implenetations using the same repagent in Qemu.
>
> +
>
> +Main feature of Repagent
>
> +             Repagent does the following:
>
> +             * Report volumes - report a list of all volumes in a VM to
> the Rephub.
>
> +             * Report writes to a volume - send all writes made to a
> protected volume to the Rephub.
>
> +                             The reporting of an IO is asyncronuous - i.e.
> the IO is not delayed by the Repagent to get any acknowledgement from the
> Rephub.
>
> +                             It is only copied to the Rephub.
>
> +             * Read a protected volume - allows the Rephub to read a
> protected volume, to enable the protected hub to syncronize the content of
> a protected volume.
>
> +
>
> +Description of the Repagent module
>
> +
>
> +Build and run options
>
> +             New configure option: --enable-replication
>
> +             New command line option:
>
> +             -repagent [hub IP/name]
>
> +
> Enable replication support for disks
>
> +
> hub is the ip or name of the machine running the replication hub.
>
> +
>
> +Module APIs
>
> +             The Repagent module interfaces two main components:
>
> +             1. The Rephub - An external API based on socket messages
>
> +             2. The generic block layer- block.c
>
> +
>
> +             Rephub message API
>
> +                             The external replication API is a message
> based API.
>
> +                             We won't go into the structure of the
> messages here - just the sematics.
>
> +
>
> +                             Messages list
>
> +                                             (The updated list and
> comments are in Rephub_cmds.h)
>
> +
>
> +                                             Messages from the Repagent to
> the Rephub:
>
> +                                             * Protected write
>
> +                                                             The Repagent
> sends each write to a protected volume to the hub with the IO status.
>
> +                                                             In case the
> status is bad the write content is not sent
>
> +                                             * Report VM volumes
>
> +                                                             The agent
> reports all the volumes of the VM to the hub.
>
> +                                             * Read Volume Response
>
> +                                                             A response to
> a Read Volume Request
>
> +                                                             Sends the
> data read from a protected volume to the hub
>
> +                                             * Agent shutdown
>
> +                                                             Notifies the
> hub that the agent is about to shutdown.
>
> +                                                             This allows a
> graceful shutdown. Any disconnection of an agent without
>
> +                                                             sending this
> command will result in a full sync of the VM volumes.
>
> +
>
> +                                             Messages from the Rephub to
> the Repagent:
>
> +                                             * Start protect
>
> +                                                             The hub
> instructs the agent to start protecting a volume. When a volume is protected
>
> +                                                             all its
> writes are sent to to the hub.
>
> +                                                             With this
> command the hub also assigns a volume ID to the given volume name.
>
> +                                             * Read volume request
>
> +                                                             The hub
> issues a read IO to a protected volume.
>
> +                                                             This command
> is used during sync - when the hub needs to read unsyncronized
>
> +                                                             sections of a
> protected volume.
>
> +                                                             This command
> is a request, the read data is returned by the read volume response message
> (see above).
>
> +             block.c API
>
> +                             The API to the generic block storage layer
> contains 3 functionalities:
>
> +                             1. Handle writes to protected volumes
>
> +                                             In bdrv_co_do_writev, each
> write is reported to the Repagent module.
>
> +                             2. Handle each new volume that registers
>
> +                                             In bdrv_open - each new
> bottom-level block driver that registers is reported.
>
> +                             2. Read from a volume
>
> +                                             Repagent calls bdrv_aio_readv
> to handle read requests coming from the hub.
>
> +
>
> +
>
> +General description of a Rephub  - a replication system the repagent
> connects to
>
> +             This section describes in high level a sample Rephub - a
> replication system that uses the repagent API
>
> +             to replicate disks.
>
> +             It describes a simple Rephub that comntinuously maintains a
> mirror of the volumes of a VM.
>
> +
>
> +             Say we have a VM we want to protect - call it PVM, say it has
> 2 volumes - V1, V2.
>
> +             Our Rephub is called SingleRephub - a Rephub protecting a
> single VM.
>
> +
>
> +             Preparations
>
> +             1. The user chooses a host to rub SingleRephub - a different
> host than PVM, call it Host2
>
> +             2. The user creates two volumes on Host2 - same sizes of V1
> and V2, call them V1R (V1 recovery) and V2R.
>
> +             3. The user runs SingleRephub process on Host2, and gives V1R
> and V2R as command line arguments.
>
> +                             From now on SingleRephub waits for the
> protected VM repagent to connect.
>
> +             4. The user runs the protected VM PVM - and uses the switch
> -repagent<Host2 IP>.
>
> +
>
> +             Runtime
>
> +             1. The repagent module connects to SingleRephub on startup.
>
> +             2. repagent reports V1 and V2 to SingleRephub.
>
> +             3. SingleRephub starts to perform an initial synchronization
> of the protected volumes-
>
> +                             it reads each protected volume (V1 and V2) -
> using read volume requests - and copies the data into the
>
> +                             recovery volume V1R and V2R.
>
> +             4. SingleRephub enters 'protection' mode - each write to the
> protected volume is sent by the repagent to the Rephub,
>
> +                             and the Rephub performs the write on the
> matching recovery volume.
>
> +
>
> +             * Note that during stage 3 writes to the protected volumes
> are not ignored - they're kept in a bitmap,
>
> +                             and will be read again when stage 3 ends, in
> an interative convergin process.
>
> +
>
> +             This flow continuously maintains an updated recovery volume.
>
> +             If the protected system is damaged, the user can create a new
> VM on Host2 with the replicated volumes attached to it.
>
> +             The new VM is a replica of the protected system.
>
> +
>
> +
>
> diff --git a/replication/repagent.c b/replication/repagent.c
>
> new file mode 100644
>
> index 0000000..c66eae7
>
> --- /dev/null
>
> +++ b/replication/repagent.c
>
> @@ -0,0 +1,322 @@
>
> +#include<string.h>
>
> +#include<stdlib.h>
>
> +#include<stdio.h>
>
> +#include<pthread.h>
>
> +#include<stdint.h>
>
> +
>
> +#include "block.h"
>
> +#include "rephub_defs.h"
>
> +#include "block_int.h"
>
> +#include "repagent_client.h"
>
> +#include "repagent.h"
>
> +#include "rephub_cmds.h"
>
> +
>
> +#define ZERO_MEM_OBJ(pObj) memset(pObj, 0, sizeof(*pObj))
>
> +#define REPAGENT_MAX_NUM_VOLUMES (64)
>
> +#define REPAGENT_VOLUME_ID_NONE (0)
>
> +
>
> +typedef struct RepagentVolume {
>
> +    uint64_t vol_id;
>
> +    const char *vol_path;
>
> +    BlockDriverState *driver_ptr;
>
> +} RepagentVolume;
>
> +
>
> +struct RepAgentState {
>
> +    int is_init;
>
> +    int num_volumes;
>
> +    RepagentVolume * volumes[REPAGENT_MAX_NUM_VOLUMES];
>
> +};
>
> +
>
> +typedef struct RepagentReadVolIo {
>
> +    QEMUIOVector qiov;
>
> +    RepCmdReadVolReq rep_cmd;
>
> +    uint8_t *buf;
>
> +    struct timeval start_time;
>
> +} RepagentReadVolIo;
>
> +
>
> +static int repagent_get_volume_by_name(const char *name);
>
> +static void repagent_report_volumes_to_hub(void);
>
> +static void repagent_vol_read_done(void *opaque, int ret);
>
> +static struct timeval tsub(struct timeval t1, struct timeval t2);
>
> +
>
> +RepAgentState g_rep_agent = { 0 };
>
> +
>
> +void repagent_init(const char *hubname, int port)
>
> +{
>
> +    /* It is the responsibility of the thread to free this struct */
>
> +    rephub_params *pParams = (rephub_params
> *)g_malloc(sizeof(rephub_params));
>
> +    if (hubname == NULL) {
>
> +        hubname = "127.0.0.1";
>
> +    }
>
> +    if (port == 0) {
>
> +        port = 9010;
>
> +    }
>
> +
>
> +    printf("repagent_init %s\n", hubname);
>
> +
>
> +    pParams->port = port;
>
> +    pParams->name = g_strdup(hubname);
>
> +
>
> +    pthread_t thread_id = 0;
>
> +
>
> +    /* Create the repagent client listener thread */
>
> +    pthread_create(&thread_id, 0, repagent_listen, (void *) pParams);
>
> +    pthread_detach(thread_id);
>
> +}
>
> +
>
> +void repagent_register_drive(const char *drive_path,
>
> +        BlockDriverState *driver_ptr)
>
> +{
>
> +    int i;
>
> +    for (i = 0; i<  g_rep_agent.num_volumes ; i++) {
>
> +        RepagentVolume *vol = g_rep_agent.volumes[i];
>
> +        if (vol != NULL) {
>
> +            assert(
>
> +                    strcmp(drive_path, vol->vol_path) != 0
>
> +&&  driver_ptr != vol->driver_ptr);
>
> +        }
>
> +    }
>
> +
>
> +    assert(g_rep_agent.num_volumes<  REPAGENT_MAX_NUM_VOLUMES);
>
> +
>
> +    printf("zerto repagent: Registering drive. Num drives %d, path %s\n",
>
> +            g_rep_agent.num_volumes, drive_path);
>
> +    g_rep_agent.volumes[i] =
>
> +            (RepagentVolume *)g_malloc(sizeof(RepagentVolume));
>
> +    g_rep_agent.volumes[i]->driver_ptr = driver_ptr;
>
> +    /* orim todo strcpy? */
>
> +    g_rep_agent.volumes[i]->vol_path = drive_path;
>
> +
>
> +    /* Orim todo thread-safety? */
>
> +    g_rep_agent.num_volumes++;
>
> +
>
> +    repagent_report_volumes_to_hub();
>
> +}
>
> +
>
> +/* orim todo destruction? */
>
> +
>
> +static RepagentVolume *repagent_get_protected_volume_by_driver(
>
> +        BlockDriverState *bs)
>
> +{
>
> +    /* orim todo optimize search */
>
> +    int i = 0;
>
> +    for (i = 0; i<  g_rep_agent.num_volumes ; i++) {
>
> +        RepagentVolume *p_vol = g_rep_agent.volumes[i];
>
> +        if (p_vol != NULL&&  p_vol->driver_ptr == (void *) bs) {
>
> +            return p_vol;
>
> +        }
>
> +    }
>
> +    return NULL;
>
> +}
>
> +
>
> +void repagent_handle_protected_write(BlockDriverState *bs, int64_t
> sector_num,
>
> +        int nb_sectors, QEMUIOVector *qiov, int ret_status)
>
> +{
>
> +    printf("zerto Protected write offset %lld, size %d, IO return status
> %d",
>
> +            (long long int) sector_num, nb_sectors, ret_status);
>
> +    if (bs->filename != NULL) {
>
> +        printf(", filename %s", bs->filename);
>
> +    }
>
> +
>
> +    printf("\n");
>
> +
>
> +    RepagentVolume *p_vol = repagent_get_protected_volume_by_driver(bs);
>
> +    if (p_vol == NULL || p_vol->vol_id == REPAGENT_VOLUME_ID_NONE) {
>
> +        /* Unprotected */
>
> +        printf("Got a write to an unprotected volume.\n");
>
> +        return;
>
> +    }
>
> +
>
> +    /* Report IO to rephub */
>
> +
>
> +    int data_size = qiov->size;
>
> +    if (ret_status<  0) {
>
> +        /* On failed ios we don't send the data to the hub */
>
> +        data_size = 0;
>
> +    }
>
> +    uint8_t *pdata = NULL;
>
> +    RepCmdProtectedWrite *p_cmd = (RepCmdProtectedWrite *) repcmd_new(
>
> +            REPHUB_CMD_PROTECTED_WRITE, data_size, (uint8_t **)&pdata);
>
> +
>
> +    if (ret_status>= 0) {
>
> +        qemu_iovec_to_buffer(qiov, pdata);
>
> +    }
>
> +
>
> +    p_cmd->volume_id = p_vol->vol_id;
>
> +    p_cmd->offset_sectors = sector_num;
>
> +    p_cmd->size_sectors = nb_sectors;
>
> +    p_cmd->ret_status = ret_status;
>
> +
>
> +    if (repagent_client_send((RepCmd *) p_cmd) != 0) {
>
> +        printf("Error sending command\n");
>
> +    }
>
> +}
>
> +
>
> +static void repagent_report_volumes_to_hub(void)
>
> +{
>
> +    /* Report IO to rephub */
>
> +    int i;
>
> +    RepCmdDataReportVmVolumes *p_cmd_data = NULL;
>
> +    RepCmdReportVmVolumes *p_cmd = (RepCmdReportVmVolumes *) repcmd_new(
>
> +            REPHUB_CMD_REPORT_VM_VOLUMES,
>
> +            g_rep_agent.num_volumes * sizeof(RepVmVolumeInfo),
>
> +            (uint8_t **)&p_cmd_data);
>
> +    p_cmd->num_volumes = g_rep_agent.num_volumes;
>
> +    printf("reporting %u volumes\n", g_rep_agent.num_volumes);
>
> +    for (i = 0; i<  g_rep_agent.num_volumes ; i++) {
>
> +        assert(g_rep_agent.volumes[i] != NULL);
>
> +        printf("reporting volume %s size %u\n",
>
> +                g_rep_agent.volumes[i]->vol_path,
>
> +                (uint32_t) sizeof(p_cmd_data->volumes[i].name));
>
> +        strncpy((char *) p_cmd_data->volumes[i].name,
>
> +                g_rep_agent.volumes[i]->vol_path,
>
> +                sizeof(p_cmd_data->volumes[i].name));
>
> +        p_cmd_data->volumes[i].volume_id = g_rep_agent.volumes[i]->vol_id;
>
> +    }
>
> +    if (repagent_client_send((RepCmd *) p_cmd) != 0) {
>
> +        printf("Error sending command\n");
>
> +    }
>
> +}
>
> +
>
> +int repaget_start_protect(RepCmdStartProtect *pcmd,
>
> +        RepCmdDataStartProtect *pcmd_data)
>
> +{
>
> +    printf("Start protect vol %s, ID %llu\n", pcmd_data->volume_name,
>
> +            (unsigned long long) pcmd->volume_id);
>
> +    int vol_index = repagent_get_volume_by_name(pcmd_data->volume_name);
>
> +    if (vol_index<  0) {
>
> +        printf("The volume doesn't exist\n");
>
> +        return TRUE;
>
> +    }
>
> +    /* orim todo protect */
>
> +    g_rep_agent.volumes[vol_index]->vol_id = pcmd->volume_id;
>
> +
>
> +    return TRUE;
>
> +}
>
> +
>
> +static int repagent_get_volume_by_name(const char *name)
>
> +{
>
> +    int i = 0;
>
> +    for (i = 0; i<  g_rep_agent.num_volumes ; i++) {
>
> +        if (g_rep_agent.volumes[i] != NULL
>
> +&&  strcmp(name, g_rep_agent.volumes[i]->vol_path) == 0) {
>
> +            return i;
>
> +        }
>
> +    }
>
> +    return -1;
>
> +}
>
> +
>
> +static int repagent_get_volume_by_id(uint64_t vol_id)
>
> +{
>
> +    int i = 0;
>
> +    for (i = 0; i<  g_rep_agent.num_volumes ; i++) {
>
> +        if (g_rep_agent.volumes[i] != NULL
>
> +&&  g_rep_agent.volumes[i]->vol_id == vol_id) {
>
> +            return i;
>
> +        }
>
> +    }
>
> +    return -1;
>
> +}
>
> +
>
> +int repaget_read_vol(RepCmdReadVolReq *pcmd, uint8_t *pdata)
>
> +{
>
> +    int index = repagent_get_volume_by_id(pcmd->volume_id);
>
> +    int size_bytes = pcmd->size_sectors * 512;
>
> +    if (index<  0) {
>
> +        printf("Vol read - Could not find vol id %llu\n",
>
> +                (unsigned long long int) pcmd->volume_id);
>
> +        RepCmdReadVolRes *p_res_cmd = (RepCmdReadVolRes *) repcmd_new(
>
> +                REPHUB_CMD_READ_VOL_RES, 0, NULL);
>
> +        p_res_cmd->req_id = pcmd->req_id;
>
> +        p_res_cmd->volume_id = pcmd->volume_id;
>
> +        p_res_cmd->is_status_success = FALSE;
>
> +        repagent_client_send((RepCmd *) p_res_cmd);
>
> +        return TRUE;
>
> +    }
>
> +
>
> +    printf("Vol read - driver %p, volId %llu, offset %llu, size %u\n",
>
> +            g_rep_agent.volumes[index]->driver_ptr,
>
> +            (unsigned long long int) pcmd->volume_id,
>
> +            (unsigned long long int) pcmd->offset_sectors,
> pcmd->size_sectors);
>
> +
>
> +    {
>
> +        RepagentReadVolIo *read_xact = calloc(1,
> sizeof(RepagentReadVolIo));
>
> +
>
> +/*        BlockDriverAIOCB *acb; */
>
> +
>
> +        ZERO_MEM_OBJ(read_xact);
>
> +
>
> +        qemu_iovec_init(&read_xact->qiov, 1);
>
> +
>
> +        /*read_xact->buf =
>
> +        qemu_blockalign(g_rep_agent.volumes[index]->driver_ptr,
> size_bytes); */
>
> +        read_xact->buf = (uint8_t *) g_malloc(size_bytes);
>
> +        read_xact->rep_cmd = *pcmd;
>
> +        qemu_iovec_add(&read_xact->qiov, read_xact->buf, size_bytes);
>
> +
>
> +        gettimeofday(&read_xact->start_time, NULL);
>
> +        /* orim TODO - use the returned acb to cancel the request on
> shutdown */
>
> +        /*acb = */bdrv_aio_readv(g_rep_agent.volumes[index]->driver_ptr,
>
> +                read_xact->rep_cmd.offset_sectors,&read_xact->qiov,
>
> +                read_xact->rep_cmd.size_sectors, repagent_vol_read_done,
>
> +                read_xact);
>
> +    }
>
> +
>
> +    return TRUE;
>
> +}
>
> +
>
> +static void repagent_vol_read_done(void *opaque, int ret)
>
> +{
>
> +    struct timeval t2;
>
> +    RepagentReadVolIo *read_xact = (RepagentReadVolIo *) opaque;
>
> +    uint8_t *pdata = NULL;
>
> +    RepCmdReadVolRes *pcmd = (RepCmdReadVolRes *) repcmd_new(
>
> +            REPHUB_CMD_READ_VOL_RES, read_xact->rep_cmd.size_sectors * 512,
>
> +&pdata);
>
> +    pcmd->req_id = read_xact->rep_cmd.req_id;
>
> +    pcmd->volume_id = read_xact->rep_cmd.volume_id;
>
> +    pcmd->is_status_success = FALSE;
>
> +
>
> +    printf("Protected vol read - volId %llu, offset %llu, size %u\n",
>
> +            (unsigned long long int) read_xact->rep_cmd.volume_id,
>
> +            (unsigned long long int) read_xact->rep_cmd.offset_sectors,
>
> +            read_xact->rep_cmd.size_sectors);
>
> +    gettimeofday(&t2, NULL);
>
> +
>
> +    if (ret>= 0) {
>
> +        /* Read response - send the data to the hub */
>
> +        t2 = tsub(t2, read_xact->start_time);
>
> +        printf("Read prot vol done. Took %u seconds, %u us.",
>
> +                (uint32_t) t2.tv_sec, (uint32_t) t2.tv_usec);
>
> +
>
> +        pcmd->is_status_success = TRUE;
>
> +        /* orim todo optimize - don't copy, use the qiov buffer */
>
> +        qemu_iovec_to_buffer(&read_xact->qiov, pdata);
>
> +    } else {
>
> +        printf("readv failed: %s\n", strerror(-ret));
>
> +    }
>
> +
>
> +    repagent_client_send((RepCmd *) pcmd);
>
> +
>
> +    /*qemu_vfree(read_xact->buf); */
>
> +    g_free(read_xact->buf);
>
> +
>
> +    g_free(read_xact);
>
> +}
>
> +
>
> +static struct timeval tsub(struct timeval t1, struct timeval t2)
>
> +{
>
> +    t1.tv_usec -= t2.tv_usec;
>
> +    if (t1.tv_usec<  0) {
>
> +        t1.tv_usec += 1000000;
>
> +        t1.tv_sec--;
>
> +    }
>
> +    t1.tv_sec -= t2.tv_sec;
>
> +    return t1;
>
> +}
>
> +
>
> +void repagent_client_connected(void)
>
> +{
>
> +    /* orim todo thread protection */
>
> +    repagent_report_volumes_to_hub();
>
> +}
>
> diff --git a/replication/repagent.h b/replication/repagent.h
>
> new file mode 100644
>
> index 0000000..98ccbf2
>
> --- /dev/null
>
> +++ b/replication/repagent.h
>
> @@ -0,0 +1,46 @@
>
> +/*
>
> + * QEMU System Emulator
>
> + *
>
> + * Copyright (c) 2003-2008 Fabrice Bellard
>
> + *
>
> + * Permission is hereby granted, free of charge, to any person obtaining a
> copy
>
> + * of this software and associated documentation files (the "Software"),
> to deal
>
> + * in the Software without restriction, including without limitation the
> rights
>
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
>
> + * copies of the Software, and to permit persons to whom the Software is
>
> + * furnished to do so, subject to the following conditions:
>
> + *
>
> + * The above copyright notice and this permission notice shall be included
> in
>
> + * all copies or substantial portions of the Software.
>
> + *
>
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
> OR
>
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
>
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> FROM,
>
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> IN
>
> + * THE SOFTWARE.
>
> + */
>
> +#ifndef REPAGENT_H
>
> +#define REPAGENT_H
>
> +#include<stdint.h>
>
> +
>
> +#include "qemu-common.h"
>
> +
>
> +typedef struct RepAgentState RepAgentState;
>
> +typedef struct RepCmdStartProtect RepCmdStartProtect;
>
> +typedef struct RepCmdDataStartProtect RepCmdDataStartProtect;
>
> +struct RepCmdReadVolReq;
>
> +
>
> +void repagent_init(const char *hubname, int port);
>
> +void repagent_handle_protected_write(BlockDriverState *bs,
>
> +        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov, int
> ret_status);
>
> +void repagent_register_drive(const char *drive_path,
>
> +        BlockDriverState *driver_ptr);
>
> +int repaget_start_protect(RepCmdStartProtect *pcmd,
>
> +        RepCmdDataStartProtect *pcmd_data);
>
> +int repaget_read_vol(struct RepCmdReadVolReq *pcmd, uint8_t *pdata);
>
> +void repagent_client_connected(void);
>
> +
>
> +
>
> +#endif /* REPAGENT_H */
>
> diff --git a/replication/repagent_client.c b/replication/repagent_client.c
>
> new file mode 100644
>
> index 0000000..4dd9ea4
>
> --- /dev/null
>
> +++ b/replication/repagent_client.c
>
> @@ -0,0 +1,138 @@
>
> +#include "repcmd.h"
>
> +#include "rephub_cmds.h"
>
> +#include "repcmd_listener.h"
>
> +#include "repagent_client.h"
>
> +#include "repagent.h"
>
> +
>
> +#include<string.h>
>
> +#include<stdlib.h>
>
> +#include<errno.h>
>
> +#include<stdio.h>
>
> +#include<resolv.h>
>
> +#include<sys/socket.h>
>
> +#include<arpa/inet.h>
>
> +#include<netinet/in.h>
>
> +#include<unistd.h>
>
> +
>
> +#define ZERO_MEM_OBJ(pObj) memset(pObj, 0, sizeof(*pObj))
>
> +
>
> +static void repagent_process_cmd(RepCmd *pCmd, uint8_t *pData, void
> *clientPtr);
>
> +
>
> +typedef struct repagent_client_state {
>
> +    int is_connected;
>
> +    int is_terminate_receive;
>
> +    int hsock;
>
> +} repagent_client_state;
>
> +
>
> +static repagent_client_state g_client_state = { 0 };
>
> +
>
> +void *repagent_listen(void *pParam)
>
> +{
>
> +    rephub_params *pServerParams = (rephub_params *) pParam;
>
> +    int host_port = pServerParams->port;
>
> +    const char *host_name = pServerParams->name;
>
> +
>
> +    printf("Creating repagent listener thread...\n");
>
> +    g_free(pServerParams);
>
> +
>
> +    struct sockaddr_in my_addr;
>
> +
>
> +    int err;
>
> +    int retries = 0;
>
> +
>
> +    g_client_state.hsock = socket(AF_INET, SOCK_STREAM, 0);
>
> +    if (g_client_state.hsock == -1) {
>
> +        printf("Error initializing socket %d\n", errno);
>
> +        return (void *) -1;
>
> +    }
>
> +
>
> +    int param = 1;
>
> +
>
> +    if ((setsockopt(g_client_state.hsock, SOL_SOCKET, SO_REUSEADDR,
>
> +            (char *)&param, sizeof(int)) == -1)
>
> +            || (setsockopt(g_client_state.hsock, SOL_SOCKET, SO_KEEPALIVE,
>
> +                    (char *)&param, sizeof(int)) == -1)) {
>
> +        printf("Error setting options %d\n", errno);
>
> +        return (void *) -1;
>
> +    }
>
> +
>
> +    my_addr.sin_family = AF_INET;
>
> +    my_addr.sin_port = htons(host_port);
>
> +    memset(&(my_addr.sin_zero), 0, 8);
>
> +
>
> +    my_addr.sin_addr.s_addr = inet_addr(host_name);
>
> +
>
> +    /* Reconnect loop */
>
> +    while (!g_client_state.is_terminate_receive) {
>
> +
>
> +        if (connect(g_client_state.hsock, (struct sockaddr *)&my_addr,
>
> +                sizeof(my_addr)) == -1) {
>
> +            err = errno;
>
> +            if (err != EINPROGRESS) {
>
> +                retries++;
>
> +                fprintf(
>
> +                        stderr,
>
> +                        "Error connecting socket %d. Host %s, port %u.
> Retry count %d\n",
>
> +                        errno, host_name, host_port, retries);
>
> +                usleep(5 * 1000 * 1000);
>
> +                continue;
>
> +            }
>
> +        }
>
> +        retries = 0;
>
> +
>
> +        g_client_state.is_connected = 1;
>
> +
>
> +        repagent_client_connected();
>
> +        repcmd_listener(g_client_state.hsock, repagent_process_cmd, NULL);
>
> +        close(g_client_state.hsock);
>
> +
>
> +        g_client_state.is_connected = 0;
>
> +    }
>
> +    return 0;
>
> +}
>
> +
>
> +void repagent_process_cmd(RepCmd *pcmd, uint8_t *pdata, void *clientPtr)
>
> +{
>
> +    int is_free_data = 1;
>
> +    printf("Repagent got cmd %d\n", pcmd->hdr.cmdid);
>
> +    switch (pcmd->hdr.cmdid) {
>
> +    case REPHUB_CMD_START_PROTECT: {
>
> +        is_free_data = repaget_start_protect((RepCmdStartProtect *) pcmd,
>
> +                (RepCmdDataStartProtect *) pdata);
>
> +    }
>
> +        break;
>
> +    case REPHUB_CMD_READ_VOL_REQ: {
>
> +        is_free_data = repaget_read_vol((RepCmdReadVolReq *) pcmd, pdata);
>
> +    }
>
> +        break;
>
> +    default:
>
> +        assert(0);
>
> +        break;
>
> +
>
> +    }
>
> +
>
> +    if (is_free_data) {
>
> +        g_free(pdata);
>
> +    }
>
> +}
>
> +
>
> +int repagent_client_send(RepCmd *p_cmd)
>
> +{
>
> +    int bytecount = 0;
>
> +    printf("Send cmd %u, data size %u\n", p_cmd->hdr.cmdid,
>
> +            p_cmd->hdr.data_size_bytes);
>
> +    if (!g_client_state.is_connected) {
>
> +        printf("Not connected to hub\n");
>
> +        return -1;
>
> +    }
>
> +
>
> +    bytecount = send(g_client_state.hsock, p_cmd,
>
> +            sizeof(RepCmd) + p_cmd->hdr.data_size_bytes, 0);
>
> +    if (bytecount<  sizeof(RepCmd) + p_cmd->hdr.data_size_bytes) {
>
> +        printf("Bad send %d, errno %d\n", bytecount, errno);
>
> +        return bytecount;
>
> +    }
>
> +
>
> +    /* Success */
>
> +    return 0;
>
> +}
>
> diff --git a/replication/repagent_client.h b/replication/repagent_client.h
>
> new file mode 100644
>
> index 0000000..62a5377
>
> --- /dev/null
>
> +++ b/replication/repagent_client.h
>
> @@ -0,0 +1,36 @@
>
> +/*
>
> + * QEMU System Emulator
>
> + *
>
> + * Copyright (c) 2003-2008 Fabrice Bellard
>
> + *
>
> + * Permission is hereby granted, free of charge, to any person obtaining a
> copy
>
> + * of this software and associated documentation files (the "Software"),
> to deal
>
> + * in the Software without restriction, including without limitation the
> rights
>
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
>
> + * copies of the Software, and to permit persons to whom the Software is
>
> + * furnished to do so, subject to the following conditions:
>
> + *
>
> + * The above copyright notice and this permission notice shall be included
> in
>
> + * all copies or substantial portions of the Software.
>
> + *
>
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
> OR
>
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
>
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> FROM,
>
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> IN
>
> + * THE SOFTWARE.
>
> + */
>
> +#ifndef REPAGENT_CLIENT_H
>
> +#define REPAGENT_CLIENT_H
>
> +#include "repcmd.h"
>
> +
>
> +typedef struct rephub_params {
>
> +    char *name;
>
> +    int port;
>
> +} rephub_params;
>
> +
>
> +void *repagent_listen(void *pParam);
>
> +int repagent_client_send(RepCmd *p_cmd);
>
> +
>
> +#endif /* REPAGENT_CLIENT_H */
>
> diff --git a/replication/repcmd.h b/replication/repcmd.h
>
> new file mode 100644
>
> index 0000000..8c6cf1b
>
> --- /dev/null
>
> +++ b/replication/repcmd.h
>
> @@ -0,0 +1,59 @@
>
> +/*
>
> + * QEMU System Emulator
>
> + *
>
> + * Copyright (c) 2003-2008 Fabrice Bellard
>
> + *
>
> + * Permission is hereby granted, free of charge, to any person obtaining a
> copy
>
> + * of this software and associated documentation files (the "Software"),
> to deal
>
> + * in the Software without restriction, including without limitation the
> rights
>
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
>
> + * copies of the Software, and to permit persons to whom the Software is
>
> + * furnished to do so, subject to the following conditions:
>
> + *
>
> + * The above copyright notice and this permission notice shall be included
> in
>
> + * all copies or substantial portions of the Software.
>
> + *
>
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
> OR
>
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
>
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> FROM,
>
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> IN
>
> + * THE SOFTWARE.
>
> + */
>
> +#ifndef REPCMD_H
>
> +#define REPCMD_H
>
> +
>
> +#include<stdint.h>
>
> +
>
> +#define REPCMD_MAGIC1 (0x1122)
>
> +#define REPCMD_MAGIC2 (0x3344)
>
> +#define REPCMD_NUM_U32_PARAMS (11)
>
> +
>
> +enum RepCmds {
>
> +    REPCMD_FIRST_INVALID                    = 0,
>
> +    REPCMD_FIRST_HUBCMD                     = 1,
>
> +    REPHUB_CMD_PROTECTED_WRITE              = 2,
>
> +    REPHUB_CMD_REPORT_VM_VOLUMES            = 3,
>
> +    REPHUB_CMD_START_PROTECT                = 4,
>
> +    REPHUB_CMD_READ_VOL_REQ                 = 5,
>
> +    REPHUB_CMD_READ_VOL_RES                 = 6,
>
> +    REPHUB_CMD_AGENT_SHUTDOWN               = 7,
>
> +};
>
> +
>
> +typedef struct RepCmdHdr {
>
> +    uint16_t magic1;
>
> +    uint16_t cmdid;
>
> +    uint32_t data_size_bytes;
>
> +} RepCmdHdr;
>
> +
>
> +typedef struct RepCmd {
>
> +    RepCmdHdr hdr;
>
> +    unsigned int parameters[REPCMD_NUM_U32_PARAMS];
>
> +    unsigned int magic2;
>
> +    uint8_t data[0];
>
> +} RepCmd;
>
> +
>
> +RepCmd *repcmd_new(int cmd_id, int data_size, uint8_t **p_out_pdata);
>
> +
>
> +#endif /* REPCMD_H */
>
> diff --git a/replication/repcmd_listener.c b/replication/repcmd_listener.c
>
> new file mode 100644
>
> index 0000000..a211927
>
> --- /dev/null
>
> +++ b/replication/repcmd_listener.c
>
> @@ -0,0 +1,137 @@
>
> +#include<fcntl.h>
>
> +#include<string.h>
>
> +#include<stdlib.h>
>
> +#include<errno.h>
>
> +#include<stdio.h>
>
> +#include<netinet/in.h>
>
> +#include<resolv.h>
>
> +#include<sys/socket.h>
>
> +#include<arpa/inet.h>
>
> +#include<unistd.h>
>
> +#include<pthread.h>
>
> +#include<assert.h>
>
> +
>
> +/* Use the CONFIG_REPLICATION flag to determine whether
>
> + * we're under qemu build or a hub When under
>
> + * qemu use g_malloc */
>
> +#ifdef CONFIG_REPLICATION
>
> +#include<glib.h>
>
> +#define REPCMD_MALLOC g_malloc
>
> +#else
>
> +#define REPCMD_MALLOC malloc
>
> +#endif
>
> +
>
> +#include "repcmd.h"
>
> +#include "repcmd_listener.h"
>
> +
>
> +#define ZERO_MEM_OBJ(pObj) memset((void *)pObj, 0, sizeof(*pObj))
>
> +
>
> +typedef struct RepCmdListenerState {
>
> +    int is_terminate_receive;
>
> +} RepCmdListenerState;
>
> +
>
> +static RepCmdListenerState g_listenerState = { 0 };
>
> +
>
> +/* Returns 0 for initiated termination or socket error value on error */
>
> +int repcmd_listener(int hsock, pfn_received_cmd_cb callback, void
> *clientPtr)
>
> +{
>
> +    RepCmd curCmd;
>
> +    uint8_t *pReadBuf = (uint8_t *)&curCmd;
>
> +    int bytesToGet = sizeof(RepCmd);
>
> +    int bytesGotten = 0;
>
> +    int isGotHeader = 0;
>
> +    uint8_t *pdata = NULL;
>
> +
>
> +    assert(callback != NULL);
>
> +
>
> +    /* receive loop */
>
> +    while (!g_listenerState.is_terminate_receive) {
>
> +        int bytecount;
>
> +
>
> +        bytecount = recv(hsock, pReadBuf + bytesGotten,
>
> +                bytesToGet - bytesGotten, 0);
>
> +        if (bytecount == -1) {
>
> +            fprintf(stderr, "Error receiving data %d\n", errno);
>
> +            return errno;
>
> +        }
>
> +
>
> +        if (bytecount == 0) {
>
> +            printf("Disconnected\n");
>
> +            return 0;
>
> +        }
>
> +        bytesGotten += bytecount;
>
> +/*     printf("Recieved bytes %d, got %d/%d\n",
>
> +                bytecount, bytesGotten, bytesToGet); */
>
> +        /* print content */
>
> +        if (0) {
>
> +            int i;
>
> +            for (i = 0; i<  bytecount ; i += 4) {
>
> +                /*printf("%d/%d", i, bytecount/4); */
>
> +                printf("%#x ",
>
> +                        *(int *) (&pReadBuf[bytesGotten - bytecount + i]));
>
> +
>
> +            }
>
> +            printf("\n");
>
> +        }
>
> +        assert(bytesGotten<= bytesToGet);
>
> +        if (bytesGotten == bytesToGet) {
>
> +            int isGotData = 0;
>
> +            bytesGotten = 0;
>
> +            if (!isGotHeader) {
>
> +                /* We just got the header */
>
> +                isGotHeader = 1;
>
> +
>
> +                assert(curCmd.hdr.magic1 == REPCMD_MAGIC1);
>
> +                assert(curCmd.magic2 == REPCMD_MAGIC2);
>
> +                if (curCmd.hdr.data_size_bytes>  0) {
>
> +                    pdata = (uint8_t *)REPCMD_MALLOC(
>
> +                                curCmd.hdr.data_size_bytes);
>
> +/*                    printf("malloc %p\n", pdata); */
>
> +                    pReadBuf = pdata;
>
> +                } else {
>
> +                    /* no data */
>
> +                    isGotData = 1;
>
> +                    pdata = NULL;
>
> +                }
>
> +                bytesToGet = curCmd.hdr.data_size_bytes;
>
> +            } else {
>
> +                isGotData = 1;
>
> +            }
>
> +
>
> +            if (isGotData) {
>
> +                /* Got command and data */
>
> +                (*callback)(&curCmd, pdata, clientPtr);
>
> +
>
> +                /* It's the callee responsibility to free pData */
>
> +                pdata = NULL;
>
> +                ZERO_MEM_OBJ(&curCmd);
>
> +                pReadBuf = (uint8_t *)&curCmd;
>
> +                bytesGotten = 0;
>
> +                bytesToGet = sizeof(RepCmd);
>
> +                isGotHeader = 0;
>
> +            }
>
> +        }
>
> +    }
>
> +    return 0;
>
> +}
>
> +
>
> +RepCmd *repcmd_new(int cmd_id, int data_size, uint8_t **p_out_pdata)
>
> +{
>
> +    RepCmd *p_cmd = (RepCmd *)REPCMD_MALLOC(sizeof(RepCmd) + data_size);
>
> +    assert(p_cmd != NULL);
>
> +
>
> +    /* Zero the CMD (not the data) */
>
> +    ZERO_MEM_OBJ(p_cmd);
>
> +
>
> +    p_cmd->hdr.cmdid = cmd_id;
>
> +    p_cmd->hdr.magic1 = REPCMD_MAGIC1;
>
> +    p_cmd->magic2 = REPCMD_MAGIC2;
>
> +    p_cmd->hdr.data_size_bytes = data_size;
>
> +
>
> +    if (p_out_pdata != NULL) {
>
> +        *p_out_pdata = p_cmd->data;
>
> +    }
>
> +
>
> +    return p_cmd;
>
> +}
>
> +
>
> diff --git a/replication/repcmd_listener.h b/replication/repcmd_listener.h
>
> new file mode 100644
>
> index 0000000..c09a12e
>
> --- /dev/null
>
> +++ b/replication/repcmd_listener.h
>
> @@ -0,0 +1,32 @@
>
> +/*
>
> + * QEMU System Emulator
>
> + *
>
> + * Copyright (c) 2003-2008 Fabrice Bellard
>
> + *
>
> + * Permission is hereby granted, free of charge, to any person obtaining a
> copy
>
> + * of this software and associated documentation files (the "Software"),
> to deal
>
> + * in the Software without restriction, including without limitation the
> rights
>
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
>
> + * copies of the Software, and to permit persons to whom the Software is
>
> + * furnished to do so, subject to the following conditions:
>
> + *
>
> + * The above copyright notice and this permission notice shall be included
> in
>
> + * all copies or substantial portions of the Software.
>
> + *
>
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
> OR
>
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
>
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> FROM,
>
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> IN
>
> + * THE SOFTWARE.
>
> + */
>
> +#ifndef REPCMD_LISTENER_H
>
> +#define REPCMD_LISTENER_H
>
> +#include<stdint.h>
>
> +typedef void (*pfn_received_cmd_cb)(RepCmd *pCmd,
>
> +                uint8_t *pData, void *clientPtr);
>
> +
>
> +int repcmd_listener(int hsock, pfn_received_cmd_cb callback, void
> *clientPtr);
>
> +
>
> +#endif /* REPCMD_LISTENER_H */
>
> diff --git a/replication/rephub_cmds.h b/replication/rephub_cmds.h
>
> new file mode 100644
>
> index 0000000..820c37d
>
> --- /dev/null
>
> +++ b/replication/rephub_cmds.h
>
> @@ -0,0 +1,150 @@
>
> +/*
>
> + * QEMU System Emulator
>
> + *
>
> + * Copyright (c) 2003-2008 Fabrice Bellard
>
> + *
>
> + * Permission is hereby granted, free of charge, to any person obtaining a
> copy
>
> + * of this software and associated documentation files (the "Software"),
> to deal
>
> + * in the Software without restriction, including without limitation the
> rights
>
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
>
> + * copies of the Software, and to permit persons to whom the Software is
>
> + * furnished to do so, subject to the following conditions:
>
> + *
>
> + * The above copyright notice and this permission notice shall be included
> in
>
> + * all copies or substantial portions of the Software.
>
> + *
>
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
> OR
>
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
>
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> FROM,
>
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> IN
>
> + * THE SOFTWARE.
>
> + */
>
> +#ifndef REPHUB_CMDS_H
>
> +#define REPHUB_CMDS_H
>
> +
>
> +#include<stdint.h>
>
> +#include "repcmd.h"
>
> +#include "rephub_defs.h"
>
> +
>
> +/*********************************************************
>
> + * RepCmd Report a protected IO
>
> + *
>
> + * REPHUB_CMD_PROTECTED_WRITE
>
> + * Direction: agent->hub
>
> + *
>
> + * Any write of a protected volume is send with this
>
> + * message to the hub, with its status.
>
> + * When case the status is bad no data is sent
>
> + *********************************************************/
>
> +typedef struct RepCmdProtectedWrite {
>
> +    RepCmdHdr hdr;
>
> +    uint64_t volume_id;
>
> +    uint64_t offset_sectors;
>
> +    /* The size field duplicates the RepCmd size,
>
> +     * but it is needed for reporting failed IOs' sizes */
>
> +    uint32_t size_sectors;
>
> +    int ret_status;
>
> +} RepCmdProtectedWrite;
>
> +
>
> +/*********************************************************
>
> + * RepCmd Report VM volumes
>
> + *
>
> + * REPHUB_CMD_REPORT_VM_VOLUMES
>
> + * Direction: agent->hub
>
> + *
>
> + * The agent reports all the volumes of the VM
>
> + * to the hub.
>
> + *********************************************************/
>
> +typedef struct RepVmVolumeInfo {
>
> +    char name[REPHUB_MAX_VOL_NAME_LEN];
>
> +    uint64_t volume_id;
>
> +    uint32_t size_mb;
>
> +} RepVmVolumeInfo;
>
> +
>
> +typedef struct RepCmdReportVmVolumes {
>
> +    RepCmdHdr hdr;
>
> +    int num_volumes;
>
> +} RepCmdReportVmVolumes;
>
> +
>
> +typedef struct RepCmdDataReportVmVolumes {
>
> +    RepVmVolumeInfo volumes[0];
>
> +} RepCmdDataReportVmVolumes;
>
> +
>
> +
>
> +/*********************************************************
>
> + * RepCmd Start protect
>
> + *
>
> + * REPHUB_CMD_START_PROTECT
>
> + * Direction: hub->agent
>
> + *
>
> + * The hub instructs the agent to start protecting
>
> + * a volume. When a volume is protected all its writes
>
> + * are sent to to the hub.
>
> + * With this command the hub also assigns a volume ID to
>
> + * the given volume name.
>
> + *********************************************************/
>
> +typedef struct RepCmdStartProtect {
>
> +    RepCmdHdr hdr;
>
> +    uint64_t volume_id;
>
> +} RepCmdStartProtect;
>
> +
>
> +typedef struct RepCmdDataStartProtect {
>
> +    char volume_name[REPHUB_MAX_VOL_NAME_LEN];
>
> +} RepCmdDataStartProtect;
>
> +
>
> +
>
> +/*********************************************************
>
> + * RepCmd Read Volume Request
>
> + *
>
> + * REPHUB_CMD_READ_VOL_REQ
>
> + * Direction: hub->agent
>
> + *
>
> + * The hub issues a read IO to a protected volume.
>
> + * This command is used during sync - when the hub needs
>
> + * to read unsyncronized sections of a protected volume.
>
> + * This command is a request, the read data is returned
>
> + * by the response command REPHUB_CMD_READ_VOL_RES
>
> + *********************************************************/
>
> +typedef struct RepCmdReadVolReq {
>
> +    RepCmdHdr hdr;
>
> +    int req_id;
>
> +    int size_sectors;
>
> +    uint64_t volume_id;
>
> +    uint64_t offset_sectors;
>
> +} RepCmdReadVolReq;
>
> +
>
> +/*********************************************************
>
> + * RepCmd Read Volume Response
>
> + *
>
> + * REPHUB_CMD_READ_VOL_RES
>
> + * Direction: agent->hub
>
> + *
>
> + * A response to REPHUB_CMD_READ_VOL_REQ.
>
> + * Sends the data read from a protected volume
>
> + *********************************************************/
>
> +typedef struct RepCmdReadVolRes {
>
> +    RepCmdHdr hdr;
>
> +    int req_id;
>
> +    int is_status_success;
>
> +    uint64_t volume_id;
>
> +} RepCmdReadVolRes;
>
> +
>
> +/*********************************************************
>
> + * RepCmd Agent shutdown
>
> + *
>
> + * REPHUB_CMD_AGENT_SHUTDOWN
>
> + * Direction: agent->hub
>
> + *
>
> + * Notifies the hub that the agent is about to shutdown.
>
> + * This allows a graceful shutdown. Any disconnection
>
> + * of an agent without sending this command will result
>
> + * in a full sync of the VM volumes.
>
> + *********************************************************/
>
> +typedef struct RepCmdAgentShutdown {
>
> +    RepCmdHdr hdr;
>
> +} RepCmdAgentShutdown;
>
> +
>
> +
>
> +#endif /* REPHUB_CMDS_H */
>
> diff --git a/replication/rephub_defs.h b/replication/rephub_defs.h
>
> new file mode 100644
>
> index 0000000..e34e0ce
>
> --- /dev/null
>
> +++ b/replication/rephub_defs.h
>
> @@ -0,0 +1,40 @@
>
> +/*
>
> + * QEMU System Emulator
>
> + *
>
> + * Copyright (c) 2003-2008 Fabrice Bellard
>
> + *
>
> + * Permission is hereby granted, free of charge, to any person obtaining a
> copy
>
> + * of this software and associated documentation files (the "Software"),
> to deal
>
> + * in the Software without restriction, including without limitation the
> rights
>
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or
> sell
>
> + * copies of the Software, and to permit persons to whom the Software is
>
> + * furnished to do so, subject to the following conditions:
>
> + *
>
> + * The above copyright notice and this permission notice shall be included
> in
>
> + * all copies or substantial portions of the Software.
>
> + *
>
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
> OR
>
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
> OTHER
>
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> FROM,
>
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> IN
>
> + * THE SOFTWARE.
>
> + */
>
> +#ifndef REP_HUB_DEFS_H
>
> +#define REP_HUB_DEFS_H
>
> +
>
> +#include<stdint.h>
>
> +
>
> +#define REPHUB_MAX_VOL_NAME_LEN (1024)
>
> +#define REPHUB_MAX_NUM_VOLUMES (512)
>
> +
>
> +#ifndef TRUE
>
> +    #define TRUE (1)
>
> +#endif
>
> +
>
> +#ifndef FALSE
>
> +    #define FALSE (0)
>
> +#endif
>
> +
>
> +#endif /* REP_HUB_DEFS_H */
>
> diff --git a/vl.c b/vl.c
>
> index 624da0f..506b5dc 100644
>
> --- a/vl.c
>
> +++ b/vl.c
>
> @@ -167,6 +167,7 @@ int main(int argc, char **argv)
>
>   #include "ui/qemu-spice.h"
>
> +#include "replication/repagent.h"
>
> //#define DEBUG_NET
>
> //#define DEBUG_SLIRP
>
> @@ -2307,6 +2308,15 @@ int main(int argc, char **argv, char **envp)
>
>                   drive_add(IF_DEFAULT, popt->index - QEMU_OPTION_hda,
> optarg,
>
>                             HD_OPTS);
>
>                   break;
>
> +            case QEMU_OPTION_repagent:
>
> +#ifdef CONFIG_REPLICATION
>
> +                repagent_init(optarg, 0);
>
> +#else
>
> +                fprintf(stderr, "Replication support is disabled. "
>
> +                    "Don't use -repagent option.\n");
>
> +                exit(1);
>
> +#endif
>
> +                break;
>
>               case QEMU_OPTION_drive:
>
>                   if (drive_def(optarg) == NULL) {
>
>                       exit(1);
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 12:12 ` Anthony Liguori
@ 2012-02-07 12:25   ` Dor Laor
  2012-02-07 12:30     ` Ori Mamluk
  0 siblings, 1 reply; 66+ messages in thread
From: Dor Laor @ 2012-02-07 12:25 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Ori Mamluk

On 02/07/2012 02:12 PM, Anthony Liguori wrote:
> Hi,
>
> On 02/07/2012 04:29 AM, Ori Mamluk wrote:
>> Repagent is a new module that allows an external replication system to
>> replicate a volume of a Qemu VM.
>>
>> This RFC patch adds the repagent client module to Qemu.
>
> Please read http://wiki.qemu.org/Contribute/SubmitAPatch
>
> In particular, use a tool like git-send-email and split this patch up
> into more manageable chunks.
>
> Is there an Open Source rehub available? As a project policy, adding
> external APIs specifically for proprietary software is not something
> we're willing to do.
>
> Regards,
>
> Anthony Liguori

In addition, I don't see that the listener thread holds any lock while 
it reads the image. I guess that during that period the guest runs and 
may race w/ this new thread.

About image ID for the replication hub, you can use the VM's pid or VM's 
uuid paired w/ the specific disk uuid

Thanks,
Dor

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 12:25   ` Dor Laor
@ 2012-02-07 12:30     ` Ori Mamluk
  2012-02-07 12:40       ` Anthony Liguori
  0 siblings, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-07 12:30 UTC (permalink / raw)
  To: dlaor; +Cc: Kevin Wolf, qemu-devel

> In addition, I don't see that the listener thread holds any lock while
it reads the image. I guess that during that period the guest runs and may
race w/ this new thread.

Yes - I mentioned that in the patch mail as one of the open issues. Can
you direct me to the lock I need? The function I call from a new thread
context is bdrv_aio_readv.

-----Original Message-----
From: Dor Laor [mailto:dlaor@redhat.com]
Sent: Tuesday, February 07, 2012 2:25 PM
To: Anthony Liguori
Cc: Ori Mamluk; Kevin Wolf; qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC PATCH] replication agent module

On 02/07/2012 02:12 PM, Anthony Liguori wrote:
> Hi,
>
> On 02/07/2012 04:29 AM, Ori Mamluk wrote:
>> Repagent is a new module that allows an external replication system
>> to replicate a volume of a Qemu VM.
>>
>> This RFC patch adds the repagent client module to Qemu.
>
> Please read http://wiki.qemu.org/Contribute/SubmitAPatch
>
> In particular, use a tool like git-send-email and split this patch up
> into more manageable chunks.
>
> Is there an Open Source rehub available? As a project policy, adding
> external APIs specifically for proprietary software is not something
> we're willing to do.
>
> Regards,
>
> Anthony Liguori

In addition, I don't see that the listener thread holds any lock while it
reads the image. I guess that during that period the guest runs and may
race w/ this new thread.

About image ID for the replication hub, you can use the VM's pid or VM's
uuid paired w/ the specific disk uuid

Thanks,
Dor

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 12:30     ` Ori Mamluk
@ 2012-02-07 12:40       ` Anthony Liguori
  2012-02-07 14:06         ` Ori Mamluk
  0 siblings, 1 reply; 66+ messages in thread
From: Anthony Liguori @ 2012-02-07 12:40 UTC (permalink / raw)
  To: Ori Mamluk; +Cc: Kevin Wolf, dlaor, qemu-devel

On 02/07/2012 06:30 AM, Ori Mamluk wrote:
>> In addition, I don't see that the listener thread holds any lock while
> it reads the image. I guess that during that period the guest runs and may
> race w/ this new thread.
>
> Yes - I mentioned that in the patch mail as one of the open issues. Can
> you direct me to the lock I need? The function I call from a new thread
> context is bdrv_aio_readv.

One thing that has me confused about this proposal.  Why not just make rehub be 
an iSCSI target provider and call it a day?

Regards,

Anthony Liguori

>
> -----Original Message-----
> From: Dor Laor [mailto:dlaor@redhat.com]
> Sent: Tuesday, February 07, 2012 2:25 PM
> To: Anthony Liguori
> Cc: Ori Mamluk; Kevin Wolf; qemu-devel@nongnu.org
> Subject: Re: [Qemu-devel] [RFC PATCH] replication agent module
>
> On 02/07/2012 02:12 PM, Anthony Liguori wrote:
>> Hi,
>>
>> On 02/07/2012 04:29 AM, Ori Mamluk wrote:
>>> Repagent is a new module that allows an external replication system
>>> to replicate a volume of a Qemu VM.
>>>
>>> This RFC patch adds the repagent client module to Qemu.
>>
>> Please read http://wiki.qemu.org/Contribute/SubmitAPatch
>>
>> In particular, use a tool like git-send-email and split this patch up
>> into more manageable chunks.
>>
>> Is there an Open Source rehub available? As a project policy, adding
>> external APIs specifically for proprietary software is not something
>> we're willing to do.
>>
>> Regards,
>>
>> Anthony Liguori
>
> In addition, I don't see that the listener thread holds any lock while it
> reads the image. I guess that during that period the guest runs and may
> race w/ this new thread.
>
> About image ID for the replication hub, you can use the VM's pid or VM's
> uuid paired w/ the specific disk uuid
>
> Thanks,
> Dor
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 10:29 [Qemu-devel] [RFC PATCH] replication agent module Ori Mamluk
  2012-02-07 12:12 ` Anthony Liguori
@ 2012-02-07 13:34 ` Kevin Wolf
  2012-02-07 13:50   ` Stefan Hajnoczi
                     ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: Kevin Wolf @ 2012-02-07 13:34 UTC (permalink / raw)
  To: Ori Mamluk; +Cc: dlaor, qemu-devel, Luiz Capitulino

Am 07.02.2012 11:29, schrieb Ori Mamluk:
> Repagent is a new module that allows an external replication system to
> replicate a volume of a Qemu VM.
> 
> This RFC patch adds the repagent client module to Qemu.
> 
>  
> 
> Documentation of the module role and API is in the patch at
> replication/qemu-repagent.txt
> 
>  
> 
> The main motivation behind the module is to allow replication of VMs in
> a virtualization environment like RhevM.
> 
> To achieve this we need basic replication support in Qemu.
> 
>  
> 
> This is the first submission of this module, which was written as a
> Proof Of Concept, and used successfully for replicating and recovering a
> Qemu VM.

I'll mostly ignore the code for now and just comment on the design.

One thing to consider for the next version of the RFC would be to split
this in a series smaller patches. This one has become quite large, which
makes it hard to review (and yes, please use git send-email).

> Points and open issues:
> 
> *             The module interfaces the Qemu storage stack at block.c
> generic layer. Is this the right place to intercept/inject IOs?

There are two ways to intercept I/O requests. The first one is what you
chose, just add some code to bdrv_co_do_writev, and I think it's
reasonable to do this.

The other one would be to add a special block driver for a replication:
protocol that writes to two different places (the real block driver for
the image, and the network connection). Generally this feels even a bit
more elegant, but it brings new problems with it: For example, when you
create an external snapshot, you need to pay attention not to lose the
replication because the protocol is somewhere in the middle of a backing
file chain.

> *             The patch contains performing IO reads invoked by a new
> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
> is not protected by any lock – is this OK?

No, definitely not. Block layer code expects that it holds
qemu_global_mutex.

I'm not sure if a thread is the right solution. You should probably use
something that resembles other asynchronous code in qemu, i.e. either
callback or coroutine based.

> *             VM ID – the replication system implies an environment with
> several VMs connected to a central replication system (Rephub).
>                 This requires some sort of identification for a VM. The
> current patch does not include a VM ID – I did not find any adequate ID
> to use.

The replication hub already opened a connection to the VM, so it somehow
managed to know which VM this process represents, right?

The unique ID would be something like the PID of the VM or the file
descriptor of the communication channel to it.

> diff --git a/Makefile b/Makefile
> 
> index 4f6eaa4..a1b3701 100644
> 
> --- a/Makefile
> 
> +++ b/Makefile
> 
> @@ -149,9 +149,9 @@ qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o cmd.o
> qemu-ga.o: $(GENERATED_HEADERS
> 
> tools-obj-y = qemu-tool.o $(oslib-obj-y) $(trace-obj-y) \
> 
>                qemu-timer-common.o cutils.o
> 
> -qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
> 
> -qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
> 
> -qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
> 
> +qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
> $(replication-obj-y)
> 
> +qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
> $(replication-obj-y)
> 
> +qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
> $(replication-obj-y)

$(replication-obj-y) should be included in $(block-obj-y) instead


> @@ -2733,6 +2739,7 @@ echo "curl support      $curl"
> 
> echo "check support     $check_utests"
> 
> echo "mingw32 support   $mingw32"
> 
> echo "Audio drivers     $audio_drv_list"
> 
> +echo "Replication          $replication"
> 
> echo "Extra audio cards $audio_card_list"
> 
> echo "Block whitelist   $block_drv_whitelist"
> 
> echo "Mixer emulation   $mixemu"

Why do you add it in the middle rather than at the end?

> diff --git a/replication/qemu-repagent.txt b/replication/qemu-repagent.txt
> 
> new file mode 100755
> 
> index 0000000..e3b0c1e
> 
> --- /dev/null
> 
> +++ b/replication/qemu-repagent.txt
> 
> @@ -0,0 +1,104 @@
> 
> +             repagent - replication agent - a Qemu module for enabling
> continuous async replication of VM volumes
> 
> +
> 
> +Introduction
> 
> +             This document describes a feature in Qemu - a replication
> agent (AKA Repagent).
> 
> +             The Repagent is a new module that exposes an API to an
> external replication system (AKA Rephub).
> 
> +             This API allows a Rephub to communicate with a Qemu VM and
> continuously replicate its volumes.
> 
> +             The imlementation of a Rephub is outside of the scope of
> this document. There may be several various Rephub
> 
> +             implenetations using the same repagent in Qemu.
> 
> +
> 
> +Main feature of Repagent
> 
> +             Repagent does the following:
> 
> +             * Report volumes - report a list of all volumes in a VM to
> the Rephub.

Does the query-block QMP command give you what you need?

> +             * Report writes to a volume - send all writes made to a
> protected volume to the Rephub.
> 
> +                             The reporting of an IO is asyncronuous -
> i.e. the IO is not delayed by the Repagent to get any acknowledgement
> from the Rephub.
> +                             It is only copied to the Rephub.
> 
> +             * Read a protected volume - allows the Rephub to read a
> protected volume, to enable the protected hub to syncronize the content
> of a protected volume.

We were discussing using NBD as the protocol for any data that is
transferred from/to the replication hub, so that we can use the existing
NBD client and server code that qemu has. Seems you came to the
conclusion to use different protocol? What are the reasons?

The other message types could possibly be implemented as QMP commands. I
guess we might need to attach multiple QMP monitors for this to work
(one for libvirt, one for the rephub). I'm not sure if there is a
fundamental problem with this or if it just needs to be done.

> +
> 
> +Description of the Repagent module
> 
> +
> 
> +Build and run options
> 
> +             New configure option: --enable-replication
> 
> +             New command line option:
> 
> +             -repagent [hub IP/name]

You'll probably want a monitor command to enable this at runtime.

> +                                                                            
> Enable replication support for disks
> 
> +                                                                            
> hub is the ip or name of the machine running the replication hub.
> 
> +
> 
> +Module APIs
> 
> +             The Repagent module interfaces two main components:
> 
> +             1. The Rephub - An external API based on socket messages
> 
> +             2. The generic block layer- block.c
> 
> +
> 
> +             Rephub message API
> 
> +                             The external replication API is a message
> based API.
> 
> +                             We won't go into the structure of the
> messages here - just the sematics.
> 
> +
> 
> +                             Messages list
> 
> +                                             (The updated list and
> comments are in Rephub_cmds.h)
> 
> +
> 
> +                                             Messages from the Repagent
> to the Rephub:
> 
> +                                             * Protected write
> 
> +                                                             The
> Repagent sends each write to a protected volume to the hub with the IO
> status.
> 
> +                                                             In case
> the status is bad the write content is not sent
> 
> +                                             * Report VM volumes
> 
> +                                                             The agent
> reports all the volumes of the VM to the hub.
> 
> +                                             * Read Volume Response
> 
> +                                                             A response
> to a Read Volume Request
> 
> +                                                             Sends the
> data read from a protected volume to the hub
> 
> +                                             * Agent shutdown
> 
> +                                                             Notifies
> the hub that the agent is about to shutdown.
> 
> +                                                             This
> allows a graceful shutdown. Any disconnection of an agent without
> 
> +                                                             sending
> this command will result in a full sync of the VM volumes.

What does "full sync" mean, what data is synced with which other place?
Is it bad when this happens just because the network is down for a
moment, but the VM actually keeps running?

> +
> 
> +                                             Messages from the Rephub
> to the Repagent:
> 
> +                                             * Start protect
> 
> +                                                             The hub
> instructs the agent to start protecting a volume. When a volume is protected
> 
> +                                                             all its
> writes are sent to to the hub.
> 
> +                                                             With this
> command the hub also assigns a volume ID to the given volume name.
> 
> +                                             * Read volume request
> 
> +                                                             The hub
> issues a read IO to a protected volume.
> 
> +                                                             This
> command is used during sync - when the hub needs to read unsyncronized
> 
> +                                                             sections
> of a protected volume.
> 
> +                                                             This
> command is a request, the read data is returned by the read volume
> response message (see above).
> 
> +             block.c API
> 
> +                             The API to the generic block storage layer
> contains 3 functionalities:
> 
> +                             1. Handle writes to protected volumes
> 
> +                                             In bdrv_co_do_writev, each
> write is reported to the Repagent module.
> 
> +                             2. Handle each new volume that registers
> 
> +                                             In bdrv_open - each new
> bottom-level block driver that registers is reported.

Could probably be a QMP event.

> +                             2. Read from a volume
> 
> +                                             Repagent calls
> bdrv_aio_readv to handle read requests coming from the hub.
> 
> +
> 
> +
> 
> +General description of a Rephub  - a replication system the repagent
> connects to
> 
> +             This section describes in high level a sample Rephub - a
> replication system that uses the repagent API
> 
> +             to replicate disks.
> 
> +             It describes a simple Rephub that comntinuously maintains
> a mirror of the volumes of a VM.
> 
> +
> 
> +             Say we have a VM we want to protect - call it PVM, say it
> has 2 volumes - V1, V2.
> 
> +             Our Rephub is called SingleRephub - a Rephub protecting a
> single VM.
> 
> +
> 
> +             Preparations
> 
> +             1. The user chooses a host to rub SingleRephub - a
> different host than PVM, call it Host2
> 
> +             2. The user creates two volumes on Host2 - same sizes of
> V1 and V2, call them V1R (V1 recovery) and V2R.
> 
> +             3. The user runs SingleRephub process on Host2, and gives
> V1R and V2R as command line arguments.
> 
> +                             From now on SingleRephub waits for the
> protected VM repagent to connect.
> 
> +             4. The user runs the protected VM PVM - and uses the
> switch -repagent <Host2 IP>.
> 
> +
> 
> +             Runtime
> 
> +             1. The repagent module connects to SingleRephub on startup.
> 
> +             2. repagent reports V1 and V2 to SingleRephub.
> 
> +             3. SingleRephub starts to perform an initial
> synchronization of the protected volumes-
> 
> +                             it reads each protected volume (V1 and V2)
> - using read volume requests - and copies the data into the
> 
> +                             recovery volume V1R and V2R.

Are you really going to do this on every start of the VM? Comparing the
whole content of an image will take quite some time.

> +             4. SingleRephub enters 'protection' mode - each write to
> the protected volume is sent by the repagent to the Rephub,
> 
> +                             and the Rephub performs the write on the
> matching recovery volume.
> 
> +
> 
> +             * Note that during stage 3 writes to the protected volumes
> are not ignored - they're kept in a bitmap,
> 
> +                             and will be read again when stage 3 ends,
> in an interative convergin process.
> 
> +
> 
> +             This flow continuously maintains an updated recovery volume.
> 
> +             If the protected system is damaged, the user can create a
> new VM on Host2 with the replicated volumes attached to it.
> 
> +             The new VM is a replica of the protected system.

Have you meanwhile had the time to take a look at Kemari and check how
big the overlap is?

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 13:34 ` Kevin Wolf
@ 2012-02-07 13:50   ` Stefan Hajnoczi
  2012-02-07 13:58     ` Paolo Bonzini
                       ` (3 more replies)
  2012-02-07 14:45   ` Ori Mamluk
  2012-02-08 11:45   ` Luiz Capitulino
  2 siblings, 4 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-07 13:50 UTC (permalink / raw)
  To: Ori Mamluk; +Cc: Kevin Wolf, dlaor, qemu-devel, Luiz Capitulino

On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>> Repagent is a new module that allows an external replication system to
>> replicate a volume of a Qemu VM.

I recently joked with Kevin that QEMU is on its way to reimplementing
the Linux block and device-mapper layers.  Now we have drbd, thanks!
:P

Except for image files, the way to do this on a Linux host would be
using drbd block devices.  We still haven't figured out a nice way to
make image files full-fledged Linux block devices, so we're
reimplementing all the block code in QEMU userspace.

>>
>> This RFC patch adds the repagent client module to Qemu.
>>
>>
>>
>> Documentation of the module role and API is in the patch at
>> replication/qemu-repagent.txt
>>
>>
>>
>> The main motivation behind the module is to allow replication of VMs in
>> a virtualization environment like RhevM.
>>
>> To achieve this we need basic replication support in Qemu.
>>
>>
>>
>> This is the first submission of this module, which was written as a
>> Proof Of Concept, and used successfully for replicating and recovering a
>> Qemu VM.
>
> I'll mostly ignore the code for now and just comment on the design.
>
> One thing to consider for the next version of the RFC would be to split
> this in a series smaller patches. This one has become quite large, which
> makes it hard to review (and yes, please use git send-email).
>
>> Points and open issues:
>>
>> *             The module interfaces the Qemu storage stack at block.c
>> generic layer. Is this the right place to intercept/inject IOs?
>
> There are two ways to intercept I/O requests. The first one is what you
> chose, just add some code to bdrv_co_do_writev, and I think it's
> reasonable to do this.
>
> The other one would be to add a special block driver for a replication:
> protocol that writes to two different places (the real block driver for
> the image, and the network connection). Generally this feels even a bit
> more elegant, but it brings new problems with it: For example, when you
> create an external snapshot, you need to pay attention not to lose the
> replication because the protocol is somewhere in the middle of a backing
> file chain.
>
>> *             The patch contains performing IO reads invoked by a new
>> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
>> is not protected by any lock – is this OK?
>
> No, definitely not. Block layer code expects that it holds
> qemu_global_mutex.
>
> I'm not sure if a thread is the right solution. You should probably use
> something that resembles other asynchronous code in qemu, i.e. either
> callback or coroutine based.

There is a flow control problem here which is interesting.  If the
rephub is slower than the writer or unavailable, then eventually we
either need to stop replicating writes or we need to throttle the
guest writes.  I haven't read through the whole patch yet but the flow
control solution is very closely tied to how you use
threads/coroutines and how you use network sockets.

>> +             * Read a protected volume - allows the Rephub to read a
>> protected volume, to enable the protected hub to syncronize the content
>> of a protected volume.
>
> We were discussing using NBD as the protocol for any data that is
> transferred from/to the replication hub, so that we can use the existing
> NBD client and server code that qemu has. Seems you came to the
> conclusion to use different protocol? What are the reasons?
>
> The other message types could possibly be implemented as QMP commands. I
> guess we might need to attach multiple QMP monitors for this to work
> (one for libvirt, one for the rephub). I'm not sure if there is a
> fundamental problem with this or if it just needs to be done.

Agreed.  You can already query block devices using QMP 'query-block'.
By adding in-process NBD server support you could then launch an NBD
server for each volume which you wish to replicate.  However, in this
case it sounds almost like you want the reverse - you could provide an
NBD server on the rephub and QEMU would mirror writes to it (the NBD
client code is already in QEMU).

There is also interest from other external software (like libvirt) to
be able to read volumes while the VM is running.

BTW, do you poll the volumes or how do you handle hotplug?  Does
anything special need to be done when a volume is unplugged?

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 13:50   ` Stefan Hajnoczi
@ 2012-02-07 13:58     ` Paolo Bonzini
  2012-02-07 14:05     ` Paolo Bonzini
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-07 13:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, dlaor, qemu-devel, Ori Mamluk, Luiz Capitulino

On 02/07/2012 02:50 PM, Stefan Hajnoczi wrote:
>   We still haven't figured out a nice way to
> make image files full-fledged Linux block devices, so we're
> reimplementing all the block code in QEMU userspace.

What about my favorite hammer, NBD?

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 13:50   ` Stefan Hajnoczi
  2012-02-07 13:58     ` Paolo Bonzini
@ 2012-02-07 14:05     ` Paolo Bonzini
  2012-02-08 12:17       ` Orit Wasserman
  2012-02-07 14:18     ` Ori Mamluk
  2012-02-07 14:59     ` Anthony Liguori
  3 siblings, 1 reply; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-07 14:05 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, dlaor, qemu-devel, Ori Mamluk, Luiz Capitulino

On 02/07/2012 02:50 PM, Stefan Hajnoczi wrote:
>> I guess we might need to attach multiple QMP monitors for this to work
>> (one for libvirt, one for the rephub). I'm not sure if there is a
>> fundamental problem with this or if it just needs to be done.
>
> Agreed.  You can already query block devices using QMP 'query-block'.
> By adding in-process NBD server support you could then launch an NBD
> server for each volume which you wish to replicate.  However, in this
> case it sounds almost like you want the reverse - you could provide an
> NBD server on the rephub and QEMU would mirror writes to it (the NBD
> client code is already in QEMU).

Yes, this is how we were also planning to do migration without shared 
storage, right?

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 12:40       ` Anthony Liguori
@ 2012-02-07 14:06         ` Ori Mamluk
  2012-02-07 14:40           ` Paolo Bonzini
  0 siblings, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-07 14:06 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, עודד קדם,
	dlaor,
	תומר בן אור,
	qemu-devel

On 07/02/2012 14:40, Anthony Liguori wrote:
> On 02/07/2012 06:30 AM, Ori Mamluk wrote:
>>> In addition, I don't see that the listener thread holds any lock while
>> it reads the image. I guess that during that period the guest runs 
>> and may
>> race w/ this new thread.
>>
>> Yes - I mentioned that in the patch mail as one of the open issues. Can
>> you direct me to the lock I need? The function I call from a new thread
>> context is bdrv_aio_readv.
>
> One thing that has me confused about this proposal.  Why not just make 
> rehub be an iSCSI target provider and call it a day?
>
You've got a point there. This also goes with what Kevin wrote about 
splitting the IOs with a block driver and not at the generic level.

The main issue about it is that the Rephub also needs the other 
direction - to read the protected volume.

I get the feeling that with live block copy and NBD there's probably 
something that might fit this need, no?
With a 'new' agent like I need this is relatively easily achieved by a 
bidirectional protocol, but I agree a more generic protocol would be 
more elegant, although it will probably require a socket per direction, no?

I Some smaller questions:
* Is there already a working iScsi initiator as a block driver (I hope 
I'm using the right terminology) in Qemu, or do I need to write one?
* This driver would need to be added in run-time - to allow starting to 
protect a running VM. Maybe via a monitor command. I guess that's OK, right?
* What can you say about NBD via iScsi - with respect to our 
requirements- who is more mature in Qemu?

One more thing about the iScsi initiator - it will not be a standard 
backing for a drive, because the 'production' drive (i.e. the original 
image) is more important than the replicated one. This means that even 
though we use iScsi, this is still a replication agent - not a generic 
'additional' iscsi backing.




> Regards,
>
> Anthony Liguori
>
>>
>> -----Original Message-----
>> From: Dor Laor [mailto:dlaor@redhat.com]
>> Sent: Tuesday, February 07, 2012 2:25 PM
>> To: Anthony Liguori
>> Cc: Ori Mamluk; Kevin Wolf; qemu-devel@nongnu.org
>> Subject: Re: [Qemu-devel] [RFC PATCH] replication agent module
>>
>> On 02/07/2012 02:12 PM, Anthony Liguori wrote:
>>> Hi,
>>>
>>> On 02/07/2012 04:29 AM, Ori Mamluk wrote:
>>>> Repagent is a new module that allows an external replication system
>>>> to replicate a volume of a Qemu VM.
>>>>
>>>> This RFC patch adds the repagent client module to Qemu.
>>>
>>> Please read http://wiki.qemu.org/Contribute/SubmitAPatch
>>>
>>> In particular, use a tool like git-send-email and split this patch up
>>> into more manageable chunks.
>>>
>>> Is there an Open Source rehub available? As a project policy, adding
>>> external APIs specifically for proprietary software is not something
>>> we're willing to do.
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>
>> In addition, I don't see that the listener thread holds any lock 
>> while it
>> reads the image. I guess that during that period the guest runs and may
>> race w/ this new thread.
>>
>> About image ID for the replication hub, you can use the VM's pid or VM's
>> uuid paired w/ the specific disk uuid
>>
>> Thanks,
>> Dor
>>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 13:50   ` Stefan Hajnoczi
  2012-02-07 13:58     ` Paolo Bonzini
  2012-02-07 14:05     ` Paolo Bonzini
@ 2012-02-07 14:18     ` Ori Mamluk
  2012-02-07 14:59     ` Anthony Liguori
  3 siblings, 0 replies; 66+ messages in thread
From: Ori Mamluk @ 2012-02-07 14:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Luiz Capitulino

On 07/02/2012 15:50, Stefan Hajnoczi wrote:

First let me say that I'm not completely used to the inline replies - so 
I initially missed some of your mail before.
> On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>> Repagent is a new module that allows an external replication system to
>>> replicate a volume of a Qemu VM.
> I recently joked with Kevin that QEMU is on its way to reimplementing
> the Linux block and device-mapper layers.  Now we have drbd, thanks!
> :P
>
> Except for image files, the way to do this on a Linux host would be
> using drbd block devices.  We still haven't figured out a nice way to
> make image files full-fledged Linux block devices, so we're
> reimplementing all the block code in QEMU userspace.
>
>>> This RFC patch adds the repagent client module to Qemu.
>>>
>>>
>>>
>>> Documentation of the module role and API is in the patch at
>>> replication/qemu-repagent.txt
>>>
>>>
>>>
>>> The main motivation behind the module is to allow replication of VMs in
>>> a virtualization environment like RhevM.
>>>
>>> To achieve this we need basic replication support in Qemu.
>>>
>>>
>>>
>>> This is the first submission of this module, which was written as a
>>> Proof Of Concept, and used successfully for replicating and recovering a
>>> Qemu VM.
>> I'll mostly ignore the code for now and just comment on the design.
>>
>> One thing to consider for the next version of the RFC would be to split
>> this in a series smaller patches. This one has become quite large, which
>> makes it hard to review (and yes, please use git send-email).
>>
>>> Points and open issues:
>>>
>>> *             The module interfaces the Qemu storage stack at block.c
>>> generic layer. Is this the right place to intercept/inject IOs?
>> There are two ways to intercept I/O requests. The first one is what you
>> chose, just add some code to bdrv_co_do_writev, and I think it's
>> reasonable to do this.
>>
>> The other one would be to add a special block driver for a replication:
>> protocol that writes to two different places (the real block driver for
>> the image, and the network connection). Generally this feels even a bit
>> more elegant, but it brings new problems with it: For example, when you
>> create an external snapshot, you need to pay attention not to lose the
>> replication because the protocol is somewhere in the middle of a backing
>> file chain.
>>
>>> *             The patch contains performing IO reads invoked by a new
>>> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
>>> is not protected by any lock – is this OK?
>> No, definitely not. Block layer code expects that it holds
>> qemu_global_mutex.
>>
>> I'm not sure if a thread is the right solution. You should probably use
>> something that resembles other asynchronous code in qemu, i.e. either
>> callback or coroutine based.
> There is a flow control problem here which is interesting.  If the
> rephub is slower than the writer or unavailable, then eventually we
> either need to stop replicating writes or we need to throttle the
> guest writes.  I haven't read through the whole patch yet but the flow
> control solution is very closely tied to how you use
> threads/coroutines and how you use network sockets.
In general the replication is naturally less important than the main 
(production) volume.
This means that the solution aims to never throttle the guest writes.
In the current stage, both IOs will need to complete before reporting 
back to the guest, but the volume is a real write to storage while the 
Rephub may involve only copying to memory.
Later on we can get rid of waiting to the replicated IO altogether by 
adding a bitmap - but this is only for a later stage.

>
>>> +             * Read a protected volume - allows the Rephub to read a
>>> protected volume, to enable the protected hub to syncronize the content
>>> of a protected volume.
>> We were discussing using NBD as the protocol for any data that is
>> transferred from/to the replication hub, so that we can use the existing
>> NBD client and server code that qemu has. Seems you came to the
>> conclusion to use different protocol? What are the reasons?
>>
>> The other message types could possibly be implemented as QMP commands. I
>> guess we might need to attach multiple QMP monitors for this to work
>> (one for libvirt, one for the rephub). I'm not sure if there is a
>> fundamental problem with this or if it just needs to be done.
> Agreed.  You can already query block devices using QMP 'query-block'.
> By adding in-process NBD server support you could then launch an NBD
> server for each volume which you wish to replicate.  However, in this
> case it sounds almost like you want the reverse - you could provide an
> NBD server on the rephub and QEMU would mirror writes to it (the NBD
> client code is already in QEMU).
>
> There is also interest from other external software (like libvirt) to
> be able to read volumes while the VM is running.
>
> BTW, do you poll the volumes or how do you handle hotplug?  Does
> anything special need to be done when a volume is unplugged?
We assume that we handle he hotplug top-down - via the management 
system, and not from the VM.
In general, we don't protect 'all volumes' of a VM - the management 
system (either RhevM or Rephub - depending on the design) specifically 
instructs to start protecting a volume.
> Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:06         ` Ori Mamluk
@ 2012-02-07 14:40           ` Paolo Bonzini
  2012-02-07 14:48             ` Ori Mamluk
                               ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-07 14:40 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel

On 02/07/2012 03:06 PM, Ori Mamluk wrote:
> The main issue about it is that the Rephub also needs the other
> direction - to read the protected volume. I get the feeling that with
> live block copy and NBD there's probably something that might fit
> this need, no?

Yes, with two NBD sockets you could do it.  But would you use both at 
the same time?  I would have thought that either the rephub is streaming 
from the protected volume, or QEMU is streaming to the rephub.

The current streaming code in QEMU only deals with the former. 
Streaming to a remote server would not be supported.

> With a 'new' agent like I need this is relatively easily achieved by a
> bidirectional protocol, but I agree a more generic protocol would be
> more elegant, although it will probably require a socket per direction, no?
>
> I Some smaller questions:
> * Is there already a working iScsi initiator as a block driver (I hope
> I'm using the right terminology) in Qemu, or do I need to write one?

Yes, there is one using libiscsi.  But I think Anthony was not referring 
to iSCSI in particular, NBD would work just as well.

> * This driver would need to be added in run-time - to allow starting to
> protect a running VM. Maybe via a monitor command. I guess that's OK,
> right?

Yes, I think you can detach a block device from a drive and reattach the 
new mirroring device.

> * What can you say about NBD via iScsi - with respect to our
> requirements- who is more mature in Qemu?

Personally I prefer NBD because it is lighter-weight and there is a 
server inside QEMU (so you can use it easily with non-raw images).  It 
is more mature, but it is a bit less extensible.

> One more thing about the iScsi initiator - it will not be a standard
> backing for a drive, because the 'production' drive (i.e. the original
> image) is more important than the replicated one. This means that even
> though we use iScsi, this is still a replication agent - not a generic
> 'additional' iscsi backing.

Yes, understood.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 13:34 ` Kevin Wolf
  2012-02-07 13:50   ` Stefan Hajnoczi
@ 2012-02-07 14:45   ` Ori Mamluk
  2012-02-08 12:29     ` Orit Wasserman
  2012-02-08 11:45   ` Luiz Capitulino
  2 siblings, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-07 14:45 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Luiz Capitulino

On 07/02/2012 15:34, Kevin Wolf wrote:
> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>> Repagent is a new module that allows an external replication system to
>> replicate a volume of a Qemu VM.
>>
>> This RFC patch adds the repagent client module to Qemu.
>>
>>
>>
>> Documentation of the module role and API is in the patch at
>> replication/qemu-repagent.txt
>>
>>
>>
>> The main motivation behind the module is to allow replication of VMs in
>> a virtualization environment like RhevM.
>>
>> To achieve this we need basic replication support in Qemu.
>>
>>
>>
>> This is the first submission of this module, which was written as a
>> Proof Of Concept, and used successfully for replicating and recovering a
>> Qemu VM.
> I'll mostly ignore the code for now and just comment on the design.
That's fine. The code was mainly for my understanding of the system.
> One thing to consider for the next version of the RFC would be to split
> this in a series smaller patches. This one has become quite large, which
> makes it hard to review (and yes, please use git send-email).
>
>> Points and open issues:
>>
>> *             The module interfaces the Qemu storage stack at block.c
>> generic layer. Is this the right place to intercept/inject IOs?
> There are two ways to intercept I/O requests. The first one is what you
> chose, just add some code to bdrv_co_do_writev, and I think it's
> reasonable to do this.
>
> The other one would be to add a special block driver for a replication:
> protocol that writes to two different places (the real block driver for
> the image, and the network connection). Generally this feels even a bit
> more elegant, but it brings new problems with it: For example, when you
> create an external snapshot, you need to pay attention not to lose the
> replication because the protocol is somewhere in the middle of a backing
> file chain.
Yes. With this solution we'll have to somehow make sure that the 
replication driver is closer to the guest than any driver which alters 
the IO.

>
>> *             The patch contains performing IO reads invoked by a new
>> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
>> is not protected by any lock – is this OK?
> No, definitely not. Block layer code expects that it holds
> qemu_global_mutex.
>
> I'm not sure if a thread is the right solution. You should probably use
> something that resembles other asynchronous code in qemu, i.e. either
> callback or coroutine based.
I call bdrv_aio_readv - which in my understanding creates a co-routing, 
so my current solution is co-routines based. Did I get something wrong?

>
>> *             VM ID – the replication system implies an environment with
>> several VMs connected to a central replication system (Rephub).
>>                  This requires some sort of identification for a VM. The
>> current patch does not include a VM ID – I did not find any adequate ID
>> to use.
> The replication hub already opened a connection to the VM, so it somehow
> managed to know which VM this process represents, right?
The current design has the server at the Rephub side, so the VM connects 
to the Rephub, and not the other way around.
The VM could be instructed to "enable protection" by a monitor command, 
and then it connects to the 'known' Rephub.
> The unique ID would be something like the PID of the VM or the file
> descriptor of the communication channel to it.
The PID might be useful - we'll later need to correlate it to the way 
Rhevm identifies the machine, but not right now...
>> diff --git a/Makefile b/Makefile
>>
>> index 4f6eaa4..a1b3701 100644
>>
>> --- a/Makefile
>>
>> +++ b/Makefile
>>
>> @@ -149,9 +149,9 @@ qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o cmd.o
>> qemu-ga.o: $(GENERATED_HEADERS
>>
>> tools-obj-y = qemu-tool.o $(oslib-obj-y) $(trace-obj-y) \
>>
>>                 qemu-timer-common.o cutils.o
>>
>> -qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
>>
>> -qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
>>
>> -qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
>>
>> +qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
>> $(replication-obj-y)
>>
>> +qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
>> $(replication-obj-y)
>>
>> +qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
>> $(replication-obj-y)
> $(replication-obj-y) should be included in $(block-obj-y) instead
>
>
>> @@ -2733,6 +2739,7 @@ echo "curl support      $curl"
>>
>> echo "check support     $check_utests"
>>
>> echo "mingw32 support   $mingw32"
>>
>> echo "Audio drivers     $audio_drv_list"
>>
>> +echo "Replication          $replication"
>>
>> echo "Extra audio cards $audio_card_list"
>>
>> echo "Block whitelist   $block_drv_whitelist"
>>
>> echo "Mixer emulation   $mixemu"
> Why do you add it in the middle rather than at the end?
No reason, I'll change it.
>
>> diff --git a/replication/qemu-repagent.txt b/replication/qemu-repagent.txt
>>
>> new file mode 100755
>>
>> index 0000000..e3b0c1e
>>
>> --- /dev/null
>>
>> +++ b/replication/qemu-repagent.txt
>>
>> @@ -0,0 +1,104 @@
>>
>> +             repagent - replication agent - a Qemu module for enabling
>> continuous async replication of VM volumes
>>
>> +
>>
>> +Introduction
>>
>> +             This document describes a feature in Qemu - a replication
>> agent (AKA Repagent).
>>
>> +             The Repagent is a new module that exposes an API to an
>> external replication system (AKA Rephub).
>>
>> +             This API allows a Rephub to communicate with a Qemu VM and
>> continuously replicate its volumes.
>>
>> +             The imlementation of a Rephub is outside of the scope of
>> this document. There may be several various Rephub
>>
>> +             implenetations using the same repagent in Qemu.
>>
>> +
>>
>> +Main feature of Repagent
>>
>> +             Repagent does the following:
>>
>> +             * Report volumes - report a list of all volumes in a VM to
>> the Rephub.
> Does the query-block QMP command give you what you need?
I'll look into it.
>> +             * Report writes to a volume - send all writes made to a
>> protected volume to the Rephub.
>>
>> +                             The reporting of an IO is asyncronuous -
>> i.e. the IO is not delayed by the Repagent to get any acknowledgement
>> from the Rephub.
>> +                             It is only copied to the Rephub.
>>
>> +             * Read a protected volume - allows the Rephub to read a
>> protected volume, to enable the protected hub to syncronize the content
>> of a protected volume.
> We were discussing using NBD as the protocol for any data that is
> transferred from/to the replication hub, so that we can use the existing
> NBD client and server code that qemu has. Seems you came to the
> conclusion to use different protocol? What are the reasons?
Initially I thought there will have to be more functionality in the agent.
Now it seems that you're right, and Stefan also pointed out something 
similar.
Let me think about how I can get the same functionality with NBD (or 
iScsi) server and client.
>
> The other message types could possibly be implemented as QMP commands. I
> guess we might need to attach multiple QMP monitors for this to work
> (one for libvirt, one for the rephub). I'm not sure if there is a
> fundamental problem with this or if it just needs to be done.
>> +
>>
>> +Description of the Repagent module
>>
>> +
>>
>> +Build and run options
>>
>> +             New configure option: --enable-replication
>>
>> +             New command line option:
>>
>> +             -repagent [hub IP/name]
> You'll probably want a monitor command to enable this at runtime.
Yep.
>> +
>> Enable replication support for disks
>>
>> +
>> hub is the ip or name of the machine running the replication hub.
>>
>> +
>>
>> +Module APIs
>>
>> +             The Repagent module interfaces two main components:
>>
>> +             1. The Rephub - An external API based on socket messages
>>
>> +             2. The generic block layer- block.c
>>
>> +
>>
>> +             Rephub message API
>>
>> +                             The external replication API is a message
>> based API.
>>
>> +                             We won't go into the structure of the
>> messages here - just the sematics.
>>
>> +
>>
>> +                             Messages list
>>
>> +                                             (The updated list and
>> comments are in Rephub_cmds.h)
>>
>> +
>>
>> +                                             Messages from the Repagent
>> to the Rephub:
>>
>> +                                             * Protected write
>>
>> +                                                             The
>> Repagent sends each write to a protected volume to the hub with the IO
>> status.
>>
>> +                                                             In case
>> the status is bad the write content is not sent
>>
>> +                                             * Report VM volumes
>>
>> +                                                             The agent
>> reports all the volumes of the VM to the hub.
>>
>> +                                             * Read Volume Response
>>
>> +                                                             A response
>> to a Read Volume Request
>>
>> +                                                             Sends the
>> data read from a protected volume to the hub
>>
>> +                                             * Agent shutdown
>>
>> +                                                             Notifies
>> the hub that the agent is about to shutdown.
>>
>> +                                                             This
>> allows a graceful shutdown. Any disconnection of an agent without
>>
>> +                                                             sending
>> this command will result in a full sync of the VM volumes.
> What does "full sync" mean, what data is synced with which other place?
> Is it bad when this happens just because the network is down for a
> moment, but the VM actually keeps running?
Full sync means reading the entire volume.
It is bad when it happens because of a short network outage, but I think 
that it's a good 'intermediate' step to do so.
We can first build a system which assumes that the connection between 
the agent and the Rephub is solid, and on a next stage add a bitmap 
mechanism in the agent that will optimize it - to overcome outages 
without full sync.
>> +
>>
>> +                                             Messages from the Rephub
>> to the Repagent:
>>
>> +                                             * Start protect
>>
>> +                                                             The hub
>> instructs the agent to start protecting a volume. When a volume is protected
>>
>> +                                                             all its
>> writes are sent to to the hub.
>>
>> +                                                             With this
>> command the hub also assigns a volume ID to the given volume name.
>>
>> +                                             * Read volume request
>>
>> +                                                             The hub
>> issues a read IO to a protected volume.
>>
>> +                                                             This
>> command is used during sync - when the hub needs to read unsyncronized
>>
>> +                                                             sections
>> of a protected volume.
>>
>> +                                                             This
>> command is a request, the read data is returned by the read volume
>> response message (see above).
>>
>> +             block.c API
>>
>> +                             The API to the generic block storage layer
>> contains 3 functionalities:
>>
>> +                             1. Handle writes to protected volumes
>>
>> +                                             In bdrv_co_do_writev, each
>> write is reported to the Repagent module.
>>
>> +                             2. Handle each new volume that registers
>>
>> +                                             In bdrv_open - each new
>> bottom-level block driver that registers is reported.
> Could probably be a QMP event.
OK
>> +                             2. Read from a volume
>>
>> +                                             Repagent calls
>> bdrv_aio_readv to handle read requests coming from the hub.
>>
>> +
>>
>> +
>>
>> +General description of a Rephub  - a replication system the repagent
>> connects to
>>
>> +             This section describes in high level a sample Rephub - a
>> replication system that uses the repagent API
>>
>> +             to replicate disks.
>>
>> +             It describes a simple Rephub that comntinuously maintains
>> a mirror of the volumes of a VM.
>>
>> +
>>
>> +             Say we have a VM we want to protect - call it PVM, say it
>> has 2 volumes - V1, V2.
>>
>> +             Our Rephub is called SingleRephub - a Rephub protecting a
>> single VM.
>>
>> +
>>
>> +             Preparations
>>
>> +             1. The user chooses a host to rub SingleRephub - a
>> different host than PVM, call it Host2
>>
>> +             2. The user creates two volumes on Host2 - same sizes of
>> V1 and V2, call them V1R (V1 recovery) and V2R.
>>
>> +             3. The user runs SingleRephub process on Host2, and gives
>> V1R and V2R as command line arguments.
>>
>> +                             From now on SingleRephub waits for the
>> protected VM repagent to connect.
>>
>> +             4. The user runs the protected VM PVM - and uses the
>> switch -repagent<Host2 IP>.
>>
>> +
>>
>> +             Runtime
>>
>> +             1. The repagent module connects to SingleRephub on startup.
>>
>> +             2. repagent reports V1 and V2 to SingleRephub.
>>
>> +             3. SingleRephub starts to perform an initial
>> synchronization of the protected volumes-
>>
>> +                             it reads each protected volume (V1 and V2)
>> - using read volume requests - and copies the data into the
>>
>> +                             recovery volume V1R and V2R.
> Are you really going to do this on every start of the VM? Comparing the
> whole content of an image will take quite some time.
It is done when you first start protect a volume, not each time a VM 
boots. A VM can reboot without needing a full sync.
>
>> +             4. SingleRephub enters 'protection' mode - each write to
>> the protected volume is sent by the repagent to the Rephub,
>>
>> +                             and the Rephub performs the write on the
>> matching recovery volume.
>>
>> +
>>
>> +             * Note that during stage 3 writes to the protected volumes
>> are not ignored - they're kept in a bitmap,
>>
>> +                             and will be read again when stage 3 ends,
>> in an interative convergin process.
>>
>> +
>>
>> +             This flow continuously maintains an updated recovery volume.
>>
>> +             If the protected system is damaged, the user can create a
>> new VM on Host2 with the replicated volumes attached to it.
>>
>> +             The new VM is a replica of the protected system.
> Have you meanwhile had the time to take a look at Kemari and check how
> big the overlap is?
No. What's Kemari? I'll look it up.
>
> Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:40           ` Paolo Bonzini
@ 2012-02-07 14:48             ` Ori Mamluk
  2012-02-07 15:47               ` Paolo Bonzini
  2012-02-07 14:53             ` Kevin Wolf
  2012-02-07 15:00             ` Anthony Liguori
  2 siblings, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-07 14:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel

On 07/02/2012 16:40, Paolo Bonzini wrote:
> On 02/07/2012 03:06 PM, Ori Mamluk wrote:
>> The main issue about it is that the Rephub also needs the other
>> direction - to read the protected volume. I get the feeling that with
>> live block copy and NBD there's probably something that might fit
>> this need, no?
>
> Yes, with two NBD sockets you could do it.  But would you use both at 
> the same time?  I would have thought that either the rephub is 
> streaming from the protected volume, or QEMU is streaming to the rephub.
>
> The current streaming code in QEMU only deals with the former. 
> Streaming to a remote server would not be supported.
>
I need it at the same time. The Rephub reads either the full volume or 
parts of, and concurrently protects new IOs.
>> With a 'new' agent like I need this is relatively easily achieved by a
>> bidirectional protocol, but I agree a more generic protocol would be
>> more elegant, although it will probably require a socket per 
>> direction, no?
>>
>> I Some smaller questions:
>> * Is there already a working iScsi initiator as a block driver (I hope
>> I'm using the right terminology) in Qemu, or do I need to write one?
>
> Yes, there is one using libiscsi.  But I think Anthony was not 
> referring to iSCSI in particular, NBD would work just as well.
>
>> * This driver would need to be added in run-time - to allow starting to
>> protect a running VM. Maybe via a monitor command. I guess that's OK,
>> right?
>
> Yes, I think you can detach a block device from a drive and reattach 
> the new mirroring device.
>
>> * What can you say about NBD via iScsi - with respect to our
>> requirements- who is more mature in Qemu?
>
> Personally I prefer NBD because it is lighter-weight and there is a 
> server inside QEMU (so you can use it easily with non-raw images).  It 
> is more mature, but it is a bit less extensible.
>
I think we don't need an extensible or rich API, so NBD may fit. I'll 
look deeper into it.
>> One more thing about the iScsi initiator - it will not be a standard
>> backing for a drive, because the 'production' drive (i.e. the original
>> image) is more important than the replicated one. This means that even
>> though we use iScsi, this is still a replication agent - not a generic
>> 'additional' iscsi backing.
>
> Yes, understood.
>
> Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:40           ` Paolo Bonzini
  2012-02-07 14:48             ` Ori Mamluk
@ 2012-02-07 14:53             ` Kevin Wolf
  2012-02-07 15:00             ` Anthony Liguori
  2 siblings, 0 replies; 66+ messages in thread
From: Kevin Wolf @ 2012-02-07 14:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: תומר בן אור,
	Ori Mamluk, עודד קדם,
	dlaor, qemu-devel

Am 07.02.2012 15:40, schrieb Paolo Bonzini:
> On 02/07/2012 03:06 PM, Ori Mamluk wrote:
>> The main issue about it is that the Rephub also needs the other
>> direction - to read the protected volume. I get the feeling that with
>> live block copy and NBD there's probably something that might fit
>> this need, no?
> 
> Yes, with two NBD sockets you could do it.  But would you use both at 
> the same time?  I would have thought that either the rephub is streaming 
> from the protected volume, or QEMU is streaming to the rephub.
> 
> The current streaming code in QEMU only deals with the former. 
> Streaming to a remote server would not be supported.

Eventually we'll want to have it. We have been discussing about a mirror
block driver more than once. I think the same thing could be reused for
both replication and pre-copy live block migration.

You would probably have some flags that describe differences in the
detail (e.g. whether to wait for the mirrored write or not), but they
should be relatively small.

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 13:50   ` Stefan Hajnoczi
                       ` (2 preceding siblings ...)
  2012-02-07 14:18     ` Ori Mamluk
@ 2012-02-07 14:59     ` Anthony Liguori
  2012-02-07 15:20       ` Stefan Hajnoczi
  2012-02-21 16:01       ` Markus Armbruster
  3 siblings, 2 replies; 66+ messages in thread
From: Anthony Liguori @ 2012-02-07 14:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, dlaor, qemu-devel, Ori Mamluk, Luiz Capitulino

On 02/07/2012 07:50 AM, Stefan Hajnoczi wrote:
> On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>> Repagent is a new module that allows an external replication system to
>>> replicate a volume of a Qemu VM.
>
> I recently joked with Kevin that QEMU is on its way to reimplementing
> the Linux block and device-mapper layers.  Now we have drbd, thanks!
> :P

I don't think it's a joke.  Do we really want to get into this space?  Why not 
just use drbd?

If it's because we want to also work with image formats, perhaps we should 
export our image format code as a shared library and let drbd link against it.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:40           ` Paolo Bonzini
  2012-02-07 14:48             ` Ori Mamluk
  2012-02-07 14:53             ` Kevin Wolf
@ 2012-02-07 15:00             ` Anthony Liguori
  2 siblings, 0 replies; 66+ messages in thread
From: Anthony Liguori @ 2012-02-07 15:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, dlaor,
	עודד קדם,
	תומר בן אור,
	qemu-devel, Ori Mamluk

On 02/07/2012 08:40 AM, Paolo Bonzini wrote:
> On 02/07/2012 03:06 PM, Ori Mamluk wrote:
>> The main issue about it is that the Rephub also needs the other
>> direction - to read the protected volume. I get the feeling that with
>> live block copy and NBD there's probably something that might fit
>> this need, no?
>
> Yes, with two NBD sockets you could do it. But would you use both at the same
> time? I would have thought that either the rephub is streaming from the
> protected volume, or QEMU is streaming to the rephub.
>
> The current streaming code in QEMU only deals with the former. Streaming to a
> remote server would not be supported.
>
>> With a 'new' agent like I need this is relatively easily achieved by a
>> bidirectional protocol, but I agree a more generic protocol would be
>> more elegant, although it will probably require a socket per direction, no?
>>
>> I Some smaller questions:
>> * Is there already a working iScsi initiator as a block driver (I hope
>> I'm using the right terminology) in Qemu, or do I need to write one?
>
> Yes, there is one using libiscsi. But I think Anthony was not referring to iSCSI
> in particular, NBD would work just as well.
>
>> * This driver would need to be added in run-time - to allow starting to
>> protect a running VM. Maybe via a monitor command. I guess that's OK,
>> right?
>
> Yes, I think you can detach a block device from a drive and reattach the new
> mirroring device.
>
>> * What can you say about NBD via iScsi - with respect to our
>> requirements- who is more mature in Qemu?
>
> Personally I prefer NBD because it is lighter-weight and there is a server
> inside QEMU (so you can use it easily with non-raw images). It is more mature,
> but it is a bit less extensible.

Which is also fine.

You could also just use DRBD ;-)

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:59     ` Anthony Liguori
@ 2012-02-07 15:20       ` Stefan Hajnoczi
  2012-02-07 16:25         ` Anthony Liguori
  2012-02-21 16:01       ` Markus Armbruster
  1 sibling, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-07 15:20 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, dlaor, qemu-devel, Ori Mamluk, Luiz Capitulino

On Tue, Feb 7, 2012 at 2:59 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 02/07/2012 07:50 AM, Stefan Hajnoczi wrote:
>>
>> On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>>
>>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>>>
>>>> Repagent is a new module that allows an external replication system to
>>>> replicate a volume of a Qemu VM.
>>
>>
>> I recently joked with Kevin that QEMU is on its way to reimplementing
>> the Linux block and device-mapper layers.  Now we have drbd, thanks!
>> :P
>
>
> I don't think it's a joke.  Do we really want to get into this space?  Why
> not just use drbd?
>
> If it's because we want to also work with image formats, perhaps we should
> export our image format code as a shared library and let drbd link against
> it.

When the guest disk image is on an LVM volume the picture looks like this:

Guest -> QEMU -> drbd -> LVM volume

When an image file is in use we need a way for Linux to access it:

Guest -> QEMU -> drbd -> local NBD server

The local NBD server runs the qcow2, qed, etc code.

Both scenarios are possible today without modifications to QEMU code.

For reference, here is the drbd website http://www.drbd.org/, it's
described as "network based raid-1".

Looking seriously at drbd is worthwhile because it already implements
the advanced features that this prototype patch omits.  Take a look at
the documentation at http://www.drbd.org/docs/working/.  Checksums,
rate of synchronization, congestion policies, I/O error handling
policies, and so on are all supported already.

I suspect using drbd would require more management stack integration
rather than QEMU patches.  For example, libvirt would need to launch
NBD servers and drbd.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:48             ` Ori Mamluk
@ 2012-02-07 15:47               ` Paolo Bonzini
  2012-02-08  6:10                 ` Ori Mamluk
  0 siblings, 1 reply; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-07 15:47 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf, עודד קדם,
	תומר בן אור,
	dlaor, qemu-devel

On 02/07/2012 03:48 PM, Ori Mamluk wrote:
>> The current streaming code in QEMU only deals with the former.
>> Streaming to a remote server would not be supported.
>>
> I need it at the same time. The Rephub reads either the full volume or
> parts of, and concurrently protects new IOs.

Why can't QEMU itself stream the full volume in the background, and send 
that together with any new I/O?  Is it because the rephub knows which 
parts are out-of-date and need recovery?  In that case, as a first 
approximation the rephub can pass the sector at which streaming should 
start.

But I'm also starting to wonder whether it would be simpler to use 
existing replication code.  DRBD is more feature-rich, and you can use 
it over loopback or NBD devices (respectively raw and non-raw), and also 
store the replication metadata on a file using the loopback device. 
Ceph even has a userspace library and support within QEMU.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 15:20       ` Stefan Hajnoczi
@ 2012-02-07 16:25         ` Anthony Liguori
  0 siblings, 0 replies; 66+ messages in thread
From: Anthony Liguori @ 2012-02-07 16:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, dlaor, qemu-devel, Ori Mamluk, Luiz Capitulino

On 02/07/2012 09:20 AM, Stefan Hajnoczi wrote:
> On Tue, Feb 7, 2012 at 2:59 PM, Anthony Liguori<anthony@codemonkey.ws>  wrote:
>> On 02/07/2012 07:50 AM, Stefan Hajnoczi wrote:
>>>
>>> On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf<kwolf@redhat.com>    wrote:
>>>>
>>>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>>>>
>>>>> Repagent is a new module that allows an external replication system to
>>>>> replicate a volume of a Qemu VM.
>>>
>>>
>>> I recently joked with Kevin that QEMU is on its way to reimplementing
>>> the Linux block and device-mapper layers.  Now we have drbd, thanks!
>>> :P
>>
>>
>> I don't think it's a joke.  Do we really want to get into this space?  Why
>> not just use drbd?
>>
>> If it's because we want to also work with image formats, perhaps we should
>> export our image format code as a shared library and let drbd link against
>> it.
>
> When the guest disk image is on an LVM volume the picture looks like this:
>
> Guest ->  QEMU ->  drbd ->  LVM volume
>
> When an image file is in use we need a way for Linux to access it:
>
> Guest ->  QEMU ->  drbd ->  local NBD server
>
> The local NBD server runs the qcow2, qed, etc code.
>
> Both scenarios are possible today without modifications to QEMU code.
>
> For reference, here is the drbd website http://www.drbd.org/, it's
> described as "network based raid-1".
>
> Looking seriously at drbd is worthwhile because it already implements
> the advanced features that this prototype patch omits.  Take a look at
> the documentation at http://www.drbd.org/docs/working/.  Checksums,
> rate of synchronization, congestion policies, I/O error handling
> policies, and so on are all supported already.
>
> I suspect using drbd would require more management stack integration
> rather than QEMU patches.  For example, libvirt would need to launch
> NBD servers and drbd.

I don't have a problem with a libqemu that presented an easy to use interface 
for libvirt to consume....

But NIH justified with the fact that we can provide a single management 
interface isn't a good justification IMHO.

Regards,

Anthony Liguori

>
> Stefan
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 15:47               ` Paolo Bonzini
@ 2012-02-08  6:10                 ` Ori Mamluk
  2012-02-08  8:49                   ` Dor Laor
                                     ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Ori Mamluk @ 2012-02-08  6:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, dlaor,
	עודד קדם,
	תומר בן אור,
	qemu-devel, Yair Kuszpet

On 07/02/2012 17:47, Paolo Bonzini wrote:
> On 02/07/2012 03:48 PM, Ori Mamluk wrote:
>>> The current streaming code in QEMU only deals with the former.
>>> Streaming to a remote server would not be supported.
>>>
>> I need it at the same time. The Rephub reads either the full volume or
>> parts of, and concurrently protects new IOs.
>
> Why can't QEMU itself stream the full volume in the background, and 
> send that together with any new I/O?  Is it because the rephub knows 
> which parts are out-of-date and need recovery?  In that case, as a 
> first approximation the rephub can pass the sector at which streaming 
> should start.
Yes - it's because rephub knows. The parts that need recovery may be a 
series of random IOs that were lost because of a network outage 
somewhere along the replication pipe.
Easy to think of it as a bitmap holding the not-yet-replicated IOs. The 
rephub occasionally reads those areas to 'sync' them, so in effect the 
rephub needs read access - it's not really to trigger streaming from an 
offset.
>
> But I'm also starting to wonder whether it would be simpler to use 
> existing replication code.  DRBD is more feature-rich, and you can use 
> it over loopback or NBD devices (respectively raw and non-raw), and 
> also store the replication metadata on a file using the loopback 
> device. Ceph even has a userspace library and support within QEMU.
>
I think there are two immediate problems that drbd poses:
1. Our replication is not a simple mirror - it maintains history. I.e. 
you can recover to any point in time in the last X hours (usually 24) at 
a granularity of about 5 seconds.
To be able to do that and keep the replica consistent we need to be 
notified for each IO.
2. drbd is 'below' all the Qemu block layers - if the protected volume 
is qcow2 then drbd doesn't get the raw IOs, right?

Ori

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08  6:10                 ` Ori Mamluk
@ 2012-02-08  8:49                   ` Dor Laor
  2012-02-08 11:59                     ` Stefan Hajnoczi
  2012-02-08  8:55                   ` Kevin Wolf
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Dor Laor @ 2012-02-08  8:49 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf, עודד קדם,
	תומר בן אור,
	qemu-devel, Yair Kuszpet, Paolo Bonzini

On 02/08/2012 08:10 AM, Ori Mamluk wrote:
> On 07/02/2012 17:47, Paolo Bonzini wrote:
>> On 02/07/2012 03:48 PM, Ori Mamluk wrote:
>>>> The current streaming code in QEMU only deals with the former.
>>>> Streaming to a remote server would not be supported.
>>>>
>>> I need it at the same time. The Rephub reads either the full volume or
>>> parts of, and concurrently protects new IOs.
>>
>> Why can't QEMU itself stream the full volume in the background, and
>> send that together with any new I/O? Is it because the rephub knows
>> which parts are out-of-date and need recovery? In that case, as a
>> first approximation the rephub can pass the sector at which streaming
>> should start.
> Yes - it's because rephub knows. The parts that need recovery may be a
> series of random IOs that were lost because of a network outage
> somewhere along the replication pipe.
> Easy to think of it as a bitmap holding the not-yet-replicated IOs. The
> rephub occasionally reads those areas to 'sync' them, so in effect the
> rephub needs read access - it's not really to trigger streaming from an
> offset.
>>
>> But I'm also starting to wonder whether it would be simpler to use
>> existing replication code. DRBD is more feature-rich, and you can use
>> it over loopback or NBD devices (respectively raw and non-raw), and
>> also store the replication metadata on a file using the loopback
>> device. Ceph even has a userspace library and support within QEMU.
>>
> I think there are two immediate problems that drbd poses:
> 1. Our replication is not a simple mirror - it maintains history. I.e.
> you can recover to any point in time in the last X hours (usually 24) at
> a granularity of about 5 seconds.
> To be able to do that and keep the replica consistent we need to be
> notified for each IO.

Can you please elaborate some more in the exact details -
In theory, you can build a setup where the drbd (or nbd) copy on the 
destination side write to a intermediate image and every such write is 
trapped locally on the destination and you may not immediately propagate 
that to the disk image the VM sees.

> 2. drbd is 'below' all the Qemu block layers - if the protected volume
> is qcow2 then drbd doesn't get the raw IOs, right?

That's one of the major caveats in drbd/iscsi/nbd - there is no support 
for block level snapshots[1]. I wonder if the scsi protocol has 
something like this so we'll get efficient replication of qcow2/lvm 
snapshots that their base is already shared. If we'll gain such 
functionality, we'll benefit of it for storage vm motion solution too.

Another issue w/ drbd is that a continuous backup solution requires to 
do consistent snapshot and call file system freeze and sync it w/ the 
current block IO transfer. DRBD doesn't do that nor the other protocols. 
Of course DRBD can be enhanced but it will take allot more time.

A third requirement and similar to above is to group snapshots of 
several VMs so a consistent _cross vm application view_ will be created. 
It demands some control over IO tagging.

To summarize, IMHO drbd (which I used successfully 6 years ago and I 
love) is not drop&replace solution to this case.
I recommend we either to fit the nbd/iscsi case and improve our vm 
storage motion on the way or worse case develop proprietary logic that 
can live out side of qemu using IO tapping interface, similar to the 
guidelines Ori outlines.

Thanks,
Dor

[1] Check the far too basic approach for snapshots: 
http://www.drbd.org/users-guide/s-lvm-snapshots.html
>
> Ori
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08  6:10                 ` Ori Mamluk
  2012-02-08  8:49                   ` Dor Laor
@ 2012-02-08  8:55                   ` Kevin Wolf
  2012-02-08  9:47                     ` Ori Mamluk
  2012-02-08 11:02                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
  2012-02-08 12:03                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
  3 siblings, 1 reply; 66+ messages in thread
From: Kevin Wolf @ 2012-02-08  8:55 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: dlaor, עודד קדם,
	תומר בן אור,
	qemu-devel, Yair Kuszpet, Paolo Bonzini

Am 08.02.2012 07:10, schrieb Ori Mamluk:
> On 07/02/2012 17:47, Paolo Bonzini wrote:
>> On 02/07/2012 03:48 PM, Ori Mamluk wrote:
>>>> The current streaming code in QEMU only deals with the former.
>>>> Streaming to a remote server would not be supported.
>>>>
>>> I need it at the same time. The Rephub reads either the full volume or
>>> parts of, and concurrently protects new IOs.
>>
>> Why can't QEMU itself stream the full volume in the background, and 
>> send that together with any new I/O?  Is it because the rephub knows 
>> which parts are out-of-date and need recovery?  In that case, as a 
>> first approximation the rephub can pass the sector at which streaming 
>> should start.
> Yes - it's because rephub knows. The parts that need recovery may be a 
> series of random IOs that were lost because of a network outage 
> somewhere along the replication pipe.
> Easy to think of it as a bitmap holding the not-yet-replicated IOs. The 
> rephub occasionally reads those areas to 'sync' them, so in effect the 
> rephub needs read access - it's not really to trigger streaming from an 
> offset.

So how does the rephub know which areas were touched by lost requests?
Isn't qemu the only one who could know what it sent?

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08  8:55                   ` Kevin Wolf
@ 2012-02-08  9:47                     ` Ori Mamluk
  2012-02-08 10:04                       ` Kevin Wolf
  0 siblings, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-08  9:47 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: dlaor, עודד קדם,
	תומר בן אור,
	qemu-devel, Yair Kuszpet, Paolo Bonzini

On 08/02/2012 10:55, Kevin Wolf wrote:
> Am 08.02.2012 07:10, schrieb Ori Mamluk:
>> On 07/02/2012 17:47, Paolo Bonzini wrote:
>>>
>>> Why can't QEMU itself stream the full volume in the background, and
>>> send that together with any new I/O?  Is it because the rephub knows
>>> which parts are out-of-date and need recovery?  In that case, as a
>>> first approximation the rephub can pass the sector at which streaming
>>> should start.
>> Yes - it's because rephub knows. The parts that need recovery may be a
>> series of random IOs that were lost because of a network outage
>> somewhere along the replication pipe.
>> Easy to think of it as a bitmap holding the not-yet-replicated IOs. The
>> rephub occasionally reads those areas to 'sync' them, so in effect the
>> rephub needs read access - it's not really to trigger streaming from an
>> offset.
> So how does the rephub know which areas were touched by lost requests?
> Isn't qemu the only one who could know what it sent?
>
> Kevin
You're right. Currently only Qemu knows.
The problem is that if we move the responsibility to a layer below Qemu 
- then rephub will never know.
Our (Zerto's) solution for vmware has a different design, but it has 3 
parts relevant to this discussion:
1. Tapping to protected writes / read protected volume
2. Maintain a bitmap
3. Provide cross-VM consistency for recovery.

I want to simplify our design by taking it one step at a time.
My first goal for Qemu is to have only step 1 - meaning tap all 
protected writes, and be able to read.
I think it will be simpler for all of us to complete that first, and it 
provides a basic ability (though not optimal) for protection and recovery.

I think using an external streaming mechanism will make the next stages 
impossible.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08  9:47                     ` Ori Mamluk
@ 2012-02-08 10:04                       ` Kevin Wolf
  2012-02-08 13:28                         ` [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module) Ori Mamluk
  0 siblings, 1 reply; 66+ messages in thread
From: Kevin Wolf @ 2012-02-08 10:04 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: dlaor, עודד קדם,
	תומר בן אור,
	qemu-devel, Yair Kuszpet, Paolo Bonzini

Am 08.02.2012 10:47, schrieb Ori Mamluk:
> On 08/02/2012 10:55, Kevin Wolf wrote:
>> Am 08.02.2012 07:10, schrieb Ori Mamluk:
>>> On 07/02/2012 17:47, Paolo Bonzini wrote:
>>>>
>>>> Why can't QEMU itself stream the full volume in the background, and
>>>> send that together with any new I/O?  Is it because the rephub knows
>>>> which parts are out-of-date and need recovery?  In that case, as a
>>>> first approximation the rephub can pass the sector at which streaming
>>>> should start.
>>> Yes - it's because rephub knows. The parts that need recovery may be a
>>> series of random IOs that were lost because of a network outage
>>> somewhere along the replication pipe.
>>> Easy to think of it as a bitmap holding the not-yet-replicated IOs. The
>>> rephub occasionally reads those areas to 'sync' them, so in effect the
>>> rephub needs read access - it's not really to trigger streaming from an
>>> offset.
>> So how does the rephub know which areas were touched by lost requests?
>> Isn't qemu the only one who could know what it sent?
>>
>> Kevin
> You're right. Currently only Qemu knows.

How could it change later on? If the network is down, qemu can't
communicate it to anyone else, so it stays the only one who knows.

> The problem is that if we move the responsibility to a layer below Qemu 
> - then rephub will never know.
> Our (Zerto's) solution for vmware has a different design, but it has 3 
> parts relevant to this discussion:
> 1. Tapping to protected writes / read protected volume
> 2. Maintain a bitmap
> 3. Provide cross-VM consistency for recovery.
> 
> I want to simplify our design by taking it one step at a time.
> My first goal for Qemu is to have only step 1 - meaning tap all 
> protected writes, and be able to read.
> I think it will be simpler for all of us to complete that first, and it 
> provides a basic ability (though not optimal) for protection and recovery.
> 
> I think using an external streaming mechanism will make the next stages 
> impossible.

Well, then we need to discuss all stages now. If you tell only part of
what you're going to do, you'll get a design that will only work for
part of what you need.

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08  6:10                 ` Ori Mamluk
  2012-02-08  8:49                   ` Dor Laor
  2012-02-08  8:55                   ` Kevin Wolf
@ 2012-02-08 11:02                   ` Stefan Hajnoczi
  2012-02-08 13:00                     ` [Qemu-devel] [RFC] Replication agent requirements (was [RFC PATCH] replication agent module) Ori Mamluk
  2012-02-08 12:03                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
  3 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-08 11:02 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Yair Kuszpet, Paolo Bonzini

On Wed, Feb 8, 2012 at 6:10 AM, Ori Mamluk <omamluk@zerto.com> wrote:
> 2. drbd is 'below' all the Qemu block layers - if the protected volume is
> qcow2 then drbd doesn't get the raw IOs, right?

No, if you look at the layers again:

Guest -> QEMU -> drbd -> local NBD server

The local NBD server runs the qcow2, qed, etc code.

drbd is on top of image format code.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 13:34 ` Kevin Wolf
  2012-02-07 13:50   ` Stefan Hajnoczi
  2012-02-07 14:45   ` Ori Mamluk
@ 2012-02-08 11:45   ` Luiz Capitulino
  2 siblings, 0 replies; 66+ messages in thread
From: Luiz Capitulino @ 2012-02-08 11:45 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: dlaor, qemu-devel, Ori Mamluk

On Tue, 07 Feb 2012 14:34:12 +0100
Kevin Wolf <kwolf@redhat.com> wrote:

> The other message types could possibly be implemented as QMP commands. I
> guess we might need to attach multiple QMP monitors for this to work
> (one for libvirt, one for the rephub). I'm not sure if there is a
> fundamental problem with this or if it just needs to be done.

Afaik, multiple QMP instances is not well tested. It's something I've tried
not to break when adding new functionality, but I don't think it got any
serious usage so far.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08  8:49                   ` Dor Laor
@ 2012-02-08 11:59                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-08 11:59 UTC (permalink / raw)
  To: dlaor
  Cc: Kevin Wolf, עודד קדם,
	תומר בן אור,
	qemu-devel, Ori Mamluk, Yair Kuszpet, Paolo Bonzini

2012/2/8 Dor Laor <dlaor@redhat.com>:
> On 02/08/2012 08:10 AM, Ori Mamluk wrote:
>>
>> 2. drbd is 'below' all the Qemu block layers - if the protected volume
>> is qcow2 then drbd doesn't get the raw IOs, right?
>
>
> That's one of the major caveats in drbd/iscsi/nbd - there is no support for
> block level snapshots[1]. I wonder if the scsi protocol has something like
> this so we'll get efficient replication of qcow2/lvm snapshots that their
> base is already shared. If we'll gain such functionality, we'll benefit of
> it for storage vm motion solution too.

In the case of copy-on-write disk images we do want to mirror all
writes because, by definition, they are not shared.  I think the
trickier part is how to do the initial synchronization without copying
the entire backing file.

> Another issue w/ drbd is that a continuous backup solution requires to do
> consistent snapshot and call file system freeze and sync it w/ the current
> block IO transfer. DRBD doesn't do that nor the other protocols. Of course
> DRBD can be enhanced but it will take allot more time.

Ori's patch simply mirrors writes, it doesn't have any higher-level
consistent snapshot support either.  Consistent snapshots are
different from continuous backups - I thought these were being
addressed with completely separate QMP and guest agent commands?

> A third requirement and similar to above is to group snapshots of several
> VMs so a consistent _cross vm application view_ will be created. It demands
> some control over IO tagging.

If I understand correctly this means being able to go back to time T
across multiple VMs' volumes.  That sounds like a timestamping issue
and is mainly a server-side feature, the agent is not involved.

> To summarize, IMHO drbd (which I used successfully 6 years ago and I love)
> is not drop&replace solution to this case.
> I recommend we either to fit the nbd/iscsi case and improve our vm storage
> motion on the way or worse case develop proprietary logic that can live out
> side of qemu using IO tapping interface, similar to the guidelines Ori
> outlines.

Perhaps we can figure out how to make this replication functionality
fit in with image streaming and block migration.  If it provides
generally useful functionality (outside of just the replication case)
then that would be worth adding to QEMU because it would be useful
beyond drbd territory.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08  6:10                 ` Ori Mamluk
                                     ` (2 preceding siblings ...)
  2012-02-08 11:02                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
@ 2012-02-08 12:03                   ` Stefan Hajnoczi
  2012-02-08 12:46                     ` Paolo Bonzini
  3 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-08 12:03 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Yair Kuszpet, Paolo Bonzini

On Wed, Feb 8, 2012 at 6:10 AM, Ori Mamluk <omamluk@zerto.com> wrote:
> On 07/02/2012 17:47, Paolo Bonzini wrote:
>> But I'm also starting to wonder whether it would be simpler to use
>> existing replication code.  DRBD is more feature-rich, and you can use it
>> over loopback or NBD devices (respectively raw and non-raw), and also store
>> the replication metadata on a file using the loopback device. Ceph even has
>> a userspace library and support within QEMU.
>>
> I think there are two immediate problems that drbd poses:
> 1. Our replication is not a simple mirror - it maintains history. I.e. you
> can recover to any point in time in the last X hours (usually 24) at a
> granularity of about 5 seconds.
> To be able to do that and keep the replica consistent we need to be notified
> for each IO.

If you intend to run an unmodified drbd server on the rephub, then it
may not be possible to get point-in-time backups.  (Although this
probably depends since things like btrfs or zfs may allow you to get
back to arbitrary transactions or timestamps.)

But you could consider drbd as a network protocol and implement your
own server which speaks the protocol.  Then you can add any
functionality you like, just like the case with the proprietary rephub
server you mentioned in your patch.

So the only difference is that instead of using a new custom protocol
the rephub would need to speak the drbd protocol.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:05     ` Paolo Bonzini
@ 2012-02-08 12:17       ` Orit Wasserman
  0 siblings, 0 replies; 66+ messages in thread
From: Orit Wasserman @ 2012-02-08 12:17 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

On 02/07/2012 04:05 PM, Paolo Bonzini wrote:
> On 02/07/2012 02:50 PM, Stefan Hajnoczi wrote:
>>> I guess we might need to attach multiple QMP monitors for this to work
>>> (one for libvirt, one for the rephub). I'm not sure if there is a
>>> fundamental problem with this or if it just needs to be done.
>>
>> Agreed.  You can already query block devices using QMP 'query-block'.
>> By adding in-process NBD server support you could then launch an NBD
>> server for each volume which you wish to replicate.  However, in this
>> case it sounds almost like you want the reverse - you could provide an
>> NBD server on the rephub and QEMU would mirror writes to it (the NBD
>> client code is already in QEMU).
> 
> Yes, this is how we were also planning to do migration without shared storage, right?

We originally planned to run ISCSI target . But now that NBD can handle a file chain we can use it too.

> 
> Paolo
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:45   ` Ori Mamluk
@ 2012-02-08 12:29     ` Orit Wasserman
  0 siblings, 0 replies; 66+ messages in thread
From: Orit Wasserman @ 2012-02-08 12:29 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf, dlaor,
	עודד קדם,
	תומר בן אור,
	qemu-devel, Luiz Capitulino

On 02/07/2012 04:45 PM, Ori Mamluk wrote:
> On 07/02/2012 15:34, Kevin Wolf wrote:
>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>> Repagent is a new module that allows an external replication system to
>>> replicate a volume of a Qemu VM.
>>>
>>> This RFC patch adds the repagent client module to Qemu.
>>>
>>>
>>>
>>> Documentation of the module role and API is in the patch at
>>> replication/qemu-repagent.txt
>>>
>>>
>>>
>>> The main motivation behind the module is to allow replication of VMs in
>>> a virtualization environment like RhevM.
>>>
>>> To achieve this we need basic replication support in Qemu.
>>>
>>>
>>>
>>> This is the first submission of this module, which was written as a
>>> Proof Of Concept, and used successfully for replicating and recovering a
>>> Qemu VM.
>> I'll mostly ignore the code for now and just comment on the design.
> That's fine. The code was mainly for my understanding of the system.
>> One thing to consider for the next version of the RFC would be to split
>> this in a series smaller patches. This one has become quite large, which
>> makes it hard to review (and yes, please use git send-email).
>>
>>> Points and open issues:
>>>
>>> *             The module interfaces the Qemu storage stack at block.c
>>> generic layer. Is this the right place to intercept/inject IOs?
>> There are two ways to intercept I/O requests. The first one is what you
>> chose, just add some code to bdrv_co_do_writev, and I think it's
>> reasonable to do this.
>>
>> The other one would be to add a special block driver for a replication:
>> protocol that writes to two different places (the real block driver for
>> the image, and the network connection). Generally this feels even a bit
>> more elegant, but it brings new problems with it: For example, when you
>> create an external snapshot, you need to pay attention not to lose the
>> replication because the protocol is somewhere in the middle of a backing
>> file chain.
> Yes. With this solution we'll have to somehow make sure that the replication driver is closer to the guest than any driver which alters the IO.
> 
>>
>>> *             The patch contains performing IO reads invoked by a new
>>> thread (a TCP listener thread). See repaget_read_vol in repagent.c. It
>>> is not protected by any lock – is this OK?
>> No, definitely not. Block layer code expects that it holds
>> qemu_global_mutex.
>>
>> I'm not sure if a thread is the right solution. You should probably use
>> something that resembles other asynchronous code in qemu, i.e. either
>> callback or coroutine based.
> I call bdrv_aio_readv - which in my understanding creates a co-routing, so my current solution is co-routines based. Did I get something wrong?
> 
>>
>>> *             VM ID – the replication system implies an environment with
>>> several VMs connected to a central replication system (Rephub).
>>>                  This requires some sort of identification for a VM. The
>>> current patch does not include a VM ID – I did not find any adequate ID
>>> to use.
>> The replication hub already opened a connection to the VM, so it somehow
>> managed to know which VM this process represents, right?
> The current design has the server at the Rephub side, so the VM connects to the Rephub, and not the other way around.
> The VM could be instructed to "enable protection" by a monitor command, and then it connects to the 'known' Rephub.
>> The unique ID would be something like the PID of the VM or the file
>> descriptor of the communication channel to it.
> The PID might be useful - we'll later need to correlate it to the way Rhevm identifies the machine, but not right now...
>>> diff --git a/Makefile b/Makefile
>>>
>>> index 4f6eaa4..a1b3701 100644
>>>
>>> --- a/Makefile
>>>
>>> +++ b/Makefile
>>>
>>> @@ -149,9 +149,9 @@ qemu-img.o qemu-tool.o qemu-nbd.o qemu-io.o cmd.o
>>> qemu-ga.o: $(GENERATED_HEADERS
>>>
>>> tools-obj-y = qemu-tool.o $(oslib-obj-y) $(trace-obj-y) \
>>>
>>>                 qemu-timer-common.o cutils.o
>>>
>>> -qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
>>>
>>> -qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
>>>
>>> -qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
>>>
>>> +qemu-img$(EXESUF): qemu-img.o $(tools-obj-y) $(block-obj-y)
>>> $(replication-obj-y)
>>>
>>> +qemu-nbd$(EXESUF): qemu-nbd.o $(tools-obj-y) $(block-obj-y)
>>> $(replication-obj-y)
>>>
>>> +qemu-io$(EXESUF): qemu-io.o cmd.o $(tools-obj-y) $(block-obj-y)
>>> $(replication-obj-y)
>> $(replication-obj-y) should be included in $(block-obj-y) instead
>>
>>
>>> @@ -2733,6 +2739,7 @@ echo "curl support      $curl"
>>>
>>> echo "check support     $check_utests"
>>>
>>> echo "mingw32 support   $mingw32"
>>>
>>> echo "Audio drivers     $audio_drv_list"
>>>
>>> +echo "Replication          $replication"
>>>
>>> echo "Extra audio cards $audio_card_list"
>>>
>>> echo "Block whitelist   $block_drv_whitelist"
>>>
>>> echo "Mixer emulation   $mixemu"
>> Why do you add it in the middle rather than at the end?
> No reason, I'll change it.
>>
>>> diff --git a/replication/qemu-repagent.txt b/replication/qemu-repagent.txt
>>>
>>> new file mode 100755
>>>
>>> index 0000000..e3b0c1e
>>>
>>> --- /dev/null
>>>
>>> +++ b/replication/qemu-repagent.txt
>>>
>>> @@ -0,0 +1,104 @@
>>>
>>> +             repagent - replication agent - a Qemu module for enabling
>>> continuous async replication of VM volumes
>>>
>>> +
>>>
>>> +Introduction
>>>
>>> +             This document describes a feature in Qemu - a replication
>>> agent (AKA Repagent).
>>>
>>> +             The Repagent is a new module that exposes an API to an
>>> external replication system (AKA Rephub).
>>>
>>> +             This API allows a Rephub to communicate with a Qemu VM and
>>> continuously replicate its volumes.
>>>
>>> +             The imlementation of a Rephub is outside of the scope of
>>> this document. There may be several various Rephub
>>>
>>> +             implenetations using the same repagent in Qemu.
>>>
>>> +
>>>
>>> +Main feature of Repagent
>>>
>>> +             Repagent does the following:
>>>
>>> +             * Report volumes - report a list of all volumes in a VM to
>>> the Rephub.
>> Does the query-block QMP command give you what you need?
> I'll look into it.
>>> +             * Report writes to a volume - send all writes made to a
>>> protected volume to the Rephub.
>>>
>>> +                             The reporting of an IO is asyncronuous -
>>> i.e. the IO is not delayed by the Repagent to get any acknowledgement
>>> from the Rephub.
>>> +                             It is only copied to the Rephub.
>>>
>>> +             * Read a protected volume - allows the Rephub to read a
>>> protected volume, to enable the protected hub to syncronize the content
>>> of a protected volume.
>> We were discussing using NBD as the protocol for any data that is
>> transferred from/to the replication hub, so that we can use the existing
>> NBD client and server code that qemu has. Seems you came to the
>> conclusion to use different protocol? What are the reasons?
> Initially I thought there will have to be more functionality in the agent.
> Now it seems that you're right, and Stefan also pointed out something similar.
> Let me think about how I can get the same functionality with NBD (or iScsi) server and client.
>>
>> The other message types could possibly be implemented as QMP commands. I
>> guess we might need to attach multiple QMP monitors for this to work
>> (one for libvirt, one for the rephub). I'm not sure if there is a
>> fundamental problem with this or if it just needs to be done.
>>> +
>>>
>>> +Description of the Repagent module
>>>
>>> +
>>>
>>> +Build and run options
>>>
>>> +             New configure option: --enable-replication
>>>
>>> +             New command line option:
>>>
>>> +             -repagent [hub IP/name]
>> You'll probably want a monitor command to enable this at runtime.
> Yep.
>>> +
>>> Enable replication support for disks
>>>
>>> +
>>> hub is the ip or name of the machine running the replication hub.
>>>
>>> +
>>>
>>> +Module APIs
>>>
>>> +             The Repagent module interfaces two main components:
>>>
>>> +             1. The Rephub - An external API based on socket messages
>>>
>>> +             2. The generic block layer- block.c
>>>
>>> +
>>>
>>> +             Rephub message API
>>>
>>> +                             The external replication API is a message
>>> based API.
>>>
>>> +                             We won't go into the structure of the
>>> messages here - just the sematics.
>>>
>>> +
>>>
>>> +                             Messages list
>>>
>>> +                                             (The updated list and
>>> comments are in Rephub_cmds.h)
>>>
>>> +
>>>
>>> +                                             Messages from the Repagent
>>> to the Rephub:
>>>
>>> +                                             * Protected write
>>>
>>> +                                                             The
>>> Repagent sends each write to a protected volume to the hub with the IO
>>> status.
>>>
>>> +                                                             In case
>>> the status is bad the write content is not sent
>>>
>>> +                                             * Report VM volumes
>>>
>>> +                                                             The agent
>>> reports all the volumes of the VM to the hub.
>>>
>>> +                                             * Read Volume Response
>>>
>>> +                                                             A response
>>> to a Read Volume Request
>>>
>>> +                                                             Sends the
>>> data read from a protected volume to the hub
>>>
>>> +                                             * Agent shutdown
>>>
>>> +                                                             Notifies
>>> the hub that the agent is about to shutdown.
>>>
>>> +                                                             This
>>> allows a graceful shutdown. Any disconnection of an agent without
>>>
>>> +                                                             sending
>>> this command will result in a full sync of the VM volumes.
>> What does "full sync" mean, what data is synced with which other place?
>> Is it bad when this happens just because the network is down for a
>> moment, but the VM actually keeps running?
> Full sync means reading the entire volume.
> It is bad when it happens because of a short network outage, but I think that it's a good 'intermediate' step to do so.
> We can first build a system which assumes that the connection between the agent and the Rephub is solid, and on a next stage add a bitmap mechanism in the agent that will optimize it - to overcome outages without full sync.
>>> +
>>>
>>> +                                             Messages from the Rephub
>>> to the Repagent:
>>>
>>> +                                             * Start protect
>>>
>>> +                                                             The hub
>>> instructs the agent to start protecting a volume. When a volume is protected
>>>
>>> +                                                             all its
>>> writes are sent to to the hub.
>>>
>>> +                                                             With this
>>> command the hub also assigns a volume ID to the given volume name.
>>>
>>> +                                             * Read volume request
>>>
>>> +                                                             The hub
>>> issues a read IO to a protected volume.
>>>
>>> +                                                             This
>>> command is used during sync - when the hub needs to read unsyncronized
>>>
>>> +                                                             sections
>>> of a protected volume.
>>>
>>> +                                                             This
>>> command is a request, the read data is returned by the read volume
>>> response message (see above).
>>>
>>> +             block.c API
>>>
>>> +                             The API to the generic block storage layer
>>> contains 3 functionalities:
>>>
>>> +                             1. Handle writes to protected volumes
>>>
>>> +                                             In bdrv_co_do_writev, each
>>> write is reported to the Repagent module.
>>>
>>> +                             2. Handle each new volume that registers
>>>
>>> +                                             In bdrv_open - each new
>>> bottom-level block driver that registers is reported.
>> Could probably be a QMP event.
> OK
>>> +                             2. Read from a volume
>>>
>>> +                                             Repagent calls
>>> bdrv_aio_readv to handle read requests coming from the hub.
>>>
>>> +
>>>
>>> +
>>>
>>> +General description of a Rephub  - a replication system the repagent
>>> connects to
>>>
>>> +             This section describes in high level a sample Rephub - a
>>> replication system that uses the repagent API
>>>
>>> +             to replicate disks.
>>>
>>> +             It describes a simple Rephub that comntinuously maintains
>>> a mirror of the volumes of a VM.
>>>
>>> +
>>>
>>> +             Say we have a VM we want to protect - call it PVM, say it
>>> has 2 volumes - V1, V2.
>>>
>>> +             Our Rephub is called SingleRephub - a Rephub protecting a
>>> single VM.
>>>
>>> +
>>>
>>> +             Preparations
>>>
>>> +             1. The user chooses a host to rub SingleRephub - a
>>> different host than PVM, call it Host2
>>>
>>> +             2. The user creates two volumes on Host2 - same sizes of
>>> V1 and V2, call them V1R (V1 recovery) and V2R.
>>>
>>> +             3. The user runs SingleRephub process on Host2, and gives
>>> V1R and V2R as command line arguments.
>>>
>>> +                             From now on SingleRephub waits for the
>>> protected VM repagent to connect.
>>>
>>> +             4. The user runs the protected VM PVM - and uses the
>>> switch -repagent<Host2 IP>.
>>>
>>> +
>>>
>>> +             Runtime
>>>
>>> +             1. The repagent module connects to SingleRephub on startup.
>>>
>>> +             2. repagent reports V1 and V2 to SingleRephub.
>>>
>>> +             3. SingleRephub starts to perform an initial
>>> synchronization of the protected volumes-
>>>
>>> +                             it reads each protected volume (V1 and V2)
>>> - using read volume requests - and copies the data into the
>>>
>>> +                             recovery volume V1R and V2R.
>> Are you really going to do this on every start of the VM? Comparing the
>> whole content of an image will take quite some time.
> It is done when you first start protect a volume, not each time a VM boots. A VM can reboot without needing a full sync.
>>
>>> +             4. SingleRephub enters 'protection' mode - each write to
>>> the protected volume is sent by the repagent to the Rephub,
>>>
>>> +                             and the Rephub performs the write on the
>>> matching recovery volume.
>>>
>>> +
>>>
>>> +             * Note that during stage 3 writes to the protected volumes
>>> are not ignored - they're kept in a bitmap,
>>>
>>> +                             and will be read again when stage 3 ends,
>>> in an interative convergin process.
>>>
>>> +
>>>
>>> +             This flow continuously maintains an updated recovery volume.
>>>
>>> +             If the protected system is damaged, the user can create a
>>> new VM on Host2 with the replicated volumes attached to it.
>>>
>>> +             The new VM is a replica of the protected system.
>> Have you meanwhile had the time to take a look at Kemari and check how
>> big the overlap is?
> No. What's Kemari? I'll look it up.

Kemari is a fault tolerance solution for KVM : http://wiki.qemu.org/Features/FaultTolerance

It sync the guest memory to a remote instance by using the live migration mechanism in QEMU.
As for the image it assumes shared storage.

The similarity is that the synchronization is done when there is an IO event (not only block IO but also network).
It needs to trap the IO event and delay it till the sync is complete.

The code is based on an older QEMU version without coroutines.
Not sure how much it can help you.

Orit
>>
>> Kevin
> 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08 12:03                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
@ 2012-02-08 12:46                     ` Paolo Bonzini
  2012-02-08 14:39                       ` Stefan Hajnoczi
  0 siblings, 1 reply; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-08 12:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Ori Mamluk, Yair Kuszpet

On 02/08/2012 01:03 PM, Stefan Hajnoczi wrote:
> If you intend to run an unmodified drbd server on the rephub, then it
> may not be possible to get point-in-time backups.  (Although this
> probably depends since things like btrfs or zfs may allow you to get
> back to arbitrary transactions or timestamps.)

I'm not sure what's the overhead, but btrfs copy-on-write (reflinks) may 
help.

> But you could consider drbd as a network protocol and implement your
> own server which speaks the protocol.  Then you can add any
> functionality you like, just like the case with the proprietary rephub
> server you mentioned in your patch.
>
> So the only difference is that instead of using a new custom protocol
> the rephub would need to speak the drbd protocol.

So you're suggesting DRBD-over-NBD on the client, and for the 
replication hub a custom server speaking the DRBD protocol?  I didn't 
find any documentation for DRBD and the code is only in the kernel, so 
this sounds like a lot of work.

What about taking the existing Ceph/RBD driver in QEMU and changing it 
to support arbitrary image formats rather than just raw?  That sounds 
much much easier.  The main advantage is that Ceph has a user-space 
library for use in the replication hub.  It also supports snapshots.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [Qemu-devel] [RFC] Replication agent requirements (was [RFC PATCH] replication agent module)
  2012-02-08 11:02                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
@ 2012-02-08 13:00                     ` Ori Mamluk
  2012-02-08 13:30                       ` Anthony Liguori
  0 siblings, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-08 13:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Yair Kuszpet, Paolo Bonzini

Hi,
Following previous mails from Kevin and Dor, I'd like to specify the 
high level requirements of a replication agent as I see them.

1. Report each write to a protected volume to the rephub, at an IO 
transaction granularity
     * The reporting is not synchronous, i.e. the write completion is 
not delayed until the rephub received it.
     * The IOs have to be the raw guest IOs - i.e. not converted to any 
sparse format or another filter that alters the size/offset
2. Report failures to report an IO (e.g. socket disconnect or send 
timeout) or failed IOs (bad status from storage) to rephub
     * It is enough to disconnect the socket - that can be considered a 
'failure report'
3. Enable rephub to read arbitrary regions in the protected volume
     * Assume that rephub can identify IOs which were dropped by the 
replication system, and needs to re-read the data of these IOs.

We'd like to treat the following requirement as a second stage - not to 
implement it in the first version:
4. Synchronously report IO writes meta data (offset, size) to an 
external API
     * Synchronous meaning that it is reported (blocking) before the IO 
is processed by storage.
     * The goal is to maintain a dirty bitmap outside of the Qemu process
     * The tracking needs to be more persistent than the Qemu process. A 
good example for that is to expose an additional process API
         (yet another NBD??) that will be hold the bitmap by either the 
host RAM or by writing persistently to storage.

The emphasis to report single IO transactions is because high end 
replication (near synchronous) requires access to every IO shortly after 
it is written to the storage.

Thanks,
Ori

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module)
  2012-02-08 10:04                       ` Kevin Wolf
@ 2012-02-08 13:28                         ` Ori Mamluk
  2012-02-08 14:59                           ` Stefan Hajnoczi
  0 siblings, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-08 13:28 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: dlaor, עודד קדם,
	תומר בן אור,
	qemu-devel, Yair Kuszpet, Paolo Bonzini

Hi,
Thanks for all the valuable inputs provided so far, I'll try to suggest 
a design based on them.
The main inputs were about the use a new transport protocol between 
repagent and rephub.
It was suggested to use some standard network storage protocol instead, 
and use QMP commands for the control path.

The main idea is to use two NBD connections per protected volume:
NBD tap - protected VM is the client, rephub is the server, used to 
report writes.
     The tap is not a standard NBD backing - it is for replication, 
meaning that its importance is lesser than
     the main image path. Errors are not reported to the protected VM as 
IO error.
NBD reader - protected VM is the server, rephub is the client, used for 
reading the protected volume.
     The NBD reader is a generic remote read (can add also write) 
capability, probably usable for other various needs.
     Actually the reader will probably be more useful as a 
reader/writer, but for the agent - only read is required.

Here's a list of the protocol messages from the previous design and how 
they're implemented in this design:
Rephub --> Repagent:
* Start protect
      Will be done via QMP command.
* Read volume request
       Covered by NBD reader

Repagent --> Rephub
* Protected write
      Covered by NBD tap
* Report VM volumes
      Isn't required in the protocol. I assume the management system 
tracks the volumes
* Read Volume Response
      Covered by NBD tap
* Agent shutdown
      Not covered.

The start protect scenario will look something like:
* User calls start protect for a volume
* Mgmt system (e.g. Rhev) sends QMP command to VM - start protect, with 
volume details (path) and a
     IP+port number for NBD tap
     --> Qemu connects to the NBD tap server
* Mgmt system sends QMP command to VM - start remote reader with volume 
details and port number for NBD reader.
     --> Qemu starts to listen as an NBD server on that port

Issues:
* As far as I understand, NBD requires socket/port per volume, which the 
management system allocates. This is a little cumbersome
     The original design had a single server in the rephub - a single 
port allocation, and a socket per Qemu.

Appreciate any comments and ideas.
Thanks,
Ori

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC] Replication agent requirements (was [RFC PATCH] replication agent module)
  2012-02-08 13:00                     ` [Qemu-devel] [RFC] Replication agent requirements (was [RFC PATCH] replication agent module) Ori Mamluk
@ 2012-02-08 13:30                       ` Anthony Liguori
  0 siblings, 0 replies; 66+ messages in thread
From: Anthony Liguori @ 2012-02-08 13:30 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf, dlaor, Stefan Hajnoczi,
	תומר בן אור,
	qemu-devel, עודד קדם,
	Yair Kuszpet, Paolo Bonzini

On 02/08/2012 07:00 AM, Ori Mamluk wrote:
> Hi,
> Following previous mails from Kevin and Dor, I'd like to specify the high level
> requirements of a replication agent as I see them.
>
> 1. Report each write to a protected volume to the rephub, at an IO transaction
> granularity
> * The reporting is not synchronous, i.e. the write completion is not delayed
> until the rephub received it.
> * The IOs have to be the raw guest IOs - i.e. not converted to any sparse format
> or another filter that alters the size/offset

For now.  I'm sure you'll eventually have a synchronous replication requirement.

We're doomed to reinvent all of the Linux storage layer it seems.  I think we 
really only have two choices: make better use of kernel facilities for this 
(like drbd) or have a proper, pluggable, storage interface so that QEMU proper 
doesn't have to deal with all of this.

Gluster is appealing as a pluggable storage interface although the license is 
problematic for us today.

I'm quite confident that we shouldn't be in the business of replicating storage 
though.  If the answer is NBD++, that's fine too.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08 12:46                     ` Paolo Bonzini
@ 2012-02-08 14:39                       ` Stefan Hajnoczi
  2012-02-08 14:55                         ` Paolo Bonzini
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-08 14:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Ori Mamluk, Yair Kuszpet

On Wed, Feb 8, 2012 at 12:46 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 02/08/2012 01:03 PM, Stefan Hajnoczi wrote:
>>
>> If you intend to run an unmodified drbd server on the rephub, then it
>> may not be possible to get point-in-time backups.  (Although this
>> probably depends since things like btrfs or zfs may allow you to get
>> back to arbitrary transactions or timestamps.)
>
>
> I'm not sure what's the overhead, but btrfs copy-on-write (reflinks) may
> help.
>
>
>> But you could consider drbd as a network protocol and implement your
>> own server which speaks the protocol.  Then you can add any
>> functionality you like, just like the case with the proprietary rephub
>> server you mentioned in your patch.
>>
>> So the only difference is that instead of using a new custom protocol
>> the rephub would need to speak the drbd protocol.
>
>
> So you're suggesting DRBD-over-NBD on the client, and for the replication
> hub a custom server speaking the DRBD protocol?  I didn't find any
> documentation for DRBD and the code is only in the kernel, so this sounds
> like a lot of work.

Adding code to QEMU is definitely the easiest solution.  I'm just not
sure whether it's the right one if this will evolve to have the same
features as drbd.

> What about taking the existing Ceph/RBD driver in QEMU and changing it to
> support arbitrary image formats rather than just raw?  That sounds much much
> easier.  The main advantage is that Ceph has a user-space library for use in
> the replication hub.  It also supports snapshots.

I missed how Ceph/RBD helps.  Can you explain how we would use it?

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08 14:39                       ` Stefan Hajnoczi
@ 2012-02-08 14:55                         ` Paolo Bonzini
  2012-02-08 15:07                           ` Stefan Hajnoczi
  0 siblings, 1 reply; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-08 14:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Ori Mamluk, Yair Kuszpet

On 02/08/2012 03:39 PM, Stefan Hajnoczi wrote:
>
>> >  What about taking the existing Ceph/RBD driver in QEMU and changing it to
>> >  support arbitrary image formats rather than just raw?  That sounds much much
>> >  easier.  The main advantage is that Ceph has a user-space library for use in
>> >  the replication hub.  It also supports snapshots.
> I missed how Ceph/RBD helps.  Can you explain how we would use it?

Ceph supports replication, you would just put images in a Ceph store 
rather than in a "normal" filesystem.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module)
  2012-02-08 13:28                         ` [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module) Ori Mamluk
@ 2012-02-08 14:59                           ` Stefan Hajnoczi
  2012-02-08 14:59                             ` Stefan Hajnoczi
  2012-02-19 13:40                             ` Ori Mamluk
  0 siblings, 2 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-08 14:59 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Yair Kuszpet, Paolo Bonzini

On Wed, Feb 8, 2012 at 1:28 PM, Ori Mamluk <omamluk@zerto.com> wrote:
> Hi,
> Thanks for all the valuable inputs provided so far, I'll try to suggest a
> design based on them.
> The main inputs were about the use a new transport protocol between repagent
> and rephub.
> It was suggested to use some standard network storage protocol instead, and
> use QMP commands for the control path.
>
> The main idea is to use two NBD connections per protected volume:
> NBD tap - protected VM is the client, rephub is the server, used to report
> writes.
>    The tap is not a standard NBD backing - it is for replication, meaning
> that its importance is lesser than
>    the main image path. Errors are not reported to the protected VM as IO
> error.

You mentioned a future feature that sends request metadata (offset,
length) to the rephub synchronously so that protection is 100%.
(Otherwise a network failure or crash might result in missed writes
that the rephub does not know about.)

The NBD tap might not be the right channel for sending synchronous
request metadata, since the protocol is geared towards block I/O
requests that include the actual data.  I'm not sure that QMP should
be used either - even though we have the concept of QMP events -
because it's not a low-latency, high ops communications channel.

Which channel do you use in your existing products for synchronous
request metadata?

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module)
  2012-02-08 14:59                           ` Stefan Hajnoczi
@ 2012-02-08 14:59                             ` Stefan Hajnoczi
  2012-02-19 13:40                             ` Ori Mamluk
  1 sibling, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-08 14:59 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Yair Kuszpet, Paolo Bonzini

On Wed, Feb 8, 2012 at 2:59 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Feb 8, 2012 at 1:28 PM, Ori Mamluk <omamluk@zerto.com> wrote:
>> Hi,
>> Thanks for all the valuable inputs provided so far, I'll try to suggest a
>> design based on them.
>> The main inputs were about the use a new transport protocol between repagent
>> and rephub.
>> It was suggested to use some standard network storage protocol instead, and
>> use QMP commands for the control path.
>>
>> The main idea is to use two NBD connections per protected volume:
>> NBD tap - protected VM is the client, rephub is the server, used to report
>> writes.
>>    The tap is not a standard NBD backing - it is for replication, meaning
>> that its importance is lesser than
>>    the main image path. Errors are not reported to the protected VM as IO
>> error.
>
> You mentioned a future feature that sends request metadata (offset,
> length) to the rephub synchronously so that protection is 100%.
> (Otherwise a network failure or crash might result in missed writes
> that the rephub does not know about.)
>
> The NBD tap might not be the right channel for sending synchronous
> request metadata, since the protocol is geared towards block I/O
> requests that include the actual data.  I'm not sure that QMP should
> be used either - even though we have the concept of QMP events -
> because it's not a low-latency, high ops communications channel.
>
> Which channel do you use in your existing products for synchronous
> request metadata?

BTW, your design makes sense to me.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-08 14:55                         ` Paolo Bonzini
@ 2012-02-08 15:07                           ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-08 15:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Ori Mamluk, Yair Kuszpet

On Wed, Feb 8, 2012 at 2:55 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 02/08/2012 03:39 PM, Stefan Hajnoczi wrote:
>>
>>
>>> >  What about taking the existing Ceph/RBD driver in QEMU and changing it
>>> > to
>>> >  support arbitrary image formats rather than just raw?  That sounds
>>> > much much
>>> >  easier.  The main advantage is that Ceph has a user-space library for
>>> > use in
>>> >  the replication hub.  It also supports snapshots.
>>
>> I missed how Ceph/RBD helps.  Can you explain how we would use it?
>
>
> Ceph supports replication, you would just put images in a Ceph store rather
> than in a "normal" filesystem.

I don't think that meets the need to replicate guest I/Os before
they've been sliced and diced by an image format.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module)
  2012-02-08 14:59                           ` Stefan Hajnoczi
  2012-02-08 14:59                             ` Stefan Hajnoczi
@ 2012-02-19 13:40                             ` Ori Mamluk
  2012-02-20 14:32                               ` Paolo Bonzini
  1 sibling, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-19 13:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Yair Kuszpet, Paolo Bonzini

On 08/02/2012 16:59, Stefan Hajnoczi wrote:
> On Wed, Feb 8, 2012 at 1:28 PM, Ori Mamluk<omamluk@zerto.com>  wrote:
> You mentioned a future feature that sends request metadata (offset,
> length) to the rephub synchronously so that protection is 100%.
> (Otherwise a network failure or crash might result in missed writes
> that the rephub does not know about.)
>
> The NBD tap might not be the right channel for sending synchronous
> request metadata, since the protocol is geared towards block I/O
> requests that include the actual data.  I'm not sure that QMP should
> be used either - even though we have the concept of QMP events -
> because it's not a low-latency, high ops communications channel.
>
> Which channel do you use in your existing products for synchronous
> request metadata?
>
> Stefan

Looking a little deeper into the NBD solution, it has another 
problematic angle.
Assuming Rhev is managing the system - it will need to allocate a port 
per volume on the host.
I don't see a clean way to do it.
Also, the idea of opening 3 process-external APIs for the replication 
(NBD client, NBD server, meta-data tap) doesn't feel right to me.

Going back to Anthony's older mail :
> We're doomed to reinvent all of the Linux storage layer it seems.  I 
> think we really only have two choices: make better use of kernel 
> facilities for this (like drbd) or have a proper, pluggable, storage 
> interface so that QEMU proper doesn't have to deal with all of this.
>
> Gluster is appealing as a pluggable storage interface although the 
> license is problematic for us today.
>
> I'm quite confident that we shouldn't be in the business of 
> replicating storage though.  If the answer is NBD++, that's fine too. 

I think it might be better to go back to my original less generic design.
We can regard it as a 'plugin' for a specific application - in this 
case, replication.
I can add a plugin interface in the generic block layer that allows 
building a proper storage stack.
The plugin will have capabilities like a filter driver - getting hold of 
the request on its way down (from VM to storage) and on its way up (IO 
completion), allowing to block or stall both.

As for the plugin mechanism - it's clear to me that a dynamic plugin is 
out of the question. It can be a definition - for example a 'plugins' 
directory under block, which will contain the plugins code, and plugged 
by command line or QMP commands.
This way we create separation between the Qemu code and the storage filters,

The down side is that the plugin code tends to be less generic and 
reusable.
The advantage is that by separating - we don't complicate the Qemu 
storage stack code with applicative requirements.

How about it?

Ori.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module)
  2012-02-19 13:40                             ` Ori Mamluk
@ 2012-02-20 14:32                               ` Paolo Bonzini
  2012-02-21  9:03                                 ` [Qemu-devel] BlockDriverState stack and BlockListeners (was: [RFC] Replication agent design) Kevin Wolf
  0 siblings, 1 reply; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-20 14:32 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	Stefan Hajnoczi, dlaor, qemu-devel,
	עודד קדם,
	Yair Kuszpet

On 02/19/2012 02:40 PM, Ori Mamluk wrote:
> 
> I think it might be better to go back to my original less generic design.
> We can regard it as a 'plugin' for a specific application - in this
> case, replication.
> I can add a plugin interface in the generic block layer that allows
> building a proper storage stack.
> The plugin will have capabilities like a filter driver - getting hold of
> the request on its way down (from VM to storage) and on its way up (IO
> completion), allowing to block or stall both.

I and Stefan talked about this recently... we called it a BlockListener.
 It seems like a good idea, and probably copy-on-read should be
converted in due to time to a BlockListener, too.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [Qemu-devel] BlockDriverState stack and BlockListeners (was: [RFC] Replication agent design)
  2012-02-20 14:32                               ` Paolo Bonzini
@ 2012-02-21  9:03                                 ` Kevin Wolf
  2012-02-21  9:15                                   ` [Qemu-devel] BlockDriverState stack and BlockListeners Paolo Bonzini
  2012-02-29  8:38                                   ` Ori Mamluk
  0 siblings, 2 replies; 66+ messages in thread
From: Kevin Wolf @ 2012-02-21  9:03 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: תומר בן אור,
	Stefan Hajnoczi, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

Am 20.02.2012 15:32, schrieb Paolo Bonzini:
> On 02/19/2012 02:40 PM, Ori Mamluk wrote:
>>
>> I think it might be better to go back to my original less generic design.
>> We can regard it as a 'plugin' for a specific application - in this
>> case, replication.
>> I can add a plugin interface in the generic block layer that allows
>> building a proper storage stack.
>> The plugin will have capabilities like a filter driver - getting hold of
>> the request on its way down (from VM to storage) and on its way up (IO
>> completion), allowing to block or stall both.
> 
> I and Stefan talked about this recently... we called it a BlockListener.
>  It seems like a good idea, and probably copy-on-read should be
> converted in due to time to a BlockListener, too.

After thinking a bit about it, I tend to agree. However, I wouldn't call
it a BlockListener because it could do much more than just observing
requests, it can modify them. Basically it would take a request and do
anything with it. It could enqueue the request and do nothing for the
moment (I/O throttling), it could use a different buffer and do copy on
read, it could mirror writes, etc.

So let's check which features could make use of it:

- Copy on read
- I/O throttling
- blkmirror for precopy storage migration
- replication agent
- Old style block migration (btw, we should deprecate this)
- Maybe even bdrv_check_request and high watermark? However, they are
  not optional, so probably makes less sense.

I think these are enough cases to justify it. Now, which operations do
we need to intercept?

- bdrv_co_read
- bdrv_co_write
- bdrv_drain (btw, we need a version for only one BDS)
- Probably bdrv_co_discard as well

Anything I missed? Now the interesting question that comes to mind is:
What is really the difference between the proposed BlockListener and a
BlockDriver? Sure, a listener would implement much less functionality,
but we also have BlockDrivers today that implement very few of the
callbacks.

A bdrv_drain callback doesn't exist yet in BlockDrivers, but I consider
this a bug (qemu_aio_flush() is really the implementation for raw-posix
and possibly some network protocols), so we should just add this to
BlockDriver.

The main difference that I see is that the listeners stay always on top.
For example, let's assume that if implemented a blkmirror driver in
today's infrastructure, you would get a BlockDriverState stack like
blkmirror -> qcow2 -> file. If you take a live snapshot now, you don't
want to have the blkmirror applied to the old top-level image, which is
now a read-only backing file. Instead, it should move to the new
top-level image. I believe this is similar with I/O throttling, to some
degree with copy on read, etc.

So maybe we just need to extend the current BlockDriverState stack to
distinguish "normal" and "always on top" BlockDrivers, where the latter
would roughly correspond to BlockListeners?

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21  9:03                                 ` [Qemu-devel] BlockDriverState stack and BlockListeners (was: [RFC] Replication agent design) Kevin Wolf
@ 2012-02-21  9:15                                   ` Paolo Bonzini
  2012-02-21  9:49                                     ` Kevin Wolf
  2012-02-29  8:38                                   ` Ori Mamluk
  1 sibling, 1 reply; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-21  9:15 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	Stefan Hajnoczi, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

On 02/21/2012 10:03 AM, Kevin Wolf wrote:
> So let's check which features could make use of it:
> 
> - Copy on read
> - I/O throttling
> - blkmirror for precopy storage migration
> - replication agent
> - Old style block migratiom

More precisely, dirty bitmap handling.

> (btw, we should deprecate this)

Yeah, but we need blkmirror to provide an alternative.

> - Maybe even bdrv_check_request and high watermark? However, they are
>   not optional, so probably makes less sense.
> 
> I think these are enough cases to justify it. Now, which operations do
> we need to intercept?
> 
> - bdrv_co_read
> - bdrv_co_write
> - bdrv_drain (btw, we need a version for only one BDS)
> - Probably bdrv_co_discard as well

Yes.

> Anything I missed?

bdrv_co_flush, I think.

> Now the interesting question that comes to mind is:
> What is really the difference between the proposed BlockListener and a
> BlockDriver? Sure, a listener would implement much less functionality,
> but we also have BlockDrivers today that implement very few of the
> callbacks.

The two differences that come to mind are:

1) BlockListener would be by-design coroutine based; I know we disagree
on this (you want to change raw to coroutines long term; I would like to
reintroduce some AIO fastpaths when there are no active listeners).

2) BlockListener would be entirely an implementation detail, used in the
implementation of other commands.

Third, perhaps the interface to BlockListener could be
bdrv_before/after_read.  Cannot really say without writing one or two
BlockListeners whether this would be useful or a limitation.

> The main difference that I see is that the listeners stay always on top.
> For example, let's assume that if implemented a blkmirror driver in
> today's infrastructure, you would get a BlockDriverState stack like
> blkmirror -> qcow2 -> file. If you take a live snapshot now, you don't
> want to have the blkmirror applied to the old top-level image, which is
> now a read-only backing file. Instead, it should move to the new
> top-level image.

Yes, that's because a BlockListener always applies to a
BlockDriverState, and live snapshots close+reopen the BDS but do not
delete/recreate it.

> So maybe we just need to extend the current BlockDriverState stack to
> distinguish "normal" and "always on top" BlockDrivers, where the latter
> would roughly correspond to BlockListeners?

I would prefer having separate objects.  Even if you do not count fields
related to throttling or copy-on-read or other tasks in the list above,
there are many fields in BDS that do not really apply to BlockListeners.
 Backing files, device ops, encryption, even size.  Having extra methods
is not a big problem, but unwanted data items smell...

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21  9:15                                   ` [Qemu-devel] BlockDriverState stack and BlockListeners Paolo Bonzini
@ 2012-02-21  9:49                                     ` Kevin Wolf
  2012-02-21 10:09                                       ` Paolo Bonzini
  2012-02-21 10:20                                       ` Ori Mamluk
  0 siblings, 2 replies; 66+ messages in thread
From: Kevin Wolf @ 2012-02-21  9:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: תומר בן אור,
	Stefan Hajnoczi, Jeff Cody, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

Am 21.02.2012 10:15, schrieb Paolo Bonzini:
> On 02/21/2012 10:03 AM, Kevin Wolf wrote:
>> So let's check which features could make use of it:
>>
>> - Copy on read
>> - I/O throttling
>> - blkmirror for precopy storage migration
>> - replication agent
>> - Old style block migratiom
> 
> More precisely, dirty bitmap handling.

Yes, but is it used anywhere else?

>> (btw, we should deprecate this)
> 
> Yeah, but we need blkmirror to provide an alternative.

Does block migration even work meanwhile without corrupting things left
and right?

>> - Maybe even bdrv_check_request and high watermark? However, they are
>>   not optional, so probably makes less sense.
>>
>> I think these are enough cases to justify it. Now, which operations do
>> we need to intercept?
>>
>> - bdrv_co_read
>> - bdrv_co_write
>> - bdrv_drain (btw, we need a version for only one BDS)
>> - Probably bdrv_co_discard as well
> 
> Yes.
> 
>> Anything I missed?
> 
> bdrv_co_flush, I think.

Right, we'll need bdrv_co_flush as well.

>> Now the interesting question that comes to mind is:
>> What is really the difference between the proposed BlockListener and a
>> BlockDriver? Sure, a listener would implement much less functionality,
>> but we also have BlockDrivers today that implement very few of the
>> callbacks.
> 
> The two differences that come to mind are:
> 
> 1) BlockListener would be by-design coroutine based; I know we disagree
> on this (you want to change raw to coroutines long term; I would like to
> reintroduce some AIO fastpaths when there are no active listeners).

I can't see a technical reason why a BlockListener could not be callback
based. The only reason might be a "there are only coroutines" policy.

But even then, just don't implement bdrv_aio_* like all other
coroutine-based block drivers.

> 2) BlockListener would be entirely an implementation detail, used in the
> implementation of other commands.

Depending on what you mean by command (presumably QMP commands?), I
think I disagree. Management tools will want to start a VM with
BlockListeners already applied (which would be possible via -blockdev).

And on the other hand, protocols like file are entirely implementation
details as well, and still they are BlockDrivers.

> Third, perhaps the interface to BlockListener could be
> bdrv_before/after_read.  Cannot really say without writing one or two
> BlockListeners whether this would be useful or a limitation.

/* bdrv_before code here */
ret = bdrv_co_read(bs->file, ...);
/* bdrv_after code here */

Should be pretty much equivalent, no?

>> The main difference that I see is that the listeners stay always on top.
>> For example, let's assume that if implemented a blkmirror driver in
>> today's infrastructure, you would get a BlockDriverState stack like
>> blkmirror -> qcow2 -> file. If you take a live snapshot now, you don't
>> want to have the blkmirror applied to the old top-level image, which is
>> now a read-only backing file. Instead, it should move to the new
>> top-level image.
> 
> Yes, that's because a BlockListener always applies to a
> BlockDriverState, and live snapshots close+reopen the BDS but do not
> delete/recreate it.

Hm, that's an interesting angle to look at it. The reasoning makes sense
to me (though I would reword it as a BlockListener belongs to a
drive/block device rather than a BDS, which is an implementation detail).

However, live snapshots can't close and reopen the BDS in the future,
because the reopen could fail and you must not have closed the old image
yet in this case. So what Jeff and I were looking into recently is to
change this so that new top-level images are opened first without
backing file and then the backing file relationship is created with the
existing BDS.

Of course, we stumbled across the thing that you're telling me here,
that devices refer to the same BDS as before. So their content must be
swapped, but some data like the device name (and now BlockListeners)
stay with the top-level image instead.

Which in turn reminds me of a discussion I had with Stefan a while ago,
where we came to the conclusion that we need to separate the
representation of an image file and a "view" of it which represents a
block device (as a whole). One of the reasons then was that one qcow2
image could offer multiple views (one r/w view and for each snapshot a
r/o one). I think the separation that this would require might actually
be the same as the separation of things that stay top-level and that
belong to the image file.

Isn't it cool how everything is connected with everything? :-)

>> So maybe we just need to extend the current BlockDriverState stack to
>> distinguish "normal" and "always on top" BlockDrivers, where the latter
>> would roughly correspond to BlockListeners?
> 
> I would prefer having separate objects.  Even if you do not count fields
> related to throttling or copy-on-read or other tasks in the list above,
> there are many fields in BDS that do not really apply to BlockListeners.
>  Backing files, device ops, encryption, even size.  Having extra methods
> is not a big problem, but unwanted data items smell...

Most other block drivers use only little of them. We can try to clean up
some of them (and making the separation described above would probably
help with it), but BlockListeners aren't really different here from
existing drivers.

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21  9:49                                     ` Kevin Wolf
@ 2012-02-21 10:09                                       ` Paolo Bonzini
  2012-02-21 10:51                                         ` Kevin Wolf
  2012-02-21 10:20                                       ` Ori Mamluk
  1 sibling, 1 reply; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-21 10:09 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	Stefan Hajnoczi, Jeff Cody, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

On 02/21/2012 10:49 AM, Kevin Wolf wrote:
> Am 21.02.2012 10:15, schrieb Paolo Bonzini:
>> On 02/21/2012 10:03 AM, Kevin Wolf wrote:
>>> - Old style block migratiom
>>
>> More precisely, dirty bitmap handling.
> 
> Yes, but is it used anywhere else?

No, just nitpicking.

>>> (btw, we should deprecate this)
>>
>> Yeah, but we need blkmirror to provide an alternative.
> 
> Does block migration even work meanwhile without corrupting things left
> and right?

No idea.  Somebody must have used it at some point. :)

>> 1) BlockListener would be by-design coroutine based; I know we disagree
>> on this (you want to change raw to coroutines long term; I would like to
>> reintroduce some AIO fastpaths when there are no active listeners).
> 
> I can't see a technical reason why a BlockListener could not be callback
> based. The only reason might be a "there are only coroutines" policy.

Not a technical reason, more of a sanity reason.  Coroutines were
introduced to allow a number of the things in the list.

>> 2) BlockListener would be entirely an implementation detail, used in the
>> implementation of other commands.
> 
> Depending on what you mean by command (presumably QMP commands?),

QMP commands, command-line (-drive), whatever.

> And on the other hand, protocols like file are entirely implementation
> details as well, and still they are BlockDrivers.

True.  Formats and protocols also do not have perfectly overlapping
functionality (backing images are only a format thing, for example).

>>> The main difference that I see is that the listeners stay always on top.
>>> For example, let's assume that if implemented a blkmirror driver in
>>> today's infrastructure, you would get a BlockDriverState stack like
>>> blkmirror -> qcow2 -> file. If you take a live snapshot now, you don't
>>> want to have the blkmirror applied to the old top-level image, which is
>>> now a read-only backing file. Instead, it should move to the new
>>> top-level image.
>>
>> Yes, that's because a BlockListener always applies to a
>> BlockDriverState, and live snapshots close+reopen the BDS but do not
>> delete/recreate it.
> 
> Hm, that's an interesting angle to look at it. The reasoning makes sense
> to me (though I would reword it as a BlockListener belongs to a
> drive/block device rather than a BDS, which is an implementation detail).

Yes.

> However, live snapshots can't close and reopen the BDS in the future,
> because the reopen could fail and you must not have closed the old image
> yet in this case. So what Jeff and I were looking into recently is to
> change this so that new top-level images are opened first without
> backing file and then the backing file relationship is created with the
> existing BDS.
> 
> Of course, we stumbled across the thing that you're telling me here,
> that devices refer to the same BDS as before. So their content must be
> swapped, but some data like the device name (and now BlockListeners)
> stay with the top-level image instead.
> 
> Which in turn reminds me of a discussion I had with Stefan a while ago,
> where we came to the conclusion that we need to separate the
> representation of an image file and a "view" of it which represents a
> block device (as a whole). [...]
> 
> Isn't it cool how everything is connected with everything? :-)

:)

So we'd have:

- Protocols -- Protocols tell you where to get the raw bits.
- Formats -- Formats transform those raw bits into a block device.
- Views -- Views can move from a format to another.  A format can use a
default view implementation, or provide its own (e.g. to access
different snapshots).
- Listeners -- I think a view can have-a listener?

with the following relationship:

- A format has-a protocol for the raw bits and has-a view for the
backing file.
- A view has-a format, a device has-a view.
- A view can have-a listener?  Or is it formats?

But I think we're digressing...

>>> So maybe we just need to extend the current BlockDriverState stack to
>>> distinguish "normal" and "always on top" BlockDrivers, where the latter
>>> would roughly correspond to BlockListeners?
>>
>> I would prefer having separate objects.  Even if you do not count fields
>> related to throttling or copy-on-read or other tasks in the list above,
>> there are many fields in BDS that do not really apply to BlockListeners.
>>  Backing files, device ops, encryption, even size.  Having extra methods
>> is not a big problem, but unwanted data items smell...
> 
> Most other block drivers use only little of them. We can try to clean up
> some of them (and making the separation described above would probably
> help with it), but BlockListeners aren't really different here from
> existing drivers.

True, the question only matters insofar as having a separate data
structure simplifies the design.  ("Simplify" means "we can actually
understand it and be reasonably sure that it's correct and implementable").

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21  9:49                                     ` Kevin Wolf
  2012-02-21 10:09                                       ` Paolo Bonzini
@ 2012-02-21 10:20                                       ` Ori Mamluk
  1 sibling, 0 replies; 66+ messages in thread
From: Ori Mamluk @ 2012-02-21 10:20 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Stefan Hajnoczi, Yair Kuszpet,
	Paolo Bonzini

On 21/02/2012 11:49, Kevin Wolf wrote:
> Am 21.02.2012 10:15, schrieb Paolo Bonzini:
>>> So maybe we just need to extend the current BlockDriverState stack to
>>> distinguish "normal" and "always on top" BlockDrivers, where the latter
>>> would roughly correspond to BlockListeners?
>> I would prefer having separate objects.  Even if you do not count fields
>> related to throttling or copy-on-read or other tasks in the list above,
>> there are many fields in BDS that do not really apply to BlockListeners.
>>   Backing files, device ops, encryption, even size.  Having extra methods
>> is not a big problem, but unwanted data items smell...
> Most other block drivers use only little of them. We can try to clean up
> some of them (and making the separation described above would probably
> help with it), but BlockListeners aren't really different here from
> existing drivers.
My impression as an outside observer was similar - it appears as though 
the block driver object contains stuff that belongs to the specific 
driver (e.g. bitmap).

An additional capability that I need in the replication filter is for 
the driver to initiate a new IO. It means that if we have a stack of 
drivers guest->bdrv1->bdrv2->bdrv3->image, then bdrv2 should be able to 
invoke a read - which will be processed only by deeper parts of the 
stack - i.e. bdrv3.
Makes sense?

For question of upper/lower drivers, I tend to think that there are two 
kinds - those who need the original guest IO transactions and those who 
transform them. Maybe two separate drivers stack (upper and lower) are 
enough to implement this difference.
> Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 10:09                                       ` Paolo Bonzini
@ 2012-02-21 10:51                                         ` Kevin Wolf
  2012-02-21 11:36                                           ` Paolo Bonzini
  0 siblings, 1 reply; 66+ messages in thread
From: Kevin Wolf @ 2012-02-21 10:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: תומר בן אור,
	Stefan Hajnoczi, Jeff Cody, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

Am 21.02.2012 11:09, schrieb Paolo Bonzini:
>>> 2) BlockListener would be entirely an implementation detail, used in the
>>> implementation of other commands.
>>
>> Depending on what you mean by command (presumably QMP commands?),
> 
> QMP commands, command-line (-drive), whatever.
> 
>> And on the other hand, protocols like file are entirely implementation
>> details as well, and still they are BlockDrivers.
> 
> True.  Formats and protocols also do not have perfectly overlapping
> functionality (backing images are only a format thing, for example).

And even protocols and protocols don't. Compare file to blkdebug, for
example. In fact, blkdebug and blkverify are already very close to what
BlockListeners would be.

>>>> The main difference that I see is that the listeners stay always on top.
>>>> For example, let's assume that if implemented a blkmirror driver in
>>>> today's infrastructure, you would get a BlockDriverState stack like
>>>> blkmirror -> qcow2 -> file. If you take a live snapshot now, you don't
>>>> want to have the blkmirror applied to the old top-level image, which is
>>>> now a read-only backing file. Instead, it should move to the new
>>>> top-level image.
>>>
>>> Yes, that's because a BlockListener always applies to a
>>> BlockDriverState, and live snapshots close+reopen the BDS but do not
>>> delete/recreate it.
>>
>> Hm, that's an interesting angle to look at it. The reasoning makes sense
>> to me (though I would reword it as a BlockListener belongs to a
>> drive/block device rather than a BDS, which is an implementation detail).
> 
> Yes.
> 
>> However, live snapshots can't close and reopen the BDS in the future,
>> because the reopen could fail and you must not have closed the old image
>> yet in this case. So what Jeff and I were looking into recently is to
>> change this so that new top-level images are opened first without
>> backing file and then the backing file relationship is created with the
>> existing BDS.
>>
>> Of course, we stumbled across the thing that you're telling me here,
>> that devices refer to the same BDS as before. So their content must be
>> swapped, but some data like the device name (and now BlockListeners)
>> stay with the top-level image instead.
>>
>> Which in turn reminds me of a discussion I had with Stefan a while ago,
>> where we came to the conclusion that we need to separate the
>> representation of an image file and a "view" of it which represents a
>> block device (as a whole). [...]
>>
>> Isn't it cool how everything is connected with everything? :-)
> 
> :)
> 
> So we'd have:
> 
> - Protocols -- Protocols tell you where to get the raw bits.
> - Formats -- Formats transform those raw bits into a block device.
> - Views -- Views can move from a format to another.  A format can use a
> default view implementation, or provide its own (e.g. to access
> different snapshots).
> - Listeners -- I think a view can have-a listener?

Where protocols, formats and listeners are-a image (Not the best name,
I'm open for suggestions. Something like "BDS stack building block"
would be most accurate...). Or actually not is-a in terms of
inheritance, but I think it would be the very same thing without any
subclassing, implemented by a BlockDriver.

> with the following relationship:
> 
> - A format has-a protocol for the raw bits and has-a view for the
> backing file.

An image has-a image from which it takes its data (bs->file). And it
has-a view for the backing file, yes. Both could be a listener.

> - A view has-a format, a device has-a view.
> - A view can have-a listener?  Or is it formats?

A view has-a image. This may happen to be a listener, which in turn
has-a image (could be another listener, a format or a protocol).

The question is what the semantics is with live snapshots (there are
probably similar problems, but this is the obvious one). For example we
could now have:

mirror -> qcow2 -> blkdebug -> file

There are two listeners here, mirror and blkdebug. (Things like blkdebug
are why view has-a listener isn't enough). After creating an external
snapshot, we expect the graph to look like this (the arrow down is the
backing file):

mirror -> qcow2 -> file
            |
            +-> qcow2 -> blkdebug -> file

The question is: Can we assume that any listeners that are on top of the
first format or protocol (i.e. those that would fit your model) should
move to the new top-level view? Or would it sometimes make sense to keep
it at the old one?

> But I think we're digressing...

No, in fact I think we need to get an idea of what we want to have in
the end. And we need to do it soon, almost any new topic that's coming
up ends up in a discussion about the same shortcomings of the block layer.

It doesn't make sense to hack in more and more stuff without having
proper infrastructure for it.

>>>> So maybe we just need to extend the current BlockDriverState stack to
>>>> distinguish "normal" and "always on top" BlockDrivers, where the latter
>>>> would roughly correspond to BlockListeners?
>>>
>>> I would prefer having separate objects.  Even if you do not count fields
>>> related to throttling or copy-on-read or other tasks in the list above,
>>> there are many fields in BDS that do not really apply to BlockListeners.
>>>  Backing files, device ops, encryption, even size.  Having extra methods
>>> is not a big problem, but unwanted data items smell...
>>
>> Most other block drivers use only little of them. We can try to clean up
>> some of them (and making the separation described above would probably
>> help with it), but BlockListeners aren't really different here from
>> existing drivers.
> 
> True, the question only matters insofar as having a separate data
> structure simplifies the design.  ("Simplify" means "we can actually
> understand it and be reasonably sure that it's correct and implementable").

Having only one kind of building block for the block driver graph is a
simplification, too. Or at least one common base class.

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 10:51                                         ` Kevin Wolf
@ 2012-02-21 11:36                                           ` Paolo Bonzini
  2012-02-21 12:22                                             ` Stefan Hajnoczi
  2012-02-21 13:10                                             ` Kevin Wolf
  0 siblings, 2 replies; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-21 11:36 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	Stefan Hajnoczi, Jeff Cody, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

On 02/21/2012 11:51 AM, Kevin Wolf wrote:
> And even protocols and protocols don't. Compare file to blkdebug, for
> example. In fact, blkdebug and blkverify are already very close to what
> BlockListeners would be.

Yes, and I think considering blkdebug and blkverify help in the design.
They provide the difference between views and listeners: listeners can
be applied to both a protocol and a view, while views can only be
applied to a format.

> > - Protocols -- Protocols tell you where to get the raw bits.
> > - Formats -- Formats transform those raw bits into a block device.
> > - Views -- Views can move from a format to another.  A format can use a
> > default view implementation, or provide its own (e.g. to access
> > different snapshots).
> > - Listeners -- I think a view can have-a listener?
> 
> Where protocols, formats and listeners are-a image (Not the best name,
> I'm open for suggestions. Something like "BDS stack building block"
> would be most accurate...). Or actually not is-a in terms of
> inheritance, but I think it would be the very same thing without any
> subclassing, implemented by a BlockDriver.
> 
> > with the following relationship:
> > 
> > - A format has-a protocol for the raw bits and has-a view for the
> > backing file.
> 
> An image has-a image from which it takes its data (bs->file).

No. A protocol has neither an image below it, nor a backing file.  I
think a view has no backing file either (except as provided by the
format).  I'm not sure that a listener can have a backing file, either;
you could say that blkverify has one, but it could just as well have
three or four, so it's not the same thing.

A format does not do much more than create views, do snapshots, and hold
state that is common to all of its views.

So, let's put aside listeners for a moment, and consider this
alternative hierarchy:

BlockDriver
    Protocol
        FileProtocol
        ...
    View
        QCOW2View
        RawView
        ...
BlockFormat
    QCOW2Format
    RawFormat
    ...

Now we have to figure out how to fit listeners in this picture.

> And it has-a view for the backing file, yes. Both could be a listener.
> 
>> > - A view has-a format, a device has-a view.
>> > - A view can have-a listener?  Or is it formats?
> A view has-a image. This may happen to be a listener, which in turn
> has-a image (could be another listener, a format or a protocol).
> 
> The question is what the semantics is with live snapshots (there are
> probably similar problems, but this is the obvious one). For example we
> could now have:
> 
> mirror -> qcow2 -> blkdebug -> file

Let's be precise here:

mirror -> currentSnapshot -> qcow2 -> blkdebug -> file

- file is a protocol.

- blkdebug is a listener

- qcow2 is a format

- currentSnapshot is a view

- mirror is a listener

The difference between blkdebug/mirror on one side, and currentSnapshot
on the other, is that (as you said) blkdebug/mirror are always stacked
on top of something else. A driver that references currentSnapshot
actually gets mirror.

So we get to the actual hierarchy I'm proposing:

BlockDriver
    BlockSource (see below)
    Protocol (bdrv_file_open)
        FileProtocol
        ...
    View (has-a BlockFormat)
        QCOW2View
        RawView
        ...
    BlockListener (has-a BlockDriver)
        MirrorListener
        BlkDebugListener
BlockFormat (has-a BlockSource)
    QCOW2Format
    RawFormat
    ...

Protocols and views are only internal.  Formats and devices in practice
will only ever see BlockSources.  A BlockSource is a reference a stack of
BlockDrivers, where the base must be a protocol or view and there can
be a number of listeners stacked on top of it.  Listeners can be
added or removed from the stack, and the bottom driver can be swapped
for another (for snapshots).

So, here is how it would look:

  .== BlockSource ==.                   .== BlockSource ===.
  | MirrorListener  |                   | BlkDebugListener |
  | QCOW2View ------+--> QCOW2Format -> | FileProtocol     |
  '================='                   '=================='


> There are two listeners here, mirror and blkdebug. (Things like blkdebug
> are why view has-a listener isn't enough). After creating an external
> snapshot, we expect the graph to look like this (the arrow down is the
> backing file):
> 
> mirror -> qcow2 -> file
>             |
>             +-> qcow2 -> blkdebug -> file

And here:

  .== BlockSource ==.
  | MirrorListener  |                   .== BlockSource ==.
  | QCOW2View ------+--> QCOW2Format -> | FileProtocol    |
  '================='    |              '================='
                         |                                          .== BlockSource ===.
                         |   .== BlockSource ==.                    | BlkDebugListener |
                         '-> | QCOW2View ------+--> QCOW2Format --> | FileProtocol     |
                             '================='                    '=================='

Does it seem sane?

> The question is: Can we assume that any listeners that are on top of the
> first format or protocol (i.e. those that would fit your model) should
> move to the new top-level view? Or would it sometimes make sense to keep
> it at the old one?

I think it depends, but both possibilities should be doable in this model.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 11:36                                           ` Paolo Bonzini
@ 2012-02-21 12:22                                             ` Stefan Hajnoczi
  2012-02-21 12:57                                               ` Paolo Bonzini
  2012-02-21 15:49                                               ` Markus Armbruster
  2012-02-21 13:10                                             ` Kevin Wolf
  1 sibling, 2 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-21 12:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Ori Mamluk, Yair Kuszpet

On Tue, Feb 21, 2012 at 11:36 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> On 02/21/2012 11:51 AM, Kevin Wolf wrote:
>> And even protocols and protocols don't. Compare file to blkdebug, for
>> example. In fact, blkdebug and blkverify are already very close to what
>> BlockListeners would be.
>
> Yes, and I think considering blkdebug and blkverify help in the design.
> They provide the difference between views and listeners: listeners can
> be applied to both a protocol and a view, while views can only be
> applied to a format.
>
>> > - Protocols -- Protocols tell you where to get the raw bits.
>> > - Formats -- Formats transform those raw bits into a block device.
>> > - Views -- Views can move from a format to another.  A format can use a
>> > default view implementation, or provide its own (e.g. to access
>> > different snapshots).
>> > - Listeners -- I think a view can have-a listener?
>>
>> Where protocols, formats and listeners are-a image (Not the best name,
>> I'm open for suggestions. Something like "BDS stack building block"
>> would be most accurate...). Or actually not is-a in terms of
>> inheritance, but I think it would be the very same thing without any
>> subclassing, implemented by a BlockDriver.
>>
>> > with the following relationship:
>> >
>> > - A format has-a protocol for the raw bits and has-a view for the
>> > backing file.
>>
>> An image has-a image from which it takes its data (bs->file).
>
> No. A protocol has neither an image below it, nor a backing file.  I
> think a view has no backing file either (except as provided by the
> format).  I'm not sure that a listener can have a backing file, either;
> you could say that blkverify has one, but it could just as well have
> three or four, so it's not the same thing.
>
> A format does not do much more than create views, do snapshots, and hold
> state that is common to all of its views.
>
> So, let's put aside listeners for a moment, and consider this
> alternative hierarchy:
>
> BlockDriver
>    Protocol
>        FileProtocol
>        ...
>    View
>        QCOW2View
>        RawView
>        ...
> BlockFormat
>    QCOW2Format
>    RawFormat
>    ...
>
> Now we have to figure out how to fit listeners in this picture.
>
>> And it has-a view for the backing file, yes. Both could be a listener.
>>
>>> > - A view has-a format, a device has-a view.
>>> > - A view can have-a listener?  Or is it formats?
>> A view has-a image. This may happen to be a listener, which in turn
>> has-a image (could be another listener, a format or a protocol).
>>
>> The question is what the semantics is with live snapshots (there are
>> probably similar problems, but this is the obvious one). For example we
>> could now have:
>>
>> mirror -> qcow2 -> blkdebug -> file
>
> Let's be precise here:
>
> mirror -> currentSnapshot -> qcow2 -> blkdebug -> file
>
> - file is a protocol.
>
> - blkdebug is a listener
>
> - qcow2 is a format
>
> - currentSnapshot is a view
>
> - mirror is a listener
>
> The difference between blkdebug/mirror on one side, and currentSnapshot
> on the other, is that (as you said) blkdebug/mirror are always stacked
> on top of something else. A driver that references currentSnapshot
> actually gets mirror.
>
> So we get to the actual hierarchy I'm proposing:
>
> BlockDriver
>    BlockSource (see below)
>    Protocol (bdrv_file_open)
>        FileProtocol
>        ...
>    View (has-a BlockFormat)
>        QCOW2View
>        RawView
>        ...
>    BlockListener (has-a BlockDriver)
>        MirrorListener
>        BlkDebugListener
> BlockFormat (has-a BlockSource)
>    QCOW2Format
>    RawFormat
>    ...
>
> Protocols and views are only internal.  Formats and devices in practice
> will only ever see BlockSources.  A BlockSource is a reference a stack of
> BlockDrivers, where the base must be a protocol or view and there can
> be a number of listeners stacked on top of it.  Listeners can be
> added or removed from the stack, and the bottom driver can be swapped
> for another (for snapshots).
>
> So, here is how it would look:
>
>  .== BlockSource ==.                   .== BlockSource ===.
>  | MirrorListener  |                   | BlkDebugListener |
>  | QCOW2View ------+--> QCOW2Format -> | FileProtocol     |
>  '================='                   '=================='
>
>
>> There are two listeners here, mirror and blkdebug. (Things like blkdebug
>> are why view has-a listener isn't enough). After creating an external
>> snapshot, we expect the graph to look like this (the arrow down is the
>> backing file):
>>
>> mirror -> qcow2 -> file
>>             |
>>             +-> qcow2 -> blkdebug -> file
>
> And here:
>
>  .== BlockSource ==.
>  | MirrorListener  |                   .== BlockSource ==.
>  | QCOW2View ------+--> QCOW2Format -> | FileProtocol    |
>  '================='    |              '================='
>                         |                                          .== BlockSource ===.
>                         |   .== BlockSource ==.                    | BlkDebugListener |
>                         '-> | QCOW2View ------+--> QCOW2Format --> | FileProtocol     |
>                             '================='                    '=================='
>
> Does it seem sane?

This is a good discussion because BlockDriverState has become bloated
and complex.  A lot of fields only apply to sub-cases and we should
really split this struct.

Fields like "backing_file" *should* be in generic code, not duplicated
in each Format.  But BlockDriverState is too generic since it also
encompasses Protocols and Listeners.

You mentioned that some of these classes would be "internal".  I think
everything should be exposed in the QOM just like Linux exposes kernel
objects in sysfs.  It makes troubleshooting and debugging easier.

As has been pointed out, "Listener" suggests a passive role.  Perhaps
BlockFilter, BlockProcessor, or BlockModule is a better name.

Ideally Formats can be isolated from the rest of the block layer so
that it becomes easy to create a libimageformat.  If we bake
coroutines, I/O APIs, and memory allocation functions too deeply into
Formats then they are hard to test and impossible to use outside of
QEMU.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 12:22                                             ` Stefan Hajnoczi
@ 2012-02-21 12:57                                               ` Paolo Bonzini
  2012-02-21 15:49                                               ` Markus Armbruster
  1 sibling, 0 replies; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-21 12:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Ori Mamluk, Yair Kuszpet

On 02/21/2012 01:22 PM, Stefan Hajnoczi wrote:
> This is a good discussion because BlockDriverState has become bloated
> and complex.  A lot of fields only apply to sub-cases and we should
> really split this struct.
> 
> Fields like "backing_file" *should* be in generic code, not duplicated
> in each Format.  But BlockDriverState is too generic since it also
> encompasses Protocols and Listeners.

Yes.

> You mentioned that some of these classes would be "internal".  I think
> everything should be exposed in the QOM just like Linux exposes kernel
> objects in sysfs.  It makes troubleshooting and debugging easier.

Yes, exposing in QOM makes sense.  But QOM can see all things internal
by design. :)  The question is more what to expose to the rest of QEMU.
 For me the answer would be: BlockSource to the device and monitor,
Format to the monitor only.

> As has been pointed out, "Listener" suggests a passive role.  Perhaps
> BlockFilter, BlockProcessor, or BlockModule is a better name.

BlockFilter sounds good.

> Ideally Formats can be isolated from the rest of the block layer so
> that it becomes easy to create a libimageformat.  If we bake
> coroutines, I/O APIs, and memory allocation functions too deeply into
> Formats then they are hard to test and impossible to use outside of
> QEMU.

I'm afraid that the only way to do this is to first replace coroutines
with threads.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 11:36                                           ` Paolo Bonzini
  2012-02-21 12:22                                             ` Stefan Hajnoczi
@ 2012-02-21 13:10                                             ` Kevin Wolf
  2012-02-21 13:21                                               ` Paolo Bonzini
                                                                 ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: Kevin Wolf @ 2012-02-21 13:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: תומר בן אור,
	Stefan Hajnoczi, Jeff Cody, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

Am 21.02.2012 12:36, schrieb Paolo Bonzini:
> And here:
> 
>   .== BlockSource ==.
>   | MirrorListener  |                   .== BlockSource ==.
>   | QCOW2View ------+--> QCOW2Format -> | FileProtocol    |
>   '================='    |              '================='
>                          |                                          .== BlockSource ===.
>                          |   .== BlockSource ==.                    | BlkDebugListener |
>                          '-> | QCOW2View ------+--> QCOW2Format --> | FileProtocol     |
>                              '================='                    '=================='
> 
> Does it seem sane?

Yes, this (and the whole explanation that I didn't quote) looks good to
me. For now, at least, until I find the first example where it doesn't
work. Or maybe I can just trust Markus to bring something up.

>> The question is: Can we assume that any listeners that are on top of the
>> first format or protocol (i.e. those that would fit your model) should
>> move to the new top-level view? Or would it sometimes make sense to keep
>> it at the old one?
> 
> I think it depends, but both possibilities should be doable in this model.

Meh. :-)

Maybe we need to introduce something outside of the whole stack, an
entity that is referred to by the device (as in IDE, virtio-blk, ...)
and that refers to a stack of top-level listeners (which would be moved
to the new top-level BlockSource on live snapshot) and to the first
BlockSource (which can have more listeners, and those would stick with
the same BlockSource even if moves down the chain).

Oh, and just to open another can of worms: We should probably design in
the notion of media (which can be ejected etc.) and drives (which always
stay there). We don't have a clean separation today.

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 13:10                                             ` Kevin Wolf
@ 2012-02-21 13:21                                               ` Paolo Bonzini
  2012-02-21 15:56                                               ` Markus Armbruster
  2012-02-21 17:16                                               ` Stefan Hajnoczi
  2 siblings, 0 replies; 66+ messages in thread
From: Paolo Bonzini @ 2012-02-21 13:21 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	Stefan Hajnoczi, Jeff Cody, dlaor, qemu-devel, Markus Armbruster,
	Zhi Yong Wu, Federico Simoncelli, Ori Mamluk,
	עודד קדם,
	Yair Kuszpet

On 02/21/2012 02:10 PM, Kevin Wolf wrote:
>> > I think it depends, but both possibilities should be doable in this model.
> 
> Meh. :-)

Agreed. :)

> Maybe we need to introduce something outside of the whole stack, an
> entity that is referred to by the device (as in IDE, virtio-blk, ...)
> and that refers to a stack of top-level listeners (which would be moved
> to the new top-level BlockSource on live snapshot) and to the first
> BlockSource (which can have more listeners, and those would stick with
> the same BlockSource even if moves down the chain).

That would be a nested BlockSource.  I'm not sure why this is needed
though.  You can move BlockDrivers to another BlockSource down the chain
(stacking them on top of those that are there already), and leave the
upper BlockSource with the Protocol/View only.

> Oh, and just to open another can of worms: We should probably design in
> the notion of media (which can be ejected etc.) and drives (which always
> stay there). We don't have a clean separation today.

It is there: the drive is the BlockSource, the medium is the Protocol.
An empty drive is just another protocol.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 12:22                                             ` Stefan Hajnoczi
  2012-02-21 12:57                                               ` Paolo Bonzini
@ 2012-02-21 15:49                                               ` Markus Armbruster
  1 sibling, 0 replies; 66+ messages in thread
From: Markus Armbruster @ 2012-02-21 15:49 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Zhi Yong Wu, Federico Simoncelli,
	Ori Mamluk, Yair Kuszpet, Paolo Bonzini

Stefan Hajnoczi <stefanha@gmail.com> writes:

> This is a good discussion because BlockDriverState has become bloated
> and complex.  A lot of fields only apply to sub-cases and we should
> really split this struct.

Yup.

We've had discussions where couldn't even agree whether a specific block
driver is a format or a protocol or something else entirely.  This is
because the code doesn't really make distinctions.

> Fields like "backing_file" *should* be in generic code, not duplicated
> in each Format.

Debatable.

>                  But BlockDriverState is too generic since it also
> encompasses Protocols and Listeners.
>
> You mentioned that some of these classes would be "internal".  I think
> everything should be exposed in the QOM just like Linux exposes kernel
> objects in sysfs.  It makes troubleshooting and debugging easier.
>
> As has been pointed out, "Listener" suggests a passive role.  Perhaps
> BlockFilter, BlockProcessor, or BlockModule is a better name.

BlockFilter sounds good to me.

> Ideally Formats can be isolated from the rest of the block layer so
> that it becomes easy to create a libimageformat.  If we bake
> coroutines, I/O APIs, and memory allocation functions too deeply into
> Formats then they are hard to test and impossible to use outside of
> QEMU.

Wouldn't that be nice.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 13:10                                             ` Kevin Wolf
  2012-02-21 13:21                                               ` Paolo Bonzini
@ 2012-02-21 15:56                                               ` Markus Armbruster
  2012-02-21 16:04                                                 ` Kevin Wolf
  2012-02-21 17:16                                               ` Stefan Hajnoczi
  2 siblings, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2012-02-21 15:56 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Zhi Yong Wu, Federico Simoncelli,
	Ori Mamluk, Stefan Hajnoczi, Yair Kuszpet, Paolo Bonzini

Kevin Wolf <kwolf@redhat.com> writes:

> Am 21.02.2012 12:36, schrieb Paolo Bonzini:
>> And here:
>> 
>>   .== BlockSource ==.
>>   | MirrorListener  |                   .== BlockSource ==.
>>   | QCOW2View ------+--> QCOW2Format -> | FileProtocol    |
>>   '================='    |              '================='
>>                          |                                          .== BlockSource ===.
>>                          |   .== BlockSource ==.                    | BlkDebugListener |
>>                          '-> | QCOW2View ------+--> QCOW2Format --> | FileProtocol     |
>>                              '================='                    '=================='
>> 
>> Does it seem sane?
>
> Yes, this (and the whole explanation that I didn't quote) looks good to
> me. For now, at least, until I find the first example where it doesn't
> work. Or maybe I can just trust Markus to bring something up.
>
>>> The question is: Can we assume that any listeners that are on top of the
>>> first format or protocol (i.e. those that would fit your model) should
>>> move to the new top-level view? Or would it sometimes make sense to keep
>>> it at the old one?
>> 
>> I think it depends, but both possibilities should be doable in this model.
>
> Meh. :-)
>
> Maybe we need to introduce something outside of the whole stack, an
> entity that is referred to by the device (as in IDE, virtio-blk, ...)
> and that refers to a stack of top-level listeners (which would be moved
> to the new top-level BlockSource on live snapshot) and to the first
> BlockSource (which can have more listeners, and those would stick with
> the same BlockSource even if moves down the chain).

The top-level BDS is already special.  I think it makes sense to factor
out the specialness into a "block backend" type, and let it point to a
non-special block driver instance (root of a tree of block driver
instances, in general).

> Oh, and just to open another can of worms: We should probably design in
> the notion of media (which can be ejected etc.) and drives (which always
> stay there). We don't have a clean separation today.

The "closed BDS means no media" thing works, but it's odd.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-07 14:59     ` Anthony Liguori
  2012-02-07 15:20       ` Stefan Hajnoczi
@ 2012-02-21 16:01       ` Markus Armbruster
  2012-02-21 17:31         ` Stefan Hajnoczi
  1 sibling, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2012-02-21 16:01 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, dlaor, qemu-devel, Luiz Capitulino,
	Ori Mamluk

Anthony Liguori <anthony@codemonkey.ws> writes:

> On 02/07/2012 07:50 AM, Stefan Hajnoczi wrote:
>> On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>>> Repagent is a new module that allows an external replication system to
>>>> replicate a volume of a Qemu VM.
>>
>> I recently joked with Kevin that QEMU is on its way to reimplementing
>> the Linux block and device-mapper layers.  Now we have drbd, thanks!
>> :P
>
> I don't think it's a joke.  Do we really want to get into this space?
> Why not just use drbd?
>
> If it's because we want to also work with image formats, perhaps we
> should export our image format code as a shared library and let drbd
> link against it.

I want to be able to "mount foo.qcow2 /mnt".

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 15:56                                               ` Markus Armbruster
@ 2012-02-21 16:04                                                 ` Kevin Wolf
  2012-02-21 16:19                                                   ` Markus Armbruster
  0 siblings, 1 reply; 66+ messages in thread
From: Kevin Wolf @ 2012-02-21 16:04 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Zhi Yong Wu, Federico Simoncelli,
	Ori Mamluk, Stefan Hajnoczi, Yair Kuszpet, Paolo Bonzini

Am 21.02.2012 16:56, schrieb Markus Armbruster:
> Kevin Wolf <kwolf@redhat.com> writes:
> 
>> Am 21.02.2012 12:36, schrieb Paolo Bonzini:
>>> And here:
>>>
>>>   .== BlockSource ==.
>>>   | MirrorListener  |                   .== BlockSource ==.
>>>   | QCOW2View ------+--> QCOW2Format -> | FileProtocol    |
>>>   '================='    |              '================='
>>>                          |                                          .== BlockSource ===.
>>>                          |   .== BlockSource ==.                    | BlkDebugListener |
>>>                          '-> | QCOW2View ------+--> QCOW2Format --> | FileProtocol     |
>>>                              '================='                    '=================='
>>>
>>> Does it seem sane?
>>
>> Yes, this (and the whole explanation that I didn't quote) looks good to
>> me. For now, at least, until I find the first example where it doesn't
>> work. Or maybe I can just trust Markus to bring something up.
>>
>>>> The question is: Can we assume that any listeners that are on top of the
>>>> first format or protocol (i.e. those that would fit your model) should
>>>> move to the new top-level view? Or would it sometimes make sense to keep
>>>> it at the old one?
>>>
>>> I think it depends, but both possibilities should be doable in this model.
>>
>> Meh. :-)
>>
>> Maybe we need to introduce something outside of the whole stack, an
>> entity that is referred to by the device (as in IDE, virtio-blk, ...)
>> and that refers to a stack of top-level listeners (which would be moved
>> to the new top-level BlockSource on live snapshot) and to the first
>> BlockSource (which can have more listeners, and those would stick with
>> the same BlockSource even if moves down the chain).
> 
> The top-level BDS is already special.  I think it makes sense to factor
> out the specialness into a "block backend" type, and let it point to a
> non-special block driver instance (root of a tree of block driver
> instances, in general).

I think this is what I meant.

>> Oh, and just to open another can of worms: We should probably design in
>> the notion of media (which can be ejected etc.) and drives (which always
>> stay there). We don't have a clean separation today.
> 
> The "closed BDS means no media" thing works, but it's odd.

I'm more talking about data that belongs to the media, like geometry.
This came up recently with Hervé's floppy patches.

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 16:04                                                 ` Kevin Wolf
@ 2012-02-21 16:19                                                   ` Markus Armbruster
  2012-02-21 16:39                                                     ` Kevin Wolf
  0 siblings, 1 reply; 66+ messages in thread
From: Markus Armbruster @ 2012-02-21 16:19 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Zhi Yong Wu, Federico Simoncelli,
	Ori Mamluk, Stefan Hajnoczi, Yair Kuszpet, Paolo Bonzini

Kevin Wolf <kwolf@redhat.com> writes:

> Am 21.02.2012 16:56, schrieb Markus Armbruster:
>> Kevin Wolf <kwolf@redhat.com> writes:
[...]
>>> Maybe we need to introduce something outside of the whole stack, an
>>> entity that is referred to by the device (as in IDE, virtio-blk, ...)
>>> and that refers to a stack of top-level listeners (which would be moved
>>> to the new top-level BlockSource on live snapshot) and to the first
>>> BlockSource (which can have more listeners, and those would stick with
>>> the same BlockSource even if moves down the chain).
>> 
>> The top-level BDS is already special.  I think it makes sense to factor
>> out the specialness into a "block backend" type, and let it point to a
>> non-special block driver instance (root of a tree of block driver
>> instances, in general).
>
> I think this is what I meant.

Then we're in violent agreement :)

>>> Oh, and just to open another can of worms: We should probably design in
>>> the notion of media (which can be ejected etc.) and drives (which always
>>> stay there). We don't have a clean separation today.
>> 
>> The "closed BDS means no media" thing works, but it's odd.
>
> I'm more talking about data that belongs to the media, like geometry.
> This came up recently with Hervé's floppy patches.

Is geometry relevant to anything but floppies and really small disks
being accessed via really old interfaces?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 16:19                                                   ` Markus Armbruster
@ 2012-02-21 16:39                                                     ` Kevin Wolf
  0 siblings, 0 replies; 66+ messages in thread
From: Kevin Wolf @ 2012-02-21 16:39 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Zhi Yong Wu, Federico Simoncelli,
	Ori Mamluk, Stefan Hajnoczi, Yair Kuszpet, Paolo Bonzini

Am 21.02.2012 17:19, schrieb Markus Armbruster:
>>>> Oh, and just to open another can of worms: We should probably design in
>>>> the notion of media (which can be ejected etc.) and drives (which always
>>>> stay there). We don't have a clean separation today.
>>>
>>> The "closed BDS means no media" thing works, but it's odd.
>>
>> I'm more talking about data that belongs to the media, like geometry.
>> This came up recently with Hervé's floppy patches.
> 
> Is geometry relevant to anything but floppies and really small disks
> being accessed via really old interfaces?

Not sure what it's actually used for, but even virtio-blk does have a
geometry.

But is it really only geometry? I think the read-only flag belongs to
the medium as well. There are probably more candidates.

Kevin

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21 13:10                                             ` Kevin Wolf
  2012-02-21 13:21                                               ` Paolo Bonzini
  2012-02-21 15:56                                               ` Markus Armbruster
@ 2012-02-21 17:16                                               ` Stefan Hajnoczi
  2 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-21 17:16 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	עודד קדם,
	Jeff Cody, dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Ori Mamluk, Yair Kuszpet, Paolo Bonzini

On Tue, Feb 21, 2012 at 1:10 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 21.02.2012 12:36, schrieb Paolo Bonzini:
> Oh, and just to open another can of worms: We should probably design in
> the notion of media (which can be ejected etc.) and drives (which always
> stay there). We don't have a clean separation today.

Media and drives are guest state, they should be part of device
emulation and not host device implementation.

The layering issue is that things like floppy and CD-ROM passthrough
do deal with media change and status.

Does it make sense to have a hw/drive.c for the guest state associated
with a -drive?

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] replication agent module
  2012-02-21 16:01       ` Markus Armbruster
@ 2012-02-21 17:31         ` Stefan Hajnoczi
  0 siblings, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-02-21 17:31 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Kevin Wolf, Ori Mamluk, dlaor, qemu-devel, Luiz Capitulino

On Tue, Feb 21, 2012 at 4:01 PM, Markus Armbruster <armbru@redhat.com> wrote:
> Anthony Liguori <anthony@codemonkey.ws> writes:
>
>> On 02/07/2012 07:50 AM, Stefan Hajnoczi wrote:
>>> On Tue, Feb 7, 2012 at 1:34 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>>> Am 07.02.2012 11:29, schrieb Ori Mamluk:
>>>>> Repagent is a new module that allows an external replication system to
>>>>> replicate a volume of a Qemu VM.
>>>
>>> I recently joked with Kevin that QEMU is on its way to reimplementing
>>> the Linux block and device-mapper layers.  Now we have drbd, thanks!
>>> :P
>>
>> I don't think it's a joke.  Do we really want to get into this space?
>> Why not just use drbd?
>>
>> If it's because we want to also work with image formats, perhaps we
>> should export our image format code as a shared library and let drbd
>> link against it.
>
> I want to be able to "mount foo.qcow2 /mnt".

That would only be possible if foo.qcow2 is a flat file system without
a partition table or logical volumes.

But I completely agree that working with image files on the host
should be as easy as mount.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-21  9:03                                 ` [Qemu-devel] BlockDriverState stack and BlockListeners (was: [RFC] Replication agent design) Kevin Wolf
  2012-02-21  9:15                                   ` [Qemu-devel] BlockDriverState stack and BlockListeners Paolo Bonzini
@ 2012-02-29  8:38                                   ` Ori Mamluk
  2012-03-03 11:46                                     ` Stefan Hajnoczi
  1 sibling, 1 reply; 66+ messages in thread
From: Ori Mamluk @ 2012-02-29  8:38 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Stefan Hajnoczi, Yair Kuszpet,
	Paolo Bonzini

Hi Kevin and all,
I think the BlockFilter direction goes very well with our plans for a 
replication module.
I guess it would take some discussions and time to form a solid layer 
for the BlockFilters, and I'd like to move ahead in parallel with the 
replication module.
What do you say about the following plan:
I'll continue with the module development in the spirit of my original 
RFC, and use a tentative filter-like API to block.c that I'll add.
Then I can submit the patches, get comments for the internal parts of 
the filter, and later on when some form of generic BlockFilter API is 
there - I can change my filter to use it.
This way we can have some anchor in Qemu to work with and develop the 
rest of the management parts.
Does it sound reasonable?

Thanks,
Ori

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-02-29  8:38                                   ` Ori Mamluk
@ 2012-03-03 11:46                                     ` Stefan Hajnoczi
  2012-03-04  5:14                                       ` Ori Mamluk
  0 siblings, 1 reply; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-03-03 11:46 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Yair Kuszpet, Paolo Bonzini

On Wed, Feb 29, 2012 at 8:38 AM, Ori Mamluk <omamluk@zerto.com> wrote:
> I think the BlockFilter direction goes very well with our plans for a
> replication module.
> I guess it would take some discussions and time to form a solid layer for
> the BlockFilters, and I'd like to move ahead in parallel with the
> replication module.

Will the replicatoin module still use a custom network protocol or do
you plan to implement the in-process NBD server?

I have added the in-process NBD server idea to the Google Summer of
Code 2012 project ideas page.  Perhaps students will be interested in
implementing it this summer.  But if you are already working on it I
can remove the idea, please let me know.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-03-03 11:46                                     ` Stefan Hajnoczi
@ 2012-03-04  5:14                                       ` Ori Mamluk
  2012-03-04  8:56                                         ` Paolo Bonzini
  2012-03-05 12:04                                         ` Stefan Hajnoczi
  0 siblings, 2 replies; 66+ messages in thread
From: Ori Mamluk @ 2012-03-04  5:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Yair Kuszpet, Paolo Bonzini

On 03/03/2012 13:46, Stefan Hajnoczi wrote:
> On Wed, Feb 29, 2012 at 8:38 AM, Ori Mamluk<omamluk@zerto.com>  wrote:
>> I think the BlockFilter direction goes very well with our plans for a
>> replication module.
>> I guess it would take some discussions and time to form a solid layer for
>> the BlockFilters, and I'd like to move ahead in parallel with the
>> replication module.
> Will the replicatoin module still use a custom network protocol or do
> you plan to implement the in-process NBD server?
>
> I have added the in-process NBD server idea to the Google Summer of
> Code 2012 project ideas page.  Perhaps students will be interested in
> implementing it this summer.  But if you are already working on it I
> can remove the idea, please let me know.
>
> Thanks,
> Stefan
I prefer not to do it as NBD server, mainly because NBD by definition 
requires a port per volume and I think it will pose a management overhead.
So my current plan is a custom protocol - but it's still TBD.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-03-04  5:14                                       ` Ori Mamluk
@ 2012-03-04  8:56                                         ` Paolo Bonzini
  2012-03-05 12:04                                         ` Stefan Hajnoczi
  1 sibling, 0 replies; 66+ messages in thread
From: Paolo Bonzini @ 2012-03-04  8:56 UTC (permalink / raw)
  To: qemu-devel

Il 04/03/2012 06:14, Ori Mamluk ha scritto:
> I prefer not to do it as NBD server, mainly because NBD by definition
> requires a port per volume and I think it will pose a management overhead.

NBD supports multiple volumes on the same server, just not the
implementation in QEMU.  Also you could use a different negotiation but
still the same request format, which would let you share some code.

Paolo

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [Qemu-devel] BlockDriverState stack and BlockListeners
  2012-03-04  5:14                                       ` Ori Mamluk
  2012-03-04  8:56                                         ` Paolo Bonzini
@ 2012-03-05 12:04                                         ` Stefan Hajnoczi
  1 sibling, 0 replies; 66+ messages in thread
From: Stefan Hajnoczi @ 2012-03-05 12:04 UTC (permalink / raw)
  To: Ori Mamluk
  Cc: Kevin Wolf,
	תומר בן אור,
	עודד קדם,
	dlaor, qemu-devel, Markus Armbruster, Zhi Yong Wu,
	Federico Simoncelli, Yair Kuszpet, Paolo Bonzini

On Sun, Mar 4, 2012 at 5:14 AM, Ori Mamluk <omamluk@zerto.com> wrote:
> On 03/03/2012 13:46, Stefan Hajnoczi wrote:
>>
>> On Wed, Feb 29, 2012 at 8:38 AM, Ori Mamluk<omamluk@zerto.com>  wrote:
>>>
>>> I think the BlockFilter direction goes very well with our plans for a
>>> replication module.
>>> I guess it would take some discussions and time to form a solid layer for
>>> the BlockFilters, and I'd like to move ahead in parallel with the
>>> replication module.
>>
>> Will the replicatoin module still use a custom network protocol or do
>> you plan to implement the in-process NBD server?
>>
>> I have added the in-process NBD server idea to the Google Summer of
>> Code 2012 project ideas page.  Perhaps students will be interested in
>> implementing it this summer.  But if you are already working on it I
>> can remove the idea, please let me know.
>>
>> Thanks,
>> Stefan
>
> I prefer not to do it as NBD server, mainly because NBD by definition
> requires a port per volume and I think it will pose a management overhead.
> So my current plan is a custom protocol - but it's still TBD.

Okay, in that case I'll leave the GSoC project because the in-process
NBD server is a useful feature to have in the future.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2012-03-05 12:04 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-07 10:29 [Qemu-devel] [RFC PATCH] replication agent module Ori Mamluk
2012-02-07 12:12 ` Anthony Liguori
2012-02-07 12:25   ` Dor Laor
2012-02-07 12:30     ` Ori Mamluk
2012-02-07 12:40       ` Anthony Liguori
2012-02-07 14:06         ` Ori Mamluk
2012-02-07 14:40           ` Paolo Bonzini
2012-02-07 14:48             ` Ori Mamluk
2012-02-07 15:47               ` Paolo Bonzini
2012-02-08  6:10                 ` Ori Mamluk
2012-02-08  8:49                   ` Dor Laor
2012-02-08 11:59                     ` Stefan Hajnoczi
2012-02-08  8:55                   ` Kevin Wolf
2012-02-08  9:47                     ` Ori Mamluk
2012-02-08 10:04                       ` Kevin Wolf
2012-02-08 13:28                         ` [Qemu-devel] [RFC] Replication agent design (was [RFC PATCH] replication agent module) Ori Mamluk
2012-02-08 14:59                           ` Stefan Hajnoczi
2012-02-08 14:59                             ` Stefan Hajnoczi
2012-02-19 13:40                             ` Ori Mamluk
2012-02-20 14:32                               ` Paolo Bonzini
2012-02-21  9:03                                 ` [Qemu-devel] BlockDriverState stack and BlockListeners (was: [RFC] Replication agent design) Kevin Wolf
2012-02-21  9:15                                   ` [Qemu-devel] BlockDriverState stack and BlockListeners Paolo Bonzini
2012-02-21  9:49                                     ` Kevin Wolf
2012-02-21 10:09                                       ` Paolo Bonzini
2012-02-21 10:51                                         ` Kevin Wolf
2012-02-21 11:36                                           ` Paolo Bonzini
2012-02-21 12:22                                             ` Stefan Hajnoczi
2012-02-21 12:57                                               ` Paolo Bonzini
2012-02-21 15:49                                               ` Markus Armbruster
2012-02-21 13:10                                             ` Kevin Wolf
2012-02-21 13:21                                               ` Paolo Bonzini
2012-02-21 15:56                                               ` Markus Armbruster
2012-02-21 16:04                                                 ` Kevin Wolf
2012-02-21 16:19                                                   ` Markus Armbruster
2012-02-21 16:39                                                     ` Kevin Wolf
2012-02-21 17:16                                               ` Stefan Hajnoczi
2012-02-21 10:20                                       ` Ori Mamluk
2012-02-29  8:38                                   ` Ori Mamluk
2012-03-03 11:46                                     ` Stefan Hajnoczi
2012-03-04  5:14                                       ` Ori Mamluk
2012-03-04  8:56                                         ` Paolo Bonzini
2012-03-05 12:04                                         ` Stefan Hajnoczi
2012-02-08 11:02                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
2012-02-08 13:00                     ` [Qemu-devel] [RFC] Replication agent requirements (was [RFC PATCH] replication agent module) Ori Mamluk
2012-02-08 13:30                       ` Anthony Liguori
2012-02-08 12:03                   ` [Qemu-devel] [RFC PATCH] replication agent module Stefan Hajnoczi
2012-02-08 12:46                     ` Paolo Bonzini
2012-02-08 14:39                       ` Stefan Hajnoczi
2012-02-08 14:55                         ` Paolo Bonzini
2012-02-08 15:07                           ` Stefan Hajnoczi
2012-02-07 14:53             ` Kevin Wolf
2012-02-07 15:00             ` Anthony Liguori
2012-02-07 13:34 ` Kevin Wolf
2012-02-07 13:50   ` Stefan Hajnoczi
2012-02-07 13:58     ` Paolo Bonzini
2012-02-07 14:05     ` Paolo Bonzini
2012-02-08 12:17       ` Orit Wasserman
2012-02-07 14:18     ` Ori Mamluk
2012-02-07 14:59     ` Anthony Liguori
2012-02-07 15:20       ` Stefan Hajnoczi
2012-02-07 16:25         ` Anthony Liguori
2012-02-21 16:01       ` Markus Armbruster
2012-02-21 17:31         ` Stefan Hajnoczi
2012-02-07 14:45   ` Ori Mamluk
2012-02-08 12:29     ` Orit Wasserman
2012-02-08 11:45   ` Luiz Capitulino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.