[PATCH 0/5] colo: Introduce resource agent and test suite/CI

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] colo: Introduce resource agent and test suite/CI
@ 2020-05-11 12:26 Lukas Straub
  2020-05-11 12:26 ` [PATCH 1/5] block/quorum.c: stable children names Lukas Straub
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Lukas Straub @ 2020-05-11 12:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Zhang Chen, Alberto Garcia, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

Hello Everyone,
These patches introduce a resource agent for fully automatic management of colo
and a test suite building upon the resource agent to extensively test colo.

Test suite features:
-Tests failover with peer crashing and hanging and failover during checkpoint
-Tests network using ssh and iperf3
-Quick test requires no special configuration
-Network test for testing colo-compare
-Stress test: failover all the time with network load

Resource agent features:
-Fully automatic management of colo
-Handles many failures: hanging/crashing qemu, replication error, disk error, ...
-Recovers from hanging qemu by using the "yank" oob command
-Tracks which node has up-to-date data
-Works well in clusters with more than 2 nodes

Run times on my laptop:
Quick test: 200s
Network test: 800s (tagged as slow)
Stress test: 1300s (tagged as slow)

The test suite needs access to a network bridge to properly test the network,
so some parameters need to be given to the test run. See
tests/acceptance/colo.py for more information.

I wonder how this integrates in existing CI infrastructure. Is there a common
CI for qemu where this can run or does every subsystem have to run their own
CI?

Regards,
Lukas Straub


Lukas Straub (5):
  block/quorum.c: stable children names
  colo: Introduce resource agent
  colo: Introduce high-level test suite
  configure,Makefile: Install colo resource-agent
  MAINTAINERS: Add myself as maintainer for COLO resource agent

 MAINTAINERS                              |    6 +
 Makefile                                 |    5 +
 block/quorum.c                           |   20 +-
 configure                                |   10 +
 scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
 scripts/colo-resource-agent/crm_master   |   44 +
 scripts/colo-resource-agent/crm_resource |   12 +
 tests/acceptance/colo.py                 |  689 +++++++++++
 8 files changed, 2209 insertions(+), 6 deletions(-)
 create mode 100755 scripts/colo-resource-agent/colo
 create mode 100755 scripts/colo-resource-agent/crm_master
 create mode 100755 scripts/colo-resource-agent/crm_resource
 create mode 100644 tests/acceptance/colo.py

-- 
2.20.1

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] block/quorum.c: stable children names
  2020-05-11 12:26 [PATCH 0/5] colo: Introduce resource agent and test suite/CI Lukas Straub
@ 2020-05-11 12:26 ` Lukas Straub
  2020-06-02  1:01   ` Zhang, Chen
  2020-06-02 11:07   ` Alberto Garcia
  2020-05-11 12:26 ` [PATCH 2/5] colo: Introduce resource agent Lukas Straub
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: Lukas Straub @ 2020-05-11 12:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Zhang Chen, Alberto Garcia, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2784 bytes --]

If we remove the child with the highest index from the quorum,
decrement s->next_child_index. This way we get stable children
names as long as we only remove the last child.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 block/quorum.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/block/quorum.c b/block/quorum.c
index 6d7a56bd93..acfa09c2cc 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -29,6 +29,8 @@
 
 #define HASH_LENGTH 32
 
+#define INDEXSTR_LEN 32
+
 #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
 #define QUORUM_OPT_BLKVERIFY      "blkverify"
 #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
@@ -972,9 +974,9 @@ static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
     opened = g_new0(bool, s->num_children);
 
     for (i = 0; i < s->num_children; i++) {
-        char indexstr[32];
-        ret = snprintf(indexstr, 32, "children.%d", i);
-        assert(ret < 32);
+        char indexstr[INDEXSTR_LEN];
+        ret = snprintf(indexstr, INDEXSTR_LEN, "children.%d", i);
+        assert(ret < INDEXSTR_LEN);
 
         s->children[i] = bdrv_open_child(NULL, options, indexstr, bs,
                                          &child_format, false, &local_err);
@@ -1026,7 +1028,7 @@ static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
 {
     BDRVQuorumState *s = bs->opaque;
     BdrvChild *child;
-    char indexstr[32];
+    char indexstr[INDEXSTR_LEN];
     int ret;
 
     if (s->is_blkverify) {
@@ -1041,8 +1043,8 @@ static void quorum_add_child(BlockDriverState *bs, BlockDriverState *child_bs,
         return;
     }
 
-    ret = snprintf(indexstr, 32, "children.%u", s->next_child_index);
-    if (ret < 0 || ret >= 32) {
+    ret = snprintf(indexstr, INDEXSTR_LEN, "children.%u", s->next_child_index);
+    if (ret < 0 || ret >= INDEXSTR_LEN) {
         error_setg(errp, "cannot generate child name");
         return;
     }
@@ -1069,6 +1071,7 @@ static void quorum_del_child(BlockDriverState *bs, BdrvChild *child,
                              Error **errp)
 {
     BDRVQuorumState *s = bs->opaque;
+    char indexstr[INDEXSTR_LEN];
     int i;
 
     for (i = 0; i < s->num_children; i++) {
@@ -1090,6 +1093,11 @@ static void quorum_del_child(BlockDriverState *bs, BdrvChild *child,
     /* We know now that num_children > threshold, so blkverify must be false */
     assert(!s->is_blkverify);
 
+    snprintf(indexstr, INDEXSTR_LEN, "children.%u", s->next_child_index - 1);
+    if (!strncmp(child->name, indexstr, INDEXSTR_LEN)) {
+        s->next_child_index--;
+    }
+
     bdrv_drained_begin(bs);
 
     /* We can safely remove this child now */
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/5] colo: Introduce resource agent
  2020-05-11 12:26 [PATCH 0/5] colo: Introduce resource agent and test suite/CI Lukas Straub
  2020-05-11 12:26 ` [PATCH 1/5] block/quorum.c: stable children names Lukas Straub
@ 2020-05-11 12:26 ` Lukas Straub
  2020-05-11 12:27 ` [PATCH 3/5] colo: Introduce high-level test suite Lukas Straub
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Lukas Straub @ 2020-05-11 12:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: Zhang Chen, Alberto Garcia, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 59893 bytes --]

Introduce a resource agent which can be used to manage qemu COLO
in a pacemaker cluster.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 scripts/colo-resource-agent/colo | 1429 ++++++++++++++++++++++++++++++
 1 file changed, 1429 insertions(+)
 create mode 100755 scripts/colo-resource-agent/colo

diff --git a/scripts/colo-resource-agent/colo b/scripts/colo-resource-agent/colo
new file mode 100755
index 0000000000..fbc5dc2c13
--- /dev/null
+++ b/scripts/colo-resource-agent/colo
@@ -0,0 +1,1429 @@
+#!/usr/bin/env python3
+
+# Resource agent for qemu COLO for use with Pacemaker CRM
+#
+# Copyright (c) Lukas Straub <lukasstraub2@web.de>
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+from __future__ import print_function
+import subprocess
+import sys
+import os
+import os.path
+import signal
+import socket
+import select
+import json
+import re
+import time
+import logging
+import logging.handlers
+
+# Constants
+OCF_SUCCESS = 0
+OCF_ERR_GENERIC = 1
+OCF_ERR_ARGS = 2
+OCF_ERR_UNIMPLEMENTED = 3
+OCF_ERR_PERM = 4
+OCF_ERR_INSTALLED = 5
+OCF_ERR_CONFIGURED = 6
+OCF_NOT_RUNNING = 7
+OCF_RUNNING_MASTER = 8
+OCF_FAILED_MASTER = 9
+
+# Get environment variables
+OCF_RESKEY_CRM_meta_notify_type \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_type")
+OCF_RESKEY_CRM_meta_notify_operation \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_operation")
+OCF_RESKEY_CRM_meta_notify_key_operation \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_key_operation")
+OCF_RESKEY_CRM_meta_notify_start_uname \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_start_uname", "")
+OCF_RESKEY_CRM_meta_notify_stop_uname \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_stop_uname", "")
+OCF_RESKEY_CRM_meta_notify_active_uname \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_active_uname", "")
+OCF_RESKEY_CRM_meta_notify_promote_uname \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_promote_uname", "")
+OCF_RESKEY_CRM_meta_notify_demote_uname \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_demote_uname", "")
+OCF_RESKEY_CRM_meta_notify_master_uname \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_master_uname", "")
+OCF_RESKEY_CRM_meta_notify_slave_uname \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify_slave_uname", "")
+
+HA_RSCTMP = os.getenv("HA_RSCTMP", "/run/resource-agents")
+HA_LOGFACILITY = os.getenv("HA_LOGFACILITY")
+HA_LOGFILE = os.getenv("HA_LOGFILE")
+HA_DEBUG = os.getenv("HA_debug", "0")
+HA_DEBUGLOG = os.getenv("HA_DEBUGLOG")
+OCF_RESOURCE_INSTANCE = os.getenv("OCF_RESOURCE_INSTANCE", "default-instance")
+OCF_RESKEY_CRM_meta_timeout \
+    = os.getenv("OCF_RESKEY_CRM_meta_timeout", "60000")
+OCF_RESKEY_CRM_meta_interval \
+    = int(os.getenv("OCF_RESKEY_CRM_meta_interval", "1"))
+OCF_RESKEY_CRM_meta_clone_max \
+    = int(os.getenv("OCF_RESKEY_CRM_meta_clone_max", "1"))
+OCF_RESKEY_CRM_meta_clone_node_max \
+    = int(os.getenv("OCF_RESKEY_CRM_meta_clone_node_max", "1"))
+OCF_RESKEY_CRM_meta_master_max \
+    = int(os.getenv("OCF_RESKEY_CRM_meta_master_max", "1"))
+OCF_RESKEY_CRM_meta_master_node_max \
+    = int(os.getenv("OCF_RESKEY_CRM_meta_master_node_max", "1"))
+OCF_RESKEY_CRM_meta_notify \
+    = os.getenv("OCF_RESKEY_CRM_meta_notify")
+OCF_RESKEY_CRM_meta_globally_unique \
+    = os.getenv("OCF_RESKEY_CRM_meta_globally_unique")
+
+HOSTNAME = os.getenv("OCF_RESKEY_CRM_meta_on_node", socket.gethostname())
+
+OCF_ACTION = os.getenv("__OCF_ACTION")
+if not OCF_ACTION and len(sys.argv) == 2:
+    OCF_ACTION = sys.argv[1]
+
+# Resource parameters
+OCF_RESKEY_qemu_binary_default = "qemu-system-x86_64"
+OCF_RESKEY_qemu_img_binary_default = "qemu-img"
+OCF_RESKEY_log_dir_default = HA_RSCTMP
+OCF_RESKEY_options_default = ""
+OCF_RESKEY_disk_size_default = ""
+OCF_RESKEY_active_hidden_dir_default = ""
+OCF_RESKEY_listen_address_default = "0.0.0.0"
+OCF_RESKEY_base_port_default = "9000"
+OCF_RESKEY_checkpoint_interval_default = "20000"
+OCF_RESKEY_compare_timeout_default = "3000"
+OCF_RESKEY_expired_scan_cycle_default = "3000"
+OCF_RESKEY_use_filter_rewriter_default = "true"
+OCF_RESKEY_vnet_hdr_default = "false"
+OCF_RESKEY_max_disk_errors_default = "1"
+OCF_RESKEY_monitor_timeout_default = "20000"
+OCF_RESKEY_yank_timeout_default = "10000"
+OCF_RESKEY_fail_fast_timeout_default = "5000"
+OCF_RESKEY_debug_default = "0"
+
+OCF_RESKEY_qemu_binary \
+    = os.getenv("OCF_RESKEY_qemu_binary", OCF_RESKEY_qemu_binary_default)
+OCF_RESKEY_qemu_img_binary \
+    = os.getenv("OCF_RESKEY_qemu_img_binary", OCF_RESKEY_qemu_img_binary_default)
+OCF_RESKEY_log_dir \
+    = os.getenv("OCF_RESKEY_log_dir", OCF_RESKEY_log_dir_default)
+OCF_RESKEY_options \
+    = os.getenv("OCF_RESKEY_options", OCF_RESKEY_options_default)
+OCF_RESKEY_disk_size \
+    = os.getenv("OCF_RESKEY_disk_size", OCF_RESKEY_disk_size_default)
+OCF_RESKEY_active_hidden_dir \
+    = os.getenv("OCF_RESKEY_active_hidden_dir", OCF_RESKEY_active_hidden_dir_default)
+OCF_RESKEY_listen_address \
+    = os.getenv("OCF_RESKEY_listen_address", OCF_RESKEY_listen_address_default)
+OCF_RESKEY_base_port \
+    = os.getenv("OCF_RESKEY_base_port", OCF_RESKEY_base_port_default)
+OCF_RESKEY_checkpoint_interval \
+    = os.getenv("OCF_RESKEY_checkpoint_interval", OCF_RESKEY_checkpoint_interval_default)
+OCF_RESKEY_compare_timeout \
+    = os.getenv("OCF_RESKEY_compare_timeout", OCF_RESKEY_compare_timeout_default)
+OCF_RESKEY_expired_scan_cycle \
+    = os.getenv("OCF_RESKEY_expired_scan_cycle", OCF_RESKEY_expired_scan_cycle_default)
+OCF_RESKEY_use_filter_rewriter \
+    = os.getenv("OCF_RESKEY_use_filter_rewriter", OCF_RESKEY_use_filter_rewriter_default)
+OCF_RESKEY_vnet_hdr \
+    = os.getenv("OCF_RESKEY_vnet_hdr", OCF_RESKEY_vnet_hdr_default)
+OCF_RESKEY_max_disk_errors \
+    = os.getenv("OCF_RESKEY_max_disk_errors", OCF_RESKEY_max_disk_errors_default)
+OCF_RESKEY_monitor_timeout \
+    = os.getenv("OCF_RESKEY_monitor_timeout", OCF_RESKEY_monitor_timeout_default)
+OCF_RESKEY_yank_timeout \
+    = os.getenv("OCF_RESKEY_yank_timeout", OCF_RESKEY_yank_timeout_default)
+OCF_RESKEY_fail_fast_timeout \
+    = os.getenv("OCF_RESKEY_fail_fast_timeout", OCF_RESKEY_fail_fast_timeout_default)
+OCF_RESKEY_debug \
+    = os.getenv("OCF_RESKEY_debug", OCF_RESKEY_debug_default)
+
+ACTIVE_IMAGE = os.path.join(OCF_RESKEY_active_hidden_dir, \
+                            OCF_RESOURCE_INSTANCE + "-active.qcow2")
+HIDDEN_IMAGE = os.path.join(OCF_RESKEY_active_hidden_dir, \
+                            OCF_RESOURCE_INSTANCE + "-hidden.qcow2")
+
+QMP_SOCK = os.path.join(HA_RSCTMP, OCF_RESOURCE_INSTANCE + "-qmp.sock")
+HELPER_SOCK = os.path.join(HA_RSCTMP, OCF_RESOURCE_INSTANCE + "-helper.sock")
+COMP_SOCK = os.path.join(HA_RSCTMP, OCF_RESOURCE_INSTANCE + "-compare.sock")
+COMP_OUT_SOCK = os.path.join(HA_RSCTMP, OCF_RESOURCE_INSTANCE \
+                                        + "-comp_out.sock")
+
+PID_FILE = os.path.join(HA_RSCTMP, OCF_RESOURCE_INSTANCE + "-qemu.pid")
+
+QMP_LOG = os.path.join(OCF_RESKEY_log_dir, OCF_RESOURCE_INSTANCE + "-qmp.log")
+QEMU_LOG = os.path.join(OCF_RESKEY_log_dir, OCF_RESOURCE_INSTANCE + "-qemu.log")
+HELPER_LOG = os.path.join(OCF_RESKEY_log_dir, OCF_RESOURCE_INSTANCE \
+                                                                + "-helper.log")
+
+START_TIME = time.time()
+did_yank = False
+
+# Exception only raised by ourself
+class Error(Exception):
+    pass
+
+def setup_constants():
+    # This function is called after the parameters where validated
+    global OCF_RESKEY_CRM_meta_timeout
+    if OCF_ACTION == "monitor":
+        OCF_RESKEY_CRM_meta_timeout = OCF_RESKEY_monitor_timeout
+
+    global MIGRATE_PORT, MIRROR_PORT, COMPARE_IN_PORT, NBD_PORT
+    MIGRATE_PORT = int(OCF_RESKEY_base_port)
+    MIRROR_PORT = int(OCF_RESKEY_base_port) + 1
+    COMPARE_IN_PORT = int(OCF_RESKEY_base_port) + 2
+    NBD_PORT = int(OCF_RESKEY_base_port) + 3
+
+    global QEMU_PRIMARY_CMDLINE
+    QEMU_PRIMARY_CMDLINE = \
+        ("'%(OCF_RESKEY_qemu_binary)s' %(OCF_RESKEY_options)s"
+        " -drive if=none,node-name=colo-disk0,driver=quorum,read-pattern=fifo,"
+        "vote-threshold=1,children.0=parent0"
+        " -qmp unix:'%(QMP_SOCK)s',server,nowait"
+        " -daemonize -D '%(QEMU_LOG)s' -pidfile '%(PID_FILE)s'") % globals()
+
+    global QEMU_SECONDARY_CMDLINE
+    QEMU_SECONDARY_CMDLINE = \
+        ("'%(OCF_RESKEY_qemu_binary)s' %(OCF_RESKEY_options)s"
+        " -chardev socket,id=red0,host='%(OCF_RESKEY_listen_address)s',"
+        "port=%(MIRROR_PORT)s,server,nowait,nodelay,yank"
+        " -chardev socket,id=red1,host='%(OCF_RESKEY_listen_address)s',"
+        "port=%(COMPARE_IN_PORT)s,server,nowait,nodelay,yank"
+        " -object filter-redirector,id=f1,netdev=hn0,queue=tx,indev=red0"
+        " -object filter-redirector,id=f2,netdev=hn0,queue=rx,outdev=red1") \
+        % globals()
+
+    if is_true(OCF_RESKEY_use_filter_rewriter):
+        QEMU_SECONDARY_CMDLINE += \
+            " -object filter-rewriter,id=rew0,netdev=hn0,queue=all"
+
+    QEMU_SECONDARY_CMDLINE += \
+        (" -drive if=none,node-name=childs0,top-id=colo-disk0,"
+        "driver=replication,mode=secondary,file.driver=qcow2,"
+        "file.file.filename='%(ACTIVE_IMAGE)s',file.backing.driver=qcow2,"
+        "file.backing.file.filename='%(HIDDEN_IMAGE)s',"
+        "file.backing.backing=parent0"
+        " -drive if=none,node-name=colo-disk0,driver=quorum,read-pattern=fifo,"
+        "vote-threshold=1,children.0=childs0"
+        " -incoming tcp:'%(OCF_RESKEY_listen_address)s':%(MIGRATE_PORT)s"
+        " -global migration.yank=true"
+        " -qmp unix:'%(QMP_SOCK)s',server,nowait"
+        " -daemonize -D '%(QEMU_LOG)s' -pidfile '%(PID_FILE)s'") % globals()
+
+def qemu_colo_meta_data():
+    print("""\
+<?xml version="1.0"?>
+<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
+<resource-agent name="colo">
+
+    <version>1.0</version>
+    <longdesc lang="en">
+Resource agent for qemu COLO. (https://wiki.qemu.org/Features/COLO)
+
+After defining the master/slave instance, the master score has to be
+manually set to show which node has up-to-date data. So you copy your
+image to one host (and create empty images the other host(s)) and then
+run "crm_master -r name_of_your_primitive -v 10" on that host.
+Also, you have to set 'notify=true' in the metadata attributes when
+defining the master/slave instance.
+
+Note:
+-If the instance is stopped cluster-wide, the resource agent will do a
+clean shutdown. Set the demote timeout to the time it takes for your
+guest to shutdown.
+-Colo replication is started from the monitor action. Set the monitor
+timeout to at least the time it takes for replication to start. You can
+set the monitor_timeout parameter for a soft timeout, which the resource
+agent tries to satisfy.
+-The resource agent may notify pacemaker about peer failure,
+these failures will show up with exitreason="Simulated failure".
+    </longdesc>
+    <shortdesc lang="en">Qemu COLO</shortdesc>
+
+    <parameters>
+
+    <parameter name="qemu_binary" unique="0" required="0">
+        <longdesc lang="en">qemu binary to use</longdesc>
+        <shortdesc lang="en">qemu binary</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_qemu_binary_default + """\"/>
+    </parameter>
+
+    <parameter name="qemu_img_binary" unique="0" required="0">
+        <longdesc lang="en">qemu-img binary to use</longdesc>
+        <shortdesc lang="en">qemu-img binary</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_qemu_img_binary_default + """\"/>
+    </parameter>
+
+    <parameter name="log_dir" unique="0" required="0">
+        <longdesc lang="en">Directory to place logs in</longdesc>
+        <shortdesc lang="en">Log directory</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_log_dir_default + """\"/>
+    </parameter>
+
+    <parameter name="options" unique="0" required="1">
+        <longdesc lang="en">
+Options to pass to qemu. These will be passed alongside COLO specific
+options, so you need to follow these conventions: The netdev should have
+id=hn0 and the disk controller drive=colo-disk0. The image node should
+have id/node-name=parent0, but should not be connected to the guest.
+Example:
+-vnc :0 -enable-kvm -cpu qemu64,+kvmclock -m 512 -netdev bridge,id=hn0
+-device e1000,netdev=hn0 -device virtio-blk,drive=colo-disk0
+-drive if=none,id=parent0,format=qcow2,file=/mnt/vms/vm01.qcow2
+        </longdesc>
+        <shortdesc lang="en">Options to pass to qemu.</shortdesc>
+    </parameter>
+
+    <parameter name="disk_size" unique="0" required="1">
+        <longdesc lang="en">Disk size of the image</longdesc>
+        <shortdesc lang="en">Disk size of the image</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_disk_size_default + """\"/>
+    </parameter>
+
+    <parameter name="active_hidden_dir" unique="0" required="1">
+        <longdesc lang="en">
+Directory where the active and hidden images will be stored. It is
+recommended to put this on a ramdisk.
+        </longdesc>
+        <shortdesc lang="en">Path to active and hidden images</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_active_hidden_dir_default + """\"/>
+    </parameter>
+
+    <parameter name="listen_address" unique="0" required="0">
+        <longdesc lang="en">Address to listen on.</longdesc>
+        <shortdesc lang="en">Listen address</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_listen_address_default + """\"/>
+    </parameter>
+
+    <parameter name="base_port" unique="1" required="0">
+        <longdesc lang="en">
+4 tcp ports that are unique for each instance. (base_port to base_port + 3)
+        </longdesc>
+        <shortdesc lang="en">Ports to use</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_base_port_default + """\"/>
+    </parameter>
+
+    <parameter name="checkpoint_interval" unique="0" required="0">
+        <longdesc lang="en">
+Interval for regular checkpoints in milliseconds.
+        </longdesc>
+        <shortdesc lang="en">Interval for regular checkpoints</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_checkpoint_interval_default + """\"/>
+    </parameter>
+
+    <parameter name="compare_timeout" unique="0" required="0">
+        <longdesc lang="en">
+Maximum time to hold a primary packet if secondary hasn't sent it yet,
+in milliseconds.
+You should also adjust "expired_scan_cycle" accordingly.
+        </longdesc>
+        <shortdesc lang="en">Compare timeout</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_compare_timeout_default + """\"/>
+    </parameter>
+
+    <parameter name="expired_scan_cycle" unique="0" required="0">
+        <longdesc lang="en">
+Interval for checking for expired primary packets in milliseconds.
+        </longdesc>
+        <shortdesc lang="en">Expired packet check interval</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_expired_scan_cycle_default + """\"/>
+    </parameter>
+
+    <parameter name="use_filter_rewriter" unique="0" required="0">
+        <longdesc lang="en">
+Use filter-rewriter to increase similarity between the VMs.
+        </longdesc>
+        <shortdesc lang="en">Use filter-rewriter</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_use_filter_rewriter_default + """\"/>
+    </parameter>
+
+    <parameter name="vnet_hdr" unique="0" required="0">
+        <longdesc lang="en">
+Set this to true if your system supports vnet_hdr and you enabled
+it on the tap netdev.
+        </longdesc>
+        <shortdesc lang="en">vnet_hdr support</shortdesc>
+        <content type="string" default=\"""" \
+            + OCF_RESKEY_vnet_hdr_default + """\"/>
+    </parameter>
+
+    <parameter name="max_disk_errors" unique="0" required="0">
+        <longdesc lang="en">
+Maximum disk read errors per monitor interval before marking the resource
+as failed. A write error is always fatal except if the value is 0.
+A value of 0 will disable disk error handling.
+Primary disk errors are only handled if there is a healthy secondary.
+        </longdesc>
+        <shortdesc lang="en">Maximum disk errors</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_max_disk_errors_default + """\"/>
+    </parameter>
+
+    <parameter name="monitor_timeout" unique="0" required="0">
+        <longdesc lang="en">
+Soft timeout for monitor, in milliseconds.
+Must be lower than the monitor action timeout.
+        </longdesc>
+        <shortdesc lang="en">Monitor timeout</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_monitor_timeout_default + """\"/>
+    </parameter>
+
+    <parameter name="yank_timeout" unique="0" required="0">
+        <longdesc lang="en">
+Timeout for QMP commands after which to execute the "yank" command,
+in milliseconds.
+Must be lower than any of the action timeouts.
+        </longdesc>
+        <shortdesc lang="en">Yank timeout</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_yank_timeout_default + """\"/>
+    </parameter>
+
+    <parameter name="fail_fast_timeout" unique="0" required="0">
+        <longdesc lang="en">
+Timeout for QMP commands used in the stop and demote actions to speed
+up recovery from a hanging qemu, in milliseconds.
+Must be lower than any of the action timeouts.
+        </longdesc>
+        <shortdesc lang="en">Timeout for fast paths</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_fail_fast_timeout_default + """\"/>
+    </parameter>
+
+    <parameter name="debug" unique="0" required="0">
+        <longdesc lang="en">
+Control debugging:
+0: disable debugging
+1: log debug messages and qmp commands
+2: + dump core of hanging qemu
+        </longdesc>
+        <shortdesc lang="en">Control debugging</shortdesc>
+        <content type="integer" default=\"""" \
+            + OCF_RESKEY_debug_default + """\"/>
+    </parameter>
+
+    </parameters>
+
+    <actions>
+        <action name="start"        timeout="30s" />
+        <action name="stop"         timeout="10s" />
+        <action name="monitor"      timeout="30s" \
+            interval="1000ms" depth="0" role="Slave" />
+        <action name="monitor"      timeout="30s" \
+            interval="1001ms" depth="0" role="Master" />
+        <action name="notify"       timeout="30s" />
+        <action name="promote"      timeout="30s" />
+        <action name="demote"       timeout="120s" />
+        <action name="meta-data"    timeout="5s" />
+        <action name="validate-all" timeout="20s" />
+    </actions>
+
+</resource-agent>
+""")
+
+def logs_open():
+    global log
+    log = logging.getLogger(OCF_RESOURCE_INSTANCE)
+    if int(OCF_RESKEY_debug) >= 1 or HA_DEBUG != "0":
+        log.setLevel(logging.DEBUG)
+    else:
+        log.setLevel(logging.INFO)
+
+    formater = logging.Formatter("(%(name)s) %(levelname)s: %(message)s")
+
+    if sys.stdout.isatty():
+        handler = logging.StreamHandler(stream=sys.stderr)
+        handler.setFormatter(formater)
+        log.addHandler(handler)
+
+    if HA_LOGFACILITY:
+        handler = logging.handlers.SysLogHandler("/dev/log")
+        handler.setFormatter(formater)
+        log.addHandler(handler)
+
+    if HA_LOGFILE:
+        handler = logging.FileHandler(HA_LOGFILE)
+        handler.setFormatter(formater)
+        log.addHandler(handler)
+
+    if HA_DEBUGLOG and HA_DEBUGLOG != HA_LOGFILE:
+        handler = logging.FileHandler(HA_DEBUGLOG)
+        handler.setFormatter(formater)
+        log.addHandler(handler)
+
+    global qmp_log
+    qmp_log = logging.getLogger("qmp_log")
+    qmp_log.setLevel(logging.DEBUG)
+    formater = logging.Formatter("%(message)s")
+
+    if int(OCF_RESKEY_debug) >= 1:
+        handler = logging.handlers.WatchedFileHandler(QMP_LOG)
+        handler.setFormatter(formater)
+        qmp_log.addHandler(handler)
+    else:
+        handler = logging.NullHandler()
+        qmp_log.addHandler(handler)
+
+def rotate_logfile(logfile, numlogs):
+    numlogs -= 1
+    for n in range(numlogs, -1, -1):
+        file = logfile
+        if n != 0:
+            file = "%s.%s" % (file, n)
+        if os.path.exists(file):
+            if n == numlogs:
+                os.remove(file)
+            else:
+                newname = "%s.%s" % (logfile, n + 1)
+                os.rename(file, newname)
+
+def is_writable(file):
+    return os.access(file, os.W_OK)
+
+def is_executable_file(file):
+    return os.path.isfile(file) and os.access(file, os.X_OK)
+
+def is_true(var):
+    return re.match("yes|true|1|YES|TRUE|True|ja|on|ON", str(var)) != None
+
+# Check if the binary exists and is executable
+def check_binary(binary):
+    if is_executable_file(binary):
+        return True
+    PATH = os.getenv("PATH", os.defpath)
+    for dir in PATH.split(os.pathsep):
+        if is_executable_file(os.path.join(dir, binary)):
+            return True
+    log.error("binary \"%s\" doesn't exist or not executable" % binary)
+    return False
+
+def run_command(commandline):
+    proc = subprocess.Popen(commandline, shell=True, stdout=subprocess.PIPE,
+                          stderr=subprocess.STDOUT, universal_newlines=True)
+    stdout, stderr = proc.communicate()
+    if proc.returncode != 0:
+        log.error("command \"%s\" failed with code %s:\n%s" \
+                    % (commandline, proc.returncode, stdout))
+        raise Error()
+
+# Functions for setting and getting the master score to tell Pacemaker which
+# host has the most recent data
+def set_master_score(score):
+    if score == 0:
+        run_command("crm_master -q -l forever -D")
+    else:
+        run_command("crm_master -q -l forever -v %s" % score)
+
+def set_remote_master_score(remote, score):
+    if score == 0:
+        run_command("crm_master -q -l forever -N '%s' -D" % remote)
+    else:
+        run_command("crm_master -q -l forever -N '%s' -v %s" % (remote, score))
+
+def get_master_score():
+    proc = subprocess.Popen("crm_master -q -G", shell=True,
+                            stdout=subprocess.PIPE, stderr=subprocess.DEVNULL,
+                            universal_newlines=True)
+    stdout, stderr = proc.communicate()
+    if proc.returncode != 0:
+        return 0
+    else:
+        return int(str.strip(stdout))
+
+def get_remote_master_score(remote):
+    proc = subprocess.Popen("crm_master -q -N '%s' -G" % remote, shell=True,
+                            stdout=subprocess.PIPE, stderr=subprocess.DEVNULL,
+                            universal_newlines=True)
+    stdout, stderr = proc.communicate()
+    if proc.returncode != 0:
+        return 0
+    else:
+        return int(str.strip(stdout))
+
+# Tell Pacemaker that the remote resource failed
+def report_remote_failure(remote):
+    run_command("crm_resource --resource '%s' --fail --node '%s'"
+                % (OCF_RESOURCE_INSTANCE, remote))
+
+def recv_line(fd):
+    line = ""
+    while True:
+        tmp = fd.recv(1).decode()
+        line += tmp
+        if tmp == "\n" or len(tmp) == 0:
+            break
+    return line
+
+# Filter out events
+def read_answer(fd):
+    while True:
+        line = recv_line(fd)
+        qmp_log.debug(str.strip(line))
+
+        if len(line) == 0:
+            log.error("qmp connection closed")
+            raise Error()
+
+        answer = json.loads(line)
+        # Ignore everything else
+        if "return" in answer or "error" in answer:
+            break
+    return answer
+
+# Execute one or more qmp commands
+def qmp_execute(fd, commands, ignore_error = False, do_yank = True):
+    for command in commands:
+        if not command:
+            continue
+
+        try:
+            to_send = json.dumps(command)
+            fd.sendall(str.encode(to_send + "\n"))
+            qmp_log.debug(to_send)
+
+            answer = read_answer(fd)
+        except Exception as e:
+            if isinstance(e, socket.timeout) and do_yank:
+                log.warning("Command timed out, trying to unfreeze qemu")
+                new_timeout = max(2, (int(OCF_RESKEY_CRM_meta_timeout)/1000) \
+                                    - (time.time() - START_TIME) - 2)
+                fd.settimeout(new_timeout)
+                answer = yank(fd)
+                # Read answer of timed-out command
+                try:
+                    if "id" in answer:
+                        answer = read_answer(fd)
+                    else:
+                        read_answer(fd)
+                except socket.error as e:
+                    log.error("while reading answer of timed out command: "
+                              "%s\n%s" % (json.dumps(command), e))
+                    raise Error()
+            elif isinstance(e, socket.timeout) or isinstance(e, socket.error):
+                log.error("while executing qmp command: %s\n%s" \
+                            % (json.dumps(command), e))
+                raise Error()
+            else:
+                raise
+
+        if not ignore_error and ("error" in answer):
+            log.error("qmp command returned error:\n%s\n%s" \
+                        % (json.dumps(command), json.dumps(answer)))
+            raise Error()
+
+    return answer
+
+# Open qemu qmp connection
+def qmp_open(fail_fast = False):
+    if fail_fast:
+        timeout = int(OCF_RESKEY_fail_fast_timeout)/1000
+    else:
+        timeout = int(OCF_RESKEY_yank_timeout)/1000
+
+    try:
+        fd = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        fd.settimeout(timeout)
+        fd.connect(HELPER_SOCK)
+    except socket.error as e:
+        log.error("while connecting to helper socket: %s" % e)
+        raise Error()
+
+    return fd
+
+def yank(fd):
+    global did_yank
+    did_yank = True
+    answer = qmp_execute(fd, [{"exec-oob": "yank", "id": "yank0"}], \
+                            do_yank = False, ignore_error = True)
+    return answer
+
+def oob_helper_exec(client, cmd, events):
+    if cmd["exec-helper"] == "get-events":
+        event = cmd["arguments"]["event"]
+        if (event in events):
+            to_send = json.dumps({"return": events[event]})
+            client.sendall(str.encode(to_send + "\n"))
+        else:
+            client.sendall(str.encode("{\"return\": []}\n"))
+    elif cmd["exec-helper"] == "clear-events":
+        events.clear()
+        client.sendall(str.encode("{\"return\": {}}\n"))
+    else:
+        client.sendall(str.encode("{\"error\": \"Unknown helper command\"}\n"))
+
+def oob_helper(qmp, server):
+    max_events = max(100, int(OCF_RESKEY_max_disk_errors))
+    events = {}
+    try:
+        os.close(0)
+        os.close(1)
+        os.close(2)
+        logging.shutdown()
+
+        client = None
+        while True:
+            if client:
+                watch = [client, qmp]
+            else:
+                watch = [server, qmp]
+            sel = select.select(watch, [], [])
+            try:
+                if client in sel[0]:
+                    cmd = recv_line(client)
+                    if len(cmd) == 0:
+                        # client socket was closed: wait for new client
+                        client.close()
+                        client = None
+                        continue
+                    else:
+                        parsed = json.loads(cmd)
+                        if ("exec-helper" in parsed):
+                            oob_helper_exec(client, parsed, events)
+                        else:
+                            qmp.sendall(str.encode(cmd))
+                if qmp in sel[0]:
+                    answer = recv_line(qmp)
+                    if len(answer) == 0:
+                        # qmp socket was closed: qemu died, exit
+                        os._exit(0)
+                    else:
+                        parsed = json.loads(answer)
+                        if ("event" in parsed):
+                            event = parsed["event"]
+                            if (event not in events):
+                                events[event] = []
+                            if len(events[event]) < max_events:
+                                events[event].append(parsed)
+                        elif client:
+                            client.sendall(str.encode(answer))
+                if server in sel[0]:
+                    client, client_addr = server.accept()
+            except socket.error as e:
+                pass
+    except Exception as e:
+        with open(HELPER_LOG, 'a') as f:
+            f.write(str(e) + "\n")
+    os._exit(0)
+
+# Fork off helper to keep the oob qmp connection open and to catch events
+def oob_helper_open():
+    try:
+        qmp = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        qmp.connect(QMP_SOCK)
+        qmp_execute(qmp, [{"execute": "qmp_capabilities", "arguments": {"enable": ["oob"]}}])
+    except socket.error as e:
+        log.error("while connecting to qmp socket: %s" % e)
+        raise Error()
+
+    try:
+        if os.path.exists(HELPER_SOCK):
+            os.unlink(HELPER_SOCK)
+        server = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+        server.bind(HELPER_SOCK)
+        server.listen(1)
+    except socket.error as e:
+        log.error("while opening helper socket: %s" % e)
+        raise Error()
+
+    qmp.set_inheritable(True)
+    server.set_inheritable(True)
+
+    try:
+        pid = os.fork()
+    except OSError as e:
+        log.error("while forking off oob helper: %s" % e)
+        raise Error()
+
+    if pid == 0:
+        # child 1: Exits after forking off child 2, so pid 1 will become
+        # responsible for the child
+        os.setsid()
+
+        pid = os.fork()
+
+        if pid == 0:
+            # child 2: here the actual work is being done
+            oob_helper(qmp, server)
+        else:
+            os._exit(0)
+
+    qmp.close()
+    server.close()
+
+# Get the host of the nbd node
+def qmp_get_nbd_remote(fd):
+    block_nodes = qmp_execute(fd, [{"execute": "query-named-block-nodes", "arguments": {"flat": True}}])
+    for node in block_nodes["return"]:
+        if node["node-name"] == "nbd0":
+            url = str(node["image"]["filename"])
+            return str.split(url, "//")[1].split("/")[0].split(":")[0]
+    return None
+
+# Check if we are currently resyncing
+def qmp_check_resync(fd):
+    answer = qmp_execute(fd, [{"execute": "query-block-jobs"}])
+    for job in answer["return"]:
+        if job["device"] == "resync":
+            return job
+    return None
+
+def qmp_start_resync(fd, remote):
+    answer = qmp_execute(fd, [{"execute": "blockdev-add", "arguments": {"driver": "nbd", "node-name": "nbd0", "server": {"type": "inet", "host": str(remote), "port": str(NBD_PORT)}, "export": "parent0", "detect-zeroes": "on", "yank": True}}], ignore_error = True)
+    if "error" in answer:
+        log.warning("Failed to add nbd node: %s" % json.dumps(answer))
+        log.warning("Assuming peer failure")
+        report_remote_failure(remote)
+    else:
+        qmp_execute(fd, [{"execute": "blockdev-mirror", "arguments": {"device": "colo-disk0", "job-id": "resync", "target": "nbd0", "sync": "full", "on-target-error": "report", "on-source-error": "ignore", "auto-dismiss": False}}])
+
+def qmp_cancel_resync(fd):
+    timeout = START_TIME + (int(OCF_RESKEY_yank_timeout)/1000)
+
+    if qmp_check_resync(fd)["status"] != "concluded":
+        qmp_execute(fd, [{"execute": "block-job-cancel", "arguments": {"device": "resync", "force": True}}], ignore_error = True)
+        # Wait for the block-job to finish
+        while time.time() < timeout:
+            if qmp_check_resync(fd)["status"] == "concluded":
+                break
+            log.debug("Waiting for block-job to finish in qmp_cancel_resync()")
+            time.sleep(1)
+        else:
+            log.warning("Timed out, trying to unfreeze qemu")
+            yank(fd)
+            while qmp_check_resync(fd)["status"] != "concluded":
+                log.debug("Waiting for block-job to finish")
+                time.sleep(1)
+
+    qmp_execute(fd, [
+        {"execute": "block-job-dismiss", "arguments": {"id": "resync"}},
+        {"execute": "blockdev-del", "arguments": {"node-name": "nbd0"}}
+        ])
+
+def qmp_start_colo(fd, remote):
+    # Check if we have a filter-rewriter
+    answer = qmp_execute(fd, [{"execute": "qom-list", "arguments": {"path": "/objects/rew0"}}], ignore_error = True)
+    if "error" in answer:
+        if answer["error"]["class"] == "DeviceNotFound":
+            have_filter_rewriter = False
+        else:
+            log.error("while checking for filter-rewriter:\n%s" \
+                        % json.dumps(answer))
+            raise Error()
+    else:
+        have_filter_rewriter = True
+
+    # Pause VM and cancel resync
+    qmp_execute(fd, [
+        {"execute": "stop"},
+        {"execute": "block-job-cancel", "arguments": {"device": "resync"}}
+        ])
+
+    # Wait for the block-job to finish
+    while qmp_check_resync(fd)["status"] != "concluded":
+        log.debug("Waiting for block-job to finish in qmp_start_colo()")
+        time.sleep(1)
+
+    # Add nbd to the quorum node
+    qmp_execute(fd, [
+        {"execute": "block-job-dismiss", "arguments": {"id": "resync"}},
+        {"execute": "x-blockdev-change", "arguments": {"parent": "colo-disk0", "node": "nbd0"}}
+        ])
+
+    # Connect mirror and compare_in to secondary
+    qmp_execute(fd, [
+        {"execute": "chardev-add", "arguments": {"id": "comp_pri_in0<", "backend": {"type": "socket", "data": {"addr": {"type": "unix", "data": {"path": str(COMP_SOCK)}}, "server": True}}}},
+        {"execute": "chardev-add", "arguments": {"id": "comp_pri_in0>", "backend": {"type": "socket", "data": {"addr": {"type": "unix", "data": {"path": str(COMP_SOCK)}}, "server": False}}}},
+        {"execute": "chardev-add", "arguments": {"id": "comp_out0<", "backend": {"type": "socket", "data": {"addr": {"type": "unix", "data": {"path": str(COMP_OUT_SOCK)}}, "server": True}}}},
+        {"execute": "chardev-add", "arguments": {"id": "comp_out0>", "backend": {"type": "socket", "data": {"addr": {"type": "unix", "data": {"path": str(COMP_OUT_SOCK)}}, "server": False}}}},
+        {"execute": "chardev-add", "arguments": {"id": "mirror0", "backend": {"type": "socket", "data": {"addr": {"type": "inet", "data": {"host": str(remote), "port": str(MIRROR_PORT)}}, "server": False, "nodelay": True, "yank": True}}}},
+        {"execute": "chardev-add", "arguments": {"id": "comp_sec_in0", "backend": {"type": "socket", "data": {"addr": {"type": "inet", "data": {"host": str(remote), "port": str(COMPARE_IN_PORT)}}, "server": False, "nodelay": True, "yank": True}}}}
+        ])
+
+    # Add the COLO filters
+    vnet_hdr_support = is_true(OCF_RESKEY_vnet_hdr)
+    if have_filter_rewriter:
+        qmp_execute(fd, [
+            {"execute": "object-add", "arguments": {"qom-type": "filter-mirror", "id": "m0", "props": {"insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "tx", "outdev": "mirror0", "vnet_hdr_support": vnet_hdr_support}}},
+            {"execute": "object-add", "arguments": {"qom-type": "filter-redirector", "id": "redire0", "props": {"insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "rx", "indev": "comp_out0<", "vnet_hdr_support": vnet_hdr_support}}},
+            {"execute": "object-add", "arguments": {"qom-type": "filter-redirector", "id": "redire1", "props": {"insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "rx", "outdev": "comp_pri_in0<", "vnet_hdr_support": vnet_hdr_support}}},
+            {"execute": "object-add", "arguments": {"qom-type": "iothread", "id": "iothread1"}},
+            {"execute": "object-add", "arguments": {"qom-type": "colo-compare", "id": "comp0", "props": {"primary_in": "comp_pri_in0>", "secondary_in": "comp_sec_in0", "outdev": "comp_out0>", "iothread": "iothread1", "compare_timeout": int(OCF_RESKEY_compare_timeout), "expired_scan_cycle": int(OCF_RESKEY_expired_scan_cycle), "vnet_hdr_support": vnet_hdr_support}}}
+            ])
+    else:
+        qmp_execute(fd, [
+            {"execute": "object-add", "arguments": {"qom-type": "filter-mirror", "id": "m0", "props": {"netdev": "hn0", "queue": "tx", "outdev": "mirror0", "vnet_hdr_support": vnet_hdr_support}}},
+            {"execute": "object-add", "arguments": {"qom-type": "filter-redirector", "id": "redire0", "props": {"netdev": "hn0", "queue": "rx", "indev": "comp_out0<", "vnet_hdr_support": vnet_hdr_support}}},
+            {"execute": "object-add", "arguments": {"qom-type": "filter-redirector", "id": "redire1", "props": {"netdev": "hn0", "queue": "rx", "outdev": "comp_pri_in0<", "vnet_hdr_support": vnet_hdr_support}}},
+            {"execute": "object-add", "arguments": {"qom-type": "iothread", "id": "iothread1"}},
+            {"execute": "object-add", "arguments": {"qom-type": "colo-compare", "id": "comp0", "props": {"primary_in": "comp_pri_in0>", "secondary_in": "comp_sec_in0", "outdev": "comp_out0>", "iothread": "iothread1", "compare_timeout": int(OCF_RESKEY_compare_timeout), "expired_scan_cycle": int(OCF_RESKEY_expired_scan_cycle), "vnet_hdr_support": vnet_hdr_support}}}
+            ])
+
+    # Start COLO
+    qmp_execute(fd, [
+        {"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "x-colo", "state": True }] }},
+        {"execute": "migrate-set-parameters" , "arguments": {"x-checkpoint-delay": int(OCF_RESKEY_checkpoint_interval), "yank": True}},
+        {"execute": "migrate", "arguments": {"uri": "tcp:%s:%s" % (remote, MIGRATE_PORT)}}
+        ])
+
+    # Wait for COLO to start
+    while qmp_execute(fd, [{"execute": "query-status"}])["return"]["status"] \
+            == "paused" \
+            or qmp_execute(fd, [{"execute": "query-colo-status"}])["return"]["mode"] \
+            != "primary" :
+        log.debug("Waiting for colo replication to start")
+        time.sleep(1)
+
+def qmp_primary_failover(fd):
+    qmp_execute(fd, [
+        {"execute": "object-del", "arguments": {"id": "m0"}},
+        {"execute": "object-del", "arguments": {"id": "redire0"}},
+        {"execute": "object-del", "arguments": {"id": "redire1"}},
+        {"execute": "x-colo-lost-heartbeat"},
+        {"execute": "object-del", "arguments": {"id": "comp0"}},
+        {"execute": "object-del", "arguments": {"id": "iothread1"}},
+        {"execute": "x-blockdev-change", "arguments": {"parent": "colo-disk0", "child": "children.1"}},
+        {"execute": "blockdev-del", "arguments": {"node-name": "nbd0"}},
+        {"execute": "chardev-remove", "arguments": {"id": "mirror0"}},
+        {"execute": "chardev-remove", "arguments": {"id": "comp_sec_in0"}},
+        {"execute": "chardev-remove", "arguments": {"id": "comp_pri_in0>"}},
+        {"execute": "chardev-remove", "arguments": {"id": "comp_pri_in0<"}},
+        {"execute": "chardev-remove", "arguments": {"id": "comp_out0>"}},
+        {"execute": "chardev-remove", "arguments": {"id": "comp_out0<"}}
+        ])
+
+def qmp_secondary_failover(fd):
+    qmp_execute(fd, [
+        {"execute": "nbd-server-stop"},
+        {"execute": "object-del", "arguments": {"id": "f2"}},
+        {"execute": "object-del", "arguments": {"id": "f1"}},
+        {"execute": "x-colo-lost-heartbeat"},
+        {"execute": "chardev-remove", "arguments": {"id": "red1"}},
+        {"execute": "chardev-remove", "arguments": {"id": "red0"}},
+        ])
+
+# Check qemu health and colo role
+def qmp_check_state(fd, fail_fast = False):
+    answer = qmp_execute(fd, [{"execute": "query-status"}], \
+                            do_yank = not fail_fast)
+    vm_status = answer["return"]
+
+    answer = qmp_execute(fd, [{"execute": "query-colo-status"}], \
+                            do_yank = not fail_fast)
+    colo_status = answer["return"]
+
+    if vm_status["status"] == "inmigrate":
+        role = OCF_SUCCESS
+        replication = OCF_NOT_RUNNING
+
+    elif (vm_status["status"] == "running" \
+          or vm_status["status"] == "colo" \
+          or vm_status["status"] == "finish-migrate") \
+         and colo_status["mode"] == "none" \
+         and (colo_status["reason"] == "request" \
+              or colo_status["reason"] == "none"):
+        role = OCF_RUNNING_MASTER
+        replication = OCF_NOT_RUNNING
+
+    elif (vm_status["status"] == "running" \
+          or vm_status["status"] == "colo" \
+          or vm_status["status"] == "finish-migrate") \
+         and colo_status["mode"] == "secondary":
+        role = OCF_SUCCESS
+        replication = OCF_SUCCESS
+
+    elif (vm_status["status"] == "running" \
+          or vm_status["status"] == "colo" \
+          or vm_status["status"] == "finish-migrate") \
+         and colo_status["mode"] == "primary":
+        role = OCF_RUNNING_MASTER
+        replication = OCF_SUCCESS
+
+    else:
+        log.error("Invalid qemu status:\nvm status: %s\ncolo status: %s" \
+                    % (vm_status, colo_status))
+        role = OCF_ERR_GENERIC
+        replication = OCF_ERR_GENERIC
+
+    return role, replication
+
+# Sanity checks: check parameters, files, binaries, etc.
+def qemu_colo_validate_all():
+    # Check resource parameters
+    if not str.isdigit(OCF_RESKEY_base_port):
+        log.error("base_port needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_checkpoint_interval):
+        log.error("checkpoint_interval needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_compare_timeout):
+        log.error("compare_timeout needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_expired_scan_cycle):
+        log.error("expired_scan_cycle needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_max_disk_errors):
+        log.error("max_disk_errors needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_monitor_timeout):
+        log.error("monitor_timeout needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_yank_timeout):
+        log.error("yank_timeout needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_fail_fast_timeout):
+        log.error("fail_fast_timeout needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not str.isdigit(OCF_RESKEY_debug):
+        log.error("debug needs to be a number")
+        return OCF_ERR_CONFIGURED
+
+    if not OCF_RESKEY_active_hidden_dir:
+        log.error("active_hidden_dir needs to be specified")
+        return OCF_ERR_CONFIGURED
+
+    if not OCF_RESKEY_disk_size:
+        log.error("disk_size needs to be specified")
+        return OCF_ERR_CONFIGURED
+
+    # Check resource meta configuration
+    if OCF_ACTION != "stop":
+        if OCF_RESKEY_CRM_meta_master_max != 1:
+            log.error("only one master allowed")
+            return OCF_ERR_CONFIGURED
+
+        if OCF_RESKEY_CRM_meta_clone_max > 2:
+            log.error("maximum 2 clones allowed")
+            return OCF_ERR_CONFIGURED
+
+        if OCF_RESKEY_CRM_meta_master_node_max != 1:
+            log.error("only one master per node allowed")
+            return OCF_ERR_CONFIGURED
+
+        if OCF_RESKEY_CRM_meta_clone_node_max != 1:
+            log.error("only one clone per node allowed")
+            return OCF_ERR_CONFIGURED
+
+    # Check if notify is enabled
+    if OCF_ACTION != "stop" and OCF_ACTION != "monitor":
+        if not is_true(OCF_RESKEY_CRM_meta_notify) \
+           and not OCF_RESKEY_CRM_meta_notify_start_uname:
+            log.error("notify needs to be enabled")
+            return OCF_ERR_CONFIGURED
+
+    # Check that globally-unique is disabled
+    if is_true(OCF_RESKEY_CRM_meta_globally_unique):
+        log.error("globally-unique needs to be disabled")
+        return OCF_ERR_CONFIGURED
+
+    # Check binaries
+    if not check_binary(OCF_RESKEY_qemu_binary):
+        return OCF_ERR_INSTALLED
+
+    if not check_binary(OCF_RESKEY_qemu_img_binary):
+        return OCF_ERR_INSTALLED
+
+    # Check paths and files
+    if not is_writable(OCF_RESKEY_active_hidden_dir) \
+        or not os.path.isdir(OCF_RESKEY_active_hidden_dir):
+        log.error("active and hidden image directory missing or not writable")
+        return OCF_ERR_PERM
+
+    return OCF_SUCCESS
+
+# Check if qemu is running
+def check_pid():
+    if not os.path.exists(PID_FILE):
+        return OCF_NOT_RUNNING, None
+
+    fd = open(PID_FILE, "r")
+    pid = int(str.strip(fd.readline()))
+    fd.close()
+    try:
+        os.kill(pid, 0)
+    except OSError:
+        log.info("qemu is not running")
+        return OCF_NOT_RUNNING, pid
+    else:
+        return OCF_SUCCESS, pid
+
+def qemu_colo_monitor(fail_fast = False):
+    status, pid = check_pid()
+    if status != OCF_SUCCESS:
+        return status, OCF_NOT_RUNNING
+
+    fd = qmp_open(fail_fast)
+
+    role, replication = qmp_check_state(fd, fail_fast)
+    if role != OCF_SUCCESS and role != OCF_RUNNING_MASTER:
+        return role, replication
+
+    colo_events = qmp_execute(fd, [{"exec-helper": "get-events", "arguments": {"event": "COLO_EXIT"}}], do_yank = False)
+    for event in colo_events["return"]:
+        if event["data"]["reason"] == "error":
+            if replication == OCF_SUCCESS:
+                replication = OCF_ERR_GENERIC
+
+    if did_yank and replication == OCF_SUCCESS:
+        replication = OCF_ERR_GENERIC
+
+    peer_disk_errors = 0
+    local_disk_errors = 0
+    quorum_events = qmp_execute(fd, [{"exec-helper": "get-events", "arguments": {"event": "QUORUM_REPORT_BAD"}}], do_yank = False)
+    for event in quorum_events["return"]:
+        if event["data"]["node-name"] == "nbd0":
+            if event["data"]["type"] == "read":
+                peer_disk_errors += 1
+            else:
+                peer_disk_errors += int(OCF_RESKEY_max_disk_errors)
+        else:
+            if event["data"]["type"] == "read":
+                local_disk_errors += 1
+            else:
+                local_disk_errors += int(OCF_RESKEY_max_disk_errors)
+
+    if int(OCF_RESKEY_max_disk_errors) != 0:
+        if peer_disk_errors >= int(OCF_RESKEY_max_disk_errors):
+            log.error("Peer disk error")
+            if replication == OCF_SUCCESS:
+                replication = OCF_ERR_GENERIC
+
+        if local_disk_errors >= int(OCF_RESKEY_max_disk_errors):
+            if replication == OCF_SUCCESS:
+                log.error("Local disk error")
+                role = OCF_ERR_GENERIC
+            else:
+                log.warning("Local disk error")
+
+    if not fail_fast and OCF_RESKEY_CRM_meta_interval != 0:
+        # This isn't a probe monitor
+        block_job = qmp_check_resync(fd)
+        if block_job:
+            if "error" in block_job:
+                log.error("resync error: %s" % block_job["error"])
+                peer = qmp_get_nbd_remote(fd)
+                qmp_cancel_resync(fd)
+                report_remote_failure(peer)
+            elif block_job["ready"] == True:
+                log.info("resync done, starting colo")
+                peer = qmp_get_nbd_remote(fd)
+                qmp_start_colo(fd, peer)
+                # COLO started, our secondary now can be promoted if the
+                # primary fails
+                set_remote_master_score(peer, 100)
+            else:
+                pct_done = (float(block_job["offset"]) \
+                            / float(block_job["len"])) * 100
+                log.info("resync %.1f%% done" % pct_done)
+        else:
+            if replication == OCF_ERR_GENERIC:
+                if role == OCF_RUNNING_MASTER:
+                    log.error("Replication error")
+                    peer = qmp_get_nbd_remote(fd)
+                    if peer:
+                        report_remote_failure(peer)
+                else:
+                    log.warning("Replication error")
+        qmp_execute(fd, [{"exec-helper": "clear-events"}], do_yank = False)
+
+    fd.close()
+
+    return role, replication
+
+def qemu_colo_start():
+    if check_pid()[0] == OCF_SUCCESS:
+        log.info("qemu is already running")
+        return OCF_SUCCESS
+
+    run_command("'%s' create -q -f qcow2 %s %s" \
+            % (OCF_RESKEY_qemu_img_binary, ACTIVE_IMAGE, OCF_RESKEY_disk_size))
+    run_command("'%s' create -q -f qcow2 %s %s" \
+            % (OCF_RESKEY_qemu_img_binary, HIDDEN_IMAGE, OCF_RESKEY_disk_size))
+
+    rotate_logfile(QMP_LOG, 8)
+    rotate_logfile(QEMU_LOG, 8)
+    run_command(QEMU_SECONDARY_CMDLINE)
+    oob_helper_open()
+
+    fd = qmp_open()
+    qmp_execute(fd, [
+        {"execute": "nbd-server-start", "arguments": {"addr": {"type": "inet", "data": {"host": str(OCF_RESKEY_listen_address), "port": str(NBD_PORT)}}}},
+        {"execute": "nbd-server-add", "arguments": {"device": "parent0", "writable": True}}
+        ])
+    fd.close()
+
+    return OCF_SUCCESS
+
+def env_do_shutdown_guest():
+    return OCF_RESKEY_CRM_meta_notify_active_uname \
+           and OCF_RESKEY_CRM_meta_notify_stop_uname \
+           and str.strip(OCF_RESKEY_CRM_meta_notify_active_uname) \
+               == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname)
+
+def env_find_secondary():
+    # slave(s) =
+    # OCF_RESKEY_CRM_meta_notify_slave_uname
+    # - OCF_RESKEY_CRM_meta_notify_stop_uname
+    # + OCF_RESKEY_CRM_meta_notify_start_uname
+    # Filter out hosts that are stopping and ourselves
+    for host in str.split(OCF_RESKEY_CRM_meta_notify_slave_uname, " "):
+        if host:
+            for stopping_host \
+                in str.split(OCF_RESKEY_CRM_meta_notify_stop_uname, " "):
+                if host == stopping_host:
+                    break
+            else:
+                if host != HOSTNAME:
+                    # we found a valid secondary
+                    return host
+
+    for host in str.split(OCF_RESKEY_CRM_meta_notify_start_uname, " "):
+        if host != HOSTNAME:
+            # we found a valid secondary
+            return host
+
+    # we found no secondary
+    return None
+
+def _qemu_colo_stop(monstatus, shutdown_guest):
+    # stop action must do everything possible to stop the resource
+    try:
+        timeout = START_TIME + (int(OCF_RESKEY_CRM_meta_timeout)/1000) - 5
+        force_stop = False
+
+        if monstatus == OCF_NOT_RUNNING:
+            log.info("resource is already stopped")
+            return OCF_SUCCESS
+        elif monstatus == OCF_RUNNING_MASTER or monstatus == OCF_SUCCESS:
+            force_stop = False
+        else:
+            force_stop = True
+
+        if not force_stop:
+            fd = qmp_open(True)
+            if shutdown_guest:
+                if monstatus == OCF_RUNNING_MASTER:
+                    qmp_execute(fd, [{"execute": "system_powerdown"}], \
+                                    do_yank = False)
+            else:
+                qmp_execute(fd, [{"execute": "quit"}], do_yank = False)
+            fd.close()
+
+            # wait for qemu to stop
+            while time.time() < timeout:
+                status, pid = check_pid()
+                if status == OCF_NOT_RUNNING:
+                    # qemu stopped
+                    return OCF_SUCCESS
+                elif status == OCF_SUCCESS:
+                    # wait
+                    log.debug("Waiting for qemu to stop")
+                    time.sleep(1)
+                else:
+                    # something went wrong, force stop instead
+                    break
+
+            log.warning("clean stop timeout reached")
+    except Exception as e:
+        log.warning("error while stopping: %s" % e)
+
+    log.info("force stopping qemu")
+
+    status, pid = check_pid()
+    if status == OCF_NOT_RUNNING:
+        return OCF_SUCCESS
+    try:
+        if int(OCF_RESKEY_debug) >= 2:
+            os.kill(pid, signal.SIGSEGV)
+        else:
+            os.kill(pid, signal.SIGTERM)
+            time.sleep(2)
+            os.kill(pid, signal.SIGKILL)
+    except Exception:
+        pass
+
+    while check_pid()[0] != OCF_NOT_RUNNING:
+        time.sleep(1)
+
+    return OCF_SUCCESS
+
+def qemu_colo_stop():
+    shutdown_guest = env_do_shutdown_guest()
+    try:
+        role, replication = qemu_colo_monitor(True)
+    except Exception:
+        role, replication = OCF_ERR_GENERIC, OCF_ERR_GENERIC
+
+    status = _qemu_colo_stop(role, shutdown_guest)
+
+    if HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname):
+        if str.strip(OCF_RESKEY_CRM_meta_notify_promote_uname) != HOSTNAME:
+            # We where primary and the secondary is to be promoted.
+            # We are going to be out of date.
+            set_master_score(0)
+        else:
+            if role == OCF_RUNNING_MASTER:
+                # We where a healthy primary but had no healty secondary or it
+                # was stopped as well. So we have up-to-date data.
+                set_master_score(10)
+            else:
+                # We where a unhealthy primary but also had no healty secondary.
+                # So we still should have up-to-date data.
+                set_master_score(5)
+    else:
+        if get_master_score() > 10:
+            if role == OCF_SUCCESS:
+                if shutdown_guest:
+                    # We where a healthy secondary and (probably) had a healthy
+                    # primary and both where stopped. So we have up-to-date data
+                    # too.
+                    set_master_score(10)
+                else:
+                    # We where a healthy secondary and (probably) had a healthy
+                    # primary still running. So we are now out of date.
+                    set_master_score(0)
+            else:
+                # We where a unhealthy secondary. So we are now out of date.
+                set_master_score(0)
+
+    return status
+
+def qemu_colo_notify():
+    action = "%s-%s" % (OCF_RESKEY_CRM_meta_notify_type, \
+                        OCF_RESKEY_CRM_meta_notify_operation)
+
+    if action == "post-start":
+        if HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname):
+            peer = str.strip(OCF_RESKEY_CRM_meta_notify_start_uname)
+            fd = qmp_open()
+            qmp_start_resync(fd, peer)
+            # The secondary has inconsistent data until resync is finished
+            set_remote_master_score(peer, 0)
+            fd.close()
+
+    elif action == "pre-stop":
+        if not env_do_shutdown_guest() \
+           and HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname) \
+           and HOSTNAME != str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname):
+            fd = qmp_open()
+            peer = qmp_get_nbd_remote(fd)
+            log.debug("our peer: %s" % peer)
+            if peer == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname):
+                if qmp_check_resync(fd):
+                    qmp_cancel_resync(fd)
+                elif get_remote_master_score(peer) > 10:
+                    qmp_primary_failover(fd)
+                qmp_execute(fd, [{"exec-helper": "clear-events"}],do_yank=False)
+            fd.close()
+
+    elif action == "post-stop" \
+         and OCF_RESKEY_CRM_meta_notify_key_operation == "stonith" \
+         and (HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname)
+            or str.strip(OCF_RESKEY_CRM_meta_notify_promote_uname)):
+        peer = str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname)
+        set_remote_master_score(peer, 0)
+
+    return OCF_SUCCESS
+
+def qemu_colo_promote():
+    role, replication = qemu_colo_monitor()
+
+    if role == OCF_SUCCESS and replication == OCF_NOT_RUNNING:
+        status = _qemu_colo_stop(role, False)
+        if status != OCF_SUCCESS:
+            return status
+
+        rotate_logfile(QMP_LOG, 8)
+        rotate_logfile(QEMU_LOG, 8)
+        run_command(QEMU_PRIMARY_CMDLINE)
+        oob_helper_open()
+        set_master_score(101)
+
+        peer = env_find_secondary()
+        if peer:
+            fd = qmp_open()
+            qmp_start_resync(fd, peer)
+            # The secondary has inconsistent data until resync is finished
+            set_remote_master_score(peer, 0)
+            fd.close()
+        return OCF_SUCCESS
+    elif role == OCF_SUCCESS and replication != OCF_NOT_RUNNING:
+        fd = qmp_open()
+        qmp_secondary_failover(fd)
+        set_master_score(101)
+
+        peer = env_find_secondary()
+        if peer:
+            qmp_start_resync(fd, peer)
+            # The secondary has inconsistent data until resync is finished
+            set_remote_master_score(peer, 0)
+        qmp_execute(fd, [{"exec-helper": "clear-events"}], do_yank=False)
+        fd.close()
+        return OCF_SUCCESS
+    else:
+        return OCF_ERR_GENERIC
+
+def qemu_colo_demote():
+    status = qemu_colo_stop()
+    if status != OCF_SUCCESS:
+        return status
+    return qemu_colo_start()
+
+
+if OCF_ACTION == "meta-data":
+    qemu_colo_meta_data()
+    exit(OCF_SUCCESS)
+
+logs_open()
+
+status = qemu_colo_validate_all()
+# Exit here if our sanity checks fail, but try to continue if we need to stop
+if status != OCF_SUCCESS and OCF_ACTION != "stop":
+    exit(status)
+
+setup_constants()
+
+try:
+    if OCF_ACTION == "start":
+        status = qemu_colo_start()
+    elif OCF_ACTION == "stop":
+        status = qemu_colo_stop()
+    elif OCF_ACTION == "monitor":
+        status = qemu_colo_monitor()[0]
+    elif OCF_ACTION == "notify":
+        status = qemu_colo_notify()
+    elif OCF_ACTION == "promote":
+        status = qemu_colo_promote()
+    elif OCF_ACTION == "demote":
+        status = qemu_colo_demote()
+    elif OCF_ACTION == "validate-all":
+        status = qemu_colo_validate_all()
+    else:
+        status = OCF_ERR_UNIMPLEMENTED
+except Error:
+    exit(OCF_ERR_GENERIC)
+else:
+    exit(status)
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/5] colo: Introduce high-level test suite
  2020-05-11 12:26 [PATCH 0/5] colo: Introduce resource agent and test suite/CI Lukas Straub
  2020-05-11 12:26 ` [PATCH 1/5] block/quorum.c: stable children names Lukas Straub
  2020-05-11 12:26 ` [PATCH 2/5] colo: Introduce resource agent Lukas Straub
@ 2020-05-11 12:27 ` Lukas Straub
  2020-06-02 12:19   ` Philippe Mathieu-Daudé
  2020-05-11 12:27 ` [PATCH 4/5] configure,Makefile: Install colo resource-agent Lukas Straub
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Lukas Straub @ 2020-05-11 12:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: Zhang Chen, Alberto Garcia, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 28057 bytes --]

Add high-level test relying on the colo resource-agent to test
all failover cases while checking guest network connectivity.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 scripts/colo-resource-agent/crm_master   |  44 ++
 scripts/colo-resource-agent/crm_resource |  12 +
 tests/acceptance/colo.py                 | 689 +++++++++++++++++++++++
 3 files changed, 745 insertions(+)
 create mode 100755 scripts/colo-resource-agent/crm_master
 create mode 100755 scripts/colo-resource-agent/crm_resource
 create mode 100644 tests/acceptance/colo.py

diff --git a/scripts/colo-resource-agent/crm_master b/scripts/colo-resource-agent/crm_master
new file mode 100755
index 0000000000..886f523bda
--- /dev/null
+++ b/scripts/colo-resource-agent/crm_master
@@ -0,0 +1,44 @@
+#!/bin/bash
+
+# Fake crm_master for COLO testing
+#
+# Copyright (c) Lukas Straub <lukasstraub2@web.de>
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+TMPDIR="$HA_RSCTMP"
+score=0
+query=0
+
+OPTIND=1
+while getopts 'Qql:Dv:N:G' opt; do
+    case "$opt" in
+        Q|q)
+            # Noop
+        ;;
+        "l")
+            # Noop
+        ;;
+        "D")
+            score=0
+        ;;
+        "v")
+            score=$OPTARG
+        ;;
+        "N")
+            TMPDIR="$COLO_TEST_REMOTE_TMP"
+        ;;
+        "G")
+            query=1
+        ;;
+    esac
+done
+
+if (( query )); then
+    cat "${TMPDIR}/master_score" || exit 1
+else
+    echo $score > "${TMPDIR}/master_score" || exit 1
+fi
+
+exit 0
diff --git a/scripts/colo-resource-agent/crm_resource b/scripts/colo-resource-agent/crm_resource
new file mode 100755
index 0000000000..ad69ff3c6b
--- /dev/null
+++ b/scripts/colo-resource-agent/crm_resource
@@ -0,0 +1,12 @@
+#!/bin/sh
+
+# Fake crm_resource for COLO testing
+#
+# Copyright (c) Lukas Straub <lukasstraub2@web.de>
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+# Noop
+
+exit 0
diff --git a/tests/acceptance/colo.py b/tests/acceptance/colo.py
new file mode 100644
index 0000000000..465513fb6c
--- /dev/null
+++ b/tests/acceptance/colo.py
@@ -0,0 +1,689 @@
+# High-level test suite for qemu COLO testing all failover cases while checking
+# guest network connectivity
+#
+# Copyright (c) Lukas Straub <lukasstraub2@web.de>
+#
+# This work is licensed under the terms of the GNU GPL, version 2 or
+# later.  See the COPYING file in the top-level directory.
+
+# HOWTO:
+#
+# This test has the following parameters:
+# bridge_name: name of the bridge interface to connect qemu to
+# host_address: ip address of the bridge interface
+# guest_address: ip address that the guest gets from the dhcp server
+# bridge_helper: path to the brige helper
+#                (default: /usr/lib/qemu/qemu-bridge-helper)
+# install_cmd: command to run to install iperf3 and memtester in the guest
+#              (default: "sudo -n dnf -q -y install iperf3 memtester")
+#
+# To run the network tests, you have to specify the parameters.
+#
+# Example for running the colo tests:
+# make check-acceptance FEDORA_31_ARCHES="x86_64" AVOCADO_TAGS="-t colo \
+#  -p bridge_name=br0 -p host_address=192.168.220.1 \
+#  -p guest_address=192.168.220.222"
+#
+# The colo tests currently only use x86_64 test vm images. With the
+# FEDORA_31_ARCHES make variable as in the example, only the x86_64 images will
+# be downloaded.
+#
+# If you're running the network tests as an unprivileged user, you need to set
+# the suid bit on the bridge helper (chmod +s <bridge-helper>).
+#
+# The dhcp server should assign a static ip to the guest, else the test may be
+# unreliable. The Mac address for the guest is always 52:54:00:12:34:56.
+
+
+import select
+import sys
+import subprocess
+import shutil
+import os
+import signal
+import os.path
+import time
+import tempfile
+
+from avocado import Test
+from avocado import skipUnless
+from avocado.utils import network
+from avocado.utils import vmimage
+from avocado.utils import cloudinit
+from avocado.utils import ssh
+from avocado.utils.path import find_command
+
+from avocado_qemu import pick_default_qemu_bin, BUILD_DIR, SOURCE_DIR
+from qemu.qmp import QEMUMonitorProtocol
+
+def iperf3_available():
+    try:
+        find_command("iperf3")
+    except CmdNotFoundError:
+        return False
+    return True
+
+class ColoTest(Test):
+
+    # Constants
+    OCF_SUCCESS = 0
+    OCF_ERR_GENERIC = 1
+    OCF_ERR_ARGS = 2
+    OCF_ERR_UNIMPLEMENTED = 3
+    OCF_ERR_PERM = 4
+    OCF_ERR_INSTALLED = 5
+    OCF_ERR_CONFIGURED = 6
+    OCF_NOT_RUNNING = 7
+    OCF_RUNNING_MASTER = 8
+    OCF_FAILED_MASTER = 9
+
+    HOSTA = 10
+    HOSTB = 11
+
+    QEMU_OPTIONS = (" -display none -vga none -enable-kvm"
+                    " -smp 2 -cpu host -m 768"
+                    " -device e1000,mac=52:54:00:12:34:56,netdev=hn0"
+                    " -device virtio-blk,drive=colo-disk0")
+
+    FEDORA_VERSION = "31"
+    IMAGE_CHECKSUM = "e3c1b309d9203604922d6e255c2c5d098a309c2d46215d8fc026954f3c5c27a0"
+    IMAGE_SIZE = "4294967296b"
+
+    hang_qemu = False
+    checkpoint_failover = False
+    traffic_procs = []
+
+    def get_image(self, temp_dir):
+        try:
+            return vmimage.get(
+                "fedora", arch="x86_64", version=self.FEDORA_VERSION,
+                checksum=self.IMAGE_CHECKSUM, algorithm="sha256",
+                cache_dir=self.cache_dirs[0],
+                snapshot_dir=temp_dir)
+        except:
+            self.cancel("Failed to download/prepare image")
+
+    @skipUnless(ssh.SSH_CLIENT_BINARY, "No SSH client available")
+    def setUp(self):
+        # Qemu and qemu-img binary
+        default_qemu_bin = pick_default_qemu_bin()
+        self.QEMU_BINARY = self.params.get("qemu_bin", default=default_qemu_bin)
+
+        # If qemu-img has been built, use it, otherwise the system wide one
+        # will be used.  If none is available, the test will cancel.
+        qemu_img = os.path.join(BUILD_DIR, "qemu-img")
+        if not os.path.exists(qemu_img):
+            qemu_img = find_command("qemu-img", False)
+        if qemu_img is False:
+            self.cancel("Could not find \"qemu-img\", which is required to "
+                        "create the bootable image")
+        self.QEMU_IMG_BINARY = qemu_img
+        vmimage.QEMU_IMG = qemu_img
+
+        self.RESOURCE_AGENT = os.path.join(SOURCE_DIR,
+                                           "scripts/colo-resource-agent/colo")
+        self.ADD_PATH = os.path.join(SOURCE_DIR, "scripts/colo-resource-agent")
+
+        # Logs
+        self.RA_LOG = os.path.join(self.outputdir, "resource-agent.log")
+        self.HOSTA_LOGDIR = os.path.join(self.outputdir, "hosta")
+        self.HOSTB_LOGDIR = os.path.join(self.outputdir, "hostb")
+        os.makedirs(self.HOSTA_LOGDIR)
+        os.makedirs(self.HOSTB_LOGDIR)
+
+        # Temporary directories
+        # We don't use self.workdir because of unix socket path length
+        # limitations
+        self.TMPDIR = tempfile.mkdtemp()
+        self.TMPA = os.path.join(self.TMPDIR, "hosta")
+        self.TMPB = os.path.join(self.TMPDIR, "hostb")
+        os.makedirs(self.TMPA)
+        os.makedirs(self.TMPB)
+
+        # Network
+        self.BRIDGE_NAME = self.params.get("bridge_name")
+        if self.BRIDGE_NAME:
+            self.HOST_ADDRESS = self.params.get("host_address")
+            self.GUEST_ADDRESS = self.params.get("guest_address")
+            self.BRIDGE_HELPER = self.params.get("bridge_helper",
+                                    default="/usr/lib/qemu/qemu-bridge-helper")
+            self.SSH_PORT = 22
+        else:
+            # QEMU's hard coded usermode router address
+            self.HOST_ADDRESS = "10.0.2.2"
+            self.GUEST_ADDRESS = "127.0.0.1"
+            self.BRIDGE_HOSTA_PORT = network.find_free_port(address="127.0.0.1")
+            self.BRIDGE_HOSTB_PORT = network.find_free_port(address="127.0.0.1")
+            self.SSH_PORT = network.find_free_port(address="127.0.0.1")
+
+        self.CLOUDINIT_HOME_PORT = network.find_free_port()
+
+        # Find free port range
+        base_port = 1024
+        while True:
+            base_port = network.find_free_port(start_port=base_port,
+                                               address="127.0.0.1")
+            if base_port == None:
+                self.cancel("Failed to find a free port")
+            for n in range(base_port, base_port +4):
+                if n > 65535:
+                    break
+                if not network.is_port_free(n, "127.0.0.1"):
+                    break
+            else:
+                # for loop above didn't break
+                break
+
+        self.BASE_PORT = base_port
+
+        # Disk images
+        self.log.info("Downloading/preparing boot image")
+        self.HOSTA_IMAGE = self.get_image(self.TMPA).path
+        self.HOSTB_IMAGE = self.get_image(self.TMPB).path
+        self.CLOUDINIT_ISO = os.path.join(self.TMPDIR, "cloudinit.iso")
+
+        self.log.info("Preparing cloudinit image")
+        try:
+            cloudinit.iso(self.CLOUDINIT_ISO, self.name,
+                          username="test", password="password",
+                          phone_home_host=self.HOST_ADDRESS,
+                          phone_home_port=self.CLOUDINIT_HOME_PORT)
+        except Exception as e:
+            self.cancel("Failed to prepare cloudinit image")
+
+        self.QEMU_OPTIONS += " -cdrom %s" % self.CLOUDINIT_ISO
+
+        # Network bridge
+        if not self.BRIDGE_NAME:
+            self.BRIDGE_PIDFILE = os.path.join(self.TMPDIR, "bridge.pid")
+            self.run_command(("'%s' -pidfile '%s'"
+                " -M none -display none -daemonize"
+                " -netdev user,id=host,hostfwd=tcp:127.0.0.1:%s-:22"
+                " -netdev socket,id=hosta,listen=127.0.0.1:%s"
+                " -netdev socket,id=hostb,listen=127.0.0.1:%s"
+                " -netdev hubport,id=hostport,hubid=0,netdev=host"
+                " -netdev hubport,id=porta,hubid=0,netdev=hosta"
+                " -netdev hubport,id=portb,hubid=0,netdev=hostb")
+                % (self.QEMU_BINARY, self.BRIDGE_PIDFILE, self.SSH_PORT,
+                   self.BRIDGE_HOSTA_PORT, self.BRIDGE_HOSTB_PORT), 0)
+
+    def tearDown(self):
+        try:
+            pid = self.read_pidfile(self.BRIDGE_PIDFILE)
+            if pid and self.check_pid(pid):
+                os.kill(pid, signal.SIGKILL)
+        except Exception as e:
+            pass
+
+        try:
+            self.ra_stop(self.HOSTA)
+        except Exception as e:
+            pass
+
+        try:
+            self.ra_stop(self.HOSTB)
+        except Exception as e:
+            pass
+
+        try:
+            self.ssh_close()
+        except Exception as e:
+            pass
+
+        for proc in self.traffic_procs:
+            try:
+                os.killpg(proc.pid, signal.SIGTERM)
+            except Exception as e:
+                pass
+
+        shutil.rmtree(self.TMPDIR)
+
+    def run_command(self, cmdline, expected_status, env=None):
+        proc = subprocess.Popen(cmdline, shell=True, stdout=subprocess.PIPE,
+                                stderr=subprocess.STDOUT,
+                                universal_newlines=True, env=env)
+        stdout, stderr = proc.communicate()
+        if proc.returncode != expected_status:
+            self.fail("command \"%s\" failed with code %s:\n%s"
+                           % (cmdline, proc.returncode, stdout))
+
+        return proc.returncode
+
+    def cat_line(self, path):
+        line=""
+        try:
+            fd = open(path, "r")
+            line = str.strip(fd.readline())
+            fd.close()
+        except:
+            pass
+        return line
+
+    def read_pidfile(self, pidfile):
+        try:
+            pid = int(self.cat_line(pidfile))
+        except ValueError:
+            return None
+        else:
+            return pid
+
+    def check_pid(self, pid):
+        try:
+            os.kill(pid, 0)
+        except OSError:
+            return False
+        else:
+            return True
+
+    def ssh_open(self):
+        self.ssh_conn = ssh.Session(self.GUEST_ADDRESS, self.SSH_PORT,
+                                    user="test", password="password")
+        self.ssh_conn.DEFAULT_OPTIONS += (("UserKnownHostsFile", "/dev/null"),)
+        for i in range(10):
+            try:
+                if self.ssh_conn.connect():
+                    return
+            except Exception as e:
+                pass
+            time.sleep(4)
+        self.fail("sshd timeout")
+
+    def ssh_ping(self):
+        if self.ssh_conn.cmd("echo ping").stdout != b"ping\n":
+            self.fail("ssh ping failed")
+
+    def ssh_close(self):
+        self.ssh_conn.quit()
+
+    def setup_base_env(self, host):
+        PATH = os.getenv("PATH", "")
+        env = { "PATH": "%s:%s" % (self.ADD_PATH, PATH),
+                "HA_LOGFILE": self.RA_LOG,
+                "OCF_RESOURCE_INSTANCE": "colo-test",
+                "OCF_RESKEY_CRM_meta_clone_max": "2",
+                "OCF_RESKEY_CRM_meta_notify": "true",
+                "OCF_RESKEY_CRM_meta_timeout": "30000",
+                "OCF_RESKEY_qemu_binary": self.QEMU_BINARY,
+                "OCF_RESKEY_qemu_img_binary": self.QEMU_IMG_BINARY,
+                "OCF_RESKEY_disk_size": str(self.IMAGE_SIZE),
+                "OCF_RESKEY_checkpoint_interval": "10000",
+                "OCF_RESKEY_base_port": str(self.BASE_PORT),
+                "OCF_RESKEY_debug": "2"}
+
+        if host == self.HOSTA:
+            env.update({"OCF_RESKEY_options":
+                            ("%s -qmp unix:%s,server,nowait"
+                             " -drive if=none,id=parent0,file='%s'")
+                            % (self.QEMU_OPTIONS, self.get_qmp_sock(host),
+                               self.HOSTA_IMAGE),
+                        "OCF_RESKEY_active_hidden_dir": self.TMPA,
+                        "OCF_RESKEY_listen_address": "127.0.0.1",
+                        "OCF_RESKEY_log_dir": self.HOSTA_LOGDIR,
+                        "OCF_RESKEY_CRM_meta_on_node": "127.0.0.1",
+                        "HA_RSCTMP": self.TMPA,
+                        "COLO_TEST_REMOTE_TMP": self.TMPB})
+
+        else:
+            env.update({"OCF_RESKEY_options":
+                            ("%s -qmp unix:%s,server,nowait"
+                             " -drive if=none,id=parent0,file='%s'")
+                            % (self.QEMU_OPTIONS, self.get_qmp_sock(host),
+                               self.HOSTB_IMAGE),
+                        "OCF_RESKEY_active_hidden_dir": self.TMPB,
+                        "OCF_RESKEY_listen_address": "127.0.0.2",
+                        "OCF_RESKEY_log_dir": self.HOSTB_LOGDIR,
+                        "OCF_RESKEY_CRM_meta_on_node": "127.0.0.2",
+                        "HA_RSCTMP": self.TMPB,
+                        "COLO_TEST_REMOTE_TMP": self.TMPA})
+
+        if self.BRIDGE_NAME:
+            env["OCF_RESKEY_options"] += \
+                " -netdev bridge,id=hn0,br=%s,helper='%s'" \
+                % (self.BRIDGE_NAME, self.BRIDGE_HELPER)
+        else:
+            if host == self.HOSTA:
+                env["OCF_RESKEY_options"] += \
+                    " -netdev socket,id=hn0,connect=127.0.0.1:%s" \
+                    % self.BRIDGE_HOSTA_PORT
+            else:
+                env["OCF_RESKEY_options"] += \
+                    " -netdev socket,id=hn0,connect=127.0.0.1:%s" \
+                    % self.BRIDGE_HOSTB_PORT
+        return env
+
+    def ra_start(self, host):
+        env = self.setup_base_env(host)
+        self.run_command(self.RESOURCE_AGENT + " start", self.OCF_SUCCESS, env)
+
+    def ra_stop(self, host):
+        env = self.setup_base_env(host)
+        self.run_command(self.RESOURCE_AGENT + " stop", self.OCF_SUCCESS, env)
+
+    def ra_monitor(self, host, expected_status):
+        env = self.setup_base_env(host)
+        self.run_command(self.RESOURCE_AGENT + " monitor", expected_status, env)
+
+    def ra_promote(self, host):
+        env = self.setup_base_env(host)
+        self.run_command(self.RESOURCE_AGENT + " promote", self.OCF_SUCCESS,env)
+
+    def ra_notify_start(self, host):
+        env = self.setup_base_env(host)
+
+        env.update({"OCF_RESKEY_CRM_meta_notify_type": "post",
+                    "OCF_RESKEY_CRM_meta_notify_operation": "start"})
+
+        if host == self.HOSTA:
+            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.1",
+                        "OCF_RESKEY_CRM_meta_notify_start_uname": "127.0.0.2"})
+        else:
+            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.2",
+                        "OCF_RESKEY_CRM_meta_notify_start_uname": "127.0.0.1"})
+
+        self.run_command(self.RESOURCE_AGENT + " notify", self.OCF_SUCCESS, env)
+
+    def ra_notify_stop(self, host):
+        env = self.setup_base_env(host)
+
+        env.update({"OCF_RESKEY_CRM_meta_notify_type": "pre",
+                    "OCF_RESKEY_CRM_meta_notify_operation": "stop"})
+
+        if host == self.HOSTA:
+            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.1",
+                        "OCF_RESKEY_CRM_meta_notify_stop_uname": "127.0.0.2"})
+        else:
+            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.2",
+                        "OCF_RESKEY_CRM_meta_notify_stop_uname": "127.0.0.1"})
+
+        self.run_command(self.RESOURCE_AGENT + " notify", self.OCF_SUCCESS, env)
+
+    def get_pid(self, host):
+        if host == self.HOSTA:
+            return self.read_pidfile(os.path.join(self.TMPA,
+                                                 "colo-test-qemu.pid"))
+        else:
+            return self.read_pidfile(os.path.join(self.TMPB,
+                                                 "colo-test-qemu.pid"))
+
+    def get_master_score(self, host):
+        if host == self.HOSTA:
+            return int(self.cat_line(os.path.join(self.TMPA, "master_score")))
+        else:
+            return int(self.cat_line(os.path.join(self.TMPB, "master_score")))
+
+    def get_qmp_sock(self, host):
+        if host == self.HOSTA:
+            return os.path.join(self.TMPA, "my-qmp.sock")
+        else:
+            return os.path.join(self.TMPB, "my-qmp.sock")
+
+
+    def kill_qemu_pre(self, host):
+        pid = self.get_pid(host)
+
+        qmp_sock = self.get_qmp_sock(host)
+
+        if self.checkpoint_failover:
+            qmp_conn = QEMUMonitorProtocol(qmp_sock)
+            qmp_conn.settimeout(10)
+            qmp_conn.connect()
+            while True:
+                event = qmp_conn.pull_event(wait=True)
+                if event["event"] == "STOP":
+                    break
+            qmp_conn.close()
+
+
+        if pid and self.check_pid(pid):
+            if self.hang_qemu:
+                os.kill(pid, signal.SIGSTOP)
+            else:
+                os.kill(pid, signal.SIGKILL)
+                while self.check_pid(pid):
+                    time.sleep(1)
+
+    def kill_qemu_post(self, host):
+        pid = self.get_pid(host)
+
+        if self.hang_qemu and pid and self.check_pid(pid):
+            os.kill(pid, signal.SIGKILL)
+            while self.check_pid(pid):
+                time.sleep(1)
+
+    def prepare_guest(self):
+        pass
+
+    def cycle_start(self, cycle):
+        pass
+
+    def active_section(self):
+        return False
+
+    def cycle_end(self, cycle):
+        pass
+
+    def check_connection(self):
+        self.ssh_ping()
+        for proc in self.traffic_procs:
+            if proc.poll() != None:
+                self.fail("Traffic process died")
+
+    def _test_colo(self, loop=1):
+        loop = max(loop, 1)
+        self.log.info("Will put logs in %s" % self.outputdir)
+
+        self.ra_stop(self.HOSTA)
+        self.ra_stop(self.HOSTB)
+
+        self.log.info("*** Startup ***")
+        self.ra_start(self.HOSTA)
+        self.ra_start(self.HOSTB)
+
+        self.ra_monitor(self.HOSTA, self.OCF_SUCCESS)
+        self.ra_monitor(self.HOSTB, self.OCF_SUCCESS)
+
+        self.log.info("*** Promoting ***")
+        self.ra_promote(self.HOSTA)
+        cloudinit.wait_for_phone_home(("0.0.0.0", self.CLOUDINIT_HOME_PORT),
+                                      self.name)
+        self.ssh_open()
+        self.prepare_guest()
+
+        self.ra_notify_start(self.HOSTA)
+
+        while self.get_master_score(self.HOSTB) != 100:
+            self.ra_monitor(self.HOSTA, self.OCF_RUNNING_MASTER)
+            self.ra_monitor(self.HOSTB, self.OCF_SUCCESS)
+            time.sleep(1)
+
+        self.log.info("*** Replication started ***")
+
+        self.check_connection()
+
+        primary = self.HOSTA
+        secondary = self.HOSTB
+
+        for n in range(loop):
+            self.cycle_start(n)
+            self.log.info("*** Secondary failover ***")
+            self.kill_qemu_pre(primary)
+            self.ra_notify_stop(secondary)
+            self.ra_monitor(secondary, self.OCF_SUCCESS)
+            self.ra_promote(secondary)
+            self.ra_monitor(secondary, self.OCF_RUNNING_MASTER)
+            self.kill_qemu_post(primary)
+
+            self.check_connection()
+
+            tmp = primary
+            primary = secondary
+            secondary = tmp
+
+            self.log.info("*** Secondary continue replication ***")
+            self.ra_start(secondary)
+            self.ra_notify_start(primary)
+
+            self.check_connection()
+
+            # Wait for resync
+            while self.get_master_score(secondary) != 100:
+                self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
+                self.ra_monitor(secondary, self.OCF_SUCCESS)
+                time.sleep(1)
+
+            self.log.info("*** Replication started ***")
+
+            self.check_connection()
+
+            if self.active_section():
+                break
+
+            self.log.info("*** Primary failover ***")
+            self.kill_qemu_pre(secondary)
+            self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
+            self.ra_notify_stop(primary)
+            self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
+            self.kill_qemu_post(secondary)
+
+            self.check_connection()
+
+            self.log.info("*** Primary continue replication ***")
+            self.ra_start(secondary)
+            self.ra_notify_start(primary)
+
+            self.check_connection()
+
+            # Wait for resync
+            while self.get_master_score(secondary) != 100:
+                self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
+                self.ra_monitor(secondary, self.OCF_SUCCESS)
+                time.sleep(1)
+
+            self.log.info("*** Replication started ***")
+
+            self.check_connection()
+
+            self.cycle_end(n)
+
+        self.ssh_close()
+
+        self.ra_stop(self.HOSTA)
+        self.ra_stop(self.HOSTB)
+
+        self.ra_monitor(self.HOSTA, self.OCF_NOT_RUNNING)
+        self.ra_monitor(self.HOSTB, self.OCF_NOT_RUNNING)
+        self.log.info("*** all ok ***")
+
+
+class ColoQuickTest(ColoTest):
+    """
+    :avocado: tags=colo
+    :avocado: tags=quick
+    :avocado: tags=arch:x86_64
+    """
+
+    timeout = 300
+
+    def cycle_end(self, cycle):
+        self.log.info("Testing with peer qemu hanging"
+                      " and failover during checkpoint")
+        self.hang_qemu = True
+
+    def test_quick(self):
+        self.checkpoint_failover = True
+        self.log.info("Testing with peer qemu crashing"
+                      " and failover during checkpoint")
+        self._test_colo(loop=2)
+
+
+class ColoNetworkTest(ColoTest):
+
+    def prepare_guest(self):
+        install_cmd = self.params.get("install_cmd", default=
+                                "sudo -n dnf -q -y install iperf3 memtester")
+        self.ssh_conn.cmd(install_cmd)
+        # Use two instances to work around a bug in iperf3
+        self.ssh_conn.cmd("iperf3 -sD; iperf3 -sD -p 5202")
+
+    def _cycle_start(self, cycle):
+        pass
+
+    def cycle_start(self, cycle):
+        self._cycle_start(cycle)
+        tests = [("", "client-to-server tcp"), ("-R", "server-to-client tcp")]
+
+        self.log.info("Testing iperf %s" % tests[cycle % 2][1])
+        iperf_cmd = "iperf3 %s -t 60 -i 60 --connect-timeout 30000 -c %s" \
+                        % (tests[cycle % 2][0], self.GUEST_ADDRESS)
+        proc = subprocess.Popen("while %s && %s; do sleep 1; done >>'%s' 2>&1"
+                    % (iperf_cmd, iperf_cmd + " -p 5202",
+                    os.path.join(self.outputdir, "iperf.log")),
+                    shell=True, preexec_fn=os.setsid, stdin=subprocess.DEVNULL,
+                    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        self.traffic_procs.append(proc)
+        time.sleep(5) # Wait for iperf to get up to speed
+
+    def cycle_end(self, cycle):
+        for proc in self.traffic_procs:
+            try:
+                os.killpg(proc.pid, signal.SIGTERM)
+                proc.wait()
+            except Exception as e:
+                pass
+        self.traffic_procs.clear()
+        time.sleep(20)
+
+class ColoRealNetworkTest(ColoNetworkTest):
+    """
+    :avocado: tags=colo
+    :avocado: tags=slow
+    :avocado: tags=network_test
+    :avocado: tags=arch:x86_64
+    """
+
+    timeout = 900
+
+    def active_section(self):
+        time.sleep(300)
+        return False
+
+    @skipUnless(iperf3_available(), "iperf3 not available")
+    def test_network(self):
+        if not self.BRIDGE_NAME:
+            self.cancel("bridge options not given, will skip network test")
+        self.log.info("Testing with peer qemu crashing and network load")
+        self._test_colo(loop=2)
+
+class ColoStressTest(ColoNetworkTest):
+    """
+    :avocado: tags=colo
+    :avocado: tags=slow
+    :avocado: tags=stress_test
+    :avocado: tags=arch:x86_64
+    """
+
+    timeout = 1800
+
+    def _cycle_start(self, cycle):
+        if cycle == 4:
+            self.log.info("Stresstest with peer qemu hanging, network load"
+                          " and failover during checkpoint")
+            self.checkpoint_failover = True
+            self.hang_qemu = True
+        elif cycle == 8:
+            self.log.info("Stresstest with peer qemu crashing and network load")
+            self.checkpoint_failover = False
+            self.hang_qemu = False
+        elif cycle == 12:
+            self.log.info("Stresstest with peer qemu hanging and network load")
+            self.checkpoint_failover = False
+            self.hang_qemu = True
+
+    @skipUnless(iperf3_available(), "iperf3 not available")
+    def test_stress(self):
+        if not self.BRIDGE_NAME:
+            self.cancel("bridge options not given, will skip network test")
+        self.log.info("Stresstest with peer qemu crashing, network load"
+                      " and failover during checkpoint")
+        self.checkpoint_failover = True
+        self._test_colo(loop=16)
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/5] configure,Makefile: Install colo resource-agent
  2020-05-11 12:26 [PATCH 0/5] colo: Introduce resource agent and test suite/CI Lukas Straub
                   ` (2 preceding siblings ...)
  2020-05-11 12:27 ` [PATCH 3/5] colo: Introduce high-level test suite Lukas Straub
@ 2020-05-11 12:27 ` Lukas Straub
  2020-05-11 12:27 ` [PATCH 5/5] MAINTAINERS: Add myself as maintainer for COLO resource agent Lukas Straub
  2020-05-18  9:38 ` [PATCH 0/5] colo: Introduce resource agent and test suite/CI Zhang, Chen
  5 siblings, 0 replies; 14+ messages in thread
From: Lukas Straub @ 2020-05-11 12:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: Zhang Chen, Alberto Garcia, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2453 bytes --]

Optionally install the resouce-agent so it gets picked up by
pacemaker.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 Makefile  |  5 +++++
 configure | 10 ++++++++++
 2 files changed, 15 insertions(+)

diff --git a/Makefile b/Makefile
index 8a9113e666..2ebffc4465 100644
--- a/Makefile
+++ b/Makefile
@@ -973,6 +973,11 @@ ifneq ($(DESCS),)
 		$(INSTALL_DATA) "$$tmpf" \
 			"$(DESTDIR)$(qemu_datadir)/firmware/$$x"; \
 	done
+endif
+ifdef INSTALL_COLO_RA
+	mkdir -p "$(DESTDIR)$(libdir)/ocf/resource.d/qemu"
+	$(INSTALL_PROG) "scripts/colo-resource-agent/colo" \
+		"$(DESTDIR)$(libdir)/ocf/resource.d/qemu/colo"
 endif
 	for s in $(ICON_SIZES); do \
 		mkdir -p "$(DESTDIR)$(qemu_icondir)/hicolor/$${s}/apps"; \
diff --git a/configure b/configure
index 23b5e93752..c9252030cf 100755
--- a/configure
+++ b/configure
@@ -430,6 +430,7 @@ softmmu="yes"
 linux_user="no"
 bsd_user="no"
 blobs="yes"
+colo_ra="no"
 edk2_blobs="no"
 pkgversion=""
 pie=""
@@ -1309,6 +1310,10 @@ for opt do
   ;;
   --disable-blobs) blobs="no"
   ;;
+  --disable-colo-ra) colo_ra="no"
+  ;;
+  --enable-colo-ra) colo_ra="yes"
+  ;;
   --with-pkgversion=*) pkgversion="$optarg"
   ;;
   --with-coroutine=*) coroutine="$optarg"
@@ -1776,6 +1781,7 @@ Advanced options (experts only):
   --enable-gcov            enable test coverage analysis with gcov
   --gcov=GCOV              use specified gcov [$gcov_tool]
   --disable-blobs          disable installing provided firmware blobs
+  --enable-colo-ra         enable installing the COLO resource agent for pacemaker
   --with-vss-sdk=SDK-path  enable Windows VSS support in QEMU Guest Agent
   --with-win-sdk=SDK-path  path to Windows Platform SDK (to build VSS .tlb)
   --tls-priority           default TLS protocol/cipher priority string
@@ -6647,6 +6653,7 @@ echo "Linux AIO support $linux_aio"
 echo "Linux io_uring support $linux_io_uring"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs     $blobs"
+echo "Install COLO resource agent $colo_ra"
 echo "KVM support       $kvm"
 echo "HAX support       $hax"
 echo "HVF support       $hvf"
@@ -7188,6 +7195,9 @@ fi
 if test "$blobs" = "yes" ; then
   echo "INSTALL_BLOBS=yes" >> $config_host_mak
 fi
+if test "$colo_ra" = "yes" ; then
+  echo "INSTALL_COLO_RA=yes" >> $config_host_mak
+fi
 if test "$iovec" = "yes" ; then
   echo "CONFIG_IOVEC=y" >> $config_host_mak
 fi
-- 
2.20.1


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/5] MAINTAINERS: Add myself as maintainer for COLO resource agent
  2020-05-11 12:26 [PATCH 0/5] colo: Introduce resource agent and test suite/CI Lukas Straub
                   ` (3 preceding siblings ...)
  2020-05-11 12:27 ` [PATCH 4/5] configure,Makefile: Install colo resource-agent Lukas Straub
@ 2020-05-11 12:27 ` Lukas Straub
  2020-05-18  9:38 ` [PATCH 0/5] colo: Introduce resource agent and test suite/CI Zhang, Chen
  5 siblings, 0 replies; 14+ messages in thread
From: Lukas Straub @ 2020-05-11 12:27 UTC (permalink / raw)
  To: qemu-devel; +Cc: Zhang Chen, Alberto Garcia, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

While I'm not going to have much time for this, I'll still
try to test and review patches.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 MAINTAINERS | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8cbc1fac2b..4c623a96e1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2466,6 +2466,12 @@ F: net/colo*
 F: net/filter-rewriter.c
 F: net/filter-mirror.c
 
+COLO resource agent and testing
+M: Lukas Straub <lukasstraub2@web.de>
+S: Odd fixes
+F: scripts/colo-resource-agent/*
+F: tests/acceptance/colo.py
+
 Record/replay
 M: Pavel Dovgalyuk <pavel.dovgaluk@ispras.ru>
 R: Paolo Bonzini <pbonzini@redhat.com>
-- 
2.20.1

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
  2020-05-11 12:26 [PATCH 0/5] colo: Introduce resource agent and test suite/CI Lukas Straub
                   ` (4 preceding siblings ...)
  2020-05-11 12:27 ` [PATCH 5/5] MAINTAINERS: Add myself as maintainer for COLO resource agent Lukas Straub
@ 2020-05-18  9:38 ` Zhang, Chen
  2020-06-06 18:59   ` Lukas Straub
  5 siblings, 1 reply; 14+ messages in thread
From: Zhang, Chen @ 2020-05-18  9:38 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel
  Cc: Jason Wang, Alberto Garcia, Dr. David Alan Gilbert



> -----Original Message-----
> From: Lukas Straub <lukasstraub2@web.de>
> Sent: Monday, May 11, 2020 8:27 PM
> To: qemu-devel <qemu-devel@nongnu.org>
> Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> Subject: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> 
> Hello Everyone,
> These patches introduce a resource agent for fully automatic management of
> colo and a test suite building upon the resource agent to extensively test colo.
> 
> Test suite features:
> -Tests failover with peer crashing and hanging and failover during checkpoint
> -Tests network using ssh and iperf3 -Quick test requires no special
> configuration -Network test for testing colo-compare -Stress test: failover all
> the time with network load
> 
> Resource agent features:
> -Fully automatic management of colo
> -Handles many failures: hanging/crashing qemu, replication error, disk
> error, ...
> -Recovers from hanging qemu by using the "yank" oob command -Tracks
> which node has up-to-date data -Works well in clusters with more than 2
> nodes
> 
> Run times on my laptop:
> Quick test: 200s
> Network test: 800s (tagged as slow)
> Stress test: 1300s (tagged as slow)
> 
> The test suite needs access to a network bridge to properly test the network,
> so some parameters need to be given to the test run. See
> tests/acceptance/colo.py for more information.
> 
> I wonder how this integrates in existing CI infrastructure. Is there a common
> CI for qemu where this can run or does every subsystem have to run their
> own CI?

Wow~ Very happy to see this series.
I have checked the "how to" in tests/acceptance/colo.py,
But it looks not enough for users, can you write an independent document for this series?
Include test Infrastructure ASC II diagram,  test cases design , detailed how to and more information for 
pacemaker cluster and resource agent..etc ?

Thanks
Zhang Chen


> 
> Regards,
> Lukas Straub
> 
> 
> Lukas Straub (5):
>   block/quorum.c: stable children names
>   colo: Introduce resource agent
>   colo: Introduce high-level test suite
>   configure,Makefile: Install colo resource-agent
>   MAINTAINERS: Add myself as maintainer for COLO resource agent
> 
>  MAINTAINERS                              |    6 +
>  Makefile                                 |    5 +
>  block/quorum.c                           |   20 +-
>  configure                                |   10 +
>  scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
>  scripts/colo-resource-agent/crm_master   |   44 +
>  scripts/colo-resource-agent/crm_resource |   12 +
>  tests/acceptance/colo.py                 |  689 +++++++++++
>  8 files changed, 2209 insertions(+), 6 deletions(-)  create mode 100755
> scripts/colo-resource-agent/colo  create mode 100755 scripts/colo-resource-
> agent/crm_master
>  create mode 100755 scripts/colo-resource-agent/crm_resource
>  create mode 100644 tests/acceptance/colo.py
> 
> --
> 2.20.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 1/5] block/quorum.c: stable children names
  2020-05-11 12:26 ` [PATCH 1/5] block/quorum.c: stable children names Lukas Straub
@ 2020-06-02  1:01   ` Zhang, Chen
  2020-06-02 11:07   ` Alberto Garcia
  1 sibling, 0 replies; 14+ messages in thread
From: Zhang, Chen @ 2020-06-02  1:01 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel; +Cc: Alberto Garcia, Dr. David Alan Gilbert



> -----Original Message-----
> From: Lukas Straub <lukasstraub2@web.de>
> Sent: Monday, May 11, 2020 8:27 PM
> To: qemu-devel <qemu-devel@nongnu.org>
> Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> Subject: [PATCH 1/5] block/quorum.c: stable children names
> 
> If we remove the child with the highest index from the quorum, decrement
> s->next_child_index. This way we get stable children names as long as we
> only remove the last child.
> 

Looks good for me, and it can solve this bug:
colo: Can not recover colo after svm failover twice
https://bugs.launchpad.net/bugs/1881231

Reviewed-by: Zhang Chen <chen.zhang@intel.com>

> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  block/quorum.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/block/quorum.c b/block/quorum.c index 6d7a56bd93..acfa09c2cc
> 100644
> --- a/block/quorum.c
> +++ b/block/quorum.c
> @@ -29,6 +29,8 @@
> 
>  #define HASH_LENGTH 32
> 
> +#define INDEXSTR_LEN 32
> +
>  #define QUORUM_OPT_VOTE_THRESHOLD "vote-threshold"
>  #define QUORUM_OPT_BLKVERIFY      "blkverify"
>  #define QUORUM_OPT_REWRITE        "rewrite-corrupted"
> @@ -972,9 +974,9 @@ static int quorum_open(BlockDriverState *bs, QDict
> *options, int flags,
>      opened = g_new0(bool, s->num_children);
> 
>      for (i = 0; i < s->num_children; i++) {
> -        char indexstr[32];
> -        ret = snprintf(indexstr, 32, "children.%d", i);
> -        assert(ret < 32);
> +        char indexstr[INDEXSTR_LEN];
> +        ret = snprintf(indexstr, INDEXSTR_LEN, "children.%d", i);
> +        assert(ret < INDEXSTR_LEN);
> 
>          s->children[i] = bdrv_open_child(NULL, options, indexstr, bs,
>                                           &child_format, false, &local_err); @@ -1026,7 +1028,7
> @@ static void quorum_add_child(BlockDriverState *bs, BlockDriverState
> *child_bs,  {
>      BDRVQuorumState *s = bs->opaque;
>      BdrvChild *child;
> -    char indexstr[32];
> +    char indexstr[INDEXSTR_LEN];
>      int ret;
> 
>      if (s->is_blkverify) {
> @@ -1041,8 +1043,8 @@ static void quorum_add_child(BlockDriverState *bs,
> BlockDriverState *child_bs,
>          return;
>      }
> 
> -    ret = snprintf(indexstr, 32, "children.%u", s->next_child_index);
> -    if (ret < 0 || ret >= 32) {
> +    ret = snprintf(indexstr, INDEXSTR_LEN, "children.%u", s-
> >next_child_index);
> +    if (ret < 0 || ret >= INDEXSTR_LEN) {
>          error_setg(errp, "cannot generate child name");
>          return;
>      }
> @@ -1069,6 +1071,7 @@ static void quorum_del_child(BlockDriverState *bs,
> BdrvChild *child,
>                               Error **errp)  {
>      BDRVQuorumState *s = bs->opaque;
> +    char indexstr[INDEXSTR_LEN];
>      int i;
> 
>      for (i = 0; i < s->num_children; i++) { @@ -1090,6 +1093,11 @@ static void
> quorum_del_child(BlockDriverState *bs, BdrvChild *child,
>      /* We know now that num_children > threshold, so blkverify must be
> false */
>      assert(!s->is_blkverify);
> 
> +    snprintf(indexstr, INDEXSTR_LEN, "children.%u", s->next_child_index - 1);
> +    if (!strncmp(child->name, indexstr, INDEXSTR_LEN)) {
> +        s->next_child_index--;
> +    }
> +
>      bdrv_drained_begin(bs);
> 
>      /* We can safely remove this child now */
> --
> 2.20.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/5] block/quorum.c: stable children names
  2020-05-11 12:26 ` [PATCH 1/5] block/quorum.c: stable children names Lukas Straub
  2020-06-02  1:01   ` Zhang, Chen
@ 2020-06-02 11:07   ` Alberto Garcia
  1 sibling, 0 replies; 14+ messages in thread
From: Alberto Garcia @ 2020-06-02 11:07 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel; +Cc: Zhang Chen, Dr. David Alan Gilbert

On Mon 11 May 2020 02:26:54 PM CEST, Lukas Straub wrote:
> If we remove the child with the highest index from the quorum,
> decrement s->next_child_index. This way we get stable children
> names as long as we only remove the last child.
>
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>

Reviewed-by: Alberto Garcia <berto@igalia.com>

Berto


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] colo: Introduce high-level test suite
  2020-05-11 12:27 ` [PATCH 3/5] colo: Introduce high-level test suite Lukas Straub
@ 2020-06-02 12:19   ` Philippe Mathieu-Daudé
  2020-06-04 10:55     ` Lukas Straub
  0 siblings, 1 reply; 14+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-06-02 12:19 UTC (permalink / raw)
  To: Lukas Straub, qemu-devel
  Cc: Zhang Chen, Alberto Garcia, Dr. David Alan Gilbert,
	Wainer dos Santos Moschetta, Cleber Rosa

+Cleber/Wainer

On 5/11/20 2:27 PM, Lukas Straub wrote:
> Add high-level test relying on the colo resource-agent to test
> all failover cases while checking guest network connectivity.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  scripts/colo-resource-agent/crm_master   |  44 ++
>  scripts/colo-resource-agent/crm_resource |  12 +
>  tests/acceptance/colo.py                 | 689 +++++++++++++++++++++++
>  3 files changed, 745 insertions(+)
>  create mode 100755 scripts/colo-resource-agent/crm_master
>  create mode 100755 scripts/colo-resource-agent/crm_resource
>  create mode 100644 tests/acceptance/colo.py
> 
> diff --git a/scripts/colo-resource-agent/crm_master b/scripts/colo-resource-agent/crm_master
> new file mode 100755
> index 0000000000..886f523bda
> --- /dev/null
> +++ b/scripts/colo-resource-agent/crm_master
> @@ -0,0 +1,44 @@
> +#!/bin/bash
> +
> +# Fake crm_master for COLO testing
> +#
> +# Copyright (c) Lukas Straub <lukasstraub2@web.de>
> +#
> +# This work is licensed under the terms of the GNU GPL, version 2 or
> +# later.  See the COPYING file in the top-level directory.
> +
> +TMPDIR="$HA_RSCTMP"
> +score=0
> +query=0
> +
> +OPTIND=1
> +while getopts 'Qql:Dv:N:G' opt; do
> +    case "$opt" in
> +        Q|q)
> +            # Noop
> +        ;;
> +        "l")
> +            # Noop
> +        ;;
> +        "D")
> +            score=0
> +        ;;
> +        "v")
> +            score=$OPTARG
> +        ;;
> +        "N")
> +            TMPDIR="$COLO_TEST_REMOTE_TMP"
> +        ;;
> +        "G")
> +            query=1
> +        ;;
> +    esac
> +done
> +
> +if (( query )); then
> +    cat "${TMPDIR}/master_score" || exit 1
> +else
> +    echo $score > "${TMPDIR}/master_score" || exit 1
> +fi
> +
> +exit 0
> diff --git a/scripts/colo-resource-agent/crm_resource b/scripts/colo-resource-agent/crm_resource
> new file mode 100755
> index 0000000000..ad69ff3c6b
> --- /dev/null
> +++ b/scripts/colo-resource-agent/crm_resource
> @@ -0,0 +1,12 @@
> +#!/bin/sh
> +
> +# Fake crm_resource for COLO testing
> +#
> +# Copyright (c) Lukas Straub <lukasstraub2@web.de>
> +#
> +# This work is licensed under the terms of the GNU GPL, version 2 or
> +# later.  See the COPYING file in the top-level directory.
> +
> +# Noop
> +
> +exit 0
> diff --git a/tests/acceptance/colo.py b/tests/acceptance/colo.py
> new file mode 100644
> index 0000000000..465513fb6c
> --- /dev/null
> +++ b/tests/acceptance/colo.py
> @@ -0,0 +1,689 @@
> +# High-level test suite for qemu COLO testing all failover cases while checking
> +# guest network connectivity
> +#
> +# Copyright (c) Lukas Straub <lukasstraub2@web.de>
> +#
> +# This work is licensed under the terms of the GNU GPL, version 2 or
> +# later.  See the COPYING file in the top-level directory.
> +
> +# HOWTO:
> +#
> +# This test has the following parameters:
> +# bridge_name: name of the bridge interface to connect qemu to
> +# host_address: ip address of the bridge interface
> +# guest_address: ip address that the guest gets from the dhcp server
> +# bridge_helper: path to the brige helper
> +#                (default: /usr/lib/qemu/qemu-bridge-helper)
> +# install_cmd: command to run to install iperf3 and memtester in the guest
> +#              (default: "sudo -n dnf -q -y install iperf3 memtester")
> +#
> +# To run the network tests, you have to specify the parameters.
> +#
> +# Example for running the colo tests:
> +# make check-acceptance FEDORA_31_ARCHES="x86_64" AVOCADO_TAGS="-t colo \
> +#  -p bridge_name=br0 -p host_address=192.168.220.1 \
> +#  -p guest_address=192.168.220.222"
> +#
> +# The colo tests currently only use x86_64 test vm images. With the
> +# FEDORA_31_ARCHES make variable as in the example, only the x86_64 images will
> +# be downloaded.
> +#
> +# If you're running the network tests as an unprivileged user, you need to set
> +# the suid bit on the bridge helper (chmod +s <bridge-helper>).
> +#
> +# The dhcp server should assign a static ip to the guest, else the test may be
> +# unreliable. The Mac address for the guest is always 52:54:00:12:34:56.
> +
> +
> +import select
> +import sys
> +import subprocess
> +import shutil
> +import os
> +import signal
> +import os.path
> +import time
> +import tempfile
> +
> +from avocado import Test
> +from avocado import skipUnless
> +from avocado.utils import network
> +from avocado.utils import vmimage
> +from avocado.utils import cloudinit
> +from avocado.utils import ssh
> +from avocado.utils.path import find_command
> +
> +from avocado_qemu import pick_default_qemu_bin, BUILD_DIR, SOURCE_DIR
> +from qemu.qmp import QEMUMonitorProtocol
> +
> +def iperf3_available():
> +    try:
> +        find_command("iperf3")
> +    except CmdNotFoundError:
> +        return False
> +    return True
> +
> +class ColoTest(Test):
> +
> +    # Constants
> +    OCF_SUCCESS = 0
> +    OCF_ERR_GENERIC = 1
> +    OCF_ERR_ARGS = 2
> +    OCF_ERR_UNIMPLEMENTED = 3
> +    OCF_ERR_PERM = 4
> +    OCF_ERR_INSTALLED = 5
> +    OCF_ERR_CONFIGURED = 6
> +    OCF_NOT_RUNNING = 7
> +    OCF_RUNNING_MASTER = 8
> +    OCF_FAILED_MASTER = 9
> +
> +    HOSTA = 10
> +    HOSTB = 11
> +
> +    QEMU_OPTIONS = (" -display none -vga none -enable-kvm"
> +                    " -smp 2 -cpu host -m 768"
> +                    " -device e1000,mac=52:54:00:12:34:56,netdev=hn0"
> +                    " -device virtio-blk,drive=colo-disk0")
> +
> +    FEDORA_VERSION = "31"
> +    IMAGE_CHECKSUM = "e3c1b309d9203604922d6e255c2c5d098a309c2d46215d8fc026954f3c5c27a0"
> +    IMAGE_SIZE = "4294967296b"
> +
> +    hang_qemu = False
> +    checkpoint_failover = False
> +    traffic_procs = []
> +
> +    def get_image(self, temp_dir):
> +        try:
> +            return vmimage.get(
> +                "fedora", arch="x86_64", version=self.FEDORA_VERSION,
> +                checksum=self.IMAGE_CHECKSUM, algorithm="sha256",
> +                cache_dir=self.cache_dirs[0],
> +                snapshot_dir=temp_dir)
> +        except:
> +            self.cancel("Failed to download/prepare image")
> +
> +    @skipUnless(ssh.SSH_CLIENT_BINARY, "No SSH client available")
> +    def setUp(self):
> +        # Qemu and qemu-img binary
> +        default_qemu_bin = pick_default_qemu_bin()
> +        self.QEMU_BINARY = self.params.get("qemu_bin", default=default_qemu_bin)
> +
> +        # If qemu-img has been built, use it, otherwise the system wide one
> +        # will be used.  If none is available, the test will cancel.
> +        qemu_img = os.path.join(BUILD_DIR, "qemu-img")
> +        if not os.path.exists(qemu_img):
> +            qemu_img = find_command("qemu-img", False)
> +        if qemu_img is False:
> +            self.cancel("Could not find \"qemu-img\", which is required to "
> +                        "create the bootable image")
> +        self.QEMU_IMG_BINARY = qemu_img

Probably worth refactoring that as re-usable pick_qemuimg_bin() or
better named?

Similarly with BRIDGE_HELPER... We need a generic class to get binaries
from environment or build dir.

> +        vmimage.QEMU_IMG = qemu_img
> +
> +        self.RESOURCE_AGENT = os.path.join(SOURCE_DIR,
> +                                           "scripts/colo-resource-agent/colo")
> +        self.ADD_PATH = os.path.join(SOURCE_DIR, "scripts/colo-resource-agent")
> +
> +        # Logs
> +        self.RA_LOG = os.path.join(self.outputdir, "resource-agent.log")
> +        self.HOSTA_LOGDIR = os.path.join(self.outputdir, "hosta")
> +        self.HOSTB_LOGDIR = os.path.join(self.outputdir, "hostb")
> +        os.makedirs(self.HOSTA_LOGDIR)
> +        os.makedirs(self.HOSTB_LOGDIR)
> +
> +        # Temporary directories
> +        # We don't use self.workdir because of unix socket path length
> +        # limitations
> +        self.TMPDIR = tempfile.mkdtemp()
> +        self.TMPA = os.path.join(self.TMPDIR, "hosta")
> +        self.TMPB = os.path.join(self.TMPDIR, "hostb")
> +        os.makedirs(self.TMPA)
> +        os.makedirs(self.TMPB)
> +
> +        # Network
> +        self.BRIDGE_NAME = self.params.get("bridge_name")
> +        if self.BRIDGE_NAME:
> +            self.HOST_ADDRESS = self.params.get("host_address")
> +            self.GUEST_ADDRESS = self.params.get("guest_address")
> +            self.BRIDGE_HELPER = self.params.get("bridge_helper",
> +                                    default="/usr/lib/qemu/qemu-bridge-helper")
> +            self.SSH_PORT = 22
> +        else:
> +            # QEMU's hard coded usermode router address
> +            self.HOST_ADDRESS = "10.0.2.2"
> +            self.GUEST_ADDRESS = "127.0.0.1"
> +            self.BRIDGE_HOSTA_PORT = network.find_free_port(address="127.0.0.1")
> +            self.BRIDGE_HOSTB_PORT = network.find_free_port(address="127.0.0.1")
> +            self.SSH_PORT = network.find_free_port(address="127.0.0.1")
> +
> +        self.CLOUDINIT_HOME_PORT = network.find_free_port()
> +
> +        # Find free port range
> +        base_port = 1024
> +        while True:
> +            base_port = network.find_free_port(start_port=base_port,
> +                                               address="127.0.0.1")
> +            if base_port == None:
> +                self.cancel("Failed to find a free port")
> +            for n in range(base_port, base_port +4):
> +                if n > 65535:
> +                    break
> +                if not network.is_port_free(n, "127.0.0.1"):
> +                    break
> +            else:
> +                # for loop above didn't break
> +                break
> +
> +        self.BASE_PORT = base_port
> +
> +        # Disk images
> +        self.log.info("Downloading/preparing boot image")
> +        self.HOSTA_IMAGE = self.get_image(self.TMPA).path
> +        self.HOSTB_IMAGE = self.get_image(self.TMPB).path
> +        self.CLOUDINIT_ISO = os.path.join(self.TMPDIR, "cloudinit.iso")
> +
> +        self.log.info("Preparing cloudinit image")
> +        try:
> +            cloudinit.iso(self.CLOUDINIT_ISO, self.name,
> +                          username="test", password="password",
> +                          phone_home_host=self.HOST_ADDRESS,
> +                          phone_home_port=self.CLOUDINIT_HOME_PORT)
> +        except Exception as e:
> +            self.cancel("Failed to prepare cloudinit image")
> +
> +        self.QEMU_OPTIONS += " -cdrom %s" % self.CLOUDINIT_ISO
> +
> +        # Network bridge
> +        if not self.BRIDGE_NAME:
> +            self.BRIDGE_PIDFILE = os.path.join(self.TMPDIR, "bridge.pid")
> +            self.run_command(("'%s' -pidfile '%s'"
> +                " -M none -display none -daemonize"
> +                " -netdev user,id=host,hostfwd=tcp:127.0.0.1:%s-:22"
> +                " -netdev socket,id=hosta,listen=127.0.0.1:%s"
> +                " -netdev socket,id=hostb,listen=127.0.0.1:%s"
> +                " -netdev hubport,id=hostport,hubid=0,netdev=host"
> +                " -netdev hubport,id=porta,hubid=0,netdev=hosta"
> +                " -netdev hubport,id=portb,hubid=0,netdev=hostb")
> +                % (self.QEMU_BINARY, self.BRIDGE_PIDFILE, self.SSH_PORT,
> +                   self.BRIDGE_HOSTA_PORT, self.BRIDGE_HOSTB_PORT), 0)
> +
> +    def tearDown(self):
> +        try:
> +            pid = self.read_pidfile(self.BRIDGE_PIDFILE)
> +            if pid and self.check_pid(pid):
> +                os.kill(pid, signal.SIGKILL)
> +        except Exception as e:
> +            pass
> +
> +        try:
> +            self.ra_stop(self.HOSTA)
> +        except Exception as e:
> +            pass
> +
> +        try:
> +            self.ra_stop(self.HOSTB)
> +        except Exception as e:
> +            pass
> +
> +        try:
> +            self.ssh_close()
> +        except Exception as e:
> +            pass
> +
> +        for proc in self.traffic_procs:
> +            try:
> +                os.killpg(proc.pid, signal.SIGTERM)
> +            except Exception as e:
> +                pass
> +
> +        shutil.rmtree(self.TMPDIR)
> +
> +    def run_command(self, cmdline, expected_status, env=None):
> +        proc = subprocess.Popen(cmdline, shell=True, stdout=subprocess.PIPE,
> +                                stderr=subprocess.STDOUT,
> +                                universal_newlines=True, env=env)
> +        stdout, stderr = proc.communicate()
> +        if proc.returncode != expected_status:
> +            self.fail("command \"%s\" failed with code %s:\n%s"
> +                           % (cmdline, proc.returncode, stdout))
> +
> +        return proc.returncode
> +
> +    def cat_line(self, path):
> +        line=""
> +        try:
> +            fd = open(path, "r")
> +            line = str.strip(fd.readline())
> +            fd.close()
> +        except:
> +            pass
> +        return line
> +
> +    def read_pidfile(self, pidfile):
> +        try:
> +            pid = int(self.cat_line(pidfile))
> +        except ValueError:
> +            return None
> +        else:
> +            return pid
> +
> +    def check_pid(self, pid):
> +        try:
> +            os.kill(pid, 0)
> +        except OSError:
> +            return False
> +        else:
> +            return True
> +
> +    def ssh_open(self):
> +        self.ssh_conn = ssh.Session(self.GUEST_ADDRESS, self.SSH_PORT,
> +                                    user="test", password="password")
> +        self.ssh_conn.DEFAULT_OPTIONS += (("UserKnownHostsFile", "/dev/null"),)
> +        for i in range(10):
> +            try:
> +                if self.ssh_conn.connect():
> +                    return
> +            except Exception as e:
> +                pass
> +            time.sleep(4)
> +        self.fail("sshd timeout")
> +
> +    def ssh_ping(self):
> +        if self.ssh_conn.cmd("echo ping").stdout != b"ping\n":
> +            self.fail("ssh ping failed")
> +
> +    def ssh_close(self):
> +        self.ssh_conn.quit()
> +
> +    def setup_base_env(self, host):
> +        PATH = os.getenv("PATH", "")
> +        env = { "PATH": "%s:%s" % (self.ADD_PATH, PATH),
> +                "HA_LOGFILE": self.RA_LOG,
> +                "OCF_RESOURCE_INSTANCE": "colo-test",
> +                "OCF_RESKEY_CRM_meta_clone_max": "2",
> +                "OCF_RESKEY_CRM_meta_notify": "true",
> +                "OCF_RESKEY_CRM_meta_timeout": "30000",
> +                "OCF_RESKEY_qemu_binary": self.QEMU_BINARY,
> +                "OCF_RESKEY_qemu_img_binary": self.QEMU_IMG_BINARY,
> +                "OCF_RESKEY_disk_size": str(self.IMAGE_SIZE),

We can remove IMAGE_SIZE and use a runtime filesize(HOSTA_IMAGE) instead.

> +                "OCF_RESKEY_checkpoint_interval": "10000",
> +                "OCF_RESKEY_base_port": str(self.BASE_PORT),
> +                "OCF_RESKEY_debug": "2"}
> +
> +        if host == self.HOSTA:
> +            env.update({"OCF_RESKEY_options":
> +                            ("%s -qmp unix:%s,server,nowait"
> +                             " -drive if=none,id=parent0,file='%s'")
> +                            % (self.QEMU_OPTIONS, self.get_qmp_sock(host),
> +                               self.HOSTA_IMAGE),
> +                        "OCF_RESKEY_active_hidden_dir": self.TMPA,
> +                        "OCF_RESKEY_listen_address": "127.0.0.1",
> +                        "OCF_RESKEY_log_dir": self.HOSTA_LOGDIR,
> +                        "OCF_RESKEY_CRM_meta_on_node": "127.0.0.1",
> +                        "HA_RSCTMP": self.TMPA,
> +                        "COLO_TEST_REMOTE_TMP": self.TMPB})
> +
> +        else:
> +            env.update({"OCF_RESKEY_options":
> +                            ("%s -qmp unix:%s,server,nowait"
> +                             " -drive if=none,id=parent0,file='%s'")
> +                            % (self.QEMU_OPTIONS, self.get_qmp_sock(host),
> +                               self.HOSTB_IMAGE),
> +                        "OCF_RESKEY_active_hidden_dir": self.TMPB,
> +                        "OCF_RESKEY_listen_address": "127.0.0.2",
> +                        "OCF_RESKEY_log_dir": self.HOSTB_LOGDIR,
> +                        "OCF_RESKEY_CRM_meta_on_node": "127.0.0.2",
> +                        "HA_RSCTMP": self.TMPB,
> +                        "COLO_TEST_REMOTE_TMP": self.TMPA})
> +
> +        if self.BRIDGE_NAME:
> +            env["OCF_RESKEY_options"] += \
> +                " -netdev bridge,id=hn0,br=%s,helper='%s'" \
> +                % (self.BRIDGE_NAME, self.BRIDGE_HELPER)
> +        else:
> +            if host == self.HOSTA:
> +                env["OCF_RESKEY_options"] += \
> +                    " -netdev socket,id=hn0,connect=127.0.0.1:%s" \
> +                    % self.BRIDGE_HOSTA_PORT
> +            else:
> +                env["OCF_RESKEY_options"] += \
> +                    " -netdev socket,id=hn0,connect=127.0.0.1:%s" \
> +                    % self.BRIDGE_HOSTB_PORT
> +        return env
> +
> +    def ra_start(self, host):
> +        env = self.setup_base_env(host)
> +        self.run_command(self.RESOURCE_AGENT + " start", self.OCF_SUCCESS, env)
> +
> +    def ra_stop(self, host):
> +        env = self.setup_base_env(host)
> +        self.run_command(self.RESOURCE_AGENT + " stop", self.OCF_SUCCESS, env)
> +
> +    def ra_monitor(self, host, expected_status):
> +        env = self.setup_base_env(host)
> +        self.run_command(self.RESOURCE_AGENT + " monitor", expected_status, env)
> +
> +    def ra_promote(self, host):
> +        env = self.setup_base_env(host)
> +        self.run_command(self.RESOURCE_AGENT + " promote", self.OCF_SUCCESS,env)
> +
> +    def ra_notify_start(self, host):
> +        env = self.setup_base_env(host)
> +
> +        env.update({"OCF_RESKEY_CRM_meta_notify_type": "post",
> +                    "OCF_RESKEY_CRM_meta_notify_operation": "start"})
> +
> +        if host == self.HOSTA:
> +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.1",
> +                        "OCF_RESKEY_CRM_meta_notify_start_uname": "127.0.0.2"})
> +        else:
> +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.2",
> +                        "OCF_RESKEY_CRM_meta_notify_start_uname": "127.0.0.1"})
> +
> +        self.run_command(self.RESOURCE_AGENT + " notify", self.OCF_SUCCESS, env)
> +
> +    def ra_notify_stop(self, host):
> +        env = self.setup_base_env(host)
> +
> +        env.update({"OCF_RESKEY_CRM_meta_notify_type": "pre",
> +                    "OCF_RESKEY_CRM_meta_notify_operation": "stop"})
> +
> +        if host == self.HOSTA:
> +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.1",
> +                        "OCF_RESKEY_CRM_meta_notify_stop_uname": "127.0.0.2"})
> +        else:
> +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.2",
> +                        "OCF_RESKEY_CRM_meta_notify_stop_uname": "127.0.0.1"})
> +
> +        self.run_command(self.RESOURCE_AGENT + " notify", self.OCF_SUCCESS, env)
> +
> +    def get_pid(self, host):
> +        if host == self.HOSTA:
> +            return self.read_pidfile(os.path.join(self.TMPA,
> +                                                 "colo-test-qemu.pid"))
> +        else:
> +            return self.read_pidfile(os.path.join(self.TMPB,
> +                                                 "colo-test-qemu.pid"))
> +
> +    def get_master_score(self, host):
> +        if host == self.HOSTA:
> +            return int(self.cat_line(os.path.join(self.TMPA, "master_score")))
> +        else:
> +            return int(self.cat_line(os.path.join(self.TMPB, "master_score")))
> +
> +    def get_qmp_sock(self, host):
> +        if host == self.HOSTA:
> +            return os.path.join(self.TMPA, "my-qmp.sock")
> +        else:
> +            return os.path.join(self.TMPB, "my-qmp.sock")
> +
> +
> +    def kill_qemu_pre(self, host):
> +        pid = self.get_pid(host)
> +
> +        qmp_sock = self.get_qmp_sock(host)
> +
> +        if self.checkpoint_failover:
> +            qmp_conn = QEMUMonitorProtocol(qmp_sock)
> +            qmp_conn.settimeout(10)
> +            qmp_conn.connect()
> +            while True:
> +                event = qmp_conn.pull_event(wait=True)
> +                if event["event"] == "STOP":
> +                    break
> +            qmp_conn.close()
> +
> +
> +        if pid and self.check_pid(pid):
> +            if self.hang_qemu:
> +                os.kill(pid, signal.SIGSTOP)
> +            else:
> +                os.kill(pid, signal.SIGKILL)
> +                while self.check_pid(pid):
> +                    time.sleep(1)
> +
> +    def kill_qemu_post(self, host):
> +        pid = self.get_pid(host)
> +
> +        if self.hang_qemu and pid and self.check_pid(pid):
> +            os.kill(pid, signal.SIGKILL)
> +            while self.check_pid(pid):
> +                time.sleep(1)
> +
> +    def prepare_guest(self):
> +        pass
> +
> +    def cycle_start(self, cycle):
> +        pass
> +
> +    def active_section(self):
> +        return False
> +
> +    def cycle_end(self, cycle):
> +        pass
> +
> +    def check_connection(self):
> +        self.ssh_ping()
> +        for proc in self.traffic_procs:
> +            if proc.poll() != None:
> +                self.fail("Traffic process died")
> +
> +    def _test_colo(self, loop=1):
> +        loop = max(loop, 1)
> +        self.log.info("Will put logs in %s" % self.outputdir)
> +
> +        self.ra_stop(self.HOSTA)
> +        self.ra_stop(self.HOSTB)
> +
> +        self.log.info("*** Startup ***")
> +        self.ra_start(self.HOSTA)
> +        self.ra_start(self.HOSTB)
> +
> +        self.ra_monitor(self.HOSTA, self.OCF_SUCCESS)
> +        self.ra_monitor(self.HOSTB, self.OCF_SUCCESS)
> +
> +        self.log.info("*** Promoting ***")
> +        self.ra_promote(self.HOSTA)
> +        cloudinit.wait_for_phone_home(("0.0.0.0", self.CLOUDINIT_HOME_PORT),
> +                                      self.name)
> +        self.ssh_open()
> +        self.prepare_guest()
> +
> +        self.ra_notify_start(self.HOSTA)
> +
> +        while self.get_master_score(self.HOSTB) != 100:
> +            self.ra_monitor(self.HOSTA, self.OCF_RUNNING_MASTER)
> +            self.ra_monitor(self.HOSTB, self.OCF_SUCCESS)
> +            time.sleep(1)
> +
> +        self.log.info("*** Replication started ***")
> +
> +        self.check_connection()
> +
> +        primary = self.HOSTA
> +        secondary = self.HOSTB
> +
> +        for n in range(loop):
> +            self.cycle_start(n)
> +            self.log.info("*** Secondary failover ***")
> +            self.kill_qemu_pre(primary)
> +            self.ra_notify_stop(secondary)
> +            self.ra_monitor(secondary, self.OCF_SUCCESS)
> +            self.ra_promote(secondary)
> +            self.ra_monitor(secondary, self.OCF_RUNNING_MASTER)
> +            self.kill_qemu_post(primary)
> +
> +            self.check_connection()
> +
> +            tmp = primary
> +            primary = secondary
> +            secondary = tmp
> +
> +            self.log.info("*** Secondary continue replication ***")
> +            self.ra_start(secondary)
> +            self.ra_notify_start(primary)
> +
> +            self.check_connection()
> +
> +            # Wait for resync
> +            while self.get_master_score(secondary) != 100:
> +                self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> +                self.ra_monitor(secondary, self.OCF_SUCCESS)
> +                time.sleep(1)
> +
> +            self.log.info("*** Replication started ***")
> +
> +            self.check_connection()
> +
> +            if self.active_section():
> +                break
> +
> +            self.log.info("*** Primary failover ***")
> +            self.kill_qemu_pre(secondary)
> +            self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> +            self.ra_notify_stop(primary)
> +            self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> +            self.kill_qemu_post(secondary)
> +
> +            self.check_connection()
> +
> +            self.log.info("*** Primary continue replication ***")
> +            self.ra_start(secondary)
> +            self.ra_notify_start(primary)
> +
> +            self.check_connection()
> +
> +            # Wait for resync
> +            while self.get_master_score(secondary) != 100:
> +                self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> +                self.ra_monitor(secondary, self.OCF_SUCCESS)
> +                time.sleep(1)
> +
> +            self.log.info("*** Replication started ***")
> +
> +            self.check_connection()
> +
> +            self.cycle_end(n)

Interesting test :)

> +
> +        self.ssh_close()
> +
> +        self.ra_stop(self.HOSTA)
> +        self.ra_stop(self.HOSTB)
> +
> +        self.ra_monitor(self.HOSTA, self.OCF_NOT_RUNNING)
> +        self.ra_monitor(self.HOSTB, self.OCF_NOT_RUNNING)
> +        self.log.info("*** all ok ***")
> +
> +
> +class ColoQuickTest(ColoTest):
> +    """
> +    :avocado: tags=colo
> +    :avocado: tags=quick
> +    :avocado: tags=arch:x86_64
> +    """
> +
> +    timeout = 300
> +
> +    def cycle_end(self, cycle):
> +        self.log.info("Testing with peer qemu hanging"
> +                      " and failover during checkpoint")
> +        self.hang_qemu = True
> +
> +    def test_quick(self):
> +        self.checkpoint_failover = True
> +        self.log.info("Testing with peer qemu crashing"
> +                      " and failover during checkpoint")
> +        self._test_colo(loop=2)
> +
> +
> +class ColoNetworkTest(ColoTest):
> +
> +    def prepare_guest(self):
> +        install_cmd = self.params.get("install_cmd", default=
> +                                "sudo -n dnf -q -y install iperf3 memtester")
> +        self.ssh_conn.cmd(install_cmd)
> +        # Use two instances to work around a bug in iperf3
> +        self.ssh_conn.cmd("iperf3 -sD; iperf3 -sD -p 5202")
> +
> +    def _cycle_start(self, cycle):
> +        pass
> +
> +    def cycle_start(self, cycle):
> +        self._cycle_start(cycle)
> +        tests = [("", "client-to-server tcp"), ("-R", "server-to-client tcp")]
> +
> +        self.log.info("Testing iperf %s" % tests[cycle % 2][1])
> +        iperf_cmd = "iperf3 %s -t 60 -i 60 --connect-timeout 30000 -c %s" \
> +                        % (tests[cycle % 2][0], self.GUEST_ADDRESS)
> +        proc = subprocess.Popen("while %s && %s; do sleep 1; done >>'%s' 2>&1"
> +                    % (iperf_cmd, iperf_cmd + " -p 5202",
> +                    os.path.join(self.outputdir, "iperf.log")),
> +                    shell=True, preexec_fn=os.setsid, stdin=subprocess.DEVNULL,
> +                    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

So this run on the host, require the host to be Linux + iperf3
installed. Don't we need to be privileged to run it?

> +        self.traffic_procs.append(proc)
> +        time.sleep(5) # Wait for iperf to get up to speed
> +
> +    def cycle_end(self, cycle):
> +        for proc in self.traffic_procs:
> +            try:
> +                os.killpg(proc.pid, signal.SIGTERM)
> +                proc.wait()
> +            except Exception as e:
> +                pass
> +        self.traffic_procs.clear()
> +        time.sleep(20)
> +
> +class ColoRealNetworkTest(ColoNetworkTest):
> +    """
> +    :avocado: tags=colo
> +    :avocado: tags=slow
> +    :avocado: tags=network_test
> +    :avocado: tags=arch:x86_64
> +    """
> +
> +    timeout = 900
> +
> +    def active_section(self):
> +        time.sleep(300)
> +        return False
> +
> +    @skipUnless(iperf3_available(), "iperf3 not available")
> +    def test_network(self):
> +        if not self.BRIDGE_NAME:
> +            self.cancel("bridge options not given, will skip network test")
> +        self.log.info("Testing with peer qemu crashing and network load")
> +        self._test_colo(loop=2)
> +
> +class ColoStressTest(ColoNetworkTest):
> +    """
> +    :avocado: tags=colo
> +    :avocado: tags=slow
> +    :avocado: tags=stress_test
> +    :avocado: tags=arch:x86_64
> +    """
> +
> +    timeout = 1800

How long does this test take on your hw (what CPU, to compare)?

> +
> +    def _cycle_start(self, cycle):
> +        if cycle == 4:
> +            self.log.info("Stresstest with peer qemu hanging, network load"
> +                          " and failover during checkpoint")
> +            self.checkpoint_failover = True
> +            self.hang_qemu = True
> +        elif cycle == 8:
> +            self.log.info("Stresstest with peer qemu crashing and network load")
> +            self.checkpoint_failover = False
> +            self.hang_qemu = False
> +        elif cycle == 12:
> +            self.log.info("Stresstest with peer qemu hanging and network load")
> +            self.checkpoint_failover = False
> +            self.hang_qemu = True
> +
> +    @skipUnless(iperf3_available(), "iperf3 not available")
> +    def test_stress(self):
> +        if not self.BRIDGE_NAME:
> +            self.cancel("bridge options not given, will skip network test")
> +        self.log.info("Stresstest with peer qemu crashing, network load"
> +                      " and failover during checkpoint")
> +        self.checkpoint_failover = True
> +        self._test_colo(loop=16)
> 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/5] colo: Introduce high-level test suite
  2020-06-02 12:19   ` Philippe Mathieu-Daudé
@ 2020-06-04 10:55     ` Lukas Straub
  0 siblings, 0 replies; 14+ messages in thread
From: Lukas Straub @ 2020-06-04 10:55 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Alberto Garcia, qemu-devel, Wainer dos Santos Moschetta,
	Dr. David Alan Gilbert, Zhang Chen, Cleber Rosa

[-- Attachment #1: Type: text/plain, Size: 32332 bytes --]

On Tue, 2 Jun 2020 14:19:08 +0200
Philippe Mathieu-Daudé <philmd@redhat.com> wrote:

> +Cleber/Wainer
> 
> On 5/11/20 2:27 PM, Lukas Straub wrote:
> > Add high-level test relying on the colo resource-agent to test
> > all failover cases while checking guest network connectivity.
> > 
> > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > ---
> >  scripts/colo-resource-agent/crm_master   |  44 ++
> >  scripts/colo-resource-agent/crm_resource |  12 +
> >  tests/acceptance/colo.py                 | 689 +++++++++++++++++++++++
> >  3 files changed, 745 insertions(+)
> >  create mode 100755 scripts/colo-resource-agent/crm_master
> >  create mode 100755 scripts/colo-resource-agent/crm_resource
> >  create mode 100644 tests/acceptance/colo.py
> > 
> > diff --git a/scripts/colo-resource-agent/crm_master b/scripts/colo-resource-agent/crm_master
> > new file mode 100755
> > index 0000000000..886f523bda
> > --- /dev/null
> > +++ b/scripts/colo-resource-agent/crm_master
> > @@ -0,0 +1,44 @@
> > +#!/bin/bash
> > +
> > +# Fake crm_master for COLO testing
> > +#
> > +# Copyright (c) Lukas Straub <lukasstraub2@web.de>
> > +#
> > +# This work is licensed under the terms of the GNU GPL, version 2 or
> > +# later.  See the COPYING file in the top-level directory.
> > +
> > +TMPDIR="$HA_RSCTMP"
> > +score=0
> > +query=0
> > +
> > +OPTIND=1
> > +while getopts 'Qql:Dv:N:G' opt; do
> > +    case "$opt" in
> > +        Q|q)
> > +            # Noop
> > +        ;;
> > +        "l")
> > +            # Noop
> > +        ;;
> > +        "D")
> > +            score=0
> > +        ;;
> > +        "v")
> > +            score=$OPTARG
> > +        ;;
> > +        "N")
> > +            TMPDIR="$COLO_TEST_REMOTE_TMP"
> > +        ;;
> > +        "G")
> > +            query=1
> > +        ;;
> > +    esac
> > +done
> > +
> > +if (( query )); then
> > +    cat "${TMPDIR}/master_score" || exit 1
> > +else
> > +    echo $score > "${TMPDIR}/master_score" || exit 1
> > +fi
> > +
> > +exit 0
> > diff --git a/scripts/colo-resource-agent/crm_resource b/scripts/colo-resource-agent/crm_resource
> > new file mode 100755
> > index 0000000000..ad69ff3c6b
> > --- /dev/null
> > +++ b/scripts/colo-resource-agent/crm_resource
> > @@ -0,0 +1,12 @@
> > +#!/bin/sh
> > +
> > +# Fake crm_resource for COLO testing
> > +#
> > +# Copyright (c) Lukas Straub <lukasstraub2@web.de>
> > +#
> > +# This work is licensed under the terms of the GNU GPL, version 2 or
> > +# later.  See the COPYING file in the top-level directory.
> > +
> > +# Noop
> > +
> > +exit 0
> > diff --git a/tests/acceptance/colo.py b/tests/acceptance/colo.py
> > new file mode 100644
> > index 0000000000..465513fb6c
> > --- /dev/null
> > +++ b/tests/acceptance/colo.py
> > @@ -0,0 +1,689 @@
> > +# High-level test suite for qemu COLO testing all failover cases while checking
> > +# guest network connectivity
> > +#
> > +# Copyright (c) Lukas Straub <lukasstraub2@web.de>
> > +#
> > +# This work is licensed under the terms of the GNU GPL, version 2 or
> > +# later.  See the COPYING file in the top-level directory.
> > +
> > +# HOWTO:
> > +#
> > +# This test has the following parameters:
> > +# bridge_name: name of the bridge interface to connect qemu to
> > +# host_address: ip address of the bridge interface
> > +# guest_address: ip address that the guest gets from the dhcp server
> > +# bridge_helper: path to the brige helper
> > +#                (default: /usr/lib/qemu/qemu-bridge-helper)
> > +# install_cmd: command to run to install iperf3 and memtester in the guest
> > +#              (default: "sudo -n dnf -q -y install iperf3 memtester")
> > +#
> > +# To run the network tests, you have to specify the parameters.
> > +#
> > +# Example for running the colo tests:
> > +# make check-acceptance FEDORA_31_ARCHES="x86_64" AVOCADO_TAGS="-t colo \
> > +#  -p bridge_name=br0 -p host_address=192.168.220.1 \
> > +#  -p guest_address=192.168.220.222"
> > +#
> > +# The colo tests currently only use x86_64 test vm images. With the
> > +# FEDORA_31_ARCHES make variable as in the example, only the x86_64 images will
> > +# be downloaded.
> > +#
> > +# If you're running the network tests as an unprivileged user, you need to set
> > +# the suid bit on the bridge helper (chmod +s <bridge-helper>).
> > +#
> > +# The dhcp server should assign a static ip to the guest, else the test may be
> > +# unreliable. The Mac address for the guest is always 52:54:00:12:34:56.
> > +
> > +
> > +import select
> > +import sys
> > +import subprocess
> > +import shutil
> > +import os
> > +import signal
> > +import os.path
> > +import time
> > +import tempfile
> > +
> > +from avocado import Test
> > +from avocado import skipUnless
> > +from avocado.utils import network
> > +from avocado.utils import vmimage
> > +from avocado.utils import cloudinit
> > +from avocado.utils import ssh
> > +from avocado.utils.path import find_command
> > +
> > +from avocado_qemu import pick_default_qemu_bin, BUILD_DIR, SOURCE_DIR
> > +from qemu.qmp import QEMUMonitorProtocol
> > +
> > +def iperf3_available():
> > +    try:
> > +        find_command("iperf3")
> > +    except CmdNotFoundError:
> > +        return False
> > +    return True
> > +
> > +class ColoTest(Test):
> > +
> > +    # Constants
> > +    OCF_SUCCESS = 0
> > +    OCF_ERR_GENERIC = 1
> > +    OCF_ERR_ARGS = 2
> > +    OCF_ERR_UNIMPLEMENTED = 3
> > +    OCF_ERR_PERM = 4
> > +    OCF_ERR_INSTALLED = 5
> > +    OCF_ERR_CONFIGURED = 6
> > +    OCF_NOT_RUNNING = 7
> > +    OCF_RUNNING_MASTER = 8
> > +    OCF_FAILED_MASTER = 9
> > +
> > +    HOSTA = 10
> > +    HOSTB = 11
> > +
> > +    QEMU_OPTIONS = (" -display none -vga none -enable-kvm"
> > +                    " -smp 2 -cpu host -m 768"
> > +                    " -device e1000,mac=52:54:00:12:34:56,netdev=hn0"
> > +                    " -device virtio-blk,drive=colo-disk0")
> > +
> > +    FEDORA_VERSION = "31"
> > +    IMAGE_CHECKSUM = "e3c1b309d9203604922d6e255c2c5d098a309c2d46215d8fc026954f3c5c27a0"
> > +    IMAGE_SIZE = "4294967296b"
> > +
> > +    hang_qemu = False
> > +    checkpoint_failover = False
> > +    traffic_procs = []
> > +
> > +    def get_image(self, temp_dir):
> > +        try:
> > +            return vmimage.get(
> > +                "fedora", arch="x86_64", version=self.FEDORA_VERSION,
> > +                checksum=self.IMAGE_CHECKSUM, algorithm="sha256",
> > +                cache_dir=self.cache_dirs[0],
> > +                snapshot_dir=temp_dir)
> > +        except:
> > +            self.cancel("Failed to download/prepare image")
> > +
> > +    @skipUnless(ssh.SSH_CLIENT_BINARY, "No SSH client available")
> > +    def setUp(self):
> > +        # Qemu and qemu-img binary
> > +        default_qemu_bin = pick_default_qemu_bin()
> > +        self.QEMU_BINARY = self.params.get("qemu_bin", default=default_qemu_bin)
> > +
> > +        # If qemu-img has been built, use it, otherwise the system wide one
> > +        # will be used.  If none is available, the test will cancel.
> > +        qemu_img = os.path.join(BUILD_DIR, "qemu-img")
> > +        if not os.path.exists(qemu_img):
> > +            qemu_img = find_command("qemu-img", False)
> > +        if qemu_img is False:
> > +            self.cancel("Could not find \"qemu-img\", which is required to "
> > +                        "create the bootable image")
> > +        self.QEMU_IMG_BINARY = qemu_img  
> 
> Probably worth refactoring that as re-usable pick_qemuimg_bin() or
> better named?
> 
> Similarly with BRIDGE_HELPER... We need a generic class to get binaries
> from environment or build dir.

Agreed, I think a new function pick_qemu_util similar to this code in tests/acceptance/avocado_qemu/__init__.py should suffice.

> > +        vmimage.QEMU_IMG = qemu_img
> > +
> > +        self.RESOURCE_AGENT = os.path.join(SOURCE_DIR,
> > +                                           "scripts/colo-resource-agent/colo")
> > +        self.ADD_PATH = os.path.join(SOURCE_DIR, "scripts/colo-resource-agent")
> > +
> > +        # Logs
> > +        self.RA_LOG = os.path.join(self.outputdir, "resource-agent.log")
> > +        self.HOSTA_LOGDIR = os.path.join(self.outputdir, "hosta")
> > +        self.HOSTB_LOGDIR = os.path.join(self.outputdir, "hostb")
> > +        os.makedirs(self.HOSTA_LOGDIR)
> > +        os.makedirs(self.HOSTB_LOGDIR)
> > +
> > +        # Temporary directories
> > +        # We don't use self.workdir because of unix socket path length
> > +        # limitations
> > +        self.TMPDIR = tempfile.mkdtemp()
> > +        self.TMPA = os.path.join(self.TMPDIR, "hosta")
> > +        self.TMPB = os.path.join(self.TMPDIR, "hostb")
> > +        os.makedirs(self.TMPA)
> > +        os.makedirs(self.TMPB)
> > +
> > +        # Network
> > +        self.BRIDGE_NAME = self.params.get("bridge_name")
> > +        if self.BRIDGE_NAME:
> > +            self.HOST_ADDRESS = self.params.get("host_address")
> > +            self.GUEST_ADDRESS = self.params.get("guest_address")
> > +            self.BRIDGE_HELPER = self.params.get("bridge_helper",
> > +                                    default="/usr/lib/qemu/qemu-bridge-helper")
> > +            self.SSH_PORT = 22
> > +        else:
> > +            # QEMU's hard coded usermode router address
> > +            self.HOST_ADDRESS = "10.0.2.2"
> > +            self.GUEST_ADDRESS = "127.0.0.1"
> > +            self.BRIDGE_HOSTA_PORT = network.find_free_port(address="127.0.0.1")
> > +            self.BRIDGE_HOSTB_PORT = network.find_free_port(address="127.0.0.1")
> > +            self.SSH_PORT = network.find_free_port(address="127.0.0.1")
> > +
> > +        self.CLOUDINIT_HOME_PORT = network.find_free_port()
> > +
> > +        # Find free port range
> > +        base_port = 1024
> > +        while True:
> > +            base_port = network.find_free_port(start_port=base_port,
> > +                                               address="127.0.0.1")
> > +            if base_port == None:
> > +                self.cancel("Failed to find a free port")
> > +            for n in range(base_port, base_port +4):
> > +                if n > 65535:
> > +                    break
> > +                if not network.is_port_free(n, "127.0.0.1"):
> > +                    break
> > +            else:
> > +                # for loop above didn't break
> > +                break
> > +
> > +        self.BASE_PORT = base_port
> > +
> > +        # Disk images
> > +        self.log.info("Downloading/preparing boot image")
> > +        self.HOSTA_IMAGE = self.get_image(self.TMPA).path
> > +        self.HOSTB_IMAGE = self.get_image(self.TMPB).path
> > +        self.CLOUDINIT_ISO = os.path.join(self.TMPDIR, "cloudinit.iso")
> > +
> > +        self.log.info("Preparing cloudinit image")
> > +        try:
> > +            cloudinit.iso(self.CLOUDINIT_ISO, self.name,
> > +                          username="test", password="password",
> > +                          phone_home_host=self.HOST_ADDRESS,
> > +                          phone_home_port=self.CLOUDINIT_HOME_PORT)
> > +        except Exception as e:
> > +            self.cancel("Failed to prepare cloudinit image")
> > +
> > +        self.QEMU_OPTIONS += " -cdrom %s" % self.CLOUDINIT_ISO
> > +
> > +        # Network bridge
> > +        if not self.BRIDGE_NAME:
> > +            self.BRIDGE_PIDFILE = os.path.join(self.TMPDIR, "bridge.pid")
> > +            self.run_command(("'%s' -pidfile '%s'"
> > +                " -M none -display none -daemonize"
> > +                " -netdev user,id=host,hostfwd=tcp:127.0.0.1:%s-:22"
> > +                " -netdev socket,id=hosta,listen=127.0.0.1:%s"
> > +                " -netdev socket,id=hostb,listen=127.0.0.1:%s"
> > +                " -netdev hubport,id=hostport,hubid=0,netdev=host"
> > +                " -netdev hubport,id=porta,hubid=0,netdev=hosta"
> > +                " -netdev hubport,id=portb,hubid=0,netdev=hostb")
> > +                % (self.QEMU_BINARY, self.BRIDGE_PIDFILE, self.SSH_PORT,
> > +                   self.BRIDGE_HOSTA_PORT, self.BRIDGE_HOSTB_PORT), 0)
> > +
> > +    def tearDown(self):
> > +        try:
> > +            pid = self.read_pidfile(self.BRIDGE_PIDFILE)
> > +            if pid and self.check_pid(pid):
> > +                os.kill(pid, signal.SIGKILL)
> > +        except Exception as e:
> > +            pass
> > +
> > +        try:
> > +            self.ra_stop(self.HOSTA)
> > +        except Exception as e:
> > +            pass
> > +
> > +        try:
> > +            self.ra_stop(self.HOSTB)
> > +        except Exception as e:
> > +            pass
> > +
> > +        try:
> > +            self.ssh_close()
> > +        except Exception as e:
> > +            pass
> > +
> > +        for proc in self.traffic_procs:
> > +            try:
> > +                os.killpg(proc.pid, signal.SIGTERM)
> > +            except Exception as e:
> > +                pass
> > +
> > +        shutil.rmtree(self.TMPDIR)
> > +
> > +    def run_command(self, cmdline, expected_status, env=None):
> > +        proc = subprocess.Popen(cmdline, shell=True, stdout=subprocess.PIPE,
> > +                                stderr=subprocess.STDOUT,
> > +                                universal_newlines=True, env=env)
> > +        stdout, stderr = proc.communicate()
> > +        if proc.returncode != expected_status:
> > +            self.fail("command \"%s\" failed with code %s:\n%s"
> > +                           % (cmdline, proc.returncode, stdout))
> > +
> > +        return proc.returncode
> > +
> > +    def cat_line(self, path):
> > +        line=""
> > +        try:
> > +            fd = open(path, "r")
> > +            line = str.strip(fd.readline())
> > +            fd.close()
> > +        except:
> > +            pass
> > +        return line
> > +
> > +    def read_pidfile(self, pidfile):
> > +        try:
> > +            pid = int(self.cat_line(pidfile))
> > +        except ValueError:
> > +            return None
> > +        else:
> > +            return pid
> > +
> > +    def check_pid(self, pid):
> > +        try:
> > +            os.kill(pid, 0)
> > +        except OSError:
> > +            return False
> > +        else:
> > +            return True
> > +
> > +    def ssh_open(self):
> > +        self.ssh_conn = ssh.Session(self.GUEST_ADDRESS, self.SSH_PORT,
> > +                                    user="test", password="password")
> > +        self.ssh_conn.DEFAULT_OPTIONS += (("UserKnownHostsFile", "/dev/null"),)
> > +        for i in range(10):
> > +            try:
> > +                if self.ssh_conn.connect():
> > +                    return
> > +            except Exception as e:
> > +                pass
> > +            time.sleep(4)
> > +        self.fail("sshd timeout")
> > +
> > +    def ssh_ping(self):
> > +        if self.ssh_conn.cmd("echo ping").stdout != b"ping\n":
> > +            self.fail("ssh ping failed")
> > +
> > +    def ssh_close(self):
> > +        self.ssh_conn.quit()
> > +
> > +    def setup_base_env(self, host):
> > +        PATH = os.getenv("PATH", "")
> > +        env = { "PATH": "%s:%s" % (self.ADD_PATH, PATH),
> > +                "HA_LOGFILE": self.RA_LOG,
> > +                "OCF_RESOURCE_INSTANCE": "colo-test",
> > +                "OCF_RESKEY_CRM_meta_clone_max": "2",
> > +                "OCF_RESKEY_CRM_meta_notify": "true",
> > +                "OCF_RESKEY_CRM_meta_timeout": "30000",
> > +                "OCF_RESKEY_qemu_binary": self.QEMU_BINARY,
> > +                "OCF_RESKEY_qemu_img_binary": self.QEMU_IMG_BINARY,
> > +                "OCF_RESKEY_disk_size": str(self.IMAGE_SIZE),  
> 
> We can remove IMAGE_SIZE and use a runtime filesize(HOSTA_IMAGE) instead.

I will change the resource-agent so it doesn't need OCF_RESKEY_disk_size.

> > +                "OCF_RESKEY_checkpoint_interval": "10000",
> > +                "OCF_RESKEY_base_port": str(self.BASE_PORT),
> > +                "OCF_RESKEY_debug": "2"}
> > +
> > +        if host == self.HOSTA:
> > +            env.update({"OCF_RESKEY_options":
> > +                            ("%s -qmp unix:%s,server,nowait"
> > +                             " -drive if=none,id=parent0,file='%s'")
> > +                            % (self.QEMU_OPTIONS, self.get_qmp_sock(host),
> > +                               self.HOSTA_IMAGE),
> > +                        "OCF_RESKEY_active_hidden_dir": self.TMPA,
> > +                        "OCF_RESKEY_listen_address": "127.0.0.1",
> > +                        "OCF_RESKEY_log_dir": self.HOSTA_LOGDIR,
> > +                        "OCF_RESKEY_CRM_meta_on_node": "127.0.0.1",
> > +                        "HA_RSCTMP": self.TMPA,
> > +                        "COLO_TEST_REMOTE_TMP": self.TMPB})
> > +
> > +        else:
> > +            env.update({"OCF_RESKEY_options":
> > +                            ("%s -qmp unix:%s,server,nowait"
> > +                             " -drive if=none,id=parent0,file='%s'")
> > +                            % (self.QEMU_OPTIONS, self.get_qmp_sock(host),
> > +                               self.HOSTB_IMAGE),
> > +                        "OCF_RESKEY_active_hidden_dir": self.TMPB,
> > +                        "OCF_RESKEY_listen_address": "127.0.0.2",
> > +                        "OCF_RESKEY_log_dir": self.HOSTB_LOGDIR,
> > +                        "OCF_RESKEY_CRM_meta_on_node": "127.0.0.2",
> > +                        "HA_RSCTMP": self.TMPB,
> > +                        "COLO_TEST_REMOTE_TMP": self.TMPA})
> > +
> > +        if self.BRIDGE_NAME:
> > +            env["OCF_RESKEY_options"] += \
> > +                " -netdev bridge,id=hn0,br=%s,helper='%s'" \
> > +                % (self.BRIDGE_NAME, self.BRIDGE_HELPER)
> > +        else:
> > +            if host == self.HOSTA:
> > +                env["OCF_RESKEY_options"] += \
> > +                    " -netdev socket,id=hn0,connect=127.0.0.1:%s" \
> > +                    % self.BRIDGE_HOSTA_PORT
> > +            else:
> > +                env["OCF_RESKEY_options"] += \
> > +                    " -netdev socket,id=hn0,connect=127.0.0.1:%s" \
> > +                    % self.BRIDGE_HOSTB_PORT
> > +        return env
> > +
> > +    def ra_start(self, host):
> > +        env = self.setup_base_env(host)
> > +        self.run_command(self.RESOURCE_AGENT + " start", self.OCF_SUCCESS, env)
> > +
> > +    def ra_stop(self, host):
> > +        env = self.setup_base_env(host)
> > +        self.run_command(self.RESOURCE_AGENT + " stop", self.OCF_SUCCESS, env)
> > +
> > +    def ra_monitor(self, host, expected_status):
> > +        env = self.setup_base_env(host)
> > +        self.run_command(self.RESOURCE_AGENT + " monitor", expected_status, env)
> > +
> > +    def ra_promote(self, host):
> > +        env = self.setup_base_env(host)
> > +        self.run_command(self.RESOURCE_AGENT + " promote", self.OCF_SUCCESS,env)
> > +
> > +    def ra_notify_start(self, host):
> > +        env = self.setup_base_env(host)
> > +
> > +        env.update({"OCF_RESKEY_CRM_meta_notify_type": "post",
> > +                    "OCF_RESKEY_CRM_meta_notify_operation": "start"})
> > +
> > +        if host == self.HOSTA:
> > +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.1",
> > +                        "OCF_RESKEY_CRM_meta_notify_start_uname": "127.0.0.2"})
> > +        else:
> > +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.2",
> > +                        "OCF_RESKEY_CRM_meta_notify_start_uname": "127.0.0.1"})
> > +
> > +        self.run_command(self.RESOURCE_AGENT + " notify", self.OCF_SUCCESS, env)
> > +
> > +    def ra_notify_stop(self, host):
> > +        env = self.setup_base_env(host)
> > +
> > +        env.update({"OCF_RESKEY_CRM_meta_notify_type": "pre",
> > +                    "OCF_RESKEY_CRM_meta_notify_operation": "stop"})
> > +
> > +        if host == self.HOSTA:
> > +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.1",
> > +                        "OCF_RESKEY_CRM_meta_notify_stop_uname": "127.0.0.2"})
> > +        else:
> > +            env.update({"OCF_RESKEY_CRM_meta_notify_master_uname": "127.0.0.2",
> > +                        "OCF_RESKEY_CRM_meta_notify_stop_uname": "127.0.0.1"})
> > +
> > +        self.run_command(self.RESOURCE_AGENT + " notify", self.OCF_SUCCESS, env)
> > +
> > +    def get_pid(self, host):
> > +        if host == self.HOSTA:
> > +            return self.read_pidfile(os.path.join(self.TMPA,
> > +                                                 "colo-test-qemu.pid"))
> > +        else:
> > +            return self.read_pidfile(os.path.join(self.TMPB,
> > +                                                 "colo-test-qemu.pid"))
> > +
> > +    def get_master_score(self, host):
> > +        if host == self.HOSTA:
> > +            return int(self.cat_line(os.path.join(self.TMPA, "master_score")))
> > +        else:
> > +            return int(self.cat_line(os.path.join(self.TMPB, "master_score")))
> > +
> > +    def get_qmp_sock(self, host):
> > +        if host == self.HOSTA:
> > +            return os.path.join(self.TMPA, "my-qmp.sock")
> > +        else:
> > +            return os.path.join(self.TMPB, "my-qmp.sock")
> > +
> > +
> > +    def kill_qemu_pre(self, host):
> > +        pid = self.get_pid(host)
> > +
> > +        qmp_sock = self.get_qmp_sock(host)
> > +
> > +        if self.checkpoint_failover:
> > +            qmp_conn = QEMUMonitorProtocol(qmp_sock)
> > +            qmp_conn.settimeout(10)
> > +            qmp_conn.connect()
> > +            while True:
> > +                event = qmp_conn.pull_event(wait=True)
> > +                if event["event"] == "STOP":
> > +                    break
> > +            qmp_conn.close()
> > +
> > +
> > +        if pid and self.check_pid(pid):
> > +            if self.hang_qemu:
> > +                os.kill(pid, signal.SIGSTOP)
> > +            else:
> > +                os.kill(pid, signal.SIGKILL)
> > +                while self.check_pid(pid):
> > +                    time.sleep(1)
> > +
> > +    def kill_qemu_post(self, host):
> > +        pid = self.get_pid(host)
> > +
> > +        if self.hang_qemu and pid and self.check_pid(pid):
> > +            os.kill(pid, signal.SIGKILL)
> > +            while self.check_pid(pid):
> > +                time.sleep(1)
> > +
> > +    def prepare_guest(self):
> > +        pass
> > +
> > +    def cycle_start(self, cycle):
> > +        pass
> > +
> > +    def active_section(self):
> > +        return False
> > +
> > +    def cycle_end(self, cycle):
> > +        pass
> > +
> > +    def check_connection(self):
> > +        self.ssh_ping()
> > +        for proc in self.traffic_procs:
> > +            if proc.poll() != None:
> > +                self.fail("Traffic process died")
> > +
> > +    def _test_colo(self, loop=1):
> > +        loop = max(loop, 1)
> > +        self.log.info("Will put logs in %s" % self.outputdir)
> > +
> > +        self.ra_stop(self.HOSTA)
> > +        self.ra_stop(self.HOSTB)
> > +
> > +        self.log.info("*** Startup ***")
> > +        self.ra_start(self.HOSTA)
> > +        self.ra_start(self.HOSTB)
> > +
> > +        self.ra_monitor(self.HOSTA, self.OCF_SUCCESS)
> > +        self.ra_monitor(self.HOSTB, self.OCF_SUCCESS)
> > +
> > +        self.log.info("*** Promoting ***")
> > +        self.ra_promote(self.HOSTA)
> > +        cloudinit.wait_for_phone_home(("0.0.0.0", self.CLOUDINIT_HOME_PORT),
> > +                                      self.name)
> > +        self.ssh_open()
> > +        self.prepare_guest()
> > +
> > +        self.ra_notify_start(self.HOSTA)
> > +
> > +        while self.get_master_score(self.HOSTB) != 100:
> > +            self.ra_monitor(self.HOSTA, self.OCF_RUNNING_MASTER)
> > +            self.ra_monitor(self.HOSTB, self.OCF_SUCCESS)
> > +            time.sleep(1)
> > +
> > +        self.log.info("*** Replication started ***")
> > +
> > +        self.check_connection()
> > +
> > +        primary = self.HOSTA
> > +        secondary = self.HOSTB
> > +
> > +        for n in range(loop):
> > +            self.cycle_start(n)
> > +            self.log.info("*** Secondary failover ***")
> > +            self.kill_qemu_pre(primary)
> > +            self.ra_notify_stop(secondary)
> > +            self.ra_monitor(secondary, self.OCF_SUCCESS)
> > +            self.ra_promote(secondary)
> > +            self.ra_monitor(secondary, self.OCF_RUNNING_MASTER)
> > +            self.kill_qemu_post(primary)
> > +
> > +            self.check_connection()
> > +
> > +            tmp = primary
> > +            primary = secondary
> > +            secondary = tmp
> > +
> > +            self.log.info("*** Secondary continue replication ***")
> > +            self.ra_start(secondary)
> > +            self.ra_notify_start(primary)
> > +
> > +            self.check_connection()
> > +
> > +            # Wait for resync
> > +            while self.get_master_score(secondary) != 100:
> > +                self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> > +                self.ra_monitor(secondary, self.OCF_SUCCESS)
> > +                time.sleep(1)
> > +
> > +            self.log.info("*** Replication started ***")
> > +
> > +            self.check_connection()
> > +
> > +            if self.active_section():
> > +                break
> > +
> > +            self.log.info("*** Primary failover ***")
> > +            self.kill_qemu_pre(secondary)
> > +            self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> > +            self.ra_notify_stop(primary)
> > +            self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> > +            self.kill_qemu_post(secondary)
> > +
> > +            self.check_connection()
> > +
> > +            self.log.info("*** Primary continue replication ***")
> > +            self.ra_start(secondary)
> > +            self.ra_notify_start(primary)
> > +
> > +            self.check_connection()
> > +
> > +            # Wait for resync
> > +            while self.get_master_score(secondary) != 100:
> > +                self.ra_monitor(primary, self.OCF_RUNNING_MASTER)
> > +                self.ra_monitor(secondary, self.OCF_SUCCESS)
> > +                time.sleep(1)
> > +
> > +            self.log.info("*** Replication started ***")
> > +
> > +            self.check_connection()
> > +
> > +            self.cycle_end(n)  
> 
> Interesting test :)
> 
> > +
> > +        self.ssh_close()
> > +
> > +        self.ra_stop(self.HOSTA)
> > +        self.ra_stop(self.HOSTB)
> > +
> > +        self.ra_monitor(self.HOSTA, self.OCF_NOT_RUNNING)
> > +        self.ra_monitor(self.HOSTB, self.OCF_NOT_RUNNING)
> > +        self.log.info("*** all ok ***")
> > +
> > +
> > +class ColoQuickTest(ColoTest):
> > +    """
> > +    :avocado: tags=colo
> > +    :avocado: tags=quick
> > +    :avocado: tags=arch:x86_64
> > +    """
> > +
> > +    timeout = 300
> > +
> > +    def cycle_end(self, cycle):
> > +        self.log.info("Testing with peer qemu hanging"
> > +                      " and failover during checkpoint")
> > +        self.hang_qemu = True
> > +
> > +    def test_quick(self):
> > +        self.checkpoint_failover = True
> > +        self.log.info("Testing with peer qemu crashing"
> > +                      " and failover during checkpoint")
> > +        self._test_colo(loop=2)
> > +
> > +
> > +class ColoNetworkTest(ColoTest):
> > +
> > +    def prepare_guest(self):
> > +        install_cmd = self.params.get("install_cmd", default=
> > +                                "sudo -n dnf -q -y install iperf3 memtester")
> > +        self.ssh_conn.cmd(install_cmd)
> > +        # Use two instances to work around a bug in iperf3
> > +        self.ssh_conn.cmd("iperf3 -sD; iperf3 -sD -p 5202")
> > +
> > +    def _cycle_start(self, cycle):
> > +        pass
> > +
> > +    def cycle_start(self, cycle):
> > +        self._cycle_start(cycle)
> > +        tests = [("", "client-to-server tcp"), ("-R", "server-to-client tcp")]
> > +
> > +        self.log.info("Testing iperf %s" % tests[cycle % 2][1])
> > +        iperf_cmd = "iperf3 %s -t 60 -i 60 --connect-timeout 30000 -c %s" \
> > +                        % (tests[cycle % 2][0], self.GUEST_ADDRESS)
> > +        proc = subprocess.Popen("while %s && %s; do sleep 1; done >>'%s' 2>&1"
> > +                    % (iperf_cmd, iperf_cmd + " -p 5202",
> > +                    os.path.join(self.outputdir, "iperf.log")),
> > +                    shell=True, preexec_fn=os.setsid, stdin=subprocess.DEVNULL,
> > +                    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)  
> 
> So this run on the host, require the host to be Linux + iperf3
> installed. Don't we need to be privileged to run it?

Not for iperf3, but the bridge helper needs the setuid bit to be set if the test runs unprivileged.

> > +        self.traffic_procs.append(proc)
> > +        time.sleep(5) # Wait for iperf to get up to speed
> > +
> > +    def cycle_end(self, cycle):
> > +        for proc in self.traffic_procs:
> > +            try:
> > +                os.killpg(proc.pid, signal.SIGTERM)
> > +                proc.wait()
> > +            except Exception as e:
> > +                pass
> > +        self.traffic_procs.clear()
> > +        time.sleep(20)
> > +
> > +class ColoRealNetworkTest(ColoNetworkTest):
> > +    """
> > +    :avocado: tags=colo
> > +    :avocado: tags=slow
> > +    :avocado: tags=network_test
> > +    :avocado: tags=arch:x86_64
> > +    """
> > +
> > +    timeout = 900
> > +
> > +    def active_section(self):
> > +        time.sleep(300)
> > +        return False
> > +
> > +    @skipUnless(iperf3_available(), "iperf3 not available")
> > +    def test_network(self):
> > +        if not self.BRIDGE_NAME:
> > +            self.cancel("bridge options not given, will skip network test")
> > +        self.log.info("Testing with peer qemu crashing and network load")
> > +        self._test_colo(loop=2)
> > +
> > +class ColoStressTest(ColoNetworkTest):
> > +    """
> > +    :avocado: tags=colo
> > +    :avocado: tags=slow
> > +    :avocado: tags=stress_test
> > +    :avocado: tags=arch:x86_64
> > +    """
> > +
> > +    timeout = 1800  
> 
> How long does this test take on your hw (what CPU, to compare)?

The CPU is an Intel i7-5600U and M.2 SATA SSD (during resync the whole image is copied), here are the runtimes:
Quick test: ~200s
Network test: ~800s
Stress test: ~1300s

> > +
> > +    def _cycle_start(self, cycle):
> > +        if cycle == 4:
> > +            self.log.info("Stresstest with peer qemu hanging, network load"
> > +                          " and failover during checkpoint")
> > +            self.checkpoint_failover = True
> > +            self.hang_qemu = True
> > +        elif cycle == 8:
> > +            self.log.info("Stresstest with peer qemu crashing and network load")
> > +            self.checkpoint_failover = False
> > +            self.hang_qemu = False
> > +        elif cycle == 12:
> > +            self.log.info("Stresstest with peer qemu hanging and network load")
> > +            self.checkpoint_failover = False
> > +            self.hang_qemu = True
> > +
> > +    @skipUnless(iperf3_available(), "iperf3 not available")
> > +    def test_stress(self):
> > +        if not self.BRIDGE_NAME:
> > +            self.cancel("bridge options not given, will skip network test")
> > +        self.log.info("Stresstest with peer qemu crashing, network load"
> > +                      " and failover during checkpoint")
> > +        self.checkpoint_failover = True
> > +        self._test_colo(loop=16)
> >   
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
  2020-05-18  9:38 ` [PATCH 0/5] colo: Introduce resource agent and test suite/CI Zhang, Chen
@ 2020-06-06 18:59   ` Lukas Straub
  2020-06-16  1:42     ` Zhang, Chen
  0 siblings, 1 reply; 14+ messages in thread
From: Lukas Straub @ 2020-06-06 18:59 UTC (permalink / raw)
  To: Zhang, Chen
  Cc: Jason Wang, Alberto Garcia, qemu-devel, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 3535 bytes --]

On Mon, 18 May 2020 09:38:24 +0000
"Zhang, Chen" <chen.zhang@intel.com> wrote:

> > -----Original Message-----
> > From: Lukas Straub <lukasstraub2@web.de>
> > Sent: Monday, May 11, 2020 8:27 PM
> > To: qemu-devel <qemu-devel@nongnu.org>
> > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> > Subject: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> > 
> > Hello Everyone,
> > These patches introduce a resource agent for fully automatic management of
> > colo and a test suite building upon the resource agent to extensively test colo.
> > 
> > Test suite features:
> > -Tests failover with peer crashing and hanging and failover during checkpoint
> > -Tests network using ssh and iperf3 -Quick test requires no special
> > configuration -Network test for testing colo-compare -Stress test: failover all
> > the time with network load
> > 
> > Resource agent features:
> > -Fully automatic management of colo
> > -Handles many failures: hanging/crashing qemu, replication error, disk
> > error, ...
> > -Recovers from hanging qemu by using the "yank" oob command -Tracks
> > which node has up-to-date data -Works well in clusters with more than 2
> > nodes
> > 
> > Run times on my laptop:
> > Quick test: 200s
> > Network test: 800s (tagged as slow)
> > Stress test: 1300s (tagged as slow)
> > 
> > The test suite needs access to a network bridge to properly test the network,
> > so some parameters need to be given to the test run. See
> > tests/acceptance/colo.py for more information.
> > 
> > I wonder how this integrates in existing CI infrastructure. Is there a common
> > CI for qemu where this can run or does every subsystem have to run their
> > own CI?  
> 
> Wow~ Very happy to see this series.
> I have checked the "how to" in tests/acceptance/colo.py,
> But it looks not enough for users, can you write an independent document for this series?
> Include test Infrastructure ASC II diagram,  test cases design , detailed how to and more information for 
> pacemaker cluster and resource agent..etc ?

Hi,
I quickly created a more complete howto for configuring a pacemaker cluster and using the resource agent, I hope it helps:
https://wiki.qemu.org/Features/COLO/Managed_HOWTO

Regards,
Lukas Straub

> Thanks
> Zhang Chen
> 
> 
> > 
> > Regards,
> > Lukas Straub
> > 
> > 
> > Lukas Straub (5):
> >   block/quorum.c: stable children names
> >   colo: Introduce resource agent
> >   colo: Introduce high-level test suite
> >   configure,Makefile: Install colo resource-agent
> >   MAINTAINERS: Add myself as maintainer for COLO resource agent
> > 
> >  MAINTAINERS                              |    6 +
> >  Makefile                                 |    5 +
> >  block/quorum.c                           |   20 +-
> >  configure                                |   10 +
> >  scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
> >  scripts/colo-resource-agent/crm_master   |   44 +
> >  scripts/colo-resource-agent/crm_resource |   12 +
> >  tests/acceptance/colo.py                 |  689 +++++++++++
> >  8 files changed, 2209 insertions(+), 6 deletions(-)  create mode 100755
> > scripts/colo-resource-agent/colo  create mode 100755 scripts/colo-resource-
> > agent/crm_master
> >  create mode 100755 scripts/colo-resource-agent/crm_resource
> >  create mode 100644 tests/acceptance/colo.py
> > 
> > --
> > 2.20.1  


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
  2020-06-06 18:59   ` Lukas Straub
@ 2020-06-16  1:42     ` Zhang, Chen
  2020-06-19 13:55       ` Lukas Straub
  0 siblings, 1 reply; 14+ messages in thread
From: Zhang, Chen @ 2020-06-16  1:42 UTC (permalink / raw)
  To: Lukas Straub
  Cc: Zhanghailiang, Jason Wang, Alberto Garcia, qemu-devel,
	Dr. David Alan Gilbert



> -----Original Message-----
> From: Lukas Straub <lukasstraub2@web.de>
> Sent: Sunday, June 7, 2020 3:00 AM
> To: Zhang, Chen <chen.zhang@intel.com>
> Cc: qemu-devel <qemu-devel@nongnu.org>; Alberto Garcia
> <berto@igalia.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>; Jason
> Wang <jasowang@redhat.com>
> Subject: Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> 
> On Mon, 18 May 2020 09:38:24 +0000
> "Zhang, Chen" <chen.zhang@intel.com> wrote:
> 
> > > -----Original Message-----
> > > From: Lukas Straub <lukasstraub2@web.de>
> > > Sent: Monday, May 11, 2020 8:27 PM
> > > To: qemu-devel <qemu-devel@nongnu.org>
> > > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> > > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> > > Subject: [PATCH 0/5] colo: Introduce resource agent and test
> > > suite/CI
> > >
> > > Hello Everyone,
> > > These patches introduce a resource agent for fully automatic
> > > management of colo and a test suite building upon the resource agent to
> extensively test colo.
> > >
> > > Test suite features:
> > > -Tests failover with peer crashing and hanging and failover during
> > > checkpoint -Tests network using ssh and iperf3 -Quick test requires
> > > no special configuration -Network test for testing colo-compare
> > > -Stress test: failover all the time with network load
> > >
> > > Resource agent features:
> > > -Fully automatic management of colo
> > > -Handles many failures: hanging/crashing qemu, replication error,
> > > disk error, ...
> > > -Recovers from hanging qemu by using the "yank" oob command -Tracks
> > > which node has up-to-date data -Works well in clusters with more
> > > than 2 nodes
> > >
> > > Run times on my laptop:
> > > Quick test: 200s
> > > Network test: 800s (tagged as slow)
> > > Stress test: 1300s (tagged as slow)
> > >
> > > The test suite needs access to a network bridge to properly test the
> > > network, so some parameters need to be given to the test run. See
> > > tests/acceptance/colo.py for more information.
> > >
> > > I wonder how this integrates in existing CI infrastructure. Is there
> > > a common CI for qemu where this can run or does every subsystem have
> > > to run their own CI?
> >
> > Wow~ Very happy to see this series.
> > I have checked the "how to" in tests/acceptance/colo.py, But it looks
> > not enough for users, can you write an independent document for this
> series?
> > Include test Infrastructure ASC II diagram,  test cases design ,
> > detailed how to and more information for pacemaker cluster and resource
> agent..etc ?
> 
> Hi,
> I quickly created a more complete howto for configuring a pacemaker cluster
> and using the resource agent, I hope it helps:
> https://wiki.qemu.org/Features/COLO/Managed_HOWTO

Hi Lukas,

I noticed you contribute some content in Qemu COLO WIKI.
For the Features/COLO/Manual HOWTO
https://wiki.qemu.org/Features/COLO/Manual_HOWTO

Why not keep the Secondary side start command same with the qemu/docs/COLO-FT.txt?
If I understand correctly, add the quorum related command in secondary will support resume replication.
Then, we can add primary/secondary resume step here.

Thanks
Zhang Chen

> 
> Regards,
> Lukas Straub
> 
> > Thanks
> > Zhang Chen
> >
> >
> > >
> > > Regards,
> > > Lukas Straub
> > >
> > >
> > > Lukas Straub (5):
> > >   block/quorum.c: stable children names
> > >   colo: Introduce resource agent
> > >   colo: Introduce high-level test suite
> > >   configure,Makefile: Install colo resource-agent
> > >   MAINTAINERS: Add myself as maintainer for COLO resource agent
> > >
> > >  MAINTAINERS                              |    6 +
> > >  Makefile                                 |    5 +
> > >  block/quorum.c                           |   20 +-
> > >  configure                                |   10 +
> > >  scripts/colo-resource-agent/colo         | 1429 ++++++++++++++++++++++
> > >  scripts/colo-resource-agent/crm_master   |   44 +
> > >  scripts/colo-resource-agent/crm_resource |   12 +
> > >  tests/acceptance/colo.py                 |  689 +++++++++++
> > >  8 files changed, 2209 insertions(+), 6 deletions(-)  create mode
> > > 100755 scripts/colo-resource-agent/colo  create mode 100755
> > > scripts/colo-resource- agent/crm_master  create mode 100755
> > > scripts/colo-resource-agent/crm_resource
> > >  create mode 100644 tests/acceptance/colo.py
> > >
> > > --
> > > 2.20.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
  2020-06-16  1:42     ` Zhang, Chen
@ 2020-06-19 13:55       ` Lukas Straub
  0 siblings, 0 replies; 14+ messages in thread
From: Lukas Straub @ 2020-06-19 13:55 UTC (permalink / raw)
  To: Zhang, Chen
  Cc: Zhanghailiang, Jason Wang, Alberto Garcia, qemu-devel,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 3657 bytes --]

On Tue, 16 Jun 2020 01:42:45 +0000
"Zhang, Chen" <chen.zhang@intel.com> wrote:

> > -----Original Message-----
> > From: Lukas Straub <lukasstraub2@web.de>
> > Sent: Sunday, June 7, 2020 3:00 AM
> > To: Zhang, Chen <chen.zhang@intel.com>
> > Cc: qemu-devel <qemu-devel@nongnu.org>; Alberto Garcia
> > <berto@igalia.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>; Jason
> > Wang <jasowang@redhat.com>
> > Subject: Re: [PATCH 0/5] colo: Introduce resource agent and test suite/CI
> > 
> > On Mon, 18 May 2020 09:38:24 +0000
> > "Zhang, Chen" <chen.zhang@intel.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Lukas Straub <lukasstraub2@web.de>
> > > > Sent: Monday, May 11, 2020 8:27 PM
> > > > To: qemu-devel <qemu-devel@nongnu.org>
> > > > Cc: Alberto Garcia <berto@igalia.com>; Dr. David Alan Gilbert
> > > > <dgilbert@redhat.com>; Zhang, Chen <chen.zhang@intel.com>
> > > > Subject: [PATCH 0/5] colo: Introduce resource agent and test
> > > > suite/CI
> > > >
> > > > Hello Everyone,
> > > > These patches introduce a resource agent for fully automatic
> > > > management of colo and a test suite building upon the resource agent to  
> > extensively test colo.  
> > > >
> > > > Test suite features:
> > > > -Tests failover with peer crashing and hanging and failover during
> > > > checkpoint -Tests network using ssh and iperf3 -Quick test requires
> > > > no special configuration -Network test for testing colo-compare
> > > > -Stress test: failover all the time with network load
> > > >
> > > > Resource agent features:
> > > > -Fully automatic management of colo
> > > > -Handles many failures: hanging/crashing qemu, replication error,
> > > > disk error, ...
> > > > -Recovers from hanging qemu by using the "yank" oob command -Tracks
> > > > which node has up-to-date data -Works well in clusters with more
> > > > than 2 nodes
> > > >
> > > > Run times on my laptop:
> > > > Quick test: 200s
> > > > Network test: 800s (tagged as slow)
> > > > Stress test: 1300s (tagged as slow)
> > > >
> > > > The test suite needs access to a network bridge to properly test the
> > > > network, so some parameters need to be given to the test run. See
> > > > tests/acceptance/colo.py for more information.
> > > >
> > > > I wonder how this integrates in existing CI infrastructure. Is there
> > > > a common CI for qemu where this can run or does every subsystem have
> > > > to run their own CI?  
> > >
> > > Wow~ Very happy to see this series.
> > > I have checked the "how to" in tests/acceptance/colo.py, But it looks
> > > not enough for users, can you write an independent document for this  
> > series?  
> > > Include test Infrastructure ASC II diagram,  test cases design ,
> > > detailed how to and more information for pacemaker cluster and resource  
> > agent..etc ?
> > 
> > Hi,
> > I quickly created a more complete howto for configuring a pacemaker cluster
> > and using the resource agent, I hope it helps:
> > https://wiki.qemu.org/Features/COLO/Managed_HOWTO  
> 
> Hi Lukas,
> 
> I noticed you contribute some content in Qemu COLO WIKI.
> For the Features/COLO/Manual HOWTO
> https://wiki.qemu.org/Features/COLO/Manual_HOWTO
> 
> Why not keep the Secondary side start command same with the qemu/docs/COLO-FT.txt?
> If I understand correctly, add the quorum related command in secondary will support resume replication.
> Then, we can add primary/secondary resume step here.

I haven't updated the wiki from qemu/docs/COLO-FT.txt yet, I just moved it there from the main page.

Regards,
Lukas Straub

> Thanks
> Zhang Chen


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-06-19 14:00 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11 12:26 [PATCH 0/5] colo: Introduce resource agent and test suite/CI Lukas Straub
2020-05-11 12:26 ` [PATCH 1/5] block/quorum.c: stable children names Lukas Straub
2020-06-02  1:01   ` Zhang, Chen
2020-06-02 11:07   ` Alberto Garcia
2020-05-11 12:26 ` [PATCH 2/5] colo: Introduce resource agent Lukas Straub
2020-05-11 12:27 ` [PATCH 3/5] colo: Introduce high-level test suite Lukas Straub
2020-06-02 12:19   ` Philippe Mathieu-Daudé
2020-06-04 10:55     ` Lukas Straub
2020-05-11 12:27 ` [PATCH 4/5] configure,Makefile: Install colo resource-agent Lukas Straub
2020-05-11 12:27 ` [PATCH 5/5] MAINTAINERS: Add myself as maintainer for COLO resource agent Lukas Straub
2020-05-18  9:38 ` [PATCH 0/5] colo: Introduce resource agent and test suite/CI Zhang, Chen
2020-06-06 18:59   ` Lukas Straub
2020-06-16  1:42     ` Zhang, Chen
2020-06-19 13:55       ` Lukas Straub

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.