All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] tests: improve reliability of migration test
@ 2022-06-28 10:54 Daniel P. Berrangé
  2022-06-28 10:54 ` [PATCH 1/5] tests: wait max 120 seconds for migration test status changes Daniel P. Berrangé
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Thomas Huth,
	Cornelia Huck, qemu-s390x, Dr. David Alan Gilbert,
	Daniel P. Berrangé

Since the TLS tests were added a few people have reported seeing
hangs in some of the TLS test cases for migration. Debugging
has revealed that in all cases the test was waiting for a STOP
event that never arrived.

The problem is that TLS performance is highly dependant on the
crypto impl. Some people have been running tests on machines
which are highly efficient at running the guest dirtying workload
but relatively slow at TLS. This has prevented convergance from
being reliably achieved in the configured max downtime.

Since this test design has been long standing I suspect the
lack of convergance is a likely cause of previous hangs we've
seen in various scenarios that resulted in us disabling the test
on s390 TCG, ppc TCG and ppc KVM-PR.

Thus I have suggested we drop this skip conditions, though I would
note that I've not had the ability to actually test the effect that
this has. 

Daniel P. Berrangé (5):
  tests: wait max 120 seconds for migration test status changes
  tests: wait for migration completion before looking for STOP event
  tests: increase migration test converge downtime to 30 seconds
  tests: use consistent bandwidth/downtime limits in migration tests
  tests: stop skipping migration test on s390x/ppc64

 tests/qtest/migration-helpers.c | 14 ++++++
 tests/qtest/migration-test.c    | 80 ++++++++++-----------------------
 2 files changed, 38 insertions(+), 56 deletions(-)

-- 
2.36.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] tests: wait max 120 seconds for migration test status changes
  2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
@ 2022-06-28 10:54 ` Daniel P. Berrangé
  2022-06-28 12:47   ` Laurent Vivier
  2022-06-28 12:49   ` Thomas Huth
  2022-06-28 10:54 ` [PATCH 2/5] tests: wait for migration completion before looking for STOP event Daniel P. Berrangé
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Thomas Huth,
	Cornelia Huck, qemu-s390x, Dr. David Alan Gilbert,
	Daniel P. Berrangé

Currently the wait_for_migration_fail and wait_for_migration_complete
functions will spin in an infinite loop checking query-migrate status
to detect a specific change/goal. This is fine when everything goes
to plan, but when the unusual happens, these will hang the test suite
forever.

Any normally executing migration test case normally takes < 1 second
for a state change, with exception of the autoconverge test which
takes about 5 seconds. Taking into account possibility of people
running tests inside TCG, allowing a factor of x20 slowdown gives
a reasonable worst case of 120 seconds. Anything taking longer than
this is a strong sign that the test has hung, or the test should be
rewritten to be faster.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-helpers.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
index a6aa59e4e6..e81e831c85 100644
--- a/tests/qtest/migration-helpers.c
+++ b/tests/qtest/migration-helpers.c
@@ -15,6 +15,14 @@
 
 #include "migration-helpers.h"
 
+/*
+ * Number of seconds we wait when looking for migration
+ * status changes, to avoid test suite hanging forever
+ * when things go wrong. Needs to be higher enough to
+ * avoid false positives on loaded hosts.
+ */
+#define MIGRATION_STATUS_WAIT_TIMEOUT 120
+
 bool got_stop;
 
 static void check_stop_event(QTestState *who)
@@ -166,8 +174,11 @@ static bool check_migration_status(QTestState *who, const char *goal,
 void wait_for_migration_status(QTestState *who,
                                const char *goal, const char **ungoals)
 {
+    g_test_timer_start();
     while (!check_migration_status(who, goal, ungoals)) {
         usleep(1000);
+
+        g_assert(g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT);
     }
 }
 
@@ -178,6 +189,7 @@ void wait_for_migration_complete(QTestState *who)
 
 void wait_for_migration_fail(QTestState *from, bool allow_active)
 {
+    g_test_timer_start();
     QDict *rsp_return;
     char *status;
     bool failed;
@@ -193,6 +205,8 @@ void wait_for_migration_fail(QTestState *from, bool allow_active)
         g_assert(result);
         failed = !strcmp(status, "failed");
         g_free(status);
+
+        g_assert(g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT);
     } while (!failed);
 
     /* Is the machine currently running? */
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/5] tests: wait for migration completion before looking for STOP event
  2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
  2022-06-28 10:54 ` [PATCH 1/5] tests: wait max 120 seconds for migration test status changes Daniel P. Berrangé
@ 2022-06-28 10:54 ` Daniel P. Berrangé
  2022-06-28 12:47   ` Laurent Vivier
  2022-06-28 14:08   ` Dr. David Alan Gilbert
  2022-06-28 10:54 ` [PATCH 3/5] tests: increase migration test converge downtime to 30 seconds Daniel P. Berrangé
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Thomas Huth,
	Cornelia Huck, qemu-s390x, Dr. David Alan Gilbert,
	Daniel P. Berrangé

When moving into the convergance phase, the precopy tests will first
look for a STOP event and once found will look for migration completion
status. If the test VM is not converging, the test suite will be waiting
for the STOP event forever. If we wait for the migration completion
status first, then we will trigger the previously added timeout and
prevent the test hanging forever.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index d33e8060f9..ac9e303b1f 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -1232,6 +1232,10 @@ static void test_precopy_common(MigrateCommon *args)
 
         migrate_set_parameter_int(from, "downtime-limit", CONVERGE_DOWNTIME);
 
+        /* We do this first, as it has a timeout to stop us
+         * hanging forever if migration didn't converge */
+        wait_for_migration_complete(from);
+
         if (!got_stop) {
             qtest_qmp_eventwait(from, "STOP");
         }
@@ -1239,7 +1243,6 @@ static void test_precopy_common(MigrateCommon *args)
         qtest_qmp_eventwait(to, "RESUME");
 
         wait_for_serial("dest_serial");
-        wait_for_migration_complete(from);
     }
 
     if (args->finish_hook) {
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/5] tests: increase migration test converge downtime to 30 seconds
  2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
  2022-06-28 10:54 ` [PATCH 1/5] tests: wait max 120 seconds for migration test status changes Daniel P. Berrangé
  2022-06-28 10:54 ` [PATCH 2/5] tests: wait for migration completion before looking for STOP event Daniel P. Berrangé
@ 2022-06-28 10:54 ` Daniel P. Berrangé
  2022-06-28 12:47   ` Laurent Vivier
  2022-06-28 10:54 ` [PATCH 4/5] tests: use consistent bandwidth/downtime limits in migration tests Daniel P. Berrangé
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Thomas Huth,
	Cornelia Huck, qemu-s390x, Dr. David Alan Gilbert,
	Daniel P. Berrangé

While 1 second might be enough to converge migration on a fast host,
this is not guaranteed, especially if using TLS in the tests without
hardware accelerated crypto available.

Increasing the downtime to 30 seconds should guarantee it can converge
in any sane scenario.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index ac9e303b1f..a54eff6d56 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -47,7 +47,7 @@ unsigned end_address;
 static bool uffd_feature_thread_id;
 
 /* A downtime where the test really should converge */
-#define CONVERGE_DOWNTIME 1000
+#define CONVERGE_DOWNTIME (1000 * 30)
 
 #if defined(__linux__)
 #include <sys/syscall.h>
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/5] tests: use consistent bandwidth/downtime limits in migration tests
  2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
                   ` (2 preceding siblings ...)
  2022-06-28 10:54 ` [PATCH 3/5] tests: increase migration test converge downtime to 30 seconds Daniel P. Berrangé
@ 2022-06-28 10:54 ` Daniel P. Berrangé
  2022-06-28 14:16   ` Dr. David Alan Gilbert
  2022-06-28 10:54 ` [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64 Daniel P. Berrangé
  2022-06-28 13:19 ` [PATCH 0/5] tests: improve reliability of migration test Thomas Huth
  5 siblings, 1 reply; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Thomas Huth,
	Cornelia Huck, qemu-s390x, Dr. David Alan Gilbert,
	Daniel P. Berrangé

The different migration test cases are using a variety of settings
to ensure convergance/non-convergance. Introduce two helpers to
extra the common functionality and ensure consistency.

  * Non-convergance: 1ms downtime, 30mbs bandwidth
  * Convergance: 30s downtime, 1gbs bandwidth

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 54 +++++++++++++-----------------------
 1 file changed, 20 insertions(+), 34 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index a54eff6d56..9e64125f02 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -46,9 +46,6 @@ unsigned start_address;
 unsigned end_address;
 static bool uffd_feature_thread_id;
 
-/* A downtime where the test really should converge */
-#define CONVERGE_DOWNTIME (1000 * 30)
-
 #if defined(__linux__)
 #include <sys/syscall.h>
 #include <sys/vfs.h>
@@ -402,6 +399,20 @@ static void migrate_set_parameter_str(QTestState *who, const char *parameter,
     migrate_check_parameter_str(who, parameter, value);
 }
 
+static void migrate_ensure_non_converge(QTestState *who)
+{
+    /* Can't converge with 1ms downtime + 30 mbs bandwidth limit */
+    migrate_set_parameter_int(who, "max-bandwidth", 30 * 1000 * 1000);
+    migrate_set_parameter_int(who, "downtime-limit", 1);
+}
+
+static void migrate_ensure_converge(QTestState *who)
+{
+    /* Should converge with 30s downtime + 1 gbs bandwidth limit */
+    migrate_set_parameter_int(who, "max-bandwidth", 1 * 1000 * 1000 * 1000);
+    migrate_set_parameter_int(who, "downtime-limit", 30 * 1000);
+}
+
 static void migrate_pause(QTestState *who)
 {
     QDict *rsp;
@@ -984,12 +995,7 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
     migrate_set_capability(to, "postcopy-ram", true);
     migrate_set_capability(to, "postcopy-blocktime", true);
 
-    /* We want to pick a speed slow enough that the test completes
-     * quickly, but that it doesn't complete precopy even on a slow
-     * machine, so also set the downtime.
-     */
-    migrate_set_parameter_int(from, "max-bandwidth", 30000000);
-    migrate_set_parameter_int(from, "downtime-limit", 1);
+    migrate_ensure_non_converge(from);
 
     /* Wait for the first serial output from the source */
     wait_for_serial("src_serial");
@@ -1188,15 +1194,7 @@ static void test_precopy_common(MigrateCommon *args)
         return;
     }
 
-    /*
-     * We want to pick a speed slow enough that the test completes
-     * quickly, but that it doesn't complete precopy even on a slow
-     * machine, so also set the downtime.
-     */
-    /* 1 ms should make it not converge*/
-    migrate_set_parameter_int(from, "downtime-limit", 1);
-    /* 1GB/s */
-    migrate_set_parameter_int(from, "max-bandwidth", 1000000000);
+    migrate_ensure_non_converge(from);
 
     if (args->start_hook) {
         data_hook = args->start_hook(from, to);
@@ -1230,7 +1228,7 @@ static void test_precopy_common(MigrateCommon *args)
             wait_for_migration_pass(from);
         }
 
-        migrate_set_parameter_int(from, "downtime-limit", CONVERGE_DOWNTIME);
+        migrate_ensure_converge(from);
 
         /* We do this first, as it has a timeout to stop us
          * hanging forever if migration didn't converge */
@@ -1694,8 +1692,7 @@ static void test_migrate_auto_converge(void)
      * Set the initial parameters so that the migration could not converge
      * without throttling.
      */
-    migrate_set_parameter_int(from, "downtime-limit", 1);
-    migrate_set_parameter_int(from, "max-bandwidth", 100000000); /* ~100Mb/s */
+    migrate_ensure_non_converge(from);
 
     /* To check remaining size after precopy */
     migrate_set_capability(from, "pause-before-switchover", true);
@@ -2000,15 +1997,7 @@ static void test_multifd_tcp_cancel(void)
         return;
     }
 
-    /*
-     * We want to pick a speed slow enough that the test completes
-     * quickly, but that it doesn't complete precopy even on a slow
-     * machine, so also set the downtime.
-     */
-    /* 1 ms should make it not converge*/
-    migrate_set_parameter_int(from, "downtime-limit", 1);
-    /* 300MB/s */
-    migrate_set_parameter_int(from, "max-bandwidth", 30000000);
+    migrate_ensure_non_converge(from);
 
     migrate_set_parameter_int(from, "multifd-channels", 16);
     migrate_set_parameter_int(to, "multifd-channels", 16);
@@ -2054,10 +2043,7 @@ static void test_multifd_tcp_cancel(void)
 
     wait_for_migration_status(from, "cancelled", NULL);
 
-    /* 300ms it should converge */
-    migrate_set_parameter_int(from, "downtime-limit", 300);
-    /* 1GB/s */
-    migrate_set_parameter_int(from, "max-bandwidth", 1000000000);
+    migrate_ensure_converge(from);
 
     migrate_qmp(from, uri, "{}");
 
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64
  2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
                   ` (3 preceding siblings ...)
  2022-06-28 10:54 ` [PATCH 4/5] tests: use consistent bandwidth/downtime limits in migration tests Daniel P. Berrangé
@ 2022-06-28 10:54 ` Daniel P. Berrangé
  2022-06-28 13:18   ` Thomas Huth
  2022-07-05  8:06   ` Thomas Huth
  2022-06-28 13:19 ` [PATCH 0/5] tests: improve reliability of migration test Thomas Huth
  5 siblings, 2 replies; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 10:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Thomas Huth,
	Cornelia Huck, qemu-s390x, Dr. David Alan Gilbert,
	Daniel P. Berrangé

There have been checks put into the migration test which skip it in a
few scenarios

 * ppc64 TCG
 * ppc64 KVM with kvm-pr
 * s390x TCG

In the original commits there are references to unexplained hangs in
the test. There is no record of details of where it was hanging, but
it is suspected that these were all a result of the max downtime limit
being set at too low a value to guarantee convergance.

Since a previous commit bumped the value from 1 second to 30 seconds,
it is believed that hangs due to non-convergance should be eliminated
and thus worth trying to remove the skipped scenarios.

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
---
 tests/qtest/migration-test.c | 21 ---------------------
 1 file changed, 21 deletions(-)

diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
index 9e64125f02..500169f687 100644
--- a/tests/qtest/migration-test.c
+++ b/tests/qtest/migration-test.c
@@ -2085,7 +2085,6 @@ static bool kvm_dirty_ring_supported(void)
 int main(int argc, char **argv)
 {
     char template[] = "/tmp/migration-test-XXXXXX";
-    const bool has_kvm = qtest_has_accel("kvm");
     int ret;
 
     g_test_init(&argc, &argv, NULL);
@@ -2094,26 +2093,6 @@ int main(int argc, char **argv)
         return g_test_run();
     }
 
-    /*
-     * On ppc64, the test only works with kvm-hv, but not with kvm-pr and TCG
-     * is touchy due to race conditions on dirty bits (especially on PPC for
-     * some reason)
-     */
-    if (g_str_equal(qtest_get_arch(), "ppc64") &&
-        (!has_kvm || access("/sys/module/kvm_hv", F_OK))) {
-        g_test_message("Skipping test: kvm_hv not available");
-        return g_test_run();
-    }
-
-    /*
-     * Similar to ppc64, s390x seems to be touchy with TCG, so disable it
-     * there until the problems are resolved
-     */
-    if (g_str_equal(qtest_get_arch(), "s390x") && !has_kvm) {
-        g_test_message("Skipping test: s390x host with KVM is required");
-        return g_test_run();
-    }
-
     tmpfs = mkdtemp(template);
     if (!tmpfs) {
         g_test_message("mkdtemp on path (%s): %s", template, strerror(errno));
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] tests: wait max 120 seconds for migration test status changes
  2022-06-28 10:54 ` [PATCH 1/5] tests: wait max 120 seconds for migration test status changes Daniel P. Berrangé
@ 2022-06-28 12:47   ` Laurent Vivier
  2022-06-28 12:49   ` Thomas Huth
  1 sibling, 0 replies; 18+ messages in thread
From: Laurent Vivier @ 2022-06-28 12:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Paolo Bonzini, Juan Quintela, Thomas Huth, Cornelia Huck,
	qemu-s390x, Dr. David Alan Gilbert

On 28/06/2022 12:54, Daniel P. Berrangé wrote:
> Currently the wait_for_migration_fail and wait_for_migration_complete
> functions will spin in an infinite loop checking query-migrate status
> to detect a specific change/goal. This is fine when everything goes
> to plan, but when the unusual happens, these will hang the test suite
> forever.
> 
> Any normally executing migration test case normally takes < 1 second
> for a state change, with exception of the autoconverge test which
> takes about 5 seconds. Taking into account possibility of people
> running tests inside TCG, allowing a factor of x20 slowdown gives
> a reasonable worst case of 120 seconds. Anything taking longer than
> this is a strong sign that the test has hung, or the test should be
> rewritten to be faster.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-helpers.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
> index a6aa59e4e6..e81e831c85 100644
> --- a/tests/qtest/migration-helpers.c
> +++ b/tests/qtest/migration-helpers.c
> @@ -15,6 +15,14 @@
>   
>   #include "migration-helpers.h"
>   
> +/*
> + * Number of seconds we wait when looking for migration
> + * status changes, to avoid test suite hanging forever
> + * when things go wrong. Needs to be higher enough to
> + * avoid false positives on loaded hosts.
> + */
> +#define MIGRATION_STATUS_WAIT_TIMEOUT 120
> +
>   bool got_stop;
>   
>   static void check_stop_event(QTestState *who)
> @@ -166,8 +174,11 @@ static bool check_migration_status(QTestState *who, const char *goal,
>   void wait_for_migration_status(QTestState *who,
>                                  const char *goal, const char **ungoals)
>   {
> +    g_test_timer_start();
>       while (!check_migration_status(who, goal, ungoals)) {
>           usleep(1000);
> +
> +        g_assert(g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT);
>       }
>   }
>   
> @@ -178,6 +189,7 @@ void wait_for_migration_complete(QTestState *who)
>   
>   void wait_for_migration_fail(QTestState *from, bool allow_active)
>   {
> +    g_test_timer_start();
>       QDict *rsp_return;
>       char *status;
>       bool failed;
> @@ -193,6 +205,8 @@ void wait_for_migration_fail(QTestState *from, bool allow_active)
>           g_assert(result);
>           failed = !strcmp(status, "failed");
>           g_free(status);
> +
> +        g_assert(g_test_timer_elapsed() < MIGRATION_STATUS_WAIT_TIMEOUT);
>       } while (!failed);
>   
>       /* Is the machine currently running? */

Reviewed-by: Laurent Vivier <laurent@vivier.eu>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] tests: wait for migration completion before looking for STOP event
  2022-06-28 10:54 ` [PATCH 2/5] tests: wait for migration completion before looking for STOP event Daniel P. Berrangé
@ 2022-06-28 12:47   ` Laurent Vivier
  2022-06-28 14:08   ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 18+ messages in thread
From: Laurent Vivier @ 2022-06-28 12:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Paolo Bonzini, Juan Quintela, Thomas Huth, Cornelia Huck,
	qemu-s390x, Dr. David Alan Gilbert

On 28/06/2022 12:54, Daniel P. Berrangé wrote:
> When moving into the convergance phase, the precopy tests will first
> look for a STOP event and once found will look for migration completion
> status. If the test VM is not converging, the test suite will be waiting
> for the STOP event forever. If we wait for the migration completion
> status first, then we will trigger the previously added timeout and
> prevent the test hanging forever.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-test.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index d33e8060f9..ac9e303b1f 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -1232,6 +1232,10 @@ static void test_precopy_common(MigrateCommon *args)
>   
>           migrate_set_parameter_int(from, "downtime-limit", CONVERGE_DOWNTIME);
>   
> +        /* We do this first, as it has a timeout to stop us
> +         * hanging forever if migration didn't converge */
> +        wait_for_migration_complete(from);
> +
>           if (!got_stop) {
>               qtest_qmp_eventwait(from, "STOP");
>           }
> @@ -1239,7 +1243,6 @@ static void test_precopy_common(MigrateCommon *args)
>           qtest_qmp_eventwait(to, "RESUME");
>   
>           wait_for_serial("dest_serial");
> -        wait_for_migration_complete(from);
>       }
>   
>       if (args->finish_hook) {

Reviewed-by: Laurent Vivier <laurent@vivier.eu>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/5] tests: increase migration test converge downtime to 30 seconds
  2022-06-28 10:54 ` [PATCH 3/5] tests: increase migration test converge downtime to 30 seconds Daniel P. Berrangé
@ 2022-06-28 12:47   ` Laurent Vivier
  0 siblings, 0 replies; 18+ messages in thread
From: Laurent Vivier @ 2022-06-28 12:47 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Paolo Bonzini, Juan Quintela, Thomas Huth, Cornelia Huck,
	qemu-s390x, Dr. David Alan Gilbert

On 28/06/2022 12:54, Daniel P. Berrangé wrote:
> While 1 second might be enough to converge migration on a fast host,
> this is not guaranteed, especially if using TLS in the tests without
> hardware accelerated crypto available.
> 
> Increasing the downtime to 30 seconds should guarantee it can converge
> in any sane scenario.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-test.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index ac9e303b1f..a54eff6d56 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -47,7 +47,7 @@ unsigned end_address;
>   static bool uffd_feature_thread_id;
>   
>   /* A downtime where the test really should converge */
> -#define CONVERGE_DOWNTIME 1000
> +#define CONVERGE_DOWNTIME (1000 * 30)
>   
>   #if defined(__linux__)
>   #include <sys/syscall.h>

Reviewed-by: Laurent Vivier <laurent@vivier.eu>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/5] tests: wait max 120 seconds for migration test status changes
  2022-06-28 10:54 ` [PATCH 1/5] tests: wait max 120 seconds for migration test status changes Daniel P. Berrangé
  2022-06-28 12:47   ` Laurent Vivier
@ 2022-06-28 12:49   ` Thomas Huth
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Huth @ 2022-06-28 12:49 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Cornelia Huck,
	qemu-s390x, Dr. David Alan Gilbert

On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> Currently the wait_for_migration_fail and wait_for_migration_complete
> functions will spin in an infinite loop checking query-migrate status
> to detect a specific change/goal. This is fine when everything goes
> to plan, but when the unusual happens, these will hang the test suite
> forever.
> 
> Any normally executing migration test case normally takes < 1 second
> for a state change, with exception of the autoconverge test which
> takes about 5 seconds. Taking into account possibility of people
> running tests inside TCG, allowing a factor of x20 slowdown gives
> a reasonable worst case of 120 seconds. Anything taking longer than
> this is a strong sign that the test has hung, or the test should be
> rewritten to be faster.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-helpers.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/tests/qtest/migration-helpers.c b/tests/qtest/migration-helpers.c
> index a6aa59e4e6..e81e831c85 100644
> --- a/tests/qtest/migration-helpers.c
> +++ b/tests/qtest/migration-helpers.c
> @@ -15,6 +15,14 @@
>   
>   #include "migration-helpers.h"
>   
> +/*
> + * Number of seconds we wait when looking for migration
> + * status changes, to avoid test suite hanging forever
> + * when things go wrong. Needs to be higher enough to
> + * avoid false positives on loaded hosts.
> + */
> +#define MIGRATION_STATUS_WAIT_TIMEOUT 120

Reviewed-by: Thomas Huth <thuth@redhat.com>

... but I think I'd suggest to use an even longer timeout - sometimes people 
try to run the tests with TCI, or on heavy loaded CI machines, and then you 
can get really really long timeouts. Maybe 240 or even 300 seconds?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64
  2022-06-28 10:54 ` [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64 Daniel P. Berrangé
@ 2022-06-28 13:18   ` Thomas Huth
  2022-07-05  8:06   ` Thomas Huth
  1 sibling, 0 replies; 18+ messages in thread
From: Thomas Huth @ 2022-06-28 13:18 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Cornelia Huck,
	qemu-s390x, Dr. David Alan Gilbert

On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> There have been checks put into the migration test which skip it in a
> few scenarios
> 
>   * ppc64 TCG
>   * ppc64 KVM with kvm-pr
>   * s390x TCG
> 
> In the original commits there are references to unexplained hangs in
> the test. There is no record of details of where it was hanging, but
> it is suspected that these were all a result of the max downtime limit
> being set at too low a value to guarantee convergance.
> 
> Since a previous commit bumped the value from 1 second to 30 seconds,
> it is believed that hangs due to non-convergance should be eliminated
> and thus worth trying to remove the skipped scenarios.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-test.c | 21 ---------------------
>   1 file changed, 21 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index 9e64125f02..500169f687 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -2085,7 +2085,6 @@ static bool kvm_dirty_ring_supported(void)
>   int main(int argc, char **argv)
>   {
>       char template[] = "/tmp/migration-test-XXXXXX";
> -    const bool has_kvm = qtest_has_accel("kvm");
>       int ret;
>   
>       g_test_init(&argc, &argv, NULL);
> @@ -2094,26 +2093,6 @@ int main(int argc, char **argv)
>           return g_test_run();
>       }
>   
> -    /*
> -     * On ppc64, the test only works with kvm-hv, but not with kvm-pr and TCG
> -     * is touchy due to race conditions on dirty bits (especially on PPC for
> -     * some reason)
> -     */
> -    if (g_str_equal(qtest_get_arch(), "ppc64") &&
> -        (!has_kvm || access("/sys/module/kvm_hv", F_OK))) {
> -        g_test_message("Skipping test: kvm_hv not available");
> -        return g_test_run();
> -    }
> -
> -    /*
> -     * Similar to ppc64, s390x seems to be touchy with TCG, so disable it
> -     * there until the problems are resolved
> -     */
> -    if (g_str_equal(qtest_get_arch(), "s390x") && !has_kvm) {
> -        g_test_message("Skipping test: s390x host with KVM is required");
> -        return g_test_run();
> -    }

I'm in favor of giving this now a try ... we still can revert the patch if 
it does not work out.

Reviewed-by: Thomas Huth <thuth@redhat.com>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/5] tests: improve reliability of migration test
  2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
                   ` (4 preceding siblings ...)
  2022-06-28 10:54 ` [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64 Daniel P. Berrangé
@ 2022-06-28 13:19 ` Thomas Huth
  5 siblings, 0 replies; 18+ messages in thread
From: Thomas Huth @ 2022-06-28 13:19 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Cornelia Huck,
	qemu-s390x, Dr. David Alan Gilbert

On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> Since the TLS tests were added a few people have reported seeing
> hangs in some of the TLS test cases for migration. Debugging
> has revealed that in all cases the test was waiting for a STOP
> event that never arrived.
> 
> The problem is that TLS performance is highly dependant on the
> crypto impl. Some people have been running tests on machines
> which are highly efficient at running the guest dirtying workload
> but relatively slow at TLS. This has prevented convergance from
> being reliably achieved in the configured max downtime.
> 
> Since this test design has been long standing I suspect the
> lack of convergance is a likely cause of previous hangs we've
> seen in various scenarios that resulted in us disabling the test
> on s390 TCG, ppc TCG and ppc KVM-PR.
> 
> Thus I have suggested we drop this skip conditions, though I would
> note that I've not had the ability to actually test the effect that
> this has.
> 
> Daniel P. Berrangé (5):
>    tests: wait max 120 seconds for migration test status changes
>    tests: wait for migration completion before looking for STOP event
>    tests: increase migration test converge downtime to 30 seconds
>    tests: use consistent bandwidth/downtime limits in migration tests
>    tests: stop skipping migration test on s390x/ppc64
> 
>   tests/qtest/migration-helpers.c | 14 ++++++
>   tests/qtest/migration-test.c    | 80 ++++++++++-----------------------
>   2 files changed, 38 insertions(+), 56 deletions(-)

FYI, this is fixing the issue with the hang that I saw with the 
precopy/unix/tls/x509/override-host test on my RHEL8 s390x host.

Tested-by: Thomas Huth <thuth@redhat.com>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] tests: wait for migration completion before looking for STOP event
  2022-06-28 10:54 ` [PATCH 2/5] tests: wait for migration completion before looking for STOP event Daniel P. Berrangé
  2022-06-28 12:47   ` Laurent Vivier
@ 2022-06-28 14:08   ` Dr. David Alan Gilbert
  2022-06-28 14:10     ` Daniel P. Berrangé
  1 sibling, 1 reply; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2022-06-28 14:08 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Paolo Bonzini, Laurent Vivier, Juan Quintela,
	Thomas Huth, Cornelia Huck, qemu-s390x

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> When moving into the convergance phase, the precopy tests will first
> look for a STOP event and once found will look for migration completion
> status. If the test VM is not converging, the test suite will be waiting
> for the STOP event forever. If we wait for the migration completion
> status first, then we will trigger the previously added timeout and
> prevent the test hanging forever.

Yeh OK, I guess we might end up with a similar time limit being added to
qtest_qmp_eventwait.


Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>  tests/qtest/migration-test.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index d33e8060f9..ac9e303b1f 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -1232,6 +1232,10 @@ static void test_precopy_common(MigrateCommon *args)
>  
>          migrate_set_parameter_int(from, "downtime-limit", CONVERGE_DOWNTIME);
>  
> +        /* We do this first, as it has a timeout to stop us
> +         * hanging forever if migration didn't converge */
> +        wait_for_migration_complete(from);
> +
>          if (!got_stop) {
>              qtest_qmp_eventwait(from, "STOP");
>          }
> @@ -1239,7 +1243,6 @@ static void test_precopy_common(MigrateCommon *args)
>          qtest_qmp_eventwait(to, "RESUME");
>  
>          wait_for_serial("dest_serial");
> -        wait_for_migration_complete(from);
>      }
>  
>      if (args->finish_hook) {
> -- 
> 2.36.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/5] tests: wait for migration completion before looking for STOP event
  2022-06-28 14:08   ` Dr. David Alan Gilbert
@ 2022-06-28 14:10     ` Daniel P. Berrangé
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-06-28 14:10 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Laurent Vivier, Juan Quintela,
	Thomas Huth, Cornelia Huck, qemu-s390x

On Tue, Jun 28, 2022 at 03:08:29PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > When moving into the convergance phase, the precopy tests will first
> > look for a STOP event and once found will look for migration completion
> > status. If the test VM is not converging, the test suite will be waiting
> > for the STOP event forever. If we wait for the migration completion
> > status first, then we will trigger the previously added timeout and
> > prevent the test hanging forever.
> 
> Yeh OK, I guess we might end up with a similar time limit being added to
> qtest_qmp_eventwait.

Yeah, my first inclination was to modify that method, but it was
not so straightforward, since the wait is done by sitting in recv()
on a blocking socket and I didn't fancy getting into refactoring
that side of things.

> 
> 
> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> 
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >  tests/qtest/migration-test.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> > index d33e8060f9..ac9e303b1f 100644
> > --- a/tests/qtest/migration-test.c
> > +++ b/tests/qtest/migration-test.c
> > @@ -1232,6 +1232,10 @@ static void test_precopy_common(MigrateCommon *args)
> >  
> >          migrate_set_parameter_int(from, "downtime-limit", CONVERGE_DOWNTIME);
> >  
> > +        /* We do this first, as it has a timeout to stop us
> > +         * hanging forever if migration didn't converge */
> > +        wait_for_migration_complete(from);
> > +
> >          if (!got_stop) {
> >              qtest_qmp_eventwait(from, "STOP");
> >          }
> > @@ -1239,7 +1243,6 @@ static void test_precopy_common(MigrateCommon *args)
> >          qtest_qmp_eventwait(to, "RESUME");
> >  
> >          wait_for_serial("dest_serial");
> > -        wait_for_migration_complete(from);
> >      }
> >  
> >      if (args->finish_hook) {
> > -- 
> > 2.36.1
> > 
> -- 
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] tests: use consistent bandwidth/downtime limits in migration tests
  2022-06-28 10:54 ` [PATCH 4/5] tests: use consistent bandwidth/downtime limits in migration tests Daniel P. Berrangé
@ 2022-06-28 14:16   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2022-06-28 14:16 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: qemu-devel, Paolo Bonzini, Laurent Vivier, Juan Quintela,
	Thomas Huth, Cornelia Huck, qemu-s390x

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> The different migration test cases are using a variety of settings
> to ensure convergance/non-convergance. Introduce two helpers to
> extra the common functionality and ensure consistency.
> 
>   * Non-convergance: 1ms downtime, 30mbs bandwidth
>   * Convergance: 30s downtime, 1gbs bandwidth
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  tests/qtest/migration-test.c | 54 +++++++++++++-----------------------
>  1 file changed, 20 insertions(+), 34 deletions(-)
> 
> diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c
> index a54eff6d56..9e64125f02 100644
> --- a/tests/qtest/migration-test.c
> +++ b/tests/qtest/migration-test.c
> @@ -46,9 +46,6 @@ unsigned start_address;
>  unsigned end_address;
>  static bool uffd_feature_thread_id;
>  
> -/* A downtime where the test really should converge */
> -#define CONVERGE_DOWNTIME (1000 * 30)
> -
>  #if defined(__linux__)
>  #include <sys/syscall.h>
>  #include <sys/vfs.h>
> @@ -402,6 +399,20 @@ static void migrate_set_parameter_str(QTestState *who, const char *parameter,
>      migrate_check_parameter_str(who, parameter, value);
>  }
>  
> +static void migrate_ensure_non_converge(QTestState *who)
> +{
> +    /* Can't converge with 1ms downtime + 30 mbs bandwidth limit */
> +    migrate_set_parameter_int(who, "max-bandwidth", 30 * 1000 * 1000);
> +    migrate_set_parameter_int(who, "downtime-limit", 1);
> +}
> +
> +static void migrate_ensure_converge(QTestState *who)
> +{
> +    /* Should converge with 30s downtime + 1 gbs bandwidth limit */
> +    migrate_set_parameter_int(who, "max-bandwidth", 1 * 1000 * 1000 * 1000);
> +    migrate_set_parameter_int(who, "downtime-limit", 30 * 1000);
> +}
> +
>  static void migrate_pause(QTestState *who)
>  {
>      QDict *rsp;
> @@ -984,12 +995,7 @@ static int migrate_postcopy_prepare(QTestState **from_ptr,
>      migrate_set_capability(to, "postcopy-ram", true);
>      migrate_set_capability(to, "postcopy-blocktime", true);
>  
> -    /* We want to pick a speed slow enough that the test completes
> -     * quickly, but that it doesn't complete precopy even on a slow
> -     * machine, so also set the downtime.
> -     */
> -    migrate_set_parameter_int(from, "max-bandwidth", 30000000);
> -    migrate_set_parameter_int(from, "downtime-limit", 1);
> +    migrate_ensure_non_converge(from);
>  
>      /* Wait for the first serial output from the source */
>      wait_for_serial("src_serial");
> @@ -1188,15 +1194,7 @@ static void test_precopy_common(MigrateCommon *args)
>          return;
>      }
>  
> -    /*
> -     * We want to pick a speed slow enough that the test completes
> -     * quickly, but that it doesn't complete precopy even on a slow
> -     * machine, so also set the downtime.
> -     */
> -    /* 1 ms should make it not converge*/
> -    migrate_set_parameter_int(from, "downtime-limit", 1);
> -    /* 1GB/s */
> -    migrate_set_parameter_int(from, "max-bandwidth", 1000000000);
> +    migrate_ensure_non_converge(from);
>  
>      if (args->start_hook) {
>          data_hook = args->start_hook(from, to);
> @@ -1230,7 +1228,7 @@ static void test_precopy_common(MigrateCommon *args)
>              wait_for_migration_pass(from);
>          }
>  
> -        migrate_set_parameter_int(from, "downtime-limit", CONVERGE_DOWNTIME);
> +        migrate_ensure_converge(from);
>  
>          /* We do this first, as it has a timeout to stop us
>           * hanging forever if migration didn't converge */
> @@ -1694,8 +1692,7 @@ static void test_migrate_auto_converge(void)
>       * Set the initial parameters so that the migration could not converge
>       * without throttling.
>       */
> -    migrate_set_parameter_int(from, "downtime-limit", 1);
> -    migrate_set_parameter_int(from, "max-bandwidth", 100000000); /* ~100Mb/s */
> +    migrate_ensure_non_converge(from);
>  
>      /* To check remaining size after precopy */
>      migrate_set_capability(from, "pause-before-switchover", true);
> @@ -2000,15 +1997,7 @@ static void test_multifd_tcp_cancel(void)
>          return;
>      }
>  
> -    /*
> -     * We want to pick a speed slow enough that the test completes
> -     * quickly, but that it doesn't complete precopy even on a slow
> -     * machine, so also set the downtime.
> -     */
> -    /* 1 ms should make it not converge*/
> -    migrate_set_parameter_int(from, "downtime-limit", 1);
> -    /* 300MB/s */
> -    migrate_set_parameter_int(from, "max-bandwidth", 30000000);
> +    migrate_ensure_non_converge(from);
>  
>      migrate_set_parameter_int(from, "multifd-channels", 16);
>      migrate_set_parameter_int(to, "multifd-channels", 16);
> @@ -2054,10 +2043,7 @@ static void test_multifd_tcp_cancel(void)
>  
>      wait_for_migration_status(from, "cancelled", NULL);
>  
> -    /* 300ms it should converge */
> -    migrate_set_parameter_int(from, "downtime-limit", 300);
> -    /* 1GB/s */
> -    migrate_set_parameter_int(from, "max-bandwidth", 1000000000);
> +    migrate_ensure_converge(from);
>  
>      migrate_qmp(from, uri, "{}");
>  
> -- 
> 2.36.1
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64
  2022-06-28 10:54 ` [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64 Daniel P. Berrangé
  2022-06-28 13:18   ` Thomas Huth
@ 2022-07-05  8:06   ` Thomas Huth
  2022-07-05  8:09     ` Daniel P. Berrangé
  1 sibling, 1 reply; 18+ messages in thread
From: Thomas Huth @ 2022-07-05  8:06 UTC (permalink / raw)
  To: Daniel P. Berrangé, qemu-devel
  Cc: Paolo Bonzini, Laurent Vivier, Juan Quintela, Cornelia Huck,
	qemu-s390x, Dr. David Alan Gilbert

On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> There have been checks put into the migration test which skip it in a
> few scenarios
> 
>   * ppc64 TCG
>   * ppc64 KVM with kvm-pr
>   * s390x TCG
> 
> In the original commits there are references to unexplained hangs in
> the test. There is no record of details of where it was hanging, but
> it is suspected that these were all a result of the max downtime limit
> being set at too low a value to guarantee convergance.
> 
> Since a previous commit bumped the value from 1 second to 30 seconds,
> it is believed that hangs due to non-convergance should be eliminated
> and thus worth trying to remove the skipped scenarios.
> 
> Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> ---
>   tests/qtest/migration-test.c | 21 ---------------------
>   1 file changed, 21 deletions(-)

I just gave this a try, and it's failing on my x86 laptop with the ppc64 target:

/ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't 
support requested feature, cap-cfpc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-sbbc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ibs=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ccf-assist=on
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-cfpc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-sbbc=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ibs=workaround
qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ccf-assist=on
Memory content inconsistency at df6000 first_byte = 98 last_byte = 98 
current = 2 hit_edge = 0
Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97 
current = 96 hit_edge = 1
and in another 5542 pages**
ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram: 
assertion failed: (bad == 0)
Aborted (core dumped)

So I guess this workaround was about a different issue and we should drop 
this patch.

  Thomas



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64
  2022-07-05  8:06   ` Thomas Huth
@ 2022-07-05  8:09     ` Daniel P. Berrangé
  2022-07-05  8:38       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel P. Berrangé @ 2022-07-05  8:09 UTC (permalink / raw)
  To: Thomas Huth
  Cc: qemu-devel, Paolo Bonzini, Laurent Vivier, Juan Quintela,
	Cornelia Huck, qemu-s390x, Dr. David Alan Gilbert

On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote:
> On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> > There have been checks put into the migration test which skip it in a
> > few scenarios
> > 
> >   * ppc64 TCG
> >   * ppc64 KVM with kvm-pr
> >   * s390x TCG
> > 
> > In the original commits there are references to unexplained hangs in
> > the test. There is no record of details of where it was hanging, but
> > it is suspected that these were all a result of the max downtime limit
> > being set at too low a value to guarantee convergance.
> > 
> > Since a previous commit bumped the value from 1 second to 30 seconds,
> > it is believed that hangs due to non-convergance should be eliminated
> > and thus worth trying to remove the skipped scenarios.
> > 
> > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > ---
> >   tests/qtest/migration-test.c | 21 ---------------------
> >   1 file changed, 21 deletions(-)
> 
> I just gave this a try, and it's failing on my x86 laptop with the ppc64 target:
> 
> /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't
> support requested feature, cap-cfpc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-sbbc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ibs=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ccf-assist=on
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-cfpc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-sbbc=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ibs=workaround
> qemu-system-ppc64: warning: TCG doesn't support requested feature,
> cap-ccf-assist=on
> Memory content inconsistency at df6000 first_byte = 98 last_byte = 98
> current = 2 hit_edge = 0
> Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97
> current = 96 hit_edge = 1
> and in another 5542 pages**
> ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram:
> assertion failed: (bad == 0)
> Aborted (core dumped)
> 
> So I guess this workaround was about a different issue and we should drop
> this patch.

Yeah, at the very least needs for investigation.

It is a little worrying though that we get such failures as it smells
like a genuine bug that we've been missing from having tests disabled.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64
  2022-07-05  8:09     ` Daniel P. Berrangé
@ 2022-07-05  8:38       ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2022-07-05  8:38 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Thomas Huth, qemu-devel, Paolo Bonzini, Laurent Vivier,
	Juan Quintela, Cornelia Huck, qemu-s390x

* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Tue, Jul 05, 2022 at 10:06:58AM +0200, Thomas Huth wrote:
> > On 28/06/2022 12.54, Daniel P. Berrangé wrote:
> > > There have been checks put into the migration test which skip it in a
> > > few scenarios
> > > 
> > >   * ppc64 TCG
> > >   * ppc64 KVM with kvm-pr
> > >   * s390x TCG
> > > 
> > > In the original commits there are references to unexplained hangs in
> > > the test. There is no record of details of where it was hanging, but
> > > it is suspected that these were all a result of the max downtime limit
> > > being set at too low a value to guarantee convergance.
> > > 
> > > Since a previous commit bumped the value from 1 second to 30 seconds,
> > > it is believed that hangs due to non-convergance should be eliminated
> > > and thus worth trying to remove the skipped scenarios.
> > > 
> > > Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
> > > ---
> > >   tests/qtest/migration-test.c | 21 ---------------------
> > >   1 file changed, 21 deletions(-)
> > 
> > I just gave this a try, and it's failing on my x86 laptop with the ppc64 target:
> > 
> > /ppc64/migration/auto_converge: qemu-system-ppc64: warning: TCG doesn't
> > support requested feature, cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-cfpc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-sbbc=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ibs=workaround
> > qemu-system-ppc64: warning: TCG doesn't support requested feature,
> > cap-ccf-assist=on
> > Memory content inconsistency at df6000 first_byte = 98 last_byte = 98
> > current = 2 hit_edge = 0

98->2 is a strangely large gap, and just one page.

> > Memory content inconsistency at 4e51000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1

Yeh that's broken;   the way I think about this is you've got a loop
and the guest is following the loop incrementing one page at a time;
if you stop the world you should see one 'edge' where the incrementer
has currently incremented the previous page but hasn't done the current
page yet.   e.g. in this case the 'start' of the memory is 98, and we
were seeing 97, so we've run past that 'edge' at some point earlier.
Now we've hit 96, that should be impossible, because all of the 96's
should have incremented out before there was ever a 98 in the loop.

> > Memory content inconsistency at 4e52000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e53000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e54000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e55000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e56000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e57000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e58000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > Memory content inconsistency at 4e59000 first_byte = 98 last_byte = 97
> > current = 96 hit_edge = 1
> > and in another 5542 pages**
> > ERROR:../../devel/qemu/tests/qtest/migration-test.c:280:check_guests_ram:
> > assertion failed: (bad == 0)
> > Aborted (core dumped)
> > 
> > So I guess this workaround was about a different issue and we should drop
> > this patch.
> 
> Yeah, at the very least needs for investigation.
> 
> It is a little worrying though that we get such failures as it smells
> like a genuine bug that we've been missing from having tests disabled.

Yeh I suspect it's a TCG bug not updating the 'changed' flag on the page
*after* writing the data.  I believe we've sene a case on ARM.

Dave

> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-07-05  8:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-28 10:54 [PATCH 0/5] tests: improve reliability of migration test Daniel P. Berrangé
2022-06-28 10:54 ` [PATCH 1/5] tests: wait max 120 seconds for migration test status changes Daniel P. Berrangé
2022-06-28 12:47   ` Laurent Vivier
2022-06-28 12:49   ` Thomas Huth
2022-06-28 10:54 ` [PATCH 2/5] tests: wait for migration completion before looking for STOP event Daniel P. Berrangé
2022-06-28 12:47   ` Laurent Vivier
2022-06-28 14:08   ` Dr. David Alan Gilbert
2022-06-28 14:10     ` Daniel P. Berrangé
2022-06-28 10:54 ` [PATCH 3/5] tests: increase migration test converge downtime to 30 seconds Daniel P. Berrangé
2022-06-28 12:47   ` Laurent Vivier
2022-06-28 10:54 ` [PATCH 4/5] tests: use consistent bandwidth/downtime limits in migration tests Daniel P. Berrangé
2022-06-28 14:16   ` Dr. David Alan Gilbert
2022-06-28 10:54 ` [RFC PATCH 5/5] tests: stop skipping migration test on s390x/ppc64 Daniel P. Berrangé
2022-06-28 13:18   ` Thomas Huth
2022-07-05  8:06   ` Thomas Huth
2022-07-05  8:09     ` Daniel P. Berrangé
2022-07-05  8:38       ` Dr. David Alan Gilbert
2022-06-28 13:19 ` [PATCH 0/5] tests: improve reliability of migration test Thomas Huth

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.