All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
@ 2017-08-23  6:51 Peter Xu
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 1/8] monitor: move skip_flush into monitor_data_init Peter Xu
                   ` (9 more replies)
  0 siblings, 10 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

v2:
- fixed "make check" error that patchew reported
- moved the thread_join upper in monitor_data_destroy(), before
  resources are released
- added one new patch (current patch 3) that fixes a nasty risk
  condition with IOWatchPoll.  Please see commit message for more
  information.
- added a g_main_context_wakeup() to make sure the separate loop
  thread can be kicked always when we want to destroy the per-monitor
  threads.
- added one new patch (current patch 8) to introduce migration mgmt
  lock for migrate_incoming.

This is an extended work for migration postcopy recovery. This series
is tested with the following series to make sure it solves the monitor
hang problem that we have encountered for postcopy recovery:

  [RFC 00/29] Migration: postcopy failure recovery
  [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery

The root problem is that, monitor commands are all handled in main
loop thread now, no matter how many monitors we specify. And, if main
loop thread hangs due to some reason, all monitors will be stuck.
This can be done in reversed order as well: if any of the monitor
hangs, it will hang the main loop, and the rest of the monitors (if
there is any).

That affects postcopy recovery, since the recovery requires user input
on destination side.  If monitors hang, the destination VM dies and
lose hope for even a final recovery.

So, sometimes we need to make sure the monitor be alive, at least one
of them.

The whole idea of this series is that instead if handling monitor
commands all in main loop thread, we do it separately in per-monitor
threads.  Then, even if main loop thread hangs at any point by any
reason, per-monitor thread can still survive.  Further, we add hint in
QMP/HMP to show whether a command can be executed without QMP, if so,
we avoid taking BQL when running that command.  It greatly reduced
contention of BQL.  Now the only user of that new parameter (currently
I call it "without-bql") is "migrate-incoming" command, which is the
only command to rescue a paused postcopy migration.

However, even with the series, it does not mean that per-monitor
threads will never hang.  One example is that we can still run "info
vcpus" in per-monitor threads during a paused postcopy (in that state,
page faults are never handled, and "info cpus" will never return since
it tries to sync every vcpus).  So to make sure it does not hang, we
not only need the per-monitor thread, the user should be careful as
well on how to use it.

For postcopy recovery, we may need dedicated monitor channel for
recovery.  In other words, a destination VM that supports postcopy
recovery would possibly need:

  -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL

Here, the MAIN_CHANNEL can be MUXed and shared by other chardev
frontends, while the RECOVERY_CHANNEL should *ONLY* be used to input
the "migrate-incoming" command (similar thing applies to HMP
channels).  As long as we are following this rule, the
RECOVERY_CHANNEL can never hang.

Some details on each patch:

Patch 1: a simple cleanup only

Patch 2: allow monitors to create per-monitor thread to handle monitor
         command requests. Since monitor is only one type of chardev
         frontend, we only do this if the backend is dedicated, say,
         if MUX is not turned on (if MUX is on, it's still using main
         loop thread).

Patch 3: based on patch 2, this patch introduced a new parameter for
         QMP commands called "without-bql", it is a hint that the
         command does not need BQL.

Patch 4: Let QMP command "migrate-incoming" avoid taking BQL.

Patch 5: Introduced sister parameter for HMP "without_bql", which
         works just like QMP "without-bql".

Patch 6: Let HMP command "migrate-incoming" avoid taking BQL.

Please review. Thanks,

Peter Xu (8):
  monitor: move skip_flush into monitor_data_init
  monitor: allow monitor to create thread to poll
  char-io: fix possible risk on IOWatchPoll
  QAPI: new QMP command option "without-bql"
  hmp: support "without_bql"
  migration: qmp: migrate_incoming don't need BQL
  migration: hmp: migrate_incoming don't need BQL
  migration: add incoming mgmt lock

 chardev/char-io.c              | 15 +++++++-
 docs/devel/qapi-code-gen.txt   | 10 ++++-
 hmp-commands.hx                |  1 +
 include/qapi/qmp/dispatch.h    |  1 +
 migration/migration.c          |  6 +++
 migration/migration.h          |  3 ++
 monitor.c                      | 87 +++++++++++++++++++++++++++++++++++++++---
 qapi-schema.json               |  3 +-
 qapi/qmp-dispatch.c            | 26 +++++++++++++
 scripts/qapi-commands.py       | 18 ++++++---
 scripts/qapi-introspect.py     |  2 +-
 scripts/qapi.py                | 15 +++++---
 scripts/qapi2texi.py           |  2 +-
 tests/qapi-schema/test-qapi.py |  2 +-
 14 files changed, 168 insertions(+), 23 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 1/8] monitor: move skip_flush into monitor_data_init
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-23 16:31   ` Dr. David Alan Gilbert
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll Peter Xu
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

It's part of the data init.  Collect it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 monitor.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/monitor.c b/monitor.c
index e0f8801..7c90df7 100644
--- a/monitor.c
+++ b/monitor.c
@@ -568,13 +568,14 @@ static void monitor_qapi_event_init(void)
 
 static void handle_hmp_command(Monitor *mon, const char *cmdline);
 
-static void monitor_data_init(Monitor *mon)
+static void monitor_data_init(Monitor *mon, bool skip_flush)
 {
     memset(mon, 0, sizeof(Monitor));
     qemu_mutex_init(&mon->out_lock);
     mon->outbuf = qstring_new();
     /* Use *mon_cmds by default. */
     mon->cmd_table = mon_cmds;
+    mon->skip_flush = skip_flush;
 }
 
 static void monitor_data_destroy(Monitor *mon)
@@ -594,8 +595,7 @@ char *qmp_human_monitor_command(const char *command_line, bool has_cpu_index,
     char *output = NULL;
     Monitor *old_mon, hmp;
 
-    monitor_data_init(&hmp);
-    hmp.skip_flush = true;
+    monitor_data_init(&hmp, true);
 
     old_mon = cur_mon;
     cur_mon = &hmp;
@@ -4098,7 +4098,7 @@ void monitor_init(Chardev *chr, int flags)
     }
 
     mon = g_malloc(sizeof(*mon));
-    monitor_data_init(mon);
+    monitor_data_init(mon, false);
 
     qemu_chr_fe_init(&mon->chr, chr, &error_abort);
     mon->flags = flags;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 1/8] monitor: move skip_flush into monitor_data_init Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-23 17:35   ` Dr. David Alan Gilbert
  2017-08-25 15:27   ` Marc-André Lureau
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll Peter Xu
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

Firstly, introduce Monitor.use_thread, and set it for monitors that are
using non-mux typed backend chardev.  We only do this for monitors, so
mux-typed chardevs are not suitable (when it connects to, e.g., serials
and the monitor together).

When use_thread is set, we create standalone thread to poll the monitor
events, isolated from the main loop thread.  Here we still need to take
the BQL before dispatching the tasks since some of the monitor commands
are not allowed to execute without the protection of BQL.  Then this
gives us the chance to avoid taking the BQL for some monitor commands in
the future.

* Why this change?

We need these per-monitor threads to make sure we can have at least one
monitor that will never stuck (that can receive further monitor
commands).

* So when will monitors stuck?  And, how do they stuck?

After we have postcopy and remote page faults, it's simple to achieve a
stuck in the monitor (which is also a stuck in main loop thread):

(1) Monitor deadlock on BQL

As we may know, when postcopy is running on destination VM, the vcpu
threads can stuck merely any time as long as it tries to access an
uncopied guest page.  Meanwhile, when the stuck happens, it is possible
that the vcpu thread is holding the BQL.  If the page fault is not
handled quickly, you'll find that monitors stop working, which is trying
to take the BQL.

If the page fault cannot be handled correctly (one case is a paused
postcopy, when network is temporarily down), monitors will hang
forever.  Without current patch, that means the main loop hanged.  We'll
never find a way to talk to VM again.

(2) Monitor tries to run codes page-faulted vcpus

The HMP command "info cpus" is one of the good example - it tries to
kick all the vcpus and sync status from them.  However, if there is any
vcpu that stuck at an unhandled page fault, it can never achieve the
sync, then the HMP hangs.  Again, it hangs the main loop thread as well.

After either (1) or (2), we can see the deadlock problem:

- On one hand, if monitor hangs, we cannot do the postcopy recovery,
  because postcopy recovery needs user to specify new listening port on
  destination monitor.

- On the other hand, if we cannot recover the paused postcopy, then page
  faults cannot be serviced, and the monitors will possibly hang
  forever then.

* How this patch helps?

- Firstly, we'll have our own thread for each dedicated monitor (or say,
  the backend chardev is only used for monitor), so even main loop
  thread hangs (it is always possible), this monitor thread may still
  survive.

- Not all monitor commands need the BQL.  We can selectively take the
  BQL (depends on which command we are running) to avoid waiting on a
  page-faulted vcpu thread that has taken the BQL (this will be done in
  following up patches).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 monitor.c           | 75 +++++++++++++++++++++++++++++++++++++++++++++++++----
 qapi/qmp-dispatch.c | 15 +++++++++++
 2 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/monitor.c b/monitor.c
index 7c90df7..3d4ecff 100644
--- a/monitor.c
+++ b/monitor.c
@@ -36,6 +36,8 @@
 #include "net/net.h"
 #include "net/slirp.h"
 #include "chardev/char-fe.h"
+#include "chardev/char-mux.h"
+#include "chardev/char-io.h"
 #include "ui/qemu-spice.h"
 #include "sysemu/numa.h"
 #include "monitor/monitor.h"
@@ -190,6 +192,8 @@ struct Monitor {
     int flags;
     int suspend_cnt;
     bool skip_flush;
+    /* Whether the monitor wants to be polled in standalone thread */
+    bool use_thread;
 
     QemuMutex out_lock;
     QString *outbuf;
@@ -206,6 +210,11 @@ struct Monitor {
     mon_cmd_t *cmd_table;
     QLIST_HEAD(,mon_fd_t) fds;
     QLIST_ENTRY(Monitor) entry;
+
+    /* Only used when "use_thread" is used */
+    QemuThread mon_thread;
+    GMainContext *mon_context;
+    GMainLoop *mon_loop;
 };
 
 /* QMP checker flags */
@@ -568,7 +577,7 @@ static void monitor_qapi_event_init(void)
 
 static void handle_hmp_command(Monitor *mon, const char *cmdline);
 
-static void monitor_data_init(Monitor *mon, bool skip_flush)
+static void monitor_data_init(Monitor *mon, bool skip_flush, bool use_thread)
 {
     memset(mon, 0, sizeof(Monitor));
     qemu_mutex_init(&mon->out_lock);
@@ -576,10 +585,34 @@ static void monitor_data_init(Monitor *mon, bool skip_flush)
     /* Use *mon_cmds by default. */
     mon->cmd_table = mon_cmds;
     mon->skip_flush = skip_flush;
+    mon->use_thread = use_thread;
+    if (use_thread) {
+        /*
+         * For monitors that use isolated threads, they'll need their
+         * own GMainContext and GMainLoop.  Otherwise, these pointers
+         * will be NULL, which means the default context will be used.
+         */
+        mon->mon_context = g_main_context_new();
+        mon->mon_loop = g_main_loop_new(mon->mon_context, TRUE);
+    }
 }
 
 static void monitor_data_destroy(Monitor *mon)
 {
+    /* Destroy the thread first if there is */
+    if (mon->use_thread) {
+        /* Notify the per-monitor thread to quit. */
+        g_main_loop_quit(mon->mon_loop);
+        /*
+         * Make sure the context will get the quit message since it's
+         * in another thread.  Without this, it may not be able to
+         * respond to the quit message immediately.
+         */
+        g_main_context_wakeup(mon->mon_context);
+        qemu_thread_join(&mon->mon_thread);
+        g_main_loop_unref(mon->mon_loop);
+        g_main_context_unref(mon->mon_context);
+    }
     qemu_chr_fe_deinit(&mon->chr, false);
     if (monitor_is_qmp(mon)) {
         json_message_parser_destroy(&mon->qmp.parser);
@@ -595,7 +628,7 @@ char *qmp_human_monitor_command(const char *command_line, bool has_cpu_index,
     char *output = NULL;
     Monitor *old_mon, hmp;
 
-    monitor_data_init(&hmp, true);
+    monitor_data_init(&hmp, true, false);
 
     old_mon = cur_mon;
     cur_mon = &hmp;
@@ -3101,6 +3134,11 @@ static void handle_hmp_command(Monitor *mon, const char *cmdline)
 {
     QDict *qdict;
     const mon_cmd_t *cmd;
+    /*
+     * If we haven't take the BQL (when called by per-monitor
+     * threads), we need to take care of the BQL on our own.
+     */
+    bool take_bql = !qemu_mutex_iothread_locked();
 
     trace_handle_hmp_command(mon, cmdline);
 
@@ -3116,7 +3154,16 @@ static void handle_hmp_command(Monitor *mon, const char *cmdline)
         return;
     }
 
+    if (take_bql) {
+        qemu_mutex_lock_iothread();
+    }
+
     cmd->cmd(mon, qdict);
+
+    if (take_bql) {
+        qemu_mutex_unlock_iothread();
+    }
+
     QDECREF(qdict);
 }
 
@@ -4086,6 +4133,15 @@ static void __attribute__((constructor)) monitor_lock_init(void)
     qemu_mutex_init(&monitor_lock);
 }
 
+static void *monitor_thread(void *data)
+{
+    Monitor *mon = data;
+
+    g_main_loop_run(mon->mon_loop);
+
+    return NULL;
+}
+
 void monitor_init(Chardev *chr, int flags)
 {
     static int is_first_init = 1;
@@ -4098,7 +4154,9 @@ void monitor_init(Chardev *chr, int flags)
     }
 
     mon = g_malloc(sizeof(*mon));
-    monitor_data_init(mon, false);
+
+    /* For non-mux typed monitors, we create dedicated threads. */
+    monitor_data_init(mon, false, !CHARDEV_IS_MUX(chr));
 
     qemu_chr_fe_init(&mon->chr, chr, &error_abort);
     mon->flags = flags;
@@ -4112,12 +4170,19 @@ void monitor_init(Chardev *chr, int flags)
 
     if (monitor_is_qmp(mon)) {
         qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read, monitor_qmp_read,
-                                 monitor_qmp_event, NULL, mon, NULL, true);
+                                 monitor_qmp_event, NULL, mon,
+                                 mon->mon_context, true);
         qemu_chr_fe_set_echo(&mon->chr, true);
         json_message_parser_init(&mon->qmp.parser, handle_qmp_command);
     } else {
         qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read, monitor_read,
-                                 monitor_event, NULL, mon, NULL, true);
+                                 monitor_event, NULL, mon,
+                                 mon->mon_context, true);
+    }
+
+    if (mon->use_thread) {
+        qemu_thread_create(&mon->mon_thread, chr->label, monitor_thread,
+                           mon, QEMU_THREAD_JOINABLE);
     }
 
     qemu_mutex_lock(&monitor_lock);
diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index 5ad36f8..3b6b224 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -19,6 +19,7 @@
 #include "qapi/qmp/qjson.h"
 #include "qapi-types.h"
 #include "qapi/qmp/qerror.h"
+#include "qemu/main-loop.h"
 
 static QDict *qmp_dispatch_check_obj(const QObject *request, Error **errp)
 {
@@ -75,6 +76,11 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds, QObject *request,
     QDict *args, *dict;
     QmpCommand *cmd;
     QObject *ret = NULL;
+    /*
+     * If we haven't take the BQL (when called by per-monitor
+     * threads), we need to take care of the BQL on our own.
+     */
+    bool take_bql = !qemu_mutex_iothread_locked();
 
     dict = qmp_dispatch_check_obj(request, errp);
     if (!dict) {
@@ -101,7 +107,16 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds, QObject *request,
         QINCREF(args);
     }
 
+    if (take_bql) {
+        qemu_mutex_lock_iothread();
+    }
+
     cmd->fn(args, &ret, &local_err);
+
+    if (take_bql) {
+        qemu_mutex_unlock_iothread();
+    }
+
     if (local_err) {
         error_propagate(errp, local_err);
     } else if (cmd->options & QCO_NO_SUCCESS_RESP) {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 1/8] monitor: move skip_flush into monitor_data_init Peter Xu
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-25 14:44   ` Marc-André Lureau
  2017-08-26  7:19   ` Fam Zheng
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql" Peter Xu
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

This is not a problem if we are only having one single loop thread like
before.  However, after per-monitor thread is introduced, this is not
true any more, and the risk can happen.

The risk can be triggered with "make check -j8" sometimes:

  qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
  io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.

This patch keeps the reference for the watch object when creating in
io_add_watch_poll(), so that the object will never be released in the
context main loop, especially when the context loop is running in
another standalone thread.  Meanwhile, when we want to remove the watch
object, we always first detach the watch object from its owner context,
then we continue with the cleanup.

Without this patch, calling io_remove_watch_poll() in main loop thread
is not thread-safe, since the other per-monitor thread may be modifying
the watch object at the same time.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 chardev/char-io.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/chardev/char-io.c b/chardev/char-io.c
index f810524..5c52c40 100644
--- a/chardev/char-io.c
+++ b/chardev/char-io.c
@@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
     g_free(name);
 
     g_source_attach(&iwp->parent, context);
-    g_source_unref(&iwp->parent);
     return (GSource *)iwp;
 }
 
@@ -131,12 +130,24 @@ static void io_remove_watch_poll(GSource *source)
     IOWatchPoll *iwp;
 
     iwp = io_watch_poll_from_source(source);
+
+    /*
+     * Here the order of destruction really matters.  We need to first
+     * detach the IOWatchPoll object from the context (which may still
+     * be running in another loop thread), only after that could we
+     * continue to operate on iwp->src, or there may be risk condition
+     * between current thread and the context loop thread.
+     *
+     * Let's blame the glib bug mentioned in commit 2b3167 (again) for
+     * this extra complexity.
+     */
+    g_source_destroy(&iwp->parent);
     if (iwp->src) {
         g_source_destroy(iwp->src);
         g_source_unref(iwp->src);
         iwp->src = NULL;
     }
-    g_source_destroy(&iwp->parent);
+    g_source_unref(&iwp->parent);
 }
 
 void remove_fd_in_watch(Chardev *chr)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
                   ` (2 preceding siblings ...)
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-23 17:44   ` Dr. David Alan Gilbert
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 5/8] hmp: support "without_bql" Peter Xu
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

Introducing this new parameter for QMP commands in general to mark out
when the command does not need BQL.  Normally QMP command executions are
done with the protection of BQL in QEMU.  However the truth is that not
all the QMP commands require the BQL.

This new parameter provides a way to allow QMP commands to run in
parallel when possible, without the contention on the BQL.

Since the default value of "without-bql" is still false, so now all QMP
commands are still protected by BQL still.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 docs/devel/qapi-code-gen.txt   | 10 +++++++++-
 include/qapi/qmp/dispatch.h    |  1 +
 qapi/qmp-dispatch.c            | 11 +++++++++++
 scripts/qapi-commands.py       | 18 +++++++++++++-----
 scripts/qapi-introspect.py     |  2 +-
 scripts/qapi.py                | 15 ++++++++++-----
 scripts/qapi2texi.py           |  2 +-
 tests/qapi-schema/test-qapi.py |  2 +-
 8 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/docs/devel/qapi-code-gen.txt b/docs/devel/qapi-code-gen.txt
index 9903ac4..4960d00 100644
--- a/docs/devel/qapi-code-gen.txt
+++ b/docs/devel/qapi-code-gen.txt
@@ -556,7 +556,8 @@ following example objects:
 
 Usage: { 'command': STRING, '*data': COMPLEX-TYPE-NAME-OR-DICT,
          '*returns': TYPE-NAME, '*boxed': true,
-         '*gen': false, '*success-response': false }
+         '*gen': false, '*success-response': false,
+         '*without-bql': false }
 
 Commands are defined by using a dictionary containing several members,
 where three members are most common.  The 'command' member is a
@@ -636,6 +637,13 @@ possible, the command expression should include the optional key
 'success-response' with boolean value false.  So far, only QGA makes
 use of this member.
 
+Most of the commands require the Big QEMU Lock (BQL) be held during
+execution.  However, there is a small subset of the commands that may
+not really need BQL at all.  To mark out this kind of commands, we can
+specify "without-bql" to "true".  This parameter is only a hint for
+internal QMP implementation to provide possiblility to allow commands
+be run in parallel, or reduce the contention of the lock.  Users of QMP
+should not really be aware of such information.
 
 === Events ===
 
diff --git a/include/qapi/qmp/dispatch.h b/include/qapi/qmp/dispatch.h
index 20578dc..ec5c620 100644
--- a/include/qapi/qmp/dispatch.h
+++ b/include/qapi/qmp/dispatch.h
@@ -23,6 +23,7 @@ typedef enum QmpCommandOptions
 {
     QCO_NO_OPTIONS = 0x0,
     QCO_NO_SUCCESS_RESP = 0x1,
+    QCO_WITHOUT_BQL = 0x2,
 } QmpCommandOptions;
 
 typedef struct QmpCommand
diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
index 3b6b224..b7fba5e 100644
--- a/qapi/qmp-dispatch.c
+++ b/qapi/qmp-dispatch.c
@@ -107,6 +107,17 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds, QObject *request,
         QINCREF(args);
     }
 
+    if (cmd->options & QCO_WITHOUT_BQL) {
+        /*
+         * If this command can live without BQL, then we don't take
+         * it.  One thing to mention: we may have already taken the
+         * BQL before reaching here.  If so, we just keep it.  So
+         * generally speaking we are trying our best on reducing the
+         * contention of BQL.
+         */
+        take_bql = false;
+    }
+
     if (take_bql) {
         qemu_mutex_lock_iothread();
     }
diff --git a/scripts/qapi-commands.py b/scripts/qapi-commands.py
index 974d0a4..155a0a4 100644
--- a/scripts/qapi-commands.py
+++ b/scripts/qapi-commands.py
@@ -192,10 +192,17 @@ out:
     return ret
 
 
-def gen_register_command(name, success_response):
-    options = 'QCO_NO_OPTIONS'
+def gen_register_command(name, success_response, without_bql):
+    options = []
+
     if not success_response:
-        options = 'QCO_NO_SUCCESS_RESP'
+        options += ['QCO_NO_SUCCESS_RESP']
+    if without_bql:
+        options += ['QCO_WITHOUT_BQL']
+
+    if not options:
+        options = ['QCO_NO_OPTIONS']
+    options = " | ".join(options)
 
     ret = mcgen('''
     qmp_register_command(cmds, "%(name)s",
@@ -241,7 +248,7 @@ class QAPISchemaGenCommandVisitor(QAPISchemaVisitor):
         self._visited_ret_types = None
 
     def visit_command(self, name, info, arg_type, ret_type,
-                      gen, success_response, boxed):
+                      gen, success_response, boxed, without_bql):
         if not gen:
             return
         self.decl += gen_command_decl(name, arg_type, boxed, ret_type)
@@ -250,7 +257,8 @@ class QAPISchemaGenCommandVisitor(QAPISchemaVisitor):
             self.defn += gen_marshal_output(ret_type)
         self.decl += gen_marshal_decl(name)
         self.defn += gen_marshal(name, arg_type, boxed, ret_type)
-        self._regy += gen_register_command(name, success_response)
+        self._regy += gen_register_command(name, success_response,
+                                           without_bql)
 
 
 (input_file, output_dir, do_c, do_h, prefix, opts) = parse_command_line()
diff --git a/scripts/qapi-introspect.py b/scripts/qapi-introspect.py
index 032bcea..a523544 100644
--- a/scripts/qapi-introspect.py
+++ b/scripts/qapi-introspect.py
@@ -154,7 +154,7 @@ const char %(c_name)s[] = %(c_string)s;
                                     for m in variants.variants]})
 
     def visit_command(self, name, info, arg_type, ret_type,
-                      gen, success_response, boxed):
+                      gen, success_response, boxed, without_bql):
         arg_type = arg_type or self._schema.the_empty_object_type
         ret_type = ret_type or self._schema.the_empty_object_type
         self._gen_json(name, 'command',
diff --git a/scripts/qapi.py b/scripts/qapi.py
index 8aa2775..3951143 100644
--- a/scripts/qapi.py
+++ b/scripts/qapi.py
@@ -920,7 +920,8 @@ def check_exprs(exprs):
         elif 'command' in expr:
             meta = 'command'
             check_keys(expr_elem, 'command', [],
-                       ['data', 'returns', 'gen', 'success-response', 'boxed'])
+                       ['data', 'returns', 'gen', 'success-response',
+                        'boxed', 'without-bql'])
         elif 'event' in expr:
             meta = 'event'
             check_keys(expr_elem, 'event', [], ['data', 'boxed'])
@@ -1031,7 +1032,7 @@ class QAPISchemaVisitor(object):
         pass
 
     def visit_command(self, name, info, arg_type, ret_type,
-                      gen, success_response, boxed):
+                      gen, success_response, boxed, without_bql):
         pass
 
     def visit_event(self, name, info, arg_type, boxed):
@@ -1398,7 +1399,7 @@ class QAPISchemaAlternateType(QAPISchemaType):
 
 class QAPISchemaCommand(QAPISchemaEntity):
     def __init__(self, name, info, doc, arg_type, ret_type,
-                 gen, success_response, boxed):
+                 gen, success_response, boxed, without_bql):
         QAPISchemaEntity.__init__(self, name, info, doc)
         assert not arg_type or isinstance(arg_type, str)
         assert not ret_type or isinstance(ret_type, str)
@@ -1408,6 +1409,7 @@ class QAPISchemaCommand(QAPISchemaEntity):
         self.ret_type = None
         self.gen = gen
         self.success_response = success_response
+        self.without_bql = without_bql
         self.boxed = boxed
 
     def check(self, schema):
@@ -1432,7 +1434,8 @@ class QAPISchemaCommand(QAPISchemaEntity):
     def visit(self, visitor):
         visitor.visit_command(self.name, self.info,
                               self.arg_type, self.ret_type,
-                              self.gen, self.success_response, self.boxed)
+                              self.gen, self.success_response,
+                              self.boxed, self.without_bql)
 
 
 class QAPISchemaEvent(QAPISchemaEntity):
@@ -1639,6 +1642,7 @@ class QAPISchema(object):
         rets = expr.get('returns')
         gen = expr.get('gen', True)
         success_response = expr.get('success-response', True)
+        without_bql = expr.get('without-bql', False)
         boxed = expr.get('boxed', False)
         if isinstance(data, OrderedDict):
             data = self._make_implicit_object_type(
@@ -1647,7 +1651,8 @@ class QAPISchema(object):
             assert len(rets) == 1
             rets = self._make_array_type(rets[0], info)
         self._def_entity(QAPISchemaCommand(name, info, doc, data, rets,
-                                           gen, success_response, boxed))
+                                           gen, success_response,
+                                           boxed, without_bql))
 
     def _def_event(self, expr, info, doc):
         name = expr['event']
diff --git a/scripts/qapi2texi.py b/scripts/qapi2texi.py
index a317526..659bd83 100755
--- a/scripts/qapi2texi.py
+++ b/scripts/qapi2texi.py
@@ -236,7 +236,7 @@ class QAPISchemaGenDocVisitor(qapi.QAPISchemaVisitor):
                              body=texi_entity(doc, 'Members'))
 
     def visit_command(self, name, info, arg_type, ret_type,
-                      gen, success_response, boxed):
+                      gen, success_response, boxed, without_bql):
         doc = self.cur_doc
         if self.out:
             self.out += '\n'
diff --git a/tests/qapi-schema/test-qapi.py b/tests/qapi-schema/test-qapi.py
index c7724d3..15aff29 100644
--- a/tests/qapi-schema/test-qapi.py
+++ b/tests/qapi-schema/test-qapi.py
@@ -36,7 +36,7 @@ class QAPISchemaTestVisitor(QAPISchemaVisitor):
         self._print_variants(variants)
 
     def visit_command(self, name, info, arg_type, ret_type,
-                      gen, success_response, boxed):
+                      gen, success_response, boxed, without_bql):
         print 'command %s %s -> %s' % \
             (name, arg_type and arg_type.name, ret_type and ret_type.name)
         print '   gen=%s success_response=%s boxed=%s' % \
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 5/8] hmp: support "without_bql"
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
                   ` (3 preceding siblings ...)
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql" Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-23 17:46   ` Dr. David Alan Gilbert
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 6/8] migration: qmp: migrate_incoming don't need BQL Peter Xu
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

Introducing new option "without_bql" for HMP commands.  It works just
like QMP "without-bql", but for HMP commands.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 monitor.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/monitor.c b/monitor.c
index 3d4ecff..c26c797 100644
--- a/monitor.c
+++ b/monitor.c
@@ -125,6 +125,8 @@ typedef struct mon_cmd_t {
     const char *args_type;
     const char *params;
     const char *help;
+    /* Whether this command can be run without taking BQL? */
+    bool without_bql;
     void (*cmd)(Monitor *mon, const QDict *qdict);
     /* @sub_table is a list of 2nd level of commands. If it does not exist,
      * cmd should be used. If it exists, sub_table[?].cmd should be
@@ -3154,6 +3156,14 @@ static void handle_hmp_command(Monitor *mon, const char *cmdline)
         return;
     }
 
+    if (cmd->without_bql) {
+        /*
+         * This is similar to QMP's "without-bql".  See comments in
+         * do_qmp_dispatch().
+         */
+        take_bql = false;
+    }
+
     if (take_bql) {
         qemu_mutex_lock_iothread();
     }
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 6/8] migration: qmp: migrate_incoming don't need BQL
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
                   ` (4 preceding siblings ...)
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 5/8] hmp: support "without_bql" Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 7/8] migration: hmp: " Peter Xu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

Let "migrate-incoming" command be run without BQL.  Then even if any
thread hanged with BQL held, we can still run this command.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 qapi-schema.json | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/qapi-schema.json b/qapi-schema.json
index 802ea53..b55e73b 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -3259,7 +3259,8 @@
 # <- { "return": {} }
 #
 ##
-{ 'command': 'migrate-incoming', 'data': {'uri': 'str' } }
+{ 'command': 'migrate-incoming', 'data': {'uri': 'str' },
+  'without-bql': 'true' }
 
 ##
 # @xen-save-devices-state:
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 7/8] migration: hmp: migrate_incoming don't need BQL
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
                   ` (5 preceding siblings ...)
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 6/8] migration: qmp: migrate_incoming don't need BQL Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock Peter Xu
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

Allow this command to run without BQL held.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hmp-commands.hx | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 1941e19..e8d8812 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -968,6 +968,7 @@ ETEXI
         .params     = "uri",
         .help       = "Continue an incoming migration from an -incoming defer",
         .cmd        = hmp_migrate_incoming,
+        .without_bql = true,
     },
 
 STEXI
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
                   ` (6 preceding siblings ...)
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 7/8] migration: hmp: " Peter Xu
@ 2017-08-23  6:51 ` Peter Xu
  2017-08-23 18:01   ` Dr. David Alan Gilbert
  2017-08-29 11:03 ` [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Daniel P. Berrange
  2017-09-06 14:50 ` Stefan Hajnoczi
  9 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-23  6:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Daniel P . Berrange, Fam Zheng, Juan Quintela,
	mdroth, peterx, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

Now at least migrate_incoming can be run in parallel.  Let's provide a
migration lock to protect it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 6 ++++++
 migration/migration.h | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index c3fe0ed..32058f7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -145,6 +145,7 @@ MigrationIncomingState *migration_incoming_get_current(void)
         mis_current.state = MIGRATION_STATUS_NONE;
         memset(&mis_current, 0, sizeof(MigrationIncomingState));
         qemu_mutex_init(&mis_current.rp_mutex);
+        qemu_mutex_init(&mis_current.mgmt_mutex);
         qemu_event_init(&mis_current.main_thread_load_event, false);
         once = true;
     }
@@ -1171,6 +1172,7 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
 {
     Error *local_err = NULL;
     static bool once = true;
+    MigrationIncomingState *mis = migration_incoming_get_current();
 
     if (!deferred_incoming) {
         error_setg(errp, "For use with '-incoming defer'");
@@ -1180,8 +1182,12 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
         error_setg(errp, "The incoming migration has already been started");
     }
 
+    qemu_mutex_lock(&mis->mgmt_mutex);
+
     qemu_start_incoming_migration(uri, &local_err);
 
+    qemu_mutex_unlock(&mis->mgmt_mutex);
+
     if (local_err) {
         error_propagate(errp, local_err);
         return;
diff --git a/migration/migration.h b/migration/migration.h
index 148c9fa..95f077b 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -58,6 +58,9 @@ struct MigrationIncomingState {
     /* The coroutine we should enter (back) after failover */
     Coroutine *migration_incoming_co;
     QemuSemaphore colo_incoming_sem;
+
+    /* Migration incoming management lock */
+    QemuMutex mgmt_mutex;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 1/8] monitor: move skip_flush into monitor_data_init
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 1/8] monitor: move skip_flush into monitor_data_init Peter Xu
@ 2017-08-23 16:31   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-23 16:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> It's part of the data init.  Collect it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

OK, this can probably go separately as well.

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> ---
>  monitor.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/monitor.c b/monitor.c
> index e0f8801..7c90df7 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -568,13 +568,14 @@ static void monitor_qapi_event_init(void)
>  
>  static void handle_hmp_command(Monitor *mon, const char *cmdline);
>  
> -static void monitor_data_init(Monitor *mon)
> +static void monitor_data_init(Monitor *mon, bool skip_flush)
>  {
>      memset(mon, 0, sizeof(Monitor));
>      qemu_mutex_init(&mon->out_lock);
>      mon->outbuf = qstring_new();
>      /* Use *mon_cmds by default. */
>      mon->cmd_table = mon_cmds;
> +    mon->skip_flush = skip_flush;
>  }
>  
>  static void monitor_data_destroy(Monitor *mon)
> @@ -594,8 +595,7 @@ char *qmp_human_monitor_command(const char *command_line, bool has_cpu_index,
>      char *output = NULL;
>      Monitor *old_mon, hmp;
>  
> -    monitor_data_init(&hmp);
> -    hmp.skip_flush = true;
> +    monitor_data_init(&hmp, true);
>  
>      old_mon = cur_mon;
>      cur_mon = &hmp;
> @@ -4098,7 +4098,7 @@ void monitor_init(Chardev *chr, int flags)
>      }
>  
>      mon = g_malloc(sizeof(*mon));
> -    monitor_data_init(mon);
> +    monitor_data_init(mon, false);
>  
>      qemu_chr_fe_init(&mon->chr, chr, &error_abort);
>      mon->flags = flags;
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll Peter Xu
@ 2017-08-23 17:35   ` Dr. David Alan Gilbert
  2017-08-25  4:25     ` Peter Xu
  2017-08-25 15:27   ` Marc-André Lureau
  1 sibling, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-23 17:35 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> Firstly, introduce Monitor.use_thread, and set it for monitors that are
> using non-mux typed backend chardev.  We only do this for monitors, so
> mux-typed chardevs are not suitable (when it connects to, e.g., serials
> and the monitor together).
> 
> When use_thread is set, we create standalone thread to poll the monitor
> events, isolated from the main loop thread.  Here we still need to take
> the BQL before dispatching the tasks since some of the monitor commands
> are not allowed to execute without the protection of BQL.  Then this
> gives us the chance to avoid taking the BQL for some monitor commands in
> the future.
> 
> * Why this change?
> 
> We need these per-monitor threads to make sure we can have at least one
> monitor that will never stuck (that can receive further monitor
> commands).
> 
> * So when will monitors stuck?  And, how do they stuck?

(Minor: 'stuck' is past tense, 'stick' is probably the right word; however
'block' is probably what you actually want)

> After we have postcopy and remote page faults, it's simple to achieve a
> stuck in the monitor (which is also a stuck in main loop thread):
> 
> (1) Monitor deadlock on BQL
> 
> As we may know, when postcopy is running on destination VM, the vcpu
> threads can stuck merely any time as long as it tries to access an
> uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> that the vcpu thread is holding the BQL.  If the page fault is not
> handled quickly, you'll find that monitors stop working, which is trying
> to take the BQL.
> 
> If the page fault cannot be handled correctly (one case is a paused
> postcopy, when network is temporarily down), monitors will hang
> forever.  Without current patch, that means the main loop hanged.  We'll
> never find a way to talk to VM again.
> 
> (2) Monitor tries to run codes page-faulted vcpus
> 
> The HMP command "info cpus" is one of the good example - it tries to
> kick all the vcpus and sync status from them.  However, if there is any
> vcpu that stuck at an unhandled page fault, it can never achieve the
> sync, then the HMP hangs.  Again, it hangs the main loop thread as well.
> 
> After either (1) or (2), we can see the deadlock problem:
> 
> - On one hand, if monitor hangs, we cannot do the postcopy recovery,
>   because postcopy recovery needs user to specify new listening port on
>   destination monitor.
> 
> - On the other hand, if we cannot recover the paused postcopy, then page
>   faults cannot be serviced, and the monitors will possibly hang
>   forever then.
> 
> * How this patch helps?
> 
> - Firstly, we'll have our own thread for each dedicated monitor (or say,
>   the backend chardev is only used for monitor), so even main loop
>   thread hangs (it is always possible), this monitor thread may still
>   survive.
> 
> - Not all monitor commands need the BQL.  We can selectively take the
>   BQL (depends on which command we are running) to avoid waiting on a
>   page-faulted vcpu thread that has taken the BQL (this will be done in
>   following up patches).
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

A few high level things:
  a) I think this patch probably wants to split into
     1) A patch that decides whether to create a new thread and
     initialises it
     2) One that starts to fix up the locking

  b) I think you also need to take the bql around any of the custom
     completion functions; (maybe in monitor_find_completion ?)
     since they do things like walk the lists of devices.

  c) As mentioned on irc there's fun to be had with cur_mon and error
     handling - in my local world I have cur_mon declared as __thread
     but never got around to thinking aobut what should set it up.
     There's also 'wavcapture: Convert to error_report' that I posted
     in March that got rid of some uses of cur_mon in wavcapture.c
     for error_report.  But there's some interesting stuff to be checked
     with where error_reporting goes.

  d) I wonder if it's better to have thread as a flag, so that you have
     to explicitly ask for a monitor to have it's own thread.

I'll leave it to Dan to check over the chardev mechanics in here.

Dave

> ---
>  monitor.c           | 75 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  qapi/qmp-dispatch.c | 15 +++++++++++
>  2 files changed, 85 insertions(+), 5 deletions(-)
> 
> diff --git a/monitor.c b/monitor.c
> index 7c90df7..3d4ecff 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -36,6 +36,8 @@
>  #include "net/net.h"
>  #include "net/slirp.h"
>  #include "chardev/char-fe.h"
> +#include "chardev/char-mux.h"
> +#include "chardev/char-io.h"
>  #include "ui/qemu-spice.h"
>  #include "sysemu/numa.h"
>  #include "monitor/monitor.h"
> @@ -190,6 +192,8 @@ struct Monitor {
>      int flags;
>      int suspend_cnt;
>      bool skip_flush;
> +    /* Whether the monitor wants to be polled in standalone thread */
> +    bool use_thread;
>  
>      QemuMutex out_lock;
>      QString *outbuf;
> @@ -206,6 +210,11 @@ struct Monitor {
>      mon_cmd_t *cmd_table;
>      QLIST_HEAD(,mon_fd_t) fds;
>      QLIST_ENTRY(Monitor) entry;
> +
> +    /* Only used when "use_thread" is used */
> +    QemuThread mon_thread;
> +    GMainContext *mon_context;
> +    GMainLoop *mon_loop;
>  };
>  
>  /* QMP checker flags */
> @@ -568,7 +577,7 @@ static void monitor_qapi_event_init(void)
>  
>  static void handle_hmp_command(Monitor *mon, const char *cmdline);
>  
> -static void monitor_data_init(Monitor *mon, bool skip_flush)
> +static void monitor_data_init(Monitor *mon, bool skip_flush, bool use_thread)
>  {
>      memset(mon, 0, sizeof(Monitor));
>      qemu_mutex_init(&mon->out_lock);
> @@ -576,10 +585,34 @@ static void monitor_data_init(Monitor *mon, bool skip_flush)
>      /* Use *mon_cmds by default. */
>      mon->cmd_table = mon_cmds;
>      mon->skip_flush = skip_flush;
> +    mon->use_thread = use_thread;
> +    if (use_thread) {
> +        /*
> +         * For monitors that use isolated threads, they'll need their
> +         * own GMainContext and GMainLoop.  Otherwise, these pointers
> +         * will be NULL, which means the default context will be used.
> +         */
> +        mon->mon_context = g_main_context_new();
> +        mon->mon_loop = g_main_loop_new(mon->mon_context, TRUE);
> +    }
>  }
>  
>  static void monitor_data_destroy(Monitor *mon)
>  {
> +    /* Destroy the thread first if there is */
> +    if (mon->use_thread) {
> +        /* Notify the per-monitor thread to quit. */
> +        g_main_loop_quit(mon->mon_loop);
> +        /*
> +         * Make sure the context will get the quit message since it's
> +         * in another thread.  Without this, it may not be able to
> +         * respond to the quit message immediately.
> +         */
> +        g_main_context_wakeup(mon->mon_context);
> +        qemu_thread_join(&mon->mon_thread);
> +        g_main_loop_unref(mon->mon_loop);
> +        g_main_context_unref(mon->mon_context);
> +    }
>      qemu_chr_fe_deinit(&mon->chr, false);
>      if (monitor_is_qmp(mon)) {
>          json_message_parser_destroy(&mon->qmp.parser);
> @@ -595,7 +628,7 @@ char *qmp_human_monitor_command(const char *command_line, bool has_cpu_index,
>      char *output = NULL;
>      Monitor *old_mon, hmp;
>  
> -    monitor_data_init(&hmp, true);
> +    monitor_data_init(&hmp, true, false);
>  
>      old_mon = cur_mon;
>      cur_mon = &hmp;
> @@ -3101,6 +3134,11 @@ static void handle_hmp_command(Monitor *mon, const char *cmdline)
>  {
>      QDict *qdict;
>      const mon_cmd_t *cmd;
> +    /*
> +     * If we haven't take the BQL (when called by per-monitor
> +     * threads), we need to take care of the BQL on our own.
> +     */
> +    bool take_bql = !qemu_mutex_iothread_locked();
>  
>      trace_handle_hmp_command(mon, cmdline);
>  
> @@ -3116,7 +3154,16 @@ static void handle_hmp_command(Monitor *mon, const char *cmdline)
>          return;
>      }
>  
> +    if (take_bql) {
> +        qemu_mutex_lock_iothread();
> +    }
> +
>      cmd->cmd(mon, qdict);
> +
> +    if (take_bql) {
> +        qemu_mutex_unlock_iothread();
> +    }
> +
>      QDECREF(qdict);
>  }
>  
> @@ -4086,6 +4133,15 @@ static void __attribute__((constructor)) monitor_lock_init(void)
>      qemu_mutex_init(&monitor_lock);
>  }
>  
> +static void *monitor_thread(void *data)
> +{
> +    Monitor *mon = data;
> +
> +    g_main_loop_run(mon->mon_loop);
> +
> +    return NULL;
> +}
> +
>  void monitor_init(Chardev *chr, int flags)
>  {
>      static int is_first_init = 1;
> @@ -4098,7 +4154,9 @@ void monitor_init(Chardev *chr, int flags)
>      }
>  
>      mon = g_malloc(sizeof(*mon));
> -    monitor_data_init(mon, false);
> +
> +    /* For non-mux typed monitors, we create dedicated threads. */
> +    monitor_data_init(mon, false, !CHARDEV_IS_MUX(chr));
>  
>      qemu_chr_fe_init(&mon->chr, chr, &error_abort);
>      mon->flags = flags;
> @@ -4112,12 +4170,19 @@ void monitor_init(Chardev *chr, int flags)
>  
>      if (monitor_is_qmp(mon)) {
>          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read, monitor_qmp_read,
> -                                 monitor_qmp_event, NULL, mon, NULL, true);
> +                                 monitor_qmp_event, NULL, mon,
> +                                 mon->mon_context, true);
>          qemu_chr_fe_set_echo(&mon->chr, true);
>          json_message_parser_init(&mon->qmp.parser, handle_qmp_command);
>      } else {
>          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read, monitor_read,
> -                                 monitor_event, NULL, mon, NULL, true);
> +                                 monitor_event, NULL, mon,
> +                                 mon->mon_context, true);
> +    }
> +
> +    if (mon->use_thread) {
> +        qemu_thread_create(&mon->mon_thread, chr->label, monitor_thread,
> +                           mon, QEMU_THREAD_JOINABLE);
>      }
>  
>      qemu_mutex_lock(&monitor_lock);
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index 5ad36f8..3b6b224 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -19,6 +19,7 @@
>  #include "qapi/qmp/qjson.h"
>  #include "qapi-types.h"
>  #include "qapi/qmp/qerror.h"
> +#include "qemu/main-loop.h"
>  
>  static QDict *qmp_dispatch_check_obj(const QObject *request, Error **errp)
>  {
> @@ -75,6 +76,11 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds, QObject *request,
>      QDict *args, *dict;
>      QmpCommand *cmd;
>      QObject *ret = NULL;
> +    /*
> +     * If we haven't take the BQL (when called by per-monitor
> +     * threads), we need to take care of the BQL on our own.
> +     */
> +    bool take_bql = !qemu_mutex_iothread_locked();
>  
>      dict = qmp_dispatch_check_obj(request, errp);
>      if (!dict) {
> @@ -101,7 +107,16 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds, QObject *request,
>          QINCREF(args);
>      }
>  
> +    if (take_bql) {
> +        qemu_mutex_lock_iothread();
> +    }
> +
>      cmd->fn(args, &ret, &local_err);
> +
> +    if (take_bql) {
> +        qemu_mutex_unlock_iothread();
> +    }
> +
>      if (local_err) {
>          error_propagate(errp, local_err);
>      } else if (cmd->options & QCO_NO_SUCCESS_RESP) {
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql" Peter Xu
@ 2017-08-23 17:44   ` Dr. David Alan Gilbert
  2017-08-23 23:37     ` Fam Zheng
  2017-08-25  5:35     ` Peter Xu
  0 siblings, 2 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-23 17:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> Introducing this new parameter for QMP commands in general to mark out
> when the command does not need BQL.  Normally QMP command executions are
> done with the protection of BQL in QEMU.  However the truth is that not
> all the QMP commands require the BQL.
> 
> This new parameter provides a way to allow QMP commands to run in
> parallel when possible, without the contention on the BQL.
> 
> Since the default value of "without-bql" is still false, so now all QMP
> commands are still protected by BQL still.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

We should define what a 'without-bql' command is allowed to do:
   'Commands that have without-bql set _may_ be called without the bql
   being taken.  They must not take the bql or any other lock that may
   become dependent on the bql.'
   (Do we need to say anything about RCU?)

Also, 'no-bql' is shorter :-)

Dave

> ---
>  docs/devel/qapi-code-gen.txt   | 10 +++++++++-
>  include/qapi/qmp/dispatch.h    |  1 +
>  qapi/qmp-dispatch.c            | 11 +++++++++++
>  scripts/qapi-commands.py       | 18 +++++++++++++-----
>  scripts/qapi-introspect.py     |  2 +-
>  scripts/qapi.py                | 15 ++++++++++-----
>  scripts/qapi2texi.py           |  2 +-
>  tests/qapi-schema/test-qapi.py |  2 +-
>  8 files changed, 47 insertions(+), 14 deletions(-)
> 
> diff --git a/docs/devel/qapi-code-gen.txt b/docs/devel/qapi-code-gen.txt
> index 9903ac4..4960d00 100644
> --- a/docs/devel/qapi-code-gen.txt
> +++ b/docs/devel/qapi-code-gen.txt
> @@ -556,7 +556,8 @@ following example objects:
>  
>  Usage: { 'command': STRING, '*data': COMPLEX-TYPE-NAME-OR-DICT,
>           '*returns': TYPE-NAME, '*boxed': true,
> -         '*gen': false, '*success-response': false }
> +         '*gen': false, '*success-response': false,
> +         '*without-bql': false }
>  
>  Commands are defined by using a dictionary containing several members,
>  where three members are most common.  The 'command' member is a
> @@ -636,6 +637,13 @@ possible, the command expression should include the optional key
>  'success-response' with boolean value false.  So far, only QGA makes
>  use of this member.
>  
> +Most of the commands require the Big QEMU Lock (BQL) be held during
> +execution.  However, there is a small subset of the commands that may
> +not really need BQL at all.  To mark out this kind of commands, we can
> +specify "without-bql" to "true".  This parameter is only a hint for
> +internal QMP implementation to provide possiblility to allow commands
> +be run in parallel, or reduce the contention of the lock.  Users of QMP
> +should not really be aware of such information.

Well, I think users of these commands might select them specifically
because they know that they won't block.  Those who care about latency might
look to use commands that don't take the lock because of a reduced
effect on the performance as well.

Dave

>  === Events ===
>  
> diff --git a/include/qapi/qmp/dispatch.h b/include/qapi/qmp/dispatch.h
> index 20578dc..ec5c620 100644
> --- a/include/qapi/qmp/dispatch.h
> +++ b/include/qapi/qmp/dispatch.h
> @@ -23,6 +23,7 @@ typedef enum QmpCommandOptions
>  {
>      QCO_NO_OPTIONS = 0x0,
>      QCO_NO_SUCCESS_RESP = 0x1,
> +    QCO_WITHOUT_BQL = 0x2,
>  } QmpCommandOptions;
>  
>  typedef struct QmpCommand
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index 3b6b224..b7fba5e 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -107,6 +107,17 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds, QObject *request,
>          QINCREF(args);
>      }
>  
> +    if (cmd->options & QCO_WITHOUT_BQL) {
> +        /*
> +         * If this command can live without BQL, then we don't take
> +         * it.  One thing to mention: we may have already taken the
> +         * BQL before reaching here.  If so, we just keep it.  So
> +         * generally speaking we are trying our best on reducing the
> +         * contention of BQL.
> +         */
> +        take_bql = false;
> +    }
> +
>      if (take_bql) {
>          qemu_mutex_lock_iothread();
>      }
> diff --git a/scripts/qapi-commands.py b/scripts/qapi-commands.py
> index 974d0a4..155a0a4 100644
> --- a/scripts/qapi-commands.py
> +++ b/scripts/qapi-commands.py
> @@ -192,10 +192,17 @@ out:
>      return ret
>  
>  
> -def gen_register_command(name, success_response):
> -    options = 'QCO_NO_OPTIONS'
> +def gen_register_command(name, success_response, without_bql):
> +    options = []
> +
>      if not success_response:
> -        options = 'QCO_NO_SUCCESS_RESP'
> +        options += ['QCO_NO_SUCCESS_RESP']
> +    if without_bql:
> +        options += ['QCO_WITHOUT_BQL']
> +
> +    if not options:
> +        options = ['QCO_NO_OPTIONS']
> +    options = " | ".join(options)
>  
>      ret = mcgen('''
>      qmp_register_command(cmds, "%(name)s",
> @@ -241,7 +248,7 @@ class QAPISchemaGenCommandVisitor(QAPISchemaVisitor):
>          self._visited_ret_types = None
>  
>      def visit_command(self, name, info, arg_type, ret_type,
> -                      gen, success_response, boxed):
> +                      gen, success_response, boxed, without_bql):
>          if not gen:
>              return
>          self.decl += gen_command_decl(name, arg_type, boxed, ret_type)
> @@ -250,7 +257,8 @@ class QAPISchemaGenCommandVisitor(QAPISchemaVisitor):
>              self.defn += gen_marshal_output(ret_type)
>          self.decl += gen_marshal_decl(name)
>          self.defn += gen_marshal(name, arg_type, boxed, ret_type)
> -        self._regy += gen_register_command(name, success_response)
> +        self._regy += gen_register_command(name, success_response,
> +                                           without_bql)
>  
>  
>  (input_file, output_dir, do_c, do_h, prefix, opts) = parse_command_line()
> diff --git a/scripts/qapi-introspect.py b/scripts/qapi-introspect.py
> index 032bcea..a523544 100644
> --- a/scripts/qapi-introspect.py
> +++ b/scripts/qapi-introspect.py
> @@ -154,7 +154,7 @@ const char %(c_name)s[] = %(c_string)s;
>                                      for m in variants.variants]})
>  
>      def visit_command(self, name, info, arg_type, ret_type,
> -                      gen, success_response, boxed):
> +                      gen, success_response, boxed, without_bql):
>          arg_type = arg_type or self._schema.the_empty_object_type
>          ret_type = ret_type or self._schema.the_empty_object_type
>          self._gen_json(name, 'command',
> diff --git a/scripts/qapi.py b/scripts/qapi.py
> index 8aa2775..3951143 100644
> --- a/scripts/qapi.py
> +++ b/scripts/qapi.py
> @@ -920,7 +920,8 @@ def check_exprs(exprs):
>          elif 'command' in expr:
>              meta = 'command'
>              check_keys(expr_elem, 'command', [],
> -                       ['data', 'returns', 'gen', 'success-response', 'boxed'])
> +                       ['data', 'returns', 'gen', 'success-response',
> +                        'boxed', 'without-bql'])
>          elif 'event' in expr:
>              meta = 'event'
>              check_keys(expr_elem, 'event', [], ['data', 'boxed'])
> @@ -1031,7 +1032,7 @@ class QAPISchemaVisitor(object):
>          pass
>  
>      def visit_command(self, name, info, arg_type, ret_type,
> -                      gen, success_response, boxed):
> +                      gen, success_response, boxed, without_bql):
>          pass
>  
>      def visit_event(self, name, info, arg_type, boxed):
> @@ -1398,7 +1399,7 @@ class QAPISchemaAlternateType(QAPISchemaType):
>  
>  class QAPISchemaCommand(QAPISchemaEntity):
>      def __init__(self, name, info, doc, arg_type, ret_type,
> -                 gen, success_response, boxed):
> +                 gen, success_response, boxed, without_bql):
>          QAPISchemaEntity.__init__(self, name, info, doc)
>          assert not arg_type or isinstance(arg_type, str)
>          assert not ret_type or isinstance(ret_type, str)
> @@ -1408,6 +1409,7 @@ class QAPISchemaCommand(QAPISchemaEntity):
>          self.ret_type = None
>          self.gen = gen
>          self.success_response = success_response
> +        self.without_bql = without_bql
>          self.boxed = boxed
>  
>      def check(self, schema):
> @@ -1432,7 +1434,8 @@ class QAPISchemaCommand(QAPISchemaEntity):
>      def visit(self, visitor):
>          visitor.visit_command(self.name, self.info,
>                                self.arg_type, self.ret_type,
> -                              self.gen, self.success_response, self.boxed)
> +                              self.gen, self.success_response,
> +                              self.boxed, self.without_bql)
>  
>  
>  class QAPISchemaEvent(QAPISchemaEntity):
> @@ -1639,6 +1642,7 @@ class QAPISchema(object):
>          rets = expr.get('returns')
>          gen = expr.get('gen', True)
>          success_response = expr.get('success-response', True)
> +        without_bql = expr.get('without-bql', False)
>          boxed = expr.get('boxed', False)
>          if isinstance(data, OrderedDict):
>              data = self._make_implicit_object_type(
> @@ -1647,7 +1651,8 @@ class QAPISchema(object):
>              assert len(rets) == 1
>              rets = self._make_array_type(rets[0], info)
>          self._def_entity(QAPISchemaCommand(name, info, doc, data, rets,
> -                                           gen, success_response, boxed))
> +                                           gen, success_response,
> +                                           boxed, without_bql))
>  
>      def _def_event(self, expr, info, doc):
>          name = expr['event']
> diff --git a/scripts/qapi2texi.py b/scripts/qapi2texi.py
> index a317526..659bd83 100755
> --- a/scripts/qapi2texi.py
> +++ b/scripts/qapi2texi.py
> @@ -236,7 +236,7 @@ class QAPISchemaGenDocVisitor(qapi.QAPISchemaVisitor):
>                               body=texi_entity(doc, 'Members'))
>  
>      def visit_command(self, name, info, arg_type, ret_type,
> -                      gen, success_response, boxed):
> +                      gen, success_response, boxed, without_bql):
>          doc = self.cur_doc
>          if self.out:
>              self.out += '\n'
> diff --git a/tests/qapi-schema/test-qapi.py b/tests/qapi-schema/test-qapi.py
> index c7724d3..15aff29 100644
> --- a/tests/qapi-schema/test-qapi.py
> +++ b/tests/qapi-schema/test-qapi.py
> @@ -36,7 +36,7 @@ class QAPISchemaTestVisitor(QAPISchemaVisitor):
>          self._print_variants(variants)
>  
>      def visit_command(self, name, info, arg_type, ret_type,
> -                      gen, success_response, boxed):
> +                      gen, success_response, boxed, without_bql):
>          print 'command %s %s -> %s' % \
>              (name, arg_type and arg_type.name, ret_type and ret_type.name)
>          print '   gen=%s success_response=%s boxed=%s' % \
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 5/8] hmp: support "without_bql"
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 5/8] hmp: support "without_bql" Peter Xu
@ 2017-08-23 17:46   ` Dr. David Alan Gilbert
  2017-08-25  5:44     ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-23 17:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> Introducing new option "without_bql" for HMP commands.  It works just
> like QMP "without-bql", but for HMP commands.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

It's going to be interesting when we have hmp commands that just call
their qmp equivalent that we need to check for no mistakes.

Dave

> ---
>  monitor.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/monitor.c b/monitor.c
> index 3d4ecff..c26c797 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -125,6 +125,8 @@ typedef struct mon_cmd_t {
>      const char *args_type;
>      const char *params;
>      const char *help;
> +    /* Whether this command can be run without taking BQL? */
> +    bool without_bql;
>      void (*cmd)(Monitor *mon, const QDict *qdict);
>      /* @sub_table is a list of 2nd level of commands. If it does not exist,
>       * cmd should be used. If it exists, sub_table[?].cmd should be
> @@ -3154,6 +3156,14 @@ static void handle_hmp_command(Monitor *mon, const char *cmdline)
>          return;
>      }
>  
> +    if (cmd->without_bql) {
> +        /*
> +         * This is similar to QMP's "without-bql".  See comments in
> +         * do_qmp_dispatch().
> +         */
> +        take_bql = false;
> +    }
> +
>      if (take_bql) {
>          qemu_mutex_lock_iothread();
>      }
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock Peter Xu
@ 2017-08-23 18:01   ` Dr. David Alan Gilbert
  2017-08-25  5:49     ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-23 18:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> Now at least migrate_incoming can be run in parallel.  Let's provide a
> migration lock to protect it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 6 ++++++
>  migration/migration.h | 3 +++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index c3fe0ed..32058f7 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -145,6 +145,7 @@ MigrationIncomingState *migration_incoming_get_current(void)
>          mis_current.state = MIGRATION_STATUS_NONE;
>          memset(&mis_current, 0, sizeof(MigrationIncomingState));
>          qemu_mutex_init(&mis_current.rp_mutex);
> +        qemu_mutex_init(&mis_current.mgmt_mutex);
>          qemu_event_init(&mis_current.main_thread_load_event, false);
>          once = true;
>      }
> @@ -1171,6 +1172,7 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
>  {
>      Error *local_err = NULL;
>      static bool once = true;
> +    MigrationIncomingState *mis = migration_incoming_get_current();

migration_incoming_get_current isn't actually thread-safe itself unless
you can guarantee the initial allocation has happened - otherwise both
threads can race and do the 'once' code at the same time.

Similarly, these locks - they don't protect our 'once' - so a second
thread could come in here and both get past the !once check.

Dave

>  
>      if (!deferred_incoming) {
>          error_setg(errp, "For use with '-incoming defer'");
> @@ -1180,8 +1182,12 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
>          error_setg(errp, "The incoming migration has already been started");
>      }
>  
> +    qemu_mutex_lock(&mis->mgmt_mutex);
> +
>      qemu_start_incoming_migration(uri, &local_err);
>  
> +    qemu_mutex_unlock(&mis->mgmt_mutex);
> +
>      if (local_err) {
>          error_propagate(errp, local_err);
>          return;
> diff --git a/migration/migration.h b/migration/migration.h
> index 148c9fa..95f077b 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -58,6 +58,9 @@ struct MigrationIncomingState {
>      /* The coroutine we should enter (back) after failover */
>      Coroutine *migration_incoming_co;
>      QemuSemaphore colo_incoming_sem;
> +
> +    /* Migration incoming management lock */
> +    QemuMutex mgmt_mutex;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> -- 
> 2.7.4
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-23 17:44   ` Dr. David Alan Gilbert
@ 2017-08-23 23:37     ` Fam Zheng
  2017-08-25  5:37       ` Peter Xu
  2017-08-25  5:35     ` Peter Xu
  1 sibling, 1 reply; 104+ messages in thread
From: Fam Zheng @ 2017-08-23 23:37 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Daniel P . Berrange,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Wed, 08/23 18:44, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Introducing this new parameter for QMP commands in general to mark out
> > when the command does not need BQL.  Normally QMP command executions are
> > done with the protection of BQL in QEMU.  However the truth is that not
> > all the QMP commands require the BQL.
> > 
> > This new parameter provides a way to allow QMP commands to run in
> > parallel when possible, without the contention on the BQL.
> > 
> > Since the default value of "without-bql" is still false, so now all QMP
> > commands are still protected by BQL still.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> We should define what a 'without-bql' command is allowed to do:
>    'Commands that have without-bql set _may_ be called without the bql
>    being taken.  They must not take the bql or any other lock that may
>    become dependent on the bql.'
>    (Do we need to say anything about RCU?)
> 
> Also, 'no-bql' is shorter :-)

Or rather "need-bql" that defaults to true to avoid double negative (TM) with
"no-bql = false"?

Fam

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-23 17:35   ` Dr. David Alan Gilbert
@ 2017-08-25  4:25     ` Peter Xu
  2017-08-25  9:30       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-25  4:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Wed, Aug 23, 2017 at 06:35:35PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> > using non-mux typed backend chardev.  We only do this for monitors, so
> > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> > and the monitor together).
> > 
> > When use_thread is set, we create standalone thread to poll the monitor
> > events, isolated from the main loop thread.  Here we still need to take
> > the BQL before dispatching the tasks since some of the monitor commands
> > are not allowed to execute without the protection of BQL.  Then this
> > gives us the chance to avoid taking the BQL for some monitor commands in
> > the future.
> > 
> > * Why this change?
> > 
> > We need these per-monitor threads to make sure we can have at least one
> > monitor that will never stuck (that can receive further monitor
> > commands).
> > 
> > * So when will monitors stuck?  And, how do they stuck?
> 
> (Minor: 'stuck' is past tense, 'stick' is probably the right word; however
> 'block' is probably what you actually want)

Yet another English error.  Thanks! :-)

(I guess "monitors get stuck" should also work?)

> 
> > After we have postcopy and remote page faults, it's simple to achieve a
> > stuck in the monitor (which is also a stuck in main loop thread):
> > 
> > (1) Monitor deadlock on BQL
> > 
> > As we may know, when postcopy is running on destination VM, the vcpu
> > threads can stuck merely any time as long as it tries to access an
> > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> > that the vcpu thread is holding the BQL.  If the page fault is not
> > handled quickly, you'll find that monitors stop working, which is trying
> > to take the BQL.
> > 
> > If the page fault cannot be handled correctly (one case is a paused
> > postcopy, when network is temporarily down), monitors will hang
> > forever.  Without current patch, that means the main loop hanged.  We'll
> > never find a way to talk to VM again.
> > 
> > (2) Monitor tries to run codes page-faulted vcpus
> > 
> > The HMP command "info cpus" is one of the good example - it tries to
> > kick all the vcpus and sync status from them.  However, if there is any
> > vcpu that stuck at an unhandled page fault, it can never achieve the
> > sync, then the HMP hangs.  Again, it hangs the main loop thread as well.
> > 
> > After either (1) or (2), we can see the deadlock problem:
> > 
> > - On one hand, if monitor hangs, we cannot do the postcopy recovery,
> >   because postcopy recovery needs user to specify new listening port on
> >   destination monitor.
> > 
> > - On the other hand, if we cannot recover the paused postcopy, then page
> >   faults cannot be serviced, and the monitors will possibly hang
> >   forever then.
> > 
> > * How this patch helps?
> > 
> > - Firstly, we'll have our own thread for each dedicated monitor (or say,
> >   the backend chardev is only used for monitor), so even main loop
> >   thread hangs (it is always possible), this monitor thread may still
> >   survive.
> > 
> > - Not all monitor commands need the BQL.  We can selectively take the
> >   BQL (depends on which command we are running) to avoid waiting on a
> >   page-faulted vcpu thread that has taken the BQL (this will be done in
> >   following up patches).
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> A few high level things:
>   a) I think this patch probably wants to split into
>      1) A patch that decides whether to create a new thread and
>      initialises it
>      2) One that starts to fix up the locking

Sure.

> 
>   b) I think you also need to take the bql around any of the custom
>      completion functions; (maybe in monitor_find_completion ?)
>      since they do things like walk the lists of devices.

Ah, yes.  Actually IMHO those completions should be protected by
smaller locks as well.  Considering this only affects HMP, how about
this: when "without-bql" is set for a command, it should mean that the
whole command does not need BQL, this should include not only the
command execution part, but also the command auto completion routine.
So I take the BQL in the completion only for those whose "without-bql"
is unset, like the trick played for the command execution part.

For the only command "migrate_incoming", it does not have completion
routine, so "without-bql=true" still applies.

Or would you prefer I just take the lock unconditionally?

> 
>   c) As mentioned on irc there's fun to be had with cur_mon and error
>      handling - in my local world I have cur_mon declared as __thread
>      but never got around to thinking aobut what should set it up.
>      There's also 'wavcapture: Convert to error_report' that I posted
>      in March that got rid of some uses of cur_mon in wavcapture.c
>      for error_report.

Yeh.  I at least also see a positive ACK from Markus in the other
thread for per-thread cur_mon, sounds like this is the right way to
go.

To setup cur_mon, what I can think of is create wrapper for
pthread_create() in qemu_thread_create().  I see that we have done
similar thing in util/qemu-thread-win32.c for Windows.  With that we
can setup the cur_mon before going into real thread function but in
the right context, though we may need one more parameter for current
qemu_thread_create():

void qemu_thread_create(QemuThread *thread, const char *name,
                       void *(*start_routine)(void*),
                       void *arg, int mode, Monitor *mon);

Then we can specify monitor for any new thread (default to cur_mon).
For per-monitor threads, I think we need to pass in that specific mon.

Is this doable?

> But there's some interesting stuff to be checked
>      with where error_reporting goes.

Do you mean the case when e.g. we only have one HMP and that HMP is
threaded?  If so, I guess the error_report()s will be directed mostly
to stderr.

I believe it'll break some HMP users, but IIUC HMP behavior is allowed
to be changed and after all people can still catch the error message
somewhere, though outside that HMP console.  So I think it might be ok.

> 
>   d) I wonder if it's better to have thread as a flag, so that you have
>      to explicitly ask for a monitor to have it's own thread.

This should be doable.  Would a new parameter for "-qmp" and "-hmp"
suffice?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-23 17:44   ` Dr. David Alan Gilbert
  2017-08-23 23:37     ` Fam Zheng
@ 2017-08-25  5:35     ` Peter Xu
  2017-08-25  9:06       ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-25  5:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Wed, Aug 23, 2017 at 06:44:12PM +0100, Dr. David Alan Gilbert wrote:

[...]

> > +Most of the commands require the Big QEMU Lock (BQL) be held during
> > +execution.  However, there is a small subset of the commands that may
> > +not really need BQL at all.  To mark out this kind of commands, we can
> > +specify "without-bql" to "true".  This parameter is only a hint for
> > +internal QMP implementation to provide possiblility to allow commands
> > +be run in parallel, or reduce the contention of the lock.  Users of QMP
> > +should not really be aware of such information.
> 
> Well, I think users of these commands might select them specifically
> because they know that they won't block.  Those who care about latency might
> look to use commands that don't take the lock because of a reduced
> effect on the performance as well.

What would be the best way to tell user?  I think again this should
mostly for HMP only, right?

Maybe we can add a new command to list these lock-free commands.  Or,
I can dump something in "help" and "help info" like:

(qemu) help migrate_incoming
migrate_incoming uri -- Continue an incoming migration from an -incoming defer (BQL-less)

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-23 23:37     ` Fam Zheng
@ 2017-08-25  5:37       ` Peter Xu
  2017-08-25  9:14         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-25  5:37 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini,
	Daniel P . Berrange, Juan Quintela, mdroth, Eric Blake,
	Laurent Vivier, Markus Armbruster

On Thu, Aug 24, 2017 at 07:37:32AM +0800, Fam Zheng wrote:
> On Wed, 08/23 18:44, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Introducing this new parameter for QMP commands in general to mark out
> > > when the command does not need BQL.  Normally QMP command executions are
> > > done with the protection of BQL in QEMU.  However the truth is that not
> > > all the QMP commands require the BQL.
> > > 
> > > This new parameter provides a way to allow QMP commands to run in
> > > parallel when possible, without the contention on the BQL.
> > > 
> > > Since the default value of "without-bql" is still false, so now all QMP
> > > commands are still protected by BQL still.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > We should define what a 'without-bql' command is allowed to do:
> >    'Commands that have without-bql set _may_ be called without the bql
> >    being taken.  They must not take the bql or any other lock that may
> >    become dependent on the bql.'

Sure.

> >    (Do we need to say anything about RCU?)

Could I ask how is RCU related?

> > 
> > Also, 'no-bql' is shorter :-)
> 
> Or rather "need-bql" that defaults to true to avoid double negative (TM) with
> "no-bql = false"?

Ok let me use "need-bql". :)

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 5/8] hmp: support "without_bql"
  2017-08-23 17:46   ` Dr. David Alan Gilbert
@ 2017-08-25  5:44     ` Peter Xu
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-25  5:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Wed, Aug 23, 2017 at 06:46:29PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Introducing new option "without_bql" for HMP commands.  It works just
> > like QMP "without-bql", but for HMP commands.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> It's going to be interesting when we have hmp commands that just call
> their qmp equivalent that we need to check for no mistakes.

But we don't have any QMP & HMP mapping between them, do we?  Then I
have no good idea on how we can do the check.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock
  2017-08-23 18:01   ` Dr. David Alan Gilbert
@ 2017-08-25  5:49     ` Peter Xu
  2017-08-25  9:34       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-25  5:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Wed, Aug 23, 2017 at 07:01:35PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Now at least migrate_incoming can be run in parallel.  Let's provide a
> > migration lock to protect it.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  migration/migration.c | 6 ++++++
> >  migration/migration.h | 3 +++
> >  2 files changed, 9 insertions(+)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index c3fe0ed..32058f7 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -145,6 +145,7 @@ MigrationIncomingState *migration_incoming_get_current(void)
> >          mis_current.state = MIGRATION_STATUS_NONE;
> >          memset(&mis_current, 0, sizeof(MigrationIncomingState));
> >          qemu_mutex_init(&mis_current.rp_mutex);
> > +        qemu_mutex_init(&mis_current.mgmt_mutex);
> >          qemu_event_init(&mis_current.main_thread_load_event, false);
> >          once = true;
> >      }
> > @@ -1171,6 +1172,7 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
> >  {
> >      Error *local_err = NULL;
> >      static bool once = true;
> > +    MigrationIncomingState *mis = migration_incoming_get_current();
> 
> migration_incoming_get_current isn't actually thread-safe itself unless
> you can guarantee the initial allocation has happened - otherwise both
> threads can race and do the 'once' code at the same time.

How about I init the incoming object as well in
migration_object_init()?

> 
> Similarly, these locks - they don't protect our 'once' - so a second
> thread could come in here and both get past the !once check.

Oh I missed this one since actually I am removing that "once" variable
in postcopy recovery series. :)

I can put the last two patches into postcopy recovery series, then
it'll be fine.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-25  5:35     ` Peter Xu
@ 2017-08-25  9:06       ` Dr. David Alan Gilbert
  2017-08-28  8:26         ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25  9:06 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Aug 23, 2017 at 06:44:12PM +0100, Dr. David Alan Gilbert wrote:
> 
> [...]
> 
> > > +Most of the commands require the Big QEMU Lock (BQL) be held during
> > > +execution.  However, there is a small subset of the commands that may
> > > +not really need BQL at all.  To mark out this kind of commands, we can
> > > +specify "without-bql" to "true".  This parameter is only a hint for
> > > +internal QMP implementation to provide possiblility to allow commands
> > > +be run in parallel, or reduce the contention of the lock.  Users of QMP
> > > +should not really be aware of such information.
> > 
> > Well, I think users of these commands might select them specifically
> > because they know that they won't block.  Those who care about latency might
> > look to use commands that don't take the lock because of a reduced
> > effect on the performance as well.
> 
> What would be the best way to tell user?  I think again this should
> mostly for HMP only, right?

It needs to be docuemnted for QMP users as well so that those developing
management code know what's safe.

> Maybe we can add a new command to list these lock-free commands.  Or,
> I can dump something in "help" and "help info" like:
> 
> (qemu) help migrate_incoming
> migrate_incoming uri -- Continue an incoming migration from an -incoming defer (BQL-less)

'lock free' might be better?

Dave

> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-25  5:37       ` Peter Xu
@ 2017-08-25  9:14         ` Dr. David Alan Gilbert
  2017-08-28  8:08           ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25  9:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fam Zheng, qemu-devel, Paolo Bonzini, Daniel P . Berrange,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Thu, Aug 24, 2017 at 07:37:32AM +0800, Fam Zheng wrote:
> > On Wed, 08/23 18:44, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > Introducing this new parameter for QMP commands in general to mark out
> > > > when the command does not need BQL.  Normally QMP command executions are
> > > > done with the protection of BQL in QEMU.  However the truth is that not
> > > > all the QMP commands require the BQL.
> > > > 
> > > > This new parameter provides a way to allow QMP commands to run in
> > > > parallel when possible, without the contention on the BQL.
> > > > 
> > > > Since the default value of "without-bql" is still false, so now all QMP
> > > > commands are still protected by BQL still.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > 
> > > We should define what a 'without-bql' command is allowed to do:
> > >    'Commands that have without-bql set _may_ be called without the bql
> > >    being taken.  They must not take the bql or any other lock that may
> > >    become dependent on the bql.'
> 
> Sure.
> 
> > >    (Do we need to say anything about RCU?)
> 
> Could I ask how is RCU related?

My definition above said that anything declared without bql couldn't
take the bql, so couldn't block on any other thread holding the bql.
But is our command allowed to use synchronize_rcu or rcu_read_lock
that could wait for or block other threads doing rcu stuff?
Because if it did is there any guarantee that it wouldn't block?


> 
> > > 
> > > Also, 'no-bql' is shorter :-)
> > 
> > Or rather "need-bql" that defaults to true to avoid double negative (TM) with
> > "no-bql = false"?
> 
> Ok let me use "need-bql". :)

Fine by me.

Dave

> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25  4:25     ` Peter Xu
@ 2017-08-25  9:30       ` Dr. David Alan Gilbert
  2017-08-28  5:53         ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25  9:30 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Aug 23, 2017 at 06:35:35PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> > > using non-mux typed backend chardev.  We only do this for monitors, so
> > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> > > and the monitor together).
> > > 
> > > When use_thread is set, we create standalone thread to poll the monitor
> > > events, isolated from the main loop thread.  Here we still need to take
> > > the BQL before dispatching the tasks since some of the monitor commands
> > > are not allowed to execute without the protection of BQL.  Then this
> > > gives us the chance to avoid taking the BQL for some monitor commands in
> > > the future.
> > > 
> > > * Why this change?
> > > 
> > > We need these per-monitor threads to make sure we can have at least one
> > > monitor that will never stuck (that can receive further monitor
> > > commands).
> > > 
> > > * So when will monitors stuck?  And, how do they stuck?
> > 
> > (Minor: 'stuck' is past tense, 'stick' is probably the right word; however
> > 'block' is probably what you actually want)
> 
> Yet another English error.  Thanks! :-)

That's OK - only minor.

> (I guess "monitors get stuck" should also work?)

Yes.

> > 
> > > After we have postcopy and remote page faults, it's simple to achieve a
> > > stuck in the monitor (which is also a stuck in main loop thread):
> > > 
> > > (1) Monitor deadlock on BQL
> > > 
> > > As we may know, when postcopy is running on destination VM, the vcpu
> > > threads can stuck merely any time as long as it tries to access an
> > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> > > that the vcpu thread is holding the BQL.  If the page fault is not
> > > handled quickly, you'll find that monitors stop working, which is trying
> > > to take the BQL.
> > > 
> > > If the page fault cannot be handled correctly (one case is a paused
> > > postcopy, when network is temporarily down), monitors will hang
> > > forever.  Without current patch, that means the main loop hanged.  We'll
> > > never find a way to talk to VM again.
> > > 
> > > (2) Monitor tries to run codes page-faulted vcpus
> > > 
> > > The HMP command "info cpus" is one of the good example - it tries to
> > > kick all the vcpus and sync status from them.  However, if there is any
> > > vcpu that stuck at an unhandled page fault, it can never achieve the
> > > sync, then the HMP hangs.  Again, it hangs the main loop thread as well.
> > > 
> > > After either (1) or (2), we can see the deadlock problem:
> > > 
> > > - On one hand, if monitor hangs, we cannot do the postcopy recovery,
> > >   because postcopy recovery needs user to specify new listening port on
> > >   destination monitor.
> > > 
> > > - On the other hand, if we cannot recover the paused postcopy, then page
> > >   faults cannot be serviced, and the monitors will possibly hang
> > >   forever then.
> > > 
> > > * How this patch helps?
> > > 
> > > - Firstly, we'll have our own thread for each dedicated monitor (or say,
> > >   the backend chardev is only used for monitor), so even main loop
> > >   thread hangs (it is always possible), this monitor thread may still
> > >   survive.
> > > 
> > > - Not all monitor commands need the BQL.  We can selectively take the
> > >   BQL (depends on which command we are running) to avoid waiting on a
> > >   page-faulted vcpu thread that has taken the BQL (this will be done in
> > >   following up patches).
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > A few high level things:
> >   a) I think this patch probably wants to split into
> >      1) A patch that decides whether to create a new thread and
> >      initialises it
> >      2) One that starts to fix up the locking
> 
> Sure.
> 
> > 
> >   b) I think you also need to take the bql around any of the custom
> >      completion functions; (maybe in monitor_find_completion ?)
> >      since they do things like walk the lists of devices.
> 
> Ah, yes.  Actually IMHO those completions should be protected by
> smaller locks as well.  Considering this only affects HMP, how about
> this: when "without-bql" is set for a command, it should mean that the
> whole command does not need BQL, this should include not only the
> command execution part, but also the command auto completion routine.
> So I take the BQL in the completion only for those whose "without-bql"
> is unset, like the trick played for the command execution part.
> 
> For the only command "migrate_incoming", it does not have completion
> routine, so "without-bql=true" still applies.
> 
> Or would you prefer I just take the lock unconditionally?

I think either of those would work; no preference.

> > 
> >   c) As mentioned on irc there's fun to be had with cur_mon and error
> >      handling - in my local world I have cur_mon declared as __thread
> >      but never got around to thinking aobut what should set it up.
> >      There's also 'wavcapture: Convert to error_report' that I posted
> >      in March that got rid of some uses of cur_mon in wavcapture.c
> >      for error_report.
> 
> Yeh.  I at least also see a positive ACK from Markus in the other
> thread for per-thread cur_mon, sounds like this is the right way to
> go.
> 
> To setup cur_mon, what I can think of is create wrapper for
> pthread_create() in qemu_thread_create().  I see that we have done
> similar thing in util/qemu-thread-win32.c for Windows.  With that we
> can setup the cur_mon before going into real thread function but in
> the right context, though we may need one more parameter for current
> qemu_thread_create():
> 
> void qemu_thread_create(QemuThread *thread, const char *name,
>                        void *(*start_routine)(void*),
>                        void *arg, int mode, Monitor *mon);
> 
> Then we can specify monitor for any new thread (default to cur_mon).
> For per-monitor threads, I think we need to pass in that specific mon.
> 
> Is this doable?

That would mean changing all the qemu_thread_create calls, but yes
I guess is doable.  I'd thought the other way, perhaps you inherit
Monitor except in the case of when the monitor creates threads.

> > But there's some interesting stuff to be checked
> >      with where error_reporting goes.
> 
> Do you mean the case when e.g. we only have one HMP and that HMP is
> threaded?  If so, I guess the error_report()s will be directed mostly
> to stderr.
>
> I believe it'll break some HMP users, but IIUC HMP behavior is allowed
> to be changed and after all people can still catch the error message
> somewhere, though outside that HMP console.  So I think it might be ok.

I think if we get cur_mon right then it'll work OK.

> > 
> >   d) I wonder if it's better to have thread as a flag, so that you have
> >      to explicitly ask for a monitor to have it's own thread.
> 
> This should be doable.  Would a new parameter for "-qmp" and "-hmp"
> suffice?

Yes.

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock
  2017-08-25  5:49     ` Peter Xu
@ 2017-08-25  9:34       ` Dr. David Alan Gilbert
  2017-08-28  8:39         ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25  9:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Aug 23, 2017 at 07:01:35PM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > Now at least migrate_incoming can be run in parallel.  Let's provide a
> > > migration lock to protect it.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  migration/migration.c | 6 ++++++
> > >  migration/migration.h | 3 +++
> > >  2 files changed, 9 insertions(+)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index c3fe0ed..32058f7 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -145,6 +145,7 @@ MigrationIncomingState *migration_incoming_get_current(void)
> > >          mis_current.state = MIGRATION_STATUS_NONE;
> > >          memset(&mis_current, 0, sizeof(MigrationIncomingState));
> > >          qemu_mutex_init(&mis_current.rp_mutex);
> > > +        qemu_mutex_init(&mis_current.mgmt_mutex);
> > >          qemu_event_init(&mis_current.main_thread_load_event, false);
> > >          once = true;
> > >      }
> > > @@ -1171,6 +1172,7 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
> > >  {
> > >      Error *local_err = NULL;
> > >      static bool once = true;
> > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > 
> > migration_incoming_get_current isn't actually thread-safe itself unless
> > you can guarantee the initial allocation has happened - otherwise both
> > threads can race and do the 'once' code at the same time.
> 
> How about I init the incoming object as well in
> migration_object_init()?

Yes I think that might work.

> > 
> > Similarly, these locks - they don't protect our 'once' - so a second
> > thread could come in here and both get past the !once check.
> 
> Oh I missed this one since actually I am removing that "once" variable
> in postcopy recovery series. :)
> 
> I can put the last two patches into postcopy recovery series, then
> it'll be fine.

OK; these thigns just emphasise how hard it is to make a function really
lock free.

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll Peter Xu
@ 2017-08-25 14:44   ` Marc-André Lureau
  2017-08-26  7:19   ` Fam Zheng
  1 sibling, 0 replies; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-25 14:44 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, Markus Armbruster,
	mdroth, Paolo Bonzini, Dr . David Alan Gilbert

On Wed, Aug 23, 2017 at 8:54 AM Peter Xu <peterx@redhat.com> wrote:

> This is not a problem if we are only having one single loop thread like
> before.  However, after per-monitor thread is introduced, this is not
> true any more, and the risk can happen.
>
> The risk can be triggered with "make check -j8" sometimes:
>
>   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
>   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
>
> This patch keeps the reference for the watch object when creating in
> io_add_watch_poll(), so that the object will never be released in the
> context main loop, especially when the context loop is running in
> another standalone thread.  Meanwhile, when we want to remove the watch
> object, we always first detach the watch object from its owner context,
> then we continue with the cleanup.
>
> Without this patch, calling io_remove_watch_poll() in main loop thread
> is not thread-safe, since the other per-monitor thread may be modifying
> the watch object at the same time.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
>

Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>



> ---
>  chardev/char-io.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/chardev/char-io.c b/chardev/char-io.c
> index f810524..5c52c40 100644
> --- a/chardev/char-io.c
> +++ b/chardev/char-io.c
> @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
>      g_free(name);
>
>      g_source_attach(&iwp->parent, context);
> -    g_source_unref(&iwp->parent);
>      return (GSource *)iwp;
>  }
>
> @@ -131,12 +130,24 @@ static void io_remove_watch_poll(GSource *source)
>      IOWatchPoll *iwp;
>
>      iwp = io_watch_poll_from_source(source);
> +
> +    /*
> +     * Here the order of destruction really matters.  We need to first
> +     * detach the IOWatchPoll object from the context (which may still
> +     * be running in another loop thread), only after that could we
> +     * continue to operate on iwp->src, or there may be risk condition
> +     * between current thread and the context loop thread.
> +     *
> +     * Let's blame the glib bug mentioned in commit 2b3167 (again) for
> +     * this extra complexity.
> +     */
> +    g_source_destroy(&iwp->parent);
>      if (iwp->src) {
>          g_source_destroy(iwp->src);
>          g_source_unref(iwp->src);
>          iwp->src = NULL;
>      }
> -    g_source_destroy(&iwp->parent);
> +    g_source_unref(&iwp->parent);
>  }
>
>  void remove_fd_in_watch(Chardev *chr)
> --
> 2.7.4
>
>
> --
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll Peter Xu
  2017-08-23 17:35   ` Dr. David Alan Gilbert
@ 2017-08-25 15:27   ` Marc-André Lureau
  2017-08-25 15:33     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-25 15:27 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, Markus Armbruster,
	mdroth, Paolo Bonzini, Dr . David Alan Gilbert

Hi

On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:

> Firstly, introduce Monitor.use_thread, and set it for monitors that are
> using non-mux typed backend chardev.  We only do this for monitors, so
> mux-typed chardevs are not suitable (when it connects to, e.g., serials
> and the monitor together).
>
> When use_thread is set, we create standalone thread to poll the monitor
> events, isolated from the main loop thread.  Here we still need to take
> the BQL before dispatching the tasks since some of the monitor commands
> are not allowed to execute without the protection of BQL.  Then this
> gives us the chance to avoid taking the BQL for some monitor commands in
> the future.
>
> * Why this change?
>
> We need these per-monitor threads to make sure we can have at least one
> monitor that will never stuck (that can receive further monitor
> commands).
>
> * So when will monitors stuck?  And, how do they stuck?
>
> After we have postcopy and remote page faults, it's simple to achieve a
> stuck in the monitor (which is also a stuck in main loop thread):
>
> (1) Monitor deadlock on BQL
>
> As we may know, when postcopy is running on destination VM, the vcpu
> threads can stuck merely any time as long as it tries to access an
> uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> that the vcpu thread is holding the BQL.  If the page fault is not
> handled quickly, you'll find that monitors stop working, which is trying
> to take the BQL.
>
> If the page fault cannot be handled correctly (one case is a paused
> postcopy, when network is temporarily down), monitors will hang
> forever.  Without current patch, that means the main loop hanged.  We'll
> never find a way to talk to VM again.
>

Could the BQL be pushed down to the monitor commands level instead? That
way we wouldn't need a seperate thread to solve the hang on commands that
do not need BQL.

We could also optionnally make command that need the BQL to fail if lock is
held (after a timeout)?



> (2) Monitor tries to run codes page-faulted vcpus
>
> The HMP command "info cpus" is one of the good example - it tries to
> kick all the vcpus and sync status from them.  However, if there is any
> vcpu that stuck at an unhandled page fault, it can never achieve the
> sync, then the HMP hangs.  Again, it hangs the main loop thread as well.
>
> After either (1) or (2), we can see the deadlock problem:
>
> - On one hand, if monitor hangs, we cannot do the postcopy recovery,
>   because postcopy recovery needs user to specify new listening port on
>   destination monitor.
>
> - On the other hand, if we cannot recover the paused postcopy, then page
>   faults cannot be serviced, and the monitors will possibly hang
>   forever then.
>
> * How this patch helps?
>
> - Firstly, we'll have our own thread for each dedicated monitor (or say,
>   the backend chardev is only used for monitor), so even main loop
>   thread hangs (it is always possible), this monitor thread may still
>   survive.
>
> - Not all monitor commands need the BQL.  We can selectively take the
>   BQL (depends on which command we are running) to avoid waiting on a
>   page-faulted vcpu thread that has taken the BQL (this will be done in
>   following up patches).
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  monitor.c           | 75
> +++++++++++++++++++++++++++++++++++++++++++++++++----
>  qapi/qmp-dispatch.c | 15 +++++++++++
>  2 files changed, 85 insertions(+), 5 deletions(-)
>
> diff --git a/monitor.c b/monitor.c
> index 7c90df7..3d4ecff 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -36,6 +36,8 @@
>  #include "net/net.h"
>  #include "net/slirp.h"
>  #include "chardev/char-fe.h"
> +#include "chardev/char-mux.h"
> +#include "chardev/char-io.h"
>  #include "ui/qemu-spice.h"
>  #include "sysemu/numa.h"
>  #include "monitor/monitor.h"
> @@ -190,6 +192,8 @@ struct Monitor {
>      int flags;
>      int suspend_cnt;
>      bool skip_flush;
> +    /* Whether the monitor wants to be polled in standalone thread */
> +    bool use_thread;
>
>      QemuMutex out_lock;
>      QString *outbuf;
> @@ -206,6 +210,11 @@ struct Monitor {
>      mon_cmd_t *cmd_table;
>      QLIST_HEAD(,mon_fd_t) fds;
>      QLIST_ENTRY(Monitor) entry;
> +
> +    /* Only used when "use_thread" is used */
> +    QemuThread mon_thread;
> +    GMainContext *mon_context;
> +    GMainLoop *mon_loop;
>  };
>
>  /* QMP checker flags */
> @@ -568,7 +577,7 @@ static void monitor_qapi_event_init(void)
>
>  static void handle_hmp_command(Monitor *mon, const char *cmdline);
>
> -static void monitor_data_init(Monitor *mon, bool skip_flush)
> +static void monitor_data_init(Monitor *mon, bool skip_flush, bool
> use_thread)
>  {
>      memset(mon, 0, sizeof(Monitor));
>      qemu_mutex_init(&mon->out_lock);
> @@ -576,10 +585,34 @@ static void monitor_data_init(Monitor *mon, bool
> skip_flush)
>      /* Use *mon_cmds by default. */
>      mon->cmd_table = mon_cmds;
>      mon->skip_flush = skip_flush;
> +    mon->use_thread = use_thread;
> +    if (use_thread) {
> +        /*
> +         * For monitors that use isolated threads, they'll need their
> +         * own GMainContext and GMainLoop.  Otherwise, these pointers
> +         * will be NULL, which means the default context will be used.
> +         */
> +        mon->mon_context = g_main_context_new();
> +        mon->mon_loop = g_main_loop_new(mon->mon_context, TRUE);
> +    }
>  }
>
>  static void monitor_data_destroy(Monitor *mon)
>  {
> +    /* Destroy the thread first if there is */
> +    if (mon->use_thread) {
> +        /* Notify the per-monitor thread to quit. */
> +        g_main_loop_quit(mon->mon_loop);
> +        /*
> +         * Make sure the context will get the quit message since it's
> +         * in another thread.  Without this, it may not be able to
> +         * respond to the quit message immediately.
> +         */
> +        g_main_context_wakeup(mon->mon_context);
> +        qemu_thread_join(&mon->mon_thread);
> +        g_main_loop_unref(mon->mon_loop);
> +        g_main_context_unref(mon->mon_context);
> +    }
>      qemu_chr_fe_deinit(&mon->chr, false);
>      if (monitor_is_qmp(mon)) {
>          json_message_parser_destroy(&mon->qmp.parser);
> @@ -595,7 +628,7 @@ char *qmp_human_monitor_command(const char
> *command_line, bool has_cpu_index,
>      char *output = NULL;
>      Monitor *old_mon, hmp;
>
> -    monitor_data_init(&hmp, true);
> +    monitor_data_init(&hmp, true, false);
>
>      old_mon = cur_mon;
>      cur_mon = &hmp;
> @@ -3101,6 +3134,11 @@ static void handle_hmp_command(Monitor *mon, const
> char *cmdline)
>  {
>      QDict *qdict;
>      const mon_cmd_t *cmd;
> +    /*
> +     * If we haven't take the BQL (when called by per-monitor
> +     * threads), we need to take care of the BQL on our own.
> +     */
> +    bool take_bql = !qemu_mutex_iothread_locked();
>
>      trace_handle_hmp_command(mon, cmdline);
>
> @@ -3116,7 +3154,16 @@ static void handle_hmp_command(Monitor *mon, const
> char *cmdline)
>          return;
>      }
>
> +    if (take_bql) {
> +        qemu_mutex_lock_iothread();
> +    }
> +
>      cmd->cmd(mon, qdict);
> +
> +    if (take_bql) {
> +        qemu_mutex_unlock_iothread();
> +    }
> +
>      QDECREF(qdict);
>  }
>
> @@ -4086,6 +4133,15 @@ static void __attribute__((constructor))
> monitor_lock_init(void)
>      qemu_mutex_init(&monitor_lock);
>  }
>
> +static void *monitor_thread(void *data)
> +{
> +    Monitor *mon = data;
> +
> +    g_main_loop_run(mon->mon_loop);
> +
> +    return NULL;
> +}
> +
>  void monitor_init(Chardev *chr, int flags)
>  {
>      static int is_first_init = 1;
> @@ -4098,7 +4154,9 @@ void monitor_init(Chardev *chr, int flags)
>      }
>
>      mon = g_malloc(sizeof(*mon));
> -    monitor_data_init(mon, false);
> +
> +    /* For non-mux typed monitors, we create dedicated threads. */
> +    monitor_data_init(mon, false, !CHARDEV_IS_MUX(chr));
>
>      qemu_chr_fe_init(&mon->chr, chr, &error_abort);
>      mon->flags = flags;
> @@ -4112,12 +4170,19 @@ void monitor_init(Chardev *chr, int flags)
>
>      if (monitor_is_qmp(mon)) {
>          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> monitor_qmp_read,
> -                                 monitor_qmp_event, NULL, mon, NULL,
> true);
> +                                 monitor_qmp_event, NULL, mon,
> +                                 mon->mon_context, true);
>          qemu_chr_fe_set_echo(&mon->chr, true);
>          json_message_parser_init(&mon->qmp.parser, handle_qmp_command);
>      } else {
>          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> monitor_read,
> -                                 monitor_event, NULL, mon, NULL, true);
> +                                 monitor_event, NULL, mon,
> +                                 mon->mon_context, true);
> +    }
> +
> +    if (mon->use_thread) {
> +        qemu_thread_create(&mon->mon_thread, chr->label, monitor_thread,
> +                           mon, QEMU_THREAD_JOINABLE);
>      }
>
>      qemu_mutex_lock(&monitor_lock);
> diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> index 5ad36f8..3b6b224 100644
> --- a/qapi/qmp-dispatch.c
> +++ b/qapi/qmp-dispatch.c
> @@ -19,6 +19,7 @@
>  #include "qapi/qmp/qjson.h"
>  #include "qapi-types.h"
>  #include "qapi/qmp/qerror.h"
> +#include "qemu/main-loop.h"
>
>  static QDict *qmp_dispatch_check_obj(const QObject *request, Error **errp)
>  {
> @@ -75,6 +76,11 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds,
> QObject *request,
>      QDict *args, *dict;
>      QmpCommand *cmd;
>      QObject *ret = NULL;
> +    /*
> +     * If we haven't take the BQL (when called by per-monitor
> +     * threads), we need to take care of the BQL on our own.
> +     */
> +    bool take_bql = !qemu_mutex_iothread_locked();
>
>      dict = qmp_dispatch_check_obj(request, errp);
>      if (!dict) {
> @@ -101,7 +107,16 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds,
> QObject *request,
>          QINCREF(args);
>      }
>
> +    if (take_bql) {
> +        qemu_mutex_lock_iothread();
> +    }
> +
>      cmd->fn(args, &ret, &local_err);
> +
> +    if (take_bql) {
> +        qemu_mutex_unlock_iothread();
> +    }
> +
>      if (local_err) {
>          error_propagate(errp, local_err);
>      } else if (cmd->options & QCO_NO_SUCCESS_RESP) {
> --
> 2.7.4
>
>
> --
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 15:27   ` Marc-André Lureau
@ 2017-08-25 15:33     ` Dr. David Alan Gilbert
  2017-08-25 16:07       ` Marc-André Lureau
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25 15:33 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> 
> > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> > using non-mux typed backend chardev.  We only do this for monitors, so
> > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> > and the monitor together).
> >
> > When use_thread is set, we create standalone thread to poll the monitor
> > events, isolated from the main loop thread.  Here we still need to take
> > the BQL before dispatching the tasks since some of the monitor commands
> > are not allowed to execute without the protection of BQL.  Then this
> > gives us the chance to avoid taking the BQL for some monitor commands in
> > the future.
> >
> > * Why this change?
> >
> > We need these per-monitor threads to make sure we can have at least one
> > monitor that will never stuck (that can receive further monitor
> > commands).
> >
> > * So when will monitors stuck?  And, how do they stuck?
> >
> > After we have postcopy and remote page faults, it's simple to achieve a
> > stuck in the monitor (which is also a stuck in main loop thread):
> >
> > (1) Monitor deadlock on BQL
> >
> > As we may know, when postcopy is running on destination VM, the vcpu
> > threads can stuck merely any time as long as it tries to access an
> > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> > that the vcpu thread is holding the BQL.  If the page fault is not
> > handled quickly, you'll find that monitors stop working, which is trying
> > to take the BQL.
> >
> > If the page fault cannot be handled correctly (one case is a paused
> > postcopy, when network is temporarily down), monitors will hang
> > forever.  Without current patch, that means the main loop hanged.  We'll
> > never find a way to talk to VM again.
> >
> 
> Could the BQL be pushed down to the monitor commands level instead? That
> way we wouldn't need a seperate thread to solve the hang on commands that
> do not need BQL.

If the main thread is stuck though I don't see how that helps you; you
have to be able to run these commands on another thread.

Dave

> We could also optionnally make command that need the BQL to fail if lock is
> held (after a timeout)?
> 
> 
> 
> > (2) Monitor tries to run codes page-faulted vcpus
> >
> > The HMP command "info cpus" is one of the good example - it tries to
> > kick all the vcpus and sync status from them.  However, if there is any
> > vcpu that stuck at an unhandled page fault, it can never achieve the
> > sync, then the HMP hangs.  Again, it hangs the main loop thread as well.
> >
> > After either (1) or (2), we can see the deadlock problem:
> >
> > - On one hand, if monitor hangs, we cannot do the postcopy recovery,
> >   because postcopy recovery needs user to specify new listening port on
> >   destination monitor.
> >
> > - On the other hand, if we cannot recover the paused postcopy, then page
> >   faults cannot be serviced, and the monitors will possibly hang
> >   forever then.
> >
> > * How this patch helps?
> >
> > - Firstly, we'll have our own thread for each dedicated monitor (or say,
> >   the backend chardev is only used for monitor), so even main loop
> >   thread hangs (it is always possible), this monitor thread may still
> >   survive.
> >
> > - Not all monitor commands need the BQL.  We can selectively take the
> >   BQL (depends on which command we are running) to avoid waiting on a
> >   page-faulted vcpu thread that has taken the BQL (this will be done in
> >   following up patches).
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  monitor.c           | 75
> > +++++++++++++++++++++++++++++++++++++++++++++++++----
> >  qapi/qmp-dispatch.c | 15 +++++++++++
> >  2 files changed, 85 insertions(+), 5 deletions(-)
> >
> > diff --git a/monitor.c b/monitor.c
> > index 7c90df7..3d4ecff 100644
> > --- a/monitor.c
> > +++ b/monitor.c
> > @@ -36,6 +36,8 @@
> >  #include "net/net.h"
> >  #include "net/slirp.h"
> >  #include "chardev/char-fe.h"
> > +#include "chardev/char-mux.h"
> > +#include "chardev/char-io.h"
> >  #include "ui/qemu-spice.h"
> >  #include "sysemu/numa.h"
> >  #include "monitor/monitor.h"
> > @@ -190,6 +192,8 @@ struct Monitor {
> >      int flags;
> >      int suspend_cnt;
> >      bool skip_flush;
> > +    /* Whether the monitor wants to be polled in standalone thread */
> > +    bool use_thread;
> >
> >      QemuMutex out_lock;
> >      QString *outbuf;
> > @@ -206,6 +210,11 @@ struct Monitor {
> >      mon_cmd_t *cmd_table;
> >      QLIST_HEAD(,mon_fd_t) fds;
> >      QLIST_ENTRY(Monitor) entry;
> > +
> > +    /* Only used when "use_thread" is used */
> > +    QemuThread mon_thread;
> > +    GMainContext *mon_context;
> > +    GMainLoop *mon_loop;
> >  };
> >
> >  /* QMP checker flags */
> > @@ -568,7 +577,7 @@ static void monitor_qapi_event_init(void)
> >
> >  static void handle_hmp_command(Monitor *mon, const char *cmdline);
> >
> > -static void monitor_data_init(Monitor *mon, bool skip_flush)
> > +static void monitor_data_init(Monitor *mon, bool skip_flush, bool
> > use_thread)
> >  {
> >      memset(mon, 0, sizeof(Monitor));
> >      qemu_mutex_init(&mon->out_lock);
> > @@ -576,10 +585,34 @@ static void monitor_data_init(Monitor *mon, bool
> > skip_flush)
> >      /* Use *mon_cmds by default. */
> >      mon->cmd_table = mon_cmds;
> >      mon->skip_flush = skip_flush;
> > +    mon->use_thread = use_thread;
> > +    if (use_thread) {
> > +        /*
> > +         * For monitors that use isolated threads, they'll need their
> > +         * own GMainContext and GMainLoop.  Otherwise, these pointers
> > +         * will be NULL, which means the default context will be used.
> > +         */
> > +        mon->mon_context = g_main_context_new();
> > +        mon->mon_loop = g_main_loop_new(mon->mon_context, TRUE);
> > +    }
> >  }
> >
> >  static void monitor_data_destroy(Monitor *mon)
> >  {
> > +    /* Destroy the thread first if there is */
> > +    if (mon->use_thread) {
> > +        /* Notify the per-monitor thread to quit. */
> > +        g_main_loop_quit(mon->mon_loop);
> > +        /*
> > +         * Make sure the context will get the quit message since it's
> > +         * in another thread.  Without this, it may not be able to
> > +         * respond to the quit message immediately.
> > +         */
> > +        g_main_context_wakeup(mon->mon_context);
> > +        qemu_thread_join(&mon->mon_thread);
> > +        g_main_loop_unref(mon->mon_loop);
> > +        g_main_context_unref(mon->mon_context);
> > +    }
> >      qemu_chr_fe_deinit(&mon->chr, false);
> >      if (monitor_is_qmp(mon)) {
> >          json_message_parser_destroy(&mon->qmp.parser);
> > @@ -595,7 +628,7 @@ char *qmp_human_monitor_command(const char
> > *command_line, bool has_cpu_index,
> >      char *output = NULL;
> >      Monitor *old_mon, hmp;
> >
> > -    monitor_data_init(&hmp, true);
> > +    monitor_data_init(&hmp, true, false);
> >
> >      old_mon = cur_mon;
> >      cur_mon = &hmp;
> > @@ -3101,6 +3134,11 @@ static void handle_hmp_command(Monitor *mon, const
> > char *cmdline)
> >  {
> >      QDict *qdict;
> >      const mon_cmd_t *cmd;
> > +    /*
> > +     * If we haven't take the BQL (when called by per-monitor
> > +     * threads), we need to take care of the BQL on our own.
> > +     */
> > +    bool take_bql = !qemu_mutex_iothread_locked();
> >
> >      trace_handle_hmp_command(mon, cmdline);
> >
> > @@ -3116,7 +3154,16 @@ static void handle_hmp_command(Monitor *mon, const
> > char *cmdline)
> >          return;
> >      }
> >
> > +    if (take_bql) {
> > +        qemu_mutex_lock_iothread();
> > +    }
> > +
> >      cmd->cmd(mon, qdict);
> > +
> > +    if (take_bql) {
> > +        qemu_mutex_unlock_iothread();
> > +    }
> > +
> >      QDECREF(qdict);
> >  }
> >
> > @@ -4086,6 +4133,15 @@ static void __attribute__((constructor))
> > monitor_lock_init(void)
> >      qemu_mutex_init(&monitor_lock);
> >  }
> >
> > +static void *monitor_thread(void *data)
> > +{
> > +    Monitor *mon = data;
> > +
> > +    g_main_loop_run(mon->mon_loop);
> > +
> > +    return NULL;
> > +}
> > +
> >  void monitor_init(Chardev *chr, int flags)
> >  {
> >      static int is_first_init = 1;
> > @@ -4098,7 +4154,9 @@ void monitor_init(Chardev *chr, int flags)
> >      }
> >
> >      mon = g_malloc(sizeof(*mon));
> > -    monitor_data_init(mon, false);
> > +
> > +    /* For non-mux typed monitors, we create dedicated threads. */
> > +    monitor_data_init(mon, false, !CHARDEV_IS_MUX(chr));
> >
> >      qemu_chr_fe_init(&mon->chr, chr, &error_abort);
> >      mon->flags = flags;
> > @@ -4112,12 +4170,19 @@ void monitor_init(Chardev *chr, int flags)
> >
> >      if (monitor_is_qmp(mon)) {
> >          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> > monitor_qmp_read,
> > -                                 monitor_qmp_event, NULL, mon, NULL,
> > true);
> > +                                 monitor_qmp_event, NULL, mon,
> > +                                 mon->mon_context, true);
> >          qemu_chr_fe_set_echo(&mon->chr, true);
> >          json_message_parser_init(&mon->qmp.parser, handle_qmp_command);
> >      } else {
> >          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> > monitor_read,
> > -                                 monitor_event, NULL, mon, NULL, true);
> > +                                 monitor_event, NULL, mon,
> > +                                 mon->mon_context, true);
> > +    }
> > +
> > +    if (mon->use_thread) {
> > +        qemu_thread_create(&mon->mon_thread, chr->label, monitor_thread,
> > +                           mon, QEMU_THREAD_JOINABLE);
> >      }
> >
> >      qemu_mutex_lock(&monitor_lock);
> > diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> > index 5ad36f8..3b6b224 100644
> > --- a/qapi/qmp-dispatch.c
> > +++ b/qapi/qmp-dispatch.c
> > @@ -19,6 +19,7 @@
> >  #include "qapi/qmp/qjson.h"
> >  #include "qapi-types.h"
> >  #include "qapi/qmp/qerror.h"
> > +#include "qemu/main-loop.h"
> >
> >  static QDict *qmp_dispatch_check_obj(const QObject *request, Error **errp)
> >  {
> > @@ -75,6 +76,11 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds,
> > QObject *request,
> >      QDict *args, *dict;
> >      QmpCommand *cmd;
> >      QObject *ret = NULL;
> > +    /*
> > +     * If we haven't take the BQL (when called by per-monitor
> > +     * threads), we need to take care of the BQL on our own.
> > +     */
> > +    bool take_bql = !qemu_mutex_iothread_locked();
> >
> >      dict = qmp_dispatch_check_obj(request, errp);
> >      if (!dict) {
> > @@ -101,7 +107,16 @@ static QObject *do_qmp_dispatch(QmpCommandList *cmds,
> > QObject *request,
> >          QINCREF(args);
> >      }
> >
> > +    if (take_bql) {
> > +        qemu_mutex_lock_iothread();
> > +    }
> > +
> >      cmd->fn(args, &ret, &local_err);
> > +
> > +    if (take_bql) {
> > +        qemu_mutex_unlock_iothread();
> > +    }
> > +
> >      if (local_err) {
> >          error_propagate(errp, local_err);
> >      } else if (cmd->options & QCO_NO_SUCCESS_RESP) {
> > --
> > 2.7.4
> >
> >
> > --
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 15:33     ` Dr. David Alan Gilbert
@ 2017-08-25 16:07       ` Marc-André Lureau
  2017-08-25 16:12         ` Dr. David Alan Gilbert
                           ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-25 16:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini

On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
wrote:

> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > Hi
> >
> > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> > > using non-mux typed backend chardev.  We only do this for monitors, so
> > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> > > and the monitor together).
> > >
> > > When use_thread is set, we create standalone thread to poll the monitor
> > > events, isolated from the main loop thread.  Here we still need to take
> > > the BQL before dispatching the tasks since some of the monitor commands
> > > are not allowed to execute without the protection of BQL.  Then this
> > > gives us the chance to avoid taking the BQL for some monitor commands
> in
> > > the future.
> > >
> > > * Why this change?
> > >
> > > We need these per-monitor threads to make sure we can have at least one
> > > monitor that will never stuck (that can receive further monitor
> > > commands).
> > >
> > > * So when will monitors stuck?  And, how do they stuck?
> > >
> > > After we have postcopy and remote page faults, it's simple to achieve a
> > > stuck in the monitor (which is also a stuck in main loop thread):
> > >
> > > (1) Monitor deadlock on BQL
> > >
> > > As we may know, when postcopy is running on destination VM, the vcpu
> > > threads can stuck merely any time as long as it tries to access an
> > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> > > that the vcpu thread is holding the BQL.  If the page fault is not
> > > handled quickly, you'll find that monitors stop working, which is
> trying
> > > to take the BQL.
> > >
> > > If the page fault cannot be handled correctly (one case is a paused
> > > postcopy, when network is temporarily down), monitors will hang
> > > forever.  Without current patch, that means the main loop hanged.
> We'll
> > > never find a way to talk to VM again.
> > >
> >
> > Could the BQL be pushed down to the monitor commands level instead? That
> > way we wouldn't need a seperate thread to solve the hang on commands that
> > do not need BQL.
>
> If the main thread is stuck though I don't see how that helps you; you
> have to be able to run these commands on another thread.
>

Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
and the command doesn't need it, it would work.  In (2),  info cpus
shouldn't keep the BQL (my qapi-async series would probably help here)


> Dave
>
> > We could also optionnally make command that need the BQL to fail if lock
> is
> > held (after a timeout)?
> >
> >
> >
> > > (2) Monitor tries to run codes page-faulted vcpus
> > >
> > > The HMP command "info cpus" is one of the good example - it tries to
> > > kick all the vcpus and sync status from them.  However, if there is any
> > > vcpu that stuck at an unhandled page fault, it can never achieve the
> > > sync, then the HMP hangs.  Again, it hangs the main loop thread as
> well.
> > >
> > > After either (1) or (2), we can see the deadlock problem:
> > >
> > > - On one hand, if monitor hangs, we cannot do the postcopy recovery,
> > >   because postcopy recovery needs user to specify new listening port on
> > >   destination monitor.
> > >
> > > - On the other hand, if we cannot recover the paused postcopy, then
> page
> > >   faults cannot be serviced, and the monitors will possibly hang
> > >   forever then.
> > >
> > > * How this patch helps?
> > >
> > > - Firstly, we'll have our own thread for each dedicated monitor (or
> say,
> > >   the backend chardev is only used for monitor), so even main loop
> > >   thread hangs (it is always possible), this monitor thread may still
> > >   survive.
> > >
> > > - Not all monitor commands need the BQL.  We can selectively take the
> > >   BQL (depends on which command we are running) to avoid waiting on a
> > >   page-faulted vcpu thread that has taken the BQL (this will be done in
> > >   following up patches).
> > >
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  monitor.c           | 75
> > > +++++++++++++++++++++++++++++++++++++++++++++++++----
> > >  qapi/qmp-dispatch.c | 15 +++++++++++
> > >  2 files changed, 85 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/monitor.c b/monitor.c
> > > index 7c90df7..3d4ecff 100644
> > > --- a/monitor.c
> > > +++ b/monitor.c
> > > @@ -36,6 +36,8 @@
> > >  #include "net/net.h"
> > >  #include "net/slirp.h"
> > >  #include "chardev/char-fe.h"
> > > +#include "chardev/char-mux.h"
> > > +#include "chardev/char-io.h"
> > >  #include "ui/qemu-spice.h"
> > >  #include "sysemu/numa.h"
> > >  #include "monitor/monitor.h"
> > > @@ -190,6 +192,8 @@ struct Monitor {
> > >      int flags;
> > >      int suspend_cnt;
> > >      bool skip_flush;
> > > +    /* Whether the monitor wants to be polled in standalone thread */
> > > +    bool use_thread;
> > >
> > >      QemuMutex out_lock;
> > >      QString *outbuf;
> > > @@ -206,6 +210,11 @@ struct Monitor {
> > >      mon_cmd_t *cmd_table;
> > >      QLIST_HEAD(,mon_fd_t) fds;
> > >      QLIST_ENTRY(Monitor) entry;
> > > +
> > > +    /* Only used when "use_thread" is used */
> > > +    QemuThread mon_thread;
> > > +    GMainContext *mon_context;
> > > +    GMainLoop *mon_loop;
> > >  };
> > >
> > >  /* QMP checker flags */
> > > @@ -568,7 +577,7 @@ static void monitor_qapi_event_init(void)
> > >
> > >  static void handle_hmp_command(Monitor *mon, const char *cmdline);
> > >
> > > -static void monitor_data_init(Monitor *mon, bool skip_flush)
> > > +static void monitor_data_init(Monitor *mon, bool skip_flush, bool
> > > use_thread)
> > >  {
> > >      memset(mon, 0, sizeof(Monitor));
> > >      qemu_mutex_init(&mon->out_lock);
> > > @@ -576,10 +585,34 @@ static void monitor_data_init(Monitor *mon, bool
> > > skip_flush)
> > >      /* Use *mon_cmds by default. */
> > >      mon->cmd_table = mon_cmds;
> > >      mon->skip_flush = skip_flush;
> > > +    mon->use_thread = use_thread;
> > > +    if (use_thread) {
> > > +        /*
> > > +         * For monitors that use isolated threads, they'll need their
> > > +         * own GMainContext and GMainLoop.  Otherwise, these pointers
> > > +         * will be NULL, which means the default context will be used.
> > > +         */
> > > +        mon->mon_context = g_main_context_new();
> > > +        mon->mon_loop = g_main_loop_new(mon->mon_context, TRUE);
> > > +    }
> > >  }
> > >
> > >  static void monitor_data_destroy(Monitor *mon)
> > >  {
> > > +    /* Destroy the thread first if there is */
> > > +    if (mon->use_thread) {
> > > +        /* Notify the per-monitor thread to quit. */
> > > +        g_main_loop_quit(mon->mon_loop);
> > > +        /*
> > > +         * Make sure the context will get the quit message since it's
> > > +         * in another thread.  Without this, it may not be able to
> > > +         * respond to the quit message immediately.
> > > +         */
> > > +        g_main_context_wakeup(mon->mon_context);
> > > +        qemu_thread_join(&mon->mon_thread);
> > > +        g_main_loop_unref(mon->mon_loop);
> > > +        g_main_context_unref(mon->mon_context);
> > > +    }
> > >      qemu_chr_fe_deinit(&mon->chr, false);
> > >      if (monitor_is_qmp(mon)) {
> > >          json_message_parser_destroy(&mon->qmp.parser);
> > > @@ -595,7 +628,7 @@ char *qmp_human_monitor_command(const char
> > > *command_line, bool has_cpu_index,
> > >      char *output = NULL;
> > >      Monitor *old_mon, hmp;
> > >
> > > -    monitor_data_init(&hmp, true);
> > > +    monitor_data_init(&hmp, true, false);
> > >
> > >      old_mon = cur_mon;
> > >      cur_mon = &hmp;
> > > @@ -3101,6 +3134,11 @@ static void handle_hmp_command(Monitor *mon,
> const
> > > char *cmdline)
> > >  {
> > >      QDict *qdict;
> > >      const mon_cmd_t *cmd;
> > > +    /*
> > > +     * If we haven't take the BQL (when called by per-monitor
> > > +     * threads), we need to take care of the BQL on our own.
> > > +     */
> > > +    bool take_bql = !qemu_mutex_iothread_locked();
> > >
> > >      trace_handle_hmp_command(mon, cmdline);
> > >
> > > @@ -3116,7 +3154,16 @@ static void handle_hmp_command(Monitor *mon,
> const
> > > char *cmdline)
> > >          return;
> > >      }
> > >
> > > +    if (take_bql) {
> > > +        qemu_mutex_lock_iothread();
> > > +    }
> > > +
> > >      cmd->cmd(mon, qdict);
> > > +
> > > +    if (take_bql) {
> > > +        qemu_mutex_unlock_iothread();
> > > +    }
> > > +
> > >      QDECREF(qdict);
> > >  }
> > >
> > > @@ -4086,6 +4133,15 @@ static void __attribute__((constructor))
> > > monitor_lock_init(void)
> > >      qemu_mutex_init(&monitor_lock);
> > >  }
> > >
> > > +static void *monitor_thread(void *data)
> > > +{
> > > +    Monitor *mon = data;
> > > +
> > > +    g_main_loop_run(mon->mon_loop);
> > > +
> > > +    return NULL;
> > > +}
> > > +
> > >  void monitor_init(Chardev *chr, int flags)
> > >  {
> > >      static int is_first_init = 1;
> > > @@ -4098,7 +4154,9 @@ void monitor_init(Chardev *chr, int flags)
> > >      }
> > >
> > >      mon = g_malloc(sizeof(*mon));
> > > -    monitor_data_init(mon, false);
> > > +
> > > +    /* For non-mux typed monitors, we create dedicated threads. */
> > > +    monitor_data_init(mon, false, !CHARDEV_IS_MUX(chr));
> > >
> > >      qemu_chr_fe_init(&mon->chr, chr, &error_abort);
> > >      mon->flags = flags;
> > > @@ -4112,12 +4170,19 @@ void monitor_init(Chardev *chr, int flags)
> > >
> > >      if (monitor_is_qmp(mon)) {
> > >          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> > > monitor_qmp_read,
> > > -                                 monitor_qmp_event, NULL, mon, NULL,
> > > true);
> > > +                                 monitor_qmp_event, NULL, mon,
> > > +                                 mon->mon_context, true);
> > >          qemu_chr_fe_set_echo(&mon->chr, true);
> > >          json_message_parser_init(&mon->qmp.parser,
> handle_qmp_command);
> > >      } else {
> > >          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> > > monitor_read,
> > > -                                 monitor_event, NULL, mon, NULL,
> true);
> > > +                                 monitor_event, NULL, mon,
> > > +                                 mon->mon_context, true);
> > > +    }
> > > +
> > > +    if (mon->use_thread) {
> > > +        qemu_thread_create(&mon->mon_thread, chr->label,
> monitor_thread,
> > > +                           mon, QEMU_THREAD_JOINABLE);
> > >      }
> > >
> > >      qemu_mutex_lock(&monitor_lock);
> > > diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> > > index 5ad36f8..3b6b224 100644
> > > --- a/qapi/qmp-dispatch.c
> > > +++ b/qapi/qmp-dispatch.c
> > > @@ -19,6 +19,7 @@
> > >  #include "qapi/qmp/qjson.h"
> > >  #include "qapi-types.h"
> > >  #include "qapi/qmp/qerror.h"
> > > +#include "qemu/main-loop.h"
> > >
> > >  static QDict *qmp_dispatch_check_obj(const QObject *request, Error
> **errp)
> > >  {
> > > @@ -75,6 +76,11 @@ static QObject *do_qmp_dispatch(QmpCommandList
> *cmds,
> > > QObject *request,
> > >      QDict *args, *dict;
> > >      QmpCommand *cmd;
> > >      QObject *ret = NULL;
> > > +    /*
> > > +     * If we haven't take the BQL (when called by per-monitor
> > > +     * threads), we need to take care of the BQL on our own.
> > > +     */
> > > +    bool take_bql = !qemu_mutex_iothread_locked();
> > >
> > >      dict = qmp_dispatch_check_obj(request, errp);
> > >      if (!dict) {
> > > @@ -101,7 +107,16 @@ static QObject *do_qmp_dispatch(QmpCommandList
> *cmds,
> > > QObject *request,
> > >          QINCREF(args);
> > >      }
> > >
> > > +    if (take_bql) {
> > > +        qemu_mutex_lock_iothread();
> > > +    }
> > > +
> > >      cmd->fn(args, &ret, &local_err);
> > > +
> > > +    if (take_bql) {
> > > +        qemu_mutex_unlock_iothread();
> > > +    }
> > > +
> > >      if (local_err) {
> > >          error_propagate(errp, local_err);
> > >      } else if (cmd->options & QCO_NO_SUCCESS_RESP) {
> > > --
> > > 2.7.4
> > >
> > >
> > > --
> > Marc-André Lureau
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 16:07       ` Marc-André Lureau
@ 2017-08-25 16:12         ` Dr. David Alan Gilbert
  2017-08-25 16:21           ` Marc-André Lureau
  2017-08-28  3:05         ` Peter Xu
  2017-08-28 11:08         ` Markus Armbruster
  2 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25 16:12 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
> 
> > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > > Hi
> > >
> > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> > > > using non-mux typed backend chardev.  We only do this for monitors, so
> > > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> > > > and the monitor together).
> > > >
> > > > When use_thread is set, we create standalone thread to poll the monitor
> > > > events, isolated from the main loop thread.  Here we still need to take
> > > > the BQL before dispatching the tasks since some of the monitor commands
> > > > are not allowed to execute without the protection of BQL.  Then this
> > > > gives us the chance to avoid taking the BQL for some monitor commands
> > in
> > > > the future.
> > > >
> > > > * Why this change?
> > > >
> > > > We need these per-monitor threads to make sure we can have at least one
> > > > monitor that will never stuck (that can receive further monitor
> > > > commands).
> > > >
> > > > * So when will monitors stuck?  And, how do they stuck?
> > > >
> > > > After we have postcopy and remote page faults, it's simple to achieve a
> > > > stuck in the monitor (which is also a stuck in main loop thread):
> > > >
> > > > (1) Monitor deadlock on BQL
> > > >
> > > > As we may know, when postcopy is running on destination VM, the vcpu
> > > > threads can stuck merely any time as long as it tries to access an
> > > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> > > > that the vcpu thread is holding the BQL.  If the page fault is not
> > > > handled quickly, you'll find that monitors stop working, which is
> > trying
> > > > to take the BQL.
> > > >
> > > > If the page fault cannot be handled correctly (one case is a paused
> > > > postcopy, when network is temporarily down), monitors will hang
> > > > forever.  Without current patch, that means the main loop hanged.
> > We'll
> > > > never find a way to talk to VM again.
> > > >
> > >
> > > Could the BQL be pushed down to the monitor commands level instead? That
> > > way we wouldn't need a seperate thread to solve the hang on commands that
> > > do not need BQL.
> >
> > If the main thread is stuck though I don't see how that helps you; you
> > have to be able to run these commands on another thread.
> >
> 
> Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
> and the command doesn't need it, it would work.

True; assuming nothing else in the main loop is blocked;  which is a big
if - making sure no bh's etc could block on guest memory or the bql.

> In (2),  info cpus
> shouldn't keep the BQL (my qapi-async series would probably help here)

How does that work?

Dave

> 
> > Dave
> >
> > > We could also optionnally make command that need the BQL to fail if lock
> > is
> > > held (after a timeout)?
> > >
> > >
> > >
> > > > (2) Monitor tries to run codes page-faulted vcpus
> > > >
> > > > The HMP command "info cpus" is one of the good example - it tries to
> > > > kick all the vcpus and sync status from them.  However, if there is any
> > > > vcpu that stuck at an unhandled page fault, it can never achieve the
> > > > sync, then the HMP hangs.  Again, it hangs the main loop thread as
> > well.
> > > >
> > > > After either (1) or (2), we can see the deadlock problem:
> > > >
> > > > - On one hand, if monitor hangs, we cannot do the postcopy recovery,
> > > >   because postcopy recovery needs user to specify new listening port on
> > > >   destination monitor.
> > > >
> > > > - On the other hand, if we cannot recover the paused postcopy, then
> > page
> > > >   faults cannot be serviced, and the monitors will possibly hang
> > > >   forever then.
> > > >
> > > > * How this patch helps?
> > > >
> > > > - Firstly, we'll have our own thread for each dedicated monitor (or
> > say,
> > > >   the backend chardev is only used for monitor), so even main loop
> > > >   thread hangs (it is always possible), this monitor thread may still
> > > >   survive.
> > > >
> > > > - Not all monitor commands need the BQL.  We can selectively take the
> > > >   BQL (depends on which command we are running) to avoid waiting on a
> > > >   page-faulted vcpu thread that has taken the BQL (this will be done in
> > > >   following up patches).
> > > >
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  monitor.c           | 75
> > > > +++++++++++++++++++++++++++++++++++++++++++++++++----
> > > >  qapi/qmp-dispatch.c | 15 +++++++++++
> > > >  2 files changed, 85 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/monitor.c b/monitor.c
> > > > index 7c90df7..3d4ecff 100644
> > > > --- a/monitor.c
> > > > +++ b/monitor.c
> > > > @@ -36,6 +36,8 @@
> > > >  #include "net/net.h"
> > > >  #include "net/slirp.h"
> > > >  #include "chardev/char-fe.h"
> > > > +#include "chardev/char-mux.h"
> > > > +#include "chardev/char-io.h"
> > > >  #include "ui/qemu-spice.h"
> > > >  #include "sysemu/numa.h"
> > > >  #include "monitor/monitor.h"
> > > > @@ -190,6 +192,8 @@ struct Monitor {
> > > >      int flags;
> > > >      int suspend_cnt;
> > > >      bool skip_flush;
> > > > +    /* Whether the monitor wants to be polled in standalone thread */
> > > > +    bool use_thread;
> > > >
> > > >      QemuMutex out_lock;
> > > >      QString *outbuf;
> > > > @@ -206,6 +210,11 @@ struct Monitor {
> > > >      mon_cmd_t *cmd_table;
> > > >      QLIST_HEAD(,mon_fd_t) fds;
> > > >      QLIST_ENTRY(Monitor) entry;
> > > > +
> > > > +    /* Only used when "use_thread" is used */
> > > > +    QemuThread mon_thread;
> > > > +    GMainContext *mon_context;
> > > > +    GMainLoop *mon_loop;
> > > >  };
> > > >
> > > >  /* QMP checker flags */
> > > > @@ -568,7 +577,7 @@ static void monitor_qapi_event_init(void)
> > > >
> > > >  static void handle_hmp_command(Monitor *mon, const char *cmdline);
> > > >
> > > > -static void monitor_data_init(Monitor *mon, bool skip_flush)
> > > > +static void monitor_data_init(Monitor *mon, bool skip_flush, bool
> > > > use_thread)
> > > >  {
> > > >      memset(mon, 0, sizeof(Monitor));
> > > >      qemu_mutex_init(&mon->out_lock);
> > > > @@ -576,10 +585,34 @@ static void monitor_data_init(Monitor *mon, bool
> > > > skip_flush)
> > > >      /* Use *mon_cmds by default. */
> > > >      mon->cmd_table = mon_cmds;
> > > >      mon->skip_flush = skip_flush;
> > > > +    mon->use_thread = use_thread;
> > > > +    if (use_thread) {
> > > > +        /*
> > > > +         * For monitors that use isolated threads, they'll need their
> > > > +         * own GMainContext and GMainLoop.  Otherwise, these pointers
> > > > +         * will be NULL, which means the default context will be used.
> > > > +         */
> > > > +        mon->mon_context = g_main_context_new();
> > > > +        mon->mon_loop = g_main_loop_new(mon->mon_context, TRUE);
> > > > +    }
> > > >  }
> > > >
> > > >  static void monitor_data_destroy(Monitor *mon)
> > > >  {
> > > > +    /* Destroy the thread first if there is */
> > > > +    if (mon->use_thread) {
> > > > +        /* Notify the per-monitor thread to quit. */
> > > > +        g_main_loop_quit(mon->mon_loop);
> > > > +        /*
> > > > +         * Make sure the context will get the quit message since it's
> > > > +         * in another thread.  Without this, it may not be able to
> > > > +         * respond to the quit message immediately.
> > > > +         */
> > > > +        g_main_context_wakeup(mon->mon_context);
> > > > +        qemu_thread_join(&mon->mon_thread);
> > > > +        g_main_loop_unref(mon->mon_loop);
> > > > +        g_main_context_unref(mon->mon_context);
> > > > +    }
> > > >      qemu_chr_fe_deinit(&mon->chr, false);
> > > >      if (monitor_is_qmp(mon)) {
> > > >          json_message_parser_destroy(&mon->qmp.parser);
> > > > @@ -595,7 +628,7 @@ char *qmp_human_monitor_command(const char
> > > > *command_line, bool has_cpu_index,
> > > >      char *output = NULL;
> > > >      Monitor *old_mon, hmp;
> > > >
> > > > -    monitor_data_init(&hmp, true);
> > > > +    monitor_data_init(&hmp, true, false);
> > > >
> > > >      old_mon = cur_mon;
> > > >      cur_mon = &hmp;
> > > > @@ -3101,6 +3134,11 @@ static void handle_hmp_command(Monitor *mon,
> > const
> > > > char *cmdline)
> > > >  {
> > > >      QDict *qdict;
> > > >      const mon_cmd_t *cmd;
> > > > +    /*
> > > > +     * If we haven't take the BQL (when called by per-monitor
> > > > +     * threads), we need to take care of the BQL on our own.
> > > > +     */
> > > > +    bool take_bql = !qemu_mutex_iothread_locked();
> > > >
> > > >      trace_handle_hmp_command(mon, cmdline);
> > > >
> > > > @@ -3116,7 +3154,16 @@ static void handle_hmp_command(Monitor *mon,
> > const
> > > > char *cmdline)
> > > >          return;
> > > >      }
> > > >
> > > > +    if (take_bql) {
> > > > +        qemu_mutex_lock_iothread();
> > > > +    }
> > > > +
> > > >      cmd->cmd(mon, qdict);
> > > > +
> > > > +    if (take_bql) {
> > > > +        qemu_mutex_unlock_iothread();
> > > > +    }
> > > > +
> > > >      QDECREF(qdict);
> > > >  }
> > > >
> > > > @@ -4086,6 +4133,15 @@ static void __attribute__((constructor))
> > > > monitor_lock_init(void)
> > > >      qemu_mutex_init(&monitor_lock);
> > > >  }
> > > >
> > > > +static void *monitor_thread(void *data)
> > > > +{
> > > > +    Monitor *mon = data;
> > > > +
> > > > +    g_main_loop_run(mon->mon_loop);
> > > > +
> > > > +    return NULL;
> > > > +}
> > > > +
> > > >  void monitor_init(Chardev *chr, int flags)
> > > >  {
> > > >      static int is_first_init = 1;
> > > > @@ -4098,7 +4154,9 @@ void monitor_init(Chardev *chr, int flags)
> > > >      }
> > > >
> > > >      mon = g_malloc(sizeof(*mon));
> > > > -    monitor_data_init(mon, false);
> > > > +
> > > > +    /* For non-mux typed monitors, we create dedicated threads. */
> > > > +    monitor_data_init(mon, false, !CHARDEV_IS_MUX(chr));
> > > >
> > > >      qemu_chr_fe_init(&mon->chr, chr, &error_abort);
> > > >      mon->flags = flags;
> > > > @@ -4112,12 +4170,19 @@ void monitor_init(Chardev *chr, int flags)
> > > >
> > > >      if (monitor_is_qmp(mon)) {
> > > >          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> > > > monitor_qmp_read,
> > > > -                                 monitor_qmp_event, NULL, mon, NULL,
> > > > true);
> > > > +                                 monitor_qmp_event, NULL, mon,
> > > > +                                 mon->mon_context, true);
> > > >          qemu_chr_fe_set_echo(&mon->chr, true);
> > > >          json_message_parser_init(&mon->qmp.parser,
> > handle_qmp_command);
> > > >      } else {
> > > >          qemu_chr_fe_set_handlers(&mon->chr, monitor_can_read,
> > > > monitor_read,
> > > > -                                 monitor_event, NULL, mon, NULL,
> > true);
> > > > +                                 monitor_event, NULL, mon,
> > > > +                                 mon->mon_context, true);
> > > > +    }
> > > > +
> > > > +    if (mon->use_thread) {
> > > > +        qemu_thread_create(&mon->mon_thread, chr->label,
> > monitor_thread,
> > > > +                           mon, QEMU_THREAD_JOINABLE);
> > > >      }
> > > >
> > > >      qemu_mutex_lock(&monitor_lock);
> > > > diff --git a/qapi/qmp-dispatch.c b/qapi/qmp-dispatch.c
> > > > index 5ad36f8..3b6b224 100644
> > > > --- a/qapi/qmp-dispatch.c
> > > > +++ b/qapi/qmp-dispatch.c
> > > > @@ -19,6 +19,7 @@
> > > >  #include "qapi/qmp/qjson.h"
> > > >  #include "qapi-types.h"
> > > >  #include "qapi/qmp/qerror.h"
> > > > +#include "qemu/main-loop.h"
> > > >
> > > >  static QDict *qmp_dispatch_check_obj(const QObject *request, Error
> > **errp)
> > > >  {
> > > > @@ -75,6 +76,11 @@ static QObject *do_qmp_dispatch(QmpCommandList
> > *cmds,
> > > > QObject *request,
> > > >      QDict *args, *dict;
> > > >      QmpCommand *cmd;
> > > >      QObject *ret = NULL;
> > > > +    /*
> > > > +     * If we haven't take the BQL (when called by per-monitor
> > > > +     * threads), we need to take care of the BQL on our own.
> > > > +     */
> > > > +    bool take_bql = !qemu_mutex_iothread_locked();
> > > >
> > > >      dict = qmp_dispatch_check_obj(request, errp);
> > > >      if (!dict) {
> > > > @@ -101,7 +107,16 @@ static QObject *do_qmp_dispatch(QmpCommandList
> > *cmds,
> > > > QObject *request,
> > > >          QINCREF(args);
> > > >      }
> > > >
> > > > +    if (take_bql) {
> > > > +        qemu_mutex_lock_iothread();
> > > > +    }
> > > > +
> > > >      cmd->fn(args, &ret, &local_err);
> > > > +
> > > > +    if (take_bql) {
> > > > +        qemu_mutex_unlock_iothread();
> > > > +    }
> > > > +
> > > >      if (local_err) {
> > > >          error_propagate(errp, local_err);
> > > >      } else if (cmd->options & QCO_NO_SUCCESS_RESP) {
> > > > --
> > > > 2.7.4
> > > >
> > > >
> > > > --
> > > Marc-André Lureau
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 16:12         ` Dr. David Alan Gilbert
@ 2017-08-25 16:21           ` Marc-André Lureau
  2017-08-25 16:29             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-25 16:21 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini

Hi

On Fri, Aug 25, 2017 at 6:12 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
wrote:

> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <
> dgilbert@redhat.com>
> > wrote:
> >
> > > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > > > Hi
> > > >
> > > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > > Firstly, introduce Monitor.use_thread, and set it for monitors
> that are
> > > > > using non-mux typed backend chardev.  We only do this for
> monitors, so
> > > > > mux-typed chardevs are not suitable (when it connects to, e.g.,
> serials
> > > > > and the monitor together).
> > > > >
> > > > > When use_thread is set, we create standalone thread to poll the
> monitor
> > > > > events, isolated from the main loop thread.  Here we still need to
> take
> > > > > the BQL before dispatching the tasks since some of the monitor
> commands
> > > > > are not allowed to execute without the protection of BQL.  Then
> this
> > > > > gives us the chance to avoid taking the BQL for some monitor
> commands
> > > in
> > > > > the future.
> > > > >
> > > > > * Why this change?
> > > > >
> > > > > We need these per-monitor threads to make sure we can have at
> least one
> > > > > monitor that will never stuck (that can receive further monitor
> > > > > commands).
> > > > >
> > > > > * So when will monitors stuck?  And, how do they stuck?
> > > > >
> > > > > After we have postcopy and remote page faults, it's simple to
> achieve a
> > > > > stuck in the monitor (which is also a stuck in main loop thread):
> > > > >
> > > > > (1) Monitor deadlock on BQL
> > > > >
> > > > > As we may know, when postcopy is running on destination VM, the
> vcpu
> > > > > threads can stuck merely any time as long as it tries to access an
> > > > > uncopied guest page.  Meanwhile, when the stuck happens, it is
> possible
> > > > > that the vcpu thread is holding the BQL.  If the page fault is not
> > > > > handled quickly, you'll find that monitors stop working, which is
> > > trying
> > > > > to take the BQL.
> > > > >
> > > > > If the page fault cannot be handled correctly (one case is a paused
> > > > > postcopy, when network is temporarily down), monitors will hang
> > > > > forever.  Without current patch, that means the main loop hanged.
> > > We'll
> > > > > never find a way to talk to VM again.
> > > > >
> > > >
> > > > Could the BQL be pushed down to the monitor commands level instead?
> That
> > > > way we wouldn't need a seperate thread to solve the hang on commands
> that
> > > > do not need BQL.
> > >
> > > If the main thread is stuck though I don't see how that helps you; you
> > > have to be able to run these commands on another thread.
> > >
> >
> > Why would the main thread be stuck? In (1) If the vcpu thread takes the
> BQL
> > and the command doesn't need it, it would work.
>
> True; assuming nothing else in the main loop is blocked;  which is a big
> if - making sure no bh's etc could block on guest memory or the bql.
>
>
> In (2),  info cpus
> > shouldn't keep the BQL (my qapi-async series would probably help here)
>
> How does that work?
>
>
https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03626.html

With the series, a command can be broken up in receive/start &
finish/reply. This allows to reenter the loop, potentially freeing the BQL,
and process other events. This allowed me to fix a screendump glitch bug (
http://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03650.html). This
also open the door to concurrent QMP commands (if the client turns on the
capability option).

-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 16:21           ` Marc-André Lureau
@ 2017-08-25 16:29             ` Dr. David Alan Gilbert
  2017-08-26  8:33               ` Marc-André Lureau
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-08-25 16:29 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini

* Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> Hi
> 
> On Fri, Aug 25, 2017 at 6:12 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
> 
> > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > > On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <
> > dgilbert@redhat.com>
> > > wrote:
> > >
> > > > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > > > > Hi
> > > > >
> > > > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > > Firstly, introduce Monitor.use_thread, and set it for monitors
> > that are
> > > > > > using non-mux typed backend chardev.  We only do this for
> > monitors, so
> > > > > > mux-typed chardevs are not suitable (when it connects to, e.g.,
> > serials
> > > > > > and the monitor together).
> > > > > >
> > > > > > When use_thread is set, we create standalone thread to poll the
> > monitor
> > > > > > events, isolated from the main loop thread.  Here we still need to
> > take
> > > > > > the BQL before dispatching the tasks since some of the monitor
> > commands
> > > > > > are not allowed to execute without the protection of BQL.  Then
> > this
> > > > > > gives us the chance to avoid taking the BQL for some monitor
> > commands
> > > > in
> > > > > > the future.
> > > > > >
> > > > > > * Why this change?
> > > > > >
> > > > > > We need these per-monitor threads to make sure we can have at
> > least one
> > > > > > monitor that will never stuck (that can receive further monitor
> > > > > > commands).
> > > > > >
> > > > > > * So when will monitors stuck?  And, how do they stuck?
> > > > > >
> > > > > > After we have postcopy and remote page faults, it's simple to
> > achieve a
> > > > > > stuck in the monitor (which is also a stuck in main loop thread):
> > > > > >
> > > > > > (1) Monitor deadlock on BQL
> > > > > >
> > > > > > As we may know, when postcopy is running on destination VM, the
> > vcpu
> > > > > > threads can stuck merely any time as long as it tries to access an
> > > > > > uncopied guest page.  Meanwhile, when the stuck happens, it is
> > possible
> > > > > > that the vcpu thread is holding the BQL.  If the page fault is not
> > > > > > handled quickly, you'll find that monitors stop working, which is
> > > > trying
> > > > > > to take the BQL.
> > > > > >
> > > > > > If the page fault cannot be handled correctly (one case is a paused
> > > > > > postcopy, when network is temporarily down), monitors will hang
> > > > > > forever.  Without current patch, that means the main loop hanged.
> > > > We'll
> > > > > > never find a way to talk to VM again.
> > > > > >
> > > > >
> > > > > Could the BQL be pushed down to the monitor commands level instead?
> > That
> > > > > way we wouldn't need a seperate thread to solve the hang on commands
> > that
> > > > > do not need BQL.
> > > >
> > > > If the main thread is stuck though I don't see how that helps you; you
> > > > have to be able to run these commands on another thread.
> > > >
> > >
> > > Why would the main thread be stuck? In (1) If the vcpu thread takes the
> > BQL
> > > and the command doesn't need it, it would work.
> >
> > True; assuming nothing else in the main loop is blocked;  which is a big
> > if - making sure no bh's etc could block on guest memory or the bql.
> >
> >
> > In (2),  info cpus
> > > shouldn't keep the BQL (my qapi-async series would probably help here)
> >
> > How does that work?
> >
> >
> https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03626.html
> 
> With the series, a command can be broken up in receive/start &
> finish/reply. This allows to reenter the loop, potentially freeing the BQL,
> and process other events. This allowed me to fix a screendump glitch bug (
> http://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03650.html). This
> also open the door to concurrent QMP commands (if the client turns on the
> capability option).

Interesting.
I can see how that would work well for commands that know they're long
lived and do that work to split themselves into
receive/start/finish/reply.  However I'm worried that it means that
it's fragile in that if something accesses guest memory when they din't
realise they were doing, or code that forgot it's taking the lock, then
we've got a command that can occasionally block.  That's going to be a
lot of analysis and design on each command and if we were to do it
widely then we'd certainly miss some cases.  Having the monitors in
spearate threads means you only have to worry about the commands
you want to be lock-free.

Dave

> -- 
> Marc-André Lureau
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll Peter Xu
  2017-08-25 14:44   ` Marc-André Lureau
@ 2017-08-26  7:19   ` Fam Zheng
  2017-08-28  5:56     ` Peter Xu
  1 sibling, 1 reply; 104+ messages in thread
From: Fam Zheng @ 2017-08-26  7:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

s/risk/race/ for $subject and the whole patch?

Fam

On Wed, 08/23 14:51, Peter Xu wrote:
> This is not a problem if we are only having one single loop thread like
> before.  However, after per-monitor thread is introduced, this is not
> true any more, and the risk can happen.
> 
> The risk can be triggered with "make check -j8" sometimes:
> 
>   qemu-system-x86_64: /root/git/qemu/chardev/char-io.c:91:
>   io_watch_poll_finalize: Assertion `iwp->src == NULL' failed.
> 
> This patch keeps the reference for the watch object when creating in
> io_add_watch_poll(), so that the object will never be released in the
> context main loop, especially when the context loop is running in
> another standalone thread.  Meanwhile, when we want to remove the watch
> object, we always first detach the watch object from its owner context,
> then we continue with the cleanup.
> 
> Without this patch, calling io_remove_watch_poll() in main loop thread
> is not thread-safe, since the other per-monitor thread may be modifying
> the watch object at the same time.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  chardev/char-io.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/chardev/char-io.c b/chardev/char-io.c
> index f810524..5c52c40 100644
> --- a/chardev/char-io.c
> +++ b/chardev/char-io.c
> @@ -122,7 +122,6 @@ GSource *io_add_watch_poll(Chardev *chr,
>      g_free(name);
>  
>      g_source_attach(&iwp->parent, context);
> -    g_source_unref(&iwp->parent);
>      return (GSource *)iwp;
>  }
>  
> @@ -131,12 +130,24 @@ static void io_remove_watch_poll(GSource *source)
>      IOWatchPoll *iwp;
>  
>      iwp = io_watch_poll_from_source(source);
> +
> +    /*
> +     * Here the order of destruction really matters.  We need to first
> +     * detach the IOWatchPoll object from the context (which may still
> +     * be running in another loop thread), only after that could we
> +     * continue to operate on iwp->src, or there may be risk condition
> +     * between current thread and the context loop thread.
> +     *
> +     * Let's blame the glib bug mentioned in commit 2b3167 (again) for
> +     * this extra complexity.
> +     */
> +    g_source_destroy(&iwp->parent);
>      if (iwp->src) {
>          g_source_destroy(iwp->src);
>          g_source_unref(iwp->src);
>          iwp->src = NULL;
>      }
> -    g_source_destroy(&iwp->parent);
> +    g_source_unref(&iwp->parent);
>  }
>  
>  void remove_fd_in_watch(Chardev *chr)
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 16:29             ` Dr. David Alan Gilbert
@ 2017-08-26  8:33               ` Marc-André Lureau
  0 siblings, 0 replies; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-26  8:33 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini

Hi

On Fri, Aug 25, 2017 at 6:29 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
wrote:

> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > > In (2),  info cpus
> > > > shouldn't keep the BQL (my qapi-async series would probably help
> here)
> > >
> > > How does that work?
> > >
> > >
> > https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03626.html
> >
> > With the series, a command can be broken up in receive/start &
> > finish/reply. This allows to reenter the loop, potentially freeing the
> BQL,
> > and process other events. This allowed me to fix a screendump glitch bug
> (
> > http://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03650.html).
> This
> > also open the door to concurrent QMP commands (if the client turns on the
> > capability option).
>
> Interesting.
> I can see how that would work well for commands that know they're long
> lived and do that work to split themselves into
> receive/start/finish/reply.  However I'm worried that it means that
> it's fragile in that if something accesses guest memory when they din't
> realise they were doing, or code that forgot it's taking the lock, then
> we've got a command that can occasionally block.  That's going to be a
> lot of analysis and design on each command and if we were to do it
> widely then we'd certainly miss some cases.  Having the monitors in
> spearate threads means you only have to worry about the commands
> you want to be lock-free.
>

Well the concurrency problems are essentially similar in both cases, but I
would argue that avoiding parallelism is easier to deal with. My approach
is also very conservative, only commands that are "async-free" are broken
up, so you mostly have to worry about those regarding concurrency. But with
a seperate thread, you have additional concerns, since you may run while
the BQL is taken somewhere else.

-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 16:07       ` Marc-André Lureau
  2017-08-25 16:12         ` Dr. David Alan Gilbert
@ 2017-08-28  3:05         ` Peter Xu
  2017-08-28 10:11           ` Marc-André Lureau
  2017-08-28 11:08         ` Markus Armbruster
  2 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-28  3:05 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Dr. David Alan Gilbert, qemu-devel, Laurent Vivier, Fam Zheng,
	Juan Quintela, Markus Armbruster, mdroth, Paolo Bonzini

On Fri, Aug 25, 2017 at 04:07:34PM +0000, Marc-André Lureau wrote:
> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
> 
> > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > > Hi
> > >
> > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> > > > using non-mux typed backend chardev.  We only do this for monitors, so
> > > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> > > > and the monitor together).
> > > >
> > > > When use_thread is set, we create standalone thread to poll the monitor
> > > > events, isolated from the main loop thread.  Here we still need to take
> > > > the BQL before dispatching the tasks since some of the monitor commands
> > > > are not allowed to execute without the protection of BQL.  Then this
> > > > gives us the chance to avoid taking the BQL for some monitor commands
> > in
> > > > the future.
> > > >
> > > > * Why this change?
> > > >
> > > > We need these per-monitor threads to make sure we can have at least one
> > > > monitor that will never stuck (that can receive further monitor
> > > > commands).
> > > >
> > > > * So when will monitors stuck?  And, how do they stuck?
> > > >
> > > > After we have postcopy and remote page faults, it's simple to achieve a
> > > > stuck in the monitor (which is also a stuck in main loop thread):
> > > >
> > > > (1) Monitor deadlock on BQL
> > > >
> > > > As we may know, when postcopy is running on destination VM, the vcpu
> > > > threads can stuck merely any time as long as it tries to access an
> > > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> > > > that the vcpu thread is holding the BQL.  If the page fault is not
> > > > handled quickly, you'll find that monitors stop working, which is
> > trying
> > > > to take the BQL.
> > > >
> > > > If the page fault cannot be handled correctly (one case is a paused
> > > > postcopy, when network is temporarily down), monitors will hang
> > > > forever.  Without current patch, that means the main loop hanged.
> > We'll
> > > > never find a way to talk to VM again.
> > > >
> > >
> > > Could the BQL be pushed down to the monitor commands level instead? That
> > > way we wouldn't need a seperate thread to solve the hang on commands that
> > > do not need BQL.
> >
> > If the main thread is stuck though I don't see how that helps you; you
> > have to be able to run these commands on another thread.
> >
> 
> Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
> and the command doesn't need it, it would work.  In (2),  info cpus
> shouldn't keep the BQL (my qapi-async series would probably help here)

(Thanks for joining the discussion)

AFAIK the main thread can be stuck for many reasons.  I have seen one
stack when the VGA code (IIUC) was trying to writting to guest graphic
memory in main loop thread but luckily that guest page is still not
copied yet from source.  As long as the main thread is stuck for any
reason, no chance for monitor commands, even if the commands support
async operations.

So IMHO the only solution is doing these things in separate threads,
rather than all in a single one.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25  9:30       ` Dr. David Alan Gilbert
@ 2017-08-28  5:53         ` Peter Xu
  2017-09-08 17:29           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-28  5:53 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Fri, Aug 25, 2017 at 10:30:42AM +0100, Dr. David Alan Gilbert wrote:

[...]

> > >   c) As mentioned on irc there's fun to be had with cur_mon and error
> > >      handling - in my local world I have cur_mon declared as __thread
> > >      but never got around to thinking aobut what should set it up.
> > >      There's also 'wavcapture: Convert to error_report' that I posted
> > >      in March that got rid of some uses of cur_mon in wavcapture.c
> > >      for error_report.
> > 
> > Yeh.  I at least also see a positive ACK from Markus in the other
> > thread for per-thread cur_mon, sounds like this is the right way to
> > go.
> > 
> > To setup cur_mon, what I can think of is create wrapper for
> > pthread_create() in qemu_thread_create().  I see that we have done
> > similar thing in util/qemu-thread-win32.c for Windows.  With that we
> > can setup the cur_mon before going into real thread function but in
> > the right context, though we may need one more parameter for current
> > qemu_thread_create():
> > 
> > void qemu_thread_create(QemuThread *thread, const char *name,
> >                        void *(*start_routine)(void*),
> >                        void *arg, int mode, Monitor *mon);
> > 
> > Then we can specify monitor for any new thread (default to cur_mon).
> > For per-monitor threads, I think we need to pass in that specific mon.
> > 
> > Is this doable?
> 
> That would mean changing all the qemu_thread_create calls, but yes
> I guess is doable.  I'd thought the other way, perhaps you inherit
> Monitor except in the case of when the monitor creates threads.

Do you mean setup cur_mon in monitor threads?

I'm afraid that may not be enough, since after we mark cur_mon as
__thread variable, it should be NULL for each newly created threads,
then we need to init them for every thread.  Or anything I missed?

[...]

> > > 
> > >   d) I wonder if it's better to have thread as a flag, so that you have
> > >      to explicitly ask for a monitor to have it's own thread.
> > 
> > This should be doable.  Would a new parameter for "-qmp" and "-hmp"
> > suffice?
> 
> Yes.

(I meant "-monitor" when saying "-hmp")

Hmm, it seems not easy to simply add a new parameter for it, since we
used "," already to parse the chardev params in monitor codes, like
the usage of:

  -qmp telnet::8888,server,nowait

So I cannot simply do:

  -qmp telnet::8888,server,nowait,threaded=on

Or it will be treated for a parameter for chardev type "telnet".

I can at least add something similar to QEMU_OPTION_qmp_pretty, like:
QEMU_OPTION_qmp_threaded, and maybe I also need
QEMU_OPTION_monitor_pretty.  But I am thinking whether there can be
anything better.  Any suggestion from anyone?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll
  2017-08-26  7:19   ` Fam Zheng
@ 2017-08-28  5:56     ` Peter Xu
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-28  5:56 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

On Sat, Aug 26, 2017 at 03:19:39PM +0800, Fam Zheng wrote:
> s/risk/race/ for $subject and the whole patch?

I think... Yes. :-)  Thanks.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-25  9:14         ` Dr. David Alan Gilbert
@ 2017-08-28  8:08           ` Peter Xu
  2017-09-08 17:38             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-28  8:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Fam Zheng, qemu-devel, Paolo Bonzini, Daniel P . Berrange,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Fri, Aug 25, 2017 at 10:14:12AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Thu, Aug 24, 2017 at 07:37:32AM +0800, Fam Zheng wrote:
> > > On Wed, 08/23 18:44, Dr. David Alan Gilbert wrote:
> > > > * Peter Xu (peterx@redhat.com) wrote:
> > > > > Introducing this new parameter for QMP commands in general to mark out
> > > > > when the command does not need BQL.  Normally QMP command executions are
> > > > > done with the protection of BQL in QEMU.  However the truth is that not
> > > > > all the QMP commands require the BQL.
> > > > > 
> > > > > This new parameter provides a way to allow QMP commands to run in
> > > > > parallel when possible, without the contention on the BQL.
> > > > > 
> > > > > Since the default value of "without-bql" is still false, so now all QMP
> > > > > commands are still protected by BQL still.
> > > > > 
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > 
> > > > We should define what a 'without-bql' command is allowed to do:
> > > >    'Commands that have without-bql set _may_ be called without the bql
> > > >    being taken.  They must not take the bql or any other lock that may
> > > >    become dependent on the bql.'
> > 
> > Sure.
> > 
> > > >    (Do we need to say anything about RCU?)
> > 
> > Could I ask how is RCU related?
> 
> My definition above said that anything declared without bql couldn't
> take the bql, so couldn't block on any other thread holding the bql.
> But is our command allowed to use synchronize_rcu or rcu_read_lock
> that could wait for or block other threads doing rcu stuff?
> Because if it did is there any guarantee that it wouldn't block?

I see.  Shall we just ignore RCU for now?  Since currently I don't see
a real synchronize_rcu() user yet in QEMU, except the RCU thread.  And
rcu_read_lock() should not block itself, so IMHO calling it only in
monitor command handlers should always be fine?

> 
> 
> > 
> > > > 
> > > > Also, 'no-bql' is shorter :-)
> > > 
> > > Or rather "need-bql" that defaults to true to avoid double negative (TM) with
> > > "no-bql = false"?
> > 
> > Ok let me use "need-bql". :)
> 
> Fine by me.

I'm switching to "need-bql" for QMP only, and used "no-bql" in HMP,
since I failed to find a good way to init mon_cmd_t field to true by
default.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-25  9:06       ` Dr. David Alan Gilbert
@ 2017-08-28  8:26         ` Peter Xu
  2017-09-08 17:52           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-28  8:26 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Fri, Aug 25, 2017 at 10:06:27AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Wed, Aug 23, 2017 at 06:44:12PM +0100, Dr. David Alan Gilbert wrote:
> > 
> > [...]
> > 
> > > > +Most of the commands require the Big QEMU Lock (BQL) be held during
> > > > +execution.  However, there is a small subset of the commands that may
> > > > +not really need BQL at all.  To mark out this kind of commands, we can
> > > > +specify "without-bql" to "true".  This parameter is only a hint for
> > > > +internal QMP implementation to provide possiblility to allow commands
> > > > +be run in parallel, or reduce the contention of the lock.  Users of QMP
> > > > +should not really be aware of such information.
> > > 
> > > Well, I think users of these commands might select them specifically
> > > because they know that they won't block.  Those who care about latency might
> > > look to use commands that don't take the lock because of a reduced
> > > effect on the performance as well.
> > 
> > What would be the best way to tell user?  I think again this should
> > mostly for HMP only, right?
> 
> It needs to be docuemnted for QMP users as well so that those developing
> management code know what's safe.

I see.  What's the corresponding QMP documentation I should touch up?

> 
> > Maybe we can add a new command to list these lock-free commands.  Or,
> > I can dump something in "help" and "help info" like:
> > 
> > (qemu) help migrate_incoming
> > migrate_incoming uri -- Continue an incoming migration from an -incoming defer (BQL-less)
> 
> 'lock free' might be better?

I'm ok with it.  But would the word "lock" too general?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock
  2017-08-25  9:34       ` Dr. David Alan Gilbert
@ 2017-08-28  8:39         ` Peter Xu
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-08-28  8:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Fri, Aug 25, 2017 at 10:34:56AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Wed, Aug 23, 2017 at 07:01:35PM +0100, Dr. David Alan Gilbert wrote:
> > > * Peter Xu (peterx@redhat.com) wrote:
> > > > Now at least migrate_incoming can be run in parallel.  Let's provide a
> > > > migration lock to protect it.
> > > > 
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  migration/migration.c | 6 ++++++
> > > >  migration/migration.h | 3 +++
> > > >  2 files changed, 9 insertions(+)
> > > > 
> > > > diff --git a/migration/migration.c b/migration/migration.c
> > > > index c3fe0ed..32058f7 100644
> > > > --- a/migration/migration.c
> > > > +++ b/migration/migration.c
> > > > @@ -145,6 +145,7 @@ MigrationIncomingState *migration_incoming_get_current(void)
> > > >          mis_current.state = MIGRATION_STATUS_NONE;
> > > >          memset(&mis_current, 0, sizeof(MigrationIncomingState));
> > > >          qemu_mutex_init(&mis_current.rp_mutex);
> > > > +        qemu_mutex_init(&mis_current.mgmt_mutex);
> > > >          qemu_event_init(&mis_current.main_thread_load_event, false);
> > > >          once = true;
> > > >      }
> > > > @@ -1171,6 +1172,7 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
> > > >  {
> > > >      Error *local_err = NULL;
> > > >      static bool once = true;
> > > > +    MigrationIncomingState *mis = migration_incoming_get_current();
> > > 
> > > migration_incoming_get_current isn't actually thread-safe itself unless
> > > you can guarantee the initial allocation has happened - otherwise both
> > > threads can race and do the 'once' code at the same time.
> > 
> > How about I init the incoming object as well in
> > migration_object_init()?
> 
> Yes I think that might work.

This change would suite better for the postcopy recovery series.  Will
add one more patch for it.

> 
> > > 
> > > Similarly, these locks - they don't protect our 'once' - so a second
> > > thread could come in here and both get past the !once check.
> > 
> > Oh I missed this one since actually I am removing that "once" variable
> > in postcopy recovery series. :)
> > 
> > I can put the last two patches into postcopy recovery series, then
> > it'll be fine.
> 
> OK; these thigns just emphasise how hard it is to make a function really
> lock free.

Agreed.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28  3:05         ` Peter Xu
@ 2017-08-28 10:11           ` Marc-André Lureau
  2017-08-28 12:48             ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-28 10:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, QEMU, Laurent Vivier, Fam Zheng,
	Juan Quintela, Markus Armbruster, Michael Roth, Paolo Bonzini

Hi

On Mon, Aug 28, 2017 at 5:05 AM, Peter Xu <peterx@redhat.com> wrote:
> On Fri, Aug 25, 2017 at 04:07:34PM +0000, Marc-André Lureau wrote:
>> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
>> wrote:
>>
>> > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
>> > > Hi
>> > >
>> > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
>> > >
>> > > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
>> > > > using non-mux typed backend chardev.  We only do this for monitors, so
>> > > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
>> > > > and the monitor together).
>> > > >
>> > > > When use_thread is set, we create standalone thread to poll the monitor
>> > > > events, isolated from the main loop thread.  Here we still need to take
>> > > > the BQL before dispatching the tasks since some of the monitor commands
>> > > > are not allowed to execute without the protection of BQL.  Then this
>> > > > gives us the chance to avoid taking the BQL for some monitor commands
>> > in
>> > > > the future.
>> > > >
>> > > > * Why this change?
>> > > >
>> > > > We need these per-monitor threads to make sure we can have at least one
>> > > > monitor that will never stuck (that can receive further monitor
>> > > > commands).
>> > > >
>> > > > * So when will monitors stuck?  And, how do they stuck?
>> > > >
>> > > > After we have postcopy and remote page faults, it's simple to achieve a
>> > > > stuck in the monitor (which is also a stuck in main loop thread):
>> > > >
>> > > > (1) Monitor deadlock on BQL
>> > > >
>> > > > As we may know, when postcopy is running on destination VM, the vcpu
>> > > > threads can stuck merely any time as long as it tries to access an
>> > > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
>> > > > that the vcpu thread is holding the BQL.  If the page fault is not
>> > > > handled quickly, you'll find that monitors stop working, which is
>> > trying
>> > > > to take the BQL.
>> > > >
>> > > > If the page fault cannot be handled correctly (one case is a paused
>> > > > postcopy, when network is temporarily down), monitors will hang
>> > > > forever.  Without current patch, that means the main loop hanged.
>> > We'll
>> > > > never find a way to talk to VM again.
>> > > >
>> > >
>> > > Could the BQL be pushed down to the monitor commands level instead? That
>> > > way we wouldn't need a seperate thread to solve the hang on commands that
>> > > do not need BQL.
>> >
>> > If the main thread is stuck though I don't see how that helps you; you
>> > have to be able to run these commands on another thread.
>> >
>>
>> Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
>> and the command doesn't need it, it would work.  In (2),  info cpus
>> shouldn't keep the BQL (my qapi-async series would probably help here)
>
> (Thanks for joining the discussion)
>
> AFAIK the main thread can be stuck for many reasons.  I have seen one
> stack when the VGA code (IIUC) was trying to writting to guest graphic
> memory in main loop thread but luckily that guest page is still not
> copied yet from source.  As long as the main thread is stuck for any
> reason, no chance for monitor commands, even if the commands support
> async operations.

If that command becomes async (it probably should, any command doing
IO probaly should), then the main loop can keep running.

>
> So IMHO the only solution is doing these things in separate threads,
> rather than all in a single one.

I wouldn't say it's the only solution. I think the monitor can touch
many areas that haven't been written with multi-threading in mind. My
proposal is probably safer, although I don't know how hard it would be
to push the BQL down to QMP commands, and make async existing IO
commands. The benefits of this work are quite interesting imho,
because a stuck mainloop is basically a stuck qemu, and an additional
thread will not solve it...

-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-25 16:07       ` Marc-André Lureau
  2017-08-25 16:12         ` Dr. David Alan Gilbert
  2017-08-28  3:05         ` Peter Xu
@ 2017-08-28 11:08         ` Markus Armbruster
  2017-08-28 12:28           ` Marc-André Lureau
  2 siblings, 1 reply; 104+ messages in thread
From: Markus Armbruster @ 2017-08-28 11:08 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Dr. David Alan Gilbert, Laurent Vivier, Fam Zheng, mdroth,
	Juan Quintela, Peter Xu, qemu-devel, Paolo Bonzini

Marc-André Lureau <marcandre.lureau@gmail.com> writes:

> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
>
>> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
>> > Hi
>> >
>> > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
>> >
>> > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
>> > > using non-mux typed backend chardev.  We only do this for monitors, so
>> > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
>> > > and the monitor together).
>> > >
>> > > When use_thread is set, we create standalone thread to poll the monitor
>> > > events, isolated from the main loop thread.  Here we still need to take
>> > > the BQL before dispatching the tasks since some of the monitor commands
>> > > are not allowed to execute without the protection of BQL.  Then this
>> > > gives us the chance to avoid taking the BQL for some monitor commands in
>> > > the future.
>> > >
>> > > * Why this change?
>> > >
>> > > We need these per-monitor threads to make sure we can have at least one
>> > > monitor that will never stuck (that can receive further monitor
>> > > commands).
>> > >
>> > > * So when will monitors stuck?  And, how do they stuck?
>> > >
>> > > After we have postcopy and remote page faults, it's simple to achieve a
>> > > stuck in the monitor (which is also a stuck in main loop thread):
>> > >
>> > > (1) Monitor deadlock on BQL
>> > >
>> > > As we may know, when postcopy is running on destination VM, the vcpu
>> > > threads can stuck merely any time as long as it tries to access an
>> > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
>> > > that the vcpu thread is holding the BQL.  If the page fault is not
>> > > handled quickly, you'll find that monitors stop working, which is trying
>> > > to take the BQL.
>> > >
>> > > If the page fault cannot be handled correctly (one case is a paused
>> > > postcopy, when network is temporarily down), monitors will hang
>> > > forever.  Without current patch, that means the main loop hanged. We'll
>> > > never find a way to talk to VM again.
>> > >
>> >
>> > Could the BQL be pushed down to the monitor commands level instead? That
>> > way we wouldn't need a seperate thread to solve the hang on commands that
>> > do not need BQL.
>>
>> If the main thread is stuck though I don't see how that helps you; you
>> have to be able to run these commands on another thread.
>>
>
> Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
> and the command doesn't need it, it would work.  In (2),  info cpus
> shouldn't keep the BQL (my qapi-async series would probably help here)

This has been discussed several times[*], but of course not with
everybody, so I'll summarize once more: asynchronous commands are not a
actually *required* for anything.  They are *one* way to package the
"kick off task, receive an asynchronous message when it's done" pattern.
Another way is synchronous command for the kick off, event for the
"done".

For better or worse, synchronous command + event is what we have today.
Whether adding another way to package the the same thing improves the
QMP interface is doubtful.


[*] Try this one:
Message-ID: <87o9yv890z.fsf@dusky.pond.sub.org>
https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg05483.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28 11:08         ` Markus Armbruster
@ 2017-08-28 12:28           ` Marc-André Lureau
  2017-08-28 16:24             ` Markus Armbruster
  0 siblings, 1 reply; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-28 12:28 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Dr. David Alan Gilbert, Laurent Vivier, Fam Zheng, mdroth,
	Juan Quintela, Peter Xu, qemu-devel, Paolo Bonzini

Hi

On Mon, Aug 28, 2017 at 1:08 PM Markus Armbruster <armbru@redhat.com> wrote:

> Marc-André Lureau <marcandre.lureau@gmail.com> writes:
>
> > On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <
> dgilbert@redhat.com>
> > wrote:
> >
> >> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> >> > Hi
> >> >
> >> > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> >> >
> >> > > Firstly, introduce Monitor.use_thread, and set it for monitors that
> are
> >> > > using non-mux typed backend chardev.  We only do this for monitors,
> so
> >> > > mux-typed chardevs are not suitable (when it connects to, e.g.,
> serials
> >> > > and the monitor together).
> >> > >
> >> > > When use_thread is set, we create standalone thread to poll the
> monitor
> >> > > events, isolated from the main loop thread.  Here we still need to
> take
> >> > > the BQL before dispatching the tasks since some of the monitor
> commands
> >> > > are not allowed to execute without the protection of BQL.  Then this
> >> > > gives us the chance to avoid taking the BQL for some monitor
> commands in
> >> > > the future.
> >> > >
> >> > > * Why this change?
> >> > >
> >> > > We need these per-monitor threads to make sure we can have at least
> one
> >> > > monitor that will never stuck (that can receive further monitor
> >> > > commands).
> >> > >
> >> > > * So when will monitors stuck?  And, how do they stuck?
> >> > >
> >> > > After we have postcopy and remote page faults, it's simple to
> achieve a
> >> > > stuck in the monitor (which is also a stuck in main loop thread):
> >> > >
> >> > > (1) Monitor deadlock on BQL
> >> > >
> >> > > As we may know, when postcopy is running on destination VM, the vcpu
> >> > > threads can stuck merely any time as long as it tries to access an
> >> > > uncopied guest page.  Meanwhile, when the stuck happens, it is
> possible
> >> > > that the vcpu thread is holding the BQL.  If the page fault is not
> >> > > handled quickly, you'll find that monitors stop working, which is
> trying
> >> > > to take the BQL.
> >> > >
> >> > > If the page fault cannot be handled correctly (one case is a paused
> >> > > postcopy, when network is temporarily down), monitors will hang
> >> > > forever.  Without current patch, that means the main loop hanged.
> We'll
> >> > > never find a way to talk to VM again.
> >> > >
> >> >
> >> > Could the BQL be pushed down to the monitor commands level instead?
> That
> >> > way we wouldn't need a seperate thread to solve the hang on commands
> that
> >> > do not need BQL.
> >>
> >> If the main thread is stuck though I don't see how that helps you; you
> >> have to be able to run these commands on another thread.
> >>
> >
> > Why would the main thread be stuck? In (1) If the vcpu thread takes the
> BQL
> > and the command doesn't need it, it would work.  In (2),  info cpus
> > shouldn't keep the BQL (my qapi-async series would probably help here)
>
> This has been discussed several times[*], but of course not with
> everybody, so I'll summarize once more: asynchronous commands are not a
> actually *required* for anything.  They are *one* way to package the
> "kick off task, receive an asynchronous message when it's done" pattern.
> Another way is synchronous command for the kick off, event for the
> "done".
>

But you would have to break or duplicate the QMP APIs. My proposal doesn't
need that, a command can reenter the main loop, and keep current QMP API.


> For better or worse, synchronous command + event is what we have today.
> Whether adding another way to package the the same thing improves the
> QMP interface is doubtful.
>

I would argue my series is mostly about internal refactoring for the
benefit mentionned above. The fact that you can do (optionnaly) concurrent
QMP commands is a nice bonus. Furthermore, it simplifies the API compared
to CMD / dummy reply + EVENT. And it gives a meaning to the exisiting
command "id"..


>
> [*] Try this one:
> Message-ID: <87o9yv890z.fsf@dusky.pond.sub.org>
> https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg05483.html
>
-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28 10:11           ` Marc-André Lureau
@ 2017-08-28 12:48             ` Peter Xu
  2017-09-05 18:58               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-28 12:48 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Dr. David Alan Gilbert, QEMU, Laurent Vivier, Fam Zheng,
	Juan Quintela, Markus Armbruster, Michael Roth, Paolo Bonzini

On Mon, Aug 28, 2017 at 12:11:38PM +0200, Marc-André Lureau wrote:
> Hi
> 
> On Mon, Aug 28, 2017 at 5:05 AM, Peter Xu <peterx@redhat.com> wrote:
> > On Fri, Aug 25, 2017 at 04:07:34PM +0000, Marc-André Lureau wrote:
> >> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> wrote:
> >>
> >> > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> >> > > Hi
> >> > >
> >> > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> >> > >
> >> > > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> >> > > > using non-mux typed backend chardev.  We only do this for monitors, so
> >> > > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> >> > > > and the monitor together).
> >> > > >
> >> > > > When use_thread is set, we create standalone thread to poll the monitor
> >> > > > events, isolated from the main loop thread.  Here we still need to take
> >> > > > the BQL before dispatching the tasks since some of the monitor commands
> >> > > > are not allowed to execute without the protection of BQL.  Then this
> >> > > > gives us the chance to avoid taking the BQL for some monitor commands
> >> > in
> >> > > > the future.
> >> > > >
> >> > > > * Why this change?
> >> > > >
> >> > > > We need these per-monitor threads to make sure we can have at least one
> >> > > > monitor that will never stuck (that can receive further monitor
> >> > > > commands).
> >> > > >
> >> > > > * So when will monitors stuck?  And, how do they stuck?
> >> > > >
> >> > > > After we have postcopy and remote page faults, it's simple to achieve a
> >> > > > stuck in the monitor (which is also a stuck in main loop thread):
> >> > > >
> >> > > > (1) Monitor deadlock on BQL
> >> > > >
> >> > > > As we may know, when postcopy is running on destination VM, the vcpu
> >> > > > threads can stuck merely any time as long as it tries to access an
> >> > > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> >> > > > that the vcpu thread is holding the BQL.  If the page fault is not
> >> > > > handled quickly, you'll find that monitors stop working, which is
> >> > trying
> >> > > > to take the BQL.
> >> > > >
> >> > > > If the page fault cannot be handled correctly (one case is a paused
> >> > > > postcopy, when network is temporarily down), monitors will hang
> >> > > > forever.  Without current patch, that means the main loop hanged.
> >> > We'll
> >> > > > never find a way to talk to VM again.
> >> > > >
> >> > >
> >> > > Could the BQL be pushed down to the monitor commands level instead? That
> >> > > way we wouldn't need a seperate thread to solve the hang on commands that
> >> > > do not need BQL.
> >> >
> >> > If the main thread is stuck though I don't see how that helps you; you
> >> > have to be able to run these commands on another thread.
> >> >
> >>
> >> Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
> >> and the command doesn't need it, it would work.  In (2),  info cpus
> >> shouldn't keep the BQL (my qapi-async series would probably help here)
> >
> > (Thanks for joining the discussion)
> >
> > AFAIK the main thread can be stuck for many reasons.  I have seen one
> > stack when the VGA code (IIUC) was trying to writting to guest graphic
> > memory in main loop thread but luckily that guest page is still not
> > copied yet from source.  As long as the main thread is stuck for any
> > reason, no chance for monitor commands, even if the commands support
> > async operations.
> 
> If that command becomes async (it probably should, any command doing
> IO probaly should), then the main loop can keep running.

The problem is that, it's not blocked at "a command", but a task
running on the main thread.  The task can access guest memory, and
when the guest page is not there, the main thread hangs.  Then it
hangs every monitors, and all other tasks that are bounded to main
thread.

> 
> >
> > So IMHO the only solution is doing these things in separate threads,
> > rather than all in a single one.
> 
> I wouldn't say it's the only solution. I think the monitor can touch
> many areas that haven't been written with multi-threading in mind. My
> proposal is probably safer, although I don't know how hard it would be
> to push the BQL down to QMP commands, and make async existing IO
> commands. The benefits of this work are quite interesting imho,
> because a stuck mainloop is basically a stuck qemu, and an additional
> thread will not solve it...
> 
> -- 
> Marc-André Lureau

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28 12:28           ` Marc-André Lureau
@ 2017-08-28 16:24             ` Markus Armbruster
  2017-08-28 17:24               ` Marc-André Lureau
  0 siblings, 1 reply; 104+ messages in thread
From: Markus Armbruster @ 2017-08-28 16:24 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Peter Xu, mdroth, Paolo Bonzini

Marc-André Lureau <marcandre.lureau@gmail.com> writes:

> Hi
>
> On Mon, Aug 28, 2017 at 1:08 PM Markus Armbruster <armbru@redhat.com> wrote:
>
>> Marc-André Lureau <marcandre.lureau@gmail.com> writes:
>>
>> > On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <
>> dgilbert@redhat.com>
>> > wrote:
>> >
>> >> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
[...]
>> >> > Could the BQL be pushed down to the monitor commands level instead? That
>> >> > way we wouldn't need a seperate thread to solve the hang on commands that
>> >> > do not need BQL.
>> >>
>> >> If the main thread is stuck though I don't see how that helps you; you
>> >> have to be able to run these commands on another thread.
>> >>
>> >
>> > Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
>> > and the command doesn't need it, it would work.  In (2),  info cpus
>> > shouldn't keep the BQL (my qapi-async series would probably help here)
>>
>> This has been discussed several times[*], but of course not with
>> everybody, so I'll summarize once more: asynchronous commands are not a
>> actually *required* for anything.  They are *one* way to package the
>> "kick off task, receive an asynchronous message when it's done" pattern.
>> Another way is synchronous command for the kick off, event for the
>> "done".
>>
>
> But you would have to break or duplicate the QMP APIs. My proposal doesn't
> need that, a command can reenter the main loop, and keep current QMP API.

Changing an existing command from synchronous to asynchronous is
definitely an API change, as discussed before.

>> For better or worse, synchronous command + event is what we have today.
>> Whether adding another way to package the the same thing improves the
>> QMP interface is doubtful.
>>
>
> I would argue my series is mostly about internal refactoring for the
> benefit mentionned above. The fact that you can do (optionnaly) concurrent
> QMP commands is a nice bonus. Furthermore, it simplifies the API compared
> to CMD / dummy reply + EVENT. And it gives a meaning to the exisiting
> command "id"..

Call a change "mostly internal" when it fundamentally extends the QMP
protocol makes as much sense to me as "mostly not pregnant".

The non-dummy nature of the command reply has also been discussed
several times.  So has been the relative complexity of "synchronous
commands + events" vs. "asynchronous commands" vs. "both".  In
considerable detail, in fact:

    Message-ID: <87mvaszbsu.fsf@dusky.pond.sub.org>
    http://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01090.html

Let me quote its last few lines:

    Bottom line:

    1. I still don't want to merge this.

    2. I want us to tackle jobs sooner rather than later.

    3. Once we got at least a jobs prototype, I'm willing to reconsider
       asynchronous commands implemented as special case of jobs.

    One of the most important maintainer duties is saying "no".  It's also
    one of the least fun duties.

I'm happy to reconsider this conclusion when presented with new
evidence.

Asynchronous commands vs. synchronous commands + events are different
packaging of the same thing: neither can do anything the other could not
do.  If we want to make progress on the monitor hang problem (this
thread's topic), we should focus on the *concepts*, not how to best
package them for QMP.

>> [*] Try this one:
>> Message-ID: <87o9yv890z.fsf@dusky.pond.sub.org>
>> https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg05483.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28 16:24             ` Markus Armbruster
@ 2017-08-28 17:24               ` Marc-André Lureau
  2017-08-29  6:27                 ` Markus Armbruster
  0 siblings, 1 reply; 104+ messages in thread
From: Marc-André Lureau @ 2017-08-28 17:24 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, qemu-devel,
	Dr. David Alan Gilbert, Peter Xu, mdroth, Paolo Bonzini

Hi

On Mon, Aug 28, 2017 at 6:24 PM Markus Armbruster <armbru@redhat.com> wrote:

> Marc-André Lureau <marcandre.lureau@gmail.com> writes:
>
> > Hi
> >
> > On Mon, Aug 28, 2017 at 1:08 PM Markus Armbruster <armbru@redhat.com>
> wrote:
> >
> >> Marc-André Lureau <marcandre.lureau@gmail.com> writes:
> >>
> >> > On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <
> >> dgilbert@redhat.com>
> >> > wrote:
> >> >
> >> >> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> [...]
> >> >> > Could the BQL be pushed down to the monitor commands level
> instead? That
> >> >> > way we wouldn't need a seperate thread to solve the hang on
> commands that
> >> >> > do not need BQL.
> >> >>
> >> >> If the main thread is stuck though I don't see how that helps you;
> you
> >> >> have to be able to run these commands on another thread.
> >> >>
> >> >
> >> > Why would the main thread be stuck? In (1) If the vcpu thread takes
> the BQL
> >> > and the command doesn't need it, it would work.  In (2),  info cpus
> >> > shouldn't keep the BQL (my qapi-async series would probably help here)
> >>
> >> This has been discussed several times[*], but of course not with
> >> everybody, so I'll summarize once more: asynchronous commands are not a
> >> actually *required* for anything.  They are *one* way to package the
> >> "kick off task, receive an asynchronous message when it's done" pattern.
> >> Another way is synchronous command for the kick off, event for the
> >> "done".
> >>
> >
> > But you would have to break or duplicate the QMP APIs. My proposal
> doesn't
> > need that, a command can reenter the main loop, and keep current QMP API.
>
> Changing an existing command from synchronous to asynchronous is
> definitely an API change, as discussed before.
>

We are getting of topic, but I really feel there is a misunderstanding and
a wrong evaluation of my -async proposal.

The command is in no way asynchronous from an external point of view, as
long as the QMP user doesn't declare async capability.

There is no external API change. The "internal API" change is made optional
too.


> >> For better or worse, synchronous command + event is what we have today.
> >> Whether adding another way to package the the same thing improves the
> >> QMP interface is doubtful.
> >>
> >
> > I would argue my series is mostly about internal refactoring for the
> > benefit mentionned above. The fact that you can do (optionnaly)
> concurrent
> > QMP commands is a nice bonus. Furthermore, it simplifies the API compared
> > to CMD / dummy reply + EVENT. And it gives a meaning to the exisiting
> > command "id"..
>
> Call a change "mostly internal" when it fundamentally extends the QMP
> protocol makes as much sense to me as "mostly not pregnant".


> The non-dummy nature of the command reply has also been discussed
> several times.  So has been the relative complexity of "synchronous
> commands + events" vs. "asynchronous commands" vs. "both".  In
> considerable detail, in fact:
>
>     Message-ID: <87mvaszbsu.fsf@dusky.pond.sub.org>
>     http://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01090.html
>
> Let me quote its last few lines:
>
>     Bottom line:
>
>     1. I still don't want to merge this.
>

>     2. I want us to tackle jobs sooner rather than later.
>
>     3. Once we got at least a jobs prototype, I'm willing to reconsider
>        asynchronous commands implemented as special case of jobs.
>

This is not being fair tbh, my proposal is RFC for 2y, and has been ready
for a while. I have no clear clue what the "jobs" API would look like, or
what are the requirements, so I can't work on it or defend my work. But it
will almost certainly benefit from my low-level -async work.


>     One of the most important maintainer duties is saying "no".  It's also
>     one of the least fun duties.
>
>
I'm happy to reconsider this conclusion when presented with new
> evidence.
>

 I am still convince my proposal has many merits, so I will keep proposing
it when I see conflicts with what I have, even if you said "no" to an
earlier merge request.


> Asynchronous commands vs. synchronous commands + events are different
> packaging of the same thing: neither can do anything the other could not
> do.  If we want to make progress on the monitor hang problem (this
> thread's topic), we should focus on the *concepts*, not how to best
> package them for QMP.
>

Right, please ignore  the optional QMP protocol -async capability and look
at it: one way is to make long-lived/blocking QMP commands -async
internally, so other work can happen. This is 90% of my -async proposal.


> >> [*] Try this one:
> >> Message-ID: <87o9yv890z.fsf@dusky.pond.sub.org>
> >> https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg05483.html
>
-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28 17:24               ` Marc-André Lureau
@ 2017-08-29  6:27                 ` Markus Armbruster
  0 siblings, 0 replies; 104+ messages in thread
From: Markus Armbruster @ 2017-08-29  6:27 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Laurent Vivier, Fam Zheng, mdroth, Juan Quintela, qemu-devel,
	Peter Xu, Dr. David Alan Gilbert, Paolo Bonzini, John Snow

Marc-André Lureau <marcandre.lureau@gmail.com> writes:

> Hi
>
> On Mon, Aug 28, 2017 at 6:24 PM Markus Armbruster <armbru@redhat.com> wrote:
>
>> Marc-André Lureau <marcandre.lureau@gmail.com> writes:
>>
>> > Hi
>> >
>> > On Mon, Aug 28, 2017 at 1:08 PM Markus Armbruster <armbru@redhat.com>
>> wrote:
>> >
>> >> Marc-André Lureau <marcandre.lureau@gmail.com> writes:
>> >>
>> >> > On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <
>> >> dgilbert@redhat.com>
>> >> > wrote:
>> >> >
>> >> >> * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
>> [...]
>> >> >> > Could the BQL be pushed down to the monitor commands level
>> instead? That
>> >> >> > way we wouldn't need a seperate thread to solve the hang on
>> commands that
>> >> >> > do not need BQL.
>> >> >>
>> >> >> If the main thread is stuck though I don't see how that helps you;
>> you
>> >> >> have to be able to run these commands on another thread.
>> >> >>
>> >> >
>> >> > Why would the main thread be stuck? In (1) If the vcpu thread takes
>> the BQL
>> >> > and the command doesn't need it, it would work.  In (2),  info cpus
>> >> > shouldn't keep the BQL (my qapi-async series would probably help here)
>> >>
>> >> This has been discussed several times[*], but of course not with
>> >> everybody, so I'll summarize once more: asynchronous commands are not a
>> >> actually *required* for anything.  They are *one* way to package the
>> >> "kick off task, receive an asynchronous message when it's done" pattern.
>> >> Another way is synchronous command for the kick off, event for the
>> >> "done".
>> >>
>> >
>> > But you would have to break or duplicate the QMP APIs. My proposal
>> doesn't
>> > need that, a command can reenter the main loop, and keep current QMP API.
>>
>> Changing an existing command from synchronous to asynchronous is
>> definitely an API change, as discussed before.
>>
>
> We are getting of topic, but I really feel there is a misunderstanding and
> a wrong evaluation of my -async proposal.
>
> The command is in no way asynchronous from an external point of view, as
> long as the QMP user doesn't declare async capability.
>
> There is no external API change. The "internal API" change is made optional
> too.

Recapitulating another bit of previous discussion...

Making the change from synchronous to asynchronous opt-in is indeed
compatible evolution.

Another compatible evolution is an opt-in change from synchronous to
synchronous kick off + notify.

Once again, same thing, different packaging.

>> >> For better or worse, synchronous command + event is what we have today.
>> >> Whether adding another way to package the the same thing improves the
>> >> QMP interface is doubtful.
>> >>
>> >
>> > I would argue my series is mostly about internal refactoring for the
>> > benefit mentionned above. The fact that you can do (optionnaly)
>> concurrent
>> > QMP commands is a nice bonus. Furthermore, it simplifies the API compared
>> > to CMD / dummy reply + EVENT. And it gives a meaning to the exisiting
>> > command "id"..
>>
>> Call a change "mostly internal" when it fundamentally extends the QMP
>> protocol makes as much sense to me as "mostly not pregnant".
>
>
>> The non-dummy nature of the command reply has also been discussed
>> several times.  So has been the relative complexity of "synchronous
>> commands + events" vs. "asynchronous commands" vs. "both".  In
>> considerable detail, in fact:
>>
>>     Message-ID: <87mvaszbsu.fsf@dusky.pond.sub.org>
>>     http://lists.gnu.org/archive/html/qemu-devel/2017-05/msg01090.html
>>
>> Let me quote its last few lines:
>>
>>     Bottom line:
>>
>>     1. I still don't want to merge this.
>>
>>     2. I want us to tackle jobs sooner rather than later.
>>
>>     3. Once we got at least a jobs prototype, I'm willing to reconsider
>>        asynchronous commands implemented as special case of jobs.
>>
>
> This is not being fair tbh, my proposal is RFC for 2y, and has been ready
> for a while. I have no clear clue what the "jobs" API would look like, or
> what are the requirements, so I can't work on it or defend my work. But it
> will almost certainly benefit from my low-level -async work.

"Jobs" are even readier: it's what we use today.  The trouble is they
aren't instances of a generic jobs API.  We have a few that are
instances of a block jobs API, and a few more that are ad hoc, such as
migration, PCI hotplug with ACPI, dump-guest-memory with detach=true.
Item 2. above is rectifying that.  John Snow (cc'ed) has played with it
some.  I asked you to talk it over with him.  Have you done that?
Results?  If not, perhaps talking it over in person at the KVM Forum
would work better.

Note that both block jobs and migration provide more than just "kick
off" and "notify when done": they also let you examine and control the
job.  These features are needed for all but the simplest of cases.

>>     One of the most important maintainer duties is saying "no".  It's also
>>     one of the least fun duties.
>>
>>
>> I'm happy to reconsider this conclusion when presented with new
>> evidence.
>>
>
>  I am still convince my proposal has many merits, so I will keep proposing
> it when I see conflicts with what I have, even if you said "no" to an
> earlier merge request.

Want me to prepare a canned reply I can paste quickly?  You could do the
same for your counter-point, and we all save lots of time.

>> Asynchronous commands vs. synchronous commands + events are different
>> packaging of the same thing: neither can do anything the other could not
>> do.  If we want to make progress on the monitor hang problem (this
>> thread's topic), we should focus on the *concepts*, not how to best
>> package them for QMP.
>>
>
> Right, please ignore  the optional QMP protocol -async capability and look
> at it: one way is to make long-lived/blocking QMP commands -async
> internally, so other work can happen. This is 90% of my -async proposal.

Certain activities that are now synchronous commands should become "kick
off task, receive an asynchronous message when it's done".

The interesting question for *this* thread is inhowfar such a change
could help with avoiding monitor hangs.  Would changing the "right"
commands and making clients use them correctly suffice?  If not, what
other work is needed?

>> >> [*] Try this one:
>> >> Message-ID: <87o9yv890z.fsf@dusky.pond.sub.org>
>> >> https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg05483.html

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
                   ` (7 preceding siblings ...)
  2017-08-23  6:51 ` [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock Peter Xu
@ 2017-08-29 11:03 ` Daniel P. Berrange
  2017-08-30  7:06   ` Markus Armbruster
  2017-09-06  9:48   ` Dr. David Alan Gilbert
  2017-09-06 14:50 ` Stefan Hajnoczi
  9 siblings, 2 replies; 104+ messages in thread
From: Daniel P. Berrange @ 2017-08-29 11:03 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela, mdroth,
	Eric Blake, Laurent Vivier, Markus Armbruster,
	Dr . David Alan Gilbert

On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> v2:
> - fixed "make check" error that patchew reported
> - moved the thread_join upper in monitor_data_destroy(), before
>   resources are released
> - added one new patch (current patch 3) that fixes a nasty risk
>   condition with IOWatchPoll.  Please see commit message for more
>   information.
> - added a g_main_context_wakeup() to make sure the separate loop
>   thread can be kicked always when we want to destroy the per-monitor
>   threads.
> - added one new patch (current patch 8) to introduce migration mgmt
>   lock for migrate_incoming.
> 
> This is an extended work for migration postcopy recovery. This series
> is tested with the following series to make sure it solves the monitor
> hang problem that we have encountered for postcopy recovery:
> 
>   [RFC 00/29] Migration: postcopy failure recovery
>   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> 
> The root problem is that, monitor commands are all handled in main
> loop thread now, no matter how many monitors we specify. And, if main
> loop thread hangs due to some reason, all monitors will be stuck.
> This can be done in reversed order as well: if any of the monitor
> hangs, it will hang the main loop, and the rest of the monitors (if
> there is any).
> 
> That affects postcopy recovery, since the recovery requires user input
> on destination side.  If monitors hang, the destination VM dies and
> lose hope for even a final recovery.
> 
> So, sometimes we need to make sure the monitor be alive, at least one
> of them.
> 
> The whole idea of this series is that instead if handling monitor
> commands all in main loop thread, we do it separately in per-monitor
> threads.  Then, even if main loop thread hangs at any point by any
> reason, per-monitor thread can still survive.  Further, we add hint in
> QMP/HMP to show whether a command can be executed without QMP, if so,
> we avoid taking BQL when running that command.  It greatly reduced
> contention of BQL.  Now the only user of that new parameter (currently
> I call it "without-bql") is "migrate-incoming" command, which is the
> only command to rescue a paused postcopy migration.
> 
> However, even with the series, it does not mean that per-monitor
> threads will never hang.  One example is that we can still run "info
> vcpus" in per-monitor threads during a paused postcopy (in that state,
> page faults are never handled, and "info cpus" will never return since
> it tries to sync every vcpus).  So to make sure it does not hang, we
> not only need the per-monitor thread, the user should be careful as
> well on how to use it.
> 
> For postcopy recovery, we may need dedicated monitor channel for
> recovery.  In other words, a destination VM that supports postcopy
> recovery would possibly need:
> 
>   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL

I think this is a really horrible thing to expose to management applications.
They should not need to be aware of fact that QEMU is buggy and thus requires
that certain commands be run on different monitors to work around the bug.

I'd much prefer to see the problem described handled transparently inside
QEMU. One approach is have a dedicated thread in QEMU responsible for all
monitor I/O. This thread should never actually execute monitor commands
though, it would simply parse the command request and put data onto a queue
of pending commands, thus it could never hang. The command queue could be
processed by the main thread, or by another thread that is interested.
eg the migration thread could process any queued commands related to
migration directly.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-08-29 11:03 ` [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Daniel P. Berrange
@ 2017-08-30  7:06   ` Markus Armbruster
  2017-08-30 10:13     ` Daniel P. Berrange
  2017-09-06  9:48   ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 104+ messages in thread
From: Markus Armbruster @ 2017-08-30  7:06 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, Laurent Vivier, Fam Zheng, Juan Quintela, mdroth,
	qemu-devel, Paolo Bonzini, Dr . David Alan Gilbert, John Snow,
	Marc-André Lureau

"Daniel P. Berrange" <berrange@redhat.com> writes:

> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
>> v2:
>> - fixed "make check" error that patchew reported
>> - moved the thread_join upper in monitor_data_destroy(), before
>>   resources are released
>> - added one new patch (current patch 3) that fixes a nasty risk
>>   condition with IOWatchPoll.  Please see commit message for more
>>   information.
>> - added a g_main_context_wakeup() to make sure the separate loop
>>   thread can be kicked always when we want to destroy the per-monitor
>>   threads.
>> - added one new patch (current patch 8) to introduce migration mgmt
>>   lock for migrate_incoming.
>> 
>> This is an extended work for migration postcopy recovery. This series
>> is tested with the following series to make sure it solves the monitor
>> hang problem that we have encountered for postcopy recovery:
>> 
>>   [RFC 00/29] Migration: postcopy failure recovery
>>   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
>> 
>> The root problem is that, monitor commands are all handled in main
>> loop thread now, no matter how many monitors we specify. And, if main
>> loop thread hangs due to some reason, all monitors will be stuck.
>> This can be done in reversed order as well: if any of the monitor
>> hangs, it will hang the main loop, and the rest of the monitors (if
>> there is any).

Yes.

>> That affects postcopy recovery, since the recovery requires user input
>> on destination side.  If monitors hang, the destination VM dies and
>> lose hope for even a final recovery.
>> 
>> So, sometimes we need to make sure the monitor be alive, at least one
>> of them.
>> 
>> The whole idea of this series is that instead if handling monitor
>> commands all in main loop thread, we do it separately in per-monitor
>> threads.  Then, even if main loop thread hangs at any point by any
>> reason, per-monitor thread can still survive.

This takes care of "monitor hangs because other parts of the main loop
(including other monitors) hang".  It doesn't take care of "monitor
hangs because the current monitor command hangs".

>>                                                Further, we add hint in
>> QMP/HMP to show whether a command can be executed without QMP, if so,
>> we avoid taking BQL when running that command.  It greatly reduced
>> contention of BQL.  Now the only user of that new parameter (currently
>> I call it "without-bql") is "migrate-incoming" command, which is the
>> only command to rescue a paused postcopy migration.

This takes care of one way commands can hang.  There are other ways;
NFS server going AWOL is a classic.  I don't know whether any other way
applies to migrate-incoming.

>> However, even with the series, it does not mean that per-monitor
>> threads will never hang.  One example is that we can still run "info
>> vcpus" in per-monitor threads during a paused postcopy (in that state,
>> page faults are never handled, and "info cpus" will never return since
>> it tries to sync every vcpus).  So to make sure it does not hang, we
>> not only need the per-monitor thread, the user should be careful as
>> well on how to use it.
>> 
>> For postcopy recovery, we may need dedicated monitor channel for
>> recovery.  In other words, a destination VM that supports postcopy
>> recovery would possibly need:
>> 
>>   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL

Where RECOVERY_CHANNEL isn't necessarily just for postcopy, but for any
"emergency" QMP access.  If you use it only for commands that cannot
hang (i.e. terminate in bounded time), then you'll always be able to get
commands accepted there in bounded time.

> I think this is a really horrible thing to expose to management applications.
> They should not need to be aware of fact that QEMU is buggy and thus requires
> that certain commands be run on different monitors to work around the bug.

These are (serious) design limitations, not bugs in the narrow sense of
the word.

However, I quite agree that the need for clients to know whether a
monitor command can hang is impractical for the general case.  What
might be practical is a QMP monitor mode that accepts only known
hang-free commands.  Hang-free could be introspectable.

In case you consider that ugly: it's best to explore the design space
first, and recoil from "ugly" second.

> I'd much prefer to see the problem described handled transparently inside
> QEMU. One approach is have a dedicated thread in QEMU responsible for all
> monitor I/O. This thread should never actually execute monitor commands
> though, it would simply parse the command request and put data onto a queue
> of pending commands, thus it could never hang. The command queue could be
> processed by the main thread, or by another thread that is interested.
> eg the migration thread could process any queued commands related to
> migration directly.

The monitor itself can't hang then, but the thread(s) dequeuing parsed
commands can.

To maintain commands' synchronous semantics, their replies need to be
sent in order, which of course reintroduces the hangs.

Let's take a step back from the implementation, and talk about
*behavior* instead.

You prefer to have "the problem described handled transparently inside
QEMU".  I read that as "QEMU must ensure the QMP monitor is available at
all times".  "Available" means it accepts commands in bounded time.
Some commands will always finish in bounded time once accepted, others
may not, and whether they do may depend on the commands currently in
flight.

Commands that can always start and always terminate in bounded time are
no problem.

All the other commands have to become "job-starting": the QMP command
kicks off a "job", which runs concurrently with the QMP monitor for some
(possibly unbounded) time, then finishes.  Jobs can be examined (say to
monitor progress, if the job supports that) and controlled (say to
cancel, if the job supports that).

A few commands are already job-starting: migrate, the block job family,
dump-guest-memory with detach=true.  Whether they're already hang-free I
can't say; they could do risky work in their synchronous part.

Many commands that can hang are not job-starting.

Changing a command from "do the job" to "merely start the job" is a
compatibility break.

We could make the change opt-in to preserve compatibility.  But is
preserving a compatible QMP monitor that is prone to hang wortwhile?

If no, we may choose to use the resulting compatibility break to also
switch the packaging of jobs from the current "synchronous command +
broadcast message when done" to some variation of asynchronous command.
But that should be discussed in a separate thread, and only after we
know how we plan to ensure monitor availability.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-08-30  7:06   ` Markus Armbruster
@ 2017-08-30 10:13     ` Daniel P. Berrange
  2017-08-31  3:31       ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2017-08-30 10:13 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Peter Xu, Laurent Vivier, Fam Zheng, Juan Quintela, mdroth,
	qemu-devel, Paolo Bonzini, Dr . David Alan Gilbert, John Snow,
	Marc-André Lureau

On Wed, Aug 30, 2017 at 09:06:20AM +0200, Markus Armbruster wrote:
> "Daniel P. Berrange" <berrange@redhat.com> writes:
> 
> > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> 
> >> However, even with the series, it does not mean that per-monitor
> >> threads will never hang.  One example is that we can still run "info
> >> vcpus" in per-monitor threads during a paused postcopy (in that state,
> >> page faults are never handled, and "info cpus" will never return since
> >> it tries to sync every vcpus).  So to make sure it does not hang, we
> >> not only need the per-monitor thread, the user should be careful as
> >> well on how to use it.
> >> 
> >> For postcopy recovery, we may need dedicated monitor channel for
> >> recovery.  In other words, a destination VM that supports postcopy
> >> recovery would possibly need:
> >> 
> >>   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> 
> Where RECOVERY_CHANNEL isn't necessarily just for postcopy, but for any
> "emergency" QMP access.  If you use it only for commands that cannot
> hang (i.e. terminate in bounded time), then you'll always be able to get
> commands accepted there in bounded time.
> 
> > I think this is a really horrible thing to expose to management applications.
> > They should not need to be aware of fact that QEMU is buggy and thus requires
> > that certain commands be run on different monitors to work around the bug.
> 
> These are (serious) design limitations, not bugs in the narrow sense of
> the word.
> 
> However, I quite agree that the need for clients to know whether a
> monitor command can hang is impractical for the general case.  What
> might be practical is a QMP monitor mode that accepts only known
> hang-free commands.  Hang-free could be introspectable.
> 
> In case you consider that ugly: it's best to explore the design space
> first, and recoil from "ugly" second.

Actually you slightly mis-interpreted me there. I think it is ok for
applications to have knowledge about whether a particular command
may hang or not. Given that knowledge it should *not*, however, require
that the application issue such commands on separate monitor channels.
It is entirely possible to handle hang-free commands on the existing
channel.

> > I'd much prefer to see the problem described handled transparently inside
> > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > monitor I/O. This thread should never actually execute monitor commands
> > though, it would simply parse the command request and put data onto a queue
> > of pending commands, thus it could never hang. The command queue could be
> > processed by the main thread, or by another thread that is interested.
> > eg the migration thread could process any queued commands related to
> > migration directly.
> 
> The monitor itself can't hang then, but the thread(s) dequeuing parsed
> commands can.

If certain commands are hang-free then you can have a dedicated thread
that only de-queues & processes the hang-free commands. The approach I
outlined is exactly how libvirt deals with its own RPC dispatch. We have
certain commands that are guaranteed to not hang, which are processed by
a dedicated pool of threads. So even if all normal RPC commands have
hung, you can still run a subset of hang-free RPC commands.

> 
> To maintain commands' synchronous semantics, their replies need to be
> sent in order, which of course reintroduces the hangs.

The requirement for such ordering is just an arbitrary restriction that
QEMU currently imposes. It is reasonable to allow arbitrary ordering of
responses (which is what libvirt does in its RPC layer). Admittedly at
this stage though, we would likely require some "opt in" handshake when
initializing QMP for the app to tell QEMU it can cope with out of order
replies. It would require that each command request has a unique serial
number, which is included in the associated reply, so apps can match
them up. We used to have that but iirc it was then removed.

There's other ways to deal with this, such as the job starting idea you
mention below.

The key point though is that I don't think creating multiple monitor
servers is a desirable approach - it is just a hack to avoid dealing
with the root cause problems. 

> Let's take a step back from the implementation, and talk about
> *behavior* instead.
> 
> You prefer to have "the problem described handled transparently inside
> QEMU".  I read that as "QEMU must ensure the QMP monitor is available at
> all times".  "Available" means it accepts commands in bounded time.
> Some commands will always finish in bounded time once accepted, others
> may not, and whether they do may depend on the commands currently in
> flight.
> 
> Commands that can always start and always terminate in bounded time are
> no problem.
> 
> All the other commands have to become "job-starting": the QMP command
> kicks off a "job", which runs concurrently with the QMP monitor for some
> (possibly unbounded) time, then finishes.  Jobs can be examined (say to
> monitor progress, if the job supports that) and controlled (say to
> cancel, if the job supports that).
> 
> A few commands are already job-starting: migrate, the block job family,
> dump-guest-memory with detach=true.  Whether they're already hang-free I
> can't say; they could do risky work in their synchronous part.
> 
> Many commands that can hang are not job-starting.
> 
> Changing a command from "do the job" to "merely start the job" is a
> compatibility break.
> 
> We could make the change opt-in to preserve compatibility.  But is
> preserving a compatible QMP monitor that is prone to hang wortwhile?
> 
> If no, we may choose to use the resulting compatibility break to also
> switch the packaging of jobs from the current "synchronous command +
> broadcast message when done" to some variation of asynchronous command.
> But that should be discussed in a separate thread, and only after we
> know how we plan to ensure monitor availability.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-08-30 10:13     ` Daniel P. Berrange
@ 2017-08-31  3:31       ` Peter Xu
  2017-08-31  9:14         ` Daniel P. Berrange
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-08-31  3:31 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Markus Armbruster, Laurent Vivier, Fam Zheng, Juan Quintela,
	mdroth, qemu-devel, Paolo Bonzini, Dr . David Alan Gilbert,
	John Snow, Marc-André Lureau

On Wed, Aug 30, 2017 at 11:13:11AM +0100, Daniel P. Berrange wrote:
> On Wed, Aug 30, 2017 at 09:06:20AM +0200, Markus Armbruster wrote:
> > "Daniel P. Berrange" <berrange@redhat.com> writes:
> > 
> > > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > 
> > >> However, even with the series, it does not mean that per-monitor
> > >> threads will never hang.  One example is that we can still run "info
> > >> vcpus" in per-monitor threads during a paused postcopy (in that state,
> > >> page faults are never handled, and "info cpus" will never return since
> > >> it tries to sync every vcpus).  So to make sure it does not hang, we
> > >> not only need the per-monitor thread, the user should be careful as
> > >> well on how to use it.
> > >> 
> > >> For postcopy recovery, we may need dedicated monitor channel for
> > >> recovery.  In other words, a destination VM that supports postcopy
> > >> recovery would possibly need:
> > >> 
> > >>   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> > 
> > Where RECOVERY_CHANNEL isn't necessarily just for postcopy, but for any
> > "emergency" QMP access.  If you use it only for commands that cannot
> > hang (i.e. terminate in bounded time), then you'll always be able to get
> > commands accepted there in bounded time.
> > 
> > > I think this is a really horrible thing to expose to management applications.
> > > They should not need to be aware of fact that QEMU is buggy and thus requires
> > > that certain commands be run on different monitors to work around the bug.
> > 
> > These are (serious) design limitations, not bugs in the narrow sense of
> > the word.
> > 
> > However, I quite agree that the need for clients to know whether a
> > monitor command can hang is impractical for the general case.  What
> > might be practical is a QMP monitor mode that accepts only known
> > hang-free commands.  Hang-free could be introspectable.
> > 
> > In case you consider that ugly: it's best to explore the design space
> > first, and recoil from "ugly" second.
> 
> Actually you slightly mis-interpreted me there. I think it is ok for
> applications to have knowledge about whether a particular command
> may hang or not. Given that knowledge it should *not*, however, require
> that the application issue such commands on separate monitor channels.
> It is entirely possible to handle hang-free commands on the existing
> channel.
> 
> > > I'd much prefer to see the problem described handled transparently inside
> > > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > > monitor I/O. This thread should never actually execute monitor commands
> > > though, it would simply parse the command request and put data onto a queue
> > > of pending commands, thus it could never hang. The command queue could be
> > > processed by the main thread, or by another thread that is interested.
> > > eg the migration thread could process any queued commands related to
> > > migration directly.
> > 
> > The monitor itself can't hang then, but the thread(s) dequeuing parsed
> > commands can.
> 
> If certain commands are hang-free then you can have a dedicated thread
> that only de-queues & processes the hang-free commands. The approach I
> outlined is exactly how libvirt deals with its own RPC dispatch. We have
> certain commands that are guaranteed to not hang, which are processed by
> a dedicated pool of threads. So even if all normal RPC commands have
> hung, you can still run a subset of hang-free RPC commands.
> 
> > 
> > To maintain commands' synchronous semantics, their replies need to be
> > sent in order, which of course reintroduces the hangs.
> 
> The requirement for such ordering is just an arbitrary restriction that
> QEMU currently imposes. It is reasonable to allow arbitrary ordering of
> responses (which is what libvirt does in its RPC layer). Admittedly at
> this stage though, we would likely require some "opt in" handshake when
> initializing QMP for the app to tell QEMU it can cope with out of order
> replies. It would require that each command request has a unique serial
> number, which is included in the associated reply, so apps can match
> them up. We used to have that but iirc it was then removed.
> 
> There's other ways to deal with this, such as the job starting idea you
> mention below.
> 
> The key point though is that I don't think creating multiple monitor
> servers is a desirable approach - it is just a hack to avoid dealing
> with the root cause problems. 

Yeah I kindly agree.  It's not the root problem, but AFAIU that's the
simplest way for now to solve the problem.  But I think I understand
the major problem here - an extra channel is an interface change, and
it affects users of monitors.  So I agree we'd better be patient on
choosing a good enough interface, looks like we have two:

- dedicated "hang-able" and "hang-free" channel, or,

- async command handling, then we will have one single dedicated
  command parser (possibly as well in separate thread rather than main
  thread), per-command ID, and possibly slightly more work.  For this
  one, I believe there are different implementations.

So looks like what we need to do is firstly choose an interface, and
if we choose the 2nd, we need to further choose the implementaion.

Before getting to an conclusion, just want to make sure we have got a
consensus on that at least we should start to move the monitor command
handling into a separate thread rather than main thread, am I correct?

Thanks,

> 
> > Let's take a step back from the implementation, and talk about
> > *behavior* instead.
> > 
> > You prefer to have "the problem described handled transparently inside
> > QEMU".  I read that as "QEMU must ensure the QMP monitor is available at
> > all times".  "Available" means it accepts commands in bounded time.
> > Some commands will always finish in bounded time once accepted, others
> > may not, and whether they do may depend on the commands currently in
> > flight.
> > 
> > Commands that can always start and always terminate in bounded time are
> > no problem.
> > 
> > All the other commands have to become "job-starting": the QMP command
> > kicks off a "job", which runs concurrently with the QMP monitor for some
> > (possibly unbounded) time, then finishes.  Jobs can be examined (say to
> > monitor progress, if the job supports that) and controlled (say to
> > cancel, if the job supports that).
> > 
> > A few commands are already job-starting: migrate, the block job family,
> > dump-guest-memory with detach=true.  Whether they're already hang-free I
> > can't say; they could do risky work in their synchronous part.
> > 
> > Many commands that can hang are not job-starting.
> > 
> > Changing a command from "do the job" to "merely start the job" is a
> > compatibility break.
> > 
> > We could make the change opt-in to preserve compatibility.  But is
> > preserving a compatible QMP monitor that is prone to hang wortwhile?
> > 
> > If no, we may choose to use the resulting compatibility break to also
> > switch the packaging of jobs from the current "synchronous command +
> > broadcast message when done" to some variation of asynchronous command.
> > But that should be discussed in a separate thread, and only after we
> > know how we plan to ensure monitor availability.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-08-31  3:31       ` Peter Xu
@ 2017-08-31  9:14         ` Daniel P. Berrange
  0 siblings, 0 replies; 104+ messages in thread
From: Daniel P. Berrange @ 2017-08-31  9:14 UTC (permalink / raw)
  To: Peter Xu
  Cc: Markus Armbruster, Laurent Vivier, Fam Zheng, Juan Quintela,
	mdroth, qemu-devel, Paolo Bonzini, Dr . David Alan Gilbert,
	John Snow, Marc-André Lureau

On Thu, Aug 31, 2017 at 11:31:55AM +0800, Peter Xu wrote:
> Before getting to an conclusion, just want to make sure we have got a
> consensus on that at least we should start to move the monitor command
> handling into a separate thread rather than main thread, am I correct?

Certainly agree on that, moving dispatch of monitor commands out of
the main thread is critical IMHO. The main thread should only ever
be doing work that is gauranteed non-blockable and completable in a
short, finite amount of time. This means at most it should do I/O only
for the monitor.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28 12:48             ` Peter Xu
@ 2017-09-05 18:58               ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-05 18:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Marc-André Lureau, QEMU, Laurent Vivier, Fam Zheng,
	Juan Quintela, Markus Armbruster, Michael Roth, Paolo Bonzini

* Peter Xu (peterx@redhat.com) wrote:
> On Mon, Aug 28, 2017 at 12:11:38PM +0200, Marc-André Lureau wrote:
> > Hi
> > 
> > On Mon, Aug 28, 2017 at 5:05 AM, Peter Xu <peterx@redhat.com> wrote:
> > > On Fri, Aug 25, 2017 at 04:07:34PM +0000, Marc-André Lureau wrote:
> > >> On Fri, Aug 25, 2017 at 5:33 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> > >> wrote:
> > >>
> > >> > * Marc-André Lureau (marcandre.lureau@gmail.com) wrote:
> > >> > > Hi
> > >> > >
> > >> > > On Wed, Aug 23, 2017 at 8:52 AM Peter Xu <peterx@redhat.com> wrote:
> > >> > >
> > >> > > > Firstly, introduce Monitor.use_thread, and set it for monitors that are
> > >> > > > using non-mux typed backend chardev.  We only do this for monitors, so
> > >> > > > mux-typed chardevs are not suitable (when it connects to, e.g., serials
> > >> > > > and the monitor together).
> > >> > > >
> > >> > > > When use_thread is set, we create standalone thread to poll the monitor
> > >> > > > events, isolated from the main loop thread.  Here we still need to take
> > >> > > > the BQL before dispatching the tasks since some of the monitor commands
> > >> > > > are not allowed to execute without the protection of BQL.  Then this
> > >> > > > gives us the chance to avoid taking the BQL for some monitor commands
> > >> > in
> > >> > > > the future.
> > >> > > >
> > >> > > > * Why this change?
> > >> > > >
> > >> > > > We need these per-monitor threads to make sure we can have at least one
> > >> > > > monitor that will never stuck (that can receive further monitor
> > >> > > > commands).
> > >> > > >
> > >> > > > * So when will monitors stuck?  And, how do they stuck?
> > >> > > >
> > >> > > > After we have postcopy and remote page faults, it's simple to achieve a
> > >> > > > stuck in the monitor (which is also a stuck in main loop thread):
> > >> > > >
> > >> > > > (1) Monitor deadlock on BQL
> > >> > > >
> > >> > > > As we may know, when postcopy is running on destination VM, the vcpu
> > >> > > > threads can stuck merely any time as long as it tries to access an
> > >> > > > uncopied guest page.  Meanwhile, when the stuck happens, it is possible
> > >> > > > that the vcpu thread is holding the BQL.  If the page fault is not
> > >> > > > handled quickly, you'll find that monitors stop working, which is
> > >> > trying
> > >> > > > to take the BQL.
> > >> > > >
> > >> > > > If the page fault cannot be handled correctly (one case is a paused
> > >> > > > postcopy, when network is temporarily down), monitors will hang
> > >> > > > forever.  Without current patch, that means the main loop hanged.
> > >> > We'll
> > >> > > > never find a way to talk to VM again.
> > >> > > >
> > >> > >
> > >> > > Could the BQL be pushed down to the monitor commands level instead? That
> > >> > > way we wouldn't need a seperate thread to solve the hang on commands that
> > >> > > do not need BQL.
> > >> >
> > >> > If the main thread is stuck though I don't see how that helps you; you
> > >> > have to be able to run these commands on another thread.
> > >> >
> > >>
> > >> Why would the main thread be stuck? In (1) If the vcpu thread takes the BQL
> > >> and the command doesn't need it, it would work.  In (2),  info cpus
> > >> shouldn't keep the BQL (my qapi-async series would probably help here)
> > >
> > > (Thanks for joining the discussion)
> > >
> > > AFAIK the main thread can be stuck for many reasons.  I have seen one
> > > stack when the VGA code (IIUC) was trying to writting to guest graphic
> > > memory in main loop thread but luckily that guest page is still not
> > > copied yet from source.  As long as the main thread is stuck for any
> > > reason, no chance for monitor commands, even if the commands support
> > > async operations.
> > 
> > If that command becomes async (it probably should, any command doing
> > IO probaly should), then the main loop can keep running.
> 
> The problem is that, it's not blocked at "a command", but a task
> running on the main thread.  The task can access guest memory, and
> when the guest page is not there, the main thread hangs.  Then it
> hangs every monitors, and all other tasks that are bounded to main
> thread.

This is my main reason for believing we need this approach; I don't
see how the async-command solution solves it.

(I don't have anything against the async command stuff, I just don't
think it solves this problem)

Dave

> > 
> > >
> > > So IMHO the only solution is doing these things in separate threads,
> > > rather than all in a single one.
> > 
> > I wouldn't say it's the only solution. I think the monitor can touch
> > many areas that haven't been written with multi-threading in mind. My
> > proposal is probably safer, although I don't know how hard it would be
> > to push the BQL down to QMP commands, and make async existing IO
> > commands. The benefits of this work are quite interesting imho,
> > because a stuck mainloop is basically a stuck qemu, and an additional
> > thread will not solve it...
> > 
> > -- 
> > Marc-André Lureau
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-08-29 11:03 ` [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Daniel P. Berrange
  2017-08-30  7:06   ` Markus Armbruster
@ 2017-09-06  9:48   ` Dr. David Alan Gilbert
  2017-09-06 10:46     ` Daniel P. Berrange
  1 sibling, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-06  9:48 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > v2:
> > - fixed "make check" error that patchew reported
> > - moved the thread_join upper in monitor_data_destroy(), before
> >   resources are released
> > - added one new patch (current patch 3) that fixes a nasty risk
> >   condition with IOWatchPoll.  Please see commit message for more
> >   information.
> > - added a g_main_context_wakeup() to make sure the separate loop
> >   thread can be kicked always when we want to destroy the per-monitor
> >   threads.
> > - added one new patch (current patch 8) to introduce migration mgmt
> >   lock for migrate_incoming.
> > 
> > This is an extended work for migration postcopy recovery. This series
> > is tested with the following series to make sure it solves the monitor
> > hang problem that we have encountered for postcopy recovery:
> > 
> >   [RFC 00/29] Migration: postcopy failure recovery
> >   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> > 
> > The root problem is that, monitor commands are all handled in main
> > loop thread now, no matter how many monitors we specify. And, if main
> > loop thread hangs due to some reason, all monitors will be stuck.
> > This can be done in reversed order as well: if any of the monitor
> > hangs, it will hang the main loop, and the rest of the monitors (if
> > there is any).
> > 
> > That affects postcopy recovery, since the recovery requires user input
> > on destination side.  If monitors hang, the destination VM dies and
> > lose hope for even a final recovery.
> > 
> > So, sometimes we need to make sure the monitor be alive, at least one
> > of them.
> > 
> > The whole idea of this series is that instead if handling monitor
> > commands all in main loop thread, we do it separately in per-monitor
> > threads.  Then, even if main loop thread hangs at any point by any
> > reason, per-monitor thread can still survive.  Further, we add hint in
> > QMP/HMP to show whether a command can be executed without QMP, if so,
> > we avoid taking BQL when running that command.  It greatly reduced
> > contention of BQL.  Now the only user of that new parameter (currently
> > I call it "without-bql") is "migrate-incoming" command, which is the
> > only command to rescue a paused postcopy migration.
> > 
> > However, even with the series, it does not mean that per-monitor
> > threads will never hang.  One example is that we can still run "info
> > vcpus" in per-monitor threads during a paused postcopy (in that state,
> > page faults are never handled, and "info cpus" will never return since
> > it tries to sync every vcpus).  So to make sure it does not hang, we
> > not only need the per-monitor thread, the user should be careful as
> > well on how to use it.
> > 
> > For postcopy recovery, we may need dedicated monitor channel for
> > recovery.  In other words, a destination VM that supports postcopy
> > recovery would possibly need:
> > 
> >   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> 
> I think this is a really horrible thing to expose to management applications.
> They should not need to be aware of fact that QEMU is buggy and thus requires
> that certain commands be run on different monitors to work around the bug.

It's unfortunately baked in way too deep to fix in the near term; the
BQL is just too cantagious and we have a fundamental design of running
all the main IO emulation in one thread.

> I'd much prefer to see the problem described handled transparently inside
> QEMU. One approach is have a dedicated thread in QEMU responsible for all
> monitor I/O. This thread should never actually execute monitor commands
> though, it would simply parse the command request and put data onto a queue
> of pending commands, thus it could never hang. The command queue could be
> processed by the main thread, or by another thread that is interested.
> eg the migration thread could process any queued commands related to
> migration directly.

That requires a change in the current API to allow async command
completion (OK that is something Marc-Andre's world has) so that
from the one connection you can have multiple outstanding commands.
Hmm unless....

We've also got problems that some commands don't like being run outside
of the main thread (see Fam's reply on the 21st pointing out that a lot
of block commands would assert).

I think the way to move to what you describe would be:
  a) A separate thread for monitor IO
      This seems a separate problem
      How hard is that?  Will all the current IO mechanisms used
      for monitors just work if we run them in a separate thread?
      What about mux?

  b) Initially all commands get dispatched to the main thread
     so nothing changes about the API.

  c) We create a new thread for the lock-free commands, and route
      lock-free commands down it.

  d) We start with a rule that on any one monitor connection we
  don't allow you to start a command until the previous one has
  finished

(d) allows us to avoid any API changes, but allows us to do lock-free
stuff on a separate connection like Peter's world.
We can drop (d) once we have a way of doing async commands.
We can add dispatching to more threads once someone describes
what they want from those threads.

Does that work for you Dan?

(IMHO this is still more complex than Peter's world and I don't
really see the advantage).

Dave


> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06  9:48   ` Dr. David Alan Gilbert
@ 2017-09-06 10:46     ` Daniel P. Berrange
  2017-09-06 10:48       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-06 10:46 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

On Wed, Sep 06, 2017 at 10:48:46AM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > > v2:
> > > - fixed "make check" error that patchew reported
> > > - moved the thread_join upper in monitor_data_destroy(), before
> > >   resources are released
> > > - added one new patch (current patch 3) that fixes a nasty risk
> > >   condition with IOWatchPoll.  Please see commit message for more
> > >   information.
> > > - added a g_main_context_wakeup() to make sure the separate loop
> > >   thread can be kicked always when we want to destroy the per-monitor
> > >   threads.
> > > - added one new patch (current patch 8) to introduce migration mgmt
> > >   lock for migrate_incoming.
> > > 
> > > This is an extended work for migration postcopy recovery. This series
> > > is tested with the following series to make sure it solves the monitor
> > > hang problem that we have encountered for postcopy recovery:
> > > 
> > >   [RFC 00/29] Migration: postcopy failure recovery
> > >   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> > > 
> > > The root problem is that, monitor commands are all handled in main
> > > loop thread now, no matter how many monitors we specify. And, if main
> > > loop thread hangs due to some reason, all monitors will be stuck.
> > > This can be done in reversed order as well: if any of the monitor
> > > hangs, it will hang the main loop, and the rest of the monitors (if
> > > there is any).
> > > 
> > > That affects postcopy recovery, since the recovery requires user input
> > > on destination side.  If monitors hang, the destination VM dies and
> > > lose hope for even a final recovery.
> > > 
> > > So, sometimes we need to make sure the monitor be alive, at least one
> > > of them.
> > > 
> > > The whole idea of this series is that instead if handling monitor
> > > commands all in main loop thread, we do it separately in per-monitor
> > > threads.  Then, even if main loop thread hangs at any point by any
> > > reason, per-monitor thread can still survive.  Further, we add hint in
> > > QMP/HMP to show whether a command can be executed without QMP, if so,
> > > we avoid taking BQL when running that command.  It greatly reduced
> > > contention of BQL.  Now the only user of that new parameter (currently
> > > I call it "without-bql") is "migrate-incoming" command, which is the
> > > only command to rescue a paused postcopy migration.
> > > 
> > > However, even with the series, it does not mean that per-monitor
> > > threads will never hang.  One example is that we can still run "info
> > > vcpus" in per-monitor threads during a paused postcopy (in that state,
> > > page faults are never handled, and "info cpus" will never return since
> > > it tries to sync every vcpus).  So to make sure it does not hang, we
> > > not only need the per-monitor thread, the user should be careful as
> > > well on how to use it.
> > > 
> > > For postcopy recovery, we may need dedicated monitor channel for
> > > recovery.  In other words, a destination VM that supports postcopy
> > > recovery would possibly need:
> > > 
> > >   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> > 
> > I think this is a really horrible thing to expose to management applications.
> > They should not need to be aware of fact that QEMU is buggy and thus requires
> > that certain commands be run on different monitors to work around the bug.
> 
> It's unfortunately baked in way too deep to fix in the near term; the
> BQL is just too cantagious and we have a fundamental design of running
> all the main IO emulation in one thread.
> 
> > I'd much prefer to see the problem described handled transparently inside
> > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > monitor I/O. This thread should never actually execute monitor commands
> > though, it would simply parse the command request and put data onto a queue
> > of pending commands, thus it could never hang. The command queue could be
> > processed by the main thread, or by another thread that is interested.
> > eg the migration thread could process any queued commands related to
> > migration directly.
> 
> That requires a change in the current API to allow async command
> completion (OK that is something Marc-Andre's world has) so that
> from the one connection you can have multiple outstanding commands.
> Hmm unless....
> 
> We've also got problems that some commands don't like being run outside
> of the main thread (see Fam's reply on the 21st pointing out that a lot
> of block commands would assert).
> 
> I think the way to move to what you describe would be:
>   a) A separate thread for monitor IO
>       This seems a separate problem
>       How hard is that?  Will all the current IO mechanisms used
>       for monitors just work if we run them in a separate thread?
>       What about mux?
> 
>   b) Initially all commands get dispatched to the main thread
>      so nothing changes about the API.
> 
>   c) We create a new thread for the lock-free commands, and route
>       lock-free commands down it.
> 
>   d) We start with a rule that on any one monitor connection we
>   don't allow you to start a command until the previous one has
>   finished
> 
> (d) allows us to avoid any API changes, but allows us to do lock-free
> stuff on a separate connection like Peter's world.
> We can drop (d) once we have a way of doing async commands.
> We can add dispatching to more threads once someone describes
> what they want from those threads.
> 
> Does that work for you Dan?

It would *provided* that we do (c) for the commands Peter wants for
this migration series.  IOW, I don't want to have to have logic in
libvirt that either needs to add a 2nd monitor server, or open a 2nd
monitor connection, to deal with migration post-copy recovery in some
versions of QEMU.  So whatever is needed to make post-copy recovery
work has to be done for (c).

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 10:46     ` Daniel P. Berrange
@ 2017-09-06 10:48       ` Dr. David Alan Gilbert
  2017-09-06 10:54         ` Daniel P. Berrange
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-06 10:48 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Wed, Sep 06, 2017 at 10:48:46AM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > > > v2:
> > > > - fixed "make check" error that patchew reported
> > > > - moved the thread_join upper in monitor_data_destroy(), before
> > > >   resources are released
> > > > - added one new patch (current patch 3) that fixes a nasty risk
> > > >   condition with IOWatchPoll.  Please see commit message for more
> > > >   information.
> > > > - added a g_main_context_wakeup() to make sure the separate loop
> > > >   thread can be kicked always when we want to destroy the per-monitor
> > > >   threads.
> > > > - added one new patch (current patch 8) to introduce migration mgmt
> > > >   lock for migrate_incoming.
> > > > 
> > > > This is an extended work for migration postcopy recovery. This series
> > > > is tested with the following series to make sure it solves the monitor
> > > > hang problem that we have encountered for postcopy recovery:
> > > > 
> > > >   [RFC 00/29] Migration: postcopy failure recovery
> > > >   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> > > > 
> > > > The root problem is that, monitor commands are all handled in main
> > > > loop thread now, no matter how many monitors we specify. And, if main
> > > > loop thread hangs due to some reason, all monitors will be stuck.
> > > > This can be done in reversed order as well: if any of the monitor
> > > > hangs, it will hang the main loop, and the rest of the monitors (if
> > > > there is any).
> > > > 
> > > > That affects postcopy recovery, since the recovery requires user input
> > > > on destination side.  If monitors hang, the destination VM dies and
> > > > lose hope for even a final recovery.
> > > > 
> > > > So, sometimes we need to make sure the monitor be alive, at least one
> > > > of them.
> > > > 
> > > > The whole idea of this series is that instead if handling monitor
> > > > commands all in main loop thread, we do it separately in per-monitor
> > > > threads.  Then, even if main loop thread hangs at any point by any
> > > > reason, per-monitor thread can still survive.  Further, we add hint in
> > > > QMP/HMP to show whether a command can be executed without QMP, if so,
> > > > we avoid taking BQL when running that command.  It greatly reduced
> > > > contention of BQL.  Now the only user of that new parameter (currently
> > > > I call it "without-bql") is "migrate-incoming" command, which is the
> > > > only command to rescue a paused postcopy migration.
> > > > 
> > > > However, even with the series, it does not mean that per-monitor
> > > > threads will never hang.  One example is that we can still run "info
> > > > vcpus" in per-monitor threads during a paused postcopy (in that state,
> > > > page faults are never handled, and "info cpus" will never return since
> > > > it tries to sync every vcpus).  So to make sure it does not hang, we
> > > > not only need the per-monitor thread, the user should be careful as
> > > > well on how to use it.
> > > > 
> > > > For postcopy recovery, we may need dedicated monitor channel for
> > > > recovery.  In other words, a destination VM that supports postcopy
> > > > recovery would possibly need:
> > > > 
> > > >   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> > > 
> > > I think this is a really horrible thing to expose to management applications.
> > > They should not need to be aware of fact that QEMU is buggy and thus requires
> > > that certain commands be run on different monitors to work around the bug.
> > 
> > It's unfortunately baked in way too deep to fix in the near term; the
> > BQL is just too cantagious and we have a fundamental design of running
> > all the main IO emulation in one thread.
> > 
> > > I'd much prefer to see the problem described handled transparently inside
> > > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > > monitor I/O. This thread should never actually execute monitor commands
> > > though, it would simply parse the command request and put data onto a queue
> > > of pending commands, thus it could never hang. The command queue could be
> > > processed by the main thread, or by another thread that is interested.
> > > eg the migration thread could process any queued commands related to
> > > migration directly.
> > 
> > That requires a change in the current API to allow async command
> > completion (OK that is something Marc-Andre's world has) so that
> > from the one connection you can have multiple outstanding commands.
> > Hmm unless....
> > 
> > We've also got problems that some commands don't like being run outside
> > of the main thread (see Fam's reply on the 21st pointing out that a lot
> > of block commands would assert).
> > 
> > I think the way to move to what you describe would be:
> >   a) A separate thread for monitor IO
> >       This seems a separate problem
> >       How hard is that?  Will all the current IO mechanisms used
> >       for monitors just work if we run them in a separate thread?
> >       What about mux?
> > 
> >   b) Initially all commands get dispatched to the main thread
> >      so nothing changes about the API.
> > 
> >   c) We create a new thread for the lock-free commands, and route
> >       lock-free commands down it.
> > 
> >   d) We start with a rule that on any one monitor connection we
> >   don't allow you to start a command until the previous one has
> >   finished
> > 
> > (d) allows us to avoid any API changes, but allows us to do lock-free
> > stuff on a separate connection like Peter's world.
> > We can drop (d) once we have a way of doing async commands.
> > We can add dispatching to more threads once someone describes
> > what they want from those threads.
> > 
> > Does that work for you Dan?
> 
> It would *provided* that we do (c) for the commands Peter wants for
> this migration series.  IOW, I don't want to have to have logic in
> libvirt that either needs to add a 2nd monitor server, or open a 2nd
> monitor connection, to deal with migration post-copy recovery in some
> versions of QEMU.  So whatever is needed to make post-copy recovery
> work has to be done for (c).

But then doesn't that mean you're requiring us to break (d) and change
the QMP interface to libvirt so it can do async stuff?

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 10:48       ` Dr. David Alan Gilbert
@ 2017-09-06 10:54         ` Daniel P. Berrange
  2017-09-06 10:57           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-06 10:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

On Wed, Sep 06, 2017 at 11:48:51AM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > On Wed, Sep 06, 2017 at 10:48:46AM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > > > > v2:
> > > > > - fixed "make check" error that patchew reported
> > > > > - moved the thread_join upper in monitor_data_destroy(), before
> > > > >   resources are released
> > > > > - added one new patch (current patch 3) that fixes a nasty risk
> > > > >   condition with IOWatchPoll.  Please see commit message for more
> > > > >   information.
> > > > > - added a g_main_context_wakeup() to make sure the separate loop
> > > > >   thread can be kicked always when we want to destroy the per-monitor
> > > > >   threads.
> > > > > - added one new patch (current patch 8) to introduce migration mgmt
> > > > >   lock for migrate_incoming.
> > > > > 
> > > > > This is an extended work for migration postcopy recovery. This series
> > > > > is tested with the following series to make sure it solves the monitor
> > > > > hang problem that we have encountered for postcopy recovery:
> > > > > 
> > > > >   [RFC 00/29] Migration: postcopy failure recovery
> > > > >   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> > > > > 
> > > > > The root problem is that, monitor commands are all handled in main
> > > > > loop thread now, no matter how many monitors we specify. And, if main
> > > > > loop thread hangs due to some reason, all monitors will be stuck.
> > > > > This can be done in reversed order as well: if any of the monitor
> > > > > hangs, it will hang the main loop, and the rest of the monitors (if
> > > > > there is any).
> > > > > 
> > > > > That affects postcopy recovery, since the recovery requires user input
> > > > > on destination side.  If monitors hang, the destination VM dies and
> > > > > lose hope for even a final recovery.
> > > > > 
> > > > > So, sometimes we need to make sure the monitor be alive, at least one
> > > > > of them.
> > > > > 
> > > > > The whole idea of this series is that instead if handling monitor
> > > > > commands all in main loop thread, we do it separately in per-monitor
> > > > > threads.  Then, even if main loop thread hangs at any point by any
> > > > > reason, per-monitor thread can still survive.  Further, we add hint in
> > > > > QMP/HMP to show whether a command can be executed without QMP, if so,
> > > > > we avoid taking BQL when running that command.  It greatly reduced
> > > > > contention of BQL.  Now the only user of that new parameter (currently
> > > > > I call it "without-bql") is "migrate-incoming" command, which is the
> > > > > only command to rescue a paused postcopy migration.
> > > > > 
> > > > > However, even with the series, it does not mean that per-monitor
> > > > > threads will never hang.  One example is that we can still run "info
> > > > > vcpus" in per-monitor threads during a paused postcopy (in that state,
> > > > > page faults are never handled, and "info cpus" will never return since
> > > > > it tries to sync every vcpus).  So to make sure it does not hang, we
> > > > > not only need the per-monitor thread, the user should be careful as
> > > > > well on how to use it.
> > > > > 
> > > > > For postcopy recovery, we may need dedicated monitor channel for
> > > > > recovery.  In other words, a destination VM that supports postcopy
> > > > > recovery would possibly need:
> > > > > 
> > > > >   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> > > > 
> > > > I think this is a really horrible thing to expose to management applications.
> > > > They should not need to be aware of fact that QEMU is buggy and thus requires
> > > > that certain commands be run on different monitors to work around the bug.
> > > 
> > > It's unfortunately baked in way too deep to fix in the near term; the
> > > BQL is just too cantagious and we have a fundamental design of running
> > > all the main IO emulation in one thread.
> > > 
> > > > I'd much prefer to see the problem described handled transparently inside
> > > > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > > > monitor I/O. This thread should never actually execute monitor commands
> > > > though, it would simply parse the command request and put data onto a queue
> > > > of pending commands, thus it could never hang. The command queue could be
> > > > processed by the main thread, or by another thread that is interested.
> > > > eg the migration thread could process any queued commands related to
> > > > migration directly.
> > > 
> > > That requires a change in the current API to allow async command
> > > completion (OK that is something Marc-Andre's world has) so that
> > > from the one connection you can have multiple outstanding commands.
> > > Hmm unless....
> > > 
> > > We've also got problems that some commands don't like being run outside
> > > of the main thread (see Fam's reply on the 21st pointing out that a lot
> > > of block commands would assert).
> > > 
> > > I think the way to move to what you describe would be:
> > >   a) A separate thread for monitor IO
> > >       This seems a separate problem
> > >       How hard is that?  Will all the current IO mechanisms used
> > >       for monitors just work if we run them in a separate thread?
> > >       What about mux?
> > > 
> > >   b) Initially all commands get dispatched to the main thread
> > >      so nothing changes about the API.
> > > 
> > >   c) We create a new thread for the lock-free commands, and route
> > >       lock-free commands down it.
> > > 
> > >   d) We start with a rule that on any one monitor connection we
> > >   don't allow you to start a command until the previous one has
> > >   finished
> > > 
> > > (d) allows us to avoid any API changes, but allows us to do lock-free
> > > stuff on a separate connection like Peter's world.
> > > We can drop (d) once we have a way of doing async commands.
> > > We can add dispatching to more threads once someone describes
> > > what they want from those threads.
> > > 
> > > Does that work for you Dan?
> > 
> > It would *provided* that we do (c) for the commands Peter wants for
> > this migration series.  IOW, I don't want to have to have logic in
> > libvirt that either needs to add a 2nd monitor server, or open a 2nd
> > monitor connection, to deal with migration post-copy recovery in some
> > versions of QEMU.  So whatever is needed to make post-copy recovery
> > work has to be done for (c).
> 
> But then doesn't that mean you're requiring us to break (d) and change
> the QMP interface to libvirt so it can do async stuff?

Depends on your definition of break - I'm assuming there's either a way
to opt-in to use of a async mode for existing commands in (c), or that
async commands would be added in parallel with existing sync commands.
IOW, its not a API breakage - its an opt-in extension of existing
functionality.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 10:54         ` Daniel P. Berrange
@ 2017-09-06 10:57           ` Dr. David Alan Gilbert
  2017-09-06 11:06             ` Daniel P. Berrange
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-06 10:57 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Wed, Sep 06, 2017 at 11:48:51AM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > On Wed, Sep 06, 2017 at 10:48:46AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > > > > > v2:
> > > > > > - fixed "make check" error that patchew reported
> > > > > > - moved the thread_join upper in monitor_data_destroy(), before
> > > > > >   resources are released
> > > > > > - added one new patch (current patch 3) that fixes a nasty risk
> > > > > >   condition with IOWatchPoll.  Please see commit message for more
> > > > > >   information.
> > > > > > - added a g_main_context_wakeup() to make sure the separate loop
> > > > > >   thread can be kicked always when we want to destroy the per-monitor
> > > > > >   threads.
> > > > > > - added one new patch (current patch 8) to introduce migration mgmt
> > > > > >   lock for migrate_incoming.
> > > > > > 
> > > > > > This is an extended work for migration postcopy recovery. This series
> > > > > > is tested with the following series to make sure it solves the monitor
> > > > > > hang problem that we have encountered for postcopy recovery:
> > > > > > 
> > > > > >   [RFC 00/29] Migration: postcopy failure recovery
> > > > > >   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> > > > > > 
> > > > > > The root problem is that, monitor commands are all handled in main
> > > > > > loop thread now, no matter how many monitors we specify. And, if main
> > > > > > loop thread hangs due to some reason, all monitors will be stuck.
> > > > > > This can be done in reversed order as well: if any of the monitor
> > > > > > hangs, it will hang the main loop, and the rest of the monitors (if
> > > > > > there is any).
> > > > > > 
> > > > > > That affects postcopy recovery, since the recovery requires user input
> > > > > > on destination side.  If monitors hang, the destination VM dies and
> > > > > > lose hope for even a final recovery.
> > > > > > 
> > > > > > So, sometimes we need to make sure the monitor be alive, at least one
> > > > > > of them.
> > > > > > 
> > > > > > The whole idea of this series is that instead if handling monitor
> > > > > > commands all in main loop thread, we do it separately in per-monitor
> > > > > > threads.  Then, even if main loop thread hangs at any point by any
> > > > > > reason, per-monitor thread can still survive.  Further, we add hint in
> > > > > > QMP/HMP to show whether a command can be executed without QMP, if so,
> > > > > > we avoid taking BQL when running that command.  It greatly reduced
> > > > > > contention of BQL.  Now the only user of that new parameter (currently
> > > > > > I call it "without-bql") is "migrate-incoming" command, which is the
> > > > > > only command to rescue a paused postcopy migration.
> > > > > > 
> > > > > > However, even with the series, it does not mean that per-monitor
> > > > > > threads will never hang.  One example is that we can still run "info
> > > > > > vcpus" in per-monitor threads during a paused postcopy (in that state,
> > > > > > page faults are never handled, and "info cpus" will never return since
> > > > > > it tries to sync every vcpus).  So to make sure it does not hang, we
> > > > > > not only need the per-monitor thread, the user should be careful as
> > > > > > well on how to use it.
> > > > > > 
> > > > > > For postcopy recovery, we may need dedicated monitor channel for
> > > > > > recovery.  In other words, a destination VM that supports postcopy
> > > > > > recovery would possibly need:
> > > > > > 
> > > > > >   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> > > > > 
> > > > > I think this is a really horrible thing to expose to management applications.
> > > > > They should not need to be aware of fact that QEMU is buggy and thus requires
> > > > > that certain commands be run on different monitors to work around the bug.
> > > > 
> > > > It's unfortunately baked in way too deep to fix in the near term; the
> > > > BQL is just too cantagious and we have a fundamental design of running
> > > > all the main IO emulation in one thread.
> > > > 
> > > > > I'd much prefer to see the problem described handled transparently inside
> > > > > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > > > > monitor I/O. This thread should never actually execute monitor commands
> > > > > though, it would simply parse the command request and put data onto a queue
> > > > > of pending commands, thus it could never hang. The command queue could be
> > > > > processed by the main thread, or by another thread that is interested.
> > > > > eg the migration thread could process any queued commands related to
> > > > > migration directly.
> > > > 
> > > > That requires a change in the current API to allow async command
> > > > completion (OK that is something Marc-Andre's world has) so that
> > > > from the one connection you can have multiple outstanding commands.
> > > > Hmm unless....
> > > > 
> > > > We've also got problems that some commands don't like being run outside
> > > > of the main thread (see Fam's reply on the 21st pointing out that a lot
> > > > of block commands would assert).
> > > > 
> > > > I think the way to move to what you describe would be:
> > > >   a) A separate thread for monitor IO
> > > >       This seems a separate problem
> > > >       How hard is that?  Will all the current IO mechanisms used
> > > >       for monitors just work if we run them in a separate thread?
> > > >       What about mux?
> > > > 
> > > >   b) Initially all commands get dispatched to the main thread
> > > >      so nothing changes about the API.
> > > > 
> > > >   c) We create a new thread for the lock-free commands, and route
> > > >       lock-free commands down it.
> > > > 
> > > >   d) We start with a rule that on any one monitor connection we
> > > >   don't allow you to start a command until the previous one has
> > > >   finished
> > > > 
> > > > (d) allows us to avoid any API changes, but allows us to do lock-free
> > > > stuff on a separate connection like Peter's world.
> > > > We can drop (d) once we have a way of doing async commands.
> > > > We can add dispatching to more threads once someone describes
> > > > what they want from those threads.
> > > > 
> > > > Does that work for you Dan?
> > > 
> > > It would *provided* that we do (c) for the commands Peter wants for
> > > this migration series.  IOW, I don't want to have to have logic in
> > > libvirt that either needs to add a 2nd monitor server, or open a 2nd
> > > monitor connection, to deal with migration post-copy recovery in some
> > > versions of QEMU.  So whatever is needed to make post-copy recovery
> > > work has to be done for (c).
> > 
> > But then doesn't that mean you're requiring us to break (d) and change
> > the QMP interface to libvirt so it can do async stuff?
> 
> Depends on your definition of break - I'm assuming there's either a way
> to opt-in to use of a async mode for existing commands in (c), or that
> async commands would be added in parallel with existing sync commands.
> IOW, its not a API breakage - its an opt-in extension of existing
> functionality.

But you'd need to do async commands for all commands you issued to avoid
blocking the io thread so that you could then issue the recovery
commands.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 10:57           ` Dr. David Alan Gilbert
@ 2017-09-06 11:06             ` Daniel P. Berrange
  2017-09-06 11:31               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-06 11:06 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

On Wed, Sep 06, 2017 at 11:57:05AM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > On Wed, Sep 06, 2017 at 11:48:51AM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > On Wed, Sep 06, 2017 at 10:48:46AM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > > > > > > v2:
> > > > > > > - fixed "make check" error that patchew reported
> > > > > > > - moved the thread_join upper in monitor_data_destroy(), before
> > > > > > >   resources are released
> > > > > > > - added one new patch (current patch 3) that fixes a nasty risk
> > > > > > >   condition with IOWatchPoll.  Please see commit message for more
> > > > > > >   information.
> > > > > > > - added a g_main_context_wakeup() to make sure the separate loop
> > > > > > >   thread can be kicked always when we want to destroy the per-monitor
> > > > > > >   threads.
> > > > > > > - added one new patch (current patch 8) to introduce migration mgmt
> > > > > > >   lock for migrate_incoming.
> > > > > > > 
> > > > > > > This is an extended work for migration postcopy recovery. This series
> > > > > > > is tested with the following series to make sure it solves the monitor
> > > > > > > hang problem that we have encountered for postcopy recovery:
> > > > > > > 
> > > > > > >   [RFC 00/29] Migration: postcopy failure recovery
> > > > > > >   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> > > > > > > 
> > > > > > > The root problem is that, monitor commands are all handled in main
> > > > > > > loop thread now, no matter how many monitors we specify. And, if main
> > > > > > > loop thread hangs due to some reason, all monitors will be stuck.
> > > > > > > This can be done in reversed order as well: if any of the monitor
> > > > > > > hangs, it will hang the main loop, and the rest of the monitors (if
> > > > > > > there is any).
> > > > > > > 
> > > > > > > That affects postcopy recovery, since the recovery requires user input
> > > > > > > on destination side.  If monitors hang, the destination VM dies and
> > > > > > > lose hope for even a final recovery.
> > > > > > > 
> > > > > > > So, sometimes we need to make sure the monitor be alive, at least one
> > > > > > > of them.
> > > > > > > 
> > > > > > > The whole idea of this series is that instead if handling monitor
> > > > > > > commands all in main loop thread, we do it separately in per-monitor
> > > > > > > threads.  Then, even if main loop thread hangs at any point by any
> > > > > > > reason, per-monitor thread can still survive.  Further, we add hint in
> > > > > > > QMP/HMP to show whether a command can be executed without QMP, if so,
> > > > > > > we avoid taking BQL when running that command.  It greatly reduced
> > > > > > > contention of BQL.  Now the only user of that new parameter (currently
> > > > > > > I call it "without-bql") is "migrate-incoming" command, which is the
> > > > > > > only command to rescue a paused postcopy migration.
> > > > > > > 
> > > > > > > However, even with the series, it does not mean that per-monitor
> > > > > > > threads will never hang.  One example is that we can still run "info
> > > > > > > vcpus" in per-monitor threads during a paused postcopy (in that state,
> > > > > > > page faults are never handled, and "info cpus" will never return since
> > > > > > > it tries to sync every vcpus).  So to make sure it does not hang, we
> > > > > > > not only need the per-monitor thread, the user should be careful as
> > > > > > > well on how to use it.
> > > > > > > 
> > > > > > > For postcopy recovery, we may need dedicated monitor channel for
> > > > > > > recovery.  In other words, a destination VM that supports postcopy
> > > > > > > recovery would possibly need:
> > > > > > > 
> > > > > > >   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> > > > > > 
> > > > > > I think this is a really horrible thing to expose to management applications.
> > > > > > They should not need to be aware of fact that QEMU is buggy and thus requires
> > > > > > that certain commands be run on different monitors to work around the bug.
> > > > > 
> > > > > It's unfortunately baked in way too deep to fix in the near term; the
> > > > > BQL is just too cantagious and we have a fundamental design of running
> > > > > all the main IO emulation in one thread.
> > > > > 
> > > > > > I'd much prefer to see the problem described handled transparently inside
> > > > > > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > > > > > monitor I/O. This thread should never actually execute monitor commands
> > > > > > though, it would simply parse the command request and put data onto a queue
> > > > > > of pending commands, thus it could never hang. The command queue could be
> > > > > > processed by the main thread, or by another thread that is interested.
> > > > > > eg the migration thread could process any queued commands related to
> > > > > > migration directly.
> > > > > 
> > > > > That requires a change in the current API to allow async command
> > > > > completion (OK that is something Marc-Andre's world has) so that
> > > > > from the one connection you can have multiple outstanding commands.
> > > > > Hmm unless....
> > > > > 
> > > > > We've also got problems that some commands don't like being run outside
> > > > > of the main thread (see Fam's reply on the 21st pointing out that a lot
> > > > > of block commands would assert).
> > > > > 
> > > > > I think the way to move to what you describe would be:
> > > > >   a) A separate thread for monitor IO
> > > > >       This seems a separate problem
> > > > >       How hard is that?  Will all the current IO mechanisms used
> > > > >       for monitors just work if we run them in a separate thread?
> > > > >       What about mux?
> > > > > 
> > > > >   b) Initially all commands get dispatched to the main thread
> > > > >      so nothing changes about the API.
> > > > > 
> > > > >   c) We create a new thread for the lock-free commands, and route
> > > > >       lock-free commands down it.
> > > > > 
> > > > >   d) We start with a rule that on any one monitor connection we
> > > > >   don't allow you to start a command until the previous one has
> > > > >   finished
> > > > > 
> > > > > (d) allows us to avoid any API changes, but allows us to do lock-free
> > > > > stuff on a separate connection like Peter's world.
> > > > > We can drop (d) once we have a way of doing async commands.
> > > > > We can add dispatching to more threads once someone describes
> > > > > what they want from those threads.
> > > > > 
> > > > > Does that work for you Dan?
> > > > 
> > > > It would *provided* that we do (c) for the commands Peter wants for
> > > > this migration series.  IOW, I don't want to have to have logic in
> > > > libvirt that either needs to add a 2nd monitor server, or open a 2nd
> > > > monitor connection, to deal with migration post-copy recovery in some
> > > > versions of QEMU.  So whatever is needed to make post-copy recovery
> > > > work has to be done for (c).
> > > 
> > > But then doesn't that mean you're requiring us to break (d) and change
> > > the QMP interface to libvirt so it can do async stuff?
> > 
> > Depends on your definition of break - I'm assuming there's either a way
> > to opt-in to use of a async mode for existing commands in (c), or that
> > async commands would be added in parallel with existing sync commands.
> > IOW, its not a API breakage - its an opt-in extension of existing
> > functionality.
> 
> But you'd need to do async commands for all commands you issued to avoid
> blocking the io thread so that you could then issue the recovery
> commands.

I don't see why that has to be the case. In order to issue an async command
all that needs to be the case is that command replies should be allowed to
be sent out of order.

IOW if command A is blocking and command B is async, then we shoudl be
allowed to have the following

   req A
   req B
   res A
   res B

Or

   req A
   req B
   res B
   res A

Or

   req B
   req A
   res B
   res A

etc.

This does imply that you need a separate monitor I/O processing, from the
command execution thread, but I see no need for all commands to suddenly
become async. Just allowing interleaved replies is sufficient from the
POV of the protocol definition. This interleaving is easy to handle from
the client POV - just requires a unique 'serial' in the request by the
client, that is copied into the reply by QEMU.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 11:06             ` Daniel P. Berrange
@ 2017-09-06 11:31               ` Dr. David Alan Gilbert
  2017-09-06 11:54                 ` Daniel P. Berrange
  2017-09-07 13:59                 ` Eric Blake
  0 siblings, 2 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-06 11:31 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Wed, Sep 06, 2017 at 11:57:05AM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > On Wed, Sep 06, 2017 at 11:48:51AM +0100, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > On Wed, Sep 06, 2017 at 10:48:46AM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > > > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > > > > > > > v2:
> > > > > > > > - fixed "make check" error that patchew reported
> > > > > > > > - moved the thread_join upper in monitor_data_destroy(), before
> > > > > > > >   resources are released
> > > > > > > > - added one new patch (current patch 3) that fixes a nasty risk
> > > > > > > >   condition with IOWatchPoll.  Please see commit message for more
> > > > > > > >   information.
> > > > > > > > - added a g_main_context_wakeup() to make sure the separate loop
> > > > > > > >   thread can be kicked always when we want to destroy the per-monitor
> > > > > > > >   threads.
> > > > > > > > - added one new patch (current patch 8) to introduce migration mgmt
> > > > > > > >   lock for migrate_incoming.
> > > > > > > > 
> > > > > > > > This is an extended work for migration postcopy recovery. This series
> > > > > > > > is tested with the following series to make sure it solves the monitor
> > > > > > > > hang problem that we have encountered for postcopy recovery:
> > > > > > > > 
> > > > > > > >   [RFC 00/29] Migration: postcopy failure recovery
> > > > > > > >   [RFC 0/6] migration: re-use migrate_incoming for postcopy recovery
> > > > > > > > 
> > > > > > > > The root problem is that, monitor commands are all handled in main
> > > > > > > > loop thread now, no matter how many monitors we specify. And, if main
> > > > > > > > loop thread hangs due to some reason, all monitors will be stuck.
> > > > > > > > This can be done in reversed order as well: if any of the monitor
> > > > > > > > hangs, it will hang the main loop, and the rest of the monitors (if
> > > > > > > > there is any).
> > > > > > > > 
> > > > > > > > That affects postcopy recovery, since the recovery requires user input
> > > > > > > > on destination side.  If monitors hang, the destination VM dies and
> > > > > > > > lose hope for even a final recovery.
> > > > > > > > 
> > > > > > > > So, sometimes we need to make sure the monitor be alive, at least one
> > > > > > > > of them.
> > > > > > > > 
> > > > > > > > The whole idea of this series is that instead if handling monitor
> > > > > > > > commands all in main loop thread, we do it separately in per-monitor
> > > > > > > > threads.  Then, even if main loop thread hangs at any point by any
> > > > > > > > reason, per-monitor thread can still survive.  Further, we add hint in
> > > > > > > > QMP/HMP to show whether a command can be executed without QMP, if so,
> > > > > > > > we avoid taking BQL when running that command.  It greatly reduced
> > > > > > > > contention of BQL.  Now the only user of that new parameter (currently
> > > > > > > > I call it "without-bql") is "migrate-incoming" command, which is the
> > > > > > > > only command to rescue a paused postcopy migration.
> > > > > > > > 
> > > > > > > > However, even with the series, it does not mean that per-monitor
> > > > > > > > threads will never hang.  One example is that we can still run "info
> > > > > > > > vcpus" in per-monitor threads during a paused postcopy (in that state,
> > > > > > > > page faults are never handled, and "info cpus" will never return since
> > > > > > > > it tries to sync every vcpus).  So to make sure it does not hang, we
> > > > > > > > not only need the per-monitor thread, the user should be careful as
> > > > > > > > well on how to use it.
> > > > > > > > 
> > > > > > > > For postcopy recovery, we may need dedicated monitor channel for
> > > > > > > > recovery.  In other words, a destination VM that supports postcopy
> > > > > > > > recovery would possibly need:
> > > > > > > > 
> > > > > > > >   -qmp MAIN_CHANNEL -qmp RECOVERY_CHANNEL
> > > > > > > 
> > > > > > > I think this is a really horrible thing to expose to management applications.
> > > > > > > They should not need to be aware of fact that QEMU is buggy and thus requires
> > > > > > > that certain commands be run on different monitors to work around the bug.
> > > > > > 
> > > > > > It's unfortunately baked in way too deep to fix in the near term; the
> > > > > > BQL is just too cantagious and we have a fundamental design of running
> > > > > > all the main IO emulation in one thread.
> > > > > > 
> > > > > > > I'd much prefer to see the problem described handled transparently inside
> > > > > > > QEMU. One approach is have a dedicated thread in QEMU responsible for all
> > > > > > > monitor I/O. This thread should never actually execute monitor commands
> > > > > > > though, it would simply parse the command request and put data onto a queue
> > > > > > > of pending commands, thus it could never hang. The command queue could be
> > > > > > > processed by the main thread, or by another thread that is interested.
> > > > > > > eg the migration thread could process any queued commands related to
> > > > > > > migration directly.
> > > > > > 
> > > > > > That requires a change in the current API to allow async command
> > > > > > completion (OK that is something Marc-Andre's world has) so that
> > > > > > from the one connection you can have multiple outstanding commands.
> > > > > > Hmm unless....
> > > > > > 
> > > > > > We've also got problems that some commands don't like being run outside
> > > > > > of the main thread (see Fam's reply on the 21st pointing out that a lot
> > > > > > of block commands would assert).
> > > > > > 
> > > > > > I think the way to move to what you describe would be:
> > > > > >   a) A separate thread for monitor IO
> > > > > >       This seems a separate problem
> > > > > >       How hard is that?  Will all the current IO mechanisms used
> > > > > >       for monitors just work if we run them in a separate thread?
> > > > > >       What about mux?
> > > > > > 
> > > > > >   b) Initially all commands get dispatched to the main thread
> > > > > >      so nothing changes about the API.
> > > > > > 
> > > > > >   c) We create a new thread for the lock-free commands, and route
> > > > > >       lock-free commands down it.
> > > > > > 
> > > > > >   d) We start with a rule that on any one monitor connection we
> > > > > >   don't allow you to start a command until the previous one has
> > > > > >   finished
> > > > > > 
> > > > > > (d) allows us to avoid any API changes, but allows us to do lock-free
> > > > > > stuff on a separate connection like Peter's world.
> > > > > > We can drop (d) once we have a way of doing async commands.
> > > > > > We can add dispatching to more threads once someone describes
> > > > > > what they want from those threads.
> > > > > > 
> > > > > > Does that work for you Dan?
> > > > > 
> > > > > It would *provided* that we do (c) for the commands Peter wants for
> > > > > this migration series.  IOW, I don't want to have to have logic in
> > > > > libvirt that either needs to add a 2nd monitor server, or open a 2nd
> > > > > monitor connection, to deal with migration post-copy recovery in some
> > > > > versions of QEMU.  So whatever is needed to make post-copy recovery
> > > > > work has to be done for (c).
> > > > 
> > > > But then doesn't that mean you're requiring us to break (d) and change
> > > > the QMP interface to libvirt so it can do async stuff?
> > > 
> > > Depends on your definition of break - I'm assuming there's either a way
> > > to opt-in to use of a async mode for existing commands in (c), or that
> > > async commands would be added in parallel with existing sync commands.
> > > IOW, its not a API breakage - its an opt-in extension of existing
> > > functionality.
> > 
> > But you'd need to do async commands for all commands you issued to avoid
> > blocking the io thread so that you could then issue the recovery
> > commands.
> 
> I don't see why that has to be the case. In order to issue an async command
> all that needs to be the case is that command replies should be allowed to
> be sent out of order.
> 
> IOW if command A is blocking and command B is async, then we shoudl be
> allowed to have the following
> 
>    req A
>    req B
>    res A
>    res B
> 
> Or
> 
>    req A
>    req B
>    res B
>    res A
> 
> Or
> 
>    req B
>    req A
>    res B
>    res A
> 
> etc.
> 
> This does imply that you need a separate monitor I/O processing, from the
> command execution thread, but I see no need for all commands to suddenly
> become async. Just allowing interleaved replies is sufficient from the
> POV of the protocol definition. This interleaving is easy to handle from
> the client POV - just requires a unique 'serial' in the request by the
> client, that is copied into the reply by QEMU.

OK, so for that we can just take Marc-André's syntax and call it 'id':
  https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html

then it's upto the caller to ensure those id's are unique.

I do worry about two things:
  a) With this the caller doesn't really know which commands could be
  in parallel - for example if we've got a recovery command that's
  executed by this non-locking thread that's OK, we expect that
  to be doable in parallel.  If in the future though we do
  what you initially suggested and have a bunch of commands get
  routed to the migration thread (say) then those would suddenly
  operate in parallel with other commands that we're previously
  synchronous.

  b) I still worry how the various IO channels will behave on another
  thread.  But that's more a general feeling rather than anything
  specific.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 11:31               ` Dr. David Alan Gilbert
@ 2017-09-06 11:54                 ` Daniel P. Berrange
  2017-09-07  8:13                   ` Peter Xu
  2017-09-07 10:04                   ` Dr. David Alan Gilbert
  2017-09-07 13:59                 ` Eric Blake
  1 sibling, 2 replies; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-06 11:54 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > This does imply that you need a separate monitor I/O processing, from the
> > command execution thread, but I see no need for all commands to suddenly
> > become async. Just allowing interleaved replies is sufficient from the
> > POV of the protocol definition. This interleaving is easy to handle from
> > the client POV - just requires a unique 'serial' in the request by the
> > client, that is copied into the reply by QEMU.
> 
> OK, so for that we can just take Marc-André's syntax and call it 'id':
>   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> 
> then it's upto the caller to ensure those id's are unique.

Libvirt has in fact generated a unique 'id' for every monitor command
since day 1 of supporting QMP.

> I do worry about two things:
>   a) With this the caller doesn't really know which commands could be
>   in parallel - for example if we've got a recovery command that's
>   executed by this non-locking thread that's OK, we expect that
>   to be doable in parallel.  If in the future though we do
>   what you initially suggested and have a bunch of commands get
>   routed to the migration thread (say) then those would suddenly
>   operate in parallel with other commands that we're previously
>   synchronous.

We could still have an opt-in for async commands. eg default to executing
all commands in the main thread, unless the client issues an explicit
"make it async" command, to switch to allowing the migration thread to
process it async.

 { "execute": "qmp_allow_async",
   "data": { "commands": [
       "migrate_cancel",
   ] } }


 { "return": { "commands": [
       "migrate_cancel",
   ] } }

The server response contains the subset of commands from the request
for which async is supported.

That gives good negotiation ability going forward as we incrementally
support async on more commands.

>   b) I still worry how the various IO channels will behave on another
>   thread.  But that's more a general feeling rather than anything
>   specific.

The only complexity will be around making sure the Chardev code uses
the right GMainContext for any watches on the underlying QIOChannel,
so that we poll() from the custom thread instead of the main thread.
IOW, as long as all I/O is done from the single thread everything
should work fine.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
                   ` (8 preceding siblings ...)
  2017-08-29 11:03 ` [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Daniel P. Berrange
@ 2017-09-06 14:50 ` Stefan Hajnoczi
  2017-09-06 15:14   ` Dr. David Alan Gilbert
  9 siblings, 1 reply; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-06 14:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini,
	Dr . David Alan Gilbert

On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> The root problem is that, monitor commands are all handled in main
> loop thread now, no matter how many monitors we specify. And, if main
> loop thread hangs due to some reason, all monitors will be stuck.

I see a larger issue with postcopy: existing QEMU code assumes that
guest memory access is instantaneous.

Postcopy breaks this assumption and introduces blocking points that can
now take unbounded time.

This problem isn't specific to the monitor.  It can also happen to other
components in QEMU like the gdbstub.

Do we need an asynchronous memory API?  Synchronous memory access should
only be allowed in vcpu threads.

Stefan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 14:50 ` Stefan Hajnoczi
@ 2017-09-06 15:14   ` Dr. David Alan Gilbert
  2017-09-07  7:38     ` Peter Xu
  2017-09-07  8:58     ` Stefan Hajnoczi
  0 siblings, 2 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-06 15:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, mdroth, Paolo Bonzini

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > The root problem is that, monitor commands are all handled in main
> > loop thread now, no matter how many monitors we specify. And, if main
> > loop thread hangs due to some reason, all monitors will be stuck.
> 
> I see a larger issue with postcopy: existing QEMU code assumes that
> guest memory access is instantaneous.
> 
> Postcopy breaks this assumption and introduces blocking points that can
> now take unbounded time.
> 
> This problem isn't specific to the monitor.  It can also happen to other
> components in QEMU like the gdbstub.
> 
> Do we need an asynchronous memory API?  Synchronous memory access should
> only be allowed in vcpu threads.

It would probably be useful for gdbstub where the overhead of async
doesn't matter;  but doing that for all IO emulation is hard.

Dave

> Stefan
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 15:14   ` Dr. David Alan Gilbert
@ 2017-09-07  7:38     ` Peter Xu
  2017-09-07  8:58     ` Stefan Hajnoczi
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-09-07  7:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, qemu-devel, Laurent Vivier, Fam Zheng,
	Juan Quintela, Markus Armbruster, mdroth, Paolo Bonzini

On Wed, Sep 06, 2017 at 04:14:37PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> > On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> > > The root problem is that, monitor commands are all handled in main
> > > loop thread now, no matter how many monitors we specify. And, if main
> > > loop thread hangs due to some reason, all monitors will be stuck.
> > 
> > I see a larger issue with postcopy: existing QEMU code assumes that
> > guest memory access is instantaneous.
> > 
> > Postcopy breaks this assumption and introduces blocking points that can
> > now take unbounded time.
> > 
> > This problem isn't specific to the monitor.  It can also happen to other
> > components in QEMU like the gdbstub.
> > 
> > Do we need an asynchronous memory API?  Synchronous memory access should
> > only be allowed in vcpu threads.
> 
> It would probably be useful for gdbstub where the overhead of async
> doesn't matter;  but doing that for all IO emulation is hard.

Agreed.

IIUC one problem is that we should have code that cached the HVA for
specific GPA, then when it wants to write to that GPA, it directly
writes to corresponding HVA.  No memory API is used.  I am not sure
whether it's possible to convert all these usages into memory APIs
(even if it supports async operations).

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 11:54                 ` Daniel P. Berrange
@ 2017-09-07  8:13                   ` Peter Xu
  2017-09-07  8:49                     ` Stefan Hajnoczi
                                       ` (3 more replies)
  2017-09-07 10:04                   ` Dr. David Alan Gilbert
  1 sibling, 4 replies; 104+ messages in thread
From: Peter Xu @ 2017-09-07  8:13 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > This does imply that you need a separate monitor I/O processing, from the
> > > command execution thread, but I see no need for all commands to suddenly
> > > become async. Just allowing interleaved replies is sufficient from the
> > > POV of the protocol definition. This interleaving is easy to handle from
> > > the client POV - just requires a unique 'serial' in the request by the
> > > client, that is copied into the reply by QEMU.
> > 
> > OK, so for that we can just take Marc-André's syntax and call it 'id':
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > 
> > then it's upto the caller to ensure those id's are unique.
> 
> Libvirt has in fact generated a unique 'id' for every monitor command
> since day 1 of supporting QMP.
> 
> > I do worry about two things:
> >   a) With this the caller doesn't really know which commands could be
> >   in parallel - for example if we've got a recovery command that's
> >   executed by this non-locking thread that's OK, we expect that
> >   to be doable in parallel.  If in the future though we do
> >   what you initially suggested and have a bunch of commands get
> >   routed to the migration thread (say) then those would suddenly
> >   operate in parallel with other commands that we're previously
> >   synchronous.
> 
> We could still have an opt-in for async commands. eg default to executing
> all commands in the main thread, unless the client issues an explicit
> "make it async" command, to switch to allowing the migration thread to
> process it async.
> 
>  { "execute": "qmp_allow_async",
>    "data": { "commands": [
>        "migrate_cancel",
>    ] } }
> 
> 
>  { "return": { "commands": [
>        "migrate_cancel",
>    ] } }
> 
> The server response contains the subset of commands from the request
> for which async is supported.
> 
> That gives good negotiation ability going forward as we incrementally
> support async on more commands.

I think this goes back to the discussion on which design we'd like to
choose.  IMHO the whole async idea plus the per-command-id is indeed
cleaner and nicer, and I believe that can benefit not only libvirt,
but also other QMP users.  The problem is, I have no idea how long
it'll take to let us have such a feature - I believe that will include
QEMU and Libvirt to both support that.  And it'll be a pity if the
postcopy recovery cannot work only because we cannot guarantee a
stable monitor.

I'm curious whether there are other requirements (besides postcopy
recovery) that would want an always-alive monitor to run some
lock-free commands?  If there is, I'd be more inclined to first
provide a work-around solution like "-qmp-lockfree", and we can
provide a better solution afterwards until when the whole async QMP
work ready.

> 
> >   b) I still worry how the various IO channels will behave on another
> >   thread.  But that's more a general feeling rather than anything
> >   specific.
> 
> The only complexity will be around making sure the Chardev code uses
> the right GMainContext for any watches on the underlying QIOChannel,
> so that we poll() from the custom thread instead of the main thread.
> IOW, as long as all I/O is done from the single thread everything
> should work fine.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  8:13                   ` Peter Xu
@ 2017-09-07  8:49                     ` Stefan Hajnoczi
  2017-09-07  9:18                       ` Dr. David Alan Gilbert
  2017-09-07  8:55                     ` Daniel P. Berrange
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-07  8:49 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrange, Laurent Vivier, Fam Zheng, Juan Quintela,
	qemu-devel, Markus Armbruster, Michael Roth,
	Dr. David Alan Gilbert, Paolo Bonzini

On Thu, Sep 7, 2017 at 9:13 AM, Peter Xu <peterx@redhat.com> wrote:
> On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
>> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
>> > * Daniel P. Berrange (berrange@redhat.com) wrote:
>> > > This does imply that you need a separate monitor I/O processing, from the
>> > > command execution thread, but I see no need for all commands to suddenly
>> > > become async. Just allowing interleaved replies is sufficient from the
>> > > POV of the protocol definition. This interleaving is easy to handle from
>> > > the client POV - just requires a unique 'serial' in the request by the
>> > > client, that is copied into the reply by QEMU.
>> >
>> > OK, so for that we can just take Marc-André's syntax and call it 'id':
>> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
>> >
>> > then it's upto the caller to ensure those id's are unique.
>>
>> Libvirt has in fact generated a unique 'id' for every monitor command
>> since day 1 of supporting QMP.
>>
>> > I do worry about two things:
>> >   a) With this the caller doesn't really know which commands could be
>> >   in parallel - for example if we've got a recovery command that's
>> >   executed by this non-locking thread that's OK, we expect that
>> >   to be doable in parallel.  If in the future though we do
>> >   what you initially suggested and have a bunch of commands get
>> >   routed to the migration thread (say) then those would suddenly
>> >   operate in parallel with other commands that we're previously
>> >   synchronous.
>>
>> We could still have an opt-in for async commands. eg default to executing
>> all commands in the main thread, unless the client issues an explicit
>> "make it async" command, to switch to allowing the migration thread to
>> process it async.
>>
>>  { "execute": "qmp_allow_async",
>>    "data": { "commands": [
>>        "migrate_cancel",
>>    ] } }
>>
>>
>>  { "return": { "commands": [
>>        "migrate_cancel",
>>    ] } }
>>
>> The server response contains the subset of commands from the request
>> for which async is supported.
>>
>> That gives good negotiation ability going forward as we incrementally
>> support async on more commands.
>
> I think this goes back to the discussion on which design we'd like to
> choose.  IMHO the whole async idea plus the per-command-id is indeed
> cleaner and nicer, and I believe that can benefit not only libvirt,
> but also other QMP users.  The problem is, I have no idea how long
> it'll take to let us have such a feature - I believe that will include
> QEMU and Libvirt to both support that.  And it'll be a pity if the
> postcopy recovery cannot work only because we cannot guarantee a
> stable monitor.

Please don't rush in a hack, they often introduce new bugs that we
have to support long-term when they are part of the QMP API.

In your original email you mentioned "info cpus".  Have you considered
modifying this command so it does not sync the CPU?  I'm not sure
callers really need to sync the CPU, typically they just want to know
the vcpu numbers, thread IDs, and current state (halted, running,
etc).

The next step after that would be to audit other monitor commands for
unnecessary vcpu synchronization.

Stefan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  8:13                   ` Peter Xu
  2017-09-07  8:49                     ` Stefan Hajnoczi
@ 2017-09-07  8:55                     ` Daniel P. Berrange
  2017-09-07  9:19                       ` Dr. David Alan Gilbert
  2017-09-07  9:15                     ` Dr. David Alan Gilbert
  2017-09-07 12:59                     ` Markus Armbruster
  3 siblings, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-07  8:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, qemu-devel, Paolo Bonzini, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

On Thu, Sep 07, 2017 at 04:13:41PM +0800, Peter Xu wrote:
> On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > This does imply that you need a separate monitor I/O processing, from the
> > > > command execution thread, but I see no need for all commands to suddenly
> > > > become async. Just allowing interleaved replies is sufficient from the
> > > > POV of the protocol definition. This interleaving is easy to handle from
> > > > the client POV - just requires a unique 'serial' in the request by the
> > > > client, that is copied into the reply by QEMU.
> > > 
> > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > > 
> > > then it's upto the caller to ensure those id's are unique.
> > 
> > Libvirt has in fact generated a unique 'id' for every monitor command
> > since day 1 of supporting QMP.
> > 
> > > I do worry about two things:
> > >   a) With this the caller doesn't really know which commands could be
> > >   in parallel - for example if we've got a recovery command that's
> > >   executed by this non-locking thread that's OK, we expect that
> > >   to be doable in parallel.  If in the future though we do
> > >   what you initially suggested and have a bunch of commands get
> > >   routed to the migration thread (say) then those would suddenly
> > >   operate in parallel with other commands that we're previously
> > >   synchronous.
> > 
> > We could still have an opt-in for async commands. eg default to executing
> > all commands in the main thread, unless the client issues an explicit
> > "make it async" command, to switch to allowing the migration thread to
> > process it async.
> > 
> >  { "execute": "qmp_allow_async",
> >    "data": { "commands": [
> >        "migrate_cancel",
> >    ] } }
> > 
> > 
> >  { "return": { "commands": [
> >        "migrate_cancel",
> >    ] } }
> > 
> > The server response contains the subset of commands from the request
> > for which async is supported.
> > 
> > That gives good negotiation ability going forward as we incrementally
> > support async on more commands.
> 
> I think this goes back to the discussion on which design we'd like to
> choose.  IMHO the whole async idea plus the per-command-id is indeed
> cleaner and nicer, and I believe that can benefit not only libvirt,
> but also other QMP users.  The problem is, I have no idea how long
> it'll take to let us have such a feature - I believe that will include
> QEMU and Libvirt to both support that.  And it'll be a pity if the
> postcopy recovery cannot work only because we cannot guarantee a
> stable monitor.

This is not a blocker for having postcopy recovery feature merged.
It merely means that in a situation where the mainloop is blocked,
then we can't recover, in other situations we'll be able to recover
fine. Sure it would be nice to fix that problem too, but I don't
see it as a block.

I don't think the hacks proposed are a good tradeoff, compared to
fixing the fundamental problem with the monitor impl in QEMU. We
have discussed this monitor problem for years pretty much since
day 1 of QMP being designed, but it never gets serious attention.
IMHO it is well overdue to change that and focus attention on the
root problem and not just punt it down the road yet again by adding
short term hacks.

Adding an extra monitor channel, even as a short term hack, is
*not* short term from libvirt's POV - we'll have to carry that
code for many years into the future, even after QEMU provides
a real fix. So even if QEMU provides such a short term hack, I
would none the less be strongly against libvirt using it.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 15:14   ` Dr. David Alan Gilbert
  2017-09-07  7:38     ` Peter Xu
@ 2017-09-07  8:58     ` Stefan Hajnoczi
  2017-09-07  9:35       ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-07  8:58 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, Michael Roth, Paolo Bonzini

On Wed, Sep 6, 2017 at 4:14 PM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
>> > The root problem is that, monitor commands are all handled in main
>> > loop thread now, no matter how many monitors we specify. And, if main
>> > loop thread hangs due to some reason, all monitors will be stuck.
>>
>> I see a larger issue with postcopy: existing QEMU code assumes that
>> guest memory access is instantaneous.
>>
>> Postcopy breaks this assumption and introduces blocking points that can
>> now take unbounded time.
>>
>> This problem isn't specific to the monitor.  It can also happen to other
>> components in QEMU like the gdbstub.
>>
>> Do we need an asynchronous memory API?  Synchronous memory access should
>> only be allowed in vcpu threads.
>
> It would probably be useful for gdbstub where the overhead of async
> doesn't matter;  but doing that for all IO emulation is hard.

Why is it hard?

Memory access can be synchronous in the vcpu thread.  That eliminates
a lot of code straight away.

Anything using dma-helpers.c is already async.  They just don't know
that the memory access part is being made async too :).

The remaining cases are virtio and some other devices.

If you are worried about performance, the first rule is that async
memory access is only needed on the destination side when post-copy is
active.  Maybe use setjmp to return from the signal handler and queue
a callback for when the page has been loaded.

Stefan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  8:13                   ` Peter Xu
  2017-09-07  8:49                     ` Stefan Hajnoczi
  2017-09-07  8:55                     ` Daniel P. Berrange
@ 2017-09-07  9:15                     ` Dr. David Alan Gilbert
  2017-09-07  9:25                       ` Daniel P. Berrange
  2017-09-07 12:59                     ` Markus Armbruster
  3 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07  9:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrange, qemu-devel, Paolo Bonzini, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > This does imply that you need a separate monitor I/O processing, from the
> > > > command execution thread, but I see no need for all commands to suddenly
> > > > become async. Just allowing interleaved replies is sufficient from the
> > > > POV of the protocol definition. This interleaving is easy to handle from
> > > > the client POV - just requires a unique 'serial' in the request by the
> > > > client, that is copied into the reply by QEMU.
> > > 
> > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > > 
> > > then it's upto the caller to ensure those id's are unique.
> > 
> > Libvirt has in fact generated a unique 'id' for every monitor command
> > since day 1 of supporting QMP.
> > 
> > > I do worry about two things:
> > >   a) With this the caller doesn't really know which commands could be
> > >   in parallel - for example if we've got a recovery command that's
> > >   executed by this non-locking thread that's OK, we expect that
> > >   to be doable in parallel.  If in the future though we do
> > >   what you initially suggested and have a bunch of commands get
> > >   routed to the migration thread (say) then those would suddenly
> > >   operate in parallel with other commands that we're previously
> > >   synchronous.
> > 
> > We could still have an opt-in for async commands. eg default to executing
> > all commands in the main thread, unless the client issues an explicit
> > "make it async" command, to switch to allowing the migration thread to
> > process it async.
> > 
> >  { "execute": "qmp_allow_async",
> >    "data": { "commands": [
> >        "migrate_cancel",
> >    ] } }
> > 
> > 
> >  { "return": { "commands": [
> >        "migrate_cancel",
> >    ] } }
> > 
> > The server response contains the subset of commands from the request
> > for which async is supported.
> > 
> > That gives good negotiation ability going forward as we incrementally
> > support async on more commands.
> 
> I think this goes back to the discussion on which design we'd like to
> choose.  IMHO the whole async idea plus the per-command-id is indeed
> cleaner and nicer, and I believe that can benefit not only libvirt,
> but also other QMP users.  The problem is, I have no idea how long
> it'll take to let us have such a feature - I believe that will include
> QEMU and Libvirt to both support that.  And it'll be a pity if the
> postcopy recovery cannot work only because we cannot guarantee a
> stable monitor.

libvirt will need changes for postcopy recovery however we do it;
so we need to do it the way they want.

I think Dan's suggestion isn't as hard as it initially sounded;  a first
thing to try would be taking all the monitor IO into another thread
and feeding all commands to the main thread for execution - that sounds
like the hard part.
(I'm not sure how multiple monitors interact for this).

Dave

> I'm curious whether there are other requirements (besides postcopy
> recovery) that would want an always-alive monitor to run some
> lock-free commands?  If there is, I'd be more inclined to first
> provide a work-around solution like "-qmp-lockfree", and we can
> provide a better solution afterwards until when the whole async QMP
> work ready.
> 
> > 
> > >   b) I still worry how the various IO channels will behave on another
> > >   thread.  But that's more a general feeling rather than anything
> > >   specific.
> > 
> > The only complexity will be around making sure the Chardev code uses
> > the right GMainContext for any watches on the underlying QIOChannel,
> > so that we poll() from the custom thread instead of the main thread.
> > IOW, as long as all I/O is done from the single thread everything
> > should work fine.
> > 
> > Regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  8:49                     ` Stefan Hajnoczi
@ 2017-09-07  9:18                       ` Dr. David Alan Gilbert
  2017-09-07 10:19                         ` Stefan Hajnoczi
  2017-09-07 10:24                         ` Peter Xu
  0 siblings, 2 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07  9:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Xu, Daniel P. Berrange, Laurent Vivier, Fam Zheng,
	Juan Quintela, qemu-devel, Markus Armbruster, Michael Roth,
	Paolo Bonzini

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Thu, Sep 7, 2017 at 9:13 AM, Peter Xu <peterx@redhat.com> wrote:
> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> >> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> >> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> >> > > This does imply that you need a separate monitor I/O processing, from the
> >> > > command execution thread, but I see no need for all commands to suddenly
> >> > > become async. Just allowing interleaved replies is sufficient from the
> >> > > POV of the protocol definition. This interleaving is easy to handle from
> >> > > the client POV - just requires a unique 'serial' in the request by the
> >> > > client, that is copied into the reply by QEMU.
> >> >
> >> > OK, so for that we can just take Marc-André's syntax and call it 'id':
> >> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> >> >
> >> > then it's upto the caller to ensure those id's are unique.
> >>
> >> Libvirt has in fact generated a unique 'id' for every monitor command
> >> since day 1 of supporting QMP.
> >>
> >> > I do worry about two things:
> >> >   a) With this the caller doesn't really know which commands could be
> >> >   in parallel - for example if we've got a recovery command that's
> >> >   executed by this non-locking thread that's OK, we expect that
> >> >   to be doable in parallel.  If in the future though we do
> >> >   what you initially suggested and have a bunch of commands get
> >> >   routed to the migration thread (say) then those would suddenly
> >> >   operate in parallel with other commands that we're previously
> >> >   synchronous.
> >>
> >> We could still have an opt-in for async commands. eg default to executing
> >> all commands in the main thread, unless the client issues an explicit
> >> "make it async" command, to switch to allowing the migration thread to
> >> process it async.
> >>
> >>  { "execute": "qmp_allow_async",
> >>    "data": { "commands": [
> >>        "migrate_cancel",
> >>    ] } }
> >>
> >>
> >>  { "return": { "commands": [
> >>        "migrate_cancel",
> >>    ] } }
> >>
> >> The server response contains the subset of commands from the request
> >> for which async is supported.
> >>
> >> That gives good negotiation ability going forward as we incrementally
> >> support async on more commands.
> >
> > I think this goes back to the discussion on which design we'd like to
> > choose.  IMHO the whole async idea plus the per-command-id is indeed
> > cleaner and nicer, and I believe that can benefit not only libvirt,
> > but also other QMP users.  The problem is, I have no idea how long
> > it'll take to let us have such a feature - I believe that will include
> > QEMU and Libvirt to both support that.  And it'll be a pity if the
> > postcopy recovery cannot work only because we cannot guarantee a
> > stable monitor.
> 
> Please don't rush in a hack, they often introduce new bugs that we
> have to support long-term when they are part of the QMP API.
> 
> In your original email you mentioned "info cpus".  Have you considered
> modifying this command so it does not sync the CPU?  I'm not sure
> callers really need to sync the CPU, typically they just want to know
> the vcpu numbers, thread IDs, and current state (halted, running,
> etc).

But it has the pc as well, so that's actual state.

Dave

> The next step after that would be to audit other monitor commands for
> unnecessary vcpu synchronization.
> 
> Stefan
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  8:55                     ` Daniel P. Berrange
@ 2017-09-07  9:19                       ` Dr. David Alan Gilbert
  2017-09-07  9:22                         ` Daniel P. Berrange
  2017-09-07 11:19                         ` Markus Armbruster
  0 siblings, 2 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07  9:19 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Thu, Sep 07, 2017 at 04:13:41PM +0800, Peter Xu wrote:
> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> > > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > This does imply that you need a separate monitor I/O processing, from the
> > > > > command execution thread, but I see no need for all commands to suddenly
> > > > > become async. Just allowing interleaved replies is sufficient from the
> > > > > POV of the protocol definition. This interleaving is easy to handle from
> > > > > the client POV - just requires a unique 'serial' in the request by the
> > > > > client, that is copied into the reply by QEMU.
> > > > 
> > > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > > > 
> > > > then it's upto the caller to ensure those id's are unique.
> > > 
> > > Libvirt has in fact generated a unique 'id' for every monitor command
> > > since day 1 of supporting QMP.
> > > 
> > > > I do worry about two things:
> > > >   a) With this the caller doesn't really know which commands could be
> > > >   in parallel - for example if we've got a recovery command that's
> > > >   executed by this non-locking thread that's OK, we expect that
> > > >   to be doable in parallel.  If in the future though we do
> > > >   what you initially suggested and have a bunch of commands get
> > > >   routed to the migration thread (say) then those would suddenly
> > > >   operate in parallel with other commands that we're previously
> > > >   synchronous.
> > > 
> > > We could still have an opt-in for async commands. eg default to executing
> > > all commands in the main thread, unless the client issues an explicit
> > > "make it async" command, to switch to allowing the migration thread to
> > > process it async.
> > > 
> > >  { "execute": "qmp_allow_async",
> > >    "data": { "commands": [
> > >        "migrate_cancel",
> > >    ] } }
> > > 
> > > 
> > >  { "return": { "commands": [
> > >        "migrate_cancel",
> > >    ] } }
> > > 
> > > The server response contains the subset of commands from the request
> > > for which async is supported.
> > > 
> > > That gives good negotiation ability going forward as we incrementally
> > > support async on more commands.
> > 
> > I think this goes back to the discussion on which design we'd like to
> > choose.  IMHO the whole async idea plus the per-command-id is indeed
> > cleaner and nicer, and I believe that can benefit not only libvirt,
> > but also other QMP users.  The problem is, I have no idea how long
> > it'll take to let us have such a feature - I believe that will include
> > QEMU and Libvirt to both support that.  And it'll be a pity if the
> > postcopy recovery cannot work only because we cannot guarantee a
> > stable monitor.
> 
> This is not a blocker for having postcopy recovery feature merged.
> It merely means that in a situation where the mainloop is blocked,
> then we can't recover, in other situations we'll be able to recover
> fine. Sure it would be nice to fix that problem too, but I don't
> see it as a block.

It's probably OK to merge the recovery code before the monitor code;
but I don't think it's something you'd want to tell users about -
a 'postcopy recovery that only works rarely' isn't much use.

Dave

> I don't think the hacks proposed are a good tradeoff, compared to
> fixing the fundamental problem with the monitor impl in QEMU. We
> have discussed this monitor problem for years pretty much since
> day 1 of QMP being designed, but it never gets serious attention.
> IMHO it is well overdue to change that and focus attention on the
> root problem and not just punt it down the road yet again by adding
> short term hacks.
> 
> Adding an extra monitor channel, even as a short term hack, is
> *not* short term from libvirt's POV - we'll have to carry that
> code for many years into the future, even after QEMU provides
> a real fix. So even if QEMU provides such a short term hack, I
> would none the less be strongly against libvirt using it.
> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  9:19                       ` Dr. David Alan Gilbert
@ 2017-09-07  9:22                         ` Daniel P. Berrange
  2017-09-07  9:27                           ` Dr. David Alan Gilbert
  2017-09-07 11:19                         ` Markus Armbruster
  1 sibling, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-07  9:22 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

On Thu, Sep 07, 2017 at 10:19:47AM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > On Thu, Sep 07, 2017 at 04:13:41PM +0800, Peter Xu wrote:
> > > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> > > > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > > This does imply that you need a separate monitor I/O processing, from the
> > > > > > command execution thread, but I see no need for all commands to suddenly
> > > > > > become async. Just allowing interleaved replies is sufficient from the
> > > > > > POV of the protocol definition. This interleaving is easy to handle from
> > > > > > the client POV - just requires a unique 'serial' in the request by the
> > > > > > client, that is copied into the reply by QEMU.
> > > > > 
> > > > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > > > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > > > > 
> > > > > then it's upto the caller to ensure those id's are unique.
> > > > 
> > > > Libvirt has in fact generated a unique 'id' for every monitor command
> > > > since day 1 of supporting QMP.
> > > > 
> > > > > I do worry about two things:
> > > > >   a) With this the caller doesn't really know which commands could be
> > > > >   in parallel - for example if we've got a recovery command that's
> > > > >   executed by this non-locking thread that's OK, we expect that
> > > > >   to be doable in parallel.  If in the future though we do
> > > > >   what you initially suggested and have a bunch of commands get
> > > > >   routed to the migration thread (say) then those would suddenly
> > > > >   operate in parallel with other commands that we're previously
> > > > >   synchronous.
> > > > 
> > > > We could still have an opt-in for async commands. eg default to executing
> > > > all commands in the main thread, unless the client issues an explicit
> > > > "make it async" command, to switch to allowing the migration thread to
> > > > process it async.
> > > > 
> > > >  { "execute": "qmp_allow_async",
> > > >    "data": { "commands": [
> > > >        "migrate_cancel",
> > > >    ] } }
> > > > 
> > > > 
> > > >  { "return": { "commands": [
> > > >        "migrate_cancel",
> > > >    ] } }
> > > > 
> > > > The server response contains the subset of commands from the request
> > > > for which async is supported.
> > > > 
> > > > That gives good negotiation ability going forward as we incrementally
> > > > support async on more commands.
> > > 
> > > I think this goes back to the discussion on which design we'd like to
> > > choose.  IMHO the whole async idea plus the per-command-id is indeed
> > > cleaner and nicer, and I believe that can benefit not only libvirt,
> > > but also other QMP users.  The problem is, I have no idea how long
> > > it'll take to let us have such a feature - I believe that will include
> > > QEMU and Libvirt to both support that.  And it'll be a pity if the
> > > postcopy recovery cannot work only because we cannot guarantee a
> > > stable monitor.
> > 
> > This is not a blocker for having postcopy recovery feature merged.
> > It merely means that in a situation where the mainloop is blocked,
> > then we can't recover, in other situations we'll be able to recover
> > fine. Sure it would be nice to fix that problem too, but I don't
> > see it as a block.
> 
> It's probably OK to merge the recovery code before the monitor code;
> but I don't think it's something you'd want to tell users about -
> a 'postcopy recovery that only works rarely' isn't much use.

I dunno. Compared to today where there's zero post-copy recovery,
I think even an incremental improvement is useful. Its a choice
between "your VM is dead" and "you've a 50/50 chance of life".


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  9:15                     ` Dr. David Alan Gilbert
@ 2017-09-07  9:25                       ` Daniel P. Berrange
  0 siblings, 0 replies; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-07  9:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

On Thu, Sep 07, 2017 at 10:15:09AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> > > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > This does imply that you need a separate monitor I/O processing, from the
> > > > > command execution thread, but I see no need for all commands to suddenly
> > > > > become async. Just allowing interleaved replies is sufficient from the
> > > > > POV of the protocol definition. This interleaving is easy to handle from
> > > > > the client POV - just requires a unique 'serial' in the request by the
> > > > > client, that is copied into the reply by QEMU.
> > > > 
> > > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > > > 
> > > > then it's upto the caller to ensure those id's are unique.
> > > 
> > > Libvirt has in fact generated a unique 'id' for every monitor command
> > > since day 1 of supporting QMP.
> > > 
> > > > I do worry about two things:
> > > >   a) With this the caller doesn't really know which commands could be
> > > >   in parallel - for example if we've got a recovery command that's
> > > >   executed by this non-locking thread that's OK, we expect that
> > > >   to be doable in parallel.  If in the future though we do
> > > >   what you initially suggested and have a bunch of commands get
> > > >   routed to the migration thread (say) then those would suddenly
> > > >   operate in parallel with other commands that we're previously
> > > >   synchronous.
> > > 
> > > We could still have an opt-in for async commands. eg default to executing
> > > all commands in the main thread, unless the client issues an explicit
> > > "make it async" command, to switch to allowing the migration thread to
> > > process it async.
> > > 
> > >  { "execute": "qmp_allow_async",
> > >    "data": { "commands": [
> > >        "migrate_cancel",
> > >    ] } }
> > > 
> > > 
> > >  { "return": { "commands": [
> > >        "migrate_cancel",
> > >    ] } }
> > > 
> > > The server response contains the subset of commands from the request
> > > for which async is supported.
> > > 
> > > That gives good negotiation ability going forward as we incrementally
> > > support async on more commands.
> > 
> > I think this goes back to the discussion on which design we'd like to
> > choose.  IMHO the whole async idea plus the per-command-id is indeed
> > cleaner and nicer, and I believe that can benefit not only libvirt,
> > but also other QMP users.  The problem is, I have no idea how long
> > it'll take to let us have such a feature - I believe that will include
> > QEMU and Libvirt to both support that.  And it'll be a pity if the
> > postcopy recovery cannot work only because we cannot guarantee a
> > stable monitor.
> 
> libvirt will need changes for postcopy recovery however we do it;
> so we need to do it the way they want.
> 
> I think Dan's suggestion isn't as hard as it initially sounded;  a first
> thing to try would be taking all the monitor IO into another thread
> and feeding all commands to the main thread for execution - that sounds
> like the hard part.
> (I'm not sure how multiple monitors interact for this).

Multiple monitors is probably not as hard as it sounds. No matter how
many monitors you have configured today, they'll all serviced by the
main loop so commands from each monitor are strictly serialized.

So if you moved I/O processing to a separate thread, and had a single
queue of commands to be executed by the main thread, you would still
have the exact same serialized processing across multiple monitors.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  9:22                         ` Daniel P. Berrange
@ 2017-09-07  9:27                           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07  9:27 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Thu, Sep 07, 2017 at 10:19:47AM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > On Thu, Sep 07, 2017 at 04:13:41PM +0800, Peter Xu wrote:
> > > > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> > > > > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > > > > This does imply that you need a separate monitor I/O processing, from the
> > > > > > > command execution thread, but I see no need for all commands to suddenly
> > > > > > > become async. Just allowing interleaved replies is sufficient from the
> > > > > > > POV of the protocol definition. This interleaving is easy to handle from
> > > > > > > the client POV - just requires a unique 'serial' in the request by the
> > > > > > > client, that is copied into the reply by QEMU.
> > > > > > 
> > > > > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > > > > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > > > > > 
> > > > > > then it's upto the caller to ensure those id's are unique.
> > > > > 
> > > > > Libvirt has in fact generated a unique 'id' for every monitor command
> > > > > since day 1 of supporting QMP.
> > > > > 
> > > > > > I do worry about two things:
> > > > > >   a) With this the caller doesn't really know which commands could be
> > > > > >   in parallel - for example if we've got a recovery command that's
> > > > > >   executed by this non-locking thread that's OK, we expect that
> > > > > >   to be doable in parallel.  If in the future though we do
> > > > > >   what you initially suggested and have a bunch of commands get
> > > > > >   routed to the migration thread (say) then those would suddenly
> > > > > >   operate in parallel with other commands that we're previously
> > > > > >   synchronous.
> > > > > 
> > > > > We could still have an opt-in for async commands. eg default to executing
> > > > > all commands in the main thread, unless the client issues an explicit
> > > > > "make it async" command, to switch to allowing the migration thread to
> > > > > process it async.
> > > > > 
> > > > >  { "execute": "qmp_allow_async",
> > > > >    "data": { "commands": [
> > > > >        "migrate_cancel",
> > > > >    ] } }
> > > > > 
> > > > > 
> > > > >  { "return": { "commands": [
> > > > >        "migrate_cancel",
> > > > >    ] } }
> > > > > 
> > > > > The server response contains the subset of commands from the request
> > > > > for which async is supported.
> > > > > 
> > > > > That gives good negotiation ability going forward as we incrementally
> > > > > support async on more commands.
> > > > 
> > > > I think this goes back to the discussion on which design we'd like to
> > > > choose.  IMHO the whole async idea plus the per-command-id is indeed
> > > > cleaner and nicer, and I believe that can benefit not only libvirt,
> > > > but also other QMP users.  The problem is, I have no idea how long
> > > > it'll take to let us have such a feature - I believe that will include
> > > > QEMU and Libvirt to both support that.  And it'll be a pity if the
> > > > postcopy recovery cannot work only because we cannot guarantee a
> > > > stable monitor.
> > > 
> > > This is not a blocker for having postcopy recovery feature merged.
> > > It merely means that in a situation where the mainloop is blocked,
> > > then we can't recover, in other situations we'll be able to recover
> > > fine. Sure it would be nice to fix that problem too, but I don't
> > > see it as a block.
> > 
> > It's probably OK to merge the recovery code before the monitor code;
> > but I don't think it's something you'd want to tell users about -
> > a 'postcopy recovery that only works rarely' isn't much use.
> 
> I dunno. Compared to today where there's zero post-copy recovery,
> I think even an incremental improvement is useful. Its a choice
> between "your VM is dead" and "you've a 50/50 chance of life".

There's a chunk of people who wont use postcopy because they regard
it as dangerous; they need something that works in most cases before
they'll use it.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  8:58     ` Stefan Hajnoczi
@ 2017-09-07  9:35       ` Dr. David Alan Gilbert
  2017-09-07 10:09         ` Stefan Hajnoczi
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07  9:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, Michael Roth, Paolo Bonzini

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Wed, Sep 6, 2017 at 4:14 PM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> >> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> >> > The root problem is that, monitor commands are all handled in main
> >> > loop thread now, no matter how many monitors we specify. And, if main
> >> > loop thread hangs due to some reason, all monitors will be stuck.
> >>
> >> I see a larger issue with postcopy: existing QEMU code assumes that
> >> guest memory access is instantaneous.
> >>
> >> Postcopy breaks this assumption and introduces blocking points that can
> >> now take unbounded time.
> >>
> >> This problem isn't specific to the monitor.  It can also happen to other
> >> components in QEMU like the gdbstub.
> >>
> >> Do we need an asynchronous memory API?  Synchronous memory access should
> >> only be allowed in vcpu threads.
> >
> > It would probably be useful for gdbstub where the overhead of async
> > doesn't matter;  but doing that for all IO emulation is hard.
> 
> Why is it hard?
> 
> Memory access can be synchronous in the vcpu thread.  That eliminates
> a lot of code straight away.
> 
> Anything using dma-helpers.c is already async.  They just don't know
> that the memory access part is being made async too :).

Can you point me to some info on that ?

> The remaining cases are virtio and some other devices.
> 
> If you are worried about performance, the first rule is that async
> memory access is only needed on the destination side when post-copy is
> active.  Maybe use setjmp to return from the signal handler and queue
> a callback for when the page has been loaded.

I'm not sure it's worth trying to be too clever at avoiding this;
I see the fact that we're doing IO with the bql held as a more
fundamental problem.

Dave

> Stefan
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 11:54                 ` Daniel P. Berrange
  2017-09-07  8:13                   ` Peter Xu
@ 2017-09-07 10:04                   ` Dr. David Alan Gilbert
  2017-09-07 10:08                     ` Daniel P. Berrange
  1 sibling, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 10:04 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

* Daniel P. Berrange (berrange@redhat.com) wrote:
> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > This does imply that you need a separate monitor I/O processing, from the
> > > command execution thread, but I see no need for all commands to suddenly
> > > become async. Just allowing interleaved replies is sufficient from the
> > > POV of the protocol definition. This interleaving is easy to handle from
> > > the client POV - just requires a unique 'serial' in the request by the
> > > client, that is copied into the reply by QEMU.
> > 
> > OK, so for that we can just take Marc-André's syntax and call it 'id':
> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > 
> > then it's upto the caller to ensure those id's are unique.
> 
> Libvirt has in fact generated a unique 'id' for every monitor command
> since day 1 of supporting QMP.
> 
> > I do worry about two things:
> >   a) With this the caller doesn't really know which commands could be
> >   in parallel - for example if we've got a recovery command that's
> >   executed by this non-locking thread that's OK, we expect that
> >   to be doable in parallel.  If in the future though we do
> >   what you initially suggested and have a bunch of commands get
> >   routed to the migration thread (say) then those would suddenly
> >   operate in parallel with other commands that we're previously
> >   synchronous.
> 
> We could still have an opt-in for async commands. eg default to executing
> all commands in the main thread, unless the client issues an explicit
> "make it async" command, to switch to allowing the migration thread to
> process it async.
> 
>  { "execute": "qmp_allow_async",
>    "data": { "commands": [
>        "migrate_cancel",
>    ] } }
> 
> 
>  { "return": { "commands": [
>        "migrate_cancel",
>    ] } }
> 
> The server response contains the subset of commands from the request
> for which async is supported.
> 
> That gives good negotiation ability going forward as we incrementally
> support async on more commands.

Is that 'qmp_allow_async' a command purely to query whether a command
is async or is it a wrapper to cause that command to be executed async?

> >   b) I still worry how the various IO channels will behave on another
> >   thread.  But that's more a general feeling rather than anything
> >   specific.
> 
> The only complexity will be around making sure the Chardev code uses
> the right GMainContext for any watches on the underlying QIOChannel,
> so that we poll() from the custom thread instead of the main thread.
> IOW, as long as all I/O is done from the single thread everything
> should work fine.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 10:04                   ` Dr. David Alan Gilbert
@ 2017-09-07 10:08                     ` Daniel P. Berrange
  0 siblings, 0 replies; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-07 10:08 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Eric Blake, Laurent Vivier, Markus Armbruster

On Thu, Sep 07, 2017 at 11:04:02AM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrange (berrange@redhat.com) wrote:
> > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > > > This does imply that you need a separate monitor I/O processing, from the
> > > > command execution thread, but I see no need for all commands to suddenly
> > > > become async. Just allowing interleaved replies is sufficient from the
> > > > POV of the protocol definition. This interleaving is easy to handle from
> > > > the client POV - just requires a unique 'serial' in the request by the
> > > > client, that is copied into the reply by QEMU.
> > > 
> > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > > 
> > > then it's upto the caller to ensure those id's are unique.
> > 
> > Libvirt has in fact generated a unique 'id' for every monitor command
> > since day 1 of supporting QMP.
> > 
> > > I do worry about two things:
> > >   a) With this the caller doesn't really know which commands could be
> > >   in parallel - for example if we've got a recovery command that's
> > >   executed by this non-locking thread that's OK, we expect that
> > >   to be doable in parallel.  If in the future though we do
> > >   what you initially suggested and have a bunch of commands get
> > >   routed to the migration thread (say) then those would suddenly
> > >   operate in parallel with other commands that we're previously
> > >   synchronous.
> > 
> > We could still have an opt-in for async commands. eg default to executing
> > all commands in the main thread, unless the client issues an explicit
> > "make it async" command, to switch to allowing the migration thread to
> > process it async.
> > 
> >  { "execute": "qmp_allow_async",
> >    "data": { "commands": [
> >        "migrate_cancel",
> >    ] } }
> > 
> > 
> >  { "return": { "commands": [
> >        "migrate_cancel",
> >    ] } }
> > 
> > The server response contains the subset of commands from the request
> > for which async is supported.
> > 
> > That gives good negotiation ability going forward as we incrementally
> > support async on more commands.
> 
> Is that 'qmp_allow_async' a command purely to query whether a command
> is async or is it a wrapper to cause that command to be executed async?

The former.

It merely used by the client to tell QEMU that it wants the command(s)
listed to have async processing enabled. QEMU reports back which commands
it has actually enabled async for.

IOW, before executing this, everything is still processed synchronously,
even if QEMU has support for async. This ensures back compat as we enable
support for async per command. After executing this command, then future
usage of 'migrate_cancel' would be run async.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  9:35       ` Dr. David Alan Gilbert
@ 2017-09-07 10:09         ` Stefan Hajnoczi
  2017-09-07 12:02           ` Peter Xu
  0 siblings, 1 reply; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-07 10:09 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, qemu-devel, Laurent Vivier, Fam Zheng, Juan Quintela,
	Markus Armbruster, Michael Roth, Paolo Bonzini

On Thu, Sep 7, 2017 at 10:35 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> On Wed, Sep 6, 2017 at 4:14 PM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> >> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
>> >> > The root problem is that, monitor commands are all handled in main
>> >> > loop thread now, no matter how many monitors we specify. And, if main
>> >> > loop thread hangs due to some reason, all monitors will be stuck.
>> >>
>> >> I see a larger issue with postcopy: existing QEMU code assumes that
>> >> guest memory access is instantaneous.
>> >>
>> >> Postcopy breaks this assumption and introduces blocking points that can
>> >> now take unbounded time.
>> >>
>> >> This problem isn't specific to the monitor.  It can also happen to other
>> >> components in QEMU like the gdbstub.
>> >>
>> >> Do we need an asynchronous memory API?  Synchronous memory access should
>> >> only be allowed in vcpu threads.
>> >
>> > It would probably be useful for gdbstub where the overhead of async
>> > doesn't matter;  but doing that for all IO emulation is hard.
>>
>> Why is it hard?
>>
>> Memory access can be synchronous in the vcpu thread.  That eliminates
>> a lot of code straight away.
>>
>> Anything using dma-helpers.c is already async.  They just don't know
>> that the memory access part is being made async too :).
>
> Can you point me to some info on that ?

IDE and SCSI use dma-helpers.c to perform I/O:
hw/ide/core.c:892:        s->bus->dma->aiocb =
dma_blk_io(blk_get_aio_context(s->blk),
hw/ide/macio.c:189:        s->bus->dma->aiocb =
dma_blk_io(blk_get_aio_context(s->blk), &s->sg,
hw/scsi/scsi-disk.c:348:        r->req.aiocb =
dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
hw/scsi/scsi-disk.c:551:        r->req.aiocb =
dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),

They pass a scatter-gather list of guest RAM addresses to
dma-helpers.c.  They receive a callback when I/O has finished.

Try following the code path.  Request submission may be from a vcpu
thread or IOThread.  Completion occurs in the main loop or an
IOThread.

The main point is that this API is already asynchronous.  If any
changes are needed for async guest memory access (not sure, I haven't
checked), then at least the dma-helpers.c users do not need to be
modified.

>> The remaining cases are virtio and some other devices.
>>
>> If you are worried about performance, the first rule is that async
>> memory access is only needed on the destination side when post-copy is
>> active.  Maybe use setjmp to return from the signal handler and queue
>> a callback for when the page has been loaded.
>
> I'm not sure it's worth trying to be too clever at avoiding this;
> I see the fact that we're doing IO with the bql held as a more
> fundamental problem.

QEMU should be doing I/O syscalls in async fashion or threadpool
workers (no BQL) so the BQL is not an issue.  Anything else could
cause unbounded waits even without postcopy.

Can you explain what you are worried about?

Stefan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  9:18                       ` Dr. David Alan Gilbert
@ 2017-09-07 10:19                         ` Stefan Hajnoczi
  2017-09-07 10:24                         ` Peter Xu
  1 sibling, 0 replies; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-07 10:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, Daniel P. Berrange, Laurent Vivier, Fam Zheng,
	Juan Quintela, qemu-devel, Markus Armbruster, Michael Roth,
	Paolo Bonzini

On Thu, Sep 7, 2017 at 10:18 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> On Thu, Sep 7, 2017 at 9:13 AM, Peter Xu <peterx@redhat.com> wrote:
>> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
>> >> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
>> >> > * Daniel P. Berrange (berrange@redhat.com) wrote:
>> >> > > This does imply that you need a separate monitor I/O processing, from the
>> >> > > command execution thread, but I see no need for all commands to suddenly
>> >> > > become async. Just allowing interleaved replies is sufficient from the
>> >> > > POV of the protocol definition. This interleaving is easy to handle from
>> >> > > the client POV - just requires a unique 'serial' in the request by the
>> >> > > client, that is copied into the reply by QEMU.
>> >> >
>> >> > OK, so for that we can just take Marc-André's syntax and call it 'id':
>> >> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
>> >> >
>> >> > then it's upto the caller to ensure those id's are unique.
>> >>
>> >> Libvirt has in fact generated a unique 'id' for every monitor command
>> >> since day 1 of supporting QMP.
>> >>
>> >> > I do worry about two things:
>> >> >   a) With this the caller doesn't really know which commands could be
>> >> >   in parallel - for example if we've got a recovery command that's
>> >> >   executed by this non-locking thread that's OK, we expect that
>> >> >   to be doable in parallel.  If in the future though we do
>> >> >   what you initially suggested and have a bunch of commands get
>> >> >   routed to the migration thread (say) then those would suddenly
>> >> >   operate in parallel with other commands that we're previously
>> >> >   synchronous.
>> >>
>> >> We could still have an opt-in for async commands. eg default to executing
>> >> all commands in the main thread, unless the client issues an explicit
>> >> "make it async" command, to switch to allowing the migration thread to
>> >> process it async.
>> >>
>> >>  { "execute": "qmp_allow_async",
>> >>    "data": { "commands": [
>> >>        "migrate_cancel",
>> >>    ] } }
>> >>
>> >>
>> >>  { "return": { "commands": [
>> >>        "migrate_cancel",
>> >>    ] } }
>> >>
>> >> The server response contains the subset of commands from the request
>> >> for which async is supported.
>> >>
>> >> That gives good negotiation ability going forward as we incrementally
>> >> support async on more commands.
>> >
>> > I think this goes back to the discussion on which design we'd like to
>> > choose.  IMHO the whole async idea plus the per-command-id is indeed
>> > cleaner and nicer, and I believe that can benefit not only libvirt,
>> > but also other QMP users.  The problem is, I have no idea how long
>> > it'll take to let us have such a feature - I believe that will include
>> > QEMU and Libvirt to both support that.  And it'll be a pity if the
>> > postcopy recovery cannot work only because we cannot guarantee a
>> > stable monitor.
>>
>> Please don't rush in a hack, they often introduce new bugs that we
>> have to support long-term when they are part of the QMP API.
>>
>> In your original email you mentioned "info cpus".  Have you considered
>> modifying this command so it does not sync the CPU?  I'm not sure
>> callers really need to sync the CPU, typically they just want to know
>> the vcpu numbers, thread IDs, and current state (halted, running,
>> etc).
>
> But it has the pc as well, so that's actual state.

In what circumstances is the pc useful?

If the client just wants the vcpu -> thread ID mapping, it doesn't
matter at all.

If the CPU is halted, then the PC is already accurate and
synchronization isn't a problem.

If the CPU is running, then an accurate PC is meaningless since it
will have changed the moment the monitor command completes.  We might
as well just keep a copy of the last PC when entering QEMU in a vcpu
thread.

So I think we can offer a perfectly useful PC value without syncing.

Stefan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  9:18                       ` Dr. David Alan Gilbert
  2017-09-07 10:19                         ` Stefan Hajnoczi
@ 2017-09-07 10:24                         ` Peter Xu
  1 sibling, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-09-07 10:24 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, Daniel P. Berrange, Laurent Vivier, Fam Zheng,
	Juan Quintela, qemu-devel, Markus Armbruster, Michael Roth,
	Paolo Bonzini

On Thu, Sep 07, 2017 at 10:18:16AM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> > On Thu, Sep 7, 2017 at 9:13 AM, Peter Xu <peterx@redhat.com> wrote:
> > > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> > >> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> > >> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> > >> > > This does imply that you need a separate monitor I/O processing, from the
> > >> > > command execution thread, but I see no need for all commands to suddenly
> > >> > > become async. Just allowing interleaved replies is sufficient from the
> > >> > > POV of the protocol definition. This interleaving is easy to handle from
> > >> > > the client POV - just requires a unique 'serial' in the request by the
> > >> > > client, that is copied into the reply by QEMU.
> > >> >
> > >> > OK, so for that we can just take Marc-André's syntax and call it 'id':
> > >> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> > >> >
> > >> > then it's upto the caller to ensure those id's are unique.
> > >>
> > >> Libvirt has in fact generated a unique 'id' for every monitor command
> > >> since day 1 of supporting QMP.
> > >>
> > >> > I do worry about two things:
> > >> >   a) With this the caller doesn't really know which commands could be
> > >> >   in parallel - for example if we've got a recovery command that's
> > >> >   executed by this non-locking thread that's OK, we expect that
> > >> >   to be doable in parallel.  If in the future though we do
> > >> >   what you initially suggested and have a bunch of commands get
> > >> >   routed to the migration thread (say) then those would suddenly
> > >> >   operate in parallel with other commands that we're previously
> > >> >   synchronous.
> > >>
> > >> We could still have an opt-in for async commands. eg default to executing
> > >> all commands in the main thread, unless the client issues an explicit
> > >> "make it async" command, to switch to allowing the migration thread to
> > >> process it async.
> > >>
> > >>  { "execute": "qmp_allow_async",
> > >>    "data": { "commands": [
> > >>        "migrate_cancel",
> > >>    ] } }
> > >>
> > >>
> > >>  { "return": { "commands": [
> > >>        "migrate_cancel",
> > >>    ] } }
> > >>
> > >> The server response contains the subset of commands from the request
> > >> for which async is supported.
> > >>
> > >> That gives good negotiation ability going forward as we incrementally
> > >> support async on more commands.
> > >
> > > I think this goes back to the discussion on which design we'd like to
> > > choose.  IMHO the whole async idea plus the per-command-id is indeed
> > > cleaner and nicer, and I believe that can benefit not only libvirt,
> > > but also other QMP users.  The problem is, I have no idea how long
> > > it'll take to let us have such a feature - I believe that will include
> > > QEMU and Libvirt to both support that.  And it'll be a pity if the
> > > postcopy recovery cannot work only because we cannot guarantee a
> > > stable monitor.
> > 
> > Please don't rush in a hack, they often introduce new bugs that we
> > have to support long-term when they are part of the QMP API.

Sorry, I wasn't meant to push anything.  I was trying to see what
would be the best way to go.

> > 
> > In your original email you mentioned "info cpus".  Have you considered
> > modifying this command so it does not sync the CPU?  I'm not sure
> > callers really need to sync the CPU, typically they just want to know
> > the vcpu numbers, thread IDs, and current state (halted, running,
> > etc).
> 
> But it has the pc as well, so that's actual state.

Yes.  Even if we don't need to sync pc regs for this single "info
cpus" command, I do feel slightly awkward if we don't allow things
like syncing CPU to happen in any command.  IMHO we just need to make
sure these commands may block.

> 
> Dave
> 
> > The next step after that would be to audit other monitor commands for
> > unnecessary vcpu synchronization.

It's really hard to do this for every single command, at least to me.

Comparing to this, I think now I more prefer what Dan has suggested in
the other reply to have an extra way to request async commands while
keep the rest of commands compatible (though obviously I misunderstood
the email when I was writting up previous reply...).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  9:19                       ` Dr. David Alan Gilbert
  2017-09-07  9:22                         ` Daniel P. Berrange
@ 2017-09-07 11:19                         ` Markus Armbruster
  2017-09-07 11:31                           ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 104+ messages in thread
From: Markus Armbruster @ 2017-09-07 11:19 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Daniel P. Berrange, Laurent Vivier, Fam Zheng, Juan Quintela,
	mdroth, Peter Xu, qemu-devel, Paolo Bonzini

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Daniel P. Berrange (berrange@redhat.com) wrote:
>> On Thu, Sep 07, 2017 at 04:13:41PM +0800, Peter Xu wrote:
>> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
>> > > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
>> > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
>> > > > > This does imply that you need a separate monitor I/O processing, from the
>> > > > > command execution thread, but I see no need for all commands to suddenly
>> > > > > become async. Just allowing interleaved replies is sufficient from the
>> > > > > POV of the protocol definition. This interleaving is easy to handle from
>> > > > > the client POV - just requires a unique 'serial' in the request by the
>> > > > > client, that is copied into the reply by QEMU.
>> > > > 
>> > > > OK, so for that we can just take Marc-André's syntax and call it 'id':
>> > > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
>> > > > 
>> > > > then it's upto the caller to ensure those id's are unique.
>> > > 
>> > > Libvirt has in fact generated a unique 'id' for every monitor command
>> > > since day 1 of supporting QMP.
>> > > 
>> > > > I do worry about two things:
>> > > >   a) With this the caller doesn't really know which commands could be
>> > > >   in parallel - for example if we've got a recovery command that's
>> > > >   executed by this non-locking thread that's OK, we expect that
>> > > >   to be doable in parallel.  If in the future though we do
>> > > >   what you initially suggested and have a bunch of commands get
>> > > >   routed to the migration thread (say) then those would suddenly
>> > > >   operate in parallel with other commands that we're previously
>> > > >   synchronous.
>> > > 
>> > > We could still have an opt-in for async commands. eg default to executing
>> > > all commands in the main thread, unless the client issues an explicit
>> > > "make it async" command, to switch to allowing the migration thread to
>> > > process it async.
>> > > 
>> > >  { "execute": "qmp_allow_async",
>> > >    "data": { "commands": [
>> > >        "migrate_cancel",
>> > >    ] } }
>> > > 
>> > > 
>> > >  { "return": { "commands": [
>> > >        "migrate_cancel",
>> > >    ] } }
>> > > 
>> > > The server response contains the subset of commands from the request
>> > > for which async is supported.
>> > > 
>> > > That gives good negotiation ability going forward as we incrementally
>> > > support async on more commands.
>> > 
>> > I think this goes back to the discussion on which design we'd like to
>> > choose.  IMHO the whole async idea plus the per-command-id is indeed
>> > cleaner and nicer, and I believe that can benefit not only libvirt,
>> > but also other QMP users.  The problem is, I have no idea how long
>> > it'll take to let us have such a feature - I believe that will include
>> > QEMU and Libvirt to both support that.  And it'll be a pity if the
>> > postcopy recovery cannot work only because we cannot guarantee a
>> > stable monitor.
>> 
>> This is not a blocker for having postcopy recovery feature merged.
>> It merely means that in a situation where the mainloop is blocked,
>> then we can't recover, in other situations we'll be able to recover
>> fine. Sure it would be nice to fix that problem too, but I don't
>> see it as a block.
>
> It's probably OK to merge the recovery code before the monitor code;
> but I don't think it's something you'd want to tell users about -
> a 'postcopy recovery that only works rarely' isn't much use.

"Rarely"?  Are main loop hangs *that* common?

Can we quantify the problem to help gauge urgency?

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 11:19                         ` Markus Armbruster
@ 2017-09-07 11:31                           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 11:31 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Daniel P. Berrange, Laurent Vivier, Fam Zheng, Juan Quintela,
	mdroth, Peter Xu, qemu-devel, Paolo Bonzini

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> >> On Thu, Sep 07, 2017 at 04:13:41PM +0800, Peter Xu wrote:
> >> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> >> > > On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> >> > > > * Daniel P. Berrange (berrange@redhat.com) wrote:
> >> > > > > This does imply that you need a separate monitor I/O processing, from the
> >> > > > > command execution thread, but I see no need for all commands to suddenly
> >> > > > > become async. Just allowing interleaved replies is sufficient from the
> >> > > > > POV of the protocol definition. This interleaving is easy to handle from
> >> > > > > the client POV - just requires a unique 'serial' in the request by the
> >> > > > > client, that is copied into the reply by QEMU.
> >> > > > 
> >> > > > OK, so for that we can just take Marc-André's syntax and call it 'id':
> >> > > >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> >> > > > 
> >> > > > then it's upto the caller to ensure those id's are unique.
> >> > > 
> >> > > Libvirt has in fact generated a unique 'id' for every monitor command
> >> > > since day 1 of supporting QMP.
> >> > > 
> >> > > > I do worry about two things:
> >> > > >   a) With this the caller doesn't really know which commands could be
> >> > > >   in parallel - for example if we've got a recovery command that's
> >> > > >   executed by this non-locking thread that's OK, we expect that
> >> > > >   to be doable in parallel.  If in the future though we do
> >> > > >   what you initially suggested and have a bunch of commands get
> >> > > >   routed to the migration thread (say) then those would suddenly
> >> > > >   operate in parallel with other commands that we're previously
> >> > > >   synchronous.
> >> > > 
> >> > > We could still have an opt-in for async commands. eg default to executing
> >> > > all commands in the main thread, unless the client issues an explicit
> >> > > "make it async" command, to switch to allowing the migration thread to
> >> > > process it async.
> >> > > 
> >> > >  { "execute": "qmp_allow_async",
> >> > >    "data": { "commands": [
> >> > >        "migrate_cancel",
> >> > >    ] } }
> >> > > 
> >> > > 
> >> > >  { "return": { "commands": [
> >> > >        "migrate_cancel",
> >> > >    ] } }
> >> > > 
> >> > > The server response contains the subset of commands from the request
> >> > > for which async is supported.
> >> > > 
> >> > > That gives good negotiation ability going forward as we incrementally
> >> > > support async on more commands.
> >> > 
> >> > I think this goes back to the discussion on which design we'd like to
> >> > choose.  IMHO the whole async idea plus the per-command-id is indeed
> >> > cleaner and nicer, and I believe that can benefit not only libvirt,
> >> > but also other QMP users.  The problem is, I have no idea how long
> >> > it'll take to let us have such a feature - I believe that will include
> >> > QEMU and Libvirt to both support that.  And it'll be a pity if the
> >> > postcopy recovery cannot work only because we cannot guarantee a
> >> > stable monitor.
> >> 
> >> This is not a blocker for having postcopy recovery feature merged.
> >> It merely means that in a situation where the mainloop is blocked,
> >> then we can't recover, in other situations we'll be able to recover
> >> fine. Sure it would be nice to fix that problem too, but I don't
> >> see it as a block.
> >
> > It's probably OK to merge the recovery code before the monitor code;
> > but I don't think it's something you'd want to tell users about -
> > a 'postcopy recovery that only works rarely' isn't much use.
> 
> "Rarely"?  Are main loop hangs *that* common?
> 
> Can we quantify the problem to help gauge urgency?

Not really; it depends on workload and behaviour.  The people who worry
about postcopy recovery actually care about it working in almost every
case.
So I'm OK to add the recovery code, it's just not something we should
be shouting about to users until we have the monitor fixed.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 10:09         ` Stefan Hajnoczi
@ 2017-09-07 12:02           ` Peter Xu
  2017-09-07 16:53             ` Stefan Hajnoczi
  0 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-09-07 12:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Dr. David Alan Gilbert, qemu-devel, Laurent Vivier, Fam Zheng,
	Juan Quintela, Markus Armbruster, Michael Roth, Paolo Bonzini

On Thu, Sep 07, 2017 at 11:09:29AM +0100, Stefan Hajnoczi wrote:
> On Thu, Sep 7, 2017 at 10:35 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> >> On Wed, Sep 6, 2017 at 4:14 PM, Dr. David Alan Gilbert
> >> <dgilbert@redhat.com> wrote:
> >> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> >> >> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> >> >> > The root problem is that, monitor commands are all handled in main
> >> >> > loop thread now, no matter how many monitors we specify. And, if main
> >> >> > loop thread hangs due to some reason, all monitors will be stuck.
> >> >>
> >> >> I see a larger issue with postcopy: existing QEMU code assumes that
> >> >> guest memory access is instantaneous.
> >> >>
> >> >> Postcopy breaks this assumption and introduces blocking points that can
> >> >> now take unbounded time.
> >> >>
> >> >> This problem isn't specific to the monitor.  It can also happen to other
> >> >> components in QEMU like the gdbstub.
> >> >>
> >> >> Do we need an asynchronous memory API?  Synchronous memory access should
> >> >> only be allowed in vcpu threads.
> >> >
> >> > It would probably be useful for gdbstub where the overhead of async
> >> > doesn't matter;  but doing that for all IO emulation is hard.
> >>
> >> Why is it hard?
> >>
> >> Memory access can be synchronous in the vcpu thread.  That eliminates
> >> a lot of code straight away.
> >>
> >> Anything using dma-helpers.c is already async.  They just don't know
> >> that the memory access part is being made async too :).
> >
> > Can you point me to some info on that ?
> 
> IDE and SCSI use dma-helpers.c to perform I/O:
> hw/ide/core.c:892:        s->bus->dma->aiocb =
> dma_blk_io(blk_get_aio_context(s->blk),
> hw/ide/macio.c:189:        s->bus->dma->aiocb =
> dma_blk_io(blk_get_aio_context(s->blk), &s->sg,
> hw/scsi/scsi-disk.c:348:        r->req.aiocb =
> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
> hw/scsi/scsi-disk.c:551:        r->req.aiocb =
> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
> 
> They pass a scatter-gather list of guest RAM addresses to
> dma-helpers.c.  They receive a callback when I/O has finished.
> 
> Try following the code path.  Request submission may be from a vcpu
> thread or IOThread.  Completion occurs in the main loop or an
> IOThread.
> 
> The main point is that this API is already asynchronous.  If any
> changes are needed for async guest memory access (not sure, I haven't
> checked), then at least the dma-helpers.c users do not need to be
> modified.
> 
> >> The remaining cases are virtio and some other devices.
> >>
> >> If you are worried about performance, the first rule is that async
> >> memory access is only needed on the destination side when post-copy is
> >> active.  Maybe use setjmp to return from the signal handler and queue
> >> a callback for when the page has been loaded.
> >
> > I'm not sure it's worth trying to be too clever at avoiding this;
> > I see the fact that we're doing IO with the bql held as a more
> > fundamental problem.
> 
> QEMU should be doing I/O syscalls in async fashion or threadpool
> workers (no BQL) so the BQL is not an issue.  Anything else could
> cause unbounded waits even without postcopy.

E.g. when vcpu got page faulted with BQL taken, while the main thread
needs the BQL to dispatch anything, including monitor commands.

So I think it's a multiplex problem - we need to solve both (1) main
thread accessing guest memories which is still missing, and (2) BQL
deadlocks between vcpu threads and main thread.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07  8:13                   ` Peter Xu
                                       ` (2 preceding siblings ...)
  2017-09-07  9:15                     ` Dr. David Alan Gilbert
@ 2017-09-07 12:59                     ` Markus Armbruster
  2017-09-07 13:22                       ` Daniel P. Berrange
  2017-09-07 14:20                       ` Dr. David Alan Gilbert
  3 siblings, 2 replies; 104+ messages in thread
From: Markus Armbruster @ 2017-09-07 12:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Daniel P. Berrange, Laurent Vivier, Fam Zheng, Juan Quintela,
	qemu-devel, mdroth, Dr. David Alan Gilbert, Paolo Bonzini,
	John Snow

Peter Xu <peterx@redhat.com> writes:

> On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
>> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
>> > * Daniel P. Berrange (berrange@redhat.com) wrote:
>> > > This does imply that you need a separate monitor I/O processing, from the
>> > > command execution thread, but I see no need for all commands to suddenly
>> > > become async. Just allowing interleaved replies is sufficient from the
>> > > POV of the protocol definition. This interleaving is easy to handle from
>> > > the client POV - just requires a unique 'serial' in the request by the
>> > > client, that is copied into the reply by QEMU.
>> > 
>> > OK, so for that we can just take Marc-André's syntax and call it 'id':
>> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
>> > 
>> > then it's upto the caller to ensure those id's are unique.
>> 
>> Libvirt has in fact generated a unique 'id' for every monitor command
>> since day 1 of supporting QMP.
>> 
>> > I do worry about two things:
>> >   a) With this the caller doesn't really know which commands could be
>> >   in parallel - for example if we've got a recovery command that's
>> >   executed by this non-locking thread that's OK, we expect that
>> >   to be doable in parallel.  If in the future though we do
>> >   what you initially suggested and have a bunch of commands get
>> >   routed to the migration thread (say) then those would suddenly
>> >   operate in parallel with other commands that we're previously
>> >   synchronous.
>> 
>> We could still have an opt-in for async commands. eg default to executing
>> all commands in the main thread, unless the client issues an explicit
>> "make it async" command, to switch to allowing the migration thread to
>> process it async.
>> 
>>  { "execute": "qmp_allow_async",
>>    "data": { "commands": [
>>        "migrate_cancel",
>>    ] } }
>> 
>> 
>>  { "return": { "commands": [
>>        "migrate_cancel",
>>    ] } }
>> 
>> The server response contains the subset of commands from the request
>> for which async is supported.
>> 
>> That gives good negotiation ability going forward as we incrementally
>> support async on more commands.
>
> I think this goes back to the discussion on which design we'd like to
> choose.  IMHO the whole async idea plus the per-command-id is indeed
> cleaner and nicer, and I believe that can benefit not only libvirt,

The following may be a bit harsh in places.  I apologize in advance.  A
better writer than me wouldn't have to resort to that.  I've tried a few
times to make my point that "async QMP" is neither necessary nor
sufficient for monitor availability, but apparently without luck, since
there's still talk like it was.  I hope this attempt will work.

> but also other QMP users.  The problem is, I have no idea how long
> it'll take to let us have such a feature - I believe that will include
> QEMU and Libvirt to both support that.  And it'll be a pity if the
> postcopy recovery cannot work only because we cannot guarantee a
> stable monitor.
>
> I'm curious whether there are other requirements (besides postcopy
> recovery) that would want an always-alive monitor to run some
> lock-free commands?  If there is, I'd be more inclined to first
> provide a work-around solution like "-qmp-lockfree", and we can
> provide a better solution afterwards until when the whole async QMP
> work ready.

Yes, there are other requirements for "async QMP", and no, "async QMP"
isn't a solution, but at best a part of a solution.

Before I talk about QMP requirements, I need to ask a whole raft of
questions, because so far this thread feels like dreaming up grand
designs with only superficial understanding of the subject matter.
Quite possibly because *my* understanding is superficial.  If yours
isn't, great!  Go answer my questions :)

The root problem are main loop hangs.  QMP monitor hangs are merely a
special case.

The main loop should not hang.  We've always violated that design
assumption in places, e.g. in monitor commands that write to disk, and
thus can hang indefinitely with NFS.  Post-copy adds more violations, as
Stefan pointed out.

I can't say whether solving the special case "QMP monitor hangs" without
also solving "main loop hangs" is useful.  A perfectly available QMP
monitor buys you nothing if it feeds a command queue that isn't being
emptied because its consumers all hang.

So, what exactly is going to drain the command queue?  If there's more
than one consumer, how exactly are commands from the queue dispatched to
the consumers?

What are the "no hang" guarantees (if any) and conditions for each of
these consumers?

We can have any number of QMP monitors today.  Would each of them feed
its own queue?  Would they all feed a shared queue?

How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
command?

What does it mean when an asynchronous command follows a synchronous
command in the same QMP monitor?  I would expect the synchronous command
to complete before the asynchronous command, because that's what
synchronous means, isn't it?  To keep your QMP monitor available, you
then must not send synchronous commands that can hang.

How can we determine whether a certain synchronous command can hang?
Note that with opt-in async, *all* commands are also synchronous
commands.

In short, explain to me how exactly you plan to ensure that certain QMP
commands (such as post-copy recovery) can always "get through", in the
presence of multiple monitors, hanging main loop, hanging synchronous
commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.


Now let's talk about QMP requirements.

Any addition to QMP must consider what exists already.

You may add more of the same.

You may generalize existing stuff.

You may change existing stuff if you have sufficient reason, subject to
backward compatibility constraints.

But attempts to add new ways to do the same old stuff without properly
integrating the existing ways are not going to fly.

In particular, any new way to start some job, monitor and control it
while it lives, get notified about its state changes and so forth must
integrate the existing ways.  These include block jobs (probably the
most sophisticated of the lot), migration, dump-guest-memory, and
possibly more.  They all work the same way: synchronous command to kick
off the job, more synchronous commands to monitor and control, events to
notify.  They do differ in detail.

Asynchronous commands are a new way to do this.  When you only need to
be notified on "done", and don't need to monitor / control, they fit the
bill quite neatly.

However, we can't just ignore the cases where we need more than that!
For those, we want a single generic solution instead of the several ad
hoc solutions we have now.

If we add asynchronous commands *now*, and for simple cases only, we add
yet another special case for a future generic solution to integrate.
I'm not going to let that happen.

I figure the closest to a generic solution we have is block jobs.
Perhaps a generic solution could be had by abstracting away the "block"
from "block jobs", leaving just "jobs".

Another approach is generalizing the asynchronous command proposal to
fully cover the not-so-simple cases.

If you'd rather want to make progress on monitor availability without
cracking the "jobs" problem, you're in luck!  Use your license to "add
more of the same": synchronous command to start a job, query to monitor,
event to notify.  

If you insist on tying your monitor availability solution to
asynchronous commands, then I'm in luck!  I just found volunteers to
solve the "jobs" problem for me.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 12:59                     ` Markus Armbruster
@ 2017-09-07 13:22                       ` Daniel P. Berrange
  2017-09-07 17:41                         ` Markus Armbruster
  2017-09-07 14:20                       ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-07 13:22 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Peter Xu, Laurent Vivier, Fam Zheng, Juan Quintela, qemu-devel,
	mdroth, Dr. David Alan Gilbert, Paolo Bonzini, John Snow

On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
> So, what exactly is going to drain the command queue?  If there's more
> than one consumer, how exactly are commands from the queue dispatched to
> the consumers?

In terms of my proposal, for any single command there should only ever
be a single consumer. The default consumer would be the main event loop
thread, such that we have no semantic change to QMP operation from today.

Some commands that are capable of being made "async", would have a
different consumer. For example, if the client requested the 'migrate-cancel'
be made async, this would change things such that the migration thread is
now responsible for consuming the "migrate-cancel" command, instead of the
default main loop.

> What are the "no hang" guarantees (if any) and conditions for each of
> these consumers?

The non-main thread consumers would have to have some reasonable
guarantee that they won't block on a lock held by the main loop,
otherwise the whole feature is largely useless.

> We can have any number of QMP monitors today.  Would each of them feed
> its own queue?  Would they all feed a shared queue?

Currently with multiple QMP monitors, everything runs in the main
loop, so commands arriving across  multiple monitors are 100%
serialized and processed strictly in the order in which QEMU reads
them off the wire.  To maintain these semantics, we would need to
have a single shared queue for the default main loop consumer, so
that ordering does not change.

> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
> command?

Per monitor+command. ie just because libvirt knows how to cope with
async execution on the monitor it has open, does not mean that a
different app on the 2nd monitor command can cope. So in my proposal
the switch to async must be scoped to the particular command only
for the monitor connection that requesteed it.

> What does it mean when an asynchronous command follows a synchronous
> command in the same QMP monitor?  I would expect the synchronous command
> to complete before the asynchronous command, because that's what
> synchronous means, isn't it?  To keep your QMP monitor available, you
> then must not send synchronous commands that can hang.

No, that is not what I described. All synchronous commands are
serialized wrt each other, just as today. An asychronous command
can run as soon as it is received, regardless of whether any
earlier sent sync commands are still executing or pending. This
is trivial to achieve when you separate monitor I/O from command
execution in separate threads, provided of course the async
command consumers are not in the main loop.

> How can we determine whether a certain synchronous command can hang?
> Note that with opt-in async, *all* commands are also synchronous
> commands.
> 
> In short, explain to me how exactly you plan to ensure that certain QMP
> commands (such as post-copy recovery) can always "get through", in the
> presence of multiple monitors, hanging main loop, hanging synchronous
> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.

Taking migrate-cancel as the example. The migration code already has
a background thread doing work independantly onthe main loop. Upon
marking the migrate-cancel command as async, the migration control
thread would become the consumer of migrate-cancel. This allows the
migration operation to be cancelled immediately, regardless of whether
there are earlier monitor commands blocked in the main loop.

Of course this assumes the migration control thread can't block
for locks held by the main thread. 

> Now let's talk about QMP requirements.
> 
> Any addition to QMP must consider what exists already.
> 
> You may add more of the same.
> 
> You may generalize existing stuff.
> 
> You may change existing stuff if you have sufficient reason, subject to
> backward compatibility constraints.
> 
> But attempts to add new ways to do the same old stuff without properly
> integrating the existing ways are not going to fly.
> 
> In particular, any new way to start some job, monitor and control it
> while it lives, get notified about its state changes and so forth must
> integrate the existing ways.  These include block jobs (probably the
> most sophisticated of the lot), migration, dump-guest-memory, and
> possibly more.  They all work the same way: synchronous command to kick
> off the job, more synchronous commands to monitor and control, events to
> notify.  They do differ in detail.
> 
> Asynchronous commands are a new way to do this.  When you only need to
> be notified on "done", and don't need to monitor / control, they fit the
> bill quite neatly.
> 
> However, we can't just ignore the cases where we need more than that!
> For those, we want a single generic solution instead of the several ad
> hoc solutions we have now.
> 
> If we add asynchronous commands *now*, and for simple cases only, we add
> yet another special case for a future generic solution to integrate.
> I'm not going to let that happen.

With the async commands suggestion, while it would initially not
provide a way to query incremental status, that could easily be
fitted in.  Because command replies from async commands may be
out-of-order wrt the original requests, clients would need to
provide a unique ID for each command run. This originally was
part of QMP spec but then dropped, but libvirt still actually
generates a uniqe ID for every QMP command.

Given this, one option is to actually use the QMP command ID as
a job ID, and let you query ongoing status via some new QMP
command that accepts the ID of the job to be queried. A complexity
with this is how to make the jobs visible across multiple QMP
monitors. The job ID might actually have to be a combination of
the serial ID from the QMP command, and the ID of the monitor
chardev combined.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-06 11:31               ` Dr. David Alan Gilbert
  2017-09-06 11:54                 ` Daniel P. Berrange
@ 2017-09-07 13:59                 ` Eric Blake
  1 sibling, 0 replies; 104+ messages in thread
From: Eric Blake @ 2017-09-07 13:59 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Daniel P. Berrange
  Cc: Peter Xu, qemu-devel, Paolo Bonzini, Fam Zheng, Juan Quintela,
	mdroth, Laurent Vivier, Markus Armbruster

[-- Attachment #1: Type: text/plain, Size: 1990 bytes --]

On 09/06/2017 06:31 AM, Dr. David Alan Gilbert wrote:

>> This does imply that you need a separate monitor I/O processing, from the
>> command execution thread, but I see no need for all commands to suddenly
>> become async. Just allowing interleaved replies is sufficient from the
>> POV of the protocol definition. This interleaving is easy to handle from
>> the client POV - just requires a unique 'serial' in the request by the
>> client, that is copied into the reply by QEMU.
> 
> OK, so for that we can just take Marc-André's syntax and call it 'id':
>   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> 
> then it's upto the caller to ensure those id's are unique.

We ALREADY support 'id', and it is already up to the caller to ensure
those id's are unique, even without Marc-André's additions.

> 
> I do worry about two things:
>   a) With this the caller doesn't really know which commands could be
>   in parallel - for example if we've got a recovery command that's
>   executed by this non-locking thread that's OK, we expect that
>   to be doable in parallel.  If in the future though we do
>   what you initially suggested and have a bunch of commands get
>   routed to the migration thread (say) then those would suddenly
>   operate in parallel with other commands that we're previously
>   synchronous.

Presumably, all existing commands are NOT async, and introspection via
query-qmp-schema will let you query which new commands ARE async.  Or
existing commands will gain an optional parameter to opt-in to async
behavior for that command, defaulting to sync by default.  Thus, an old
libvirt will never call an async command, and never notice the
difference, but a new libvirt that is aware of async commands will opt
in to the commands that it wants to use in an async manner.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 619 bytes --]

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 12:59                     ` Markus Armbruster
  2017-09-07 13:22                       ` Daniel P. Berrange
@ 2017-09-07 14:20                       ` Dr. David Alan Gilbert
  2017-09-07 17:41                         ` Markus Armbruster
  1 sibling, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 14:20 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Peter Xu, Daniel P. Berrange, Laurent Vivier, Fam Zheng,
	Juan Quintela, qemu-devel, mdroth, Paolo Bonzini, John Snow

* Markus Armbruster (armbru@redhat.com) wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> >> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> >> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> >> > > This does imply that you need a separate monitor I/O processing, from the
> >> > > command execution thread, but I see no need for all commands to suddenly
> >> > > become async. Just allowing interleaved replies is sufficient from the
> >> > > POV of the protocol definition. This interleaving is easy to handle from
> >> > > the client POV - just requires a unique 'serial' in the request by the
> >> > > client, that is copied into the reply by QEMU.
> >> > 
> >> > OK, so for that we can just take Marc-André's syntax and call it 'id':
> >> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> >> > 
> >> > then it's upto the caller to ensure those id's are unique.
> >> 
> >> Libvirt has in fact generated a unique 'id' for every monitor command
> >> since day 1 of supporting QMP.
> >> 
> >> > I do worry about two things:
> >> >   a) With this the caller doesn't really know which commands could be
> >> >   in parallel - for example if we've got a recovery command that's
> >> >   executed by this non-locking thread that's OK, we expect that
> >> >   to be doable in parallel.  If in the future though we do
> >> >   what you initially suggested and have a bunch of commands get
> >> >   routed to the migration thread (say) then those would suddenly
> >> >   operate in parallel with other commands that we're previously
> >> >   synchronous.
> >> 
> >> We could still have an opt-in for async commands. eg default to executing
> >> all commands in the main thread, unless the client issues an explicit
> >> "make it async" command, to switch to allowing the migration thread to
> >> process it async.
> >> 
> >>  { "execute": "qmp_allow_async",
> >>    "data": { "commands": [
> >>        "migrate_cancel",
> >>    ] } }
> >> 
> >> 
> >>  { "return": { "commands": [
> >>        "migrate_cancel",
> >>    ] } }
> >> 
> >> The server response contains the subset of commands from the request
> >> for which async is supported.
> >> 
> >> That gives good negotiation ability going forward as we incrementally
> >> support async on more commands.
> >
> > I think this goes back to the discussion on which design we'd like to
> > choose.  IMHO the whole async idea plus the per-command-id is indeed
> > cleaner and nicer, and I believe that can benefit not only libvirt,
> 
> The following may be a bit harsh in places.  I apologize in advance.  A
> better writer than me wouldn't have to resort to that.  I've tried a few
> times to make my point that "async QMP" is neither necessary nor
> sufficient for monitor availability, but apparently without luck, since
> there's still talk like it was.  I hope this attempt will work.
> 
> > but also other QMP users.  The problem is, I have no idea how long
> > it'll take to let us have such a feature - I believe that will include
> > QEMU and Libvirt to both support that.  And it'll be a pity if the
> > postcopy recovery cannot work only because we cannot guarantee a
> > stable monitor.
> >
> > I'm curious whether there are other requirements (besides postcopy
> > recovery) that would want an always-alive monitor to run some
> > lock-free commands?  If there is, I'd be more inclined to first
> > provide a work-around solution like "-qmp-lockfree", and we can
> > provide a better solution afterwards until when the whole async QMP
> > work ready.
> 
> Yes, there are other requirements for "async QMP", and no, "async QMP"
> isn't a solution, but at best a part of a solution.
> 
> Before I talk about QMP requirements, I need to ask a whole raft of
> questions, because so far this thread feels like dreaming up grand
> designs with only superficial understanding of the subject matter.

I think Dan's suggestions are pretty good; while I prefered Peter's
implementation, I think Dan's will work fine and if that's good for
libvirt I'm OK with that.  I think we have a reasonable understanding
of the problem.

> Quite possibly because *my* understanding is superficial.  If yours
> isn't, great!  Go answer my questions :)
> 
> The root problem are main loop hangs.  QMP monitor hangs are merely a
> special case.
> 
> The main loop should not hang.  We've always violated that design
> assumption in places, e.g. in monitor commands that write to disk, and
> thus can hang indefinitely with NFS.  Post-copy adds more violations, as
> Stefan pointed out.
> 
> I can't say whether solving the special case "QMP monitor hangs" without
> also solving "main loop hangs" is useful.  A perfectly available QMP
> monitor buys you nothing if it feeds a command queue that isn't being
> emptied because its consumers all hang.

Correct.

> So, what exactly is going to drain the command queue?  If there's more
> than one consumer, how exactly are commands from the queue dispatched to
> the consumers?

The idea is to have 2 extra threads:
   a) An IO thread
   b) A thread that deals with non-blocking commands
   the existing main thread.

   The IO thread dispatches most commands to the main thread
but doesn't wait for the response.  When responses arrive it forwards
the response back.
   A class of commands is forwarded to the non-blocking command thread.

   More threads may be added in the future with some set of the commands
being moved off the main thread to these other threads.  Eventually
maybe no commands would be handled on the main thread.

> What are the "no hang" guarantees (if any) and conditions for each of
> these consumers?

Commands sent to the main thread are as they are now.
The non-blocking-command thread *shall not block*, it will not access
guest memory, it wont take any lock that is taken by any other thread
that can block on the main thread or main memory.  Commands that run
on it can:
   a) Access state that can be read atomically - e.g. 
      'info status'
   b) Store parameters and then wake another thread
   c) Issue a non-blocking system call.


  In the case of postcopy recovery I see a command issued which starts
the new migration stream;  the command parses the path and makes sure
it's valid, and then stores it and kicks a recovery thread.
  In the case of a COLO failover I'd see something that does a
shutdown(2) on the migration stream.

> We can have any number of QMP monitors today.  Would each of them feed
> its own queue?  Would they all feed a shared queue?

I see two queues; one which is the set of commands being forwarded
to the main thread, the other is the set of commands being forwarded
to the non-blocking thread.

> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
> command?

The command that Dan suggested is the opt-in; I think it's per monitor;
now we're starting to get a bit more fuzzy.

> What does it mean when an asynchronous command follows a synchronous
> command in the same QMP monitor?  I would expect the synchronous command
> to complete before the asynchronous command, because that's what
> synchronous means, isn't it?  To keep your QMP monitor available, you
> then must not send synchronous commands that can hang.

Once you opt-in, all commands operate in a semi-asynchronous fashion;
that is they don't block the IO thread, but at the same time there's
never any more than one command outstanding on any one thread.
You can issue any command you like; one command at a time waiting
for the response with the knowledge that you can then always
issue one of the non-blocking-commands after it.

> How can we determine whether a certain synchronous command can hang?
> Note that with opt-in async, *all* commands are also synchronous
> commands.

You regard all commands as blockable unless told otherwise.  The result
from Dan's command is a list of truly async commands.

> In short, explain to me how exactly you plan to ensure that certain QMP
> commands (such as post-copy recovery) can always "get through", in the
> presence of multiple monitors, hanging main loop, hanging synchronous
> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.

Have I missed anything?

> 
> Now let's talk about QMP requirements.
> 
> Any addition to QMP must consider what exists already.

Yes.

> You may add more of the same.

Yes

> You may generalize existing stuff.

Yes

> You may change existing stuff if you have sufficient reason, subject to
> backward compatibility constraints.

Yes

> But attempts to add new ways to do the same old stuff without properly
> integrating the existing ways are not going to fly.

Agreed; that's why I'm following Dan's recommendations.

> In particular, any new way to start some job, monitor and control it
> while it lives, get notified about its state changes and so forth must
> integrate the existing ways.  These include block jobs (probably the
> most sophisticated of the lot), migration, dump-guest-memory, and
> possibly more.  They all work the same way: synchronous command to kick
> off the job, more synchronous commands to monitor and control, events to
> notify.  They do differ in detail.

And that's why we have the rule that all existing commands go onto the
main thread and only one of those is outstanding at a time.  That way
the actual behaviour of the existing commands doesn't change at all -
however you do require the 'id' field in the command to put into the
response so that you can distinguish the response of a command from each
thread.  Even if you enable async, if you don't use any of the
non-blocking commands the stream is just the same - send a command, get
a response, send a command, get a response....

> Asynchronous commands are a new way to do this.  When you only need to
> be notified on "done", and don't need to monitor / control, they fit the
> bill quite neatly.
> 
> However, we can't just ignore the cases where we need more than that!
> For those, we want a single generic solution instead of the several ad
> hoc solutions we have now.
> 
> If we add asynchronous commands *now*, and for simple cases only, we add
> yet another special case for a future generic solution to integrate.
> I'm not going to let that happen.
> 
> I figure the closest to a generic solution we have is block jobs.
> Perhaps a generic solution could be had by abstracting away the "block"
> from "block jobs", leaving just "jobs".

I don't know block jobs well enough to answer that.
I would suggest you could add a thread for asynchronous commands
and you could shuffle commands onto that thread as and when you feel
like it.

> Another approach is generalizing the asynchronous command proposal to
> fully cover the not-so-simple cases.
> 
> If you'd rather want to make progress on monitor availability without
> cracking the "jobs" problem, you're in luck!  Use your license to "add
> more of the same": synchronous command to start a job, query to monitor,
> event to notify.  
> 
> If you insist on tying your monitor availability solution to
> asynchronous commands, then I'm in luck!  I just found volunteers to
> solve the "jobs" problem for me.

I'm looking for minimal change here while keeping the door open for
the future, if there's anything you think we should do to make that
easy then tell us - but I'd rather this didn't turn into a 'fix all
known monitor problems' because frankly we may as well give up now.
So i don't see this as solving the 'jobs' problem, but if we can
do something to make it easier to solve in the future then lets do it.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 12:02           ` Peter Xu
@ 2017-09-07 16:53             ` Stefan Hajnoczi
  2017-09-07 17:14               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-07 16:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Dr. David Alan Gilbert, qemu-devel, Laurent Vivier, Fam Zheng,
	Juan Quintela, Markus Armbruster, Michael Roth, Paolo Bonzini

On Thu, Sep 7, 2017 at 1:02 PM, Peter Xu <peterx@redhat.com> wrote:
> On Thu, Sep 07, 2017 at 11:09:29AM +0100, Stefan Hajnoczi wrote:
>> On Thu, Sep 7, 2017 at 10:35 AM, Dr. David Alan Gilbert
>> <dgilbert@redhat.com> wrote:
>> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> >> On Wed, Sep 6, 2017 at 4:14 PM, Dr. David Alan Gilbert
>> >> <dgilbert@redhat.com> wrote:
>> >> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> >> >> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
>> >> >> > The root problem is that, monitor commands are all handled in main
>> >> >> > loop thread now, no matter how many monitors we specify. And, if main
>> >> >> > loop thread hangs due to some reason, all monitors will be stuck.
>> >> >>
>> >> >> I see a larger issue with postcopy: existing QEMU code assumes that
>> >> >> guest memory access is instantaneous.
>> >> >>
>> >> >> Postcopy breaks this assumption and introduces blocking points that can
>> >> >> now take unbounded time.
>> >> >>
>> >> >> This problem isn't specific to the monitor.  It can also happen to other
>> >> >> components in QEMU like the gdbstub.
>> >> >>
>> >> >> Do we need an asynchronous memory API?  Synchronous memory access should
>> >> >> only be allowed in vcpu threads.
>> >> >
>> >> > It would probably be useful for gdbstub where the overhead of async
>> >> > doesn't matter;  but doing that for all IO emulation is hard.
>> >>
>> >> Why is it hard?
>> >>
>> >> Memory access can be synchronous in the vcpu thread.  That eliminates
>> >> a lot of code straight away.
>> >>
>> >> Anything using dma-helpers.c is already async.  They just don't know
>> >> that the memory access part is being made async too :).
>> >
>> > Can you point me to some info on that ?
>>
>> IDE and SCSI use dma-helpers.c to perform I/O:
>> hw/ide/core.c:892:        s->bus->dma->aiocb =
>> dma_blk_io(blk_get_aio_context(s->blk),
>> hw/ide/macio.c:189:        s->bus->dma->aiocb =
>> dma_blk_io(blk_get_aio_context(s->blk), &s->sg,
>> hw/scsi/scsi-disk.c:348:        r->req.aiocb =
>> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
>> hw/scsi/scsi-disk.c:551:        r->req.aiocb =
>> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
>>
>> They pass a scatter-gather list of guest RAM addresses to
>> dma-helpers.c.  They receive a callback when I/O has finished.
>>
>> Try following the code path.  Request submission may be from a vcpu
>> thread or IOThread.  Completion occurs in the main loop or an
>> IOThread.
>>
>> The main point is that this API is already asynchronous.  If any
>> changes are needed for async guest memory access (not sure, I haven't
>> checked), then at least the dma-helpers.c users do not need to be
>> modified.
>>
>> >> The remaining cases are virtio and some other devices.
>> >>
>> >> If you are worried about performance, the first rule is that async
>> >> memory access is only needed on the destination side when post-copy is
>> >> active.  Maybe use setjmp to return from the signal handler and queue
>> >> a callback for when the page has been loaded.
>> >
>> > I'm not sure it's worth trying to be too clever at avoiding this;
>> > I see the fact that we're doing IO with the bql held as a more
>> > fundamental problem.
>>
>> QEMU should be doing I/O syscalls in async fashion or threadpool
>> workers (no BQL) so the BQL is not an issue.  Anything else could
>> cause unbounded waits even without postcopy.
>
> E.g. when vcpu got page faulted with BQL taken, while the main thread
> needs the BQL to dispatch anything, including monitor commands.
>
> So I think it's a multiplex problem - we need to solve both (1) main
> thread accessing guest memories which is still missing, and (2) BQL
> deadlocks between vcpu threads and main thread.

I think we need a single solution and cannot treat these as separate.
This is because the same virtio device emulation code may run in 3
contexts:
1. vcpu thread (ioeventfd=off)
2. main loop thread (ioeventfd=on)
3. IOThread (ioeventfd=on, iothread=<id>)

If you try to solve them separately then the code won't work in all 3
contexts anymore.

Stefan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 16:53             ` Stefan Hajnoczi
@ 2017-09-07 17:14               ` Dr. David Alan Gilbert
  2017-09-07 17:35                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 17:14 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Xu, Laurent Vivier, Fam Zheng, Michael Roth, Juan Quintela,
	qemu-devel, Markus Armbruster, Paolo Bonzini

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Thu, Sep 7, 2017 at 1:02 PM, Peter Xu <peterx@redhat.com> wrote:
> > On Thu, Sep 07, 2017 at 11:09:29AM +0100, Stefan Hajnoczi wrote:
> >> On Thu, Sep 7, 2017 at 10:35 AM, Dr. David Alan Gilbert
> >> <dgilbert@redhat.com> wrote:
> >> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> >> >> On Wed, Sep 6, 2017 at 4:14 PM, Dr. David Alan Gilbert
> >> >> <dgilbert@redhat.com> wrote:
> >> >> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> >> >> >> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
> >> >> >> > The root problem is that, monitor commands are all handled in main
> >> >> >> > loop thread now, no matter how many monitors we specify. And, if main
> >> >> >> > loop thread hangs due to some reason, all monitors will be stuck.
> >> >> >>
> >> >> >> I see a larger issue with postcopy: existing QEMU code assumes that
> >> >> >> guest memory access is instantaneous.
> >> >> >>
> >> >> >> Postcopy breaks this assumption and introduces blocking points that can
> >> >> >> now take unbounded time.
> >> >> >>
> >> >> >> This problem isn't specific to the monitor.  It can also happen to other
> >> >> >> components in QEMU like the gdbstub.
> >> >> >>
> >> >> >> Do we need an asynchronous memory API?  Synchronous memory access should
> >> >> >> only be allowed in vcpu threads.
> >> >> >
> >> >> > It would probably be useful for gdbstub where the overhead of async
> >> >> > doesn't matter;  but doing that for all IO emulation is hard.
> >> >>
> >> >> Why is it hard?
> >> >>
> >> >> Memory access can be synchronous in the vcpu thread.  That eliminates
> >> >> a lot of code straight away.
> >> >>
> >> >> Anything using dma-helpers.c is already async.  They just don't know
> >> >> that the memory access part is being made async too :).
> >> >
> >> > Can you point me to some info on that ?
> >>
> >> IDE and SCSI use dma-helpers.c to perform I/O:
> >> hw/ide/core.c:892:        s->bus->dma->aiocb =
> >> dma_blk_io(blk_get_aio_context(s->blk),
> >> hw/ide/macio.c:189:        s->bus->dma->aiocb =
> >> dma_blk_io(blk_get_aio_context(s->blk), &s->sg,
> >> hw/scsi/scsi-disk.c:348:        r->req.aiocb =
> >> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
> >> hw/scsi/scsi-disk.c:551:        r->req.aiocb =
> >> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
> >>
> >> They pass a scatter-gather list of guest RAM addresses to
> >> dma-helpers.c.  They receive a callback when I/O has finished.
> >>
> >> Try following the code path.  Request submission may be from a vcpu
> >> thread or IOThread.  Completion occurs in the main loop or an
> >> IOThread.
> >>
> >> The main point is that this API is already asynchronous.  If any
> >> changes are needed for async guest memory access (not sure, I haven't
> >> checked), then at least the dma-helpers.c users do not need to be
> >> modified.
> >>
> >> >> The remaining cases are virtio and some other devices.
> >> >>
> >> >> If you are worried about performance, the first rule is that async
> >> >> memory access is only needed on the destination side when post-copy is
> >> >> active.  Maybe use setjmp to return from the signal handler and queue
> >> >> a callback for when the page has been loaded.
> >> >
> >> > I'm not sure it's worth trying to be too clever at avoiding this;
> >> > I see the fact that we're doing IO with the bql held as a more
> >> > fundamental problem.
> >>
> >> QEMU should be doing I/O syscalls in async fashion or threadpool
> >> workers (no BQL) so the BQL is not an issue.  Anything else could
> >> cause unbounded waits even without postcopy.
> >
> > E.g. when vcpu got page faulted with BQL taken, while the main thread
> > needs the BQL to dispatch anything, including monitor commands.
> >
> > So I think it's a multiplex problem - we need to solve both (1) main
> > thread accessing guest memories which is still missing, and (2) BQL
> > deadlocks between vcpu threads and main thread.
> 
> I think we need a single solution and cannot treat these as separate.
> This is because the same virtio device emulation code may run in 3
> contexts:
> 1. vcpu thread (ioeventfd=off)
> 2. main loop thread (ioeventfd=on)
> 3. IOThread (ioeventfd=on, iothread=<id>)
> 
> If you try to solve them separately then the code won't work in all 3
> contexts anymore.

I think you can also get main loop thread hangs on things like
network packet reception.

Dave

> 
> Stefan
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 17:14               ` Dr. David Alan Gilbert
@ 2017-09-07 17:35                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-07 17:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Xu, Laurent Vivier, Fam Zheng, Michael Roth, Juan Quintela,
	qemu-devel, Markus Armbruster, Paolo Bonzini

On Thu, Sep 7, 2017 at 6:14 PM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> On Thu, Sep 7, 2017 at 1:02 PM, Peter Xu <peterx@redhat.com> wrote:
>> > On Thu, Sep 07, 2017 at 11:09:29AM +0100, Stefan Hajnoczi wrote:
>> >> On Thu, Sep 7, 2017 at 10:35 AM, Dr. David Alan Gilbert
>> >> <dgilbert@redhat.com> wrote:
>> >> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> >> >> On Wed, Sep 6, 2017 at 4:14 PM, Dr. David Alan Gilbert
>> >> >> <dgilbert@redhat.com> wrote:
>> >> >> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
>> >> >> >> On Wed, Aug 23, 2017 at 02:51:03PM +0800, Peter Xu wrote:
>> >> >> >> > The root problem is that, monitor commands are all handled in main
>> >> >> >> > loop thread now, no matter how many monitors we specify. And, if main
>> >> >> >> > loop thread hangs due to some reason, all monitors will be stuck.
>> >> >> >>
>> >> >> >> I see a larger issue with postcopy: existing QEMU code assumes that
>> >> >> >> guest memory access is instantaneous.
>> >> >> >>
>> >> >> >> Postcopy breaks this assumption and introduces blocking points that can
>> >> >> >> now take unbounded time.
>> >> >> >>
>> >> >> >> This problem isn't specific to the monitor.  It can also happen to other
>> >> >> >> components in QEMU like the gdbstub.
>> >> >> >>
>> >> >> >> Do we need an asynchronous memory API?  Synchronous memory access should
>> >> >> >> only be allowed in vcpu threads.
>> >> >> >
>> >> >> > It would probably be useful for gdbstub where the overhead of async
>> >> >> > doesn't matter;  but doing that for all IO emulation is hard.
>> >> >>
>> >> >> Why is it hard?
>> >> >>
>> >> >> Memory access can be synchronous in the vcpu thread.  That eliminates
>> >> >> a lot of code straight away.
>> >> >>
>> >> >> Anything using dma-helpers.c is already async.  They just don't know
>> >> >> that the memory access part is being made async too :).
>> >> >
>> >> > Can you point me to some info on that ?
>> >>
>> >> IDE and SCSI use dma-helpers.c to perform I/O:
>> >> hw/ide/core.c:892:        s->bus->dma->aiocb =
>> >> dma_blk_io(blk_get_aio_context(s->blk),
>> >> hw/ide/macio.c:189:        s->bus->dma->aiocb =
>> >> dma_blk_io(blk_get_aio_context(s->blk), &s->sg,
>> >> hw/scsi/scsi-disk.c:348:        r->req.aiocb =
>> >> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
>> >> hw/scsi/scsi-disk.c:551:        r->req.aiocb =
>> >> dma_blk_io(blk_get_aio_context(s->qdev.conf.blk),
>> >>
>> >> They pass a scatter-gather list of guest RAM addresses to
>> >> dma-helpers.c.  They receive a callback when I/O has finished.
>> >>
>> >> Try following the code path.  Request submission may be from a vcpu
>> >> thread or IOThread.  Completion occurs in the main loop or an
>> >> IOThread.
>> >>
>> >> The main point is that this API is already asynchronous.  If any
>> >> changes are needed for async guest memory access (not sure, I haven't
>> >> checked), then at least the dma-helpers.c users do not need to be
>> >> modified.
>> >>
>> >> >> The remaining cases are virtio and some other devices.
>> >> >>
>> >> >> If you are worried about performance, the first rule is that async
>> >> >> memory access is only needed on the destination side when post-copy is
>> >> >> active.  Maybe use setjmp to return from the signal handler and queue
>> >> >> a callback for when the page has been loaded.
>> >> >
>> >> > I'm not sure it's worth trying to be too clever at avoiding this;
>> >> > I see the fact that we're doing IO with the bql held as a more
>> >> > fundamental problem.
>> >>
>> >> QEMU should be doing I/O syscalls in async fashion or threadpool
>> >> workers (no BQL) so the BQL is not an issue.  Anything else could
>> >> cause unbounded waits even without postcopy.
>> >
>> > E.g. when vcpu got page faulted with BQL taken, while the main thread
>> > needs the BQL to dispatch anything, including monitor commands.
>> >
>> > So I think it's a multiplex problem - we need to solve both (1) main
>> > thread accessing guest memories which is still missing, and (2) BQL
>> > deadlocks between vcpu threads and main thread.
>>
>> I think we need a single solution and cannot treat these as separate.
>> This is because the same virtio device emulation code may run in 3
>> contexts:
>> 1. vcpu thread (ioeventfd=off)
>> 2. main loop thread (ioeventfd=on)
>> 3. IOThread (ioeventfd=on, iothread=<id>)
>>
>> If you try to solve them separately then the code won't work in all 3
>> contexts anymore.
>
> I think you can also get main loop thread hangs on things like
> network packet reception.

That is case #2.  The QEMU net subsystem reads receive packets into a
temporary buffer (it's not zero-copy) and invokes the virtio-net
receive handler function from the main loop.

Stefan

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 13:22                       ` Daniel P. Berrange
@ 2017-09-07 17:41                         ` Markus Armbruster
  2017-09-07 18:09                           ` Dr. David Alan Gilbert
  2017-09-08  9:27                           ` Daniel P. Berrange
  0 siblings, 2 replies; 104+ messages in thread
From: Markus Armbruster @ 2017-09-07 17:41 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, mdroth, Peter Xu,
	qemu-devel, Paolo Bonzini, John Snow, Dr. David Alan Gilbert

"Daniel P. Berrange" <berrange@redhat.com> writes:

> On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
>> So, what exactly is going to drain the command queue?  If there's more
>> than one consumer, how exactly are commands from the queue dispatched to
>> the consumers?
>
> In terms of my proposal, for any single command there should only ever
> be a single consumer. The default consumer would be the main event loop
> thread, such that we have no semantic change to QMP operation from today.
>
> Some commands that are capable of being made "async", would have a
> different consumer. For example, if the client requested the 'migrate-cancel'
> be made async, this would change things such that the migration thread is
> now responsible for consuming the "migrate-cancel" command, instead of the
> default main loop.
>
>> What are the "no hang" guarantees (if any) and conditions for each of
>> these consumers?
>
> The non-main thread consumers would have to have some reasonable
> guarantee that they won't block on a lock held by the main loop,
> otherwise the whole feature is largely useless.

Same if they block indefinitely on anything else, actually.  In other
words, we need to talk about liveness.

Threads by themselves don't buy us liveness.  Being careful with
operations that may block does.  That care may lead to farming out
certain operations to other threads, where they may block without harm.

You only talk about "the non-main thread consumers".  What about the
main thread?  Is it okay for the main thread to block?  If yes, why?

>> We can have any number of QMP monitors today.  Would each of them feed
>> its own queue?  Would they all feed a shared queue?
>
> Currently with multiple QMP monitors, everything runs in the main
> loop, so commands arriving across  multiple monitors are 100%
> serialized and processed strictly in the order in which QEMU reads
> them off the wire.  To maintain these semantics, we would need to
> have a single shared queue for the default main loop consumer, so
> that ordering does not change.
>
>> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
>> command?
>
> Per monitor+command. ie just because libvirt knows how to cope with
> async execution on the monitor it has open, does not mean that a
> different app on the 2nd monitor command can cope. So in my proposal
> the switch to async must be scoped to the particular command only
> for the monitor connection that requesteed it.
>
>> What does it mean when an asynchronous command follows a synchronous
>> command in the same QMP monitor?  I would expect the synchronous command
>> to complete before the asynchronous command, because that's what
>> synchronous means, isn't it?  To keep your QMP monitor available, you
>> then must not send synchronous commands that can hang.
>
> No, that is not what I described. All synchronous commands are
> serialized wrt each other, just as today. An asychronous command
> can run as soon as it is received, regardless of whether any
> earlier sent sync commands are still executing or pending. This
> is trivial to achieve when you separate monitor I/O from command
> execution in separate threads, provided of course the async
> command consumers are not in the main loop.

So, a synchronous command is synchronous with respect to other commands,
except for certain non-blocking commands.  The distinctive feature of
the latter isn't so much an asynchronous reply, but out-of-band
dispatch.

Out-of-band dispatch of commands that cannot block in fact orthogonal to
asynchronous replies.  I can't see why out-of-band dispatch of
synchronous non-blocking commands wouldn't work, too.

>> How can we determine whether a certain synchronous command can hang?
>> Note that with opt-in async, *all* commands are also synchronous
>> commands.
>> 
>> In short, explain to me how exactly you plan to ensure that certain QMP
>> commands (such as post-copy recovery) can always "get through", in the
>> presence of multiple monitors, hanging main loop, hanging synchronous
>> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
>
> Taking migrate-cancel as the example. The migration code already has
> a background thread doing work independantly onthe main loop. Upon
> marking the migrate-cancel command as async, the migration control
> thread would become the consumer of migrate-cancel.

>From 30,000 feet, the QMP monitor sends a "cancel" message to the
migration thread, and later receives a "canceled" message from the
migration thread.

>From 300 feet, we use the migrate-cancel QMP command as the cancel
message, and its success response as the "canceled" message.

In other words, we're pressing the external QM-Protocol into service as
internal message passing protocol.

>                                                     This allows the
> migration operation to be cancelled immediately, regardless of whether
> there are earlier monitor commands blocked in the main loop.

The necessary part is moving all operations that can block out of
whatever loop runs the monitor, be it the main loop, some other event
loop, or a dedicated monitor thread's monitor loop.

Moving out non-blocking operations isn't necessary.  migrate-cancel
could communicate with the migration thread by any suitable mechanism or
protocol.  It doesn't have to be QMP.  Why would we want it to be QMP?

> Of course this assumes the migration control thread can't block
> for locks held by the main thread.

Thanks for your answers, they help.

>> Now let's talk about QMP requirements.
>> 
>> Any addition to QMP must consider what exists already.
>> 
>> You may add more of the same.
>> 
>> You may generalize existing stuff.
>> 
>> You may change existing stuff if you have sufficient reason, subject to
>> backward compatibility constraints.
>> 
>> But attempts to add new ways to do the same old stuff without properly
>> integrating the existing ways are not going to fly.
>> 
>> In particular, any new way to start some job, monitor and control it
>> while it lives, get notified about its state changes and so forth must
>> integrate the existing ways.  These include block jobs (probably the
>> most sophisticated of the lot), migration, dump-guest-memory, and
>> possibly more.  They all work the same way: synchronous command to kick
>> off the job, more synchronous commands to monitor and control, events to
>> notify.  They do differ in detail.
>> 
>> Asynchronous commands are a new way to do this.  When you only need to
>> be notified on "done", and don't need to monitor / control, they fit the
>> bill quite neatly.
>> 
>> However, we can't just ignore the cases where we need more than that!
>> For those, we want a single generic solution instead of the several ad
>> hoc solutions we have now.
>> 
>> If we add asynchronous commands *now*, and for simple cases only, we add
>> yet another special case for a future generic solution to integrate.
>> I'm not going to let that happen.
>
> With the async commands suggestion, while it would initially not
> provide a way to query incremental status, that could easily be
> fitted in.

This is [*] below.

>             Because command replies from async commands may be
> out-of-order wrt the original requests, clients would need to
> provide a unique ID for each command run. This originally was
> part of QMP spec but then dropped, but libvirt still actually
> generates a uniqe ID for every QMP command.
>
> Given this, one option is to actually use the QMP command ID as
> a job ID, and let you query ongoing status via some new QMP
> command that accepts the ID of the job to be queried. A complexity
> with this is how to make the jobs visible across multiple QMP
> monitors. The job ID might actually have to be a combination of
> the serial ID from the QMP command, and the ID of the monitor
> chardev combined.

Yes.  The job ID must be unique across all QMP monitors to make
broadcast notifications work.

>> I figure the closest to a generic solution we have is block jobs.
>> Perhaps a generic solution could be had by abstracting away the "block"
>> from "block jobs", leaving just "jobs".

[*] starts here:

>> Another approach is generalizing the asynchronous command proposal to
>> fully cover the not-so-simple cases.

We know asynchronous commands "fully cover" when we can use them to
replace all the existing job-like commands.

Until then, they enlarge rather than solve our jobs problem.

I get the need for an available monitor.  But I need to balance it with
other needs.  Can we find a solution for our monitor availability
problem that doesn't enlarge our jobs problem?

>> If you'd rather want to make progress on monitor availability without
>> cracking the "jobs" problem, you're in luck!  Use your license to "add
>> more of the same": synchronous command to start a job, query to monitor,
>> event to notify.  
>> 
>> If you insist on tying your monitor availability solution to
>> asynchronous commands, then I'm in luck!  I just found volunteers to
>> solve the "jobs" problem for me.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 14:20                       ` Dr. David Alan Gilbert
@ 2017-09-07 17:41                         ` Markus Armbruster
  2017-09-07 18:04                           ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Markus Armbruster @ 2017-09-07 17:41 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, qemu-devel, Peter Xu,
	mdroth, Paolo Bonzini, John Snow

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
>> >> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
>> >> > * Daniel P. Berrange (berrange@redhat.com) wrote:
>> >> > > This does imply that you need a separate monitor I/O processing, from the
>> >> > > command execution thread, but I see no need for all commands to suddenly
>> >> > > become async. Just allowing interleaved replies is sufficient from the
>> >> > > POV of the protocol definition. This interleaving is easy to handle from
>> >> > > the client POV - just requires a unique 'serial' in the request by the
>> >> > > client, that is copied into the reply by QEMU.
>> >> > 
>> >> > OK, so for that we can just take Marc-André's syntax and call it 'id':
>> >> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
>> >> > 
>> >> > then it's upto the caller to ensure those id's are unique.
>> >> 
>> >> Libvirt has in fact generated a unique 'id' for every monitor command
>> >> since day 1 of supporting QMP.
>> >> 
>> >> > I do worry about two things:
>> >> >   a) With this the caller doesn't really know which commands could be
>> >> >   in parallel - for example if we've got a recovery command that's
>> >> >   executed by this non-locking thread that's OK, we expect that
>> >> >   to be doable in parallel.  If in the future though we do
>> >> >   what you initially suggested and have a bunch of commands get
>> >> >   routed to the migration thread (say) then those would suddenly
>> >> >   operate in parallel with other commands that we're previously
>> >> >   synchronous.
>> >> 
>> >> We could still have an opt-in for async commands. eg default to executing
>> >> all commands in the main thread, unless the client issues an explicit
>> >> "make it async" command, to switch to allowing the migration thread to
>> >> process it async.
>> >> 
>> >>  { "execute": "qmp_allow_async",
>> >>    "data": { "commands": [
>> >>        "migrate_cancel",
>> >>    ] } }
>> >> 
>> >> 
>> >>  { "return": { "commands": [
>> >>        "migrate_cancel",
>> >>    ] } }
>> >> 
>> >> The server response contains the subset of commands from the request
>> >> for which async is supported.
>> >> 
>> >> That gives good negotiation ability going forward as we incrementally
>> >> support async on more commands.
>> >
>> > I think this goes back to the discussion on which design we'd like to
>> > choose.  IMHO the whole async idea plus the per-command-id is indeed
>> > cleaner and nicer, and I believe that can benefit not only libvirt,
>> 
>> The following may be a bit harsh in places.  I apologize in advance.  A
>> better writer than me wouldn't have to resort to that.  I've tried a few
>> times to make my point that "async QMP" is neither necessary nor
>> sufficient for monitor availability, but apparently without luck, since
>> there's still talk like it was.  I hope this attempt will work.
>> 
>> > but also other QMP users.  The problem is, I have no idea how long
>> > it'll take to let us have such a feature - I believe that will include
>> > QEMU and Libvirt to both support that.  And it'll be a pity if the
>> > postcopy recovery cannot work only because we cannot guarantee a
>> > stable monitor.
>> >
>> > I'm curious whether there are other requirements (besides postcopy
>> > recovery) that would want an always-alive monitor to run some
>> > lock-free commands?  If there is, I'd be more inclined to first
>> > provide a work-around solution like "-qmp-lockfree", and we can
>> > provide a better solution afterwards until when the whole async QMP
>> > work ready.
>> 
>> Yes, there are other requirements for "async QMP", and no, "async QMP"
>> isn't a solution, but at best a part of a solution.
>> 
>> Before I talk about QMP requirements, I need to ask a whole raft of
>> questions, because so far this thread feels like dreaming up grand
>> designs with only superficial understanding of the subject matter.
>
> I think Dan's suggestions are pretty good; while I prefered Peter's
> implementation, I think Dan's will work fine and if that's good for
> libvirt I'm OK with that.  I think we have a reasonable understanding
> of the problem.
>
>> Quite possibly because *my* understanding is superficial.  If yours
>> isn't, great!  Go answer my questions :)
>> 
>> The root problem are main loop hangs.  QMP monitor hangs are merely a
>> special case.
>> 
>> The main loop should not hang.  We've always violated that design
>> assumption in places, e.g. in monitor commands that write to disk, and
>> thus can hang indefinitely with NFS.  Post-copy adds more violations, as
>> Stefan pointed out.
>> 
>> I can't say whether solving the special case "QMP monitor hangs" without
>> also solving "main loop hangs" is useful.  A perfectly available QMP
>> monitor buys you nothing if it feeds a command queue that isn't being
>> emptied because its consumers all hang.
>
> Correct.
>
>> So, what exactly is going to drain the command queue?  If there's more
>> than one consumer, how exactly are commands from the queue dispatched to
>> the consumers?
>
> The idea is to have 2 extra threads:
>    a) An IO thread
>    b) A thread that deals with non-blocking commands

These are the two extra threads, and ...

>    the existing main thread.

... they are "extras" to the existing main thread, right?

>    The IO thread dispatches most commands to the main thread
> but doesn't wait for the response.  When responses arrive it forwards
> the response back.

The QMP monitor runs in this I/O thread?

>    A class of commands is forwarded to the non-blocking command thread.

Since the non-blocking commands by definition don't block, why can't we
simply execute them in the I/O thread?

>    More threads may be added in the future with some set of the commands
> being moved off the main thread to these other threads.  Eventually
> maybe no commands would be handled on the main thread.
>
>> What are the "no hang" guarantees (if any) and conditions for each of
>> these consumers?
>
> Commands sent to the main thread are as they are now.
> The non-blocking-command thread *shall not block*, it will not access
> guest memory, it wont take any lock that is taken by any other thread
> that can block on the main thread or main memory.  Commands that run
> on it can:
>    a) Access state that can be read atomically - e.g. 
>       'info status'
>    b) Store parameters and then wake another thread
>    c) Issue a non-blocking system call.
>
>
>   In the case of postcopy recovery I see a command issued which starts
> the new migration stream;  the command parses the path and makes sure
> it's valid, and then stores it and kicks a recovery thread.
>   In the case of a COLO failover I'd see something that does a
> shutdown(2) on the migration stream.
>
>> We can have any number of QMP monitors today.  Would each of them feed
>> its own queue?  Would they all feed a shared queue?
>
> I see two queues; one which is the set of commands being forwarded
> to the main thread, the other is the set of commands being forwarded
> to the non-blocking thread.
>
>> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
>> command?
>
> The command that Dan suggested is the opt-in; I think it's per monitor;
> now we're starting to get a bit more fuzzy.
>
>> What does it mean when an asynchronous command follows a synchronous
>> command in the same QMP monitor?  I would expect the synchronous command
>> to complete before the asynchronous command, because that's what
>> synchronous means, isn't it?  To keep your QMP monitor available, you
>> then must not send synchronous commands that can hang.
>
> Once you opt-in, all commands operate in a semi-asynchronous fashion;
> that is they don't block the IO thread, but at the same time there's
> never any more than one command outstanding on any one thread.
> You can issue any command you like; one command at a time waiting
> for the response with the knowledge that you can then always
> issue one of the non-blocking-commands after it.
>
>> How can we determine whether a certain synchronous command can hang?
>> Note that with opt-in async, *all* commands are also synchronous
>> commands.
>
> You regard all commands as blockable unless told otherwise.  The result
> from Dan's command is a list of truly async commands.
>
>> In short, explain to me how exactly you plan to ensure that certain QMP
>> commands (such as post-copy recovery) can always "get through", in the
>> presence of multiple monitors, hanging main loop, hanging synchronous
>> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
>
> Have I missed anything?

I'm trying to square this with Dan's reply, but it's probably too late
in my day for me to succeed.

>> Now let's talk about QMP requirements.
>> 
>> Any addition to QMP must consider what exists already.
>
> Yes.
>
>> You may add more of the same.
>
> Yes
>
>> You may generalize existing stuff.
>
> Yes
>
>> You may change existing stuff if you have sufficient reason, subject to
>> backward compatibility constraints.
>
> Yes
>
>> But attempts to add new ways to do the same old stuff without properly
>> integrating the existing ways are not going to fly.
>
> Agreed; that's why I'm following Dan's recommendations.
>
>> In particular, any new way to start some job, monitor and control it
>> while it lives, get notified about its state changes and so forth must
>> integrate the existing ways.  These include block jobs (probably the
>> most sophisticated of the lot), migration, dump-guest-memory, and
>> possibly more.  They all work the same way: synchronous command to kick
>> off the job, more synchronous commands to monitor and control, events to
>> notify.  They do differ in detail.
>
> And that's why we have the rule that all existing commands go onto the
> main thread and only one of those is outstanding at a time.  That way
> the actual behaviour of the existing commands doesn't change at all -
> however you do require the 'id' field in the command to put into the
> response so that you can distinguish the response of a command from each
> thread.  Even if you enable async, if you don't use any of the
> non-blocking commands the stream is just the same - send a command, get
> a response, send a command, get a response....
>
>> Asynchronous commands are a new way to do this.  When you only need to
>> be notified on "done", and don't need to monitor / control, they fit the
>> bill quite neatly.
>> 
>> However, we can't just ignore the cases where we need more than that!
>> For those, we want a single generic solution instead of the several ad
>> hoc solutions we have now.
>> 
>> If we add asynchronous commands *now*, and for simple cases only, we add
>> yet another special case for a future generic solution to integrate.
>> I'm not going to let that happen.
>> 
>> I figure the closest to a generic solution we have is block jobs.
>> Perhaps a generic solution could be had by abstracting away the "block"
>> from "block jobs", leaving just "jobs".
>
> I don't know block jobs well enough to answer that.
> I would suggest you could add a thread for asynchronous commands
> and you could shuffle commands onto that thread as and when you feel
> like it.
>
>> Another approach is generalizing the asynchronous command proposal to
>> fully cover the not-so-simple cases.
>> 
>> If you'd rather want to make progress on monitor availability without
>> cracking the "jobs" problem, you're in luck!  Use your license to "add
>> more of the same": synchronous command to start a job, query to monitor,
>> event to notify.  
>> 
>> If you insist on tying your monitor availability solution to
>> asynchronous commands, then I'm in luck!  I just found volunteers to
>> solve the "jobs" problem for me.
>
> I'm looking for minimal change here while keeping the door open for
> the future, if there's anything you think we should do to make that
> easy then tell us - but I'd rather this didn't turn into a 'fix all
> known monitor problems' because frankly we may as well give up now.
> So i don't see this as solving the 'jobs' problem, but if we can
> do something to make it easier to solve in the future then lets do it.

Forget about asynchronous commands, jobs and the whole shebang of
distractions, and consider what you really need: I suspect it could be
out-of-band dispatch of non-blocking commands.  More on that in my reply
to Daniel.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 17:41                         ` Markus Armbruster
@ 2017-09-07 18:04                           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 18:04 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, qemu-devel, Peter Xu,
	mdroth, Paolo Bonzini, John Snow

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > On Wed, Sep 06, 2017 at 12:54:28PM +0100, Daniel P. Berrange wrote:
> >> >> On Wed, Sep 06, 2017 at 12:31:58PM +0100, Dr. David Alan Gilbert wrote:
> >> >> > * Daniel P. Berrange (berrange@redhat.com) wrote:
> >> >> > > This does imply that you need a separate monitor I/O processing, from the
> >> >> > > command execution thread, but I see no need for all commands to suddenly
> >> >> > > become async. Just allowing interleaved replies is sufficient from the
> >> >> > > POV of the protocol definition. This interleaving is easy to handle from
> >> >> > > the client POV - just requires a unique 'serial' in the request by the
> >> >> > > client, that is copied into the reply by QEMU.
> >> >> > 
> >> >> > OK, so for that we can just take Marc-André's syntax and call it 'id':
> >> >> >   https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg03634.html
> >> >> > 
> >> >> > then it's upto the caller to ensure those id's are unique.
> >> >> 
> >> >> Libvirt has in fact generated a unique 'id' for every monitor command
> >> >> since day 1 of supporting QMP.
> >> >> 
> >> >> > I do worry about two things:
> >> >> >   a) With this the caller doesn't really know which commands could be
> >> >> >   in parallel - for example if we've got a recovery command that's
> >> >> >   executed by this non-locking thread that's OK, we expect that
> >> >> >   to be doable in parallel.  If in the future though we do
> >> >> >   what you initially suggested and have a bunch of commands get
> >> >> >   routed to the migration thread (say) then those would suddenly
> >> >> >   operate in parallel with other commands that we're previously
> >> >> >   synchronous.
> >> >> 
> >> >> We could still have an opt-in for async commands. eg default to executing
> >> >> all commands in the main thread, unless the client issues an explicit
> >> >> "make it async" command, to switch to allowing the migration thread to
> >> >> process it async.
> >> >> 
> >> >>  { "execute": "qmp_allow_async",
> >> >>    "data": { "commands": [
> >> >>        "migrate_cancel",
> >> >>    ] } }
> >> >> 
> >> >> 
> >> >>  { "return": { "commands": [
> >> >>        "migrate_cancel",
> >> >>    ] } }
> >> >> 
> >> >> The server response contains the subset of commands from the request
> >> >> for which async is supported.
> >> >> 
> >> >> That gives good negotiation ability going forward as we incrementally
> >> >> support async on more commands.
> >> >
> >> > I think this goes back to the discussion on which design we'd like to
> >> > choose.  IMHO the whole async idea plus the per-command-id is indeed
> >> > cleaner and nicer, and I believe that can benefit not only libvirt,
> >> 
> >> The following may be a bit harsh in places.  I apologize in advance.  A
> >> better writer than me wouldn't have to resort to that.  I've tried a few
> >> times to make my point that "async QMP" is neither necessary nor
> >> sufficient for monitor availability, but apparently without luck, since
> >> there's still talk like it was.  I hope this attempt will work.
> >> 
> >> > but also other QMP users.  The problem is, I have no idea how long
> >> > it'll take to let us have such a feature - I believe that will include
> >> > QEMU and Libvirt to both support that.  And it'll be a pity if the
> >> > postcopy recovery cannot work only because we cannot guarantee a
> >> > stable monitor.
> >> >
> >> > I'm curious whether there are other requirements (besides postcopy
> >> > recovery) that would want an always-alive monitor to run some
> >> > lock-free commands?  If there is, I'd be more inclined to first
> >> > provide a work-around solution like "-qmp-lockfree", and we can
> >> > provide a better solution afterwards until when the whole async QMP
> >> > work ready.
> >> 
> >> Yes, there are other requirements for "async QMP", and no, "async QMP"
> >> isn't a solution, but at best a part of a solution.
> >> 
> >> Before I talk about QMP requirements, I need to ask a whole raft of
> >> questions, because so far this thread feels like dreaming up grand
> >> designs with only superficial understanding of the subject matter.
> >
> > I think Dan's suggestions are pretty good; while I prefered Peter's
> > implementation, I think Dan's will work fine and if that's good for
> > libvirt I'm OK with that.  I think we have a reasonable understanding
> > of the problem.
> >
> >> Quite possibly because *my* understanding is superficial.  If yours
> >> isn't, great!  Go answer my questions :)
> >> 
> >> The root problem are main loop hangs.  QMP monitor hangs are merely a
> >> special case.
> >> 
> >> The main loop should not hang.  We've always violated that design
> >> assumption in places, e.g. in monitor commands that write to disk, and
> >> thus can hang indefinitely with NFS.  Post-copy adds more violations, as
> >> Stefan pointed out.
> >> 
> >> I can't say whether solving the special case "QMP monitor hangs" without
> >> also solving "main loop hangs" is useful.  A perfectly available QMP
> >> monitor buys you nothing if it feeds a command queue that isn't being
> >> emptied because its consumers all hang.
> >
> > Correct.
> >
> >> So, what exactly is going to drain the command queue?  If there's more
> >> than one consumer, how exactly are commands from the queue dispatched to
> >> the consumers?
> >
> > The idea is to have 2 extra threads:
> >    a) An IO thread
> >    b) A thread that deals with non-blocking commands
> 
> These are the two extra threads, and ...
> 
> >    the existing main thread.
> 
> ... they are "extras" to the existing main thread, right?

Yes; three threads total.

> >    The IO thread dispatches most commands to the main thread
> > but doesn't wait for the response.  When responses arrive it forwards
> > the response back.
> 
> The QMP monitor runs in this I/O thread?

Yes; all the output formatting, basic checking of the input stream
(assuming that is that's all lock free)

> >    A class of commands is forwarded to the non-blocking command thread.
> 
> Since the non-blocking commands by definition don't block, why can't we
> simply execute them in the I/O thread?

Yes, I think that's possible.
I think it came from Dan's shape of suggestion so that in future
maybe we would add more threads for other types of things (e.g.
a thread that handled block commands)
(I don't think that ends up looking too different from Peter's world -
it's got one thread per monitor connection which is the other way up
from this but I think the behaviour is very similar).

> >    More threads may be added in the future with some set of the commands
> > being moved off the main thread to these other threads.  Eventually
> > maybe no commands would be handled on the main thread.
> >
> >> What are the "no hang" guarantees (if any) and conditions for each of
> >> these consumers?
> >
> > Commands sent to the main thread are as they are now.
> > The non-blocking-command thread *shall not block*, it will not access
> > guest memory, it wont take any lock that is taken by any other thread
> > that can block on the main thread or main memory.  Commands that run
> > on it can:
> >    a) Access state that can be read atomically - e.g. 
> >       'info status'
> >    b) Store parameters and then wake another thread
> >    c) Issue a non-blocking system call.
> >
> >
> >   In the case of postcopy recovery I see a command issued which starts
> > the new migration stream;  the command parses the path and makes sure
> > it's valid, and then stores it and kicks a recovery thread.
> >   In the case of a COLO failover I'd see something that does a
> > shutdown(2) on the migration stream.
> >
> >> We can have any number of QMP monitors today.  Would each of them feed
> >> its own queue?  Would they all feed a shared queue?
> >
> > I see two queues; one which is the set of commands being forwarded
> > to the main thread, the other is the set of commands being forwarded
> > to the non-blocking thread.
> >
> >> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
> >> command?
> >
> > The command that Dan suggested is the opt-in; I think it's per monitor;
> > now we're starting to get a bit more fuzzy.
> >
> >> What does it mean when an asynchronous command follows a synchronous
> >> command in the same QMP monitor?  I would expect the synchronous command
> >> to complete before the asynchronous command, because that's what
> >> synchronous means, isn't it?  To keep your QMP monitor available, you
> >> then must not send synchronous commands that can hang.
> >
> > Once you opt-in, all commands operate in a semi-asynchronous fashion;
> > that is they don't block the IO thread, but at the same time there's
> > never any more than one command outstanding on any one thread.
> > You can issue any command you like; one command at a time waiting
> > for the response with the knowledge that you can then always
> > issue one of the non-blocking-commands after it.
> >
> >> How can we determine whether a certain synchronous command can hang?
> >> Note that with opt-in async, *all* commands are also synchronous
> >> commands.
> >
> > You regard all commands as blockable unless told otherwise.  The result
> > from Dan's command is a list of truly async commands.
> >
> >> In short, explain to me how exactly you plan to ensure that certain QMP
> >> commands (such as post-copy recovery) can always "get through", in the
> >> presence of multiple monitors, hanging main loop, hanging synchronous
> >> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
> >
> > Have I missed anything?
> 
> I'm trying to square this with Dan's reply, but it's probably too late
> in my day for me to succeed.

Nod.

> >> Now let's talk about QMP requirements.
> >> 
> >> Any addition to QMP must consider what exists already.
> >
> > Yes.
> >
> >> You may add more of the same.
> >
> > Yes
> >
> >> You may generalize existing stuff.
> >
> > Yes
> >
> >> You may change existing stuff if you have sufficient reason, subject to
> >> backward compatibility constraints.
> >
> > Yes
> >
> >> But attempts to add new ways to do the same old stuff without properly
> >> integrating the existing ways are not going to fly.
> >
> > Agreed; that's why I'm following Dan's recommendations.
> >
> >> In particular, any new way to start some job, monitor and control it
> >> while it lives, get notified about its state changes and so forth must
> >> integrate the existing ways.  These include block jobs (probably the
> >> most sophisticated of the lot), migration, dump-guest-memory, and
> >> possibly more.  They all work the same way: synchronous command to kick
> >> off the job, more synchronous commands to monitor and control, events to
> >> notify.  They do differ in detail.
> >
> > And that's why we have the rule that all existing commands go onto the
> > main thread and only one of those is outstanding at a time.  That way
> > the actual behaviour of the existing commands doesn't change at all -
> > however you do require the 'id' field in the command to put into the
> > response so that you can distinguish the response of a command from each
> > thread.  Even if you enable async, if you don't use any of the
> > non-blocking commands the stream is just the same - send a command, get
> > a response, send a command, get a response....
> >
> >> Asynchronous commands are a new way to do this.  When you only need to
> >> be notified on "done", and don't need to monitor / control, they fit the
> >> bill quite neatly.
> >> 
> >> However, we can't just ignore the cases where we need more than that!
> >> For those, we want a single generic solution instead of the several ad
> >> hoc solutions we have now.
> >> 
> >> If we add asynchronous commands *now*, and for simple cases only, we add
> >> yet another special case for a future generic solution to integrate.
> >> I'm not going to let that happen.
> >> 
> >> I figure the closest to a generic solution we have is block jobs.
> >> Perhaps a generic solution could be had by abstracting away the "block"
> >> from "block jobs", leaving just "jobs".
> >
> > I don't know block jobs well enough to answer that.
> > I would suggest you could add a thread for asynchronous commands
> > and you could shuffle commands onto that thread as and when you feel
> > like it.
> >
> >> Another approach is generalizing the asynchronous command proposal to
> >> fully cover the not-so-simple cases.
> >> 
> >> If you'd rather want to make progress on monitor availability without
> >> cracking the "jobs" problem, you're in luck!  Use your license to "add
> >> more of the same": synchronous command to start a job, query to monitor,
> >> event to notify.  
> >> 
> >> If you insist on tying your monitor availability solution to
> >> asynchronous commands, then I'm in luck!  I just found volunteers to
> >> solve the "jobs" problem for me.
> >
> > I'm looking for minimal change here while keeping the door open for
> > the future, if there's anything you think we should do to make that
> > easy then tell us - but I'd rather this didn't turn into a 'fix all
> > known monitor problems' because frankly we may as well give up now.
> > So i don't see this as solving the 'jobs' problem, but if we can
> > do something to make it easier to solve in the future then lets do it.
> 
> Forget about asynchronous commands, jobs and the whole shebang of
> distractions, and consider what you really need: I suspect it could be
> out-of-band dispatch of non-blocking commands.  More on that in my reply
> to Daniel.

Right, and then the only thing we need to do is make sure the caller
doesn't get the replies to those commands mixed up with the replies
to the blocking commands.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 17:41                         ` Markus Armbruster
@ 2017-09-07 18:09                           ` Dr. David Alan Gilbert
  2017-09-08  8:41                             ` Markus Armbruster
  2017-09-08  9:27                           ` Daniel P. Berrange
  1 sibling, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-07 18:09 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Daniel P. Berrange, Laurent Vivier, Fam Zheng, Juan Quintela,
	mdroth, Peter Xu, qemu-devel, Paolo Bonzini, John Snow

* Markus Armbruster (armbru@redhat.com) wrote:
> "Daniel P. Berrange" <berrange@redhat.com> writes:
> 
> > On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
> >> So, what exactly is going to drain the command queue?  If there's more
> >> than one consumer, how exactly are commands from the queue dispatched to
> >> the consumers?
> >
> > In terms of my proposal, for any single command there should only ever
> > be a single consumer. The default consumer would be the main event loop
> > thread, such that we have no semantic change to QMP operation from today.
> >
> > Some commands that are capable of being made "async", would have a
> > different consumer. For example, if the client requested the 'migrate-cancel'
> > be made async, this would change things such that the migration thread is
> > now responsible for consuming the "migrate-cancel" command, instead of the
> > default main loop.
> >
> >> What are the "no hang" guarantees (if any) and conditions for each of
> >> these consumers?
> >
> > The non-main thread consumers would have to have some reasonable
> > guarantee that they won't block on a lock held by the main loop,
> > otherwise the whole feature is largely useless.
> 
> Same if they block indefinitely on anything else, actually.  In other
> words, we need to talk about liveness.
> 
> Threads by themselves don't buy us liveness.  Being careful with
> operations that may block does.  That care may lead to farming out
> certain operations to other threads, where they may block without harm.
> 
> You only talk about "the non-main thread consumers".  What about the
> main thread?  Is it okay for the main thread to block?  If yes, why?

It would be great if the main thread never blocked; but IMHO that's
a huge task that we'll never get done [challenge].

> >> We can have any number of QMP monitors today.  Would each of them feed
> >> its own queue?  Would they all feed a shared queue?
> >
> > Currently with multiple QMP monitors, everything runs in the main
> > loop, so commands arriving across  multiple monitors are 100%
> > serialized and processed strictly in the order in which QEMU reads
> > them off the wire.  To maintain these semantics, we would need to
> > have a single shared queue for the default main loop consumer, so
> > that ordering does not change.
> >
> >> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
> >> command?
> >
> > Per monitor+command. ie just because libvirt knows how to cope with
> > async execution on the monitor it has open, does not mean that a
> > different app on the 2nd monitor command can cope. So in my proposal
> > the switch to async must be scoped to the particular command only
> > for the monitor connection that requesteed it.
> >
> >> What does it mean when an asynchronous command follows a synchronous
> >> command in the same QMP monitor?  I would expect the synchronous command
> >> to complete before the asynchronous command, because that's what
> >> synchronous means, isn't it?  To keep your QMP monitor available, you
> >> then must not send synchronous commands that can hang.
> >
> > No, that is not what I described. All synchronous commands are
> > serialized wrt each other, just as today. An asychronous command
> > can run as soon as it is received, regardless of whether any
> > earlier sent sync commands are still executing or pending. This
> > is trivial to achieve when you separate monitor I/O from command
> > execution in separate threads, provided of course the async
> > command consumers are not in the main loop.
> 
> So, a synchronous command is synchronous with respect to other commands,
> except for certain non-blocking commands.  The distinctive feature of
> the latter isn't so much an asynchronous reply, but out-of-band
> dispatch.
> 
> Out-of-band dispatch of commands that cannot block in fact orthogonal to
> asynchronous replies.  I can't see why out-of-band dispatch of
> synchronous non-blocking commands wouldn't work, too.
> 
> >> How can we determine whether a certain synchronous command can hang?
> >> Note that with opt-in async, *all* commands are also synchronous
> >> commands.
> >> 
> >> In short, explain to me how exactly you plan to ensure that certain QMP
> >> commands (such as post-copy recovery) can always "get through", in the
> >> presence of multiple monitors, hanging main loop, hanging synchronous
> >> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
> >
> > Taking migrate-cancel as the example. The migration code already has
> > a background thread doing work independantly onthe main loop. Upon
> > marking the migrate-cancel command as async, the migration control
> > thread would become the consumer of migrate-cancel.
> 
> From 30,000 feet, the QMP monitor sends a "cancel" message to the
> migration thread, and later receives a "canceled" message from the
> migration thread.
> 
> From 300 feet, we use the migrate-cancel QMP command as the cancel
> message, and its success response as the "canceled" message.
> 
> In other words, we're pressing the external QM-Protocol into service as
> internal message passing protocol.

Be careful; it's not a cancel in the postcopy recovery case, it's a
restart.  The command is very much like the migration-incoming command.
The management layer has to provide data with the request, so it's not
an internal command.

> >                                                     This allows the
> > migration operation to be cancelled immediately, regardless of whether
> > there are earlier monitor commands blocked in the main loop.
> 
> The necessary part is moving all operations that can block out of
> whatever loop runs the monitor, be it the main loop, some other event
> loop, or a dedicated monitor thread's monitor loop.
> 
> Moving out non-blocking operations isn't necessary.  migrate-cancel
> could communicate with the migration thread by any suitable mechanism or
> protocol.  It doesn't have to be QMP.  Why would we want it to be QMP?

Because why invent another wheel?
This is a command that the management layer has to issue to qemu for
it to recover, including passing data, in a way similar to other
commands - so it looks like a QMP command, so why not use QMP.

Also, I think making other commands lock-free is advantageous - 
some of the 'info' commands just dont really need locks, making them
not use locks removes latency effects caused by the management layer
prodding qemu.

> > Of course this assumes the migration control thread can't block
> > for locks held by the main thread.
> 
> Thanks for your answers, they help.
> 
> >> Now let's talk about QMP requirements.
> >> 
> >> Any addition to QMP must consider what exists already.
> >> 
> >> You may add more of the same.
> >> 
> >> You may generalize existing stuff.
> >> 
> >> You may change existing stuff if you have sufficient reason, subject to
> >> backward compatibility constraints.
> >> 
> >> But attempts to add new ways to do the same old stuff without properly
> >> integrating the existing ways are not going to fly.
> >> 
> >> In particular, any new way to start some job, monitor and control it
> >> while it lives, get notified about its state changes and so forth must
> >> integrate the existing ways.  These include block jobs (probably the
> >> most sophisticated of the lot), migration, dump-guest-memory, and
> >> possibly more.  They all work the same way: synchronous command to kick
> >> off the job, more synchronous commands to monitor and control, events to
> >> notify.  They do differ in detail.
> >> 
> >> Asynchronous commands are a new way to do this.  When you only need to
> >> be notified on "done", and don't need to monitor / control, they fit the
> >> bill quite neatly.
> >> 
> >> However, we can't just ignore the cases where we need more than that!
> >> For those, we want a single generic solution instead of the several ad
> >> hoc solutions we have now.
> >> 
> >> If we add asynchronous commands *now*, and for simple cases only, we add
> >> yet another special case for a future generic solution to integrate.
> >> I'm not going to let that happen.
> >
> > With the async commands suggestion, while it would initially not
> > provide a way to query incremental status, that could easily be
> > fitted in.
> 
> This is [*] below.
> 
> >             Because command replies from async commands may be
> > out-of-order wrt the original requests, clients would need to
> > provide a unique ID for each command run. This originally was
> > part of QMP spec but then dropped, but libvirt still actually
> > generates a uniqe ID for every QMP command.
> >
> > Given this, one option is to actually use the QMP command ID as
> > a job ID, and let you query ongoing status via some new QMP
> > command that accepts the ID of the job to be queried. A complexity
> > with this is how to make the jobs visible across multiple QMP
> > monitors. The job ID might actually have to be a combination of
> > the serial ID from the QMP command, and the ID of the monitor
> > chardev combined.
> 
> Yes.  The job ID must be unique across all QMP monitors to make
> broadcast notifications work.
> 
> >> I figure the closest to a generic solution we have is block jobs.
> >> Perhaps a generic solution could be had by abstracting away the "block"
> >> from "block jobs", leaving just "jobs".
> 
> [*] starts here:
> 
> >> Another approach is generalizing the asynchronous command proposal to
> >> fully cover the not-so-simple cases.
> 
> We know asynchronous commands "fully cover" when we can use them to
> replace all the existing job-like commands.
> 
> Until then, they enlarge rather than solve our jobs problem.
> 
> I get the need for an available monitor.  But I need to balance it with
> other needs.  Can we find a solution for our monitor availability
> problem that doesn't enlarge our jobs problem?

Hopefully!

Dave

> >> If you'd rather want to make progress on monitor availability without
> >> cracking the "jobs" problem, you're in luck!  Use your license to "add
> >> more of the same": synchronous command to start a job, query to monitor,
> >> event to notify.  
> >> 
> >> If you insist on tying your monitor availability solution to
> >> asynchronous commands, then I'm in luck!  I just found volunteers to
> >> solve the "jobs" problem for me.
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 18:09                           ` Dr. David Alan Gilbert
@ 2017-09-08  8:41                             ` Markus Armbruster
  2017-09-08  9:32                               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 104+ messages in thread
From: Markus Armbruster @ 2017-09-08  8:41 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, mdroth, Peter Xu,
	qemu-devel, Paolo Bonzini, John Snow

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> "Daniel P. Berrange" <berrange@redhat.com> writes:
>> 
>> > On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
>> >> So, what exactly is going to drain the command queue?  If there's more
>> >> than one consumer, how exactly are commands from the queue dispatched to
>> >> the consumers?
>> >
>> > In terms of my proposal, for any single command there should only ever
>> > be a single consumer. The default consumer would be the main event loop
>> > thread, such that we have no semantic change to QMP operation from today.
>> >
>> > Some commands that are capable of being made "async", would have a
>> > different consumer. For example, if the client requested the 'migrate-cancel'
>> > be made async, this would change things such that the migration thread is
>> > now responsible for consuming the "migrate-cancel" command, instead of the
>> > default main loop.
>> >
>> >> What are the "no hang" guarantees (if any) and conditions for each of
>> >> these consumers?
>> >
>> > The non-main thread consumers would have to have some reasonable
>> > guarantee that they won't block on a lock held by the main loop,
>> > otherwise the whole feature is largely useless.
>> 
>> Same if they block indefinitely on anything else, actually.  In other
>> words, we need to talk about liveness.
>> 
>> Threads by themselves don't buy us liveness.  Being careful with
>> operations that may block does.  That care may lead to farming out
>> certain operations to other threads, where they may block without harm.
>> 
>> You only talk about "the non-main thread consumers".  What about the
>> main thread?  Is it okay for the main thread to block?  If yes, why?
>
> It would be great if the main thread never blocked; but IMHO that's
> a huge task that we'll never get done [challenge].

This is perhaps starting to wander off the topic, but here goes anyway.

What unpleasant things can happen when the main loop hangs?

What are the known causes of main loop hangs?  Any ideas on fixing them?

Are the unknown main loop hangs relevant in practice?

If we can't eliminate main loop hangs, any ideas on reducing their
impact?

>> >> We can have any number of QMP monitors today.  Would each of them feed
>> >> its own queue?  Would they all feed a shared queue?
>> >
>> > Currently with multiple QMP monitors, everything runs in the main
>> > loop, so commands arriving across  multiple monitors are 100%
>> > serialized and processed strictly in the order in which QEMU reads
>> > them off the wire.  To maintain these semantics, we would need to
>> > have a single shared queue for the default main loop consumer, so
>> > that ordering does not change.
>> >
>> >> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
>> >> command?
>> >
>> > Per monitor+command. ie just because libvirt knows how to cope with
>> > async execution on the monitor it has open, does not mean that a
>> > different app on the 2nd monitor command can cope. So in my proposal
>> > the switch to async must be scoped to the particular command only
>> > for the monitor connection that requesteed it.
>> >
>> >> What does it mean when an asynchronous command follows a synchronous
>> >> command in the same QMP monitor?  I would expect the synchronous command
>> >> to complete before the asynchronous command, because that's what
>> >> synchronous means, isn't it?  To keep your QMP monitor available, you
>> >> then must not send synchronous commands that can hang.
>> >
>> > No, that is not what I described. All synchronous commands are
>> > serialized wrt each other, just as today. An asychronous command
>> > can run as soon as it is received, regardless of whether any
>> > earlier sent sync commands are still executing or pending. This
>> > is trivial to achieve when you separate monitor I/O from command
>> > execution in separate threads, provided of course the async
>> > command consumers are not in the main loop.
>> 
>> So, a synchronous command is synchronous with respect to other commands,
>> except for certain non-blocking commands.  The distinctive feature of
>> the latter isn't so much an asynchronous reply, but out-of-band
>> dispatch.
>> 
>> Out-of-band dispatch of commands that cannot block in fact orthogonal to
>> asynchronous replies.  I can't see why out-of-band dispatch of
>> synchronous non-blocking commands wouldn't work, too.
>> 
>> >> How can we determine whether a certain synchronous command can hang?
>> >> Note that with opt-in async, *all* commands are also synchronous
>> >> commands.
>> >> 
>> >> In short, explain to me how exactly you plan to ensure that certain QMP
>> >> commands (such as post-copy recovery) can always "get through", in the
>> >> presence of multiple monitors, hanging main loop, hanging synchronous
>> >> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
>> >
>> > Taking migrate-cancel as the example. The migration code already has
>> > a background thread doing work independantly onthe main loop. Upon
>> > marking the migrate-cancel command as async, the migration control
>> > thread would become the consumer of migrate-cancel.
>> 
>> From 30,000 feet, the QMP monitor sends a "cancel" message to the
>> migration thread, and later receives a "canceled" message from the
>> migration thread.
>> 
>> From 300 feet, we use the migrate-cancel QMP command as the cancel
>> message, and its success response as the "canceled" message.
>> 
>> In other words, we're pressing the external QM-Protocol into service as
>> internal message passing protocol.
>
> Be careful; it's not a cancel in the postcopy recovery case, it's a
> restart.  The command is very much like the migration-incoming command.
> The management layer has to provide data with the request, so it's not
> an internal command.

It's still a message.

>> >                                                     This allows the
>> > migration operation to be cancelled immediately, regardless of whether
>> > there are earlier monitor commands blocked in the main loop.
>> 
>> The necessary part is moving all operations that can block out of
>> whatever loop runs the monitor, be it the main loop, some other event
>> loop, or a dedicated monitor thread's monitor loop.
>> 
>> Moving out non-blocking operations isn't necessary.  migrate-cancel
>> could communicate with the migration thread by any suitable mechanism or
>> protocol.  It doesn't have to be QMP.  Why would we want it to be QMP?
>
> Because why invent another wheel?
> This is a command that the management layer has to issue to qemu for
> it to recover, including passing data, in a way similar to other
> commands - so it looks like a QMP command, so why not use QMP.

Point taken.

Minor terminology remark: I'd prefer to call this a reuse of QAPI rather
than QMP, because QMP makes me think of sockets and JSON, while QAPI
makes me think of generated data types and marshaling code.

> Also, I think making other commands lock-free is advantageous - 
> some of the 'info' commands just dont really need locks, making them
> not use locks removes latency effects caused by the management layer
> prodding qemu.

I get the desire to move commands that can block out of whatever loop
runs the monitor.  But moving out commands that always complete quickly
seems pointless: by the time you're done queuing them, you could be done
*executing* them.  More on that below.

>> > Of course this assumes the migration control thread can't block
>> > for locks held by the main thread.
>> 
>> Thanks for your answers, they help.
>> 
>> >> Now let's talk about QMP requirements.
>> >> 
>> >> Any addition to QMP must consider what exists already.
>> >> 
>> >> You may add more of the same.
>> >> 
>> >> You may generalize existing stuff.
>> >> 
>> >> You may change existing stuff if you have sufficient reason, subject to
>> >> backward compatibility constraints.
>> >> 
>> >> But attempts to add new ways to do the same old stuff without properly
>> >> integrating the existing ways are not going to fly.
>> >> 
>> >> In particular, any new way to start some job, monitor and control it
>> >> while it lives, get notified about its state changes and so forth must
>> >> integrate the existing ways.  These include block jobs (probably the
>> >> most sophisticated of the lot), migration, dump-guest-memory, and
>> >> possibly more.  They all work the same way: synchronous command to kick
>> >> off the job, more synchronous commands to monitor and control, events to
>> >> notify.  They do differ in detail.
>> >> 
>> >> Asynchronous commands are a new way to do this.  When you only need to
>> >> be notified on "done", and don't need to monitor / control, they fit the
>> >> bill quite neatly.
>> >> 
>> >> However, we can't just ignore the cases where we need more than that!
>> >> For those, we want a single generic solution instead of the several ad
>> >> hoc solutions we have now.
>> >> 
>> >> If we add asynchronous commands *now*, and for simple cases only, we add
>> >> yet another special case for a future generic solution to integrate.
>> >> I'm not going to let that happen.
>> >
>> > With the async commands suggestion, while it would initially not
>> > provide a way to query incremental status, that could easily be
>> > fitted in.
>> 
>> This is [*] below.
>> 
>> >             Because command replies from async commands may be
>> > out-of-order wrt the original requests, clients would need to
>> > provide a unique ID for each command run. This originally was
>> > part of QMP spec but then dropped, but libvirt still actually
>> > generates a uniqe ID for every QMP command.
>> >
>> > Given this, one option is to actually use the QMP command ID as
>> > a job ID, and let you query ongoing status via some new QMP
>> > command that accepts the ID of the job to be queried. A complexity
>> > with this is how to make the jobs visible across multiple QMP
>> > monitors. The job ID might actually have to be a combination of
>> > the serial ID from the QMP command, and the ID of the monitor
>> > chardev combined.
>> 
>> Yes.  The job ID must be unique across all QMP monitors to make
>> broadcast notifications work.
>> 
>> >> I figure the closest to a generic solution we have is block jobs.
>> >> Perhaps a generic solution could be had by abstracting away the "block"
>> >> from "block jobs", leaving just "jobs".
>> 
>> [*] starts here:
>> 
>> >> Another approach is generalizing the asynchronous command proposal to
>> >> fully cover the not-so-simple cases.
>> 
>> We know asynchronous commands "fully cover" when we can use them to
>> replace all the existing job-like commands.
>> 
>> Until then, they enlarge rather than solve our jobs problem.
>> 
>> I get the need for an available monitor.  But I need to balance it with
>> other needs.  Can we find a solution for our monitor availability
>> problem that doesn't enlarge our jobs problem?
>
> Hopefully!
>
> Dave
>
>> >> If you'd rather want to make progress on monitor availability without
>> >> cracking the "jobs" problem, you're in luck!  Use your license to "add
>> >> more of the same": synchronous command to start a job, query to monitor,
>> >> event to notify.  
>> >> 
>> >> If you insist on tying your monitor availability solution to
>> >> asynchronous commands, then I'm in luck!  I just found volunteers to
>> >> solve the "jobs" problem for me.

Let me try to distill the discussion so far into a design sketch.

1. A QMP monitor runs in a loop.  The loop may execute other stuff, but
   this must not unduly delay the monitor's work.  Thus, everything in
   this loop must complete "quickly".

   All QMP monitors currently run in the main loop, which really should
   satisfy "quickly", but doesn't.  Since fixing that to a tolerable
   degree is beyond our means (is it?), we move them out.

   Design alternative: either one loop and thread per monitor, or one
   loop and thread for all monitors, or something in between.

   I'm wary of "one thread per software artifact" designs.  "One
   (preemptable) thread per activity, all sharing state" is a lousy way
   to structure software.

2. A QMP monitor receives and dispatches commands, and sends command
   responses and events.

   What if sending a response or event would block?  See 6.

3. Arbitrary code can trigger QMP events.  Events are broadcast to all
   QMP monitors.  Each QMP monitor provides an event queue.  When an
   event is triggered, it gets put into all queues, subject to rate
   limiting.

   Rate limiting and queuing needs some shared data, which is protected
   by a mutex.  The critical sections guarded by this mutex must be
   "quick".

   Nothing new here, it's how events work today.

   We could easily add events that go to just one monitor, if there's a
   need.

4. Commands are normally dispatched to a worker thread, where they can
   take their own sweet time to complete.

   Currently, the monitor runs in the main loop and executes commands
   directly.  This is effectively dispatching commands to the main loop.
   Dispatch to main loop is wrong, because it can make the main loop
   hang.  If it was the only relevant cause for main loop hangs, we'd
   move the command work out and be done.  Since it isn't (see 1.) we
   *also* have to move the monitor out to prevent main loop hangs from
   hanging the monitor.

   Moving monitor and command work to separate threads changes the
   dispatch from function call to queuing.  Need a pair of queues, one
   for commands, one for responses.

   Design alternative: one worker per monitor, or one worker for all
   monitors, or main loop is the one worker for all monitors.  The
   latter leaves the main loop hangs unaddressed.  No worse than before,
   so it could be okay as a first step.

   The worker provides the pair of queues.  It executes commands in
   order.  If a command blocks, the command queue stalls.

   The command queue can therefore grow without bounds.  See 6.

5. Certain commands that always complete "quickly" can instead be
   executed directly, at the QMP client's request.  This direct
   execution is out-of-band; the response can "overtake" prior in-band
   commands' responses.

   The amount of work these out-of-band commands do themselves is up to
   them.  A qiuck query command would do all the work.  migrate-cancel
   could perhaps notify the migration thread and be done.  Postcopy
   recovery could perhaps send its argument struct to whatever thread
   handles recovery.

6. Flow control

   We currently leave flow control to the underlying character device.
   If the client sends more quickly than the monitor can execute, the
   client's send eventually blocks or fails with EAGAIN.  If the monitor
   sends more quickly than the client accepts, the monitor buffers
   without bounds (I think).

   Buffering monitor output without bounds is bad.  We could perhaps
   kill a monitor when it exceeds its limit.

   Buffering monitor input (in the command queue) without bounds is just
   as bad.  It also destroys the existing flow control mechanism: the
   client can no longer detect that it's sending too much.  Not an issue
   for fully synchronous clients, i.e. clients that wait for the
   previous command's response before they send the next command.  Such
   clients cannot use of out-of-band command execution.

   The obvious way to limit the command queue is to fail commands when
   the queue is "full".

   Note that we can't send an error response right away then, because
   the command is in-band (if it wasn't, we wouldn't queue it), so its
   response has to go after all all the respones to the (in-band)
   commands currently in the queue.

   To tell the client right away, we could send an event.

   Delaying the "queue full" response until the correct time to send it
   requires state: at least the command ID.  We can just as well enqueue
   and pray memory will suffice.

   Note that the only reason for the command queue is out-of-band
   commands.  Without them, reading the next command is pointless.  This
   leads me to a possible solution: separate out-of-band mode, default
   off, QMP client can switch it on.  When off, we read monitor input
   just like we do now (no queue, no problem).  When on, we read and
   queue.  If the queue is full, we send a "queue full" event with the
   IDs of the commands we dropped on the floor.  By switching on
   out-of-band-mode, the QMP client also opts into this event.

   Switching could be done with QMP capabilities negotiation.

7. How all this is related to "jobs"

   Out-of-band execution is a limited special case of asynchronous
   execution.  With general asynchronous execution, responses can be
   sent in any order.  With out-of-band execution, only the out-of-band
   responses can "jump" order, and only over in-band responses.

   "All commands are (to be treated as) asynchronous" is arguably more
   elegant than this out-of-band thing.  However, it runs into two
   roadblocks that don't apply to out-of-band.

   One, backward compatibility.  That's a roadblock only as much as we
   make it one.

   Two, consistency.  "All asynchronous, but we do most job-like things
   with commands + events anyway" is not acceptable to me.  I'd be
   willing to accept "all asynchronous" when it solves the jobs problem.

   You asked for a solution to the monitor availability problem that
   doesn't require you to solve the jobs problem first.  Well, here's my
   best try.  Go shoot some holes into it :)

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-07 17:41                         ` Markus Armbruster
  2017-09-07 18:09                           ` Dr. David Alan Gilbert
@ 2017-09-08  9:27                           ` Daniel P. Berrange
  1 sibling, 0 replies; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-08  9:27 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, mdroth, Peter Xu,
	qemu-devel, Paolo Bonzini, John Snow, Dr. David Alan Gilbert

On Thu, Sep 07, 2017 at 07:41:29PM +0200, Markus Armbruster wrote:
> "Daniel P. Berrange" <berrange@redhat.com> writes:
> 
> > On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
> >> So, what exactly is going to drain the command queue?  If there's more
> >> than one consumer, how exactly are commands from the queue dispatched to
> >> the consumers?
> >
> > In terms of my proposal, for any single command there should only ever
> > be a single consumer. The default consumer would be the main event loop
> > thread, such that we have no semantic change to QMP operation from today.
> >
> > Some commands that are capable of being made "async", would have a
> > different consumer. For example, if the client requested the 'migrate-cancel'
> > be made async, this would change things such that the migration thread is
> > now responsible for consuming the "migrate-cancel" command, instead of the
> > default main loop.
> >
> >> What are the "no hang" guarantees (if any) and conditions for each of
> >> these consumers?
> >
> > The non-main thread consumers would have to have some reasonable
> > guarantee that they won't block on a lock held by the main loop,
> > otherwise the whole feature is largely useless.
> 
> Same if they block indefinitely on anything else, actually.  In other
> words, we need to talk about liveness.
> 
> Threads by themselves don't buy us liveness.  Being careful with
> operations that may block does.  That care may lead to farming out
> certain operations to other threads, where they may block without harm.
> 
> You only talk about "the non-main thread consumers".  What about the
> main thread?  Is it okay for the main thread to block?  If yes, why?

It isn't ok, but I feel that challenge is intractable in the short to
medium term. Agree that having separate threads doesn't automatically
give liveness, but I think it makes the problem tractble to solve for
at least a subset of scenarios.

> > No, that is not what I described. All synchronous commands are
> > serialized wrt each other, just as today. An asychronous command
> > can run as soon as it is received, regardless of whether any
> > earlier sent sync commands are still executing or pending. This
> > is trivial to achieve when you separate monitor I/O from command
> > execution in separate threads, provided of course the async
> > command consumers are not in the main loop.
> 
> So, a synchronous command is synchronous with respect to other commands,
> except for certain non-blocking commands.  The distinctive feature of
> the latter isn't so much an asynchronous reply, but out-of-band
> dispatch.

The terminology synchronous vs asynchronous is not a great fit for
what I was describing. The distinction is really closer to being
serialized vs parallelizable commands.

> >                                                     This allows the
> > migration operation to be cancelled immediately, regardless of whether
> > there are earlier monitor commands blocked in the main loop.
> 
> The necessary part is moving all operations that can block out of
> whatever loop runs the monitor, be it the main loop, some other event
> loop, or a dedicated monitor thread's monitor loop.
> 
> Moving out non-blocking operations isn't necessary.  migrate-cancel
> could communicate with the migration thread by any suitable mechanism or
> protocol.  It doesn't have to be QMP.  Why would we want it to be QMP?

I don't think we really want to invent yet another way of controlling
QEMU, that isn't QMP do we, particularly not if it is special cased
to just one operationg ?

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-08  8:41                             ` Markus Armbruster
@ 2017-09-08  9:32                               ` Dr. David Alan Gilbert
  2017-09-08 11:49                                 ` Markus Armbruster
  0 siblings, 1 reply; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-08  9:32 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, mdroth, Peter Xu,
	qemu-devel, Paolo Bonzini, John Snow

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> "Daniel P. Berrange" <berrange@redhat.com> writes:
> >> 
> >> > On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
> >> >> So, what exactly is going to drain the command queue?  If there's more
> >> >> than one consumer, how exactly are commands from the queue dispatched to
> >> >> the consumers?
> >> >
> >> > In terms of my proposal, for any single command there should only ever
> >> > be a single consumer. The default consumer would be the main event loop
> >> > thread, such that we have no semantic change to QMP operation from today.
> >> >
> >> > Some commands that are capable of being made "async", would have a
> >> > different consumer. For example, if the client requested the 'migrate-cancel'
> >> > be made async, this would change things such that the migration thread is
> >> > now responsible for consuming the "migrate-cancel" command, instead of the
> >> > default main loop.
> >> >
> >> >> What are the "no hang" guarantees (if any) and conditions for each of
> >> >> these consumers?
> >> >
> >> > The non-main thread consumers would have to have some reasonable
> >> > guarantee that they won't block on a lock held by the main loop,
> >> > otherwise the whole feature is largely useless.
> >> 
> >> Same if they block indefinitely on anything else, actually.  In other
> >> words, we need to talk about liveness.
> >> 
> >> Threads by themselves don't buy us liveness.  Being careful with
> >> operations that may block does.  That care may lead to farming out
> >> certain operations to other threads, where they may block without harm.
> >> 
> >> You only talk about "the non-main thread consumers".  What about the
> >> main thread?  Is it okay for the main thread to block?  If yes, why?
> >
> > It would be great if the main thread never blocked; but IMHO that's
> > a huge task that we'll never get done [challenge].
> 
> This is perhaps starting to wander off the topic, but here goes anyway.
> 
> What unpleasant things can happen when the main loop hangs?

  a) We can't interact with the monitor to fix the cause of the hang
     (Which is my main interest here)
  b) IO emulation might also be blocked because it's waiting on the bql

> What are the known causes of main loop hangs?  Any ideas on fixing them?

  c) hangs on networking while under BQL; there's at least one case near
  the end of migrate
  d) hangs on storage devices while under BQL; I think there are similar
  cases in migrate and possibly elsewhere
  e) postcopy pages not yet arrived - then a problem if the postcopy
  dies and needs recovery (because of a)

> Are the unknown main loop hangs relevant in practice?

Well, the unknown ones are unknown; the known ones however seem
relevant:
  f) I can't recover a failed postcopy
  g) A COLO synchronisation might hang at a bad point in migrate and
  you can't kill it off to cause one side to continue
  h) A failure of networking at just the wrong point in migrate can
  cause the source to be paused for a long time - but I don't think
  I've seen it in practice.

> If we can't eliminate main loop hangs, any ideas on reducing their
> impact?

Note there's two related things; main loop hangs and bql hangs; I'm not
sure that the two are always the same.

Stefan mentioned some ways of doing asynchronous memory lookups/accesses
though I'm not sure they'd work in the postcopy case; but they'd need
work in lots of devices.
Some of the IO under the BQL might be fixable; IMHO in a lot of places
we don't really need the full BQL, we just need a 'you aren't going to
change the config' lock.

> >> >> We can have any number of QMP monitors today.  Would each of them feed
> >> >> its own queue?  Would they all feed a shared queue?
> >> >
> >> > Currently with multiple QMP monitors, everything runs in the main
> >> > loop, so commands arriving across  multiple monitors are 100%
> >> > serialized and processed strictly in the order in which QEMU reads
> >> > them off the wire.  To maintain these semantics, we would need to
> >> > have a single shared queue for the default main loop consumer, so
> >> > that ordering does not change.
> >> >
> >> >> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
> >> >> command?
> >> >
> >> > Per monitor+command. ie just because libvirt knows how to cope with
> >> > async execution on the monitor it has open, does not mean that a
> >> > different app on the 2nd monitor command can cope. So in my proposal
> >> > the switch to async must be scoped to the particular command only
> >> > for the monitor connection that requesteed it.
> >> >
> >> >> What does it mean when an asynchronous command follows a synchronous
> >> >> command in the same QMP monitor?  I would expect the synchronous command
> >> >> to complete before the asynchronous command, because that's what
> >> >> synchronous means, isn't it?  To keep your QMP monitor available, you
> >> >> then must not send synchronous commands that can hang.
> >> >
> >> > No, that is not what I described. All synchronous commands are
> >> > serialized wrt each other, just as today. An asychronous command
> >> > can run as soon as it is received, regardless of whether any
> >> > earlier sent sync commands are still executing or pending. This
> >> > is trivial to achieve when you separate monitor I/O from command
> >> > execution in separate threads, provided of course the async
> >> > command consumers are not in the main loop.
> >> 
> >> So, a synchronous command is synchronous with respect to other commands,
> >> except for certain non-blocking commands.  The distinctive feature of
> >> the latter isn't so much an asynchronous reply, but out-of-band
> >> dispatch.
> >> 
> >> Out-of-band dispatch of commands that cannot block in fact orthogonal to
> >> asynchronous replies.  I can't see why out-of-band dispatch of
> >> synchronous non-blocking commands wouldn't work, too.
> >> 
> >> >> How can we determine whether a certain synchronous command can hang?
> >> >> Note that with opt-in async, *all* commands are also synchronous
> >> >> commands.
> >> >> 
> >> >> In short, explain to me how exactly you plan to ensure that certain QMP
> >> >> commands (such as post-copy recovery) can always "get through", in the
> >> >> presence of multiple monitors, hanging main loop, hanging synchronous
> >> >> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
> >> >
> >> > Taking migrate-cancel as the example. The migration code already has
> >> > a background thread doing work independantly onthe main loop. Upon
> >> > marking the migrate-cancel command as async, the migration control
> >> > thread would become the consumer of migrate-cancel.
> >> 
> >> From 30,000 feet, the QMP monitor sends a "cancel" message to the
> >> migration thread, and later receives a "canceled" message from the
> >> migration thread.
> >> 
> >> From 300 feet, we use the migrate-cancel QMP command as the cancel
> >> message, and its success response as the "canceled" message.
> >> 
> >> In other words, we're pressing the external QM-Protocol into service as
> >> internal message passing protocol.
> >
> > Be careful; it's not a cancel in the postcopy recovery case, it's a
> > restart.  The command is very much like the migration-incoming command.
> > The management layer has to provide data with the request, so it's not
> > an internal command.
> 
> It's still a message.
> 
> >> >                                                     This allows the
> >> > migration operation to be cancelled immediately, regardless of whether
> >> > there are earlier monitor commands blocked in the main loop.
> >> 
> >> The necessary part is moving all operations that can block out of
> >> whatever loop runs the monitor, be it the main loop, some other event
> >> loop, or a dedicated monitor thread's monitor loop.
> >> 
> >> Moving out non-blocking operations isn't necessary.  migrate-cancel
> >> could communicate with the migration thread by any suitable mechanism or
> >> protocol.  It doesn't have to be QMP.  Why would we want it to be QMP?
> >
> > Because why invent another wheel?
> > This is a command that the management layer has to issue to qemu for
> > it to recover, including passing data, in a way similar to other
> > commands - so it looks like a QMP command, so why not use QMP.
> 
> Point taken.
> 
> Minor terminology remark: I'd prefer to call this a reuse of QAPI rather
> than QMP, because QMP makes me think of sockets and JSON, while QAPI
> makes me think of generated data types and marshaling code.

Well it's a command that's got to come over the socket from management,
so I'm still thinking sockets and JSON. A lot of the problems you
describe below are more to do with the pain of managing the messages
squeezed through a socket.

> > Also, I think making other commands lock-free is advantageous - 
> > some of the 'info' commands just dont really need locks, making them
> > not use locks removes latency effects caused by the management layer
> > prodding qemu.
> 
> I get the desire to move commands that can block out of whatever loop
> runs the monitor.  But moving out commands that always complete quickly
> seems pointless: by the time you're done queuing them, you could be done
> *executing* them.  More on that below.

My thinking here wasn't about the speed of executing the command, my
interest was more on the performance of the guest/IO - avoiding taking
the BQL would have less impact on IO emulation, as would keeping the
main thread free.

> >> > Of course this assumes the migration control thread can't block
> >> > for locks held by the main thread.
> >> 
> >> Thanks for your answers, they help.
> >> 
> >> >> Now let's talk about QMP requirements.
> >> >> 
> >> >> Any addition to QMP must consider what exists already.
> >> >> 
> >> >> You may add more of the same.
> >> >> 
> >> >> You may generalize existing stuff.
> >> >> 
> >> >> You may change existing stuff if you have sufficient reason, subject to
> >> >> backward compatibility constraints.
> >> >> 
> >> >> But attempts to add new ways to do the same old stuff without properly
> >> >> integrating the existing ways are not going to fly.
> >> >> 
> >> >> In particular, any new way to start some job, monitor and control it
> >> >> while it lives, get notified about its state changes and so forth must
> >> >> integrate the existing ways.  These include block jobs (probably the
> >> >> most sophisticated of the lot), migration, dump-guest-memory, and
> >> >> possibly more.  They all work the same way: synchronous command to kick
> >> >> off the job, more synchronous commands to monitor and control, events to
> >> >> notify.  They do differ in detail.
> >> >> 
> >> >> Asynchronous commands are a new way to do this.  When you only need to
> >> >> be notified on "done", and don't need to monitor / control, they fit the
> >> >> bill quite neatly.
> >> >> 
> >> >> However, we can't just ignore the cases where we need more than that!
> >> >> For those, we want a single generic solution instead of the several ad
> >> >> hoc solutions we have now.
> >> >> 
> >> >> If we add asynchronous commands *now*, and for simple cases only, we add
> >> >> yet another special case for a future generic solution to integrate.
> >> >> I'm not going to let that happen.
> >> >
> >> > With the async commands suggestion, while it would initially not
> >> > provide a way to query incremental status, that could easily be
> >> > fitted in.
> >> 
> >> This is [*] below.
> >> 
> >> >             Because command replies from async commands may be
> >> > out-of-order wrt the original requests, clients would need to
> >> > provide a unique ID for each command run. This originally was
> >> > part of QMP spec but then dropped, but libvirt still actually
> >> > generates a uniqe ID for every QMP command.
> >> >
> >> > Given this, one option is to actually use the QMP command ID as
> >> > a job ID, and let you query ongoing status via some new QMP
> >> > command that accepts the ID of the job to be queried. A complexity
> >> > with this is how to make the jobs visible across multiple QMP
> >> > monitors. The job ID might actually have to be a combination of
> >> > the serial ID from the QMP command, and the ID of the monitor
> >> > chardev combined.
> >> 
> >> Yes.  The job ID must be unique across all QMP monitors to make
> >> broadcast notifications work.
> >> 
> >> >> I figure the closest to a generic solution we have is block jobs.
> >> >> Perhaps a generic solution could be had by abstracting away the "block"
> >> >> from "block jobs", leaving just "jobs".
> >> 
> >> [*] starts here:
> >> 
> >> >> Another approach is generalizing the asynchronous command proposal to
> >> >> fully cover the not-so-simple cases.
> >> 
> >> We know asynchronous commands "fully cover" when we can use them to
> >> replace all the existing job-like commands.
> >> 
> >> Until then, they enlarge rather than solve our jobs problem.
> >> 
> >> I get the need for an available monitor.  But I need to balance it with
> >> other needs.  Can we find a solution for our monitor availability
> >> problem that doesn't enlarge our jobs problem?
> >
> > Hopefully!
> >
> > Dave
> >
> >> >> If you'd rather want to make progress on monitor availability without
> >> >> cracking the "jobs" problem, you're in luck!  Use your license to "add
> >> >> more of the same": synchronous command to start a job, query to monitor,
> >> >> event to notify.  
> >> >> 
> >> >> If you insist on tying your monitor availability solution to
> >> >> asynchronous commands, then I'm in luck!  I just found volunteers to
> >> >> solve the "jobs" problem for me.
> 
> Let me try to distill the discussion so far into a design sketch.
> 
> 1. A QMP monitor runs in a loop.  The loop may execute other stuff, but
>    this must not unduly delay the monitor's work.  Thus, everything in
>    this loop must complete "quickly".
> 
>    All QMP monitors currently run in the main loop, which really should
>    satisfy "quickly", but doesn't.  Since fixing that to a tolerable
>    degree is beyond our means (is it?), we move them out.
> 
>    Design alternative: either one loop and thread per monitor, or one
>    loop and thread for all monitors, or something in between.
> 
>    I'm wary of "one thread per software artifact" designs.  "One
>    (preemptable) thread per activity, all sharing state" is a lousy way
>    to structure software.

Shrug; I've always thought of it as an easy solution unless you'd
get into hundreds of threads, which given the number of monitors, we
wont.

> 2. A QMP monitor receives and dispatches commands, and sends command
>    responses and events.
> 
>    What if sending a response or event would block?  See 6.
> 
> 3. Arbitrary code can trigger QMP events.  Events are broadcast to all
>    QMP monitors.  Each QMP monitor provides an event queue.  When an
>    event is triggered, it gets put into all queues, subject to rate
>    limiting.
> 
>    Rate limiting and queuing needs some shared data, which is protected
>    by a mutex.  The critical sections guarded by this mutex must be
>    "quick".
> 
>    Nothing new here, it's how events work today.
> 
>    We could easily add events that go to just one monitor, if there's a
>    need.

I don't think events could cause a problem here since they're always
outbound - so they could never block inbound commands?

> 4. Commands are normally dispatched to a worker thread, where they can
>    take their own sweet time to complete.
> 
>    Currently, the monitor runs in the main loop and executes commands
>    directly.  This is effectively dispatching commands to the main loop.
>    Dispatch to main loop is wrong, because it can make the main loop
>    hang.  If it was the only relevant cause for main loop hangs, we'd
>    move the command work out and be done.  Since it isn't (see 1.) we
>    *also* have to move the monitor out to prevent main loop hangs from
>    hanging the monitor.
> 
>    Moving monitor and command work to separate threads changes the
>    dispatch from function call to queuing.  Need a pair of queues, one
>    for commands, one for responses.
> 
>    Design alternative: one worker per monitor, or one worker for all
>    monitors, or main loop is the one worker for all monitors.  The
>    latter leaves the main loop hangs unaddressed.  No worse than before,
>    so it could be okay as a first step.
> 
>    The worker provides the pair of queues.  It executes commands in
>    order.  If a command blocks, the command queue stalls.
> 
>    The command queue can therefore grow without bounds.  See 6.
> 
> 5. Certain commands that always complete "quickly" can instead be
>    executed directly, at the QMP client's request.  This direct
>    execution is out-of-band; the response can "overtake" prior in-band
>    commands' responses.
> 
>    The amount of work these out-of-band commands do themselves is up to
>    them.  A qiuck query command would do all the work.  migrate-cancel
>    could perhaps notify the migration thread and be done.  Postcopy
>    recovery could perhaps send its argument struct to whatever thread
>    handles recovery.

Yes.

> 6. Flow control

I think this is potentially the tricky bit!

>    We currently leave flow control to the underlying character device.
>    If the client sends more quickly than the monitor can execute, the
>    client's send eventually blocks or fails with EAGAIN.  If the monitor
>    sends more quickly than the client accepts, the monitor buffers
>    without bounds (I think).
> 
>    Buffering monitor output without bounds is bad.  We could perhaps
>    kill a monitor when it exceeds its limit.

I'm not sure it's possible to define that limit; for example
'query-block' gives a list of information for all devices; there are
people running with 200+ block devices so the output for that would be
huge.

>    Buffering monitor input (in the command queue) without bounds is just
>    as bad.  It also destroys the existing flow control mechanism: the
>    client can no longer detect that it's sending too much.  Not an issue
>    for fully synchronous clients, i.e. clients that wait for the
>    previous command's response before they send the next command.  Such
>    clients cannot use of out-of-band command execution.
> 
>    The obvious way to limit the command queue is to fail commands when
>    the queue is "full".
> 
>    Note that we can't send an error response right away then, because
>    the command is in-band (if it wasn't, we wouldn't queue it), so its
>    response has to go after all all the respones to the (in-band)
>    commands currently in the queue.
> 
>    To tell the client right away, we could send an event.
> 
>    Delaying the "queue full" response until the correct time to send it
>    requires state: at least the command ID.  We can just as well enqueue
>    and pray memory will suffice.
> 
>    Note that the only reason for the command queue is out-of-band
>    commands.  Without them, reading the next command is pointless.  This
>    leads me to a possible solution: separate out-of-band mode, default
>    off, QMP client can switch it on.  When off, we read monitor input
>    just like we do now (no queue, no problem).  When on, we read and
>    queue.  If the queue is full, we send a "queue full" event with the
>    IDs of the commands we dropped on the floor.  By switching on
>    out-of-band-mode, the QMP client also opts into this event.
> 
>    Switching could be done with QMP capabilities negotiation.

I'm not sure how this queue interacts for multiple monitors using the
single IO thread.  It's currently legal for each monitor to send one
command and for that command to be outstanding; so 'queue full' mustn't
happen in that case, because we still want to allow any of the monitors
to issue one of the non-locking commands.
So I think we need 2x 1 entry input queues per monitor; one for normal
command and one for non-locking commands; I think that's different
from what we've previously suggested which is 2 central queues.

> 7. How all this is related to "jobs"
> 
>    Out-of-band execution is a limited special case of asynchronous
>    execution.  With general asynchronous execution, responses can be
>    sent in any order.  With out-of-band execution, only the out-of-band
>    responses can "jump" order, and only over in-band responses.
> 
>    "All commands are (to be treated as) asynchronous" is arguably more
>    elegant than this out-of-band thing.  However, it runs into two
>    roadblocks that don't apply to out-of-band.
> 
>    One, backward compatibility.  That's a roadblock only as much as we
>    make it one.
> 
>    Two, consistency.  "All asynchronous, but we do most job-like things
>    with commands + events anyway" is not acceptable to me.  I'd be
>    willing to accept "all asynchronous" when it solves the jobs problem.

I suspect there are other things that limit making everything
asynchronous; for example commands that currently only expect to be
executing in the main thread; if you wanted to make an existing command
async you'd have to audit it for all the possible places it could hang.

I also see the other problem as keeping the management level
understanding of which commands are asynchronous; Dan's suggestion is
that command where the management layer specifies which commands it
expects to be asynchronous, and qemu responds with which ones actually
are.

>    You asked for a solution to the monitor availability problem that
>    doesn't require you to solve the jobs problem first.  Well, here's my
>    best try.  Go shoot some holes into it :)

Hopefully we're running out of holes.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-08  9:32                               ` Dr. David Alan Gilbert
@ 2017-09-08 11:49                                 ` Markus Armbruster
  2017-09-08 13:19                                   ` Stefan Hajnoczi
                                                     ` (2 more replies)
  0 siblings, 3 replies; 104+ messages in thread
From: Markus Armbruster @ 2017-09-08 11:49 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Laurent Vivier, Fam Zheng, Juan Quintela, qemu-devel, Peter Xu,
	mdroth, Paolo Bonzini, John Snow

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>> 
>> > * Markus Armbruster (armbru@redhat.com) wrote:
>> >> "Daniel P. Berrange" <berrange@redhat.com> writes:
>> >> 
>> >> > On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
>> >> >> So, what exactly is going to drain the command queue?  If there's more
>> >> >> than one consumer, how exactly are commands from the queue dispatched to
>> >> >> the consumers?
>> >> >
>> >> > In terms of my proposal, for any single command there should only ever
>> >> > be a single consumer. The default consumer would be the main event loop
>> >> > thread, such that we have no semantic change to QMP operation from today.
>> >> >
>> >> > Some commands that are capable of being made "async", would have a
>> >> > different consumer. For example, if the client requested the 'migrate-cancel'
>> >> > be made async, this would change things such that the migration thread is
>> >> > now responsible for consuming the "migrate-cancel" command, instead of the
>> >> > default main loop.
>> >> >
>> >> >> What are the "no hang" guarantees (if any) and conditions for each of
>> >> >> these consumers?
>> >> >
>> >> > The non-main thread consumers would have to have some reasonable
>> >> > guarantee that they won't block on a lock held by the main loop,
>> >> > otherwise the whole feature is largely useless.
>> >> 
>> >> Same if they block indefinitely on anything else, actually.  In other
>> >> words, we need to talk about liveness.
>> >> 
>> >> Threads by themselves don't buy us liveness.  Being careful with
>> >> operations that may block does.  That care may lead to farming out
>> >> certain operations to other threads, where they may block without harm.
>> >> 
>> >> You only talk about "the non-main thread consumers".  What about the
>> >> main thread?  Is it okay for the main thread to block?  If yes, why?
>> >
>> > It would be great if the main thread never blocked; but IMHO that's
>> > a huge task that we'll never get done [challenge].
>> 
>> This is perhaps starting to wander off the topic, but here goes anyway.
>> 
>> What unpleasant things can happen when the main loop hangs?
>
>   a) We can't interact with the monitor to fix the cause of the hang
>      (Which is my main interest here)
>   b) IO emulation might also be blocked because it's waiting on the bql

To readers other than Dave: anything else?

>> What are the known causes of main loop hangs?  Any ideas on fixing them?
>
>   c) hangs on networking while under BQL; there's at least one case near
>   the end of migrate
>   d) hangs on storage devices while under BQL; I think there are similar
>   cases in migrate and possibly elsewhere
>   e) postcopy pages not yet arrived - then a problem if the postcopy
>   dies and needs recovery (because of a)

Any ideas on fixing them?

To readers other than Dave: anything else?

>> Are the unknown main loop hangs relevant in practice?
>
> Well, the unknown ones are unknown; the known ones however seem
> relevant:
>   f) I can't recover a failed postcopy
>   g) A COLO synchronisation might hang at a bad point in migrate and
>   you can't kill it off to cause one side to continue
>   h) A failure of networking at just the wrong point in migrate can
>   cause the source to be paused for a long time - but I don't think
>   I've seen it in practice.

I don't doubt your assertion that the known ones are relevant; I assume
you've run into them.

The purpose of my question is to find out how serious a problem the
unknown causes are.  I'm afraid the answer is "we don't know".

>> If we can't eliminate main loop hangs, any ideas on reducing their
>> impact?
>
> Note there's two related things; main loop hangs and bql hangs; I'm not
> sure that the two are always the same.
>
> Stefan mentioned some ways of doing asynchronous memory lookups/accesses
> though I'm not sure they'd work in the postcopy case; but they'd need
> work in lots of devices.
> Some of the IO under the BQL might be fixable; IMHO in a lot of places
> we don't really need the full BQL, we just need a 'you aren't going to
> change the config' lock.

This is all about reducing main loop hangs.  Another one is moving
"slow" code out of the main loop, e.g. monitor commands.

My question was aiming in a slightly different direction, however: given
that the main loop can hang, is there anything we can do to mitigate
known bad consequences of such hangs?

We're actually discussing one thing we can do to mitigate: moving the
monitor core out of the main loop, to keep the monitor available.  Any
other ideas?

>> >> >> We can have any number of QMP monitors today.  Would each of them feed
>> >> >> its own queue?  Would they all feed a shared queue?
>> >> >
>> >> > Currently with multiple QMP monitors, everything runs in the main
>> >> > loop, so commands arriving across  multiple monitors are 100%
>> >> > serialized and processed strictly in the order in which QEMU reads
>> >> > them off the wire.  To maintain these semantics, we would need to
>> >> > have a single shared queue for the default main loop consumer, so
>> >> > that ordering does not change.
>> >> >
>> >> >> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
>> >> >> command?
>> >> >
>> >> > Per monitor+command. ie just because libvirt knows how to cope with
>> >> > async execution on the monitor it has open, does not mean that a
>> >> > different app on the 2nd monitor command can cope. So in my proposal
>> >> > the switch to async must be scoped to the particular command only
>> >> > for the monitor connection that requesteed it.
>> >> >
>> >> >> What does it mean when an asynchronous command follows a synchronous
>> >> >> command in the same QMP monitor?  I would expect the synchronous command
>> >> >> to complete before the asynchronous command, because that's what
>> >> >> synchronous means, isn't it?  To keep your QMP monitor available, you
>> >> >> then must not send synchronous commands that can hang.
>> >> >
>> >> > No, that is not what I described. All synchronous commands are
>> >> > serialized wrt each other, just as today. An asychronous command
>> >> > can run as soon as it is received, regardless of whether any
>> >> > earlier sent sync commands are still executing or pending. This
>> >> > is trivial to achieve when you separate monitor I/O from command
>> >> > execution in separate threads, provided of course the async
>> >> > command consumers are not in the main loop.
>> >> 
>> >> So, a synchronous command is synchronous with respect to other commands,
>> >> except for certain non-blocking commands.  The distinctive feature of
>> >> the latter isn't so much an asynchronous reply, but out-of-band
>> >> dispatch.
>> >> 
>> >> Out-of-band dispatch of commands that cannot block in fact orthogonal to
>> >> asynchronous replies.  I can't see why out-of-band dispatch of
>> >> synchronous non-blocking commands wouldn't work, too.
>> >> 
>> >> >> How can we determine whether a certain synchronous command can hang?
>> >> >> Note that with opt-in async, *all* commands are also synchronous
>> >> >> commands.
>> >> >> 
>> >> >> In short, explain to me how exactly you plan to ensure that certain QMP
>> >> >> commands (such as post-copy recovery) can always "get through", in the
>> >> >> presence of multiple monitors, hanging main loop, hanging synchronous
>> >> >> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
>> >> >
>> >> > Taking migrate-cancel as the example. The migration code already has
>> >> > a background thread doing work independantly onthe main loop. Upon
>> >> > marking the migrate-cancel command as async, the migration control
>> >> > thread would become the consumer of migrate-cancel.
>> >> 
>> >> From 30,000 feet, the QMP monitor sends a "cancel" message to the
>> >> migration thread, and later receives a "canceled" message from the
>> >> migration thread.
>> >> 
>> >> From 300 feet, we use the migrate-cancel QMP command as the cancel
>> >> message, and its success response as the "canceled" message.
>> >> 
>> >> In other words, we're pressing the external QM-Protocol into service as
>> >> internal message passing protocol.
>> >
>> > Be careful; it's not a cancel in the postcopy recovery case, it's a
>> > restart.  The command is very much like the migration-incoming command.
>> > The management layer has to provide data with the request, so it's not
>> > an internal command.
>> 
>> It's still a message.
>> 
>> >> >                                                     This allows the
>> >> > migration operation to be cancelled immediately, regardless of whether
>> >> > there are earlier monitor commands blocked in the main loop.
>> >> 
>> >> The necessary part is moving all operations that can block out of
>> >> whatever loop runs the monitor, be it the main loop, some other event
>> >> loop, or a dedicated monitor thread's monitor loop.
>> >> 
>> >> Moving out non-blocking operations isn't necessary.  migrate-cancel
>> >> could communicate with the migration thread by any suitable mechanism or
>> >> protocol.  It doesn't have to be QMP.  Why would we want it to be QMP?
>> >
>> > Because why invent another wheel?
>> > This is a command that the management layer has to issue to qemu for
>> > it to recover, including passing data, in a way similar to other
>> > commands - so it looks like a QMP command, so why not use QMP.
>> 
>> Point taken.
>> 
>> Minor terminology remark: I'd prefer to call this a reuse of QAPI rather
>> than QMP, because QMP makes me think of sockets and JSON, while QAPI
>> makes me think of generated data types and marshaling code.
>
> Well it's a command that's got to come over the socket from management,
> so I'm still thinking sockets and JSON. A lot of the problems you
> describe below are more to do with the pain of managing the messages
> squeezed through a socket.
>
>> > Also, I think making other commands lock-free is advantageous - 
>> > some of the 'info' commands just dont really need locks, making them
>> > not use locks removes latency effects caused by the management layer
>> > prodding qemu.
>> 
>> I get the desire to move commands that can block out of whatever loop
>> runs the monitor.  But moving out commands that always complete quickly
>> seems pointless: by the time you're done queuing them, you could be done
>> *executing* them.  More on that below.
>
> My thinking here wasn't about the speed of executing the command, my
> interest was more on the performance of the guest/IO - avoiding taking
> the BQL would have less impact on IO emulation, as would keeping the
> main thread free.
>
>> >> > Of course this assumes the migration control thread can't block
>> >> > for locks held by the main thread.
>> >> 
>> >> Thanks for your answers, they help.
>> >> 
>> >> >> Now let's talk about QMP requirements.
>> >> >> 
>> >> >> Any addition to QMP must consider what exists already.
>> >> >> 
>> >> >> You may add more of the same.
>> >> >> 
>> >> >> You may generalize existing stuff.
>> >> >> 
>> >> >> You may change existing stuff if you have sufficient reason, subject to
>> >> >> backward compatibility constraints.
>> >> >> 
>> >> >> But attempts to add new ways to do the same old stuff without properly
>> >> >> integrating the existing ways are not going to fly.
>> >> >> 
>> >> >> In particular, any new way to start some job, monitor and control it
>> >> >> while it lives, get notified about its state changes and so forth must
>> >> >> integrate the existing ways.  These include block jobs (probably the
>> >> >> most sophisticated of the lot), migration, dump-guest-memory, and
>> >> >> possibly more.  They all work the same way: synchronous command to kick
>> >> >> off the job, more synchronous commands to monitor and control, events to
>> >> >> notify.  They do differ in detail.
>> >> >> 
>> >> >> Asynchronous commands are a new way to do this.  When you only need to
>> >> >> be notified on "done", and don't need to monitor / control, they fit the
>> >> >> bill quite neatly.
>> >> >> 
>> >> >> However, we can't just ignore the cases where we need more than that!
>> >> >> For those, we want a single generic solution instead of the several ad
>> >> >> hoc solutions we have now.
>> >> >> 
>> >> >> If we add asynchronous commands *now*, and for simple cases only, we add
>> >> >> yet another special case for a future generic solution to integrate.
>> >> >> I'm not going to let that happen.
>> >> >
>> >> > With the async commands suggestion, while it would initially not
>> >> > provide a way to query incremental status, that could easily be
>> >> > fitted in.
>> >> 
>> >> This is [*] below.
>> >> 
>> >> >             Because command replies from async commands may be
>> >> > out-of-order wrt the original requests, clients would need to
>> >> > provide a unique ID for each command run. This originally was
>> >> > part of QMP spec but then dropped, but libvirt still actually
>> >> > generates a uniqe ID for every QMP command.
>> >> >
>> >> > Given this, one option is to actually use the QMP command ID as
>> >> > a job ID, and let you query ongoing status via some new QMP
>> >> > command that accepts the ID of the job to be queried. A complexity
>> >> > with this is how to make the jobs visible across multiple QMP
>> >> > monitors. The job ID might actually have to be a combination of
>> >> > the serial ID from the QMP command, and the ID of the monitor
>> >> > chardev combined.
>> >> 
>> >> Yes.  The job ID must be unique across all QMP monitors to make
>> >> broadcast notifications work.
>> >> 
>> >> >> I figure the closest to a generic solution we have is block jobs.
>> >> >> Perhaps a generic solution could be had by abstracting away the "block"
>> >> >> from "block jobs", leaving just "jobs".
>> >> 
>> >> [*] starts here:
>> >> 
>> >> >> Another approach is generalizing the asynchronous command proposal to
>> >> >> fully cover the not-so-simple cases.
>> >> 
>> >> We know asynchronous commands "fully cover" when we can use them to
>> >> replace all the existing job-like commands.
>> >> 
>> >> Until then, they enlarge rather than solve our jobs problem.
>> >> 
>> >> I get the need for an available monitor.  But I need to balance it with
>> >> other needs.  Can we find a solution for our monitor availability
>> >> problem that doesn't enlarge our jobs problem?
>> >
>> > Hopefully!
>> >
>> > Dave
>> >
>> >> >> If you'd rather want to make progress on monitor availability without
>> >> >> cracking the "jobs" problem, you're in luck!  Use your license to "add
>> >> >> more of the same": synchronous command to start a job, query to monitor,
>> >> >> event to notify.  
>> >> >> 
>> >> >> If you insist on tying your monitor availability solution to
>> >> >> asynchronous commands, then I'm in luck!  I just found volunteers to
>> >> >> solve the "jobs" problem for me.
>> 
>> Let me try to distill the discussion so far into a design sketch.
>> 
>> 1. A QMP monitor runs in a loop.  The loop may execute other stuff, but
>>    this must not unduly delay the monitor's work.  Thus, everything in
>>    this loop must complete "quickly".
>> 
>>    All QMP monitors currently run in the main loop, which really should
>>    satisfy "quickly", but doesn't.  Since fixing that to a tolerable
>>    degree is beyond our means (is it?), we move them out.
>> 
>>    Design alternative: either one loop and thread per monitor, or one
>>    loop and thread for all monitors, or something in between.
>> 
>>    I'm wary of "one thread per software artifact" designs.  "One
>>    (preemptable) thread per activity, all sharing state" is a lousy way
>>    to structure software.
>
> Shrug; I've always thought of it as an easy solution unless you'd
> get into hundreds of threads, which given the number of monitors, we
> wont.
>
>> 2. A QMP monitor receives and dispatches commands, and sends command
>>    responses and events.
>> 
>>    What if sending a response or event would block?  See 6.
>> 
>> 3. Arbitrary code can trigger QMP events.  Events are broadcast to all
>>    QMP monitors.  Each QMP monitor provides an event queue.  When an
>>    event is triggered, it gets put into all queues, subject to rate
>>    limiting.

Correction: only events that are rate-limited go through the queue.  The
other bypass it.  This is an optimization.

>>    Rate limiting and queuing needs some shared data, which is protected
>>    by a mutex.  The critical sections guarded by this mutex must be
>>    "quick".

There's another mutex guarding the monitor's output buffer (see 6.),
among other things.

>>    Nothing new here, it's how events work today.
>> 
>>    We could easily add events that go to just one monitor, if there's a
>>    need.
>
> I don't think events could cause a problem here since they're always
> outbound - so they could never block inbound commands?

Events are indeed okay as they are.  I merely wanted to mention that
they don't *have* to broadcast.  The "queue full" event mentioned under
6. probably shouldn't be broadcast.

>> 4. Commands are normally dispatched to a worker thread, where they can
>>    take their own sweet time to complete.
>> 
>>    Currently, the monitor runs in the main loop and executes commands
>>    directly.  This is effectively dispatching commands to the main loop.
>>    Dispatch to main loop is wrong, because it can make the main loop
>>    hang.  If it was the only relevant cause for main loop hangs, we'd
>>    move the command work out and be done.  Since it isn't (see 1.) we
>>    *also* have to move the monitor out to prevent main loop hangs from
>>    hanging the monitor.
>> 
>>    Moving monitor and command work to separate threads changes the
>>    dispatch from function call to queuing.  Need a pair of queues, one
>>    for commands, one for responses.
>> 
>>    Design alternative: one worker per monitor, or one worker for all
>>    monitors, or main loop is the one worker for all monitors.  The
>>    latter leaves the main loop hangs unaddressed.  No worse than before,
>>    so it could be okay as a first step.
>> 
>>    The worker provides the pair of queues.  It executes commands in
>>    order.  If a command blocks, the command queue stalls.
>> 
>>    The command queue can therefore grow without bounds.  See 6.
>> 
>> 5. Certain commands that always complete "quickly" can instead be
>>    executed directly, at the QMP client's request.  This direct
>>    execution is out-of-band; the response can "overtake" prior in-band
>>    commands' responses.
>> 
>>    The amount of work these out-of-band commands do themselves is up to
>>    them.  A qiuck query command would do all the work.  migrate-cancel
>>    could perhaps notify the migration thread and be done.  Postcopy
>>    recovery could perhaps send its argument struct to whatever thread
>>    handles recovery.
>
> Yes.

Message sending needs to be non-blocking.  If the message can't be sent,
the command should fail.  Queuing instead is a problematic idea, because
then you get to deal with the same flow control problems we're
discussing below.

>> 6. Flow control
>
> I think this is potentially the tricky bit!
>
>>    We currently leave flow control to the underlying character device.
>>    If the client sends more quickly than the monitor can execute, the
>>    client's send eventually blocks or fails with EAGAIN.  If the monitor
>>    sends more quickly than the client accepts, the monitor buffers
>>    without bounds (I think).
>> 
>>    Buffering monitor output without bounds is bad.  We could perhaps
>>    kill a monitor when it exceeds its limit.
>
> I'm not sure it's possible to define that limit; for example
> 'query-block' gives a list of information for all devices; there are
> people running with 200+ block devices so the output for that would be
> huge.

For comparison, here are our current input limits:

* Sum of JSON token size: 64MiB
* JSON token count: 2Mi
* JSON nesting depth 1024

The first two limit heap usage, the third limits stack usage.  The first
and the last go back to Anthony (commit 29c75dd).  I added the second
because the first is insufficient (commit df64983).

If we hit these generous limits, surely something has gone haywire.

An output limit of 64MiB should be good for ~100k block devices with
MiBs to spare.  Generous enough for a "if you hit this limit, you're
abusing QMP way too much" argument?  If not, how far left would you like
me to shift the limit?

>>    Buffering monitor input (in the command queue) without bounds is just
>>    as bad.  It also destroys the existing flow control mechanism: the
>>    client can no longer detect that it's sending too much.  Not an issue
>>    for fully synchronous clients, i.e. clients that wait for the
>>    previous command's response before they send the next command.  Such
>>    clients cannot use of out-of-band command execution.
>> 
>>    The obvious way to limit the command queue is to fail commands when
>>    the queue is "full".
>> 
>>    Note that we can't send an error response right away then, because
>>    the command is in-band (if it wasn't, we wouldn't queue it), so its
>>    response has to go after all all the respones to the (in-band)
>>    commands currently in the queue.
>> 
>>    To tell the client right away, we could send an event.
>> 
>>    Delaying the "queue full" response until the correct time to send it
>>    requires state: at least the command ID.  We can just as well enqueue
>>    and pray memory will suffice.
>> 
>>    Note that the only reason for the command queue is out-of-band
>>    commands.  Without them, reading the next command is pointless.  This
>>    leads me to a possible solution: separate out-of-band mode, default
>>    off, QMP client can switch it on.  When off, we read monitor input
>>    just like we do now (no queue, no problem).  When on, we read and
>>    queue.  If the queue is full, we send a "queue full" event with the
>>    IDs of the commands we dropped on the floor.  By switching on
>>    out-of-band-mode, the QMP client also opts into this event.
>> 
>>    Switching could be done with QMP capabilities negotiation.
>
> I'm not sure how this queue interacts for multiple monitors using the
> single IO thread.  It's currently legal for each monitor to send one
> command and for that command to be outstanding; so 'queue full' mustn't
> happen in that case, because we still want to allow any of the monitors
> to issue one of the non-locking commands.

Right, the "queue full" condition must be per monitor, and it must not
apply to in-band commands (which aren't queued anyway).

> So I think we need 2x 1 entry input queues per monitor; one for normal
> command and one for non-locking commands; I think that's different
> from what we've previously suggested which is 2 central queues.

Perhaps I was less than clear under 4., but I meant to propose design
alternatives one shared worker fed by one pair of queues, and one worker
per monitor, each fed by its own pair of queues.  Another alternative
would be one shared worker fed by one pair of queues per monitor.

Pair of queues means one for in-band commands, one for their responses.
There is no queue for out-of-band commands, because out-of-band commands
are not queued.

>> 7. How all this is related to "jobs"
>> 
>>    Out-of-band execution is a limited special case of asynchronous
>>    execution.  With general asynchronous execution, responses can be
>>    sent in any order.  With out-of-band execution, only the out-of-band
>>    responses can "jump" order, and only over in-band responses.
>> 
>>    "All commands are (to be treated as) asynchronous" is arguably more
>>    elegant than this out-of-band thing.  However, it runs into two
>>    roadblocks that don't apply to out-of-band.
>> 
>>    One, backward compatibility.  That's a roadblock only as much as we
>>    make it one.
>> 
>>    Two, consistency.  "All asynchronous, but we do most job-like things
>>    with commands + events anyway" is not acceptable to me.  I'd be
>>    willing to accept "all asynchronous" when it solves the jobs problem.
>
> I suspect there are other things that limit making everything
> asynchronous; for example commands that currently only expect to be
> executing in the main thread; if you wanted to make an existing command
> async you'd have to audit it for all the possible places it could hang.

You're right.

> I also see the other problem as keeping the management level
> understanding of which commands are asynchronous; Dan's suggestion is
> that command where the management layer specifies which commands it
> expects to be asynchronous, and qemu responds with which ones actually
> are.

"Command supports out-of-band dispatch" would be visible in
query-qmp-schema.

Design alternative: either switching on out-of-band mode (see 6.)
switches all out-of-band commands to out-of-band dispatch, or it
doesn't, and the client has to request out-of-band dispatch explicitly.
The explicit request could either be per execute (say send {'exec-oob':
COMMAND-NAME ...} instead of {'execute': COMMAND-NAME...}), or per
session, i.e. with a new command to enable oob dispatch for a list of
oob-capable commands.

I figure explicit is safer, because it lets us make more commands
oob-capable without upsetting existing oob-aware QMP clients.

>>    You asked for a solution to the monitor availability problem that
>>    doesn't require you to solve the jobs problem first.  Well, here's my
>>    best try.  Go shoot some holes into it :)
>
> Hopefully we're running out of holes.

Thanks!

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-08 11:49                                 ` Markus Armbruster
@ 2017-09-08 13:19                                   ` Stefan Hajnoczi
  2017-09-11 10:32                                   ` Peter Xu
  2017-09-11 10:43                                   ` Daniel P. Berrange
  2 siblings, 0 replies; 104+ messages in thread
From: Stefan Hajnoczi @ 2017-09-08 13:19 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Dr. David Alan Gilbert, Laurent Vivier, Fam Zheng, Juan Quintela,
	Michael Roth, Peter Xu, qemu-devel, Paolo Bonzini, John Snow

On Fri, Sep 8, 2017 at 12:49 PM, Markus Armbruster <armbru@redhat.com> wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>
>> * Markus Armbruster (armbru@redhat.com) wrote:
>>> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
>>>
>>> > * Markus Armbruster (armbru@redhat.com) wrote:
>>> >> "Daniel P. Berrange" <berrange@redhat.com> writes:
>>> >>
>>> >> > On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
>>> If we can't eliminate main loop hangs, any ideas on reducing their
>>> impact?
>>
>> Note there's two related things; main loop hangs and bql hangs; I'm not
>> sure that the two are always the same.
>>
>> Stefan mentioned some ways of doing asynchronous memory lookups/accesses
>> though I'm not sure they'd work in the postcopy case; but they'd need
>> work in lots of devices.
>> Some of the IO under the BQL might be fixable; IMHO in a lot of places
>> we don't really need the full BQL, we just need a 'you aren't going to
>> change the config' lock.
>
> This is all about reducing main loop hangs.  Another one is moving
> "slow" code out of the main loop, e.g. monitor commands.
>
> My question was aiming in a slightly different direction, however: given
> that the main loop can hang, is there anything we can do to mitigate
> known bad consequences of such hangs?

I don't think we can mitigate it completely but we can make it visible
and easier to study.

There were discussions about making the event loop observable in the
past.  In other words, logging which handler functions are firing.
That way you can debug scenarios where the loop is spinning
("main-loop: WARNING: I/O thread spun for 1000 iterations\n") and also
latency.  Collecting event handler latencies and looking at the
histogram would be interesting.  The outliers (e.g. 250+ microseconds)
are things that we should know about and consider refactoring.

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll
  2017-08-28  5:53         ` Peter Xu
@ 2017-09-08 17:29           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-08 17:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Aug 25, 2017 at 10:30:42AM +0100, Dr. David Alan Gilbert wrote:
> 
> [...]
> 
> > > >   c) As mentioned on irc there's fun to be had with cur_mon and error
> > > >      handling - in my local world I have cur_mon declared as __thread
> > > >      but never got around to thinking aobut what should set it up.
> > > >      There's also 'wavcapture: Convert to error_report' that I posted
> > > >      in March that got rid of some uses of cur_mon in wavcapture.c
> > > >      for error_report.
> > > 
> > > Yeh.  I at least also see a positive ACK from Markus in the other
> > > thread for per-thread cur_mon, sounds like this is the right way to
> > > go.
> > > 
> > > To setup cur_mon, what I can think of is create wrapper for
> > > pthread_create() in qemu_thread_create().  I see that we have done
> > > similar thing in util/qemu-thread-win32.c for Windows.  With that we
> > > can setup the cur_mon before going into real thread function but in
> > > the right context, though we may need one more parameter for current
> > > qemu_thread_create():
> > > 
> > > void qemu_thread_create(QemuThread *thread, const char *name,
> > >                        void *(*start_routine)(void*),
> > >                        void *arg, int mode, Monitor *mon);
> > > 
> > > Then we can specify monitor for any new thread (default to cur_mon).
> > > For per-monitor threads, I think we need to pass in that specific mon.
> > > 
> > > Is this doable?
> > 
> > That would mean changing all the qemu_thread_create calls, but yes
> > I guess is doable.  I'd thought the other way, perhaps you inherit
> > Monitor except in the case of when the monitor creates threads.
> 
> Do you mean setup cur_mon in monitor threads?
> 
> I'm afraid that may not be enough, since after we mark cur_mon as
> __thread variable, it should be NULL for each newly created threads,
> then we need to init them for every thread.  Or anything I missed?

Right, I thought you'd modify qemu_thread_create to setup cur_mon in
new threads to the same as the parent.

Dave

> [...]
> 
> > > > 
> > > >   d) I wonder if it's better to have thread as a flag, so that you have
> > > >      to explicitly ask for a monitor to have it's own thread.
> > > 
> > > This should be doable.  Would a new parameter for "-qmp" and "-hmp"
> > > suffice?
> > 
> > Yes.
> 
> (I meant "-monitor" when saying "-hmp")
> 
> Hmm, it seems not easy to simply add a new parameter for it, since we
> used "," already to parse the chardev params in monitor codes, like
> the usage of:
> 
>   -qmp telnet::8888,server,nowait
> 
> So I cannot simply do:
> 
>   -qmp telnet::8888,server,nowait,threaded=on
> 
> Or it will be treated for a parameter for chardev type "telnet".
> 
> I can at least add something similar to QEMU_OPTION_qmp_pretty, like:
> QEMU_OPTION_qmp_threaded, and maybe I also need
> QEMU_OPTION_monitor_pretty.  But I am thinking whether there can be
> anything better.  Any suggestion from anyone?
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-28  8:08           ` Peter Xu
@ 2017-09-08 17:38             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-08 17:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Fam Zheng, qemu-devel, Paolo Bonzini, Daniel P . Berrange,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Aug 25, 2017 at 10:14:12AM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Thu, Aug 24, 2017 at 07:37:32AM +0800, Fam Zheng wrote:
> > > > On Wed, 08/23 18:44, Dr. David Alan Gilbert wrote:
> > > > > * Peter Xu (peterx@redhat.com) wrote:
> > > > > > Introducing this new parameter for QMP commands in general to mark out
> > > > > > when the command does not need BQL.  Normally QMP command executions are
> > > > > > done with the protection of BQL in QEMU.  However the truth is that not
> > > > > > all the QMP commands require the BQL.
> > > > > > 
> > > > > > This new parameter provides a way to allow QMP commands to run in
> > > > > > parallel when possible, without the contention on the BQL.
> > > > > > 
> > > > > > Since the default value of "without-bql" is still false, so now all QMP
> > > > > > commands are still protected by BQL still.
> > > > > > 
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > 
> > > > > We should define what a 'without-bql' command is allowed to do:
> > > > >    'Commands that have without-bql set _may_ be called without the bql
> > > > >    being taken.  They must not take the bql or any other lock that may
> > > > >    become dependent on the bql.'
> > > 
> > > Sure.
> > > 
> > > > >    (Do we need to say anything about RCU?)
> > > 
> > > Could I ask how is RCU related?
> > 
> > My definition above said that anything declared without bql couldn't
> > take the bql, so couldn't block on any other thread holding the bql.
> > But is our command allowed to use synchronize_rcu or rcu_read_lock
> > that could wait for or block other threads doing rcu stuff?
> > Because if it did is there any guarantee that it wouldn't block?
> 
> I see.  Shall we just ignore RCU for now?  Since currently I don't see
> a real synchronize_rcu() user yet in QEMU, except the RCU thread.  And
> rcu_read_lock() should not block itself, so IMHO calling it only in
> monitor command handlers should always be fine?

Yes, I think you're right that the rcu_read_lock is OK; just something
to keep in mind.

Dave

> > 
> > 
> > > 
> > > > > 
> > > > > Also, 'no-bql' is shorter :-)
> > > > 
> > > > Or rather "need-bql" that defaults to true to avoid double negative (TM) with
> > > > "no-bql = false"?
> > > 
> > > Ok let me use "need-bql". :)
> > 
> > Fine by me.
> 
> I'm switching to "need-bql" for QMP only, and used "no-bql" in HMP,
> since I failed to find a good way to init mon_cmd_t field to true by
> default.
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql"
  2017-08-28  8:26         ` Peter Xu
@ 2017-09-08 17:52           ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 104+ messages in thread
From: Dr. David Alan Gilbert @ 2017-09-08 17:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Paolo Bonzini, Daniel P . Berrange, Fam Zheng,
	Juan Quintela, mdroth, Eric Blake, Laurent Vivier,
	Markus Armbruster

* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Aug 25, 2017 at 10:06:27AM +0100, Dr. David Alan Gilbert wrote:
> > * Peter Xu (peterx@redhat.com) wrote:
> > > On Wed, Aug 23, 2017 at 06:44:12PM +0100, Dr. David Alan Gilbert wrote:
> > > 
> > > [...]
> > > 
> > > > > +Most of the commands require the Big QEMU Lock (BQL) be held during
> > > > > +execution.  However, there is a small subset of the commands that may
> > > > > +not really need BQL at all.  To mark out this kind of commands, we can
> > > > > +specify "without-bql" to "true".  This parameter is only a hint for
> > > > > +internal QMP implementation to provide possiblility to allow commands
> > > > > +be run in parallel, or reduce the contention of the lock.  Users of QMP
> > > > > +should not really be aware of such information.
> > > > 
> > > > Well, I think users of these commands might select them specifically
> > > > because they know that they won't block.  Those who care about latency might
> > > > look to use commands that don't take the lock because of a reduced
> > > > effect on the performance as well.
> > > 
> > > What would be the best way to tell user?  I think again this should
> > > mostly for HMP only, right?
> > 
> > It needs to be docuemnted for QMP users as well so that those developing
> > management code know what's safe.
> 
> I see.  What's the corresponding QMP documentation I should touch up?

I'm not sure, but based on the long thread; I think the idea is to add
something to the schema so the flag appears in the introspection.  I'll
leave the details of how to Markus.

> > 
> > > Maybe we can add a new command to list these lock-free commands.  Or,
> > > I can dump something in "help" and "help info" like:
> > > 
> > > (qemu) help migrate_incoming
> > > migrate_incoming uri -- Continue an incoming migration from an -incoming defer (BQL-less)
> > 
> > 'lock free' might be better?
> 
> I'm ok with it.  But would the word "lock" too general?

Maybe, but it's probably not just BQL.

Dave

> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-08 11:49                                 ` Markus Armbruster
  2017-09-08 13:19                                   ` Stefan Hajnoczi
@ 2017-09-11 10:32                                   ` Peter Xu
  2017-09-11 10:36                                     ` Peter Xu
  2017-09-11 10:43                                   ` Daniel P. Berrange
  2 siblings, 1 reply; 104+ messages in thread
From: Peter Xu @ 2017-09-11 10:32 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Dr. David Alan Gilbert, Laurent Vivier, Fam Zheng, Juan Quintela,
	qemu-devel, mdroth, Paolo Bonzini, John Snow

On Fri, Sep 08, 2017 at 01:49:41PM +0200, Markus Armbruster wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> >> 
> >> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> >> "Daniel P. Berrange" <berrange@redhat.com> writes:
> >> >> 
> >> >> > On Thu, Sep 07, 2017 at 02:59:28PM +0200, Markus Armbruster wrote:
> >> >> >> So, what exactly is going to drain the command queue?  If there's more
> >> >> >> than one consumer, how exactly are commands from the queue dispatched to
> >> >> >> the consumers?
> >> >> >
> >> >> > In terms of my proposal, for any single command there should only ever
> >> >> > be a single consumer. The default consumer would be the main event loop
> >> >> > thread, such that we have no semantic change to QMP operation from today.
> >> >> >
> >> >> > Some commands that are capable of being made "async", would have a
> >> >> > different consumer. For example, if the client requested the 'migrate-cancel'
> >> >> > be made async, this would change things such that the migration thread is
> >> >> > now responsible for consuming the "migrate-cancel" command, instead of the
> >> >> > default main loop.
> >> >> >
> >> >> >> What are the "no hang" guarantees (if any) and conditions for each of
> >> >> >> these consumers?
> >> >> >
> >> >> > The non-main thread consumers would have to have some reasonable
> >> >> > guarantee that they won't block on a lock held by the main loop,
> >> >> > otherwise the whole feature is largely useless.
> >> >> 
> >> >> Same if they block indefinitely on anything else, actually.  In other
> >> >> words, we need to talk about liveness.
> >> >> 
> >> >> Threads by themselves don't buy us liveness.  Being careful with
> >> >> operations that may block does.  That care may lead to farming out
> >> >> certain operations to other threads, where they may block without harm.
> >> >> 
> >> >> You only talk about "the non-main thread consumers".  What about the
> >> >> main thread?  Is it okay for the main thread to block?  If yes, why?
> >> >
> >> > It would be great if the main thread never blocked; but IMHO that's
> >> > a huge task that we'll never get done [challenge].
> >> 
> >> This is perhaps starting to wander off the topic, but here goes anyway.
> >> 
> >> What unpleasant things can happen when the main loop hangs?
> >
> >   a) We can't interact with the monitor to fix the cause of the hang
> >      (Which is my main interest here)
> >   b) IO emulation might also be blocked because it's waiting on the bql
> 
> To readers other than Dave: anything else?
> 
> >> What are the known causes of main loop hangs?  Any ideas on fixing them?
> >
> >   c) hangs on networking while under BQL; there's at least one case near
> >   the end of migrate
> >   d) hangs on storage devices while under BQL; I think there are similar
> >   cases in migrate and possibly elsewhere
> >   e) postcopy pages not yet arrived - then a problem if the postcopy
> >   dies and needs recovery (because of a)
> 
> Any ideas on fixing them?
> 
> To readers other than Dave: anything else?
> 
> >> Are the unknown main loop hangs relevant in practice?
> >
> > Well, the unknown ones are unknown; the known ones however seem
> > relevant:
> >   f) I can't recover a failed postcopy
> >   g) A COLO synchronisation might hang at a bad point in migrate and
> >   you can't kill it off to cause one side to continue
> >   h) A failure of networking at just the wrong point in migrate can
> >   cause the source to be paused for a long time - but I don't think
> >   I've seen it in practice.
> 
> I don't doubt your assertion that the known ones are relevant; I assume
> you've run into them.
> 
> The purpose of my question is to find out how serious a problem the
> unknown causes are.  I'm afraid the answer is "we don't know".
> 
> >> If we can't eliminate main loop hangs, any ideas on reducing their
> >> impact?
> >
> > Note there's two related things; main loop hangs and bql hangs; I'm not
> > sure that the two are always the same.
> >
> > Stefan mentioned some ways of doing asynchronous memory lookups/accesses
> > though I'm not sure they'd work in the postcopy case; but they'd need
> > work in lots of devices.
> > Some of the IO under the BQL might be fixable; IMHO in a lot of places
> > we don't really need the full BQL, we just need a 'you aren't going to
> > change the config' lock.
> 
> This is all about reducing main loop hangs.  Another one is moving
> "slow" code out of the main loop, e.g. monitor commands.
> 
> My question was aiming in a slightly different direction, however: given
> that the main loop can hang, is there anything we can do to mitigate
> known bad consequences of such hangs?
> 
> We're actually discussing one thing we can do to mitigate: moving the
> monitor core out of the main loop, to keep the monitor available.  Any
> other ideas?
> 
> >> >> >> We can have any number of QMP monitors today.  Would each of them feed
> >> >> >> its own queue?  Would they all feed a shared queue?
> >> >> >
> >> >> > Currently with multiple QMP monitors, everything runs in the main
> >> >> > loop, so commands arriving across  multiple monitors are 100%
> >> >> > serialized and processed strictly in the order in which QEMU reads
> >> >> > them off the wire.  To maintain these semantics, we would need to
> >> >> > have a single shared queue for the default main loop consumer, so
> >> >> > that ordering does not change.
> >> >> >
> >> >> >> How exactly is opt-in asynchronous to work?  Per QMP monitor?  Per
> >> >> >> command?
> >> >> >
> >> >> > Per monitor+command. ie just because libvirt knows how to cope with
> >> >> > async execution on the monitor it has open, does not mean that a
> >> >> > different app on the 2nd monitor command can cope. So in my proposal
> >> >> > the switch to async must be scoped to the particular command only
> >> >> > for the monitor connection that requesteed it.
> >> >> >
> >> >> >> What does it mean when an asynchronous command follows a synchronous
> >> >> >> command in the same QMP monitor?  I would expect the synchronous command
> >> >> >> to complete before the asynchronous command, because that's what
> >> >> >> synchronous means, isn't it?  To keep your QMP monitor available, you
> >> >> >> then must not send synchronous commands that can hang.
> >> >> >
> >> >> > No, that is not what I described. All synchronous commands are
> >> >> > serialized wrt each other, just as today. An asychronous command
> >> >> > can run as soon as it is received, regardless of whether any
> >> >> > earlier sent sync commands are still executing or pending. This
> >> >> > is trivial to achieve when you separate monitor I/O from command
> >> >> > execution in separate threads, provided of course the async
> >> >> > command consumers are not in the main loop.
> >> >> 
> >> >> So, a synchronous command is synchronous with respect to other commands,
> >> >> except for certain non-blocking commands.  The distinctive feature of
> >> >> the latter isn't so much an asynchronous reply, but out-of-band
> >> >> dispatch.
> >> >> 
> >> >> Out-of-band dispatch of commands that cannot block in fact orthogonal to
> >> >> asynchronous replies.  I can't see why out-of-band dispatch of
> >> >> synchronous non-blocking commands wouldn't work, too.
> >> >> 
> >> >> >> How can we determine whether a certain synchronous command can hang?
> >> >> >> Note that with opt-in async, *all* commands are also synchronous
> >> >> >> commands.
> >> >> >> 
> >> >> >> In short, explain to me how exactly you plan to ensure that certain QMP
> >> >> >> commands (such as post-copy recovery) can always "get through", in the
> >> >> >> presence of multiple monitors, hanging main loop, hanging synchronous
> >> >> >> commands, hanging whatever-else-can-now-hang-in-this-post-copy-world.
> >> >> >
> >> >> > Taking migrate-cancel as the example. The migration code already has
> >> >> > a background thread doing work independantly onthe main loop. Upon
> >> >> > marking the migrate-cancel command as async, the migration control
> >> >> > thread would become the consumer of migrate-cancel.
> >> >> 
> >> >> From 30,000 feet, the QMP monitor sends a "cancel" message to the
> >> >> migration thread, and later receives a "canceled" message from the
> >> >> migration thread.
> >> >> 
> >> >> From 300 feet, we use the migrate-cancel QMP command as the cancel
> >> >> message, and its success response as the "canceled" message.
> >> >> 
> >> >> In other words, we're pressing the external QM-Protocol into service as
> >> >> internal message passing protocol.
> >> >
> >> > Be careful; it's not a cancel in the postcopy recovery case, it's a
> >> > restart.  The command is very much like the migration-incoming command.
> >> > The management layer has to provide data with the request, so it's not
> >> > an internal command.
> >> 
> >> It's still a message.
> >> 
> >> >> >                                                     This allows the
> >> >> > migration operation to be cancelled immediately, regardless of whether
> >> >> > there are earlier monitor commands blocked in the main loop.
> >> >> 
> >> >> The necessary part is moving all operations that can block out of
> >> >> whatever loop runs the monitor, be it the main loop, some other event
> >> >> loop, or a dedicated monitor thread's monitor loop.
> >> >> 
> >> >> Moving out non-blocking operations isn't necessary.  migrate-cancel
> >> >> could communicate with the migration thread by any suitable mechanism or
> >> >> protocol.  It doesn't have to be QMP.  Why would we want it to be QMP?
> >> >
> >> > Because why invent another wheel?
> >> > This is a command that the management layer has to issue to qemu for
> >> > it to recover, including passing data, in a way similar to other
> >> > commands - so it looks like a QMP command, so why not use QMP.
> >> 
> >> Point taken.
> >> 
> >> Minor terminology remark: I'd prefer to call this a reuse of QAPI rather
> >> than QMP, because QMP makes me think of sockets and JSON, while QAPI
> >> makes me think of generated data types and marshaling code.
> >
> > Well it's a command that's got to come over the socket from management,
> > so I'm still thinking sockets and JSON. A lot of the problems you
> > describe below are more to do with the pain of managing the messages
> > squeezed through a socket.
> >
> >> > Also, I think making other commands lock-free is advantageous - 
> >> > some of the 'info' commands just dont really need locks, making them
> >> > not use locks removes latency effects caused by the management layer
> >> > prodding qemu.
> >> 
> >> I get the desire to move commands that can block out of whatever loop
> >> runs the monitor.  But moving out commands that always complete quickly
> >> seems pointless: by the time you're done queuing them, you could be done
> >> *executing* them.  More on that below.
> >
> > My thinking here wasn't about the speed of executing the command, my
> > interest was more on the performance of the guest/IO - avoiding taking
> > the BQL would have less impact on IO emulation, as would keeping the
> > main thread free.
> >
> >> >> > Of course this assumes the migration control thread can't block
> >> >> > for locks held by the main thread.
> >> >> 
> >> >> Thanks for your answers, they help.
> >> >> 
> >> >> >> Now let's talk about QMP requirements.
> >> >> >> 
> >> >> >> Any addition to QMP must consider what exists already.
> >> >> >> 
> >> >> >> You may add more of the same.
> >> >> >> 
> >> >> >> You may generalize existing stuff.
> >> >> >> 
> >> >> >> You may change existing stuff if you have sufficient reason, subject to
> >> >> >> backward compatibility constraints.
> >> >> >> 
> >> >> >> But attempts to add new ways to do the same old stuff without properly
> >> >> >> integrating the existing ways are not going to fly.
> >> >> >> 
> >> >> >> In particular, any new way to start some job, monitor and control it
> >> >> >> while it lives, get notified about its state changes and so forth must
> >> >> >> integrate the existing ways.  These include block jobs (probably the
> >> >> >> most sophisticated of the lot), migration, dump-guest-memory, and
> >> >> >> possibly more.  They all work the same way: synchronous command to kick
> >> >> >> off the job, more synchronous commands to monitor and control, events to
> >> >> >> notify.  They do differ in detail.
> >> >> >> 
> >> >> >> Asynchronous commands are a new way to do this.  When you only need to
> >> >> >> be notified on "done", and don't need to monitor / control, they fit the
> >> >> >> bill quite neatly.
> >> >> >> 
> >> >> >> However, we can't just ignore the cases where we need more than that!
> >> >> >> For those, we want a single generic solution instead of the several ad
> >> >> >> hoc solutions we have now.
> >> >> >> 
> >> >> >> If we add asynchronous commands *now*, and for simple cases only, we add
> >> >> >> yet another special case for a future generic solution to integrate.
> >> >> >> I'm not going to let that happen.
> >> >> >
> >> >> > With the async commands suggestion, while it would initially not
> >> >> > provide a way to query incremental status, that could easily be
> >> >> > fitted in.
> >> >> 
> >> >> This is [*] below.
> >> >> 
> >> >> >             Because command replies from async commands may be
> >> >> > out-of-order wrt the original requests, clients would need to
> >> >> > provide a unique ID for each command run. This originally was
> >> >> > part of QMP spec but then dropped, but libvirt still actually
> >> >> > generates a uniqe ID for every QMP command.
> >> >> >
> >> >> > Given this, one option is to actually use the QMP command ID as
> >> >> > a job ID, and let you query ongoing status via some new QMP
> >> >> > command that accepts the ID of the job to be queried. A complexity
> >> >> > with this is how to make the jobs visible across multiple QMP
> >> >> > monitors. The job ID might actually have to be a combination of
> >> >> > the serial ID from the QMP command, and the ID of the monitor
> >> >> > chardev combined.
> >> >> 
> >> >> Yes.  The job ID must be unique across all QMP monitors to make
> >> >> broadcast notifications work.
> >> >> 
> >> >> >> I figure the closest to a generic solution we have is block jobs.
> >> >> >> Perhaps a generic solution could be had by abstracting away the "block"
> >> >> >> from "block jobs", leaving just "jobs".
> >> >> 
> >> >> [*] starts here:
> >> >> 
> >> >> >> Another approach is generalizing the asynchronous command proposal to
> >> >> >> fully cover the not-so-simple cases.
> >> >> 
> >> >> We know asynchronous commands "fully cover" when we can use them to
> >> >> replace all the existing job-like commands.
> >> >> 
> >> >> Until then, they enlarge rather than solve our jobs problem.
> >> >> 
> >> >> I get the need for an available monitor.  But I need to balance it with
> >> >> other needs.  Can we find a solution for our monitor availability
> >> >> problem that doesn't enlarge our jobs problem?
> >> >
> >> > Hopefully!
> >> >
> >> > Dave
> >> >
> >> >> >> If you'd rather want to make progress on monitor availability without
> >> >> >> cracking the "jobs" problem, you're in luck!  Use your license to "add
> >> >> >> more of the same": synchronous command to start a job, query to monitor,
> >> >> >> event to notify.  
> >> >> >> 
> >> >> >> If you insist on tying your monitor availability solution to
> >> >> >> asynchronous commands, then I'm in luck!  I just found volunteers to
> >> >> >> solve the "jobs" problem for me.
> >> 
> >> Let me try to distill the discussion so far into a design sketch.
> >> 
> >> 1. A QMP monitor runs in a loop.  The loop may execute other stuff, but
> >>    this must not unduly delay the monitor's work.  Thus, everything in
> >>    this loop must complete "quickly".
> >> 
> >>    All QMP monitors currently run in the main loop, which really should
> >>    satisfy "quickly", but doesn't.  Since fixing that to a tolerable
> >>    degree is beyond our means (is it?), we move them out.
> >> 
> >>    Design alternative: either one loop and thread per monitor, or one
> >>    loop and thread for all monitors, or something in between.
> >> 
> >>    I'm wary of "one thread per software artifact" designs.  "One
> >>    (preemptable) thread per activity, all sharing state" is a lousy way
> >>    to structure software.
> >
> > Shrug; I've always thought of it as an easy solution unless you'd
> > get into hundreds of threads, which given the number of monitors, we
> > wont.
> >
> >> 2. A QMP monitor receives and dispatches commands, and sends command
> >>    responses and events.
> >> 
> >>    What if sending a response or event would block?  See 6.
> >> 
> >> 3. Arbitrary code can trigger QMP events.  Events are broadcast to all
> >>    QMP monitors.  Each QMP monitor provides an event queue.  When an
> >>    event is triggered, it gets put into all queues, subject to rate
> >>    limiting.
> 
> Correction: only events that are rate-limited go through the queue.  The
> other bypass it.  This is an optimization.
> 
> >>    Rate limiting and queuing needs some shared data, which is protected
> >>    by a mutex.  The critical sections guarded by this mutex must be
> >>    "quick".
> 
> There's another mutex guarding the monitor's output buffer (see 6.),
> among other things.
> 
> >>    Nothing new here, it's how events work today.
> >> 
> >>    We could easily add events that go to just one monitor, if there's a
> >>    need.
> >
> > I don't think events could cause a problem here since they're always
> > outbound - so they could never block inbound commands?
> 
> Events are indeed okay as they are.  I merely wanted to mention that
> they don't *have* to broadcast.  The "queue full" event mentioned under
> 6. probably shouldn't be broadcast.
> 
> >> 4. Commands are normally dispatched to a worker thread, where they can
> >>    take their own sweet time to complete.
> >> 
> >>    Currently, the monitor runs in the main loop and executes commands
> >>    directly.  This is effectively dispatching commands to the main loop.
> >>    Dispatch to main loop is wrong, because it can make the main loop
> >>    hang.  If it was the only relevant cause for main loop hangs, we'd
> >>    move the command work out and be done.  Since it isn't (see 1.) we
> >>    *also* have to move the monitor out to prevent main loop hangs from
> >>    hanging the monitor.
> >> 
> >>    Moving monitor and command work to separate threads changes the
> >>    dispatch from function call to queuing.  Need a pair of queues, one
> >>    for commands, one for responses.
> >> 
> >>    Design alternative: one worker per monitor, or one worker for all
> >>    monitors, or main loop is the one worker for all monitors.  The
> >>    latter leaves the main loop hangs unaddressed.  No worse than before,
> >>    so it could be okay as a first step.
> >> 
> >>    The worker provides the pair of queues.  It executes commands in
> >>    order.  If a command blocks, the command queue stalls.
> >> 
> >>    The command queue can therefore grow without bounds.  See 6.
> >> 
> >> 5. Certain commands that always complete "quickly" can instead be
> >>    executed directly, at the QMP client's request.  This direct
> >>    execution is out-of-band; the response can "overtake" prior in-band
> >>    commands' responses.
> >> 
> >>    The amount of work these out-of-band commands do themselves is up to
> >>    them.  A qiuck query command would do all the work.  migrate-cancel
> >>    could perhaps notify the migration thread and be done.  Postcopy
> >>    recovery could perhaps send its argument struct to whatever thread
> >>    handles recovery.
> >
> > Yes.
> 
> Message sending needs to be non-blocking.  If the message can't be sent,
> the command should fail.  Queuing instead is a problematic idea, because
> then you get to deal with the same flow control problems we're
> discussing below.
> 
> >> 6. Flow control
> >
> > I think this is potentially the tricky bit!
> >
> >>    We currently leave flow control to the underlying character device.
> >>    If the client sends more quickly than the monitor can execute, the
> >>    client's send eventually blocks or fails with EAGAIN.  If the monitor
> >>    sends more quickly than the client accepts, the monitor buffers
> >>    without bounds (I think).
> >> 
> >>    Buffering monitor output without bounds is bad.  We could perhaps
> >>    kill a monitor when it exceeds its limit.
> >
> > I'm not sure it's possible to define that limit; for example
> > 'query-block' gives a list of information for all devices; there are
> > people running with 200+ block devices so the output for that would be
> > huge.
> 
> For comparison, here are our current input limits:
> 
> * Sum of JSON token size: 64MiB
> * JSON token count: 2Mi
> * JSON nesting depth 1024
> 
> The first two limit heap usage, the third limits stack usage.  The first
> and the last go back to Anthony (commit 29c75dd).  I added the second
> because the first is insufficient (commit df64983).
> 
> If we hit these generous limits, surely something has gone haywire.
> 
> An output limit of 64MiB should be good for ~100k block devices with
> MiBs to spare.  Generous enough for a "if you hit this limit, you're
> abusing QMP way too much" argument?  If not, how far left would you like
> me to shift the limit?
> 
> >>    Buffering monitor input (in the command queue) without bounds is just
> >>    as bad.  It also destroys the existing flow control mechanism: the
> >>    client can no longer detect that it's sending too much.  Not an issue
> >>    for fully synchronous clients, i.e. clients that wait for the
> >>    previous command's response before they send the next command.  Such
> >>    clients cannot use of out-of-band command execution.
> >> 
> >>    The obvious way to limit the command queue is to fail commands when
> >>    the queue is "full".
> >> 
> >>    Note that we can't send an error response right away then, because
> >>    the command is in-band (if it wasn't, we wouldn't queue it), so its
> >>    response has to go after all all the respones to the (in-band)
> >>    commands currently in the queue.
> >> 
> >>    To tell the client right away, we could send an event.
> >> 
> >>    Delaying the "queue full" response until the correct time to send it
> >>    requires state: at least the command ID.  We can just as well enqueue
> >>    and pray memory will suffice.
> >> 
> >>    Note that the only reason for the command queue is out-of-band
> >>    commands.  Without them, reading the next command is pointless.  This
> >>    leads me to a possible solution: separate out-of-band mode, default
> >>    off, QMP client can switch it on.  When off, we read monitor input
> >>    just like we do now (no queue, no problem).  When on, we read and
> >>    queue.  If the queue is full, we send a "queue full" event with the
> >>    IDs of the commands we dropped on the floor.  By switching on
> >>    out-of-band-mode, the QMP client also opts into this event.
> >> 
> >>    Switching could be done with QMP capabilities negotiation.
> >
> > I'm not sure how this queue interacts for multiple monitors using the
> > single IO thread.  It's currently legal for each monitor to send one
> > command and for that command to be outstanding; so 'queue full' mustn't
> > happen in that case, because we still want to allow any of the monitors
> > to issue one of the non-locking commands.
> 
> Right, the "queue full" condition must be per monitor, and it must not
> apply to in-band commands (which aren't queued anyway).
> 
> > So I think we need 2x 1 entry input queues per monitor; one for normal
> > command and one for non-locking commands; I think that's different
> > from what we've previously suggested which is 2 central queues.
> 
> Perhaps I was less than clear under 4., but I meant to propose design
> alternatives one shared worker fed by one pair of queues, and one worker
> per monitor, each fed by its own pair of queues.  Another alternative
> would be one shared worker fed by one pair of queues per monitor.
> 
> Pair of queues means one for in-band commands, one for their responses.
> There is no queue for out-of-band commands, because out-of-band commands
> are not queued.
> 
> >> 7. How all this is related to "jobs"
> >> 
> >>    Out-of-band execution is a limited special case of asynchronous
> >>    execution.  With general asynchronous execution, responses can be
> >>    sent in any order.  With out-of-band execution, only the out-of-band
> >>    responses can "jump" order, and only over in-band responses.
> >> 
> >>    "All commands are (to be treated as) asynchronous" is arguably more
> >>    elegant than this out-of-band thing.  However, it runs into two
> >>    roadblocks that don't apply to out-of-band.
> >> 
> >>    One, backward compatibility.  That's a roadblock only as much as we
> >>    make it one.
> >> 
> >>    Two, consistency.  "All asynchronous, but we do most job-like things
> >>    with commands + events anyway" is not acceptable to me.  I'd be
> >>    willing to accept "all asynchronous" when it solves the jobs problem.
> >
> > I suspect there are other things that limit making everything
> > asynchronous; for example commands that currently only expect to be
> > executing in the main thread; if you wanted to make an existing command
> > async you'd have to audit it for all the possible places it could hang.
> 
> You're right.
> 
> > I also see the other problem as keeping the management level
> > understanding of which commands are asynchronous; Dan's suggestion is
> > that command where the management layer specifies which commands it
> > expects to be asynchronous, and qemu responds with which ones actually
> > are.
> 
> "Command supports out-of-band dispatch" would be visible in
> query-qmp-schema.
> 
> Design alternative: either switching on out-of-band mode (see 6.)
> switches all out-of-band commands to out-of-band dispatch, or it
> doesn't, and the client has to request out-of-band dispatch explicitly.
> The explicit request could either be per execute (say send {'exec-oob':
> COMMAND-NAME ...} instead of {'execute': COMMAND-NAME...}), or per
> session, i.e. with a new command to enable oob dispatch for a list of
> oob-capable commands.
> 
> I figure explicit is safer, because it lets us make more commands
> oob-capable without upsetting existing oob-aware QMP clients.

I think this OOB solution should work for us, though I'm still trying
to digest this whole thing.  Thanks Markus for this design, much
appreciated.  Meanwhile, sorry to have troubled you on this. I really
didn't mean to!

Considering that we may still have some commands (like what Fam has
mentioned in block layer) that may need to be run only in main thread,
I think a first attempt may need to have one IO/parser thread (parses
monitor input stream, and also responsible to run out-of-band
commands), plus no worker thread, then I'll feed the dispatching work
back to main thread again to make sure that assumption still keeps.

Dan, do you think this will work from libvirt POV?  I won't try to
prototype anything if without your confirmation as well.

Thanks!

> 
> >>    You asked for a solution to the monitor availability problem that
> >>    doesn't require you to solve the jobs problem first.  Well, here's my
> >>    best try.  Go shoot some holes into it :)
> >
> > Hopefully we're running out of holes.
> 
> Thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-11 10:32                                   ` Peter Xu
@ 2017-09-11 10:36                                     ` Peter Xu
  0 siblings, 0 replies; 104+ messages in thread
From: Peter Xu @ 2017-09-11 10:36 UTC (permalink / raw)
  To: Markus Armbruster, Daniel P. Berrange
  Cc: Dr. David Alan Gilbert, Laurent Vivier, Fam Zheng, Juan Quintela,
	qemu-devel, mdroth, Paolo Bonzini, John Snow

On Mon, Sep 11, 2017 at 06:32:03PM +0800, Peter Xu wrote:

[...]

> I think this OOB solution should work for us, though I'm still trying
> to digest this whole thing.  Thanks Markus for this design, much
> appreciated.  Meanwhile, sorry to have troubled you on this. I really
> didn't mean to!
> 
> Considering that we may still have some commands (like what Fam has
> mentioned in block layer) that may need to be run only in main thread,
> I think a first attempt may need to have one IO/parser thread (parses
> monitor input stream, and also responsible to run out-of-band
> commands), plus no worker thread, then I'll feed the dispatching work
> back to main thread again to make sure that assumption still keeps.
> 
> Dan, do you think this will work from libvirt POV?  I won't try to
> prototype anything if without your confirmation as well.
> 
> Thanks!

CC Daniel.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 104+ messages in thread

* Re: [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread
  2017-09-08 11:49                                 ` Markus Armbruster
  2017-09-08 13:19                                   ` Stefan Hajnoczi
  2017-09-11 10:32                                   ` Peter Xu
@ 2017-09-11 10:43                                   ` Daniel P. Berrange
  2 siblings, 0 replies; 104+ messages in thread
From: Daniel P. Berrange @ 2017-09-11 10:43 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Dr. David Alan Gilbert, Laurent Vivier, Fam Zheng, Juan Quintela,
	mdroth, Peter Xu, qemu-devel, Paolo Bonzini, John Snow

On Fri, Sep 08, 2017 at 01:49:41PM +0200, Markus Armbruster wrote:
> > I also see the other problem as keeping the management level
> > understanding of which commands are asynchronous; Dan's suggestion is
> > that command where the management layer specifies which commands it
> > expects to be asynchronous, and qemu responds with which ones actually
> > are.
> 
> "Command supports out-of-band dispatch" would be visible in
> query-qmp-schema.
> 
> Design alternative: either switching on out-of-band mode (see 6.)
> switches all out-of-band commands to out-of-band dispatch, or it
> doesn't, and the client has to request out-of-band dispatch explicitly.
> The explicit request could either be per execute (say send {'exec-oob':
> COMMAND-NAME ...} instead of {'execute': COMMAND-NAME...}), or per
> session, i.e. with a new command to enable oob dispatch for a list of
> oob-capable commands.
> 
> I figure explicit is safer, because it lets us make more commands
> oob-capable without upsetting existing oob-aware QMP clients.

Yep, this is fine too - it achieves the same end goals as the approach
I suggest. Namely

 - clients can detect which commands can do OOB (via the schema)
 - clients can choose which commands to run OOB (via exec vs exec-oob)

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 104+ messages in thread

end of thread, other threads:[~2017-09-11 10:43 UTC | newest]

Thread overview: 104+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-23  6:51 [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Peter Xu
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 1/8] monitor: move skip_flush into monitor_data_init Peter Xu
2017-08-23 16:31   ` Dr. David Alan Gilbert
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 2/8] monitor: allow monitor to create thread to poll Peter Xu
2017-08-23 17:35   ` Dr. David Alan Gilbert
2017-08-25  4:25     ` Peter Xu
2017-08-25  9:30       ` Dr. David Alan Gilbert
2017-08-28  5:53         ` Peter Xu
2017-09-08 17:29           ` Dr. David Alan Gilbert
2017-08-25 15:27   ` Marc-André Lureau
2017-08-25 15:33     ` Dr. David Alan Gilbert
2017-08-25 16:07       ` Marc-André Lureau
2017-08-25 16:12         ` Dr. David Alan Gilbert
2017-08-25 16:21           ` Marc-André Lureau
2017-08-25 16:29             ` Dr. David Alan Gilbert
2017-08-26  8:33               ` Marc-André Lureau
2017-08-28  3:05         ` Peter Xu
2017-08-28 10:11           ` Marc-André Lureau
2017-08-28 12:48             ` Peter Xu
2017-09-05 18:58               ` Dr. David Alan Gilbert
2017-08-28 11:08         ` Markus Armbruster
2017-08-28 12:28           ` Marc-André Lureau
2017-08-28 16:24             ` Markus Armbruster
2017-08-28 17:24               ` Marc-André Lureau
2017-08-29  6:27                 ` Markus Armbruster
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 3/8] char-io: fix possible risk on IOWatchPoll Peter Xu
2017-08-25 14:44   ` Marc-André Lureau
2017-08-26  7:19   ` Fam Zheng
2017-08-28  5:56     ` Peter Xu
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 4/8] QAPI: new QMP command option "without-bql" Peter Xu
2017-08-23 17:44   ` Dr. David Alan Gilbert
2017-08-23 23:37     ` Fam Zheng
2017-08-25  5:37       ` Peter Xu
2017-08-25  9:14         ` Dr. David Alan Gilbert
2017-08-28  8:08           ` Peter Xu
2017-09-08 17:38             ` Dr. David Alan Gilbert
2017-08-25  5:35     ` Peter Xu
2017-08-25  9:06       ` Dr. David Alan Gilbert
2017-08-28  8:26         ` Peter Xu
2017-09-08 17:52           ` Dr. David Alan Gilbert
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 5/8] hmp: support "without_bql" Peter Xu
2017-08-23 17:46   ` Dr. David Alan Gilbert
2017-08-25  5:44     ` Peter Xu
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 6/8] migration: qmp: migrate_incoming don't need BQL Peter Xu
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 7/8] migration: hmp: " Peter Xu
2017-08-23  6:51 ` [Qemu-devel] [RFC v2 8/8] migration: add incoming mgmt lock Peter Xu
2017-08-23 18:01   ` Dr. David Alan Gilbert
2017-08-25  5:49     ` Peter Xu
2017-08-25  9:34       ` Dr. David Alan Gilbert
2017-08-28  8:39         ` Peter Xu
2017-08-29 11:03 ` [Qemu-devel] [RFC v2 0/8] monitor: allow per-monitor thread Daniel P. Berrange
2017-08-30  7:06   ` Markus Armbruster
2017-08-30 10:13     ` Daniel P. Berrange
2017-08-31  3:31       ` Peter Xu
2017-08-31  9:14         ` Daniel P. Berrange
2017-09-06  9:48   ` Dr. David Alan Gilbert
2017-09-06 10:46     ` Daniel P. Berrange
2017-09-06 10:48       ` Dr. David Alan Gilbert
2017-09-06 10:54         ` Daniel P. Berrange
2017-09-06 10:57           ` Dr. David Alan Gilbert
2017-09-06 11:06             ` Daniel P. Berrange
2017-09-06 11:31               ` Dr. David Alan Gilbert
2017-09-06 11:54                 ` Daniel P. Berrange
2017-09-07  8:13                   ` Peter Xu
2017-09-07  8:49                     ` Stefan Hajnoczi
2017-09-07  9:18                       ` Dr. David Alan Gilbert
2017-09-07 10:19                         ` Stefan Hajnoczi
2017-09-07 10:24                         ` Peter Xu
2017-09-07  8:55                     ` Daniel P. Berrange
2017-09-07  9:19                       ` Dr. David Alan Gilbert
2017-09-07  9:22                         ` Daniel P. Berrange
2017-09-07  9:27                           ` Dr. David Alan Gilbert
2017-09-07 11:19                         ` Markus Armbruster
2017-09-07 11:31                           ` Dr. David Alan Gilbert
2017-09-07  9:15                     ` Dr. David Alan Gilbert
2017-09-07  9:25                       ` Daniel P. Berrange
2017-09-07 12:59                     ` Markus Armbruster
2017-09-07 13:22                       ` Daniel P. Berrange
2017-09-07 17:41                         ` Markus Armbruster
2017-09-07 18:09                           ` Dr. David Alan Gilbert
2017-09-08  8:41                             ` Markus Armbruster
2017-09-08  9:32                               ` Dr. David Alan Gilbert
2017-09-08 11:49                                 ` Markus Armbruster
2017-09-08 13:19                                   ` Stefan Hajnoczi
2017-09-11 10:32                                   ` Peter Xu
2017-09-11 10:36                                     ` Peter Xu
2017-09-11 10:43                                   ` Daniel P. Berrange
2017-09-08  9:27                           ` Daniel P. Berrange
2017-09-07 14:20                       ` Dr. David Alan Gilbert
2017-09-07 17:41                         ` Markus Armbruster
2017-09-07 18:04                           ` Dr. David Alan Gilbert
2017-09-07 10:04                   ` Dr. David Alan Gilbert
2017-09-07 10:08                     ` Daniel P. Berrange
2017-09-07 13:59                 ` Eric Blake
2017-09-06 14:50 ` Stefan Hajnoczi
2017-09-06 15:14   ` Dr. David Alan Gilbert
2017-09-07  7:38     ` Peter Xu
2017-09-07  8:58     ` Stefan Hajnoczi
2017-09-07  9:35       ` Dr. David Alan Gilbert
2017-09-07 10:09         ` Stefan Hajnoczi
2017-09-07 12:02           ` Peter Xu
2017-09-07 16:53             ` Stefan Hajnoczi
2017-09-07 17:14               ` Dr. David Alan Gilbert
2017-09-07 17:35                 ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.