All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O
@ 2011-01-22  9:29 Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library Stefan Hajnoczi
                   ` (13 more replies)
  0 siblings, 14 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori

This patch series prototypes making QCOW2 fully asynchronous to eliminate the
timing jitter and poor performance that has been observed.  QCOW2 has
asynchronous I/O code paths for some of the read/write common cases but
metadata access is always synchronous.

One solution is to rewrite QCOW2 to be fully asynchronous by splitting all
functions that perform blocking I/O into a series of callbacks.  Due to the
complexity of QCOW2, this conversion and the maintenance prospects are
unattractive.

This patch series prototypes an alternative solution to make QCOW2
asynchronous.  It introduces coroutines, cooperative userspace threads of
control, so that each QCOW2 request has its own call stack.  To perform I/O,
the coroutine submits an asynchronous I/O request and then yields back to QEMU.
The coroutine stays suspended while the I/O operation is being processed by
lower layers of the stack.  When the asynchronous I/O completes, the coroutine
is resumed.

The upshot of this is that QCOW2 can be implemented in a sequential fashion
without explicit callbacks but all I/O actually happens asynchronously under
the covers.

This prototype implements reads, writes, and flushes.  Should install or boot
VMs successfully.  However, it has the following limitations:

1. QCOW2 requests are serialized because the code is not yet safe for
   concurrent requests.  See the last patch for details.

2. Coroutines are unoptimized.  We should pool coroutines (and their mmapped
   stacks) to avoid the cost of coroutine creation.

3. The qcow2_aio_read_cb() and qcow2_aoi_write_cb() functions should be
   refactored into sequential code now that callbacks are no longer needed.

I think this approach can solve the performance and functional problems of the
current QCOW2 implementation.  It does not require invasive changes, much of
QCOW2 works unmodified.

Kevin: Do you like this approach and do you want to develop it further?

Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-26 15:25   ` Avi Kivity
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 02/12] continuation: Fix container_of() redefinition Stefan Hajnoczi
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Asynchronous image format code is becoming very complex.  Let's try
using coroutines to write sequential code without callbacks but use
coroutines to switch stacks under the hood.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 Makefile.objs        |    2 +-
 continuation.c       |   87 +++++++++++++++++++++++++++++++++
 continuation.h       |   58 ++++++++++++++++++++++
 coroutine.h          |   72 +++++++++++++++++++++++++++
 coroutine_ucontext.c |  131 ++++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 349 insertions(+), 1 deletions(-)
 create mode 100644 continuation.c
 create mode 100644 continuation.h
 create mode 100644 coroutine.h
 create mode 100644 coroutine_ucontext.c

diff --git a/Makefile.objs b/Makefile.objs
index c3e52c5..ae26396 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,7 +15,7 @@ oslib-obj-$(CONFIG_POSIX) += oslib-posix.o
 
 block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o
 block-obj-y += nbd.o block.o aio.o aes.o qemu-config.o
-block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
+block-obj-$(CONFIG_POSIX) += posix-aio-compat.o continuation.o coroutine_ucontext.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
diff --git a/continuation.c b/continuation.c
new file mode 100644
index 0000000..f61aeb5
--- /dev/null
+++ b/continuation.c
@@ -0,0 +1,87 @@
+/*
+ * GTK VNC Widget
+ *
+ * Copyright (C) 2006  Anthony Liguori <anthony@codemonkey.ws>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.0 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#include "continuation.h"
+
+/*
+ * va_args to makecontext() must be type 'int', so passing
+ * the pointer we need may require several int args. This
+ * union is a quick hack to let us do that
+ */
+union cc_arg {
+	void *p;
+	int i[2];
+};
+
+static void continuation_trampoline(int i0, int i1)
+{
+	union cc_arg arg;
+	struct continuation *cc;
+	arg.i[0] = i0;
+	arg.i[1] = i1;
+	cc = arg.p;
+
+	cc->entry(cc);
+}
+
+int cc_init(struct continuation *cc)
+{
+	volatile union cc_arg arg;
+	arg.p = cc;
+	if (getcontext(&cc->uc) == -1)
+		return -1;
+
+	cc->uc.uc_link = &cc->last;
+	cc->uc.uc_stack.ss_sp = cc->stack;
+	cc->uc.uc_stack.ss_size = cc->stack_size;
+	cc->uc.uc_stack.ss_flags = 0;
+
+	makecontext(&cc->uc, (void *)continuation_trampoline, 2, arg.i[0], arg.i[1]);
+
+	return 0;
+}
+
+int cc_release(struct continuation *cc)
+{
+	if (cc->release)
+		return cc->release(cc);
+
+	return 0;
+}
+
+int cc_swap(struct continuation *from, struct continuation *to)
+{
+	to->exited = 0;
+	if (getcontext(&to->last) == -1)
+		return -1;
+	else if (to->exited == 0)
+		to->exited = 1;
+	else if (to->exited == 1)
+		return 1;
+
+	return swapcontext(&from->uc, &to->uc);
+}
+/*
+ * Local variables:
+ *  c-indent-level: 8
+ *  c-basic-offset: 8
+ *  tab-width: 8
+ * End:
+ */
diff --git a/continuation.h b/continuation.h
new file mode 100644
index 0000000..585788e
--- /dev/null
+++ b/continuation.h
@@ -0,0 +1,58 @@
+/*
+ * GTK VNC Widget
+ *
+ * Copyright (C) 2006  Anthony Liguori <anthony@codemonkey.ws>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.0 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef _CONTINUATION_H_
+#define _CONTINUATION_H_
+
+#include <ucontext.h>
+
+struct continuation
+{
+	char *stack;
+	size_t stack_size;
+	void (*entry)(struct continuation *cc);
+	int (*release)(struct continuation *cc);
+
+	/* private */
+	ucontext_t uc;
+	ucontext_t last;
+	int exited;
+};
+
+int cc_init(struct continuation *cc);
+
+int cc_release(struct continuation *cc);
+
+/* you can use an uninitialized struct continuation for from if you do not have
+   the current continuation handy. */
+int cc_swap(struct continuation *from, struct continuation *to);
+
+#define offset_of(type, member) ((unsigned long)(&((type *)0)->member))
+#define container_of(obj, type, member) \
+        (type *)(((char *)obj) - offset_of(type, member))
+
+#endif
+/*
+ * Local variables:
+ *  c-indent-level: 8
+ *  c-basic-offset: 8
+ *  tab-width: 8
+ * End:
+ */
diff --git a/coroutine.h b/coroutine.h
new file mode 100644
index 0000000..5316535
--- /dev/null
+++ b/coroutine.h
@@ -0,0 +1,72 @@
+/*
+ * GTK VNC Widget
+ *
+ * Copyright (C) 2006  Anthony Liguori <anthony@codemonkey.ws>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.0 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#ifndef _COROUTINE_H_
+#define _COROUTINE_H_
+
+#define WITH_UCONTEXT 1
+
+#if WITH_UCONTEXT
+#include "continuation.h"
+#else
+#include <glib.h>
+#endif
+
+struct coroutine
+{
+	size_t stack_size;
+	void *(*entry)(void *);
+	int (*release)(struct coroutine *);
+
+	/* read-only */
+	int exited;
+
+	/* private */
+	struct coroutine *caller;
+	void *data;
+
+#if WITH_UCONTEXT
+	struct continuation cc;
+#else
+	GThread *thread;
+	gboolean runnable;
+#endif
+};
+
+int coroutine_init(struct coroutine *co);
+
+int coroutine_release(struct coroutine *co);
+
+void *coroutine_swap(struct coroutine *from, struct coroutine *to, void *arg);
+
+struct coroutine *coroutine_self(void);
+
+void *coroutine_yieldto(struct coroutine *to, void *arg);
+
+void *coroutine_yield(void *arg);
+
+#endif
+/*
+ * Local variables:
+ *  c-indent-level: 8
+ *  c-basic-offset: 8
+ *  tab-width: 8
+ * End:
+ */
diff --git a/coroutine_ucontext.c b/coroutine_ucontext.c
new file mode 100644
index 0000000..25e8687
--- /dev/null
+++ b/coroutine_ucontext.c
@@ -0,0 +1,131 @@
+/*
+ * GTK VNC Widget
+ *
+ * Copyright (C) 2006  Anthony Liguori <anthony@codemonkey.ws>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.0 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301 USA
+ */
+
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include "coroutine.h"
+
+int coroutine_release(struct coroutine *co)
+{
+	return cc_release(&co->cc);
+}
+
+static int _coroutine_release(struct continuation *cc)
+{
+	struct coroutine *co = container_of(cc, struct coroutine, cc);
+
+	if (co->release) {
+		int ret = co->release(co);
+		if (ret < 0)
+			return ret;
+	}
+
+	co->caller = NULL;
+
+	return 0;
+}
+
+static void coroutine_trampoline(struct continuation *cc)
+{
+	struct coroutine *co = container_of(cc, struct coroutine, cc);
+	co->data = co->entry(co->data);
+}
+
+int coroutine_init(struct coroutine *co)
+{
+	if (co->stack_size == 0)
+		co->stack_size = 16 << 20;
+
+	co->cc.stack_size = co->stack_size;
+	co->cc.stack = mmap(0, co->stack_size,
+			    PROT_READ | PROT_WRITE,
+			    MAP_PRIVATE | MAP_ANONYMOUS,
+			    -1, 0);
+	if (co->cc.stack == MAP_FAILED)
+		return -1;
+	co->cc.entry = coroutine_trampoline;
+	co->cc.release = _coroutine_release;
+	co->exited = 0;
+
+	return cc_init(&co->cc);
+}
+
+#if 0
+static __thread struct coroutine leader;
+static __thread struct coroutine *current;
+#else
+static struct coroutine leader;
+static struct coroutine *current;
+#endif
+
+struct coroutine *coroutine_self(void)
+{
+	if (current == NULL)
+		current = &leader;
+	return current;
+}
+
+void *coroutine_swap(struct coroutine *from, struct coroutine *to, void *arg)
+{
+	int ret;
+	to->data = arg;
+	current = to;
+	ret = cc_swap(&from->cc, &to->cc);
+	if (ret == 0)
+		return from->data;
+	else if (ret == 1) {
+		coroutine_release(to);
+		current = &leader;
+		to->exited = 1;
+		return to->data;
+	}
+
+	return NULL;
+}
+
+void *coroutine_yieldto(struct coroutine *to, void *arg)
+{
+	if (to->caller) {
+		fprintf(stderr, "Co-routine is re-entering itself\n");
+		abort();
+	}
+	to->caller = coroutine_self();
+	return coroutine_swap(coroutine_self(), to, arg);
+}
+
+void *coroutine_yield(void *arg)
+{
+	struct coroutine *to = coroutine_self()->caller;
+	if (!to) {
+		fprintf(stderr, "Co-routine is yielding to no one\n");
+		abort();
+	}
+	coroutine_self()->caller = NULL;
+	return coroutine_swap(coroutine_self(), to, arg);
+}
+/*
+ * Local variables:
+ *  c-indent-level: 8
+ *  c-basic-offset: 8
+ *  tab-width: 8
+ * End:
+ */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 02/12] continuation: Fix container_of() redefinition
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 03/12] Make sure to release allocated stack when coroutine is released Stefan Hajnoczi
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

QEMU already has a visible container_of() macro definition.  Avoid the
compiler error.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 continuation.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/continuation.h b/continuation.h
index 585788e..112df65 100644
--- a/continuation.h
+++ b/continuation.h
@@ -45,8 +45,10 @@ int cc_release(struct continuation *cc);
 int cc_swap(struct continuation *from, struct continuation *to);
 
 #define offset_of(type, member) ((unsigned long)(&((type *)0)->member))
+#ifndef container_of
 #define container_of(obj, type, member) \
         (type *)(((char *)obj) - offset_of(type, member))
+#endif
 
 #endif
 /*
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 03/12] Make sure to release allocated stack when coroutine is released.
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 02/12] continuation: Fix container_of() redefinition Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 04/12] coroutine: Use thread-local leader and current variables Stefan Hajnoczi
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Copied from a fix to the original code by Anthony Liguori
<aliguori@us.ibm.com>.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 coroutine_ucontext.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/coroutine_ucontext.c b/coroutine_ucontext.c
index 25e8687..f76da94 100644
--- a/coroutine_ucontext.c
+++ b/coroutine_ucontext.c
@@ -38,6 +38,7 @@ static int _coroutine_release(struct continuation *cc)
 		if (ret < 0)
 			return ret;
 	}
+    munmap(co->cc.stack, co->cc.stack_size);
 
 	co->caller = NULL;
 
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 04/12] coroutine: Use thread-local leader and current variables
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 03/12] Make sure to release allocated stack when coroutine is released Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines Stefan Hajnoczi
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Each pthread should have its own current coroutine.  This ensures
coroutines are thread-safe.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 coroutine_ucontext.c |    5 -----
 1 files changed, 0 insertions(+), 5 deletions(-)

diff --git a/coroutine_ucontext.c b/coroutine_ucontext.c
index f76da94..289e5bd 100644
--- a/coroutine_ucontext.c
+++ b/coroutine_ucontext.c
@@ -70,13 +70,8 @@ int coroutine_init(struct coroutine *co)
 	return cc_init(&co->cc);
 }
 
-#if 0
 static __thread struct coroutine leader;
 static __thread struct coroutine *current;
-#else
-static struct coroutine leader;
-static struct coroutine *current;
-#endif
 
 struct coroutine *coroutine_self(void)
 {
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 04/12] coroutine: Use thread-local leader and current variables Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-26 15:29   ` Avi Kivity
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 06/12] coroutine: Add qemu_coroutine_self() Stefan Hajnoczi
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Add functions to create coroutines and transfer control into a coroutine
and back out again.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 Makefile.objs    |    2 +-
 qemu-coroutine.c |   40 ++++++++++++++++++++++++++++
 qemu-coroutine.h |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 117 insertions(+), 1 deletions(-)
 create mode 100644 qemu-coroutine.c
 create mode 100644 qemu-coroutine.h

diff --git a/Makefile.objs b/Makefile.objs
index ae26396..7933fc9 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,7 +15,7 @@ oslib-obj-$(CONFIG_POSIX) += oslib-posix.o
 
 block-obj-y = cutils.o cache-utils.o qemu-malloc.o qemu-option.o module.o
 block-obj-y += nbd.o block.o aio.o aes.o qemu-config.o
-block-obj-$(CONFIG_POSIX) += posix-aio-compat.o continuation.o coroutine_ucontext.o
+block-obj-$(CONFIG_POSIX) += posix-aio-compat.o continuation.o coroutine_ucontext.o qemu-coroutine.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
diff --git a/qemu-coroutine.c b/qemu-coroutine.c
new file mode 100644
index 0000000..dd2cd8e
--- /dev/null
+++ b/qemu-coroutine.c
@@ -0,0 +1,40 @@
+/*
+ * QEMU coroutines
+ *
+ * Copyright IBM, Corp. 2011
+ *
+ * Authors:
+ *  Stefan Hajnoczi    <stefanha@linux.vnet.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#include "coroutine.h"
+
+#include "qemu-common.h"
+#include "qemu-coroutine.h"
+
+struct Coroutine {
+    struct coroutine co;
+};
+
+Coroutine *qemu_coroutine_create(CoroutineEntry *entry)
+{
+    Coroutine *coroutine = qemu_mallocz(sizeof(*coroutine));
+
+    coroutine->co.entry = entry;
+    coroutine_init(&coroutine->co);
+    return coroutine;
+}
+
+void *qemu_coroutine_enter(Coroutine *coroutine, void *opaque)
+{
+    return coroutine_yieldto(&coroutine->co, opaque);
+}
+
+void * coroutine_fn qemu_coroutine_yield(void *opaque)
+{
+    return coroutine_yield(opaque);
+}
diff --git a/qemu-coroutine.h b/qemu-coroutine.h
new file mode 100644
index 0000000..22fe4ea
--- /dev/null
+++ b/qemu-coroutine.h
@@ -0,0 +1,76 @@
+/*
+ * QEMU coroutine implementation
+ *
+ * Copyright IBM, Corp. 2011
+ *
+ * Authors:
+ *  Stefan Hajnoczi    <stefanha@linux.vnet.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ *
+ */
+
+#ifndef QEMU_COROUTINE_H
+#define QEMU_COROUTINE_H
+
+/**
+ * Mark a function that executes in coroutine context
+ *
+ * Functions that execute in coroutine context cannot be called directly from
+ * normal functions.  In the future it would be nice to enable compiler or
+ * static checker support for catching such errors.  This annotation might make
+ * it possible and in the meantime it serves as documentation.
+ *
+ * For example:
+ *
+ *   static void coroutine_fn foo(void) {
+ *       ....
+ *   }
+ */
+#define coroutine_fn
+
+typedef struct Coroutine Coroutine;
+
+/**
+ * Coroutine entry point
+ *
+ * When the coroutine is entered for the first time, opaque is passed in as an
+ * argument.
+ *
+ * When this function returns, the coroutine is destroyed automatically and the
+ * return value is passed back to the caller who last entered the coroutine.
+ */
+typedef void * coroutine_fn CoroutineEntry(void *opaque);
+
+/**
+ * Create a new coroutine
+ *
+ * Use qemu_coroutine_enter() to actually transfer control to the coroutine.
+ */
+Coroutine *qemu_coroutine_create(CoroutineEntry *entry);
+
+/**
+ * Transfer control to a coroutine
+ *
+ * The opaque argument is made available to the coroutine either as the entry
+ * function argument if this is the first time a new coroutine is entered, or
+ * as the return value from qemu_coroutine_yield().
+ *
+ * The return value from this function is either an opaque value yielded by the
+ * coroutine or the coroutine entry function return value when the coroutine
+ * terminates.
+ */
+void *qemu_coroutine_enter(Coroutine *coroutine, void *opaque);
+
+/**
+ * Transfer control back to a coroutine's caller
+ *
+ * The opaque argument is returned from the calling qemu_coroutine_enter().
+ *
+ * The return value is the argument passed back in from the next
+ * qemu_coroutine_enter().
+ */
+void * coroutine_fn qemu_coroutine_yield(void *opaque);
+
+#endif /* QEMU_COROUTINE_H */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 06/12] coroutine: Add qemu_coroutine_self()
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 07/12] coroutine: Add coroutine_is_leader() Stefan Hajnoczi
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Add a function to get the current coroutine.  There is always a current
coroutine, either the "leader" (default main coroutine) or a specific
coroutine created with qemu_coroutine_create().

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 qemu-coroutine.c |    5 +++++
 qemu-coroutine.h |    5 +++++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index dd2cd8e..e55b7c6 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -38,3 +38,8 @@ void * coroutine_fn qemu_coroutine_yield(void *opaque)
 {
     return coroutine_yield(opaque);
 }
+
+Coroutine * coroutine_fn qemu_coroutine_self(void)
+{
+    return (Coroutine*)coroutine_self();
+}
diff --git a/qemu-coroutine.h b/qemu-coroutine.h
index 22fe4ea..41f90bb 100644
--- a/qemu-coroutine.h
+++ b/qemu-coroutine.h
@@ -73,4 +73,9 @@ void *qemu_coroutine_enter(Coroutine *coroutine, void *opaque);
  */
 void * coroutine_fn qemu_coroutine_yield(void *opaque);
 
+/**
+ * Get the currently executing coroutine
+ */
+Coroutine * coroutine_fn qemu_coroutine_self(void);
+
 #endif /* QEMU_COROUTINE_H */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 07/12] coroutine: Add coroutine_is_leader()
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 06/12] coroutine: Add qemu_coroutine_self() Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 08/12] coroutine: Add qemu_in_coroutine() Stefan Hajnoczi
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Make it possible to check whether a coroutine is the default main
coroutine (the "leader") or not.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 coroutine.h          |    2 ++
 coroutine_ucontext.c |    5 +++++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/coroutine.h b/coroutine.h
index 5316535..3aca19a 100644
--- a/coroutine.h
+++ b/coroutine.h
@@ -58,6 +58,8 @@ void *coroutine_swap(struct coroutine *from, struct coroutine *to, void *arg);
 
 struct coroutine *coroutine_self(void);
 
+int coroutine_is_leader(struct coroutine *co);
+
 void *coroutine_yieldto(struct coroutine *to, void *arg);
 
 void *coroutine_yield(void *arg);
diff --git a/coroutine_ucontext.c b/coroutine_ucontext.c
index 289e5bd..b90a2f6 100644
--- a/coroutine_ucontext.c
+++ b/coroutine_ucontext.c
@@ -80,6 +80,11 @@ struct coroutine *coroutine_self(void)
 	return current;
 }
 
+int coroutine_is_leader(struct coroutine *co)
+{
+    return co == &leader;
+}
+
 void *coroutine_swap(struct coroutine *from, struct coroutine *to, void *arg)
 {
 	int ret;
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 08/12] coroutine: Add qemu_in_coroutine()
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 07/12] coroutine: Add coroutine_is_leader() Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 09/12] block: Add bdrv_co_readv() and bdrv_co_writev() Stefan Hajnoczi
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

The qemu_in_coroutine() function checks whether or not we are currently
in a user coroutine.  This can be used to distinguish between executing
normal vcpu or iothread code from executing a coroutine.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 qemu-coroutine.c |    5 +++++
 qemu-coroutine.h |    7 +++++++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/qemu-coroutine.c b/qemu-coroutine.c
index e55b7c6..9318600 100644
--- a/qemu-coroutine.c
+++ b/qemu-coroutine.c
@@ -43,3 +43,8 @@ Coroutine * coroutine_fn qemu_coroutine_self(void)
 {
     return (Coroutine*)coroutine_self();
 }
+
+bool qemu_in_coroutine(void)
+{
+    return !coroutine_is_leader(coroutine_self());
+}
diff --git a/qemu-coroutine.h b/qemu-coroutine.h
index 41f90bb..1fad3bf 100644
--- a/qemu-coroutine.h
+++ b/qemu-coroutine.h
@@ -14,6 +14,8 @@
 #ifndef QEMU_COROUTINE_H
 #define QEMU_COROUTINE_H
 
+#include <stdbool.h>
+
 /**
  * Mark a function that executes in coroutine context
  *
@@ -78,4 +80,9 @@ void * coroutine_fn qemu_coroutine_yield(void *opaque);
  */
 Coroutine * coroutine_fn qemu_coroutine_self(void);
 
+/**
+ * Return whether or not currently inside a coroutine
+ */
+bool qemu_in_coroutine(void);
+
 #endif /* QEMU_COROUTINE_H */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 09/12] block: Add bdrv_co_readv() and bdrv_co_writev()
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (7 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 08/12] coroutine: Add qemu_in_coroutine() Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 10/12] block: Add coroutine support to synchronous I/O functions Stefan Hajnoczi
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Add coroutine versions of the bdrv_aio_readv() and bdrv_aio_writev()
functions.  The coroutine version doesn't take a callback and simply
returns the error value.  Behind the scenes they are implemented using
bdrv_aio_readv() and bdrv_aio_writev().

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 block.h |    7 +++++++
 2 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index ff2795b..ecb6538 100644
--- a/block.c
+++ b/block.c
@@ -2614,6 +2614,57 @@ void qemu_aio_release(void *p)
 }
 
 /**************************************************************/
+/* I/O for coroutines */
+
+typedef struct CoroutineIOCompletion {
+    Coroutine *coroutine;
+    int ret;
+} CoroutineIOCompletion;
+
+static void bdrv_co_complete(void *opaque, int ret)
+{
+    CoroutineIOCompletion *co = opaque;
+
+    co->ret = ret;
+    qemu_coroutine_enter(co->coroutine, NULL);
+}
+
+static int coroutine_fn bdrv_co_io(BlockDriverState *bs, int64_t sector_num,
+                                   QEMUIOVector *iov, int nb_sectors,
+                                   int is_write)
+{
+    CoroutineIOCompletion co = {
+        .coroutine = qemu_coroutine_self(),
+    };
+    BlockDriverAIOCB *acb;
+
+    if (is_write) {
+        acb = bdrv_aio_writev(bs, sector_num, iov, nb_sectors,
+                              bdrv_co_complete, &co);
+    } else {
+        acb = bdrv_aio_readv(bs, sector_num, iov, nb_sectors,
+                             bdrv_co_complete, &co);
+    }
+    if (!acb) {
+        return -EIO;
+    }
+    qemu_coroutine_yield(NULL);
+    return co.ret;
+}
+
+int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
+                               QEMUIOVector *iov, int nb_sectors)
+{
+    return bdrv_co_io(bs, sector_num, iov, nb_sectors, 0);
+}
+
+int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
+                                QEMUIOVector *iov, int nb_sectors)
+{
+    return bdrv_co_io(bs, sector_num, iov, nb_sectors, 1);
+}
+
+/**************************************************************/
 /* removable device support */
 
 /**
diff --git a/block.h b/block.h
index f923add..472e3d4 100644
--- a/block.h
+++ b/block.h
@@ -4,6 +4,7 @@
 #include "qemu-aio.h"
 #include "qemu-common.h"
 #include "qemu-option.h"
+#include "qemu-coroutine.h"
 #include "qobject.h"
 
 /* block.c */
@@ -120,6 +121,12 @@ BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
 				 BlockDriverCompletionFunc *cb, void *opaque);
 void bdrv_aio_cancel(BlockDriverAIOCB *acb);
 
+/* block I/O for coroutines */
+int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
+                               QEMUIOVector *iov, int nb_sectors);
+int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
+                                QEMUIOVector *iov, int nb_sectors);
+
 typedef struct BlockRequest {
     /* Fields to be filled by multiwrite caller */
     int64_t sector;
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 10/12] block: Add coroutine support to synchronous I/O functions
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (8 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 09/12] block: Add bdrv_co_readv() and bdrv_co_writev() Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Implement transparent support for the following synchronous I/O
functions to be called from coroutines:

bdrv_read()
bdrv_write()
bdrv_pread()
bdrv_pwrite()
bdrv_pwrite_sync()
bdrv_write_sync()
bdrv_flush()

These functions will use asynchronous I/O behind the scenes but
otherwise behave the same as the truly synchronous implementations.

This change allows synchronous I/O code in QEMU to just work in a
coroutine.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 block.h |    1 +
 2 files changed, 43 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index ecb6538..f955ded 100644
--- a/block.c
+++ b/block.c
@@ -924,6 +924,17 @@ int bdrv_read(BlockDriverState *bs, int64_t sector_num,
 {
     BlockDriver *drv = bs->drv;
 
+    if (qemu_in_coroutine()) {
+        QEMUIOVector qiov;
+        struct iovec iov = {
+            .iov_base = buf,
+            .iov_len = nb_sectors * BDRV_SECTOR_SIZE,
+        };
+
+        qemu_iovec_init_external(&qiov, &iov, 1);
+        return bdrv_co_readv(bs, sector_num, &qiov, nb_sectors);
+    }
+
     if (!drv)
         return -ENOMEDIUM;
     if (bdrv_check_request(bs, sector_num, nb_sectors))
@@ -970,6 +981,18 @@ int bdrv_write(BlockDriverState *bs, int64_t sector_num,
                const uint8_t *buf, int nb_sectors)
 {
     BlockDriver *drv = bs->drv;
+
+    if (qemu_in_coroutine()) {
+        QEMUIOVector qiov;
+        struct iovec iov = {
+            .iov_base = (void *)buf,
+            .iov_len = nb_sectors * BDRV_SECTOR_SIZE,
+        };
+
+        qemu_iovec_init_external(&qiov, &iov, 1);
+        return bdrv_co_writev(bs, sector_num, &qiov, nb_sectors);
+    }
+
     if (!bs->drv)
         return -ENOMEDIUM;
     if (bs->read_only)
@@ -1471,6 +1494,10 @@ const char *bdrv_get_device_name(BlockDriverState *bs)
 
 int bdrv_flush(BlockDriverState *bs)
 {
+    if (qemu_in_coroutine()) {
+        return bdrv_co_flush(bs);
+    }
+
     if (bs->open_flags & BDRV_O_NO_FLUSH) {
         return 0;
     }
@@ -2664,6 +2691,21 @@ int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
     return bdrv_co_io(bs, sector_num, iov, nb_sectors, 1);
 }
 
+int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
+{
+    CoroutineIOCompletion co = {
+        .coroutine = qemu_coroutine_self(),
+    };
+    BlockDriverAIOCB *acb;
+
+    acb = bdrv_aio_flush(bs, bdrv_co_complete, &co);
+    if (!acb) {
+        return -EIO;
+    }
+    qemu_coroutine_yield(NULL);
+    return co.ret;
+}
+
 /**************************************************************/
 /* removable device support */
 
diff --git a/block.h b/block.h
index 472e3d4..bced0c5 100644
--- a/block.h
+++ b/block.h
@@ -126,6 +126,7 @@ int coroutine_fn bdrv_co_readv(BlockDriverState *bs, int64_t sector_num,
                                QEMUIOVector *iov, int nb_sectors);
 int coroutine_fn bdrv_co_writev(BlockDriverState *bs, int64_t sector_num,
                                 QEMUIOVector *iov, int nb_sectors);
+int coroutine_fn bdrv_co_flush(BlockDriverState *bs);
 
 typedef struct BlockRequest {
     /* Fields to be filled by multiwrite caller */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (9 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 10/12] block: Add coroutine support to synchronous I/O functions Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-23 23:40   ` Anthony Liguori
  2011-01-26 15:40   ` Avi Kivity
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 12/12] qcow2: Serialize all requests Stefan Hajnoczi
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

Converting qcow2 to use coroutines is fairly simple since most of qcow2
is synchronous.  The synchronous I/O functions likes bdrv_pread() now
transparently work when called from a coroutine, so all the synchronous
code just works.

The explicitly asynchronous code is adjusted to repeatedly call
qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
At that point the coroutine will return from its entry function and its
resources are freed.

The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
from a BH.  This is necessary since the user callback code does not
expect to be executed from a coroutine.

This conversion is not completely correct because the safety the
synchronous code does not carry over to the coroutine version.
Previously, a synchronous code path could assume that it will never be
interleaved with another request executing.  This is no longer true
because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
other requests can be processed during that time.

The solution is to carefully introduce checks so that pending requests
do not step on each other's toes.  That is left for a future patch...

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qcow2.c |  160 ++++++++++++++++++++++++++++++---------------------------
 1 files changed, 85 insertions(+), 75 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index b6b094c..4b33ef3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -361,19 +361,20 @@ typedef struct QCowAIOCB {
     uint64_t bytes_done;
     uint64_t cluster_offset;
     uint8_t *cluster_data;
-    BlockDriverAIOCB *hd_aiocb;
     QEMUIOVector hd_qiov;
     QEMUBH *bh;
     QCowL2Meta l2meta;
     QLIST_ENTRY(QCowAIOCB) next_depend;
+    Coroutine *coroutine;
+    int ret; /* return code for user callback */
 } QCowAIOCB;
 
 static void qcow2_aio_cancel(BlockDriverAIOCB *blockacb)
 {
     QCowAIOCB *acb = container_of(blockacb, QCowAIOCB, common);
-    if (acb->hd_aiocb)
-        bdrv_aio_cancel(acb->hd_aiocb);
     qemu_aio_release(acb);
+    /* XXX This function looks broken, we could be in the middle of a request
+     * and releasing the acb is not a good idea */
 }
 
 static AIOPool qcow2_aio_pool = {
@@ -381,13 +382,14 @@ static AIOPool qcow2_aio_pool = {
     .cancel             = qcow2_aio_cancel,
 };
 
-static void qcow2_aio_read_cb(void *opaque, int ret);
-static void qcow2_aio_read_bh(void *opaque)
+static void qcow2_aio_bh(void *opaque)
 {
     QCowAIOCB *acb = opaque;
     qemu_bh_delete(acb->bh);
     acb->bh = NULL;
-    qcow2_aio_read_cb(opaque, 0);
+    acb->common.cb(acb->common.opaque, acb->ret);
+    qemu_iovec_destroy(&acb->hd_qiov);
+    qemu_aio_release(acb);
 }
 
 static int qcow2_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb)
@@ -404,14 +406,13 @@ static int qcow2_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb)
     return 0;
 }
 
-static void qcow2_aio_read_cb(void *opaque, int ret)
+static int coroutine_fn qcow2_aio_read_cb(void *opaque, int ret)
 {
     QCowAIOCB *acb = opaque;
     BlockDriverState *bs = acb->common.bs;
     BDRVQcowState *s = bs->opaque;
     int index_in_cluster, n1;
 
-    acb->hd_aiocb = NULL;
     if (ret < 0)
         goto done;
 
@@ -469,22 +470,13 @@ static void qcow2_aio_read_cb(void *opaque, int ret)
                 acb->sector_num, acb->cur_nr_sectors);
             if (n1 > 0) {
                 BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
-                acb->hd_aiocb = bdrv_aio_readv(bs->backing_hd, acb->sector_num,
-                                    &acb->hd_qiov, acb->cur_nr_sectors,
-				    qcow2_aio_read_cb, acb);
-                if (acb->hd_aiocb == NULL)
-                    goto done;
-            } else {
-                ret = qcow2_schedule_bh(qcow2_aio_read_bh, acb);
-                if (ret < 0)
+                ret = bdrv_co_readv(bs->backing_hd, acb->sector_num, &acb->hd_qiov, acb->cur_nr_sectors);
+                if (ret < 0) {
                     goto done;
+                }
             }
         } else {
-            /* Note: in this case, no need to wait */
             qemu_iovec_memset(&acb->hd_qiov, 0, 512 * acb->cur_nr_sectors);
-            ret = qcow2_schedule_bh(qcow2_aio_read_bh, acb);
-            if (ret < 0)
-                goto done;
         }
     } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
         /* add AIO support for compressed blocks ? */
@@ -494,10 +486,6 @@ static void qcow2_aio_read_cb(void *opaque, int ret)
         qemu_iovec_from_buffer(&acb->hd_qiov,
             s->cluster_cache + index_in_cluster * 512,
             512 * acb->cur_nr_sectors);
-
-        ret = qcow2_schedule_bh(qcow2_aio_read_bh, acb);
-        if (ret < 0)
-            goto done;
     } else {
         if ((acb->cluster_offset & 511) != 0) {
             ret = -EIO;
@@ -522,34 +510,50 @@ static void qcow2_aio_read_cb(void *opaque, int ret)
         }
 
         BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
-        acb->hd_aiocb = bdrv_aio_readv(bs->file,
+        ret = bdrv_co_readv(bs->file,
                             (acb->cluster_offset >> 9) + index_in_cluster,
-                            &acb->hd_qiov, acb->cur_nr_sectors,
-                            qcow2_aio_read_cb, acb);
-        if (acb->hd_aiocb == NULL) {
-            ret = -EIO;
+                            &acb->hd_qiov, acb->cur_nr_sectors);
+        if (ret < 0) {
             goto done;
         }
     }
 
-    return;
+    return 1;
 done:
-    acb->common.cb(acb->common.opaque, ret);
-    qemu_iovec_destroy(&acb->hd_qiov);
-    qemu_aio_release(acb);
+    acb->ret = ret;
+    qcow2_schedule_bh(qcow2_aio_bh, acb);
+    return 0;
+}
+
+static void * coroutine_fn qcow2_co_read(void *opaque)
+{
+    QCowAIOCB *acb = opaque;
+
+    while (qcow2_aio_read_cb(acb, 0)) {
+    }
+    return NULL;
+}
+
+static int coroutine_fn qcow2_aio_write_cb(void *opaque, int ret);
+static void * coroutine_fn qcow2_co_write(void *opaque)
+{
+    QCowAIOCB *acb = opaque;
+
+    while (qcow2_aio_write_cb(acb, 0)) {
+    }
+    return NULL;
 }
 
-static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, int64_t sector_num,
-                                  QEMUIOVector *qiov, int nb_sectors,
-                                  BlockDriverCompletionFunc *cb,
-                                  void *opaque, int is_write)
+static BlockDriverAIOCB *qcow2_aio_setup(BlockDriverState *bs,
+        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
+        BlockDriverCompletionFunc *cb, void *opaque, int is_write)
 {
     QCowAIOCB *acb;
+    Coroutine *coroutine;
 
     acb = qemu_aio_get(&qcow2_aio_pool, bs, cb, opaque);
     if (!acb)
         return NULL;
-    acb->hd_aiocb = NULL;
     acb->sector_num = sector_num;
     acb->qiov = qiov;
 
@@ -561,7 +565,12 @@ static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, int64_t sector_num,
     acb->cluster_offset = 0;
     acb->l2meta.nb_clusters = 0;
     QLIST_INIT(&acb->l2meta.dependent_requests);
-    return acb;
+
+    coroutine = qemu_coroutine_create(is_write ? qcow2_co_write
+                                               : qcow2_co_read);
+    acb->coroutine = coroutine;
+    qemu_coroutine_enter(coroutine, acb);
+    return &acb->common;
 }
 
 static BlockDriverAIOCB *qcow2_aio_readv(BlockDriverState *bs,
@@ -570,38 +579,48 @@ static BlockDriverAIOCB *qcow2_aio_readv(BlockDriverState *bs,
                                          BlockDriverCompletionFunc *cb,
                                          void *opaque)
 {
-    QCowAIOCB *acb;
-
-    acb = qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
-    if (!acb)
-        return NULL;
-
-    qcow2_aio_read_cb(acb, 0);
-    return &acb->common;
+    return qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
 }
 
-static void qcow2_aio_write_cb(void *opaque, int ret);
-
-static void run_dependent_requests(QCowL2Meta *m)
+static void qcow2_co_run_dependent_requests(void *opaque)
 {
+    QCowAIOCB *acb = opaque;
     QCowAIOCB *req;
     QCowAIOCB *next;
 
+    qemu_bh_delete(acb->bh);
+    acb->bh = NULL;
+
+    /* Restart all dependent requests */
+    QLIST_FOREACH_SAFE(req, &acb->l2meta.dependent_requests, next_depend, next) {
+        qemu_coroutine_enter(req->coroutine, NULL);
+    }
+
+    /* Reenter the original request */
+    qemu_coroutine_enter(acb->coroutine, NULL);
+}
+
+static void run_dependent_requests(QCowL2Meta *m)
+{
     /* Take the request off the list of running requests */
     if (m->nb_clusters != 0) {
         QLIST_REMOVE(m, next_in_flight);
     }
 
-    /* Restart all dependent requests */
-    QLIST_FOREACH_SAFE(req, &m->dependent_requests, next_depend, next) {
-        qcow2_aio_write_cb(req, 0);
+    if (!QLIST_EMPTY(&m->dependent_requests)) {
+        /* TODO This is a hack to get at the acb, may not be correct if called
+         * with a QCowL2Meta that is not part of a QCowAIOCB.
+         */
+        QCowAIOCB *acb = container_of(m, QCowAIOCB, l2meta);
+        qcow2_schedule_bh(qcow2_co_run_dependent_requests, acb);
+        qemu_coroutine_yield(NULL);
     }
 
     /* Empty the list for the next part of the request */
     QLIST_INIT(&m->dependent_requests);
 }
 
-static void qcow2_aio_write_cb(void *opaque, int ret)
+static int coroutine_fn qcow2_aio_write_cb(void *opaque, int ret)
 {
     QCowAIOCB *acb = opaque;
     BlockDriverState *bs = acb->common.bs;
@@ -609,8 +628,6 @@ static void qcow2_aio_write_cb(void *opaque, int ret)
     int index_in_cluster;
     int n_end;
 
-    acb->hd_aiocb = NULL;
-
     if (ret >= 0) {
         ret = qcow2_alloc_cluster_link_l2(bs, &acb->l2meta);
     }
@@ -648,7 +665,8 @@ static void qcow2_aio_write_cb(void *opaque, int ret)
     if (acb->l2meta.nb_clusters == 0 && acb->l2meta.depends_on != NULL) {
         QLIST_INSERT_HEAD(&acb->l2meta.depends_on->dependent_requests,
             acb, next_depend);
-        return;
+        qemu_coroutine_yield(NULL);
+        return 1;
     }
 
     assert((acb->cluster_offset & 511) == 0);
@@ -675,25 +693,22 @@ static void qcow2_aio_write_cb(void *opaque, int ret)
     }
 
     BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
-    acb->hd_aiocb = bdrv_aio_writev(bs->file,
-                                    (acb->cluster_offset >> 9) + index_in_cluster,
-                                    &acb->hd_qiov, acb->cur_nr_sectors,
-                                    qcow2_aio_write_cb, acb);
-    if (acb->hd_aiocb == NULL) {
-        ret = -EIO;
+    ret = bdrv_co_writev(bs->file,
+                         (acb->cluster_offset >> 9) + index_in_cluster,
+                         &acb->hd_qiov, acb->cur_nr_sectors);
+    if (ret < 0) {
         goto fail;
     }
-
-    return;
+    return 1;
 
 fail:
     if (acb->l2meta.nb_clusters != 0) {
         QLIST_REMOVE(&acb->l2meta, next_in_flight);
     }
 done:
-    acb->common.cb(acb->common.opaque, ret);
-    qemu_iovec_destroy(&acb->hd_qiov);
-    qemu_aio_release(acb);
+    acb->ret = ret;
+    qcow2_schedule_bh(qcow2_aio_bh, acb);
+    return 0;
 }
 
 static BlockDriverAIOCB *qcow2_aio_writev(BlockDriverState *bs,
@@ -703,16 +718,10 @@ static BlockDriverAIOCB *qcow2_aio_writev(BlockDriverState *bs,
                                           void *opaque)
 {
     BDRVQcowState *s = bs->opaque;
-    QCowAIOCB *acb;
 
     s->cluster_cache_offset = -1; /* disable compressed cache */
 
-    acb = qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 1);
-    if (!acb)
-        return NULL;
-
-    qcow2_aio_write_cb(acb, 0);
-    return &acb->common;
+    return qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 1);
 }
 
 static void qcow2_close(BlockDriverState *bs)
@@ -824,6 +833,7 @@ static int qcow2_change_backing_file(BlockDriverState *bs,
     return qcow2_update_ext_header(bs, backing_file, backing_fmt);
 }
 
+/* TODO did we break this for coroutines? */
 static int preallocate(BlockDriverState *bs)
 {
     uint64_t nb_sectors;
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [Qemu-devel] [RFC][PATCH 12/12] qcow2: Serialize all requests
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (10 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
@ 2011-01-22  9:29 ` Stefan Hajnoczi
  2011-01-23 23:31 ` [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Anthony Liguori
  2011-01-24 11:58 ` [Qemu-devel] " Kevin Wolf
  13 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-22  9:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi

QCOW2 with coroutines is not safe because synchronous code paths are no
longer guaranteed to execute without interference from pending requests.
A blocking call like bdrv_pread() causes the coroutine to yield and
another request can be processed during that time, causing to race
conditions or interference between pending requests.

The simple solution is to serialize all requests.  This is bad for
performance and a fine-grained solution needs to be implemented in
future patches.

Using this patch, QCOW2 with coroutines can reliably install a RHEL6 VM
with a virtio-blk disk.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qcow2.c |   21 ++++++++++++++++++++-
 block/qcow2.h |    5 +++--
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 4b33ef3..0cea0e8 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -231,6 +231,7 @@ static int qcow2_open(BlockDriverState *bs, int flags)
     }
 
     QLIST_INIT(&s->cluster_allocs);
+    QTAILQ_INIT(&s->request_list);
 
     /* read qcow2 extensions */
     if (header.backing_file_offset) {
@@ -365,6 +366,7 @@ typedef struct QCowAIOCB {
     QEMUBH *bh;
     QCowL2Meta l2meta;
     QLIST_ENTRY(QCowAIOCB) next_depend;
+    QTAILQ_ENTRY(QCowAIOCB) next_request;
     Coroutine *coroutine;
     int ret; /* return code for user callback */
 } QCowAIOCB;
@@ -385,11 +387,20 @@ static AIOPool qcow2_aio_pool = {
 static void qcow2_aio_bh(void *opaque)
 {
     QCowAIOCB *acb = opaque;
+    BDRVQcowState *s = acb->common.bs->opaque;
+
     qemu_bh_delete(acb->bh);
     acb->bh = NULL;
     acb->common.cb(acb->common.opaque, acb->ret);
     qemu_iovec_destroy(&acb->hd_qiov);
+    QTAILQ_REMOVE(&s->request_list, acb, next_request);
     qemu_aio_release(acb);
+
+    /* Start next request */
+    if (!QTAILQ_EMPTY(&s->request_list)) {
+        acb = QTAILQ_FIRST(&s->request_list);
+        qemu_coroutine_enter(acb->coroutine, acb);
+    }
 }
 
 static int qcow2_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb)
@@ -548,8 +559,10 @@ static BlockDriverAIOCB *qcow2_aio_setup(BlockDriverState *bs,
         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque, int is_write)
 {
+    BDRVQcowState *s = bs->opaque;
     QCowAIOCB *acb;
     Coroutine *coroutine;
+    int start_request;
 
     acb = qemu_aio_get(&qcow2_aio_pool, bs, cb, opaque);
     if (!acb)
@@ -569,7 +582,13 @@ static BlockDriverAIOCB *qcow2_aio_setup(BlockDriverState *bs,
     coroutine = qemu_coroutine_create(is_write ? qcow2_co_write
                                                : qcow2_co_read);
     acb->coroutine = coroutine;
-    qemu_coroutine_enter(coroutine, acb);
+
+    /* Kick off the request if no others are currently executing */
+    start_request = QTAILQ_EMPTY(&s->request_list);
+    QTAILQ_INSERT_TAIL(&s->request_list, acb, next_request);
+    if (start_request) {
+        qemu_coroutine_enter(coroutine, acb);
+    }
     return &acb->common;
 }
 
diff --git a/block/qcow2.h b/block/qcow2.h
index 5217bea..159f86b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -78,6 +78,8 @@ typedef struct QCowSnapshot {
     uint64_t vm_clock_nsec;
 } QCowSnapshot;
 
+struct QCowAIOCB;
+
 typedef struct BDRVQcowState {
     int cluster_bits;
     int cluster_size;
@@ -98,6 +100,7 @@ typedef struct BDRVQcowState {
     uint8_t *cluster_data;
     uint64_t cluster_cache_offset;
     QLIST_HEAD(QCowClusterAlloc, QCowL2Meta) cluster_allocs;
+    QTAILQ_HEAD(, QCowAIOCB) request_list;
 
     uint64_t *refcount_table;
     uint64_t refcount_table_offset;
@@ -128,8 +131,6 @@ typedef struct QCowCreateState {
     int64_t refcount_block_offset;
 } QCowCreateState;
 
-struct QCowAIOCB;
-
 /* XXX This could be private for qcow2-cluster.c */
 typedef struct QCowL2Meta
 {
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (11 preceding siblings ...)
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 12/12] qcow2: Serialize all requests Stefan Hajnoczi
@ 2011-01-23 23:31 ` Anthony Liguori
  2011-02-01 13:23   ` Kevin Wolf
  2011-01-24 11:58 ` [Qemu-devel] " Kevin Wolf
  13 siblings, 1 reply; 46+ messages in thread
From: Anthony Liguori @ 2011-01-23 23:31 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

On 01/22/2011 03:29 AM, Stefan Hajnoczi wrote:
> This patch series prototypes making QCOW2 fully asynchronous to eliminate the
> timing jitter and poor performance that has been observed.  QCOW2 has
> asynchronous I/O code paths for some of the read/write common cases but
> metadata access is always synchronous.
>
> One solution is to rewrite QCOW2 to be fully asynchronous by splitting all
> functions that perform blocking I/O into a series of callbacks.  Due to the
> complexity of QCOW2, this conversion and the maintenance prospects are
> unattractive.
>
> This patch series prototypes an alternative solution to make QCOW2
> asynchronous.  It introduces coroutines, cooperative userspace threads of
> control, so that each QCOW2 request has its own call stack.  To perform I/O,
> the coroutine submits an asynchronous I/O request and then yields back to QEMU.
> The coroutine stays suspended while the I/O operation is being processed by
> lower layers of the stack.  When the asynchronous I/O completes, the coroutine
> is resumed.
>
> The upshot of this is that QCOW2 can be implemented in a sequential fashion
> without explicit callbacks but all I/O actually happens asynchronously under
> the covers.
>
> This prototype implements reads, writes, and flushes.  Should install or boot
> VMs successfully.  However, it has the following limitations:
>
> 1. QCOW2 requests are serialized because the code is not yet safe for
>     concurrent requests.  See the last patch for details.
>
> 2. Coroutines are unoptimized.  We should pool coroutines (and their mmapped
>     stacks) to avoid the cost of coroutine creation.
>
> 3. The qcow2_aio_read_cb() and qcow2_aoi_write_cb() functions should be
>     refactored into sequential code now that callbacks are no longer needed.
>
> I think this approach can solve the performance and functional problems of the
> current QCOW2 implementation.  It does not require invasive changes, much of
> QCOW2 works unmodified.
>
> Kevin: Do you like this approach and do you want to develop it further?
>    

Couple of additional comments:

1) The coroutine code is copied in a number of projects.  I plan on 
splitting it into a shared library.

2) Coroutines are really just a way of working with threads.  The 
implementation Stefan posted happens to take advantage of it's 
characteristics to optimize the implementation but the original 
coroutine implementation also has a thread based version for compatibility.

3) Essentially, coroutines are threads that 1) run in lock step 2) 
transfer control explicitly to one another via yieldto() and 3) performs 
message passing as part of the yield process.

They are very similar to yield-based generators in Python except a bit 
more flexible.  Think of coroutines as a way to introduce threading into 
QEMU without requiring that all of QEMU's core infrastructure be 
re-entrant to start out with (because of property (1)).

Regards,

Anthony Liguori

> Stefan
>
>
>    

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
@ 2011-01-23 23:40   ` Anthony Liguori
  2011-01-24 11:09     ` Stefan Hajnoczi
  2011-01-26 15:40   ` Avi Kivity
  1 sibling, 1 reply; 46+ messages in thread
From: Anthony Liguori @ 2011-01-23 23:40 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel

On 01/22/2011 03:29 AM, Stefan Hajnoczi wrote:
> Converting qcow2 to use coroutines is fairly simple since most of qcow2
> is synchronous.  The synchronous I/O functions likes bdrv_pread() now
> transparently work when called from a coroutine, so all the synchronous
> code just works.
>
> The explicitly asynchronous code is adjusted to repeatedly call
> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
> At that point the coroutine will return from its entry function and its
> resources are freed.
>
> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
> from a BH.  This is necessary since the user callback code does not
> expect to be executed from a coroutine.
>
> This conversion is not completely correct because the safety the
> synchronous code does not carry over to the coroutine version.
> Previously, a synchronous code path could assume that it will never be
> interleaved with another request executing.  This is no longer true
> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
> other requests can be processed during that time.
>
> The solution is to carefully introduce checks so that pending requests
> do not step on each other's toes.  That is left for a future patch...
>
> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>    

As an alternative approach, could we trap async calls from the block 
device, implement them in a synchronous fashion, then issue the callback 
immediately?

This would mean that qcow_aio_write() would become fully synchronous 
which also means that you can track when the operation is completed 
entirely within the block layer.  IOW, it should be possible to do this 
with almost no change to qcow2.

I think this is the right approach too.  If we're using coroutines, we 
shouldn't do anything asynchronous in the image formats.  The good bit 
about this is that we can probably dramatically simplify the block layer 
API but eliminating the sync/async versions of everything.

Regards,

Anthony Liguori

> ---
>   block/qcow2.c |  160 ++++++++++++++++++++++++++++++---------------------------
>   1 files changed, 85 insertions(+), 75 deletions(-)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index b6b094c..4b33ef3 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -361,19 +361,20 @@ typedef struct QCowAIOCB {
>       uint64_t bytes_done;
>       uint64_t cluster_offset;
>       uint8_t *cluster_data;
> -    BlockDriverAIOCB *hd_aiocb;
>       QEMUIOVector hd_qiov;
>       QEMUBH *bh;
>       QCowL2Meta l2meta;
>       QLIST_ENTRY(QCowAIOCB) next_depend;
> +    Coroutine *coroutine;
> +    int ret; /* return code for user callback */
>   } QCowAIOCB;
>
>   static void qcow2_aio_cancel(BlockDriverAIOCB *blockacb)
>   {
>       QCowAIOCB *acb = container_of(blockacb, QCowAIOCB, common);
> -    if (acb->hd_aiocb)
> -        bdrv_aio_cancel(acb->hd_aiocb);
>       qemu_aio_release(acb);
> +    /* XXX This function looks broken, we could be in the middle of a request
> +     * and releasing the acb is not a good idea */
>   }
>
>   static AIOPool qcow2_aio_pool = {
> @@ -381,13 +382,14 @@ static AIOPool qcow2_aio_pool = {
>       .cancel             = qcow2_aio_cancel,
>   };
>
> -static void qcow2_aio_read_cb(void *opaque, int ret);
> -static void qcow2_aio_read_bh(void *opaque)
> +static void qcow2_aio_bh(void *opaque)
>   {
>       QCowAIOCB *acb = opaque;
>       qemu_bh_delete(acb->bh);
>       acb->bh = NULL;
> -    qcow2_aio_read_cb(opaque, 0);
> +    acb->common.cb(acb->common.opaque, acb->ret);
> +    qemu_iovec_destroy(&acb->hd_qiov);
> +    qemu_aio_release(acb);
>   }
>
>   static int qcow2_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb)
> @@ -404,14 +406,13 @@ static int qcow2_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb)
>       return 0;
>   }
>
> -static void qcow2_aio_read_cb(void *opaque, int ret)
> +static int coroutine_fn qcow2_aio_read_cb(void *opaque, int ret)
>   {
>       QCowAIOCB *acb = opaque;
>       BlockDriverState *bs = acb->common.bs;
>       BDRVQcowState *s = bs->opaque;
>       int index_in_cluster, n1;
>
> -    acb->hd_aiocb = NULL;
>       if (ret<  0)
>           goto done;
>
> @@ -469,22 +470,13 @@ static void qcow2_aio_read_cb(void *opaque, int ret)
>                   acb->sector_num, acb->cur_nr_sectors);
>               if (n1>  0) {
>                   BLKDBG_EVENT(bs->file, BLKDBG_READ_BACKING_AIO);
> -                acb->hd_aiocb = bdrv_aio_readv(bs->backing_hd, acb->sector_num,
> -&acb->hd_qiov, acb->cur_nr_sectors,
> -				    qcow2_aio_read_cb, acb);
> -                if (acb->hd_aiocb == NULL)
> -                    goto done;
> -            } else {
> -                ret = qcow2_schedule_bh(qcow2_aio_read_bh, acb);
> -                if (ret<  0)
> +                ret = bdrv_co_readv(bs->backing_hd, acb->sector_num,&acb->hd_qiov, acb->cur_nr_sectors);
> +                if (ret<  0) {
>                       goto done;
> +                }
>               }
>           } else {
> -            /* Note: in this case, no need to wait */
>               qemu_iovec_memset(&acb->hd_qiov, 0, 512 * acb->cur_nr_sectors);
> -            ret = qcow2_schedule_bh(qcow2_aio_read_bh, acb);
> -            if (ret<  0)
> -                goto done;
>           }
>       } else if (acb->cluster_offset&  QCOW_OFLAG_COMPRESSED) {
>           /* add AIO support for compressed blocks ? */
> @@ -494,10 +486,6 @@ static void qcow2_aio_read_cb(void *opaque, int ret)
>           qemu_iovec_from_buffer(&acb->hd_qiov,
>               s->cluster_cache + index_in_cluster * 512,
>               512 * acb->cur_nr_sectors);
> -
> -        ret = qcow2_schedule_bh(qcow2_aio_read_bh, acb);
> -        if (ret<  0)
> -            goto done;
>       } else {
>           if ((acb->cluster_offset&  511) != 0) {
>               ret = -EIO;
> @@ -522,34 +510,50 @@ static void qcow2_aio_read_cb(void *opaque, int ret)
>           }
>
>           BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
> -        acb->hd_aiocb = bdrv_aio_readv(bs->file,
> +        ret = bdrv_co_readv(bs->file,
>                               (acb->cluster_offset>>  9) + index_in_cluster,
> -&acb->hd_qiov, acb->cur_nr_sectors,
> -                            qcow2_aio_read_cb, acb);
> -        if (acb->hd_aiocb == NULL) {
> -            ret = -EIO;
> +&acb->hd_qiov, acb->cur_nr_sectors);
> +        if (ret<  0) {
>               goto done;
>           }
>       }
>
> -    return;
> +    return 1;
>   done:
> -    acb->common.cb(acb->common.opaque, ret);
> -    qemu_iovec_destroy(&acb->hd_qiov);
> -    qemu_aio_release(acb);
> +    acb->ret = ret;
> +    qcow2_schedule_bh(qcow2_aio_bh, acb);
> +    return 0;
> +}
> +
> +static void * coroutine_fn qcow2_co_read(void *opaque)
> +{
> +    QCowAIOCB *acb = opaque;
> +
> +    while (qcow2_aio_read_cb(acb, 0)) {
> +    }
> +    return NULL;
> +}
> +
> +static int coroutine_fn qcow2_aio_write_cb(void *opaque, int ret);
> +static void * coroutine_fn qcow2_co_write(void *opaque)
> +{
> +    QCowAIOCB *acb = opaque;
> +
> +    while (qcow2_aio_write_cb(acb, 0)) {
> +    }
> +    return NULL;
>   }
>
> -static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, int64_t sector_num,
> -                                  QEMUIOVector *qiov, int nb_sectors,
> -                                  BlockDriverCompletionFunc *cb,
> -                                  void *opaque, int is_write)
> +static BlockDriverAIOCB *qcow2_aio_setup(BlockDriverState *bs,
> +        int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
> +        BlockDriverCompletionFunc *cb, void *opaque, int is_write)
>   {
>       QCowAIOCB *acb;
> +    Coroutine *coroutine;
>
>       acb = qemu_aio_get(&qcow2_aio_pool, bs, cb, opaque);
>       if (!acb)
>           return NULL;
> -    acb->hd_aiocb = NULL;
>       acb->sector_num = sector_num;
>       acb->qiov = qiov;
>
> @@ -561,7 +565,12 @@ static QCowAIOCB *qcow2_aio_setup(BlockDriverState *bs, int64_t sector_num,
>       acb->cluster_offset = 0;
>       acb->l2meta.nb_clusters = 0;
>       QLIST_INIT(&acb->l2meta.dependent_requests);
> -    return acb;
> +
> +    coroutine = qemu_coroutine_create(is_write ? qcow2_co_write
> +                                               : qcow2_co_read);
> +    acb->coroutine = coroutine;
> +    qemu_coroutine_enter(coroutine, acb);
> +    return&acb->common;
>   }
>
>   static BlockDriverAIOCB *qcow2_aio_readv(BlockDriverState *bs,
> @@ -570,38 +579,48 @@ static BlockDriverAIOCB *qcow2_aio_readv(BlockDriverState *bs,
>                                            BlockDriverCompletionFunc *cb,
>                                            void *opaque)
>   {
> -    QCowAIOCB *acb;
> -
> -    acb = qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
> -    if (!acb)
> -        return NULL;
> -
> -    qcow2_aio_read_cb(acb, 0);
> -    return&acb->common;
> +    return qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 0);
>   }
>
> -static void qcow2_aio_write_cb(void *opaque, int ret);
> -
> -static void run_dependent_requests(QCowL2Meta *m)
> +static void qcow2_co_run_dependent_requests(void *opaque)
>   {
> +    QCowAIOCB *acb = opaque;
>       QCowAIOCB *req;
>       QCowAIOCB *next;
>
> +    qemu_bh_delete(acb->bh);
> +    acb->bh = NULL;
> +
> +    /* Restart all dependent requests */
> +    QLIST_FOREACH_SAFE(req,&acb->l2meta.dependent_requests, next_depend, next) {
> +        qemu_coroutine_enter(req->coroutine, NULL);
> +    }
> +
> +    /* Reenter the original request */
> +    qemu_coroutine_enter(acb->coroutine, NULL);
> +}
> +
> +static void run_dependent_requests(QCowL2Meta *m)
> +{
>       /* Take the request off the list of running requests */
>       if (m->nb_clusters != 0) {
>           QLIST_REMOVE(m, next_in_flight);
>       }
>
> -    /* Restart all dependent requests */
> -    QLIST_FOREACH_SAFE(req,&m->dependent_requests, next_depend, next) {
> -        qcow2_aio_write_cb(req, 0);
> +    if (!QLIST_EMPTY(&m->dependent_requests)) {
> +        /* TODO This is a hack to get at the acb, may not be correct if called
> +         * with a QCowL2Meta that is not part of a QCowAIOCB.
> +         */
> +        QCowAIOCB *acb = container_of(m, QCowAIOCB, l2meta);
> +        qcow2_schedule_bh(qcow2_co_run_dependent_requests, acb);
> +        qemu_coroutine_yield(NULL);
>       }
>
>       /* Empty the list for the next part of the request */
>       QLIST_INIT(&m->dependent_requests);
>   }
>
> -static void qcow2_aio_write_cb(void *opaque, int ret)
> +static int coroutine_fn qcow2_aio_write_cb(void *opaque, int ret)
>   {
>       QCowAIOCB *acb = opaque;
>       BlockDriverState *bs = acb->common.bs;
> @@ -609,8 +628,6 @@ static void qcow2_aio_write_cb(void *opaque, int ret)
>       int index_in_cluster;
>       int n_end;
>
> -    acb->hd_aiocb = NULL;
> -
>       if (ret>= 0) {
>           ret = qcow2_alloc_cluster_link_l2(bs,&acb->l2meta);
>       }
> @@ -648,7 +665,8 @@ static void qcow2_aio_write_cb(void *opaque, int ret)
>       if (acb->l2meta.nb_clusters == 0&&  acb->l2meta.depends_on != NULL) {
>           QLIST_INSERT_HEAD(&acb->l2meta.depends_on->dependent_requests,
>               acb, next_depend);
> -        return;
> +        qemu_coroutine_yield(NULL);
> +        return 1;
>       }
>
>       assert((acb->cluster_offset&  511) == 0);
> @@ -675,25 +693,22 @@ static void qcow2_aio_write_cb(void *opaque, int ret)
>       }
>
>       BLKDBG_EVENT(bs->file, BLKDBG_WRITE_AIO);
> -    acb->hd_aiocb = bdrv_aio_writev(bs->file,
> -                                    (acb->cluster_offset>>  9) + index_in_cluster,
> -&acb->hd_qiov, acb->cur_nr_sectors,
> -                                    qcow2_aio_write_cb, acb);
> -    if (acb->hd_aiocb == NULL) {
> -        ret = -EIO;
> +    ret = bdrv_co_writev(bs->file,
> +                         (acb->cluster_offset>>  9) + index_in_cluster,
> +&acb->hd_qiov, acb->cur_nr_sectors);
> +    if (ret<  0) {
>           goto fail;
>       }
> -
> -    return;
> +    return 1;
>
>   fail:
>       if (acb->l2meta.nb_clusters != 0) {
>           QLIST_REMOVE(&acb->l2meta, next_in_flight);
>       }
>   done:
> -    acb->common.cb(acb->common.opaque, ret);
> -    qemu_iovec_destroy(&acb->hd_qiov);
> -    qemu_aio_release(acb);
> +    acb->ret = ret;
> +    qcow2_schedule_bh(qcow2_aio_bh, acb);
> +    return 0;
>   }
>
>   static BlockDriverAIOCB *qcow2_aio_writev(BlockDriverState *bs,
> @@ -703,16 +718,10 @@ static BlockDriverAIOCB *qcow2_aio_writev(BlockDriverState *bs,
>                                             void *opaque)
>   {
>       BDRVQcowState *s = bs->opaque;
> -    QCowAIOCB *acb;
>
>       s->cluster_cache_offset = -1; /* disable compressed cache */
>
> -    acb = qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 1);
> -    if (!acb)
> -        return NULL;
> -
> -    qcow2_aio_write_cb(acb, 0);
> -    return&acb->common;
> +    return qcow2_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, 1);
>   }
>
>   static void qcow2_close(BlockDriverState *bs)
> @@ -824,6 +833,7 @@ static int qcow2_change_backing_file(BlockDriverState *bs,
>       return qcow2_update_ext_header(bs, backing_file, backing_fmt);
>   }
>
> +/* TODO did we break this for coroutines? */
>   static int preallocate(BlockDriverState *bs)
>   {
>       uint64_t nb_sectors;
>    

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-23 23:40   ` Anthony Liguori
@ 2011-01-24 11:09     ` Stefan Hajnoczi
  0 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-24 11:09 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On Sun, Jan 23, 2011 at 11:40 PM, Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 01/22/2011 03:29 AM, Stefan Hajnoczi wrote:
>>
>> Converting qcow2 to use coroutines is fairly simple since most of qcow2
>> is synchronous.  The synchronous I/O functions likes bdrv_pread() now
>> transparently work when called from a coroutine, so all the synchronous
>> code just works.
>>
>> The explicitly asynchronous code is adjusted to repeatedly call
>> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
>> At that point the coroutine will return from its entry function and its
>> resources are freed.
>>
>> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
>> from a BH.  This is necessary since the user callback code does not
>> expect to be executed from a coroutine.
>>
>> This conversion is not completely correct because the safety the
>> synchronous code does not carry over to the coroutine version.
>> Previously, a synchronous code path could assume that it will never be
>> interleaved with another request executing.  This is no longer true
>> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
>> other requests can be processed during that time.
>>
>> The solution is to carefully introduce checks so that pending requests
>> do not step on each other's toes.  That is left for a future patch...
>>
>> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>
>
> As an alternative approach, could we trap async calls from the block device,
> implement them in a synchronous fashion, then issue the callback
> immediately?
>
> This would mean that qcow_aio_write() would become fully synchronous which
> also means that you can track when the operation is completed entirely
> within the block layer.  IOW, it should be possible to do this with almost
> no change to qcow2.

I'm not sure I understand what you're suggesting.  Right now
bdrv_read() for coroutines is implemented on top of bdrv_aio_readv().
And bdrv_pread() is implemented on top of bdrv_read().  It doesn't
make sense to me to implement bdrv_aio_readv() in terms of
bdrv_read().  Also is it safe to invoke the callback without a BH?

> I think this is the right approach too.  If we're using coroutines, we
> shouldn't do anything asynchronous in the image formats.  The good bit about
> this is that we can probably dramatically simplify the block layer API but
> eliminating the sync/async versions of everything.

Hardware emulation needs the asynchronous API so I don't think we can
get rid of bdrv_aio_readv(), bdrv_aio_writev(), and bdrv_aio_flush()
completely.  IDE and SCSI also like to be able to cancel their aio
requests.

Non-invasive coroutines support in the block layer will allow us to
easily make the more obscure image formats asynchronous too :).

Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [Qemu-devel] Re: [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
                   ` (12 preceding siblings ...)
  2011-01-23 23:31 ` [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Anthony Liguori
@ 2011-01-24 11:58 ` Kevin Wolf
  2011-01-24 13:10   ` Stefan Hajnoczi
  13 siblings, 1 reply; 46+ messages in thread
From: Kevin Wolf @ 2011-01-24 11:58 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Anthony Liguori, qemu-devel

Am 22.01.2011 10:29, schrieb Stefan Hajnoczi:
> This patch series prototypes making QCOW2 fully asynchronous to eliminate the
> timing jitter and poor performance that has been observed.  QCOW2 has
> asynchronous I/O code paths for some of the read/write common cases but
> metadata access is always synchronous.
> 
> One solution is to rewrite QCOW2 to be fully asynchronous by splitting all
> functions that perform blocking I/O into a series of callbacks.  Due to the
> complexity of QCOW2, this conversion and the maintenance prospects are
> unattractive.
> 
> This patch series prototypes an alternative solution to make QCOW2
> asynchronous.  It introduces coroutines, cooperative userspace threads of
> control, so that each QCOW2 request has its own call stack.  To perform I/O,
> the coroutine submits an asynchronous I/O request and then yields back to QEMU.
> The coroutine stays suspended while the I/O operation is being processed by
> lower layers of the stack.  When the asynchronous I/O completes, the coroutine
> is resumed.
> 
> The upshot of this is that QCOW2 can be implemented in a sequential fashion
> without explicit callbacks but all I/O actually happens asynchronously under
> the covers.
> 
> This prototype implements reads, writes, and flushes.  Should install or boot
> VMs successfully.  However, it has the following limitations:
> 
> 1. QCOW2 requests are serialized because the code is not yet safe for
>    concurrent requests.  See the last patch for details.
> 
> 2. Coroutines are unoptimized.  We should pool coroutines (and their mmapped
>    stacks) to avoid the cost of coroutine creation.
> 
> 3. The qcow2_aio_read_cb() and qcow2_aoi_write_cb() functions should be
>    refactored into sequential code now that callbacks are no longer needed.
> 
> I think this approach can solve the performance and functional problems of the
> current QCOW2 implementation.  It does not require invasive changes, much of
> QCOW2 works unmodified.
> 
> Kevin: Do you like this approach and do you want to develop it further?

I think it looks like a good start. The code will look much nicer this
way than with the callback jungle that you tried out in QED.

I'm not completely sure about patches 10 and 12, I don't think I agree
with the conversion approach. By making bdrv_pread/pwrite asynchronous,
you force drivers to be converted all at once - which leads to big
hammers as in patch 12 (by the way, I'm curious if you have tried how
much performance is hurt?)

Wouldn't we be better off if we added a bdrv_co_pread/pwrite and
converted qcow2 step by step? I'm not sure what the easy way forward
would be with patch 12, looks more like a dead end to me (though I
haven't looked at it for more than a few minutes yet).

One more thing I want to mention is that bdrv_aio_read doesn't have the
same semantics as bdrv_read with respect to EOF. The AIO one returns
-EINVAL when reading beyond EOF whereas bdrv_read returns zeros. I'd
expect that we'll hit this with the conversion.

Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-24 11:58 ` [Qemu-devel] " Kevin Wolf
@ 2011-01-24 13:10   ` Stefan Hajnoczi
  0 siblings, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-24 13:10 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Stefan Hajnoczi, qemu-devel

On Mon, Jan 24, 2011 at 11:58 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 22.01.2011 10:29, schrieb Stefan Hajnoczi:
>> Kevin: Do you like this approach and do you want to develop it further?
>
> I think it looks like a good start. The code will look much nicer this
> way than with the callback jungle that you tried out in QED.
>
> I'm not completely sure about patches 10 and 12, I don't think I agree
> with the conversion approach. By making bdrv_pread/pwrite asynchronous,
> you force drivers to be converted all at once - which leads to big
> hammers as in patch 12 (by the way, I'm curious if you have tried how
> much performance is hurt?)

I have not measured performance.  We're serializing requests and doing
a bunch of extra system calls per request so I expect a noticable
degradation.  With some profiling and optimization we should be able
to get good performance though.

> Wouldn't we be better off if we added a bdrv_co_pread/pwrite and
> converted qcow2 step by step? I'm not sure what the easy way forward
> would be with patch 12, looks more like a dead end to me (though I
> haven't looked at it for more than a few minutes yet).

I think you're right.  I wanted to prove that it is possible to make
qcow2 asynchronous using coroutines.  Perhaps we should lay off on
making everything asynchronous and instead convert code incrementally.
 We wouldn't need patch 10 or patch 12.

There is one interesting feature of patch 10, it allows code to do
block I/O from normal and coroutine context.  i.e. you don't have to
rewrite all your metadata functions in order to use them from both
contexts.  This can be achieved or worked around in other ways, but I
think it's a neat feature :).

> One more thing I want to mention is that bdrv_aio_read doesn't have the
> same semantics as bdrv_read with respect to EOF. The AIO one returns
> -EINVAL when reading beyond EOF whereas bdrv_read returns zeros. I'd
> expect that we'll hit this with the conversion.

Thanks for pointing this out, I didn't know that.

I would like convert QED to use coroutines because we may be able to
simplify the code significantly.  Do you want to take over the qcow2
side of things from here?  I'm happy to do clean ups first so the code
is in a state that you feel comfortable with, just let me know.

Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library Stefan Hajnoczi
@ 2011-01-26 15:25   ` Avi Kivity
  2011-01-26 16:00     ` Anthony Liguori
  0 siblings, 1 reply; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 15:25 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel

On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
> Asynchronous image format code is becoming very complex.  Let's try
> using coroutines to write sequential code without callbacks but use
> coroutines to switch stacks under the hood.
>
>
> +
> +int cc_swap(struct continuation *from, struct continuation *to)
> +{
> +	to->exited = 0;
> +	if (getcontext(&to->last) == -1)
> +		return -1;
> +	else if (to->exited == 0)
> +		to->exited = 1;
> +	else if (to->exited == 1)
> +		return 1;
> +
> +	return swapcontext(&from->uc,&to->uc);
> +}

swapcontext() is very slow, involving the fpu and a syscall.

A nice trick I've used in the past is to use getcontext/makecontext for 
the initial setup and setjmp/longjmp for switching.  Of course this can 
be done later, as an optimization.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines Stefan Hajnoczi
@ 2011-01-26 15:29   ` Avi Kivity
  2011-01-26 16:00     ` Anthony Liguori
  2011-01-27  9:40     ` Stefan Hajnoczi
  0 siblings, 2 replies; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 15:29 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel

On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
> Add functions to create coroutines and transfer control into a coroutine
> and back out again.
>
>
> +
> +struct Coroutine {
> +    struct coroutine co;
> +};
> +
>
> +/**
> + * Coroutine entry point
> + *
> + * When the coroutine is entered for the first time, opaque is passed in as an
> + * argument.
> + *
> + * When this function returns, the coroutine is destroyed automatically and the
> + * return value is passed back to the caller who last entered the coroutine.
> + */
> +typedef void * coroutine_fn CoroutineEntry(void *opaque);

The more modern style is to use the Coroutine structure as argument, and 
let the coroutine function use container_of() to obtain access to its 
own data structures.  Similarly it can store any return value there, 
avoiding casts to and from void pointers.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
  2011-01-23 23:40   ` Anthony Liguori
@ 2011-01-26 15:40   ` Avi Kivity
  2011-01-26 15:50     ` Kevin Wolf
  2011-01-27 10:09     ` Stefan Hajnoczi
  1 sibling, 2 replies; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 15:40 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel

On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
> Converting qcow2 to use coroutines is fairly simple since most of qcow2
> is synchronous.  The synchronous I/O functions likes bdrv_pread() now
> transparently work when called from a coroutine, so all the synchronous
> code just works.
>
> The explicitly asynchronous code is adjusted to repeatedly call
> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
> At that point the coroutine will return from its entry function and its
> resources are freed.
>
> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
> from a BH.  This is necessary since the user callback code does not
> expect to be executed from a coroutine.
>
> This conversion is not completely correct because the safety the
> synchronous code does not carry over to the coroutine version.
> Previously, a synchronous code path could assume that it will never be
> interleaved with another request executing.  This is no longer true
> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
> other requests can be processed during that time.
>
> The solution is to carefully introduce checks so that pending requests
> do not step on each other's toes.  That is left for a future patch...
>


The way I thought of doing this is:

qcow_aio_write(...)
{
     execute_in_coroutine {
         co_mutex_lock(&bs->mutex);
         do_qcow_aio_write(...); // original qcow code
         co_mutex_release(&bs->mutex);
     }
}

(similar changes for the the other callbacks)

if the code happens to be asynchronous (no metadata changes), we'll take 
the mutex and release it immediately after submitting the I/O, so no 
extra serialization happens.  If the code does issue a synchronous 
metadata write, we'll lock out all other operations on the same block 
device, but still allow the vcpu to execute, since all the locking 
happens in a coroutine.

Essentially, a mutex becomes the dependency tracking mechnism.  A global 
mutex means all synchronous operations are dependent.  Later, we can 
convert the metadata cache entry dependency lists to local mutexes 
inside the cache entry structures.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 15:40   ` Avi Kivity
@ 2011-01-26 15:50     ` Kevin Wolf
  2011-01-26 16:08       ` Anthony Liguori
  2011-01-26 16:08       ` Avi Kivity
  2011-01-27 10:09     ` Stefan Hajnoczi
  1 sibling, 2 replies; 46+ messages in thread
From: Kevin Wolf @ 2011-01-26 15:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Anthony Liguori, Stefan Hajnoczi, qemu-devel

Am 26.01.2011 16:40, schrieb Avi Kivity:
> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>> Converting qcow2 to use coroutines is fairly simple since most of qcow2
>> is synchronous.  The synchronous I/O functions likes bdrv_pread() now
>> transparently work when called from a coroutine, so all the synchronous
>> code just works.
>>
>> The explicitly asynchronous code is adjusted to repeatedly call
>> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
>> At that point the coroutine will return from its entry function and its
>> resources are freed.
>>
>> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
>> from a BH.  This is necessary since the user callback code does not
>> expect to be executed from a coroutine.
>>
>> This conversion is not completely correct because the safety the
>> synchronous code does not carry over to the coroutine version.
>> Previously, a synchronous code path could assume that it will never be
>> interleaved with another request executing.  This is no longer true
>> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
>> other requests can be processed during that time.
>>
>> The solution is to carefully introduce checks so that pending requests
>> do not step on each other's toes.  That is left for a future patch...
> 
> The way I thought of doing this is:
> 
> qcow_aio_write(...)
> {
>      execute_in_coroutine {
>          co_mutex_lock(&bs->mutex);
>          do_qcow_aio_write(...); // original qcow code
>          co_mutex_release(&bs->mutex);
>      }
> }
> 
> (similar changes for the the other callbacks)
> 
> if the code happens to be asynchronous (no metadata changes), we'll take 
> the mutex and release it immediately after submitting the I/O, so no 
> extra serialization happens.  If the code does issue a synchronous 
> metadata write, we'll lock out all other operations on the same block 
> device, but still allow the vcpu to execute, since all the locking 
> happens in a coroutine.
> 
> Essentially, a mutex becomes the dependency tracking mechnism.  A global 
> mutex means all synchronous operations are dependent.  Later, we can 
> convert the metadata cache entry dependency lists to local mutexes 
> inside the cache entry structures.

I thought a bit about it since you mentioned it in the call yesterday
and I think this approach makes sense. Even immediately after the
conversion we should be in a better state than with Stefan's approach
because I/O without metadata disk access won't be serialized.

In the other thread you mentioned that you have written some code
independently. Do you have it in some public git repository?

Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-26 15:25   ` Avi Kivity
@ 2011-01-26 16:00     ` Anthony Liguori
  2011-01-26 16:13       ` Avi Kivity
  0 siblings, 1 reply; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 16:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/26/2011 09:25 AM, Avi Kivity wrote:
> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>> Asynchronous image format code is becoming very complex.  Let's try
>> using coroutines to write sequential code without callbacks but use
>> coroutines to switch stacks under the hood.
>>
>>
>> +
>> +int cc_swap(struct continuation *from, struct continuation *to)
>> +{
>> +    to->exited = 0;
>> +    if (getcontext(&to->last) == -1)
>> +        return -1;
>> +    else if (to->exited == 0)
>> +        to->exited = 1;
>> +    else if (to->exited == 1)
>> +        return 1;
>> +
>> +    return swapcontext(&from->uc,&to->uc);
>> +}
>
> swapcontext() is very slow, involving the fpu and a syscall.
>
> A nice trick I've used in the past is to use getcontext/makecontext 
> for the initial setup and setjmp/longjmp for switching.  Of course 
> this can be done later, as an optimization.

Yeah, there are further optimizations that can be had but really, it's 
not important here.  In fact, I'd rather start with the threaded version 
just because it's far more portable than ucontext.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines
  2011-01-26 15:29   ` Avi Kivity
@ 2011-01-26 16:00     ` Anthony Liguori
  2011-01-27  9:40     ` Stefan Hajnoczi
  1 sibling, 0 replies; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 16:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/26/2011 09:29 AM, Avi Kivity wrote:
> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>> Add functions to create coroutines and transfer control into a coroutine
>> and back out again.
>>
>>
>> +
>> +struct Coroutine {
>> +    struct coroutine co;
>> +};
>> +
>>
>> +/**
>> + * Coroutine entry point
>> + *
>> + * When the coroutine is entered for the first time, opaque is 
>> passed in as an
>> + * argument.
>> + *
>> + * When this function returns, the coroutine is destroyed 
>> automatically and the
>> + * return value is passed back to the caller who last entered the 
>> coroutine.
>> + */
>> +typedef void * coroutine_fn CoroutineEntry(void *opaque);
>
> The more modern style is to use the Coroutine structure as argument, 
> and let the coroutine function use container_of() to obtain access to 
> its own data structures.  Similarly it can store any return value 
> there, avoiding casts to and from void pointers.

This is old (shared) code from other projects.  My plan is to split into 
a separate library instead of actually including this in QEMU.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 15:50     ` Kevin Wolf
@ 2011-01-26 16:08       ` Anthony Liguori
  2011-01-26 16:13         ` Avi Kivity
  2011-01-26 16:08       ` Avi Kivity
  1 sibling, 1 reply; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 16:08 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Anthony Liguori, Avi Kivity, Stefan Hajnoczi

On 01/26/2011 09:50 AM, Kevin Wolf wrote:
> Am 26.01.2011 16:40, schrieb Avi Kivity:
>    
>> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>>      
>>> Converting qcow2 to use coroutines is fairly simple since most of qcow2
>>> is synchronous.  The synchronous I/O functions likes bdrv_pread() now
>>> transparently work when called from a coroutine, so all the synchronous
>>> code just works.
>>>
>>> The explicitly asynchronous code is adjusted to repeatedly call
>>> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
>>> At that point the coroutine will return from its entry function and its
>>> resources are freed.
>>>
>>> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
>>> from a BH.  This is necessary since the user callback code does not
>>> expect to be executed from a coroutine.
>>>
>>> This conversion is not completely correct because the safety the
>>> synchronous code does not carry over to the coroutine version.
>>> Previously, a synchronous code path could assume that it will never be
>>> interleaved with another request executing.  This is no longer true
>>> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
>>> other requests can be processed during that time.
>>>
>>> The solution is to carefully introduce checks so that pending requests
>>> do not step on each other's toes.  That is left for a future patch...
>>>        
>> The way I thought of doing this is:
>>
>> qcow_aio_write(...)
>> {
>>       execute_in_coroutine {
>>           co_mutex_lock(&bs->mutex);
>>           do_qcow_aio_write(...); // original qcow code
>>           co_mutex_release(&bs->mutex);
>>      

The release has to be executed in the call back.

I think it's a bit nicer to not do a mutex, but rather to have a notion 
of freezing/unfreezing the block queue and instead do:

completion() {
    bdrv_unfreeze(bs);
}

coroutine {
    bdrv_freeze(bs);
    do_qcow_aio_write(completion);
}

Freeze/unfreeze is useful in a number of other places too (like 
snapshotting).

Regards,

Anthony Liguori

>>       }
>> }
>>
>> (similar changes for the the other callbacks)
>>
>> if the code happens to be asynchronous (no metadata changes), we'll take
>> the mutex and release it immediately after submitting the I/O, so no
>> extra serialization happens.  If the code does issue a synchronous
>> metadata write, we'll lock out all other operations on the same block
>> device, but still allow the vcpu to execute, since all the locking
>> happens in a coroutine.
>>
>> Essentially, a mutex becomes the dependency tracking mechnism.  A global
>> mutex means all synchronous operations are dependent.  Later, we can
>> convert the metadata cache entry dependency lists to local mutexes
>> inside the cache entry structures.
>>      
> I thought a bit about it since you mentioned it in the call yesterday
> and I think this approach makes sense. Even immediately after the
> conversion we should be in a better state than with Stefan's approach
> because I/O without metadata disk access won't be serialized.
>
> In the other thread you mentioned that you have written some code
> independently. Do you have it in some public git repository?
>
> Kevin
>
>    

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 15:50     ` Kevin Wolf
  2011-01-26 16:08       ` Anthony Liguori
@ 2011-01-26 16:08       ` Avi Kivity
  1 sibling, 0 replies; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 16:08 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/26/2011 05:50 PM, Kevin Wolf wrote:
> In the other thread you mentioned that you have written some code
> independently. Do you have it in some public git repository?

I've written it mostly in my mind... IIRC I have the basic coroutine 
equivalents (in my series they are simply threads, without an explicit 
yield_to, but basically the same thing).  I've started work on making 
aio emulation use coroutines/uthreads, so vmdk and related automatically 
become non-blocking (but all requests serialize against each other 
against a built in BlockDriverState::mutex).  I have done nothing on 
qcow2 beyond figuring out how to do it (which I outlined above).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 16:08       ` Anthony Liguori
@ 2011-01-26 16:13         ` Avi Kivity
  2011-01-26 16:28           ` Anthony Liguori
  0 siblings, 1 reply; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 16:13 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/26/2011 06:08 PM, Anthony Liguori wrote:
> On 01/26/2011 09:50 AM, Kevin Wolf wrote:
>> Am 26.01.2011 16:40, schrieb Avi Kivity:
>>> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>>>> Converting qcow2 to use coroutines is fairly simple since most of 
>>>> qcow2
>>>> is synchronous.  The synchronous I/O functions likes bdrv_pread() now
>>>> transparently work when called from a coroutine, so all the 
>>>> synchronous
>>>> code just works.
>>>>
>>>> The explicitly asynchronous code is adjusted to repeatedly call
>>>> qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request 
>>>> completes.
>>>> At that point the coroutine will return from its entry function and 
>>>> its
>>>> resources are freed.
>>>>
>>>> The bdrv_aio_readv() and bdrv_aio_writev() user callback is now 
>>>> invoked
>>>> from a BH.  This is necessary since the user callback code does not
>>>> expect to be executed from a coroutine.
>>>>
>>>> This conversion is not completely correct because the safety the
>>>> synchronous code does not carry over to the coroutine version.
>>>> Previously, a synchronous code path could assume that it will never be
>>>> interleaved with another request executing.  This is no longer true
>>>> because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield 
>>>> and
>>>> other requests can be processed during that time.
>>>>
>>>> The solution is to carefully introduce checks so that pending requests
>>>> do not step on each other's toes.  That is left for a future patch...
>>> The way I thought of doing this is:
>>>
>>> qcow_aio_write(...)
>>> {
>>>       execute_in_coroutine {
>>>           co_mutex_lock(&bs->mutex);
>>>           do_qcow_aio_write(...); // original qcow code
>>>           co_mutex_release(&bs->mutex);
>
> The release has to be executed in the call back.
>
> I think it's a bit nicer to not do a mutex, but rather to have a 
> notion of freezing/unfreezing the block queue and instead do:
>
> completion() {
>    bdrv_unfreeze(bs);
> }
>
> coroutine {
>    bdrv_freeze(bs);
>    do_qcow_aio_write(completion);
> }
>
> Freeze/unfreeze is useful in a number of other places too (like 
> snapshotting).

Serializing against a global mutex has the advantage that it can be 
treated as a global lock that is decomposed into fine-grained locks.

For example, we can start the code conversion from an explict async 
model to a threaded sync model by converting the mutex into a 
shared/exclusive lock.  Operations like read and write take the lock for 
shared access (and take a fine-grained mutex on the metadata cache 
entry), while operation like creating a snapshot take the lock for 
exclusive access.  That doesn't work with freeze/thaw.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-26 16:00     ` Anthony Liguori
@ 2011-01-26 16:13       ` Avi Kivity
  2011-01-26 16:19         ` Anthony Liguori
  2011-01-26 16:21         ` Anthony Liguori
  0 siblings, 2 replies; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 16:13 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/26/2011 06:00 PM, Anthony Liguori wrote:
> On 01/26/2011 09:25 AM, Avi Kivity wrote:
>> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>>> Asynchronous image format code is becoming very complex.  Let's try
>>> using coroutines to write sequential code without callbacks but use
>>> coroutines to switch stacks under the hood.
>>>
>>>
>>> +
>>> +int cc_swap(struct continuation *from, struct continuation *to)
>>> +{
>>> +    to->exited = 0;
>>> +    if (getcontext(&to->last) == -1)
>>> +        return -1;
>>> +    else if (to->exited == 0)
>>> +        to->exited = 1;
>>> +    else if (to->exited == 1)
>>> +        return 1;
>>> +
>>> +    return swapcontext(&from->uc,&to->uc);
>>> +}
>>
>> swapcontext() is very slow, involving the fpu and a syscall.
>>
>> A nice trick I've used in the past is to use getcontext/makecontext 
>> for the initial setup and setjmp/longjmp for switching.  Of course 
>> this can be done later, as an optimization.
>
> Yeah, there are further optimizations that can be had but really, it's 
> not important here.  In fact, I'd rather start with the threaded 
> version just because it's far more portable than ucontext.
>

What do you mean by threaded version?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-26 16:13       ` Avi Kivity
@ 2011-01-26 16:19         ` Anthony Liguori
  2011-01-26 16:22           ` Avi Kivity
  2011-01-26 16:21         ` Anthony Liguori
  1 sibling, 1 reply; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 16:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Anthony Liguori, Anthony Liguori, Stefan Hajnoczi,
	qemu-devel

On 01/26/2011 10:13 AM, Avi Kivity wrote:
> On 01/26/2011 06:00 PM, Anthony Liguori wrote:
>> On 01/26/2011 09:25 AM, Avi Kivity wrote:
>>> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>>>> Asynchronous image format code is becoming very complex.  Let's try
>>>> using coroutines to write sequential code without callbacks but use
>>>> coroutines to switch stacks under the hood.
>>>>
>>>>
>>>> +
>>>> +int cc_swap(struct continuation *from, struct continuation *to)
>>>> +{
>>>> +    to->exited = 0;
>>>> +    if (getcontext(&to->last) == -1)
>>>> +        return -1;
>>>> +    else if (to->exited == 0)
>>>> +        to->exited = 1;
>>>> +    else if (to->exited == 1)
>>>> +        return 1;
>>>> +
>>>> +    return swapcontext(&from->uc,&to->uc);
>>>> +}
>>>
>>> swapcontext() is very slow, involving the fpu and a syscall.
>>>
>>> A nice trick I've used in the past is to use getcontext/makecontext 
>>> for the initial setup and setjmp/longjmp for switching.  Of course 
>>> this can be done later, as an optimization.
>>
>> Yeah, there are further optimizations that can be had but really, 
>> it's not important here.  In fact, I'd rather start with the threaded 
>> version just because it's far more portable than ucontext.
>>
>
> What do you mean by threaded version?

Stefan didn't post it, but the original code also has a GThread based 
implementation when ucontext isn't available (like on Windows).  It uses 
a mutex to control the execution of the coroutines.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-26 16:13       ` Avi Kivity
  2011-01-26 16:19         ` Anthony Liguori
@ 2011-01-26 16:21         ` Anthony Liguori
  1 sibling, 0 replies; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 16:21 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/26/2011 10:13 AM, Avi Kivity wrote:
> On 01/26/2011 06:00 PM, Anthony Liguori wrote:
>> On 01/26/2011 09:25 AM, Avi Kivity wrote:
>>> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
>>>> Asynchronous image format code is becoming very complex.  Let's try
>>>> using coroutines to write sequential code without callbacks but use
>>>> coroutines to switch stacks under the hood.
>>>>
>>>>
>>>> +
>>>> +int cc_swap(struct continuation *from, struct continuation *to)
>>>> +{
>>>> +    to->exited = 0;
>>>> +    if (getcontext(&to->last) == -1)
>>>> +        return -1;
>>>> +    else if (to->exited == 0)
>>>> +        to->exited = 1;
>>>> +    else if (to->exited == 1)
>>>> +        return 1;
>>>> +
>>>> +    return swapcontext(&from->uc,&to->uc);
>>>> +}
>>>
>>> swapcontext() is very slow, involving the fpu and a syscall.
>>>
>>> A nice trick I've used in the past is to use getcontext/makecontext 
>>> for the initial setup and setjmp/longjmp for switching.  Of course 
>>> this can be done later, as an optimization.
>>
>> Yeah, there are further optimizations that can be had but really, 
>> it's not important here.  In fact, I'd rather start with the threaded 
>> version just because it's far more portable than ucontext.
>>
>
> What do you mean by threaded version?

http://git.gnome.org/browse/gtk-vnc/tree/src/coroutine_gthread.c

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-26 16:19         ` Anthony Liguori
@ 2011-01-26 16:22           ` Avi Kivity
  2011-01-26 16:29             ` Anthony Liguori
  0 siblings, 1 reply; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 16:22 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Anthony Liguori, Anthony Liguori, Stefan Hajnoczi,
	qemu-devel

On 01/26/2011 06:19 PM, Anthony Liguori wrote:
>> What do you mean by threaded version?
>
>
> Stefan didn't post it, but the original code also has a GThread based 
> implementation when ucontext isn't available (like on Windows).  It 
> uses a mutex to control the execution of the coroutines.

Ah ok.  These can all be hidden under a single API.

btw, I think Windows does provide support for user-level threads under 
the name Fibers.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 16:13         ` Avi Kivity
@ 2011-01-26 16:28           ` Anthony Liguori
  2011-01-26 16:38             ` Avi Kivity
  0 siblings, 1 reply; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 16:28 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On 01/26/2011 10:13 AM, Avi Kivity wrote:
> Serializing against a global mutex has the advantage that it can be 
> treated as a global lock that is decomposed into fine-grained locks.
>
> For example, we can start the code conversion from an explict async 
> model to a threaded sync model by converting the mutex into a 
> shared/exclusive lock.  Operations like read and write take the lock 
> for shared access (and take a fine-grained mutex on the metadata cache 
> entry), while operation like creating a snapshot take the lock for 
> exclusive access.  That doesn't work with freeze/thaw.

The trouble with this is that you increase the amount of re-entrance 
whereas freeze/thaw doesn't.

The code from the beginning of the request to where the mutex is 
acquired will be executed for every single request even while requests 
are blocked at the mutex acquisition.

With freeze/thaw, you freeze the queue and prevent any request from 
starting until you thaw.  You only thaw and return control to allow 
another request to execute when you begin executing an asynchronous I/O 
callback.

I think my previous example was wrong, you really want to do:

qcow2_aio_writev() {
    coroutine {
        freeze();
        sync_io(); // existing qcow2 code
        thaw();
        // existing non I/O code
        bdrv_aio_writev(callback); // no explicit freeze/thaw needed
    }
}

This is equivalent to our existing code because no new re-entrance is 
introduced.  The only re-entrancy points are in the 
bdrv_aio_{readv,writev} calls.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library
  2011-01-26 16:22           ` Avi Kivity
@ 2011-01-26 16:29             ` Anthony Liguori
  0 siblings, 0 replies; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 16:29 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/26/2011 10:22 AM, Avi Kivity wrote:
> On 01/26/2011 06:19 PM, Anthony Liguori wrote:
>>> What do you mean by threaded version?
>>
>>
>> Stefan didn't post it, but the original code also has a GThread based 
>> implementation when ucontext isn't available (like on Windows).  It 
>> uses a mutex to control the execution of the coroutines.
>
> Ah ok.  These can all be hidden under a single API.

It is, that's the point of the coroutine abstraction :-)


> btw, I think Windows does provide support for user-level threads under 
> the name Fibers.

Yes, I never got around to implementing it though.  There was something 
odd about them that I thought would be difficult to use but I can't 
remember the details.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 16:28           ` Anthony Liguori
@ 2011-01-26 16:38             ` Avi Kivity
  2011-01-26 17:12               ` Anthony Liguori
  0 siblings, 1 reply; 46+ messages in thread
From: Avi Kivity @ 2011-01-26 16:38 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On 01/26/2011 06:28 PM, Anthony Liguori wrote:
> On 01/26/2011 10:13 AM, Avi Kivity wrote:
>> Serializing against a global mutex has the advantage that it can be 
>> treated as a global lock that is decomposed into fine-grained locks.
>>
>> For example, we can start the code conversion from an explict async 
>> model to a threaded sync model by converting the mutex into a 
>> shared/exclusive lock.  Operations like read and write take the lock 
>> for shared access (and take a fine-grained mutex on the metadata 
>> cache entry), while operation like creating a snapshot take the lock 
>> for exclusive access.  That doesn't work with freeze/thaw.
>
> The trouble with this is that you increase the amount of re-entrance 
> whereas freeze/thaw doesn't.
>
> The code from the beginning of the request to where the mutex is 
> acquired will be executed for every single request even while requests 
> are blocked at the mutex acquisition.

It's just a few instructions.

>
> With freeze/thaw, you freeze the queue and prevent any request from 
> starting until you thaw.  You only thaw and return control to allow 
> another request to execute when you begin executing an asynchronous 
> I/O callback.
>

What do you actually save?  The longjmp() to the coroutine code, linking 
in to the mutex wait queue, and another switch back to the main 
coroutine?  Given that we don't expect to block often, it seems hardly a 
cost worth optimizing.

> I think my previous example was wrong, you really want to do:
>
> qcow2_aio_writev() {
>    coroutine {
>        freeze();
>        sync_io(); // existing qcow2 code
>        thaw();
>        // existing non I/O code
>        bdrv_aio_writev(callback); // no explicit freeze/thaw needed
>    }
> }
>
> This is equivalent to our existing code because no new re-entrance is 
> introduced.  The only re-entrancy points are in the 
> bdrv_aio_{readv,writev} calls.

This requires you to know which code is sync, and which code is async.  
My conversion allows you to wrap the code blindly with a mutex, and have 
it do the right thing automatically.  This is most useful where the code 
can be sync or async depending on data (which is the case for qcow2).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 16:38             ` Avi Kivity
@ 2011-01-26 17:12               ` Anthony Liguori
  2011-01-27  9:25                 ` Avi Kivity
  2011-01-27  9:27                 ` Kevin Wolf
  0 siblings, 2 replies; 46+ messages in thread
From: Anthony Liguori @ 2011-01-26 17:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On 01/26/2011 10:38 AM, Avi Kivity wrote:
> On 01/26/2011 06:28 PM, Anthony Liguori wrote:
>> On 01/26/2011 10:13 AM, Avi Kivity wrote:
>>> Serializing against a global mutex has the advantage that it can be 
>>> treated as a global lock that is decomposed into fine-grained locks.
>>>
>>> For example, we can start the code conversion from an explict async 
>>> model to a threaded sync model by converting the mutex into a 
>>> shared/exclusive lock.  Operations like read and write take the lock 
>>> for shared access (and take a fine-grained mutex on the metadata 
>>> cache entry), while operation like creating a snapshot take the lock 
>>> for exclusive access.  That doesn't work with freeze/thaw.
>>
>> The trouble with this is that you increase the amount of re-entrance 
>> whereas freeze/thaw doesn't.
>>
>> The code from the beginning of the request to where the mutex is 
>> acquired will be executed for every single request even while 
>> requests are blocked at the mutex acquisition.
>
> It's just a few instructions.
>
>>
>> With freeze/thaw, you freeze the queue and prevent any request from 
>> starting until you thaw.  You only thaw and return control to allow 
>> another request to execute when you begin executing an asynchronous 
>> I/O callback.
>>
>
> What do you actually save?  The longjmp() to the coroutine code, 
> linking in to the mutex wait queue, and another switch back to the 
> main coroutine?  Given that we don't expect to block often, it seems 
> hardly a cost worth optimizing.

It's a matter of correctness, not optimization.

Consider the following example:

coroutine {
    l2 = find_l2(offset);
    // allocates but does not increment max cluster offset
    l2[l2_offset(offset)] = alloc_cluster();

    co_mutex_lock(&lock);
    write_l2(l2);
    co_mutex_unlock(&lock);

    l1[l1_offset(offset)] = l2;

    co_mutex_lock(&lock);
    write_l1(l1);
    co_mutex_unlock(&lock);

    commit_cluster(l2[l2_offset(offset)]);
}

This code is incorrect.  The code before the first co_mutex_lock() may 
be called a dozen times before before anyone finishes this routine.  
That means the same cluster is reused multiple times.

This code was correct before we introduced coroutines because we 
effectively had one big lock around the whole thing.

Regards,

Anthony Liguori

>
>> I think my previous example was wrong, you really want to do:
>>
>> qcow2_aio_writev() {
>>    coroutine {
>>        freeze();
>>        sync_io(); // existing qcow2 code
>>        thaw();
>>        // existing non I/O code
>>        bdrv_aio_writev(callback); // no explicit freeze/thaw needed
>>    }
>> }
>>
>> This is equivalent to our existing code because no new re-entrance is 
>> introduced.  The only re-entrancy points are in the 
>> bdrv_aio_{readv,writev} calls.
>
> This requires you to know which code is sync, and which code is 
> async.  My conversion allows you to wrap the code blindly with a 
> mutex, and have it do the right thing automatically.  This is most 
> useful where the code can be sync or async depending on data (which is 
> the case for qcow2).
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 17:12               ` Anthony Liguori
@ 2011-01-27  9:25                 ` Avi Kivity
  2011-01-27  9:27                 ` Kevin Wolf
  1 sibling, 0 replies; 46+ messages in thread
From: Avi Kivity @ 2011-01-27  9:25 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel

On 01/26/2011 07:12 PM, Anthony Liguori wrote:
>> What do you actually save?  The longjmp() to the coroutine code, 
>> linking in to the mutex wait queue, and another switch back to the 
>> main coroutine?  Given that we don't expect to block often, it seems 
>> hardly a cost worth optimizing.
>
>
> It's a matter of correctness, not optimization.
>
> Consider the following example:
>
> coroutine {
>    l2 = find_l2(offset);
>    // allocates but does not increment max cluster offset
>    l2[l2_offset(offset)] = alloc_cluster();
>
>    co_mutex_lock(&lock);
>    write_l2(l2);
>    co_mutex_unlock(&lock);
>
>    l1[l1_offset(offset)] = l2;
>
>    co_mutex_lock(&lock);
>    write_l1(l1);
>    co_mutex_unlock(&lock);
>
>    commit_cluster(l2[l2_offset(offset)]);
> }
>
> This code is incorrect.


Yes it's incorrect, but it isn't what I'm proposing.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 17:12               ` Anthony Liguori
  2011-01-27  9:25                 ` Avi Kivity
@ 2011-01-27  9:27                 ` Kevin Wolf
  2011-01-27  9:49                   ` Avi Kivity
  1 sibling, 1 reply; 46+ messages in thread
From: Kevin Wolf @ 2011-01-27  9:27 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Avi Kivity, Stefan Hajnoczi

Am 26.01.2011 18:12, schrieb Anthony Liguori:
> On 01/26/2011 10:38 AM, Avi Kivity wrote:
>> On 01/26/2011 06:28 PM, Anthony Liguori wrote:
>>> On 01/26/2011 10:13 AM, Avi Kivity wrote:
>>>> Serializing against a global mutex has the advantage that it can be 
>>>> treated as a global lock that is decomposed into fine-grained locks.
>>>>
>>>> For example, we can start the code conversion from an explict async 
>>>> model to a threaded sync model by converting the mutex into a 
>>>> shared/exclusive lock.  Operations like read and write take the lock 
>>>> for shared access (and take a fine-grained mutex on the metadata 
>>>> cache entry), while operation like creating a snapshot take the lock 
>>>> for exclusive access.  That doesn't work with freeze/thaw.
>>>
>>> The trouble with this is that you increase the amount of re-entrance 
>>> whereas freeze/thaw doesn't.
>>>
>>> The code from the beginning of the request to where the mutex is 
>>> acquired will be executed for every single request even while 
>>> requests are blocked at the mutex acquisition.
>>
>> It's just a few instructions.
>>
>>>
>>> With freeze/thaw, you freeze the queue and prevent any request from 
>>> starting until you thaw.  You only thaw and return control to allow 
>>> another request to execute when you begin executing an asynchronous 
>>> I/O callback.
>>>
>>
>> What do you actually save?  The longjmp() to the coroutine code, 
>> linking in to the mutex wait queue, and another switch back to the 
>> main coroutine?  Given that we don't expect to block often, it seems 
>> hardly a cost worth optimizing.

And if you cared, you could probably optimize it in the mutex
implementation.

> It's a matter of correctness, not optimization.
> 
> Consider the following example:
> 
> coroutine {
>     l2 = find_l2(offset);
>     // allocates but does not increment max cluster offset
>     l2[l2_offset(offset)] = alloc_cluster();
> 
>     co_mutex_lock(&lock);
>     write_l2(l2);
>     co_mutex_unlock(&lock);
> 
>     l1[l1_offset(offset)] = l2;
> 
>     co_mutex_lock(&lock);
>     write_l1(l1);
>     co_mutex_unlock(&lock);
> 
>     commit_cluster(l2[l2_offset(offset)]);
> }
> 
> This code is incorrect.  The code before the first co_mutex_lock() may 
> be called a dozen times before before anyone finishes this routine.  
> That means the same cluster is reused multiple times.

But this is not a new problem. In qcow2 we have this today:

cluster_offset = alloc_cluster();
bdrv_aio_write()
...
update_l2()

There is already code to handle this case, even though it could probably
be simplified for coroutines. I don't think we want to make the code
worse during the conversion by basically serializing everything.

> This code was correct before we introduced coroutines because we 
> effectively had one big lock around the whole thing.

So what's the lesson to learn? You need to take care when you convert a
big lock into smaller ones?

>>> I think my previous example was wrong, you really want to do:
>>>
>>> qcow2_aio_writev() {
>>>    coroutine {
>>>        freeze();
>>>        sync_io(); // existing qcow2 code
>>>        thaw();
>>>        // existing non I/O code
>>>        bdrv_aio_writev(callback); // no explicit freeze/thaw needed
>>>    }
>>> }
>>>
>>> This is equivalent to our existing code because no new re-entrance is 
>>> introduced.  The only re-entrancy points are in the 
>>> bdrv_aio_{readv,writev} calls.
>>
>> This requires you to know which code is sync, and which code is 
>> async.  My conversion allows you to wrap the code blindly with a 
>> mutex, and have it do the right thing automatically.  This is most 
>> useful where the code can be sync or async depending on data (which is 
>> the case for qcow2).

Well, but in the case of qcow2, you don't want to have a big mutex
around everything. We perfectly know which parts are asynchronous and
which are synchronous, so we'd want to do it finer grained from the
beginning.

Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines
  2011-01-26 15:29   ` Avi Kivity
  2011-01-26 16:00     ` Anthony Liguori
@ 2011-01-27  9:40     ` Stefan Hajnoczi
  1 sibling, 0 replies; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-27  9:40 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On Wed, Jan 26, 2011 at 05:29:23PM +0200, Avi Kivity wrote:
> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
> >Add functions to create coroutines and transfer control into a coroutine
> >and back out again.
> >
> >
> >+
> >+struct Coroutine {
> >+    struct coroutine co;
> >+};
> >+
> >
> >+/**
> >+ * Coroutine entry point
> >+ *
> >+ * When the coroutine is entered for the first time, opaque is passed in as an
> >+ * argument.
> >+ *
> >+ * When this function returns, the coroutine is destroyed automatically and the
> >+ * return value is passed back to the caller who last entered the coroutine.
> >+ */
> >+typedef void * coroutine_fn CoroutineEntry(void *opaque);
> 
> The more modern style is to use the Coroutine structure as argument,
> and let the coroutine function use container_of() to obtain access
> to its own data structures.  Similarly it can store any return value
> there, avoiding casts to and from void pointers.

Yes, container_of() would be nice but we need to be careful to support
pooling Coroutine structs.  Or maybe just pool the mmaped stacks.

Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-27  9:27                 ` Kevin Wolf
@ 2011-01-27  9:49                   ` Avi Kivity
  2011-01-27 10:34                     ` Kevin Wolf
  0 siblings, 1 reply; 46+ messages in thread
From: Avi Kivity @ 2011-01-27  9:49 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

On 01/27/2011 11:27 AM, Kevin Wolf wrote:
> Well, but in the case of qcow2, you don't want to have a big mutex
> around everything. We perfectly know which parts are asynchronous and
> which are synchronous, so we'd want to do it finer grained from the
> beginning.

Yes we do.  And the way I proposed it, the new mutex does not introduce 
any new serialization.

To repeat, for every qcow2 callback or completion X (not qcow2 read or 
write operation), we transform it in the following manner:

1. Rename X into do_X.  If X is called directly from within qcow2, 
change the call to do_x.  Create a new X which simply calls do_X().  
This change is purely textual and doesn't affect runtime at all.

2. Change X to

X()
{
    create a coroutine
    pack arguments to X in a structure
    schedule the coroutine to execute a new function call_X with the 
structure as argument
    wait for the coroutine to complete
    unpack the result (if any)
    dispose of the coroutine
    return the result
}

call_X()
{
    unpack arguments
    co_mutex_lock(&bs->mutex)
    result = do_X()
    co_mutex_unlock(&bs->mutex)
    pack result
}

(in the case of aio_X callbacks, we return a fake AIOCB:

aio_X()
{
       allocate AIOCB
       allocate coroutine
       pack arguments to X, and fake AIOCB in a structure
       schedule the coroutine to execute call_X with the structure as 
argument
       return fake AIOCB
}

call_aio_X()
{
       unpack arguments
       co_mutex_lock(&bs->mutex)
       do_X()
       co_mutex_unlock(&bs->mutex)
}

and change the completion to complete the fake AIOCB
)

The result of this transformation is:

- if X1 executed without blocking, it still executes without blocking, 
except for the case below
- if X2 blocked, it will serialize any X1 which doesn't, but it will no 
longer block the vcpu thread

We could do this transformation in the block layer, as it doesn't 
involve qcow2 at all.  I don't think that's a good idea, though, as 
qcow2 would be the only beneficiary and it would be harder to further 
evolve qcow2.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-26 15:40   ` Avi Kivity
  2011-01-26 15:50     ` Kevin Wolf
@ 2011-01-27 10:09     ` Stefan Hajnoczi
  2011-01-27 10:46       ` Avi Kivity
  1 sibling, 1 reply; 46+ messages in thread
From: Stefan Hajnoczi @ 2011-01-27 10:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On Wed, Jan 26, 2011 at 05:40:27PM +0200, Avi Kivity wrote:
> On 01/22/2011 11:29 AM, Stefan Hajnoczi wrote:
> >Converting qcow2 to use coroutines is fairly simple since most of qcow2
> >is synchronous.  The synchronous I/O functions likes bdrv_pread() now
> >transparently work when called from a coroutine, so all the synchronous
> >code just works.
> >
> >The explicitly asynchronous code is adjusted to repeatedly call
> >qcow2_aio_read_cb() or qcow2_aio_write_cb() until the request completes.
> >At that point the coroutine will return from its entry function and its
> >resources are freed.
> >
> >The bdrv_aio_readv() and bdrv_aio_writev() user callback is now invoked
> >from a BH.  This is necessary since the user callback code does not
> >expect to be executed from a coroutine.
> >
> >This conversion is not completely correct because the safety the
> >synchronous code does not carry over to the coroutine version.
> >Previously, a synchronous code path could assume that it will never be
> >interleaved with another request executing.  This is no longer true
> >because bdrv_pread() and bdrv_pwrite() cause the coroutine to yield and
> >other requests can be processed during that time.
> >
> >The solution is to carefully introduce checks so that pending requests
> >do not step on each other's toes.  That is left for a future patch...
> >
> 
> 
> The way I thought of doing this is:
> 
> qcow_aio_write(...)
> {
>     execute_in_coroutine {
>         co_mutex_lock(&bs->mutex);
>         do_qcow_aio_write(...); // original qcow code
>         co_mutex_release(&bs->mutex);
>     }
> }

I think this approach makes sense as the next step towards a
fine-grained strategy.

Remember that qcow2 is not just:
qcow2_aio_write(...)
{
    sync_io(...);
    aio_write(...);
    complete_request(...); /* in callback */
}

It is actually:
qcow2_aio_write(...)
{
    sync_io(...);
    aio_write(...);
    more_sync_io(...); /* in callback */
    complete_request(...);
}

We need to make sure the synchronous I/O after aio write completion is
also guarded.

Stefan

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-27  9:49                   ` Avi Kivity
@ 2011-01-27 10:34                     ` Kevin Wolf
  2011-01-27 10:41                       ` Avi Kivity
  0 siblings, 1 reply; 46+ messages in thread
From: Kevin Wolf @ 2011-01-27 10:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Stefan Hajnoczi, qemu-devel

Am 27.01.2011 10:49, schrieb Avi Kivity:
> On 01/27/2011 11:27 AM, Kevin Wolf wrote:
>> Well, but in the case of qcow2, you don't want to have a big mutex
>> around everything. We perfectly know which parts are asynchronous and
>> which are synchronous, so we'd want to do it finer grained from the
>> beginning.
> 
> Yes we do.  And the way I proposed it, the new mutex does not introduce 
> any new serialization.
> 
> To repeat, for every qcow2 callback or completion X (not qcow2 read or 
> write operation), we transform it in the following manner:
> [...]

This works fine for code that is completely synchronous today (and you
can't serialize it more than it already is anyway).

It doesn't work for qemu_aio_readv/writev because these use AIO for
reading/writing the data. So you definitely need to rewrite that part,
or the AIO callback will cause the code to run outside its coroutine.
And during this rewrite you'll want to pay attention that you don't hold
the mutex for the bdrv_co_readv that was an AIO request before, or
you'll introduce additional serialization.

Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-27 10:34                     ` Kevin Wolf
@ 2011-01-27 10:41                       ` Avi Kivity
  2011-01-27 11:27                         ` Kevin Wolf
  0 siblings, 1 reply; 46+ messages in thread
From: Avi Kivity @ 2011-01-27 10:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

On 01/27/2011 12:34 PM, Kevin Wolf wrote:
> Am 27.01.2011 10:49, schrieb Avi Kivity:
> >  On 01/27/2011 11:27 AM, Kevin Wolf wrote:
> >>  Well, but in the case of qcow2, you don't want to have a big mutex
> >>  around everything. We perfectly know which parts are asynchronous and
> >>  which are synchronous, so we'd want to do it finer grained from the
> >>  beginning.
> >
> >  Yes we do.  And the way I proposed it, the new mutex does not introduce
> >  any new serialization.
> >
> >  To repeat, for every qcow2 callback or completion X (not qcow2 read or
> >  write operation), we transform it in the following manner:
> >  [...]
>
> This works fine for code that is completely synchronous today (and you
> can't serialize it more than it already is anyway).
>
> It doesn't work for qemu_aio_readv/writev because these use AIO for
> reading/writing the data. So you definitely need to rewrite that part,
> or the AIO callback will cause the code to run outside its coroutine.

The callbacks need to be wrapped in the same way.  Schedule a coroutine 
to run the true callback.

> And during this rewrite you'll want to pay attention that you don't hold
> the mutex for the bdrv_co_readv that was an AIO request before, or
> you'll introduce additional serialization.

I don't follow.  Please elaborate.

The idea behind this is, we keep exactly the same serialization 
characteristics we had before.  If a aio_X callback or its intermediate 
completion executed without blocking, it will continue to do so.  
Holding a co_mutex over a non-blocking code sequence does not serialize, 
except if a blocking sequence is executing concurrently (in which case 
the serialization was present before).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-27 10:09     ` Stefan Hajnoczi
@ 2011-01-27 10:46       ` Avi Kivity
  0 siblings, 0 replies; 46+ messages in thread
From: Avi Kivity @ 2011-01-27 10:46 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, qemu-devel

On 01/27/2011 12:09 PM, Stefan Hajnoczi wrote:
> >
> >
> >  The way I thought of doing this is:
> >
> >  qcow_aio_write(...)
> >  {
> >      execute_in_coroutine {
> >          co_mutex_lock(&bs->mutex);
> >          do_qcow_aio_write(...); // original qcow code
> >          co_mutex_release(&bs->mutex);
> >      }
> >  }
>
> I think this approach makes sense as the next step towards a
> fine-grained strategy.
>
> Remember that qcow2 is not just:
> qcow2_aio_write(...)
> {
>      sync_io(...);
>      aio_write(...);
>      complete_request(...); /* in callback */
> }
>
> It is actually:
> qcow2_aio_write(...)
> {
>      sync_io(...);
>      aio_write(...);
>      more_sync_io(...); /* in callback */
>      complete_request(...);
> }
>
> We need to make sure the synchronous I/O after aio write completion is
> also guarded.

Correct.  Every entry point into qcow2 needs to be wrapped.

Maybe the best way forward is to implement it in the block layer.  While 
that makes additional hacking harder, it emphasises the fact that qcow2 
logic need not be changed in any way in order to remove vcpu blocking 
without compromising concurrency.

The block layer could examine BlockDriver::not_really_async, and if set, 
instead of calling BlockDriver::bdrv_aio_writev, it can call a wrapper 
which schedules a coroutine and returns a fake AIOCB.  The wrapper would 
also wrap the completion with a wrapper, so that it too would execute 
under a coroutine.  The mutex would be managed by the wrapper, and qcow2 
would be completely unchanged except for setting not_really_async.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-27 10:41                       ` Avi Kivity
@ 2011-01-27 11:27                         ` Kevin Wolf
  2011-01-27 12:21                           ` Avi Kivity
  0 siblings, 1 reply; 46+ messages in thread
From: Kevin Wolf @ 2011-01-27 11:27 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Stefan Hajnoczi, qemu-devel

Am 27.01.2011 11:41, schrieb Avi Kivity:
> On 01/27/2011 12:34 PM, Kevin Wolf wrote:
>> Am 27.01.2011 10:49, schrieb Avi Kivity:
>>>  On 01/27/2011 11:27 AM, Kevin Wolf wrote:
>>>>  Well, but in the case of qcow2, you don't want to have a big mutex
>>>>  around everything. We perfectly know which parts are asynchronous and
>>>>  which are synchronous, so we'd want to do it finer grained from the
>>>>  beginning.
>>>
>>>  Yes we do.  And the way I proposed it, the new mutex does not introduce
>>>  any new serialization.
>>>
>>>  To repeat, for every qcow2 callback or completion X (not qcow2 read or
>>>  write operation), we transform it in the following manner:
>>>  [...]
>>
>> This works fine for code that is completely synchronous today (and you
>> can't serialize it more than it already is anyway).
>>
>> It doesn't work for qemu_aio_readv/writev because these use AIO for
>> reading/writing the data. So you definitely need to rewrite that part,
>> or the AIO callback will cause the code to run outside its coroutine.
> 
> The callbacks need to be wrapped in the same way.  Schedule a coroutine 
> to run the true callback.

Okay, I see what you're proposing. You could schedule a new coroutine
for callbacks indeed.

But I think it's actually easier to convert the bdrv_aio_readv into a
bdrv_co_readv (and by that removing the callback) and just make sure
that you don't hold the mutex during this call - basically what Stefan's
code does, just with mutexes instead of a request queue.

>> And during this rewrite you'll want to pay attention that you don't hold
>> the mutex for the bdrv_co_readv that was an AIO request before, or
>> you'll introduce additional serialization.
> 
> I don't follow.  Please elaborate.

We were thinking of different approaches. I hope it's clearer now.

Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-27 11:27                         ` Kevin Wolf
@ 2011-01-27 12:21                           ` Avi Kivity
  0 siblings, 0 replies; 46+ messages in thread
From: Avi Kivity @ 2011-01-27 12:21 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel

On 01/27/2011 01:27 PM, Kevin Wolf wrote:
> Am 27.01.2011 11:41, schrieb Avi Kivity:
> >  On 01/27/2011 12:34 PM, Kevin Wolf wrote:
> >>  Am 27.01.2011 10:49, schrieb Avi Kivity:
> >>>   On 01/27/2011 11:27 AM, Kevin Wolf wrote:
> >>>>   Well, but in the case of qcow2, you don't want to have a big mutex
> >>>>   around everything. We perfectly know which parts are asynchronous and
> >>>>   which are synchronous, so we'd want to do it finer grained from the
> >>>>   beginning.
> >>>
> >>>   Yes we do.  And the way I proposed it, the new mutex does not introduce
> >>>   any new serialization.
> >>>
> >>>   To repeat, for every qcow2 callback or completion X (not qcow2 read or
> >>>   write operation), we transform it in the following manner:
> >>>   [...]
> >>
> >>  This works fine for code that is completely synchronous today (and you
> >>  can't serialize it more than it already is anyway).
> >>
> >>  It doesn't work for qemu_aio_readv/writev because these use AIO for
> >>  reading/writing the data. So you definitely need to rewrite that part,
> >>  or the AIO callback will cause the code to run outside its coroutine.
> >
> >  The callbacks need to be wrapped in the same way.  Schedule a coroutine
> >  to run the true callback.
>
> Okay, I see what you're proposing. You could schedule a new coroutine
> for callbacks indeed.
>
> But I think it's actually easier to convert the bdrv_aio_readv into a
> bdrv_co_readv (and by that removing the callback) and just make sure
> that you don't hold the mutex during this call - basically what Stefan's
> code does, just with mutexes instead of a request queue.

My approach was for someone who isn't too familiar with qcow2 - a 
mindless conversion.  A more integrated approach is better, since it 
will lead to tighter code, and if Stefan or you are able to do it 
without impacting concurrency, I'm all for it.

> >>  And during this rewrite you'll want to pay attention that you don't hold
> >>  the mutex for the bdrv_co_readv that was an AIO request before, or
> >>  you'll introduce additional serialization.
> >
> >  I don't follow.  Please elaborate.
>
> We were thinking of different approaches. I hope it's clearer now.

I think so.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O
  2011-01-23 23:31 ` [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Anthony Liguori
@ 2011-02-01 13:23   ` Kevin Wolf
  0 siblings, 0 replies; 46+ messages in thread
From: Kevin Wolf @ 2011-02-01 13:23 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Stefan Hajnoczi, qemu-devel

Am 24.01.2011 00:31, schrieb Anthony Liguori:
> On 01/22/2011 03:29 AM, Stefan Hajnoczi wrote:
>> This patch series prototypes making QCOW2 fully asynchronous to eliminate the
>> timing jitter and poor performance that has been observed.  QCOW2 has
>> asynchronous I/O code paths for some of the read/write common cases but
>> metadata access is always synchronous.
>>
>> One solution is to rewrite QCOW2 to be fully asynchronous by splitting all
>> functions that perform blocking I/O into a series of callbacks.  Due to the
>> complexity of QCOW2, this conversion and the maintenance prospects are
>> unattractive.
>>
>> This patch series prototypes an alternative solution to make QCOW2
>> asynchronous.  It introduces coroutines, cooperative userspace threads of
>> control, so that each QCOW2 request has its own call stack.  To perform I/O,
>> the coroutine submits an asynchronous I/O request and then yields back to QEMU.
>> The coroutine stays suspended while the I/O operation is being processed by
>> lower layers of the stack.  When the asynchronous I/O completes, the coroutine
>> is resumed.
>>
>> The upshot of this is that QCOW2 can be implemented in a sequential fashion
>> without explicit callbacks but all I/O actually happens asynchronously under
>> the covers.
>>
>> This prototype implements reads, writes, and flushes.  Should install or boot
>> VMs successfully.  However, it has the following limitations:
>>
>> 1. QCOW2 requests are serialized because the code is not yet safe for
>>     concurrent requests.  See the last patch for details.
>>
>> 2. Coroutines are unoptimized.  We should pool coroutines (and their mmapped
>>     stacks) to avoid the cost of coroutine creation.
>>
>> 3. The qcow2_aio_read_cb() and qcow2_aoi_write_cb() functions should be
>>     refactored into sequential code now that callbacks are no longer needed.
>>
>> I think this approach can solve the performance and functional problems of the
>> current QCOW2 implementation.  It does not require invasive changes, much of
>> QCOW2 works unmodified.
>>
>> Kevin: Do you like this approach and do you want to develop it further?
>>    
> 
> Couple of additional comments:
> 
> 1) The coroutine code is copied in a number of projects.  I plan on 
> splitting it into a shared library.

When are you going to do so? Should we wait with using coroutines in
qemu until then or just start with maintaining a copy and remove it later?

I think we'll have to carry a copy of the library in the qemu source
anyway for a while if we depend on it and distributions don't have a
package for the library yet.

Kevin

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2011-02-01 13:22 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-22  9:29 [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 01/12] coroutine: Add gtk-vnc coroutines library Stefan Hajnoczi
2011-01-26 15:25   ` Avi Kivity
2011-01-26 16:00     ` Anthony Liguori
2011-01-26 16:13       ` Avi Kivity
2011-01-26 16:19         ` Anthony Liguori
2011-01-26 16:22           ` Avi Kivity
2011-01-26 16:29             ` Anthony Liguori
2011-01-26 16:21         ` Anthony Liguori
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 02/12] continuation: Fix container_of() redefinition Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 03/12] Make sure to release allocated stack when coroutine is released Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 04/12] coroutine: Use thread-local leader and current variables Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 05/12] coroutine: Add coroutines Stefan Hajnoczi
2011-01-26 15:29   ` Avi Kivity
2011-01-26 16:00     ` Anthony Liguori
2011-01-27  9:40     ` Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 06/12] coroutine: Add qemu_coroutine_self() Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 07/12] coroutine: Add coroutine_is_leader() Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 08/12] coroutine: Add qemu_in_coroutine() Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 09/12] block: Add bdrv_co_readv() and bdrv_co_writev() Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 10/12] block: Add coroutine support to synchronous I/O functions Stefan Hajnoczi
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 11/12] qcow2: Convert qcow2 to use coroutines for async I/O Stefan Hajnoczi
2011-01-23 23:40   ` Anthony Liguori
2011-01-24 11:09     ` Stefan Hajnoczi
2011-01-26 15:40   ` Avi Kivity
2011-01-26 15:50     ` Kevin Wolf
2011-01-26 16:08       ` Anthony Liguori
2011-01-26 16:13         ` Avi Kivity
2011-01-26 16:28           ` Anthony Liguori
2011-01-26 16:38             ` Avi Kivity
2011-01-26 17:12               ` Anthony Liguori
2011-01-27  9:25                 ` Avi Kivity
2011-01-27  9:27                 ` Kevin Wolf
2011-01-27  9:49                   ` Avi Kivity
2011-01-27 10:34                     ` Kevin Wolf
2011-01-27 10:41                       ` Avi Kivity
2011-01-27 11:27                         ` Kevin Wolf
2011-01-27 12:21                           ` Avi Kivity
2011-01-26 16:08       ` Avi Kivity
2011-01-27 10:09     ` Stefan Hajnoczi
2011-01-27 10:46       ` Avi Kivity
2011-01-22  9:29 ` [Qemu-devel] [RFC][PATCH 12/12] qcow2: Serialize all requests Stefan Hajnoczi
2011-01-23 23:31 ` [Qemu-devel] [RFC][PATCH 00/12] qcow2: Convert qcow2 to use coroutines for async I/O Anthony Liguori
2011-02-01 13:23   ` Kevin Wolf
2011-01-24 11:58 ` [Qemu-devel] " Kevin Wolf
2011-01-24 13:10   ` Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.