[RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
@ 2020-05-12 14:43 Kevin Wolf
  2020-05-12 14:43 ` [RFC PATCH 1/3] block: Factor out bdrv_run_co() Kevin Wolf
                   ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Kevin Wolf @ 2020-05-12 14:43 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, qemu-devel, s.reiter, armbru, mreitz, stefanha, t.lamprecht

Stefan (Reiter), after looking a bit closer at this, I think there is no
bug in QEMU, but the bug is in your coroutine code that calls block
layer functions without moving into the right AioContext first. I've
written this series anyway as it potentially makes the life of callers
easier and would probably make your buggy code correct.

However, it doesn't feel right to commit something like patch 2 without
having a user for it. Is there a reason why you can't upstream your
async snapshot code?

The series would also happen fix a bug in my recent patch to convert
qmp_block_resize() to coroutines, but I feel it's not how I would
naturally fix it. Switching the thread already in the QMP handler before
calling bdrv_truncate() would feel more appropriate. I wonder if it
wouldn't actually be the same for your snapshot code.

Kevin Wolf (3):
  block: Factor out bdrv_run_co()
  block: Allow bdrv_run_co() from different AioContext
  block: Assert we're running in the right thread

 block/io.c | 122 ++++++++++++++++++++---------------------------------
 1 file changed, 45 insertions(+), 77 deletions(-)

-- 
2.25.3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/3] block: Factor out bdrv_run_co()
  2020-05-12 14:43 [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Kevin Wolf
@ 2020-05-12 14:43 ` Kevin Wolf
  2020-05-12 15:37   ` Eric Blake
  2020-05-12 14:43 ` [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext Kevin Wolf
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2020-05-12 14:43 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, qemu-devel, s.reiter, armbru, mreitz, stefanha, t.lamprecht

We have a few bdrv_*() functions that can either spawn a new coroutine
and wait for it with BDRV_POLL_WHILE() or use a fastpath if they are
alreeady running in a coroutine. All of them duplicate basically the
same code.

Factor the common code into a new function bdrv_run_co().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 104 +++++++++++++++--------------------------------------
 1 file changed, 28 insertions(+), 76 deletions(-)

diff --git a/block/io.c b/block/io.c
index 7d30e61edc..c1badaadc9 100644
--- a/block/io.c
+++ b/block/io.c
@@ -891,6 +891,22 @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
     return 0;
 }
 
+static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
+                       void *opaque, int *ret)
+{
+    if (qemu_in_coroutine()) {
+        /* Fast-path if already in coroutine context */
+        entry(opaque);
+    } else {
+        Coroutine *co = qemu_coroutine_create(entry, opaque);
+        *ret = NOT_DONE;
+        bdrv_coroutine_enter(bs, co);
+        BDRV_POLL_WHILE(bs, *ret == NOT_DONE);
+    }
+
+    return *ret;
+}
+
 typedef struct RwCo {
     BdrvChild *child;
     int64_t offset;
@@ -923,25 +939,15 @@ static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
                         QEMUIOVector *qiov, bool is_write,
                         BdrvRequestFlags flags)
 {
-    Coroutine *co;
     RwCo rwco = {
         .child = child,
         .offset = offset,
         .qiov = qiov,
         .is_write = is_write,
-        .ret = NOT_DONE,
         .flags = flags,
     };
 
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_rw_co_entry(&rwco);
-    } else {
-        co = qemu_coroutine_create(bdrv_rw_co_entry, &rwco);
-        bdrv_coroutine_enter(child->bs, co);
-        BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
-    }
-    return rwco.ret;
+    return bdrv_run_co(child->bs, bdrv_rw_co_entry, &rwco, &rwco.ret);
 }
 
 int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
@@ -2230,7 +2236,6 @@ typedef struct BdrvCoBlockStatusData {
     int64_t *map;
     BlockDriverState **file;
     int ret;
-    bool done;
 } BdrvCoBlockStatusData;
 
 int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
@@ -2492,7 +2497,6 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
                                            data->want_zero,
                                            data->offset, data->bytes,
                                            data->pnum, data->map, data->file);
-    data->done = true;
     aio_wait_kick();
 }
 
@@ -2508,7 +2512,6 @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
                                           int64_t *map,
                                           BlockDriverState **file)
 {
-    Coroutine *co;
     BdrvCoBlockStatusData data = {
         .bs = bs,
         .base = base,
@@ -2518,18 +2521,9 @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
         .pnum = pnum,
         .map = map,
         .file = file,
-        .done = false,
     };
 
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_block_status_above_co_entry(&data);
-    } else {
-        co = qemu_coroutine_create(bdrv_block_status_above_co_entry, &data);
-        bdrv_coroutine_enter(bs, co);
-        BDRV_POLL_WHILE(bs, !data.done);
-    }
-    return data.ret;
+    return bdrv_run_co(bs, bdrv_block_status_above_co_entry, &data, &data.ret);
 }
 
 int bdrv_block_status_above(BlockDriverState *bs, BlockDriverState *base,
@@ -2669,22 +2663,13 @@ static inline int
 bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                 bool is_read)
 {
-    if (qemu_in_coroutine()) {
-        return bdrv_co_rw_vmstate(bs, qiov, pos, is_read);
-    } else {
-        BdrvVmstateCo data = {
-            .bs         = bs,
-            .qiov       = qiov,
-            .pos        = pos,
-            .is_read    = is_read,
-            .ret        = -EINPROGRESS,
-        };
-        Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
-
-        bdrv_coroutine_enter(bs, co);
-        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
-        return data.ret;
-    }
+    BdrvVmstateCo data = {
+        .bs         = bs,
+        .qiov       = qiov,
+        .pos        = pos,
+        .is_read    = is_read,
+    };
+    return bdrv_run_co(bs, bdrv_co_rw_vmstate_entry, &data, &data.ret);
 }
 
 int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
@@ -2890,22 +2875,11 @@ early_exit:
 
 int bdrv_flush(BlockDriverState *bs)
 {
-    Coroutine *co;
     FlushCo flush_co = {
         .bs = bs,
-        .ret = NOT_DONE,
     };
 
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_flush_co_entry(&flush_co);
-    } else {
-        co = qemu_coroutine_create(bdrv_flush_co_entry, &flush_co);
-        bdrv_coroutine_enter(bs, co);
-        BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
-    }
-
-    return flush_co.ret;
+    return bdrv_run_co(bs, bdrv_flush_co_entry, &flush_co, &flush_co.ret);
 }
 
 typedef struct DiscardCo {
@@ -3038,24 +3012,13 @@ out:
 
 int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
 {
-    Coroutine *co;
     DiscardCo rwco = {
         .child = child,
         .offset = offset,
         .bytes = bytes,
-        .ret = NOT_DONE,
     };
 
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_pdiscard_co_entry(&rwco);
-    } else {
-        co = qemu_coroutine_create(bdrv_pdiscard_co_entry, &rwco);
-        bdrv_coroutine_enter(child->bs, co);
-        BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
-    }
-
-    return rwco.ret;
+    return bdrv_run_co(child->bs, bdrv_pdiscard_co_entry, &rwco, &rwco.ret);
 }
 
 int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf)
@@ -3477,7 +3440,6 @@ static void coroutine_fn bdrv_truncate_co_entry(void *opaque)
 int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
                   PreallocMode prealloc, BdrvRequestFlags flags, Error **errp)
 {
-    Coroutine *co;
     TruncateCo tco = {
         .child      = child,
         .offset     = offset,
@@ -3485,17 +3447,7 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
         .prealloc   = prealloc,
         .flags      = flags,
         .errp       = errp,
-        .ret        = NOT_DONE,
     };
 
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_truncate_co_entry(&tco);
-    } else {
-        co = qemu_coroutine_create(bdrv_truncate_co_entry, &tco);
-        bdrv_coroutine_enter(child->bs, co);
-        BDRV_POLL_WHILE(child->bs, tco.ret == NOT_DONE);
-    }
-
-    return tco.ret;
+    return bdrv_run_co(child->bs, bdrv_truncate_co_entry, &tco, &tco.ret);
 }
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext
  2020-05-12 14:43 [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Kevin Wolf
  2020-05-12 14:43 ` [RFC PATCH 1/3] block: Factor out bdrv_run_co() Kevin Wolf
@ 2020-05-12 14:43 ` Kevin Wolf
  2020-05-12 16:02   ` Thomas Lamprecht
  2020-05-25 14:18   ` Stefan Reiter
  2020-05-12 14:43 ` [RFC PATCH 3/3] block: Assert we're running in the right thread Kevin Wolf
  2020-05-14 13:21 ` [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Thomas Lamprecht
  3 siblings, 2 replies; 26+ messages in thread
From: Kevin Wolf @ 2020-05-12 14:43 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, qemu-devel, s.reiter, armbru, mreitz, stefanha, t.lamprecht

Coroutine functions that are entered through bdrv_run_co() are already
safe to call from synchronous code in a different AioContext because
bdrv_coroutine_enter() will schedule them in the context of the node.

However, the coroutine fastpath still requires that we're already in the
right AioContext when called in coroutine context.

In order to make the behaviour more consistent and to make life a bit
easier for callers, let's check the AioContext and automatically move
the current coroutine around if we're not in the right context yet.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index c1badaadc9..7808e8bdc0 100644
--- a/block/io.c
+++ b/block/io.c
@@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
                        void *opaque, int *ret)
 {
     if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
+        Coroutine *self = qemu_coroutine_self();
+        AioContext *bs_ctx = bdrv_get_aio_context(bs);
+        AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
+
+        if (bs_ctx != co_ctx) {
+            /* Move to the iothread of the node */
+            aio_co_schedule(bs_ctx, self);
+            qemu_coroutine_yield();
+        }
         entry(opaque);
+        if (bs_ctx != co_ctx) {
+            /* Move back to the original AioContext */
+            aio_co_schedule(bs_ctx, self);
+            qemu_coroutine_yield();
+        }
     } else {
         Coroutine *co = qemu_coroutine_create(entry, opaque);
         *ret = NOT_DONE;
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/3] block: Assert we're running in the right thread
  2020-05-12 14:43 [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Kevin Wolf
  2020-05-12 14:43 ` [RFC PATCH 1/3] block: Factor out bdrv_run_co() Kevin Wolf
  2020-05-12 14:43 ` [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext Kevin Wolf
@ 2020-05-12 14:43 ` Kevin Wolf
  2020-05-14 13:52   ` Stefan Reiter
  2020-05-14 13:21 ` [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Thomas Lamprecht
  3 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2020-05-12 14:43 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, qemu-devel, s.reiter, armbru, mreitz, stefanha, t.lamprecht

tracked_request_begin() is called for most I/O operations, so it's a
good place to assert that we're indeed running in the home thread of the
node's AioContext.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 7808e8bdc0..924bf5ba46 100644
--- a/block/io.c
+++ b/block/io.c
@@ -695,14 +695,17 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
                                   uint64_t bytes,
                                   enum BdrvTrackedRequestType type)
 {
+    Coroutine *self = qemu_coroutine_self();
+
     assert(bytes <= INT64_MAX && offset <= INT64_MAX - bytes);
+    assert(bs->aio_context == qemu_coroutine_get_aio_context(self));
 
     *req = (BdrvTrackedRequest){
         .bs = bs,
         .offset         = offset,
         .bytes          = bytes,
         .type           = type,
-        .co             = qemu_coroutine_self(),
+        .co             = self,
         .serialising    = false,
         .overlap_offset = offset,
         .overlap_bytes  = bytes,
-- 
2.25.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] block: Factor out bdrv_run_co()
  2020-05-12 14:43 ` [RFC PATCH 1/3] block: Factor out bdrv_run_co() Kevin Wolf
@ 2020-05-12 15:37   ` Eric Blake
  2020-05-20  9:09     ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Blake @ 2020-05-12 15:37 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block
  Cc: Vladimir Sementsov-Ogievskiy, armbru, s.reiter, qemu-devel,
	mreitz, stefanha, t.lamprecht

On 5/12/20 9:43 AM, Kevin Wolf wrote:
> We have a few bdrv_*() functions that can either spawn a new coroutine
> and wait for it with BDRV_POLL_WHILE() or use a fastpath if they are
> alreeady running in a coroutine. All of them duplicate basically the

already

> same code.
> 
> Factor the common code into a new function bdrv_run_co().
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/io.c | 104 +++++++++++++++--------------------------------------
>   1 file changed, 28 insertions(+), 76 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 7d30e61edc..c1badaadc9 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -891,6 +891,22 @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
>       return 0;
>   }
>   
> +static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
> +                       void *opaque, int *ret)
> +{
> +    if (qemu_in_coroutine()) {
> +        /* Fast-path if already in coroutine context */
> +        entry(opaque);
> +    } else {
> +        Coroutine *co = qemu_coroutine_create(entry, opaque);
> +        *ret = NOT_DONE;
> +        bdrv_coroutine_enter(bs, co);
> +        BDRV_POLL_WHILE(bs, *ret == NOT_DONE);

For my reference, NOT_DONE is defined as INT_MAX, which does not seem to 
be used as a return value in other situations.

> @@ -923,25 +939,15 @@ static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
>                           QEMUIOVector *qiov, bool is_write,
>                           BdrvRequestFlags flags)
>   {
> -    Coroutine *co;
>       RwCo rwco = {
>           .child = child,
>           .offset = offset,
>           .qiov = qiov,
>           .is_write = is_write,
> -        .ret = NOT_DONE,
>           .flags = flags,
>       };
>   
> -    if (qemu_in_coroutine()) {
> -        /* Fast-path if already in coroutine context */
> -        bdrv_rw_co_entry(&rwco);
> -    } else {
> -        co = qemu_coroutine_create(bdrv_rw_co_entry, &rwco);
> -        bdrv_coroutine_enter(child->bs, co);
> -        BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
> -    }
> -    return rwco.ret;
> +    return bdrv_run_co(child->bs, bdrv_rw_co_entry, &rwco, &rwco.ret);

So code that previously looped on NOT_DONE is obviously safe, while...

>   }
>   
>   int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
> @@ -2230,7 +2236,6 @@ typedef struct BdrvCoBlockStatusData {
>       int64_t *map;
>       BlockDriverState **file;
>       int ret;
> -    bool done;
>   } BdrvCoBlockStatusData;
>   
>   int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
> @@ -2492,7 +2497,6 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
>                                              data->want_zero,
>                                              data->offset, data->bytes,
>                                              data->pnum, data->map, data->file);
> -    data->done = true;
>       aio_wait_kick();

...code that looped on something else now has to be checked that 
data->ret is still being set to something useful.  Fortunately that is 
true here.

> @@ -2669,22 +2663,13 @@ static inline int
>   bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>                   bool is_read)
>   {
> -    if (qemu_in_coroutine()) {
> -        return bdrv_co_rw_vmstate(bs, qiov, pos, is_read);
> -    } else {
> -        BdrvVmstateCo data = {
> -            .bs         = bs,
> -            .qiov       = qiov,
> -            .pos        = pos,
> -            .is_read    = is_read,
> -            .ret        = -EINPROGRESS,
> -        };
> -        Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
> -
> -        bdrv_coroutine_enter(bs, co);
> -        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
> -        return data.ret;

It's a little harder to see whether -EINPROGRESS might ever be returned 
by a driver, but again this looks safe.

Here, it's a little less obvious whether any driver might return 
-EINPROGRESS, but it looks like if they did that,

Reviewed-by: Eric Blake <eblake@redhat.com>

Conflicts with Vladimir's patches which try to add more coroutine 
wrappers (but those need a rebase anyway):
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg04559.html

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext
  2020-05-12 14:43 ` [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext Kevin Wolf
@ 2020-05-12 16:02   ` Thomas Lamprecht
  2020-05-12 19:29     ` Kevin Wolf
  2020-05-25 14:18   ` Stefan Reiter
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Lamprecht @ 2020-05-12 16:02 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, s.reiter, armbru, stefanha, mreitz

On 5/12/20 4:43 PM, Kevin Wolf wrote:
> Coroutine functions that are entered through bdrv_run_co() are already
> safe to call from synchronous code in a different AioContext because
> bdrv_coroutine_enter() will schedule them in the context of the node.
> 
> However, the coroutine fastpath still requires that we're already in the
> right AioContext when called in coroutine context.
> 
> In order to make the behaviour more consistent and to make life a bit
> easier for callers, let's check the AioContext and automatically move
> the current coroutine around if we're not in the right context yet.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/io.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/block/io.c b/block/io.c
> index c1badaadc9..7808e8bdc0 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
>                         void *opaque, int *ret)
>  {
>      if (qemu_in_coroutine()) {
> -        /* Fast-path if already in coroutine context */
> +        Coroutine *self = qemu_coroutine_self();
> +        AioContext *bs_ctx = bdrv_get_aio_context(bs);
> +        AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
> +
> +        if (bs_ctx != co_ctx) {
> +            /* Move to the iothread of the node */
> +            aio_co_schedule(bs_ctx, self);
> +            qemu_coroutine_yield();
> +        }
>          entry(opaque);
> +        if (bs_ctx != co_ctx) {
> +            /* Move back to the original AioContext */
> +            aio_co_schedule(bs_ctx, self);

shouldn't it use co_ctx here, as else it's just scheduled again on the one from bs?

Looks OK for me besides that.

> +            qemu_coroutine_yield();
> +        }
>      } else {
>          Coroutine *co = qemu_coroutine_create(entry, opaque);
>          *ret = NOT_DONE;
> 




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext
  2020-05-12 16:02   ` Thomas Lamprecht
@ 2020-05-12 19:29     ` Kevin Wolf
  0 siblings, 0 replies; 26+ messages in thread
From: Kevin Wolf @ 2020-05-12 19:29 UTC (permalink / raw)
  To: Thomas Lamprecht
  Cc: qemu-block, s.reiter, qemu-devel, armbru, stefanha, mreitz

Am 12.05.2020 um 18:02 hat Thomas Lamprecht geschrieben:
> On 5/12/20 4:43 PM, Kevin Wolf wrote:
> > Coroutine functions that are entered through bdrv_run_co() are already
> > safe to call from synchronous code in a different AioContext because
> > bdrv_coroutine_enter() will schedule them in the context of the node.
> > 
> > However, the coroutine fastpath still requires that we're already in the
> > right AioContext when called in coroutine context.
> > 
> > In order to make the behaviour more consistent and to make life a bit
> > easier for callers, let's check the AioContext and automatically move
> > the current coroutine around if we're not in the right context yet.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block/io.c | 15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/io.c b/block/io.c
> > index c1badaadc9..7808e8bdc0 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
> >                         void *opaque, int *ret)
> >  {
> >      if (qemu_in_coroutine()) {
> > -        /* Fast-path if already in coroutine context */
> > +        Coroutine *self = qemu_coroutine_self();
> > +        AioContext *bs_ctx = bdrv_get_aio_context(bs);
> > +        AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
> > +
> > +        if (bs_ctx != co_ctx) {
> > +            /* Move to the iothread of the node */
> > +            aio_co_schedule(bs_ctx, self);
> > +            qemu_coroutine_yield();
> > +        }
> >          entry(opaque);
> > +        if (bs_ctx != co_ctx) {
> > +            /* Move back to the original AioContext */
> > +            aio_co_schedule(bs_ctx, self);
> 
> shouldn't it use co_ctx here, as else it's just scheduled again on the
> one from bs?

Oops, you're right, of course.

> Looks OK for me besides that.

Thanks!

Kevin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-12 14:43 [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Kevin Wolf
                   ` (2 preceding siblings ...)
  2020-05-12 14:43 ` [RFC PATCH 3/3] block: Assert we're running in the right thread Kevin Wolf
@ 2020-05-14 13:21 ` Thomas Lamprecht
  2020-05-14 14:26   ` Kevin Wolf
  3 siblings, 1 reply; 26+ messages in thread
From: Thomas Lamprecht @ 2020-05-14 13:21 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, s.reiter, armbru, stefanha, mreitz

On 5/12/20 4:43 PM, Kevin Wolf wrote:
> Stefan (Reiter), after looking a bit closer at this, I think there is no
> bug in QEMU, but the bug is in your coroutine code that calls block
> layer functions without moving into the right AioContext first. I've
> written this series anyway as it potentially makes the life of callers
> easier and would probably make your buggy code correct.

> However, it doesn't feel right to commit something like patch 2 without
> having a user for it. Is there a reason why you can't upstream your
> async snapshot code?

I mean I understand what you mean, but it would make the interface IMO so
much easier to use, if one wants to explicit schedule it beforehand they
can still do. But that would open the way for two styles doing things, not
sure if this would seen as bad. The assert about from patch 3/3 would be
already really helping a lot, though.

Regarding upstreaming, there was some historical attempt to upstream it
from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
I'm not quite sure why it didn't went through then, I see if I can get some
time searching the mailing list archive.

We'd be naturally open and glad to upstream it, what it effectively allow
us to do is to not block the VM to much during snapshoting it live.

I pushed a tree[0] with mostly just that specific code squashed together (hope
I did not break anything), most of the actual code is in commit [1].
It'd be cleaned up a bit and checked for coding style issues, but works good
here.

Anyway, thanks for your help and pointers!

[0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async
[1]: https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] block: Assert we're running in the right thread
  2020-05-12 14:43 ` [RFC PATCH 3/3] block: Assert we're running in the right thread Kevin Wolf
@ 2020-05-14 13:52   ` Stefan Reiter
  2020-05-14 14:30     ` Kevin Wolf
  0 siblings, 1 reply; 26+ messages in thread
From: Stefan Reiter @ 2020-05-14 13:52 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, t.lamprecht, armbru, stefanha, mreitz

On 5/12/20 4:43 PM, Kevin Wolf wrote:
> tracked_request_begin() is called for most I/O operations, so it's a
> good place to assert that we're indeed running in the home thread of the
> node's AioContext.
> 

Is this patch supposed to be always correct or only together with nr. 2?

I changed our code to call bdrv_flush_all from the main AIO context and 
it certainly works just fine (even without this series, so I suppose 
that would be the 'correct' way to fix it you mention on the cover), 
though of course it trips this assert without patch 2.

> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/io.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 7808e8bdc0..924bf5ba46 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -695,14 +695,17 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
>                                     uint64_t bytes,
>                                     enum BdrvTrackedRequestType type)
>   {
> +    Coroutine *self = qemu_coroutine_self();
> +
>       assert(bytes <= INT64_MAX && offset <= INT64_MAX - bytes);
> +    assert(bs->aio_context == qemu_coroutine_get_aio_context(self));
>   
>       *req = (BdrvTrackedRequest){
>           .bs = bs,
>           .offset         = offset,
>           .bytes          = bytes,
>           .type           = type,
> -        .co             = qemu_coroutine_self(),
> +        .co             = self,
>           .serialising    = false,
>           .overlap_offset = offset,
>           .overlap_bytes  = bytes,
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-14 13:21 ` [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Thomas Lamprecht
@ 2020-05-14 14:26   ` Kevin Wolf
  2020-05-19 12:32     ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2020-05-14 14:26 UTC (permalink / raw)
  To: Thomas Lamprecht
  Cc: vsementsov, qemu-block, s.reiter, qemu-devel, armbru, stefanha, mreitz

Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
> On 5/12/20 4:43 PM, Kevin Wolf wrote:
> > Stefan (Reiter), after looking a bit closer at this, I think there is no
> > bug in QEMU, but the bug is in your coroutine code that calls block
> > layer functions without moving into the right AioContext first. I've
> > written this series anyway as it potentially makes the life of callers
> > easier and would probably make your buggy code correct.
> 
> > However, it doesn't feel right to commit something like patch 2 without
> > having a user for it. Is there a reason why you can't upstream your
> > async snapshot code?
> 
> I mean I understand what you mean, but it would make the interface IMO so
> much easier to use, if one wants to explicit schedule it beforehand they
> can still do. But that would open the way for two styles doing things, not
> sure if this would seen as bad. The assert about from patch 3/3 would be
> already really helping a lot, though.

I think patches 1 and 3 are good to be committed either way if people
think they are useful. They make sense without the async snapshot code.

My concern with the interface in patch 2 is both that it could give
people a false sense of security and that it would be tempting to write
inefficient code.

Usually, you won't have just a single call into the block layer for a
given block node, but you'll perform multiple operations. Switching to
the target context once rather than switching back and forth in every
operation is obviously more efficient.

But chances are that even if one of these function is bdrv_flush(),
which now works correctly from a different thread, you might need
another function that doesn't implement the same magic. So you always
need to be aware which functions support cross-context calls which
ones don't.

I feel we'd see a few bugs related to this.

> Regarding upstreaming, there was some historical attempt to upstream it
> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
> I'm not quite sure why it didn't went through then, I see if I can get
> some time searching the mailing list archive.
> 
> We'd be naturally open and glad to upstream it, what it effectively
> allow us to do is to not block the VM to much during snapshoting it
> live.

Yes, there is no doubt that this is useful functionality. There has been
talk about this every now and then, but I don't think we ever got to a
point where it actually could be implemented.

Vladimir, I seem to remember you (or someone else from your team?) were
interested in async snapshots as well a while ago?

> I pushed a tree[0] with mostly just that specific code squashed together (hope
> I did not break anything), most of the actual code is in commit [1].
> It'd be cleaned up a bit and checked for coding style issues, but works good
> here.
> 
> Anyway, thanks for your help and pointers!
> 
> [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async
> [1]: https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121

It doesn't even look that bad in terms of patch size. I had imagined it
a bit larger.

But it seems this is not really just an async 'savevm' (which would save
the VM state in a qcow2 file), but you store the state in a separate
raw file. What is the difference between this and regular migration into
a file?

I remember people talking about how snapshotting can store things in a
way that a normal migration stream can't do, like overwriting outdated
RAM state instead of just appending the new state, but you don't seem to
implement something like this.

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] block: Assert we're running in the right thread
  2020-05-14 13:52   ` Stefan Reiter
@ 2020-05-14 14:30     ` Kevin Wolf
  2020-05-20  9:12       ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2020-05-14 14:30 UTC (permalink / raw)
  To: Stefan Reiter
  Cc: qemu-block, qemu-devel, armbru, stefanha, mreitz, t.lamprecht

Am 14.05.2020 um 15:52 hat Stefan Reiter geschrieben:
> On 5/12/20 4:43 PM, Kevin Wolf wrote:
> > tracked_request_begin() is called for most I/O operations, so it's a
> > good place to assert that we're indeed running in the home thread of the
> > node's AioContext.
> > 
> 
> Is this patch supposed to be always correct or only together with nr. 2?
> 
> I changed our code to call bdrv_flush_all from the main AIO context and it
> certainly works just fine (even without this series, so I suppose that would
> be the 'correct' way to fix it you mention on the cover), though of course
> it trips this assert without patch 2.

Yes, I think this is a basic assumption that should always be true.
This series shouldn't fix anything for upstream QEMU (at least I'm not
aware of anything that needs it), so the assertion could be added even
without the other patches.

In fact, I'm currently thinking that committing just patch 1 (because
it's a nice cleanup anyway) and patch 3 (because it will let us know
when we mess up) might make sense.

Kevin

> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block/io.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/io.c b/block/io.c
> > index 7808e8bdc0..924bf5ba46 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -695,14 +695,17 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
> >                                     uint64_t bytes,
> >                                     enum BdrvTrackedRequestType type)
> >   {
> > +    Coroutine *self = qemu_coroutine_self();
> > +
> >       assert(bytes <= INT64_MAX && offset <= INT64_MAX - bytes);
> > +    assert(bs->aio_context == qemu_coroutine_get_aio_context(self));
> >       *req = (BdrvTrackedRequest){
> >           .bs = bs,
> >           .offset         = offset,
> >           .bytes          = bytes,
> >           .type           = type,
> > -        .co             = qemu_coroutine_self(),
> > +        .co             = self,
> >           .serialising    = false,
> >           .overlap_offset = offset,
> >           .overlap_bytes  = bytes,
> > 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-14 14:26   ` Kevin Wolf
@ 2020-05-19 12:32     ` Vladimir Sementsov-Ogievskiy
  2020-05-19 13:54       ` Denis Plotnikov
  0 siblings, 1 reply; 26+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 12:32 UTC (permalink / raw)
  To: Kevin Wolf, Thomas Lamprecht
  Cc: qemu-block, s.reiter, qemu-devel, armbru, Denis Plotnikov,
	stefanha, mreitz

14.05.2020 17:26, Kevin Wolf wrote:
> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>> Stefan (Reiter), after looking a bit closer at this, I think there is no
>>> bug in QEMU, but the bug is in your coroutine code that calls block
>>> layer functions without moving into the right AioContext first. I've
>>> written this series anyway as it potentially makes the life of callers
>>> easier and would probably make your buggy code correct.
>>
>>> However, it doesn't feel right to commit something like patch 2 without
>>> having a user for it. Is there a reason why you can't upstream your
>>> async snapshot code?
>>
>> I mean I understand what you mean, but it would make the interface IMO so
>> much easier to use, if one wants to explicit schedule it beforehand they
>> can still do. But that would open the way for two styles doing things, not
>> sure if this would seen as bad. The assert about from patch 3/3 would be
>> already really helping a lot, though.
> 
> I think patches 1 and 3 are good to be committed either way if people
> think they are useful. They make sense without the async snapshot code.
> 
> My concern with the interface in patch 2 is both that it could give
> people a false sense of security and that it would be tempting to write
> inefficient code.
> 
> Usually, you won't have just a single call into the block layer for a
> given block node, but you'll perform multiple operations. Switching to
> the target context once rather than switching back and forth in every
> operation is obviously more efficient.
> 
> But chances are that even if one of these function is bdrv_flush(),
> which now works correctly from a different thread, you might need
> another function that doesn't implement the same magic. So you always
> need to be aware which functions support cross-context calls which
> ones don't.
> 
> I feel we'd see a few bugs related to this.
> 
>> Regarding upstreaming, there was some historical attempt to upstream it
>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
>> I'm not quite sure why it didn't went through then, I see if I can get
>> some time searching the mailing list archive.
>>
>> We'd be naturally open and glad to upstream it, what it effectively
>> allow us to do is to not block the VM to much during snapshoting it
>> live.
> 
> Yes, there is no doubt that this is useful functionality. There has been
> talk about this every now and then, but I don't think we ever got to a
> point where it actually could be implemented.
> 
> Vladimir, I seem to remember you (or someone else from your team?) were
> interested in async snapshots as well a while ago?

Den is working on this (add him to CC)

> 
>> I pushed a tree[0] with mostly just that specific code squashed together (hope
>> I did not break anything), most of the actual code is in commit [1].
>> It'd be cleaned up a bit and checked for coding style issues, but works good
>> here.
>>
>> Anyway, thanks for your help and pointers!
>>
>> [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async
>> [1]: https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121
> 
> It doesn't even look that bad in terms of patch size. I had imagined it
> a bit larger.
> 
> But it seems this is not really just an async 'savevm' (which would save
> the VM state in a qcow2 file), but you store the state in a separate
> raw file. What is the difference between this and regular migration into
> a file?
> 
> I remember people talking about how snapshotting can store things in a
> way that a normal migration stream can't do, like overwriting outdated
> RAM state instead of just appending the new state, but you don't seem to
> implement something like this.
> 
> Kevin
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-19 12:32     ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 13:54       ` Denis Plotnikov
  2020-05-19 14:18         ` Kevin Wolf
  0 siblings, 1 reply; 26+ messages in thread
From: Denis Plotnikov @ 2020-05-19 13:54 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf, Thomas Lamprecht
  Cc: qemu-block, s.reiter, qemu-devel, armbru, stefanha, mreitz



On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
> 14.05.2020 17:26, Kevin Wolf wrote:
>> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
>>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>>> Stefan (Reiter), after looking a bit closer at this, I think there 
>>>> is no
>>>> bug in QEMU, but the bug is in your coroutine code that calls block
>>>> layer functions without moving into the right AioContext first. I've
>>>> written this series anyway as it potentially makes the life of callers
>>>> easier and would probably make your buggy code correct.
>>>
>>>> However, it doesn't feel right to commit something like patch 2 
>>>> without
>>>> having a user for it. Is there a reason why you can't upstream your
>>>> async snapshot code?
>>>
>>> I mean I understand what you mean, but it would make the interface 
>>> IMO so
>>> much easier to use, if one wants to explicit schedule it beforehand 
>>> they
>>> can still do. But that would open the way for two styles doing 
>>> things, not
>>> sure if this would seen as bad. The assert about from patch 3/3 
>>> would be
>>> already really helping a lot, though.
>>
>> I think patches 1 and 3 are good to be committed either way if people
>> think they are useful. They make sense without the async snapshot code.
>>
>> My concern with the interface in patch 2 is both that it could give
>> people a false sense of security and that it would be tempting to write
>> inefficient code.
>>
>> Usually, you won't have just a single call into the block layer for a
>> given block node, but you'll perform multiple operations. Switching to
>> the target context once rather than switching back and forth in every
>> operation is obviously more efficient.
>>
>> But chances are that even if one of these function is bdrv_flush(),
>> which now works correctly from a different thread, you might need
>> another function that doesn't implement the same magic. So you always
>> need to be aware which functions support cross-context calls which
>> ones don't.
>>
>> I feel we'd see a few bugs related to this.
>>
>>> Regarding upstreaming, there was some historical attempt to upstream it
>>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
>>> I'm not quite sure why it didn't went through then, I see if I can get
>>> some time searching the mailing list archive.
>>>
>>> We'd be naturally open and glad to upstream it, what it effectively
>>> allow us to do is to not block the VM to much during snapshoting it
>>> live.
>>
>> Yes, there is no doubt that this is useful functionality. There has been
>> talk about this every now and then, but I don't think we ever got to a
>> point where it actually could be implemented.
>>
>> Vladimir, I seem to remember you (or someone else from your team?) were
>> interested in async snapshots as well a while ago?
>
> Den is working on this (add him to CC)
Yes, I was working on that.

What I've done can be found here: 
https://github.com/denis-plotnikov/qemu/commits/bgs_uffd

The idea was to save a snapshot (state+ram) asynchronously to a separate 
(raw) file using the existing infrastructure.
The goal of that was to reduce the VM downtime on snapshot.

We decided to postpone this work until "userfaultfd write protected 
mode" wasn't in the linux mainstream.
Now, userfaultfd-wp is merged to linux. So we have plans to continue 
this work.

According to the saving the "internal" snapshot to qcow2 I still have a 
question. May be this is the right place and time to ask.

If I remember correctly, in qcow2 the snapshot is stored at the end of 
the address space of the current block-placement-table.
We switch to the new block-placement-table after the snapshot storing is 
complete. In case of sync snapshot, we should switch the
table before the snapshot is written, another words, we should be able 
to preallocate the the space for the snapshot and keep a link
to the space until snapshot writing is completed.
The question is whether it could be done without qcow2 modification and 
if not, could you please give some ideas of how to implement that?

Denis
>
>>
>>> I pushed a tree[0] with mostly just that specific code squashed 
>>> together (hope
>>> I did not break anything), most of the actual code is in commit [1].
>>> It'd be cleaned up a bit and checked for coding style issues, but 
>>> works good
>>> here.
>>>
>>> Anyway, thanks for your help and pointers!
>>>
>>> [0]: https://github.com/ThomasLamprecht/qemu/tree/savevm-async
>>> [1]: 
>>> https://github.com/ThomasLamprecht/qemu/commit/ffb9531f370ef0073e4b6f6021f4c47ccd702121
>>
>> It doesn't even look that bad in terms of patch size. I had imagined it
>> a bit larger.
>>
>> But it seems this is not really just an async 'savevm' (which would save
>> the VM state in a qcow2 file), but you store the state in a separate
>> raw file. What is the difference between this and regular migration into
>> a file?
>>
>> I remember people talking about how snapshotting can store things in a
>> way that a normal migration stream can't do, like overwriting outdated
>> RAM state instead of just appending the new state, but you don't seem to
>> implement something like this.
>>
>> Kevin
>>
>
>



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-19 13:54       ` Denis Plotnikov
@ 2020-05-19 14:18         ` Kevin Wolf
  2020-05-19 15:05           ` Denis Plotnikov
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2020-05-19 14:18 UTC (permalink / raw)
  To: Denis Plotnikov
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel, s.reiter,
	armbru, mreitz, stefanha, Thomas Lamprecht

Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:
> 
> 
> On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
> > 14.05.2020 17:26, Kevin Wolf wrote:
> > > Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
> > > > On 5/12/20 4:43 PM, Kevin Wolf wrote:
> > > > > Stefan (Reiter), after looking a bit closer at this, I think
> > > > > there is no
> > > > > bug in QEMU, but the bug is in your coroutine code that calls block
> > > > > layer functions without moving into the right AioContext first. I've
> > > > > written this series anyway as it potentially makes the life of callers
> > > > > easier and would probably make your buggy code correct.
> > > > 
> > > > > However, it doesn't feel right to commit something like
> > > > > patch 2 without
> > > > > having a user for it. Is there a reason why you can't upstream your
> > > > > async snapshot code?
> > > > 
> > > > I mean I understand what you mean, but it would make the
> > > > interface IMO so
> > > > much easier to use, if one wants to explicit schedule it
> > > > beforehand they
> > > > can still do. But that would open the way for two styles doing
> > > > things, not
> > > > sure if this would seen as bad. The assert about from patch 3/3
> > > > would be
> > > > already really helping a lot, though.
> > > 
> > > I think patches 1 and 3 are good to be committed either way if people
> > > think they are useful. They make sense without the async snapshot code.
> > > 
> > > My concern with the interface in patch 2 is both that it could give
> > > people a false sense of security and that it would be tempting to write
> > > inefficient code.
> > > 
> > > Usually, you won't have just a single call into the block layer for a
> > > given block node, but you'll perform multiple operations. Switching to
> > > the target context once rather than switching back and forth in every
> > > operation is obviously more efficient.
> > > 
> > > But chances are that even if one of these function is bdrv_flush(),
> > > which now works correctly from a different thread, you might need
> > > another function that doesn't implement the same magic. So you always
> > > need to be aware which functions support cross-context calls which
> > > ones don't.
> > > 
> > > I feel we'd see a few bugs related to this.
> > > 
> > > > Regarding upstreaming, there was some historical attempt to upstream it
> > > > from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
> > > > I'm not quite sure why it didn't went through then, I see if I can get
> > > > some time searching the mailing list archive.
> > > > 
> > > > We'd be naturally open and glad to upstream it, what it effectively
> > > > allow us to do is to not block the VM to much during snapshoting it
> > > > live.
> > > 
> > > Yes, there is no doubt that this is useful functionality. There has been
> > > talk about this every now and then, but I don't think we ever got to a
> > > point where it actually could be implemented.
> > > 
> > > Vladimir, I seem to remember you (or someone else from your team?) were
> > > interested in async snapshots as well a while ago?
> > 
> > Den is working on this (add him to CC)
> Yes, I was working on that.
> 
> What I've done can be found here:
> https://github.com/denis-plotnikov/qemu/commits/bgs_uffd
> 
> The idea was to save a snapshot (state+ram) asynchronously to a separate
> (raw) file using the existing infrastructure.
> The goal of that was to reduce the VM downtime on snapshot.
> 
> We decided to postpone this work until "userfaultfd write protected mode"
> wasn't in the linux mainstream.
> Now, userfaultfd-wp is merged to linux. So we have plans to continue this
> work.
> 
> According to the saving the "internal" snapshot to qcow2 I still have a
> question. May be this is the right place and time to ask.
> 
> If I remember correctly, in qcow2 the snapshot is stored at the end of
> the address space of the current block-placement-table.

Yes. Basically the way snapshots with VM state work is that you write
the VM state to some offset after the end of the virtual disk, when the
VM state is completely written you snapshot the current state (including
both content of the virtual disk and VM state) and finally discard the
VM state again in the active L1 table.

> We switch to the new block-placement-table after the snapshot storing
> is complete. In case of sync snapshot, we should switch the table
> before the snapshot is written, another words, we should be able to
> preallocate the the space for the snapshot and keep a link to the
> space until snapshot writing is completed.

I don't see a fundamental difference between sync and async in this
respect. Why can't you write the VM state to the current L1 table first
like we usually do?

We always have only one active L1 table at a time, which simplifies
cluster allocation a bit, so it would be preferable to keep it this way.

Kevin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-19 14:18         ` Kevin Wolf
@ 2020-05-19 15:05           ` Denis Plotnikov
  2020-05-19 15:29             ` Kevin Wolf
  0 siblings, 1 reply; 26+ messages in thread
From: Denis Plotnikov @ 2020-05-19 15:05 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel, s.reiter,
	armbru, mreitz, stefanha, Thomas Lamprecht



On 19.05.2020 17:18, Kevin Wolf wrote:
> Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:
>>
>> On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>> 14.05.2020 17:26, Kevin Wolf wrote:
>>>> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
>>>>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>>>>> Stefan (Reiter), after looking a bit closer at this, I think
>>>>>> there is no
>>>>>> bug in QEMU, but the bug is in your coroutine code that calls block
>>>>>> layer functions without moving into the right AioContext first. I've
>>>>>> written this series anyway as it potentially makes the life of callers
>>>>>> easier and would probably make your buggy code correct.
>>>>>> However, it doesn't feel right to commit something like
>>>>>> patch 2 without
>>>>>> having a user for it. Is there a reason why you can't upstream your
>>>>>> async snapshot code?
>>>>> I mean I understand what you mean, but it would make the
>>>>> interface IMO so
>>>>> much easier to use, if one wants to explicit schedule it
>>>>> beforehand they
>>>>> can still do. But that would open the way for two styles doing
>>>>> things, not
>>>>> sure if this would seen as bad. The assert about from patch 3/3
>>>>> would be
>>>>> already really helping a lot, though.
>>>> I think patches 1 and 3 are good to be committed either way if people
>>>> think they are useful. They make sense without the async snapshot code.
>>>>
>>>> My concern with the interface in patch 2 is both that it could give
>>>> people a false sense of security and that it would be tempting to write
>>>> inefficient code.
>>>>
>>>> Usually, you won't have just a single call into the block layer for a
>>>> given block node, but you'll perform multiple operations. Switching to
>>>> the target context once rather than switching back and forth in every
>>>> operation is obviously more efficient.
>>>>
>>>> But chances are that even if one of these function is bdrv_flush(),
>>>> which now works correctly from a different thread, you might need
>>>> another function that doesn't implement the same magic. So you always
>>>> need to be aware which functions support cross-context calls which
>>>> ones don't.
>>>>
>>>> I feel we'd see a few bugs related to this.
>>>>
>>>>> Regarding upstreaming, there was some historical attempt to upstream it
>>>>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
>>>>> I'm not quite sure why it didn't went through then, I see if I can get
>>>>> some time searching the mailing list archive.
>>>>>
>>>>> We'd be naturally open and glad to upstream it, what it effectively
>>>>> allow us to do is to not block the VM to much during snapshoting it
>>>>> live.
>>>> Yes, there is no doubt that this is useful functionality. There has been
>>>> talk about this every now and then, but I don't think we ever got to a
>>>> point where it actually could be implemented.
>>>>
>>>> Vladimir, I seem to remember you (or someone else from your team?) were
>>>> interested in async snapshots as well a while ago?
>>> Den is working on this (add him to CC)
>> Yes, I was working on that.
>>
>> What I've done can be found here:
>> https://github.com/denis-plotnikov/qemu/commits/bgs_uffd
>>
>> The idea was to save a snapshot (state+ram) asynchronously to a separate
>> (raw) file using the existing infrastructure.
>> The goal of that was to reduce the VM downtime on snapshot.
>>
>> We decided to postpone this work until "userfaultfd write protected mode"
>> wasn't in the linux mainstream.
>> Now, userfaultfd-wp is merged to linux. So we have plans to continue this
>> work.
>>
>> According to the saving the "internal" snapshot to qcow2 I still have a
>> question. May be this is the right place and time to ask.
>>
>> If I remember correctly, in qcow2 the snapshot is stored at the end of
>> the address space of the current block-placement-table.
> Yes. Basically the way snapshots with VM state work is that you write
> the VM state to some offset after the end of the virtual disk, when the
> VM state is completely written you snapshot the current state (including
> both content of the virtual disk and VM state) and finally discard the
> VM state again in the active L1 table.
>
>> We switch to the new block-placement-table after the snapshot storing
>> is complete. In case of sync snapshot, we should switch the table
>> before the snapshot is written, another words, we should be able to
>> preallocate the the space for the snapshot and keep a link to the
>> space until snapshot writing is completed.
> I don't see a fundamental difference between sync and async in this
> respect. Why can't you write the VM state to the current L1 table first
> like we usually do?

I'm not quite sure I understand the point.
Let's see all the picture of async snapshot: our goal is to minimize a 
VM downtime during the snapshot.
When we do async snapshot we save vmstate except RAM when a VM is stopped
using the current L1 table (further initial L1 table). Then, we want the 
VM start running
and write RAM content. At this time all RAM is write-protected.
We unprotect each RAM page once it has been written.
All those RAM pages should go to the snapshot using the initial L1 table.
Since the VM is running, it may want to write new disk blocks,
so we need to use a NEW L1 table to provide this ability. (Am I correct 
so far?)
Thus, if I understand correctly, we need to use two L1 tables: the 
initial one to store RAM pages
to the vmstate and the new one to allow disk writings.

May be I can't see a better way to achieve that. Please, correct me if 
I'm wrong.

Denis
>
> We always have only one active L1 table at a time, which simplifies
> cluster allocation a bit, so it would be preferable to keep it this way.
>
> Kevin
>




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-19 15:05           ` Denis Plotnikov
@ 2020-05-19 15:29             ` Kevin Wolf
  2020-05-19 15:48               ` Vladimir Sementsov-Ogievskiy
  2020-05-20  7:23               ` Denis Plotnikov
  0 siblings, 2 replies; 26+ messages in thread
From: Kevin Wolf @ 2020-05-19 15:29 UTC (permalink / raw)
  To: Denis Plotnikov
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel, s.reiter,
	armbru, mreitz, stefanha, Thomas Lamprecht

Am 19.05.2020 um 17:05 hat Denis Plotnikov geschrieben:
> On 19.05.2020 17:18, Kevin Wolf wrote:
> > Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:
> > > 
> > > On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
> > > > 14.05.2020 17:26, Kevin Wolf wrote:
> > > > > Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
> > > > > > On 5/12/20 4:43 PM, Kevin Wolf wrote:
> > > > > > > Stefan (Reiter), after looking a bit closer at this, I think
> > > > > > > there is no
> > > > > > > bug in QEMU, but the bug is in your coroutine code that calls block
> > > > > > > layer functions without moving into the right AioContext first. I've
> > > > > > > written this series anyway as it potentially makes the life of callers
> > > > > > > easier and would probably make your buggy code correct.
> > > > > > > However, it doesn't feel right to commit something like
> > > > > > > patch 2 without
> > > > > > > having a user for it. Is there a reason why you can't upstream your
> > > > > > > async snapshot code?
> > > > > > I mean I understand what you mean, but it would make the
> > > > > > interface IMO so
> > > > > > much easier to use, if one wants to explicit schedule it
> > > > > > beforehand they
> > > > > > can still do. But that would open the way for two styles doing
> > > > > > things, not
> > > > > > sure if this would seen as bad. The assert about from patch 3/3
> > > > > > would be
> > > > > > already really helping a lot, though.
> > > > > I think patches 1 and 3 are good to be committed either way if people
> > > > > think they are useful. They make sense without the async snapshot code.
> > > > > 
> > > > > My concern with the interface in patch 2 is both that it could give
> > > > > people a false sense of security and that it would be tempting to write
> > > > > inefficient code.
> > > > > 
> > > > > Usually, you won't have just a single call into the block layer for a
> > > > > given block node, but you'll perform multiple operations. Switching to
> > > > > the target context once rather than switching back and forth in every
> > > > > operation is obviously more efficient.
> > > > > 
> > > > > But chances are that even if one of these function is bdrv_flush(),
> > > > > which now works correctly from a different thread, you might need
> > > > > another function that doesn't implement the same magic. So you always
> > > > > need to be aware which functions support cross-context calls which
> > > > > ones don't.
> > > > > 
> > > > > I feel we'd see a few bugs related to this.
> > > > > 
> > > > > > Regarding upstreaming, there was some historical attempt to upstream it
> > > > > > from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
> > > > > > I'm not quite sure why it didn't went through then, I see if I can get
> > > > > > some time searching the mailing list archive.
> > > > > > 
> > > > > > We'd be naturally open and glad to upstream it, what it effectively
> > > > > > allow us to do is to not block the VM to much during snapshoting it
> > > > > > live.
> > > > > Yes, there is no doubt that this is useful functionality. There has been
> > > > > talk about this every now and then, but I don't think we ever got to a
> > > > > point where it actually could be implemented.
> > > > > 
> > > > > Vladimir, I seem to remember you (or someone else from your team?) were
> > > > > interested in async snapshots as well a while ago?
> > > > Den is working on this (add him to CC)
> > > Yes, I was working on that.
> > > 
> > > What I've done can be found here:
> > > https://github.com/denis-plotnikov/qemu/commits/bgs_uffd
> > > 
> > > The idea was to save a snapshot (state+ram) asynchronously to a separate
> > > (raw) file using the existing infrastructure.
> > > The goal of that was to reduce the VM downtime on snapshot.
> > > 
> > > We decided to postpone this work until "userfaultfd write protected mode"
> > > wasn't in the linux mainstream.
> > > Now, userfaultfd-wp is merged to linux. So we have plans to continue this
> > > work.
> > > 
> > > According to the saving the "internal" snapshot to qcow2 I still have a
> > > question. May be this is the right place and time to ask.
> > > 
> > > If I remember correctly, in qcow2 the snapshot is stored at the end of
> > > the address space of the current block-placement-table.
> > Yes. Basically the way snapshots with VM state work is that you write
> > the VM state to some offset after the end of the virtual disk, when the
> > VM state is completely written you snapshot the current state (including
> > both content of the virtual disk and VM state) and finally discard the
> > VM state again in the active L1 table.
> > 
> > > We switch to the new block-placement-table after the snapshot storing
> > > is complete. In case of sync snapshot, we should switch the table
> > > before the snapshot is written, another words, we should be able to
> > > preallocate the the space for the snapshot and keep a link to the
> > > space until snapshot writing is completed.
> > I don't see a fundamental difference between sync and async in this
> > respect. Why can't you write the VM state to the current L1 table first
> > like we usually do?
> 
> I'm not quite sure I understand the point.
> Let's see all the picture of async snapshot: our goal is to minimize a VM
> downtime during the snapshot.
> When we do async snapshot we save vmstate except RAM when a VM is stopped
> using the current L1 table (further initial L1 table). Then, we want the VM
> start running
> and write RAM content. At this time all RAM is write-protected.
> We unprotect each RAM page once it has been written.

Oh, I see, you're basically doing something like postcopy migration. I
was assuming it was more like regular live migration, except that you
would overwrite updated RAM blocks instead of appending them.

I can see your requirement then.

> All those RAM pages should go to the snapshot using the initial L1 table.
> Since the VM is running, it may want to write new disk blocks,
> so we need to use a NEW L1 table to provide this ability. (Am I correct so
> far?)
> Thus, if I understand correctly, we need to use two L1 tables: the initial
> one to store RAM pages
> to the vmstate and the new one to allow disk writings.
> 
> May be I can't see a better way to achieve that. Please, correct me if I'm
> wrong.

I guess I could imagine a different, though probably not better way: We
could internally have a separate low-level operation that moves the VM
state from the active layer to an already existing disk snapshot. Then
you would snapshot the disk and start writing the VM to the active
layer, and when the VM state write has completed you move it into the
snapshot.

The other options is doing what you suggested. There is nothing in the
qcow2 on-disk format that would prevent this, but we would have to
extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
like a non-trivial amount of code changes, though it would potentially
enable more use cases we never implemented ((read-only) access to
internal snapshots as block nodes, so you could e.g. use block jobs to
export a snapshot).

Kevin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-19 15:29             ` Kevin Wolf
@ 2020-05-19 15:48               ` Vladimir Sementsov-Ogievskiy
  2020-05-19 16:06                 ` Eric Blake
  2020-05-20  7:23               ` Denis Plotnikov
  1 sibling, 1 reply; 26+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 15:48 UTC (permalink / raw)
  To: Kevin Wolf, Denis Plotnikov
  Cc: qemu-block, qemu-devel, s.reiter, armbru, mreitz, stefanha,
	Thomas Lamprecht

19.05.2020 18:29, Kevin Wolf wrote:
> Am 19.05.2020 um 17:05 hat Denis Plotnikov geschrieben:
>> On 19.05.2020 17:18, Kevin Wolf wrote:
>>> Am 19.05.2020 um 15:54 hat Denis Plotnikov geschrieben:
>>>>
>>>> On 19.05.2020 15:32, Vladimir Sementsov-Ogievskiy wrote:
>>>>> 14.05.2020 17:26, Kevin Wolf wrote:
>>>>>> Am 14.05.2020 um 15:21 hat Thomas Lamprecht geschrieben:
>>>>>>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>>>>>>> Stefan (Reiter), after looking a bit closer at this, I think
>>>>>>>> there is no
>>>>>>>> bug in QEMU, but the bug is in your coroutine code that calls block
>>>>>>>> layer functions without moving into the right AioContext first. I've
>>>>>>>> written this series anyway as it potentially makes the life of callers
>>>>>>>> easier and would probably make your buggy code correct.
>>>>>>>> However, it doesn't feel right to commit something like
>>>>>>>> patch 2 without
>>>>>>>> having a user for it. Is there a reason why you can't upstream your
>>>>>>>> async snapshot code?
>>>>>>> I mean I understand what you mean, but it would make the
>>>>>>> interface IMO so
>>>>>>> much easier to use, if one wants to explicit schedule it
>>>>>>> beforehand they
>>>>>>> can still do. But that would open the way for two styles doing
>>>>>>> things, not
>>>>>>> sure if this would seen as bad. The assert about from patch 3/3
>>>>>>> would be
>>>>>>> already really helping a lot, though.
>>>>>> I think patches 1 and 3 are good to be committed either way if people
>>>>>> think they are useful. They make sense without the async snapshot code.
>>>>>>
>>>>>> My concern with the interface in patch 2 is both that it could give
>>>>>> people a false sense of security and that it would be tempting to write
>>>>>> inefficient code.
>>>>>>
>>>>>> Usually, you won't have just a single call into the block layer for a
>>>>>> given block node, but you'll perform multiple operations. Switching to
>>>>>> the target context once rather than switching back and forth in every
>>>>>> operation is obviously more efficient.
>>>>>>
>>>>>> But chances are that even if one of these function is bdrv_flush(),
>>>>>> which now works correctly from a different thread, you might need
>>>>>> another function that doesn't implement the same magic. So you always
>>>>>> need to be aware which functions support cross-context calls which
>>>>>> ones don't.
>>>>>>
>>>>>> I feel we'd see a few bugs related to this.
>>>>>>
>>>>>>> Regarding upstreaming, there was some historical attempt to upstream it
>>>>>>> from Dietmar, but in the time frame of ~ 8 to 10 years ago or so.
>>>>>>> I'm not quite sure why it didn't went through then, I see if I can get
>>>>>>> some time searching the mailing list archive.
>>>>>>>
>>>>>>> We'd be naturally open and glad to upstream it, what it effectively
>>>>>>> allow us to do is to not block the VM to much during snapshoting it
>>>>>>> live.
>>>>>> Yes, there is no doubt that this is useful functionality. There has been
>>>>>> talk about this every now and then, but I don't think we ever got to a
>>>>>> point where it actually could be implemented.
>>>>>>
>>>>>> Vladimir, I seem to remember you (or someone else from your team?) were
>>>>>> interested in async snapshots as well a while ago?
>>>>> Den is working on this (add him to CC)
>>>> Yes, I was working on that.
>>>>
>>>> What I've done can be found here:
>>>> https://github.com/denis-plotnikov/qemu/commits/bgs_uffd
>>>>
>>>> The idea was to save a snapshot (state+ram) asynchronously to a separate
>>>> (raw) file using the existing infrastructure.
>>>> The goal of that was to reduce the VM downtime on snapshot.
>>>>
>>>> We decided to postpone this work until "userfaultfd write protected mode"
>>>> wasn't in the linux mainstream.
>>>> Now, userfaultfd-wp is merged to linux. So we have plans to continue this
>>>> work.
>>>>
>>>> According to the saving the "internal" snapshot to qcow2 I still have a
>>>> question. May be this is the right place and time to ask.
>>>>
>>>> If I remember correctly, in qcow2 the snapshot is stored at the end of
>>>> the address space of the current block-placement-table.
>>> Yes. Basically the way snapshots with VM state work is that you write
>>> the VM state to some offset after the end of the virtual disk, when the
>>> VM state is completely written you snapshot the current state (including
>>> both content of the virtual disk and VM state) and finally discard the
>>> VM state again in the active L1 table.
>>>
>>>> We switch to the new block-placement-table after the snapshot storing
>>>> is complete. In case of sync snapshot, we should switch the table
>>>> before the snapshot is written, another words, we should be able to
>>>> preallocate the the space for the snapshot and keep a link to the
>>>> space until snapshot writing is completed.
>>> I don't see a fundamental difference between sync and async in this
>>> respect. Why can't you write the VM state to the current L1 table first
>>> like we usually do?
>>
>> I'm not quite sure I understand the point.
>> Let's see all the picture of async snapshot: our goal is to minimize a VM
>> downtime during the snapshot.
>> When we do async snapshot we save vmstate except RAM when a VM is stopped
>> using the current L1 table (further initial L1 table). Then, we want the VM
>> start running
>> and write RAM content. At this time all RAM is write-protected.
>> We unprotect each RAM page once it has been written.
> 
> Oh, I see, you're basically doing something like postcopy migration. I
> was assuming it was more like regular live migration, except that you
> would overwrite updated RAM blocks instead of appending them.
> 
> I can see your requirement then.
> 
>> All those RAM pages should go to the snapshot using the initial L1 table.
>> Since the VM is running, it may want to write new disk blocks,
>> so we need to use a NEW L1 table to provide this ability. (Am I correct so
>> far?)
>> Thus, if I understand correctly, we need to use two L1 tables: the initial
>> one to store RAM pages
>> to the vmstate and the new one to allow disk writings.
>>
>> May be I can't see a better way to achieve that. Please, correct me if I'm
>> wrong.
> 
> I guess I could imagine a different, though probably not better way: We
> could internally have a separate low-level operation that moves the VM
> state from the active layer to an already existing disk snapshot. Then
> you would snapshot the disk and start writing the VM to the active
> layer, and when the VM state write has completed you move it into the
> snapshot.
> 
> The other options is doing what you suggested. There is nothing in the
> qcow2 on-disk format that would prevent this, but we would have to
> extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
> like a non-trivial amount of code changes, though it would potentially
> enable more use cases we never implemented ((read-only) access to
> internal snapshots as block nodes, so you could e.g. use block jobs to
> export a snapshot).

Or export a snapshot through NBD.

Still, I have one more idea, probably we already discussed it?
Honestly, I don't like the fact that we store vmstate into guest-data space. After EOF, invisible, but still..

Maybe, it would be good to make a qcow2 extension for storing vmstate separately? So snapshot metadata will include two more fields: vmstate_offset and vmstate_length (hmm, actually we already have the second one), which will be allocated as normal qcow2 metadata? Or we can add one-two levels of layered allocation if needed, but keep it separate from L1/L2 tables for guest clusters.


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-19 15:48               ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 16:06                 ` Eric Blake
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Blake @ 2020-05-19 16:06 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Kevin Wolf, Denis Plotnikov
  Cc: qemu-block, armbru, s.reiter, qemu-devel, mreitz, stefanha,
	Thomas Lamprecht

On 5/19/20 10:48 AM, Vladimir Sementsov-Ogievskiy wrote:

>> The other options is doing what you suggested. There is nothing in the
>> qcow2 on-disk format that would prevent this, but we would have to
>> extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
>> like a non-trivial amount of code changes, though it would potentially
>> enable more use cases we never implemented ((read-only) access to
>> internal snapshots as block nodes, so you could e.g. use block jobs to
>> export a snapshot).
> 
> Or export a snapshot through NBD.
> 
> Still, I have one more idea, probably we already discussed it?
> Honestly, I don't like the fact that we store vmstate into guest-data 
> space. After EOF, invisible, but still..
> 
> Maybe, it would be good to make a qcow2 extension for storing vmstate 
> separately?

The existing location of internal snapshots IS already stored separately 
from guest-data space, precisely because it is beyond EOF.

> So snapshot metadata will include two more fields: 
> vmstate_offset and vmstate_length (hmm, actually we already have the 
> second one), which will be allocated as normal qcow2 metadata?

How will adding redundant fields help?  Both fields are already present 
in the snapshot table of v3 images (even if indirectly) by virtue of:

         32 - 35:    Size of the VM state in bytes. 0 if no VM state is 
saved.
                     If there is VM state, it starts at the first cluster
                     described by first L1 table entry that doesn't 
describe a
                     regular guest cluster (i.e. VM state is stored like 
guest
                     disk content, except that it is stored at offsets 
that are
                     larger than the virtual disk presented to the guest)

                     Byte 40 - 47:   Size of the VM state in bytes. 0 if 
no VM
                                     state is saved. If this field is 
present,
                                     the 32-bit value in bytes 32-35 is 
ignored.

                     Byte 48 - 55:   Virtual disk size of the snapshot 
in bytes

which gives you both the 64-bit size (in order to compute the last 
cluster accessible to the guest, and thus the next cluster available to 
the vmstate beyond EOF) and the 64-bit length of that vmstate.

> Or we can 
> add one-two levels of layered allocation if needed, but keep it separate 
> from L1/L2 tables for guest clusters.
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext
  2020-05-19 15:29             ` Kevin Wolf
  2020-05-19 15:48               ` Vladimir Sementsov-Ogievskiy
@ 2020-05-20  7:23               ` Denis Plotnikov
  1 sibling, 0 replies; 26+ messages in thread
From: Denis Plotnikov @ 2020-05-20  7:23 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel, s.reiter,
	armbru, mreitz, stefanha, Thomas Lamprecht


>> I'm not quite sure I understand the point.
>> Let's see all the picture of async snapshot: our goal is to minimize a VM
>> downtime during the snapshot.
>> When we do async snapshot we save vmstate except RAM when a VM is stopped
>> using the current L1 table (further initial L1 table). Then, we want the VM
>> start running
>> and write RAM content. At this time all RAM is write-protected.
>> We unprotect each RAM page once it has been written.
> Oh, I see, you're basically doing something like postcopy migration. I
> was assuming it was more like regular live migration, except that you
> would overwrite updated RAM blocks instead of appending them.
>
> I can see your requirement then.
>
>> All those RAM pages should go to the snapshot using the initial L1 table.
>> Since the VM is running, it may want to write new disk blocks,
>> so we need to use a NEW L1 table to provide this ability. (Am I correct so
>> far?)
>> Thus, if I understand correctly, we need to use two L1 tables: the initial
>> one to store RAM pages
>> to the vmstate and the new one to allow disk writings.
>>
>> May be I can't see a better way to achieve that. Please, correct me if I'm
>> wrong.
> I guess I could imagine a different, though probably not better way: We
> could internally have a separate low-level operation that moves the VM
> state from the active layer to an already existing disk snapshot. Then
> you would snapshot the disk and start writing the VM to the active
> layer, and when the VM state write has completed you move it into the
> snapshot.
>
> The other options is doing what you suggested. There is nothing in the
> qcow2 on-disk format that would prevent this, but we would have to
> extend the qcow2 driver to allow I/O to inactive L1 tables. This sounds
> like a non-trivial amount of code changes, though it would potentially
> enable more use cases we never implemented ((read-only) access to
> internal snapshots as block nodes, so you could e.g. use block jobs to
> export a snapshot).
>
> Kevin

Ok, thanks for validating the possibilities and more ideas of 
implementation.
I think I should start from trying to post my background snapshot 
version storing the vmstate to an external file
because write-protected-userfaultfd is now available on linux.
And If it's accepted I'll try to come up with an internal version for 
qcow2 (It seems this is the only format supporting this).

Denis


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] block: Factor out bdrv_run_co()
  2020-05-12 15:37   ` Eric Blake
@ 2020-05-20  9:09     ` Philippe Mathieu-Daudé
  2020-05-20 11:14       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 26+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-05-20  9:09 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, qemu-block
  Cc: Vladimir Sementsov-Ogievskiy, qemu-devel, s.reiter, armbru,
	mreitz, stefanha, t.lamprecht

On 5/12/20 5:37 PM, Eric Blake wrote:
> On 5/12/20 9:43 AM, Kevin Wolf wrote:
>> We have a few bdrv_*() functions that can either spawn a new coroutine
>> and wait for it with BDRV_POLL_WHILE() or use a fastpath if they are
>> alreeady running in a coroutine. All of them duplicate basically the
> 
> already
> 
>> same code.
>>
>> Factor the common code into a new function bdrv_run_co().
>>
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>>   block/io.c | 104 +++++++++++++++--------------------------------------
>>   1 file changed, 28 insertions(+), 76 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index 7d30e61edc..c1badaadc9 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -891,6 +891,22 @@ static int 
>> bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
>>       return 0;
>>   }
>> +static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
>> +                       void *opaque, int *ret)
>> +{
>> +    if (qemu_in_coroutine()) {
>> +        /* Fast-path if already in coroutine context */
>> +        entry(opaque);
>> +    } else {
>> +        Coroutine *co = qemu_coroutine_create(entry, opaque);
>> +        *ret = NOT_DONE;
>> +        bdrv_coroutine_enter(bs, co);
>> +        BDRV_POLL_WHILE(bs, *ret == NOT_DONE);
> 
> For my reference, NOT_DONE is defined as INT_MAX, which does not seem to 
> be used as a return value in other situations.
> 
>> @@ -923,25 +939,15 @@ static int bdrv_prwv_co(BdrvChild *child, 
>> int64_t offset,
>>                           QEMUIOVector *qiov, bool is_write,
>>                           BdrvRequestFlags flags)
>>   {
>> -    Coroutine *co;
>>       RwCo rwco = {
>>           .child = child,
>>           .offset = offset,
>>           .qiov = qiov,
>>           .is_write = is_write,
>> -        .ret = NOT_DONE,
>>           .flags = flags,
>>       };
>> -    if (qemu_in_coroutine()) {
>> -        /* Fast-path if already in coroutine context */
>> -        bdrv_rw_co_entry(&rwco);
>> -    } else {
>> -        co = qemu_coroutine_create(bdrv_rw_co_entry, &rwco);
>> -        bdrv_coroutine_enter(child->bs, co);
>> -        BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
>> -    }
>> -    return rwco.ret;
>> +    return bdrv_run_co(child->bs, bdrv_rw_co_entry, &rwco, &rwco.ret);
> 
> So code that previously looped on NOT_DONE is obviously safe, while...
> 
>>   }
>>   int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>> @@ -2230,7 +2236,6 @@ typedef struct BdrvCoBlockStatusData {
>>       int64_t *map;
>>       BlockDriverState **file;
>>       int ret;
>> -    bool done;
>>   } BdrvCoBlockStatusData;
>>   int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
>> @@ -2492,7 +2497,6 @@ static void coroutine_fn 
>> bdrv_block_status_above_co_entry(void *opaque)
>>                                              data->want_zero,
>>                                              data->offset, data->bytes,
>>                                              data->pnum, data->map, 
>> data->file);
>> -    data->done = true;
>>       aio_wait_kick();
> 
> ...code that looped on something else now has to be checked that 
> data->ret is still being set to something useful.  Fortunately that is 
> true here.
> 
>> @@ -2669,22 +2663,13 @@ static inline int
>>   bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>>                   bool is_read)
>>   {
>> -    if (qemu_in_coroutine()) {
>> -        return bdrv_co_rw_vmstate(bs, qiov, pos, is_read);
>> -    } else {
>> -        BdrvVmstateCo data = {
>> -            .bs         = bs,
>> -            .qiov       = qiov,
>> -            .pos        = pos,
>> -            .is_read    = is_read,
>> -            .ret        = -EINPROGRESS,
>> -        };
>> -        Coroutine *co = 
>> qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
>> -
>> -        bdrv_coroutine_enter(bs, co);
>> -        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
>> -        return data.ret;
> 
> It's a little harder to see whether -EINPROGRESS might ever be returned 
> by a driver, but again this looks safe.

Maybe add a comment regarding -EINPROGRESS before calling bdrv_run_co() 
in bdrv_rw_vmstate()?

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

> 
> Here, it's a little less obvious whether any driver might return 
> -EINPROGRESS, but it looks like if they did that,
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
> Conflicts with Vladimir's patches which try to add more coroutine 
> wrappers (but those need a rebase anyway):
> https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg04559.html
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] block: Assert we're running in the right thread
  2020-05-14 14:30     ` Kevin Wolf
@ 2020-05-20  9:12       ` Philippe Mathieu-Daudé
  0 siblings, 0 replies; 26+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-05-20  9:12 UTC (permalink / raw)
  To: Kevin Wolf, Stefan Reiter
  Cc: qemu-block, armbru, qemu-devel, mreitz, stefanha, t.lamprecht

On 5/14/20 4:30 PM, Kevin Wolf wrote:
> Am 14.05.2020 um 15:52 hat Stefan Reiter geschrieben:
>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>> tracked_request_begin() is called for most I/O operations, so it's a
>>> good place to assert that we're indeed running in the home thread of the
>>> node's AioContext.
>>>
>>
>> Is this patch supposed to be always correct or only together with nr. 2?
>>
>> I changed our code to call bdrv_flush_all from the main AIO context and it
>> certainly works just fine (even without this series, so I suppose that would
>> be the 'correct' way to fix it you mention on the cover), though of course
>> it trips this assert without patch 2.
> 
> Yes, I think this is a basic assumption that should always be true.
> This series shouldn't fix anything for upstream QEMU (at least I'm not
> aware of anything that needs it), so the assertion could be added even
> without the other patches.
> 
> In fact, I'm currently thinking that committing just patch 1 (because
> it's a nice cleanup anyway) and patch 3 (because it will let us know
> when we mess up) might make sense.

FWIW applying 1+3 as no-RFC LGTM.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

> 
> Kevin
> 
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>>    block/io.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/io.c b/block/io.c
>>> index 7808e8bdc0..924bf5ba46 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -695,14 +695,17 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
>>>                                      uint64_t bytes,
>>>                                      enum BdrvTrackedRequestType type)
>>>    {
>>> +    Coroutine *self = qemu_coroutine_self();
>>> +
>>>        assert(bytes <= INT64_MAX && offset <= INT64_MAX - bytes);
>>> +    assert(bs->aio_context == qemu_coroutine_get_aio_context(self));
>>>        *req = (BdrvTrackedRequest){
>>>            .bs = bs,
>>>            .offset         = offset,
>>>            .bytes          = bytes,
>>>            .type           = type,
>>> -        .co             = qemu_coroutine_self(),
>>> +        .co             = self,
>>>            .serialising    = false,
>>>            .overlap_offset = offset,
>>>            .overlap_bytes  = bytes,
>>>
>>
> 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] block: Factor out bdrv_run_co()
  2020-05-20  9:09     ` Philippe Mathieu-Daudé
@ 2020-05-20 11:14       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 26+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-20 11:14 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Eric Blake, Kevin Wolf, qemu-block
  Cc: qemu-devel, s.reiter, armbru, mreitz, stefanha, t.lamprecht

20.05.2020 12:09, Philippe Mathieu-Daudé wrote:
> On 5/12/20 5:37 PM, Eric Blake wrote:
>> On 5/12/20 9:43 AM, Kevin Wolf wrote:
>>> We have a few bdrv_*() functions that can either spawn a new coroutine
>>> and wait for it with BDRV_POLL_WHILE() or use a fastpath if they are
>>> alreeady running in a coroutine. All of them duplicate basically the
>>
>> already
>>
>>> same code.
>>>
>>> Factor the common code into a new function bdrv_run_co().
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>>   block/io.c | 104 +++++++++++++++--------------------------------------
>>>   1 file changed, 28 insertions(+), 76 deletions(-)
>>>
>>> diff --git a/block/io.c b/block/io.c
>>> index 7d30e61edc..c1badaadc9 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>> @@ -891,6 +891,22 @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
>>>       return 0;
>>>   }
>>> +static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
>>> +                       void *opaque, int *ret)
>>> +{
>>> +    if (qemu_in_coroutine()) {
>>> +        /* Fast-path if already in coroutine context */
>>> +        entry(opaque);
>>> +    } else {
>>> +        Coroutine *co = qemu_coroutine_create(entry, opaque);
>>> +        *ret = NOT_DONE;
>>> +        bdrv_coroutine_enter(bs, co);
>>> +        BDRV_POLL_WHILE(bs, *ret == NOT_DONE);
>>
>> For my reference, NOT_DONE is defined as INT_MAX, which does not seem to be used as a return value in other situations.
>>
>>> @@ -923,25 +939,15 @@ static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
>>>                           QEMUIOVector *qiov, bool is_write,
>>>                           BdrvRequestFlags flags)
>>>   {
>>> -    Coroutine *co;
>>>       RwCo rwco = {
>>>           .child = child,
>>>           .offset = offset,
>>>           .qiov = qiov,
>>>           .is_write = is_write,
>>> -        .ret = NOT_DONE,
>>>           .flags = flags,
>>>       };
>>> -    if (qemu_in_coroutine()) {
>>> -        /* Fast-path if already in coroutine context */
>>> -        bdrv_rw_co_entry(&rwco);
>>> -    } else {
>>> -        co = qemu_coroutine_create(bdrv_rw_co_entry, &rwco);
>>> -        bdrv_coroutine_enter(child->bs, co);
>>> -        BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
>>> -    }
>>> -    return rwco.ret;
>>> +    return bdrv_run_co(child->bs, bdrv_rw_co_entry, &rwco, &rwco.ret);
>>
>> So code that previously looped on NOT_DONE is obviously safe, while...
>>
>>>   }
>>>   int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
>>> @@ -2230,7 +2236,6 @@ typedef struct BdrvCoBlockStatusData {
>>>       int64_t *map;
>>>       BlockDriverState **file;
>>>       int ret;
>>> -    bool done;
>>>   } BdrvCoBlockStatusData;
>>>   int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
>>> @@ -2492,7 +2497,6 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
>>>                                              data->want_zero,
>>>                                              data->offset, data->bytes,
>>>                                              data->pnum, data->map, data->file);
>>> -    data->done = true;
>>>       aio_wait_kick();
>>
>> ...code that looped on something else now has to be checked that data->ret is still being set to something useful.  Fortunately that is true here.
>>
>>> @@ -2669,22 +2663,13 @@ static inline int
>>>   bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
>>>                   bool is_read)
>>>   {
>>> -    if (qemu_in_coroutine()) {
>>> -        return bdrv_co_rw_vmstate(bs, qiov, pos, is_read);
>>> -    } else {
>>> -        BdrvVmstateCo data = {
>>> -            .bs         = bs,
>>> -            .qiov       = qiov,
>>> -            .pos        = pos,
>>> -            .is_read    = is_read,
>>> -            .ret        = -EINPROGRESS,
>>> -        };
>>> -        Coroutine *co = qemu_coroutine_create(bdrv_co_rw_vmstate_entry, &data);
>>> -
>>> -        bdrv_coroutine_enter(bs, co);
>>> -        BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
>>> -        return data.ret;
>>
>> It's a little harder to see whether -EINPROGRESS might ever be returned by a driver, but again this looks safe.
> 
> Maybe add a comment regarding -EINPROGRESS before calling bdrv_run_co() in bdrv_rw_vmstate()?
> 
> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
> 

Hi!

Actually, I've sent a v2 of this patch:
"[PATCH v2] block: Factor out bdrv_run_co()"
https://lists.gnu.org/archive/html/qemu-devel/2020-05/msg05437.html


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext
  2020-05-12 14:43 ` [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext Kevin Wolf
  2020-05-12 16:02   ` Thomas Lamprecht
@ 2020-05-25 14:18   ` Stefan Reiter
  2020-05-25 16:41     ` Kevin Wolf
  1 sibling, 1 reply; 26+ messages in thread
From: Stefan Reiter @ 2020-05-25 14:18 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, t.lamprecht, armbru, stefanha, mreitz

On 5/12/20 4:43 PM, Kevin Wolf wrote:
> Coroutine functions that are entered through bdrv_run_co() are already
> safe to call from synchronous code in a different AioContext because
> bdrv_coroutine_enter() will schedule them in the context of the node.
> 
> However, the coroutine fastpath still requires that we're already in the
> right AioContext when called in coroutine context.
> 
> In order to make the behaviour more consistent and to make life a bit
> easier for callers, let's check the AioContext and automatically move
> the current coroutine around if we're not in the right context yet.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>   block/io.c | 15 ++++++++++++++-
>   1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/block/io.c b/block/io.c
> index c1badaadc9..7808e8bdc0 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
>                          void *opaque, int *ret)
>   {
>       if (qemu_in_coroutine()) {
> -        /* Fast-path if already in coroutine context */
> +        Coroutine *self = qemu_coroutine_self();
> +        AioContext *bs_ctx = bdrv_get_aio_context(bs);
> +        AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
> +
> +        if (bs_ctx != co_ctx) {
> +            /* Move to the iothread of the node */
> +            aio_co_schedule(bs_ctx, self);
> +            qemu_coroutine_yield();

I'm pretty sure this can lead to a race: When the thread we're 
re-scheduling to is faster to schedule us than we can reach 
qemu_coroutine_yield, then we'll get an abort ("Co-routine re-entered 
recursively"), since co->caller is still set.

I've seen this happen in our code when I try to do the scheduling 
fandangle there.

Is there a safer way to have a coroutine reschedule itself? Some lock 
missing?

> +        }
>           entry(opaque);
> +        if (bs_ctx != co_ctx) {
> +            /* Move back to the original AioContext */
> +            aio_co_schedule(bs_ctx, self);
> +            qemu_coroutine_yield();
> +        }
>       } else {
>           Coroutine *co = qemu_coroutine_create(entry, opaque);
>           *ret = NOT_DONE;
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext
  2020-05-25 14:18   ` Stefan Reiter
@ 2020-05-25 16:41     ` Kevin Wolf
  2020-05-26 16:42       ` Kevin Wolf
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2020-05-25 16:41 UTC (permalink / raw)
  To: Stefan Reiter
  Cc: qemu-block, qemu-devel, armbru, stefanha, mreitz, t.lamprecht

Am 25.05.2020 um 16:18 hat Stefan Reiter geschrieben:
> On 5/12/20 4:43 PM, Kevin Wolf wrote:
> > Coroutine functions that are entered through bdrv_run_co() are already
> > safe to call from synchronous code in a different AioContext because
> > bdrv_coroutine_enter() will schedule them in the context of the node.
> > 
> > However, the coroutine fastpath still requires that we're already in the
> > right AioContext when called in coroutine context.
> > 
> > In order to make the behaviour more consistent and to make life a bit
> > easier for callers, let's check the AioContext and automatically move
> > the current coroutine around if we're not in the right context yet.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >   block/io.c | 15 ++++++++++++++-
> >   1 file changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/io.c b/block/io.c
> > index c1badaadc9..7808e8bdc0 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
> >                          void *opaque, int *ret)
> >   {
> >       if (qemu_in_coroutine()) {
> > -        /* Fast-path if already in coroutine context */
> > +        Coroutine *self = qemu_coroutine_self();
> > +        AioContext *bs_ctx = bdrv_get_aio_context(bs);
> > +        AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
> > +
> > +        if (bs_ctx != co_ctx) {
> > +            /* Move to the iothread of the node */
> > +            aio_co_schedule(bs_ctx, self);
> > +            qemu_coroutine_yield();
> 
> I'm pretty sure this can lead to a race: When the thread we're re-scheduling
> to is faster to schedule us than we can reach qemu_coroutine_yield, then
> we'll get an abort ("Co-routine re-entered recursively"), since co->caller
> is still set.
> 
> I've seen this happen in our code when I try to do the scheduling fandangle
> there.

Ah, crap. I guess letting a coroutine re-schedule itself is only safe
within the same thread then.

> Is there a safer way to have a coroutine reschedule itself? Some lock
> missing?

There is no problem that can't be solved by adding another level of
indirection... We would have to schedule a BH in the original thread
that will only schedule the coroutine in its new thread after it has
yielded.

Maybe we should actually introduce a helper function that moves the
current coroutine to a different AioContext this way.

Kevin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext
  2020-05-25 16:41     ` Kevin Wolf
@ 2020-05-26 16:42       ` Kevin Wolf
  2020-05-27  8:56         ` Stefan Reiter
  0 siblings, 1 reply; 26+ messages in thread
From: Kevin Wolf @ 2020-05-26 16:42 UTC (permalink / raw)
  To: Stefan Reiter
  Cc: qemu-block, armbru, qemu-devel, mreitz, stefanha, t.lamprecht

Am 25.05.2020 um 18:41 hat Kevin Wolf geschrieben:
> Am 25.05.2020 um 16:18 hat Stefan Reiter geschrieben:
> > On 5/12/20 4:43 PM, Kevin Wolf wrote:
> > > Coroutine functions that are entered through bdrv_run_co() are already
> > > safe to call from synchronous code in a different AioContext because
> > > bdrv_coroutine_enter() will schedule them in the context of the node.
> > > 
> > > However, the coroutine fastpath still requires that we're already in the
> > > right AioContext when called in coroutine context.
> > > 
> > > In order to make the behaviour more consistent and to make life a bit
> > > easier for callers, let's check the AioContext and automatically move
> > > the current coroutine around if we're not in the right context yet.
> > > 
> > > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > > ---
> > >   block/io.c | 15 ++++++++++++++-
> > >   1 file changed, 14 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/block/io.c b/block/io.c
> > > index c1badaadc9..7808e8bdc0 100644
> > > --- a/block/io.c
> > > +++ b/block/io.c
> > > @@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
> > >                          void *opaque, int *ret)
> > >   {
> > >       if (qemu_in_coroutine()) {
> > > -        /* Fast-path if already in coroutine context */
> > > +        Coroutine *self = qemu_coroutine_self();
> > > +        AioContext *bs_ctx = bdrv_get_aio_context(bs);
> > > +        AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
> > > +
> > > +        if (bs_ctx != co_ctx) {
> > > +            /* Move to the iothread of the node */
> > > +            aio_co_schedule(bs_ctx, self);
> > > +            qemu_coroutine_yield();
> > 
> > I'm pretty sure this can lead to a race: When the thread we're re-scheduling
> > to is faster to schedule us than we can reach qemu_coroutine_yield, then
> > we'll get an abort ("Co-routine re-entered recursively"), since co->caller
> > is still set.
> > 
> > I've seen this happen in our code when I try to do the scheduling fandangle
> > there.
> 
> Ah, crap. I guess letting a coroutine re-schedule itself is only safe
> within the same thread then.
> 
> > Is there a safer way to have a coroutine reschedule itself? Some lock
> > missing?
> 
> There is no problem that can't be solved by adding another level of
> indirection... We would have to schedule a BH in the original thread
> that will only schedule the coroutine in its new thread after it has
> yielded.
> 
> Maybe we should actually introduce a helper function that moves the
> current coroutine to a different AioContext this way.

Like this:

https://repo.or.cz/qemu/kevin.git/commitdiff/ed0244ba4ac699f7e8eaf7512ff25645cf43bda2

The series for which I need this isn't quite ready yet, so I haven't
sent it as a patch yet, but if it proves useful in other contexts, we
can always commit it without the rest.

Kevin



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext
  2020-05-26 16:42       ` Kevin Wolf
@ 2020-05-27  8:56         ` Stefan Reiter
  0 siblings, 0 replies; 26+ messages in thread
From: Stefan Reiter @ 2020-05-27  8:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, armbru, qemu-devel, mreitz, stefanha, t.lamprecht

On 5/26/20 6:42 PM, Kevin Wolf wrote:
> Am 25.05.2020 um 18:41 hat Kevin Wolf geschrieben:
>> Am 25.05.2020 um 16:18 hat Stefan Reiter geschrieben:
>>> On 5/12/20 4:43 PM, Kevin Wolf wrote:
>>>> Coroutine functions that are entered through bdrv_run_co() are already
>>>> safe to call from synchronous code in a different AioContext because
>>>> bdrv_coroutine_enter() will schedule them in the context of the node.
>>>>
>>>> However, the coroutine fastpath still requires that we're already in the
>>>> right AioContext when called in coroutine context.
>>>>
>>>> In order to make the behaviour more consistent and to make life a bit
>>>> easier for callers, let's check the AioContext and automatically move
>>>> the current coroutine around if we're not in the right context yet.
>>>>
>>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>>> ---
>>>>    block/io.c | 15 ++++++++++++++-
>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/block/io.c b/block/io.c
>>>> index c1badaadc9..7808e8bdc0 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>> @@ -895,8 +895,21 @@ static int bdrv_run_co(BlockDriverState *bs, CoroutineEntry *entry,
>>>>                           void *opaque, int *ret)
>>>>    {
>>>>        if (qemu_in_coroutine()) {
>>>> -        /* Fast-path if already in coroutine context */
>>>> +        Coroutine *self = qemu_coroutine_self();
>>>> +        AioContext *bs_ctx = bdrv_get_aio_context(bs);
>>>> +        AioContext *co_ctx = qemu_coroutine_get_aio_context(self);
>>>> +
>>>> +        if (bs_ctx != co_ctx) {
>>>> +            /* Move to the iothread of the node */
>>>> +            aio_co_schedule(bs_ctx, self);
>>>> +            qemu_coroutine_yield();
>>>
>>> I'm pretty sure this can lead to a race: When the thread we're re-scheduling
>>> to is faster to schedule us than we can reach qemu_coroutine_yield, then
>>> we'll get an abort ("Co-routine re-entered recursively"), since co->caller
>>> is still set.
>>>
>>> I've seen this happen in our code when I try to do the scheduling fandangle
>>> there.
>>
>> Ah, crap. I guess letting a coroutine re-schedule itself is only safe
>> within the same thread then.
>>
>>> Is there a safer way to have a coroutine reschedule itself? Some lock
>>> missing?
>>
>> There is no problem that can't be solved by adding another level of
>> indirection... We would have to schedule a BH in the original thread
>> that will only schedule the coroutine in its new thread after it has
>> yielded.
>>
>> Maybe we should actually introduce a helper function that moves the
>> current coroutine to a different AioContext this way.
> 
> Like this:
> 
> https://repo.or.cz/qemu/kevin.git/commitdiff/ed0244ba4ac699f7e8eaf7512ff25645cf43bda2
> 

Commit looks good to me, using aio_co_reschedule_self fixes all aborts 
I've been seeing.

> The series for which I need this isn't quite ready yet, so I haven't
> sent it as a patch yet, but if it proves useful in other contexts, we
> can always commit it without the rest.
> 

I did a quick search for places where a similar pattern is used and 
found 'hw/9pfs/coth.h', where this behavior is already described (though 
the bh seems to be scheduled using the threadpool API, which I'm not 
really familiar with). All other places where qemu_coroutine_yield() is 
preceded by a aio_co_schedule() only do so in the same AioContext, which 
should be safe.

> Kevin
> 
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-05-27  8:57 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-12 14:43 [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Kevin Wolf
2020-05-12 14:43 ` [RFC PATCH 1/3] block: Factor out bdrv_run_co() Kevin Wolf
2020-05-12 15:37   ` Eric Blake
2020-05-20  9:09     ` Philippe Mathieu-Daudé
2020-05-20 11:14       ` Vladimir Sementsov-Ogievskiy
2020-05-12 14:43 ` [RFC PATCH 2/3] block: Allow bdrv_run_co() from different AioContext Kevin Wolf
2020-05-12 16:02   ` Thomas Lamprecht
2020-05-12 19:29     ` Kevin Wolf
2020-05-25 14:18   ` Stefan Reiter
2020-05-25 16:41     ` Kevin Wolf
2020-05-26 16:42       ` Kevin Wolf
2020-05-27  8:56         ` Stefan Reiter
2020-05-12 14:43 ` [RFC PATCH 3/3] block: Assert we're running in the right thread Kevin Wolf
2020-05-14 13:52   ` Stefan Reiter
2020-05-14 14:30     ` Kevin Wolf
2020-05-20  9:12       ` Philippe Mathieu-Daudé
2020-05-14 13:21 ` [RFC PATCH 0/3] block: Synchronous bdrv_*() from coroutine in different AioContext Thomas Lamprecht
2020-05-14 14:26   ` Kevin Wolf
2020-05-19 12:32     ` Vladimir Sementsov-Ogievskiy
2020-05-19 13:54       ` Denis Plotnikov
2020-05-19 14:18         ` Kevin Wolf
2020-05-19 15:05           ` Denis Plotnikov
2020-05-19 15:29             ` Kevin Wolf
2020-05-19 15:48               ` Vladimir Sementsov-Ogievskiy
2020-05-19 16:06                 ` Eric Blake
2020-05-20  7:23               ` Denis Plotnikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.