All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] block/io: safer inc/dec in_flight sections
@ 2020-04-27 14:38 Vladimir Sementsov-Ogievskiy
  2020-04-27 14:38 ` [PATCH v2 1/9] block/io: refactor bdrv_is_allocated_above to run only one coroutine Vladimir Sementsov-Ogievskiy
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:38 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

Hi all!

This is inspired by Kevin's
"block: Fix blk->in_flight during blk_wait_while_drained()" series.

So, like it's now done for block-backends, let's expand
in_flight-protected sections for bdrv_ interfaces, including
coroutine_enter and BDRV_POLL_WHILE loop into these sections.

v2:
01: drop coroutine_fn from bdrv_is_allocated_above declaration
02-04: add Stefan's r-b
05: improve commit message
    fix typo in bdrv_do_pdiscard name
    add more "To be called between exactly one pair of bdrv_inc/dec_in_flight()" comments
    drop unused bs variable
    fix bdrv_pdiscard_co_entry to use bdrv_do_pdiscard
06: similarly to 05, add additional comment to bdrv_is_allocated_above_co_entry
07-08: add Stefan's r-b
09: add "To be called between exactly one pair of bdrv_inc/dec_in_flight()" for bdrv_make_zero_co_entry

Vladimir Sementsov-Ogievskiy (9):
  block/io: refactor bdrv_is_allocated_above to run only one coroutine
  block/io: refactor bdrv_co_ioctl: move aio stuff to corresponding
    block
  block/io: move flush and pdiscard stuff down
  block/io: move bdrv_rw_co_entry and friends down
  block/io: expand in_flight inc/dec section: simple cases
  block/io: expand in_flight inc/dec section: block-status
  block/io: add bdrv_do_pwrite_zeroes
  block/io: move bdrv_make_zero under block-status
  block/io: expand in_flight inc/dec section: bdrv_make_zero

 block/io.c | 789 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 525 insertions(+), 264 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 1/9] block/io: refactor bdrv_is_allocated_above to run only one coroutine
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:38 ` Vladimir Sementsov-Ogievskiy
  2020-05-01 21:25   ` Eric Blake
  2020-04-27 14:39 ` [PATCH v2 2/9] block/io: refactor bdrv_co_ioctl: move aio stuff to corresponding block Vladimir Sementsov-Ogievskiy
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:38 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

bdrv_is_allocated_above creates new coroutine on each iteration if
called from non-coroutine context. To simplify expansion of in_flight
inc/dec sections in further patch let's refactor it.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 71 insertions(+), 5 deletions(-)

diff --git a/block/io.c b/block/io.c
index aba67f66b9..94ab8eaa0f 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2482,6 +2482,22 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
     return ret;
 }
 
+static int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs,
+                                             int64_t offset, int64_t bytes,
+                                             int64_t *pnum)
+{
+    int ret;
+    int64_t dummy;
+
+    ret = bdrv_co_block_status_above(bs, backing_bs(bs), false, offset,
+                                     bytes, pnum ? pnum : &dummy, NULL,
+                                     NULL);
+    if (ret < 0) {
+        return ret;
+    }
+    return !!(ret & BDRV_BLOCK_ALLOCATED);
+}
+
 /* Coroutine wrapper for bdrv_block_status_above() */
 static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
 {
@@ -2578,10 +2594,10 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
  * but 'pnum' will only be 0 when end of file is reached.
  *
  */
-int bdrv_is_allocated_above(BlockDriverState *top,
-                            BlockDriverState *base,
-                            bool include_base, int64_t offset,
-                            int64_t bytes, int64_t *pnum)
+static int coroutine_fn
+bdrv_co_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
+                           bool include_base, int64_t offset, int64_t bytes,
+                           int64_t *pnum)
 {
     BlockDriverState *intermediate;
     int ret;
@@ -2595,7 +2611,7 @@ int bdrv_is_allocated_above(BlockDriverState *top,
         int64_t size_inter;
 
         assert(intermediate);
-        ret = bdrv_is_allocated(intermediate, offset, bytes, &pnum_inter);
+        ret = bdrv_co_is_allocated(intermediate, offset, bytes, &pnum_inter);
         if (ret < 0) {
             return ret;
         }
@@ -2624,6 +2640,56 @@ int bdrv_is_allocated_above(BlockDriverState *top,
     return 0;
 }
 
+typedef struct BdrvCoIsAllocatedAboveData {
+    BlockDriverState *top;
+    BlockDriverState *base;
+    bool include_base;
+    int64_t offset;
+    int64_t bytes;
+    int64_t *pnum;
+    int ret;
+    bool done;
+} BdrvCoIsAllocatedAboveData;
+
+static void coroutine_fn bdrv_is_allocated_above_co_entry(void *opaque)
+{
+    BdrvCoIsAllocatedAboveData *data = opaque;
+
+    data->ret = bdrv_co_is_allocated_above(data->top, data->base,
+                                           data->include_base,
+                                           data->offset, data->bytes,
+                                           data->pnum);
+    data->done = true;
+    aio_wait_kick();
+}
+
+int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
+                            bool include_base, int64_t offset, int64_t bytes,
+                            int64_t *pnum)
+{
+    Coroutine *co;
+    BdrvCoIsAllocatedAboveData data = {
+        .top = top,
+        .base = base,
+        .include_base = include_base,
+        .offset = offset,
+        .bytes = bytes,
+        .pnum = pnum,
+        .done = false,
+    };
+
+    if (qemu_in_coroutine()) {
+        /* Fast-path if already in coroutine context */
+        bdrv_is_allocated_above_co_entry(&data);
+    } else {
+        co = qemu_coroutine_create(bdrv_is_allocated_above_co_entry, &data);
+        bdrv_coroutine_enter(top, co);
+        BDRV_POLL_WHILE(top, !data.done);
+    }
+
+    return data.ret;
+}
+
 typedef struct BdrvVmstateCo {
     BlockDriverState    *bs;
     QEMUIOVector        *qiov;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 2/9] block/io: refactor bdrv_co_ioctl: move aio stuff to corresponding block
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
  2020-04-27 14:38 ` [PATCH v2 1/9] block/io: refactor bdrv_is_allocated_above to run only one coroutine Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-04-27 14:39 ` [PATCH v2 3/9] block/io: move flush and pdiscard stuff down Vladimir Sementsov-Ogievskiy
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/io.c b/block/io.c
index 94ab8eaa0f..880871e691 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3125,31 +3125,38 @@ int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
 
 int bdrv_co_ioctl(BlockDriverState *bs, int req, void *buf)
 {
+    int ret;
     BlockDriver *drv = bs->drv;
-    CoroutineIOCompletion co = {
-        .coroutine = qemu_coroutine_self(),
-    };
-    BlockAIOCB *acb;
 
     bdrv_inc_in_flight(bs);
+
     if (!drv || (!drv->bdrv_aio_ioctl && !drv->bdrv_co_ioctl)) {
-        co.ret = -ENOTSUP;
+        ret = -ENOTSUP;
         goto out;
     }
 
     if (drv->bdrv_co_ioctl) {
-        co.ret = drv->bdrv_co_ioctl(bs, req, buf);
+        ret = drv->bdrv_co_ioctl(bs, req, buf);
     } else {
+        CoroutineIOCompletion co = {
+            .coroutine = qemu_coroutine_self(),
+        };
+        BlockAIOCB *acb;
+
         acb = drv->bdrv_aio_ioctl(bs, req, buf, bdrv_co_io_em_complete, &co);
         if (!acb) {
-            co.ret = -ENOTSUP;
+            ret = -ENOTSUP;
             goto out;
         }
+
         qemu_coroutine_yield();
+        ret = co.ret;
     }
+
 out:
     bdrv_dec_in_flight(bs);
-    return co.ret;
+
+    return ret;
 }
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size)
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 3/9] block/io: move flush and pdiscard stuff down
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
  2020-04-27 14:38 ` [PATCH v2 1/9] block/io: refactor bdrv_is_allocated_above to run only one coroutine Vladimir Sementsov-Ogievskiy
  2020-04-27 14:39 ` [PATCH v2 2/9] block/io: refactor bdrv_co_ioctl: move aio stuff to corresponding block Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-04-27 14:39 ` [PATCH v2 4/9] block/io: move bdrv_rw_co_entry and friends down Vladimir Sementsov-Ogievskiy
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

bdrv_co_flush and bdrv_co_pdiscard will become static in further patch,
move their usage down.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 56 +++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/block/io.c b/block/io.c
index 880871e691..1134f8144a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2827,20 +2827,6 @@ void bdrv_aio_cancel_async(BlockAIOCB *acb)
 /**************************************************************/
 /* Coroutine block device emulation */
 
-typedef struct FlushCo {
-    BlockDriverState *bs;
-    int ret;
-} FlushCo;
-
-
-static void coroutine_fn bdrv_flush_co_entry(void *opaque)
-{
-    FlushCo *rwco = opaque;
-
-    rwco->ret = bdrv_co_flush(rwco->bs);
-    aio_wait_kick();
-}
-
 int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
 {
     int current_gen;
@@ -2953,6 +2939,19 @@ early_exit:
     return ret;
 }
 
+typedef struct FlushCo {
+    BlockDriverState *bs;
+    int ret;
+} FlushCo;
+
+static void coroutine_fn bdrv_flush_co_entry(void *opaque)
+{
+    FlushCo *rwco = opaque;
+
+    rwco->ret = bdrv_co_flush(rwco->bs);
+    aio_wait_kick();
+}
+
 int bdrv_flush(BlockDriverState *bs)
 {
     Coroutine *co;
@@ -2973,20 +2972,6 @@ int bdrv_flush(BlockDriverState *bs)
     return flush_co.ret;
 }
 
-typedef struct DiscardCo {
-    BdrvChild *child;
-    int64_t offset;
-    int64_t bytes;
-    int ret;
-} DiscardCo;
-static void coroutine_fn bdrv_pdiscard_co_entry(void *opaque)
-{
-    DiscardCo *rwco = opaque;
-
-    rwco->ret = bdrv_co_pdiscard(rwco->child, rwco->offset, rwco->bytes);
-    aio_wait_kick();
-}
-
 int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
                                   int64_t bytes)
 {
@@ -3101,6 +3086,21 @@ out:
     return ret;
 }
 
+typedef struct DiscardCo {
+    BdrvChild *child;
+    int64_t offset;
+    int64_t bytes;
+    int ret;
+} DiscardCo;
+
+static void coroutine_fn bdrv_pdiscard_co_entry(void *opaque)
+{
+    DiscardCo *rwco = opaque;
+
+    rwco->ret = bdrv_co_pdiscard(rwco->child, rwco->offset, rwco->bytes);
+    aio_wait_kick();
+}
+
 int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
 {
     Coroutine *co;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 4/9] block/io: move bdrv_rw_co_entry and friends down
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
                   ` (2 preceding siblings ...)
  2020-04-27 14:39 ` [PATCH v2 3/9] block/io: move flush and pdiscard stuff down Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-04-27 14:39 ` [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases Vladimir Sementsov-Ogievskiy
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

We are going to use bdrv_co_pwritev_part and bdrv_co_preadv_part in
bdrv_rw_co_entry, so move it down.

Note: Comment formatting was changed to conform to coding style and
function order was changed. Otherwise the code is unmodified.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 361 +++++++++++++++++++++++++++--------------------------
 1 file changed, 181 insertions(+), 180 deletions(-)

diff --git a/block/io.c b/block/io.c
index 1134f8144a..061f3f2590 100644
--- a/block/io.c
+++ b/block/io.c
@@ -891,186 +891,6 @@ static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
     return 0;
 }
 
-typedef struct RwCo {
-    BdrvChild *child;
-    int64_t offset;
-    QEMUIOVector *qiov;
-    bool is_write;
-    int ret;
-    BdrvRequestFlags flags;
-} RwCo;
-
-static void coroutine_fn bdrv_rw_co_entry(void *opaque)
-{
-    RwCo *rwco = opaque;
-
-    if (!rwco->is_write) {
-        rwco->ret = bdrv_co_preadv(rwco->child, rwco->offset,
-                                   rwco->qiov->size, rwco->qiov,
-                                   rwco->flags);
-    } else {
-        rwco->ret = bdrv_co_pwritev(rwco->child, rwco->offset,
-                                    rwco->qiov->size, rwco->qiov,
-                                    rwco->flags);
-    }
-    aio_wait_kick();
-}
-
-/*
- * Process a vectored synchronous request using coroutines
- */
-static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
-                        QEMUIOVector *qiov, bool is_write,
-                        BdrvRequestFlags flags)
-{
-    Coroutine *co;
-    RwCo rwco = {
-        .child = child,
-        .offset = offset,
-        .qiov = qiov,
-        .is_write = is_write,
-        .ret = NOT_DONE,
-        .flags = flags,
-    };
-
-    if (qemu_in_coroutine()) {
-        /* Fast-path if already in coroutine context */
-        bdrv_rw_co_entry(&rwco);
-    } else {
-        co = qemu_coroutine_create(bdrv_rw_co_entry, &rwco);
-        bdrv_coroutine_enter(child->bs, co);
-        BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
-    }
-    return rwco.ret;
-}
-
-int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
-                       int bytes, BdrvRequestFlags flags)
-{
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, bytes);
-
-    return bdrv_prwv_co(child, offset, &qiov, true,
-                        BDRV_REQ_ZERO_WRITE | flags);
-}
-
-/*
- * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
- * The operation is sped up by checking the block status and only writing
- * zeroes to the device if they currently do not return zeroes. Optional
- * flags are passed through to bdrv_pwrite_zeroes (e.g. BDRV_REQ_MAY_UNMAP,
- * BDRV_REQ_FUA).
- *
- * Returns < 0 on error, 0 on success. For error codes see bdrv_write().
- */
-int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
-{
-    int ret;
-    int64_t target_size, bytes, offset = 0;
-    BlockDriverState *bs = child->bs;
-
-    target_size = bdrv_getlength(bs);
-    if (target_size < 0) {
-        return target_size;
-    }
-
-    for (;;) {
-        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
-        if (bytes <= 0) {
-            return 0;
-        }
-        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
-        if (ret < 0) {
-            return ret;
-        }
-        if (ret & BDRV_BLOCK_ZERO) {
-            offset += bytes;
-            continue;
-        }
-        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
-        if (ret < 0) {
-            return ret;
-        }
-        offset += bytes;
-    }
-}
-
-int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
-{
-    int ret;
-
-    ret = bdrv_prwv_co(child, offset, qiov, false, 0);
-    if (ret < 0) {
-        return ret;
-    }
-
-    return qiov->size;
-}
-
-/* See bdrv_pwrite() for the return codes */
-int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes)
-{
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-
-    if (bytes < 0) {
-        return -EINVAL;
-    }
-
-    return bdrv_preadv(child, offset, &qiov);
-}
-
-int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
-{
-    int ret;
-
-    ret = bdrv_prwv_co(child, offset, qiov, true, 0);
-    if (ret < 0) {
-        return ret;
-    }
-
-    return qiov->size;
-}
-
-/* Return no. of bytes on success or < 0 on error. Important errors are:
-  -EIO         generic I/O error (may happen for all errors)
-  -ENOMEDIUM   No media inserted.
-  -EINVAL      Invalid offset or number of bytes
-  -EACCES      Trying to write a read-only device
-*/
-int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf, int bytes)
-{
-    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
-
-    if (bytes < 0) {
-        return -EINVAL;
-    }
-
-    return bdrv_pwritev(child, offset, &qiov);
-}
-
-/*
- * Writes to the file and ensures that no writes are reordered across this
- * request (acts as a barrier)
- *
- * Returns 0 on success, -errno in error cases.
- */
-int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
-                     const void *buf, int count)
-{
-    int ret;
-
-    ret = bdrv_pwrite(child, offset, buf, count);
-    if (ret < 0) {
-        return ret;
-    }
-
-    ret = bdrv_flush(child->bs);
-    if (ret < 0) {
-        return ret;
-    }
-
-    return 0;
-}
-
 typedef struct CoroutineIOCompletion {
     Coroutine *coroutine;
     int ret;
@@ -2185,6 +2005,187 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
                            BDRV_REQ_ZERO_WRITE | flags);
 }
 
+typedef struct RwCo {
+    BdrvChild *child;
+    int64_t offset;
+    QEMUIOVector *qiov;
+    bool is_write;
+    int ret;
+    BdrvRequestFlags flags;
+} RwCo;
+
+static void coroutine_fn bdrv_rw_co_entry(void *opaque)
+{
+    RwCo *rwco = opaque;
+
+    if (!rwco->is_write) {
+        rwco->ret = bdrv_co_preadv(rwco->child, rwco->offset,
+                                   rwco->qiov->size, rwco->qiov,
+                                   rwco->flags);
+    } else {
+        rwco->ret = bdrv_co_pwritev(rwco->child, rwco->offset,
+                                    rwco->qiov->size, rwco->qiov,
+                                    rwco->flags);
+    }
+    aio_wait_kick();
+}
+
+/*
+ * Process a vectored synchronous request using coroutines
+ */
+static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
+                        QEMUIOVector *qiov, bool is_write,
+                        BdrvRequestFlags flags)
+{
+    Coroutine *co;
+    RwCo rwco = {
+        .child = child,
+        .offset = offset,
+        .qiov = qiov,
+        .is_write = is_write,
+        .ret = NOT_DONE,
+        .flags = flags,
+    };
+
+    if (qemu_in_coroutine()) {
+        /* Fast-path if already in coroutine context */
+        bdrv_rw_co_entry(&rwco);
+    } else {
+        co = qemu_coroutine_create(bdrv_rw_co_entry, &rwco);
+        bdrv_coroutine_enter(child->bs, co);
+        BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
+    }
+    return rwco.ret;
+}
+
+int bdrv_pwrite_zeroes(BdrvChild *child, int64_t offset,
+                       int bytes, BdrvRequestFlags flags)
+{
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, NULL, bytes);
+
+    return bdrv_prwv_co(child, offset, &qiov, true,
+                        BDRV_REQ_ZERO_WRITE | flags);
+}
+
+/* See bdrv_pwrite() for the return codes */
+int bdrv_pread(BdrvChild *child, int64_t offset, void *buf, int bytes)
+{
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
+
+    if (bytes < 0) {
+        return -EINVAL;
+    }
+
+    return bdrv_preadv(child, offset, &qiov);
+}
+
+int bdrv_pwritev(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
+{
+    int ret;
+
+    ret = bdrv_prwv_co(child, offset, qiov, true, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return qiov->size;
+}
+
+/*
+ * Return no. of bytes on success or < 0 on error. Important errors are:
+ * -EIO         generic I/O error (may happen for all errors)
+ * -ENOMEDIUM   No media inserted.
+ * -EINVAL      Invalid offset or number of bytes
+ * -EACCES      Trying to write a read-only device
+ */
+int bdrv_pwrite(BdrvChild *child, int64_t offset, const void *buf, int bytes)
+{
+    QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, buf, bytes);
+
+    if (bytes < 0) {
+        return -EINVAL;
+    }
+
+    return bdrv_pwritev(child, offset, &qiov);
+}
+
+/*
+ * Writes to the file and ensures that no writes are reordered across this
+ * request (acts as a barrier)
+ *
+ * Returns 0 on success, -errno in error cases.
+ */
+int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
+                     const void *buf, int count)
+{
+    int ret;
+
+    ret = bdrv_pwrite(child, offset, buf, count);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = bdrv_flush(child->bs);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
+ * The operation is sped up by checking the block status and only writing
+ * zeroes to the device if they currently do not return zeroes. Optional
+ * flags are passed through to bdrv_pwrite_zeroes (e.g. BDRV_REQ_MAY_UNMAP,
+ * BDRV_REQ_FUA).
+ *
+ * Returns < 0 on error, 0 on success. For error codes see bdrv_write().
+ */
+int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = child->bs;
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (ret & BDRV_BLOCK_ZERO) {
+            offset += bytes;
+            continue;
+        }
+        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
+        if (ret < 0) {
+            return ret;
+        }
+        offset += bytes;
+    }
+}
+
+int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
+{
+    int ret;
+
+    ret = bdrv_prwv_co(child, offset, qiov, false, 0);
+    if (ret < 0) {
+        return ret;
+    }
+
+    return qiov->size;
+}
+
 /*
  * Flush ALL BDSes regardless of if they are reachable via a BlkBackend or not.
  */
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
                   ` (3 preceding siblings ...)
  2020-04-27 14:39 ` [PATCH v2 4/9] block/io: move bdrv_rw_co_entry and friends down Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-05-01 21:43   ` Eric Blake
                     ` (2 more replies)
  2020-04-27 14:39 ` [PATCH v2 6/9] block/io: expand in_flight inc/dec section: block-status Vladimir Sementsov-Ogievskiy
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

It's safer to expand in_flight request to start before enter to
coroutine in synchronous wrappers, due to the following (theoretical)
problem:

Consider write.
It's possible, that qemu_coroutine_enter only schedules execution,
assume such case.

Then we may possibly have the following:

1. Somehow check that we are not in drained section in outer code.

2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
will protect us from starting drained section.

3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
in_flight).

4. Assume coroutine not yet actually entered, only scheduled, and we go
to some code, which starts drained section (as in_flight is zero).

5. Scheduled coroutine starts, and blindly increases in_flight, and we
are in drained section with in_flight request.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 161 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 124 insertions(+), 37 deletions(-)

diff --git a/block/io.c b/block/io.c
index 061f3f2590..a91d8c1e21 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1511,7 +1511,8 @@ int coroutine_fn bdrv_co_preadv(BdrvChild *child,
     return bdrv_co_preadv_part(child, offset, bytes, qiov, 0, flags);
 }
 
-int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
+static int coroutine_fn bdrv_do_preadv_part(BdrvChild *child,
     int64_t offset, unsigned int bytes,
     QEMUIOVector *qiov, size_t qiov_offset,
     BdrvRequestFlags flags)
@@ -1540,8 +1541,6 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
         return 0;
     }
 
-    bdrv_inc_in_flight(bs);
-
     /* Don't do copy-on-read if we read data before write operation */
     if (atomic_read(&bs->copy_on_read)) {
         flags |= BDRV_REQ_COPY_ON_READ;
@@ -1554,13 +1553,26 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
                               bs->bl.request_alignment,
                               qiov, qiov_offset, flags);
     tracked_request_end(&req);
-    bdrv_dec_in_flight(bs);
 
     bdrv_padding_destroy(&pad);
 
     return ret;
 }
 
+int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
+    int64_t offset, unsigned int bytes,
+    QEMUIOVector *qiov, size_t qiov_offset,
+    BdrvRequestFlags flags)
+{
+    int ret;
+
+    bdrv_inc_in_flight(child->bs);
+    ret = bdrv_do_preadv_part(child, offset, bytes, qiov, qiov_offset, flags);
+    bdrv_dec_in_flight(child->bs);
+
+    return ret;
+}
+
 static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
     int64_t offset, int bytes, BdrvRequestFlags flags)
 {
@@ -1922,7 +1934,8 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
     return bdrv_co_pwritev_part(child, offset, bytes, qiov, 0, flags);
 }
 
-int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
+static int coroutine_fn bdrv_do_pwritev_part(BdrvChild *child,
     int64_t offset, unsigned int bytes, QEMUIOVector *qiov, size_t qiov_offset,
     BdrvRequestFlags flags)
 {
@@ -1962,7 +1975,6 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
         return 0;
     }
 
-    bdrv_inc_in_flight(bs);
     /*
      * Align write if necessary by performing a read-modify-write cycle.
      * Pad qiov with the read parts and be sure to have a tracked request not
@@ -1987,7 +1999,19 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
 
 out:
     tracked_request_end(&req);
-    bdrv_dec_in_flight(bs);
+
+    return ret;
+}
+
+int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
+    int64_t offset, unsigned int bytes, QEMUIOVector *qiov, size_t qiov_offset,
+    BdrvRequestFlags flags)
+{
+    int ret;
+
+    bdrv_inc_in_flight(child->bs);
+    ret = bdrv_do_pwritev_part(child, offset, bytes, qiov, qiov_offset, flags);
+    bdrv_dec_in_flight(child->bs);
 
     return ret;
 }
@@ -2014,17 +2038,18 @@ typedef struct RwCo {
     BdrvRequestFlags flags;
 } RwCo;
 
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
 static void coroutine_fn bdrv_rw_co_entry(void *opaque)
 {
     RwCo *rwco = opaque;
 
     if (!rwco->is_write) {
-        rwco->ret = bdrv_co_preadv(rwco->child, rwco->offset,
-                                   rwco->qiov->size, rwco->qiov,
+        rwco->ret = bdrv_do_preadv_part(rwco->child, rwco->offset,
+                                   rwco->qiov->size, rwco->qiov, 0,
                                    rwco->flags);
     } else {
-        rwco->ret = bdrv_co_pwritev(rwco->child, rwco->offset,
-                                    rwco->qiov->size, rwco->qiov,
+        rwco->ret = bdrv_do_pwritev_part(rwco->child, rwco->offset,
+                                    rwco->qiov->size, rwco->qiov, 0,
                                     rwco->flags);
     }
     aio_wait_kick();
@@ -2047,6 +2072,8 @@ static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
         .flags = flags,
     };
 
+    bdrv_inc_in_flight(child->bs);
+
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
         bdrv_rw_co_entry(&rwco);
@@ -2055,6 +2082,9 @@ static int bdrv_prwv_co(BdrvChild *child, int64_t offset,
         bdrv_coroutine_enter(child->bs, co);
         BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
     }
+
+    bdrv_dec_in_flight(child->bs);
+
     return rwco.ret;
 }
 
@@ -2699,15 +2729,14 @@ typedef struct BdrvVmstateCo {
     int                 ret;
 } BdrvVmstateCo;
 
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
 static int coroutine_fn
-bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
+bdrv_do_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                    bool is_read)
 {
     BlockDriver *drv = bs->drv;
     int ret = -ENOTSUP;
 
-    bdrv_inc_in_flight(bs);
-
     if (!drv) {
         ret = -ENOMEDIUM;
     } else if (drv->bdrv_load_vmstate) {
@@ -2717,17 +2746,19 @@ bdrv_co_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
             ret = drv->bdrv_save_vmstate(bs, qiov, pos);
         }
     } else if (bs->file) {
-        ret = bdrv_co_rw_vmstate(bs->file->bs, qiov, pos, is_read);
+        bdrv_inc_in_flight(bs->file->bs);
+        ret = bdrv_do_rw_vmstate(bs->file->bs, qiov, pos, is_read);
+        bdrv_dec_in_flight(bs->file->bs);
     }
 
-    bdrv_dec_in_flight(bs);
     return ret;
 }
 
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
 static void coroutine_fn bdrv_co_rw_vmstate_entry(void *opaque)
 {
     BdrvVmstateCo *co = opaque;
-    co->ret = bdrv_co_rw_vmstate(co->bs, co->qiov, co->pos, co->is_read);
+    co->ret = bdrv_do_rw_vmstate(co->bs, co->qiov, co->pos, co->is_read);
     aio_wait_kick();
 }
 
@@ -2735,8 +2766,12 @@ static inline int
 bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
                 bool is_read)
 {
+    int ret;
+
+    bdrv_inc_in_flight(bs);
+
     if (qemu_in_coroutine()) {
-        return bdrv_co_rw_vmstate(bs, qiov, pos, is_read);
+        ret = bdrv_do_rw_vmstate(bs, qiov, pos, is_read);
     } else {
         BdrvVmstateCo data = {
             .bs         = bs,
@@ -2749,8 +2784,12 @@ bdrv_rw_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos,
 
         bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, data.ret == -EINPROGRESS);
-        return data.ret;
+        ret = data.ret;
     }
+
+    bdrv_dec_in_flight(bs);
+
+    return ret;
 }
 
 int bdrv_save_vmstate(BlockDriverState *bs, const uint8_t *buf,
@@ -2828,16 +2867,14 @@ void bdrv_aio_cancel_async(BlockAIOCB *acb)
 /**************************************************************/
 /* Coroutine block device emulation */
 
-int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
+static int coroutine_fn bdrv_do_flush(BlockDriverState *bs)
 {
     int current_gen;
-    int ret = 0;
-
-    bdrv_inc_in_flight(bs);
+    int ret;
 
-    if (!bdrv_is_inserted(bs) || bdrv_is_read_only(bs) ||
-        bdrv_is_sg(bs)) {
-        goto early_exit;
+    if (!bdrv_is_inserted(bs) || bdrv_is_read_only(bs) || bdrv_is_sg(bs)) {
+        return 0;
     }
 
     qemu_co_mutex_lock(&bs->reqs_lock);
@@ -2935,8 +2972,17 @@ out:
     qemu_co_queue_next(&bs->flush_queue);
     qemu_co_mutex_unlock(&bs->reqs_lock);
 
-early_exit:
+    return ret;
+}
+
+int coroutine_fn bdrv_co_flush(BlockDriverState *bs)
+{
+    int ret;
+
+    bdrv_inc_in_flight(bs);
+    ret = bdrv_do_flush(bs);
     bdrv_dec_in_flight(bs);
+
     return ret;
 }
 
@@ -2945,11 +2991,12 @@ typedef struct FlushCo {
     int ret;
 } FlushCo;
 
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
 static void coroutine_fn bdrv_flush_co_entry(void *opaque)
 {
     FlushCo *rwco = opaque;
 
-    rwco->ret = bdrv_co_flush(rwco->bs);
+    rwco->ret = bdrv_do_flush(rwco->bs);
     aio_wait_kick();
 }
 
@@ -2961,6 +3008,8 @@ int bdrv_flush(BlockDriverState *bs)
         .ret = NOT_DONE,
     };
 
+    bdrv_inc_in_flight(bs);
+
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
         bdrv_flush_co_entry(&flush_co);
@@ -2970,11 +3019,14 @@ int bdrv_flush(BlockDriverState *bs)
         BDRV_POLL_WHILE(bs, flush_co.ret == NOT_DONE);
     }
 
+    bdrv_dec_in_flight(bs);
+
     return flush_co.ret;
 }
 
-int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
-                                  int64_t bytes)
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
+static int coroutine_fn bdrv_do_pdiscard(BdrvChild *child, int64_t offset,
+                                         int64_t bytes)
 {
     BdrvTrackedRequest req;
     int max_pdiscard, ret;
@@ -3012,7 +3064,6 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
     head = offset % align;
     tail = (offset + bytes) % align;
 
-    bdrv_inc_in_flight(bs);
     tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_DISCARD);
 
     ret = bdrv_co_write_req_prepare(child, offset, bytes, &req, 0);
@@ -3083,7 +3134,18 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
 out:
     bdrv_co_write_req_finish(child, req.offset, req.bytes, &req, ret);
     tracked_request_end(&req);
-    bdrv_dec_in_flight(bs);
+    return ret;
+}
+
+int coroutine_fn bdrv_co_pdiscard(BdrvChild *child,
+                                  int64_t offset, int64_t bytes)
+{
+    int ret;
+
+    bdrv_inc_in_flight(child->bs);
+    ret = bdrv_do_pdiscard(child, offset, bytes);
+    bdrv_dec_in_flight(child->bs);
+
     return ret;
 }
 
@@ -3094,11 +3156,12 @@ typedef struct DiscardCo {
     int ret;
 } DiscardCo;
 
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
 static void coroutine_fn bdrv_pdiscard_co_entry(void *opaque)
 {
     DiscardCo *rwco = opaque;
 
-    rwco->ret = bdrv_co_pdiscard(rwco->child, rwco->offset, rwco->bytes);
+    rwco->ret = bdrv_do_pdiscard(rwco->child, rwco->offset, rwco->bytes);
     aio_wait_kick();
 }
 
@@ -3112,6 +3175,8 @@ int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
         .ret = NOT_DONE,
     };
 
+    bdrv_inc_in_flight(child->bs);
+
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
         bdrv_pdiscard_co_entry(&rwco);
@@ -3121,6 +3186,8 @@ int bdrv_pdiscard(BdrvChild *child, int64_t offset, int64_t bytes)
         BDRV_POLL_WHILE(child->bs, rwco.ret == NOT_DONE);
     }
 
+    bdrv_dec_in_flight(child->bs);
+
     return rwco.ret;
 }
 
@@ -3411,9 +3478,12 @@ static void bdrv_parent_cb_resize(BlockDriverState *bs)
  * If 'exact' is true, the file must be resized to exactly the given
  * 'offset'.  Otherwise, it is sufficient for the node to be at least
  * 'offset' bytes in length.
+ *
+ * To be called between exactly one pair of bdrv_inc/dec_in_flight()
  */
-int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
-                                  PreallocMode prealloc, Error **errp)
+static int coroutine_fn bdrv_do_truncate(BdrvChild *child,
+                                         int64_t offset, bool exact,
+                                         PreallocMode prealloc, Error **errp)
 {
     BlockDriverState *bs = child->bs;
     BlockDriver *drv = bs->drv;
@@ -3444,7 +3514,6 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
         new_bytes = 0;
     }
 
-    bdrv_inc_in_flight(bs);
     tracked_request_begin(&req, bs, offset - new_bytes, new_bytes,
                           BDRV_TRACKED_TRUNCATE);
 
@@ -3493,6 +3562,19 @@ int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
 
 out:
     tracked_request_end(&req);
+
+    return ret;
+}
+
+int coroutine_fn bdrv_co_truncate(BdrvChild *child,
+                                  int64_t offset, bool exact,
+                                  PreallocMode prealloc, Error **errp)
+{
+    int ret;
+    BlockDriverState *bs = child->bs;
+
+    bdrv_inc_in_flight(bs);
+    ret = bdrv_do_truncate(child, offset, exact, prealloc, errp);
     bdrv_dec_in_flight(bs);
 
     return ret;
@@ -3507,10 +3589,11 @@ typedef struct TruncateCo {
     int ret;
 } TruncateCo;
 
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
 static void coroutine_fn bdrv_truncate_co_entry(void *opaque)
 {
     TruncateCo *tco = opaque;
-    tco->ret = bdrv_co_truncate(tco->child, tco->offset, tco->exact,
+    tco->ret = bdrv_do_truncate(tco->child, tco->offset, tco->exact,
                                 tco->prealloc, tco->errp);
     aio_wait_kick();
 }
@@ -3528,6 +3611,8 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
         .ret        = NOT_DONE,
     };
 
+    bdrv_inc_in_flight(child->bs);
+
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
         bdrv_truncate_co_entry(&tco);
@@ -3537,5 +3622,7 @@ int bdrv_truncate(BdrvChild *child, int64_t offset, bool exact,
         BDRV_POLL_WHILE(child->bs, tco.ret == NOT_DONE);
     }
 
+    bdrv_dec_in_flight(child->bs);
+
     return tco.ret;
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 6/9] block/io: expand in_flight inc/dec section: block-status
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
                   ` (4 preceding siblings ...)
  2020-04-27 14:39 ` [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-05-01 22:00   ` Eric Blake
  2020-04-27 14:39 ` [PATCH v2 7/9] block/io: add bdrv_do_pwrite_zeroes Vladimir Sementsov-Ogievskiy
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

It's safer to expand in_flight request to start before enter to
coroutine in synchronous wrappers and end after BDRV_POLL_WHILE loop.
Note that qemu_coroutine_enter may only schedule the coroutine in some
circumstances.

block-status requests are complex, they involve querying different
block driver states across backing chain. Let's expand only in_flight
section for the top bs, keeping other sections as is.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 65 ++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 51 insertions(+), 14 deletions(-)

diff --git a/block/io.c b/block/io.c
index a91d8c1e21..1cb6f433e5 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2303,6 +2303,10 @@ int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
  * _ZERO where possible; otherwise, the result favors larger 'pnum',
  * with a focus on accurate BDRV_BLOCK_ALLOCATED.
  *
+ * If 'inc_in_flight' is true, in_flight counter will be increased for bs during
+ * the operation. All nested block_status calls will increase the counter for
+ * corresponding bs anyway.
+ *
  * If 'offset' is beyond the end of the disk image the return value is
  * BDRV_BLOCK_EOF and 'pnum' is set to 0.
  *
@@ -2321,7 +2325,7 @@ int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
  * set to the host mapping and BDS corresponding to the guest offset.
  */
 static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
-                                             bool want_zero,
+                                             bool want_zero, bool inc_in_flight,
                                              int64_t offset, int64_t bytes,
                                              int64_t *pnum, int64_t *map,
                                              BlockDriverState **file)
@@ -2372,7 +2376,9 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
         goto early_out;
     }
 
-    bdrv_inc_in_flight(bs);
+    if (inc_in_flight) {
+        bdrv_inc_in_flight(bs);
+    }
 
     /* Round out to request_alignment boundaries */
     align = bs->bl.request_alignment;
@@ -2409,7 +2415,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
 
     if (ret & BDRV_BLOCK_RAW) {
         assert(ret & BDRV_BLOCK_OFFSET_VALID && local_file);
-        ret = bdrv_co_block_status(local_file, want_zero, local_map,
+        ret = bdrv_co_block_status(local_file, want_zero, true, local_map,
                                    *pnum, pnum, &local_map, &local_file);
         goto out;
     }
@@ -2436,7 +2442,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
         int64_t file_pnum;
         int ret2;
 
-        ret2 = bdrv_co_block_status(local_file, want_zero, local_map,
+        ret2 = bdrv_co_block_status(local_file, want_zero, true, local_map,
                                     *pnum, &file_pnum, NULL, NULL);
         if (ret2 >= 0) {
             /* Ignore errors.  This is just providing extra information, it
@@ -2459,7 +2465,9 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs,
     }
 
 out:
-    bdrv_dec_in_flight(bs);
+    if (inc_in_flight) {
+        bdrv_dec_in_flight(bs);
+    }
     if (ret >= 0 && offset + *pnum == total_size) {
         ret |= BDRV_BLOCK_EOF;
     }
@@ -2473,9 +2481,15 @@ early_out:
     return ret;
 }
 
+/*
+ * If 'inc_in_flight' is true, in_flight counter will be increased for bs during
+ * the operation. All block_status calls to the backing chain of bs will
+ * increase the counter for corresponding bs anyway.
+ */
 static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
                                                    BlockDriverState *base,
                                                    bool want_zero,
+                                                   bool inc_in_flight,
                                                    int64_t offset,
                                                    int64_t bytes,
                                                    int64_t *pnum,
@@ -2488,11 +2502,13 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
 
     assert(bs != base);
     for (p = bs; p != base; p = backing_bs(p)) {
-        ret = bdrv_co_block_status(p, want_zero, offset, bytes, pnum, map,
-                                   file);
+        ret = bdrv_co_block_status(p, want_zero, inc_in_flight,
+                                   offset, bytes, pnum, map, file);
         if (ret < 0) {
             break;
         }
+        inc_in_flight = true;
+
         if (ret & BDRV_BLOCK_ZERO && ret & BDRV_BLOCK_EOF && !first) {
             /*
              * Reading beyond the end of the file continues to read
@@ -2514,15 +2530,16 @@ static int coroutine_fn bdrv_co_block_status_above(BlockDriverState *bs,
 }
 
 static int coroutine_fn bdrv_co_is_allocated(BlockDriverState *bs,
+                                             bool inc_in_flight,
                                              int64_t offset, int64_t bytes,
                                              int64_t *pnum)
 {
     int ret;
     int64_t dummy;
 
-    ret = bdrv_co_block_status_above(bs, backing_bs(bs), false, offset,
-                                     bytes, pnum ? pnum : &dummy, NULL,
-                                     NULL);
+    ret = bdrv_co_block_status_above(bs, backing_bs(bs), false, inc_in_flight,
+                                     offset, bytes, pnum ? pnum : &dummy,
+                                     NULL, NULL);
     if (ret < 0) {
         return ret;
     }
@@ -2535,7 +2552,7 @@ static void coroutine_fn bdrv_block_status_above_co_entry(void *opaque)
     BdrvCoBlockStatusData *data = opaque;
 
     data->ret = bdrv_co_block_status_above(data->bs, data->base,
-                                           data->want_zero,
+                                           data->want_zero, false,
                                            data->offset, data->bytes,
                                            data->pnum, data->map, data->file);
     data->done = true;
@@ -2567,6 +2584,8 @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
         .done = false,
     };
 
+    bdrv_inc_in_flight(bs);
+
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
         bdrv_block_status_above_co_entry(&data);
@@ -2575,6 +2594,9 @@ static int bdrv_common_block_status_above(BlockDriverState *bs,
         bdrv_coroutine_enter(bs, co);
         BDRV_POLL_WHILE(bs, !data.done);
     }
+
+    bdrv_dec_in_flight(bs);
+
     return data.ret;
 }
 
@@ -2624,15 +2646,19 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
  * words, the result is not necessarily the maximum possible range);
  * but 'pnum' will only be 0 when end of file is reached.
  *
+ * To be called between exactly one pair of bdrv_inc/dec_in_flight() for top bs.
+ * bdrv_do_is_allocated_above takes care of increasing in_fligth for other block
+ * driver states from bs backing chain.
  */
 static int coroutine_fn
-bdrv_co_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
+bdrv_do_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
                            bool include_base, int64_t offset, int64_t bytes,
                            int64_t *pnum)
 {
     BlockDriverState *intermediate;
     int ret;
     int64_t n = bytes;
+    bool inc_in_flight = false;
 
     assert(base || !include_base);
 
@@ -2642,10 +2668,12 @@ bdrv_co_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
         int64_t size_inter;
 
         assert(intermediate);
-        ret = bdrv_co_is_allocated(intermediate, offset, bytes, &pnum_inter);
+        ret = bdrv_co_is_allocated(intermediate, inc_in_flight, offset, bytes,
+                                   &pnum_inter);
         if (ret < 0) {
             return ret;
         }
+        inc_in_flight = true;
         if (ret) {
             *pnum = pnum_inter;
             return 1;
@@ -2682,11 +2710,16 @@ typedef struct BdrvCoIsAllocatedAboveData {
     bool done;
 } BdrvCoIsAllocatedAboveData;
 
+/*
+ * To be called between exactly one pair of bdrv_inc/dec_in_flight() for top bs.
+ * bdrv_do_is_allocated_above takes care of increasing in_fligth for other block
+ * driver states from the backing chain.
+ */
 static void coroutine_fn bdrv_is_allocated_above_co_entry(void *opaque)
 {
     BdrvCoIsAllocatedAboveData *data = opaque;
 
-    data->ret = bdrv_co_is_allocated_above(data->top, data->base,
+    data->ret = bdrv_do_is_allocated_above(data->top, data->base,
                                            data->include_base,
                                            data->offset, data->bytes,
                                            data->pnum);
@@ -2709,6 +2742,8 @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
         .done = false,
     };
 
+    bdrv_inc_in_flight(top);
+
     if (qemu_in_coroutine()) {
         /* Fast-path if already in coroutine context */
         bdrv_is_allocated_above_co_entry(&data);
@@ -2718,6 +2753,8 @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
         BDRV_POLL_WHILE(top, !data.done);
     }
 
+    bdrv_inc_in_flight(top);
+
     return data.ret;
 }
 
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 7/9] block/io: add bdrv_do_pwrite_zeroes
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
                   ` (5 preceding siblings ...)
  2020-04-27 14:39 ` [PATCH v2 6/9] block/io: expand in_flight inc/dec section: block-status Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-05-01 22:05   ` Eric Blake
  2020-04-27 14:39 ` [PATCH v2 8/9] block/io: move bdrv_make_zero under block-status Vladimir Sementsov-Ogievskiy
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

We'll need a bdrv_co_pwrite_zeroes version without inc/dec in_flight to
be used in further implementation of bdrv_make_zero.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/block/io.c b/block/io.c
index 1cb6f433e5..e6a8ead46c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2016,8 +2016,10 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
     return ret;
 }
 
-int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
-                                       int bytes, BdrvRequestFlags flags)
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
+static int coroutine_fn
+bdrv_do_pwrite_zeroes(BdrvChild *child, int64_t offset, int bytes,
+                      BdrvRequestFlags flags)
 {
     trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
 
@@ -2025,8 +2027,21 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
         flags &= ~BDRV_REQ_MAY_UNMAP;
     }
 
-    return bdrv_co_pwritev(child, offset, bytes, NULL,
-                           BDRV_REQ_ZERO_WRITE | flags);
+    return bdrv_do_pwritev_part(child, offset, bytes, NULL, 0,
+                                BDRV_REQ_ZERO_WRITE | flags);
+}
+
+int coroutine_fn
+bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset, int bytes,
+                      BdrvRequestFlags flags)
+{
+    int ret;
+
+    bdrv_inc_in_flight(child->bs);
+    ret = bdrv_do_pwrite_zeroes(child, offset, bytes, flags);
+    bdrv_dec_in_flight(child->bs);
+
+    return ret;
 }
 
 typedef struct RwCo {
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 8/9] block/io: move bdrv_make_zero under block-status
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
                   ` (6 preceding siblings ...)
  2020-04-27 14:39 ` [PATCH v2 7/9] block/io: add bdrv_do_pwrite_zeroes Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-04-27 14:39 ` [PATCH v2 9/9] block/io: expand in_flight inc/dec section: bdrv_make_zero Vladimir Sementsov-Ogievskiy
  2020-05-19 11:18 ` [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
  9 siblings, 0 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

We are going to use bdrv_co_block_status in bdrv_make_zero, so move it
now down.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io.c | 82 +++++++++++++++++++++++++++---------------------------
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/block/io.c b/block/io.c
index e6a8ead46c..3bc0daec33 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2178,47 +2178,6 @@ int bdrv_pwrite_sync(BdrvChild *child, int64_t offset,
     return 0;
 }
 
-/*
- * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
- * The operation is sped up by checking the block status and only writing
- * zeroes to the device if they currently do not return zeroes. Optional
- * flags are passed through to bdrv_pwrite_zeroes (e.g. BDRV_REQ_MAY_UNMAP,
- * BDRV_REQ_FUA).
- *
- * Returns < 0 on error, 0 on success. For error codes see bdrv_write().
- */
-int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
-{
-    int ret;
-    int64_t target_size, bytes, offset = 0;
-    BlockDriverState *bs = child->bs;
-
-    target_size = bdrv_getlength(bs);
-    if (target_size < 0) {
-        return target_size;
-    }
-
-    for (;;) {
-        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
-        if (bytes <= 0) {
-            return 0;
-        }
-        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
-        if (ret < 0) {
-            return ret;
-        }
-        if (ret & BDRV_BLOCK_ZERO) {
-            offset += bytes;
-            continue;
-        }
-        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
-        if (ret < 0) {
-            return ret;
-        }
-        offset += bytes;
-    }
-}
-
 int bdrv_preadv(BdrvChild *child, int64_t offset, QEMUIOVector *qiov)
 {
     int ret;
@@ -2773,6 +2732,47 @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
     return data.ret;
 }
 
+/*
+ * Completely zero out a block device with the help of bdrv_pwrite_zeroes.
+ * The operation is sped up by checking the block status and only writing
+ * zeroes to the device if they currently do not return zeroes. Optional
+ * flags are passed through to bdrv_pwrite_zeroes (e.g. BDRV_REQ_MAY_UNMAP,
+ * BDRV_REQ_FUA).
+ *
+ * Returns < 0 on error, 0 on success. For error codes see bdrv_write().
+ */
+int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
+{
+    int ret;
+    int64_t target_size, bytes, offset = 0;
+    BlockDriverState *bs = child->bs;
+
+    target_size = bdrv_getlength(bs);
+    if (target_size < 0) {
+        return target_size;
+    }
+
+    for (;;) {
+        bytes = MIN(target_size - offset, BDRV_REQUEST_MAX_BYTES);
+        if (bytes <= 0) {
+            return 0;
+        }
+        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        if (ret < 0) {
+            return ret;
+        }
+        if (ret & BDRV_BLOCK_ZERO) {
+            offset += bytes;
+            continue;
+        }
+        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
+        if (ret < 0) {
+            return ret;
+        }
+        offset += bytes;
+    }
+}
+
 typedef struct BdrvVmstateCo {
     BlockDriverState    *bs;
     QEMUIOVector        *qiov;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 9/9] block/io: expand in_flight inc/dec section: bdrv_make_zero
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
                   ` (7 preceding siblings ...)
  2020-04-27 14:39 ` [PATCH v2 8/9] block/io: move bdrv_make_zero under block-status Vladimir Sementsov-Ogievskiy
@ 2020-04-27 14:39 ` Vladimir Sementsov-Ogievskiy
  2020-05-01 22:08   ` Eric Blake
  2020-05-19 11:18 ` [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
  9 siblings, 1 reply; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-04-27 14:39 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, vsementsov, qemu-devel, mreitz, stefanha, den

It's safer to expand in_flight request to start before enter to
coroutine in synchronous wrappers and end after BDRV_POLL_WHILE loop.
Note that qemu_coroutine_enter may only schedule the coroutine in some
circumstances.

bdrv_make_zero update includes refactoring: move the whole loop into
coroutine, which has additional benefit of not create/enter new
coroutine on each iteration.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block/io.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/block/io.c b/block/io.c
index 3bc0daec33..cd5374e6c7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2740,8 +2740,11 @@ int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
  * BDRV_REQ_FUA).
  *
  * Returns < 0 on error, 0 on success. For error codes see bdrv_write().
+ *
+ * To be called between exactly one pair of bdrv_inc/dec_in_flight()
  */
-int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
+static int coroutine_fn
+bdrv_do_make_zero(BdrvChild *child, BdrvRequestFlags flags)
 {
     int ret;
     int64_t target_size, bytes, offset = 0;
@@ -2757,7 +2760,8 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
         if (bytes <= 0) {
             return 0;
         }
-        ret = bdrv_block_status(bs, offset, bytes, &bytes, NULL, NULL);
+        ret = bdrv_co_block_status(bs, true, false,
+                                   offset, bytes, &bytes, NULL, NULL);
         if (ret < 0) {
             return ret;
         }
@@ -2765,7 +2769,7 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
             offset += bytes;
             continue;
         }
-        ret = bdrv_pwrite_zeroes(child, offset, bytes, flags);
+        ret = bdrv_do_pwrite_zeroes(child, offset, bytes, flags);
         if (ret < 0) {
             return ret;
         }
@@ -2773,6 +2777,50 @@ int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
     }
 }
 
+typedef struct BdrvDoMakeZeroData {
+    BdrvChild *child;
+    BdrvRequestFlags flags;
+    int ret;
+    bool done;
+} BdrvDoMakeZeroData;
+
+/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
+static void coroutine_fn bdrv_make_zero_co_entry(void *opaque)
+{
+    BdrvDoMakeZeroData *data = opaque;
+
+    data->ret = bdrv_do_make_zero(data->child, data->flags);
+    data->done = true;
+    aio_wait_kick();
+}
+
+int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
+{
+    int ret;
+
+    bdrv_inc_in_flight(child->bs);
+
+    if (qemu_in_coroutine()) {
+        /* Fast-path if already in coroutine context */
+        ret = bdrv_do_make_zero(child, flags);
+    } else {
+        BdrvDoMakeZeroData data = {
+            .child = child,
+            .flags = flags,
+            .done = false,
+        };
+        Coroutine *co = qemu_coroutine_create(bdrv_make_zero_co_entry, &data);
+
+        bdrv_coroutine_enter(child->bs, co);
+        BDRV_POLL_WHILE(child->bs, !data.done);
+        ret = data.ret;
+    }
+
+    bdrv_dec_in_flight(child->bs);
+
+    return ret;
+}
+
 typedef struct BdrvVmstateCo {
     BlockDriverState    *bs;
     QEMUIOVector        *qiov;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 1/9] block/io: refactor bdrv_is_allocated_above to run only one coroutine
  2020-04-27 14:38 ` [PATCH v2 1/9] block/io: refactor bdrv_is_allocated_above to run only one coroutine Vladimir Sementsov-Ogievskiy
@ 2020-05-01 21:25   ` Eric Blake
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2020-05-01 21:25 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 4/27/20 9:38 AM, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_is_allocated_above creates new coroutine on each iteration if
> called from non-coroutine context. To simplify expansion of in_flight
> inc/dec sections in further patch let's refactor it.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/io.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 71 insertions(+), 5 deletions(-)
> 

Quite a lot of lines added, but it fits the the mechanical boilerplate 
we have elsewhere.

> diff --git a/block/io.c b/block/io.c
> index aba67f66b9..94ab8eaa0f 100644
> --- a/block/io.c
> +++ b/block/io.c

> +int bdrv_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
> +                            bool include_base, int64_t offset, int64_t bytes,
> +                            int64_t *pnum)
> +{
> +    Coroutine *co;
> +    BdrvCoIsAllocatedAboveData data = {
> +        .top = top,
> +        .base = base,
> +        .include_base = include_base,
> +        .offset = offset,
> +        .bytes = bytes,
> +        .pnum = pnum,
> +        .done = false,
> +    };

Omitting the line '.done = false,' has the same effect, since once you 
use a designated initializer, all remaining unspecified fields are 
0-initialized.  But explicitly mentioning it doesn't hurt.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-04-27 14:39 ` [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases Vladimir Sementsov-Ogievskiy
@ 2020-05-01 21:43   ` Eric Blake
  2020-05-06  7:02   ` Vladimir Sementsov-Ogievskiy
  2020-05-19 11:04   ` Kevin Wolf
  2 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2020-05-01 21:43 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 4/27/20 9:39 AM, Vladimir Sementsov-Ogievskiy wrote:
> It's safer to expand in_flight request to start before enter to
> coroutine in synchronous wrappers, due to the following (theoretical)
> problem:
> 
> Consider write.
> It's possible, that qemu_coroutine_enter only schedules execution,
> assume such case.
> 
> Then we may possibly have the following:
> 
> 1. Somehow check that we are not in drained section in outer code.
> 
> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
> will protect us from starting drained section.
> 
> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
> in_flight).
> 
> 4. Assume coroutine not yet actually entered, only scheduled, and we go
> to some code, which starts drained section (as in_flight is zero).
> 
> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
> are in drained section with in_flight request.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/io.c | 161 +++++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 124 insertions(+), 37 deletions(-)

>   
> +int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
> +    int64_t offset, unsigned int bytes,
> +    QEMUIOVector *qiov, size_t qiov_offset,
> +    BdrvRequestFlags flags)
> +{

Doesn't seem to be the usual indentation in this file.

> @@ -1922,7 +1934,8 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>       return bdrv_co_pwritev_part(child, offset, bytes, qiov, 0, flags);
>   }
>   
> -int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
> +/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
> +static int coroutine_fn bdrv_do_pwritev_part(BdrvChild *child,
>       int64_t offset, unsigned int bytes, QEMUIOVector *qiov, size_t qiov_offset,
>       BdrvRequestFlags flags)
>   {

then again, it was in use here, and saves reindenting the remaining 
lines.  I'll let the maintainer decide which style is preferred.

> @@ -2014,17 +2038,18 @@ typedef struct RwCo {
>       BdrvRequestFlags flags;
>   } RwCo;
>   
> +/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
>   static void coroutine_fn bdrv_rw_co_entry(void *opaque)
>   {
>       RwCo *rwco = opaque;
>   
>       if (!rwco->is_write) {
> -        rwco->ret = bdrv_co_preadv(rwco->child, rwco->offset,
> -                                   rwco->qiov->size, rwco->qiov,
> +        rwco->ret = bdrv_do_preadv_part(rwco->child, rwco->offset,
> +                                   rwco->qiov->size, rwco->qiov, 0,
>                                      rwco->flags);

Indentation is now off.

>       } else {
> -        rwco->ret = bdrv_co_pwritev(rwco->child, rwco->offset,
> -                                    rwco->qiov->size, rwco->qiov,
> +        rwco->ret = bdrv_do_pwritev_part(rwco->child, rwco->offset,
> +                                    rwco->qiov->size, rwco->qiov, 0,
>                                       rwco->flags);

and again

> @@ -3411,9 +3478,12 @@ static void bdrv_parent_cb_resize(BlockDriverState *bs)
>    * If 'exact' is true, the file must be resized to exactly the given
>    * 'offset'.  Otherwise, it is sufficient for the node to be at least
>    * 'offset' bytes in length.
> + *
> + * To be called between exactly one pair of bdrv_inc/dec_in_flight()
>    */
> -int coroutine_fn bdrv_co_truncate(BdrvChild *child, int64_t offset, bool exact,
> -                                  PreallocMode prealloc, Error **errp)
> +static int coroutine_fn bdrv_do_truncate(BdrvChild *child,
> +                                         int64_t offset, bool exact,
> +                                         PreallocMode prealloc, Error **errp)

Needs to be rebased, now that master has Kevin's patches addeing a 
'BdrvRequestFlags flags' parameter.  But the rebase should be obvious.

Otherwise looks sane to me, but I may be missing one of the finer points 
on which functions should be decorated with 'coroutine_fn'.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 6/9] block/io: expand in_flight inc/dec section: block-status
  2020-04-27 14:39 ` [PATCH v2 6/9] block/io: expand in_flight inc/dec section: block-status Vladimir Sementsov-Ogievskiy
@ 2020-05-01 22:00   ` Eric Blake
  2020-05-19 10:57     ` Kevin Wolf
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Blake @ 2020-05-01 22:00 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 4/27/20 9:39 AM, Vladimir Sementsov-Ogievskiy wrote:
> It's safer to expand in_flight request to start before enter to
> coroutine in synchronous wrappers and end after BDRV_POLL_WHILE loop.
> Note that qemu_coroutine_enter may only schedule the coroutine in some
> circumstances.

Wording suggestion:

It's safer to expand the region protected by an in_flight request to 
begin in the synchronous wrapper and end after the BDRV_POLL_WHILE loop. 
  Leaving the in_flight request in the coroutine itself risks a race 
where calling qemu_coroutine_enter() may have only scheduled, rather 
than started, the coroutine, allowing some other thread a chance to not 
realize an operation is in flight.

> 
> block-status requests are complex, they involve querying different
> block driver states across backing chain. Let's expand only in_flight
> section for the top bs, keeping other sections as is.

block-status requests are complex, involving a query of different block 
driver states across the backing chain.  Let's expand only the in_flight 
section for the top bs, and keep the other sections as-is.

I'd welcome Kevin's review on my next comment, but if I'm correct, I 
think we can further add the following justification to the commit message:

Gathering block status only requires reads from the block device, and 
backing devices are typically read-only, so losing any in_flight race on 
a backing device is less likely to cause problems with concurrent 
modifications on the overall backing chain.

> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/io.c | 65 ++++++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 51 insertions(+), 14 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index a91d8c1e21..1cb6f433e5 100644
> --- a/block/io.c

> @@ -2624,15 +2646,19 @@ int coroutine_fn bdrv_is_allocated(BlockDriverState *bs, int64_t offset,
>    * words, the result is not necessarily the maximum possible range);
>    * but 'pnum' will only be 0 when end of file is reached.
>    *
> + * To be called between exactly one pair of bdrv_inc/dec_in_flight() for top bs.
> + * bdrv_do_is_allocated_above takes care of increasing in_fligth for other block

in_flight

> + * driver states from bs backing chain.
>    */
>   static int coroutine_fn
> -bdrv_co_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
> +bdrv_do_is_allocated_above(BlockDriverState *top, BlockDriverState *base,
>                              bool include_base, int64_t offset, int64_t bytes,
>                              int64_t *pnum)

> @@ -2682,11 +2710,16 @@ typedef struct BdrvCoIsAllocatedAboveData {
>       bool done;
>   } BdrvCoIsAllocatedAboveData;
>   
> +/*
> + * To be called between exactly one pair of bdrv_inc/dec_in_flight() for top bs.
> + * bdrv_do_is_allocated_above takes care of increasing in_fligth for other block
> + * driver states from the backing chain.
> + */
>   static void coroutine_fn bdrv_is_allocated_above_co_entry(void *opaque)

and again

Otherwise looks reasonable to me.  Fixing typos is trivial, so:

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 7/9] block/io: add bdrv_do_pwrite_zeroes
  2020-04-27 14:39 ` [PATCH v2 7/9] block/io: add bdrv_do_pwrite_zeroes Vladimir Sementsov-Ogievskiy
@ 2020-05-01 22:05   ` Eric Blake
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2020-05-01 22:05 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 4/27/20 9:39 AM, Vladimir Sementsov-Ogievskiy wrote:
> We'll need a bdrv_co_pwrite_zeroes version without inc/dec in_flight to
> be used in further implementation of bdrv_make_zero.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   block/io.c | 23 +++++++++++++++++++----
>   1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 1cb6f433e5..e6a8ead46c 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2016,8 +2016,10 @@ int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
>       return ret;
>   }
>   
> -int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, int64_t offset,
> -                                       int bytes, BdrvRequestFlags flags)
> +/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */
> +static int coroutine_fn
> +bdrv_do_pwrite_zeroes(BdrvChild *child, int64_t offset, int bytes,
> +                      BdrvRequestFlags flags)

I assume your 64-bit conversion series is based on top of this one, and 
therefore this gets cleaned up there to take a 64-bit bytes request.  In 
the meantime, sticking to 32-bit is fine.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 9/9] block/io: expand in_flight inc/dec section: bdrv_make_zero
  2020-04-27 14:39 ` [PATCH v2 9/9] block/io: expand in_flight inc/dec section: bdrv_make_zero Vladimir Sementsov-Ogievskiy
@ 2020-05-01 22:08   ` Eric Blake
  0 siblings, 0 replies; 27+ messages in thread
From: Eric Blake @ 2020-05-01 22:08 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

On 4/27/20 9:39 AM, Vladimir Sementsov-Ogievskiy wrote:
> It's safer to expand in_flight request to start before enter to
> coroutine in synchronous wrappers and end after BDRV_POLL_WHILE loop.
> Note that qemu_coroutine_enter may only schedule the coroutine in some
> circumstances.

See my wording suggestions earlier in the series.

> 
> bdrv_make_zero update includes refactoring: move the whole loop into
> coroutine, which has additional benefit of not create/enter new
> coroutine on each iteration.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>   block/io.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 51 insertions(+), 3 deletions(-)
> 

> +int bdrv_make_zero(BdrvChild *child, BdrvRequestFlags flags)
> +{
> +    int ret;
> +
> +    bdrv_inc_in_flight(child->bs);
> +
> +    if (qemu_in_coroutine()) {
> +        /* Fast-path if already in coroutine context */
> +        ret = bdrv_do_make_zero(child, flags);
> +    } else {
> +        BdrvDoMakeZeroData data = {
> +            .child = child,
> +            .flags = flags,
> +            .done = false,

Another case where the line '.done = false,' is optional, thanks to C 
semantics, but does not hurt to leave it in.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-04-27 14:39 ` [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases Vladimir Sementsov-Ogievskiy
  2020-05-01 21:43   ` Eric Blake
@ 2020-05-06  7:02   ` Vladimir Sementsov-Ogievskiy
  2020-05-18 18:21     ` Vladimir Sementsov-Ogievskiy
  2020-05-19 10:52     ` Kevin Wolf
  2020-05-19 11:04   ` Kevin Wolf
  2 siblings, 2 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-06  7:02 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
> It's safer to expand in_flight request to start before enter to
> coroutine in synchronous wrappers, due to the following (theoretical)
> problem:
> 
> Consider write.
> It's possible, that qemu_coroutine_enter only schedules execution,
> assume such case.
> 
> Then we may possibly have the following:
> 
> 1. Somehow check that we are not in drained section in outer code.
> 
> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
> will protect us from starting drained section.
> 
> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
> in_flight).
> 
> 4. Assume coroutine not yet actually entered, only scheduled, and we go
> to some code, which starts drained section (as in_flight is zero).
> 
> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
> are in drained section with in_flight request.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

Very interesting: this patch breaks test-replication. It hangs:

(gdb) thr a a bt

Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
#0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
#1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
#2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
#3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
#4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
#5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
#6  0x00007eff2f602553 in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
#0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
#1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
#2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
#3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
#4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
#5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
#6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
#7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
#8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
#9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
#10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
#11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
#12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
#13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
#14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
#15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
#16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
#17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
#18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
#19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
#20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
#21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
#22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
#23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
#24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
#25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
#26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
#27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
#28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
#29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
#30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
#31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
#32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
#33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
#34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645


(gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
$5 = 0
(gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
$6 = 0
(gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
$7 = 1
(gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
$8 = 0
(gdb) fr 20
#20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
4252        bdrv_flush(bs);
(gdb) p bs->node_name
$9 = "#block5317", '\000' <repeats 21 times>
(gdb) p bs->drv
$10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
(gdb) p bs->in_flight
$11 = 1
(gdb) p bs->tracked_requests
$12 = {lh_first = 0x0}


So, we entered bdrv_flush at frame 19, and increased in_flight. Then we go to aio_poll and to nested event loop, and we never return to decrease in_flight field.

Hmm. I'm afraid, I don't know what to do with that. Kevin, could you take a look? And could similar thing happen with blk layer, because of you recent similar patch?


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-06  7:02   ` Vladimir Sementsov-Ogievskiy
@ 2020-05-18 18:21     ` Vladimir Sementsov-Ogievskiy
  2020-05-19 10:52     ` Kevin Wolf
  1 sibling, 0 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-18 18:21 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

06.05.2020 10:02, Vladimir Sementsov-Ogievskiy wrote:
> 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
>> It's safer to expand in_flight request to start before enter to
>> coroutine in synchronous wrappers, due to the following (theoretical)
>> problem:
>>
>> Consider write.
>> It's possible, that qemu_coroutine_enter only schedules execution,
>> assume such case.
>>
>> Then we may possibly have the following:
>>
>> 1. Somehow check that we are not in drained section in outer code.
>>
>> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
>> will protect us from starting drained section.
>>
>> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
>> in_flight).
>>
>> 4. Assume coroutine not yet actually entered, only scheduled, and we go
>> to some code, which starts drained section (as in_flight is zero).
>>
>> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
>> are in drained section with in_flight request.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> Very interesting: this patch breaks test-replication. It hangs:
> 
> (gdb) thr a a bt
> 
> Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
> #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
> #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
> #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
> #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
> #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
> #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
> #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
> #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
> #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
> #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
> #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
> #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
> #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
> #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
> #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
> #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
> #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
> #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
> #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
> #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
> #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
> #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
> #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
> #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
> #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
> #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
> #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
> #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
> #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
> #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
> #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
> #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
> #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
> #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
> #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
> #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
> #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
> 
> 
> (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
> $5 = 0
> (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
> $6 = 0
> (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
> $7 = 1
> (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
> $8 = 0
> (gdb) fr 20
> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> 4252        bdrv_flush(bs);
> (gdb) p bs->node_name
> $9 = "#block5317", '\000' <repeats 21 times>
> (gdb) p bs->drv
> $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
> (gdb) p bs->in_flight
> $11 = 1
> (gdb) p bs->tracked_requests
> $12 = {lh_first = 0x0}
> 
> 
> So, we entered bdrv_flush at frame 19, and increased in_flight. Then we go to aio_poll and to nested event loop, and we never return to decrease in_flight field.
> 
> Hmm. I'm afraid, I don't know what to do with that. Kevin, could you take a look? And could similar thing happen with blk layer, because of you recent similar patch?
> 
> 

ping...

Do someone know, how to prevent nested polling loop? Is my patch wrong, or replication or something generic?

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-06  7:02   ` Vladimir Sementsov-Ogievskiy
  2020-05-18 18:21     ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 10:52     ` Kevin Wolf
  2020-05-19 11:06       ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 27+ messages in thread
From: Kevin Wolf @ 2020-05-19 10:52 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 06.05.2020 um 09:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
> > It's safer to expand in_flight request to start before enter to
> > coroutine in synchronous wrappers, due to the following (theoretical)
> > problem:
> > 
> > Consider write.
> > It's possible, that qemu_coroutine_enter only schedules execution,
> > assume such case.
> > 
> > Then we may possibly have the following:
> > 
> > 1. Somehow check that we are not in drained section in outer code.
> > 
> > 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
> > will protect us from starting drained section.
> > 
> > 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
> > in_flight).
> > 
> > 4. Assume coroutine not yet actually entered, only scheduled, and we go
> > to some code, which starts drained section (as in_flight is zero).
> > 
> > 5. Scheduled coroutine starts, and blindly increases in_flight, and we
> > are in drained section with in_flight request.
> > 
> > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> 
> Very interesting: this patch breaks test-replication. It hangs:
> 
> (gdb) thr a a bt
> 
> Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
> #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
> #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
> #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
> #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
> #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
> #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
> #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
> #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
> #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
> #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
> #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
> #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
> #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
> #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
> #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
> #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
> #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
> #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
> #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
> #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
> #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
> #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
> #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
> #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
> #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
> #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
> #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
> #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
> #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
> #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
> #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
> #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
> #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
> #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
> #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
> #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
> #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
> 
> 
> (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
> $5 = 0
> (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
> $6 = 0
> (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
> $7 = 1
> (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
> $8 = 0
> (gdb) fr 20
> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> 4252        bdrv_flush(bs);
> (gdb) p bs->node_name
> $9 = "#block5317", '\000' <repeats 21 times>
> (gdb) p bs->drv
> $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
> (gdb) p bs->in_flight
> $11 = 1
> (gdb) p bs->tracked_requests
> $12 = {lh_first = 0x0}
> 
> 
> So, we entered bdrv_flush at frame 19, and increased in_flight. Then
> we go to aio_poll and to nested event loop, and we never return to
> decrease in_flight field.
> 
> Hmm. I'm afraid, I don't know what to do with that. Kevin, could you
> take a look? And could similar thing happen with blk layer, because of
> you recent similar patch?

Hmm... You mean blk_prw(), right? Looks like it could have the same
problem, indeed.

Maybe we need to move the blk/bdrv_dec_in_flight to inside the coroutine
(probably to the place where we currently have aio_wait_kick(), which
would already be built in for bdrv_dec_in_flight). This is the last
thing the coroutine does, so presumably it will still be late enough.

Kevin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 6/9] block/io: expand in_flight inc/dec section: block-status
  2020-05-01 22:00   ` Eric Blake
@ 2020-05-19 10:57     ` Kevin Wolf
  0 siblings, 0 replies; 27+ messages in thread
From: Kevin Wolf @ 2020-05-19 10:57 UTC (permalink / raw)
  To: Eric Blake
  Cc: fam, Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel,
	mreitz, stefanha, den

Am 02.05.2020 um 00:00 hat Eric Blake geschrieben:
> On 4/27/20 9:39 AM, Vladimir Sementsov-Ogievskiy wrote:
> > It's safer to expand in_flight request to start before enter to
> > coroutine in synchronous wrappers and end after BDRV_POLL_WHILE loop.
> > Note that qemu_coroutine_enter may only schedule the coroutine in some
> > circumstances.
> 
> Wording suggestion:
> 
> It's safer to expand the region protected by an in_flight request to begin
> in the synchronous wrapper and end after the BDRV_POLL_WHILE loop.  Leaving
> the in_flight request in the coroutine itself risks a race where calling
> qemu_coroutine_enter() may have only scheduled, rather than started, the
> coroutine, allowing some other thread a chance to not realize an operation
> is in flight.
> 
> > 
> > block-status requests are complex, they involve querying different
> > block driver states across backing chain. Let's expand only in_flight
> > section for the top bs, keeping other sections as is.
> 
> block-status requests are complex, involving a query of different block
> driver states across the backing chain.  Let's expand only the in_flight
> section for the top bs, and keep the other sections as-is.
> 
> I'd welcome Kevin's review on my next comment, but if I'm correct, I think
> we can further add the following justification to the commit message:
> 
> Gathering block status only requires reads from the block device, and
> backing devices are typically read-only, so losing any in_flight race on a
> backing device is less likely to cause problems with concurrent
> modifications on the overall backing chain.

Actually, my question is what we gain by increasing in_flight only for
the top level. It feels wrong to me, though maybe it doesn't actually
lead to bugs because in practice, we completely drain the parents
instead of just draining requests going to one specific child.

But as this patch shows, not increasing in_flight in some cases is a lot
more work than doing it, and it's harder to understand why it's correct.
So why not simply increase it unconditionally?

This is how other requests work as well. If you make a read request to a
qcow2 image, you'll get in_flight increased for both the qcow2 node and
the file-posix node.

Kevin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-04-27 14:39 ` [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases Vladimir Sementsov-Ogievskiy
  2020-05-01 21:43   ` Eric Blake
  2020-05-06  7:02   ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 11:04   ` Kevin Wolf
  2 siblings, 0 replies; 27+ messages in thread
From: Kevin Wolf @ 2020-05-19 11:04 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 27.04.2020 um 16:39 hat Vladimir Sementsov-Ogievskiy geschrieben:
> It's safer to expand in_flight request to start before enter to
> coroutine in synchronous wrappers, due to the following (theoretical)
> problem:
> 
> Consider write.
> It's possible, that qemu_coroutine_enter only schedules execution,
> assume such case.
> 
> Then we may possibly have the following:
> 
> 1. Somehow check that we are not in drained section in outer code.
> 
> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
> will protect us from starting drained section.
> 
> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
> in_flight).
> 
> 4. Assume coroutine not yet actually entered, only scheduled, and we go
> to some code, which starts drained section (as in_flight is zero).
> 
> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
> are in drained section with in_flight request.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

> diff --git a/block/io.c b/block/io.c
> index 061f3f2590..a91d8c1e21 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -1511,7 +1511,8 @@ int coroutine_fn bdrv_co_preadv(BdrvChild *child,
>      return bdrv_co_preadv_part(child, offset, bytes, qiov, 0, flags);
>  }
>  
> -int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
> +/* To be called between exactly one pair of bdrv_inc/dec_in_flight() */

You have lots of comments like this one. Isn't this condition too
strict, though?

In the BlockBackend layer, it needs to be true because
blk_wait_while_drained() decreases in_flight only once (which is an ugly
hack, honestly, but it works...). It's comparable to how
AIO_WAIT_WHILE() relies on having locked the context exactly once even
though it is a recursive lock, because it wants to drop the lock
temporarily.

I don't think the same reasoning applies to BDS in_flight, does it?

We can potentially simplify the code if we don't have to fulfill the
condition.

Kevin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-19 10:52     ` Kevin Wolf
@ 2020-05-19 11:06       ` Vladimir Sementsov-Ogievskiy
  2020-05-19 11:16         ` Kevin Wolf
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 11:06 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

19.05.2020 13:52, Kevin Wolf wrote:
> Am 06.05.2020 um 09:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
>>> It's safer to expand in_flight request to start before enter to
>>> coroutine in synchronous wrappers, due to the following (theoretical)
>>> problem:
>>>
>>> Consider write.
>>> It's possible, that qemu_coroutine_enter only schedules execution,
>>> assume such case.
>>>
>>> Then we may possibly have the following:
>>>
>>> 1. Somehow check that we are not in drained section in outer code.
>>>
>>> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
>>> will protect us from starting drained section.
>>>
>>> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
>>> in_flight).
>>>
>>> 4. Assume coroutine not yet actually entered, only scheduled, and we go
>>> to some code, which starts drained section (as in_flight is zero).
>>>
>>> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
>>> are in drained section with in_flight request.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>
>> Very interesting: this patch breaks test-replication. It hangs:
>>
>> (gdb) thr a a bt
>>
>> Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
>> #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
>> #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
>> #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
>> #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
>> #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
>> #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
>> #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
>>
>> Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
>> #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
>> #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
>> #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
>> #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
>> #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
>> #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
>> #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
>> #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
>> #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
>> #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
>> #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
>> #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
>> #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
>> #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
>> #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
>> #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
>> #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
>> #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
>> #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
>> #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>> #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
>> #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
>> #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
>> #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
>> #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
>> #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
>> #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
>> #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
>> #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>> #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>> #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>> #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
>> #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
>> #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
>>
>>
>> (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
>> $5 = 0
>> (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
>> $6 = 0
>> (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
>> $7 = 1
>> (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
>> $8 = 0
>> (gdb) fr 20
>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>> 4252        bdrv_flush(bs);
>> (gdb) p bs->node_name
>> $9 = "#block5317", '\000' <repeats 21 times>
>> (gdb) p bs->drv
>> $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
>> (gdb) p bs->in_flight
>> $11 = 1
>> (gdb) p bs->tracked_requests
>> $12 = {lh_first = 0x0}
>>
>>
>> So, we entered bdrv_flush at frame 19, and increased in_flight. Then
>> we go to aio_poll and to nested event loop, and we never return to
>> decrease in_flight field.
>>
>> Hmm. I'm afraid, I don't know what to do with that. Kevin, could you
>> take a look? And could similar thing happen with blk layer, because of
>> you recent similar patch?
> 
> Hmm... You mean blk_prw(), right? Looks like it could have the same
> problem, indeed.
> 
> Maybe we need to move the blk/bdrv_dec_in_flight to inside the coroutine
> (probably to the place where we currently have aio_wait_kick(), which
> would already be built in for bdrv_dec_in_flight). This is the last
> thing the coroutine does, so presumably it will still be late enough.
> 

But moving "inc" into coroutine is dangerous too, as we discussed that coroutine_enter may only schedule the coroutine, and something may call drain before actual "inc".

So, we probably need to move inc/dec into coroutine, and protect by some proper locking mechanism

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-19 11:06       ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 11:16         ` Kevin Wolf
  2020-05-19 11:25           ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Wolf @ 2020-05-19 11:16 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 19.05.2020 um 13:06 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.05.2020 13:52, Kevin Wolf wrote:
> > Am 06.05.2020 um 09:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
> > > > It's safer to expand in_flight request to start before enter to
> > > > coroutine in synchronous wrappers, due to the following (theoretical)
> > > > problem:
> > > > 
> > > > Consider write.
> > > > It's possible, that qemu_coroutine_enter only schedules execution,
> > > > assume such case.
> > > > 
> > > > Then we may possibly have the following:
> > > > 
> > > > 1. Somehow check that we are not in drained section in outer code.
> > > > 
> > > > 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
> > > > will protect us from starting drained section.
> > > > 
> > > > 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
> > > > in_flight).
> > > > 
> > > > 4. Assume coroutine not yet actually entered, only scheduled, and we go
> > > > to some code, which starts drained section (as in_flight is zero).
> > > > 
> > > > 5. Scheduled coroutine starts, and blindly increases in_flight, and we
> > > > are in drained section with in_flight request.
> > > > 
> > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > 
> > > Very interesting: this patch breaks test-replication. It hangs:
> > > 
> > > (gdb) thr a a bt
> > > 
> > > Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
> > > #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
> > > #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
> > > #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
> > > #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
> > > #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
> > > #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
> > > #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
> > > 
> > > Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
> > > #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
> > > #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
> > > #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
> > > #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
> > > #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
> > > #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
> > > #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
> > > #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
> > > #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
> > > #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
> > > #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
> > > #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
> > > #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
> > > #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
> > > #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
> > > #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
> > > #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
> > > #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
> > > #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
> > > #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
> > > #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> > > #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
> > > #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
> > > #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
> > > #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
> > > #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
> > > #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
> > > #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
> > > #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
> > > #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> > > #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> > > #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> > > #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
> > > #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
> > > #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
> > > 
> > > 
> > > (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
> > > $5 = 0
> > > (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
> > > $6 = 0
> > > (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
> > > $7 = 1
> > > (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
> > > $8 = 0
> > > (gdb) fr 20
> > > #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> > > 4252        bdrv_flush(bs);
> > > (gdb) p bs->node_name
> > > $9 = "#block5317", '\000' <repeats 21 times>
> > > (gdb) p bs->drv
> > > $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
> > > (gdb) p bs->in_flight
> > > $11 = 1
> > > (gdb) p bs->tracked_requests
> > > $12 = {lh_first = 0x0}
> > > 
> > > 
> > > So, we entered bdrv_flush at frame 19, and increased in_flight. Then
> > > we go to aio_poll and to nested event loop, and we never return to
> > > decrease in_flight field.
> > > 
> > > Hmm. I'm afraid, I don't know what to do with that. Kevin, could you
> > > take a look? And could similar thing happen with blk layer, because of
> > > you recent similar patch?
> > 
> > Hmm... You mean blk_prw(), right? Looks like it could have the same
> > problem, indeed.
> > 
> > Maybe we need to move the blk/bdrv_dec_in_flight to inside the coroutine
> > (probably to the place where we currently have aio_wait_kick(), which
> > would already be built in for bdrv_dec_in_flight). This is the last
> > thing the coroutine does, so presumably it will still be late enough.
> > 
> 
> But moving "inc" into coroutine is dangerous too, as we discussed that
> coroutine_enter may only schedule the coroutine, and something may
> call drain before actual "inc".

No, I mean moving only the dec, not inc. So inc before entering the
coroutine (outside of it), and dec at the end, but still inside the
coroutine.

Kevin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 0/9] block/io: safer inc/dec in_flight sections
  2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
                   ` (8 preceding siblings ...)
  2020-04-27 14:39 ` [PATCH v2 9/9] block/io: expand in_flight inc/dec section: bdrv_make_zero Vladimir Sementsov-Ogievskiy
@ 2020-05-19 11:18 ` Vladimir Sementsov-Ogievskiy
  9 siblings, 0 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 11:18 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, fam, qemu-devel, mreitz, stefanha, den

27.04.2020 17:38, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> This is inspired by Kevin's
> "block: Fix blk->in_flight during blk_wait_while_drained()" series.
> 
> So, like it's now done for block-backends, let's expand
> in_flight-protected sections for bdrv_ interfaces, including
> coroutine_enter and BDRV_POLL_WHILE loop into these sections.
> 

OK, let's postpone this thing.

1. Idea to move ind/dec out of coroutine seems wrong, it leads to dead-lock, as shown in backtrace in my answer to 5/9.

2. Idea to keep request inside only one pair of ind/dec is probably an extra restriction in bdrv layer (I just blindly followed how it was done in blk layer by Kevin)

3. We still may have a theoretical race between request start and drained section start, but it needs another audit and smarter solution.

So, seems that we should not apply these series as it is, sorry for the noise. I think, I'll resend my 64bit-block-layer series based on master instead of this one.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-19 11:16         ` Kevin Wolf
@ 2020-05-19 11:25           ` Vladimir Sementsov-Ogievskiy
  2020-05-19 14:01             ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 11:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

19.05.2020 14:16, Kevin Wolf wrote:
> Am 19.05.2020 um 13:06 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 19.05.2020 13:52, Kevin Wolf wrote:
>>> Am 06.05.2020 um 09:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>> 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
>>>>> It's safer to expand in_flight request to start before enter to
>>>>> coroutine in synchronous wrappers, due to the following (theoretical)
>>>>> problem:
>>>>>
>>>>> Consider write.
>>>>> It's possible, that qemu_coroutine_enter only schedules execution,
>>>>> assume such case.
>>>>>
>>>>> Then we may possibly have the following:
>>>>>
>>>>> 1. Somehow check that we are not in drained section in outer code.
>>>>>
>>>>> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
>>>>> will protect us from starting drained section.
>>>>>
>>>>> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
>>>>> in_flight).
>>>>>
>>>>> 4. Assume coroutine not yet actually entered, only scheduled, and we go
>>>>> to some code, which starts drained section (as in_flight is zero).
>>>>>
>>>>> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
>>>>> are in drained section with in_flight request.
>>>>>
>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>
>>>> Very interesting: this patch breaks test-replication. It hangs:
>>>>
>>>> (gdb) thr a a bt
>>>>
>>>> Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
>>>> #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
>>>> #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
>>>> #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
>>>> #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
>>>> #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
>>>> #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
>>>> #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
>>>>
>>>> Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
>>>> #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
>>>> #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
>>>> #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
>>>> #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
>>>> #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
>>>> #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
>>>> #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
>>>> #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
>>>> #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
>>>> #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
>>>> #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
>>>> #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
>>>> #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
>>>> #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
>>>> #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
>>>> #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
>>>> #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
>>>> #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
>>>> #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
>>>> #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
>>>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>>>> #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
>>>> #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
>>>> #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
>>>> #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
>>>> #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
>>>> #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
>>>> #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
>>>> #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
>>>> #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>> #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>> #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>> #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
>>>> #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
>>>> #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
>>>>
>>>>
>>>> (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
>>>> $5 = 0
>>>> (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
>>>> $6 = 0
>>>> (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
>>>> $7 = 1
>>>> (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
>>>> $8 = 0
>>>> (gdb) fr 20
>>>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>>>> 4252        bdrv_flush(bs);
>>>> (gdb) p bs->node_name
>>>> $9 = "#block5317", '\000' <repeats 21 times>
>>>> (gdb) p bs->drv
>>>> $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
>>>> (gdb) p bs->in_flight
>>>> $11 = 1
>>>> (gdb) p bs->tracked_requests
>>>> $12 = {lh_first = 0x0}
>>>>
>>>>
>>>> So, we entered bdrv_flush at frame 19, and increased in_flight. Then
>>>> we go to aio_poll and to nested event loop, and we never return to
>>>> decrease in_flight field.
>>>>
>>>> Hmm. I'm afraid, I don't know what to do with that. Kevin, could you
>>>> take a look? And could similar thing happen with blk layer, because of
>>>> you recent similar patch?
>>>
>>> Hmm... You mean blk_prw(), right? Looks like it could have the same
>>> problem, indeed.
>>>
>>> Maybe we need to move the blk/bdrv_dec_in_flight to inside the coroutine
>>> (probably to the place where we currently have aio_wait_kick(), which
>>> would already be built in for bdrv_dec_in_flight). This is the last
>>> thing the coroutine does, so presumably it will still be late enough.
>>>
>>
>> But moving "inc" into coroutine is dangerous too, as we discussed that
>> coroutine_enter may only schedule the coroutine, and something may
>> call drain before actual "inc".
> 
> No, I mean moving only the dec, not inc. So inc before entering the
> coroutine (outside of it), and dec at the end, but still inside the
> coroutine.
> 

Hmm. it probably make sense. Ha, I hastened to answer on cover-letter that this all to be dropped. Ok, I'll give it a roll and check your idea, thanks!


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-19 11:25           ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 14:01             ` Vladimir Sementsov-Ogievskiy
  2020-05-19 14:33               ` Kevin Wolf
  0 siblings, 1 reply; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 14:01 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

19.05.2020 14:25, Vladimir Sementsov-Ogievskiy wrote:
> 19.05.2020 14:16, Kevin Wolf wrote:
>> Am 19.05.2020 um 13:06 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 19.05.2020 13:52, Kevin Wolf wrote:
>>>> Am 06.05.2020 um 09:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> It's safer to expand in_flight request to start before enter to
>>>>>> coroutine in synchronous wrappers, due to the following (theoretical)
>>>>>> problem:
>>>>>>
>>>>>> Consider write.
>>>>>> It's possible, that qemu_coroutine_enter only schedules execution,
>>>>>> assume such case.
>>>>>>
>>>>>> Then we may possibly have the following:
>>>>>>
>>>>>> 1. Somehow check that we are not in drained section in outer code.
>>>>>>
>>>>>> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
>>>>>> will protect us from starting drained section.
>>>>>>
>>>>>> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
>>>>>> in_flight).
>>>>>>
>>>>>> 4. Assume coroutine not yet actually entered, only scheduled, and we go
>>>>>> to some code, which starts drained section (as in_flight is zero).
>>>>>>
>>>>>> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
>>>>>> are in drained section with in_flight request.
>>>>>>
>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>
>>>>> Very interesting: this patch breaks test-replication. It hangs:
>>>>>
>>>>> (gdb) thr a a bt
>>>>>
>>>>> Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
>>>>> #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
>>>>> #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
>>>>> #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
>>>>> #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
>>>>> #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
>>>>> #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
>>>>> #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
>>>>>
>>>>> Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
>>>>> #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
>>>>> #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
>>>>> #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
>>>>> #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
>>>>> #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
>>>>> #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
>>>>> #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
>>>>> #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
>>>>> #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
>>>>> #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
>>>>> #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
>>>>> #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
>>>>> #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
>>>>> #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
>>>>> #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
>>>>> #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
>>>>> #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
>>>>> #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
>>>>> #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
>>>>> #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
>>>>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>>>>> #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
>>>>> #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
>>>>> #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
>>>>> #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
>>>>> #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
>>>>> #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
>>>>> #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
>>>>> #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
>>>>> #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>>> #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>>> #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>>> #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
>>>>> #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
>>>>> #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
>>>>>
>>>>>
>>>>> (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
>>>>> $5 = 0
>>>>> (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
>>>>> $6 = 0
>>>>> (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
>>>>> $7 = 1
>>>>> (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
>>>>> $8 = 0
>>>>> (gdb) fr 20
>>>>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>>>>> 4252        bdrv_flush(bs);
>>>>> (gdb) p bs->node_name
>>>>> $9 = "#block5317", '\000' <repeats 21 times>
>>>>> (gdb) p bs->drv
>>>>> $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
>>>>> (gdb) p bs->in_flight
>>>>> $11 = 1
>>>>> (gdb) p bs->tracked_requests
>>>>> $12 = {lh_first = 0x0}
>>>>>
>>>>>
>>>>> So, we entered bdrv_flush at frame 19, and increased in_flight. Then
>>>>> we go to aio_poll and to nested event loop, and we never return to
>>>>> decrease in_flight field.
>>>>>
>>>>> Hmm. I'm afraid, I don't know what to do with that. Kevin, could you
>>>>> take a look? And could similar thing happen with blk layer, because of
>>>>> you recent similar patch?
>>>>
>>>> Hmm... You mean blk_prw(), right? Looks like it could have the same
>>>> problem, indeed.
>>>>
>>>> Maybe we need to move the blk/bdrv_dec_in_flight to inside the coroutine
>>>> (probably to the place where we currently have aio_wait_kick(), which
>>>> would already be built in for bdrv_dec_in_flight). This is the last
>>>> thing the coroutine does, so presumably it will still be late enough.
>>>>
>>>
>>> But moving "inc" into coroutine is dangerous too, as we discussed that
>>> coroutine_enter may only schedule the coroutine, and something may
>>> call drain before actual "inc".
>>
>> No, I mean moving only the dec, not inc. So inc before entering the
>> coroutine (outside of it), and dec at the end, but still inside the
>> coroutine.
>>
> 
> Hmm. it probably make sense. Ha, I hastened to answer on cover-letter that this all to be dropped. Ok, I'll give it a roll and check your idea, thanks!
> 
> 

Checked this helps. I think, I'd try to rebase it onto your "[RFC PATCH 1/3] block: Factor out bdrv_run_co()"

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-19 14:01             ` Vladimir Sementsov-Ogievskiy
@ 2020-05-19 14:33               ` Kevin Wolf
  2020-05-19 16:54                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 27+ messages in thread
From: Kevin Wolf @ 2020-05-19 14:33 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy
  Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

Am 19.05.2020 um 16:01 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 19.05.2020 14:25, Vladimir Sementsov-Ogievskiy wrote:
> > 19.05.2020 14:16, Kevin Wolf wrote:
> > > Am 19.05.2020 um 13:06 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > 19.05.2020 13:52, Kevin Wolf wrote:
> > > > > Am 06.05.2020 um 09:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
> > > > > > 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
> > > > > > > It's safer to expand in_flight request to start before enter to
> > > > > > > coroutine in synchronous wrappers, due to the following (theoretical)
> > > > > > > problem:
> > > > > > > 
> > > > > > > Consider write.
> > > > > > > It's possible, that qemu_coroutine_enter only schedules execution,
> > > > > > > assume such case.
> > > > > > > 
> > > > > > > Then we may possibly have the following:
> > > > > > > 
> > > > > > > 1. Somehow check that we are not in drained section in outer code.
> > > > > > > 
> > > > > > > 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
> > > > > > > will protect us from starting drained section.
> > > > > > > 
> > > > > > > 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
> > > > > > > in_flight).
> > > > > > > 
> > > > > > > 4. Assume coroutine not yet actually entered, only scheduled, and we go
> > > > > > > to some code, which starts drained section (as in_flight is zero).
> > > > > > > 
> > > > > > > 5. Scheduled coroutine starts, and blindly increases in_flight, and we
> > > > > > > are in drained section with in_flight request.
> > > > > > > 
> > > > > > > Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> > > > > > 
> > > > > > Very interesting: this patch breaks test-replication. It hangs:
> > > > > > 
> > > > > > (gdb) thr a a bt
> > > > > > 
> > > > > > Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
> > > > > > #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
> > > > > > #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
> > > > > > #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
> > > > > > #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
> > > > > > #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
> > > > > > #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
> > > > > > #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
> > > > > > 
> > > > > > Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
> > > > > > #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
> > > > > > #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
> > > > > > #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
> > > > > > #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
> > > > > > #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
> > > > > > #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
> > > > > > #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
> > > > > > #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
> > > > > > #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
> > > > > > #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
> > > > > > #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
> > > > > > #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
> > > > > > #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
> > > > > > #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
> > > > > > #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
> > > > > > #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
> > > > > > #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
> > > > > > #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
> > > > > > #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
> > > > > > #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
> > > > > > #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> > > > > > #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
> > > > > > #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
> > > > > > #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
> > > > > > #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
> > > > > > #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
> > > > > > #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
> > > > > > #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
> > > > > > #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
> > > > > > #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> > > > > > #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> > > > > > #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
> > > > > > #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
> > > > > > #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
> > > > > > #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
> > > > > > 
> > > > > > 
> > > > > > (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
> > > > > > $5 = 0
> > > > > > (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
> > > > > > $6 = 0
> > > > > > (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
> > > > > > $7 = 1
> > > > > > (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
> > > > > > $8 = 0
> > > > > > (gdb) fr 20
> > > > > > #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
> > > > > > 4252        bdrv_flush(bs);
> > > > > > (gdb) p bs->node_name
> > > > > > $9 = "#block5317", '\000' <repeats 21 times>
> > > > > > (gdb) p bs->drv
> > > > > > $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
> > > > > > (gdb) p bs->in_flight
> > > > > > $11 = 1
> > > > > > (gdb) p bs->tracked_requests
> > > > > > $12 = {lh_first = 0x0}
> > > > > > 
> > > > > > 
> > > > > > So, we entered bdrv_flush at frame 19, and increased in_flight. Then
> > > > > > we go to aio_poll and to nested event loop, and we never return to
> > > > > > decrease in_flight field.
> > > > > > 
> > > > > > Hmm. I'm afraid, I don't know what to do with that. Kevin, could you
> > > > > > take a look? And could similar thing happen with blk layer, because of
> > > > > > you recent similar patch?
> > > > > 
> > > > > Hmm... You mean blk_prw(), right? Looks like it could have the same
> > > > > problem, indeed.
> > > > > 
> > > > > Maybe we need to move the blk/bdrv_dec_in_flight to inside the coroutine
> > > > > (probably to the place where we currently have aio_wait_kick(), which
> > > > > would already be built in for bdrv_dec_in_flight). This is the last
> > > > > thing the coroutine does, so presumably it will still be late enough.
> > > > > 
> > > > 
> > > > But moving "inc" into coroutine is dangerous too, as we discussed that
> > > > coroutine_enter may only schedule the coroutine, and something may
> > > > call drain before actual "inc".
> > > 
> > > No, I mean moving only the dec, not inc. So inc before entering the
> > > coroutine (outside of it), and dec at the end, but still inside the
> > > coroutine.
> > > 
> > 
> > Hmm. it probably make sense. Ha, I hastened to answer on
> > cover-letter that this all to be dropped. Ok, I'll give it a roll
> > and check your idea, thanks!
> 
> Checked this helps. I think, I'd try to rebase it onto your "[RFC
> PATCH 1/3] block: Factor out bdrv_run_co()"

I haven't received much feedback for that series yet. But I assume
patch 1 won't be contentious, so that makes sense to me.

Kevin



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases
  2020-05-19 14:33               ` Kevin Wolf
@ 2020-05-19 16:54                 ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 27+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-05-19 16:54 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: fam, qemu-block, qemu-devel, mreitz, stefanha, den

19.05.2020 17:33, Kevin Wolf wrote:
> Am 19.05.2020 um 16:01 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 19.05.2020 14:25, Vladimir Sementsov-Ogievskiy wrote:
>>> 19.05.2020 14:16, Kevin Wolf wrote:
>>>> Am 19.05.2020 um 13:06 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> 19.05.2020 13:52, Kevin Wolf wrote:
>>>>>> Am 06.05.2020 um 09:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> 27.04.2020 17:39, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> It's safer to expand in_flight request to start before enter to
>>>>>>>> coroutine in synchronous wrappers, due to the following (theoretical)
>>>>>>>> problem:
>>>>>>>>
>>>>>>>> Consider write.
>>>>>>>> It's possible, that qemu_coroutine_enter only schedules execution,
>>>>>>>> assume such case.
>>>>>>>>
>>>>>>>> Then we may possibly have the following:
>>>>>>>>
>>>>>>>> 1. Somehow check that we are not in drained section in outer code.
>>>>>>>>
>>>>>>>> 2. Call bdrv_pwritev(), assuming that it will increase in_flight, which
>>>>>>>> will protect us from starting drained section.
>>>>>>>>
>>>>>>>> 3. It calls bdrv_prwv_co() -> bdrv_coroutine_enter() (not yet increased
>>>>>>>> in_flight).
>>>>>>>>
>>>>>>>> 4. Assume coroutine not yet actually entered, only scheduled, and we go
>>>>>>>> to some code, which starts drained section (as in_flight is zero).
>>>>>>>>
>>>>>>>> 5. Scheduled coroutine starts, and blindly increases in_flight, and we
>>>>>>>> are in drained section with in_flight request.
>>>>>>>>
>>>>>>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>>
>>>>>>> Very interesting: this patch breaks test-replication. It hangs:
>>>>>>>
>>>>>>> (gdb) thr a a bt
>>>>>>>
>>>>>>> Thread 2 (Thread 0x7eff256cd700 (LWP 2843)):
>>>>>>> #0  0x00007eff2f5fd1fd in syscall () from /lib64/libc.so.6
>>>>>>> #1  0x000055af9a9a4f11 in qemu_futex_wait (f=0x55af9aa6f758 <rcu_call_ready_event>, val=4294967295) at /work/src/qemu/up-expand-bdrv-in_flight-bounds/include/qemu/futex.h:29
>>>>>>> #2  0x000055af9a9a50d5 in qemu_event_wait (ev=0x55af9aa6f758 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
>>>>>>> #3  0x000055af9a9bd20d in call_rcu_thread (opaque=0x0) at util/rcu.c:260
>>>>>>> #4  0x000055af9a9a5288 in qemu_thread_start (args=0x55af9c4f1b80) at util/qemu-thread-posix.c:519
>>>>>>> #5  0x00007eff2f6d44c0 in start_thread () from /lib64/libpthread.so.0
>>>>>>> #6  0x00007eff2f602553 in clone () from /lib64/libc.so.6
>>>>>>>
>>>>>>> Thread 1 (Thread 0x7eff25820a80 (LWP 2842)):
>>>>>>> #0  0x00007eff2f5f7bd6 in ppoll () from /lib64/libc.so.6
>>>>>>> #1  0x000055af9a99e405 in qemu_poll_ns (fds=0x55af9c52a830, nfds=1, timeout=-1) at util/qemu-timer.c:335
>>>>>>> #2  0x000055af9a9a1cab in fdmon_poll_wait (ctx=0x55af9c526890, ready_list=0x7ffc73e8c5d0, timeout=-1) at util/fdmon-poll.c:79
>>>>>>> #3  0x000055af9a9a160c in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:600
>>>>>>> #4  0x000055af9a8f0bb0 in bdrv_do_drained_begin (bs=0x55af9c52a8d0, recursive=false, parent=0x0, ignore_bds_parents=false, poll=true) at block/io.c:429
>>>>>>> #5  0x000055af9a8f0c95 in bdrv_drained_begin (bs=0x55af9c52a8d0) at block/io.c:435
>>>>>>> #6  0x000055af9a8dc6a8 in blk_drain (blk=0x55af9c542c10) at block/block-backend.c:1681
>>>>>>> #7  0x000055af9a8da0b6 in blk_unref (blk=0x55af9c542c10) at block/block-backend.c:473
>>>>>>> #8  0x000055af9a8eb5e7 in mirror_exit_common (job=0x55af9c6c45c0) at block/mirror.c:667
>>>>>>> #9  0x000055af9a8eb9c1 in mirror_prepare (job=0x55af9c6c45c0) at block/mirror.c:765
>>>>>>> #10 0x000055af9a87cd65 in job_prepare (job=0x55af9c6c45c0) at job.c:781
>>>>>>> #11 0x000055af9a87b62a in job_txn_apply (job=0x55af9c6c45c0, fn=0x55af9a87cd28 <job_prepare>) at job.c:158
>>>>>>> #12 0x000055af9a87cdee in job_do_finalize (job=0x55af9c6c45c0) at job.c:798
>>>>>>> #13 0x000055af9a87cfb5 in job_completed_txn_success (job=0x55af9c6c45c0) at job.c:852
>>>>>>> #14 0x000055af9a87d055 in job_completed (job=0x55af9c6c45c0) at job.c:865
>>>>>>> #15 0x000055af9a87d0a8 in job_exit (opaque=0x55af9c6c45c0) at job.c:885
>>>>>>> #16 0x000055af9a99b981 in aio_bh_call (bh=0x55af9c547440) at util/async.c:136
>>>>>>> #17 0x000055af9a99ba8b in aio_bh_poll (ctx=0x55af9c526890) at util/async.c:164
>>>>>>> #18 0x000055af9a9a17ff in aio_poll (ctx=0x55af9c526890, blocking=true) at util/aio-posix.c:650
>>>>>>> #19 0x000055af9a8f7011 in bdrv_flush (bs=0x55af9c53b900) at block/io.c:3019
>>>>>>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>>>>>>> #21 0x000055af9a874ca3 in bdrv_delete (bs=0x55af9c53b900) at block.c:4498
>>>>>>> #22 0x000055af9a877862 in bdrv_unref (bs=0x55af9c53b900) at block.c:5866
>>>>>>> #23 0x000055af9a870837 in bdrv_root_unref_child (child=0x55af9c6c4430) at block.c:2684
>>>>>>> #24 0x000055af9a8da9a2 in blk_remove_bs (blk=0x55af9c547bd0) at block/block-backend.c:803
>>>>>>> #25 0x000055af9a8d9e54 in blk_delete (blk=0x55af9c547bd0) at block/block-backend.c:422
>>>>>>> #26 0x000055af9a8da0f8 in blk_unref (blk=0x55af9c547bd0) at block/block-backend.c:477
>>>>>>> #27 0x000055af9a86a6f1 in teardown_secondary () at tests/test-replication.c:392
>>>>>>> #28 0x000055af9a86aac1 in test_secondary_stop () at tests/test-replication.c:490
>>>>>>> #29 0x00007eff2fd7df7e in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>>>>> #30 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>>>>> #31 0x00007eff2fd7dd24 in g_test_run_suite_internal () from /lib64/libglib-2.0.so.0
>>>>>>> #32 0x00007eff2fd7e46a in g_test_run_suite () from /lib64/libglib-2.0.so.0
>>>>>>> #33 0x00007eff2fd7e485 in g_test_run () from /lib64/libglib-2.0.so.0
>>>>>>> #34 0x000055af9a86b19c in main (argc=1, argv=0x7ffc73e8d088) at tests/test-replication.c:645
>>>>>>>
>>>>>>>
>>>>>>> (gdb) p ((BlockBackend *)0x55af9c547bd0)->in_flight
>>>>>>> $5 = 0
>>>>>>> (gdb) p ((BlockBackend *)0x55af9c542c10)->in_flight
>>>>>>> $6 = 0
>>>>>>> (gdb) p ((BlockDriverState *)0x55af9c53b900)->in_flight
>>>>>>> $7 = 1
>>>>>>> (gdb) p ((BlockDriverState *)0x55af9c52a8d0)->in_flight
>>>>>>> $8 = 0
>>>>>>> (gdb) fr 20
>>>>>>> #20 0x000055af9a874351 in bdrv_close (bs=0x55af9c53b900) at block.c:4252
>>>>>>> 4252        bdrv_flush(bs);
>>>>>>> (gdb) p bs->node_name
>>>>>>> $9 = "#block5317", '\000' <repeats 21 times>
>>>>>>> (gdb) p bs->drv
>>>>>>> $10 = (BlockDriver *) 0x55af9aa63c40 <bdrv_replication>
>>>>>>> (gdb) p bs->in_flight
>>>>>>> $11 = 1
>>>>>>> (gdb) p bs->tracked_requests
>>>>>>> $12 = {lh_first = 0x0}
>>>>>>>
>>>>>>>
>>>>>>> So, we entered bdrv_flush at frame 19, and increased in_flight. Then
>>>>>>> we go to aio_poll and to nested event loop, and we never return to
>>>>>>> decrease in_flight field.
>>>>>>>
>>>>>>> Hmm. I'm afraid, I don't know what to do with that. Kevin, could you
>>>>>>> take a look? And could similar thing happen with blk layer, because of
>>>>>>> you recent similar patch?
>>>>>>
>>>>>> Hmm... You mean blk_prw(), right? Looks like it could have the same
>>>>>> problem, indeed.
>>>>>>
>>>>>> Maybe we need to move the blk/bdrv_dec_in_flight to inside the coroutine
>>>>>> (probably to the place where we currently have aio_wait_kick(), which
>>>>>> would already be built in for bdrv_dec_in_flight). This is the last
>>>>>> thing the coroutine does, so presumably it will still be late enough.
>>>>>>
>>>>>
>>>>> But moving "inc" into coroutine is dangerous too, as we discussed that
>>>>> coroutine_enter may only schedule the coroutine, and something may
>>>>> call drain before actual "inc".
>>>>
>>>> No, I mean moving only the dec, not inc. So inc before entering the
>>>> coroutine (outside of it), and dec at the end, but still inside the
>>>> coroutine.
>>>>
>>>
>>> Hmm. it probably make sense. Ha, I hastened to answer on
>>> cover-letter that this all to be dropped. Ok, I'll give it a roll
>>> and check your idea, thanks!
>>
>> Checked this helps. I think, I'd try to rebase it onto your "[RFC
>> PATCH 1/3] block: Factor out bdrv_run_co()"
> 
> I haven't received much feedback for that series yet. But I assume
> patch 1 won't be contentious, so that makes sense to me.
> 


I still have a question.

OK, what we achieve with this series?

We guarantee, that if someone call e.g. bdrv_pwrite, than it increases in_flight immediately, and drained section will not start until request finished. But what guarantees that user will not call bdrv_pwrite already in drained section? blk layer guarantees it: it has blk_wait_while_drained() guard. But what about all other users?

Some users just do nested request on child node: this should be safe anyway, as drained section will not start until parent request finishes..

blk users are protected by blk_wait_while_drained() and by blk->in_flight

And other users are most probably broken anyway and this series doesn't fix them.

So, I again doubt that we need this series..

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-05-19 16:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-27 14:38 [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy
2020-04-27 14:38 ` [PATCH v2 1/9] block/io: refactor bdrv_is_allocated_above to run only one coroutine Vladimir Sementsov-Ogievskiy
2020-05-01 21:25   ` Eric Blake
2020-04-27 14:39 ` [PATCH v2 2/9] block/io: refactor bdrv_co_ioctl: move aio stuff to corresponding block Vladimir Sementsov-Ogievskiy
2020-04-27 14:39 ` [PATCH v2 3/9] block/io: move flush and pdiscard stuff down Vladimir Sementsov-Ogievskiy
2020-04-27 14:39 ` [PATCH v2 4/9] block/io: move bdrv_rw_co_entry and friends down Vladimir Sementsov-Ogievskiy
2020-04-27 14:39 ` [PATCH v2 5/9] block/io: expand in_flight inc/dec section: simple cases Vladimir Sementsov-Ogievskiy
2020-05-01 21:43   ` Eric Blake
2020-05-06  7:02   ` Vladimir Sementsov-Ogievskiy
2020-05-18 18:21     ` Vladimir Sementsov-Ogievskiy
2020-05-19 10:52     ` Kevin Wolf
2020-05-19 11:06       ` Vladimir Sementsov-Ogievskiy
2020-05-19 11:16         ` Kevin Wolf
2020-05-19 11:25           ` Vladimir Sementsov-Ogievskiy
2020-05-19 14:01             ` Vladimir Sementsov-Ogievskiy
2020-05-19 14:33               ` Kevin Wolf
2020-05-19 16:54                 ` Vladimir Sementsov-Ogievskiy
2020-05-19 11:04   ` Kevin Wolf
2020-04-27 14:39 ` [PATCH v2 6/9] block/io: expand in_flight inc/dec section: block-status Vladimir Sementsov-Ogievskiy
2020-05-01 22:00   ` Eric Blake
2020-05-19 10:57     ` Kevin Wolf
2020-04-27 14:39 ` [PATCH v2 7/9] block/io: add bdrv_do_pwrite_zeroes Vladimir Sementsov-Ogievskiy
2020-05-01 22:05   ` Eric Blake
2020-04-27 14:39 ` [PATCH v2 8/9] block/io: move bdrv_make_zero under block-status Vladimir Sementsov-Ogievskiy
2020-04-27 14:39 ` [PATCH v2 9/9] block/io: expand in_flight inc/dec section: bdrv_make_zero Vladimir Sementsov-Ogievskiy
2020-05-01 22:08   ` Eric Blake
2020-05-19 11:18 ` [PATCH v2 0/9] block/io: safer inc/dec in_flight sections Vladimir Sementsov-Ogievskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.