All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel
@ 2021-08-06  9:38 Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort() Max Reitz
                   ` (11 more replies)
  0 siblings, 12 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Hi,

v1 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00705.html

v2 cover letter:
https://lists.nongnu.org/archive/html/qemu-block/2021-07/msg00747.html

Changes in v3:
- Patch 1: After adding patch 11, I got a failed assertion in
  tests/unit/test-block-iothread (failing qemu_mutex_unlock_impl()).
  That is because before patch 11, for zero-length source devices,
  mirror clears .cancelled unconditionally before exiting.  So even
  force-cancelled jobs are considered to be completed normally, which
  doesn’t seem quite right.
  Anyway, test-block-iothread does some iothread switching, and
  cancelling jobs is not really prepared for that.  This patch fixes
  that (I hope...).

- Patch 4: Split off from patch 5

- Patch 7:
  - Added a long section in the commit message detailing every choice
    for every job_is_cancelled() invocation
  - Use job_cancel_requested() in the assertion in
    job_completed_txn_abort(), because it is not quite clear whether
    soft-cancelled mirror jobs can end up in this path (it seems like a
    bug if that happens, but I think that’s something to fix in some
    other series)

- Patch 8: Added: This is kind of preparation for patch 9, but also just
  a bug fix in itself, I believe

- Patch 9: Moved the job_is_cancelled() check after the last yield point
  before the mirror_iteration() call

- Patch 10: Added: If force-cancelled jobs should not generate new I/O
  requests at all (except for forwarding something to the source
  device), then we need to stop doing active mirroring once the mirror
  job is force-cancelled

- Patch 11: Added: Clearing .cancelled seemed like a hack, so getting
  rid of it seems like a good thing to do
  (And only with this patch, I can assert that .force_cancel can only be
  true when .cancelled is true also; if we tried it before this patch,
  tests/unit/test-block-iothread would fail.)


The discussion around v2 has shown that there are probably more bugs in
the job code, but I think this series is becoming long enough that we
should tackle those in a different series.


git-backport-diff against v1:

Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/12:[down] 'job: Context changes in job_completed_txn_abort()'
002/12:[----] [--] 'mirror: Keep s->synced on error'
003/12:[----] [--] 'mirror: Drop s->synced'
004/12:[down] 'job: Force-cancel jobs in a failed transaction'
005/12:[0007] [FC] 'job: @force parameter for job_cancel_sync{,_all}()'
006/12:[----] [--] 'jobs: Give Job.force_cancel more meaning'
007/12:[0002] [FC] 'job: Add job_cancel_requested()'
008/12:[down] 'mirror: Use job_is_cancelled()'
009/12:[0007] [FC] 'mirror: Check job_is_cancelled() earlier'
010/12:[down] 'mirror: Stop active mirroring after force-cancel'
011/12:[down] 'mirror: Do not clear .cancelled'
012/12:[----] [--] 'iotests: Add mirror-ready-cancel-error test'


Max Reitz (12):
  job: Context changes in job_completed_txn_abort()
  mirror: Keep s->synced on error
  mirror: Drop s->synced
  job: Force-cancel jobs in a failed transaction
  job: @force parameter for job_cancel_sync{,_all}()
  jobs: Give Job.force_cancel more meaning
  job: Add job_cancel_requested()
  mirror: Use job_is_cancelled()
  mirror: Check job_is_cancelled() earlier
  mirror: Stop active mirroring after force-cancel
  mirror: Do not clear .cancelled
  iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h                            |  29 +++-
 block/backup.c                                |   3 +-
 block/mirror.c                                |  56 ++++---
 block/replication.c                           |   4 +-
 blockdev.c                                    |   4 +-
 job.c                                         |  67 ++++++--
 qemu-nbd.c                                    |   2 +-
 softmmu/runstate.c                            |   2 +-
 storage-daemon/qemu-storage-daemon.c          |   2 +-
 tests/unit/test-block-iothread.c              |   2 +-
 tests/unit/test-blockjob.c                    |   2 +-
 tests/qemu-iotests/109.out                    |  60 +++-----
 .../tests/mirror-ready-cancel-error           | 143 ++++++++++++++++++
 .../tests/mirror-ready-cancel-error.out       |   5 +
 tests/qemu-iotests/tests/qsd-jobs.out         |   2 +-
 15 files changed, 292 insertions(+), 91 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort()
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 19:16   ` Eric Blake
  2021-09-01 10:05   ` Vladimir Sementsov-Ogievskiy
  2021-08-06  9:38 ` [PATCH for-6.2 v3 02/12] mirror: Keep s->synced on error Max Reitz
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Finalizing the job may cause its AioContext to change.  This is noted by
job_exit(), which points at job_txn_apply() to take this fact into
account.

However, job_completed() does not necessarily invoke job_txn_apply()
(through job_completed_txn_success()), but potentially also
job_completed_txn_abort().  The latter stores the context in a local
variable, and so always acquires the same context at its end that it has
released in the beginning -- which may be a different context from the
one that job_exit() releases at its end.  If it is different, qemu
aborts ("qemu_mutex_unlock_impl: Operation not permitted").

Drop the local @outer_ctx variable from job_completed_txn_abort(), and
instead re-acquire the actual job's context at the end of the function,
so job_exit() will release the same.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 job.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/job.c b/job.c
index e7a5d28854..3fe23bb77e 100644
--- a/job.c
+++ b/job.c
@@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force)
 
 static void job_completed_txn_abort(Job *job)
 {
-    AioContext *outer_ctx = job->aio_context;
     AioContext *ctx;
     JobTxn *txn = job->txn;
     Job *other_job;
@@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job)
     txn->aborting = true;
     job_txn_ref(txn);
 
-    /* We can only hold the single job's AioContext lock while calling
+    /*
+     * We can only hold the single job's AioContext lock while calling
      * job_finalize_single() because the finalization callbacks can involve
-     * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */
-    aio_context_release(outer_ctx);
+     * calls of AIO_WAIT_WHILE(), which could deadlock otherwise.
+     * Note that the job's AioContext may change when it is finalized.
+     */
+    job_ref(job);
+    aio_context_release(job->aio_context);
 
     /* Other jobs are effectively cancelled by us, set the status for
      * them; this job, however, may or may not be cancelled, depending
@@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job)
     }
     while (!QLIST_EMPTY(&txn->jobs)) {
         other_job = QLIST_FIRST(&txn->jobs);
+        /*
+         * The job's AioContext may change, so store it in @ctx so we
+         * release the same context that we have acquired before.
+         */
         ctx = other_job->aio_context;
         aio_context_acquire(ctx);
         if (!job_is_completed(other_job)) {
@@ -779,7 +786,13 @@ static void job_completed_txn_abort(Job *job)
         aio_context_release(ctx);
     }
 
-    aio_context_acquire(outer_ctx);
+    /*
+     * Use job_ref()/job_unref() so we can read the AioContext here
+     * even if the job went away during job_finalize_single().
+     */
+    ctx = job->aio_context;
+    job_unref(job);
+    aio_context_acquire(ctx);
 
     job_txn_unref(txn);
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 02/12] mirror: Keep s->synced on error
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort() Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 03/12] mirror: Drop s->synced Max Reitz
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/mirror.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
                                             int error)
 {
-    s->synced = false;
     s->actively_synced = false;
     if (read) {
         return block_job_error_action(&s->common, s->on_source_error,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 03/12] mirror: Drop s->synced
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort() Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 02/12] mirror: Keep s->synced on error Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction Max Reitz
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

As of HEAD^, there is no meaning to s->synced other than whether the job
is READY or not.  job_is_ready() gives us that information, too.

Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 block/mirror.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index d73b704473..fcb7b65f93 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -56,7 +56,6 @@ typedef struct MirrorBlockJob {
     bool zero_target;
     MirrorCopyMode copy_mode;
     BlockdevOnError on_source_error, on_target_error;
-    bool synced;
     /* Set when the target is synced (dirty bitmap is clean, nothing
      * in flight) and the job is running in active mode */
     bool actively_synced;
@@ -936,7 +935,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
     if (s->bdev_length == 0) {
         /* Transition to the READY state and wait for complete. */
         job_transition_to_ready(&s->common.job);
-        s->synced = true;
         s->actively_synced = true;
         while (!job_is_cancelled(&s->common.job) && !s->should_complete) {
             job_yield(&s->common.job);
@@ -1028,7 +1026,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         should_complete = false;
         if (s->in_flight == 0 && cnt == 0) {
             trace_mirror_before_flush(s);
-            if (!s->synced) {
+            if (!job_is_ready(&s->common.job)) {
                 if (mirror_flush(s) < 0) {
                     /* Go check s->ret.  */
                     continue;
@@ -1039,7 +1037,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
                  * the target in a consistent state.
                  */
                 job_transition_to_ready(&s->common.job);
-                s->synced = true;
                 if (s->copy_mode != MIRROR_COPY_MODE_BACKGROUND) {
                     s->actively_synced = true;
                 }
@@ -1083,14 +1080,15 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 
         ret = 0;
 
-        if (s->synced && !should_complete) {
+        if (job_is_ready(&s->common.job) && !should_complete) {
             delay_ns = (s->in_flight == 0 &&
                         cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
         }
-        trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
+        trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
+                                  delay_ns);
         job_sleep_ns(&s->common.job, delay_ns);
         if (job_is_cancelled(&s->common.job) &&
-            (!s->synced || s->common.job.force_cancel))
+            (!job_is_ready(&s->common.job) || s->common.job.force_cancel))
         {
             break;
         }
@@ -1103,8 +1101,9 @@ immediate_exit:
          * or it was cancelled prematurely so that we do not guarantee that
          * the target is a copy of the source.
          */
-        assert(ret < 0 || ((s->common.job.force_cancel || !s->synced) &&
-               job_is_cancelled(&s->common.job)));
+        assert(ret < 0 ||
+               ((s->common.job.force_cancel || !job_is_ready(&s->common.job)) &&
+                job_is_cancelled(&s->common.job)));
         assert(need_drain);
         mirror_wait_for_all_io(s);
     }
@@ -1127,7 +1126,7 @@ static void mirror_complete(Job *job, Error **errp)
 {
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
 
-    if (!s->synced) {
+    if (!job_is_ready(job)) {
         error_setg(errp, "The active block job '%s' cannot be completed",
                    job->id);
         return;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (2 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 03/12] mirror: Drop s->synced Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 19:22   ` Eric Blake
  2021-09-01 10:08   ` Vladimir Sementsov-Ogievskiy
  2021-08-06  9:38 ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

When a transaction is aborted, no result matters, and so all jobs within
should be force-cancelled.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 job.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/job.c b/job.c
index 3fe23bb77e..24e7c4fcb7 100644
--- a/job.c
+++ b/job.c
@@ -766,7 +766,12 @@ static void job_completed_txn_abort(Job *job)
         if (other_job != job) {
             ctx = other_job->aio_context;
             aio_context_acquire(ctx);
-            job_cancel_async(other_job, false);
+            /*
+             * This is a transaction: If one job failed, no result will matter.
+             * Therefore, pass force=true to terminate all other jobs as quickly
+             * as possible.
+             */
+            job_cancel_async(other_job, true);
             aio_context_release(ctx);
         }
     }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}()
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (3 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 19:39   ` Eric Blake
                     ` (2 more replies)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 06/12] jobs: Give Job.force_cancel more meaning Max Reitz
                   ` (6 subsequent siblings)
  11 siblings, 3 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/qemu/job.h                    | 10 ++---
 block/replication.c                   |  4 +-
 blockdev.c                            |  4 +-
 job.c                                 | 20 +++++++--
 qemu-nbd.c                            |  2 +-
 softmmu/runstate.c                    |  2 +-
 storage-daemon/qemu-storage-daemon.c  |  2 +-
 tests/unit/test-block-iothread.c      |  2 +-
 tests/unit/test-blockjob.c            |  2 +-
 tests/qemu-iotests/109.out            | 60 +++++++++++----------------
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 11 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..5e8edbc2c8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, Error **errp);
 
 /**
  * Synchronously cancel the @job.  The completion callback is called
- * before the function returns.  The job may actually complete
- * instead of canceling itself; the circumstances under which this
- * happens depend on the kind of job that is active.
+ * before the function returns.  If @force is false, the job may
+ * actually complete instead of canceling itself; the circumstances
+ * under which this happens depend on the kind of job that is active.
  *
  * Returns the return value from the job if the job actually completed
  * during the call, or -ECANCELED if it was canceled.
  *
  * Callers must hold the AioContext lock of job->aio_context.
  */
-int job_cancel_sync(Job *job);
+int job_cancel_sync(Job *job, bool force);
 
 /** Synchronously cancels all jobs using job_cancel_sync(). */
-void job_cancel_sync_all(void);
+void job_cancel_sync_all(bool force);
 
 /**
  * @job: The job to be completed.
diff --git a/block/replication.c b/block/replication.c
index 32444b9a8f..e7a9327b12 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,7 @@ static void replication_close(BlockDriverState *bs)
     if (s->stage == BLOCK_REPLICATION_FAILOVER) {
         commit_job = &s->commit_job->job;
         assert(commit_job->aio_context == qemu_get_current_aio_context());
-        job_cancel_sync(commit_job);
+        job_cancel_sync(commit_job, false);
     }
 
     if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool failover, Error **errp)
          * disk, secondary disk in backup_job_completed().
          */
         if (s->backup_job) {
-            job_cancel_sync(&s->backup_job->job);
+            job_cancel_sync(&s->backup_job->job, false);
         }
 
         if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index 3d8ac368a1..aa95918c02 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1848,7 +1848,7 @@ static void drive_backup_abort(BlkActionState *common)
         aio_context = bdrv_get_aio_context(state->bs);
         aio_context_acquire(aio_context);
 
-        job_cancel_sync(&state->job->job);
+        job_cancel_sync(&state->job->job, true);
 
         aio_context_release(aio_context);
     }
@@ -1949,7 +1949,7 @@ static void blockdev_backup_abort(BlkActionState *common)
         aio_context = bdrv_get_aio_context(state->bs);
         aio_context_acquire(aio_context);
 
-        job_cancel_sync(&state->job->job);
+        job_cancel_sync(&state->job->job, true);
 
         aio_context_release(aio_context);
     }
diff --git a/job.c b/job.c
index 24e7c4fcb7..1b68a7a983 100644
--- a/job.c
+++ b/job.c
@@ -982,12 +982,24 @@ static void job_cancel_err(Job *job, Error **errp)
     job_cancel(job, false);
 }
 
-int job_cancel_sync(Job *job)
+/**
+ * Same as job_cancel_err(), but force-cancel.
+ */
+static void job_force_cancel_err(Job *job, Error **errp)
 {
-    return job_finish_sync(job, &job_cancel_err, NULL);
+    job_cancel(job, true);
+}
+
+int job_cancel_sync(Job *job, bool force)
+{
+    if (force) {
+        return job_finish_sync(job, &job_force_cancel_err, NULL);
+    } else {
+        return job_finish_sync(job, &job_cancel_err, NULL);
+    }
 }
 
-void job_cancel_sync_all(void)
+void job_cancel_sync_all(bool force)
 {
     Job *job;
     AioContext *aio_context;
@@ -995,7 +1007,7 @@ void job_cancel_sync_all(void)
     while ((job = job_next(NULL))) {
         aio_context = job->aio_context;
         aio_context_acquire(aio_context);
-        job_cancel_sync(job);
+        job_cancel_sync(job, force);
         aio_context_release(aio_context);
     }
 }
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 26ffbf15af..7fadfcfd23 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -479,7 +479,7 @@ static const char *socket_activation_validate_opts(const char *device,
 
 static void qemu_nbd_shutdown(void)
 {
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
     blk_exp_close_all();
     bdrv_close_all();
 }
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b7365a..cf239e3b4c 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -812,7 +812,7 @@ void qemu_cleanup(void)
     vm_shutdown();
     replay_finish();
 
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
     bdrv_close_all();
 
     /* vhost-user must be cleaned up before chardevs.  */
diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
index fc8b150629..6c7142574c 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -347,7 +347,7 @@ int main(int argc, char *argv[])
 
     blk_exp_close_all();
     bdrv_drain_all_begin();
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
     bdrv_close_all();
 
     monitor_cleanup();
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index c39e70b2f5..09807fd2ca 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -662,7 +662,7 @@ static void test_propagate_mirror(void)
     g_assert(bdrv_get_aio_context(target) == ctx);
     g_assert(bdrv_get_aio_context(filter) == ctx);
 
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
 
     aio_context_acquire(ctx);
     blk_set_aio_context(blk, main_ctx, &error_abort);
diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
index dcacfa6c7c..4c9e1bf1e5 100644
--- a/tests/unit/test-blockjob.c
+++ b/tests/unit/test-blockjob.c
@@ -230,7 +230,7 @@ static void cancel_common(CancelJob *s)
     ctx = job->job.aio_context;
     aio_context_acquire(ctx);
 
-    job_cancel_sync(&job->job);
+    job_cancel_sync(&job->job, true);
     if (sts != JOB_STATUS_CREATED && sts != JOB_STATUS_CONCLUDED) {
         Job *dummy = &job->job;
         job_dismiss(&dummy, &error_abort);
diff --git a/tests/qemu-iotests/109.out b/tests/qemu-iotests/109.out
index 8f839b4b7f..e29280015e 100644
--- a/tests/qemu-iotests/109.out
+++ b/tests/qemu-iotests/109.out
@@ -44,9 +44,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -95,9 +94,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 197120, "offset": 197120, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 197120, "offset": 197120, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -146,9 +144,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -197,9 +194,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -248,9 +244,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -299,9 +294,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -349,9 +343,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -399,9 +392,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 31457280, "offset": 31457280, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 31457280, "offset": 31457280, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -449,9 +441,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -499,9 +490,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 2048, "offset": 2048, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 2048, "offset": 2048, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -529,9 +519,8 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -552,9 +541,8 @@ Images are identical.
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
diff --git a/tests/qemu-iotests/tests/qsd-jobs.out b/tests/qemu-iotests/tests/qsd-jobs.out
index 189423354b..c1bc9b8356 100644
--- a/tests/qemu-iotests/tests/qsd-jobs.out
+++ b/tests/qemu-iotests/tests/qsd-jobs.out
@@ -8,7 +8,7 @@ QMP_VERSION
 {"return": {}}
 {"return": {}}
 {"return": {}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "commit"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "commit"}}
 
 === Streaming can't get permission on base node ===
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 06/12] jobs: Give Job.force_cancel more meaning
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (4 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested() Max Reitz
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
---
 include/qemu/job.h | 11 ++++++++++-
 block/backup.c     |  3 ++-
 block/mirror.c     | 24 ++++++++++++++++++------
 job.c              |  6 +++++-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 5e8edbc2c8..8aa90f7395 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -253,8 +253,17 @@ struct JobDriver {
 
     /**
      * If the callback is not NULL, it will be invoked in job_cancel_async
+     *
+     * This function must return true if the job will be cancelled
+     * immediately without any further I/O (mandatory if @force is
+     * true), and false otherwise.  This lets the generic job layer
+     * know whether a job has been truly (force-)cancelled, or whether
+     * it is just in a special completion mode (like mirror after
+     * READY).
+     * (If the callback is NULL, the job is assumed to terminate
+     * without I/O.)
      */
-    void (*cancel)(Job *job, bool force);
+    bool (*cancel)(Job *job, bool force);
 
 
     /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..513e1c8a0b 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -331,11 +331,12 @@ static void coroutine_fn backup_set_speed(BlockJob *job, int64_t speed)
     }
 }
 
-static void backup_cancel(Job *job, bool force)
+static bool backup_cancel(Job *job, bool force)
 {
     BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
     bdrv_cancel_in_flight(s->target_bs);
+    return true;
 }
 
 static const BlockJobDriver backup_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index fcb7b65f93..e93631a9f6 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1087,9 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
                                   delay_ns);
         job_sleep_ns(&s->common.job, delay_ns);
-        if (job_is_cancelled(&s->common.job) &&
-            (!job_is_ready(&s->common.job) || s->common.job.force_cancel))
-        {
+        if (job_is_cancelled(&s->common.job) && s->common.job.force_cancel) {
             break;
         }
         s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1102,7 +1100,7 @@ immediate_exit:
          * the target is a copy of the source.
          */
         assert(ret < 0 ||
-               ((s->common.job.force_cancel || !job_is_ready(&s->common.job)) &&
+               (s->common.job.force_cancel &&
                 job_is_cancelled(&s->common.job)));
         assert(need_drain);
         mirror_wait_for_all_io(s);
@@ -1188,14 +1186,27 @@ static bool mirror_drained_poll(BlockJob *job)
     return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job, bool force)
+static bool mirror_cancel(Job *job, bool force)
 {
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
     BlockDriverState *target = blk_bs(s->target);
 
-    if (force || !job_is_ready(job)) {
+    /*
+     * Before the job is READY, we treat any cancellation like a
+     * force-cancellation.
+     */
+    force = force || !job_is_ready(job);
+
+    if (force) {
         bdrv_cancel_in_flight(target);
     }
+    return force;
+}
+
+static bool commit_active_cancel(Job *job, bool force)
+{
+    /* Same as above in mirror_cancel() */
+    return force || !job_is_ready(job);
 }
 
 static const BlockJobDriver mirror_job_driver = {
@@ -1225,6 +1236,7 @@ static const BlockJobDriver commit_active_job_driver = {
         .abort                  = mirror_abort,
         .pause                  = mirror_pause,
         .complete               = mirror_complete,
+        .cancel                 = commit_active_cancel,
     },
     .drained_poll           = mirror_drained_poll,
 };
diff --git a/job.c b/job.c
index 1b68a7a983..9e82b128ff 100644
--- a/job.c
+++ b/job.c
@@ -719,8 +719,12 @@ static int job_finalize_single(Job *job)
 static void job_cancel_async(Job *job, bool force)
 {
     if (job->driver->cancel) {
-        job->driver->cancel(job, force);
+        force = job->driver->cancel(job, force);
+    } else {
+        /* No .cancel() means the job will behave as if force-cancelled */
+        force = true;
     }
+
     if (job->user_paused) {
         /* Do not call job_enter here, the caller will handle it.  */
         if (job->driver->user_resume) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested()
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (5 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 06/12] jobs: Give Job.force_cancel more meaning Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 20:34   ` Eric Blake
  2021-09-01 11:44   ` Vladimir Sementsov-Ogievskiy
  2021-08-06  9:38 ` [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled() Max Reitz
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_requested() as the general variant, which returns true for
any jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Finally, here is a justification for how different job_is_cancelled()
invocations are treated by this patch:

- block/mirror.c (mirror_run()):
  - The first invocation is a while loop that should loop until the job
    has been cancelled or scheduled for completion.  What kind of cancel
    does not matter, only the fact that the job is supposed to end.

  - The second invocation wants to know whether the job has been
    soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
    but if the job were force-cancelled, we should leave the main loop
    as soon as possible anyway, so this should not matter here.

  - The last two invocations already check force_cancel, so they should
    continue to use job_is_cancelled().

- block/backup.c, block/commit.c, block/stream.c, anything in tests/:
  These jobs know only force-cancel, so there is no difference between
  job_is_cancelled() and job_cancel_requested().  We can continue using
  job_is_cancelled().

- job.c:
  - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
    jobs should be prevented from being paused.  Continue using job_is_cancelled().

  - job_update_rc(), job_finalize_single(), job_finish_sync(): These
    functions are all called after the job has left its main loop.  The
    mirror job (the only job that can be soft-cancelled) will clear
    .cancelled before leaving the main loop if it has been
    soft-cancelled.  Therefore, these functions will observe .cancelled
    to be true only if the job has been force-cancelled.  We can
    continue to use job_is_cancelled().
    (Furthermore, conceptually, a soft-cancelled mirror job should not
    report to have been cancelled.  It should report completion (see
    also the block-job-cancel QAPI documentation).  Therefore, it makes
    sense for these functions not to distinguish between a
    soft-cancelled mirror job and a job that has completed as normal.)

  - job_completed_txn_abort(): All jobs other than @job have been
    force-cancelled.  job_is_cancelled() must be true for them.
    Regarding @job itself: job_completed_txn_abort() is mostly called
    when the job's return value is not 0.  A soft-cancelled mirror has a
    return value of 0, and so will not end up here then.
    However, job_cancel() invokes job_completed_txn_abort() if the job
    has been deferred to the main loop, which is mostly the case for
    completed jobs (which skip the assertion), but not for sure.
    To be safe, use job_cancel_requested() in this assertion.

  - job_complete(): This is function eventually invoked by the user
    (through qmp_block_job_complete() or qmp_job_complete(), or
    job_complete_sync(), which comes from qemu-img).  The intention here
    is to prevent a user from invoking job-complete after the job has
    been cancelled.  This should also apply to soft cancelling: After a
    mirror job has been soft-cancelled, the user should not be able to
    decide otherwise and have it complete as normal (i.e. pivoting to
    the target).

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/qemu/job.h |  8 +++++++-
 block/mirror.c     | 10 ++++------
 job.c              |  9 +++++++--
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job *job);
 
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
 bool job_is_cancelled(Job *job);
 
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
+
 /** Returns whether the job is in a completed state. */
 bool job_is_completed(Job *job);
 
diff --git a/block/mirror.c b/block/mirror.c
index e93631a9f6..72e02fa34e 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -936,7 +936,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         /* Transition to the READY state and wait for complete. */
         job_transition_to_ready(&s->common.job);
         s->actively_synced = true;
-        while (!job_is_cancelled(&s->common.job) && !s->should_complete) {
+        while (!job_cancel_requested(&s->common.job) && !s->should_complete) {
             job_yield(&s->common.job);
         }
         s->common.job.cancelled = false;
@@ -1043,7 +1043,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
             }
 
             should_complete = s->should_complete ||
-                job_is_cancelled(&s->common.job);
+                job_cancel_requested(&s->common.job);
             cnt = bdrv_get_dirty_count(s->dirty_bitmap);
         }
 
@@ -1087,7 +1087,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
                                   delay_ns);
         job_sleep_ns(&s->common.job, delay_ns);
-        if (job_is_cancelled(&s->common.job) && s->common.job.force_cancel) {
+        if (job_is_cancelled(&s->common.job)) {
             break;
         }
         s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1099,9 +1099,7 @@ immediate_exit:
          * or it was cancelled prematurely so that we do not guarantee that
          * the target is a copy of the source.
          */
-        assert(ret < 0 ||
-               (s->common.job.force_cancel &&
-                job_is_cancelled(&s->common.job)));
+        assert(ret < 0 || job_is_cancelled(&s->common.job));
         assert(need_drain);
         mirror_wait_for_all_io(s);
     }
diff --git a/job.c b/job.c
index 9e82b128ff..2bd3c946a7 100644
--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
 }
 
 bool job_is_cancelled(Job *job)
+{
+    return job->cancelled && job->force_cancel;
+}
+
+bool job_cancel_requested(Job *job)
 {
     return job->cancelled;
 }
@@ -788,7 +793,7 @@ static void job_completed_txn_abort(Job *job)
         ctx = other_job->aio_context;
         aio_context_acquire(ctx);
         if (!job_is_completed(other_job)) {
-            assert(job_is_cancelled(other_job));
+            assert(job_cancel_requested(other_job));
             job_finish_sync(other_job, NULL, NULL);
         }
         job_finalize_single(other_job);
@@ -1028,7 +1033,7 @@ void job_complete(Job *job, Error **errp)
     if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
         return;
     }
-    if (job_is_cancelled(job) || !job->driver->complete) {
+    if (job_cancel_requested(job) || !job->driver->complete) {
         error_setg(errp, "The active block job '%s' cannot be completed",
                    job->id);
         return;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled()
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (6 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested() Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 20:35   ` Eric Blake
  2021-09-01 11:45   ` Vladimir Sementsov-Ogievskiy
  2021-08-06  9:38 ` [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier Max Reitz
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

mirror_drained_poll() returns true whenever the job is cancelled,
because "we [can] be sure that it won't issue more requests".  However,
this is only true for force-cancelled jobs, so use job_is_cancelled().

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 72e02fa34e..024fa2dcea 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1177,7 +1177,7 @@ static bool mirror_drained_poll(BlockJob *job)
      * from one of our own drain sections, to avoid a deadlock waiting for
      * ourselves.
      */
-    if (!s->common.job.paused && !s->common.job.cancelled && !s->in_drain) {
+    if (!s->common.job.paused && !job_is_cancelled(&job->job) && !s->in_drain) {
         return true;
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (7 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled() Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 20:36   ` Eric Blake
  2021-09-01 12:11   ` Vladimir Sementsov-Ogievskiy
  2021-08-06  9:38 ` [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel Max Reitz
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

Jobs can be cancelled while they yield, and once they are
(force-cancelled), they should not generate new I/O requests.
Therefore, we should put the check after the last yield before
mirror_iteration() is invoked.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 024fa2dcea..bf1d50ff1c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1000,6 +1000,11 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
 
         job_pause_point(&s->common.job);
 
+        if (job_is_cancelled(&s->common.job)) {
+            ret = 0;
+            goto immediate_exit;
+        }
+
         cnt = bdrv_get_dirty_count(s->dirty_bitmap);
         /* cnt is the number of dirty bytes remaining and s->bytes_in_flight is
          * the number of bytes currently being processed; together those are
@@ -1078,8 +1083,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
             break;
         }
 
-        ret = 0;
-
         if (job_is_ready(&s->common.job) && !should_complete) {
             delay_ns = (s->in_flight == 0 &&
                         cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
@@ -1087,9 +1090,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         trace_mirror_before_sleep(s, cnt, job_is_ready(&s->common.job),
                                   delay_ns);
         job_sleep_ns(&s->common.job, delay_ns);
-        if (job_is_cancelled(&s->common.job)) {
-            break;
-        }
         s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (8 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 20:37   ` Eric Blake
  2021-09-01 12:16   ` Vladimir Sementsov-Ogievskiy
  2021-08-06  9:38 ` [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled Max Reitz
  2021-08-06  9:38 ` [PATCH for-6.2 v3 12/12] iotests: Add mirror-ready-cancel-error test Max Reitz
  11 siblings, 2 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Once the mirror job is force-cancelled (job_is_cancelled() is true), we
should not generate new I/O requests.  This applies to active mirroring,
too, so stop it once the job is cancelled.

(We must still forward all I/O requests to the source, though, of
course, but those are not really I/O requests generated by the job, so
this is fine.)

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/mirror.c b/block/mirror.c
index bf1d50ff1c..af89c1716a 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1418,6 +1418,7 @@ static int coroutine_fn bdrv_mirror_top_do_write(BlockDriverState *bs,
     bool copy_to_target;
 
     copy_to_target = s->job->ret >= 0 &&
+                     !job_is_cancelled(&s->job->common.job) &&
                      s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
 
     if (copy_to_target) {
@@ -1466,6 +1467,7 @@ static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
     bool copy_to_target;
 
     copy_to_target = s->job->ret >= 0 &&
+                     !job_is_cancelled(&s->job->common.job) &&
                      s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
 
     if (copy_to_target) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (9 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  2021-08-06 20:42   ` Eric Blake
  2021-09-01 12:22   ` Vladimir Sementsov-Ogievskiy
  2021-08-06  9:38 ` [PATCH for-6.2 v3 12/12] iotests: Add mirror-ready-cancel-error test Max Reitz
  11 siblings, 2 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Clearing .cancelled before leaving the main loop when the job has been
soft-cancelled is no longer necessary since job_is_cancelled() only
returns true for jobs that have been force-cancelled.

Therefore, this only makes a differences in places that call
job_cancel_requested().  In block/mirror.c, this is done only before
.cancelled was cleared.

In job.c, there are two callers:
- job_completed_txn_abort() asserts that .cancelled is true, so keeping
  it true will not affect this place.

- job_complete() refuses to let a job complete that has .cancelled set.
  It is correct to refuse to let the user invoke job-complete on mirror
  jobs that have already been soft-cancelled.

With this change, there are no places that reset .cancelled to false and
so we can be sure that .force_cancel can only be true of .cancelled is
true as well.  Assert this in job_is_cancelled().

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 2 --
 job.c          | 4 +++-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index af89c1716a..f94aa52fae 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -939,7 +939,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         while (!job_cancel_requested(&s->common.job) && !s->should_complete) {
             job_yield(&s->common.job);
         }
-        s->common.job.cancelled = false;
         goto immediate_exit;
     }
 
@@ -1078,7 +1077,6 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
              * completion.
              */
             assert(QLIST_EMPTY(&bs->tracked_requests));
-            s->common.job.cancelled = false;
             need_drain = false;
             break;
         }
diff --git a/job.c b/job.c
index 2bd3c946a7..2ce6865ab2 100644
--- a/job.c
+++ b/job.c
@@ -217,7 +217,9 @@ const char *job_type_str(const Job *job)
 
 bool job_is_cancelled(Job *job)
 {
-    return job->cancelled && job->force_cancel;
+    /* force_cancel may be true only if cancelled is true, too */
+    assert(job->cancelled || !job->force_cancel);
+    return job->force_cancel;
 }
 
 bool job_cancel_requested(Job *job)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH for-6.2 v3 12/12] iotests: Add mirror-ready-cancel-error test
  2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
                   ` (10 preceding siblings ...)
  2021-08-06  9:38 ` [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled Max Reitz
@ 2021-08-06  9:38 ` Max Reitz
  11 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-06  9:38 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, Max Reitz

Test what happens when there is an I/O error after a mirror job in the
READY phase has been cancelled.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 .../tests/mirror-ready-cancel-error           | 143 ++++++++++++++++++
 .../tests/mirror-ready-cancel-error.out       |   5 +
 2 files changed, 148 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error b/tests/qemu-iotests/tests/mirror-ready-cancel-error
new file mode 100755
index 0000000000..f2dc88881f
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when errors occur to a mirror job after it has
+# been cancelled in the READY phase
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+
+
+image_size = 1 * 1024 * 1024
+source = os.path.join(iotests.test_dir, 'source.img')
+target = os.path.join(iotests.test_dir, 'target.img')
+
+
+class TestMirrorReadyCancelError(iotests.QMPTestCase):
+    def setUp(self) -> None:
+        assert iotests.qemu_img_create('-f', iotests.imgfmt, source,
+                                       str(image_size)) == 0
+        assert iotests.qemu_img_create('-f', iotests.imgfmt, target,
+                                       str(image_size)) == 0
+
+        self.vm = iotests.VM()
+        self.vm.launch()
+
+    def tearDown(self) -> None:
+        self.vm.shutdown()
+        os.remove(source)
+        os.remove(target)
+
+    def add_blockdevs(self, once: bool) -> None:
+        res = self.vm.qmp('blockdev-add',
+                          **{'node-name': 'source',
+                             'driver': iotests.imgfmt,
+                             'file': {
+                                 'driver': 'file',
+                                 'filename': source
+                             }})
+        self.assert_qmp(res, 'return', {})
+
+        # blkdebug notes:
+        # Enter state 2 on the first flush, which happens before the
+        # job enters the READY state.  The second flush will happen
+        # when the job is about to complete, and we want that one to
+        # fail.
+        res = self.vm.qmp('blockdev-add',
+                          **{'node-name': 'target',
+                             'driver': iotests.imgfmt,
+                             'file': {
+                                 'driver': 'blkdebug',
+                                 'image': {
+                                     'driver': 'file',
+                                     'filename': target
+                                 },
+                                 'set-state': [{
+                                     'event': 'flush_to_disk',
+                                     'state': 1,
+                                     'new_state': 2
+                                 }],
+                                 'inject-error': [{
+                                     'event': 'flush_to_disk',
+                                     'once': once,
+                                     'immediately': True,
+                                     'state': 2
+                                 }]}})
+        self.assert_qmp(res, 'return', {})
+
+    def start_mirror(self) -> None:
+        res = self.vm.qmp('blockdev-mirror',
+                          job_id='mirror',
+                          device='source',
+                          target='target',
+                          filter_node_name='mirror-top',
+                          sync='full',
+                          on_target_error='stop')
+        self.assert_qmp(res, 'return', {})
+
+    def cancel_mirror_with_error(self) -> None:
+        self.vm.event_wait('BLOCK_JOB_READY')
+
+        # Write something so will not leave the job immediately, but
+        # flush first (which will fail, thanks to blkdebug)
+        res = self.vm.qmp('human-monitor-command',
+                          command_line='qemu-io mirror-top "write 0 64k"')
+        self.assert_qmp(res, 'return', '')
+
+        # Drain status change events
+        while self.vm.event_wait('JOB_STATUS_CHANGE', timeout=0.0) is not None:
+            pass
+
+        res = self.vm.qmp('block-job-cancel', device='mirror')
+        self.assert_qmp(res, 'return', {})
+
+        self.vm.event_wait('BLOCK_JOB_ERROR')
+
+    def test_transient_error(self) -> None:
+        self.add_blockdevs(True)
+        self.start_mirror()
+        self.cancel_mirror_with_error()
+
+        while True:
+            e = self.vm.event_wait('JOB_STATUS_CHANGE')
+            if e['data']['status'] == 'standby':
+                # Transient error, try again
+                self.vm.qmp('block-job-resume', device='mirror')
+            elif e['data']['status'] == 'null':
+                break
+
+    def test_persistent_error(self) -> None:
+        self.add_blockdevs(False)
+        self.start_mirror()
+        self.cancel_mirror_with_error()
+
+        while True:
+            e = self.vm.event_wait('JOB_STATUS_CHANGE')
+            if e['data']['status'] == 'standby':
+                # Persistent error, no point in continuing
+                self.vm.qmp('block-job-cancel', device='mirror', force=True)
+            elif e['data']['status'] == 'null':
+                break
+
+
+if __name__ == '__main__':
+    # LUKS would require special key-secret handling in add_blockdevs()
+    iotests.main(supported_fmts=['generic'],
+                 unsupported_fmts=['luks'],
+                 supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error.out b/tests/qemu-iotests/tests/mirror-ready-cancel-error.out
new file mode 100644
index 0000000000..fbc63e62f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error.out
@@ -0,0 +1,5 @@
+..
+----------------------------------------------------------------------
+Ran 2 tests
+
+OK
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort() Max Reitz
@ 2021-08-06 19:16   ` Eric Blake
  2021-08-09 10:04     ` Max Reitz
  2021-09-01 10:05   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 35+ messages in thread
From: Eric Blake @ 2021-08-06 19:16 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:48AM +0200, Max Reitz wrote:
> Finalizing the job may cause its AioContext to change.  This is noted by
> job_exit(), which points at job_txn_apply() to take this fact into
> account.
> 
> However, job_completed() does not necessarily invoke job_txn_apply()
> (through job_completed_txn_success()), but potentially also
> job_completed_txn_abort().  The latter stores the context in a local
> variable, and so always acquires the same context at its end that it has
> released in the beginning -- which may be a different context from the
> one that job_exit() releases at its end.  If it is different, qemu
> aborts ("qemu_mutex_unlock_impl: Operation not permitted").

Is this a bug fix that needs to make it into 6.1?

> 
> Drop the local @outer_ctx variable from job_completed_txn_abort(), and
> instead re-acquire the actual job's context at the end of the function,
> so job_exit() will release the same.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  job.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)

The commit message makes sense, and does a good job at explaining the
change.  I'm still a bit fuzzy on how jobs are supposed to play nice
with contexts, but since your patch matches the commit message, I'm
happy to give:

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction
  2021-08-06  9:38 ` [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction Max Reitz
@ 2021-08-06 19:22   ` Eric Blake
  2021-09-01 10:08   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Blake @ 2021-08-06 19:22 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:51AM +0200, Max Reitz wrote:
> When a transaction is aborted, no result matters, and so all jobs within
> should be force-cancelled.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  job.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

> 
> diff --git a/job.c b/job.c
> index 3fe23bb77e..24e7c4fcb7 100644
> --- a/job.c
> +++ b/job.c
> @@ -766,7 +766,12 @@ static void job_completed_txn_abort(Job *job)
>          if (other_job != job) {
>              ctx = other_job->aio_context;
>              aio_context_acquire(ctx);
> -            job_cancel_async(other_job, false);
> +            /*
> +             * This is a transaction: If one job failed, no result will matter.
> +             * Therefore, pass force=true to terminate all other jobs as quickly
> +             * as possible.
> +             */
> +            job_cancel_async(other_job, true);
>              aio_context_release(ctx);
>          }
>      }
> -- 
> 2.31.1
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
@ 2021-08-06 19:39   ` Eric Blake
  2021-08-09 10:09     ` Max Reitz
  2021-09-01 10:20   ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}() Vladimir Sementsov-Ogievskiy
  2021-09-01 11:04   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 35+ messages in thread
From: Eric Blake @ 2021-08-06 19:39 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:52AM +0200, Max Reitz wrote:
> Callers should be able to specify whether they want job_cancel_sync() to
> force-cancel the job or not.
> 
> In fact, almost all invocations do not care about consistency of the
> result and just want the job to terminate as soon as possible, so they
> should pass force=true.  The replication block driver is the exception.
> 
> This changes some iotest outputs, because quitting qemu while a mirror
> job is active will now lead to it being cancelled instead of completed,
> which is what we want.  (Cancelling a READY mirror job with force=false
> may take an indefinite amount of time, which we do not want when
> quitting.  If users want consistent results, they must have all jobs be
> done before they quit qemu.)

Feels somewhat like a bug fix, but I also understand why you'd prefer
to delay this to 6.2 (it is not a fresh regression, but a longstanding
issue).

> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---

> +++ b/job.c
> @@ -982,12 +982,24 @@ static void job_cancel_err(Job *job, Error **errp)
>      job_cancel(job, false);
>  }
>  
> -int job_cancel_sync(Job *job)
> +/**
> + * Same as job_cancel_err(), but force-cancel.
> + */
> +static void job_force_cancel_err(Job *job, Error **errp)
>  {
> -    return job_finish_sync(job, &job_cancel_err, NULL);
> +    job_cancel(job, true);
> +}

In isolation, it looks odd that errp is passed but not used.  But
looking further, it's because this is a callback that must have a
given signature, so it's okay.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested() Max Reitz
@ 2021-08-06 20:34   ` Eric Blake
  2021-09-01 11:44   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Blake @ 2021-08-06 20:34 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:54AM +0200, Max Reitz wrote:
> Most callers of job_is_cancelled() actually want to know whether the job
> is on its way to immediate termination.  For example, we refuse to pause
> jobs that are cancelled; but this only makes sense for jobs that are
> really actually cancelled.
> 
> A mirror job that is cancelled during READY with force=false should
> absolutely be allowed to pause.  This "cancellation" (which is actually
> a kind of completion) may take an indefinite amount of time, and so
> should behave like any job during normal operation.  For example, with
> on-target-error=stop, the job should stop on write errors.  (In
> contrast, force-cancelled jobs should not get write errors, as they
> should just terminate and not do further I/O.)
> 
> Therefore, redefine job_is_cancelled() to only return true for jobs that
> are force-cancelled (which as of HEAD^ means any job that interprets the
> cancellation request as a request for immediate termination), and add
> job_cancel_requested() as the general variant, which returns true for
> any jobs which have been requested to be cancelled, whether it be
> immediately or after an arbitrarily long completion phase.
> 
> Finally, here is a justification for how different job_is_cancelled()
> invocations are treated by this patch:

Thanks for this list; it's really thorough and helpful.

> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz <mreitz@redhat.com>

Although it is fixing a bug, the bug has been long-standing, so I
agree with your claim that this is 6.2 material.

> ---
>  include/qemu/job.h |  8 +++++++-
>  block/mirror.c     | 10 ++++------
>  job.c              |  9 +++++++--
>  3 files changed, 18 insertions(+), 9 deletions(-)
> 

> +++ b/job.c
> @@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
>  }
>  
>  bool job_is_cancelled(Job *job)
> +{
> +    return job->cancelled && job->force_cancel;
> +}
> +
> +bool job_cancel_requested(Job *job)
>  {
>      return job->cancelled;
>  }

Works out rather nicely.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled() Max Reitz
@ 2021-08-06 20:35   ` Eric Blake
  2021-09-01 11:45   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Blake @ 2021-08-06 20:35 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:55AM +0200, Max Reitz wrote:
> mirror_drained_poll() returns true whenever the job is cancelled,
> because "we [can] be sure that it won't issue more requests".  However,
> this is only true for force-cancelled jobs, so use job_is_cancelled().
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 72e02fa34e..024fa2dcea 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1177,7 +1177,7 @@ static bool mirror_drained_poll(BlockJob *job)
>       * from one of our own drain sections, to avoid a deadlock waiting for
>       * ourselves.
>       */
> -    if (!s->common.job.paused && !s->common.job.cancelled && !s->in_drain) {
> +    if (!s->common.job.paused && !job_is_cancelled(&job->job) && !s->in_drain) {
>          return true;
>      }

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier
  2021-08-06  9:38 ` [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier Max Reitz
@ 2021-08-06 20:36   ` Eric Blake
  2021-09-01 12:11   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Blake @ 2021-08-06 20:36 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:56AM +0200, Max Reitz wrote:
> We must check whether the job is force-cancelled early in our main loop,
> most importantly before any `continue` statement.  For example, we used
> to have `continue`s before our current checking location that are
> triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
> failing, force-cancelling the job would not terminate it.
> 
> Jobs can be cancelled while they yield, and once they are
> (force-cancelled), they should not generate new I/O requests.
> Therefore, we should put the check after the last yield before
> mirror_iteration() is invoked.
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel
  2021-08-06  9:38 ` [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel Max Reitz
@ 2021-08-06 20:37   ` Eric Blake
  2021-09-01 12:16   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Blake @ 2021-08-06 20:37 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:57AM +0200, Max Reitz wrote:
> Once the mirror job is force-cancelled (job_is_cancelled() is true), we
> should not generate new I/O requests.  This applies to active mirroring,
> too, so stop it once the job is cancelled.
> 
> (We must still forward all I/O requests to the source, though, of
> course, but those are not really I/O requests generated by the job, so
> this is fine.)
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled
  2021-08-06  9:38 ` [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled Max Reitz
@ 2021-08-06 20:42   ` Eric Blake
  2021-09-01 12:22   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Blake @ 2021-08-06 20:42 UTC (permalink / raw)
  To: Max Reitz
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On Fri, Aug 06, 2021 at 11:38:58AM +0200, Max Reitz wrote:
> Clearing .cancelled before leaving the main loop when the job has been
> soft-cancelled is no longer necessary since job_is_cancelled() only
> returns true for jobs that have been force-cancelled.
> 
> Therefore, this only makes a differences in places that call
> job_cancel_requested().  In block/mirror.c, this is done only before
> .cancelled was cleared.
> 
> In job.c, there are two callers:
> - job_completed_txn_abort() asserts that .cancelled is true, so keeping
>   it true will not affect this place.
> 
> - job_complete() refuses to let a job complete that has .cancelled set.
>   It is correct to refuse to let the user invoke job-complete on mirror
>   jobs that have already been soft-cancelled.
> 
> With this change, there are no places that reset .cancelled to false and
> so we can be sure that .force_cancel can only be true of .cancelled is

s/of/if/

> true as well.  Assert this in job_is_cancelled().
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>  block/mirror.c | 2 --
>  job.c          | 4 +++-
>  2 files changed, 3 insertions(+), 3 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort()
  2021-08-06 19:16   ` Eric Blake
@ 2021-08-09 10:04     ` Max Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-09 10:04 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On 06.08.21 21:16, Eric Blake wrote:
> On Fri, Aug 06, 2021 at 11:38:48AM +0200, Max Reitz wrote:
>> Finalizing the job may cause its AioContext to change.  This is noted by
>> job_exit(), which points at job_txn_apply() to take this fact into
>> account.
>>
>> However, job_completed() does not necessarily invoke job_txn_apply()
>> (through job_completed_txn_success()), but potentially also
>> job_completed_txn_abort().  The latter stores the context in a local
>> variable, and so always acquires the same context at its end that it has
>> released in the beginning -- which may be a different context from the
>> one that job_exit() releases at its end.  If it is different, qemu
>> aborts ("qemu_mutex_unlock_impl: Operation not permitted").
> Is this a bug fix that needs to make it into 6.1?

Well, I only encountered it as part of this series (which I really don’t 
think is 6.2 material at this point), and so I don’t know.

Can’t hurt, I suppose, but if we wanted this to be in 6.1, we’d better 
have a specific test for it, I think.

>> Drop the local @outer_ctx variable from job_completed_txn_abort(), and
>> instead re-acquire the actual job's context at the end of the function,
>> so job_exit() will release the same.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   job.c | 23 ++++++++++++++++++-----
>>   1 file changed, 18 insertions(+), 5 deletions(-)
> The commit message makes sense, and does a good job at explaining the
> change.  I'm still a bit fuzzy on how jobs are supposed to play nice
> with contexts,

I can relate :)

> but since your patch matches the commit message, I'm
> happy to give:
>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks!



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}()
  2021-08-06 19:39   ` Eric Blake
@ 2021-08-09 10:09     ` Max Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Max Reitz @ 2021-08-09 10:09 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, qemu-devel, qemu-block

On 06.08.21 21:39, Eric Blake wrote:
> On Fri, Aug 06, 2021 at 11:38:52AM +0200, Max Reitz wrote:
>> Callers should be able to specify whether they want job_cancel_sync() to
>> force-cancel the job or not.
>>
>> In fact, almost all invocations do not care about consistency of the
>> result and just want the job to terminate as soon as possible, so they
>> should pass force=true.  The replication block driver is the exception.
>>
>> This changes some iotest outputs, because quitting qemu while a mirror
>> job is active will now lead to it being cancelled instead of completed,
>> which is what we want.  (Cancelling a READY mirror job with force=false
>> may take an indefinite amount of time, which we do not want when
>> quitting.  If users want consistent results, they must have all jobs be
>> done before they quit qemu.)
> Feels somewhat like a bug fix, but I also understand why you'd prefer
> to delay this to 6.2 (it is not a fresh regression, but a longstanding
> issue).

It is, hence the “Buglink” tag below.  However, only all of this series 
together really fixes that bug (or at least patches 5+7+9 together), 
just taking one wouldn’t help much.  And together, it’s just too much 
for 6.2 at this point.

>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>> +++ b/job.c
>> @@ -982,12 +982,24 @@ static void job_cancel_err(Job *job, Error **errp)
>>       job_cancel(job, false);
>>   }
>>   
>> -int job_cancel_sync(Job *job)
>> +/**
>> + * Same as job_cancel_err(), but force-cancel.
>> + */
>> +static void job_force_cancel_err(Job *job, Error **errp)
>>   {
>> -    return job_finish_sync(job, &job_cancel_err, NULL);
>> +    job_cancel(job, true);
>> +}
> In isolation, it looks odd that errp is passed but not used.  But
> looking further, it's because this is a callback that must have a
> given signature, so it's okay.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort() Max Reitz
  2021-08-06 19:16   ` Eric Blake
@ 2021-09-01 10:05   ` Vladimir Sementsov-Ogievskiy
  2021-09-01 12:47     ` Hanna Reitz
  1 sibling, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 10:05 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> Finalizing the job may cause its AioContext to change.  This is noted by
> job_exit(), which points at job_txn_apply() to take this fact into
> account.
> 
> However, job_completed() does not necessarily invoke job_txn_apply()
> (through job_completed_txn_success()), but potentially also
> job_completed_txn_abort().  The latter stores the context in a local
> variable, and so always acquires the same context at its end that it has
> released in the beginning -- which may be a different context from the
> one that job_exit() releases at its end.  If it is different, qemu
> aborts ("qemu_mutex_unlock_impl: Operation not permitted").
> 
> Drop the local @outer_ctx variable from job_completed_txn_abort(), and
> instead re-acquire the actual job's context at the end of the function,
> so job_exit() will release the same.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   job.c | 23 ++++++++++++++++++-----
>   1 file changed, 18 insertions(+), 5 deletions(-)
> 
> diff --git a/job.c b/job.c
> index e7a5d28854..3fe23bb77e 100644
> --- a/job.c
> +++ b/job.c
> @@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force)
>   
>   static void job_completed_txn_abort(Job *job)
>   {
> -    AioContext *outer_ctx = job->aio_context;
>       AioContext *ctx;
>       JobTxn *txn = job->txn;
>       Job *other_job;
> @@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job)
>       txn->aborting = true;
>       job_txn_ref(txn);
>   
> -    /* We can only hold the single job's AioContext lock while calling
> +    /*
> +     * We can only hold the single job's AioContext lock while calling
>        * job_finalize_single() because the finalization callbacks can involve
> -     * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */
> -    aio_context_release(outer_ctx);
> +     * calls of AIO_WAIT_WHILE(), which could deadlock otherwise.
> +     * Note that the job's AioContext may change when it is finalized.
> +     */
> +    job_ref(job);
> +    aio_context_release(job->aio_context);
>   
>       /* Other jobs are effectively cancelled by us, set the status for
>        * them; this job, however, may or may not be cancelled, depending
> @@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job)
>       }
>       while (!QLIST_EMPTY(&txn->jobs)) {
>           other_job = QLIST_FIRST(&txn->jobs);
> +        /*
> +         * The job's AioContext may change, so store it in @ctx so we
> +         * release the same context that we have acquired before.
> +         */
>           ctx = other_job->aio_context;
>           aio_context_acquire(ctx);
>           if (!job_is_completed(other_job)) {
> @@ -779,7 +786,13 @@ static void job_completed_txn_abort(Job *job)
>           aio_context_release(ctx);
>       }
>   
> -    aio_context_acquire(outer_ctx);
> +    /*
> +     * Use job_ref()/job_unref() so we can read the AioContext here
> +     * even if the job went away during job_finalize_single().
> +     */
> +    ctx = job->aio_context;
> +    job_unref(job);
> +    aio_context_acquire(ctx);


why to use ctx variable and not do it exactly same as in job_txn_apply() :

    aio_context_acquire(job->aio_context);
    job_unref(job);

?

anyway:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction
  2021-08-06  9:38 ` [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction Max Reitz
  2021-08-06 19:22   ` Eric Blake
@ 2021-09-01 10:08   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 10:08 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> When a transaction is aborted, no result matters, and so all jobs within
> should be force-cancelled.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   job.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/job.c b/job.c
> index 3fe23bb77e..24e7c4fcb7 100644
> --- a/job.c
> +++ b/job.c
> @@ -766,7 +766,12 @@ static void job_completed_txn_abort(Job *job)
>           if (other_job != job) {
>               ctx = other_job->aio_context;
>               aio_context_acquire(ctx);
> -            job_cancel_async(other_job, false);
> +            /*
> +             * This is a transaction: If one job failed, no result will matter.
> +             * Therefore, pass force=true to terminate all other jobs as quickly
> +             * as possible.
> +             */
> +            job_cancel_async(other_job, true);
>               aio_context_release(ctx);
>           }
>       }
> 

Anyway, only backup jobs may be in a transaction, which doesn't distinguish force and soft cancelling. So, that doesn't change any logic.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
  2021-08-06 19:39   ` Eric Blake
@ 2021-09-01 10:20   ` Vladimir Sementsov-Ogievskiy
  2021-09-01 12:49     ` Hanna Reitz
  2021-09-01 11:04   ` Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 10:20 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> Callers should be able to specify whether they want job_cancel_sync() to
> force-cancel the job or not.
> 
> In fact, almost all invocations do not care about consistency of the
> result and just want the job to terminate as soon as possible, so they
> should pass force=true.  The replication block driver is the exception.
> 
> This changes some iotest outputs, because quitting qemu while a mirror
> job is active will now lead to it being cancelled instead of completed,
> which is what we want.  (Cancelling a READY mirror job with force=false
> may take an indefinite amount of time, which we do not want when
> quitting.  If users want consistent results, they must have all jobs be
> done before they quit qemu.)
> 
> Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz<mreitz@redhat.com>
> ---
>   include/qemu/job.h                    | 10 ++---
>   block/replication.c                   |  4 +-
>   blockdev.c                            |  4 +-
>   job.c                                 | 20 +++++++--
>   qemu-nbd.c                            |  2 +-
>   softmmu/runstate.c                    |  2 +-
>   storage-daemon/qemu-storage-daemon.c  |  2 +-
>   tests/unit/test-block-iothread.c      |  2 +-
>   tests/unit/test-blockjob.c            |  2 +-
>   tests/qemu-iotests/109.out            | 60 +++++++++++----------------
>   tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
>   11 files changed, 55 insertions(+), 55 deletions(-)
> 
> diff --git a/include/qemu/job.h b/include/qemu/job.h
> index 41162ed494..5e8edbc2c8 100644
> --- a/include/qemu/job.h
> +++ b/include/qemu/job.h
> @@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, Error **errp);
>   
>   /**
>    * Synchronously cancel the @job.  The completion callback is called
> - * before the function returns.  The job may actually complete
> - * instead of canceling itself; the circumstances under which this
> - * happens depend on the kind of job that is active.
> + * before the function returns.  If @force is false, the job may
> + * actually complete instead of canceling itself; the circumstances
> + * under which this happens depend on the kind of job that is active.
>    *
>    * Returns the return value from the job if the job actually completed
>    * during the call, or -ECANCELED if it was canceled.
>    *
>    * Callers must hold the AioContext lock of job->aio_context.
>    */
> -int job_cancel_sync(Job *job);
> +int job_cancel_sync(Job *job, bool force);
>   
>   /** Synchronously cancels all jobs using job_cancel_sync(). */
> -void job_cancel_sync_all(void);
> +void job_cancel_sync_all(bool force);

I think it would be better to keep job_cancel_sync_all(void) prototype and just change its behavior to do force-cancel. Anyway, this patch always pass true to it. And it would be strange to do soft-cancel-all, keeping in mind that soft cancelling only make sense for mirror in ready state.

Anyway:
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
  2021-08-06 19:39   ` Eric Blake
  2021-09-01 10:20   ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}() Vladimir Sementsov-Ogievskiy
@ 2021-09-01 11:04   ` Vladimir Sementsov-Ogievskiy
  2021-09-01 12:50     ` Hanna Reitz
  2 siblings, 1 reply; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 11:04 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> @@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool failover, Error **errp)
>            * disk, secondary disk in backup_job_completed().
>            */
>           if (s->backup_job) {
> -            job_cancel_sync(&s->backup_job->job);
> +            job_cancel_sync(&s->backup_job->job, false);

That's not quite correct, as backup is always force cancelled..

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested() Max Reitz
  2021-08-06 20:34   ` Eric Blake
@ 2021-09-01 11:44   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 11:44 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> Most callers of job_is_cancelled() actually want to know whether the job
> is on its way to immediate termination.  For example, we refuse to pause
> jobs that are cancelled; but this only makes sense for jobs that are
> really actually cancelled.
> 
> A mirror job that is cancelled during READY with force=false should
> absolutely be allowed to pause.  This "cancellation" (which is actually
> a kind of completion) may take an indefinite amount of time, and so
> should behave like any job during normal operation.  For example, with
> on-target-error=stop, the job should stop on write errors.  (In
> contrast, force-cancelled jobs should not get write errors, as they
> should just terminate and not do further I/O.)
> 
> Therefore, redefine job_is_cancelled() to only return true for jobs that
> are force-cancelled (which as of HEAD^ means any job that interprets the
> cancellation request as a request for immediate termination), and add
> job_cancel_requested() as the general variant, which returns true for
> any jobs which have been requested to be cancelled, whether it be
> immediately or after an arbitrarily long completion phase.
> 
> Finally, here is a justification for how different job_is_cancelled()
> invocations are treated by this patch:
> 
> - block/mirror.c (mirror_run()):
>    - The first invocation is a while loop that should loop until the job
>      has been cancelled or scheduled for completion.  What kind of cancel
>      does not matter, only the fact that the job is supposed to end.
> 
>    - The second invocation wants to know whether the job has been
>      soft-cancelled.  Calling job_cancel_requested() is a bit too broad,
>      but if the job were force-cancelled, we should leave the main loop
>      as soon as possible anyway, so this should not matter here.
> 
>    - The last two invocations already check force_cancel, so they should
>      continue to use job_is_cancelled().
> 
> - block/backup.c, block/commit.c, block/stream.c, anything in tests/:
>    These jobs know only force-cancel, so there is no difference between
>    job_is_cancelled() and job_cancel_requested().  We can continue using
>    job_is_cancelled().
> 
> - job.c:
>    - job_pause_point(), job_yield(), job_sleep_ns(): Only force-cancelled
>      jobs should be prevented from being paused.  Continue using job_is_cancelled().
> 
>    - job_update_rc(), job_finalize_single(), job_finish_sync(): These
>      functions are all called after the job has left its main loop.  The
>      mirror job (the only job that can be soft-cancelled) will clear
>      .cancelled before leaving the main loop if it has been
>      soft-cancelled.  Therefore, these functions will observe .cancelled
>      to be true only if the job has been force-cancelled.  We can
>      continue to use job_is_cancelled().
>      (Furthermore, conceptually, a soft-cancelled mirror job should not
>      report to have been cancelled.  It should report completion (see
>      also the block-job-cancel QAPI documentation).  Therefore, it makes
>      sense for these functions not to distinguish between a
>      soft-cancelled mirror job and a job that has completed as normal.)
> 
>    - job_completed_txn_abort(): All jobs other than @job have been
>      force-cancelled.  job_is_cancelled() must be true for them.
>      Regarding @job itself: job_completed_txn_abort() is mostly called
>      when the job's return value is not 0.  A soft-cancelled mirror has a
>      return value of 0, and so will not end up here then.
>      However, job_cancel() invokes job_completed_txn_abort() if the job
>      has been deferred to the main loop, which is mostly the case for
>      completed jobs (which skip the assertion), but not for sure.
>      To be safe, use job_cancel_requested() in this assertion.
> 
>    - job_complete(): This is function eventually invoked by the user
>      (through qmp_block_job_complete() or qmp_job_complete(), or
>      job_complete_sync(), which comes from qemu-img).  The intention here
>      is to prevent a user from invoking job-complete after the job has
>      been cancelled.  This should also apply to soft cancelling: After a
>      mirror job has been soft-cancelled, the user should not be able to
>      decide otherwise and have it complete as normal (i.e. pivoting to
>      the target).
> 
> Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled()
  2021-08-06  9:38 ` [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled() Max Reitz
  2021-08-06 20:35   ` Eric Blake
@ 2021-09-01 11:45   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 11:45 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> mirror_drained_poll() returns true whenever the job is cancelled,
> because "we [can] be sure that it won't issue more requests".  However,
> this is only true for force-cancelled jobs, so use job_is_cancelled().
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier
  2021-08-06  9:38 ` [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier Max Reitz
  2021-08-06 20:36   ` Eric Blake
@ 2021-09-01 12:11   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 12:11 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> We must check whether the job is force-cancelled early in our main loop,
> most importantly before any `continue` statement.  For example, we used
> to have `continue`s before our current checking location that are
> triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
> failing, force-cancelling the job would not terminate it.
> 
> Jobs can be cancelled while they yield, and once they are
> (force-cancelled), they should not generate new I/O requests.
> Therefore, we should put the check after the last yield before
> mirror_iteration() is invoked.
> 
> Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz<mreitz@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel
  2021-08-06  9:38 ` [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel Max Reitz
  2021-08-06 20:37   ` Eric Blake
@ 2021-09-01 12:16   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 12:16 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> Once the mirror job is force-cancelled (job_is_cancelled() is true), we
> should not generate new I/O requests.  This applies to active mirroring,
> too, so stop it once the job is cancelled.
> 
> (We must still forward all I/O requests to the source, though, of
> course, but those are not really I/O requests generated by the job, so
> this is fine.)
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled
  2021-08-06  9:38 ` [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled Max Reitz
  2021-08-06 20:42   ` Eric Blake
@ 2021-09-01 12:22   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-09-01 12:22 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf

06.08.2021 12:38, Max Reitz wrote:
> Clearing .cancelled before leaving the main loop when the job has been
> soft-cancelled is no longer necessary since job_is_cancelled() only
> returns true for jobs that have been force-cancelled.
> 
> Therefore, this only makes a differences in places that call
> job_cancel_requested().  In block/mirror.c, this is done only before
> .cancelled was cleared.
> 
> In job.c, there are two callers:
> - job_completed_txn_abort() asserts that .cancelled is true, so keeping
>    it true will not affect this place.
> 
> - job_complete() refuses to let a job complete that has .cancelled set.
>    It is correct to refuse to let the user invoke job-complete on mirror
>    jobs that have already been soft-cancelled.
> 
> With this change, there are no places that reset .cancelled to false and
> so we can be sure that .force_cancel can only be true of .cancelled is
> true as well.  Assert this in job_is_cancelled().
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort()
  2021-09-01 10:05   ` Vladimir Sementsov-Ogievskiy
@ 2021-09-01 12:47     ` Hanna Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Hanna Reitz @ 2021-09-01 12:47 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu-block
  Cc: Kevin Wolf, qemu-devel

On 01.09.21 12:05, Vladimir Sementsov-Ogievskiy wrote:
> 06.08.2021 12:38, Max Reitz wrote:
>> Finalizing the job may cause its AioContext to change.  This is noted by
>> job_exit(), which points at job_txn_apply() to take this fact into
>> account.
>>
>> However, job_completed() does not necessarily invoke job_txn_apply()
>> (through job_completed_txn_success()), but potentially also
>> job_completed_txn_abort().  The latter stores the context in a local
>> variable, and so always acquires the same context at its end that it has
>> released in the beginning -- which may be a different context from the
>> one that job_exit() releases at its end.  If it is different, qemu
>> aborts ("qemu_mutex_unlock_impl: Operation not permitted").
>>
>> Drop the local @outer_ctx variable from job_completed_txn_abort(), and
>> instead re-acquire the actual job's context at the end of the function,
>> so job_exit() will release the same.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   job.c | 23 ++++++++++++++++++-----
>>   1 file changed, 18 insertions(+), 5 deletions(-)
>>
>> diff --git a/job.c b/job.c
>> index e7a5d28854..3fe23bb77e 100644
>> --- a/job.c
>> +++ b/job.c
>> @@ -737,7 +737,6 @@ static void job_cancel_async(Job *job, bool force)
>>     static void job_completed_txn_abort(Job *job)
>>   {
>> -    AioContext *outer_ctx = job->aio_context;
>>       AioContext *ctx;
>>       JobTxn *txn = job->txn;
>>       Job *other_job;
>> @@ -751,10 +750,14 @@ static void job_completed_txn_abort(Job *job)
>>       txn->aborting = true;
>>       job_txn_ref(txn);
>>   -    /* We can only hold the single job's AioContext lock while 
>> calling
>> +    /*
>> +     * We can only hold the single job's AioContext lock while calling
>>        * job_finalize_single() because the finalization callbacks can 
>> involve
>> -     * calls of AIO_WAIT_WHILE(), which could deadlock otherwise. */
>> -    aio_context_release(outer_ctx);
>> +     * calls of AIO_WAIT_WHILE(), which could deadlock otherwise.
>> +     * Note that the job's AioContext may change when it is finalized.
>> +     */
>> +    job_ref(job);
>> +    aio_context_release(job->aio_context);
>>         /* Other jobs are effectively cancelled by us, set the status 
>> for
>>        * them; this job, however, may or may not be cancelled, depending
>> @@ -769,6 +772,10 @@ static void job_completed_txn_abort(Job *job)
>>       }
>>       while (!QLIST_EMPTY(&txn->jobs)) {
>>           other_job = QLIST_FIRST(&txn->jobs);
>> +        /*
>> +         * The job's AioContext may change, so store it in @ctx so we
>> +         * release the same context that we have acquired before.
>> +         */
>>           ctx = other_job->aio_context;
>>           aio_context_acquire(ctx);
>>           if (!job_is_completed(other_job)) {
>> @@ -779,7 +786,13 @@ static void job_completed_txn_abort(Job *job)
>>           aio_context_release(ctx);
>>       }
>>   -    aio_context_acquire(outer_ctx);
>> +    /*
>> +     * Use job_ref()/job_unref() so we can read the AioContext here
>> +     * even if the job went away during job_finalize_single().
>> +     */
>> +    ctx = job->aio_context;
>> +    job_unref(job);
>> +    aio_context_acquire(ctx);
>
>
> why to use ctx variable and not do it exactly same as in 
> job_txn_apply() :
>
>    aio_context_acquire(job->aio_context);
>    job_unref(job);
>
> ?

Oh, I just didn’t think of that.  Sounds good, thanks!

Hanna

> anyway:
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}()
  2021-09-01 10:20   ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}() Vladimir Sementsov-Ogievskiy
@ 2021-09-01 12:49     ` Hanna Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Hanna Reitz @ 2021-09-01 12:49 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu-block
  Cc: Kevin Wolf, qemu-devel

On 01.09.21 12:20, Vladimir Sementsov-Ogievskiy wrote:
> 06.08.2021 12:38, Max Reitz wrote:
>> Callers should be able to specify whether they want job_cancel_sync() to
>> force-cancel the job or not.
>>
>> In fact, almost all invocations do not care about consistency of the
>> result and just want the job to terminate as soon as possible, so they
>> should pass force=true.  The replication block driver is the exception.
>>
>> This changes some iotest outputs, because quitting qemu while a mirror
>> job is active will now lead to it being cancelled instead of completed,
>> which is what we want.  (Cancelling a READY mirror job with force=false
>> may take an indefinite amount of time, which we do not want when
>> quitting.  If users want consistent results, they must have all jobs be
>> done before they quit qemu.)
>>
>> Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
>> Signed-off-by: Max Reitz<mreitz@redhat.com>
>> ---
>>   include/qemu/job.h                    | 10 ++---
>>   block/replication.c                   |  4 +-
>>   blockdev.c                            |  4 +-
>>   job.c                                 | 20 +++++++--
>>   qemu-nbd.c                            |  2 +-
>>   softmmu/runstate.c                    |  2 +-
>>   storage-daemon/qemu-storage-daemon.c  |  2 +-
>>   tests/unit/test-block-iothread.c      |  2 +-
>>   tests/unit/test-blockjob.c            |  2 +-
>>   tests/qemu-iotests/109.out            | 60 +++++++++++----------------
>>   tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
>>   11 files changed, 55 insertions(+), 55 deletions(-)
>>
>> diff --git a/include/qemu/job.h b/include/qemu/job.h
>> index 41162ed494..5e8edbc2c8 100644
>> --- a/include/qemu/job.h
>> +++ b/include/qemu/job.h
>> @@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, 
>> Error **errp);
>>     /**
>>    * Synchronously cancel the @job.  The completion callback is called
>> - * before the function returns.  The job may actually complete
>> - * instead of canceling itself; the circumstances under which this
>> - * happens depend on the kind of job that is active.
>> + * before the function returns.  If @force is false, the job may
>> + * actually complete instead of canceling itself; the circumstances
>> + * under which this happens depend on the kind of job that is active.
>>    *
>>    * Returns the return value from the job if the job actually completed
>>    * during the call, or -ECANCELED if it was canceled.
>>    *
>>    * Callers must hold the AioContext lock of job->aio_context.
>>    */
>> -int job_cancel_sync(Job *job);
>> +int job_cancel_sync(Job *job, bool force);
>>     /** Synchronously cancels all jobs using job_cancel_sync(). */
>> -void job_cancel_sync_all(void);
>> +void job_cancel_sync_all(bool force);
>
> I think it would be better to keep job_cancel_sync_all(void) prototype 
> and just change its behavior to do force-cancel. Anyway, this patch 
> always pass true to it. And it would be strange to do soft-cancel-all, 
> keeping in mind that soft cancelling only make sense for mirror in 
> ready state.

Actually, yes, that’s true.  I’ll drop the parameter.

Hanna



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}()
  2021-09-01 11:04   ` Vladimir Sementsov-Ogievskiy
@ 2021-09-01 12:50     ` Hanna Reitz
  0 siblings, 0 replies; 35+ messages in thread
From: Hanna Reitz @ 2021-09-01 12:50 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, Max Reitz, qemu-block
  Cc: Kevin Wolf, qemu-devel

On 01.09.21 13:04, Vladimir Sementsov-Ogievskiy wrote:
> 06.08.2021 12:38, Max Reitz wrote:
>> @@ -726,7 +726,7 @@ static void replication_stop(ReplicationState 
>> *rs, bool failover, Error **errp)
>>            * disk, secondary disk in backup_job_completed().
>>            */
>>           if (s->backup_job) {
>> -            job_cancel_sync(&s->backup_job->job);
>> +            job_cancel_sync(&s->backup_job->job, false);
>
> That's not quite correct, as backup is always force cancelled..

Good point.  I think functionally it shouldn’t make a difference, right? 
– but it’s better to be explicit about it and only use force=false where 
it actually makes a difference.

Hanna



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-09-01 12:59 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-06  9:38 [PATCH for-6.2 v3 00/12] mirror: Handle errors after READY cancel Max Reitz
2021-08-06  9:38 ` [PATCH for-6.2 v3 01/12] job: Context changes in job_completed_txn_abort() Max Reitz
2021-08-06 19:16   ` Eric Blake
2021-08-09 10:04     ` Max Reitz
2021-09-01 10:05   ` Vladimir Sementsov-Ogievskiy
2021-09-01 12:47     ` Hanna Reitz
2021-08-06  9:38 ` [PATCH for-6.2 v3 02/12] mirror: Keep s->synced on error Max Reitz
2021-08-06  9:38 ` [PATCH for-6.2 v3 03/12] mirror: Drop s->synced Max Reitz
2021-08-06  9:38 ` [PATCH for-6.2 v3 04/12] job: Force-cancel jobs in a failed transaction Max Reitz
2021-08-06 19:22   ` Eric Blake
2021-09-01 10:08   ` Vladimir Sementsov-Ogievskiy
2021-08-06  9:38 ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
2021-08-06 19:39   ` Eric Blake
2021-08-09 10:09     ` Max Reitz
2021-09-01 10:20   ` [PATCH for-6.2 v3 05/12] job: @force parameter for job_cancel_sync{,_all}() Vladimir Sementsov-Ogievskiy
2021-09-01 12:49     ` Hanna Reitz
2021-09-01 11:04   ` Vladimir Sementsov-Ogievskiy
2021-09-01 12:50     ` Hanna Reitz
2021-08-06  9:38 ` [PATCH for-6.2 v3 06/12] jobs: Give Job.force_cancel more meaning Max Reitz
2021-08-06  9:38 ` [PATCH for-6.2 v3 07/12] job: Add job_cancel_requested() Max Reitz
2021-08-06 20:34   ` Eric Blake
2021-09-01 11:44   ` Vladimir Sementsov-Ogievskiy
2021-08-06  9:38 ` [PATCH for-6.2 v3 08/12] mirror: Use job_is_cancelled() Max Reitz
2021-08-06 20:35   ` Eric Blake
2021-09-01 11:45   ` Vladimir Sementsov-Ogievskiy
2021-08-06  9:38 ` [PATCH for-6.2 v3 09/12] mirror: Check job_is_cancelled() earlier Max Reitz
2021-08-06 20:36   ` Eric Blake
2021-09-01 12:11   ` Vladimir Sementsov-Ogievskiy
2021-08-06  9:38 ` [PATCH for-6.2 v3 10/12] mirror: Stop active mirroring after force-cancel Max Reitz
2021-08-06 20:37   ` Eric Blake
2021-09-01 12:16   ` Vladimir Sementsov-Ogievskiy
2021-08-06  9:38 ` [PATCH for-6.2 v3 11/12] mirror: Do not clear .cancelled Max Reitz
2021-08-06 20:42   ` Eric Blake
2021-09-01 12:22   ` Vladimir Sementsov-Ogievskiy
2021-08-06  9:38 ` [PATCH for-6.2 v3 12/12] iotests: Add mirror-ready-cancel-error test Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.