qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel
@ 2021-07-22 12:26 Max Reitz
  2021-07-22 12:26 ` [PATCH for-6.1? 1/6] mirror: Keep s->synced on error Max Reitz
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Max Reitz @ 2021-07-22 12:26 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	Max Reitz

Hi,

This is a rather complex series with changes that aren’t exactly local
to the mirror job, so maybe it’s too complex for 6.1.

However, it is a bug fix, and not an insignificant one, though probably
not a regression of any kind.

Bug report:
https://gitlab.com/qemu-project/qemu/-/issues/462

(I didn’t put any “Fixes:” or “Resolves:” into the commit messages,
because there is no single patch here that fixes the bug.)

The root of the problem is that if you cancel a mirror job during its
READY phase, any kind of I/O error (with the error action 'stop') is
likely to not be handled gracefully, which means that perhaps the job
will just loop forever without pausing, doing nothing but emitting
errors.  There is no way to stop the job, or cancel it with force=true,
and so you also cannot quit qemu normally, because, well, cancelling the
job doesn’t do anything.  So you have to kill qemu to stop the mess.

If you’re lucky, the error is transient.  Then qemu will just kill
itself with a failed assertion, because it’ll try a READY -> READY
transition, which isn’t allowed.

There are a couple of problems contributing to it all:

(1) The READY -> READY transition comes from the fact that we will enter
    the READY state whenever source and target are synced, and whenever
    s->synced is false.  I/O errors reset s->synced.  I believe they
    shouldn’t.
    (Patch 1)

(2) Quitting qemu doesn’t force-cancel jobs.  I don’t understand why.
    If for all jobs but mirror we want them to be cancelled and not
    properly completed, why do we want mirror to get a consistent
    result?  (Which is what cancel with force=false gives you.)
    I believe we actually don’t care, and so on many occasions where we
    invoke job_cancel_sync() and job_cancel_sync_all(), we want to
    force-cancel the job(s) in question.
    (Patch 2)

(3) Cancelling mirror post-READY with force=false is actually not really
    cancelling the job.  It’s a different completion mode.  The job
    should run like any normal job, it shouldn’t be special-cased.
    However, we have a couple of places that special-case cancelled job
    because they believe that such jobs are on their way to definite
    termination.  For example, we don’t allow pausing cancelled jobs.
    We definitely do want to allow pausing a mirror post-READY job that
    is being non-force-cancelled.  The job may still take an arbitrary
    amount of time, so it absolutely should be pausable.
    (Patches 3, 4)

(4) Mirror only checks whether it’s been force-cancelled at the bottom
    of its main loop, after several `continue`s.  Therefore, if flushing
    fails (and it then `continue`s), that check will be skipped.  If
    flushing fails continuously, the job cannot be force-cancelled.
    (Patch 5)


Max Reitz (6):
  mirror: Keep s->synced on error
  job: @force parameter for job_cancel_sync{,_all}()
  jobs: Give Job.force_cancel more meaning
  job: Add job_cancel_requested()
  mirror: Check job_is_cancelled() earlier
  iotests: Add mirror-ready-cancel-error test

 include/qemu/job.h                            |  29 +++-
 block/backup.c                                |   3 +-
 block/mirror.c                                |  35 +++--
 block/replication.c                           |   4 +-
 blockdev.c                                    |   4 +-
 job.c                                         |  46 ++++--
 qemu-nbd.c                                    |   2 +-
 softmmu/runstate.c                            |   2 +-
 storage-daemon/qemu-storage-daemon.c          |   2 +-
 tests/unit/test-block-iothread.c              |   2 +-
 tests/unit/test-blockjob.c                    |   2 +-
 tests/qemu-iotests/109.out                    |  60 +++-----
 .../tests/mirror-ready-cancel-error           | 143 ++++++++++++++++++
 .../tests/mirror-ready-cancel-error.out       |   5 +
 tests/qemu-iotests/tests/qsd-jobs.out         |   2 +-
 15 files changed, 262 insertions(+), 79 deletions(-)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

-- 
2.31.1



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH for-6.1? 1/6] mirror: Keep s->synced on error
  2021-07-22 12:26 [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel Max Reitz
@ 2021-07-22 12:26 ` Max Reitz
  2021-07-22 16:25   ` Vladimir Sementsov-Ogievskiy
  2021-07-26 10:24   ` Vladimir Sementsov-Ogievskiy
  2021-07-22 12:26 ` [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 16+ messages in thread
From: Max Reitz @ 2021-07-22 12:26 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	Max Reitz

An error does not take us out of the READY phase, which is what
s->synced signifies.  It does of course mean that source and target are
no longer in sync, but that is what s->actively_sync is for -- s->synced
never meant that source and target are in sync, only that they were at
some point (and at that point we transitioned into the READY phase).

The tangible problem is that we transition to READY once we are in sync
and s->synced is false.  By resetting s->synced here, we will transition
from READY to READY once the error is resolved (if the job keeps
running), and that transition is not allowed.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/mirror.c b/block/mirror.c
index 98fc66eabf..d73b704473 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -121,7 +121,6 @@ typedef enum MirrorMethod {
 static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
                                             int error)
 {
-    s->synced = false;
     s->actively_synced = false;
     if (read) {
         return block_job_error_action(&s->common, s->on_source_error,
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{, _all}()
  2021-07-22 12:26 [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel Max Reitz
  2021-07-22 12:26 ` [PATCH for-6.1? 1/6] mirror: Keep s->synced on error Max Reitz
@ 2021-07-22 12:26 ` Max Reitz
  2021-07-22 17:00   ` [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{,_all}() Vladimir Sementsov-Ogievskiy
  2021-07-22 12:26 ` [PATCH for-6.1? 3/6] jobs: Give Job.force_cancel more meaning Max Reitz
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Max Reitz @ 2021-07-22 12:26 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	Max Reitz

Callers should be able to specify whether they want job_cancel_sync() to
force-cancel the job or not.

In fact, almost all invocations do not care about consistency of the
result and just want the job to terminate as soon as possible, so they
should pass force=true.  The replication block driver is the exception.

This changes some iotest outputs, because quitting qemu while a mirror
job is active will now lead to it being cancelled instead of completed,
which is what we want.  (Cancelling a READY mirror job with force=false
may take an indefinite amount of time, which we do not want when
quitting.  If users want consistent results, they must have all jobs be
done before they quit qemu.)

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/qemu/job.h                    | 10 ++---
 block/replication.c                   |  4 +-
 blockdev.c                            |  4 +-
 job.c                                 | 27 +++++++++---
 qemu-nbd.c                            |  2 +-
 softmmu/runstate.c                    |  2 +-
 storage-daemon/qemu-storage-daemon.c  |  2 +-
 tests/unit/test-block-iothread.c      |  2 +-
 tests/unit/test-blockjob.c            |  2 +-
 tests/qemu-iotests/109.out            | 60 +++++++++++----------------
 tests/qemu-iotests/tests/qsd-jobs.out |  2 +-
 11 files changed, 61 insertions(+), 56 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 41162ed494..5e8edbc2c8 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -506,19 +506,19 @@ void job_user_cancel(Job *job, bool force, Error **errp);
 
 /**
  * Synchronously cancel the @job.  The completion callback is called
- * before the function returns.  The job may actually complete
- * instead of canceling itself; the circumstances under which this
- * happens depend on the kind of job that is active.
+ * before the function returns.  If @force is false, the job may
+ * actually complete instead of canceling itself; the circumstances
+ * under which this happens depend on the kind of job that is active.
  *
  * Returns the return value from the job if the job actually completed
  * during the call, or -ECANCELED if it was canceled.
  *
  * Callers must hold the AioContext lock of job->aio_context.
  */
-int job_cancel_sync(Job *job);
+int job_cancel_sync(Job *job, bool force);
 
 /** Synchronously cancels all jobs using job_cancel_sync(). */
-void job_cancel_sync_all(void);
+void job_cancel_sync_all(bool force);
 
 /**
  * @job: The job to be completed.
diff --git a/block/replication.c b/block/replication.c
index 32444b9a8f..e7a9327b12 100644
--- a/block/replication.c
+++ b/block/replication.c
@@ -149,7 +149,7 @@ static void replication_close(BlockDriverState *bs)
     if (s->stage == BLOCK_REPLICATION_FAILOVER) {
         commit_job = &s->commit_job->job;
         assert(commit_job->aio_context == qemu_get_current_aio_context());
-        job_cancel_sync(commit_job);
+        job_cancel_sync(commit_job, false);
     }
 
     if (s->mode == REPLICATION_MODE_SECONDARY) {
@@ -726,7 +726,7 @@ static void replication_stop(ReplicationState *rs, bool failover, Error **errp)
          * disk, secondary disk in backup_job_completed().
          */
         if (s->backup_job) {
-            job_cancel_sync(&s->backup_job->job);
+            job_cancel_sync(&s->backup_job->job, false);
         }
 
         if (!failover) {
diff --git a/blockdev.c b/blockdev.c
index 3d8ac368a1..aa95918c02 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1848,7 +1848,7 @@ static void drive_backup_abort(BlkActionState *common)
         aio_context = bdrv_get_aio_context(state->bs);
         aio_context_acquire(aio_context);
 
-        job_cancel_sync(&state->job->job);
+        job_cancel_sync(&state->job->job, true);
 
         aio_context_release(aio_context);
     }
@@ -1949,7 +1949,7 @@ static void blockdev_backup_abort(BlkActionState *common)
         aio_context = bdrv_get_aio_context(state->bs);
         aio_context_acquire(aio_context);
 
-        job_cancel_sync(&state->job->job);
+        job_cancel_sync(&state->job->job, true);
 
         aio_context_release(aio_context);
     }
diff --git a/job.c b/job.c
index e7a5d28854..9e971d64cf 100644
--- a/job.c
+++ b/job.c
@@ -763,7 +763,12 @@ static void job_completed_txn_abort(Job *job)
         if (other_job != job) {
             ctx = other_job->aio_context;
             aio_context_acquire(ctx);
-            job_cancel_async(other_job, false);
+            /*
+             * This is a transaction: If one job failed, no result will matter.
+             * Therefore, pass force=true to terminate all other jobs as quickly
+             * as possible.
+             */
+            job_cancel_async(other_job, true);
             aio_context_release(ctx);
         }
     }
@@ -964,12 +969,24 @@ static void job_cancel_err(Job *job, Error **errp)
     job_cancel(job, false);
 }
 
-int job_cancel_sync(Job *job)
+/**
+ * Same as job_cancel_err(), but force-cancel.
+ */
+static void job_force_cancel_err(Job *job, Error **errp)
 {
-    return job_finish_sync(job, &job_cancel_err, NULL);
+    job_cancel(job, true);
+}
+
+int job_cancel_sync(Job *job, bool force)
+{
+    if (force) {
+        return job_finish_sync(job, &job_force_cancel_err, NULL);
+    } else {
+        return job_finish_sync(job, &job_cancel_err, NULL);
+    }
 }
 
-void job_cancel_sync_all(void)
+void job_cancel_sync_all(bool force)
 {
     Job *job;
     AioContext *aio_context;
@@ -977,7 +994,7 @@ void job_cancel_sync_all(void)
     while ((job = job_next(NULL))) {
         aio_context = job->aio_context;
         aio_context_acquire(aio_context);
-        job_cancel_sync(job);
+        job_cancel_sync(job, force);
         aio_context_release(aio_context);
     }
 }
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 26ffbf15af..7fadfcfd23 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -479,7 +479,7 @@ static const char *socket_activation_validate_opts(const char *device,
 
 static void qemu_nbd_shutdown(void)
 {
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
     blk_exp_close_all();
     bdrv_close_all();
 }
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index 10d9b7365a..cf239e3b4c 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -812,7 +812,7 @@ void qemu_cleanup(void)
     vm_shutdown();
     replay_finish();
 
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
     bdrv_close_all();
 
     /* vhost-user must be cleaned up before chardevs.  */
diff --git a/storage-daemon/qemu-storage-daemon.c b/storage-daemon/qemu-storage-daemon.c
index fc8b150629..6c7142574c 100644
--- a/storage-daemon/qemu-storage-daemon.c
+++ b/storage-daemon/qemu-storage-daemon.c
@@ -347,7 +347,7 @@ int main(int argc, char *argv[])
 
     blk_exp_close_all();
     bdrv_drain_all_begin();
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
     bdrv_close_all();
 
     monitor_cleanup();
diff --git a/tests/unit/test-block-iothread.c b/tests/unit/test-block-iothread.c
index c39e70b2f5..09807fd2ca 100644
--- a/tests/unit/test-block-iothread.c
+++ b/tests/unit/test-block-iothread.c
@@ -662,7 +662,7 @@ static void test_propagate_mirror(void)
     g_assert(bdrv_get_aio_context(target) == ctx);
     g_assert(bdrv_get_aio_context(filter) == ctx);
 
-    job_cancel_sync_all();
+    job_cancel_sync_all(true);
 
     aio_context_acquire(ctx);
     blk_set_aio_context(blk, main_ctx, &error_abort);
diff --git a/tests/unit/test-blockjob.c b/tests/unit/test-blockjob.c
index dcacfa6c7c..4c9e1bf1e5 100644
--- a/tests/unit/test-blockjob.c
+++ b/tests/unit/test-blockjob.c
@@ -230,7 +230,7 @@ static void cancel_common(CancelJob *s)
     ctx = job->job.aio_context;
     aio_context_acquire(ctx);
 
-    job_cancel_sync(&job->job);
+    job_cancel_sync(&job->job, true);
     if (sts != JOB_STATUS_CREATED && sts != JOB_STATUS_CONCLUDED) {
         Job *dummy = &job->job;
         job_dismiss(&dummy, &error_abort);
diff --git a/tests/qemu-iotests/109.out b/tests/qemu-iotests/109.out
index 8f839b4b7f..e29280015e 100644
--- a/tests/qemu-iotests/109.out
+++ b/tests/qemu-iotests/109.out
@@ -44,9 +44,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -95,9 +94,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 197120, "offset": 197120, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 197120, "offset": 197120, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -146,9 +144,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -197,9 +194,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 1024, "offset": 1024, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -248,9 +244,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 65536, "offset": 65536, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -299,9 +294,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -349,9 +343,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 2560, "offset": 2560, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -399,9 +392,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 31457280, "offset": 31457280, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 31457280, "offset": 31457280, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -449,9 +441,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 327680, "offset": 327680, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -499,9 +490,8 @@ read 512/512 bytes at offset 0
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 2048, "offset": 2048, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 2048, "offset": 2048, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -529,9 +519,8 @@ WARNING: Image format was not specified for 'TEST_DIR/t.raw' and probing guessed
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
@@ -552,9 +541,8 @@ Images are identical.
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-qmp-quit"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "standby", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "ready", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "src"}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "aborting", "id": "src"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "src", "len": 512, "offset": 512, "speed": 0, "type": "mirror"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "src"}}
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "src"}}
 Images are identical.
diff --git a/tests/qemu-iotests/tests/qsd-jobs.out b/tests/qemu-iotests/tests/qsd-jobs.out
index 189423354b..c1bc9b8356 100644
--- a/tests/qemu-iotests/tests/qsd-jobs.out
+++ b/tests/qemu-iotests/tests/qsd-jobs.out
@@ -8,7 +8,7 @@ QMP_VERSION
 {"return": {}}
 {"return": {}}
 {"return": {}}
-{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "commit"}}
+{"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "BLOCK_JOB_CANCELLED", "data": {"device": "job0", "len": 0, "offset": 0, "speed": 0, "type": "commit"}}
 
 === Streaming can't get permission on base node ===
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-6.1? 3/6] jobs: Give Job.force_cancel more meaning
  2021-07-22 12:26 [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel Max Reitz
  2021-07-22 12:26 ` [PATCH for-6.1? 1/6] mirror: Keep s->synced on error Max Reitz
  2021-07-22 12:26 ` [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
@ 2021-07-22 12:26 ` Max Reitz
  2021-07-22 18:16   ` Vladimir Sementsov-Ogievskiy
  2021-07-22 12:26 ` [PATCH for-6.1? 4/6] job: Add job_cancel_requested() Max Reitz
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Max Reitz @ 2021-07-22 12:26 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	Max Reitz

We largely have two cancel modes for jobs:

First, there is actual cancelling.  The job is terminated as soon as
possible, without trying to reach a consistent result.

Second, we have mirror in the READY state.  Technically, the job is not
really cancelled, but it just is a different completion mode.  The job
can still run for an indefinite amount of time while it tries to reach a
consistent result.

We want to be able to clearly distinguish which cancel mode a job is in
(when it has been cancelled).  We can use Job.force_cancel for this, but
right now it only reflects cancel requests from the user with
force=true, but clearly, jobs that do not even distinguish between
force=false and force=true are effectively always force-cancelled.

So this patch has Job.force_cancel signify whether the job will
terminate as soon as possible (force_cancel=true) or whether it will
effectively remain running despite being "cancelled"
(force_cancel=false).

To this end, we let jobs that provide JobDriver.cancel() tell the
generic job code whether they will terminate as soon as possible or not,
and for jobs that do not provide that method we assume they will.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/qemu/job.h | 11 ++++++++++-
 block/backup.c     |  3 ++-
 block/mirror.c     | 24 ++++++++++++++++++------
 job.c              |  6 +++++-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 5e8edbc2c8..8aa90f7395 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -253,8 +253,17 @@ struct JobDriver {
 
     /**
      * If the callback is not NULL, it will be invoked in job_cancel_async
+     *
+     * This function must return true if the job will be cancelled
+     * immediately without any further I/O (mandatory if @force is
+     * true), and false otherwise.  This lets the generic job layer
+     * know whether a job has been truly (force-)cancelled, or whether
+     * it is just in a special completion mode (like mirror after
+     * READY).
+     * (If the callback is NULL, the job is assumed to terminate
+     * without I/O.)
      */
-    void (*cancel)(Job *job, bool force);
+    bool (*cancel)(Job *job, bool force);
 
 
     /** Called when the job is freed */
diff --git a/block/backup.c b/block/backup.c
index bd3614ce70..513e1c8a0b 100644
--- a/block/backup.c
+++ b/block/backup.c
@@ -331,11 +331,12 @@ static void coroutine_fn backup_set_speed(BlockJob *job, int64_t speed)
     }
 }
 
-static void backup_cancel(Job *job, bool force)
+static bool backup_cancel(Job *job, bool force)
 {
     BackupBlockJob *s = container_of(job, BackupBlockJob, common.job);
 
     bdrv_cancel_in_flight(s->target_bs);
+    return true;
 }
 
 static const BlockJobDriver backup_job_driver = {
diff --git a/block/mirror.c b/block/mirror.c
index d73b704473..c3514f4196 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1089,9 +1089,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         }
         trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
         job_sleep_ns(&s->common.job, delay_ns);
-        if (job_is_cancelled(&s->common.job) &&
-            (!s->synced || s->common.job.force_cancel))
-        {
+        if (job_is_cancelled(&s->common.job) && s->common.job.force_cancel) {
             break;
         }
         s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1103,7 +1101,7 @@ immediate_exit:
          * or it was cancelled prematurely so that we do not guarantee that
          * the target is a copy of the source.
          */
-        assert(ret < 0 || ((s->common.job.force_cancel || !s->synced) &&
+        assert(ret < 0 || (s->common.job.force_cancel &&
                job_is_cancelled(&s->common.job)));
         assert(need_drain);
         mirror_wait_for_all_io(s);
@@ -1189,14 +1187,27 @@ static bool mirror_drained_poll(BlockJob *job)
     return !!s->in_flight;
 }
 
-static void mirror_cancel(Job *job, bool force)
+static bool mirror_cancel(Job *job, bool force)
 {
     MirrorBlockJob *s = container_of(job, MirrorBlockJob, common.job);
     BlockDriverState *target = blk_bs(s->target);
 
-    if (force || !job_is_ready(job)) {
+    /*
+     * Before the job is READY, we treat any cancellation like a
+     * force-cancellation.
+     */
+    force = force || !job_is_ready(job);
+
+    if (force) {
         bdrv_cancel_in_flight(target);
     }
+    return force;
+}
+
+static bool commit_active_cancel(Job *job, bool force)
+{
+    /* Same as above in mirror_cancel() */
+    return force || !job_is_ready(job);
 }
 
 static const BlockJobDriver mirror_job_driver = {
@@ -1226,6 +1237,7 @@ static const BlockJobDriver commit_active_job_driver = {
         .abort                  = mirror_abort,
         .pause                  = mirror_pause,
         .complete               = mirror_complete,
+        .cancel                 = commit_active_cancel,
     },
     .drained_poll           = mirror_drained_poll,
 };
diff --git a/job.c b/job.c
index 9e971d64cf..e78d893a9c 100644
--- a/job.c
+++ b/job.c
@@ -719,8 +719,12 @@ static int job_finalize_single(Job *job)
 static void job_cancel_async(Job *job, bool force)
 {
     if (job->driver->cancel) {
-        job->driver->cancel(job, force);
+        force = job->driver->cancel(job, force);
+    } else {
+        /* No .cancel() means the job will behave as if force-cancelled */
+        force = true;
     }
+
     if (job->user_paused) {
         /* Do not call job_enter here, the caller will handle it.  */
         if (job->driver->user_resume) {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-6.1? 4/6] job: Add job_cancel_requested()
  2021-07-22 12:26 [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel Max Reitz
                   ` (2 preceding siblings ...)
  2021-07-22 12:26 ` [PATCH for-6.1? 3/6] jobs: Give Job.force_cancel more meaning Max Reitz
@ 2021-07-22 12:26 ` Max Reitz
  2021-07-22 17:58   ` Vladimir Sementsov-Ogievskiy
  2021-07-22 12:26 ` [PATCH for-6.1? 5/6] mirror: Check job_is_cancelled() earlier Max Reitz
  2021-07-22 12:26 ` [PATCH for-6.1? 6/6] iotests: Add mirror-ready-cancel-error test Max Reitz
  5 siblings, 1 reply; 16+ messages in thread
From: Max Reitz @ 2021-07-22 12:26 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	Max Reitz

Most callers of job_is_cancelled() actually want to know whether the job
is on its way to immediate termination.  For example, we refuse to pause
jobs that are cancelled; but this only makes sense for jobs that are
really actually cancelled.

A mirror job that is cancelled during READY with force=false should
absolutely be allowed to pause.  This "cancellation" (which is actually
a kind of completion) may take an indefinite amount of time, and so
should behave like any job during normal operation.  For example, with
on-target-error=stop, the job should stop on write errors.  (In
contrast, force-cancelled jobs should not get write errors, as they
should just terminate and not do further I/O.)

Therefore, redefine job_is_cancelled() to only return true for jobs that
are force-cancelled (which as of HEAD^ means any job that interprets the
cancellation request as a request for immediate termination), and add
job_cancel_request() as the general variant, which returns true for any
jobs which have been requested to be cancelled, whether it be
immediately or after an arbitrarily long completion phase.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 include/qemu/job.h |  8 +++++++-
 block/mirror.c     |  9 ++++-----
 job.c              | 13 +++++++++----
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/qemu/job.h b/include/qemu/job.h
index 8aa90f7395..032edf3c5f 100644
--- a/include/qemu/job.h
+++ b/include/qemu/job.h
@@ -436,9 +436,15 @@ const char *job_type_str(const Job *job);
 /** Returns true if the job should not be visible to the management layer. */
 bool job_is_internal(Job *job);
 
-/** Returns whether the job is scheduled for cancellation. */
+/** Returns whether the job is being cancelled. */
 bool job_is_cancelled(Job *job);
 
+/**
+ * Returns whether the job is scheduled for cancellation (at an
+ * indefinite point).
+ */
+bool job_cancel_requested(Job *job);
+
 /** Returns whether the job is in a completed state. */
 bool job_is_completed(Job *job);
 
diff --git a/block/mirror.c b/block/mirror.c
index c3514f4196..291d2ed040 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -938,7 +938,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         job_transition_to_ready(&s->common.job);
         s->synced = true;
         s->actively_synced = true;
-        while (!job_is_cancelled(&s->common.job) && !s->should_complete) {
+        while (!job_cancel_requested(&s->common.job) && !s->should_complete) {
             job_yield(&s->common.job);
         }
         s->common.job.cancelled = false;
@@ -1046,7 +1046,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
             }
 
             should_complete = s->should_complete ||
-                job_is_cancelled(&s->common.job);
+                job_cancel_requested(&s->common.job);
             cnt = bdrv_get_dirty_count(s->dirty_bitmap);
         }
 
@@ -1089,7 +1089,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
         }
         trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
         job_sleep_ns(&s->common.job, delay_ns);
-        if (job_is_cancelled(&s->common.job) && s->common.job.force_cancel) {
+        if (job_is_cancelled(&s->common.job)) {
             break;
         }
         s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
@@ -1101,8 +1101,7 @@ immediate_exit:
          * or it was cancelled prematurely so that we do not guarantee that
          * the target is a copy of the source.
          */
-        assert(ret < 0 || (s->common.job.force_cancel &&
-               job_is_cancelled(&s->common.job)));
+        assert(ret < 0 || job_is_cancelled(&s->common.job));
         assert(need_drain);
         mirror_wait_for_all_io(s);
     }
diff --git a/job.c b/job.c
index e78d893a9c..c51c8077cb 100644
--- a/job.c
+++ b/job.c
@@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
 }
 
 bool job_is_cancelled(Job *job)
+{
+    return job->cancelled && job->force_cancel;
+}
+
+bool job_cancel_requested(Job *job)
 {
     return job->cancelled;
 }
@@ -650,7 +655,7 @@ static void job_conclude(Job *job)
 
 static void job_update_rc(Job *job)
 {
-    if (!job->ret && job_is_cancelled(job)) {
+    if (!job->ret && job_cancel_requested(job)) {
         job->ret = -ECANCELED;
     }
     if (job->ret) {
@@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
 
     /* Emit events only if we actually started */
     if (job_started(job)) {
-        if (job_is_cancelled(job)) {
+        if (job_cancel_requested(job)) {
             job_event_cancelled(job);
         } else {
             job_event_completed(job);
@@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
     if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
         return;
     }
-    if (job_is_cancelled(job) || !job->driver->complete) {
+    if (job_cancel_requested(job) || !job->driver->complete) {
         error_setg(errp, "The active block job '%s' cannot be completed",
                    job->id);
         return;
@@ -1043,7 +1048,7 @@ int job_finish_sync(Job *job, void (*finish)(Job *, Error **errp), Error **errp)
     AIO_WAIT_WHILE(job->aio_context,
                    (job_enter(job), !job_is_completed(job)));
 
-    ret = (job_is_cancelled(job) && job->ret == 0) ? -ECANCELED : job->ret;
+    ret = (job_cancel_requested(job) && job->ret == 0) ? -ECANCELED : job->ret;
     job_unref(job);
     return ret;
 }
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-6.1? 5/6] mirror: Check job_is_cancelled() earlier
  2021-07-22 12:26 [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel Max Reitz
                   ` (3 preceding siblings ...)
  2021-07-22 12:26 ` [PATCH for-6.1? 4/6] job: Add job_cancel_requested() Max Reitz
@ 2021-07-22 12:26 ` Max Reitz
  2021-07-22 12:26 ` [PATCH for-6.1? 6/6] iotests: Add mirror-ready-cancel-error test Max Reitz
  5 siblings, 0 replies; 16+ messages in thread
From: Max Reitz @ 2021-07-22 12:26 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	Max Reitz

We must check whether the job is force-cancelled early in our main loop,
most importantly before any `continue` statement.  For example, we used
to have `continue`s before our current checking location that are
triggered by `mirror_flush()` failing.  So, if `mirror_flush()` kept
failing, force-cancelling the job would not terminate it.

A job being force-cancelled should be treated the same as the job having
failed, so put the check in the same place where we check `s->ret < 0`.

Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 block/mirror.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 291d2ed040..a993ed37d0 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -995,7 +995,7 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
             mirror_wait_for_any_operation(s, true);
         }
 
-        if (s->ret < 0) {
+        if (s->ret < 0 || job_is_cancelled(&s->common.job)) {
             ret = s->ret;
             goto immediate_exit;
         }
@@ -1081,17 +1081,12 @@ static int coroutine_fn mirror_run(Job *job, Error **errp)
             break;
         }
 
-        ret = 0;
-
         if (s->synced && !should_complete) {
             delay_ns = (s->in_flight == 0 &&
                         cnt == 0 ? BLOCK_JOB_SLICE_TIME : 0);
         }
         trace_mirror_before_sleep(s, cnt, s->synced, delay_ns);
         job_sleep_ns(&s->common.job, delay_ns);
-        if (job_is_cancelled(&s->common.job)) {
-            break;
-        }
         s->last_pause_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     }
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-6.1? 6/6] iotests: Add mirror-ready-cancel-error test
  2021-07-22 12:26 [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel Max Reitz
                   ` (4 preceding siblings ...)
  2021-07-22 12:26 ` [PATCH for-6.1? 5/6] mirror: Check job_is_cancelled() earlier Max Reitz
@ 2021-07-22 12:26 ` Max Reitz
  5 siblings, 0 replies; 16+ messages in thread
From: Max Reitz @ 2021-07-22 12:26 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Vladimir Sementsov-Ogievskiy, John Snow, qemu-devel,
	Max Reitz

Test what happens when there is an I/O error after a mirror job in the
READY phase has been cancelled.

Signed-off-by: Max Reitz <mreitz@redhat.com>
---
 .../tests/mirror-ready-cancel-error           | 143 ++++++++++++++++++
 .../tests/mirror-ready-cancel-error.out       |   5 +
 2 files changed, 148 insertions(+)
 create mode 100755 tests/qemu-iotests/tests/mirror-ready-cancel-error
 create mode 100644 tests/qemu-iotests/tests/mirror-ready-cancel-error.out

diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error b/tests/qemu-iotests/tests/mirror-ready-cancel-error
new file mode 100755
index 0000000000..f2dc88881f
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error
@@ -0,0 +1,143 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when errors occur to a mirror job after it has
+# been cancelled in the READY phase
+#
+# Copyright (C) 2021 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+
+
+image_size = 1 * 1024 * 1024
+source = os.path.join(iotests.test_dir, 'source.img')
+target = os.path.join(iotests.test_dir, 'target.img')
+
+
+class TestMirrorReadyCancelError(iotests.QMPTestCase):
+    def setUp(self) -> None:
+        assert iotests.qemu_img_create('-f', iotests.imgfmt, source,
+                                       str(image_size)) == 0
+        assert iotests.qemu_img_create('-f', iotests.imgfmt, target,
+                                       str(image_size)) == 0
+
+        self.vm = iotests.VM()
+        self.vm.launch()
+
+    def tearDown(self) -> None:
+        self.vm.shutdown()
+        os.remove(source)
+        os.remove(target)
+
+    def add_blockdevs(self, once: bool) -> None:
+        res = self.vm.qmp('blockdev-add',
+                          **{'node-name': 'source',
+                             'driver': iotests.imgfmt,
+                             'file': {
+                                 'driver': 'file',
+                                 'filename': source
+                             }})
+        self.assert_qmp(res, 'return', {})
+
+        # blkdebug notes:
+        # Enter state 2 on the first flush, which happens before the
+        # job enters the READY state.  The second flush will happen
+        # when the job is about to complete, and we want that one to
+        # fail.
+        res = self.vm.qmp('blockdev-add',
+                          **{'node-name': 'target',
+                             'driver': iotests.imgfmt,
+                             'file': {
+                                 'driver': 'blkdebug',
+                                 'image': {
+                                     'driver': 'file',
+                                     'filename': target
+                                 },
+                                 'set-state': [{
+                                     'event': 'flush_to_disk',
+                                     'state': 1,
+                                     'new_state': 2
+                                 }],
+                                 'inject-error': [{
+                                     'event': 'flush_to_disk',
+                                     'once': once,
+                                     'immediately': True,
+                                     'state': 2
+                                 }]}})
+        self.assert_qmp(res, 'return', {})
+
+    def start_mirror(self) -> None:
+        res = self.vm.qmp('blockdev-mirror',
+                          job_id='mirror',
+                          device='source',
+                          target='target',
+                          filter_node_name='mirror-top',
+                          sync='full',
+                          on_target_error='stop')
+        self.assert_qmp(res, 'return', {})
+
+    def cancel_mirror_with_error(self) -> None:
+        self.vm.event_wait('BLOCK_JOB_READY')
+
+        # Write something so will not leave the job immediately, but
+        # flush first (which will fail, thanks to blkdebug)
+        res = self.vm.qmp('human-monitor-command',
+                          command_line='qemu-io mirror-top "write 0 64k"')
+        self.assert_qmp(res, 'return', '')
+
+        # Drain status change events
+        while self.vm.event_wait('JOB_STATUS_CHANGE', timeout=0.0) is not None:
+            pass
+
+        res = self.vm.qmp('block-job-cancel', device='mirror')
+        self.assert_qmp(res, 'return', {})
+
+        self.vm.event_wait('BLOCK_JOB_ERROR')
+
+    def test_transient_error(self) -> None:
+        self.add_blockdevs(True)
+        self.start_mirror()
+        self.cancel_mirror_with_error()
+
+        while True:
+            e = self.vm.event_wait('JOB_STATUS_CHANGE')
+            if e['data']['status'] == 'standby':
+                # Transient error, try again
+                self.vm.qmp('block-job-resume', device='mirror')
+            elif e['data']['status'] == 'null':
+                break
+
+    def test_persistent_error(self) -> None:
+        self.add_blockdevs(False)
+        self.start_mirror()
+        self.cancel_mirror_with_error()
+
+        while True:
+            e = self.vm.event_wait('JOB_STATUS_CHANGE')
+            if e['data']['status'] == 'standby':
+                # Persistent error, no point in continuing
+                self.vm.qmp('block-job-cancel', device='mirror', force=True)
+            elif e['data']['status'] == 'null':
+                break
+
+
+if __name__ == '__main__':
+    # LUKS would require special key-secret handling in add_blockdevs()
+    iotests.main(supported_fmts=['generic'],
+                 unsupported_fmts=['luks'],
+                 supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/mirror-ready-cancel-error.out b/tests/qemu-iotests/tests/mirror-ready-cancel-error.out
new file mode 100644
index 0000000000..fbc63e62f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/mirror-ready-cancel-error.out
@@ -0,0 +1,5 @@
+..
+----------------------------------------------------------------------
+Ran 2 tests
+
+OK
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 1/6] mirror: Keep s->synced on error
  2021-07-22 12:26 ` [PATCH for-6.1? 1/6] mirror: Keep s->synced on error Max Reitz
@ 2021-07-22 16:25   ` Vladimir Sementsov-Ogievskiy
  2021-07-26  7:11     ` Max Reitz
  2021-07-26 10:24   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-07-22 16:25 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, John Snow

22.07.2021 15:26, Max Reitz wrote:
> An error does not take us out of the READY phase, which is what
> s->synced signifies.  It does of course mean that source and target are
> no longer in sync, but that is what s->actively_sync is for -- s->synced
> never meant that source and target are in sync, only that they were at
> some point (and at that point we transitioned into the READY phase).
> 
> The tangible problem is that we transition to READY once we are in sync
> and s->synced is false.  By resetting s->synced here, we will transition
> from READY to READY once the error is resolved (if the job keeps
> running), and that transition is not allowed.
> 
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---
>   block/mirror.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 98fc66eabf..d73b704473 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -121,7 +121,6 @@ typedef enum MirrorMethod {
>   static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool read,
>                                               int error)
>   {
> -    s->synced = false;
>       s->actively_synced = false;
>       if (read) {
>           return block_job_error_action(&s->common, s->on_source_error,
> 

Looked through.. Yes, seems s->synced used as "is ready". Isn't it better to drop s->synced at all and use job_is_read() instead?

Hmm, s->actively_synced used only for assertion in active_write_settle().. That's not wrong, just interesting.

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{,_all}()
  2021-07-22 12:26 ` [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
@ 2021-07-22 17:00   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-07-22 17:00 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, John Snow

22.07.2021 15:26, Max Reitz wrote:
> Callers should be able to specify whether they want job_cancel_sync() to
> force-cancel the job or not.
> 
> In fact, almost all invocations do not care about consistency of the
> result and just want the job to terminate as soon as possible, so they
> should pass force=true.  The replication block driver is the exception.
> 
> This changes some iotest outputs, because quitting qemu while a mirror
> job is active will now lead to it being cancelled instead of completed,
> which is what we want.  (Cancelling a READY mirror job with force=false
> may take an indefinite amount of time, which we do not want when
> quitting.  If users want consistent results, they must have all jobs be
> done before they quit qemu.)
> 
> Buglink:https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz<mreitz@redhat.com>

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 4/6] job: Add job_cancel_requested()
  2021-07-22 12:26 ` [PATCH for-6.1? 4/6] job: Add job_cancel_requested() Max Reitz
@ 2021-07-22 17:58   ` Vladimir Sementsov-Ogievskiy
  2021-07-26  7:09     ` Max Reitz
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-07-22 17:58 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, John Snow

22.07.2021 15:26, Max Reitz wrote:
> Most callers of job_is_cancelled() actually want to know whether the job
> is on its way to immediate termination.  For example, we refuse to pause
> jobs that are cancelled; but this only makes sense for jobs that are
> really actually cancelled.
> 
> A mirror job that is cancelled during READY with force=false should
> absolutely be allowed to pause.  This "cancellation" (which is actually
> a kind of completion) 

You have to repeat that this "cancel" is not "cancel".

So, the whole problem is that feature of mirror, on cancel in READY state do not cancel but do some specific kind of completion.

You try to make this thing correctly handled on generic layer..

Did you consider instead just drop the feature from generic layer? So that all *cancel* functions always do force-cancel. Then the internal implementation become a lot clearer.

But we have to support the qmp block-job-cancel of READY mirror (and commit) with force=false.

We can do it as an exclusion in qmp_block_job_cancel, something like:

if (job is mirror or commit AND it's ready AND force = false)
    mirror_soft_cancel(...);
else
    job_cancel(...);


> may take an indefinite amount of time, and so
> should behave like any job during normal operation.  For example, with
> on-target-error=stop, the job should stop on write errors.  (In
> contrast, force-cancelled jobs should not get write errors, as they
> should just terminate and not do further I/O.)
> 
> Therefore, redefine job_is_cancelled() to only return true for jobs that
> are force-cancelled (which as of HEAD^ means any job that interprets the
> cancellation request as a request for immediate termination), and add
> job_cancel_request() as the general variant, which returns true for any
> jobs which have been requested to be cancelled, whether it be
> immediately or after an arbitrarily long completion phase.
> 
> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
> Signed-off-by: Max Reitz <mreitz@redhat.com>
> ---

[..]

> --- a/job.c
> +++ b/job.c
> @@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
>   }
>   
>   bool job_is_cancelled(Job *job)
> +{
> +    return job->cancelled && job->force_cancel;
> +}
> +
> +bool job_cancel_requested(Job *job)
>   {
>       return job->cancelled;
>   }
> @@ -650,7 +655,7 @@ static void job_conclude(Job *job)
>   
>   static void job_update_rc(Job *job)
>   {
> -    if (!job->ret && job_is_cancelled(job)) {
> +    if (!job->ret && job_cancel_requested(job)) {

Why not job_is_cancelled() here?

So in case of mirror other kind of completion we set ret to -ECANCELED?

>           job->ret = -ECANCELED;
>       }
>       if (job->ret) {
> @@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
>   
>       /* Emit events only if we actually started */
>       if (job_started(job)) {
> -        if (job_is_cancelled(job)) {
> +        if (job_cancel_requested(job)) {
>               job_event_cancelled(job);

Same question here.. Shouldn't mirror report COMPLETED event in case of not-force cancelled in READY state?

>           } else {
>               job_event_completed(job);
> @@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
>       if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
>           return;
>       }
> -    if (job_is_cancelled(job) || !job->driver->complete) {
> +    if (job_cancel_requested(job) || !job->driver->complete) {
>           error_setg(errp, "The active block job '%s' cannot be completed",
>                      job->id);
>           return;
> @@ -1043,7 +1048,7 @@ int job_finish_sync(Job *job, void (*finish)(Job *, Error **errp), Error **errp)
>       AIO_WAIT_WHILE(job->aio_context,
>                      (job_enter(job), !job_is_completed(job)));
>   
> -    ret = (job_is_cancelled(job) && job->ret == 0) ? -ECANCELED : job->ret;
> +    ret = (job_cancel_requested(job) && job->ret == 0) ? -ECANCELED : job->ret;
>       job_unref(job);
>       return ret;
>   }
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 3/6] jobs: Give Job.force_cancel more meaning
  2021-07-22 12:26 ` [PATCH for-6.1? 3/6] jobs: Give Job.force_cancel more meaning Max Reitz
@ 2021-07-22 18:16   ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-07-22 18:16 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, John Snow

22.07.2021 15:26, Max Reitz wrote:
> We largely have two cancel modes for jobs:
> 
> First, there is actual cancelling.  The job is terminated as soon as
> possible, without trying to reach a consistent result.
> 
> Second, we have mirror in the READY state.  Technically, the job is not
> really cancelled, but it just is a different completion mode.  The job
> can still run for an indefinite amount of time while it tries to reach a
> consistent result.
> 
> We want to be able to clearly distinguish which cancel mode a job is in
> (when it has been cancelled).  We can use Job.force_cancel for this, but
> right now it only reflects cancel requests from the user with
> force=true, but clearly, jobs that do not even distinguish between
> force=false and force=true are effectively always force-cancelled.
> 
> So this patch has Job.force_cancel signify whether the job will
> terminate as soon as possible (force_cancel=true) or whether it will
> effectively remain running despite being "cancelled"
> (force_cancel=false).
> 
> To this end, we let jobs that provide JobDriver.cancel() tell the
> generic job code whether they will terminate as soon as possible or not,
> and for jobs that do not provide that method we assume they will.
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>


In isolation this patch is rather strange: force_cancel is used only by mirror. But we keep in generic job layer. And make a handler to set a value to this variable. So in generic layer we ask mirror which value it want to set to generic variable, which is used only by mirror.. This probably shows that this feature of mirror should be mirror only and generic layer shouldn't take care of it (see also my answer to next commit).

But at the end of the series the variable is not more used by mirror directly. So, technically the commit is not wrong, and it is a preparation for the following ones.

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 4/6] job: Add job_cancel_requested()
  2021-07-22 17:58   ` Vladimir Sementsov-Ogievskiy
@ 2021-07-26  7:09     ` Max Reitz
  2021-07-27 14:47       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 16+ messages in thread
From: Max Reitz @ 2021-07-26  7:09 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, John Snow, qemu-devel

On 22.07.21 19:58, Vladimir Sementsov-Ogievskiy wrote:
> 22.07.2021 15:26, Max Reitz wrote:
>> Most callers of job_is_cancelled() actually want to know whether the job
>> is on its way to immediate termination.  For example, we refuse to pause
>> jobs that are cancelled; but this only makes sense for jobs that are
>> really actually cancelled.
>>
>> A mirror job that is cancelled during READY with force=false should
>> absolutely be allowed to pause.  This "cancellation" (which is actually
>> a kind of completion) 
>
> You have to repeat that this "cancel" is not "cancel".
>
> So, the whole problem is that feature of mirror, on cancel in READY 
> state do not cancel but do some specific kind of completion.
>
> You try to make this thing correctly handled on generic layer..
>
> Did you consider instead just drop the feature from generic layer? So 
> that all *cancel* functions always do force-cancel. Then the internal 
> implementation become a lot clearer.

Yes, I considered that, and I’ve decided against it (for now), because 
such a change would obviously be an incompatible change.  It would 
require a deprecation period, and so we would need to fix this bug now 
anyway.

> But we have to support the qmp block-job-cancel of READY mirror (and 
> commit) with force=false.
>
> We can do it as an exclusion in qmp_block_job_cancel, something like:
>
> if (job is mirror or commit AND it's ready AND force = false)
>    mirror_soft_cancel(...);
> else
>    job_cancel(...);

I didn’t consider such a hack, though.  I don’t like it.  If we think 
that we should change our approach because mirror’s soft cancel is 
actually a completion mode, and the current situation is too confusing, 
such a change should be user-visible, too.  (I think there was this idea 
of having job-specific flags or parameters you could change at runtime, 
and so you’d just change the “pivot” parameter between true or false.)

Also, I don’t know whether this would really make anything “a lot” 
easier.  After this series job_is_cancelled() already tells the true 
story, so all we could really change is to drop force_cancel and unify 
the “s->should_complete || job_cancel_requested()” conditions in 
block/mirror.c into one variable.  So when I considered making cancel 
exclusively force-cancel jobs, I thought it wouldn’t actually be worth 
it in practice.

>> may take an indefinite amount of time, and so
>> should behave like any job during normal operation.  For example, with
>> on-target-error=stop, the job should stop on write errors.  (In
>> contrast, force-cancelled jobs should not get write errors, as they
>> should just terminate and not do further I/O.)
>>
>> Therefore, redefine job_is_cancelled() to only return true for jobs that
>> are force-cancelled (which as of HEAD^ means any job that interprets the
>> cancellation request as a request for immediate termination), and add
>> job_cancel_request() as the general variant, which returns true for any
>> jobs which have been requested to be cancelled, whether it be
>> immediately or after an arbitrarily long completion phase.
>>
>> Buglink: https://gitlab.com/qemu-project/qemu/-/issues/462
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>
> [..]
>
>> --- a/job.c
>> +++ b/job.c
>> @@ -216,6 +216,11 @@ const char *job_type_str(const Job *job)
>>   }
>>     bool job_is_cancelled(Job *job)
>> +{
>> +    return job->cancelled && job->force_cancel;
>> +}
>> +
>> +bool job_cancel_requested(Job *job)
>>   {
>>       return job->cancelled;
>>   }
>> @@ -650,7 +655,7 @@ static void job_conclude(Job *job)
>>     static void job_update_rc(Job *job)
>>   {
>> -    if (!job->ret && job_is_cancelled(job)) {
>> +    if (!job->ret && job_cancel_requested(job)) {
>
> Why not job_is_cancelled() here?
>
> So in case of mirror other kind of completion we set ret to -ECANCELED?

I thought the return value is a user-visible thing, so I left it as-is.

Seems I was wrong, more below.

>>           job->ret = -ECANCELED;
>>       }
>>       if (job->ret) {
>> @@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
>>         /* Emit events only if we actually started */
>>       if (job_started(job)) {
>> -        if (job_is_cancelled(job)) {
>> +        if (job_cancel_requested(job)) {
>>               job_event_cancelled(job);
>
> Same question here.. Shouldn't mirror report COMPLETED event in case 
> of not-force cancelled in READY state?

Same here, I thought this is user-visible, nothing internal, so I should 
leave it as-is.

Now I see that cancelling mirror post-READY indeed should result in a 
COMPLETED event.  So I’m actually not exactly sure how mirror does that, 
despite this code here (which functionally isn’t changed by this patch), 
but it’s absolutely true that job_is_cancelled() would be more 
appropriate here.

(No iotest failed, so I thought this change was right.  Well.)

>>           } else {
>>               job_event_completed(job);
>> @@ -1015,7 +1020,7 @@ void job_complete(Job *job, Error **errp)
>>       if (job_apply_verb(job, JOB_VERB_COMPLETE, errp)) {
>>           return;
>>       }
>> -    if (job_is_cancelled(job) || !job->driver->complete) {
>> +    if (job_cancel_requested(job) || !job->driver->complete) {
>>           error_setg(errp, "The active block job '%s' cannot be 
>> completed",
>>                      job->id);
>>           return;
>> @@ -1043,7 +1048,7 @@ int job_finish_sync(Job *job, void 
>> (*finish)(Job *, Error **errp), Error **errp)
>>       AIO_WAIT_WHILE(job->aio_context,
>>                      (job_enter(job), !job_is_completed(job)));
>>   -    ret = (job_is_cancelled(job) && job->ret == 0) ? -ECANCELED : 
>> job->ret;
>> +    ret = (job_cancel_requested(job) && job->ret == 0) ? -ECANCELED 
>> : job->ret;

So here it should probably stay a job_is_cancelled(), too.

Max

>>       job_unref(job);
>>       return ret;
>>   }
>>
>
>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 1/6] mirror: Keep s->synced on error
  2021-07-22 16:25   ` Vladimir Sementsov-Ogievskiy
@ 2021-07-26  7:11     ` Max Reitz
  0 siblings, 0 replies; 16+ messages in thread
From: Max Reitz @ 2021-07-26  7:11 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, John Snow, qemu-devel

On 22.07.21 18:25, Vladimir Sementsov-Ogievskiy wrote:
> 22.07.2021 15:26, Max Reitz wrote:
>> An error does not take us out of the READY phase, which is what
>> s->synced signifies.  It does of course mean that source and target are
>> no longer in sync, but that is what s->actively_sync is for -- s->synced
>> never meant that source and target are in sync, only that they were at
>> some point (and at that point we transitioned into the READY phase).
>>
>> The tangible problem is that we transition to READY once we are in sync
>> and s->synced is false.  By resetting s->synced here, we will transition
>> from READY to READY once the error is resolved (if the job keeps
>> running), and that transition is not allowed.
>>
>> Signed-off-by: Max Reitz <mreitz@redhat.com>
>> ---
>>   block/mirror.c | 1 -
>>   1 file changed, 1 deletion(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 98fc66eabf..d73b704473 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -121,7 +121,6 @@ typedef enum MirrorMethod {
>>   static BlockErrorAction mirror_error_action(MirrorBlockJob *s, bool 
>> read,
>>                                               int error)
>>   {
>> -    s->synced = false;
>>       s->actively_synced = false;
>>       if (read) {
>>           return block_job_error_action(&s->common, s->on_source_error,
>>
>
> Looked through.. Yes, seems s->synced used as "is ready". Isn't it 
> better to drop s->synced at all and use job_is_read() instead?

Sounds good, though I think for the change to be clear, I’d like to keep 
this patch and then drop s->synced on top.

Max

> Hmm, s->actively_synced used only for assertion in 
> active_write_settle().. That's not wrong, just interesting.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 1/6] mirror: Keep s->synced on error
  2021-07-22 12:26 ` [PATCH for-6.1? 1/6] mirror: Keep s->synced on error Max Reitz
  2021-07-22 16:25   ` Vladimir Sementsov-Ogievskiy
@ 2021-07-26 10:24   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 0 replies; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-07-26 10:24 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, John Snow

22.07.2021 15:26, Max Reitz wrote:
> An error does not take us out of the READY phase, which is what
> s->synced signifies.  It does of course mean that source and target are
> no longer in sync, but that is what s->actively_sync is for -- s->synced
> never meant that source and target are in sync, only that they were at
> some point (and at that point we transitioned into the READY phase).
> 
> The tangible problem is that we transition to READY once we are in sync
> and s->synced is false.  By resetting s->synced here, we will transition
> from READY to READY once the error is resolved (if the job keeps
> running), and that transition is not allowed.
> 
> Signed-off-by: Max Reitz<mreitz@redhat.com>


Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 4/6] job: Add job_cancel_requested()
  2021-07-26  7:09     ` Max Reitz
@ 2021-07-27 14:47       ` Vladimir Sementsov-Ogievskiy
  2021-07-27 15:30         ` Max Reitz
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2021-07-27 14:47 UTC (permalink / raw)
  To: Max Reitz, qemu-block; +Cc: qemu-devel, Kevin Wolf, John Snow

26.07.2021 10:09, Max Reitz wrote:
> 
>>>           job->ret = -ECANCELED;
>>>       }
>>>       if (job->ret) {
>>> @@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
>>>         /* Emit events only if we actually started */
>>>       if (job_started(job)) {
>>> -        if (job_is_cancelled(job)) {
>>> +        if (job_cancel_requested(job)) {
>>>               job_event_cancelled(job);
>>
>> Same question here.. Shouldn't mirror report COMPLETED event in case of not-force cancelled in READY state?
> 
> Same here, I thought this is user-visible, nothing internal, so I should leave it as-is.
> 
> Now I see that cancelling mirror post-READY indeed should result in a COMPLETED event.  So I’m actually not exactly sure how mirror does that, despite this code here


Hmm. Now looking at mirror code, I see that it does "s->common.job.cancelled = false"

-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-6.1? 4/6] job: Add job_cancel_requested()
  2021-07-27 14:47       ` Vladimir Sementsov-Ogievskiy
@ 2021-07-27 15:30         ` Max Reitz
  0 siblings, 0 replies; 16+ messages in thread
From: Max Reitz @ 2021-07-27 15:30 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block
  Cc: Kevin Wolf, John Snow, qemu-devel

On 27.07.21 16:47, Vladimir Sementsov-Ogievskiy wrote:
> 26.07.2021 10:09, Max Reitz wrote:
>>
>>>>           job->ret = -ECANCELED;
>>>>       }
>>>>       if (job->ret) {
>>>> @@ -704,7 +709,7 @@ static int job_finalize_single(Job *job)
>>>>         /* Emit events only if we actually started */
>>>>       if (job_started(job)) {
>>>> -        if (job_is_cancelled(job)) {
>>>> +        if (job_cancel_requested(job)) {
>>>>               job_event_cancelled(job);
>>>
>>> Same question here.. Shouldn't mirror report COMPLETED event in case 
>>> of not-force cancelled in READY state?
>>
>> Same here, I thought this is user-visible, nothing internal, so I 
>> should leave it as-is.
>>
>> Now I see that cancelling mirror post-READY indeed should result in a 
>> COMPLETED event.  So I’m actually not exactly sure how mirror does 
>> that, despite this code here
>
>
> Hmm. Now looking at mirror code, I see that it does 
> "s->common.job.cancelled = false"

*lol*, that’s nice.

So since we’ve missed the rc1 boat now, I think this is 6.2 material.  
I’ll look into whether we can drop that then, that would be nice.

Max



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-27 15:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-22 12:26 [PATCH for-6.1? 0/6] mirror: Handle errors after READY cancel Max Reitz
2021-07-22 12:26 ` [PATCH for-6.1? 1/6] mirror: Keep s->synced on error Max Reitz
2021-07-22 16:25   ` Vladimir Sementsov-Ogievskiy
2021-07-26  7:11     ` Max Reitz
2021-07-26 10:24   ` Vladimir Sementsov-Ogievskiy
2021-07-22 12:26 ` [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{, _all}() Max Reitz
2021-07-22 17:00   ` [PATCH for-6.1? 2/6] job: @force parameter for job_cancel_sync{,_all}() Vladimir Sementsov-Ogievskiy
2021-07-22 12:26 ` [PATCH for-6.1? 3/6] jobs: Give Job.force_cancel more meaning Max Reitz
2021-07-22 18:16   ` Vladimir Sementsov-Ogievskiy
2021-07-22 12:26 ` [PATCH for-6.1? 4/6] job: Add job_cancel_requested() Max Reitz
2021-07-22 17:58   ` Vladimir Sementsov-Ogievskiy
2021-07-26  7:09     ` Max Reitz
2021-07-27 14:47       ` Vladimir Sementsov-Ogievskiy
2021-07-27 15:30         ` Max Reitz
2021-07-22 12:26 ` [PATCH for-6.1? 5/6] mirror: Check job_is_cancelled() earlier Max Reitz
2021-07-22 12:26 ` [PATCH for-6.1? 6/6] iotests: Add mirror-ready-cancel-error test Max Reitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).