All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
@ 2015-12-22 16:46 Kevin Wolf
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images Kevin Wolf
                   ` (13 more replies)
  0 siblings, 14 replies; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

Enough innocent images have died because users called 'qemu-img snapshot' while
the VM was still running. Educating the users doesn't seem to be a working
strategy, so this series adds locking to qcow2 that refuses to access the image
read-write from two processes.

Eric, this will require a libvirt update to deal with qemu crashes which leave
locked images behind. The simplest thinkable way would be to unconditionally
override the lock in libvirt whenever the option is present. In that case,
libvirt VMs would be protected against concurrent non-libvirt accesses, but not
the other way round. If you want more than that, libvirt would have to check
somehow if it was its own VM that used the image and left the lock behind. I
imagine that can't be too hard either.

Also note that this kind of depends on Max's bdrv_close_all() series, but only
in order to pass test case 142. This is not a bug in this series, but a
preexisting one (bs->file can be closed before bs), and it becomes apparent
when qemu fails to unlock an image due to this bug. Max's series fixes this.

Kevin Wolf (10):
  qcow2: Write feature table only for v3 images
  qcow2: Write full header on image creation
  block: Assert no write requests under BDRV_O_INCOMING
  block: Fix error path in bdrv_invalidate_cache()
  block: Inactivate BDS when migration completes
  qemu-img: Prepare for locked images
  qcow2: Implement .bdrv_inactivate
  qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache()
  qcow2: Make image inaccessible after failed qcow2_invalidate_cache()
  qcow2: Add image locking

 block.c                    |  36 +++++++++
 block/io.c                 |   2 +
 block/qcow2.c              | 190 +++++++++++++++++++++++++++++++++++----------
 block/qcow2.h              |   7 +-
 docs/specs/qcow2.txt       |   7 +-
 include/block/block.h      |   2 +
 include/block/block_int.h  |   1 +
 include/qapi/error.h       |   1 +
 migration/migration.c      |   7 ++
 qapi/common.json           |   3 +-
 qemu-img-cmds.hx           |  10 ++-
 qemu-img.c                 |  96 +++++++++++++++++++----
 qemu-img.texi              |  20 ++++-
 qmp.c                      |  12 +++
 tests/qemu-iotests/026     |   2 +-
 tests/qemu-iotests/026.out |  60 ++++++++++++--
 tests/qemu-iotests/031.out |  23 +++---
 tests/qemu-iotests/036     |   2 +
 tests/qemu-iotests/036.out |   7 +-
 tests/qemu-iotests/039     |   4 +-
 tests/qemu-iotests/061     |   2 +
 tests/qemu-iotests/061.out |  43 +++++-----
 tests/qemu-iotests/071     |   7 ++
 tests/qemu-iotests/071.out |   4 +
 tests/qemu-iotests/089     |   2 +-
 tests/qemu-iotests/089.out |   2 -
 tests/qemu-iotests/091     |   2 +-
 tests/qemu-iotests/098     |   2 +-
 28 files changed, 445 insertions(+), 111 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 20:20   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 02/10] qcow2: Write full header on image creation Kevin Wolf
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

Version 2 images don't have feature bits, so writing a feature table to
those images is kind of pointless.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c              | 48 ++++++++++++++++++++++++----------------------
 tests/qemu-iotests/031.out | 12 +-----------
 tests/qemu-iotests/061.out | 15 ---------------
 3 files changed, 26 insertions(+), 49 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 1789af4..5f22e18 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1895,31 +1895,33 @@ int qcow2_update_header(BlockDriverState *bs)
     }
 
     /* Feature table */
-    Qcow2Feature features[] = {
-        {
-            .type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
-            .bit  = QCOW2_INCOMPAT_DIRTY_BITNR,
-            .name = "dirty bit",
-        },
-        {
-            .type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
-            .bit  = QCOW2_INCOMPAT_CORRUPT_BITNR,
-            .name = "corrupt bit",
-        },
-        {
-            .type = QCOW2_FEAT_TYPE_COMPATIBLE,
-            .bit  = QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR,
-            .name = "lazy refcounts",
-        },
-    };
+    if (s->qcow_version >= 3) {
+        Qcow2Feature features[] = {
+            {
+                .type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
+                .bit  = QCOW2_INCOMPAT_DIRTY_BITNR,
+                .name = "dirty bit",
+            },
+            {
+                .type = QCOW2_FEAT_TYPE_INCOMPATIBLE,
+                .bit  = QCOW2_INCOMPAT_CORRUPT_BITNR,
+                .name = "corrupt bit",
+            },
+            {
+                .type = QCOW2_FEAT_TYPE_COMPATIBLE,
+                .bit  = QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR,
+                .name = "lazy refcounts",
+            },
+        };
 
-    ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
-                         features, sizeof(features), buflen);
-    if (ret < 0) {
-        goto fail;
+        ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
+                             features, sizeof(features), buflen);
+        if (ret < 0) {
+            goto fail;
+        }
+        buf += ret;
+        buflen -= ret;
     }
-    buf += ret;
-    buflen -= ret;
 
     /* Keep unknown header extensions */
     QLIST_FOREACH(uext, &s->unknown_header_ext, next) {
diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index fce3ce0..f065404 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -53,11 +53,6 @@ refcount_order            4
 header_length             72
 
 Header extension:
-magic                     0x6803f857
-length                    144
-data                      <binary>
-
-Header extension:
 magic                     0x12345678
 length                    31
 data                      'This is a test header extension'
@@ -68,7 +63,7 @@ No errors were found on the image.
 
 magic                     0x514649fb
 version                   2
-backing_file_offset       0x128
+backing_file_offset       0x90
 backing_file_size         0x17
 cluster_bits              16
 size                      67108864
@@ -91,11 +86,6 @@ length                    11
 data                      'host_device'
 
 Header extension:
-magic                     0x6803f857
-length                    144
-data                      <binary>
-
-Header extension:
 magic                     0x12345678
 length                    31
 data                      'This is a test header extension'
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 57aae28..d604682 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -43,11 +43,6 @@ autoclear_features        0x0
 refcount_order            4
 header_length             72
 
-Header extension:
-magic                     0x6803f857
-length                    144
-data                      <binary>
-
 read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
@@ -105,11 +100,6 @@ autoclear_features        0x0
 refcount_order            4
 header_length             72
 
-Header extension:
-magic                     0x6803f857
-length                    144
-data                      <binary>
-
 read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
@@ -155,11 +145,6 @@ autoclear_features        0x0
 refcount_order            4
 header_length             72
 
-Header extension:
-magic                     0x6803f857
-length                    144
-data                      <binary>
-
 No errors were found on the image.
 
 === Testing version upgrade and resize ===
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 02/10] qcow2: Write full header on image creation
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 20:25   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 03/10] block: Assert no write requests under BDRV_O_INCOMING Kevin Wolf
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

When creating a qcow2 image, we didn't necessarily call
qcow2_update_header(), but could end up with the basic header that
qcow2_create2() created manually. One thing that this basic header
lacks is the feature table. Let's make sure that it's always present.

This requires a few updates to test cases as well.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c              |  7 +++++++
 tests/qemu-iotests/031.out |  5 +++++
 tests/qemu-iotests/036     |  2 ++
 tests/qemu-iotests/036.out |  5 +++++
 tests/qemu-iotests/061.out | 20 ++++++++++++++++++++
 5 files changed, 39 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 5f22e18..01f1fe3 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2239,6 +2239,13 @@ static int qcow2_create2(const char *filename, int64_t total_size,
         abort();
     }
 
+    /* Create a full header (including things like feature table) */
+    ret = qcow2_update_header(bs);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret, "Could not update qcow2 header");
+        goto out;
+    }
+
     /* Okay, now that we have a valid image, let's give it the right size */
     ret = bdrv_truncate(bs, total_size);
     if (ret < 0) {
diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index f065404..7f5050b 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -116,6 +116,11 @@ refcount_order            4
 header_length             104
 
 Header extension:
+magic                     0x6803f857
+length                    144
+data                      <binary>
+
+Header extension:
 magic                     0x12345678
 length                    31
 data                      'This is a test header extension'
diff --git a/tests/qemu-iotests/036 b/tests/qemu-iotests/036
index 392f1ef..c4cc91b 100755
--- a/tests/qemu-iotests/036
+++ b/tests/qemu-iotests/036
@@ -57,6 +57,7 @@ _make_test_img 64M
 $PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 63
 
 # Without feature table
+$PYTHON qcow2.py "$TEST_IMG" del-header-ext 0x6803f857
 $PYTHON qcow2.py "$TEST_IMG" dump-header
 _img_info
 
@@ -73,6 +74,7 @@ $PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 62
 $PYTHON qcow2.py "$TEST_IMG" set-feature-bit incompatible 63
 
 # Without feature table
+$PYTHON qcow2.py "$TEST_IMG" del-header-ext 0x6803f857
 _img_info
 
 # With feature table containing bit 63
diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
index 5616e37..f443635 100644
--- a/tests/qemu-iotests/036.out
+++ b/tests/qemu-iotests/036.out
@@ -56,6 +56,11 @@ autoclear_features        0x8000000000000000
 refcount_order            4
 header_length             104
 
+Header extension:
+magic                     0x6803f857
+length                    144
+data                      <binary>
+
 
 === Repair image ===
 
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index d604682..a03732e 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -24,6 +24,11 @@ autoclear_features        0x0
 refcount_order            4
 header_length             104
 
+Header extension:
+magic                     0x6803f857
+length                    144
+data                      <binary>
+
 magic                     0x514649fb
 version                   2
 backing_file_offset       0x0
@@ -76,6 +81,11 @@ autoclear_features        0x0
 refcount_order            4
 header_length             104
 
+Header extension:
+magic                     0x6803f857
+length                    144
+data                      <binary>
+
 ERROR cluster 5 refcount=0 reference=1
 ERROR cluster 6 refcount=0 reference=1
 Rebuilding refcount structure
@@ -126,6 +136,11 @@ autoclear_features        0x40000000000
 refcount_order            4
 header_length             104
 
+Header extension:
+magic                     0x6803f857
+length                    144
+data                      <binary>
+
 magic                     0x514649fb
 version                   2
 backing_file_offset       0x0
@@ -228,6 +243,11 @@ autoclear_features        0x0
 refcount_order            4
 header_length             104
 
+Header extension:
+magic                     0x6803f857
+length                    144
+data                      <binary>
+
 ERROR cluster 5 refcount=0 reference=1
 ERROR cluster 6 refcount=0 reference=1
 Rebuilding refcount structure
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 03/10] block: Assert no write requests under BDRV_O_INCOMING
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images Kevin Wolf
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 02/10] qcow2: Write full header on image creation Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 20:27   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 04/10] block: Fix error path in bdrv_invalidate_cache() Kevin Wolf
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

As long as BDRV_O_INCOMING is set, the image file is only opened so we
have a file descriptor for it. We're definitely not supposed to modify
the image, it's still owned by the migration source.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/io.c b/block/io.c
index 841f5b5..e893be8 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1293,6 +1293,7 @@ static int coroutine_fn bdrv_co_do_pwritev(BlockDriverState *bs,
     if (bs->read_only) {
         return -EPERM;
     }
+    assert(!(bs->open_flags & BDRV_O_INCOMING));
 
     ret = bdrv_check_byte_request(bs, offset, bytes);
     if (ret < 0) {
@@ -2453,6 +2454,7 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
     } else if (bs->read_only) {
         return -EPERM;
     }
+    assert(!(bs->open_flags & BDRV_O_INCOMING));
 
     /* Do nothing if disabled.  */
     if (!(bs->open_flags & BDRV_O_UNMAP)) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 04/10] block: Fix error path in bdrv_invalidate_cache()
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (2 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 03/10] block: Assert no write requests under BDRV_O_INCOMING Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 20:31   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 05/10] block: Inactivate BDS when migration completes Kevin Wolf
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

We can only clear BDRV_O_INCOMING if the caches were actually
invalidated.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block.c b/block.c
index 411edbf..554ed64 100644
--- a/block.c
+++ b/block.c
@@ -3273,12 +3273,14 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp)
         bdrv_invalidate_cache(bs->file->bs, &local_err);
     }
     if (local_err) {
+        bs->open_flags |= BDRV_O_INCOMING;
         error_propagate(errp, local_err);
         return;
     }
 
     ret = refresh_total_sectors(bs, bs->total_sectors);
     if (ret < 0) {
+        bs->open_flags |= BDRV_O_INCOMING;
         error_setg_errno(errp, -ret, "Could not refresh total sector count");
         return;
     }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 05/10] block: Inactivate BDS when migration completes
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (3 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 04/10] block: Fix error path in bdrv_invalidate_cache() Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 20:43   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images Kevin Wolf
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

So far, live migration with shared storage meant that the image is in a
not-really-ready don't-touch-me state on the destination while the
source is still actively using it, but after completing the migration,
the image was fully opened on both sides. This is bad.

This patch adds a block driver callback to inactivate images on the
source before completing the migration. Inactivation means that it goes
to a state as if it was just live migrated to the qemu instance on the
source (i.e. BDRV_O_INCOMING is set). You're then supposed to continue
either on the source or on the destination, which takes ownership of the
image.

A typical migration looks like this now with respect to disk images:

1. Destination qemu is started, the image is opened with
   BDRV_O_INCOMING. The image is fully opened on the source.

2. Migration is about to complete. The source flushes the image and
   inactivates it. Now both sides have the image opened with
   BDRV_O_INCOMING and are expecting the other side to still modify it.

3. One side (the destination on success) continues and calls
   bdrv_invalidate_all() in order to take ownership of the image again.
   This removes BDRV_O_INCOMING on the resuming side; the flag remains
   set on the other side.

This ensures that the same image isn't written to by both instances
(unless both are resumed, but then you get what you deserve). This is
important because .bdrv_close for non-BDRV_O_INCOMING images could write
to the image file, which is definitely forbidden while another host is
using the image.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c                   | 34 ++++++++++++++++++++++++++++++++++
 include/block/block.h     |  1 +
 include/block/block_int.h |  1 +
 migration/migration.c     |  7 +++++++
 qmp.c                     | 12 ++++++++++++
 5 files changed, 55 insertions(+)

diff --git a/block.c b/block.c
index 554ed64..8b4288e 100644
--- a/block.c
+++ b/block.c
@@ -3304,6 +3304,40 @@ void bdrv_invalidate_cache_all(Error **errp)
     }
 }
 
+static int bdrv_inactivate(BlockDriverState *bs)
+{
+    int ret;
+
+    if (bs->drv->bdrv_inactivate) {
+        ret = bs->drv->bdrv_inactivate(bs);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    bs->open_flags |= BDRV_O_INCOMING;
+    return 0;
+}
+
+int bdrv_inactivate_all(void)
+{
+    BlockDriverState *bs;
+    int ret;
+
+    QTAILQ_FOREACH(bs, &bdrv_states, device_list) {
+        AioContext *aio_context = bdrv_get_aio_context(bs);
+
+        aio_context_acquire(aio_context);
+        ret = bdrv_inactivate(bs);
+        aio_context_release(aio_context);
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
 /**************************************************************/
 /* removable device support */
 
diff --git a/include/block/block.h b/include/block/block.h
index db8e096..0d00ac1 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -368,6 +368,7 @@ BlockAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
 /* Invalidate any cached metadata used by image formats */
 void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp);
 void bdrv_invalidate_cache_all(Error **errp);
+int bdrv_inactivate_all(void);
 
 /* Ensure contents are flushed to disk.  */
 int bdrv_flush(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 9a1c466..a35ddc9 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -172,6 +172,7 @@ struct BlockDriver {
      * Invalidate any cached meta-data.
      */
     void (*bdrv_invalidate_cache)(BlockDriverState *bs, Error **errp);
+    int (*bdrv_inactivate)(BlockDriverState *bs);
 
     /*
      * Flushes all data that was already written to the OS all the way down to
diff --git a/migration/migration.c b/migration/migration.c
index c842499..309aa98 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1416,7 +1416,11 @@ static int postcopy_start(MigrationState *ms, bool *old_vm_running)
     *old_vm_running = runstate_is_running();
     global_state_store();
     ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
+    if (ret < 0) {
+        goto fail;
+    }
 
+    ret = bdrv_inactivate_all();
     if (ret < 0) {
         goto fail;
     }
@@ -1536,6 +1540,9 @@ static void migration_completion(MigrationState *s, int current_active_state,
         if (!ret) {
             ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
             if (ret >= 0) {
+                ret = bdrv_inactivate_all();
+            }
+            if (ret >= 0) {
                 qemu_file_set_rate_limit(s->file, INT64_MAX);
                 qemu_savevm_state_complete_precopy(s->file, false);
             }
diff --git a/qmp.c b/qmp.c
index 0a1fa19..39b8da1 100644
--- a/qmp.c
+++ b/qmp.c
@@ -192,6 +192,18 @@ void qmp_cont(Error **errp)
         }
     }
 
+    /* Continuing after completed migration. Images have been inactivated to
+     * allow the destination to take control. Need to get control back now. */
+    if (runstate_check(RUN_STATE_FINISH_MIGRATE) ||
+        runstate_check(RUN_STATE_POSTMIGRATE))
+    {
+        bdrv_invalidate_cache_all(&local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+
     if (runstate_check(RUN_STATE_INMIGRATE)) {
         autostart = 1;
     } else {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (4 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 05/10] block: Inactivate BDS when migration completes Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 16:57   ` Daniel P. Berrange
                     ` (2 more replies)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 07/10] qcow2: Implement .bdrv_inactivate Kevin Wolf
                   ` (7 subsequent siblings)
  13 siblings, 3 replies; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

This patch extends qemu-img for working with locked images. It prints a
helpful error message when trying to access a locked image read-write,
and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
-r all --force' option in order to override a lock left behind after a
qemu crash.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/block/block.h |  1 +
 include/qapi/error.h  |  1 +
 qapi/common.json      |  3 +-
 qemu-img-cmds.hx      | 10 ++++--
 qemu-img.c            | 96 +++++++++++++++++++++++++++++++++++++++++++--------
 qemu-img.texi         | 20 ++++++++++-
 6 files changed, 113 insertions(+), 18 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 0d00ac1..1ae655c 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -101,6 +101,7 @@ typedef struct HDGeometry {
 #define BDRV_OPT_CACHE_DIRECT   "cache.direct"
 #define BDRV_OPT_CACHE_NO_FLUSH "cache.no-flush"
 
+#define BDRV_OPT_OVERRIDE_LOCK  "override-lock"
 
 #define BDRV_SECTOR_BITS   9
 #define BDRV_SECTOR_SIZE   (1ULL << BDRV_SECTOR_BITS)
diff --git a/include/qapi/error.h b/include/qapi/error.h
index 6285cf5..53591bc 100644
--- a/include/qapi/error.h
+++ b/include/qapi/error.h
@@ -102,6 +102,7 @@ typedef enum ErrorClass {
     ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
     ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
     ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
+    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
 } ErrorClass;
 
 /*
diff --git a/qapi/common.json b/qapi/common.json
index 9353a7b..1bf6e46 100644
--- a/qapi/common.json
+++ b/qapi/common.json
@@ -27,7 +27,8 @@
 { 'enum': 'QapiErrorClass',
   # Keep this in sync with ErrorClass in error.h
   'data': [ 'GenericError', 'CommandNotFound', 'DeviceEncrypted',
-            'DeviceNotActive', 'DeviceNotFound', 'KVMMissingCap' ] }
+            'DeviceNotActive', 'DeviceNotFound', 'KVMMissingCap',
+            'ImageFileLocked' ] }
 
 ##
 # @VersionTriple
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index 9567774..dd4aebc 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -10,9 +10,9 @@ STEXI
 ETEXI
 
 DEF("check", img_check,
-    "check [-q] [-f fmt] [--output=ofmt] [-r [leaks | all]] [-T src_cache] filename")
+    "check [-q] [-f fmt] [--force] [--output=ofmt] [-r [leaks | all]] [-T src_cache] filename")
 STEXI
-@item check [-q] [-f @var{fmt}] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
+@item check [-q] [-f @var{fmt}] [--force] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
 ETEXI
 
 DEF("create", img_create,
@@ -39,6 +39,12 @@ STEXI
 @item convert [-c] [-p] [-q] [-n] [-f @var{fmt}] [-t @var{cache}] [-T @var{src_cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_id_or_name}] [-l @var{snapshot_param}] [-S @var{sparse_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
 ETEXI
 
+DEF("force-unlock", img_force_unlock,
+    "force_unlock [-f fmt] filename")
+STEXI
+@item force-unlock [-f @var{fmt}] @var{filename}
+ETEXI
+
 DEF("info", img_info,
     "info [-f fmt] [--output=ofmt] [--backing-chain] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 3d48b4f..4fe15cd 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -47,6 +47,7 @@ typedef struct img_cmd_t {
 enum {
     OPTION_OUTPUT = 256,
     OPTION_BACKING_CHAIN = 257,
+    OPTION_FORCE = 258,
 };
 
 typedef enum OutputFormat {
@@ -198,7 +199,7 @@ static int print_block_option_help(const char *filename, const char *fmt)
 
 static BlockBackend *img_open(const char *id, const char *filename,
                               const char *fmt, int flags,
-                              bool require_io, bool quiet)
+                              bool require_io, bool quiet, bool force)
 {
     BlockBackend *blk;
     BlockDriverState *bs;
@@ -206,12 +207,34 @@ static BlockBackend *img_open(const char *id, const char *filename,
     Error *local_err = NULL;
     QDict *options = NULL;
 
+    options = qdict_new();
     if (fmt) {
-        options = qdict_new();
         qdict_put(options, "driver", qstring_from_str(fmt));
     }
+    QINCREF(options);
 
     blk = blk_new_open(id, filename, NULL, options, flags, &local_err);
+    if (!blk && error_get_class(local_err) == ERROR_CLASS_IMAGE_FILE_LOCKED) {
+        if (force) {
+            qdict_put(options, BDRV_OPT_OVERRIDE_LOCK, qstring_from_str("on"));
+            blk = blk_new_open(id, filename, NULL, options, flags, NULL);
+            if (blk) {
+                error_free(local_err);
+            }
+        } else {
+            error_report("The image file '%s' is locked and cannot be "
+                         "opened for write access as this may cause image "
+                         "corruption.", filename);
+            error_report("If it is locked in error (e.g. because "
+                         "of an unclean shutdown) and you are sure that no "
+                         "other processes are working on the image file, you "
+                         "can use 'qemu-img force-unlock' or the --force flag "
+                         "for 'qemu-img check' in order to override this "
+                         "check.");
+            error_free(local_err);
+            goto fail;
+        }
+    }
     if (!blk) {
         error_report("Could not open '%s': %s", filename,
                      error_get_pretty(local_err));
@@ -234,6 +257,7 @@ static BlockBackend *img_open(const char *id, const char *filename,
     return blk;
 fail:
     blk_unref(blk);
+    QDECREF(options);
     return NULL;
 }
 
@@ -492,6 +516,7 @@ static int img_check(int argc, char **argv)
     int flags = BDRV_O_FLAGS | BDRV_O_CHECK;
     ImageCheck *check;
     bool quiet = false;
+    bool force = false;
 
     fmt = NULL;
     output = NULL;
@@ -500,6 +525,7 @@ static int img_check(int argc, char **argv)
         int option_index = 0;
         static const struct option long_options[] = {
             {"help", no_argument, 0, 'h'},
+            {"force", no_argument, 0, OPTION_FORCE},
             {"format", required_argument, 0, 'f'},
             {"repair", required_argument, 0, 'r'},
             {"output", required_argument, 0, OPTION_OUTPUT},
@@ -515,6 +541,9 @@ static int img_check(int argc, char **argv)
         case 'h':
             help();
             break;
+        case OPTION_FORCE:
+            force = true;
+            break;
         case 'f':
             fmt = optarg;
             break;
@@ -561,7 +590,7 @@ static int img_check(int argc, char **argv)
         return 1;
     }
 
-    blk = img_open("image", filename, fmt, flags, true, quiet);
+    blk = img_open("image", filename, fmt, flags, true, quiet, force);
     if (!blk) {
         return 1;
     }
@@ -633,6 +662,44 @@ fail:
     return ret;
 }
 
+static int img_force_unlock(int argc, char **argv)
+{
+    BlockBackend *blk;
+    const char *format = NULL;
+    const char *filename;
+    char c;
+
+    for (;;) {
+        c = getopt(argc, argv, "hf:");
+        if (c == -1) {
+            break;
+        }
+        switch (c) {
+        case '?':
+        case 'h':
+            help();
+            break;
+        case 'f':
+            format = optarg;
+            break;
+        }
+    }
+
+    if (optind != argc - 1) {
+        error_exit("Expecting one image file name");
+    }
+    filename = argv[optind];
+
+    /* Just force-opening and closing the image is enough to unlock it */
+    blk = img_open("image", filename, format, BDRV_O_FLAGS | BDRV_O_RDWR,
+                   false, false, true);
+    if (blk) {
+        blk_unref(blk);
+    }
+
+    return 0;
+}
+
 typedef struct CommonBlockJobCBInfo {
     BlockDriverState *bs;
     Error **errp;
@@ -727,7 +794,7 @@ static int img_commit(int argc, char **argv)
         return 1;
     }
 
-    blk = img_open("image", filename, fmt, flags, true, quiet);
+    blk = img_open("image", filename, fmt, flags, true, quiet, false);
     if (!blk) {
         return 1;
     }
@@ -1032,14 +1099,14 @@ static int img_compare(int argc, char **argv)
         goto out3;
     }
 
-    blk1 = img_open("image_1", filename1, fmt1, flags, true, quiet);
+    blk1 = img_open("image_1", filename1, fmt1, flags, true, quiet, false);
     if (!blk1) {
         ret = 2;
         goto out3;
     }
     bs1 = blk_bs(blk1);
 
-    blk2 = img_open("image_2", filename2, fmt2, flags, true, quiet);
+    blk2 = img_open("image_2", filename2, fmt2, flags, true, quiet, false);
     if (!blk2) {
         ret = 2;
         goto out2;
@@ -1679,7 +1746,7 @@ static int img_convert(int argc, char **argv)
         char *id = bs_n > 1 ? g_strdup_printf("source_%d", bs_i)
                             : g_strdup("source");
         blk[bs_i] = img_open(id, argv[optind + bs_i], fmt, src_flags,
-                             true, quiet);
+                             true, quiet, false);
         g_free(id);
         if (!blk[bs_i]) {
             ret = -1;
@@ -1823,7 +1890,8 @@ static int img_convert(int argc, char **argv)
         goto out;
     }
 
-    out_blk = img_open("target", out_filename, out_fmt, flags, true, quiet);
+    out_blk = img_open("target", out_filename, out_fmt, flags, true, quiet,
+                       false);
     if (!out_blk) {
         ret = -1;
         goto out;
@@ -2015,7 +2083,7 @@ static ImageInfoList *collect_image_info_list(const char *filename,
         g_hash_table_insert(filenames, (gpointer)filename, NULL);
 
         blk = img_open("image", filename, fmt,
-                       BDRV_O_FLAGS | BDRV_O_NO_BACKING, false, false);
+                       BDRV_O_FLAGS | BDRV_O_NO_BACKING, false, false, false);
         if (!blk) {
             goto err;
         }
@@ -2279,7 +2347,7 @@ static int img_map(int argc, char **argv)
         return 1;
     }
 
-    blk = img_open("image", filename, fmt, BDRV_O_FLAGS, true, false);
+    blk = img_open("image", filename, fmt, BDRV_O_FLAGS, true, false, false);
     if (!blk) {
         return 1;
     }
@@ -2401,7 +2469,7 @@ static int img_snapshot(int argc, char **argv)
     filename = argv[optind++];
 
     /* Open the image */
-    blk = img_open("image", filename, NULL, bdrv_oflags, true, quiet);
+    blk = img_open("image", filename, NULL, bdrv_oflags, true, quiet, false);
     if (!blk) {
         return 1;
     }
@@ -2545,7 +2613,7 @@ static int img_rebase(int argc, char **argv)
      * Ignore the old backing file for unsafe rebase in case we want to correct
      * the reference to a renamed or moved backing file.
      */
-    blk = img_open("image", filename, fmt, flags, true, quiet);
+    blk = img_open("image", filename, fmt, flags, true, quiet, false);
     if (!blk) {
         ret = -1;
         goto out;
@@ -2858,7 +2926,7 @@ static int img_resize(int argc, char **argv)
     qemu_opts_del(param);
 
     blk = img_open("image", filename, fmt, BDRV_O_FLAGS | BDRV_O_RDWR,
-                   true, quiet);
+                   true, quiet, false);
     if (!blk) {
         ret = -1;
         goto out;
@@ -2989,7 +3057,7 @@ static int img_amend(int argc, char **argv)
         goto out;
     }
 
-    blk = img_open("image", filename, fmt, flags, true, quiet);
+    blk = img_open("image", filename, fmt, flags, true, quiet, false);
     if (!blk) {
         ret = -1;
         goto out;
diff --git a/qemu-img.texi b/qemu-img.texi
index 55c6be3..70efac2 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -117,7 +117,7 @@ Skip the creation of the target volume
 Command description:
 
 @table @option
-@item check [-f @var{fmt}] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
+@item check [-q] [-f @var{fmt}] [--force] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
 
 Perform a consistency check on the disk image @var{filename}. The command can
 output in the format @var{ofmt} which is either @code{human} or @code{json}.
@@ -153,6 +153,12 @@ If @code{-r} is specified, exit codes representing the image state refer to the
 state after (the attempt at) repairing it. That is, a successful @code{-r all}
 will yield the exit code 0, independently of the image state before.
 
+The @code{-r} option requires read-write access to the image, which is
+prohibited if a qcow2 file is still marked as in use. If you know for sure that
+the information is outdated and the image is in fact not in use by another
+process any more (e.g. because the QEMU process crashed), specifying
+@code{--force} overrides this check.
+
 @item create [-f @var{fmt}] [-o @var{options}] @var{filename} [@var{size}]
 
 Create the new disk image @var{filename} of size @var{size} and format
@@ -261,6 +267,18 @@ skipped. This is useful for formats such as @code{rbd} if the target
 volume has already been created with site specific options that cannot
 be supplied through qemu-img.
 
+@item force-unlock [-f @var{fmt}] @var{filename}
+
+Read-write disk images can generally be safely opened only from a single
+process at the same time. In order to protect against corruption from
+neglecting to follow this rule, qcow2 images are automatically flagged as
+in use when they are opened and the flag is removed again on a clean
+shutdown.
+
+However, in cases of an unclean shutdown, the image might be still marked as in
+use so that any further read-write access is prohibited. You can use the
+@code{force-unlock} command to manually remove the in-use flag then.
+
 @item info [-f @var{fmt}] [--output=@var{ofmt}] [--backing-chain] @var{filename}
 
 Give information about the disk image @var{filename}. Use it in
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 07/10] qcow2: Implement .bdrv_inactivate
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (5 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 21:17   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 08/10] qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache() Kevin Wolf
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

The callback has to ensure that closing or flushing the image afterwards
wouldn't cause a write access to the image files. This means that just
the caches have to be written out, which is part of the existing
.bdrv_close implementation.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 01f1fe3..2cba276 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1685,6 +1685,32 @@ fail:
     return ret;
 }
 
+static int qcow2_inactivate(BlockDriverState *bs)
+{
+    BDRVQcow2State *s = bs->opaque;
+    int ret, result = 0;
+
+    ret = qcow2_cache_flush(bs, s->l2_table_cache);
+    if (ret) {
+        result = ret;
+        error_report("Failed to flush the L2 table cache: %s",
+                     strerror(-ret));
+    }
+
+    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
+    if (ret) {
+        result = ret;
+        error_report("Failed to flush the refcount block cache: %s",
+                     strerror(-ret));
+    }
+
+    if (result == 0) {
+        qcow2_mark_clean(bs);
+    }
+
+    return result;
+}
+
 static void qcow2_close(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
@@ -1693,23 +1719,7 @@ static void qcow2_close(BlockDriverState *bs)
     s->l1_table = NULL;
 
     if (!(bs->open_flags & BDRV_O_INCOMING)) {
-        int ret1, ret2;
-
-        ret1 = qcow2_cache_flush(bs, s->l2_table_cache);
-        ret2 = qcow2_cache_flush(bs, s->refcount_block_cache);
-
-        if (ret1) {
-            error_report("Failed to flush the L2 table cache: %s",
-                         strerror(-ret1));
-        }
-        if (ret2) {
-            error_report("Failed to flush the refcount block cache: %s",
-                         strerror(-ret2));
-        }
-
-        if (!ret1 && !ret2) {
-            qcow2_mark_clean(bs);
-        }
+        qcow2_inactivate(bs);
     }
 
     cache_clean_timer_del(bs);
@@ -3340,6 +3350,7 @@ BlockDriver bdrv_qcow2 = {
 
     .bdrv_refresh_limits        = qcow2_refresh_limits,
     .bdrv_invalidate_cache      = qcow2_invalidate_cache,
+    .bdrv_inactivate            = qcow2_inactivate,
 
     .create_opts         = &qcow2_create_opts,
     .bdrv_check          = qcow2_check,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 08/10] qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache()
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (6 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 07/10] qcow2: Implement .bdrv_inactivate Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 21:22   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 09/10] qcow2: Make image inaccessible after failed qcow2_invalidate_cache() Kevin Wolf
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

What qcow2_invalidate_cache() should do is closing the image with
BDRV_O_INCOMING set and reopening it with the flag cleared. In fact, it
used to do exactly the opposite: qcow2_close() relied on bs->open_flags,
which is already updated to have cleared BDRV_O_INCOMING at this point,
whereas qcow2_open() was called with s->flags, which has the flag still
set. Fix this.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 2cba276..de50b80 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1718,7 +1718,7 @@ static void qcow2_close(BlockDriverState *bs)
     /* else pre-write overlap checks in cache_destroy may crash */
     s->l1_table = NULL;
 
-    if (!(bs->open_flags & BDRV_O_INCOMING)) {
+    if (!(s->flags & BDRV_O_INCOMING)) {
         qcow2_inactivate(bs);
     }
 
@@ -1769,6 +1769,7 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp)
     memset(s, 0, sizeof(BDRVQcow2State));
     options = qdict_clone_shallow(bs->options);
 
+    flags &= ~BDRV_O_INCOMING;
     ret = qcow2_open(bs, options, flags, &local_err);
     QDECREF(options);
     if (local_err) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 09/10] qcow2: Make image inaccessible after failed qcow2_invalidate_cache()
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (7 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 08/10] qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache() Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 21:24   ` Eric Blake
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 10/10] qcow2: Add image locking Kevin Wolf
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

If qcow2_invalidate_cache() fails, we are in a state where qcow2_close()
has already been completed, but the image hasn't been reopened yet.
Calling into any qcow2 function for an image in this state will cause
crashes.

The real solution would be to get rid of the close/open pair and instead
do an atomic reset of the involved data structures, but this isn't
trivial, so let's just make the image inaccessible for now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index de50b80..544c124 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1763,6 +1763,7 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp)
     bdrv_invalidate_cache(bs->file->bs, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
+        bs->drv = NULL;
         return;
     }
 
@@ -1776,9 +1777,11 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp)
         error_setg(errp, "Could not reopen qcow2 layer: %s",
                    error_get_pretty(local_err));
         error_free(local_err);
+        bs->drv = NULL;
         return;
     } else if (ret < 0) {
         error_setg_errno(errp, -ret, "Could not reopen qcow2 layer");
+        bs->drv = NULL;
         return;
     }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 10/10] qcow2: Add image locking
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (8 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 09/10] qcow2: Make image inaccessible after failed qcow2_invalidate_cache() Kevin Wolf
@ 2015-12-22 16:46 ` Kevin Wolf
  2015-12-22 22:04   ` Eric Blake
  2015-12-23  3:14 ` [Qemu-devel] [PATCH 00/10] qcow2: Implement " Fam Zheng
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 16:46 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, qemu-devel, mreitz

People have been keeping destroying their qcow2 images by using
'qemu-img snapshot' on images that were in use by a VM. This patch adds
some image locking that protects them against this mistake.

In order to achieve this, a new compatible header flag is introduced
that tells that the image is currently in use. It is (almost) always set
when qcow2 considers the image to be active and in a read-write mode.
During live migration, the source considers the image active until the
VM stops on migration completion. The destination considers it active as
soon as it starts running the VM.

In cases where qemu wasn't shut down cleanly, images may incorrectly
refuse to open. An option override-lock=on is provided to force opening
the image (this is the option that qemu-img uses for 'force-unlock' and
'check --force').

A few test cases have to be adjusted, either to update error messages,
to use read-only mode to avoid the check, or to override the lock where
necessary.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2.c              | 84 ++++++++++++++++++++++++++++++++++++++++++++++
 block/qcow2.h              |  7 +++-
 docs/specs/qcow2.txt       |  7 +++-
 tests/qemu-iotests/026     |  2 +-
 tests/qemu-iotests/026.out | 60 ++++++++++++++++++++++++++++-----
 tests/qemu-iotests/031.out |  8 ++---
 tests/qemu-iotests/036.out |  4 +--
 tests/qemu-iotests/039     |  4 +--
 tests/qemu-iotests/061     |  2 ++
 tests/qemu-iotests/061.out | 16 ++++-----
 tests/qemu-iotests/071     |  7 ++++
 tests/qemu-iotests/071.out |  4 +++
 tests/qemu-iotests/089     |  2 +-
 tests/qemu-iotests/089.out |  2 --
 tests/qemu-iotests/091     |  2 +-
 tests/qemu-iotests/098     |  2 +-
 16 files changed, 181 insertions(+), 32 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 544c124..c07a078 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -307,6 +307,8 @@ int qcow2_mark_corrupt(BlockDriverState *bs)
     BDRVQcow2State *s = bs->opaque;
 
     s->incompatible_features |= QCOW2_INCOMPAT_CORRUPT;
+    s->compatible_features &= ~QCOW2_COMPAT_IN_USE;
+
     return qcow2_update_header(bs);
 }
 
@@ -472,6 +474,11 @@ static QemuOptsList qcow2_runtime_opts = {
             .type = QEMU_OPT_NUMBER,
             .help = "Clean unused cache entries after this time (in seconds)",
         },
+        {
+            .name = BDRV_OPT_OVERRIDE_LOCK,
+            .type = QEMU_OPT_BOOL,
+            .help = "Open the image read-write even if it is locked",
+        },
         { /* end of list */ }
     },
 };
@@ -593,6 +600,7 @@ typedef struct Qcow2ReopenState {
     Qcow2Cache *l2_table_cache;
     Qcow2Cache *refcount_block_cache;
     bool use_lazy_refcounts;
+    bool override_lock;
     int overlap_check;
     bool discard_passthrough[QCOW2_DISCARD_MAX];
     uint64_t cache_clean_interval;
@@ -754,6 +762,9 @@ static int qcow2_update_options_prepare(BlockDriverState *bs,
     r->discard_passthrough[QCOW2_DISCARD_OTHER] =
         qemu_opt_get_bool(opts, QCOW2_OPT_DISCARD_OTHER, false);
 
+    r->override_lock = qemu_opt_get_bool(opts, BDRV_OPT_OVERRIDE_LOCK,
+                                         s->override_lock);
+
     ret = 0;
 fail:
     qemu_opts_del(opts);
@@ -788,6 +799,8 @@ static void qcow2_update_options_commit(BlockDriverState *bs,
         s->cache_clean_interval = r->cache_clean_interval;
         cache_clean_timer_init(bs, bdrv_get_aio_context(bs));
     }
+
+    s->override_lock = r->override_lock;
 }
 
 static void qcow2_update_options_abort(BlockDriverState *bs,
@@ -1080,6 +1093,22 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail;
     }
 
+    /* Protect against opening the image r/w twice at the same time */
+    if (!bs->read_only && (s->compatible_features & QCOW2_COMPAT_IN_USE)) {
+        /* Shared storage is expected during migration */
+        bool migrating = (flags & BDRV_O_INCOMING);
+
+        if (!migrating && !s->override_lock) {
+            error_set(errp, ERROR_CLASS_IMAGE_FILE_LOCKED,
+                      "Image is already in use");
+            error_append_hint(errp, "This check can be disabled "
+                              "with override-lock=on. Caution: Opening an "
+                              "image twice can cause corruption!");
+            ret = -EBUSY;
+            goto fail;
+        }
+    }
+
     s->cluster_cache = g_malloc(s->cluster_size);
     /* one more sector for decompressed data alignment */
     s->cluster_data = qemu_try_blockalign(bs->file->bs, QCOW_MAX_CRYPT_CLUSTERS
@@ -1164,6 +1193,17 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
         }
     }
 
+    /* Set advisory lock in the header (do this as the final step so that
+     * failure doesn't leave a locked image around) */
+    if (!bs->read_only && !(flags & BDRV_O_INCOMING) && s->qcow_version >= 3) {
+        s->compatible_features |= QCOW2_COMPAT_IN_USE;
+        ret = qcow2_update_header(bs);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Could not update qcow2 header");
+            goto fail;
+        }
+    }
+
 #ifdef DEBUG_ALLOC
     {
         BdrvCheckResult result = {0};
@@ -1260,6 +1300,15 @@ static int qcow2_reopen_prepare(BDRVReopenState *state,
         if (ret < 0) {
             goto fail;
         }
+
+        if (!state->bs->read_only) {
+            BDRVQcow2State *s = state->bs->opaque;
+            s->compatible_features &= ~QCOW2_COMPAT_IN_USE;
+            ret = qcow2_update_header(state->bs);
+            if (ret < 0) {
+                goto fail;
+            }
+        }
     }
 
     return 0;
@@ -1272,12 +1321,32 @@ fail:
 
 static void qcow2_reopen_commit(BDRVReopenState *state)
 {
+    /* We can't fail the commit, so if the header update fails, we may end up
+     * not protecting the image even though it is writable now. This is okay,
+     * the lock is a best-effort service to protect the user from shooting
+     * themselves into the foot. */
+    if (state->bs->read_only && (state->flags & BDRV_O_RDWR)) {
+        BDRVQcow2State *s = state->bs->opaque;
+        s->compatible_features |= QCOW2_COMPAT_IN_USE;
+        (void) qcow2_update_header(state->bs);
+    }
+
     qcow2_update_options_commit(state->bs, state->opaque);
     g_free(state->opaque);
 }
 
 static void qcow2_reopen_abort(BDRVReopenState *state)
 {
+    /* We can't fail the abort, so if the header update fails, we may end up
+     * not protecting the image any more. This is okay, the lock is a
+     * best-effort service to protect the user from shooting themselves into
+     * the foot. */
+    if (!state->bs->read_only && !(state->flags & BDRV_O_RDWR)) {
+        BDRVQcow2State *s = state->bs->opaque;
+        s->compatible_features |= QCOW2_COMPAT_IN_USE;
+        (void) qcow2_update_header(state->bs);
+    }
+
     qcow2_update_options_abort(state->bs, state->opaque);
     g_free(state->opaque);
 }
@@ -1708,6 +1777,16 @@ static int qcow2_inactivate(BlockDriverState *bs)
         qcow2_mark_clean(bs);
     }
 
+    if (!bs->read_only) {
+        s->flags |= BDRV_O_INCOMING;
+        s->compatible_features &= ~QCOW2_COMPAT_IN_USE;
+        ret = qcow2_update_header(bs);
+        if (ret < 0) {
+            result = ret;
+            error_report("Could not update qcow2 header: %s", strerror(-ret));
+        }
+    }
+
     return result;
 }
 
@@ -1926,6 +2005,11 @@ int qcow2_update_header(BlockDriverState *bs)
                 .bit  = QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR,
                 .name = "lazy refcounts",
             },
+            {
+                .type = QCOW2_FEAT_TYPE_COMPATIBLE,
+                .bit  = QCOW2_COMPAT_IN_USE_BITNR,
+                .name = "image is in use (locked)",
+            },
         };
 
         ret = header_ext_add(buf, QCOW2_EXT_MAGIC_FEATURE_TABLE,
diff --git a/block/qcow2.h b/block/qcow2.h
index a063a3c..7fc12f8 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -190,9 +190,12 @@ enum {
 /* Compatible feature bits */
 enum {
     QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR = 0,
+    QCOW2_COMPAT_IN_USE_BITNR         = 1,
     QCOW2_COMPAT_LAZY_REFCOUNTS       = 1 << QCOW2_COMPAT_LAZY_REFCOUNTS_BITNR,
+    QCOW2_COMPAT_IN_USE               = 1 << QCOW2_COMPAT_IN_USE_BITNR,
 
-    QCOW2_COMPAT_FEAT_MASK            = QCOW2_COMPAT_LAZY_REFCOUNTS,
+    QCOW2_COMPAT_FEAT_MASK            = QCOW2_COMPAT_LAZY_REFCOUNTS
+                                      | QCOW2_COMPAT_IN_USE,
 };
 
 enum qcow2_discard_type {
@@ -293,6 +296,8 @@ typedef struct BDRVQcow2State {
      * override) */
     char *image_backing_file;
     char *image_backing_format;
+
+    bool override_lock;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index f236d8c..7732a23 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -96,7 +96,12 @@ in the description of a field.
                                 marking the image file dirty and postponing
                                 refcount metadata updates.
 
-                    Bits 1-63:  Reserved (set to 0)
+                    Bit 1:      Locking bit. If this bit is set, then the
+                                image is supposedly in use by some process and
+                                shouldn't be opened read-write by another
+                                process.
+
+                    Bits 2-63:  Reserved (set to 0)
 
          88 -  95:  autoclear_features
                     Bitmask of auto-clear features. An implementation may only
diff --git a/tests/qemu-iotests/026 b/tests/qemu-iotests/026
index ba1047c..32e1b0c 100755
--- a/tests/qemu-iotests/026
+++ b/tests/qemu-iotests/026
@@ -107,7 +107,7 @@ $QEMU_IO -c "write $vmstate 0 128k " "$BLKDBG_TEST_IMG" | _filter_qemu_io
 # Reads are another path to trigger l2_load, so do a read, too
 if [ "$event" == "l2_load" ]; then
     $QEMU_IO -c "write $vmstate 0 128k " "$BLKDBG_TEST_IMG" | _filter_qemu_io
-    $QEMU_IO -c "read $vmstate 0 128k " "$BLKDBG_TEST_IMG" | _filter_qemu_io
+    $QEMU_IO -r -c "read $vmstate 0 128k " "$BLKDBG_TEST_IMG" | _filter_qemu_io
 fi
 
 _check_test_img 2>&1 | grep -v "refcount=1 reference=0"
diff --git a/tests/qemu-iotests/026.out b/tests/qemu-iotests/026.out
index d84d82c..86ae16f 100644
--- a/tests/qemu-iotests/026.out
+++ b/tests/qemu-iotests/026.out
@@ -16,6 +16,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_update; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 
 1 leaked clusters were found on the image.
@@ -25,6 +26,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_update; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 
 1 leaked clusters were found on the image.
@@ -44,6 +46,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_update; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 1 leaked clusters were found on the image.
@@ -53,6 +56,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_update; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 1 leaked clusters were found on the image.
@@ -80,9 +84,8 @@ wrote 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
-Failed to flush the L2 table cache: Input/output error
-Failed to flush the refcount block cache: Input/output error
 read failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -92,9 +95,8 @@ wrote 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
-Failed to flush the L2 table cache: Input/output error
-Failed to flush the refcount block cache: Input/output error
 read failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -120,9 +122,8 @@ wrote 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
-Failed to flush the L2 table cache: No space left on device
-Failed to flush the refcount block cache: No space left on device
 read failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -132,9 +133,8 @@ wrote 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
-Failed to flush the L2 table cache: No space left on device
-Failed to flush the refcount block cache: No space left on device
 read failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -152,6 +152,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_update; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 
 127 leaked clusters were found on the image.
@@ -161,6 +162,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_update; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 
 127 leaked clusters were found on the image.
@@ -180,6 +182,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_update; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 127 leaked clusters were found on the image.
@@ -189,6 +192,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_update; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 127 leaked clusters were found on the image.
@@ -208,6 +212,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_alloc_write; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -215,6 +220,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_alloc_write; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 
 1 leaked clusters were found on the image.
@@ -234,6 +240,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_alloc_write; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -241,6 +248,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l2_alloc_write; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 1 leaked clusters were found on the image.
@@ -260,6 +268,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: write_aio; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -267,6 +276,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: write_aio; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -284,6 +294,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: write_aio; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -291,6 +302,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: write_aio; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -308,6 +320,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_load; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -315,6 +328,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_load; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -332,6 +346,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_load; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -339,6 +354,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_load; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -356,6 +372,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_update_part; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -363,6 +380,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_update_part; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -380,6 +398,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_update_part; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -387,6 +406,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_update_part; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -404,6 +424,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -411,6 +432,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -428,6 +450,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -435,6 +458,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -452,6 +476,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: cluster_alloc; errno: 5; imm: off; once: off; write
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -459,6 +484,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: cluster_alloc; errno: 5; imm: off; once: off; write -b
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -476,6 +502,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: cluster_alloc; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -483,6 +510,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: cluster_alloc; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 
@@ -503,6 +531,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 55 leaked clusters were found on the image.
@@ -512,6 +541,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_hookup; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 251 leaked clusters were found on the image.
@@ -531,6 +561,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -538,6 +569,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_write; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -555,6 +587,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 11 leaked clusters were found on the image.
@@ -564,6 +597,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_write_blocks; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 23 leaked clusters were found on the image.
@@ -583,6 +617,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 11 leaked clusters were found on the image.
@@ -592,6 +627,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_write_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 23 leaked clusters were found on the image.
@@ -611,6 +647,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 11 leaked clusters were found on the image.
@@ -620,6 +657,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: refblock_alloc_switch_table; errno: 28; imm: off; once: off; write -b
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 23 leaked clusters were found on the image.
@@ -637,6 +675,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_grow_alloc_table; errno: 5; imm: off; once: off
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -649,6 +688,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_grow_alloc_table; errno: 28; imm: off; once: off
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -661,6 +701,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_grow_write_table; errno: 5; imm: off; once: off
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -673,6 +714,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_grow_write_table; errno: 28; imm: off; once: off
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 No errors were found on the image.
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
@@ -685,6 +727,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_grow_activate_table; errno: 5; imm: off; once: off
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 write failed: Input/output error
 
 96 leaked clusters were found on the image.
@@ -699,6 +742,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824
 Event: l1_grow_activate_table; errno: 28; imm: off; once: off
 Failed to flush the L2 table cache: No space left on device
 Failed to flush the refcount block cache: No space left on device
+Could not update qcow2 header: No space left on device
 write failed: No space left on device
 
 96 leaked clusters were found on the image.
diff --git a/tests/qemu-iotests/031.out b/tests/qemu-iotests/031.out
index 7f5050b..68a74d0 100644
--- a/tests/qemu-iotests/031.out
+++ b/tests/qemu-iotests/031.out
@@ -117,7 +117,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 Header extension:
@@ -150,7 +150,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 Header extension:
@@ -164,7 +164,7 @@ No errors were found on the image.
 
 magic                     0x514649fb
 version                   3
-backing_file_offset       0x148
+backing_file_offset       0x178
 backing_file_size         0x17
 cluster_bits              16
 size                      67108864
@@ -188,7 +188,7 @@ data                      'host_device'
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 Header extension:
diff --git a/tests/qemu-iotests/036.out b/tests/qemu-iotests/036.out
index f443635..69c5756 100644
--- a/tests/qemu-iotests/036.out
+++ b/tests/qemu-iotests/036.out
@@ -58,7 +58,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 
@@ -86,7 +86,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 *** done
diff --git a/tests/qemu-iotests/039 b/tests/qemu-iotests/039
index 9e9b379..6f6f7c3 100755
--- a/tests/qemu-iotests/039
+++ b/tests/qemu-iotests/039
@@ -86,7 +86,7 @@ $PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
 echo
 echo "== Repairing the image file must succeed =="
 
-_check_test_img -r all
+_check_test_img -r all --force
 
 # The dirty bit must not be set
 $PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
@@ -109,7 +109,7 @@ $QEMU_IO -c "write -P 0x5a 0 512" \
 # The dirty bit must be set
 $PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
 
-$QEMU_IO -c "write 0 512" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "open -o override-lock=on $TEST_IMG" -c "write 0 512" | _filter_qemu_io
 
 # The dirty bit must not be set
 $PYTHON qcow2.py "$TEST_IMG" dump-header | grep incompatible_features
diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index e191e65..9178e60 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -61,6 +61,7 @@ IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
 $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
 $PYTHON qcow2.py "$TEST_IMG" dump-header
+$QEMU_IMG force-unlock "$TEST_IMG"
 $QEMU_IMG amend -o "compat=0.10" "$TEST_IMG"
 $PYTHON qcow2.py "$TEST_IMG" dump-header
 $QEMU_IO -c "read -P 0x2a 0 128k" "$TEST_IMG" | _filter_qemu_io
@@ -95,6 +96,7 @@ IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
 $QEMU_IO -c "write -P 0x2a 0 128k" -c flush \
          -c "sigraise $(kill -l KILL)" "$TEST_IMG" 2>&1 | _filter_qemu_io
 $PYTHON qcow2.py "$TEST_IMG" dump-header
+$QEMU_IMG force-unlock "$TEST_IMG"
 $QEMU_IMG amend -o "lazy_refcounts=off" "$TEST_IMG"
 $PYTHON qcow2.py "$TEST_IMG" dump-header
 $QEMU_IO -c "read -P 0x2a 0 128k" "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index a03732e..41c9df9 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -26,7 +26,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 magic                     0x514649fb
@@ -76,14 +76,14 @@ refcount_table_clusters   1
 nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     0x1
-compatible_features       0x1
+compatible_features       0x3
 autoclear_features        0x0
 refcount_order            4
 header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 ERROR cluster 5 refcount=0 reference=1
@@ -138,7 +138,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 magic                     0x514649fb
@@ -207,7 +207,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 read 65536/65536 bytes at offset 44040192
@@ -238,14 +238,14 @@ refcount_table_clusters   1
 nb_snapshots              0
 snapshot_offset           0x0
 incompatible_features     0x1
-compatible_features       0x1
+compatible_features       0x3
 autoclear_features        0x0
 refcount_order            4
 header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 ERROR cluster 5 refcount=0 reference=1
@@ -274,7 +274,7 @@ header_length             104
 
 Header extension:
 magic                     0x6803f857
-length                    144
+length                    192
 data                      <binary>
 
 read 131072/131072 bytes at offset 0
diff --git a/tests/qemu-iotests/071 b/tests/qemu-iotests/071
index 92ab991..5ad48dd 100755
--- a/tests/qemu-iotests/071
+++ b/tests/qemu-iotests/071
@@ -85,6 +85,7 @@ $QEMU_IO -c 'write -P 42 0 512' "$TEST_IMG" | _filter_qemu_io
 
 $QEMU_IO -c "open -o driver=raw,file.driver=blkverify,file.raw.filename=$TEST_IMG.base $TEST_IMG" \
          -c 'read -P 42 0 512' | _filter_qemu_io
+$QEMU_IMG force-unlock "$TEST_IMG"
 
 echo
 echo "=== Testing blkdebug through filename ==="
@@ -92,6 +93,7 @@ echo
 
 $QEMU_IO -c "open -o file.driver=blkdebug,file.inject-error.event=l2_load $TEST_IMG" \
          -c 'read -P 42 0x38000 512'
+$QEMU_IMG force-unlock "$TEST_IMG"
 
 echo
 echo "=== Testing blkdebug through file blockref ==="
@@ -99,6 +101,7 @@ echo
 
 $QEMU_IO -c "open -o driver=$IMGFMT,file.driver=blkdebug,file.inject-error.event=l2_load,file.image.filename=$TEST_IMG" \
          -c 'read -P 42 0x38000 512'
+$QEMU_IMG force-unlock "$TEST_IMG"
 
 echo
 echo "=== Testing blkdebug on existing block device ==="
@@ -137,6 +140,7 @@ run_qemu <<EOF
 }
 { "execute": "quit" }
 EOF
+$QEMU_IMG force-unlock "$TEST_IMG"
 
 echo
 echo "=== Testing blkverify on existing block device ==="
@@ -176,6 +180,7 @@ run_qemu <<EOF
 }
 { "execute": "quit" }
 EOF
+$QEMU_IMG force-unlock "$TEST_IMG"
 
 echo
 echo "=== Testing blkverify on existing raw block device ==="
@@ -215,6 +220,7 @@ run_qemu <<EOF
 }
 { "execute": "quit" }
 EOF
+$QEMU_IMG force-unlock "$TEST_IMG"
 
 echo
 echo "=== Testing blkdebug's set-state through QMP ==="
@@ -268,6 +274,7 @@ run_qemu <<EOF
 }
 { "execute": "quit" }
 EOF
+$QEMU_IMG force-unlock "$TEST_IMG"
 
 # success, all done
 echo "*** done"
diff --git a/tests/qemu-iotests/071.out b/tests/qemu-iotests/071.out
index 2b40ead..b6afd6b 100644
--- a/tests/qemu-iotests/071.out
+++ b/tests/qemu-iotests/071.out
@@ -32,12 +32,14 @@ blkverify: read sector_num=0 nb_sectors=1 contents mismatch in sector 0
 
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 read failed: Input/output error
 
 === Testing blkdebug through file blockref ===
 
 Failed to flush the L2 table cache: Input/output error
 Failed to flush the refcount block cache: Input/output error
+Could not update qcow2 header: Input/output error
 read failed: Input/output error
 
 === Testing blkdebug on existing block device ===
@@ -53,6 +55,7 @@ read failed: Input/output error
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN"}
 QEMU_PROG: Failed to flush the L2 table cache: Input/output error
 QEMU_PROG: Failed to flush the refcount block cache: Input/output error
+QEMU_PROG: Could not update qcow2 header: Input/output error
 
 
 === Testing blkverify on existing block device ===
@@ -94,5 +97,6 @@ read failed: Input/output error
 {"timestamp": {"seconds":  TIMESTAMP, "microseconds":  TIMESTAMP}, "event": "SHUTDOWN"}
 QEMU_PROG: Failed to flush the L2 table cache: Input/output error
 QEMU_PROG: Failed to flush the refcount block cache: Input/output error
+QEMU_PROG: Could not update qcow2 header: Input/output error
 
 *** done
diff --git a/tests/qemu-iotests/089 b/tests/qemu-iotests/089
index 3e0038d..509ac9b 100755
--- a/tests/qemu-iotests/089
+++ b/tests/qemu-iotests/089
@@ -94,7 +94,7 @@ $QEMU_IO -c 'write -P 42 0x38000 512' "$TEST_IMG" | _filter_qemu_io
 
 # The "image.filename" part tests whether "a": { "b": "c" } and "a.b": "c" do
 # the same (which they should).
-$QEMU_IO_PROG --cache $CACHEMODE \
+$QEMU_IO_PROG -r --cache $CACHEMODE \
      -c 'read -P 42 0x38000 512' "json:{
     \"driver\": \"$IMGFMT\",
     \"file\": {
diff --git a/tests/qemu-iotests/089.out b/tests/qemu-iotests/089.out
index 5b541a3..18f5fdd 100644
--- a/tests/qemu-iotests/089.out
+++ b/tests/qemu-iotests/089.out
@@ -24,8 +24,6 @@ read 512/512 bytes at offset 0
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
 wrote 512/512 bytes at offset 229376
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-Failed to flush the L2 table cache: Input/output error
-Failed to flush the refcount block cache: Input/output error
 read failed: Input/output error
 
 === Testing qemu-img info output ===
diff --git a/tests/qemu-iotests/091 b/tests/qemu-iotests/091
index 32bbd56..b7b0120 100755
--- a/tests/qemu-iotests/091
+++ b/tests/qemu-iotests/091
@@ -97,7 +97,7 @@ _send_qemu_cmd $h2 'qemu-io disk flush' "(qemu)"
 _send_qemu_cmd $h2 'quit' ""
 
 echo "Check image pattern"
-${QEMU_IO} -c "read -P 0x22 0 4M" "${TEST_IMG}" | _filter_testdir | _filter_qemu_io
+${QEMU_IO} -r -c "read -P 0x22 0 4M" "${TEST_IMG}" | _filter_testdir | _filter_qemu_io
 
 echo "Running 'qemu-img check -r all \$TEST_IMG'"
 "${QEMU_IMG}" check -r all "${TEST_IMG}" 2>&1 | _filter_testdir | _filter_qemu
diff --git a/tests/qemu-iotests/098 b/tests/qemu-iotests/098
index e2230ad..71fbf524 100755
--- a/tests/qemu-iotests/098
+++ b/tests/qemu-iotests/098
@@ -67,7 +67,7 @@ $QEMU_IMG commit "blkdebug:$TEST_DIR/blkdebug.conf:$TEST_IMG" 2>&1 \
     | _filter_testdir | _filter_imgfmt
 
 # There may be errors, but they should be fixed by opening the image
-$QEMU_IO -c close "$TEST_IMG"
+$QEMU_IO -c "open -o override-lock=on $TEST_IMG" -c close
 
 _check_test_img
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images Kevin Wolf
@ 2015-12-22 16:57   ` Daniel P. Berrange
  2015-12-22 17:00     ` Kevin Wolf
  2015-12-22 21:06   ` Eric Blake
  2015-12-22 21:41   ` Eric Blake
  2 siblings, 1 reply; 99+ messages in thread
From: Daniel P. Berrange @ 2015-12-22 16:57 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

On Tue, Dec 22, 2015 at 05:46:22PM +0100, Kevin Wolf wrote:
> This patch extends qemu-img for working with locked images. It prints a
> helpful error message when trying to access a locked image read-write,
> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> -r all --force' option in order to override a lock left behind after a
> qemu crash.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>


> diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
> index 9567774..dd4aebc 100644
> --- a/qemu-img-cmds.hx
> +++ b/qemu-img-cmds.hx
> @@ -10,9 +10,9 @@ STEXI
>  ETEXI
>  
>  DEF("check", img_check,
> -    "check [-q] [-f fmt] [--output=ofmt] [-r [leaks | all]] [-T src_cache] filename")
> +    "check [-q] [-f fmt] [--force] [--output=ofmt] [-r [leaks | all]] [-T src_cache] filename")
>  STEXI
> -@item check [-q] [-f @var{fmt}] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
> +@item check [-q] [-f @var{fmt}] [--force] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
>  ETEXI

FWIW, my patch to add a new --source arg to qemu-img commands
means you wouldn't need to keep inventing new command line arguments
for each new block driver option you want to support usage of in
qemu-img - they'll all be accessible:

  https://lists.gnu.org/archive/html/qemu-devel/2015-12/msg04021.html

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2015-12-22 16:57   ` Daniel P. Berrange
@ 2015-12-22 17:00     ` Kevin Wolf
  0 siblings, 0 replies; 99+ messages in thread
From: Kevin Wolf @ 2015-12-22 17:00 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: qemu-devel, qemu-block, mreitz

Am 22.12.2015 um 17:57 hat Daniel P. Berrange geschrieben:
> On Tue, Dec 22, 2015 at 05:46:22PM +0100, Kevin Wolf wrote:
> > This patch extends qemu-img for working with locked images. It prints a
> > helpful error message when trying to access a locked image read-write,
> > and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> > -r all --force' option in order to override a lock left behind after a
> > qemu crash.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> 
> 
> > diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
> > index 9567774..dd4aebc 100644
> > --- a/qemu-img-cmds.hx
> > +++ b/qemu-img-cmds.hx
> > @@ -10,9 +10,9 @@ STEXI
> >  ETEXI
> >  
> >  DEF("check", img_check,
> > -    "check [-q] [-f fmt] [--output=ofmt] [-r [leaks | all]] [-T src_cache] filename")
> > +    "check [-q] [-f fmt] [--force] [--output=ofmt] [-r [leaks | all]] [-T src_cache] filename")
> >  STEXI
> > -@item check [-q] [-f @var{fmt}] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
> > +@item check [-q] [-f @var{fmt}] [--force] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
> >  ETEXI
> 
> FWIW, my patch to add a new --source arg to qemu-img commands
> means you wouldn't need to keep inventing new command line arguments
> for each new block driver option you want to support usage of in
> qemu-img - they'll all be accessible:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2015-12/msg04021.html

Good point. I think I'd keep 'check --force' anyway because it's kind of
intuitive. Your series will be a good solution for all other qemu-img
commands, though, so I'll refrain from adding specific options there.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images Kevin Wolf
@ 2015-12-22 20:20   ` Eric Blake
  2016-01-11 15:20     ` Kevin Wolf
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Blake @ 2015-12-22 20:20 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> Version 2 images don't have feature bits, so writing a feature table to
> those images is kind of pointless.

Fortunately, it is also harmless; even the v2 spec allowed for unknown
extension headers.

> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/qcow2.c              | 48 ++++++++++++++++++++++++----------------------
>  tests/qemu-iotests/031.out | 12 +-----------
>  tests/qemu-iotests/061.out | 15 ---------------
>  3 files changed, 26 insertions(+), 49 deletions(-)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

Did you test that amend'ing an image from v2 to v3 adds the table, and
downgrading from v3 to v2 drops the table?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 02/10] qcow2: Write full header on image creation
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 02/10] qcow2: Write full header on image creation Kevin Wolf
@ 2015-12-22 20:25   ` Eric Blake
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 20:25 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 639 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> When creating a qcow2 image, we didn't necessarily call
> qcow2_update_header(), but could end up with the basic header that
> qcow2_create2() created manually. One thing that this basic header
> lacks is the feature table. Let's make sure that it's always present.

Aha, this helps answer my question on 1/10.

> 
> This requires a few updates to test cases as well.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 03/10] block: Assert no write requests under BDRV_O_INCOMING
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 03/10] block: Assert no write requests under BDRV_O_INCOMING Kevin Wolf
@ 2015-12-22 20:27   ` Eric Blake
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 20:27 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 789 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> As long as BDRV_O_INCOMING is set, the image file is only opened so we
> have a file descriptor for it. We're definitely not supposed to modify
> the image, it's still owned by the migration source.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 

Reviewed-by: Eric Blake <eblake@redhat.com>

Based on the other thread going on, we may actually be able to hit this
assertion due to a latent bug:
https://lists.gnu.org/archive/html/qemu-devel/2015-12/msg01126.html
and maybe this assert will help us get a better stack trace to pinpoint
the culprit?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 04/10] block: Fix error path in bdrv_invalidate_cache()
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 04/10] block: Fix error path in bdrv_invalidate_cache() Kevin Wolf
@ 2015-12-22 20:31   ` Eric Blake
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 20:31 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> We can only clear BDRV_O_INCOMING if the caches were actually
> invalidated.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 05/10] block: Inactivate BDS when migration completes
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 05/10] block: Inactivate BDS when migration completes Kevin Wolf
@ 2015-12-22 20:43   ` Eric Blake
  2016-01-05 20:21     ` [Qemu-devel] [Qemu-block] " John Snow
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Blake @ 2015-12-22 20:43 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 3379 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> So far, live migration with shared storage meant that the image is in a
> not-really-ready don't-touch-me state on the destination while the
> source is still actively using it, but after completing the migration,
> the image was fully opened on both sides. This is bad.
> 
> This patch adds a block driver callback to inactivate images on the
> source before completing the migration. Inactivation means that it goes
> to a state as if it was just live migrated to the qemu instance on the
> source (i.e. BDRV_O_INCOMING is set). You're then supposed to continue
> either on the source or on the destination, which takes ownership of the
> image.
> 
> A typical migration looks like this now with respect to disk images:
> 
> 1. Destination qemu is started, the image is opened with
>    BDRV_O_INCOMING. The image is fully opened on the source.
> 
> 2. Migration is about to complete. The source flushes the image and
>    inactivates it. Now both sides have the image opened with
>    BDRV_O_INCOMING and are expecting the other side to still modify it.

The name BDRV_O_INCOMING now doesn't quite match semantics on the
source, but I don't have any better suggestions.  BDRV_O_LIMITED_USE?
BDRV_O_HANDOFF?  At any rate, I fully agree with your logic of locking
things down on the source to mark that the destination is about to take
over write access to the file.

> 
> 3. One side (the destination on success) continues and calls
>    bdrv_invalidate_all() in order to take ownership of the image again.
>    This removes BDRV_O_INCOMING on the resuming side; the flag remains
>    set on the other side.
> 
> This ensures that the same image isn't written to by both instances
> (unless both are resumed, but then you get what you deserve). This is
> important because .bdrv_close for non-BDRV_O_INCOMING images could write
> to the image file, which is definitely forbidden while another host is
> using the image.

And indeed, this is a prereq to your patch that modifies the file on
close to clear the new 'open-for-writing' flag :)

> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block.c                   | 34 ++++++++++++++++++++++++++++++++++
>  include/block/block.h     |  1 +
>  include/block/block_int.h |  1 +
>  migration/migration.c     |  7 +++++++
>  qmp.c                     | 12 ++++++++++++
>  5 files changed, 55 insertions(+)
> 

> @@ -1536,6 +1540,9 @@ static void migration_completion(MigrationState *s, int current_active_state,
>          if (!ret) {
>              ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>              if (ret >= 0) {
> +                ret = bdrv_inactivate_all();
> +            }
> +            if (ret >= 0) {
>                  qemu_file_set_rate_limit(s->file, INT64_MAX);

Isn't the point of the rate limit change to allow any pending operations
to flush without artificial slow limits?  Will inactivating the device
be too slow if rate limits are still slow?

But offhand, I don't have any strong proof that a different order is
required, so yours makes sense to me.

You may want a second opinion, but I'm okay if you add:
Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images Kevin Wolf
  2015-12-22 16:57   ` Daniel P. Berrange
@ 2015-12-22 21:06   ` Eric Blake
  2016-01-11 15:49     ` Markus Armbruster
  2016-01-11 16:22     ` Kevin Wolf
  2015-12-22 21:41   ` Eric Blake
  2 siblings, 2 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 21:06 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 7589 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> This patch extends qemu-img for working with locked images. It prints a
> helpful error message when trying to access a locked image read-write,
> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> -r all --force' option in order to override a lock left behind after a
> qemu crash.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/block.h |  1 +
>  include/qapi/error.h  |  1 +
>  qapi/common.json      |  3 +-
>  qemu-img-cmds.hx      | 10 ++++--
>  qemu-img.c            | 96 +++++++++++++++++++++++++++++++++++++++++++--------
>  qemu-img.texi         | 20 ++++++++++-
>  6 files changed, 113 insertions(+), 18 deletions(-)
> 

> +++ b/include/qapi/error.h
> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
>  } ErrorClass;

Wow - a new ErrorClass.  It's been a while since we could justify one of
these, but I think you might have found a case.

>  
>  /*
> diff --git a/qapi/common.json b/qapi/common.json
> index 9353a7b..1bf6e46 100644
> --- a/qapi/common.json
> +++ b/qapi/common.json
> @@ -27,7 +27,8 @@
>  { 'enum': 'QapiErrorClass',
>    # Keep this in sync with ErrorClass in error.h
>    'data': [ 'GenericError', 'CommandNotFound', 'DeviceEncrypted',
> -            'DeviceNotActive', 'DeviceNotFound', 'KVMMissingCap' ] }
> +            'DeviceNotActive', 'DeviceNotFound', 'KVMMissingCap',
> +            'ImageFileLocked' ] }

Missing documentation of the new value; should be something like:

# @ImageFileLocked: the requested operation attempted to write to an
#    image locked for writing by another process (since 2.6)

>  
> +DEF("force-unlock", img_force_unlock,
> +    "force_unlock [-f fmt] filename")

So is it force-unlock or force_unlock?  It's our first two-word command
on the qemu-img CLI, but I strongly prefer '-' (hitting the shift key
mid-word is a bother for CLI usage).

> +++ b/qemu-img.c
> @@ -47,6 +47,7 @@ typedef struct img_cmd_t {
>  enum {
>      OPTION_OUTPUT = 256,
>      OPTION_BACKING_CHAIN = 257,
> +    OPTION_FORCE = 258,
>  };

May conflict with Daniel's proposed patches; I'm sure you two can sort
out the problems.

> @@ -206,12 +207,34 @@ static BlockBackend *img_open(const char *id, const char *filename,
>      Error *local_err = NULL;
>      QDict *options = NULL;
>  
> +    options = qdict_new();
>      if (fmt) {
> -        options = qdict_new();
>          qdict_put(options, "driver", qstring_from_str(fmt));
>      }
> +    QINCREF(options);
>  
>      blk = blk_new_open(id, filename, NULL, options, flags, &local_err);
> +    if (!blk && error_get_class(local_err) == ERROR_CLASS_IMAGE_FILE_LOCKED) {
> +        if (force) {
> +            qdict_put(options, BDRV_OPT_OVERRIDE_LOCK, qstring_from_str("on"));

I guess it's safer to try without override and then re-issue with it,
only when needed, rather than treating 'force' as blindly turning on
override even when it is not needed to avoid the need for reissuing
commands.  And probably not observable to the user which of the two
approaches you use (same end results).

> +            blk = blk_new_open(id, filename, NULL, options, flags, NULL);

Can't the second attempt still fail, for some other reason?  I think
passing NULL for errp is risky here.  I guess you're saved by the fact
that blk_new_open() should always return NULL if an error would have
been set, and that you want to favor reporting the original failure
(with the class ERROR_CLASS_IMAGE_FILE_LOCKED) rather than the
second-attempt failure.

> +            if (blk) {
> +                error_free(local_err);
> +            }
> +        } else {
> +            error_report("The image file '%s' is locked and cannot be "
> +                         "opened for write access as this may cause image "
> +                         "corruption.", filename);

This completely discards the information in local_err.  Of course, I
don't know what information you are proposing to store for the actual
advisory lock extension header.  But let's suppose it were to include
hostname+pid information on who claimed the lock, rather than just a
single lock bit.  That additional information in local_err may well be
worth reporting here rather than just discarding it all.

> +            error_report("If it is locked in error (e.g. because "
> +                         "of an unclean shutdown) and you are sure that no "
> +                         "other processes are working on the image file, you "
> +                         "can use 'qemu-img force-unlock' or the --force flag "
> +                         "for 'qemu-img check' in order to override this "
> +                         "check.");

Long line; I don't know if we want to insert intermediate line breaks.
Markus may have more opinions on what this should look like.

> +static int img_force_unlock(int argc, char **argv)
> +{
> +    BlockBackend *blk;
> +    const char *format = NULL;
> +    const char *filename;
> +    char c;
> +
> +    for (;;) {
> +        c = getopt(argc, argv, "hf:");
> +        if (c == -1) {
> +            break;
> +        }
> +        switch (c) {
> +        case '?':
> +        case 'h':
> +            help();
> +            break;
> +        case 'f':
> +            format = optarg;
> +            break;

Depending on what we decide for Daniel's patches, you may not even want
a -f here, but always treat this as a new-style command that only takes
QemuOpts style parsing of a positional parameter.  Right now, I'm
leaning towards his v3 design (all older sub-commands gain a boolean
flag that says whether the positional parameters are literal filenames
or specific QemuOpts strings), but since your subcommand is new, we
don't have to cater to the older style.

> +++ b/qemu-img.texi
> @@ -117,7 +117,7 @@ Skip the creation of the target volume
>  Command description:
>  
>  @table @option
> -@item check [-f @var{fmt}] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
> +@item check [-q] [-f @var{fmt}] [--force] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}

Where did -q come from?

>  
> +@item force-unlock [-f @var{fmt}] @var{filename}

Okay - most of your patch used the sane spelling; it was just the one
spot I found that used force_unlock incorrectly.

> +
> +Read-write disk images can generally be safely opened only from a single
> +process at the same time. In order to protect against corruption from
> +neglecting to follow this rule, qcow2 images are automatically flagged as
> +in use when they are opened and the flag is removed again on a clean
> +shutdown.
> +
> +However, in cases of an unclean shutdown, the image might be still marked as in
> +use so that any further read-write access is prohibited. You can use the
> +@code{force-unlock} command to manually remove the in-use flag then.
> +

Looks reasonable.  I do think I found enough things, though, that it
will require a v2 (perhaps rebased on some other patches) before I give R-b.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] qcow2: Implement .bdrv_inactivate
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 07/10] qcow2: Implement .bdrv_inactivate Kevin Wolf
@ 2015-12-22 21:17   ` Eric Blake
  2016-01-11 15:34     ` Kevin Wolf
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Blake @ 2015-12-22 21:17 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 2818 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> The callback has to ensure that closing or flushing the image afterwards
> wouldn't cause a write access to the image files. This means that just
> the caches have to be written out, which is part of the existing
> .bdrv_close implementation.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/qcow2.c | 45 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 28 insertions(+), 17 deletions(-)
> 

Mostly code motion, but I still ended up with questions.

>  
> +static int qcow2_inactivate(BlockDriverState *bs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int ret, result = 0;
> +
> +    ret = qcow2_cache_flush(bs, s->l2_table_cache);
> +    if (ret) {
> +        result = ret;
> +        error_report("Failed to flush the L2 table cache: %s",
> +                     strerror(-ret));

I asked Markus if we want error_report_errno() - and ever since then, I
keep finding more and more uses that would benefit from it :)

> +    }
> +
> +    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
> +    if (ret) {
> +        result = ret;
> +        error_report("Failed to flush the refcount block cache: %s",
> +                     strerror(-ret));
> +    }
> +

If the media fails in between these two statements,...

> +    if (result == 0) {
> +        qcow2_mark_clean(bs);

...can't qcow2_mark_clean() fail due to an EIO or other write error?  Do
we care?  I guess the worst is that we didn't mark the image clean after
all, which is no worse than if qemu[-img] had been SIGKILL'd at the same
point where I hypothesized that the media could fail.

> +    }
> +
> +    return result;

If both flushes failed, you return the result set to the return value of
the second flush.  Is it ever possible that the return value of the
first flush might be more useful?

> @@ -1693,23 +1719,7 @@ static void qcow2_close(BlockDriverState *bs)
>      s->l1_table = NULL;
>  
>      if (!(bs->open_flags & BDRV_O_INCOMING)) {

> -        if (!ret1 && !ret2) {
> -            qcow2_mark_clean(bs);
> -        }
> +        qcow2_inactivate(bs);
>      }

Then again, the lone existing caller in this addition doesn't even care
about the return value.  The only other caller was your new code added
earlier in the series; in 5/10 migration_completion(), which uses the
value as a conditional but doesn't try to call strerror(-ret).

Since this is mostly code motion, any semantic changes you would want to
make based on my questions above probably belong in their own patches.
Therefore, for this patch as written,

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 08/10] qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache()
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 08/10] qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache() Kevin Wolf
@ 2015-12-22 21:22   ` Eric Blake
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 21:22 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 791 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> What qcow2_invalidate_cache() should do is closing the image with
> BDRV_O_INCOMING set and reopening it with the flag cleared. In fact, it

Might read better with s/closing/close/, s/reopening/reopen/

> used to do exactly the opposite: qcow2_close() relied on bs->open_flags,
> which is already updated to have cleared BDRV_O_INCOMING at this point,
> whereas qcow2_open() was called with s->flags, which has the flag still
> set. Fix this.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/qcow2.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 09/10] qcow2: Make image inaccessible after failed qcow2_invalidate_cache()
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 09/10] qcow2: Make image inaccessible after failed qcow2_invalidate_cache() Kevin Wolf
@ 2015-12-22 21:24   ` Eric Blake
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 21:24 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 805 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> If qcow2_invalidate_cache() fails, we are in a state where qcow2_close()
> has already been completed, but the image hasn't been reopened yet.
> Calling into any qcow2 function for an image in this state will cause
> crashes.
> 
> The real solution would be to get rid of the close/open pair and instead
> do an atomic reset of the involved data structures, but this isn't
> trivial, so let's just make the image inaccessible for now.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/qcow2.c | 3 +++
>  1 file changed, 3 insertions(+)

Band-aids are better than nothing.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images Kevin Wolf
  2015-12-22 16:57   ` Daniel P. Berrange
  2015-12-22 21:06   ` Eric Blake
@ 2015-12-22 21:41   ` Eric Blake
  2 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 21:41 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 1913 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> This patch extends qemu-img for working with locked images. It prints a
> helpful error message when trying to access a locked image read-write,
> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> -r all --force' option in order to override a lock left behind after a
> qemu crash.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/block/block.h |  1 +
>  include/qapi/error.h  |  1 +
>  qapi/common.json      |  3 +-
>  qemu-img-cmds.hx      | 10 ++++--
>  qemu-img.c            | 96 +++++++++++++++++++++++++++++++++++++++++++--------
>  qemu-img.texi         | 20 ++++++++++-
>  6 files changed, 113 insertions(+), 18 deletions(-)
> 
> diff --git a/include/block/block.h b/include/block/block.h
> index 0d00ac1..1ae655c 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -101,6 +101,7 @@ typedef struct HDGeometry {
>  #define BDRV_OPT_CACHE_DIRECT   "cache.direct"
>  #define BDRV_OPT_CACHE_NO_FLUSH "cache.no-flush"
>  
> +#define BDRV_OPT_OVERRIDE_LOCK  "override-lock"

New dict key here...

>  
>      blk = blk_new_open(id, filename, NULL, options, flags, &local_err);
> +    if (!blk && error_get_class(local_err) == ERROR_CLASS_IMAGE_FILE_LOCKED) {
> +        if (force) {
> +            qdict_put(options, BDRV_OPT_OVERRIDE_LOCK, qstring_from_str("on"));

...but not supported by any of the block drivers until 10/10 adds it for
qcow2.  I guess what happens is that...

> +            blk = blk_new_open(id, filename, NULL, options, flags, NULL);

...the second blk_new_open() fails if the option is unrecognized, but we
ignore the second failure; and therefore the option makes a difference
only if the block driver understands the option.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 10/10] qcow2: Add image locking
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 10/10] qcow2: Add image locking Kevin Wolf
@ 2015-12-22 22:04   ` Eric Blake
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-22 22:04 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

[-- Attachment #1: Type: text/plain, Size: 7346 bytes --]

On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> People have been keeping destroying their qcow2 images by using

Any of these sound better:

People have kept on destroying
People have been destroying
People keep on destroying

> 'qemu-img snapshot' on images that were in use by a VM. This patch adds
> some image locking that protects them against this mistake.
> 
> In order to achieve this, a new compatible header flag is introduced
> that tells that the image is currently in use. It is (almost) always set
> when qcow2 considers the image to be active and in a read-write mode.
> During live migration, the source considers the image active until the
> VM stops on migration completion. The destination considers it active as
> soon as it starts running the VM.
> 
> In cases where qemu wasn't shut down cleanly, images may incorrectly
> refuse to open. An option override-lock=on is provided to force opening
> the image (this is the option that qemu-img uses for 'force-unlock' and
> 'check --force').
> 
> A few test cases have to be adjusted, either to update error messages,
> to use read-only mode to avoid the check, or to override the lock where
> necessary.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---

> diff --git a/block/qcow2.c b/block/qcow2.c
> index 544c124..c07a078 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -307,6 +307,8 @@ int qcow2_mark_corrupt(BlockDriverState *bs)
>      BDRVQcow2State *s = bs->opaque;
>  
>      s->incompatible_features |= QCOW2_INCOMPAT_CORRUPT;
> +    s->compatible_features &= ~QCOW2_COMPAT_IN_USE;
> +

So the moment we detect something is wrong, we (attempt to) write the
corrupt bit but promise to do no further writes, so it makes sense that
we can claim we are no longer using the image.

>  
> @@ -472,6 +474,11 @@ static QemuOptsList qcow2_runtime_opts = {
>              .type = QEMU_OPT_NUMBER,
>              .help = "Clean unused cache entries after this time (in seconds)",
>          },
> +        {
> +            .name = BDRV_OPT_OVERRIDE_LOCK,
> +            .type = QEMU_OPT_BOOL,
> +            .help = "Open the image read-write even if it is locked",
> +        },

Missing counterpart documentation in qapi/block-core.json
BlockdevOptionsQcow2.

> +    /* Protect against opening the image r/w twice at the same time */
> +    if (!bs->read_only && (s->compatible_features & QCOW2_COMPAT_IN_USE)) {
> +        /* Shared storage is expected during migration */
> +        bool migrating = (flags & BDRV_O_INCOMING);
> +
> +        if (!migrating && !s->override_lock) {
> +            error_set(errp, ERROR_CLASS_IMAGE_FILE_LOCKED,
> +                      "Image is already in use");
> +            error_append_hint(errp, "This check can be disabled "
> +                              "with override-lock=on. Caution: Opening an "
> +                              "image twice can cause corruption!");

Here's where I wondered in 6/10 if it is worth providing additional
information about the current lock owner; and that information would
come from [1]...

> @@ -1164,6 +1193,17 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
>          }
>      }
>  
> +    /* Set advisory lock in the header (do this as the final step so that
> +     * failure doesn't leave a locked image around) */
> +    if (!bs->read_only && !(flags & BDRV_O_INCOMING) && s->qcow_version >= 3) {
> +        s->compatible_features |= QCOW2_COMPAT_IN_USE;
> +        ret = qcow2_update_header(bs);

This says that we set the advisory bit even when override-lock; the only
purpose of override lock is to allow us to write to the image even if
the bit was already set.

I suppose the other choice would be that override-lock on means that we
don't bother to set the bit at all, leaving us with the qemu 2.5
behavior of not claiming the lock and making it easier to stomp on the
image - perhaps useful for regression testing, but probably not as safe
as a default.  So I can agree with how you implemented the override.

> @@ -1272,12 +1321,32 @@ fail:
>  
>  static void qcow2_reopen_commit(BDRVReopenState *state)
>  {
> +    /* We can't fail the commit, so if the header update fails, we may end up
> +     * not protecting the image even though it is writable now. This is okay,
> +     * the lock is a best-effort service to protect the user from shooting
> +     * themselves into the foot. */

s/into/in/

> +    if (state->bs->read_only && (state->flags & BDRV_O_RDWR)) {
> +        BDRVQcow2State *s = state->bs->opaque;
> +        s->compatible_features |= QCOW2_COMPAT_IN_USE;
> +        (void) qcow2_update_header(state->bs);
> +    }
> +
>      qcow2_update_options_commit(state->bs, state->opaque);
>      g_free(state->opaque);
>  }
>  
>  static void qcow2_reopen_abort(BDRVReopenState *state)
>  {
> +    /* We can't fail the abort, so if the header update fails, we may end up
> +     * not protecting the image any more. This is okay, the lock is a
> +     * best-effort service to protect the user from shooting themselves into

s/into/in/

> @@ -1708,6 +1777,16 @@ static int qcow2_inactivate(BlockDriverState *bs)
>          qcow2_mark_clean(bs);
>      }
>  
> +    if (!bs->read_only) {
> +        s->flags |= BDRV_O_INCOMING;
> +        s->compatible_features &= ~QCOW2_COMPAT_IN_USE;
> +        ret = qcow2_update_header(bs);
> +        if (ret < 0) {
> +            result = ret;
> +            error_report("Could not update qcow2 header: %s", strerror(-ret));
> +        }
> +    }
> +
>      return result;
>  }
>  

> +++ b/docs/specs/qcow2.txt
> @@ -96,7 +96,12 @@ in the description of a field.
>                                  marking the image file dirty and postponing
>                                  refcount metadata updates.
>  
> -                    Bits 1-63:  Reserved (set to 0)
> +                    Bit 1:      Locking bit. If this bit is set, then the
> +                                image is supposedly in use by some process and
> +                                shouldn't be opened read-write by another
> +                                process.

...[1] I'm wondering if we should add a new optional extension header
here, which records the hostname+pid (and maybe also the argv[0]) of the
process that set this bit.

If this bit is clear, the extension header can be ignored/deleted as
useless (or more likely, rewritten the moment we open the file
read-write because we'll be setting the bit again); if this bit is set
but the extension header is not present, we have no further information
to report beyond "file is locked".  But if this bit is set and the
extension header is present, then we can attempt to tell the user more
details about who last claimed to be writing to the file (which may help
the user decide if it really still is in use, or if the lock is left
over in error due to an abrupt exit).  That also implies that the act of
setting this bit should also default to adding the new extension header,
populated with useful information.

Overall, I like the direction this series is headed in!

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (9 preceding siblings ...)
  2015-12-22 16:46 ` [Qemu-devel] [PATCH 10/10] qcow2: Add image locking Kevin Wolf
@ 2015-12-23  3:14 ` Fam Zheng
  2015-12-23  7:35   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
                     ` (2 more replies)
  2015-12-23 14:57 ` [Qemu-devel] " Vasiliy Tolstov
                   ` (2 subsequent siblings)
  13 siblings, 3 replies; 99+ messages in thread
From: Fam Zheng @ 2015-12-23  3:14 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

On Tue, 12/22 17:46, Kevin Wolf wrote:
> Enough innocent images have died because users called 'qemu-img snapshot' while
> the VM was still running. Educating the users doesn't seem to be a working
> strategy, so this series adds locking to qcow2 that refuses to access the image
> read-write from two processes.
> 
> Eric, this will require a libvirt update to deal with qemu crashes which leave
> locked images behind. The simplest thinkable way would be to unconditionally
> override the lock in libvirt whenever the option is present. In that case,
> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> the other way round. If you want more than that, libvirt would have to check
> somehow if it was its own VM that used the image and left the lock behind. I
> imagine that can't be too hard either.

The motivation is great, but I'm not sure I like the side-effect that an
unclean shutdown will require a "forced" open, because it makes using qcow2 in
development cumbersome, and like you said, management/user also needs to handle
this explicitly. This is a bit of a personal preference, but it's strong enough
that I want to speak up.

As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
and a program crash will automatically drop the lock.

Fam

> 
> Also note that this kind of depends on Max's bdrv_close_all() series, but only
> in order to pass test case 142. This is not a bug in this series, but a
> preexisting one (bs->file can be closed before bs), and it becomes apparent
> when qemu fails to unlock an image due to this bug. Max's series fixes this.
> 
> Kevin Wolf (10):
>   qcow2: Write feature table only for v3 images
>   qcow2: Write full header on image creation
>   block: Assert no write requests under BDRV_O_INCOMING
>   block: Fix error path in bdrv_invalidate_cache()
>   block: Inactivate BDS when migration completes
>   qemu-img: Prepare for locked images
>   qcow2: Implement .bdrv_inactivate
>   qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache()
>   qcow2: Make image inaccessible after failed qcow2_invalidate_cache()
>   qcow2: Add image locking
> 
>  block.c                    |  36 +++++++++
>  block/io.c                 |   2 +
>  block/qcow2.c              | 190 +++++++++++++++++++++++++++++++++++----------
>  block/qcow2.h              |   7 +-
>  docs/specs/qcow2.txt       |   7 +-
>  include/block/block.h      |   2 +
>  include/block/block_int.h  |   1 +
>  include/qapi/error.h       |   1 +
>  migration/migration.c      |   7 ++
>  qapi/common.json           |   3 +-
>  qemu-img-cmds.hx           |  10 ++-
>  qemu-img.c                 |  96 +++++++++++++++++++----
>  qemu-img.texi              |  20 ++++-
>  qmp.c                      |  12 +++
>  tests/qemu-iotests/026     |   2 +-
>  tests/qemu-iotests/026.out |  60 ++++++++++++--
>  tests/qemu-iotests/031.out |  23 +++---
>  tests/qemu-iotests/036     |   2 +
>  tests/qemu-iotests/036.out |   7 +-
>  tests/qemu-iotests/039     |   4 +-
>  tests/qemu-iotests/061     |   2 +
>  tests/qemu-iotests/061.out |  43 +++++-----
>  tests/qemu-iotests/071     |   7 ++
>  tests/qemu-iotests/071.out |   4 +
>  tests/qemu-iotests/089     |   2 +-
>  tests/qemu-iotests/089.out |   2 -
>  tests/qemu-iotests/091     |   2 +-
>  tests/qemu-iotests/098     |   2 +-
>  28 files changed, 445 insertions(+), 111 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23  3:14 ` [Qemu-devel] [PATCH 00/10] qcow2: Implement " Fam Zheng
@ 2015-12-23  7:35   ` Denis V. Lunev
  2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
  2015-12-23 10:47   ` [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Daniel P. Berrange
  2015-12-23 23:19   ` [Qemu-devel] " Max Reitz
  2 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  7:35 UTC (permalink / raw)
  To: Fam Zheng, Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

On 12/23/2015 06:14 AM, Fam Zheng wrote:
> On Tue, 12/22 17:46, Kevin Wolf wrote:
>> Enough innocent images have died because users called 'qemu-img snapshot' while
>> the VM was still running. Educating the users doesn't seem to be a working
>> strategy, so this series adds locking to qcow2 that refuses to access the image
>> read-write from two processes.
>>
>> Eric, this will require a libvirt update to deal with qemu crashes which leave
>> locked images behind. The simplest thinkable way would be to unconditionally
>> override the lock in libvirt whenever the option is present. In that case,
>> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
>> the other way round. If you want more than that, libvirt would have to check
>> somehow if it was its own VM that used the image and left the lock behind. I
>> imagine that can't be too hard either.
> The motivation is great, but I'm not sure I like the side-effect that an
> unclean shutdown will require a "forced" open, because it makes using qcow2 in
> development cumbersome, and like you said, management/user also needs to handle
> this explicitly. This is a bit of a personal preference, but it's strong enough
> that I want to speak up.
>
> As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> and a program crash will automatically drop the lock.
>
> Fam

Exactly!

I have a set in progress for this purpose.

In this case we will be able to automatically recover from crash
(with a lock taken) calling bdrv_check.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery
  2015-12-23  7:35   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
@ 2015-12-23  7:46     ` Denis V. Lunev
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 1/5] block: added lock image option and callback Denis V. Lunev
                         ` (5 more replies)
  0 siblings, 6 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  7:46 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Max Reitz, Olga Krishtal,
	Denis V. Lunev

This series of patches is aimed to prevent usage of image
file by different qemu instances. In case we are the first
instance, and option lock is lockfile, - we lock the image file,
and if check option is on, we check the file and fix it if
nessecary. If one of this two ops fails - the image is closed
with the error.

Patchset is not polished at all! Sent for a discussion as an alternative
approach.

Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Fam Zheng <famz@redhat.com>

Olga Krishtal (5):
  block: added lock image option and callback
  block: implemented bdrv_lock_image for raw file
  block: added check image option and callback bdrv_is_opened_unclean
  qcow2: implemented bdrv_is_opened_unclean
  block/paralels: added paralles implementation for
    bdrv_is_opened_unclean

 block.c                   | 73 +++++++++++++++++++++++++++++++++++++++++++++++
 block/parallels.c         |  7 ++++-
 block/qcow2.c             | 11 ++++++-
 block/qcow2.h             |  1 +
 block/raw-posix.c         | 15 ++++++++++
 block/raw-win32.c         | 19 ++++++++++++
 include/block/block.h     |  2 ++
 include/block/block_int.h |  2 ++
 qapi/block-core.json      |  9 ++++++
 9 files changed, 137 insertions(+), 2 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
@ 2015-12-23  7:46       ` Denis V. Lunev
  2015-12-23 23:48         ` Eric Blake
  2016-01-11 17:31         ` Kevin Wolf
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 2/5] block: implemented bdrv_lock_image for raw file Denis V. Lunev
                         ` (4 subsequent siblings)
  5 siblings, 2 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  7:46 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Max Reitz, Olga Krishtal,
	Denis V. Lunev

From: Olga Krishtal <okrishtal@virtuozzo.com>

While opening the image we want to be sure that we are the
one who works with image, anf if it is not true -
opening the image for writing should fail.

There are 2 ways at the moment: no lock at all and lock the file
image.

Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Fam Zheng <famz@redhat.com>
---
 block.c                   | 41 +++++++++++++++++++++++++++++++++++++++++
 include/block/block.h     |  1 +
 include/block/block_int.h |  1 +
 qapi/block-core.json      |  9 +++++++++
 4 files changed, 52 insertions(+)

diff --git a/block.c b/block.c
index 411edbf..74228b8 100644
--- a/block.c
+++ b/block.c
@@ -40,6 +40,8 @@
 #include "qemu/timer.h"
 #include "qapi-event.h"
 #include "block/throttle-groups.h"
+#include "qapi-types.h"
+#include "qapi/util.h"
 
 #ifdef CONFIG_BSD
 #include <sys/types.h>
@@ -895,6 +897,12 @@ static QemuOptsList bdrv_runtime_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "Ignore flush requests",
         },
+        {
+            .name = "lock",
+            .type = QEMU_OPT_STRING,
+            .help = "How to lock the image: none or lockfile",
+            .def_value_str = "none",
+        },
         { /* end of list */ }
     },
 };
@@ -914,6 +922,8 @@ static int bdrv_open_common(BlockDriverState *bs, BdrvChild *file,
     QemuOpts *opts;
     BlockDriver *drv;
     Error *local_err = NULL;
+    const char *lock_image;
+    int lock;
 
     assert(bs->file == NULL);
     assert(options != NULL && bs->options != options);
@@ -1020,6 +1030,18 @@ static int bdrv_open_common(BlockDriverState *bs, BdrvChild *file,
         goto free_and_fail;
     }
 
+    lock_image = qemu_opt_get(opts, "lock");
+    lock = qapi_enum_parse(BdrvLockImage_lookup, lock_image,
+                BDRV_LOCK_IMAGE__MAX, BDRV_LOCK_IMAGE_NONE, &local_err);
+    if (!bs->read_only && lock != BDRV_LOCK_IMAGE_NONE) {
+        ret = bdrv_lock_image(bs, lock);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Could not lock the image");
+            bdrv_close(bs);
+            goto fail_opts;
+        }
+    }
+
     if (bs->encrypted) {
         error_report("Encrypted images are deprecated");
         error_printf("Support for them will be removed in a future release.\n"
@@ -4320,3 +4342,22 @@ void bdrv_refresh_filename(BlockDriverState *bs)
         QDECREF(json);
     }
 }
+
+int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock)
+{
+    BlockDriver *drv = bs->drv;
+    if (lock != BDRV_LOCK_IMAGE_LOCKFILE) {
+        return -EOPNOTSUPP;
+    }
+    if (drv != NULL && drv->bdrv_lock_image != NULL) {
+        return bs->drv->bdrv_lock_image(bs, lock);
+    }
+    if (bs->file == NULL) {
+        return -EOPNOTSUPP;
+    }
+    drv = bs->file->bs->drv;
+    if (drv != NULL && drv->bdrv_lock_image != NULL) {
+        return drv->bdrv_lock_image(bs, lock);
+    }
+    return -EOPNOTSUPP;
+}
diff --git a/include/block/block.h b/include/block/block.h
index db8e096..27fc434 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -506,6 +506,7 @@ void bdrv_reset_dirty_bitmap(BdrvDirtyBitmap *bitmap,
 void bdrv_dirty_iter_init(BdrvDirtyBitmap *bitmap, struct HBitmapIter *hbi);
 void bdrv_set_dirty_iter(struct HBitmapIter *hbi, int64_t offset);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
+int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock);
 
 void bdrv_enable_copy_on_read(BlockDriverState *bs);
 void bdrv_disable_copy_on_read(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 256609d..755f342 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -135,6 +135,7 @@ struct BlockDriver {
     int (*bdrv_create)(const char *filename, QemuOpts *opts, Error **errp);
     int (*bdrv_set_key)(BlockDriverState *bs, const char *key);
     int (*bdrv_make_empty)(BlockDriverState *bs);
+    int (*bdrv_lock_image)(BlockDriverState *bs, BdrvLockImage lock);
 
     void (*bdrv_refresh_filename)(BlockDriverState *bs, QDict *options);
 
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1a5d9ce..b82589b 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2408,3 +2408,12 @@
 ##
 { 'command': 'block-set-write-threshold',
   'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
+
+##
+# @BdrvLockImage:
+#
+#An enumeration of lock types to lock image file.
+# @none - do not lock the image file
+# @lockfile
+## Since 2.6
+{ 'enum': 'BdrvLockImage', 'data':['none', 'lockfile']}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 2/5] block: implemented bdrv_lock_image for raw file
  2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 1/5] block: added lock image option and callback Denis V. Lunev
@ 2015-12-23  7:46       ` Denis V. Lunev
  2015-12-23 12:40         ` Daniel P. Berrange
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 3/5] block: added check image option and callback bdrv_is_opened_unclean Denis V. Lunev
                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  7:46 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Max Reitz, Olga Krishtal,
	Denis V. Lunev

From: Olga Krishtal <okrishtal@virtuozzo.com>

To lock the image file flock (LockFileEx) is used.
We lock file handle/descriptor. If lock is failed -
an error is returned.

In win32 realization we can lock reagion of bytes within the file.
For this reason we at first have to get file size and only than lock it.

Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 15 +++++++++++++++
 block/raw-win32.c | 19 +++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 076d070..6226a5c 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -33,6 +33,7 @@
 #include "raw-aio.h"
 #include "qapi/util.h"
 #include "qapi/qmp/qstring.h"
+#include <sys/file.h>
 
 #if defined(__APPLE__) && (__MACH__)
 #include <paths.h>
@@ -576,6 +577,19 @@ fail:
     return ret;
 }
 
+static int raw_lock_image(BlockDriverState *bs, BdrvLockImage lock)
+{
+    int ret;
+    if (lock != BDRV_LOCK_IMAGE_LOCKFILE) {
+        return -ENOTSUP;
+    }
+    ret = flock(((BDRVRawState *)(bs->opaque))->fd, LOCK_EX|LOCK_NB);
+    if (ret != 0) {
+        return -ret;
+    }
+    return ret;
+}
+
 static int raw_open(BlockDriverState *bs, QDict *options, int flags,
                     Error **errp)
 {
@@ -1946,6 +1960,7 @@ BlockDriver bdrv_file = {
     .bdrv_co_get_block_status = raw_co_get_block_status,
     .bdrv_co_write_zeroes = raw_co_write_zeroes,
 
+    .bdrv_lock_image = raw_lock_image,
     .bdrv_aio_readv = raw_aio_readv,
     .bdrv_aio_writev = raw_aio_writev,
     .bdrv_aio_flush = raw_aio_flush,
diff --git a/block/raw-win32.c b/block/raw-win32.c
index 2d0907a..d05160a 100644
--- a/block/raw-win32.c
+++ b/block/raw-win32.c
@@ -370,6 +370,24 @@ fail:
     return ret;
 }
 
+static int raw_lock_image(BlockDriverState *bs, BdrvLockImage lock)
+{
+    DWORD size_high = 0, size_low = 0;
+    BDRVRawState *s = bs->opaque;
+    if (lock != BDRV_LOCK_IMAGE_LOCKFILE) {
+        return -ENOTSUP;
+    }
+    size_low = GetFileSize(s->hfile, &size_high);
+    if (GetLastError() != 0) {
+        return -EINVAL;
+    }
+    if (!LockFileEx(s->hfile, LOCKFILE_EXCLUSIVE_LOCK|LOCKFILE_FAIL_IMMEDIATELY,
+                        0, size_high, size_low, NULL)) {
+            return -EINVAL;
+    }
+    return 0;
+}
+
 static BlockAIOCB *raw_aio_readv(BlockDriverState *bs,
                          int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
                          BlockCompletionFunc *cb, void *opaque)
@@ -552,6 +570,7 @@ BlockDriver bdrv_file = {
     .bdrv_create        = raw_create,
     .bdrv_has_zero_init = bdrv_has_zero_init_1,
 
+    .bdrv_lock_image    = raw_lock_image,
     .bdrv_aio_readv     = raw_aio_readv,
     .bdrv_aio_writev    = raw_aio_writev,
     .bdrv_aio_flush     = raw_aio_flush,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 3/5] block: added check image option and callback bdrv_is_opened_unclean
  2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 1/5] block: added lock image option and callback Denis V. Lunev
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 2/5] block: implemented bdrv_lock_image for raw file Denis V. Lunev
@ 2015-12-23  7:46       ` Denis V. Lunev
  2015-12-23  9:09         ` Fam Zheng
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 4/5] qcow2: implemented bdrv_is_opened_unclean Denis V. Lunev
                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  7:46 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Max Reitz, Olga Krishtal,
	Denis V. Lunev

From: Olga Krishtal <okrishtal@virtuozzo.com>

If image is opened for writing and it was not closed correctly
(the image is dirty) we have to check and repair it. By default
the option is off.

bdrv_is_opened_unclean - cheks if the image is dirty
This callbsck will be used to ensure that image was
closed correctly, and if not - to check and repair it.

Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Fam Zheng <famz@redhat.com>
---
 block.c                   | 32 ++++++++++++++++++++++++++++++++
 include/block/block.h     |  1 +
 include/block/block_int.h |  1 +
 3 files changed, 34 insertions(+)

diff --git a/block.c b/block.c
index 74228b8..1f704f5 100644
--- a/block.c
+++ b/block.c
@@ -903,6 +903,12 @@ static QemuOptsList bdrv_runtime_opts = {
             .help = "How to lock the image: none or lockfile",
             .def_value_str = "none",
         },
+        {
+            .name = "check",
+            .type = QEMU_OPT_BOOL,
+            .help = "Check and repair the image if it is unclean",
+            .def_value_str = "off",
+        },
         { /* end of list */ }
     },
 };
@@ -1042,6 +1048,16 @@ static int bdrv_open_common(BlockDriverState *bs, BdrvChild *file,
         }
     }
 
+    if (!bs->read_only && qemu_opt_get_bool_del(opts, "check", true) &&
+                bdrv_is_opened_unclean(bs) && !(flags | BDRV_O_CHECK)) {
+        BdrvCheckResult result = {0};
+        ret = bdrv_check(bs, &result, BDRV_FIX_ERRORS);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Could not repair dirty image");
+            bdrv_close(bs);
+            goto fail_opts;
+        }
+    }
     if (bs->encrypted) {
         error_report("Encrypted images are deprecated");
         error_printf("Support for them will be removed in a future release.\n"
@@ -4361,3 +4377,19 @@ int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock)
     }
     return -EOPNOTSUPP;
 }
+
+bool bdrv_is_opened_unclean(BlockDriverState *bs)
+{
+    BlockDriver *drv = bs->drv;
+    if (drv != NULL && drv->bdrv_is_opened_unclean != NULL) {
+        return drv->bdrv_is_opened_unclean(bs);
+    }
+    if (bs->file == NULL) {
+        return false;
+    }
+    drv = bs->file->bs->drv;
+    if (drv != NULL && drv->bdrv_is_opened_unclean != NULL) {
+        return drv->bdrv_is_opened_unclean;
+    }
+    return false;
+}
diff --git a/include/block/block.h b/include/block/block.h
index 27fc434..c366990 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -507,6 +507,7 @@ void bdrv_dirty_iter_init(BdrvDirtyBitmap *bitmap, struct HBitmapIter *hbi);
 void bdrv_set_dirty_iter(struct HBitmapIter *hbi, int64_t offset);
 int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
 int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock);
+bool bdrv_is_opened_unclean(BlockDriverState *bs);
 
 void bdrv_enable_copy_on_read(BlockDriverState *bs);
 void bdrv_disable_copy_on_read(BlockDriverState *bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 755f342..fc3f4a6 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -137,6 +137,7 @@ struct BlockDriver {
     int (*bdrv_make_empty)(BlockDriverState *bs);
     int (*bdrv_lock_image)(BlockDriverState *bs, BdrvLockImage lock);
 
+    bool (*bdrv_is_opened_unclean)(BlockDriverState *bs);
     void (*bdrv_refresh_filename)(BlockDriverState *bs, QDict *options);
 
     /* aio */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 4/5] qcow2: implemented bdrv_is_opened_unclean
  2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
                         ` (2 preceding siblings ...)
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 3/5] block: added check image option and callback bdrv_is_opened_unclean Denis V. Lunev
@ 2015-12-23  7:46       ` Denis V. Lunev
  2016-01-11 17:37         ` Kevin Wolf
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 5/5] block/paralels: added paralles implementation for bdrv_is_opened_unclean Denis V. Lunev
  2015-12-23  8:09       ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Fam Zheng
  5 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  7:46 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Max Reitz, Olga Krishtal,
	Denis V. Lunev

From: Olga Krishtal <okrishtal@virtuozzo.com>

While opening image we save dirty state in header_unclean.
If the image was closed incorrectly we can retrieve this fact
using bdrv_is_opened_unclean callback.

This is necessary in case when we want to call brdv_check to
repair dirty image.

Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Fam Zheng <famz@redhat.com>
---
 block/qcow2.c | 11 ++++++++++-
 block/qcow2.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 1789af4..de3b97f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -817,6 +817,11 @@ static int qcow2_update_options(BlockDriverState *bs, QDict *options,
     return ret;
 }
 
+static bool qcow2_is_opened_unclean(BlockDriverState *bs)
+{
+    return ((BDRVQcow2State *)(bs->opaque))->header_unclean;
+}
+
 static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
                       Error **errp)
 {
@@ -1156,7 +1161,6 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
     if (!(flags & (BDRV_O_CHECK | BDRV_O_INCOMING)) && !bs->read_only &&
         (s->incompatible_features & QCOW2_INCOMPAT_DIRTY)) {
         BdrvCheckResult result = {0};
-
         ret = qcow2_check(bs, &result, BDRV_FIX_ERRORS | BDRV_FIX_LEAKS);
         if (ret < 0) {
             error_setg_errno(errp, -ret, "Could not repair dirty image");
@@ -1170,6 +1174,9 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags,
         qcow2_check_refcounts(bs, &result, 0);
     }
 #endif
+    if (flags & BDRV_O_RDWR) {
+        s->header_unclean = true;
+    }
     return ret;
 
  fail:
@@ -1691,6 +1698,7 @@ static void qcow2_close(BlockDriverState *bs)
     qemu_vfree(s->l1_table);
     /* else pre-write overlap checks in cache_destroy may crash */
     s->l1_table = NULL;
+    s->header_unclean = false;
 
     if (!(bs->open_flags & BDRV_O_INCOMING)) {
         int ret1, ret2;
@@ -3305,6 +3313,7 @@ BlockDriver bdrv_qcow2 = {
     .bdrv_co_get_block_status = qcow2_co_get_block_status,
     .bdrv_set_key       = qcow2_set_key,
 
+    .bdrv_is_opened_unclean  = qcow2_is_opened_unclean,
     .bdrv_co_readv          = qcow2_co_readv,
     .bdrv_co_writev         = qcow2_co_writev,
     .bdrv_co_flush_to_os    = qcow2_co_flush_to_os,
diff --git a/block/qcow2.h b/block/qcow2.h
index a063a3c..c743d66 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -278,6 +278,7 @@ typedef struct BDRVQcow2State {
     int overlap_check; /* bitmask of Qcow2MetadataOverlap values */
     bool signaled_corruption;
 
+    bool header_unclean;
     uint64_t incompatible_features;
     uint64_t compatible_features;
     uint64_t autoclear_features;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [Qemu-devel] [PATCH 5/5] block/paralels: added paralles implementation for bdrv_is_opened_unclean
  2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
                         ` (3 preceding siblings ...)
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 4/5] qcow2: implemented bdrv_is_opened_unclean Denis V. Lunev
@ 2015-12-23  7:46       ` Denis V. Lunev
  2015-12-23  8:09       ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Fam Zheng
  5 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  7:46 UTC (permalink / raw)
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Max Reitz, Olga Krishtal,
	Denis V. Lunev

From: Olga Krishtal <okrishtal@virtuozzo.com>

Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Fam Zheng <famz@redhat.com>
---
 block/parallels.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/parallels.c b/block/parallels.c
index e4a56a5..618b609 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -359,6 +359,10 @@ static coroutine_fn int parallels_co_readv(BlockDriverState *bs,
     return ret;
 }
 
+static bool parallels_is_opened_unclean(BlockDriverState *bs)
+{
+    return ((BDRVParallelsState *)(bs->opaque))->header_unclean;
+}
 
 static int parallels_check(BlockDriverState *bs, BdrvCheckResult *res,
                            BdrvCheckMode fix)
@@ -376,7 +380,7 @@ static int parallels_check(BlockDriverState *bs, BdrvCheckResult *res,
         return size;
     }
 
-    if (s->header_unclean) {
+    if (parallels_is_opened_unclean(bs)) {
         fprintf(stderr, "%s image was not closed correctly\n",
                 fix & BDRV_FIX_ERRORS ? "Repairing" : "ERROR");
         res->corruptions++;
@@ -743,6 +747,7 @@ static BlockDriver bdrv_parallels = {
     .bdrv_close		= parallels_close,
     .bdrv_co_get_block_status = parallels_co_get_block_status,
     .bdrv_has_zero_init       = bdrv_has_zero_init_1,
+    .bdrv_is_opened_unclean        = parallels_is_opened_unclean,
     .bdrv_co_flush_to_os      = parallels_co_flush_to_os,
     .bdrv_co_readv  = parallels_co_readv,
     .bdrv_co_writev = parallels_co_writev,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery
  2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
                         ` (4 preceding siblings ...)
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 5/5] block/paralels: added paralles implementation for bdrv_is_opened_unclean Denis V. Lunev
@ 2015-12-23  8:09       ` Fam Zheng
  2015-12-23  8:36         ` Denis V. Lunev
  5 siblings, 1 reply; 99+ messages in thread
From: Fam Zheng @ 2015-12-23  8:09 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, Olga Krishtal, qemu-devel, Max Reitz

On Wed, 12/23 10:46, Denis V. Lunev wrote:
> This series of patches is aimed to prevent usage of image
> file by different qemu instances. In case we are the first
> instance, and option lock is lockfile, - we lock the image file,
> and if check option is on, we check the file and fix it if
> nessecary. If one of this two ops fails - the image is closed
> with the error.
> 
> Patchset is not polished at all! Sent for a discussion as an alternative
> approach.

I like this approach. The first two patches match what I was thinking of.

Patch 5 is okay, the unclean flag reflects HEADER_INUSE_MAGIC in parallels
header; unfortunately patch 4 is wrong because qcow2 lacks a counterpart flag
in the format, and the patch only modified an in memory variable.  we have to
add this as a compatible_features bit in order to support this operation.

Didn't review very closely because at least one patch doesn't seem to compile.
:)

Fam

> 
> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> 
> Olga Krishtal (5):
>   block: added lock image option and callback
>   block: implemented bdrv_lock_image for raw file
>   block: added check image option and callback bdrv_is_opened_unclean
>   qcow2: implemented bdrv_is_opened_unclean
>   block/paralels: added paralles implementation for
>     bdrv_is_opened_unclean
> 
>  block.c                   | 73 +++++++++++++++++++++++++++++++++++++++++++++++
>  block/parallels.c         |  7 ++++-
>  block/qcow2.c             | 11 ++++++-
>  block/qcow2.h             |  1 +
>  block/raw-posix.c         | 15 ++++++++++
>  block/raw-win32.c         | 19 ++++++++++++
>  include/block/block.h     |  2 ++
>  include/block/block_int.h |  2 ++
>  qapi/block-core.json      |  9 ++++++
>  9 files changed, 137 insertions(+), 2 deletions(-)
> 
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery
  2015-12-23  8:09       ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Fam Zheng
@ 2015-12-23  8:36         ` Denis V. Lunev
  0 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  8:36 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Kevin Wolf, Olga Krishtal, qemu-devel, Max Reitz

On 12/23/2015 11:09 AM, Fam Zheng wrote:
> On Wed, 12/23 10:46, Denis V. Lunev wrote:
>> This series of patches is aimed to prevent usage of image
>> file by different qemu instances. In case we are the first
>> instance, and option lock is lockfile, - we lock the image file,
>> and if check option is on, we check the file and fix it if
>> nessecary. If one of this two ops fails - the image is closed
>> with the error.
>>
>> Patchset is not polished at all! Sent for a discussion as an alternative
>> approach.
> I like this approach. The first two patches match what I was thinking of.
>
> Patch 5 is okay, the unclean flag reflects HEADER_INUSE_MAGIC in parallels
> header; unfortunately patch 4 is wrong because qcow2 lacks a counterpart flag
> in the format, and the patch only modified an in memory variable.  we have to
> add this as a compatible_features bit in order to support this operation.
yep :) this could be done, no problem.

> Didn't review very closely because at least one patch doesn't seem to compile.
> :)
>
> Fam
>
which compile error do you have?

I have double checked that it is compiled OK on

commit 5dc42c186d63b7b338594fc071cf290805dcc5a5
Merge: c595b21 7236975
Author: Peter Maydell <peter.maydell@linaro.org>
Date:   Tue Dec 22 14:21:42 2015 +0000

     Merge remote-tracking branch 
'remotes/stefanha/tags/block-pull-request' into staging

     # gpg: Signature made Tue 22 Dec 2015 08:52:55 GMT using RSA key ID 
81AB73C8
     # gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
     # gpg:                 aka "Stefan Hajnoczi <stefanha@gmail.com>"

     * remotes/stefanha/tags/block-pull-request:
       sdhci: add optional quirk property to disable card 
insertion/removal interrupts
       sdhci: don't raise a command index error for an unexpected response
       sd: sdhci: Delete over-zealous power check
       scripts/gdb: Fix a python exception in mtree.py
       parallels: add format spec
       block/mirror: replace IOV_MAX with blk_get_max_iov()
       block: replace IOV_MAX with BlockLimits.max_iov
       block-backend: add blk_get_max_iov()
       block: add BlockLimits.max_iov field
       virtio-blk: trivial code optimization

     Signed-off-by: Peter Maydell <peter.maydell@linaro.org>

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] block: added check image option and callback bdrv_is_opened_unclean
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 3/5] block: added check image option and callback bdrv_is_opened_unclean Denis V. Lunev
@ 2015-12-23  9:09         ` Fam Zheng
  2015-12-23  9:14           ` Denis V. Lunev
  0 siblings, 1 reply; 99+ messages in thread
From: Fam Zheng @ 2015-12-23  9:09 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, Olga Krishtal, qemu-devel, Max Reitz

On Wed, 12/23 10:46, Denis V. Lunev wrote:
> From: Olga Krishtal <okrishtal@virtuozzo.com>
> 
> If image is opened for writing and it was not closed correctly
> (the image is dirty) we have to check and repair it. By default
> the option is off.
> 
> bdrv_is_opened_unclean - cheks if the image is dirty
> This callbsck will be used to ensure that image was
> closed correctly, and if not - to check and repair it.
> 
> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> ---
>  block.c                   | 32 ++++++++++++++++++++++++++++++++
>  include/block/block.h     |  1 +
>  include/block/block_int.h |  1 +
>  3 files changed, 34 insertions(+)
> 
> diff --git a/block.c b/block.c
> index 74228b8..1f704f5 100644
> --- a/block.c
> +++ b/block.c
> @@ -903,6 +903,12 @@ static QemuOptsList bdrv_runtime_opts = {
>              .help = "How to lock the image: none or lockfile",
>              .def_value_str = "none",
>          },
> +        {
> +            .name = "check",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "Check and repair the image if it is unclean",
> +            .def_value_str = "off",
> +        },
>          { /* end of list */ }
>      },
>  };
> @@ -1042,6 +1048,16 @@ static int bdrv_open_common(BlockDriverState *bs, BdrvChild *file,
>          }
>      }
>  
> +    if (!bs->read_only && qemu_opt_get_bool_del(opts, "check", true) &&
> +                bdrv_is_opened_unclean(bs) && !(flags | BDRV_O_CHECK)) {
> +        BdrvCheckResult result = {0};
> +        ret = bdrv_check(bs, &result, BDRV_FIX_ERRORS);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Could not repair dirty image");
> +            bdrv_close(bs);
> +            goto fail_opts;
> +        }
> +    }
>      if (bs->encrypted) {
>          error_report("Encrypted images are deprecated");
>          error_printf("Support for them will be removed in a future release.\n"
> @@ -4361,3 +4377,19 @@ int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock)
>      }
>      return -EOPNOTSUPP;
>  }
> +
> +bool bdrv_is_opened_unclean(BlockDriverState *bs)
> +{
> +    BlockDriver *drv = bs->drv;
> +    if (drv != NULL && drv->bdrv_is_opened_unclean != NULL) {
> +        return drv->bdrv_is_opened_unclean(bs);
> +    }
> +    if (bs->file == NULL) {
> +        return false;
> +    }
> +    drv = bs->file->bs->drv;
> +    if (drv != NULL && drv->bdrv_is_opened_unclean != NULL) {
> +        return drv->bdrv_is_opened_unclean;

Should this be

           return drv->bdrv_is_opened_unclean(bs);

?

(well, I may be wrong in saying this doesn't compile :)

Fam

> +    }
> +    return false;
> +}
> diff --git a/include/block/block.h b/include/block/block.h
> index 27fc434..c366990 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -507,6 +507,7 @@ void bdrv_dirty_iter_init(BdrvDirtyBitmap *bitmap, struct HBitmapIter *hbi);
>  void bdrv_set_dirty_iter(struct HBitmapIter *hbi, int64_t offset);
>  int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>  int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock);
> +bool bdrv_is_opened_unclean(BlockDriverState *bs);
>  
>  void bdrv_enable_copy_on_read(BlockDriverState *bs);
>  void bdrv_disable_copy_on_read(BlockDriverState *bs);
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 755f342..fc3f4a6 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -137,6 +137,7 @@ struct BlockDriver {
>      int (*bdrv_make_empty)(BlockDriverState *bs);
>      int (*bdrv_lock_image)(BlockDriverState *bs, BdrvLockImage lock);
>  
> +    bool (*bdrv_is_opened_unclean)(BlockDriverState *bs);
>      void (*bdrv_refresh_filename)(BlockDriverState *bs, QDict *options);
>  
>      /* aio */
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] block: added check image option and callback bdrv_is_opened_unclean
  2015-12-23  9:09         ` Fam Zheng
@ 2015-12-23  9:14           ` Denis V. Lunev
  0 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23  9:14 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Kevin Wolf, Olga Krishtal, qemu-devel, Max Reitz

On 12/23/2015 12:09 PM, Fam Zheng wrote:
> On Wed, 12/23 10:46, Denis V. Lunev wrote:
>> From: Olga Krishtal <okrishtal@virtuozzo.com>
>>
>> If image is opened for writing and it was not closed correctly
>> (the image is dirty) we have to check and repair it. By default
>> the option is off.
>>
>> bdrv_is_opened_unclean - cheks if the image is dirty
>> This callbsck will be used to ensure that image was
>> closed correctly, and if not - to check and repair it.
>>
>> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
>> ---
>>   block.c                   | 32 ++++++++++++++++++++++++++++++++
>>   include/block/block.h     |  1 +
>>   include/block/block_int.h |  1 +
>>   3 files changed, 34 insertions(+)
>>
>> diff --git a/block.c b/block.c
>> index 74228b8..1f704f5 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -903,6 +903,12 @@ static QemuOptsList bdrv_runtime_opts = {
>>               .help = "How to lock the image: none or lockfile",
>>               .def_value_str = "none",
>>           },
>> +        {
>> +            .name = "check",
>> +            .type = QEMU_OPT_BOOL,
>> +            .help = "Check and repair the image if it is unclean",
>> +            .def_value_str = "off",
>> +        },
>>           { /* end of list */ }
>>       },
>>   };
>> @@ -1042,6 +1048,16 @@ static int bdrv_open_common(BlockDriverState *bs, BdrvChild *file,
>>           }
>>       }
>>   
>> +    if (!bs->read_only && qemu_opt_get_bool_del(opts, "check", true) &&
>> +                bdrv_is_opened_unclean(bs) && !(flags | BDRV_O_CHECK)) {
>> +        BdrvCheckResult result = {0};
>> +        ret = bdrv_check(bs, &result, BDRV_FIX_ERRORS);
>> +        if (ret < 0) {
>> +            error_setg_errno(errp, -ret, "Could not repair dirty image");
>> +            bdrv_close(bs);
>> +            goto fail_opts;
>> +        }
>> +    }
>>       if (bs->encrypted) {
>>           error_report("Encrypted images are deprecated");
>>           error_printf("Support for them will be removed in a future release.\n"
>> @@ -4361,3 +4377,19 @@ int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock)
>>       }
>>       return -EOPNOTSUPP;
>>   }
>> +
>> +bool bdrv_is_opened_unclean(BlockDriverState *bs)
>> +{
>> +    BlockDriver *drv = bs->drv;
>> +    if (drv != NULL && drv->bdrv_is_opened_unclean != NULL) {
>> +        return drv->bdrv_is_opened_unclean(bs);
>> +    }
>> +    if (bs->file == NULL) {
>> +        return false;
>> +    }
>> +    drv = bs->file->bs->drv;
>> +    if (drv != NULL && drv->bdrv_is_opened_unclean != NULL) {
>> +        return drv->bdrv_is_opened_unclean;
> Should this be
>
>             return drv->bdrv_is_opened_unclean(bs);
>
> ?
>
> (well, I may be wrong in saying this doesn't compile :)
>
> Fam

exactly :)

>> +    }
>> +    return false;
>> +}
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 27fc434..c366990 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -507,6 +507,7 @@ void bdrv_dirty_iter_init(BdrvDirtyBitmap *bitmap, struct HBitmapIter *hbi);
>>   void bdrv_set_dirty_iter(struct HBitmapIter *hbi, int64_t offset);
>>   int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>>   int bdrv_lock_image(BlockDriverState *bs, BdrvLockImage lock);
>> +bool bdrv_is_opened_unclean(BlockDriverState *bs);
>>   
>>   void bdrv_enable_copy_on_read(BlockDriverState *bs);
>>   void bdrv_disable_copy_on_read(BlockDriverState *bs);
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index 755f342..fc3f4a6 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -137,6 +137,7 @@ struct BlockDriver {
>>       int (*bdrv_make_empty)(BlockDriverState *bs);
>>       int (*bdrv_lock_image)(BlockDriverState *bs, BdrvLockImage lock);
>>   
>> +    bool (*bdrv_is_opened_unclean)(BlockDriverState *bs);
>>       void (*bdrv_refresh_filename)(BlockDriverState *bs, QDict *options);
>>   
>>       /* aio */
>> -- 
>> 2.1.4
>>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23  3:14 ` [Qemu-devel] [PATCH 00/10] qcow2: Implement " Fam Zheng
  2015-12-23  7:35   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
@ 2015-12-23 10:47   ` Daniel P. Berrange
  2015-12-23 12:15     ` [Qemu-devel] [Qemu-block] " Roman Kagan
  2016-01-11 17:14     ` [Qemu-devel] " Kevin Wolf
  2015-12-23 23:19   ` [Qemu-devel] " Max Reitz
  2 siblings, 2 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2015-12-23 10:47 UTC (permalink / raw)
  To: Fam Zheng; +Cc: Kevin Wolf, qemu-devel, qemu-block, mreitz

On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> On Tue, 12/22 17:46, Kevin Wolf wrote:
> > Enough innocent images have died because users called 'qemu-img snapshot' while
> > the VM was still running. Educating the users doesn't seem to be a working
> > strategy, so this series adds locking to qcow2 that refuses to access the image
> > read-write from two processes.
> > 
> > Eric, this will require a libvirt update to deal with qemu crashes which leave
> > locked images behind. The simplest thinkable way would be to unconditionally
> > override the lock in libvirt whenever the option is present. In that case,
> > libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> > the other way round. If you want more than that, libvirt would have to check
> > somehow if it was its own VM that used the image and left the lock behind. I
> > imagine that can't be too hard either.
> 
> The motivation is great, but I'm not sure I like the side-effect that an
> unclean shutdown will require a "forced" open, because it makes using qcow2 in
> development cumbersome, and like you said, management/user also needs to handle
> this explicitly. This is a bit of a personal preference, but it's strong enough
> that I want to speak up.

Yeah, I am also not really a big fan of locking mechanisms which are not
automatically cleaned up on process exit. On the other hand you could
say that people who choose to run qemu-img manually are already taking
fate into their own hands, and ending up with a dirty image on unclean
exit is still miles better than loosing all your data.

> As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> and a program crash will automatically drop the lock.

FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
out locks using fcntl()/lockf() on all disk images associated with a VM.

This only protects against two QEMU emulators running at the same time
though, and also only if they're using libvirt APIs. So it doesn't
protect if someone runs qemu-img manually, or indeed if libvirt runs
qemu-img, though we could fairly easily address the latter.

A problem with lockf is that it is almost unusable by design, because
if you have 2 (or more) file descriptors open against the same file,
if you close *any* of the file descriptors it releases all locks on
the file, even if the locks were acquired on a different file descriptor
than the one being closed :-( This is why we put our locking code into a
completely separate process (virtlockd), to guarantee nothing else might
accidentally open/close file descriptors on the same file we had locked.

A second problem with using flock/lockf() is that on block devices the
locks are only scoped to the local host, so if you have shared block
storage they locks are not all that useful. To deal with this, virtlockd
has the concept of a "lockspace". The default lockspace is associated
directly while the disk files, but alternate lockspaces are possible
which are indirectly associated. For example, we have lockspaces that
are keyed off the SCSI unique volume ID, and the LVM volume UUID, which
cna be placed on a shared filesystem. This lets us get cross-host locking
even for block storage. We have a future desire to be able to make use
of storage native locking mechansisms too such as SCSI reservations.

So while QEMU could add a bdrv_lock() driver method, it will have some
limitations & implementation complexity (ensuring nothing else in QEMU
can ever accidentally open+close the same file that QEMU has locked),
though it could offer better protection than we have with libvirt for
cases whe e people run qemu-img manually.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 10:47   ` [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Daniel P. Berrange
@ 2015-12-23 12:15     ` Roman Kagan
  2015-12-23 12:29       ` Daniel P. Berrange
  2015-12-23 12:34       ` Daniel P. Berrange
  2016-01-11 17:14     ` [Qemu-devel] " Kevin Wolf
  1 sibling, 2 replies; 99+ messages in thread
From: Roman Kagan @ 2015-12-23 12:15 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: Kevin Wolf, Fam Zheng, qemu-devel, qemu-block, mreitz

On Wed, Dec 23, 2015 at 10:47:22AM +0000, Daniel P. Berrange wrote:
> On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> > As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> > similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> > and a program crash will automatically drop the lock.
> 
> FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
> out locks using fcntl()/lockf() on all disk images associated with a VM.

Is it even possible without QEMU cooperating?  In particular in complex
cases with e.g. backing chains?

This was exactly the reason why we designed the "lock" option to take an
argument describing the locking mechanism to be used (see the tentative
patchset Denis posted in this thread).  The only one currently
implemented is flock()-based; however it can be extended to other
mechanisms like network / cluster / SAN lock managers, etc.  In
particular, it can be made to talk to virtlockd.

Roman.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 12:15     ` [Qemu-devel] [Qemu-block] " Roman Kagan
@ 2015-12-23 12:29       ` Daniel P. Berrange
  2015-12-23 12:41         ` Denis V. Lunev
  2015-12-23 12:34       ` Daniel P. Berrange
  1 sibling, 1 reply; 99+ messages in thread
From: Daniel P. Berrange @ 2015-12-23 12:29 UTC (permalink / raw)
  To: Roman Kagan, Fam Zheng, Kevin Wolf, qemu-devel, qemu-block, mreitz

On Wed, Dec 23, 2015 at 03:15:50PM +0300, Roman Kagan wrote:
> On Wed, Dec 23, 2015 at 10:47:22AM +0000, Daniel P. Berrange wrote:
> > On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> > > As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> > > similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> > > and a program crash will automatically drop the lock.
> > 
> > FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
> > out locks using fcntl()/lockf() on all disk images associated with a VM.
> 
> Is it even possible without QEMU cooperating?  In particular in complex
> cases with e.g. backing chains?

Yes, libvirt already has to know & understand exactly what chains are
in use in order to grant correct permissions via SELinux/AppArmour.
Once it knows that it can also deal with acquiring suitable locks.

> This was exactly the reason why we designed the "lock" option to take an
> argument describing the locking mechanism to be used (see the tentative
> patchset Denis posted in this thread).  The only one currently
> implemented is flock()-based; however it can be extended to other
> mechanisms like network / cluster / SAN lock managers, etc.  In
> particular, it can be made to talk to virtlockd.

NB flock() doesn't work reliably / portably on NFS. Many impls
would treat it as a no-op. Other impls would only acquire the
lock on the local NFS client, not the server. Apparently Linux
now[1] transparently converts flock() into fcntl() locks on NFS
only, so you now have the problem that any close() will release
the lock. So IMHO flock() is even less usable than fcntl() as
a result.

Regards,
Daniel

[1]http://0pointer.de/blog/projects/locking.html
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 12:15     ` [Qemu-devel] [Qemu-block] " Roman Kagan
  2015-12-23 12:29       ` Daniel P. Berrange
@ 2015-12-23 12:34       ` Daniel P. Berrange
  2015-12-23 12:47         ` Denis V. Lunev
  1 sibling, 1 reply; 99+ messages in thread
From: Daniel P. Berrange @ 2015-12-23 12:34 UTC (permalink / raw)
  To: Roman Kagan, Fam Zheng, Kevin Wolf, qemu-devel, qemu-block, mreitz

On Wed, Dec 23, 2015 at 03:15:50PM +0300, Roman Kagan wrote:
> On Wed, Dec 23, 2015 at 10:47:22AM +0000, Daniel P. Berrange wrote:
> > On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> > > As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> > > similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> > > and a program crash will automatically drop the lock.
> > 
> > FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
> > out locks using fcntl()/lockf() on all disk images associated with a VM.
> 
> Is it even possible without QEMU cooperating?  In particular in complex
> cases with e.g. backing chains?
> 
> This was exactly the reason why we designed the "lock" option to take an
> argument describing the locking mechanism to be used (see the tentative
> patchset Denis posted in this thread).  The only one currently
> implemented is flock()-based; however it can be extended to other
> mechanisms like network / cluster / SAN lock managers, etc.  In
> particular, it can be made to talk to virtlockd.

NB, libvirt generally considers QEMU to be untrustworthy, which is
another reason why we use virtlockd to acquire the locks *prior*
to granting QEMU any access to the file(s). On this basis we would
not really trust QEMU to do acquire/release locks itself by talking
to virtlockd. Indeed, we'd not really trust QEMU locking at all, no
matter what mechanism it used - we want strong guarantee of locking
regardless of whether QEMU is broken / compromised.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] block: implemented bdrv_lock_image for raw file
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 2/5] block: implemented bdrv_lock_image for raw file Denis V. Lunev
@ 2015-12-23 12:40         ` Daniel P. Berrange
  0 siblings, 0 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2015-12-23 12:40 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Olga Krishtal, Fam Zheng, qemu-devel, Max Reitz

On Wed, Dec 23, 2015 at 10:46:53AM +0300, Denis V. Lunev wrote:
> From: Olga Krishtal <okrishtal@virtuozzo.com>
> 
> To lock the image file flock (LockFileEx) is used.
> We lock file handle/descriptor. If lock is failed -
> an error is returned.
> 
> In win32 realization we can lock reagion of bytes within the file.
> For this reason we at first have to get file size and only than lock it.
> 
> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 15 +++++++++++++++
>  block/raw-win32.c | 19 +++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 076d070..6226a5c 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -33,6 +33,7 @@
>  #include "raw-aio.h"
>  #include "qapi/util.h"
>  #include "qapi/qmp/qstring.h"
> +#include <sys/file.h>
>  
>  #if defined(__APPLE__) && (__MACH__)
>  #include <paths.h>
> @@ -576,6 +577,19 @@ fail:
>      return ret;
>  }
>  
> +static int raw_lock_image(BlockDriverState *bs, BdrvLockImage lock)
> +{
> +    int ret;
> +    if (lock != BDRV_LOCK_IMAGE_LOCKFILE) {
> +        return -ENOTSUP;
> +    }
> +    ret = flock(((BDRVRawState *)(bs->opaque))->fd, LOCK_EX|LOCK_NB);
> +    if (ret != 0) {
> +        return -ret;
> +    }
> +    return ret;
> +}

flock() is a pretty bad choice wrt to NFS. Historically it was often
a no-op. Some impls treat it as a client-local lock and not a server
side lock. Linux apparently now converts flock locks to fcntl locks,
but on NFS only. As a result they'll suffer from the problem that
a close() on any file descriptor pointing to the file will release
the lock. So IMHO both flock() and fcntl() are unusable in practice
from within QEMU, as I don't think it is practical to guarantee
QEMU won't accidentally release the lock by closing another file
descriptor pointing to the same file. If you want to use flock or
fcntl() and provide a strong safety guarantee the only option is to
acquire the locks in a dedicated process prior to giving QEMU access
to the files, which is what libvirt does with its virtlockd daemon.

Regards,
Daniel

[1] http://0pointer.de/blog/projects/locking.html
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 12:29       ` Daniel P. Berrange
@ 2015-12-23 12:41         ` Denis V. Lunev
  2015-12-23 12:46           ` Daniel P. Berrange
  0 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23 12:41 UTC (permalink / raw)
  To: Daniel P. Berrange, Roman Kagan, Fam Zheng, Kevin Wolf,
	qemu-devel, qemu-block, mreitz

On 12/23/2015 03:29 PM, Daniel P. Berrange wrote:
> On Wed, Dec 23, 2015 at 03:15:50PM +0300, Roman Kagan wrote:
>> On Wed, Dec 23, 2015 at 10:47:22AM +0000, Daniel P. Berrange wrote:
>>> On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
>>>> As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
>>>> similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
>>>> and a program crash will automatically drop the lock.
>>> FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
>>> out locks using fcntl()/lockf() on all disk images associated with a VM.
>> Is it even possible without QEMU cooperating?  In particular in complex
>> cases with e.g. backing chains?
> Yes, libvirt already has to know & understand exactly what chains are
> in use in order to grant correct permissions via SELinux/AppArmour.
> Once it knows that it can also deal with acquiring suitable locks.
>
>> This was exactly the reason why we designed the "lock" option to take an
>> argument describing the locking mechanism to be used (see the tentative
>> patchset Denis posted in this thread).  The only one currently
>> implemented is flock()-based; however it can be extended to other
>> mechanisms like network / cluster / SAN lock managers, etc.  In
>> particular, it can be made to talk to virtlockd.
> NB flock() doesn't work reliably / portably on NFS. Many impls
> would treat it as a no-op. Other impls would only acquire the
> lock on the local NFS client, not the server. Apparently Linux
> now[1] transparently converts flock() into fcntl() locks on NFS
> only, so you now have the problem that any close() will release
> the lock. So IMHO flock() is even less usable than fcntl() as
> a result.
>
> Regards,
> Daniel
>
> [1]http://0pointer.de/blog/projects/locking.html
Do you suggest to implement connection FROM qemu-img etc stuff
to libvirt and query libvirt about this?

This is absolutely fine actually, why not. We can made lock mechanics
with type 'libvirt' exactly in the same way as flock. This approach
should satisfy all needs and users.

Isn't it?

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 12:41         ` Denis V. Lunev
@ 2015-12-23 12:46           ` Daniel P. Berrange
  0 siblings, 0 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2015-12-23 12:46 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, qemu-block, qemu-devel, mreitz, Roman Kagan

On Wed, Dec 23, 2015 at 03:41:01PM +0300, Denis V. Lunev wrote:
> On 12/23/2015 03:29 PM, Daniel P. Berrange wrote:
> >On Wed, Dec 23, 2015 at 03:15:50PM +0300, Roman Kagan wrote:
> >>On Wed, Dec 23, 2015 at 10:47:22AM +0000, Daniel P. Berrange wrote:
> >>>On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> >>>>As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> >>>>similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> >>>>and a program crash will automatically drop the lock.
> >>>FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
> >>>out locks using fcntl()/lockf() on all disk images associated with a VM.
> >>Is it even possible without QEMU cooperating?  In particular in complex
> >>cases with e.g. backing chains?
> >Yes, libvirt already has to know & understand exactly what chains are
> >in use in order to grant correct permissions via SELinux/AppArmour.
> >Once it knows that it can also deal with acquiring suitable locks.
> >
> >>This was exactly the reason why we designed the "lock" option to take an
> >>argument describing the locking mechanism to be used (see the tentative
> >>patchset Denis posted in this thread).  The only one currently
> >>implemented is flock()-based; however it can be extended to other
> >>mechanisms like network / cluster / SAN lock managers, etc.  In
> >>particular, it can be made to talk to virtlockd.
> >NB flock() doesn't work reliably / portably on NFS. Many impls
> >would treat it as a no-op. Other impls would only acquire the
> >lock on the local NFS client, not the server. Apparently Linux
> >now[1] transparently converts flock() into fcntl() locks on NFS
> >only, so you now have the problem that any close() will release
> >the lock. So IMHO flock() is even less usable than fcntl() as
> >a result.
> >
> >Regards,
> >Daniel
> >
> >[1]http://0pointer.de/blog/projects/locking.html
> Do you suggest to implement connection FROM qemu-img etc stuff
> to libvirt and query libvirt about this?
> 
> This is absolutely fine actually, why not. We can made lock mechanics
> with type 'libvirt' exactly in the same way as flock. This approach
> should satisfy all needs and users.

You want libvirt to use its locking APIs any time it has to invoke
qemu-img.

For case where people are not using libvirt, but running qemu-img
directly having qemu-img call back into libvirt isn't going to
help IMHO. Those people are already not paying attention to the
docs, so they're also not going to remember to add the command
line to tell qemu-img to talk to libvirt. So its pretty pointless
to have qemu-img talk to libvirt IMHO, as well as adding complexity
by creating a mutual two-way dependancy which is undesirable.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 12:34       ` Daniel P. Berrange
@ 2015-12-23 12:47         ` Denis V. Lunev
  2015-12-23 12:56           ` Daniel P. Berrange
  0 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23 12:47 UTC (permalink / raw)
  To: Daniel P. Berrange, Roman Kagan, Fam Zheng, Kevin Wolf,
	qemu-devel, qemu-block, mreitz

On 12/23/2015 03:34 PM, Daniel P. Berrange wrote:
> On Wed, Dec 23, 2015 at 03:15:50PM +0300, Roman Kagan wrote:
>> On Wed, Dec 23, 2015 at 10:47:22AM +0000, Daniel P. Berrange wrote:
>>> On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
>>>> As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
>>>> similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
>>>> and a program crash will automatically drop the lock.
>>> FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
>>> out locks using fcntl()/lockf() on all disk images associated with a VM.
>> Is it even possible without QEMU cooperating?  In particular in complex
>> cases with e.g. backing chains?
>>
>> This was exactly the reason why we designed the "lock" option to take an
>> argument describing the locking mechanism to be used (see the tentative
>> patchset Denis posted in this thread).  The only one currently
>> implemented is flock()-based; however it can be extended to other
>> mechanisms like network / cluster / SAN lock managers, etc.  In
>> particular, it can be made to talk to virtlockd.
> NB, libvirt generally considers QEMU to be untrustworthy, which is
> another reason why we use virtlockd to acquire the locks *prior*
> to granting QEMU any access to the file(s). On this basis we would
> not really trust QEMU to do acquire/release locks itself by talking
> to virtlockd. Indeed, we'd not really trust QEMU locking at all, no
> matter what mechanism it used - we want strong guarantee of locking
> regardless of whether QEMU is broken / compromised.
>
> Regards,
> Daniel
this is not the case we are trying to solve here. Here customer accidentally
called 'qemu-img snapshot' and face his doom in ruined image.

How can we will be able to find proper libvirtd in the case of network
filesystem inside client swarm? This daemon is local to the host.
Filesystem locking can be used in the hope that setup is consistent.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 12:47         ` Denis V. Lunev
@ 2015-12-23 12:56           ` Daniel P. Berrange
  0 siblings, 0 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2015-12-23 12:56 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, qemu-block, qemu-devel, mreitz, Roman Kagan

On Wed, Dec 23, 2015 at 03:47:00PM +0300, Denis V. Lunev wrote:
> On 12/23/2015 03:34 PM, Daniel P. Berrange wrote:
> >On Wed, Dec 23, 2015 at 03:15:50PM +0300, Roman Kagan wrote:
> >>On Wed, Dec 23, 2015 at 10:47:22AM +0000, Daniel P. Berrange wrote:
> >>>On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> >>>>As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> >>>>similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> >>>>and a program crash will automatically drop the lock.
> >>>FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
> >>>out locks using fcntl()/lockf() on all disk images associated with a VM.
> >>Is it even possible without QEMU cooperating?  In particular in complex
> >>cases with e.g. backing chains?
> >>
> >>This was exactly the reason why we designed the "lock" option to take an
> >>argument describing the locking mechanism to be used (see the tentative
> >>patchset Denis posted in this thread).  The only one currently
> >>implemented is flock()-based; however it can be extended to other
> >>mechanisms like network / cluster / SAN lock managers, etc.  In
> >>particular, it can be made to talk to virtlockd.
> >NB, libvirt generally considers QEMU to be untrustworthy, which is
> >another reason why we use virtlockd to acquire the locks *prior*
> >to granting QEMU any access to the file(s). On this basis we would
> >not really trust QEMU to do acquire/release locks itself by talking
> >to virtlockd. Indeed, we'd not really trust QEMU locking at all, no
> >matter what mechanism it used - we want strong guarantee of locking
> >regardless of whether QEMU is broken / compromised.
> >
> this is not the case we are trying to solve here. Here customer accidentally
> called 'qemu-img snapshot' and face his doom in ruined image.

You're merely describing one out of many possible ways to ruin the
image. Running multiple QEMU system emulators pointing to the same
image will equally trash it. Or an admin mistakenly adding the same
image to the same QEMU twice eg once as a primary image, and once
mistakenly via a backing file. Or a broken / compromised QEMU
mistakenly/intentionally acquiring the wrong locks or not any locks.
Any locking mechanism has to consider all the possible ways of doom.

> How can we will be able to find proper libvirtd in the case of network
> filesystem inside client swarm? This daemon is local to the host.
> Filesystem locking can be used in the hope that setup is consistent.

We have one virtlockd on each host, and they would be configured to use
a common lockspace on the shared filesystem, so the locks acquired on
one host would be visible to the other host & vica-verca. This works
reasonably reliably with fcntl(), at least more so than with flock().

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (10 preceding siblings ...)
  2015-12-23  3:14 ` [Qemu-devel] [PATCH 00/10] qcow2: Implement " Fam Zheng
@ 2015-12-23 14:57 ` Vasiliy Tolstov
  2015-12-23 15:08   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
  2015-12-23 15:09   ` Denis V. Lunev
  2015-12-24  5:43 ` Denis V. Lunev
  2016-01-14 14:01 ` Max Reitz
  13 siblings, 2 replies; 99+ messages in thread
From: Vasiliy Tolstov @ 2015-12-23 14:57 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

2015-12-22 19:46 GMT+03:00 Kevin Wolf <kwolf@redhat.com>:
> Enough innocent images have died because users called 'qemu-img snapshot' while
> the VM was still running. Educating the users doesn't seem to be a working
> strategy, so this series adds locking to qcow2 that refuses to access the image
> read-write from two processes.
>
> Eric, this will require a libvirt update to deal with qemu crashes which leave
> locked images behind. The simplest thinkable way would be to unconditionally
> override the lock in libvirt whenever the option is present. In that case,
> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> the other way round. If you want more than that, libvirt would have to check
> somehow if it was its own VM that used the image and left the lock behind. I
> imagine that can't be too hard either.


This breaks ability to create disk only snapshot while vm is running.
Or i miss something?

-- 
Vasiliy Tolstov,
e-mail: v.tolstov@selfip.ru

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 14:57 ` [Qemu-devel] " Vasiliy Tolstov
@ 2015-12-23 15:08   ` Denis V. Lunev
  2015-12-23 15:11     ` Vasiliy Tolstov
  2015-12-23 15:09   ` Denis V. Lunev
  1 sibling, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23 15:08 UTC (permalink / raw)
  To: Vasiliy Tolstov, Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

On 12/23/2015 05:57 PM, Vasiliy Tolstov wrote:
> 2015-12-22 19:46 GMT+03:00 Kevin Wolf <kwolf@redhat.com>:
>> Enough innocent images have died because users called 'qemu-img snapshot' while
>> the VM was still running. Educating the users doesn't seem to be a working
>> strategy, so this series adds locking to qcow2 that refuses to access the image
>> read-write from two processes.
>>
>> Eric, this will require a libvirt update to deal with qemu crashes which leave
>> locked images behind. The simplest thinkable way would be to unconditionally
>> override the lock in libvirt whenever the option is present. In that case,
>> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
>> the other way round. If you want more than that, libvirt would have to check
>> somehow if it was its own VM that used the image and left the lock behind. I
>> imagine that can't be too hard either.
>
> This breaks ability to create disk only snapshot while vm is running.
> Or i miss something?
>
you should do this by asking running QEMU not by
qemu-img, which is badly wrong.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 14:57 ` [Qemu-devel] " Vasiliy Tolstov
  2015-12-23 15:08   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
@ 2015-12-23 15:09   ` Denis V. Lunev
  1 sibling, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-23 15:09 UTC (permalink / raw)
  To: Vasiliy Tolstov, Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

On 12/23/2015 05:57 PM, Vasiliy Tolstov wrote:
> 2015-12-22 19:46 GMT+03:00 Kevin Wolf <kwolf@redhat.com>:
>> Enough innocent images have died because users called 'qemu-img snapshot' while
>> the VM was still running. Educating the users doesn't seem to be a working
>> strategy, so this series adds locking to qcow2 that refuses to access the image
>> read-write from two processes.
>>
>> Eric, this will require a libvirt update to deal with qemu crashes which leave
>> locked images behind. The simplest thinkable way would be to unconditionally
>> override the lock in libvirt whenever the option is present. In that case,
>> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
>> the other way round. If you want more than that, libvirt would have to check
>> somehow if it was its own VM that used the image and left the lock behind. I
>> imagine that can't be too hard either.
>
> This breaks ability to create disk only snapshot while vm is running.
> Or i miss something?
>
you should do this by asking running QEMU but not by
qemu-img, which is badly wrong.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 15:08   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
@ 2015-12-23 15:11     ` Vasiliy Tolstov
  2016-01-11 16:25       ` Kevin Wolf
  0 siblings, 1 reply; 99+ messages in thread
From: Vasiliy Tolstov @ 2015-12-23 15:11 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, qemu-block, mreitz

2015-12-23 18:08 GMT+03:00 Denis V. Lunev <den-lists@parallels.com>:
> you should do this by asking running QEMU not by
> qemu-img, which is badly wrong.
>
> Den


Ok, if this is possible via qmp/hmp qemu, no problem.

-- 
Vasiliy Tolstov,
e-mail: v.tolstov@selfip.ru

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23  3:14 ` [Qemu-devel] [PATCH 00/10] qcow2: Implement " Fam Zheng
  2015-12-23  7:35   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
  2015-12-23 10:47   ` [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Daniel P. Berrange
@ 2015-12-23 23:19   ` Max Reitz
  2015-12-24  5:41     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
  2 siblings, 1 reply; 99+ messages in thread
From: Max Reitz @ 2015-12-23 23:19 UTC (permalink / raw)
  To: Fam Zheng, Kevin Wolf; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 3603 bytes --]

On 23.12.2015 04:14, Fam Zheng wrote:
> On Tue, 12/22 17:46, Kevin Wolf wrote:
>> Enough innocent images have died because users called 'qemu-img snapshot' while
>> the VM was still running. Educating the users doesn't seem to be a working
>> strategy, so this series adds locking to qcow2 that refuses to access the image
>> read-write from two processes.
>>
>> Eric, this will require a libvirt update to deal with qemu crashes which leave
>> locked images behind. The simplest thinkable way would be to unconditionally
>> override the lock in libvirt whenever the option is present. In that case,
>> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
>> the other way round. If you want more than that, libvirt would have to check
>> somehow if it was its own VM that used the image and left the lock behind. I
>> imagine that can't be too hard either.
> 
> The motivation is great, but I'm not sure I like the side-effect that an
> unclean shutdown will require a "forced" open, because it makes using qcow2 in
> development cumbersome,

How so?

Just extend your "x86_64-softmmu/qemu-system-x86_64
-these-options-will-make-qemu-crash" invocation by "; qemu-img
force-unlock foo.qcow2".

>                         and like you said, management/user also needs to handle
> this explicitly. This is a bit of a personal preference, but it's strong enough
> that I want to speak up.

Well, I personally always had the opposite preference. I see Denis's
series works on Windows, too, which is good. However, it won't work with
any backend, which a qcow2 flag will.

Also, Denis's/Olga's series will by default not lock the file. This is
not an issue if one uses libvirt to run a VM; but there are people who
invoke qemu directly and then try to run qemu-img concurrently, and I
doubt those people will manually set the locking option. This might be
addressed by automatically setting the option if a certain format like
qcow2 is used, but it may be pretty difficult to get that implemented
nicely.

So the benefits of a qcow2 flag are only minor ones. However, I
personally believe that automatic unlock on crash is a very minor
benefit as well. That should never happen in practice anyway, and a
crashing qemu is such a great inconvenience that I as a user wouldn't
really mind having to unlock the image afterwards.

In fact, libvirt could even do that manually, couldn't it? If qemu
crashes, it just invokes qemu-img force-unlock on any qcow2 image which
was attached R/W to the VM.

> As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> and a program crash will automatically drop the lock.

Making other formats benefit from addressing this issue is a good point,
but it too is a minor point. Formats other than qcow2 and raw are only
supported for compatibility anyway, and we don't need this feature for raw.


I feel like most of the question which approach to take revolves around
"But what if qemu crashes?". You (and others) are right in that having
to manually unlock the image then is cumbersome, however, I think that:
(1) qemu should never crash anyway.
(2) If qemu does crash, having to unlock the image is probably the
    least of your worries.
(3) If you are using libvirt, it should actually be possible for
    libvirt to automatically force-unlock images on qemu crash.

This is why I don't think that keeping a locked image behind on qemu
crash is actually an issue.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 1/5] block: added lock image option and callback Denis V. Lunev
@ 2015-12-23 23:48         ` Eric Blake
  2016-01-11 17:31         ` Kevin Wolf
  1 sibling, 0 replies; 99+ messages in thread
From: Eric Blake @ 2015-12-23 23:48 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Olga Krishtal, Fam Zheng, qemu-devel, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1819 bytes --]

On 12/23/2015 12:46 AM, Denis V. Lunev wrote:
> From: Olga Krishtal <okrishtal@virtuozzo.com>
> 
> While opening the image we want to be sure that we are the
> one who works with image, anf if it is not true -

s/anf/and/

> opening the image for writing should fail.
> 
> There are 2 ways at the moment: no lock at all and lock the file
> image.
> 
> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> CC: Fam Zheng <famz@redhat.com>
> ---

> @@ -895,6 +897,12 @@ static QemuOptsList bdrv_runtime_opts = {
>              .type = QEMU_OPT_BOOL,
>              .help = "Ignore flush requests",
>          },
> +        {
> +            .name = "lock",
> +            .type = QEMU_OPT_STRING,
> +            .help = "How to lock the image: none or lockfile",
> +            .def_value_str = "none",

If locking is not on by default, then it is not providing the protection
we want.  Having multiple lock styles doesn't help much - advisory
locking is only useful if all clients use the same style.


> +++ b/qapi/block-core.json
> @@ -2408,3 +2408,12 @@
>  ##
>  { 'command': 'block-set-write-threshold',
>    'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }
> +
> +##
> +# @BdrvLockImage:
> +#
> +#An enumeration of lock types to lock image file.
> +# @none - do not lock the image file
> +# @lockfile

I know you said this series was rough, but you didn't document much
about what lockfile means.

> +## Since 2.6
> +{ 'enum': 'BdrvLockImage', 'data':['none', 'lockfile']}
> 

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 23:19   ` [Qemu-devel] " Max Reitz
@ 2015-12-24  5:41     ` Denis V. Lunev
  2015-12-24  5:42       ` Denis V. Lunev
                         ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-24  5:41 UTC (permalink / raw)
  To: Max Reitz, Fam Zheng, Kevin Wolf; +Cc: qemu-devel, qemu-block

On 12/24/2015 02:19 AM, Max Reitz wrote:
> On 23.12.2015 04:14, Fam Zheng wrote:
>> On Tue, 12/22 17:46, Kevin Wolf wrote:
>>> Enough innocent images have died because users called 'qemu-img snapshot' while
>>> the VM was still running. Educating the users doesn't seem to be a working
>>> strategy, so this series adds locking to qcow2 that refuses to access the image
>>> read-write from two processes.
>>>
>>> Eric, this will require a libvirt update to deal with qemu crashes which leave
>>> locked images behind. The simplest thinkable way would be to unconditionally
>>> override the lock in libvirt whenever the option is present. In that case,
>>> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
>>> the other way round. If you want more than that, libvirt would have to check
>>> somehow if it was its own VM that used the image and left the lock behind. I
>>> imagine that can't be too hard either.
>> The motivation is great, but I'm not sure I like the side-effect that an
>> unclean shutdown will require a "forced" open, because it makes using qcow2 in
>> development cumbersome,
> How so?
>
> Just extend your "x86_64-softmmu/qemu-system-x86_64
> -these-options-will-make-qemu-crash" invocation by "; qemu-img
> force-unlock foo.qcow2".
>
>>                          and like you said, management/user also needs to handle
>> this explicitly. This is a bit of a personal preference, but it's strong enough
>> that I want to speak up.
> Well, I personally always had the opposite preference. I see Denis's
> series works on Windows, too, which is good. However, it won't work with
> any backend, which a qcow2 flag will.
>
> Also, Denis's/Olga's series will by default not lock the file. This is
> not an issue if one uses libvirt to run a VM; but there are people who
> invoke qemu directly and then try to run qemu-img concurrently, and I
> doubt those people will manually set the locking option. This might be
> addressed by automatically setting the option if a certain format like
> qcow2 is used, but it may be pretty difficult to get that implemented
> nicely.
>
> So the benefits of a qcow2 flag are only minor ones. However, I
> personally believe that automatic unlock on crash is a very minor
> benefit as well. That should never happen in practice anyway, and a
> crashing qemu is such a great inconvenience that I as a user wouldn't
> really mind having to unlock the image afterwards.
IMHO you are wrong. This is VERY important. The situation would be exactly
the same after node poweroff, which could happen and really happens in
the real life from time to time.

In this cases VMs should start automatically and ASAP if configured this
way. Any manual interaction here is a REAL pain.

> In fact, libvirt could even do that manually, couldn't it? If qemu
> crashes, it just invokes qemu-img force-unlock on any qcow2 image which
> was attached R/W to the VM.

in the situation above libvirt does not have the information or this
information could be unreliable.

>> As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
>> similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
>> and a program crash will automatically drop the lock.
> Making other formats benefit from addressing this issue is a good point,
> but it too is a minor point. Formats other than qcow2 and raw are only
> supported for compatibility anyway, and we don't need this feature for raw.
I would like to have this covered by flock and this indeed working for
years with Parallels.

>
> I feel like most of the question which approach to take revolves around
> "But what if qemu crashes?". You (and others) are right in that having
> to manually unlock the image then is cumbersome, however, I think that:
> (1) qemu should never crash anyway.
> (2) If qemu does crash, having to unlock the image is probably the
>      least of your worries.
> (3) If you are using libvirt, it should actually be possible for
>      libvirt to automatically force-unlock images on qemu crash.
>
> This is why I don't think that keeping a locked image behind on qemu
> crash is actually an issue.
>
> Max
>
pls see above. Node failure and unexpected power loss really matters.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-24  5:41     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
@ 2015-12-24  5:42       ` Denis V. Lunev
  2016-01-04 17:02       ` Max Reitz
  2016-01-11 16:47       ` Kevin Wolf
  2 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-24  5:42 UTC (permalink / raw)
  To: Max Reitz, Fam Zheng, Kevin Wolf; +Cc: qemu-devel, qemu-block

On 12/24/2015 08:41 AM, Denis V. Lunev wrote:
> On 12/24/2015 02:19 AM, Max Reitz wrote:
>> On 23.12.2015 04:14, Fam Zheng wrote:
>>> On Tue, 12/22 17:46, Kevin Wolf wrote:
>>>> Enough innocent images have died because users called 'qemu-img 
>>>> snapshot' while
>>>> the VM was still running. Educating the users doesn't seem to be a 
>>>> working
>>>> strategy, so this series adds locking to qcow2 that refuses to 
>>>> access the image
>>>> read-write from two processes.
>>>>
>>>> Eric, this will require a libvirt update to deal with qemu crashes 
>>>> which leave
>>>> locked images behind. The simplest thinkable way would be to 
>>>> unconditionally
>>>> override the lock in libvirt whenever the option is present. In 
>>>> that case,
>>>> libvirt VMs would be protected against concurrent non-libvirt 
>>>> accesses, but not
>>>> the other way round. If you want more than that, libvirt would have 
>>>> to check
>>>> somehow if it was its own VM that used the image and left the lock 
>>>> behind. I
>>>> imagine that can't be too hard either.
>>> The motivation is great, but I'm not sure I like the side-effect 
>>> that an
>>> unclean shutdown will require a "forced" open, because it makes 
>>> using qcow2 in
>>> development cumbersome,
>> How so?
>>
>> Just extend your "x86_64-softmmu/qemu-system-x86_64
>> -these-options-will-make-qemu-crash" invocation by "; qemu-img
>> force-unlock foo.qcow2".
>>
>>>                          and like you said, management/user also 
>>> needs to handle
>>> this explicitly. This is a bit of a personal preference, but it's 
>>> strong enough
>>> that I want to speak up.
>> Well, I personally always had the opposite preference. I see Denis's
>> series works on Windows, too, which is good. However, it won't work with
>> any backend, which a qcow2 flag will.
>>
>> Also, Denis's/Olga's series will by default not lock the file. This is
>> not an issue if one uses libvirt to run a VM; but there are people who
>> invoke qemu directly and then try to run qemu-img concurrently, and I
>> doubt those people will manually set the locking option. This might be
>> addressed by automatically setting the option if a certain format like
>> qcow2 is used, but it may be pretty difficult to get that implemented
>> nicely.
>>
>> So the benefits of a qcow2 flag are only minor ones. However, I
>> personally believe that automatic unlock on crash is a very minor
>> benefit as well. That should never happen in practice anyway, and a
>> crashing qemu is such a great inconvenience that I as a user wouldn't
>> really mind having to unlock the image afterwards.
> IMHO you are wrong. This is VERY important. The situation would be 
> exactly
> the same after node poweroff, which could happen and really happens in
> the real life from time to time.
>
> In this cases VMs should start automatically and ASAP if configured this
> way. Any manual interaction here is a REAL pain.
>
>> In fact, libvirt could even do that manually, couldn't it? If qemu
>> crashes, it just invokes qemu-img force-unlock on any qcow2 image which
>> was attached R/W to the VM.
>
> in the situation above libvirt does not have the information or this
> information could be unreliable.
>
>>> As an alternative, can we introduce .bdrv_flock() in protocol 
>>> drivers, with
>>> similar semantics to flock(2) or lockf(3)? That way all formats can 
>>> benefit,
>>> and a program crash will automatically drop the lock.
>> Making other formats benefit from addressing this issue is a good point,
>> but it too is a minor point. Formats other than qcow2 and raw are only
>> supported for compatibility anyway, and we don't need this feature 
>> for raw.
> I would like to have this covered by flock and this indeed working for
> years with Parallels.
>
>>
>> I feel like most of the question which approach to take revolves around
>> "But what if qemu crashes?". You (and others) are right in that having
>> to manually unlock the image then is cumbersome, however, I think that:
>> (1) qemu should never crash anyway.
>> (2) If qemu does crash, having to unlock the image is probably the
>>      least of your worries.
>> (3) If you are using libvirt, it should actually be possible for
>>      libvirt to automatically force-unlock images on qemu crash.
>>
>> This is why I don't think that keeping a locked image behind on qemu
>> crash is actually an issue.
>>
>> Max
>>
> pls see above. Node failure and unexpected power loss really matters.
>
> Den
Sorry, forgotten. One more bullet here - OOM killer.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (11 preceding siblings ...)
  2015-12-23 14:57 ` [Qemu-devel] " Vasiliy Tolstov
@ 2015-12-24  5:43 ` Denis V. Lunev
  2016-01-11 16:33   ` Kevin Wolf
  2016-01-14 14:01 ` Max Reitz
  13 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2015-12-24  5:43 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz

On 12/22/2015 07:46 PM, Kevin Wolf wrote:
> Enough innocent images have died because users called 'qemu-img snapshot' while
> the VM was still running. Educating the users doesn't seem to be a working
> strategy, so this series adds locking to qcow2 that refuses to access the image
> read-write from two processes.
>
> Eric, this will require a libvirt update to deal with qemu crashes which leave
> locked images behind. The simplest thinkable way would be to unconditionally
> override the lock in libvirt whenever the option is present. In that case,
> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> the other way round. If you want more than that, libvirt would have to check
> somehow if it was its own VM that used the image and left the lock behind. I
> imagine that can't be too hard either.
>
> Also note that this kind of depends on Max's bdrv_close_all() series, but only
> in order to pass test case 142. This is not a bug in this series, but a
> preexisting one (bs->file can be closed before bs), and it becomes apparent
> when qemu fails to unlock an image due to this bug. Max's series fixes this.
>
>
This approach has a hole with qcow2_invalidate_cache()
The lock is released (and can't be kept by design) in
between qcow2_close()/qcow2_open() sequences if
I understand this correctly.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-24  5:41     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
  2015-12-24  5:42       ` Denis V. Lunev
@ 2016-01-04 17:02       ` Max Reitz
  2016-01-11 16:47       ` Kevin Wolf
  2 siblings, 0 replies; 99+ messages in thread
From: Max Reitz @ 2016-01-04 17:02 UTC (permalink / raw)
  To: Denis V. Lunev, Fam Zheng, Kevin Wolf; +Cc: qemu-devel, qemu-block

[-- Attachment #1: Type: text/plain, Size: 2913 bytes --]

On 24.12.2015 06:41, Denis V. Lunev wrote:
> On 12/24/2015 02:19 AM, Max Reitz wrote:
>> So the benefits of a qcow2 flag are only minor ones. However, I
>> personally believe that automatic unlock on crash is a very minor
>> benefit as well. That should never happen in practice anyway, and a
>> crashing qemu is such a great inconvenience that I as a user wouldn't
>> really mind having to unlock the image afterwards.
> IMHO you are wrong. This is VERY important. The situation would be exactly
> the same after node poweroff, which could happen and really happens in
> the real life from time to time.
> 
> In this cases VMs should start automatically and ASAP if configured this
> way. Any manual interaction here is a REAL pain.

Thanks, that's a good example.

However, I don't know much about management at that layer, so this is
probably where I'm out of the discussion.

(For instance, I don't know which kind of node you are talking about; I
presume it is a physical node, because if it was a virtual node, you'd
just kill the qemu instance in question by sending a QMP quit command.)

>> In fact, libvirt could even do that manually, couldn't it? If qemu
>> crashes, it just invokes qemu-img force-unlock on any qcow2 image which
>> was attached R/W to the VM.
> 
> in the situation above libvirt does not have the information or this
> information could be unreliable.

Well, then s/libvirt/any of the management layers/. As far as I know,
qemu-img commands are still used pretty high up in the stack.

>>> As an alternative, can we introduce .bdrv_flock() in protocol
>>> drivers, with
>>> similar semantics to flock(2) or lockf(3)? That way all formats can
>>> benefit,
>>> and a program crash will automatically drop the lock.
>> Making other formats benefit from addressing this issue is a good point,
>> but it too is a minor point. Formats other than qcow2 and raw are only
>> supported for compatibility anyway, and we don't need this feature for
>> raw.
> I would like to have this covered by flock and this indeed working for
> years with Parallels.
> 
>>
>> I feel like most of the question which approach to take revolves around
>> "But what if qemu crashes?". You (and others) are right in that having
>> to manually unlock the image then is cumbersome, however, I think that:
>> (1) qemu should never crash anyway.
>> (2) If qemu does crash, having to unlock the image is probably the
>>      least of your worries.
>> (3) If you are using libvirt, it should actually be possible for
>>      libvirt to automatically force-unlock images on qemu crash.
>>
>> This is why I don't think that keeping a locked image behind on qemu
>> crash is actually an issue.
>>
>> Max
>>
> pls see above. Node failure and unexpected power loss really matters.

Good points indeed (maybe, I can't actually judge, but I'll trust you).

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 05/10] block: Inactivate BDS when migration completes
  2015-12-22 20:43   ` Eric Blake
@ 2016-01-05 20:21     ` John Snow
  2016-01-13 14:25       ` Kevin Wolf
  0 siblings, 1 reply; 99+ messages in thread
From: John Snow @ 2016-01-05 20:21 UTC (permalink / raw)
  To: Eric Blake, Kevin Wolf, qemu-block; +Cc: qemu-devel, mreitz



On 12/22/2015 03:43 PM, Eric Blake wrote:
> On 12/22/2015 09:46 AM, Kevin Wolf wrote:
>> So far, live migration with shared storage meant that the image is in a
>> not-really-ready don't-touch-me state on the destination while the
>> source is still actively using it, but after completing the migration,
>> the image was fully opened on both sides. This is bad.
>>
>> This patch adds a block driver callback to inactivate images on the
>> source before completing the migration. Inactivation means that it goes
>> to a state as if it was just live migrated to the qemu instance on the
>> source (i.e. BDRV_O_INCOMING is set). You're then supposed to continue
>> either on the source or on the destination, which takes ownership of the
>> image.
>>
>> A typical migration looks like this now with respect to disk images:
>>
>> 1. Destination qemu is started, the image is opened with
>>    BDRV_O_INCOMING. The image is fully opened on the source.
>>
>> 2. Migration is about to complete. The source flushes the image and
>>    inactivates it. Now both sides have the image opened with
>>    BDRV_O_INCOMING and are expecting the other side to still modify it.
> 
> The name BDRV_O_INCOMING now doesn't quite match semantics on the
> source, but I don't have any better suggestions.  BDRV_O_LIMITED_USE?
> BDRV_O_HANDOFF?  At any rate, I fully agree with your logic of locking
> things down on the source to mark that the destination is about to take
> over write access to the file.
> 

INCOMING is handy as it keeps the code simple, even if it's weird to
read. Is it worth adding the extra ifs/case statements everywhere to add
in BDRV_O_HANDOFF? Maybe in the future someone will use BDRV_O_INCOMING
to mean something more specific (data is incoming, not just in the
process of being handed off) that could cause problems.

Maybe even just renaming BDRV_O_INCOMING right now to be BDRV_O_HANDOFF
would accomplish the semantics we want on both source and destination
without needing two flags.

Follow your dreams, Go with what you feel.

>>
>> 3. One side (the destination on success) continues and calls
>>    bdrv_invalidate_all() in order to take ownership of the image again.
>>    This removes BDRV_O_INCOMING on the resuming side; the flag remains
>>    set on the other side.
>>
>> This ensures that the same image isn't written to by both instances
>> (unless both are resumed, but then you get what you deserve). This is
>> important because .bdrv_close for non-BDRV_O_INCOMING images could write
>> to the image file, which is definitely forbidden while another host is
>> using the image.
> 
> And indeed, this is a prereq to your patch that modifies the file on
> close to clear the new 'open-for-writing' flag :)
> 
>>
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>>  block.c                   | 34 ++++++++++++++++++++++++++++++++++
>>  include/block/block.h     |  1 +
>>  include/block/block_int.h |  1 +
>>  migration/migration.c     |  7 +++++++
>>  qmp.c                     | 12 ++++++++++++
>>  5 files changed, 55 insertions(+)
>>
> 
>> @@ -1536,6 +1540,9 @@ static void migration_completion(MigrationState *s, int current_active_state,
>>          if (!ret) {
>>              ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>>              if (ret >= 0) {
>> +                ret = bdrv_inactivate_all();
>> +            }
>> +            if (ret >= 0) {
>>                  qemu_file_set_rate_limit(s->file, INT64_MAX);
> 
> Isn't the point of the rate limit change to allow any pending operations
> to flush without artificial slow limits?  Will inactivating the device
> be too slow if rate limits are still slow?
> 

This sets the rate limit for the migration pipe, doesn't it? My reading
was that this removes any artificial limits for the sake of post-copy,
but we shouldn't be flushing any writes to disk at this point, so I
think this order won't interfere with anything.

> But offhand, I don't have any strong proof that a different order is
> required, so yours makes sense to me.
> 
> You may want a second opinion, but I'm okay if you add:
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
Reviewed-by: John Snow <jsnow@redhat.com>

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images
  2015-12-22 20:20   ` Eric Blake
@ 2016-01-11 15:20     ` Kevin Wolf
  0 siblings, 0 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 15:20 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 1163 bytes --]

Am 22.12.2015 um 21:20 hat Eric Blake geschrieben:
> On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> > Version 2 images don't have feature bits, so writing a feature table to
> > those images is kind of pointless.
> 
> Fortunately, it is also harmless; even the v2 spec allowed for unknown
> extension headers.

With 512 byte clusters it could use up important space that you wanted
to use for the backing file path!

Okay, okay, maybe not that critical... ;-)

> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block/qcow2.c              | 48 ++++++++++++++++++++++++----------------------
> >  tests/qemu-iotests/031.out | 12 +-----------
> >  tests/qemu-iotests/061.out | 15 ---------------
> >  3 files changed, 26 insertions(+), 49 deletions(-)
> > 
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> 
> Did you test that amend'ing an image from v2 to v3 adds the table, and
> downgrading from v3 to v2 drops the table?

I'm not sure if I tested it manually, but I'm updating the results of
test case 061, which tests both upgrades and downgrades, so if your
review was thorough enough, the answer is yes.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 07/10] qcow2: Implement .bdrv_inactivate
  2015-12-22 21:17   ` Eric Blake
@ 2016-01-11 15:34     ` Kevin Wolf
  0 siblings, 0 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 15:34 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 3479 bytes --]

Am 22.12.2015 um 22:17 hat Eric Blake geschrieben:
> On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> > The callback has to ensure that closing or flushing the image afterwards
> > wouldn't cause a write access to the image files. This means that just
> > the caches have to be written out, which is part of the existing
> > .bdrv_close implementation.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block/qcow2.c | 45 ++++++++++++++++++++++++++++-----------------
> >  1 file changed, 28 insertions(+), 17 deletions(-)
> > 
> 
> Mostly code motion, but I still ended up with questions.
> 
> >  
> > +static int qcow2_inactivate(BlockDriverState *bs)
> > +{
> > +    BDRVQcow2State *s = bs->opaque;
> > +    int ret, result = 0;
> > +
> > +    ret = qcow2_cache_flush(bs, s->l2_table_cache);
> > +    if (ret) {
> > +        result = ret;
> > +        error_report("Failed to flush the L2 table cache: %s",
> > +                     strerror(-ret));
> 
> I asked Markus if we want error_report_errno() - and ever since then, I
> keep finding more and more uses that would benefit from it :)

At least they should be easy to find if we ever do introduce it.

> > +    }
> > +
> > +    ret = qcow2_cache_flush(bs, s->refcount_block_cache);
> > +    if (ret) {
> > +        result = ret;
> > +        error_report("Failed to flush the refcount block cache: %s",
> > +                     strerror(-ret));
> > +    }
> > +
> 
> If the media fails in between these two statements,...
> 
> > +    if (result == 0) {
> > +        qcow2_mark_clean(bs);
> 
> ...can't qcow2_mark_clean() fail due to an EIO or other write error?  Do
> we care?  I guess the worst is that we didn't mark the image clean after
> all, which is no worse than if qemu[-img] had been SIGKILL'd at the same
> point where I hypothesized that the media could fail.

Exactly. We can't avoid that disks fail, but we can make sure that we do
things in the right order so that the failure can't result in
corruption. Marking the image clean as the last thing should be the
right order to ensure this.

> > +    }
> > +
> > +    return result;
> 
> If both flushes failed, you return the result set to the return value of
> the second flush.  Is it ever possible that the return value of the
> first flush might be more useful?

I can imagine that there are such cases. But if both fail, we don't have
a way to tell which error code is more useful, so we can just pick any.
We have an error_report() for both, at least, in case the admin needs to
check what happened.

> > @@ -1693,23 +1719,7 @@ static void qcow2_close(BlockDriverState *bs)
> >      s->l1_table = NULL;
> >  
> >      if (!(bs->open_flags & BDRV_O_INCOMING)) {
> 
> > -        if (!ret1 && !ret2) {
> > -            qcow2_mark_clean(bs);
> > -        }
> > +        qcow2_inactivate(bs);
> >      }
> 
> Then again, the lone existing caller in this addition doesn't even care
> about the return value.  The only other caller was your new code added
> earlier in the series; in 5/10 migration_completion(), which uses the
> value as a conditional but doesn't try to call strerror(-ret).
> 
> Since this is mostly code motion, any semantic changes you would want to
> make based on my questions above probably belong in their own patches.
> Therefore, for this patch as written,
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>

Thanks.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2015-12-22 21:06   ` Eric Blake
@ 2016-01-11 15:49     ` Markus Armbruster
  2016-01-11 16:05       ` Kevin Wolf
  2016-01-11 16:22     ` Kevin Wolf
  1 sibling, 1 reply; 99+ messages in thread
From: Markus Armbruster @ 2016-01-11 15:49 UTC (permalink / raw)
  To: Eric Blake; +Cc: Kevin Wolf, qemu-devel, qemu-block, mreitz

Eric Blake <eblake@redhat.com> writes:

> On 12/22/2015 09:46 AM, Kevin Wolf wrote:
>> This patch extends qemu-img for working with locked images. It prints a
>> helpful error message when trying to access a locked image read-write,
>> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
>> -r all --force' option in order to override a lock left behind after a
>> qemu crash.
>> 
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>>  include/block/block.h |  1 +
>>  include/qapi/error.h  |  1 +
>>  qapi/common.json      |  3 +-
>>  qemu-img-cmds.hx      | 10 ++++--
>>  qemu-img.c            | 96 +++++++++++++++++++++++++++++++++++++++++++--------
>>  qemu-img.texi         | 20 ++++++++++-
>>  6 files changed, 113 insertions(+), 18 deletions(-)
>> 
>
>> +++ b/include/qapi/error.h
>> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
>>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
>>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
>>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
>> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
>>  } ErrorClass;
>
> Wow - a new ErrorClass.  It's been a while since we could justify one of
> these, but I think you might have found a case.

Spell out the rationale for the new ErrorClass, please.

[...]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2016-01-11 15:49     ` Markus Armbruster
@ 2016-01-11 16:05       ` Kevin Wolf
  2016-01-12 15:20         ` Markus Armbruster
  0 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 16:05 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel, qemu-block, mreitz

Am 11.01.2016 um 16:49 hat Markus Armbruster geschrieben:
> Eric Blake <eblake@redhat.com> writes:
> 
> > On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> >> This patch extends qemu-img for working with locked images. It prints a
> >> helpful error message when trying to access a locked image read-write,
> >> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> >> -r all --force' option in order to override a lock left behind after a
> >> qemu crash.
> >> 
> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >> ---
> >>  include/block/block.h |  1 +
> >>  include/qapi/error.h  |  1 +
> >>  qapi/common.json      |  3 +-
> >>  qemu-img-cmds.hx      | 10 ++++--
> >>  qemu-img.c            | 96 +++++++++++++++++++++++++++++++++++++++++++--------
> >>  qemu-img.texi         | 20 ++++++++++-
> >>  6 files changed, 113 insertions(+), 18 deletions(-)
> >> 
> >
> >> +++ b/include/qapi/error.h
> >> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
> >>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
> >>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
> >>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
> >> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
> >>  } ErrorClass;
> >
> > Wow - a new ErrorClass.  It's been a while since we could justify one of
> > these, but I think you might have found a case.
> 
> Spell out the rationale for the new ErrorClass, please.

Action to be taken for this error class: Decide whether the lock is a
leftover from a previous qemu run that ended in an unclean shutdown. If
so, retry with overriding the lock.

Currently used by qemu-img when ordered to override a lock. libvirt
will need to do the same.

Alternative design without a new error class: Accept the
override-lock=on option on the block.c level (QAPI BlockdevOptionsBase)
and ignore it silently for image formats that don't support locking
(i.e. anything but qcow2), so that qemu-img and libvirt can set this
option unconditionally even if they don't know that the image is locked
(but libvirt would have to check first if qemu supports the option at
all in order to maintain compatibility with old qemu versions).

In my opinion, this would be worse design than adding an error class. It
would also prevent libvirt from logging when it forcefully unlocks an
image.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2015-12-22 21:06   ` Eric Blake
  2016-01-11 15:49     ` Markus Armbruster
@ 2016-01-11 16:22     ` Kevin Wolf
  1 sibling, 0 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 16:22 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 9395 bytes --]

Am 22.12.2015 um 22:06 hat Eric Blake geschrieben:
> On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> > This patch extends qemu-img for working with locked images. It prints a
> > helpful error message when trying to access a locked image read-write,
> > and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> > -r all --force' option in order to override a lock left behind after a
> > qemu crash.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  include/block/block.h |  1 +
> >  include/qapi/error.h  |  1 +
> >  qapi/common.json      |  3 +-
> >  qemu-img-cmds.hx      | 10 ++++--
> >  qemu-img.c            | 96 +++++++++++++++++++++++++++++++++++++++++++--------
> >  qemu-img.texi         | 20 ++++++++++-
> >  6 files changed, 113 insertions(+), 18 deletions(-)
> > 
> 
> > +++ b/include/qapi/error.h
> > @@ -102,6 +102,7 @@ typedef enum ErrorClass {
> >      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
> >      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
> >      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
> > +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
> >  } ErrorClass;
> 
> Wow - a new ErrorClass.  It's been a while since we could justify one of
> these, but I think you might have found a case.

Yes, I think libvirt will need it as well.

That's the whole reason why I consider your input so important for this
series. If a crashed qemu can leave locked images around, libvirt needs
to know how to deal with it. I don't think it's hard to implement at
least something basic, but ideally it should be there before the qemu
release that introduces the locks.

> >  /*
> > diff --git a/qapi/common.json b/qapi/common.json
> > index 9353a7b..1bf6e46 100644
> > --- a/qapi/common.json
> > +++ b/qapi/common.json
> > @@ -27,7 +27,8 @@
> >  { 'enum': 'QapiErrorClass',
> >    # Keep this in sync with ErrorClass in error.h
> >    'data': [ 'GenericError', 'CommandNotFound', 'DeviceEncrypted',
> > -            'DeviceNotActive', 'DeviceNotFound', 'KVMMissingCap' ] }
> > +            'DeviceNotActive', 'DeviceNotFound', 'KVMMissingCap',
> > +            'ImageFileLocked' ] }
> 
> Missing documentation of the new value; should be something like:
> 
> # @ImageFileLocked: the requested operation attempted to write to an
> #    image locked for writing by another process (since 2.6)

Right, thanks for catching that.

> >  
> > +DEF("force-unlock", img_force_unlock,
> > +    "force_unlock [-f fmt] filename")
> 
> So is it force-unlock or force_unlock?  It's our first two-word command
> on the qemu-img CLI, but I strongly prefer '-' (hitting the shift key
> mid-word is a bother for CLI usage).

Just a typo (the underscore is only in the help message).

> > +++ b/qemu-img.c
> > @@ -47,6 +47,7 @@ typedef struct img_cmd_t {
> >  enum {
> >      OPTION_OUTPUT = 256,
> >      OPTION_BACKING_CHAIN = 257,
> > +    OPTION_FORCE = 258,
> >  };
> 
> May conflict with Daniel's proposed patches; I'm sure you two can sort
> out the problems.
> 
> > @@ -206,12 +207,34 @@ static BlockBackend *img_open(const char *id, const char *filename,
> >      Error *local_err = NULL;
> >      QDict *options = NULL;
> >  
> > +    options = qdict_new();
> >      if (fmt) {
> > -        options = qdict_new();
> >          qdict_put(options, "driver", qstring_from_str(fmt));
> >      }
> > +    QINCREF(options);
> >  
> >      blk = blk_new_open(id, filename, NULL, options, flags, &local_err);
> > +    if (!blk && error_get_class(local_err) == ERROR_CLASS_IMAGE_FILE_LOCKED) {
> > +        if (force) {
> > +            qdict_put(options, BDRV_OPT_OVERRIDE_LOCK, qstring_from_str("on"));
> 
> I guess it's safer to try without override and then re-issue with it,
> only when needed, rather than treating 'force' as blindly turning on
> override even when it is not needed to avoid the need for reissuing
> commands.  And probably not observable to the user which of the two
> approaches you use (same end results).

The reason here is that only qcow2 supports locks (and therefore the
lock override). Any other image format would result in an "unknown
option" error.

> > +            blk = blk_new_open(id, filename, NULL, options, flags, NULL);
> 
> Can't the second attempt still fail, for some other reason?  I think
> passing NULL for errp is risky here.  I guess you're saved by the fact
> that blk_new_open() should always return NULL if an error would have
> been set, and that you want to favor reporting the original failure
> (with the class ERROR_CLASS_IMAGE_FILE_LOCKED) rather than the
> second-attempt failure.

I think this might be from before I introduced the error class, and I
wanted to return the original error instead of a potential "unknown
option" error that is introduced here. But I think we can assume that
any driver that returns ERROR_CLASS_IMAGE_FILE_LOCKED also supports the
image override option.

So it seems to me that I should now favour the later error.

> > +            if (blk) {
> > +                error_free(local_err);
> > +            }
> > +        } else {
> > +            error_report("The image file '%s' is locked and cannot be "
> > +                         "opened for write access as this may cause image "
> > +                         "corruption.", filename);
> 
> This completely discards the information in local_err.  Of course, I
> don't know what information you are proposing to store for the actual
> advisory lock extension header.  But let's suppose it were to include
> hostname+pid information on who claimed the lock, rather than just a
> single lock bit.  That additional information in local_err may well be
> worth reporting here rather than just discarding it all.

Hm. Yes. I did it this way to avoid the override-lock=on message, which
isn't helpful in a qemu-img context. Maybe I can somehow drop/replace
the hint of the errp and then use that?

> > +            error_report("If it is locked in error (e.g. because "
> > +                         "of an unclean shutdown) and you are sure that no "
> > +                         "other processes are working on the image file, you "
> > +                         "can use 'qemu-img force-unlock' or the --force flag "
> > +                         "for 'qemu-img check' in order to override this "
> > +                         "check.");
> 
> Long line; I don't know if we want to insert intermediate line breaks.
> Markus may have more opinions on what this should look like.

I don't think we assume any specific terminal size and line breaks in
the wrong place are probably worse than no line breaks at all.

> > +static int img_force_unlock(int argc, char **argv)
> > +{
> > +    BlockBackend *blk;
> > +    const char *format = NULL;
> > +    const char *filename;
> > +    char c;
> > +
> > +    for (;;) {
> > +        c = getopt(argc, argv, "hf:");
> > +        if (c == -1) {
> > +            break;
> > +        }
> > +        switch (c) {
> > +        case '?':
> > +        case 'h':
> > +            help();
> > +            break;
> > +        case 'f':
> > +            format = optarg;
> > +            break;
> 
> Depending on what we decide for Daniel's patches, you may not even want
> a -f here, but always treat this as a new-style command that only takes
> QemuOpts style parsing of a positional parameter.  Right now, I'm
> leaning towards his v3 design (all older sub-commands gain a boolean
> flag that says whether the positional parameters are literal filenames
> or specific QemuOpts strings), but since your subcommand is new, we
> don't have to cater to the older style.

I think the old syntax is friendlier for human users; or at least for
the less experienced ones.

> > +++ b/qemu-img.texi
> > @@ -117,7 +117,7 @@ Skip the creation of the target volume
> >  Command description:
> >  
> >  @table @option
> > -@item check [-f @var{fmt}] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
> > +@item check [-q] [-f @var{fmt}] [--force] [--output=@var{ofmt}] [-r [leaks | all]] [-T @var{src_cache}] @var{filename}
> 
> Where did -q come from?

From an earlier patch that forgot to update the docs. Should I make that
a separate patch?

> >  
> > +@item force-unlock [-f @var{fmt}] @var{filename}
> 
> Okay - most of your patch used the sane spelling; it was just the one
> spot I found that used force_unlock incorrectly.
> 
> > +
> > +Read-write disk images can generally be safely opened only from a single
> > +process at the same time. In order to protect against corruption from
> > +neglecting to follow this rule, qcow2 images are automatically flagged as
> > +in use when they are opened and the flag is removed again on a clean
> > +shutdown.
> > +
> > +However, in cases of an unclean shutdown, the image might be still marked as in
> > +use so that any further read-write access is prohibited. You can use the
> > +@code{force-unlock} command to manually remove the in-use flag then.
> > +
> 
> Looks reasonable.  I do think I found enough things, though, that it
> will require a v2 (perhaps rebased on some other patches) before I give R-b.

Kevin

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 15:11     ` Vasiliy Tolstov
@ 2016-01-11 16:25       ` Kevin Wolf
  0 siblings, 0 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 16:25 UTC (permalink / raw)
  To: Vasiliy Tolstov; +Cc: mreitz, qemu-devel, qemu-block, Denis V. Lunev

Am 23.12.2015 um 16:11 hat Vasiliy Tolstov geschrieben:
> 2015-12-23 18:08 GMT+03:00 Denis V. Lunev <den-lists@parallels.com>:
> > you should do this by asking running QEMU not by
> > qemu-img, which is badly wrong.
> >
> > Den
> 
> 
> Ok, if this is possible via qmp/hmp qemu, no problem.

It's not only possible, but it's the only correct way. If you use
qemu-img, you're bound to corrupt your images sooner or later. This is
exactly what this series wants to protect you against.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-24  5:43 ` Denis V. Lunev
@ 2016-01-11 16:33   ` Kevin Wolf
  2016-01-11 16:38     ` Denis V. Lunev
  0 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 16:33 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: qemu-devel, qemu-block, mreitz

Am 24.12.2015 um 06:43 hat Denis V. Lunev geschrieben:
> On 12/22/2015 07:46 PM, Kevin Wolf wrote:
> >Enough innocent images have died because users called 'qemu-img snapshot' while
> >the VM was still running. Educating the users doesn't seem to be a working
> >strategy, so this series adds locking to qcow2 that refuses to access the image
> >read-write from two processes.
> >
> >Eric, this will require a libvirt update to deal with qemu crashes which leave
> >locked images behind. The simplest thinkable way would be to unconditionally
> >override the lock in libvirt whenever the option is present. In that case,
> >libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> >the other way round. If you want more than that, libvirt would have to check
> >somehow if it was its own VM that used the image and left the lock behind. I
> >imagine that can't be too hard either.
> >
> >Also note that this kind of depends on Max's bdrv_close_all() series, but only
> >in order to pass test case 142. This is not a bug in this series, but a
> >preexisting one (bs->file can be closed before bs), and it becomes apparent
> >when qemu fails to unlock an image due to this bug. Max's series fixes this.
> >
> >
> This approach has a hole with qcow2_invalidate_cache()
> The lock is released (and can't be kept by design) in
> between qcow2_close()/qcow2_open() sequences if
> I understand this correctly.

qcow2_invalidate_cache() is only called with BDRV_O_INCOMING set, i.e.
the instance isn't the current user and doesn't release the lock. It
requires, however, that a previous user has released the lock, otherwise
the qcow2_open() would fail.

But even if qcow2_invalidate_cache() was called for an image that is a
user, the behaviour wouldn't be wrong. In the period while the flag is
cleared, there is no write access to the image file. If someone were
opening the image for another process at this point, they would get the
image, but again qcow2_open() would fail and we wouldn't corrupt the
image.

And finally, the goal of this series is to protect the users against
stupid mistakes, not against malicious acts. Even if there were some
windows in which the image wouldn't be protected (though I don't think
they exist), having the common case safe would already be a huge
improvement over the current state.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2016-01-11 16:33   ` Kevin Wolf
@ 2016-01-11 16:38     ` Denis V. Lunev
  0 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2016-01-11 16:38 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

On 01/11/2016 07:33 PM, Kevin Wolf wrote:
> Am 24.12.2015 um 06:43 hat Denis V. Lunev geschrieben:
>> On 12/22/2015 07:46 PM, Kevin Wolf wrote:
>>> Enough innocent images have died because users called 'qemu-img snapshot' while
>>> the VM was still running. Educating the users doesn't seem to be a working
>>> strategy, so this series adds locking to qcow2 that refuses to access the image
>>> read-write from two processes.
>>>
>>> Eric, this will require a libvirt update to deal with qemu crashes which leave
>>> locked images behind. The simplest thinkable way would be to unconditionally
>>> override the lock in libvirt whenever the option is present. In that case,
>>> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
>>> the other way round. If you want more than that, libvirt would have to check
>>> somehow if it was its own VM that used the image and left the lock behind. I
>>> imagine that can't be too hard either.
>>>
>>> Also note that this kind of depends on Max's bdrv_close_all() series, but only
>>> in order to pass test case 142. This is not a bug in this series, but a
>>> preexisting one (bs->file can be closed before bs), and it becomes apparent
>>> when qemu fails to unlock an image due to this bug. Max's series fixes this.
>>>
>>>
>> This approach has a hole with qcow2_invalidate_cache()
>> The lock is released (and can't be kept by design) in
>> between qcow2_close()/qcow2_open() sequences if
>> I understand this correctly.
> qcow2_invalidate_cache() is only called with BDRV_O_INCOMING set, i.e.
> the instance isn't the current user and doesn't release the lock. It
> requires, however, that a previous user has released the lock, otherwise
> the qcow2_open() would fail.
>
> But even if qcow2_invalidate_cache() was called for an image that is a
> user, the behaviour wouldn't be wrong. In the period while the flag is
> cleared, there is no write access to the image file. If someone were
> opening the image for another process at this point, they would get the
> image, but again qcow2_open() would fail and we wouldn't corrupt the
> image.
>
> And finally, the goal of this series is to protect the users against
> stupid mistakes, not against malicious acts. Even if there were some
> windows in which the image wouldn't be protected (though I don't think
> they exist), having the common case safe would already be a huge
> improvement over the current state.
>
> Kevin
but how we will recover after node powerloss (which will happen
sooner or later)?

We are doomed in this case to make a blind guess whether
we are allowed to clear the flag or not.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-24  5:41     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
  2015-12-24  5:42       ` Denis V. Lunev
  2016-01-04 17:02       ` Max Reitz
@ 2016-01-11 16:47       ` Kevin Wolf
  2016-01-11 17:56         ` Daniel P. Berrange
  2 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 16:47 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Fam Zheng, qemu-devel, qemu-block, Max Reitz

Am 24.12.2015 um 06:41 hat Denis V. Lunev geschrieben:
> On 12/24/2015 02:19 AM, Max Reitz wrote:
> >So the benefits of a qcow2 flag are only minor ones. However, I
> >personally believe that automatic unlock on crash is a very minor
> >benefit as well. That should never happen in practice anyway, and a
> >crashing qemu is such a great inconvenience that I as a user wouldn't
> >really mind having to unlock the image afterwards.
> IMHO you are wrong. This is VERY important. The situation would be exactly
> the same after node poweroff, which could happen and really happens in
> the real life from time to time.
> 
> In this cases VMs should start automatically and ASAP if configured this
> way. Any manual interaction here is a REAL pain.

Yes. Your management tool should be able to cope with it.

> >In fact, libvirt could even do that manually, couldn't it? If qemu
> >crashes, it just invokes qemu-img force-unlock on any qcow2 image which
> >was attached R/W to the VM.
> 
> in the situation above libvirt does not have the information or this
> information could be unreliable.

That would be a libvirt bug then. Did you check?

A good management tool knows which VMs it had running before a host
crash. For all I know, libvirt does.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
  2015-12-23 10:47   ` [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Daniel P. Berrange
  2015-12-23 12:15     ` [Qemu-devel] [Qemu-block] " Roman Kagan
@ 2016-01-11 17:14     ` Kevin Wolf
  2016-01-11 17:54       ` Daniel P. Berrange
  2016-01-13  8:56       ` Markus Armbruster
  1 sibling, 2 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 17:14 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: Fam Zheng, qemu-devel, qemu-block, mreitz

Am 23.12.2015 um 11:47 hat Daniel P. Berrange geschrieben:
> On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> > On Tue, 12/22 17:46, Kevin Wolf wrote:
> > > Enough innocent images have died because users called 'qemu-img snapshot' while
> > > the VM was still running. Educating the users doesn't seem to be a working
> > > strategy, so this series adds locking to qcow2 that refuses to access the image
> > > read-write from two processes.
> > > 
> > > Eric, this will require a libvirt update to deal with qemu crashes which leave
> > > locked images behind. The simplest thinkable way would be to unconditionally
> > > override the lock in libvirt whenever the option is present. In that case,
> > > libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> > > the other way round. If you want more than that, libvirt would have to check
> > > somehow if it was its own VM that used the image and left the lock behind. I
> > > imagine that can't be too hard either.
> > 
> > The motivation is great, but I'm not sure I like the side-effect that an
> > unclean shutdown will require a "forced" open, because it makes using qcow2 in
> > development cumbersome, and like you said, management/user also needs to handle
> > this explicitly. This is a bit of a personal preference, but it's strong enough
> > that I want to speak up.
> 
> Yeah, I am also not really a big fan of locking mechanisms which are not
> automatically cleaned up on process exit. On the other hand you could
> say that people who choose to run qemu-img manually are already taking
> fate into their own hands, and ending up with a dirty image on unclean
> exit is still miles better than loosing all your data.
> 
> > As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> > similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> > and a program crash will automatically drop the lock.
> 
> FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
> out locks using fcntl()/lockf() on all disk images associated with a VM.

Does this actually mean that if QEMU did try to use flock(), it would
fail because libvirt is already holding the lock?

I considered adding both locking schemes (the qcow2 flag for qcow2 on
any backend; flock() for anything else on local files), but if this is
true, that's game over for any flock() based patches.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 1/5] block: added lock image option and callback Denis V. Lunev
  2015-12-23 23:48         ` Eric Blake
@ 2016-01-11 17:31         ` Kevin Wolf
  2016-01-11 17:58           ` Daniel P. Berrange
  2016-01-12  5:38           ` Denis V. Lunev
  1 sibling, 2 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 17:31 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Fam Zheng, Olga Krishtal, qemu-devel, Max Reitz

Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
> From: Olga Krishtal <okrishtal@virtuozzo.com>
> 
> While opening the image we want to be sure that we are the
> one who works with image, anf if it is not true -
> opening the image for writing should fail.
> 
> There are 2 ways at the moment: no lock at all and lock the file
> image.
> 
> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> CC: Fam Zheng <famz@redhat.com>

As long as locking is disabled by default, it's useless and won't
prevent people from corrupting their images. These corruptions happen
exactly because people don't know how to use qemu properly. You can't
expect them to enable locking manually.

Also, you probably need to consider bdrv_reopen() and live migration.
I think live migration would be blocked if source and destination both
see the lock; which is admittedly less likely than with the qcow2 patch
(and generally a problem of this series), but with localhost migration
and potentially with some NFS setups it can be the case.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] qcow2: implemented bdrv_is_opened_unclean
  2015-12-23  7:46       ` [Qemu-devel] [PATCH 4/5] qcow2: implemented bdrv_is_opened_unclean Denis V. Lunev
@ 2016-01-11 17:37         ` Kevin Wolf
  0 siblings, 0 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 17:37 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Fam Zheng, Olga Krishtal, qemu-devel, Max Reitz

Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
> From: Olga Krishtal <okrishtal@virtuozzo.com>
> 
> While opening image we save dirty state in header_unclean.
> If the image was closed incorrectly we can retrieve this fact
> using bdrv_is_opened_unclean callback.
> 
> This is necessary in case when we want to call brdv_check to
> repair dirty image.
> 
> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Max Reitz <mreitz@redhat.com>
> CC: Eric Blake <eblake@redhat.com>
> CC: Fam Zheng <famz@redhat.com>

What does this fix?

qcow2 isn't supposed to require a repair after an unclean shutdown. The
only exception is with lazy-refcounts=on, which already sets a dirty
flag in the header and triggers the repair in qcow2_open().

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
  2016-01-11 17:14     ` [Qemu-devel] " Kevin Wolf
@ 2016-01-11 17:54       ` Daniel P. Berrange
  2016-01-13  8:56       ` Markus Armbruster
  1 sibling, 0 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2016-01-11 17:54 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Fam Zheng, qemu-devel, qemu-block, mreitz

On Mon, Jan 11, 2016 at 06:14:15PM +0100, Kevin Wolf wrote:
> Am 23.12.2015 um 11:47 hat Daniel P. Berrange geschrieben:
> > On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
> > > On Tue, 12/22 17:46, Kevin Wolf wrote:
> > > > Enough innocent images have died because users called 'qemu-img snapshot' while
> > > > the VM was still running. Educating the users doesn't seem to be a working
> > > > strategy, so this series adds locking to qcow2 that refuses to access the image
> > > > read-write from two processes.
> > > > 
> > > > Eric, this will require a libvirt update to deal with qemu crashes which leave
> > > > locked images behind. The simplest thinkable way would be to unconditionally
> > > > override the lock in libvirt whenever the option is present. In that case,
> > > > libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> > > > the other way round. If you want more than that, libvirt would have to check
> > > > somehow if it was its own VM that used the image and left the lock behind. I
> > > > imagine that can't be too hard either.
> > > 
> > > The motivation is great, but I'm not sure I like the side-effect that an
> > > unclean shutdown will require a "forced" open, because it makes using qcow2 in
> > > development cumbersome, and like you said, management/user also needs to handle
> > > this explicitly. This is a bit of a personal preference, but it's strong enough
> > > that I want to speak up.
> > 
> > Yeah, I am also not really a big fan of locking mechanisms which are not
> > automatically cleaned up on process exit. On the other hand you could
> > say that people who choose to run qemu-img manually are already taking
> > fate into their own hands, and ending up with a dirty image on unclean
> > exit is still miles better than loosing all your data.
> > 
> > > As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
> > > similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
> > > and a program crash will automatically drop the lock.
> > 
> > FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
> > out locks using fcntl()/lockf() on all disk images associated with a VM.
> 
> Does this actually mean that if QEMU did try to use flock(), it would
> fail because libvirt is already holding the lock?

It depends on the configuration of virtlockd, but out of the box
it will current uses fcntl() against the disk image directly. fcntl
uses a separate lockspace than flock() so their locks are invisible
to each other, except for NFS where linux apparently re-writes flock()
into fcntl(). So yeah, for at least NFS it would fail because libvirt
already holds the lock out of the box.

> I considered adding both locking schemes (the qcow2 flag for qcow2 on
> any backend; flock() for anything else on local files), but if this is
> true, that's game over for any flock() based patches.

Yeah if QEMU attempts to flock/fcntl that's going to cause trouble
unless it is disabled by default, in which case libvirt could simply
never enable it in order to avoid the clash.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2016-01-11 16:47       ` Kevin Wolf
@ 2016-01-11 17:56         ` Daniel P. Berrange
  0 siblings, 0 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2016-01-11 17:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Max Reitz, Fam Zheng, qemu-devel, qemu-block, Denis V. Lunev

On Mon, Jan 11, 2016 at 05:47:06PM +0100, Kevin Wolf wrote:
> Am 24.12.2015 um 06:41 hat Denis V. Lunev geschrieben:
> > On 12/24/2015 02:19 AM, Max Reitz wrote:
> > >So the benefits of a qcow2 flag are only minor ones. However, I
> > >personally believe that automatic unlock on crash is a very minor
> > >benefit as well. That should never happen in practice anyway, and a
> > >crashing qemu is such a great inconvenience that I as a user wouldn't
> > >really mind having to unlock the image afterwards.
> > IMHO you are wrong. This is VERY important. The situation would be exactly
> > the same after node poweroff, which could happen and really happens in
> > the real life from time to time.
> > 
> > In this cases VMs should start automatically and ASAP if configured this
> > way. Any manual interaction here is a REAL pain.
> 
> Yes. Your management tool should be able to cope with it.
> 
> > >In fact, libvirt could even do that manually, couldn't it? If qemu
> > >crashes, it just invokes qemu-img force-unlock on any qcow2 image which
> > >was attached R/W to the VM.
> > 
> > in the situation above libvirt does not have the information or this
> > information could be unreliable.
> 
> That would be a libvirt bug then. Did you check?
> 
> A good management tool knows which VMs it had running before a host
> crash. For all I know, libvirt does.

Dealing with recovery after host crash is out of scope for libvirt. This
is the responsibility of the higher level mgmt tool, which should be using
some kind of reliable clustering & fencing technology to ensure safety
(ie via STONITH capability) even in the fact of mis-behaving hardware.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-11 17:31         ` Kevin Wolf
@ 2016-01-11 17:58           ` Daniel P. Berrange
  2016-01-11 18:35             ` Kevin Wolf
  2016-01-12  5:38           ` Denis V. Lunev
  1 sibling, 1 reply; 99+ messages in thread
From: Daniel P. Berrange @ 2016-01-11 17:58 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Denis V. Lunev, Max Reitz, Fam Zheng, qemu-devel, Olga Krishtal

On Mon, Jan 11, 2016 at 06:31:06PM +0100, Kevin Wolf wrote:
> Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
> > From: Olga Krishtal <okrishtal@virtuozzo.com>
> > 
> > While opening the image we want to be sure that we are the
> > one who works with image, anf if it is not true -
> > opening the image for writing should fail.
> > 
> > There are 2 ways at the moment: no lock at all and lock the file
> > image.
> > 
> > Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Max Reitz <mreitz@redhat.com>
> > CC: Eric Blake <eblake@redhat.com>
> > CC: Fam Zheng <famz@redhat.com>
> 
> As long as locking is disabled by default, it's useless and won't
> prevent people from corrupting their images. These corruptions happen
> exactly because people don't know how to use qemu properly. You can't
> expect them to enable locking manually.
> 
> Also, you probably need to consider bdrv_reopen() and live migration.
> I think live migration would be blocked if source and destination both
> see the lock; which is admittedly less likely than with the qcow2 patch
> (and generally a problem of this series), but with localhost migration
> and potentially with some NFS setups it can be the case.

Note that when libvirt does locking it will release locks when a VM
is paused, and acquire locks prior to resuming CPUs. This allows live
migration to work because you never have CPUs running on both src + dst
at the same time. This does mean that libvirt does not allow QEMU to
automatically re-start CPUs when migration completes, as it needs to
take some action to acquire locks before allowing the dst to start
running again.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-11 17:58           ` Daniel P. Berrange
@ 2016-01-11 18:35             ` Kevin Wolf
  2016-01-13  8:52               ` Markus Armbruster
  2016-01-13  9:51               ` Daniel P. Berrange
  0 siblings, 2 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-11 18:35 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Denis V. Lunev, Max Reitz, Fam Zheng, qemu-devel, Olga Krishtal

Am 11.01.2016 um 18:58 hat Daniel P. Berrange geschrieben:
> On Mon, Jan 11, 2016 at 06:31:06PM +0100, Kevin Wolf wrote:
> > Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
> > > From: Olga Krishtal <okrishtal@virtuozzo.com>
> > > 
> > > While opening the image we want to be sure that we are the
> > > one who works with image, anf if it is not true -
> > > opening the image for writing should fail.
> > > 
> > > There are 2 ways at the moment: no lock at all and lock the file
> > > image.
> > > 
> > > Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > CC: Kevin Wolf <kwolf@redhat.com>
> > > CC: Max Reitz <mreitz@redhat.com>
> > > CC: Eric Blake <eblake@redhat.com>
> > > CC: Fam Zheng <famz@redhat.com>
> > 
> > As long as locking is disabled by default, it's useless and won't
> > prevent people from corrupting their images. These corruptions happen
> > exactly because people don't know how to use qemu properly. You can't
> > expect them to enable locking manually.
> > 
> > Also, you probably need to consider bdrv_reopen() and live migration.
> > I think live migration would be blocked if source and destination both
> > see the lock; which is admittedly less likely than with the qcow2 patch
> > (and generally a problem of this series), but with localhost migration
> > and potentially with some NFS setups it can be the case.
> 
> Note that when libvirt does locking it will release locks when a VM
> is paused, and acquire locks prior to resuming CPUs. This allows live
> migration to work because you never have CPUs running on both src + dst
> at the same time. This does mean that libvirt does not allow QEMU to
> automatically re-start CPUs when migration completes, as it needs to
> take some action to acquire locks before allowing the dst to start
> running again.

This assumes that block devices can only be written to if CPUs are
running. In the days of qemu 0.9, this was probably right, but with
things like block jobs and built-in NBD servers, I wouldn't be as sure
these days.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-11 17:31         ` Kevin Wolf
  2016-01-11 17:58           ` Daniel P. Berrange
@ 2016-01-12  5:38           ` Denis V. Lunev
  2016-01-12 10:10             ` Kevin Wolf
  1 sibling, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2016-01-12  5:38 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Fam Zheng, Olga Krishtal, qemu-devel, Max Reitz

On 01/11/2016 08:31 PM, Kevin Wolf wrote:
> Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
>> From: Olga Krishtal <okrishtal@virtuozzo.com>
>>
>> While opening the image we want to be sure that we are the
>> one who works with image, anf if it is not true -
>> opening the image for writing should fail.
>>
>> There are 2 ways at the moment: no lock at all and lock the file
>> image.
>>
>> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Max Reitz <mreitz@redhat.com>
>> CC: Eric Blake <eblake@redhat.com>
>> CC: Fam Zheng <famz@redhat.com>
> As long as locking is disabled by default, it's useless and won't
> prevent people from corrupting their images. These corruptions happen
> exactly because people don't know how to use qemu properly. You can't
> expect them to enable locking manually.
You are right. Though this is not a big problem to enable locking.
If there are several mechanics, we could save one into the
qcow2 header.


> Also, you probably need to consider bdrv_reopen() and live migration.
> I think live migration would be blocked if source and destination both
> see the lock; which is admittedly less likely than with the qcow2 patch
> (and generally a problem of this series), but with localhost migration
> and potentially with some NFS setups it can be the case.
>
> Kevin
for live migration the situation should be not that problematic.
The disk is opened in RW mode only on one node. Am I right?

The lock is taken when the image is opened in RW mode only.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12  5:38           ` Denis V. Lunev
@ 2016-01-12 10:10             ` Kevin Wolf
  2016-01-12 11:33               ` Fam Zheng
  0 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-12 10:10 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Fam Zheng, Olga Krishtal, qemu-devel, Max Reitz

Am 12.01.2016 um 06:38 hat Denis V. Lunev geschrieben:
> On 01/11/2016 08:31 PM, Kevin Wolf wrote:
> >Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
> >>From: Olga Krishtal <okrishtal@virtuozzo.com>
> >>
> >>While opening the image we want to be sure that we are the
> >>one who works with image, anf if it is not true -
> >>opening the image for writing should fail.
> >>
> >>There are 2 ways at the moment: no lock at all and lock the file
> >>image.
> >>
> >>Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> >>Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>CC: Kevin Wolf <kwolf@redhat.com>
> >>CC: Max Reitz <mreitz@redhat.com>
> >>CC: Eric Blake <eblake@redhat.com>
> >>CC: Fam Zheng <famz@redhat.com>
> >As long as locking is disabled by default, it's useless and won't
> >prevent people from corrupting their images. These corruptions happen
> >exactly because people don't know how to use qemu properly. You can't
> >expect them to enable locking manually.
> You are right. Though this is not a big problem to enable locking.
> If there are several mechanics, we could save one into the
> qcow2 header.

The problem is that libvirt already takes a lock, as Dan mentioned in
another reply in this thread, so we can't enable locking in qemu by
default. It would always fail when run under libvirt.

Unless I'm seriously mistaken, this means that flock() inside qemu is
dead.

> >Also, you probably need to consider bdrv_reopen() and live migration.
> >I think live migration would be blocked if source and destination both
> >see the lock; which is admittedly less likely than with the qcow2 patch
> >(and generally a problem of this series), but with localhost migration
> >and potentially with some NFS setups it can be the case.
> >
> >Kevin
> for live migration the situation should be not that problematic.
> The disk is opened in RW mode only on one node. Am I right?
> The lock is taken when the image is opened in RW mode only.

You wish. :-)

That's exactly the reason why my patch series contains all these patches
with inactivate/invalidate_cache. These patches still don't avoid that
the file descriptor is opened r/w on both sides, but at least they
ensure that it's clearly defined who actually uses it to write.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 10:10             ` Kevin Wolf
@ 2016-01-12 11:33               ` Fam Zheng
  2016-01-12 12:24                 ` Denis V. Lunev
                                   ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Fam Zheng @ 2016-01-12 11:33 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Max Reitz, Olga Krishtal, Denis V. Lunev

On Tue, 01/12 11:10, Kevin Wolf wrote:
> 
> The problem is that libvirt already takes a lock, as Dan mentioned in
> another reply in this thread, so we can't enable locking in qemu by
> default. It would always fail when run under libvirt.
> 
> Unless I'm seriously mistaken, this means that flock() inside qemu is
> dead.

Yes, I see the problem with libvirt, but can we instead do these?

  1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
  2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.

This way, if libvirt is holding flock, we can assume libvirt is actually
"using" the image: 1) just works as before, but 2) will not break the qcow2.
That is still a slight improvement, and does solve the reckless "qemu-img
snapshot create" user's problem.

Fam

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 11:33               ` Fam Zheng
@ 2016-01-12 12:24                 ` Denis V. Lunev
  2016-01-12 12:28                 ` Kevin Wolf
  2016-01-12 15:59                 ` Denis V. Lunev
  2 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2016-01-12 12:24 UTC (permalink / raw)
  To: Fam Zheng, Kevin Wolf; +Cc: Olga Krishtal, qemu-devel, Max Reitz

On 01/12/2016 02:33 PM, Fam Zheng wrote:
> On Tue, 01/12 11:10, Kevin Wolf wrote:
>> The problem is that libvirt already takes a lock, as Dan mentioned in
>> another reply in this thread, so we can't enable locking in qemu by
>> default. It would always fail when run under libvirt.
>>
>> Unless I'm seriously mistaken, this means that flock() inside qemu is
>> dead.
> Yes, I see the problem with libvirt, but can we instead do these?
>
>    1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
>    2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.
>
> This way, if libvirt is holding flock, we can assume libvirt is actually
> "using" the image: 1) just works as before, but 2) will not break the qcow2.
> That is still a slight improvement, and does solve the reckless "qemu-img
> snapshot create" user's problem.
>
> Fam
assuming that we should lock "by default".

This looks good to me.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 11:33               ` Fam Zheng
  2016-01-12 12:24                 ` Denis V. Lunev
@ 2016-01-12 12:28                 ` Kevin Wolf
  2016-01-12 13:17                   ` Fam Zheng
  2016-01-12 15:59                 ` Denis V. Lunev
  2 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-12 12:28 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Max Reitz, Olga Krishtal, Denis V. Lunev

Am 12.01.2016 um 12:33 hat Fam Zheng geschrieben:
> On Tue, 01/12 11:10, Kevin Wolf wrote:
> > 
> > The problem is that libvirt already takes a lock, as Dan mentioned in
> > another reply in this thread, so we can't enable locking in qemu by
> > default. It would always fail when run under libvirt.
> > 
> > Unless I'm seriously mistaken, this means that flock() inside qemu is
> > dead.
> 
> Yes, I see the problem with libvirt, but can we instead do these?
> 
>   1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
>   2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.
> 
> This way, if libvirt is holding flock, we can assume libvirt is actually
> "using" the image: 1) just works as before, but 2) will not break the qcow2.
> That is still a slight improvement, and does solve the reckless "qemu-img
> snapshot create" user's problem.

This makes two assumptions:

1. qemu is only ever invoked by libvirt
2. qemu-img is only ever invoked by human users

Both of them are wrong. 1. just means that manually started QEMUs are
unprotected (which is already bad), but 2. means that qemu-img called by
libvirt fails (which is obviously not acceptable).

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 12:28                 ` Kevin Wolf
@ 2016-01-12 13:17                   ` Fam Zheng
  2016-01-12 13:24                     ` Daniel P. Berrange
  0 siblings, 1 reply; 99+ messages in thread
From: Fam Zheng @ 2016-01-12 13:17 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Max Reitz, Olga Krishtal, Denis V. Lunev

On Tue, 01/12 13:28, Kevin Wolf wrote:
> Am 12.01.2016 um 12:33 hat Fam Zheng geschrieben:
> > On Tue, 01/12 11:10, Kevin Wolf wrote:
> > > 
> > > The problem is that libvirt already takes a lock, as Dan mentioned in
> > > another reply in this thread, so we can't enable locking in qemu by
> > > default. It would always fail when run under libvirt.
> > > 
> > > Unless I'm seriously mistaken, this means that flock() inside qemu is
> > > dead.
> > 
> > Yes, I see the problem with libvirt, but can we instead do these?
> > 
> >   1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
> >   2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.
> > 
> > This way, if libvirt is holding flock, we can assume libvirt is actually
> > "using" the image: 1) just works as before, but 2) will not break the qcow2.
> > That is still a slight improvement, and does solve the reckless "qemu-img
> > snapshot create" user's problem.
> 
> This makes two assumptions:
> 
> 1. qemu is only ever invoked by libvirt
> 2. qemu-img is only ever invoked by human users
> 
> Both of them are wrong. 1. just means that manually started QEMUs are
> unprotected (which is already bad), but 2. means that qemu-img called by
> libvirt fails (which is obviously not acceptable).

No, my assumptions are:

a. libvirt calls flock() when invoking qemu;
b. libvirt doesn't call flock() when invoking qemu-img; (if I read the libvirt
   code correctly, input from libvirt folks needed);

So, 1. means that multiple manually started QEMU instances writing to the same
image are NOT protected, that's the limitation. 2. means that qemu-img called
by either libvirt or a human user is prevented from corrupting an in-use qcow2.

As long as I'm not wrong about b.

Fam

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 13:17                   ` Fam Zheng
@ 2016-01-12 13:24                     ` Daniel P. Berrange
  2016-01-13  0:08                       ` Fam Zheng
  0 siblings, 1 reply; 99+ messages in thread
From: Daniel P. Berrange @ 2016-01-12 13:24 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, qemu-devel, Max Reitz, Olga Krishtal, Denis V. Lunev

On Tue, Jan 12, 2016 at 09:17:51PM +0800, Fam Zheng wrote:
> On Tue, 01/12 13:28, Kevin Wolf wrote:
> > Am 12.01.2016 um 12:33 hat Fam Zheng geschrieben:
> > > On Tue, 01/12 11:10, Kevin Wolf wrote:
> > > > 
> > > > The problem is that libvirt already takes a lock, as Dan mentioned in
> > > > another reply in this thread, so we can't enable locking in qemu by
> > > > default. It would always fail when run under libvirt.
> > > > 
> > > > Unless I'm seriously mistaken, this means that flock() inside qemu is
> > > > dead.
> > > 
> > > Yes, I see the problem with libvirt, but can we instead do these?
> > > 
> > >   1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
> > >   2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.
> > > 
> > > This way, if libvirt is holding flock, we can assume libvirt is actually
> > > "using" the image: 1) just works as before, but 2) will not break the qcow2.
> > > That is still a slight improvement, and does solve the reckless "qemu-img
> > > snapshot create" user's problem.
> > 
> > This makes two assumptions:
> > 
> > 1. qemu is only ever invoked by libvirt
> > 2. qemu-img is only ever invoked by human users
> > 
> > Both of them are wrong. 1. just means that manually started QEMUs are
> > unprotected (which is already bad), but 2. means that qemu-img called by
> > libvirt fails (which is obviously not acceptable).
> 
> No, my assumptions are:
> 
> a. libvirt calls flock() when invoking qemu;
> b. libvirt doesn't call flock() when invoking qemu-img; (if I read the libvirt
>    code correctly, input from libvirt folks needed);

b. is /currently/ true, but I wouldn't guarantee that will always be
true, because we've (vague) plans to extend our locking infrastructure
to cover our storage pools APIs too, at which point we'd likely be
have locking around qemu-img based API calls too. There's also likelihood
we'll make our locking API public, in which case it is possible that
an app using libvirt could have acquired locks on the file.

> So, 1. means that multiple manually started QEMU instances writing to the same
> image are NOT protected, that's the limitation. 2. means that qemu-img called
> by either libvirt or a human user is prevented from corrupting an in-use qcow2.
> 
> As long as I'm not wrong about b.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2016-01-11 16:05       ` Kevin Wolf
@ 2016-01-12 15:20         ` Markus Armbruster
  2016-01-12 17:36           ` Kevin Wolf
  0 siblings, 1 reply; 99+ messages in thread
From: Markus Armbruster @ 2016-01-12 15:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

Kevin Wolf <kwolf@redhat.com> writes:

> Am 11.01.2016 um 16:49 hat Markus Armbruster geschrieben:
>> Eric Blake <eblake@redhat.com> writes:
>> 
>> > On 12/22/2015 09:46 AM, Kevin Wolf wrote:
>> >> This patch extends qemu-img for working with locked images. It prints a
>> >> helpful error message when trying to access a locked image read-write,
>> >> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
>> >> -r all --force' option in order to override a lock left behind after a
>> >> qemu crash.
>> >> 
>> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> >> ---
>> >>  include/block/block.h |  1 +
>> >>  include/qapi/error.h  |  1 +
>> >>  qapi/common.json      |  3 +-
>> >>  qemu-img-cmds.hx      | 10 ++++--
>> >>  qemu-img.c | 96
>> >> +++++++++++++++++++++++++++++++++++++++++++--------
>> >>  qemu-img.texi         | 20 ++++++++++-
>> >>  6 files changed, 113 insertions(+), 18 deletions(-)
>> >> 
>> >
>> >> +++ b/include/qapi/error.h
>> >> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
>> >>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
>> >>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
>> >>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
>> >> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
>> >>  } ErrorClass;
>> >
>> > Wow - a new ErrorClass.  It's been a while since we could justify one of
>> > these, but I think you might have found a case.
>> 
>> Spell out the rationale for the new ErrorClass, please.
>
> Action to be taken for this error class: Decide whether the lock is a
> leftover from a previous qemu run that ended in an unclean shutdown. If
> so, retry with overriding the lock.
>
> Currently used by qemu-img when ordered to override a lock. libvirt
> will need to do the same.

Let's see whether I understand the intended use:

    open image
    if open fails with ImageFileLocked:
        guess whether the lock is stale
        if guessing not stale:
            error out
        open image with lock override

Correct?

Obvious troublespots:

1. If you guess wrong, you destroy the image.  No worse than before, so
   okay, declare documentation problem.

2. TOCTTOU open to open with lock override

   I understand this lock cannot be foolproof, but I dislike the "if
   lock stale then steal lock" TOCTTOU trap all the same.

        Process A                       Process B
        open image, ImageFileLocked
        guess not stale (correct)
                                        open image, ImageFileLocked
                                        guess not stale (correct)
                                        open image with lock override
        open image with lock override

   To avoid, you need to add a suitable unique identifier to the lock,
   then change the override to "override the lock with this unique ID".

   A smartly chosen identifier might even help with guessing whether the
   lock is stale, at least in some cases.

3. TOCTTOU within open (hypothetical, haven't read your code)

   Obviously, "open image" needs to acquire the lock.  Is open+lock
   atomic with respect to concurrent open+lock?

> Alternative design without a new error class: Accept the
> override-lock=on option on the block.c level (QAPI BlockdevOptionsBase)
> and ignore it silently for image formats that don't support locking
> (i.e. anything but qcow2), so that qemu-img and libvirt can set this
> option unconditionally even if they don't know that the image is locked
> (but libvirt would have to check first if qemu supports the option at
> all in order to maintain compatibility with old qemu versions).
>
> In my opinion, this would be worse design than adding an error class. It
> would also prevent libvirt from logging when it forcefully unlocks an
> image.

Well, QEMU could (and probably should) say something when overriding a
lock, and libvirt should certainly log whatever QEMU says.

Let me try a different tack.  It may well be unworkable.

An image may support a lock.  If it does, it carries an ID uniquely
identifying a particular lock - unlock period.

When you open an image supporting locking, you atomically acquire its
lock and set its lock ID.  If the user specifies the ID, we use that,
else we make one up.

To open with lock override, you need to specify something like
override=ID.

Intended use by management application, ignoring the need for lock
override:

    generate lock ID for this image and save it in persistent management
      storage
    open and lock image, setting the lock ID
    ...
    unlock and close image
    delete lock ID in persistent management storage

Now extend to override the lock when necessary:

    if persistent management storage has lock ID for this image
        open with lock override=ID
    else
        generate lock ID for this image and save...
        open and lock...

This lets the management application override its own stale locks after
a crash, but it won't override anybody else's locks, and it doesn't
engage in any guesswork.

Obviously, the management application will also need to be able to
override stale locks from opens by someone else, say a human user
bypassing the management application.  Perhaps like this:

    examine the image
    if it's locked:
        get the lock ID
        let next higher level in the stack decide whether it's stale
            (this level might be a human user)
        if it's stale:
            open with lock override=ID

No open-to-open-with-override TOCTTOU.

Thoughts?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 11:33               ` Fam Zheng
  2016-01-12 12:24                 ` Denis V. Lunev
  2016-01-12 12:28                 ` Kevin Wolf
@ 2016-01-12 15:59                 ` Denis V. Lunev
  2016-01-13  0:10                   ` Fam Zheng
  2 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2016-01-12 15:59 UTC (permalink / raw)
  To: Fam Zheng, Kevin Wolf; +Cc: Olga Krishtal, qemu-devel, Max Reitz

On 01/12/2016 02:33 PM, Fam Zheng wrote:
> On Tue, 01/12 11:10, Kevin Wolf wrote:
>> The problem is that libvirt already takes a lock, as Dan mentioned in
>> another reply in this thread, so we can't enable locking in qemu by
>> default. It would always fail when run under libvirt.
>>
>> Unless I'm seriously mistaken, this means that flock() inside qemu is
>> dead.
> Yes, I see the problem with libvirt, but can we instead do these?
>
>    1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
>    2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.
>
> This way, if libvirt is holding flock, we can assume libvirt is actually
> "using" the image: 1) just works as before, but 2) will not break the qcow2.
> That is still a slight improvement, and does solve the reckless "qemu-img
> snapshot create" user's problem.
>
> Fam
There is a better way though.

If we will switch default in my patch from 'nolock' to 'lock' then
pour guys which are calling qemu-img etc stuff will see the lock
as necessary while 'proper management software' aka libvirt
will be able to call qemu/qemu-img etc with proper 'nolock'
flag as they do care about the locking.

Though from my POW all locks should be taken in the responsible
entity, i.e. qemu or qemu-img as if locks are held by libvirt then
they should be re-taken on f.e. daemon restart, which could happen.
This is not that convenient.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2016-01-12 15:20         ` Markus Armbruster
@ 2016-01-12 17:36           ` Kevin Wolf
  2016-01-13  8:44             ` Markus Armbruster
  0 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-12 17:36 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel, qemu-block, mreitz

Am 12.01.2016 um 16:20 hat Markus Armbruster geschrieben:
> Kevin Wolf <kwolf@redhat.com> writes:
> 
> > Am 11.01.2016 um 16:49 hat Markus Armbruster geschrieben:
> >> Eric Blake <eblake@redhat.com> writes:
> >> 
> >> > On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> >> >> This patch extends qemu-img for working with locked images. It prints a
> >> >> helpful error message when trying to access a locked image read-write,
> >> >> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> >> >> -r all --force' option in order to override a lock left behind after a
> >> >> qemu crash.
> >> >> 
> >> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >> >> ---
> >> >>  include/block/block.h |  1 +
> >> >>  include/qapi/error.h  |  1 +
> >> >>  qapi/common.json      |  3 +-
> >> >>  qemu-img-cmds.hx      | 10 ++++--
> >> >>  qemu-img.c | 96
> >> >> +++++++++++++++++++++++++++++++++++++++++++--------
> >> >>  qemu-img.texi         | 20 ++++++++++-
> >> >>  6 files changed, 113 insertions(+), 18 deletions(-)
> >> >> 
> >> >
> >> >> +++ b/include/qapi/error.h
> >> >> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
> >> >>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
> >> >>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
> >> >>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
> >> >> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
> >> >>  } ErrorClass;
> >> >
> >> > Wow - a new ErrorClass.  It's been a while since we could justify one of
> >> > these, but I think you might have found a case.
> >> 
> >> Spell out the rationale for the new ErrorClass, please.
> >
> > Action to be taken for this error class: Decide whether the lock is a
> > leftover from a previous qemu run that ended in an unclean shutdown. If
> > so, retry with overriding the lock.
> >
> > Currently used by qemu-img when ordered to override a lock. libvirt
> > will need to do the same.
> 
> Let's see whether I understand the intended use:
> 
>     open image
>     if open fails with ImageFileLocked:
>         guess whether the lock is stale
>         if guessing not stale:
>             error out
>         open image with lock override
> 
> Correct?

Yes. Where "guess" is more or less "check whether the management tool
started qemu with this image, but didn't cleanly shut it down". This can
guess wrong if, and only if, some other user used a different algorithm
and forced an unlock even though the image didn't belong to them before
the crash.

> Obvious troublespots:
> 
> 1. If you guess wrong, you destroy the image.  No worse than before, so
>    okay, declare documentation problem.
> 
> 2. TOCTTOU open to open with lock override
>    [...]
> 
> 3. TOCTTOU within open (hypothetical, haven't read your code)
>    [...]

Yes, these exist in theory. The question is what scenarios you want to
protect against and whether improving the mechanism to cover these cases
is worth the effort.

The answer for what I wanted to protect is a manual action on an image
that is already in use. The user isn't quick enough to manually let two
processes open the same image at the same time, so I didn't consider
that scenario relevant.

But assuming that everyone (including the human user) follows the above
protocol (force-unlock only what was yours before the crash), at least
cases 1 and 2 don't happen anyway.

> Let me try a different tack.  It may well be unworkable.
> [...]

It doesn't sound unworkable, but it might be overengineered if the goal
is just to protect people against 'qemu-img snapshot' on running VMs.

> Obviously, the management application will also need to be able to
> override stale locks from opens by someone else, say a human user
> bypassing the management application.

That's not obvious. If another user messed it up, this other user can
also clean it up. But yes, asking a higher level would work, too.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 13:24                     ` Daniel P. Berrange
@ 2016-01-13  0:08                       ` Fam Zheng
  0 siblings, 0 replies; 99+ messages in thread
From: Fam Zheng @ 2016-01-13  0:08 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Kevin Wolf, qemu-devel, Max Reitz, Olga Krishtal, Denis V. Lunev

On Tue, 01/12 13:24, Daniel P. Berrange wrote:
> On Tue, Jan 12, 2016 at 09:17:51PM +0800, Fam Zheng wrote:
> > On Tue, 01/12 13:28, Kevin Wolf wrote:
> > > Am 12.01.2016 um 12:33 hat Fam Zheng geschrieben:
> > > > On Tue, 01/12 11:10, Kevin Wolf wrote:
> > > > > 
> > > > > The problem is that libvirt already takes a lock, as Dan mentioned in
> > > > > another reply in this thread, so we can't enable locking in qemu by
> > > > > default. It would always fail when run under libvirt.
> > > > > 
> > > > > Unless I'm seriously mistaken, this means that flock() inside qemu is
> > > > > dead.
> > > > 
> > > > Yes, I see the problem with libvirt, but can we instead do these?
> > > > 
> > > >   1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
> > > >   2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.
> > > > 
> > > > This way, if libvirt is holding flock, we can assume libvirt is actually
> > > > "using" the image: 1) just works as before, but 2) will not break the qcow2.
> > > > That is still a slight improvement, and does solve the reckless "qemu-img
> > > > snapshot create" user's problem.
> > > 
> > > This makes two assumptions:
> > > 
> > > 1. qemu is only ever invoked by libvirt
> > > 2. qemu-img is only ever invoked by human users
> > > 
> > > Both of them are wrong. 1. just means that manually started QEMUs are
> > > unprotected (which is already bad), but 2. means that qemu-img called by
> > > libvirt fails (which is obviously not acceptable).
> > 
> > No, my assumptions are:
> > 
> > a. libvirt calls flock() when invoking qemu;
> > b. libvirt doesn't call flock() when invoking qemu-img; (if I read the libvirt
> >    code correctly, input from libvirt folks needed);
> 
> b. is /currently/ true, but I wouldn't guarantee that will always be
> true, because we've (vague) plans to extend our locking infrastructure
> to cover our storage pools APIs too, at which point we'd likely be
> have locking around qemu-img based API calls too. There's also likelihood
> we'll make our locking API public, in which case it is possible that
> an app using libvirt could have acquired locks on the file.
> 

This is not a problem. When you extend that in libvirt, you can at meanwhile
modify it to always specify "nolock=on" when invoking the new qemu-img so that
it doesn't check flock().

Fam

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-12 15:59                 ` Denis V. Lunev
@ 2016-01-13  0:10                   ` Fam Zheng
  2016-01-13 16:44                     ` Eric Blake
  0 siblings, 1 reply; 99+ messages in thread
From: Fam Zheng @ 2016-01-13  0:10 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Kevin Wolf, qemu-devel, Max Reitz, Olga Krishtal

On Tue, 01/12 18:59, Denis V. Lunev wrote:
> On 01/12/2016 02:33 PM, Fam Zheng wrote:
> >On Tue, 01/12 11:10, Kevin Wolf wrote:
> >>The problem is that libvirt already takes a lock, as Dan mentioned in
> >>another reply in this thread, so we can't enable locking in qemu by
> >>default. It would always fail when run under libvirt.
> >>
> >>Unless I'm seriously mistaken, this means that flock() inside qemu is
> >>dead.
> >Yes, I see the problem with libvirt, but can we instead do these?
> >
> >   1) Do a soft flock() in QEMU invocation. If it fails, sliently ignore.
> >   2) Do a hard flock() in qemu-img invocation. If it fails, report and exit.
> >
> >This way, if libvirt is holding flock, we can assume libvirt is actually
> >"using" the image: 1) just works as before, but 2) will not break the qcow2.
> >That is still a slight improvement, and does solve the reckless "qemu-img
> >snapshot create" user's problem.
> >
> >Fam
> There is a better way though.
> 
> If we will switch default in my patch from 'nolock' to 'lock' then
> pour guys which are calling qemu-img etc stuff will see the lock
> as necessary while 'proper management software' aka libvirt
> will be able to call qemu/qemu-img etc with proper 'nolock'
> flag as they do care about the locking.

That is wrong because then we break old libvirt with the new qemu-img (acquires
lock by default), which is IMO a breakage of backward compatibility.

Fam

> 
> Though from my POW all locks should be taken in the responsible
> entity, i.e. qemu or qemu-img as if locks are held by libvirt then
> they should be re-taken on f.e. daemon restart, which could happen.
> This is not that convenient.
> 
> Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2016-01-12 17:36           ` Kevin Wolf
@ 2016-01-13  8:44             ` Markus Armbruster
  2016-01-13 14:19               ` Kevin Wolf
  0 siblings, 1 reply; 99+ messages in thread
From: Markus Armbruster @ 2016-01-13  8:44 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

Kevin Wolf <kwolf@redhat.com> writes:

> Am 12.01.2016 um 16:20 hat Markus Armbruster geschrieben:
>> Kevin Wolf <kwolf@redhat.com> writes:
>> 
>> > Am 11.01.2016 um 16:49 hat Markus Armbruster geschrieben:
>> >> Eric Blake <eblake@redhat.com> writes:
>> >> 
>> >> > On 12/22/2015 09:46 AM, Kevin Wolf wrote:
>> >> >> This patch extends qemu-img for working with locked images. It prints a
>> >> >> helpful error message when trying to access a locked image read-write,
>> >> >> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
>> >> >> -r all --force' option in order to override a lock left behind after a
>> >> >> qemu crash.
>> >> >> 
>> >> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> >> >> ---
>> >> >>  include/block/block.h |  1 +
>> >> >>  include/qapi/error.h  |  1 +
>> >> >>  qapi/common.json      |  3 +-
>> >> >>  qemu-img-cmds.hx      | 10 ++++--
>> >> >>  qemu-img.c | 96
>> >> >> +++++++++++++++++++++++++++++++++++++++++++--------
>> >> >>  qemu-img.texi         | 20 ++++++++++-
>> >> >>  6 files changed, 113 insertions(+), 18 deletions(-)
>> >> >> 
>> >> >
>> >> >> +++ b/include/qapi/error.h
>> >> >> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
>> >> >>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
>> >> >>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
>> >> >>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
>> >> >> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
>> >> >>  } ErrorClass;
>> >> >
>> >> > Wow - a new ErrorClass.  It's been a while since we could justify one of
>> >> > these, but I think you might have found a case.
>> >> 
>> >> Spell out the rationale for the new ErrorClass, please.
>> >
>> > Action to be taken for this error class: Decide whether the lock is a
>> > leftover from a previous qemu run that ended in an unclean shutdown. If
>> > so, retry with overriding the lock.
>> >
>> > Currently used by qemu-img when ordered to override a lock. libvirt
>> > will need to do the same.
>> 
>> Let's see whether I understand the intended use:
>> 
>>     open image
>>     if open fails with ImageFileLocked:
>>         guess whether the lock is stale
>>         if guessing not stale:
>>             error out
>>         open image with lock override
>> 
>> Correct?
>
> Yes. Where "guess" is more or less "check whether the management tool
> started qemu with this image, but didn't cleanly shut it down". This can
> guess wrong if, and only if, some other user used a different algorithm
> and forced an unlock even though the image didn't belong to them before
> the crash.
>
>> Obvious troublespots:
>> 
>> 1. If you guess wrong, you destroy the image.  No worse than before, so
>>    okay, declare documentation problem.
>> 
>> 2. TOCTTOU open to open with lock override
>>    [...]
>> 
>> 3. TOCTTOU within open (hypothetical, haven't read your code)
>>    [...]
>
> Yes, these exist in theory. The question is what scenarios you want to
> protect against and whether improving the mechanism to cover these cases
> is worth the effort.
>
> The answer for what I wanted to protect is a manual action on an image
> that is already in use. The user isn't quick enough to manually let two
> processes open the same image at the same time, so I didn't consider
> that scenario relevant.
>
> But assuming that everyone (including the human user) follows the above
> protocol (force-unlock only what was yours before the crash), at least
> cases 1 and 2 don't happen anyway.

"Force-unlock only what you locked yourself" is easier to stipulate than
to adhere to when the tools can't give you a hint on who did the
locking.  This is particularly true when "you" is a human, with human
imperfect memory.

I understand that this locking can't provide complete protection, and
merely aims to catch certain common accidents.

However, to avoid a false sense of security, its limitations need to be
clearly documented.  This very much includes the rule "force-unlock only
what you locked yourself".  In my opinion, it should also include the
raciness.

Sometimes, solving a problem is easier than documenting it.

>> Let me try a different tack.  It may well be unworkable.
>> [...]
>
> It doesn't sound unworkable, but it might be overengineered if the goal
> is just to protect people against 'qemu-img snapshot' on running VMs.
>
>> Obviously, the management application will also need to be able to
>> override stale locks from opens by someone else, say a human user
>> bypassing the management application.
>
> That's not obvious. If another user messed it up, this other user can
> also clean it up. But yes, asking a higher level would work, too.
>
> Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-11 18:35             ` Kevin Wolf
@ 2016-01-13  8:52               ` Markus Armbruster
  2016-01-13  9:12                 ` Denis V. Lunev
  2016-01-13  9:51               ` Daniel P. Berrange
  1 sibling, 1 reply; 99+ messages in thread
From: Markus Armbruster @ 2016-01-13  8:52 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Fam Zheng, qemu-devel, Olga Krishtal, Denis V. Lunev, Max Reitz

Kevin Wolf <kwolf@redhat.com> writes:

> Am 11.01.2016 um 18:58 hat Daniel P. Berrange geschrieben:
>> On Mon, Jan 11, 2016 at 06:31:06PM +0100, Kevin Wolf wrote:
>> > Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
>> > > From: Olga Krishtal <okrishtal@virtuozzo.com>
>> > > 
>> > > While opening the image we want to be sure that we are the
>> > > one who works with image, anf if it is not true -
>> > > opening the image for writing should fail.
>> > > 
>> > > There are 2 ways at the moment: no lock at all and lock the file
>> > > image.
>> > > 
>> > > Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
>> > > Signed-off-by: Denis V. Lunev <den@openvz.org>
>> > > CC: Kevin Wolf <kwolf@redhat.com>
>> > > CC: Max Reitz <mreitz@redhat.com>
>> > > CC: Eric Blake <eblake@redhat.com>
>> > > CC: Fam Zheng <famz@redhat.com>
>> > 
>> > As long as locking is disabled by default, it's useless and won't
>> > prevent people from corrupting their images. These corruptions happen
>> > exactly because people don't know how to use qemu properly. You can't
>> > expect them to enable locking manually.
>> > 
>> > Also, you probably need to consider bdrv_reopen() and live migration.
>> > I think live migration would be blocked if source and destination both
>> > see the lock; which is admittedly less likely than with the qcow2 patch
>> > (and generally a problem of this series), but with localhost migration
>> > and potentially with some NFS setups it can be the case.
>> 
>> Note that when libvirt does locking it will release locks when a VM
>> is paused, and acquire locks prior to resuming CPUs. This allows live
>> migration to work because you never have CPUs running on both src + dst
>> at the same time. This does mean that libvirt does not allow QEMU to
>> automatically re-start CPUs when migration completes, as it needs to
>> take some action to acquire locks before allowing the dst to start
>> running again.
>
> This assumes that block devices can only be written to if CPUs are
> running. In the days of qemu 0.9, this was probably right, but with
> things like block jobs and built-in NBD servers, I wouldn't be as sure
> these days.

Sounds like QEMU and libvirt should cooperate more closely to get the
locking less wrong.

QEMU should have more accurate knowledge on how it is using the image.
Libvirt may be able to provide better locks, with the help of its
virtlockd daemon.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking
  2016-01-11 17:14     ` [Qemu-devel] " Kevin Wolf
  2016-01-11 17:54       ` Daniel P. Berrange
@ 2016-01-13  8:56       ` Markus Armbruster
  2016-01-13  9:11         ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
  1 sibling, 1 reply; 99+ messages in thread
From: Markus Armbruster @ 2016-01-13  8:56 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Fam Zheng, qemu-devel, qemu-block, mreitz

Kevin Wolf <kwolf@redhat.com> writes:

> Am 23.12.2015 um 11:47 hat Daniel P. Berrange geschrieben:
>> On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
>> > On Tue, 12/22 17:46, Kevin Wolf wrote:
>> > > Enough innocent images have died because users called 'qemu-img
>> > > snapshot' while
>> > > the VM was still running. Educating the users doesn't seem to be a working
>> > > strategy, so this series adds locking to qcow2 that refuses to
>> > > access the image
>> > > read-write from two processes.
>> > > 
>> > > Eric, this will require a libvirt update to deal with qemu
>> > > crashes which leave
>> > > locked images behind. The simplest thinkable way would be to
>> > > unconditionally
>> > > override the lock in libvirt whenever the option is present. In that case,
>> > > libvirt VMs would be protected against concurrent non-libvirt
>> > > accesses, but not
>> > > the other way round. If you want more than that, libvirt would
>> > > have to check
>> > > somehow if it was its own VM that used the image and left the
>> > > lock behind. I
>> > > imagine that can't be too hard either.
>> > 
>> > The motivation is great, but I'm not sure I like the side-effect that an
>> > unclean shutdown will require a "forced" open, because it makes
>> > using qcow2 in
>> > development cumbersome, and like you said, management/user also
>> > needs to handle
>> > this explicitly. This is a bit of a personal preference, but it's
>> > strong enough
>> > that I want to speak up.
>> 
>> Yeah, I am also not really a big fan of locking mechanisms which are not
>> automatically cleaned up on process exit. On the other hand you could
>> say that people who choose to run qemu-img manually are already taking
>> fate into their own hands, and ending up with a dirty image on unclean
>> exit is still miles better than loosing all your data.
>> 
>> > As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
>> > similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
>> > and a program crash will automatically drop the lock.
>> 
>> FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
>> out locks using fcntl()/lockf() on all disk images associated with a VM.
>
> Does this actually mean that if QEMU did try to use flock(), it would
> fail because libvirt is already holding the lock?
>
> I considered adding both locking schemes (the qcow2 flag for qcow2 on
> any backend; flock() for anything else on local files), but if this is
> true, that's game over for any flock() based patches.

"Game over" for any patches that use the same locking mechanism as
libvirt without coordinating with libvirt.

Of course, patches that use a new locking mechanism will almost
certainly need some libvirt work, too.

Can we come up with a more integrated solution where QEMU cooperates
with libvirt on locking when libvirt is in play, and does a more limited
job itself when libvirt isn't in play?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2016-01-13  8:56       ` Markus Armbruster
@ 2016-01-13  9:11         ` Denis V. Lunev
  0 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2016-01-13  9:11 UTC (permalink / raw)
  To: Markus Armbruster, Kevin Wolf; +Cc: Fam Zheng, qemu-devel, qemu-block, mreitz

On 01/13/2016 11:56 AM, Markus Armbruster wrote:
> Kevin Wolf <kwolf@redhat.com> writes:
>
>> Am 23.12.2015 um 11:47 hat Daniel P. Berrange geschrieben:
>>> On Wed, Dec 23, 2015 at 11:14:12AM +0800, Fam Zheng wrote:
>>>> On Tue, 12/22 17:46, Kevin Wolf wrote:
>>>>> Enough innocent images have died because users called 'qemu-img
>>>>> snapshot' while
>>>>> the VM was still running. Educating the users doesn't seem to be a working
>>>>> strategy, so this series adds locking to qcow2 that refuses to
>>>>> access the image
>>>>> read-write from two processes.
>>>>>
>>>>> Eric, this will require a libvirt update to deal with qemu
>>>>> crashes which leave
>>>>> locked images behind. The simplest thinkable way would be to
>>>>> unconditionally
>>>>> override the lock in libvirt whenever the option is present. In that case,
>>>>> libvirt VMs would be protected against concurrent non-libvirt
>>>>> accesses, but not
>>>>> the other way round. If you want more than that, libvirt would
>>>>> have to check
>>>>> somehow if it was its own VM that used the image and left the
>>>>> lock behind. I
>>>>> imagine that can't be too hard either.
>>>> The motivation is great, but I'm not sure I like the side-effect that an
>>>> unclean shutdown will require a "forced" open, because it makes
>>>> using qcow2 in
>>>> development cumbersome, and like you said, management/user also
>>>> needs to handle
>>>> this explicitly. This is a bit of a personal preference, but it's
>>>> strong enough
>>>> that I want to speak up.
>>> Yeah, I am also not really a big fan of locking mechanisms which are not
>>> automatically cleaned up on process exit. On the other hand you could
>>> say that people who choose to run qemu-img manually are already taking
>>> fate into their own hands, and ending up with a dirty image on unclean
>>> exit is still miles better than loosing all your data.
>>>
>>>> As an alternative, can we introduce .bdrv_flock() in protocol drivers, with
>>>> similar semantics to flock(2) or lockf(3)? That way all formats can benefit,
>>>> and a program crash will automatically drop the lock.
>>> FWIW, the libvirt locking daemon (virtlockd) will already attempt to take
>>> out locks using fcntl()/lockf() on all disk images associated with a VM.
>> Does this actually mean that if QEMU did try to use flock(), it would
>> fail because libvirt is already holding the lock?
>>
>> I considered adding both locking schemes (the qcow2 flag for qcow2 on
>> any backend; flock() for anything else on local files), but if this is
>> true, that's game over for any flock() based patches.
> "Game over" for any patches that use the same locking mechanism as
> libvirt without coordinating with libvirt.
>
> Of course, patches that use a new locking mechanism will almost
> certainly need some libvirt work, too.
>
> Can we come up with a more integrated solution where QEMU cooperates
> with libvirt on locking when libvirt is in play, and does a more limited
> job itself when libvirt isn't in play?
>
for me this seems viable. But this requires a serious efforts to
do and serious coordination.

How this could be done and how this could be coordinated
with the current state of uncertainty.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-13  8:52               ` Markus Armbruster
@ 2016-01-13  9:12                 ` Denis V. Lunev
  2016-01-13  9:50                   ` Daniel P. Berrange
  0 siblings, 1 reply; 99+ messages in thread
From: Denis V. Lunev @ 2016-01-13  9:12 UTC (permalink / raw)
  To: Markus Armbruster, Kevin Wolf
  Cc: Fam Zheng, Olga Krishtal, qemu-devel, Max Reitz

On 01/13/2016 11:52 AM, Markus Armbruster wrote:
> Kevin Wolf <kwolf@redhat.com> writes:
>
>> Am 11.01.2016 um 18:58 hat Daniel P. Berrange geschrieben:
>>> On Mon, Jan 11, 2016 at 06:31:06PM +0100, Kevin Wolf wrote:
>>>> Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
>>>>> From: Olga Krishtal <okrishtal@virtuozzo.com>
>>>>>
>>>>> While opening the image we want to be sure that we are the
>>>>> one who works with image, anf if it is not true -
>>>>> opening the image for writing should fail.
>>>>>
>>>>> There are 2 ways at the moment: no lock at all and lock the file
>>>>> image.
>>>>>
>>>>> Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
>>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>>> CC: Max Reitz <mreitz@redhat.com>
>>>>> CC: Eric Blake <eblake@redhat.com>
>>>>> CC: Fam Zheng <famz@redhat.com>
>>>> As long as locking is disabled by default, it's useless and won't
>>>> prevent people from corrupting their images. These corruptions happen
>>>> exactly because people don't know how to use qemu properly. You can't
>>>> expect them to enable locking manually.
>>>>
>>>> Also, you probably need to consider bdrv_reopen() and live migration.
>>>> I think live migration would be blocked if source and destination both
>>>> see the lock; which is admittedly less likely than with the qcow2 patch
>>>> (and generally a problem of this series), but with localhost migration
>>>> and potentially with some NFS setups it can be the case.
>>> Note that when libvirt does locking it will release locks when a VM
>>> is paused, and acquire locks prior to resuming CPUs. This allows live
>>> migration to work because you never have CPUs running on both src + dst
>>> at the same time. This does mean that libvirt does not allow QEMU to
>>> automatically re-start CPUs when migration completes, as it needs to
>>> take some action to acquire locks before allowing the dst to start
>>> running again.
>> This assumes that block devices can only be written to if CPUs are
>> running. In the days of qemu 0.9, this was probably right, but with
>> things like block jobs and built-in NBD servers, I wouldn't be as sure
>> these days.
> Sounds like QEMU and libvirt should cooperate more closely to get the
> locking less wrong.
>
> QEMU should have more accurate knowledge on how it is using the image.
> Libvirt may be able to provide better locks, with the help of its
> virtlockd daemon.
daemon owning locks is a problem:
- there are distributed cases
- daemons restart from time to time

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-13  9:12                 ` Denis V. Lunev
@ 2016-01-13  9:50                   ` Daniel P. Berrange
  0 siblings, 0 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2016-01-13  9:50 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Fam Zheng, qemu-devel, Markus Armbruster,
	Olga Krishtal, Max Reitz

On Wed, Jan 13, 2016 at 12:12:10PM +0300, Denis V. Lunev wrote:
> On 01/13/2016 11:52 AM, Markus Armbruster wrote:
> >Kevin Wolf <kwolf@redhat.com> writes:
> >
> >>Am 11.01.2016 um 18:58 hat Daniel P. Berrange geschrieben:
> >>>On Mon, Jan 11, 2016 at 06:31:06PM +0100, Kevin Wolf wrote:
> >>>>Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
> >>>>>From: Olga Krishtal <okrishtal@virtuozzo.com>
> >>>>>
> >>>>>While opening the image we want to be sure that we are the
> >>>>>one who works with image, anf if it is not true -
> >>>>>opening the image for writing should fail.
> >>>>>
> >>>>>There are 2 ways at the moment: no lock at all and lock the file
> >>>>>image.
> >>>>>
> >>>>>Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> >>>>>Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>>>>CC: Kevin Wolf <kwolf@redhat.com>
> >>>>>CC: Max Reitz <mreitz@redhat.com>
> >>>>>CC: Eric Blake <eblake@redhat.com>
> >>>>>CC: Fam Zheng <famz@redhat.com>
> >>>>As long as locking is disabled by default, it's useless and won't
> >>>>prevent people from corrupting their images. These corruptions happen
> >>>>exactly because people don't know how to use qemu properly. You can't
> >>>>expect them to enable locking manually.
> >>>>
> >>>>Also, you probably need to consider bdrv_reopen() and live migration.
> >>>>I think live migration would be blocked if source and destination both
> >>>>see the lock; which is admittedly less likely than with the qcow2 patch
> >>>>(and generally a problem of this series), but with localhost migration
> >>>>and potentially with some NFS setups it can be the case.
> >>>Note that when libvirt does locking it will release locks when a VM
> >>>is paused, and acquire locks prior to resuming CPUs. This allows live
> >>>migration to work because you never have CPUs running on both src + dst
> >>>at the same time. This does mean that libvirt does not allow QEMU to
> >>>automatically re-start CPUs when migration completes, as it needs to
> >>>take some action to acquire locks before allowing the dst to start
> >>>running again.
> >>This assumes that block devices can only be written to if CPUs are
> >>running. In the days of qemu 0.9, this was probably right, but with
> >>things like block jobs and built-in NBD servers, I wouldn't be as sure
> >>these days.
> >Sounds like QEMU and libvirt should cooperate more closely to get the
> >locking less wrong.
> >
> >QEMU should have more accurate knowledge on how it is using the image.
> >Libvirt may be able to provide better locks, with the help of its
> >virtlockd daemon.
> daemon owning locks is a problem:
> - there are distributed cases
> - daemons restart from time to time

The virtlockd daemon copes with both of these cases just fine. There is
one daemon per virtualization host, and they can be configured acquire
locks in a way that they will be enforced across all hosts. The reason
we do it in a separate virtlockd daemon instead of libvirtd is that we
designed it to be able to re-exec() itself while maintaining all locks
to allow for seemless upgrade.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-11 18:35             ` Kevin Wolf
  2016-01-13  8:52               ` Markus Armbruster
@ 2016-01-13  9:51               ` Daniel P. Berrange
  1 sibling, 0 replies; 99+ messages in thread
From: Daniel P. Berrange @ 2016-01-13  9:51 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Denis V. Lunev, Max Reitz, Fam Zheng, qemu-devel, Olga Krishtal

On Mon, Jan 11, 2016 at 07:35:57PM +0100, Kevin Wolf wrote:
> Am 11.01.2016 um 18:58 hat Daniel P. Berrange geschrieben:
> > On Mon, Jan 11, 2016 at 06:31:06PM +0100, Kevin Wolf wrote:
> > > Am 23.12.2015 um 08:46 hat Denis V. Lunev geschrieben:
> > > > From: Olga Krishtal <okrishtal@virtuozzo.com>
> > > > 
> > > > While opening the image we want to be sure that we are the
> > > > one who works with image, anf if it is not true -
> > > > opening the image for writing should fail.
> > > > 
> > > > There are 2 ways at the moment: no lock at all and lock the file
> > > > image.
> > > > 
> > > > Signed-off-by: Olga Krishtal <okrishtal@virtuozzo.com>
> > > > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > > > CC: Kevin Wolf <kwolf@redhat.com>
> > > > CC: Max Reitz <mreitz@redhat.com>
> > > > CC: Eric Blake <eblake@redhat.com>
> > > > CC: Fam Zheng <famz@redhat.com>
> > > 
> > > As long as locking is disabled by default, it's useless and won't
> > > prevent people from corrupting their images. These corruptions happen
> > > exactly because people don't know how to use qemu properly. You can't
> > > expect them to enable locking manually.
> > > 
> > > Also, you probably need to consider bdrv_reopen() and live migration.
> > > I think live migration would be blocked if source and destination both
> > > see the lock; which is admittedly less likely than with the qcow2 patch
> > > (and generally a problem of this series), but with localhost migration
> > > and potentially with some NFS setups it can be the case.
> > 
> > Note that when libvirt does locking it will release locks when a VM
> > is paused, and acquire locks prior to resuming CPUs. This allows live
> > migration to work because you never have CPUs running on both src + dst
> > at the same time. This does mean that libvirt does not allow QEMU to
> > automatically re-start CPUs when migration completes, as it needs to
> > take some action to acquire locks before allowing the dst to start
> > running again.
> 
> This assumes that block devices can only be written to if CPUs are
> running. In the days of qemu 0.9, this was probably right, but with
> things like block jobs and built-in NBD servers, I wouldn't be as sure
> these days.

True, libvirt knows when it is using block jobs & NBD servers, so it
should not be difficult to address this.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2016-01-13  8:44             ` Markus Armbruster
@ 2016-01-13 14:19               ` Kevin Wolf
  2016-01-14 13:07                 ` Markus Armbruster
  0 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-13 14:19 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel, qemu-block, mreitz

Am 13.01.2016 um 09:44 hat Markus Armbruster geschrieben:
> Kevin Wolf <kwolf@redhat.com> writes:
> 
> > Am 12.01.2016 um 16:20 hat Markus Armbruster geschrieben:
> >> Kevin Wolf <kwolf@redhat.com> writes:
> >> 
> >> > Am 11.01.2016 um 16:49 hat Markus Armbruster geschrieben:
> >> >> Eric Blake <eblake@redhat.com> writes:
> >> >> 
> >> >> > On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> >> >> >> This patch extends qemu-img for working with locked images. It prints a
> >> >> >> helpful error message when trying to access a locked image read-write,
> >> >> >> and adds a 'qemu-img force-unlock' command as well as a 'qemu-img check
> >> >> >> -r all --force' option in order to override a lock left behind after a
> >> >> >> qemu crash.
> >> >> >> 
> >> >> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> >> >> >> ---
> >> >> >>  include/block/block.h |  1 +
> >> >> >>  include/qapi/error.h  |  1 +
> >> >> >>  qapi/common.json      |  3 +-
> >> >> >>  qemu-img-cmds.hx      | 10 ++++--
> >> >> >>  qemu-img.c | 96
> >> >> >> +++++++++++++++++++++++++++++++++++++++++++--------
> >> >> >>  qemu-img.texi         | 20 ++++++++++-
> >> >> >>  6 files changed, 113 insertions(+), 18 deletions(-)
> >> >> >> 
> >> >> >
> >> >> >> +++ b/include/qapi/error.h
> >> >> >> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
> >> >> >>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
> >> >> >>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
> >> >> >>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
> >> >> >> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
> >> >> >>  } ErrorClass;
> >> >> >
> >> >> > Wow - a new ErrorClass.  It's been a while since we could justify one of
> >> >> > these, but I think you might have found a case.
> >> >> 
> >> >> Spell out the rationale for the new ErrorClass, please.
> >> >
> >> > Action to be taken for this error class: Decide whether the lock is a
> >> > leftover from a previous qemu run that ended in an unclean shutdown. If
> >> > so, retry with overriding the lock.
> >> >
> >> > Currently used by qemu-img when ordered to override a lock. libvirt
> >> > will need to do the same.
> >> 
> >> Let's see whether I understand the intended use:
> >> 
> >>     open image
> >>     if open fails with ImageFileLocked:
> >>         guess whether the lock is stale
> >>         if guessing not stale:
> >>             error out
> >>         open image with lock override
> >> 
> >> Correct?
> >
> > Yes. Where "guess" is more or less "check whether the management tool
> > started qemu with this image, but didn't cleanly shut it down". This can
> > guess wrong if, and only if, some other user used a different algorithm
> > and forced an unlock even though the image didn't belong to them before
> > the crash.
> >
> >> Obvious troublespots:
> >> 
> >> 1. If you guess wrong, you destroy the image.  No worse than before, so
> >>    okay, declare documentation problem.
> >> 
> >> 2. TOCTTOU open to open with lock override
> >>    [...]
> >> 
> >> 3. TOCTTOU within open (hypothetical, haven't read your code)
> >>    [...]
> >
> > Yes, these exist in theory. The question is what scenarios you want to
> > protect against and whether improving the mechanism to cover these cases
> > is worth the effort.
> >
> > The answer for what I wanted to protect is a manual action on an image
> > that is already in use. The user isn't quick enough to manually let two
> > processes open the same image at the same time, so I didn't consider
> > that scenario relevant.
> >
> > But assuming that everyone (including the human user) follows the above
> > protocol (force-unlock only what was yours before the crash), at least
> > cases 1 and 2 don't happen anyway.
> 
> "Force-unlock only what you locked yourself" is easier to stipulate than
> to adhere to when the tools can't give you a hint on who did the
> locking.  This is particularly true when "you" is a human, with human
> imperfect memory.
> 
> I understand that this locking can't provide complete protection, and
> merely aims to catch certain common accidents.
> 
> However, to avoid a false sense of security, its limitations need to be
> clearly documented.  This very much includes the rule "force-unlock only
> what you locked yourself".  In my opinion, it should also include the
> raciness.
> 
> Sometimes, solving a problem is easier than documenting it.

Maybe I'll just merge the migration fixes and shelve the rest of the
series until I'm bored enough to implement the "real thing" with an
incompatible feature flag, lock IDs with an autogenerated part and
another part from the user, saved host name and PID, and a qcow2 driver
that refuses to write anything to an image it doesn't hold the lock for
even in corner cases. For now, I've already used more time for this than
I intended (didn't expect all that live migration fun initially...).

And after all, it's not my VMs that get corrupted.

If you think that solving the problem "for real" is too easy to take
shortcuts, I'll happily create a BZ and assign it to you if you'd like
me to.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 05/10] block: Inactivate BDS when migration completes
  2016-01-05 20:21     ` [Qemu-devel] [Qemu-block] " John Snow
@ 2016-01-13 14:25       ` Kevin Wolf
  2016-01-13 16:35         ` Eric Blake
  0 siblings, 1 reply; 99+ messages in thread
From: Kevin Wolf @ 2016-01-13 14:25 UTC (permalink / raw)
  To: John Snow; +Cc: qemu-devel, qemu-block, mreitz

Am 05.01.2016 um 21:21 hat John Snow geschrieben:
> 
> 
> On 12/22/2015 03:43 PM, Eric Blake wrote:
> > On 12/22/2015 09:46 AM, Kevin Wolf wrote:
> >> So far, live migration with shared storage meant that the image is in a
> >> not-really-ready don't-touch-me state on the destination while the
> >> source is still actively using it, but after completing the migration,
> >> the image was fully opened on both sides. This is bad.
> >>
> >> This patch adds a block driver callback to inactivate images on the
> >> source before completing the migration. Inactivation means that it goes
> >> to a state as if it was just live migrated to the qemu instance on the
> >> source (i.e. BDRV_O_INCOMING is set). You're then supposed to continue
> >> either on the source or on the destination, which takes ownership of the
> >> image.
> >>
> >> A typical migration looks like this now with respect to disk images:
> >>
> >> 1. Destination qemu is started, the image is opened with
> >>    BDRV_O_INCOMING. The image is fully opened on the source.
> >>
> >> 2. Migration is about to complete. The source flushes the image and
> >>    inactivates it. Now both sides have the image opened with
> >>    BDRV_O_INCOMING and are expecting the other side to still modify it.
> > 
> > The name BDRV_O_INCOMING now doesn't quite match semantics on the
> > source, but I don't have any better suggestions.  BDRV_O_LIMITED_USE?
> > BDRV_O_HANDOFF?  At any rate, I fully agree with your logic of locking
> > things down on the source to mark that the destination is about to take
> > over write access to the file.
> > 
> 
> INCOMING is handy as it keeps the code simple, even if it's weird to
> read. Is it worth adding the extra ifs/case statements everywhere to add
> in BDRV_O_HANDOFF? Maybe in the future someone will use BDRV_O_INCOMING
> to mean something more specific (data is incoming, not just in the
> process of being handed off) that could cause problems.
> 
> Maybe even just renaming BDRV_O_INCOMING right now to be BDRV_O_HANDOFF
> would accomplish the semantics we want on both source and destination
> without needing two flags.
> 
> Follow your dreams, Go with what you feel.

How about renaming BDRV_O_INCOMING to BDRV_O_INACTIVE?

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 05/10] block: Inactivate BDS when migration completes
  2016-01-13 14:25       ` Kevin Wolf
@ 2016-01-13 16:35         ` Eric Blake
  0 siblings, 0 replies; 99+ messages in thread
From: Eric Blake @ 2016-01-13 16:35 UTC (permalink / raw)
  To: Kevin Wolf, John Snow; +Cc: qemu-devel, qemu-block, mreitz

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

On 01/13/2016 07:25 AM, Kevin Wolf wrote:

>>> The name BDRV_O_INCOMING now doesn't quite match semantics on the
>>> source, but I don't have any better suggestions.  BDRV_O_LIMITED_USE?
>>> BDRV_O_HANDOFF?  At any rate, I fully agree with your logic of locking
>>> things down on the source to mark that the destination is about to take
>>> over write access to the file.
>>>
>>
>> INCOMING is handy as it keeps the code simple, even if it's weird to
>> read. Is it worth adding the extra ifs/case statements everywhere to add
>> in BDRV_O_HANDOFF? Maybe in the future someone will use BDRV_O_INCOMING
>> to mean something more specific (data is incoming, not just in the
>> process of being handed off) that could cause problems.
>>
>> Maybe even just renaming BDRV_O_INCOMING right now to be BDRV_O_HANDOFF
>> would accomplish the semantics we want on both source and destination
>> without needing two flags.
>>
>> Follow your dreams, Go with what you feel.
> 
> How about renaming BDRV_O_INCOMING to BDRV_O_INACTIVE?

BDRV_O_INACTIVE works for me.  Do the rename as a separate mechanical
patch, obviously.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-13  0:10                   ` Fam Zheng
@ 2016-01-13 16:44                     ` Eric Blake
  2016-01-14  7:23                       ` Denis V. Lunev
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Blake @ 2016-01-13 16:44 UTC (permalink / raw)
  To: Fam Zheng, Denis V. Lunev
  Cc: Kevin Wolf, Olga Krishtal, qemu-devel, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 1072 bytes --]

On 01/12/2016 05:10 PM, Fam Zheng wrote:

>> If we will switch default in my patch from 'nolock' to 'lock' then
>> pour guys which are calling qemu-img etc stuff will see the lock
>> as necessary while 'proper management software' aka libvirt
>> will be able to call qemu/qemu-img etc with proper 'nolock'
>> flag as they do care about the locking.
> 
> That is wrong because then we break old libvirt with the new qemu-img (acquires
> lock by default), which is IMO a breakage of backward compatibility.

In the big software stack picture, it is okay to reject 'old libvirt/new
qemu' as an invalid combination.  Upgrade-wise, we specifically support
'new libvirt/old qemu' - but it is fair game to say that 'if you want to
run new qemu, you must first upgrade to new libvirt that knows how to
drive it'.

That said, minimizing back-compat breaks, so that old libvirt can
(usually) correctly drive new qemu, is a worthy design goal for qemu.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] block: added lock image option and callback
  2016-01-13 16:44                     ` Eric Blake
@ 2016-01-14  7:23                       ` Denis V. Lunev
  0 siblings, 0 replies; 99+ messages in thread
From: Denis V. Lunev @ 2016-01-14  7:23 UTC (permalink / raw)
  To: Eric Blake, Fam Zheng; +Cc: Kevin Wolf, Olga Krishtal, qemu-devel, Max Reitz

On 01/13/2016 07:44 PM, Eric Blake wrote:
> On 01/12/2016 05:10 PM, Fam Zheng wrote:
>
>>> If we will switch default in my patch from 'nolock' to 'lock' then
>>> pour guys which are calling qemu-img etc stuff will see the lock
>>> as necessary while 'proper management software' aka libvirt
>>> will be able to call qemu/qemu-img etc with proper 'nolock'
>>> flag as they do care about the locking.
>> That is wrong because then we break old libvirt with the new qemu-img (acquires
>> lock by default), which is IMO a breakage of backward compatibility.
> In the big software stack picture, it is okay to reject 'old libvirt/new
> qemu' as an invalid combination.  Upgrade-wise, we specifically support
> 'new libvirt/old qemu' - but it is fair game to say that 'if you want to
> run new qemu, you must first upgrade to new libvirt that knows how to
> drive it'.
>
> That said, minimizing back-compat breaks, so that old libvirt can
> (usually) correctly drive new qemu, is a worthy design goal for qemu.
>
there is one other thing I have originally missed to add to the
picture.

Locking could be complex and format specific. In original
Parallels disk format (not image but entire bundle), the locking
is performed on a very special file.

Thus either libvirt must know exact format details or it must
rely on something which really does know details, i.e. QEMU/qemu-img.

Den

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2016-01-13 14:19               ` Kevin Wolf
@ 2016-01-14 13:07                 ` Markus Armbruster
  2016-01-14 14:19                   ` Kevin Wolf
  0 siblings, 1 reply; 99+ messages in thread
From: Markus Armbruster @ 2016-01-14 13:07 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, qemu-block, mreitz

Kevin Wolf <kwolf@redhat.com> writes:

> Am 13.01.2016 um 09:44 hat Markus Armbruster geschrieben:
>> Kevin Wolf <kwolf@redhat.com> writes:
>> 
>> > Am 12.01.2016 um 16:20 hat Markus Armbruster geschrieben:
>> >> Kevin Wolf <kwolf@redhat.com> writes:
>> >> 
>> >> > Am 11.01.2016 um 16:49 hat Markus Armbruster geschrieben:
>> >> >> Eric Blake <eblake@redhat.com> writes:
>> >> >> 
>> >> >> > On 12/22/2015 09:46 AM, Kevin Wolf wrote:
>> >> >> >> This patch extends qemu-img for working with locked
>> >> >> >> images. It prints a
>> >> >> >> helpful error message when trying to access a locked image
>> >> >> >> read-write,
>> >> >> >> and adds a 'qemu-img force-unlock' command as well as a
>> >> >> >> 'qemu-img check
>> >> >> >> -r all --force' option in order to override a lock left
>> >> >> >> behind after a
>> >> >> >> qemu crash.
>> >> >> >> 
>> >> >> >> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> >> >> >> ---
>> >> >> >>  include/block/block.h |  1 +
>> >> >> >>  include/qapi/error.h  |  1 +
>> >> >> >>  qapi/common.json      |  3 +-
>> >> >> >>  qemu-img-cmds.hx      | 10 ++++--
>> >> >> >>  qemu-img.c | 96
>> >> >> >> +++++++++++++++++++++++++++++++++++++++++++--------
>> >> >> >>  qemu-img.texi         | 20 ++++++++++-
>> >> >> >>  6 files changed, 113 insertions(+), 18 deletions(-)
>> >> >> >> 
>> >> >> >
>> >> >> >> +++ b/include/qapi/error.h
>> >> >> >> @@ -102,6 +102,7 @@ typedef enum ErrorClass {
>> >> >> >>      ERROR_CLASS_DEVICE_NOT_ACTIVE = QAPI_ERROR_CLASS_DEVICENOTACTIVE,
>> >> >> >>      ERROR_CLASS_DEVICE_NOT_FOUND = QAPI_ERROR_CLASS_DEVICENOTFOUND,
>> >> >> >>      ERROR_CLASS_KVM_MISSING_CAP = QAPI_ERROR_CLASS_KVMMISSINGCAP,
>> >> >> >> +    ERROR_CLASS_IMAGE_FILE_LOCKED = QAPI_ERROR_CLASS_IMAGEFILELOCKED,
>> >> >> >>  } ErrorClass;
>> >> >> >
>> >> >> > Wow - a new ErrorClass.  It's been a while since we could
>> >> >> > justify one of
>> >> >> > these, but I think you might have found a case.
>> >> >> 
>> >> >> Spell out the rationale for the new ErrorClass, please.
>> >> >
>> >> > Action to be taken for this error class: Decide whether the lock is a
>> >> > leftover from a previous qemu run that ended in an unclean shutdown. If
>> >> > so, retry with overriding the lock.
>> >> >
>> >> > Currently used by qemu-img when ordered to override a lock. libvirt
>> >> > will need to do the same.
>> >> 
>> >> Let's see whether I understand the intended use:
>> >> 
>> >>     open image
>> >>     if open fails with ImageFileLocked:
>> >>         guess whether the lock is stale
>> >>         if guessing not stale:
>> >>             error out
>> >>         open image with lock override
>> >> 
>> >> Correct?
>> >
>> > Yes. Where "guess" is more or less "check whether the management tool
>> > started qemu with this image, but didn't cleanly shut it down". This can
>> > guess wrong if, and only if, some other user used a different algorithm
>> > and forced an unlock even though the image didn't belong to them before
>> > the crash.
>> >
>> >> Obvious troublespots:
>> >> 
>> >> 1. If you guess wrong, you destroy the image.  No worse than before, so
>> >>    okay, declare documentation problem.
>> >> 
>> >> 2. TOCTTOU open to open with lock override
>> >>    [...]
>> >> 
>> >> 3. TOCTTOU within open (hypothetical, haven't read your code)
>> >>    [...]
>> >
>> > Yes, these exist in theory. The question is what scenarios you want to
>> > protect against and whether improving the mechanism to cover these cases
>> > is worth the effort.
>> >
>> > The answer for what I wanted to protect is a manual action on an image
>> > that is already in use. The user isn't quick enough to manually let two
>> > processes open the same image at the same time, so I didn't consider
>> > that scenario relevant.
>> >
>> > But assuming that everyone (including the human user) follows the above
>> > protocol (force-unlock only what was yours before the crash), at least
>> > cases 1 and 2 don't happen anyway.
>> 
>> "Force-unlock only what you locked yourself" is easier to stipulate than
>> to adhere to when the tools can't give you a hint on who did the
>> locking.  This is particularly true when "you" is a human, with human
>> imperfect memory.
>> 
>> I understand that this locking can't provide complete protection, and
>> merely aims to catch certain common accidents.
>> 
>> However, to avoid a false sense of security, its limitations need to be
>> clearly documented.  This very much includes the rule "force-unlock only
>> what you locked yourself".  In my opinion, it should also include the
>> raciness.
>> 
>> Sometimes, solving a problem is easier than documenting it.
>
> Maybe I'll just merge the migration fixes and shelve the rest of the
> series until I'm bored enough to implement the "real thing" with an
> incompatible feature flag, lock IDs with an autogenerated part and
> another part from the user, saved host name and PID, and a qcow2 driver
> that refuses to write anything to an image it doesn't hold the lock for
> even in corner cases. For now, I've already used more time for this than
> I intended (didn't expect all that live migration fun initially...).
>
> And after all, it's not my VMs that get corrupted.
>
> If you think that solving the problem "for real" is too easy to take
> shortcuts, I'll happily create a BZ and assign it to you if you'd like
> me to.

I'm not sure what to do with this message.  Did I tick you off?

I took the trouble to understand what you're trying to do, in particular
the limitations.  I also explored a possible alternative that might be
less limited.  I didn't ask you to adopt this alternative.  I didn't
call your patch a shortcut.  I did point out that its limitations need
to be documented, and encouraged you to weigh documentation effort
against implementation effort.  Additionally, I wondered how this
affects libvirt, and whether we need to cooperate more closely on
locking.

If that's review gone too far, then I fail to see why we bother with it.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 00/10] qcow2: Implement image locking
  2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
                   ` (12 preceding siblings ...)
  2015-12-24  5:43 ` Denis V. Lunev
@ 2016-01-14 14:01 ` Max Reitz
  13 siblings, 0 replies; 99+ messages in thread
From: Max Reitz @ 2016-01-14 14:01 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2930 bytes --]

On 22.12.2015 17:46, Kevin Wolf wrote:
> Enough innocent images have died because users called 'qemu-img snapshot' while
> the VM was still running. Educating the users doesn't seem to be a working
> strategy, so this series adds locking to qcow2 that refuses to access the image
> read-write from two processes.
> 
> Eric, this will require a libvirt update to deal with qemu crashes which leave
> locked images behind. The simplest thinkable way would be to unconditionally
> override the lock in libvirt whenever the option is present. In that case,
> libvirt VMs would be protected against concurrent non-libvirt accesses, but not
> the other way round. If you want more than that, libvirt would have to check
> somehow if it was its own VM that used the image and left the lock behind. I
> imagine that can't be too hard either.
> 
> Also note that this kind of depends on Max's bdrv_close_all() series, but only
> in order to pass test case 142. This is not a bug in this series, but a
> preexisting one (bs->file can be closed before bs), and it becomes apparent
> when qemu fails to unlock an image due to this bug. Max's series fixes this.

Here's another crazy idea around the issue which kind of locking to
choose: We could implement both.

Our main issue is that we currently do not protect users not using any
management stack against wrongly invoking a second qemu/qemu-img
instance on a qcow2 image that's already in use. We have to assume that
these users will not specify any locking option, so our default choice
needs to work here. Both flock() and format locking would work here fine.

Then there are users which use old libvirt versions and new qemu
versions. I know Eric doesn't care about this, but I do, because if I
had a full-blown management stack, I'd rather update qemu for better
performance or security fixes or whatever than the rest of the whole
stack. Also, if we need coordination between libvirt and qemu, that will
delay release of whatever implementation we do, which is not horrible,
but not desirable either. Therefore, it would be nice if the default
choice would work with libvirt today, which only leaves format locking.

Then there are users for whom format locking is not the right choice.
They often do have a management layer, so they don't actually need the
features provided by a locking mechanism in qemu (libvirt does it
anyway), and I do think we can assume them willing to set some runtime
options. Therefore, the default does not matter here, but we should
implement flock() because once libvirt can cope with qemu invoking that
by itself, it may offer them better protection.

This is why I think implementing both format and protocol locking may be
a good idea. I think we should enable format locking and disable flock()
by default, but of course you should be able to override this choice at
runtime.

Thoughts?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images
  2016-01-14 13:07                 ` Markus Armbruster
@ 2016-01-14 14:19                   ` Kevin Wolf
  0 siblings, 0 replies; 99+ messages in thread
From: Kevin Wolf @ 2016-01-14 14:19 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel, qemu-block, mreitz

Am 14.01.2016 um 14:07 hat Markus Armbruster geschrieben:
> Kevin Wolf <kwolf@redhat.com> writes:
> 
> > Am 13.01.2016 um 09:44 hat Markus Armbruster geschrieben:
> >> Kevin Wolf <kwolf@redhat.com> writes:
> >> > Yes, these exist in theory. The question is what scenarios you want to
> >> > protect against and whether improving the mechanism to cover these cases
> >> > is worth the effort.
> >> >
> >> > The answer for what I wanted to protect is a manual action on an image
> >> > that is already in use. The user isn't quick enough to manually let two
> >> > processes open the same image at the same time, so I didn't consider
> >> > that scenario relevant.
> >> >
> >> > But assuming that everyone (including the human user) follows the above
> >> > protocol (force-unlock only what was yours before the crash), at least
> >> > cases 1 and 2 don't happen anyway.
> >> 
> >> "Force-unlock only what you locked yourself" is easier to stipulate than
> >> to adhere to when the tools can't give you a hint on who did the
> >> locking.  This is particularly true when "you" is a human, with human
> >> imperfect memory.
> >> 
> >> I understand that this locking can't provide complete protection, and
> >> merely aims to catch certain common accidents.
> >> 
> >> However, to avoid a false sense of security, its limitations need to be
> >> clearly documented.  This very much includes the rule "force-unlock only
> >> what you locked yourself".  In my opinion, it should also include the
> >> raciness.
> >> 
> >> Sometimes, solving a problem is easier than documenting it.
> >
> > Maybe I'll just merge the migration fixes and shelve the rest of the
> > series until I'm bored enough to implement the "real thing" with an
> > incompatible feature flag, lock IDs with an autogenerated part and
> > another part from the user, saved host name and PID, and a qcow2 driver
> > that refuses to write anything to an image it doesn't hold the lock for
> > even in corner cases. For now, I've already used more time for this than
> > I intended (didn't expect all that live migration fun initially...).
> >
> > And after all, it's not my VMs that get corrupted.
> >
> > If you think that solving the problem "for real" is too easy to take
> > shortcuts, I'll happily create a BZ and assign it to you if you'd like
> > me to.
> 
> I'm not sure what to do with this message.  Did I tick you off?
> 
> I took the trouble to understand what you're trying to do, in particular
> the limitations.  I also explored a possible alternative that might be
> less limited.  I didn't ask you to adopt this alternative.  I didn't
> call your patch a shortcut.  I did point out that its limitations need
> to be documented, and encouraged you to weigh documentation effort
> against implementation effort.  Additionally, I wondered how this
> affects libvirt, and whether we need to cooperate more closely on
> locking.
> 
> If that's review gone too far, then I fail to see why we bother with it.

No, it's fine. I just don't like the outcome of the review and was a bit
disappointed about what that means (see below), but that doesn't mean
that you did anything wrong.

I had a few spare minutes and wanted to hack up a quick solution just to
prevent the snapshot case. But now it seems that people (not just you)
seem to expect something more complete instead. And you might be right,
if we do the incomplete thing now and something better later, we might
end up with two mechanisms in the image format. So perhaps we should do
the "real thing".

But I've used up the spare minutes I had before Christmas to hack on
locking and now I need to handle more important projects and all the
other patch series on the list first. At the same time, something like
this needs to be merged early in a release cycle, so that libvirt can
implement it in time.

Combine everything and you get that if we want the real thing, and we
want myself to do it, it might not be for 2.6.

Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2016-01-14 14:20 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-22 16:46 [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Kevin Wolf
2015-12-22 16:46 ` [Qemu-devel] [PATCH 01/10] qcow2: Write feature table only for v3 images Kevin Wolf
2015-12-22 20:20   ` Eric Blake
2016-01-11 15:20     ` Kevin Wolf
2015-12-22 16:46 ` [Qemu-devel] [PATCH 02/10] qcow2: Write full header on image creation Kevin Wolf
2015-12-22 20:25   ` Eric Blake
2015-12-22 16:46 ` [Qemu-devel] [PATCH 03/10] block: Assert no write requests under BDRV_O_INCOMING Kevin Wolf
2015-12-22 20:27   ` Eric Blake
2015-12-22 16:46 ` [Qemu-devel] [PATCH 04/10] block: Fix error path in bdrv_invalidate_cache() Kevin Wolf
2015-12-22 20:31   ` Eric Blake
2015-12-22 16:46 ` [Qemu-devel] [PATCH 05/10] block: Inactivate BDS when migration completes Kevin Wolf
2015-12-22 20:43   ` Eric Blake
2016-01-05 20:21     ` [Qemu-devel] [Qemu-block] " John Snow
2016-01-13 14:25       ` Kevin Wolf
2016-01-13 16:35         ` Eric Blake
2015-12-22 16:46 ` [Qemu-devel] [PATCH 06/10] qemu-img: Prepare for locked images Kevin Wolf
2015-12-22 16:57   ` Daniel P. Berrange
2015-12-22 17:00     ` Kevin Wolf
2015-12-22 21:06   ` Eric Blake
2016-01-11 15:49     ` Markus Armbruster
2016-01-11 16:05       ` Kevin Wolf
2016-01-12 15:20         ` Markus Armbruster
2016-01-12 17:36           ` Kevin Wolf
2016-01-13  8:44             ` Markus Armbruster
2016-01-13 14:19               ` Kevin Wolf
2016-01-14 13:07                 ` Markus Armbruster
2016-01-14 14:19                   ` Kevin Wolf
2016-01-11 16:22     ` Kevin Wolf
2015-12-22 21:41   ` Eric Blake
2015-12-22 16:46 ` [Qemu-devel] [PATCH 07/10] qcow2: Implement .bdrv_inactivate Kevin Wolf
2015-12-22 21:17   ` Eric Blake
2016-01-11 15:34     ` Kevin Wolf
2015-12-22 16:46 ` [Qemu-devel] [PATCH 08/10] qcow2: Fix BDRV_O_INCOMING handling in qcow2_invalidate_cache() Kevin Wolf
2015-12-22 21:22   ` Eric Blake
2015-12-22 16:46 ` [Qemu-devel] [PATCH 09/10] qcow2: Make image inaccessible after failed qcow2_invalidate_cache() Kevin Wolf
2015-12-22 21:24   ` Eric Blake
2015-12-22 16:46 ` [Qemu-devel] [PATCH 10/10] qcow2: Add image locking Kevin Wolf
2015-12-22 22:04   ` Eric Blake
2015-12-23  3:14 ` [Qemu-devel] [PATCH 00/10] qcow2: Implement " Fam Zheng
2015-12-23  7:35   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
2015-12-23  7:46     ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Denis V. Lunev
2015-12-23  7:46       ` [Qemu-devel] [PATCH 1/5] block: added lock image option and callback Denis V. Lunev
2015-12-23 23:48         ` Eric Blake
2016-01-11 17:31         ` Kevin Wolf
2016-01-11 17:58           ` Daniel P. Berrange
2016-01-11 18:35             ` Kevin Wolf
2016-01-13  8:52               ` Markus Armbruster
2016-01-13  9:12                 ` Denis V. Lunev
2016-01-13  9:50                   ` Daniel P. Berrange
2016-01-13  9:51               ` Daniel P. Berrange
2016-01-12  5:38           ` Denis V. Lunev
2016-01-12 10:10             ` Kevin Wolf
2016-01-12 11:33               ` Fam Zheng
2016-01-12 12:24                 ` Denis V. Lunev
2016-01-12 12:28                 ` Kevin Wolf
2016-01-12 13:17                   ` Fam Zheng
2016-01-12 13:24                     ` Daniel P. Berrange
2016-01-13  0:08                       ` Fam Zheng
2016-01-12 15:59                 ` Denis V. Lunev
2016-01-13  0:10                   ` Fam Zheng
2016-01-13 16:44                     ` Eric Blake
2016-01-14  7:23                       ` Denis V. Lunev
2015-12-23  7:46       ` [Qemu-devel] [PATCH 2/5] block: implemented bdrv_lock_image for raw file Denis V. Lunev
2015-12-23 12:40         ` Daniel P. Berrange
2015-12-23  7:46       ` [Qemu-devel] [PATCH 3/5] block: added check image option and callback bdrv_is_opened_unclean Denis V. Lunev
2015-12-23  9:09         ` Fam Zheng
2015-12-23  9:14           ` Denis V. Lunev
2015-12-23  7:46       ` [Qemu-devel] [PATCH 4/5] qcow2: implemented bdrv_is_opened_unclean Denis V. Lunev
2016-01-11 17:37         ` Kevin Wolf
2015-12-23  7:46       ` [Qemu-devel] [PATCH 5/5] block/paralels: added paralles implementation for bdrv_is_opened_unclean Denis V. Lunev
2015-12-23  8:09       ` [Qemu-devel] [PATCH RFC 0/5] generic image locking and crash recovery Fam Zheng
2015-12-23  8:36         ` Denis V. Lunev
2015-12-23 10:47   ` [Qemu-devel] [PATCH 00/10] qcow2: Implement image locking Daniel P. Berrange
2015-12-23 12:15     ` [Qemu-devel] [Qemu-block] " Roman Kagan
2015-12-23 12:29       ` Daniel P. Berrange
2015-12-23 12:41         ` Denis V. Lunev
2015-12-23 12:46           ` Daniel P. Berrange
2015-12-23 12:34       ` Daniel P. Berrange
2015-12-23 12:47         ` Denis V. Lunev
2015-12-23 12:56           ` Daniel P. Berrange
2016-01-11 17:14     ` [Qemu-devel] " Kevin Wolf
2016-01-11 17:54       ` Daniel P. Berrange
2016-01-13  8:56       ` Markus Armbruster
2016-01-13  9:11         ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
2015-12-23 23:19   ` [Qemu-devel] " Max Reitz
2015-12-24  5:41     ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
2015-12-24  5:42       ` Denis V. Lunev
2016-01-04 17:02       ` Max Reitz
2016-01-11 16:47       ` Kevin Wolf
2016-01-11 17:56         ` Daniel P. Berrange
2015-12-23 14:57 ` [Qemu-devel] " Vasiliy Tolstov
2015-12-23 15:08   ` [Qemu-devel] [Qemu-block] " Denis V. Lunev
2015-12-23 15:11     ` Vasiliy Tolstov
2016-01-11 16:25       ` Kevin Wolf
2015-12-23 15:09   ` Denis V. Lunev
2015-12-24  5:43 ` Denis V. Lunev
2016-01-11 16:33   ` Kevin Wolf
2016-01-11 16:38     ` Denis V. Lunev
2016-01-14 14:01 ` Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.