All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 00/14] block: Image locking series
@ 2016-10-31 15:38 Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd Fam Zheng
                   ` (14 more replies)
  0 siblings, 15 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

This is v9 of the image locking series. I redid the whole series, adopting the
"two locks" approach from Kevin and Max.

Depends on "[Qemu-devel] [PATCH] raw-posix: Rename 'raw_s' to 'rs'" in Max's
block branch.

Fam Zheng (14):
  osdep: Add qemu_lock_fd and qemu_unlock_fd
  block: Define BDRV_O_SHARE_RW
  qemu-io: Set "share-rw" flag together with read-only
  qemu-img: Set "share-rw" flag in read-only commands
  block: Set "share-rw" flag in drive-backup when sync=none
  block: Set "share-rw" flag for incoming migration
  iotests: 055: Don't attach the drive to vm for drive-backup
  iotests: 030: Read-only open image for getting map
  iotests: 087: Don't attch test image twice
  iotests: 085: Avoid image locking conflict
  iotests: 091: Quit QEMU before checking image
  tests: Use null-co:// instead of /dev/null as the dummy image
  raw-posix: Implement image locking
  tests: Add test-image-lock

 block/raw-posix.c          | 710 +++++++++++++++++++++++++++++++++++++++++----
 blockdev.c                 |   5 +-
 include/block/block.h      |   2 +
 include/qemu/osdep.h       |   2 +
 qemu-img.c                 |  10 +-
 qemu-io.c                  |   2 +
 tests/Makefile.include     |   2 +
 tests/drive_del-test.c     |   2 +-
 tests/nvme-test.c          |   2 +-
 tests/qemu-iotests/030     |   4 +-
 tests/qemu-iotests/055     |  32 +-
 tests/qemu-iotests/085     |  32 +-
 tests/qemu-iotests/085.out |   3 +-
 tests/qemu-iotests/087     |   6 +-
 tests/qemu-iotests/091     |   2 +
 tests/test-image-lock.c    | 179 ++++++++++++
 tests/usb-hcd-uhci-test.c  |   2 +-
 tests/usb-hcd-xhci-test.c  |   2 +-
 tests/virtio-blk-test.c    |   2 +-
 tests/virtio-scsi-test.c   |   4 +-
 util/osdep.c               |  29 ++
 21 files changed, 939 insertions(+), 95 deletions(-)
 create mode 100644 tests/test-image-lock.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-12-02  0:30   ` Max Reitz
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 02/14] block: Define BDRV_O_SHARE_RW Fam Zheng
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

They are wrappers of POSIX fcntl "file private locking".

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/qemu/osdep.h |  2 ++
 util/osdep.c         | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 0e3c330..f15e122 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -294,6 +294,8 @@ int qemu_close(int fd);
 #ifndef _WIN32
 int qemu_dup(int fd);
 #endif
+int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive);
+int qemu_unlock_fd(int fd, int64_t start, int64_t len);
 
 #if defined(__HAIKU__) && defined(__i386__)
 #define FMT_pid "%ld"
diff --git a/util/osdep.c b/util/osdep.c
index 06fb1cf..b85a490 100644
--- a/util/osdep.c
+++ b/util/osdep.c
@@ -140,6 +140,35 @@ static int qemu_parse_fdset(const char *param)
 {
     return qemu_parse_fd(param);
 }
+
+static int qemu_lock_fcntl(int fd, int64_t start, int64_t len, int fl_type)
+{
+#ifdef F_OFD_SETLK
+    int ret;
+    struct flock fl = {
+        .l_whence = SEEK_SET,
+        .l_start  = start,
+        .l_len    = len,
+        .l_type   = fl_type,
+    };
+    do {
+        ret = fcntl(fd, F_OFD_SETLK, &fl);
+    } while (ret == -1 && errno == EINTR);
+    return ret == -1 ? -errno : 0;
+#else
+    return -ENOTSUP;
+#endif
+}
+
+int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive)
+{
+    return qemu_lock_fcntl(fd, start, len, exclusive ? F_WRLCK : F_RDLCK);
+}
+
+int qemu_unlock_fd(int fd, int64_t start, int64_t len)
+{
+    return qemu_lock_fcntl(fd, start, len, F_UNLCK);
+}
 #endif
 
 /*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 02/14] block: Define BDRV_O_SHARE_RW
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 03/14] qemu-io: Set "share-rw" flag together with read-only Fam Zheng
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/block/block.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/block/block.h b/include/block/block.h
index 107c603..d09a932 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -98,6 +98,8 @@ typedef struct HDGeometry {
                                       select an appropriate protocol driver,
                                       ignoring the format layer */
 #define BDRV_O_NO_IO       0x10000 /* don't initialize for I/O */
+#define BDRV_O_SHARE_RW    0x20000 /* accept shared read-write from other users
+                                      of the image */
 
 #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_NO_FLUSH)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 03/14] qemu-io: Set "share-rw" flag together with read-only
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 02/14] block: Define BDRV_O_SHARE_RW Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 04/14] qemu-img: Set "share-rw" flag in read-only commands Fam Zheng
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

qemu-io is a low level tool to read or modify guest visible data, which
implies the user knows very well what is being done. Allowing reading
from a locked image is harmless in most cases, so do it.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 qemu-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/qemu-io.c b/qemu-io.c
index 23a229f..f000504 100644
--- a/qemu-io.c
+++ b/qemu-io.c
@@ -585,6 +585,8 @@ int main(int argc, char **argv)
     /* open the device */
     if (!readonly) {
         flags |= BDRV_O_RDWR;
+    } else {
+        flags |= BDRV_O_SHARE_RW;
     }
 
     if ((argc - optind) == 1) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 04/14] qemu-img: Set "share-rw" flag in read-only commands
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (2 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 03/14] qemu-io: Set "share-rw" flag together with read-only Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-12-02  0:52   ` Max Reitz
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 05/14] block: Set "share-rw" flag in drive-backup when sync=none Fam Zheng
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

Checking the status of an image when it is being used by guest is
usually useful, and there is no risk of corrupting data, so don't let
the upcoming image locking feature limit this use case.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 qemu-img.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index afcd51f..b2f4077 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -679,6 +679,10 @@ static int img_check(int argc, char **argv)
             break;
         }
     }
+
+    if (!(flags & BDRV_O_RDWR)) {
+        flags |= BDRV_O_SHARE_RW;
+    }
     if (optind != argc - 1) {
         error_exit("Expecting one image file name");
     }
@@ -1231,6 +1235,7 @@ static int img_compare(int argc, char **argv)
         goto out3;
     }
 
+    flags |= BDRV_O_SHARE_RW;
     blk1 = img_open(image_opts, filename1, fmt1, flags, writethrough, quiet);
     if (!blk1) {
         ret = 2;
@@ -2279,7 +2284,8 @@ static ImageInfoList *collect_image_info_list(bool image_opts,
         g_hash_table_insert(filenames, (gpointer)filename, NULL);
 
         blk = img_open(image_opts, filename, fmt,
-                       BDRV_O_NO_BACKING | BDRV_O_NO_IO, false, false);
+                       BDRV_O_NO_BACKING | BDRV_O_NO_IO | BDRV_O_SHARE_RW,
+                       false, false);
         if (!blk) {
             goto err;
         }
@@ -2605,7 +2611,7 @@ static int img_map(int argc, char **argv)
         return 1;
     }
 
-    blk = img_open(image_opts, filename, fmt, 0, false, false);
+    blk = img_open(image_opts, filename, fmt, BDRV_O_SHARE_RW, false, false);
     if (!blk) {
         return 1;
     }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 05/14] block: Set "share-rw" flag in drive-backup when sync=none
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (3 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 04/14] qemu-img: Set "share-rw" flag in read-only commands Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-12-02  1:01   ` Max Reitz
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 06/14] block: Set "share-rw" flag for incoming migration Fam Zheng
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

In this case we may open the source's backing image chain multiple
times. Setting share flag means the new open won't try to acquire or
check any lock, once we implement image locking.

Signed-off-by: Fam Zheng <famz@redhat.com>

---

An alternative is reusing (and bdrv_ref) the existing source's backing
bs instead of opening another one. If we decide that approach is better,
it's better to do it in a separate series.
---
 blockdev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/blockdev.c b/blockdev.c
index d11a74f..9992c5d 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3160,6 +3160,7 @@ static void do_drive_backup(DriveBackup *backup, BlockJobTxn *txn, Error **errp)
     }
     if (backup->sync == MIRROR_SYNC_MODE_NONE) {
         source = bs;
+        flags |= BDRV_O_SHARE_RW;
     }
 
     size = bdrv_getlength(bs);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 06/14] block: Set "share-rw" flag for incoming migration
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (4 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 05/14] block: Set "share-rw" flag in drive-backup when sync=none Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-12-02  1:22   ` Max Reitz
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 07/14] iotests: 055: Don't attach the drive to vm for drive-backup Fam Zheng
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

In this way the source side QEMU is writing to the image. We need to
open it, so share-rw is required.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 blockdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 9992c5d..230c7c1 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -576,7 +576,7 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
         assert((bdrv_flags & BDRV_O_CACHE_MASK) == 0);
 
         if (runstate_check(RUN_STATE_INMIGRATE)) {
-            bdrv_flags |= BDRV_O_INACTIVE;
+            bdrv_flags |= BDRV_O_INACTIVE | BDRV_O_SHARE_RW;
         }
 
         blk = blk_new_open(file, NULL, bs_opts, bdrv_flags, errp);
@@ -646,7 +646,7 @@ static BlockDriverState *bds_tree_init(QDict *bs_opts, Error **errp)
     qdict_set_default_str(bs_opts, BDRV_OPT_READ_ONLY, "off");
 
     if (runstate_check(RUN_STATE_INMIGRATE)) {
-        bdrv_flags |= BDRV_O_INACTIVE;
+        bdrv_flags |= BDRV_O_INACTIVE | BDRV_O_SHARE_RW;
     }
 
     return bdrv_open(NULL, NULL, bs_opts, bdrv_flags, errp);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 07/14] iotests: 055: Don't attach the drive to vm for drive-backup
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (5 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 06/14] block: Set "share-rw" flag for incoming migration Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 08/14] iotests: 030: Read-only open image for getting map Fam Zheng
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/qemu-iotests/055 | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/tests/qemu-iotests/055 b/tests/qemu-iotests/055
index 1d3fd04..20a7596 100755
--- a/tests/qemu-iotests/055
+++ b/tests/qemu-iotests/055
@@ -455,17 +455,18 @@ class TestDriveCompression(iotests.QMPTestCase):
         except OSError:
             pass
 
-    def do_prepare_drives(self, fmt, args):
+    def do_prepare_drives(self, fmt, args, attach):
         self.vm = iotests.VM().add_drive(test_img)
 
         qemu_img('create', '-f', fmt, blockdev_target_img,
                  str(TestDriveCompression.image_len), *args)
-        self.vm.add_drive(blockdev_target_img, format=fmt)
+        if attach:
+            self.vm.add_drive(blockdev_target_img, format=fmt)
 
         self.vm.launch()
 
-    def do_test_compress_complete(self, cmd, format, **args):
-        self.do_prepare_drives(format['type'], format['args'])
+    def do_test_compress_complete(self, cmd, format, attach, **args):
+        self.do_prepare_drives(format['type'], format['args'], attach)
 
         self.assert_no_active_block_jobs()
 
@@ -481,15 +482,16 @@ class TestDriveCompression(iotests.QMPTestCase):
 
     def test_complete_compress_drive_backup(self):
         for format in TestDriveCompression.fmt_supports_compression:
-            self.do_test_compress_complete('drive-backup', format,
+            self.do_test_compress_complete('drive-backup', format, attach=False,
                                            target=blockdev_target_img, mode='existing')
 
     def test_complete_compress_blockdev_backup(self):
         for format in TestDriveCompression.fmt_supports_compression:
-            self.do_test_compress_complete('blockdev-backup', format, target='drive1')
+            self.do_test_compress_complete('blockdev-backup', format,
+                                           attach=True, target='drive1')
 
-    def do_test_compress_cancel(self, cmd, format, **args):
-        self.do_prepare_drives(format['type'], format['args'])
+    def do_test_compress_cancel(self, cmd, format, attach, **args):
+        self.do_prepare_drives(format['type'], format['args'], attach)
 
         self.assert_no_active_block_jobs()
 
@@ -503,15 +505,16 @@ class TestDriveCompression(iotests.QMPTestCase):
 
     def test_compress_cancel_drive_backup(self):
         for format in TestDriveCompression.fmt_supports_compression:
-            self.do_test_compress_cancel('drive-backup', format,
+            self.do_test_compress_cancel('drive-backup', format, attach=False,
                                          target=blockdev_target_img, mode='existing')
 
     def test_compress_cancel_blockdev_backup(self):
        for format in TestDriveCompression.fmt_supports_compression:
-            self.do_test_compress_cancel('blockdev-backup', format, target='drive1')
+            self.do_test_compress_cancel('blockdev-backup', format, attach=True,
+                                         target='drive1')
 
-    def do_test_compress_pause(self, cmd, format, **args):
-        self.do_prepare_drives(format['type'], format['args'])
+    def do_test_compress_pause(self, cmd, format, attach, **args):
+        self.do_prepare_drives(format['type'], format['args'], attach)
 
         self.assert_no_active_block_jobs()
 
@@ -543,12 +546,13 @@ class TestDriveCompression(iotests.QMPTestCase):
 
     def test_compress_pause_drive_backup(self):
         for format in TestDriveCompression.fmt_supports_compression:
-            self.do_test_compress_pause('drive-backup', format,
+            self.do_test_compress_pause('drive-backup', format, attach=False,
                                         target=blockdev_target_img, mode='existing')
 
     def test_compress_pause_blockdev_backup(self):
         for format in TestDriveCompression.fmt_supports_compression:
-            self.do_test_compress_pause('blockdev-backup', format, target='drive1')
+            self.do_test_compress_pause('blockdev-backup', format, attach=True,
+                                        target='drive1')
 
 if __name__ == '__main__':
     iotests.main(supported_fmts=['raw', 'qcow2'])
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 08/14] iotests: 030: Read-only open image for getting map
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (6 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 07/14] iotests: 055: Don't attach the drive to vm for drive-backup Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 09/14] iotests: 087: Don't attch test image twice Fam Zheng
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/qemu-iotests/030 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 107049b..c0ab4df 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -95,7 +95,7 @@ class TestSingleDrive(iotests.QMPTestCase):
         self.assert_no_active_block_jobs()
 
         # The image map is empty before the operation
-        empty_map = qemu_io('-f', iotests.imgfmt, '-c', 'map', test_img)
+        empty_map = qemu_io('-f', iotests.imgfmt, '-c', 'map', '-r', test_img)
 
         # This is a no-op: no data should ever be copied from the base image
         result = self.vm.qmp('block-stream', device='drive0', base=mid_img)
@@ -106,7 +106,7 @@ class TestSingleDrive(iotests.QMPTestCase):
         self.assert_no_active_block_jobs()
         self.vm.shutdown()
 
-        self.assertEqual(qemu_io('-f', iotests.imgfmt, '-c', 'map', test_img),
+        self.assertEqual(qemu_io('-f', iotests.imgfmt, '-c', 'map', '-r', test_img),
                          empty_map, 'image file map changed after a no-op')
 
     def test_stream_partial(self):
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 09/14] iotests: 087: Don't attch test image twice
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (7 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 08/14] iotests: 030: Read-only open image for getting map Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 10/14] iotests: 085: Avoid image locking conflict Fam Zheng
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

The test scenario doesn't require the same image, instead it focuses on
the duplicated node-name, so use null-co to avoid locking conflict.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/qemu-iotests/087 | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/tests/qemu-iotests/087 b/tests/qemu-iotests/087
index 9de57dd..6d52f7d 100755
--- a/tests/qemu-iotests/087
+++ b/tests/qemu-iotests/087
@@ -82,8 +82,7 @@ run_qemu -drive driver=$IMGFMT,id=disk,node-name=test-node,file="$TEST_IMG" <<EO
       "driver": "$IMGFMT",
       "node-name": "disk",
       "file": {
-          "driver": "file",
-          "filename": "$TEST_IMG"
+          "driver": "null-co"
       }
     }
   }
@@ -92,8 +91,7 @@ run_qemu -drive driver=$IMGFMT,id=disk,node-name=test-node,file="$TEST_IMG" <<EO
       "driver": "$IMGFMT",
       "node-name": "test-node",
       "file": {
-          "driver": "file",
-          "filename": "$TEST_IMG"
+          "driver": "null-co"
       }
     }
   }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 10/14] iotests: 085: Avoid image locking conflict
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (8 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 09/14] iotests: 087: Don't attch test image twice Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 11/14] iotests: 091: Quit QEMU before checking image Fam Zheng
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

In the case where we test the expected error when a blockdev-snapshot
target already has a backing image, backing chain is opened multiple
times. This will be a problem when we use image locking, so use a
different backing file that is not already open.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/qemu-iotests/085     | 32 +++++++++++++++++++-------------
 tests/qemu-iotests/085.out |  3 ++-
 2 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/tests/qemu-iotests/085 b/tests/qemu-iotests/085
index c53e97f..c91d78d 100755
--- a/tests/qemu-iotests/085
+++ b/tests/qemu-iotests/085
@@ -87,24 +87,26 @@ function create_group_snapshot()
 }
 
 # ${1}: unique identifier for the snapshot filename
-# ${2}: true: open backing images; false: don't open them (default)
+# ${2}: extra_params to the blockdev-add command
+# ${3}: filename
+function do_blockdev_add()
+{
+    cmd="{ 'execute': 'blockdev-add', 'arguments':
+           { 'driver': 'qcow2', 'node-name': 'snap_${1}', ${2}
+             'file':
+             { 'driver': 'file', 'filename': '${3}',
+               'node-name': 'file_${1}' } } }"
+    _send_qemu_cmd $h "${cmd}" "return"
+}
+
+# ${1}: unique identifier for the snapshot filename
 function add_snapshot_image()
 {
-    if [ "${2}" = "true" ]; then
-        extra_params=""
-    else
-        extra_params="'backing': '', "
-    fi
     base_image="${TEST_DIR}/$((${1}-1))-${snapshot_virt0}"
     snapshot_file="${TEST_DIR}/${1}-${snapshot_virt0}"
     _make_test_img -b "${base_image}" "$size"
     mv "${TEST_IMG}" "${snapshot_file}"
-    cmd="{ 'execute': 'blockdev-add', 'arguments':
-           { 'driver': 'qcow2', 'node-name': 'snap_${1}', ${extra_params}
-             'file':
-             { 'driver': 'file', 'filename': '${snapshot_file}',
-               'node-name': 'file_${1}' } } }"
-    _send_qemu_cmd $h "${cmd}" "return"
+    do_blockdev_add "$1" "'backing': '', " "${snapshot_file}"
 }
 
 # ${1}: unique identifier for the snapshot filename
@@ -222,7 +224,11 @@ echo === Invalid command - snapshot node has a backing image ===
 echo
 
 SNAPSHOTS=$((${SNAPSHOTS}+1))
-add_snapshot_image ${SNAPSHOTS} true
+
+_make_test_img "$size"
+mv "${TEST_IMG}" "${TEST_IMG}.base"
+_make_test_img -b "${TEST_IMG}.base" "$size"
+do_blockdev_add ${SNAPSHOTS} "" "${TEST_IMG}"
 blockdev_snapshot ${SNAPSHOTS} error
 
 echo
diff --git a/tests/qemu-iotests/085.out b/tests/qemu-iotests/085.out
index 08e4bb7..7bbf84d 100644
--- a/tests/qemu-iotests/085.out
+++ b/tests/qemu-iotests/085.out
@@ -78,7 +78,8 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/
 
 === Invalid command - snapshot node has a backing image ===
 
-Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/12-snapshot-v0.IMGFMT
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 backing_file=TEST_DIR/t.IMGFMT.base
 {"return": {}}
 {"error": {"class": "GenericError", "desc": "The snapshot already has a backing image"}}
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 11/14] iotests: 091: Quit QEMU before checking image
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (9 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 10/14] iotests: 085: Avoid image locking conflict Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 12/14] tests: Use null-co:// instead of /dev/null as the dummy image Fam Zheng
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/qemu-iotests/091 | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tests/qemu-iotests/091 b/tests/qemu-iotests/091
index 32bbd56..10ac4a8 100755
--- a/tests/qemu-iotests/091
+++ b/tests/qemu-iotests/091
@@ -95,7 +95,9 @@ echo "vm2: qemu process running successfully"
 echo "vm2: flush io, and quit"
 _send_qemu_cmd $h2 'qemu-io disk flush' "(qemu)"
 _send_qemu_cmd $h2 'quit' ""
+_send_qemu_cmd $h1 'quit' ""
 
+wait
 echo "Check image pattern"
 ${QEMU_IO} -c "read -P 0x22 0 4M" "${TEST_IMG}" | _filter_testdir | _filter_qemu_io
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 12/14] tests: Use null-co:// instead of /dev/null as the dummy image
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (10 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 11/14] iotests: 091: Quit QEMU before checking image Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking Fam Zheng
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/drive_del-test.c    | 2 +-
 tests/nvme-test.c         | 2 +-
 tests/usb-hcd-uhci-test.c | 2 +-
 tests/usb-hcd-xhci-test.c | 2 +-
 tests/virtio-blk-test.c   | 2 +-
 tests/virtio-scsi-test.c  | 4 ++--
 6 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/tests/drive_del-test.c b/tests/drive_del-test.c
index 121b9c9..2175139 100644
--- a/tests/drive_del-test.c
+++ b/tests/drive_del-test.c
@@ -92,7 +92,7 @@ static void test_after_failed_device_add(void)
 static void test_drive_del_device_del(void)
 {
     /* Start with a drive used by a device that unplugs instantaneously */
-    qtest_start("-drive if=none,id=drive0,file=/dev/null,format=raw"
+    qtest_start("-drive if=none,id=drive0,file=null-co://,format=raw"
                 " -device virtio-scsi-pci"
                 " -device scsi-hd,drive=drive0,id=dev0");
 
diff --git a/tests/nvme-test.c b/tests/nvme-test.c
index c8bece4..7674a44 100644
--- a/tests/nvme-test.c
+++ b/tests/nvme-test.c
@@ -22,7 +22,7 @@ int main(int argc, char **argv)
     g_test_init(&argc, &argv, NULL);
     qtest_add_func("/nvme/nop", nop);
 
-    qtest_start("-drive id=drv0,if=none,file=/dev/null,format=raw "
+    qtest_start("-drive id=drv0,if=none,file=null-co://,format=raw "
                 "-device nvme,drive=drv0,serial=foo");
     ret = g_test_run();
 
diff --git a/tests/usb-hcd-uhci-test.c b/tests/usb-hcd-uhci-test.c
index e956b9c..24c3ea9 100644
--- a/tests/usb-hcd-uhci-test.c
+++ b/tests/usb-hcd-uhci-test.c
@@ -78,7 +78,7 @@ int main(int argc, char **argv)
 {
     const char *arch = qtest_get_arch();
     const char *cmd = "-device piix3-usb-uhci,id=uhci,addr=1d.0"
-                      " -drive id=drive0,if=none,file=/dev/null,format=raw"
+                      " -drive id=drive0,if=none,file=null-co://,format=raw"
                       " -device usb-tablet,bus=uhci.0,port=1";
     int ret;
 
diff --git a/tests/usb-hcd-xhci-test.c b/tests/usb-hcd-xhci-test.c
index 22513e9..031764d 100644
--- a/tests/usb-hcd-xhci-test.c
+++ b/tests/usb-hcd-xhci-test.c
@@ -89,7 +89,7 @@ int main(int argc, char **argv)
     qtest_add_func("/xhci/pci/hotplug/usb-uas", test_usb_uas_hotplug);
 
     qtest_start("-device nec-usb-xhci,id=xhci"
-                " -drive id=drive0,if=none,file=/dev/null,format=raw");
+                " -drive id=drive0,if=none,file=null-co://,format=raw");
     ret = g_test_run();
     qtest_end();
 
diff --git a/tests/virtio-blk-test.c b/tests/virtio-blk-test.c
index 0506917..874a9c5 100644
--- a/tests/virtio-blk-test.c
+++ b/tests/virtio-blk-test.c
@@ -66,7 +66,7 @@ static QPCIBus *pci_test_start(void)
     tmp_path = drive_create();
 
     cmdline = g_strdup_printf("-drive if=none,id=drive0,file=%s,format=raw "
-                        "-drive if=none,id=drive1,file=/dev/null,format=raw "
+                        "-drive if=none,id=drive1,file=null-co://,format=raw "
                         "-device virtio-blk-pci,id=drv0,drive=drive0,"
                         "addr=%x.%x",
                         tmp_path, PCI_SLOT, PCI_FN);
diff --git a/tests/virtio-scsi-test.c b/tests/virtio-scsi-test.c
index 79088bb..58a2796 100644
--- a/tests/virtio-scsi-test.c
+++ b/tests/virtio-scsi-test.c
@@ -40,7 +40,7 @@ static void qvirtio_scsi_start(const char *extra_opts)
     char *cmdline;
 
     cmdline = g_strdup_printf(
-                "-drive id=drv0,if=none,file=/dev/null,format=raw "
+                "-drive id=drv0,if=none,file=null-co://format=raw "
                 "-device virtio-scsi-pci,id=vs0 "
                 "-device scsi-hd,bus=vs0.0,drive=drv0 %s",
                 extra_opts ? : "");
@@ -192,7 +192,7 @@ static void hotplug(void)
 {
     QDict *response;
 
-    qvirtio_scsi_start("-drive id=drv1,if=none,file=/dev/null,format=raw");
+    qvirtio_scsi_start("-drive id=drv1,if=none,file=null-co://,format=raw");
     response = qmp("{\"execute\": \"device_add\","
                    " \"arguments\": {"
                    "   \"driver\": \"scsi-hd\","
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (11 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 12/14] tests: Use null-co:// instead of /dev/null as the dummy image Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-10-31 22:01   ` Eric Blake
                     ` (2 more replies)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 14/14] tests: Add test-image-lock Fam Zheng
  2016-12-02  3:10 ` [Qemu-devel] [PATCH 00/14] block: Image locking series Max Reitz
  14 siblings, 3 replies; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

This implements open flag sensible image locking for local file
and host device protocol.

virtlockd in libvirt locks the first byte, so we start looking at the
file bytes from 1.

Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
four locking modes by combining two bits (BDRV_O_RDWR and
BDRV_O_SHARE_RW), and implemented by taking two locks:

Lock bytes:

* byte 1: I can't allow other processes to write to the image
* byte 2: I am writing to the image

Lock modes:

* shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
  byte 2. Test whether byte 1 is locked using an exclusive lock, and
  fail if so.

* exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
  whether byte 1 is locked using an exclusive lock, and fail if so. Then
  take shared lock on byte 1. I suppose this is racy, but we can
  probably tolerate that.

* reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything

* reader that can't tolerate writers (neither bit is set): Take shared
  lock on byte 1. Test whether byte 2 is locked, and fail if so.

The complication is in the transactional reopen.  To make the reopen
logic managable, and allow better reuse, the code is internally
organized with a table from old mode to the new one.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 660 insertions(+), 50 deletions(-)

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 7c62fc3..07ab117 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -131,8 +131,44 @@ do { \
 
 #define MAX_BLOCKSIZE	4096
 
+/* Posix file locking bytes. Libvirt takes byte 0, so start from byte 1. */
+#define RAW_LOCK_BYTE_MIN 1
+#define RAW_LOCK_BYTE_NO_OTHER_WRITER 1
+#define RAW_LOCK_BYTE_WRITE     2
+#define RAW_LOCK_BYTE_MAX 2
+
+/*
+ ** shared writer: Take shared lock on byte 2. Test whether byte 1 is
+ *  locked using an exclusive lock, and fail if so.
+ *
+ ** exclusive writer: Take shared lock on byte 2. Test whether byte 1 is
+ *  locked using an exclusive lock, and fail if so. Then take shared lock
+ *  on byte 1. I suppose this is racy, but we can probably tolerate that.
+ *
+ ** reader that can tolerate writers: Don't do anything
+ *
+ ** reader that can't tolerate writers: Take shared lock on byte 1. Test
+ *  whether byte 2 is locked, and fail if so.
+ */
+
+typedef enum {
+    /* Read only and accept other writers. */
+    RAW_L_READ_SHARE_RW,
+    /* Read only and try to forbid other writers. */
+    RAW_L_READ,
+    /* Read write and accept other writers. */
+    RAW_L_WRITE_SHARE_RW,
+    /* Read write and try to forbit other writers. */
+    RAW_L_WRITE,
+} BDRVRawLockMode;
+
 typedef struct BDRVRawState {
     int fd;
+    /* A dup of @fd to make manipulating lock easier, especially during reopen,
+     * where this will accept BDRVRawReopenState.lock_fd. */
+    int lock_fd;
+    bool disable_lock;
+    bool lock_on_invalidate;
     int type;
     int open_flags;
     size_t buf_align;
@@ -146,10 +182,13 @@ typedef struct BDRVRawState {
     bool use_linux_aio:1;
     bool has_fallocate;
     bool needs_alignment;
+    BDRVRawLockMode cur_lock_mode;
 } BDRVRawState;
 
 typedef struct BDRVRawReopenState {
     int fd;
+    /* A dup of @fd used for acquiring lock. */
+    int lock_fd;
     int open_flags;
 } BDRVRawReopenState;
 
@@ -368,6 +407,77 @@ static void raw_parse_flags(int bdrv_flags, int *open_flags)
     }
 }
 
+static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
+{
+    int ret;
+    assert(fd >= 0);
+    /* Locking byte 1 avoids interfereing with virtlockd. */
+    switch (mode) {
+    case RAW_L_READ_SHARE_RW:
+        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_MIN,
+                             RAW_LOCK_BYTE_MAX - RAW_LOCK_BYTE_MIN + 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock fd");
+            goto fail;
+        }
+        break;
+    case RAW_L_READ:
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock share byte");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock write byte exclusively");
+            goto fail;
+        }
+        qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
+        break;
+    case RAW_L_WRITE_SHARE_RW:
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock write byte");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
+            goto fail;
+        }
+        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock share byte");
+            goto fail;
+        }
+        break;
+    case RAW_L_WRITE:
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock write byte");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
+            goto fail;
+        }
+        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to downgrade share byte");
+            goto fail;
+        }
+        break;
+    default:
+        abort();
+    }
+    return 0;
+fail:
+    qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+    qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
+    return -errno;
+}
+
 static void raw_parse_filename(const char *filename, QDict *options,
                                Error **errp)
 {
@@ -393,10 +503,115 @@ static QemuOptsList raw_runtime_opts = {
             .type = QEMU_OPT_STRING,
             .help = "host AIO implementation (threads, native)",
         },
+        {
+            .name = "disable-lock",
+            .type = QEMU_OPT_BOOL,
+            .help = "don't lock the file",
+        },
         { /* end of list */ }
     },
 };
 
+static BDRVRawLockMode raw_get_lock_mode(int flags)
+{
+    switch (flags & (BDRV_O_RDWR | BDRV_O_SHARE_RW)) {
+    case BDRV_O_RDWR:
+        return RAW_L_WRITE;
+    case BDRV_O_RDWR | BDRV_O_SHARE_RW:
+        return RAW_L_WRITE_SHARE_RW;
+    case BDRV_O_SHARE_RW:
+        return RAW_L_READ_SHARE_RW;
+    case 0:
+        return RAW_L_READ;
+    default:
+        abort();
+    }
+}
+
+static int raw_dup_flags(int fd, int old_flags, int new_flags,
+                         const char *filename, Error **errp)
+{
+    int ret = -1;
+    int fcntl_flags = O_APPEND | O_NONBLOCK;
+#ifdef O_NOATIME
+    fcntl_flags |= O_NOATIME;
+#endif
+
+#ifdef O_ASYNC
+    /* Not all operating systems have O_ASYNC, and those that don't
+     * will not let us track the state into rs->open_flags (typically
+     * you achieve the same effect with an ioctl, for example I_SETSIG
+     * on Solaris). But we do not use O_ASYNC, so that's fine.
+     */
+    assert((old_flags & O_ASYNC) == 0);
+#endif
+
+    if ((new_flags & ~fcntl_flags) == (old_flags & ~fcntl_flags)) {
+        /* dup the original fd */
+        ret = qemu_dup(fd);
+        if (ret >= 0) {
+            if (fcntl_setfl(ret, new_flags)) {
+                int new_fd = ret;
+                ret = -errno;
+                qemu_close(new_fd);
+            }
+        }
+    }
+
+    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
+    if (ret < 0) {
+        const char *normalized_filename = filename;
+        ret = raw_normalize_devicepath(&normalized_filename);
+        if (ret < 0) {
+            error_setg_errno(errp, -ret, "Could not normalize device path");
+        } else {
+            assert(!(new_flags & O_CREAT));
+            ret = qemu_open(normalized_filename, new_flags);
+            if (ret == -1) {
+                error_setg_errno(errp, errno, "Could not open file with new flags");
+                ret = -errno;
+            }
+        }
+    }
+    return ret;
+}
+
+static int raw_lock_image(BlockDriverState *bs, int bdrv_flags, Error **errp)
+{
+    int ret;
+    BDRVRawState *s = bs->opaque;
+    BDRVRawLockMode lock_mode;
+
+    if (bdrv_flags & BDRV_O_INACTIVE) {
+        s->disable_lock = true;
+        s->lock_on_invalidate = true;
+    }
+    if (!s->disable_lock) {
+        lock_mode = raw_get_lock_mode(bdrv_flags);
+        if (!(bdrv_flags & BDRV_O_RDWR) && access(bs->filename, W_OK) != 0) {
+            s->disable_lock = true;
+        }
+    }
+    if (!s->disable_lock && lock_mode != RAW_L_READ_SHARE_RW) {
+        int lock_flags = s->open_flags;
+        if (!(bdrv_flags & BDRV_O_SHARE_RW)) {
+            lock_flags |= O_RDWR;
+        }
+        ret = raw_dup_flags(s->fd, s->open_flags, lock_flags, bs->filename,
+                                   errp);
+        if (ret < 0) {
+            return ret;
+        }
+        s->lock_fd = ret;
+        ret = raw_lock_fd(s->lock_fd, lock_mode, errp);
+        if (ret) {
+            return ret;
+        }
+        s->cur_lock_mode = lock_mode;
+    }
+    return 0;
+}
+
 static int raw_open_common(BlockDriverState *bs, QDict *options,
                            int bdrv_flags, int open_flags, Error **errp)
 {
@@ -440,6 +655,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     raw_parse_flags(bdrv_flags, &s->open_flags);
 
     s->fd = -1;
+    s->lock_fd = -1;
     fd = qemu_open(filename, s->open_flags, 0644);
     if (fd < 0) {
         ret = -errno;
@@ -451,6 +667,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
     }
     s->fd = fd;
 
+    s->disable_lock = qemu_opt_get_bool(opts, "disable-lock", false);
+
+    if (!s->disable_lock) {
+        ret = raw_lock_image(bs, bdrv_flags, errp);
+        if (ret) {
+            goto fail;
+        }
+    }
+
 #ifdef CONFIG_LINUX_AIO
      /* Currently Linux does AIO only for files opened with O_DIRECT */
     if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {
@@ -538,6 +763,398 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
     return raw_open_common(bs, options, flags, 0, errp);
 }
 
+typedef enum {
+    RAW_LT_PREPARE,
+    RAW_LT_COMMIT,
+    RAW_LT_ABORT
+} RawLockTransOp;
+
+typedef int (*RawReopenFunc)(RawLockTransOp op,
+                             int old_lock_fd, int new_lock_fd,
+                             BDRVRawLockMode old_lock,
+                             BDRVRawLockMode new_lock,
+                             Error **errp);
+
+static int raw_lt_nop(RawLockTransOp op,
+                            int old_lock_fd, int new_lock_fd,
+                            BDRVRawLockMode old_lock,
+                            BDRVRawLockMode new_lock,
+                            Error **errp)
+{
+    assert(old_lock == new_lock || new_lock == RAW_L_READ_SHARE_RW);
+    return 0;
+}
+
+static int raw_lt_from_unlock(RawLockTransOp op,
+                              int old_lock_fd, int new_lock_fd,
+                              BDRVRawLockMode old_lock,
+                              BDRVRawLockMode new_lock,
+                              Error **errp)
+{
+    assert(old_lock != new_lock);
+    assert(old_lock == RAW_L_READ_SHARE_RW);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        return raw_lock_fd(new_lock_fd, new_lock, errp);
+        break;
+    case RAW_LT_COMMIT:
+    case RAW_LT_ABORT:
+        break;
+    }
+
+    return 0;
+}
+
+static int raw_lt_read_to_write_share_rw(RawLockTransOp op,
+                                         int old_lock_fd, int new_lock_fd,
+                                         BDRVRawLockMode old_lock,
+                                         BDRVRawLockMode new_lock,
+                                         Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_READ);
+    assert(new_lock == RAW_L_WRITE_SHARE_RW);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            /* This is very unlikely, but catch it anyway. */
+            error_setg_errno(errp, errno, "Failed to downgrade new fd (share byte)");
+        }
+        break;
+    case RAW_LT_COMMIT:
+        break;
+    case RAW_LT_ABORT:
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_report("Failed to restore lock on old fd (share byte)");
+        }
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_read_to_write(RawLockTransOp op,
+                                int old_lock_fd, int new_lock_fd,
+                                BDRVRawLockMode old_lock,
+                                BDRVRawLockMode new_lock,
+                                Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_READ);
+    assert(new_lock == RAW_L_WRITE);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to restore old fd (share byte) b");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        break;
+    case RAW_LT_ABORT:
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_report("Failed to restore lock on old fd (share byte)");
+        }
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_share_rw_to_read(RawLockTransOp op,
+                                         int old_lock_fd, int new_lock_fd,
+                                         BDRVRawLockMode old_lock,
+                                         BDRVRawLockMode new_lock,
+                                         Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE_SHARE_RW);
+    assert(new_lock == RAW_L_READ);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        /* Make sure there are no other writers. */
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
+        if (ret) {
+            error_report("Failed to unlock old fd (write byte)");
+            break;
+        }
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_share_rw_to_write(RawLockTransOp op,
+                                          int old_lock_fd, int new_lock_fd,
+                                          BDRVRawLockMode old_lock,
+                                          BDRVRawLockMode new_lock,
+                                          Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE_SHARE_RW);
+    assert(new_lock == RAW_L_WRITE);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        /* Make sure there are no other writers. */
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to downgrade old fd (write byte)");
+            break;
+        }
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_to_read(RawLockTransOp op,
+                                int old_lock_fd, int new_lock_fd,
+                                BDRVRawLockMode old_lock,
+                                BDRVRawLockMode new_lock,
+                                Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE);
+    assert(new_lock == RAW_L_READ);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
+        if (ret) {
+            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_COMMIT:
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
+        if (ret) {
+            error_report("Failed to unlock old fd (write byte)");
+            break;
+        }
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+static int raw_lt_write_to_write_share_rw(RawLockTransOp op,
+                                          int old_lock_fd, int new_lock_fd,
+                                          BDRVRawLockMode old_lock,
+                                          BDRVRawLockMode new_lock,
+                                          Error **errp)
+{
+    int ret = 0;
+
+    assert(old_lock == RAW_L_WRITE);
+    assert(new_lock == RAW_L_WRITE_SHARE_RW);
+    switch (op) {
+    case RAW_LT_PREPARE:
+        break;
+    case RAW_LT_COMMIT:
+        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
+        if (ret) {
+            error_report("Failed to unlock old fd (share byte)");
+            break;
+        }
+        break;
+    case RAW_LT_ABORT:
+        break;
+    }
+    return ret ? -errno : 0;
+}
+
+/**
+ * Transactionally moving between possible locking states is tricky and must be
+ * done carefully. That is mostly because downgrading an exclusive lock to
+ * shared or unlocked is not guaranteed to be revertable. As a result, in such
+ * cases we have to defer the downgraing to "commit", given that no revert will
+ * happen after that point, and that downgrading a lock should never fail.
+ *
+ * On the other hand, upgrading a lock (e.g. from unlocked or shared to
+ * exclusive lock) must happen in "prepare" because it may fail.
+ *
+ * Manage the operation matrix with this state transition table to make
+ * fulfulling above conditions easier.
+ */
+static const struct RawReopenFuncRecord {
+    BDRVRawLockMode old_lock;
+    BDRVRawLockMode new_lock;
+    RawReopenFunc func;
+    bool need_lock_fd;
+    bool close_old_lock_fd;
+} reopen_functions[] = {
+
+    {RAW_L_READ_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, false},
+    {RAW_L_READ_SHARE_RW, RAW_L_READ, raw_lt_from_unlock, true},
+    {RAW_L_READ_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_from_unlock, true},
+    {RAW_L_READ_SHARE_RW, RAW_L_WRITE, raw_lt_from_unlock, true},
+
+    {RAW_L_READ, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
+    {RAW_L_READ, RAW_L_READ, raw_lt_nop, false, false},
+    {RAW_L_READ, RAW_L_WRITE_SHARE_RW, raw_lt_read_to_write_share_rw, true},
+    {RAW_L_READ, RAW_L_WRITE, raw_lt_read_to_write, true},
+
+    {RAW_L_WRITE_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
+    {RAW_L_WRITE_SHARE_RW, RAW_L_READ, raw_lt_write_share_rw_to_read, true},
+    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_nop, false, false},
+    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE, raw_lt_write_share_rw_to_write, true},
+
+    {RAW_L_WRITE, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
+    {RAW_L_WRITE, RAW_L_READ, raw_lt_write_to_read, true},
+    {RAW_L_WRITE, RAW_L_WRITE_SHARE_RW, raw_lt_write_to_write_share_rw, true},
+    {RAW_L_WRITE, RAW_L_WRITE, raw_lt_nop, false, false},
+};
+
+static int raw_reopen_handle_lock(BDRVReopenState *state,
+                                  RawLockTransOp op,
+                                  Error **errp)
+{
+    BDRVRawReopenState *rs = state->opaque;
+    BDRVRawState *s = state->bs->opaque;
+    BDRVRawLockMode old_lock, new_lock;
+    const struct RawReopenFuncRecord *rec;
+    int ret;
+
+    old_lock = s->cur_lock_mode;
+    if (qdict_get_try_bool(state->options, "disable-lock", false)) {
+        new_lock = RAW_L_READ_SHARE_RW;
+    } else {
+        new_lock = raw_get_lock_mode(state->flags);
+    }
+    qdict_del(state->options, "disable-lock");
+
+    for (rec = &reopen_functions[0];
+         rec < &reopen_functions[ARRAY_SIZE(reopen_functions)];
+         rec++) {
+        if (rec->old_lock == old_lock && rec->new_lock == new_lock) {
+            break;
+        }
+    }
+    assert(rec != &reopen_functions[ARRAY_SIZE(reopen_functions)]);
+
+    switch (op) {
+    case RAW_LT_PREPARE:
+        if (rec->need_lock_fd) {
+            int lock_flags = rs->open_flags;
+            if (!(state->flags & BDRV_O_SHARE_RW)) {
+                lock_flags |= O_RDWR;
+            }
+            ret = raw_dup_flags(rs->fd, s->open_flags, lock_flags,
+                                state->bs->filename, errp);
+            if (ret < 0) {
+                return ret;
+            }
+            rs->lock_fd = ret;
+        } else {
+            rs->lock_fd = -1;
+        }
+        return rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);
+    case RAW_LT_COMMIT:
+        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);
+        if ((rec->need_lock_fd || rec->close_old_lock_fd) && s->lock_fd >= 0) {
+            qemu_close(s->lock_fd);
+        }
+        if (rec->need_lock_fd) {
+            s->lock_fd = rs->lock_fd;
+        }
+        s->cur_lock_mode = new_lock;
+        break;
+    case RAW_LT_ABORT:
+        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);
+        if (rec->need_lock_fd) {
+            if (rs->lock_fd >= 0) {
+                qemu_close(rs->lock_fd);
+                rs->lock_fd = -1;
+            }
+        }
+        break;
+    }
+    return 0;
+}
+
 static int raw_reopen_prepare(BDRVReopenState *state,
                               BlockReopenQueue *queue, Error **errp)
 {
@@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
 
     raw_parse_flags(state->flags, &rs->open_flags);
 
-    rs->fd = -1;
-
-    int fcntl_flags = O_APPEND | O_NONBLOCK;
-#ifdef O_NOATIME
-    fcntl_flags |= O_NOATIME;
-#endif
-
-#ifdef O_ASYNC
-    /* Not all operating systems have O_ASYNC, and those that don't
-     * will not let us track the state into rs->open_flags (typically
-     * you achieve the same effect with an ioctl, for example I_SETSIG
-     * on Solaris). But we do not use O_ASYNC, so that's fine.
-     */
-    assert((s->open_flags & O_ASYNC) == 0);
-#endif
-
-    if ((rs->open_flags & ~fcntl_flags) == (s->open_flags & ~fcntl_flags)) {
-        /* dup the original fd */
-        rs->fd = qemu_dup(s->fd);
-        if (rs->fd >= 0) {
-            ret = fcntl_setfl(rs->fd, rs->open_flags);
-            if (ret) {
-                qemu_close(rs->fd);
-                rs->fd = -1;
-            }
-        }
+    ret = raw_dup_flags(s->fd, s->open_flags, rs->open_flags,
+                        state->bs->filename, errp);
+    if (ret < 0) {
+        return ret;
     }
 
-    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
-    if (rs->fd == -1) {
-        const char *normalized_filename = state->bs->filename;
-        ret = raw_normalize_devicepath(&normalized_filename);
-        if (ret < 0) {
-            error_setg_errno(errp, -ret, "Could not normalize device path");
-        } else {
-            assert(!(rs->open_flags & O_CREAT));
-            rs->fd = qemu_open(normalized_filename, rs->open_flags);
-            if (rs->fd == -1) {
-                error_setg_errno(errp, errno, "Could not reopen file");
-                ret = -1;
-            }
-        }
-    }
+    rs->fd = ret;
 
     /* Fail already reopen_prepare() if we can't get a working O_DIRECT
      * alignment with the new fd. */
-    if (rs->fd != -1) {
-        raw_probe_alignment(state->bs, rs->fd, &local_err);
-        if (local_err) {
-            qemu_close(rs->fd);
-            rs->fd = -1;
-            error_propagate(errp, local_err);
-            ret = -EINVAL;
-        }
+    raw_probe_alignment(state->bs, rs->fd, &local_err);
+    if (local_err) {
+        qemu_close(rs->fd);
+        rs->fd = -1;
+        error_propagate(errp, local_err);
+        return -EINVAL;
     }
+    ret = raw_reopen_handle_lock(state, RAW_LT_PREPARE, errp);
 
     return ret;
 }
@@ -626,6 +1206,8 @@ static void raw_reopen_commit(BDRVReopenState *state)
 
     s->open_flags = rs->open_flags;
 
+    raw_reopen_handle_lock(state, RAW_LT_COMMIT, &error_abort);
+
     qemu_close(s->fd);
     s->fd = rs->fd;
 
@@ -643,6 +1225,8 @@ static void raw_reopen_abort(BDRVReopenState *state)
         return;
     }
 
+    raw_reopen_handle_lock(state, RAW_LT_ABORT, &error_abort);
+
     if (rs->fd >= 0) {
         qemu_close(rs->fd);
         rs->fd = -1;
@@ -1332,6 +1916,10 @@ static void raw_close(BlockDriverState *bs)
         qemu_close(s->fd);
         s->fd = -1;
     }
+    if (s->lock_fd >= 0) {
+        qemu_close(s->lock_fd);
+        s->lock_fd = -1;
+    }
 }
 
 static int raw_truncate(BlockDriverState *bs, int64_t offset)
@@ -1832,6 +2420,27 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
     return 0;
 }
 
+static int raw_inactivate(BlockDriverState *bs)
+{
+    BDRVRawState *s = bs->opaque;
+    int r = 0;
+
+    if (s->cur_lock_mode != RAW_L_READ_SHARE_RW) {
+        r = raw_lock_fd(s->lock_fd, RAW_L_READ_SHARE_RW, NULL);
+    }
+    return r;
+}
+
+static void raw_invalidate_cache(BlockDriverState *bs, Error **errp)
+{
+    BDRVRawState *s = bs->opaque;
+
+    if (s->lock_on_invalidate) {
+        s->disable_lock = false;
+        raw_lock_image(bs, bdrv_get_flags(bs), errp);
+    }
+}
+
 static QemuOptsList raw_create_opts = {
     .name = "raw-create-opts",
     .head = QTAILQ_HEAD_INITIALIZER(raw_create_opts.head),
@@ -1885,7 +2494,8 @@ BlockDriver bdrv_file = {
     .bdrv_get_info = raw_get_info,
     .bdrv_get_allocated_file_size
                         = raw_get_allocated_file_size,
-
+    .bdrv_inactivate = raw_inactivate,
+    .bdrv_invalidate_cache = raw_invalidate_cache,
     .create_opts = &raw_create_opts,
 };
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH 14/14] tests: Add test-image-lock
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (12 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking Fam Zheng
@ 2016-10-31 15:38 ` Fam Zheng
  2016-12-02 16:30   ` Max Reitz
  2016-12-02  3:10 ` [Qemu-devel] [PATCH 00/14] block: Image locking series Max Reitz
  14 siblings, 1 reply; 32+ messages in thread
From: Fam Zheng @ 2016-10-31 15:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, Max Reitz, qemu-block, rjones

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 tests/Makefile.include  |   2 +
 tests/test-image-lock.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 181 insertions(+)
 create mode 100644 tests/test-image-lock.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index cd058ef..65f8500 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -56,6 +56,7 @@ check-unit-y += tests/test-hbitmap$(EXESUF)
 gcov-files-test-hbitmap-y = blockjob.c
 check-unit-y += tests/test-blockjob$(EXESUF)
 check-unit-y += tests/test-blockjob-txn$(EXESUF)
+check-unit-y += tests/test-image-lock$(EXESUF)
 check-unit-y += tests/test-x86-cpuid$(EXESUF)
 # all code tested by test-x86-cpuid is inside topology.h
 gcov-files-test-x86-cpuid-y =
@@ -493,6 +494,7 @@ tests/test-rfifolock$(EXESUF): tests/test-rfifolock.o $(test-util-obj-y)
 tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
 tests/test-blockjob$(EXESUF): tests/test-blockjob.o $(test-block-obj-y) $(test-util-obj-y)
 tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
+tests/test-image-lock$(EXESUF): tests/test-image-lock.o $(test-block-obj-y) $(libqos-obj-y)
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y)
diff --git a/tests/test-image-lock.c b/tests/test-image-lock.c
new file mode 100644
index 0000000..db5a723
--- /dev/null
+++ b/tests/test-image-lock.c
@@ -0,0 +1,179 @@
+/*
+ * Image lock tests
+ *
+ * Copyright 2016 Red Hat, Inc.
+ *
+ * Authors:
+ *  Fam Zheng <famz@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "qapi/qmp/qbool.h"
+#include "sysemu/block-backend.h"
+
+#define TEST_IMAGE_SIZE 4096
+static char test_image[] = "/tmp/qtest.XXXXXX";
+static int test_image_fd;
+
+static BlockBackend *open_test_image(int flags, bool disable_lock)
+{
+    QDict *opts = qdict_new();
+
+    qdict_set_default_str(opts, "filename", test_image);
+    qdict_set_default_str(opts, "driver", "file");
+    if (disable_lock) {
+        qdict_put(opts, "disable-lock", qbool_from_bool(true));
+    }
+
+    return blk_new_open(NULL, NULL, opts, flags | BDRV_O_ALLOW_RDWR, NULL);
+}
+
+#define RW true
+#define RO false
+#define SHARE true
+#define EXCLU false
+
+static struct CompatData {
+    bool write_1;
+    bool share_1;
+    bool write_2;
+    bool share_2;
+    bool compatible;
+} compat_data[] = {
+    /* Write 1, Share 1, Write 2, Share 2, Compatible. */
+    { RO,       SHARE,   RO,      SHARE,   true,  },
+    { RO,       SHARE,   RO,      EXCLU,   true,  },
+    { RO,       SHARE,   RW,      SHARE,   true,  },
+    { RO,       SHARE,   RW,      EXCLU,   true,  },
+    { RO,       EXCLU,   RO,      SHARE,   true,  },
+    { RO,       EXCLU,   RO,      EXCLU,   true,  },
+    { RO,       EXCLU,   RW,      SHARE,   false, },
+    { RO,       EXCLU,   RW,      EXCLU,   false, },
+};
+
+/* Test one combination scenario.
+ *
+ * @flags1: The flags of the first blk.
+ * @flags2: The flags of the second blk.
+ * @disable1: The value for raw-posix disable-lock option of the first blk.
+ * @disable2: The value for raw-posix disable-lock option of the second blk.
+ * @from_reopen: Whether or not the first blk should get flags1 from a reopen.
+ * @initial: The source flags from which the blk1 is reopened, only
+ *                effective if @from_reopen is true.
+ */
+static void do_test_compat_one(int flags1, int flags2,
+                               bool disable1, bool disable2,
+                               bool from_reopen, int initial_flags,
+                               bool compatible)
+{
+    BlockBackend *blk1, *blk2;
+
+    if (!from_reopen) {
+        blk1 = open_test_image(flags1, disable1);
+    } else {
+        int ret;
+        blk1 = open_test_image(initial_flags, disable1);
+        BlockReopenQueue *rq = NULL;
+
+        rq = bdrv_reopen_queue(rq, blk_bs(blk1), NULL, flags1);
+        ret = bdrv_reopen_multiple(rq, &error_abort);
+        g_assert_cmpint(ret, ==, 0);
+    }
+    g_assert_nonnull(blk1);
+    g_assert_cmphex(blk_get_flags(blk1) & (BDRV_O_SHARE_RW | BDRV_O_RDWR),
+                    ==, flags1);
+    blk2 = open_test_image(flags2, disable2);
+    if (compatible) {
+        g_assert_nonnull(blk2);
+    } else {
+        g_assert_null(blk2);
+    }
+    blk_unref(blk1);
+    blk_unref(blk2);
+}
+
+static void do_test_compat(bool test_disable, bool from_reopen,
+                           int initial_flags)
+{
+    int i;
+    int flags1, flags2;
+
+    for (i = 0; i < ARRAY_SIZE(compat_data); i++) {
+        struct CompatData *data = &compat_data[i];
+        bool compat = data->compatible;
+
+        flags1 = (data->write_1 ? BDRV_O_RDWR : 0) |
+                 (data->share_1 ? BDRV_O_SHARE_RW : 0);
+        flags2 = (data->write_2 ? BDRV_O_RDWR : 0) |
+                 (data->share_2 ? BDRV_O_SHARE_RW : 0);
+        if (!test_disable) {
+            do_test_compat_one(flags1, flags2, false, false,
+                               from_reopen, initial_flags, compat);
+
+            do_test_compat_one(flags2, flags1, false, false,
+                               from_reopen, initial_flags, compat);
+        } else {
+            compat = true;
+            do_test_compat_one(flags1, flags2, true, false,
+                               from_reopen, initial_flags, compat);
+            do_test_compat_one(flags1, flags2, false, true,
+                               from_reopen, initial_flags, compat);
+            do_test_compat_one(flags2, flags1, true, false,
+                               from_reopen, initial_flags, compat);
+            do_test_compat_one(flags2, flags1, false, true,
+                               from_reopen, initial_flags, compat);
+            do_test_compat_one(flags1, flags2, true, true,
+                               from_reopen, initial_flags, compat);
+        }
+    }
+}
+
+static void test_compat(void)
+{
+    do_test_compat(false, false, 0);
+}
+
+static void test_compat_after_reopen(void)
+{
+    do_test_compat(false, true, 0);
+    do_test_compat(false, true, BDRV_O_SHARE_RW);
+    do_test_compat(false, true, BDRV_O_RDWR);
+    do_test_compat(false, true, BDRV_O_RDWR | BDRV_O_SHARE_RW);
+}
+
+static void test_0bytefile(void)
+{
+    ftruncate(test_image_fd, 0);
+    do_test_compat(false, false, 0);
+}
+
+static void test_disable(void)
+{
+    do_test_compat(true, false, 0);
+    do_test_compat(true, true, 0);
+    do_test_compat(true, true, BDRV_O_SHARE_RW);
+    do_test_compat(true, true, BDRV_O_RDWR);
+    do_test_compat(true, true, BDRV_O_RDWR | BDRV_O_SHARE_RW);
+}
+
+int main(int argc, char **argv)
+{
+    test_image_fd = mkstemp(test_image);
+
+    qemu_init_main_loop(&error_fatal);
+    bdrv_init();
+
+    g_assert(test_image_fd >= 0);
+    ftruncate(test_image_fd, TEST_IMAGE_SIZE);
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/image-lock/compat", test_compat);
+    g_test_add_func("/image-lock/compat_after_reopen", test_compat_after_reopen);
+    g_test_add_func("/image-lock/compat_0bytefile", test_0bytefile);
+    g_test_add_func("/image-lock/disable", test_disable);
+    return g_test_run();
+}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking Fam Zheng
@ 2016-10-31 22:01   ` Eric Blake
  2016-10-31 22:39     ` Richard W.M. Jones
  2016-11-01  2:06     ` Fam Zheng
  2016-12-02  2:58   ` Max Reitz
  2016-12-02 16:13   ` Max Reitz
  2 siblings, 2 replies; 32+ messages in thread
From: Eric Blake @ 2016-10-31 22:01 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Kevin Wolf, rjones, qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 4233 bytes --]

On 10/31/2016 10:38 AM, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.

What happens if we try to use a raw file with less than 3 bytes? There's
not much to be locked in that case (especially if we round down to
sector sizes - the file is effectively empty) - but it's probably a
corner case you have to be aware of.

> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.
> 
> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.
> 
> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally

s/managable/manageable/

> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 

> +typedef enum {
> +    /* Read only and accept other writers. */
> +    RAW_L_READ_SHARE_RW,
> +    /* Read only and try to forbid other writers. */
> +    RAW_L_READ,
> +    /* Read write and accept other writers. */
> +    RAW_L_WRITE_SHARE_RW,
> +    /* Read write and try to forbit other writers. */

s/forbit/forbid/


>  
> +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> +{
> +    int ret;
> +    assert(fd >= 0);
> +    /* Locking byte 1 avoids interfereing with virtlockd. */

s/interfereing/interfering/


> +/**
> + * Transactionally moving between possible locking states is tricky and must be
> + * done carefully. That is mostly because downgrading an exclusive lock to
> + * shared or unlocked is not guaranteed to be revertable. As a result, in such

s/revertable/revertible/

> + * cases we have to defer the downgraing to "commit", given that no revert will

s/downgraing/downgrading/

> + * happen after that point, and that downgrading a lock should never fail.
> + *
> + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> + * exclusive lock) must happen in "prepare" because it may fail.
> + *
> + * Manage the operation matrix with this state transition table to make
> + * fulfulling above conditions easier.

s/fulfulling/fulfilling/


> @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
>  
>      raw_parse_flags(state->flags, &rs->open_flags);
>  
> -    rs->fd = -1;
> -
> -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> -#ifdef O_NOATIME
> -    fcntl_flags |= O_NOATIME;
> -#endif
> -
> -#ifdef O_ASYNC
> -    /* Not all operating systems have O_ASYNC, and those that don't
> -     * will not let us track the state into rs->open_flags (typically
> -     * you achieve the same effect with an ioctl, for example I_SETSIG
> -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> -     */
> -    assert((s->open_flags & O_ASYNC) == 0);
> -#endif

It looks like you are doing some code motion (refactoring into a helper
function) mixed in with everything else; it might be worth splitting
that into a separate commit for ease of review.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2016-10-31 22:01   ` Eric Blake
@ 2016-10-31 22:39     ` Richard W.M. Jones
  2016-11-01  2:06     ` Fam Zheng
  1 sibling, 0 replies; 32+ messages in thread
From: Richard W.M. Jones @ 2016-10-31 22:39 UTC (permalink / raw)
  To: Eric Blake; +Cc: Fam Zheng, qemu-devel, Kevin Wolf, qemu-block, Max Reitz


On Mon, Oct 31, 2016 at 05:01:44PM -0500, Eric Blake wrote:
> On 10/31/2016 10:38 AM, Fam Zheng wrote:
> > This implements open flag sensible image locking for local file
> > and host device protocol.
> > 
> > virtlockd in libvirt locks the first byte, so we start looking at the
> > file bytes from 1.
> 
> What happens if we try to use a raw file with less than 3 bytes? There's
> not much to be locked in that case (especially if we round down to
> sector sizes - the file is effectively empty) - but it's probably a
> corner case you have to be aware of.

It would be nice if qemu could handle zero-length or even < 512 byte
drives, but we (libguestfs) gave up that battle a long time ago after
encountering several bugs in this area.  These days when you try to
add an empty file[1] through the libguestfs API, the implementation
replaces it with a 4096 byte raw-format file of zeroes.

More here:
https://github.com/libguestfs/libguestfs/blob/master/src/drives.c#L395

Rich.

[1] It has the name "/dev/null" for historical reasons, but is not
that actual device for the reasons given above.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2016-10-31 22:01   ` Eric Blake
  2016-10-31 22:39     ` Richard W.M. Jones
@ 2016-11-01  2:06     ` Fam Zheng
  1 sibling, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-11-01  2:06 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, Kevin Wolf, rjones, qemu-block, Max Reitz

On Mon, 10/31 17:01, Eric Blake wrote:
> On 10/31/2016 10:38 AM, Fam Zheng wrote:
> > This implements open flag sensible image locking for local file
> > and host device protocol.
> > 
> > virtlockd in libvirt locks the first byte, so we start looking at the
> > file bytes from 1.
> 
> What happens if we try to use a raw file with less than 3 bytes? There's
> not much to be locked in that case (especially if we round down to
> sector sizes - the file is effectively empty) - but it's probably a
> corner case you have to be aware of.

Yes.  Not sure if it is complete (and always true) but patch 14 covers a 0 byte
test case and it seems the lock just works the same a a large file.

So I look at the kernel implementation of fcntl locks, to see if it limits lock
range to file size, but I don't see any.

I also manually tested with "touch /var/tmp/zerofile && qemu-io
/var/tmp/zerofile" and "lslocks", it indeed report the locked bytes tough the
file is 0 byte.

As another example, /dev/null is also lockable by this series, that's why I have
to add a patch to change it to null-co://.

So, I think where a zero byte file cannot be locked, if any, is a corner case.

> 
> > 
> > Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> > four locking modes by combining two bits (BDRV_O_RDWR and
> > BDRV_O_SHARE_RW), and implemented by taking two locks:
> > 
> > Lock bytes:
> > 
> > * byte 1: I can't allow other processes to write to the image
> > * byte 2: I am writing to the image
> > 
> > Lock modes:
> > 
> > * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
> >   byte 2. Test whether byte 1 is locked using an exclusive lock, and
> >   fail if so.
> > 
> > * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
> >   whether byte 1 is locked using an exclusive lock, and fail if so. Then
> >   take shared lock on byte 1. I suppose this is racy, but we can
> >   probably tolerate that.
> > 
> > * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> > 
> > * reader that can't tolerate writers (neither bit is set): Take shared
> >   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> > 
> > The complication is in the transactional reopen.  To make the reopen
> > logic managable, and allow better reuse, the code is internally
> 
> s/managable/manageable/
> 
> > organized with a table from old mode to the new one.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 660 insertions(+), 50 deletions(-)
> > 
> 
> > +typedef enum {
> > +    /* Read only and accept other writers. */
> > +    RAW_L_READ_SHARE_RW,
> > +    /* Read only and try to forbid other writers. */
> > +    RAW_L_READ,
> > +    /* Read write and accept other writers. */
> > +    RAW_L_WRITE_SHARE_RW,
> > +    /* Read write and try to forbit other writers. */
> 
> s/forbit/forbid/
> 
> 
> >  
> > +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> > +{
> > +    int ret;
> > +    assert(fd >= 0);
> > +    /* Locking byte 1 avoids interfereing with virtlockd. */
> 
> s/interfereing/interfering/
> 
> 
> > +/**
> > + * Transactionally moving between possible locking states is tricky and must be
> > + * done carefully. That is mostly because downgrading an exclusive lock to
> > + * shared or unlocked is not guaranteed to be revertable. As a result, in such
> 
> s/revertable/revertible/
> 
> > + * cases we have to defer the downgraing to "commit", given that no revert will
> 
> s/downgraing/downgrading/
> 
> > + * happen after that point, and that downgrading a lock should never fail.
> > + *
> > + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> > + * exclusive lock) must happen in "prepare" because it may fail.
> > + *
> > + * Manage the operation matrix with this state transition table to make
> > + * fulfulling above conditions easier.
> 
> s/fulfulling/fulfilling/

Thanks, will fix these typos and misspells.

> 
> 
> > @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
> >  
> >      raw_parse_flags(state->flags, &rs->open_flags);
> >  
> > -    rs->fd = -1;
> > -
> > -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> > -#ifdef O_NOATIME
> > -    fcntl_flags |= O_NOATIME;
> > -#endif
> > -
> > -#ifdef O_ASYNC
> > -    /* Not all operating systems have O_ASYNC, and those that don't
> > -     * will not let us track the state into rs->open_flags (typically
> > -     * you achieve the same effect with an ioctl, for example I_SETSIG
> > -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> > -     */
> > -    assert((s->open_flags & O_ASYNC) == 0);
> > -#endif
> 
> It looks like you are doing some code motion (refactoring into a helper
> function) mixed in with everything else; it might be worth splitting
> that into a separate commit for ease of review.

Yes, good idea.

Fam

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd Fam Zheng
@ 2016-12-02  0:30   ` Max Reitz
  2016-12-08  6:53     ` Fam Zheng
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2016-12-02  0:30 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 1963 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> They are wrappers of POSIX fcntl "file private locking".
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  include/qemu/osdep.h |  2 ++
>  util/osdep.c         | 29 +++++++++++++++++++++++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 0e3c330..f15e122 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -294,6 +294,8 @@ int qemu_close(int fd);
>  #ifndef _WIN32
>  int qemu_dup(int fd);
>  #endif
> +int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive);
> +int qemu_unlock_fd(int fd, int64_t start, int64_t len);
>  
>  #if defined(__HAIKU__) && defined(__i386__)
>  #define FMT_pid "%ld"
> diff --git a/util/osdep.c b/util/osdep.c
> index 06fb1cf..b85a490 100644
> --- a/util/osdep.c
> +++ b/util/osdep.c
> @@ -140,6 +140,35 @@ static int qemu_parse_fdset(const char *param)
>  {
>      return qemu_parse_fd(param);
>  }
> +
> +static int qemu_lock_fcntl(int fd, int64_t start, int64_t len, int fl_type)
> +{
> +#ifdef F_OFD_SETLK
> +    int ret;
> +    struct flock fl = {
> +        .l_whence = SEEK_SET,
> +        .l_start  = start,
> +        .l_len    = len,
> +        .l_type   = fl_type,
> +    };
> +    do {
> +        ret = fcntl(fd, F_OFD_SETLK, &fl);
> +    } while (ret == -1 && errno == EINTR);

As I've asked in the last version: Can EINTR happen at all? My man page
tells me it's possible only with F(_OFD)_SETLKW.

Max

> +    return ret == -1 ? -errno : 0;
> +#else
> +    return -ENOTSUP;
> +#endif
> +}
> +
> +int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive)
> +{
> +    return qemu_lock_fcntl(fd, start, len, exclusive ? F_WRLCK : F_RDLCK);
> +}
> +
> +int qemu_unlock_fd(int fd, int64_t start, int64_t len)
> +{
> +    return qemu_lock_fcntl(fd, start, len, F_UNLCK);
> +}
>  #endif
>  
>  /*
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 04/14] qemu-img: Set "share-rw" flag in read-only commands
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 04/14] qemu-img: Set "share-rw" flag in read-only commands Fam Zheng
@ 2016-12-02  0:52   ` Max Reitz
  2016-12-08  7:19     ` Fam Zheng
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2016-12-02  0:52 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 2813 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> Checking the status of an image when it is being used by guest is
> usually useful,

True for qemu-img info and maybe even qemu-img compare (and qemu-img map
is just a debugging tool, so that's fine, too), but I don't think
qemu-img check is very useful. You're not unlikely to see leaks and
maybe even errors (with lazy_refcounts=on) which don't mean anything
because they will go away once the VM is shut down.

>                 and there is no risk of corrupting data, so don't let
> the upcoming image locking feature limit this use case.

I agree that there is no harm in doing it, but for qemu-img check I also
think it isn't very useful either.

Anyway, you can keep it since I think it should not be doing anything:
The formats implementing qemu-img check should clear BDRV_O_SHARE_RW
anyway (unless overridden however that may work).

> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  qemu-img.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/qemu-img.c b/qemu-img.c
> index afcd51f..b2f4077 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -679,6 +679,10 @@ static int img_check(int argc, char **argv)
>              break;
>          }
>      }
> +
> +    if (!(flags & BDRV_O_RDWR)) {
> +        flags |= BDRV_O_SHARE_RW;
> +    }

If you want to keep this for img_check() (and I'm not going to stop you
if you do), I think it would be better to put this right in front of
img_open() to make it really clear that both are not set at the same
time (without having to look into bdrv_parse_cache_mode()).

Max

>      if (optind != argc - 1) {
>          error_exit("Expecting one image file name");
>      }
> @@ -1231,6 +1235,7 @@ static int img_compare(int argc, char **argv)
>          goto out3;
>      }
>  
> +    flags |= BDRV_O_SHARE_RW;
>      blk1 = img_open(image_opts, filename1, fmt1, flags, writethrough, quiet);
>      if (!blk1) {
>          ret = 2;
> @@ -2279,7 +2284,8 @@ static ImageInfoList *collect_image_info_list(bool image_opts,
>          g_hash_table_insert(filenames, (gpointer)filename, NULL);
>  
>          blk = img_open(image_opts, filename, fmt,
> -                       BDRV_O_NO_BACKING | BDRV_O_NO_IO, false, false);
> +                       BDRV_O_NO_BACKING | BDRV_O_NO_IO | BDRV_O_SHARE_RW,
> +                       false, false);
>          if (!blk) {
>              goto err;
>          }
> @@ -2605,7 +2611,7 @@ static int img_map(int argc, char **argv)
>          return 1;
>      }
>  
> -    blk = img_open(image_opts, filename, fmt, 0, false, false);
> +    blk = img_open(image_opts, filename, fmt, BDRV_O_SHARE_RW, false, false);
>      if (!blk) {
>          return 1;
>      }
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 05/14] block: Set "share-rw" flag in drive-backup when sync=none
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 05/14] block: Set "share-rw" flag in drive-backup when sync=none Fam Zheng
@ 2016-12-02  1:01   ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2016-12-02  1:01 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 1717 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> In this case we may open the source's backing image chain multiple
> times. Setting share flag means the new open won't try to acquire or
> check any lock, once we implement image locking.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> 
> ---
> 
> An alternative is reusing (and bdrv_ref) the existing source's backing
> bs instead of opening another one. If we decide that approach is better,
> it's better to do it in a separate series.

Yes, it is better (there's a reason why we do it for drive-mirror), and
while I somehow agree that we could put it off until later, I'm not sure
we should. Opening an image with both BDRV_O_RDWR and BDRV_O_SHARE_RW at
the same time just because our implementation is lacking is not ideal.

Anyway, the whole issue becomes more complex when involving format
drivers. I'll write more to that in a response to the cover letter.

> ---
>  blockdev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/blockdev.c b/blockdev.c
> index d11a74f..9992c5d 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -3160,6 +3160,7 @@ static void do_drive_backup(DriveBackup *backup, BlockJobTxn *txn, Error **errp)
>      }
>      if (backup->sync == MIRROR_SYNC_MODE_NONE) {
>          source = bs;
> +        flags |= BDRV_O_SHARE_RW;

In any case, there should be a comment explaining the situation here
(having to go through git blame is a bit tedious...), possibly involving
a TODO or FIXME regarding that we really shouldn't be using
BDRV_O_SHARE_RW (or maybe we should? I'm not sure, I'll explore it in
said cover letter response).

Max

>      }
>  
>      size = bdrv_getlength(bs);
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 06/14] block: Set "share-rw" flag for incoming migration
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 06/14] block: Set "share-rw" flag for incoming migration Fam Zheng
@ 2016-12-02  1:22   ` Max Reitz
  0 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2016-12-02  1:22 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 682 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> In this way the source side QEMU is writing to the image. We need to
> open it, so share-rw is required.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  blockdev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Should we make some effort to clear the flag once migration is done
(i.e. via bdrv_reopen())? Also, should we set it everywhere where
BDRV_O_INACTIVE is set (e.g. in postcopy migration, too)?

Is it possible for these images to be opened R/W? If so, we have to make
sure the source has also opened the image in shared mode, otherwise
opening it will fail despite of BDRV_O_SHARE_RW.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking Fam Zheng
  2016-10-31 22:01   ` Eric Blake
@ 2016-12-02  2:58   ` Max Reitz
  2017-01-18 10:48     ` Fam Zheng
  2016-12-02 16:13   ` Max Reitz
  2 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2016-12-02  2:58 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 16240 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.
> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.

Testing whether something is locked would be easier by using F_OFD_GETLK
instead of actually creating an exclusive lock and then releasing it.
The reason why I proposed exclusive locks for this in the first place
was because I had the order reversed (first take the exclusive lock for
testing, then take the permanent shared lock, then release the exclusive
lock).

(And so you don't overlook it: You should be using F_OFD_GETLK, see [2])

> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.

As I said in my response to Kevin, we can make this non-racy in two ways:

(1) Just take an exclusive lock on both bytes.
(2) First start as an exclusive reader. Then upgrade the lock to that of
an exclusive writer by taking an exclusive lock on byte 2 and then
downgrading it to a shared lock.

The latter can be optimized by taking exclusive locks on both bytes and
then downgrading both to shared locks. But at that point you could just
as well do (1).

> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally
> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7c62fc3..07ab117 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c
> @@ -131,8 +131,44 @@ do { \
>  
>  #define MAX_BLOCKSIZE	4096
>  
> +/* Posix file locking bytes. Libvirt takes byte 0, so start from byte 1. */
> +#define RAW_LOCK_BYTE_MIN 1
> +#define RAW_LOCK_BYTE_NO_OTHER_WRITER 1
> +#define RAW_LOCK_BYTE_WRITE     2
> +#define RAW_LOCK_BYTE_MAX 2
> +
> +/*
> + ** shared writer: Take shared lock on byte 2. Test whether byte 1 is
> + *  locked using an exclusive lock, and fail if so.
> + *
> + ** exclusive writer: Take shared lock on byte 2. Test whether byte 1 is
> + *  locked using an exclusive lock, and fail if so. Then take shared lock
> + *  on byte 1. I suppose this is racy, but we can probably tolerate that.
> + *
> + ** reader that can tolerate writers: Don't do anything
> + *
> + ** reader that can't tolerate writers: Take shared lock on byte 1. Test
> + *  whether byte 2 is locked, and fail if so.
> + */

I guess it would make sense to bring this comment into the same order as
the names are given in BDRVRawLockMode.

> +
> +typedef enum {
> +    /* Read only and accept other writers. */
> +    RAW_L_READ_SHARE_RW,
> +    /* Read only and try to forbid other writers. */
> +    RAW_L_READ,
> +    /* Read write and accept other writers. */
> +    RAW_L_WRITE_SHARE_RW,
> +    /* Read write and try to forbit other writers. */
> +    RAW_L_WRITE,
> +} BDRVRawLockMode;
> +
>  typedef struct BDRVRawState {
>      int fd;
> +    /* A dup of @fd to make manipulating lock easier, especially during reopen,
> +     * where this will accept BDRVRawReopenState.lock_fd. */
> +    int lock_fd;
> +    bool disable_lock;
> +    bool lock_on_invalidate;
>      int type;
>      int open_flags;
>      size_t buf_align;
> @@ -146,10 +182,13 @@ typedef struct BDRVRawState {
>      bool use_linux_aio:1;
>      bool has_fallocate;
>      bool needs_alignment;
> +    BDRVRawLockMode cur_lock_mode;
>  } BDRVRawState;
>  
>  typedef struct BDRVRawReopenState {
>      int fd;
> +    /* A dup of @fd used for acquiring lock. */
> +    int lock_fd;
>      int open_flags;
>  } BDRVRawReopenState;
>  
> @@ -368,6 +407,77 @@ static void raw_parse_flags(int bdrv_flags, int *open_flags)
>      }
>  }
>  
> +static int raw_lock_fd(int fd, BDRVRawLockMode mode, Error **errp)
> +{
> +    int ret;
> +    assert(fd >= 0);
> +    /* Locking byte 1 avoids interfereing with virtlockd. */

This comment should be superseded by the one above the definition of
RAW_LOCK_BYTE_*.

> +    switch (mode) {
> +    case RAW_L_READ_SHARE_RW:
> +        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_MIN,
> +                             RAW_LOCK_BYTE_MAX - RAW_LOCK_BYTE_MIN + 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock fd");

Why not use -ret instead of errno?

> +            goto fail;
> +        }
> +        break;
> +    case RAW_L_READ:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte exclusively");

I think we should drop errno in favor of a more useful hint like "(maybe
already opened R/W by a different process?)", but that seems a bit long
to include...

> +            goto fail;
> +        }
> +        qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
> +        break;
> +    case RAW_L_WRITE_SHARE_RW:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
> +            goto fail;
> +        }
> +        ret = qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock share byte");
> +            goto fail;
> +        }
> +        break;
> +    case RAW_L_WRITE:
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock write byte");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock share byte exclusively");
> +            goto fail;
> +        }
> +        ret = qemu_lock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to downgrade share byte");
> +            goto fail;
> +        }
> +        break;
> +    default:
> +        abort();
> +    }
> +    return 0;
> +fail:
> +    qemu_unlock_fd(fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +    qemu_unlock_fd(fd, RAW_LOCK_BYTE_WRITE, 1);
> +    return -errno;

qemu_unlock_fd() may overwrite errno. Not sure if that is intended.
Also, just returning ret would be fine.

> +}
> +
>  static void raw_parse_filename(const char *filename, QDict *options,
>                                 Error **errp)
>  {
> @@ -393,10 +503,115 @@ static QemuOptsList raw_runtime_opts = {
>              .type = QEMU_OPT_STRING,
>              .help = "host AIO implementation (threads, native)",
>          },
> +        {
> +            .name = "disable-lock",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "don't lock the file",
> +        },
>          { /* end of list */ }
>      },
>  };
>  
> +static BDRVRawLockMode raw_get_lock_mode(int flags)
> +{
> +    switch (flags & (BDRV_O_RDWR | BDRV_O_SHARE_RW)) {
> +    case BDRV_O_RDWR:
> +        return RAW_L_WRITE;
> +    case BDRV_O_RDWR | BDRV_O_SHARE_RW:
> +        return RAW_L_WRITE_SHARE_RW;
> +    case BDRV_O_SHARE_RW:
> +        return RAW_L_READ_SHARE_RW;
> +    case 0:
> +        return RAW_L_READ;
> +    default:
> +        abort();
> +    }
> +}
> +
> +static int raw_dup_flags(int fd, int old_flags, int new_flags,
> +                         const char *filename, Error **errp)
> +{
> +    int ret = -1;
> +    int fcntl_flags = O_APPEND | O_NONBLOCK;
> +#ifdef O_NOATIME
> +    fcntl_flags |= O_NOATIME;
> +#endif
> +
> +#ifdef O_ASYNC
> +    /* Not all operating systems have O_ASYNC, and those that don't
> +     * will not let us track the state into rs->open_flags (typically
> +     * you achieve the same effect with an ioctl, for example I_SETSIG
> +     * on Solaris). But we do not use O_ASYNC, so that's fine.
> +     */
> +    assert((old_flags & O_ASYNC) == 0);
> +#endif
> +
> +    if ((new_flags & ~fcntl_flags) == (old_flags & ~fcntl_flags)) {
> +        /* dup the original fd */
> +        ret = qemu_dup(fd);
> +        if (ret >= 0) {
> +            if (fcntl_setfl(ret, new_flags)) {
> +                int new_fd = ret;
> +                ret = -errno;
> +                qemu_close(new_fd);
> +            }
> +        }
> +    }
> +
> +    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
> +    if (ret < 0) {
> +        const char *normalized_filename = filename;
> +        ret = raw_normalize_devicepath(&normalized_filename);
> +        if (ret < 0) {
> +            error_setg_errno(errp, -ret, "Could not normalize device path");
> +        } else {
> +            assert(!(new_flags & O_CREAT));
> +            ret = qemu_open(normalized_filename, new_flags);
> +            if (ret == -1) {
> +                error_setg_errno(errp, errno, "Could not open file with new flags");
> +                ret = -errno;

I personally prefer it the other way round (ret = -errno;
error_setg_errno(errp, -ret, ...);) but error_setg_errno() now saves
errno, it's not wrong this way.

> +            }
> +        }
> +    }
> +    return ret;
> +}
> +
> +static int raw_lock_image(BlockDriverState *bs, int bdrv_flags, Error **errp)
> +{
> +    int ret;
> +    BDRVRawState *s = bs->opaque;
> +    BDRVRawLockMode lock_mode;
> +
> +    if (bdrv_flags & BDRV_O_INACTIVE) {
> +        s->disable_lock = true;
> +        s->lock_on_invalidate = true;
> +    }

Ah, I see, that's interesting (regarding my comments on when and how an
inactive BDS should be shared).

Anyway, using s->disable_lock both for the user-supplied option and for
internal stuff seems wrong. It's not so bad here, but it seems wrong in
raw_invalidate_cache().

> +    if (!s->disable_lock) {
> +        lock_mode = raw_get_lock_mode(bdrv_flags);
> +        if (!(bdrv_flags & BDRV_O_RDWR) && access(bs->filename, W_OK) != 0) {
> +            s->disable_lock = true;
> +        }

What is this code for (the above three lines)? Why do you disable the
lock if the image cannot be written to and it's opened read-only?

I suppose it's because of [1], but: What if it's just this process that
cannot write to it but others (e.g. invoked by root) can? Then this is
not a no-op.

> +    }
> +    if (!s->disable_lock && lock_mode != RAW_L_READ_SHARE_RW) {
> +        int lock_flags = s->open_flags;
> +        if (!(bdrv_flags & BDRV_O_SHARE_RW)) {
> +            lock_flags |= O_RDWR;
> +        }

This will result in the next function call trying to open an image R/W,
even if the original image was opened in read-only mode. That doesn't
seem quite right to me.

[1] So I guess above you are setting disable_lock to true so that
raw_dup_flags() will at least succeed.

In any case, it seems wrong to me to open the image R/W if the user has
requested read-only, even if it succeeds and even if we are not actually
writing to it.

[2] I guess this is required because otherwise we can't place exclusive
locks? Would using F_OFD_GETLK instead of trying to place an exclusive
lock fix this issue? (A quick test on my part says: Yes, it would;
F_OFD_GETLK ignores permissions and just seems to care about whether
there already is a lock.)

> +        ret = raw_dup_flags(s->fd, s->open_flags, lock_flags, bs->filename,

I'd propose using bs->exact_filename instead of bs->filename, because
the former is guaranteed not to be a json: filename.

While for raw-posix you are guaranteed to never to have a json: filename
in bs->filename, it's still a bit more clear.

> +                                   errp);

The indentation is off.

> +        if (ret < 0) {
> +            return ret;
> +        }
> +        s->lock_fd = ret;
> +        ret = raw_lock_fd(s->lock_fd, lock_mode, errp);
> +        if (ret) {
> +            return ret;
> +        }
> +        s->cur_lock_mode = lock_mode;
> +    }
> +    return 0;
> +}
> +
>  static int raw_open_common(BlockDriverState *bs, QDict *options,
>                             int bdrv_flags, int open_flags, Error **errp)
>  {
> @@ -440,6 +655,7 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      raw_parse_flags(bdrv_flags, &s->open_flags);
>  
>      s->fd = -1;
> +    s->lock_fd = -1;
>      fd = qemu_open(filename, s->open_flags, 0644);
>      if (fd < 0) {
>          ret = -errno;
> @@ -451,6 +667,15 @@ static int raw_open_common(BlockDriverState *bs, QDict *options,
>      }
>      s->fd = fd;
>  
> +    s->disable_lock = qemu_opt_get_bool(opts, "disable-lock", false);
> +
> +    if (!s->disable_lock) {
> +        ret = raw_lock_image(bs, bdrv_flags, errp);
> +        if (ret) {
> +            goto fail;
> +        }
> +    }
> +
>  #ifdef CONFIG_LINUX_AIO
>       /* Currently Linux does AIO only for files opened with O_DIRECT */
>      if (s->use_linux_aio && !(s->open_flags & O_DIRECT)) {

Pausing review here because it's already 4 am...

(I'll try to look at it later today, after I've had some sleep.)

> @@ -538,6 +763,398 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,

[...]

> @@ -1832,6 +2420,27 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)

And continuing here.

>      return 0;
>  }
>  
> +static int raw_inactivate(BlockDriverState *bs)
> +{
> +    BDRVRawState *s = bs->opaque;
> +    int r = 0;
> +
> +    if (s->cur_lock_mode != RAW_L_READ_SHARE_RW) {
> +        r = raw_lock_fd(s->lock_fd, RAW_L_READ_SHARE_RW, NULL);
> +    }
> +    return r;
> +}
> +
> +static void raw_invalidate_cache(BlockDriverState *bs, Error **errp)
> +{
> +    BDRVRawState *s = bs->opaque;
> +
> +    if (s->lock_on_invalidate) {
> +        s->disable_lock = false;

As I hinted at above, this looks wrong. If the user has specified that
this image should be unlocked, it should be unlocked.

Max

> +        raw_lock_image(bs, bdrv_get_flags(bs), errp);
> +    }
> +}
> +
>  static QemuOptsList raw_create_opts = {
>      .name = "raw-create-opts",
>      .head = QTAILQ_HEAD_INITIALIZER(raw_create_opts.head),
> @@ -1885,7 +2494,8 @@ BlockDriver bdrv_file = {
>      .bdrv_get_info = raw_get_info,
>      .bdrv_get_allocated_file_size
>                          = raw_get_allocated_file_size,
> -
> +    .bdrv_inactivate = raw_inactivate,
> +    .bdrv_invalidate_cache = raw_invalidate_cache,
>      .create_opts = &raw_create_opts,
>  };
>  
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 00/14] block: Image locking series
  2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
                   ` (13 preceding siblings ...)
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 14/14] tests: Add test-image-lock Fam Zheng
@ 2016-12-02  3:10 ` Max Reitz
  14 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2016-12-02  3:10 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 3070 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> This is v9 of the image locking series. I redid the whole series, adopting the
> "two locks" approach from Kevin and Max.
> 
> Depends on "[Qemu-devel] [PATCH] raw-posix: Rename 'raw_s' to 'rs'" in Max's
> block branch.
> 
> Fam Zheng (14):
>   osdep: Add qemu_lock_fd and qemu_unlock_fd
>   block: Define BDRV_O_SHARE_RW
>   qemu-io: Set "share-rw" flag together with read-only
>   qemu-img: Set "share-rw" flag in read-only commands
>   block: Set "share-rw" flag in drive-backup when sync=none
>   block: Set "share-rw" flag for incoming migration
>   iotests: 055: Don't attach the drive to vm for drive-backup
>   iotests: 030: Read-only open image for getting map
>   iotests: 087: Don't attch test image twice
>   iotests: 085: Avoid image locking conflict
>   iotests: 091: Quit QEMU before checking image
>   tests: Use null-co:// instead of /dev/null as the dummy image
>   raw-posix: Implement image locking
>   tests: Add test-image-lock

One issue I have with the series in the current state is that it does
not involve the format layer. For raw images, it's fine to be shared if
BDRV_O_SHARE_RW comes from the block backend, but for qcow2 images it's
not. Therefore, most drivers in the format layer should force
BDRV_O_SHARE_RW to be off for the protocol layer.

In fact, BDRV_O_SHARE_RW does not mean that an image has to allow
sharing, but it means that the block backend (i.e. the user of the block
device) is fine with sharing. Therefore, it's actually very fine to set
this flag for the drive-backup target because we can easily justify not
to care about concurrent writes there.

However, we have to have a way to override locking, and you have four
patches in your series that look like they're trying to do this:

We don't really have to override locking in patches 3 and 4, it's enough
to give a hint to the block layer that sharing is fine (i.e. to set
BDRV_O_SHARE_RW). If something in the block layer overrides this
decision, we can just let the user override it again (by setting
disable-lock).

Patch 5 can easily justify setting BDRV_O_SHARE_RW (as said above), but
it cannot justify setting disable-lock, and there is no user to do so.
If the target image is e.g. in qcow2 format, we should not just
force-share the image but we should fix drive-backup.

Patch 6 on the other hand is very justified in force-sharing the image.
Unfortunately, it doesn't really do that. Setting BDRV_O_SHARE_RW should
not do that, as I said above. It should ideally set disable-lock. But
then again, patch 13 does all kinds of funny force-sharing based on
whether BDRV_O_INACTIVE is set, so maybe this could be the general
model: Would it work to drop patch 6 and just make raw-posix always keep
the image unlocked if BDRV_O_INACTIVE is set?

(The issue with setting disable-lock internally is furthermore that it's
raw-posix-specific (and I think that is justified) -- you cannot just
set it blindly and hope it gets to the right BDS eventually.)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking Fam Zheng
  2016-10-31 22:01   ` Eric Blake
  2016-12-02  2:58   ` Max Reitz
@ 2016-12-02 16:13   ` Max Reitz
  2 siblings, 0 replies; 32+ messages in thread
From: Max Reitz @ 2016-12-02 16:13 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 25890 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.
> 
> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks:
> 
> Lock bytes:
> 
> * byte 1: I can't allow other processes to write to the image
> * byte 2: I am writing to the image
> 
> Lock modes:
> 
> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>   fail if so.
> 
> * exclusive writer (BDRV_O_RDWR only): Take shared lock on byte 2. Test
>   whether byte 1 is locked using an exclusive lock, and fail if so. Then
>   take shared lock on byte 1. I suppose this is racy, but we can
>   probably tolerate that.
> 
> * reader that can tolerate writers (BDRV_O_SHARE_RW only): Don't do anything
> 
> * reader that can't tolerate writers (neither bit is set): Take shared
>   lock on byte 1. Test whether byte 2 is locked, and fail if so.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally
> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/raw-posix.c | 710 ++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 660 insertions(+), 50 deletions(-)
> 
> diff --git a/block/raw-posix.c b/block/raw-posix.c
> index 7c62fc3..07ab117 100644
> --- a/block/raw-posix.c
> +++ b/block/raw-posix.c

Resuming review!

> @@ -538,6 +763,398 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags,
>      return raw_open_common(bs, options, flags, 0, errp);
>  }
>  
> +typedef enum {
> +    RAW_LT_PREPARE,
> +    RAW_LT_COMMIT,
> +    RAW_LT_ABORT
> +} RawLockTransOp;
> +
> +typedef int (*RawReopenFunc)(RawLockTransOp op,
> +                             int old_lock_fd, int new_lock_fd,
> +                             BDRVRawLockMode old_lock,
> +                             BDRVRawLockMode new_lock,
> +                             Error **errp);
> +
> +static int raw_lt_nop(RawLockTransOp op,
> +                            int old_lock_fd, int new_lock_fd,
> +                            BDRVRawLockMode old_lock,
> +                            BDRVRawLockMode new_lock,
> +                            Error **errp)

Indentation is off.

> +{
> +    assert(old_lock == new_lock || new_lock == RAW_L_READ_SHARE_RW);
> +    return 0;
> +}
> +

Note and question before everything (after I had read basically
everything...): A small test shows me that locks are apparently shared
between dup'ed file descriptors, i.e.:

fd = open("foo");

lock(fd);
dup(fd);
unlock(fd);
close(fd);

new_fd = open("foo");
lock(new_fd); // fails

It works the other way around, too:

fd = open("foo");

lock(fd);
duped_fd = dup(fd);
unlock(duped_fd);
close(duped_fd);

new_fd = open("foo");
lock(new_fd); // fails


(The first lock() call has to use a shared lock and the second an
exclusive lock so that it works within a single process.)

This is pretty horrible because we don't know how old_lock_fd and
new_lock_fd are related. old_lock_fd was created somewhere through
raw_dup_flags() from s->fd; we have no idea whether that was actually
dup'ed or reopened. Then, rs->fd is created from s->fd, again through
raw_dup_flags(). Finally, new_lock_fd is created from rs->fd, once more
through said raw_dup_flags().

In the end, we have no idea whether new_lock_fd() is the end of a long
dup() chain from old_lock_fd or whether it was reopened somewhere in
between (well, we know that it has to be reopened when changing from R/W
to RO or the other way around, but we have no idea when keeping R/W or
RO (because dup() or fcntl() may have simply failed)).

In my opinion, this makes reviewing the code pretty difficult because
it's hard to keep in mind what cases may happen everywhere.

Would it be better to always fully re-open() the lock_fd instead of
duplicating it? Then we would at least know that old_lock_fd and
new_lock_fd never share the same lock configuration.

If you don't want to do that, there should be at least comments in the
RawReopenFuncs explaining whether new_lock_fd may share locks with
old_lock_fd or whether that's impossible.

> +static int raw_lt_from_unlock(RawLockTransOp op,
> +                              int old_lock_fd, int new_lock_fd,
> +                              BDRVRawLockMode old_lock,
> +                              BDRVRawLockMode new_lock,
> +                              Error **errp)
> +{
> +    assert(old_lock != new_lock);
> +    assert(old_lock == RAW_L_READ_SHARE_RW);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        return raw_lock_fd(new_lock_fd, new_lock, errp);
> +        break;

That break is a tad superfluous.

> +    case RAW_LT_COMMIT:

Maybe a "fall through" comment or something here? I think Coverity might
complain without.

> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +
> +    return 0;
> +}
> +
> +static int raw_lt_read_to_write_share_rw(RawLockTransOp op,
> +                                         int old_lock_fd, int new_lock_fd,
> +                                         BDRVRawLockMode old_lock,
> +                                         BDRVRawLockMode new_lock,
> +                                         Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_READ);
> +    assert(new_lock == RAW_L_WRITE_SHARE_RW);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            /* This is very unlikely, but catch it anyway. */
> +            error_setg_errno(errp, errno, "Failed to downgrade new fd (share byte)");

I'd put a "break;" even into the last error block.

However, maybe a "break;" is not correct: When a reopening operation
fails, its abort function does not get called (as far as I can see). So
if we fail somewhere here, the old lock won't be restored. I think you
should either manually invoke the abort branch or in
raw_reopen_handle_lock() just always invoke the function again with
RAW_LT_ABORT if RAW_LT_PREPARE failed.

> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_report("Failed to restore lock on old fd (share byte)");
> +        }
> +        break;
> +    }
> +    return ret ? -errno : 0;

What's wrong with "return ret;"?

> +}
> +
> +static int raw_lt_read_to_write(RawLockTransOp op,
> +                                int old_lock_fd, int new_lock_fd,
> +                                BDRVRawLockMode old_lock,
> +                                BDRVRawLockMode new_lock,
> +                                Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_READ);
> +    assert(new_lock == RAW_L_WRITE);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to unlock old fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to upgrade new fd (share byte)");
> +            break;
> +        }

I'd suggest upgrading RAW_LOCK_BYTE_WRITE, too, to avoid race conditions.

(By the way, I don't think we can use F_OFD_GETLK here, we have to
upgrade (and then optionally downgrade) the locks because both locks
have to be held at the same time. Then again, this isn't so bad because
new_lock_fd has write access anyway.)

> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to restore old fd (share byte) b");

block/raw-posix.c:892:81: warning: trailing " b" at the end of string
[-Wpedantic]

> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }

This is exactly the same as the qemu_lock_fd() invocation above it
(without the trailing " b" though).

> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_report("Failed to restore lock on old fd (share byte)");
> +        }
> +        break;
> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_share_rw_to_read(RawLockTransOp op,
> +                                         int old_lock_fd, int new_lock_fd,
> +                                         BDRVRawLockMode old_lock,
> +                                         BDRVRawLockMode new_lock,
> +                                         Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE_SHARE_RW);
> +    assert(new_lock == RAW_L_READ);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        /* Make sure there are no other writers. */
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
> +        if (ret) {
> +            error_report("Failed to unlock old fd (write byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_ABORT:
> +        break;

How about restoring the shared lock on RAW_LOCK_BYTE_WRITE in old_lock_fd?

> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_share_rw_to_write(RawLockTransOp op,
> +                                          int old_lock_fd, int new_lock_fd,
> +                                          BDRVRawLockMode old_lock,
> +                                          BDRVRawLockMode new_lock,
> +                                          Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE_SHARE_RW);
> +    assert(new_lock == RAW_L_WRITE);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        /* Make sure there are no other writers. */
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock old fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to downgrade old fd (write byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (write byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        break;

As I said above, as far as I can see, the abort branch is not invoked if
the prepare branch failed. However, it should be, and then it's not
correct for this to be empty: old_lock_fd may still have an exclusive
lock on RAW_LOCK_BYTE_WRITE, this should be downgraded to a shared lock.

> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_to_read(RawLockTransOp op,
> +                                int old_lock_fd, int new_lock_fd,
> +                                BDRVRawLockMode old_lock,
> +                                BDRVRawLockMode new_lock,
> +                                Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE);
> +    assert(new_lock == RAW_L_READ);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, errno, "Failed to lock new fd (share byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1);
> +        if (ret) {
> +            error_report("Failed to unlock old fd (write byte)");
> +            break;
> +        }

old_lock_fd is closed automatically which should take care of this.
Better safe than sorry, though, of course.

(But there are a couple of places above where you are relying on
old_lock_fd being closed automatically, and also on new_lock_fd being
closed automatically on abort.)

> +        break;
> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +static int raw_lt_write_to_write_share_rw(RawLockTransOp op,
> +                                          int old_lock_fd, int new_lock_fd,
> +                                          BDRVRawLockMode old_lock,
> +                                          BDRVRawLockMode new_lock,
> +                                          Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE);
> +    assert(new_lock == RAW_L_WRITE_SHARE_RW);
> +    switch (op) {
> +    case RAW_LT_PREPARE:

You're not locking the write byte of new_lock here. Maybe it gets
inherited from old_lock_fd automatically, but we don't know whether
new_lock_fd is a dup() of old_lock_fd or whether it is not.

> +        break;
> +    case RAW_LT_COMMIT:
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_report("Failed to unlock old fd (share byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +    return ret ? -errno : 0;
> +}
> +
> +/**
> + * Transactionally moving between possible locking states is tricky and must be
> + * done carefully. That is mostly because downgrading an exclusive lock to
> + * shared or unlocked is not guaranteed to be revertable. As a result, in such
> + * cases we have to defer the downgraing to "commit", given that no revert will
> + * happen after that point, and that downgrading a lock should never fail.
> + *
> + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> + * exclusive lock) must happen in "prepare" because it may fail.
> + *
> + * Manage the operation matrix with this state transition table to make
> + * fulfulling above conditions easier.
> + */
> +static const struct RawReopenFuncRecord {
> +    BDRVRawLockMode old_lock;
> +    BDRVRawLockMode new_lock;
> +    RawReopenFunc func;
> +    bool need_lock_fd;
> +    bool close_old_lock_fd;
> +} reopen_functions[] = {
> +
> +    {RAW_L_READ_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, false},
> +    {RAW_L_READ_SHARE_RW, RAW_L_READ, raw_lt_from_unlock, true},
> +    {RAW_L_READ_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_from_unlock, true},
> +    {RAW_L_READ_SHARE_RW, RAW_L_WRITE, raw_lt_from_unlock, true},
> +
> +    {RAW_L_READ, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
> +    {RAW_L_READ, RAW_L_READ, raw_lt_nop, false, false},
> +    {RAW_L_READ, RAW_L_WRITE_SHARE_RW, raw_lt_read_to_write_share_rw, true},
> +    {RAW_L_READ, RAW_L_WRITE, raw_lt_read_to_write, true},
> +
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_READ, raw_lt_write_share_rw_to_read, true},
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE_SHARE_RW, raw_lt_nop, false, false},
> +    {RAW_L_WRITE_SHARE_RW, RAW_L_WRITE, raw_lt_write_share_rw_to_write, true},
> +
> +    {RAW_L_WRITE, RAW_L_READ_SHARE_RW, raw_lt_nop, false, true},
> +    {RAW_L_WRITE, RAW_L_READ, raw_lt_write_to_read, true},
> +    {RAW_L_WRITE, RAW_L_WRITE_SHARE_RW, raw_lt_write_to_write_share_rw, true},
> +    {RAW_L_WRITE, RAW_L_WRITE, raw_lt_nop, false, false},
> +};
> +
> +static int raw_reopen_handle_lock(BDRVReopenState *state,
> +                                  RawLockTransOp op,
> +                                  Error **errp)
> +{
> +    BDRVRawReopenState *rs = state->opaque;
> +    BDRVRawState *s = state->bs->opaque;
> +    BDRVRawLockMode old_lock, new_lock;
> +    const struct RawReopenFuncRecord *rec;
> +    int ret;
> +
> +    old_lock = s->cur_lock_mode;
> +    if (qdict_get_try_bool(state->options, "disable-lock", false)) {

Putting a disable_lock into BDRVRawReopenState, setting that in prepare
and then copying it over to s->disable_lock on commit seems better to me.

> +        new_lock = RAW_L_READ_SHARE_RW;
> +    } else {
> +        new_lock = raw_get_lock_mode(state->flags);
> +    }
> +    qdict_del(state->options, "disable-lock");

So you're reading it on prepare, then deleting it, and on commit you're
just getting new_lock from raw_get_lock_mode() because you already
deleted it from state->options? That seems wrong to me.

> +
> +    for (rec = &reopen_functions[0];
> +         rec < &reopen_functions[ARRAY_SIZE(reopen_functions)];
> +         rec++) {
> +        if (rec->old_lock == old_lock && rec->new_lock == new_lock) {
> +            break;
> +        }
> +    }
> +    assert(rec != &reopen_functions[ARRAY_SIZE(reopen_functions)]);
> +
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        if (rec->need_lock_fd) {
> +            int lock_flags = rs->open_flags;
> +            if (!(state->flags & BDRV_O_SHARE_RW)) {
> +                lock_flags |= O_RDWR;
> +            }

Again, I think we can and should do without this.

> +            ret = raw_dup_flags(rs->fd, s->open_flags, lock_flags,
> +                                state->bs->filename, errp);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +            rs->lock_fd = ret;
> +        } else {
> +            rs->lock_fd = -1;
> +        }
> +        return rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);

As I said above somewhere, it might make sense to fall through to
RAW_LT_ABORT if this returns an error.

> +    case RAW_LT_COMMIT:
> +        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);

Since you're ignoring the return value, you can also pass &error_abort
(I know it doesn't matter because this function is invoked with errp ==
&error_abort, but it looks cleaner).

> +        if ((rec->need_lock_fd || rec->close_old_lock_fd) && s->lock_fd >= 0) {
> +            qemu_close(s->lock_fd);

I'd propose setting it to -1 afterwards (relevant when
!rec->need_lock_fd && rec->close_old_lock_fd).

> +        }
> +        if (rec->need_lock_fd) {
> +            s->lock_fd = rs->lock_fd;
> +        }
> +        s->cur_lock_mode = new_lock;
> +        break;
> +    case RAW_LT_ABORT:
> +        rec->func(op, s->lock_fd, rs->lock_fd, old_lock, new_lock, errp);

Again, could pass &error_abort instead of errp.

> +        if (rec->need_lock_fd) {
> +            if (rs->lock_fd >= 0) {

Actually, rs->lock_fd >= 0 should be equivalent to rec->need_lock_fd.

> +                qemu_close(rs->lock_fd);
> +                rs->lock_fd = -1;
> +            }
> +        }
> +        break;
> +    }
> +    return 0;
> +}
> +
>  static int raw_reopen_prepare(BDRVReopenState *state,
>                                BlockReopenQueue *queue, Error **errp)
>  {
> @@ -560,61 +1177,24 @@ static int raw_reopen_prepare(BDRVReopenState *state,
>  
>      raw_parse_flags(state->flags, &rs->open_flags);
>  
> -    rs->fd = -1;
> -
> -    int fcntl_flags = O_APPEND | O_NONBLOCK;
> -#ifdef O_NOATIME
> -    fcntl_flags |= O_NOATIME;
> -#endif
> -
> -#ifdef O_ASYNC
> -    /* Not all operating systems have O_ASYNC, and those that don't
> -     * will not let us track the state into rs->open_flags (typically
> -     * you achieve the same effect with an ioctl, for example I_SETSIG
> -     * on Solaris). But we do not use O_ASYNC, so that's fine.
> -     */
> -    assert((s->open_flags & O_ASYNC) == 0);
> -#endif
> -
> -    if ((rs->open_flags & ~fcntl_flags) == (s->open_flags & ~fcntl_flags)) {
> -        /* dup the original fd */
> -        rs->fd = qemu_dup(s->fd);
> -        if (rs->fd >= 0) {
> -            ret = fcntl_setfl(rs->fd, rs->open_flags);
> -            if (ret) {
> -                qemu_close(rs->fd);
> -                rs->fd = -1;
> -            }
> -        }
> +    ret = raw_dup_flags(s->fd, s->open_flags, rs->open_flags,
> +                        state->bs->filename, errp);
> +    if (ret < 0) {
> +        return ret;
>      }
>  
> -    /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
> -    if (rs->fd == -1) {
> -        const char *normalized_filename = state->bs->filename;
> -        ret = raw_normalize_devicepath(&normalized_filename);
> -        if (ret < 0) {
> -            error_setg_errno(errp, -ret, "Could not normalize device path");
> -        } else {
> -            assert(!(rs->open_flags & O_CREAT));
> -            rs->fd = qemu_open(normalized_filename, rs->open_flags);
> -            if (rs->fd == -1) {
> -                error_setg_errno(errp, errno, "Could not reopen file");
> -                ret = -1;
> -            }
> -        }
> -    }
> +    rs->fd = ret;
>  
>      /* Fail already reopen_prepare() if we can't get a working O_DIRECT
>       * alignment with the new fd. */
> -    if (rs->fd != -1) {
> -        raw_probe_alignment(state->bs, rs->fd, &local_err);
> -        if (local_err) {
> -            qemu_close(rs->fd);
> -            rs->fd = -1;
> -            error_propagate(errp, local_err);
> -            ret = -EINVAL;
> -        }
> +    raw_probe_alignment(state->bs, rs->fd, &local_err);
> +    if (local_err) {
> +        qemu_close(rs->fd);
> +        rs->fd = -1;
> +        error_propagate(errp, local_err);
> +        return -EINVAL;
>      }
> +    ret = raw_reopen_handle_lock(state, RAW_LT_PREPARE, errp);

Shouldn't this have the same clean-up routine on error as above, i.e.
close the new FD?

>  
>      return ret;
>  }
> @@ -626,6 +1206,8 @@ static void raw_reopen_commit(BDRVReopenState *state)
>  
>      s->open_flags = rs->open_flags;
>  
> +    raw_reopen_handle_lock(state, RAW_LT_COMMIT, &error_abort);
> +
>      qemu_close(s->fd);
>      s->fd = rs->fd;
>  
> @@ -643,6 +1225,8 @@ static void raw_reopen_abort(BDRVReopenState *state)
>          return;
>      }
>  
> +    raw_reopen_handle_lock(state, RAW_LT_ABORT, &error_abort);
> +
>      if (rs->fd >= 0) {
>          qemu_close(rs->fd);
>          rs->fd = -1;
> @@ -1332,6 +1916,10 @@ static void raw_close(BlockDriverState *bs)
>          qemu_close(s->fd);
>          s->fd = -1;
>      }
> +    if (s->lock_fd >= 0) {
> +        qemu_close(s->lock_fd);
> +        s->lock_fd = -1;
> +    }
>  }
>  
>  static int raw_truncate(BlockDriverState *bs, int64_t offset)
> @@ -1832,6 +2420,27 @@ static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)

*Resumed review complete* :-)

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 14/14] tests: Add test-image-lock
  2016-10-31 15:38 ` [Qemu-devel] [PATCH 14/14] tests: Add test-image-lock Fam Zheng
@ 2016-12-02 16:30   ` Max Reitz
  2016-12-09  7:39     ` Fam Zheng
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2016-12-02 16:30 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel; +Cc: Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

[-- Attachment #1: Type: text/plain, Size: 3629 bytes --]

On 31.10.2016 16:38, Fam Zheng wrote:
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  tests/Makefile.include  |   2 +
>  tests/test-image-lock.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 181 insertions(+)
>  create mode 100644 tests/test-image-lock.c
> 
> diff --git a/tests/Makefile.include b/tests/Makefile.include
> index cd058ef..65f8500 100644
> --- a/tests/Makefile.include
> +++ b/tests/Makefile.include
> @@ -56,6 +56,7 @@ check-unit-y += tests/test-hbitmap$(EXESUF)
>  gcov-files-test-hbitmap-y = blockjob.c
>  check-unit-y += tests/test-blockjob$(EXESUF)
>  check-unit-y += tests/test-blockjob-txn$(EXESUF)
> +check-unit-y += tests/test-image-lock$(EXESUF)
>  check-unit-y += tests/test-x86-cpuid$(EXESUF)
>  # all code tested by test-x86-cpuid is inside topology.h
>  gcov-files-test-x86-cpuid-y =
> @@ -493,6 +494,7 @@ tests/test-rfifolock$(EXESUF): tests/test-rfifolock.o $(test-util-obj-y)
>  tests/test-throttle$(EXESUF): tests/test-throttle.o $(test-block-obj-y)
>  tests/test-blockjob$(EXESUF): tests/test-blockjob.o $(test-block-obj-y) $(test-util-obj-y)
>  tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o $(test-block-obj-y) $(test-util-obj-y)
> +tests/test-image-lock$(EXESUF): tests/test-image-lock.o $(test-block-obj-y) $(libqos-obj-y)
>  tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
>  tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
>  tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y)
> diff --git a/tests/test-image-lock.c b/tests/test-image-lock.c
> new file mode 100644
> index 0000000..db5a723
> --- /dev/null
> +++ b/tests/test-image-lock.c
> @@ -0,0 +1,179 @@
> +/*
> + * Image lock tests
> + *
> + * Copyright 2016 Red Hat, Inc.
> + *
> + * Authors:
> + *  Fam Zheng <famz@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING.LIB file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qapi/error.h"
> +#include "qemu/error-report.h"
> +#include "qapi/qmp/qbool.h"
> +#include "sysemu/block-backend.h"
> +
> +#define TEST_IMAGE_SIZE 4096
> +static char test_image[] = "/tmp/qtest.XXXXXX";
> +static int test_image_fd;
> +
> +static BlockBackend *open_test_image(int flags, bool disable_lock)
> +{
> +    QDict *opts = qdict_new();
> +
> +    qdict_set_default_str(opts, "filename", test_image);
> +    qdict_set_default_str(opts, "driver", "file");
> +    if (disable_lock) {
> +        qdict_put(opts, "disable-lock", qbool_from_bool(true));
> +    }
> +
> +    return blk_new_open(NULL, NULL, opts, flags | BDRV_O_ALLOW_RDWR, NULL);
> +}
> +
> +#define RW true
> +#define RO false
> +#define SHARE true
> +#define EXCLU false
> +
> +static struct CompatData {
> +    bool write_1;
> +    bool share_1;
> +    bool write_2;
> +    bool share_2;
> +    bool compatible;
> +} compat_data[] = {
> +    /* Write 1, Share 1, Write 2, Share 2, Compatible. */
> +    { RO,       SHARE,   RO,      SHARE,   true,  },
> +    { RO,       SHARE,   RO,      EXCLU,   true,  },
> +    { RO,       SHARE,   RW,      SHARE,   true,  },
> +    { RO,       SHARE,   RW,      EXCLU,   true,  },
> +    { RO,       EXCLU,   RO,      SHARE,   true,  },
> +    { RO,       EXCLU,   RO,      EXCLU,   true,  },
> +    { RO,       EXCLU,   RW,      SHARE,   false, },
> +    { RO,       EXCLU,   RW,      EXCLU,   false, },
> +};

Without having looked closer at the test, what about RW/RW compatibility?

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd
  2016-12-02  0:30   ` Max Reitz
@ 2016-12-08  6:53     ` Fam Zheng
  0 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-12-08  6:53 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

On Fri, 12/02 01:30, Max Reitz wrote:
> On 31.10.2016 16:38, Fam Zheng wrote:
> > They are wrappers of POSIX fcntl "file private locking".
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  include/qemu/osdep.h |  2 ++
> >  util/osdep.c         | 29 +++++++++++++++++++++++++++++
> >  2 files changed, 31 insertions(+)
> > 
> > diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > index 0e3c330..f15e122 100644
> > --- a/include/qemu/osdep.h
> > +++ b/include/qemu/osdep.h
> > @@ -294,6 +294,8 @@ int qemu_close(int fd);
> >  #ifndef _WIN32
> >  int qemu_dup(int fd);
> >  #endif
> > +int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive);
> > +int qemu_unlock_fd(int fd, int64_t start, int64_t len);
> >  
> >  #if defined(__HAIKU__) && defined(__i386__)
> >  #define FMT_pid "%ld"
> > diff --git a/util/osdep.c b/util/osdep.c
> > index 06fb1cf..b85a490 100644
> > --- a/util/osdep.c
> > +++ b/util/osdep.c
> > @@ -140,6 +140,35 @@ static int qemu_parse_fdset(const char *param)
> >  {
> >      return qemu_parse_fd(param);
> >  }
> > +
> > +static int qemu_lock_fcntl(int fd, int64_t start, int64_t len, int fl_type)
> > +{
> > +#ifdef F_OFD_SETLK
> > +    int ret;
> > +    struct flock fl = {
> > +        .l_whence = SEEK_SET,
> > +        .l_start  = start,
> > +        .l_len    = len,
> > +        .l_type   = fl_type,
> > +    };
> > +    do {
> > +        ret = fcntl(fd, F_OFD_SETLK, &fl);
> > +    } while (ret == -1 && errno == EINTR);
> 
> As I've asked in the last version: Can EINTR happen at all? My man page
> tells me it's possible only with F(_OFD)_SETLKW.

I think you are right, let's drop the condition.

Fam

> 
> Max
> 
> > +    return ret == -1 ? -errno : 0;
> > +#else
> > +    return -ENOTSUP;
> > +#endif
> > +}
> > +
> > +int qemu_lock_fd(int fd, int64_t start, int64_t len, bool exclusive)
> > +{
> > +    return qemu_lock_fcntl(fd, start, len, exclusive ? F_WRLCK : F_RDLCK);
> > +}
> > +
> > +int qemu_unlock_fd(int fd, int64_t start, int64_t len)
> > +{
> > +    return qemu_lock_fcntl(fd, start, len, F_UNLCK);
> > +}
> >  #endif
> >  
> >  /*
> > 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 04/14] qemu-img: Set "share-rw" flag in read-only commands
  2016-12-02  0:52   ` Max Reitz
@ 2016-12-08  7:19     ` Fam Zheng
  0 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-12-08  7:19 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

On Fri, 12/02 01:52, Max Reitz wrote:
> > diff --git a/qemu-img.c b/qemu-img.c
> > index afcd51f..b2f4077 100644
> > --- a/qemu-img.c
> > +++ b/qemu-img.c
> > @@ -679,6 +679,10 @@ static int img_check(int argc, char **argv)
> >              break;
> >          }
> >      }
> > +
> > +    if (!(flags & BDRV_O_RDWR)) {
> > +        flags |= BDRV_O_SHARE_RW;
> > +    }
> 
> If you want to keep this for img_check() (and I'm not going to stop you
> if you do), I think it would be better to put this right in front of
> img_open() to make it really clear that both are not set at the same
> time (without having to look into bdrv_parse_cache_mode()).

Sounds good, will move it.

Fam

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 14/14] tests: Add test-image-lock
  2016-12-02 16:30   ` Max Reitz
@ 2016-12-09  7:39     ` Fam Zheng
  0 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2016-12-09  7:39 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

On Fri, 12/02 17:30, Max Reitz wrote:
> > +static struct CompatData {
> > +    bool write_1;
> > +    bool share_1;
> > +    bool write_2;
> > +    bool share_2;
> > +    bool compatible;
> > +} compat_data[] = {
> > +    /* Write 1, Share 1, Write 2, Share 2, Compatible. */
> > +    { RO,       SHARE,   RO,      SHARE,   true,  },
> > +    { RO,       SHARE,   RO,      EXCLU,   true,  },
> > +    { RO,       SHARE,   RW,      SHARE,   true,  },
> > +    { RO,       SHARE,   RW,      EXCLU,   true,  },
> > +    { RO,       EXCLU,   RO,      SHARE,   true,  },
> > +    { RO,       EXCLU,   RO,      EXCLU,   true,  },
> > +    { RO,       EXCLU,   RW,      SHARE,   false, },
> > +    { RO,       EXCLU,   RW,      EXCLU,   false, },
> > +};
> 
> Without having looked closer at the test, what about RW/RW compatibility?

Good catch, I will fix this matrix in next version.

Fam

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2016-12-02  2:58   ` Max Reitz
@ 2017-01-18 10:48     ` Fam Zheng
  2017-01-18 13:02       ` Max Reitz
  0 siblings, 1 reply; 32+ messages in thread
From: Fam Zheng @ 2017-01-18 10:48 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

On Fri, 12/02 03:58, Max Reitz wrote:
> On 31.10.2016 16:38, Fam Zheng wrote:
> > This implements open flag sensible image locking for local file
> > and host device protocol.
> > 
> > virtlockd in libvirt locks the first byte, so we start looking at the
> > file bytes from 1.
> > 
> > Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
> > four locking modes by combining two bits (BDRV_O_RDWR and
> > BDRV_O_SHARE_RW), and implemented by taking two locks:
> > 
> > Lock bytes:
> > 
> > * byte 1: I can't allow other processes to write to the image
> > * byte 2: I am writing to the image
> > 
> > Lock modes:
> > 
> > * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
> >   byte 2. Test whether byte 1 is locked using an exclusive lock, and
> >   fail if so.
> 
> Testing whether something is locked would be easier by using F_OFD_GETLK
> instead of actually creating an exclusive lock and then releasing it.

My attempt to do this shows it doesn't work: fcntl forces the tested lock type
to read lock for some unknown reason, so it cannot be used to test other shared
lockers.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2017-01-18 10:48     ` Fam Zheng
@ 2017-01-18 13:02       ` Max Reitz
  2017-01-18 13:19         ` Fam Zheng
  0 siblings, 1 reply; 32+ messages in thread
From: Max Reitz @ 2017-01-18 13:02 UTC (permalink / raw)
  To: Fam Zheng; +Cc: qemu-devel, Daniel P. Berrange, Kevin Wolf, qemu-block, rjones


[-- Attachment #1.1: Type: text/plain, Size: 1913 bytes --]

On 18.01.2017 11:48, Fam Zheng wrote:
> On Fri, 12/02 03:58, Max Reitz wrote:
>> On 31.10.2016 16:38, Fam Zheng wrote:
>>> This implements open flag sensible image locking for local file
>>> and host device protocol.
>>>
>>> virtlockd in libvirt locks the first byte, so we start looking at the
>>> file bytes from 1.
>>>
>>> Quoting what was proposed by Kevin Wolf <kwolf@redhat.com>, there are
>>> four locking modes by combining two bits (BDRV_O_RDWR and
>>> BDRV_O_SHARE_RW), and implemented by taking two locks:
>>>
>>> Lock bytes:
>>>
>>> * byte 1: I can't allow other processes to write to the image
>>> * byte 2: I am writing to the image
>>>
>>> Lock modes:
>>>
>>> * shared writer (BDRV_O_RDWR | BDRV_O_SHARE_RW): Take shared lock on
>>>   byte 2. Test whether byte 1 is locked using an exclusive lock, and
>>>   fail if so.
>>
>> Testing whether something is locked would be easier by using F_OFD_GETLK
>> instead of actually creating an exclusive lock and then releasing it.
> 
> My attempt to do this shows it doesn't work: fcntl forces the tested lock type
> to read lock for some unknown reason, so it cannot be used to test other shared
> lockers.

Do you have a reproducer? The attached quick and dirty test case works
for me (compile test.c to test and test2.c to test2), producing the
following output (when running ./test):

set rd lock: 0, Success
get lock: 0, Success; read lock in place
set wr lock: -1, Resource temporarily unavailable
unset lock: 0, Resource temporarily unavailable
===
unset lock: 0, Success
get lock: 0, Success; unlocked
set wr lock: 0, Success
unset lock: 0, Success
===
set wr lock: 0, Success
get lock: 0, Success; write lock in place
set wr lock: -1, Resource temporarily unavailable
unset lock: 0, Resource temporarily unavailable

So the l_type field is correctly set after every F_OFD_GETLK query call.

Max

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.2: test.c --]
[-- Type: text/x-csrc; name="test.c", Size: 804 bytes --]

#define _GNU_SOURCE

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>


int main(void)
{
    system("touch foo");


    int fd = open("foo", O_RDWR);

    int ret = fcntl(fd, F_OFD_SETLK, &(struct flock){ .l_type = F_RDLCK });
    printf("set rd lock: %i, %s\n", ret, strerror(errno));

    system("./test2");

    printf("===\n");

    ret = fcntl(fd, F_OFD_SETLK, &(struct flock){ .l_type = F_UNLCK });
    printf("unset lock: %i, %s\n", ret, strerror(errno));

    system("./test2");

    printf("===\n");

    ret = fcntl(fd, F_OFD_SETLK, &(struct flock){ .l_type = F_WRLCK });
    printf("set wr lock: %i, %s\n", ret, strerror(errno));

    system("./test2");

    close(fd);
    return 0;
}

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1.3: test2.c --]
[-- Type: text/x-csrc; name="test2.c", Size: 683 bytes --]

#define _GNU_SOURCE

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>


static const char *lock_names[] = {
    [F_UNLCK] = "unlocked",
    [F_RDLCK] = "read lock in place",
    [F_WRLCK] = "write lock in place",
};


int main(void)
{
    int fd = open("foo", O_RDWR);

    struct flock fl = { .l_type = F_WRLCK };
    int ret = fcntl(fd, F_OFD_GETLK, &fl);
    printf("get lock: %i, %s; %s\n", ret, strerror(errno), lock_names[fl.l_type]);

    ret = fcntl(fd, F_OFD_SETLK, &(struct flock){ .l_type = F_WRLCK });
    printf("set wr lock: %i, %s\n", ret, strerror(errno));

    close(fd);
    return 0;
}

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking
  2017-01-18 13:02       ` Max Reitz
@ 2017-01-18 13:19         ` Fam Zheng
  0 siblings, 0 replies; 32+ messages in thread
From: Fam Zheng @ 2017-01-18 13:19 UTC (permalink / raw)
  To: Max Reitz; +Cc: qemu-devel, Daniel P. Berrange, Kevin Wolf, qemu-block, rjones

On Wed, 01/18 14:02, Max Reitz wrote:
> >> Testing whether something is locked would be easier by using F_OFD_GETLK
> >> instead of actually creating an exclusive lock and then releasing it.
> > 
> > My attempt to do this shows it doesn't work: fcntl forces the tested lock type
> > to read lock for some unknown reason, so it cannot be used to test other shared
> > lockers.
> 
> Do you have a reproducer? The attached quick and dirty test case works
> for me (compile test.c to test and test2.c to test2), producing the
> following output (when running ./test):
> 
> set rd lock: 0, Success
> get lock: 0, Success; read lock in place
> set wr lock: -1, Resource temporarily unavailable
> unset lock: 0, Resource temporarily unavailable
> ===
> unset lock: 0, Success
> get lock: 0, Success; unlocked
> set wr lock: 0, Success
> unset lock: 0, Success
> ===
> set wr lock: 0, Success
> get lock: 0, Success; write lock in place
> set wr lock: -1, Resource temporarily unavailable
> unset lock: 0, Resource temporarily unavailable

You are right, now I see why I was confused with the F_OFD_GETLK interface.

Fam

> 
> So the l_type field is correctly set after every F_OFD_GETLK query call.
> 
> Max

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-01-18 13:19 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-31 15:38 [Qemu-devel] [PATCH 00/14] block: Image locking series Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 01/14] osdep: Add qemu_lock_fd and qemu_unlock_fd Fam Zheng
2016-12-02  0:30   ` Max Reitz
2016-12-08  6:53     ` Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 02/14] block: Define BDRV_O_SHARE_RW Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 03/14] qemu-io: Set "share-rw" flag together with read-only Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 04/14] qemu-img: Set "share-rw" flag in read-only commands Fam Zheng
2016-12-02  0:52   ` Max Reitz
2016-12-08  7:19     ` Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 05/14] block: Set "share-rw" flag in drive-backup when sync=none Fam Zheng
2016-12-02  1:01   ` Max Reitz
2016-10-31 15:38 ` [Qemu-devel] [PATCH 06/14] block: Set "share-rw" flag for incoming migration Fam Zheng
2016-12-02  1:22   ` Max Reitz
2016-10-31 15:38 ` [Qemu-devel] [PATCH 07/14] iotests: 055: Don't attach the drive to vm for drive-backup Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 08/14] iotests: 030: Read-only open image for getting map Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 09/14] iotests: 087: Don't attch test image twice Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 10/14] iotests: 085: Avoid image locking conflict Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 11/14] iotests: 091: Quit QEMU before checking image Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 12/14] tests: Use null-co:// instead of /dev/null as the dummy image Fam Zheng
2016-10-31 15:38 ` [Qemu-devel] [PATCH 13/14] raw-posix: Implement image locking Fam Zheng
2016-10-31 22:01   ` Eric Blake
2016-10-31 22:39     ` Richard W.M. Jones
2016-11-01  2:06     ` Fam Zheng
2016-12-02  2:58   ` Max Reitz
2017-01-18 10:48     ` Fam Zheng
2017-01-18 13:02       ` Max Reitz
2017-01-18 13:19         ` Fam Zheng
2016-12-02 16:13   ` Max Reitz
2016-10-31 15:38 ` [Qemu-devel] [PATCH 14/14] tests: Add test-image-lock Fam Zheng
2016-12-02 16:30   ` Max Reitz
2016-12-09  7:39     ` Fam Zheng
2016-12-02  3:10 ` [Qemu-devel] [PATCH 00/14] block: Image locking series Max Reitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.