[Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format
@ 2010-10-08 15:48 Stefan Hajnoczi
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 1/7] qcow2: Make get_bits_from_size() common Stefan Hajnoczi
                   ` (8 more replies)
  0 siblings, 9 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

QEMU Enhanced Disk format is a disk image format that forgoes features
found in qcow2 in favor of better levels of performance and data
integrity.  Due to its simpler on-disk layout, it is possible to safely
perform metadata updates more efficiently.

Installations, suspend-to-disk, and other allocation-heavy I/O workloads
will see increased performance due to fewer I/Os and syncs.  Workloads
that do not cause new clusters to be allocated will perform similar to
raw images due to in-memory metadata caching.

The format supports sparse disk images.  It does not rely on the host
filesystem holes feature, making it a good choice for sparse disk images
that need to be transferred over channels where holes are not supported.

Backing files are supported so only deltas against a base image can be
stored.

The file format is extensible so that additional features can be added
later with graceful compatibility handling.

Internal snapshots are not supported.  This eliminates the need for
additional metadata to track copy-on-write clusters.

Compression and encryption are not supported.  They add complexity and can be
implemented at other layers in the stack (i.e. inside the guest or on the
host).  Encryption has been identified as a potential future extension and the
file format allows for this.

This patchset implements the base functionality.

Later patches will address the following points:
 * Resizing the disk image.  The capability has been designed in but the
 code has not been written yet.
 * Resetting the image after backing file commit completes.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
This code is also available from git:

http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/qed

v2:
 * Add QED format specification to documentation
 * Use __builtin_ctzl() for get_bits_from_size()
 * Fine-grained table locking to allow concurrent allocating write requests
 * Fix qemu_free() instead of qemu_vfree() in qed_unref_l2_cache_entry()
 * Comment clean-ups

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH v2 1/7] qcow2: Make get_bits_from_size() common
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
@ 2010-10-08 15:48 ` Stefan Hajnoczi
  2010-10-08 18:01   ` [Qemu-devel] " Anthony Liguori
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values Stefan Hajnoczi
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

The get_bits_from_size() calculates the log base-2 of a number.  This is
useful in bit manipulation code working with power-of-2s.

Currently used by qcow2 and needed by qed in a follow-on patch.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 block/qcow2.c |   22 ----------------------
 cutils.c      |   18 ++++++++++++++++++
 qemu-common.h |    1 +
 3 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ee3481b..6e25812 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -794,28 +794,6 @@ static int qcow2_change_backing_file(BlockDriverState *bs,
     return qcow2_update_ext_header(bs, backing_file, backing_fmt);
 }
 
-static int get_bits_from_size(size_t size)
-{
-    int res = 0;
-
-    if (size == 0) {
-        return -1;
-    }
-
-    while (size != 1) {
-        /* Not a power of two */
-        if (size & 1) {
-            return -1;
-        }
-
-        size >>= 1;
-        res++;
-    }
-
-    return res;
-}
-
-
 static int preallocate(BlockDriverState *bs)
 {
     uint64_t nb_sectors;
diff --git a/cutils.c b/cutils.c
index 5883737..6c32198 100644
--- a/cutils.c
+++ b/cutils.c
@@ -283,3 +283,21 @@ int fcntl_setfl(int fd, int flag)
 }
 #endif
 
+/**
+ * Get the number of bits for a power of 2
+ *
+ * The following is true for powers of 2:
+ *   n == 1 << get_bits_from_size(n)
+ */
+int get_bits_from_size(size_t size)
+{
+    if (size == 0 || (size & (size - 1))) {
+        return -1;
+    }
+
+#if defined(_WIN32) && defined(__x86_64__)
+    return __builtin_ctzll(size);
+#else
+    return __builtin_ctzl(size);
+#endif
+}
diff --git a/qemu-common.h b/qemu-common.h
index 81aafa0..e0ca398 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -153,6 +153,7 @@ time_t mktimegm(struct tm *tm);
 int qemu_fls(int i);
 int qemu_fdatasync(int fd);
 int fcntl_setfl(int fd, int flag);
+int get_bits_from_size(size_t size);
 
 /* path.c */
 void init_paths(const char *prefix);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 1/7] qcow2: Make get_bits_from_size() common Stefan Hajnoczi
@ 2010-10-08 15:48 ` Stefan Hajnoczi
  2010-10-11 11:09   ` [Qemu-devel] " Kevin Wolf
                     ` (2 more replies)
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 3/7] docs: Add QED image format specification Stefan Hajnoczi
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

From: Anthony Liguori <aliguori@us.ibm.com>

This common function converts byte counts to human-readable strings with
proper units.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 cutils.c      |   15 +++++++++++++++
 qemu-common.h |    1 +
 2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/cutils.c b/cutils.c
index 6c32198..5041203 100644
--- a/cutils.c
+++ b/cutils.c
@@ -301,3 +301,18 @@ int get_bits_from_size(size_t size)
     return __builtin_ctzl(size);
 #endif
 }
+
+void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size)
+{
+    if (size < (1ULL << 10)) {
+        snprintf(buffer, buffer_len, "%" PRIu64 " byte(s)", size);
+    } else if (size < (1ULL << 20)) {
+        snprintf(buffer, buffer_len, "%" PRIu64 " KB(s)", size >> 10);
+    } else if (size < (1ULL << 30)) {
+        snprintf(buffer, buffer_len, "%" PRIu64 " MB(s)", size >> 20);
+    } else if (size < (1ULL << 40)) {
+        snprintf(buffer, buffer_len, "%" PRIu64 " GB(s)", size >> 30);
+    } else {
+        snprintf(buffer, buffer_len, "%" PRIu64 " TB(s)", size >> 40);
+    }
+}
diff --git a/qemu-common.h b/qemu-common.h
index e0ca398..80ae834 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -154,6 +154,7 @@ int qemu_fls(int i);
 int qemu_fdatasync(int fd);
 int fcntl_setfl(int fd, int flag);
 int get_bits_from_size(size_t size);
+void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size);
 
 /* path.c */
 void init_paths(const char *prefix);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 1/7] qcow2: Make get_bits_from_size() common Stefan Hajnoczi
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values Stefan Hajnoczi
@ 2010-10-08 15:48 ` Stefan Hajnoczi
  2010-10-10  9:20   ` [Qemu-devel] " Avi Kivity
  2010-10-11 13:58   ` Kevin Wolf
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 4/7] qed: Add QEMU Enhanced Disk image format Stefan Hajnoczi
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 94 insertions(+), 0 deletions(-)
 create mode 100644 docs/specs/qed_spec.txt

diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
new file mode 100644
index 0000000..c942b8e
--- /dev/null
+++ b/docs/specs/qed_spec.txt
@@ -0,0 +1,94 @@
+=Specification=
+
+The file format looks like this:
+
+ +----------+----------+----------+-----+
+ | cluster0 | cluster1 | cluster2 | ... |
+ +----------+----------+----------+-----+
+
+The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
+
+Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
+
+All fields are little-endian.
+
+==Header==
+ Header {
+     uint32_t magic;               /* QED\0 */
+ 
+     uint32_t cluster_size;        /* in bytes */
+     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
+     uint32_t header_size;         /* in clusters */
+ 
+     uint64_t features;            /* format feature bits */
+     uint64_t compat_features;     /* compat feature bits */
+     uint64_t l1_table_offset;     /* in bytes */
+     uint64_t image_size;          /* total logical image size, in bytes */
+ 
+     /* if (features & QED_F_BACKING_FILE) */
+     uint32_t backing_filename_offset; /* in bytes from start of header */
+     uint32_t backing_filename_size;   /* in bytes */
+ 
+     /* if (compat_features & QED_CF_BACKING_FORMAT) */
+     uint32_t backing_fmt_offset;  /* in bytes from start of header */
+     uint32_t backing_fmt_size;    /* in bytes */
+ }
+
+Field descriptions:
+* cluster_size must be a power of 2 in range [2^12, 2^26].
+* table_size must be a power of 2 in range [1, 16].
+* header_size is the number of clusters used by the header and any additional information stored before regular clusters.
+* features and compat_features are bitmaps where active file format features can be selectively enabled.  The difference between the two is that an image file that uses unknown compat_features bits can be safely opened without knowing how to interpret those bits.  If an image file has an unsupported features bit set then it is not possible to open that image (the image is not backwards-compatible).
+* l1_table_offset must be a multiple of cluster_size.
+* image_size is the block device size seen by the guest and must be a multiple of cluster_size.
+* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.
+
+Feature bits:
+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
+* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
+
+==Tables==
+
+Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
+
+ #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
+  
+ Table {
+     uint64_t offsets[TABLE_NOFFSETS];
+ }
+
+The tables are organized as follows:
+
+                    +----------+
+                    | L1 table |
+                    +----------+
+               ,------'  |  '------.
+          +----------+   |    +----------+
+          | L2 table |  ...   | L2 table |
+          +----------+        +----------+
+      ,------'  |  '------.
+ +----------+   |    +----------+
+ |   Data   |  ...   |   Data   |
+ +----------+        +----------+
+
+A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
+
+The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
+ header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
+
+Logical offsets are translated into cluster offsets as follows:
+
+  table_bits table_bits    cluster_bits
+  <--------> <--------> <--------------->
+ +----------+----------+-----------------+
+ | L1 index | L2 index |     byte offset |
+ +----------+----------+-----------------+
+ 
+       Structure of a logical offset
+
+ def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
+   l2_offset = l1_table[l1_index]
+   l2_table = load_table(l2_offset)
+   cluster_offset = l2_table[l2_index]
+   return cluster_offset + byte_offset
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH v2 4/7] qed: Add QEMU Enhanced Disk image format
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 3/7] docs: Add QED image format specification Stefan Hajnoczi
@ 2010-10-08 15:48 ` Stefan Hajnoczi
  2010-10-11 15:16   ` [Qemu-devel] " Kevin Wolf
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 5/7] qed: Table, L2 cache, and cluster functions Stefan Hajnoczi
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

This patch introduces the qed on-disk layout and implements image
creation.  Later patches add read/write and other functionality.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 Makefile.objs |    1 +
 block/qed.c   |  530 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/qed.h   |  148 ++++++++++++++++
 3 files changed, 679 insertions(+), 0 deletions(-)
 create mode 100644 block/qed.c
 create mode 100644 block/qed.h

diff --git a/Makefile.objs b/Makefile.objs
index 816194a..ff15795 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -14,6 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
+block-nested-y += qed.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
diff --git a/block/qed.c b/block/qed.c
new file mode 100644
index 0000000..ea03798
--- /dev/null
+++ b/block/qed.c
@@ -0,0 +1,530 @@
+/*
+ * QEMU Enhanced Disk Format
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
+                          const char *filename)
+{
+    const QEDHeader *header = (const void *)buf;
+
+    if (buf_size < sizeof(*header)) {
+        return 0;
+    }
+    if (le32_to_cpu(header->magic) != QED_MAGIC) {
+        return 0;
+    }
+    return 100;
+}
+
+static void qed_header_le_to_cpu(const QEDHeader *le, QEDHeader *cpu)
+{
+    cpu->magic = le32_to_cpu(le->magic);
+    cpu->cluster_size = le32_to_cpu(le->cluster_size);
+    cpu->table_size = le32_to_cpu(le->table_size);
+    cpu->header_size = le32_to_cpu(le->header_size);
+    cpu->features = le64_to_cpu(le->features);
+    cpu->compat_features = le64_to_cpu(le->compat_features);
+    cpu->l1_table_offset = le64_to_cpu(le->l1_table_offset);
+    cpu->image_size = le64_to_cpu(le->image_size);
+    cpu->backing_filename_offset = le32_to_cpu(le->backing_filename_offset);
+    cpu->backing_filename_size = le32_to_cpu(le->backing_filename_size);
+    cpu->backing_fmt_offset = le32_to_cpu(le->backing_fmt_offset);
+    cpu->backing_fmt_size = le32_to_cpu(le->backing_fmt_size);
+}
+
+static void qed_header_cpu_to_le(const QEDHeader *cpu, QEDHeader *le)
+{
+    le->magic = cpu_to_le32(cpu->magic);
+    le->cluster_size = cpu_to_le32(cpu->cluster_size);
+    le->table_size = cpu_to_le32(cpu->table_size);
+    le->header_size = cpu_to_le32(cpu->header_size);
+    le->features = cpu_to_le64(cpu->features);
+    le->compat_features = cpu_to_le64(cpu->compat_features);
+    le->l1_table_offset = cpu_to_le64(cpu->l1_table_offset);
+    le->image_size = cpu_to_le64(cpu->image_size);
+    le->backing_filename_offset = cpu_to_le32(cpu->backing_filename_offset);
+    le->backing_filename_size = cpu_to_le32(cpu->backing_filename_size);
+    le->backing_fmt_offset = cpu_to_le32(cpu->backing_fmt_offset);
+    le->backing_fmt_size = cpu_to_le32(cpu->backing_fmt_size);
+}
+
+static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
+{
+    uint64_t table_entries;
+    uint64_t l2_size;
+
+    table_entries = (table_size * cluster_size) / sizeof(uint64_t);
+    l2_size = table_entries * cluster_size;
+
+    return l2_size * table_entries;
+}
+
+static bool qed_is_cluster_size_valid(uint32_t cluster_size)
+{
+    if (cluster_size < QED_MIN_CLUSTER_SIZE ||
+        cluster_size > QED_MAX_CLUSTER_SIZE) {
+        return false;
+    }
+    if (cluster_size & (cluster_size - 1)) {
+        return false; /* not power of 2 */
+    }
+    return true;
+}
+
+static bool qed_is_table_size_valid(uint32_t table_size)
+{
+    if (table_size < QED_MIN_TABLE_SIZE ||
+        table_size > QED_MAX_TABLE_SIZE) {
+        return false;
+    }
+    if (table_size & (table_size - 1)) {
+        return false; /* not power of 2 */
+    }
+    return true;
+}
+
+static bool qed_is_image_size_valid(uint64_t image_size, uint32_t cluster_size,
+                                    uint32_t table_size)
+{
+    if (image_size == 0) {
+        /* Supporting zero size images makes life harder because even the L1
+         * table is not needed.  Make life simple and forbid zero size images.
+         */
+        return false;
+    }
+    if (image_size & (cluster_size - 1)) {
+        return false; /* not multiple of cluster size */
+    }
+    if (image_size > qed_max_image_size(cluster_size, table_size)) {
+        return false; /* image is too large */
+    }
+    return true;
+}
+
+/**
+ * Read a string of known length from the image file
+ *
+ * @file:       Image file
+ * @offset:     File offset to start of string, in bytes
+ * @n:          String length in bytes
+ * @buf:        Destination buffer
+ * @buflen:     Destination buffer length in bytes
+ *
+ * The string is NUL-terminated.
+ */
+static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
+                           char *buf, size_t buflen)
+{
+    int ret;
+    if (n >= buflen) {
+        return -EINVAL;
+    }
+    ret = bdrv_pread(file, offset, buf, n);
+    if (ret != n) {
+        return ret;
+    }
+    buf[n] = '\0';
+    return 0;
+}
+
+static int bdrv_qed_open(BlockDriverState *bs, int flags)
+{
+    BDRVQEDState *s = bs->opaque;
+    QEDHeader le_header;
+    int64_t file_size;
+    int ret;
+
+    s->bs = bs;
+
+    ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
+    if (ret != sizeof(le_header)) {
+        return ret;
+    }
+    qed_header_le_to_cpu(&le_header, &s->header);
+
+    if (s->header.magic != QED_MAGIC) {
+        return -ENOENT;
+    }
+    if (s->header.features & ~QED_FEATURE_MASK) {
+        return -ENOTSUP; /* image uses unsupported feature bits */
+    }
+    if (!qed_is_cluster_size_valid(s->header.cluster_size)) {
+        return -EINVAL;
+    }
+
+    /* Round up file size to the next cluster */
+    file_size = bdrv_getlength(bs->file);
+    if (file_size < 0) {
+        return file_size;
+    }
+    s->file_size = qed_start_of_cluster(s, file_size);
+
+    if (!qed_is_table_size_valid(s->header.table_size)) {
+        return -EINVAL;
+    }
+    if (!qed_is_image_size_valid(s->header.image_size,
+                                 s->header.cluster_size,
+                                 s->header.table_size)) {
+        return -EINVAL;
+    }
+    if (!qed_check_table_offset(s, s->header.l1_table_offset)) {
+        return -EINVAL;
+    }
+
+    s->table_nelems = (s->header.cluster_size * s->header.table_size) /
+                      sizeof(uint64_t);
+    s->l2_shift = get_bits_from_size(s->header.cluster_size);
+    s->l2_mask = s->table_nelems - 1;
+    s->l1_shift = s->l2_shift + get_bits_from_size(s->l2_mask + 1);
+
+    if ((s->header.features & QED_F_BACKING_FILE)) {
+        ret = qed_read_string(bs->file, s->header.backing_filename_offset,
+                              s->header.backing_filename_size, bs->backing_file,
+                              sizeof(bs->backing_file));
+        if (ret < 0) {
+            return ret;
+        }
+
+        if ((s->header.compat_features & QED_CF_BACKING_FORMAT)) {
+            ret = qed_read_string(bs->file, s->header.backing_fmt_offset,
+                                  s->header.backing_fmt_size,
+                                  bs->backing_format,
+                                  sizeof(bs->backing_format));
+            if (ret < 0) {
+                return ret;
+            }
+        }
+    }
+    return ret;
+}
+
+static void bdrv_qed_close(BlockDriverState *bs)
+{
+}
+
+static void bdrv_qed_flush(BlockDriverState *bs)
+{
+    bdrv_flush(bs->file);
+}
+
+static int qed_create(const char *filename, uint32_t cluster_size,
+                      uint64_t image_size, uint32_t table_size,
+                      const char *backing_file, const char *backing_fmt)
+{
+    QEDHeader header = {
+        .magic = QED_MAGIC,
+        .cluster_size = cluster_size,
+        .table_size = table_size,
+        .header_size = 1,
+        .features = 0,
+        .compat_features = 0,
+        .l1_table_offset = cluster_size,
+        .image_size = image_size,
+    };
+    QEDHeader le_header;
+    uint8_t *l1_table = NULL;
+    size_t l1_size = header.cluster_size * header.table_size;
+    int ret = 0;
+    int fd;
+
+    fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0644);
+    if (fd < 0) {
+        return -errno;
+    }
+
+    if (backing_file) {
+        header.features |= QED_F_BACKING_FILE;
+        header.backing_filename_offset = sizeof(le_header);
+        header.backing_filename_size = strlen(backing_file);
+        if (backing_fmt) {
+            header.compat_features |= QED_CF_BACKING_FORMAT;
+            header.backing_fmt_offset = header.backing_filename_offset +
+                                        header.backing_filename_size;
+            header.backing_fmt_size = strlen(backing_fmt);
+        }
+    }
+
+    qed_header_cpu_to_le(&header, &le_header);
+    if (qemu_write_full(fd, &le_header, sizeof(le_header)) != sizeof(le_header)) {
+        ret = -errno;
+        goto out;
+    }
+    if (qemu_write_full(fd, backing_file, header.backing_filename_size) != header.backing_filename_size) {
+        ret = -errno;
+        goto out;
+    }
+    if (qemu_write_full(fd, backing_fmt, header.backing_fmt_size) != header.backing_fmt_size) {
+        ret = -errno;
+        goto out;
+    }
+
+    l1_table = qemu_mallocz(l1_size);
+    lseek(fd, header.l1_table_offset, SEEK_SET);
+    if (qemu_write_full(fd, l1_table, l1_size) != l1_size) {
+        ret = -errno;
+        goto out;
+    }
+
+out:
+    qemu_free(l1_table);
+    close(fd);
+    return ret;
+}
+
+static int bdrv_qed_create(const char *filename, QEMUOptionParameter *options)
+{
+    uint64_t image_size = 0;
+    uint32_t cluster_size = QED_DEFAULT_CLUSTER_SIZE;
+    uint32_t table_size = QED_DEFAULT_TABLE_SIZE;
+    const char *backing_file = NULL;
+    const char *backing_fmt = NULL;
+
+    while (options && options->name) {
+        if (!strcmp(options->name, BLOCK_OPT_SIZE)) {
+            image_size = options->value.n;
+        } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FILE)) {
+            backing_file = options->value.s;
+        } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FMT)) {
+            backing_fmt = options->value.s;
+        } else if (!strcmp(options->name, BLOCK_OPT_CLUSTER_SIZE)) {
+            if (options->value.n) {
+                cluster_size = options->value.n;
+            }
+        } else if (!strcmp(options->name, "table_size")) {
+            if (options->value.n) {
+                table_size = options->value.n;
+            }
+        }
+        options++;
+    }
+
+    if (!qed_is_cluster_size_valid(cluster_size)) {
+        fprintf(stderr, "QED cluster size must be within range [%u, %u] and power of 2\n",
+                QED_MIN_CLUSTER_SIZE, QED_MAX_CLUSTER_SIZE);
+        return -EINVAL;
+    }
+    if (!qed_is_table_size_valid(table_size)) {
+        fprintf(stderr, "QED table size must be within range [%u, %u] and power of 2\n",
+                QED_MIN_TABLE_SIZE, QED_MAX_TABLE_SIZE);
+        return -EINVAL;
+    }
+    if (!qed_is_image_size_valid(image_size, cluster_size, table_size)) {
+        char buffer[64];
+
+        bytes_to_str(buffer, sizeof(buffer),
+                     qed_max_image_size(cluster_size, table_size));
+
+        fprintf(stderr,
+                "QED image size must be a non-zero multiple of cluster size and less than %s\n",
+                buffer);
+        return -EINVAL;
+    }
+
+    return qed_create(filename, cluster_size, image_size, table_size,
+                      backing_file, backing_fmt);
+}
+
+static int bdrv_qed_is_allocated(BlockDriverState *bs, int64_t sector_num,
+                                  int nb_sectors, int *pnum)
+{
+    return -ENOTSUP;
+}
+
+static int bdrv_qed_make_empty(BlockDriverState *bs)
+{
+    return -ENOTSUP;
+}
+
+static BlockDriverAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
+                                            int64_t sector_num,
+                                            QEMUIOVector *qiov, int nb_sectors,
+                                            BlockDriverCompletionFunc *cb,
+                                            void *opaque)
+{
+    return NULL;
+}
+
+static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
+                                             int64_t sector_num,
+                                             QEMUIOVector *qiov, int nb_sectors,
+                                             BlockDriverCompletionFunc *cb,
+                                             void *opaque)
+{
+    return NULL;
+}
+
+static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
+                                            BlockDriverCompletionFunc *cb,
+                                            void *opaque)
+{
+    return bdrv_aio_flush(bs->file, cb, opaque);
+}
+
+static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset)
+{
+    return -ENOTSUP;
+}
+
+static int64_t bdrv_qed_getlength(BlockDriverState *bs)
+{
+    BDRVQEDState *s = bs->opaque;
+    return s->header.image_size;
+}
+
+static int bdrv_qed_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
+{
+    BDRVQEDState *s = bs->opaque;
+
+    memset(bdi, 0, sizeof(*bdi));
+    bdi->cluster_size = s->header.cluster_size;
+    return 0;
+}
+
+static int bdrv_qed_change_backing_file(BlockDriverState *bs,
+                                        const char *backing_file,
+                                        const char *backing_fmt)
+{
+    BDRVQEDState *s = bs->opaque;
+    QEDHeader new_header, le_header;
+    void *buffer;
+    size_t buffer_len, backing_file_len, backing_fmt_len;
+    int ret;
+
+    /* Refuse to set backing filename if unknown compat feature bits are
+     * active.  If the image uses an unknown compat feature then we may not
+     * know the layout of data following the header structure and cannot safely
+     * add a new string.
+     */
+    if (backing_file && (s->header.compat_features &
+                         ~QED_COMPAT_FEATURE_MASK)) {
+        return -ENOTSUP;
+    }
+
+    memcpy(&new_header, &s->header, sizeof(new_header));
+
+    new_header.features &= ~QED_F_BACKING_FILE;
+    new_header.compat_features &= ~QED_CF_BACKING_FORMAT;
+
+    /* Adjust feature flags */
+    if (backing_file) {
+        new_header.features |= QED_F_BACKING_FILE;
+        if (backing_fmt) {
+            new_header.compat_features |= QED_CF_BACKING_FORMAT;
+        }
+    }
+
+    /* Calculate new header size */
+    backing_file_len = backing_fmt_len = 0;
+
+    if (backing_file) {
+        backing_file_len = strlen(backing_file);
+        if (backing_fmt) {
+            backing_fmt_len = strlen(backing_fmt);
+        }
+    }
+
+    buffer_len = sizeof(new_header);
+    new_header.backing_filename_offset = buffer_len;
+    new_header.backing_filename_size = backing_file_len;
+    buffer_len += backing_file_len;
+    new_header.backing_fmt_offset = buffer_len;
+    new_header.backing_fmt_size = backing_fmt_len;
+    buffer_len += backing_fmt_len;
+
+    /* Make sure we can rewrite header without failing */
+    if (buffer_len > new_header.header_size * new_header.cluster_size) {
+        return -ENOSPC;
+    }
+
+    /* Prepare new header */
+    buffer = qemu_malloc(buffer_len);
+
+    qed_header_cpu_to_le(&new_header, &le_header);
+    memcpy(buffer, &le_header, sizeof(le_header));
+    buffer_len = sizeof(le_header);
+
+    memcpy(buffer + buffer_len, backing_file, backing_file_len);
+    buffer_len += backing_file_len;
+
+    memcpy(buffer + buffer_len, backing_fmt, backing_fmt_len);
+    buffer_len += backing_fmt_len;
+
+    /* Write new header */
+    ret = bdrv_pwrite_sync(bs->file, 0, buffer, buffer_len);
+    qemu_free(buffer);
+    if (ret == 0) {
+        memcpy(&s->header, &new_header, sizeof(new_header));
+    }
+    return ret;
+}
+
+static int bdrv_qed_check(BlockDriverState *bs, BdrvCheckResult *result)
+{
+    return -ENOTSUP;
+}
+
+static QEMUOptionParameter qed_create_options[] = {
+    {
+        .name = BLOCK_OPT_SIZE,
+        .type = OPT_SIZE,
+        .help = "Virtual disk size (in bytes)"
+    }, {
+        .name = BLOCK_OPT_BACKING_FILE,
+        .type = OPT_STRING,
+        .help = "File name of a base image"
+    }, {
+        .name = BLOCK_OPT_BACKING_FMT,
+        .type = OPT_STRING,
+        .help = "Image format of the base image"
+    }, {
+        .name = BLOCK_OPT_CLUSTER_SIZE,
+        .type = OPT_SIZE,
+        .help = "Cluster size (in bytes)"
+    }, {
+        .name = "table_size",
+        .type = OPT_SIZE,
+        .help = "L1/L2 table size (in clusters)"
+    },
+    { /* end of list */ }
+};
+
+static BlockDriver bdrv_qed = {
+    .format_name = "qed",
+    .instance_size = sizeof(BDRVQEDState),
+    .create_options = qed_create_options,
+
+    .bdrv_probe = bdrv_qed_probe,
+    .bdrv_open = bdrv_qed_open,
+    .bdrv_close = bdrv_qed_close,
+    .bdrv_create = bdrv_qed_create,
+    .bdrv_flush = bdrv_qed_flush,
+    .bdrv_is_allocated = bdrv_qed_is_allocated,
+    .bdrv_make_empty = bdrv_qed_make_empty,
+    .bdrv_aio_readv = bdrv_qed_aio_readv,
+    .bdrv_aio_writev = bdrv_qed_aio_writev,
+    .bdrv_aio_flush = bdrv_qed_aio_flush,
+    .bdrv_truncate = bdrv_qed_truncate,
+    .bdrv_getlength = bdrv_qed_getlength,
+    .bdrv_get_info = bdrv_qed_get_info,
+    .bdrv_change_backing_file = bdrv_qed_change_backing_file,
+    .bdrv_check = bdrv_qed_check,
+};
+
+static void bdrv_qed_init(void)
+{
+    bdrv_register(&bdrv_qed);
+}
+
+block_init(bdrv_qed_init);
diff --git a/block/qed.h b/block/qed.h
new file mode 100644
index 0000000..7ce95a7
--- /dev/null
+++ b/block/qed.h
@@ -0,0 +1,148 @@
+/*
+ * QEMU Enhanced Disk Format
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#ifndef BLOCK_QED_H
+#define BLOCK_QED_H
+
+#include "block_int.h"
+
+/* The layout of a QED file is as follows:
+ *
+ * +--------+----------+----------+----------+-----+
+ * | header | L1 table | cluster0 | cluster1 | ... |
+ * +--------+----------+----------+----------+-----+
+ *
+ * There is a 2-level pagetable for cluster allocation:
+ *
+ *                     +----------+
+ *                     | L1 table |
+ *                     +----------+
+ *                ,------'  |  '------.
+ *           +----------+   |    +----------+
+ *           | L2 table |  ...   | L2 table |
+ *           +----------+        +----------+
+ *       ,------'  |  '------.
+ *  +----------+   |    +----------+
+ *  |   Data   |  ...   |   Data   |
+ *  +----------+        +----------+
+ *
+ * The L1 table is fixed size and always present.  L2 tables are allocated on
+ * demand.  The L1 table size determines the maximum possible image size; it
+ * can be influenced using the cluster_size and table_size values.
+ *
+ * All fields are little-endian on disk.
+ */
+
+enum {
+    QED_MAGIC = 'Q' | 'E' << 8 | 'D' << 16 | '\0' << 24,
+
+    /* The image supports a backing file */
+    QED_F_BACKING_FILE = 0x01,
+
+    /* The image has the backing file format */
+    QED_CF_BACKING_FORMAT = 0x01,
+
+    /* Feature bits must be used when the on-disk format changes */
+    QED_FEATURE_MASK = QED_F_BACKING_FILE,            /* supported feature bits */
+    QED_COMPAT_FEATURE_MASK = QED_CF_BACKING_FORMAT,  /* supported compat feature bits */
+
+    /* Data is stored in groups of sectors called clusters.  Cluster size must
+     * be large to avoid keeping too much metadata.  I/O requests that have
+     * sub-cluster size will require read-modify-write.
+     */
+    QED_MIN_CLUSTER_SIZE = 4 * 1024, /* in bytes */
+    QED_MAX_CLUSTER_SIZE = 64 * 1024 * 1024,
+    QED_DEFAULT_CLUSTER_SIZE = 64 * 1024,
+
+    /* Allocated clusters are tracked using a 2-level pagetable.  Table size is
+     * a multiple of clusters so large maximum image sizes can be supported
+     * without jacking up the cluster size too much.
+     */
+    QED_MIN_TABLE_SIZE = 1,        /* in clusters */
+    QED_MAX_TABLE_SIZE = 16,
+    QED_DEFAULT_TABLE_SIZE = 4,
+};
+
+typedef struct {
+    uint32_t magic;                 /* QED\0 */
+
+    uint32_t cluster_size;          /* in bytes */
+    uint32_t table_size;            /* for L1 and L2 tables, in clusters */
+    uint32_t header_size;           /* in clusters */
+
+    uint64_t features;              /* format feature bits */
+    uint64_t compat_features;       /* compatible feature bits */
+    uint64_t l1_table_offset;       /* in bytes */
+    uint64_t image_size;            /* total logical image size, in bytes */
+
+    /* if (features & QED_F_BACKING_FILE) */
+    uint32_t backing_filename_offset; /* in bytes from start of header */
+    uint32_t backing_filename_size;   /* in bytes */
+
+    /* if (compat_features & QED_CF_BACKING_FORMAT) */
+    uint32_t backing_fmt_offset;    /* in bytes from start of header */
+    uint32_t backing_fmt_size;      /* in bytes */
+} QEDHeader;
+
+typedef struct {
+    BlockDriverState *bs;           /* device */
+    uint64_t file_size;             /* length of image file, in bytes */
+
+    QEDHeader header;               /* always cpu-endian */
+    uint32_t table_nelems;
+    uint32_t l1_shift;
+    uint32_t l2_shift;
+    uint32_t l2_mask;
+} BDRVQEDState;
+
+/**
+ * Utility functions
+ */
+static inline uint64_t qed_start_of_cluster(BDRVQEDState *s, uint64_t offset)
+{
+    return offset & ~(uint64_t)(s->header.cluster_size - 1);
+}
+
+/**
+ * Test if a cluster offset is valid
+ */
+static inline bool qed_check_cluster_offset(BDRVQEDState *s, uint64_t offset)
+{
+    uint64_t header_size = (uint64_t)s->header.header_size *
+                           s->header.cluster_size;
+
+    if (offset & (s->header.cluster_size - 1)) {
+        return false;
+    }
+    return offset >= header_size && offset < s->file_size;
+}
+
+/**
+ * Test if a table offset is valid
+ */
+static inline bool qed_check_table_offset(BDRVQEDState *s, uint64_t offset)
+{
+    uint64_t end_offset = offset + (s->header.table_size - 1) *
+                          s->header.cluster_size;
+
+    /* Overflow check */
+    if (end_offset <= offset) {
+        return false;
+    }
+
+    return qed_check_cluster_offset(s, offset) &&
+           qed_check_cluster_offset(s, end_offset);
+}
+
+#endif /* BLOCK_QED_H */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH v2 5/7] qed: Table, L2 cache, and cluster functions
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 4/7] qed: Add QEMU Enhanced Disk image format Stefan Hajnoczi
@ 2010-10-08 15:48 ` Stefan Hajnoczi
  2010-10-12 14:44   ` [Qemu-devel] " Kevin Wolf
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 6/7] qed: Read/write support Stefan Hajnoczi
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

This patch adds code to look up data cluster offsets in the image via
the L1/L2 tables.  The L2 tables are writethrough cached in memory for
performance (each read/write requires a lookup so it is essential to
cache the tables).

With cluster lookup code in place it is possible to implement
bdrv_is_allocated() to query the number of contiguous
allocated/unallocated clusters.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 Makefile.objs        |    2 +-
 block/qed-cluster.c  |  145 +++++++++++++++++++++++
 block/qed-gencb.c    |   32 +++++
 block/qed-l2-cache.c |  132 +++++++++++++++++++++
 block/qed-table.c    |  316 ++++++++++++++++++++++++++++++++++++++++++++++++++
 block/qed.c          |   57 +++++++++-
 block/qed.h          |  108 +++++++++++++++++
 trace-events         |    6 +
 8 files changed, 796 insertions(+), 2 deletions(-)
 create mode 100644 block/qed-cluster.c
 create mode 100644 block/qed-gencb.c
 create mode 100644 block/qed-l2-cache.c
 create mode 100644 block/qed-table.c

diff --git a/Makefile.objs b/Makefile.objs
index ff15795..7b3b19c 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
-block-nested-y += qed.o
+block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
diff --git a/block/qed-cluster.c b/block/qed-cluster.c
new file mode 100644
index 0000000..af65e5a
--- /dev/null
+++ b/block/qed-cluster.c
@@ -0,0 +1,145 @@
+/*
+ * QEMU Enhanced Disk Format Cluster functions
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+/**
+ * Count the number of contiguous data clusters
+ *
+ * @s:              QED state
+ * @table:          L2 table
+ * @index:          First cluster index
+ * @n:              Maximum number of clusters
+ * @offset:         Set to first cluster offset
+ *
+ * This function scans tables for contiguous allocated or free clusters.
+ */
+static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
+                                                  QEDTable *table,
+                                                  unsigned int index,
+                                                  unsigned int n,
+                                                  uint64_t *offset)
+{
+    unsigned int end = MIN(index + n, s->table_nelems);
+    uint64_t last = table->offsets[index];
+    unsigned int i;
+
+    *offset = last;
+
+    for (i = index + 1; i < end; i++) {
+        if (last == 0) {
+            /* Counting free clusters */
+            if (table->offsets[i] != 0) {
+                break;
+            }
+        } else {
+            /* Counting allocated clusters */
+            if (table->offsets[i] != last + s->header.cluster_size) {
+                break;
+            }
+            last = table->offsets[i];
+        }
+    }
+    return i - index;
+}
+
+typedef struct {
+    BDRVQEDState *s;
+    uint64_t pos;
+    size_t len;
+
+    QEDRequest *request;
+
+    /* User callback */
+    QEDFindClusterFunc *cb;
+    void *opaque;
+} QEDFindClusterCB;
+
+static void qed_find_cluster_cb(void *opaque, int ret)
+{
+    QEDFindClusterCB *find_cluster_cb = opaque;
+    BDRVQEDState *s = find_cluster_cb->s;
+    QEDRequest *request = find_cluster_cb->request;
+    uint64_t offset = 0;
+    size_t len = 0;
+    unsigned int index;
+    unsigned int n;
+
+    if (ret) {
+        ret = QED_CLUSTER_ERROR;
+        goto out;
+    }
+
+    index = qed_l2_index(s, find_cluster_cb->pos);
+    n = qed_bytes_to_clusters(s,
+                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
+                              find_cluster_cb->len);
+    n = qed_count_contiguous_clusters(s, request->l2_table->table,
+                                      index, n, &offset);
+
+    ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
+    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
+              qed_offset_into_cluster(s, find_cluster_cb->pos));
+
+    if (offset && !qed_check_cluster_offset(s, offset)) {
+        ret = QED_CLUSTER_ERROR;
+        goto out;
+    }
+
+out:
+    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
+    qemu_free(find_cluster_cb);
+}
+
+/**
+ * Find the offset of a data cluster
+ *
+ * @s:          QED state
+ * @pos:        Byte position in device
+ * @len:        Number of bytes
+ * @cb:         Completion function
+ * @opaque:     User data for completion function
+ */
+void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                      size_t len, QEDFindClusterFunc *cb, void *opaque)
+{
+    QEDFindClusterCB *find_cluster_cb;
+    uint64_t l2_offset;
+
+    /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
+     * so that a request acts on one L2 table at a time.
+     */
+    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
+
+    l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
+    if (!l2_offset) {
+        cb(opaque, QED_CLUSTER_L1, 0, len);
+        return;
+    }
+    if (!qed_check_table_offset(s, l2_offset)) {
+        cb(opaque, QED_CLUSTER_ERROR, 0, 0);
+        return;
+    }
+
+    find_cluster_cb = qemu_malloc(sizeof(*find_cluster_cb));
+    find_cluster_cb->s = s;
+    find_cluster_cb->pos = pos;
+    find_cluster_cb->len = len;
+    find_cluster_cb->cb = cb;
+    find_cluster_cb->opaque = opaque;
+    find_cluster_cb->request = request;
+
+    qed_read_l2_table(s, request, l2_offset,
+                      qed_find_cluster_cb, find_cluster_cb);
+}
diff --git a/block/qed-gencb.c b/block/qed-gencb.c
new file mode 100644
index 0000000..d389e12
--- /dev/null
+++ b/block/qed-gencb.c
@@ -0,0 +1,32 @@
+/*
+ * QEMU Enhanced Disk Format
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque)
+{
+    GenericCB *gencb = qemu_malloc(len);
+    gencb->cb = cb;
+    gencb->opaque = opaque;
+    return gencb;
+}
+
+void gencb_complete(void *opaque, int ret)
+{
+    GenericCB *gencb = opaque;
+    BlockDriverCompletionFunc *cb = gencb->cb;
+    void *user_opaque = gencb->opaque;
+
+    qemu_free(gencb);
+    cb(user_opaque, ret);
+}
diff --git a/block/qed-l2-cache.c b/block/qed-l2-cache.c
new file mode 100644
index 0000000..3b2bf6e
--- /dev/null
+++ b/block/qed-l2-cache.c
@@ -0,0 +1,132 @@
+/*
+ * QEMU Enhanced Disk Format L2 Cache
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qed.h"
+
+/* Each L2 holds 2GB so this let's us fully cache a 100GB disk */
+#define MAX_L2_CACHE_SIZE 50
+
+/**
+ * Initialize the L2 cache
+ */
+void qed_init_l2_cache(L2TableCache *l2_cache,
+                       L2TableAllocFunc *alloc_l2_table,
+                       void *alloc_l2_table_opaque)
+{
+    QTAILQ_INIT(&l2_cache->entries);
+    l2_cache->n_entries = 0;
+    l2_cache->alloc_l2_table = alloc_l2_table;
+    l2_cache->alloc_l2_table_opaque = alloc_l2_table_opaque;
+}
+
+/**
+ * Free the L2 cache
+ */
+void qed_free_l2_cache(L2TableCache *l2_cache)
+{
+    CachedL2Table *entry, *next_entry;
+
+    QTAILQ_FOREACH_SAFE(entry, &l2_cache->entries, node, next_entry) {
+        qemu_vfree(entry->table);
+        qemu_free(entry);
+    }
+}
+
+/**
+ * Allocate an uninitialized entry from the cache
+ *
+ * The returned entry has a reference count of 1 and is owned by the caller.
+ */
+CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache)
+{
+    CachedL2Table *entry;
+
+    entry = qemu_mallocz(sizeof(*entry));
+    entry->table = l2_cache->alloc_l2_table(l2_cache->alloc_l2_table_opaque);
+    entry->ref++;
+
+    return entry;
+}
+
+/**
+ * Decrease an entry's reference count and free if necessary when the reference
+ * count drops to zero.
+ */
+void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry)
+{
+    if (!entry) {
+        return;
+    }
+
+    entry->ref--;
+    if (entry->ref == 0) {
+        qemu_vfree(entry->table);
+        qemu_free(entry);
+    }
+}
+
+/**
+ * Find an entry in the L2 cache.  This may return NULL and it's up to the
+ * caller to satisfy the cache miss.
+ *
+ * For a cached entry, this function increases the reference count and returns
+ * the entry.
+ */
+CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset)
+{
+    CachedL2Table *entry;
+
+    QTAILQ_FOREACH(entry, &l2_cache->entries, node) {
+        if (entry->offset == offset) {
+            entry->ref++;
+            return entry;
+        }
+    }
+    return NULL;
+}
+
+/**
+ * Commit an L2 cache entry into the cache.  This is meant to be used as part of
+ * the process to satisfy a cache miss.  A caller would allocate an entry which
+ * is not actually in the L2 cache and then once the entry was valid and
+ * present on disk, the entry can be committed into the cache.
+ *
+ * Since the cache is write-through, it's important that this function is not
+ * called until the entry is present on disk and the L1 has been updated to
+ * point to the entry.
+ *
+ * N.B. This function steals a reference to the l2_table from the caller so the
+ * caller must obtain a new reference by issuing a call to
+ * qed_find_l2_cache_entry().
+ */
+void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table)
+{
+    CachedL2Table *entry;
+
+    entry = qed_find_l2_cache_entry(l2_cache, l2_table->offset);
+    if (entry) {
+        qed_unref_l2_cache_entry(l2_cache, entry);
+        qed_unref_l2_cache_entry(l2_cache, l2_table);
+        return;
+    }
+
+    if (l2_cache->n_entries >= MAX_L2_CACHE_SIZE) {
+        entry = QTAILQ_FIRST(&l2_cache->entries);
+        QTAILQ_REMOVE(&l2_cache->entries, entry, node);
+        l2_cache->n_entries--;
+        qed_unref_l2_cache_entry(l2_cache, entry);
+    }
+
+    l2_cache->n_entries++;
+    QTAILQ_INSERT_TAIL(&l2_cache->entries, l2_table, node);
+}
diff --git a/block/qed-table.c b/block/qed-table.c
new file mode 100644
index 0000000..ba6faf0
--- /dev/null
+++ b/block/qed-table.c
@@ -0,0 +1,316 @@
+/*
+ * QEMU Enhanced Disk Format Table I/O
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
+ *  Anthony Liguori   <aliguori@us.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "trace.h"
+#include "qemu_socket.h" /* for EINPROGRESS on Windows */
+#include "qed.h"
+
+typedef struct {
+    GenericCB gencb;
+    BDRVQEDState *s;
+    QEDTable *table;
+
+    struct iovec iov;
+    QEMUIOVector qiov;
+} QEDReadTableCB;
+
+static void qed_read_table_cb(void *opaque, int ret)
+{
+    QEDReadTableCB *read_table_cb = opaque;
+    QEDTable *table = read_table_cb->table;
+    int noffsets = read_table_cb->iov.iov_len / sizeof(uint64_t);
+    int i;
+
+    /* Handle I/O error */
+    if (ret) {
+        goto out;
+    }
+
+    /* Byteswap offsets */
+    for (i = 0; i < noffsets; i++) {
+        table->offsets[i] = le64_to_cpu(table->offsets[i]);
+    }
+
+out:
+    /* Completion */
+    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
+    gencb_complete(&read_table_cb->gencb, ret);
+}
+
+static void qed_read_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+                           BlockDriverCompletionFunc *cb, void *opaque)
+{
+    QEDReadTableCB *read_table_cb = gencb_alloc(sizeof(*read_table_cb),
+                                                cb, opaque);
+    QEMUIOVector *qiov = &read_table_cb->qiov;
+    BlockDriverAIOCB *aiocb;
+
+    trace_qed_read_table(s, offset, table);
+
+    read_table_cb->s = s;
+    read_table_cb->table = table;
+    read_table_cb->iov.iov_base = table->offsets,
+    read_table_cb->iov.iov_len = s->header.cluster_size * s->header.table_size,
+
+    qemu_iovec_init_external(qiov, &read_table_cb->iov, 1);
+    aiocb = bdrv_aio_readv(s->bs->file, offset / BDRV_SECTOR_SIZE, qiov,
+                           read_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
+                           qed_read_table_cb, read_table_cb);
+    if (!aiocb) {
+        qed_read_table_cb(read_table_cb, -EIO);
+    }
+}
+
+typedef struct {
+    GenericCB gencb;
+    BDRVQEDState *s;
+    QEDTable *orig_table;
+    QEDTable *table;
+    bool flush;             /* flush after write? */
+
+    struct iovec iov;
+    QEMUIOVector qiov;
+} QEDWriteTableCB;
+
+static void qed_write_table_cb(void *opaque, int ret)
+{
+    QEDWriteTableCB *write_table_cb = opaque;
+
+    trace_qed_write_table_cb(write_table_cb->s,
+                              write_table_cb->orig_table, ret);
+
+    if (ret) {
+        goto out;
+    }
+
+    if (write_table_cb->flush) {
+        /* We still need to flush first */
+        write_table_cb->flush = false;
+        bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
+                       write_table_cb);
+        return;
+    }
+
+out:
+    qemu_vfree(write_table_cb->table);
+    gencb_complete(&write_table_cb->gencb, ret);
+    return;
+}
+
+/**
+ * Write out an updated part or all of a table
+ *
+ * @s:          QED state
+ * @offset:     Offset of table in image file, in bytes
+ * @table:      Table
+ * @index:      Index of first element
+ * @n:          Number of elements
+ * @flush:      Whether or not to sync to disk
+ * @cb:         Completion function
+ * @opaque:     Argument for completion function
+ */
+static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
+                            unsigned int index, unsigned int n, bool flush,
+                            BlockDriverCompletionFunc *cb, void *opaque)
+{
+    QEDWriteTableCB *write_table_cb;
+    BlockDriverAIOCB *aiocb;
+    unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
+    unsigned int start, end, i;
+    size_t len_bytes;
+
+    trace_qed_write_table(s, offset, table, index, n);
+
+    /* Calculate indices of the first and one after last elements */
+    start = index & ~sector_mask;
+    end = (index + n + sector_mask) & ~sector_mask;
+
+    len_bytes = (end - start) * sizeof(uint64_t);
+
+    write_table_cb = gencb_alloc(sizeof(*write_table_cb), cb, opaque);
+    write_table_cb->s = s;
+    write_table_cb->orig_table = table;
+    write_table_cb->flush = flush;
+    write_table_cb->table = qemu_blockalign(s->bs, len_bytes);
+    write_table_cb->iov.iov_base = write_table_cb->table->offsets;
+    write_table_cb->iov.iov_len = len_bytes;
+    qemu_iovec_init_external(&write_table_cb->qiov, &write_table_cb->iov, 1);
+
+    /* Byteswap table */
+    for (i = start; i < end; i++) {
+        uint64_t le_offset = cpu_to_le64(table->offsets[i]);
+        write_table_cb->table->offsets[i - start] = le_offset;
+    }
+
+    /* Adjust for offset into table */
+    offset += start * sizeof(uint64_t);
+
+    aiocb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
+                            &write_table_cb->qiov,
+                            write_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
+                            qed_write_table_cb, write_table_cb);
+    if (!aiocb) {
+        qed_write_table_cb(write_table_cb, -EIO);
+    }
+}
+
+/**
+ * Propagate return value from async callback
+ */
+static void qed_sync_cb(void *opaque, int ret)
+{
+    *(int *)opaque = ret;
+}
+
+int qed_read_l1_table_sync(BDRVQEDState *s)
+{
+    int ret = -EINPROGRESS;
+
+    async_context_push();
+
+    qed_read_table(s, s->header.l1_table_offset,
+                   s->l1_table, qed_sync_cb, &ret);
+    while (ret == -EINPROGRESS) {
+        qemu_aio_wait();
+    }
+
+    async_context_pop();
+
+    return ret;
+}
+
+void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
+                        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
+    qed_write_table(s, s->header.l1_table_offset,
+                    s->l1_table, index, n, false, cb, opaque);
+}
+
+int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
+                            unsigned int n)
+{
+    int ret = -EINPROGRESS;
+
+    async_context_push();
+
+    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
+    while (ret == -EINPROGRESS) {
+        qemu_aio_wait();
+    }
+
+    async_context_pop();
+
+    return ret;
+}
+
+typedef struct {
+    GenericCB gencb;
+    BDRVQEDState *s;
+    uint64_t l2_offset;
+    QEDRequest *request;
+} QEDReadL2TableCB;
+
+static void qed_read_l2_table_cb(void *opaque, int ret)
+{
+    QEDReadL2TableCB *read_l2_table_cb = opaque;
+    QEDRequest *request = read_l2_table_cb->request;
+    BDRVQEDState *s = read_l2_table_cb->s;
+    CachedL2Table *l2_table = request->l2_table;
+
+    if (ret) {
+        /* can't trust loaded L2 table anymore */
+        qed_unref_l2_cache_entry(&s->l2_cache, l2_table);
+        request->l2_table = NULL;
+    } else {
+        l2_table->offset = read_l2_table_cb->l2_offset;
+
+        qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
+
+        /* This is guaranteed to succeed because we just committed the entry
+         * to the cache.
+         */
+        request->l2_table = qed_find_l2_cache_entry(&s->l2_cache,
+                                                    l2_table->offset);
+        assert(request->l2_table != NULL);
+    }
+
+    gencb_complete(&read_l2_table_cb->gencb, ret);
+}
+
+void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+                       BlockDriverCompletionFunc *cb, void *opaque)
+{
+    QEDReadL2TableCB *read_l2_table_cb;
+
+    qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table);
+
+    /* Check for cached L2 entry */
+    request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
+    if (request->l2_table) {
+        cb(opaque, 0);
+        return;
+    }
+
+    request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
+
+    read_l2_table_cb = gencb_alloc(sizeof(*read_l2_table_cb), cb, opaque);
+    read_l2_table_cb->s = s;
+    read_l2_table_cb->l2_offset = offset;
+    read_l2_table_cb->request = request;
+
+    BLKDBG_EVENT(s->bs->file, BLKDBG_L2_LOAD);
+    qed_read_table(s, offset, request->l2_table->table,
+                   qed_read_l2_table_cb, read_l2_table_cb);
+}
+
+int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
+{
+    int ret = -EINPROGRESS;
+
+    async_context_push();
+
+    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
+    while (ret == -EINPROGRESS) {
+        qemu_aio_wait();
+    }
+
+    async_context_pop();
+    return ret;
+}
+
+void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                        unsigned int index, unsigned int n, bool flush,
+                        BlockDriverCompletionFunc *cb, void *opaque)
+{
+    BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
+    qed_write_table(s, request->l2_table->offset,
+                    request->l2_table->table, index, n, flush, cb, opaque);
+}
+
+int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
+                            unsigned int index, unsigned int n, bool flush)
+{
+    int ret = -EINPROGRESS;
+
+    async_context_push();
+
+    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
+    while (ret == -EINPROGRESS) {
+        qemu_aio_wait();
+    }
+
+    async_context_pop();
+    return ret;
+}
diff --git a/block/qed.c b/block/qed.c
index ea03798..6d7f4d7 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -139,6 +139,15 @@ static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
     return 0;
 }
 
+static QEDTable *qed_alloc_table(void *opaque)
+{
+    BDRVQEDState *s = opaque;
+
+    /* Honor O_DIRECT memory alignment requirements */
+    return qemu_blockalign(s->bs,
+                           s->header.cluster_size * s->header.table_size);
+}
+
 static int bdrv_qed_open(BlockDriverState *bs, int flags)
 {
     BDRVQEDState *s = bs->opaque;
@@ -207,11 +216,24 @@ static int bdrv_qed_open(BlockDriverState *bs, int flags)
             }
         }
     }
+
+    s->l1_table = qed_alloc_table(s);
+    qed_init_l2_cache(&s->l2_cache, qed_alloc_table, s);
+
+    ret = qed_read_l1_table_sync(s);
+    if (ret) {
+        qed_free_l2_cache(&s->l2_cache);
+        qemu_vfree(s->l1_table);
+    }
     return ret;
 }
 
 static void bdrv_qed_close(BlockDriverState *bs)
 {
+    BDRVQEDState *s = bs->opaque;
+
+    qed_free_l2_cache(&s->l2_cache);
+    qemu_vfree(s->l1_table);
 }
 
 static void bdrv_qed_flush(BlockDriverState *bs)
@@ -336,10 +358,43 @@ static int bdrv_qed_create(const char *filename, QEMUOptionParameter *options)
                       backing_file, backing_fmt);
 }
 
+typedef struct {
+    int is_allocated;
+    int *pnum;
+} QEDIsAllocatedCB;
+
+static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t len)
+{
+    QEDIsAllocatedCB *cb = opaque;
+    *cb->pnum = len / BDRV_SECTOR_SIZE;
+    cb->is_allocated = ret == QED_CLUSTER_FOUND;
+}
+
 static int bdrv_qed_is_allocated(BlockDriverState *bs, int64_t sector_num,
                                   int nb_sectors, int *pnum)
 {
-    return -ENOTSUP;
+    BDRVQEDState *s = bs->opaque;
+    uint64_t pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
+    size_t len = (size_t)nb_sectors * BDRV_SECTOR_SIZE;
+    QEDIsAllocatedCB cb = {
+        .is_allocated = -1,
+        .pnum = pnum,
+    };
+    QEDRequest request = { .l2_table = NULL };
+
+    async_context_push();
+
+    qed_find_cluster(s, &request, pos, len, qed_is_allocated_cb, &cb);
+
+    while (cb.is_allocated == -1) {
+        qemu_aio_wait();
+    }
+
+    async_context_pop();
+
+    qed_unref_l2_cache_entry(&s->l2_cache, request.l2_table);
+
+    return cb.is_allocated;
 }
 
 static int bdrv_qed_make_empty(BlockDriverState *bs)
diff --git a/block/qed.h b/block/qed.h
index 7ce95a7..9ea288f 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -96,16 +96,103 @@ typedef struct {
 } QEDHeader;
 
 typedef struct {
+    uint64_t offsets[0];            /* in bytes */
+} QEDTable;
+
+/* The L2 cache is a simple write-through cache for L2 structures */
+typedef struct CachedL2Table {
+    QEDTable *table;
+    uint64_t offset;    /* offset=0 indicates an invalidate entry */
+    QTAILQ_ENTRY(CachedL2Table) node;
+    int ref;
+} CachedL2Table;
+
+/**
+ * Allocate an L2 table
+ *
+ * This callback is used by the L2 cache to allocate tables without knowing
+ * their size or alignment requirements.
+ */
+typedef QEDTable *L2TableAllocFunc(void *opaque);
+
+typedef struct {
+    QTAILQ_HEAD(, CachedL2Table) entries;
+    unsigned int n_entries;
+    L2TableAllocFunc *alloc_l2_table;
+    void *alloc_l2_table_opaque;
+} L2TableCache;
+
+typedef struct QEDRequest {
+    CachedL2Table *l2_table;
+} QEDRequest;
+
+typedef struct {
     BlockDriverState *bs;           /* device */
     uint64_t file_size;             /* length of image file, in bytes */
 
     QEDHeader header;               /* always cpu-endian */
+    QEDTable *l1_table;
+    L2TableCache l2_cache;          /* l2 table cache */
     uint32_t table_nelems;
     uint32_t l1_shift;
     uint32_t l2_shift;
     uint32_t l2_mask;
 } BDRVQEDState;
 
+enum {
+    QED_CLUSTER_FOUND,         /* cluster found */
+    QED_CLUSTER_L2,            /* cluster missing in L2 */
+    QED_CLUSTER_L1,            /* cluster missing in L1 */
+    QED_CLUSTER_ERROR,         /* error looking up cluster */
+};
+
+typedef void QEDFindClusterFunc(void *opaque, int ret, uint64_t offset, size_t len);
+
+/**
+ * Generic callback for chaining async callbacks
+ */
+typedef struct {
+    BlockDriverCompletionFunc *cb;
+    void *opaque;
+} GenericCB;
+
+void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque);
+void gencb_complete(void *opaque, int ret);
+
+/**
+ * L2 cache functions
+ */
+void qed_init_l2_cache(L2TableCache *l2_cache, L2TableAllocFunc *alloc_l2_table, void *alloc_l2_table_opaque);
+void qed_free_l2_cache(L2TableCache *l2_cache);
+CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache);
+void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry);
+CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset);
+void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
+
+/**
+ * Table I/O functions
+ */
+int qed_read_l1_table_sync(BDRVQEDState *s);
+void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
+                        BlockDriverCompletionFunc *cb, void *opaque);
+int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
+                            unsigned int n);
+int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
+                           uint64_t offset);
+void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
+                       BlockDriverCompletionFunc *cb, void *opaque);
+void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
+                        unsigned int index, unsigned int n, bool flush,
+                        BlockDriverCompletionFunc *cb, void *opaque);
+int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
+                            unsigned int index, unsigned int n, bool flush);
+
+/**
+ * Cluster functions
+ */
+void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
+                      size_t len, QEDFindClusterFunc *cb, void *opaque);
+
 /**
  * Utility functions
  */
@@ -114,6 +201,27 @@ static inline uint64_t qed_start_of_cluster(BDRVQEDState *s, uint64_t offset)
     return offset & ~(uint64_t)(s->header.cluster_size - 1);
 }
 
+static inline uint64_t qed_offset_into_cluster(BDRVQEDState *s, uint64_t offset)
+{
+    return offset & (s->header.cluster_size - 1);
+}
+
+static inline unsigned int qed_bytes_to_clusters(BDRVQEDState *s, size_t bytes)
+{
+    return qed_start_of_cluster(s, bytes + (s->header.cluster_size - 1)) /
+           (s->header.cluster_size - 1);
+}
+
+static inline unsigned int qed_l1_index(BDRVQEDState *s, uint64_t pos)
+{
+    return pos >> s->l1_shift;
+}
+
+static inline unsigned int qed_l2_index(BDRVQEDState *s, uint64_t pos)
+{
+    return (pos >> s->l2_shift) & s->l2_mask;
+}
+
 /**
  * Test if a cluster offset is valid
  */
diff --git a/trace-events b/trace-events
index f32c83f..a390196 100644
--- a/trace-events
+++ b/trace-events
@@ -67,3 +67,9 @@ disable cpu_out(unsigned int addr, unsigned int val) "addr %#x value %u"
 # balloon.c
 # Since requests are raised via monitor, not many tracepoints are needed.
 disable balloon_event(void *opaque, unsigned long addr) "opaque %p addr %lu"
+
+# block/qed-table.c
+disable qed_read_table(void *s, uint64_t offset, void *table) "s %p offset %"PRIu64" table %p"
+disable qed_read_table_cb(void *s, void *table, int ret) "s %p table %p ret %d"
+disable qed_write_table(void *s, uint64_t offset, void *table, unsigned int index, unsigned int n) "s %p offset %"PRIu64" table %p index %u n %u"
+disable qed_write_table_cb(void *s, void *table, int ret) "s %p table %p ret %d"
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH v2 6/7] qed: Read/write support
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
                   ` (4 preceding siblings ...)
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 5/7] qed: Table, L2 cache, and cluster functions Stefan Hajnoczi
@ 2010-10-08 15:48 ` Stefan Hajnoczi
  2010-10-10  9:10   ` [Qemu-devel] " Avi Kivity
  2010-10-12 15:08   ` Kevin Wolf
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 7/7] qed: Consistency check support Stefan Hajnoczi
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

This patch implements the read/write state machine.  Operations are
fully asynchronous and multiple operations may be active at any time.

Allocating writes lock tables to ensure metadata updates do not
interfere with each other.  If two allocating writes need to update the
same L2 table they will run sequentially.  If two allocating writes need
to update different L2 tables they will run in parallel.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 Makefile.objs    |    1 +
 block/qed-lock.c |  124 ++++++++++++
 block/qed.c      |  593 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 block/qed.h      |   43 ++++
 trace-events     |   10 +
 5 files changed, 769 insertions(+), 2 deletions(-)
 create mode 100644 block/qed-lock.c

diff --git a/Makefile.objs b/Makefile.objs
index 7b3b19c..24d734f 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,6 +15,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
+block-nested-y += qed-lock.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
diff --git a/block/qed-lock.c b/block/qed-lock.c
new file mode 100644
index 0000000..bd91729
--- /dev/null
+++ b/block/qed-lock.c
@@ -0,0 +1,124 @@
+/*
+ * QEMU Enhanced Disk Format Lock
+ *
+ * Copyright IBM, Corp. 2010
+ *
+ * Authors:
+ *  Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+/*
+ * Table locking works as follows:
+ *
+ * Reads and non-allocating writes do not acquire locks because they do not
+ * modify tables and only see committed L2 cache entries.
+ *
+ * An allocating write request that needs to update an existing L2 table
+ * acquires a lock on the table.  This serializes requests that touch the same
+ * L2 table.
+ *
+ * An allocating write request that needs to create a new L2 table and update
+ * the L1 table acquires a lock on the L1 table.  This serializes requests that
+ * create new L2 tables.
+ *
+ * When a request is unable to acquire a lock, it is put to sleep and must
+ * return.  When the lock it attempted to acquire becomes available, a wakeup
+ * function is invoked to activate it again.
+ *
+ * A request must retry its cluster lookup after waking up because the tables
+ * have changed.  For example, an allocating write may no longer need to
+ * allocate if the previous request already allocated the cluster.
+ */
+
+#include "qed.h"
+
+struct QEDLockEntry {
+    uint64_t key;
+    QSIMPLEQ_HEAD(, QEDAIOCB) reqs;
+    QTAILQ_ENTRY(QEDLockEntry) next;
+};
+
+/**
+ * Initialize a lock
+ *
+ * @lock:           Lock
+ * @wakeup_fn:      Callback to reactivate a sleeping request
+ */
+void qed_lock_init(QEDLock *lock, BlockDriverCompletionFunc *wakeup_fn)
+{
+    QTAILQ_INIT(&lock->entries);
+    lock->wakeup_fn = wakeup_fn;
+}
+
+/**
+ * Acquire a lock on a given key
+ *
+ * @lock:           Lock
+ * @key:            Key to lock on
+ * @acb:            Request
+ * @ret:            true if lock was acquired, false if request needs to sleep
+ *
+ * If the request currently has another lock held, that lock will be released.
+ */
+bool qed_lock(QEDLock *lock, uint64_t key, QEDAIOCB *acb)
+{
+    QEDLockEntry *entry = acb->lock_entry;
+
+    if (entry) {
+        /* Lock already held */
+        if (entry->key == key) {
+            return true;
+        }
+
+        /* Release old lock */
+        qed_unlock(lock, acb);
+    }
+
+    /* Find held lock */
+    QTAILQ_FOREACH(entry, &lock->entries, next) {
+        if (entry->key == key) {
+            QSIMPLEQ_INSERT_TAIL(&entry->reqs, acb, next);
+            acb->lock_entry = entry;
+            return false;
+        }
+    }
+
+    /* Allocate new lock entry */
+    entry = qemu_malloc(sizeof(*entry));
+    entry->key = key;
+    QSIMPLEQ_INIT(&entry->reqs);
+    QSIMPLEQ_INSERT_TAIL(&entry->reqs, acb, next);
+    QTAILQ_INSERT_TAIL(&lock->entries, entry, next);
+    acb->lock_entry = entry;
+    return true;
+}
+
+/**
+ * Release a held lock
+ */
+void qed_unlock(QEDLock *lock, QEDAIOCB *acb)
+{
+    QEDLockEntry *entry = acb->lock_entry;
+
+    if (!entry) {
+        return;
+    }
+
+    acb->lock_entry = NULL;
+    QSIMPLEQ_REMOVE_HEAD(&entry->reqs, next);
+
+    /* Wake up next lock holder */
+    acb = QSIMPLEQ_FIRST(&entry->reqs);
+    if (acb) {
+        lock->wakeup_fn(acb, 0);
+        return;
+    }
+
+    /* Free lock entry */
+    QTAILQ_REMOVE(&lock->entries, entry, next);
+    qemu_free(entry);
+}
diff --git a/block/qed.c b/block/qed.c
index 6d7f4d7..4fded31 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -12,8 +12,26 @@
  *
  */
 
+#include "trace.h"
 #include "qed.h"
 
+static void qed_aio_cancel(BlockDriverAIOCB *blockacb)
+{
+    QEDAIOCB *acb = (QEDAIOCB *)blockacb;
+    bool finished = false;
+
+    /* Wait for the request to finish */
+    acb->finished = &finished;
+    while (!finished) {
+        qemu_aio_wait();
+    }
+}
+
+static AIOPool qed_aio_pool = {
+    .aiocb_size         = sizeof(QEDAIOCB),
+    .cancel             = qed_aio_cancel,
+};
+
 static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
                           const char *filename)
 {
@@ -139,6 +157,20 @@ static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
     return 0;
 }
 
+/**
+ * Allocate new clusters
+ *
+ * @s:          QED state
+ * @n:          Number of contiguous clusters to allocate
+ * @offset:     Offset of first allocated cluster, filled in on success
+ */
+static int qed_alloc_clusters(BDRVQEDState *s, unsigned int n, uint64_t *offset)
+{
+    *offset = s->file_size;
+    s->file_size += n * s->header.cluster_size;
+    return 0;
+}
+
 static QEDTable *qed_alloc_table(void *opaque)
 {
     BDRVQEDState *s = opaque;
@@ -148,6 +180,30 @@ static QEDTable *qed_alloc_table(void *opaque)
                            s->header.cluster_size * s->header.table_size);
 }
 
+/**
+ * Allocate a new zeroed L2 table
+ */
+static CachedL2Table *qed_new_l2_table(BDRVQEDState *s)
+{
+    uint64_t offset;
+    int ret;
+    CachedL2Table *l2_table;
+
+    ret = qed_alloc_clusters(s, s->header.table_size, &offset);
+    if (ret) {
+        return NULL;
+    }
+
+    l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
+    l2_table->offset = offset;
+
+    memset(l2_table->table->offsets, 0,
+           s->header.cluster_size * s->header.table_size);
+    return l2_table;
+}
+
+static void qed_aio_next_io(void *opaque, int ret);
+
 static int bdrv_qed_open(BlockDriverState *bs, int flags)
 {
     BDRVQEDState *s = bs->opaque;
@@ -156,6 +212,7 @@ static int bdrv_qed_open(BlockDriverState *bs, int flags)
     int ret;
 
     s->bs = bs;
+    qed_lock_init(&s->lock, qed_aio_next_io);
 
     ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
     if (ret != sizeof(le_header)) {
@@ -402,13 +459,545 @@ static int bdrv_qed_make_empty(BlockDriverState *bs)
     return -ENOTSUP;
 }
 
+static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
+{
+    return acb->common.bs->opaque;
+}
+
+typedef struct {
+    GenericCB gencb;
+    BDRVQEDState *s;
+    QEMUIOVector qiov;
+    struct iovec iov;
+    uint64_t offset;
+} CopyFromBackingFileCB;
+
+static void qed_copy_from_backing_file_cb(void *opaque, int ret)
+{
+    CopyFromBackingFileCB *copy_cb = opaque;
+    qemu_vfree(copy_cb->iov.iov_base);
+    gencb_complete(&copy_cb->gencb, ret);
+}
+
+static void qed_copy_from_backing_file_write(void *opaque, int ret)
+{
+    CopyFromBackingFileCB *copy_cb = opaque;
+    BDRVQEDState *s = copy_cb->s;
+    BlockDriverAIOCB *aiocb;
+
+    if (ret) {
+        qed_copy_from_backing_file_cb(copy_cb, ret);
+        return;
+    }
+
+    BLKDBG_EVENT(s->bs->file, BLKDBG_COW_WRITE);
+    aiocb = bdrv_aio_writev(s->bs->file, copy_cb->offset / BDRV_SECTOR_SIZE,
+                            &copy_cb->qiov,
+                            copy_cb->qiov.size / BDRV_SECTOR_SIZE,
+                            qed_copy_from_backing_file_cb, copy_cb);
+    if (!aiocb) {
+        qed_copy_from_backing_file_cb(copy_cb, -EIO);
+    }
+}
+
+/**
+ * Copy data from backing file into the image
+ *
+ * @s:          QED state
+ * @pos:        Byte position in device
+ * @len:        Number of bytes
+ * @offset:     Byte offset in image file
+ * @cb:         Completion function
+ * @opaque:     User data for completion function
+ */
+static void qed_copy_from_backing_file(BDRVQEDState *s, uint64_t pos,
+                                       uint64_t len, uint64_t offset,
+                                       BlockDriverCompletionFunc *cb,
+                                       void *opaque)
+{
+    CopyFromBackingFileCB *copy_cb;
+    BlockDriverAIOCB *aiocb;
+
+    /* Skip copy entirely if there is no work to do */
+    if (len == 0) {
+        cb(opaque, 0);
+        return;
+    }
+
+    copy_cb = gencb_alloc(sizeof(*copy_cb), cb, opaque);
+    copy_cb->s = s;
+    copy_cb->offset = offset;
+    copy_cb->iov.iov_base = qemu_blockalign(s->bs, len);
+    copy_cb->iov.iov_len = len;
+    qemu_iovec_init_external(&copy_cb->qiov, &copy_cb->iov, 1);
+
+    /* Zero sectors if there is no backing file */
+    if (!s->bs->backing_hd) {
+        /* Note that it is possible to skip writing zeroes for prefill if the
+         * cluster is not yet allocated and the file guarantees new space is
+         * zeroed.  Don't take this shortcut for now since it also forces us to
+         * handle the special case of rounding file size down on open, which
+         * can be solved by also doing a truncate to free any extra data at the
+         * end of the file.
+         */
+        memset(copy_cb->iov.iov_base, 0, len);
+        qed_copy_from_backing_file_write(copy_cb, 0);
+        return;
+    }
+
+    BLKDBG_EVENT(s->bs->file, BLKDBG_READ_BACKING);
+    aiocb = bdrv_aio_readv(s->bs->backing_hd, pos / BDRV_SECTOR_SIZE,
+                           &copy_cb->qiov, len / BDRV_SECTOR_SIZE,
+                           qed_copy_from_backing_file_write, copy_cb);
+    if (!aiocb) {
+        qed_copy_from_backing_file_cb(copy_cb, -EIO);
+    }
+}
+
+/**
+ * Link one or more contiguous clusters into a table
+ *
+ * @s:              QED state
+ * @table:          L2 table
+ * @index:          First cluster index
+ * @n:              Number of contiguous clusters
+ * @cluster:        First cluster byte offset in image file
+ */
+static void qed_update_l2_table(BDRVQEDState *s, QEDTable *table, int index,
+                                unsigned int n, uint64_t cluster)
+{
+    int i;
+    for (i = index; i < index + n; i++) {
+        table->offsets[i] = cluster;
+        cluster += s->header.cluster_size;
+    }
+}
+
+static void qed_aio_complete_bh(void *opaque)
+{
+    QEDAIOCB *acb = opaque;
+    BlockDriverCompletionFunc *cb = acb->common.cb;
+    void *user_opaque = acb->common.opaque;
+    int ret = acb->bh_ret;
+    bool *finished = acb->finished;
+
+    qemu_bh_delete(acb->bh);
+    qemu_aio_release(acb);
+
+    /* Invoke callback */
+    cb(user_opaque, ret);
+
+    /* Signal cancel completion */
+    if (finished) {
+        *finished = true;
+    }
+}
+
+static void qed_aio_complete(QEDAIOCB *acb, int ret)
+{
+    BDRVQEDState *s = acb_to_s(acb);
+
+    trace_qed_aio_complete(s, acb, ret);
+
+    /* Free resources */
+    qemu_iovec_destroy(&acb->cur_qiov);
+    qed_unref_l2_cache_entry(&s->l2_cache, acb->request.l2_table);
+    qed_unlock(&s->lock, acb);
+
+    /* Arrange for a bh to invoke the completion function */
+    acb->bh_ret = ret;
+    acb->bh = qemu_bh_new(qed_aio_complete_bh, acb);
+    qemu_bh_schedule(acb->bh);
+}
+
+/**
+ * Construct an iovec array for a given length
+ *
+ * @acb:        I/O request
+ * @len:        Maximum number of bytes
+ *
+ * This function can be called several times to build subset iovec arrays of
+ * acb->qiov.  For example:
+ *
+ *   acb->qiov->iov[] = {{0x100000, 1024},
+ *                       {0x200000, 1024}}
+ *
+ *   qed_acb_build_qiov(acb, 512) =>
+ *                      {{0x100000, 512}}
+ *
+ *   qed_acb_build_qiov(acb, 1024) =>
+ *                      {{0x100200, 512},
+ *                       {0x200000, 512}}
+ *
+ *   qed_acb_build_qiov(acb, 512) =>
+ *                      {{0x200200, 512}}
+ */
+static void qed_acb_build_qiov(QEDAIOCB *acb, size_t len)
+{
+    struct iovec *iov_end = &acb->qiov->iov[acb->qiov->niov];
+    size_t iov_offset = acb->cur_iov_offset;
+    struct iovec *iov = acb->cur_iov;
+
+    while (iov != iov_end && len > 0) {
+        size_t nbytes = MIN(iov->iov_len - iov_offset, len);
+
+        qemu_iovec_add(&acb->cur_qiov, iov->iov_base + iov_offset, nbytes);
+        iov_offset += nbytes;
+        len -= nbytes;
+
+        if (iov_offset >= iov->iov_len) {
+            iov_offset = 0;
+            iov++;
+        }
+    }
+
+    /* Stash state for next time */
+    acb->cur_iov = iov;
+    acb->cur_iov_offset = iov_offset;
+}
+
+/**
+ * Commit the current L2 table to the cache
+ */
+static void qed_commit_l2_update(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    CachedL2Table *l2_table = acb->request.l2_table;
+
+    qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
+
+    /* This is guaranteed to succeed because we just committed the entry to the
+     * cache.
+     */
+    acb->request.l2_table = qed_find_l2_cache_entry(&s->l2_cache,
+                                                    l2_table->offset);
+    assert(acb->request.l2_table != NULL);
+
+    qed_aio_next_io(opaque, ret);
+}
+
+/**
+ * Update L1 table with new L2 table offset and write it out
+ */
+static void qed_aio_write_l1_update(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    int index;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    index = qed_l1_index(s, acb->cur_pos);
+    s->l1_table->offsets[index] = acb->request.l2_table->offset;
+
+    qed_write_l1_table(s, index, 1, qed_commit_l2_update, acb);
+}
+
+/**
+ * Update L2 table with new cluster offsets and write them out
+ */
+static void qed_aio_write_l2_update(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    bool need_alloc = acb->find_cluster_ret == QED_CLUSTER_L1;
+    int index;
+
+    if (ret) {
+        goto err;
+    }
+
+    if (need_alloc) {
+        qed_unref_l2_cache_entry(&s->l2_cache, acb->request.l2_table);
+        acb->request.l2_table = qed_new_l2_table(s);
+        if (!acb->request.l2_table) {
+            ret = -EIO;
+            goto err;
+        }
+    }
+
+    index = qed_l2_index(s, acb->cur_pos);
+    qed_update_l2_table(s, acb->request.l2_table->table, index, acb->cur_nclusters,
+                         acb->cur_cluster);
+
+    if (need_alloc) {
+        /* Write out the whole new L2 table */
+        qed_write_l2_table(s, &acb->request, 0, s->table_nelems, true,
+                            qed_aio_write_l1_update, acb);
+    } else {
+        /* Write out only the updated part of the L2 table */
+        qed_write_l2_table(s, &acb->request, index, acb->cur_nclusters, false,
+                            qed_aio_next_io, acb);
+    }
+    return;
+
+err:
+    qed_aio_complete(acb, ret);
+}
+
+/**
+ * Write data to the image file
+ */
+static void qed_aio_write_main(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    bool need_alloc = acb->find_cluster_ret != QED_CLUSTER_FOUND;
+    uint64_t offset = acb->cur_cluster;
+    BlockDriverAIOCB *file_acb;
+
+    trace_qed_aio_write_main(s, acb, ret, offset, acb->cur_qiov.size);
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    offset += qed_offset_into_cluster(s, acb->cur_pos);
+    BLKDBG_EVENT(s->bs->file, BLKDBG_WRITE_AIO);
+    file_acb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
+                               &acb->cur_qiov,
+                               acb->cur_qiov.size / BDRV_SECTOR_SIZE,
+                               need_alloc ? qed_aio_write_l2_update :
+                                            qed_aio_next_io,
+                               acb);
+    if (!file_acb) {
+        qed_aio_complete(acb, -EIO);
+    }
+}
+
+/**
+ * Populate back untouched region of new data cluster
+ */
+static void qed_aio_write_postfill(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    uint64_t start = acb->cur_pos + acb->cur_qiov.size;
+    uint64_t len = qed_start_of_cluster(s, start + s->header.cluster_size - 1) - start;
+    uint64_t offset = acb->cur_cluster + qed_offset_into_cluster(s, acb->cur_pos) + acb->cur_qiov.size;
+
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    trace_qed_aio_write_postfill(s, acb, start, len, offset);
+    qed_copy_from_backing_file(s, start, len, offset,
+                                qed_aio_write_main, acb);
+}
+
+/**
+ * Populate front untouched region of new data cluster
+ */
+static void qed_aio_write_prefill(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    uint64_t start = qed_start_of_cluster(s, acb->cur_pos);
+    uint64_t len = qed_offset_into_cluster(s, acb->cur_pos);
+
+    trace_qed_aio_write_prefill(s, acb, start, len, acb->cur_cluster);
+    qed_copy_from_backing_file(s, start, len, acb->cur_cluster,
+                                qed_aio_write_postfill, acb);
+}
+
+/**
+ * Write data cluster
+ *
+ * @opaque:     Write request
+ * @ret:        QED_CLUSTER_FOUND, QED_CLUSTER_L2, QED_CLUSTER_L1,
+ *              or QED_CLUSTER_ERROR
+ * @offset:     Cluster offset in bytes
+ * @len:        Length in bytes
+ *
+ * Callback from qed_find_cluster().
+ */
+static void qed_aio_write_data(void *opaque, int ret,
+                               uint64_t offset, size_t len)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    bool need_alloc = ret != QED_CLUSTER_FOUND;
+
+    trace_qed_aio_write_data(s, acb, ret, offset, len);
+
+    if (ret == QED_CLUSTER_ERROR) {
+        goto err;
+    }
+
+    if (need_alloc) {
+        /* If a new L2 table is being allocated, lock the L1 table.  Otherwise
+         * just lock the L2 table.
+         */
+        uint64_t lock_key = ret == QED_CLUSTER_L1 ?
+                            s->header.l1_table_offset :
+                            acb->request.l2_table->offset;
+
+        if (!qed_lock(&s->lock, lock_key, acb)) {
+            return; /* sleep until woken up again */
+        }
+    } else {
+        /* If we're still holding a lock, release it */
+        qed_unlock(&s->lock, acb);
+    }
+
+    acb->cur_nclusters = qed_bytes_to_clusters(s,
+                             qed_offset_into_cluster(s, acb->cur_pos) + len);
+
+    if (need_alloc) {
+        if (qed_alloc_clusters(s, acb->cur_nclusters, &offset) != 0) {
+            goto err;
+        }
+    }
+
+    acb->find_cluster_ret = ret;
+    acb->cur_cluster = offset;
+    qed_acb_build_qiov(acb, len);
+
+    /* Write data in place if the cluster already exists */
+    if (!need_alloc) {
+        qed_aio_write_main(acb, 0);
+        return;
+    }
+
+    /* Write new cluster */
+    qed_aio_write_prefill(acb, 0);
+    return;
+
+err:
+    qed_aio_complete(acb, -EIO);
+}
+
+/**
+ * Read data cluster
+ *
+ * @opaque:     Read request
+ * @ret:        QED_CLUSTER_FOUND, QED_CLUSTER_L2, QED_CLUSTER_L1,
+ *              or QED_CLUSTER_ERROR
+ * @offset:     Cluster offset in bytes
+ * @len:        Length in bytes
+ *
+ * Callback from qed_find_cluster().
+ */
+static void qed_aio_read_data(void *opaque, int ret,
+                              uint64_t offset, size_t len)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    BlockDriverState *bs = acb->common.bs;
+    BlockDriverState *file = bs->file;
+    BlockDriverAIOCB *file_acb;
+
+    trace_qed_aio_read_data(s, acb, ret, offset, len);
+
+    if (ret == QED_CLUSTER_ERROR) {
+        goto err;
+    }
+
+    qed_acb_build_qiov(acb, len);
+
+    /* Adjust offset into cluster */
+    offset += qed_offset_into_cluster(s, acb->cur_pos);
+
+    /* Handle backing file and unallocated sparse hole reads */
+    if (ret != QED_CLUSTER_FOUND) {
+        if (!bs->backing_hd) {
+            qemu_iovec_memset(&acb->cur_qiov, 0, acb->cur_qiov.size);
+            qed_aio_next_io(acb, 0);
+            return;
+        }
+
+        /* Pass through read to backing file */
+        offset = acb->cur_pos;
+        file = bs->backing_hd;
+    }
+
+    BLKDBG_EVENT(bs->file, BLKDBG_READ_AIO);
+    file_acb = bdrv_aio_readv(file, offset / BDRV_SECTOR_SIZE,
+                              &acb->cur_qiov,
+                              acb->cur_qiov.size / BDRV_SECTOR_SIZE,
+                              qed_aio_next_io, acb);
+    if (!file_acb) {
+        goto err;
+    }
+    return;
+
+err:
+    qed_aio_complete(acb, -EIO);
+}
+
+/**
+ * Begin next I/O or complete the request
+ */
+static void qed_aio_next_io(void *opaque, int ret)
+{
+    QEDAIOCB *acb = opaque;
+    BDRVQEDState *s = acb_to_s(acb);
+    QEDFindClusterFunc *io_fn =
+        acb->is_write ? qed_aio_write_data : qed_aio_read_data;
+
+    trace_qed_aio_next_io(s, acb, ret, acb->cur_pos + acb->cur_qiov.size);
+
+    /* Handle I/O error */
+    if (ret) {
+        qed_aio_complete(acb, ret);
+        return;
+    }
+
+    acb->cur_pos += acb->cur_qiov.size;
+    qemu_iovec_reset(&acb->cur_qiov);
+
+    /* Complete request */
+    if (acb->cur_pos >= acb->end_pos) {
+        qed_aio_complete(acb, 0);
+        return;
+    }
+
+    /* Find next cluster and start I/O */
+    qed_find_cluster(s, &acb->request,
+                      acb->cur_pos, acb->end_pos - acb->cur_pos,
+                      io_fn, acb);
+}
+
+static BlockDriverAIOCB *qed_aio_setup(BlockDriverState *bs,
+                                       int64_t sector_num,
+                                       QEMUIOVector *qiov, int nb_sectors,
+                                       BlockDriverCompletionFunc *cb,
+                                       void *opaque, bool is_write)
+{
+    QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
+
+    trace_qed_aio_setup(bs->opaque, acb, sector_num, nb_sectors,
+                         opaque, is_write);
+
+    acb->is_write = is_write;
+    acb->finished = NULL;
+    acb->qiov = qiov;
+    acb->cur_iov = acb->qiov->iov;
+    acb->cur_iov_offset = 0;
+    acb->cur_pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE;
+    acb->end_pos = acb->cur_pos + nb_sectors * BDRV_SECTOR_SIZE;
+    acb->request.l2_table = NULL;
+    acb->lock_entry = NULL;
+    qemu_iovec_init(&acb->cur_qiov, qiov->niov);
+
+    /* Start request */
+    qed_aio_next_io(acb, 0);
+    return &acb->common;
+}
+
 static BlockDriverAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
                                             int64_t sector_num,
                                             QEMUIOVector *qiov, int nb_sectors,
                                             BlockDriverCompletionFunc *cb,
                                             void *opaque)
 {
-    return NULL;
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, false);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
@@ -417,7 +1006,7 @@ static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
                                              BlockDriverCompletionFunc *cb,
                                              void *opaque)
 {
-    return NULL;
+    return qed_aio_setup(bs, sector_num, qiov, nb_sectors, cb, opaque, true);
 }
 
 static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
diff --git a/block/qed.h b/block/qed.h
index 9ea288f..91c91f4 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -126,6 +126,41 @@ typedef struct QEDRequest {
     CachedL2Table *l2_table;
 } QEDRequest;
 
+typedef struct QEDLockEntry QEDLockEntry;
+
+typedef struct QEDAIOCB {
+    BlockDriverAIOCB common;
+    QEMUBH *bh;
+    int bh_ret;                     /* final return status for completion bh */
+    QSIMPLEQ_ENTRY(QEDAIOCB) next;  /* next request */
+    QEDLockEntry *lock_entry;       /* held lock */
+    bool is_write;                  /* false - read, true - write */
+    bool *finished;                 /* signal for cancel completion */
+    uint64_t end_pos;               /* request end on block device, in bytes */
+
+    /* User scatter-gather list */
+    QEMUIOVector *qiov;
+    struct iovec *cur_iov;          /* current iovec to process */
+    size_t cur_iov_offset;          /* byte count already processed in iovec */
+
+    /* Current cluster scatter-gather list */
+    QEMUIOVector cur_qiov;
+    uint64_t cur_pos;               /* position on block device, in bytes */
+    uint64_t cur_cluster;           /* cluster offset in image file */
+    unsigned int cur_nclusters;     /* number of clusters being accessed */
+    int find_cluster_ret;           /* used for L1/L2 update */
+
+    QEDRequest request;
+} QEDAIOCB;
+
+/**
+ * Lock used to serialize requests touching the same table
+ */
+typedef struct {
+    QTAILQ_HEAD(, QEDLockEntry) entries;
+    BlockDriverCompletionFunc *wakeup_fn;
+} QEDLock;
+
 typedef struct {
     BlockDriverState *bs;           /* device */
     uint64_t file_size;             /* length of image file, in bytes */
@@ -133,6 +168,7 @@ typedef struct {
     QEDHeader header;               /* always cpu-endian */
     QEDTable *l1_table;
     L2TableCache l2_cache;          /* l2 table cache */
+    QEDLock lock;                   /* table lock */
     uint32_t table_nelems;
     uint32_t l1_shift;
     uint32_t l2_shift;
@@ -170,6 +206,13 @@ CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset);
 void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table);
 
 /**
+ * Lock functions
+ */
+void qed_lock_init(QEDLock *lock, BlockDriverCompletionFunc *wakeup_fn);
+bool qed_lock(QEDLock *lock, uint64_t key, QEDAIOCB *acb);
+void qed_unlock(QEDLock *lock, QEDAIOCB *acb);
+
+/**
  * Table I/O functions
  */
 int qed_read_l1_table_sync(BDRVQEDState *s);
diff --git a/trace-events b/trace-events
index a390196..86a1a75 100644
--- a/trace-events
+++ b/trace-events
@@ -73,3 +73,13 @@ disable qed_read_table(void *s, uint64_t offset, void *table) "s %p offset %"PRI
 disable qed_read_table_cb(void *s, void *table, int ret) "s %p table %p ret %d"
 disable qed_write_table(void *s, uint64_t offset, void *table, unsigned int index, unsigned int n) "s %p offset %"PRIu64" table %p index %u n %u"
 disable qed_write_table_cb(void *s, void *table, int ret) "s %p table %p ret %d"
+
+# block/qed.c
+disable qed_aio_complete(void *s, void *acb, int ret) "s %p acb %p ret %d"
+disable qed_aio_setup(void *s, void *acb, int64_t sector_num, int nb_sectors, void *opaque, int is_write) "s %p acb %p sector_num %"PRId64" nb_sectors %d opaque %p is_write %d"
+disable qed_aio_next_io(void *s, void *acb, int ret, uint64_t cur_pos) "s %p acb %p ret %d cur_pos %"PRIu64""
+disable qed_aio_read_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+disable qed_aio_write_data(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
+disable qed_aio_write_prefill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64""
+disable qed_aio_write_postfill(void *s, void *acb, uint64_t start, size_t len, uint64_t offset) "s %p acb %p start %"PRIu64" len %zu offset %"PRIu64""
+disable qed_aio_write_main(void *s, void *acb, int ret, uint64_t offset, size_t len) "s %p acb %p ret %d offset %"PRIu64" len %zu"
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] [PATCH v2 7/7] qed: Consistency check support
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
                   ` (5 preceding siblings ...)
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 6/7] qed: Read/write support Stefan Hajnoczi
@ 2010-10-08 15:48 ` Stefan Hajnoczi
  2010-10-11 13:21 ` [Qemu-devel] Re: [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Kevin Wolf
  2010-10-16  7:51 ` [Qemu-devel] " Stefan Hajnoczi
  8 siblings, 0 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-08 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

This patch adds support for the qemu-img check command.  It also
introduces a dirty bit in the qed header to mark modified images as
needing a check.  This bit is cleared when the image file is closed
cleanly.

If an image file is opened and it has the dirty bit set, a consistency
check will run and try to fix corrupted table offsets.  These
corruptions may occur if there is power loss while an allocating write
is performed.  Once the image is fixed it opens as normal again.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
 Makefile.objs     |    2 +-
 block/qed-check.c |  197 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/qed.c       |  138 ++++++++++++++++++++++++++++++++++++-
 block/qed.h       |   11 +++-
 4 files changed, 343 insertions(+), 5 deletions(-)
 create mode 100644 block/qed-check.c

diff --git a/Makefile.objs b/Makefile.objs
index 24d734f..bc6fd38 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -15,7 +15,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
-block-nested-y += qed-lock.o
+block-nested-y += qed-lock.o qed-check.o
 block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
diff --git a/block/qed-check.c b/block/qed-check.c
new file mode 100644
index 0000000..1c2ad81
--- /dev/null
+++ b/block/qed-check.c
@@ -0,0 +1,197 @@
+#include "qed.h"
+
+typedef struct {
+    BDRVQEDState *s;
+    BdrvCheckResult *result;
+    bool fix;                           /* whether to fix invalid offsets */
+
+    size_t nclusters;
+    uint32_t *used_clusters;            /* referenced cluster bitmap */
+
+    QEDRequest request;
+} QEDCheck;
+
+static bool qed_test_bit(uint32_t *bitmap, uint64_t n) {
+    return !!(bitmap[n / 32] & (1 << (n % 32)));
+}
+
+static void qed_set_bit(uint32_t *bitmap, uint64_t n) {
+    bitmap[n / 32] |= 1 << (n % 32);
+}
+
+/**
+ * Set bitmap bits for clusters
+ *
+ * @check:          Check structure
+ * @offset:         Starting offset in bytes
+ * @n:              Number of clusters
+ */
+static bool qed_set_used_clusters(QEDCheck *check, uint64_t offset,
+                                  unsigned int n)
+{
+    uint64_t cluster = qed_bytes_to_clusters(check->s, offset);
+    unsigned int corruptions = 0;
+
+    while (n-- != 0) {
+        /* Clusters should only be referenced once */
+        if (qed_test_bit(check->used_clusters, cluster)) {
+            corruptions++;
+        }
+
+        qed_set_bit(check->used_clusters, cluster);
+        cluster++;
+    }
+
+    check->result->corruptions += corruptions;
+    return corruptions == 0;
+}
+
+/**
+ * Check an L2 table
+ *
+ * @ret:            Number of invalid cluster offsets
+ */
+static unsigned int qed_check_l2_table(QEDCheck *check, QEDTable *table)
+{
+    BDRVQEDState *s = check->s;
+    unsigned int i, num_invalid = 0;
+
+    for (i = 0; i < s->table_nelems; i++) {
+        uint64_t offset = table->offsets[i];
+
+        if (!offset) {
+            continue;
+        }
+
+        /* Detect invalid cluster offset */
+        if (!qed_check_cluster_offset(s, offset)) {
+            if (check->fix) {
+                table->offsets[i] = 0;
+            } else {
+                check->result->corruptions++;
+            }
+
+            num_invalid++;
+            continue;
+        }
+
+        qed_set_used_clusters(check, offset, 1);
+    }
+
+    return num_invalid;
+}
+
+/**
+ * Descend tables and check each cluster is referenced once only
+ */
+static int qed_check_l1_table(QEDCheck *check, QEDTable *table)
+{
+    BDRVQEDState *s = check->s;
+    unsigned int i, num_invalid_l1 = 0;
+    int ret, last_error = 0;
+
+    /* Mark L1 table clusters used */
+    qed_set_used_clusters(check, s->header.l1_table_offset,
+                          s->header.table_size);
+
+    for (i = 0; i < s->table_nelems; i++) {
+        unsigned int num_invalid_l2;
+        uint64_t offset = table->offsets[i];
+
+        if (!offset) {
+            continue;
+        }
+
+        /* Detect invalid L2 offset */
+        if (!qed_check_table_offset(s, offset)) {
+            /* Clear invalid offset */
+            if (check->fix) {
+                table->offsets[i] = 0;
+            } else {
+                check->result->corruptions++;
+            }
+
+            num_invalid_l1++;
+            continue;
+        }
+
+        if (!qed_set_used_clusters(check, offset, s->header.table_size)) {
+            continue; /* skip an invalid table */
+        }
+
+        ret = qed_read_l2_table_sync(s, &check->request, offset);
+        if (ret) {
+            check->result->check_errors++;
+            last_error = ret;
+            continue;
+        }
+
+        num_invalid_l2 = qed_check_l2_table(check,
+                                            check->request.l2_table->table);
+
+        /* Write out fixed L2 table */
+        if (num_invalid_l2 > 0 && check->fix) {
+            ret = qed_write_l2_table_sync(s, &check->request, 0,
+                                          s->table_nelems, false);
+            if (ret) {
+                check->result->check_errors++;
+                last_error = ret;
+                continue;
+            }
+        }
+    }
+
+    /* Drop reference to final table */
+    qed_unref_l2_cache_entry(&s->l2_cache, check->request.l2_table);
+    check->request.l2_table = NULL;
+
+    /* Write out fixed L1 table */
+    if (num_invalid_l1 > 0 && check->fix) {
+        ret = qed_write_l1_table_sync(s, 0, s->table_nelems);
+        if (ret) {
+            check->result->check_errors++;
+            last_error = ret;
+        }
+    }
+
+    return last_error;
+}
+
+/**
+ * Check for unreferenced (leaked) clusters
+ */
+static void qed_check_for_leaks(QEDCheck *check)
+{
+    BDRVQEDState *s = check->s;
+    size_t i;
+
+    for (i = s->header.header_size; i < check->nclusters; i++) {
+        if (!qed_test_bit(check->used_clusters, i)) {
+            check->result->leaks++;
+        }
+    }
+}
+
+int qed_check(BDRVQEDState *s, BdrvCheckResult *result, bool fix)
+{
+    QEDCheck check = {
+        .s = s,
+        .result = result,
+        .nclusters = qed_bytes_to_clusters(s, s->file_size),
+        .request = { .l2_table = NULL },
+        .fix = fix,
+    };
+    int ret;
+
+    check.used_clusters = qemu_mallocz(((check.nclusters + 31) / 32) *
+                                       sizeof(check.used_clusters[0]));
+
+    ret = qed_check_l1_table(&check, s->l1_table);
+    if (ret == 0) {
+        /* Only check for leaks if entire image was scanned successfully */
+        qed_check_for_leaks(&check);
+    }
+
+    qemu_free(check.used_clusters);
+    return ret;
+}
diff --git a/block/qed.c b/block/qed.c
index 4fded31..5632887 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -78,6 +78,94 @@ static void qed_header_cpu_to_le(const QEDHeader *cpu, QEDHeader *le)
     le->backing_fmt_size = cpu_to_le32(cpu->backing_fmt_size);
 }
 
+static int qed_write_header_sync(BDRVQEDState *s)
+{
+    QEDHeader le;
+    int ret;
+
+    qed_header_cpu_to_le(&s->header, &le);
+    ret = bdrv_pwrite(s->bs->file, 0, &le, sizeof(le));
+    if (ret != sizeof(le)) {
+        return ret;
+    }
+    return 0;
+}
+
+typedef struct {
+    GenericCB gencb;
+    BDRVQEDState *s;
+    struct iovec iov;
+    QEMUIOVector qiov;
+    int nsectors;
+    uint8_t *buf;
+} QEDWriteHeaderCB;
+
+static void qed_write_header_cb(void *opaque, int ret)
+{
+    QEDWriteHeaderCB *write_header_cb = opaque;
+
+    qemu_vfree(write_header_cb->buf);
+    gencb_complete(write_header_cb, ret);
+}
+
+static void qed_write_header_read_cb(void *opaque, int ret)
+{
+    QEDWriteHeaderCB *write_header_cb = opaque;
+    BDRVQEDState *s = write_header_cb->s;
+    BlockDriverAIOCB *acb;
+
+    if (ret) {
+        qed_write_header_cb(write_header_cb, ret);
+        return;
+    }
+
+    /* Update header */
+    qed_header_cpu_to_le(&s->header, (QEDHeader *)write_header_cb->buf);
+
+    acb = bdrv_aio_writev(s->bs->file, 0, &write_header_cb->qiov,
+                          write_header_cb->nsectors, qed_write_header_cb,
+                          write_header_cb);
+    if (!acb) {
+        qed_write_header_cb(write_header_cb, -EIO);
+    }
+}
+
+/**
+ * Update header in-place (does not rewrite backing filename or other strings)
+ *
+ * This function only updates known header fields in-place and does not affect
+ * extra data after the QED header.
+ */
+static void qed_write_header(BDRVQEDState *s, BlockDriverCompletionFunc cb,
+                             void *opaque)
+{
+    /* We must write full sectors for O_DIRECT but cannot necessarily generate
+     * the data following the header if an unrecognized compat feature is
+     * active.  Therefore, first read the sectors containing the header, update
+     * them, and write back.
+     */
+
+    BlockDriverAIOCB *acb;
+    int nsectors = (sizeof(QEDHeader) + BDRV_SECTOR_SIZE - 1) /
+                   BDRV_SECTOR_SIZE;
+    size_t len = nsectors * BDRV_SECTOR_SIZE;
+    QEDWriteHeaderCB *write_header_cb = gencb_alloc(sizeof(*write_header_cb),
+                                                    cb, opaque);
+
+    write_header_cb->s = s;
+    write_header_cb->nsectors = nsectors;
+    write_header_cb->buf = qemu_blockalign(s->bs, len);
+    write_header_cb->iov.iov_base = write_header_cb->buf;
+    write_header_cb->iov.iov_len = len;
+    qemu_iovec_init_external(&write_header_cb->qiov, &write_header_cb->iov, 1);
+
+    acb = bdrv_aio_readv(s->bs->file, 0, &write_header_cb->qiov, nsectors,
+                         qed_write_header_read_cb, write_header_cb);
+    if (!acb) {
+        qed_write_header_cb(write_header_cb, -EIO);
+    }
+}
+
 static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
 {
     uint64_t table_entries;
@@ -279,6 +367,32 @@ static int bdrv_qed_open(BlockDriverState *bs, int flags)
 
     ret = qed_read_l1_table_sync(s);
     if (ret) {
+        goto out;
+    }
+
+    /* If image was not closed cleanly, check consistency */
+    if (s->header.features & QED_F_NEED_CHECK) {
+        /* Read-only images cannot be fixed.  There is no risk of corruption
+         * since write operations are not possible.  Therefore, allow
+         * potentially inconsistent images to be opened read-only.  This can
+         * aid data recovery from an otherwise inconsistent image.
+         */
+        if (!bdrv_is_read_only(bs->file)) {
+            BdrvCheckResult result = {0};
+
+            ret = qed_check(s, &result, true);
+            if (!ret && !result.corruptions && !result.check_errors) {
+                /* Ensure fixes reach storage before clearing check bit */
+                bdrv_flush(s->bs);
+
+                s->header.features &= ~QED_F_NEED_CHECK;
+                qed_write_header_sync(s);
+            }
+        }
+    }
+
+out:
+    if (ret) {
         qed_free_l2_cache(&s->l2_cache);
         qemu_vfree(s->l1_table);
     }
@@ -289,6 +403,15 @@ static void bdrv_qed_close(BlockDriverState *bs)
 {
     BDRVQEDState *s = bs->opaque;
 
+    /* Ensure writes reach stable storage */
+    bdrv_flush(bs->file);
+
+    /* Clean shutdown, no check required on next open */
+    if (s->header.features & QED_F_NEED_CHECK) {
+        s->header.features &= ~QED_F_NEED_CHECK;
+        qed_write_header_sync(s);
+    }
+
     qed_free_l2_cache(&s->l2_cache);
     qemu_vfree(s->l1_table);
 }
@@ -865,8 +988,15 @@ static void qed_aio_write_data(void *opaque, int ret,
         return;
     }
 
-    /* Write new cluster */
-    qed_aio_write_prefill(acb, 0);
+    /* Write new cluster if the image is already marked dirty */
+    if (s->header.features & QED_F_NEED_CHECK) {
+        qed_aio_write_prefill(acb, 0);
+        return;
+    }
+
+    /* Mark the image dirty before writing the new cluster */
+    s->header.features |= QED_F_NEED_CHECK;
+    qed_write_header(s, qed_aio_write_prefill, acb);
     return;
 
 err:
@@ -1116,7 +1246,9 @@ static int bdrv_qed_change_backing_file(BlockDriverState *bs,
 
 static int bdrv_qed_check(BlockDriverState *bs, BdrvCheckResult *result)
 {
-    return -ENOTSUP;
+    BDRVQEDState *s = bs->opaque;
+
+    return qed_check(s, result, false);
 }
 
 static QEMUOptionParameter qed_create_options[] = {
diff --git a/block/qed.h b/block/qed.h
index 91c91f4..a26bcde 100644
--- a/block/qed.h
+++ b/block/qed.h
@@ -50,11 +50,15 @@ enum {
     /* The image supports a backing file */
     QED_F_BACKING_FILE = 0x01,
 
+    /* The image needs a consistency check before use */
+    QED_F_NEED_CHECK = 0x02,
+
     /* The image has the backing file format */
     QED_CF_BACKING_FORMAT = 0x01,
 
     /* Feature bits must be used when the on-disk format changes */
-    QED_FEATURE_MASK = QED_F_BACKING_FILE,            /* supported feature bits */
+    QED_FEATURE_MASK = QED_F_BACKING_FILE |           /* supported feature bits */
+                       QED_F_NEED_CHECK,
     QED_COMPAT_FEATURE_MASK = QED_CF_BACKING_FORMAT,  /* supported compat feature bits */
 
     /* Data is stored in groups of sectors called clusters.  Cluster size must
@@ -237,6 +241,11 @@ void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
                       size_t len, QEDFindClusterFunc *cb, void *opaque);
 
 /**
+ * Consistency check
+ */
+int qed_check(BDRVQEDState *s, BdrvCheckResult *result, bool fix);
+
+/**
  * Utility functions
  */
 static inline uint64_t qed_start_of_cluster(BDRVQEDState *s, uint64_t offset)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 1/7] qcow2: Make get_bits_from_size() common
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 1/7] qcow2: Make get_bits_from_size() common Stefan Hajnoczi
@ 2010-10-08 18:01   ` Anthony Liguori
  0 siblings, 0 replies; 72+ messages in thread
From: Anthony Liguori @ 2010-10-08 18:01 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Avi Kivity, qemu-devel, Christoph Hellwig

On 10/08/2010 10:48 AM, Stefan Hajnoczi wrote:
> The get_bits_from_size() calculates the log base-2 of a number.  This is
> useful in bit manipulation code working with power-of-2s.
>
> Currently used by qcow2 and needed by qed in a follow-on patch.
>
> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>    

Acked-by: Anthony Liguori <aliguori@us.ibm.com>

Regards,

Anthony Liguori

> ---
>   block/qcow2.c |   22 ----------------------
>   cutils.c      |   18 ++++++++++++++++++
>   qemu-common.h |    1 +
>   3 files changed, 19 insertions(+), 22 deletions(-)
>
> diff --git a/block/qcow2.c b/block/qcow2.c
> index ee3481b..6e25812 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -794,28 +794,6 @@ static int qcow2_change_backing_file(BlockDriverState *bs,
>       return qcow2_update_ext_header(bs, backing_file, backing_fmt);
>   }
>
> -static int get_bits_from_size(size_t size)
> -{
> -    int res = 0;
> -
> -    if (size == 0) {
> -        return -1;
> -    }
> -
> -    while (size != 1) {
> -        /* Not a power of two */
> -        if (size&  1) {
> -            return -1;
> -        }
> -
> -        size>>= 1;
> -        res++;
> -    }
> -
> -    return res;
> -}
> -
> -
>   static int preallocate(BlockDriverState *bs)
>   {
>       uint64_t nb_sectors;
> diff --git a/cutils.c b/cutils.c
> index 5883737..6c32198 100644
> --- a/cutils.c
> +++ b/cutils.c
> @@ -283,3 +283,21 @@ int fcntl_setfl(int fd, int flag)
>   }
>   #endif
>
> +/**
> + * Get the number of bits for a power of 2
> + *
> + * The following is true for powers of 2:
> + *   n == 1<<  get_bits_from_size(n)
> + */
> +int get_bits_from_size(size_t size)
> +{
> +    if (size == 0 || (size&  (size - 1))) {
> +        return -1;
> +    }
> +
> +#if defined(_WIN32)&&  defined(__x86_64__)
> +    return __builtin_ctzll(size);
> +#else
> +    return __builtin_ctzl(size);
> +#endif
> +}
> diff --git a/qemu-common.h b/qemu-common.h
> index 81aafa0..e0ca398 100644
> --- a/qemu-common.h
> +++ b/qemu-common.h
> @@ -153,6 +153,7 @@ time_t mktimegm(struct tm *tm);
>   int qemu_fls(int i);
>   int qemu_fdatasync(int fd);
>   int fcntl_setfl(int fd, int flag);
> +int get_bits_from_size(size_t size);
>
>   /* path.c */
>   void init_paths(const char *prefix);
>    

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 6/7] qed: Read/write support Stefan Hajnoczi
@ 2010-10-10  9:10   ` Avi Kivity
  2010-10-11 10:37     ` Stefan Hajnoczi
  2010-10-12 15:08   ` Kevin Wolf
  1 sibling, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-10  9:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> This patch implements the read/write state machine.  Operations are
> fully asynchronous and multiple operations may be active at any time.
>
> Allocating writes lock tables to ensure metadata updates do not
> interfere with each other.  If two allocating writes need to update the
> same L2 table they will run sequentially.  If two allocating writes need
> to update different L2 tables they will run in parallel.
>

Shouldn't there be a flush between an allocating write and an L2 
update?  Otherwise a reuse of a cluster can move logical sectors from 
one place to another, causing a data disclosure.

Can be skipped if the new cluster is beyond the physical image size.

> +
> +/*
> + * Table locking works as follows:
> + *
> + * Reads and non-allocating writes do not acquire locks because they do not
> + * modify tables and only see committed L2 cache entries.

What about a non-allocating write that follows an allocating write?

1 Guest writes to sector 0
2 Host reads backing image (or supplies zeros), sectors 1-127
3 Host writes sectors 0-127
4 Guest writes sector 1
5 Host writes sector 1

There needs to be a barrier that prevents the host and the disk from 
reordering operations 3 and 5, or guest operation 4 is lost.  As far as 
the guest is concerned no overlapping writes were issued, so it isn't 
required to provide any barriers.

(based on the comment only, haven't read the code)

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 3/7] docs: Add QED image format specification Stefan Hajnoczi
@ 2010-10-10  9:20   ` Avi Kivity
  2010-10-11 10:09     ` Stefan Hajnoczi
  2010-10-11 13:58   ` Kevin Wolf
  1 sibling, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-10  9:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> ---
>   docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 94 insertions(+), 0 deletions(-)
>
> +Feature bits:
> +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.

Great that QED_F_NEED_CHECK is now non-optional.  However, suppose we 
add a new structure (e.g. persistent freelist); it's now impossible to 
tell whether the structure is updated or not:

1 new qemu opens image
2 writes persistent freelist
3 clears need_check
4 shuts down
5 old qemu opens image
6 doesn't update persistent freelist
7 clears need_check
8 shuts down

The image is now inconsistent, but has need_check cleared.

We can address this by having a third feature bitmask that is 
autocleared by guests that don't recognize various bits; so the sequence 
becomes:

1 new qemu opens image
2 writes persistent freelist
3 clears need_check
4 sets persistent_freelist
5 shuts down
6 old qemu opens image
7 clears persistent_freelist (and any other bits it doesn't recognize)
8 doesn't update persistent freelist
9 clears need_check
10 shuts down

The image is now consistent, since the persistent freelist has disappeared.

> +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
> +

It was suggested to have just a bit saying whether the backing format is 
raw or not.  This way you don't need to store the format.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-10  9:20   ` [Qemu-devel] " Avi Kivity
@ 2010-10-11 10:09     ` Stefan Hajnoczi
  2010-10-11 13:04       ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 10:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
>  On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> >---
> >  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 94 insertions(+), 0 deletions(-)
> >
> >+Feature bits:
> >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> 
> Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
> we add a new structure (e.g. persistent freelist); it's now
> impossible to tell whether the structure is updated or not:
> 
> 1 new qemu opens image
> 2 writes persistent freelist
> 3 clears need_check
> 4 shuts down
> 5 old qemu opens image
> 6 doesn't update persistent freelist
> 7 clears need_check
> 8 shuts down
> 
> The image is now inconsistent, but has need_check cleared.
> 
> We can address this by having a third feature bitmask that is
> autocleared by guests that don't recognize various bits; so the
> sequence becomes:
> 
> 1 new qemu opens image
> 2 writes persistent freelist
> 3 clears need_check
> 4 sets persistent_freelist
> 5 shuts down
> 6 old qemu opens image
> 7 clears persistent_freelist (and any other bits it doesn't recognize)
> 8 doesn't update persistent freelist
> 9 clears need_check
> 10 shuts down
> 
> The image is now consistent, since the persistent freelist has disappeared.

It is more complicated than just the feature bit.  The freelist requires
space in the image file.  Clearing the persistent_freelist bit leaks the
freelist.

We can solve this by using a compat feature bit and an autoclear feature
bit.  The autoclear bit specifies whether or not the freelist is valid
and the compat feature bit specifices whether or not the freelist
exists.

When the new qemu opens the image again it notices that the autoclear
bit is unset but the compat bit is set.  This means the freelist is
out-of-date and its space can be reclaimed.

I don't like the idea of doing these feature bit acrobatics because they
add complexity.  On the other hand your proposal ensures backward
compatibility in the case of an optional data structure that needs to
stay in sync with the image.  I'm just not 100% convinced it's worth it.

> >+* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
> >+
> 
> It was suggested to have just a bit saying whether the backing
> format is raw or not.  This way you don't need to store the format.

Will fix in v3.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-10  9:10   ` [Qemu-devel] " Avi Kivity
@ 2010-10-11 10:37     ` Stefan Hajnoczi
  2010-10-11 13:10       ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 10:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
>  On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >This patch implements the read/write state machine.  Operations are
> >fully asynchronous and multiple operations may be active at any time.
> >
> >Allocating writes lock tables to ensure metadata updates do not
> >interfere with each other.  If two allocating writes need to update the
> >same L2 table they will run sequentially.  If two allocating writes need
> >to update different L2 tables they will run in parallel.
> >
> 
> Shouldn't there be a flush between an allocating write and an L2
> update?  Otherwise a reuse of a cluster can move logical sectors
> from one place to another, causing a data disclosure.
> 
> Can be skipped if the new cluster is beyond the physical image size.

Currently clusters are never reused and new clusters are always beyond
physical image size.  The only exception I can think of is when the
image file size is not a multiple of the cluster size and we round down
to the start of cluster.

In an implementation that supports TRIM or otherwise reuses clusters
this is a cost.

> >+
> >+/*
> >+ * Table locking works as follows:
> >+ *
> >+ * Reads and non-allocating writes do not acquire locks because they do not
> >+ * modify tables and only see committed L2 cache entries.
> 
> What about a non-allocating write that follows an allocating write?
> 
> 1 Guest writes to sector 0
> 2 Host reads backing image (or supplies zeros), sectors 1-127
> 3 Host writes sectors 0-127
> 4 Guest writes sector 1
> 5 Host writes sector 1
> 
> There needs to be a barrier that prevents the host and the disk from
> reordering operations 3 and 5, or guest operation 4 is lost.  As far
> as the guest is concerned no overlapping writes were issued, so it
> isn't required to provide any barriers.
> 
> (based on the comment only, haven't read the code)

There is no barrier between operations 3 and 5.  However, operation 5
only starts after operation 3 has completed because of table locking.
It is my understanding that *independent* requests may be reordered but
two writes to the *same* sector will not be reordered if write A
completes before write B is issued.

Imagine a test program that uses pwrite() to rewrite a counter many
times on disk.  When the program finishes it prints the counter
variable's last value.  This scenario is like operations 3 and 5 above.
If we read the counter back from disk it will be the final value, not
some intermediate value.  The writes will not be reordered.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values Stefan Hajnoczi
@ 2010-10-11 11:09   ` Kevin Wolf
  2010-10-13  9:15   ` [Qemu-devel] " Markus Armbruster
  2010-10-13 10:25   ` [Qemu-devel] " Avi Kivity
  2 siblings, 0 replies; 72+ messages in thread
From: Kevin Wolf @ 2010-10-11 11:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> From: Anthony Liguori <aliguori@us.ibm.com>
> 
> This common function converts byte counts to human-readable strings with
> proper units.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

I wonder how many similar implementations we have in the tree. I know at
least of cvtstr() in cmd.c, which is used by qemu-io, but I would be
surprised if it was the only one.

Adding such a function to cutils.c sounds like a good idea, but we
should probably try to change everything to use this common function.

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 10:09     ` Stefan Hajnoczi
@ 2010-10-11 13:04       ` Avi Kivity
  2010-10-11 13:42         ` Stefan Hajnoczi
  2010-10-11 14:54         ` Anthony Liguori
  0 siblings, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 13:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
> On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >  >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> >  >---
> >  >   docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >  >   1 files changed, 94 insertions(+), 0 deletions(-)
> >  >
> >  >+Feature bits:
> >  >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> >  >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> >
> >  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
> >  we add a new structure (e.g. persistent freelist); it's now
> >  impossible to tell whether the structure is updated or not:
> >
> >  1 new qemu opens image
> >  2 writes persistent freelist
> >  3 clears need_check
> >  4 shuts down
> >  5 old qemu opens image
> >  6 doesn't update persistent freelist
> >  7 clears need_check
> >  8 shuts down
> >
> >  The image is now inconsistent, but has need_check cleared.
> >
> >  We can address this by having a third feature bitmask that is
> >  autocleared by guests that don't recognize various bits; so the
> >  sequence becomes:
> >
> >  1 new qemu opens image
> >  2 writes persistent freelist
> >  3 clears need_check
> >  4 sets persistent_freelist
> >  5 shuts down
> >  6 old qemu opens image
> >  7 clears persistent_freelist (and any other bits it doesn't recognize)
> >  8 doesn't update persistent freelist
> >  9 clears need_check
> >  10 shuts down
> >
> >  The image is now consistent, since the persistent freelist has disappeared.
>
> It is more complicated than just the feature bit.  The freelist requires
> space in the image file.  Clearing the persistent_freelist bit leaks the
> freelist.
>
> We can solve this by using a compat feature bit and an autoclear feature
> bit.  The autoclear bit specifies whether or not the freelist is valid
> and the compat feature bit specifices whether or not the freelist
> exists.
>
> When the new qemu opens the image again it notices that the autoclear
> bit is unset but the compat bit is set.  This means the freelist is
> out-of-date and its space can be reclaimed.
>
> I don't like the idea of doing these feature bit acrobatics because they
> add complexity.  On the other hand your proposal ensures backward
> compatibility in the case of an optional data structure that needs to
> stay in sync with the image.  I'm just not 100% convinced it's worth it.

My scenario ends up with data corruption if we move to an old qemu and 
then back again, without any aborts.

A leak is acceptable (it won't grow; it's just an unused, incorrect 
freelist), but data corruption is not.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-11 10:37     ` Stefan Hajnoczi
@ 2010-10-11 13:10       ` Avi Kivity
  2010-10-11 13:55         ` Stefan Hajnoczi
  2010-10-11 14:57         ` Anthony Liguori
  0 siblings, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 13:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote:
> On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >  >This patch implements the read/write state machine.  Operations are
> >  >fully asynchronous and multiple operations may be active at any time.
> >  >
> >  >Allocating writes lock tables to ensure metadata updates do not
> >  >interfere with each other.  If two allocating writes need to update the
> >  >same L2 table they will run sequentially.  If two allocating writes need
> >  >to update different L2 tables they will run in parallel.
> >  >
> >
> >  Shouldn't there be a flush between an allocating write and an L2
> >  update?  Otherwise a reuse of a cluster can move logical sectors
> >  from one place to another, causing a data disclosure.
> >
> >  Can be skipped if the new cluster is beyond the physical image size.
>
> Currently clusters are never reused and new clusters are always beyond
> physical image size.  The only exception I can think of is when the
> image file size is not a multiple of the cluster size and we round down
> to the start of cluster.
>
> In an implementation that supports TRIM or otherwise reuses clusters
> this is a cost.
>
> >  >+
> >  >+/*
> >  >+ * Table locking works as follows:
> >  >+ *
> >  >+ * Reads and non-allocating writes do not acquire locks because they do not
> >  >+ * modify tables and only see committed L2 cache entries.
> >
> >  What about a non-allocating write that follows an allocating write?
> >
> >  1 Guest writes to sector 0
> >  2 Host reads backing image (or supplies zeros), sectors 1-127
> >  3 Host writes sectors 0-127
> >  4 Guest writes sector 1
> >  5 Host writes sector 1
> >
> >  There needs to be a barrier that prevents the host and the disk from
> >  reordering operations 3 and 5, or guest operation 4 is lost.  As far
> >  as the guest is concerned no overlapping writes were issued, so it
> >  isn't required to provide any barriers.
> >
> >  (based on the comment only, haven't read the code)
>
> There is no barrier between operations 3 and 5.  However, operation 5
> only starts after operation 3 has completed because of table locking.
> It is my understanding that *independent* requests may be reordered but
> two writes to the *same* sector will not be reordered if write A
> completes before write B is issued.
>
> Imagine a test program that uses pwrite() to rewrite a counter many
> times on disk.  When the program finishes it prints the counter
> variable's last value.  This scenario is like operations 3 and 5 above.
> If we read the counter back from disk it will be the final value, not
> some intermediate value.  The writes will not be reordered.

Yes, all that is needed is a barrier in program order.  So, operation 5 
is an allocating write as long as 3 hasn't returned? (at which point it 
becomes a non-allocating write)?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
                   ` (6 preceding siblings ...)
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 7/7] qed: Consistency check support Stefan Hajnoczi
@ 2010-10-11 13:21 ` Kevin Wolf
  2010-10-11 15:37   ` Stefan Hajnoczi
  2010-10-16  7:51 ` [Qemu-devel] " Stefan Hajnoczi
  8 siblings, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-11 13:21 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> This code is also available from git:
> 
> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/qed

This doesn't seem to be the same as the latest patches you posted to
qemu-devel. Forgot to push?

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 13:04       ` Avi Kivity
@ 2010-10-11 13:42         ` Stefan Hajnoczi
  2010-10-11 13:44           ` Avi Kivity
  2010-10-11 14:54         ` Anthony Liguori
  1 sibling, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 13:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Mon, Oct 11, 2010 at 03:04:03PM +0200, Avi Kivity wrote:
>  On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
> >On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
> >>   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >>  >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> >>  >---
> >>  >   docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >>  >   1 files changed, 94 insertions(+), 0 deletions(-)
> >>  >
> >>  >+Feature bits:
> >>  >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> >>  >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> >>
> >>  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
> >>  we add a new structure (e.g. persistent freelist); it's now
> >>  impossible to tell whether the structure is updated or not:
> >>
> >>  1 new qemu opens image
> >>  2 writes persistent freelist
> >>  3 clears need_check
> >>  4 shuts down
> >>  5 old qemu opens image
> >>  6 doesn't update persistent freelist
> >>  7 clears need_check
> >>  8 shuts down
> >>
> >>  The image is now inconsistent, but has need_check cleared.
> >>
> >>  We can address this by having a third feature bitmask that is
> >>  autocleared by guests that don't recognize various bits; so the
> >>  sequence becomes:
> >>
> >>  1 new qemu opens image
> >>  2 writes persistent freelist
> >>  3 clears need_check
> >>  4 sets persistent_freelist
> >>  5 shuts down
> >>  6 old qemu opens image
> >>  7 clears persistent_freelist (and any other bits it doesn't recognize)
> >>  8 doesn't update persistent freelist
> >>  9 clears need_check
> >>  10 shuts down
> >>
> >>  The image is now consistent, since the persistent freelist has disappeared.
> >
> >It is more complicated than just the feature bit.  The freelist requires
> >space in the image file.  Clearing the persistent_freelist bit leaks the
> >freelist.
> >
> >We can solve this by using a compat feature bit and an autoclear feature
> >bit.  The autoclear bit specifies whether or not the freelist is valid
> >and the compat feature bit specifices whether or not the freelist
> >exists.
> >
> >When the new qemu opens the image again it notices that the autoclear
> >bit is unset but the compat bit is set.  This means the freelist is
> >out-of-date and its space can be reclaimed.
> >
> >I don't like the idea of doing these feature bit acrobatics because they
> >add complexity.  On the other hand your proposal ensures backward
> >compatibility in the case of an optional data structure that needs to
> >stay in sync with the image.  I'm just not 100% convinced it's worth it.
> 
> My scenario ends up with data corruption if we move to an old qemu
> and then back again, without any aborts.
> 
> A leak is acceptable (it won't grow; it's just an unused, incorrect
> freelist), but data corruption is not.

The alternative is for the freelist to be a non-compat feature bit.
That means older QEMU binaries cannot use a QED image that has enabled
the freelist.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 13:42         ` Stefan Hajnoczi
@ 2010-10-11 13:44           ` Avi Kivity
  2010-10-11 14:06             ` Stefan Hajnoczi
  2010-10-11 15:02             ` Anthony Liguori
  0 siblings, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 13:44 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
> >
> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
> >  freelist), but data corruption is not.
>
> The alternative is for the freelist to be a non-compat feature bit.
> That means older QEMU binaries cannot use a QED image that has enabled
> the freelist.

For this one feature.  What about others?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-11 13:10       ` Avi Kivity
@ 2010-10-11 13:55         ` Stefan Hajnoczi
  2010-10-11 14:57         ` Anthony Liguori
  1 sibling, 0 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 13:55 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Mon, Oct 11, 2010 at 03:10:16PM +0200, Avi Kivity wrote:
>  On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote:
> >On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
> >>   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> >>  >This patch implements the read/write state machine.  Operations are
> >>  >fully asynchronous and multiple operations may be active at any time.
> >>  >
> >>  >Allocating writes lock tables to ensure metadata updates do not
> >>  >interfere with each other.  If two allocating writes need to update the
> >>  >same L2 table they will run sequentially.  If two allocating writes need
> >>  >to update different L2 tables they will run in parallel.
> >>  >
> >>
> >>  Shouldn't there be a flush between an allocating write and an L2
> >>  update?  Otherwise a reuse of a cluster can move logical sectors
> >>  from one place to another, causing a data disclosure.
> >>
> >>  Can be skipped if the new cluster is beyond the physical image size.
> >
> >Currently clusters are never reused and new clusters are always beyond
> >physical image size.  The only exception I can think of is when the
> >image file size is not a multiple of the cluster size and we round down
> >to the start of cluster.
> >
> >In an implementation that supports TRIM or otherwise reuses clusters
> >this is a cost.
> >
> >>  >+
> >>  >+/*
> >>  >+ * Table locking works as follows:
> >>  >+ *
> >>  >+ * Reads and non-allocating writes do not acquire locks because they do not
> >>  >+ * modify tables and only see committed L2 cache entries.
> >>
> >>  What about a non-allocating write that follows an allocating write?
> >>
> >>  1 Guest writes to sector 0
> >>  2 Host reads backing image (or supplies zeros), sectors 1-127
> >>  3 Host writes sectors 0-127
> >>  4 Guest writes sector 1
> >>  5 Host writes sector 1
> >>
> >>  There needs to be a barrier that prevents the host and the disk from
> >>  reordering operations 3 and 5, or guest operation 4 is lost.  As far
> >>  as the guest is concerned no overlapping writes were issued, so it
> >>  isn't required to provide any barriers.
> >>
> >>  (based on the comment only, haven't read the code)
> >
> >There is no barrier between operations 3 and 5.  However, operation 5
> >only starts after operation 3 has completed because of table locking.
> >It is my understanding that *independent* requests may be reordered but
> >two writes to the *same* sector will not be reordered if write A
> >completes before write B is issued.
> >
> >Imagine a test program that uses pwrite() to rewrite a counter many
> >times on disk.  When the program finishes it prints the counter
> >variable's last value.  This scenario is like operations 3 and 5 above.
> >If we read the counter back from disk it will be the final value, not
> >some intermediate value.  The writes will not be reordered.
> 
> Yes, all that is needed is a barrier in program order.  So,
> operation 5 is an allocating write as long as 3 hasn't returned? (at
> which point it becomes a non-allocating write)?

Yes, operation 5 waits until operation 3 completes.  After waking up on
the lock, the request looks up the cluster again because it may now be
allocated - operation 5 switches to an non-allocating write.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 3/7] docs: Add QED image format specification Stefan Hajnoczi
  2010-10-10  9:20   ` [Qemu-devel] " Avi Kivity
@ 2010-10-11 13:58   ` Kevin Wolf
  2010-10-11 15:30     ` Stefan Hajnoczi
  1 sibling, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-11 13:58 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 94 insertions(+), 0 deletions(-)
>  create mode 100644 docs/specs/qed_spec.txt
> 
> diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
> new file mode 100644
> index 0000000..c942b8e
> --- /dev/null
> +++ b/docs/specs/qed_spec.txt
> @@ -0,0 +1,94 @@
> +=Specification=
> +
> +The file format looks like this:
> +
> + +----------+----------+----------+-----+
> + | cluster0 | cluster1 | cluster2 | ... |
> + +----------+----------+----------+-----+
> +
> +The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
> +
> +Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
> +
> +All fields are little-endian.
> +
> +==Header==
> + Header {
> +     uint32_t magic;               /* QED\0 */
> + 
> +     uint32_t cluster_size;        /* in bytes */
> +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
> +     uint32_t header_size;         /* in clusters */
> + 
> +     uint64_t features;            /* format feature bits */
> +     uint64_t compat_features;     /* compat feature bits */
> +     uint64_t l1_table_offset;     /* in bytes */
> +     uint64_t image_size;          /* total logical image size, in bytes */
> + 
> +     /* if (features & QED_F_BACKING_FILE) */
> +     uint32_t backing_filename_offset; /* in bytes from start of header */
> +     uint32_t backing_filename_size;   /* in bytes */
> + 
> +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
> +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
> +     uint32_t backing_fmt_size;    /* in bytes */

It was discussed before, but I don't think we came to a conclusion. Are
there any circumstances under which you don't want to set the
QED_CF_BACKING_FORMAT flag?

Also it's unclear what this "if" actually means: If the flag isn't set,
are the fields zero, are they undefined or are they even completely
missing and the offsets of the following fields must be adjusted?

> + }
> +
> +Field descriptions:
> +* cluster_size must be a power of 2 in range [2^12, 2^26].
> +* table_size must be a power of 2 in range [1, 16].

Is there a reason why this must be a power of two?

> +* header_size is the number of clusters used by the header and any additional information stored before regular clusters.
> +* features and compat_features are bitmaps where active file format features can be selectively enabled.  The difference between the two is that an image file that uses unknown compat_features bits can be safely opened without knowing how to interpret those bits.  If an image file has an unsupported features bit set then it is not possible to open that image (the image is not backwards-compatible).
> +* l1_table_offset must be a multiple of cluster_size.

And it is the offset of the first byte of the L1 table in the image file.

> +* image_size is the block device size seen by the guest and must be a multiple of cluster_size.

So there are image sizes that can't be accurately represented in QED? I
think that's a bad idea. Even more so because I can't see how it greatly
simplifies implementation (you save the operation for rounding up on
open/create, that's it) - it looks like a completely arbitrary restriction.

> +* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.

A description of the meaning of these strings is missing.

> +
> +Feature bits:
> +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.

I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
twice is confusing at first sight.

> +
> +==Tables==
> +
> +Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
> +
> + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
> +  
> + Table {
> +     uint64_t offsets[TABLE_NOFFSETS];
> + }
> +
> +The tables are organized as follows:
> +
> +                    +----------+
> +                    | L1 table |
> +                    +----------+
> +               ,------'  |  '------.
> +          +----------+   |    +----------+
> +          | L2 table |  ...   | L2 table |
> +          +----------+        +----------+
> +      ,------'  |  '------.
> + +----------+   |    +----------+
> + |   Data   |  ...   |   Data   |
> + +----------+        +----------+
> +
> +A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
> +
> +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
> + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
> +
> +Logical offsets are translated into cluster offsets as follows:
> +
> +  table_bits table_bits    cluster_bits
> +  <--------> <--------> <--------------->
> + +----------+----------+-----------------+
> + | L1 index | L2 index |     byte offset |
> + +----------+----------+-----------------+
> + 
> +       Structure of a logical offset
> +
> + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
> +   l2_offset = l1_table[l1_index]
> +   l2_table = load_table(l2_offset)
> +   cluster_offset = l2_table[l2_index]
> +   return cluster_offset + byte_offset

Should we reserve some bits in the table entries in case we need some
flags later? Also, I suppose all table entries must be cluster aligned?

What happened to the other sections that older versions of the spec
contained? For example, this version doesn't specify any more what the
semantics of unallocated clusters and backing files is.

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 13:44           ` Avi Kivity
@ 2010-10-11 14:06             ` Stefan Hajnoczi
  2010-10-11 14:12               ` Avi Kivity
  2010-10-11 15:02             ` Anthony Liguori
  1 sibling, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 14:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Mon, Oct 11, 2010 at 03:44:38PM +0200, Avi Kivity wrote:
>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
> >>
> >>  A leak is acceptable (it won't grow; it's just an unused, incorrect
> >>  freelist), but data corruption is not.
> >
> >The alternative is for the freelist to be a non-compat feature bit.
> >That means older QEMU binaries cannot use a QED image that has enabled
> >the freelist.
> 
> For this one feature.  What about others?

Compat features that need to be in sync with the image state will either
require specific checks (e.g. checksum or shadow of the state) or they
need to be non-compat features and are not backwards compatible.

I'm not opposing autoclear feature bits themselves, they are a neat
idea.  However, they will initially have no users so is this something
we really want to carry?

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 14:06             ` Stefan Hajnoczi
@ 2010-10-11 14:12               ` Avi Kivity
  0 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 14:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/11/2010 04:06 PM, Stefan Hajnoczi wrote:
> On Mon, Oct 11, 2010 at 03:44:38PM +0200, Avi Kivity wrote:
> >   On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
> >  >>
> >  >>   A leak is acceptable (it won't grow; it's just an unused, incorrect
> >  >>   freelist), but data corruption is not.
> >  >
> >  >The alternative is for the freelist to be a non-compat feature bit.
> >  >That means older QEMU binaries cannot use a QED image that has enabled
> >  >the freelist.
> >
> >  For this one feature.  What about others?
>
> Compat features that need to be in sync with the image state will either
> require specific checks (e.g. checksum or shadow of the state) or they
> need to be non-compat features and are not backwards compatible.
>
> I'm not opposing autoclear feature bits themselves, they are a neat
> idea.  However, they will initially have no users so is this something
> we really want to carry?

Hard to tell in advance.  It seems like a simple feature with potential 
to make our lives easier in the future.

Anything we develop in the next few months can be rolled back into the 
baseline, assuming we declare the format unstable while 0.14 is in 
development, so this is really about post 0.14 features.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 13:04       ` Avi Kivity
  2010-10-11 13:42         ` Stefan Hajnoczi
@ 2010-10-11 14:54         ` Anthony Liguori
  2010-10-11 14:58           ` Avi Kivity
  1 sibling, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 14:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

On 10/11/2010 08:04 AM, Avi Kivity wrote:
>  On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
>> On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
>> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
>> > >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>> > >---
>> > >   docs/specs/qed_spec.txt |   94 
>> +++++++++++++++++++++++++++++++++++++++++++++++
>> > >   1 files changed, 94 insertions(+), 0 deletions(-)
>> > >
>> > >+Feature bits:
>> > >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
>> > >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check 
>> before use.
>> >
>> >  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
>> >  we add a new structure (e.g. persistent freelist); it's now
>> >  impossible to tell whether the structure is updated or not:
>> >
>> >  1 new qemu opens image
>> >  2 writes persistent freelist
>> >  3 clears need_check
>> >  4 shuts down
>> >  5 old qemu opens image
>> >  6 doesn't update persistent freelist
>> >  7 clears need_check
>> >  8 shuts down
>> >
>> >  The image is now inconsistent, but has need_check cleared.
>> >
>> >  We can address this by having a third feature bitmask that is
>> >  autocleared by guests that don't recognize various bits; so the
>> >  sequence becomes:
>> >
>> >  1 new qemu opens image
>> >  2 writes persistent freelist
>> >  3 clears need_check
>> >  4 sets persistent_freelist
>> >  5 shuts down
>> >  6 old qemu opens image
>> >  7 clears persistent_freelist (and any other bits it doesn't 
>> recognize)
>> >  8 doesn't update persistent freelist
>> >  9 clears need_check
>> >  10 shuts down
>> >
>> >  The image is now consistent, since the persistent freelist has 
>> disappeared.
>>
>> It is more complicated than just the feature bit.  The freelist requires
>> space in the image file.  Clearing the persistent_freelist bit leaks the
>> freelist.
>>
>> We can solve this by using a compat feature bit and an autoclear feature
>> bit.  The autoclear bit specifies whether or not the freelist is valid
>> and the compat feature bit specifices whether or not the freelist
>> exists.
>>
>> When the new qemu opens the image again it notices that the autoclear
>> bit is unset but the compat bit is set.  This means the freelist is
>> out-of-date and its space can be reclaimed.
>>
>> I don't like the idea of doing these feature bit acrobatics because they
>> add complexity.  On the other hand your proposal ensures backward
>> compatibility in the case of an optional data structure that needs to
>> stay in sync with the image.  I'm just not 100% convinced it's worth it.
>
> My scenario ends up with data corruption if we move to an old qemu and 
> then back again, without any aborts.
>
> A leak is acceptable (it won't grow; it's just an unused, incorrect 
> freelist), but data corruption is not.

A leak is unacceptable.  It means an image can grow to an unbounded 
size.  If you are a server provider offering multitenancy, then a 
malicious guest can potentially grow the image beyond it's allotted size 
causing a Denial of Service attack against another tenant.

A freelist has to be a non-optional feature.  When the freelist bit is 
set, an older QEMU cannot read the image.  If the freelist is completed 
used, the freelist bit can be cleared and the image is then usable by 
older QEMUs.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-11 13:10       ` Avi Kivity
  2010-10-11 13:55         ` Stefan Hajnoczi
@ 2010-10-11 14:57         ` Anthony Liguori
  1 sibling, 0 replies; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 14:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

On 10/11/2010 08:10 AM, Avi Kivity wrote:
>  On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote:
>> On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
>> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
>> > >This patch implements the read/write state machine.  Operations are
>> > >fully asynchronous and multiple operations may be active at any time.
>> > >
>> > >Allocating writes lock tables to ensure metadata updates do not
>> > >interfere with each other.  If two allocating writes need to 
>> update the
>> > >same L2 table they will run sequentially.  If two allocating 
>> writes need
>> > >to update different L2 tables they will run in parallel.
>> > >
>> >
>> >  Shouldn't there be a flush between an allocating write and an L2
>> >  update?  Otherwise a reuse of a cluster can move logical sectors
>> >  from one place to another, causing a data disclosure.
>> >
>> >  Can be skipped if the new cluster is beyond the physical image size.
>>
>> Currently clusters are never reused and new clusters are always beyond
>> physical image size.  The only exception I can think of is when the
>> image file size is not a multiple of the cluster size and we round down
>> to the start of cluster.
>>
>> In an implementation that supports TRIM or otherwise reuses clusters
>> this is a cost.
>>
>> > >+
>> > >+/*
>> > >+ * Table locking works as follows:
>> > >+ *
>> > >+ * Reads and non-allocating writes do not acquire locks because 
>> they do not
>> > >+ * modify tables and only see committed L2 cache entries.
>> >
>> >  What about a non-allocating write that follows an allocating write?
>> >
>> >  1 Guest writes to sector 0
>> >  2 Host reads backing image (or supplies zeros), sectors 1-127
>> >  3 Host writes sectors 0-127
>> >  4 Guest writes sector 1
>> >  5 Host writes sector 1
>> >
>> >  There needs to be a barrier that prevents the host and the disk from
>> >  reordering operations 3 and 5, or guest operation 4 is lost.  As far
>> >  as the guest is concerned no overlapping writes were issued, so it
>> >  isn't required to provide any barriers.
>> >
>> >  (based on the comment only, haven't read the code)
>>
>> There is no barrier between operations 3 and 5.  However, operation 5
>> only starts after operation 3 has completed because of table locking.
>> It is my understanding that *independent* requests may be reordered but
>> two writes to the *same* sector will not be reordered if write A
>> completes before write B is issued.
>>
>> Imagine a test program that uses pwrite() to rewrite a counter many
>> times on disk.  When the program finishes it prints the counter
>> variable's last value.  This scenario is like operations 3 and 5 above.
>> If we read the counter back from disk it will be the final value, not
>> some intermediate value.  The writes will not be reordered.
>
> Yes, all that is needed is a barrier in program order.  So, operation 
> 5 is an allocating write as long as 3 hasn't returned? (at which point 
> it becomes a non-allocating write)?

Correct.  The table lock in 0 is held until the request completes fully 
(including the write out of all of the fill data--step 3) which means 5 
will not begin until 3 has completed

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 14:54         ` Anthony Liguori
@ 2010-10-11 14:58           ` Avi Kivity
  2010-10-11 15:49             ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 14:58 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

  On 10/11/2010 04:54 PM, Anthony Liguori wrote:
> On 10/11/2010 08:04 AM, Avi Kivity wrote:
>>  On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:
>>> On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
>>> >   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
>>> > >Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>> > >---
>>> > >   docs/specs/qed_spec.txt |   94 
>>> +++++++++++++++++++++++++++++++++++++++++++++++
>>> > >   1 files changed, 94 insertions(+), 0 deletions(-)
>>> > >
>>> > >+Feature bits:
>>> > >+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
>>> > >+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check 
>>> before use.
>>> >
>>> >  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
>>> >  we add a new structure (e.g. persistent freelist); it's now
>>> >  impossible to tell whether the structure is updated or not:
>>> >
>>> >  1 new qemu opens image
>>> >  2 writes persistent freelist
>>> >  3 clears need_check
>>> >  4 shuts down
>>> >  5 old qemu opens image
>>> >  6 doesn't update persistent freelist
>>> >  7 clears need_check
>>> >  8 shuts down
>>> >
>>> >  The image is now inconsistent, but has need_check cleared.
>>> >
>>> >  We can address this by having a third feature bitmask that is
>>> >  autocleared by guests that don't recognize various bits; so the
>>> >  sequence becomes:
>>> >
>>> >  1 new qemu opens image
>>> >  2 writes persistent freelist
>>> >  3 clears need_check
>>> >  4 sets persistent_freelist
>>> >  5 shuts down
>>> >  6 old qemu opens image
>>> >  7 clears persistent_freelist (and any other bits it doesn't 
>>> recognize)
>>> >  8 doesn't update persistent freelist
>>> >  9 clears need_check
>>> >  10 shuts down
>>> >
>>> >  The image is now consistent, since the persistent freelist has 
>>> disappeared.
>>>
>>> It is more complicated than just the feature bit.  The freelist 
>>> requires
>>> space in the image file.  Clearing the persistent_freelist bit leaks 
>>> the
>>> freelist.
>>>
>>> We can solve this by using a compat feature bit and an autoclear 
>>> feature
>>> bit.  The autoclear bit specifies whether or not the freelist is valid
>>> and the compat feature bit specifices whether or not the freelist
>>> exists.
>>>
>>> When the new qemu opens the image again it notices that the autoclear
>>> bit is unset but the compat bit is set.  This means the freelist is
>>> out-of-date and its space can be reclaimed.
>>>
>>> I don't like the idea of doing these feature bit acrobatics because 
>>> they
>>> add complexity.  On the other hand your proposal ensures backward
>>> compatibility in the case of an optional data structure that needs to
>>> stay in sync with the image.  I'm just not 100% convinced it's worth 
>>> it.
>>
>> My scenario ends up with data corruption if we move to an old qemu 
>> and then back again, without any aborts.
>>
>> A leak is acceptable (it won't grow; it's just an unused, incorrect 
>> freelist), but data corruption is not.
>
> A leak is unacceptable.  It means an image can grow to an unbounded 
> size.  If you are a server provider offering multitenancy, then a 
> malicious guest can potentially grow the image beyond it's allotted 
> size causing a Denial of Service attack against another tenant.

This particular leak cannot grow, and is not controlled by the guest.

> A freelist has to be a non-optional feature.  When the freelist bit is 
> set, an older QEMU cannot read the image.  If the freelist is 
> completed used, the freelist bit can be cleared and the image is then 
> usable by older QEMUs.

Once we support TRIM (or detect zeros) we'll never have a clean freelist.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 13:44           ` Avi Kivity
  2010-10-11 14:06             ` Stefan Hajnoczi
@ 2010-10-11 15:02             ` Anthony Liguori
  2010-10-11 15:24               ` Avi Kivity
  1 sibling, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 15:02 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

On 10/11/2010 08:44 AM, Avi Kivity wrote:
>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>> >
>> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
>> >  freelist), but data corruption is not.
>>
>> The alternative is for the freelist to be a non-compat feature bit.
>> That means older QEMU binaries cannot use a QED image that has enabled
>> the freelist.
>
> For this one feature.  What about others?

A compat feature is one where the feature can be completely ignored 
(meaning that the QEMU does not have to understand the data format).

An example of a compat feature is copy-on-read.  It's merely a 
suggestion and there is no additional metadata.  If a QEMU doesn't 
understand it, it doesn't affect it's ability to read the image.

An example of a non-compat feature would be zero cluster entries.  Zero 
cluster entries are a special L2 table entry that indicates that a 
cluster's on-disk data is all zeros.  As long as there is at least 1 ZCE 
in the L2 tables, this feature bit must be set.  As soon as all of the 
ZCE bits are cleared, the feature bit can be unset.

An older QEMU will gracefully fail when presented with an image using 
ZCE bits.  An image with no ZCEs will work on older QEMUs.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 4/7] qed: Add QEMU Enhanced Disk image format
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 4/7] qed: Add QEMU Enhanced Disk image format Stefan Hajnoczi
@ 2010-10-11 15:16   ` Kevin Wolf
  0 siblings, 0 replies; 72+ messages in thread
From: Kevin Wolf @ 2010-10-11 15:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> This patch introduces the qed on-disk layout and implements image
> creation.  Later patches add read/write and other functionality.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  Makefile.objs |    1 +
>  block/qed.c   |  530 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  block/qed.h   |  148 ++++++++++++++++
>  3 files changed, 679 insertions(+), 0 deletions(-)
>  create mode 100644 block/qed.c
>  create mode 100644 block/qed.h
> 
> diff --git a/Makefile.objs b/Makefile.objs
> index 816194a..ff15795 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -14,6 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>  
>  block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
> +block-nested-y += qed.o
>  block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
>  block-nested-$(CONFIG_WIN32) += raw-win32.o
>  block-nested-$(CONFIG_POSIX) += raw-posix.o
> diff --git a/block/qed.c b/block/qed.c
> new file mode 100644
> index 0000000..ea03798
> --- /dev/null
> +++ b/block/qed.c
> @@ -0,0 +1,530 @@
> +/*
> + * QEMU Enhanced Disk Format
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> + *  Anthony Liguori   <aliguori@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +static int bdrv_qed_probe(const uint8_t *buf, int buf_size,
> +                          const char *filename)
> +{
> +    const QEDHeader *header = (const void *)buf;

Too lazy to type the real type name? ;-)

> +
> +    if (buf_size < sizeof(*header)) {
> +        return 0;
> +    }
> +    if (le32_to_cpu(header->magic) != QED_MAGIC) {
> +        return 0;
> +    }
> +    return 100;
> +}
> +
> +static void qed_header_le_to_cpu(const QEDHeader *le, QEDHeader *cpu)
> +{
> +    cpu->magic = le32_to_cpu(le->magic);
> +    cpu->cluster_size = le32_to_cpu(le->cluster_size);
> +    cpu->table_size = le32_to_cpu(le->table_size);
> +    cpu->header_size = le32_to_cpu(le->header_size);
> +    cpu->features = le64_to_cpu(le->features);
> +    cpu->compat_features = le64_to_cpu(le->compat_features);
> +    cpu->l1_table_offset = le64_to_cpu(le->l1_table_offset);
> +    cpu->image_size = le64_to_cpu(le->image_size);
> +    cpu->backing_filename_offset = le32_to_cpu(le->backing_filename_offset);
> +    cpu->backing_filename_size = le32_to_cpu(le->backing_filename_size);
> +    cpu->backing_fmt_offset = le32_to_cpu(le->backing_fmt_offset);
> +    cpu->backing_fmt_size = le32_to_cpu(le->backing_fmt_size);
> +}
> +
> +static void qed_header_cpu_to_le(const QEDHeader *cpu, QEDHeader *le)
> +{
> +    le->magic = cpu_to_le32(cpu->magic);
> +    le->cluster_size = cpu_to_le32(cpu->cluster_size);
> +    le->table_size = cpu_to_le32(cpu->table_size);
> +    le->header_size = cpu_to_le32(cpu->header_size);
> +    le->features = cpu_to_le64(cpu->features);
> +    le->compat_features = cpu_to_le64(cpu->compat_features);
> +    le->l1_table_offset = cpu_to_le64(cpu->l1_table_offset);
> +    le->image_size = cpu_to_le64(cpu->image_size);
> +    le->backing_filename_offset = cpu_to_le32(cpu->backing_filename_offset);
> +    le->backing_filename_size = cpu_to_le32(cpu->backing_filename_size);
> +    le->backing_fmt_offset = cpu_to_le32(cpu->backing_fmt_offset);
> +    le->backing_fmt_size = cpu_to_le32(cpu->backing_fmt_size);
> +}
> +

/** Returns the maximum virtual disk size in bytes */

> +static uint64_t qed_max_image_size(uint32_t cluster_size, uint32_t table_size)
> +{
> +    uint64_t table_entries;
> +    uint64_t l2_size;
> +
> +    table_entries = (table_size * cluster_size) / sizeof(uint64_t);
> +    l2_size = table_entries * cluster_size;
> +
> +    return l2_size * table_entries;
> +}
> +
> +static bool qed_is_cluster_size_valid(uint32_t cluster_size)
> +{
> +    if (cluster_size < QED_MIN_CLUSTER_SIZE ||
> +        cluster_size > QED_MAX_CLUSTER_SIZE) {
> +        return false;
> +    }
> +    if (cluster_size & (cluster_size - 1)) {
> +        return false; /* not power of 2 */
> +    }
> +    return true;
> +}
> +
> +static bool qed_is_table_size_valid(uint32_t table_size)
> +{
> +    if (table_size < QED_MIN_TABLE_SIZE ||
> +        table_size > QED_MAX_TABLE_SIZE) {
> +        return false;
> +    }
> +    if (table_size & (table_size - 1)) {
> +        return false; /* not power of 2 */
> +    }
> +    return true;
> +}
> +
> +static bool qed_is_image_size_valid(uint64_t image_size, uint32_t cluster_size,
> +                                    uint32_t table_size)
> +{
> +    if (image_size == 0) {
> +        /* Supporting zero size images makes life harder because even the L1
> +         * table is not needed.  Make life simple and forbid zero size images.
> +         */
> +        return false;
> +    }

Should the spec be updated to forbid zero size images?

> +    if (image_size & (cluster_size - 1)) {
> +        return false; /* not multiple of cluster size */
> +    }
> +    if (image_size > qed_max_image_size(cluster_size, table_size)) {
> +        return false; /* image is too large */
> +    }
> +    return true;
> +}
> +
> +/**
> + * Read a string of known length from the image file
> + *
> + * @file:       Image file
> + * @offset:     File offset to start of string, in bytes
> + * @n:          String length in bytes
> + * @buf:        Destination buffer
> + * @buflen:     Destination buffer length in bytes
> + *
> + * The string is NUL-terminated.

Return value?

> + */
> +static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
> +                           char *buf, size_t buflen)
> +{
> +    int ret;
> +    if (n >= buflen) {
> +        return -EINVAL;
> +    }
> +    ret = bdrv_pread(file, offset, buf, n);
> +    if (ret != n) {
> +        return ret;
> +    }
> +    buf[n] = '\0';
> +    return 0;
> +}
> +
> +static int bdrv_qed_open(BlockDriverState *bs, int flags)
> +{
> +    BDRVQEDState *s = bs->opaque;
> +    QEDHeader le_header;
> +    int64_t file_size;
> +    int ret;
> +
> +    s->bs = bs;
> +
> +    ret = bdrv_pread(bs->file, 0, &le_header, sizeof(le_header));
> +    if (ret != sizeof(le_header)) {
> +        return ret;
> +    }

While this is correct because bdrv_pread never returns short reads, I
think "if (ret < 0)" would be easier to read.

> +    qed_header_le_to_cpu(&le_header, &s->header);
> +
> +    if (s->header.magic != QED_MAGIC) {
> +        return -ENOENT;
> +    }
> +    if (s->header.features & ~QED_FEATURE_MASK) {
> +        return -ENOTSUP; /* image uses unsupported feature bits */
> +    }
> +    if (!qed_is_cluster_size_valid(s->header.cluster_size)) {
> +        return -EINVAL;
> +    }
> +
> +    /* Round up file size to the next cluster */
> +    file_size = bdrv_getlength(bs->file);
> +    if (file_size < 0) {
> +        return file_size;
> +    }
> +    s->file_size = qed_start_of_cluster(s, file_size);

Aren't you rounding down despite the comment?

> +
> +    if (!qed_is_table_size_valid(s->header.table_size)) {
> +        return -EINVAL;
> +    }
> +    if (!qed_is_image_size_valid(s->header.image_size,
> +                                 s->header.cluster_size,
> +                                 s->header.table_size)) {
> +        return -EINVAL;
> +    }
> +    if (!qed_check_table_offset(s, s->header.l1_table_offset)) {
> +        return -EINVAL;
> +    }
> +
> +    s->table_nelems = (s->header.cluster_size * s->header.table_size) /
> +                      sizeof(uint64_t);
> +    s->l2_shift = get_bits_from_size(s->header.cluster_size);
> +    s->l2_mask = s->table_nelems - 1;
> +    s->l1_shift = s->l2_shift + get_bits_from_size(s->l2_mask + 1);
> +
> +    if ((s->header.features & QED_F_BACKING_FILE)) {
> +        ret = qed_read_string(bs->file, s->header.backing_filename_offset,
> +                              s->header.backing_filename_size, bs->backing_file,
> +                              sizeof(bs->backing_file));
> +        if (ret < 0) {
> +            return ret;
> +        }
> +
> +        if ((s->header.compat_features & QED_CF_BACKING_FORMAT)) {
> +            ret = qed_read_string(bs->file, s->header.backing_fmt_offset,
> +                                  s->header.backing_fmt_size,
> +                                  bs->backing_format,
> +                                  sizeof(bs->backing_format));
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +    }
> +    return ret;

I think this should be a return 0 instead of returning a random
non-negative number.

> +}
> +
> +static void bdrv_qed_close(BlockDriverState *bs)
> +{
> +}
> +
> +static void bdrv_qed_flush(BlockDriverState *bs)
> +{
> +    bdrv_flush(bs->file);
> +}
> +
> +static int qed_create(const char *filename, uint32_t cluster_size,
> +                      uint64_t image_size, uint32_t table_size,
> +                      const char *backing_file, const char *backing_fmt)
> +{
> +    QEDHeader header = {
> +        .magic = QED_MAGIC,
> +        .cluster_size = cluster_size,
> +        .table_size = table_size,
> +        .header_size = 1,
> +        .features = 0,
> +        .compat_features = 0,
> +        .l1_table_offset = cluster_size,
> +        .image_size = image_size,
> +    };
> +    QEDHeader le_header;
> +    uint8_t *l1_table = NULL;
> +    size_t l1_size = header.cluster_size * header.table_size;
> +    int ret = 0;
> +    int fd;
> +
> +    fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0644);

Please use block.c functions (bdrv_create_file etc.) instead of POSIX
functions.

> +    if (fd < 0) {
> +        return -errno;
> +    }
> +
> +    if (backing_file) {
> +        header.features |= QED_F_BACKING_FILE;
> +        header.backing_filename_offset = sizeof(le_header);
> +        header.backing_filename_size = strlen(backing_file);
> +        if (backing_fmt) {
> +            header.compat_features |= QED_CF_BACKING_FORMAT;
> +            header.backing_fmt_offset = header.backing_filename_offset +
> +                                        header.backing_filename_size;
> +            header.backing_fmt_size = strlen(backing_fmt);
> +        }
> +    }
> +
> +    qed_header_cpu_to_le(&header, &le_header);
> +    if (qemu_write_full(fd, &le_header, sizeof(le_header)) != sizeof(le_header)) {
> +        ret = -errno;
> +        goto out;
> +    }
> +    if (qemu_write_full(fd, backing_file, header.backing_filename_size) != header.backing_filename_size) {
> +        ret = -errno;
> +        goto out;
> +    }
> +    if (qemu_write_full(fd, backing_fmt, header.backing_fmt_size) != header.backing_fmt_size) {
> +        ret = -errno;
> +        goto out;
> +    }
> +
> +    l1_table = qemu_mallocz(l1_size);
> +    lseek(fd, header.l1_table_offset, SEEK_SET);
> +    if (qemu_write_full(fd, l1_table, l1_size) != l1_size) {
> +        ret = -errno;
> +        goto out;
> +    }
> +
> +out:
> +    qemu_free(l1_table);
> +    close(fd);
> +    return ret;
> +}
> +
> +static int bdrv_qed_create(const char *filename, QEMUOptionParameter *options)
> +{
> +    uint64_t image_size = 0;
> +    uint32_t cluster_size = QED_DEFAULT_CLUSTER_SIZE;
> +    uint32_t table_size = QED_DEFAULT_TABLE_SIZE;
> +    const char *backing_file = NULL;
> +    const char *backing_fmt = NULL;
> +
> +    while (options && options->name) {
> +        if (!strcmp(options->name, BLOCK_OPT_SIZE)) {
> +            image_size = options->value.n;
> +        } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FILE)) {
> +            backing_file = options->value.s;
> +        } else if (!strcmp(options->name, BLOCK_OPT_BACKING_FMT)) {
> +            backing_fmt = options->value.s;
> +        } else if (!strcmp(options->name, BLOCK_OPT_CLUSTER_SIZE)) {
> +            if (options->value.n) {
> +                cluster_size = options->value.n;
> +            }
> +        } else if (!strcmp(options->name, "table_size")) {
> +            if (options->value.n) {
> +                table_size = options->value.n;
> +            }
> +        }
> +        options++;
> +    }
> +
> +    if (!qed_is_cluster_size_valid(cluster_size)) {
> +        fprintf(stderr, "QED cluster size must be within range [%u, %u] and power of 2\n",
> +                QED_MIN_CLUSTER_SIZE, QED_MAX_CLUSTER_SIZE);
> +        return -EINVAL;
> +    }
> +    if (!qed_is_table_size_valid(table_size)) {
> +        fprintf(stderr, "QED table size must be within range [%u, %u] and power of 2\n",
> +                QED_MIN_TABLE_SIZE, QED_MAX_TABLE_SIZE);
> +        return -EINVAL;
> +    }
> +    if (!qed_is_image_size_valid(image_size, cluster_size, table_size)) {
> +        char buffer[64];
> +
> +        bytes_to_str(buffer, sizeof(buffer),
> +                     qed_max_image_size(cluster_size, table_size));
> +
> +        fprintf(stderr,
> +                "QED image size must be a non-zero multiple of cluster size and less than %s\n",
> +                buffer);
> +        return -EINVAL;
> +    }
> +
> +    return qed_create(filename, cluster_size, image_size, table_size,
> +                      backing_file, backing_fmt);
> +}
> +
> +static int bdrv_qed_is_allocated(BlockDriverState *bs, int64_t sector_num,
> +                                  int nb_sectors, int *pnum)
> +{
> +    return -ENOTSUP;
> +}
> +
> +static int bdrv_qed_make_empty(BlockDriverState *bs)
> +{
> +    return -ENOTSUP;
> +}
> +
> +static BlockDriverAIOCB *bdrv_qed_aio_readv(BlockDriverState *bs,
> +                                            int64_t sector_num,
> +                                            QEMUIOVector *qiov, int nb_sectors,
> +                                            BlockDriverCompletionFunc *cb,
> +                                            void *opaque)
> +{
> +    return NULL;
> +}
> +
> +static BlockDriverAIOCB *bdrv_qed_aio_writev(BlockDriverState *bs,
> +                                             int64_t sector_num,
> +                                             QEMUIOVector *qiov, int nb_sectors,
> +                                             BlockDriverCompletionFunc *cb,
> +                                             void *opaque)
> +{
> +    return NULL;
> +}
> +
> +static BlockDriverAIOCB *bdrv_qed_aio_flush(BlockDriverState *bs,
> +                                            BlockDriverCompletionFunc *cb,
> +                                            void *opaque)
> +{
> +    return bdrv_aio_flush(bs->file, cb, opaque);
> +}
> +
> +static int bdrv_qed_truncate(BlockDriverState *bs, int64_t offset)
> +{
> +    return -ENOTSUP;
> +}
> +
> +static int64_t bdrv_qed_getlength(BlockDriverState *bs)
> +{
> +    BDRVQEDState *s = bs->opaque;
> +    return s->header.image_size;
> +}
> +
> +static int bdrv_qed_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
> +{
> +    BDRVQEDState *s = bs->opaque;
> +
> +    memset(bdi, 0, sizeof(*bdi));
> +    bdi->cluster_size = s->header.cluster_size;
> +    return 0;
> +}
> +
> +static int bdrv_qed_change_backing_file(BlockDriverState *bs,
> +                                        const char *backing_file,
> +                                        const char *backing_fmt)
> +{
> +    BDRVQEDState *s = bs->opaque;
> +    QEDHeader new_header, le_header;
> +    void *buffer;
> +    size_t buffer_len, backing_file_len, backing_fmt_len;
> +    int ret;
> +
> +    /* Refuse to set backing filename if unknown compat feature bits are
> +     * active.  If the image uses an unknown compat feature then we may not
> +     * know the layout of data following the header structure and cannot safely
> +     * add a new string.
> +     */
> +    if (backing_file && (s->header.compat_features &
> +                         ~QED_COMPAT_FEATURE_MASK)) {
> +        return -ENOTSUP;
> +    }
> +
> +    memcpy(&new_header, &s->header, sizeof(new_header));
> +
> +    new_header.features &= ~QED_F_BACKING_FILE;
> +    new_header.compat_features &= ~QED_CF_BACKING_FORMAT;
> +
> +    /* Adjust feature flags */
> +    if (backing_file) {
> +        new_header.features |= QED_F_BACKING_FILE;
> +        if (backing_fmt) {
> +            new_header.compat_features |= QED_CF_BACKING_FORMAT;
> +        }
> +    }
> +
> +    /* Calculate new header size */
> +    backing_file_len = backing_fmt_len = 0;
> +
> +    if (backing_file) {
> +        backing_file_len = strlen(backing_file);
> +        if (backing_fmt) {
> +            backing_fmt_len = strlen(backing_fmt);
> +        }
> +    }
> +
> +    buffer_len = sizeof(new_header);
> +    new_header.backing_filename_offset = buffer_len;
> +    new_header.backing_filename_size = backing_file_len;
> +    buffer_len += backing_file_len;
> +    new_header.backing_fmt_offset = buffer_len;
> +    new_header.backing_fmt_size = backing_fmt_len;
> +    buffer_len += backing_fmt_len;
> +
> +    /* Make sure we can rewrite header without failing */
> +    if (buffer_len > new_header.header_size * new_header.cluster_size) {
> +        return -ENOSPC;
> +    }
> +
> +    /* Prepare new header */
> +    buffer = qemu_malloc(buffer_len);
> +
> +    qed_header_cpu_to_le(&new_header, &le_header);
> +    memcpy(buffer, &le_header, sizeof(le_header));
> +    buffer_len = sizeof(le_header);
> +
> +    memcpy(buffer + buffer_len, backing_file, backing_file_len);
> +    buffer_len += backing_file_len;
> +
> +    memcpy(buffer + buffer_len, backing_fmt, backing_fmt_len);
> +    buffer_len += backing_fmt_len;
> +
> +    /* Write new header */
> +    ret = bdrv_pwrite_sync(bs->file, 0, buffer, buffer_len);
> +    qemu_free(buffer);
> +    if (ret == 0) {
> +        memcpy(&s->header, &new_header, sizeof(new_header));
> +    }
> +    return ret;
> +}
> +
> +static int bdrv_qed_check(BlockDriverState *bs, BdrvCheckResult *result)
> +{
> +    return -ENOTSUP;
> +}
> +
> +static QEMUOptionParameter qed_create_options[] = {
> +    {
> +        .name = BLOCK_OPT_SIZE,
> +        .type = OPT_SIZE,
> +        .help = "Virtual disk size (in bytes)"
> +    }, {
> +        .name = BLOCK_OPT_BACKING_FILE,
> +        .type = OPT_STRING,
> +        .help = "File name of a base image"
> +    }, {
> +        .name = BLOCK_OPT_BACKING_FMT,
> +        .type = OPT_STRING,
> +        .help = "Image format of the base image"
> +    }, {
> +        .name = BLOCK_OPT_CLUSTER_SIZE,
> +        .type = OPT_SIZE,
> +        .help = "Cluster size (in bytes)"
> +    }, {
> +        .name = "table_size",

What about introducing a constant for this?

> +        .type = OPT_SIZE,
> +        .help = "L1/L2 table size (in clusters)"
> +    },
> +    { /* end of list */ }
> +};
> +
> +static BlockDriver bdrv_qed = {
> +    .format_name = "qed",
> +    .instance_size = sizeof(BDRVQEDState),
> +    .create_options = qed_create_options,
> +
> +    .bdrv_probe = bdrv_qed_probe,
> +    .bdrv_open = bdrv_qed_open,
> +    .bdrv_close = bdrv_qed_close,
> +    .bdrv_create = bdrv_qed_create,
> +    .bdrv_flush = bdrv_qed_flush,
> +    .bdrv_is_allocated = bdrv_qed_is_allocated,
> +    .bdrv_make_empty = bdrv_qed_make_empty,
> +    .bdrv_aio_readv = bdrv_qed_aio_readv,
> +    .bdrv_aio_writev = bdrv_qed_aio_writev,
> +    .bdrv_aio_flush = bdrv_qed_aio_flush,
> +    .bdrv_truncate = bdrv_qed_truncate,
> +    .bdrv_getlength = bdrv_qed_getlength,
> +    .bdrv_get_info = bdrv_qed_get_info,
> +    .bdrv_change_backing_file = bdrv_qed_change_backing_file,
> +    .bdrv_check = bdrv_qed_check,

Please align the = of all definitions vertically.

> +};
> +
> +static void bdrv_qed_init(void)
> +{
> +    bdrv_register(&bdrv_qed);
> +}
> +
> +block_init(bdrv_qed_init);
> diff --git a/block/qed.h b/block/qed.h
> new file mode 100644
> index 0000000..7ce95a7
> --- /dev/null
> +++ b/block/qed.h
> @@ -0,0 +1,148 @@
> +/*
> + * QEMU Enhanced Disk Format
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> + *  Anthony Liguori   <aliguori@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#ifndef BLOCK_QED_H
> +#define BLOCK_QED_H
> +
> +#include "block_int.h"
> +
> +/* The layout of a QED file is as follows:
> + *
> + * +--------+----------+----------+----------+-----+
> + * | header | L1 table | cluster0 | cluster1 | ... |
> + * +--------+----------+----------+----------+-----+
> + *
> + * There is a 2-level pagetable for cluster allocation:
> + *
> + *                     +----------+
> + *                     | L1 table |
> + *                     +----------+
> + *                ,------'  |  '------.
> + *           +----------+   |    +----------+
> + *           | L2 table |  ...   | L2 table |
> + *           +----------+        +----------+
> + *       ,------'  |  '------.
> + *  +----------+   |    +----------+
> + *  |   Data   |  ...   |   Data   |
> + *  +----------+        +----------+
> + *
> + * The L1 table is fixed size and always present.  L2 tables are allocated on
> + * demand.  The L1 table size determines the maximum possible image size; it
> + * can be influenced using the cluster_size and table_size values.
> + *
> + * All fields are little-endian on disk.
> + */
> +
> +enum {
> +    QED_MAGIC = 'Q' | 'E' << 8 | 'D' << 16 | '\0' << 24,
> +
> +    /* The image supports a backing file */
> +    QED_F_BACKING_FILE = 0x01,
> +
> +    /* The image has the backing file format */
> +    QED_CF_BACKING_FORMAT = 0x01,
> +
> +    /* Feature bits must be used when the on-disk format changes */
> +    QED_FEATURE_MASK = QED_F_BACKING_FILE,            /* supported feature bits */
> +    QED_COMPAT_FEATURE_MASK = QED_CF_BACKING_FORMAT,  /* supported compat feature bits */
> +
> +    /* Data is stored in groups of sectors called clusters.  Cluster size must
> +     * be large to avoid keeping too much metadata.  I/O requests that have
> +     * sub-cluster size will require read-modify-write.
> +     */
> +    QED_MIN_CLUSTER_SIZE = 4 * 1024, /* in bytes */
> +    QED_MAX_CLUSTER_SIZE = 64 * 1024 * 1024,
> +    QED_DEFAULT_CLUSTER_SIZE = 64 * 1024,
> +
> +    /* Allocated clusters are tracked using a 2-level pagetable.  Table size is
> +     * a multiple of clusters so large maximum image sizes can be supported
> +     * without jacking up the cluster size too much.
> +     */
> +    QED_MIN_TABLE_SIZE = 1,        /* in clusters */
> +    QED_MAX_TABLE_SIZE = 16,
> +    QED_DEFAULT_TABLE_SIZE = 4,
> +};
> +
> +typedef struct {
> +    uint32_t magic;                 /* QED\0 */
> +
> +    uint32_t cluster_size;          /* in bytes */
> +    uint32_t table_size;            /* for L1 and L2 tables, in clusters */
> +    uint32_t header_size;           /* in clusters */
> +
> +    uint64_t features;              /* format feature bits */
> +    uint64_t compat_features;       /* compatible feature bits */
> +    uint64_t l1_table_offset;       /* in bytes */
> +    uint64_t image_size;            /* total logical image size, in bytes */
> +
> +    /* if (features & QED_F_BACKING_FILE) */
> +    uint32_t backing_filename_offset; /* in bytes from start of header */
> +    uint32_t backing_filename_size;   /* in bytes */
> +
> +    /* if (compat_features & QED_CF_BACKING_FORMAT) */
> +    uint32_t backing_fmt_offset;    /* in bytes from start of header */
> +    uint32_t backing_fmt_size;      /* in bytes */
> +} QEDHeader;
> +
> +typedef struct {
> +    BlockDriverState *bs;           /* device */
> +    uint64_t file_size;             /* length of image file, in bytes */
> +
> +    QEDHeader header;               /* always cpu-endian */
> +    uint32_t table_nelems;
> +    uint32_t l1_shift;
> +    uint32_t l2_shift;
> +    uint32_t l2_mask;
> +} BDRVQEDState;
> +
> +/**
> + * Utility functions

Sure that this is the right description for this function? ;-)

> + */
> +static inline uint64_t qed_start_of_cluster(BDRVQEDState *s, uint64_t offset)
> +{
> +    return offset & ~(uint64_t)(s->header.cluster_size - 1);
> +}
> +
> +/**
> + * Test if a cluster offset is valid
> + */
> +static inline bool qed_check_cluster_offset(BDRVQEDState *s, uint64_t offset)
> +{
> +    uint64_t header_size = (uint64_t)s->header.header_size *
> +                           s->header.cluster_size;
> +
> +    if (offset & (s->header.cluster_size - 1)) {
> +        return false;
> +    }
> +    return offset >= header_size && offset < s->file_size;
> +}
> +
> +/**
> + * Test if a table offset is valid
> + */
> +static inline bool qed_check_table_offset(BDRVQEDState *s, uint64_t offset)
> +{
> +    uint64_t end_offset = offset + (s->header.table_size - 1) *
> +                          s->header.cluster_size;
> +
> +    /* Overflow check */
> +    if (end_offset <= offset) {
> +        return false;
> +    }
> +
> +    return qed_check_cluster_offset(s, offset) &&
> +           qed_check_cluster_offset(s, end_offset);
> +}
> +
> +#endif /* BLOCK_QED_H */

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:02             ` Anthony Liguori
@ 2010-10-11 15:24               ` Avi Kivity
  2010-10-11 15:41                 ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 15:24 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

  On 10/11/2010 05:02 PM, Anthony Liguori wrote:
> On 10/11/2010 08:44 AM, Avi Kivity wrote:
>>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>>> >
>>> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
>>> >  freelist), but data corruption is not.
>>>
>>> The alternative is for the freelist to be a non-compat feature bit.
>>> That means older QEMU binaries cannot use a QED image that has enabled
>>> the freelist.
>>
>> For this one feature.  What about others?
>
> A compat feature is one where the feature can be completely ignored 
> (meaning that the QEMU does not have to understand the data format).
>
> An example of a compat feature is copy-on-read.  It's merely a 
> suggestion and there is no additional metadata.  If a QEMU doesn't 
> understand it, it doesn't affect it's ability to read the image.
>
> An example of a non-compat feature would be zero cluster entries.  
> Zero cluster entries are a special L2 table entry that indicates that 
> a cluster's on-disk data is all zeros.  As long as there is at least 1 
> ZCE in the L2 tables, this feature bit must be set.  As soon as all of 
> the ZCE bits are cleared, the feature bit can be unset.
>
> An older QEMU will gracefully fail when presented with an image using 
> ZCE bits.  An image with no ZCEs will work on older QEMUs.
>

What's the motivation behind ZCE?

There is yet a third type of feature, one which is not strictly needed 
in order to use the image, but if used, must be kept synchronized.  An 
example is the freelist.  Another example is a directory index for a 
filesystem.  I can't think of another example which would be relevant to 
QED -- metadata checksums perhaps? -- we can always declare it a 
non-compatible feature, but of course, it reduces compatibility.



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 13:58   ` Kevin Wolf
@ 2010-10-11 15:30     ` Stefan Hajnoczi
  2010-10-11 15:39       ` Avi Kivity
  2010-10-11 15:50       ` Kevin Wolf
  0 siblings, 2 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 15:30 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote:
> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> > Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> > ---
> >  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 94 insertions(+), 0 deletions(-)
> >  create mode 100644 docs/specs/qed_spec.txt
> > 
> > diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
> > new file mode 100644
> > index 0000000..c942b8e
> > --- /dev/null
> > +++ b/docs/specs/qed_spec.txt
> > @@ -0,0 +1,94 @@
> > +=Specification=
> > +
> > +The file format looks like this:
> > +
> > + +----------+----------+----------+-----+
> > + | cluster0 | cluster1 | cluster2 | ... |
> > + +----------+----------+----------+-----+
> > +
> > +The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
> > +
> > +Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
> > +
> > +All fields are little-endian.
> > +
> > +==Header==
> > + Header {
> > +     uint32_t magic;               /* QED\0 */
> > + 
> > +     uint32_t cluster_size;        /* in bytes */
> > +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
> > +     uint32_t header_size;         /* in clusters */
> > + 
> > +     uint64_t features;            /* format feature bits */
> > +     uint64_t compat_features;     /* compat feature bits */
> > +     uint64_t l1_table_offset;     /* in bytes */
> > +     uint64_t image_size;          /* total logical image size, in bytes */
> > + 
> > +     /* if (features & QED_F_BACKING_FILE) */
> > +     uint32_t backing_filename_offset; /* in bytes from start of header */
> > +     uint32_t backing_filename_size;   /* in bytes */
> > + 
> > +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
> > +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
> > +     uint32_t backing_fmt_size;    /* in bytes */
> 
> It was discussed before, but I don't think we came to a conclusion. Are
> there any circumstances under which you don't want to set the
> QED_CF_BACKING_FORMAT flag?

I suggest the following:

QED_CF_BACKING_FORMAT_RAW = 0x1

When set, the backing file is a raw image and should not be probed for
its file format.  The default (unset) means that the backing image file
format may be probed.

Now the backing_fmt_{offset,size} are no longer necessary.

> 
> Also it's unclear what this "if" actually means: If the flag isn't set,
> are the fields zero, are they undefined or are they even completely
> missing and the offsets of the following fields must be adjusted?

I have updated the wiki:

"Fields predicated on a feature bit are only used when that feature is
set. The fields always take up header space, regardless of whether or
not the feature bit is set."

> 
> > + }
> > +
> > +Field descriptions:
> > +* cluster_size must be a power of 2 in range [2^12, 2^26].
> > +* table_size must be a power of 2 in range [1, 16].
> 
> Is there a reason why this must be a power of two?

The power of two makes logical-to-cluster offset translation easy and
cheap:

l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask])
cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask)

> 
> > +* header_size is the number of clusters used by the header and any additional information stored before regular clusters.
> > +* features and compat_features are bitmaps where active file format features can be selectively enabled.  The difference between the two is that an image file that uses unknown compat_features bits can be safely opened without knowing how to interpret those bits.  If an image file has an unsupported features bit set then it is not possible to open that image (the image is not backwards-compatible).
> > +* l1_table_offset must be a multiple of cluster_size.
> 
> And it is the offset of the first byte of the L1 table in the image file.

Updated, thanks.

> 
> > +* image_size is the block device size seen by the guest and must be a multiple of cluster_size.
> 
> So there are image sizes that can't be accurately represented in QED? I
> think that's a bad idea. Even more so because I can't see how it greatly
> simplifies implementation (you save the operation for rounding up on
> open/create, that's it) - it looks like a completely arbitrary restriction.

Good point.  I will try to lift this restriction in v3.

> 
> > +* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.
> 
> A description of the meaning of these strings is missing.

Update:
"The backing filename string is given in the
backing_filename_{offset,size} fields and may be an absolute path or
relative to the image file."

> 
> > +
> > +Feature bits:
> > +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
> > +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
> > +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
> 
> I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
> twice is confusing at first sight.

Updated, thanks.

> 
> > +
> > +==Tables==
> > +
> > +Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
> > +
> > + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
> > +  
> > + Table {
> > +     uint64_t offsets[TABLE_NOFFSETS];
> > + }
> > +
> > +The tables are organized as follows:
> > +
> > +                    +----------+
> > +                    | L1 table |
> > +                    +----------+
> > +               ,------'  |  '------.
> > +          +----------+   |    +----------+
> > +          | L2 table |  ...   | L2 table |
> > +          +----------+        +----------+
> > +      ,------'  |  '------.
> > + +----------+   |    +----------+
> > + |   Data   |  ...   |   Data   |
> > + +----------+        +----------+
> > +
> > +A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
> > +
> > +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
> > + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
> > +
> > +Logical offsets are translated into cluster offsets as follows:
> > +
> > +  table_bits table_bits    cluster_bits
> > +  <--------> <--------> <--------------->
> > + +----------+----------+-----------------+
> > + | L1 index | L2 index |     byte offset |
> > + +----------+----------+-----------------+
> > + 
> > +       Structure of a logical offset
> > +
> > + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
> > +   l2_offset = l1_table[l1_index]
> > +   l2_table = load_table(l2_offset)
> > +   cluster_offset = l2_table[l2_index]
> > +   return cluster_offset + byte_offset
> 
> Should we reserve some bits in the table entries in case we need some
> flags later? Also, I suppose all table entries must be cluster aligned?

Yes, let's do that.  At least for sparse zero cluster tracking we need a
bit.  The minimum 4k cluster size gives us 12 bits to play with.

> 
> What happened to the other sections that older versions of the spec
> contained? For example, this version doesn't specify any more what the
> semantics of unallocated clusters and backing files is.

I removed them because they don't describe the on-disk layout and were
more of a way to think through the implementation than a format
specification.  It was more a decision to focus my effort on improving
the on-disk layout specification than anything else.

Do you want the semantics in the specification, or is it okay to leave
that part on the wiki only?

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format
  2010-10-11 13:21 ` [Qemu-devel] Re: [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Kevin Wolf
@ 2010-10-11 15:37   ` Stefan Hajnoczi
  0 siblings, 0 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 15:37 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	Avi Kivity

On Mon, Oct 11, 2010 at 2:21 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
>> This code is also available from git:
>>
>> http://repo.or.cz/w/qemu/stefanha.git/shortlog/refs/heads/qed
>
> This doesn't seem to be the same as the latest patches you posted to
> qemu-devel. Forgot to push?

Sorry for the hassle.  The v2 patchset is there now.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:30     ` Stefan Hajnoczi
@ 2010-10-11 15:39       ` Avi Kivity
  2010-10-11 15:46         ` Stefan Hajnoczi
  2010-10-11 15:50       ` Kevin Wolf
  1 sibling, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 15:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
> >
> >  It was discussed before, but I don't think we came to a conclusion. Are
> >  there any circumstances under which you don't want to set the
> >  QED_CF_BACKING_FORMAT flag?
>
> I suggest the following:
>
> QED_CF_BACKING_FORMAT_RAW = 0x1
>
> When set, the backing file is a raw image and should not be probed for
> its file format.  The default (unset) means that the backing image file
> format may be probed.
>
> Now the backing_fmt_{offset,size} are no longer necessary.

Should it not be an incompatible option?  If the backing disk starts 
with a format magic, it will be probed by an older qemu, incorrectly.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:24               ` Avi Kivity
@ 2010-10-11 15:41                 ` Anthony Liguori
  2010-10-11 15:47                   ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 15:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

On 10/11/2010 10:24 AM, Avi Kivity wrote:
>  On 10/11/2010 05:02 PM, Anthony Liguori wrote:
>> On 10/11/2010 08:44 AM, Avi Kivity wrote:
>>>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>>>> >
>>>> >  A leak is acceptable (it won't grow; it's just an unused, incorrect
>>>> >  freelist), but data corruption is not.
>>>>
>>>> The alternative is for the freelist to be a non-compat feature bit.
>>>> That means older QEMU binaries cannot use a QED image that has enabled
>>>> the freelist.
>>>
>>> For this one feature.  What about others?
>>
>> A compat feature is one where the feature can be completely ignored 
>> (meaning that the QEMU does not have to understand the data format).
>>
>> An example of a compat feature is copy-on-read.  It's merely a 
>> suggestion and there is no additional metadata.  If a QEMU doesn't 
>> understand it, it doesn't affect it's ability to read the image.
>>
>> An example of a non-compat feature would be zero cluster entries.  
>> Zero cluster entries are a special L2 table entry that indicates that 
>> a cluster's on-disk data is all zeros.  As long as there is at least 
>> 1 ZCE in the L2 tables, this feature bit must be set.  As soon as all 
>> of the ZCE bits are cleared, the feature bit can be unset.
>>
>> An older QEMU will gracefully fail when presented with an image using 
>> ZCE bits.  An image with no ZCEs will work on older QEMUs.
>>
>
> What's the motivation behind ZCE?

It's very useful for Copy-on-Read.  If the cluster in the backing file 
is unallocated, then when you do a copy-on-read, you don't want to write 
out a zero cluster since you'd expand the image to it's maximum size.

It's also useful for operations like compaction in the absence of TRIM.  
The common implementation on platforms like VMware is to open a file and 
write zeros to it until it fills up the filesystem.  You then delete the 
file.  The result is that any unallocated data on the disk is written as 
zero and combined with zero-detection in the image format, you can 
compact the image size by marking unallocated blocks as ZCE.

> There is yet a third type of feature, one which is not strictly needed 
> in order to use the image, but if used, must be kept synchronized.  An 
> example is the freelist.  Another example is a directory index for a 
> filesystem.  I can't think of another example which would be relevant 
> to QED -- metadata checksums perhaps? -- we can always declare it a 
> non-compatible feature, but of course, it reduces compatibility.

You're suggesting a feature that is not strictly needed, but that needs 
to be kept up to date.  If it can't be kept up to date, something needs 
to happen to remove it.  Let's call this a transient feature.

Most of the transient features can be removed given some bit of code.  
For instance, ZCE can be removed by writing out zero clusters or writing 
an unallocated L2 entry if there is no backing file.

I think we could add a qemu-img demote command or something like that 
that attempted to remove features when possible.  That doesn't give you 
instant compatibility but I'm doubtful that you can come up with a 
generic way to remove a feature from an image without knowing anything 
about the image.

Regards,

Anthony Liguori

>
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:39       ` Avi Kivity
@ 2010-10-11 15:46         ` Stefan Hajnoczi
  2010-10-11 16:18           ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-11 15:46 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>  On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
> >>
> >>  It was discussed before, but I don't think we came to a conclusion. Are
> >>  there any circumstances under which you don't want to set the
> >>  QED_CF_BACKING_FORMAT flag?
> >
> >I suggest the following:
> >
> >QED_CF_BACKING_FORMAT_RAW = 0x1
> >
> >When set, the backing file is a raw image and should not be probed for
> >its file format.  The default (unset) means that the backing image file
> >format may be probed.
> >
> >Now the backing_fmt_{offset,size} are no longer necessary.
> 
> Should it not be an incompatible option?  If the backing disk starts
> with a format magic, it will be probed by an older qemu,
> incorrectly.

Agreed, it should be a non-compat feature bit.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:41                 ` Anthony Liguori
@ 2010-10-11 15:47                   ` Avi Kivity
  0 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 15:47 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Anthony Liguori, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

  On 10/11/2010 05:41 PM, Anthony Liguori wrote:
> On 10/11/2010 10:24 AM, Avi Kivity wrote:
>>  On 10/11/2010 05:02 PM, Anthony Liguori wrote:
>>> On 10/11/2010 08:44 AM, Avi Kivity wrote:
>>>>  On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
>>>>> >
>>>>> >  A leak is acceptable (it won't grow; it's just an unused, 
>>>>> incorrect
>>>>> >  freelist), but data corruption is not.
>>>>>
>>>>> The alternative is for the freelist to be a non-compat feature bit.
>>>>> That means older QEMU binaries cannot use a QED image that has 
>>>>> enabled
>>>>> the freelist.
>>>>
>>>> For this one feature.  What about others?
>>>
>>> A compat feature is one where the feature can be completely ignored 
>>> (meaning that the QEMU does not have to understand the data format).
>>>
>>> An example of a compat feature is copy-on-read.  It's merely a 
>>> suggestion and there is no additional metadata.  If a QEMU doesn't 
>>> understand it, it doesn't affect it's ability to read the image.
>>>
>>> An example of a non-compat feature would be zero cluster entries.  
>>> Zero cluster entries are a special L2 table entry that indicates 
>>> that a cluster's on-disk data is all zeros.  As long as there is at 
>>> least 1 ZCE in the L2 tables, this feature bit must be set.  As soon 
>>> as all of the ZCE bits are cleared, the feature bit can be unset.
>>>
>>> An older QEMU will gracefully fail when presented with an image 
>>> using ZCE bits.  An image with no ZCEs will work on older QEMUs.
>>>
>>
>> What's the motivation behind ZCE?
>
> It's very useful for Copy-on-Read.  If the cluster in the backing file 
> is unallocated, then when you do a copy-on-read, you don't want to 
> write out a zero cluster since you'd expand the image to it's maximum 
> size.
>
> It's also useful for operations like compaction in the absence of 
> TRIM.  The common implementation on platforms like VMware is to open a 
> file and write zeros to it until it fills up the filesystem.  You then 
> delete the file.  The result is that any unallocated data on the disk 
> is written as zero and combined with zero-detection in the image 
> format, you can compact the image size by marking unallocated blocks 
> as ZCE.

Both make sense.  The latter is also useful with TRIM: if you have a 
backing image it's better to implement TRIM with ZCE rather than 
exposing the cluster from the backing file; it saves you a COW when you 
later reallocate the cluster.

>
>> There is yet a third type of feature, one which is not strictly 
>> needed in order to use the image, but if used, must be kept 
>> synchronized.  An example is the freelist.  Another example is a 
>> directory index for a filesystem.  I can't think of another example 
>> which would be relevant to QED -- metadata checksums perhaps? -- we 
>> can always declare it a non-compatible feature, but of course, it 
>> reduces compatibility.
>
> You're suggesting a feature that is not strictly needed, but that 
> needs to be kept up to date.  If it can't be kept up to date, 
> something needs to happen to remove it.  Let's call this a transient 
> feature.
>
> Most of the transient features can be removed given some bit of code.  
> For instance, ZCE can be removed by writing out zero clusters or 
> writing an unallocated L2 entry if there is no backing file.
>
> I think we could add a qemu-img demote command or something like that 
> that attempted to remove features when possible.  That doesn't give 
> you instant compatibility but I'm doubtful that you can come up with a 
> generic way to remove a feature from an image without knowing anything 
> about the image.
>

That should work, and in the worst case there is qemu-img convert (which 
should be taught about format options).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 14:58           ` Avi Kivity
@ 2010-10-11 15:49             ` Anthony Liguori
  2010-10-11 16:02               ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 15:49 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

On 10/11/2010 09:58 AM, Avi Kivity wrote:
>> A leak is unacceptable.  It means an image can grow to an unbounded 
>> size.  If you are a server provider offering multitenancy, then a 
>> malicious guest can potentially grow the image beyond it's allotted 
>> size causing a Denial of Service attack against another tenant.
>
>
> This particular leak cannot grow, and is not controlled by the guest.

As the image gets moved from hypervisor to hypervisor, it can keep 
growing if given a chance to fill up the disk, then trim it all way.

In a mixed hypervisor environment, it just becomes a numbers game.

>> A freelist has to be a non-optional feature.  When the freelist bit 
>> is set, an older QEMU cannot read the image.  If the freelist is 
>> completed used, the freelist bit can be cleared and the image is then 
>> usable by older QEMUs.
>
> Once we support TRIM (or detect zeros) we'll never have a clean freelist.

Zero detection doesn't add to the free list.

A potential solution here is to treat TRIM a little differently than 
we've been discussing.

When TRIM happens, don't immediately write an unallocated cluster entry 
for the L2.  Leave the L2 entry in-tact.  Don't actually write a UCE to 
the L2 until you actually allocate the block.

This implies a cost because you'll need to do metadata syncs to make 
this work.  However, that eliminates leakage.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:30     ` Stefan Hajnoczi
  2010-10-11 15:39       ` Avi Kivity
@ 2010-10-11 15:50       ` Kevin Wolf
  1 sibling, 0 replies; 72+ messages in thread
From: Kevin Wolf @ 2010-10-11 15:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Am 11.10.2010 17:30, schrieb Stefan Hajnoczi:
> On Mon, Oct 11, 2010 at 03:58:07PM +0200, Kevin Wolf wrote:
>> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
>>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>> ---
>>>  docs/specs/qed_spec.txt |   94 +++++++++++++++++++++++++++++++++++++++++++++++
>>>  1 files changed, 94 insertions(+), 0 deletions(-)
>>>  create mode 100644 docs/specs/qed_spec.txt
>>>
>>> diff --git a/docs/specs/qed_spec.txt b/docs/specs/qed_spec.txt
>>> new file mode 100644
>>> index 0000000..c942b8e
>>> --- /dev/null
>>> +++ b/docs/specs/qed_spec.txt
>>> @@ -0,0 +1,94 @@
>>> +=Specification=
>>> +
>>> +The file format looks like this:
>>> +
>>> + +----------+----------+----------+-----+
>>> + | cluster0 | cluster1 | cluster2 | ... |
>>> + +----------+----------+----------+-----+
>>> +
>>> +The first cluster begins with the '''header'''.  The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file.  A regular cluster may be a '''data cluster''', an '''L2''', or an '''L1 table'''.  L1 and L2 tables are composed of one or more contiguous clusters.
>>> +
>>> +Normally the file size will be a multiple of the cluster size.  If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written.  Legitimate extra information should use space between the header and the first regular cluster.
>>> +
>>> +All fields are little-endian.
>>> +
>>> +==Header==
>>> + Header {
>>> +     uint32_t magic;               /* QED\0 */
>>> + 
>>> +     uint32_t cluster_size;        /* in bytes */
>>> +     uint32_t table_size;          /* for L1 and L2 tables, in clusters */
>>> +     uint32_t header_size;         /* in clusters */
>>> + 
>>> +     uint64_t features;            /* format feature bits */
>>> +     uint64_t compat_features;     /* compat feature bits */
>>> +     uint64_t l1_table_offset;     /* in bytes */
>>> +     uint64_t image_size;          /* total logical image size, in bytes */
>>> + 
>>> +     /* if (features & QED_F_BACKING_FILE) */
>>> +     uint32_t backing_filename_offset; /* in bytes from start of header */
>>> +     uint32_t backing_filename_size;   /* in bytes */
>>> + 
>>> +     /* if (compat_features & QED_CF_BACKING_FORMAT) */
>>> +     uint32_t backing_fmt_offset;  /* in bytes from start of header */
>>> +     uint32_t backing_fmt_size;    /* in bytes */
>>
>> It was discussed before, but I don't think we came to a conclusion. Are
>> there any circumstances under which you don't want to set the
>> QED_CF_BACKING_FORMAT flag?
> 
> I suggest the following:
> 
> QED_CF_BACKING_FORMAT_RAW = 0x1
> 
> When set, the backing file is a raw image and should not be probed for
> its file format.  The default (unset) means that the backing image file
> format may be probed.

And it must be set for raw image formats because we'll want to avoid
considering raw for the probing.

Works for me.

>>> + }
>>> +
>>> +Field descriptions:
>>> +* cluster_size must be a power of 2 in range [2^12, 2^26].
>>> +* table_size must be a power of 2 in range [1, 16].
>>
>> Is there a reason why this must be a power of two?
> 
> The power of two makes logical-to-cluster offset translation easy and
> cheap:
> 
> l2_table = get_l2_table(l1_table[(logical >> l2_shift) & l2_mask])
> cluster = l2_table[logical >> l1_shift] + (logical & cluster_mask)

Right, good point.

>>> +* image_size is the block device size seen by the guest and must be a multiple of cluster_size.
>>
>> So there are image sizes that can't be accurately represented in QED? I
>> think that's a bad idea. Even more so because I can't see how it greatly
>> simplifies implementation (you save the operation for rounding up on
>> open/create, that's it) - it looks like a completely arbitrary restriction.
> 
> Good point.  I will try to lift this restriction in v3.

Thanks.

>>> +* backing_filename and backing_fmt are both strings in (byte offset, byte size) form.  They are not NUL-terminated and do not have alignment constraints.
>>
>> A description of the meaning of these strings is missing.
> 
> Update:
> "The backing filename string is given in the
> backing_filename_{offset,size} fields and may be an absolute path or
> relative to the image file."

Looks good.

>>
>>> +
>>> +Feature bits:
>>> +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
>>> +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.
>>> +* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format stored.
>>
>> I suggest adding a headline "Compatibility Feature Bits". Seeing 0x01
>> twice is confusing at first sight.
> 
> Updated, thanks.
> 
>>
>>> +
>>> +==Tables==
>>> +
>>> +Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
>>> +
>>> + #define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
>>> +  
>>> + Table {
>>> +     uint64_t offsets[TABLE_NOFFSETS];
>>> + }
>>> +
>>> +The tables are organized as follows:
>>> +
>>> +                    +----------+
>>> +                    | L1 table |
>>> +                    +----------+
>>> +               ,------'  |  '------.
>>> +          +----------+   |    +----------+
>>> +          | L2 table |  ...   | L2 table |
>>> +          +----------+        +----------+
>>> +      ,------'  |  '------.
>>> + +----------+   |    +----------+
>>> + |   Data   |  ...   |   Data   |
>>> + +----------+        +----------+
>>> +
>>> +A table is made up of one or more contiguous clusters.  The table_size header field determines table size for an image file.  For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
>>> +
>>> +The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
>>> + header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
>>> +
>>> +Logical offsets are translated into cluster offsets as follows:
>>> +
>>> +  table_bits table_bits    cluster_bits
>>> +  <--------> <--------> <--------------->
>>> + +----------+----------+-----------------+
>>> + | L1 index | L2 index |     byte offset |
>>> + +----------+----------+-----------------+
>>> + 
>>> +       Structure of a logical offset
>>> +
>>> + def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
>>> +   l2_offset = l1_table[l1_index]
>>> +   l2_table = load_table(l2_offset)
>>> +   cluster_offset = l2_table[l2_index]
>>> +   return cluster_offset + byte_offset
>>
>> Should we reserve some bits in the table entries in case we need some
>> flags later? Also, I suppose all table entries must be cluster aligned?
> 
> Yes, let's do that.  At least for sparse zero cluster tracking we need a
> bit.  The minimum 4k cluster size gives us 12 bits to play with.

Yes, 12 bits should be enough for a while.

>> What happened to the other sections that older versions of the spec
>> contained? For example, this version doesn't specify any more what the
>> semantics of unallocated clusters and backing files is.
> 
> I removed them because they don't describe the on-disk layout and were
> more of a way to think through the implementation than a format
> specification.  It was more a decision to focus my effort on improving
> the on-disk layout specification than anything else.
> 
> Do you want the semantics in the specification, or is it okay to leave
> that part on the wiki only?

I think, at least we need to describe what an unallocated table entry
looks like (it has a zero cluster offset - but may it have flags?) and
what this means for the virtual disk content (read from the backing
file, zero if there is no backing file). This definitely belongs in the
specification.

Other parts that describe qemu's implementation or best practices can be
left on the wiki only.

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:49             ` Anthony Liguori
@ 2010-10-11 16:02               ` Avi Kivity
  2010-10-11 16:10                 ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-11 16:02 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

  On 10/11/2010 05:49 PM, Anthony Liguori wrote:
> On 10/11/2010 09:58 AM, Avi Kivity wrote:
>>> A leak is unacceptable.  It means an image can grow to an unbounded 
>>> size.  If you are a server provider offering multitenancy, then a 
>>> malicious guest can potentially grow the image beyond it's allotted 
>>> size causing a Denial of Service attack against another tenant.
>>
>>
>> This particular leak cannot grow, and is not controlled by the guest.
>
> As the image gets moved from hypervisor to hypervisor, it can keep 
> growing if given a chance to fill up the disk, then trim it all way.
>
> In a mixed hypervisor environment, it just becomes a numbers game.

I don't see how it can grow.  Both the freelist and the clusters it 
points to consume space, which becomes a leak once you move it to a 
hypervisor that doesn't understand the freelist.  The older hypervisor 
then allocates new blocks.  As soon as it performs a metadata scan (if 
ever), the freelist is reclaimed.

You could only get a growing leak if you moved it to a hypervisor that 
doesn't perform metadata scans, but then that is independent of the 
freelist.

>
>>> A freelist has to be a non-optional feature.  When the freelist bit 
>>> is set, an older QEMU cannot read the image.  If the freelist is 
>>> completed used, the freelist bit can be cleared and the image is 
>>> then usable by older QEMUs.
>>
>> Once we support TRIM (or detect zeros) we'll never have a clean 
>> freelist.
>
> Zero detection doesn't add to the free list.

Why not?  If a cluster is zero filled, you may drop it (assuming no 
backing image).

>
> A potential solution here is to treat TRIM a little differently than 
> we've been discussing.
>
> When TRIM happens, don't immediately write an unallocated cluster 
> entry for the L2.  Leave the L2 entry in-tact.  Don't actually write a 
> UCE to the L2 until you actually allocate the block.
>
> This implies a cost because you'll need to do metadata syncs to make 
> this work.  However, that eliminates leakage.

The information is lost on shutdown; and you can have a large number of 
unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a user 
expecting a visit from RIAA).

A slight twist on your proposal is to have an allocated-but-may-drop bit 
in a L2 entry.  TRIM or zero detection sets the bit (leaving the cluster 
number intact).  A following write to the cluster needs to clear the 
bit; if we reallocate the cluster we need to replace it with a ZCE.

This makes the freelist all L2 entries with the bit set; it may be less 
efficient than a custom data structure though.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 16:02               ` Avi Kivity
@ 2010-10-11 16:10                 ` Anthony Liguori
  2010-10-12 10:25                   ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 16:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

On 10/11/2010 11:02 AM, Avi Kivity wrote:
>  On 10/11/2010 05:49 PM, Anthony Liguori wrote:
>> On 10/11/2010 09:58 AM, Avi Kivity wrote:
>>>> A leak is unacceptable.  It means an image can grow to an unbounded 
>>>> size.  If you are a server provider offering multitenancy, then a 
>>>> malicious guest can potentially grow the image beyond it's allotted 
>>>> size causing a Denial of Service attack against another tenant.
>>>
>>>
>>> This particular leak cannot grow, and is not controlled by the guest.
>>
>> As the image gets moved from hypervisor to hypervisor, it can keep 
>> growing if given a chance to fill up the disk, then trim it all way.
>>
>> In a mixed hypervisor environment, it just becomes a numbers game.
>
> I don't see how it can grow.  Both the freelist and the clusters it 
> points to consume space, which becomes a leak once you move it to a 
> hypervisor that doesn't understand the freelist.  The older hypervisor 
> then allocates new blocks.  As soon as it performs a metadata scan (if 
> ever), the freelist is reclaimed.

Assume you don't ever do a metadata scan (which is really our design point).

If you move to a hypervisor that doesn't support it, then move to a 
hypervisor that does, you create a brand new freelist and start leaking 
more space.  This isn't a contrived scenario if you have a cloud 
environment with a mix of hosts.

You might not be able to get a ping-pong every time you provision, but 
with enough effort, you could create serious problems.

It's really an issue of correctness.  Making correctness trade-offs for 
the purpose of compatibility is a policy decision and not something we 
should bake into an image format.  If a tool feels strongly that it's a 
reasonable trade off to make, it can always fudge the feature bits itself.

>>
>>>> A freelist has to be a non-optional feature.  When the freelist bit 
>>>> is set, an older QEMU cannot read the image.  If the freelist is 
>>>> completed used, the freelist bit can be cleared and the image is 
>>>> then usable by older QEMUs.
>>>
>>> Once we support TRIM (or detect zeros) we'll never have a clean 
>>> freelist.
>>
>> Zero detection doesn't add to the free list.
>
> Why not?  If a cluster is zero filled, you may drop it (assuming no 
> backing image).

Sorry, I was thinking about the case of copy-on-read.  When you 
transition from UCE -> ZCE, nothing gets added to the free list.  But if 
you go from allocated -> ZCE, then you would add to the free list.

>>
>> A potential solution here is to treat TRIM a little differently than 
>> we've been discussing.
>>
>> When TRIM happens, don't immediately write an unallocated cluster 
>> entry for the L2.  Leave the L2 entry in-tact.  Don't actually write 
>> a UCE to the L2 until you actually allocate the block.
>>
>> This implies a cost because you'll need to do metadata syncs to make 
>> this work.  However, that eliminates leakage.
>
> The information is lost on shutdown; and you can have a large number 
> of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a 
> user expecting a visit from RIAA).
>
> A slight twist on your proposal is to have an allocated-but-may-drop 
> bit in a L2 entry.  TRIM or zero detection sets the bit (leaving the 
> cluster number intact).  A following write to the cluster needs to 
> clear the bit; if we reallocate the cluster we need to replace it with 
> a ZCE.

Yeah, this is sort of what I was thinking.  You would still want a free 
list but it becomes totally optional because if it's lost, no data is 
leaked (assuming that the older version understands the bit).

I was suggesting that we store that bit in the free list though because 
that let's us support having older QEMUs with absolutely no knowledge 
still work.

> This makes the freelist all L2 entries with the bit set; it may be 
> less efficient than a custom data structure though.

We still want the freelist to avoid recreating it.  We also want to 
store the allocated-but-may-drop bit in the free list.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 15:46         ` Stefan Hajnoczi
@ 2010-10-11 16:18           ` Anthony Liguori
  2010-10-11 17:14             ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 16:18 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Christoph Hellwig, Avi Kivity, qemu-devel

On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>    
>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
>>      
>>>>   It was discussed before, but I don't think we came to a conclusion. Are
>>>>   there any circumstances under which you don't want to set the
>>>>   QED_CF_BACKING_FORMAT flag?
>>>>          
>>> I suggest the following:
>>>
>>> QED_CF_BACKING_FORMAT_RAW = 0x1
>>>
>>> When set, the backing file is a raw image and should not be probed for
>>> its file format.  The default (unset) means that the backing image file
>>> format may be probed.
>>>
>>> Now the backing_fmt_{offset,size} are no longer necessary.
>>>        
>> Should it not be an incompatible option?  If the backing disk starts
>> with a format magic, it will be probed by an older qemu,
>> incorrectly.
>>      
> Agreed, it should be a non-compat feature bit.
>    

If it's just raw or not raw, then I agree it should be non-compat.

I think we just need a feature bit then that indicates that the backing 
file is non-probeable which certainly simplifies the implementation.

QED_F_BACKING_FORMAT_NOPROBE maybe?

Regards,

Anthony Liguori

> Stefan
>
>    

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 16:18           ` Anthony Liguori
@ 2010-10-11 17:14             ` Anthony Liguori
  2010-10-12  8:07               ` Kevin Wolf
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-11 17:14 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Christoph Hellwig, Avi Kivity, qemu-devel

On 10/11/2010 11:18 AM, Anthony Liguori wrote:
> On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
>> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
>>>>>   It was discussed before, but I don't think we came to a 
>>>>> conclusion. Are
>>>>>   there any circumstances under which you don't want to set the
>>>>>   QED_CF_BACKING_FORMAT flag?
>>>> I suggest the following:
>>>>
>>>> QED_CF_BACKING_FORMAT_RAW = 0x1
>>>>
>>>> When set, the backing file is a raw image and should not be probed for
>>>> its file format.  The default (unset) means that the backing image 
>>>> file
>>>> format may be probed.
>>>>
>>>> Now the backing_fmt_{offset,size} are no longer necessary.
>>> Should it not be an incompatible option?  If the backing disk starts
>>> with a format magic, it will be probed by an older qemu,
>>> incorrectly.
>> Agreed, it should be a non-compat feature bit.
>
> If it's just raw or not raw, then I agree it should be non-compat.
>
> I think we just need a feature bit then that indicates that the 
> backing file is non-probeable which certainly simplifies the 
> implementation.
>
> QED_F_BACKING_FORMAT_NOPROBE maybe?

Er, thinking more, this is still a good idea but we still need 
QED_CF_BACKING_FORMAT because we specifically need to know when a 
protocol is specified.  Otherwise, we have no way of doing nbd as a 
backing file.

Regards,

Anthony Liguori

> Regards,
>
> Anthony Liguori
>
>> Stefan
>>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 17:14             ` Anthony Liguori
@ 2010-10-12  8:07               ` Kevin Wolf
  2010-10-12 13:16                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-12  8:07 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: qemu-devel, Christoph Hellwig, Stefan Hajnoczi, Avi Kivity

Am 11.10.2010 19:14, schrieb Anthony Liguori:
> On 10/11/2010 11:18 AM, Anthony Liguori wrote:
>> On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
>>> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
>>>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
>>>>>>   It was discussed before, but I don't think we came to a 
>>>>>> conclusion. Are
>>>>>>   there any circumstances under which you don't want to set the
>>>>>>   QED_CF_BACKING_FORMAT flag?
>>>>> I suggest the following:
>>>>>
>>>>> QED_CF_BACKING_FORMAT_RAW = 0x1
>>>>>
>>>>> When set, the backing file is a raw image and should not be probed for
>>>>> its file format.  The default (unset) means that the backing image 
>>>>> file
>>>>> format may be probed.
>>>>>
>>>>> Now the backing_fmt_{offset,size} are no longer necessary.
>>>> Should it not be an incompatible option?  If the backing disk starts
>>>> with a format magic, it will be probed by an older qemu,
>>>> incorrectly.
>>> Agreed, it should be a non-compat feature bit.
>>
>> If it's just raw or not raw, then I agree it should be non-compat.
>>
>> I think we just need a feature bit then that indicates that the 
>> backing file is non-probeable which certainly simplifies the 
>> implementation.
>>
>> QED_F_BACKING_FORMAT_NOPROBE maybe?
> 
> Er, thinking more, this is still a good idea but we still need 
> QED_CF_BACKING_FORMAT because we specifically need to know when a 
> protocol is specified.  Otherwise, we have no way of doing nbd as a 
> backing file.

Well, the protocol is currently encoded in the file name, separated by a
colon. Of course, we want to get rid of that, but we still don't know
what we want instead. It's completely unrelated to the backing file
format, though, it's about the format of the backing file name.

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-11 16:10                 ` Anthony Liguori
@ 2010-10-12 10:25                   ` Avi Kivity
  0 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-12 10:25 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

  On 10/11/2010 06:10 PM, Anthony Liguori wrote:
> On 10/11/2010 11:02 AM, Avi Kivity wrote:
>>  On 10/11/2010 05:49 PM, Anthony Liguori wrote:
>>> On 10/11/2010 09:58 AM, Avi Kivity wrote:
>>>>> A leak is unacceptable.  It means an image can grow to an 
>>>>> unbounded size.  If you are a server provider offering 
>>>>> multitenancy, then a malicious guest can potentially grow the 
>>>>> image beyond it's allotted size causing a Denial of Service attack 
>>>>> against another tenant.
>>>>
>>>>
>>>> This particular leak cannot grow, and is not controlled by the guest.
>>>
>>> As the image gets moved from hypervisor to hypervisor, it can keep 
>>> growing if given a chance to fill up the disk, then trim it all way.
>>>
>>> In a mixed hypervisor environment, it just becomes a numbers game.
>>
>> I don't see how it can grow.  Both the freelist and the clusters it 
>> points to consume space, which becomes a leak once you move it to a 
>> hypervisor that doesn't understand the freelist.  The older 
>> hypervisor then allocates new blocks.  As soon as it performs a 
>> metadata scan (if ever), the freelist is reclaimed.
>
> Assume you don't ever do a metadata scan (which is really our design 
> point).

What about crashes?

>
> If you move to a hypervisor that doesn't support it, then move to a 
> hypervisor that does, you create a brand new freelist and start 
> leaking more space.  This isn't a contrived scenario if you have a 
> cloud environment with a mix of hosts.

It's only a leak if you don't do a metadata scan.

>
> You might not be able to get a ping-pong every time you provision, but 
> with enough effort, you could create serious problems.
>
> It's really an issue of correctness.  Making correctness trade-offs 
> for the purpose of compatibility is a policy decision and not 
> something we should bake into an image format.  If a tool feels 
> strongly that it's a reasonable trade off to make, it can always fudge 
> the feature bits itself.

I think the effort here is reasonable, clearing a bit on startup is not 
that complicated.

>>>
>>> A potential solution here is to treat TRIM a little differently than 
>>> we've been discussing.
>>>
>>> When TRIM happens, don't immediately write an unallocated cluster 
>>> entry for the L2.  Leave the L2 entry in-tact.  Don't actually write 
>>> a UCE to the L2 until you actually allocate the block.
>>>
>>> This implies a cost because you'll need to do metadata syncs to make 
>>> this work.  However, that eliminates leakage.
>>
>> The information is lost on shutdown; and you can have a large number 
>> of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a 
>> user expecting a visit from RIAA).
>>
>> A slight twist on your proposal is to have an allocated-but-may-drop 
>> bit in a L2 entry.  TRIM or zero detection sets the bit (leaving the 
>> cluster number intact).  A following write to the cluster needs to 
>> clear the bit; if we reallocate the cluster we need to replace it 
>> with a ZCE.
>
> Yeah, this is sort of what I was thinking.  You would still want a 
> free list but it becomes totally optional because if it's lost, no 
> data is leaked (assuming that the older version understands the bit).
>
> I was suggesting that we store that bit in the free list though 
> because that let's us support having older QEMUs with absolutely no 
> knowledge still work.

It doesn't - on rewrite an old qemu won't clear the bit, so a newer qemu 
would think it's still free.

The autoclear bit solves it nicely - the old qemu automatically drops 
the allocated-but-may-drop bits, undoing any TRIMs (which is 
unfortunate) but preserving consistency.



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-12  8:07               ` Kevin Wolf
@ 2010-10-12 13:16                 ` Stefan Hajnoczi
  2010-10-12 13:32                   ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-12 13:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Christoph Hellwig, Avi Kivity, qemu-devel

On Tue, Oct 12, 2010 at 10:07:53AM +0200, Kevin Wolf wrote:
> Am 11.10.2010 19:14, schrieb Anthony Liguori:
> > On 10/11/2010 11:18 AM, Anthony Liguori wrote:
> >> On 10/11/2010 10:46 AM, Stefan Hajnoczi wrote:
> >>> On Mon, Oct 11, 2010 at 05:39:01PM +0200, Avi Kivity wrote:
> >>>>   On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:
> >>>>>>   It was discussed before, but I don't think we came to a 
> >>>>>> conclusion. Are
> >>>>>>   there any circumstances under which you don't want to set the
> >>>>>>   QED_CF_BACKING_FORMAT flag?
> >>>>> I suggest the following:
> >>>>>
> >>>>> QED_CF_BACKING_FORMAT_RAW = 0x1
> >>>>>
> >>>>> When set, the backing file is a raw image and should not be probed for
> >>>>> its file format.  The default (unset) means that the backing image 
> >>>>> file
> >>>>> format may be probed.
> >>>>>
> >>>>> Now the backing_fmt_{offset,size} are no longer necessary.
> >>>> Should it not be an incompatible option?  If the backing disk starts
> >>>> with a format magic, it will be probed by an older qemu,
> >>>> incorrectly.
> >>> Agreed, it should be a non-compat feature bit.
> >>
> >> If it's just raw or not raw, then I agree it should be non-compat.
> >>
> >> I think we just need a feature bit then that indicates that the 
> >> backing file is non-probeable which certainly simplifies the 
> >> implementation.
> >>
> >> QED_F_BACKING_FORMAT_NOPROBE maybe?
> > 
> > Er, thinking more, this is still a good idea but we still need 
> > QED_CF_BACKING_FORMAT because we specifically need to know when a 
> > protocol is specified.  Otherwise, we have no way of doing nbd as a 
> > backing file.
> 
> Well, the protocol is currently encoded in the file name, separated by a
> colon. Of course, we want to get rid of that, but we still don't know
> what we want instead. It's completely unrelated to the backing file
> format, though, it's about the format of the backing file name.

I agree with Kevin.  There's no need to have the ill-defined backing
format AFAICT.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification
  2010-10-12 13:16                 ` Stefan Hajnoczi
@ 2010-10-12 13:32                   ` Anthony Liguori
  0 siblings, 0 replies; 72+ messages in thread
From: Anthony Liguori @ 2010-10-12 13:32 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Christoph Hellwig, Avi Kivity, qemu-devel

On 10/12/2010 08:16 AM, Stefan Hajnoczi wrote:
>
>> Well, the protocol is currently encoded in the file name, separated by a
>> colon. Of course, we want to get rid of that, but we still don't know
>> what we want instead. It's completely unrelated to the backing file
>> format, though, it's about the format of the backing file name.
>>      
> I agree with Kevin.  There's no need to have the ill-defined backing
> format AFAICT.
>    

Yeah, I've now convinced myself we don't need backing format name too.

Regards,

Anthony Liguori

> Stefan
>    

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 5/7] qed: Table, L2 cache, and cluster functions
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 5/7] qed: Table, L2 cache, and cluster functions Stefan Hajnoczi
@ 2010-10-12 14:44   ` Kevin Wolf
  2010-10-13 13:41     ` Stefan Hajnoczi
  0 siblings, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-12 14:44 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> This patch adds code to look up data cluster offsets in the image via
> the L1/L2 tables.  The L2 tables are writethrough cached in memory for
> performance (each read/write requires a lookup so it is essential to
> cache the tables).
> 
> With cluster lookup code in place it is possible to implement
> bdrv_is_allocated() to query the number of contiguous
> allocated/unallocated clusters.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  Makefile.objs        |    2 +-
>  block/qed-cluster.c  |  145 +++++++++++++++++++++++
>  block/qed-gencb.c    |   32 +++++
>  block/qed-l2-cache.c |  132 +++++++++++++++++++++
>  block/qed-table.c    |  316 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  block/qed.c          |   57 +++++++++-
>  block/qed.h          |  108 +++++++++++++++++
>  trace-events         |    6 +
>  8 files changed, 796 insertions(+), 2 deletions(-)
>  create mode 100644 block/qed-cluster.c
>  create mode 100644 block/qed-gencb.c
>  create mode 100644 block/qed-l2-cache.c
>  create mode 100644 block/qed-table.c
> 
> diff --git a/Makefile.objs b/Makefile.objs
> index ff15795..7b3b19c 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -14,7 +14,7 @@ block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
>  
>  block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o
>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o
> -block-nested-y += qed.o
> +block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>  block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
>  block-nested-$(CONFIG_WIN32) += raw-win32.o
>  block-nested-$(CONFIG_POSIX) += raw-posix.o
> diff --git a/block/qed-cluster.c b/block/qed-cluster.c
> new file mode 100644
> index 0000000..af65e5a
> --- /dev/null
> +++ b/block/qed-cluster.c
> @@ -0,0 +1,145 @@
> +/*
> + * QEMU Enhanced Disk Format Cluster functions
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> + *  Anthony Liguori   <aliguori@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.

Hm, just noticed it here: COPYING is the text of the GPL, not LGPL. The
same comment applies to all other QED files, too.

> + *
> + */
> +
> +#include "qed.h"
> +
> +/**
> + * Count the number of contiguous data clusters
> + *
> + * @s:              QED state
> + * @table:          L2 table
> + * @index:          First cluster index
> + * @n:              Maximum number of clusters
> + * @offset:         Set to first cluster offset
> + *
> + * This function scans tables for contiguous allocated or free clusters.
> + */
> +static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
> +                                                  QEDTable *table,
> +                                                  unsigned int index,
> +                                                  unsigned int n,
> +                                                  uint64_t *offset)
> +{
> +    unsigned int end = MIN(index + n, s->table_nelems);
> +    uint64_t last = table->offsets[index];
> +    unsigned int i;
> +
> +    *offset = last;
> +
> +    for (i = index + 1; i < end; i++) {
> +        if (last == 0) {
> +            /* Counting free clusters */
> +            if (table->offsets[i] != 0) {
> +                break;
> +            }
> +        } else {
> +            /* Counting allocated clusters */
> +            if (table->offsets[i] != last + s->header.cluster_size) {
> +                break;
> +            }
> +            last = table->offsets[i];
> +        }
> +    }
> +    return i - index;
> +}
> +
> +typedef struct {
> +    BDRVQEDState *s;
> +    uint64_t pos;
> +    size_t len;
> +
> +    QEDRequest *request;
> +
> +    /* User callback */
> +    QEDFindClusterFunc *cb;
> +    void *opaque;
> +} QEDFindClusterCB;
> +
> +static void qed_find_cluster_cb(void *opaque, int ret)
> +{
> +    QEDFindClusterCB *find_cluster_cb = opaque;
> +    BDRVQEDState *s = find_cluster_cb->s;
> +    QEDRequest *request = find_cluster_cb->request;
> +    uint64_t offset = 0;
> +    size_t len = 0;
> +    unsigned int index;
> +    unsigned int n;
> +
> +    if (ret) {
> +        ret = QED_CLUSTER_ERROR;

Can ret be anything else here? If so, why would we return a more generic
error value instead of passing down the original one?

[Okay, after having read more code, this is the place where we throw
errno away. We shouldn't do that.]

I also wonder, if reading from the disk failed, is the errno value lost?

> +        goto out;
> +    }
> +
> +    index = qed_l2_index(s, find_cluster_cb->pos);
> +    n = qed_bytes_to_clusters(s,
> +                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
> +                              find_cluster_cb->len);
> +    n = qed_count_contiguous_clusters(s, request->l2_table->table,
> +                                      index, n, &offset);
> +
> +    ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
> +    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
> +              qed_offset_into_cluster(s, find_cluster_cb->pos));
> +
> +    if (offset && !qed_check_cluster_offset(s, offset)) {
> +        ret = QED_CLUSTER_ERROR;
> +        goto out;
> +    }
> +
> +out:
> +    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
> +    qemu_free(find_cluster_cb);
> +}
> +
> +/**
> + * Find the offset of a data cluster
> + *
> + * @s:          QED state
> + * @pos:        Byte position in device
> + * @len:        Number of bytes
> + * @cb:         Completion function
> + * @opaque:     User data for completion function
> + */

If we add header comments (which I think we should), we shouldn't do
them only pro forma, but try to make them actually useful, i.e. describe
all inputs and outputs.

I'm reading this code for the first time and all these callbacks are
really confusing. What I know is that in all the state that I pass (s
and request) _something_ changes and in the cb is called with _some_
parameters of which I don't know what they mean.

So a good first step would adding a description of the arguments to cb.
At least in qed_read_l2_table, which actually does directly change the
state, we should additionally state that it returns the new L2 table in
request->l2_table. Things like this are not obvious if you didn't write
the code.

> +void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
> +                      size_t len, QEDFindClusterFunc *cb, void *opaque)
> +{
> +    QEDFindClusterCB *find_cluster_cb;
> +    uint64_t l2_offset;
> +
> +    /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
> +     * so that a request acts on one L2 table at a time.
> +     */
> +    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
> +
> +    l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
> +    if (!l2_offset) {
> +        cb(opaque, QED_CLUSTER_L1, 0, len);
> +        return;
> +    }
> +    if (!qed_check_table_offset(s, l2_offset)) {
> +        cb(opaque, QED_CLUSTER_ERROR, 0, 0);
> +        return;
> +    }
> +
> +    find_cluster_cb = qemu_malloc(sizeof(*find_cluster_cb));
> +    find_cluster_cb->s = s;
> +    find_cluster_cb->pos = pos;
> +    find_cluster_cb->len = len;
> +    find_cluster_cb->cb = cb;
> +    find_cluster_cb->opaque = opaque;
> +    find_cluster_cb->request = request;
> +
> +    qed_read_l2_table(s, request, l2_offset,
> +                      qed_find_cluster_cb, find_cluster_cb);
> +}
> diff --git a/block/qed-gencb.c b/block/qed-gencb.c
> new file mode 100644
> index 0000000..d389e12
> --- /dev/null
> +++ b/block/qed-gencb.c
> @@ -0,0 +1,32 @@
> +/*
> + * QEMU Enhanced Disk Format
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    GenericCB *gencb = qemu_malloc(len);
> +    gencb->cb = cb;
> +    gencb->opaque = opaque;
> +    return gencb;
> +}
> +
> +void gencb_complete(void *opaque, int ret)
> +{
> +    GenericCB *gencb = opaque;
> +    BlockDriverCompletionFunc *cb = gencb->cb;
> +    void *user_opaque = gencb->opaque;
> +
> +    qemu_free(gencb);
> +    cb(user_opaque, ret);
> +}
> diff --git a/block/qed-l2-cache.c b/block/qed-l2-cache.c
> new file mode 100644
> index 0000000..3b2bf6e
> --- /dev/null
> +++ b/block/qed-l2-cache.c
> @@ -0,0 +1,132 @@
> +/*
> + * QEMU Enhanced Disk Format L2 Cache
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Anthony Liguori   <aliguori@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "qed.h"
> +
> +/* Each L2 holds 2GB so this let's us fully cache a 100GB disk */
> +#define MAX_L2_CACHE_SIZE 50
> +
> +/**
> + * Initialize the L2 cache
> + */
> +void qed_init_l2_cache(L2TableCache *l2_cache,
> +                       L2TableAllocFunc *alloc_l2_table,

What is this function pointer meant for? So far I can only see one call
to qed_init_l2_cache(), so I guess this indirection is just in
preparation for some future extension? Maybe add a comment?

> +                       void *alloc_l2_table_opaque)
> +{
> +    QTAILQ_INIT(&l2_cache->entries);
> +    l2_cache->n_entries = 0;
> +    l2_cache->alloc_l2_table = alloc_l2_table;
> +    l2_cache->alloc_l2_table_opaque = alloc_l2_table_opaque;
> +}
> +
> +/**
> + * Free the L2 cache
> + */
> +void qed_free_l2_cache(L2TableCache *l2_cache)
> +{
> +    CachedL2Table *entry, *next_entry;
> +
> +    QTAILQ_FOREACH_SAFE(entry, &l2_cache->entries, node, next_entry) {
> +        qemu_vfree(entry->table);
> +        qemu_free(entry);
> +    }
> +}
> +
> +/**
> + * Allocate an uninitialized entry from the cache
> + *
> + * The returned entry has a reference count of 1 and is owned by the caller.
> + */
> +CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache)
> +{
> +    CachedL2Table *entry;
> +
> +    entry = qemu_mallocz(sizeof(*entry));
> +    entry->table = l2_cache->alloc_l2_table(l2_cache->alloc_l2_table_opaque);
> +    entry->ref++;
> +
> +    return entry;
> +}

Hm, what references are counted by ref? Do you have more than one L2
cache and an entry can be referenced by multiple of them?

> +
> +/**
> + * Decrease an entry's reference count and free if necessary when the reference
> + * count drops to zero.
> + */
> +void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry)
> +{
> +    if (!entry) {
> +        return;
> +    }
> +
> +    entry->ref--;
> +    if (entry->ref == 0) {
> +        qemu_vfree(entry->table);
> +        qemu_free(entry);
> +    }
> +}

The l2_cache arguments looks unused. Do we need it?

> +
> +/**
> + * Find an entry in the L2 cache.  This may return NULL and it's up to the
> + * caller to satisfy the cache miss.
> + *
> + * For a cached entry, this function increases the reference count and returns
> + * the entry.
> + */
> +CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset)
> +{
> +    CachedL2Table *entry;
> +
> +    QTAILQ_FOREACH(entry, &l2_cache->entries, node) {
> +        if (entry->offset == offset) {
> +            entry->ref++;
> +            return entry;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/**
> + * Commit an L2 cache entry into the cache.  This is meant to be used as part of
> + * the process to satisfy a cache miss.  A caller would allocate an entry which
> + * is not actually in the L2 cache and then once the entry was valid and
> + * present on disk, the entry can be committed into the cache.
> + *
> + * Since the cache is write-through, it's important that this function is not
> + * called until the entry is present on disk and the L1 has been updated to
> + * point to the entry.
> + *
> + * N.B. This function steals a reference to the l2_table from the caller so the
> + * caller must obtain a new reference by issuing a call to
> + * qed_find_l2_cache_entry().
> + */
> +void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table)
> +{
> +    CachedL2Table *entry;
> +
> +    entry = qed_find_l2_cache_entry(l2_cache, l2_table->offset);
> +    if (entry) {
> +        qed_unref_l2_cache_entry(l2_cache, entry);

Maybe the qed_find_l2_cache_entry semantics isn't really the right one
if we need to decrease the refcount here just because that function just
increased it and we don't actually want that?

> +        qed_unref_l2_cache_entry(l2_cache, l2_table);
> +        return;
> +    }
> +
> +    if (l2_cache->n_entries >= MAX_L2_CACHE_SIZE) {
> +        entry = QTAILQ_FIRST(&l2_cache->entries);
> +        QTAILQ_REMOVE(&l2_cache->entries, entry, node);
> +        l2_cache->n_entries--;
> +        qed_unref_l2_cache_entry(l2_cache, entry);
> +    }
> +
> +    l2_cache->n_entries++;
> +    QTAILQ_INSERT_TAIL(&l2_cache->entries, l2_table, node);
> +}

Okay, so the table has the right refcount because we steal a refcount
from the caller, and if you don't reuse this, we explicitly unref it. Am
I the only one to find this interface confusing?

> diff --git a/block/qed-table.c b/block/qed-table.c
> new file mode 100644
> index 0000000..ba6faf0
> --- /dev/null
> +++ b/block/qed-table.c
> @@ -0,0 +1,316 @@
> +/*
> + * QEMU Enhanced Disk Format Table I/O
> + *
> + * Copyright IBM, Corp. 2010
> + *
> + * Authors:
> + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> + *  Anthony Liguori   <aliguori@us.ibm.com>
> + *
> + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "trace.h"
> +#include "qemu_socket.h" /* for EINPROGRESS on Windows */
> +#include "qed.h"
> +
> +typedef struct {
> +    GenericCB gencb;
> +    BDRVQEDState *s;
> +    QEDTable *table;
> +
> +    struct iovec iov;
> +    QEMUIOVector qiov;
> +} QEDReadTableCB;
> +
> +static void qed_read_table_cb(void *opaque, int ret)
> +{
> +    QEDReadTableCB *read_table_cb = opaque;
> +    QEDTable *table = read_table_cb->table;
> +    int noffsets = read_table_cb->iov.iov_len / sizeof(uint64_t);
> +    int i;
> +
> +    /* Handle I/O error */
> +    if (ret) {
> +        goto out;
> +    }
> +
> +    /* Byteswap offsets */
> +    for (i = 0; i < noffsets; i++) {
> +        table->offsets[i] = le64_to_cpu(table->offsets[i]);
> +    }
> +
> +out:
> +    /* Completion */
> +    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
> +    gencb_complete(&read_table_cb->gencb, ret);
> +}
> +
> +static void qed_read_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
> +                           BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    QEDReadTableCB *read_table_cb = gencb_alloc(sizeof(*read_table_cb),
> +                                                cb, opaque);
> +    QEMUIOVector *qiov = &read_table_cb->qiov;
> +    BlockDriverAIOCB *aiocb;
> +
> +    trace_qed_read_table(s, offset, table);
> +
> +    read_table_cb->s = s;
> +    read_table_cb->table = table;
> +    read_table_cb->iov.iov_base = table->offsets,
> +    read_table_cb->iov.iov_len = s->header.cluster_size * s->header.table_size,
> +
> +    qemu_iovec_init_external(qiov, &read_table_cb->iov, 1);
> +    aiocb = bdrv_aio_readv(s->bs->file, offset / BDRV_SECTOR_SIZE, qiov,
> +                           read_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
> +                           qed_read_table_cb, read_table_cb);
> +    if (!aiocb) {
> +        qed_read_table_cb(read_table_cb, -EIO);
> +    }
> +}
> +
> +typedef struct {
> +    GenericCB gencb;
> +    BDRVQEDState *s;
> +    QEDTable *orig_table;
> +    QEDTable *table;
> +    bool flush;             /* flush after write? */
> +
> +    struct iovec iov;
> +    QEMUIOVector qiov;
> +} QEDWriteTableCB;
> +
> +static void qed_write_table_cb(void *opaque, int ret)
> +{
> +    QEDWriteTableCB *write_table_cb = opaque;
> +
> +    trace_qed_write_table_cb(write_table_cb->s,
> +                              write_table_cb->orig_table, ret);
> +
> +    if (ret) {
> +        goto out;
> +    }
> +
> +    if (write_table_cb->flush) {
> +        /* We still need to flush first */
> +        write_table_cb->flush = false;
> +        bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
> +                       write_table_cb);
> +        return;
> +    }
> +
> +out:
> +    qemu_vfree(write_table_cb->table);
> +    gencb_complete(&write_table_cb->gencb, ret);
> +    return;
> +}
> +
> +/**
> + * Write out an updated part or all of a table
> + *
> + * @s:          QED state
> + * @offset:     Offset of table in image file, in bytes
> + * @table:      Table
> + * @index:      Index of first element
> + * @n:          Number of elements
> + * @flush:      Whether or not to sync to disk
> + * @cb:         Completion function
> + * @opaque:     Argument for completion function
> + */
> +static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
> +                            unsigned int index, unsigned int n, bool flush,
> +                            BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    QEDWriteTableCB *write_table_cb;
> +    BlockDriverAIOCB *aiocb;
> +    unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
> +    unsigned int start, end, i;
> +    size_t len_bytes;
> +
> +    trace_qed_write_table(s, offset, table, index, n);
> +
> +    /* Calculate indices of the first and one after last elements */
> +    start = index & ~sector_mask;
> +    end = (index + n + sector_mask) & ~sector_mask;
> +
> +    len_bytes = (end - start) * sizeof(uint64_t);
> +
> +    write_table_cb = gencb_alloc(sizeof(*write_table_cb), cb, opaque);
> +    write_table_cb->s = s;
> +    write_table_cb->orig_table = table;
> +    write_table_cb->flush = flush;
> +    write_table_cb->table = qemu_blockalign(s->bs, len_bytes);
> +    write_table_cb->iov.iov_base = write_table_cb->table->offsets;
> +    write_table_cb->iov.iov_len = len_bytes;
> +    qemu_iovec_init_external(&write_table_cb->qiov, &write_table_cb->iov, 1);
> +
> +    /* Byteswap table */
> +    for (i = start; i < end; i++) {
> +        uint64_t le_offset = cpu_to_le64(table->offsets[i]);
> +        write_table_cb->table->offsets[i - start] = le_offset;
> +    }
> +
> +    /* Adjust for offset into table */
> +    offset += start * sizeof(uint64_t);
> +
> +    aiocb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
> +                            &write_table_cb->qiov,
> +                            write_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
> +                            qed_write_table_cb, write_table_cb);
> +    if (!aiocb) {
> +        qed_write_table_cb(write_table_cb, -EIO);
> +    }
> +}
> +
> +/**
> + * Propagate return value from async callback
> + */
> +static void qed_sync_cb(void *opaque, int ret)
> +{
> +    *(int *)opaque = ret;
> +}
> +
> +int qed_read_l1_table_sync(BDRVQEDState *s)
> +{
> +    int ret = -EINPROGRESS;
> +
> +    async_context_push();
> +
> +    qed_read_table(s, s->header.l1_table_offset,
> +                   s->l1_table, qed_sync_cb, &ret);
> +    while (ret == -EINPROGRESS) {
> +        qemu_aio_wait();
> +    }
> +
> +    async_context_pop();
> +
> +    return ret;
> +}
> +
> +void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
> +                        BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
> +    qed_write_table(s, s->header.l1_table_offset,
> +                    s->l1_table, index, n, false, cb, opaque);
> +}
> +
> +int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
> +                            unsigned int n)
> +{
> +    int ret = -EINPROGRESS;
> +
> +    async_context_push();
> +
> +    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
> +    while (ret == -EINPROGRESS) {
> +        qemu_aio_wait();
> +    }
> +
> +    async_context_pop();
> +
> +    return ret;
> +}
> +
> +typedef struct {
> +    GenericCB gencb;
> +    BDRVQEDState *s;
> +    uint64_t l2_offset;
> +    QEDRequest *request;
> +} QEDReadL2TableCB;
> +
> +static void qed_read_l2_table_cb(void *opaque, int ret)
> +{
> +    QEDReadL2TableCB *read_l2_table_cb = opaque;
> +    QEDRequest *request = read_l2_table_cb->request;
> +    BDRVQEDState *s = read_l2_table_cb->s;
> +    CachedL2Table *l2_table = request->l2_table;
> +
> +    if (ret) {
> +        /* can't trust loaded L2 table anymore */
> +        qed_unref_l2_cache_entry(&s->l2_cache, l2_table);
> +        request->l2_table = NULL;

Is decreasing the refcount by one and clearing request->l2_table enough?
Didn't we destroy it for all references? Unless, of course, there is at
most one reference, but then the refcount is useless.

Hm, or do we just increase the refcount before the cache entry is
actually used, and we shouldn't do that? Not sure I understand the
purpose of this refcount thing yet.

> +    } else {
> +        l2_table->offset = read_l2_table_cb->l2_offset;
> +
> +        qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
> +
> +        /* This is guaranteed to succeed because we just committed the entry
> +         * to the cache.
> +         */
> +        request->l2_table = qed_find_l2_cache_entry(&s->l2_cache,
> +                                                    l2_table->offset);
> +        assert(request->l2_table != NULL);
> +    }
> +
> +    gencb_complete(&read_l2_table_cb->gencb, ret);
> +}
> +
> +void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
> +                       BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    QEDReadL2TableCB *read_l2_table_cb;
> +
> +    qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table);
> +
> +    /* Check for cached L2 entry */
> +    request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
> +    if (request->l2_table) {
> +        cb(opaque, 0);
> +        return;
> +    }
> +
> +    request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
> +
> +    read_l2_table_cb = gencb_alloc(sizeof(*read_l2_table_cb), cb, opaque);
> +    read_l2_table_cb->s = s;
> +    read_l2_table_cb->l2_offset = offset;
> +    read_l2_table_cb->request = request;
> +
> +    BLKDBG_EVENT(s->bs->file, BLKDBG_L2_LOAD);
> +    qed_read_table(s, offset, request->l2_table->table,
> +                   qed_read_l2_table_cb, read_l2_table_cb);
> +}
> +
> +int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
> +{
> +    int ret = -EINPROGRESS;
> +
> +    async_context_push();
> +
> +    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
> +    while (ret == -EINPROGRESS) {
> +        qemu_aio_wait();
> +    }
> +
> +    async_context_pop();
> +    return ret;
> +}
> +
> +void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
> +                        unsigned int index, unsigned int n, bool flush,
> +                        BlockDriverCompletionFunc *cb, void *opaque)
> +{
> +    BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
> +    qed_write_table(s, request->l2_table->offset,
> +                    request->l2_table->table, index, n, flush, cb, opaque);
> +}
> +
> +int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
> +                            unsigned int index, unsigned int n, bool flush)
> +{
> +    int ret = -EINPROGRESS;
> +
> +    async_context_push();
> +
> +    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
> +    while (ret == -EINPROGRESS) {
> +        qemu_aio_wait();
> +    }
> +
> +    async_context_pop();
> +    return ret;
> +}
> diff --git a/block/qed.c b/block/qed.c
> index ea03798..6d7f4d7 100644
> --- a/block/qed.c
> +++ b/block/qed.c
> @@ -139,6 +139,15 @@ static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
>      return 0;
>  }
>  
> +static QEDTable *qed_alloc_table(void *opaque)
> +{
> +    BDRVQEDState *s = opaque;
> +
> +    /* Honor O_DIRECT memory alignment requirements */
> +    return qemu_blockalign(s->bs,
> +                           s->header.cluster_size * s->header.table_size);
> +}
> +
>  static int bdrv_qed_open(BlockDriverState *bs, int flags)
>  {
>      BDRVQEDState *s = bs->opaque;
> @@ -207,11 +216,24 @@ static int bdrv_qed_open(BlockDriverState *bs, int flags)
>              }
>          }
>      }
> +
> +    s->l1_table = qed_alloc_table(s);
> +    qed_init_l2_cache(&s->l2_cache, qed_alloc_table, s);
> +
> +    ret = qed_read_l1_table_sync(s);
> +    if (ret) {
> +        qed_free_l2_cache(&s->l2_cache);

Why not initializing the L2 cache only if the read succeeded?

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 6/7] qed: Read/write support Stefan Hajnoczi
  2010-10-10  9:10   ` [Qemu-devel] " Avi Kivity
@ 2010-10-12 15:08   ` Kevin Wolf
  2010-10-12 15:22     ` Anthony Liguori
  1 sibling, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-12 15:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> This patch implements the read/write state machine.  Operations are
> fully asynchronous and multiple operations may be active at any time.
> 
> Allocating writes lock tables to ensure metadata updates do not
> interfere with each other.  If two allocating writes need to update the
> same L2 table they will run sequentially.  If two allocating writes need
> to update different L2 tables they will run in parallel.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

I hope to review this in more detail tomorrow, but there's one thing I
already noticed: When allocating a cluster, but not writing the whole
cluster (i.e. requests involving COW), I think we need to flush after
the COW and before the cluster allocation is written to the L2 table to
maintain the right order. Otherwise we might destroy data that isn't
even touched by the guest request in case of a crash.

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-12 15:08   ` Kevin Wolf
@ 2010-10-12 15:22     ` Anthony Liguori
  2010-10-12 15:39       ` Kevin Wolf
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-12 15:22 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, Avi Kivity, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>   Otherwise we might destroy data that isn't
> even touched by the guest request in case of a crash.
>    

The failure scenarios are either that the cluster is leaked in which 
case, the old version of the data is still present or the cluster is 
orphaned because the L2 entry is written, in which case the old version 
of the data is present.

Are you referring to a scenario where the cluster is partially written 
because the data is present in the write cache and the write cache isn't 
flushed on power failure?

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-12 15:22     ` Anthony Liguori
@ 2010-10-12 15:39       ` Kevin Wolf
  2010-10-12 15:59         ` Stefan Hajnoczi
  0 siblings, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-12 15:39 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Anthony Liguori, Avi Kivity, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

Am 12.10.2010 17:22, schrieb Anthony Liguori:
> On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>>   Otherwise we might destroy data that isn't
>> even touched by the guest request in case of a crash.
>>    
> 
> The failure scenarios are either that the cluster is leaked in which 
> case, the old version of the data is still present or the cluster is 
> orphaned because the L2 entry is written, in which case the old version 
> of the data is present.

Hm, how does the latter case work? Or rather, what do mean by "orphaned"?

> Are you referring to a scenario where the cluster is partially written 
> because the data is present in the write cache and the write cache isn't 
> flushed on power failure?

The case I'm referring to is a COW. So let's assume a partial write to
an unallocated cluster, we then need to do a COW in pre/postfill. Then
we do a normal write and link the new cluster in the L2 table.

Assume that the write to the L2 table is already on the disk, but the
pre/postfill data isn't yet. At this point we have a bad state because
if we crash now we have lost the data that should have been copied from
the backing file.

If we can't guarantee that a new cluster is all zeros, the same happens
without a backing file. So as soon as we start reusing freed clusters,
we get this case for all QED images.

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-12 15:39       ` Kevin Wolf
@ 2010-10-12 15:59         ` Stefan Hajnoczi
  2010-10-12 16:16           ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-12 15:59 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, Anthony Liguori, Avi Kivity, qemu-devel,
	Christoph Hellwig

On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
> Am 12.10.2010 17:22, schrieb Anthony Liguori:
> > On 10/12/2010 10:08 AM, Kevin Wolf wrote:
> >>   Otherwise we might destroy data that isn't
> >> even touched by the guest request in case of a crash.
> >>    
> > 
> > The failure scenarios are either that the cluster is leaked in which 
> > case, the old version of the data is still present or the cluster is 
> > orphaned because the L2 entry is written, in which case the old version 
> > of the data is present.
> 
> Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
> 
> > Are you referring to a scenario where the cluster is partially written 
> > because the data is present in the write cache and the write cache isn't 
> > flushed on power failure?
> 
> The case I'm referring to is a COW. So let's assume a partial write to
> an unallocated cluster, we then need to do a COW in pre/postfill. Then
> we do a normal write and link the new cluster in the L2 table.
> 
> Assume that the write to the L2 table is already on the disk, but the
> pre/postfill data isn't yet. At this point we have a bad state because
> if we crash now we have lost the data that should have been copied from
> the backing file.

In this case QED_F_NEED_CHECK is set and the invalid cluster offset
should be reset to zero on open.

However, I think we can get into a state where the pre/postfill data
isn't on the disk yet but another allocation has increased the file
size, making the unwritten cluster "valid".  This fools consistency
check into thinking the data cluster (which was never written to on
disk) is valid.

Will think about this more tonight.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-12 15:59         ` Stefan Hajnoczi
@ 2010-10-12 16:16           ` Anthony Liguori
  2010-10-12 16:21             ` Avi Kivity
  2010-10-13 12:13             ` Stefan Hajnoczi
  0 siblings, 2 replies; 72+ messages in thread
From: Anthony Liguori @ 2010-10-12 16:16 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Avi Kivity, qemu-devel, Christoph Hellwig

On 10/12/2010 10:59 AM, Stefan Hajnoczi wrote:
> On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
>    
>> Am 12.10.2010 17:22, schrieb Anthony Liguori:
>>      
>>> On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>>>        
>>>>    Otherwise we might destroy data that isn't
>>>> even touched by the guest request in case of a crash.
>>>>
>>>>          
>>> The failure scenarios are either that the cluster is leaked in which
>>> case, the old version of the data is still present or the cluster is
>>> orphaned because the L2 entry is written, in which case the old version
>>> of the data is present.
>>>        
>> Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
>>
>>      
>>> Are you referring to a scenario where the cluster is partially written
>>> because the data is present in the write cache and the write cache isn't
>>> flushed on power failure?
>>>        
>> The case I'm referring to is a COW. So let's assume a partial write to
>> an unallocated cluster, we then need to do a COW in pre/postfill. Then
>> we do a normal write and link the new cluster in the L2 table.
>>
>> Assume that the write to the L2 table is already on the disk, but the
>> pre/postfill data isn't yet. At this point we have a bad state because
>> if we crash now we have lost the data that should have been copied from
>> the backing file.
>>      
> In this case QED_F_NEED_CHECK is set and the invalid cluster offset
> should be reset to zero on open.
>
> However, I think we can get into a state where the pre/postfill data
> isn't on the disk yet but another allocation has increased the file
> size, making the unwritten cluster "valid".  This fools consistency
> check into thinking the data cluster (which was never written to on
> disk) is valid.
>
> Will think about this more tonight.
>    

It's fairly simple to add a sync to this path.  It's probably worth 
checking the prefill/postfill for zeros and avoiding the write/sync if 
that's the case.  That should optimize the common cases of allocating 
new space within a file.

My intuition is that we can avoid the sync entirely but we'll need to 
think about it further.

Regards,

Anthony Liguori

> Stefan
>    

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-12 16:16           ` Anthony Liguori
@ 2010-10-12 16:21             ` Avi Kivity
  2010-10-13 12:13             ` Stefan Hajnoczi
  1 sibling, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-12 16:21 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

  On 10/12/2010 06:16 PM, Anthony Liguori wrote:
>
> It's fairly simple to add a sync to this path.  It's probably worth 
> checking the prefill/postfill for zeros and avoiding the write/sync if 
> that's the case.  That should optimize the common cases of allocating 
> new space within a file.
>
> My intuition is that we can avoid the sync entirely but we'll need to 
> think about it further.
>

I don't think so.  This isn't a guest initiated write so we can't shift 
responsibility to the guest.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values Stefan Hajnoczi
  2010-10-11 11:09   ` [Qemu-devel] " Kevin Wolf
@ 2010-10-13  9:15   ` Markus Armbruster
  2010-10-13  9:28     ` Kevin Wolf
  2010-10-13 10:25   ` [Qemu-devel] " Avi Kivity
  2 siblings, 1 reply; 72+ messages in thread
From: Markus Armbruster @ 2010-10-13  9:15 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> writes:

> From: Anthony Liguori <aliguori@us.ibm.com>
>
> This common function converts byte counts to human-readable strings with
> proper units.
>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> ---
>  cutils.c      |   15 +++++++++++++++
>  qemu-common.h |    1 +
>  2 files changed, 16 insertions(+), 0 deletions(-)
>
> diff --git a/cutils.c b/cutils.c
> index 6c32198..5041203 100644
> --- a/cutils.c
> +++ b/cutils.c
> @@ -301,3 +301,18 @@ int get_bits_from_size(size_t size)
>      return __builtin_ctzl(size);
>  #endif
>  }
> +
> +void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size)

Why is the size argument uint64_t and not size_t?

The name bytes_to_str() suggests you're formatting a sequence of bytes.
What about sztostr()?  Matches Jes's strtosz().

> +{
> +    if (size < (1ULL << 10)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " byte(s)", size);
> +    } else if (size < (1ULL << 20)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " KB(s)", size >> 10);
> +    } else if (size < (1ULL << 30)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " MB(s)", size >> 20);
> +    } else if (size < (1ULL << 40)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " GB(s)", size >> 30);
> +    } else {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " TB(s)", size >> 40);
> +    }

Sure you want to truncate rather than round?

The "(s)" sure are ugly.  We don't usually add plural-s after a unit: we
write ten milliseconds as 10ms, not 10ms(s).

Suggest to return the length of the resulting string, as returned by
snprintf().

> +}
> diff --git a/qemu-common.h b/qemu-common.h
> index e0ca398..80ae834 100644
> --- a/qemu-common.h
> +++ b/qemu-common.h
> @@ -154,6 +154,7 @@ int qemu_fls(int i);
>  int qemu_fdatasync(int fd);
>  int fcntl_setfl(int fd, int flag);
>  int get_bits_from_size(size_t size);
> +void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size);
>  
>  /* path.c */
>  void init_paths(const char *prefix);

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values
  2010-10-13  9:15   ` [Qemu-devel] " Markus Armbruster
@ 2010-10-13  9:28     ` Kevin Wolf
  2010-10-13 10:58       ` Stefan Hajnoczi
  0 siblings, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-13  9:28 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Anthony Liguori, Avi Kivity, Christoph Hellwig, Stefan Hajnoczi,
	qemu-devel

Am 13.10.2010 11:15, schrieb Markus Armbruster:
> Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> writes:
> 
>> From: Anthony Liguori <aliguori@us.ibm.com>
>>
>> This common function converts byte counts to human-readable strings with
>> proper units.
>>
>> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>> ---
>>  cutils.c      |   15 +++++++++++++++
>>  qemu-common.h |    1 +
>>  2 files changed, 16 insertions(+), 0 deletions(-)
>>
>> diff --git a/cutils.c b/cutils.c
>> index 6c32198..5041203 100644
>> --- a/cutils.c
>> +++ b/cutils.c
>> @@ -301,3 +301,18 @@ int get_bits_from_size(size_t size)
>>      return __builtin_ctzl(size);
>>  #endif
>>  }
>> +
>> +void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size)
> 
> Why is the size argument uint64_t and not size_t?

size_t would be rather small for disk images on 32 bit hosts.

> The name bytes_to_str() suggests you're formatting a sequence of bytes.
> What about sztostr()?  Matches Jes's strtosz().
> 
>> +{
>> +    if (size < (1ULL << 10)) {
>> +        snprintf(buffer, buffer_len, "%" PRIu64 " byte(s)", size);
>> +    } else if (size < (1ULL << 20)) {
>> +        snprintf(buffer, buffer_len, "%" PRIu64 " KB(s)", size >> 10);
>> +    } else if (size < (1ULL << 30)) {
>> +        snprintf(buffer, buffer_len, "%" PRIu64 " MB(s)", size >> 20);
>> +    } else if (size < (1ULL << 40)) {
>> +        snprintf(buffer, buffer_len, "%" PRIu64 " GB(s)", size >> 30);
>> +    } else {
>> +        snprintf(buffer, buffer_len, "%" PRIu64 " TB(s)", size >> 40);
>> +    }
> 
> Sure you want to truncate rather than round?
> 
> The "(s)" sure are ugly.  We don't usually add plural-s after a unit: we
> write ten milliseconds as 10ms, not 10ms(s).

I suggest taking the output format from cvtstr in cmd.c, so that qemu-io
output stays the same when switching to a common function (it says "10
KiB" and "1 bytes").

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values
  2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values Stefan Hajnoczi
  2010-10-11 11:09   ` [Qemu-devel] " Kevin Wolf
  2010-10-13  9:15   ` [Qemu-devel] " Markus Armbruster
@ 2010-10-13 10:25   ` Avi Kivity
  2 siblings, 0 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-13 10:25 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
> From: Anthony Liguori<aliguori@us.ibm.com>
>
> This common function converts byte counts to human-readable strings with
> proper units.
>
> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
> ---
>   cutils.c      |   15 +++++++++++++++
>   qemu-common.h |    1 +
>   2 files changed, 16 insertions(+), 0 deletions(-)
>
> diff --git a/cutils.c b/cutils.c
> index 6c32198..5041203 100644
> --- a/cutils.c
> +++ b/cutils.c
> @@ -301,3 +301,18 @@ int get_bits_from_size(size_t size)
>       return __builtin_ctzl(size);
>   #endif
>   }
> +
> +void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size)
> +{
> +    if (size<  (1ULL<<  10)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " byte(s)", size);
> +    } else if (size<  (1ULL<<  20)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " KB(s)", size>>  10);
> +    } else if (size<  (1ULL<<  30)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " MB(s)", size>>  20);
> +    } else if (size<  (1ULL<<  40)) {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " GB(s)", size>>  30);
> +    } else {
> +        snprintf(buffer, buffer_len, "%" PRIu64 " TB(s)", size>>  40);
> +    }
> +}

This will show 1.5GB as 1GB.  Need either floating point with a couple 
of digits of precision, or show 1.5GB as 1500MB.

It also misuses SI prefixes. 1 GB means 10^9 bytes, not 2^30 bytes (with 
a common exception for RAM which is usually packaged in 1.125 times some 
power of two).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values
  2010-10-13  9:28     ` Kevin Wolf
@ 2010-10-13 10:58       ` Stefan Hajnoczi
  0 siblings, 0 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-13 10:58 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, Avi Kivity, Christoph Hellwig,
	Markus Armbruster, qemu-devel

On Wed, Oct 13, 2010 at 11:28:42AM +0200, Kevin Wolf wrote:
> Am 13.10.2010 11:15, schrieb Markus Armbruster:
> > Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> writes:
> > 
> >> From: Anthony Liguori <aliguori@us.ibm.com>
> >>
> >> This common function converts byte counts to human-readable strings with
> >> proper units.
> >>
> >> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> >> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
> >> ---
> >>  cutils.c      |   15 +++++++++++++++
> >>  qemu-common.h |    1 +
> >>  2 files changed, 16 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/cutils.c b/cutils.c
> >> index 6c32198..5041203 100644
> >> --- a/cutils.c
> >> +++ b/cutils.c
> >> @@ -301,3 +301,18 @@ int get_bits_from_size(size_t size)
> >>      return __builtin_ctzl(size);
> >>  #endif
> >>  }
> >> +
> >> +void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size)
> > 
> > Why is the size argument uint64_t and not size_t?
> 
> size_t would be rather small for disk images on 32 bit hosts.
> 
> > The name bytes_to_str() suggests you're formatting a sequence of bytes.
> > What about sztostr()?  Matches Jes's strtosz().
> > 
> >> +{
> >> +    if (size < (1ULL << 10)) {
> >> +        snprintf(buffer, buffer_len, "%" PRIu64 " byte(s)", size);
> >> +    } else if (size < (1ULL << 20)) {
> >> +        snprintf(buffer, buffer_len, "%" PRIu64 " KB(s)", size >> 10);
> >> +    } else if (size < (1ULL << 30)) {
> >> +        snprintf(buffer, buffer_len, "%" PRIu64 " MB(s)", size >> 20);
> >> +    } else if (size < (1ULL << 40)) {
> >> +        snprintf(buffer, buffer_len, "%" PRIu64 " GB(s)", size >> 30);
> >> +    } else {
> >> +        snprintf(buffer, buffer_len, "%" PRIu64 " TB(s)", size >> 40);
> >> +    }
> > 
> > Sure you want to truncate rather than round?
> > 
> > The "(s)" sure are ugly.  We don't usually add plural-s after a unit: we
> > write ten milliseconds as 10ms, not 10ms(s).
> 
> I suggest taking the output format from cvtstr in cmd.c, so that qemu-io
> output stays the same when switching to a common function (it says "10
> KiB" and "1 bytes").

I have a patch to replace bytes_to_str() with cvtstr().  We should
probably rename cvtstr() to sztostr() as suggested because that name is
more descriptive.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-12 16:16           ` Anthony Liguori
  2010-10-12 16:21             ` Avi Kivity
@ 2010-10-13 12:13             ` Stefan Hajnoczi
  2010-10-13 13:07               ` Kevin Wolf
  1 sibling, 1 reply; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-13 12:13 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Avi Kivity, qemu-devel, Christoph Hellwig

On Tue, Oct 12, 2010 at 11:16:17AM -0500, Anthony Liguori wrote:
> On 10/12/2010 10:59 AM, Stefan Hajnoczi wrote:
> >On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
> >>Am 12.10.2010 17:22, schrieb Anthony Liguori:
> >>>On 10/12/2010 10:08 AM, Kevin Wolf wrote:
> >>>>   Otherwise we might destroy data that isn't
> >>>>even touched by the guest request in case of a crash.
> >>>>
> >>>The failure scenarios are either that the cluster is leaked in which
> >>>case, the old version of the data is still present or the cluster is
> >>>orphaned because the L2 entry is written, in which case the old version
> >>>of the data is present.
> >>Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
> >>
> >>>Are you referring to a scenario where the cluster is partially written
> >>>because the data is present in the write cache and the write cache isn't
> >>>flushed on power failure?
> >>The case I'm referring to is a COW. So let's assume a partial write to
> >>an unallocated cluster, we then need to do a COW in pre/postfill. Then
> >>we do a normal write and link the new cluster in the L2 table.
> >>
> >>Assume that the write to the L2 table is already on the disk, but the
> >>pre/postfill data isn't yet. At this point we have a bad state because
> >>if we crash now we have lost the data that should have been copied from
> >>the backing file.
> >In this case QED_F_NEED_CHECK is set and the invalid cluster offset
> >should be reset to zero on open.
> >
> >However, I think we can get into a state where the pre/postfill data
> >isn't on the disk yet but another allocation has increased the file
> >size, making the unwritten cluster "valid".  This fools consistency
> >check into thinking the data cluster (which was never written to on
> >disk) is valid.
> >
> >Will think about this more tonight.
> 
> It's fairly simple to add a sync to this path.  It's probably worth
> checking the prefill/postfill for zeros and avoiding the write/sync
> if that's the case.  That should optimize the common cases of
> allocating new space within a file.
> 
> My intuition is that we can avoid the sync entirely but we'll need
> to think about it further.

We can avoid it when a backing image is not used.  Your idea to check
for zeroes in the backing image is neat too, it may well reduce the
common case even for backing images.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 12:13             ` Stefan Hajnoczi
@ 2010-10-13 13:07               ` Kevin Wolf
  2010-10-13 13:24                 ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Kevin Wolf @ 2010-10-13 13:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Christoph Hellwig, Avi Kivity, qemu-devel

Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> On Tue, Oct 12, 2010 at 11:16:17AM -0500, Anthony Liguori wrote:
>> On 10/12/2010 10:59 AM, Stefan Hajnoczi wrote:
>>> On Tue, Oct 12, 2010 at 05:39:48PM +0200, Kevin Wolf wrote:
>>>> Am 12.10.2010 17:22, schrieb Anthony Liguori:
>>>>> On 10/12/2010 10:08 AM, Kevin Wolf wrote:
>>>>>>   Otherwise we might destroy data that isn't
>>>>>> even touched by the guest request in case of a crash.
>>>>>>
>>>>> The failure scenarios are either that the cluster is leaked in which
>>>>> case, the old version of the data is still present or the cluster is
>>>>> orphaned because the L2 entry is written, in which case the old version
>>>>> of the data is present.
>>>> Hm, how does the latter case work? Or rather, what do mean by "orphaned"?
>>>>
>>>>> Are you referring to a scenario where the cluster is partially written
>>>>> because the data is present in the write cache and the write cache isn't
>>>>> flushed on power failure?
>>>> The case I'm referring to is a COW. So let's assume a partial write to
>>>> an unallocated cluster, we then need to do a COW in pre/postfill. Then
>>>> we do a normal write and link the new cluster in the L2 table.
>>>>
>>>> Assume that the write to the L2 table is already on the disk, but the
>>>> pre/postfill data isn't yet. At this point we have a bad state because
>>>> if we crash now we have lost the data that should have been copied from
>>>> the backing file.
>>> In this case QED_F_NEED_CHECK is set and the invalid cluster offset
>>> should be reset to zero on open.
>>>
>>> However, I think we can get into a state where the pre/postfill data
>>> isn't on the disk yet but another allocation has increased the file
>>> size, making the unwritten cluster "valid".  This fools consistency
>>> check into thinking the data cluster (which was never written to on
>>> disk) is valid.
>>>
>>> Will think about this more tonight.
>>
>> It's fairly simple to add a sync to this path.  It's probably worth
>> checking the prefill/postfill for zeros and avoiding the write/sync
>> if that's the case.  That should optimize the common cases of
>> allocating new space within a file.
>>
>> My intuition is that we can avoid the sync entirely but we'll need
>> to think about it further.
> 
> We can avoid it when a backing image is not used.  Your idea to check
> for zeroes in the backing image is neat too, it may well reduce the
> common case even for backing images.

The additional requirement is that we're extending the file and not
reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
doesn't work on host_devices anyway)

Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 13:07               ` Kevin Wolf
@ 2010-10-13 13:24                 ` Anthony Liguori
  2010-10-13 13:50                   ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-13 13:24 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Christoph Hellwig, Stefan Hajnoczi, Avi Kivity

On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
>    
>> We can avoid it when a backing image is not used.  Your idea to check
>> for zeroes in the backing image is neat too, it may well reduce the
>> common case even for backing images.
>>      
> The additional requirement is that we're extending the file and not
> reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> doesn't work on host_devices anyway)
>    

Yes, that's a good point.

BTW, I think we've decided that making it work on host_devices is not 
that bad.

We can add an additional feature called QED_F_PHYSICAL_SIZE.

This feature will add another field to the header that contains an 
offset immediately following the last cluster allocation.

During a metadata scan, we can accurately recreate this field so we only 
need to update this field whenever we clear the header dirty bit (which 
means during an fsync()).

That means we can maintain the physical size without introducing 
additional fsync()s in the allocation path.  Since we're already writing 
out the header anyway, the write operation is basically free too.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [Qemu-devel] Re: [PATCH v2 5/7] qed: Table, L2 cache, and cluster functions
  2010-10-12 14:44   ` [Qemu-devel] " Kevin Wolf
@ 2010-10-13 13:41     ` Stefan Hajnoczi
  0 siblings, 0 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-13 13:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

On Tue, Oct 12, 2010 at 04:44:34PM +0200, Kevin Wolf wrote:
> Am 08.10.2010 17:48, schrieb Stefan Hajnoczi:
> > diff --git a/block/qed-cluster.c b/block/qed-cluster.c
> > new file mode 100644
> > index 0000000..af65e5a
> > --- /dev/null
> > +++ b/block/qed-cluster.c
> > @@ -0,0 +1,145 @@
> > +/*
> > + * QEMU Enhanced Disk Format Cluster functions
> > + *
> > + * Copyright IBM, Corp. 2010
> > + *
> > + * Authors:
> > + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> > + *  Anthony Liguori   <aliguori@us.ibm.com>
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> 
> Hm, just noticed it here: COPYING is the text of the GPL, not LGPL. The
> same comment applies to all other QED files, too.

It should be COPYING.LIB, thanks for pointing this out.

> 
> > + *
> > + */
> > +
> > +#include "qed.h"
> > +
> > +/**
> > + * Count the number of contiguous data clusters
> > + *
> > + * @s:              QED state
> > + * @table:          L2 table
> > + * @index:          First cluster index
> > + * @n:              Maximum number of clusters
> > + * @offset:         Set to first cluster offset
> > + *
> > + * This function scans tables for contiguous allocated or free clusters.
> > + */
> > +static unsigned int qed_count_contiguous_clusters(BDRVQEDState *s,
> > +                                                  QEDTable *table,
> > +                                                  unsigned int index,
> > +                                                  unsigned int n,
> > +                                                  uint64_t *offset)
> > +{
> > +    unsigned int end = MIN(index + n, s->table_nelems);
> > +    uint64_t last = table->offsets[index];
> > +    unsigned int i;
> > +
> > +    *offset = last;
> > +
> > +    for (i = index + 1; i < end; i++) {
> > +        if (last == 0) {
> > +            /* Counting free clusters */
> > +            if (table->offsets[i] != 0) {
> > +                break;
> > +            }
> > +        } else {
> > +            /* Counting allocated clusters */
> > +            if (table->offsets[i] != last + s->header.cluster_size) {
> > +                break;
> > +            }
> > +            last = table->offsets[i];
> > +        }
> > +    }
> > +    return i - index;
> > +}
> > +
> > +typedef struct {
> > +    BDRVQEDState *s;
> > +    uint64_t pos;
> > +    size_t len;
> > +
> > +    QEDRequest *request;
> > +
> > +    /* User callback */
> > +    QEDFindClusterFunc *cb;
> > +    void *opaque;
> > +} QEDFindClusterCB;
> > +
> > +static void qed_find_cluster_cb(void *opaque, int ret)
> > +{
> > +    QEDFindClusterCB *find_cluster_cb = opaque;
> > +    BDRVQEDState *s = find_cluster_cb->s;
> > +    QEDRequest *request = find_cluster_cb->request;
> > +    uint64_t offset = 0;
> > +    size_t len = 0;
> > +    unsigned int index;
> > +    unsigned int n;
> > +
> > +    if (ret) {
> > +        ret = QED_CLUSTER_ERROR;
> 
> Can ret be anything else here? If so, why would we return a more generic
> error value instead of passing down the original one?
> 
> [Okay, after having read more code, this is the place where we throw
> errno away. We shouldn't do that.]
> 
> I also wonder, if reading from the disk failed, is the errno value lost?
> 
> > +        goto out;
> > +    }
> > +
> > +    index = qed_l2_index(s, find_cluster_cb->pos);
> > +    n = qed_bytes_to_clusters(s,
> > +                              qed_offset_into_cluster(s, find_cluster_cb->pos) +
> > +                              find_cluster_cb->len);
> > +    n = qed_count_contiguous_clusters(s, request->l2_table->table,
> > +                                      index, n, &offset);
> > +
> > +    ret = offset ? QED_CLUSTER_FOUND : QED_CLUSTER_L2;
> > +    len = MIN(find_cluster_cb->len, n * s->header.cluster_size -
> > +              qed_offset_into_cluster(s, find_cluster_cb->pos));
> > +
> > +    if (offset && !qed_check_cluster_offset(s, offset)) {
> > +        ret = QED_CLUSTER_ERROR;
> > +        goto out;
> > +    }
> > +
> > +out:
> > +    find_cluster_cb->cb(find_cluster_cb->opaque, ret, offset, len);
> > +    qemu_free(find_cluster_cb);
> > +}
> > +
> > +/**
> > + * Find the offset of a data cluster
> > + *
> > + * @s:          QED state
> > + * @pos:        Byte position in device
> > + * @len:        Number of bytes
> > + * @cb:         Completion function
> > + * @opaque:     User data for completion function
> > + */
> 
> If we add header comments (which I think we should), we shouldn't do
> them only pro forma, but try to make them actually useful, i.e. describe
> all inputs and outputs.
> 
> I'm reading this code for the first time and all these callbacks are
> really confusing. What I know is that in all the state that I pass (s
> and request) _something_ changes and in the cb is called with _some_
> parameters of which I don't know what they mean.
> 
> So a good first step would adding a description of the arguments to cb.
> At least in qed_read_l2_table, which actually does directly change the
> state, we should additionally state that it returns the new L2 table in
> request->l2_table. Things like this are not obvious if you didn't write
> the code.

Right, I will improve the doc comments for v3.

> 
> > +void qed_find_cluster(BDRVQEDState *s, QEDRequest *request, uint64_t pos,
> > +                      size_t len, QEDFindClusterFunc *cb, void *opaque)
> > +{
> > +    QEDFindClusterCB *find_cluster_cb;
> > +    uint64_t l2_offset;
> > +
> > +    /* Limit length to L2 boundary.  Requests are broken up at the L2 boundary
> > +     * so that a request acts on one L2 table at a time.
> > +     */
> > +    len = MIN(len, (((pos >> s->l1_shift) + 1) << s->l1_shift) - pos);
> > +
> > +    l2_offset = s->l1_table->offsets[qed_l1_index(s, pos)];
> > +    if (!l2_offset) {
> > +        cb(opaque, QED_CLUSTER_L1, 0, len);
> > +        return;
> > +    }
> > +    if (!qed_check_table_offset(s, l2_offset)) {
> > +        cb(opaque, QED_CLUSTER_ERROR, 0, 0);
> > +        return;
> > +    }
> > +
> > +    find_cluster_cb = qemu_malloc(sizeof(*find_cluster_cb));
> > +    find_cluster_cb->s = s;
> > +    find_cluster_cb->pos = pos;
> > +    find_cluster_cb->len = len;
> > +    find_cluster_cb->cb = cb;
> > +    find_cluster_cb->opaque = opaque;
> > +    find_cluster_cb->request = request;
> > +
> > +    qed_read_l2_table(s, request, l2_offset,
> > +                      qed_find_cluster_cb, find_cluster_cb);
> > +}
> > diff --git a/block/qed-gencb.c b/block/qed-gencb.c
> > new file mode 100644
> > index 0000000..d389e12
> > --- /dev/null
> > +++ b/block/qed-gencb.c
> > @@ -0,0 +1,32 @@
> > +/*
> > + * QEMU Enhanced Disk Format
> > + *
> > + * Copyright IBM, Corp. 2010
> > + *
> > + * Authors:
> > + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qed.h"
> > +
> > +void *gencb_alloc(size_t len, BlockDriverCompletionFunc *cb, void *opaque)
> > +{
> > +    GenericCB *gencb = qemu_malloc(len);
> > +    gencb->cb = cb;
> > +    gencb->opaque = opaque;
> > +    return gencb;
> > +}
> > +
> > +void gencb_complete(void *opaque, int ret)
> > +{
> > +    GenericCB *gencb = opaque;
> > +    BlockDriverCompletionFunc *cb = gencb->cb;
> > +    void *user_opaque = gencb->opaque;
> > +
> > +    qemu_free(gencb);
> > +    cb(user_opaque, ret);
> > +}
> > diff --git a/block/qed-l2-cache.c b/block/qed-l2-cache.c
> > new file mode 100644
> > index 0000000..3b2bf6e
> > --- /dev/null
> > +++ b/block/qed-l2-cache.c
> > @@ -0,0 +1,132 @@
> > +/*
> > + * QEMU Enhanced Disk Format L2 Cache
> > + *
> > + * Copyright IBM, Corp. 2010
> > + *
> > + * Authors:
> > + *  Anthony Liguori   <aliguori@us.ibm.com>
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "qed.h"
> > +
> > +/* Each L2 holds 2GB so this let's us fully cache a 100GB disk */
> > +#define MAX_L2_CACHE_SIZE 50
> > +
> > +/**
> > + * Initialize the L2 cache
> > + */
> > +void qed_init_l2_cache(L2TableCache *l2_cache,
> > +                       L2TableAllocFunc *alloc_l2_table,
> 
> What is this function pointer meant for? So far I can only see one call
> to qed_init_l2_cache(), so I guess this indirection is just in
> preparation for some future extension? Maybe add a comment?

I will review this function pointer and address with a comment: the
L2TableAllocFunc decouples qed-l2-cache.c from the qemu_blockalign() and
table sizing.  Maybe it's not worth having this if it just complicates
things.

> 
> > +                       void *alloc_l2_table_opaque)
> > +{
> > +    QTAILQ_INIT(&l2_cache->entries);
> > +    l2_cache->n_entries = 0;
> > +    l2_cache->alloc_l2_table = alloc_l2_table;
> > +    l2_cache->alloc_l2_table_opaque = alloc_l2_table_opaque;
> > +}
> > +
> > +/**
> > + * Free the L2 cache
> > + */
> > +void qed_free_l2_cache(L2TableCache *l2_cache)
> > +{
> > +    CachedL2Table *entry, *next_entry;
> > +
> > +    QTAILQ_FOREACH_SAFE(entry, &l2_cache->entries, node, next_entry) {
> > +        qemu_vfree(entry->table);
> > +        qemu_free(entry);
> > +    }
> > +}
> > +
> > +/**
> > + * Allocate an uninitialized entry from the cache
> > + *
> > + * The returned entry has a reference count of 1 and is owned by the caller.
> > + */
> > +CachedL2Table *qed_alloc_l2_cache_entry(L2TableCache *l2_cache)
> > +{
> > +    CachedL2Table *entry;
> > +
> > +    entry = qemu_mallocz(sizeof(*entry));
> > +    entry->table = l2_cache->alloc_l2_table(l2_cache->alloc_l2_table_opaque);
> > +    entry->ref++;
> > +
> > +    return entry;
> > +}
> 
> Hm, what references are counted by ref? Do you have more than one L2
> cache and an entry can be referenced by multiple of them?

I'll add comments describing how the cache works and is used.  I hope
this will make refcounts and caching clearer.

> 
> > +
> > +/**
> > + * Decrease an entry's reference count and free if necessary when the reference
> > + * count drops to zero.
> > + */
> > +void qed_unref_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *entry)
> > +{
> > +    if (!entry) {
> > +        return;
> > +    }
> > +
> > +    entry->ref--;
> > +    if (entry->ref == 0) {
> > +        qemu_vfree(entry->table);
> > +        qemu_free(entry);
> > +    }
> > +}
> 
> The l2_cache arguments looks unused. Do we need it?

It's not needed, will remove.

> 
> > +
> > +/**
> > + * Find an entry in the L2 cache.  This may return NULL and it's up to the
> > + * caller to satisfy the cache miss.
> > + *
> > + * For a cached entry, this function increases the reference count and returns
> > + * the entry.
> > + */
> > +CachedL2Table *qed_find_l2_cache_entry(L2TableCache *l2_cache, uint64_t offset)
> > +{
> > +    CachedL2Table *entry;
> > +
> > +    QTAILQ_FOREACH(entry, &l2_cache->entries, node) {
> > +        if (entry->offset == offset) {
> > +            entry->ref++;
> > +            return entry;
> > +        }
> > +    }
> > +    return NULL;
> > +}
> > +
> > +/**
> > + * Commit an L2 cache entry into the cache.  This is meant to be used as part of
> > + * the process to satisfy a cache miss.  A caller would allocate an entry which
> > + * is not actually in the L2 cache and then once the entry was valid and
> > + * present on disk, the entry can be committed into the cache.
> > + *
> > + * Since the cache is write-through, it's important that this function is not
> > + * called until the entry is present on disk and the L1 has been updated to
> > + * point to the entry.
> > + *
> > + * N.B. This function steals a reference to the l2_table from the caller so the
> > + * caller must obtain a new reference by issuing a call to
> > + * qed_find_l2_cache_entry().
> > + */
> > +void qed_commit_l2_cache_entry(L2TableCache *l2_cache, CachedL2Table *l2_table)
> > +{
> > +    CachedL2Table *entry;
> > +
> > +    entry = qed_find_l2_cache_entry(l2_cache, l2_table->offset);
> > +    if (entry) {
> > +        qed_unref_l2_cache_entry(l2_cache, entry);
> 
> Maybe the qed_find_l2_cache_entry semantics isn't really the right one
> if we need to decrease the refcount here just because that function just
> increased it and we don't actually want that?
> 
> > +        qed_unref_l2_cache_entry(l2_cache, l2_table);
> > +        return;
> > +    }
> > +
> > +    if (l2_cache->n_entries >= MAX_L2_CACHE_SIZE) {
> > +        entry = QTAILQ_FIRST(&l2_cache->entries);
> > +        QTAILQ_REMOVE(&l2_cache->entries, entry, node);
> > +        l2_cache->n_entries--;
> > +        qed_unref_l2_cache_entry(l2_cache, entry);
> > +    }
> > +
> > +    l2_cache->n_entries++;
> > +    QTAILQ_INSERT_TAIL(&l2_cache->entries, l2_table, node);
> > +}
> 
> Okay, so the table has the right refcount because we steal a refcount
> from the caller, and if you don't reuse this, we explicitly unref it. Am
> I the only one to find this interface confusing?
> 
> > diff --git a/block/qed-table.c b/block/qed-table.c
> > new file mode 100644
> > index 0000000..ba6faf0
> > --- /dev/null
> > +++ b/block/qed-table.c
> > @@ -0,0 +1,316 @@
> > +/*
> > + * QEMU Enhanced Disk Format Table I/O
> > + *
> > + * Copyright IBM, Corp. 2010
> > + *
> > + * Authors:
> > + *  Stefan Hajnoczi   <stefanha@linux.vnet.ibm.com>
> > + *  Anthony Liguori   <aliguori@us.ibm.com>
> > + *
> > + * This work is licensed under the terms of the GNU LGPL, version 2 or later.
> > + * See the COPYING file in the top-level directory.
> > + *
> > + */
> > +
> > +#include "trace.h"
> > +#include "qemu_socket.h" /* for EINPROGRESS on Windows */
> > +#include "qed.h"
> > +
> > +typedef struct {
> > +    GenericCB gencb;
> > +    BDRVQEDState *s;
> > +    QEDTable *table;
> > +
> > +    struct iovec iov;
> > +    QEMUIOVector qiov;
> > +} QEDReadTableCB;
> > +
> > +static void qed_read_table_cb(void *opaque, int ret)
> > +{
> > +    QEDReadTableCB *read_table_cb = opaque;
> > +    QEDTable *table = read_table_cb->table;
> > +    int noffsets = read_table_cb->iov.iov_len / sizeof(uint64_t);
> > +    int i;
> > +
> > +    /* Handle I/O error */
> > +    if (ret) {
> > +        goto out;
> > +    }
> > +
> > +    /* Byteswap offsets */
> > +    for (i = 0; i < noffsets; i++) {
> > +        table->offsets[i] = le64_to_cpu(table->offsets[i]);
> > +    }
> > +
> > +out:
> > +    /* Completion */
> > +    trace_qed_read_table_cb(read_table_cb->s, read_table_cb->table, ret);
> > +    gencb_complete(&read_table_cb->gencb, ret);
> > +}
> > +
> > +static void qed_read_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
> > +                           BlockDriverCompletionFunc *cb, void *opaque)
> > +{
> > +    QEDReadTableCB *read_table_cb = gencb_alloc(sizeof(*read_table_cb),
> > +                                                cb, opaque);
> > +    QEMUIOVector *qiov = &read_table_cb->qiov;
> > +    BlockDriverAIOCB *aiocb;
> > +
> > +    trace_qed_read_table(s, offset, table);
> > +
> > +    read_table_cb->s = s;
> > +    read_table_cb->table = table;
> > +    read_table_cb->iov.iov_base = table->offsets,
> > +    read_table_cb->iov.iov_len = s->header.cluster_size * s->header.table_size,
> > +
> > +    qemu_iovec_init_external(qiov, &read_table_cb->iov, 1);
> > +    aiocb = bdrv_aio_readv(s->bs->file, offset / BDRV_SECTOR_SIZE, qiov,
> > +                           read_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
> > +                           qed_read_table_cb, read_table_cb);
> > +    if (!aiocb) {
> > +        qed_read_table_cb(read_table_cb, -EIO);
> > +    }
> > +}
> > +
> > +typedef struct {
> > +    GenericCB gencb;
> > +    BDRVQEDState *s;
> > +    QEDTable *orig_table;
> > +    QEDTable *table;
> > +    bool flush;             /* flush after write? */
> > +
> > +    struct iovec iov;
> > +    QEMUIOVector qiov;
> > +} QEDWriteTableCB;
> > +
> > +static void qed_write_table_cb(void *opaque, int ret)
> > +{
> > +    QEDWriteTableCB *write_table_cb = opaque;
> > +
> > +    trace_qed_write_table_cb(write_table_cb->s,
> > +                              write_table_cb->orig_table, ret);
> > +
> > +    if (ret) {
> > +        goto out;
> > +    }
> > +
> > +    if (write_table_cb->flush) {
> > +        /* We still need to flush first */
> > +        write_table_cb->flush = false;
> > +        bdrv_aio_flush(write_table_cb->s->bs, qed_write_table_cb,
> > +                       write_table_cb);
> > +        return;
> > +    }
> > +
> > +out:
> > +    qemu_vfree(write_table_cb->table);
> > +    gencb_complete(&write_table_cb->gencb, ret);
> > +    return;
> > +}
> > +
> > +/**
> > + * Write out an updated part or all of a table
> > + *
> > + * @s:          QED state
> > + * @offset:     Offset of table in image file, in bytes
> > + * @table:      Table
> > + * @index:      Index of first element
> > + * @n:          Number of elements
> > + * @flush:      Whether or not to sync to disk
> > + * @cb:         Completion function
> > + * @opaque:     Argument for completion function
> > + */
> > +static void qed_write_table(BDRVQEDState *s, uint64_t offset, QEDTable *table,
> > +                            unsigned int index, unsigned int n, bool flush,
> > +                            BlockDriverCompletionFunc *cb, void *opaque)
> > +{
> > +    QEDWriteTableCB *write_table_cb;
> > +    BlockDriverAIOCB *aiocb;
> > +    unsigned int sector_mask = BDRV_SECTOR_SIZE / sizeof(uint64_t) - 1;
> > +    unsigned int start, end, i;
> > +    size_t len_bytes;
> > +
> > +    trace_qed_write_table(s, offset, table, index, n);
> > +
> > +    /* Calculate indices of the first and one after last elements */
> > +    start = index & ~sector_mask;
> > +    end = (index + n + sector_mask) & ~sector_mask;
> > +
> > +    len_bytes = (end - start) * sizeof(uint64_t);
> > +
> > +    write_table_cb = gencb_alloc(sizeof(*write_table_cb), cb, opaque);
> > +    write_table_cb->s = s;
> > +    write_table_cb->orig_table = table;
> > +    write_table_cb->flush = flush;
> > +    write_table_cb->table = qemu_blockalign(s->bs, len_bytes);
> > +    write_table_cb->iov.iov_base = write_table_cb->table->offsets;
> > +    write_table_cb->iov.iov_len = len_bytes;
> > +    qemu_iovec_init_external(&write_table_cb->qiov, &write_table_cb->iov, 1);
> > +
> > +    /* Byteswap table */
> > +    for (i = start; i < end; i++) {
> > +        uint64_t le_offset = cpu_to_le64(table->offsets[i]);
> > +        write_table_cb->table->offsets[i - start] = le_offset;
> > +    }
> > +
> > +    /* Adjust for offset into table */
> > +    offset += start * sizeof(uint64_t);
> > +
> > +    aiocb = bdrv_aio_writev(s->bs->file, offset / BDRV_SECTOR_SIZE,
> > +                            &write_table_cb->qiov,
> > +                            write_table_cb->iov.iov_len / BDRV_SECTOR_SIZE,
> > +                            qed_write_table_cb, write_table_cb);
> > +    if (!aiocb) {
> > +        qed_write_table_cb(write_table_cb, -EIO);
> > +    }
> > +}
> > +
> > +/**
> > + * Propagate return value from async callback
> > + */
> > +static void qed_sync_cb(void *opaque, int ret)
> > +{
> > +    *(int *)opaque = ret;
> > +}
> > +
> > +int qed_read_l1_table_sync(BDRVQEDState *s)
> > +{
> > +    int ret = -EINPROGRESS;
> > +
> > +    async_context_push();
> > +
> > +    qed_read_table(s, s->header.l1_table_offset,
> > +                   s->l1_table, qed_sync_cb, &ret);
> > +    while (ret == -EINPROGRESS) {
> > +        qemu_aio_wait();
> > +    }
> > +
> > +    async_context_pop();
> > +
> > +    return ret;
> > +}
> > +
> > +void qed_write_l1_table(BDRVQEDState *s, unsigned int index, unsigned int n,
> > +                        BlockDriverCompletionFunc *cb, void *opaque)
> > +{
> > +    BLKDBG_EVENT(s->bs->file, BLKDBG_L1_UPDATE);
> > +    qed_write_table(s, s->header.l1_table_offset,
> > +                    s->l1_table, index, n, false, cb, opaque);
> > +}
> > +
> > +int qed_write_l1_table_sync(BDRVQEDState *s, unsigned int index,
> > +                            unsigned int n)
> > +{
> > +    int ret = -EINPROGRESS;
> > +
> > +    async_context_push();
> > +
> > +    qed_write_l1_table(s, index, n, qed_sync_cb, &ret);
> > +    while (ret == -EINPROGRESS) {
> > +        qemu_aio_wait();
> > +    }
> > +
> > +    async_context_pop();
> > +
> > +    return ret;
> > +}
> > +
> > +typedef struct {
> > +    GenericCB gencb;
> > +    BDRVQEDState *s;
> > +    uint64_t l2_offset;
> > +    QEDRequest *request;
> > +} QEDReadL2TableCB;
> > +
> > +static void qed_read_l2_table_cb(void *opaque, int ret)
> > +{
> > +    QEDReadL2TableCB *read_l2_table_cb = opaque;
> > +    QEDRequest *request = read_l2_table_cb->request;
> > +    BDRVQEDState *s = read_l2_table_cb->s;
> > +    CachedL2Table *l2_table = request->l2_table;
> > +
> > +    if (ret) {
> > +        /* can't trust loaded L2 table anymore */
> > +        qed_unref_l2_cache_entry(&s->l2_cache, l2_table);
> > +        request->l2_table = NULL;
> 
> Is decreasing the refcount by one and clearing request->l2_table enough?
> Didn't we destroy it for all references? Unless, of course, there is at
> most one reference, but then the refcount is useless.
> 
> Hm, or do we just increase the refcount before the cache entry is
> actually used, and we shouldn't do that? Not sure I understand the
> purpose of this refcount thing yet.
> 
> > +    } else {
> > +        l2_table->offset = read_l2_table_cb->l2_offset;
> > +
> > +        qed_commit_l2_cache_entry(&s->l2_cache, l2_table);
> > +
> > +        /* This is guaranteed to succeed because we just committed the entry
> > +         * to the cache.
> > +         */
> > +        request->l2_table = qed_find_l2_cache_entry(&s->l2_cache,
> > +                                                    l2_table->offset);
> > +        assert(request->l2_table != NULL);
> > +    }
> > +
> > +    gencb_complete(&read_l2_table_cb->gencb, ret);
> > +}
> > +
> > +void qed_read_l2_table(BDRVQEDState *s, QEDRequest *request, uint64_t offset,
> > +                       BlockDriverCompletionFunc *cb, void *opaque)
> > +{
> > +    QEDReadL2TableCB *read_l2_table_cb;
> > +
> > +    qed_unref_l2_cache_entry(&s->l2_cache, request->l2_table);
> > +
> > +    /* Check for cached L2 entry */
> > +    request->l2_table = qed_find_l2_cache_entry(&s->l2_cache, offset);
> > +    if (request->l2_table) {
> > +        cb(opaque, 0);
> > +        return;
> > +    }
> > +
> > +    request->l2_table = qed_alloc_l2_cache_entry(&s->l2_cache);
> > +
> > +    read_l2_table_cb = gencb_alloc(sizeof(*read_l2_table_cb), cb, opaque);
> > +    read_l2_table_cb->s = s;
> > +    read_l2_table_cb->l2_offset = offset;
> > +    read_l2_table_cb->request = request;
> > +
> > +    BLKDBG_EVENT(s->bs->file, BLKDBG_L2_LOAD);
> > +    qed_read_table(s, offset, request->l2_table->table,
> > +                   qed_read_l2_table_cb, read_l2_table_cb);
> > +}
> > +
> > +int qed_read_l2_table_sync(BDRVQEDState *s, QEDRequest *request, uint64_t offset)
> > +{
> > +    int ret = -EINPROGRESS;
> > +
> > +    async_context_push();
> > +
> > +    qed_read_l2_table(s, request, offset, qed_sync_cb, &ret);
> > +    while (ret == -EINPROGRESS) {
> > +        qemu_aio_wait();
> > +    }
> > +
> > +    async_context_pop();
> > +    return ret;
> > +}
> > +
> > +void qed_write_l2_table(BDRVQEDState *s, QEDRequest *request,
> > +                        unsigned int index, unsigned int n, bool flush,
> > +                        BlockDriverCompletionFunc *cb, void *opaque)
> > +{
> > +    BLKDBG_EVENT(s->bs->file, BLKDBG_L2_UPDATE);
> > +    qed_write_table(s, request->l2_table->offset,
> > +                    request->l2_table->table, index, n, flush, cb, opaque);
> > +}
> > +
> > +int qed_write_l2_table_sync(BDRVQEDState *s, QEDRequest *request,
> > +                            unsigned int index, unsigned int n, bool flush)
> > +{
> > +    int ret = -EINPROGRESS;
> > +
> > +    async_context_push();
> > +
> > +    qed_write_l2_table(s, request, index, n, flush, qed_sync_cb, &ret);
> > +    while (ret == -EINPROGRESS) {
> > +        qemu_aio_wait();
> > +    }
> > +
> > +    async_context_pop();
> > +    return ret;
> > +}
> > diff --git a/block/qed.c b/block/qed.c
> > index ea03798..6d7f4d7 100644
> > --- a/block/qed.c
> > +++ b/block/qed.c
> > @@ -139,6 +139,15 @@ static int qed_read_string(BlockDriverState *file, uint64_t offset, size_t n,
> >      return 0;
> >  }
> >  
> > +static QEDTable *qed_alloc_table(void *opaque)
> > +{
> > +    BDRVQEDState *s = opaque;
> > +
> > +    /* Honor O_DIRECT memory alignment requirements */
> > +    return qemu_blockalign(s->bs,
> > +                           s->header.cluster_size * s->header.table_size);
> > +}
> > +
> >  static int bdrv_qed_open(BlockDriverState *bs, int flags)
> >  {
> >      BDRVQEDState *s = bs->opaque;
> > @@ -207,11 +216,24 @@ static int bdrv_qed_open(BlockDriverState *bs, int flags)
> >              }
> >          }
> >      }
> > +
> > +    s->l1_table = qed_alloc_table(s);
> > +    qed_init_l2_cache(&s->l2_cache, qed_alloc_table, s);
> > +
> > +    ret = qed_read_l1_table_sync(s);
> > +    if (ret) {
> > +        qed_free_l2_cache(&s->l2_cache);
> 
> Why not initializing the L2 cache only if the read succeeded?

The consistency check patch adds more code after the call to
qed_read_l1_table_sync() where we really need the L2 cache, so it
will be like this or we can add more goto labels, I don't think it
matters.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 13:24                 ` Anthony Liguori
@ 2010-10-13 13:50                   ` Avi Kivity
  2010-10-13 14:07                     ` Stefan Hajnoczi
  2010-10-13 14:10                     ` Anthony Liguori
  0 siblings, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-13 13:50 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

  On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> On 10/13/2010 08:07 AM, Kevin Wolf wrote:
>> Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
>>> We can avoid it when a backing image is not used.  Your idea to check
>>> for zeroes in the backing image is neat too, it may well reduce the
>>> common case even for backing images.
>> The additional requirement is that we're extending the file and not
>> reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
>> doesn't work on host_devices anyway)
>
> Yes, that's a good point.
>
> BTW, I think we've decided that making it work on host_devices is not 
> that bad.
>
> We can add an additional feature called QED_F_PHYSICAL_SIZE.
>
> This feature will add another field to the header that contains an 
> offset immediately following the last cluster allocation.
>
> During a metadata scan, we can accurately recreate this field so we 
> only need to update this field whenever we clear the header dirty bit 
> (which means during an fsync()).

If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the 
header dirty bit.


>
> That means we can maintain the physical size without introducing 
> additional fsync()s in the allocation path.  Since we're already 
> writing out the header anyway, the write operation is basically free too.

I don't see how it is free.  It's an extra write.  The good news is that 
it's very easy to amortize.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 13:50                   ` Avi Kivity
@ 2010-10-13 14:07                     ` Stefan Hajnoczi
  2010-10-13 14:08                       ` Anthony Liguori
  2010-10-13 14:10                       ` Avi Kivity
  2010-10-13 14:10                     ` Anthony Liguori
  1 sibling, 2 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-13 14:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote:
>  On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> >On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> >>Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> >>>We can avoid it when a backing image is not used.  Your idea to check
> >>>for zeroes in the backing image is neat too, it may well reduce the
> >>>common case even for backing images.
> >>The additional requirement is that we're extending the file and not
> >>reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> >>doesn't work on host_devices anyway)
> >
> >Yes, that's a good point.
> >
> >BTW, I think we've decided that making it work on host_devices is
> >not that bad.
> >
> >We can add an additional feature called QED_F_PHYSICAL_SIZE.
> >
> >This feature will add another field to the header that contains an
> >offset immediately following the last cluster allocation.
> >
> >During a metadata scan, we can accurately recreate this field so
> >we only need to update this field whenever we clear the header
> >dirty bit (which means during an fsync()).
> 
> If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the
> header dirty bit.

Do you mean we just need to check the physical size header field against
the actual file size?  If the two are different, then a consistency
check is forced.

> >
> >That means we can maintain the physical size without introducing
> >additional fsync()s in the allocation path.  Since we're already
> >writing out the header anyway, the write operation is basically
> >free too.
> 
> I don't see how it is free.  It's an extra write.  The good news is
> that it's very easy to amortize.

We only need to update the header field on disk when we're already
updating the header, so it's not even an extra write operation.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 14:07                     ` Stefan Hajnoczi
@ 2010-10-13 14:08                       ` Anthony Liguori
  2010-10-13 14:10                       ` Avi Kivity
  1 sibling, 0 replies; 72+ messages in thread
From: Anthony Liguori @ 2010-10-13 14:08 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, Christoph Hellwig, Avi Kivity, qemu-devel

On 10/13/2010 09:07 AM, Stefan Hajnoczi wrote:
>
>>> That means we can maintain the physical size without introducing
>>> additional fsync()s in the allocation path.  Since we're already
>>> writing out the header anyway, the write operation is basically
>>> free too.
>>>        
>> I don't see how it is free.  It's an extra write.  The good news is
>> that it's very easy to amortize.
>>      
> We only need to update the header field on disk when we're already
> updating the header, so it's not even an extra write operation.
>    

Because we're already writing out the sector that contains that field in 
the header.

Regards,

Anthony Liguori

> Stefan
>    

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 13:50                   ` Avi Kivity
  2010-10-13 14:07                     ` Stefan Hajnoczi
@ 2010-10-13 14:10                     ` Anthony Liguori
  1 sibling, 0 replies; 72+ messages in thread
From: Anthony Liguori @ 2010-10-13 14:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig,
	Stefan Hajnoczi

On 10/13/2010 08:50 AM, Avi Kivity wrote:
>  On 10/13/2010 03:24 PM, Anthony Liguori wrote:
>> On 10/13/2010 08:07 AM, Kevin Wolf wrote:
>>> Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
>>>> We can avoid it when a backing image is not used.  Your idea to check
>>>> for zeroes in the backing image is neat too, it may well reduce the
>>>> common case even for backing images.
>>> The additional requirement is that we're extending the file and not
>>> reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
>>> doesn't work on host_devices anyway)
>>
>> Yes, that's a good point.
>>
>> BTW, I think we've decided that making it work on host_devices is not 
>> that bad.
>>
>> We can add an additional feature called QED_F_PHYSICAL_SIZE.
>>
>> This feature will add another field to the header that contains an 
>> offset immediately following the last cluster allocation.
>>
>> During a metadata scan, we can accurately recreate this field so we 
>> only need to update this field whenever we clear the header dirty bit 
>> (which means during an fsync()).
>
> If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the 
> header dirty bit.

Yes, autoclear bits are essentially granular header dirty bits.

Regards,

Anthony Liguori

>
>>
>> That means we can maintain the physical size without introducing 
>> additional fsync()s in the allocation path.  Since we're already 
>> writing out the header anyway, the write operation is basically free 
>> too.
>
> I don't see how it is free.  It's an extra write.  The good news is 
> that it's very easy to amortize.
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 14:07                     ` Stefan Hajnoczi
  2010-10-13 14:08                       ` Anthony Liguori
@ 2010-10-13 14:10                       ` Avi Kivity
  2010-10-13 14:11                         ` Anthony Liguori
  2010-10-14 11:06                         ` Stefan Hajnoczi
  1 sibling, 2 replies; 72+ messages in thread
From: Avi Kivity @ 2010-10-13 14:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

  On 10/13/2010 04:07 PM, Stefan Hajnoczi wrote:
> On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote:
> >   On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> >  >On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> >  >>Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> >  >>>We can avoid it when a backing image is not used.  Your idea to check
> >  >>>for zeroes in the backing image is neat too, it may well reduce the
> >  >>>common case even for backing images.
> >  >>The additional requirement is that we're extending the file and not
> >  >>reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> >  >>doesn't work on host_devices anyway)
> >  >
> >  >Yes, that's a good point.
> >  >
> >  >BTW, I think we've decided that making it work on host_devices is
> >  >not that bad.
> >  >
> >  >We can add an additional feature called QED_F_PHYSICAL_SIZE.
> >  >
> >  >This feature will add another field to the header that contains an
> >  >offset immediately following the last cluster allocation.
> >  >
> >  >During a metadata scan, we can accurately recreate this field so
> >  >we only need to update this field whenever we clear the header
> >  >dirty bit (which means during an fsync()).
> >
> >  If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the
> >  header dirty bit.
>
> Do you mean we just need to check the physical size header field against
> the actual file size?  If the two are different, then a consistency
> check is forced.

I thought you'd only use a header size field when you don't have a real 
file size.  Why do you need both?

> >  >
> >  >That means we can maintain the physical size without introducing
> >  >additional fsync()s in the allocation path.  Since we're already
> >  >writing out the header anyway, the write operation is basically
> >  >free too.
> >
> >  I don't see how it is free.  It's an extra write.  The good news is
> >  that it's very easy to amortize.
>
> We only need to update the header field on disk when we're already
> updating the header, so it's not even an extra write operation.

Why would you ever update the header, apart from relocating L1 for some 
reason?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 14:10                       ` Avi Kivity
@ 2010-10-13 14:11                         ` Anthony Liguori
  2010-10-13 14:16                           ` Avi Kivity
  2010-10-14 11:06                         ` Stefan Hajnoczi
  1 sibling, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-13 14:11 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel


>> > >
>> > >That means we can maintain the physical size without introducing
>> > >additional fsync()s in the allocation path.  Since we're already
>> > >writing out the header anyway, the write operation is basically
>> > >free too.
>> >
>> >  I don't see how it is free.  It's an extra write.  The good news is
>> >  that it's very easy to amortize.
>>
>> We only need to update the header field on disk when we're already
>> updating the header, so it's not even an extra write operation.
>
> Why would you ever update the header, apart from relocating L1 for 
> some reason?

To update the L1/L2 tables clean bit.  That's what prevents a check in 
the normal case where you have a clean shutdown.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 14:11                         ` Anthony Liguori
@ 2010-10-13 14:16                           ` Avi Kivity
  2010-10-13 14:53                             ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-13 14:16 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

  On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>> Why would you ever update the header, apart from relocating L1 for 
>> some reason?
>
>
> To update the L1/L2 tables clean bit.  That's what prevents a check in 
> the normal case where you have a clean shutdown.

I see - so you wouldn't update it every allocation, only when the disk 
has been quiet for a while.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 14:16                           ` Avi Kivity
@ 2010-10-13 14:53                             ` Anthony Liguori
  2010-10-13 15:08                               ` Avi Kivity
  0 siblings, 1 reply; 72+ messages in thread
From: Anthony Liguori @ 2010-10-13 14:53 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

On 10/13/2010 09:16 AM, Avi Kivity wrote:
>  On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>>> Why would you ever update the header, apart from relocating L1 for 
>>> some reason?
>>
>>
>> To update the L1/L2 tables clean bit.  That's what prevents a check 
>> in the normal case where you have a clean shutdown.
>
> I see - so you wouldn't update it every allocation, only when the disk 
> has been quiet for a while.

Right, the current plan is to flush the header dirty bit on shutdown or 
whenever there is an explicit flush of the device.  Current that is 
caused by either a guest-initiated flush or a L1 update.  We also plan 
to add a timer-based flush such that a flush is scheduled for some 
period of time (like 5 minutes) after the dirty bit is set.

The end result should be that the only window for requiring a metadata 
scan is if a crash occurs within 5 minutes of a cluster allocation and 
an explicit flush has not occurred for some other reason.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 14:53                             ` Anthony Liguori
@ 2010-10-13 15:08                               ` Avi Kivity
  2010-10-13 15:42                                 ` Anthony Liguori
  0 siblings, 1 reply; 72+ messages in thread
From: Avi Kivity @ 2010-10-13 15:08 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

  On 10/13/2010 04:53 PM, Anthony Liguori wrote:
> On 10/13/2010 09:16 AM, Avi Kivity wrote:
>>  On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>>>> Why would you ever update the header, apart from relocating L1 for 
>>>> some reason?
>>>
>>>
>>> To update the L1/L2 tables clean bit.  That's what prevents a check 
>>> in the normal case where you have a clean shutdown.
>>
>> I see - so you wouldn't update it every allocation, only when the 
>> disk has been quiet for a while.
>
> Right, the current plan is to flush the header dirty bit on shutdown 
> or whenever there is an explicit flush of the device.  Current that is 
> caused by either a guest-initiated flush or a L1 update.

That does add an extra write (and a new write+flush later to mark the 
header dirty again when you start allocating).  I'd drop it and only use 
the timer.

in fact, it adds an extra flush too.  The sequence

1 L1 update
2 mark clean
3 flush

is unsafe since you can crash between 2 and 3, ad only 2 makes it.  So 
I'd do something like

1 opportunistic flush (for whatever reason)
2 set timer
3 no intervening metadata changes
4 mark clean
5 no intervening metadata changes
6 mark dirty
7 flush
8 metadata changes


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 15:08                               ` Avi Kivity
@ 2010-10-13 15:42                                 ` Anthony Liguori
  0 siblings, 0 replies; 72+ messages in thread
From: Anthony Liguori @ 2010-10-13 15:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Christoph Hellwig, Stefan Hajnoczi, qemu-devel

On 10/13/2010 10:08 AM, Avi Kivity wrote:
>  On 10/13/2010 04:53 PM, Anthony Liguori wrote:
>> On 10/13/2010 09:16 AM, Avi Kivity wrote:
>>>  On 10/13/2010 04:11 PM, Anthony Liguori wrote:
>>>>> Why would you ever update the header, apart from relocating L1 for 
>>>>> some reason?
>>>>
>>>>
>>>> To update the L1/L2 tables clean bit.  That's what prevents a check 
>>>> in the normal case where you have a clean shutdown.
>>>
>>> I see - so you wouldn't update it every allocation, only when the 
>>> disk has been quiet for a while.
>>
>> Right, the current plan is to flush the header dirty bit on shutdown 
>> or whenever there is an explicit flush of the device.  Current that 
>> is caused by either a guest-initiated flush or a L1 update.
>
> That does add an extra write (and a new write+flush later to mark the 
> header dirty again when you start allocating).  I'd drop it and only 
> use the timer.
>
> in fact, it adds an extra flush too.  The sequence
>
> 1 L1 update
> 2 mark clean
> 3 flush
>
> is unsafe since you can crash between 2 and 3, ad only 2 makes it.  So 
> I'd do something like

You've got the order wrong.

1. L1 update
2. flush()
3. mark clean

If (3) doesn't make it to disk, that's okay.  It just causes an extra scan.

> 1 opportunistic flush (for whatever reason)
> 2 set timer
> 3 no intervening metadata changes
> 4 mark clean
> 5 no intervening metadata changes
> 6 mark dirty
> 7 flush
> 8 metadata changes

Not sure I see why we set the timer in step 2 as opposed to:

0 clear scheduled flush (if necessary)
1 opportunistic flush (for whatever reason)
2 mark clean
3 no intervening metadata changes
4 mark dirty
5 flush
6 schedule flush (in 5 minutes)
7 metadata changes

Which is now recorded at http://wiki.qemu.org/Features/QED/ScanAvoidance 
so we can keep track of this.

Regards,

Anthony Liguori

>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support
  2010-10-13 14:10                       ` Avi Kivity
  2010-10-13 14:11                         ` Anthony Liguori
@ 2010-10-14 11:06                         ` Stefan Hajnoczi
  1 sibling, 0 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-14 11:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Christoph Hellwig

On Wed, Oct 13, 2010 at 04:10:25PM +0200, Avi Kivity wrote:
>  On 10/13/2010 04:07 PM, Stefan Hajnoczi wrote:
> >On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote:
> >>   On 10/13/2010 03:24 PM, Anthony Liguori wrote:
> >>  >On 10/13/2010 08:07 AM, Kevin Wolf wrote:
> >>  >>Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
> >>  >>>We can avoid it when a backing image is not used.  Your idea to check
> >>  >>>for zeroes in the backing image is neat too, it may well reduce the
> >>  >>>common case even for backing images.
> >>  >>The additional requirement is that we're extending the file and not
> >>  >>reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
> >>  >>doesn't work on host_devices anyway)
> >>  >
> >>  >Yes, that's a good point.
> >>  >
> >>  >BTW, I think we've decided that making it work on host_devices is
> >>  >not that bad.
> >>  >
> >>  >We can add an additional feature called QED_F_PHYSICAL_SIZE.
> >>  >
> >>  >This feature will add another field to the header that contains an
> >>  >offset immediately following the last cluster allocation.
> >>  >
> >>  >During a metadata scan, we can accurately recreate this field so
> >>  >we only need to update this field whenever we clear the header
> >>  >dirty bit (which means during an fsync()).
> >>
> >>  If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the
> >>  header dirty bit.
> >
> >Do you mean we just need to check the physical size header field against
> >the actual file size?  If the two are different, then a consistency
> >check is forced.
> 
> I thought you'd only use a header size field when you don't have a
> real file size.  Why do you need both?

I probably didn't understand correctly :).  You said with
QED_F_PHYSICAL_SIZE autoclear you don't need the header dirty bit.  I
don't see how it eliminates the need for the header dirty bit.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format
  2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
                   ` (7 preceding siblings ...)
  2010-10-11 13:21 ` [Qemu-devel] Re: [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Kevin Wolf
@ 2010-10-16  7:51 ` Stefan Hajnoczi
  8 siblings, 0 replies; 72+ messages in thread
From: Stefan Hajnoczi @ 2010-10-16  7:51 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, Avi Kivity, qemu-devel, Christoph Hellwig

Thanks for the review comments and discussions.  I will send v3 when I
finish making the changes from your comments.

Stefan

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2010-10-16  8:07 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-08 15:48 [Qemu-devel] [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Stefan Hajnoczi
2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 1/7] qcow2: Make get_bits_from_size() common Stefan Hajnoczi
2010-10-08 18:01   ` [Qemu-devel] " Anthony Liguori
2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values Stefan Hajnoczi
2010-10-11 11:09   ` [Qemu-devel] " Kevin Wolf
2010-10-13  9:15   ` [Qemu-devel] " Markus Armbruster
2010-10-13  9:28     ` Kevin Wolf
2010-10-13 10:58       ` Stefan Hajnoczi
2010-10-13 10:25   ` [Qemu-devel] " Avi Kivity
2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 3/7] docs: Add QED image format specification Stefan Hajnoczi
2010-10-10  9:20   ` [Qemu-devel] " Avi Kivity
2010-10-11 10:09     ` Stefan Hajnoczi
2010-10-11 13:04       ` Avi Kivity
2010-10-11 13:42         ` Stefan Hajnoczi
2010-10-11 13:44           ` Avi Kivity
2010-10-11 14:06             ` Stefan Hajnoczi
2010-10-11 14:12               ` Avi Kivity
2010-10-11 15:02             ` Anthony Liguori
2010-10-11 15:24               ` Avi Kivity
2010-10-11 15:41                 ` Anthony Liguori
2010-10-11 15:47                   ` Avi Kivity
2010-10-11 14:54         ` Anthony Liguori
2010-10-11 14:58           ` Avi Kivity
2010-10-11 15:49             ` Anthony Liguori
2010-10-11 16:02               ` Avi Kivity
2010-10-11 16:10                 ` Anthony Liguori
2010-10-12 10:25                   ` Avi Kivity
2010-10-11 13:58   ` Kevin Wolf
2010-10-11 15:30     ` Stefan Hajnoczi
2010-10-11 15:39       ` Avi Kivity
2010-10-11 15:46         ` Stefan Hajnoczi
2010-10-11 16:18           ` Anthony Liguori
2010-10-11 17:14             ` Anthony Liguori
2010-10-12  8:07               ` Kevin Wolf
2010-10-12 13:16                 ` Stefan Hajnoczi
2010-10-12 13:32                   ` Anthony Liguori
2010-10-11 15:50       ` Kevin Wolf
2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 4/7] qed: Add QEMU Enhanced Disk image format Stefan Hajnoczi
2010-10-11 15:16   ` [Qemu-devel] " Kevin Wolf
2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 5/7] qed: Table, L2 cache, and cluster functions Stefan Hajnoczi
2010-10-12 14:44   ` [Qemu-devel] " Kevin Wolf
2010-10-13 13:41     ` Stefan Hajnoczi
2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 6/7] qed: Read/write support Stefan Hajnoczi
2010-10-10  9:10   ` [Qemu-devel] " Avi Kivity
2010-10-11 10:37     ` Stefan Hajnoczi
2010-10-11 13:10       ` Avi Kivity
2010-10-11 13:55         ` Stefan Hajnoczi
2010-10-11 14:57         ` Anthony Liguori
2010-10-12 15:08   ` Kevin Wolf
2010-10-12 15:22     ` Anthony Liguori
2010-10-12 15:39       ` Kevin Wolf
2010-10-12 15:59         ` Stefan Hajnoczi
2010-10-12 16:16           ` Anthony Liguori
2010-10-12 16:21             ` Avi Kivity
2010-10-13 12:13             ` Stefan Hajnoczi
2010-10-13 13:07               ` Kevin Wolf
2010-10-13 13:24                 ` Anthony Liguori
2010-10-13 13:50                   ` Avi Kivity
2010-10-13 14:07                     ` Stefan Hajnoczi
2010-10-13 14:08                       ` Anthony Liguori
2010-10-13 14:10                       ` Avi Kivity
2010-10-13 14:11                         ` Anthony Liguori
2010-10-13 14:16                           ` Avi Kivity
2010-10-13 14:53                             ` Anthony Liguori
2010-10-13 15:08                               ` Avi Kivity
2010-10-13 15:42                                 ` Anthony Liguori
2010-10-14 11:06                         ` Stefan Hajnoczi
2010-10-13 14:10                     ` Anthony Liguori
2010-10-08 15:48 ` [Qemu-devel] [PATCH v2 7/7] qed: Consistency check support Stefan Hajnoczi
2010-10-11 13:21 ` [Qemu-devel] Re: [PATCH v2 0/7] qed: Add QEMU Enhanced Disk format Kevin Wolf
2010-10-11 15:37   ` Stefan Hajnoczi
2010-10-16  7:51 ` [Qemu-devel] " Stefan Hajnoczi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.