* qcow2 corruption observed, fixed by reverting old change
@ 2009-02-11 7:00 ` Jamie Lokier
0 siblings, 0 replies; 45+ messages in thread
From: Jamie Lokier @ 2009-02-11 7:00 UTC (permalink / raw)
To: qemu-devel; +Cc: kvm-devel
Hi,
As you see from the subject, I'm getting qcow2 corruption.
I have a Windows 2000 guest which boots and runs fine in kvm-72, fails
with a blue-screen indicating file corruption errors in kvm-73 through
to kvm-83 (the latest), and succeeds if I replace block-qcow2.c with
the version from kvm-72.
The blue screen appears towards the end of the boot sequence, and
shows only briefly before rebooting. It says:
STOP: c0000218 (Registry File Failure)
The registry cannot load the hive (file):
\SystemRoot\System32\Config\SOFTWARE
or its log or alternate.
It is corrupt, absent, or not writable.
Beginning dump of physical memory
Physical memory dump complete. Contact your system administrator or
technical support [...?]
Although there are many ways to make Windows blue screen in KVM, in
this case I've narrowed it down to the difference in
qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83).
If I build kvm-83, except with block-qcow2.c reverted back to the
kvm-72 version, my guest boots and runs.
since the changes to block-qcow2.c only affect how it reads and writes
the format, not device emulation, that suggests a bug in the new version.
>From kvm-73 to kvm-83, there have been more changes block-qcow2.c, but
I still have the same symptom.
The bug isn't in reading, so it must be caused by writing. When I use
"qemu-img convert" to convert my qcow2 file to a raw file, with broken
and fixed versions, it produces the same raw file. Also, when I use
"-snapshot", kvm-83 works without the patch! So there's no evident
corruption if the qcow2 file is only read.
Applying the patch below to kvm-83 fixes it. You probably don't want
to apply this patch as it reverts some good looking qcow2
optimisations, but take it as a big clue that the optimisations
introduced a bug which shows up as a data corruption error in the
guest, if some writing is done. I don't know more than that.
If an official QEMU release is about to happen, this should probably
be fixed before it's unleashed on existing guests :-)
Regards,
-- Jamie
--- kvm-83-release/qemu/block-qcow2.c 2009-01-13 13:29:42.000000000 +0000
+++ kvm-83-fixed/qemu/block-qcow2.c 2009-02-11 05:37:53.000000000 +0000
@@ -52,8 +52,6 @@
#define QCOW_CRYPT_NONE 0
#define QCOW_CRYPT_AES 1
-#define QCOW_MAX_CRYPT_CLUSTERS 32
-
/* indicate that the refcount of the referenced cluster is exactly one. */
#define QCOW_OFLAG_COPIED (1LL << 63)
/* indicate that the cluster is compressed (they never have the copied flag) */
@@ -61,6 +59,10 @@
#define REFCOUNT_SHIFT 1 /* refcount size is 2 bytes */
+#ifndef offsetof
+#define offsetof(type, field) ((size_t) &((type *)0)->field)
+#endif
+
typedef struct QCowHeader {
uint32_t magic;
uint32_t version;
@@ -269,8 +271,7 @@
if (!s->cluster_cache)
goto fail;
/* one more sector for decompressed data alignment */
- s->cluster_data = qemu_malloc(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size
- + 512);
+ s->cluster_data = qemu_malloc(s->cluster_size + 512);
if (!s->cluster_data)
goto fail;
s->cluster_cache_offset = -1;
@@ -437,7 +438,8 @@
int new_l1_size, new_l1_size2, ret, i;
uint64_t *new_l1_table;
uint64_t new_l1_table_offset;
- uint8_t data[12];
+ uint64_t data64;
+ uint32_t data32;
new_l1_size = s->l1_size;
if (min_size <= new_l1_size)
@@ -467,10 +469,13 @@
new_l1_table[i] = be64_to_cpu(new_l1_table[i]);
/* set new table */
- cpu_to_be32w((uint32_t*)data, new_l1_size);
- cpu_to_be64w((uint64_t*)(data + 4), new_l1_table_offset);
- if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_size), data,
- sizeof(data)) != sizeof(data))
+ data64 = cpu_to_be64(new_l1_table_offset);
+ if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_table_offset),
+ &data64, sizeof(data64)) != sizeof(data64))
+ goto fail;
+ data32 = cpu_to_be32(new_l1_size);
+ if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_size),
+ &data32, sizeof(data32)) != sizeof(data32))
goto fail;
qemu_free(s->l1_table);
free_clusters(bs, s->l1_table_offset, s->l1_size * sizeof(uint64_t));
@@ -483,549 +488,169 @@
return -EIO;
}
-/*
- * seek_l2_table
- *
- * seek l2_offset in the l2_cache table
- * if not found, return NULL,
- * if found,
- * increments the l2 cache hit count of the entry,
- * if counter overflow, divide by two all counters
- * return the pointer to the l2 cache entry
- *
- */
-
-static uint64_t *seek_l2_table(BDRVQcowState *s, uint64_t l2_offset)
-{
- int i, j;
-
- for(i = 0; i < L2_CACHE_SIZE; i++) {
- if (l2_offset == s->l2_cache_offsets[i]) {
- /* increment the hit count */
- if (++s->l2_cache_counts[i] == 0xffffffff) {
- for(j = 0; j < L2_CACHE_SIZE; j++) {
- s->l2_cache_counts[j] >>= 1;
- }
- }
- return s->l2_cache + (i << s->l2_bits);
- }
- }
- return NULL;
-}
-
-/*
- * l2_load
- *
- * Loads a L2 table into memory. If the table is in the cache, the cache
- * is used; otherwise the L2 table is loaded from the image file.
- *
- * Returns a pointer to the L2 table on success, or NULL if the read from
- * the image file failed.
- */
-
-static uint64_t *l2_load(BlockDriverState *bs, uint64_t l2_offset)
-{
- BDRVQcowState *s = bs->opaque;
- int min_index;
- uint64_t *l2_table;
-
- /* seek if the table for the given offset is in the cache */
-
- l2_table = seek_l2_table(s, l2_offset);
- if (l2_table != NULL)
- return l2_table;
-
- /* not found: load a new entry in the least used one */
-
- min_index = l2_cache_new_entry(bs);
- l2_table = s->l2_cache + (min_index << s->l2_bits);
- if (bdrv_pread(s->hd, l2_offset, l2_table, s->l2_size * sizeof(uint64_t)) !=
- s->l2_size * sizeof(uint64_t))
- return NULL;
- s->l2_cache_offsets[min_index] = l2_offset;
- s->l2_cache_counts[min_index] = 1;
-
- return l2_table;
-}
-
-/*
- * l2_allocate
- *
- * Allocate a new l2 entry in the file. If l1_index points to an already
- * used entry in the L2 table (i.e. we are doing a copy on write for the L2
- * table) copy the contents of the old L2 table into the newly allocated one.
- * Otherwise the new table is initialized with zeros.
- *
- */
-
-static uint64_t *l2_allocate(BlockDriverState *bs, int l1_index)
-{
- BDRVQcowState *s = bs->opaque;
- int min_index;
- uint64_t old_l2_offset, tmp;
- uint64_t *l2_table, l2_offset;
-
- old_l2_offset = s->l1_table[l1_index];
-
- /* allocate a new l2 entry */
-
- l2_offset = alloc_clusters(bs, s->l2_size * sizeof(uint64_t));
-
- /* update the L1 entry */
-
- s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED;
-
- tmp = cpu_to_be64(l2_offset | QCOW_OFLAG_COPIED);
- if (bdrv_pwrite(s->hd, s->l1_table_offset + l1_index * sizeof(tmp),
- &tmp, sizeof(tmp)) != sizeof(tmp))
- return NULL;
-
- /* allocate a new entry in the l2 cache */
-
- min_index = l2_cache_new_entry(bs);
- l2_table = s->l2_cache + (min_index << s->l2_bits);
-
- if (old_l2_offset == 0) {
- /* if there was no old l2 table, clear the new table */
- memset(l2_table, 0, s->l2_size * sizeof(uint64_t));
- } else {
- /* if there was an old l2 table, read it from the disk */
- if (bdrv_pread(s->hd, old_l2_offset,
- l2_table, s->l2_size * sizeof(uint64_t)) !=
- s->l2_size * sizeof(uint64_t))
- return NULL;
- }
- /* write the l2 table to the file */
- if (bdrv_pwrite(s->hd, l2_offset,
- l2_table, s->l2_size * sizeof(uint64_t)) !=
- s->l2_size * sizeof(uint64_t))
- return NULL;
-
- /* update the l2 cache entry */
-
- s->l2_cache_offsets[min_index] = l2_offset;
- s->l2_cache_counts[min_index] = 1;
-
- return l2_table;
-}
-
-static int size_to_clusters(BDRVQcowState *s, int64_t size)
-{
- return (size + (s->cluster_size - 1)) >> s->cluster_bits;
-}
-
-static int count_contiguous_clusters(uint64_t nb_clusters, int cluster_size,
- uint64_t *l2_table, uint64_t start, uint64_t mask)
-{
- int i;
- uint64_t offset = be64_to_cpu(l2_table[0]) & ~mask;
-
- if (!offset)
- return 0;
-
- for (i = start; i < start + nb_clusters; i++)
- if (offset + i * cluster_size != (be64_to_cpu(l2_table[i]) & ~mask))
- break;
-
- return (i - start);
-}
-
-static int count_contiguous_free_clusters(uint64_t nb_clusters, uint64_t *l2_table)
-{
- int i = 0;
-
- while(nb_clusters-- && l2_table[i] == 0)
- i++;
-
- return i;
-}
-
-/*
- * get_cluster_offset
+/* 'allocate' is:
*
- * For a given offset of the disk image, return cluster offset in
- * qcow2 file.
+ * 0 not to allocate.
*
- * on entry, *num is the number of contiguous clusters we'd like to
- * access following offset.
+ * 1 to allocate a normal cluster (for sector indexes 'n_start' to
+ * 'n_end')
*
- * on exit, *num is the number of contiguous clusters we can read.
- *
- * Return 1, if the offset is found
- * Return 0, otherwise.
+ * 2 to allocate a compressed cluster of size
+ * 'compressed_size'. 'compressed_size' must be > 0 and <
+ * cluster_size
*
+ * return 0 if not allocated.
*/
-
static uint64_t get_cluster_offset(BlockDriverState *bs,
- uint64_t offset, int *num)
+ uint64_t offset, int allocate,
+ int compressed_size,
+ int n_start, int n_end)
{
BDRVQcowState *s = bs->opaque;
- int l1_index, l2_index;
- uint64_t l2_offset, *l2_table, cluster_offset;
- int l1_bits, c;
- int index_in_cluster, nb_available, nb_needed, nb_clusters;
-
- index_in_cluster = (offset >> 9) & (s->cluster_sectors - 1);
- nb_needed = *num + index_in_cluster;
-
- l1_bits = s->l2_bits + s->cluster_bits;
-
- /* compute how many bytes there are between the offset and
- * the end of the l1 entry
- */
-
- nb_available = (1 << l1_bits) - (offset & ((1 << l1_bits) - 1));
-
- /* compute the number of available sectors */
-
- nb_available = (nb_available >> 9) + index_in_cluster;
-
- cluster_offset = 0;
-
- /* seek the the l2 offset in the l1 table */
-
- l1_index = offset >> l1_bits;
- if (l1_index >= s->l1_size)
- goto out;
-
- l2_offset = s->l1_table[l1_index];
-
- /* seek the l2 table of the given l2 offset */
-
- if (!l2_offset)
- goto out;
-
- /* load the l2 table in memory */
-
- l2_offset &= ~QCOW_OFLAG_COPIED;
- l2_table = l2_load(bs, l2_offset);
- if (l2_table == NULL)
- return 0;
-
- /* find the cluster offset for the given disk offset */
-
- l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
- cluster_offset = be64_to_cpu(l2_table[l2_index]);
- nb_clusters = size_to_clusters(s, nb_needed << 9);
-
- if (!cluster_offset) {
- /* how many empty clusters ? */
- c = count_contiguous_free_clusters(nb_clusters, &l2_table[l2_index]);
- } else {
- /* how many allocated clusters ? */
- c = count_contiguous_clusters(nb_clusters, s->cluster_size,
- &l2_table[l2_index], 0, QCOW_OFLAG_COPIED);
- }
-
- nb_available = (c * s->cluster_sectors);
-out:
- if (nb_available > nb_needed)
- nb_available = nb_needed;
-
- *num = nb_available - index_in_cluster;
-
- return cluster_offset & ~QCOW_OFLAG_COPIED;
-}
-
-/*
- * free_any_clusters
- *
- * free clusters according to its type: compressed or not
- *
- */
-
-static void free_any_clusters(BlockDriverState *bs,
- uint64_t cluster_offset, int nb_clusters)
-{
- BDRVQcowState *s = bs->opaque;
-
- /* free the cluster */
-
- if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
- int nb_csectors;
- nb_csectors = ((cluster_offset >> s->csize_shift) &
- s->csize_mask) + 1;
- free_clusters(bs, (cluster_offset & s->cluster_offset_mask) & ~511,
- nb_csectors * 512);
- return;
- }
-
- free_clusters(bs, cluster_offset, nb_clusters << s->cluster_bits);
-
- return;
-}
-
-/*
- * get_cluster_table
- *
- * for a given disk offset, load (and allocate if needed)
- * the l2 table.
- *
- * the l2 table offset in the qcow2 file and the cluster index
- * in the l2 table are given to the caller.
- *
- */
-
-static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
- uint64_t **new_l2_table,
- uint64_t *new_l2_offset,
- int *new_l2_index)
-{
- BDRVQcowState *s = bs->opaque;
- int l1_index, l2_index, ret;
- uint64_t l2_offset, *l2_table;
-
- /* seek the the l2 offset in the l1 table */
+ int min_index, i, j, l1_index, l2_index, ret;
+ uint64_t l2_offset, *l2_table, cluster_offset, tmp, old_l2_offset;
l1_index = offset >> (s->l2_bits + s->cluster_bits);
if (l1_index >= s->l1_size) {
- ret = grow_l1_table(bs, l1_index + 1);
- if (ret < 0)
+ /* outside l1 table is allowed: we grow the table if needed */
+ if (!allocate)
+ return 0;
+ if (grow_l1_table(bs, l1_index + 1) < 0)
return 0;
}
l2_offset = s->l1_table[l1_index];
+ if (!l2_offset) {
+ if (!allocate)
+ return 0;
+ l2_allocate:
+ old_l2_offset = l2_offset;
+ /* allocate a new l2 entry */
+ l2_offset = alloc_clusters(bs, s->l2_size * sizeof(uint64_t));
+ /* update the L1 entry */
+ s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED;
+ tmp = cpu_to_be64(l2_offset | QCOW_OFLAG_COPIED);
+ if (bdrv_pwrite(s->hd, s->l1_table_offset + l1_index * sizeof(tmp),
+ &tmp, sizeof(tmp)) != sizeof(tmp))
+ return 0;
+ min_index = l2_cache_new_entry(bs);
+ l2_table = s->l2_cache + (min_index << s->l2_bits);
- /* seek the l2 table of the given l2 offset */
-
- if (l2_offset & QCOW_OFLAG_COPIED) {
- /* load the l2 table in memory */
- l2_offset &= ~QCOW_OFLAG_COPIED;
- l2_table = l2_load(bs, l2_offset);
- if (l2_table == NULL)
+ if (old_l2_offset == 0) {
+ memset(l2_table, 0, s->l2_size * sizeof(uint64_t));
+ } else {
+ if (bdrv_pread(s->hd, old_l2_offset,
+ l2_table, s->l2_size * sizeof(uint64_t)) !=
+ s->l2_size * sizeof(uint64_t))
+ return 0;
+ }
+ if (bdrv_pwrite(s->hd, l2_offset,
+ l2_table, s->l2_size * sizeof(uint64_t)) !=
+ s->l2_size * sizeof(uint64_t))
return 0;
} else {
- if (l2_offset)
- free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t));
- l2_table = l2_allocate(bs, l1_index);
- if (l2_table == NULL)
+ if (!(l2_offset & QCOW_OFLAG_COPIED)) {
+ if (allocate) {
+ free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t));
+ goto l2_allocate;
+ }
+ } else {
+ l2_offset &= ~QCOW_OFLAG_COPIED;
+ }
+ for(i = 0; i < L2_CACHE_SIZE; i++) {
+ if (l2_offset == s->l2_cache_offsets[i]) {
+ /* increment the hit count */
+ if (++s->l2_cache_counts[i] == 0xffffffff) {
+ for(j = 0; j < L2_CACHE_SIZE; j++) {
+ s->l2_cache_counts[j] >>= 1;
+ }
+ }
+ l2_table = s->l2_cache + (i << s->l2_bits);
+ goto found;
+ }
+ }
+ /* not found: load a new entry in the least used one */
+ min_index = l2_cache_new_entry(bs);
+ l2_table = s->l2_cache + (min_index << s->l2_bits);
+ if (bdrv_pread(s->hd, l2_offset, l2_table, s->l2_size * sizeof(uint64_t)) !=
+ s->l2_size * sizeof(uint64_t))
return 0;
- l2_offset = s->l1_table[l1_index] & ~QCOW_OFLAG_COPIED;
}
-
- /* find the cluster offset for the given disk offset */
-
+ s->l2_cache_offsets[min_index] = l2_offset;
+ s->l2_cache_counts[min_index] = 1;
+ found:
l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1);
-
- *new_l2_table = l2_table;
- *new_l2_offset = l2_offset;
- *new_l2_index = l2_index;
-
- return 1;
-}
-
-/*
- * alloc_compressed_cluster_offset
- *
- * For a given offset of the disk image, return cluster offset in
- * qcow2 file.
- *
- * If the offset is not found, allocate a new compressed cluster.
- *
- * Return the cluster offset if successful,
- * Return 0, otherwise.
- *
- */
-
-static uint64_t alloc_compressed_cluster_offset(BlockDriverState *bs,
- uint64_t offset,
- int compressed_size)
-{
- BDRVQcowState *s = bs->opaque;
- int l2_index, ret;
- uint64_t l2_offset, *l2_table, cluster_offset;
- int nb_csectors;
-
- ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index);
- if (ret == 0)
- return 0;
-
cluster_offset = be64_to_cpu(l2_table[l2_index]);
- if (cluster_offset & QCOW_OFLAG_COPIED)
- return cluster_offset & ~QCOW_OFLAG_COPIED;
-
- if (cluster_offset)
- free_any_clusters(bs, cluster_offset, 1);
-
- cluster_offset = alloc_bytes(bs, compressed_size);
- nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
- (cluster_offset >> 9);
-
- cluster_offset |= QCOW_OFLAG_COMPRESSED |
- ((uint64_t)nb_csectors << s->csize_shift);
-
- /* update L2 table */
-
- /* compressed clusters never have the copied flag */
-
- l2_table[l2_index] = cpu_to_be64(cluster_offset);
- if (bdrv_pwrite(s->hd,
- l2_offset + l2_index * sizeof(uint64_t),
- l2_table + l2_index,
- sizeof(uint64_t)) != sizeof(uint64_t))
- return 0;
-
- return cluster_offset;
-}
-
-typedef struct QCowL2Meta
-{
- uint64_t offset;
- int n_start;
- int nb_available;
- int nb_clusters;
-} QCowL2Meta;
-
-static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset,
- QCowL2Meta *m)
-{
- BDRVQcowState *s = bs->opaque;
- int i, j = 0, l2_index, ret;
- uint64_t *old_cluster, start_sect, l2_offset, *l2_table;
-
- if (m->nb_clusters == 0)
- return 0;
-
- if (!(old_cluster = qemu_malloc(m->nb_clusters * sizeof(uint64_t))))
- return -ENOMEM;
-
- /* copy content of unmodified sectors */
- start_sect = (m->offset & ~(s->cluster_size - 1)) >> 9;
- if (m->n_start) {
- ret = copy_sectors(bs, start_sect, cluster_offset, 0, m->n_start);
- if (ret < 0)
- goto err;
+ if (!cluster_offset) {
+ if (!allocate)
+ return cluster_offset;
+ } else if (!(cluster_offset & QCOW_OFLAG_COPIED)) {
+ if (!allocate)
+ return cluster_offset;
+ /* free the cluster */
+ if (cluster_offset & QCOW_OFLAG_COMPRESSED) {
+ int nb_csectors;
+ nb_csectors = ((cluster_offset >> s->csize_shift) &
+ s->csize_mask) + 1;
+ free_clusters(bs, (cluster_offset & s->cluster_offset_mask) & ~511,
+ nb_csectors * 512);
+ } else {
+ free_clusters(bs, cluster_offset, s->cluster_size);
+ }
+ } else {
+ cluster_offset &= ~QCOW_OFLAG_COPIED;
+ return cluster_offset;
}
-
- if (m->nb_available & (s->cluster_sectors - 1)) {
- uint64_t end = m->nb_available & ~(uint64_t)(s->cluster_sectors - 1);
- ret = copy_sectors(bs, start_sect + end, cluster_offset + (end << 9),
- m->nb_available - end, s->cluster_sectors);
- if (ret < 0)
- goto err;
+ if (allocate == 1) {
+ /* allocate a new cluster */
+ cluster_offset = alloc_clusters(bs, s->cluster_size);
+
+ /* we must initialize the cluster content which won't be
+ written */
+ if ((n_end - n_start) < s->cluster_sectors) {
+ uint64_t start_sect;
+
+ start_sect = (offset & ~(s->cluster_size - 1)) >> 9;
+ ret = copy_sectors(bs, start_sect,
+ cluster_offset, 0, n_start);
+ if (ret < 0)
+ return 0;
+ ret = copy_sectors(bs, start_sect,
+ cluster_offset, n_end, s->cluster_sectors);
+ if (ret < 0)
+ return 0;
+ }
+ tmp = cpu_to_be64(cluster_offset | QCOW_OFLAG_COPIED);
+ } else {
+ int nb_csectors;
+ cluster_offset = alloc_bytes(bs, compressed_size);
+ nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
+ (cluster_offset >> 9);
+ cluster_offset |= QCOW_OFLAG_COMPRESSED |
+ ((uint64_t)nb_csectors << s->csize_shift);
+ /* compressed clusters never have the copied flag */
+ tmp = cpu_to_be64(cluster_offset);
}
-
- ret = -EIO;
/* update L2 table */
- if (!get_cluster_table(bs, m->offset, &l2_table, &l2_offset, &l2_index))
- goto err;
-
- for (i = 0; i < m->nb_clusters; i++) {
- if(l2_table[l2_index + i] != 0)
- old_cluster[j++] = l2_table[l2_index + i];
-
- l2_table[l2_index + i] = cpu_to_be64((cluster_offset +
- (i << s->cluster_bits)) | QCOW_OFLAG_COPIED);
- }
-
- if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t),
- l2_table + l2_index, m->nb_clusters * sizeof(uint64_t)) !=
- m->nb_clusters * sizeof(uint64_t))
- goto err;
-
- for (i = 0; i < j; i++)
- free_any_clusters(bs, old_cluster[i], 1);
-
- ret = 0;
-err:
- qemu_free(old_cluster);
- return ret;
- }
-
-/*
- * alloc_cluster_offset
- *
- * For a given offset of the disk image, return cluster offset in
- * qcow2 file.
- *
- * If the offset is not found, allocate a new cluster.
- *
- * Return the cluster offset if successful,
- * Return 0, otherwise.
- *
- */
-
-static uint64_t alloc_cluster_offset(BlockDriverState *bs,
- uint64_t offset,
- int n_start, int n_end,
- int *num, QCowL2Meta *m)
-{
- BDRVQcowState *s = bs->opaque;
- int l2_index, ret;
- uint64_t l2_offset, *l2_table, cluster_offset;
- int nb_clusters, i = 0;
-
- ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index);
- if (ret == 0)
+ l2_table[l2_index] = tmp;
+ if (bdrv_pwrite(s->hd,
+ l2_offset + l2_index * sizeof(tmp), &tmp, sizeof(tmp)) != sizeof(tmp))
return 0;
-
- nb_clusters = size_to_clusters(s, n_end << 9);
-
- nb_clusters = MIN(nb_clusters, s->l2_size - l2_index);
-
- cluster_offset = be64_to_cpu(l2_table[l2_index]);
-
- /* We keep all QCOW_OFLAG_COPIED clusters */
-
- if (cluster_offset & QCOW_OFLAG_COPIED) {
- nb_clusters = count_contiguous_clusters(nb_clusters, s->cluster_size,
- &l2_table[l2_index], 0, 0);
-
- cluster_offset &= ~QCOW_OFLAG_COPIED;
- m->nb_clusters = 0;
-
- goto out;
- }
-
- /* for the moment, multiple compressed clusters are not managed */
-
- if (cluster_offset & QCOW_OFLAG_COMPRESSED)
- nb_clusters = 1;
-
- /* how many available clusters ? */
-
- while (i < nb_clusters) {
- i += count_contiguous_clusters(nb_clusters - i, s->cluster_size,
- &l2_table[l2_index], i, 0);
-
- if(be64_to_cpu(l2_table[l2_index + i]))
- break;
-
- i += count_contiguous_free_clusters(nb_clusters - i,
- &l2_table[l2_index + i]);
-
- cluster_offset = be64_to_cpu(l2_table[l2_index + i]);
-
- if ((cluster_offset & QCOW_OFLAG_COPIED) ||
- (cluster_offset & QCOW_OFLAG_COMPRESSED))
- break;
- }
- nb_clusters = i;
-
- /* allocate a new cluster */
-
- cluster_offset = alloc_clusters(bs, nb_clusters * s->cluster_size);
-
- /* save info needed for meta data update */
- m->offset = offset;
- m->n_start = n_start;
- m->nb_clusters = nb_clusters;
-
-out:
- m->nb_available = MIN(nb_clusters << (s->cluster_bits - 9), n_end);
-
- *num = m->nb_available - n_start;
-
return cluster_offset;
}
static int qcow_is_allocated(BlockDriverState *bs, int64_t sector_num,
int nb_sectors, int *pnum)
{
+ BDRVQcowState *s = bs->opaque;
+ int index_in_cluster, n;
uint64_t cluster_offset;
- *pnum = nb_sectors;
- cluster_offset = get_cluster_offset(bs, sector_num << 9, pnum);
-
+ cluster_offset = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0);
+ index_in_cluster = sector_num & (s->cluster_sectors - 1);
+ n = s->cluster_sectors - index_in_cluster;
+ if (n > nb_sectors)
+ n = nb_sectors;
+ *pnum = n;
return (cluster_offset != 0);
}
@@ -1102,9 +727,11 @@
uint64_t cluster_offset;
while (nb_sectors > 0) {
- n = nb_sectors;
- cluster_offset = get_cluster_offset(bs, sector_num << 9, &n);
+ cluster_offset = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0);
index_in_cluster = sector_num & (s->cluster_sectors - 1);
+ n = s->cluster_sectors - index_in_cluster;
+ if (n > nb_sectors)
+ n = nb_sectors;
if (!cluster_offset) {
if (bs->backing_hd) {
/* read from the base image */
@@ -1143,18 +770,15 @@
BDRVQcowState *s = bs->opaque;
int ret, index_in_cluster, n;
uint64_t cluster_offset;
- int n_end;
- QCowL2Meta l2meta;
while (nb_sectors > 0) {
index_in_cluster = sector_num & (s->cluster_sectors - 1);
- n_end = index_in_cluster + nb_sectors;
- if (s->crypt_method &&
- n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
- n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
- cluster_offset = alloc_cluster_offset(bs, sector_num << 9,
- index_in_cluster,
- n_end, &n, &l2meta);
+ n = s->cluster_sectors - index_in_cluster;
+ if (n > nb_sectors)
+ n = nb_sectors;
+ cluster_offset = get_cluster_offset(bs, sector_num << 9, 1, 0,
+ index_in_cluster,
+ index_in_cluster + n);
if (!cluster_offset)
return -1;
if (s->crypt_method) {
@@ -1165,10 +789,8 @@
} else {
ret = bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512);
}
- if (ret != n * 512 || alloc_cluster_link_l2(bs, cluster_offset, &l2meta) < 0) {
- free_any_clusters(bs, cluster_offset, l2meta.nb_clusters);
+ if (ret != n * 512)
return -1;
- }
nb_sectors -= n;
sector_num += n;
buf += n * 512;
@@ -1186,33 +808,8 @@
uint64_t cluster_offset;
uint8_t *cluster_data;
BlockDriverAIOCB *hd_aiocb;
- QEMUBH *bh;
- QCowL2Meta l2meta;
} QCowAIOCB;
-static void qcow_aio_read_cb(void *opaque, int ret);
-static void qcow_aio_read_bh(void *opaque)
-{
- QCowAIOCB *acb = opaque;
- qemu_bh_delete(acb->bh);
- acb->bh = NULL;
- qcow_aio_read_cb(opaque, 0);
-}
-
-static int qcow_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb)
-{
- if (acb->bh)
- return -EIO;
-
- acb->bh = qemu_bh_new(cb, acb);
- if (!acb->bh)
- return -EIO;
-
- qemu_bh_schedule(acb->bh);
-
- return 0;
-}
-
static void qcow_aio_read_cb(void *opaque, int ret)
{
QCowAIOCB *acb = opaque;
@@ -1222,12 +819,13 @@
acb->hd_aiocb = NULL;
if (ret < 0) {
-fail:
+ fail:
acb->common.cb(acb->common.opaque, ret);
qemu_aio_release(acb);
return;
}
+ redo:
/* post process the read buffer */
if (!acb->cluster_offset) {
/* nothing to do */
@@ -1253,9 +851,12 @@
}
/* prepare next AIO request */
- acb->n = acb->nb_sectors;
- acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, &acb->n);
+ acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9,
+ 0, 0, 0, 0);
index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
+ acb->n = s->cluster_sectors - index_in_cluster;
+ if (acb->n > acb->nb_sectors)
+ acb->n = acb->nb_sectors;
if (!acb->cluster_offset) {
if (bs->backing_hd) {
@@ -1268,16 +869,12 @@
if (acb->hd_aiocb == NULL)
goto fail;
} else {
- ret = qcow_schedule_bh(qcow_aio_read_bh, acb);
- if (ret < 0)
- goto fail;
+ goto redo;
}
} else {
/* Note: in this case, no need to wait */
memset(acb->buf, 0, 512 * acb->n);
- ret = qcow_schedule_bh(qcow_aio_read_bh, acb);
- if (ret < 0)
- goto fail;
+ goto redo;
}
} else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) {
/* add AIO support for compressed blocks ? */
@@ -1285,9 +882,7 @@
goto fail;
memcpy(acb->buf,
s->cluster_cache + index_in_cluster * 512, 512 * acb->n);
- ret = qcow_schedule_bh(qcow_aio_read_bh, acb);
- if (ret < 0)
- goto fail;
+ goto redo;
} else {
if ((acb->cluster_offset & 511) != 0) {
ret = -EIO;
@@ -1316,7 +911,6 @@
acb->nb_sectors = nb_sectors;
acb->n = 0;
acb->cluster_offset = 0;
- acb->l2meta.nb_clusters = 0;
return acb;
}
@@ -1340,8 +934,8 @@
BlockDriverState *bs = acb->common.bs;
BDRVQcowState *s = bs->opaque;
int index_in_cluster;
+ uint64_t cluster_offset;
const uint8_t *src_buf;
- int n_end;
acb->hd_aiocb = NULL;
@@ -1352,11 +946,6 @@
return;
}
- if (alloc_cluster_link_l2(bs, acb->cluster_offset, &acb->l2meta) < 0) {
- free_any_clusters(bs, acb->cluster_offset, acb->l2meta.nb_clusters);
- goto fail;
- }
-
acb->nb_sectors -= acb->n;
acb->sector_num += acb->n;
acb->buf += acb->n * 512;
@@ -1369,22 +958,19 @@
}
index_in_cluster = acb->sector_num & (s->cluster_sectors - 1);
- n_end = index_in_cluster + acb->nb_sectors;
- if (s->crypt_method &&
- n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors)
- n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors;
-
- acb->cluster_offset = alloc_cluster_offset(bs, acb->sector_num << 9,
- index_in_cluster,
- n_end, &acb->n, &acb->l2meta);
- if (!acb->cluster_offset || (acb->cluster_offset & 511) != 0) {
+ acb->n = s->cluster_sectors - index_in_cluster;
+ if (acb->n > acb->nb_sectors)
+ acb->n = acb->nb_sectors;
+ cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, 1, 0,
+ index_in_cluster,
+ index_in_cluster + acb->n);
+ if (!cluster_offset || (cluster_offset & 511) != 0) {
ret = -EIO;
goto fail;
}
if (s->crypt_method) {
if (!acb->cluster_data) {
- acb->cluster_data = qemu_mallocz(QCOW_MAX_CRYPT_CLUSTERS *
- s->cluster_size);
+ acb->cluster_data = qemu_mallocz(s->cluster_size);
if (!acb->cluster_data) {
ret = -ENOMEM;
goto fail;
@@ -1397,7 +983,7 @@
src_buf = acb->buf;
}
acb->hd_aiocb = bdrv_aio_write(s->hd,
- (acb->cluster_offset >> 9) + index_in_cluster,
+ (cluster_offset >> 9) + index_in_cluster,
src_buf, acb->n,
qcow_aio_write_cb, acb);
if (acb->hd_aiocb == NULL)
@@ -1571,7 +1157,7 @@
memset(s->l1_table, 0, l1_length);
if (bdrv_pwrite(s->hd, s->l1_table_offset, s->l1_table, l1_length) < 0)
- return -1;
+ return -1;
ret = bdrv_truncate(s->hd, s->l1_table_offset + l1_length);
if (ret < 0)
return ret;
@@ -1637,10 +1223,8 @@
/* could not compress: write normal cluster */
qcow_write(bs, sector_num, buf, s->cluster_sectors);
} else {
- cluster_offset = alloc_compressed_cluster_offset(bs, sector_num << 9,
- out_len);
- if (!cluster_offset)
- return -1;
+ cluster_offset = get_cluster_offset(bs, sector_num << 9, 2,
+ out_len, 0, 0);
cluster_offset &= s->cluster_offset_mask;
if (bdrv_pwrite(s->hd, cluster_offset, out_buf, out_len) != out_len) {
qemu_free(out_buf);
@@ -2225,19 +1809,26 @@
BDRVQcowState *s = bs->opaque;
int i, nb_clusters;
- nb_clusters = size_to_clusters(s, size);
-retry:
- for(i = 0; i < nb_clusters; i++) {
- int64_t i = s->free_cluster_index++;
- if (get_refcount(bs, i) != 0)
- goto retry;
- }
+ nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits;
+ for(;;) {
+ if (get_refcount(bs, s->free_cluster_index) == 0) {
+ s->free_cluster_index++;
+ for(i = 1; i < nb_clusters; i++) {
+ if (get_refcount(bs, s->free_cluster_index) != 0)
+ goto not_found;
+ s->free_cluster_index++;
+ }
#ifdef DEBUG_ALLOC2
- printf("alloc_clusters: size=%lld -> %lld\n",
- size,
- (s->free_cluster_index - nb_clusters) << s->cluster_bits);
+ printf("alloc_clusters: size=%lld -> %lld\n",
+ size,
+ (s->free_cluster_index - nb_clusters) << s->cluster_bits);
#endif
- return (s->free_cluster_index - nb_clusters) << s->cluster_bits;
+ return (s->free_cluster_index - nb_clusters) << s->cluster_bits;
+ } else {
+ not_found:
+ s->free_cluster_index++;
+ }
+ }
}
static int64_t alloc_clusters(BlockDriverState *bs, int64_t size)
@@ -2301,7 +1892,8 @@
int new_table_size, new_table_size2, refcount_table_clusters, i, ret;
uint64_t *new_table;
int64_t table_offset;
- uint8_t data[12];
+ uint64_t data64;
+ uint32_t data32;
int old_table_size;
int64_t old_table_offset;
@@ -2340,10 +1932,13 @@
for(i = 0; i < s->refcount_table_size; i++)
be64_to_cpus(&new_table[i]);
- cpu_to_be64w((uint64_t*)data, table_offset);
- cpu_to_be32w((uint32_t*)(data + 8), refcount_table_clusters);
+ data64 = cpu_to_be64(table_offset);
if (bdrv_pwrite(s->hd, offsetof(QCowHeader, refcount_table_offset),
- data, sizeof(data)) != sizeof(data))
+ &data64, sizeof(data64)) != sizeof(data64))
+ goto fail;
+ data32 = cpu_to_be32(refcount_table_clusters);
+ if (bdrv_pwrite(s->hd, offsetof(QCowHeader, refcount_table_clusters),
+ &data32, sizeof(data32)) != sizeof(data32))
goto fail;
qemu_free(s->refcount_table);
old_table_offset = s->refcount_table_offset;
@@ -2572,7 +2167,7 @@
uint16_t *refcount_table;
size = bdrv_getlength(s->hd);
- nb_clusters = size_to_clusters(s, size);
+ nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits;
refcount_table = qemu_mallocz(nb_clusters * sizeof(uint16_t));
/* header */
@@ -2624,7 +2219,7 @@
int refcount;
size = bdrv_getlength(s->hd);
- nb_clusters = size_to_clusters(s, size);
+ nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits;
for(k = 0; k < nb_clusters;) {
k1 = k;
refcount = get_refcount(bs, k);
^ permalink raw reply [flat|nested] 45+ messages in thread
* [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-11 7:00 ` Jamie Lokier 0 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-11 7:00 UTC (permalink / raw) To: qemu-devel; +Cc: kvm-devel Hi, As you see from the subject, I'm getting qcow2 corruption. I have a Windows 2000 guest which boots and runs fine in kvm-72, fails with a blue-screen indicating file corruption errors in kvm-73 through to kvm-83 (the latest), and succeeds if I replace block-qcow2.c with the version from kvm-72. The blue screen appears towards the end of the boot sequence, and shows only briefly before rebooting. It says: STOP: c0000218 (Registry File Failure) The registry cannot load the hive (file): \SystemRoot\System32\Config\SOFTWARE or its log or alternate. It is corrupt, absent, or not writable. Beginning dump of physical memory Physical memory dump complete. Contact your system administrator or technical support [...?] Although there are many ways to make Windows blue screen in KVM, in this case I've narrowed it down to the difference in qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). If I build kvm-83, except with block-qcow2.c reverted back to the kvm-72 version, my guest boots and runs. since the changes to block-qcow2.c only affect how it reads and writes the format, not device emulation, that suggests a bug in the new version. >From kvm-73 to kvm-83, there have been more changes block-qcow2.c, but I still have the same symptom. The bug isn't in reading, so it must be caused by writing. When I use "qemu-img convert" to convert my qcow2 file to a raw file, with broken and fixed versions, it produces the same raw file. Also, when I use "-snapshot", kvm-83 works without the patch! So there's no evident corruption if the qcow2 file is only read. Applying the patch below to kvm-83 fixes it. You probably don't want to apply this patch as it reverts some good looking qcow2 optimisations, but take it as a big clue that the optimisations introduced a bug which shows up as a data corruption error in the guest, if some writing is done. I don't know more than that. If an official QEMU release is about to happen, this should probably be fixed before it's unleashed on existing guests :-) Regards, -- Jamie --- kvm-83-release/qemu/block-qcow2.c 2009-01-13 13:29:42.000000000 +0000 +++ kvm-83-fixed/qemu/block-qcow2.c 2009-02-11 05:37:53.000000000 +0000 @@ -52,8 +52,6 @@ #define QCOW_CRYPT_NONE 0 #define QCOW_CRYPT_AES 1 -#define QCOW_MAX_CRYPT_CLUSTERS 32 - /* indicate that the refcount of the referenced cluster is exactly one. */ #define QCOW_OFLAG_COPIED (1LL << 63) /* indicate that the cluster is compressed (they never have the copied flag) */ @@ -61,6 +59,10 @@ #define REFCOUNT_SHIFT 1 /* refcount size is 2 bytes */ +#ifndef offsetof +#define offsetof(type, field) ((size_t) &((type *)0)->field) +#endif + typedef struct QCowHeader { uint32_t magic; uint32_t version; @@ -269,8 +271,7 @@ if (!s->cluster_cache) goto fail; /* one more sector for decompressed data alignment */ - s->cluster_data = qemu_malloc(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size - + 512); + s->cluster_data = qemu_malloc(s->cluster_size + 512); if (!s->cluster_data) goto fail; s->cluster_cache_offset = -1; @@ -437,7 +438,8 @@ int new_l1_size, new_l1_size2, ret, i; uint64_t *new_l1_table; uint64_t new_l1_table_offset; - uint8_t data[12]; + uint64_t data64; + uint32_t data32; new_l1_size = s->l1_size; if (min_size <= new_l1_size) @@ -467,10 +469,13 @@ new_l1_table[i] = be64_to_cpu(new_l1_table[i]); /* set new table */ - cpu_to_be32w((uint32_t*)data, new_l1_size); - cpu_to_be64w((uint64_t*)(data + 4), new_l1_table_offset); - if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_size), data, - sizeof(data)) != sizeof(data)) + data64 = cpu_to_be64(new_l1_table_offset); + if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_table_offset), + &data64, sizeof(data64)) != sizeof(data64)) + goto fail; + data32 = cpu_to_be32(new_l1_size); + if (bdrv_pwrite(s->hd, offsetof(QCowHeader, l1_size), + &data32, sizeof(data32)) != sizeof(data32)) goto fail; qemu_free(s->l1_table); free_clusters(bs, s->l1_table_offset, s->l1_size * sizeof(uint64_t)); @@ -483,549 +488,169 @@ return -EIO; } -/* - * seek_l2_table - * - * seek l2_offset in the l2_cache table - * if not found, return NULL, - * if found, - * increments the l2 cache hit count of the entry, - * if counter overflow, divide by two all counters - * return the pointer to the l2 cache entry - * - */ - -static uint64_t *seek_l2_table(BDRVQcowState *s, uint64_t l2_offset) -{ - int i, j; - - for(i = 0; i < L2_CACHE_SIZE; i++) { - if (l2_offset == s->l2_cache_offsets[i]) { - /* increment the hit count */ - if (++s->l2_cache_counts[i] == 0xffffffff) { - for(j = 0; j < L2_CACHE_SIZE; j++) { - s->l2_cache_counts[j] >>= 1; - } - } - return s->l2_cache + (i << s->l2_bits); - } - } - return NULL; -} - -/* - * l2_load - * - * Loads a L2 table into memory. If the table is in the cache, the cache - * is used; otherwise the L2 table is loaded from the image file. - * - * Returns a pointer to the L2 table on success, or NULL if the read from - * the image file failed. - */ - -static uint64_t *l2_load(BlockDriverState *bs, uint64_t l2_offset) -{ - BDRVQcowState *s = bs->opaque; - int min_index; - uint64_t *l2_table; - - /* seek if the table for the given offset is in the cache */ - - l2_table = seek_l2_table(s, l2_offset); - if (l2_table != NULL) - return l2_table; - - /* not found: load a new entry in the least used one */ - - min_index = l2_cache_new_entry(bs); - l2_table = s->l2_cache + (min_index << s->l2_bits); - if (bdrv_pread(s->hd, l2_offset, l2_table, s->l2_size * sizeof(uint64_t)) != - s->l2_size * sizeof(uint64_t)) - return NULL; - s->l2_cache_offsets[min_index] = l2_offset; - s->l2_cache_counts[min_index] = 1; - - return l2_table; -} - -/* - * l2_allocate - * - * Allocate a new l2 entry in the file. If l1_index points to an already - * used entry in the L2 table (i.e. we are doing a copy on write for the L2 - * table) copy the contents of the old L2 table into the newly allocated one. - * Otherwise the new table is initialized with zeros. - * - */ - -static uint64_t *l2_allocate(BlockDriverState *bs, int l1_index) -{ - BDRVQcowState *s = bs->opaque; - int min_index; - uint64_t old_l2_offset, tmp; - uint64_t *l2_table, l2_offset; - - old_l2_offset = s->l1_table[l1_index]; - - /* allocate a new l2 entry */ - - l2_offset = alloc_clusters(bs, s->l2_size * sizeof(uint64_t)); - - /* update the L1 entry */ - - s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED; - - tmp = cpu_to_be64(l2_offset | QCOW_OFLAG_COPIED); - if (bdrv_pwrite(s->hd, s->l1_table_offset + l1_index * sizeof(tmp), - &tmp, sizeof(tmp)) != sizeof(tmp)) - return NULL; - - /* allocate a new entry in the l2 cache */ - - min_index = l2_cache_new_entry(bs); - l2_table = s->l2_cache + (min_index << s->l2_bits); - - if (old_l2_offset == 0) { - /* if there was no old l2 table, clear the new table */ - memset(l2_table, 0, s->l2_size * sizeof(uint64_t)); - } else { - /* if there was an old l2 table, read it from the disk */ - if (bdrv_pread(s->hd, old_l2_offset, - l2_table, s->l2_size * sizeof(uint64_t)) != - s->l2_size * sizeof(uint64_t)) - return NULL; - } - /* write the l2 table to the file */ - if (bdrv_pwrite(s->hd, l2_offset, - l2_table, s->l2_size * sizeof(uint64_t)) != - s->l2_size * sizeof(uint64_t)) - return NULL; - - /* update the l2 cache entry */ - - s->l2_cache_offsets[min_index] = l2_offset; - s->l2_cache_counts[min_index] = 1; - - return l2_table; -} - -static int size_to_clusters(BDRVQcowState *s, int64_t size) -{ - return (size + (s->cluster_size - 1)) >> s->cluster_bits; -} - -static int count_contiguous_clusters(uint64_t nb_clusters, int cluster_size, - uint64_t *l2_table, uint64_t start, uint64_t mask) -{ - int i; - uint64_t offset = be64_to_cpu(l2_table[0]) & ~mask; - - if (!offset) - return 0; - - for (i = start; i < start + nb_clusters; i++) - if (offset + i * cluster_size != (be64_to_cpu(l2_table[i]) & ~mask)) - break; - - return (i - start); -} - -static int count_contiguous_free_clusters(uint64_t nb_clusters, uint64_t *l2_table) -{ - int i = 0; - - while(nb_clusters-- && l2_table[i] == 0) - i++; - - return i; -} - -/* - * get_cluster_offset +/* 'allocate' is: * - * For a given offset of the disk image, return cluster offset in - * qcow2 file. + * 0 not to allocate. * - * on entry, *num is the number of contiguous clusters we'd like to - * access following offset. + * 1 to allocate a normal cluster (for sector indexes 'n_start' to + * 'n_end') * - * on exit, *num is the number of contiguous clusters we can read. - * - * Return 1, if the offset is found - * Return 0, otherwise. + * 2 to allocate a compressed cluster of size + * 'compressed_size'. 'compressed_size' must be > 0 and < + * cluster_size * + * return 0 if not allocated. */ - static uint64_t get_cluster_offset(BlockDriverState *bs, - uint64_t offset, int *num) + uint64_t offset, int allocate, + int compressed_size, + int n_start, int n_end) { BDRVQcowState *s = bs->opaque; - int l1_index, l2_index; - uint64_t l2_offset, *l2_table, cluster_offset; - int l1_bits, c; - int index_in_cluster, nb_available, nb_needed, nb_clusters; - - index_in_cluster = (offset >> 9) & (s->cluster_sectors - 1); - nb_needed = *num + index_in_cluster; - - l1_bits = s->l2_bits + s->cluster_bits; - - /* compute how many bytes there are between the offset and - * the end of the l1 entry - */ - - nb_available = (1 << l1_bits) - (offset & ((1 << l1_bits) - 1)); - - /* compute the number of available sectors */ - - nb_available = (nb_available >> 9) + index_in_cluster; - - cluster_offset = 0; - - /* seek the the l2 offset in the l1 table */ - - l1_index = offset >> l1_bits; - if (l1_index >= s->l1_size) - goto out; - - l2_offset = s->l1_table[l1_index]; - - /* seek the l2 table of the given l2 offset */ - - if (!l2_offset) - goto out; - - /* load the l2 table in memory */ - - l2_offset &= ~QCOW_OFLAG_COPIED; - l2_table = l2_load(bs, l2_offset); - if (l2_table == NULL) - return 0; - - /* find the cluster offset for the given disk offset */ - - l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1); - cluster_offset = be64_to_cpu(l2_table[l2_index]); - nb_clusters = size_to_clusters(s, nb_needed << 9); - - if (!cluster_offset) { - /* how many empty clusters ? */ - c = count_contiguous_free_clusters(nb_clusters, &l2_table[l2_index]); - } else { - /* how many allocated clusters ? */ - c = count_contiguous_clusters(nb_clusters, s->cluster_size, - &l2_table[l2_index], 0, QCOW_OFLAG_COPIED); - } - - nb_available = (c * s->cluster_sectors); -out: - if (nb_available > nb_needed) - nb_available = nb_needed; - - *num = nb_available - index_in_cluster; - - return cluster_offset & ~QCOW_OFLAG_COPIED; -} - -/* - * free_any_clusters - * - * free clusters according to its type: compressed or not - * - */ - -static void free_any_clusters(BlockDriverState *bs, - uint64_t cluster_offset, int nb_clusters) -{ - BDRVQcowState *s = bs->opaque; - - /* free the cluster */ - - if (cluster_offset & QCOW_OFLAG_COMPRESSED) { - int nb_csectors; - nb_csectors = ((cluster_offset >> s->csize_shift) & - s->csize_mask) + 1; - free_clusters(bs, (cluster_offset & s->cluster_offset_mask) & ~511, - nb_csectors * 512); - return; - } - - free_clusters(bs, cluster_offset, nb_clusters << s->cluster_bits); - - return; -} - -/* - * get_cluster_table - * - * for a given disk offset, load (and allocate if needed) - * the l2 table. - * - * the l2 table offset in the qcow2 file and the cluster index - * in the l2 table are given to the caller. - * - */ - -static int get_cluster_table(BlockDriverState *bs, uint64_t offset, - uint64_t **new_l2_table, - uint64_t *new_l2_offset, - int *new_l2_index) -{ - BDRVQcowState *s = bs->opaque; - int l1_index, l2_index, ret; - uint64_t l2_offset, *l2_table; - - /* seek the the l2 offset in the l1 table */ + int min_index, i, j, l1_index, l2_index, ret; + uint64_t l2_offset, *l2_table, cluster_offset, tmp, old_l2_offset; l1_index = offset >> (s->l2_bits + s->cluster_bits); if (l1_index >= s->l1_size) { - ret = grow_l1_table(bs, l1_index + 1); - if (ret < 0) + /* outside l1 table is allowed: we grow the table if needed */ + if (!allocate) + return 0; + if (grow_l1_table(bs, l1_index + 1) < 0) return 0; } l2_offset = s->l1_table[l1_index]; + if (!l2_offset) { + if (!allocate) + return 0; + l2_allocate: + old_l2_offset = l2_offset; + /* allocate a new l2 entry */ + l2_offset = alloc_clusters(bs, s->l2_size * sizeof(uint64_t)); + /* update the L1 entry */ + s->l1_table[l1_index] = l2_offset | QCOW_OFLAG_COPIED; + tmp = cpu_to_be64(l2_offset | QCOW_OFLAG_COPIED); + if (bdrv_pwrite(s->hd, s->l1_table_offset + l1_index * sizeof(tmp), + &tmp, sizeof(tmp)) != sizeof(tmp)) + return 0; + min_index = l2_cache_new_entry(bs); + l2_table = s->l2_cache + (min_index << s->l2_bits); - /* seek the l2 table of the given l2 offset */ - - if (l2_offset & QCOW_OFLAG_COPIED) { - /* load the l2 table in memory */ - l2_offset &= ~QCOW_OFLAG_COPIED; - l2_table = l2_load(bs, l2_offset); - if (l2_table == NULL) + if (old_l2_offset == 0) { + memset(l2_table, 0, s->l2_size * sizeof(uint64_t)); + } else { + if (bdrv_pread(s->hd, old_l2_offset, + l2_table, s->l2_size * sizeof(uint64_t)) != + s->l2_size * sizeof(uint64_t)) + return 0; + } + if (bdrv_pwrite(s->hd, l2_offset, + l2_table, s->l2_size * sizeof(uint64_t)) != + s->l2_size * sizeof(uint64_t)) return 0; } else { - if (l2_offset) - free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t)); - l2_table = l2_allocate(bs, l1_index); - if (l2_table == NULL) + if (!(l2_offset & QCOW_OFLAG_COPIED)) { + if (allocate) { + free_clusters(bs, l2_offset, s->l2_size * sizeof(uint64_t)); + goto l2_allocate; + } + } else { + l2_offset &= ~QCOW_OFLAG_COPIED; + } + for(i = 0; i < L2_CACHE_SIZE; i++) { + if (l2_offset == s->l2_cache_offsets[i]) { + /* increment the hit count */ + if (++s->l2_cache_counts[i] == 0xffffffff) { + for(j = 0; j < L2_CACHE_SIZE; j++) { + s->l2_cache_counts[j] >>= 1; + } + } + l2_table = s->l2_cache + (i << s->l2_bits); + goto found; + } + } + /* not found: load a new entry in the least used one */ + min_index = l2_cache_new_entry(bs); + l2_table = s->l2_cache + (min_index << s->l2_bits); + if (bdrv_pread(s->hd, l2_offset, l2_table, s->l2_size * sizeof(uint64_t)) != + s->l2_size * sizeof(uint64_t)) return 0; - l2_offset = s->l1_table[l1_index] & ~QCOW_OFLAG_COPIED; } - - /* find the cluster offset for the given disk offset */ - + s->l2_cache_offsets[min_index] = l2_offset; + s->l2_cache_counts[min_index] = 1; + found: l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1); - - *new_l2_table = l2_table; - *new_l2_offset = l2_offset; - *new_l2_index = l2_index; - - return 1; -} - -/* - * alloc_compressed_cluster_offset - * - * For a given offset of the disk image, return cluster offset in - * qcow2 file. - * - * If the offset is not found, allocate a new compressed cluster. - * - * Return the cluster offset if successful, - * Return 0, otherwise. - * - */ - -static uint64_t alloc_compressed_cluster_offset(BlockDriverState *bs, - uint64_t offset, - int compressed_size) -{ - BDRVQcowState *s = bs->opaque; - int l2_index, ret; - uint64_t l2_offset, *l2_table, cluster_offset; - int nb_csectors; - - ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index); - if (ret == 0) - return 0; - cluster_offset = be64_to_cpu(l2_table[l2_index]); - if (cluster_offset & QCOW_OFLAG_COPIED) - return cluster_offset & ~QCOW_OFLAG_COPIED; - - if (cluster_offset) - free_any_clusters(bs, cluster_offset, 1); - - cluster_offset = alloc_bytes(bs, compressed_size); - nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - - (cluster_offset >> 9); - - cluster_offset |= QCOW_OFLAG_COMPRESSED | - ((uint64_t)nb_csectors << s->csize_shift); - - /* update L2 table */ - - /* compressed clusters never have the copied flag */ - - l2_table[l2_index] = cpu_to_be64(cluster_offset); - if (bdrv_pwrite(s->hd, - l2_offset + l2_index * sizeof(uint64_t), - l2_table + l2_index, - sizeof(uint64_t)) != sizeof(uint64_t)) - return 0; - - return cluster_offset; -} - -typedef struct QCowL2Meta -{ - uint64_t offset; - int n_start; - int nb_available; - int nb_clusters; -} QCowL2Meta; - -static int alloc_cluster_link_l2(BlockDriverState *bs, uint64_t cluster_offset, - QCowL2Meta *m) -{ - BDRVQcowState *s = bs->opaque; - int i, j = 0, l2_index, ret; - uint64_t *old_cluster, start_sect, l2_offset, *l2_table; - - if (m->nb_clusters == 0) - return 0; - - if (!(old_cluster = qemu_malloc(m->nb_clusters * sizeof(uint64_t)))) - return -ENOMEM; - - /* copy content of unmodified sectors */ - start_sect = (m->offset & ~(s->cluster_size - 1)) >> 9; - if (m->n_start) { - ret = copy_sectors(bs, start_sect, cluster_offset, 0, m->n_start); - if (ret < 0) - goto err; + if (!cluster_offset) { + if (!allocate) + return cluster_offset; + } else if (!(cluster_offset & QCOW_OFLAG_COPIED)) { + if (!allocate) + return cluster_offset; + /* free the cluster */ + if (cluster_offset & QCOW_OFLAG_COMPRESSED) { + int nb_csectors; + nb_csectors = ((cluster_offset >> s->csize_shift) & + s->csize_mask) + 1; + free_clusters(bs, (cluster_offset & s->cluster_offset_mask) & ~511, + nb_csectors * 512); + } else { + free_clusters(bs, cluster_offset, s->cluster_size); + } + } else { + cluster_offset &= ~QCOW_OFLAG_COPIED; + return cluster_offset; } - - if (m->nb_available & (s->cluster_sectors - 1)) { - uint64_t end = m->nb_available & ~(uint64_t)(s->cluster_sectors - 1); - ret = copy_sectors(bs, start_sect + end, cluster_offset + (end << 9), - m->nb_available - end, s->cluster_sectors); - if (ret < 0) - goto err; + if (allocate == 1) { + /* allocate a new cluster */ + cluster_offset = alloc_clusters(bs, s->cluster_size); + + /* we must initialize the cluster content which won't be + written */ + if ((n_end - n_start) < s->cluster_sectors) { + uint64_t start_sect; + + start_sect = (offset & ~(s->cluster_size - 1)) >> 9; + ret = copy_sectors(bs, start_sect, + cluster_offset, 0, n_start); + if (ret < 0) + return 0; + ret = copy_sectors(bs, start_sect, + cluster_offset, n_end, s->cluster_sectors); + if (ret < 0) + return 0; + } + tmp = cpu_to_be64(cluster_offset | QCOW_OFLAG_COPIED); + } else { + int nb_csectors; + cluster_offset = alloc_bytes(bs, compressed_size); + nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - + (cluster_offset >> 9); + cluster_offset |= QCOW_OFLAG_COMPRESSED | + ((uint64_t)nb_csectors << s->csize_shift); + /* compressed clusters never have the copied flag */ + tmp = cpu_to_be64(cluster_offset); } - - ret = -EIO; /* update L2 table */ - if (!get_cluster_table(bs, m->offset, &l2_table, &l2_offset, &l2_index)) - goto err; - - for (i = 0; i < m->nb_clusters; i++) { - if(l2_table[l2_index + i] != 0) - old_cluster[j++] = l2_table[l2_index + i]; - - l2_table[l2_index + i] = cpu_to_be64((cluster_offset + - (i << s->cluster_bits)) | QCOW_OFLAG_COPIED); - } - - if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t), - l2_table + l2_index, m->nb_clusters * sizeof(uint64_t)) != - m->nb_clusters * sizeof(uint64_t)) - goto err; - - for (i = 0; i < j; i++) - free_any_clusters(bs, old_cluster[i], 1); - - ret = 0; -err: - qemu_free(old_cluster); - return ret; - } - -/* - * alloc_cluster_offset - * - * For a given offset of the disk image, return cluster offset in - * qcow2 file. - * - * If the offset is not found, allocate a new cluster. - * - * Return the cluster offset if successful, - * Return 0, otherwise. - * - */ - -static uint64_t alloc_cluster_offset(BlockDriverState *bs, - uint64_t offset, - int n_start, int n_end, - int *num, QCowL2Meta *m) -{ - BDRVQcowState *s = bs->opaque; - int l2_index, ret; - uint64_t l2_offset, *l2_table, cluster_offset; - int nb_clusters, i = 0; - - ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index); - if (ret == 0) + l2_table[l2_index] = tmp; + if (bdrv_pwrite(s->hd, + l2_offset + l2_index * sizeof(tmp), &tmp, sizeof(tmp)) != sizeof(tmp)) return 0; - - nb_clusters = size_to_clusters(s, n_end << 9); - - nb_clusters = MIN(nb_clusters, s->l2_size - l2_index); - - cluster_offset = be64_to_cpu(l2_table[l2_index]); - - /* We keep all QCOW_OFLAG_COPIED clusters */ - - if (cluster_offset & QCOW_OFLAG_COPIED) { - nb_clusters = count_contiguous_clusters(nb_clusters, s->cluster_size, - &l2_table[l2_index], 0, 0); - - cluster_offset &= ~QCOW_OFLAG_COPIED; - m->nb_clusters = 0; - - goto out; - } - - /* for the moment, multiple compressed clusters are not managed */ - - if (cluster_offset & QCOW_OFLAG_COMPRESSED) - nb_clusters = 1; - - /* how many available clusters ? */ - - while (i < nb_clusters) { - i += count_contiguous_clusters(nb_clusters - i, s->cluster_size, - &l2_table[l2_index], i, 0); - - if(be64_to_cpu(l2_table[l2_index + i])) - break; - - i += count_contiguous_free_clusters(nb_clusters - i, - &l2_table[l2_index + i]); - - cluster_offset = be64_to_cpu(l2_table[l2_index + i]); - - if ((cluster_offset & QCOW_OFLAG_COPIED) || - (cluster_offset & QCOW_OFLAG_COMPRESSED)) - break; - } - nb_clusters = i; - - /* allocate a new cluster */ - - cluster_offset = alloc_clusters(bs, nb_clusters * s->cluster_size); - - /* save info needed for meta data update */ - m->offset = offset; - m->n_start = n_start; - m->nb_clusters = nb_clusters; - -out: - m->nb_available = MIN(nb_clusters << (s->cluster_bits - 9), n_end); - - *num = m->nb_available - n_start; - return cluster_offset; } static int qcow_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum) { + BDRVQcowState *s = bs->opaque; + int index_in_cluster, n; uint64_t cluster_offset; - *pnum = nb_sectors; - cluster_offset = get_cluster_offset(bs, sector_num << 9, pnum); - + cluster_offset = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0); + index_in_cluster = sector_num & (s->cluster_sectors - 1); + n = s->cluster_sectors - index_in_cluster; + if (n > nb_sectors) + n = nb_sectors; + *pnum = n; return (cluster_offset != 0); } @@ -1102,9 +727,11 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - n = nb_sectors; - cluster_offset = get_cluster_offset(bs, sector_num << 9, &n); + cluster_offset = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0); index_in_cluster = sector_num & (s->cluster_sectors - 1); + n = s->cluster_sectors - index_in_cluster; + if (n > nb_sectors) + n = nb_sectors; if (!cluster_offset) { if (bs->backing_hd) { /* read from the base image */ @@ -1143,18 +770,15 @@ BDRVQcowState *s = bs->opaque; int ret, index_in_cluster, n; uint64_t cluster_offset; - int n_end; - QCowL2Meta l2meta; while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); - n_end = index_in_cluster + nb_sectors; - if (s->crypt_method && - n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors) - n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors; - cluster_offset = alloc_cluster_offset(bs, sector_num << 9, - index_in_cluster, - n_end, &n, &l2meta); + n = s->cluster_sectors - index_in_cluster; + if (n > nb_sectors) + n = nb_sectors; + cluster_offset = get_cluster_offset(bs, sector_num << 9, 1, 0, + index_in_cluster, + index_in_cluster + n); if (!cluster_offset) return -1; if (s->crypt_method) { @@ -1165,10 +789,8 @@ } else { ret = bdrv_pwrite(s->hd, cluster_offset + index_in_cluster * 512, buf, n * 512); } - if (ret != n * 512 || alloc_cluster_link_l2(bs, cluster_offset, &l2meta) < 0) { - free_any_clusters(bs, cluster_offset, l2meta.nb_clusters); + if (ret != n * 512) return -1; - } nb_sectors -= n; sector_num += n; buf += n * 512; @@ -1186,33 +808,8 @@ uint64_t cluster_offset; uint8_t *cluster_data; BlockDriverAIOCB *hd_aiocb; - QEMUBH *bh; - QCowL2Meta l2meta; } QCowAIOCB; -static void qcow_aio_read_cb(void *opaque, int ret); -static void qcow_aio_read_bh(void *opaque) -{ - QCowAIOCB *acb = opaque; - qemu_bh_delete(acb->bh); - acb->bh = NULL; - qcow_aio_read_cb(opaque, 0); -} - -static int qcow_schedule_bh(QEMUBHFunc *cb, QCowAIOCB *acb) -{ - if (acb->bh) - return -EIO; - - acb->bh = qemu_bh_new(cb, acb); - if (!acb->bh) - return -EIO; - - qemu_bh_schedule(acb->bh); - - return 0; -} - static void qcow_aio_read_cb(void *opaque, int ret) { QCowAIOCB *acb = opaque; @@ -1222,12 +819,13 @@ acb->hd_aiocb = NULL; if (ret < 0) { -fail: + fail: acb->common.cb(acb->common.opaque, ret); qemu_aio_release(acb); return; } + redo: /* post process the read buffer */ if (!acb->cluster_offset) { /* nothing to do */ @@ -1253,9 +851,12 @@ } /* prepare next AIO request */ - acb->n = acb->nb_sectors; - acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, &acb->n); + acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, + 0, 0, 0, 0); index_in_cluster = acb->sector_num & (s->cluster_sectors - 1); + acb->n = s->cluster_sectors - index_in_cluster; + if (acb->n > acb->nb_sectors) + acb->n = acb->nb_sectors; if (!acb->cluster_offset) { if (bs->backing_hd) { @@ -1268,16 +869,12 @@ if (acb->hd_aiocb == NULL) goto fail; } else { - ret = qcow_schedule_bh(qcow_aio_read_bh, acb); - if (ret < 0) - goto fail; + goto redo; } } else { /* Note: in this case, no need to wait */ memset(acb->buf, 0, 512 * acb->n); - ret = qcow_schedule_bh(qcow_aio_read_bh, acb); - if (ret < 0) - goto fail; + goto redo; } } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) { /* add AIO support for compressed blocks ? */ @@ -1285,9 +882,7 @@ goto fail; memcpy(acb->buf, s->cluster_cache + index_in_cluster * 512, 512 * acb->n); - ret = qcow_schedule_bh(qcow_aio_read_bh, acb); - if (ret < 0) - goto fail; + goto redo; } else { if ((acb->cluster_offset & 511) != 0) { ret = -EIO; @@ -1316,7 +911,6 @@ acb->nb_sectors = nb_sectors; acb->n = 0; acb->cluster_offset = 0; - acb->l2meta.nb_clusters = 0; return acb; } @@ -1340,8 +934,8 @@ BlockDriverState *bs = acb->common.bs; BDRVQcowState *s = bs->opaque; int index_in_cluster; + uint64_t cluster_offset; const uint8_t *src_buf; - int n_end; acb->hd_aiocb = NULL; @@ -1352,11 +946,6 @@ return; } - if (alloc_cluster_link_l2(bs, acb->cluster_offset, &acb->l2meta) < 0) { - free_any_clusters(bs, acb->cluster_offset, acb->l2meta.nb_clusters); - goto fail; - } - acb->nb_sectors -= acb->n; acb->sector_num += acb->n; acb->buf += acb->n * 512; @@ -1369,22 +958,19 @@ } index_in_cluster = acb->sector_num & (s->cluster_sectors - 1); - n_end = index_in_cluster + acb->nb_sectors; - if (s->crypt_method && - n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors) - n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors; - - acb->cluster_offset = alloc_cluster_offset(bs, acb->sector_num << 9, - index_in_cluster, - n_end, &acb->n, &acb->l2meta); - if (!acb->cluster_offset || (acb->cluster_offset & 511) != 0) { + acb->n = s->cluster_sectors - index_in_cluster; + if (acb->n > acb->nb_sectors) + acb->n = acb->nb_sectors; + cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, 1, 0, + index_in_cluster, + index_in_cluster + acb->n); + if (!cluster_offset || (cluster_offset & 511) != 0) { ret = -EIO; goto fail; } if (s->crypt_method) { if (!acb->cluster_data) { - acb->cluster_data = qemu_mallocz(QCOW_MAX_CRYPT_CLUSTERS * - s->cluster_size); + acb->cluster_data = qemu_mallocz(s->cluster_size); if (!acb->cluster_data) { ret = -ENOMEM; goto fail; @@ -1397,7 +983,7 @@ src_buf = acb->buf; } acb->hd_aiocb = bdrv_aio_write(s->hd, - (acb->cluster_offset >> 9) + index_in_cluster, + (cluster_offset >> 9) + index_in_cluster, src_buf, acb->n, qcow_aio_write_cb, acb); if (acb->hd_aiocb == NULL) @@ -1571,7 +1157,7 @@ memset(s->l1_table, 0, l1_length); if (bdrv_pwrite(s->hd, s->l1_table_offset, s->l1_table, l1_length) < 0) - return -1; + return -1; ret = bdrv_truncate(s->hd, s->l1_table_offset + l1_length); if (ret < 0) return ret; @@ -1637,10 +1223,8 @@ /* could not compress: write normal cluster */ qcow_write(bs, sector_num, buf, s->cluster_sectors); } else { - cluster_offset = alloc_compressed_cluster_offset(bs, sector_num << 9, - out_len); - if (!cluster_offset) - return -1; + cluster_offset = get_cluster_offset(bs, sector_num << 9, 2, + out_len, 0, 0); cluster_offset &= s->cluster_offset_mask; if (bdrv_pwrite(s->hd, cluster_offset, out_buf, out_len) != out_len) { qemu_free(out_buf); @@ -2225,19 +1809,26 @@ BDRVQcowState *s = bs->opaque; int i, nb_clusters; - nb_clusters = size_to_clusters(s, size); -retry: - for(i = 0; i < nb_clusters; i++) { - int64_t i = s->free_cluster_index++; - if (get_refcount(bs, i) != 0) - goto retry; - } + nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits; + for(;;) { + if (get_refcount(bs, s->free_cluster_index) == 0) { + s->free_cluster_index++; + for(i = 1; i < nb_clusters; i++) { + if (get_refcount(bs, s->free_cluster_index) != 0) + goto not_found; + s->free_cluster_index++; + } #ifdef DEBUG_ALLOC2 - printf("alloc_clusters: size=%lld -> %lld\n", - size, - (s->free_cluster_index - nb_clusters) << s->cluster_bits); + printf("alloc_clusters: size=%lld -> %lld\n", + size, + (s->free_cluster_index - nb_clusters) << s->cluster_bits); #endif - return (s->free_cluster_index - nb_clusters) << s->cluster_bits; + return (s->free_cluster_index - nb_clusters) << s->cluster_bits; + } else { + not_found: + s->free_cluster_index++; + } + } } static int64_t alloc_clusters(BlockDriverState *bs, int64_t size) @@ -2301,7 +1892,8 @@ int new_table_size, new_table_size2, refcount_table_clusters, i, ret; uint64_t *new_table; int64_t table_offset; - uint8_t data[12]; + uint64_t data64; + uint32_t data32; int old_table_size; int64_t old_table_offset; @@ -2340,10 +1932,13 @@ for(i = 0; i < s->refcount_table_size; i++) be64_to_cpus(&new_table[i]); - cpu_to_be64w((uint64_t*)data, table_offset); - cpu_to_be32w((uint32_t*)(data + 8), refcount_table_clusters); + data64 = cpu_to_be64(table_offset); if (bdrv_pwrite(s->hd, offsetof(QCowHeader, refcount_table_offset), - data, sizeof(data)) != sizeof(data)) + &data64, sizeof(data64)) != sizeof(data64)) + goto fail; + data32 = cpu_to_be32(refcount_table_clusters); + if (bdrv_pwrite(s->hd, offsetof(QCowHeader, refcount_table_clusters), + &data32, sizeof(data32)) != sizeof(data32)) goto fail; qemu_free(s->refcount_table); old_table_offset = s->refcount_table_offset; @@ -2572,7 +2167,7 @@ uint16_t *refcount_table; size = bdrv_getlength(s->hd); - nb_clusters = size_to_clusters(s, size); + nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits; refcount_table = qemu_mallocz(nb_clusters * sizeof(uint16_t)); /* header */ @@ -2624,7 +2219,7 @@ int refcount; size = bdrv_getlength(s->hd); - nb_clusters = size_to_clusters(s, size); + nb_clusters = (size + s->cluster_size - 1) >> s->cluster_bits; for(k = 0; k < nb_clusters;) { k1 = k; refcount = get_refcount(bs, k); ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 7:00 ` [Qemu-devel] " Jamie Lokier (?) @ 2009-02-11 9:57 ` Kevin Wolf 2009-02-11 11:27 ` Jamie Lokier 2009-02-11 11:41 ` Jamie Lokier -1 siblings, 2 replies; 45+ messages in thread From: Kevin Wolf @ 2009-02-11 9:57 UTC (permalink / raw) To: Jamie Lokier; +Cc: qemu-devel, kvm-devel Jamie Lokier schrieb: > Although there are many ways to make Windows blue screen in KVM, in > this case I've narrowed it down to the difference in > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you narrow it down to one of these? I certainly don't feel like reviewing all of them once again. Do I understand correctly that your image fails with the kvm-73 version of block-qcow2.c and afterwards boots with kvm-72? So the image is not really corrupted but it's more likely an IO error which brings the OS down? Kevin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 9:57 ` Kevin Wolf @ 2009-02-11 11:27 ` Jamie Lokier 2009-02-11 11:41 ` Jamie Lokier 1 sibling, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-11 11:27 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, kvm-devel Kevin Wolf wrote: > Jamie Lokier schrieb: > > Although there are many ways to make Windows blue screen in KVM, in > > this case I've narrowed it down to the difference in > > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). > > This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you > narrow it down to one of these? I certainly don't feel like reviewing > all of them once again. That's helpful, thanks. It's a production mail server which was affected, and it's being used at the moment. Not sure if I can narrow it down that easily :-) > Do I understand correctly that your image fails with the kvm-73 version > of block-qcow2.c and afterwards boots with kvm-72? So the image is not > really corrupted but it's more likely an IO error which brings the OS down? That's correct, it's always booted when trying again with kvm-72, or a later kvm with block-qcow2.c reverted. It might be an I/O error rather than corruption. Up to kvm-76, I/O errors aren't reported over the IDE driver. -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-11 11:27 ` Jamie Lokier 0 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-11 11:27 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, kvm-devel Kevin Wolf wrote: > Jamie Lokier schrieb: > > Although there are many ways to make Windows blue screen in KVM, in > > this case I've narrowed it down to the difference in > > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). > > This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you > narrow it down to one of these? I certainly don't feel like reviewing > all of them once again. That's helpful, thanks. It's a production mail server which was affected, and it's being used at the moment. Not sure if I can narrow it down that easily :-) > Do I understand correctly that your image fails with the kvm-73 version > of block-qcow2.c and afterwards boots with kvm-72? So the image is not > really corrupted but it's more likely an IO error which brings the OS down? That's correct, it's always booted when trying again with kvm-72, or a later kvm with block-qcow2.c reverted. It might be an I/O error rather than corruption. Up to kvm-76, I/O errors aren't reported over the IDE driver. -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 9:57 ` Kevin Wolf @ 2009-02-11 11:41 ` Jamie Lokier 2009-02-11 11:41 ` Jamie Lokier 1 sibling, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-11 11:41 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, kvm-devel Kevin Wolf wrote: > Jamie Lokier schrieb: > > Although there are many ways to make Windows blue screen in KVM, in > > this case I've narrowed it down to the difference in > > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). > > This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you > narrow it down to one of these? I certainly don't feel like reviewing > all of them once again. It's QEMU SVN delta 5005-5006, copied below. I've tested by applying the diffs up to QEMU SVN revs 5003 to 500, onto kvm-72, and 5005-5006 is the diff which triggers the failed guest boot, consistently. Aside from logic, the code mixes signed 32-bit with unsigned 64-bit with unclear naming which would make me nervous. My host is 64-bit, by the way. -- Jamie --- trunk/block-qcow2.c 2008/08/14 18:09:32 5005 +++ trunk/block-qcow2.c 2008/08/14 18:10:28 5006 @@ -52,6 +52,8 @@ #define QCOW_CRYPT_NONE 0 #define QCOW_CRYPT_AES 1 +#define QCOW_MAX_CRYPT_CLUSTERS 32 + /* indicate that the refcount of the referenced cluster is exactly one. */ #define QCOW_OFLAG_COPIED (1LL << 63) /* indicate that the cluster is compressed (they never have the copied flag) */ @@ -263,7 +265,8 @@ if (!s->cluster_cache) goto fail; /* one more sector for decompressed data alignment */ - s->cluster_data = qemu_malloc(s->cluster_size + 512); + s->cluster_data = qemu_malloc(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + + 512); if (!s->cluster_data) goto fail; s->cluster_cache_offset = -1; @@ -612,43 +615,98 @@ * For a given offset of the disk image, return cluster offset in * qcow2 file. * + * on entry, *num is the number of contiguous clusters we'd like to + * access following offset. + * + * on exit, *num is the number of contiguous clusters we can read. + * * Return 1, if the offset is found * Return 0, otherwise. * */ -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset) +static uint64_t get_cluster_offset(BlockDriverState *bs, + uint64_t offset, int *num) { BDRVQcowState *s = bs->opaque; int l1_index, l2_index; - uint64_t l2_offset, *l2_table, cluster_offset; + uint64_t l2_offset, *l2_table, cluster_offset, next; + int l1_bits; + int index_in_cluster, nb_available, nb_needed; + + index_in_cluster = (offset >> 9) & (s->cluster_sectors - 1); + nb_needed = *num + index_in_cluster; + + l1_bits = s->l2_bits + s->cluster_bits; + + /* compute how many bytes there are between the offset and + * and the end of the l1 entry + */ + + nb_available = (1 << l1_bits) - (offset & ((1 << l1_bits) - 1)); + + /* compute the number of available sectors */ + + nb_available = (nb_available >> 9) + index_in_cluster; + + cluster_offset = 0; /* seek the the l2 offset in the l1 table */ - l1_index = offset >> (s->l2_bits + s->cluster_bits); + l1_index = offset >> l1_bits; if (l1_index >= s->l1_size) - return 0; + goto out; l2_offset = s->l1_table[l1_index]; /* seek the l2 table of the given l2 offset */ if (!l2_offset) - return 0; + goto out; /* load the l2 table in memory */ l2_offset &= ~QCOW_OFLAG_COPIED; l2_table = l2_load(bs, l2_offset); if (l2_table == NULL) - return 0; + goto out; /* find the cluster offset for the given disk offset */ l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1); cluster_offset = be64_to_cpu(l2_table[l2_index]); + nb_available = s->cluster_sectors; + l2_index++; + + if (!cluster_offset) { - return cluster_offset & ~QCOW_OFLAG_COPIED; + /* how many empty clusters ? */ + + while (nb_available < nb_needed && !l2_table[l2_index]) { + l2_index++; + nb_available += s->cluster_sectors; + } + } else { + + /* how many allocated clusters ? */ + + cluster_offset &= ~QCOW_OFLAG_COPIED; + while (nb_available < nb_needed) { + next = be64_to_cpu(l2_table[l2_index]) & ~QCOW_OFLAG_COPIED; + if (next != cluster_offset + (nb_available << 9)) + break; + l2_index++; + nb_available += s->cluster_sectors; + } + } + +out: + if (nb_available > nb_needed) + nb_available = nb_needed; + + *num = nb_available - index_in_cluster; + + return cluster_offset; } /* @@ -659,13 +717,10 @@ */ static void free_any_clusters(BlockDriverState *bs, - uint64_t cluster_offset) + uint64_t cluster_offset, int nb_clusters) { BDRVQcowState *s = bs->opaque; - if (cluster_offset == 0) - return; - /* free the cluster */ if (cluster_offset & QCOW_OFLAG_COMPRESSED) { @@ -677,7 +732,9 @@ return; } - free_clusters(bs, cluster_offset, s->cluster_size); + free_clusters(bs, cluster_offset, nb_clusters << s->cluster_bits); + + return; } /* @@ -768,7 +825,8 @@ if (cluster_offset & QCOW_OFLAG_COPIED) return cluster_offset & ~QCOW_OFLAG_COPIED; - free_any_clusters(bs, cluster_offset); + if (cluster_offset) + free_any_clusters(bs, cluster_offset, 1); cluster_offset = alloc_bytes(bs, compressed_size); nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - @@ -806,68 +864,136 @@ static uint64_t alloc_cluster_offset(BlockDriverState *bs, uint64_t offset, - int n_start, int n_end) + int n_start, int n_end, + int *num) { BDRVQcowState *s = bs->opaque; int l2_index, ret; uint64_t l2_offset, *l2_table, cluster_offset; + int nb_available, nb_clusters, i; + uint64_t start_sect, current; ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index); if (ret == 0) return 0; + nb_clusters = ((n_end << 9) + s->cluster_size - 1) >> + s->cluster_bits; + if (nb_clusters > s->l2_size - l2_index) + nb_clusters = s->l2_size - l2_index; + cluster_offset = be64_to_cpu(l2_table[l2_index]); - if (cluster_offset & QCOW_OFLAG_COPIED) - return cluster_offset & ~QCOW_OFLAG_COPIED; - free_any_clusters(bs, cluster_offset); + /* We keep all QCOW_OFLAG_COPIED clusters */ + + if (cluster_offset & QCOW_OFLAG_COPIED) { + + for (i = 1; i < nb_clusters; i++) { + current = be64_to_cpu(l2_table[l2_index + i]); + if (cluster_offset + (i << s->cluster_bits) != current) + break; + } + nb_clusters = i; + + nb_available = nb_clusters << (s->cluster_bits - 9); + if (nb_available > n_end) + nb_available = n_end; + + cluster_offset &= ~QCOW_OFLAG_COPIED; + + goto out; + } + + /* for the moment, multiple compressed clusters are not managed */ + + if (cluster_offset & QCOW_OFLAG_COMPRESSED) + nb_clusters = 1; + + /* how many empty or how many to free ? */ + + if (!cluster_offset) { + + /* how many free clusters ? */ + + i = 1; + while (i < nb_clusters && + l2_table[l2_index + i] == 0) { + i++; + } + nb_clusters = i; + + } else { + + /* how many contiguous clusters ? */ + + for (i = 1; i < nb_clusters; i++) { + current = be64_to_cpu(l2_table[l2_index + i]); + if (cluster_offset + (i << s->cluster_bits) != current) + break; + } + nb_clusters = i; + + free_any_clusters(bs, cluster_offset, i); + } /* allocate a new cluster */ - cluster_offset = alloc_clusters(bs, s->cluster_size); + cluster_offset = alloc_clusters(bs, nb_clusters * s->cluster_size); /* we must initialize the cluster content which won't be written */ - if ((n_end - n_start) < s->cluster_sectors) { - uint64_t start_sect; - - start_sect = (offset & ~(s->cluster_size - 1)) >> 9; - ret = copy_sectors(bs, start_sect, - cluster_offset, 0, n_start); + nb_available = nb_clusters << (s->cluster_bits - 9); + if (nb_available > n_end) + nb_available = n_end; + + /* copy content of unmodified sectors */ + + start_sect = (offset & ~(s->cluster_size - 1)) >> 9; + if (n_start) { + ret = copy_sectors(bs, start_sect, cluster_offset, 0, n_start); if (ret < 0) return 0; - ret = copy_sectors(bs, start_sect, - cluster_offset, n_end, s->cluster_sectors); + } + + if (nb_available & (s->cluster_sectors - 1)) { + uint64_t end = nb_available & ~(uint64_t)(s->cluster_sectors - 1); + ret = copy_sectors(bs, start_sect + end, + cluster_offset + (end << 9), + nb_available - end, + s->cluster_sectors); if (ret < 0) return 0; } /* update L2 table */ - l2_table[l2_index] = cpu_to_be64(cluster_offset | QCOW_OFLAG_COPIED); + for (i = 0; i < nb_clusters; i++) + l2_table[l2_index + i] = cpu_to_be64((cluster_offset + + (i << s->cluster_bits)) | + QCOW_OFLAG_COPIED); + if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t), l2_table + l2_index, - sizeof(uint64_t)) != sizeof(uint64_t)) + nb_clusters * sizeof(uint64_t)) != + nb_clusters * sizeof(uint64_t)) return 0; +out: + *num = nb_available - n_start; + return cluster_offset; } static int qcow_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum) { - BDRVQcowState *s = bs->opaque; - int index_in_cluster, n; uint64_t cluster_offset; - cluster_offset = get_cluster_offset(bs, sector_num << 9); - index_in_cluster = sector_num & (s->cluster_sectors - 1); - n = s->cluster_sectors - index_in_cluster; - if (n > nb_sectors) - n = nb_sectors; - *pnum = n; + *pnum = nb_sectors; + cluster_offset = get_cluster_offset(bs, sector_num << 9, pnum); + return (cluster_offset != 0); } @@ -944,11 +1070,9 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - cluster_offset = get_cluster_offset(bs, sector_num << 9); + n = nb_sectors; + cluster_offset = get_cluster_offset(bs, sector_num << 9, &n); index_in_cluster = sector_num & (s->cluster_sectors - 1); - n = s->cluster_sectors - index_in_cluster; - if (n > nb_sectors) - n = nb_sectors; if (!cluster_offset) { if (bs->backing_hd) { /* read from the base image */ @@ -987,15 +1111,17 @@ BDRVQcowState *s = bs->opaque; int ret, index_in_cluster, n; uint64_t cluster_offset; + int n_end; while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); - n = s->cluster_sectors - index_in_cluster; - if (n > nb_sectors) - n = nb_sectors; + n_end = index_in_cluster + nb_sectors; + if (s->crypt_method && + n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors) + n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors; cluster_offset = alloc_cluster_offset(bs, sector_num << 9, index_in_cluster, - index_in_cluster + n); + n_end, &n); if (!cluster_offset) return -1; if (s->crypt_method) { @@ -1068,11 +1194,9 @@ } /* prepare next AIO request */ - acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9); + acb->n = acb->nb_sectors; + acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, &acb->n); index_in_cluster = acb->sector_num & (s->cluster_sectors - 1); - acb->n = s->cluster_sectors - index_in_cluster; - if (acb->n > acb->nb_sectors) - acb->n = acb->nb_sectors; if (!acb->cluster_offset) { if (bs->backing_hd) { @@ -1152,6 +1276,7 @@ int index_in_cluster; uint64_t cluster_offset; const uint8_t *src_buf; + int n_end; acb->hd_aiocb = NULL; @@ -1174,19 +1299,22 @@ } index_in_cluster = acb->sector_num & (s->cluster_sectors - 1); - acb->n = s->cluster_sectors - index_in_cluster; - if (acb->n > acb->nb_sectors) - acb->n = acb->nb_sectors; + n_end = index_in_cluster + acb->nb_sectors; + if (s->crypt_method && + n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors) + n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors; + cluster_offset = alloc_cluster_offset(bs, acb->sector_num << 9, index_in_cluster, - index_in_cluster + acb->n); + n_end, &acb->n); if (!cluster_offset || (cluster_offset & 511) != 0) { ret = -EIO; goto fail; } if (s->crypt_method) { if (!acb->cluster_data) { - acb->cluster_data = qemu_mallocz(s->cluster_size); + acb->cluster_data = qemu_mallocz(QCOW_MAX_CRYPT_CLUSTERS * + s->cluster_size); if (!acb->cluster_data) { ret = -ENOMEM; goto fail; ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-11 11:41 ` Jamie Lokier 0 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-11 11:41 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, kvm-devel Kevin Wolf wrote: > Jamie Lokier schrieb: > > Although there are many ways to make Windows blue screen in KVM, in > > this case I've narrowed it down to the difference in > > qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). > > This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you > narrow it down to one of these? I certainly don't feel like reviewing > all of them once again. It's QEMU SVN delta 5005-5006, copied below. I've tested by applying the diffs up to QEMU SVN revs 5003 to 500, onto kvm-72, and 5005-5006 is the diff which triggers the failed guest boot, consistently. Aside from logic, the code mixes signed 32-bit with unsigned 64-bit with unclear naming which would make me nervous. My host is 64-bit, by the way. -- Jamie --- trunk/block-qcow2.c 2008/08/14 18:09:32 5005 +++ trunk/block-qcow2.c 2008/08/14 18:10:28 5006 @@ -52,6 +52,8 @@ #define QCOW_CRYPT_NONE 0 #define QCOW_CRYPT_AES 1 +#define QCOW_MAX_CRYPT_CLUSTERS 32 + /* indicate that the refcount of the referenced cluster is exactly one. */ #define QCOW_OFLAG_COPIED (1LL << 63) /* indicate that the cluster is compressed (they never have the copied flag) */ @@ -263,7 +265,8 @@ if (!s->cluster_cache) goto fail; /* one more sector for decompressed data alignment */ - s->cluster_data = qemu_malloc(s->cluster_size + 512); + s->cluster_data = qemu_malloc(QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + + 512); if (!s->cluster_data) goto fail; s->cluster_cache_offset = -1; @@ -612,43 +615,98 @@ * For a given offset of the disk image, return cluster offset in * qcow2 file. * + * on entry, *num is the number of contiguous clusters we'd like to + * access following offset. + * + * on exit, *num is the number of contiguous clusters we can read. + * * Return 1, if the offset is found * Return 0, otherwise. * */ -static uint64_t get_cluster_offset(BlockDriverState *bs, uint64_t offset) +static uint64_t get_cluster_offset(BlockDriverState *bs, + uint64_t offset, int *num) { BDRVQcowState *s = bs->opaque; int l1_index, l2_index; - uint64_t l2_offset, *l2_table, cluster_offset; + uint64_t l2_offset, *l2_table, cluster_offset, next; + int l1_bits; + int index_in_cluster, nb_available, nb_needed; + + index_in_cluster = (offset >> 9) & (s->cluster_sectors - 1); + nb_needed = *num + index_in_cluster; + + l1_bits = s->l2_bits + s->cluster_bits; + + /* compute how many bytes there are between the offset and + * and the end of the l1 entry + */ + + nb_available = (1 << l1_bits) - (offset & ((1 << l1_bits) - 1)); + + /* compute the number of available sectors */ + + nb_available = (nb_available >> 9) + index_in_cluster; + + cluster_offset = 0; /* seek the the l2 offset in the l1 table */ - l1_index = offset >> (s->l2_bits + s->cluster_bits); + l1_index = offset >> l1_bits; if (l1_index >= s->l1_size) - return 0; + goto out; l2_offset = s->l1_table[l1_index]; /* seek the l2 table of the given l2 offset */ if (!l2_offset) - return 0; + goto out; /* load the l2 table in memory */ l2_offset &= ~QCOW_OFLAG_COPIED; l2_table = l2_load(bs, l2_offset); if (l2_table == NULL) - return 0; + goto out; /* find the cluster offset for the given disk offset */ l2_index = (offset >> s->cluster_bits) & (s->l2_size - 1); cluster_offset = be64_to_cpu(l2_table[l2_index]); + nb_available = s->cluster_sectors; + l2_index++; + + if (!cluster_offset) { - return cluster_offset & ~QCOW_OFLAG_COPIED; + /* how many empty clusters ? */ + + while (nb_available < nb_needed && !l2_table[l2_index]) { + l2_index++; + nb_available += s->cluster_sectors; + } + } else { + + /* how many allocated clusters ? */ + + cluster_offset &= ~QCOW_OFLAG_COPIED; + while (nb_available < nb_needed) { + next = be64_to_cpu(l2_table[l2_index]) & ~QCOW_OFLAG_COPIED; + if (next != cluster_offset + (nb_available << 9)) + break; + l2_index++; + nb_available += s->cluster_sectors; + } + } + +out: + if (nb_available > nb_needed) + nb_available = nb_needed; + + *num = nb_available - index_in_cluster; + + return cluster_offset; } /* @@ -659,13 +717,10 @@ */ static void free_any_clusters(BlockDriverState *bs, - uint64_t cluster_offset) + uint64_t cluster_offset, int nb_clusters) { BDRVQcowState *s = bs->opaque; - if (cluster_offset == 0) - return; - /* free the cluster */ if (cluster_offset & QCOW_OFLAG_COMPRESSED) { @@ -677,7 +732,9 @@ return; } - free_clusters(bs, cluster_offset, s->cluster_size); + free_clusters(bs, cluster_offset, nb_clusters << s->cluster_bits); + + return; } /* @@ -768,7 +825,8 @@ if (cluster_offset & QCOW_OFLAG_COPIED) return cluster_offset & ~QCOW_OFLAG_COPIED; - free_any_clusters(bs, cluster_offset); + if (cluster_offset) + free_any_clusters(bs, cluster_offset, 1); cluster_offset = alloc_bytes(bs, compressed_size); nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) - @@ -806,68 +864,136 @@ static uint64_t alloc_cluster_offset(BlockDriverState *bs, uint64_t offset, - int n_start, int n_end) + int n_start, int n_end, + int *num) { BDRVQcowState *s = bs->opaque; int l2_index, ret; uint64_t l2_offset, *l2_table, cluster_offset; + int nb_available, nb_clusters, i; + uint64_t start_sect, current; ret = get_cluster_table(bs, offset, &l2_table, &l2_offset, &l2_index); if (ret == 0) return 0; + nb_clusters = ((n_end << 9) + s->cluster_size - 1) >> + s->cluster_bits; + if (nb_clusters > s->l2_size - l2_index) + nb_clusters = s->l2_size - l2_index; + cluster_offset = be64_to_cpu(l2_table[l2_index]); - if (cluster_offset & QCOW_OFLAG_COPIED) - return cluster_offset & ~QCOW_OFLAG_COPIED; - free_any_clusters(bs, cluster_offset); + /* We keep all QCOW_OFLAG_COPIED clusters */ + + if (cluster_offset & QCOW_OFLAG_COPIED) { + + for (i = 1; i < nb_clusters; i++) { + current = be64_to_cpu(l2_table[l2_index + i]); + if (cluster_offset + (i << s->cluster_bits) != current) + break; + } + nb_clusters = i; + + nb_available = nb_clusters << (s->cluster_bits - 9); + if (nb_available > n_end) + nb_available = n_end; + + cluster_offset &= ~QCOW_OFLAG_COPIED; + + goto out; + } + + /* for the moment, multiple compressed clusters are not managed */ + + if (cluster_offset & QCOW_OFLAG_COMPRESSED) + nb_clusters = 1; + + /* how many empty or how many to free ? */ + + if (!cluster_offset) { + + /* how many free clusters ? */ + + i = 1; + while (i < nb_clusters && + l2_table[l2_index + i] == 0) { + i++; + } + nb_clusters = i; + + } else { + + /* how many contiguous clusters ? */ + + for (i = 1; i < nb_clusters; i++) { + current = be64_to_cpu(l2_table[l2_index + i]); + if (cluster_offset + (i << s->cluster_bits) != current) + break; + } + nb_clusters = i; + + free_any_clusters(bs, cluster_offset, i); + } /* allocate a new cluster */ - cluster_offset = alloc_clusters(bs, s->cluster_size); + cluster_offset = alloc_clusters(bs, nb_clusters * s->cluster_size); /* we must initialize the cluster content which won't be written */ - if ((n_end - n_start) < s->cluster_sectors) { - uint64_t start_sect; - - start_sect = (offset & ~(s->cluster_size - 1)) >> 9; - ret = copy_sectors(bs, start_sect, - cluster_offset, 0, n_start); + nb_available = nb_clusters << (s->cluster_bits - 9); + if (nb_available > n_end) + nb_available = n_end; + + /* copy content of unmodified sectors */ + + start_sect = (offset & ~(s->cluster_size - 1)) >> 9; + if (n_start) { + ret = copy_sectors(bs, start_sect, cluster_offset, 0, n_start); if (ret < 0) return 0; - ret = copy_sectors(bs, start_sect, - cluster_offset, n_end, s->cluster_sectors); + } + + if (nb_available & (s->cluster_sectors - 1)) { + uint64_t end = nb_available & ~(uint64_t)(s->cluster_sectors - 1); + ret = copy_sectors(bs, start_sect + end, + cluster_offset + (end << 9), + nb_available - end, + s->cluster_sectors); if (ret < 0) return 0; } /* update L2 table */ - l2_table[l2_index] = cpu_to_be64(cluster_offset | QCOW_OFLAG_COPIED); + for (i = 0; i < nb_clusters; i++) + l2_table[l2_index + i] = cpu_to_be64((cluster_offset + + (i << s->cluster_bits)) | + QCOW_OFLAG_COPIED); + if (bdrv_pwrite(s->hd, l2_offset + l2_index * sizeof(uint64_t), l2_table + l2_index, - sizeof(uint64_t)) != sizeof(uint64_t)) + nb_clusters * sizeof(uint64_t)) != + nb_clusters * sizeof(uint64_t)) return 0; +out: + *num = nb_available - n_start; + return cluster_offset; } static int qcow_is_allocated(BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum) { - BDRVQcowState *s = bs->opaque; - int index_in_cluster, n; uint64_t cluster_offset; - cluster_offset = get_cluster_offset(bs, sector_num << 9); - index_in_cluster = sector_num & (s->cluster_sectors - 1); - n = s->cluster_sectors - index_in_cluster; - if (n > nb_sectors) - n = nb_sectors; - *pnum = n; + *pnum = nb_sectors; + cluster_offset = get_cluster_offset(bs, sector_num << 9, pnum); + return (cluster_offset != 0); } @@ -944,11 +1070,9 @@ uint64_t cluster_offset; while (nb_sectors > 0) { - cluster_offset = get_cluster_offset(bs, sector_num << 9); + n = nb_sectors; + cluster_offset = get_cluster_offset(bs, sector_num << 9, &n); index_in_cluster = sector_num & (s->cluster_sectors - 1); - n = s->cluster_sectors - index_in_cluster; - if (n > nb_sectors) - n = nb_sectors; if (!cluster_offset) { if (bs->backing_hd) { /* read from the base image */ @@ -987,15 +1111,17 @@ BDRVQcowState *s = bs->opaque; int ret, index_in_cluster, n; uint64_t cluster_offset; + int n_end; while (nb_sectors > 0) { index_in_cluster = sector_num & (s->cluster_sectors - 1); - n = s->cluster_sectors - index_in_cluster; - if (n > nb_sectors) - n = nb_sectors; + n_end = index_in_cluster + nb_sectors; + if (s->crypt_method && + n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors) + n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors; cluster_offset = alloc_cluster_offset(bs, sector_num << 9, index_in_cluster, - index_in_cluster + n); + n_end, &n); if (!cluster_offset) return -1; if (s->crypt_method) { @@ -1068,11 +1194,9 @@ } /* prepare next AIO request */ - acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9); + acb->n = acb->nb_sectors; + acb->cluster_offset = get_cluster_offset(bs, acb->sector_num << 9, &acb->n); index_in_cluster = acb->sector_num & (s->cluster_sectors - 1); - acb->n = s->cluster_sectors - index_in_cluster; - if (acb->n > acb->nb_sectors) - acb->n = acb->nb_sectors; if (!acb->cluster_offset) { if (bs->backing_hd) { @@ -1152,6 +1276,7 @@ int index_in_cluster; uint64_t cluster_offset; const uint8_t *src_buf; + int n_end; acb->hd_aiocb = NULL; @@ -1174,19 +1299,22 @@ } index_in_cluster = acb->sector_num & (s->cluster_sectors - 1); - acb->n = s->cluster_sectors - index_in_cluster; - if (acb->n > acb->nb_sectors) - acb->n = acb->nb_sectors; + n_end = index_in_cluster + acb->nb_sectors; + if (s->crypt_method && + n_end > QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors) + n_end = QCOW_MAX_CRYPT_CLUSTERS * s->cluster_sectors; + cluster_offset = alloc_cluster_offset(bs, acb->sector_num << 9, index_in_cluster, - index_in_cluster + acb->n); + n_end, &acb->n); if (!cluster_offset || (cluster_offset & 511) != 0) { ret = -EIO; goto fail; } if (s->crypt_method) { if (!acb->cluster_data) { - acb->cluster_data = qemu_mallocz(s->cluster_size); + acb->cluster_data = qemu_mallocz(QCOW_MAX_CRYPT_CLUSTERS * + s->cluster_size); if (!acb->cluster_data) { ret = -ENOMEM; goto fail; ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 11:41 ` Jamie Lokier @ 2009-02-11 12:41 ` Kevin Wolf -1 siblings, 0 replies; 45+ messages in thread From: Kevin Wolf @ 2009-02-11 12:41 UTC (permalink / raw) To: Jamie Lokier; +Cc: qemu-devel, kvm-devel, Laurent Vivier Jamie Lokier schrieb: > Kevin Wolf wrote: >> Jamie Lokier schrieb: >>> Although there are many ways to make Windows blue screen in KVM, in >>> this case I've narrowed it down to the difference in >>> qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). >> This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you >> narrow it down to one of these? I certainly don't feel like reviewing >> all of them once again. > > It's QEMU SVN delta 5005-5006, copied below. > > I've tested by applying the diffs up to QEMU SVN revs 5003 to 500, > onto kvm-72, and 5005-5006 is the diff which triggers the failed guest > boot, consistently. That's exactly what I was afraid of... It's the most difficult patch of the series. I'm adding Laurent to CC who wrote the patch series then, but I can imagine he wants to do different things in his spare time. Besides reviewing the code over and over again, I think the only real chance is that you can get a non-productive copy of your image and add some debug code so that we can see at least which code path is causing problems. > Aside from logic, the code mixes signed 32-bit with unsigned 64-bit > with unclear naming which would make me nervous. My host is 64-bit, > by the way. I would suspect that simply having a 64 bit host isn't enough to trigger the problem. These patches were in for half a year now without anyone noticing such failure. By the way and completely off-topic: Have you already tried to use the VHD patches? I would really like to know if they fix your problems. Kevin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-11 12:41 ` Kevin Wolf 0 siblings, 0 replies; 45+ messages in thread From: Kevin Wolf @ 2009-02-11 12:41 UTC (permalink / raw) To: Jamie Lokier; +Cc: Laurent Vivier, qemu-devel, kvm-devel Jamie Lokier schrieb: > Kevin Wolf wrote: >> Jamie Lokier schrieb: >>> Although there are many ways to make Windows blue screen in KVM, in >>> this case I've narrowed it down to the difference in >>> qemu/block-qcow2.c between kvm-72 and kvm-73 (not -83). >> This must be one of SVN revisions 5003 to 5008 in upstream qemu. Can you >> narrow it down to one of these? I certainly don't feel like reviewing >> all of them once again. > > It's QEMU SVN delta 5005-5006, copied below. > > I've tested by applying the diffs up to QEMU SVN revs 5003 to 500, > onto kvm-72, and 5005-5006 is the diff which triggers the failed guest > boot, consistently. That's exactly what I was afraid of... It's the most difficult patch of the series. I'm adding Laurent to CC who wrote the patch series then, but I can imagine he wants to do different things in his spare time. Besides reviewing the code over and over again, I think the only real chance is that you can get a non-productive copy of your image and add some debug code so that we can see at least which code path is causing problems. > Aside from logic, the code mixes signed 32-bit with unsigned 64-bit > with unclear naming which would make me nervous. My host is 64-bit, > by the way. I would suspect that simply having a 64 bit host isn't enough to trigger the problem. These patches were in for half a year now without anyone noticing such failure. By the way and completely off-topic: Have you already tried to use the VHD patches? I would really like to know if they fix your problems. Kevin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 12:41 ` Kevin Wolf @ 2009-02-11 16:48 ` Jamie Lokier -1 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-11 16:48 UTC (permalink / raw) To: Kevin Wolf; +Cc: qemu-devel, kvm-devel, Laurent Vivier Kevin Wolf wrote: > Besides reviewing the code over and over again, I think the only real > chance is that you can get a non-productive copy of your image and add > some debug code so that we can see at least which code path is causing > problems. I have a copy of my image to reproduce the bug, so I can test patches including diagnostic patches. That's what I did to narrow it down. Being a company mail server, I can't send you the image of course. > > Aside from logic, the code mixes signed 32-bit with unsigned 64-bit > > with unclear naming which would make me nervous. My host is 64-bit, > > by the way. > > I would suspect that simply having a 64 bit host isn't enough to trigger > the problem. These patches were in for half a year now without anyone > noticing such failure. It was just for clarity. If there are any bugs it's more likely to be truncation on a 32 bit host :-) I didn't see any mention of "long", so the code should behave the same on 64-bit and 32-bit hosts. > By the way and completely off-topic: Have you already tried to use the > VHD patches? I would really like to know if they fix your problems. Are those patches in kvm-83? I still have the image that was causing problems way back, and I'm converting it to raw now with kvm-83 to see if it now matches the raw image produced by VPC's own tool. Beyond checking that it reads ok, which it didn't before, I don't know how to test the VPC support properly, but I can try booting the image and see if it at least doesn't fsck^H^H^H^Hscandisk like it used to. I'm not using VPC images any more, we just install Windows into empty QCOW2 or raw images, like everyone else. :-) At some point I may test the VPC support with prebuilt images downloaded from Microsoft - you can too! -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-11 16:48 ` Jamie Lokier 0 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-11 16:48 UTC (permalink / raw) To: Kevin Wolf; +Cc: Laurent Vivier, qemu-devel, kvm-devel Kevin Wolf wrote: > Besides reviewing the code over and over again, I think the only real > chance is that you can get a non-productive copy of your image and add > some debug code so that we can see at least which code path is causing > problems. I have a copy of my image to reproduce the bug, so I can test patches including diagnostic patches. That's what I did to narrow it down. Being a company mail server, I can't send you the image of course. > > Aside from logic, the code mixes signed 32-bit with unsigned 64-bit > > with unclear naming which would make me nervous. My host is 64-bit, > > by the way. > > I would suspect that simply having a 64 bit host isn't enough to trigger > the problem. These patches were in for half a year now without anyone > noticing such failure. It was just for clarity. If there are any bugs it's more likely to be truncation on a 32 bit host :-) I didn't see any mention of "long", so the code should behave the same on 64-bit and 32-bit hosts. > By the way and completely off-topic: Have you already tried to use the > VHD patches? I would really like to know if they fix your problems. Are those patches in kvm-83? I still have the image that was causing problems way back, and I'm converting it to raw now with kvm-83 to see if it now matches the raw image produced by VPC's own tool. Beyond checking that it reads ok, which it didn't before, I don't know how to test the VPC support properly, but I can try booting the image and see if it at least doesn't fsck^H^H^H^Hscandisk like it used to. I'm not using VPC images any more, we just install Windows into empty QCOW2 or raw images, like everyone else. :-) At some point I may test the VPC support with prebuilt images downloaded from Microsoft - you can too! -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: qcow2 corruption observed, fixed by reverting old change 2009-02-11 16:48 ` Jamie Lokier @ 2009-02-12 22:57 ` Consul -1 siblings, 0 replies; 45+ messages in thread From: Consul @ 2009-02-12 22:57 UTC (permalink / raw) To: qemu-devel; +Cc: kvm Jamie Lokier wrote: > > It was just for clarity. If there are any bugs it's more likely to be > truncation on a 32 bit host :-) > Maybe not a proper fix, do you see the same "corruption" with this patch? I don't know if it causes any memory leaks, but it certainly clears the segfaults while running my old qcow2 windows images. Perhaps this is a wrong place to free() or it needs a condition? $ svn diff block.c Index: block.c =================================================================== --- block.c (revision 6618) +++ block.c (working copy) @@ -1263,7 +1263,7 @@ if (!s->is_write) { qemu_iovec_from_buffer(s->iov, s->bounce, s->iov->size); } - qemu_free(s->bounce); + //qemu_free(s->bounce); s->this_aiocb->cb(s->this_aiocb->opaque, ret); qemu_aio_release(s->this_aiocb); } ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-12 22:57 ` Consul 0 siblings, 0 replies; 45+ messages in thread From: Consul @ 2009-02-12 22:57 UTC (permalink / raw) To: qemu-devel; +Cc: kvm Jamie Lokier wrote: > > It was just for clarity. If there are any bugs it's more likely to be > truncation on a 32 bit host :-) > Maybe not a proper fix, do you see the same "corruption" with this patch? I don't know if it causes any memory leaks, but it certainly clears the segfaults while running my old qcow2 windows images. Perhaps this is a wrong place to free() or it needs a condition? $ svn diff block.c Index: block.c =================================================================== --- block.c (revision 6618) +++ block.c (working copy) @@ -1263,7 +1263,7 @@ if (!s->is_write) { qemu_iovec_from_buffer(s->iov, s->bounce, s->iov->size); } - qemu_free(s->bounce); + //qemu_free(s->bounce); s->this_aiocb->cb(s->this_aiocb->opaque, ret); qemu_aio_release(s->this_aiocb); } ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: qcow2 corruption observed, fixed by reverting old change 2009-02-12 22:57 ` [Qemu-devel] " Consul @ 2009-02-12 23:19 ` Consul -1 siblings, 0 replies; 45+ messages in thread From: Consul @ 2009-02-12 23:19 UTC (permalink / raw) To: qemu-devel; +Cc: kvm Consul wrote: > Jamie Lokier wrote: >> >> It was just for clarity. If there are any bugs it's more likely to be >> truncation on a 32 bit host :-) >> > Maybe not a proper fix, do you see the same "corruption" with this patch? > I don't know if it causes any memory leaks, but it certainly clears the > segfaults while running my old qcow2 windows images. Perhaps this is a > wrong place to free() or it needs a condition? > > $ svn diff block.c > Index: block.c > =================================================================== > --- block.c (revision 6618) > +++ block.c (working copy) > @@ -1263,7 +1263,7 @@ > if (!s->is_write) { > qemu_iovec_from_buffer(s->iov, s->bounce, s->iov->size); > } > - qemu_free(s->bounce); > + //qemu_free(s->bounce); > s->this_aiocb->cb(s->this_aiocb->opaque, ret); > qemu_aio_release(s->this_aiocb); > } > > In fact, s->bounce == 0xf270000 looks suspicious to me. Perhaps there is a 64/32 bit conversion error somewhere? (gdb) Num Type Disp Enb Address What 1 breakpoint keep y 0x004035ac in qemu_bh_poll at c:/test/qemu/vl.c:3342 stop only if bh->opaque==0xee9d440 (gdb) The program being debugged has been started already. Start it from the beginning? (y or n) [answered Y; input not from terminal] Starting program: c:\test\qemu/i386-softmmu/qemu.exe -L c:\\qemu-dist -hda C:\\qemu-img\\qem8D.tmp -m 512 -boot c -loadvm 1 [New thread 5188.0x180] [New thread 5188.0xea0] [New thread 5188.0x1608] Breakpoint 1, qemu_bh_poll () at c:/test/qemu/vl.c:3342 3342 bh->cb(bh->opaque); (gdb) #0 qemu_bh_poll () at c:/test/qemu/vl.c:3342 #1 0x00403a9a in main_loop_wait (timeout=0) at c:/test/qemu/vl.c:3745 #2 0x00407bf5 in main (argc=11, argv=0x3e27c0, envp=0xccc359ff) at c:/test/qemu/vl.c:3888 (gdb) qcow_aio_read_bh (opaque=0xee9d440) at block-qcow2.c:1194 1194 qemu_bh_delete(acb->bh); (gdb) qemu_bh_delete (bh=0xe41b2f8) at c:/test/qemu/vl.c:3391 3391 bh->scheduled = 0; (gdb) 3392 bh->deleted = 1; (gdb) 3393 } (gdb) qcow_aio_read_bh (opaque=0xee9d440) at block-qcow2.c:1195 1195 acb->bh = NULL; (gdb) 1196 qcow_aio_read_cb(opaque, 0); (gdb) qcow_aio_read_cb (opaque=0xee9d440, ret=0) at block-qcow2.c:1215 1215 QCowAIOCB *acb = opaque; (gdb) 1216 BlockDriverState *bs = acb->common.bs; (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1220 acb->hd_aiocb = NULL; (gdb) 1221 if (ret < 0) { (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1229 if (!acb->cluster_offset) { (gdb) 1231 } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) { (gdb) 1241 acb->nb_sectors -= acb->n; (gdb) 1242 acb->sector_num += acb->n; (gdb) 1243 acb->buf += acb->n * 512; (gdb) 1245 if (acb->nb_sectors == 0) { (gdb) 1247 acb->common.cb(acb->common.opaque, 0); (gdb) qcow_aio_read_cb (opaque=0xee9d3d0, ret=0) at block-qcow2.c:1215 1215 QCowAIOCB *acb = opaque; (gdb) 1216 BlockDriverState *bs = acb->common.bs; (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1220 acb->hd_aiocb = NULL; (gdb) 1221 if (ret < 0) { (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1229 if (!acb->cluster_offset) { (gdb) 1241 acb->nb_sectors -= acb->n; (gdb) 1242 acb->sector_num += acb->n; (gdb) 1243 acb->buf += acb->n * 512; (gdb) 1245 if (acb->nb_sectors == 0) { (gdb) 1247 acb->common.cb(acb->common.opaque, 0); (gdb) bdrv_aio_rw_vector_cb (opaque=0xe41b2c8, ret=0) at block.c:1261 1261 VectorTranslationState *s = opaque; (gdb) 1263 if (!s->is_write) { (gdb) 1264 qemu_iovec_from_buffer(s->iov, s->bounce, s->iov->size); (gdb) qemu_iovec_from_buffer (qiov=0xe412b9c, buf=0xf270000, count=512) at cutils.c:155 155 for (i = 0; i < qiov->niov && count; ++i) { (gdb) 151 const uint8_t *p = (const uint8_t *)buf; (gdb) 157 if (copy > qiov->iov[i].iov_len) (gdb) 159 memcpy(qiov->iov[i].iov_base, p, copy); (gdb) 160 p += copy; (gdb) 155 for (i = 0; i < qiov->niov && count; ++i) { (gdb) 163 } (gdb) bdrv_aio_rw_vector_cb (opaque=0xe41b2c8, ret=0) at block.c:1266 1266 qemu_free(s->bounce); (gdb) qemu_free (ptr=0xf270000) at qemu-malloc.c:41 41 free(ptr); (gdb) Program received signal SIGSEGV, Segmentation fault. 0x7c96d811 in ntdll!RtlpNtMakeTemporaryKey () from C:\WINNT\system32\ntdll.dll ^ permalink raw reply [flat|nested] 45+ messages in thread
* [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-12 23:19 ` Consul 0 siblings, 0 replies; 45+ messages in thread From: Consul @ 2009-02-12 23:19 UTC (permalink / raw) To: qemu-devel; +Cc: kvm Consul wrote: > Jamie Lokier wrote: >> >> It was just for clarity. If there are any bugs it's more likely to be >> truncation on a 32 bit host :-) >> > Maybe not a proper fix, do you see the same "corruption" with this patch? > I don't know if it causes any memory leaks, but it certainly clears the > segfaults while running my old qcow2 windows images. Perhaps this is a > wrong place to free() or it needs a condition? > > $ svn diff block.c > Index: block.c > =================================================================== > --- block.c (revision 6618) > +++ block.c (working copy) > @@ -1263,7 +1263,7 @@ > if (!s->is_write) { > qemu_iovec_from_buffer(s->iov, s->bounce, s->iov->size); > } > - qemu_free(s->bounce); > + //qemu_free(s->bounce); > s->this_aiocb->cb(s->this_aiocb->opaque, ret); > qemu_aio_release(s->this_aiocb); > } > > In fact, s->bounce == 0xf270000 looks suspicious to me. Perhaps there is a 64/32 bit conversion error somewhere? (gdb) Num Type Disp Enb Address What 1 breakpoint keep y 0x004035ac in qemu_bh_poll at c:/test/qemu/vl.c:3342 stop only if bh->opaque==0xee9d440 (gdb) The program being debugged has been started already. Start it from the beginning? (y or n) [answered Y; input not from terminal] Starting program: c:\test\qemu/i386-softmmu/qemu.exe -L c:\\qemu-dist -hda C:\\qemu-img\\qem8D.tmp -m 512 -boot c -loadvm 1 [New thread 5188.0x180] [New thread 5188.0xea0] [New thread 5188.0x1608] Breakpoint 1, qemu_bh_poll () at c:/test/qemu/vl.c:3342 3342 bh->cb(bh->opaque); (gdb) #0 qemu_bh_poll () at c:/test/qemu/vl.c:3342 #1 0x00403a9a in main_loop_wait (timeout=0) at c:/test/qemu/vl.c:3745 #2 0x00407bf5 in main (argc=11, argv=0x3e27c0, envp=0xccc359ff) at c:/test/qemu/vl.c:3888 (gdb) qcow_aio_read_bh (opaque=0xee9d440) at block-qcow2.c:1194 1194 qemu_bh_delete(acb->bh); (gdb) qemu_bh_delete (bh=0xe41b2f8) at c:/test/qemu/vl.c:3391 3391 bh->scheduled = 0; (gdb) 3392 bh->deleted = 1; (gdb) 3393 } (gdb) qcow_aio_read_bh (opaque=0xee9d440) at block-qcow2.c:1195 1195 acb->bh = NULL; (gdb) 1196 qcow_aio_read_cb(opaque, 0); (gdb) qcow_aio_read_cb (opaque=0xee9d440, ret=0) at block-qcow2.c:1215 1215 QCowAIOCB *acb = opaque; (gdb) 1216 BlockDriverState *bs = acb->common.bs; (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1220 acb->hd_aiocb = NULL; (gdb) 1221 if (ret < 0) { (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1229 if (!acb->cluster_offset) { (gdb) 1231 } else if (acb->cluster_offset & QCOW_OFLAG_COMPRESSED) { (gdb) 1241 acb->nb_sectors -= acb->n; (gdb) 1242 acb->sector_num += acb->n; (gdb) 1243 acb->buf += acb->n * 512; (gdb) 1245 if (acb->nb_sectors == 0) { (gdb) 1247 acb->common.cb(acb->common.opaque, 0); (gdb) qcow_aio_read_cb (opaque=0xee9d3d0, ret=0) at block-qcow2.c:1215 1215 QCowAIOCB *acb = opaque; (gdb) 1216 BlockDriverState *bs = acb->common.bs; (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1220 acb->hd_aiocb = NULL; (gdb) 1221 if (ret < 0) { (gdb) 1217 BDRVQcowState *s = bs->opaque; (gdb) 1229 if (!acb->cluster_offset) { (gdb) 1241 acb->nb_sectors -= acb->n; (gdb) 1242 acb->sector_num += acb->n; (gdb) 1243 acb->buf += acb->n * 512; (gdb) 1245 if (acb->nb_sectors == 0) { (gdb) 1247 acb->common.cb(acb->common.opaque, 0); (gdb) bdrv_aio_rw_vector_cb (opaque=0xe41b2c8, ret=0) at block.c:1261 1261 VectorTranslationState *s = opaque; (gdb) 1263 if (!s->is_write) { (gdb) 1264 qemu_iovec_from_buffer(s->iov, s->bounce, s->iov->size); (gdb) qemu_iovec_from_buffer (qiov=0xe412b9c, buf=0xf270000, count=512) at cutils.c:155 155 for (i = 0; i < qiov->niov && count; ++i) { (gdb) 151 const uint8_t *p = (const uint8_t *)buf; (gdb) 157 if (copy > qiov->iov[i].iov_len) (gdb) 159 memcpy(qiov->iov[i].iov_base, p, copy); (gdb) 160 p += copy; (gdb) 155 for (i = 0; i < qiov->niov && count; ++i) { (gdb) 163 } (gdb) bdrv_aio_rw_vector_cb (opaque=0xe41b2c8, ret=0) at block.c:1266 1266 qemu_free(s->bounce); (gdb) qemu_free (ptr=0xf270000) at qemu-malloc.c:41 41 free(ptr); (gdb) Program received signal SIGSEGV, Segmentation fault. 0x7c96d811 in ntdll!RtlpNtMakeTemporaryKey () from C:\WINNT\system32\ntdll.dll ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-12 23:19 ` [Qemu-devel] " Consul (?) @ 2009-02-13 7:50 ` Marc Bevand -1 siblings, 0 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-13 7:50 UTC (permalink / raw) To: qemu-devel Forwarding a post I sent to the kvm ML: I have got a massive KVM installation with hundreds of guests runnings dozens of different OSes, and have also noticed multiple qcow2 corruption bugs. All my guests are using the qcow2 format, and my hosts are running vanilla linux 2.6.28 x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors). My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to run kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the version from kvm-72 to fix the BSOD. kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact registry corruption error: http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_id=180599 This bug is also fixed by reverting block-qcow2.c to the version from kvm-72. I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the qcow2 performance regression caused by the default writethrough caching policy) but it randomly triggers an even worse bug: the moment I shut down a guest by typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk image with mostly NUL bytes (!) which completely destroys it. I am familiar with the qcow2 format and apparently this 4kB block seems to be an L2 table with most entries set to zero. I have had to restore at least 6 or 7 disk images from backup after occurences of that bug. My intuition tells me this may be the qcow2 code trying to allocate a cluster to write a new L2 table, but not noticing the allocation failed (represented by a 0 offset), and writing the L2 table at that 0 offset, overwriting the qcow2 header. Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted to its kvm-72 version. Basically qcow2 in kvm-73 or newer is completely unreliable. -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 16:48 ` Jamie Lokier (?) (?) @ 2009-02-16 12:44 ` Kevin Wolf 2009-02-17 0:43 ` Jamie Lokier -1 siblings, 1 reply; 45+ messages in thread From: Kevin Wolf @ 2009-02-16 12:44 UTC (permalink / raw) To: qemu-devel; +Cc: Laurent Vivier, kvm-devel Jamie Lokier schrieb: > Kevin Wolf wrote: >> Besides reviewing the code over and over again, I think the only real >> chance is that you can get a non-productive copy of your image and add >> some debug code so that we can see at least which code path is causing >> problems. > > I have a copy of my image to reproduce the bug, so I can test patches > including diagnostic patches. That's what I did to narrow it down. > > Being a company mail server, I can't send you the image of course. I perfectly understand that, it just makes debugging hard when you don't have a specific suspicion nor the problematic image. Do you need those diagnostic patches from me or will you try to put debug messages into the code yourself? I would just start putting in random printfs into alloc_cluster_offset() and the functions it calls to get some information about the failing write. I guess we'd need some iterations with my diagnostic patch and it wouldn't be any better than ad-hoc hacks done by yourself. >> By the way and completely off-topic: Have you already tried to use the >> VHD patches? I would really like to know if they fix your problems. > > Are those patches in kvm-83? I still have the image that was causing > problems way back, and I'm converting it to raw now with kvm-83 to see > if it now matches the raw image produced by VPC's own tool. Avi mentioned the patches in the kvm-84 announcement yesterday, so it seems they are not in kvm-83. Kevin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-16 12:44 ` [Qemu-devel] " Kevin Wolf @ 2009-02-17 0:43 ` Jamie Lokier 0 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-17 0:43 UTC (permalink / raw) To: qemu-devel; +Cc: Laurent Vivier, kvm-devel Kevin Wolf wrote: > >> By the way and completely off-topic: Have you already tried to use the > >> VHD patches? I would really like to know if they fix your problems. > > > > Are those patches in kvm-83? I still have the image that was causing > > problems way back, and I'm converting it to raw now with kvm-83 to see > > if it now matches the raw image produced by VPC's own tool. > > Avi mentioned the patches in the kvm-84 announcement yesterday, so it > seems they are not in kvm-83. Ok. I did the conversion from VHD to raw with "kvm-83-img convert" and it gave a different result to MSVPC - i.e. wrong. I'll try it again sometime with kvm-84. -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-17 0:43 ` Jamie Lokier 0 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-17 0:43 UTC (permalink / raw) To: qemu-devel; +Cc: Laurent Vivier, kvm-devel Kevin Wolf wrote: > >> By the way and completely off-topic: Have you already tried to use the > >> VHD patches? I would really like to know if they fix your problems. > > > > Are those patches in kvm-83? I still have the image that was causing > > problems way back, and I'm converting it to raw now with kvm-83 to see > > if it now matches the raw image produced by VPC's own tool. > > Avi mentioned the patches in the kvm-84 announcement yesterday, so it > seems they are not in kvm-83. Ok. I did the conversion from VHD to raw with "kvm-83-img convert" and it gave a different result to MSVPC - i.e. wrong. I'll try it again sometime with kvm-84. -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 16:48 ` Jamie Lokier @ 2009-03-06 22:37 ` Filip Navara -1 siblings, 0 replies; 45+ messages in thread From: Filip Navara @ 2009-03-06 22:37 UTC (permalink / raw) To: qemu-devel; +Cc: Kevin Wolf, Laurent Vivier, kvm-devel [-- Attachment #1: Type: text/plain, Size: 659 bytes --] On Wed, Feb 11, 2009 at 5:48 PM, Jamie Lokier <jamie@shareable.org> wrote: > Kevin Wolf wrote: >> Besides reviewing the code over and over again, I think the only real >> chance is that you can get a non-productive copy of your image and add >> some debug code so that we can see at least which code path is causing >> problems. > > I have a copy of my image to reproduce the bug, so I can test patches > including diagnostic patches. That's what I did to narrow it down. Let's see. I have looked at the change in revision 5006 back and forth and this is the only bug that I can see... Does the patch help any? Best regards, Filip Navara [-- Attachment #2: block-qcow2.diff --] [-- Type: application/octet-stream, Size: 623 bytes --] Index: block-qcow2.c =================================================================== --- block-qcow2.c (revision 6729) +++ block-qcow2.c (working copy) @@ -621,7 +621,7 @@ if (!offset) return 0; - for (i = start; i < start + nb_clusters; i++) + for (i = start; i < start + nb_clusters && i < s->l2_size; i++) if (offset + i * cluster_size != (be64_to_cpu(l2_table[i]) & ~mask)) break; @@ -632,7 +632,7 @@ { int i = 0; - while(nb_clusters-- && l2_table[i] == 0) + while(nb_clusters-- && i < s->l2_size && l2_table[i] == 0) i++; return i; ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-03-06 22:37 ` Filip Navara 0 siblings, 0 replies; 45+ messages in thread From: Filip Navara @ 2009-03-06 22:37 UTC (permalink / raw) To: qemu-devel; +Cc: Laurent Vivier, Kevin Wolf, kvm-devel [-- Attachment #1: Type: text/plain, Size: 659 bytes --] On Wed, Feb 11, 2009 at 5:48 PM, Jamie Lokier <jamie@shareable.org> wrote: > Kevin Wolf wrote: >> Besides reviewing the code over and over again, I think the only real >> chance is that you can get a non-productive copy of your image and add >> some debug code so that we can see at least which code path is causing >> problems. > > I have a copy of my image to reproduce the bug, so I can test patches > including diagnostic patches. That's what I did to narrow it down. Let's see. I have looked at the change in revision 5006 back and forth and this is the only bug that I can see... Does the patch help any? Best regards, Filip Navara [-- Attachment #2: block-qcow2.diff --] [-- Type: application/octet-stream, Size: 623 bytes --] Index: block-qcow2.c =================================================================== --- block-qcow2.c (revision 6729) +++ block-qcow2.c (working copy) @@ -621,7 +621,7 @@ if (!offset) return 0; - for (i = start; i < start + nb_clusters; i++) + for (i = start; i < start + nb_clusters && i < s->l2_size; i++) if (offset + i * cluster_size != (be64_to_cpu(l2_table[i]) & ~mask)) break; @@ -632,7 +632,7 @@ { int i = 0; - while(nb_clusters-- && l2_table[i] == 0) + while(nb_clusters-- && i < s->l2_size && l2_table[i] == 0) i++; return i; ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-11 12:41 ` Kevin Wolf @ 2009-02-12 5:45 ` Chris Wright -1 siblings, 0 replies; 45+ messages in thread From: Chris Wright @ 2009-02-12 5:45 UTC (permalink / raw) To: Kevin Wolf; +Cc: Jamie Lokier, qemu-devel, kvm-devel, Laurent Vivier * Kevin Wolf (kwolf@suse.de) wrote: > I would suspect that simply having a 64 bit host isn't enough to trigger > the problem. These patches were in for half a year now without anyone > noticing such failure. BTW, we've seen similar corruption, but not narrowed it down successfully as it's been quite difficult to trigger. thanks, -chris ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-12 5:45 ` Chris Wright 0 siblings, 0 replies; 45+ messages in thread From: Chris Wright @ 2009-02-12 5:45 UTC (permalink / raw) To: Kevin Wolf; +Cc: Laurent Vivier, qemu-devel, kvm-devel * Kevin Wolf (kwolf@suse.de) wrote: > I would suspect that simply having a 64 bit host isn't enough to trigger > the problem. These patches were in for half a year now without anyone > noticing such failure. BTW, we've seen similar corruption, but not narrowed it down successfully as it's been quite difficult to trigger. thanks, -chris ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change 2009-02-12 5:45 ` Chris Wright @ 2009-02-12 11:08 ` Johannes Schindelin -1 siblings, 0 replies; 45+ messages in thread From: Johannes Schindelin @ 2009-02-12 11:08 UTC (permalink / raw) To: qemu-devel; +Cc: Kevin Wolf, Laurent Vivier, kvm-devel Hi, On Wed, 11 Feb 2009, Chris Wright wrote: > * Kevin Wolf (kwolf@suse.de) wrote: > > I would suspect that simply having a 64 bit host isn't enough to trigger > > the problem. These patches were in for half a year now without anyone > > noticing such failure. > > BTW, we've seen similar corruption, but not narrowed it down successfully as > it's been quite difficult to trigger. Same here, Dscho ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] qcow2 corruption observed, fixed by reverting old change @ 2009-02-12 11:08 ` Johannes Schindelin 0 siblings, 0 replies; 45+ messages in thread From: Johannes Schindelin @ 2009-02-12 11:08 UTC (permalink / raw) To: qemu-devel; +Cc: Laurent Vivier, Kevin Wolf, kvm-devel Hi, On Wed, 11 Feb 2009, Chris Wright wrote: > * Kevin Wolf (kwolf@suse.de) wrote: > > I would suspect that simply having a 64 bit host isn't enough to trigger > > the problem. These patches were in for half a year now without anyone > > noticing such failure. > > BTW, we've seen similar corruption, but not narrowed it down successfully as > it's been quite difficult to trigger. Same here, Dscho ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: qcow2 corruption observed, fixed by reverting old change 2009-02-11 7:00 ` [Qemu-devel] " Jamie Lokier (?) (?) @ 2009-02-13 6:41 ` Marc Bevand 2009-02-13 11:16 ` [Qemu-devel] " Kevin Wolf -1 siblings, 1 reply; 45+ messages in thread From: Marc Bevand @ 2009-02-13 6:41 UTC (permalink / raw) To: kvm Jamie Lokier <jamie <at> shareable.org> writes: > > As you see from the subject, I'm getting qcow2 corruption. > > I have a Windows 2000 guest which boots and runs fine in kvm-72, fails > with a blue-screen indicating file corruption errors in kvm-73 through > to kvm-83 (the latest), and succeeds if I replace block-qcow2.c with > the version from kvm-72. > > The blue screen appears towards the end of the boot sequence, and > shows only briefly before rebooting. It says: > > STOP: c0000218 (Registry File Failure) > The registry cannot load the hive (file): > \SystemRoot\System32\Config\SOFTWARE > or its log or alternate. > It is corrupt, absent, or not writable. > > Beginning dump of physical memory > Physical memory dump complete. Contact your system administrator or > technical support [...?] I have got a massive KVM installation with hundreds of guests runnings dozens of different OSes, and have also noticed multiple qcow2 corruption bugs. All my guests are using the qcow2 format, and my hosts are running vanilla linux 2.6.28 x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors). My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to run kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the version from kvm-72 to fix the BSOD. kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact registry corruption error: http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_id=180599 This bug is also fixed by reverting block-qcow2.c to the version from kvm-72. I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the qcow2 performance regression caused by the default writethrough caching policy) but it randomly triggers an even worse bug: the moment I shut down a guest by typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk image with mostly NUL bytes (!) which completely destroys it. I am familiar with the qcow2 format and apparently this 4kB block seems to be an L2 table with most entries set to zero. I have had to restore at least 6 or 7 disk images from backup after occurences of that bug. My intuition tells me this may be the qcow2 code trying to allocate a cluster to write a new L2 table, but not noticing the allocation failed (represented by a 0 offset), and writing the L2 table at that 0 offset, overwriting the qcow2 header. Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted to its kvm-72 version. Basically qcow2 in kvm-73 or newer is completely unreliable. -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: qcow2 corruption observed, fixed by reverting old change 2009-02-13 6:41 ` Marc Bevand @ 2009-02-13 11:16 ` Kevin Wolf 0 siblings, 0 replies; 45+ messages in thread From: Kevin Wolf @ 2009-02-13 11:16 UTC (permalink / raw) To: Marc Bevand; +Cc: kvm, qemu-devel, Gleb Natapov Hi Marc, You should not take qemu-devel out of the CC list. This is where the bugs need to be fixed, they aren't KVM specific. I'm quoting your complete mail to forward it to where it belongs. Marc Bevand schrieb: > Jamie Lokier <jamie <at> shareable.org> writes: >> As you see from the subject, I'm getting qcow2 corruption. >> >> I have a Windows 2000 guest which boots and runs fine in kvm-72, fails >> with a blue-screen indicating file corruption errors in kvm-73 through >> to kvm-83 (the latest), and succeeds if I replace block-qcow2.c with >> the version from kvm-72. >> >> The blue screen appears towards the end of the boot sequence, and >> shows only briefly before rebooting. It says: >> >> STOP: c0000218 (Registry File Failure) >> The registry cannot load the hive (file): >> \SystemRoot\System32\Config\SOFTWARE >> or its log or alternate. >> It is corrupt, absent, or not writable. >> >> Beginning dump of physical memory >> Physical memory dump complete. Contact your system administrator or >> technical support [...?] > > I have got a massive KVM installation with hundreds of guests runnings dozens of > different OSes, and have also noticed multiple qcow2 corruption bugs. All my > guests are using the qcow2 format, and my hosts are running vanilla linux 2.6.28 > x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors). > > My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to run > kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the > version from kvm-72 to fix the BSOD. > > kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact > registry corruption error: > http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_id=180599 > This bug is also fixed by reverting block-qcow2.c to the version from kvm-72. > > I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the > qcow2 performance regression caused by the default writethrough caching policy) > but it randomly triggers an even worse bug: the moment I shut down a guest by > typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk > image with mostly NUL bytes (!) which completely destroys it. I am familiar with > the qcow2 format and apparently this 4kB block seems to be an L2 table with most > entries set to zero. I have had to restore at least 6 or 7 disk images from > backup after occurences of that bug. My intuition tells me this may be the qcow2 > code trying to allocate a cluster to write a new L2 table, but not noticing the > allocation failed (represented by a 0 offset), and writing the L2 table at that > 0 offset, overwriting the qcow2 header. > > Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted > to its kvm-72 version. > > Basically qcow2 in kvm-73 or newer is completely unreliable. > > -marc I think the corruption is a completely unrelated bug. I would suspect it was introduced in one of Gleb's patches in December. Adding him to CC. Kevin ^ permalink raw reply [flat|nested] 45+ messages in thread
* [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-13 11:16 ` Kevin Wolf 0 siblings, 0 replies; 45+ messages in thread From: Kevin Wolf @ 2009-02-13 11:16 UTC (permalink / raw) To: Marc Bevand; +Cc: Gleb Natapov, qemu-devel, kvm Hi Marc, You should not take qemu-devel out of the CC list. This is where the bugs need to be fixed, they aren't KVM specific. I'm quoting your complete mail to forward it to where it belongs. Marc Bevand schrieb: > Jamie Lokier <jamie <at> shareable.org> writes: >> As you see from the subject, I'm getting qcow2 corruption. >> >> I have a Windows 2000 guest which boots and runs fine in kvm-72, fails >> with a blue-screen indicating file corruption errors in kvm-73 through >> to kvm-83 (the latest), and succeeds if I replace block-qcow2.c with >> the version from kvm-72. >> >> The blue screen appears towards the end of the boot sequence, and >> shows only briefly before rebooting. It says: >> >> STOP: c0000218 (Registry File Failure) >> The registry cannot load the hive (file): >> \SystemRoot\System32\Config\SOFTWARE >> or its log or alternate. >> It is corrupt, absent, or not writable. >> >> Beginning dump of physical memory >> Physical memory dump complete. Contact your system administrator or >> technical support [...?] > > I have got a massive KVM installation with hundreds of guests runnings dozens of > different OSes, and have also noticed multiple qcow2 corruption bugs. All my > guests are using the qcow2 format, and my hosts are running vanilla linux 2.6.28 > x86_64 kernels and use NPT (Opteron 'Barcelona' 23xx processors). > > My Windows 2000 guests BSOD just like yours with kvm-73 or newer. I have to run > kvm-75 (I need the NPT fixes it contains) with block-qcow2.c reverted to the > version from kvm-72 to fix the BSOD. > > kvm-73+ also causes some of my Windows 2003 guests to exhibit this exact > registry corruption error: > http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2001452&group_id=180599 > This bug is also fixed by reverting block-qcow2.c to the version from kvm-72. > > I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the > qcow2 performance regression caused by the default writethrough caching policy) > but it randomly triggers an even worse bug: the moment I shut down a guest by > typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk > image with mostly NUL bytes (!) which completely destroys it. I am familiar with > the qcow2 format and apparently this 4kB block seems to be an L2 table with most > entries set to zero. I have had to restore at least 6 or 7 disk images from > backup after occurences of that bug. My intuition tells me this may be the qcow2 > code trying to allocate a cluster to write a new L2 table, but not noticing the > allocation failed (represented by a 0 offset), and writing the L2 table at that > 0 offset, overwriting the qcow2 header. > > Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted > to its kvm-72 version. > > Basically qcow2 in kvm-73 or newer is completely unreliable. > > -marc I think the corruption is a completely unrelated bug. I would suspect it was introduced in one of Gleb's patches in December. Adding him to CC. Kevin ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-13 11:16 ` [Qemu-devel] " Kevin Wolf @ 2009-02-13 16:23 ` Jamie Lokier -1 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-13 16:23 UTC (permalink / raw) To: qemu-devel; +Cc: Marc Bevand, Gleb Natapov, kvm Marc Bevand schrieb: > I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older > because of the qcow2 performance regression caused by the default > writethrough caching policy) but it randomly triggers an even worse > bug: the moment I shut down a guest by typing "quit" in the monitor, > it sometimes overwrite the first 4kB of the disk image with mostly > NUL bytes (!) which completely destroys it. I am familiar with the > qcow2 format and apparently this 4kB block seems to be an L2 table > with most entries set to zero. I have had to restore at least 6 or 7 > disk images from backup after occurences of that bug. Ow! That's a really serious bug. How many of us have regular hourly backups of our disk images? And how many of us are running databases or mail servers on our VMs, where even restoring from a recent backup is a harmful event? I've not noticed this bug reported by Marc, probably because I nearly always finish a KVM session by killing it, either because I'm testing or because KVM locks up occasionally and needs kill -9 :-( And because I've not used any KVM since kvm-72 in production until recently, only for testing my personal VMs. I must say, _thank goodness_ that the bug I reported occurs at boot time, and caused me to revert the qcow2 code. I'm now running a crticial VM on kvm-83 with reverted qcow2. Sure it's risky as there's no reason to believe kvm-83 is "stable", but there's no reason to believe any other version of KVM is especially stable either - there's no stabilising bug fix only branch that I'm aware of. If I hadn't had the boot time bug which I reported, I could have unrecoverable corruption instead from Marc's bug. For the time being, I'm going to _strongly_ advise my VM using professional clients to never, *ever* use qcow2 except for snapshot testing. Unfortunately the other delta/growable formats seem to be even less reliable, because they're not used much, so they should be avoided too. This corruption plus the data integrity/durability issues on host failure are a big deal. Even with kvm-72, I'm nervous about qcow2 now. Just because a bug hasn't caused obvious guest failures, doesn't mean it's not happening. Is there a way to restructure the code and/or how it works so it's more clearly correct? > My intuition tells me this may be the qcow2 code trying to allocate > a cluster to write a new L2 table, but not noticing the allocation > failed (represented by a 0 offset), and writing the L2 table at that > 0 offset, overwriting the qcow2 header. My intuition says it's important to identify the cause of this, as it might not be qcow2 but the AIO code going awry with a random offset when closing down, e.g. if there's a use-after-free bug. Marc.. this is quite a serious bug you've reported. Is there a reason you didn't report it earlier? -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-13 16:23 ` Jamie Lokier 0 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-13 16:23 UTC (permalink / raw) To: qemu-devel; +Cc: Marc Bevand, kvm, Gleb Natapov Marc Bevand schrieb: > I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older > because of the qcow2 performance regression caused by the default > writethrough caching policy) but it randomly triggers an even worse > bug: the moment I shut down a guest by typing "quit" in the monitor, > it sometimes overwrite the first 4kB of the disk image with mostly > NUL bytes (!) which completely destroys it. I am familiar with the > qcow2 format and apparently this 4kB block seems to be an L2 table > with most entries set to zero. I have had to restore at least 6 or 7 > disk images from backup after occurences of that bug. Ow! That's a really serious bug. How many of us have regular hourly backups of our disk images? And how many of us are running databases or mail servers on our VMs, where even restoring from a recent backup is a harmful event? I've not noticed this bug reported by Marc, probably because I nearly always finish a KVM session by killing it, either because I'm testing or because KVM locks up occasionally and needs kill -9 :-( And because I've not used any KVM since kvm-72 in production until recently, only for testing my personal VMs. I must say, _thank goodness_ that the bug I reported occurs at boot time, and caused me to revert the qcow2 code. I'm now running a crticial VM on kvm-83 with reverted qcow2. Sure it's risky as there's no reason to believe kvm-83 is "stable", but there's no reason to believe any other version of KVM is especially stable either - there's no stabilising bug fix only branch that I'm aware of. If I hadn't had the boot time bug which I reported, I could have unrecoverable corruption instead from Marc's bug. For the time being, I'm going to _strongly_ advise my VM using professional clients to never, *ever* use qcow2 except for snapshot testing. Unfortunately the other delta/growable formats seem to be even less reliable, because they're not used much, so they should be avoided too. This corruption plus the data integrity/durability issues on host failure are a big deal. Even with kvm-72, I'm nervous about qcow2 now. Just because a bug hasn't caused obvious guest failures, doesn't mean it's not happening. Is there a way to restructure the code and/or how it works so it's more clearly correct? > My intuition tells me this may be the qcow2 code trying to allocate > a cluster to write a new L2 table, but not noticing the allocation > failed (represented by a 0 offset), and writing the L2 table at that > 0 offset, overwriting the qcow2 header. My intuition says it's important to identify the cause of this, as it might not be qcow2 but the AIO code going awry with a random offset when closing down, e.g. if there's a use-after-free bug. Marc.. this is quite a serious bug you've reported. Is there a reason you didn't report it earlier? -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-13 16:23 ` Jamie Lokier @ 2009-02-13 18:43 ` Chris Wright -1 siblings, 0 replies; 45+ messages in thread From: Chris Wright @ 2009-02-13 18:43 UTC (permalink / raw) To: Jamie Lokier; +Cc: qemu-devel, Marc Bevand, Gleb Natapov, kvm * Jamie Lokier (jamie@shareable.org) wrote: > no reason to believe kvm-83 is "stable", but there's no reason to > believe any other version of KVM is especially stable either - there's > no stabilising bug fix only branch that I'm aware of. There's ad-hoc one w/out formal releases. But...never been closer ;-) http://thread.gmane.org/gmane.comp.emulators.kvm.devel/28179 thanks, -chris ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-13 18:43 ` Chris Wright 0 siblings, 0 replies; 45+ messages in thread From: Chris Wright @ 2009-02-13 18:43 UTC (permalink / raw) To: Jamie Lokier; +Cc: Marc Bevand, qemu-devel, Gleb Natapov, kvm * Jamie Lokier (jamie@shareable.org) wrote: > no reason to believe kvm-83 is "stable", but there's no reason to > believe any other version of KVM is especially stable either - there's > no stabilising bug fix only branch that I'm aware of. There's ad-hoc one w/out formal releases. But...never been closer ;-) http://thread.gmane.org/gmane.comp.emulators.kvm.devel/28179 thanks, -chris ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-13 16:23 ` Jamie Lokier (?) (?) @ 2009-02-14 6:31 ` Marc Bevand 2009-02-14 22:28 ` Dor Laor 2009-02-15 2:37 ` Jamie Lokier -1 siblings, 2 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-14 6:31 UTC (permalink / raw) To: Jamie Lokier; +Cc: qemu-devel, Gleb Natapov, kvm On Fri, Feb 13, 2009 at 8:23 AM, Jamie Lokier <jamie@shareable.org> wrote: > > Marc.. this is quite a serious bug you've reported. Is there a > reason you didn't report it earlier? Because I only started hitting that bug a couple weeks ago after having upgraded to a buggy kvm version. > Is there a way to restructure the code and/or how it works so it's > more clearly correct? I am seriously concerned about the general design of qcow2. The code base is more complex than it needs to be, the format itself is susceptible to race conditions causing cluster leaks when updating some internal datastructures, it gets easily fragmented, etc. I am considering implementing a new disk image format that supports base images, snapshots (of the guest state), clones (of the disk content); that has a radically simpler design & code base; that is always consistent "on disk"; that is friendly to delta diffing (ie. space-efficient when used with ZFS snapshots or rsync); and that makes use of checksumming & replication to detect & fix corruption of critical data structures (ideally this should be implemented by the filesystem, unfortunately ZFS is not available everywhere :D). I believe the key to achieve these (seemingly utopian) goals is to represent a disk "image" as a set of sparse files, 1 per snapshot/clone. -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-14 6:31 ` Marc Bevand @ 2009-02-14 22:28 ` Dor Laor 2009-02-15 2:37 ` Jamie Lokier 1 sibling, 0 replies; 45+ messages in thread From: Dor Laor @ 2009-02-14 22:28 UTC (permalink / raw) To: Marc Bevand; +Cc: Jamie Lokier, qemu-devel, Gleb Natapov, kvm Marc Bevand wrote: > On Fri, Feb 13, 2009 at 8:23 AM, Jamie Lokier <jamie@shareable.org> wrote: > >> Marc.. this is quite a serious bug you've reported. Is there a >> reason you didn't report it earlier? >> > > Because I only started hitting that bug a couple weeks ago after > having upgraded to a buggy kvm version. > > >> Is there a way to restructure the code and/or how it works so it's >> more clearly correct? >> > > I am seriously concerned about the general design of qcow2. The code > base is more complex than it needs to be, the format itself is > susceptible to race conditions causing cluster leaks when updating > some internal datastructures, it gets easily fragmented, etc. > > I am considering implementing a new disk image format that supports > base images, snapshots (of the guest state), clones (of the disk > content); that has a radically simpler design & code base; that is > always consistent "on disk"; that is friendly to delta diffing (ie. > space-efficient when used with ZFS snapshots or rsync); and that makes > use of checksumming & replication to detect & fix corruption of > critical data structures (ideally this should be implemented by the > filesystem, unfortunately ZFS is not available everywhere :D). > > I believe the key to achieve these (seemingly utopian) goals is to > represent a disk "image" as a set of sparse files, 1 per > snapshot/clone. > Both qcow2 and vmdk have the ability to keep 'external' snapshots. In addition to what you wrote, qcow2 is missing journal for its meta data and also performs poorly because of complex meta data and sync calls. We might use vmdk format or VHD as a base for the future high performing, safe image format for qemu -dor > -marc > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-14 22:28 ` Dor Laor 0 siblings, 0 replies; 45+ messages in thread From: Dor Laor @ 2009-02-14 22:28 UTC (permalink / raw) To: Marc Bevand; +Cc: qemu-devel, Gleb Natapov, kvm Marc Bevand wrote: > On Fri, Feb 13, 2009 at 8:23 AM, Jamie Lokier <jamie@shareable.org> wrote: > >> Marc.. this is quite a serious bug you've reported. Is there a >> reason you didn't report it earlier? >> > > Because I only started hitting that bug a couple weeks ago after > having upgraded to a buggy kvm version. > > >> Is there a way to restructure the code and/or how it works so it's >> more clearly correct? >> > > I am seriously concerned about the general design of qcow2. The code > base is more complex than it needs to be, the format itself is > susceptible to race conditions causing cluster leaks when updating > some internal datastructures, it gets easily fragmented, etc. > > I am considering implementing a new disk image format that supports > base images, snapshots (of the guest state), clones (of the disk > content); that has a radically simpler design & code base; that is > always consistent "on disk"; that is friendly to delta diffing (ie. > space-efficient when used with ZFS snapshots or rsync); and that makes > use of checksumming & replication to detect & fix corruption of > critical data structures (ideally this should be implemented by the > filesystem, unfortunately ZFS is not available everywhere :D). > > I believe the key to achieve these (seemingly utopian) goals is to > represent a disk "image" as a set of sparse files, 1 per > snapshot/clone. > Both qcow2 and vmdk have the ability to keep 'external' snapshots. In addition to what you wrote, qcow2 is missing journal for its meta data and also performs poorly because of complex meta data and sync calls. We might use vmdk format or VHD as a base for the future high performing, safe image format for qemu -dor > -marc > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-14 22:28 ` Dor Laor (?) @ 2009-02-15 2:27 ` Jamie Lokier -1 siblings, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-15 2:27 UTC (permalink / raw) To: Dor Laor; +Cc: Marc Bevand, qemu-devel, Gleb Natapov, kvm Dor Laor wrote: > Both qcow2 and vmdk have the ability to keep 'external' snapshots. I didn't see any mention of this in QEMU's documentation. One of the most annoying features of qcow2 is "savevm" storing all VM snapshots in the same qcow2 file. Is this not true? > In addition to what you wrote, qcow2 is missing journal for its meta > data and > also performs poorly because of complex meta data and sync calls. > > We might use vmdk format or VHD as a base for the future high > performing, safe > image format for qemu You'll want to validate VHD carefully. I tested it just yesterday (with kvm-83), and "qemu-img convert" does not correctly unpack my VHD image (from Microsoft Virtual PC) to raw, compared with the unpacked version from MSVPC's own conversion tool. There's some patches which greatly improve the VHD support; I'm not sure if they're in kvm-83. -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-14 22:28 ` Dor Laor @ 2009-02-15 7:56 ` Marc Bevand -1 siblings, 0 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-15 7:56 UTC (permalink / raw) To: dlaor; +Cc: Jamie Lokier, qemu-devel, Gleb Natapov, kvm On Sat, Feb 14, 2009 at 2:28 PM, Dor Laor <dlaor@redhat.com> wrote: > > Both qcow2 and vmdk have the ability to keep 'external' snapshots. I know but they don't implement one feature I cited: clones, or "writable snapshots", which I would like implemented with support for deduplication. Base images / backing files are too limited because they have to be managed by the enduser and there is no deduplication done between multiple images based on the same backing file. > We might use vmdk format or VHD as a base for the future high performing, > safe image format for qemu Neither vmdk nor vhd satisfy my requirements: not always consistent on disk, no possibility of detecting/correcting errors, susceptible to fragmentation (affects vmdk, not sure about vhd), and possibly others. Jamie: yes in an ideal world, the storage virtualization layer could make use of the host's filesystem or block layer snapshotting/cloning features, but in the real world too few OSes implement these. -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-15 7:56 ` Marc Bevand 0 siblings, 0 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-15 7:56 UTC (permalink / raw) To: dlaor; +Cc: qemu-devel, Gleb Natapov, kvm On Sat, Feb 14, 2009 at 2:28 PM, Dor Laor <dlaor@redhat.com> wrote: > > Both qcow2 and vmdk have the ability to keep 'external' snapshots. I know but they don't implement one feature I cited: clones, or "writable snapshots", which I would like implemented with support for deduplication. Base images / backing files are too limited because they have to be managed by the enduser and there is no deduplication done between multiple images based on the same backing file. > We might use vmdk format or VHD as a base for the future high performing, > safe image format for qemu Neither vmdk nor vhd satisfy my requirements: not always consistent on disk, no possibility of detecting/correcting errors, susceptible to fragmentation (affects vmdk, not sure about vhd), and possibly others. Jamie: yes in an ideal world, the storage virtualization layer could make use of the host's filesystem or block layer snapshotting/cloning features, but in the real world too few OSes implement these. -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change 2009-02-14 6:31 ` Marc Bevand 2009-02-14 22:28 ` Dor Laor @ 2009-02-15 2:37 ` Jamie Lokier 1 sibling, 0 replies; 45+ messages in thread From: Jamie Lokier @ 2009-02-15 2:37 UTC (permalink / raw) To: Marc Bevand; +Cc: qemu-devel, Gleb Natapov, kvm Marc Bevand wrote: > On Fri, Feb 13, 2009 at 8:23 AM, Jamie Lokier <jamie@shareable.org> wrote: > > > > Marc.. this is quite a serious bug you've reported. Is there a > > reason you didn't report it earlier? > > Because I only started hitting that bug a couple weeks ago after > having upgraded to a buggy kvm version. > > > Is there a way to restructure the code and/or how it works so it's > > more clearly correct? > > I am seriously concerned about the general design of qcow2. The code > base is more complex than it needs to be, the format itself is > susceptible to race conditions causing cluster leaks when updating > some internal datastructures, it gets easily fragmented, etc. When I read it, I thought the code was remarkably compact for what it does, although I agree that the leaks, fragmentation and inconsistency on crashes are serious. From elsewhere it sounds like the refcount update cost is significant too. > I am considering implementing a new disk image format that supports > base images, snapshots (of the guest state), clones (of the disk > content); that has a radically simpler design & code base; that is > always consistent "on disk"; that is friendly to delta diffing (ie. > space-efficient when used with ZFS snapshots or rsync); and that makes > use of checksumming & replication to detect & fix corruption of > critical data structures (ideally this should be implemented by the > filesystem, unfortunately ZFS is not available everywhere :D). You have just described a high quality modern filesystem or database engine; both would certainly be far more complex than qcow2's code. Especially with checksumming and replication :) ZFS isn't everywhere, but it looks like everyone wants to clone ZFS's best features everywhere (but not it's worst feature: lots of memory required). I've had similar thoughts myself, by the way :-) > I believe the key to achieve these (seemingly utopian) goals is to > represent a disk "image" as a set of sparse files, 1 per > snapshot/clone. You can already do this, if your filesystem supports snapshotting. On Linux hosts, any filesystem can snapshot by using LVM underneath it (although it's not pretty to do). A few experimental Linux filesystems let you snapshot at the filesystem level. A feature you missed in the utopian vision is sharing backing store for equal parts of files between different snapshots _after_ they've been written in separate branches (with the same data), and also among different VMs. It's becoming stylish to put similarity detection in the filesystem somewhere too :-) -- Jamie ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: qcow2 corruption observed, fixed by reverting old change 2009-02-13 11:16 ` [Qemu-devel] " Kevin Wolf @ 2009-02-15 10:57 ` Gleb Natapov -1 siblings, 0 replies; 45+ messages in thread From: Gleb Natapov @ 2009-02-15 10:57 UTC (permalink / raw) To: Kevin Wolf; +Cc: Marc Bevand, kvm, qemu-devel > > I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the > > qcow2 performance regression caused by the default writethrough caching policy) > > but it randomly triggers an even worse bug: the moment I shut down a guest by > > typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk > > image with mostly NUL bytes (!) which completely destroys it. I am familiar with > > the qcow2 format and apparently this 4kB block seems to be an L2 table with most > > entries set to zero. I have had to restore at least 6 or 7 disk images from > > backup after occurences of that bug. My intuition tells me this may be the qcow2 > > code trying to allocate a cluster to write a new L2 table, but not noticing the > > allocation failed (represented by a 0 offset), and writing the L2 table at that > > 0 offset, overwriting the qcow2 header. > > > > Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted > > to its kvm-72 version. > > > > Basically qcow2 in kvm-73 or newer is completely unreliable. > > > > -marc > > I think the corruption is a completely unrelated bug. I would suspect it > was introduced in one of Gleb's patches in December. Adding him to CC. > I am not able to reproduce this. After more then hundred boot linux; generate disk io; quit loops all I've got is an image with 7 leaked blocks and couple of filesystem corruptions that were fixed by fsck. -- Gleb. ^ permalink raw reply [flat|nested] 45+ messages in thread
* [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-15 10:57 ` Gleb Natapov 0 siblings, 0 replies; 45+ messages in thread From: Gleb Natapov @ 2009-02-15 10:57 UTC (permalink / raw) To: Kevin Wolf; +Cc: Marc Bevand, qemu-devel, kvm > > I tested kvm-81 and kvm-83 as well (can't test kvm-80 or older because of the > > qcow2 performance regression caused by the default writethrough caching policy) > > but it randomly triggers an even worse bug: the moment I shut down a guest by > > typing "quit" in the monitor, it sometimes overwrite the first 4kB of the disk > > image with mostly NUL bytes (!) which completely destroys it. I am familiar with > > the qcow2 format and apparently this 4kB block seems to be an L2 table with most > > entries set to zero. I have had to restore at least 6 or 7 disk images from > > backup after occurences of that bug. My intuition tells me this may be the qcow2 > > code trying to allocate a cluster to write a new L2 table, but not noticing the > > allocation failed (represented by a 0 offset), and writing the L2 table at that > > 0 offset, overwriting the qcow2 header. > > > > Fortunately this bug is also fixed by running kvm-75 with block-qcow2.c reverted > > to its kvm-72 version. > > > > Basically qcow2 in kvm-73 or newer is completely unreliable. > > > > -marc > > I think the corruption is a completely unrelated bug. I would suspect it > was introduced in one of Gleb's patches in December. Adding him to CC. > I am not able to reproduce this. After more then hundred boot linux; generate disk io; quit loops all I've got is an image with 7 leaked blocks and couple of filesystem corruptions that were fixed by fsck. -- Gleb. ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: qcow2 corruption observed, fixed by reverting old change 2009-02-15 10:57 ` [Qemu-devel] " Gleb Natapov @ 2009-02-15 11:46 ` Marc Bevand -1 siblings, 0 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-15 11:46 UTC (permalink / raw) To: Gleb Natapov; +Cc: Kevin Wolf, kvm, qemu-devel On Sun, Feb 15, 2009 at 2:57 AM, Gleb Natapov <gleb@redhat.com> wrote: > > I am not able to reproduce this. After more then hundred boot linux; generate > disk io; quit loops all I've got is an image with 7 leaked blocks and > couple of filesystem corruptions that were fixed by fsck. The type of activity occuring in the guest is most likely an important factor determining the probability of the bug occuring. So you should try running guest OSes I remember having been affected by it: Windows 2003 SP2 x64. And now that I think about it, I don't recall any other guest OS having been a victim of that bug... coincidence ? Other factors you might consider when trying to reproduce: the qcow2 images that ended up being corrupted had a backing file (a read-only qcow2 image); NPT was in use; the host filesystem was xfs; my command line was: $ qemu-system-x86_64 -name xxx -monitor stdio -vnc xxx:xxx -hda hda -net nic,macaddr=xx:xx:xx:xx:xx:xx,model=rtl8139 -net tap -boot c -cdrom "" -cpu qemu64 -m 1024 -usbdevice tablet -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-15 11:46 ` Marc Bevand 0 siblings, 0 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-15 11:46 UTC (permalink / raw) To: Gleb Natapov; +Cc: Kevin Wolf, kvm, qemu-devel On Sun, Feb 15, 2009 at 2:57 AM, Gleb Natapov <gleb@redhat.com> wrote: > > I am not able to reproduce this. After more then hundred boot linux; generate > disk io; quit loops all I've got is an image with 7 leaked blocks and > couple of filesystem corruptions that were fixed by fsck. The type of activity occuring in the guest is most likely an important factor determining the probability of the bug occuring. So you should try running guest OSes I remember having been affected by it: Windows 2003 SP2 x64. And now that I think about it, I don't recall any other guest OS having been a victim of that bug... coincidence ? Other factors you might consider when trying to reproduce: the qcow2 images that ended up being corrupted had a backing file (a read-only qcow2 image); NPT was in use; the host filesystem was xfs; my command line was: $ qemu-system-x86_64 -name xxx -monitor stdio -vnc xxx:xxx -hda hda -net nic,macaddr=xx:xx:xx:xx:xx:xx,model=rtl8139 -net tap -boot c -cdrom "" -cpu qemu64 -m 1024 -usbdevice tablet -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: qcow2 corruption observed, fixed by reverting old change 2009-02-15 11:46 ` [Qemu-devel] " Marc Bevand @ 2009-02-15 11:54 ` Marc Bevand -1 siblings, 0 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-15 11:54 UTC (permalink / raw) To: Gleb Natapov; +Cc: Kevin Wolf, kvm, qemu-devel On Sun, Feb 15, 2009 at 3:46 AM, Marc Bevand <m.bevand@gmail.com> wrote: > Other factors you might consider when trying to reproduce: [...] And the probability of that bug occuring seems less than 1% (I only witnessed 6 or 7 occurences out of about a thousand shutdown events). Also, contrary to what I said I am *not* sure whether the "quit" monitor command was used or not. Instead the guests may have been using ACPI to shut themselves down. -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
* [Qemu-devel] Re: qcow2 corruption observed, fixed by reverting old change @ 2009-02-15 11:54 ` Marc Bevand 0 siblings, 0 replies; 45+ messages in thread From: Marc Bevand @ 2009-02-15 11:54 UTC (permalink / raw) To: Gleb Natapov; +Cc: Kevin Wolf, kvm, qemu-devel On Sun, Feb 15, 2009 at 3:46 AM, Marc Bevand <m.bevand@gmail.com> wrote: > Other factors you might consider when trying to reproduce: [...] And the probability of that bug occuring seems less than 1% (I only witnessed 6 or 7 occurences out of about a thousand shutdown events). Also, contrary to what I said I am *not* sure whether the "quit" monitor command was used or not. Instead the guests may have been using ACPI to shut themselves down. -marc ^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2009-03-06 22:37 UTC | newest] Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-02-11 7:00 qcow2 corruption observed, fixed by reverting old change Jamie Lokier 2009-02-11 7:00 ` [Qemu-devel] " Jamie Lokier 2009-02-11 9:57 ` Kevin Wolf 2009-02-11 11:27 ` Jamie Lokier 2009-02-11 11:27 ` Jamie Lokier 2009-02-11 11:41 ` Jamie Lokier 2009-02-11 11:41 ` Jamie Lokier 2009-02-11 12:41 ` Kevin Wolf 2009-02-11 12:41 ` Kevin Wolf 2009-02-11 16:48 ` Jamie Lokier 2009-02-11 16:48 ` Jamie Lokier 2009-02-12 22:57 ` Consul 2009-02-12 22:57 ` [Qemu-devel] " Consul 2009-02-12 23:19 ` Consul 2009-02-12 23:19 ` [Qemu-devel] " Consul 2009-02-13 7:50 ` Marc Bevand 2009-02-16 12:44 ` [Qemu-devel] " Kevin Wolf 2009-02-17 0:43 ` Jamie Lokier 2009-02-17 0:43 ` Jamie Lokier 2009-03-06 22:37 ` Filip Navara 2009-03-06 22:37 ` Filip Navara 2009-02-12 5:45 ` Chris Wright 2009-02-12 5:45 ` Chris Wright 2009-02-12 11:08 ` Johannes Schindelin 2009-02-12 11:08 ` Johannes Schindelin 2009-02-13 6:41 ` Marc Bevand 2009-02-13 11:16 ` Kevin Wolf 2009-02-13 11:16 ` [Qemu-devel] " Kevin Wolf 2009-02-13 16:23 ` Jamie Lokier 2009-02-13 16:23 ` Jamie Lokier 2009-02-13 18:43 ` Chris Wright 2009-02-13 18:43 ` Chris Wright 2009-02-14 6:31 ` Marc Bevand 2009-02-14 22:28 ` Dor Laor 2009-02-14 22:28 ` Dor Laor 2009-02-15 2:27 ` Jamie Lokier 2009-02-15 7:56 ` Marc Bevand 2009-02-15 7:56 ` Marc Bevand 2009-02-15 2:37 ` Jamie Lokier 2009-02-15 10:57 ` Gleb Natapov 2009-02-15 10:57 ` [Qemu-devel] " Gleb Natapov 2009-02-15 11:46 ` Marc Bevand 2009-02-15 11:46 ` [Qemu-devel] " Marc Bevand 2009-02-15 11:54 ` Marc Bevand 2009-02-15 11:54 ` [Qemu-devel] " Marc Bevand
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.