[Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format
@ 2019-04-24  7:48 ` Sam Eiderman
  0 siblings, 0 replies; 13+ messages in thread
From: Sam Eiderman @ 2019-04-24  7:48 UTC (permalink / raw)
  To: fam, kwolf, mreitz, qemu-block
  Cc: arbel.moshe, karl.heubaum, eyal.moscovici, liran.alon,
	qemu-devel, shmuel.eiderman

VMware introduced a new snapshot format in VMFS6 - seSparse (Space
Efficient Sparse) which is the default format available in ESXi 6.7.
Add read-only support for the new snapshot format.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format
@ 2019-04-24  7:48 ` Sam Eiderman
  0 siblings, 0 replies; 13+ messages in thread
From: Sam Eiderman @ 2019-04-24  7:48 UTC (permalink / raw)
  To: fam, kwolf, mreitz, qemu-block
  Cc: eyal.moscovici, arbel.moshe, qemu-devel, shmuel.eiderman,
	liran.alon, karl.heubaum

VMware introduced a new snapshot format in VMFS6 - seSparse (Space
Efficient Sparse) which is the default format available in ESXi 6.7.
Add read-only support for the new snapshot format.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage
@ 2019-04-24  7:49   ` Sam Eiderman
  0 siblings, 0 replies; 13+ messages in thread
From: Sam Eiderman @ 2019-04-24  7:49 UTC (permalink / raw)
  To: fam, kwolf, mreitz, qemu-block
  Cc: arbel.moshe, karl.heubaum, eyal.moscovici, liran.alon,
	qemu-devel, shmuel.eiderman

Commit b0651b8c246d ("vmdk: Move l1_size check into vmdk_add_extent")
extended the l1_size check from VMDK4 to VMDK3 but did not update the
default coverage in the moved comment.

The previous vmdk4 calculation:

    (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB

The added vmdk3 calculation:

    (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB

Adding the calculation of vmdk3 to the comment.

In any case, VMware does not offer virtual disks more than 2TB for
vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
not implemented yet in qemu.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
---
 block/vmdk.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index de8cb859f8..fc7378da78 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -426,10 +426,15 @@ static int vmdk_add_extent(BlockDriverState *bs,
         return -EFBIG;
     }
     if (l1_size > 512 * 1024 * 1024) {
-        /* Although with big capacity and small l1_entry_sectors, we can get a
+        /*
+         * Although with big capacity and small l1_entry_sectors, we can get a
          * big l1_size, we don't want unbounded value to allocate the table.
-         * Limit it to 512M, which is 16PB for default cluster and L2 table
-         * size */
+         * Limit it to 512M, which is:
+         *     16PB - for default "Hosted Sparse Extent" (VMDK4)
+         *            cluster size: 64KB, L2 table size: 512 entries
+         *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
+         *            cluster size: 512B, L2 table size: 4096 entries
+         */
         error_setg(errp, "L1 size too big");
         return -EFBIG;
     }
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage
@ 2019-04-24  7:49   ` Sam Eiderman
  0 siblings, 0 replies; 13+ messages in thread
From: Sam Eiderman @ 2019-04-24  7:49 UTC (permalink / raw)
  To: fam, kwolf, mreitz, qemu-block
  Cc: eyal.moscovici, arbel.moshe, qemu-devel, shmuel.eiderman,
	liran.alon, karl.heubaum

Commit b0651b8c246d ("vmdk: Move l1_size check into vmdk_add_extent")
extended the l1_size check from VMDK4 to VMDK3 but did not update the
default coverage in the moved comment.

The previous vmdk4 calculation:

    (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB

The added vmdk3 calculation:

    (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB

Adding the calculation of vmdk3 to the comment.

In any case, VMware does not offer virtual disks more than 2TB for
vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
not implemented yet in qemu.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
---
 block/vmdk.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index de8cb859f8..fc7378da78 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -426,10 +426,15 @@ static int vmdk_add_extent(BlockDriverState *bs,
         return -EFBIG;
     }
     if (l1_size > 512 * 1024 * 1024) {
-        /* Although with big capacity and small l1_entry_sectors, we can get a
+        /*
+         * Although with big capacity and small l1_entry_sectors, we can get a
          * big l1_size, we don't want unbounded value to allocate the table.
-         * Limit it to 512M, which is 16PB for default cluster and L2 table
-         * size */
+         * Limit it to 512M, which is:
+         *     16PB - for default "Hosted Sparse Extent" (VMDK4)
+         *            cluster size: 64KB, L2 table size: 512 entries
+         *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
+         *            cluster size: 512B, L2 table size: 4096 entries
+         */
         error_setg(errp, "L1 size too big");
         return -EFBIG;
     }
-- 
2.13.3



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots
@ 2019-04-24  7:49   ` Sam Eiderman
  0 siblings, 0 replies; 13+ messages in thread
From: Sam Eiderman @ 2019-04-24  7:49 UTC (permalink / raw)
  To: fam, kwolf, mreitz, qemu-block
  Cc: arbel.moshe, karl.heubaum, eyal.moscovici, liran.alon,
	qemu-devel, shmuel.eiderman

Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).

This format was lacking in the following:

    * Grain directory (L1) and grain table (L2) entries were 32-bit,
      allowing access to only 2TB (slightly less) of data.
    * The grain size (default) was 512 bytes - leading to data
      fragmentation and many grain tables.
    * For space reclamation purposes, it was necessary to find all the
      grains which are not pointed to by any grain table - so a reverse
      mapping of "offset of grain in vmdk" to "grain table" must be
      constructed - which takes large amounts of CPU/RAM.

The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf

In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).

This format fixes the above issues:

    * All entries are now 64-bit.
    * The grain size (default) is 4KB.
    * Grain directory and grain tables are now located at the beginning
      of the file.
      + seSparse format reserves space for all grain tables.
      + Grain tables can be addressed using an index.
      + Grains are located in the end of the file and can also be
        addressed with an index.
      - seSparse vmdks of large disks (64TB) have huge preallocated
        headers - mainly due to L2 tables, even for empty snapshots.
    * The header contains a reverse mapping ("backmap") of "offset of
      grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
      specifies for each grain - whether it is allocated or not.
      Using these data structures we can implement space reclamation
      efficiently.
    * Due to the fact that the header now maintains two mappings:
        * The regular one (grain directory & grain tables)
        * A reverse one (backmap and free bitmap)
      These data structures can lose consistency upon crash and result
      in a corrupted VMDK.
      Therefore, a journal is also added to the VMDK and is replayed
      when the VMware reopens the file after a crash.

Since ESXi 6.7 - SESparse is the only snapshot format available.

Unfortunately, VMware does not provide documentation regarding the new
seSparse format.

This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.

The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:

    /var/log/hostd.log

Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.

Relevant log records are of the form:

[...] Const Header:
[...]  constMagic     = 0xcafebabe
[...]  version        = 2.1
[...]  capacity       = 204800
[...]  grainSize      = 8
[...]  grainTableSize = 64
[...]  flags          = 0
[...] Extents:
[...]  Header         : <1 : 1>
[...]  JournalHdr     : <2 : 2>
[...]  Journal        : <2048 : 2048>
[...]  GrainDirectory : <4096 : 2048>
[...]  GrainTables    : <6144 : 2048>
[...]  FreeBitmap     : <8192 : 2048>
[...]  BackMap        : <10240 : 2048>
[...]  Grain          : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic     = 0xcafecafe
[...] FreeGTNumber      = 0
[...] nextTxnSeqNumber  = 0
[...] replayJournal     = 0

The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>

This commit is a strict implementation which enforces:
    * magics
    * version number 2.1
    * grain size of 8 sectors  (4KB)
    * grain table size of 64 sectors
    * zero flags
    * extent locations

Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
    * Read-only
    * No journal replay
    * No space reclamation
    * No unmap support

Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.

However there are several differences in the grain access itself.
Grain directory (L1):
    * Grain directory entries are indexes (not offsets) to grain
      tables.
    * Valid grain directory entries have their highest nibble set to
      0x1.
    * Since grain tables are always located in the beginning of the
      file - the index can fit into 32 bits - so we can use its low
      part if it's valid.
Grain table (L2):
    * Grain table entries are indexes (not offsets) to grains.
    * If the highest nibble of the entry is:
        0x0:
            The grain in not allocated.
            The rest of the bytes are 0.
        0x1:
            The grain is unmapped - guest sees a zero grain.
            The rest of the bits point to the previously mapped grain,
            see 0x3 case.
        0x2:
            The grain is zero.
        0x3:
            The grain is allocated - to get the index calculate:
            ((entry & 0x0fff000000000000) >> 48) |
            ((entry & 0x0000ffffffffffff) << 12)
    * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
      grain which results from the guest using sg_unmap to unmap the
      grain - but the grain itself still exists in the grain extent - a
      space reclamation procedure should delete it.
      Unmapping a zero grain has no effect (0x2 will not change to 0x1)
      but unmapping an unallocated grain will (0x0 to 0x1) - naturally.

In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.

Read-only is implemented by failing on pwrite since qemu-img opens the
vmdk with read-write flags for "convert" (which is a read-only)
operation.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
---
 block/vmdk.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 459 insertions(+), 16 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index fc7378da78..e599c08b95 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -91,6 +91,44 @@ typedef struct {
     uint16_t compressAlgorithm;
 } QEMU_PACKED VMDK4Header;
 
+typedef struct VMDKSESparseConstHeader {
+    uint64_t magic;
+    uint64_t version;
+    uint64_t capacity;
+    uint64_t grain_size;
+    uint64_t grain_table_size;
+    uint64_t flags;
+    uint64_t reserved1;
+    uint64_t reserved2;
+    uint64_t reserved3;
+    uint64_t reserved4;
+    uint64_t volatile_header_offset;
+    uint64_t volatile_header_size;
+    uint64_t journal_header_offset;
+    uint64_t journal_header_size;
+    uint64_t journal_offset;
+    uint64_t journal_size;
+    uint64_t grain_dir_offset;
+    uint64_t grain_dir_size;
+    uint64_t grain_tables_offset;
+    uint64_t grain_tables_size;
+    uint64_t free_bitmap_offset;
+    uint64_t free_bitmap_size;
+    uint64_t backmap_offset;
+    uint64_t backmap_size;
+    uint64_t grains_offset;
+    uint64_t grains_size;
+    uint8_t pad[304];
+} QEMU_PACKED VMDKSESparseConstHeader;
+
+typedef struct VMDKSESparseVolatileHeader {
+    uint64_t magic;
+    uint64_t free_gt_number;
+    uint64_t next_txn_seq_number;
+    uint64_t replay_journal;
+    uint8_t pad[480];
+} QEMU_PACKED VMDKSESparseVolatileHeader;
+
 #define L2_CACHE_SIZE 16
 
 typedef struct VmdkExtent {
@@ -99,19 +137,23 @@ typedef struct VmdkExtent {
     bool compressed;
     bool has_marker;
     bool has_zero_grain;
+    bool sesparse;
+    uint64_t sesparse_l2_tables_offset;
+    uint64_t sesparse_clusters_offset;
+    int32_t entry_size;
     int version;
     int64_t sectors;
     int64_t end_sector;
     int64_t flat_start_offset;
     int64_t l1_table_offset;
     int64_t l1_backup_table_offset;
-    uint32_t *l1_table;
+    void *l1_table;
     uint32_t *l1_backup_table;
     unsigned int l1_size;
     uint32_t l1_entry_sectors;
 
     unsigned int l2_size;
-    uint32_t *l2_cache;
+    void *l2_cache;
     uint32_t l2_cache_offsets[L2_CACHE_SIZE];
     uint32_t l2_cache_counts[L2_CACHE_SIZE];
 
@@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs,
          *            cluster size: 64KB, L2 table size: 512 entries
          *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
          *            cluster size: 512B, L2 table size: 4096 entries
+         *     8PB  - for default "ESXI seSparse Extent"
+         *            cluster size: 4KB,  L2 table size: 4096 entries
          */
         error_setg(errp, "L1 size too big");
         return -EFBIG;
@@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
     extent->l2_size = l2_size;
     extent->cluster_sectors = flat ? sectors : cluster_sectors;
     extent->next_cluster_sector = ROUND_UP(nb_sectors, cluster_sectors);
+    extent->entry_size = sizeof(uint32_t);
 
     if (s->num_extents > 1) {
         extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
@@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
     int i;
 
     /* read the L1 table */
-    l1_size = extent->l1_size * sizeof(uint32_t);
+    l1_size = extent->l1_size * extent->entry_size;
     extent->l1_table = g_try_malloc(l1_size);
     if (l1_size && extent->l1_table == NULL) {
         return -ENOMEM;
@@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
         goto fail_l1;
     }
     for (i = 0; i < extent->l1_size; i++) {
-        le32_to_cpus(&extent->l1_table[i]);
+        if (extent->sesparse) {
+            le64_to_cpus((uint64_t *)extent->l1_table + i);
+        } else {
+            le32_to_cpus((uint32_t *)extent->l1_table + i);
+        }
     }
 
     if (extent->l1_backup_table_offset) {
+        assert(!extent->sesparse);
         extent->l1_backup_table = g_try_malloc(l1_size);
         if (l1_size && extent->l1_backup_table == NULL) {
             ret = -ENOMEM;
@@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
     }
 
     extent->l2_cache =
-        g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE);
+        g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
     return 0;
  fail_l1b:
     g_free(extent->l1_backup_table);
@@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs,
     return ret;
 }
 
+#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe
+#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe
+
+static const char zero_pad[BDRV_SECTOR_SIZE];
+
+/* Strict checks - format not officially documented */
+static int check_se_sparse_const_header(VMDKSESparseConstHeader *header,
+                                        Error **errp)
+{
+    uint64_t grain_table_coverage;
+    uint64_t needed_grain_tables;
+    uint64_t needed_grain_dir_size;
+    uint64_t needed_grain_tables_size;
+    uint64_t needed_free_bitmap_size;
+
+    header->magic = le64_to_cpu(header->magic);
+    header->version = le64_to_cpu(header->version);
+    header->grain_size = le64_to_cpu(header->grain_size);
+    header->grain_table_size = le64_to_cpu(header->grain_table_size);
+    header->flags = le64_to_cpu(header->flags);
+    header->reserved1 = le64_to_cpu(header->reserved1);
+    header->reserved2 = le64_to_cpu(header->reserved2);
+    header->reserved3 = le64_to_cpu(header->reserved3);
+    header->reserved4 = le64_to_cpu(header->reserved4);
+
+    header->volatile_header_offset =
+        le64_to_cpu(header->volatile_header_offset);
+    header->volatile_header_size = le64_to_cpu(header->volatile_header_size);
+
+    header->journal_header_offset = le64_to_cpu(header->journal_header_offset);
+    header->journal_header_size = le64_to_cpu(header->journal_header_size);
+
+    header->journal_offset = le64_to_cpu(header->journal_offset);
+    header->journal_size = le64_to_cpu(header->journal_size);
+
+    header->grain_dir_offset = le64_to_cpu(header->grain_dir_offset);
+    header->grain_dir_size = le64_to_cpu(header->grain_dir_size);
+
+    header->grain_tables_offset = le64_to_cpu(header->grain_tables_offset);
+    header->grain_tables_size = le64_to_cpu(header->grain_tables_size);
+
+    header->free_bitmap_offset = le64_to_cpu(header->free_bitmap_offset);
+    header->free_bitmap_size = le64_to_cpu(header->free_bitmap_size);
+
+    header->backmap_offset = le64_to_cpu(header->backmap_offset);
+    header->backmap_size = le64_to_cpu(header->backmap_size);
+
+    header->grains_offset = le64_to_cpu(header->grains_offset);
+    header->grains_size = le64_to_cpu(header->grains_size);
+
+    if (header->magic != SESPARSE_CONST_HEADER_MAGIC) {
+        error_setg(errp, "Bad const header magic: 0x%016" PRIx64,
+                   header->magic);
+        return -EINVAL;
+    }
+
+    if (header->version != 0x0000000200000001) {
+        error_setg(errp, "Unsupported version: 0x%016" PRIx64,
+                   header->version);
+        return -ENOTSUP;
+    }
+
+    if (header->grain_size != 8) {
+        error_setg(errp, "Unsupported grain size: %" PRIu64,
+                   header->grain_size);
+        return -ENOTSUP;
+    }
+
+    if (header->grain_table_size != 64) {
+        error_setg(errp, "Unsupported grain table size: %" PRIu64,
+                   header->grain_table_size);
+        return -ENOTSUP;
+    }
+
+    if (header->flags != 0) {
+        error_setg(errp, "Unsupported flags: 0x%016" PRIx64,
+                   header->flags);
+        return -ENOTSUP;
+    }
+
+    if (header->reserved1 != 0 || header->reserved2 != 0 ||
+        header->reserved3 != 0 || header->reserved4 != 0) {
+        error_setg(errp, "Unsupported reserved bits:"
+                   " 0x%016" PRIx64 " 0x%016" PRIx64
+                   " 0x%016" PRIx64 " 0x%016" PRIx64,
+                   header->reserved1, header->reserved2,
+                   header->reserved3, header->reserved4);
+        return -ENOTSUP;
+    }
+
+    if (header->volatile_header_offset != 1) {
+        error_setg(errp, "Unsupported volatile header offset: %" PRIu64,
+                   header->volatile_header_offset);
+        return -ENOTSUP;
+    }
+
+    if (header->volatile_header_size != 1) {
+        error_setg(errp, "Unsupported volatile header size: %" PRIu64,
+                   header->volatile_header_size);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_header_offset != 2) {
+        error_setg(errp, "Unsupported journal header offset: %" PRIu64,
+                   header->journal_header_offset);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_header_size != 2) {
+        error_setg(errp, "Unsupported journal header size: %" PRIu64,
+                   header->journal_header_size);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_offset != 2048) {
+        error_setg(errp, "Unsupported journal offset: %" PRIu64,
+                   header->journal_offset);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_size != 2048) {
+        error_setg(errp, "Unsupported journal size: %" PRIu64,
+                   header->journal_size);
+        return -ENOTSUP;
+    }
+
+    if (header->grain_dir_offset != 4096) {
+        error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
+                   header->grain_dir_offset);
+        return -ENOTSUP;
+    }
+
+    /* in sectors */
+    grain_table_coverage = ((header->grain_table_size *
+                             BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
+                            header->grain_size);
+    needed_grain_tables = header->capacity / grain_table_coverage;
+    needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables * sizeof(uint64_t),
+                                         BDRV_SECTOR_SIZE);
+    needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
+
+    if (header->grain_dir_size != needed_grain_dir_size) {
+        error_setg(errp, "Invalid grain directory size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->grain_dir_size, needed_grain_dir_size);
+        return -EINVAL;
+    }
+
+    if (header->grain_tables_offset !=
+        header->grain_dir_offset + header->grain_dir_size) {
+        error_setg(errp, "Grain directory must be followed by grain tables");
+        return -EINVAL;
+    }
+
+    needed_grain_tables_size = needed_grain_tables * header->grain_table_size;
+    needed_grain_tables_size = ROUND_UP(needed_grain_tables_size, 2048);
+
+    if (header->grain_tables_size != needed_grain_tables_size) {
+        error_setg(errp, "Invalid grain tables size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->grain_tables_size, needed_grain_tables_size);
+        return -EINVAL;
+    }
+
+    if (header->free_bitmap_offset !=
+        header->grain_tables_offset + header->grain_tables_size) {
+        error_setg(errp, "Grain tables must be followed by free bitmap");
+        return -EINVAL;
+    }
+
+    /* in bits */
+    needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
+                                           header->grain_size);
+    /* in bytes */
+    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
+                                           BITS_PER_BYTE);
+    /* in sectors */
+    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
+                                           BDRV_SECTOR_SIZE);
+    needed_free_bitmap_size = ROUND_UP(needed_free_bitmap_size, 2048);
+
+    if (header->free_bitmap_size != needed_free_bitmap_size) {
+        error_setg(errp, "Invalid free bitmap size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->free_bitmap_size, needed_free_bitmap_size);
+        return -EINVAL;
+    }
+
+    if (header->backmap_offset !=
+        header->free_bitmap_offset + header->free_bitmap_size) {
+        error_setg(errp, "Free bitmap must be followed by backmap");
+        return -EINVAL;
+    }
+
+    if (header->backmap_size != header->grain_tables_size) {
+        error_setg(errp, "Invalid backmap size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->backmap_size, header->grain_tables_size);
+        return -EINVAL;
+    }
+
+    if (header->grains_offset !=
+        header->backmap_offset + header->backmap_size) {
+        error_setg(errp, "Backmap must be followed by grains");
+        return -EINVAL;
+    }
+
+    if (header->grains_size != header->capacity) {
+        error_setg(errp, "Invalid grains size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->grains_size, header->capacity);
+        return -EINVAL;
+    }
+
+    /* check that padding is 0 */
+    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
+        error_setg(errp, "Unsupported non-zero const header padding");
+        return -ENOTSUP;
+    }
+
+    return 0;
+}
+
+static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader *header,
+                                           Error **errp)
+{
+    header->magic = le64_to_cpu(header->magic);
+    header->free_gt_number = le64_to_cpu(header->free_gt_number);
+    header->next_txn_seq_number = le64_to_cpu(header->next_txn_seq_number);
+    header->replay_journal = le64_to_cpu(header->replay_journal);
+
+    if (header->magic != SESPARSE_VOLATILE_HEADER_MAGIC) {
+        error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64,
+                   header->magic);
+        return -EINVAL;
+    }
+
+    if (header->replay_journal) {
+        error_setg(errp, "Replaying journal not supported");
+        return -ENOTSUP;
+    }
+
+    /* check that padding is 0 */
+    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
+        error_setg(errp, "Unsupported non-zero volatile header padding");
+        return -ENOTSUP;
+    }
+
+    return 0;
+}
+
+static int vmdk_open_se_sparse(BlockDriverState *bs,
+                               BdrvChild *file,
+                               int flags, Error **errp)
+{
+    int ret;
+    VMDKSESparseConstHeader const_header;
+    VMDKSESparseVolatileHeader volatile_header;
+    VmdkExtent *extent;
+
+    assert(sizeof(const_header) == BDRV_SECTOR_SIZE);
+
+    ret = bdrv_pread(file, 0, &const_header, sizeof(const_header));
+    if (ret < 0) {
+        bdrv_refresh_filename(file->bs);
+        error_setg_errno(errp, -ret,
+                         "Could not read const header from file '%s'",
+                         file->bs->filename);
+        return ret;
+    }
+
+    /* check const header */
+    ret = check_se_sparse_const_header(&const_header, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    assert(sizeof(volatile_header) == BDRV_SECTOR_SIZE);
+
+    ret = bdrv_pread(file,
+                     const_header.volatile_header_offset * BDRV_SECTOR_SIZE,
+                     &volatile_header, sizeof(volatile_header));
+    if (ret < 0) {
+        bdrv_refresh_filename(file->bs);
+        error_setg_errno(errp, -ret,
+                         "Could not read volatile header from file '%s'",
+                         file->bs->filename);
+        return ret;
+    }
+
+    /* check volatile header */
+    ret = check_se_sparse_volatile_header(&volatile_header, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = vmdk_add_extent(bs, file, false,
+                          const_header.capacity,
+                          const_header.grain_dir_offset * BDRV_SECTOR_SIZE,
+                          0,
+                          const_header.grain_dir_size *
+                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
+                          const_header.grain_table_size *
+                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
+                          const_header.grain_size,
+                          &extent,
+                          errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    extent->sesparse = true;
+    extent->sesparse_l2_tables_offset = const_header.grain_tables_offset;
+    extent->sesparse_clusters_offset = const_header.grains_offset;
+    extent->entry_size = sizeof(uint64_t);
+
+    ret = vmdk_init_tables(bs, extent, errp);
+    if (ret) {
+        /* free extent allocated by vmdk_add_extent */
+        vmdk_free_last_extent(bs);
+    }
+
+    return ret;
+}
+
 static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
                                QDict *options, Error **errp);
 
@@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
          * RW [size in sectors] SPARSE "file-name.vmdk"
          * RW [size in sectors] VMFS "file-name.vmdk"
          * RW [size in sectors] VMFSSPARSE "file-name.vmdk"
+         * RW [size in sectors] SESPARSE "file-name.vmdk"
          */
         flat_offset = -1;
         matches = sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %" SCNd64,
@@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
 
         if (sectors <= 0 ||
             (strcmp(type, "FLAT") && strcmp(type, "SPARSE") &&
-             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) ||
+             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") &&
+             strcmp(type, "SESPARSE")) ||
             (strcmp(access, "RW"))) {
             continue;
         }
@@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
                 return ret;
             }
             extent = &s->extents[s->num_extents - 1];
+        } else if (!strcmp(type, "SESPARSE")) {
+            ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
+            if (ret) {
+                bdrv_unref_child(bs, extent_file);
+                return ret;
+            }
+            extent = &s->extents[s->num_extents - 1];
         } else {
             error_setg(errp, "Unsupported extent type '%s'", type);
             bdrv_unref_child(bs, extent_file);
@@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
     if (strcmp(ct, "monolithicFlat") &&
         strcmp(ct, "vmfs") &&
         strcmp(ct, "vmfsSparse") &&
+        strcmp(ct, "seSparse") &&
         strcmp(ct, "twoGbMaxExtentSparse") &&
         strcmp(ct, "twoGbMaxExtentFlat")) {
         error_setg(errp, "Unsupported image type '%s'", ct);
@@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs,
 {
     unsigned int l1_index, l2_offset, l2_index;
     int min_index, i, j;
-    uint32_t min_count, *l2_table;
+    uint32_t min_count;
+    void *l2_table;
     bool zeroed = false;
     int64_t ret;
     int64_t cluster_sector;
+    unsigned int l2_size_bytes = extent->l2_size * extent->entry_size;
 
     if (m_data) {
         m_data->valid = 0;
@@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs,
     if (l1_index >= extent->l1_size) {
         return VMDK_ERROR;
     }
-    l2_offset = extent->l1_table[l1_index];
+    if (extent->sesparse) {
+        uint64_t l2_offset_u64 = ((uint64_t *)extent->l1_table)[l1_index];
+        if (l2_offset_u64 == 0) {
+            l2_offset = 0;
+        } else if ((l2_offset_u64 & 0xffffffff00000000) != 0x1000000000000000) {
+            /*
+             * Top most nibble is 0x1 if grain table is allocated.
+             * strict check - top most 4 bytes must be 0x10000000 since max
+             * supported size is 64TB for disk - so no more than 64TB / 16MB
+             * grain directories which is smaller than uint32,
+             * where 16MB is the only supported default grain table coverage.
+             */
+            return VMDK_ERROR;
+        } else {
+            l2_offset_u64 = l2_offset_u64 & 0x00000000ffffffff;
+            l2_offset_u64 = extent->sesparse_l2_tables_offset +
+                l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE;
+            if (l2_offset_u64 > 0x00000000ffffffff) {
+                return VMDK_ERROR;
+            }
+            l2_offset = (unsigned int)(l2_offset_u64);
+        }
+    } else {
+        l2_offset = ((uint32_t *)extent->l1_table)[l1_index];
+    }
     if (!l2_offset) {
         return VMDK_UNALLOC;
     }
@@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs,
                     extent->l2_cache_counts[j] >>= 1;
                 }
             }
-            l2_table = extent->l2_cache + (i * extent->l2_size);
+            l2_table = (char *)extent->l2_cache + (i * l2_size_bytes);
             goto found;
         }
     }
@@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs,
             min_index = i;
         }
     }
-    l2_table = extent->l2_cache + (min_index * extent->l2_size);
+    l2_table = (char *)extent->l2_cache + (min_index * l2_size_bytes);
     BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD);
     if (bdrv_pread(extent->file,
                 (int64_t)l2_offset * 512,
                 l2_table,
-                extent->l2_size * sizeof(uint32_t)
-            ) != extent->l2_size * sizeof(uint32_t)) {
+                l2_size_bytes
+            ) != l2_size_bytes) {
         return VMDK_ERROR;
     }
 
@@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs,
     extent->l2_cache_counts[min_index] = 1;
  found:
     l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
-    cluster_sector = le32_to_cpu(l2_table[l2_index]);
 
-    if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
-        zeroed = true;
+    if (extent->sesparse) {
+        cluster_sector = le64_to_cpu(((uint64_t *)l2_table)[l2_index]);
+        switch (cluster_sector & 0xf000000000000000) {
+        case 0x0000000000000000:
+            /* unallocated grain */
+            if (cluster_sector != 0) {
+                return VMDK_ERROR;
+            }
+            break;
+        case 0x1000000000000000:
+            /* scsi-unmapped grain - fallthrough */
+        case 0x2000000000000000:
+            /* zero grain */
+            zeroed = true;
+            break;
+        case 0x3000000000000000:
+            /* allocated grain */
+            cluster_sector = (((cluster_sector & 0x0fff000000000000) >> 48) |
+                              ((cluster_sector & 0x0000ffffffffffff) << 12));
+            cluster_sector = extent->sesparse_clusters_offset +
+                cluster_sector * extent->cluster_sectors;
+            break;
+        default:
+            return VMDK_ERROR;
+        }
+    } else {
+        cluster_sector = le32_to_cpu(((uint32_t *)l2_table)[l2_index]);
+
+        if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
+            zeroed = true;
+        }
     }
 
     if (!cluster_sector || zeroed) {
         if (!allocate) {
             return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
         }
+        assert(!extent->sesparse);
 
         if (extent->next_cluster_sector >= VMDK_EXTENT_MAX_SECTORS) {
             return VMDK_ERROR;
@@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs,
             m_data->l1_index = l1_index;
             m_data->l2_index = l2_index;
             m_data->l2_offset = l2_offset;
-            m_data->l2_cache_entry = &l2_table[l2_index];
+            m_data->l2_cache_entry = ((uint32_t *)l2_table) + l2_index;
         }
     }
     *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
@@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
         if (!extent) {
             return -EIO;
         }
+        if (extent->sesparse) {
+            return -ENOTSUP;
+        }
         offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
         n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
                              - offset_in_cluster);
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots
@ 2019-04-24  7:49   ` Sam Eiderman
  0 siblings, 0 replies; 13+ messages in thread
From: Sam Eiderman @ 2019-04-24  7:49 UTC (permalink / raw)
  To: fam, kwolf, mreitz, qemu-block
  Cc: eyal.moscovici, arbel.moshe, qemu-devel, shmuel.eiderman,
	liran.alon, karl.heubaum

Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
QEMU).

This format was lacking in the following:

    * Grain directory (L1) and grain table (L2) entries were 32-bit,
      allowing access to only 2TB (slightly less) of data.
    * The grain size (default) was 512 bytes - leading to data
      fragmentation and many grain tables.
    * For space reclamation purposes, it was necessary to find all the
      grains which are not pointed to by any grain table - so a reverse
      mapping of "offset of grain in vmdk" to "grain table" must be
      constructed - which takes large amounts of CPU/RAM.

The format specification can be found in VMware's documentation:
https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf

In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
introduced: SESparse (Space Efficient).

This format fixes the above issues:

    * All entries are now 64-bit.
    * The grain size (default) is 4KB.
    * Grain directory and grain tables are now located at the beginning
      of the file.
      + seSparse format reserves space for all grain tables.
      + Grain tables can be addressed using an index.
      + Grains are located in the end of the file and can also be
        addressed with an index.
      - seSparse vmdks of large disks (64TB) have huge preallocated
        headers - mainly due to L2 tables, even for empty snapshots.
    * The header contains a reverse mapping ("backmap") of "offset of
      grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
      specifies for each grain - whether it is allocated or not.
      Using these data structures we can implement space reclamation
      efficiently.
    * Due to the fact that the header now maintains two mappings:
        * The regular one (grain directory & grain tables)
        * A reverse one (backmap and free bitmap)
      These data structures can lose consistency upon crash and result
      in a corrupted VMDK.
      Therefore, a journal is also added to the VMDK and is replayed
      when the VMware reopens the file after a crash.

Since ESXi 6.7 - SESparse is the only snapshot format available.

Unfortunately, VMware does not provide documentation regarding the new
seSparse format.

This commit is based on black-box research of the seSparse format.
Various in-guest block operations and their effect on the snapshot file
were tested.

The only VMware provided source of information (regarding the underlying
implementation) was a log file on the ESXi:

    /var/log/hostd.log

Whenever an seSparse snapshot is created - the log is being populated
with seSparse records.

Relevant log records are of the form:

[...] Const Header:
[...]  constMagic     = 0xcafebabe
[...]  version        = 2.1
[...]  capacity       = 204800
[...]  grainSize      = 8
[...]  grainTableSize = 64
[...]  flags          = 0
[...] Extents:
[...]  Header         : <1 : 1>
[...]  JournalHdr     : <2 : 2>
[...]  Journal        : <2048 : 2048>
[...]  GrainDirectory : <4096 : 2048>
[...]  GrainTables    : <6144 : 2048>
[...]  FreeBitmap     : <8192 : 2048>
[...]  BackMap        : <10240 : 2048>
[...]  Grain          : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic     = 0xcafecafe
[...] FreeGTNumber      = 0
[...] nextTxnSeqNumber  = 0
[...] replayJournal     = 0

The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>

This commit is a strict implementation which enforces:
    * magics
    * version number 2.1
    * grain size of 8 sectors  (4KB)
    * grain table size of 64 sectors
    * zero flags
    * extent locations

Additionally, this commit proivdes only a subset of the functionality
offered by seSparse's format:
    * Read-only
    * No journal replay
    * No space reclamation
    * No unmap support

Hence, journal header, journal, free bitmap and backmap extents are
unused, only the "classic" (L1 -> L2 -> data) grain access is
implemented.

However there are several differences in the grain access itself.
Grain directory (L1):
    * Grain directory entries are indexes (not offsets) to grain
      tables.
    * Valid grain directory entries have their highest nibble set to
      0x1.
    * Since grain tables are always located in the beginning of the
      file - the index can fit into 32 bits - so we can use its low
      part if it's valid.
Grain table (L2):
    * Grain table entries are indexes (not offsets) to grains.
    * If the highest nibble of the entry is:
        0x0:
            The grain in not allocated.
            The rest of the bytes are 0.
        0x1:
            The grain is unmapped - guest sees a zero grain.
            The rest of the bits point to the previously mapped grain,
            see 0x3 case.
        0x2:
            The grain is zero.
        0x3:
            The grain is allocated - to get the index calculate:
            ((entry & 0x0fff000000000000) >> 48) |
            ((entry & 0x0000ffffffffffff) << 12)
    * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
      grain which results from the guest using sg_unmap to unmap the
      grain - but the grain itself still exists in the grain extent - a
      space reclamation procedure should delete it.
      Unmapping a zero grain has no effect (0x2 will not change to 0x1)
      but unmapping an unallocated grain will (0x0 to 0x1) - naturally.

In order to implement seSparse some fields had to be changed to support
both 32-bit and 64-bit entry sizes.

Read-only is implemented by failing on pwrite since qemu-img opens the
vmdk with read-write flags for "convert" (which is a read-only)
operation.

Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
---
 block/vmdk.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 459 insertions(+), 16 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index fc7378da78..e599c08b95 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -91,6 +91,44 @@ typedef struct {
     uint16_t compressAlgorithm;
 } QEMU_PACKED VMDK4Header;
 
+typedef struct VMDKSESparseConstHeader {
+    uint64_t magic;
+    uint64_t version;
+    uint64_t capacity;
+    uint64_t grain_size;
+    uint64_t grain_table_size;
+    uint64_t flags;
+    uint64_t reserved1;
+    uint64_t reserved2;
+    uint64_t reserved3;
+    uint64_t reserved4;
+    uint64_t volatile_header_offset;
+    uint64_t volatile_header_size;
+    uint64_t journal_header_offset;
+    uint64_t journal_header_size;
+    uint64_t journal_offset;
+    uint64_t journal_size;
+    uint64_t grain_dir_offset;
+    uint64_t grain_dir_size;
+    uint64_t grain_tables_offset;
+    uint64_t grain_tables_size;
+    uint64_t free_bitmap_offset;
+    uint64_t free_bitmap_size;
+    uint64_t backmap_offset;
+    uint64_t backmap_size;
+    uint64_t grains_offset;
+    uint64_t grains_size;
+    uint8_t pad[304];
+} QEMU_PACKED VMDKSESparseConstHeader;
+
+typedef struct VMDKSESparseVolatileHeader {
+    uint64_t magic;
+    uint64_t free_gt_number;
+    uint64_t next_txn_seq_number;
+    uint64_t replay_journal;
+    uint8_t pad[480];
+} QEMU_PACKED VMDKSESparseVolatileHeader;
+
 #define L2_CACHE_SIZE 16
 
 typedef struct VmdkExtent {
@@ -99,19 +137,23 @@ typedef struct VmdkExtent {
     bool compressed;
     bool has_marker;
     bool has_zero_grain;
+    bool sesparse;
+    uint64_t sesparse_l2_tables_offset;
+    uint64_t sesparse_clusters_offset;
+    int32_t entry_size;
     int version;
     int64_t sectors;
     int64_t end_sector;
     int64_t flat_start_offset;
     int64_t l1_table_offset;
     int64_t l1_backup_table_offset;
-    uint32_t *l1_table;
+    void *l1_table;
     uint32_t *l1_backup_table;
     unsigned int l1_size;
     uint32_t l1_entry_sectors;
 
     unsigned int l2_size;
-    uint32_t *l2_cache;
+    void *l2_cache;
     uint32_t l2_cache_offsets[L2_CACHE_SIZE];
     uint32_t l2_cache_counts[L2_CACHE_SIZE];
 
@@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs,
          *            cluster size: 64KB, L2 table size: 512 entries
          *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
          *            cluster size: 512B, L2 table size: 4096 entries
+         *     8PB  - for default "ESXI seSparse Extent"
+         *            cluster size: 4KB,  L2 table size: 4096 entries
          */
         error_setg(errp, "L1 size too big");
         return -EFBIG;
@@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
     extent->l2_size = l2_size;
     extent->cluster_sectors = flat ? sectors : cluster_sectors;
     extent->next_cluster_sector = ROUND_UP(nb_sectors, cluster_sectors);
+    extent->entry_size = sizeof(uint32_t);
 
     if (s->num_extents > 1) {
         extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
@@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
     int i;
 
     /* read the L1 table */
-    l1_size = extent->l1_size * sizeof(uint32_t);
+    l1_size = extent->l1_size * extent->entry_size;
     extent->l1_table = g_try_malloc(l1_size);
     if (l1_size && extent->l1_table == NULL) {
         return -ENOMEM;
@@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
         goto fail_l1;
     }
     for (i = 0; i < extent->l1_size; i++) {
-        le32_to_cpus(&extent->l1_table[i]);
+        if (extent->sesparse) {
+            le64_to_cpus((uint64_t *)extent->l1_table + i);
+        } else {
+            le32_to_cpus((uint32_t *)extent->l1_table + i);
+        }
     }
 
     if (extent->l1_backup_table_offset) {
+        assert(!extent->sesparse);
         extent->l1_backup_table = g_try_malloc(l1_size);
         if (l1_size && extent->l1_backup_table == NULL) {
             ret = -ENOMEM;
@@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
     }
 
     extent->l2_cache =
-        g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE);
+        g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
     return 0;
  fail_l1b:
     g_free(extent->l1_backup_table);
@@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs,
     return ret;
 }
 
+#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe
+#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe
+
+static const char zero_pad[BDRV_SECTOR_SIZE];
+
+/* Strict checks - format not officially documented */
+static int check_se_sparse_const_header(VMDKSESparseConstHeader *header,
+                                        Error **errp)
+{
+    uint64_t grain_table_coverage;
+    uint64_t needed_grain_tables;
+    uint64_t needed_grain_dir_size;
+    uint64_t needed_grain_tables_size;
+    uint64_t needed_free_bitmap_size;
+
+    header->magic = le64_to_cpu(header->magic);
+    header->version = le64_to_cpu(header->version);
+    header->grain_size = le64_to_cpu(header->grain_size);
+    header->grain_table_size = le64_to_cpu(header->grain_table_size);
+    header->flags = le64_to_cpu(header->flags);
+    header->reserved1 = le64_to_cpu(header->reserved1);
+    header->reserved2 = le64_to_cpu(header->reserved2);
+    header->reserved3 = le64_to_cpu(header->reserved3);
+    header->reserved4 = le64_to_cpu(header->reserved4);
+
+    header->volatile_header_offset =
+        le64_to_cpu(header->volatile_header_offset);
+    header->volatile_header_size = le64_to_cpu(header->volatile_header_size);
+
+    header->journal_header_offset = le64_to_cpu(header->journal_header_offset);
+    header->journal_header_size = le64_to_cpu(header->journal_header_size);
+
+    header->journal_offset = le64_to_cpu(header->journal_offset);
+    header->journal_size = le64_to_cpu(header->journal_size);
+
+    header->grain_dir_offset = le64_to_cpu(header->grain_dir_offset);
+    header->grain_dir_size = le64_to_cpu(header->grain_dir_size);
+
+    header->grain_tables_offset = le64_to_cpu(header->grain_tables_offset);
+    header->grain_tables_size = le64_to_cpu(header->grain_tables_size);
+
+    header->free_bitmap_offset = le64_to_cpu(header->free_bitmap_offset);
+    header->free_bitmap_size = le64_to_cpu(header->free_bitmap_size);
+
+    header->backmap_offset = le64_to_cpu(header->backmap_offset);
+    header->backmap_size = le64_to_cpu(header->backmap_size);
+
+    header->grains_offset = le64_to_cpu(header->grains_offset);
+    header->grains_size = le64_to_cpu(header->grains_size);
+
+    if (header->magic != SESPARSE_CONST_HEADER_MAGIC) {
+        error_setg(errp, "Bad const header magic: 0x%016" PRIx64,
+                   header->magic);
+        return -EINVAL;
+    }
+
+    if (header->version != 0x0000000200000001) {
+        error_setg(errp, "Unsupported version: 0x%016" PRIx64,
+                   header->version);
+        return -ENOTSUP;
+    }
+
+    if (header->grain_size != 8) {
+        error_setg(errp, "Unsupported grain size: %" PRIu64,
+                   header->grain_size);
+        return -ENOTSUP;
+    }
+
+    if (header->grain_table_size != 64) {
+        error_setg(errp, "Unsupported grain table size: %" PRIu64,
+                   header->grain_table_size);
+        return -ENOTSUP;
+    }
+
+    if (header->flags != 0) {
+        error_setg(errp, "Unsupported flags: 0x%016" PRIx64,
+                   header->flags);
+        return -ENOTSUP;
+    }
+
+    if (header->reserved1 != 0 || header->reserved2 != 0 ||
+        header->reserved3 != 0 || header->reserved4 != 0) {
+        error_setg(errp, "Unsupported reserved bits:"
+                   " 0x%016" PRIx64 " 0x%016" PRIx64
+                   " 0x%016" PRIx64 " 0x%016" PRIx64,
+                   header->reserved1, header->reserved2,
+                   header->reserved3, header->reserved4);
+        return -ENOTSUP;
+    }
+
+    if (header->volatile_header_offset != 1) {
+        error_setg(errp, "Unsupported volatile header offset: %" PRIu64,
+                   header->volatile_header_offset);
+        return -ENOTSUP;
+    }
+
+    if (header->volatile_header_size != 1) {
+        error_setg(errp, "Unsupported volatile header size: %" PRIu64,
+                   header->volatile_header_size);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_header_offset != 2) {
+        error_setg(errp, "Unsupported journal header offset: %" PRIu64,
+                   header->journal_header_offset);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_header_size != 2) {
+        error_setg(errp, "Unsupported journal header size: %" PRIu64,
+                   header->journal_header_size);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_offset != 2048) {
+        error_setg(errp, "Unsupported journal offset: %" PRIu64,
+                   header->journal_offset);
+        return -ENOTSUP;
+    }
+
+    if (header->journal_size != 2048) {
+        error_setg(errp, "Unsupported journal size: %" PRIu64,
+                   header->journal_size);
+        return -ENOTSUP;
+    }
+
+    if (header->grain_dir_offset != 4096) {
+        error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
+                   header->grain_dir_offset);
+        return -ENOTSUP;
+    }
+
+    /* in sectors */
+    grain_table_coverage = ((header->grain_table_size *
+                             BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
+                            header->grain_size);
+    needed_grain_tables = header->capacity / grain_table_coverage;
+    needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables * sizeof(uint64_t),
+                                         BDRV_SECTOR_SIZE);
+    needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
+
+    if (header->grain_dir_size != needed_grain_dir_size) {
+        error_setg(errp, "Invalid grain directory size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->grain_dir_size, needed_grain_dir_size);
+        return -EINVAL;
+    }
+
+    if (header->grain_tables_offset !=
+        header->grain_dir_offset + header->grain_dir_size) {
+        error_setg(errp, "Grain directory must be followed by grain tables");
+        return -EINVAL;
+    }
+
+    needed_grain_tables_size = needed_grain_tables * header->grain_table_size;
+    needed_grain_tables_size = ROUND_UP(needed_grain_tables_size, 2048);
+
+    if (header->grain_tables_size != needed_grain_tables_size) {
+        error_setg(errp, "Invalid grain tables size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->grain_tables_size, needed_grain_tables_size);
+        return -EINVAL;
+    }
+
+    if (header->free_bitmap_offset !=
+        header->grain_tables_offset + header->grain_tables_size) {
+        error_setg(errp, "Grain tables must be followed by free bitmap");
+        return -EINVAL;
+    }
+
+    /* in bits */
+    needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
+                                           header->grain_size);
+    /* in bytes */
+    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
+                                           BITS_PER_BYTE);
+    /* in sectors */
+    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
+                                           BDRV_SECTOR_SIZE);
+    needed_free_bitmap_size = ROUND_UP(needed_free_bitmap_size, 2048);
+
+    if (header->free_bitmap_size != needed_free_bitmap_size) {
+        error_setg(errp, "Invalid free bitmap size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->free_bitmap_size, needed_free_bitmap_size);
+        return -EINVAL;
+    }
+
+    if (header->backmap_offset !=
+        header->free_bitmap_offset + header->free_bitmap_size) {
+        error_setg(errp, "Free bitmap must be followed by backmap");
+        return -EINVAL;
+    }
+
+    if (header->backmap_size != header->grain_tables_size) {
+        error_setg(errp, "Invalid backmap size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->backmap_size, header->grain_tables_size);
+        return -EINVAL;
+    }
+
+    if (header->grains_offset !=
+        header->backmap_offset + header->backmap_size) {
+        error_setg(errp, "Backmap must be followed by grains");
+        return -EINVAL;
+    }
+
+    if (header->grains_size != header->capacity) {
+        error_setg(errp, "Invalid grains size: %" PRIu64
+                   ", needed: %" PRIu64,
+                   header->grains_size, header->capacity);
+        return -EINVAL;
+    }
+
+    /* check that padding is 0 */
+    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
+        error_setg(errp, "Unsupported non-zero const header padding");
+        return -ENOTSUP;
+    }
+
+    return 0;
+}
+
+static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader *header,
+                                           Error **errp)
+{
+    header->magic = le64_to_cpu(header->magic);
+    header->free_gt_number = le64_to_cpu(header->free_gt_number);
+    header->next_txn_seq_number = le64_to_cpu(header->next_txn_seq_number);
+    header->replay_journal = le64_to_cpu(header->replay_journal);
+
+    if (header->magic != SESPARSE_VOLATILE_HEADER_MAGIC) {
+        error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64,
+                   header->magic);
+        return -EINVAL;
+    }
+
+    if (header->replay_journal) {
+        error_setg(errp, "Replaying journal not supported");
+        return -ENOTSUP;
+    }
+
+    /* check that padding is 0 */
+    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
+        error_setg(errp, "Unsupported non-zero volatile header padding");
+        return -ENOTSUP;
+    }
+
+    return 0;
+}
+
+static int vmdk_open_se_sparse(BlockDriverState *bs,
+                               BdrvChild *file,
+                               int flags, Error **errp)
+{
+    int ret;
+    VMDKSESparseConstHeader const_header;
+    VMDKSESparseVolatileHeader volatile_header;
+    VmdkExtent *extent;
+
+    assert(sizeof(const_header) == BDRV_SECTOR_SIZE);
+
+    ret = bdrv_pread(file, 0, &const_header, sizeof(const_header));
+    if (ret < 0) {
+        bdrv_refresh_filename(file->bs);
+        error_setg_errno(errp, -ret,
+                         "Could not read const header from file '%s'",
+                         file->bs->filename);
+        return ret;
+    }
+
+    /* check const header */
+    ret = check_se_sparse_const_header(&const_header, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    assert(sizeof(volatile_header) == BDRV_SECTOR_SIZE);
+
+    ret = bdrv_pread(file,
+                     const_header.volatile_header_offset * BDRV_SECTOR_SIZE,
+                     &volatile_header, sizeof(volatile_header));
+    if (ret < 0) {
+        bdrv_refresh_filename(file->bs);
+        error_setg_errno(errp, -ret,
+                         "Could not read volatile header from file '%s'",
+                         file->bs->filename);
+        return ret;
+    }
+
+    /* check volatile header */
+    ret = check_se_sparse_volatile_header(&volatile_header, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    ret = vmdk_add_extent(bs, file, false,
+                          const_header.capacity,
+                          const_header.grain_dir_offset * BDRV_SECTOR_SIZE,
+                          0,
+                          const_header.grain_dir_size *
+                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
+                          const_header.grain_table_size *
+                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
+                          const_header.grain_size,
+                          &extent,
+                          errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    extent->sesparse = true;
+    extent->sesparse_l2_tables_offset = const_header.grain_tables_offset;
+    extent->sesparse_clusters_offset = const_header.grains_offset;
+    extent->entry_size = sizeof(uint64_t);
+
+    ret = vmdk_init_tables(bs, extent, errp);
+    if (ret) {
+        /* free extent allocated by vmdk_add_extent */
+        vmdk_free_last_extent(bs);
+    }
+
+    return ret;
+}
+
 static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
                                QDict *options, Error **errp);
 
@@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
          * RW [size in sectors] SPARSE "file-name.vmdk"
          * RW [size in sectors] VMFS "file-name.vmdk"
          * RW [size in sectors] VMFSSPARSE "file-name.vmdk"
+         * RW [size in sectors] SESPARSE "file-name.vmdk"
          */
         flat_offset = -1;
         matches = sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %" SCNd64,
@@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
 
         if (sectors <= 0 ||
             (strcmp(type, "FLAT") && strcmp(type, "SPARSE") &&
-             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) ||
+             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") &&
+             strcmp(type, "SESPARSE")) ||
             (strcmp(access, "RW"))) {
             continue;
         }
@@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
                 return ret;
             }
             extent = &s->extents[s->num_extents - 1];
+        } else if (!strcmp(type, "SESPARSE")) {
+            ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
+            if (ret) {
+                bdrv_unref_child(bs, extent_file);
+                return ret;
+            }
+            extent = &s->extents[s->num_extents - 1];
         } else {
             error_setg(errp, "Unsupported extent type '%s'", type);
             bdrv_unref_child(bs, extent_file);
@@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
     if (strcmp(ct, "monolithicFlat") &&
         strcmp(ct, "vmfs") &&
         strcmp(ct, "vmfsSparse") &&
+        strcmp(ct, "seSparse") &&
         strcmp(ct, "twoGbMaxExtentSparse") &&
         strcmp(ct, "twoGbMaxExtentFlat")) {
         error_setg(errp, "Unsupported image type '%s'", ct);
@@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs,
 {
     unsigned int l1_index, l2_offset, l2_index;
     int min_index, i, j;
-    uint32_t min_count, *l2_table;
+    uint32_t min_count;
+    void *l2_table;
     bool zeroed = false;
     int64_t ret;
     int64_t cluster_sector;
+    unsigned int l2_size_bytes = extent->l2_size * extent->entry_size;
 
     if (m_data) {
         m_data->valid = 0;
@@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs,
     if (l1_index >= extent->l1_size) {
         return VMDK_ERROR;
     }
-    l2_offset = extent->l1_table[l1_index];
+    if (extent->sesparse) {
+        uint64_t l2_offset_u64 = ((uint64_t *)extent->l1_table)[l1_index];
+        if (l2_offset_u64 == 0) {
+            l2_offset = 0;
+        } else if ((l2_offset_u64 & 0xffffffff00000000) != 0x1000000000000000) {
+            /*
+             * Top most nibble is 0x1 if grain table is allocated.
+             * strict check - top most 4 bytes must be 0x10000000 since max
+             * supported size is 64TB for disk - so no more than 64TB / 16MB
+             * grain directories which is smaller than uint32,
+             * where 16MB is the only supported default grain table coverage.
+             */
+            return VMDK_ERROR;
+        } else {
+            l2_offset_u64 = l2_offset_u64 & 0x00000000ffffffff;
+            l2_offset_u64 = extent->sesparse_l2_tables_offset +
+                l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE;
+            if (l2_offset_u64 > 0x00000000ffffffff) {
+                return VMDK_ERROR;
+            }
+            l2_offset = (unsigned int)(l2_offset_u64);
+        }
+    } else {
+        l2_offset = ((uint32_t *)extent->l1_table)[l1_index];
+    }
     if (!l2_offset) {
         return VMDK_UNALLOC;
     }
@@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs,
                     extent->l2_cache_counts[j] >>= 1;
                 }
             }
-            l2_table = extent->l2_cache + (i * extent->l2_size);
+            l2_table = (char *)extent->l2_cache + (i * l2_size_bytes);
             goto found;
         }
     }
@@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs,
             min_index = i;
         }
     }
-    l2_table = extent->l2_cache + (min_index * extent->l2_size);
+    l2_table = (char *)extent->l2_cache + (min_index * l2_size_bytes);
     BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD);
     if (bdrv_pread(extent->file,
                 (int64_t)l2_offset * 512,
                 l2_table,
-                extent->l2_size * sizeof(uint32_t)
-            ) != extent->l2_size * sizeof(uint32_t)) {
+                l2_size_bytes
+            ) != l2_size_bytes) {
         return VMDK_ERROR;
     }
 
@@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs,
     extent->l2_cache_counts[min_index] = 1;
  found:
     l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
-    cluster_sector = le32_to_cpu(l2_table[l2_index]);
 
-    if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
-        zeroed = true;
+    if (extent->sesparse) {
+        cluster_sector = le64_to_cpu(((uint64_t *)l2_table)[l2_index]);
+        switch (cluster_sector & 0xf000000000000000) {
+        case 0x0000000000000000:
+            /* unallocated grain */
+            if (cluster_sector != 0) {
+                return VMDK_ERROR;
+            }
+            break;
+        case 0x1000000000000000:
+            /* scsi-unmapped grain - fallthrough */
+        case 0x2000000000000000:
+            /* zero grain */
+            zeroed = true;
+            break;
+        case 0x3000000000000000:
+            /* allocated grain */
+            cluster_sector = (((cluster_sector & 0x0fff000000000000) >> 48) |
+                              ((cluster_sector & 0x0000ffffffffffff) << 12));
+            cluster_sector = extent->sesparse_clusters_offset +
+                cluster_sector * extent->cluster_sectors;
+            break;
+        default:
+            return VMDK_ERROR;
+        }
+    } else {
+        cluster_sector = le32_to_cpu(((uint32_t *)l2_table)[l2_index]);
+
+        if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
+            zeroed = true;
+        }
     }
 
     if (!cluster_sector || zeroed) {
         if (!allocate) {
             return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
         }
+        assert(!extent->sesparse);
 
         if (extent->next_cluster_sector >= VMDK_EXTENT_MAX_SECTORS) {
             return VMDK_ERROR;
@@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs,
             m_data->l1_index = l1_index;
             m_data->l2_index = l2_index;
             m_data->l2_offset = l2_offset;
-            m_data->l2_cache_entry = &l2_table[l2_index];
+            m_data->l2_cache_entry = ((uint32_t *)l2_table) + l2_index;
         }
     }
     *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
@@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
         if (!extent) {
             return -EIO;
         }
+        if (extent->sesparse) {
+            return -ENOTSUP;
+        }
         offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
         n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
                              - offset_in_cluster);
-- 
2.13.3



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage
@ 2019-04-24 10:19     ` yuchenlin via Qemu-devel
  0 siblings, 0 replies; 13+ messages in thread
From: yuchenlin @ 2019-04-24 10:19 UTC (permalink / raw)
  To: Sam Eiderman
  Cc: fam, kwolf, mreitz, qemu-block, eyal.moscovici, arbel.moshe,
	qemu-devel, liran.alon, Qemu-block

On 2019-04-24 15:49, Sam Eiderman wrote:
> Commit b0651b8c246d ("vmdk: Move l1_size check into vmdk_add_extent")
> extended the l1_size check from VMDK4 to VMDK3 but did not update the
> default coverage in the moved comment.
> 
> The previous vmdk4 calculation:
> 
>     (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB
> 
> The added vmdk3 calculation:
> 
>     (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB
> 
> Adding the calculation of vmdk3 to the comment.
> 
> In any case, VMware does not offer virtual disks more than 2TB for
> vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
> not implemented yet in qemu.
> 
> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
> Reviewed-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
> ---
>  block/vmdk.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index de8cb859f8..fc7378da78 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -426,10 +426,15 @@ static int vmdk_add_extent(BlockDriverState *bs,
>          return -EFBIG;
>      }
>      if (l1_size > 512 * 1024 * 1024) {
> -        /* Although with big capacity and small l1_entry_sectors, we 
> can get a
> +        /*
> +         * Although with big capacity and small l1_entry_sectors, we 
> can get a
>           * big l1_size, we don't want unbounded value to allocate the 
> table.
> -         * Limit it to 512M, which is 16PB for default cluster and L2 
> table
> -         * size */
> +         * Limit it to 512M, which is:
> +         *     16PB - for default "Hosted Sparse Extent" (VMDK4)
> +         *            cluster size: 64KB, L2 table size: 512 entries
> +         *     1PB  - for default "ESXi Host Sparse Extent" 
> (VMDK3/vmfsSparse)
> +         *            cluster size: 512B, L2 table size: 4096 entries
> +         */
>          error_setg(errp, "L1 size too big");
>          return -EFBIG;
>      }

The calculation of VMDK3 can be verified in end of page No.9 of the spec 
(https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf).

Also the VMDK4 can be checked in the section Grain Table and Grain in 
page No.8 of the spec.

Reviewed-by: yuchenlin <yuchenlin@synology.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [Qemu-block] [PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage
@ 2019-04-24 10:19     ` yuchenlin via Qemu-devel
  0 siblings, 0 replies; 13+ messages in thread
From: yuchenlin via Qemu-devel @ 2019-04-24 10:19 UTC (permalink / raw)
  To: Sam Eiderman
  Cc: fam, kwolf, Qemu-block, eyal.moscovici, qemu-block, arbel.moshe,
	qemu-devel, mreitz, liran.alon

On 2019-04-24 15:49, Sam Eiderman wrote:
> Commit b0651b8c246d ("vmdk: Move l1_size check into vmdk_add_extent")
> extended the l1_size check from VMDK4 to VMDK3 but did not update the
> default coverage in the moved comment.
> 
> The previous vmdk4 calculation:
> 
>     (512 * 1024 * 1024) * 512(l2 entries) * 65536(grain) = 16PB
> 
> The added vmdk3 calculation:
> 
>     (512 * 1024 * 1024) * 4096(l2 entries) * 512(grain) = 1PB
> 
> Adding the calculation of vmdk3 to the comment.
> 
> In any case, VMware does not offer virtual disks more than 2TB for
> vmdk4/vmdk3 or 64TB for the new undocumented seSparse format which is
> not implemented yet in qemu.
> 
> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
> Reviewed-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
> ---
>  block/vmdk.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index de8cb859f8..fc7378da78 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -426,10 +426,15 @@ static int vmdk_add_extent(BlockDriverState *bs,
>          return -EFBIG;
>      }
>      if (l1_size > 512 * 1024 * 1024) {
> -        /* Although with big capacity and small l1_entry_sectors, we 
> can get a
> +        /*
> +         * Although with big capacity and small l1_entry_sectors, we 
> can get a
>           * big l1_size, we don't want unbounded value to allocate the 
> table.
> -         * Limit it to 512M, which is 16PB for default cluster and L2 
> table
> -         * size */
> +         * Limit it to 512M, which is:
> +         *     16PB - for default "Hosted Sparse Extent" (VMDK4)
> +         *            cluster size: 64KB, L2 table size: 512 entries
> +         *     1PB  - for default "ESXi Host Sparse Extent" 
> (VMDK3/vmfsSparse)
> +         *            cluster size: 512B, L2 table size: 4096 entries
> +         */
>          error_setg(errp, "L1 size too big");
>          return -EFBIG;
>      }

The calculation of VMDK3 can be verified in end of page No.9 of the spec 
(https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf).

Also the VMDK4 can be checked in the section Grain Table and Grain in 
page No.8 of the spec.

Reviewed-by: yuchenlin <yuchenlin@synology.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format
  2019-04-24  7:48 ` Sam Eiderman
                   ` (2 preceding siblings ...)
  (?)
@ 2019-05-12  8:14 ` Sam
  2019-05-27  8:26   ` Sam Eiderman
  -1 siblings, 1 reply; 13+ messages in thread
From: Sam @ 2019-05-12  8:14 UTC (permalink / raw)
  To: fam, kwolf, Max Reitz, qemu-block
  Cc: eyal.moscovici, Arbel Moshe, yuchenlin, qemu-devel, liran.alon,
	Karl Heubaum

Gentle ping on "[PATCH 2/2] vmdk: Add read-only support for seSparse snapshots”.
Yuchenlin reviewed "[PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage”.

Thanks, Sam

> On 24 Apr 2019, at 10:48, Sam Eiderman <shmuel.eiderman@oracle.com> wrote:
> 
> VMware introduced a new snapshot format in VMFS6 - seSparse (Space
> Efficient Sparse) which is the default format available in ESXi 6.7.
> Add read-only support for the new snapshot format.
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format
  2019-05-12  8:14 ` [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format Sam
@ 2019-05-27  8:26   ` Sam Eiderman
  0 siblings, 0 replies; 13+ messages in thread
From: Sam Eiderman @ 2019-05-27  8:26 UTC (permalink / raw)
  To: fam, kwolf, Max Reitz, qemu-block
  Cc: eyal.moscovici, Arbel Moshe, yuchenlin, qemu-devel, liran.alon,
	Karl Heubaum

Gentle ping

> On 12 May 2019, at 11:14, Sam <shmuel.eiderman@oracle.com> wrote:
> 
> Gentle ping on "[PATCH 2/2] vmdk: Add read-only support for seSparse snapshots”.
> Yuchenlin reviewed "[PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage”.
> 
> Thanks, Sam
> 
>> On 24 Apr 2019, at 10:48, Sam Eiderman <shmuel.eiderman@oracle.com <mailto:shmuel.eiderman@oracle.com>> wrote:
>> 
>> VMware introduced a new snapshot format in VMFS6 - seSparse (Space
>> Efficient Sparse) which is the default format available in ESXi 6.7.
>> Add read-only support for the new snapshot format.
>> 
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots
  2019-04-24  7:49   ` Sam Eiderman
  (?)
@ 2019-05-27 17:39   ` Max Reitz
  2019-05-28  7:57     ` Sam Eiderman
  -1 siblings, 1 reply; 13+ messages in thread
From: Max Reitz @ 2019-05-27 17:39 UTC (permalink / raw)
  To: Sam Eiderman, fam, kwolf, qemu-block
  Cc: eyal.moscovici, qemu-devel, liran.alon, karl.heubaum, arbel.moshe

[-- Attachment #1: Type: text/plain, Size: 35886 bytes --]

On 24.04.19 09:49, Sam Eiderman wrote:
> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
> QEMU).
> 
> This format was lacking in the following:
> 
>     * Grain directory (L1) and grain table (L2) entries were 32-bit,
>       allowing access to only 2TB (slightly less) of data.
>     * The grain size (default) was 512 bytes - leading to data
>       fragmentation and many grain tables.
>     * For space reclamation purposes, it was necessary to find all the
>       grains which are not pointed to by any grain table - so a reverse
>       mapping of "offset of grain in vmdk" to "grain table" must be
>       constructed - which takes large amounts of CPU/RAM.
> 
> The format specification can be found in VMware's documentation:
> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
> 
> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
> introduced: SESparse (Space Efficient).
> 
> This format fixes the above issues:
> 
>     * All entries are now 64-bit.
>     * The grain size (default) is 4KB.
>     * Grain directory and grain tables are now located at the beginning
>       of the file.
>       + seSparse format reserves space for all grain tables.
>       + Grain tables can be addressed using an index.
>       + Grains are located in the end of the file and can also be
>         addressed with an index.
>       - seSparse vmdks of large disks (64TB) have huge preallocated
>         headers - mainly due to L2 tables, even for empty snapshots.
>     * The header contains a reverse mapping ("backmap") of "offset of
>       grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
>       specifies for each grain - whether it is allocated or not.
>       Using these data structures we can implement space reclamation
>       efficiently.
>     * Due to the fact that the header now maintains two mappings:
>         * The regular one (grain directory & grain tables)
>         * A reverse one (backmap and free bitmap)
>       These data structures can lose consistency upon crash and result
>       in a corrupted VMDK.
>       Therefore, a journal is also added to the VMDK and is replayed
>       when the VMware reopens the file after a crash.
> 
> Since ESXi 6.7 - SESparse is the only snapshot format available.
> 
> Unfortunately, VMware does not provide documentation regarding the new
> seSparse format.
> 
> This commit is based on black-box research of the seSparse format.
> Various in-guest block operations and their effect on the snapshot file
> were tested.
> 
> The only VMware provided source of information (regarding the underlying
> implementation) was a log file on the ESXi:
> 
>     /var/log/hostd.log
> 
> Whenever an seSparse snapshot is created - the log is being populated
> with seSparse records.
> 
> Relevant log records are of the form:
> 
> [...] Const Header:
> [...]  constMagic     = 0xcafebabe
> [...]  version        = 2.1
> [...]  capacity       = 204800
> [...]  grainSize      = 8
> [...]  grainTableSize = 64
> [...]  flags          = 0
> [...] Extents:
> [...]  Header         : <1 : 1>
> [...]  JournalHdr     : <2 : 2>
> [...]  Journal        : <2048 : 2048>
> [...]  GrainDirectory : <4096 : 2048>
> [...]  GrainTables    : <6144 : 2048>
> [...]  FreeBitmap     : <8192 : 2048>
> [...]  BackMap        : <10240 : 2048>
> [...]  Grain          : <12288 : 204800>
> [...] Volatile Header:
> [...] volatileMagic     = 0xcafecafe
> [...] FreeGTNumber      = 0
> [...] nextTxnSeqNumber  = 0
> [...] replayJournal     = 0
> 
> The sizes that are seen in the log file are in sectors.
> Extents are of the following format: <offset : size>
> 
> This commit is a strict implementation which enforces:
>     * magics
>     * version number 2.1
>     * grain size of 8 sectors  (4KB)
>     * grain table size of 64 sectors
>     * zero flags
>     * extent locations
> 
> Additionally, this commit proivdes only a subset of the functionality
> offered by seSparse's format:
>     * Read-only
>     * No journal replay
>     * No space reclamation
>     * No unmap support
> 
> Hence, journal header, journal, free bitmap and backmap extents are
> unused, only the "classic" (L1 -> L2 -> data) grain access is
> implemented.
> 
> However there are several differences in the grain access itself.
> Grain directory (L1):
>     * Grain directory entries are indexes (not offsets) to grain
>       tables.
>     * Valid grain directory entries have their highest nibble set to
>       0x1.
>     * Since grain tables are always located in the beginning of the
>       file - the index can fit into 32 bits - so we can use its low
>       part if it's valid.
> Grain table (L2):
>     * Grain table entries are indexes (not offsets) to grains.
>     * If the highest nibble of the entry is:
>         0x0:
>             The grain in not allocated.
>             The rest of the bytes are 0.
>         0x1:
>             The grain is unmapped - guest sees a zero grain.
>             The rest of the bits point to the previously mapped grain,
>             see 0x3 case.
>         0x2:
>             The grain is zero.
>         0x3:
>             The grain is allocated - to get the index calculate:
>             ((entry & 0x0fff000000000000) >> 48) |
>             ((entry & 0x0000ffffffffffff) << 12)
>     * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
>       grain which results from the guest using sg_unmap to unmap the
>       grain - but the grain itself still exists in the grain extent - a
>       space reclamation procedure should delete it.
>       Unmapping a zero grain has no effect (0x2 will not change to 0x1)
>       but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
> 
> In order to implement seSparse some fields had to be changed to support
> both 32-bit and 64-bit entry sizes.
> 
> Read-only is implemented by failing on pwrite since qemu-img opens the
> vmdk with read-write flags for "convert" (which is a read-only)
> operation.

Does it?  The code doesn’t look like it (in img_convert(), src_flags is
never set to anything with BDRV_O_RDWR).

In theory, we have bdrv_apply_auto_read_only() for this case.  More on
that below. [1]

> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
> ---
>  block/vmdk.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 459 insertions(+), 16 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index fc7378da78..e599c08b95 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c

First, a general observation: BDRV_SECTOR_SIZE is indeed equal to VMDK’s
sector size (both are 512), but I’d consider that a coincidence.  This
file defines a SECTOR_SIZE macro.  I think you should use that instead
of BDRV_SECTOR_SIZE whenever referring to VMDK’s sector size.

(BDRV_SECTOR_SIZE is actually the sector size the qemu block layer uses.
 Yes, I know, there are plenty of places in vmdk.c that use it.  I still
think it isn’t entirely correct to use it.)

> @@ -91,6 +91,44 @@ typedef struct {
>      uint16_t compressAlgorithm;
>  } QEMU_PACKED VMDK4Header;
>  
> +typedef struct VMDKSESparseConstHeader {
> +    uint64_t magic;
> +    uint64_t version;
> +    uint64_t capacity;
> +    uint64_t grain_size;
> +    uint64_t grain_table_size;
> +    uint64_t flags;
> +    uint64_t reserved1;
> +    uint64_t reserved2;
> +    uint64_t reserved3;
> +    uint64_t reserved4;
> +    uint64_t volatile_header_offset;
> +    uint64_t volatile_header_size;
> +    uint64_t journal_header_offset;
> +    uint64_t journal_header_size;
> +    uint64_t journal_offset;
> +    uint64_t journal_size;
> +    uint64_t grain_dir_offset;
> +    uint64_t grain_dir_size;
> +    uint64_t grain_tables_offset;
> +    uint64_t grain_tables_size;
> +    uint64_t free_bitmap_offset;
> +    uint64_t free_bitmap_size;
> +    uint64_t backmap_offset;
> +    uint64_t backmap_size;
> +    uint64_t grains_offset;
> +    uint64_t grains_size;
> +    uint8_t pad[304];
> +} QEMU_PACKED VMDKSESparseConstHeader;
> +
> +typedef struct VMDKSESparseVolatileHeader {

Is this the official name?  (I know you don’t have a specification, but
maybe you do have some half-official information, I don’t know.)
Generally, I’d call the opposite of “const” “mutable”.

> +    uint64_t magic;
> +    uint64_t free_gt_number;
> +    uint64_t next_txn_seq_number;
> +    uint64_t replay_journal;
> +    uint8_t pad[480];
> +} QEMU_PACKED VMDKSESparseVolatileHeader;
> +
>  #define L2_CACHE_SIZE 16
>  
>  typedef struct VmdkExtent {
> @@ -99,19 +137,23 @@ typedef struct VmdkExtent {
>      bool compressed;
>      bool has_marker;
>      bool has_zero_grain;
> +    bool sesparse;
> +    uint64_t sesparse_l2_tables_offset;
> +    uint64_t sesparse_clusters_offset;
> +    int32_t entry_size;
>      int version;
>      int64_t sectors;
>      int64_t end_sector;
>      int64_t flat_start_offset;
>      int64_t l1_table_offset;
>      int64_t l1_backup_table_offset;
> -    uint32_t *l1_table;
> +    void *l1_table;
>      uint32_t *l1_backup_table;
>      unsigned int l1_size;
>      uint32_t l1_entry_sectors;
>  
>      unsigned int l2_size;
> -    uint32_t *l2_cache;
> +    void *l2_cache;
>      uint32_t l2_cache_offsets[L2_CACHE_SIZE];
>      uint32_t l2_cache_counts[L2_CACHE_SIZE];
>  
> @@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs,
>           *            cluster size: 64KB, L2 table size: 512 entries
>           *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
>           *            cluster size: 512B, L2 table size: 4096 entries
> +         *     8PB  - for default "ESXI seSparse Extent"
> +         *            cluster size: 4KB,  L2 table size: 4096 entries
>           */
>          error_setg(errp, "L1 size too big");
>          return -EFBIG;

Hm, now I wonder about this limit.  With this new format, it’d be 512M *
8 = 4 GB of RAM usage.  That seems an awful lot, and I don’t think we
really need to support 8 PB images.  Like, if we want to prevent
unbounded allocations, 4 GB is far above the limit of what I’d consider
sane.  (And I don’t know too much above vmdk, but can’t a user also
specify multiple extents with a single descriptor file, and thus make
qemu allocate these 4 GB multiple times?)

Some comment in this file says the maximum supported size is 64 TB
anyway.  Shouldn’t we just limit the L1 size accordingly, then?

> @@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
>      extent->l2_size = l2_size;
>      extent->cluster_sectors = flat ? sectors : cluster_sectors;
>      extent->next_cluster_sector = ROUND_UP(nb_sectors, cluster_sectors);
> +    extent->entry_size = sizeof(uint32_t);
>  
>      if (s->num_extents > 1) {
>          extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
> @@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>      int i;
>  
>      /* read the L1 table */
> -    l1_size = extent->l1_size * sizeof(uint32_t);
> +    l1_size = extent->l1_size * extent->entry_size;

Related to the above: This can overflow.  (512M * 8 == 4G)

>      extent->l1_table = g_try_malloc(l1_size);
>      if (l1_size && extent->l1_table == NULL) {
>          return -ENOMEM;
> @@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>          goto fail_l1;
>      }
>      for (i = 0; i < extent->l1_size; i++) {
> -        le32_to_cpus(&extent->l1_table[i]);
> +        if (extent->sesparse) {

Shouldn’t matter in practice, but I think evaluating extent->entry_size
would be more natural.

> +            le64_to_cpus((uint64_t *)extent->l1_table + i);
> +        } else {
> +            le32_to_cpus((uint32_t *)extent->l1_table + i);
> +        }
>      }
>  
>      if (extent->l1_backup_table_offset) {
> +        assert(!extent->sesparse);
>          extent->l1_backup_table = g_try_malloc(l1_size);
>          if (l1_size && extent->l1_backup_table == NULL) {
>              ret = -ENOMEM;
> @@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>      }
>  
>      extent->l2_cache =
> -        g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE);
> +        g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
>      return 0;
>   fail_l1b:
>      g_free(extent->l1_backup_table);
> @@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs,
>      return ret;
>  }
>  
> +#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe
> +#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe

The upper 32 bit are 0, so it doesn’t really matter, but still, if these
are 64-bit constants (which they are), they should have a 64-bit type
(e.g. UINT64_C(0x00000000cafebabe)).  (According to my very personal
taste at least.)

> +
> +static const char zero_pad[BDRV_SECTOR_SIZE];
> +
> +/* Strict checks - format not officially documented */
> +static int check_se_sparse_const_header(VMDKSESparseConstHeader *header,
> +                                        Error **errp)
> +{
> +    uint64_t grain_table_coverage;
> +    uint64_t needed_grain_tables;
> +    uint64_t needed_grain_dir_size;
> +    uint64_t needed_grain_tables_size;
> +    uint64_t needed_free_bitmap_size;
> +
> +    header->magic = le64_to_cpu(header->magic);
> +    header->version = le64_to_cpu(header->version);
> +    header->grain_size = le64_to_cpu(header->grain_size);
> +    header->grain_table_size = le64_to_cpu(header->grain_table_size);
> +    header->flags = le64_to_cpu(header->flags);
> +    header->reserved1 = le64_to_cpu(header->reserved1);
> +    header->reserved2 = le64_to_cpu(header->reserved2);
> +    header->reserved3 = le64_to_cpu(header->reserved3);
> +    header->reserved4 = le64_to_cpu(header->reserved4);
> +
> +    header->volatile_header_offset =
> +        le64_to_cpu(header->volatile_header_offset);
> +    header->volatile_header_size = le64_to_cpu(header->volatile_header_size);
> +
> +    header->journal_header_offset = le64_to_cpu(header->journal_header_offset);
> +    header->journal_header_size = le64_to_cpu(header->journal_header_size);
> +
> +    header->journal_offset = le64_to_cpu(header->journal_offset);
> +    header->journal_size = le64_to_cpu(header->journal_size);
> +
> +    header->grain_dir_offset = le64_to_cpu(header->grain_dir_offset);
> +    header->grain_dir_size = le64_to_cpu(header->grain_dir_size);
> +
> +    header->grain_tables_offset = le64_to_cpu(header->grain_tables_offset);
> +    header->grain_tables_size = le64_to_cpu(header->grain_tables_size);
> +
> +    header->free_bitmap_offset = le64_to_cpu(header->free_bitmap_offset);
> +    header->free_bitmap_size = le64_to_cpu(header->free_bitmap_size);
> +
> +    header->backmap_offset = le64_to_cpu(header->backmap_offset);
> +    header->backmap_size = le64_to_cpu(header->backmap_size);
> +
> +    header->grains_offset = le64_to_cpu(header->grains_offset);
> +    header->grains_size = le64_to_cpu(header->grains_size);
> +
> +    if (header->magic != SESPARSE_CONST_HEADER_MAGIC) {
> +        error_setg(errp, "Bad const header magic: 0x%016" PRIx64,
> +                   header->magic);
> +        return -EINVAL;
> +    }
> +
> +    if (header->version != 0x0000000200000001) {
> +        error_setg(errp, "Unsupported version: 0x%016" PRIx64,
> +                   header->version);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_size != 8) {
> +        error_setg(errp, "Unsupported grain size: %" PRIu64,
> +                   header->grain_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_table_size != 64) {
> +        error_setg(errp, "Unsupported grain table size: %" PRIu64,
> +                   header->grain_table_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->flags != 0) {
> +        error_setg(errp, "Unsupported flags: 0x%016" PRIx64,
> +                   header->flags);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->reserved1 != 0 || header->reserved2 != 0 ||
> +        header->reserved3 != 0 || header->reserved4 != 0) {
> +        error_setg(errp, "Unsupported reserved bits:"
> +                   " 0x%016" PRIx64 " 0x%016" PRIx64
> +                   " 0x%016" PRIx64 " 0x%016" PRIx64,
> +                   header->reserved1, header->reserved2,
> +                   header->reserved3, header->reserved4);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->volatile_header_offset != 1) {
> +        error_setg(errp, "Unsupported volatile header offset: %" PRIu64,
> +                   header->volatile_header_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->volatile_header_size != 1) {
> +        error_setg(errp, "Unsupported volatile header size: %" PRIu64,
> +                   header->volatile_header_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_header_offset != 2) {
> +        error_setg(errp, "Unsupported journal header offset: %" PRIu64,
> +                   header->journal_header_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_header_size != 2) {
> +        error_setg(errp, "Unsupported journal header size: %" PRIu64,
> +                   header->journal_header_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_offset != 2048) {
> +        error_setg(errp, "Unsupported journal offset: %" PRIu64,
> +                   header->journal_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_size != 2048) {
> +        error_setg(errp, "Unsupported journal size: %" PRIu64,
> +                   header->journal_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_dir_offset != 4096) {
> +        error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
> +                   header->grain_dir_offset);
> +        return -ENOTSUP;
> +    }

Do these values (metadata structure offsets and sizes) really matter?

> +    /* in sectors */
> +    grain_table_coverage = ((header->grain_table_size *

Hm, if header->grain_table_size is measured in sectors, it’d probably be
better to rename it to grain_table_sectors or something.

(“size” is usually in bytes; sometimes it’s the number of entries, but
that’s already kind of weird, I think)

> +                             BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
> +                            header->grain_size);
> +    needed_grain_tables = header->capacity / grain_table_coverage;
> +    needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables * sizeof(uint64_t),
> +                                         BDRV_SECTOR_SIZE);
> +    needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
> +
> +    if (header->grain_dir_size != needed_grain_dir_size) {

Wouldn’t it suffice to check that header->grain_dir_size >=
needed_grain_dir_size?  (The ROUND_UP() would become unnecessary, then.)

(And also, maybe s/grain_dir_size/grain_dir_sectors/)

> +        error_setg(errp, "Invalid grain directory size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grain_dir_size, needed_grain_dir_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->grain_tables_offset !=
> +        header->grain_dir_offset + header->grain_dir_size) {
> +        error_setg(errp, "Grain directory must be followed by grain tables");
> +        return -EINVAL;
> +    }
> +
> +    needed_grain_tables_size = needed_grain_tables * header->grain_table_size;
> +    needed_grain_tables_size = ROUND_UP(needed_grain_tables_size, 2048);
> +
> +    if (header->grain_tables_size != needed_grain_tables_size) {
> +        error_setg(errp, "Invalid grain tables size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grain_tables_size, needed_grain_tables_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->free_bitmap_offset !=
> +        header->grain_tables_offset + header->grain_tables_size) {
> +        error_setg(errp, "Grain tables must be followed by free bitmap");
> +        return -EINVAL;
> +    }
> +
> +    /* in bits */
> +    needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
> +                                           header->grain_size);
> +    /* in bytes */
> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> +                                           BITS_PER_BYTE);
> +    /* in sectors */
> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> +                                           BDRV_SECTOR_SIZE);
> +    needed_free_bitmap_size = ROUND_UP(needed_free_bitmap_size, 2048);

Er, now this is really confusing.  I’d start with the size in bytes:

> needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
>                                        header->grain_size * BITS_PER_BYTE);

and then translate it to sectors, but use a new variable for it
(needed_free_bitmap_sectors).

> +
> +    if (header->free_bitmap_size != needed_free_bitmap_size) {

Again, I suppose just checking that header->free_bitmap_size >=
needed_free_bitmap_size should be enough, I think.

> +        error_setg(errp, "Invalid free bitmap size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->free_bitmap_size, needed_free_bitmap_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->backmap_offset !=
> +        header->free_bitmap_offset + header->free_bitmap_size) {
> +        error_setg(errp, "Free bitmap must be followed by backmap");
> +        return -EINVAL;
> +    }
> +
> +    if (header->backmap_size != header->grain_tables_size) {
> +        error_setg(errp, "Invalid backmap size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->backmap_size, header->grain_tables_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->grains_offset !=
> +        header->backmap_offset + header->backmap_size) {
> +        error_setg(errp, "Backmap must be followed by grains");
> +        return -EINVAL;
> +    }
> +
> +    if (header->grains_size != header->capacity) {
> +        error_setg(errp, "Invalid grains size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grains_size, header->capacity);
> +        return -EINVAL;
> +    }
> +
> +    /* check that padding is 0 */
> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {

buffer_is_zero() should be sufficient.

> +        error_setg(errp, "Unsupported non-zero const header padding");
> +        return -ENOTSUP;
> +    }
> +
> +    return 0;
> +}
> +
> +static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader *header,
> +                                           Error **errp)
> +{
> +    header->magic = le64_to_cpu(header->magic);
> +    header->free_gt_number = le64_to_cpu(header->free_gt_number);
> +    header->next_txn_seq_number = le64_to_cpu(header->next_txn_seq_number);
> +    header->replay_journal = le64_to_cpu(header->replay_journal);
> +
> +    if (header->magic != SESPARSE_VOLATILE_HEADER_MAGIC) {
> +        error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64,
> +                   header->magic);
> +        return -EINVAL;
> +    }
> +
> +    if (header->replay_journal) {
> +        error_setg(errp, "Replaying journal not supported");

Hmmm, maybe "Image is dirty, replaying journal not supported"?

> +        return -ENOTSUP;
> +    }
> +
> +    /* check that padding is 0 */
> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
> +        error_setg(errp, "Unsupported non-zero volatile header padding");
> +        return -ENOTSUP;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vmdk_open_se_sparse(BlockDriverState *bs,
> +                               BdrvChild *file,
> +                               int flags, Error **errp)
> +{
> +    int ret;
> +    VMDKSESparseConstHeader const_header;
> +    VMDKSESparseVolatileHeader volatile_header;
> +    VmdkExtent *extent;
> +
> +    assert(sizeof(const_header) == BDRV_SECTOR_SIZE);
> +

[1] I think if you put a

ret = bdrv_apply_auto_read_only(bs,
    "No write support for seSparse images available", errp);
if (ret < 0) {
    return ret;
}

here, that should do the trick.

> +    ret = bdrv_pread(file, 0, &const_header, sizeof(const_header));
> +    if (ret < 0) {
> +        bdrv_refresh_filename(file->bs);
> +        error_setg_errno(errp, -ret,
> +                         "Could not read const header from file '%s'",
> +                         file->bs->filename);
> +        return ret;
> +    }
> +
> +    /* check const header */
> +    ret = check_se_sparse_const_header(&const_header, errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    assert(sizeof(volatile_header) == BDRV_SECTOR_SIZE);
> +
> +    ret = bdrv_pread(file,
> +                     const_header.volatile_header_offset * BDRV_SECTOR_SIZE,
> +                     &volatile_header, sizeof(volatile_header));
> +    if (ret < 0) {
> +        bdrv_refresh_filename(file->bs);
> +        error_setg_errno(errp, -ret,
> +                         "Could not read volatile header from file '%s'",
> +                         file->bs->filename);
> +        return ret;
> +    }
> +
> +    /* check volatile header */
> +    ret = check_se_sparse_volatile_header(&volatile_header, errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vmdk_add_extent(bs, file, false,
> +                          const_header.capacity,
> +                          const_header.grain_dir_offset * BDRV_SECTOR_SIZE,
> +                          0,
> +                          const_header.grain_dir_size *
> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
> +                          const_header.grain_table_size *
> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
> +                          const_header.grain_size,
> +                          &extent,
> +                          errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    extent->sesparse = true;
> +    extent->sesparse_l2_tables_offset = const_header.grain_tables_offset;
> +    extent->sesparse_clusters_offset = const_header.grains_offset;
> +    extent->entry_size = sizeof(uint64_t);
> +
> +    ret = vmdk_init_tables(bs, extent, errp);
> +    if (ret) {
> +        /* free extent allocated by vmdk_add_extent */
> +        vmdk_free_last_extent(bs);
> +    }
> +
> +    return ret;
> +}
> +
>  static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>                                 QDict *options, Error **errp);
>  
> @@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>           * RW [size in sectors] SPARSE "file-name.vmdk"
>           * RW [size in sectors] VMFS "file-name.vmdk"
>           * RW [size in sectors] VMFSSPARSE "file-name.vmdk"
> +         * RW [size in sectors] SESPARSE "file-name.vmdk"
>           */
>          flat_offset = -1;
>          matches = sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %" SCNd64,
> @@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>  
>          if (sectors <= 0 ||
>              (strcmp(type, "FLAT") && strcmp(type, "SPARSE") &&
> -             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) ||
> +             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") &&
> +             strcmp(type, "SESPARSE")) ||
>              (strcmp(access, "RW"))) {
>              continue;
>          }
> @@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>                  return ret;
>              }
>              extent = &s->extents[s->num_extents - 1];
> +        } else if (!strcmp(type, "SESPARSE")) {
> +            ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
> +            if (ret) {
> +                bdrv_unref_child(bs, extent_file);
> +                return ret;
> +            }
> +            extent = &s->extents[s->num_extents - 1];
>          } else {
>              error_setg(errp, "Unsupported extent type '%s'", type);
>              bdrv_unref_child(bs, extent_file);
> @@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>      if (strcmp(ct, "monolithicFlat") &&
>          strcmp(ct, "vmfs") &&
>          strcmp(ct, "vmfsSparse") &&
> +        strcmp(ct, "seSparse") &&
>          strcmp(ct, "twoGbMaxExtentSparse") &&
>          strcmp(ct, "twoGbMaxExtentFlat")) {
>          error_setg(errp, "Unsupported image type '%s'", ct);
> @@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs,
>  {
>      unsigned int l1_index, l2_offset, l2_index;
>      int min_index, i, j;
> -    uint32_t min_count, *l2_table;
> +    uint32_t min_count;
> +    void *l2_table;
>      bool zeroed = false;
>      int64_t ret;
>      int64_t cluster_sector;
> +    unsigned int l2_size_bytes = extent->l2_size * extent->entry_size;
>  
>      if (m_data) {
>          m_data->valid = 0;
> @@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs,
>      if (l1_index >= extent->l1_size) {
>          return VMDK_ERROR;
>      }
> -    l2_offset = extent->l1_table[l1_index];
> +    if (extent->sesparse) {

Maybe add assertions that extent->entry_size == 8 here and... [2]

> +        uint64_t l2_offset_u64 = ((uint64_t *)extent->l1_table)[l1_index];
> +        if (l2_offset_u64 == 0) {
> +            l2_offset = 0;
> +        } else if ((l2_offset_u64 & 0xffffffff00000000) != 0x1000000000000000) {
> +            /*
> +             * Top most nibble is 0x1 if grain table is allocated.
> +             * strict check - top most 4 bytes must be 0x10000000 since max
> +             * supported size is 64TB for disk - so no more than 64TB / 16MB
> +             * grain directories which is smaller than uint32,
> +             * where 16MB is the only supported default grain table coverage.
> +             */
> +            return VMDK_ERROR;
> +        } else {
> +            l2_offset_u64 = l2_offset_u64 & 0x00000000ffffffff;
> +            l2_offset_u64 = extent->sesparse_l2_tables_offset +
> +                l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE;
> +            if (l2_offset_u64 > 0x00000000ffffffff) {
> +                return VMDK_ERROR;
> +            }
> +            l2_offset = (unsigned int)(l2_offset_u64);
> +        }
> +    } else {

[2] ...that extent->entry_size == 4 here?

> +        l2_offset = ((uint32_t *)extent->l1_table)[l1_index];
> +    }
>      if (!l2_offset) {
>          return VMDK_UNALLOC;
>      }
> @@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>                      extent->l2_cache_counts[j] >>= 1;
>                  }
>              }
> -            l2_table = extent->l2_cache + (i * extent->l2_size);
> +            l2_table = (char *)extent->l2_cache + (i * l2_size_bytes);
>              goto found;
>          }
>      }
> @@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs,
>              min_index = i;
>          }
>      }
> -    l2_table = extent->l2_cache + (min_index * extent->l2_size);
> +    l2_table = (char *)extent->l2_cache + (min_index * l2_size_bytes);
>      BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD);
>      if (bdrv_pread(extent->file,
>                  (int64_t)l2_offset * 512,
>                  l2_table,
> -                extent->l2_size * sizeof(uint32_t)
> -            ) != extent->l2_size * sizeof(uint32_t)) {
> +                l2_size_bytes
> +            ) != l2_size_bytes) {
>          return VMDK_ERROR;
>      }
>  
> @@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs,
>      extent->l2_cache_counts[min_index] = 1;
>   found:
>      l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
> -    cluster_sector = le32_to_cpu(l2_table[l2_index]);
>  
> -    if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> -        zeroed = true;
> +    if (extent->sesparse) {
> +        cluster_sector = le64_to_cpu(((uint64_t *)l2_table)[l2_index]);
> +        switch (cluster_sector & 0xf000000000000000) {
> +        case 0x0000000000000000:
> +            /* unallocated grain */
> +            if (cluster_sector != 0) {
> +                return VMDK_ERROR;
> +            }
> +            break;
> +        case 0x1000000000000000:
> +            /* scsi-unmapped grain - fallthrough */
> +        case 0x2000000000000000:
> +            /* zero grain */
> +            zeroed = true;
> +            break;
> +        case 0x3000000000000000:
> +            /* allocated grain */
> +            cluster_sector = (((cluster_sector & 0x0fff000000000000) >> 48) |
> +                              ((cluster_sector & 0x0000ffffffffffff) << 12));

That’s just insane.  I trust you, though.

> +            cluster_sector = extent->sesparse_clusters_offset +
> +                cluster_sector * extent->cluster_sectors;
> +            break;
> +        default:
> +            return VMDK_ERROR;
> +        }
> +    } else {
> +        cluster_sector = le32_to_cpu(((uint32_t *)l2_table)[l2_index]);
> +
> +        if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> +            zeroed = true;
> +        }
>      }
>  
>      if (!cluster_sector || zeroed) {
>          if (!allocate) {
>              return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
>          }
> +        assert(!extent->sesparse);
>  
>          if (extent->next_cluster_sector >= VMDK_EXTENT_MAX_SECTORS) {
>              return VMDK_ERROR;
> @@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>              m_data->l1_index = l1_index;
>              m_data->l2_index = l2_index;
>              m_data->l2_offset = l2_offset;
> -            m_data->l2_cache_entry = &l2_table[l2_index];
> +            m_data->l2_cache_entry = ((uint32_t *)l2_table) + l2_index;
>          }
>      }
>      *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
> @@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
>          if (!extent) {
>              return -EIO;
>          }
> +        if (extent->sesparse) {
> +            return -ENOTSUP;
> +        }

([1] This should still stay here, even with bdrv_apply_auto_read_only()
above.)

>          offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
>          n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
>                               - offset_in_cluster);
> 

Overall, looks reasonable to me.  Then again, I have no clue of the
format and I trust you that it works.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots
  2019-05-27 17:39   ` Max Reitz
@ 2019-05-28  7:57     ` Sam Eiderman
  2019-05-28 12:17       ` Max Reitz
  0 siblings, 1 reply; 13+ messages in thread
From: Sam Eiderman @ 2019-05-28  7:57 UTC (permalink / raw)
  To: Max Reitz
  Cc: fam, kwolf, eyal.moscovici, qemu-block, arbel.moshe, qemu-devel,
	liran.alon, Karl Heubaum

Comments inline

> On 27 May 2019, at 20:39, Max Reitz <mreitz@redhat.com> wrote:
> 
> On 24.04.19 09:49, Sam Eiderman wrote:
>> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
>> QEMU).
>> 
>> This format was lacking in the following:
>> 
>>    * Grain directory (L1) and grain table (L2) entries were 32-bit,
>>      allowing access to only 2TB (slightly less) of data.
>>    * The grain size (default) was 512 bytes - leading to data
>>      fragmentation and many grain tables.
>>    * For space reclamation purposes, it was necessary to find all the
>>      grains which are not pointed to by any grain table - so a reverse
>>      mapping of "offset of grain in vmdk" to "grain table" must be
>>      constructed - which takes large amounts of CPU/RAM.
>> 
>> The format specification can be found in VMware's documentation:
>> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
>> 
>> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
>> introduced: SESparse (Space Efficient).
>> 
>> This format fixes the above issues:
>> 
>>    * All entries are now 64-bit.
>>    * The grain size (default) is 4KB.
>>    * Grain directory and grain tables are now located at the beginning
>>      of the file.
>>      + seSparse format reserves space for all grain tables.
>>      + Grain tables can be addressed using an index.
>>      + Grains are located in the end of the file and can also be
>>        addressed with an index.
>>      - seSparse vmdks of large disks (64TB) have huge preallocated
>>        headers - mainly due to L2 tables, even for empty snapshots.
>>    * The header contains a reverse mapping ("backmap") of "offset of
>>      grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
>>      specifies for each grain - whether it is allocated or not.
>>      Using these data structures we can implement space reclamation
>>      efficiently.
>>    * Due to the fact that the header now maintains two mappings:
>>        * The regular one (grain directory & grain tables)
>>        * A reverse one (backmap and free bitmap)
>>      These data structures can lose consistency upon crash and result
>>      in a corrupted VMDK.
>>      Therefore, a journal is also added to the VMDK and is replayed
>>      when the VMware reopens the file after a crash.
>> 
>> Since ESXi 6.7 - SESparse is the only snapshot format available.
>> 
>> Unfortunately, VMware does not provide documentation regarding the new
>> seSparse format.
>> 
>> This commit is based on black-box research of the seSparse format.
>> Various in-guest block operations and their effect on the snapshot file
>> were tested.
>> 
>> The only VMware provided source of information (regarding the underlying
>> implementation) was a log file on the ESXi:
>> 
>>    /var/log/hostd.log
>> 
>> Whenever an seSparse snapshot is created - the log is being populated
>> with seSparse records.
>> 
>> Relevant log records are of the form:
>> 
>> [...] Const Header:
>> [...]  constMagic     = 0xcafebabe
>> [...]  version        = 2.1
>> [...]  capacity       = 204800
>> [...]  grainSize      = 8
>> [...]  grainTableSize = 64
>> [...]  flags          = 0
>> [...] Extents:
>> [...]  Header         : <1 : 1>
>> [...]  JournalHdr     : <2 : 2>
>> [...]  Journal        : <2048 : 2048>
>> [...]  GrainDirectory : <4096 : 2048>
>> [...]  GrainTables    : <6144 : 2048>
>> [...]  FreeBitmap     : <8192 : 2048>
>> [...]  BackMap        : <10240 : 2048>
>> [...]  Grain          : <12288 : 204800>
>> [...] Volatile Header:
>> [...] volatileMagic     = 0xcafecafe
>> [...] FreeGTNumber      = 0
>> [...] nextTxnSeqNumber  = 0
>> [...] replayJournal     = 0
>> 
>> The sizes that are seen in the log file are in sectors.
>> Extents are of the following format: <offset : size>
>> 
>> This commit is a strict implementation which enforces:
>>    * magics
>>    * version number 2.1
>>    * grain size of 8 sectors  (4KB)
>>    * grain table size of 64 sectors
>>    * zero flags
>>    * extent locations
>> 
>> Additionally, this commit proivdes only a subset of the functionality
>> offered by seSparse's format:
>>    * Read-only
>>    * No journal replay
>>    * No space reclamation
>>    * No unmap support
>> 
>> Hence, journal header, journal, free bitmap and backmap extents are
>> unused, only the "classic" (L1 -> L2 -> data) grain access is
>> implemented.
>> 
>> However there are several differences in the grain access itself.
>> Grain directory (L1):
>>    * Grain directory entries are indexes (not offsets) to grain
>>      tables.
>>    * Valid grain directory entries have their highest nibble set to
>>      0x1.
>>    * Since grain tables are always located in the beginning of the
>>      file - the index can fit into 32 bits - so we can use its low
>>      part if it's valid.
>> Grain table (L2):
>>    * Grain table entries are indexes (not offsets) to grains.
>>    * If the highest nibble of the entry is:
>>        0x0:
>>            The grain in not allocated.
>>            The rest of the bytes are 0.
>>        0x1:
>>            The grain is unmapped - guest sees a zero grain.
>>            The rest of the bits point to the previously mapped grain,
>>            see 0x3 case.
>>        0x2:
>>            The grain is zero.
>>        0x3:
>>            The grain is allocated - to get the index calculate:
>>            ((entry & 0x0fff000000000000) >> 48) |
>>            ((entry & 0x0000ffffffffffff) << 12)
>>    * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
>>      grain which results from the guest using sg_unmap to unmap the
>>      grain - but the grain itself still exists in the grain extent - a
>>      space reclamation procedure should delete it.
>>      Unmapping a zero grain has no effect (0x2 will not change to 0x1)
>>      but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
>> 
>> In order to implement seSparse some fields had to be changed to support
>> both 32-bit and 64-bit entry sizes.
>> 
>> Read-only is implemented by failing on pwrite since qemu-img opens the
>> vmdk with read-write flags for "convert" (which is a read-only)
>> operation.
> 
> Does it?  The code doesn’t look like it (in img_convert(), src_flags is
> never set to anything with BDRV_O_RDWR).
> 
> In theory, we have bdrv_apply_auto_read_only() for this case.  More on
> that below. [1]

Yes, I added a check on the flags when I did the convert operation and it wasn’t read only.
I’ll double check that again, and try to change to bdrv_apply_auto_read_only().

I’m not sure how it will look now when the user will try to write to the image, I guess that
open with write flags will still work and there will be a failure upon write operations.

A better fix would be to change qemu-img convert to not open the image as RDWR but I
didn’t want to make a global change with this commit.

> 
>> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
>> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
>> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
>> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
>> ---
>> block/vmdk.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 459 insertions(+), 16 deletions(-)
>> 
>> diff --git a/block/vmdk.c b/block/vmdk.c
>> index fc7378da78..e599c08b95 100644
>> --- a/block/vmdk.c
>> +++ b/block/vmdk.c
> 
> First, a general observation: BDRV_SECTOR_SIZE is indeed equal to VMDK’s
> sector size (both are 512), but I’d consider that a coincidence.  This
> file defines a SECTOR_SIZE macro.  I think you should use that instead
> of BDRV_SECTOR_SIZE whenever referring to VMDK’s sector size.
> 
> (BDRV_SECTOR_SIZE is actually the sector size the qemu block layer uses.
> Yes, I know, there are plenty of places in vmdk.c that use it.  I still
> think it isn’t entirely correct to use it.)

Yes, I was too confused with all the mixing, I’ll change to SECTOR_SIZE.

> 
>> @@ -91,6 +91,44 @@ typedef struct {
>>     uint16_t compressAlgorithm;
>> } QEMU_PACKED VMDK4Header;
>> 
>> +typedef struct VMDKSESparseConstHeader {
>> +    uint64_t magic;
>> +    uint64_t version;
>> +    uint64_t capacity;
>> +    uint64_t grain_size;
>> +    uint64_t grain_table_size;
>> +    uint64_t flags;
>> +    uint64_t reserved1;
>> +    uint64_t reserved2;
>> +    uint64_t reserved3;
>> +    uint64_t reserved4;
>> +    uint64_t volatile_header_offset;
>> +    uint64_t volatile_header_size;
>> +    uint64_t journal_header_offset;
>> +    uint64_t journal_header_size;
>> +    uint64_t journal_offset;
>> +    uint64_t journal_size;
>> +    uint64_t grain_dir_offset;
>> +    uint64_t grain_dir_size;
>> +    uint64_t grain_tables_offset;
>> +    uint64_t grain_tables_size;
>> +    uint64_t free_bitmap_offset;
>> +    uint64_t free_bitmap_size;
>> +    uint64_t backmap_offset;
>> +    uint64_t backmap_size;
>> +    uint64_t grains_offset;
>> +    uint64_t grains_size;
>> +    uint8_t pad[304];
>> +} QEMU_PACKED VMDKSESparseConstHeader;
>> +
>> +typedef struct VMDKSESparseVolatileHeader {
> 
> Is this the official name?  (I know you don’t have a specification, but
> maybe you do have some half-official information, I don’t know.)
> Generally, I’d call the opposite of “const” “mutable”.

It’s the official name, can be seen in /var/log/hostd.log on ESXi host
whenever a sparse snapshot is created.

> 
>> +    uint64_t magic;
>> +    uint64_t free_gt_number;
>> +    uint64_t next_txn_seq_number;
>> +    uint64_t replay_journal;
>> +    uint8_t pad[480];
>> +} QEMU_PACKED VMDKSESparseVolatileHeader;
>> +
>> #define L2_CACHE_SIZE 16
>> 
>> typedef struct VmdkExtent {
>> @@ -99,19 +137,23 @@ typedef struct VmdkExtent {
>>     bool compressed;
>>     bool has_marker;
>>     bool has_zero_grain;
>> +    bool sesparse;
>> +    uint64_t sesparse_l2_tables_offset;
>> +    uint64_t sesparse_clusters_offset;
>> +    int32_t entry_size;
>>     int version;
>>     int64_t sectors;
>>     int64_t end_sector;
>>     int64_t flat_start_offset;
>>     int64_t l1_table_offset;
>>     int64_t l1_backup_table_offset;
>> -    uint32_t *l1_table;
>> +    void *l1_table;
>>     uint32_t *l1_backup_table;
>>     unsigned int l1_size;
>>     uint32_t l1_entry_sectors;
>> 
>>     unsigned int l2_size;
>> -    uint32_t *l2_cache;
>> +    void *l2_cache;
>>     uint32_t l2_cache_offsets[L2_CACHE_SIZE];
>>     uint32_t l2_cache_counts[L2_CACHE_SIZE];
>> 
>> @@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs,
>>          *            cluster size: 64KB, L2 table size: 512 entries
>>          *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
>>          *            cluster size: 512B, L2 table size: 4096 entries
>> +         *     8PB  - for default "ESXI seSparse Extent"
>> +         *            cluster size: 4KB,  L2 table size: 4096 entries
>>          */
>>         error_setg(errp, "L1 size too big");
>>         return -EFBIG;
> 
> Hm, now I wonder about this limit.  With this new format, it’d be 512M *
> 8 = 4 GB of RAM usage.  That seems an awful lot, and I don’t think we
> really need to support 8 PB images.  Like, if we want to prevent
> unbounded allocations, 4 GB is far above the limit of what I’d consider
> sane.  (And I don’t know too much above vmdk, but can’t a user also
> specify multiple extents with a single descriptor file, and thus make
> qemu allocate these 4 GB multiple times?)
> 
> Some comment in this file says the maximum supported size is 64 TB
> anyway.  Shouldn’t we just limit the L1 size accordingly, then?

This is true, vmdk.c assumes that L1 is relatively small (L2 tables can
reach hundreds of GBs in size) and loads it all into memory.
I agree that 512M * 8 (or even * 4) is too large, I’ll see if we can reduce it
dramatically.
I just need to take into consideration that theoretically a user can override
the number of sectors in a single grain and reduce it to 1 - that means that
the L1’s size can increase.
So for each case I’ll just assume that grain size is 512 bytes (1 sectors)
and check for the minimal size of the L1 we should allocate.
seSparse descriptors (and vmfsSparse) can only point to a single extent
- so we are good here, however we might have a chain of seSparse files
for multiple consecutive snapshots so it’s really important to reduce the
size of the L1 tables since they are all loaded into memory.
In any case, this is only a check and we probably never reached this limit
in practice but it is true that we need to make it more strict.

> 
>> @@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
>>     extent->l2_size = l2_size;
>>     extent->cluster_sectors = flat ? sectors : cluster_sectors;
>>     extent->next_cluster_sector = ROUND_UP(nb_sectors, cluster_sectors);
>> +    extent->entry_size = sizeof(uint32_t);
>> 
>>     if (s->num_extents > 1) {
>>         extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
>> @@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>>     int i;
>> 
>>     /* read the L1 table */
>> -    l1_size = extent->l1_size * sizeof(uint32_t);
>> +    l1_size = extent->l1_size * extent->entry_size;
> 
> Related to the above: This can overflow.  (512M * 8 == 4G)

Good catch, reducing the L1 table size will fix this

> 
>>     extent->l1_table = g_try_malloc(l1_size);
>>     if (l1_size && extent->l1_table == NULL) {
>>         return -ENOMEM;
>> @@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>>         goto fail_l1;
>>     }
>>     for (i = 0; i < extent->l1_size; i++) {
>> -        le32_to_cpus(&extent->l1_table[i]);
>> +        if (extent->sesparse) {
> 
> Shouldn’t matter in practice, but I think evaluating extent->entry_size
> would be more natural.

I just wanted to make it more clear that we are dealing with the seSparse format.
But maybe in this case I should evaluate extent->entry_size and add assert on
extent->sesparse inside the uint64_t case?
WDYT?

> 
>> +            le64_to_cpus((uint64_t *)extent->l1_table + i);
>> +        } else {
>> +            le32_to_cpus((uint32_t *)extent->l1_table + i);
>> +        }
>>     }
>> 
>>     if (extent->l1_backup_table_offset) {
>> +        assert(!extent->sesparse);
>>         extent->l1_backup_table = g_try_malloc(l1_size);
>>         if (l1_size && extent->l1_backup_table == NULL) {
>>             ret = -ENOMEM;
>> @@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>>     }
>> 
>>     extent->l2_cache =
>> -        g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE);
>> +        g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
>>     return 0;
>>  fail_l1b:
>>     g_free(extent->l1_backup_table);
>> @@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs,
>>     return ret;
>> }
>> 
>> +#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe
>> +#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe
> 
> The upper 32 bit are 0, so it doesn’t really matter, but still, if these
> are 64-bit constants (which they are), they should have a 64-bit type
> (e.g. UINT64_C(0x00000000cafebabe)).  (According to my very personal
> taste at least.)

Sure, I’ll do that

> 
>> +
>> +static const char zero_pad[BDRV_SECTOR_SIZE];
>> +
>> +/* Strict checks - format not officially documented */
>> +static int check_se_sparse_const_header(VMDKSESparseConstHeader *header,
>> +                                        Error **errp)
>> +{
>> +    uint64_t grain_table_coverage;
>> +    uint64_t needed_grain_tables;
>> +    uint64_t needed_grain_dir_size;
>> +    uint64_t needed_grain_tables_size;
>> +    uint64_t needed_free_bitmap_size;
>> +
>> +    header->magic = le64_to_cpu(header->magic);
>> +    header->version = le64_to_cpu(header->version);
>> +    header->grain_size = le64_to_cpu(header->grain_size);
>> +    header->grain_table_size = le64_to_cpu(header->grain_table_size);
>> +    header->flags = le64_to_cpu(header->flags);
>> +    header->reserved1 = le64_to_cpu(header->reserved1);
>> +    header->reserved2 = le64_to_cpu(header->reserved2);
>> +    header->reserved3 = le64_to_cpu(header->reserved3);
>> +    header->reserved4 = le64_to_cpu(header->reserved4);
>> +
>> +    header->volatile_header_offset =
>> +        le64_to_cpu(header->volatile_header_offset);
>> +    header->volatile_header_size = le64_to_cpu(header->volatile_header_size);
>> +
>> +    header->journal_header_offset = le64_to_cpu(header->journal_header_offset);
>> +    header->journal_header_size = le64_to_cpu(header->journal_header_size);
>> +
>> +    header->journal_offset = le64_to_cpu(header->journal_offset);
>> +    header->journal_size = le64_to_cpu(header->journal_size);
>> +
>> +    header->grain_dir_offset = le64_to_cpu(header->grain_dir_offset);
>> +    header->grain_dir_size = le64_to_cpu(header->grain_dir_size);
>> +
>> +    header->grain_tables_offset = le64_to_cpu(header->grain_tables_offset);
>> +    header->grain_tables_size = le64_to_cpu(header->grain_tables_size);
>> +
>> +    header->free_bitmap_offset = le64_to_cpu(header->free_bitmap_offset);
>> +    header->free_bitmap_size = le64_to_cpu(header->free_bitmap_size);
>> +
>> +    header->backmap_offset = le64_to_cpu(header->backmap_offset);
>> +    header->backmap_size = le64_to_cpu(header->backmap_size);
>> +
>> +    header->grains_offset = le64_to_cpu(header->grains_offset);
>> +    header->grains_size = le64_to_cpu(header->grains_size);
>> +
>> +    if (header->magic != SESPARSE_CONST_HEADER_MAGIC) {
>> +        error_setg(errp, "Bad const header magic: 0x%016" PRIx64,
>> +                   header->magic);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->version != 0x0000000200000001) {
>> +        error_setg(errp, "Unsupported version: 0x%016" PRIx64,
>> +                   header->version);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->grain_size != 8) {
>> +        error_setg(errp, "Unsupported grain size: %" PRIu64,
>> +                   header->grain_size);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->grain_table_size != 64) {
>> +        error_setg(errp, "Unsupported grain table size: %" PRIu64,
>> +                   header->grain_table_size);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->flags != 0) {
>> +        error_setg(errp, "Unsupported flags: 0x%016" PRIx64,
>> +                   header->flags);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->reserved1 != 0 || header->reserved2 != 0 ||
>> +        header->reserved3 != 0 || header->reserved4 != 0) {
>> +        error_setg(errp, "Unsupported reserved bits:"
>> +                   " 0x%016" PRIx64 " 0x%016" PRIx64
>> +                   " 0x%016" PRIx64 " 0x%016" PRIx64,
>> +                   header->reserved1, header->reserved2,
>> +                   header->reserved3, header->reserved4);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->volatile_header_offset != 1) {
>> +        error_setg(errp, "Unsupported volatile header offset: %" PRIu64,
>> +                   header->volatile_header_offset);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->volatile_header_size != 1) {
>> +        error_setg(errp, "Unsupported volatile header size: %" PRIu64,
>> +                   header->volatile_header_size);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->journal_header_offset != 2) {
>> +        error_setg(errp, "Unsupported journal header offset: %" PRIu64,
>> +                   header->journal_header_offset);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->journal_header_size != 2) {
>> +        error_setg(errp, "Unsupported journal header size: %" PRIu64,
>> +                   header->journal_header_size);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->journal_offset != 2048) {
>> +        error_setg(errp, "Unsupported journal offset: %" PRIu64,
>> +                   header->journal_offset);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->journal_size != 2048) {
>> +        error_setg(errp, "Unsupported journal size: %" PRIu64,
>> +                   header->journal_size);
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    if (header->grain_dir_offset != 4096) {
>> +        error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
>> +                   header->grain_dir_offset);
>> +        return -ENOTSUP;
>> +    }
> 
> Do these values (metadata structure offsets and sizes) really matter?

Not really, these are the values that I saw appear in 10MB, 150GB and 6TB
snapshots, they never change.
All of the checks in this function are just for “strictness” and can be removed.
I just wanted to include them at least in the v1 series to make the format more
clear and to not support vmdk's that give other values under the assumption
that the format might misbehave in other parts too.
They can be removed (maybe I’ll keep the version check in any case) or we
can keep them and remove them at any later point.
WDYT?

> 
>> +    /* in sectors */
>> +    grain_table_coverage = ((header->grain_table_size *
> 
> Hm, if header->grain_table_size is measured in sectors, it’d probably be
> better to rename it to grain_table_sectors or something.
> 
> (“size” is usually in bytes; sometimes it’s the number of entries, but
> that’s already kind of weird, I think)

grain_table_sectors is not a good name since it is not the number of sectors
a grain table uses but the number of sectors it covers.
grain_table_coverage_sectors is a better name but a bit too long.
I can change grain_table_* to gt_* and grain_dir_* to gd_* but they seem too
similar and confusing.
WDYT?

> 
>> +                             BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
>> +                            header->grain_size);
>> +    needed_grain_tables = header->capacity / grain_table_coverage;
>> +    needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables * sizeof(uint64_t),
>> +                                         BDRV_SECTOR_SIZE);
>> +    needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
>> +
>> +    if (header->grain_dir_size != needed_grain_dir_size) {
> 
> Wouldn’t it suffice to check that header->grain_dir_size >=
> needed_grain_dir_size?  (The ROUND_UP() would become unnecessary, then.)
> 
> (And also, maybe s/grain_dir_size/grain_dir_sectors/)

This is how VMware works - they round up the size to 2048 sectors for the
needed_grain_dir_sectors - so I just make sure the math adds up.
As said before, all of these checks are just for “strictness”.

> 
>> +        error_setg(errp, "Invalid grain directory size: %" PRIu64
>> +                   ", needed: %" PRIu64,
>> +                   header->grain_dir_size, needed_grain_dir_size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->grain_tables_offset !=
>> +        header->grain_dir_offset + header->grain_dir_size) {
>> +        error_setg(errp, "Grain directory must be followed by grain tables");
>> +        return -EINVAL;
>> +    }
>> +
>> +    needed_grain_tables_size = needed_grain_tables * header->grain_table_size;
>> +    needed_grain_tables_size = ROUND_UP(needed_grain_tables_size, 2048);
>> +
>> +    if (header->grain_tables_size != needed_grain_tables_size) {
>> +        error_setg(errp, "Invalid grain tables size: %" PRIu64
>> +                   ", needed: %" PRIu64,
>> +                   header->grain_tables_size, needed_grain_tables_size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->free_bitmap_offset !=
>> +        header->grain_tables_offset + header->grain_tables_size) {
>> +        error_setg(errp, "Grain tables must be followed by free bitmap");
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* in bits */
>> +    needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
>> +                                           header->grain_size);
>> +    /* in bytes */
>> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
>> +                                           BITS_PER_BYTE);
>> +    /* in sectors */
>> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
>> +                                           BDRV_SECTOR_SIZE);
>> +    needed_free_bitmap_size = ROUND_UP(needed_free_bitmap_size, 2048);
> 
> Er, now this is really confusing.  I’d start with the size in bytes:
> 
>> needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
>>                                       header->grain_size * BITS_PER_BYTE);
> 
> and then translate it to sectors, but use a new variable for it
> (needed_free_bitmap_sectors).

I’ll do that

> 
>> +
>> +    if (header->free_bitmap_size != needed_free_bitmap_size) {
> 
> Again, I suppose just checking that header->free_bitmap_size >=
> needed_free_bitmap_size should be enough, I think.

Just for strictness, as said above.

> 
>> +        error_setg(errp, "Invalid free bitmap size: %" PRIu64
>> +                   ", needed: %" PRIu64,
>> +                   header->free_bitmap_size, needed_free_bitmap_size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->backmap_offset !=
>> +        header->free_bitmap_offset + header->free_bitmap_size) {
>> +        error_setg(errp, "Free bitmap must be followed by backmap");
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->backmap_size != header->grain_tables_size) {
>> +        error_setg(errp, "Invalid backmap size: %" PRIu64
>> +                   ", needed: %" PRIu64,
>> +                   header->backmap_size, header->grain_tables_size);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->grains_offset !=
>> +        header->backmap_offset + header->backmap_size) {
>> +        error_setg(errp, "Backmap must be followed by grains");
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->grains_size != header->capacity) {
>> +        error_setg(errp, "Invalid grains size: %" PRIu64
>> +                   ", needed: %" PRIu64,
>> +                   header->grains_size, header->capacity);
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* check that padding is 0 */
>> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
> 
> buffer_is_zero() should be sufficient.

Great! I’ll use it

> 
>> +        error_setg(errp, "Unsupported non-zero const header padding");
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader *header,
>> +                                           Error **errp)
>> +{
>> +    header->magic = le64_to_cpu(header->magic);
>> +    header->free_gt_number = le64_to_cpu(header->free_gt_number);
>> +    header->next_txn_seq_number = le64_to_cpu(header->next_txn_seq_number);
>> +    header->replay_journal = le64_to_cpu(header->replay_journal);
>> +
>> +    if (header->magic != SESPARSE_VOLATILE_HEADER_MAGIC) {
>> +        error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64,
>> +                   header->magic);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (header->replay_journal) {
>> +        error_setg(errp, "Replaying journal not supported");
> 
> Hmmm, maybe "Image is dirty, replaying journal not supported”?

Even better

> 
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    /* check that padding is 0 */
>> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
>> +        error_setg(errp, "Unsupported non-zero volatile header padding");
>> +        return -ENOTSUP;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int vmdk_open_se_sparse(BlockDriverState *bs,
>> +                               BdrvChild *file,
>> +                               int flags, Error **errp)
>> +{
>> +    int ret;
>> +    VMDKSESparseConstHeader const_header;
>> +    VMDKSESparseVolatileHeader volatile_header;
>> +    VmdkExtent *extent;
>> +
>> +    assert(sizeof(const_header) == BDRV_SECTOR_SIZE);
>> +
> 
> [1] I think if you put a
> 
> ret = bdrv_apply_auto_read_only(bs,
>    "No write support for seSparse images available", errp);
> if (ret < 0) {
>    return ret;
> }
> 
> here, that should do the trick.

I’ll do that, see first comment.

> 
>> +    ret = bdrv_pread(file, 0, &const_header, sizeof(const_header));
>> +    if (ret < 0) {
>> +        bdrv_refresh_filename(file->bs);
>> +        error_setg_errno(errp, -ret,
>> +                         "Could not read const header from file '%s'",
>> +                         file->bs->filename);
>> +        return ret;
>> +    }
>> +
>> +    /* check const header */
>> +    ret = check_se_sparse_const_header(&const_header, errp);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    assert(sizeof(volatile_header) == BDRV_SECTOR_SIZE);
>> +
>> +    ret = bdrv_pread(file,
>> +                     const_header.volatile_header_offset * BDRV_SECTOR_SIZE,
>> +                     &volatile_header, sizeof(volatile_header));
>> +    if (ret < 0) {
>> +        bdrv_refresh_filename(file->bs);
>> +        error_setg_errno(errp, -ret,
>> +                         "Could not read volatile header from file '%s'",
>> +                         file->bs->filename);
>> +        return ret;
>> +    }
>> +
>> +    /* check volatile header */
>> +    ret = check_se_sparse_volatile_header(&volatile_header, errp);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    ret = vmdk_add_extent(bs, file, false,
>> +                          const_header.capacity,
>> +                          const_header.grain_dir_offset * BDRV_SECTOR_SIZE,
>> +                          0,
>> +                          const_header.grain_dir_size *
>> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
>> +                          const_header.grain_table_size *
>> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
>> +                          const_header.grain_size,
>> +                          &extent,
>> +                          errp);
>> +    if (ret < 0) {
>> +        return ret;
>> +    }
>> +
>> +    extent->sesparse = true;
>> +    extent->sesparse_l2_tables_offset = const_header.grain_tables_offset;
>> +    extent->sesparse_clusters_offset = const_header.grains_offset;
>> +    extent->entry_size = sizeof(uint64_t);
>> +
>> +    ret = vmdk_init_tables(bs, extent, errp);
>> +    if (ret) {
>> +        /* free extent allocated by vmdk_add_extent */
>> +        vmdk_free_last_extent(bs);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>>                                QDict *options, Error **errp);
>> 
>> @@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>>          * RW [size in sectors] SPARSE "file-name.vmdk"
>>          * RW [size in sectors] VMFS "file-name.vmdk"
>>          * RW [size in sectors] VMFSSPARSE "file-name.vmdk"
>> +         * RW [size in sectors] SESPARSE "file-name.vmdk"
>>          */
>>         flat_offset = -1;
>>         matches = sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %" SCNd64,
>> @@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>> 
>>         if (sectors <= 0 ||
>>             (strcmp(type, "FLAT") && strcmp(type, "SPARSE") &&
>> -             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) ||
>> +             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") &&
>> +             strcmp(type, "SESPARSE")) ||
>>             (strcmp(access, "RW"))) {
>>             continue;
>>         }
>> @@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>>                 return ret;
>>             }
>>             extent = &s->extents[s->num_extents - 1];
>> +        } else if (!strcmp(type, "SESPARSE")) {
>> +            ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
>> +            if (ret) {
>> +                bdrv_unref_child(bs, extent_file);
>> +                return ret;
>> +            }
>> +            extent = &s->extents[s->num_extents - 1];
>>         } else {
>>             error_setg(errp, "Unsupported extent type '%s'", type);
>>             bdrv_unref_child(bs, extent_file);
>> @@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>>     if (strcmp(ct, "monolithicFlat") &&
>>         strcmp(ct, "vmfs") &&
>>         strcmp(ct, "vmfsSparse") &&
>> +        strcmp(ct, "seSparse") &&
>>         strcmp(ct, "twoGbMaxExtentSparse") &&
>>         strcmp(ct, "twoGbMaxExtentFlat")) {
>>         error_setg(errp, "Unsupported image type '%s'", ct);
>> @@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs,
>> {
>>     unsigned int l1_index, l2_offset, l2_index;
>>     int min_index, i, j;
>> -    uint32_t min_count, *l2_table;
>> +    uint32_t min_count;
>> +    void *l2_table;
>>     bool zeroed = false;
>>     int64_t ret;
>>     int64_t cluster_sector;
>> +    unsigned int l2_size_bytes = extent->l2_size * extent->entry_size;
>> 
>>     if (m_data) {
>>         m_data->valid = 0;
>> @@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs,
>>     if (l1_index >= extent->l1_size) {
>>         return VMDK_ERROR;
>>     }
>> -    l2_offset = extent->l1_table[l1_index];
>> +    if (extent->sesparse) {
> 
> Maybe add assertions that extent->entry_size == 8 here and... [2]

Sure

> 
>> +        uint64_t l2_offset_u64 = ((uint64_t *)extent->l1_table)[l1_index];
>> +        if (l2_offset_u64 == 0) {
>> +            l2_offset = 0;
>> +        } else if ((l2_offset_u64 & 0xffffffff00000000) != 0x1000000000000000) {
>> +            /*
>> +             * Top most nibble is 0x1 if grain table is allocated.
>> +             * strict check - top most 4 bytes must be 0x10000000 since max
>> +             * supported size is 64TB for disk - so no more than 64TB / 16MB
>> +             * grain directories which is smaller than uint32,
>> +             * where 16MB is the only supported default grain table coverage.
>> +             */
>> +            return VMDK_ERROR;
>> +        } else {
>> +            l2_offset_u64 = l2_offset_u64 & 0x00000000ffffffff;
>> +            l2_offset_u64 = extent->sesparse_l2_tables_offset +
>> +                l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE;
>> +            if (l2_offset_u64 > 0x00000000ffffffff) {
>> +                return VMDK_ERROR;
>> +            }
>> +            l2_offset = (unsigned int)(l2_offset_u64);
>> +        }
>> +    } else {
> 
> [2] ...that extent->entry_size == 4 here?

Sure

> 
>> +        l2_offset = ((uint32_t *)extent->l1_table)[l1_index];
>> +    }
>>     if (!l2_offset) {
>>         return VMDK_UNALLOC;
>>     }
>> @@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>>                     extent->l2_cache_counts[j] >>= 1;
>>                 }
>>             }
>> -            l2_table = extent->l2_cache + (i * extent->l2_size);
>> +            l2_table = (char *)extent->l2_cache + (i * l2_size_bytes);
>>             goto found;
>>         }
>>     }
>> @@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs,
>>             min_index = i;
>>         }
>>     }
>> -    l2_table = extent->l2_cache + (min_index * extent->l2_size);
>> +    l2_table = (char *)extent->l2_cache + (min_index * l2_size_bytes);
>>     BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD);
>>     if (bdrv_pread(extent->file,
>>                 (int64_t)l2_offset * 512,
>>                 l2_table,
>> -                extent->l2_size * sizeof(uint32_t)
>> -            ) != extent->l2_size * sizeof(uint32_t)) {
>> +                l2_size_bytes
>> +            ) != l2_size_bytes) {
>>         return VMDK_ERROR;
>>     }
>> 
>> @@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs,
>>     extent->l2_cache_counts[min_index] = 1;
>>  found:
>>     l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
>> -    cluster_sector = le32_to_cpu(l2_table[l2_index]);
>> 
>> -    if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
>> -        zeroed = true;
>> +    if (extent->sesparse) {
>> +        cluster_sector = le64_to_cpu(((uint64_t *)l2_table)[l2_index]);
>> +        switch (cluster_sector & 0xf000000000000000) {
>> +        case 0x0000000000000000:
>> +            /* unallocated grain */
>> +            if (cluster_sector != 0) {
>> +                return VMDK_ERROR;
>> +            }
>> +            break;
>> +        case 0x1000000000000000:
>> +            /* scsi-unmapped grain - fallthrough */
>> +        case 0x2000000000000000:
>> +            /* zero grain */
>> +            zeroed = true;
>> +            break;
>> +        case 0x3000000000000000:
>> +            /* allocated grain */
>> +            cluster_sector = (((cluster_sector & 0x0fff000000000000) >> 48) |
>> +                              ((cluster_sector & 0x0000ffffffffffff) << 12));
> 
> That’s just insane.  I trust you, though.

Yea, very weird math, I installed several operating systems on sesparse
snapshots, then used qemu convert/commit and tested the results -
all of them booted so I guess this really works :-)

> 
>> +            cluster_sector = extent->sesparse_clusters_offset +
>> +                cluster_sector * extent->cluster_sectors;
>> +            break;
>> +        default:
>> +            return VMDK_ERROR;
>> +        }
>> +    } else {
>> +        cluster_sector = le32_to_cpu(((uint32_t *)l2_table)[l2_index]);
>> +
>> +        if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
>> +            zeroed = true;
>> +        }
>>     }
>> 
>>     if (!cluster_sector || zeroed) {
>>         if (!allocate) {
>>             return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
>>         }
>> +        assert(!extent->sesparse);
>> 
>>         if (extent->next_cluster_sector >= VMDK_EXTENT_MAX_SECTORS) {
>>             return VMDK_ERROR;
>> @@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>>             m_data->l1_index = l1_index;
>>             m_data->l2_index = l2_index;
>>             m_data->l2_offset = l2_offset;
>> -            m_data->l2_cache_entry = &l2_table[l2_index];
>> +            m_data->l2_cache_entry = ((uint32_t *)l2_table) + l2_index;
>>         }
>>     }
>>     *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
>> @@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
>>         if (!extent) {
>>             return -EIO;
>>         }
>> +        if (extent->sesparse) {
>> +            return -ENOTSUP;
>> +        }
> 
> ([1] This should still stay here, even with bdrv_apply_auto_read_only()
> above.)

Just as I thought, great.

> 
>>         offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
>>         n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
>>                              - offset_in_cluster);
>> 
> 
> Overall, looks reasonable to me.  Then again, I have no clue of the
> format and I trust you that it works.
> 
> Max

Sam


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots
  2019-05-28  7:57     ` Sam Eiderman
@ 2019-05-28 12:17       ` Max Reitz
  0 siblings, 0 replies; 13+ messages in thread
From: Max Reitz @ 2019-05-28 12:17 UTC (permalink / raw)
  To: Sam Eiderman
  Cc: fam, kwolf, eyal.moscovici, qemu-block, arbel.moshe, qemu-devel,
	liran.alon, Karl Heubaum

[-- Attachment #1: Type: text/plain, Size: 6229 bytes --]

On 28.05.19 09:57, Sam Eiderman wrote:
> Comments inline
> 
>> On 27 May 2019, at 20:39, Max Reitz <mreitz@redhat.com
>> <mailto:mreitz@redhat.com>> wrote:
>>
>> On 24.04.19 09:49, Sam Eiderman wrote:

[...]

>>> @@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState
>>> *bs, VmdkExtent *extent,
>>>         goto fail_l1;
>>>     }
>>>     for (i = 0; i < extent->l1_size; i++) {
>>> -        le32_to_cpus(&extent->l1_table[i]);
>>> +        if (extent->sesparse) {
>>
>> Shouldn’t matter in practice, but I think evaluating extent->entry_size
>> would be more natural.
> 
> I just wanted to make it more clear that we are dealing with the
> seSparse format.
> But maybe in this case I should evaluate extent->entry_size and add
> assert on
> extent->sesparse inside the uint64_t case?
> WDYT?

I think that this code works independently of whether it’s dealing with
the seSparse format or something else.  All that matters is the L1 entry
size.  If some day there is another VMDK sub format that has 64-bit L1
entries, too, this code will be ready for it.

(It’s a different matter when interpreting e.g. L2 entries, because
their encoding may differ depending on the sub format.)

>>> +            le64_to_cpus((uint64_t *)extent->l1_table + i);
>>> +        } else {
>>> +            le32_to_cpus((uint32_t *)extent->l1_table + i);
>>> +        }
>>>     }
>>>

[...]

>>> +    if (header->journal_header_offset != 2) {
>>> +        error_setg(errp, "Unsupported journal header offset: %" PRIu64,
>>> +                   header->journal_header_offset);
>>> +        return -ENOTSUP;
>>> +    }
>>> +
>>> +    if (header->journal_header_size != 2) {
>>> +        error_setg(errp, "Unsupported journal header size: %" PRIu64,
>>> +                   header->journal_header_size);
>>> +        return -ENOTSUP;
>>> +    }
>>> +
>>> +    if (header->journal_offset != 2048) {
>>> +        error_setg(errp, "Unsupported journal offset: %" PRIu64,
>>> +                   header->journal_offset);
>>> +        return -ENOTSUP;
>>> +    }
>>> +
>>> +    if (header->journal_size != 2048) {
>>> +        error_setg(errp, "Unsupported journal size: %" PRIu64,
>>> +                   header->journal_size);
>>> +        return -ENOTSUP;
>>> +    }
>>> +
>>> +    if (header->grain_dir_offset != 4096) {
>>> +        error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
>>> +                   header->grain_dir_offset);
>>> +        return -ENOTSUP;
>>> +    }
>>
>> Do these values (metadata structure offsets and sizes) really matter?
> 
> Not really, these are the values that I saw appear in 10MB, 150GB and 6TB
> snapshots, they never change.
> All of the checks in this function are just for “strictness” and can be
> removed.
> I just wanted to include them at least in the v1 series to make the
> format more
> clear and to not support vmdk's that give other values under the assumption
> that the format might misbehave in other parts too.
> They can be removed (maybe I’ll keep the version check in any case) or we
> can keep them and remove them at any later point.
> WDYT?

I’d remove them.  Because this patch only adds read-only support, there
is no danger of corrupting data.  The worst that can happen is reading
wrong data from some file that with these checks we just wouldn’t read
at all.

>>> +    /* in sectors */
>>> +    grain_table_coverage = ((header->grain_table_size *
>>
>> Hm, if header->grain_table_size is measured in sectors, it’d probably be
>> better to rename it to grain_table_sectors or something.
>>
>> (“size” is usually in bytes; sometimes it’s the number of entries, but
>> that’s already kind of weird, I think)
> 
> grain_table_sectors is not a good name since it is not the number of sectors
> a grain table uses but the number of sectors it covers.

I meant renaming header->grain_table_size to
header->grain_table_sectors.  grain_table_coverage seems good to me.

But from your answer about the volatile header it appears like
header->grain_table_size might be the official name.  If so, I guess
it’ll be fine as it is.

> grain_table_coverage_sectors is a better name but a bit too long.
> I can change grain_table_* to gt_* and grain_dir_* to gd_* but they seem too
> similar and confusing.
> WDYT?
> 
>>
>>> +                             BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
>>> +                            header->grain_size);
>>> +    needed_grain_tables = header->capacity / grain_table_coverage;
>>> +    needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables *
>>> sizeof(uint64_t),
>>> +                                         BDRV_SECTOR_SIZE);
>>> +    needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
>>> +
>>> +    if (header->grain_dir_size != needed_grain_dir_size) {
>>
>> Wouldn’t it suffice to check that header->grain_dir_size >=
>> needed_grain_dir_size?  (The ROUND_UP() would become unnecessary, then.)
>>
>> (And also, maybe s/grain_dir_size/grain_dir_sectors/)
> 
> This is how VMware works - they round up the size to 2048 sectors for the
> needed_grain_dir_sectors - so I just make sure the math adds up.
> As said before, all of these checks are just for “strictness”.

The thing is, I don’t know whether this is part of the specification of
seSparse or just part of VMware’s implementation.  When qemu increases
the qcow2 L1 table size, it has a specific algorithm for that
(current size * 1.5).  But that doesn’t make it part of the qcow2
specification, any size is fine.  Similarly, I can imagine that VMware
simply chose to increase the L1 size in units of 1 MB (so it doesn’t
need to be grown very often), but that it can work with any size.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-05-28 12:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-24  7:48 [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format Sam Eiderman
2019-04-24  7:48 ` Sam Eiderman
2019-04-24  7:49 ` [Qemu-devel] [PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage Sam Eiderman
2019-04-24  7:49   ` Sam Eiderman
2019-04-24 10:19   ` [Qemu-devel] [Qemu-block] " yuchenlin
2019-04-24 10:19     ` yuchenlin via Qemu-devel
2019-04-24  7:49 ` [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots Sam Eiderman
2019-04-24  7:49   ` Sam Eiderman
2019-05-27 17:39   ` Max Reitz
2019-05-28  7:57     ` Sam Eiderman
2019-05-28 12:17       ` Max Reitz
2019-05-12  8:14 ` [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format Sam
2019-05-27  8:26   ` Sam Eiderman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.