Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots

From: Max Reitz <mreitz@redhat.com>
To: Sam Eiderman <shmuel.eiderman@oracle.com>,
	fam@euphon.net, kwolf@redhat.com, qemu-block@nongnu.org
Cc: eyal.moscovici@oracle.com, qemu-devel@nongnu.org,
	liran.alon@oracle.com, karl.heubaum@oracle.com,
	arbel.moshe@oracle.com
Subject: Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots
Date: Mon, 27 May 2019 19:39:16 +0200	[thread overview]
Message-ID: <8a27479f-3948-df17-5abc-0c22811ae4b2@redhat.com> (raw)
In-Reply-To: <20190424074901.31430-3-shmuel.eiderman@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 35886 bytes --]

On 24.04.19 09:49, Sam Eiderman wrote:
> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
> QEMU).
> 
> This format was lacking in the following:
> 
>     * Grain directory (L1) and grain table (L2) entries were 32-bit,
>       allowing access to only 2TB (slightly less) of data.
>     * The grain size (default) was 512 bytes - leading to data
>       fragmentation and many grain tables.
>     * For space reclamation purposes, it was necessary to find all the
>       grains which are not pointed to by any grain table - so a reverse
>       mapping of "offset of grain in vmdk" to "grain table" must be
>       constructed - which takes large amounts of CPU/RAM.
> 
> The format specification can be found in VMware's documentation:
> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
> 
> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
> introduced: SESparse (Space Efficient).
> 
> This format fixes the above issues:
> 
>     * All entries are now 64-bit.
>     * The grain size (default) is 4KB.
>     * Grain directory and grain tables are now located at the beginning
>       of the file.
>       + seSparse format reserves space for all grain tables.
>       + Grain tables can be addressed using an index.
>       + Grains are located in the end of the file and can also be
>         addressed with an index.
>       - seSparse vmdks of large disks (64TB) have huge preallocated
>         headers - mainly due to L2 tables, even for empty snapshots.
>     * The header contains a reverse mapping ("backmap") of "offset of
>       grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
>       specifies for each grain - whether it is allocated or not.
>       Using these data structures we can implement space reclamation
>       efficiently.
>     * Due to the fact that the header now maintains two mappings:
>         * The regular one (grain directory & grain tables)
>         * A reverse one (backmap and free bitmap)
>       These data structures can lose consistency upon crash and result
>       in a corrupted VMDK.
>       Therefore, a journal is also added to the VMDK and is replayed
>       when the VMware reopens the file after a crash.
> 
> Since ESXi 6.7 - SESparse is the only snapshot format available.
> 
> Unfortunately, VMware does not provide documentation regarding the new
> seSparse format.
> 
> This commit is based on black-box research of the seSparse format.
> Various in-guest block operations and their effect on the snapshot file
> were tested.
> 
> The only VMware provided source of information (regarding the underlying
> implementation) was a log file on the ESXi:
> 
>     /var/log/hostd.log
> 
> Whenever an seSparse snapshot is created - the log is being populated
> with seSparse records.
> 
> Relevant log records are of the form:
> 
> [...] Const Header:
> [...]  constMagic     = 0xcafebabe
> [...]  version        = 2.1
> [...]  capacity       = 204800
> [...]  grainSize      = 8
> [...]  grainTableSize = 64
> [...]  flags          = 0
> [...] Extents:
> [...]  Header         : <1 : 1>
> [...]  JournalHdr     : <2 : 2>
> [...]  Journal        : <2048 : 2048>
> [...]  GrainDirectory : <4096 : 2048>
> [...]  GrainTables    : <6144 : 2048>
> [...]  FreeBitmap     : <8192 : 2048>
> [...]  BackMap        : <10240 : 2048>
> [...]  Grain          : <12288 : 204800>
> [...] Volatile Header:
> [...] volatileMagic     = 0xcafecafe
> [...] FreeGTNumber      = 0
> [...] nextTxnSeqNumber  = 0
> [...] replayJournal     = 0
> 
> The sizes that are seen in the log file are in sectors.
> Extents are of the following format: <offset : size>
> 
> This commit is a strict implementation which enforces:
>     * magics
>     * version number 2.1
>     * grain size of 8 sectors  (4KB)
>     * grain table size of 64 sectors
>     * zero flags
>     * extent locations
> 
> Additionally, this commit proivdes only a subset of the functionality
> offered by seSparse's format:
>     * Read-only
>     * No journal replay
>     * No space reclamation
>     * No unmap support
> 
> Hence, journal header, journal, free bitmap and backmap extents are
> unused, only the "classic" (L1 -> L2 -> data) grain access is
> implemented.
> 
> However there are several differences in the grain access itself.
> Grain directory (L1):
>     * Grain directory entries are indexes (not offsets) to grain
>       tables.
>     * Valid grain directory entries have their highest nibble set to
>       0x1.
>     * Since grain tables are always located in the beginning of the
>       file - the index can fit into 32 bits - so we can use its low
>       part if it's valid.
> Grain table (L2):
>     * Grain table entries are indexes (not offsets) to grains.
>     * If the highest nibble of the entry is:
>         0x0:
>             The grain in not allocated.
>             The rest of the bytes are 0.
>         0x1:
>             The grain is unmapped - guest sees a zero grain.
>             The rest of the bits point to the previously mapped grain,
>             see 0x3 case.
>         0x2:
>             The grain is zero.
>         0x3:
>             The grain is allocated - to get the index calculate:
>             ((entry & 0x0fff000000000000) >> 48) |
>             ((entry & 0x0000ffffffffffff) << 12)
>     * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
>       grain which results from the guest using sg_unmap to unmap the
>       grain - but the grain itself still exists in the grain extent - a
>       space reclamation procedure should delete it.
>       Unmapping a zero grain has no effect (0x2 will not change to 0x1)
>       but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
> 
> In order to implement seSparse some fields had to be changed to support
> both 32-bit and 64-bit entry sizes.
> 
> Read-only is implemented by failing on pwrite since qemu-img opens the
> vmdk with read-write flags for "convert" (which is a read-only)
> operation.

Does it?  The code doesn’t look like it (in img_convert(), src_flags is
never set to anything with BDRV_O_RDWR).

In theory, we have bdrv_apply_auto_read_only() for this case.  More on
that below. [1]

> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
> ---
>  block/vmdk.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 459 insertions(+), 16 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index fc7378da78..e599c08b95 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c

First, a general observation: BDRV_SECTOR_SIZE is indeed equal to VMDK’s
sector size (both are 512), but I’d consider that a coincidence.  This
file defines a SECTOR_SIZE macro.  I think you should use that instead
of BDRV_SECTOR_SIZE whenever referring to VMDK’s sector size.

(BDRV_SECTOR_SIZE is actually the sector size the qemu block layer uses.
 Yes, I know, there are plenty of places in vmdk.c that use it.  I still
think it isn’t entirely correct to use it.)

> @@ -91,6 +91,44 @@ typedef struct {
>      uint16_t compressAlgorithm;
>  } QEMU_PACKED VMDK4Header;
>  
> +typedef struct VMDKSESparseConstHeader {
> +    uint64_t magic;
> +    uint64_t version;
> +    uint64_t capacity;
> +    uint64_t grain_size;
> +    uint64_t grain_table_size;
> +    uint64_t flags;
> +    uint64_t reserved1;
> +    uint64_t reserved2;
> +    uint64_t reserved3;
> +    uint64_t reserved4;
> +    uint64_t volatile_header_offset;
> +    uint64_t volatile_header_size;
> +    uint64_t journal_header_offset;
> +    uint64_t journal_header_size;
> +    uint64_t journal_offset;
> +    uint64_t journal_size;
> +    uint64_t grain_dir_offset;
> +    uint64_t grain_dir_size;
> +    uint64_t grain_tables_offset;
> +    uint64_t grain_tables_size;
> +    uint64_t free_bitmap_offset;
> +    uint64_t free_bitmap_size;
> +    uint64_t backmap_offset;
> +    uint64_t backmap_size;
> +    uint64_t grains_offset;
> +    uint64_t grains_size;
> +    uint8_t pad[304];
> +} QEMU_PACKED VMDKSESparseConstHeader;
> +
> +typedef struct VMDKSESparseVolatileHeader {

Is this the official name?  (I know you don’t have a specification, but
maybe you do have some half-official information, I don’t know.)
Generally, I’d call the opposite of “const” “mutable”.

> +    uint64_t magic;
> +    uint64_t free_gt_number;
> +    uint64_t next_txn_seq_number;
> +    uint64_t replay_journal;
> +    uint8_t pad[480];
> +} QEMU_PACKED VMDKSESparseVolatileHeader;
> +
>  #define L2_CACHE_SIZE 16
>  
>  typedef struct VmdkExtent {
> @@ -99,19 +137,23 @@ typedef struct VmdkExtent {
>      bool compressed;
>      bool has_marker;
>      bool has_zero_grain;
> +    bool sesparse;
> +    uint64_t sesparse_l2_tables_offset;
> +    uint64_t sesparse_clusters_offset;
> +    int32_t entry_size;
>      int version;
>      int64_t sectors;
>      int64_t end_sector;
>      int64_t flat_start_offset;
>      int64_t l1_table_offset;
>      int64_t l1_backup_table_offset;
> -    uint32_t *l1_table;
> +    void *l1_table;
>      uint32_t *l1_backup_table;
>      unsigned int l1_size;
>      uint32_t l1_entry_sectors;
>  
>      unsigned int l2_size;
> -    uint32_t *l2_cache;
> +    void *l2_cache;
>      uint32_t l2_cache_offsets[L2_CACHE_SIZE];
>      uint32_t l2_cache_counts[L2_CACHE_SIZE];
>  
> @@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs,
>           *            cluster size: 64KB, L2 table size: 512 entries
>           *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
>           *            cluster size: 512B, L2 table size: 4096 entries
> +         *     8PB  - for default "ESXI seSparse Extent"
> +         *            cluster size: 4KB,  L2 table size: 4096 entries
>           */
>          error_setg(errp, "L1 size too big");
>          return -EFBIG;

Hm, now I wonder about this limit.  With this new format, it’d be 512M *
8 = 4 GB of RAM usage.  That seems an awful lot, and I don’t think we
really need to support 8 PB images.  Like, if we want to prevent
unbounded allocations, 4 GB is far above the limit of what I’d consider
sane.  (And I don’t know too much above vmdk, but can’t a user also
specify multiple extents with a single descriptor file, and thus make
qemu allocate these 4 GB multiple times?)

Some comment in this file says the maximum supported size is 64 TB
anyway.  Shouldn’t we just limit the L1 size accordingly, then?

> @@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
>      extent->l2_size = l2_size;
>      extent->cluster_sectors = flat ? sectors : cluster_sectors;
>      extent->next_cluster_sector = ROUND_UP(nb_sectors, cluster_sectors);
> +    extent->entry_size = sizeof(uint32_t);
>  
>      if (s->num_extents > 1) {
>          extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
> @@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>      int i;
>  
>      /* read the L1 table */
> -    l1_size = extent->l1_size * sizeof(uint32_t);
> +    l1_size = extent->l1_size * extent->entry_size;

Related to the above: This can overflow.  (512M * 8 == 4G)

>      extent->l1_table = g_try_malloc(l1_size);
>      if (l1_size && extent->l1_table == NULL) {
>          return -ENOMEM;
> @@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>          goto fail_l1;
>      }
>      for (i = 0; i < extent->l1_size; i++) {
> -        le32_to_cpus(&extent->l1_table[i]);
> +        if (extent->sesparse) {

Shouldn’t matter in practice, but I think evaluating extent->entry_size
would be more natural.

> +            le64_to_cpus((uint64_t *)extent->l1_table + i);
> +        } else {
> +            le32_to_cpus((uint32_t *)extent->l1_table + i);
> +        }
>      }
>  
>      if (extent->l1_backup_table_offset) {
> +        assert(!extent->sesparse);
>          extent->l1_backup_table = g_try_malloc(l1_size);
>          if (l1_size && extent->l1_backup_table == NULL) {
>              ret = -ENOMEM;
> @@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>      }
>  
>      extent->l2_cache =
> -        g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE);
> +        g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
>      return 0;
>   fail_l1b:
>      g_free(extent->l1_backup_table);
> @@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs,
>      return ret;
>  }
>  
> +#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe
> +#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe

The upper 32 bit are 0, so it doesn’t really matter, but still, if these
are 64-bit constants (which they are), they should have a 64-bit type
(e.g. UINT64_C(0x00000000cafebabe)).  (According to my very personal
taste at least.)

> +
> +static const char zero_pad[BDRV_SECTOR_SIZE];
> +
> +/* Strict checks - format not officially documented */
> +static int check_se_sparse_const_header(VMDKSESparseConstHeader *header,
> +                                        Error **errp)
> +{
> +    uint64_t grain_table_coverage;
> +    uint64_t needed_grain_tables;
> +    uint64_t needed_grain_dir_size;
> +    uint64_t needed_grain_tables_size;
> +    uint64_t needed_free_bitmap_size;
> +
> +    header->magic = le64_to_cpu(header->magic);
> +    header->version = le64_to_cpu(header->version);
> +    header->grain_size = le64_to_cpu(header->grain_size);
> +    header->grain_table_size = le64_to_cpu(header->grain_table_size);
> +    header->flags = le64_to_cpu(header->flags);
> +    header->reserved1 = le64_to_cpu(header->reserved1);
> +    header->reserved2 = le64_to_cpu(header->reserved2);
> +    header->reserved3 = le64_to_cpu(header->reserved3);
> +    header->reserved4 = le64_to_cpu(header->reserved4);
> +
> +    header->volatile_header_offset =
> +        le64_to_cpu(header->volatile_header_offset);
> +    header->volatile_header_size = le64_to_cpu(header->volatile_header_size);
> +
> +    header->journal_header_offset = le64_to_cpu(header->journal_header_offset);
> +    header->journal_header_size = le64_to_cpu(header->journal_header_size);
> +
> +    header->journal_offset = le64_to_cpu(header->journal_offset);
> +    header->journal_size = le64_to_cpu(header->journal_size);
> +
> +    header->grain_dir_offset = le64_to_cpu(header->grain_dir_offset);
> +    header->grain_dir_size = le64_to_cpu(header->grain_dir_size);
> +
> +    header->grain_tables_offset = le64_to_cpu(header->grain_tables_offset);
> +    header->grain_tables_size = le64_to_cpu(header->grain_tables_size);
> +
> +    header->free_bitmap_offset = le64_to_cpu(header->free_bitmap_offset);
> +    header->free_bitmap_size = le64_to_cpu(header->free_bitmap_size);
> +
> +    header->backmap_offset = le64_to_cpu(header->backmap_offset);
> +    header->backmap_size = le64_to_cpu(header->backmap_size);
> +
> +    header->grains_offset = le64_to_cpu(header->grains_offset);
> +    header->grains_size = le64_to_cpu(header->grains_size);
> +
> +    if (header->magic != SESPARSE_CONST_HEADER_MAGIC) {
> +        error_setg(errp, "Bad const header magic: 0x%016" PRIx64,
> +                   header->magic);
> +        return -EINVAL;
> +    }
> +
> +    if (header->version != 0x0000000200000001) {
> +        error_setg(errp, "Unsupported version: 0x%016" PRIx64,
> +                   header->version);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_size != 8) {
> +        error_setg(errp, "Unsupported grain size: %" PRIu64,
> +                   header->grain_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_table_size != 64) {
> +        error_setg(errp, "Unsupported grain table size: %" PRIu64,
> +                   header->grain_table_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->flags != 0) {
> +        error_setg(errp, "Unsupported flags: 0x%016" PRIx64,
> +                   header->flags);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->reserved1 != 0 || header->reserved2 != 0 ||
> +        header->reserved3 != 0 || header->reserved4 != 0) {
> +        error_setg(errp, "Unsupported reserved bits:"
> +                   " 0x%016" PRIx64 " 0x%016" PRIx64
> +                   " 0x%016" PRIx64 " 0x%016" PRIx64,
> +                   header->reserved1, header->reserved2,
> +                   header->reserved3, header->reserved4);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->volatile_header_offset != 1) {
> +        error_setg(errp, "Unsupported volatile header offset: %" PRIu64,
> +                   header->volatile_header_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->volatile_header_size != 1) {
> +        error_setg(errp, "Unsupported volatile header size: %" PRIu64,
> +                   header->volatile_header_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_header_offset != 2) {
> +        error_setg(errp, "Unsupported journal header offset: %" PRIu64,
> +                   header->journal_header_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_header_size != 2) {
> +        error_setg(errp, "Unsupported journal header size: %" PRIu64,
> +                   header->journal_header_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_offset != 2048) {
> +        error_setg(errp, "Unsupported journal offset: %" PRIu64,
> +                   header->journal_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_size != 2048) {
> +        error_setg(errp, "Unsupported journal size: %" PRIu64,
> +                   header->journal_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_dir_offset != 4096) {
> +        error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
> +                   header->grain_dir_offset);
> +        return -ENOTSUP;
> +    }

Do these values (metadata structure offsets and sizes) really matter?

> +    /* in sectors */
> +    grain_table_coverage = ((header->grain_table_size *

Hm, if header->grain_table_size is measured in sectors, it’d probably be
better to rename it to grain_table_sectors or something.

(“size” is usually in bytes; sometimes it’s the number of entries, but
that’s already kind of weird, I think)

> +                             BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
> +                            header->grain_size);
> +    needed_grain_tables = header->capacity / grain_table_coverage;
> +    needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables * sizeof(uint64_t),
> +                                         BDRV_SECTOR_SIZE);
> +    needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
> +
> +    if (header->grain_dir_size != needed_grain_dir_size) {

Wouldn’t it suffice to check that header->grain_dir_size >=
needed_grain_dir_size?  (The ROUND_UP() would become unnecessary, then.)

(And also, maybe s/grain_dir_size/grain_dir_sectors/)

> +        error_setg(errp, "Invalid grain directory size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grain_dir_size, needed_grain_dir_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->grain_tables_offset !=
> +        header->grain_dir_offset + header->grain_dir_size) {
> +        error_setg(errp, "Grain directory must be followed by grain tables");
> +        return -EINVAL;
> +    }
> +
> +    needed_grain_tables_size = needed_grain_tables * header->grain_table_size;
> +    needed_grain_tables_size = ROUND_UP(needed_grain_tables_size, 2048);
> +
> +    if (header->grain_tables_size != needed_grain_tables_size) {
> +        error_setg(errp, "Invalid grain tables size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grain_tables_size, needed_grain_tables_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->free_bitmap_offset !=
> +        header->grain_tables_offset + header->grain_tables_size) {
> +        error_setg(errp, "Grain tables must be followed by free bitmap");
> +        return -EINVAL;
> +    }
> +
> +    /* in bits */
> +    needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
> +                                           header->grain_size);
> +    /* in bytes */
> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> +                                           BITS_PER_BYTE);
> +    /* in sectors */
> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> +                                           BDRV_SECTOR_SIZE);
> +    needed_free_bitmap_size = ROUND_UP(needed_free_bitmap_size, 2048);

Er, now this is really confusing.  I’d start with the size in bytes:

> needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
>                                        header->grain_size * BITS_PER_BYTE);

and then translate it to sectors, but use a new variable for it
(needed_free_bitmap_sectors).

> +
> +    if (header->free_bitmap_size != needed_free_bitmap_size) {

Again, I suppose just checking that header->free_bitmap_size >=
needed_free_bitmap_size should be enough, I think.

> +        error_setg(errp, "Invalid free bitmap size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->free_bitmap_size, needed_free_bitmap_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->backmap_offset !=
> +        header->free_bitmap_offset + header->free_bitmap_size) {
> +        error_setg(errp, "Free bitmap must be followed by backmap");
> +        return -EINVAL;
> +    }
> +
> +    if (header->backmap_size != header->grain_tables_size) {
> +        error_setg(errp, "Invalid backmap size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->backmap_size, header->grain_tables_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->grains_offset !=
> +        header->backmap_offset + header->backmap_size) {
> +        error_setg(errp, "Backmap must be followed by grains");
> +        return -EINVAL;
> +    }
> +
> +    if (header->grains_size != header->capacity) {
> +        error_setg(errp, "Invalid grains size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grains_size, header->capacity);
> +        return -EINVAL;
> +    }
> +
> +    /* check that padding is 0 */
> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {

buffer_is_zero() should be sufficient.

> +        error_setg(errp, "Unsupported non-zero const header padding");
> +        return -ENOTSUP;
> +    }
> +
> +    return 0;
> +}
> +
> +static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader *header,
> +                                           Error **errp)
> +{
> +    header->magic = le64_to_cpu(header->magic);
> +    header->free_gt_number = le64_to_cpu(header->free_gt_number);
> +    header->next_txn_seq_number = le64_to_cpu(header->next_txn_seq_number);
> +    header->replay_journal = le64_to_cpu(header->replay_journal);
> +
> +    if (header->magic != SESPARSE_VOLATILE_HEADER_MAGIC) {
> +        error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64,
> +                   header->magic);
> +        return -EINVAL;
> +    }
> +
> +    if (header->replay_journal) {
> +        error_setg(errp, "Replaying journal not supported");

Hmmm, maybe "Image is dirty, replaying journal not supported"?

> +        return -ENOTSUP;
> +    }
> +
> +    /* check that padding is 0 */
> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
> +        error_setg(errp, "Unsupported non-zero volatile header padding");
> +        return -ENOTSUP;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vmdk_open_se_sparse(BlockDriverState *bs,
> +                               BdrvChild *file,
> +                               int flags, Error **errp)
> +{
> +    int ret;
> +    VMDKSESparseConstHeader const_header;
> +    VMDKSESparseVolatileHeader volatile_header;
> +    VmdkExtent *extent;
> +
> +    assert(sizeof(const_header) == BDRV_SECTOR_SIZE);
> +

[1] I think if you put a

ret = bdrv_apply_auto_read_only(bs,
    "No write support for seSparse images available", errp);
if (ret < 0) {
    return ret;
}

here, that should do the trick.

> +    ret = bdrv_pread(file, 0, &const_header, sizeof(const_header));
> +    if (ret < 0) {
> +        bdrv_refresh_filename(file->bs);
> +        error_setg_errno(errp, -ret,
> +                         "Could not read const header from file '%s'",
> +                         file->bs->filename);
> +        return ret;
> +    }
> +
> +    /* check const header */
> +    ret = check_se_sparse_const_header(&const_header, errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    assert(sizeof(volatile_header) == BDRV_SECTOR_SIZE);
> +
> +    ret = bdrv_pread(file,
> +                     const_header.volatile_header_offset * BDRV_SECTOR_SIZE,
> +                     &volatile_header, sizeof(volatile_header));
> +    if (ret < 0) {
> +        bdrv_refresh_filename(file->bs);
> +        error_setg_errno(errp, -ret,
> +                         "Could not read volatile header from file '%s'",
> +                         file->bs->filename);
> +        return ret;
> +    }
> +
> +    /* check volatile header */
> +    ret = check_se_sparse_volatile_header(&volatile_header, errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vmdk_add_extent(bs, file, false,
> +                          const_header.capacity,
> +                          const_header.grain_dir_offset * BDRV_SECTOR_SIZE,
> +                          0,
> +                          const_header.grain_dir_size *
> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
> +                          const_header.grain_table_size *
> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
> +                          const_header.grain_size,
> +                          &extent,
> +                          errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    extent->sesparse = true;
> +    extent->sesparse_l2_tables_offset = const_header.grain_tables_offset;
> +    extent->sesparse_clusters_offset = const_header.grains_offset;
> +    extent->entry_size = sizeof(uint64_t);
> +
> +    ret = vmdk_init_tables(bs, extent, errp);
> +    if (ret) {
> +        /* free extent allocated by vmdk_add_extent */
> +        vmdk_free_last_extent(bs);
> +    }
> +
> +    return ret;
> +}
> +
>  static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>                                 QDict *options, Error **errp);
>  
> @@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>           * RW [size in sectors] SPARSE "file-name.vmdk"
>           * RW [size in sectors] VMFS "file-name.vmdk"
>           * RW [size in sectors] VMFSSPARSE "file-name.vmdk"
> +         * RW [size in sectors] SESPARSE "file-name.vmdk"
>           */
>          flat_offset = -1;
>          matches = sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %" SCNd64,
> @@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>  
>          if (sectors <= 0 ||
>              (strcmp(type, "FLAT") && strcmp(type, "SPARSE") &&
> -             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) ||
> +             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") &&
> +             strcmp(type, "SESPARSE")) ||
>              (strcmp(access, "RW"))) {
>              continue;
>          }
> @@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>                  return ret;
>              }
>              extent = &s->extents[s->num_extents - 1];
> +        } else if (!strcmp(type, "SESPARSE")) {
> +            ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
> +            if (ret) {
> +                bdrv_unref_child(bs, extent_file);
> +                return ret;
> +            }
> +            extent = &s->extents[s->num_extents - 1];
>          } else {
>              error_setg(errp, "Unsupported extent type '%s'", type);
>              bdrv_unref_child(bs, extent_file);
> @@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>      if (strcmp(ct, "monolithicFlat") &&
>          strcmp(ct, "vmfs") &&
>          strcmp(ct, "vmfsSparse") &&
> +        strcmp(ct, "seSparse") &&
>          strcmp(ct, "twoGbMaxExtentSparse") &&
>          strcmp(ct, "twoGbMaxExtentFlat")) {
>          error_setg(errp, "Unsupported image type '%s'", ct);
> @@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs,
>  {
>      unsigned int l1_index, l2_offset, l2_index;
>      int min_index, i, j;
> -    uint32_t min_count, *l2_table;
> +    uint32_t min_count;
> +    void *l2_table;
>      bool zeroed = false;
>      int64_t ret;
>      int64_t cluster_sector;
> +    unsigned int l2_size_bytes = extent->l2_size * extent->entry_size;
>  
>      if (m_data) {
>          m_data->valid = 0;
> @@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs,
>      if (l1_index >= extent->l1_size) {
>          return VMDK_ERROR;
>      }
> -    l2_offset = extent->l1_table[l1_index];
> +    if (extent->sesparse) {

Maybe add assertions that extent->entry_size == 8 here and... [2]

> +        uint64_t l2_offset_u64 = ((uint64_t *)extent->l1_table)[l1_index];
> +        if (l2_offset_u64 == 0) {
> +            l2_offset = 0;
> +        } else if ((l2_offset_u64 & 0xffffffff00000000) != 0x1000000000000000) {
> +            /*
> +             * Top most nibble is 0x1 if grain table is allocated.
> +             * strict check - top most 4 bytes must be 0x10000000 since max
> +             * supported size is 64TB for disk - so no more than 64TB / 16MB
> +             * grain directories which is smaller than uint32,
> +             * where 16MB is the only supported default grain table coverage.
> +             */
> +            return VMDK_ERROR;
> +        } else {
> +            l2_offset_u64 = l2_offset_u64 & 0x00000000ffffffff;
> +            l2_offset_u64 = extent->sesparse_l2_tables_offset +
> +                l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE;
> +            if (l2_offset_u64 > 0x00000000ffffffff) {
> +                return VMDK_ERROR;
> +            }
> +            l2_offset = (unsigned int)(l2_offset_u64);
> +        }
> +    } else {

[2] ...that extent->entry_size == 4 here?

> +        l2_offset = ((uint32_t *)extent->l1_table)[l1_index];
> +    }
>      if (!l2_offset) {
>          return VMDK_UNALLOC;
>      }
> @@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>                      extent->l2_cache_counts[j] >>= 1;
>                  }
>              }
> -            l2_table = extent->l2_cache + (i * extent->l2_size);
> +            l2_table = (char *)extent->l2_cache + (i * l2_size_bytes);
>              goto found;
>          }
>      }
> @@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs,
>              min_index = i;
>          }
>      }
> -    l2_table = extent->l2_cache + (min_index * extent->l2_size);
> +    l2_table = (char *)extent->l2_cache + (min_index * l2_size_bytes);
>      BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD);
>      if (bdrv_pread(extent->file,
>                  (int64_t)l2_offset * 512,
>                  l2_table,
> -                extent->l2_size * sizeof(uint32_t)
> -            ) != extent->l2_size * sizeof(uint32_t)) {
> +                l2_size_bytes
> +            ) != l2_size_bytes) {
>          return VMDK_ERROR;
>      }
>  
> @@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs,
>      extent->l2_cache_counts[min_index] = 1;
>   found:
>      l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
> -    cluster_sector = le32_to_cpu(l2_table[l2_index]);
>  
> -    if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> -        zeroed = true;
> +    if (extent->sesparse) {
> +        cluster_sector = le64_to_cpu(((uint64_t *)l2_table)[l2_index]);
> +        switch (cluster_sector & 0xf000000000000000) {
> +        case 0x0000000000000000:
> +            /* unallocated grain */
> +            if (cluster_sector != 0) {
> +                return VMDK_ERROR;
> +            }
> +            break;
> +        case 0x1000000000000000:
> +            /* scsi-unmapped grain - fallthrough */
> +        case 0x2000000000000000:
> +            /* zero grain */
> +            zeroed = true;
> +            break;
> +        case 0x3000000000000000:
> +            /* allocated grain */
> +            cluster_sector = (((cluster_sector & 0x0fff000000000000) >> 48) |
> +                              ((cluster_sector & 0x0000ffffffffffff) << 12));

That’s just insane.  I trust you, though.

> +            cluster_sector = extent->sesparse_clusters_offset +
> +                cluster_sector * extent->cluster_sectors;
> +            break;
> +        default:
> +            return VMDK_ERROR;
> +        }
> +    } else {
> +        cluster_sector = le32_to_cpu(((uint32_t *)l2_table)[l2_index]);
> +
> +        if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> +            zeroed = true;
> +        }
>      }
>  
>      if (!cluster_sector || zeroed) {
>          if (!allocate) {
>              return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
>          }
> +        assert(!extent->sesparse);
>  
>          if (extent->next_cluster_sector >= VMDK_EXTENT_MAX_SECTORS) {
>              return VMDK_ERROR;
> @@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>              m_data->l1_index = l1_index;
>              m_data->l2_index = l2_index;
>              m_data->l2_offset = l2_offset;
> -            m_data->l2_cache_entry = &l2_table[l2_index];
> +            m_data->l2_cache_entry = ((uint32_t *)l2_table) + l2_index;
>          }
>      }
>      *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
> @@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
>          if (!extent) {
>              return -EIO;
>          }
> +        if (extent->sesparse) {
> +            return -ENOTSUP;
> +        }

([1] This should still stay here, even with bdrv_apply_auto_read_only()
above.)

>          offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
>          n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
>                               - offset_in_cluster);
> 

Overall, looks reasonable to me.  Then again, I have no clue of the
format and I trust you that it works.

Max

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]