All of lore.kernel.org
 help / color / mirror / Atom feed
From: Max Reitz <mreitz@redhat.com>
To: Sam Eiderman <shmuel.eiderman@oracle.com>,
	fam@euphon.net, kwolf@redhat.com, qemu-block@nongnu.org
Cc: eyal.moscovici@oracle.com, qemu-devel@nongnu.org,
	liran.alon@oracle.com, karl.heubaum@oracle.com,
	arbel.moshe@oracle.com
Subject: Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots
Date: Mon, 27 May 2019 19:39:16 +0200	[thread overview]
Message-ID: <8a27479f-3948-df17-5abc-0c22811ae4b2@redhat.com> (raw)
In-Reply-To: <20190424074901.31430-3-shmuel.eiderman@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 35886 bytes --]

On 24.04.19 09:49, Sam Eiderman wrote:
> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
> QEMU).
> 
> This format was lacking in the following:
> 
>     * Grain directory (L1) and grain table (L2) entries were 32-bit,
>       allowing access to only 2TB (slightly less) of data.
>     * The grain size (default) was 512 bytes - leading to data
>       fragmentation and many grain tables.
>     * For space reclamation purposes, it was necessary to find all the
>       grains which are not pointed to by any grain table - so a reverse
>       mapping of "offset of grain in vmdk" to "grain table" must be
>       constructed - which takes large amounts of CPU/RAM.
> 
> The format specification can be found in VMware's documentation:
> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
> 
> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
> introduced: SESparse (Space Efficient).
> 
> This format fixes the above issues:
> 
>     * All entries are now 64-bit.
>     * The grain size (default) is 4KB.
>     * Grain directory and grain tables are now located at the beginning
>       of the file.
>       + seSparse format reserves space for all grain tables.
>       + Grain tables can be addressed using an index.
>       + Grains are located in the end of the file and can also be
>         addressed with an index.
>       - seSparse vmdks of large disks (64TB) have huge preallocated
>         headers - mainly due to L2 tables, even for empty snapshots.
>     * The header contains a reverse mapping ("backmap") of "offset of
>       grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
>       specifies for each grain - whether it is allocated or not.
>       Using these data structures we can implement space reclamation
>       efficiently.
>     * Due to the fact that the header now maintains two mappings:
>         * The regular one (grain directory & grain tables)
>         * A reverse one (backmap and free bitmap)
>       These data structures can lose consistency upon crash and result
>       in a corrupted VMDK.
>       Therefore, a journal is also added to the VMDK and is replayed
>       when the VMware reopens the file after a crash.
> 
> Since ESXi 6.7 - SESparse is the only snapshot format available.
> 
> Unfortunately, VMware does not provide documentation regarding the new
> seSparse format.
> 
> This commit is based on black-box research of the seSparse format.
> Various in-guest block operations and their effect on the snapshot file
> were tested.
> 
> The only VMware provided source of information (regarding the underlying
> implementation) was a log file on the ESXi:
> 
>     /var/log/hostd.log
> 
> Whenever an seSparse snapshot is created - the log is being populated
> with seSparse records.
> 
> Relevant log records are of the form:
> 
> [...] Const Header:
> [...]  constMagic     = 0xcafebabe
> [...]  version        = 2.1
> [...]  capacity       = 204800
> [...]  grainSize      = 8
> [...]  grainTableSize = 64
> [...]  flags          = 0
> [...] Extents:
> [...]  Header         : <1 : 1>
> [...]  JournalHdr     : <2 : 2>
> [...]  Journal        : <2048 : 2048>
> [...]  GrainDirectory : <4096 : 2048>
> [...]  GrainTables    : <6144 : 2048>
> [...]  FreeBitmap     : <8192 : 2048>
> [...]  BackMap        : <10240 : 2048>
> [...]  Grain          : <12288 : 204800>
> [...] Volatile Header:
> [...] volatileMagic     = 0xcafecafe
> [...] FreeGTNumber      = 0
> [...] nextTxnSeqNumber  = 0
> [...] replayJournal     = 0
> 
> The sizes that are seen in the log file are in sectors.
> Extents are of the following format: <offset : size>
> 
> This commit is a strict implementation which enforces:
>     * magics
>     * version number 2.1
>     * grain size of 8 sectors  (4KB)
>     * grain table size of 64 sectors
>     * zero flags
>     * extent locations
> 
> Additionally, this commit proivdes only a subset of the functionality
> offered by seSparse's format:
>     * Read-only
>     * No journal replay
>     * No space reclamation
>     * No unmap support
> 
> Hence, journal header, journal, free bitmap and backmap extents are
> unused, only the "classic" (L1 -> L2 -> data) grain access is
> implemented.
> 
> However there are several differences in the grain access itself.
> Grain directory (L1):
>     * Grain directory entries are indexes (not offsets) to grain
>       tables.
>     * Valid grain directory entries have their highest nibble set to
>       0x1.
>     * Since grain tables are always located in the beginning of the
>       file - the index can fit into 32 bits - so we can use its low
>       part if it's valid.
> Grain table (L2):
>     * Grain table entries are indexes (not offsets) to grains.
>     * If the highest nibble of the entry is:
>         0x0:
>             The grain in not allocated.
>             The rest of the bytes are 0.
>         0x1:
>             The grain is unmapped - guest sees a zero grain.
>             The rest of the bits point to the previously mapped grain,
>             see 0x3 case.
>         0x2:
>             The grain is zero.
>         0x3:
>             The grain is allocated - to get the index calculate:
>             ((entry & 0x0fff000000000000) >> 48) |
>             ((entry & 0x0000ffffffffffff) << 12)
>     * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
>       grain which results from the guest using sg_unmap to unmap the
>       grain - but the grain itself still exists in the grain extent - a
>       space reclamation procedure should delete it.
>       Unmapping a zero grain has no effect (0x2 will not change to 0x1)
>       but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
> 
> In order to implement seSparse some fields had to be changed to support
> both 32-bit and 64-bit entry sizes.
> 
> Read-only is implemented by failing on pwrite since qemu-img opens the
> vmdk with read-write flags for "convert" (which is a read-only)
> operation.

Does it?  The code doesn’t look like it (in img_convert(), src_flags is
never set to anything with BDRV_O_RDWR).

In theory, we have bdrv_apply_auto_read_only() for this case.  More on
that below. [1]

> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com>
> Reviewed-by: Eyal Moscovici <eyal.moscovici@oracle.com>
> Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
> Signed-off-by: Sam Eiderman <shmuel.eiderman@oracle.com>
> ---
>  block/vmdk.c | 475 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 459 insertions(+), 16 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index fc7378da78..e599c08b95 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c

First, a general observation: BDRV_SECTOR_SIZE is indeed equal to VMDK’s
sector size (both are 512), but I’d consider that a coincidence.  This
file defines a SECTOR_SIZE macro.  I think you should use that instead
of BDRV_SECTOR_SIZE whenever referring to VMDK’s sector size.

(BDRV_SECTOR_SIZE is actually the sector size the qemu block layer uses.
 Yes, I know, there are plenty of places in vmdk.c that use it.  I still
think it isn’t entirely correct to use it.)

> @@ -91,6 +91,44 @@ typedef struct {
>      uint16_t compressAlgorithm;
>  } QEMU_PACKED VMDK4Header;
>  
> +typedef struct VMDKSESparseConstHeader {
> +    uint64_t magic;
> +    uint64_t version;
> +    uint64_t capacity;
> +    uint64_t grain_size;
> +    uint64_t grain_table_size;
> +    uint64_t flags;
> +    uint64_t reserved1;
> +    uint64_t reserved2;
> +    uint64_t reserved3;
> +    uint64_t reserved4;
> +    uint64_t volatile_header_offset;
> +    uint64_t volatile_header_size;
> +    uint64_t journal_header_offset;
> +    uint64_t journal_header_size;
> +    uint64_t journal_offset;
> +    uint64_t journal_size;
> +    uint64_t grain_dir_offset;
> +    uint64_t grain_dir_size;
> +    uint64_t grain_tables_offset;
> +    uint64_t grain_tables_size;
> +    uint64_t free_bitmap_offset;
> +    uint64_t free_bitmap_size;
> +    uint64_t backmap_offset;
> +    uint64_t backmap_size;
> +    uint64_t grains_offset;
> +    uint64_t grains_size;
> +    uint8_t pad[304];
> +} QEMU_PACKED VMDKSESparseConstHeader;
> +
> +typedef struct VMDKSESparseVolatileHeader {

Is this the official name?  (I know you don’t have a specification, but
maybe you do have some half-official information, I don’t know.)
Generally, I’d call the opposite of “const” “mutable”.

> +    uint64_t magic;
> +    uint64_t free_gt_number;
> +    uint64_t next_txn_seq_number;
> +    uint64_t replay_journal;
> +    uint8_t pad[480];
> +} QEMU_PACKED VMDKSESparseVolatileHeader;
> +
>  #define L2_CACHE_SIZE 16
>  
>  typedef struct VmdkExtent {
> @@ -99,19 +137,23 @@ typedef struct VmdkExtent {
>      bool compressed;
>      bool has_marker;
>      bool has_zero_grain;
> +    bool sesparse;
> +    uint64_t sesparse_l2_tables_offset;
> +    uint64_t sesparse_clusters_offset;
> +    int32_t entry_size;
>      int version;
>      int64_t sectors;
>      int64_t end_sector;
>      int64_t flat_start_offset;
>      int64_t l1_table_offset;
>      int64_t l1_backup_table_offset;
> -    uint32_t *l1_table;
> +    void *l1_table;
>      uint32_t *l1_backup_table;
>      unsigned int l1_size;
>      uint32_t l1_entry_sectors;
>  
>      unsigned int l2_size;
> -    uint32_t *l2_cache;
> +    void *l2_cache;
>      uint32_t l2_cache_offsets[L2_CACHE_SIZE];
>      uint32_t l2_cache_counts[L2_CACHE_SIZE];
>  
> @@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs,
>           *            cluster size: 64KB, L2 table size: 512 entries
>           *     1PB  - for default "ESXi Host Sparse Extent" (VMDK3/vmfsSparse)
>           *            cluster size: 512B, L2 table size: 4096 entries
> +         *     8PB  - for default "ESXI seSparse Extent"
> +         *            cluster size: 4KB,  L2 table size: 4096 entries
>           */
>          error_setg(errp, "L1 size too big");
>          return -EFBIG;

Hm, now I wonder about this limit.  With this new format, it’d be 512M *
8 = 4 GB of RAM usage.  That seems an awful lot, and I don’t think we
really need to support 8 PB images.  Like, if we want to prevent
unbounded allocations, 4 GB is far above the limit of what I’d consider
sane.  (And I don’t know too much above vmdk, but can’t a user also
specify multiple extents with a single descriptor file, and thus make
qemu allocate these 4 GB multiple times?)

Some comment in this file says the maximum supported size is 64 TB
anyway.  Shouldn’t we just limit the L1 size accordingly, then?

> @@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
>      extent->l2_size = l2_size;
>      extent->cluster_sectors = flat ? sectors : cluster_sectors;
>      extent->next_cluster_sector = ROUND_UP(nb_sectors, cluster_sectors);
> +    extent->entry_size = sizeof(uint32_t);
>  
>      if (s->num_extents > 1) {
>          extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
> @@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>      int i;
>  
>      /* read the L1 table */
> -    l1_size = extent->l1_size * sizeof(uint32_t);
> +    l1_size = extent->l1_size * extent->entry_size;

Related to the above: This can overflow.  (512M * 8 == 4G)

>      extent->l1_table = g_try_malloc(l1_size);
>      if (l1_size && extent->l1_table == NULL) {
>          return -ENOMEM;
> @@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>          goto fail_l1;
>      }
>      for (i = 0; i < extent->l1_size; i++) {
> -        le32_to_cpus(&extent->l1_table[i]);
> +        if (extent->sesparse) {

Shouldn’t matter in practice, but I think evaluating extent->entry_size
would be more natural.

> +            le64_to_cpus((uint64_t *)extent->l1_table + i);
> +        } else {
> +            le32_to_cpus((uint32_t *)extent->l1_table + i);
> +        }
>      }
>  
>      if (extent->l1_backup_table_offset) {
> +        assert(!extent->sesparse);
>          extent->l1_backup_table = g_try_malloc(l1_size);
>          if (l1_size && extent->l1_backup_table == NULL) {
>              ret = -ENOMEM;
> @@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs, VmdkExtent *extent,
>      }
>  
>      extent->l2_cache =
> -        g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE);
> +        g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
>      return 0;
>   fail_l1b:
>      g_free(extent->l1_backup_table);
> @@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs,
>      return ret;
>  }
>  
> +#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe
> +#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe

The upper 32 bit are 0, so it doesn’t really matter, but still, if these
are 64-bit constants (which they are), they should have a 64-bit type
(e.g. UINT64_C(0x00000000cafebabe)).  (According to my very personal
taste at least.)

> +
> +static const char zero_pad[BDRV_SECTOR_SIZE];
> +
> +/* Strict checks - format not officially documented */
> +static int check_se_sparse_const_header(VMDKSESparseConstHeader *header,
> +                                        Error **errp)
> +{
> +    uint64_t grain_table_coverage;
> +    uint64_t needed_grain_tables;
> +    uint64_t needed_grain_dir_size;
> +    uint64_t needed_grain_tables_size;
> +    uint64_t needed_free_bitmap_size;
> +
> +    header->magic = le64_to_cpu(header->magic);
> +    header->version = le64_to_cpu(header->version);
> +    header->grain_size = le64_to_cpu(header->grain_size);
> +    header->grain_table_size = le64_to_cpu(header->grain_table_size);
> +    header->flags = le64_to_cpu(header->flags);
> +    header->reserved1 = le64_to_cpu(header->reserved1);
> +    header->reserved2 = le64_to_cpu(header->reserved2);
> +    header->reserved3 = le64_to_cpu(header->reserved3);
> +    header->reserved4 = le64_to_cpu(header->reserved4);
> +
> +    header->volatile_header_offset =
> +        le64_to_cpu(header->volatile_header_offset);
> +    header->volatile_header_size = le64_to_cpu(header->volatile_header_size);
> +
> +    header->journal_header_offset = le64_to_cpu(header->journal_header_offset);
> +    header->journal_header_size = le64_to_cpu(header->journal_header_size);
> +
> +    header->journal_offset = le64_to_cpu(header->journal_offset);
> +    header->journal_size = le64_to_cpu(header->journal_size);
> +
> +    header->grain_dir_offset = le64_to_cpu(header->grain_dir_offset);
> +    header->grain_dir_size = le64_to_cpu(header->grain_dir_size);
> +
> +    header->grain_tables_offset = le64_to_cpu(header->grain_tables_offset);
> +    header->grain_tables_size = le64_to_cpu(header->grain_tables_size);
> +
> +    header->free_bitmap_offset = le64_to_cpu(header->free_bitmap_offset);
> +    header->free_bitmap_size = le64_to_cpu(header->free_bitmap_size);
> +
> +    header->backmap_offset = le64_to_cpu(header->backmap_offset);
> +    header->backmap_size = le64_to_cpu(header->backmap_size);
> +
> +    header->grains_offset = le64_to_cpu(header->grains_offset);
> +    header->grains_size = le64_to_cpu(header->grains_size);
> +
> +    if (header->magic != SESPARSE_CONST_HEADER_MAGIC) {
> +        error_setg(errp, "Bad const header magic: 0x%016" PRIx64,
> +                   header->magic);
> +        return -EINVAL;
> +    }
> +
> +    if (header->version != 0x0000000200000001) {
> +        error_setg(errp, "Unsupported version: 0x%016" PRIx64,
> +                   header->version);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_size != 8) {
> +        error_setg(errp, "Unsupported grain size: %" PRIu64,
> +                   header->grain_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_table_size != 64) {
> +        error_setg(errp, "Unsupported grain table size: %" PRIu64,
> +                   header->grain_table_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->flags != 0) {
> +        error_setg(errp, "Unsupported flags: 0x%016" PRIx64,
> +                   header->flags);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->reserved1 != 0 || header->reserved2 != 0 ||
> +        header->reserved3 != 0 || header->reserved4 != 0) {
> +        error_setg(errp, "Unsupported reserved bits:"
> +                   " 0x%016" PRIx64 " 0x%016" PRIx64
> +                   " 0x%016" PRIx64 " 0x%016" PRIx64,
> +                   header->reserved1, header->reserved2,
> +                   header->reserved3, header->reserved4);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->volatile_header_offset != 1) {
> +        error_setg(errp, "Unsupported volatile header offset: %" PRIu64,
> +                   header->volatile_header_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->volatile_header_size != 1) {
> +        error_setg(errp, "Unsupported volatile header size: %" PRIu64,
> +                   header->volatile_header_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_header_offset != 2) {
> +        error_setg(errp, "Unsupported journal header offset: %" PRIu64,
> +                   header->journal_header_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_header_size != 2) {
> +        error_setg(errp, "Unsupported journal header size: %" PRIu64,
> +                   header->journal_header_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_offset != 2048) {
> +        error_setg(errp, "Unsupported journal offset: %" PRIu64,
> +                   header->journal_offset);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->journal_size != 2048) {
> +        error_setg(errp, "Unsupported journal size: %" PRIu64,
> +                   header->journal_size);
> +        return -ENOTSUP;
> +    }
> +
> +    if (header->grain_dir_offset != 4096) {
> +        error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
> +                   header->grain_dir_offset);
> +        return -ENOTSUP;
> +    }

Do these values (metadata structure offsets and sizes) really matter?

> +    /* in sectors */
> +    grain_table_coverage = ((header->grain_table_size *

Hm, if header->grain_table_size is measured in sectors, it’d probably be
better to rename it to grain_table_sectors or something.

(“size” is usually in bytes; sometimes it’s the number of entries, but
that’s already kind of weird, I think)

> +                             BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
> +                            header->grain_size);
> +    needed_grain_tables = header->capacity / grain_table_coverage;
> +    needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables * sizeof(uint64_t),
> +                                         BDRV_SECTOR_SIZE);
> +    needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
> +
> +    if (header->grain_dir_size != needed_grain_dir_size) {

Wouldn’t it suffice to check that header->grain_dir_size >=
needed_grain_dir_size?  (The ROUND_UP() would become unnecessary, then.)

(And also, maybe s/grain_dir_size/grain_dir_sectors/)

> +        error_setg(errp, "Invalid grain directory size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grain_dir_size, needed_grain_dir_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->grain_tables_offset !=
> +        header->grain_dir_offset + header->grain_dir_size) {
> +        error_setg(errp, "Grain directory must be followed by grain tables");
> +        return -EINVAL;
> +    }
> +
> +    needed_grain_tables_size = needed_grain_tables * header->grain_table_size;
> +    needed_grain_tables_size = ROUND_UP(needed_grain_tables_size, 2048);
> +
> +    if (header->grain_tables_size != needed_grain_tables_size) {
> +        error_setg(errp, "Invalid grain tables size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grain_tables_size, needed_grain_tables_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->free_bitmap_offset !=
> +        header->grain_tables_offset + header->grain_tables_size) {
> +        error_setg(errp, "Grain tables must be followed by free bitmap");
> +        return -EINVAL;
> +    }
> +
> +    /* in bits */
> +    needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
> +                                           header->grain_size);
> +    /* in bytes */
> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> +                                           BITS_PER_BYTE);
> +    /* in sectors */
> +    needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> +                                           BDRV_SECTOR_SIZE);
> +    needed_free_bitmap_size = ROUND_UP(needed_free_bitmap_size, 2048);

Er, now this is really confusing.  I’d start with the size in bytes:

> needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
>                                        header->grain_size * BITS_PER_BYTE);

and then translate it to sectors, but use a new variable for it
(needed_free_bitmap_sectors).

> +
> +    if (header->free_bitmap_size != needed_free_bitmap_size) {

Again, I suppose just checking that header->free_bitmap_size >=
needed_free_bitmap_size should be enough, I think.

> +        error_setg(errp, "Invalid free bitmap size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->free_bitmap_size, needed_free_bitmap_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->backmap_offset !=
> +        header->free_bitmap_offset + header->free_bitmap_size) {
> +        error_setg(errp, "Free bitmap must be followed by backmap");
> +        return -EINVAL;
> +    }
> +
> +    if (header->backmap_size != header->grain_tables_size) {
> +        error_setg(errp, "Invalid backmap size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->backmap_size, header->grain_tables_size);
> +        return -EINVAL;
> +    }
> +
> +    if (header->grains_offset !=
> +        header->backmap_offset + header->backmap_size) {
> +        error_setg(errp, "Backmap must be followed by grains");
> +        return -EINVAL;
> +    }
> +
> +    if (header->grains_size != header->capacity) {
> +        error_setg(errp, "Invalid grains size: %" PRIu64
> +                   ", needed: %" PRIu64,
> +                   header->grains_size, header->capacity);
> +        return -EINVAL;
> +    }
> +
> +    /* check that padding is 0 */
> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {

buffer_is_zero() should be sufficient.

> +        error_setg(errp, "Unsupported non-zero const header padding");
> +        return -ENOTSUP;
> +    }
> +
> +    return 0;
> +}
> +
> +static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader *header,
> +                                           Error **errp)
> +{
> +    header->magic = le64_to_cpu(header->magic);
> +    header->free_gt_number = le64_to_cpu(header->free_gt_number);
> +    header->next_txn_seq_number = le64_to_cpu(header->next_txn_seq_number);
> +    header->replay_journal = le64_to_cpu(header->replay_journal);
> +
> +    if (header->magic != SESPARSE_VOLATILE_HEADER_MAGIC) {
> +        error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64,
> +                   header->magic);
> +        return -EINVAL;
> +    }
> +
> +    if (header->replay_journal) {
> +        error_setg(errp, "Replaying journal not supported");

Hmmm, maybe "Image is dirty, replaying journal not supported"?

> +        return -ENOTSUP;
> +    }
> +
> +    /* check that padding is 0 */
> +    if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
> +        error_setg(errp, "Unsupported non-zero volatile header padding");
> +        return -ENOTSUP;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vmdk_open_se_sparse(BlockDriverState *bs,
> +                               BdrvChild *file,
> +                               int flags, Error **errp)
> +{
> +    int ret;
> +    VMDKSESparseConstHeader const_header;
> +    VMDKSESparseVolatileHeader volatile_header;
> +    VmdkExtent *extent;
> +
> +    assert(sizeof(const_header) == BDRV_SECTOR_SIZE);
> +

[1] I think if you put a

ret = bdrv_apply_auto_read_only(bs,
    "No write support for seSparse images available", errp);
if (ret < 0) {
    return ret;
}

here, that should do the trick.

> +    ret = bdrv_pread(file, 0, &const_header, sizeof(const_header));
> +    if (ret < 0) {
> +        bdrv_refresh_filename(file->bs);
> +        error_setg_errno(errp, -ret,
> +                         "Could not read const header from file '%s'",
> +                         file->bs->filename);
> +        return ret;
> +    }
> +
> +    /* check const header */
> +    ret = check_se_sparse_const_header(&const_header, errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    assert(sizeof(volatile_header) == BDRV_SECTOR_SIZE);
> +
> +    ret = bdrv_pread(file,
> +                     const_header.volatile_header_offset * BDRV_SECTOR_SIZE,
> +                     &volatile_header, sizeof(volatile_header));
> +    if (ret < 0) {
> +        bdrv_refresh_filename(file->bs);
> +        error_setg_errno(errp, -ret,
> +                         "Could not read volatile header from file '%s'",
> +                         file->bs->filename);
> +        return ret;
> +    }
> +
> +    /* check volatile header */
> +    ret = check_se_sparse_volatile_header(&volatile_header, errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vmdk_add_extent(bs, file, false,
> +                          const_header.capacity,
> +                          const_header.grain_dir_offset * BDRV_SECTOR_SIZE,
> +                          0,
> +                          const_header.grain_dir_size *
> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
> +                          const_header.grain_table_size *
> +                          BDRV_SECTOR_SIZE / sizeof(uint64_t),
> +                          const_header.grain_size,
> +                          &extent,
> +                          errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    extent->sesparse = true;
> +    extent->sesparse_l2_tables_offset = const_header.grain_tables_offset;
> +    extent->sesparse_clusters_offset = const_header.grains_offset;
> +    extent->entry_size = sizeof(uint64_t);
> +
> +    ret = vmdk_init_tables(bs, extent, errp);
> +    if (ret) {
> +        /* free extent allocated by vmdk_add_extent */
> +        vmdk_free_last_extent(bs);
> +    }
> +
> +    return ret;
> +}
> +
>  static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>                                 QDict *options, Error **errp);
>  
> @@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>           * RW [size in sectors] SPARSE "file-name.vmdk"
>           * RW [size in sectors] VMFS "file-name.vmdk"
>           * RW [size in sectors] VMFSSPARSE "file-name.vmdk"
> +         * RW [size in sectors] SESPARSE "file-name.vmdk"
>           */
>          flat_offset = -1;
>          matches = sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %" SCNd64,
> @@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>  
>          if (sectors <= 0 ||
>              (strcmp(type, "FLAT") && strcmp(type, "SPARSE") &&
> -             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) ||
> +             strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") &&
> +             strcmp(type, "SESPARSE")) ||
>              (strcmp(access, "RW"))) {
>              continue;
>          }
> @@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc, BlockDriverState *bs,
>                  return ret;
>              }
>              extent = &s->extents[s->num_extents - 1];
> +        } else if (!strcmp(type, "SESPARSE")) {
> +            ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
> +            if (ret) {
> +                bdrv_unref_child(bs, extent_file);
> +                return ret;
> +            }
> +            extent = &s->extents[s->num_extents - 1];
>          } else {
>              error_setg(errp, "Unsupported extent type '%s'", type);
>              bdrv_unref_child(bs, extent_file);
> @@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
>      if (strcmp(ct, "monolithicFlat") &&
>          strcmp(ct, "vmfs") &&
>          strcmp(ct, "vmfsSparse") &&
> +        strcmp(ct, "seSparse") &&
>          strcmp(ct, "twoGbMaxExtentSparse") &&
>          strcmp(ct, "twoGbMaxExtentFlat")) {
>          error_setg(errp, "Unsupported image type '%s'", ct);
> @@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs,
>  {
>      unsigned int l1_index, l2_offset, l2_index;
>      int min_index, i, j;
> -    uint32_t min_count, *l2_table;
> +    uint32_t min_count;
> +    void *l2_table;
>      bool zeroed = false;
>      int64_t ret;
>      int64_t cluster_sector;
> +    unsigned int l2_size_bytes = extent->l2_size * extent->entry_size;
>  
>      if (m_data) {
>          m_data->valid = 0;
> @@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs,
>      if (l1_index >= extent->l1_size) {
>          return VMDK_ERROR;
>      }
> -    l2_offset = extent->l1_table[l1_index];
> +    if (extent->sesparse) {

Maybe add assertions that extent->entry_size == 8 here and... [2]

> +        uint64_t l2_offset_u64 = ((uint64_t *)extent->l1_table)[l1_index];
> +        if (l2_offset_u64 == 0) {
> +            l2_offset = 0;
> +        } else if ((l2_offset_u64 & 0xffffffff00000000) != 0x1000000000000000) {
> +            /*
> +             * Top most nibble is 0x1 if grain table is allocated.
> +             * strict check - top most 4 bytes must be 0x10000000 since max
> +             * supported size is 64TB for disk - so no more than 64TB / 16MB
> +             * grain directories which is smaller than uint32,
> +             * where 16MB is the only supported default grain table coverage.
> +             */
> +            return VMDK_ERROR;
> +        } else {
> +            l2_offset_u64 = l2_offset_u64 & 0x00000000ffffffff;
> +            l2_offset_u64 = extent->sesparse_l2_tables_offset +
> +                l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE;
> +            if (l2_offset_u64 > 0x00000000ffffffff) {
> +                return VMDK_ERROR;
> +            }
> +            l2_offset = (unsigned int)(l2_offset_u64);
> +        }
> +    } else {

[2] ...that extent->entry_size == 4 here?

> +        l2_offset = ((uint32_t *)extent->l1_table)[l1_index];
> +    }
>      if (!l2_offset) {
>          return VMDK_UNALLOC;
>      }
> @@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>                      extent->l2_cache_counts[j] >>= 1;
>                  }
>              }
> -            l2_table = extent->l2_cache + (i * extent->l2_size);
> +            l2_table = (char *)extent->l2_cache + (i * l2_size_bytes);
>              goto found;
>          }
>      }
> @@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs,
>              min_index = i;
>          }
>      }
> -    l2_table = extent->l2_cache + (min_index * extent->l2_size);
> +    l2_table = (char *)extent->l2_cache + (min_index * l2_size_bytes);
>      BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD);
>      if (bdrv_pread(extent->file,
>                  (int64_t)l2_offset * 512,
>                  l2_table,
> -                extent->l2_size * sizeof(uint32_t)
> -            ) != extent->l2_size * sizeof(uint32_t)) {
> +                l2_size_bytes
> +            ) != l2_size_bytes) {
>          return VMDK_ERROR;
>      }
>  
> @@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs,
>      extent->l2_cache_counts[min_index] = 1;
>   found:
>      l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
> -    cluster_sector = le32_to_cpu(l2_table[l2_index]);
>  
> -    if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> -        zeroed = true;
> +    if (extent->sesparse) {
> +        cluster_sector = le64_to_cpu(((uint64_t *)l2_table)[l2_index]);
> +        switch (cluster_sector & 0xf000000000000000) {
> +        case 0x0000000000000000:
> +            /* unallocated grain */
> +            if (cluster_sector != 0) {
> +                return VMDK_ERROR;
> +            }
> +            break;
> +        case 0x1000000000000000:
> +            /* scsi-unmapped grain - fallthrough */
> +        case 0x2000000000000000:
> +            /* zero grain */
> +            zeroed = true;
> +            break;
> +        case 0x3000000000000000:
> +            /* allocated grain */
> +            cluster_sector = (((cluster_sector & 0x0fff000000000000) >> 48) |
> +                              ((cluster_sector & 0x0000ffffffffffff) << 12));

That’s just insane.  I trust you, though.

> +            cluster_sector = extent->sesparse_clusters_offset +
> +                cluster_sector * extent->cluster_sectors;
> +            break;
> +        default:
> +            return VMDK_ERROR;
> +        }
> +    } else {
> +        cluster_sector = le32_to_cpu(((uint32_t *)l2_table)[l2_index]);
> +
> +        if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> +            zeroed = true;
> +        }
>      }
>  
>      if (!cluster_sector || zeroed) {
>          if (!allocate) {
>              return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
>          }
> +        assert(!extent->sesparse);
>  
>          if (extent->next_cluster_sector >= VMDK_EXTENT_MAX_SECTORS) {
>              return VMDK_ERROR;
> @@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs,
>              m_data->l1_index = l1_index;
>              m_data->l2_index = l2_index;
>              m_data->l2_offset = l2_offset;
> -            m_data->l2_cache_entry = &l2_table[l2_index];
> +            m_data->l2_cache_entry = ((uint32_t *)l2_table) + l2_index;
>          }
>      }
>      *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
> @@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t offset,
>          if (!extent) {
>              return -EIO;
>          }
> +        if (extent->sesparse) {
> +            return -ENOTSUP;
> +        }

([1] This should still stay here, even with bdrv_apply_auto_read_only()
above.)

>          offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
>          n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
>                               - offset_in_cluster);
> 

Overall, looks reasonable to me.  Then again, I have no clue of the
format and I trust you that it works.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-05-27 17:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-24  7:48 [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format Sam Eiderman
2019-04-24  7:48 ` Sam Eiderman
2019-04-24  7:49 ` [Qemu-devel] [PATCH 1/2] vmdk: Fix comment regarding max l1_size coverage Sam Eiderman
2019-04-24  7:49   ` Sam Eiderman
2019-04-24 10:19   ` [Qemu-devel] [Qemu-block] " yuchenlin
2019-04-24 10:19     ` yuchenlin via Qemu-devel
2019-04-24  7:49 ` [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots Sam Eiderman
2019-04-24  7:49   ` Sam Eiderman
2019-05-27 17:39   ` Max Reitz [this message]
2019-05-28  7:57     ` Sam Eiderman
2019-05-28 12:17       ` Max Reitz
2019-05-12  8:14 ` [Qemu-devel] [PATCH 0/2]: vmdk: Add read-only support for the new seSparse format Sam
2019-05-27  8:26   ` Sam Eiderman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8a27479f-3948-df17-5abc-0c22811ae4b2@redhat.com \
    --to=mreitz@redhat.com \
    --cc=arbel.moshe@oracle.com \
    --cc=eyal.moscovici@oracle.com \
    --cc=fam@euphon.net \
    --cc=karl.heubaum@oracle.com \
    --cc=kwolf@redhat.com \
    --cc=liran.alon@oracle.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shmuel.eiderman@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.