All of lore.kernel.org
 help / color / mirror / Atom feed
* Regarding SESparse support in QEMU
@ 2020-03-08 16:52 Tirthankar Saha
  2020-03-09 17:25 ` Sam Eiderman
  0 siblings, 1 reply; 2+ messages in thread
From: Tirthankar Saha @ 2020-03-08 16:52 UTC (permalink / raw)
  To: sameid, mreitz, kwolf, qemu-block, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 194 bytes --]

Hi Sam,

Can you please share any notes that you have regarding the structure of the
SESparse journal? This will help in adding "read-write" support for
SESparse snapshots.

Thanks,

Tirthankar

[-- Attachment #2: Type: text/html, Size: 313 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Regarding SESparse support in QEMU
  2020-03-08 16:52 Regarding SESparse support in QEMU Tirthankar Saha
@ 2020-03-09 17:25 ` Sam Eiderman
  0 siblings, 0 replies; 2+ messages in thread
From: Sam Eiderman @ 2020-03-09 17:25 UTC (permalink / raw)
  To: Tirthankar Saha; +Cc: Max Reitz, kwolf, qemu-block, qemu-devel

Hi,

This is regarding the new VMDK format of SESparse snapshots which is
available since ESXi 6.5 (for disks > 2TB) and is default since ESXi
6.7 (for all disks).
Unlike the previous format (VMFSSparse), SESparse's format is not
disclosed by VMware.

Even though, I believe that the format itself is not too complicated
and I'll try to give out pointers on how it is possible to implement a
read-write support for it.

In commit 98eb9733f (vmdk: Add read-only support for seSparse
snapshots) I have added a read-only support. As stated in the commit
message - I did not implement the following features which are
implemented in VMware:
    * read-write
    * journal replay
    * space reclamation
    * unmap support

I don't fully understand what you're trying to implement when you say
"read-write" since there are some scenarios for which you will not
need to implement a fully compatible SESparse image.

----

After creating an SESparse snapshot (by invoking "Take Snapshot" in
VMware), the following file:

    /var/log/hostd.log

Contains messages of the like:

[...] Const Header:
[...]  constMagic     = 0xcafebabe
[...]  version        = 2.1
[...]  capacity       = 204800
[...]  grainSize      = 8
[...]  grainTableSize = 64
[...]  flags          = 0
[...] Extents:
[...]  Header         : <1 : 1>
[...]  JournalHdr     : <2 : 2>
[...]  Journal        : <2048 : 2048>
[...]  GrainDirectory : <4096 : 2048>
[...]  GrainTables    : <6144 : 2048>
[...]  FreeBitmap     : <8192 : 2048>
[...]  BackMap        : <10240 : 2048>
[...]  Grain          : <12288 : 204800>
[...] Volatile Header:
[...] volatileMagic     = 0xcafecafe
[...] FreeGTNumber      = 0
[...] nextTxnSeqNumber  = 0
[...] replayJournal     = 0

The sizes that are seen in the log file are in sectors.
Extents are of the following format: <offset : size>

Just from the strings in this message and by information publicly
disclosed by VMware it is obvious how SESparse is implemented:

The problem with VMFSSparse (previous format) is that given an offset
to a grain in the file - you don't know if that grain is allocated or
not - you must scan all grain tables to perform space reclamation, and
even if somehow you know that the grain is not allocated - you don't
have the offset to the grain table that references it - to unmap it
from there.
This was obviously solved using the structures "FreeBitmap" and "BackMap".

FreeBitmap is simply a very big bitmap that tells you if the grain at
the offset of the bit is allocated.
Meaning:
If bit X is set to 1 then Grain[X] is allocated (Notice that each
grain is of the size "grainSize" sectors - usually 4K bytes).

Now, we can simply iterate through FreeBitmap to find grains we can reclaim.

Problem - we don't know where is the grain table to update, scanning
all grain tables is slow.

Solution: BackMap:

For Grain[X] look at BackMap[X] - BackMap[X] contains an index to
GrainTables at which the grain table resides (Multiply by the size of
a grain table in order to get the real offset).
(Notice that BackMap scales with the size of the VMDK - but only 8
bytes per entry instead 4K bytes)

Problem - when we now write to BackMap & FreeBitmap structures too -
we might experience data corruption of the SESparse format.

Solution: Journal

Write data (write what where) to the journal, mark journal dirty,
execute instructions in journal, clean dirty mark - if you crash in
the middle - you will reexecute the instructions when opening the VMDK
file.
(Notice this was not a problem in VMFSSprase since we could first
write the data in the grain, then update the grain table to the grain,
then update the grain directory to the grain table - so the data is
actually written on the last operation and the format structure is not
corrupted)

In any case, now we need to implement the following:

When opening the file:
* Execute journal (not implemented)

When reading
* Follow Grain Directory -> Grain Table -> Grain (This is implemented now)

For writing (not implemented, probably something like this)
* Follow Grain Directory
  * If grain table allocated -> get it
  * Otherwise, allocate new grain table (this probably uses
FreeGTNumber in the volatile header to know where to do so)
    * Extend the file by the size needed to store all the grains in
the grain table (I think this is 16MB by default)
  * Follow Grain Table
    * If grain allocated, get it
    * Otherwise, allocate it
* Write to grain (you may need to read it if you write only to some of
it, since now grains are 4K - it is possible to read/write half a
grain)
* Update BackMap and FreeBitmap accordingly.
* All of the above writes should go through the Journal for consistency.

When unmapping:
* Update bits in FreeBitmap (1 -> 0)

When reclaiming space:
Go through FreeBitmaps, decide if should perform cleaning (must have a
complete empty grain table), if so, reclaim
(You can only clean when a full Grain table is unmapped, copy another
grain table in its stead, you'll have to update Grains, and the Grain
directories, all operations should go through journal)

----

I'm not sure about how the journal is implemented or how does its
structures look like (write what where) but from what I understand the
size of the journal is constant (2048 sectors) it doesn't change
depending on the size of the disk, this makes sense - the size of the
journal is simply a buffer of writes and its size only affects "how
fast" you write. (2048 sectors was probably enough for VMware)

Notice that your implementation matters - you can totally conform to
the specs of SESparse but your performance may not be that good - if
you execute the journal for every sector you write - performance will
be bad. So you'll need to write to the journal until it's full or some
time has passed and then flush it.
----

In the end it really depends on what you're trying to achieve.

If you want to have a "read-write" VM working on top of an SESparse
snapshot - you can always use an overlay of a qcow2 image.
If you want to support "read-only" but to support "dirty journal" you
don't need to care much about performance and just focus on the "write
what where" format of the journal.
If you want to support a "read-write" on top of SESparse and then
expect a VM on ESXi to run with it - you only need to understand the
format of the journal if you want to support a dirty journal when
opening it or if your changes to the VMDK are not crash consistent
(You will have to implement BackMap and FreeBitmap though).

Hope this helps,
Sam


On Sun, Mar 8, 2020 at 6:52 PM Tirthankar Saha
<tirthankar.saha@gmail.com> wrote:
>
> Hi Sam,
>
> Can you please share any notes that you have regarding the structure of the SESparse journal? This will help in adding "read-write" support for SESparse snapshots.
>
> Thanks,
>
> Tirthankar


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-03-09 17:26 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-08 16:52 Regarding SESparse support in QEMU Tirthankar Saha
2020-03-09 17:25 ` Sam Eiderman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.