Re: call for slideware ;)

From: Heinz Mauelshagen <heinzm@redhat.com>
To: Mike Snitzer <snitzer@redhat.com>
Cc: Joe Thornber <thornber@redhat.com>
Subject: Re: call for slideware ;)
Date: Wed, 23 Feb 2011 13:18:48 +0100	[thread overview]
Message-ID: <1298463528.2965.19.camel@o> (raw)
In-Reply-To: <20110223012159.GA13983@redhat.com>

On Tue, 2011-02-22 at 20:22 -0500, Mike Snitzer wrote:
> On Thu, Feb 10 2011 at  9:59am -0500,
> Joe Thornber <thornber@redhat.com> wrote:
> 
> > Hi Mike,
> > 
> > On Wed, 2011-02-09 at 18:16 -0500, Mike Snitzer wrote:
> > > Joe and/or Heinz,
> > > 
> > > Could you provide a few slides on the thinp and shared snapshot
> > > infrastructure and targets?  Planned features and performance
> > > benefits,
> > > etc.
> > > 
> > > 
> > 
> > I've started a new project on GitHub:
> > 
> > https://github.com/jthornber/storage-papers
> > 
> > Heinz and I have started putting stuff in there.
> 
> I just had a look at the latest content and have some questions (way
> more than I'd imagine you'd like to see.. means I'm clearly missing a
> lot):
> 
> 1) from "Solution" slide:
>    "Space comes from a preallocated ‘pool’, which is itself just another
>    logical volume, thus can be resized on demand."
>    ...
>    "Separate metadata device simplifies extension, this is hidden by the
>     LVM system so sys admin unlikely to be aware of it."
>     Q: Can you elaborate on the role of the metadata?  It maps between
>        physical "area" (allocated from pool) for all writes to the
>        logical address space?

Yes.

>     Q: can thinp and snapshot metadata coexist in the same pool? -- ask
>        similar question below.

Theoretically yes if they would be able to share the same blocksize.
Practically no, because tinp (and hsm) will go by rather large block
sizes whereas snaphots go by small ones.

> 
> 2) from "Block size choice" slide:
>    The larger the block size:
>    - the less chance there is of fragmentation (describe this)
>      Q: can you please "describe this"? :)

With large blocks, the thinp user (eg. fs) is more likely to less
fragment it's metadata and data.

>    - the less frequently we need the expensive mapping operation

>      Q: "expensive" is all relative, seems you contradict the expense of
>         the mapping operation in the "Performance" slide?

No, less lookups save memory footprint to keep btree in memory and cpu
cycles in general, so larger block sizres help this.

>    - the smaller the metadata tables are, so more of them can be held in core
>      at a time. Leading to faster access to the provisioned blocks by
>      minimizing reading in mapping information
>      Q: "more of them" -- "them" being metadata tables?  So the take
>         away is more thinp devices available on the same host?

Not more thinp devices but more of their mapping tables in memory.
This is btrees with a partial working set in memory, ie. with smaller
tables there's less nodes thus more likelyhood to have them in core.

> 
> 3) from "Performance" slide:
>    "Expensive operation is mapping in a new ‘area’"
>    Q: is area the same as a block in the pool?  Why not call block size:
>    "area size"?  "Block size" is familiar to people?  Original snapshot
>    had "chunk size".

Yes, it's the allocation unit.
I think block size applies good because it names the
logical allocation entity.

> 
> 4) Q: what did you decide to run with for reads to logical address space
>       that weren't previously mapped?  Just return zeroes like was
>       discussed on lvm-team?

Well, we're returning zeroes for initial reads before writes in order to
prevent any discovery providing blocks unnecessarily. After the first
write any data in the block will be returned, which doesn't cause any
harm, because the application never wrote to the block before and thus
can never expect to retrieve senseful data from other segments of the
block it never wrote to. Discovery after such initial provisioning of a
block should be fine too, because we have to assume that the application
initialized the thinp dev properly for future discovery.

> 
> The "Metadata object" section is where you lose me:
> 
> 5) I'm not clear on the notion of "external" vs "internal" snapshots.
>    Q: can you elaborate on their characteristics?
>    Maybe the following question has some relation to external vs
>    internal?

Joe?

> 
> 6) I'm not clear on how you're going to clone the metadata tree for
>    userspace to walk (for snapshot merge, etc).  Is that "clone" really
>    a snapshot of the metadata device? -- seems unlikely as you'd need a
>    metadata device for your metadata device's snapshots?
>    - you said: "Userland will be given the location of an alternative
>      superblock for the metadata device. This is the root of a tree of
>      blocks referring to other blocks in a variety of data structures
>      (btrees, space maps etc.). Blocks will be shared with the ‘live’
>      version of the metadata, their reference counts may change as
>      sharing is broken, but we know the blocks will never be updated."
>      - Q: is this describing an "internal snapshot"?

Joe?

> 
> 7) from the "thin' target section:
> "All devices stored within a metadata object are instanced with this
> target. Be they fully mapped devices, thin provisioned devices, internal
> snapshots or external snapshots."
> Q: what is a fully mapped device?

All blocks mapped.

> 
> 8) "The target line:
> 
> thin <pool object> <internal device id>"
> Q: so by <pool object>, that is the _id_ of a pool object that was
> returned from the 'create virtual device' message?
> 
> 
> In general my understanding of all this shared store infrastructure is a
> muddled.  I need the audience to take away big concepts not get tripped
> up (or trip me up!) on the minutia.
> 
> Subtle inconsistencies and/or opaque explanation aren't helping, e.g.:
> 1) the detail of "Configuration/Use" for thinp volume
>    - "Allocate (empty) logical volume for the thin provisioning pool"
>       Q: how can it be "empty"?  Isn't it the data volume you hand to
>          the pool target?

It can start out to have zero size and uspace will grow it on thinp
targets request.

>    - "Allocate small logical volume for the thin provisioning metadata"
>       Q: before in "Solution" slide you said "Separate metadata device
>          simplifies extension", can the metadata volume be extended too?

Planned, not yet.

>    - "Set up thin provisioning mapped device on aforementioned 2 LVs"
>       Q: so there is no distinct step for creating a pool?

Not yet but we agreed to have 2 distinct steps with the multi-dev thinp
targets. Ie. one target for the pool responsible for all pool properties
like creating it in an iniitial step and another target responsible for
thin provisioned device properties.
Same with shared snapshots.

>       Q: pool is implicitly created at the time the thinp device is
>          created? (doubtful but how you enumerated the steps makes it
> 	 misleading/confusing).

This describes the single pool/device target, not the shared pool one
which is WIP to settle the interfaces as mentioned above.

>       Q: can snapshot and thinp volumes share the same pool?
>          (if possible I could see it being brittle?)
>          (but expressing such capability will help the audience "get"
> 	 the fact that the pool is nicely abstracted/sound design,
> 	 etc).

See my theoretically/practically remark above.

> 
> versus:
> 
> 2) the description of the 'pool' and 'thin' targets
>    - "This (pool) target ties together a shared metadata volume and a
>      shared data volume."
>      Q: when does the "block size" get defined if it isn't provided in
>         the target line of "pool"?

Target lines are subject to change as mentioned above. Ie. pool
properties are handled by the pool target and thin provisioned device
related ones by the device target.

>    - "Be they fully mapped devices, thin provisioned devices, internal
>      snapshots or external snapshots."
>      Q: where does the notion of a thinp-snapshot (or whatever you are
>         calling it) get expressed as a distinct target?  This is all
> 	very opaque to me...

Joe?

> 
> 
> p.s. I was going to hold off sending this and take another pass of your
> slides but decided your feedback to all my Q:s would likely be much more
> helpful than me trying to parse the slides again.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel