All of lore.kernel.org
 help / color / mirror / Atom feed
* RBD layering design draft
@ 2012-06-15 20:48 Josh Durgin
  2012-06-16  0:46 ` Sage Weil
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Josh Durgin @ 2012-06-15 20:48 UTC (permalink / raw)
  To: ceph-devel

Here's a draft of a patch to the docs outlining the rbd layering
design. Is anything unclear? Any suggestions for improvement?

Josh

============
RBD Layering
============

RBD layering refers to the creation of copy-on-write clones of block
devices. This allows for fast image creation, for example to clone a
golden master image of a virtual machine into a new instance. To
simplify the semantics, you can only create a clone of a snapshot -
snapshots are always read-only, so the rest of the image is
unaffected, and there's no possibility of writing to them
accidentally.

Note: the terms `child` and `parent` below mean an rbd image created
by cloning, and the rbd image snapshot a child was cloned from.

Command line interface
----------------------

Before cloning a snapshot, you must mark it as preserved, to prevent
it from being deleted while child images refer to it:
::

     $ rbd preserve pool/image@snap

Then you can perform the clone:
::

     $ rbd clone --parent pool/parent@snap pool2/child1

You can create a clone with different object sizes from the parent:
::

     $ rbd clone --parent pool/parent@snap --order 25 pool2/child2

To delete the parent, you must first mark it unpreserved, which checks
that there are no children left:
::

     $ rbd unpreserve pool/image@snap
     Error unpreserving: child images rely on this image
     $ rbd list_children pool/image@snap
     pool2/child1
     pool2/child2
     $ rbd copyup pool2/child1
     $ rbd rm pool2/child2
     $ rbd unpreserve pool/image@snap

Then the snapshot can be deleted like normal:
::

     $ rbd snap rm pool/image@snap

Implementation
--------------

Data Flow
^^^^^^^^^

In the initial implementation, called 'trivial layering', there will
be no tracking of which objects exist in a clone. A read that hits a
non-existent object will attempt to read from the parent object, and
this will continue recursively until an object exists or an image with
no parent is found.

Before a write is performed, the object is checked for existence. If
it doesn't exist, a copy-up operation is performed, which means
reading the relevant range of data from the parent image and writing
it (plus the original write) to the child image. To prevent races with
multiple writes trying to copy-up the same object, this copy-up
operation will include an atomic create. If the atomic create fails,
the original write is done instead. This copy-up operation is
implemented as a class method so that extra metadata can be stored by
it in the future.

A future optimization could be storing a bitmap of which objects
actually exist in a child. This would obviate the check for existence
before each write, and let reads go directly to the parent if needed.

Parent/Child relationships
^^^^^^^^^^^^^^^^^^^^^^^^^^

Children store a reference to their parent in their header, as a tuple
of (pool id, image id, snapshot id). This is enough information to
open the parent and read from it.

In addition to knowing which parent a given image has, we want to be
able to tell if a preserved image still has children. This is
accomplished with a new per-pool object, `rbd_children`, which maps
(parent pool, parent id, parent snapshot id) to a list of child
image ids. This is stored in the same pool as the child image
because the client creating a clone already has read/write access to
everything in this pool. This lets a client with read-only access to
one pool clone a snapshot from that pool into a pool they have full
access to. It increases the cost of unpreserving an image, since this
needs to check for children in every pool, but this is a rare
operation. It would likely only be done before removing old images,
which is already much more expensive because it involves deleting
every data object in the image.

Preservation
^^^^^^^^^^^^

Internally, preservation_state is a field in the header object that
can be in three states. "preserved", "unpreserved", and
"unpreserving". The first two are set as the result of "rbd
preserve/unpreserve". The "unpreserving" state is set while the "rbd
unpreserve" command checks for any child images. Only snapshots in the
"preserved" state may be cloned, so the "unpreserving" state prevents
a race like:

1. A: walk through all pools, look for clones, find none
2. B: create a clone
3. A: unpreserve parent
4. A: rbd snap rm pool/parent@snap

Resizing
^^^^^^^^

To support resizing of layered images, we need to keep track of the
minimum size the image ever was, so that if a child image is shrunk
and then expanded, the re-expanded space is treated as unused instead
of being read from the parent image. Since this can change over time,
we need to store this for each snapshot as well.

Renaming
^^^^^^^^

Currently the rbd header object (that stores all the metadata about an
image) is named after the name of the image. This makes renaming
disrupt clients who have the image open (such as children reading from
a parent image). To avoid this, we can name the header object by the
id of the image, which does not change. That is, the name of the
header object could be `rbd_header.$id`, where $id is a unique id for
the image in the pool.

When a client opens an image, all it knows is the name. There is
already a per-pool `rbd_directory` object that maps image names to
ids, but if we relied on it to get the id, we could not open any
images in that pool if that single object was unavailable. To avoid
this dependency, we can store the id of an image in an object called
`rbd_id.$image_name`, where $image_name is the name of the image. The
per-pool `rbd_directory` object is still useful for listing all images
in a pool, however.

Header changes
--------------

The header needs a few new fields:

* uint64_t parent_pool_id
* string parent_image_id
* uint64_t parent_snap_id
* uint64_t min_size (smallest size the image ever was in bytes)
* bool has_parent

Note that all the image ids are strings instead of uint64_t to let us
easily switch to uuids in the future.

cls_rbd
^^^^^^^

Some new methods are needed:
::

     /***************** methods on the rbd header *********************/
     /**
      * Sets the parent, min_size, and has_parent keys.
      * Fails if any of these keys exist, since the image already
      * had a parent.
      */
     set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)

     /**
      * Returns the parent pool id, image id, and snap id, or -ENOENT
      * if has_parent is false
      */
     get_parent(uint64_t snapid)

     /**
      * Set has_parent to false.
      */
     remove_parent() // after all parent data is copied to the child

     /*************** methods on the rbd_children object *****************/

     add_child(uint64_t parent_pool_id, string parent_image_id,
               uint64_t parent_snap_id, string image_id);
     remove_child(uint64_t parent_pool_id, string parent_image_id,
                  uint64_t parent_snap_id, string image_id);
     /**
      * List image ids of a given parent
      */
     get_children(uint64_t parent_pool_id, string parent_image_id,
                  uint64_t parent_snap_id, uint64_t max_return,
                  string start);
     /**
      * List parent images
      */
     get_parents(uint64_t max_return, uint64_t start_pool_id,
                 string start_image_id, string start_snap_id);


     /************ methods on the rbd_id.$image_name object **************/
     /**
      * Create the object and set the id. Fail and return -EEXIST if
      * the object exists.
      */
     create_id(string id)
     get_id()

     /***************** methods on the rbd_data objects ******************/
     /**
      * Create an object with parent_data as its contents,
      * then write child_data to it. If the exclusive create fails,
      * just write the child_data.
      */
      copy_up(char *parent_data, uint64_t parent_data_len,
              char *child_data, uint64_t child_data_offset,
              uint64_t child_data_length)

One existing method will change if the image supports
layering:
::

     snapshot_add - stores current min_size and has_parent with
                    other snapshot metadata (images that don't have
                    layering enabled aren't affected)

librbd
^^^^^^

Opening a child image opens its parent (and this will continue
recursively as needed). This means that an ImageCtx will contain a
pointer to the parent image context. Differing object sizes won't
matter, since reading from the parent will go through the parent
image context.

Discard will need to change for layered images so that it only
truncates objects, and does not remove them. If we removed objects, we
could not tell if we needed to read them from the parent.

A new clone method will be added, which takes the same arguments as
create except size (size of the parent image is used).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-15 20:48 RBD layering design draft Josh Durgin
@ 2012-06-16  0:46 ` Sage Weil
  2012-06-16  2:00   ` Yehuda Sadeh
  2012-06-18 16:25   ` Tommi Virtanen
  2012-06-18 17:00 ` Tommi Virtanen
  2012-06-21 21:51 ` Alex Elder
  2 siblings, 2 replies; 19+ messages in thread
From: Sage Weil @ 2012-06-16  0:46 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

Looks good!  Couple small things:

On Fri, 15 Jun 2012, Josh Durgin wrote:
> Here's a draft of a patch to the docs outlining the rbd layering
> design. Is anything unclear? Any suggestions for improvement?
> 
> Josh
> 
> ============
> RBD Layering
> ============
> 
> RBD layering refers to the creation of copy-on-write clones of block
> devices. This allows for fast image creation, for example to clone a
> golden master image of a virtual machine into a new instance. To
> simplify the semantics, you can only create a clone of a snapshot -
> snapshots are always read-only, so the rest of the image is
> unaffected, and there's no possibility of writing to them
> accidentally.
> 
> Note: the terms `child` and `parent` below mean an rbd image created
> by cloning, and the rbd image snapshot a child was cloned from.
> 
> Command line interface
> ----------------------
> 
> Before cloning a snapshot, you must mark it as preserved, to prevent
> it from being deleted while child images refer to it:
> ::
> 
>     $ rbd preserve pool/image@snap
> 
> Then you can perform the clone:
> ::
> 
>     $ rbd clone --parent pool/parent@snap pool2/child1
> 
> You can create a clone with different object sizes from the parent:
> ::
> 
>     $ rbd clone --parent pool/parent@snap --order 25 pool2/child2
> 
> To delete the parent, you must first mark it unpreserved, which checks
> that there are no children left:
> ::
> 
>     $ rbd unpreserve pool/image@snap
>     Error unpreserving: child images rely on this image
>     $ rbd list_children pool/image@snap
>     pool2/child1
>     pool2/child2
>     $ rbd copyup pool2/child1
>     $ rbd rm pool2/child2
>     $ rbd unpreserve pool/image@snap

Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure 
I have a better suggestion, but preserve is unusual.  
 
> Then the snapshot can be deleted like normal:
> ::
> 
>     $ rbd snap rm pool/image@snap
> 
> Implementation
> --------------
> 
> Data Flow
> ^^^^^^^^^
> 
> In the initial implementation, called 'trivial layering', there will
> be no tracking of which objects exist in a clone. A read that hits a
> non-existent object will attempt to read from the parent object, and
> this will continue recursively until an object exists or an image with
> no parent is found.
> 
> Before a write is performed, the object is checked for existence. If
> it doesn't exist, a copy-up operation is performed, which means
> reading the relevant range of data from the parent image and writing
> it (plus the original write) to the child image. To prevent races with
> multiple writes trying to copy-up the same object, this copy-up
> operation will include an atomic create. If the atomic create fails,
> the original write is done instead. This copy-up operation is
> implemented as a class method so that extra metadata can be stored by
> it in the future.
> 
> A future optimization could be storing a bitmap of which objects
> actually exist in a child. This would obviate the check for existence
> before each write, and let reads go directly to the parent if needed.
> 
> Parent/Child relationships
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Children store a reference to their parent in their header, as a tuple
> of (pool id, image id, snapshot id). This is enough information to
> open the parent and read from it.
> 
> In addition to knowing which parent a given image has, we want to be
> able to tell if a preserved image still has children. This is
> accomplished with a new per-pool object, `rbd_children`, which maps
> (parent pool, parent id, parent snapshot id) to a list of child
> image ids. This is stored in the same pool as the child image
> because the client creating a clone already has read/write access to
> everything in this pool. This lets a client with read-only access to
> one pool clone a snapshot from that pool into a pool they have full
> access to. It increases the cost of unpreserving an image, since this
> needs to check for children in every pool, but this is a rare
> operation. It would likely only be done before removing old images,
> which is already much more expensive because it involves deleting
> every data object in the image.
> 
> Preservation
> ^^^^^^^^^^^^
> 
> Internally, preservation_state is a field in the header object that
> can be in three states. "preserved", "unpreserved", and
> "unpreserving". The first two are set as the result of "rbd
> preserve/unpreserve". The "unpreserving" state is set while the "rbd
> unpreserve" command checks for any child images. Only snapshots in the
> "preserved" state may be cloned, so the "unpreserving" state prevents
> a race like:
> 
> 1. A: walk through all pools, look for clones, find none
> 2. B: create a clone
> 3. A: unpreserve parent
> 4. A: rbd snap rm pool/parent@snap
> 
> Resizing
> ^^^^^^^^
> 
> To support resizing of layered images, we need to keep track of the
> minimum size the image ever was, so that if a child image is shrunk
> and then expanded, the re-expanded space is treated as unused instead
> of being read from the parent image. Since this can change over time,
> we need to store this for each snapshot as well.
> 
> Renaming
> ^^^^^^^^
> 
> Currently the rbd header object (that stores all the metadata about an
> image) is named after the name of the image. This makes renaming
> disrupt clients who have the image open (such as children reading from
> a parent image). To avoid this, we can name the header object by the
> id of the image, which does not change. That is, the name of the
> header object could be `rbd_header.$id`, where $id is a unique id for
> the image in the pool.
> 
> When a client opens an image, all it knows is the name. There is
> already a per-pool `rbd_directory` object that maps image names to
> ids, but if we relied on it to get the id, we could not open any
> images in that pool if that single object was unavailable. To avoid
> this dependency, we can store the id of an image in an object called
> `rbd_id.$image_name`, where $image_name is the name of the image. The
> per-pool `rbd_directory` object is still useful for listing all images
> in a pool, however.
> 
> Header changes
> --------------
> 
> The header needs a few new fields:
> 
> * uint64_t parent_pool_id
> * string parent_image_id
> * uint64_t parent_snap_id
> * uint64_t min_size (smallest size the image ever was in bytes)
> * bool has_parent
> 
> Note that all the image ids are strings instead of uint64_t to let us
> easily switch to uuids in the future.
> 
> cls_rbd
> ^^^^^^^
> 
> Some new methods are needed:
> ::
> 
>     /***************** methods on the rbd header *********************/
>     /**
>      * Sets the parent, min_size, and has_parent keys.
>      * Fails if any of these keys exist, since the image already
>      * had a parent.
>      */
>     set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)

     set_parent(uint64_t pool_id, string image_id, uint64_t snap_id,
                uint64_t parent_size)

The actual overlap image stores will be the min of the parent_size and its 
size.

> 
>     /**
>      * Returns the parent pool id, image id, and snap id, or -ENOENT

and overlap

>      * if has_parent is false
>      */
>     get_parent(uint64_t snapid)
> 
>     /**
>      * Set has_parent to false.
>      */
>     remove_parent() // after all parent data is copied to the child
> 
>     /*************** methods on the rbd_children object *****************/
> 
>     add_child(uint64_t parent_pool_id, string parent_image_id,
>               uint64_t parent_snap_id, string image_id);
>     remove_child(uint64_t parent_pool_id, string parent_image_id,
>                  uint64_t parent_snap_id, string image_id);
>     /**
>      * List image ids of a given parent
>      */
>     get_children(uint64_t parent_pool_id, string parent_image_id,
>                  uint64_t parent_snap_id, uint64_t max_return,
>                  string start);
>     /**
>      * List parent images
>      */
>     get_parents(uint64_t max_return, uint64_t start_pool_id,
>                 string start_image_id, string start_snap_id);
> 
> 
>     /************ methods on the rbd_id.$image_name object **************/
>     /**
>      * Create the object and set the id. Fail and return -EEXIST if
>      * the object exists.
>      */
>     create_id(string id)
>     get_id()
> 
>     /***************** methods on the rbd_data objects ******************/
>     /**
>      * Create an object with parent_data as its contents,
>      * then write child_data to it. If the exclusive create fails,
>      * just write the child_data.
>      */
>      copy_up(char *parent_data, uint64_t parent_data_len,
>              char *child_data, uint64_t child_data_offset,
>              uint64_t child_data_length)
> 
> One existing method will change if the image supports
> layering:
> ::
> 
>     snapshot_add - stores current min_size and has_parent with
>                    other snapshot metadata (images that don't have
>                    layering enabled aren't affected)

Also

      set_size   - will adjust the parent overlap down as needed.

> 
> librbd
> ^^^^^^
> 
> Opening a child image opens its parent (and this will continue
> recursively as needed). This means that an ImageCtx will contain a
> pointer to the parent image context. Differing object sizes won't
> matter, since reading from the parent will go through the parent
> image context.
> 
> Discard will need to change for layered images so that it only
> truncates objects, and does not remove them. If we removed objects, we
> could not tell if we needed to read them from the parent.
> 
> A new clone method will be added, which takes the same arguments as
> create except size (size of the parent image is used).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-16  0:46 ` Sage Weil
@ 2012-06-16  2:00   ` Yehuda Sadeh
  2012-06-16 15:11     ` Sage Weil
  2012-06-18 16:25   ` Tommi Virtanen
  1 sibling, 1 reply; 19+ messages in thread
From: Yehuda Sadeh @ 2012-06-16  2:00 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josh Durgin, ceph-devel

On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil <sage@inktank.com> wrote:
> Looks good!  Couple small things:
>
>>     $ rbd unpreserve pool/image@snap
>
> Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure
> I have a better suggestion, but preserve is unusual.
>

freeze, thaw/unfreeze?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-16  2:00   ` Yehuda Sadeh
@ 2012-06-16 15:11     ` Sage Weil
  2012-06-17 13:42       ` Martin Mailand
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2012-06-16 15:11 UTC (permalink / raw)
  To: Yehuda Sadeh; +Cc: Josh Durgin, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 587 bytes --]

On Fri, 15 Jun 2012, Yehuda Sadeh wrote:
> On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil <sage@inktank.com> wrote:
> > Looks good!  Couple small things:
> >
> >>     $ rbd unpreserve pool/image@snap
> >
> > Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure
> > I have a better suggestion, but preserve is unusual.
> >
> 
> freeze, thaw/unfreeze?

Freeze/thaw usually mean something like quiesce I/O or read-only, usually 
temporarily.  What we actaully mean is "you can't delete this".  Maybe 
pin/unpin?  preserve/unpreserve may be fine, too!

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-16 15:11     ` Sage Weil
@ 2012-06-17 13:42       ` Martin Mailand
  2012-06-18 18:04         ` Gregory Farnum
  0 siblings, 1 reply; 19+ messages in thread
From: Martin Mailand @ 2012-06-17 13:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: Yehuda Sadeh, Josh Durgin, ceph-devel

Hi,
what's up locked, unlocked, unlocking?

-martin

Am 16.06.2012 17:11, schrieb Sage Weil:
> On Fri, 15 Jun 2012, Yehuda Sadeh wrote:
>> On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil<sage@inktank.com>  wrote:
>>> Looks good!  Couple small things:
>>>
>>>>      $ rbd unpreserve pool/image@snap
>>>
>>> Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure
>>> I have a better suggestion, but preserve is unusual.
>>>
>>
>> freeze, thaw/unfreeze?
>
> Freeze/thaw usually mean something like quiesce I/O or read-only, usually
> temporarily.  What we actaully mean is "you can't delete this".  Maybe
> pin/unpin?  preserve/unpreserve may be fine, too!
>
> sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-16  0:46 ` Sage Weil
  2012-06-16  2:00   ` Yehuda Sadeh
@ 2012-06-18 16:25   ` Tommi Virtanen
  2012-06-18 23:10     ` Dan Mick
  1 sibling, 1 reply; 19+ messages in thread
From: Tommi Virtanen @ 2012-06-18 16:25 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josh Durgin, ceph-devel

On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil <sage@inktank.com> wrote:
> Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure
> I have a better suggestion, but preserve is unusual.

protect/unprotect? The flag protects the image snapshot from being deleted.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-15 20:48 RBD layering design draft Josh Durgin
  2012-06-16  0:46 ` Sage Weil
@ 2012-06-18 17:00 ` Tommi Virtanen
  2012-06-18 17:14   ` Josh Durgin
  2012-06-22 14:36   ` Guido Winkelmann
  2012-06-21 21:51 ` Alex Elder
  2 siblings, 2 replies; 19+ messages in thread
From: Tommi Virtanen @ 2012-06-18 17:00 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On Fri, Jun 15, 2012 at 1:48 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
>    $ rbd unpreserve pool/image@snap
>    Error unpreserving: child images rely on this image

UX nit: this should also say what image it found.

rbd: Cannot unpreserve: Still in use by pool2/image2

>    $ rbd list_children pool/image@snap
>    pool2/child1
>    pool2/child2

How about just "rbd children"? Especially the underscore makes me unhappy.

>    $ rbd copyup pool2/child1

Does "copyup" make sense to everyone? Every time you say it, my brain
needs to flip the image inside the other way around -- I naturally
imagine a tree with the parent at the top, and children and
grandchildren down from it, but then I can't call that operation
"copyup" without wrecking my mental image.

I also can't seem to google good evidence that the term would be in
widespread use in the enterprisey block storage world, outside of the
unionfs world.. What do people call the un-dedupping, un-thinning of
copy-on-write thin provisioning?

"unshare"?

> In addition to knowing which parent a given image has, we want to be
> able to tell if a preserved image still has children. This is
> accomplished with a new per-pool object, `rbd_children`, which maps
> (parent pool, parent id, parent snapshot id) to a list of child
> image ids.

So the omap value is a list, and you need to support atomic add/remove
on the list members? Are you thinking of using an rbd class method
that does read-modify-write for that?

My instincts would have gone for (parent_pool, parent_id,
parent_snapshot_id, child_id) -> None, to get atomic operations for
free.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-18 17:00 ` Tommi Virtanen
@ 2012-06-18 17:14   ` Josh Durgin
  2012-06-18 18:01     ` Sage Weil
  2012-06-22 14:36   ` Guido Winkelmann
  1 sibling, 1 reply; 19+ messages in thread
From: Josh Durgin @ 2012-06-18 17:14 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

On 06/18/2012 10:00 AM, Tommi Virtanen wrote:
> On Fri, Jun 15, 2012 at 1:48 PM, Josh Durgin<josh.durgin@inktank.com>  wrote:
>>     $ rbd unpreserve pool/image@snap
>>     Error unpreserving: child images rely on this image
>
> UX nit: this should also say what image it found.
>
> rbd: Cannot unpreserve: Still in use by pool2/image2

Agreed.

>>     $ rbd list_children pool/image@snap
>>     pool2/child1
>>     pool2/child2
>
> How about just "rbd children"? Especially the underscore makes me unhappy.

Yeah, that sounds better.

>>     $ rbd copyup pool2/child1
>
> Does "copyup" make sense to everyone? Every time you say it, my brain
> needs to flip the image inside the other way around -- I naturally
> imagine a tree with the parent at the top, and children and
> grandchildren down from it, but then I can't call that operation
> "copyup" without wrecking my mental image.
>
> I also can't seem to google good evidence that the term would be in
> widespread use in the enterprisey block storage world, outside of the
> unionfs world.. What do people call the un-dedupping, un-thinning of
> copy-on-write thin provisioning?
>
> "unshare"?

I'm not sure what best term is, but there's probably something better 
than copyup.

>> In addition to knowing which parent a given image has, we want to be
>> able to tell if a preserved image still has children. This is
>> accomplished with a new per-pool object, `rbd_children`, which maps
>> (parent pool, parent id, parent snapshot id) to a list of child
>> image ids.
>
> So the omap value is a list, and you need to support atomic add/remove
> on the list members? Are you thinking of using an rbd class method
> that does read-modify-write for that?
>
> My instincts would have gone for (parent_pool, parent_id,
> parent_snapshot_id, child_id) ->  None, to get atomic operations for
> free.

The reason for making it a class method is more about hiding the
implementation from clients. It could be the mapping you describe in
an omap.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-18 17:14   ` Josh Durgin
@ 2012-06-18 18:01     ` Sage Weil
  2012-06-18 23:07       ` Dan Mick
  0 siblings, 1 reply; 19+ messages in thread
From: Sage Weil @ 2012-06-18 18:01 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Tommi Virtanen, ceph-devel

On Mon, 18 Jun 2012, Josh Durgin wrote:
> > >     $ rbd copyup pool2/child1
> > 
> > Does "copyup" make sense to everyone? Every time you say it, my brain
> > needs to flip the image inside the other way around -- I naturally
> > imagine a tree with the parent at the top, and children and
> > grandchildren down from it, but then I can't call that operation
> > "copyup" without wrecking my mental image.
> > 
> > I also can't seem to google good evidence that the term would be in
> > widespread use in the enterprisey block storage world, outside of the
> > unionfs world.. What do people call the un-dedupping, un-thinning of
> > copy-on-write thin provisioning?
> > 
> > "unshare"?
> 
> I'm not sure what best term is, but there's probably something better than
> copyup.

"flatten"?  My mental model is stuck on the "layering" analogy, where the 
child is a copy-on-write layer on top of a read-only parent.

Someday we may want to support the ability to add a parent to an existing 
image and do a sort of "dedup", so having an opposite for whatever term we 
pick would be a bonus.

sage

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-17 13:42       ` Martin Mailand
@ 2012-06-18 18:04         ` Gregory Farnum
  0 siblings, 0 replies; 19+ messages in thread
From: Gregory Farnum @ 2012-06-18 18:04 UTC (permalink / raw)
  To: martin; +Cc: Sage Weil, Yehuda Sadeh, Josh Durgin, ceph-devel

Locking is a separate mechanism we're already working on, which will
"lock" images so that they can't accidentally be mounted at more than
one location. :)
-Greg

On Sun, Jun 17, 2012 at 6:42 AM, Martin Mailand <martin@tuxadero.com> wrote:
> Hi,
> what's up locked, unlocked, unlocking?
>
> -martin
>
> Am 16.06.2012 17:11, schrieb Sage Weil:
>
>> On Fri, 15 Jun 2012, Yehuda Sadeh wrote:
>>>
>>> On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil<sage@inktank.com>  wrote:
>>>>
>>>> Looks good!  Couple small things:
>>>>
>>>>>     $ rbd unpreserve pool/image@snap
>>>>
>>>>
>>>> Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not
>>>> sure
>>>> I have a better suggestion, but preserve is unusual.
>>>>
>>>
>>> freeze, thaw/unfreeze?
>>
>>
>> Freeze/thaw usually mean something like quiesce I/O or read-only, usually
>> temporarily.  What we actaully mean is "you can't delete this".  Maybe
>> pin/unpin?  preserve/unpreserve may be fine, too!
>>
>> sage
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-18 18:01     ` Sage Weil
@ 2012-06-18 23:07       ` Dan Mick
  2012-06-22  2:02         ` Alex Elsayed
  0 siblings, 1 reply; 19+ messages in thread
From: Dan Mick @ 2012-06-18 23:07 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josh Durgin, Tommi Virtanen, ceph-devel

On 06/18/2012 11:01 AM, Sage Weil wrote:
> On Mon, 18 Jun 2012, Josh Durgin wrote:
>>>>      $ rbd copyup pool2/child1
>>>
>>> Does "copyup" make sense to everyone? Every time you say it, my brain
>>> needs to flip the image inside the other way around -- I naturally
>>> imagine a tree with the parent at the top, and children and
>>> grandchildren down from it, but then I can't call that operation
>>> "copyup" without wrecking my mental image.
>>>
>>> I also can't seem to google good evidence that the term would be in
>>> widespread use in the enterprisey block storage world, outside of the
>>> unionfs world.. What do people call the un-dedupping, un-thinning of
>>> copy-on-write thin provisioning?
>>>
>>> "unshare"?
>>
>> I'm not sure what best term is, but there's probably something better than
>> copyup.
>
> "flatten"?  My mental model is stuck on the "layering" analogy, where the
> child is a copy-on-write layer on top of a read-only parent.
>
> Someday we may want to support the ability to add a parent to an existing
> image and do a sort of "dedup", so having an opposite for whatever term we
> pick would be a bonus.

"disown" and "adopt"?  :)  (actually I started as a joke, but really I 
kinda like that; fits with the parent-child name)



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-18 16:25   ` Tommi Virtanen
@ 2012-06-18 23:10     ` Dan Mick
  0 siblings, 0 replies; 19+ messages in thread
From: Dan Mick @ 2012-06-18 23:10 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: Sage Weil, Josh Durgin, ceph-devel



On 06/18/2012 09:25 AM, Tommi Virtanen wrote:
> On Fri, Jun 15, 2012 at 5:46 PM, Sage Weil<sage@inktank.com>  wrote:
>> Is 'preserve' and 'unpreserve' the verbiage we want to use here?  Not sure
>> I have a better suggestion, but preserve is unusual.
>
> protect/unprotect? The flag protects the image snapshot from being deleted.

unremovable/removable?

undeletable/deletable?


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-15 20:48 RBD layering design draft Josh Durgin
  2012-06-16  0:46 ` Sage Weil
  2012-06-18 17:00 ` Tommi Virtanen
@ 2012-06-21 21:51 ` Alex Elder
  2012-06-22 14:37   ` Guido Winkelmann
  2012-06-22 16:27   ` Tommi Virtanen
  2 siblings, 2 replies; 19+ messages in thread
From: Alex Elder @ 2012-06-21 21:51 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

On 06/15/2012 03:48 PM, Josh Durgin wrote:
> Here's a draft of a patch to the docs outlining the rbd layering
> design. Is anything unclear? Any suggestions for improvement?
> 
> Josh

I'm going to try to take into account the comments others have made
but I may end up duplicating--and if so, I apologize in advance.  I
also have a lot of questions and suggestions.  They may just show
my ignorance more than anything, so I just may need to get better
educated about this...

> ============
> RBD Layering
> ============
> 
> RBD layering refers to the creation of copy-on-write clones of block
> devices. This allows for fast image creation, for example to clone a
> golden master image of a virtual machine into a new instance. To
> simplify the semantics, you can only create a clone of a snapshot -
> snapshots are always read-only, so the rest of the image is
> unaffected, and there's no possibility of writing to them
> accidentally.

I think this is a good restriction.  However the rest of your
description doesn't seem to be very clear about that.  In
particular, if there can be a chain of "parents" that suggests
that maybe a parent could be something other than a (read-only)
snapshot of a top-level RBD image.

I just learned though that a clone can itself be treated as if
it were a top-level RBD image.  (So some of my comments may
show that I didn't get that before.)

> Note: the terms `child` and `parent` below mean an rbd image created
> by cloning, and the rbd image snapshot a child was cloned from.

I went through this with the following understanding, and I'll
lay it out here because it may inform some of the comments that
follow.

RBD Image
    Top-level RBD image.  Uniquely defined by (pool id, rbd id).
    All storage for an RBD image comes from a single pool.  An
    RBD image has a fixed order, which defines the power-of-two
    size of the segments the RBD's storage is broken into.

RBD Snapshot
    Read-only snapshot of the state/content of a (parent) RBD image
    at a particular instant in time.  Uniquely defined by either
    (pool id, rbd id, snapshot id) or, because each snapshot also
    optionally has a unique user-provided name, (pool id, rbd id, name).
    Storage for a snapshot always comes from the same pool as its
    associated RBD image, and it segment size (object order) also
    matches that of its image.

RBD Clone Image
    Read/write, copy-on-write version of a particular RBD snapshot.
    Uniquely defined by (pool id, image id); its content is also
    permanently tied to the RBD snapshot on which it's based.
    Initial contents are identical to its snapshot, but any write
    to the content will result in a making a copy of an affected
    range from the snapshot's content, updating it based on the
    write operation, saving the new copy and associating the updated
    portion with the clone.  A clone must have read access to the
    snapshot it is based on, but can itself use a different pool to
    which it has read/write access to store its updated data.  A
    clone can have a different object order from the snapshot it's
    based on.

    Note that a clone can itself be snapshotted, and those snapshots
    can then have their own clones.  This leads to the possibility of
    chains of parents, mentioned elsewhere.

OK, based on that understanding, I'd recommend using terminology
more like what I use above rather than "parent" and "child."  That
is, an image, a snapshot, and a clone all play different roles and
have different semantics.

(Even though a clone can be treated as if it were a top-level RBD
image I think it's useful to have a term that distinguishes it as
dependent on another image for its data.)

> Command line interface
> ----------------------
> 
> Before cloning a snapshot, you must mark it as preserved, to prevent
> it from being deleted while child images refer to it:
> ::
> 
>     $ rbd preserve pool/image@snap

Why is it necessary to do this?  I think it may be desirable to
(i.e., to mark a particular snapshot as having some significance).
But I think this ought to be an optional feature, and one in
which you might even give it name, rather than something that's
required.  The name would be distinct from the snapshot name, to
allow snapshot "Tuesday_4pm" be preserved as "Ubuntu_12.04-image".

> Then you can perform the clone:
> ::
> 
>     $ rbd clone --parent pool/parent@snap pool2/child1

Based on my comments above, if the parent had not been "preserved"
it would automatically be at this point, by virtue of the fact it
has a clone associated with it.

Since there is always exactly one parent and one child, I'd say
drop the "--parent" and just have the parent and child be
defined by their position.  If the parent could be optionally
skipped for some reason, then make *it* be the second one.

> You can create a clone with different object sizes from the parent:
> ::
> 
>     $ rbd clone --parent pool/parent@snap --order 25 pool2/child2

Are there any restrictions on the relationship between the orders
of the parent and child?  (I don't think there has to be, and this
is actually a very interesting feature.)

> To delete the parent, you must first mark it unpreserved, which checks
> that there are no children left:
> ::
> 

Please show what happens here if this is done at this point:

      $ rbd snap rm pool/image@snap

>     $ rbd unpreserve pool/image@snap
>     Error unpreserving: child images rely on this image
>     $ rbd list_children pool/image@snap
>     pool2/child1
>     pool2/child2
>     $ rbd copyup pool2/child1

The term "copyup" does not resonate with me at all--I find it
offers no clues about what it does (and I can think of a few
contradictory interpretations).

My best guess is that you mean to be promoting a clone to be
a free-standing RBD image, re-writing the entire content of
the parent snapshot (recursively) into the clone.  And in
doing so it disassociates itself from the original.  So I
assume that from here forward.

What happens to snapshots of clones that have been the
subject of this operation?  Do they all need to be rewritten
to reflect the new objects backing the top-level image?  Do
they remain dependent on the previous parent snapshot?

>     $ rbd rm pool2/child2
>     $ rbd unpreserve pool/image@snap
> 
> Then the snapshot can be deleted like normal:
> ::
> 
>     $ rbd snap rm pool/image@snap

Note that the "preserve" and "unpreserve" operations are
valid on snapshots, not RBD images or clones.

> Implementation
> --------------
> 
> Data Flow
> ^^^^^^^^^
> 
> In the initial implementation, called 'trivial layering', there will
> be no tracking of which objects exist in a clone. A read that hits a
> non-existent object will attempt to read from the parent object, and
> this will continue recursively until an object exists or an image with
> no parent is found.

So a non-existent object in a clone is a bit like a hole in a file, but
instead of implicitly backing it with zeroes it backs it with the data
found at the same range as the snapshot the clone was based on?

If a clone had snapshots, does this mean a snapshot can include
non-existent objects in it?

Does this mean that an attempt to read beyond the end of an RBD snapshot
is not an error if the read is being done for a clone whose size has
been increased from what it was originally?  (In that case, the correct
action would be to read the range as zeroes.)

> Before a write is performed, the object is checked for existence. If
> it doesn't exist, a copy-up operation is performed, which means
> reading the relevant range of data from the parent image and writing
> it (plus the original write) to the child image. To prevent races with
> multiple writes trying to copy-up the same object, this copy-up
> operation will include an atomic create. If the atomic create fails,
> the original write is done instead. This copy-up operation is
> implemented as a class method so that extra metadata can be stored by
> it in the future.

I think we need to expand on this existence check/atomic create/copy
up business.  I'm not sure I know what "the original write is done"
means in this context.

> A future optimization could be storing a bitmap of which objects
> actually exist in a child. This would obviate the check for existence
> before each write, and let reads go directly to the parent if needed.

This may not be very difficult to do.

> Parent/Child relationships
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> Children store a reference to their parent in their header, as a tuple
> of (pool id, image id, snapshot id). This is enough information to
> open the parent and read from it.

Do we have an abstract entity that uniquely defines a snapshot?  I mean,
can we define a "snapshot name" which basically encapsulates the
(pool id, image id, snapshot id) tuple?  Maybe that doesn't matter, but
I think the abstraction might help clarify the interface a bit.  I.e.,
you can't just pass arbitrary combinations of the three components,
a snapshot name is very well defined as a unit.

> In addition to knowing which parent a given image has, we want to be
> able to tell if a preserved image still has children. This is
> accomplished with a new per-pool object, `rbd_children`, which maps
> (parent pool, parent id, parent snapshot id) to a list of child

My first thought was, why does the parent snapshot need to know the
*identity* of its descendant clones?  The main thing it seems to need
is a count of the number of clones it has.

The other thing though is that you shouldn't store the mapping
in the "rbd_children" object.  Instead, you should only store
the child object ids there, and consult those objects to identify
their parents.  Otherwise you end up with problems related to
possible discrepancy between what a child points to and what the
"rbd_children" mapping says.

> image ids. This is stored in the same pool as the child image
> because the client creating a clone already has read/write access to
> everything in this pool. This lets a client with read-only access to
> one pool clone a snapshot from that pool into a pool they have full
> access to. It increases the cost of unpreserving an image, since this

This is really a bad feature of this design because it doesn't scale.
So we ought to be thinking about a better way to do it if possible.

> needs to check for children in every pool, but this is a rare
> operation. It would likely only be done before removing old images,
> which is already much more expensive because it involves deleting
> every data object in the image.
> 
> Preservation
> ^^^^^^^^^^^^
> 
> Internally, preservation_state is a field in the header object that
> can be in three states. "preserved", "unpreserved", and
> "unpreserving". The first two are set as the result of "rbd
> preserve/unpreserve". The "unpreserving" state is set while the "rbd

The "unpreserved" state is the initial state of any snapshot.  The
"preserved" state is set immediately as a result of "rbd preserve".
The "unpreserving" state is set immediately to avoid a race; after
it is verified there are no child images, an image in "unpreserving"
state is converted to "unpreserved".

> unpreserve" command checks for any child images. Only snapshots in the
> "preserved" state may be cloned, so the "unpreserving" state prevents
> a race like:
> 
> 1. A: walk through all pools, look for clones, find none
> 2. B: create a clone
> 3. A: unpreserve parent
> 4. A: rbd snap rm pool/parent@snap
> 
> Resizing
> ^^^^^^^^
> 
> To support resizing of layered images, we need to keep track of the
> minimum size the image ever was, so that if a child image is shrunk

We don't want the minimum size.  We want to know the highest valid
offset in the image:
- Upon cloning, the last valid offset of the clone is set to the last
  valid offset of the snapshot.
- If an image is resized larger, the last valid offset remains the same.
- If an image is resized smaller, the last valid offset is reduced to
  the new, smaller size.
- If data is written to an image at an offset between the last valid
  offset and the image size, the last valid offset is updated to the
  reflect the newly-written data.

> and then expanded, the re-expanded space is treated as unused instead
> of being read from the parent image. Since this can change over time,
> we need to store this for each snapshot as well.
> 
> Renaming
> ^^^^^^^^
> 
> Currently the rbd header object (that stores all the metadata about an
> image) is named after the name of the image. This makes renaming
> disrupt clients who have the image open (such as children reading from
> a parent image). To avoid this, we can name the header object by the
> id of the image, which does not change. That is, the name of the
> header object could be `rbd_header.$id`, where $id is a unique id for
> the image in the pool.

This is very good.

> When a client opens an image, all it knows is the name. There is
> already a per-pool `rbd_directory` object that maps image names to
> ids, but if we relied on it to get the id, we could not open any
> images in that pool if that single object was unavailable. To avoid
> this dependency, we can store the id of an image in an object called
> `rbd_id.$image_name`, where $image_name is the name of the image. The
> per-pool `rbd_directory` object is still useful for listing all images
> in a pool, however.
> 
> Header changes
> --------------
> 
> The header needs a few new fields:
> 
> * uint64_t parent_pool_id
> * string parent_image_id
> * uint64_t parent_snap_id
> * uint64_t min_size (smallest size the image ever was in bytes)
> * bool has_parent

Can't we avoid the Boolean here and just designate some sort of
well-known parent image id to be used to indicate "no parent"?

> Note that all the image ids are strings instead of uint64_t to let us
> easily switch to uuids in the future.

Are we planning to begin this sort of conversion any time soon?

> cls_rbd
> ^^^^^^^
> 
> Some new methods are needed:
> ::
> 
>     /***************** methods on the rbd header *********************/
>     /**
>      * Sets the parent, min_size, and has_parent keys.
>      * Fails if any of these keys exist, since the image already
>      * had a parent.
>      */
>     set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)

    set_parent(string snap_name)  (if we had a snap_name abstraction)

>     /**
>      * Returns the parent pool id, image id, and snap id, or -ENOENT
>      * if has_parent is false
>      */
>     get_parent(uint64_t snapid)
> 
>     /**
>      * Set has_parent to false.
>      */
>     remove_parent() // after all parent data is copied to the child

Is this saying that the image has no parent once the asynchronous
copying of parent data to child has completed?  (Or would it be
synchronous?)  Or is this saying that the caller has to be sure
the data is copied before calling remove_parent()?

>     /*************** methods on the rbd_children object *****************/
> 
>     add_child(uint64_t parent_pool_id, string parent_image_id,
>               uint64_t parent_snap_id, string image_id);

    add_child(string snap_name, string image_id) (?)

>     remove_child(uint64_t parent_pool_id, string parent_image_id,
>                  uint64_t parent_snap_id, string image_id);
>     /**
>      * List image ids of a given parent
>      */
>     get_children(uint64_t parent_pool_id, string parent_image_id,
>                  uint64_t parent_snap_id, uint64_t max_return,
>                  string start);

This is the one that requires an exhaustive query across all pools.

This kind of interface implies there is a well-defined ordering of
image ids.  What does "start" look like?  I guess this generally
raises a lot of questions about whether this (or perhaps some other)
interface can reliably produce an accurate list.  (It's like the
directory interfaces--opendir(), readdir(), seekdir(), etc.)

>     /**
>      * List parent images
>      */
>     get_parents(uint64_t max_return, uint64_t start_pool_id,
>                 string start_image_id, string start_snap_id);

This interface implies an ordering across pool ids, image ids,
and snapshot ids that I'm not sure we want to rely on.


>     /************ methods on the rbd_id.$image_name object **************/
>     /**
>      * Create the object and set the id. Fail and return -EEXIST if
>      * the object exists.
>      */
>     create_id(string id)
>     get_id()
> 
>     /***************** methods on the rbd_data objects ******************/
>     /**
>      * Create an object with parent_data as its contents,
>      * then write child_data to it. If the exclusive create fails,
>      * just write the child_data.
>      */
>      copy_up(char *parent_data, uint64_t parent_data_len,
>              char *child_data, uint64_t child_data_offset,
>              uint64_t child_data_length)
> 
> One existing method will change if the image supports
> layering:
> ::
> 
>     snapshot_add - stores current min_size and has_parent with
>                    other snapshot metadata (images that don't have
>                    layering enabled aren't affected)

OK, that's all I've got.

					-Alex
> 
> librbd
> ^^^^^^
> 
> Opening a child image opens its parent (and this will continue
> recursively as needed). This means that an ImageCtx will contain a
> pointer to the parent image context. Differing object sizes won't
> matter, since reading from the parent will go through the parent
> image context.
> 
> Discard will need to change for layered images so that it only
> truncates objects, and does not remove them. If we removed objects, we
> could not tell if we needed to read them from the parent.
> 
> A new clone method will be added, which takes the same arguments as
> create except size (size of the parent image is used).
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-18 23:07       ` Dan Mick
@ 2012-06-22  2:02         ` Alex Elsayed
  2012-06-22 14:41           ` Guido Winkelmann
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Elsayed @ 2012-06-22  2:02 UTC (permalink / raw)
  To: ceph-devel

Dan Mick <dan.mick <at> inktank.com> writes:

> 
> On 06/18/2012 11:01 AM, Sage Weil wrote:
> > On Mon, 18 Jun 2012, Josh Durgin wrote:
> >>>>      $ rbd copyup pool2/child1

> "disown" and "adopt"?  :)  (actually I started as a joke, but really I 
> kinda like that; fits with the parent-child name)

The issue I see with that is that the argument refers to the child rather than
the parent, so it doesn't match. I personally like 'unshare' since it'll also
work in the dedup case, but if we stick with the parent/child terminology
'emancipate' might work (although it lacks a good reverse).



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-18 17:00 ` Tommi Virtanen
  2012-06-18 17:14   ` Josh Durgin
@ 2012-06-22 14:36   ` Guido Winkelmann
  2012-06-22 16:00     ` Tommi Virtanen
  1 sibling, 1 reply; 19+ messages in thread
From: Guido Winkelmann @ 2012-06-22 14:36 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

Am Montag, 18. Juni 2012, 10:00:32 schrieben Sie:
> On Fri, Jun 15, 2012 at 1:48 PM, Josh Durgin <josh.durgin@inktank.com> 
wrote:
> >    $ rbd unpreserve pool/image@snap
> >    Error unpreserving: child images rely on this image
> 
> UX nit: this should also say what image it found.
> 
> rbd: Cannot unpreserve: Still in use by pool2/image2

What if it's in use by a lot of images? Should it print them all, or should it 
print something like "Still in use by pool2/image2 and 50 others, use 
list_children to see them all"?

	Guido

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-21 21:51 ` Alex Elder
@ 2012-06-22 14:37   ` Guido Winkelmann
  2012-06-22 16:27   ` Tommi Virtanen
  1 sibling, 0 replies; 19+ messages in thread
From: Guido Winkelmann @ 2012-06-22 14:37 UTC (permalink / raw)
  To: elder, ceph-devel

> On 06/15/2012 03:48 PM, Josh Durgin wrote:

> > Then you can perform the clone:
> >     $ rbd clone --parent pool/parent@snap pool2/child1
> 
> Based on my comments above, if the parent had not been "preserved"
> it would automatically be at this point, by virtue of the fact it
> has a clone associated with it.
> 
> Since there is always exactly one parent and one child, I'd say
> drop the "--parent" and just have the parent and child be
> defined by their position.  If the parent could be optionally
> skipped for some reason, then make it be the second one.

I think that would be a very bad idea. clone <source> <target> would be a good 
idea; nearly all similar commandline utilities (cp, mv, ln) work like that. 
clone <target> <source> would be counterintuitive and probably lead to 
otherwise avoidable mistakes.

        Guido

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-22  2:02         ` Alex Elsayed
@ 2012-06-22 14:41           ` Guido Winkelmann
  0 siblings, 0 replies; 19+ messages in thread
From: Guido Winkelmann @ 2012-06-22 14:41 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

Am Freitag, 22. Juni 2012, 02:02:38 schrieb Alex Elsayed:
> Dan Mick <dan.mick <at> inktank.com> writes:
> > On 06/18/2012 11:01 AM, Sage Weil wrote:
> > > On Mon, 18 Jun 2012, Josh Durgin wrote:
> > >>>>      $ rbd copyup pool2/child1
> > 
> > "disown" and "adopt"?  :)  (actually I started as a joke, but really I
> > kinda like that; fits with the parent-child name)
> 
> The issue I see with that is that the argument refers to the child rather
> than the parent, so it doesn't match. I personally like 'unshare' since
> it'll also work in the dedup case, but if we stick with the parent/child
> terminology 'emancipate' might work (although it lacks a good reverse).

AFAIK the word started in ancient Rome as meaning to release slaves into 
freedom, so I suppose the opposite would be enslave?

	Guido

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-22 14:36   ` Guido Winkelmann
@ 2012-06-22 16:00     ` Tommi Virtanen
  0 siblings, 0 replies; 19+ messages in thread
From: Tommi Virtanen @ 2012-06-22 16:00 UTC (permalink / raw)
  To: Guido Winkelmann; +Cc: ceph-devel

On Fri, Jun 22, 2012 at 7:36 AM, Guido Winkelmann
<guido-ceph@thisisnotatest.de> wrote:
>> rbd: Cannot unpreserve: Still in use by pool2/image2
>
> What if it's in use by a lot of images? Should it print them all, or should it
> print something like "Still in use by pool2/image2 and 50 others, use
> list_children to see them all"?

As walking through all the (potential) clones is an expensive
operation, this should abort as soon as possible, and just complain
about the one encountered so far. That could easily be a difference of
a few seconds vs tens of seconds. We don't even know the count,
without paying that cost, so that can't be printed either.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RBD layering design draft
  2012-06-21 21:51 ` Alex Elder
  2012-06-22 14:37   ` Guido Winkelmann
@ 2012-06-22 16:27   ` Tommi Virtanen
  1 sibling, 0 replies; 19+ messages in thread
From: Tommi Virtanen @ 2012-06-22 16:27 UTC (permalink / raw)
  To: elder; +Cc: Josh Durgin, ceph-devel

On Thu, Jun 21, 2012 at 2:51 PM, Alex Elder <elder@dreamhost.com> wrote:
>> Before cloning a snapshot, you must mark it as preserved, to prevent
>> it from being deleted while child images refer to it:
>> ::
>>
>>     $ rbd preserve pool/image@snap
>
> Why is it necessary to do this?  I think it may be desirable to

So the snapshot will not be removed.

See this: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6595/focus=6675

>>     $ rbd clone --parent pool/parent@snap pool2/child1
>
> Based on my comments above, if the parent had not been "preserved"
> it would automatically be at this point, by virtue of the fact it
> has a clone associated with it.

The client creating the child typically has no write access to the
parent, and cannot do anything to it.

>> To delete the parent, you must first mark it unpreserved, which checks
>> that there are no children left:
>> ::
>
> Please show what happens here if this is done at this point:
>
>      $ rbd snap rm pool/image@snap

rbd: Cannot remove a preserved snapshot: pool/image@snap

or something like that.

> Note that the "preserve" and "unpreserve" operations are
> valid on snapshots, not RBD images or clones.

That's a very good point. Perhaps the command should be "rbd snap
preserve" and "rbd snap unpreserve".

>> In the initial implementation, called 'trivial layering', there will
>> be no tracking of which objects exist in a clone. A read that hits a
>> non-existent object will attempt to read from the parent object, and
>> this will continue recursively until an object exists or an image with
>> no parent is found.
>
> So a non-existent object in a clone is a bit like a hole in a file, but
> instead of implicitly backing it with zeroes it backs it with the data
> found at the same range as the snapshot the clone was based on?

Yes.

Continuation of that: will the clone store sparse objects, or always
copy all the data for that object from the parent? That is, what
happens if I write 1 byte to a fresh clone? (And remember that block
sizes can differ.)

> If a clone had snapshots, does this mean a snapshot can include
> non-existent objects in it?

I don't like the phrase "include non-existent objects", and find that
an overambitious topological exercise, but yes, a snapshot may be
sparse.

Reads fall through toward parents until they find something -- or run
out of parents, in which case they read zeros.

> Does this mean that an attempt to read beyond the end of an RBD snapshot
> is not an error if the read is being done for a clone whose size has
> been increased from what it was originally?  (In that case, the correct
> action would be to read the range as zeroes.)

This was discussed later in the email, and I see you responded to that part.

>> In addition to knowing which parent a given image has, we want to be
>> able to tell if a preserved image still has children. This is
>> accomplished with a new per-pool object, `rbd_children`, which maps
>> (parent pool, parent id, parent snapshot id) to a list of child
>
> My first thought was, why does the parent snapshot need to know the
> *identity* of its descendant clones?  The main thing it seems to need
> is a count of the number of clones it has.

Maintaining that count in a distributed system, without listing the
things that are in it, gets challenging. Idempotent counters are
challenging. Maintaining it as a set is easier, significantly more
debuggable, and unlikely to be too costly. Plus it lets us serve "rbd
children" faster.

> The other thing though is that you shouldn't store the mapping
> in the "rbd_children" object.  Instead, you should only store
> the child object ids there, and consult those objects to identify
> their parents.  Otherwise you end up with problems related to
> possible discrepancy between what a child points to and what the
> "rbd_children" mapping says.

The question we need to ask is "who here is a child of $FOO". Needing
an indirection for every member makes that cost a lot more.

>> image ids. This is stored in the same pool as the child image
>> because the client creating a clone already has read/write access to
>> everything in this pool. This lets a client with read-only access to
>> one pool clone a snapshot from that pool into a pool they have full
>> access to. It increases the cost of unpreserving an image, since this
>
> This is really a bad feature of this design because it doesn't scale.
> So we ought to be thinking about a better way to do it if possible.

That would be nice. Good luck! We await your email, though not holding
our breath ;)

>> To support resizing of layered images, we need to keep track of the
>> minimum size the image ever was, so that if a child image is shrunk
>
> We don't want the minimum size.  We want to know the highest valid
> offset in the image:
> - Upon cloning, the last valid offset of the clone is set to the last
>  valid offset of the snapshot.
> - If an image is resized larger, the last valid offset remains the same.
> - If an image is resized smaller, the last valid offset is reduced to
>  the new, smaller size.
> - If data is written to an image at an offset between the last valid
>  offset and the image size, the last valid offset is updated to the
>  reflect the newly-written data.

If I resize the child down, then resize it up again, and write in the
middle of the resized range, will the non-written parts above your
valid_offset be zero? That sounds like a difference in your & Josh's
designs, and something you two need to sort out.

>>     get_children(uint64_t parent_pool_id, string parent_image_id,
>>                  uint64_t parent_snap_id, uint64_t max_return,
>>                  string start);
>
> This is the one that requires an exhaustive query across all pools.
>
> This kind of interface implies there is a well-defined ordering of
> image ids.  What does "start" look like?  I guess this generally
> raises a lot of questions about whether this (or perhaps some other)
> interface can reliably produce an accurate list.  (It's like the
> directory interfaces--opendir(), readdir(), seekdir(), etc.)

It's racy against concurrent changes, sure. But we only care about the
races when the parent is preserved, and that guarantees there won't be
new (relevant) children created.

>>     get_parents(uint64_t max_return, uint64_t start_pool_id,
>>                 string start_image_id, string start_snap_id);
>
> This interface implies an ordering across pool ids, image ids,
> and snapshot ids that I'm not sure we want to rely on.

They're all either numbers or a strings, and have a clear correct
hierarchical order of (pool, image, snap).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-06-22 16:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-15 20:48 RBD layering design draft Josh Durgin
2012-06-16  0:46 ` Sage Weil
2012-06-16  2:00   ` Yehuda Sadeh
2012-06-16 15:11     ` Sage Weil
2012-06-17 13:42       ` Martin Mailand
2012-06-18 18:04         ` Gregory Farnum
2012-06-18 16:25   ` Tommi Virtanen
2012-06-18 23:10     ` Dan Mick
2012-06-18 17:00 ` Tommi Virtanen
2012-06-18 17:14   ` Josh Durgin
2012-06-18 18:01     ` Sage Weil
2012-06-18 23:07       ` Dan Mick
2012-06-22  2:02         ` Alex Elsayed
2012-06-22 14:41           ` Guido Winkelmann
2012-06-22 14:36   ` Guido Winkelmann
2012-06-22 16:00     ` Tommi Virtanen
2012-06-21 21:51 ` Alex Elder
2012-06-22 14:37   ` Guido Winkelmann
2012-06-22 16:27   ` Tommi Virtanen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.