All of lore.kernel.org
 help / color / mirror / Atom feed
* rbd layering
@ 2011-02-01  6:08 Sage Weil
  2011-02-01 17:43 ` Tommi Virtanen
  2011-02-02  7:13 ` Colin McCabe
  0 siblings, 2 replies; 11+ messages in thread
From: Sage Weil @ 2011-02-01  6:08 UTC (permalink / raw)
  To: ceph-devel

One idea we've talked a fair bit about is layering RBD images.  The idea 
would be to create a new image in O(1) time that mirrors on old image and 
get copy-on-write type semantics, like a writeable snapshot.

We've come up with a few different approaches for doing this, each with 
somewhat different performance characteristics.  The main consideration is 
that RBD images do not (currently) have an "allocation table."  Image data 
is simply striped over objects (that may or may not exist).  You read the 
object for a given block to see if it exists; if it doesn't (a "hole"), 
the content is defined to be zero-filled.

(I'll use the term "block" and "object" interchangeably to mean the object 
that stores each RBD block.  They're 4MB by default, but can be set to any 
size you want at image creation time.)

1- copy-up on first write
  - reads
    - read child image object.  if it doesn't exist, read parent block.
    -> reads to unchanged data are slower
  - writes
    - write to child image block.  if it doesn't exist, OSD will return 
      ENOENT.  the client would do a copy up (copy parent block to child 
      block), and then redo the write.
    -> first writes are slow, especially if the block existed in the parent.
  - trim/discard
    - truncate the child object to zero, but do not delete it.

2- sparse objects
  - make the OSDs maintain allocation metadata for each objects so that we 
    know which parts of the object are defined and which are holes (a 
    relatively easy thing to do).
  - writes
    - write to modified region of child object.
  - reads
    - read child image object AND allocation map.  read parent object for 
      any holes (or when child object doesn't exist)
    -> more efficient data transfer when objects are sparse.
    -> reads to unchanged data are slower (as above)
  - trim/discard
    - need to somehow distinguish between a hole that falls-thru to parent 
      and a hole that is defined to be zero by the child image.

In both cases, we could add a(n optional) allocation bitmap to the parent 
image to avoid the fall-thru for parts of the images that aren't defined 
by the child image.  That could be an explicit step taken by an 
adminstrator (e.g. after marking the parent read-only) to improve 
performance for overlayed images.  (Maintaining a consistent bitmap for 
all images is non-trivial, and would slow things down considerably.)

A few use cases for all of this:
 - "golden" VM images
 - writeable snapshots
 - image migration between pools
   - pause io
   - mark parent read-only
   - create "child" image
   - unpause io, redirect to the new child
   (these steps are all fast and O(1)!)
   - asynchronously copy-up parent blocks to the child (this is O(n))
   - once this is done, remove the child's parent reference and discard 
     the parent

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-01  6:08 rbd layering Sage Weil
@ 2011-02-01 17:43 ` Tommi Virtanen
  2011-02-02  7:13 ` Colin McCabe
  1 sibling, 0 replies; 11+ messages in thread
From: Tommi Virtanen @ 2011-02-01 17:43 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Mon, Jan 31, 2011 at 10:08:31PM -0800, Sage Weil wrote:
> A few use cases for all of this:
>  - "golden" VM images
>  - writeable snapshots
>  - image migration between pools
>    - pause io
>    - mark parent read-only
>    - create "child" image
>    - unpause io, redirect to the new child
>    (these steps are all fast and O(1)!)
>    - asynchronously copy-up parent blocks to the child (this is O(n))
>    - once this is done, remove the child's parent reference and discard 
>      the parent

Two things to add:

- creating something that can be later used to do lazy deduplication
  in the background would be good; you don't always "start from the
  golden master", but you still have images that are 99% identical.

- the "child has zero extents that should not be read from master"
  case would happen mostly by TRIM operations, these days.

-- 
:(){ :|:&};:

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-01  6:08 rbd layering Sage Weil
  2011-02-01 17:43 ` Tommi Virtanen
@ 2011-02-02  7:13 ` Colin McCabe
  2011-02-02  7:24   ` Gregory Farnum
  2011-02-02  7:34   ` Yehuda Sadeh Weinraub
  1 sibling, 2 replies; 11+ messages in thread
From: Colin McCabe @ 2011-02-02  7:13 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
> One idea we've talked a fair bit about is layering RBD images.  The idea
> would be to create a new image in O(1) time that mirrors on old image and
> get copy-on-write type semantics, like a writeable snapshot.
>
> We've come up with a few different approaches for doing this, each with
> somewhat different performance characteristics.  The main consideration is
> that RBD images do not (currently) have an "allocation table."  Image data
> is simply striped over objects (that may or may not exist).  You read the
> object for a given block to see if it exists; if it doesn't (a "hole"),
> the content is defined to be zero-filled.

Have we thought about the hash table based approach yet? Where every
block gets hashed and we only store one copy for each? I guess this is
basically how git works, except instead of fixed-size blocks, it
tracks variable-sized blobs. This is also how ZFS dedupe works.

The nice thing about the hash table based approach is that you don't
have to track parent-child relationships explicitly. If two users
happen to both install Centos 5.5 with the same settings on the same
sized-image, they'll both be deduped automatically.

The disadvantage, of course, is that you need to hash the blocks.
Also, there's some tiny probability that there will be a hash
collision. You could use a long hash key or do hash chaining to
mitigate this, of course.

The big disadvantage of the allocation table-based approaches, at
least in my mind, is that they don't feel very block device-y.
Allocation maps are things that normally go in a file system rather
than in a block device.

If we do go with an allocation-table based approach, what would the
API look like from the administrator's point of view? I guess I
imagine some kind of API where I create a child RBD block device from
a parent RBD device. Then whenever I wrote to the child image, it
would "re-dupe" the two block devices. (It seems like the amount of
sharing would start at 100% and just go down from there... unless my
analysis is missing something?)

Another possibility is that we could simply run qcow2 over rbd. qcow2
already implements copy-on-write at a higher level of the stack.

I took a quick look at the qcow2 image format at:
http://people.gnome.org/~markmc/qcow-image-format.html

It looks suspiciously like something I've seen before :)
http://en.wikipedia.org/wiki/Inode_pointer_structure

sincerely,
Colin


>
> (I'll use the term "block" and "object" interchangeably to mean the object
> that stores each RBD block.  They're 4MB by default, but can be set to any
> size you want at image creation time.)
>
> 1- copy-up on first write
>  - reads
>    - read child image object.  if it doesn't exist, read parent block.
>    -> reads to unchanged data are slower
>  - writes
>    - write to child image block.  if it doesn't exist, OSD will return
>      ENOENT.  the client would do a copy up (copy parent block to child
>      block), and then redo the write.
>    -> first writes are slow, especially if the block existed in the parent.
>  - trim/discard
>    - truncate the child object to zero, but do not delete it.
>
> 2- sparse objects
>  - make the OSDs maintain allocation metadata for each objects so that we
>    know which parts of the object are defined and which are holes (a
>    relatively easy thing to do).
>  - writes
>    - write to modified region of child object.
>  - reads
>    - read child image object AND allocation map.  read parent object for
>      any holes (or when child object doesn't exist)
>    -> more efficient data transfer when objects are sparse.
>    -> reads to unchanged data are slower (as above)
>  - trim/discard
>    - need to somehow distinguish between a hole that falls-thru to parent
>      and a hole that is defined to be zero by the child image.
>
> In both cases, we could add a(n optional) allocation bitmap to the parent
> image to avoid the fall-thru for parts of the images that aren't defined
> by the child image.  That could be an explicit step taken by an
> adminstrator (e.g. after marking the parent read-only) to improve
> performance for overlayed images.  (Maintaining a consistent bitmap for
> all images is non-trivial, and would slow things down considerably.)
>
> A few use cases for all of this:
>  - "golden" VM images
>  - writeable snapshots
>  - image migration between pools
>   - pause io
>   - mark parent read-only
>   - create "child" image
>   - unpause io, redirect to the new child
>   (these steps are all fast and O(1)!)
>   - asynchronously copy-up parent blocks to the child (this is O(n))
>   - once this is done, remove the child's parent reference and discard
>     the parent
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-02  7:13 ` Colin McCabe
@ 2011-02-02  7:24   ` Gregory Farnum
  2011-02-02  7:41     ` Kiran Patil
  2011-02-02  7:51     ` Colin McCabe
  2011-02-02  7:34   ` Yehuda Sadeh Weinraub
  1 sibling, 2 replies; 11+ messages in thread
From: Gregory Farnum @ 2011-02-02  7:24 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Sage Weil, ceph-devel

On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
>> One idea we've talked a fair bit about is layering RBD images.  The idea
>> would be to create a new image in O(1) time that mirrors on old image and
>> get copy-on-write type semantics, like a writeable snapshot.
>>
>> We've come up with a few different approaches for doing this, each with
>> somewhat different performance characteristics.  The main consideration is
>> that RBD images do not (currently) have an "allocation table."  Image data
>> is simply striped over objects (that may or may not exist).  You read the
>> object for a given block to see if it exists; if it doesn't (a "hole"),
>> the content is defined to be zero-filled.
>
> Have we thought about the hash table based approach yet? Where every
> block gets hashed and we only store one copy for each? I guess this is
> basically how git works, except instead of fixed-size blocks, it
> tracks variable-sized blobs. This is also how ZFS dedupe works.
>
> The nice thing about the hash table based approach is that you don't
> have to track parent-child relationships explicitly. If two users
> happen to both install Centos 5.5 with the same settings on the same
> sized-image, they'll both be deduped automatically.
How would you place the blocks in a CAS-based block device like this?
An allocation table might feel ugly, but when you're doing
cluster-wide block sharing you're going to need the extra metadata
somewhere. Better to store an allocation table than try and maintain
the coherency required for dynamic de-dup like that.

I guess I should say that de-dup would be a nice feature to support,
but I don't think it's appropriate to implement as part of RBD.
Anything that powerful needs to be a core RADOS feature.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-02  7:13 ` Colin McCabe
  2011-02-02  7:24   ` Gregory Farnum
@ 2011-02-02  7:34   ` Yehuda Sadeh Weinraub
  1 sibling, 0 replies; 11+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-02-02  7:34 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Sage Weil, ceph-devel

On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>
> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
> > One idea we've talked a fair bit about is layering RBD images.  The idea
> > would be to create a new image in O(1) time that mirrors on old image and
> > get copy-on-write type semantics, like a writeable snapshot.
> >
> > We've come up with a few different approaches for doing this, each with
> > somewhat different performance characteristics.  The main consideration is
> > that RBD images do not (currently) have an "allocation table."  Image data
> > is simply striped over objects (that may or may not exist).  You read the
> > object for a given block to see if it exists; if it doesn't (a "hole"),
> > the content is defined to be zero-filled.
>
> Have we thought about the hash table based approach yet? Where every
> block gets hashed and we only store one copy for each? I guess this is
> basically how git works, except instead of fixed-size blocks, it
> tracks variable-sized blobs. This is also how ZFS dedupe works.
>

Long ago there were some plans to introduce content addressable
storage at the osd level. We will probably want to have something like
that sometime, but we'd rather introduce it as a proper osd/rados
feature and not as some hack tailored specifically for rbd. I don't
want to start digging into the architectural requirements, but my gut
feeling says that it's not going to be trivial (as an understatement)
and its benefits compared to what we'd lose (simplicity, performance)
are marginal.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-02  7:24   ` Gregory Farnum
@ 2011-02-02  7:41     ` Kiran Patil
  2011-02-02  7:51     ` Colin McCabe
  1 sibling, 0 replies; 11+ messages in thread
From: Kiran Patil @ 2011-02-02  7:41 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Hello,

Josef has added Offline Deduplication for Btrfs.

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg07777.html

On Wed, Feb 2, 2011 at 12:54 PM, Gregory Farnum <gregf@hq.newdream.net> wrote:
>
> On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> > On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
> >> One idea we've talked a fair bit about is layering RBD images.  The idea
> >> would be to create a new image in O(1) time that mirrors on old image and
> >> get copy-on-write type semantics, like a writeable snapshot.
> >>
> >> We've come up with a few different approaches for doing this, each with
> >> somewhat different performance characteristics.  The main consideration is
> >> that RBD images do not (currently) have an "allocation table."  Image data
> >> is simply striped over objects (that may or may not exist).  You read the
> >> object for a given block to see if it exists; if it doesn't (a "hole"),
> >> the content is defined to be zero-filled.
> >
> > Have we thought about the hash table based approach yet? Where every
> > block gets hashed and we only store one copy for each? I guess this is
> > basically how git works, except instead of fixed-size blocks, it
> > tracks variable-sized blobs. This is also how ZFS dedupe works.
> >
> > The nice thing about the hash table based approach is that you don't
> > have to track parent-child relationships explicitly. If two users
> > happen to both install Centos 5.5 with the same settings on the same
> > sized-image, they'll both be deduped automatically.
> How would you place the blocks in a CAS-based block device like this?
> An allocation table might feel ugly, but when you're doing
> cluster-wide block sharing you're going to need the extra metadata
> somewhere. Better to store an allocation table than try and maintain
> the coherency required for dynamic de-dup like that.
>
> I guess I should say that de-dup would be a nice feature to support,
> but I don't think it's appropriate to implement as part of RBD.
> Anything that powerful needs to be a core RADOS feature.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Kind Regards,
---------------------

Kiran T Patil
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-02  7:24   ` Gregory Farnum
  2011-02-02  7:41     ` Kiran Patil
@ 2011-02-02  7:51     ` Colin McCabe
  2011-02-02 17:47       ` Sage Weil
  1 sibling, 1 reply; 11+ messages in thread
From: Colin McCabe @ 2011-02-02  7:51 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Sage Weil, ceph-devel

On Tue, Feb 1, 2011 at 11:24 PM, Gregory Farnum <gregf@hq.newdream.net> wrote:
> On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
>>> One idea we've talked a fair bit about is layering RBD images.  The idea
>>> would be to create a new image in O(1) time that mirrors on old image and
>>> get copy-on-write type semantics, like a writeable snapshot.
>>>
>>> We've come up with a few different approaches for doing this, each with
>>> somewhat different performance characteristics.  The main consideration is
>>> that RBD images do not (currently) have an "allocation table."  Image data
>>> is simply striped over objects (that may or may not exist).  You read the
>>> object for a given block to see if it exists; if it doesn't (a "hole"),
>>> the content is defined to be zero-filled.
>>
>> Have we thought about the hash table based approach yet? Where every
>> block gets hashed and we only store one copy for each? I guess this is
>> basically how git works, except instead of fixed-size blocks, it
>> tracks variable-sized blobs. This is also how ZFS dedupe works.
>>
>> The nice thing about the hash table based approach is that you don't
>> have to track parent-child relationships explicitly. If two users
>> happen to both install Centos 5.5 with the same settings on the same
>> sized-image, they'll both be deduped automatically.
> How would you place the blocks in a CAS-based block device like this?
> An allocation table might feel ugly, but when you're doing
> cluster-wide block sharing you're going to need the extra metadata
> somewhere. Better to store an allocation table than try and maintain
> the coherency required for dynamic de-dup like that.

You could chunk the hash table over several OSDs. Then you only need
to worry about doing atomic operations on a given hash table entry,
which will of course be protected by a single PG lock.

Yehuda is probably right though... it's not 100% clear that the
benefits outweigh the disadvantages, given that it would need an extra
lookup for every operation. In the end it's something that probably
will take some experimentation to get right.

Colin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-02  7:51     ` Colin McCabe
@ 2011-02-02 17:47       ` Sage Weil
  2011-02-02 18:15         ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-02-02 17:47 UTC (permalink / raw)
  To: Colin McCabe; +Cc: Gregory Farnum, ceph-devel

On Tue, 1 Feb 2011, Colin McCabe wrote:
> Yehuda is probably right though... it's not 100% clear that the
> benefits outweigh the disadvantages, given that it would need an extra
> lookup for every operation. In the end it's something that probably
> will take some experimentation to get right.

Right.  The nice thing about RBD is it's simplicity: there is almost no 
metadata.  Just the block size, image size, and object name prefix.  
That's enough to name the object with the data you want, and that object 
may or may not exist, depending on whether it's been written to.  There 
are no consistency concerns.

When I mentioned allocation bitmap before, I meant simply a bitmap 
specifying whether the block exists, that would let us avoid looking for 
an object in the parent image.  In its simplest form, you would mark the 
image read-only, then generate the bitmap once.  

Anything more complicated with that and you have to worry about keeping 
the metadata consistent with the data.  CAS, for example, requires lots of 
metadata: if I want to read block 1234, I have to look up in some table 
that says 1234 has sha1 FOO, and then go read that object.  Then writing 
is a whole other story.  Again, doing CAS on a read-only image simplifies 
things greatly, but I don't think we should go down that road now.

Mainly I'm interested in feedback on the simple layering use-case...

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-02 17:47       ` Sage Weil
@ 2011-02-02 18:15         ` Gregory Farnum
  2011-02-03 20:36           ` Christian Brunner
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2011-02-02 18:15 UTC (permalink / raw)
  To: ceph-devel; +Cc: Sage Weil

On Wed, Feb 2, 2011 at 9:47 AM, Sage Weil <sage@newdream.net> wrote:
> When I mentioned allocation bitmap before, I meant simply a bitmap
> specifying whether the block exists, that would let us avoid looking for
> an object in the parent image.  In its simplest form, you would mark the
> image read-only, then generate the bitmap once.
> ...
> Mainly I'm interested in feedback on the simple layering use-case...
So my thought with the bitmap was that it might make more sense for
rbd to maintain a bitmap specifying whether the child has overwritten
the parent block device. Then when doing a read in the parent region,
rbd defaults to reading the parent block device unless the bitmap says
the child has overwritten it.
This is reasonably fast in terms of failed reads and such: assuming
the bitmap is kept in-memory, you don't need to do multiple attempts
to do a read, and if the client wants to zero a block, or overwrites a
block and then deletes it, there's no concern about preventing
inappropriate fall-through to the parent since the bitmap still has
that block set as overwritten.

The other advantage to something like this is that it could allow
overwriting at a finer level than the size of the child's blocks. For
instance, you might store in 4MB chunks, but that's a bit large for
some things that are going to commonly change between images like
config files. So with a bitmap with, say, 1KB resolution you could
change a config file and have rbd read the block from the parent and
then plug in the 1KB containing the config file that the child
overwrote. This doesn't require too much space: storing a
1KB-granularity bitmap for a 1GB image only requires 1MB.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: rbd layering
  2011-02-02 18:15         ` Gregory Farnum
@ 2011-02-03 20:36           ` Christian Brunner
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Brunner @ 2011-02-03 20:36 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Sage Weil

2011/2/2 Gregory Farnum <gregf@hq.newdream.net>:
> On Wed, Feb 2, 2011 at 9:47 AM, Sage Weil <sage@newdream.net> wrote:
>> When I mentioned allocation bitmap before, I meant simply a bitmap
>> specifying whether the block exists, that would let us avoid looking for
>> an object in the parent image.  In its simplest form, you would mark the
>> image read-only, then generate the bitmap once.
>> ...
>> Mainly I'm interested in feedback on the simple layering use-case...
> So my thought with the bitmap was that it might make more sense for
> rbd to maintain a bitmap specifying whether the child has overwritten
> the parent block device. Then when doing a read in the parent region,
> rbd defaults to reading the parent block device unless the bitmap says
> the child has overwritten it.
> This is reasonably fast in terms of failed reads and such: assuming
> the bitmap is kept in-memory, you don't need to do multiple attempts
> to do a read, and if the client wants to zero a block, or overwrites a
> block and then deletes it, there's no concern about preventing
> inappropriate fall-through to the parent since the bitmap still has
> that block set as overwritten.
>
> The other advantage to something like this is that it could allow
> overwriting at a finer level than the size of the child's blocks. For
> instance, you might store in 4MB chunks, but that's a bit large for
> some things that are going to commonly change between images like
> config files. So with a bitmap with, say, 1KB resolution you could
> change a config file and have rbd read the block from the parent and
> then plug in the 1KB containing the config file that the child
> overwrote. This doesn't require too much space: storing a
> 1KB-granularity bitmap for a 1GB image only requires 1MB.
> -Greg

I would go for this kind of allocation bitmap, too. However I'm asking
myself if we could add TRIM support this way as well.

In my scenario the bitmap would be available in every image and should
have a 512 Byte resolution to match the block size of common hard
disks and the bitmap needs to support three states:

0: Block is not allocated
1: Block is allocated in this image (child)
2: Block is allocated in the parent image

- When we create a new image the bitmap is filled with zero.
- When we clone an image we have to copy the bitmap and switch every
allocated block from state 1 to state 2.
- When we are writing to a block in state 0 or state 2 we have to set
it to state 1 and we will have to sync the bitmap to disk.
- When a block is discarded we set the state to 0 and we will have to
sync the bitmap to disk.
- When all blocks of an object are set to 0 we can delete the object.

This way the only performance impact would be at the first write to a block.

Regards
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* rbd layering
@ 2011-02-25 22:27 Sage Weil
  0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2011-02-25 22:27 UTC (permalink / raw)
  To: ceph-devel

I wanted to follow up on the thread a couple weeks back and summarize 
where we're currently at.  The goal is to be flexible, so that we don't 
impose any performance limits for features we don't use.  

The use cases are:

 - (fast) image creation from gold master (probably followed by growing 
the image/fs)
 - image migration (create child in new location; copyup old data 
asynchronously)


Here are the pieces we currently have:

(image == rbd image
 object == one object in the image, normally 4MB)

- Parent image pointer

Each image has an option parent pointer that names a parent image.  The 
parent must be part of the same cluster, but can be in a different pool.  
It can be larger or smaller than the current image. 

It is assumed the parent is read-only.  I don't think anything sane can 
come out of doing a COW overlay over something that is changing.

- Object Bitmap

Each object in an image may have an OPTIONAL bitmap that represents 
transparency.  If the bit is set, then it is defined by this image layer 
(it can be either object data or, if the object has a hole, zeros).  If 
the bit is not set, then the content is defined by the parent image.  The 
resolution can be sector, 4KB block, or anything else.  If it is larger 
than the smallest write unit, a write may require copy-up from the lower 
layer, so using the block size is recommended.

If the object bitmap does not exist, we assume the object is NOT 
transparent (i.e. bitmap is fully colored).  That gives us compatibility 
with old images, and lets us drop the bitmap once it gets fully colored.  
Only new images that support layering will create/use it.  

- Image bitmap

Each image may have an OPTIONAL bitmap that indicates which image objects 
(may) exist.  On write, a bit is set prior to creating the each object.  
On read, if a bitmap exists but the bit for an object is not set, we can 
go directly to the parent image.  If the bitmap does not exist, reads must 
always check for the child object before falling through to the parent 
image.  Writes in the no-bitmap case write to the child object.  If The 
bitmap size need not match the image size; it may, e.g., match the size of 
a smaller parent image.

Having two bitmaps is a design tradeoff.  We could a sector/block 
resolution bitmap for the whole image, but it would increase memory use, 
and would require more "update image bitmap, wait, then write to object" 
cycles.  Having a per-object bitmap means we can atomically update the 
object bitmap for free when we do the write, and minimize the image bitmap 
updates to the first time each object is touched.

On read:
	if there is an image bitmap
		if bit is set
			read child object
			if there's an object bitmap that indicates transparency
				read holes from parent object
		else
			read parent object (*)
	else
		read child object
		if there is no child object, or bitmap indicates transparency
			read holes from parent object (*)

On write:
	if there is an image bitmap and bit is not set
		color image bitmap bit for this object
	if object bitmaps are enabled
		write to object
		color object bits too
	else
		if we are not writing the entire object    (*)
			read unwritten parts from parent   (*)
		write our data (+ copyup data from parent)

(*) These steps can be skipped if the parent image has holes here.  We 
would know that if the parent image bitmap bits are not set, or if we are 
past the end of the parent image size.

On trim/discard:
	if there is an image bitmap
		if bit is not set
			set image bitmap bit		
	truncate or zero object
	if object bitmap
		color appropriate bits


Also: the image bitmap could be created after the fact.  I.e. once we 
decide to use something as a gold image/parent, we would generate the 
image bitmap (just check which objects exist) so that overlays would 
operate more efficiently.  We'll probably want a read-only flag in the 
image header too to help keep admins from shooting themselves in the foot.


- OSD copyup/merge operation

The last piece would be an OSD method to atomically copy a parent object 
up to the overlay image.  The goal is for the copyup to be a background, 
maybe low-priority process.  We would read the parent object, then submit 
it to the child object, only write the parts that correspond to non-set 
bits in the object bitmap, and then color in all bits.


That's the current design.  Thoughts on or errors with the above?

sage


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-02-25 22:25 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-01  6:08 rbd layering Sage Weil
2011-02-01 17:43 ` Tommi Virtanen
2011-02-02  7:13 ` Colin McCabe
2011-02-02  7:24   ` Gregory Farnum
2011-02-02  7:41     ` Kiran Patil
2011-02-02  7:51     ` Colin McCabe
2011-02-02 17:47       ` Sage Weil
2011-02-02 18:15         ` Gregory Farnum
2011-02-03 20:36           ` Christian Brunner
2011-02-02  7:34   ` Yehuda Sadeh Weinraub
2011-02-25 22:27 Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.