* rbd layering
@ 2011-02-01 6:08 Sage Weil
2011-02-01 17:43 ` Tommi Virtanen
2011-02-02 7:13 ` Colin McCabe
0 siblings, 2 replies; 11+ messages in thread
From: Sage Weil @ 2011-02-01 6:08 UTC (permalink / raw)
To: ceph-devel
One idea we've talked a fair bit about is layering RBD images. The idea
would be to create a new image in O(1) time that mirrors on old image and
get copy-on-write type semantics, like a writeable snapshot.
We've come up with a few different approaches for doing this, each with
somewhat different performance characteristics. The main consideration is
that RBD images do not (currently) have an "allocation table." Image data
is simply striped over objects (that may or may not exist). You read the
object for a given block to see if it exists; if it doesn't (a "hole"),
the content is defined to be zero-filled.
(I'll use the term "block" and "object" interchangeably to mean the object
that stores each RBD block. They're 4MB by default, but can be set to any
size you want at image creation time.)
1- copy-up on first write
- reads
- read child image object. if it doesn't exist, read parent block.
-> reads to unchanged data are slower
- writes
- write to child image block. if it doesn't exist, OSD will return
ENOENT. the client would do a copy up (copy parent block to child
block), and then redo the write.
-> first writes are slow, especially if the block existed in the parent.
- trim/discard
- truncate the child object to zero, but do not delete it.
2- sparse objects
- make the OSDs maintain allocation metadata for each objects so that we
know which parts of the object are defined and which are holes (a
relatively easy thing to do).
- writes
- write to modified region of child object.
- reads
- read child image object AND allocation map. read parent object for
any holes (or when child object doesn't exist)
-> more efficient data transfer when objects are sparse.
-> reads to unchanged data are slower (as above)
- trim/discard
- need to somehow distinguish between a hole that falls-thru to parent
and a hole that is defined to be zero by the child image.
In both cases, we could add a(n optional) allocation bitmap to the parent
image to avoid the fall-thru for parts of the images that aren't defined
by the child image. That could be an explicit step taken by an
adminstrator (e.g. after marking the parent read-only) to improve
performance for overlayed images. (Maintaining a consistent bitmap for
all images is non-trivial, and would slow things down considerably.)
A few use cases for all of this:
- "golden" VM images
- writeable snapshots
- image migration between pools
- pause io
- mark parent read-only
- create "child" image
- unpause io, redirect to the new child
(these steps are all fast and O(1)!)
- asynchronously copy-up parent blocks to the child (this is O(n))
- once this is done, remove the child's parent reference and discard
the parent
sage
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-01 6:08 rbd layering Sage Weil
@ 2011-02-01 17:43 ` Tommi Virtanen
2011-02-02 7:13 ` Colin McCabe
1 sibling, 0 replies; 11+ messages in thread
From: Tommi Virtanen @ 2011-02-01 17:43 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On Mon, Jan 31, 2011 at 10:08:31PM -0800, Sage Weil wrote:
> A few use cases for all of this:
> - "golden" VM images
> - writeable snapshots
> - image migration between pools
> - pause io
> - mark parent read-only
> - create "child" image
> - unpause io, redirect to the new child
> (these steps are all fast and O(1)!)
> - asynchronously copy-up parent blocks to the child (this is O(n))
> - once this is done, remove the child's parent reference and discard
> the parent
Two things to add:
- creating something that can be later used to do lazy deduplication
in the background would be good; you don't always "start from the
golden master", but you still have images that are 99% identical.
- the "child has zero extents that should not be read from master"
case would happen mostly by TRIM operations, these days.
--
:(){ :|:&};:
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-01 6:08 rbd layering Sage Weil
2011-02-01 17:43 ` Tommi Virtanen
@ 2011-02-02 7:13 ` Colin McCabe
2011-02-02 7:24 ` Gregory Farnum
2011-02-02 7:34 ` Yehuda Sadeh Weinraub
1 sibling, 2 replies; 11+ messages in thread
From: Colin McCabe @ 2011-02-02 7:13 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
> One idea we've talked a fair bit about is layering RBD images. The idea
> would be to create a new image in O(1) time that mirrors on old image and
> get copy-on-write type semantics, like a writeable snapshot.
>
> We've come up with a few different approaches for doing this, each with
> somewhat different performance characteristics. The main consideration is
> that RBD images do not (currently) have an "allocation table." Image data
> is simply striped over objects (that may or may not exist). You read the
> object for a given block to see if it exists; if it doesn't (a "hole"),
> the content is defined to be zero-filled.
Have we thought about the hash table based approach yet? Where every
block gets hashed and we only store one copy for each? I guess this is
basically how git works, except instead of fixed-size blocks, it
tracks variable-sized blobs. This is also how ZFS dedupe works.
The nice thing about the hash table based approach is that you don't
have to track parent-child relationships explicitly. If two users
happen to both install Centos 5.5 with the same settings on the same
sized-image, they'll both be deduped automatically.
The disadvantage, of course, is that you need to hash the blocks.
Also, there's some tiny probability that there will be a hash
collision. You could use a long hash key or do hash chaining to
mitigate this, of course.
The big disadvantage of the allocation table-based approaches, at
least in my mind, is that they don't feel very block device-y.
Allocation maps are things that normally go in a file system rather
than in a block device.
If we do go with an allocation-table based approach, what would the
API look like from the administrator's point of view? I guess I
imagine some kind of API where I create a child RBD block device from
a parent RBD device. Then whenever I wrote to the child image, it
would "re-dupe" the two block devices. (It seems like the amount of
sharing would start at 100% and just go down from there... unless my
analysis is missing something?)
Another possibility is that we could simply run qcow2 over rbd. qcow2
already implements copy-on-write at a higher level of the stack.
I took a quick look at the qcow2 image format at:
http://people.gnome.org/~markmc/qcow-image-format.html
It looks suspiciously like something I've seen before :)
http://en.wikipedia.org/wiki/Inode_pointer_structure
sincerely,
Colin
>
> (I'll use the term "block" and "object" interchangeably to mean the object
> that stores each RBD block. They're 4MB by default, but can be set to any
> size you want at image creation time.)
>
> 1- copy-up on first write
> - reads
> - read child image object. if it doesn't exist, read parent block.
> -> reads to unchanged data are slower
> - writes
> - write to child image block. if it doesn't exist, OSD will return
> ENOENT. the client would do a copy up (copy parent block to child
> block), and then redo the write.
> -> first writes are slow, especially if the block existed in the parent.
> - trim/discard
> - truncate the child object to zero, but do not delete it.
>
> 2- sparse objects
> - make the OSDs maintain allocation metadata for each objects so that we
> know which parts of the object are defined and which are holes (a
> relatively easy thing to do).
> - writes
> - write to modified region of child object.
> - reads
> - read child image object AND allocation map. read parent object for
> any holes (or when child object doesn't exist)
> -> more efficient data transfer when objects are sparse.
> -> reads to unchanged data are slower (as above)
> - trim/discard
> - need to somehow distinguish between a hole that falls-thru to parent
> and a hole that is defined to be zero by the child image.
>
> In both cases, we could add a(n optional) allocation bitmap to the parent
> image to avoid the fall-thru for parts of the images that aren't defined
> by the child image. That could be an explicit step taken by an
> adminstrator (e.g. after marking the parent read-only) to improve
> performance for overlayed images. (Maintaining a consistent bitmap for
> all images is non-trivial, and would slow things down considerably.)
>
> A few use cases for all of this:
> - "golden" VM images
> - writeable snapshots
> - image migration between pools
> - pause io
> - mark parent read-only
> - create "child" image
> - unpause io, redirect to the new child
> (these steps are all fast and O(1)!)
> - asynchronously copy-up parent blocks to the child (this is O(n))
> - once this is done, remove the child's parent reference and discard
> the parent
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-02 7:13 ` Colin McCabe
@ 2011-02-02 7:24 ` Gregory Farnum
2011-02-02 7:41 ` Kiran Patil
2011-02-02 7:51 ` Colin McCabe
2011-02-02 7:34 ` Yehuda Sadeh Weinraub
1 sibling, 2 replies; 11+ messages in thread
From: Gregory Farnum @ 2011-02-02 7:24 UTC (permalink / raw)
To: Colin McCabe; +Cc: Sage Weil, ceph-devel
On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
>> One idea we've talked a fair bit about is layering RBD images. The idea
>> would be to create a new image in O(1) time that mirrors on old image and
>> get copy-on-write type semantics, like a writeable snapshot.
>>
>> We've come up with a few different approaches for doing this, each with
>> somewhat different performance characteristics. The main consideration is
>> that RBD images do not (currently) have an "allocation table." Image data
>> is simply striped over objects (that may or may not exist). You read the
>> object for a given block to see if it exists; if it doesn't (a "hole"),
>> the content is defined to be zero-filled.
>
> Have we thought about the hash table based approach yet? Where every
> block gets hashed and we only store one copy for each? I guess this is
> basically how git works, except instead of fixed-size blocks, it
> tracks variable-sized blobs. This is also how ZFS dedupe works.
>
> The nice thing about the hash table based approach is that you don't
> have to track parent-child relationships explicitly. If two users
> happen to both install Centos 5.5 with the same settings on the same
> sized-image, they'll both be deduped automatically.
How would you place the blocks in a CAS-based block device like this?
An allocation table might feel ugly, but when you're doing
cluster-wide block sharing you're going to need the extra metadata
somewhere. Better to store an allocation table than try and maintain
the coherency required for dynamic de-dup like that.
I guess I should say that de-dup would be a nice feature to support,
but I don't think it's appropriate to implement as part of RBD.
Anything that powerful needs to be a core RADOS feature.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-02 7:13 ` Colin McCabe
2011-02-02 7:24 ` Gregory Farnum
@ 2011-02-02 7:34 ` Yehuda Sadeh Weinraub
1 sibling, 0 replies; 11+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-02-02 7:34 UTC (permalink / raw)
To: Colin McCabe; +Cc: Sage Weil, ceph-devel
On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>
> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
> > One idea we've talked a fair bit about is layering RBD images. The idea
> > would be to create a new image in O(1) time that mirrors on old image and
> > get copy-on-write type semantics, like a writeable snapshot.
> >
> > We've come up with a few different approaches for doing this, each with
> > somewhat different performance characteristics. The main consideration is
> > that RBD images do not (currently) have an "allocation table." Image data
> > is simply striped over objects (that may or may not exist). You read the
> > object for a given block to see if it exists; if it doesn't (a "hole"),
> > the content is defined to be zero-filled.
>
> Have we thought about the hash table based approach yet? Where every
> block gets hashed and we only store one copy for each? I guess this is
> basically how git works, except instead of fixed-size blocks, it
> tracks variable-sized blobs. This is also how ZFS dedupe works.
>
Long ago there were some plans to introduce content addressable
storage at the osd level. We will probably want to have something like
that sometime, but we'd rather introduce it as a proper osd/rados
feature and not as some hack tailored specifically for rbd. I don't
want to start digging into the architectural requirements, but my gut
feeling says that it's not going to be trivial (as an understatement)
and its benefits compared to what we'd lose (simplicity, performance)
are marginal.
Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-02 7:24 ` Gregory Farnum
@ 2011-02-02 7:41 ` Kiran Patil
2011-02-02 7:51 ` Colin McCabe
1 sibling, 0 replies; 11+ messages in thread
From: Kiran Patil @ 2011-02-02 7:41 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel
Hello,
Josef has added Offline Deduplication for Btrfs.
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg07777.html
On Wed, Feb 2, 2011 at 12:54 PM, Gregory Farnum <gregf@hq.newdream.net> wrote:
>
> On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
> > On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
> >> One idea we've talked a fair bit about is layering RBD images. The idea
> >> would be to create a new image in O(1) time that mirrors on old image and
> >> get copy-on-write type semantics, like a writeable snapshot.
> >>
> >> We've come up with a few different approaches for doing this, each with
> >> somewhat different performance characteristics. The main consideration is
> >> that RBD images do not (currently) have an "allocation table." Image data
> >> is simply striped over objects (that may or may not exist). You read the
> >> object for a given block to see if it exists; if it doesn't (a "hole"),
> >> the content is defined to be zero-filled.
> >
> > Have we thought about the hash table based approach yet? Where every
> > block gets hashed and we only store one copy for each? I guess this is
> > basically how git works, except instead of fixed-size blocks, it
> > tracks variable-sized blobs. This is also how ZFS dedupe works.
> >
> > The nice thing about the hash table based approach is that you don't
> > have to track parent-child relationships explicitly. If two users
> > happen to both install Centos 5.5 with the same settings on the same
> > sized-image, they'll both be deduped automatically.
> How would you place the blocks in a CAS-based block device like this?
> An allocation table might feel ugly, but when you're doing
> cluster-wide block sharing you're going to need the extra metadata
> somewhere. Better to store an allocation table than try and maintain
> the coherency required for dynamic de-dup like that.
>
> I guess I should say that de-dup would be a nice feature to support,
> but I don't think it's appropriate to implement as part of RBD.
> Anything that powerful needs to be a core RADOS feature.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Kind Regards,
---------------------
Kiran T Patil
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-02 7:24 ` Gregory Farnum
2011-02-02 7:41 ` Kiran Patil
@ 2011-02-02 7:51 ` Colin McCabe
2011-02-02 17:47 ` Sage Weil
1 sibling, 1 reply; 11+ messages in thread
From: Colin McCabe @ 2011-02-02 7:51 UTC (permalink / raw)
To: Gregory Farnum; +Cc: Sage Weil, ceph-devel
On Tue, Feb 1, 2011 at 11:24 PM, Gregory Farnum <gregf@hq.newdream.net> wrote:
> On Tue, Feb 1, 2011 at 11:13 PM, Colin McCabe <cmccabe@alumni.cmu.edu> wrote:
>> On Mon, Jan 31, 2011 at 10:08 PM, Sage Weil <sage@newdream.net> wrote:
>>> One idea we've talked a fair bit about is layering RBD images. The idea
>>> would be to create a new image in O(1) time that mirrors on old image and
>>> get copy-on-write type semantics, like a writeable snapshot.
>>>
>>> We've come up with a few different approaches for doing this, each with
>>> somewhat different performance characteristics. The main consideration is
>>> that RBD images do not (currently) have an "allocation table." Image data
>>> is simply striped over objects (that may or may not exist). You read the
>>> object for a given block to see if it exists; if it doesn't (a "hole"),
>>> the content is defined to be zero-filled.
>>
>> Have we thought about the hash table based approach yet? Where every
>> block gets hashed and we only store one copy for each? I guess this is
>> basically how git works, except instead of fixed-size blocks, it
>> tracks variable-sized blobs. This is also how ZFS dedupe works.
>>
>> The nice thing about the hash table based approach is that you don't
>> have to track parent-child relationships explicitly. If two users
>> happen to both install Centos 5.5 with the same settings on the same
>> sized-image, they'll both be deduped automatically.
> How would you place the blocks in a CAS-based block device like this?
> An allocation table might feel ugly, but when you're doing
> cluster-wide block sharing you're going to need the extra metadata
> somewhere. Better to store an allocation table than try and maintain
> the coherency required for dynamic de-dup like that.
You could chunk the hash table over several OSDs. Then you only need
to worry about doing atomic operations on a given hash table entry,
which will of course be protected by a single PG lock.
Yehuda is probably right though... it's not 100% clear that the
benefits outweigh the disadvantages, given that it would need an extra
lookup for every operation. In the end it's something that probably
will take some experimentation to get right.
Colin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-02 7:51 ` Colin McCabe
@ 2011-02-02 17:47 ` Sage Weil
2011-02-02 18:15 ` Gregory Farnum
0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-02-02 17:47 UTC (permalink / raw)
To: Colin McCabe; +Cc: Gregory Farnum, ceph-devel
On Tue, 1 Feb 2011, Colin McCabe wrote:
> Yehuda is probably right though... it's not 100% clear that the
> benefits outweigh the disadvantages, given that it would need an extra
> lookup for every operation. In the end it's something that probably
> will take some experimentation to get right.
Right. The nice thing about RBD is it's simplicity: there is almost no
metadata. Just the block size, image size, and object name prefix.
That's enough to name the object with the data you want, and that object
may or may not exist, depending on whether it's been written to. There
are no consistency concerns.
When I mentioned allocation bitmap before, I meant simply a bitmap
specifying whether the block exists, that would let us avoid looking for
an object in the parent image. In its simplest form, you would mark the
image read-only, then generate the bitmap once.
Anything more complicated with that and you have to worry about keeping
the metadata consistent with the data. CAS, for example, requires lots of
metadata: if I want to read block 1234, I have to look up in some table
that says 1234 has sha1 FOO, and then go read that object. Then writing
is a whole other story. Again, doing CAS on a read-only image simplifies
things greatly, but I don't think we should go down that road now.
Mainly I'm interested in feedback on the simple layering use-case...
sage
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-02 17:47 ` Sage Weil
@ 2011-02-02 18:15 ` Gregory Farnum
2011-02-03 20:36 ` Christian Brunner
0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2011-02-02 18:15 UTC (permalink / raw)
To: ceph-devel; +Cc: Sage Weil
On Wed, Feb 2, 2011 at 9:47 AM, Sage Weil <sage@newdream.net> wrote:
> When I mentioned allocation bitmap before, I meant simply a bitmap
> specifying whether the block exists, that would let us avoid looking for
> an object in the parent image. In its simplest form, you would mark the
> image read-only, then generate the bitmap once.
> ...
> Mainly I'm interested in feedback on the simple layering use-case...
So my thought with the bitmap was that it might make more sense for
rbd to maintain a bitmap specifying whether the child has overwritten
the parent block device. Then when doing a read in the parent region,
rbd defaults to reading the parent block device unless the bitmap says
the child has overwritten it.
This is reasonably fast in terms of failed reads and such: assuming
the bitmap is kept in-memory, you don't need to do multiple attempts
to do a read, and if the client wants to zero a block, or overwrites a
block and then deletes it, there's no concern about preventing
inappropriate fall-through to the parent since the bitmap still has
that block set as overwritten.
The other advantage to something like this is that it could allow
overwriting at a finer level than the size of the child's blocks. For
instance, you might store in 4MB chunks, but that's a bit large for
some things that are going to commonly change between images like
config files. So with a bitmap with, say, 1KB resolution you could
change a config file and have rbd read the block from the parent and
then plug in the 1KB containing the config file that the child
overwrote. This doesn't require too much space: storing a
1KB-granularity bitmap for a 1GB image only requires 1MB.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: rbd layering
2011-02-02 18:15 ` Gregory Farnum
@ 2011-02-03 20:36 ` Christian Brunner
0 siblings, 0 replies; 11+ messages in thread
From: Christian Brunner @ 2011-02-03 20:36 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel, Sage Weil
2011/2/2 Gregory Farnum <gregf@hq.newdream.net>:
> On Wed, Feb 2, 2011 at 9:47 AM, Sage Weil <sage@newdream.net> wrote:
>> When I mentioned allocation bitmap before, I meant simply a bitmap
>> specifying whether the block exists, that would let us avoid looking for
>> an object in the parent image. In its simplest form, you would mark the
>> image read-only, then generate the bitmap once.
>> ...
>> Mainly I'm interested in feedback on the simple layering use-case...
> So my thought with the bitmap was that it might make more sense for
> rbd to maintain a bitmap specifying whether the child has overwritten
> the parent block device. Then when doing a read in the parent region,
> rbd defaults to reading the parent block device unless the bitmap says
> the child has overwritten it.
> This is reasonably fast in terms of failed reads and such: assuming
> the bitmap is kept in-memory, you don't need to do multiple attempts
> to do a read, and if the client wants to zero a block, or overwrites a
> block and then deletes it, there's no concern about preventing
> inappropriate fall-through to the parent since the bitmap still has
> that block set as overwritten.
>
> The other advantage to something like this is that it could allow
> overwriting at a finer level than the size of the child's blocks. For
> instance, you might store in 4MB chunks, but that's a bit large for
> some things that are going to commonly change between images like
> config files. So with a bitmap with, say, 1KB resolution you could
> change a config file and have rbd read the block from the parent and
> then plug in the 1KB containing the config file that the child
> overwrote. This doesn't require too much space: storing a
> 1KB-granularity bitmap for a 1GB image only requires 1MB.
> -Greg
I would go for this kind of allocation bitmap, too. However I'm asking
myself if we could add TRIM support this way as well.
In my scenario the bitmap would be available in every image and should
have a 512 Byte resolution to match the block size of common hard
disks and the bitmap needs to support three states:
0: Block is not allocated
1: Block is allocated in this image (child)
2: Block is allocated in the parent image
- When we create a new image the bitmap is filled with zero.
- When we clone an image we have to copy the bitmap and switch every
allocated block from state 1 to state 2.
- When we are writing to a block in state 0 or state 2 we have to set
it to state 1 and we will have to sync the bitmap to disk.
- When a block is discarded we set the state to 0 and we will have to
sync the bitmap to disk.
- When all blocks of an object are set to 0 we can delete the object.
This way the only performance impact would be at the first write to a block.
Regards
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 11+ messages in thread
* rbd layering
@ 2011-02-25 22:27 Sage Weil
0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2011-02-25 22:27 UTC (permalink / raw)
To: ceph-devel
I wanted to follow up on the thread a couple weeks back and summarize
where we're currently at. The goal is to be flexible, so that we don't
impose any performance limits for features we don't use.
The use cases are:
- (fast) image creation from gold master (probably followed by growing
the image/fs)
- image migration (create child in new location; copyup old data
asynchronously)
Here are the pieces we currently have:
(image == rbd image
object == one object in the image, normally 4MB)
- Parent image pointer
Each image has an option parent pointer that names a parent image. The
parent must be part of the same cluster, but can be in a different pool.
It can be larger or smaller than the current image.
It is assumed the parent is read-only. I don't think anything sane can
come out of doing a COW overlay over something that is changing.
- Object Bitmap
Each object in an image may have an OPTIONAL bitmap that represents
transparency. If the bit is set, then it is defined by this image layer
(it can be either object data or, if the object has a hole, zeros). If
the bit is not set, then the content is defined by the parent image. The
resolution can be sector, 4KB block, or anything else. If it is larger
than the smallest write unit, a write may require copy-up from the lower
layer, so using the block size is recommended.
If the object bitmap does not exist, we assume the object is NOT
transparent (i.e. bitmap is fully colored). That gives us compatibility
with old images, and lets us drop the bitmap once it gets fully colored.
Only new images that support layering will create/use it.
- Image bitmap
Each image may have an OPTIONAL bitmap that indicates which image objects
(may) exist. On write, a bit is set prior to creating the each object.
On read, if a bitmap exists but the bit for an object is not set, we can
go directly to the parent image. If the bitmap does not exist, reads must
always check for the child object before falling through to the parent
image. Writes in the no-bitmap case write to the child object. If The
bitmap size need not match the image size; it may, e.g., match the size of
a smaller parent image.
Having two bitmaps is a design tradeoff. We could a sector/block
resolution bitmap for the whole image, but it would increase memory use,
and would require more "update image bitmap, wait, then write to object"
cycles. Having a per-object bitmap means we can atomically update the
object bitmap for free when we do the write, and minimize the image bitmap
updates to the first time each object is touched.
On read:
if there is an image bitmap
if bit is set
read child object
if there's an object bitmap that indicates transparency
read holes from parent object
else
read parent object (*)
else
read child object
if there is no child object, or bitmap indicates transparency
read holes from parent object (*)
On write:
if there is an image bitmap and bit is not set
color image bitmap bit for this object
if object bitmaps are enabled
write to object
color object bits too
else
if we are not writing the entire object (*)
read unwritten parts from parent (*)
write our data (+ copyup data from parent)
(*) These steps can be skipped if the parent image has holes here. We
would know that if the parent image bitmap bits are not set, or if we are
past the end of the parent image size.
On trim/discard:
if there is an image bitmap
if bit is not set
set image bitmap bit
truncate or zero object
if object bitmap
color appropriate bits
Also: the image bitmap could be created after the fact. I.e. once we
decide to use something as a gold image/parent, we would generate the
image bitmap (just check which objects exist) so that overlays would
operate more efficiently. We'll probably want a read-only flag in the
image header too to help keep admins from shooting themselves in the foot.
- OSD copyup/merge operation
The last piece would be an OSD method to atomically copy a parent object
up to the overlay image. The goal is for the copyup to be a background,
maybe low-priority process. We would read the parent object, then submit
it to the child object, only write the parts that correspond to non-set
bits in the object bitmap, and then color in all bits.
That's the current design. Thoughts on or errors with the above?
sage
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2011-02-25 22:25 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-01 6:08 rbd layering Sage Weil
2011-02-01 17:43 ` Tommi Virtanen
2011-02-02 7:13 ` Colin McCabe
2011-02-02 7:24 ` Gregory Farnum
2011-02-02 7:41 ` Kiran Patil
2011-02-02 7:51 ` Colin McCabe
2011-02-02 17:47 ` Sage Weil
2011-02-02 18:15 ` Gregory Farnum
2011-02-03 20:36 ` Christian Brunner
2011-02-02 7:34 ` Yehuda Sadeh Weinraub
2011-02-25 22:27 Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.