All of lore.kernel.org
 help / color / mirror / Atom feed
* Cloning rados block devices
@ 2011-01-23 14:07 Chris Webb
  2011-01-24 14:39 ` Gregory Farnum
  0 siblings, 1 reply; 4+ messages in thread
From: Chris Webb @ 2011-01-23 14:07 UTC (permalink / raw)
  To: ceph-devel

I have a hosting product which consists of qemu-kvm virtual machines backed
by LVM2 logical volumes as virtual drives, accessed either locally or over
iscsi. I'm thinking of migrating in time to a distributed block store, such
as Ceph's rbd or Sheepdog.

One feature I would really like to be able to export to users is an ability
to make copy-on-write clones of virtual hard drives, in a Ceph context
generating a new rbd image from an existing one, or from a snapshot of an
existing image if that's easier.

I've seen Ceph's snapshot support, and in particular the rbd snapshot
support, which lets me make read-only clones of a rados block device.

What I'm after is not quite the same as writeable snapshots, as I'd also
like to be able to offer the user the ability to delete the original block
device independently of the clone, potentially before the clone itself is
deleted, so the clone is properly independent of the source apart from some
shared blocks. (If I stored my images as files in a local btrfs filesystem,
I could get exactly the behaviour I'm imagining by cloning the image file.)

I don't see any mention of a feature like this on the Ceph roadmap, and I'm
not familiar enough with the internal design yet to know whether this is an
easy extension given the book-keeping already in place for snapshots, or
whether what I'm proposing is much harder. Is anyone working on this sort of
thing already, or does the feature even already exist and I've failed to
find it? If not, I'd be very interested in any thoughts on how difficult
this would be to implement given the infrastructure that is already in
place.

Best wishes,

Chris.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Cloning rados block devices
  2011-01-23 14:07 Cloning rados block devices Chris Webb
@ 2011-01-24 14:39 ` Gregory Farnum
  2011-01-25 23:41   ` Yehuda Sadeh Weinraub
  0 siblings, 1 reply; 4+ messages in thread
From: Gregory Farnum @ 2011-01-24 14:39 UTC (permalink / raw)
  To: Chris Webb; +Cc: ceph-devel, Yehuda Sadeh

On Sun, Jan 23, 2011 at 6:07 AM, Chris Webb <chris@arachsys.com> wrote:
> One feature I would really like to be able to export to users is an ability
> to make copy-on-write clones of virtual hard drives, in a Ceph context
> generating a new rbd image from an existing one, or from a snapshot of an
> existing image if that's easier.
> ....
> I don't see any mention of a feature like this on the Ceph roadmap, and I'm
> not familiar enough with the internal design yet to know whether this is an
> easy extension given the book-keeping already in place for snapshots, or
> whether what I'm proposing is much harder. Is anyone working on this sort of
> thing already, or does the feature even already exist and I've failed to
> find it? If not, I'd be very interested in any thoughts on how difficult
> this would be to implement given the infrastructure that is already in
> place.
We've discussed similar things, but this isn't on the roadmap and I
don't think anything like it is either. There are a few problems with
simply re-using the existing snapshot mechanism. First is that it
doesn't support branching snapshots at all, and this is a hard enough
problem that we've talked about doing it for other reasons in the past
and always gone with alternative solutions. (It's not impossible,
though.) The second is that right now, all versions of an object are
stored together, on the same OSD. Which makes it pretty likely that
you'd get a lot of people cloning, say, your Ubuntu base image and
modifying the same 16 blocks, and you end up with one completely full
OSD and a fairly empty cluster. (There are mechanisms in RADOS to deal
with overloaded OSDs, but this issue of uneven distribution is one
that I would worry about even so.)

So with that said, if I were going to implement copy-on-write RBD
images, I'd probably do so in the RBD layer rather than via the RADOS
commands. Yehuda would have a better idea of how to deal with this
than I do, but I'd probably modify the header to store an index
indicating the blocks contained in the parent image and which blocks
in that range have been written to. Then set up the child image as its
own image (with its own header and rados naming scheme, etc) and
whenever one block does get written to, copy the object from the
parent image to the child's space and mark it as written in the
header. I'm not sure how this would impact performance, but presumably
most writes would be in areas of the disk not contained in the parent
image, and I don't think it would be too difficult to implement. This
wouldn't be as space-efficient as cloning for small changes like a
config file (since it would modify the whole block, which defaults to
4MB), but I bet it's better than storing 3000 installs of an Ubuntu
LTS release.
-Greg

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Cloning rados block devices
  2011-01-24 14:39 ` Gregory Farnum
@ 2011-01-25 23:41   ` Yehuda Sadeh Weinraub
  2011-02-04 14:31     ` Chris Webb
  0 siblings, 1 reply; 4+ messages in thread
From: Yehuda Sadeh Weinraub @ 2011-01-25 23:41 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Chris Webb, ceph-devel

On Mon, Jan 24, 2011 at 6:39 AM, Gregory Farnum <gregf@hq.newdream.net> wrote:
> On Sun, Jan 23, 2011 at 6:07 AM, Chris Webb <chris@arachsys.com> wrote:
>> One feature I would really like to be able to export to users is an ability
>> to make copy-on-write clones of virtual hard drives, in a Ceph context
>> generating a new rbd image from an existing one, or from a snapshot of an
>> existing image if that's easier.
>> ....
>> I don't see any mention of a feature like this on the Ceph roadmap, and I'm
>> not familiar enough with the internal design yet to know whether this is an
>> easy extension given the book-keeping already in place for snapshots, or
>> whether what I'm proposing is much harder. Is anyone working on this sort of
>> thing already, or does the feature even already exist and I've failed to
>> find it? If not, I'd be very interested in any thoughts on how difficult
>> this would be to implement given the infrastructure that is already in
>> place.
> We've discussed similar things, but this isn't on the roadmap and I
> don't think anything like it is either. There are a few problems with
> simply re-using the existing snapshot mechanism. First is that it
> doesn't support branching snapshots at all, and this is a hard enough
> problem that we've talked about doing it for other reasons in the past
> and always gone with alternative solutions. (It's not impossible,
> though.) The second is that right now, all versions of an object are
> stored together, on the same OSD. Which makes it pretty likely that
> you'd get a lot of people cloning, say, your Ubuntu base image and
> modifying the same 16 blocks, and you end up with one completely full
> OSD and a fairly empty cluster. (There are mechanisms in RADOS to deal
> with overloaded OSDs, but this issue of uneven distribution is one
> that I would worry about even so.)
>
> So with that said, if I were going to implement copy-on-write RBD
> images, I'd probably do so in the RBD layer rather than via the RADOS
> commands. Yehuda would have a better idea of how to deal with this
> than I do, but I'd probably modify the header to store an index
> indicating the blocks contained in the parent image and which blocks
> in that range have been written to. Then set up the child image as its
> own image (with its own header and rados naming scheme, etc) and
> whenever one block does get written to, copy the object from the
> parent image to the child's space and mark it as written in the
> header. I'm not sure how this would impact performance, but presumably
> most writes would be in areas of the disk not contained in the parent
> image, and I don't think it would be too difficult to implement. This
> wouldn't be as space-efficient as cloning for small changes like a
> config file (since it would modify the whole block, which defaults to
> 4MB), but I bet it's better than storing 3000 installs of an Ubuntu
> LTS release.

Overlaying images is something that we've discussed and considered
implementing. The easiest way would probably go the way Greg specified
here in a block granularity. That is, when writing to the overlaying
image you'd copy the entire block data to that image. Note that it
isn't required that the overlaying image has the same block size as
the parent image, so it might make sense to have smaller block sizes
when doing that. On top of that we can have optimizations (e.g.,
bitmaps that specify which blocks exist) but that's orthogonal to the
basic requirements.

We're in the process of implementing a new userspace library to access
rbd images (librbd) and probably any new development in that area
should go through that library once it's ready. The next stages would
be modifying the qemu-rbd code to use that library, and implementing
the kernel rbd side.

Yehuda

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Cloning rados block devices
  2011-01-25 23:41   ` Yehuda Sadeh Weinraub
@ 2011-02-04 14:31     ` Chris Webb
  0 siblings, 0 replies; 4+ messages in thread
From: Chris Webb @ 2011-02-04 14:31 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub, Gregory Farnum; +Cc: ceph-devel

Yehuda Sadeh Weinraub <yehudasa@gmail.com> writes:

> On Mon, Jan 24, 2011 at 6:39 AM, Gregory Farnum <gregf@hq.newdream.net> wrote:
>
> > So with that said, if I were going to implement copy-on-write RBD
> > images, I'd probably do so in the RBD layer rather than via the RADOS
> > commands. Yehuda would have a better idea of how to deal with this
> > than I do, but I'd probably modify the header to store an index
> > indicating the blocks contained in the parent image and which blocks
> > in that range have been written to. Then set up the child image as its
> > own image (with its own header and rados naming scheme, etc) and
> > whenever one block does get written to, copy the object from the
> > parent image to the child's space and mark it as written in the
> > header. I'm not sure how this would impact performance, but presumably
> > most writes would be in areas of the disk not contained in the parent
> > image, and I don't think it would be too difficult to implement. This
> > wouldn't be as space-efficient as cloning for small changes like a
> > config file (since it would modify the whole block, which defaults to
> > 4MB), but I bet it's better than storing 3000 installs of an Ubuntu
> > LTS release.
> 
> Overlaying images is something that we've discussed and considered
> implementing. The easiest way would probably go the way Greg specified
> here in a block granularity. That is, when writing to the overlaying
> image you'd copy the entire block data to that image. Note that it
> isn't required that the overlaying image has the same block size as
> the parent image, so it might make sense to have smaller block sizes
> when doing that. On top of that we can have optimizations (e.g.,
> bitmaps that specify which blocks exist) but that's orthogonal to the
> basic requirements.
> 
> We're in the process of implementing a new userspace library to access
> rbd images (librbd) and probably any new development in that area
> should go through that library once it's ready. The next stages would
> be modifying the qemu-rbd code to use that library, and implementing
> the kernel rbd side.

Thanks Greg and Yehuda for the prompt, detailed and helpful feedback on what
would be needed to implement this feature, and apologies for the slow
follow-up.

When I wrote my original email, I hadn't dug into the underlying structure
of Ceph very much, and didn't realise the implications of implementing this
sort of thing at the RADOS layer, but given the hotspot issues you
highlight, it does sound like implementing in the RBD layer makes much more
sense than trying to use RADOS versioning, as you say.

I suspect that for realistic loads, a 4MB copy-on-write chunk size isn't
going to be particularly evil, especially after the machine has been 'run
in'. I may be able to get a better handle on how block modifications are
distributed with a bit of instrumentation on the block layer on some of our
existing virtual machine images, though... I'll have a play!

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-02-04 14:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-23 14:07 Cloning rados block devices Chris Webb
2011-01-24 14:39 ` Gregory Farnum
2011-01-25 23:41   ` Yehuda Sadeh Weinraub
2011-02-04 14:31     ` Chris Webb

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.