All of lore.kernel.org
 help / color / mirror / Atom feed
* Replacing DRBD use with RBD
@ 2010-05-04 23:46 Martin Fick
  2010-05-05  7:30 ` Alex Elsayed
  2010-05-05 20:00 ` Yehuda Sadeh Weinraub
  0 siblings, 2 replies; 8+ messages in thread
From: Martin Fick @ 2010-05-04 23:46 UTC (permalink / raw)
  To: ceph-devel

Hello,

I have a questions with respect to RADOS and RBD and the cluster monitor daemons.

1) Is there any chance that the cluster monitor protocol will be enhanced to work practically with only 2 monitor daemons?  I ask since this seems like it would allow a 2 node RBD based device to effectively replace a DRBD based device and yet be much more easily expandable to more nodes than DRBD.  Many HA systems (say telco racks) only have two nodes and it seems silly to miss out on the opportunity to be able to use RBD in those systems.

One suggestion I have would be to do this would be to use some of the same techniques that heartbeat uses to determine whether a node has gone down or if instead there is network segregation: a serial port connection, common ping nodes (such as a router)...

I suspect that if reliable 2 node operation were designed into RBD, it would eventually replace some of the uses of DRBD.


2) Is there any way of preventing two users of an RBD device from using the device concurrently?  Is there someway to create "locks" with RADOS that would die if a node dies?  If so, this would allow an RBD device to be safely mounted as a non distributed FS such as ext3 exclusively on one of many hosts.  This would open up the use of RBD devices for linux containers or linux vservers which could run on any machine in a cluster (similar to the idea of using it with kvm/qemu).

Thanks, I look forward to playing with RBD and ceph!

-Martin



      
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Replacing DRBD use with RBD
  2010-05-04 23:46 Replacing DRBD use with RBD Martin Fick
@ 2010-05-05  7:30 ` Alex Elsayed
  2010-05-05 20:02   ` Alex Elsayed
  2010-05-05 20:00 ` Yehuda Sadeh Weinraub
  1 sibling, 1 reply; 8+ messages in thread
From: Alex Elsayed @ 2010-05-05  7:30 UTC (permalink / raw)
  To: ceph-devel

Martin Fick wrote:

> Hello,
> 
> I have a questions with respect to RADOS and RBD and the cluster monitor
> daemons.

I'm not one of the developers, but I've been following for a while and one 
of your sub-questions intriqued me; specifically:

>...This would open up the use of RBD devices for linux
> containers or linux vservers which could run on any machine in a cluster
> (similar to the idea of using it with kvm/qemu).

As it currently stands you could likely run a vserver or an 
OpenVZ/Virtuozzo/LXC container on Ceph (the distributed FS) directly, rather 
than layering a local FS over RBD. Also, this would probably provide better 
performance in the end. As a side benefit, you would gain the ability to 
make fine-grained snapshots of the guests' filesystems, access them directly 
from the host (or another Ceph client), and adjust quotas for the guest 
while running. This is probably a better solution for container-based 
virtualization than RBD-based options, due to the advantage one can take of 
all guests sharing a kernel with the host. RBD is more likely to be useful 
for full virtualization like KVM, but even in that case you could probably 
make a specialized initramfs that mounts a rootfs over Ceph with a prefix 
(taking advantage of that Ceph allows mounting a subdirectory as if it was 
the whole FS).

Hope this helps!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Replacing DRBD use with RBD
  2010-05-04 23:46 Replacing DRBD use with RBD Martin Fick
  2010-05-05  7:30 ` Alex Elsayed
@ 2010-05-05 20:00 ` Yehuda Sadeh Weinraub
  1 sibling, 0 replies; 8+ messages in thread
From: Yehuda Sadeh Weinraub @ 2010-05-05 20:00 UTC (permalink / raw)
  To: Martin Fick; +Cc: ceph-devel

On Tue, May 4, 2010 at 4:46 PM, Martin Fick <mogulguy@yahoo.com> wrote:
> Hello,

Hi!

>
> I have a questions with respect to RADOS and RBD and the cluster monitor daemons.
>
> 1) Is there any chance that the cluster monitor protocol will be enhanced to work practically with only 2 monitor daemons?  I ask since this seems like it would allow a 2 node RBD based device to effectively replace a DRBD based device and yet be much more easily expandable to more nodes than DRBD.  Many HA systems (say telco racks) only have two nodes and it seems silly to miss out on the opportunity to be able to use RBD in those systems.

The problem is that the ceph monitors require a quorum in order to
decide on the cluster state. The way the system works right now, a
2-way monitor setup would be less stable than a system with a single
monitor since it wouldn't work whenever any of the two monitors
crashes. A possible workaround would be to have a special case for a
2-way mon clusters, where it'd require a single mon for getting a
majority. I'm not sure whether this is actually feasible. As usual,
the devil is in the details.

>
> One suggestion I have would be to do this would be to use some of the same techniques that heartbeat uses to determine whether a node has gone down or if instead there is network segregation: a serial port connection, common ping nodes (such as a router)...
There is a heartbeat mechanism withing the mon cluster, and it's being
used for the monitors to keep track of their peer status. It might be
a good idea to add different configurable types of heartbeats.

>
> I suspect that if reliable 2 node operation were designed into RBD, it would eventually replace some of the uses of DRBD.
>
>
> 2) Is there any way of preventing two users of an RBD device from using the device concurrently?  Is there someway to create "locks" with RADOS that would die if a node dies?  If so, this would allow an RBD device to be safely mounted as a non distributed FS such as ext3 exclusively on one of many hosts.  This would open up the use of RBD devices for linux containers or linux vservers which could run on any machine in a cluster (similar to the idea of using it with kvm/qemu).

We were just thinking about the proper solution to this problem
ourselves. There are a few options. One is to add some kinds of
locking mechanism to the osd, which would allow doing just that. E.g.,
a client would take a lock, do whatever it needs to do, a second
client would try to get the lock but will be able to hold it only
after the first one has released it. Another option would be to have
the clients handle the mutual exclusion themselves (hence not enforced
by the osd) by setting flags and leases on the rbd header. There are
other options, but the latter would be much easier to implement and
we'll start from there.

>
> Thanks, I look forward to playing with RBD and ceph!
>

Thank you!

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Replacing DRBD use with RBD
  2010-05-05  7:30 ` Alex Elsayed
@ 2010-05-05 20:02   ` Alex Elsayed
  2010-05-05 20:13     ` Yehuda Sadeh Weinraub
  2010-05-05 20:59     ` Martin Fick
  0 siblings, 2 replies; 8+ messages in thread
From: Alex Elsayed @ 2010-05-05 20:02 UTC (permalink / raw)
  To: ceph-devel

Replying to a response that was off-list:

On Wed, May 5, 2010 at 9:02 AM, Martin Fick <mogulguy@yahoo.com> wrote:
>
> --- On Wed, 5/5/10, Alex Elsayed <eternaleye@gmail.com> wrote:
>
> > >...This would open up the use of RBD devices for linux
> > > containers or linux vservers which could run on any
> > > machine in a cluster (similar to the idea of using it
> > > with kvm/qemu).
> >
> > As it currently stands you could likely run a vserver or an
> > OpenVZ/Virtuozzo/LXC container on Ceph (the distributed FS)
> > directly, rather than layering a local FS over RBD. Also,
> > this would probably provide better performance in the end.
>
> Could you please explain why you would think that this
> would provide better performance in the end?  I would think
> that a simpler local filesystem (with remote read writes)
> could outperform ceph in most situations that would matter
> for virtual systems (i.e. low latencies for small
> read/writes), would it not?

I would recommend benchmarking to have empirical results rather
than going with my presumptions, but in Ceph the metadata servers
cache the metadata and the OSDs journal writes, so any writes which
fit in the journal will be quite fast. Also, RBD has no way of
knowing what reads/writes are 'small' in the RBD block device,
because it works by splitting the disk image into 4MB chunks and
deals with those. That means that even small reads and writes
have a minimum size of 4MB.

>
> > ...This is probably a better solution for container-based
> > virtualization than RBD-based options, due to the advantage
> > one can take of all guests sharing a kernel with the host.
>
> I am not sure I understand why you are saying the guest/host
> sharing thing is an advantage that would benefit using ceph
> over RBD, could you pleas expound?

This is an advantage in the container virtualization case because
you can (say) mount the entire Ceph FS on the host and treat the
containers simply run the containers from a very basic LXC or other 
container config, treating the Ceph filesystem as just another
directory tree from the point of view of the container. This
simplifies your container config, and gives the advantages I named
earlier (online resize, etc).

> > RBD is more likely to be useful for full virtualization
> > like KVM,
>
> Again, why so specifically?

Because for containers, the config is simplest when you can hand
them a directory tree, but for full virtualization, the config is
simplest when you can hand them a block device. Simplicity reduces
the number of potential points where errors can be introduced.

> I agree that ceph would also have it's advantages, but
> RBD based solutions would likely have some advantages
> that ceph will never have.  RBD allows one to use any
> local filesystem with any semantics/features that one
> wishes. RBD is simpler.  RBD is likely currently more
> mature than ceph?

Ceph has POSIX (or as close as possible) semantics, matching local
filesystems, and provides more features than any local FS except
BtrFS, which is similarly under heavy development.

RBD is actually a rather recent addition - the first mailing
list message about it was on March 7th, 2010, whereas Ceph has
been in development since 2007.

I am posting this to the mailing list as well, as others may find it 
interesting.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Replacing DRBD use with RBD
  2010-05-05 20:02   ` Alex Elsayed
@ 2010-05-05 20:13     ` Yehuda Sadeh Weinraub
  2010-05-05 20:59     ` Martin Fick
  1 sibling, 0 replies; 8+ messages in thread
From: Yehuda Sadeh Weinraub @ 2010-05-05 20:13 UTC (permalink / raw)
  To: Alex Elsayed; +Cc: ceph-devel

On Wed, May 5, 2010 at 1:02 PM, Alex Elsayed <eternaleye@gmail.com> wrote:
> Replying to a response that was off-list:
>
> On Wed, May 5, 2010 at 9:02 AM, Martin Fick <mogulguy@yahoo.com> wrote:
>>
>> --- On Wed, 5/5/10, Alex Elsayed <eternaleye@gmail.com> wrote:
>>
>> > >...This would open up the use of RBD devices for linux
>> > > containers or linux vservers which could run on any
>> > > machine in a cluster (similar to the idea of using it
>> > > with kvm/qemu).
>> >
>> > As it currently stands you could likely run a vserver or an
>> > OpenVZ/Virtuozzo/LXC container on Ceph (the distributed FS)
>> > directly, rather than layering a local FS over RBD. Also,
>> > this would probably provide better performance in the end.
>>
>> Could you please explain why you would think that this
>> would provide better performance in the end?  I would think
>> that a simpler local filesystem (with remote read writes)
>> could outperform ceph in most situations that would matter
>> for virtual systems (i.e. low latencies for small
>> read/writes), would it not?
>
> I would recommend benchmarking to have empirical results rather
> than going with my presumptions, but in Ceph the metadata servers
> cache the metadata and the OSDs journal writes, so any writes which
> fit in the journal will be quite fast. Also, RBD has no way of
> knowing what reads/writes are 'small' in the RBD block device,
> because it works by splitting the disk image into 4MB chunks and
> deals with those. That means that even small reads and writes
> have a minimum size of 4MB.

Just a correction. Although rbd stripes data over 4MB objects, it can
do reads and writes in a sector size granularity, that is 512 bytes.

>
>>
>> > ...This is probably a better solution for container-based
>> > virtualization than RBD-based options, due to the advantage
>> > one can take of all guests sharing a kernel with the host.
>>
>> I am not sure I understand why you are saying the guest/host
>> sharing thing is an advantage that would benefit using ceph
>> over RBD, could you pleas expound?
>
> This is an advantage in the container virtualization case because
> you can (say) mount the entire Ceph FS on the host and treat the
> containers simply run the containers from a very basic LXC or other
> container config, treating the Ceph filesystem as just another
> directory tree from the point of view of the container. This
> simplifies your container config, and gives the advantages I named
> earlier (online resize, etc).
>
>> > RBD is more likely to be useful for full virtualization
>> > like KVM,
>>
>> Again, why so specifically?
>
> Because for containers, the config is simplest when you can hand
> them a directory tree, but for full virtualization, the config is
> simplest when you can hand them a block device. Simplicity reduces
> the number of potential points where errors can be introduced.
>
>> I agree that ceph would also have it's advantages, but
>> RBD based solutions would likely have some advantages
>> that ceph will never have.  RBD allows one to use any
>> local filesystem with any semantics/features that one
>> wishes. RBD is simpler.  RBD is likely currently more
>> mature than ceph?
>
> Ceph has POSIX (or as close as possible) semantics, matching local
> filesystems, and provides more features than any local FS except
> BtrFS, which is similarly under heavy development.
>
> RBD is actually a rather recent addition - the first mailing
> list message about it was on March 7th, 2010, whereas Ceph has
> been in development since 2007.

It is a recent addition, although most of it uses the same ceph
filesystem infrastructure that is in development since 2007, so in a
sense rbd is just a small extension of the ceph filesystem. The ceph
filesystem is indeed much more mature and had undergone much more
extensive testing. Hopefully, rbd is simple enough that it won't take
too long to get it on par with the fs.

Thanks,
Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Replacing DRBD use with RBD
  2010-05-05 20:02   ` Alex Elsayed
  2010-05-05 20:13     ` Yehuda Sadeh Weinraub
@ 2010-05-05 20:59     ` Martin Fick
  1 sibling, 0 replies; 8+ messages in thread
From: Martin Fick @ 2010-05-05 20:59 UTC (permalink / raw)
  To: ceph-devel, Alex Elsayed

--- On Wed, 5/5/10, Alex Elsayed <eternaleye@gmail.com> wrote:

Sorry about the accidental off-list reply, thanks 
for replying on list...
 

> I would recommend benchmarking to have empirical results
> rather than going with my presumptions, 

Or my presumptions, :) agreed.

> but in Ceph the metadata servers cache the metadata and
> the OSDs journal writes, so any writes which fit in the
> journal will be quite fast.

Yes, but the kernel will do that also (albeit differently)
for local file systems even if the block device is remote.

> Also, RBD has no way of knowing what reads/writes are
> 'small' in the RBD block device, because it works by splitting
> the disk image into 4MB chunks and deals with those.

Good point. I assume that is tunable, at least by editing
the rbd driver, no?


> This is an advantage in the container virtualization case
> because you can (say) mount the entire Ceph FS on the 
> host and treat the containers simply run the containers 
> from a very basic LXC or other container config, 
> treating the Ceph filesystem as just another directory 
> tree from the point of view of the container.
> This simplifies your container config, and gives the 
> advantages I named earlier (online resize, etc).

Hmm, while some of those are good advantages (and
some may not be depending on your mindset), I am missing
the main point as the why this is different than with
"real VMs" except for maybe your claim of "not being 
usual" ...

 
> Ceph has POSIX (or as close as possible) semantics,
> matching local filesystems, and provides more features
> than any local FS except BtrFS, which is similarly 
> under heavy development.

Perhaps it has many of the features that you want,
but there are many things that many other FSs can
do (good and bad depending again on your mindset),
that ceph cannot and will likely never be able to
do.  For example, can it be case insensitive such
as a DOS FS?


> RBD is actually a rather recent addition - the first
> mailing list message about it was on March 7th, 2010,
> whereas Ceph has been in development since 2007.

True, I guess I meant that using the OSDs will likely
always be simpler and more stable than using them 
through ceph.

Thanks,

-Martin



      

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Replacing DRBD use with RBD
  2010-05-05 20:34 Martin Fick
@ 2010-05-06  5:10 ` Thomas Mueller
  0 siblings, 0 replies; 8+ messages in thread
From: Thomas Mueller @ 2010-05-06  5:10 UTC (permalink / raw)
  To: ceph-devel


>> There is a heartbeat mechanism withing the mon cluster, and it's being
>> used for the monitors to keep track of their peer status. It might be a
>> good idea to add different configurable types of heartbeats.
> 
> Yes, specifically, I meant by using some of the techniques that the
> heartbeat project uses:
> 
> http://www.linux-ha.org/wiki/Heartbeat
> 
> Ideally (my suggestion,) they would make some of them available in a
> library so that other projects like RADOS could use them independently
> without having to rewrite them from scratch.

IMHO Heartbeat is not developed anymore. the new thing (some parts 
derived from Heartbeat) is pacemaker/corosync . It is/will be standard on 
RH (Fedora,EL), SuSE , Ubuntu and Debian.

I'm not a developter but IMHO ceph could(/should?) take advantage from 
these projects to not reinvent the wheel. ;)

- Thomas

* http://www.clusterlabs.org
* http://en.wikipedia.org/wiki/Corosync_%28project%29
* http://www.openais.org


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Replacing DRBD use with RBD
@ 2010-05-05 20:34 Martin Fick
  2010-05-06  5:10 ` Thomas Mueller
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Fick @ 2010-05-05 20:34 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub; +Cc: ceph-devel

--- On Wed, 5/5/10, Yehuda Sadeh Weinraub <yehudasa@gmail.com> wrote:
> The problem is that the ceph monitors require a quorum in
> order to decide on the cluster state. The way the system 
> works right now, a 2-way monitor setup would be less stable 
> than a system with a single monitor since it wouldn't work
> whenever any of the two monitors crashes. 

Right, that is indeed not nice. :)

> A possible workaround would be to have a special case for a
> 2-way mon clusters, where it'd require a single mon for
> getting a majority. I'm not sure whether this is actually 
> feasible. As usual, the devil is in the details.

Yes. One simple way is to use a ping node.  If a node can
reach the ping node, but not its peer, it should be able
to assume "lone operation" and thus effectively degrade to
a single monitor situation temporarily. I guess my question
is, "is this something that the ceph project is 
potentially willing to support for OSDs?"

I suspect that also supporting dynamic reconfiguration:
http://en.wikipedia.org/wiki/Paxos_algorithm#Cheap_Paxos
would also help a great deal to make clusters more
adaptable.


> > One suggestion I have would be to do this would be to
> > use some of the same techniques that heartbeat uses to
> > determine whether a node has gone down or if instead there
> > is network segregation: a serial port connection, common
> > ping nodes (such as a router)...

> There is a heartbeat mechanism withing the mon cluster, and
> it's being used for the monitors to keep track of their peer
> status. It might be a good idea to add different configurable 
> types of heartbeats.

Yes, specifically, I meant by using some of the techniques
that the heartbeat project uses:

http://www.linux-ha.org/wiki/Heartbeat

Ideally (my suggestion,) they would make some of them 
available in a library so that other projects like 
RADOS could use them independently without having to 
rewrite them from scratch.



> > 2) Is there any way of preventing two users of an RBD
> > device from using the device concurrently?  ...
> 
> We were just thinking about the proper solution to this
> problem ourselves. There are a few options. One is to 
> add some kinds of locking mechanism to the osd, which
> would allow doing just that. E.g., a client would take 
> a lock, do whatever it needs to do, a second client 
> would try to get the lock but will be able to hold it only
> after the first one has released it. Another option would
> be to have the clients handle the mutual exclusion 
> themselves (hence not enforced by the osd) by setting 
> flags and leases on the rbd header.

I'm curious, do you mean a scheme such as writing the
name of the node "locking" the image along with a 
timestamp regularly to the header as a heartbeat?  
Along with some lock acquisition logic?

Thanks for the replies!

-Martin



      
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-05-06  5:10 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-04 23:46 Replacing DRBD use with RBD Martin Fick
2010-05-05  7:30 ` Alex Elsayed
2010-05-05 20:02   ` Alex Elsayed
2010-05-05 20:13     ` Yehuda Sadeh Weinraub
2010-05-05 20:59     ` Martin Fick
2010-05-05 20:00 ` Yehuda Sadeh Weinraub
2010-05-05 20:34 Martin Fick
2010-05-06  5:10 ` Thomas Mueller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.