All of lore.kernel.org
 help / color / mirror / Atom feed
* Ceph distributed over slow link: possible?
@ 2011-01-21 19:55 Matthias Urlichs
  2011-01-22  8:45 ` DongJin Lee
  2011-01-22  9:58 ` Ravi Pinjala
  0 siblings, 2 replies; 5+ messages in thread
From: Matthias Urlichs @ 2011-01-21 19:55 UTC (permalink / raw)
  To: ceph-devel

Hello ceph people,

My situation is this: my Ceph cluster is distributed over multiple
sites. The links between sites are rather slow. :-/

Storing one copy of a file at each site should not be a problem with
a reasonable crushmap, but ..:

* how can I verify on which devices a file is stored?

* is it possible to teach clients to read/write from "their", i.e.
  the local site's, copy of a file, instead of pulling stuff from
  a remote site? Or does ceph notice the speed difference by itself?

* My crushmap looks like this:
type 0  device
type 1  host
type 2  site
type 3  root
... (root => 2 sites => 2 hosts each => 3 devices each)
rule  data {
        ruleset 0
        type replicated
        min_size 2
        max_size 2
        step take  root
        step chooseleaf firstn 2 type site
        step emit
}

but when only one site is reachable, will there be one or two
copies of a file? If the former, how do I fix that? If the latter,
will the copy be redistributed when (the link to) the second site
comes back?

-- 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ceph distributed over slow link: possible?
  2011-01-21 19:55 Ceph distributed over slow link: possible? Matthias Urlichs
@ 2011-01-22  8:45 ` DongJin Lee
  2011-01-22  9:58 ` Ravi Pinjala
  1 sibling, 0 replies; 5+ messages in thread
From: DongJin Lee @ 2011-01-22  8:45 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: ceph-devel

On Sat, Jan 22, 2011 at 8:55 AM, Matthias Urlichs <matthias@urlichs.de> wrote:
> Hello ceph people,
>
> My situation is this: my Ceph cluster is distributed over multiple
> sites. The links between sites are rather slow. :-/
>
> Storing one copy of a file at each site should not be a problem with
> a reasonable crushmap, but ..:
>
> * how can I verify on which devices a file is stored?
>
> * is it possible to teach clients to read/write from "their", i.e.
>  the local site's, copy of a file, instead of pulling stuff from
>  a remote site? Or does ceph notice the speed difference by itself?
>
> * My crushmap looks like this:
> type 0  device
> type 1  host
> type 2  site
> type 3  root
> ... (root => 2 sites => 2 hosts each => 3 devices each)
> rule  data {
>        ruleset 0
>        type replicated
>        min_size 2
>        max_size 2
>        step take  root
>        step chooseleaf firstn 2 type site
>        step emit
> }
>
> but when only one site is reachable, will there be one or tw
> copies of a file? If the former, how do I fix that? If the latter,
> will the copy be redistributed when (the link to) the second site
> comes back?

I've got the similar questions, using the above example, we'd have 2 x
(2x3) setup, ok.
- e.g., using with 2x replic, I'd see that each site storing the same
replicas (i.e., mirrored)
- but I'm unsure what would happen when one of the site goes down and later up.
- more, how's the objects stored in the remaining 2 hosts with 3osds each? (2x3)
I think there needs a multiple take or choose type to have a bit more
specific behaviors?
e.g., to further prevent the host loosing the 3osds worth of objects,
we do another 'step chooseleaf type host' ?

Thanks, DJ
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ceph distributed over slow link: possible?
  2011-01-21 19:55 Ceph distributed over slow link: possible? Matthias Urlichs
  2011-01-22  8:45 ` DongJin Lee
@ 2011-01-22  9:58 ` Ravi Pinjala
  2011-01-22 11:06   ` Matthias Urlichs
  1 sibling, 1 reply; 5+ messages in thread
From: Ravi Pinjala @ 2011-01-22  9:58 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: ceph-devel

On Fri, Jan 21, 2011 at 11:55 AM, Matthias Urlichs <matthias@urlichs.de> wrote:
> Hello ceph people,
>
> My situation is this: my Ceph cluster is distributed over multiple
> sites. The links between sites are rather slow. :-/
>
> Storing one copy of a file at each site should not be a problem with
> a reasonable crushmap, but ..:
>
> * how can I verify on which devices a file is stored?
>
> * is it possible to teach clients to read/write from "their", i.e.
>  the local site's, copy of a file, instead of pulling stuff from
>  a remote site? Or does ceph notice the speed difference by itself?
>
> * My crushmap looks like this:
> type 0  device
> type 1  host
> type 2  site
> type 3  root
> ... (root => 2 sites => 2 hosts each => 3 devices each)
> rule  data {
>        ruleset 0
>        type replicated
>        min_size 2
>        max_size 2
>        step take  root
>        step chooseleaf firstn 2 type site
>        step emit
> }
>
> but when only one site is reachable, will there be one or two
> copies of a file? If the former, how do I fix that? If the latter,
> will the copy be redistributed when (the link to) the second site
> comes back?
>

Not an expert by any stretch, but here goes:

* I don't know of a solid way to verify where your file data is going,
but if you just want to test your replication strategy, you can write
a large file to the cluster and see which OSDs grow. (OSDs store data
in a file-based structure, so 'df' on your storage nodes will actually
give you an accurate account of where the space is used.)

* I don't think it's possible to control where clients go for reads -
Ceph is pretty much optimized to the case where all nodes are in a
single datacenter, over a more or less homogeneous network. For
writes, though, you'd be stuck with WAN speeds no matter what, because
the data has to go out to all replicas before the writes complete. So
unless your workload is very read-heavy, your performance would still
suffer if Ceph could read from the closest replica.

* With that crushmap, there'll be only one copy of your data at each
site. If you want higher replication, I think you have to put another
layer in there, something like:

min_size 4
max_size 8
step take  root
step choose firstn 0 type site
step chooseleaf firstn 2 type host
step emit

'chooseleaf' goes straight down to the device level, so we only want
to use it in the last rule. Choosing 0 in a rule selects all that are
available, so this should distribute two copies of your data at each
of your sites, even if you add more sites later on. (min_size and
max_size are updated accordingly.) CRUSH is a pretty neat system; you
could probably get fancier with the data placement rules if you want.

Again: I'm hardly an expert on this, so I'm hoping that people with
more experience will come along and correct whatever glaring errors
I've made. :)

--Ravi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ceph distributed over slow link: possible?
  2011-01-22  9:58 ` Ravi Pinjala
@ 2011-01-22 11:06   ` Matthias Urlichs
  2011-01-23  3:09     ` Gregory Farnum
  0 siblings, 1 reply; 5+ messages in thread
From: Matthias Urlichs @ 2011-01-22 11:06 UTC (permalink / raw)
  To: ceph-devel

Hi,
> * I don't think it's possible to control where clients go for reads -
> Ceph is pretty much optimized to the case where all nodes are in a
> single datacenter, over a more or less homogeneous network.

So, what piece(s) of code decides which replica is read from?

Note that even in the single-datacenter case, I'd want to avoid crossing
rack boundaries if my client is in the same rack as one of the possible
OSDs.

> For writes,
> though, you'd be stuck with WAN speeds no matter what, because the data
> has to go out to all replicas before the writes complete.

Hmm. Too bad; I'd be more than happy with writing to one or two replicas,
and trusting CEPH to manage the rest of the copying in the background.

-- 
-- Matthias Urlichs


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ceph distributed over slow link: possible?
  2011-01-22 11:06   ` Matthias Urlichs
@ 2011-01-23  3:09     ` Gregory Farnum
  0 siblings, 0 replies; 5+ messages in thread
From: Gregory Farnum @ 2011-01-23  3:09 UTC (permalink / raw)
  To: Matthias Urlichs; +Cc: ceph-devel

On Sat, Jan 22, 2011 at 3:06 AM, Matthias Urlichs <matthias@urlichs.de> wrote:
> Hi,
>> * I don't think it's possible to control where clients go for reads -
>> Ceph is pretty much optimized to the case where all nodes are in a
>> single datacenter, over a more or less homogeneous network.
>
> So, what piece(s) of code decides which replica is read from?
Clients always read from the primary OSD housing the data, for
consistency purposes. We're working on implementing read-from-replicas
at the librados level, but anybody doing that is probably going to
have to manage their own consistency guarantees. We haven't discussed
implementing it in Ceph at all yet

>> For writes,
>> though, you'd be stuck with WAN speeds no matter what, because the data
>> has to go out to all replicas before the writes complete.
>
> Hmm. Too bad; I'd be more than happy with writing to one or two replicas,
> and trusting CEPH to manage the rest of the copying in the background.
Unfortunately there's no provision for asynchronous replication in
Ceph's protocols -- they just don't fit within Ceph's overall design.
After all, with asynchronous replication, you don't have the right
number of data copies at all times. This is something that's unlikely
to change.

> My situation is this: my Ceph cluster is distributed over multiple
> sites. The links between sites are rather slow. :-/
>
> Storing one copy of a file at each site should not be a problem with
> a reasonable crushmap, but ..:
>
> * how can I verify on which devices a file is stored?
>
> * is it possible to teach clients to read/write from "their", i.e.
>  the local site's, copy of a file, instead of pulling stuff from
>  a remote site? Or does ceph notice the speed difference by itself?
From what I've read the best (only?) solution that's designed for this
situation is xtreemfs, and it's my standard recommendation. I've not
heard back if it actually works from people who ask about it, though.'

That said, depending on your exact needs there are one or two possible
solutions with Ceph. If most of your data spends most of its life in
one data center, you could set up an OSD pool that lives in each data
center and set the appropriate parts of the filesystem to use the
appropriate pool (you can specify default layouts, which include the
pool, on directories that then apply to the hierarchy tree they root).
You'd have to manage off-site backups yourself in this case, perhaps
via something nasty like rsyncing across the FS at night? Then the
current copy of the data would always be available (albeit at slow
speed) from anywhere and you'd have local backups and off-site
nightlies.
I'm not sure how the metadata cluster would handle this in terms of
dividing authority intelligently, but hopefully it's smart enough or
could be adjusted to such reasonably easily.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-01-23  3:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21 19:55 Ceph distributed over slow link: possible? Matthias Urlichs
2011-01-22  8:45 ` DongJin Lee
2011-01-22  9:58 ` Ravi Pinjala
2011-01-22 11:06   ` Matthias Urlichs
2011-01-23  3:09     ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.