All of lore.kernel.org
 help / color / mirror / Atom feed
* Client Location
       [not found] <1253073523.409.1349787553083.JavaMail.root@corellia.pncl.co.uk>
@ 2012-10-09 13:14 ` James Horner
  2012-10-09 13:30   ` Wido den Hollander
  2012-10-09 16:43   ` Mark Kampe
  0 siblings, 2 replies; 7+ messages in thread
From: James Horner @ 2012-10-09 13:14 UTC (permalink / raw)
  To: ceph-devel

Hi There 




I have a simple test cluster spread across 2 datacenters setup as follows 

DC1: 
mon.w 
mon.x 
mds.w 
mds.x 
osd1 

DC2: 
mon.e 
mds.e 
osd2 

Each DC has a hypervisor(Proxmox running qemu 1.1.1) which can connect to the cluster fine. I think I have the crush map setup to replicate between the datacenters but when I run a VM with a disk on the cluster the hv's connect to the OSD's in the other datacenter. Is there a way to tell qemu that it is DC1 or DC2 and to prefer those osd's? 

Thanks. 
James 



# begin crush map 

# devices 
device 0 osd.0 
device 1 osd.1 

# types 
type 0 osd 
type 1 host 
type 2 rack 
type 3 row 
type 4 room 
type 5 datacenter 
type 6 pool 

# buckets 
host ceph-test-dc1-osd1 { 
id -2 # do not change unnecessarily 
# weight 1.000 
alg straw 
hash 0 # rjenkins1 
item osd.0 weight 1.000 
} 
host ceph-test-dc2-osd1 { 
id -4 # do not change unnecessarily 
# weight 1.000 
alg straw 
hash 0 # rjenkins1 
item osd.1 weight 1.000 
} 
rack dc1-rack1 { 
id -3 # do not change unnecessarily 
# weight 2.000 
alg straw 
hash 0 # rjenkins1 
item ceph-test-dc1-osd1 weight 1.000 
} 

rack dc2-rack1 { 
id -5 
alg straw 
hash 0 
item ceph-test-dc2-osd1 weight 1.000 
} 

datacenter dc1 { 
id -6 
alg straw 
hash 0 
item dc1-rack1 weight 1.000 
} 

datacenter dc2 { 
id -7 
alg straw 
hash 0 
item dc2-rack1 weight 1.000 
} 

pool proxmox { 
id -1 # do not change unnecessarily 
# weight 2.000 
alg straw 
hash 0 # rjenkins1 
item dc1 weight 2.000 
item dc2 weight 2.000 
} 

# rules 
rule proxmox { 
ruleset 0 
type replicated 
min_size 1 
max_size 10 
step take default 
step chooseleaf firstn 0 type datacenter 

step emit 
} 


# end crush map 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Client Location
  2012-10-09 13:14 ` Client Location James Horner
@ 2012-10-09 13:30   ` Wido den Hollander
  2012-10-09 13:33     ` james.horner
  2012-10-09 16:43   ` Mark Kampe
  1 sibling, 1 reply; 7+ messages in thread
From: Wido den Hollander @ 2012-10-09 13:30 UTC (permalink / raw)
  To: James Horner; +Cc: ceph-devel

On 10/09/2012 03:14 PM, James Horner wrote:
> Hi There
>
>
>
>
> I have a simple test cluster spread across 2 datacenters setup as follows
>
> DC1:
> mon.w
> mon.x
> mds.w
> mds.x
> osd1
>
> DC2:
> mon.e
> mds.e
> osd2
>
> Each DC has a hypervisor(Proxmox running qemu 1.1.1) which can connect to the cluster fine. I think I have the crush map setup to replicate between the datacenters but when I run a VM with a disk on the cluster the hv's connect to the OSD's in the other datacenter. Is there a way to tell qemu that it is DC1 or DC2 and to prefer those osd's?
>

No, there is no such way. Ceph is designed to work on a local network 
where it doesn't matter where the nodes are or how the client connects.

You are not the first to ask this question. People having been thinking 
about localizing data, but there have been no concrete plans.

(See note on crushmap below btw)

> Thanks.
> James
>
>
>
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 pool
>
> # buckets
> host ceph-test-dc1-osd1 {
> id -2 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 1.000
> }
> host ceph-test-dc2-osd1 {
> id -4 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 1.000
> }
> rack dc1-rack1 {
> id -3 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item ceph-test-dc1-osd1 weight 1.000
> }
>

You don't need to specify a weight to the rack in this case, it will 
take the accumulated weight of all the hosts it has in it.

> rack dc2-rack1 {
> id -5
> alg straw
> hash 0
> item ceph-test-dc2-osd1 weight 1.000
> }
>
> datacenter dc1 {
> id -6
> alg straw
> hash 0
> item dc1-rack1 weight 1.000
> }
>
> datacenter dc2 {
> id -7
> alg straw
> hash 0
> item dc2-rack1 weight 1.000
> }
>
> pool proxmox {
> id -1 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item dc1 weight 2.000
> item dc2 weight 2.000
> }
>


Same goes here, the dc's get their weight by summing up the racks and hosts.

While in your case it doesn't matter that much, you should let crush do 
the calculating when possible.

Wido

> # rules
> rule proxmox {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type datacenter
>
> step emit
> }
>
>
> # end crush map
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Client Location
  2012-10-09 13:30   ` Wido den Hollander
@ 2012-10-09 13:33     ` james.horner
  0 siblings, 0 replies; 7+ messages in thread
From: james.horner @ 2012-10-09 13:33 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel, James Horner

Hi Wido

Thanks for the response and the advice. It a shame as otherwise ceph meets all our needs.

James

----- Original Message -----
From: "Wido den Hollander" <wido@widodh.nl>
To: "James Horner" <james.horner@precedent.co.uk>
Cc: ceph-devel@vger.kernel.org
Sent: Tuesday, October 9, 2012 2:30:30 PM
Subject: Re: Client Location

On 10/09/2012 03:14 PM, James Horner wrote:
> Hi There
>
>
>
>
> I have a simple test cluster spread across 2 datacenters setup as follows
>
> DC1:
> mon.w
> mon.x
> mds.w
> mds.x
> osd1
>
> DC2:
> mon.e
> mds.e
> osd2
>
> Each DC has a hypervisor(Proxmox running qemu 1.1.1) which can connect to the cluster fine. I think I have the crush map setup to replicate between the datacenters but when I run a VM with a disk on the cluster the hv's connect to the OSD's in the other datacenter. Is there a way to tell qemu that it is DC1 or DC2 and to prefer those osd's?
>

No, there is no such way. Ceph is designed to work on a local network 
where it doesn't matter where the nodes are or how the client connects.

You are not the first to ask this question. People having been thinking 
about localizing data, but there have been no concrete plans.

(See note on crushmap below btw)

> Thanks.
> James
>
>
>
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 pool
>
> # buckets
> host ceph-test-dc1-osd1 {
> id -2 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 1.000
> }
> host ceph-test-dc2-osd1 {
> id -4 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 1.000
> }
> rack dc1-rack1 {
> id -3 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item ceph-test-dc1-osd1 weight 1.000
> }
>

You don't need to specify a weight to the rack in this case, it will 
take the accumulated weight of all the hosts it has in it.

> rack dc2-rack1 {
> id -5
> alg straw
> hash 0
> item ceph-test-dc2-osd1 weight 1.000
> }
>
> datacenter dc1 {
> id -6
> alg straw
> hash 0
> item dc1-rack1 weight 1.000
> }
>
> datacenter dc2 {
> id -7
> alg straw
> hash 0
> item dc2-rack1 weight 1.000
> }
>
> pool proxmox {
> id -1 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item dc1 weight 2.000
> item dc2 weight 2.000
> }
>


Same goes here, the dc's get their weight by summing up the racks and hosts.

While in your case it doesn't matter that much, you should let crush do 
the calculating when possible.

Wido

> # rules
> rule proxmox {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type datacenter
>
> step emit
> }
>
>
> # end crush map
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Client Location
  2012-10-09 13:14 ` Client Location James Horner
  2012-10-09 13:30   ` Wido den Hollander
@ 2012-10-09 16:43   ` Mark Kampe
  2012-10-09 16:48     ` Gregory Farnum
  1 sibling, 1 reply; 7+ messages in thread
From: Mark Kampe @ 2012-10-09 16:43 UTC (permalink / raw)
  To: James Horner; +Cc: ceph-devel

I'm not a real engineer, so please forgive me if I misunderstand,
but can't you create a separate rule for each data center (choosing
first a local copy, and then remote copies), which should ensure
that the primary is always local.  Each data center would then
use a different pool, associated with the appropriate location-
sensitive rule.

Does this approach get you the desired locality preference?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Client Location
  2012-10-09 16:43   ` Mark Kampe
@ 2012-10-09 16:48     ` Gregory Farnum
  2012-10-10  9:16       ` James Horner
  0 siblings, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2012-10-09 16:48 UTC (permalink / raw)
  To: Mark Kampe; +Cc: James Horner, ceph-devel

On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe <mark.kampe@inktank.com> wrote:
> I'm not a real engineer, so please forgive me if I misunderstand,
> but can't you create a separate rule for each data center (choosing
> first a local copy, and then remote copies), which should ensure
> that the primary is always local.  Each data center would then
> use a different pool, associated with the appropriate location-
> sensitive rule.
>
> Does this approach get you the desired locality preference?

This sounds right to me — I think maybe there's a misunderstanding
about how CRUSH works. What precisely are you after, James?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Client Location
  2012-10-09 16:48     ` Gregory Farnum
@ 2012-10-10  9:16       ` James Horner
  2012-10-10 16:39         ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: James Horner @ 2012-10-10  9:16 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: James Horner, ceph-devel, Mark Kampe

Hi There

The basic setup Im trying to get is a backend to a Hypervisor cluster, so that auto-failover and live migration works. The mail thing is that we have a number of datacenters with a gigabit interconnect that is not always 100% reliable. In the event of a failure we want all the virtual machines to fail over to the remaining datacenters, so we need all the data in each location.
The other issue is that within each datacenter we can use link aggregation to increase the bandwidth between hypervisors and the ceph cluster but between the datacenters we only have the gigabit so it become essential to have the hyperviors looking at the storage in the same datacenter.
Another consideration is that the virtual machines might get migrated between datacenters without any failure, and the main problem I see with Mark suggests is that in this mode the migrated VM would still be connecting to the OSD's in the remote datacenter.

Tbh Im fairly new to ceph and I know im asking for everything and the kitchen sink! Any thoughts would be very helpful though.

Thanks
James

----- Original Message -----
From: "Gregory Farnum" <greg@inktank.com>
To: "Mark Kampe" <mark.kampe@inktank.com>
Cc: "James Horner" <james.horner@precedent.co.uk>, ceph-devel@vger.kernel.org
Sent: Tuesday, October 9, 2012 5:48:37 PM
Subject: Re: Client Location

On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe <mark.kampe@inktank.com> wrote:
> I'm not a real engineer, so please forgive me if I misunderstand,
> but can't you create a separate rule for each data center (choosing
> first a local copy, and then remote copies), which should ensure
> that the primary is always local.  Each data center would then
> use a different pool, associated with the appropriate location-
> sensitive rule.
>
> Does this approach get you the desired locality preference?

This sounds right to me — I think maybe there's a misunderstanding
about how CRUSH works. What precisely are you after, James?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Client Location
  2012-10-10  9:16       ` James Horner
@ 2012-10-10 16:39         ` Sage Weil
  0 siblings, 0 replies; 7+ messages in thread
From: Sage Weil @ 2012-10-10 16:39 UTC (permalink / raw)
  To: James Horner; +Cc: Gregory Farnum, ceph-devel, Mark Kampe

On Wed, 10 Oct 2012, James Horner wrote:
> Hi There
> 
> The basic setup Im trying to get is a backend to a Hypervisor cluster, 
> so that auto-failover and live migration works. The mail thing is that 
> we have a number of datacenters with a gigabit interconnect that is not 
> always 100% reliable. In the event of a failure we want all the virtual 
> machines to fail over to the remaining datacenters, so we need all the 
> data in each location.
>
> The other issue is that within each datacenter we can use link 
> aggregation to increase the bandwidth between hypervisors and the ceph 
> cluster but between the datacenters we only have the gigabit so it 
> become essential to have the hyperviors looking at the storage in the 
> same datacenter.

The ceph replication is syncrhonous, so even if you are writing to a local 
OSD, it will be updating the replica at the remote DC. The 1gbps link may 
quickly become a bottleneck.  This is a matter of having your cake and 
eating it too... you can't seamlessly fail over to another DC if you don't 
synchronously replicate to it.

> Another consideration is that the virtual machines might get migrated 
> between datacenters without any failure, and the main problem I see with 
> Mark suggests is that in this mode the migrated VM would still be 
> connecting to the OSD's in the remote datacenter.

The new rbd cloning functionality can be used to 'migrate' and image by 
cloning to a different pool (the new local DC) and then later (in teh 
background, whenever) doing a 'flatten' to migrate teh data from the 
parent to the clone.  Performance will be slower initially but improve 
once the data is migrated.

This isn't a perfect solution for your use-case, but it would work..

sage

> Tbh Im fairly new to ceph and I know im asking for everything and the 
> kitchen sink! Any thoughts would be very helpful though.
> 
> Thanks
> James
> 
> ----- Original Message -----
> From: "Gregory Farnum" <greg@inktank.com>
> To: "Mark Kampe" <mark.kampe@inktank.com>
> Cc: "James Horner" <james.horner@precedent.co.uk>, ceph-devel@vger.kernel.org
> Sent: Tuesday, October 9, 2012 5:48:37 PM
> Subject: Re: Client Location
> 
> On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe <mark.kampe@inktank.com> wrote:
> > I'm not a real engineer, so please forgive me if I misunderstand,
> > but can't you create a separate rule for each data center (choosing
> > first a local copy, and then remote copies), which should ensure
> > that the primary is always local.  Each data center would then
> > use a different pool, associated with the appropriate location-
> > sensitive rule.
> >
> > Does this approach get you the desired locality preference?
> 
> This sounds right to me ? I think maybe there's a misunderstanding
> about how CRUSH works. What precisely are you after, James?
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-10-10 16:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1253073523.409.1349787553083.JavaMail.root@corellia.pncl.co.uk>
2012-10-09 13:14 ` Client Location James Horner
2012-10-09 13:30   ` Wido den Hollander
2012-10-09 13:33     ` james.horner
2012-10-09 16:43   ` Mark Kampe
2012-10-09 16:48     ` Gregory Farnum
2012-10-10  9:16       ` James Horner
2012-10-10 16:39         ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.