* Client Location
[not found] <1253073523.409.1349787553083.JavaMail.root@corellia.pncl.co.uk>
@ 2012-10-09 13:14 ` James Horner
2012-10-09 13:30 ` Wido den Hollander
2012-10-09 16:43 ` Mark Kampe
0 siblings, 2 replies; 7+ messages in thread
From: James Horner @ 2012-10-09 13:14 UTC (permalink / raw)
To: ceph-devel
Hi There
I have a simple test cluster spread across 2 datacenters setup as follows
DC1:
mon.w
mon.x
mds.w
mds.x
osd1
DC2:
mon.e
mds.e
osd2
Each DC has a hypervisor(Proxmox running qemu 1.1.1) which can connect to the cluster fine. I think I have the crush map setup to replicate between the datacenters but when I run a VM with a disk on the cluster the hv's connect to the OSD's in the other datacenter. Is there a way to tell qemu that it is DC1 or DC2 and to prefer those osd's?
Thanks.
James
# begin crush map
# devices
device 0 osd.0
device 1 osd.1
# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool
# buckets
host ceph-test-dc1-osd1 {
id -2 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
}
host ceph-test-dc2-osd1 {
id -4 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.000
}
rack dc1-rack1 {
id -3 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item ceph-test-dc1-osd1 weight 1.000
}
rack dc2-rack1 {
id -5
alg straw
hash 0
item ceph-test-dc2-osd1 weight 1.000
}
datacenter dc1 {
id -6
alg straw
hash 0
item dc1-rack1 weight 1.000
}
datacenter dc2 {
id -7
alg straw
hash 0
item dc2-rack1 weight 1.000
}
pool proxmox {
id -1 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item dc1 weight 2.000
item dc2 weight 2.000
}
# rules
rule proxmox {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type datacenter
step emit
}
# end crush map
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Client Location
2012-10-09 13:14 ` Client Location James Horner
@ 2012-10-09 13:30 ` Wido den Hollander
2012-10-09 13:33 ` james.horner
2012-10-09 16:43 ` Mark Kampe
1 sibling, 1 reply; 7+ messages in thread
From: Wido den Hollander @ 2012-10-09 13:30 UTC (permalink / raw)
To: James Horner; +Cc: ceph-devel
On 10/09/2012 03:14 PM, James Horner wrote:
> Hi There
>
>
>
>
> I have a simple test cluster spread across 2 datacenters setup as follows
>
> DC1:
> mon.w
> mon.x
> mds.w
> mds.x
> osd1
>
> DC2:
> mon.e
> mds.e
> osd2
>
> Each DC has a hypervisor(Proxmox running qemu 1.1.1) which can connect to the cluster fine. I think I have the crush map setup to replicate between the datacenters but when I run a VM with a disk on the cluster the hv's connect to the OSD's in the other datacenter. Is there a way to tell qemu that it is DC1 or DC2 and to prefer those osd's?
>
No, there is no such way. Ceph is designed to work on a local network
where it doesn't matter where the nodes are or how the client connects.
You are not the first to ask this question. People having been thinking
about localizing data, but there have been no concrete plans.
(See note on crushmap below btw)
> Thanks.
> James
>
>
>
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 pool
>
> # buckets
> host ceph-test-dc1-osd1 {
> id -2 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 1.000
> }
> host ceph-test-dc2-osd1 {
> id -4 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 1.000
> }
> rack dc1-rack1 {
> id -3 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item ceph-test-dc1-osd1 weight 1.000
> }
>
You don't need to specify a weight to the rack in this case, it will
take the accumulated weight of all the hosts it has in it.
> rack dc2-rack1 {
> id -5
> alg straw
> hash 0
> item ceph-test-dc2-osd1 weight 1.000
> }
>
> datacenter dc1 {
> id -6
> alg straw
> hash 0
> item dc1-rack1 weight 1.000
> }
>
> datacenter dc2 {
> id -7
> alg straw
> hash 0
> item dc2-rack1 weight 1.000
> }
>
> pool proxmox {
> id -1 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item dc1 weight 2.000
> item dc2 weight 2.000
> }
>
Same goes here, the dc's get their weight by summing up the racks and hosts.
While in your case it doesn't matter that much, you should let crush do
the calculating when possible.
Wido
> # rules
> rule proxmox {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type datacenter
>
> step emit
> }
>
>
> # end crush map
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Client Location
2012-10-09 13:30 ` Wido den Hollander
@ 2012-10-09 13:33 ` james.horner
0 siblings, 0 replies; 7+ messages in thread
From: james.horner @ 2012-10-09 13:33 UTC (permalink / raw)
To: Wido den Hollander; +Cc: ceph-devel, James Horner
Hi Wido
Thanks for the response and the advice. It a shame as otherwise ceph meets all our needs.
James
----- Original Message -----
From: "Wido den Hollander" <wido@widodh.nl>
To: "James Horner" <james.horner@precedent.co.uk>
Cc: ceph-devel@vger.kernel.org
Sent: Tuesday, October 9, 2012 2:30:30 PM
Subject: Re: Client Location
On 10/09/2012 03:14 PM, James Horner wrote:
> Hi There
>
>
>
>
> I have a simple test cluster spread across 2 datacenters setup as follows
>
> DC1:
> mon.w
> mon.x
> mds.w
> mds.x
> osd1
>
> DC2:
> mon.e
> mds.e
> osd2
>
> Each DC has a hypervisor(Proxmox running qemu 1.1.1) which can connect to the cluster fine. I think I have the crush map setup to replicate between the datacenters but when I run a VM with a disk on the cluster the hv's connect to the OSD's in the other datacenter. Is there a way to tell qemu that it is DC1 or DC2 and to prefer those osd's?
>
No, there is no such way. Ceph is designed to work on a local network
where it doesn't matter where the nodes are or how the client connects.
You are not the first to ask this question. People having been thinking
about localizing data, but there have been no concrete plans.
(See note on crushmap below btw)
> Thanks.
> James
>
>
>
> # begin crush map
>
> # devices
> device 0 osd.0
> device 1 osd.1
>
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 pool
>
> # buckets
> host ceph-test-dc1-osd1 {
> id -2 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.0 weight 1.000
> }
> host ceph-test-dc2-osd1 {
> id -4 # do not change unnecessarily
> # weight 1.000
> alg straw
> hash 0 # rjenkins1
> item osd.1 weight 1.000
> }
> rack dc1-rack1 {
> id -3 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item ceph-test-dc1-osd1 weight 1.000
> }
>
You don't need to specify a weight to the rack in this case, it will
take the accumulated weight of all the hosts it has in it.
> rack dc2-rack1 {
> id -5
> alg straw
> hash 0
> item ceph-test-dc2-osd1 weight 1.000
> }
>
> datacenter dc1 {
> id -6
> alg straw
> hash 0
> item dc1-rack1 weight 1.000
> }
>
> datacenter dc2 {
> id -7
> alg straw
> hash 0
> item dc2-rack1 weight 1.000
> }
>
> pool proxmox {
> id -1 # do not change unnecessarily
> # weight 2.000
> alg straw
> hash 0 # rjenkins1
> item dc1 weight 2.000
> item dc2 weight 2.000
> }
>
Same goes here, the dc's get their weight by summing up the racks and hosts.
While in your case it doesn't matter that much, you should let crush do
the calculating when possible.
Wido
> # rules
> rule proxmox {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type datacenter
>
> step emit
> }
>
>
> # end crush map
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Client Location
2012-10-09 13:14 ` Client Location James Horner
2012-10-09 13:30 ` Wido den Hollander
@ 2012-10-09 16:43 ` Mark Kampe
2012-10-09 16:48 ` Gregory Farnum
1 sibling, 1 reply; 7+ messages in thread
From: Mark Kampe @ 2012-10-09 16:43 UTC (permalink / raw)
To: James Horner; +Cc: ceph-devel
I'm not a real engineer, so please forgive me if I misunderstand,
but can't you create a separate rule for each data center (choosing
first a local copy, and then remote copies), which should ensure
that the primary is always local. Each data center would then
use a different pool, associated with the appropriate location-
sensitive rule.
Does this approach get you the desired locality preference?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Client Location
2012-10-09 16:43 ` Mark Kampe
@ 2012-10-09 16:48 ` Gregory Farnum
2012-10-10 9:16 ` James Horner
0 siblings, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2012-10-09 16:48 UTC (permalink / raw)
To: Mark Kampe; +Cc: James Horner, ceph-devel
On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe <mark.kampe@inktank.com> wrote:
> I'm not a real engineer, so please forgive me if I misunderstand,
> but can't you create a separate rule for each data center (choosing
> first a local copy, and then remote copies), which should ensure
> that the primary is always local. Each data center would then
> use a different pool, associated with the appropriate location-
> sensitive rule.
>
> Does this approach get you the desired locality preference?
This sounds right to me — I think maybe there's a misunderstanding
about how CRUSH works. What precisely are you after, James?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Client Location
2012-10-09 16:48 ` Gregory Farnum
@ 2012-10-10 9:16 ` James Horner
2012-10-10 16:39 ` Sage Weil
0 siblings, 1 reply; 7+ messages in thread
From: James Horner @ 2012-10-10 9:16 UTC (permalink / raw)
To: Gregory Farnum; +Cc: James Horner, ceph-devel, Mark Kampe
Hi There
The basic setup Im trying to get is a backend to a Hypervisor cluster, so that auto-failover and live migration works. The mail thing is that we have a number of datacenters with a gigabit interconnect that is not always 100% reliable. In the event of a failure we want all the virtual machines to fail over to the remaining datacenters, so we need all the data in each location.
The other issue is that within each datacenter we can use link aggregation to increase the bandwidth between hypervisors and the ceph cluster but between the datacenters we only have the gigabit so it become essential to have the hyperviors looking at the storage in the same datacenter.
Another consideration is that the virtual machines might get migrated between datacenters without any failure, and the main problem I see with Mark suggests is that in this mode the migrated VM would still be connecting to the OSD's in the remote datacenter.
Tbh Im fairly new to ceph and I know im asking for everything and the kitchen sink! Any thoughts would be very helpful though.
Thanks
James
----- Original Message -----
From: "Gregory Farnum" <greg@inktank.com>
To: "Mark Kampe" <mark.kampe@inktank.com>
Cc: "James Horner" <james.horner@precedent.co.uk>, ceph-devel@vger.kernel.org
Sent: Tuesday, October 9, 2012 5:48:37 PM
Subject: Re: Client Location
On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe <mark.kampe@inktank.com> wrote:
> I'm not a real engineer, so please forgive me if I misunderstand,
> but can't you create a separate rule for each data center (choosing
> first a local copy, and then remote copies), which should ensure
> that the primary is always local. Each data center would then
> use a different pool, associated with the appropriate location-
> sensitive rule.
>
> Does this approach get you the desired locality preference?
This sounds right to me — I think maybe there's a misunderstanding
about how CRUSH works. What precisely are you after, James?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Client Location
2012-10-10 9:16 ` James Horner
@ 2012-10-10 16:39 ` Sage Weil
0 siblings, 0 replies; 7+ messages in thread
From: Sage Weil @ 2012-10-10 16:39 UTC (permalink / raw)
To: James Horner; +Cc: Gregory Farnum, ceph-devel, Mark Kampe
On Wed, 10 Oct 2012, James Horner wrote:
> Hi There
>
> The basic setup Im trying to get is a backend to a Hypervisor cluster,
> so that auto-failover and live migration works. The mail thing is that
> we have a number of datacenters with a gigabit interconnect that is not
> always 100% reliable. In the event of a failure we want all the virtual
> machines to fail over to the remaining datacenters, so we need all the
> data in each location.
>
> The other issue is that within each datacenter we can use link
> aggregation to increase the bandwidth between hypervisors and the ceph
> cluster but between the datacenters we only have the gigabit so it
> become essential to have the hyperviors looking at the storage in the
> same datacenter.
The ceph replication is syncrhonous, so even if you are writing to a local
OSD, it will be updating the replica at the remote DC. The 1gbps link may
quickly become a bottleneck. This is a matter of having your cake and
eating it too... you can't seamlessly fail over to another DC if you don't
synchronously replicate to it.
> Another consideration is that the virtual machines might get migrated
> between datacenters without any failure, and the main problem I see with
> Mark suggests is that in this mode the migrated VM would still be
> connecting to the OSD's in the remote datacenter.
The new rbd cloning functionality can be used to 'migrate' and image by
cloning to a different pool (the new local DC) and then later (in teh
background, whenever) doing a 'flatten' to migrate teh data from the
parent to the clone. Performance will be slower initially but improve
once the data is migrated.
This isn't a perfect solution for your use-case, but it would work..
sage
> Tbh Im fairly new to ceph and I know im asking for everything and the
> kitchen sink! Any thoughts would be very helpful though.
>
> Thanks
> James
>
> ----- Original Message -----
> From: "Gregory Farnum" <greg@inktank.com>
> To: "Mark Kampe" <mark.kampe@inktank.com>
> Cc: "James Horner" <james.horner@precedent.co.uk>, ceph-devel@vger.kernel.org
> Sent: Tuesday, October 9, 2012 5:48:37 PM
> Subject: Re: Client Location
>
> On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe <mark.kampe@inktank.com> wrote:
> > I'm not a real engineer, so please forgive me if I misunderstand,
> > but can't you create a separate rule for each data center (choosing
> > first a local copy, and then remote copies), which should ensure
> > that the primary is always local. Each data center would then
> > use a different pool, associated with the appropriate location-
> > sensitive rule.
> >
> > Does this approach get you the desired locality preference?
>
> This sounds right to me ? I think maybe there's a misunderstanding
> about how CRUSH works. What precisely are you after, James?
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-10-10 16:39 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <1253073523.409.1349787553083.JavaMail.root@corellia.pncl.co.uk>
2012-10-09 13:14 ` Client Location James Horner
2012-10-09 13:30 ` Wido den Hollander
2012-10-09 13:33 ` james.horner
2012-10-09 16:43 ` Mark Kampe
2012-10-09 16:48 ` Gregory Farnum
2012-10-10 9:16 ` James Horner
2012-10-10 16:39 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.