All of lore.kernel.org
 help / color / mirror / Atom feed
* uneven placement
@ 2012-07-27 13:07 Yann Dupont
  2012-07-30 17:53 ` Tommi Virtanen
  0 siblings, 1 reply; 3+ messages in thread
From: Yann Dupont @ 2012-07-27 13:07 UTC (permalink / raw)
  To: ceph-devel

Hello.
I'm running ceph with great success for 3 weeks now (the key was using 
xfs instead of btrfs on osd nodes).

Using it with rbd volumes, for lot of things (backup, etc). My setup is 
already detailled in the list, I'll just summarize again :

My ceph cluster is made of 8 OSD with quite big storage attached.
All OSD nodes are equal, except 4 OSD have 6,2 TB, 4 have 8 TB storage.



All is really running well, except placement seems non optimal ;

One OSD is now near_full (93%), 2 others have more 86% where others are 
only 50% full.

This morning I tried

ceph osd reweight-by-utilization 110


the placement is still in progress :

ceph -s
    health HEALTH_WARN 83 pgs backfill; 83 pgs recovering; 86 pgs stuck 
unclean; recovery 623428/11876870 degraded (5.249%); 3 near full osd(s)
    monmap e1: 3 mons at 
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, 
election epoch 46, quorum 0,1,2 chichibu,glenesk,karuizawa
    osdmap e731: 8 osds: 8 up, 8 in
     pgmap v792090: 1728 pgs: 1641 active+clean, 3 active+remapped, 1 
active+clean+scrubbing, 83 active+recovering+remapped+backfill; 21334 GB 
data, 42865 GB used, 15262 GB / 58128 GB avail; 623428/11876870 degraded 
(5.249%)
    mdsmap e31: 1/1/1 up {0=glenesk=up:active}, 2 up:standby


But it seems to lead to worse comportment (filing the already near -full):

see the 8 OSD :
/dev/mapper/xceph--chichibu-data
                       8,0T  5,4T  2,7T  68% /XCEPH-PROD/data
/dev/mapper/xceph--glenesk-data
                       6,2T  3,2T  3,1T  51% /XCEPH-PROD/data
/dev/mapper/xceph--karuizawa-data
                       8,0T  7,0T  1,1T  87% /XCEPH-PROD/data
/dev/mapper/xceph--hazelburn-data
                       6,2T  5,9T  373G  95% /XCEPH-PROD/data
/dev/mapper/xceph--carsebridge-data
                       8,0T  6,9T  1,2T  86% /XCEPH-PROD/data
/dev/mapper/xceph--cameronbridge-data
                       6,2T  5,1T  1,2T  83% /XCEPH-PROD/data
/dev/mapper/xceph--braeval-data
                       8,0T  4,6T  3,5T  57% /XCEPH-PROD/data
/dev/mapper/xceph--hanyu-data
                       6,2T  4,2T  2,1T  67% /XCEPH-PROD/data



Now the crush map : You'll notice that my 8 OSD nodes are placed in 4 
datacenters, and hosts with 8 TB have a different weight that the 6.2T 
nodes.


  begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host carsebridge {
     id -7        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.5 weight 1.000
}
host cameronbridge {
     id -8        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.6 weight 1.000
}
datacenter chantrerie {
     id -12        # do not change unnecessarily
     # weight 2.330
     alg straw
     hash 0    # rjenkins1
     item carsebridge weight 1.330
     item cameronbridge weight 1.000
}
host karuizawa {
     id -5        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.2 weight 1.000
}
host hazelburn {
     id -6        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.3 weight 1.000
}
datacenter loire {
     id -11        # do not change unnecessarily
     # weight 2.330
     alg straw
     hash 0    # rjenkins1
     item karuizawa weight 1.330
     item hazelburn weight 1.000
}
host chichibu {
     id -2        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.0 weight 1.000
}
host glenesk {
     id -4        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.1 weight 1.000
}
host braeval {
     id -9        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.7 weight 1.000
}
host hanyu {
     id -10        # do not change unnecessarily
     # weight 1.000
     alg straw
     hash 0    # rjenkins1
     item osd.8 weight 1.000
}
datacenter lombarderie {
     id -13        # do not change unnecessarily
     # weight 4.660
     alg straw
     hash 0    # rjenkins1
     item chichibu weight 1.330
     item glenesk weight 1.000
     item braeval weight 1.330
     item hanyu weight 1.000
}
pool default {
     id -1        # do not change unnecessarily
     # weight 8.000
     alg straw
     hash 0    # rjenkins1
     item chantrerie weight 2.000
     item loire weight 2.000
     item lombarderie weight 4.000
}
rack unknownrack {
     id -3        # do not change unnecessarily
     # weight 8.000
     alg straw
     hash 0    # rjenkins1
     item chichibu weight 1.000
     item glenesk weight 1.000
     item karuizawa weight 1.000
     item hazelburn weight 1.000
     item carsebridge weight 1.000
     item cameronbridge weight 1.000
     item braeval weight 1.000
     item hanyu weight 1.000
}

# rules
rule data {
     ruleset 0
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type datacenter
     step emit
}
rule metadata {
     ruleset 1
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type datacenter
     step emit
}
rule rbd {
     ruleset 2
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type datacenter
     step emit
}

# end crush map

There is probably something I'm doing wrong,but what ??
(BTW running 0.49 right now, it's not changing this problem)

Any hints will be appreciated,
Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: uneven placement
  2012-07-27 13:07 uneven placement Yann Dupont
@ 2012-07-30 17:53 ` Tommi Virtanen
  2012-08-01  7:55   ` Yann Dupont
  0 siblings, 1 reply; 3+ messages in thread
From: Tommi Virtanen @ 2012-07-30 17:53 UTC (permalink / raw)
  To: Yann Dupont; +Cc: ceph-devel

On Fri, Jul 27, 2012 at 6:07 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
> My ceph cluster is made of 8 OSD with quite big storage attached.
> All OSD nodes are equal, except 4 OSD have 6,2 TB, 4 have 8 TB storage.

Sounds like you should just set the weights yourself, based on the
capacities you listed here.

Even then, you only have 8 OSDs. The data placement is essentially
stochastic, you may not get perfect balance with a small cluster.
CRUSH evens out on larger clusters quite nicely, but there's still a
lot of statistical variation in the picture.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: uneven placement
  2012-07-30 17:53 ` Tommi Virtanen
@ 2012-08-01  7:55   ` Yann Dupont
  0 siblings, 0 replies; 3+ messages in thread
From: Yann Dupont @ 2012-08-01  7:55 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel

Le 30/07/2012 19:53, Tommi Virtanen a écrit :

> On Fri, Jul 27, 2012 at 6:07 AM, Yann Dupont <Yann.Dupont@univ-nantes.fr> wrote:
>> My ceph cluster is made of 8 OSD with quite big storage attached.
>> All OSD nodes are equal, except 4 OSD have 6,2 TB, 4 have 8 TB storage.
> Sounds like you should just set the weights yourself, based on the
> capacities you listed here.

Hi Tommi.

In my previous crush map, I was doing that more or less, I thought it 
was sufficient :

datacenter chantrerie {
  ...
   item carsebridge weight 1.330
     item cameronbridge weight 1.000
}
datacenter loire {
  ...
     item karuizawa weight 1.330
     item hazelburn weight 1.000
}

datacenter lombarderie {
...
     item chichibu weight 1.330
     item glenesk weight 1.000
     item braeval weight 1.330
     item hanyu weight 1.000
}

pool default {
    ...
     item chantrerie weight 2.000
     item loire weight 2.000
     item lombarderie weight 4.000
}

I've been able to grow a little more all my volumes, giving Now 8.6 TB 
for 4 nodes, and 6.8TB for the 4 others ;
Now I've tried to be more precise , here is the crushmap I'm now using :


# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host chichibu {
     id -2        # do not change unnecessarily
     # weight 8.600
     alg straw
     hash 0    # rjenkins1
     item osd.0 weight 8.600
}
host glenesk {
     id -4        # do not change unnecessarily
     # weight 6.800
     alg straw
     hash 0    # rjenkins1
     item osd.1 weight 6.800
}
host braeval {
     id -9        # do not change unnecessarily
     # weight 8.600
     alg straw
     hash 0    # rjenkins1
     item osd.7 weight 8.600
}
host hanyu {
     id -10        # do not change unnecessarily
     # weight 6.800
     alg straw
     hash 0    # rjenkins1
     item osd.8 weight 6.800
}
datacenter lombarderie {
     id -13        # do not change unnecessarily
     # weight 30.800
     alg straw
     hash 0    # rjenkins1
     item chichibu weight 8.600
     item glenesk weight 6.800
     item braeval weight 8.600
     item hanyu weight 6.800
}
host carsebridge {
     id -7        # do not change unnecessarily
     # weight 8.600
     alg straw
     hash 0    # rjenkins1
     item osd.5 weight 8.600
}
host cameronbridge {
     id -8        # do not change unnecessarily
     # weight 6.800
     alg straw
     hash 0    # rjenkins1
     item osd.6 weight 6.800
}
datacenter chantrerie {
     id -12        # do not change unnecessarily
     # weight 15.400
     alg straw
     hash 0    # rjenkins1
     item carsebridge weight 8.600
     item cameronbridge weight 6.800
}
host karuizawa {
     id -5        # do not change unnecessarily
     # weight 8.600
     alg straw
     hash 0    # rjenkins1
     item osd.2 weight 8.600
}
host hazelburn {
     id -6        # do not change unnecessarily
     # weight 6.800
     alg straw
     hash 0    # rjenkins1
     item osd.3 weight 6.800
}
datacenter loire {
     id -11        # do not change unnecessarily
     # weight 15.400
     alg straw
     hash 0    # rjenkins1
     item karuizawa weight 8.600
     item hazelburn weight 6.800
}
pool default {
     id -1        # do not change unnecessarily
     # weight 61.600
     alg straw
     hash 0    # rjenkins1
     item lombarderie weight 30.800
     item chantrerie weight 15.400
     item loire weight 15.400
}
rack unknownrack {
     id -3        # do not change unnecessarily
     # weight 8.000
     alg straw
     hash 0    # rjenkins1
     item chichibu weight 1.000
     item glenesk weight 1.000
     item karuizawa weight 1.000
     item hazelburn weight 1.000
     item carsebridge weight 1.000
     item cameronbridge weight 1.000
     item braeval weight 1.000
     item hanyu weight 1.000
}

# rules
rule data {
     ruleset 0
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type datacenter
     step emit
}
rule metadata {
     ruleset 1
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type datacenter
     step emit
}
rule rbd {
     ruleset 2
     type replicated
     min_size 1
     max_size 10
     step take default
     step chooseleaf firstn 0 type datacenter
     step emit
}

# end crush map



- I suppose the individual osd weight is probably unused, as I only have 
1 osd/host ?


It took several hours to rebalance the data, the result is , no 
surprise, more or less the same :

/dev/mapper/xceph--chichibu-data
                       8,6T  5,3T  3,4T  61% /XCEPH-PROD/data
/dev/mapper/xceph--glenesk-data
                       6,8T  3,3T  3,6T  48% /XCEPH-PROD/data
/dev/mapper/xceph--braeval-data
                       8,6T  4,4T  4,3T  51% /XCEPH-PROD/data
/dev/mapper/xceph--hanyu-data
                       6,8T  4,3T  2,6T  63% /XCEPH-PROD/data
/dev/mapper/xceph--karuizawa-data
                       8,6T  6,7T  2,0T  78% /XCEPH-PROD/data
/dev/mapper/xceph--hazelburn-data
                       6,8T  6,0T  864G  88% /XCEPH-PROD/data
/dev/mapper/xceph--carsebridge-data
                       8,6T  6,9T  1,8T  81% /XCEPH-PROD/data
/dev/mapper/xceph--cameronbridge-data
                       6,8T  5,2T  1,6T  77% /XCEPH-PROD/data

In your precedent message, did you mean I should tweak manually the 
weight based on the observation of those results ?
> stochastic, you may not get perfect balance with a small cluster.

Ok, I understand, I suppose my situation is even worse because I use 
datacenter, so placement by "firstn" is only on the 3 datacenters, wich 
gives :

17.3 out of 30.8 (56%) for datacenter lombarderie ;
12.7 out of 15.4 (82%) for datacenter loire ;
12.1 out of 15.4 (78%) for datacenter chantrerie ;

Which is not so bad.

> CRUSH evens out on larger clusters quite nicely, but there's still a
> lot of statistical variation in the picture.
I need to keep the notion of 3 datacenters ; All my data must be 
replicated on 2 distincts (read, some kilometers away) places.
So, even if I artificially multiplicate osd (by using lots of little LVM 
volumes on my arrays, I could reach 32 osd, for exemple) , I'll probably 
have a better placement inside that datacenter, BUT I'd still only have 
3 datacenters. As the firstn choice will only work on thoses 3 items, It 
will lead to similar problem . Am I wrong ?

Cheers,

-- 
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : Yann.Dupont@univ-nantes.fr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-08-01  7:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-27 13:07 uneven placement Yann Dupont
2012-07-30 17:53 ` Tommi Virtanen
2012-08-01  7:55   ` Yann Dupont

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.