* mgr balancer module
@ 2017-07-28 3:51 Sage Weil
2017-07-28 21:48 ` Douglas Fuller
2017-07-29 17:48 ` Spandan Kumar Sahu
0 siblings, 2 replies; 10+ messages in thread
From: Sage Weil @ 2017-07-28 3:51 UTC (permalink / raw)
To: ceph-devel
Hi all,
I've been working off and on on a mgr module 'balancer' that will do
automatically optimization of the pg distribution. The idea is you'll
eventually be able to just turn it on and it will slowly and continuously
optimize the layout without having to think about it.
I got something basic implemented pretty quickly that wraps around the new
pg-upmap optimizer embedded in OSDMap.cc and osdmaptool. And I had
something that adjust the compat weight-set (optimizing crush weights in a
backward-compatible way) that sort of kind of worked, but its problem was
that it worked against the actual cluster instead of a model of the
cluster, which meant it didn't always know whether a change it was making
was going to be a good one until it tried it (and moved a bunch of data
round). The conclusion from that was that the optmizer, regardless of what
method it was using (upmap, crush weights, osd weights) had to operate
against a model of the system so that it could check whether its changes
were good ones before making them.
I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed
to mgr modules in python-land to allow this. Modules can get a handle for
the current osdmap, create an incremental and propose changes to it (osd
weights, upmap entries, crush weights), and apply it to get a new test
osdmap. And I have a preliminary eval vfunction that will analyze the
distribution for a map (original or proposed) so that they can be
compared. In order to make sense of this and test it I made up a simple
interface to interact with it, but I want to run it by people to make sure
it makes sense.
The basics:
ceph balancer mode <none,upmap,crush-compat,...>
- which optimiation method to use
ceph balancer on
- run automagically
ceph balancer off
- stop running automagically
ceph balancer status
- see curent mode, any plans, whehter it's enabled
The useful bits:
ceph balancer eval
- show analysis of current data distribution
ceph balancer optimize <plan>
- create a new plan to optimize named <plan> based on the current
mode
- ceph balancer status will include a list of plans in memory
(these currently go away if ceph-mgr daemon restarts)
ceph balancer eval <plan>
- analyse resulting distribution if plan is executed
ceph balancer show <plan>
- show what the plan would do (basically a dump of cli commands to
adjust weights etc)
ceph balancer execute <plan>
- execute plan (and then discard it)
ceph balancer rm <plan>
- discard plan
A normal user will be expected to just set the mode and turn it on:
ceph balancer mode crush-compat
ceph balancer on
An advanced user can play with different optimizer modes etc and see what
they will actually do before making any changes to their cluster.
Does this seem like a reasonable direction for an operator interface?
--
The other part of this exercise is to set up the infrastructure to do the
optimization "right". All of the current code floating around to reweight
by utilization etc is deficient when you do any non-trivial CRUSH things.
I'm trying to get the infrastructure in place from the get-go so that this
will work with multiple roots and device classes.
There will be some restrictions depending on the mode. Notably, the
crush-compat only has a single set of weights to adjust, so it can't do
much if there are multiple hierarchies being balanced that overlap over
any of the same devices (we should make the balancer refuse to continue in
that case).
Similarly, we can't do projections and what utilization will look like
with a proposed change when balancing based on actual osd utilization
(what each osd reports as its total usage). Instead, we need to model the
size of each pg so that we can tell how things change when we move pgs.
Initially this will use the pg stats, but that is an incomplete solution
because we don't properly account for omap data. There is also some
storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps,
per-pg metatata). I think eventually we'll probably want to build a model
around pg size based on what the stats say, what the osds report, and a
model for unknown variables (omap cost per pg, per-object overhead, etc).
Until then, we can just make do with the pg stats (should work reasonable
well as long as you're not mixing omap and non-omap pools on the same
devices but via different subtrees).
sage
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-07-28 3:51 mgr balancer module Sage Weil
@ 2017-07-28 21:48 ` Douglas Fuller
2017-07-30 18:12 ` Sage Weil
2017-07-29 17:48 ` Spandan Kumar Sahu
1 sibling, 1 reply; 10+ messages in thread
From: Douglas Fuller @ 2017-07-28 21:48 UTC (permalink / raw)
To: Sage Weil, Ceph Development
> On Jul 27, 2017, at 11:51 PM, Sage Weil <sage@redhat.com> wrote:
>
> The basics:
>
> ceph balancer mode <none,upmap,crush-compat,...>
> - which optimiation method to use
> ceph balancer on
> - run automagically
> ceph balancer off
> - stop running automagically
So this would leave the last CRUSH map installed by the balancer in place? Or would it restore the CRUSH map that was in place before applying running the balancer? If the former (which seems more reasonable, perhaps), is there a command to revert to the last user-applied map?
> ceph balancer status
> - see curent mode, any plans, whehter it's enabled
>
> The useful bits:
>
> ceph balancer eval
> - show analysis of current data distribution
> ceph balancer optimize <plan>
> - create a new plan to optimize named <plan> based on the current
> mode
> - ceph balancer status will include a list of plans in memory
> (these currently go away if ceph-mgr daemon restarts)
> ceph balancer eval <plan>
> - analyse resulting distribution if plan is executed
> ceph balancer show <plan>
> - show what the plan would do (basically a dump of cli commands to
> adjust weights etc)
> ceph balancer execute <plan>
> - execute plan (and then discard it)
> ceph balancer rm <plan>
> - discard plan
>
> A normal user will be expected to just set the mode and turn it on:
>
> ceph balancer mode crush-compat
> ceph balancer on
>
> An advanced user can play with different optimizer modes etc and see what
> they will actually do before making any changes to their cluster.
>
> Does this seem like a reasonable direction for an operator interface?
I think it makes sense.
Possibly unrelated: is there a way to adjust the rate of map changes manually or do we just expect the balancer to handle that automagically?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-07-28 3:51 mgr balancer module Sage Weil
2017-07-28 21:48 ` Douglas Fuller
@ 2017-07-29 17:48 ` Spandan Kumar Sahu
2017-07-30 18:16 ` Sage Weil
1 sibling, 1 reply; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-07-29 17:48 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
On Fri, Jul 28, 2017 at 9:21 AM, Sage Weil <sage@redhat.com> wrote:
> Hi all,
>
> I've been working off and on on a mgr module 'balancer' that will do
> automatically optimization of the pg distribution. The idea is you'll
> eventually be able to just turn it on and it will slowly and continuously
> optimize the layout without having to think about it.
>
> I got something basic implemented pretty quickly that wraps around the new
> pg-upmap optimizer embedded in OSDMap.cc and osdmaptool. And I had
> something that adjust the compat weight-set (optimizing crush weights in a
> backward-compatible way) that sort of kind of worked, but its problem was
> that it worked against the actual cluster instead of a model of the
> cluster, which meant it didn't always know whether a change it was making
> was going to be a good one until it tried it (and moved a bunch of data
> round). The conclusion from that was that the optmizer, regardless of what
> method it was using (upmap, crush weights, osd weights) had to operate
> against a model of the system so that it could check whether its changes
> were good ones before making them.
>
> I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed
> to mgr modules in python-land to allow this. Modules can get a handle for
> the current osdmap, create an incremental and propose changes to it (osd
> weights, upmap entries, crush weights), and apply it to get a new test
> osdmap. And I have a preliminary eval vfunction that will analyze the
> distribution for a map (original or proposed) so that they can be
> compared. In order to make sense of this and test it I made up a simple
> interface to interact with it, but I want to run it by people to make sure
> it makes sense.
>
> The basics:
>
> ceph balancer mode <none,upmap,crush-compat,...>
> - which optimiation method to use
Regarding the implementation of 'do_osd_weight', can we move the
existing 'reweight-by-utilization' and 'reweight-by-pg' to MonCommands
from MgrCommands? And then, we can simply send a command to "mon"? Or
is there way to call something like "send_command(result, 'mgr',
''...)" ?
> ceph balancer on
> - run automagically
> ceph balancer off
> - stop running automagically
> ceph balancer status
> - see curent mode, any plans, whehter it's enabled
>
> The useful bits:
>
> ceph balancer eval
> - show analysis of current data distribution
> ceph balancer optimize <plan>
> - create a new plan to optimize named <plan> based on the current
> mode
> - ceph balancer status will include a list of plans in memory
> (these currently go away if ceph-mgr daemon restarts)
> ceph balancer eval <plan>
> - analyse resulting distribution if plan is executed
> ceph balancer show <plan>
> - show what the plan would do (basically a dump of cli commands to
> adjust weights etc)
> ceph balancer execute <plan>
> - execute plan (and then discard it)
> ceph balancer rm <plan>
> - discard plan
>
> A normal user will be expected to just set the mode and turn it on:
>
> ceph balancer mode crush-compat
> ceph balancer on
>
> An advanced user can play with different optimizer modes etc and see what
> they will actually do before making any changes to their cluster.
>
> Does this seem like a reasonable direction for an operator interface?
>
> --
>
> The other part of this exercise is to set up the infrastructure to do the
> optimization "right". All of the current code floating around to reweight
> by utilization etc is deficient when you do any non-trivial CRUSH things.
> I'm trying to get the infrastructure in place from the get-go so that this
> will work with multiple roots and device classes.
>
> There will be some restrictions depending on the mode. Notably, the
> crush-compat only has a single set of weights to adjust, so it can't do
> much if there are multiple hierarchies being balanced that overlap over
> any of the same devices (we should make the balancer refuse to continue in
> that case).
>
> Similarly, we can't do projections and what utilization will look like
> with a proposed change when balancing based on actual osd utilization
> (what each osd reports as its total usage). Instead, we need to model the
> size of each pg so that we can tell how things change when we move pgs.
> Initially this will use the pg stats, but that is an incomplete solution
> because we don't properly account for omap data. There is also some
> storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps,
> per-pg metatata). I think eventually we'll probably want to build a model
> around pg size based on what the stats say, what the osds report, and a
> model for unknown variables (omap cost per pg, per-object overhead, etc).
> Until then, we can just make do with the pg stats (should work reasonable
> well as long as you're not mixing omap and non-omap pools on the same
> devices but via different subtrees).
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Spandan Kumar Sahu
IIT Kharagpur
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-07-28 21:48 ` Douglas Fuller
@ 2017-07-30 18:12 ` Sage Weil
0 siblings, 0 replies; 10+ messages in thread
From: Sage Weil @ 2017-07-30 18:12 UTC (permalink / raw)
To: Douglas Fuller; +Cc: Ceph Development
On Fri, 28 Jul 2017, Douglas Fuller wrote:
>
> > On Jul 27, 2017, at 11:51 PM, Sage Weil <sage@redhat.com> wrote:
> >
> > The basics:
> >
> > ceph balancer mode <none,upmap,crush-compat,...>
> > - which optimiation method to use
> > ceph balancer on
> > - run automagically
> > ceph balancer off
> > - stop running automagically
>
> So this would leave the last CRUSH map installed by the balancer in
> place? Or would it restore the CRUSH map that was in place before
> applying running the balancer? If the former (which seems more
> reasonable, perhaps), is there a command to revert to the last
> user-applied map?
I haven't thought about a revert. I don't think it will be very practical
because the balancer will make small adjustments to weights over time, but
mixed in with that may be various other changes due to cluster
expansion/contraction or admin changes or whatever.
It's probably worth mentioning that the balancer wouldn't actually
export/import crush maps. Instead, once it decides what weight
adjustments to make, it will just issue normal monitor commands to
initiate those changes.
> > ceph balancer status
> > - see curent mode, any plans, whehter it's enabled
> >
> > The useful bits:
> >
> > ceph balancer eval
> > - show analysis of current data distribution
> > ceph balancer optimize <plan>
> > - create a new plan to optimize named <plan> based on the current
> > mode
> > - ceph balancer status will include a list of plans in memory
> > (these currently go away if ceph-mgr daemon restarts)
> > ceph balancer eval <plan>
> > - analyse resulting distribution if plan is executed
> > ceph balancer show <plan>
> > - show what the plan would do (basically a dump of cli commands to
> > adjust weights etc)
> > ceph balancer execute <plan>
> > - execute plan (and then discard it)
> > ceph balancer rm <plan>
> > - discard plan
> >
> > A normal user will be expected to just set the mode and turn it on:
> >
> > ceph balancer mode crush-compat
> > ceph balancer on
> >
> > An advanced user can play with different optimizer modes etc and see what
> > they will actually do before making any changes to their cluster.
> >
> > Does this seem like a reasonable direction for an operator interface?
>
> I think it makes sense.
>
> Possibly unrelated: is there a way to adjust the rate of map changes
> manually or do we just expect the balancer to handle that automagically?
There will be a single threshold, max_misplaced, that controls what
fraction of the PGs/objects can be misplaced at once. Default will be
something like 3%.
I think this is equivalent to what people are after when they do gradual
reweighting over time?
sage
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-07-29 17:48 ` Spandan Kumar Sahu
@ 2017-07-30 18:16 ` Sage Weil
2017-08-03 7:56 ` Spandan Kumar Sahu
0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-07-30 18:16 UTC (permalink / raw)
To: Spandan Kumar Sahu; +Cc: Ceph Development
On Sat, 29 Jul 2017, Spandan Kumar Sahu wrote:
> On Fri, Jul 28, 2017 at 9:21 AM, Sage Weil <sage@redhat.com> wrote:
> > Hi all,
> >
> > I've been working off and on on a mgr module 'balancer' that will do
> > automatically optimization of the pg distribution. The idea is you'll
> > eventually be able to just turn it on and it will slowly and continuously
> > optimize the layout without having to think about it.
> >
> > I got something basic implemented pretty quickly that wraps around the new
> > pg-upmap optimizer embedded in OSDMap.cc and osdmaptool. And I had
> > something that adjust the compat weight-set (optimizing crush weights in a
> > backward-compatible way) that sort of kind of worked, but its problem was
> > that it worked against the actual cluster instead of a model of the
> > cluster, which meant it didn't always know whether a change it was making
> > was going to be a good one until it tried it (and moved a bunch of data
> > round). The conclusion from that was that the optmizer, regardless of what
> > method it was using (upmap, crush weights, osd weights) had to operate
> > against a model of the system so that it could check whether its changes
> > were good ones before making them.
> >
> > I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed
> > to mgr modules in python-land to allow this. Modules can get a handle for
> > the current osdmap, create an incremental and propose changes to it (osd
> > weights, upmap entries, crush weights), and apply it to get a new test
> > osdmap. And I have a preliminary eval vfunction that will analyze the
> > distribution for a map (original or proposed) so that they can be
> > compared. In order to make sense of this and test it I made up a simple
> > interface to interact with it, but I want to run it by people to make sure
> > it makes sense.
> >
> > The basics:
> >
> > ceph balancer mode <none,upmap,crush-compat,...>
> > - which optimiation method to use
>
> Regarding the implementation of 'do_osd_weight', can we move the
> existing 'reweight-by-utilization' and 'reweight-by-pg' to MonCommands
> from MgrCommands? And then, we can simply send a command to "mon"? Or
> is there way to call something like "send_command(result, 'mgr',
> ''...)" ?
Yes and no... I think the inner loop doing the arithmetic can be copied,
but part of what I've done so far in balancer has built most (I think) of
the surrounding infrastructure so that we are reweighting the right osds
to match the right target distribution. The current reweight-by-* doesn't
understand multiple crush rules/roots (which are easy to create now with
the new device classes). It should be pretty easy to slot it in now...
sage
>
> > ceph balancer on
> > - run automagically
> > ceph balancer off
> > - stop running automagically
> > ceph balancer status
> > - see curent mode, any plans, whehter it's enabled
> >
> > The useful bits:
> >
> > ceph balancer eval
> > - show analysis of current data distribution
> > ceph balancer optimize <plan>
> > - create a new plan to optimize named <plan> based on the current
> > mode
> > - ceph balancer status will include a list of plans in memory
> > (these currently go away if ceph-mgr daemon restarts)
> > ceph balancer eval <plan>
> > - analyse resulting distribution if plan is executed
> > ceph balancer show <plan>
> > - show what the plan would do (basically a dump of cli commands to
> > adjust weights etc)
> > ceph balancer execute <plan>
> > - execute plan (and then discard it)
> > ceph balancer rm <plan>
> > - discard plan
> >
> > A normal user will be expected to just set the mode and turn it on:
> >
> > ceph balancer mode crush-compat
> > ceph balancer on
> >
> > An advanced user can play with different optimizer modes etc and see what
> > they will actually do before making any changes to their cluster.
> >
> > Does this seem like a reasonable direction for an operator interface?
> >
> > --
> >
> > The other part of this exercise is to set up the infrastructure to do the
> > optimization "right". All of the current code floating around to reweight
> > by utilization etc is deficient when you do any non-trivial CRUSH things.
> > I'm trying to get the infrastructure in place from the get-go so that this
> > will work with multiple roots and device classes.
> >
> > There will be some restrictions depending on the mode. Notably, the
> > crush-compat only has a single set of weights to adjust, so it can't do
> > much if there are multiple hierarchies being balanced that overlap over
> > any of the same devices (we should make the balancer refuse to continue in
> > that case).
> >
> > Similarly, we can't do projections and what utilization will look like
> > with a proposed change when balancing based on actual osd utilization
> > (what each osd reports as its total usage). Instead, we need to model the
> > size of each pg so that we can tell how things change when we move pgs.
> > Initially this will use the pg stats, but that is an incomplete solution
> > because we don't properly account for omap data. There is also some
> > storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps,
> > per-pg metatata). I think eventually we'll probably want to build a model
> > around pg size based on what the stats say, what the osds report, and a
> > model for unknown variables (omap cost per pg, per-object overhead, etc).
> > Until then, we can just make do with the pg stats (should work reasonable
> > well as long as you're not mixing omap and non-omap pools on the same
> > devices but via different subtrees).
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Spandan Kumar Sahu
> IIT Kharagpur
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-07-30 18:16 ` Sage Weil
@ 2017-08-03 7:56 ` Spandan Kumar Sahu
2017-08-03 15:53 ` Sage Weil
0 siblings, 1 reply; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-08-03 7:56 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
Sage
I think it would be a good idea to include a command in the balancer
module itself, that would optimize the crushmap using the
python-crush, and set the optimized crushmap.
As far as I believe, uneven distributions can be majorly attributed to
the factors:
* using an unoptimized crushmap
* unevenness that occurs due to the (pseudo) random nature of CRUSH
* objects having different sizes.
If we set an optimized crushmap, at the very initial stages, we have
to move very less data in the due course, in order to maintain a
proper distribution. Hence the necessity of including it in the
balancer module. Please give a look at the PR[1], I sent in this
regard, and let me know if I am moving in the right direction.
[1] : https://github.com/liewegas/ceph/pull/57
--
Regards
Spandan Kumar Sahu
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-08-03 7:56 ` Spandan Kumar Sahu
@ 2017-08-03 15:53 ` Sage Weil
2017-08-03 18:35 ` Spandan Kumar Sahu
0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-08-03 15:53 UTC (permalink / raw)
To: Spandan Kumar Sahu; +Cc: Ceph Development
Hi Spandan,
On Thu, 3 Aug 2017, Spandan Kumar Sahu wrote:
> Sage
>
> I think it would be a good idea to include a command in the balancer
> module itself, that would optimize the crushmap using the
> python-crush, and set the optimized crushmap.
>
> As far as I believe, uneven distributions can be majorly attributed to
> the factors:
> * using an unoptimized crushmap
> * unevenness that occurs due to the (pseudo) random nature of CRUSH
> * objects having different sizes.
>
> If we set an optimized crushmap, at the very initial stages, we have
> to move very less data in the due course, in order to maintain a
> proper distribution. Hence the necessity of including it in the
> balancer module. Please give a look at the PR[1], I sent in this
> regard, and let me know if I am moving in the right direction.
There are a few problems with using python-crush, the main one being that
the dependencies are problematic: it's built from a forked repo and is not
packaged properly (has to be installed with pip). It also may not
match the CRUSH version being used by the cluster.
The larger issue though is that it doesn't address all of the other
problems I highlighted in my earlier email. The main thing it *does* to
properly is it does the optimization based on a model; this was the main
problem with the old reweight-by-utilization. The new framework in
balancer.py has all the pieces now to let you do that.
I think the main value in the python-crush optimize code is that it
demonstrably works, which means we know that the cost/score fuction being
used and the descent method work together. I think the best path forward
is to look at the core of what those two pieces are doing and port it into
the balancer environment. Most recently I've been working on the
'eval' method that will generate a score for a given distribution, but I'm
working from first principles (just calculating the layout, its deviation
from the target, the standard deviation, etc.) but I'm not sure what
Loic's optimizer was doing. Also, my first attempt at a descent function
to correct weights was pretty broken, and I know a lot of experimentation
went into Loic's method.
Do you see any problems with that approach, or things that the
balancer framework does not cover?
Thanks!
sage
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-08-03 15:53 ` Sage Weil
@ 2017-08-03 18:35 ` Spandan Kumar Sahu
2017-08-04 3:50 ` Sage Weil
0 siblings, 1 reply; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-08-03 18:35 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
On Thu, Aug 3, 2017 at 9:23 PM, Sage Weil <sweil@redhat.com> wrote:
> Hi Spandan,
>
> On Thu, 3 Aug 2017, Spandan Kumar Sahu wrote:
>> Sage
>>
>> I think it would be a good idea to include a command in the balancer
>> module itself, that would optimize the crushmap using the
>> python-crush, and set the optimized crushmap.
>>
>> As far as I believe, uneven distributions can be majorly attributed to
>> the factors:
>> * using an unoptimized crushmap
>> * unevenness that occurs due to the (pseudo) random nature of CRUSH
>> * objects having different sizes.
>>
>> If we set an optimized crushmap, at the very initial stages, we have
>> to move very less data in the due course, in order to maintain a
>> proper distribution. Hence the necessity of including it in the
>> balancer module. Please give a look at the PR[1], I sent in this
>> regard, and let me know if I am moving in the right direction.
>
> There are a few problems with using python-crush, the main one being that
> the dependencies are problematic: it's built from a forked repo and is not
> packaged properly (has to be installed with pip). It also may not
> match the CRUSH version being used by the cluster.
>
I tried adding it to the install-deps.sh, but I missed the fact that,
the python-crush, will not get an updated with changes in the crush
directory of the Ceph source.
> The larger issue though is that it doesn't address all of the other
> problems I highlighted in my earlier email. The main thing it *does* to
> properly is it does the optimization based on a model; this was the main
> problem with the old reweight-by-utilization. The new framework in
> balancer.py has all the pieces now to let you do that.
>
Okay, may be then, I will try to port only the logic Loic's work and
see how this works.
> I think the main value in the python-crush optimize code is that it
> demonstrably works, which means we know that the cost/score fuction being
> used and the descent method work together. I think the best path forward
> is to look at the core of what those two pieces are doing and port it into
> the balancer environment. Most recently I've been working on the
> 'eval' method that will generate a score for a given distribution, but I'm
> working from first principles (just calculating the layout, its deviation
> from the target, the standard deviation, etc.) but I'm not sure what
I had done some-work regarding assigning a score to the distribution
at this[1] PR. It was however done in the pre-existing
reweight-by-utilization. Would you give a look over it and let me
know, if I should proceed to port it into the balancer module?
> Loic's optimizer was doing. Also, my first attempt at a descent function
> to correct weights was pretty broken, and I know a lot of experimentation
> went into Loic's method.
>
Loic's optimizer only fixed defects in the crushmap, and was not (in
the true sense) a reweight-by-utilization.
In short, Loic's optimizer was optimizing a pool, on the basis of a
rule, and then, ran a simulation to determine the new weights. Using
the `take` in rules, it used to determine a list of OSDs, and move
weights (about 1% of the overload%) from one OSD to another. This way,
the weights of the buckets on the next hierarchical level in
crush-tree wasn't affected. I went through the Loic's optimizer in
details and also added my own improvisations.
I will try to port the logic, but I am not sure, where would I fit the
optimizer in? Would that go in as a separate function in module.py or
would it have different implementations for each of upmaps, crush,
crush-compat? Loic's python-crush didn't take upmaps into account. But
the logic will apply in case of upmaps too.
> Do you see any problems with that approach, or things that the
> balancer framework does not cover?
>
I was hoping that we have an optimizer that fixes the faults in
crushmap, whenever, a crushmap is set and/or a new device get added or
deleted. The current balancer would also fix it, but it would take
much more time, and much more movement of data to achieve better
distribution, compared to if we had fixed the crushmap itself, in the
very beginning. Nevertheless, the balancer module, will eventually
reach a reasonably good distribution.
Correct me, if I am wrong. :)
[1]: https://github.com/ceph/ceph/pull/16361/files#diff-ecab4c883be988760d61a8a883ddc23fR4559
> Thanks!
> sage
>
--
Spandan Kumar Sahu
IIT Kharagpur
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-08-03 18:35 ` Spandan Kumar Sahu
@ 2017-08-04 3:50 ` Sage Weil
2017-08-07 8:55 ` Spandan Kumar Sahu
0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-08-04 3:50 UTC (permalink / raw)
To: Spandan Kumar Sahu; +Cc: Ceph Development
On Fri, 4 Aug 2017, Spandan Kumar Sahu wrote:
> On Thu, Aug 3, 2017 at 9:23 PM, Sage Weil <sweil@redhat.com> wrote:
> > I think the main value in the python-crush optimize code is that it
> > demonstrably works, which means we know that the cost/score fuction being
> > used and the descent method work together. I think the best path forward
> > is to look at the core of what those two pieces are doing and port it into
> > the balancer environment. Most recently I've been working on the
> > 'eval' method that will generate a score for a given distribution, but I'm
> > working from first principles (just calculating the layout, its deviation
> > from the target, the standard deviation, etc.) but I'm not sure what
>
> I had done some-work regarding assigning a score to the distribution
> at this[1] PR. It was however done in the pre-existing
> reweight-by-utilization. Would you give a look over it and let me
> know, if I should proceed to port it into the balancer module?
This seems reasonable... I'm not sure we can really tell what the best
function is without trying it in combination with some optimization
method, though.
I just pushed a semi-complete/working eval function in the wip-balancer
branch that uses a normalized standard deviation for pgs, objects, and
bytes. (Normalized meaning the standard deviation is divided by the
total count of pgs or objects or whatever so that it is unitless.) The
final score is just the average of those three values. Pretty sure that's
not the most sensible thing but its a start. FWIW I can do
bin/init-ceph stop
MON=1 OSD=8 MDS=0 ../src/vstart.sh -d -n -x -l
bin/ceph osd pool create foo 64
bin/ceph osd set-require-min-compat-client luminous
bin/ceph balancer mode upmap
bin/rados -p foo bench 10 write -b 4096 --no-cleanup
bin/ceph balancer eval
bin/ceph balancer optimize foo
bin/ceph balancer eval foo
bin/ceph balancer execute foo
bin/ceph balancer eval
and the score goes from .02 to .001 (and pgs get balanced).
> > Loic's optimizer was doing. Also, my first attempt at a descent function
> > to correct weights was pretty broken, and I know a lot of experimentation
> > went into Loic's method.
> >
>
> Loic's optimizer only fixed defects in the crushmap, and was not (in
> the true sense) a reweight-by-utilization.
> In short, Loic's optimizer was optimizing a pool, on the basis of a
> rule, and then, ran a simulation to determine the new weights. Using
> the `take` in rules, it used to determine a list of OSDs, and move
> weights (about 1% of the overload%) from one OSD to another. This way,
> the weights of the buckets on the next hierarchical level in
> crush-tree wasn't affected. I went through the Loic's optimizer in
> details and also added my own improvisations.
>
> I will try to port the logic, but I am not sure, where would I fit the
> optimizer in? Would that go in as a separate function in module.py or
> would it have different implementations for each of upmaps, crush,
> crush-compat? Loic's python-crush didn't take upmaps into account. But
> the logic will apply in case of upmaps too.
The 'crush-compat' mode in balancer is the one to target. There is a
partial implementation there that needs to be updated to use the new
framework; I'll fiddle with it a bit more to make it use the new Plan
approach (currently it makes changes to the changes to the cluster,
which doesn't work well!). For now the latest is at
https://github.com/ceph/ceph/pull/16272
You can ignore the other modes (upmap etc) for now. Eventually we could
make it so that transitioning from one mode to another will somehow phase
out the old changes, but that's complicated and not needed yet.
> > Do you see any problems with that approach, or things that the
> > balancer framework does not cover?
> >
>
> I was hoping that we have an optimizer that fixes the faults in
> crushmap, whenever, a crushmap is set and/or a new device get added or
> deleted. The current balancer would also fix it, but it would take
> much more time, and much more movement of data to achieve better
> distribution, compared to if we had fixed the crushmap itself, in the
> very beginning. Nevertheless, the balancer module, will eventually
> reach a reasonably good distribution.
>
> Correct me, if I am wrong. :)
No, I think you're right. I don't expect that people will be importing
crush maps that often, though... and if they do they are hopefully clever
enough to do their own thing. The goal is for everything to be manageable
via the CLI or (better yet) simply handled automatically by the system.
I think the main thing to worry about is the specific cases that people
are likely to encounter (and tend ot complain about), like adding new
devices and wanting the system to weight them in gradually.
sage
>
> [1]: https://github.com/ceph/ceph/pull/16361/files#diff-ecab4c883be988760d61a8a883ddc23fR4559
>
> > Thanks!
> > sage
> >
>
>
>
> --
> Spandan Kumar Sahu
> IIT Kharagpur
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: mgr balancer module
2017-08-04 3:50 ` Sage Weil
@ 2017-08-07 8:55 ` Spandan Kumar Sahu
0 siblings, 0 replies; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-08-07 8:55 UTC (permalink / raw)
To: Sage Weil; +Cc: Ceph Development
On Fri, Aug 4, 2017 at 9:20 AM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 4 Aug 2017, Spandan Kumar Sahu wrote:
>> On Thu, Aug 3, 2017 at 9:23 PM, Sage Weil <sweil@redhat.com> wrote:
>> > I think the main value in the python-crush optimize code is that it
>> > demonstrably works, which means we know that the cost/score fuction being
>> > used and the descent method work together. I think the best path forward
>> > is to look at the core of what those two pieces are doing and port it into
>> > the balancer environment. Most recently I've been working on the
>> > 'eval' method that will generate a score for a given distribution, but I'm
>> > working from first principles (just calculating the layout, its deviation
>> > from the target, the standard deviation, etc.) but I'm not sure what
>>
>> I had done some-work regarding assigning a score to the distribution
>> at this[1] PR. It was however done in the pre-existing
>> reweight-by-utilization. Would you give a look over it and let me
>> know, if I should proceed to port it into the balancer module?
>
> This seems reasonable... I'm not sure we can really tell what the best
> function is without trying it in combination with some optimization
> method, though.
>
> I just pushed a semi-complete/working eval function in the wip-balancer
> branch that uses a normalized standard deviation for pgs, objects, and
> bytes. (Normalized meaning the standard deviation is divided by the
> total count of pgs or objects or whatever so that it is unitless.) The
> final score is just the average of those three values. Pretty sure that's
> not the most sensible thing but its a start. FWIW I can do
>
> bin/init-ceph stop
> MON=1 OSD=8 MDS=0 ../src/vstart.sh -d -n -x -l
> bin/ceph osd pool create foo 64
> bin/ceph osd set-require-min-compat-client luminous
> bin/ceph balancer mode upmap
> bin/rados -p foo bench 10 write -b 4096 --no-cleanup
> bin/ceph balancer eval
> bin/ceph balancer optimize foo
> bin/ceph balancer eval foo
> bin/ceph balancer execute foo
> bin/ceph balancer eval
>
> and the score goes from .02 to .001 (and pgs get balanced).
>
I have sent a PR for a better scoring method at
https://github.com/liewegas/ceph/pull/59.
Standard deviation is unbounded, and it depends significantly on the
'key' ('pg' or 'objects' or 'bytes'). This is not the case with the
scoring method that I suggest. Plus, there are additional benefits
like:
1. The new method can distinguish between the [ 5 overweighted + 1
heavily underweighted] vs [5 underweighted + 1 heavily overweighted].
(Discussed in previous mails)
2. It gives score in the similar range for all keys.
3. It takes into consideration of only the over-weighted devices.
The need of such a scoring algorithm, has been mentioned in comments
and commit message.
>> > Loic's optimizer was doing. Also, my first attempt at a descent function
>> > to correct weights was pretty broken, and I know a lot of experimentation
>> > went into Loic's method.
>> >
>>
>> Loic's optimizer only fixed defects in the crushmap, and was not (in
>> the true sense) a reweight-by-utilization.
>> In short, Loic's optimizer was optimizing a pool, on the basis of a
>> rule, and then, ran a simulation to determine the new weights. Using
>> the `take` in rules, it used to determine a list of OSDs, and move
>> weights (about 1% of the overload%) from one OSD to another. This way,
>> the weights of the buckets on the next hierarchical level in
>> crush-tree wasn't affected. I went through the Loic's optimizer in
>> details and also added my own improvisations.
>>
>> I will try to port the logic, but I am not sure, where would I fit the
>> optimizer in? Would that go in as a separate function in module.py or
>> would it have different implementations for each of upmaps, crush,
>> crush-compat? Loic's python-crush didn't take upmaps into account. But
>> the logic will apply in case of upmaps too.
>
> The 'crush-compat' mode in balancer is the one to target. There is a
> partial implementation there that needs to be updated to use the new
> framework; I'll fiddle with it a bit more to make it use the new Plan
> approach (currently it makes changes to the changes to the cluster,
> which doesn't work well!). For now the latest is at
>
> https://github.com/ceph/ceph/pull/16272
>
> You can ignore the other modes (upmap etc) for now. Eventually we could
> make it so that transitioning from one mode to another will somehow phase
> out the old changes, but that's complicated and not needed yet.
>
>> > Do you see any problems with that approach, or things that the
>> > balancer framework does not cover?
>> >
>>
>> I was hoping that we have an optimizer that fixes the faults in
>> crushmap, whenever, a crushmap is set and/or a new device get added or
>> deleted. The current balancer would also fix it, but it would take
>> much more time, and much more movement of data to achieve better
>> distribution, compared to if we had fixed the crushmap itself, in the
>> very beginning. Nevertheless, the balancer module, will eventually
>> reach a reasonably good distribution.
>>
>> Correct me, if I am wrong. :)
>
> No, I think you're right. I don't expect that people will be importing
> crush maps that often, though... and if they do they are hopefully clever
> enough to do their own thing. The goal is for everything to be manageable
> via the CLI or (better yet) simply handled automatically by the system.
>
> I think the main thing to worry about is the specific cases that people
> are likely to encounter (and tend ot complain about), like adding new
> devices and wanting the system to weight them in gradually.
>
> sage
>
>
>
>>
>> [1]: https://github.com/ceph/ceph/pull/16361/files#diff-ecab4c883be988760d61a8a883ddc23fR4559
>>
>> > Thanks!
>> > sage
>> >
>>
>>
>>
>> --
>> Spandan Kumar Sahu
>> IIT Kharagpur
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
Regards
Spandan Kumar Sahu
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-08-07 8:55 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-28 3:51 mgr balancer module Sage Weil
2017-07-28 21:48 ` Douglas Fuller
2017-07-30 18:12 ` Sage Weil
2017-07-29 17:48 ` Spandan Kumar Sahu
2017-07-30 18:16 ` Sage Weil
2017-08-03 7:56 ` Spandan Kumar Sahu
2017-08-03 15:53 ` Sage Weil
2017-08-03 18:35 ` Spandan Kumar Sahu
2017-08-04 3:50 ` Sage Weil
2017-08-07 8:55 ` Spandan Kumar Sahu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.