All of lore.kernel.org
 help / color / mirror / Atom feed
* mgr balancer module
@ 2017-07-28  3:51 Sage Weil
  2017-07-28 21:48 ` Douglas Fuller
  2017-07-29 17:48 ` Spandan Kumar Sahu
  0 siblings, 2 replies; 10+ messages in thread
From: Sage Weil @ 2017-07-28  3:51 UTC (permalink / raw)
  To: ceph-devel

Hi all,

I've been working off and on on a mgr module 'balancer' that will do 
automatically optimization of the pg distribution.  The idea is you'll 
eventually be able to just turn it on and it will slowly and continuously 
optimize the layout without having to think about it.

I got something basic implemented pretty quickly that wraps around the new 
pg-upmap optimizer embedded in OSDMap.cc and osdmaptool.  And I had 
something that adjust the compat weight-set (optimizing crush weights in a 
backward-compatible way) that sort of kind of worked, but its problem was 
that it worked against the actual cluster instead of a model of the 
cluster, which meant it didn't always know whether a change it was making 
was going to be a good one until it tried it (and moved a bunch of data 
round). The conclusion from that was that the optmizer, regardless of what 
method it was using (upmap, crush weights, osd weights) had to operate 
against a model of the system so that it could check whether its changes 
were good ones before making them.

I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed 
to mgr modules in python-land to allow this.  Modules can get a handle for 
the current osdmap, create an incremental and propose changes to it (osd 
weights, upmap entries, crush weights), and apply it to get a new test 
osdmap.  And I have a preliminary eval vfunction that will analyze the 
distribution for a map (original or proposed) so that they can be 
compared.  In order to make sense of this and test it I made up a simple 
interface to interact with it, but I want to run it by people to make sure 
it makes sense.

The basics:

  ceph balancer mode <none,upmap,crush-compat,...>
	- which optimiation method to use
  ceph balancer on
	- run automagically
  ceph balancer off
	- stop running automagically
  ceph balancer status
	- see curent mode, any plans, whehter it's enabled

The useful bits:

  ceph balancer eval
	- show analysis of current data distribution
  ceph balancer optimize <plan>
	- create a new plan to optimize named <plan> based on the current 
	  mode
	- ceph balancer status will include a list of plans in memory 
          (these currently go away if ceph-mgr daemon restarts)
  ceph balancer eval <plan>
	- analyse resulting distribution if plan is executed
  ceph balancer show <plan>
	- show what the plan would do (basically a dump of cli commands to 
	  adjust weights etc)
  ceph balancer execute <plan>
	- execute plan (and then discard it)
  ceph balancer rm <plan>
	- discard plan

A normal user will be expected to just set the mode and turn it on:

  ceph balancer mode crush-compat
  ceph balancer on

An advanced user can play with different optimizer modes etc and see what 
they will actually do before making any changes to their cluster.

Does this seem like a reasonable direction for an operator interface?

--

The other part of this exercise is to set up the infrastructure to do the 
optimization "right".  All of the current code floating around to reweight 
by utilization etc is deficient when you do any non-trivial CRUSH things.  
I'm trying to get the infrastructure in place from the get-go so that this 
will work with multiple roots and device classes.

There will be some restrictions depending on the mode.  Notably, the 
crush-compat only has a single set of weights to adjust, so it can't do 
much if there are multiple hierarchies being balanced that overlap over 
any of the same devices (we should make the balancer refuse to continue in 
that case).

Similarly, we can't do projections and what utilization will look like 
with a proposed change when balancing based on actual osd utilization 
(what each osd reports as its total usage).  Instead, we need to model the 
size of each pg so that we can tell how things change when we move pgs.  
Initially this will use the pg stats, but that is an incomplete solution 
because we don't properly account for omap data.  There is also some 
storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps, 
per-pg metatata).  I think eventually we'll probably want to build a model 
around pg size based on what the stats say, what the osds report, and a 
model for unknown variables (omap cost per pg, per-object overhead, etc).  
Until then, we can just make do with the pg stats (should work reasonable 
well as long as you're not mixing omap and non-omap pools on the same 
devices but via different subtrees).

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-07-28  3:51 mgr balancer module Sage Weil
@ 2017-07-28 21:48 ` Douglas Fuller
  2017-07-30 18:12   ` Sage Weil
  2017-07-29 17:48 ` Spandan Kumar Sahu
  1 sibling, 1 reply; 10+ messages in thread
From: Douglas Fuller @ 2017-07-28 21:48 UTC (permalink / raw)
  To: Sage Weil, Ceph Development


> On Jul 27, 2017, at 11:51 PM, Sage Weil <sage@redhat.com> wrote:
> 
> The basics:
> 
>  ceph balancer mode <none,upmap,crush-compat,...>
> 	- which optimiation method to use
>  ceph balancer on
> 	- run automagically
>  ceph balancer off
> 	- stop running automagically

So this would leave the last CRUSH map installed by the balancer in place? Or would it restore the CRUSH map that was in place before applying running the balancer? If the former (which seems more reasonable, perhaps), is there a command to revert to the last user-applied map?
 
>  ceph balancer status
> 	- see curent mode, any plans, whehter it's enabled
> 
> The useful bits:
> 
>  ceph balancer eval
> 	- show analysis of current data distribution
>  ceph balancer optimize <plan>
> 	- create a new plan to optimize named <plan> based on the current 
> 	  mode
> 	- ceph balancer status will include a list of plans in memory 
>          (these currently go away if ceph-mgr daemon restarts)
>  ceph balancer eval <plan>
> 	- analyse resulting distribution if plan is executed
>  ceph balancer show <plan>
> 	- show what the plan would do (basically a dump of cli commands to 
> 	  adjust weights etc)
>  ceph balancer execute <plan>
> 	- execute plan (and then discard it)
>  ceph balancer rm <plan>
> 	- discard plan
> 
> A normal user will be expected to just set the mode and turn it on:
> 
>  ceph balancer mode crush-compat
>  ceph balancer on
> 
> An advanced user can play with different optimizer modes etc and see what 
> they will actually do before making any changes to their cluster.
> 
> Does this seem like a reasonable direction for an operator interface?

I think it makes sense.

Possibly unrelated: is there a way to adjust the rate of map changes manually or do we just expect the balancer to handle that automagically?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-07-28  3:51 mgr balancer module Sage Weil
  2017-07-28 21:48 ` Douglas Fuller
@ 2017-07-29 17:48 ` Spandan Kumar Sahu
  2017-07-30 18:16   ` Sage Weil
  1 sibling, 1 reply; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-07-29 17:48 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

On Fri, Jul 28, 2017 at 9:21 AM, Sage Weil <sage@redhat.com> wrote:
> Hi all,
>
> I've been working off and on on a mgr module 'balancer' that will do
> automatically optimization of the pg distribution.  The idea is you'll
> eventually be able to just turn it on and it will slowly and continuously
> optimize the layout without having to think about it.
>
> I got something basic implemented pretty quickly that wraps around the new
> pg-upmap optimizer embedded in OSDMap.cc and osdmaptool.  And I had
> something that adjust the compat weight-set (optimizing crush weights in a
> backward-compatible way) that sort of kind of worked, but its problem was
> that it worked against the actual cluster instead of a model of the
> cluster, which meant it didn't always know whether a change it was making
> was going to be a good one until it tried it (and moved a bunch of data
> round). The conclusion from that was that the optmizer, regardless of what
> method it was using (upmap, crush weights, osd weights) had to operate
> against a model of the system so that it could check whether its changes
> were good ones before making them.
>
> I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed
> to mgr modules in python-land to allow this.  Modules can get a handle for
> the current osdmap, create an incremental and propose changes to it (osd
> weights, upmap entries, crush weights), and apply it to get a new test
> osdmap.  And I have a preliminary eval vfunction that will analyze the
> distribution for a map (original or proposed) so that they can be
> compared.  In order to make sense of this and test it I made up a simple
> interface to interact with it, but I want to run it by people to make sure
> it makes sense.
>
> The basics:
>
>   ceph balancer mode <none,upmap,crush-compat,...>
>         - which optimiation method to use

Regarding the implementation of 'do_osd_weight', can we move the
existing 'reweight-by-utilization' and 'reweight-by-pg' to MonCommands
from MgrCommands? And then, we can simply send a command to "mon"? Or
is there way to call something like "send_command(result, 'mgr',
''...)" ?

>   ceph balancer on
>         - run automagically
>   ceph balancer off
>         - stop running automagically
>   ceph balancer status
>         - see curent mode, any plans, whehter it's enabled
>
> The useful bits:
>
>   ceph balancer eval
>         - show analysis of current data distribution
>   ceph balancer optimize <plan>
>         - create a new plan to optimize named <plan> based on the current
>           mode
>         - ceph balancer status will include a list of plans in memory
>           (these currently go away if ceph-mgr daemon restarts)
>   ceph balancer eval <plan>
>         - analyse resulting distribution if plan is executed
>   ceph balancer show <plan>
>         - show what the plan would do (basically a dump of cli commands to
>           adjust weights etc)
>   ceph balancer execute <plan>
>         - execute plan (and then discard it)
>   ceph balancer rm <plan>
>         - discard plan
>
> A normal user will be expected to just set the mode and turn it on:
>
>   ceph balancer mode crush-compat
>   ceph balancer on
>
> An advanced user can play with different optimizer modes etc and see what
> they will actually do before making any changes to their cluster.
>
> Does this seem like a reasonable direction for an operator interface?
>
> --
>
> The other part of this exercise is to set up the infrastructure to do the
> optimization "right".  All of the current code floating around to reweight
> by utilization etc is deficient when you do any non-trivial CRUSH things.
> I'm trying to get the infrastructure in place from the get-go so that this
> will work with multiple roots and device classes.
>
> There will be some restrictions depending on the mode.  Notably, the
> crush-compat only has a single set of weights to adjust, so it can't do
> much if there are multiple hierarchies being balanced that overlap over
> any of the same devices (we should make the balancer refuse to continue in
> that case).
>
> Similarly, we can't do projections and what utilization will look like
> with a proposed change when balancing based on actual osd utilization
> (what each osd reports as its total usage).  Instead, we need to model the
> size of each pg so that we can tell how things change when we move pgs.
> Initially this will use the pg stats, but that is an incomplete solution
> because we don't properly account for omap data.  There is also some
> storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps,
> per-pg metatata).  I think eventually we'll probably want to build a model
> around pg size based on what the stats say, what the osds report, and a
> model for unknown variables (omap cost per pg, per-object overhead, etc).
> Until then, we can just make do with the pg stats (should work reasonable
> well as long as you're not mixing omap and non-omap pools on the same
> devices but via different subtrees).
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Spandan Kumar Sahu
IIT Kharagpur

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-07-28 21:48 ` Douglas Fuller
@ 2017-07-30 18:12   ` Sage Weil
  0 siblings, 0 replies; 10+ messages in thread
From: Sage Weil @ 2017-07-30 18:12 UTC (permalink / raw)
  To: Douglas Fuller; +Cc: Ceph Development

On Fri, 28 Jul 2017, Douglas Fuller wrote:
> 
> > On Jul 27, 2017, at 11:51 PM, Sage Weil <sage@redhat.com> wrote:
> > 
> > The basics:
> > 
> >  ceph balancer mode <none,upmap,crush-compat,...>
> > 	- which optimiation method to use
> >  ceph balancer on
> > 	- run automagically
> >  ceph balancer off
> > 	- stop running automagically
> 
> So this would leave the last CRUSH map installed by the balancer in 
> place? Or would it restore the CRUSH map that was in place before 
> applying running the balancer? If the former (which seems more 
> reasonable, perhaps), is there a command to revert to the last 
> user-applied map?

I haven't thought about a revert.  I don't think it will be very practical 
because the balancer will make small adjustments to weights over time, but 
mixed in with that may be various other changes due to cluster 
expansion/contraction or admin changes or whatever.

It's probably worth mentioning that the balancer wouldn't actually 
export/import crush maps.  Instead, once it decides what weight 
adjustments to make, it will just issue normal monitor commands to 
initiate those changes.
  
> >  ceph balancer status
> > 	- see curent mode, any plans, whehter it's enabled
> > 
> > The useful bits:
> > 
> >  ceph balancer eval
> > 	- show analysis of current data distribution
> >  ceph balancer optimize <plan>
> > 	- create a new plan to optimize named <plan> based on the current 
> > 	  mode
> > 	- ceph balancer status will include a list of plans in memory 
> >          (these currently go away if ceph-mgr daemon restarts)
> >  ceph balancer eval <plan>
> > 	- analyse resulting distribution if plan is executed
> >  ceph balancer show <plan>
> > 	- show what the plan would do (basically a dump of cli commands to 
> > 	  adjust weights etc)
> >  ceph balancer execute <plan>
> > 	- execute plan (and then discard it)
> >  ceph balancer rm <plan>
> > 	- discard plan
> > 
> > A normal user will be expected to just set the mode and turn it on:
> > 
> >  ceph balancer mode crush-compat
> >  ceph balancer on
> > 
> > An advanced user can play with different optimizer modes etc and see what 
> > they will actually do before making any changes to their cluster.
> > 
> > Does this seem like a reasonable direction for an operator interface?
> 
> I think it makes sense.
> 
> Possibly unrelated: is there a way to adjust the rate of map changes 
> manually or do we just expect the balancer to handle that automagically?

There will be a single threshold, max_misplaced, that controls what 
fraction of the PGs/objects can be misplaced at once.  Default will be 
something like 3%.

I think this is equivalent to what people are after when they do gradual 
reweighting over time?

sage

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-07-29 17:48 ` Spandan Kumar Sahu
@ 2017-07-30 18:16   ` Sage Weil
  2017-08-03  7:56     ` Spandan Kumar Sahu
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-07-30 18:16 UTC (permalink / raw)
  To: Spandan Kumar Sahu; +Cc: Ceph Development

On Sat, 29 Jul 2017, Spandan Kumar Sahu wrote:
> On Fri, Jul 28, 2017 at 9:21 AM, Sage Weil <sage@redhat.com> wrote:
> > Hi all,
> >
> > I've been working off and on on a mgr module 'balancer' that will do
> > automatically optimization of the pg distribution.  The idea is you'll
> > eventually be able to just turn it on and it will slowly and continuously
> > optimize the layout without having to think about it.
> >
> > I got something basic implemented pretty quickly that wraps around the new
> > pg-upmap optimizer embedded in OSDMap.cc and osdmaptool.  And I had
> > something that adjust the compat weight-set (optimizing crush weights in a
> > backward-compatible way) that sort of kind of worked, but its problem was
> > that it worked against the actual cluster instead of a model of the
> > cluster, which meant it didn't always know whether a change it was making
> > was going to be a good one until it tried it (and moved a bunch of data
> > round). The conclusion from that was that the optmizer, regardless of what
> > method it was using (upmap, crush weights, osd weights) had to operate
> > against a model of the system so that it could check whether its changes
> > were good ones before making them.
> >
> > I got enough of the OSDMap, OSDMap::Incremental, and CrushWrapper exposed
> > to mgr modules in python-land to allow this.  Modules can get a handle for
> > the current osdmap, create an incremental and propose changes to it (osd
> > weights, upmap entries, crush weights), and apply it to get a new test
> > osdmap.  And I have a preliminary eval vfunction that will analyze the
> > distribution for a map (original or proposed) so that they can be
> > compared.  In order to make sense of this and test it I made up a simple
> > interface to interact with it, but I want to run it by people to make sure
> > it makes sense.
> >
> > The basics:
> >
> >   ceph balancer mode <none,upmap,crush-compat,...>
> >         - which optimiation method to use
> 
> Regarding the implementation of 'do_osd_weight', can we move the
> existing 'reweight-by-utilization' and 'reweight-by-pg' to MonCommands
> from MgrCommands? And then, we can simply send a command to "mon"? Or
> is there way to call something like "send_command(result, 'mgr',
> ''...)" ?

Yes and no... I think the inner loop doing the arithmetic can be copied, 
but part of what I've done so far in balancer has built most (I think) of 
the surrounding infrastructure so that we are reweighting the right osds 
to match the right target distribution.  The current reweight-by-* doesn't 
understand multiple crush rules/roots (which are easy to create now with 
the new device classes).  It should be pretty easy to slot it in now...

sage

 > 
> >   ceph balancer on
> >         - run automagically
> >   ceph balancer off
> >         - stop running automagically
> >   ceph balancer status
> >         - see curent mode, any plans, whehter it's enabled
> >
> > The useful bits:
> >
> >   ceph balancer eval
> >         - show analysis of current data distribution
> >   ceph balancer optimize <plan>
> >         - create a new plan to optimize named <plan> based on the current
> >           mode
> >         - ceph balancer status will include a list of plans in memory
> >           (these currently go away if ceph-mgr daemon restarts)
> >   ceph balancer eval <plan>
> >         - analyse resulting distribution if plan is executed
> >   ceph balancer show <plan>
> >         - show what the plan would do (basically a dump of cli commands to
> >           adjust weights etc)
> >   ceph balancer execute <plan>
> >         - execute plan (and then discard it)
> >   ceph balancer rm <plan>
> >         - discard plan
> >
> > A normal user will be expected to just set the mode and turn it on:
> >
> >   ceph balancer mode crush-compat
> >   ceph balancer on
> >
> > An advanced user can play with different optimizer modes etc and see what
> > they will actually do before making any changes to their cluster.
> >
> > Does this seem like a reasonable direction for an operator interface?
> >
> > --
> >
> > The other part of this exercise is to set up the infrastructure to do the
> > optimization "right".  All of the current code floating around to reweight
> > by utilization etc is deficient when you do any non-trivial CRUSH things.
> > I'm trying to get the infrastructure in place from the get-go so that this
> > will work with multiple roots and device classes.
> >
> > There will be some restrictions depending on the mode.  Notably, the
> > crush-compat only has a single set of weights to adjust, so it can't do
> > much if there are multiple hierarchies being balanced that overlap over
> > any of the same devices (we should make the balancer refuse to continue in
> > that case).
> >
> > Similarly, we can't do projections and what utilization will look like
> > with a proposed change when balancing based on actual osd utilization
> > (what each osd reports as its total usage).  Instead, we need to model the
> > size of each pg so that we can tell how things change when we move pgs.
> > Initially this will use the pg stats, but that is an incomplete solution
> > because we don't properly account for omap data.  There is also some
> > storage overhead in the OSD itself (e.g., bluestore metadata, osdmaps,
> > per-pg metatata).  I think eventually we'll probably want to build a model
> > around pg size based on what the stats say, what the osds report, and a
> > model for unknown variables (omap cost per pg, per-object overhead, etc).
> > Until then, we can just make do with the pg stats (should work reasonable
> > well as long as you're not mixing omap and non-omap pools on the same
> > devices but via different subtrees).
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Spandan Kumar Sahu
> IIT Kharagpur
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-07-30 18:16   ` Sage Weil
@ 2017-08-03  7:56     ` Spandan Kumar Sahu
  2017-08-03 15:53       ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-08-03  7:56 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Sage

I think it would be a good idea to include a command in the balancer
module itself, that would optimize the crushmap using the
python-crush, and set the optimized crushmap.

As far as I believe, uneven distributions can be majorly attributed to
the factors:
* using an unoptimized crushmap
* unevenness that occurs due to the (pseudo) random nature of CRUSH
* objects having different sizes.

If we set an optimized crushmap, at the very initial stages, we have
to move very less data in the due course, in order to maintain a
proper distribution. Hence the necessity of including it in the
balancer module. Please give a look at the PR[1], I sent in this
regard, and let me know if I am moving in the right direction.

[1] : https://github.com/liewegas/ceph/pull/57

-- 
Regards
Spandan Kumar Sahu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-08-03  7:56     ` Spandan Kumar Sahu
@ 2017-08-03 15:53       ` Sage Weil
  2017-08-03 18:35         ` Spandan Kumar Sahu
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-08-03 15:53 UTC (permalink / raw)
  To: Spandan Kumar Sahu; +Cc: Ceph Development

Hi Spandan,

On Thu, 3 Aug 2017, Spandan Kumar Sahu wrote:
> Sage
> 
> I think it would be a good idea to include a command in the balancer
> module itself, that would optimize the crushmap using the
> python-crush, and set the optimized crushmap.
> 
> As far as I believe, uneven distributions can be majorly attributed to
> the factors:
> * using an unoptimized crushmap
> * unevenness that occurs due to the (pseudo) random nature of CRUSH
> * objects having different sizes.
> 
> If we set an optimized crushmap, at the very initial stages, we have
> to move very less data in the due course, in order to maintain a
> proper distribution. Hence the necessity of including it in the
> balancer module. Please give a look at the PR[1], I sent in this
> regard, and let me know if I am moving in the right direction.

There are a few problems with using python-crush, the main one being that 
the dependencies are problematic: it's built from a forked repo and is not 
packaged properly (has to be installed with pip).  It also may not 
match the CRUSH version being used by the cluster.

The larger issue though is that it doesn't address all of the other 
problems I highlighted in my earlier email.  The main thing it *does* to 
properly is it does the optimization based on a model; this was the main 
problem with the old reweight-by-utilization.  The new framework in 
balancer.py has all the pieces now to let you do that.

I think the main value in the python-crush optimize code is that it 
demonstrably works, which means we know that the cost/score fuction being 
used and the descent method work together.  I think the best path forward 
is to look at the core of what those two pieces are doing and port it into 
the balancer environment.  Most recently I've been working on the 
'eval' method that will generate a score for a given distribution, but I'm 
working from first principles (just calculating the layout, its deviation 
from the target, the standard deviation, etc.) but I'm not sure what 
Loic's optimizer was doing.  Also, my first attempt at a descent function 
to correct weights was pretty broken, and I know a lot of experimentation 
went into Loic's method.

Do you see any problems with that approach, or things that the 
balancer framework does not cover?

Thanks!
sage


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-08-03 15:53       ` Sage Weil
@ 2017-08-03 18:35         ` Spandan Kumar Sahu
  2017-08-04  3:50           ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-08-03 18:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

On Thu, Aug 3, 2017 at 9:23 PM, Sage Weil <sweil@redhat.com> wrote:
> Hi Spandan,
>
> On Thu, 3 Aug 2017, Spandan Kumar Sahu wrote:
>> Sage
>>
>> I think it would be a good idea to include a command in the balancer
>> module itself, that would optimize the crushmap using the
>> python-crush, and set the optimized crushmap.
>>
>> As far as I believe, uneven distributions can be majorly attributed to
>> the factors:
>> * using an unoptimized crushmap
>> * unevenness that occurs due to the (pseudo) random nature of CRUSH
>> * objects having different sizes.
>>
>> If we set an optimized crushmap, at the very initial stages, we have
>> to move very less data in the due course, in order to maintain a
>> proper distribution. Hence the necessity of including it in the
>> balancer module. Please give a look at the PR[1], I sent in this
>> regard, and let me know if I am moving in the right direction.
>
> There are a few problems with using python-crush, the main one being that
> the dependencies are problematic: it's built from a forked repo and is not
> packaged properly (has to be installed with pip).  It also may not
> match the CRUSH version being used by the cluster.
>
I tried adding it to the install-deps.sh, but I missed the fact that,
the python-crush, will not get an updated with changes in the crush
directory of the Ceph source.

> The larger issue though is that it doesn't address all of the other
> problems I highlighted in my earlier email.  The main thing it *does* to
> properly is it does the optimization based on a model; this was the main
> problem with the old reweight-by-utilization.  The new framework in
> balancer.py has all the pieces now to let you do that.
>

Okay, may be then, I will try to port only the logic Loic's work and
see how this works.

> I think the main value in the python-crush optimize code is that it
> demonstrably works, which means we know that the cost/score fuction being
> used and the descent method work together.  I think the best path forward
> is to look at the core of what those two pieces are doing and port it into
> the balancer environment.  Most recently I've been working on the
> 'eval' method that will generate a score for a given distribution, but I'm
> working from first principles (just calculating the layout, its deviation
> from the target, the standard deviation, etc.) but I'm not sure what

I had done some-work regarding assigning a score to the distribution
at this[1] PR. It was however done in the pre-existing
reweight-by-utilization. Would you give a look over it and let me
know, if I should proceed to port it into the balancer module?

> Loic's optimizer was doing.  Also, my first attempt at a descent function
> to correct weights was pretty broken, and I know a lot of experimentation
> went into Loic's method.
>

Loic's optimizer only fixed defects in the crushmap, and was not (in
the true sense) a reweight-by-utilization.
In short, Loic's optimizer was optimizing a pool, on the basis of a
rule, and then, ran a simulation to determine the new weights. Using
the `take` in rules, it used to determine a list of OSDs, and move
weights (about 1% of the overload%) from one OSD to another. This way,
the weights of the buckets on the next hierarchical level in
crush-tree wasn't affected. I went through the Loic's optimizer in
details and also added my own improvisations.

I will try to port the logic, but I am not sure, where would I fit the
optimizer in? Would that go in as a separate function in module.py or
would it have different implementations for each of upmaps, crush,
crush-compat? Loic's python-crush didn't take upmaps into account. But
the logic will apply in case of upmaps too.

> Do you see any problems with that approach, or things that the
> balancer framework does not cover?
>

I was hoping that we have an optimizer that fixes the faults in
crushmap, whenever, a crushmap is set and/or a new device get added or
deleted. The current balancer would also fix it, but it would take
much more time, and much more movement of data to achieve better
distribution, compared to if we had fixed the crushmap itself, in the
very beginning. Nevertheless, the balancer module, will eventually
reach a reasonably good distribution.

Correct me, if I am wrong. :)

[1]: https://github.com/ceph/ceph/pull/16361/files#diff-ecab4c883be988760d61a8a883ddc23fR4559

> Thanks!
> sage
>



-- 
Spandan Kumar Sahu
IIT Kharagpur

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-08-03 18:35         ` Spandan Kumar Sahu
@ 2017-08-04  3:50           ` Sage Weil
  2017-08-07  8:55             ` Spandan Kumar Sahu
  0 siblings, 1 reply; 10+ messages in thread
From: Sage Weil @ 2017-08-04  3:50 UTC (permalink / raw)
  To: Spandan Kumar Sahu; +Cc: Ceph Development

On Fri, 4 Aug 2017, Spandan Kumar Sahu wrote:
> On Thu, Aug 3, 2017 at 9:23 PM, Sage Weil <sweil@redhat.com> wrote:
> > I think the main value in the python-crush optimize code is that it
> > demonstrably works, which means we know that the cost/score fuction being
> > used and the descent method work together.  I think the best path forward
> > is to look at the core of what those two pieces are doing and port it into
> > the balancer environment.  Most recently I've been working on the
> > 'eval' method that will generate a score for a given distribution, but I'm
> > working from first principles (just calculating the layout, its deviation
> > from the target, the standard deviation, etc.) but I'm not sure what
> 
> I had done some-work regarding assigning a score to the distribution
> at this[1] PR. It was however done in the pre-existing
> reweight-by-utilization. Would you give a look over it and let me
> know, if I should proceed to port it into the balancer module?

This seems reasonable...  I'm not sure we can really tell what the best 
function is without trying it in combination with some optimization 
method, though.

I just pushed a semi-complete/working eval function in the wip-balancer 
branch that uses a normalized standard deviation for pgs, objects, and 
bytes.  (Normalized meaning the standard deviation is divided by the 
total count of pgs or objects or whatever so that it is unitless.)  The 
final score is just the average of those three values.  Pretty sure that's 
not the most sensible thing but its a start.  FWIW I can do

 bin/init-ceph stop
 MON=1 OSD=8 MDS=0 ../src/vstart.sh -d -n -x -l 
 bin/ceph osd pool create foo 64 
 bin/ceph osd set-require-min-compat-client luminous
 bin/ceph balancer mode upmap
 bin/rados -p foo bench 10 write -b 4096 --no-cleanup
 bin/ceph balancer eval
 bin/ceph balancer optimize foo
 bin/ceph balancer eval foo
 bin/ceph balancer execute foo
 bin/ceph balancer eval 

and the score goes from .02 to .001 (and pgs get balanced).

> > Loic's optimizer was doing.  Also, my first attempt at a descent function
> > to correct weights was pretty broken, and I know a lot of experimentation
> > went into Loic's method.
> >
> 
> Loic's optimizer only fixed defects in the crushmap, and was not (in
> the true sense) a reweight-by-utilization.
> In short, Loic's optimizer was optimizing a pool, on the basis of a
> rule, and then, ran a simulation to determine the new weights. Using
> the `take` in rules, it used to determine a list of OSDs, and move
> weights (about 1% of the overload%) from one OSD to another. This way,
> the weights of the buckets on the next hierarchical level in
> crush-tree wasn't affected. I went through the Loic's optimizer in
> details and also added my own improvisations.
> 
> I will try to port the logic, but I am not sure, where would I fit the
> optimizer in? Would that go in as a separate function in module.py or
> would it have different implementations for each of upmaps, crush,
> crush-compat? Loic's python-crush didn't take upmaps into account. But
> the logic will apply in case of upmaps too.

The 'crush-compat' mode in balancer is the one to target.  There is a 
partial implementation there that needs to be updated to use the new 
framework; I'll fiddle with it a bit more to make it use the new Plan 
approach (currently it makes changes to the changes to the cluster, 
which doesn't work well!).  For now the latest is at

	https://github.com/ceph/ceph/pull/16272

You can ignore the other modes (upmap etc) for now.  Eventually we could 
make it so that transitioning from one mode to another will somehow phase 
out the old changes, but that's complicated and not needed yet.

> > Do you see any problems with that approach, or things that the
> > balancer framework does not cover?
> >
> 
> I was hoping that we have an optimizer that fixes the faults in
> crushmap, whenever, a crushmap is set and/or a new device get added or
> deleted. The current balancer would also fix it, but it would take
> much more time, and much more movement of data to achieve better
> distribution, compared to if we had fixed the crushmap itself, in the
> very beginning. Nevertheless, the balancer module, will eventually
> reach a reasonably good distribution.
> 
> Correct me, if I am wrong. :)

No, I think you're right.  I don't expect that people will be importing 
crush maps that often, though... and if they do they are hopefully clever 
enough to do their own thing.  The goal is for everything to be manageable 
via the CLI or (better yet) simply handled automatically by the system.

I think the main thing to worry about is the specific cases that people 
are likely to encounter (and tend ot complain about), like adding new 
devices and wanting the system to weight them in gradually.

sage



> 
> [1]: https://github.com/ceph/ceph/pull/16361/files#diff-ecab4c883be988760d61a8a883ddc23fR4559
> 
> > Thanks!
> > sage
> >
> 
> 
> 
> -- 
> Spandan Kumar Sahu
> IIT Kharagpur
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: mgr balancer module
  2017-08-04  3:50           ` Sage Weil
@ 2017-08-07  8:55             ` Spandan Kumar Sahu
  0 siblings, 0 replies; 10+ messages in thread
From: Spandan Kumar Sahu @ 2017-08-07  8:55 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

On Fri, Aug 4, 2017 at 9:20 AM, Sage Weil <sweil@redhat.com> wrote:
> On Fri, 4 Aug 2017, Spandan Kumar Sahu wrote:
>> On Thu, Aug 3, 2017 at 9:23 PM, Sage Weil <sweil@redhat.com> wrote:
>> > I think the main value in the python-crush optimize code is that it
>> > demonstrably works, which means we know that the cost/score fuction being
>> > used and the descent method work together.  I think the best path forward
>> > is to look at the core of what those two pieces are doing and port it into
>> > the balancer environment.  Most recently I've been working on the
>> > 'eval' method that will generate a score for a given distribution, but I'm
>> > working from first principles (just calculating the layout, its deviation
>> > from the target, the standard deviation, etc.) but I'm not sure what
>>
>> I had done some-work regarding assigning a score to the distribution
>> at this[1] PR. It was however done in the pre-existing
>> reweight-by-utilization. Would you give a look over it and let me
>> know, if I should proceed to port it into the balancer module?
>
> This seems reasonable...  I'm not sure we can really tell what the best
> function is without trying it in combination with some optimization
> method, though.
>
> I just pushed a semi-complete/working eval function in the wip-balancer
> branch that uses a normalized standard deviation for pgs, objects, and
> bytes.  (Normalized meaning the standard deviation is divided by the
> total count of pgs or objects or whatever so that it is unitless.)  The
> final score is just the average of those three values.  Pretty sure that's
> not the most sensible thing but its a start.  FWIW I can do
>
>  bin/init-ceph stop
>  MON=1 OSD=8 MDS=0 ../src/vstart.sh -d -n -x -l
>  bin/ceph osd pool create foo 64
>  bin/ceph osd set-require-min-compat-client luminous
>  bin/ceph balancer mode upmap
>  bin/rados -p foo bench 10 write -b 4096 --no-cleanup
>  bin/ceph balancer eval
>  bin/ceph balancer optimize foo
>  bin/ceph balancer eval foo
>  bin/ceph balancer execute foo
>  bin/ceph balancer eval
>
> and the score goes from .02 to .001 (and pgs get balanced).
>

I have sent a PR for a better scoring method at
                                      https://github.com/liewegas/ceph/pull/59.

Standard deviation is unbounded, and it depends significantly on the
'key' ('pg' or 'objects' or 'bytes'). This is not the case with the
scoring method that I suggest. Plus, there are additional benefits
like:
1. The new method can distinguish between the [ 5 overweighted + 1
heavily underweighted] vs [5 underweighted + 1 heavily overweighted].
(Discussed in previous mails)
2. It gives score in the similar range for all keys.
3. It takes into consideration of only the over-weighted devices.

The need of such a scoring algorithm, has been mentioned in comments
and commit message.

>> > Loic's optimizer was doing.  Also, my first attempt at a descent function
>> > to correct weights was pretty broken, and I know a lot of experimentation
>> > went into Loic's method.
>> >
>>
>> Loic's optimizer only fixed defects in the crushmap, and was not (in
>> the true sense) a reweight-by-utilization.
>> In short, Loic's optimizer was optimizing a pool, on the basis of a
>> rule, and then, ran a simulation to determine the new weights. Using
>> the `take` in rules, it used to determine a list of OSDs, and move
>> weights (about 1% of the overload%) from one OSD to another. This way,
>> the weights of the buckets on the next hierarchical level in
>> crush-tree wasn't affected. I went through the Loic's optimizer in
>> details and also added my own improvisations.
>>
>> I will try to port the logic, but I am not sure, where would I fit the
>> optimizer in? Would that go in as a separate function in module.py or
>> would it have different implementations for each of upmaps, crush,
>> crush-compat? Loic's python-crush didn't take upmaps into account. But
>> the logic will apply in case of upmaps too.
>
> The 'crush-compat' mode in balancer is the one to target.  There is a
> partial implementation there that needs to be updated to use the new
> framework; I'll fiddle with it a bit more to make it use the new Plan
> approach (currently it makes changes to the changes to the cluster,
> which doesn't work well!).  For now the latest is at
>
>         https://github.com/ceph/ceph/pull/16272
>
> You can ignore the other modes (upmap etc) for now.  Eventually we could
> make it so that transitioning from one mode to another will somehow phase
> out the old changes, but that's complicated and not needed yet.
>
>> > Do you see any problems with that approach, or things that the
>> > balancer framework does not cover?
>> >
>>
>> I was hoping that we have an optimizer that fixes the faults in
>> crushmap, whenever, a crushmap is set and/or a new device get added or
>> deleted. The current balancer would also fix it, but it would take
>> much more time, and much more movement of data to achieve better
>> distribution, compared to if we had fixed the crushmap itself, in the
>> very beginning. Nevertheless, the balancer module, will eventually
>> reach a reasonably good distribution.
>>
>> Correct me, if I am wrong. :)
>
> No, I think you're right.  I don't expect that people will be importing
> crush maps that often, though... and if they do they are hopefully clever
> enough to do their own thing.  The goal is for everything to be manageable
> via the CLI or (better yet) simply handled automatically by the system.
>
> I think the main thing to worry about is the specific cases that people
> are likely to encounter (and tend ot complain about), like adding new
> devices and wanting the system to weight them in gradually.
>
> sage
>
>
>
>>
>> [1]: https://github.com/ceph/ceph/pull/16361/files#diff-ecab4c883be988760d61a8a883ddc23fR4559
>>
>> > Thanks!
>> > sage
>> >
>>
>>
>>
>> --
>> Spandan Kumar Sahu
>> IIT Kharagpur
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>



-- 
Regards
Spandan Kumar Sahu

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-08-07  8:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-28  3:51 mgr balancer module Sage Weil
2017-07-28 21:48 ` Douglas Fuller
2017-07-30 18:12   ` Sage Weil
2017-07-29 17:48 ` Spandan Kumar Sahu
2017-07-30 18:16   ` Sage Weil
2017-08-03  7:56     ` Spandan Kumar Sahu
2017-08-03 15:53       ` Sage Weil
2017-08-03 18:35         ` Spandan Kumar Sahu
2017-08-04  3:50           ` Sage Weil
2017-08-07  8:55             ` Spandan Kumar Sahu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.