All of lore.kernel.org
 help / color / mirror / Atom feed
* crush optimization targets
@ 2017-05-10 13:00 Sage Weil
  2017-05-10 16:46 ` Loic Dachary
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Sage Weil @ 2017-05-10 13:00 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Peter Maloney, Ceph Development

This is slightly off topic, but I want to throw out one more thing into 
this discussion: we ultimately (with all of these methods) want to address 
CRUSH rules that only target a subset of the overall hierarchy.  I tried 
to do this in the pg-upmap improvemnets PR (which, incidentally, Loic, 
could use a review :) at

	https://github.com/ceph/ceph/pull/14902

In this commit

	https://github.com/ceph/ceph/pull/14902/commits/a9ba66c46e76b5ef8e5184a5100a37598a7e7695

it uses the get_rule_weight_osd_map() method, which returns a weighted map 
of how much "weight" a rule is trying to store on each of the OSDs that 
are potentially targetted by the CRUSH rule.  This helper us currently 
used by the 'df' code when trying to calculate the MAX AVAIL value and 
is not quite perfect (it doesn't factor in complex crush rules with 
multiple 'take' ops, for one) but for basic rules it works fine.

Anyway, that upmap code will take the set of pools you're balancing, look 
at how much they collectively *should* be putting on the target OSDs, and 
optimize against that (as opposed to the raw device CRUSH weight).

I *think* this is the simplest way to approach this (at least currently), 
although it is not in fact perfect.  We basically assume that the crush 
rules the admin has set up "make sense."  For example, if you have two 
crush rules that target a subset of the hierarchy, but they are 
overlapping (e.g., one is a subset of the other), then the subset that is 
covered by both will get utilized by the sum of the two and have a higher 
utilization--and the optimizer will not care (in fact, it will expect it).

That rules out at least one potential use-case, though: say you have a 
pool and rule defined that target a single host.  Depending on how much 
you store in that pool, those devices will be that much more utilized.  
One could imagine wanting Ceph to automatically monitor that pool's 
utilization (directly or indirectly) and push other pools' data out of 
those devices as the host-local pool fills.  I don't really like this 
scenario, though, so I can't tell if it is a "valid" one we should care 
about.

In any case, my hope is that at the end of the day we have a suite of 
opimization mechanisms: crush weights via the new choose_args, pg-upmap, 
and (if we don't deprecate it entirely) the osd reweight; and pg-based or 
osd utilization-based optimization (expected usage vs actual usage, or 
however you want to put it).  Ideally, they could use a common setup 
framework that handles the calculation of what the expected/optimal 
targets we're optimizing against (using something like the above) so that 
it isn't reinvented/reimplemented (or, more likely, not!) for each one.

Is it possible?
sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: crush optimization targets
  2017-05-10 13:00 crush optimization targets Sage Weil
@ 2017-05-10 16:46 ` Loic Dachary
  2017-05-10 17:04 ` Loic Dachary
  2017-05-11  5:52 ` Xavier Villaneau
  2 siblings, 0 replies; 5+ messages in thread
From: Loic Dachary @ 2017-05-10 16:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Hi Sage,

On 05/10/2017 03:00 PM, Sage Weil wrote:
> This is slightly off topic, but I want to throw out one more thing into 
> this discussion: we ultimately (with all of these methods) want to address 
> CRUSH rules that only target a subset of the overall hierarchy.  I tried 
> to do this in the pg-upmap improvemnets PR (which, incidentally, Loic, 
> could use a review :) at
> 
> 	https://github.com/ceph/ceph/pull/14902
> 
> In this commit
> 
> 	https://github.com/ceph/ceph/pull/14902/commits/a9ba66c46e76b5ef8e5184a5100a37598a7e7695
> 
> it uses the get_rule_weight_osd_map() method, which returns a weighted map 
> of how much "weight" a rule is trying to store on each of the OSDs that 
> are potentially targetted by the CRUSH rule.  This helper us currently 
> used by the 'df' code when trying to calculate the MAX AVAIL value and 
> is not quite perfect (it doesn't factor in complex crush rules with 
> multiple 'take' ops, for one) but for basic rules it works fine.
> 
> Anyway, that upmap code will take the set of pools you're balancing, look 
> at how much they collectively *should* be putting on the target OSDs, and 
> optimize against that (as opposed to the raw device CRUSH weight).
> 
> I *think* this is the simplest way to approach this (at least currently), 
> although it is not in fact perfect.  We basically assume that the crush 
> rules the admin has set up "make sense."  For example, if you have two 
> crush rules that target a subset of the hierarchy, but they are 
> overlapping (e.g., one is a subset of the other), then the subset that is 
> covered by both will get utilized by the sum of the two and have a higher 
> utilization--and the optimizer will not care (in fact, it will expect it).
> 
> That rules out at least one potential use-case, though: say you have a 
> pool and rule defined that target a single host.  Depending on how much 
> you store in that pool, those devices will be that much more utilized.  
> One could imagine wanting Ceph to automatically monitor that pool's 
> utilization (directly or indirectly) and push other pools' data out of 
> those devices as the host-local pool fills.  I don't really like this 
> scenario, though, so I can't tell if it is a "valid" one we should care 
> about.
> 
> In any case, my hope is that at the end of the day we have a suite of 
> opimization mechanisms: crush weights via the new choose_args, pg-upmap, 
> and (if we don't deprecate it entirely) the osd reweight; and pg-based or 
> osd utilization-based optimization (expected usage vs actual usage, or 
> however you want to put it).  Ideally, they could use a common setup 
> framework that handles the calculation of what the expected/optimal 
> targets we're optimizing against (using something like the above) so that 
> it isn't reinvented/reimplemented (or, more likely, not!) for each one.
> 
> Is it possible?

I'm afraid I'm not able to usefully comment on this because I've not thought about ways to address multi-pool / varying PGs sizes problems. I'll think about it.

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: crush optimization targets
  2017-05-10 13:00 crush optimization targets Sage Weil
  2017-05-10 16:46 ` Loic Dachary
@ 2017-05-10 17:04 ` Loic Dachary
  2017-05-11  5:52 ` Xavier Villaneau
  2 siblings, 0 replies; 5+ messages in thread
From: Loic Dachary @ 2017-05-10 17:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development



On 05/10/2017 03:00 PM, Sage Weil wrote:
> This is slightly off topic, but I want to throw out one more thing into 
> this discussion: we ultimately (with all of these methods) want to address 
> CRUSH rules that only target a subset of the overall hierarchy.  I tried 
> to do this in the pg-upmap improvemnets PR (which, incidentally, Loic, 
> could use a review :) at
> 
> 	https://github.com/ceph/ceph/pull/14902
> 
> In this commit
> 
> 	https://github.com/ceph/ceph/pull/14902/commits/a9ba66c46e76b5ef8e5184a5100a37598a7e7695
> 
> it uses the get_rule_weight_osd_map() method, which returns a weighted map 
> of how much "weight" a rule is trying to store on each of the OSDs that 
> are potentially targetted by the CRUSH rule.  This helper us currently 
> used by the 'df' code when trying to calculate the MAX AVAIL value and 
> is not quite perfect (it doesn't factor in complex crush rules with 
> multiple 'take' ops, for one) but for basic rules it works fine.

A few weeks ago I was very confused by what df shows. get_rule_weight_osd_map assumes an even distribution but we now know this is not the case, most of the time, because there are not enough PGs. When an pool uses a given OSD more than it should and another pool uses the same OSD less that it should, the sum is a variance that only makes sense if all pools occupy the same space which rarely is the case. It is not uncommon to see df report an OSD as being under used (variance < 1) while the actual usage of the OSD shows it is filled X% more than it should.

I did not think to file a bug at the time (maybe there already is one). But reading your mail made me remember so... here it is ;-)

Cheers

> 
> Anyway, that upmap code will take the set of pools you're balancing, look 
> at how much they collectively *should* be putting on the target OSDs, and 
> optimize against that (as opposed to the raw device CRUSH weight).
> 
> I *think* this is the simplest way to approach this (at least currently), 
> although it is not in fact perfect.  We basically assume that the crush 
> rules the admin has set up "make sense."  For example, if you have two 
> crush rules that target a subset of the hierarchy, but they are 
> overlapping (e.g., one is a subset of the other), then the subset that is 
> covered by both will get utilized by the sum of the two and have a higher 
> utilization--and the optimizer will not care (in fact, it will expect it).
> 
> That rules out at least one potential use-case, though: say you have a 
> pool and rule defined that target a single host.  Depending on how much 
> you store in that pool, those devices will be that much more utilized.  
> One could imagine wanting Ceph to automatically monitor that pool's 
> utilization (directly or indirectly) and push other pools' data out of 
> those devices as the host-local pool fills.  I don't really like this 
> scenario, though, so I can't tell if it is a "valid" one we should care 
> about.
> 
> In any case, my hope is that at the end of the day we have a suite of 
> opimization mechanisms: crush weights via the new choose_args, pg-upmap, 
> and (if we don't deprecate it entirely) the osd reweight; and pg-based or 
> osd utilization-based optimization (expected usage vs actual usage, or 
> however you want to put it).  Ideally, they could use a common setup 
> framework that handles the calculation of what the expected/optimal 
> targets we're optimizing against (using something like the above) so that 
> it isn't reinvented/reimplemented (or, more likely, not!) for each one.
> 
> Is it possible?
> sage
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: crush optimization targets
  2017-05-10 13:00 crush optimization targets Sage Weil
  2017-05-10 16:46 ` Loic Dachary
  2017-05-10 17:04 ` Loic Dachary
@ 2017-05-11  5:52 ` Xavier Villaneau
  2017-05-11  6:41   ` Loic Dachary
  2 siblings, 1 reply; 5+ messages in thread
From: Xavier Villaneau @ 2017-05-11  5:52 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, Ceph Development

Hello Sage,

On 05/10/2017 at 09:00 AM, Sage Weil wrote:
> I *think* this is the simplest way to approach this (at least currently),
> although it is not in fact perfect.  We basically assume that the crush
> rules the admin has set up "make sense."  For example, if you have two
> crush rules that target a subset of the hierarchy, but they are
> overlapping (e.g., one is a subset of the other), then the subset that is
> covered by both will get utilized by the sum of the two and have a higher
> utilization--and the optimizer will not care (in fact, it will expect it).
>
> That rules out at least one potential use-case, though: say you have a
> pool and rule defined that target a single host.  Depending on how much
> you store in that pool, those devices will be that much more utilized.
> One could imagine wanting Ceph to automatically monitor that pool's
> utilization (directly or indirectly) and push other pools' data out of
> those devices as the host-local pool fills.  I don't really like this
> scenario, though, so I can't tell if it is a "valid" one we should care
> about.
It looks like upmap calculation works by counting placement groups, so 
"weird" maps and rules are mostly a problem if the overlapping pools 
have different ratios of bytes per PG. Maybe that data could be used in 
the algorithm, but I don't know if the added complexity would be worth 
it. At this point, it is probably fair to think those corner cases are 
only seen on maps created by knowledgeable users.

> In any case, my hope is that at the end of the day we have a suite of
> opimization mechanisms: crush weights via the new choose_args, pg-upmap,
> and (if we don't deprecate it entirely) the osd reweight; and pg-based or
> osd utilization-based optimization (expected usage vs actual usage, or
> however you want to put it).  Ideally, they could use a common setup
> framework that handles the calculation of what the expected/optimal
> targets we're optimizing against (using something like the above) so that
> it isn't reinvented/reimplemented (or, more likely, not!) for each one.
I like pg-upmap, it looks great for fine control and has a small impact 
(compared to manipulating weights, which could move thousands of PGs 
around). This means it is ideal for re-balancing running clusters.
It also addresses the "weight spike" issue (that 5 1 1 1 1 case) since 
it can just move placement groups until it runs of options. This allows 
the theoretical limit cases to be reached eventually, whereas with 
weights it's only asymptotic.

Hopefully that won't be too confusing; reweight is already quite 
difficult to explain to new users, so pg-upmap will probably be too. 
There's also the question of how those tools interact with each other, 
for instance if PG-based optimization is run on top of utilization-based 
optimization.

(Hopefully I did not say anything too wrong, I haven't been able to 
follow those developments closely for the past couple of weeks),

Regards,
-- 
Xavier Villaneau
Software Engineer, Concurrent Computer Corp.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: crush optimization targets
  2017-05-11  5:52 ` Xavier Villaneau
@ 2017-05-11  6:41   ` Loic Dachary
  0 siblings, 0 replies; 5+ messages in thread
From: Loic Dachary @ 2017-05-11  6:41 UTC (permalink / raw)
  To: Xavier Villaneau; +Cc: Ceph Development

Hi Xavier,

On 05/11/2017 07:52 AM, Xavier Villaneau wrote:
> Hello Sage,
> 
> On 05/10/2017 at 09:00 AM, Sage Weil wrote:
>> I *think* this is the simplest way to approach this (at least currently),
>> although it is not in fact perfect.  We basically assume that the crush
>> rules the admin has set up "make sense."  For example, if you have two
>> crush rules that target a subset of the hierarchy, but they are
>> overlapping (e.g., one is a subset of the other), then the subset that is
>> covered by both will get utilized by the sum of the two and have a higher
>> utilization--and the optimizer will not care (in fact, it will expect it).
>>
>> That rules out at least one potential use-case, though: say you have a
>> pool and rule defined that target a single host.  Depending on how much
>> you store in that pool, those devices will be that much more utilized.
>> One could imagine wanting Ceph to automatically monitor that pool's
>> utilization (directly or indirectly) and push other pools' data out of
>> those devices as the host-local pool fills.  I don't really like this
>> scenario, though, so I can't tell if it is a "valid" one we should care
>> about.
> It looks like upmap calculation works by counting placement groups, so "weird" maps and rules are mostly a problem if the overlapping pools have different ratios of bytes per PG. Maybe that data could be used in the algorithm, but I don't know if the added complexity would be worth it. At this point, it is probably fair to think those corner cases are only seen on maps created by knowledgeable users.
> 
>> In any case, my hope is that at the end of the day we have a suite of
>> opimization mechanisms: crush weights via the new choose_args, pg-upmap,
>> and (if we don't deprecate it entirely) the osd reweight; and pg-based or
>> osd utilization-based optimization (expected usage vs actual usage, or
>> however you want to put it).  Ideally, they could use a common setup
>> framework that handles the calculation of what the expected/optimal
>> targets we're optimizing against (using something like the above) so that
>> it isn't reinvented/reimplemented (or, more likely, not!) for each one.
> I like pg-upmap, it looks great for fine control and has a small impact (compared to manipulating weights, which could move thousands of PGs around). This means it is ideal for re-balancing running clusters.

The weight manipulation is done by small increments. I believe each increment can be applied individually to throttle the PGs movements (i.e. a crushmap with slightly modified weights can be produced & uploaded at each step). It's not as fine grain as pg-upmap but it does not need to be an all-or-nothing optimization either.

> It also addresses the "weight spike" issue (that 5 1 1 1 1 case) since it can just move placement groups until it runs of options. This allows the theoretical limit cases to be reached eventually, whereas with weights it's only asymptotic.
> 
> Hopefully that won't be too confusing; reweight is already quite difficult to explain to new users, so pg-upmap will probably be too. There's also the question of how those tools interact with each other, for instance if PG-based optimization is run on top of utilization-based optimization.
> 
> (Hopefully I did not say anything too wrong, I haven't been able to follow those developments closely for the past couple of weeks),
> 
> Regards,

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-05-11  6:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-10 13:00 crush optimization targets Sage Weil
2017-05-10 16:46 ` Loic Dachary
2017-05-10 17:04 ` Loic Dachary
2017-05-11  5:52 ` Xavier Villaneau
2017-05-11  6:41   ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.