All of lore.kernel.org
 help / color / mirror / Atom feed
* explicitly mapping pgs in OSDMap
@ 2017-03-01 19:44 Sage Weil
  2017-03-01 20:49 ` Dan van der Ster
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-01 19:44 UTC (permalink / raw)
  To: ceph-devel

There's been a longstanding desire to improve the balance of PGs and data 
across OSDs to better utilize storage and balance workload.  We had a few 
ideas about this in a meeting last week and I wrote up a summary/proposal 
here:

	http://pad.ceph.com/p/osdmap-explicit-mapping

The basic idea is to have the ability to explicitly map individual PGs 
to certain OSDs so that we can move PGs from overfull to underfull 
devices.  The idea is that the mon or mgr would do this based on some 
heuristics or policy and should result in a better distribution than teh 
current osd weight adjustments we make now with reweight-by-utilization.

The other key property is that one reason why we need as many PGs as we do 
now is to get a good balance; if we can remap some of them explicitly, we 
can get a better balance with fewer.  In essense, CRUSH gives an 
approximate distribution, and then we correct to make it perfect (or close 
to it).

The main challenge is less about figuring out when/how to remap PGs to 
correct balance, but figuring out when to remove those remappings after 
CRUSH map changes.  Some simple greedy strategies are obvious starting 
points (e.g., to move PGs off OSD X, first adjust or remove existing remap 
entries targetting OSD X before adding new ones), but there are a few 
ways we could structure the remap entries themselves so that they 
more gracefully disappear after a change.

For example, a remap entry might move a PG from OSD A to B if it maps to 
A; if the CRUSH topology changes and the PG no longer maps to A, the entry 
would be removed or ignored.  There are a few ways to do this in the pad; 
I'm sure there are other options.

I put this on the agenda for CDM tonight.  If anyone has any other ideas 
about this we'd love to hear them!

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
@ 2017-03-01 20:49 ` Dan van der Ster
  2017-03-01 22:10   ` Sage Weil
  2017-03-02  3:09 ` Matthew Sedam
  2017-03-02 17:33 ` Kamble, Nitin A
  2 siblings, 1 reply; 13+ messages in thread
From: Dan van der Ster @ 2017-03-01 20:49 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
> There's been a longstanding desire to improve the balance of PGs and data
> across OSDs to better utilize storage and balance workload.  We had a few
> ideas about this in a meeting last week and I wrote up a summary/proposal
> here:
>
>         http://pad.ceph.com/p/osdmap-explicit-mapping
>
> The basic idea is to have the ability to explicitly map individual PGs
> to certain OSDs so that we can move PGs from overfull to underfull
> devices.  The idea is that the mon or mgr would do this based on some
> heuristics or policy and should result in a better distribution than teh
> current osd weight adjustments we make now with reweight-by-utilization.
>
> The other key property is that one reason why we need as many PGs as we do
> now is to get a good balance; if we can remap some of them explicitly, we
> can get a better balance with fewer.  In essense, CRUSH gives an
> approximate distribution, and then we correct to make it perfect (or close
> to it).
>
> The main challenge is less about figuring out when/how to remap PGs to
> correct balance, but figuring out when to remove those remappings after
> CRUSH map changes.  Some simple greedy strategies are obvious starting
> points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> entries targetting OSD X before adding new ones), but there are a few
> ways we could structure the remap entries themselves so that they
> more gracefully disappear after a change.
>
> For example, a remap entry might move a PG from OSD A to B if it maps to
> A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> would be removed or ignored.  There are a few ways to do this in the pad;
> I'm sure there are other options.
>
> I put this on the agenda for CDM tonight.  If anyone has any other ideas
> about this we'd love to hear them!
>

Hi Sage. This would be awesome! Seriously, it would let us run multi
PB clusters closer to being full -- 1% of imbalance on a 100 PB
cluster is *expensive* !!

I can't join the meeting, but here are my first thoughts on the implementation.

First, since this is a new feature, it would be cool if it supported
non-trivial topologies from the beginning -- i.e. the trivial topology
is when all OSDs should have equal PGs/weight. A non-trivial topology
is where only OSDs beneath a user-defined crush bucket are balanced.

And something I didn't understand about the remap format options --
when the cluster topology changes, couldn't we just remove all
remappings and start again? If you use some consistent method for the
remaps, shouldn't the result anyway remain similar after incremental
topology changes?

In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
we don't want to shuffle those for small topology changes, but for
larger changes perhaps we can accept those to move.

-- Dan

> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-01 20:49 ` Dan van der Ster
@ 2017-03-01 22:10   ` Sage Weil
  2017-03-01 23:18     ` Allen Samuels
  2017-03-02  3:42     ` Xiaoxi Chen
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-01 22:10 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel

On Wed, 1 Mar 2017, Dan van der Ster wrote:
> On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
> > There's been a longstanding desire to improve the balance of PGs and data
> > across OSDs to better utilize storage and balance workload.  We had a few
> > ideas about this in a meeting last week and I wrote up a summary/proposal
> > here:
> >
> >         http://pad.ceph.com/p/osdmap-explicit-mapping
> >
> > The basic idea is to have the ability to explicitly map individual PGs
> > to certain OSDs so that we can move PGs from overfull to underfull
> > devices.  The idea is that the mon or mgr would do this based on some
> > heuristics or policy and should result in a better distribution than teh
> > current osd weight adjustments we make now with reweight-by-utilization.
> >
> > The other key property is that one reason why we need as many PGs as we do
> > now is to get a good balance; if we can remap some of them explicitly, we
> > can get a better balance with fewer.  In essense, CRUSH gives an
> > approximate distribution, and then we correct to make it perfect (or close
> > to it).
> >
> > The main challenge is less about figuring out when/how to remap PGs to
> > correct balance, but figuring out when to remove those remappings after
> > CRUSH map changes.  Some simple greedy strategies are obvious starting
> > points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> > entries targetting OSD X before adding new ones), but there are a few
> > ways we could structure the remap entries themselves so that they
> > more gracefully disappear after a change.
> >
> > For example, a remap entry might move a PG from OSD A to B if it maps to
> > A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> > would be removed or ignored.  There are a few ways to do this in the pad;
> > I'm sure there are other options.
> >
> > I put this on the agenda for CDM tonight.  If anyone has any other ideas
> > about this we'd love to hear them!
> >
> 
> Hi Sage. This would be awesome! Seriously, it would let us run multi
> PB clusters closer to being full -- 1% of imbalance on a 100 PB
> cluster is *expensive* !!
> 
> I can't join the meeting, but here are my first thoughts on the implementation.
> 
> First, since this is a new feature, it would be cool if it supported
> non-trivial topologies from the beginning -- i.e. the trivial topology
> is when all OSDs should have equal PGs/weight. A non-trivial topology
> is where only OSDs beneath a user-defined crush bucket are balanced.
> 
> And something I didn't understand about the remap format options --
> when the cluster topology changes, couldn't we just remove all
> remappings and start again? If you use some consistent method for the
> remaps, shouldn't the result anyway remain similar after incremental
> topology changes?
> 
> In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
> we don't want to shuffle those for small topology changes, but for
> larger changes perhaps we can accept those to move.

If there is a small change, then we don't want to toss the mappings.. but 
if there is a large change you do; which ones you toss should 
probably depend on whether the original mapping they were overriding 
changed.

The other hard part of this, now that I think about it, is that you have 
placement constraints encoded into the CRUSH rules (e.g., separate 
replicas across racks).  Whatever is installing new mappings needs to 
understand those constraints so tha the remap entries also respect the 
policy.  That means either we need to (1) understand the CRUSH rule 
constraints, or (2) encode the (simplified?) set of placement constraints 
in the Ceph OSDMap and auto-generate the CRUSH rule from that.

For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead 
of making pseudorandom choices, pick each child based on the minimum 
utilization (or the original mapping value) in order to generate the remap 
entries...

sage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: explicitly mapping pgs in OSDMap
  2017-03-01 22:10   ` Sage Weil
@ 2017-03-01 23:18     ` Allen Samuels
  2017-03-02  3:42     ` Xiaoxi Chen
  1 sibling, 0 replies; 13+ messages in thread
From: Allen Samuels @ 2017-03-01 23:18 UTC (permalink / raw)
  To: Sage Weil, Dan van der Ster; +Cc: ceph-devel

> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, March 01, 2017 2:11 PM
> To: Dan van der Ster <dan@vanderster.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: explicitly mapping pgs in OSDMap
> 
> On Wed, 1 Mar 2017, Dan van der Ster wrote:
> > On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
> > > There's been a longstanding desire to improve the balance of PGs and
> > > data across OSDs to better utilize storage and balance workload.  We
> > > had a few ideas about this in a meeting last week and I wrote up a
> > > summary/proposal
> > > here:
> > >
> > >         http://pad.ceph.com/p/osdmap-explicit-mapping
> > >
> > > The basic idea is to have the ability to explicitly map individual
> > > PGs to certain OSDs so that we can move PGs from overfull to
> > > underfull devices.  The idea is that the mon or mgr would do this
> > > based on some heuristics or policy and should result in a better
> > > distribution than teh current osd weight adjustments we make now with
> reweight-by-utilization.
> > >
> > > The other key property is that one reason why we need as many PGs as
> > > we do now is to get a good balance; if we can remap some of them
> > > explicitly, we can get a better balance with fewer.  In essense,
> > > CRUSH gives an approximate distribution, and then we correct to make
> > > it perfect (or close to it).
> > >
> > > The main challenge is less about figuring out when/how to remap PGs
> > > to correct balance, but figuring out when to remove those remappings
> > > after CRUSH map changes.  Some simple greedy strategies are obvious
> > > starting points (e.g., to move PGs off OSD X, first adjust or remove
> > > existing remap entries targetting OSD X before adding new ones), but
> > > there are a few ways we could structure the remap entries themselves
> > > so that they more gracefully disappear after a change.
> > >
> > > For example, a remap entry might move a PG from OSD A to B if it
> > > maps to A; if the CRUSH topology changes and the PG no longer maps
> > > to A, the entry would be removed or ignored.  There are a few ways
> > > to do this in the pad; I'm sure there are other options.
> > >
> > > I put this on the agenda for CDM tonight.  If anyone has any other
> > > ideas about this we'd love to hear them!
> > >
> >
> > Hi Sage. This would be awesome! Seriously, it would let us run multi
> > PB clusters closer to being full -- 1% of imbalance on a 100 PB
> > cluster is *expensive* !!

Whoa there. This will reduce the variance between fullness on each OSD, which is a really good thing.

However, you will still have a dropoff in individual OSD performance as it fills up. Nothing here addresses that.

> >
> > I can't join the meeting, but here are my first thoughts on the
> implementation.
> >
> > First, since this is a new feature, it would be cool if it supported
> > non-trivial topologies from the beginning -- i.e. the trivial topology
> > is when all OSDs should have equal PGs/weight. A non-trivial topology
> > is where only OSDs beneath a user-defined crush bucket are balanced.
> >
> > And something I didn't understand about the remap format options --
> > when the cluster topology changes, couldn't we just remove all
> > remappings and start again? If you use some consistent method for the
> > remaps, shouldn't the result anyway remain similar after incremental
> > topology changes?
> >
> > In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
> > we don't want to shuffle those for small topology changes, but for
> > larger changes perhaps we can accept those to move.
> 
> If there is a small change, then we don't want to toss the mappings.. but if
> there is a large change you do; which ones you toss should probably depend
> on whether the original mapping they were overriding changed.
> 
> The other hard part of this, now that I think about it, is that you have
> placement constraints encoded into the CRUSH rules (e.g., separate replicas
> across racks).  Whatever is installing new mappings needs to understand
> those constraints so tha the remap entries also respect the policy.  That
> means either we need to (1) understand the CRUSH rule constraints, or (2)
> encode the (simplified?) set of placement constraints in the Ceph OSDMap
> and auto-generate the CRUSH rule from that.
> 
> For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead of
> making pseudorandom choices, pick each child based on the minimum
> utilization (or the original mapping value) in order to generate the remap
> entries...
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
  2017-03-01 20:49 ` Dan van der Ster
@ 2017-03-02  3:09 ` Matthew Sedam
  2017-03-02  6:17   ` Sage Weil
  2017-03-02 17:33 ` Kamble, Nitin A
  2 siblings, 1 reply; 13+ messages in thread
From: Matthew Sedam @ 2017-03-02  3:09 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Sage,

Hi! I am a potential GSOC 2017 student, and I am interested in the
Ceph-mgr: Smarter Reweight_by_Utilization project. However, when
reading this I wondered if this proposed idea would make my GSOC
project effectively null and void. Could you elaborate on this?

Matthew Sedam

On Wed, Mar 1, 2017 at 1:44 PM, Sage Weil <sweil@redhat.com> wrote:
> There's been a longstanding desire to improve the balance of PGs and data
> across OSDs to better utilize storage and balance workload.  We had a few
> ideas about this in a meeting last week and I wrote up a summary/proposal
> here:
>
>         http://pad.ceph.com/p/osdmap-explicit-mapping
>
> The basic idea is to have the ability to explicitly map individual PGs
> to certain OSDs so that we can move PGs from overfull to underfull
> devices.  The idea is that the mon or mgr would do this based on some
> heuristics or policy and should result in a better distribution than teh
> current osd weight adjustments we make now with reweight-by-utilization.
>
> The other key property is that one reason why we need as many PGs as we do
> now is to get a good balance; if we can remap some of them explicitly, we
> can get a better balance with fewer.  In essense, CRUSH gives an
> approximate distribution, and then we correct to make it perfect (or close
> to it).
>
> The main challenge is less about figuring out when/how to remap PGs to
> correct balance, but figuring out when to remove those remappings after
> CRUSH map changes.  Some simple greedy strategies are obvious starting
> points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> entries targetting OSD X before adding new ones), but there are a few
> ways we could structure the remap entries themselves so that they
> more gracefully disappear after a change.
>
> For example, a remap entry might move a PG from OSD A to B if it maps to
> A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> would be removed or ignored.  There are a few ways to do this in the pad;
> I'm sure there are other options.
>
> I put this on the agenda for CDM tonight.  If anyone has any other ideas
> about this we'd love to hear them!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-01 22:10   ` Sage Weil
  2017-03-01 23:18     ` Allen Samuels
@ 2017-03-02  3:42     ` Xiaoxi Chen
  1 sibling, 0 replies; 13+ messages in thread
From: Xiaoxi Chen @ 2017-03-02  3:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, ceph-devel

Looks like the mapping need to be generated automatically by some
external/internal tools, not by administrator , they can do it handy but I
dont think human can properly balance thousands of PGs across thousands of
OSDs correctly.

Assume we have such tool, can we simplify the "remove" to "regenerate" the
mapping? i.e , when osdmap changes, all previous mapping cleared, crush
rule applies and pgmap generated, then **balance tool** jump in and
generate new mapping, then we have a *balanced* pgmap.

> That means either we need to (1) understand the CRUSH rule
constraints, or (2) encode the (simplified?) set of placement constraints
in the Ceph OSDMap and auto-generate the CRUSH rule from that.

will it be simpler by just extend the crushwrapper to have a
validate_pg_map(crush *map, vector<int> pg_map)?  it just hack the "random"
logical in crush_choose and see if crush can generate same mapping as the
wanted mapping?


Xiaoxi

2017-03-02 6:10 GMT+08:00 Sage Weil <sweil@redhat.com>:
> On Wed, 1 Mar 2017, Dan van der Ster wrote:
>> On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
>> > There's been a longstanding desire to improve the balance of PGs and data
>> > across OSDs to better utilize storage and balance workload.  We had a few
>> > ideas about this in a meeting last week and I wrote up a summary/proposal
>> > here:
>> >
>> >         http://pad.ceph.com/p/osdmap-explicit-mapping
>> >
>> > The basic idea is to have the ability to explicitly map individual PGs
>> > to certain OSDs so that we can move PGs from overfull to underfull
>> > devices.  The idea is that the mon or mgr would do this based on some
>> > heuristics or policy and should result in a better distribution than teh
>> > current osd weight adjustments we make now with reweight-by-utilization.
>> >
>> > The other key property is that one reason why we need as many PGs as we do
>> > now is to get a good balance; if we can remap some of them explicitly, we
>> > can get a better balance with fewer.  In essense, CRUSH gives an
>> > approximate distribution, and then we correct to make it perfect (or close
>> > to it).
>> >
>> > The main challenge is less about figuring out when/how to remap PGs to
>> > correct balance, but figuring out when to remove those remappings after
>> > CRUSH map changes.  Some simple greedy strategies are obvious starting
>> > points (e.g., to move PGs off OSD X, first adjust or remove existing remap
>> > entries targetting OSD X before adding new ones), but there are a few
>> > ways we could structure the remap entries themselves so that they
>> > more gracefully disappear after a change.
>> >
>> > For example, a remap entry might move a PG from OSD A to B if it maps to
>> > A; if the CRUSH topology changes and the PG no longer maps to A, the entry
>> > would be removed or ignored.  There are a few ways to do this in the pad;
>> > I'm sure there are other options.
>> >
>> > I put this on the agenda for CDM tonight.  If anyone has any other ideas
>> > about this we'd love to hear them!
>> >
>>
>> Hi Sage. This would be awesome! Seriously, it would let us run multi
>> PB clusters closer to being full -- 1% of imbalance on a 100 PB
>> cluster is *expensive* !!
>>
>> I can't join the meeting, but here are my first thoughts on the implementation.
>>
>> First, since this is a new feature, it would be cool if it supported
>> non-trivial topologies from the beginning -- i.e. the trivial topology
>> is when all OSDs should have equal PGs/weight. A non-trivial topology
>> is where only OSDs beneath a user-defined crush bucket are balanced.
>>
>> And something I didn't understand about the remap format options --
>> when the cluster topology changes, couldn't we just remove all
>> remappings and start again? If you use some consistent method for the
>> remaps, shouldn't the result anyway remain similar after incremental
>> topology changes?
>>
>> In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
>> we don't want to shuffle those for small topology changes, but for
>> larger changes perhaps we can accept those to move.
>
> If there is a small change, then we don't want to toss the mappings.. but
> if there is a large change you do; which ones you toss should
> probably depend on whether the original mapping they were overriding
> changed.
>
> The other hard part of this, now that I think about it, is that you have
> placement constraints encoded into the CRUSH rules (e.g., separate
> replicas across racks).  Whatever is installing new mappings needs to
> understand those constraints so tha the remap entries also respect the
> policy.  That means either we need to (1) understand the CRUSH rule
> constraints, or (2) encode the (simplified?) set of placement constraints
> in the Ceph OSDMap and auto-generate the CRUSH rule from that.
>
> For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead
> of making pseudorandom choices, pick each child based on the minimum
> utilization (or the original mapping value) in order to generate the remap
> entries...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-02  3:09 ` Matthew Sedam
@ 2017-03-02  6:17   ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-02  6:17 UTC (permalink / raw)
  To: Matthew Sedam; +Cc: ceph-devel

On Wed, 1 Mar 2017, Matthew Sedam wrote:
> Sage,
> 
> Hi! I am a potential GSOC 2017 student, and I am interested in the
> Ceph-mgr: Smarter Reweight_by_Utilization project. However, when
> reading this I wondered if this proposed idea would make my GSOC
> project effectively null and void. Could you elaborate on this?

I would consider the below an evolution of the role of 
reweight-by-utilization.  The complexity in improving the approach is less 
around what the mechanism is that it uses to make the adjustment, and more 
around reasoning about how the current utilization is assigned to PGs, how 
PG sizes are estimated, and so on.  The problem is pretty straightforward 
when you have a uniform CRUSH hierarchy and all data is spread across 
the cluster; when you have different CRUSH rules that distributed to only 
some devices, especially when those devices overlap, things get tricky.

In any case, this just makes the project more interesting, with several 
possible avenues for improvement and optimization.  :)

sage



 > 
> Matthew Sedam
> 
> On Wed, Mar 1, 2017 at 1:44 PM, Sage Weil <sweil@redhat.com> wrote:
> > There's been a longstanding desire to improve the balance of PGs and data
> > across OSDs to better utilize storage and balance workload.  We had a few
> > ideas about this in a meeting last week and I wrote up a summary/proposal
> > here:
> >
> >         http://pad.ceph.com/p/osdmap-explicit-mapping
> >
> > The basic idea is to have the ability to explicitly map individual PGs
> > to certain OSDs so that we can move PGs from overfull to underfull
> > devices.  The idea is that the mon or mgr would do this based on some
> > heuristics or policy and should result in a better distribution than teh
> > current osd weight adjustments we make now with reweight-by-utilization.
> >
> > The other key property is that one reason why we need as many PGs as we do
> > now is to get a good balance; if we can remap some of them explicitly, we
> > can get a better balance with fewer.  In essense, CRUSH gives an
> > approximate distribution, and then we correct to make it perfect (or close
> > to it).
> >
> > The main challenge is less about figuring out when/how to remap PGs to
> > correct balance, but figuring out when to remove those remappings after
> > CRUSH map changes.  Some simple greedy strategies are obvious starting
> > points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> > entries targetting OSD X before adding new ones), but there are a few
> > ways we could structure the remap entries themselves so that they
> > more gracefully disappear after a change.
> >
> > For example, a remap entry might move a PG from OSD A to B if it maps to
> > A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> > would be removed or ignored.  There are a few ways to do this in the pad;
> > I'm sure there are other options.
> >
> > I put this on the agenda for CDM tonight.  If anyone has any other ideas
> > about this we'd love to hear them!
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
  2017-03-01 20:49 ` Dan van der Ster
  2017-03-02  3:09 ` Matthew Sedam
@ 2017-03-02 17:33 ` Kamble, Nitin A
  2017-03-02 17:40   ` Sage Weil
  2 siblings, 1 reply; 13+ messages in thread
From: Kamble, Nitin A @ 2017-03-02 17:33 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

Hi Sage,
  The crush algorithm handles mapping of pgs, and it will even with the
addition of explicit mappings. I presume, finding which pgs belong to
which OSDs will involve addition computation for each additional
explicit mapping. 

What would be penalty of this additional computation? 

For small number of explicit mappings such penalty would be small, 
IMO it can get quite expensive with large number of explicit mappings.
The implementation will need to manage the count of explicit mappings,
by reverting some of the explicit mappings as the distribution changes.
The understanding of additional overhead of the explicit mappings would
had great influence on the implementation.

Nitin




On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on behalf of sweil@redhat.com> wrote:

    There's been a longstanding desire to improve the balance of PGs and data 
    across OSDs to better utilize storage and balance workload.  We had a few 
    ideas about this in a meeting last week and I wrote up a summary/proposal 
    here:
    
    	http://pad.ceph.com/p/osdmap-explicit-mapping
    
    The basic idea is to have the ability to explicitly map individual PGs 
    to certain OSDs so that we can move PGs from overfull to underfull 
    devices.  The idea is that the mon or mgr would do this based on some 
    heuristics or policy and should result in a better distribution than teh 
    current osd weight adjustments we make now with reweight-by-utilization.
    
    The other key property is that one reason why we need as many PGs as we do 
    now is to get a good balance; if we can remap some of them explicitly, we 
    can get a better balance with fewer.  In essense, CRUSH gives an 
    approximate distribution, and then we correct to make it perfect (or close 
    to it).
    
    The main challenge is less about figuring out when/how to remap PGs to 
    correct balance, but figuring out when to remove those remappings after 
    CRUSH map changes.  Some simple greedy strategies are obvious starting 
    points (e.g., to move PGs off OSD X, first adjust or remove existing remap 
    entries targetting OSD X before adding new ones), but there are a few 
    ways we could structure the remap entries themselves so that they 
    more gracefully disappear after a change.
    
    For example, a remap entry might move a PG from OSD A to B if it maps to 
    A; if the CRUSH topology changes and the PG no longer maps to A, the entry 
    would be removed or ignored.  There are a few ways to do this in the pad; 
    I'm sure there are other options.
    
    I put this on the agenda for CDM tonight.  If anyone has any other ideas 
    about this we'd love to hear them!
    
    sage
    --
    To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-02 17:33 ` Kamble, Nitin A
@ 2017-03-02 17:40   ` Sage Weil
       [not found]     ` <F6F90D3B-5362-48D1-B786-2191E5B98331@gmail.com>
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2017-03-02 17:40 UTC (permalink / raw)
  To: Kamble, Nitin A; +Cc: ceph-devel

On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
> Hi Sage,
>   The crush algorithm handles mapping of pgs, and it will even with the
> addition of explicit mappings. I presume, finding which pgs belong to
> which OSDs will involve addition computation for each additional
> explicit mapping. 
> 
> What would be penalty of this additional computation? 
> 
> For small number of explicit mappings such penalty would be small, 
> IMO it can get quite expensive with large number of explicit mappings.
> The implementation will need to manage the count of explicit mappings,
> by reverting some of the explicit mappings as the distribution changes.
> The understanding of additional overhead of the explicit mappings would
> had great influence on the implementation.

Yeah, we'll want to be careful with the implementation, e.g., by using an 
unordered_map (hash table) for looking up the mappings (an rbtree won't 
scale particularly well).

Note that pg_temp is already a map<pg_t,vector<int>> and can get big when 
you're doing a lot of rebalancing, resulting on O(log n) lookups.  If we 
do something optimized here we should use the same strategy for pg_temp 
too.

sage


> 
> Nitin
> 
> 
> 
> 
> On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on behalf of sweil@redhat.com> wrote:
> 
>     There's been a longstanding desire to improve the balance of PGs and data 
>     across OSDs to better utilize storage and balance workload.  We had a few 
>     ideas about this in a meeting last week and I wrote up a summary/proposal 
>     here:
>     
>     	http://pad.ceph.com/p/osdmap-explicit-mapping
>     
>     The basic idea is to have the ability to explicitly map individual PGs 
>     to certain OSDs so that we can move PGs from overfull to underfull 
>     devices.  The idea is that the mon or mgr would do this based on some 
>     heuristics or policy and should result in a better distribution than teh 
>     current osd weight adjustments we make now with reweight-by-utilization.
>     
>     The other key property is that one reason why we need as many PGs as we do 
>     now is to get a good balance; if we can remap some of them explicitly, we 
>     can get a better balance with fewer.  In essense, CRUSH gives an 
>     approximate distribution, and then we correct to make it perfect (or close 
>     to it).
>     
>     The main challenge is less about figuring out when/how to remap PGs to 
>     correct balance, but figuring out when to remove those remappings after 
>     CRUSH map changes.  Some simple greedy strategies are obvious starting 
>     points (e.g., to move PGs off OSD X, first adjust or remove existing remap 
>     entries targetting OSD X before adding new ones), but there are a few 
>     ways we could structure the remap entries themselves so that they 
>     more gracefully disappear after a change.
>     
>     For example, a remap entry might move a PG from OSD A to B if it maps to 
>     A; if the CRUSH topology changes and the PG no longer maps to A, the entry 
>     would be removed or ignored.  There are a few ways to do this in the pad; 
>     I'm sure there are other options.
>     
>     I put this on the agenda for CDM tonight.  If anyone has any other ideas 
>     about this we'd love to hear them!
>     
>     sage
>     --
>     To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>     the body of a message to majordomo@vger.kernel.org
>     More majordomo info at  http://vger.kernel.org/majordomo-info.html
>     
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
       [not found]     ` <F6F90D3B-5362-48D1-B786-2191E5B98331@gmail.com>
@ 2017-03-02 18:09       ` Sage Weil
  2017-03-02 19:40         ` Tiger Hu
  2017-03-03  9:20         ` Bartłomiej Święcki
  0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-02 18:09 UTC (permalink / raw)
  To: Tiger Hu; +Cc: Kamble, Nitin A, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5965 bytes --]

On Fri, 3 Mar 2017, Tiger Hu wrote:
> Hi Sage,
> I am very glad to know you raised this issue. In some cases, user may want
> to accurately control the PG numbers in every OSDs. Is it possible to
> add/implement a new policy to support fixed-mapping? This may be useful for
> performance tuning or test purpose. Thanks.

In principle this could be used to map every pg explicitly without regard 
for CRUSH.  In think in practice, though, we can have CRUSH map 80-90% of 
the PGs, and then have explicit mappings for 10-20% in order to achieve 
the same "perfect" balance of PGs.

The result is a smaller OSDMap.  It might not matter that much for small 
clusters, but for large clusters, it is helpful to keep the OSDMap small.

OTOH, maybe through all of this we end up in a place where the OSDMap is 
just an explicit map and the mgr agent that's managing it is using CRUSH 
to generate it; who knows!  That might make for faster mappings on the 
clients since it's a table lookup instead of a mapping calculation.

I'm thinking that if we implement the ability to have the explicit 
mappings in the OSDMap we open up both possibilities.  The hard part is 
always getting the mapping compatbility into the client, so if we do that 
right, luminous+ clients will support explicit mapping (overrides) in 
OSDMap and would be able to work with future versions that go all-in on 
explicit...

sage


 > 
> Tiger
>       在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
> 
> On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
>       Hi Sage,
>        The crush algorithm handles mapping of pgs, and it will
>       even with the
>       addition of explicit mappings. I presume, finding which
>       pgs belong to
>       which OSDs will involve addition computation for each
>       additional
>       explicit mapping. 
> 
>       What would be penalty of this additional computation? 
> 
>       For small number of explicit mappings such penalty would
>       be small, 
>       IMO it can get quite expensive with large number of
>       explicit mappings.
>       The implementation will need to manage the count of
>       explicit mappings,
>       by reverting some of the explicit mappings as the
>       distribution changes.
>       The understanding of additional overhead of the explicit
>       mappings would
>       had great influence on the implementation.
> 
> 
> Yeah, we'll want to be careful with the implementation, e.g., by using
> an 
> unordered_map (hash table) for looking up the mappings (an rbtree
> won't 
> scale particularly well).
> 
> Note that pg_temp is already a map<pg_t,vector<int>> and can get big
> when 
> you're doing a lot of rebalancing, resulting on O(log n) lookups.  If
> we 
> do something optimized here we should use the same strategy for
> pg_temp 
> too.
> 
> sage
> 
> 
> 
>       Nitin
> 
> 
> 
> 
>       On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
>       behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
>       behalf of sweil@redhat.com> wrote:
> 
>          There's been a longstanding desire to improve the
>       balance of PGs and data 
>          across OSDs to better utilize storage and balance
>       workload.  We had a few 
>          ideas about this in a meeting last week and I wrote up
>       a summary/proposal 
>          here:
> 
>           http://pad.ceph.com/p/osdmap-explicit-mapping
> 
>          The basic idea is to have the ability to explicitly map
>       individual PGs 
>          to certain OSDs so that we can move PGs from overfull
>       to underfull 
>          devices.  The idea is that the mon or mgr would do this
>       based on some 
>          heuristics or policy and should result in a better
>       distribution than teh 
>          current osd weight adjustments we make now with
>       reweight-by-utilization.
> 
>          The other key property is that one reason why we need
>       as many PGs as we do 
>          now is to get a good balance; if we can remap some of
>       them explicitly, we 
>          can get a better balance with fewer.  In essense, CRUSH
>       gives an 
>          approximate distribution, and then we correct to make
>       it perfect (or close 
>          to it).
> 
>          The main challenge is less about figuring out when/how
>       to remap PGs to 
>          correct balance, but figuring out when to remove those
>       remappings after 
>          CRUSH map changes.  Some simple greedy strategies are
>       obvious starting 
>          points (e.g., to move PGs off OSD X, first adjust or
>       remove existing remap 
>          entries targetting OSD X before adding new ones), but
>       there are a few 
>          ways we could structure the remap entries themselves so
>       that they 
>          more gracefully disappear after a change.
> 
>          For example, a remap entry might move a PG from OSD A
>       to B if it maps to 
>          A; if the CRUSH topology changes and the PG no longer
>       maps to A, the entry 
>          would be removed or ignored.  There are a few ways to
>       do this in the pad; 
>          I'm sure there are other options.
> 
>          I put this on the agenda for CDM tonight.  If anyone
>       has any other ideas 
>          about this we'd love to hear them!
> 
>          sage
>          --
>          To unsubscribe from this list: send the line
>       "unsubscribe ceph-devel" in
>          the body of a message to majordomo@vger.kernel.org
>          More majordomo info at
>        http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-02 18:09       ` Sage Weil
@ 2017-03-02 19:40         ` Tiger Hu
  2017-03-03  9:20         ` Bartłomiej Święcki
  1 sibling, 0 replies; 13+ messages in thread
From: Tiger Hu @ 2017-03-02 19:40 UTC (permalink / raw)
  To: Sage Weil; +Cc: Kamble, Nitin A, ceph-devel

Sage,

Thanks for your reply. Looking forward to explicit mapping.

Tiger
> 在 2017年3月3日,上午2:09,Sage Weil <sweil@redhat.com> 写道:
> 
> On Fri, 3 Mar 2017, Tiger Hu wrote:
>> Hi Sage,
>> I am very glad to know you raised this issue. In some cases, user may want
>> to accurately control the PG numbers in every OSDs. Is it possible to
>> add/implement a new policy to support fixed-mapping? This may be useful for
>> performance tuning or test purpose. Thanks.
> 
> In principle this could be used to map every pg explicitly without regard 
> for CRUSH.  In think in practice, though, we can have CRUSH map 80-90% of 
> the PGs, and then have explicit mappings for 10-20% in order to achieve 
> the same "perfect" balance of PGs.
> 
> The result is a smaller OSDMap.  It might not matter that much for small 
> clusters, but for large clusters, it is helpful to keep the OSDMap small.
> 
> OTOH, maybe through all of this we end up in a place where the OSDMap is 
> just an explicit map and the mgr agent that's managing it is using CRUSH 
> to generate it; who knows!  That might make for faster mappings on the 
> clients since it's a table lookup instead of a mapping calculation.
> 
> I'm thinking that if we implement the ability to have the explicit 
> mappings in the OSDMap we open up both possibilities.  The hard part is 
> always getting the mapping compatbility into the client, so if we do that 
> right, luminous+ clients will support explicit mapping (overrides) in 
> OSDMap and would be able to work with future versions that go all-in on 
> explicit...
> 
> sage
> 
> 
>> 
>> Tiger
>>      在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
>> 
>> On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
>>      Hi Sage,
>>       The crush algorithm handles mapping of pgs, and it will
>>      even with the
>>      addition of explicit mappings. I presume, finding which
>>      pgs belong to
>>      which OSDs will involve addition computation for each
>>      additional
>>      explicit mapping. 
>> 
>>      What would be penalty of this additional computation? 
>> 
>>      For small number of explicit mappings such penalty would
>>      be small, 
>>      IMO it can get quite expensive with large number of
>>      explicit mappings.
>>      The implementation will need to manage the count of
>>      explicit mappings,
>>      by reverting some of the explicit mappings as the
>>      distribution changes.
>>      The understanding of additional overhead of the explicit
>>      mappings would
>>      had great influence on the implementation.
>> 
>> 
>> Yeah, we'll want to be careful with the implementation, e.g., by using
>> an 
>> unordered_map (hash table) for looking up the mappings (an rbtree
>> won't 
>> scale particularly well).
>> 
>> Note that pg_temp is already a map<pg_t,vector<int>> and can get big
>> when 
>> you're doing a lot of rebalancing, resulting on O(log n) lookups.  If
>> we 
>> do something optimized here we should use the same strategy for
>> pg_temp 
>> too.
>> 
>> sage
>> 
>> 
>> 
>>      Nitin
>> 
>> 
>> 
>> 
>>      On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
>>      behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
>>      behalf of sweil@redhat.com> wrote:
>> 
>>         There's been a longstanding desire to improve the
>>      balance of PGs and data 
>>         across OSDs to better utilize storage and balance
>>      workload.  We had a few 
>>         ideas about this in a meeting last week and I wrote up
>>      a summary/proposal 
>>         here:
>> 
>>          http://pad.ceph.com/p/osdmap-explicit-mapping
>> 
>>         The basic idea is to have the ability to explicitly map
>>      individual PGs 
>>         to certain OSDs so that we can move PGs from overfull
>>      to underfull 
>>         devices.  The idea is that the mon or mgr would do this
>>      based on some 
>>         heuristics or policy and should result in a better
>>      distribution than teh 
>>         current osd weight adjustments we make now with
>>      reweight-by-utilization.
>> 
>>         The other key property is that one reason why we need
>>      as many PGs as we do 
>>         now is to get a good balance; if we can remap some of
>>      them explicitly, we 
>>         can get a better balance with fewer.  In essense, CRUSH
>>      gives an 
>>         approximate distribution, and then we correct to make
>>      it perfect (or close 
>>         to it).
>> 
>>         The main challenge is less about figuring out when/how
>>      to remap PGs to 
>>         correct balance, but figuring out when to remove those
>>      remappings after 
>>         CRUSH map changes.  Some simple greedy strategies are
>>      obvious starting 
>>         points (e.g., to move PGs off OSD X, first adjust or
>>      remove existing remap 
>>         entries targetting OSD X before adding new ones), but
>>      there are a few 
>>         ways we could structure the remap entries themselves so
>>      that they 
>>         more gracefully disappear after a change.
>> 
>>         For example, a remap entry might move a PG from OSD A
>>      to B if it maps to 
>>         A; if the CRUSH topology changes and the PG no longer
>>      maps to A, the entry 
>>         would be removed or ignored.  There are a few ways to
>>      do this in the pad; 
>>         I'm sure there are other options.
>> 
>>         I put this on the agenda for CDM tonight.  If anyone
>>      has any other ideas 
>>         about this we'd love to hear them!
>> 
>>         sage
>>         --
>>         To unsubscribe from this list: send the line
>>      "unsubscribe ceph-devel" in
>>         the body of a message to majordomo@vger.kernel.org
>>         More majordomo info at
>>       http://vger.kernel.org/majordomo-info.html
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
>> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-02 18:09       ` Sage Weil
  2017-03-02 19:40         ` Tiger Hu
@ 2017-03-03  9:20         ` Bartłomiej Święcki
  2017-03-03 14:55           ` Sage Weil
  1 sibling, 1 reply; 13+ messages in thread
From: Bartłomiej Święcki @ 2017-03-03  9:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

The more I think about explicit-only mapping (so clients and OSDs
not doing CRUSH computations anymore) the more interesting
use cases I can see, better balance being only one of them.

Other major advantage I see is that Client would become independent
of internal CRUSH calculation changes (straw->straw2 was a bit
problematic on our production for example) which I believe would
also simplify keeping backward compatibility, especially for the
kernel stuff. Also it looks like the code would simplify.

Such explicit mapping could also help with all kinds of cluster
reconfiguration stuff - i.e. growing cluster by significant amount of
OSDs could be spread over time to reduce backfill impact, same with
increasing pgp_num. I also believe peering could gain here too
because mon/mgr could ensure only limited number of PGs is blocked
peering at the same time.

About memory overhead - I agree that it is the most important factor to
consider. I don't know ceph internals that much yet to be authoritative
here but I believe that in case of cluster rebalance, the amount of data
that has to be maintained can already grow way beyond whatexplicit
mapping would need. Is it possible that such explicit map would also
store list of potential OSDs as long as the PG is not fully recovered?
That would mean the osdmap history could be trimmed much faster, it
would be easier to predict mon resource requirements and monitors would
sync much faster during recovery.

Regards,
Bartek


W dniu 02.03.2017 o 19:09, Sage Weil pisze:
> On Fri, 3 Mar 2017, Tiger Hu wrote:
>> Hi Sage,
>> I am very glad to know you raised this issue. In some cases, user may want
>> to accurately control the PG numbers in every OSDs. Is it possible to
>> add/implement a new policy to support fixed-mapping? This may be useful for
>> performance tuning or test purpose. Thanks.
> In principle this could be used to map every pg explicitly without regard
> for CRUSH.  In think in practice, though, we can have CRUSH map 80-90% of
> the PGs, and then have explicit mappings for 10-20% in order to achieve
> the same "perfect" balance of PGs.
>
> The result is a smaller OSDMap.  It might not matter that much for small
> clusters, but for large clusters, it is helpful to keep the OSDMap small.
>
> OTOH, maybe through all of this we end up in a place where the OSDMap is
> just an explicit map and the mgr agent that's managing it is using CRUSH
> to generate it; who knows!  That might make for faster mappings on the
> clients since it's a table lookup instead of a mapping calculation.
>
> I'm thinking that if we implement the ability to have the explicit
> mappings in the OSDMap we open up both possibilities.  The hard part is
> always getting the mapping compatbility into the client, so if we do that
> right, luminous+ clients will support explicit mapping (overrides) in
> OSDMap and would be able to work with future versions that go all-in on
> explicit...
>
> sage
>
>
>   >
>> Tiger
>>        在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
>>
>> On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
>>        Hi Sage,
>>         The crush algorithm handles mapping of pgs, and it will
>>        even with the
>>        addition of explicit mappings. I presume, finding which
>>        pgs belong to
>>        which OSDs will involve addition computation for each
>>        additional
>>        explicit mapping.
>>
>>        What would be penalty of this additional computation?
>>
>>        For small number of explicit mappings such penalty would
>>        be small,
>>        IMO it can get quite expensive with large number of
>>        explicit mappings.
>>        The implementation will need to manage the count of
>>        explicit mappings,
>>        by reverting some of the explicit mappings as the
>>        distribution changes.
>>        The understanding of additional overhead of the explicit
>>        mappings would
>>        had great influence on the implementation.
>>
>>
>> Yeah, we'll want to be careful with the implementation, e.g., by using
>> an
>> unordered_map (hash table) for looking up the mappings (an rbtree
>> won't
>> scale particularly well).
>>
>> Note that pg_temp is already a map<pg_t,vector<int>> and can get big
>> when
>> you're doing a lot of rebalancing, resulting on O(log n) lookups.  If
>> we
>> do something optimized here we should use the same strategy for
>> pg_temp
>> too.
>>
>> sage
>>
>>
>>
>>        Nitin
>>
>>
>>
>>
>>        On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
>>        behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
>>        behalf of sweil@redhat.com> wrote:
>>
>>           There's been a longstanding desire to improve the
>>        balance of PGs and data
>>           across OSDs to better utilize storage and balance
>>        workload.  We had a few
>>           ideas about this in a meeting last week and I wrote up
>>        a summary/proposal
>>           here:
>>
>>            http://pad.ceph.com/p/osdmap-explicit-mapping
>>
>>           The basic idea is to have the ability to explicitly map
>>        individual PGs
>>           to certain OSDs so that we can move PGs from overfull
>>        to underfull
>>           devices.  The idea is that the mon or mgr would do this
>>        based on some
>>           heuristics or policy and should result in a better
>>        distribution than teh
>>           current osd weight adjustments we make now with
>>        reweight-by-utilization.
>>
>>           The other key property is that one reason why we need
>>        as many PGs as we do
>>           now is to get a good balance; if we can remap some of
>>        them explicitly, we
>>           can get a better balance with fewer.  In essense, CRUSH
>>        gives an
>>           approximate distribution, and then we correct to make
>>        it perfect (or close
>>           to it).
>>
>>           The main challenge is less about figuring out when/how
>>        to remap PGs to
>>           correct balance, but figuring out when to remove those
>>        remappings after
>>           CRUSH map changes.  Some simple greedy strategies are
>>        obvious starting
>>           points (e.g., to move PGs off OSD X, first adjust or
>>        remove existing remap
>>           entries targetting OSD X before adding new ones), but
>>        there are a few
>>           ways we could structure the remap entries themselves so
>>        that they
>>           more gracefully disappear after a change.
>>
>>           For example, a remap entry might move a PG from OSD A
>>        to B if it maps to
>>           A; if the CRUSH topology changes and the PG no longer
>>        maps to A, the entry
>>           would be removed or ignored.  There are a few ways to
>>        do this in the pad;
>>           I'm sure there are other options.
>>
>>           I put this on the agenda for CDM tonight.  If anyone
>>        has any other ideas
>>           about this we'd love to hear them!
>>
>>           sage
>>           --
>>           To unsubscribe from this list: send the line
>>        "unsubscribe ceph-devel" in
>>           the body of a message to majordomo@vger.kernel.org
>>           More majordomo info at
>>         http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: explicitly mapping pgs in OSDMap
  2017-03-03  9:20         ` Bartłomiej Święcki
@ 2017-03-03 14:55           ` Sage Weil
  0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-03 14:55 UTC (permalink / raw)
  To: Bartłomiej Święcki; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8756 bytes --]

On Fri, 3 Mar 2017, Bartłomiej Święcki wrote:
> Hi Sage,
> 
> The more I think about explicit-only mapping (so clients and OSDs
> not doing CRUSH computations anymore) the more interesting
> use cases I can see, better balance being only one of them.
> 
> Other major advantage I see is that Client would become independent
> of internal CRUSH calculation changes (straw->straw2 was a bit
> problematic on our production for example) which I believe would
> also simplify keeping backward compatibility, especially for the
> kernel stuff. Also it looks like the code would simplify.
> 
> Such explicit mapping could also help with all kinds of cluster
> reconfiguration stuff - i.e. growing cluster by significant amount of
> OSDs could be spread over time to reduce backfill impact, same with
> increasing pgp_num. I also believe peering could gain here too
> because mon/mgr could ensure only limited number of PGs is blocked
> peering at the same time.

Yeah
 
> About memory overhead - I agree that it is the most important factor to
> consider. I don't know ceph internals that much yet to be authoritative
> here but I believe that in case of cluster rebalance, the amount of data
> that has to be maintained can already grow way beyond whatexplicit
> mapping would need. Is it possible that such explicit map would also
> store list of potential OSDs as long as the PG is not fully recovered?
> That would mean the osdmap history could be trimmed much faster, it
> would be easier to predict mon resource requirements and monitors would
> sync much faster during recovery.

Keeping the old osdmaps around is a somewhat orthogonal problem, and if we 
address that I suspect it'd be a different solution.  There is a 'past 
intervals' PR in flight that makes the OSDs tracking of this more 
efficient, and I suspect we could come up with something that would let us 
trim sooner than we currently do.  It'll take some careful planning, 
though!

sage


> 
> Regards,
> Bartek
> 
> 
> W dniu 02.03.2017 o 19:09, Sage Weil pisze:
> > On Fri, 3 Mar 2017, Tiger Hu wrote:
> > > Hi Sage,
> > > I am very glad to know you raised this issue. In some cases, user may want
> > > to accurately control the PG numbers in every OSDs. Is it possible to
> > > add/implement a new policy to support fixed-mapping? This may be useful
> > > for
> > > performance tuning or test purpose. Thanks.
> > In principle this could be used to map every pg explicitly without regard
> > for CRUSH.  In think in practice, though, we can have CRUSH map 80-90% of
> > the PGs, and then have explicit mappings for 10-20% in order to achieve
> > the same "perfect" balance of PGs.
> > 
> > The result is a smaller OSDMap.  It might not matter that much for small
> > clusters, but for large clusters, it is helpful to keep the OSDMap small.
> > 
> > OTOH, maybe through all of this we end up in a place where the OSDMap is
> > just an explicit map and the mgr agent that's managing it is using CRUSH
> > to generate it; who knows!  That might make for faster mappings on the
> > clients since it's a table lookup instead of a mapping calculation.
> > 
> > I'm thinking that if we implement the ability to have the explicit
> > mappings in the OSDMap we open up both possibilities.  The hard part is
> > always getting the mapping compatbility into the client, so if we do that
> > right, luminous+ clients will support explicit mapping (overrides) in
> > OSDMap and would be able to work with future versions that go all-in on
> > explicit...
> > 
> > sage
> > 
> > 
> >   >
> > > Tiger
> > >        在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
> > > 
> > > On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
> > >        Hi Sage,
> > >         The crush algorithm handles mapping of pgs, and it will
> > >        even with the
> > >        addition of explicit mappings. I presume, finding which
> > >        pgs belong to
> > >        which OSDs will involve addition computation for each
> > >        additional
> > >        explicit mapping.
> > > 
> > >        What would be penalty of this additional computation?
> > > 
> > >        For small number of explicit mappings such penalty would
> > >        be small,
> > >        IMO it can get quite expensive with large number of
> > >        explicit mappings.
> > >        The implementation will need to manage the count of
> > >        explicit mappings,
> > >        by reverting some of the explicit mappings as the
> > >        distribution changes.
> > >        The understanding of additional overhead of the explicit
> > >        mappings would
> > >        had great influence on the implementation.
> > > 
> > > 
> > > Yeah, we'll want to be careful with the implementation, e.g., by using
> > > an
> > > unordered_map (hash table) for looking up the mappings (an rbtree
> > > won't
> > > scale particularly well).
> > > 
> > > Note that pg_temp is already a map<pg_t,vector<int>> and can get big
> > > when
> > > you're doing a lot of rebalancing, resulting on O(log n) lookups.  If
> > > we
> > > do something optimized here we should use the same strategy for
> > > pg_temp
> > > too.
> > > 
> > > sage
> > > 
> > > 
> > > 
> > >        Nitin
> > > 
> > > 
> > > 
> > > 
> > >        On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
> > >        behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
> > >        behalf of sweil@redhat.com> wrote:
> > > 
> > >           There's been a longstanding desire to improve the
> > >        balance of PGs and data
> > >           across OSDs to better utilize storage and balance
> > >        workload.  We had a few
> > >           ideas about this in a meeting last week and I wrote up
> > >        a summary/proposal
> > >           here:
> > > 
> > >            http://pad.ceph.com/p/osdmap-explicit-mapping
> > > 
> > >           The basic idea is to have the ability to explicitly map
> > >        individual PGs
> > >           to certain OSDs so that we can move PGs from overfull
> > >        to underfull
> > >           devices.  The idea is that the mon or mgr would do this
> > >        based on some
> > >           heuristics or policy and should result in a better
> > >        distribution than teh
> > >           current osd weight adjustments we make now with
> > >        reweight-by-utilization.
> > > 
> > >           The other key property is that one reason why we need
> > >        as many PGs as we do
> > >           now is to get a good balance; if we can remap some of
> > >        them explicitly, we
> > >           can get a better balance with fewer.  In essense, CRUSH
> > >        gives an
> > >           approximate distribution, and then we correct to make
> > >        it perfect (or close
> > >           to it).
> > > 
> > >           The main challenge is less about figuring out when/how
> > >        to remap PGs to
> > >           correct balance, but figuring out when to remove those
> > >        remappings after
> > >           CRUSH map changes.  Some simple greedy strategies are
> > >        obvious starting
> > >           points (e.g., to move PGs off OSD X, first adjust or
> > >        remove existing remap
> > >           entries targetting OSD X before adding new ones), but
> > >        there are a few
> > >           ways we could structure the remap entries themselves so
> > >        that they
> > >           more gracefully disappear after a change.
> > > 
> > >           For example, a remap entry might move a PG from OSD A
> > >        to B if it maps to
> > >           A; if the CRUSH topology changes and the PG no longer
> > >        maps to A, the entry
> > >           would be removed or ignored.  There are a few ways to
> > >        do this in the pad;
> > >           I'm sure there are other options.
> > > 
> > >           I put this on the agenda for CDM tonight.  If anyone
> > >        has any other ideas
> > >           about this we'd love to hear them!
> > > 
> > >           sage
> > >           --
> > >           To unsubscribe from this list: send the line
> > >        "unsubscribe ceph-devel" in
> > >           the body of a message to majordomo@vger.kernel.org
> > >           More majordomo info at
> > >         http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-03-03 15:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
2017-03-01 20:49 ` Dan van der Ster
2017-03-01 22:10   ` Sage Weil
2017-03-01 23:18     ` Allen Samuels
2017-03-02  3:42     ` Xiaoxi Chen
2017-03-02  3:09 ` Matthew Sedam
2017-03-02  6:17   ` Sage Weil
2017-03-02 17:33 ` Kamble, Nitin A
2017-03-02 17:40   ` Sage Weil
     [not found]     ` <F6F90D3B-5362-48D1-B786-2191E5B98331@gmail.com>
2017-03-02 18:09       ` Sage Weil
2017-03-02 19:40         ` Tiger Hu
2017-03-03  9:20         ` Bartłomiej Święcki
2017-03-03 14:55           ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.