* explicitly mapping pgs in OSDMap
@ 2017-03-01 19:44 Sage Weil
2017-03-01 20:49 ` Dan van der Ster
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-01 19:44 UTC (permalink / raw)
To: ceph-devel
There's been a longstanding desire to improve the balance of PGs and data
across OSDs to better utilize storage and balance workload. We had a few
ideas about this in a meeting last week and I wrote up a summary/proposal
here:
http://pad.ceph.com/p/osdmap-explicit-mapping
The basic idea is to have the ability to explicitly map individual PGs
to certain OSDs so that we can move PGs from overfull to underfull
devices. The idea is that the mon or mgr would do this based on some
heuristics or policy and should result in a better distribution than teh
current osd weight adjustments we make now with reweight-by-utilization.
The other key property is that one reason why we need as many PGs as we do
now is to get a good balance; if we can remap some of them explicitly, we
can get a better balance with fewer. In essense, CRUSH gives an
approximate distribution, and then we correct to make it perfect (or close
to it).
The main challenge is less about figuring out when/how to remap PGs to
correct balance, but figuring out when to remove those remappings after
CRUSH map changes. Some simple greedy strategies are obvious starting
points (e.g., to move PGs off OSD X, first adjust or remove existing remap
entries targetting OSD X before adding new ones), but there are a few
ways we could structure the remap entries themselves so that they
more gracefully disappear after a change.
For example, a remap entry might move a PG from OSD A to B if it maps to
A; if the CRUSH topology changes and the PG no longer maps to A, the entry
would be removed or ignored. There are a few ways to do this in the pad;
I'm sure there are other options.
I put this on the agenda for CDM tonight. If anyone has any other ideas
about this we'd love to hear them!
sage
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
@ 2017-03-01 20:49 ` Dan van der Ster
2017-03-01 22:10 ` Sage Weil
2017-03-02 3:09 ` Matthew Sedam
2017-03-02 17:33 ` Kamble, Nitin A
2 siblings, 1 reply; 13+ messages in thread
From: Dan van der Ster @ 2017-03-01 20:49 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
> There's been a longstanding desire to improve the balance of PGs and data
> across OSDs to better utilize storage and balance workload. We had a few
> ideas about this in a meeting last week and I wrote up a summary/proposal
> here:
>
> http://pad.ceph.com/p/osdmap-explicit-mapping
>
> The basic idea is to have the ability to explicitly map individual PGs
> to certain OSDs so that we can move PGs from overfull to underfull
> devices. The idea is that the mon or mgr would do this based on some
> heuristics or policy and should result in a better distribution than teh
> current osd weight adjustments we make now with reweight-by-utilization.
>
> The other key property is that one reason why we need as many PGs as we do
> now is to get a good balance; if we can remap some of them explicitly, we
> can get a better balance with fewer. In essense, CRUSH gives an
> approximate distribution, and then we correct to make it perfect (or close
> to it).
>
> The main challenge is less about figuring out when/how to remap PGs to
> correct balance, but figuring out when to remove those remappings after
> CRUSH map changes. Some simple greedy strategies are obvious starting
> points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> entries targetting OSD X before adding new ones), but there are a few
> ways we could structure the remap entries themselves so that they
> more gracefully disappear after a change.
>
> For example, a remap entry might move a PG from OSD A to B if it maps to
> A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> would be removed or ignored. There are a few ways to do this in the pad;
> I'm sure there are other options.
>
> I put this on the agenda for CDM tonight. If anyone has any other ideas
> about this we'd love to hear them!
>
Hi Sage. This would be awesome! Seriously, it would let us run multi
PB clusters closer to being full -- 1% of imbalance on a 100 PB
cluster is *expensive* !!
I can't join the meeting, but here are my first thoughts on the implementation.
First, since this is a new feature, it would be cool if it supported
non-trivial topologies from the beginning -- i.e. the trivial topology
is when all OSDs should have equal PGs/weight. A non-trivial topology
is where only OSDs beneath a user-defined crush bucket are balanced.
And something I didn't understand about the remap format options --
when the cluster topology changes, couldn't we just remove all
remappings and start again? If you use some consistent method for the
remaps, shouldn't the result anyway remain similar after incremental
topology changes?
In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
we don't want to shuffle those for small topology changes, but for
larger changes perhaps we can accept those to move.
-- Dan
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-01 20:49 ` Dan van der Ster
@ 2017-03-01 22:10 ` Sage Weil
2017-03-01 23:18 ` Allen Samuels
2017-03-02 3:42 ` Xiaoxi Chen
0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-01 22:10 UTC (permalink / raw)
To: Dan van der Ster; +Cc: ceph-devel
On Wed, 1 Mar 2017, Dan van der Ster wrote:
> On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
> > There's been a longstanding desire to improve the balance of PGs and data
> > across OSDs to better utilize storage and balance workload. We had a few
> > ideas about this in a meeting last week and I wrote up a summary/proposal
> > here:
> >
> > http://pad.ceph.com/p/osdmap-explicit-mapping
> >
> > The basic idea is to have the ability to explicitly map individual PGs
> > to certain OSDs so that we can move PGs from overfull to underfull
> > devices. The idea is that the mon or mgr would do this based on some
> > heuristics or policy and should result in a better distribution than teh
> > current osd weight adjustments we make now with reweight-by-utilization.
> >
> > The other key property is that one reason why we need as many PGs as we do
> > now is to get a good balance; if we can remap some of them explicitly, we
> > can get a better balance with fewer. In essense, CRUSH gives an
> > approximate distribution, and then we correct to make it perfect (or close
> > to it).
> >
> > The main challenge is less about figuring out when/how to remap PGs to
> > correct balance, but figuring out when to remove those remappings after
> > CRUSH map changes. Some simple greedy strategies are obvious starting
> > points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> > entries targetting OSD X before adding new ones), but there are a few
> > ways we could structure the remap entries themselves so that they
> > more gracefully disappear after a change.
> >
> > For example, a remap entry might move a PG from OSD A to B if it maps to
> > A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> > would be removed or ignored. There are a few ways to do this in the pad;
> > I'm sure there are other options.
> >
> > I put this on the agenda for CDM tonight. If anyone has any other ideas
> > about this we'd love to hear them!
> >
>
> Hi Sage. This would be awesome! Seriously, it would let us run multi
> PB clusters closer to being full -- 1% of imbalance on a 100 PB
> cluster is *expensive* !!
>
> I can't join the meeting, but here are my first thoughts on the implementation.
>
> First, since this is a new feature, it would be cool if it supported
> non-trivial topologies from the beginning -- i.e. the trivial topology
> is when all OSDs should have equal PGs/weight. A non-trivial topology
> is where only OSDs beneath a user-defined crush bucket are balanced.
>
> And something I didn't understand about the remap format options --
> when the cluster topology changes, couldn't we just remove all
> remappings and start again? If you use some consistent method for the
> remaps, shouldn't the result anyway remain similar after incremental
> topology changes?
>
> In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
> we don't want to shuffle those for small topology changes, but for
> larger changes perhaps we can accept those to move.
If there is a small change, then we don't want to toss the mappings.. but
if there is a large change you do; which ones you toss should
probably depend on whether the original mapping they were overriding
changed.
The other hard part of this, now that I think about it, is that you have
placement constraints encoded into the CRUSH rules (e.g., separate
replicas across racks). Whatever is installing new mappings needs to
understand those constraints so tha the remap entries also respect the
policy. That means either we need to (1) understand the CRUSH rule
constraints, or (2) encode the (simplified?) set of placement constraints
in the Ceph OSDMap and auto-generate the CRUSH rule from that.
For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead
of making pseudorandom choices, pick each child based on the minimum
utilization (or the original mapping value) in order to generate the remap
entries...
sage
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: explicitly mapping pgs in OSDMap
2017-03-01 22:10 ` Sage Weil
@ 2017-03-01 23:18 ` Allen Samuels
2017-03-02 3:42 ` Xiaoxi Chen
1 sibling, 0 replies; 13+ messages in thread
From: Allen Samuels @ 2017-03-01 23:18 UTC (permalink / raw)
To: Sage Weil, Dan van der Ster; +Cc: ceph-devel
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-
> owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, March 01, 2017 2:11 PM
> To: Dan van der Ster <dan@vanderster.com>
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: explicitly mapping pgs in OSDMap
>
> On Wed, 1 Mar 2017, Dan van der Ster wrote:
> > On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
> > > There's been a longstanding desire to improve the balance of PGs and
> > > data across OSDs to better utilize storage and balance workload. We
> > > had a few ideas about this in a meeting last week and I wrote up a
> > > summary/proposal
> > > here:
> > >
> > > http://pad.ceph.com/p/osdmap-explicit-mapping
> > >
> > > The basic idea is to have the ability to explicitly map individual
> > > PGs to certain OSDs so that we can move PGs from overfull to
> > > underfull devices. The idea is that the mon or mgr would do this
> > > based on some heuristics or policy and should result in a better
> > > distribution than teh current osd weight adjustments we make now with
> reweight-by-utilization.
> > >
> > > The other key property is that one reason why we need as many PGs as
> > > we do now is to get a good balance; if we can remap some of them
> > > explicitly, we can get a better balance with fewer. In essense,
> > > CRUSH gives an approximate distribution, and then we correct to make
> > > it perfect (or close to it).
> > >
> > > The main challenge is less about figuring out when/how to remap PGs
> > > to correct balance, but figuring out when to remove those remappings
> > > after CRUSH map changes. Some simple greedy strategies are obvious
> > > starting points (e.g., to move PGs off OSD X, first adjust or remove
> > > existing remap entries targetting OSD X before adding new ones), but
> > > there are a few ways we could structure the remap entries themselves
> > > so that they more gracefully disappear after a change.
> > >
> > > For example, a remap entry might move a PG from OSD A to B if it
> > > maps to A; if the CRUSH topology changes and the PG no longer maps
> > > to A, the entry would be removed or ignored. There are a few ways
> > > to do this in the pad; I'm sure there are other options.
> > >
> > > I put this on the agenda for CDM tonight. If anyone has any other
> > > ideas about this we'd love to hear them!
> > >
> >
> > Hi Sage. This would be awesome! Seriously, it would let us run multi
> > PB clusters closer to being full -- 1% of imbalance on a 100 PB
> > cluster is *expensive* !!
Whoa there. This will reduce the variance between fullness on each OSD, which is a really good thing.
However, you will still have a dropoff in individual OSD performance as it fills up. Nothing here addresses that.
> >
> > I can't join the meeting, but here are my first thoughts on the
> implementation.
> >
> > First, since this is a new feature, it would be cool if it supported
> > non-trivial topologies from the beginning -- i.e. the trivial topology
> > is when all OSDs should have equal PGs/weight. A non-trivial topology
> > is where only OSDs beneath a user-defined crush bucket are balanced.
> >
> > And something I didn't understand about the remap format options --
> > when the cluster topology changes, couldn't we just remove all
> > remappings and start again? If you use some consistent method for the
> > remaps, shouldn't the result anyway remain similar after incremental
> > topology changes?
> >
> > In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
> > we don't want to shuffle those for small topology changes, but for
> > larger changes perhaps we can accept those to move.
>
> If there is a small change, then we don't want to toss the mappings.. but if
> there is a large change you do; which ones you toss should probably depend
> on whether the original mapping they were overriding changed.
>
> The other hard part of this, now that I think about it, is that you have
> placement constraints encoded into the CRUSH rules (e.g., separate replicas
> across racks). Whatever is installing new mappings needs to understand
> those constraints so tha the remap entries also respect the policy. That
> means either we need to (1) understand the CRUSH rule constraints, or (2)
> encode the (simplified?) set of placement constraints in the Ceph OSDMap
> and auto-generate the CRUSH rule from that.
>
> For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead of
> making pseudorandom choices, pick each child based on the minimum
> utilization (or the original mapping value) in order to generate the remap
> entries...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
2017-03-01 20:49 ` Dan van der Ster
@ 2017-03-02 3:09 ` Matthew Sedam
2017-03-02 6:17 ` Sage Weil
2017-03-02 17:33 ` Kamble, Nitin A
2 siblings, 1 reply; 13+ messages in thread
From: Matthew Sedam @ 2017-03-02 3:09 UTC (permalink / raw)
To: Sage Weil, ceph-devel
Sage,
Hi! I am a potential GSOC 2017 student, and I am interested in the
Ceph-mgr: Smarter Reweight_by_Utilization project. However, when
reading this I wondered if this proposed idea would make my GSOC
project effectively null and void. Could you elaborate on this?
Matthew Sedam
On Wed, Mar 1, 2017 at 1:44 PM, Sage Weil <sweil@redhat.com> wrote:
> There's been a longstanding desire to improve the balance of PGs and data
> across OSDs to better utilize storage and balance workload. We had a few
> ideas about this in a meeting last week and I wrote up a summary/proposal
> here:
>
> http://pad.ceph.com/p/osdmap-explicit-mapping
>
> The basic idea is to have the ability to explicitly map individual PGs
> to certain OSDs so that we can move PGs from overfull to underfull
> devices. The idea is that the mon or mgr would do this based on some
> heuristics or policy and should result in a better distribution than teh
> current osd weight adjustments we make now with reweight-by-utilization.
>
> The other key property is that one reason why we need as many PGs as we do
> now is to get a good balance; if we can remap some of them explicitly, we
> can get a better balance with fewer. In essense, CRUSH gives an
> approximate distribution, and then we correct to make it perfect (or close
> to it).
>
> The main challenge is less about figuring out when/how to remap PGs to
> correct balance, but figuring out when to remove those remappings after
> CRUSH map changes. Some simple greedy strategies are obvious starting
> points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> entries targetting OSD X before adding new ones), but there are a few
> ways we could structure the remap entries themselves so that they
> more gracefully disappear after a change.
>
> For example, a remap entry might move a PG from OSD A to B if it maps to
> A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> would be removed or ignored. There are a few ways to do this in the pad;
> I'm sure there are other options.
>
> I put this on the agenda for CDM tonight. If anyone has any other ideas
> about this we'd love to hear them!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-01 22:10 ` Sage Weil
2017-03-01 23:18 ` Allen Samuels
@ 2017-03-02 3:42 ` Xiaoxi Chen
1 sibling, 0 replies; 13+ messages in thread
From: Xiaoxi Chen @ 2017-03-02 3:42 UTC (permalink / raw)
To: Sage Weil; +Cc: Dan van der Ster, ceph-devel
Looks like the mapping need to be generated automatically by some
external/internal tools, not by administrator , they can do it handy but I
dont think human can properly balance thousands of PGs across thousands of
OSDs correctly.
Assume we have such tool, can we simplify the "remove" to "regenerate" the
mapping? i.e , when osdmap changes, all previous mapping cleared, crush
rule applies and pgmap generated, then **balance tool** jump in and
generate new mapping, then we have a *balanced* pgmap.
> That means either we need to (1) understand the CRUSH rule
constraints, or (2) encode the (simplified?) set of placement constraints
in the Ceph OSDMap and auto-generate the CRUSH rule from that.
will it be simpler by just extend the crushwrapper to have a
validate_pg_map(crush *map, vector<int> pg_map)? it just hack the "random"
logical in crush_choose and see if crush can generate same mapping as the
wanted mapping?
Xiaoxi
2017-03-02 6:10 GMT+08:00 Sage Weil <sweil@redhat.com>:
> On Wed, 1 Mar 2017, Dan van der Ster wrote:
>> On Wed, Mar 1, 2017 at 8:44 PM, Sage Weil <sweil@redhat.com> wrote:
>> > There's been a longstanding desire to improve the balance of PGs and data
>> > across OSDs to better utilize storage and balance workload. We had a few
>> > ideas about this in a meeting last week and I wrote up a summary/proposal
>> > here:
>> >
>> > http://pad.ceph.com/p/osdmap-explicit-mapping
>> >
>> > The basic idea is to have the ability to explicitly map individual PGs
>> > to certain OSDs so that we can move PGs from overfull to underfull
>> > devices. The idea is that the mon or mgr would do this based on some
>> > heuristics or policy and should result in a better distribution than teh
>> > current osd weight adjustments we make now with reweight-by-utilization.
>> >
>> > The other key property is that one reason why we need as many PGs as we do
>> > now is to get a good balance; if we can remap some of them explicitly, we
>> > can get a better balance with fewer. In essense, CRUSH gives an
>> > approximate distribution, and then we correct to make it perfect (or close
>> > to it).
>> >
>> > The main challenge is less about figuring out when/how to remap PGs to
>> > correct balance, but figuring out when to remove those remappings after
>> > CRUSH map changes. Some simple greedy strategies are obvious starting
>> > points (e.g., to move PGs off OSD X, first adjust or remove existing remap
>> > entries targetting OSD X before adding new ones), but there are a few
>> > ways we could structure the remap entries themselves so that they
>> > more gracefully disappear after a change.
>> >
>> > For example, a remap entry might move a PG from OSD A to B if it maps to
>> > A; if the CRUSH topology changes and the PG no longer maps to A, the entry
>> > would be removed or ignored. There are a few ways to do this in the pad;
>> > I'm sure there are other options.
>> >
>> > I put this on the agenda for CDM tonight. If anyone has any other ideas
>> > about this we'd love to hear them!
>> >
>>
>> Hi Sage. This would be awesome! Seriously, it would let us run multi
>> PB clusters closer to being full -- 1% of imbalance on a 100 PB
>> cluster is *expensive* !!
>>
>> I can't join the meeting, but here are my first thoughts on the implementation.
>>
>> First, since this is a new feature, it would be cool if it supported
>> non-trivial topologies from the beginning -- i.e. the trivial topology
>> is when all OSDs should have equal PGs/weight. A non-trivial topology
>> is where only OSDs beneath a user-defined crush bucket are balanced.
>>
>> And something I didn't understand about the remap format options --
>> when the cluster topology changes, couldn't we just remove all
>> remappings and start again? If you use some consistent method for the
>> remaps, shouldn't the result anyway remain similar after incremental
>> topology changes?
>>
>> In the worst case, remaps will only form maybe 10-20% of PGs. Clearly
>> we don't want to shuffle those for small topology changes, but for
>> larger changes perhaps we can accept those to move.
>
> If there is a small change, then we don't want to toss the mappings.. but
> if there is a large change you do; which ones you toss should
> probably depend on whether the original mapping they were overriding
> changed.
>
> The other hard part of this, now that I think about it, is that you have
> placement constraints encoded into the CRUSH rules (e.g., separate
> replicas across racks). Whatever is installing new mappings needs to
> understand those constraints so tha the remap entries also respect the
> policy. That means either we need to (1) understand the CRUSH rule
> constraints, or (2) encode the (simplified?) set of placement constraints
> in the Ceph OSDMap and auto-generate the CRUSH rule from that.
>
> For (1), I bet we can make a simple "CRUSH rule intepreter" that, instead
> of making pseudorandom choices, pick each child based on the minimum
> utilization (or the original mapping value) in order to generate the remap
> entries...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-02 3:09 ` Matthew Sedam
@ 2017-03-02 6:17 ` Sage Weil
0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-02 6:17 UTC (permalink / raw)
To: Matthew Sedam; +Cc: ceph-devel
On Wed, 1 Mar 2017, Matthew Sedam wrote:
> Sage,
>
> Hi! I am a potential GSOC 2017 student, and I am interested in the
> Ceph-mgr: Smarter Reweight_by_Utilization project. However, when
> reading this I wondered if this proposed idea would make my GSOC
> project effectively null and void. Could you elaborate on this?
I would consider the below an evolution of the role of
reweight-by-utilization. The complexity in improving the approach is less
around what the mechanism is that it uses to make the adjustment, and more
around reasoning about how the current utilization is assigned to PGs, how
PG sizes are estimated, and so on. The problem is pretty straightforward
when you have a uniform CRUSH hierarchy and all data is spread across
the cluster; when you have different CRUSH rules that distributed to only
some devices, especially when those devices overlap, things get tricky.
In any case, this just makes the project more interesting, with several
possible avenues for improvement and optimization. :)
sage
>
> Matthew Sedam
>
> On Wed, Mar 1, 2017 at 1:44 PM, Sage Weil <sweil@redhat.com> wrote:
> > There's been a longstanding desire to improve the balance of PGs and data
> > across OSDs to better utilize storage and balance workload. We had a few
> > ideas about this in a meeting last week and I wrote up a summary/proposal
> > here:
> >
> > http://pad.ceph.com/p/osdmap-explicit-mapping
> >
> > The basic idea is to have the ability to explicitly map individual PGs
> > to certain OSDs so that we can move PGs from overfull to underfull
> > devices. The idea is that the mon or mgr would do this based on some
> > heuristics or policy and should result in a better distribution than teh
> > current osd weight adjustments we make now with reweight-by-utilization.
> >
> > The other key property is that one reason why we need as many PGs as we do
> > now is to get a good balance; if we can remap some of them explicitly, we
> > can get a better balance with fewer. In essense, CRUSH gives an
> > approximate distribution, and then we correct to make it perfect (or close
> > to it).
> >
> > The main challenge is less about figuring out when/how to remap PGs to
> > correct balance, but figuring out when to remove those remappings after
> > CRUSH map changes. Some simple greedy strategies are obvious starting
> > points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> > entries targetting OSD X before adding new ones), but there are a few
> > ways we could structure the remap entries themselves so that they
> > more gracefully disappear after a change.
> >
> > For example, a remap entry might move a PG from OSD A to B if it maps to
> > A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> > would be removed or ignored. There are a few ways to do this in the pad;
> > I'm sure there are other options.
> >
> > I put this on the agenda for CDM tonight. If anyone has any other ideas
> > about this we'd love to hear them!
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
2017-03-01 20:49 ` Dan van der Ster
2017-03-02 3:09 ` Matthew Sedam
@ 2017-03-02 17:33 ` Kamble, Nitin A
2017-03-02 17:40 ` Sage Weil
2 siblings, 1 reply; 13+ messages in thread
From: Kamble, Nitin A @ 2017-03-02 17:33 UTC (permalink / raw)
To: Sage Weil, ceph-devel
Hi Sage,
The crush algorithm handles mapping of pgs, and it will even with the
addition of explicit mappings. I presume, finding which pgs belong to
which OSDs will involve addition computation for each additional
explicit mapping.
What would be penalty of this additional computation?
For small number of explicit mappings such penalty would be small,
IMO it can get quite expensive with large number of explicit mappings.
The implementation will need to manage the count of explicit mappings,
by reverting some of the explicit mappings as the distribution changes.
The understanding of additional overhead of the explicit mappings would
had great influence on the implementation.
Nitin
On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on behalf of sweil@redhat.com> wrote:
There's been a longstanding desire to improve the balance of PGs and data
across OSDs to better utilize storage and balance workload. We had a few
ideas about this in a meeting last week and I wrote up a summary/proposal
here:
http://pad.ceph.com/p/osdmap-explicit-mapping
The basic idea is to have the ability to explicitly map individual PGs
to certain OSDs so that we can move PGs from overfull to underfull
devices. The idea is that the mon or mgr would do this based on some
heuristics or policy and should result in a better distribution than teh
current osd weight adjustments we make now with reweight-by-utilization.
The other key property is that one reason why we need as many PGs as we do
now is to get a good balance; if we can remap some of them explicitly, we
can get a better balance with fewer. In essense, CRUSH gives an
approximate distribution, and then we correct to make it perfect (or close
to it).
The main challenge is less about figuring out when/how to remap PGs to
correct balance, but figuring out when to remove those remappings after
CRUSH map changes. Some simple greedy strategies are obvious starting
points (e.g., to move PGs off OSD X, first adjust or remove existing remap
entries targetting OSD X before adding new ones), but there are a few
ways we could structure the remap entries themselves so that they
more gracefully disappear after a change.
For example, a remap entry might move a PG from OSD A to B if it maps to
A; if the CRUSH topology changes and the PG no longer maps to A, the entry
would be removed or ignored. There are a few ways to do this in the pad;
I'm sure there are other options.
I put this on the agenda for CDM tonight. If anyone has any other ideas
about this we'd love to hear them!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-02 17:33 ` Kamble, Nitin A
@ 2017-03-02 17:40 ` Sage Weil
[not found] ` <F6F90D3B-5362-48D1-B786-2191E5B98331@gmail.com>
0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2017-03-02 17:40 UTC (permalink / raw)
To: Kamble, Nitin A; +Cc: ceph-devel
On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
> Hi Sage,
> The crush algorithm handles mapping of pgs, and it will even with the
> addition of explicit mappings. I presume, finding which pgs belong to
> which OSDs will involve addition computation for each additional
> explicit mapping.
>
> What would be penalty of this additional computation?
>
> For small number of explicit mappings such penalty would be small,
> IMO it can get quite expensive with large number of explicit mappings.
> The implementation will need to manage the count of explicit mappings,
> by reverting some of the explicit mappings as the distribution changes.
> The understanding of additional overhead of the explicit mappings would
> had great influence on the implementation.
Yeah, we'll want to be careful with the implementation, e.g., by using an
unordered_map (hash table) for looking up the mappings (an rbtree won't
scale particularly well).
Note that pg_temp is already a map<pg_t,vector<int>> and can get big when
you're doing a lot of rebalancing, resulting on O(log n) lookups. If we
do something optimized here we should use the same strategy for pg_temp
too.
sage
>
> Nitin
>
>
>
>
> On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on behalf of sweil@redhat.com> wrote:
>
> There's been a longstanding desire to improve the balance of PGs and data
> across OSDs to better utilize storage and balance workload. We had a few
> ideas about this in a meeting last week and I wrote up a summary/proposal
> here:
>
> http://pad.ceph.com/p/osdmap-explicit-mapping
>
> The basic idea is to have the ability to explicitly map individual PGs
> to certain OSDs so that we can move PGs from overfull to underfull
> devices. The idea is that the mon or mgr would do this based on some
> heuristics or policy and should result in a better distribution than teh
> current osd weight adjustments we make now with reweight-by-utilization.
>
> The other key property is that one reason why we need as many PGs as we do
> now is to get a good balance; if we can remap some of them explicitly, we
> can get a better balance with fewer. In essense, CRUSH gives an
> approximate distribution, and then we correct to make it perfect (or close
> to it).
>
> The main challenge is less about figuring out when/how to remap PGs to
> correct balance, but figuring out when to remove those remappings after
> CRUSH map changes. Some simple greedy strategies are obvious starting
> points (e.g., to move PGs off OSD X, first adjust or remove existing remap
> entries targetting OSD X before adding new ones), but there are a few
> ways we could structure the remap entries themselves so that they
> more gracefully disappear after a change.
>
> For example, a remap entry might move a PG from OSD A to B if it maps to
> A; if the CRUSH topology changes and the PG no longer maps to A, the entry
> would be removed or ignored. There are a few ways to do this in the pad;
> I'm sure there are other options.
>
> I put this on the agenda for CDM tonight. If anyone has any other ideas
> about this we'd love to hear them!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
[not found] ` <F6F90D3B-5362-48D1-B786-2191E5B98331@gmail.com>
@ 2017-03-02 18:09 ` Sage Weil
2017-03-02 19:40 ` Tiger Hu
2017-03-03 9:20 ` Bartłomiej Święcki
0 siblings, 2 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-02 18:09 UTC (permalink / raw)
To: Tiger Hu; +Cc: Kamble, Nitin A, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 5965 bytes --]
On Fri, 3 Mar 2017, Tiger Hu wrote:
> Hi Sage,
> I am very glad to know you raised this issue. In some cases, user may want
> to accurately control the PG numbers in every OSDs. Is it possible to
> add/implement a new policy to support fixed-mapping? This may be useful for
> performance tuning or test purpose. Thanks.
In principle this could be used to map every pg explicitly without regard
for CRUSH. In think in practice, though, we can have CRUSH map 80-90% of
the PGs, and then have explicit mappings for 10-20% in order to achieve
the same "perfect" balance of PGs.
The result is a smaller OSDMap. It might not matter that much for small
clusters, but for large clusters, it is helpful to keep the OSDMap small.
OTOH, maybe through all of this we end up in a place where the OSDMap is
just an explicit map and the mgr agent that's managing it is using CRUSH
to generate it; who knows! That might make for faster mappings on the
clients since it's a table lookup instead of a mapping calculation.
I'm thinking that if we implement the ability to have the explicit
mappings in the OSDMap we open up both possibilities. The hard part is
always getting the mapping compatbility into the client, so if we do that
right, luminous+ clients will support explicit mapping (overrides) in
OSDMap and would be able to work with future versions that go all-in on
explicit...
sage
>
> Tiger
> 在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
>
> On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
> Hi Sage,
> The crush algorithm handles mapping of pgs, and it will
> even with the
> addition of explicit mappings. I presume, finding which
> pgs belong to
> which OSDs will involve addition computation for each
> additional
> explicit mapping.
>
> What would be penalty of this additional computation?
>
> For small number of explicit mappings such penalty would
> be small,
> IMO it can get quite expensive with large number of
> explicit mappings.
> The implementation will need to manage the count of
> explicit mappings,
> by reverting some of the explicit mappings as the
> distribution changes.
> The understanding of additional overhead of the explicit
> mappings would
> had great influence on the implementation.
>
>
> Yeah, we'll want to be careful with the implementation, e.g., by using
> an
> unordered_map (hash table) for looking up the mappings (an rbtree
> won't
> scale particularly well).
>
> Note that pg_temp is already a map<pg_t,vector<int>> and can get big
> when
> you're doing a lot of rebalancing, resulting on O(log n) lookups. If
> we
> do something optimized here we should use the same strategy for
> pg_temp
> too.
>
> sage
>
>
>
> Nitin
>
>
>
>
> On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
> behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
> behalf of sweil@redhat.com> wrote:
>
> There's been a longstanding desire to improve the
> balance of PGs and data
> across OSDs to better utilize storage and balance
> workload. We had a few
> ideas about this in a meeting last week and I wrote up
> a summary/proposal
> here:
>
> http://pad.ceph.com/p/osdmap-explicit-mapping
>
> The basic idea is to have the ability to explicitly map
> individual PGs
> to certain OSDs so that we can move PGs from overfull
> to underfull
> devices. The idea is that the mon or mgr would do this
> based on some
> heuristics or policy and should result in a better
> distribution than teh
> current osd weight adjustments we make now with
> reweight-by-utilization.
>
> The other key property is that one reason why we need
> as many PGs as we do
> now is to get a good balance; if we can remap some of
> them explicitly, we
> can get a better balance with fewer. In essense, CRUSH
> gives an
> approximate distribution, and then we correct to make
> it perfect (or close
> to it).
>
> The main challenge is less about figuring out when/how
> to remap PGs to
> correct balance, but figuring out when to remove those
> remappings after
> CRUSH map changes. Some simple greedy strategies are
> obvious starting
> points (e.g., to move PGs off OSD X, first adjust or
> remove existing remap
> entries targetting OSD X before adding new ones), but
> there are a few
> ways we could structure the remap entries themselves so
> that they
> more gracefully disappear after a change.
>
> For example, a remap entry might move a PG from OSD A
> to B if it maps to
> A; if the CRUSH topology changes and the PG no longer
> maps to A, the entry
> would be removed or ignored. There are a few ways to
> do this in the pad;
> I'm sure there are other options.
>
> I put this on the agenda for CDM tonight. If anyone
> has any other ideas
> about this we'd love to hear them!
>
> sage
> --
> To unsubscribe from this list: send the line
> "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-02 18:09 ` Sage Weil
@ 2017-03-02 19:40 ` Tiger Hu
2017-03-03 9:20 ` Bartłomiej Święcki
1 sibling, 0 replies; 13+ messages in thread
From: Tiger Hu @ 2017-03-02 19:40 UTC (permalink / raw)
To: Sage Weil; +Cc: Kamble, Nitin A, ceph-devel
Sage,
Thanks for your reply. Looking forward to explicit mapping.
Tiger
> 在 2017年3月3日,上午2:09,Sage Weil <sweil@redhat.com> 写道:
>
> On Fri, 3 Mar 2017, Tiger Hu wrote:
>> Hi Sage,
>> I am very glad to know you raised this issue. In some cases, user may want
>> to accurately control the PG numbers in every OSDs. Is it possible to
>> add/implement a new policy to support fixed-mapping? This may be useful for
>> performance tuning or test purpose. Thanks.
>
> In principle this could be used to map every pg explicitly without regard
> for CRUSH. In think in practice, though, we can have CRUSH map 80-90% of
> the PGs, and then have explicit mappings for 10-20% in order to achieve
> the same "perfect" balance of PGs.
>
> The result is a smaller OSDMap. It might not matter that much for small
> clusters, but for large clusters, it is helpful to keep the OSDMap small.
>
> OTOH, maybe through all of this we end up in a place where the OSDMap is
> just an explicit map and the mgr agent that's managing it is using CRUSH
> to generate it; who knows! That might make for faster mappings on the
> clients since it's a table lookup instead of a mapping calculation.
>
> I'm thinking that if we implement the ability to have the explicit
> mappings in the OSDMap we open up both possibilities. The hard part is
> always getting the mapping compatbility into the client, so if we do that
> right, luminous+ clients will support explicit mapping (overrides) in
> OSDMap and would be able to work with future versions that go all-in on
> explicit...
>
> sage
>
>
>>
>> Tiger
>> 在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
>>
>> On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
>> Hi Sage,
>> The crush algorithm handles mapping of pgs, and it will
>> even with the
>> addition of explicit mappings. I presume, finding which
>> pgs belong to
>> which OSDs will involve addition computation for each
>> additional
>> explicit mapping.
>>
>> What would be penalty of this additional computation?
>>
>> For small number of explicit mappings such penalty would
>> be small,
>> IMO it can get quite expensive with large number of
>> explicit mappings.
>> The implementation will need to manage the count of
>> explicit mappings,
>> by reverting some of the explicit mappings as the
>> distribution changes.
>> The understanding of additional overhead of the explicit
>> mappings would
>> had great influence on the implementation.
>>
>>
>> Yeah, we'll want to be careful with the implementation, e.g., by using
>> an
>> unordered_map (hash table) for looking up the mappings (an rbtree
>> won't
>> scale particularly well).
>>
>> Note that pg_temp is already a map<pg_t,vector<int>> and can get big
>> when
>> you're doing a lot of rebalancing, resulting on O(log n) lookups. If
>> we
>> do something optimized here we should use the same strategy for
>> pg_temp
>> too.
>>
>> sage
>>
>>
>>
>> Nitin
>>
>>
>>
>>
>> On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
>> behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
>> behalf of sweil@redhat.com> wrote:
>>
>> There's been a longstanding desire to improve the
>> balance of PGs and data
>> across OSDs to better utilize storage and balance
>> workload. We had a few
>> ideas about this in a meeting last week and I wrote up
>> a summary/proposal
>> here:
>>
>> http://pad.ceph.com/p/osdmap-explicit-mapping
>>
>> The basic idea is to have the ability to explicitly map
>> individual PGs
>> to certain OSDs so that we can move PGs from overfull
>> to underfull
>> devices. The idea is that the mon or mgr would do this
>> based on some
>> heuristics or policy and should result in a better
>> distribution than teh
>> current osd weight adjustments we make now with
>> reweight-by-utilization.
>>
>> The other key property is that one reason why we need
>> as many PGs as we do
>> now is to get a good balance; if we can remap some of
>> them explicitly, we
>> can get a better balance with fewer. In essense, CRUSH
>> gives an
>> approximate distribution, and then we correct to make
>> it perfect (or close
>> to it).
>>
>> The main challenge is less about figuring out when/how
>> to remap PGs to
>> correct balance, but figuring out when to remove those
>> remappings after
>> CRUSH map changes. Some simple greedy strategies are
>> obvious starting
>> points (e.g., to move PGs off OSD X, first adjust or
>> remove existing remap
>> entries targetting OSD X before adding new ones), but
>> there are a few
>> ways we could structure the remap entries themselves so
>> that they
>> more gracefully disappear after a change.
>>
>> For example, a remap entry might move a PG from OSD A
>> to B if it maps to
>> A; if the CRUSH topology changes and the PG no longer
>> maps to A, the entry
>> would be removed or ignored. There are a few ways to
>> do this in the pad;
>> I'm sure there are other options.
>>
>> I put this on the agenda for CDM tonight. If anyone
>> has any other ideas
>> about this we'd love to hear them!
>>
>> sage
>> --
>> To unsubscribe from this list: send the line
>> "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-02 18:09 ` Sage Weil
2017-03-02 19:40 ` Tiger Hu
@ 2017-03-03 9:20 ` Bartłomiej Święcki
2017-03-03 14:55 ` Sage Weil
1 sibling, 1 reply; 13+ messages in thread
From: Bartłomiej Święcki @ 2017-03-03 9:20 UTC (permalink / raw)
To: Sage Weil; +Cc: ceph-devel
Hi Sage,
The more I think about explicit-only mapping (so clients and OSDs
not doing CRUSH computations anymore) the more interesting
use cases I can see, better balance being only one of them.
Other major advantage I see is that Client would become independent
of internal CRUSH calculation changes (straw->straw2 was a bit
problematic on our production for example) which I believe would
also simplify keeping backward compatibility, especially for the
kernel stuff. Also it looks like the code would simplify.
Such explicit mapping could also help with all kinds of cluster
reconfiguration stuff - i.e. growing cluster by significant amount of
OSDs could be spread over time to reduce backfill impact, same with
increasing pgp_num. I also believe peering could gain here too
because mon/mgr could ensure only limited number of PGs is blocked
peering at the same time.
About memory overhead - I agree that it is the most important factor to
consider. I don't know ceph internals that much yet to be authoritative
here but I believe that in case of cluster rebalance, the amount of data
that has to be maintained can already grow way beyond whatexplicit
mapping would need. Is it possible that such explicit map would also
store list of potential OSDs as long as the PG is not fully recovered?
That would mean the osdmap history could be trimmed much faster, it
would be easier to predict mon resource requirements and monitors would
sync much faster during recovery.
Regards,
Bartek
W dniu 02.03.2017 o 19:09, Sage Weil pisze:
> On Fri, 3 Mar 2017, Tiger Hu wrote:
>> Hi Sage,
>> I am very glad to know you raised this issue. In some cases, user may want
>> to accurately control the PG numbers in every OSDs. Is it possible to
>> add/implement a new policy to support fixed-mapping? This may be useful for
>> performance tuning or test purpose. Thanks.
> In principle this could be used to map every pg explicitly without regard
> for CRUSH. In think in practice, though, we can have CRUSH map 80-90% of
> the PGs, and then have explicit mappings for 10-20% in order to achieve
> the same "perfect" balance of PGs.
>
> The result is a smaller OSDMap. It might not matter that much for small
> clusters, but for large clusters, it is helpful to keep the OSDMap small.
>
> OTOH, maybe through all of this we end up in a place where the OSDMap is
> just an explicit map and the mgr agent that's managing it is using CRUSH
> to generate it; who knows! That might make for faster mappings on the
> clients since it's a table lookup instead of a mapping calculation.
>
> I'm thinking that if we implement the ability to have the explicit
> mappings in the OSDMap we open up both possibilities. The hard part is
> always getting the mapping compatbility into the client, so if we do that
> right, luminous+ clients will support explicit mapping (overrides) in
> OSDMap and would be able to work with future versions that go all-in on
> explicit...
>
> sage
>
>
> >
>> Tiger
>> 在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
>>
>> On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
>> Hi Sage,
>> The crush algorithm handles mapping of pgs, and it will
>> even with the
>> addition of explicit mappings. I presume, finding which
>> pgs belong to
>> which OSDs will involve addition computation for each
>> additional
>> explicit mapping.
>>
>> What would be penalty of this additional computation?
>>
>> For small number of explicit mappings such penalty would
>> be small,
>> IMO it can get quite expensive with large number of
>> explicit mappings.
>> The implementation will need to manage the count of
>> explicit mappings,
>> by reverting some of the explicit mappings as the
>> distribution changes.
>> The understanding of additional overhead of the explicit
>> mappings would
>> had great influence on the implementation.
>>
>>
>> Yeah, we'll want to be careful with the implementation, e.g., by using
>> an
>> unordered_map (hash table) for looking up the mappings (an rbtree
>> won't
>> scale particularly well).
>>
>> Note that pg_temp is already a map<pg_t,vector<int>> and can get big
>> when
>> you're doing a lot of rebalancing, resulting on O(log n) lookups. If
>> we
>> do something optimized here we should use the same strategy for
>> pg_temp
>> too.
>>
>> sage
>>
>>
>>
>> Nitin
>>
>>
>>
>>
>> On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
>> behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
>> behalf of sweil@redhat.com> wrote:
>>
>> There's been a longstanding desire to improve the
>> balance of PGs and data
>> across OSDs to better utilize storage and balance
>> workload. We had a few
>> ideas about this in a meeting last week and I wrote up
>> a summary/proposal
>> here:
>>
>> http://pad.ceph.com/p/osdmap-explicit-mapping
>>
>> The basic idea is to have the ability to explicitly map
>> individual PGs
>> to certain OSDs so that we can move PGs from overfull
>> to underfull
>> devices. The idea is that the mon or mgr would do this
>> based on some
>> heuristics or policy and should result in a better
>> distribution than teh
>> current osd weight adjustments we make now with
>> reweight-by-utilization.
>>
>> The other key property is that one reason why we need
>> as many PGs as we do
>> now is to get a good balance; if we can remap some of
>> them explicitly, we
>> can get a better balance with fewer. In essense, CRUSH
>> gives an
>> approximate distribution, and then we correct to make
>> it perfect (or close
>> to it).
>>
>> The main challenge is less about figuring out when/how
>> to remap PGs to
>> correct balance, but figuring out when to remove those
>> remappings after
>> CRUSH map changes. Some simple greedy strategies are
>> obvious starting
>> points (e.g., to move PGs off OSD X, first adjust or
>> remove existing remap
>> entries targetting OSD X before adding new ones), but
>> there are a few
>> ways we could structure the remap entries themselves so
>> that they
>> more gracefully disappear after a change.
>>
>> For example, a remap entry might move a PG from OSD A
>> to B if it maps to
>> A; if the CRUSH topology changes and the PG no longer
>> maps to A, the entry
>> would be removed or ignored. There are a few ways to
>> do this in the pad;
>> I'm sure there are other options.
>>
>> I put this on the agenda for CDM tonight. If anyone
>> has any other ideas
>> about this we'd love to hear them!
>>
>> sage
>> --
>> To unsubscribe from this list: send the line
>> "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: explicitly mapping pgs in OSDMap
2017-03-03 9:20 ` Bartłomiej Święcki
@ 2017-03-03 14:55 ` Sage Weil
0 siblings, 0 replies; 13+ messages in thread
From: Sage Weil @ 2017-03-03 14:55 UTC (permalink / raw)
To: Bartłomiej Święcki; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 8756 bytes --]
On Fri, 3 Mar 2017, Bartłomiej Święcki wrote:
> Hi Sage,
>
> The more I think about explicit-only mapping (so clients and OSDs
> not doing CRUSH computations anymore) the more interesting
> use cases I can see, better balance being only one of them.
>
> Other major advantage I see is that Client would become independent
> of internal CRUSH calculation changes (straw->straw2 was a bit
> problematic on our production for example) which I believe would
> also simplify keeping backward compatibility, especially for the
> kernel stuff. Also it looks like the code would simplify.
>
> Such explicit mapping could also help with all kinds of cluster
> reconfiguration stuff - i.e. growing cluster by significant amount of
> OSDs could be spread over time to reduce backfill impact, same with
> increasing pgp_num. I also believe peering could gain here too
> because mon/mgr could ensure only limited number of PGs is blocked
> peering at the same time.
Yeah
> About memory overhead - I agree that it is the most important factor to
> consider. I don't know ceph internals that much yet to be authoritative
> here but I believe that in case of cluster rebalance, the amount of data
> that has to be maintained can already grow way beyond whatexplicit
> mapping would need. Is it possible that such explicit map would also
> store list of potential OSDs as long as the PG is not fully recovered?
> That would mean the osdmap history could be trimmed much faster, it
> would be easier to predict mon resource requirements and monitors would
> sync much faster during recovery.
Keeping the old osdmaps around is a somewhat orthogonal problem, and if we
address that I suspect it'd be a different solution. There is a 'past
intervals' PR in flight that makes the OSDs tracking of this more
efficient, and I suspect we could come up with something that would let us
trim sooner than we currently do. It'll take some careful planning,
though!
sage
>
> Regards,
> Bartek
>
>
> W dniu 02.03.2017 o 19:09, Sage Weil pisze:
> > On Fri, 3 Mar 2017, Tiger Hu wrote:
> > > Hi Sage,
> > > I am very glad to know you raised this issue. In some cases, user may want
> > > to accurately control the PG numbers in every OSDs. Is it possible to
> > > add/implement a new policy to support fixed-mapping? This may be useful
> > > for
> > > performance tuning or test purpose. Thanks.
> > In principle this could be used to map every pg explicitly without regard
> > for CRUSH. In think in practice, though, we can have CRUSH map 80-90% of
> > the PGs, and then have explicit mappings for 10-20% in order to achieve
> > the same "perfect" balance of PGs.
> >
> > The result is a smaller OSDMap. It might not matter that much for small
> > clusters, but for large clusters, it is helpful to keep the OSDMap small.
> >
> > OTOH, maybe through all of this we end up in a place where the OSDMap is
> > just an explicit map and the mgr agent that's managing it is using CRUSH
> > to generate it; who knows! That might make for faster mappings on the
> > clients since it's a table lookup instead of a mapping calculation.
> >
> > I'm thinking that if we implement the ability to have the explicit
> > mappings in the OSDMap we open up both possibilities. The hard part is
> > always getting the mapping compatbility into the client, so if we do that
> > right, luminous+ clients will support explicit mapping (overrides) in
> > OSDMap and would be able to work with future versions that go all-in on
> > explicit...
> >
> > sage
> >
> >
> > >
> > > Tiger
> > > 在 2017年3月3日,上午1:40,Sage Weil <sweil@redhat.com> 写道:
> > >
> > > On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
> > > Hi Sage,
> > > The crush algorithm handles mapping of pgs, and it will
> > > even with the
> > > addition of explicit mappings. I presume, finding which
> > > pgs belong to
> > > which OSDs will involve addition computation for each
> > > additional
> > > explicit mapping.
> > >
> > > What would be penalty of this additional computation?
> > >
> > > For small number of explicit mappings such penalty would
> > > be small,
> > > IMO it can get quite expensive with large number of
> > > explicit mappings.
> > > The implementation will need to manage the count of
> > > explicit mappings,
> > > by reverting some of the explicit mappings as the
> > > distribution changes.
> > > The understanding of additional overhead of the explicit
> > > mappings would
> > > had great influence on the implementation.
> > >
> > >
> > > Yeah, we'll want to be careful with the implementation, e.g., by using
> > > an
> > > unordered_map (hash table) for looking up the mappings (an rbtree
> > > won't
> > > scale particularly well).
> > >
> > > Note that pg_temp is already a map<pg_t,vector<int>> and can get big
> > > when
> > > you're doing a lot of rebalancing, resulting on O(log n) lookups. If
> > > we
> > > do something optimized here we should use the same strategy for
> > > pg_temp
> > > too.
> > >
> > > sage
> > >
> > >
> > >
> > > Nitin
> > >
> > >
> > >
> > >
> > > On 3/1/17, 11:44 AM, "ceph-devel-owner@vger.kernel.org on
> > > behalf of Sage Weil" <ceph-devel-owner@vger.kernel.org on
> > > behalf of sweil@redhat.com> wrote:
> > >
> > > There's been a longstanding desire to improve the
> > > balance of PGs and data
> > > across OSDs to better utilize storage and balance
> > > workload. We had a few
> > > ideas about this in a meeting last week and I wrote up
> > > a summary/proposal
> > > here:
> > >
> > > http://pad.ceph.com/p/osdmap-explicit-mapping
> > >
> > > The basic idea is to have the ability to explicitly map
> > > individual PGs
> > > to certain OSDs so that we can move PGs from overfull
> > > to underfull
> > > devices. The idea is that the mon or mgr would do this
> > > based on some
> > > heuristics or policy and should result in a better
> > > distribution than teh
> > > current osd weight adjustments we make now with
> > > reweight-by-utilization.
> > >
> > > The other key property is that one reason why we need
> > > as many PGs as we do
> > > now is to get a good balance; if we can remap some of
> > > them explicitly, we
> > > can get a better balance with fewer. In essense, CRUSH
> > > gives an
> > > approximate distribution, and then we correct to make
> > > it perfect (or close
> > > to it).
> > >
> > > The main challenge is less about figuring out when/how
> > > to remap PGs to
> > > correct balance, but figuring out when to remove those
> > > remappings after
> > > CRUSH map changes. Some simple greedy strategies are
> > > obvious starting
> > > points (e.g., to move PGs off OSD X, first adjust or
> > > remove existing remap
> > > entries targetting OSD X before adding new ones), but
> > > there are a few
> > > ways we could structure the remap entries themselves so
> > > that they
> > > more gracefully disappear after a change.
> > >
> > > For example, a remap entry might move a PG from OSD A
> > > to B if it maps to
> > > A; if the CRUSH topology changes and the PG no longer
> > > maps to A, the entry
> > > would be removed or ignored. There are a few ways to
> > > do this in the pad;
> > > I'm sure there are other options.
> > >
> > > I put this on the agenda for CDM tonight. If anyone
> > > has any other ideas
> > > about this we'd love to hear them!
> > >
> > > sage
> > > --
> > > To unsubscribe from this list: send the line
> > > "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at
> > > http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> > >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-03-03 15:20 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 19:44 explicitly mapping pgs in OSDMap Sage Weil
2017-03-01 20:49 ` Dan van der Ster
2017-03-01 22:10 ` Sage Weil
2017-03-01 23:18 ` Allen Samuels
2017-03-02 3:42 ` Xiaoxi Chen
2017-03-02 3:09 ` Matthew Sedam
2017-03-02 6:17 ` Sage Weil
2017-03-02 17:33 ` Kamble, Nitin A
2017-03-02 17:40 ` Sage Weil
[not found] ` <F6F90D3B-5362-48D1-B786-2191E5B98331@gmail.com>
2017-03-02 18:09 ` Sage Weil
2017-03-02 19:40 ` Tiger Hu
2017-03-03 9:20 ` Bartłomiej Święcki
2017-03-03 14:55 ` Sage Weil
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.